Posts Tagged ‘#xml’

How to use MarkLogic with Open Government?

May 10, 2013

 

How to use MarkLogic with US Congress Bill Data?

 

I’m a big advocate for open government data projects. One of the more interesting efforts is the US Congress legislation data that is published every night on GitHub.

Here’s the link. => https://github.com/unitedstates/congress/wiki

For more about GitHub, please see this Clay Shirky TED talk.

Clay Shirky would like to use GitHub as a tool to manage the legislation process. He calls it a new “form of arguing”. It’s a great idea but GitHub is not an easy tool for government legislators yet. It could use some ease-of-use tweaks for folks not familiar with software development.

In any case, Clay mentioned the posting of US Congress Bill data that occurs nightly. Posting the data is a great example of government transparency.  This data can be used to measure congressional productivity and maybe even accountability.

This data makes it possible to provide a real-time (well nightly) digital Report Card. It can be used to measure any perceptions of Congressional Gridlock that is often reported in the news.

This blog post will show how to do this but before I start here’s some notes about the US Congress that will help when we look at the data.

Notes on the US Congress and the Legislative Process:

  1. Consists of 2 houses: Senate and House of Representatives.
  2. There are 535 voting members: 435 in the House of Representatives and 100 in the Senate.
  3. Each Congressional Session is 2 years.
  4. Current Session is the 113th Congress which runs from January 3, 2013 to January 3, 2015.
  5. A key objective is to create legislation that improves and advances the entire country.
  6. Any member of House or Senate can introduce legislation for debate. It is done by proposing a bill.
  7. Congressional bill is introduced and championed by a sponsor.

  8. Bill is then sent to appropriate congressional committee to decide if the House or Senate will vote on the bill.

  9. Senate has 17 committees and 70 subcommittees.

  10. House has 23 committees and 104 subcommittees.

  11. Each committee focuses on a specific policy area.

  12. Bill is passed when a majority vote is received in both the House and Senate.

  13. President must sign the bill to pass it into law or veto it.

  14. Congressional Productivity can be measured by the number of bills that get signed by the president.

 

The Huffington Post news outlet published an article about Congressional Productivity of the 112th Congress at the end of 2012. The article noted that the 111th Congress had 383 bills signed into law by the president. The 112th congress had only 283 bills signed by the president.

One could conclude that the 112th Congress was less productive than the 111th Congress but there are other factors to consider.

Here’s the Huffington Post article.

http://www.huffingtonpost.com/2012/12/28/congress-unproductive_n_2371387.html

 

Questions:

Is it possible to use the nightly data to verify or track Congressional Productivity on a daily basis?

How difficult would it be to show data visualizations for each congress?

The answer to both questions is a most definite yes but how?

Here’s a simple data set that I’d like to visualize and build from.

Congress

Years Bills Proposed Bills Signed Bill Sign Rate
110 2007-2008 14,042 460 3.28%
111 2009-2010 13,675 383 2.80%
112 2011-2012 12,299 283 2.30%

 

MarkLogic Ingestion and Enrichment

This is an easy problem for MarkLogic. MarkLogic is typically used to ingest, enrich, search and discover.

Once the data is in MarkLogic, the following questions can be easily answered.

  1. How many bills signed by president by year?
  2. Who are the most productive congress members by year?
  3. What is the bill proposed to bill signed ratio by year?

I’ve created a MarkLogic database and ingested the US Congress Bill data from the previous Congressional Sessions. The data ranges from the 93rd Congress (1974 to 1975) to the 112th Congress (2011 to 2012).

For the current session, 113th Congress, I created a scheduled task that runs every night at 11:00 PM EST (8:00 PM PST). The scheduled task ingests the data and enriches it with some simple presidential data.

I’ve also used the MarkLogic App Builder tool to create a quick web application that shows the bills signed by year. It can also be used to search and discover.

The App Builder and ingestion code is posted here. => Source Code

The web application is here. => http://ps4.demo.marklogic.com:8004/

Please give it a try.

Please be sure to try the clickable pie chart and bar chart widgets.

To see the most productive Congress Members by Year, press the "Enacted Type” public facet in the top left corner. Once pressed, you will see the results by year.

 

                     image

 

The Top 30 Most Productive Congress Members are listed in the Sponsor Facet (see image). This sponsor list shows the most productive members for the past 32 years in a descending order.

The number in parenthesis is the number of bills that they sponsored that were signed by the President of the United States.

 

                  image

I use the Sponsor classifier because a Sponsor is a senator or house representative who introduces a bill or amendment and is the bill’s chief advocate. The process of sponsoring a bill requires significant effort. So we can conclude that this could be a good metric for a congress member’s productivity.

You can use the web app to drill down further to discover bills proposed or productive members by year, by president, by cosponsor, by subject, etc..

The remaining blog post and screencast will focus on the following topics.

  1. How to programmatically get the data zip file?
  2. How to ingest json documents?
  3. How to set up an automated daily ingestion task?
  4. Is ACID Compliance needed for this One-Way System?
  5. How to use the SQL API?
  6. How to connect and discover with Tableau and Excel?
  7. Code is posted.

 

1. How to programmatically get the data zip file?

The specific ingestion code is posted in this file. => ingest-bill-data.xqy

The URL to retrieve the nightly congress data points to a zip file.

http://unitedstates.sunlightfoundation.com/congress/data/113.zip

The following code snippet does the following:

  1. http get request to retrieve the zip file.
  2. Iterate the zip manifest to get file names.
  3. For each file name, retrieve the file.
  4. For each file, convert json to an xml document.
  5. For each xml document, transform and enrich.
  6. Save newly transformed/enriched document.

 

Here’s the code snippet.

declare variable $get-options :=
  <options xmlns="xdmp:http">
    <format xmlns="xdmp:document-get">binary</format>
  </options>;

declare variable $BASEURL :=
 "http://unitedstates.sunlightfoundation.com/congress/data/";

let $url  := fn:concat($BASEURL, "113.zip")
let $zip := xdmp:http-get($url, $get-options)[2]

let $docs :=
  for $uri in xdmp:zip-manifest($zip)//zip:part/text()
    let $jdoc := xdmp:zip-get($zip, $uri)
    let $xdoc := json:transform-from-json($jdoc)
    let $doc  := local:load-house-bill-doc($xdoc)
    order by $uri
      return
        $uri

return fn:count($docs)

2. How to ingest json documents?

The data that is posted on the GitHub site is in a json format. A good rule of thumb on json versus xml is to use json as a wire protocol for the “last mile” and then use XML for the data store.

The key advantage of XML over json is Namespaces and Schema. Namespaces and Schema provides capabilities that ultimately makes developers much more productive than if a simple json data store is used. But this is a topic for another day.

Ingesting json data into MarkLogic is easily done using XQuery. XQuery is my preferred ETL (extract, transform and load) tool. I use it to convert json, transform (and enrich) the data into more efficient structures and then store as XML.

The ingestion code is posted here.  => ingest-bill-data.xqy

In a future post, I’ll talk about adding a triple which is the best way to capture relationships with other congressional bills.

The json format of the US Congress Bill data is well documented on the GitHub Wiki page.

 

Here’s an example of the json structure used to hold subjects and summary data.

{ 
  "subjects": [ 
    "Abortion", 
    "Administrative law and regulatory procedures", 
    "Adoption and foster care" 
  ], 
  "summary": { 
    "as": "Public Law", 
    "date": "2010-03-23", 
    "text": "Patient Protection and Affordable Care Act" 
  } 
}

 

The above json is transformed and stored in XML as follows.

<subjects> 
  <subject>Abortion</subject> 
  <subject> 
     Administrative law and regulatory procedures 
  </subject> 
  <subject>Adoption and foster care</subject> 
</subjects> 
<summary> 
  <summary-as>Public Law</summary-as> 
  <summary-date>2010-03-23</summary-date> 
  <summary-text> 
    Patient Protection and Affordable Care Act 
  </summary-text> 
</summary>
 
Here’s the code snippet used to create the above xml node.
element {fn:QName($NS,"subjects")}
{
  for $item at $n in $node/*:subjects/node()
    return
      element {fn:QName($NS,"subject")} {$item/text()}
},
element {fn:QName($NS,"summary")}
{
  element {fn:QName($NS,"summary-as")}
      {$node/*:summary/*:as/text()},
      
  element {fn:QName($NS,"summary-date")}
      {$node/*:summary/*:date/text()},
      
  element {fn:QName($NS,"summary-text")}
      {$node/*:summary/*:text/text()}
},

For more detail, see line 406 in file ingest-bill-data.xqy.

 

3. How to set up an automated daily ingestion task?

The Most Productive Congress Member web application was created by MarkLogic’s App Builder tool.

The code was automatically generated and deployed.

The App Builder tool deploys code using the following directory structure.

image 

 

Please take note of the custom directory. As per name, any custom code can be safely deployed to the custom directory. The App Builder tool will not clobber any code that resides in this directory.

To create the nightly ingestion task, the ingestion code (ingest-bill-data.xqy) was copied to the /custom/schedule directory as shown above.

The next step is to add the scheduled task using the MarkLogic Admin tool.

       image

The code is set to run daily at 8:00 PM PST (11:00 PM EST).

image

Once the task is set, the web application is ready for use.

As before, please try it. => http://ps4.demo.marklogic.com:8004/

Be sure to select the Enacted Type “public” facet which will filter out the bills proposed versus the bills passed. You can also try search. Be sure to drill down to the underlying XML document (see the 3rd red circle in the image). This link will take you to an html table view of the data. There is also a raw XML view (see images).

 

image

 

Underlying XML Document HTML Table View.

image

 

Underlying XML Document XML View.

image

 

4. Is ACID Compliance needed for this One-Way system?

There’s a big debate in the NoSQL community about database consistency.

Data consistency refers to how a database is able to handle updates when failures occur. There are two general approaches, ACID and BASE.

  1. ACID systems provide consistency at the expense of availability.
  2. BASE systems provide availability at the expense of consistency.

I mention it because this application is mostly queries (read-only) and does not require rigorous ACID transactions.

ACID is not needed because data flows mostly one-way. However, if this app were to get user specific features (e.g., user profile, saved searches, user bookmarks, user workspaces, etc.). These features would turn the system into a mission critical two-way system requiring high consistency. High consistently is also referred to as durability.

But this is a topic for another day too.

5. How to use the SQL API?

MarkLogic version 6 added a new SQL API and ODBC service that is very useful for analytics.

I’ll quickly walk through the steps needed to set up a MarkLogic SQL View using the App Builder tool.

Steps:

  1. Select the database and then press the configure button (see image).
  2. Scroll to the bottom of the page and press the “Add New” button. The image already shows that view that was created (us_congress_view).
  3. Enter name, schema, localname (root node) and namespace.
  4. Select the desired range indexes. Try to avoid the Cartesian product. In this example, I intentionally left out subject, cosponser and committee because there are many subjects, cosponsors, and committees associated to a single bill. If these items were included then the number of records would jump to 7.5 million instead of 276,000.
  5. Press the update view button to commit the changes.
  6. Test using qconsole and the new SQL API (see following code).

 

xdmp:sql( 
"select uri, bill_type, congress, year, president_name,
sponsor, enacted_type, enacted_congress,
enacted_number, status, status_at, subjects_top_term
from us_congress_view limit 100", "format" 
)

 

Image 1.

image

Image 2.

image

 

Image 3.

image

Image 4.

image

 

6. How to connect and discover with Tableau and Excel?

Once the SQL View has been created, the next step is to connect the data to Tableau and Excel.

This requires an ODBC service. Creating the ODBC service is very similar to creating an XDBC service.

Go to the Admin Console > App Servers and then press the Create ODBC tab.

image

Create the service and make sure that the desired database is selected.

image

 

The next step is to create the ODBC System DSN using the 32 Bit ODBC Data Source Administrator.

Steps:

  1. Launch the 32 Bit ODBC Data Source Administrator. For windows the command is:                           C:\Windows\SysWOW64\odbcad32.exe
  2. Select the System DSN tab of the ODBC Data Source Admin tool.
  3. Press the Add Button.
  4. Select the MarkLogic SQL (X86) item.
  5. Enter the configuration settings.
  6. Press the Test button to verify.
  7. Good to Go.

Image 1.

image

Image 2.

image

 

Steps to connect to Excel:

  1. Launch Microsoft Excel. I recommend using Excel 2010.
  2. Select the Data > From Other Sources > From Microsoft Query
  3. Observe “Choose Data Source” dialog box and then select MarkLogicSQL.
  4. Observe “Query Wizard” and choose the desired view.
  5. Choose the desired columns.
  6. Set the filter (if desired).
  7. Set the desired sort.
  8. Choose “Return Data to Microsoft Excel” and press ok.
  9. Wait ~2 minutes for the spreadsheet to respond.
  10. Observe results in the spreadsheet (see image 6).

Image 1.

image

Image 2.

image

Image 3.

image

Image 4.

image

Image 5.

image

Image 6.

image

 

Connecting to Tableau is very similar to Excel. I’ll cover more detail in the screencast but here’s the steps.

Steps to connect to Tableau:

  1. Launch Tableau.
  2. Press the Connect to Data link on the home page.
  3. Select the “Other Databases (ODBC)” option on the lower left.
  4. Observe the “Generic ODBC Connection” dialog box.
  5. Select the MarkLogicSQL DSN.
  6. Press the Connect button and observe the connection attributes appear.
  7. Select the “main” schema (or equivalent).
  8. Press the table search icon in the lower right and observe the option to select the table.
  9. Select the table and observe the connection name.
  10. Press the OK button.
  11. Wait ~2 minutes for Tableau to respond.
  12. Choose the “live data” option.
  13. Observe dimensions and measure in the left side.
  14. Drag “Number of record” to the “rows”.
  15. drag the congress and president_name dimension to the “columns”.
  16. Choose the bar chart to observe a visual.

Image 1.

image

Image 2.

image

Image 3.

image

Image 4.

image

 

That should be enough to get started using MarkLogic and Tableau. MarkLogic-Tableau combination eliminates the need for expensive data warehouses.

More importantly, MarkLogic-Tableau provides the much needed data discovery. It provides the ability to surface the data in easier and more meaningful ways.

I believe this discovery capability is especially useful for the US Congress Data and the many Open Government Data projects.

Anyway, that is all for now.

Please be sure to watch the screencast where I can provide some more detail.

Advertisements