RSuite, MarkLogic and the DITA Open Toolkit

April 17, 2015

 

RSuite DITA Transform Demo

 

The above video has a brief introduction to RSuite and the DITA Open Toolkit.

I had the great pleasure of being a speaker at last week’s MarkLogic World 2015 San Francisco event. I’ll be writing much more about my MarkLogic MEAN Stack presentation later but for now I‘d like to give a good shout out to RSuite.

While at MarkLogic World, I also had the great pleasure of reconnecting with the wonderful RSuite team. I’m a big advocate of the RSuite content management system.

This past summer, I had an amazing opportunity to work at Harper Collins where I helped to deploy their new content management services using RSuite. My curiosity with RSuite stemmed from my work as a professional services consultant at MarkLogic. MarkLogic has a loyal base of customers in the media/publishing industry.

Media/publishing customers choose MarkLogic to store their content which consists of text based documents and binary assets (photos, audio, video) with the respective metadata.

MarkLogic lowered the pricing in 2013. The new pricing made it more affordable to store binary assets in MarkLogic. Keeping the binary assets together with the text based content also greatly simplifies the infrastructure and management overhead.

The RSuite CMS has the following features:

  1. Workflow
  2. DITA Transforms – provides “multi-channel” output.
  3. Role Based Security
  4. Distribution

The RSuite secret sauce is the DITA Open Toolkit. The other key component is MarkLogic.

The workflow engine is provided by jBPM which uses MySQL to store the workflow configurations and drives the finite state machine.

The DITA Open Toolkit provides the “multi-channel output” feature needed by most publishers. This is the ability to render the content to many formats such as PDF, ePub, XHTML, Adobe In Design, Word docx, any format.

The DITA acronym is Darwin Information Typing Architecture. It is an XML Data Model for Authoring and Publishing.

Eliot Kimber is the force behind DITA. Here’s some useful links.

  1. http://dita4publishers.sourceforge.net/d4p-user-guide/index.html
  2. http://sourceforge.net/u/drmacro/profile/

Publishers should avoid using XHTML as the storage format of book content for many reasons. The industry standard format is DITA or DocBook because it provides a higher level of abstraction that makes it much easier to have “multi-channel output”. These standards are also more flexible when providing custom publishing services.

The DITA format is especially interesting because of the specialization feature that makes XML structures polymorphic.

Some key points about DITA:

  1. Topic Oriented
  2. Each Topic is a separate XML file
  3. DocBook is Book Oriented
  4. DITA Initial Spec in 2001
  5. DocBook Initial Spec in 1991
  6. Core DITA Topic Types are Concept, Task, and Reference
  7. Specialization: This is subtyping where new topics are derived from existing topics.
  8. Darwin term is used because the polymorphic specializations provide an evolution path.
  9. DITA Map XML document is used to stitch the Topic XML documents.

 

I had an opportunity to chat with Norm Walsh about it at this week’s MarkLogic World event. Norm is the author of DocBook: The Definitive Guide. He’s also an active member in a few of the XML/JSON standards committees.

DITA is a competing standard to DocBook. Norm wrote this interesting blog post about DITA back in October 2005.

http://norman.walsh.name/2005/10/21/dita

My question is which one has better support for semantic annotations. Most content these days is semantically enriched using multiple Ontologies. This is needed so that SPARQL queries can be used to provide Dynamic Semantic Publishing services.

Norm was quick to tell me that DocBook 5.1 now supports RDFa. I’ll definitely investigate.

For DITA, there’s an interesting DITA to RDF transform in the works here.  http://colin.maudry.com/dita-rdf/

Bob DuCharme, author of Learning SPARQL, has this nice blog post on using RDFa with DITA and DocBook.

http://www.devx.com/semantic/Article/42543

In addition to the screencast above, the following screencasts will take a deeper dive into the RSuite software and DITA Open Toolkit.

Please take a look. Hopefully, the screencast will shed some more light on the need to store content using a higher level of abstraction (DITA or DocBook).

The screencasts will also show the value of RSuite as a full blown Content Management and Digital Asset Management (DAM) Solution.

 

DITA Open Toolkit Demo

 

DITA Open Toolkit Demo

 

RSuite Architecture and Code

 

RSuite Architecture and Code

What is the MarkLogic Document Discovery App?

August 29, 2013

 

MarkLogic Document Discovery

 

One of my favorite MarkLogic applications posted on GitHub is Document Discovery.

It’s a favorite because it provides a Quick Start tutorial for developers looking to learn about MarkLogic’s Content Process Framework (CPF) and the new binary document support.

Developers will also learn about Field Value Query and “How to annotate and add custom ratings?”.

You can access the source code and get the installation instructions in the readme section on the GitHub project page.

However, this blog post will provide more detail because it required additional changes to get it working properly on my MarkLogic 6 server.

You can see it in action here. => http://ps4.demo.marklogic.com:8009/

Be sure to watch the embedded screencast at the top of the page too.

Background

As noted in the readme, this is a simple application that was built using a customized Application Builder app. Unfortunately, it uses the older MarkLogic 5 Application Builder code. The latest Application Builder that comes with MarkLogic 6 (ML6) has undergone a significant architectural change. The ML6 version generates code that is much more declarative as it utilizes XSLT with the ML6 REST API.

A good write up on how to customize the ML6 App Builder Code is posted here.

http://docs.marklogic.com/guide/app-builder/custom#id_93564

I’ll have a newer version of the Document Discovery app that is built with the ML6 App Builder in a future blog post. For now, this version is better as a tutorial because it makes it easier to follow the code for:

  1. CPF
  2. Binary File Metadata
  3. Field Value Query
  4. How to annotate and add custom ratings?

Content Process Framework (CPF)

I like to refer to the MarkLogic CPF as a Finite State Machine (FSM) for documents. In this case, the document’s state will change or transition from one state to another via a triggering event or condition.

A typical FSM is defined by a list of its states, and the triggering condition for each transition.

For MarkLogic CPF this is done with Pipelines. A pipeline is a configuration file that contains the triggering conditions with respective actions for each state transition.

This application will use CPF for entity enrichment. In this case the entity will be an arbitrary binary file.

I find that entity enrichment is the most common use case for CPF.

Of course, the online Content Processing Framework Guide is the ultimate source to help you become fluent in CPF.

http://docs.marklogic.com/guide/cpf/overview

In the mean time, this post will provide the quick start steps needed to get this Document Discovery app configured to use a CPF pipeline.

In this case, the CPF pipeline will execute a document “scan” on an arbitrary binary document whenever it is inserted into the database.

The triggering event for this will be the  xdmp:document-insert() which puts the document into the “initial” state.

Binary File Support

A new ability to ingest a wide range of binary files was added to MarkLogic in version 5. This feature is referred to as an ISYS Document Filter.

The ISYS (Document) Filter was provided by a company called ISYS Search Software, Inc. The company has since changed its name to Perceptive Search, Inc.

The new ISYS filter extracts metadata from a wide range of binary documents (word, powerpoint, jpg, gif, excel, etc.). The list of supported binary files are listed here.

This metadata extraction capability is in an API called  xdmp:document-filter().

More details are provided in Chapter 16 of the Search Developer’s Guide.

Here’s an example of the API usage with response.

Code snippet:

xdmp:document-filter(fn:doc("/content/451 Report.docx"))

 

API Response:

image

Please note the attributes used in the meta elements shown above.

The CPF “initial” pipeline code will modify the above meta elements to be as follows.

image

 

The 3 elements in the blue circle were added later when the user adds a comment and/or rating to the document.

 

Field Value Query

In this app, the Author facet was implemented as a Field Value Query. A Field Value is used because the search (facet) also wanted to include the <typist> node.

Example:

The following XML shows 3 elements that are valid Authors (Last_Author, Author, and Typist).

image

 

Adding a Field Value Index to the database provides an ability to unify the values during a search. It also provides a weighting mechanism where the <Last_Author> element can be boosted to have a higher relevance than the <Typist> element.

 

Here’s a snapshot of the admin page used to configure it.

image

 

For more detail, see Understanding Field Word and Value Query Constructors in Chapter 4 of the Search Developer’s Guide.

 

How to annotate and add custom ratings?

JavaScript for the annotation and custom ratings is in a file called /src/custom/appjs.js.

 

Here’s the source code directory structure.

 

image

 

The JavaScipt code will display the User Comments dialog box to capture the Commenter’s Name and Comment as shown.

 

image

The code that provides the web form post action is in /src/custom/appfuncations.xqy

 

The post action sends the request to a file called/insertComment.xqy which then does either the xdmp:node-insert() for new comments or xdmp:node-replace() for updates.

 

A similar approach is done for the user ratings capability. Here’s a snapshot of the UI.

 

image

 

Installation

The source code that is currently posted on GitHub did not work initially.

I made the following changes to make it work.

  1. Config File – modify the Author field value index
  2. Config File – make compatible with MarkLogic 6
  3. Config File – include a dedicated modules and triggers DB.
  4. Pipeline – modify pipeline xml to use code in the /cpf/ingest directory.
  5. Deployed source code to a modules databases instead of the file system.
  6. Added additional logging to verify pipeline code process.

The source code is posted here. => http://sdrv.ms/15Coe6U

Here’s the steps that I used to get the application running properly.

Installation Steps:

  1. Import the configuration package that’s located in the /config directory. This package will create and configure the 3 databases (DocumentDiscovery, doument-discovery-modules, and Document-Discovery-Triggers).
  2. Install content processing on the database. See section 4.5 of the CPF Guide. => http://docs.marklogic.com/guide/cpf
  3. Install the pipeline that is in the /ingest directory.

    This is done by copying the 2 files in the /ingest directory to the /Modules/MarkLogic/cpf/ingest directory. You may need to create this directory directly under the modules directory of your MarkLogic installation (see next image). Once copied, then follow the normal pipeline loading process.

  4. Ensure that only the following pipelines are enabled for the default domain in the DocumentDiscovery database (DO NOT ENABLE DOCUMENT CONVERSION).

    Install: Document Filtering (XHTML), Status change handling, and the custom Meta Pipeline. You’ll need to load the Meta Pipeline configuration file.

  5. Deploy Source Code – be sure to modify the http server to use the proper modules database.
  6. Load data using the admin console, Information Studio or qconsole.
  7. Observe ability to search and view the binary documents!

 

Image: Shows CPF Pipeline Code located in the MarkLogic installation directory.

image

 

Image: Document Discovery App showing Search, Facets, and User Star Ratings

image

 

Conclusion

Hopefully, this post will help you get started using MarkLogic for Document Discovery.

The exciting news is the soon to be released MarkLogic 7 (ML7).

ML7 will have new  semantic technology features that will take this simple document repository to a whole new level.

This is because ML7 will have the ability to add richer metadata to each document. The richer metadata is triples.

A triple is a way to model a fact as a "Subject, Predicate, Object". Many triples can be added to a document as a sets of facts. These facts can then be incorporated into queries that that can make inferences about the “subjects”.

Example:

  1. “Jane Doe” : “graduatedFrom” : “Columbia University”
  2. “Jane Doe” : “graduationYear” : “2001”
  3. “Jane Doe” : “AuthorOf” : “This document: ISBN-125”
  4. “Sports Medicine” : “MainTopicOf” : “This document: ISBN-125”
  5. “Henry James” : “AuthorOf” : “ISBN:12345”
  6. “This document: ISBN-125” : “refersTo” : “document ISBN-545”
  7. “Sandra Day” : “graduatedFrom” : “Columbia University”
  8. “Sandra Day” : ”graduationYear” : “2001”
  9. “Sandra Day” : “AuthorOf” : “document ISBN-545”

 

From these facts, the following inferences can be made:

  1. Henry James and Jane Doe co-authored document ISBN-125.
  2. Sandra Day and Jane Doe were college classmates

 

In this example, the facts (triples) are used for knowledge discovery.

 

In this case, the knowledge discovery or inferences can be accomplished with a minimal coding effort using the simple data structures and a rich query API.

 

So please stay tuned!

How to use MarkLogic with Open Government?

May 10, 2013

 

How to use MarkLogic with US Congress Bill Data?

 

I’m a big advocate for open government data projects. One of the more interesting efforts is the US Congress legislation data that is published every night on GitHub.

Here’s the link. => https://github.com/unitedstates/congress/wiki

For more about GitHub, please see this Clay Shirky TED talk.

Clay Shirky would like to use GitHub as a tool to manage the legislation process. He calls it a new “form of arguing”. It’s a great idea but GitHub is not an easy tool for government legislators yet. It could use some ease-of-use tweaks for folks not familiar with software development.

In any case, Clay mentioned the posting of US Congress Bill data that occurs nightly. Posting the data is a great example of government transparency.  This data can be used to measure congressional productivity and maybe even accountability.

This data makes it possible to provide a real-time (well nightly) digital Report Card. It can be used to measure any perceptions of Congressional Gridlock that is often reported in the news.

This blog post will show how to do this but before I start here’s some notes about the US Congress that will help when we look at the data.

Notes on the US Congress and the Legislative Process:

  1. Consists of 2 houses: Senate and House of Representatives.
  2. There are 535 voting members: 435 in the House of Representatives and 100 in the Senate.
  3. Each Congressional Session is 2 years.
  4. Current Session is the 113th Congress which runs from January 3, 2013 to January 3, 2015.
  5. A key objective is to create legislation that improves and advances the entire country.
  6. Any member of House or Senate can introduce legislation for debate. It is done by proposing a bill.
  7. Congressional bill is introduced and championed by a sponsor.

  8. Bill is then sent to appropriate congressional committee to decide if the House or Senate will vote on the bill.

  9. Senate has 17 committees and 70 subcommittees.

  10. House has 23 committees and 104 subcommittees.

  11. Each committee focuses on a specific policy area.

  12. Bill is passed when a majority vote is received in both the House and Senate.

  13. President must sign the bill to pass it into law or veto it.

  14. Congressional Productivity can be measured by the number of bills that get signed by the president.

 

The Huffington Post news outlet published an article about Congressional Productivity of the 112th Congress at the end of 2012. The article noted that the 111th Congress had 383 bills signed into law by the president. The 112th congress had only 283 bills signed by the president.

One could conclude that the 112th Congress was less productive than the 111th Congress but there are other factors to consider.

Here’s the Huffington Post article.

http://www.huffingtonpost.com/2012/12/28/congress-unproductive_n_2371387.html

 

Questions:

Is it possible to use the nightly data to verify or track Congressional Productivity on a daily basis?

How difficult would it be to show data visualizations for each congress?

The answer to both questions is a most definite yes but how?

Here’s a simple data set that I’d like to visualize and build from.

Congress

Years Bills Proposed Bills Signed Bill Sign Rate
110 2007-2008 14,042 460 3.28%
111 2009-2010 13,675 383 2.80%
112 2011-2012 12,299 283 2.30%

 

MarkLogic Ingestion and Enrichment

This is an easy problem for MarkLogic. MarkLogic is typically used to ingest, enrich, search and discover.

Once the data is in MarkLogic, the following questions can be easily answered.

  1. How many bills signed by president by year?
  2. Who are the most productive congress members by year?
  3. What is the bill proposed to bill signed ratio by year?

I’ve created a MarkLogic database and ingested the US Congress Bill data from the previous Congressional Sessions. The data ranges from the 93rd Congress (1974 to 1975) to the 112th Congress (2011 to 2012).

For the current session, 113th Congress, I created a scheduled task that runs every night at 11:00 PM EST (8:00 PM PST). The scheduled task ingests the data and enriches it with some simple presidential data.

I’ve also used the MarkLogic App Builder tool to create a quick web application that shows the bills signed by year. It can also be used to search and discover.

The App Builder and ingestion code is posted here. => Source Code

The web application is here. => http://ps4.demo.marklogic.com:8004/

Please give it a try.

Please be sure to try the clickable pie chart and bar chart widgets.

To see the most productive Congress Members by Year, press the "Enacted Type” public facet in the top left corner. Once pressed, you will see the results by year.

 

                     image

 

The Top 30 Most Productive Congress Members are listed in the Sponsor Facet (see image). This sponsor list shows the most productive members for the past 32 years in a descending order.

The number in parenthesis is the number of bills that they sponsored that were signed by the President of the United States.

 

                  image

I use the Sponsor classifier because a Sponsor is a senator or house representative who introduces a bill or amendment and is the bill’s chief advocate. The process of sponsoring a bill requires significant effort. So we can conclude that this could be a good metric for a congress member’s productivity.

You can use the web app to drill down further to discover bills proposed or productive members by year, by president, by cosponsor, by subject, etc..

The remaining blog post and screencast will focus on the following topics.

  1. How to programmatically get the data zip file?
  2. How to ingest json documents?
  3. How to set up an automated daily ingestion task?
  4. Is ACID Compliance needed for this One-Way System?
  5. How to use the SQL API?
  6. How to connect and discover with Tableau and Excel?
  7. Code is posted.

 

1. How to programmatically get the data zip file?

The specific ingestion code is posted in this file. => ingest-bill-data.xqy

The URL to retrieve the nightly congress data points to a zip file.

http://unitedstates.sunlightfoundation.com/congress/data/113.zip

The following code snippet does the following:

  1. http get request to retrieve the zip file.
  2. Iterate the zip manifest to get file names.
  3. For each file name, retrieve the file.
  4. For each file, convert json to an xml document.
  5. For each xml document, transform and enrich.
  6. Save newly transformed/enriched document.

 

Here’s the code snippet.

declare variable $get-options :=
  <options xmlns="xdmp:http">
    <format xmlns="xdmp:document-get">binary</format>
  </options>;

declare variable $BASEURL :=
 "http://unitedstates.sunlightfoundation.com/congress/data/";

let $url  := fn:concat($BASEURL, "113.zip")
let $zip := xdmp:http-get($url, $get-options)[2]

let $docs :=
  for $uri in xdmp:zip-manifest($zip)//zip:part/text()
    let $jdoc := xdmp:zip-get($zip, $uri)
    let $xdoc := json:transform-from-json($jdoc)
    let $doc  := local:load-house-bill-doc($xdoc)
    order by $uri
      return
        $uri

return fn:count($docs)

2. How to ingest json documents?

The data that is posted on the GitHub site is in a json format. A good rule of thumb on json versus xml is to use json as a wire protocol for the “last mile” and then use XML for the data store.

The key advantage of XML over json is Namespaces and Schema. Namespaces and Schema provides capabilities that ultimately makes developers much more productive than if a simple json data store is used. But this is a topic for another day.

Ingesting json data into MarkLogic is easily done using XQuery. XQuery is my preferred ETL (extract, transform and load) tool. I use it to convert json, transform (and enrich) the data into more efficient structures and then store as XML.

The ingestion code is posted here.  => ingest-bill-data.xqy

In a future post, I’ll talk about adding a triple which is the best way to capture relationships with other congressional bills.

The json format of the US Congress Bill data is well documented on the GitHub Wiki page.

 

Here’s an example of the json structure used to hold subjects and summary data.

{ 
  "subjects": [ 
    "Abortion", 
    "Administrative law and regulatory procedures", 
    "Adoption and foster care" 
  ], 
  "summary": { 
    "as": "Public Law", 
    "date": "2010-03-23", 
    "text": "Patient Protection and Affordable Care Act" 
  } 
}

 

The above json is transformed and stored in XML as follows.

<subjects> 
  <subject>Abortion</subject> 
  <subject> 
     Administrative law and regulatory procedures 
  </subject> 
  <subject>Adoption and foster care</subject> 
</subjects> 
<summary> 
  <summary-as>Public Law</summary-as> 
  <summary-date>2010-03-23</summary-date> 
  <summary-text> 
    Patient Protection and Affordable Care Act 
  </summary-text> 
</summary>
 
Here’s the code snippet used to create the above xml node.
element {fn:QName($NS,"subjects")}
{
  for $item at $n in $node/*:subjects/node()
    return
      element {fn:QName($NS,"subject")} {$item/text()}
},
element {fn:QName($NS,"summary")}
{
  element {fn:QName($NS,"summary-as")}
      {$node/*:summary/*:as/text()},
      
  element {fn:QName($NS,"summary-date")}
      {$node/*:summary/*:date/text()},
      
  element {fn:QName($NS,"summary-text")}
      {$node/*:summary/*:text/text()}
},

For more detail, see line 406 in file ingest-bill-data.xqy.

 

3. How to set up an automated daily ingestion task?

The Most Productive Congress Member web application was created by MarkLogic’s App Builder tool.

The code was automatically generated and deployed.

The App Builder tool deploys code using the following directory structure.

image 

 

Please take note of the custom directory. As per name, any custom code can be safely deployed to the custom directory. The App Builder tool will not clobber any code that resides in this directory.

To create the nightly ingestion task, the ingestion code (ingest-bill-data.xqy) was copied to the /custom/schedule directory as shown above.

The next step is to add the scheduled task using the MarkLogic Admin tool.

       image

The code is set to run daily at 8:00 PM PST (11:00 PM EST).

image

Once the task is set, the web application is ready for use.

As before, please try it. => http://ps4.demo.marklogic.com:8004/

Be sure to select the Enacted Type “public” facet which will filter out the bills proposed versus the bills passed. You can also try search. Be sure to drill down to the underlying XML document (see the 3rd red circle in the image). This link will take you to an html table view of the data. There is also a raw XML view (see images).

 

image

 

Underlying XML Document HTML Table View.

image

 

Underlying XML Document XML View.

image

 

4. Is ACID Compliance needed for this One-Way system?

There’s a big debate in the NoSQL community about database consistency.

Data consistency refers to how a database is able to handle updates when failures occur. There are two general approaches, ACID and BASE.

  1. ACID systems provide consistency at the expense of availability.
  2. BASE systems provide availability at the expense of consistency.

I mention it because this application is mostly queries (read-only) and does not require rigorous ACID transactions.

ACID is not needed because data flows mostly one-way. However, if this app were to get user specific features (e.g., user profile, saved searches, user bookmarks, user workspaces, etc.). These features would turn the system into a mission critical two-way system requiring high consistency. High consistently is also referred to as durability.

But this is a topic for another day too.

5. How to use the SQL API?

MarkLogic version 6 added a new SQL API and ODBC service that is very useful for analytics.

I’ll quickly walk through the steps needed to set up a MarkLogic SQL View using the App Builder tool.

Steps:

  1. Select the database and then press the configure button (see image).
  2. Scroll to the bottom of the page and press the “Add New” button. The image already shows that view that was created (us_congress_view).
  3. Enter name, schema, localname (root node) and namespace.
  4. Select the desired range indexes. Try to avoid the Cartesian product. In this example, I intentionally left out subject, cosponser and committee because there are many subjects, cosponsors, and committees associated to a single bill. If these items were included then the number of records would jump to 7.5 million instead of 276,000.
  5. Press the update view button to commit the changes.
  6. Test using qconsole and the new SQL API (see following code).

 

xdmp:sql( 
"select uri, bill_type, congress, year, president_name,
sponsor, enacted_type, enacted_congress,
enacted_number, status, status_at, subjects_top_term
from us_congress_view limit 100", "format" 
)

 

Image 1.

image

Image 2.

image

 

Image 3.

image

Image 4.

image

 

6. How to connect and discover with Tableau and Excel?

Once the SQL View has been created, the next step is to connect the data to Tableau and Excel.

This requires an ODBC service. Creating the ODBC service is very similar to creating an XDBC service.

Go to the Admin Console > App Servers and then press the Create ODBC tab.

image

Create the service and make sure that the desired database is selected.

image

 

The next step is to create the ODBC System DSN using the 32 Bit ODBC Data Source Administrator.

Steps:

  1. Launch the 32 Bit ODBC Data Source Administrator. For windows the command is:                           C:\Windows\SysWOW64\odbcad32.exe
  2. Select the System DSN tab of the ODBC Data Source Admin tool.
  3. Press the Add Button.
  4. Select the MarkLogic SQL (X86) item.
  5. Enter the configuration settings.
  6. Press the Test button to verify.
  7. Good to Go.

Image 1.

image

Image 2.

image

 

Steps to connect to Excel:

  1. Launch Microsoft Excel. I recommend using Excel 2010.
  2. Select the Data > From Other Sources > From Microsoft Query
  3. Observe “Choose Data Source” dialog box and then select MarkLogicSQL.
  4. Observe “Query Wizard” and choose the desired view.
  5. Choose the desired columns.
  6. Set the filter (if desired).
  7. Set the desired sort.
  8. Choose “Return Data to Microsoft Excel” and press ok.
  9. Wait ~2 minutes for the spreadsheet to respond.
  10. Observe results in the spreadsheet (see image 6).

Image 1.

image

Image 2.

image

Image 3.

image

Image 4.

image

Image 5.

image

Image 6.

image

 

Connecting to Tableau is very similar to Excel. I’ll cover more detail in the screencast but here’s the steps.

Steps to connect to Tableau:

  1. Launch Tableau.
  2. Press the Connect to Data link on the home page.
  3. Select the “Other Databases (ODBC)” option on the lower left.
  4. Observe the “Generic ODBC Connection” dialog box.
  5. Select the MarkLogicSQL DSN.
  6. Press the Connect button and observe the connection attributes appear.
  7. Select the “main” schema (or equivalent).
  8. Press the table search icon in the lower right and observe the option to select the table.
  9. Select the table and observe the connection name.
  10. Press the OK button.
  11. Wait ~2 minutes for Tableau to respond.
  12. Choose the “live data” option.
  13. Observe dimensions and measure in the left side.
  14. Drag “Number of record” to the “rows”.
  15. drag the congress and president_name dimension to the “columns”.
  16. Choose the bar chart to observe a visual.

Image 1.

image

Image 2.

image

Image 3.

image

Image 4.

image

 

That should be enough to get started using MarkLogic and Tableau. MarkLogic-Tableau combination eliminates the need for expensive data warehouses.

More importantly, MarkLogic-Tableau provides the much needed data discovery. It provides the ability to surface the data in easier and more meaningful ways.

I believe this discovery capability is especially useful for the US Congress Data and the many Open Government Data projects.

Anyway, that is all for now.

Please be sure to watch the screencast where I can provide some more detail.

How to add Authentication to a MarkLogic App?

February 4, 2013

 

Learn how to add authentication to a MarkLogic Roxy App.

This blog post, I will show the code needed to add a simple authentication service.

In the previous post, I created the two column layout where the top of the home page had a login form. At the time the login form was not wired up.

For this post, I will wire up the login form. To do this, I’ll show how to build a simple authentication service that searches a user database, verifies the password, and generates an authentication token that will expire after 5 minutes.

This demo application will also show how to provide a simple RESTful API for search. This Search API will utilize the authentication token to restrict access to the search service.

The app will not show any role based restricted views. A more fully featured role based access control will be shown in a future post.

This demo will use the boing-roxy demo that is posted here.  http://ps2.demo.marklogic.com:8090/

A zip file containing the source code for this demo application is posted here.  => source code

Overall Approach

The solution will use the following items to authenticate a user and create an application token that gives the user access to the RESTful API for a 5 minute period.

  1. Registration Form – used to create the user profile documents. This form should only be visible to an admin user but is currently visible to all for demo purposes.
  2. User Directory – Each user will have a dedicated user directory in the MarkLogic database.
  3. User Profile Document – User profile document (/users/janedoe/profile.xml) will reside in the user’s directory. It will contain the username/password. The username must be unique. It can also be used to store the user’s role and organization information. This demo will not utilize a party management solution but it can be extended to do so.
  4. Session Document – A session document will be created when a user successfully logs in. The Session Document will be stored in the User Directory.
  5. Authentication Token – the token will be stored in the session document. Each RESTful API request must include the token in its header.
  6. ROXY Router – Will be the checkpoint or the single point of entry for each request. This is where the user token is verified for each RESTful API request. A key function called Find-Session-by-Token() verifies the session and dispatches the request if valid.
  7. Login – If the username and password is valid, the token is created. If a token already exists and has not yet expired then the same token will be used.
  8. Session Expiration – Session Expiration will be 5 minutes from initial login. The 5 minute duration is for demo purposes. Typical session expiration duration is 24 hours. Session expiration time will be UTC based. UTC is Coordinated Universal Time.
  9. Logout – Terminates the session by deleting the session document that contains the token.

 

Related Notes:

  1. Passwords are never part of the RESTful API transport except the Login API request.
  2. “Remember Me” cookie – This solution can support a “Remember Me” cookie where the token is stored in the cookie and not the password. Remember Me cookies typically expire after 90 days which is longer than the token expiration period.
  3. Verify API – A good approach for refreshing the token stored in a “90 day login cookie” is described here. A Verify API is typically used to verify the Username and Token. If they match then a new token is generated whenever the existing token has expired. The 90-day cookie web app will need to call the Verify API to refresh the token stored in the cookie.
  4. Passwords are currently stored in the User Profile doc but they are MD5 hashed.
  5. Current solution shows how to use the MarkLogic Search API with the user profile document, session document and token to provide an adequate security solution.
  6. OAuth2 – Open Authentication version 2 (OAuth2) is a widely used protocol that provides a federated user profile solution. The key benefit for this example is that user passwords do not need to be stored in MarkLogic. However, this is a topic for a future post. The OAuth2 developer details are here: https://developers.google.com/accounts/docs/OAuth2

 

1. Registration Form

The registration form creates the user profile data.

You can access it here. http://ps2.demo.marklogic.com:8090/user

Here’s a snapshot view.

image

 

2. User Directory

The registration form above creates the user profile data that is stored in a User Profile Document in the respective user directory. The Session Document is also stored in the User Directory.

Example User Directory URIs

/users/grusso/profile.xml
/users/grusso/session/2ccda41cd|2/4/2013 8:28:41 PM.xml

 

3. User Profile Document

Here’s the format of the simple user profile document. Please note that the password is stored as an MD5 hash.

<user-profile>
  <firstname>Gary</firstname>
  <lastname>Russo</lastname>
  <username>grusso</username>
  <password>5f4dcc3b5aa765d61d8327deb882cf99</password>
  <created>2013-02-02T13:45:42.856718-08:00</created>
  <modified>2013-02-03T17:51:59.585527-08:00</modified>
</user-profile>

Roxy code that generates the user profile document is:

  1. user controller – /apps/controllers/user.xqy
  2. user model – /apps/models/user-model.xqy

 

4. Session Document:

Here’s the format of the session document.

<session user-sid="2cc3bdfc38bf03c63a4de6bda41cd91b|2/4/2013 8:28:41 PM">
  <username>grusso</username>
  <created>2013-02-04T20:25:41.891354-08:00</created>
  <expiration>2013-02-04T20:28:41.891354-08:00</expiration>
</session>

Please note that the @user-sid attribute in the above XML is the authentication token.

Session Document URI:

/users/grusso/session/2cc3bdfc38bf03c63a4de6bda41cd91b|2/4/2013 8:28:41 PM.xml

The above document URI has the expiration date/time appended to it. Some JavaScript client code will use the appended expiration date/time to trigger a token refresh.

Roxy code that creates and deletes the session document is:

  1. web login controller – /apps/controllers/appbuilder.xqy
  2. rest login controller – /apps/controllers/login.xqy
  3. rest logout controller – /apps/controllers/logout.xqy 
  4. authentication model – /apps/models/authentication.xqy

 

5. Authentication Token:

In the above example, the authentication token is the following string:

2cc3bdfc38bf03c63a4de6bda41cd91b|2/4/2013 8:28:41 PM

This string needs to be added to the request header of each RESTful API request. If not the response will be a “401 unauthorized” error. The token must be prefixed with “X-Auth-Token” as follows.

X-Auth-Token: 2cc3bdfc38bf03c63a4de6bda41cd91b|2/4/2013 8:28:41 PM

The source code that extracts the X-Auth-Token value is in the router code. See line 87 of  /src/app/lib/router.xqy.

let $token := xdmp:get-request-header("X-Auth-Token")

If using Firefox Poster tool, the header can be added as shown.

 

image

 

6. ROXY Router:

As discussed above, the router is the checkpoint for all http requests. It is the ideal place to apply a security policy logic such as:

  1. Token check
  2. Requests per minute
  3. Maximum Requests per day

This post only handles the token check but this code can be extended to support all security policy logic.

The following xquery code handle the token check. Please note that certain request (e.g., login, ping) bypass the token check.

 

let $valid-request :=
  if(fn:not($config:SESSION-AUTHENTICATE)) then fn:true()
  else if(xs:string($controller) = ("ping")) then fn:true()
  else if(xs:string($controller) = ("login")) then fn:true()
  else if(xs:string($controller) = ("logout")) then fn:true()
  else if(xs:string($controller) = ("verify")) then fn:true()
  else
  (
    let $token := xdmp:get-request-header("X-Auth-Token")
    return
      if($token) then
      (
        let $valid-session := auth:findSessionByToken($token)
        return
          if($valid-session) then
          (
            fn:true(),
            auth:cacheSession($valid-session)
          )
          else
            fn:false()
      )
      else fn:false()
  )

7. Login Code:

The login code does the following:

  1. Find user profile document – Searches the the user profile documents using the username.
  2. Check password – If a document with the username is found then the password is checked.
  3. Find session document by username – If the password matches then code looks for a session document with its respective expiration date.
  4. Session Document – If the session expiration has not expired then use current session document. If session document has expired then delete it and then create a new session document containing new Authentication Token.
  5. Return Authentication Token

This code resides in the following files:

  1. rest login controller – /apps/controllers/login.xqy
  2. authentication model – /apps/models/authentication.xqy

The URI to request a Login is:

http://ps2.demo.marklogic.com:8090/login

The username and password is bundled into the request using the Authorization Header.

So the request header will need this:

Authorization: Basic Z3J1c3NvOnBhc3N3b3Jk

The encrypted string after the word Basic contains the base64 encoded username and password.

Here the code that extracts the username/password is in the login controller (/src/app/controllers/login.xqy).

declare function c:main() as item()*
{
  let $userPwd  :=
    xdmp:base64-decode(
      fn:string(
        fn:tokenize(
          xdmp:get-request-header("Authorization"), "Basic ")[2]
      )
    )
  let $username :=
    fn:string(
      (xdmp:get-request-header("username"),
           fn:tokenize($userPwd, ":")[1])[1]
    )
    
  let $password :=
    fn:string(
      (xdmp:get-request-header("password"),
           fn:tokenize($userPwd, ":")[2])[1]
    )

  let $result   := auth:login($username, $password)

  return
  (
    ch:add-value("res-code", xs:int($result/json:responseCode) ),
    ch:add-value("res-message", xs:string($result/json:message) ),
    ch:add-value("result",  $result),
    ch:add-value(
          "res-header", 
          element header {
            element Date {fn:current-dateTime()},
            element Content-Type
            {
              req:get("req-header")/content-type/fn:string()
            }
          }
        )
  )
};

The code that searches for a session document by username and its expiration date/time uses the following function. Please note the element range index query.

declare function auth:findSessionByUser($username)
{
  let $query :=
    cts:and-query((
      cts:directory-query(auth:sessionDirectory($username),"infinity"),
      cts:element-range-query(
            xs:QName("expiration"),">",
            auth:getCurrentDateTimeUTC())
    ))

  let $uri := cts:uris("",("document","limit=1"), $query )
  
  return
    fn:doc($uri)
};
 

8. Session Expiration:

The code to check the session expiration is invoked by the router code.

See line 89 in /src/app/lib/router.xqy

auth:findSessionByToken($token)

Here’s the code. Please note the element range query.

declare function auth:findSessionByToken($token as xs:string)
{
  let $query :=
    cts:and-query((
      cts:element-attribute-value-query(
        xs:QName("session"),
        xs:QName("user-sid"),
        $token
      ),
      cts:element-range-query(
        xs:QName("expiration"),">",
            auth:getCurrentDateTimeUTC()
      )
    ))

  let $uri := cts:uris("",("document","limit=1"), $query )
  let $doc := fn:doc($uri)
  
  let $current := fn:current-dateTime()
  
  return
    if ($doc) then
    (
      let $expiration := xs:dateTime($doc//expiration)
      let $diff := ($expiration - $current)
      
      return
      (
        if($diff < ($auth:SESSION-TIMEOUT div 2) ) then
          xdmp:node-replace(
              $doc//expiration/text(),
              text{fn:current-dateTime()}
            )
        else (),
        $doc/session
      )
    )
    else ()
};
 

9. Logout Code:

The logout code terminates the session by deleting the session document. Here’s the code.

declare function auth:logout($username as xs:string)
{
    let $session := auth:findSessionByUser($username)
    let $user := auth:userFind($username)

    let $token :=
      if($session) then
        $session/session/@user-sid/fn:string()
      else ()
      
    let $__ := auth:clearSession($username)
        return
           <json:object type="object">
              <json:responseCode>200</json:responseCode>
              <json:message>Logout Successful - Token Deleted</json:message>
              <json:authToken>{$token}</json:authToken>
              <json:username>{$user/username/text()}</json:username>
              <json:fullName>
              {
                fn:string-join
                ((($user/firstName,$user/firstname)[1],
                  ($user/lastName,$user/lastname)[1]), " ")
              }
              </json:fullName>
           </json:object>
};
 

Conclusion

Hopefully, the authentication code described in this demo application has been informative. It shows an approach that I have recently used in a ROXY Application.

I will be building on this solution in future posts. The most pressing next step is to add support for OAuth v2 and role based restricted views. So stay tuned.

As always, please let me know if any further clarifications or details are needed in the comments section.

Creating a MarkLogic Search Widget

November 12, 2012

 

How to create a MarkLogic Search Widget?

Intro

Once a highly searchable MarkLogic data repository has been created, a common request is to provide a Search Widget. A search widget is a simple web search box that can be added to any web page. Users can then use the search box to submit a search request to the MarkLogic database.

The client side widget consists of HTML, CSS and JavaScript. The JavaScript calls a MarkLogic Rest API asynchronously. The MarkLogic Rest API processes the request and then sends the results back to the web page. Client side JavaScript receives the results and renders accordingly.

The search widget test page shown in this screen cast is on the following link.

http://www.garyrusso.org/blog/testpage1.htm

The screencast walks through the process of creating a MarkLogic Search Widget using the Roxy Framework.

The key topics discussed are:

  1. Recap of the previous Roxy screencast.
  2. Search Widget versus MarkLogic 6 Visual Widget
  3. transform-results override – See config.xqy and snippet-lib.xqy
  4. Add a new Search Controller using Roxy command line.
  5. Cross Domain AJAX using JSONP
  6. JSON Response
  7. jQuery code used to send, receive and render.

1. Recap

The previous screencast showed the Roxy two column and three column layouts. It also showed the time saving Roxy deployer tool.

A revised version of the original screencast web application shown in the following link.

Two- Column Layout – http://ps2.demo.marklogic.com:8090/

Be sure to use the “Open XML” link in the search results column to view the underlying XML fragment.

2. Search Widget versus MarkLogic 6 Visualization Widget

The screencast builds a fully featured search box widget using some jQuery code, HTML and CSS.

However, MarkLogic 6 comes with a few prebuilt visualization widgets. This is a good opportunity to mention the MarkLogic 6 Visualization Widgets.

You can see them in action here. http://ps2.demo.marklogic.com:8004/

Be sure to try the pie chart filters.

3. transform-results function override

The proper way to write custom snippet code is to override the transform-results function. The screencast shows the search option with the respective code. For more info, see Chapter 2 of the Search Developers Guide.

4. Add a new Search Controller using Roxy command line

Use the Roxy create command to create a new controller with the respective view code.

See the Roxy command line help for the details.

ml create –help

ml create controller –help

5. Cross Domain AJAX using JSONP

Cross Domain AJAX Requests can be a security risk and are restricted in most modern browsers. The secure work around is to wrap the JSON string into a JavaScipt function. This approach is called JSON-with-Padding or JSONP.

More details on JSONP is here. http://www.json-p.org/

6. JSON Response

The over-the-wire transport from MarkLogic to a browser client is most efficiently done using the lightweight JSON format. The JSON structure used in this screencast consists of 3 parts:

  1. paginationInfo
  2. facetInfo
  3. results

The following link shows the raw JSON format.

http://ps2.demo.marklogic.com:8090/search.json?q=superhero&pg=4&ps=15

7. jQuery code used to send, receive and render.

The code that initiates the call to MarkLogic search is as follows.

getSearchResults = function() {

    if (querystring(‘q’) != "")
    {
      $.ajax({
          url: getSearchUrl(querystring(‘pg’)),
          type: ‘GET’,
          dataType: "jsonp",
          jsonp : "callback",
          jsonpCallback: "srCallback"
      });
    }
    else
    {  // Code to clear the results pane…
    }
  };

A callback function called srCallback(data) receives the results from MarkLogic.

The jQuery code is available on the widget test page by viewing source within the browser.

8. Widget Test Page

Be sure to use the following widget test page links.

  1. http://www.garyrusso.org/blog/testpage1.htm
  2. http://www.garyrusso.org/blog/testpage1.htm?q=New+York+City&pg=2&ps=100

Conclusion

The screencast builds from the previous session. The objective is to educate and raise awareness of MarkLogic’s agile database development capabilities.

There are many more topics to cover so please stay tuned.

Grokking MarkLogic’s Roxy Framework

August 17, 2012
 
Quick Walk Through of the MarkLogic Roxy Framework

Intro

For the past few months, I’ve been heads down working in the exciting Big Data software development world. I prefer to call it the post-relational document database world. My focus has been the technology around Big Data but there’s also an amazing social aspect. We see groups like Code For America, Data without Borders (DataKind) and NYC Open Data using Big Data to drive social change.

But social aspects are a topic for a later time. For now, it’s all about the code.

MarkLogic

Of course, MarkLogic is my preferred big data platform. I use it to de-normalize data so that the database engine and search engine can be the same thing.

For those unfamiliar with MarkLogic, it’s a document data store that was designed to handle extremely large amounts of unstructured data using the XML technology stack (XML, XQuery, XSLT, etc.).

The term “unstructured data” is often a topic for debate. We typically see structured data which I consider to be relational data or schema validated XML documents. There’s also semi-structured and unstructured data. One could argue that all data has structure. I typically consider semi-structured data to be partially validated XML or JSON documents. I consider unstructured data to be a set of XML or JSON documents that may have a common header but also have a payload that has no structure and can contain anything (xml, text, binary).

Document databases like MarkLogic, MongoDB, and Couchbase do away with the need to shred data into rows and columns. Aside from being unnecessary, it’s also not feasible when dealing with petabytes of data.

A key capability that MarkLogic provides is agile database development. A MarkLogic developer has the ability to ingest large amounts of data while having very little knowledge of the underlying data structures. Once the data is ingested, indexes can be added and data structures tweaked to provide the desired results. This agile development process ultimately leads to higher developer productivity, higher quality, and quicker time to market.

Semantic Linking

Document databases are also ideal for document linking. We’re just starting to realize the value of linking documents semantically. See Kurt Cagle’s Balisage 2012 paper for an interesting approach to linking documents by appending an “assertion node” to each document. The assertion node contains a “triple store” that’s used to describe the document’s relationship with other documents. These Semantic links can then be used for semantic reasoning which is a topic for another day.

Roxy Framework

Now that I gave some background, let’s talk about building MarkLogic apps using Roxy. I build most of my MarkLogic apps using the Roxy framework.

Roxy (RObust XquerY framework) is a well-designed Model-View-Controller framework for XQuery.

For now, MarkLogic’s primary API is XQuery. However, stay tuned. A rich Java and C# API is coming soon. Of course, there’s also the MarkLogic RESTful API called Corona.

I’m personally a big advocate for XQuery. XQuery is a fully fledged dynamic functional programming language. You can accomplish a lot with a small amount of code. For more info see Nuno Job’s XQuery Presentation.

Roxy’s big 3 features that makes developers immediately productive are:

  1. MVC – Write code using Model View Controller (MVC) pattern.
  2. Test Facility – facilitates Test Driven Development (TDD)
  3. Deployer – simplifies the deployment process.

You can get Roxy here. => http://github.com/marklogic/roxy

The Roxy MVC utilizes ideas from:

  1. Ruby on Rails – http://guides.rubyonrails.org/
  2. Cake PHP – http://cakephp.org/
  3. DRY (don’t repeat yourself)

I’ll won’t drill down on the MVC mechanics right now but its worth noting the following image. It shows the Ruby on Rails style “convention over configuration” URL to MVC routing. I’ll discuss further in a future screencast.

roxy-url

The screencast above will show a simple example of ingesting blog post data from the blog site Boing Boing. You can get a copy of the boing boing blog archive here. This blog archive file contains 63,999 blog posts.

I built two simple MarkLogic search apps using this data set.

  1. App Builder Version – good for a quick demo of the search capability but difficult to extend.
  2. Roxy Version – more flexible, easier to modify.

App Builder Version is here. => http://ps.demo.marklogic.com:8043/

Roxy Version is here. => http://ps.demo.marklogic.com:8090/

The code is zipped as boing-roxy-code.zip and is posted in the following directory.

     => http://sdrv.ms/12HMD6y

A C# Interface for Dependency Injection

May 27, 2010

This is a follow-up to Jesse Liberty’s Answering A C# Question blog post which compares two equivalent code examples to illustrate the value of interfaces:

  1. No interface example
  2. Interface with Dependency Injection (DI) example

Both examples use fictitious Notepad functionality with File and Twitter capability.

Example #1 does not use an interface and the line of code (LOC) count is 49.

Example #2 uses a Writer interface with Parameter Dependency Injection. The Notepad’s dependent objects (e.g., FileManager and TwitterManager) are passed as parameters (aka injected) to the worker method. In this case, the LOC count is 57.

It’s interesting to note that the interface example has slightly more code. The big win is less coupling which is much easier to maintain and more testable. I’ll have more about the testability in a future post.


Example 1 – No Interface

using System.IO;
using System;

namespace Interfaces
{
   class Program
   {
      static void Main( string[] args )
      {
         var np = new NotePad();
         np.NotePadMainMethod();
      }
   }

   class NotePad
   {
      private string text = "Hello world";

      public void NotePadMainMethod()
      {
         Console.WriteLine("Notepad interacts with user.");
         Console.WriteLine("Provides text writing surface.");
         Console.WriteLine("User pushes a print button.");
         Console.WriteLine("Notepad responds by asking ");
         Console.WriteLine("FileManager to print file...");
         Console.WriteLine("");

         var fm = new FileManager();
         fm.Print(text);

         var tm = new TwitterManager();
         tm.Tweet(text);
      }
   }

   class FileManager
   {
      public void Print(string text)
      {
         Console.WriteLine("Pretends to backup old version file." );
         Console.WriteLine("Then prints text sent to me." );
         Console.WriteLine("printing {0}" , text );

         var writer = new StreamWriter( @"HelloWorld.txt", true );

         writer.WriteLine( text );
         writer.Close();
      }
   }

   class TwitterManager
   {
      public void Tweet( string text )
      {
         // write to twitter
         Console.WriteLine("TwitterManager: " + text);
      }
   }
}

Example 2 – Writer Interface with Parameter Dependency Injection

using System.IO;
using System;

namespace Interfaces
{
   class Program
   {
      static void Main( string[] args )
      {
         var np = new NotePad();

         var fm = new FileManager();
         var tm = new TwitterManager();

         np.NotePadMainMethod(fm); // parameter injection
         np.NotePadMainMethod(tm); // parameter injection
      }
   }

   class NotePad
   {
      private string text = "Hello world";

      public void NotePadMainMethod(Writer w)
      {
         Console.WriteLine("Notepad interacts with user.");
         Console.WriteLine("Provides text writing surface.");
         Console.WriteLine("User pushes a print button.");
         Console.WriteLine("Notepad responds by asking ");
         Console.WriteLine("FileManager to print file...");
         Console.WriteLine("");

         w.Write(text);
      }
   }

   // Writer Interface
   interface Writer
   {
      void Write(string whatToWrite);
   }

   class FileManager : Writer  // Inherits Writer Interface
   {
      // Implements Write Interface Method
      public void Write(string text)
      {
         // write to a file
         Console.WriteLine("FileManager: " + text);
      }

      public void Print(string text)
      {
         Console.WriteLine("Pretends to backup old version file." );
         Console.WriteLine("Then prints text sent to me." );
         Console.WriteLine("printing {0}" , text );

         var writer = new StreamWriter(@"HelloWorld.txt", true);

         writer.WriteLine(text);
         writer.Close();
      }
   }

   class TwitterManager : Writer  // Inherits Writer Interface
   {
      // Implements Write Interface Method
      public void Write( string text )
      {
         // write to Twitter stream
         Console.WriteLine("TwitterManager: " + text);
      }
   }
}

Cloud Computing is an Overused Buzz-phrase

May 21, 2010

There are many meteorological pun’s associated with the term “cloud computing”. The term represents a huge paradigm shift in the way backend software services are delivered.

This article covers much of the confusion associated with this ambiguous and overused phrase.

From a software developer perspective, the deployment model and elasticity are the key differentiators for cloud services.

I consider a cloud service to be a system that can host my software and hide the complexity of the server farm (e.g., routers, load balancers, SSL accelerators, etc.).

Amazon popularized the term “Elastic Cloud” when they launched their core cloud component called EC2 back in August 2006. EC2 stands for Elastic Compute Cloud (EC2). Elasticity is the infrastructure’s ability to automatically scale up and scale down as needed.

Elasticity is a big deal. It dramatically simplifies the deployment and administration process. It means that software developers don’t need to worry much about infrastructure as much and can focus on coding the business process.

I consider Amazon, Google and Microsoft to be the big 3 cloud vendors. They have the elasticity expertise and server farms to support high volume cloud apps.

There’s Oracle, Salesforce.com, Rackspace and others but IMO are not generic cloud platforms.

For more about the non-developer cloud computing perspective, this Wikipedia article is a great reference.

2010: The Year and Decade for 4 Screens and a Cloud

January 4, 2010

Rob Enderle’s blog post has it right. 2010 will be the year and start of the cloud decade.

I’d like to take it a step further. The coming wave of ubiquitous ‘democratized’ data services with eager clients waiting to consume will take the internet to a dramatic new level. Microsoft’s three screens and a cloud vision speaks to it but I believe its more about “4 screens with data services”. I consider the data services to be more relevant. The cloud is the engine but the 24/7 data services it provides will be life changing/business transforming.

Thanks to 3G, pending 4G and whatever comes after, the data services will come from highly reliable mobile data pipes that can be consumed while driving a car, riding a bicycle, at the doctor’s office or exercising at the gym.

The data services are democratized because the data being provided was once only available to a select few. Opening up the data to software developers and entrepreneurs can be a catalyst for positive change. The democratization of data trend is an unstoppable force that has the power to accelerate innovation to help solve some of the world’s problems and improve the quality of life for all.

The US Chief Information Officer, Vivek Kundra, understands the power of democratized data. He spearheaded a new web site for this called Data.gov. Another great example is the City of New York’s recent NYC Big Apps Contest. Microsoft is also getting involved with their new Dallas service.

Regarding the 4 screens, not 3, I expect the data services to be designed to support the following clients.

 

1. Large Screen 10 foot away living room experience.

2. Desktop/Tablet Screen Multi-touch Tablet and PC monitor
experience.
3. Small Screen Smartphone
(e.g., Blackberry, Palm,
iPhone, iPod Touch, Android,
Win Phone 7, ZuneHD, etc.)
4. Car Dashboard Screen This is the Ford Sync, Fiat Blue&Me,
Kia UVO and General Motors OnStar.

Listing the Car Dashboard may be a bit premature but I expect to see at least 25 million "connected" cars sold during this coming decade. In less than a year, the Microsoft Ford Sync system has already exceeded 1 million in US only sales. These systems are just starting to go global with Kia’s UVO and Fiat’s Blue&Me systems. I expect “Connected Cars” consuming mission critical data services to become the norm within 5 years.

Examples of the mission critical and revenue generating data services are the real-time location-aware contextual ads or electronic billboards. Some of this is already available in the Ford Sync system. I consider it the first commercially viable Augmented Reality solution. I expect Car Dashboard solutions to eventually provide windshield “heads up display” driving directions that can also show the nearest movie listings, nearest Thai restaurants, closest hospitals, etc.

Aside from the Car Dashboard services, data services will come in many flavors. The more popular services will be the entertainment and news services:

Video Netflix, Hulu, Youtube, Boxee
Music iTunes, Pandora, Zune
Games Xbox Live, SONY Playstation Network
Books Amazon Kindle, Nook, PDF, Audible
News NY Times, CNN, MSNBC, ABC, CBS and all of the Radio News Feeds
Sports ESPN

 

There will be Quality of Life services such as:

  1. Health medical record services – HealthVault
  2. Real-time Traffic – Calculate Quickest Travel Time
  3. Air/Pollen Quality – What will the air/pollen be like on December 31st at 5:30 PM.
  4. Population Growth versus Food Supply – Expected food supply in Somalia over the next 3 years.
  5. Malaria Cases/Birth Rates/Life Expectancies by Region
  6. Violence Levels in Iraq and by Region
  7. Airport Security Wait Times – Security Check Wait Time at Gate #4 in LAX, etc.
  8. Crime Stats by Region
  9. High School Education Quality by Region

 

The list of potential services is endless.

Much of this data is already available but is not in a format that can be easily used or consumed by the 4 screens mentioned.

I’ll leave it to the developers and entrepreneurs to pioneer.

Ten years from now, I am confident that we’ll all be grateful for this new cloud computing/data services era.

 

Microsoft PDC09 Connected Show Podcast #21

December 31, 2009

Wanted to thank Peter Laudati and Dmitry Lyalin for having me on Connected Show Podcast #21 with Sara Chipps recently.

We covered a lot of ground during the podcast and Peter posted all of the relevant links.

There was a lot of great news at PDC09 but I was most fascinated by Microsoft’s support for building Java Apps with Windows Azure.

For the Java Developers out there, here’s the “Building Java Apps with Windows Azure” session.

 

Technorati Tags: ,,,,