Posts Tagged ‘MarkLogic’

RSuite, MarkLogic and the DITA Open Toolkit

April 17, 2015


RSuite DITA Transform Demo


The above video has a brief introduction to RSuite and the DITA Open Toolkit.

I had the great pleasure of being a speaker at last week’s MarkLogic World 2015 San Francisco event. I’ll be writing much more about my MarkLogic MEAN Stack presentation later but for now I‘d like to give a good shout out to RSuite.

While at MarkLogic World, I also had the great pleasure of reconnecting with the wonderful RSuite team. I’m a big advocate of the RSuite content management system.

This past summer, I had an amazing opportunity to work at Harper Collins where I helped to deploy their new content management services using RSuite. My curiosity with RSuite stemmed from my work as a professional services consultant at MarkLogic. MarkLogic has a loyal base of customers in the media/publishing industry.

Media/publishing customers choose MarkLogic to store their content which consists of text based documents and binary assets (photos, audio, video) with the respective metadata.

MarkLogic lowered the pricing in 2013. The new pricing made it more affordable to store binary assets in MarkLogic. Keeping the binary assets together with the text based content also greatly simplifies the infrastructure and management overhead.

The RSuite CMS has the following features:

  1. Workflow
  2. DITA Transforms – provides “multi-channel” output.
  3. Role Based Security
  4. Distribution

The RSuite secret sauce is the DITA Open Toolkit. The other key component is MarkLogic.

The workflow engine is provided by jBPM which uses MySQL to store the workflow configurations and drives the finite state machine.

The DITA Open Toolkit provides the “multi-channel output” feature needed by most publishers. This is the ability to render the content to many formats such as PDF, ePub, XHTML, Adobe In Design, Word docx, any format.

The DITA acronym is Darwin Information Typing Architecture. It is an XML Data Model for Authoring and Publishing.

Eliot Kimber is the force behind DITA. Here’s some useful links.


Publishers should avoid using XHTML as the storage format of book content for many reasons. The industry standard format is DITA or DocBook because it provides a higher level of abstraction that makes it much easier to have “multi-channel output”. These standards are also more flexible when providing custom publishing services.

The DITA format is especially interesting because of the specialization feature that makes XML structures polymorphic.

Some key points about DITA:

  1. Topic Oriented
  2. Each Topic is a separate XML file
  3. DocBook is Book Oriented
  4. DITA Initial Spec in 2001
  5. DocBook Initial Spec in 1991
  6. Core DITA Topic Types are Concept, Task, and Reference
  7. Specialization: This is subtyping where new topics are derived from existing topics.
  8. Darwin term is used because the polymorphic specializations provide an evolution path.
  9. DITA Map XML document is used to stitch the Topic XML documents.


I had an opportunity to chat with Norm Walsh about it at this week’s MarkLogic World event. Norm is the author of DocBook: The Definitive Guide. He’s also an active member in a few of the XML/JSON standards committees.

DITA is a competing standard to DocBook. Norm wrote this interesting blog post about DITA back in October 2005.

My question is which one has better support for semantic annotations. Most content these days is semantically enriched using multiple Ontologies. This is needed so that SPARQL queries can be used to provide Dynamic Semantic Publishing services.

Norm was quick to tell me that DocBook 5.1 now supports RDFa. I’ll definitely investigate.

For DITA, there’s an interesting DITA to RDF transform in the works here.

Bob DuCharme, author of Learning SPARQL, has this nice blog post on using RDFa with DITA and DocBook.

In addition to the screencast above, the following screencasts will take a deeper dive into the RSuite software and DITA Open Toolkit.

Please take a look. Hopefully, the screencast will shed some more light on the need to store content using a higher level of abstraction (DITA or DocBook).

The screencasts will also show the value of RSuite as a full blown Content Management and Digital Asset Management (DAM) Solution.


DITA Open Toolkit Demo


DITA Open Toolkit Demo


RSuite Architecture and Code


RSuite Architecture and Code

What is the MarkLogic Document Discovery App?

August 29, 2013


MarkLogic Document Discovery


One of my favorite MarkLogic applications posted on GitHub is Document Discovery.

It’s a favorite because it provides a Quick Start tutorial for developers looking to learn about MarkLogic’s Content Process Framework (CPF) and the new binary document support.

Developers will also learn about Field Value Query and “How to annotate and add custom ratings?”.

You can access the source code and get the installation instructions in the readme section on the GitHub project page.

However, this blog post will provide more detail because it required additional changes to get it working properly on my MarkLogic 6 server.

You can see it in action here. =>

Be sure to watch the embedded screencast at the top of the page too.


As noted in the readme, this is a simple application that was built using a customized Application Builder app. Unfortunately, it uses the older MarkLogic 5 Application Builder code. The latest Application Builder that comes with MarkLogic 6 (ML6) has undergone a significant architectural change. The ML6 version generates code that is much more declarative as it utilizes XSLT with the ML6 REST API.

A good write up on how to customize the ML6 App Builder Code is posted here.

I’ll have a newer version of the Document Discovery app that is built with the ML6 App Builder in a future blog post. For now, this version is better as a tutorial because it makes it easier to follow the code for:

  1. CPF
  2. Binary File Metadata
  3. Field Value Query
  4. How to annotate and add custom ratings?

Content Process Framework (CPF)

I like to refer to the MarkLogic CPF as a Finite State Machine (FSM) for documents. In this case, the document’s state will change or transition from one state to another via a triggering event or condition.

A typical FSM is defined by a list of its states, and the triggering condition for each transition.

For MarkLogic CPF this is done with Pipelines. A pipeline is a configuration file that contains the triggering conditions with respective actions for each state transition.

This application will use CPF for entity enrichment. In this case the entity will be an arbitrary binary file.

I find that entity enrichment is the most common use case for CPF.

Of course, the online Content Processing Framework Guide is the ultimate source to help you become fluent in CPF.

In the mean time, this post will provide the quick start steps needed to get this Document Discovery app configured to use a CPF pipeline.

In this case, the CPF pipeline will execute a document “scan” on an arbitrary binary document whenever it is inserted into the database.

The triggering event for this will be the  xdmp:document-insert() which puts the document into the “initial” state.

Binary File Support

A new ability to ingest a wide range of binary files was added to MarkLogic in version 5. This feature is referred to as an ISYS Document Filter.

The ISYS (Document) Filter was provided by a company called ISYS Search Software, Inc. The company has since changed its name to Perceptive Search, Inc.

The new ISYS filter extracts metadata from a wide range of binary documents (word, powerpoint, jpg, gif, excel, etc.). The list of supported binary files are listed here.

This metadata extraction capability is in an API called  xdmp:document-filter().

More details are provided in Chapter 16 of the Search Developer’s Guide.

Here’s an example of the API usage with response.

Code snippet:

xdmp:document-filter(fn:doc("/content/451 Report.docx"))


API Response:


Please note the attributes used in the meta elements shown above.

The CPF “initial” pipeline code will modify the above meta elements to be as follows.



The 3 elements in the blue circle were added later when the user adds a comment and/or rating to the document.


Field Value Query

In this app, the Author facet was implemented as a Field Value Query. A Field Value is used because the search (facet) also wanted to include the <typist> node.


The following XML shows 3 elements that are valid Authors (Last_Author, Author, and Typist).



Adding a Field Value Index to the database provides an ability to unify the values during a search. It also provides a weighting mechanism where the <Last_Author> element can be boosted to have a higher relevance than the <Typist> element.


Here’s a snapshot of the admin page used to configure it.



For more detail, see Understanding Field Word and Value Query Constructors in Chapter 4 of the Search Developer’s Guide.


How to annotate and add custom ratings?

JavaScript for the annotation and custom ratings is in a file called /src/custom/appjs.js.


Here’s the source code directory structure.




The JavaScipt code will display the User Comments dialog box to capture the Commenter’s Name and Comment as shown.



The code that provides the web form post action is in /src/custom/appfuncations.xqy


The post action sends the request to a file called/insertComment.xqy which then does either the xdmp:node-insert() for new comments or xdmp:node-replace() for updates.


A similar approach is done for the user ratings capability. Here’s a snapshot of the UI.





The source code that is currently posted on GitHub did not work initially.

I made the following changes to make it work.

  1. Config File – modify the Author field value index
  2. Config File – make compatible with MarkLogic 6
  3. Config File – include a dedicated modules and triggers DB.
  4. Pipeline – modify pipeline xml to use code in the /cpf/ingest directory.
  5. Deployed source code to a modules databases instead of the file system.
  6. Added additional logging to verify pipeline code process.

The source code is posted here. =>

Here’s the steps that I used to get the application running properly.

Installation Steps:

  1. Import the configuration package that’s located in the /config directory. This package will create and configure the 3 databases (DocumentDiscovery, doument-discovery-modules, and Document-Discovery-Triggers).
  2. Install content processing on the database. See section 4.5 of the CPF Guide. =>
  3. Install the pipeline that is in the /ingest directory.

    This is done by copying the 2 files in the /ingest directory to the /Modules/MarkLogic/cpf/ingest directory. You may need to create this directory directly under the modules directory of your MarkLogic installation (see next image). Once copied, then follow the normal pipeline loading process.

  4. Ensure that only the following pipelines are enabled for the default domain in the DocumentDiscovery database (DO NOT ENABLE DOCUMENT CONVERSION).

    Install: Document Filtering (XHTML), Status change handling, and the custom Meta Pipeline. You’ll need to load the Meta Pipeline configuration file.

  5. Deploy Source Code – be sure to modify the http server to use the proper modules database.
  6. Load data using the admin console, Information Studio or qconsole.
  7. Observe ability to search and view the binary documents!


Image: Shows CPF Pipeline Code located in the MarkLogic installation directory.



Image: Document Discovery App showing Search, Facets, and User Star Ratings




Hopefully, this post will help you get started using MarkLogic for Document Discovery.

The exciting news is the soon to be released MarkLogic 7 (ML7).

ML7 will have new  semantic technology features that will take this simple document repository to a whole new level.

This is because ML7 will have the ability to add richer metadata to each document. The richer metadata is triples.

A triple is a way to model a fact as a "Subject, Predicate, Object". Many triples can be added to a document as a sets of facts. These facts can then be incorporated into queries that that can make inferences about the “subjects”.


  1. “Jane Doe” : “graduatedFrom” : “Columbia University”
  2. “Jane Doe” : “graduationYear” : “2001”
  3. “Jane Doe” : “AuthorOf” : “This document: ISBN-125”
  4. “Sports Medicine” : “MainTopicOf” : “This document: ISBN-125”
  5. “Henry James” : “AuthorOf” : “ISBN:12345”
  6. “This document: ISBN-125” : “refersTo” : “document ISBN-545”
  7. “Sandra Day” : “graduatedFrom” : “Columbia University”
  8. “Sandra Day” : ”graduationYear” : “2001”
  9. “Sandra Day” : “AuthorOf” : “document ISBN-545”


From these facts, the following inferences can be made:

  1. Henry James and Jane Doe co-authored document ISBN-125.
  2. Sandra Day and Jane Doe were college classmates


In this example, the facts (triples) are used for knowledge discovery.


In this case, the knowledge discovery or inferences can be accomplished with a minimal coding effort using the simple data structures and a rich query API.


So please stay tuned!

How to use MarkLogic with Open Government?

May 10, 2013


How to use MarkLogic with US Congress Bill Data?


I’m a big advocate for open government data projects. One of the more interesting efforts is the US Congress legislation data that is published every night on GitHub.

Here’s the link. =>

For more about GitHub, please see this Clay Shirky TED talk.

Clay Shirky would like to use GitHub as a tool to manage the legislation process. He calls it a new “form of arguing”. It’s a great idea but GitHub is not an easy tool for government legislators yet. It could use some ease-of-use tweaks for folks not familiar with software development.

In any case, Clay mentioned the posting of US Congress Bill data that occurs nightly. Posting the data is a great example of government transparency.  This data can be used to measure congressional productivity and maybe even accountability.

This data makes it possible to provide a real-time (well nightly) digital Report Card. It can be used to measure any perceptions of Congressional Gridlock that is often reported in the news.

This blog post will show how to do this but before I start here’s some notes about the US Congress that will help when we look at the data.

Notes on the US Congress and the Legislative Process:

  1. Consists of 2 houses: Senate and House of Representatives.
  2. There are 535 voting members: 435 in the House of Representatives and 100 in the Senate.
  3. Each Congressional Session is 2 years.
  4. Current Session is the 113th Congress which runs from January 3, 2013 to January 3, 2015.
  5. A key objective is to create legislation that improves and advances the entire country.
  6. Any member of House or Senate can introduce legislation for debate. It is done by proposing a bill.
  7. Congressional bill is introduced and championed by a sponsor.

  8. Bill is then sent to appropriate congressional committee to decide if the House or Senate will vote on the bill.

  9. Senate has 17 committees and 70 subcommittees.

  10. House has 23 committees and 104 subcommittees.

  11. Each committee focuses on a specific policy area.

  12. Bill is passed when a majority vote is received in both the House and Senate.

  13. President must sign the bill to pass it into law or veto it.

  14. Congressional Productivity can be measured by the number of bills that get signed by the president.


The Huffington Post news outlet published an article about Congressional Productivity of the 112th Congress at the end of 2012. The article noted that the 111th Congress had 383 bills signed into law by the president. The 112th congress had only 283 bills signed by the president.

One could conclude that the 112th Congress was less productive than the 111th Congress but there are other factors to consider.

Here’s the Huffington Post article.



Is it possible to use the nightly data to verify or track Congressional Productivity on a daily basis?

How difficult would it be to show data visualizations for each congress?

The answer to both questions is a most definite yes but how?

Here’s a simple data set that I’d like to visualize and build from.


Years Bills Proposed Bills Signed Bill Sign Rate
110 2007-2008 14,042 460 3.28%
111 2009-2010 13,675 383 2.80%
112 2011-2012 12,299 283 2.30%


MarkLogic Ingestion and Enrichment

This is an easy problem for MarkLogic. MarkLogic is typically used to ingest, enrich, search and discover.

Once the data is in MarkLogic, the following questions can be easily answered.

  1. How many bills signed by president by year?
  2. Who are the most productive congress members by year?
  3. What is the bill proposed to bill signed ratio by year?

I’ve created a MarkLogic database and ingested the US Congress Bill data from the previous Congressional Sessions. The data ranges from the 93rd Congress (1974 to 1975) to the 112th Congress (2011 to 2012).

For the current session, 113th Congress, I created a scheduled task that runs every night at 11:00 PM EST (8:00 PM PST). The scheduled task ingests the data and enriches it with some simple presidential data.

I’ve also used the MarkLogic App Builder tool to create a quick web application that shows the bills signed by year. It can also be used to search and discover.

The App Builder and ingestion code is posted here. => Source Code

The web application is here. =>

Please give it a try.

Please be sure to try the clickable pie chart and bar chart widgets.

To see the most productive Congress Members by Year, press the "Enacted Type” public facet in the top left corner. Once pressed, you will see the results by year.




The Top 30 Most Productive Congress Members are listed in the Sponsor Facet (see image). This sponsor list shows the most productive members for the past 32 years in a descending order.

The number in parenthesis is the number of bills that they sponsored that were signed by the President of the United States.



I use the Sponsor classifier because a Sponsor is a senator or house representative who introduces a bill or amendment and is the bill’s chief advocate. The process of sponsoring a bill requires significant effort. So we can conclude that this could be a good metric for a congress member’s productivity.

You can use the web app to drill down further to discover bills proposed or productive members by year, by president, by cosponsor, by subject, etc..

The remaining blog post and screencast will focus on the following topics.

  1. How to programmatically get the data zip file?
  2. How to ingest json documents?
  3. How to set up an automated daily ingestion task?
  4. Is ACID Compliance needed for this One-Way System?
  5. How to use the SQL API?
  6. How to connect and discover with Tableau and Excel?
  7. Code is posted.


1. How to programmatically get the data zip file?

The specific ingestion code is posted in this file. => ingest-bill-data.xqy

The URL to retrieve the nightly congress data points to a zip file.

The following code snippet does the following:

  1. http get request to retrieve the zip file.
  2. Iterate the zip manifest to get file names.
  3. For each file name, retrieve the file.
  4. For each file, convert json to an xml document.
  5. For each xml document, transform and enrich.
  6. Save newly transformed/enriched document.


Here’s the code snippet.

declare variable $get-options :=
  <options xmlns="xdmp:http">
    <format xmlns="xdmp:document-get">binary</format>

declare variable $BASEURL :=

let $url  := fn:concat($BASEURL, "")
let $zip := xdmp:http-get($url, $get-options)[2]

let $docs :=
  for $uri in xdmp:zip-manifest($zip)//zip:part/text()
    let $jdoc := xdmp:zip-get($zip, $uri)
    let $xdoc := json:transform-from-json($jdoc)
    let $doc  := local:load-house-bill-doc($xdoc)
    order by $uri

return fn:count($docs)

2. How to ingest json documents?

The data that is posted on the GitHub site is in a json format. A good rule of thumb on json versus xml is to use json as a wire protocol for the “last mile” and then use XML for the data store.

The key advantage of XML over json is Namespaces and Schema. Namespaces and Schema provides capabilities that ultimately makes developers much more productive than if a simple json data store is used. But this is a topic for another day.

Ingesting json data into MarkLogic is easily done using XQuery. XQuery is my preferred ETL (extract, transform and load) tool. I use it to convert json, transform (and enrich) the data into more efficient structures and then store as XML.

The ingestion code is posted here.  => ingest-bill-data.xqy

In a future post, I’ll talk about adding a triple which is the best way to capture relationships with other congressional bills.

The json format of the US Congress Bill data is well documented on the GitHub Wiki page.


Here’s an example of the json structure used to hold subjects and summary data.

  "subjects": [ 
    "Administrative law and regulatory procedures", 
    "Adoption and foster care" 
  "summary": { 
    "as": "Public Law", 
    "date": "2010-03-23", 
    "text": "Patient Protection and Affordable Care Act" 


The above json is transformed and stored in XML as follows.

     Administrative law and regulatory procedures 
  <subject>Adoption and foster care</subject> 
  <summary-as>Public Law</summary-as> 
    Patient Protection and Affordable Care Act 
Here’s the code snippet used to create the above xml node.
element {fn:QName($NS,"subjects")}
  for $item at $n in $node/*:subjects/node()
      element {fn:QName($NS,"subject")} {$item/text()}
element {fn:QName($NS,"summary")}
  element {fn:QName($NS,"summary-as")}
  element {fn:QName($NS,"summary-date")}
  element {fn:QName($NS,"summary-text")}

For more detail, see line 406 in file ingest-bill-data.xqy.


3. How to set up an automated daily ingestion task?

The Most Productive Congress Member web application was created by MarkLogic’s App Builder tool.

The code was automatically generated and deployed.

The App Builder tool deploys code using the following directory structure.



Please take note of the custom directory. As per name, any custom code can be safely deployed to the custom directory. The App Builder tool will not clobber any code that resides in this directory.

To create the nightly ingestion task, the ingestion code (ingest-bill-data.xqy) was copied to the /custom/schedule directory as shown above.

The next step is to add the scheduled task using the MarkLogic Admin tool.


The code is set to run daily at 8:00 PM PST (11:00 PM EST).


Once the task is set, the web application is ready for use.

As before, please try it. =>

Be sure to select the Enacted Type “public” facet which will filter out the bills proposed versus the bills passed. You can also try search. Be sure to drill down to the underlying XML document (see the 3rd red circle in the image). This link will take you to an html table view of the data. There is also a raw XML view (see images).




Underlying XML Document HTML Table View.



Underlying XML Document XML View.



4. Is ACID Compliance needed for this One-Way system?

There’s a big debate in the NoSQL community about database consistency.

Data consistency refers to how a database is able to handle updates when failures occur. There are two general approaches, ACID and BASE.

  1. ACID systems provide consistency at the expense of availability.
  2. BASE systems provide availability at the expense of consistency.

I mention it because this application is mostly queries (read-only) and does not require rigorous ACID transactions.

ACID is not needed because data flows mostly one-way. However, if this app were to get user specific features (e.g., user profile, saved searches, user bookmarks, user workspaces, etc.). These features would turn the system into a mission critical two-way system requiring high consistency. High consistently is also referred to as durability.

But this is a topic for another day too.

5. How to use the SQL API?

MarkLogic version 6 added a new SQL API and ODBC service that is very useful for analytics.

I’ll quickly walk through the steps needed to set up a MarkLogic SQL View using the App Builder tool.


  1. Select the database and then press the configure button (see image).
  2. Scroll to the bottom of the page and press the “Add New” button. The image already shows that view that was created (us_congress_view).
  3. Enter name, schema, localname (root node) and namespace.
  4. Select the desired range indexes. Try to avoid the Cartesian product. In this example, I intentionally left out subject, cosponser and committee because there are many subjects, cosponsors, and committees associated to a single bill. If these items were included then the number of records would jump to 7.5 million instead of 276,000.
  5. Press the update view button to commit the changes.
  6. Test using qconsole and the new SQL API (see following code).


"select uri, bill_type, congress, year, president_name,
sponsor, enacted_type, enacted_congress,
enacted_number, status, status_at, subjects_top_term
from us_congress_view limit 100", "format" 


Image 1.


Image 2.



Image 3.


Image 4.



6. How to connect and discover with Tableau and Excel?

Once the SQL View has been created, the next step is to connect the data to Tableau and Excel.

This requires an ODBC service. Creating the ODBC service is very similar to creating an XDBC service.

Go to the Admin Console > App Servers and then press the Create ODBC tab.


Create the service and make sure that the desired database is selected.



The next step is to create the ODBC System DSN using the 32 Bit ODBC Data Source Administrator.


  1. Launch the 32 Bit ODBC Data Source Administrator. For windows the command is:                           C:\Windows\SysWOW64\odbcad32.exe
  2. Select the System DSN tab of the ODBC Data Source Admin tool.
  3. Press the Add Button.
  4. Select the MarkLogic SQL (X86) item.
  5. Enter the configuration settings.
  6. Press the Test button to verify.
  7. Good to Go.

Image 1.


Image 2.



Steps to connect to Excel:

  1. Launch Microsoft Excel. I recommend using Excel 2010.
  2. Select the Data > From Other Sources > From Microsoft Query
  3. Observe “Choose Data Source” dialog box and then select MarkLogicSQL.
  4. Observe “Query Wizard” and choose the desired view.
  5. Choose the desired columns.
  6. Set the filter (if desired).
  7. Set the desired sort.
  8. Choose “Return Data to Microsoft Excel” and press ok.
  9. Wait ~2 minutes for the spreadsheet to respond.
  10. Observe results in the spreadsheet (see image 6).

Image 1.


Image 2.


Image 3.


Image 4.


Image 5.


Image 6.



Connecting to Tableau is very similar to Excel. I’ll cover more detail in the screencast but here’s the steps.

Steps to connect to Tableau:

  1. Launch Tableau.
  2. Press the Connect to Data link on the home page.
  3. Select the “Other Databases (ODBC)” option on the lower left.
  4. Observe the “Generic ODBC Connection” dialog box.
  5. Select the MarkLogicSQL DSN.
  6. Press the Connect button and observe the connection attributes appear.
  7. Select the “main” schema (or equivalent).
  8. Press the table search icon in the lower right and observe the option to select the table.
  9. Select the table and observe the connection name.
  10. Press the OK button.
  11. Wait ~2 minutes for Tableau to respond.
  12. Choose the “live data” option.
  13. Observe dimensions and measure in the left side.
  14. Drag “Number of record” to the “rows”.
  15. drag the congress and president_name dimension to the “columns”.
  16. Choose the bar chart to observe a visual.

Image 1.


Image 2.


Image 3.


Image 4.



That should be enough to get started using MarkLogic and Tableau. MarkLogic-Tableau combination eliminates the need for expensive data warehouses.

More importantly, MarkLogic-Tableau provides the much needed data discovery. It provides the ability to surface the data in easier and more meaningful ways.

I believe this discovery capability is especially useful for the US Congress Data and the many Open Government Data projects.

Anyway, that is all for now.

Please be sure to watch the screencast where I can provide some more detail.

How to add Authentication to a MarkLogic App?

February 4, 2013


Learn how to add authentication to a MarkLogic Roxy App.

This blog post, I will show the code needed to add a simple authentication service.

In the previous post, I created the two column layout where the top of the home page had a login form. At the time the login form was not wired up.

For this post, I will wire up the login form. To do this, I’ll show how to build a simple authentication service that searches a user database, verifies the password, and generates an authentication token that will expire after 5 minutes.

This demo application will also show how to provide a simple RESTful API for search. This Search API will utilize the authentication token to restrict access to the search service.

The app will not show any role based restricted views. A more fully featured role based access control will be shown in a future post.

This demo will use the boing-roxy demo that is posted here.

A zip file containing the source code for this demo application is posted here.  => source code

Overall Approach

The solution will use the following items to authenticate a user and create an application token that gives the user access to the RESTful API for a 5 minute period.

  1. Registration Form – used to create the user profile documents. This form should only be visible to an admin user but is currently visible to all for demo purposes.
  2. User Directory – Each user will have a dedicated user directory in the MarkLogic database.
  3. User Profile Document – User profile document (/users/janedoe/profile.xml) will reside in the user’s directory. It will contain the username/password. The username must be unique. It can also be used to store the user’s role and organization information. This demo will not utilize a party management solution but it can be extended to do so.
  4. Session Document – A session document will be created when a user successfully logs in. The Session Document will be stored in the User Directory.
  5. Authentication Token – the token will be stored in the session document. Each RESTful API request must include the token in its header.
  6. ROXY Router – Will be the checkpoint or the single point of entry for each request. This is where the user token is verified for each RESTful API request. A key function called Find-Session-by-Token() verifies the session and dispatches the request if valid.
  7. Login – If the username and password is valid, the token is created. If a token already exists and has not yet expired then the same token will be used.
  8. Session Expiration – Session Expiration will be 5 minutes from initial login. The 5 minute duration is for demo purposes. Typical session expiration duration is 24 hours. Session expiration time will be UTC based. UTC is Coordinated Universal Time.
  9. Logout – Terminates the session by deleting the session document that contains the token.


Related Notes:

  1. Passwords are never part of the RESTful API transport except the Login API request.
  2. “Remember Me” cookie – This solution can support a “Remember Me” cookie where the token is stored in the cookie and not the password. Remember Me cookies typically expire after 90 days which is longer than the token expiration period.
  3. Verify API – A good approach for refreshing the token stored in a “90 day login cookie” is described here. A Verify API is typically used to verify the Username and Token. If they match then a new token is generated whenever the existing token has expired. The 90-day cookie web app will need to call the Verify API to refresh the token stored in the cookie.
  4. Passwords are currently stored in the User Profile doc but they are MD5 hashed.
  5. Current solution shows how to use the MarkLogic Search API with the user profile document, session document and token to provide an adequate security solution.
  6. OAuth2 – Open Authentication version 2 (OAuth2) is a widely used protocol that provides a federated user profile solution. The key benefit for this example is that user passwords do not need to be stored in MarkLogic. However, this is a topic for a future post. The OAuth2 developer details are here:


1. Registration Form

The registration form creates the user profile data.

You can access it here.

Here’s a snapshot view.



2. User Directory

The registration form above creates the user profile data that is stored in a User Profile Document in the respective user directory. The Session Document is also stored in the User Directory.

Example User Directory URIs

/users/grusso/session/2ccda41cd|2/4/2013 8:28:41 PM.xml


3. User Profile Document

Here’s the format of the simple user profile document. Please note that the password is stored as an MD5 hash.


Roxy code that generates the user profile document is:

  1. user controller – /apps/controllers/user.xqy
  2. user model – /apps/models/user-model.xqy


4. Session Document:

Here’s the format of the session document.

<session user-sid="2cc3bdfc38bf03c63a4de6bda41cd91b|2/4/2013 8:28:41 PM">

Please note that the @user-sid attribute in the above XML is the authentication token.

Session Document URI:

/users/grusso/session/2cc3bdfc38bf03c63a4de6bda41cd91b|2/4/2013 8:28:41 PM.xml

The above document URI has the expiration date/time appended to it. Some JavaScript client code will use the appended expiration date/time to trigger a token refresh.

Roxy code that creates and deletes the session document is:

  1. web login controller – /apps/controllers/appbuilder.xqy
  2. rest login controller – /apps/controllers/login.xqy
  3. rest logout controller – /apps/controllers/logout.xqy 
  4. authentication model – /apps/models/authentication.xqy


5. Authentication Token:

In the above example, the authentication token is the following string:

2cc3bdfc38bf03c63a4de6bda41cd91b|2/4/2013 8:28:41 PM

This string needs to be added to the request header of each RESTful API request. If not the response will be a “401 unauthorized” error. The token must be prefixed with “X-Auth-Token” as follows.

X-Auth-Token: 2cc3bdfc38bf03c63a4de6bda41cd91b|2/4/2013 8:28:41 PM

The source code that extracts the X-Auth-Token value is in the router code. See line 87 of  /src/app/lib/router.xqy.

let $token := xdmp:get-request-header("X-Auth-Token")

If using Firefox Poster tool, the header can be added as shown.




6. ROXY Router:

As discussed above, the router is the checkpoint for all http requests. It is the ideal place to apply a security policy logic such as:

  1. Token check
  2. Requests per minute
  3. Maximum Requests per day

This post only handles the token check but this code can be extended to support all security policy logic.

The following xquery code handle the token check. Please note that certain request (e.g., login, ping) bypass the token check.


let $valid-request :=
  if(fn:not($config:SESSION-AUTHENTICATE)) then fn:true()
  else if(xs:string($controller) = ("ping")) then fn:true()
  else if(xs:string($controller) = ("login")) then fn:true()
  else if(xs:string($controller) = ("logout")) then fn:true()
  else if(xs:string($controller) = ("verify")) then fn:true()
    let $token := xdmp:get-request-header("X-Auth-Token")
      if($token) then
        let $valid-session := auth:findSessionByToken($token)
          if($valid-session) then
      else fn:false()

7. Login Code:

The login code does the following:

  1. Find user profile document – Searches the the user profile documents using the username.
  2. Check password – If a document with the username is found then the password is checked.
  3. Find session document by username – If the password matches then code looks for a session document with its respective expiration date.
  4. Session Document – If the session expiration has not expired then use current session document. If session document has expired then delete it and then create a new session document containing new Authentication Token.
  5. Return Authentication Token

This code resides in the following files:

  1. rest login controller – /apps/controllers/login.xqy
  2. authentication model – /apps/models/authentication.xqy

The URI to request a Login is:

The username and password is bundled into the request using the Authorization Header.

So the request header will need this:

Authorization: Basic Z3J1c3NvOnBhc3N3b3Jk

The encrypted string after the word Basic contains the base64 encoded username and password.

Here the code that extracts the username/password is in the login controller (/src/app/controllers/login.xqy).

declare function c:main() as item()*
  let $userPwd  :=
          xdmp:get-request-header("Authorization"), "Basic ")[2]
  let $username :=
           fn:tokenize($userPwd, ":")[1])[1]
  let $password :=
           fn:tokenize($userPwd, ":")[2])[1]

  let $result   := auth:login($username, $password)

    ch:add-value("res-code", xs:int($result/json:responseCode) ),
    ch:add-value("res-message", xs:string($result/json:message) ),
    ch:add-value("result",  $result),
          element header {
            element Date {fn:current-dateTime()},
            element Content-Type

The code that searches for a session document by username and its expiration date/time uses the following function. Please note the element range index query.

declare function auth:findSessionByUser($username)
  let $query :=

  let $uri := cts:uris("",("document","limit=1"), $query )

8. Session Expiration:

The code to check the session expiration is invoked by the router code.

See line 89 in /src/app/lib/router.xqy


Here’s the code. Please note the element range query.

declare function auth:findSessionByToken($token as xs:string)
  let $query :=

  let $uri := cts:uris("",("document","limit=1"), $query )
  let $doc := fn:doc($uri)
  let $current := fn:current-dateTime()
    if ($doc) then
      let $expiration := xs:dateTime($doc//expiration)
      let $diff := ($expiration - $current)
        if($diff < ($auth:SESSION-TIMEOUT div 2) ) then
        else (),
    else ()

9. Logout Code:

The logout code terminates the session by deleting the session document. Here’s the code.

declare function auth:logout($username as xs:string)
    let $session := auth:findSessionByUser($username)
    let $user := auth:userFind($username)

    let $token :=
      if($session) then
      else ()
    let $__ := auth:clearSession($username)
           <json:object type="object">
              <json:message>Logout Successful - Token Deleted</json:message>
                  ($user/lastName,$user/lastname)[1]), " ")


Hopefully, the authentication code described in this demo application has been informative. It shows an approach that I have recently used in a ROXY Application.

I will be building on this solution in future posts. The most pressing next step is to add support for OAuth v2 and role based restricted views. So stay tuned.

As always, please let me know if any further clarifications or details are needed in the comments section.

Creating a MarkLogic Search Widget

November 12, 2012


How to create a MarkLogic Search Widget?


Once a highly searchable MarkLogic data repository has been created, a common request is to provide a Search Widget. A search widget is a simple web search box that can be added to any web page. Users can then use the search box to submit a search request to the MarkLogic database.

The client side widget consists of HTML, CSS and JavaScript. The JavaScript calls a MarkLogic Rest API asynchronously. The MarkLogic Rest API processes the request and then sends the results back to the web page. Client side JavaScript receives the results and renders accordingly.

The search widget test page shown in this screen cast is on the following link.

The screencast walks through the process of creating a MarkLogic Search Widget using the Roxy Framework.

The key topics discussed are:

  1. Recap of the previous Roxy screencast.
  2. Search Widget versus MarkLogic 6 Visual Widget
  3. transform-results override – See config.xqy and snippet-lib.xqy
  4. Add a new Search Controller using Roxy command line.
  5. Cross Domain AJAX using JSONP
  6. JSON Response
  7. jQuery code used to send, receive and render.

1. Recap

The previous screencast showed the Roxy two column and three column layouts. It also showed the time saving Roxy deployer tool.

A revised version of the original screencast web application shown in the following link.

Two- Column Layout –

Be sure to use the “Open XML” link in the search results column to view the underlying XML fragment.

2. Search Widget versus MarkLogic 6 Visualization Widget

The screencast builds a fully featured search box widget using some jQuery code, HTML and CSS.

However, MarkLogic 6 comes with a few prebuilt visualization widgets. This is a good opportunity to mention the MarkLogic 6 Visualization Widgets.

You can see them in action here.

Be sure to try the pie chart filters.

3. transform-results function override

The proper way to write custom snippet code is to override the transform-results function. The screencast shows the search option with the respective code. For more info, see Chapter 2 of the Search Developers Guide.

4. Add a new Search Controller using Roxy command line

Use the Roxy create command to create a new controller with the respective view code.

See the Roxy command line help for the details.

ml create –help

ml create controller –help

5. Cross Domain AJAX using JSONP

Cross Domain AJAX Requests can be a security risk and are restricted in most modern browsers. The secure work around is to wrap the JSON string into a JavaScipt function. This approach is called JSON-with-Padding or JSONP.

More details on JSONP is here.

6. JSON Response

The over-the-wire transport from MarkLogic to a browser client is most efficiently done using the lightweight JSON format. The JSON structure used in this screencast consists of 3 parts:

  1. paginationInfo
  2. facetInfo
  3. results

The following link shows the raw JSON format.

7. jQuery code used to send, receive and render.

The code that initiates the call to MarkLogic search is as follows.

getSearchResults = function() {

    if (querystring(‘q’) != "")
          url: getSearchUrl(querystring(‘pg’)),
          type: ‘GET’,
          dataType: "jsonp",
          jsonp : "callback",
          jsonpCallback: "srCallback"
    {  // Code to clear the results pane…

A callback function called srCallback(data) receives the results from MarkLogic.

The jQuery code is available on the widget test page by viewing source within the browser.

8. Widget Test Page

Be sure to use the following widget test page links.



The screencast builds from the previous session. The objective is to educate and raise awareness of MarkLogic’s agile database development capabilities.

There are many more topics to cover so please stay tuned.