Posts Tagged ‘Semantic’

RSuite, MarkLogic and the DITA Open Toolkit

April 17, 2015


RSuite DITA Transform Demo


The above video has a brief introduction to RSuite and the DITA Open Toolkit.

I had the great pleasure of being a speaker at last week’s MarkLogic World 2015 San Francisco event. I’ll be writing much more about my MarkLogic MEAN Stack presentation later but for now I‘d like to give a good shout out to RSuite.

While at MarkLogic World, I also had the great pleasure of reconnecting with the wonderful RSuite team. I’m a big advocate of the RSuite content management system.

This past summer, I had an amazing opportunity to work at Harper Collins where I helped to deploy their new content management services using RSuite. My curiosity with RSuite stemmed from my work as a professional services consultant at MarkLogic. MarkLogic has a loyal base of customers in the media/publishing industry.

Media/publishing customers choose MarkLogic to store their content which consists of text based documents and binary assets (photos, audio, video) with the respective metadata.

MarkLogic lowered the pricing in 2013. The new pricing made it more affordable to store binary assets in MarkLogic. Keeping the binary assets together with the text based content also greatly simplifies the infrastructure and management overhead.

The RSuite CMS has the following features:

  1. Workflow
  2. DITA Transforms – provides “multi-channel” output.
  3. Role Based Security
  4. Distribution

The RSuite secret sauce is the DITA Open Toolkit. The other key component is MarkLogic.

The workflow engine is provided by jBPM which uses MySQL to store the workflow configurations and drives the finite state machine.

The DITA Open Toolkit provides the “multi-channel output” feature needed by most publishers. This is the ability to render the content to many formats such as PDF, ePub, XHTML, Adobe In Design, Word docx, any format.

The DITA acronym is Darwin Information Typing Architecture. It is an XML Data Model for Authoring and Publishing.

Eliot Kimber is the force behind DITA. Here’s some useful links.


Publishers should avoid using XHTML as the storage format of book content for many reasons. The industry standard format is DITA or DocBook because it provides a higher level of abstraction that makes it much easier to have “multi-channel output”. These standards are also more flexible when providing custom publishing services.

The DITA format is especially interesting because of the specialization feature that makes XML structures polymorphic.

Some key points about DITA:

  1. Topic Oriented
  2. Each Topic is a separate XML file
  3. DocBook is Book Oriented
  4. DITA Initial Spec in 2001
  5. DocBook Initial Spec in 1991
  6. Core DITA Topic Types are Concept, Task, and Reference
  7. Specialization: This is subtyping where new topics are derived from existing topics.
  8. Darwin term is used because the polymorphic specializations provide an evolution path.
  9. DITA Map XML document is used to stitch the Topic XML documents.


I had an opportunity to chat with Norm Walsh about it at this week’s MarkLogic World event. Norm is the author of DocBook: The Definitive Guide. He’s also an active member in a few of the XML/JSON standards committees.

DITA is a competing standard to DocBook. Norm wrote this interesting blog post about DITA back in October 2005.

My question is which one has better support for semantic annotations. Most content these days is semantically enriched using multiple Ontologies. This is needed so that SPARQL queries can be used to provide Dynamic Semantic Publishing services.

Norm was quick to tell me that DocBook 5.1 now supports RDFa. I’ll definitely investigate.

For DITA, there’s an interesting DITA to RDF transform in the works here.

Bob DuCharme, author of Learning SPARQL, has this nice blog post on using RDFa with DITA and DocBook.

In addition to the screencast above, the following screencasts will take a deeper dive into the RSuite software and DITA Open Toolkit.

Please take a look. Hopefully, the screencast will shed some more light on the need to store content using a higher level of abstraction (DITA or DocBook).

The screencasts will also show the value of RSuite as a full blown Content Management and Digital Asset Management (DAM) Solution.


DITA Open Toolkit Demo


DITA Open Toolkit Demo


RSuite Architecture and Code


RSuite Architecture and Code

What is the MarkLogic Document Discovery App?

August 29, 2013


MarkLogic Document Discovery


One of my favorite MarkLogic applications posted on GitHub is Document Discovery.

It’s a favorite because it provides a Quick Start tutorial for developers looking to learn about MarkLogic’s Content Process Framework (CPF) and the new binary document support.

Developers will also learn about Field Value Query and “How to annotate and add custom ratings?”.

You can access the source code and get the installation instructions in the readme section on the GitHub project page.

However, this blog post will provide more detail because it required additional changes to get it working properly on my MarkLogic 6 server.

You can see it in action here. =>

Be sure to watch the embedded screencast at the top of the page too.


As noted in the readme, this is a simple application that was built using a customized Application Builder app. Unfortunately, it uses the older MarkLogic 5 Application Builder code. The latest Application Builder that comes with MarkLogic 6 (ML6) has undergone a significant architectural change. The ML6 version generates code that is much more declarative as it utilizes XSLT with the ML6 REST API.

A good write up on how to customize the ML6 App Builder Code is posted here.

I’ll have a newer version of the Document Discovery app that is built with the ML6 App Builder in a future blog post. For now, this version is better as a tutorial because it makes it easier to follow the code for:

  1. CPF
  2. Binary File Metadata
  3. Field Value Query
  4. How to annotate and add custom ratings?

Content Process Framework (CPF)

I like to refer to the MarkLogic CPF as a Finite State Machine (FSM) for documents. In this case, the document’s state will change or transition from one state to another via a triggering event or condition.

A typical FSM is defined by a list of its states, and the triggering condition for each transition.

For MarkLogic CPF this is done with Pipelines. A pipeline is a configuration file that contains the triggering conditions with respective actions for each state transition.

This application will use CPF for entity enrichment. In this case the entity will be an arbitrary binary file.

I find that entity enrichment is the most common use case for CPF.

Of course, the online Content Processing Framework Guide is the ultimate source to help you become fluent in CPF.

In the mean time, this post will provide the quick start steps needed to get this Document Discovery app configured to use a CPF pipeline.

In this case, the CPF pipeline will execute a document “scan” on an arbitrary binary document whenever it is inserted into the database.

The triggering event for this will be the  xdmp:document-insert() which puts the document into the “initial” state.

Binary File Support

A new ability to ingest a wide range of binary files was added to MarkLogic in version 5. This feature is referred to as an ISYS Document Filter.

The ISYS (Document) Filter was provided by a company called ISYS Search Software, Inc. The company has since changed its name to Perceptive Search, Inc.

The new ISYS filter extracts metadata from a wide range of binary documents (word, powerpoint, jpg, gif, excel, etc.). The list of supported binary files are listed here.

This metadata extraction capability is in an API called  xdmp:document-filter().

More details are provided in Chapter 16 of the Search Developer’s Guide.

Here’s an example of the API usage with response.

Code snippet:

xdmp:document-filter(fn:doc("/content/451 Report.docx"))


API Response:


Please note the attributes used in the meta elements shown above.

The CPF “initial” pipeline code will modify the above meta elements to be as follows.



The 3 elements in the blue circle were added later when the user adds a comment and/or rating to the document.


Field Value Query

In this app, the Author facet was implemented as a Field Value Query. A Field Value is used because the search (facet) also wanted to include the <typist> node.


The following XML shows 3 elements that are valid Authors (Last_Author, Author, and Typist).



Adding a Field Value Index to the database provides an ability to unify the values during a search. It also provides a weighting mechanism where the <Last_Author> element can be boosted to have a higher relevance than the <Typist> element.


Here’s a snapshot of the admin page used to configure it.



For more detail, see Understanding Field Word and Value Query Constructors in Chapter 4 of the Search Developer’s Guide.


How to annotate and add custom ratings?

JavaScript for the annotation and custom ratings is in a file called /src/custom/appjs.js.


Here’s the source code directory structure.




The JavaScipt code will display the User Comments dialog box to capture the Commenter’s Name and Comment as shown.



The code that provides the web form post action is in /src/custom/appfuncations.xqy


The post action sends the request to a file called/insertComment.xqy which then does either the xdmp:node-insert() for new comments or xdmp:node-replace() for updates.


A similar approach is done for the user ratings capability. Here’s a snapshot of the UI.





The source code that is currently posted on GitHub did not work initially.

I made the following changes to make it work.

  1. Config File – modify the Author field value index
  2. Config File – make compatible with MarkLogic 6
  3. Config File – include a dedicated modules and triggers DB.
  4. Pipeline – modify pipeline xml to use code in the /cpf/ingest directory.
  5. Deployed source code to a modules databases instead of the file system.
  6. Added additional logging to verify pipeline code process.

The source code is posted here. =>

Here’s the steps that I used to get the application running properly.

Installation Steps:

  1. Import the configuration package that’s located in the /config directory. This package will create and configure the 3 databases (DocumentDiscovery, doument-discovery-modules, and Document-Discovery-Triggers).
  2. Install content processing on the database. See section 4.5 of the CPF Guide. =>
  3. Install the pipeline that is in the /ingest directory.

    This is done by copying the 2 files in the /ingest directory to the /Modules/MarkLogic/cpf/ingest directory. You may need to create this directory directly under the modules directory of your MarkLogic installation (see next image). Once copied, then follow the normal pipeline loading process.

  4. Ensure that only the following pipelines are enabled for the default domain in the DocumentDiscovery database (DO NOT ENABLE DOCUMENT CONVERSION).

    Install: Document Filtering (XHTML), Status change handling, and the custom Meta Pipeline. You’ll need to load the Meta Pipeline configuration file.

  5. Deploy Source Code – be sure to modify the http server to use the proper modules database.
  6. Load data using the admin console, Information Studio or qconsole.
  7. Observe ability to search and view the binary documents!


Image: Shows CPF Pipeline Code located in the MarkLogic installation directory.



Image: Document Discovery App showing Search, Facets, and User Star Ratings




Hopefully, this post will help you get started using MarkLogic for Document Discovery.

The exciting news is the soon to be released MarkLogic 7 (ML7).

ML7 will have new  semantic technology features that will take this simple document repository to a whole new level.

This is because ML7 will have the ability to add richer metadata to each document. The richer metadata is triples.

A triple is a way to model a fact as a "Subject, Predicate, Object". Many triples can be added to a document as a sets of facts. These facts can then be incorporated into queries that that can make inferences about the “subjects”.


  1. “Jane Doe” : “graduatedFrom” : “Columbia University”
  2. “Jane Doe” : “graduationYear” : “2001”
  3. “Jane Doe” : “AuthorOf” : “This document: ISBN-125”
  4. “Sports Medicine” : “MainTopicOf” : “This document: ISBN-125”
  5. “Henry James” : “AuthorOf” : “ISBN:12345”
  6. “This document: ISBN-125” : “refersTo” : “document ISBN-545”
  7. “Sandra Day” : “graduatedFrom” : “Columbia University”
  8. “Sandra Day” : ”graduationYear” : “2001”
  9. “Sandra Day” : “AuthorOf” : “document ISBN-545”


From these facts, the following inferences can be made:

  1. Henry James and Jane Doe co-authored document ISBN-125.
  2. Sandra Day and Jane Doe were college classmates


In this example, the facts (triples) are used for knowledge discovery.


In this case, the knowledge discovery or inferences can be accomplished with a minimal coding effort using the simple data structures and a rich query API.


So please stay tuned!