Posts Tagged ‘eDiscovery’

What is the MarkLogic Document Discovery App?

August 29, 2013

 

MarkLogic Document Discovery

 

One of my favorite MarkLogic applications posted on GitHub is Document Discovery.

It’s a favorite because it provides a Quick Start tutorial for developers looking to learn about MarkLogic’s Content Process Framework (CPF) and the new binary document support.

Developers will also learn about Field Value Query and “How to annotate and add custom ratings?”.

You can access the source code and get the installation instructions in the readme section on the GitHub project page.

However, this blog post will provide more detail because it required additional changes to get it working properly on my MarkLogic 6 server.

You can see it in action here. => http://ps4.demo.marklogic.com:8009/

Be sure to watch the embedded screencast at the top of the page too.

Background

As noted in the readme, this is a simple application that was built using a customized Application Builder app. Unfortunately, it uses the older MarkLogic 5 Application Builder code. The latest Application Builder that comes with MarkLogic 6 (ML6) has undergone a significant architectural change. The ML6 version generates code that is much more declarative as it utilizes XSLT with the ML6 REST API.

A good write up on how to customize the ML6 App Builder Code is posted here.

http://docs.marklogic.com/guide/app-builder/custom#id_93564

I’ll have a newer version of the Document Discovery app that is built with the ML6 App Builder in a future blog post. For now, this version is better as a tutorial because it makes it easier to follow the code for:

  1. CPF
  2. Binary File Metadata
  3. Field Value Query
  4. How to annotate and add custom ratings?

Content Process Framework (CPF)

I like to refer to the MarkLogic CPF as a Finite State Machine (FSM) for documents. In this case, the document’s state will change or transition from one state to another via a triggering event or condition.

A typical FSM is defined by a list of its states, and the triggering condition for each transition.

For MarkLogic CPF this is done with Pipelines. A pipeline is a configuration file that contains the triggering conditions with respective actions for each state transition.

This application will use CPF for entity enrichment. In this case the entity will be an arbitrary binary file.

I find that entity enrichment is the most common use case for CPF.

Of course, the online Content Processing Framework Guide is the ultimate source to help you become fluent in CPF.

http://docs.marklogic.com/guide/cpf/overview

In the mean time, this post will provide the quick start steps needed to get this Document Discovery app configured to use a CPF pipeline.

In this case, the CPF pipeline will execute a document “scan” on an arbitrary binary document whenever it is inserted into the database.

The triggering event for this will be the  xdmp:document-insert() which puts the document into the “initial” state.

Binary File Support

A new ability to ingest a wide range of binary files was added to MarkLogic in version 5. This feature is referred to as an ISYS Document Filter.

The ISYS (Document) Filter was provided by a company called ISYS Search Software, Inc. The company has since changed its name to Perceptive Search, Inc.

The new ISYS filter extracts metadata from a wide range of binary documents (word, powerpoint, jpg, gif, excel, etc.). The list of supported binary files are listed here.

This metadata extraction capability is in an API called  xdmp:document-filter().

More details are provided in Chapter 16 of the Search Developer’s Guide.

Here’s an example of the API usage with response.

Code snippet:

xdmp:document-filter(fn:doc("/content/451 Report.docx"))

 

API Response:

image

Please note the attributes used in the meta elements shown above.

The CPF “initial” pipeline code will modify the above meta elements to be as follows.

image

 

The 3 elements in the blue circle were added later when the user adds a comment and/or rating to the document.

 

Field Value Query

In this app, the Author facet was implemented as a Field Value Query. A Field Value is used because the search (facet) also wanted to include the <typist> node.

Example:

The following XML shows 3 elements that are valid Authors (Last_Author, Author, and Typist).

image

 

Adding a Field Value Index to the database provides an ability to unify the values during a search. It also provides a weighting mechanism where the <Last_Author> element can be boosted to have a higher relevance than the <Typist> element.

 

Here’s a snapshot of the admin page used to configure it.

image

 

For more detail, see Understanding Field Word and Value Query Constructors in Chapter 4 of the Search Developer’s Guide.

 

How to annotate and add custom ratings?

JavaScript for the annotation and custom ratings is in a file called /src/custom/appjs.js.

 

Here’s the source code directory structure.

 

image

 

The JavaScipt code will display the User Comments dialog box to capture the Commenter’s Name and Comment as shown.

 

image

The code that provides the web form post action is in /src/custom/appfuncations.xqy

 

The post action sends the request to a file called/insertComment.xqy which then does either the xdmp:node-insert() for new comments or xdmp:node-replace() for updates.

 

A similar approach is done for the user ratings capability. Here’s a snapshot of the UI.

 

image

 

Installation

The source code that is currently posted on GitHub did not work initially.

I made the following changes to make it work.

  1. Config File – modify the Author field value index
  2. Config File – make compatible with MarkLogic 6
  3. Config File – include a dedicated modules and triggers DB.
  4. Pipeline – modify pipeline xml to use code in the /cpf/ingest directory.
  5. Deployed source code to a modules databases instead of the file system.
  6. Added additional logging to verify pipeline code process.

The source code is posted here. => http://sdrv.ms/15Coe6U

Here’s the steps that I used to get the application running properly.

Installation Steps:

  1. Import the configuration package that’s located in the /config directory. This package will create and configure the 3 databases (DocumentDiscovery, doument-discovery-modules, and Document-Discovery-Triggers).
  2. Install content processing on the database. See section 4.5 of the CPF Guide. => http://docs.marklogic.com/guide/cpf
  3. Install the pipeline that is in the /ingest directory.

    This is done by copying the 2 files in the /ingest directory to the /Modules/MarkLogic/cpf/ingest directory. You may need to create this directory directly under the modules directory of your MarkLogic installation (see next image). Once copied, then follow the normal pipeline loading process.

  4. Ensure that only the following pipelines are enabled for the default domain in the DocumentDiscovery database (DO NOT ENABLE DOCUMENT CONVERSION).

    Install: Document Filtering (XHTML), Status change handling, and the custom Meta Pipeline. You’ll need to load the Meta Pipeline configuration file.

  5. Deploy Source Code – be sure to modify the http server to use the proper modules database.
  6. Load data using the admin console, Information Studio or qconsole.
  7. Observe ability to search and view the binary documents!

 

Image: Shows CPF Pipeline Code located in the MarkLogic installation directory.

image

 

Image: Document Discovery App showing Search, Facets, and User Star Ratings

image

 

Conclusion

Hopefully, this post will help you get started using MarkLogic for Document Discovery.

The exciting news is the soon to be released MarkLogic 7 (ML7).

ML7 will have new  semantic technology features that will take this simple document repository to a whole new level.

This is because ML7 will have the ability to add richer metadata to each document. The richer metadata is triples.

A triple is a way to model a fact as a "Subject, Predicate, Object". Many triples can be added to a document as a sets of facts. These facts can then be incorporated into queries that that can make inferences about the “subjects”.

Example:

  1. “Jane Doe” : “graduatedFrom” : “Columbia University”
  2. “Jane Doe” : “graduationYear” : “2001”
  3. “Jane Doe” : “AuthorOf” : “This document: ISBN-125”
  4. “Sports Medicine” : “MainTopicOf” : “This document: ISBN-125”
  5. “Henry James” : “AuthorOf” : “ISBN:12345”
  6. “This document: ISBN-125” : “refersTo” : “document ISBN-545”
  7. “Sandra Day” : “graduatedFrom” : “Columbia University”
  8. “Sandra Day” : ”graduationYear” : “2001”
  9. “Sandra Day” : “AuthorOf” : “document ISBN-545”

 

From these facts, the following inferences can be made:

  1. Henry James and Jane Doe co-authored document ISBN-125.
  2. Sandra Day and Jane Doe were college classmates

 

In this example, the facts (triples) are used for knowledge discovery.

 

In this case, the knowledge discovery or inferences can be accomplished with a minimal coding effort using the simple data structures and a rich query API.

 

So please stay tuned!

Advertisements