While at MarkLogic World, I also had the great pleasure of reconnecting with the wonderful RSuite team. I’m a big advocate of the RSuite content management system.
This past summer, I had an amazing opportunity to work at Harper Collins where I helped to deploy their new content management services using RSuite. My curiosity with RSuite stemmed from my work as a professional services consultant at MarkLogic. MarkLogic has a loyal base of customers in the media/publishing industry.
Media/publishing customers choose MarkLogic to store their content which consists of text based documents and binary assets (photos, audio, video) with the respective metadata.
MarkLogic lowered the pricing in 2013. The new pricing made it more affordable to store binary assets in MarkLogic. Keeping the binary assets together with the text based content also greatly simplifies the infrastructure and management overhead.
The RSuite secret sauce is the DITA Open Toolkit. The other key component is MarkLogic.
The workflow engine is provided by jBPM which uses MySQL to store the workflow configurations and drives the finite state machine.
The DITA Open Toolkit provides the “multi-channel output” feature needed by most publishers. This is the ability to render the content to many formats such as PDF, ePub, XHTML, Adobe In Design, Word docx, any format.
The DITA acronym is Darwin Information Typing Architecture. It is an XML Data Model for Authoring and Publishing.
Eliot Kimber is the force behind DITA. Here’s some useful links.
Publishers should avoid using XHTML as the storage format of book content for many reasons. The industry standard format is DITA or DocBook because it provides a higher level of abstraction that makes it much easier to have “multi-channel output”. These standards are also more flexible when providing custom publishing services.
The DITA format is especially interesting because of the specialization feature that makes XML structures polymorphic.
Some key points about DITA:
Topic Oriented
Each Topic is a separate XML file
DocBook is Book Oriented
DITA Initial Spec in 2001
DocBook Initial Spec in 1991
Core DITA Topic Types are Concept, Task, and Reference
Specialization: This is subtyping where new topics are derived from existing topics.
Darwin term is used because the polymorphic specializations provide an evolution path.
DITA Map XML document is used to stitch the Topic XML documents.
I had an opportunity to chat with Norm Walsh about it at this week’s MarkLogic World event. Norm is the author of DocBook: The Definitive Guide. He’s also an active member in a few of the XML/JSON standards committees.
DITA is a competing standard to DocBook. Norm wrote this interesting blog post about DITA back in October 2005.
My question is which one has better support for semantic annotations. Most content these days is semantically enriched using multiple Ontologies. This is needed so that SPARQL queries can be used to provide Dynamic Semantic Publishing services.
In addition to the screencast above, the following screencasts will take a deeper dive into the RSuite software and DITA Open Toolkit.
Please take a look. Hopefully, the screencast will shed some more light on the need to store content using a higher level of abstraction (DITA or DocBook).
The screencasts will also show the value of RSuite as a full blown Content Management and Digital Asset Management (DAM) Solution.
Be sure to watch the embedded screencast at the top of the page too.
Background
As noted in the readme, this is a simple application that was built using a customized Application Builder app. Unfortunately, it uses the older MarkLogic 5 Application Builder code. The latest Application Builder that comes with MarkLogic 6 (ML6) has undergone a significant architectural change. The ML6 version generates code that is much more declarative as it utilizes XSLT with the ML6 REST API.
A good write up on how to customize the ML6 App Builder Code is posted here.
I’ll have a newer version of the Document Discovery app that is built with the ML6 App Builder in a future blog post. For now, this version is better as a tutorial because it makes it easier to follow the code for:
CPF
Binary File Metadata
Field Value Query
How to annotate and add custom ratings?
Content Process Framework (CPF)
I like to refer to the MarkLogic CPF as a Finite State Machine (FSM) for documents. In this case, the document’s state will change or transition from one state to another via a triggering event or condition.
A typical FSM is defined by a list of its states, and the triggering condition for each transition.
For MarkLogic CPF this is done with Pipelines. A pipeline is a configuration file that contains the triggering conditions with respective actions for each state transition.
This application will use CPF for entity enrichment. In this case the entity will be an arbitrary binary file.
I find that entity enrichment is the most common use case for CPF.
Of course, the online Content Processing Framework Guide is the ultimate source to help you become fluent in CPF.
In the mean time, this post will provide the quick start steps needed to get this Document Discovery app configured to use a CPF pipeline.
In this case, the CPF pipeline will execute a document “scan” on an arbitrary binary document whenever it is inserted into the database.
The triggering event for this will be the xdmp:document-insert() which puts the document into the “initial” state.
Binary File Support
A new ability to ingest a wide range of binary files was added to MarkLogic in version 5. This feature is referred to as an ISYS Document Filter.
The ISYS (Document) Filter was provided by a company called ISYS Search Software, Inc. The company has since changed its name to Perceptive Search, Inc.
The new ISYS filter extracts metadata from a wide range of binary documents (word, powerpoint, jpg, gif, excel, etc.). The list of supported binary files are listed here.
Please note the attributes used in the meta elements shown above.
The CPF “initial” pipeline code will modify the above meta elements to be as follows.
The 3 elements in the blue circle were added later when the user adds a comment and/or rating to the document.
Field Value Query
In this app, the Author facet was implemented as a Field Value Query. A Field Value is used because the search (facet) also wanted to include the <typist> node.
Example:
The following XML shows 3 elements that are valid Authors (Last_Author, Author, and Typist).
Adding a Field Value Index to the database provides an ability to unify the values during a search. It also provides a weighting mechanism where the <Last_Author> element can be boosted to have a higher relevance than the <Typist> element.
Here’s a snapshot of the admin page used to configure it.
JavaScript for the annotation and custom ratings is in a file called /src/custom/appjs.js.
Here’s the source code directory structure.
The JavaScipt code will display the User Comments dialog box to capture the Commenter’s Name and Comment as shown.
The code that provides the web form post action is in /src/custom/appfuncations.xqy
The post action sends the request to a file called/insertComment.xqy which then does either the xdmp:node-insert() for new comments or xdmp:node-replace() for updates.
A similar approach is done for the user ratings capability. Here’s a snapshot of the UI.
Installation
The source code that is currently posted on GitHub did not work initially.
I made the following changes to make it work.
Config File – modify the Author field value index
Config File – make compatible with MarkLogic 6
Config File – include a dedicated modules and triggers DB.
Pipeline – modify pipeline xml to use code in the /cpf/ingest directory.
Deployed source code to a modules databases instead of the file system.
Added additional logging to verify pipeline code process.
Here’s the steps that I used to get the application running properly.
Installation Steps:
Import the configuration package that’s located in the /config directory. This package will create and configure the 3 databases (DocumentDiscovery, doument-discovery-modules, and Document-Discovery-Triggers).
Install the pipeline that is in the /ingest directory.
This is done by copying the 2 files in the /ingest directory to the /Modules/MarkLogic/cpf/ingest directory. You may need to create this directory directly under the modules directory of your MarkLogic installation (see next image). Once copied, then follow the normal pipeline loading process.
Ensure that only the following pipelines are enabled for the default domain in the DocumentDiscovery database (DO NOT ENABLE DOCUMENT CONVERSION).
Install: Document Filtering (XHTML), Status change handling, and the custom Meta Pipeline. You’ll need to load the Meta Pipeline configuration file.
Deploy Source Code – be sure to modify the http server to use the proper modules database.
Load data using the admin console, Information Studio or qconsole.
Observe ability to search and view the binary documents!
Image: Shows CPF Pipeline Code located in the MarkLogic installation directory.
Image: Document Discovery App showing Search, Facets, and User Star Ratings
Conclusion
Hopefully, this post will help you get started using MarkLogic for Document Discovery.
The exciting news is the soon to be released MarkLogic 7 (ML7).
ML7 will have new semantic technology features that will take this simple document repository to a whole new level.
This is because ML7 will have the ability to add richer metadata to each document. The richer metadata is triples.
A triple is a way to model a fact as a "Subject, Predicate, Object". Many triples can be added to a document as a sets of facts. These facts can then be incorporated into queries that that can make inferences about the “subjects”.
From these facts, the following inferences can be made:
Henry James and Jane Doe co-authored document ISBN-125.
Sandra Day and Jane Doe were college classmates
In this example, the facts (triples) are used for knowledge discovery.
In this case, the knowledge discovery or inferences can be accomplished with a minimal coding effort using the simple data structures and a rich query API.
I’m a big advocate for open government data projects. One of the more interesting efforts is the US Congress legislation data that is published every night on GitHub.
Clay Shirky would like to use GitHub as a tool to manage the legislation process. He calls it a new “form of arguing”. It’s a great idea but GitHub is not an easy tool for government legislators yet. It could use some ease-of-use tweaks for folks not familiar with software development.
In any case, Clay mentioned the posting of US Congress Bill data that occurs nightly. Posting the data is a great example of government transparency. This data can be used to measure congressional productivity and maybe even accountability.
This data makes it possible to provide a real-time (well nightly) digital Report Card. It can be used to measure any perceptions of Congressional Gridlock that is often reported in the news.
This blog post will show how to do this but before I start here’s some notes about the US Congress that will help when we look at the data.
Notes on the US Congress and the Legislative Process:
Consists of 2 houses: Senate and House of Representatives.
There are 535 voting members: 435 in the House of Representatives and 100 in the Senate.
Each Congressional Session is 2 years.
Current Session is the 113th Congress which runs from January 3, 2013 to January 3, 2015.
A key objective is to create legislation that improves and advances the entire country.
Any member of House or Senate can introduce legislation for debate. It is done by proposing a bill.
Congressional bill is introduced and championed by a sponsor.
Bill is then sent to appropriate congressional committee to decide if the House or Senate will vote on the bill.
Senate has 17 committees and 70 subcommittees.
House has 23 committees and 104 subcommittees.
Each committee focuses on a specific policy area.
Bill is passed when a majority vote is received in both the House and Senate.
President must sign the bill to pass it into law or veto it.
Congressional Productivity can be measured by the number of bills that get signed by the president.
The Huffington Post news outlet published an article about Congressional Productivity of the 112th Congress at the end of 2012. The article noted that the 111th Congress had 383 bills signed into law by the president. The 112th congress had only 283 bills signed by the president.
One could conclude that the 112th Congress was less productive than the 111th Congress but there are other factors to consider.
Is it possible to use the nightly data to verify or track Congressional Productivity on a daily basis?
How difficult would it be to show data visualizations for each congress?
The answer to both questions is a most definite yes but how?
Here’s a simple data set that I’d like to visualize and build from.
Congress
Years
Bills Proposed
Bills Signed
Bill Sign Rate
110
2007-2008
14,042
460
3.28%
111
2009-2010
13,675
383
2.80%
112
2011-2012
12,299
283
2.30%
MarkLogic Ingestion and Enrichment
This is an easy problem for MarkLogic. MarkLogic is typically used to ingest, enrich, search and discover.
Once the data is in MarkLogic, the following questions can be easily answered.
How many bills signed by president by year?
Who are the most productive congress members by year?
What is the bill proposed to bill signed ratio by year?
I’ve created a MarkLogic database and ingested the US Congress Bill data from the previous Congressional Sessions. The data ranges from the 93rd Congress (1974 to 1975) to the 112th Congress (2011 to 2012).
For the current session, 113th Congress, I created a scheduled task that runs every night at 11:00 PM EST (8:00 PM PST). The scheduled task ingests the data and enriches it with some simple presidential data.
I’ve also used the MarkLogic App Builder tool to create a quick web application that shows the bills signed by year. It can also be used to search and discover.
The App Builder and ingestion code is posted here. => Source Code
To see the most productive Congress Members by Year, press the "Enacted Type” public facet in the top left corner. Once pressed, you will see the results by year.
The Top 30 Most Productive Congress Members are listed in the Sponsor Facet (see image). This sponsor list shows the most productive members for the past 32 years in a descending order.
The number in parenthesis is the number of bills that they sponsored that were signed by the President of the United States.
I use the Sponsor classifier because a Sponsor is a senator or house representative who introduces a bill or amendment and is the bill’s chief advocate. The process of sponsoring a bill requires significant effort. So we can conclude that this could be a good metric for a congress member’s productivity.
You can use the web app to drill down further to discover bills proposed or productive members by year, by president, by cosponsor, by subject, etc..
The remaining blog post and screencast will focus on the following topics.
How to programmatically get the data zip file?
How to ingest json documents?
How to set up an automated daily ingestion task?
Is ACID Compliance needed for this One-Way System?
How to use the SQL API?
How to connect and discover with Tableau and Excel?
declare variable $get-options :=
<options xmlns="xdmp:http">
<format xmlns="xdmp:document-get">binary</format>
</options>;
declare variable $BASEURL :=
"http://unitedstates.sunlightfoundation.com/congress/data/";
let $url := fn:concat($BASEURL, "113.zip")
let $zip := xdmp:http-get($url, $get-options)[2]
let $docs :=
for $uri in xdmp:zip-manifest($zip)//zip:part/text()
let $jdoc := xdmp:zip-get($zip, $uri)
let $xdoc := json:transform-from-json($jdoc)
let $doc := local:load-house-bill-doc($xdoc)
order by $uri
return
$uri
return fn:count($docs)
2. How to ingest json documents?
The data that is posted on the GitHub site is in a json format. A good rule of thumb on json versus xml is to use json as a wire protocol for the “last mile” and then use XML for the data store.
The key advantage of XML over json is Namespaces and Schema. Namespaces and Schema provides capabilities that ultimately makes developers much more productive than if a simple json data store is used. But this is a topic for another day.
Ingesting json data into MarkLogic is easily done using XQuery. XQuery is my preferred ETL (extract, transform and load) tool. I use it to convert json, transform (and enrich) the data into more efficient structures and then store as XML.
In a future post, I’ll talk about adding a triple which is the best way to capture relationships with other congressional bills.
The json format of the US Congress Bill data is well documented on the GitHub Wiki page.
Here’s an example of the json structure used to hold subjects and summary data.
{
"subjects": [
"Abortion",
"Administrative law and regulatory procedures",
"Adoption and foster care"
],
"summary": {
"as": "Public Law",
"date": "2010-03-23",
"text": "Patient Protection and Affordable Care Act"
}
}
The above json is transformed and stored in XML as follows.
<subjects>
<subject>Abortion</subject>
<subject>
Administrative law and regulatory procedures
</subject>
<subject>Adoption and foster care</subject>
</subjects>
<summary>
<summary-as>Public Law</summary-as>
<summary-date>2010-03-23</summary-date>
<summary-text>
Patient Protection and Affordable Care Act
</summary-text>
</summary>
Here’s the code snippet used to create the above xml node.
element {fn:QName($NS,"subjects")}
{
for $item at $n in $node/*:subjects/node()
return
element {fn:QName($NS,"subject")} {$item/text()}
},
element {fn:QName($NS,"summary")}
{
element {fn:QName($NS,"summary-as")}
{$node/*:summary/*:as/text()},
element {fn:QName($NS,"summary-date")}
{$node/*:summary/*:date/text()},
element {fn:QName($NS,"summary-text")}
{$node/*:summary/*:text/text()}
},
The code was automatically generated and deployed.
The App Builder tool deploys code using the following directory structure.
Please take note of the custom directory. As per name, any custom code can be safely deployed to the custom directory. The App Builder tool will not clobber any code that resides in this directory.
To create the nightly ingestion task, the ingestion code (ingest-bill-data.xqy) was copied to the /custom/schedule directory as shown above.
The next step is to add the scheduled task using the MarkLogic Admin tool.
The code is set to run daily at 8:00 PM PST (11:00 PM EST).
Once the task is set, the web application is ready for use.
Be sure to select the Enacted Type “public” facet which will filter out the bills proposed versus the bills passed. You can also try search. Be sure to drill down to the underlying XML document (see the 3rd red circle in the image). This link will take you to an html table view of the data. There is also a raw XML view (see images).
Underlying XML Document HTML Table View.
Underlying XML Document XML View.
4. Is ACID Compliance needed for this One-Way system?
There’s a big debate in the NoSQL community about database consistency.
Data consistency refers to how a database is able to handle updates when failures occur. There are two general approaches, ACID and BASE.
ACID systems provide consistency at the expense of availability.
BASE systems provide availability at the expense of consistency.
I mention it because this application is mostly queries (read-only) and does not require rigorous ACID transactions.
ACID is not needed because data flows mostly one-way. However, if this app were to get user specific features (e.g., user profile, saved searches, user bookmarks, user workspaces, etc.). These features would turn the system into a mission critical two-way system requiring high consistency. High consistently is also referred to as durability.
But this is a topic for another day too.
5. How to use the SQL API?
MarkLogic version 6 added a new SQL API and ODBC service that is very useful for analytics.
I’ll quickly walk through the steps needed to set up a MarkLogic SQL View using the App Builder tool.
Select the database and then press the configure button (see image).
Scroll to the bottom of the page and press the “Add New” button. The image already shows that view that was created (us_congress_view).
Enter name, schema, localname (root node) and namespace.
Select the desired range indexes. Try to avoid the Cartesian product. In this example, I intentionally left out subject, cosponser and committee because there are many subjects, cosponsors, and committees associated to a single bill. If these items were included then the number of records would jump to 7.5 million instead of 276,000.
Press the update view button to commit the changes.
Test using qconsole and the new SQL API (see following code).
6. How to connect and discover with Tableau and Excel?
Once the SQL View has been created, the next step is to connect the data to Tableau and Excel.
This requires an ODBC service. Creating the ODBC service is very similar to creating an XDBC service.
Go to the Admin Console > App Servers and then press the Create ODBC tab.
Create the service and make sure that the desired database is selected.
The next step is to create the ODBC System DSN using the 32 Bit ODBC Data Source Administrator.
Steps:
Launch the 32 Bit ODBC Data Source Administrator. For windows the command is: C:\Windows\SysWOW64\odbcad32.exe
Select the System DSN tab of the ODBC Data Source Admin tool.
Press the Add Button.
Select the MarkLogic SQL (X86) item.
Enter the configuration settings.
Press the Test button to verify.
Good to Go.
Image 1.
Image 2.
Steps to connect to Excel:
Launch Microsoft Excel. I recommend using Excel 2010.
Select the Data > From Other Sources > From Microsoft Query
Observe “Choose Data Source” dialog box and then select MarkLogicSQL.
Observe “Query Wizard” and choose the desired view.
Choose the desired columns.
Set the filter (if desired).
Set the desired sort.
Choose “Return Data to Microsoft Excel” and press ok.
Wait ~2 minutes for the spreadsheet to respond.
Observe results in the spreadsheet (see image 6).
Image 1.
Image 2.
Image 3.
Image 4.
Image 5.
Image 6.
Connecting to Tableau is very similar to Excel. I’ll cover more detail in the screencast but here’s the steps.
Steps to connect to Tableau:
Launch Tableau.
Press the Connect to Data link on the home page.
Select the “Other Databases (ODBC)” option on the lower left.
Observe the “Generic ODBC Connection” dialog box.
Select the MarkLogicSQL DSN.
Press the Connect button and observe the connection attributes appear.
Select the “main” schema (or equivalent).
Press the table search icon in the lower right and observe the option to select the table.
Select the table and observe the connection name.
Press the OK button.
Wait ~2 minutes for Tableau to respond.
Choose the “live data” option.
Observe dimensions and measure in the left side.
Drag “Number of record” to the “rows”.
drag the congress and president_name dimension to the “columns”.
Choose the bar chart to observe a visual.
Image 1.
Image 2.
Image 3.
Image 4.
That should be enough to get started using MarkLogic and Tableau. MarkLogic-Tableau combination eliminates the need for expensive data warehouses.
More importantly, MarkLogic-Tableau provides the much needed data discovery. It provides the ability to surface the data in easier and more meaningful ways.
I believe this discovery capability is especially useful for the US Congress Data and the many Open Government Data projects.
Anyway, that is all for now.
Please be sure to watch the screencast where I can provide some more detail.
Learn how to add authentication to a MarkLogic Roxy App.
This blog post, I will show the code needed to add a simple authentication service.
In the previous post, I created the two column layout where the top of the home page had a login form. At the time the login form was not wired up.
For this post, I will wire up the login form. To do this, I’ll show how to build a simple authentication service that searches a user database, verifies the password, and generates an authentication token that will expire after 5 minutes.
This demo application will also show how to provide a simple RESTful API for search. This Search API will utilize the authentication token to restrict access to the search service.
The app will not show any role based restricted views. A more fully featured role based access control will be shown in a future post.
A zip file containing the source code for this demo application is posted here. => source code
Overall Approach
The solution will use the following items to authenticate a user and create an application token that gives the user access to the RESTful API for a 5 minute period.
Registration Form – used to create the user profile documents. This form should only be visible to an admin user but is currently visible to all for demo purposes.
User Directory – Each user will have a dedicated user directory in the MarkLogic database.
User Profile Document – User profile document (/users/janedoe/profile.xml) will reside in the user’s directory. It will contain the username/password. The username must be unique. It can also be used to store the user’s role and organization information. This demo will not utilize a party management solution but it can be extended to do so.
Session Document – A session document will be created when a user successfully logs in. The Session Document will be stored in the User Directory.
Authentication Token – the token will be stored in the session document. Each RESTful API request must include the token in its header.
ROXY Router – Will be the checkpoint or the single point of entry for each request. This is where the user token is verified for each RESTful API request. A key function called Find-Session-by-Token() verifies the session and dispatches the request if valid.
Login – If the username and password is valid, the token is created. If a token already exists and has not yet expired then the same token will be used.
Session Expiration – Session Expiration will be 5 minutes from initial login. The 5 minute duration is for demo purposes. Typical session expiration duration is 24 hours. Session expiration time will be UTC based. UTC is Coordinated Universal Time.
Logout – Terminates the session by deleting the session document that contains the token.
Related Notes:
Passwords are never part of the RESTful API transport except the Login API request.
“Remember Me” cookie – This solution can support a “Remember Me” cookie where the token is stored in the cookie and not the password. Remember Me cookies typically expire after 90 days which is longer than the token expiration period.
Verify API – A good approach for refreshing the token stored in a “90 day login cookie” is described here. A Verify API is typically used to verify the Username and Token. If they match then a new token is generated whenever the existing token has expired. The 90-day cookie web app will need to call the Verify API to refresh the token stored in the cookie.
Passwords are currently stored in the User Profile doc but they are MD5 hashed.
Current solution shows how to use the MarkLogic Search API with the user profile document, session document and token to provide an adequate security solution.
OAuth2 – Open Authentication version 2 (OAuth2) is a widely used protocol that provides a federated user profile solution. The key benefit for this example is that user passwords do not need to be stored in MarkLogic. However, this is a topic for a future post. The OAuth2 developer details are here: https://developers.google.com/accounts/docs/OAuth2
1. Registration Form
The registration form creates the user profile data.
The registration form above creates the user profile data that is stored in a User Profile Document in the respective user directory. The Session Document is also stored in the User Directory.
The above document URI has the expiration date/time appended to it. Some JavaScript client code will use the appended expiration date/time to trigger a token refresh.
Roxy code that creates and deletes the session document is:
web login controller – /apps/controllers/appbuilder.xqy
This string needs to be added to the request header of each RESTful API request. If not the response will be a “401 unauthorized” error. The token must be prefixed with “X-Auth-Token” as follows.
The source code that extracts the X-Auth-Token value is in the router code. See line 87 of /src/app/lib/router.xqy.
let $token := xdmp:get-request-header("X-Auth-Token")
If using Firefox Poster tool, the header can be added as shown.
6. ROXY Router:
As discussed above, the router is the checkpoint for all http requests. It is the ideal place to apply a security policy logic such as:
Token check
Requests per minute
Maximum Requests per day
This post only handles the token check but this code can be extended to support all security policy logic.
The following xquery code handle the token check. Please note that certain request (e.g., login, ping) bypass the token check.
let $valid-request :=
if(fn:not($config:SESSION-AUTHENTICATE)) then fn:true()
elseif(xs:string($controller) = ("ping")) then fn:true()
elseif(xs:string($controller) = ("login")) then fn:true()
elseif(xs:string($controller) = ("logout")) then fn:true()
elseif(xs:string($controller) = ("verify")) then fn:true()
else
(
let $token := xdmp:get-request-header("X-Auth-Token")
returnif($token) then
(
let $valid-session := auth:findSessionByToken($token)
returnif($valid-session) then
(
fn:true(),
auth:cacheSession($valid-session)
)
else
fn:false()
)
else fn:false()
)
7. Login Code:
The login code does the following:
Find user profile document – Searches the the user profile documents using the username.
Check password – If a document with the username is found then the password is checked.
Find session document by username – If the password matches then code looks for a session document with its respective expiration date.
Session Document – If the session expiration has not expired then use current session document. If session document has expired then delete it and then create a new session document containing new Authentication Token.
The username and password is bundled into the request using the Authorization Header.
So the request header will need this:
Authorization: Basic Z3J1c3NvOnBhc3N3b3Jk
The encrypted string after the word Basic contains the base64 encoded username and password.
Here the code that extracts the username/password is in the login controller (/src/app/controllers/login.xqy).
declare function c:main() as item()*
{
let $userPwd :=
xdmp:base64-decode(
fn:string(
fn:tokenize(
xdmp:get-request-header("Authorization"), "Basic ")[2]
)
)
let $username :=
fn:string(
(xdmp:get-request-header("username"),
fn:tokenize($userPwd, ":")[1])[1]
)
let $password :=
fn:string(
(xdmp:get-request-header("password"),
fn:tokenize($userPwd, ":")[2])[1]
)
let $result := auth:login($username, $password)
return
(
ch:add-value("res-code", xs:int($result/json:responseCode) ),
ch:add-value("res-message", xs:string($result/json:message) ),
ch:add-value("result", $result),
ch:add-value(
"res-header",
element header {
element Date {fn:current-dateTime()},
element Content-Type
{
req:get("req-header")/content-type/fn:string()
}
}
)
)
};
The code that searches for a session document by username and its expiration date/time uses the following function. Please note the element range index query.
declare function auth:findSessionByUser($username)
{
let $query :=
cts:and-query((
cts:directory-query(auth:sessionDirectory($username),"infinity"),
cts:element-range-query(
xs:QName("expiration"),">",
auth:getCurrentDateTimeUTC())
))
let $uri := cts:uris("",("document","limit=1"), $query )
return
fn:doc($uri)
};
8. Session Expiration:
The code to check the session expiration is invoked by the router code.
See line 89 in /src/app/lib/router.xqy
auth:findSessionByToken($token)
Here’s the code. Please note the element range query.
declare function auth:findSessionByToken($token as xs:string)
{
let $query :=
cts:and-query((
cts:element-attribute-value-query(
xs:QName("session"),
xs:QName("user-sid"),
$token
),
cts:element-range-query(
xs:QName("expiration"),">",
auth:getCurrentDateTimeUTC()
)
))
let $uri := cts:uris("",("document","limit=1"), $query )
let $doc := fn:doc($uri)
let $current := fn:current-dateTime()
returnif ($doc) then
(
let $expiration := xs:dateTime($doc//expiration)
let $diff := ($expiration - $current)
return
(
if($diff < ($auth:SESSION-TIMEOUT div 2) ) then
xdmp:node-replace(
$doc//expiration/text(),
text{fn:current-dateTime()}
)
else (),
$doc/session
)
)
else ()
};
9. Logout Code:
The logout code terminates the session by deleting the session document. Here’s the code.
declare function auth:logout($username as xs:string)
{
let $session := auth:findSessionByUser($username)
let $user := auth:userFind($username)
let $token :=
if($session) then
$session/session/@user-sid/fn:string()
else ()
let $__ := auth:clearSession($username)
return
<json:object type="object">
<json:responseCode>200</json:responseCode>
<json:message>Logout Successful - Token Deleted</json:message>
<json:authToken>{$token}</json:authToken>
<json:username>{$user/username/text()}</json:username>
<json:fullName>
{
fn:string-join
((($user/firstName,$user/firstname)[1],
($user/lastName,$user/lastname)[1]), " ")
}
</json:fullName>
</json:object>
};
Conclusion
Hopefully, the authentication code described in this demo application has been informative. It shows an approach that I have recently used in a ROXY Application.
I will be building on this solution in future posts. The most pressing next step is to add support for OAuth v2 and role based restricted views. So stay tuned.
As always, please let me know if any further clarifications or details are needed in the comments section.
Once a highly searchable MarkLogic data repository has been created, a common request is to provide a Search Widget. A search widget is a simple web search box that can be added to any web page. Users can then use the search box to submit a search request to the MarkLogic database.
The client side widget consists of HTML, CSS and JavaScript. The JavaScript calls a MarkLogic Rest API asynchronously. The MarkLogic Rest API processes the request and then sends the results back to the web page. Client side JavaScript receives the results and renders accordingly.
The search widget test page shown in this screen cast is on the following link.
The proper way to write custom snippet code is to override the transform-results function. The screencast shows the search option with the respective code. For more info, see Chapter 2 of the Search Developers Guide.
4. Add a new Search Controller using Roxy command line
Use the Roxy create command to create a new controller with the respective view code.
See the Roxy command line help for the details.
ml create –help
ml create controller –help
5. Cross Domain AJAX using JSONP
Cross Domain AJAX Requests can be a security risk and are restricted in most modern browsers. The secure work around is to wrap the JSON string into a JavaScipt function. This approach is called JSON-with-Padding or JSONP.
The over-the-wire transport from MarkLogic to a browser client is most efficiently done using the lightweight JSON format. The JSON structure used in this screencast consists of 3 parts:
The screencast builds from the previous session. The objective is to educate and raise awareness of MarkLogic’s agile database development capabilities.
There are many more topics to cover so please stay tuned.
Quick Walk Through of the MarkLogic Roxy Framework
Intro
For the past few months, I’ve been heads down working in the exciting Big Data software development world. I prefer to call it the post-relational document database world. My focus has been the technology around Big Data but there’s also an amazing social aspect. We see groups like Code For America, Data without Borders (DataKind) and NYC Open Data using Big Data to drive social change.
But social aspects are a topic for a later time. For now, it’s all about the code.
MarkLogic
Of course, MarkLogic is my preferred big data platform. I use it to de-normalize data so that the database engine and search engine can be the same thing.
For those unfamiliar with MarkLogic, it’s a document data store that was designed to handle extremely large amounts of unstructured data using the XML technology stack (XML, XQuery, XSLT, etc.).
The term “unstructured data” is often a topic for debate. We typically see structured data which I consider to be relational data or schema validated XML documents. There’s also semi-structured and unstructured data. One could argue that all data has structure. I typically consider semi-structured data to be partially validated XML or JSON documents. I consider unstructured data to be a set of XML or JSON documents that may have a common header but also have a payload that has no structure and can contain anything (xml, text, binary).
Document databases like MarkLogic, MongoDB, and Couchbase do away with the need to shred data into rows and columns. Aside from being unnecessary, it’s also not feasible when dealing with petabytes of data.
A key capability that MarkLogic provides is agile database development. A MarkLogic developer has the ability to ingest large amounts of data while having very little knowledge of the underlying data structures. Once the data is ingested, indexes can be added and data structures tweaked to provide the desired results. This agile development process ultimately leads to higher developer productivity, higher quality, and quicker time to market.
Semantic Linking
Document databases are also ideal for document linking. We’re just starting to realize the value of linking documents semantically. See Kurt Cagle’s Balisage 2012 paper for an interesting approach to linking documents by appending an “assertion node” to each document. The assertion node contains a “triple store” that’s used to describe the document’s relationship with other documents. These Semantic links can then be used for semantic reasoning which is a topic for another day.
Roxy Framework
Now that I gave some background, let’s talk about building MarkLogic apps using Roxy. I build most of my MarkLogic apps using the Roxy framework.
Roxy (RObust XquerY framework) is a well-designed Model-View-Controller framework for XQuery.
For now, MarkLogic’s primary API is XQuery. However, stay tuned. A rich Java and C# API is coming soon. Of course, there’s also the MarkLogic RESTful API called Corona.
I’ll won’t drill down on the MVC mechanics right now but its worth noting the following image. It shows the Ruby on Rails style “convention over configuration” URL to MVC routing. I’ll discuss further in a future screencast.
The screencast above will show a simple example of ingesting blog post data from the blog site Boing Boing. You can get a copy of the boing boing blog archive here. This blog archive file contains 63,999 blog posts.
I built two simple MarkLogic search apps using this data set.
App Builder Version – good for a quick demo of the search capability but difficult to extend.
This is a follow-up to Jesse Liberty’s Answering A C# Question blog post which compares two equivalent code examples to illustrate the value of interfaces:
Both examples use fictitious Notepad functionality with File and Twitter capability.
Example #1 does not use an interface and the line of code (LOC) count is 49.
Example #2 uses a Writer interface with Parameter Dependency Injection. The Notepad’s dependent objects (e.g., FileManager and TwitterManager) are passed as parameters (aka injected) to the worker method. In this case, the LOC count is 57.
It’s interesting to note that the interface example has slightly more code. The big win is less coupling which is much easier to maintain and more testable. I’ll have more about the testability in a future post.
Example 1 – No Interface
using System.IO;
using System;
namespace Interfaces
{
class Program
{
static void Main( string[] args )
{
var np = new NotePad();
np.NotePadMainMethod();
}
}
class NotePad
{
private string text = "Hello world";
public void NotePadMainMethod()
{
Console.WriteLine("Notepad interacts with user.");
Console.WriteLine("Provides text writing surface.");
Console.WriteLine("User pushes a print button.");
Console.WriteLine("Notepad responds by asking ");
Console.WriteLine("FileManager to print file...");
Console.WriteLine("");
var fm = new FileManager();
fm.Print(text);
var tm = new TwitterManager();
tm.Tweet(text);
}
}
class FileManager
{
public void Print(string text)
{
Console.WriteLine("Pretends to backup old version file." );
Console.WriteLine("Then prints text sent to me." );
Console.WriteLine("printing {0}" , text );
var writer = new StreamWriter( @"HelloWorld.txt", true );
writer.WriteLine( text );
writer.Close();
}
}
class TwitterManager
{
public void Tweet( string text )
{
// write to twitter
Console.WriteLine("TwitterManager: " + text);
}
}
}
Example 2 – Writer Interface with Parameter Dependency Injection
using System.IO;
using System;
namespace Interfaces
{
class Program
{
static void Main( string[] args )
{
var np = new NotePad();
var fm = new FileManager();
var tm = new TwitterManager();
np.NotePadMainMethod(fm); // parameter injection
np.NotePadMainMethod(tm); // parameter injection
}
}
class NotePad
{
private string text = "Hello world";
public void NotePadMainMethod(Writer w)
{
Console.WriteLine("Notepad interacts with user.");
Console.WriteLine("Provides text writing surface.");
Console.WriteLine("User pushes a print button.");
Console.WriteLine("Notepad responds by asking ");
Console.WriteLine("FileManager to print file...");
Console.WriteLine("");
w.Write(text);
}
}
// Writer Interface
interface Writer
{
void Write(string whatToWrite);
}
class FileManager : Writer // Inherits Writer Interface
{
// Implements Write Interface Method
public void Write(string text)
{
// write to a file
Console.WriteLine("FileManager: " + text);
}
public void Print(string text)
{
Console.WriteLine("Pretends to backup old version file." );
Console.WriteLine("Then prints text sent to me." );
Console.WriteLine("printing {0}" , text );
var writer = new StreamWriter(@"HelloWorld.txt", true);
writer.WriteLine(text);
writer.Close();
}
}
class TwitterManager : Writer // Inherits Writer Interface
{
// Implements Write Interface Method
public void Write( string text )
{
// write to Twitter stream
Console.WriteLine("TwitterManager: " + text);
}
}
}
There are many meteorological pun’s associated with the term “cloud computing”. The term represents a huge paradigm shift in the way backend software services are delivered.
This article covers much of the confusion associated with this ambiguous and overused phrase.
From a software developer perspective, the deployment model and elasticity are the key differentiators for cloud services.
I consider a cloud service to be a system that can host my software and hide the complexity of the server farm (e.g., routers, load balancers, SSL accelerators, etc.).
Amazon popularized the term “Elastic Cloud” when they launched their core cloud component called EC2 back in August 2006. EC2 stands for Elastic Compute Cloud (EC2). Elasticity is the infrastructure’s ability to automatically scale up and scale down as needed.
Elasticity is a big deal. It dramatically simplifies the deployment and administration process. It means that software developers don’t need to worry much about infrastructure as much and can focus on coding the business process.
I consider Amazon, Google and Microsoft to be the big 3 cloud vendors. They have the elasticity expertise and server farms to support high volume cloud apps.
There’s Oracle, Salesforce.com, Rackspace and others but IMO are not generic cloud platforms.
For more about the non-developer cloud computing perspective, this Wikipedia article is a great reference.
Rob Enderle’s blog post has it right. 2010 will be the year and start of the cloud decade.
I’d like to take it a step further. The coming wave of ubiquitous ‘democratized’ data services with eager clients waiting to consume will take the internet to a dramatic new level. Microsoft’s three screens and a cloud vision speaks to it but I believe its more about “4 screens with data services”. I consider the data services to be more relevant. The cloud is the engine but the 24/7 data services it provides will be life changing/business transforming.
Thanks to 3G, pending 4G and whatever comes after, the data services will come from highly reliable mobile data pipes that can be consumed while driving a car, riding a bicycle, at the doctor’s office or exercising at the gym.
The data services are democratized because the data being provided was once only available to a select few. Opening up the data to software developers and entrepreneurs can be a catalyst for positive change. The democratization of data trend is an unstoppable force that has the power to accelerate innovation to help solve some of the world’s problems and improve the quality of life for all.
The US Chief Information Officer, Vivek Kundra, understands the power of democratized data. He spearheaded a new web site for this called Data.gov. Another great example is the City of New York’s recent NYC Big Apps Contest. Microsoft is also getting involved with their new Dallas service.
Regarding the 4 screens, not 3, I expect the data services to be designed to support the following clients.
Listing the Car Dashboard may be a bit premature but I expect to see at least 25 million "connected" cars sold during this coming decade. In less than a year, the Microsoft Ford Sync system has already exceeded 1 million in US only sales. These systems are just starting to go global with Kia’s UVO and Fiat’s Blue&Me systems. I expect “Connected Cars” consuming mission critical data services to become the norm within 5 years.
Examples of the mission critical and revenue generating data services are the real-time location-aware contextual ads or electronic billboards. Some of this is already available in the Ford Sync system. I consider it the first commercially viable Augmented Reality solution. I expect Car Dashboard solutions to eventually provide windshield “heads up display” driving directions that can also show the nearest movie listings, nearest Thai restaurants, closest hospitals, etc.
Aside from the Car Dashboard services, data services will come in many flavors. The more popular services will be the entertainment and news services:
Video
Netflix, Hulu, Youtube, Boxee
Music
iTunes, Pandora, Zune
Games
Xbox Live, SONY Playstation Network
Books
Amazon Kindle, Nook, PDF, Audible
News
NY Times, CNN, MSNBC, ABC, CBS and all of the Radio News Feeds
Sports
ESPN
There will be Quality of Life services such as:
Health medical record services – HealthVault
Real-time Traffic – Calculate Quickest Travel Time
Air/Pollen Quality – What will the air/pollen be like on December 31st at 5:30 PM.
Population Growth versus Food Supply – Expected food supply in Somalia over the next 3 years.
Malaria Cases/Birth Rates/Life Expectancies by Region
Violence Levels in Iraq and by Region
Airport Security Wait Times – Security Check Wait Time at Gate #4 in LAX, etc.
Crime Stats by Region
High School Education Quality by Region
The list of potential services is endless.
Much of this data is already available but is not in a format that can be easily used or consumed by the 4 screens mentioned.
I’ll leave it to the developers and entrepreneurs to pioneer.
Ten years from now, I am confident that we’ll all be grateful for this new cloud computing/data services era.