experience with marklogic at elsevier

Experience with MarkLogic at Elsevier

Bradley P. Allen and Darin McBeath, Elsevier Labs

Presentation at NoSQL Now 2011

San Jose, CA, USA

2011-08-25

• Elsevier , part of the Reed Elsevier group, is a world leading publisher of scientific, technical and medical full text literature. 7,000 employees in over 70 offices worldwide publish more than 2,500 journal titles and 11,000 online books.

Elsevier: who we are

Global

audience15 million doctors,

nurses and health

professionals

10 million+

researchers in 4,500

institutes

5 million students

Global

community7,000 editors

70,000 editorial

board members

200,000 referees

500,000+ authors

Global

market

North

America

EuropeAsia-

Pacific

= ++

2

• MarkLogic is used pervasively throughout our business– Science and Technology

– Health Sciences

– Operations

• It is also a strategic technology for our sister Reed Elsevier organization LexisNexis

• We were an early adopter of MarkLogic– Began working with MarkLogic in 2001

MarkLogic at Elsevier

3

• Company was committed to XML standard for content representation

• Vision of building Web services on top of XML content repositories

• Enabling new information solutions through reuse and mashup of existing journal and book content

• Relational technologies not a good fit

Motivations for MarkLogic adoption

4

Business Product Description MarkLogic Features Used Launched

Science & Technology

Scopus The largest abstract and citation database containing both peer-reviewed research literature and quality web sourcesContains 50+ million abstractsOriginal application that used MarkLogic

Repository, Transformation, and some extensions (such as fast/accurate counting).

2005

Scopus Custom Data

Offline version of Scopus Repository, Transformation 2007

EMBASE Biomedical database with over 24 million indexed records

Repository, Search, Transformation

2008

Methods Navigator

Task-specific search for experimental methods and protocols across 40,000 articles

Repository, Content Processing Framework

2010

HazMatNavigator

Chemical safety database based on Bretherick'sHandbook of Reactive Chemical Hazards, others


2010

SciVal Funding Database of current research funding opportunities and award information


2010

Health Sciences

Books 1000 books supporting multiple Health Sciences applications (HESI, NursingConsult, MDConsult).

Repository, ability to present content quickly/easily by chapter, section, paragraph

2006

Health Connect

Health Sciences journal platform Repository, Search, Transformation

2007

Linked Data Repository

500,000 content enhancement metadata documents100% XQuery application

Repository, Xpath and a handful of proprietary extensions

2011

Operations ConSyn Batch retrieval service for 10+ million journal articles Search, Repository, Task Server, Zip, Security, Transformation

2010

MarkLogic applications at Elsevier

5

• MarkLogic brings us two big benefits– Excellent fit with how we represent our content– Tools (XQuery, XSLT) that support working with that

content representation

• Those benefits come with challenges, some old, some new– Developer productivity and adoption– Standards and interoperability– Software ecosystem– Total solution fit– TCO relative to other solutions

MarkLogic benefits and challenges at Elsevier

6

• XQuery can be a powerful language for rapid prototyping– Can support writing complete web applications

• Experienced XQuery resources are difficult to find – Especially relative to emerging JSON/Web

framework resources

• Difficult to motivate developers committed to more mainstream frameworks, patterns, and languages

Developer productivity and adoption

7

• Vendors view XQuery in different ways: some view it as a query language, some as a transformation language, some as a programming language, all of the above, etc.

• These disparate views often lead to confusion in the community as to what really is XQuery

• XQuery interoperability is currently difficult and it is doubtful that it ever will be beyond simple applications– Groups such as eXPath will help tidy up some interfaces, but there is

far more work that needs to be done. – Elsevier Labs has investigated this issue in the context of the SciVal

Showcase application using 4 different XQuery engines (MarkLogic, eXist, 28ms, and XQIB)

– This experiment highlighted the differences in the implementations (and the looseness of the W3C recommendation)

Standards and interoperability

8

• The eco-system around XQuery and MarkLogic is lacking

– Not a tremendous amount of open source and/or 3rd party modules or language bindings

• The IDEs and debugging tools (while vastly improved) are still not at par with other query languages

Software ecosystem

9

• MarkLogic started out as an XML database solution

• It has added functionality (e.g. free text search) matured over the years– This is a big part of its intended use at LexisNexis

• We struggle to understand the tradeoffs between a single solution vs. composition of best-of-breed solution (e.g. MarkLogicstandalone vs. MarkLogic integrated with Solr)

Total solution fit

10

• Traditional enterprise software licensing can lead to significant costs

• NoSQL document database solutions with business models based on open source plus support services are an emerging alternative

• Still working on determining TCO tradeoff between the two in an enterprise context

TCO relative to other solutions

11

• NoSQL before it was cool

• But there are emerging differences between the document stores for traditional vs. Internet publishing– XML/XQuery/XSLT vs. JSON/UnSQL/Javascript

– Manual scale-out vs automated scale-out

• Overhead of legacy standards can be a drag– Where is XML in its adoption lifecycle?

– How does HTML5 fit in?

MarkLogic in the context of NoSQL in general

12

• Persisting as foundation of content repository efforts– XML legacy drives continued use

• Turnkey SaaS for publishing, newer NoSQL solutions competing for attention– Solutions that layer XML processing and query technologies on top of non-XML

NoSQL stores are beginning to appear (e.g. Ambrosoft’s XML DB project)

• Design choices driven by consumer Internet use cases may not yield as good a fit to information publishing as MarkLogic– Emphasis on join-free queries and use-case-driven indexing

• We are watching to see how emerging best practices and design patterns associated with consumer Internet that are good fits are supported moving forward– Auto-scaling– Web application frameworks– HTML5

Future use of MarkLogic at Elsevier

13

• We were an early adopter of MarkLogic

• Over ten years it has become a mature product that we rely on extensively across our business

• The response of MarkLogic to the emergence of NoSQL document stores, non-XML document serializations and application design patterns from the consumer Internet is of keen interest to us

Summary

14