experience with marklogic at elsevier
DESCRIPTION
Elsevier is the world's largest publisher of scientific, medical and technical (STM) content. An early adopter of XML as a standard representation for content, Elsevier has used MarkLogic in the development of a range of information access and discovery solutions for its customers. This presentation will cover Elsevier's experience with XML-centric content management systems in general and MarkLogic's technology in specific, describing Elsevier's initial adoption and uptake of the technology, current use within the Elsevier suite of online products and solutions, and opportunities for future use. Design patterns for content repositories within a publishing context that have emerged during our use of the technology will be described, and we will touch on a number of issues that have emerged, including XQuery and its adoption within the developer community, the challenges facing XML from new representations for documents and metadata such as JSON and RDF, and the delivery of search applications based on XML infrastructure.TRANSCRIPT
Experience with MarkLogic at Elsevier
Bradley P. Allen and Darin McBeath, Elsevier Labs
Presentation at NoSQL Now 2011
San Jose, CA, USA
2011-08-25
• Elsevier , part of the Reed Elsevier group, is a world leading publisher of scientific, technical and medical full text literature. 7,000 employees in over 70 offices worldwide publish more than 2,500 journal titles and 11,000 online books.
Elsevier: who we are
Global
audience15 million doctors,
nurses and health
professionals
10 million+
researchers in 4,500
institutes
5 million students
Global
community7,000 editors
70,000 editorial
board members
200,000 referees
500,000+ authors
Global
market
North
America
EuropeAsia-
Pacific
= ++
2
• MarkLogic is used pervasively throughout our business– Science and Technology
– Health Sciences
– Operations
• It is also a strategic technology for our sister Reed Elsevier organization LexisNexis
• We were an early adopter of MarkLogic– Began working with MarkLogic in 2001
MarkLogic at Elsevier
3
• Company was committed to XML standard for content representation
• Vision of building Web services on top of XML content repositories
• Enabling new information solutions through reuse and mashup of existing journal and book content
• Relational technologies not a good fit
Motivations for MarkLogic adoption
4
Business Product Description MarkLogic Features Used Launched
Science & Technology
Scopus The largest abstract and citation database containing both peer-reviewed research literature and quality web sourcesContains 50+ million abstractsOriginal application that used MarkLogic
Repository, Transformation, and some extensions (such as fast/accurate counting).
2005
Scopus Custom Data
Offline version of Scopus Repository, Transformation 2007
EMBASE Biomedical database with over 24 million indexed records
Repository, Search, Transformation
2008
Methods Navigator
Task-specific search for experimental methods and protocols across 40,000 articles
Repository, Content Processing Framework
2010
HazMatNavigator
Chemical safety database based on Bretherick'sHandbook of Reactive Chemical Hazards, others
Repository, Content Processing Framework
2010
SciVal Funding Database of current research funding opportunities and award information
Repository, Content Processing Framework
2010
Health Sciences
Books 1000 books supporting multiple Health Sciences applications (HESI, NursingConsult, MDConsult).
Repository, ability to present content quickly/easily by chapter, section, paragraph
2006
Health Connect
Health Sciences journal platform Repository, Search, Transformation
2007
Linked Data Repository
500,000 content enhancement metadata documents100% XQuery application
Repository, Xpath and a handful of proprietary extensions
2011
Operations ConSyn Batch retrieval service for 10+ million journal articles Search, Repository, Task Server, Zip, Security, Transformation
2010
MarkLogic applications at Elsevier
5
• MarkLogic brings us two big benefits– Excellent fit with how we represent our content– Tools (XQuery, XSLT) that support working with that
content representation
• Those benefits come with challenges, some old, some new– Developer productivity and adoption– Standards and interoperability– Software ecosystem– Total solution fit– TCO relative to other solutions
MarkLogic benefits and challenges at Elsevier
6
• XQuery can be a powerful language for rapid prototyping– Can support writing complete web applications
• Experienced XQuery resources are difficult to find – Especially relative to emerging JSON/Web
framework resources
• Difficult to motivate developers committed to more mainstream frameworks, patterns, and languages
Developer productivity and adoption
7
• Vendors view XQuery in different ways: some view it as a query language, some as a transformation language, some as a programming language, all of the above, etc.
• These disparate views often lead to confusion in the community as to what really is XQuery
• XQuery interoperability is currently difficult and it is doubtful that it ever will be beyond simple applications– Groups such as eXPath will help tidy up some interfaces, but there is
far more work that needs to be done. – Elsevier Labs has investigated this issue in the context of the SciVal
Showcase application using 4 different XQuery engines (MarkLogic, eXist, 28ms, and XQIB)
– This experiment highlighted the differences in the implementations (and the looseness of the W3C recommendation)
Standards and interoperability
8
• The eco-system around XQuery and MarkLogic is lacking
– Not a tremendous amount of open source and/or 3rd party modules or language bindings
• The IDEs and debugging tools (while vastly improved) are still not at par with other query languages
Software ecosystem
9
• MarkLogic started out as an XML database solution
• It has added functionality (e.g. free text search) matured over the years– This is a big part of its intended use at LexisNexis
• We struggle to understand the tradeoffs between a single solution vs. composition of best-of-breed solution (e.g. MarkLogicstandalone vs. MarkLogic integrated with Solr)
Total solution fit
10
• Traditional enterprise software licensing can lead to significant costs
• NoSQL document database solutions with business models based on open source plus support services are an emerging alternative
• Still working on determining TCO tradeoff between the two in an enterprise context
TCO relative to other solutions
11
• NoSQL before it was cool
• But there are emerging differences between the document stores for traditional vs. Internet publishing– XML/XQuery/XSLT vs. JSON/UnSQL/Javascript
– Manual scale-out vs automated scale-out
• Overhead of legacy standards can be a drag– Where is XML in its adoption lifecycle?
– How does HTML5 fit in?
MarkLogic in the context of NoSQL in general
12
• Persisting as foundation of content repository efforts– XML legacy drives continued use
• Turnkey SaaS for publishing, newer NoSQL solutions competing for attention– Solutions that layer XML processing and query technologies on top of non-XML
NoSQL stores are beginning to appear (e.g. Ambrosoft’s XML DB project)
• Design choices driven by consumer Internet use cases may not yield as good a fit to information publishing as MarkLogic– Emphasis on join-free queries and use-case-driven indexing
• We are watching to see how emerging best practices and design patterns associated with consumer Internet that are good fits are supported moving forward– Auto-scaling– Web application frameworks– HTML5
Future use of MarkLogic at Elsevier
13
• We were an early adopter of MarkLogic
• Over ten years it has become a mature product that we rely on extensively across our business
• The response of MarkLogic to the emergence of NoSQL document stores, non-XML document serializations and application design patterns from the consumer Internet is of keen interest to us
Summary
14