northeastern db class introduction to marklogic nosql april 2016
TRANSCRIPT
MarkLogic
Matt Turner, CTO Media & Entertainment
Introduction to MarkLogic NoSQL
COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
1
OutlineSomethings Happening HereThe Old and the NewData modelsData accessDiscussion
SLIDE: # COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
AnalysisOperationsAccess
DATA makes an impact
3
Stress on Traditional Data Approaches
ComplexityStructuredUnstructuredSemi-structuredRawStreams of dataConstant changeAgile analyticsFail-fastVolumeVelocityVarietyVolumeMany months of system log filesEvery tweetYears of articlesRelative to current size of operationVelocityStreams of customer feedback to determine sentimentReal-time risk analysisReal-time Business IntelligenceVarietyDatabase feedsRaw logsWeb crawl dataArticlesMulti-mediaALSO: questions!
ExamplesBig Data: Gartner coined the three Vs descriptionData: Petabyte scaleNodes: Thousands
SLIDE: # COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
4
SLIDE: # COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Stress on Traditional Data Approaches
ComplexityStructuredUnstructuredSemi-structuredRawStreams of dataConstant changeAgile analyticsFail-fastVolumeVelocityVarietyVolumeMany months of system log filesEvery tweetYears of articlesRelative to current size of operationVelocityStreams of customer feedback to determine sentimentReal-time risk analysisReal-time Business IntelligenceVarietyDatabase feedsRaw logsWeb crawl dataArticlesMulti-mediaALSO: questions!
ExamplesBig Data: Gartner coined the three Vs descriptionData: Petabyte scaleNodes: Thousands
SLIDE: # COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
6
Leader QuadrantOnline Transaction Processing RDBS
(May 2002)
SLIDE: # COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Leader QuadrantOperational DBMS
(Oct 2014)
Traditional MainstaysUpstarts Storm the Field
SLIDE: # COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
MarkLogic: Best OperationalData Warehouse
(Aug 2014)
SLIDE: # COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
A Unified, Actionable 360 View of DataWHAT BUSINESSES WANT
SLIDE: # COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
10
AnalysisOperationsAccess
DATA makes an impact
11
12
Data Is In Silos
Data is spread across disconnected databasesM&A outpaces the speed of data integrationData needs to be delivered in real timeTHE REALITY
SLIDE: # COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
13
80% of time By data scientists just wrangling dataWASTEDIn 2015 on creating relational data silos
Of data warehouse projects is on ETLThe Massive Cost of Integrating Data From Silos36Billion inSPENDING$% OF THECOST60
SLIDE: # COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Sources:80% of time spent by data scientists on just wrangling dataData scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets. Steve Lohr. For Big-Data Scientists, Janitor Work Is Key Hurdle to Insights. The New York Times. August 17, 2014. 60% of the cost of data warehouse projects is on ETLIn a report sponsored by Informatica, analysts at TDWI estimate between 60% and 80% of the total cost of a data warehouse project may be taken up by ETL software and processes. $36 Billion in spending on database management systems in 2015Gartner. Forecast: Enterprise Software Markets, Worldwide, 2011-2018, 4Q14. 2014.
14
Relational Databases with ETL Sacrifice Agility, Timeliness, and CostAll future data needs must be predictableNew SQL queries require database re-indexingSiloed database changes require ETL re-writesTHE IT CHALLENGE
ETLOLTPARCHIVESETLETLETLDATA MARTSETL
WAREHOUSE
REFERENCE DATA
SLIDE: # COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
15
ETL
OLTPWarehouseData Marts
ETL
ETLETL
ETLArchives
UnstructuredVideoAudioSignals,Logs,StreamsSocialDocuments,Messages{}MetadataSearch
ETL
ETL
ReferenceData
ETLIts Complicated
SLIDE: # COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
16
The OLD:Lets Design the Application
(And pretend its the 80s)
17
NameHair ColourFulltime Employee?Car typePaulBlondY
AlexAuburnYPorsche
DomBlackYHummernamehr_colr flltme_emplcar_tpLets Begin Cast Members{How many characters wide should this be? 8? 16? 32?{{{
SLIDE: # COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
18
New Schema Extend Ours!namehr_colrflltme_emplcar_tpPaulBlondYAlexAuburnYporscheDomBlackYHummer
house_roadtowncitypostcode11d Yonge PkFinsburyLondonN4 3NUReadingLondonN43
Hang onIf this table had 10k rows, issues?First create new big schemaThen import rows acrossDelete old table?Maybe not, legacy programs might use it!
What if we want to select Road only?Split out againMore extensions?House name and number?
SLIDE: # COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
19
There is another way!Create a new table and point to it from the old one!
namehr_colrflltme_emplcar_tpAddressPaulBlondYAlexAuburnYporscheDomBlackYHummer
house_roadtowncitypostcode11d Yonge PkFinsburyLondonN4 3NUReadingLondonN43
SLIDE: # COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
20
SLIDE: # COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Now Lets Store Something More . . . ComplicatedTranscript / BookInfoTitle = NL April 14Author = SNL CastSectionChapterPageParagraph = I love penguins becausePageParagraph = On the subject of foodChapterPageSectionChapterChapterChapterParagraphParagraphtitleauthorSectionI love PenguinsS. Lion
Issues with Sections? How many columns?
SLIDE: # COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
22
Dont Forget TaxonomiesHierarchical levels of metadata
Fixed to a specific business purposeCant be re-used in new contexts
Each record can only be associated with one levelHow many category fields?
CategoryFeature
SeriesActionDramaComedyDocumentaryCable
BroadcastDramaComedyActionDramaFamilyDocumentary
SLIDE: # COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
ResultRequires everything to be defined up front Data to be transformed and processed to fit the systemNeeds to be redone as information changesCostly to create, maintain and only captures part of the data!
TitleProductionDateCategoryAssetTypeLengthFilm13/1/14FeatureHD Master2:40Show16/4/13SeriesHD7200:40Film26/4/05FeatureArchive1:55
CategoryFeature
SeriesActionDramaComedyDocumentaryCable
BroadcastDramaComedyActionDramaFamilyDocumentary?
SLIDE: # COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Traditional Technology
ETL
OLTPWarehouseData Marts
ETL
ETLETL
ETLArchives
UnstructuredVideoAudioSignals,Logs,StreamsSocialDocuments,Messages{}MetadataSearch
ETL
ETL
ReferenceData
ETL
*NOTE: We only did this little bit!Remember?
SLIDE: # COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
26
The NEW! Enter NoSQLCategoryDescriptionExamplesKey-valuePersistent hash-table on steroidsTypically no single modeling paradigm (e.g. columns can be primitives, data structures, binaries, etc.)Amazon DynamoDBRedisRiakColumnarSimilar to K-V in some waysColumn may be arranged in groups (families)Data types are usually the expected primitivesWorks well with value crunching (e.g. time series)HBaseCassandraDocumentURI-mapped (i.e. keyed) documents in lieu of rowsSupports structured and unstructured contentNested contextMarkLogicMongoDBCouchbaseGraphDeals with inter-object graphsRelationship orientedThink object cache (with pointers) on steroidsNeo4JAllegroGraphInfoGrid
SLIDE: # COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
A Database That Integrates Data Better, Faster, with Less CostTHE DESIRED SOLUTION
SLIDE: # COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
28
The MarkLogic AlternativeAn Operational and Transactional Enterprise NoSQL Database
Data ingested as is (no ETL)Structured and unstructured data Data and metadata togetherAdapts to changing data and changing data structures
EASY TO GET DATA INFlexible Data Model
Index once and query endlesslyReal-time and lightning fastQuery across JSON, XML, text, geospatial, and semantic triples in one database
EASY TO GET DATA OUTAsk Anything Universal Index
Reliable data and transactions (100% ACID compliant)Out-of-the-box automatic failover, replication, and backup/recovery Enterprise-grade security and Common Criteria certified
100% TRUSTEDEnterprise Ready
SLIDE: # COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
29
The SNL App
SLIDE: # COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
No need to define up frontMatched to complex content and metadata data modelingData is managed in its most accessible, natural formXML, JSON, RDF, geospatial
Flexible Data Model
Schema-agnostic, structure-aware
SLIDE: # COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Instead of THIS
SLIDE: # COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Do it like THIS!
SLIDE: # COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Search and QuerySearch to find answers in documents, relationships, and metadata
Automatic indexing of every data value, text and data structureSpecialized indexes for data values (analytics, facets, sorting), geospatial and triplesAll updated in the context of ACID transactions to ensure data integrity and real-time accessAccessible via fully programmable search API with full-text search, type-ahead suggestions, facets, snippeting, highlighted search terms, proximity boosting, relevance ranking, and language support
JavaScriptXQuerySPARQLRich Query CapabilityIn-databaseMapReduceFull-text SearchSemantic SearchGeospatial Search
SLIDE: # COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
34
Timing
Context
Whos Smarter?
VS
Do domestic dogs interpret pointing as a command? Animal Cognition (2012): 1-12 , November 09, 2012By Scheider, Linda; Kaminski, Juliane; Call, Josep; Tomasello, Michael
Context!
Machines Dont Get Context . . . Manu Sporny Founder/CEO - Digital Bazaar, Inc.http://www.cambridgesemantics.com/semantic-university/what-is-linked-data
COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: #
Enter Semantics!
Manu Sporny Founder/CEO - Digital Bazaar, Inc.http://www.cambridgesemantics.com/semantic-university/what-is-linked-data
SLIDE: # COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
SemanticsEnterprise triple store, document store, and database combined
Store and query billions of facts and relationshipsLeverage ontologies for domain and role specific context access to data and documentsEfficient metadata management with relationships to ontologiesStandards-based for ease of use and integrationRDF, SPARQL, and standard REST interfaces
SLIDE: # COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
41
Semantics to Model RelationshipsData model to manage relationships and link together datatriples describe single factsCollections of facts describe complex real-world scenarios
ChevyNBC"isOnSNL"isOn
isOn!
SLIDE: # COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Ontologies Instead of CategoriesActually model information as it is in the real world
Not limited to a single purposeOntologies for all categories of metadataEven impossible categories like fictional worlds
SLIDE: # COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
NoSQL and Semantics!
SLIDE: # COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Real-Time AnalyticsRange indexes can be used forFaceted searchAggregation and visualizationAnalyticsincluding custom user-defined functionsCo-occurrenceSQL, ODBC, and BI integration
SLIDE: # COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Scalability, Elasticity and CloudMassive enterprise scalability and elasticity
Scale horizontally in clusters on commodity hardware to hundreds of nodes, petabytes of data, and billions of documentsProcess thousands of multi-document multi-statement transactions per secondStart small and scale up or down to meet capacity and performance demands without over-provisioning or over-spendingFully cloud enabled for automated deployment and management on EC2Leverage dynamic configurations with Tiered Storage
D-NODE
D-NODE
E-NODE
E-NODE
D-NODE
Add nodesto scale outAutomated failoverResult: Enterprise-ready to power mission critical products
SLIDE: # COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
46
Use Case: Deliver Better InformationPresent information based on relationships
Go beyond traditional technology with depth of content
Drive efficiency using semantic approach to tagging
SLIDE: # COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Use Case: Go Beyond SearchConcept instead of keyword search
Related content and information drive the content discovery and new interactionsSNL40 continuous viewing
Dynamically tailored to the users specific attributes or activity
SLIDE: # COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Use Case: Integrate DataIntegrate data across the automoti
SLIDE: # COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Bob PilzTaxonomy ManagerMitchell1
Semantics-driven search
TalentKristen WiigActed in
Episode 4Anne Hathaway and Killers
Part of
Played
CharacterMaharelle Sister
Season 34
SegmentThe Lawrence Welk ShowAired on
Date10/4/08
EraActed inIncludesPart of
SLIDE: # COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Intelligent recommendation engine
SLIDE: # COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
52