geoffrey hendrey @ geoffhendrey
DESCRIPTION
Geoffrey Hendrey @ geoffhendrey. Architecture for real-time ad-hoc query on distributed filesystems. Motivation. Big Data is more opaque than small data S preadsheets choke BI tools can’t scale Small samples often fail to replicate issues Engineers, data scientists, analysts need: - PowerPoint PPT PresentationTRANSCRIPT
Geoffrey Hendrey@geoffhendrey
Architecture for real-time ad-hoc query on distributed filesystems
Motivation• Big Data is more opaque than small data– Spreadsheets choke– BI tools can’t scale– Small samples often fail to replicate issues
• Engineers, data scientists, analysts need:– Faster “time to answer” on Big Data– Rapid “find, quantify, extract”
• Solve “I don’t know what I don’t know”• This is NOT about looking up items in a
product catalog (i.e. not a consumer search problem)
Scaling search with classic sharding
Classic “side system” approach• Definition of KLUDGE: “a system and
especially a computer system made up of poorly matched components” –Merriam-Webster
Hadoop SearchCluster
?????
Classic “search toolkit”• Built around fulltext use case• Inverted Indexes optimized for on-the-fly
ranking of results– TF-IDF– Okapi BM-25
• Yet never able to fully realize google-style search capability
• Issues:– Phrase detection– Pseudo synonymy– Open loop architecture
Big data ad-hoc query• Not typically a fulltext “document search” problem• Data is structured, mixed structured, and
denormalized– Log lines– Json records– CSV files– Hadoop native formats (SequenceFile)
• Ranking is explicit (ORDER BY), not relevance based• Sometimes “needle in haystack” (support,
debugging)• Sometimes “haystack in haystack” (summary
analytics, segmentation)
Dremel MPP query execution tree
Finer points of Dremel architecture• MapReduce friendly• In-Situ approach is DFS friendly• Excels at aggregation. Not so much for needle-in-
haystack.• Column storage format accelerates mapreduce
(less extraneous data pushed through)• But in some regards still a “side system”• Applications must explicitly store their data in a
columnar format• “massive” is both a benefit and a hazard
– Complex (operationally and WRT query execution)– Queries can execute quickly…on huge clusters
Crawled In-Situ Index Architecture
HDFSMapReduce
Data Crawl
In-situ Index
SimpleSearch
Application
Hadoop
Benefits to crawled In-Situ index• No changes to application data format– CSV– JSON– SequenceFile
• Clear “separation of concerns” between data and index
• Indexes become “disposable”: easily built, easily thrown away
• There is no “side system” that needs to be maintained
• Use the mapreduce “hammer” to pound a nail
Architect for Elasticity
AWS S3
Elastic MapReduce
JetS3tEC2
M1.large
ApplicationCrawl
Index
HTTP
Interesting: you don’t actually need to have hadoop installed…
Declarative Crawl Indexing
HDFSMapReduce
Data Crawl
In-situ Index
SimpleSearch
Application
Hadoop
{"filter”:"column[4]==\"athens\"" }
Parse.json
• Indexer reads declarative instructions from in-situ file• “pull” vs. traditional “push” indexing approach
Thin index
• Index size is small because data is a holistic part of the system
• data does not need to be “put into” the search system and repicated in the index.
HDFSMapReduce
Data Crawl
In-situ Index
Data
Index
Lazy data loading
HDFSMapReduce
Data Crawl
ExecutionRuntime
Data
IndexLRU
IndexCache
Lazy Pull
Lazy Pull
Column Oriented Approach
Contact InfoEmail: [email protected]
Private Beta http://vertascale.com