geoffrey hendrey @ geoffhendrey

Geoffrey Hendrey@geoffhendrey

Architecture for real-time ad-hoc query on distributed filesystems

Motivation• Big Data is more opaque than small data– Spreadsheets choke– BI tools can’t scale– Small samples often fail to replicate issues

• Engineers, data scientists, analysts need:– Faster “time to answer” on Big Data– Rapid “find, quantify, extract”

• Solve “I don’t know what I don’t know”• This is NOT about looking up items in a

product catalog (i.e. not a consumer search problem)

Scaling search with classic sharding

Classic “side system” approach• Definition of KLUDGE: “a system and

especially a computer system made up of poorly matched components” –Merriam-Webster

Hadoop SearchCluster

?????

Classic “search toolkit”• Built around fulltext use case• Inverted Indexes optimized for on-the-fly

ranking of results– TF-IDF– Okapi BM-25

• Yet never able to fully realize google-style search capability

• Issues:– Phrase detection– Pseudo synonymy– Open loop architecture

Big data ad-hoc query• Not typically a fulltext “document search” problem• Data is structured, mixed structured, and

denormalized– Log lines– Json records– CSV files– Hadoop native formats (SequenceFile)

• Ranking is explicit (ORDER BY), not relevance based• Sometimes “needle in haystack” (support,

debugging)• Sometimes “haystack in haystack” (summary

analytics, segmentation)

Dremel MPP query execution tree

Finer points of Dremel architecture• MapReduce friendly• In-Situ approach is DFS friendly• Excels at aggregation. Not so much for needle-in-

haystack.• Column storage format accelerates mapreduce

(less extraneous data pushed through)• But in some regards still a “side system”• Applications must explicitly store their data in a

columnar format• “massive” is both a benefit and a hazard

– Complex (operationally and WRT query execution)– Queries can execute quickly…on huge clusters

Crawled In-Situ Index Architecture

HDFSMapReduce

Data Crawl

In-situ Index

SimpleSearch

Application

Hadoop

Benefits to crawled In-Situ index• No changes to application data format– CSV– JSON– SequenceFile

• Clear “separation of concerns” between data and index

• Indexes become “disposable”: easily built, easily thrown away

• There is no “side system” that needs to be maintained

• Use the mapreduce “hammer” to pound a nail

Architect for Elasticity

AWS S3

Elastic MapReduce

JetS3tEC2

M1.large

ApplicationCrawl

Index

HTTP

Interesting: you don’t actually need to have hadoop installed…

Declarative Crawl Indexing

HDFSMapReduce

Data Crawl

In-situ Index

SimpleSearch

Application

Hadoop

{"filter”:"column[4]==\"athens\"" }

Parse.json

• Indexer reads declarative instructions from in-situ file• “pull” vs. traditional “push” indexing approach

Thin index

• Index size is small because data is a holistic part of the system

• data does not need to be “put into” the search system and repicated in the index.

HDFSMapReduce

Data Crawl

In-situ Index

Data

Index

Lazy data loading

HDFSMapReduce

Data Crawl

ExecutionRuntime

Data

IndexLRU

IndexCache

Lazy Pull

Lazy Pull

Column Oriented Approach

Contact InfoEmail: [email protected]

Private Beta http://vertascale.com

mailto:[email protected]

mailto:[email protected]

http://vertascale.com/

http://vertascale.com/

geoffrey hendrey @ geoffhendrey

Documents