geoffrey hendrey @ geoffhendrey

16
Geoffrey Hendrey @geoffhendrey Architecture for real- time ad-hoc query on distributed filesystems

Upload: jayme

Post on 24-Feb-2016

22 views

Category:

Documents


0 download

DESCRIPTION

Geoffrey Hendrey @ geoffhendrey. Architecture for real-time ad-hoc query on distributed filesystems. Motivation. Big Data is more opaque than small data S preadsheets choke BI tools can’t scale Small samples often fail to replicate issues Engineers, data scientists, analysts need: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Geoffrey Hendrey @ geoffhendrey

Geoffrey Hendrey@geoffhendrey

Architecture for real-time ad-hoc query on distributed filesystems

Page 2: Geoffrey Hendrey @ geoffhendrey

Motivation• Big Data is more opaque than small data– Spreadsheets choke– BI tools can’t scale– Small samples often fail to replicate issues

• Engineers, data scientists, analysts need:– Faster “time to answer” on Big Data– Rapid “find, quantify, extract”

• Solve “I don’t know what I don’t know”• This is NOT about looking up items in a

product catalog (i.e. not a consumer search problem)

Page 3: Geoffrey Hendrey @ geoffhendrey

Scaling search with classic sharding

Page 4: Geoffrey Hendrey @ geoffhendrey

Classic “side system” approach• Definition of KLUDGE: “a system and

especially a computer system made up of poorly matched components” –Merriam-Webster

Hadoop SearchCluster

?????

Page 5: Geoffrey Hendrey @ geoffhendrey

Classic “search toolkit”• Built around fulltext use case• Inverted Indexes optimized for on-the-fly

ranking of results– TF-IDF– Okapi BM-25

• Yet never able to fully realize google-style search capability

• Issues:– Phrase detection– Pseudo synonymy– Open loop architecture

Page 6: Geoffrey Hendrey @ geoffhendrey

Big data ad-hoc query• Not typically a fulltext “document search” problem• Data is structured, mixed structured, and

denormalized– Log lines– Json records– CSV files– Hadoop native formats (SequenceFile)

• Ranking is explicit (ORDER BY), not relevance based• Sometimes “needle in haystack” (support,

debugging)• Sometimes “haystack in haystack” (summary

analytics, segmentation)

Page 7: Geoffrey Hendrey @ geoffhendrey

Dremel MPP query execution tree

Page 8: Geoffrey Hendrey @ geoffhendrey

Finer points of Dremel architecture• MapReduce friendly• In-Situ approach is DFS friendly• Excels at aggregation. Not so much for needle-in-

haystack.• Column storage format accelerates mapreduce

(less extraneous data pushed through)• But in some regards still a “side system”• Applications must explicitly store their data in a

columnar format• “massive” is both a benefit and a hazard

– Complex (operationally and WRT query execution)– Queries can execute quickly…on huge clusters

Page 9: Geoffrey Hendrey @ geoffhendrey

Crawled In-Situ Index Architecture

HDFSMapReduce

Data Crawl

In-situ Index

SimpleSearch

Application

Hadoop

Page 10: Geoffrey Hendrey @ geoffhendrey

Benefits to crawled In-Situ index• No changes to application data format– CSV– JSON– SequenceFile

• Clear “separation of concerns” between data and index

• Indexes become “disposable”: easily built, easily thrown away

• There is no “side system” that needs to be maintained

• Use the mapreduce “hammer” to pound a nail

Page 11: Geoffrey Hendrey @ geoffhendrey

Architect for Elasticity

AWS S3

Elastic MapReduce

JetS3tEC2

M1.large

ApplicationCrawl

Index

HTTP

Interesting: you don’t actually need to have hadoop installed…

Page 12: Geoffrey Hendrey @ geoffhendrey

Declarative Crawl Indexing

HDFSMapReduce

Data Crawl

In-situ Index

SimpleSearch

Application

Hadoop

{"filter”:"column[4]==\"athens\"" }

Parse.json

• Indexer reads declarative instructions from in-situ file• “pull” vs. traditional “push” indexing approach

Page 13: Geoffrey Hendrey @ geoffhendrey

Thin index

• Index size is small because data is a holistic part of the system

• data does not need to be “put into” the search system and repicated in the index.

HDFSMapReduce

Data Crawl

In-situ Index

Data

Index

Page 14: Geoffrey Hendrey @ geoffhendrey

Lazy data loading

HDFSMapReduce

Data Crawl

ExecutionRuntime

Data

IndexLRU

IndexCache

Lazy Pull

Lazy Pull

Page 15: Geoffrey Hendrey @ geoffhendrey

Column Oriented Approach