the meta of hadoop - comad 2012

Post on 11-Jun-2015

453 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

What do you talk about to a hall full of database gurus? Instead of science - my talk focused on the art. What made Hadoop successful? What can we learn from it? What principles work well in building software for large scale services? What are some interesting unsolved problems in a world overrun by open-source (and VC investments :-))

TRANSCRIPT

The Meta of Hadoop

Joydeep Sen SarmaEx-Facebook DI Lead, Founder Qubole

Intro

• File/Database Systems developer (ex- Netapp/Oracle)• Yahoo (2005-07), Facebook (2007-11)

• @Facebook:– SysAdmin: operated massive Hadoop/Hive installs– Architect: conceived/wrote Apache Hive. made Hbase@FB

happen– Herded cats: first manager of Data Infra team– IT engineer/DBA: built ETL tools, warehouse/reporting for FB

Virtual Currency

• Founder Qubole Inc. (2011-)

Why Hadoop Succeeded

• Complete Solution and Extensible– useful to Engineers, Data Scientists, Analysts– performance isn’t everything.– Agile – Businesses much faster than before

• Market Dynamics– Captive Super-Reference Customer – Yahoo– Had early market to itself for Long-Time

• Separation of Compute and Storage– Parallel Computing != Database

Why Hadoop Succeeded

DATA

DATA

• Data Consolidation!– Just store everything in HDFS– MR/Hive/Pig can chew

anything

• Lights Out Architecture– Low System Operational Cost– Low Data Management Cost• Don’t need Data Priests

Meta Takeaways

Adaptive Lights-Out Software

• Successful efforts:– Automatic map-join/skew join implementations– Automatic local mode, resource cache

• Failed:– Statistics: alter table analyze table– Pre-Bucketing tables

Learning Frameworks for Systems Software

Adaptive Lights-Out Software

• Caching + Prefetching is Adaptive– Replication is not– Can bridge gap between Compute and Storage

• Page Cache over Disk >> In-memory– Degrades gracefully

• Provide APIs – not packages

Murphy’s Law

• No Trusted Components

• Defend everything– Rate-Limit access to every resource– Log and Monitor everything

• Clear and Overwhelming Force– Oversize it!

• Think QOS from Day-1

Open Source

• Small is Beautiful– Build small easy to use/understand components– Redis!

• Iterative Small Changes– Operators HATE large releases– Hive (2 weeks) vs. Hadoop (2 years?)

Opportunities

Interesting Problems - I

• Collaborative Analysis– Most analysis is Repeat– Tracking and Searching historical analysis

• Consistency Aware Querying– OLAP: Snapshots instead of live tables– OLTP: Lookup stale caches instead of master

Interesting Problems - II

• SQL is Rope– Better than procedural – but still Rope– Higher Level templates: moving averages

• Data = Mutating + Immutable– Immutable data is easy to manage– Cheap: One copy per data center (Facebook

Haystack)

Think Services, not Software

• Software is getting less interesting– Even Distributed Systems Software

• Run/Operate long-running, hot services– Innovate inside this boundary

Q&A

top related