the meta of hadoop - comad 2012
Post on 11-Jun-2015
453 Views
Preview:
DESCRIPTION
TRANSCRIPT
The Meta of Hadoop
Joydeep Sen SarmaEx-Facebook DI Lead, Founder Qubole
Intro
• File/Database Systems developer (ex- Netapp/Oracle)• Yahoo (2005-07), Facebook (2007-11)
• @Facebook:– SysAdmin: operated massive Hadoop/Hive installs– Architect: conceived/wrote Apache Hive. made Hbase@FB
happen– Herded cats: first manager of Data Infra team– IT engineer/DBA: built ETL tools, warehouse/reporting for FB
Virtual Currency
• Founder Qubole Inc. (2011-)
Why Hadoop Succeeded
• Complete Solution and Extensible– useful to Engineers, Data Scientists, Analysts– performance isn’t everything.– Agile – Businesses much faster than before
• Market Dynamics– Captive Super-Reference Customer – Yahoo– Had early market to itself for Long-Time
• Separation of Compute and Storage– Parallel Computing != Database
Why Hadoop Succeeded
DATA
DATA
• Data Consolidation!– Just store everything in HDFS– MR/Hive/Pig can chew
anything
• Lights Out Architecture– Low System Operational Cost– Low Data Management Cost• Don’t need Data Priests
Meta Takeaways
Adaptive Lights-Out Software
• Successful efforts:– Automatic map-join/skew join implementations– Automatic local mode, resource cache
• Failed:– Statistics: alter table analyze table– Pre-Bucketing tables
Learning Frameworks for Systems Software
Adaptive Lights-Out Software
• Caching + Prefetching is Adaptive– Replication is not– Can bridge gap between Compute and Storage
• Page Cache over Disk >> In-memory– Degrades gracefully
• Provide APIs – not packages
Murphy’s Law
• No Trusted Components
• Defend everything– Rate-Limit access to every resource– Log and Monitor everything
• Clear and Overwhelming Force– Oversize it!
• Think QOS from Day-1
Open Source
• Small is Beautiful– Build small easy to use/understand components– Redis!
• Iterative Small Changes– Operators HATE large releases– Hive (2 weeks) vs. Hadoop (2 years?)
Opportunities
Interesting Problems - I
• Collaborative Analysis– Most analysis is Repeat– Tracking and Searching historical analysis
• Consistency Aware Querying– OLAP: Snapshots instead of live tables– OLTP: Lookup stale caches instead of master
Interesting Problems - II
• SQL is Rope– Better than procedural – but still Rope– Higher Level templates: moving averages
• Data = Mutating + Immutable– Immutable data is easy to manage– Cheap: One copy per data center (Facebook
Haystack)
Think Services, not Software
• Software is getting less interesting– Even Distributed Systems Software
• Run/Operate long-running, hot services– Innovate inside this boundary
Q&A
top related