In the Days of Kings and Priests
• Computers and Data: Crown Jewels• Executives depend on computers• But cannot work with
them directly
• The DBA “Priesthood”• And their Acronymia: EDW, BI, OLAP
• The “architected” EDW“There is no point in bringing data … into the data
warehouse environment without integrating it.” — Bill Inmon,
Building the Data Warehouse, 2005
New Realities
• TB disks < $100• Everything is data• Rise of data-driven culture• Very publicly espoused
by Google, Wired, etc.• Sloan Digital Sky Survey,
Terraserver, etc.
The quest for knowledge used to begin with grand theories.
Now it begins with massive amounts of data.
Welcome to the Petabyte Age.
The New Practitioners
Hal Varian, UC Berkeley, Chief Economist @ Google
“Looking for a career where your services will be in high demand?
… Provide a scarce, complementary service to something that is getting
ubiquitous and cheap.
So what’s ubiquitous and cheap? Data.
And what is complementary to data? Analysis.
the sexy job in the next ten years will be statisticians
The New Practitioners
• Aggressively Datavorous• Statistically savvy• Diverse in training, tools
MAD Skills
•Magnetic• attract data and practitioners
• Agile• rapid iteration: ingest, analyze,
productionalize
• Deep• sophisticated analytics in Big Data
[Cohen, et al. VLDB 09]
Dev tools 4 analytics: reality• Current focus: engines/languages for scalable analytics • Scalable analytics algorithms are a small % of analyst’s life
Dev tools 4 analytics: reality• Current focus: engines/languages for scalable analytics • Scalable analytics algorithms are a small % of analyst’s life
software development
visualization
data product management
collaboration/networking
focus on Deepnot enough on Agilenot enough on Magnetic
dp
Analytics Coding Landscape• Single-node stat packages (R, Matlab, SAS, etc.)• domain-specific languages for linear algebra and statistics• diverse set of open libraries (e.g. CRAN library for R)• scalability limits: in-core, no parallelism
• MapReduce ecosystem• Google, MS Dryad, Hadoop open source• low-level single-node coding (Java), easy data-parallelization• SQL-like convenience languages above (Hive, Pig)• emerging open analytics toolkits (Mahout, Pregel)
• SQL + extensions (user-defined functions)• more powerful than many realize• declarative coding, easy data-parallelism• poor support for extension developers (varies by vendor)• emerging open analytics toolkits (MADlib, Hazy)
Analytics Takeaways
• little real dev difference between mapreduce and SQL• hadoop has more energetic dev tools development• SQL provides more breadth (of function, install base, HR)• lines are blurring
• serious barrier: porting the R/SAS/Matlab ecosystem• will take a decade to develop data-parallel equivalent to CRAN• algorithmic challenge, not just a coding challenge• no shortcut here (at least for MMP)
• in sum• analytics will be a “swiss army knife” approach for years to come• think portfolio. foster community, open libraries (MADlib/Mahout)
big systems c. 2011
• features:• data-centric• distributed• highly available• scalable/elastic• lots of new/custom code
• programming is becoming hard2
• (parallelism + asynchrony + failure) × (software engineering)
root cause of hardness• order is pervasive in the von neumann model• state: an ordered array of cells• logic: an ordered array of instructions
• terrible match for distributed systems
typical solution: shared storage• distributed storage replaces RAM• imposes/enforces order• e.g. via transactions or other consistency mechanisms
• shift: data-centric development• storage is not persistence — it is a programming model
• this has always been true • the cloud makes it pervasive
Dropping ACID?• early exposition: the “transaction concept” [Gray VLDB 1981]
• many think distributed ACID transactions are infeasible today• cross-site transactions coordination⇒
• ⇒ waiting • ⇒ queue buildups• ⇒ unpredictable problems
• a major lesson of Internet companies: Brewer’s “CAP theorem”• though implications being revisited
• by now, this lesson is kool-aid in the open source community…
NoSQL• “not only SQL”• really not about SQL per se.
• focus on two things:• distributed storage with “loose consistency”, not ACID.• data models that are simpler than SQL schemas
• key/value stores, documents• i.e. similar to distributed memory!
• examples• BigTable (Google), Hbase (Yahoo/Hadoop), Cassandra
(Facebook/DataStax), Sherpa (Yahoo), Dynamo (Amazon), Voldemort (LinkedIn), …
• cloud services (AppEngine, Azure)
Homework puzzle
Given:1. use storage layer for distributed coordination (order)2. use NoSQL’s loose consistency for availability
Q: how do programmers reason about order and correctness?
Homework puzzle
Given:1. use storage layer for distributed coordination (order)2. use NoSQL’s loose consistency for availability
Q: how do programmers reason about order and correctness?A: very carefully.
correctness?
ACID
• general correctness via theoretical foundations• read/write: serializability• coordination/consensus
loose consistency
• app-specific correctness via design maxims• semantic assertions• custom compensation
concerns: latency, availability concerns: hard to trust, test
application logic
system infrastructure
theoretical foundation
application logic
quicksand
system infrastructure
the shift
a vacuum here• state of the art: each app reasons about consistency
• e.g. by making use of a locking service (a la Apache Zookeeper)• e.g. by reasoning about “eventual consistency” of the storage system
• this is, arguably, hard3
• (sw eng) * (distribution) * (false abstractions)
• don’t take my word for it• Gunawi’s FATE uncovered 16 fault-recovery bugs in Hadoop FS
[NSDI ’11]
focus on storage systemsnot enough on developers
CALM <~ bloom
BOOM team
joe hellerstein ras bodik
peter alvaro neil conway bill marczak haryadi gunawi thibaud hottelier
desire: best of both worlds
• theoretical foundation for correctness under loose consistency
• embodiment of theory in a programming framework
theoretical foundation
application logic
quicksand
system infrastructure
our approach
• disorderly programming• state: unordered collections• logic: unordered statements
• implications• default: partitioning, concurrency• ordering (data, logic) explicit, special-case
• but can this make ordering decisions simpler?
monotonicity
monotonic code non-monotonic code
• info accumulation• the more you know,
the more you know
• e.g. map, filter, join
• belief revision• new inputs can
change your mind;need to “seal” input
• e.g. counts, state update
CALM Theorem• CALM: consistency as logical monotonicity• monotonic code eventually consistent⇒• non-monotonic coordinate only at non-monotonic ⇒ points of order
• conjectures at pods 2010 conference[Hellerstein, SIGMOD Record 2010]
• formulations and theorems in 2011[Ameloot,et al., PODS 2011]
practical implications• compiler can identify non-monotonic “points of order”• inject coordination code• or mark uncoordinated results as “tainted”
• compiler can help programmer think about coordination costs
• easy to do this with the right language…
background: BOOM Analytics
• 2005-2010: designed a distributed logic language called Overlog
• 2009-2010: rebuilt Hadoop File System and scheduler in Overlog• no kidding – API-compatible with Hadoop, comparable performance• win 1: Orders Of Magnitude smaller, 4 person-months dev time• win 2 (more important) : evolvability
• fixed HDFS single point of failure via Paxos-in-Overlog (6 person-weeks)• fixed HDFS scaling limits via state partitioning (1 day!)
[Alvaro et al., Eurosys 2010]
Lines of Java Lines of Overlog
HDFS 21,700 0
BOOM-FS 1,431 469
we became greedy for more
• time to build a language for real programmers. approach:• craft a disorderly DSL for distributed systems• embed in popular host languages. (I chose ruby first.)
• embody the CALM theorem in programmer tools• identify points of order in code• synthesize coordination logic, or inject “taint” tracking• high-level analysis/debuggers to pinpoint tricky ordering issues
<~ bloom
bud (bloom under development)• bloom embedded as a DSL in ruby• domain-specific code analysis tools• alpha released April, 2011 at http://bloom-lang.net • goodies• code analysis tools• library/example sandbox• EC2 deployment utilities
% gem install bud
classic example: shopping cart
• replicated, a la Amazon Dynamo
• challenge: guarantee eventual consistency of replicas• maxim: use commutative operations• easier said than done!
• Bloom/CALM paper shows compiler analysis (i.e. proofs) of the design maxims for correctness, efficiency
[Alvaro, et al. CIDR 2011]
conclusion• CALM theorem• what is coordination for? non-monotonicity.• pinpoint non-monotonic points of order
• coordination or taint tracking
• Bloom• declarative, disorderly DSL for distributed programming
• bud: organic Ruby embedding• CALM analysis of monotonicity
• synthesize coordination/compensation• released to the dev community this spring
• “friends-and-family” alpha at http://bloom-lang.net
influence propagation…?
• Technology Review TR10 2010:• “The question that we ask is simple: is the technology
likely to change the world?”
• Fortune Magazine 2010 Top in Tech:• “Some of our choices may surprise you.”
• Twittersphere:• “Read this. Read this now.”
more?
http://bloom-lang.nethttp://boom.cs.berkeley.edu
thanks to:Microsoft Research
Yahoo! ResearchIBM Research
NSFAFOSR
Consensus in Logic [Alvaro, et al. NetDB 2009]BOOM Analytics [Alvaro, et al., Eurosys 2010]Declarative Imperative [Hellerstein, SIGMOD Record 3/2010]CALM + Bloom [Alvaro, et al. CIDR 2011]
dp = datapeople
facilitating interactions between people and data throughout the analytic
lifecycle.
http://deepresearch.org
dp
Jeff HeerStanford
Tapan ParikhBerkeley
Maneesh AgrawalaBerkeley
Joe HellersteinBerkeley
Sean Diana RaviKandel MacLean Parikh
Kuang Nicholas WesleyChen Kong Willett
dpwranglerintelligent data xformation
commentspace social data analysis
usher/shreddr first-mile data entry
socialflowsmining, visualizing & browsing email
madlibparallel in-database analytics
Remember the missing pieces!
software development
visualization
data product management
collaboration/networking
data in the first mile• usher• shreddr
http://www.shreddr.orgChen, et al. CIDR 2011
“MADlib is an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematics, statistical and machine-learning methods for structured and unstructured data.”
http://www.madlib.net
• 02.03.11• “friends and family” alpha release• BSD license• initial ports: PostgreSQL, Greenplum • initial contributors: Berkeley, EMC/Greenplum
• summer 2011• beta release• new contributor pipeline for ports and methods
more?
http://www.deepresearch.org
wrangler: [kandel, et al. CHI 2011]commentspace: [willett, et al. CHI 2011]first mile: [chen, et al. CIDR 2011]adaptive feedback: [chen, et al. UIST 2010]usher: [chen, et al., ICDE 2010]
thanks to: National Science FoundationLightspeed Venture Partners
Yahoo! ResearchEMC/Greenplum
SurveyMonkey
why ruby?• “Bud uses a Ruby-flavored syntax, but this is not fundamental;
we have experimented with analogous Bloom embeddings in other languages including Python, Erlang and Scala, and they look similar in structure.”
what about erlang?• “we did a simple Bloom prototype DSL in Erlang (which we
cannot help but call “Bloomerlang”), and there is a natural correspondence between Bloom-style distributed rules and Erlang actors. However there is no requirement for Erlang programs to be written in the disorderly style of Bloom. It is not obvious that typical Erlang programs are significantly more amenable to a useful points-of-order analysis than programs written in any other functional language. For example, ordered lists are basic constructs in functional languages, and without program annotation or deeper analysis than we need to do in Bloom, any code that modifies lists would need be marked as a point of order, much like our destructive shopping cart”
CALM analysis for traditional languages?• We believe that Bloom’s “disorderly by default” style
encourages order-independent programming, and we know that its roots in database theory helped produce a simple but useful program analysis technique. While we would be happy to see the analysis “ported” to other distributed programming environments, it may be that design patterns using Bloom-esque disorderly programming are the natural way to achieve this.
sw dev space (by function)
• analytics & batch processing
• scalable systems and apps above
• service/business logic• client/mobile/browser apps• other domain-specific universes• gaming/graphics• embedded/realtime• telecomm
sw dev space (by culture)
• enterprise sw vendors• general solutions, QA/release cycles, support contracts• custom dev tools geared toward lock-in
• winning, mature internet players (GOOG, AMZN)• in-house code, agile dev, investment in innovation• partial moves to expose dev environments/tools
• open-source & immature internet players• start with clones of GOOG/AMZN c. 5-10 years ago• energetic, fascinating developer culture
• data scientists• many are tools-agnostic, some religious• another energetic, fascinating developer culture
roll-your-own culture
• google/amazon ethos• most code not open source
• open-source clones maturing but limited• years behind, limited-scope innovation• viable (with on-site hackers)• commercial support emerging
• most important: supportive hacker culture• new generation learning by doing, sharing• this ecosystem will improve over time• better tools needed here, emerging