Download - Big data and the cloud: programming futures joe hellerstein

big data and the cloud:programming futures

joe hellerstein

roadmap

• status report• analytics• scalable systems• research• calm <~ bloom

• dp

roadmap

• status reportanalytics• scalable systems• research• calm <~ bloom

• dp

In the Days of Kings and Priests

• Computers and Data: Crown Jewels• Executives depend on computers• But cannot work with

them directly

• The DBA “Priesthood”• And their Acronymia: EDW, BI, OLAP

• The “architected” EDW“There is no point in bringing data … into the data

warehouse environment without integrating it.” — Bill Inmon,

Building the Data Warehouse, 2005

http://www.flickr.com/photos/oknovokght/2189461072/

http://www.flickr.com/photos/cote/54408562/

http://www.flickr.com/photos/pelegrino/2434076076/

New Realities

• TB disks < $100• Everything is data• Rise of data-driven culture• Very publicly espoused

by Google, Wired, etc.• Sloan Digital Sky Survey,

Terraserver, etc.

The quest for knowledge used to begin with grand theories.

Now it begins with massive amounts of data.

Welcome to the Petabyte Age.

http://www.wired.com/images/press/pdf/1607cover.pdf

The New Practitioners

Hal Varian, UC Berkeley, Chief Economist @ Google

“Looking for a career where your services will be in high demand?

… Provide a scarce, complementary service to something that is getting

ubiquitous and cheap.

So what’s ubiquitous and cheap? Data.

And what is complementary to data? Analysis.

the sexy job in the next ten years will be statisticians

http://people.ischool.berkeley.edu/~hal/

The New Practitioners

• Aggressively Datavorous• Statistically savvy• Diverse in training, tools

http://www.flickr.com/photos/ucumari/356615093/

http://www.flickr.com/photos/ashleighozment/2397538892/

http://www.flickr.com/photos/volk/1379996042/

http://www.flickr.com/photos/pelegrino/2434076076/

MAD Skills

•Magnetic• attract data and practitioners

• Agile• rapid iteration: ingest, analyze,

productionalize

• Deep• sophisticated analytics in Big Data

[Cohen, et al. VLDB 09]

http://www.flickr.com/photos/tmartin/32010732/

http://www.flickr.com/photos/wmcbride1965/3605220261/

http://www.flickr.com/photos/sickening/287923931/

Dev tools 4 analytics: reality• Current focus: engines/languages for scalable analytics • Scalable analytics algorithms are a small % of analyst’s life

Dev tools 4 analytics: reality• Current focus: engines/languages for scalable analytics • Scalable analytics algorithms are a small % of analyst’s life

software development

visualization

data product management

collaboration/networking

focus on Deepnot enough on Agilenot enough on Magnetic

dp

Analytics Coding Landscape• Single-node stat packages (R, Matlab, SAS, etc.)• domain-specific languages for linear algebra and statistics• diverse set of open libraries (e.g. CRAN library for R)• scalability limits: in-core, no parallelism

• MapReduce ecosystem• Google, MS Dryad, Hadoop open source• low-level single-node coding (Java), easy data-parallelization• SQL-like convenience languages above (Hive, Pig)• emerging open analytics toolkits (Mahout, Pregel)

• SQL + extensions (user-defined functions)• more powerful than many realize• declarative coding, easy data-parallelism• poor support for extension developers (varies by vendor)• emerging open analytics toolkits (MADlib, Hazy)

Analytics Takeaways

• little real dev difference between mapreduce and SQL• hadoop has more energetic dev tools development• SQL provides more breadth (of function, install base, HR)• lines are blurring

• serious barrier: porting the R/SAS/Matlab ecosystem• will take a decade to develop data-parallel equivalent to CRAN• algorithmic challenge, not just a coding challenge• no shortcut here (at least for MMP)

• in sum• analytics will be a “swiss army knife” approach for years to come• think portfolio. foster community, open libraries (MADlib/Mahout)

roadmap

• status report• analyticsscalable systems• research• calm <~ bloom

• dp

big systems c. 2011

• features:• data-centric• distributed• highly available• scalable/elastic• lots of new/custom code

• programming is becoming hard2

• (parallelism + asynchrony + failure) × (software engineering)

root cause of hardness• order is pervasive in the von neumann model• state: an ordered array of cells• logic: an ordered array of instructions

• terrible match for distributed systems

typical solution: shared storage• distributed storage replaces RAM• imposes/enforces order• e.g. via transactions or other consistency mechanisms

• shift: data-centric development• storage is not persistence — it is a programming model

• this has always been true • the cloud makes it pervasive

Dropping ACID?• early exposition: the “transaction concept” [Gray VLDB 1981]

• many think distributed ACID transactions are infeasible today• cross-site transactions coordination⇒

• ⇒ waiting • ⇒ queue buildups• ⇒ unpredictable problems

• a major lesson of Internet companies: Brewer’s “CAP theorem”• though implications being revisited

• by now, this lesson is kool-aid in the open source community…

NoSQL• “not only SQL”• really not about SQL per se.

• focus on two things:• distributed storage with “loose consistency”, not ACID.• data models that are simpler than SQL schemas

• key/value stores, documents• i.e. similar to distributed memory!

• examples• BigTable (Google), Hbase (Yahoo/Hadoop), Cassandra

(Facebook/DataStax), Sherpa (Yahoo), Dynamo (Amazon), Voldemort (LinkedIn), …

• cloud services (AppEngine, Azure)

Homework puzzle

Given:1. use storage layer for distributed coordination (order)2. use NoSQL’s loose consistency for availability

Q: how do programmers reason about order and correctness?

Homework puzzle

Given:1. use storage layer for distributed coordination (order)2. use NoSQL’s loose consistency for availability

Q: how do programmers reason about order and correctness?A: very carefully.

correctness?

ACID

• general correctness via theoretical foundations• read/write: serializability• coordination/consensus

loose consistency

• app-specific correctness via design maxims• semantic assertions• custom compensation

concerns: latency, availability concerns: hard to trust, test

application logic

system infrastructure

theoretical foundation

application logic

quicksand


the shift

a vacuum here• state of the art: each app reasons about consistency

• e.g. by making use of a locking service (a la Apache Zookeeper)• e.g. by reasoning about “eventual consistency” of the storage system

• this is, arguably, hard3

• (sw eng) * (distribution) * (false abstractions)

• don’t take my word for it• Gunawi’s FATE uncovered 16 fault-recovery bugs in Hadoop FS

[NSDI ’11]

focus on storage systemsnot enough on developers

CALM <~ bloom

roadmap

• status report• analytics• scalable systems• research calm <~ bloom

• dp

calm <~ bloomdisorderly programming for distributed systems

BOOM team

joe hellerstein ras bodik

peter alvaro neil conway bill marczak haryadi gunawi thibaud hottelier

desire: best of both worlds

• theoretical foundation for correctness under loose consistency

• embodiment of theory in a programming framework

theoretical foundation

application logic

quicksand


our approach

• disorderly programming• state: unordered collections• logic: unordered statements

• implications• default: partitioning, concurrency• ordering (data, logic) explicit, special-case

• but can this make ordering decisions simpler?

progress

• CALM consistency (maxims theorems)⇒

• Bloom language (theorems programming)⇒

monotonicity

monotonic code non-monotonic code

• info accumulation• the more you know,

the more you know

• e.g. map, filter, join

• belief revision• new inputs can

change your mind;need to “seal” input

• e.g. counts, state update

an asidegamblers?

intuition• counting requires waiting

intuition• counting requires waiting

• waiting requires counting

CALM Theorem• CALM: consistency as logical monotonicity• monotonic code eventually consistent⇒• non-monotonic coordinate only at non-monotonic ⇒ points of order

• conjectures at pods 2010 conference[Hellerstein, SIGMOD Record 2010]

• formulations and theorems in 2011[Ameloot,et al., PODS 2011]

practical implications• compiler can identify non-monotonic “points of order”• inject coordination code• or mark uncoordinated results as “tainted”

• compiler can help programmer think about coordination costs

• easy to do this with the right language…

<~ bloom

background: BOOM Analytics

• 2005-2010: designed a distributed logic language called Overlog

• 2009-2010: rebuilt Hadoop File System and scheduler in Overlog• no kidding – API-compatible with Hadoop, comparable performance• win 1: Orders Of Magnitude smaller, 4 person-months dev time• win 2 (more important) : evolvability

• fixed HDFS single point of failure via Paxos-in-Overlog (6 person-weeks)• fixed HDFS scaling limits via state partitioning (1 day!)

[Alvaro et al., Eurosys 2010]

Lines of Java Lines of Overlog

HDFS 21,700 0

BOOM-FS 1,431 469

we became greedy for more

• time to build a language for real programmers. approach:• craft a disorderly DSL for distributed systems• embed in popular host languages. (I chose ruby first.)

• embody the CALM theorem in programmer tools• identify points of order in code• synthesize coordination logic, or inject “taint” tracking• high-level analysis/debuggers to pinpoint tricky ordering issues

<~ bloom

bud (bloom under development)• bloom embedded as a DSL in ruby• domain-specific code analysis tools• alpha released April, 2011 at http://bloom-lang.net • goodies• code analysis tools• library/example sandbox• EC2 deployment utilities

% gem install bud

http://bloom-lang.net

classic example: shopping cart

• replicated, a la Amazon Dynamo

• challenge: guarantee eventual consistency of replicas• maxim: use commutative operations• easier said than done!

• Bloom/CALM paper shows compiler analysis (i.e. proofs) of the design maxims for correctness, efficiency

[Alvaro, et al. CIDR 2011]

conclusion• CALM theorem• what is coordination for? non-monotonicity.• pinpoint non-monotonic points of order

• coordination or taint tracking

• Bloom• declarative, disorderly DSL for distributed programming

• bud: organic Ruby embedding• CALM analysis of monotonicity

• synthesize coordination/compensation• released to the dev community this spring

• “friends-and-family” alpha at http://bloom-lang.net

influence propagation…?

• Technology Review TR10 2010:• “The question that we ask is simple: is the technology

likely to change the world?”

• Fortune Magazine 2010 Top in Tech:• “Some of our choices may surprise you.”

• Twittersphere:• “Read this. Read this now.”

more?

http://bloom-lang.nethttp://boom.cs.berkeley.edu

thanks to:Microsoft Research

Yahoo! ResearchIBM Research

NSFAFOSR

Consensus in Logic [Alvaro, et al. NetDB 2009]BOOM Analytics [Alvaro, et al., Eurosys 2010]Declarative Imperative [Hellerstein, SIGMOD Record 3/2010]CALM + Bloom [Alvaro, et al. CIDR 2011]

roadmap

• status report• analytics• scalable systems• research• calm <~ bloom dp

dp = datapeople

facilitating interactions between people and data throughout the analytic

lifecycle.

http://deepresearch.org

dp

Jeff HeerStanford

Tapan ParikhBerkeley

Maneesh AgrawalaBerkeley

Joe HellersteinBerkeley

Sean Diana RaviKandel MacLean Parikh

Kuang Nicholas WesleyChen Kong Willett

dpwranglerintelligent data xformation

commentspace social data analysis

usher/shreddr first-mile data entry

socialflowsmining, visualizing & browsing email

madlibparallel in-database analytics

Remember the missing pieces!


visualization



dp


visualization



wrangler

http://vis.stanford.edu/wranglerKandel, et al. SIGCHI 2011

commentspace•

http://www.commentspace.netWillett, et al. SIGCHI 2011

http://www.commentspace.net/

data in the first mile• usher• shreddr

http://www.shreddr.orgChen, et al. CIDR 2011

http://www.commentspace.net/

Database

http://shreddr.org

shredding

column-oriented data entry

column-oriented data entry

select the snips that are not ‘MICHAEL’

“MADlib is an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematics, statistical and machine-learning methods for structured and unstructured data.”

http://www.madlib.net

• 02.03.11• “friends and family” alpha release• BSD license• initial ports: PostgreSQL, Greenplum • initial contributors: Berkeley, EMC/Greenplum

• summer 2011• beta release• new contributor pipeline for ports and methods

more?

http://www.deepresearch.org

wrangler: [kandel, et al. CHI 2011]commentspace: [willett, et al. CHI 2011]first mile: [chen, et al. CIDR 2011]adaptive feedback: [chen, et al. UIST 2010]usher: [chen, et al., ICDE 2010]

thanks to: National Science FoundationLightspeed Venture Partners

Yahoo! ResearchEMC/Greenplum

SurveyMonkey

backup slides follow

why ruby?• “Bud uses a Ruby-flavored syntax, but this is not fundamental;

we have experimented with analogous Bloom embeddings in other languages including Python, Erlang and Scala, and they look similar in structure.”

what about erlang?• “we did a simple Bloom prototype DSL in Erlang (which we

cannot help but call “Bloomerlang”), and there is a natural correspondence between Bloom-style distributed rules and Erlang actors. However there is no requirement for Erlang programs to be written in the disorderly style of Bloom. It is not obvious that typical Erlang programs are significantly more amenable to a useful points-of-order analysis than programs written in any other functional language. For example, ordered lists are basic constructs in functional languages, and without program annotation or deeper analysis than we need to do in Bloom, any code that modifies lists would need be marked as a point of order, much like our destructive shopping cart”

CALM analysis for traditional languages?• We believe that Bloom’s “disorderly by default” style

encourages order-independent programming, and we know that its roots in database theory helped produce a simple but useful program analysis technique. While we would be happy to see the analysis “ported” to other distributed programming environments, it may be that design patterns using Bloom-esque disorderly programming are the natural way to achieve this.

sw dev space (by function)

• analytics & batch processing

• scalable systems and apps above

• service/business logic• client/mobile/browser apps• other domain-specific universes• gaming/graphics• embedded/realtime• telecomm

sw dev space (by culture)

• enterprise sw vendors• general solutions, QA/release cycles, support contracts• custom dev tools geared toward lock-in

• winning, mature internet players (GOOG, AMZN)• in-house code, agile dev, investment in innovation• partial moves to expose dev environments/tools

• open-source & immature internet players• start with clones of GOOG/AMZN c. 5-10 years ago• energetic, fascinating developer culture

• data scientists• many are tools-agnostic, some religious• another energetic, fascinating developer culture

roll-your-own culture

• google/amazon ethos• most code not open source

• open-source clones maturing but limited• years behind, limited-scope innovation• viable (with on-site hackers)• commercial support emerging

• most important: supportive hacker culture• new generation learning by doing, sharing• this ecosystem will improve over time• better tools needed here, emerging

Download - Big data and the cloud: programming futures joe hellerstein

Top Related