a new partnership for cross-scale, cross-domain escience

Post on 10-May-2015

573 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Overview of the Moore Foundation-funded collaborative project between CMU and UW to advance eScience given at the Microsoft eScience 2009 workshop

TRANSCRIPT

A New Partnership for eScience

Bill Howe, UW

Ed Lazowska, UW

Garth Gibson, CMU

Christos Faloutsos, CMU

Peter Lee, CMU (DARPA)

Chris Mentzel, Moore

QuickTime™ and a decompressor

are needed to see this picture.

http://escience.washington.edu

3/12/09 Bill Howe, eScience Institute 4

3/12/09 Bill Howe, eScience Institute 5

\

3/12/09 Bill Howe, eScience Institute 6

The University of Washington eScience Institute

Rationale The exponential increase in sensors is transitioning all fields of science and

engineering from data-poor to data-rich Techniques and technologies include

Sensors and sensor networks, databases, data mining, machine learning, visualization, cluster/cloud computing

If these techniques and technologies are not widely available and widely practiced, UW will cease to be competitive

Mission Help position the University of Washington at the forefront of research both in

modern eScience techniques and technologies, and in the fields that depend upon them

Strategy Bootstrap a cadre of Research Scientists Add faculty in key fields Build out a “consultancy” of students and non-research staff

QuickTime™ and a decompressor

are needed to see this picture.

3/12/09 Bill Howe, eScience Institute 7

Staff and Funding

Funding $1M/year direct appropriation from WA State Legislature $1.5M from Gordon and Betty Moore Foundation (joint with CMU) Multiple proposals outstanding

Staffing Dave Beck, Research Scientist: Biosciences and software eng. Jeff Gardner, Research Scientist: Astrophysics and HPC Bill Howe,Research Scientist: Databases, visualization, DISC Ed Lazowska, Director Erik Lundberg (50%), Operations Director Mette Peters, Health Sciences Liaison Chance Reschke, Research Engineer: large scale computing platforms

…plus a senior faculty search underway …plus a “consultancy” of students and professional staff

3/12/09 Bill Howe, eScience Institute 8

All science is reducing to a database problem

Old model: “Query the world” (Data acquisition coupled to a specific hypothesis) New model: “Download the world” (Data acquired en masse, in support of many hypotheses)

Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS) Medicine: ubiquitous digital records, MRI, ultrasound Oceanography: high-resolution models, cheap sensors, satellites Biology: lab automation, high-throughput sequencing

“Increase data collection exponentially with FlowCam!”

3/12/09 Bill Howe, eScience Institute 9

The long tail is getting fatter:

notebooks become spreadsheets (MB), spreadsheets become databases (GB), databases become clusters (TB) clusters become clouds (PB)

The Long Tailda

ta v

olum

e

rank

Researchers with growing data management challenges but limited resources for cyberinfrastructure

• No dedicated IT staff

• Over-reliance on inadequate but familiar toolsCERN (~15PB/year)

LSST (~100PB)

PanSTARRS (~40PB)

Ocean Modelers <Spreadsheet

users>

SDSS (~100TB)

Seis-mologists

MicrobiologistsCARMEN (~50TB)

“The future is already here. It’s just not very evenly distributed.” -- William Gibson

3/12/09 Bill Howe, eScience Institute 10

Case Study: Armbrust Lab

3/12/09 Bill Howe, eScience Institute 11

Armbrust Lab Tech Roadmap

ClustalW

scala

bili

ty

cluster/cloud

workstation/server

MAQsp

ecif

ic ta

sks

gene

ral t

asks

Excel

NCBI BLAST

Phred/Phrap

CloudBurst

CLC Genomics Machine

Hadoop/Dryad

Parallel Databases

?

Azure, AWS

WebBlast*

RDBMS

R

PPlacer*

AnnoJBioPython

Past

Present

Soon

Other tools

specialization

3/12/09 Bill Howe, eScience Institute 12

What Does Scalable Mean?

Operationally: In the past: “Works even if data doesn’t fit in main memory” Now: “Can make use of 1000s of cheap computers”

Formally: In the past: polynomial time and space. If you have N data

items, you must do no more than Nk operations Soon: logarithmic time and linear space. If you have N data

items, you must do no more than N log(N) operations

Soon, you’ll only get one pass at the data So you better make that one pass count

3/12/09 Bill Howe, eScience Institute 13

A Goal: Cross-Scale Solutions

Gracefully scale up from files to databases to cluster to cloud from MB to GB to TB to PB

“Gracefully” means: logical data independence no expensive ETL migration projects

“Gracefully” means: everyone can use it Hackers / Computational Scientists Lab/Field Scientists The Public K12 Legislators

3/12/09 Bill Howe, eScience Institute 14

Data Model Operations Services

GPL * * None for free

Workflow * arbitrary boxes-and-arrows

typing, provenance, Pegasus-style resource mapping, task parallelism

SQL / Relational Algebra

Relations Select, Project, Join, Aggregate, …

optimization, physical data independence, indexing, parallelism

MapReduce [(key,value)] Map, Reduce massive data parallelism, fault tolerance, scheduling

Pig Nested Relations

RA-like, with Nest/Flatten

optimization, monitoring, scheduling

DryadLINQ IQueryable, IEnumerable

RA + Apply + Partitioning

typing, massive data parallelism, fault tolerance

MPI Arrays/ Matrices

70+ ops data parallelism, full control

3/12/09 Bill Howe, eScience Institute 15

MapReduce

Many tasks process big data, produce big data Want to use hundreds or thousands of CPUs

... but this needs to be easy Parallel databases exist, but require DBAs and $$$$ …and do not easily scale to thousands of computers

MapReduce is a lightweight framework, providing: Automatic parallelization and distribution Fault-tolerance I/O scheduling Status and monitoring

3/12/09 Bill Howe, eScience Institute 16

public class LogEntry { public string user, ip; public string page; public LogEntry(string line) { string[] fields = line.Split(' '); this.user = fields[8]; this.ip = fields[9]; this.page = fields[5]; }}

public class UserPageCount{ public string user, page; public int count; public UserPageCount( string usr, string page, int cnt){ this.user = usr; this.page = page; this.count = cnt; }}

PartitionedTable<string> logs = PartitionedTable.Get<string>(@”file:\\…\logfile.pt”);var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line);var user = from access in logentries where access.user.EndsWith(@"\ulfar") select access;var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count());var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access; htmAccesses.ToPartitionedTable(@”file:\\…\results.pt”);

slide source: Christophe Poulain, MSR

A complete DryadLINQ program

3/12/09 Bill Howe, eScience Institute 17

Relational DatabasesPre-relational DBMS brittleness: if your data changed, your application often broke.

Early RDBMS were buggy and slow (and often reviled), but required only 5% of the application code.

physical data independence

logical data independence

files and pointers

relations

views

“Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed.”

Key Idea: Programs that manipulate tabular data exhibit an algebraic structure allowing reasoning and manipulation independently of physical data representation

3/12/09 Bill Howe, eScience Institute 18

Relational Databases

Databases are especially, but exclusively, effective at “Needle in Haystack” problems:

Extracting small results from big datasets Transparently provide “old style” scalability Your query will always* finish, regardless of dataset size.

Indexes are easily built and automatically used when appropriateCREATE INDEX seq_idx ON sequence(seq);

SELECT seq FROM sequence WHERE seq = ‘GATTACGATATTA’;

*almost

3/12/09 Bill Howe, eScience Institute 19

Key Idea: Data Independence

physical data independence

logical data independence

files and pointers

relations

views

SELECT * FROM my_sequences

SELECT seq FROM ncbi_sequences WHERE seq = ‘GATTACGATATTA’;

f = fopen(‘table_file’);fseek(10030440);while (True) { fread(&buf, 1, 8192, f); if (buf == GATTACGATATTA) { . . .

3/12/09 Bill Howe, eScience Institute 20

Key Idea: An Algebra of Tables

select

project

join join

Other operators: aggregate, union, difference, cross product

3/12/09 Bill Howe, eScience Institute 21

Key Idea: Algebraic Optimization

N = ((z*2)+((z*3)+0))/1

Algebraic Laws: 1. (+) identity: x+0 = x2. (/) identity: x/1 = x3. (*) distributes: (n*x+n*y) = n*(x+y)4. (*) commutes: x*y = y*x

Apply rules 1, 3, 4, 2:N = (2+3)*z

two operations instead of five, no division operator

Same idea works with the Relational Algebra!

3/12/09 Bill Howe, eScience Institute 22

Shared Nothing Parallel Databases

Teradata Greenplum Netezza Aster Data Systems DataAllegro Vertica MonetDB

Microsoft

Recently commercialized as “Vectorwise”

Case Study: Astrophysics Simulation

24

N-body Astrophysics Simulation

• 15 years in dev

• 109 particles

• Gravity

• Months to run

• 7.5 million CPU hours

• 500 timesteps

• Big Bang to now

Simulations from Tom Quinn’s Lab, work by Sarah Loebman, YongChul Kwon, Bill Howe, Jeff Gardner, Magda Balazinska

25

Q1: Find Hot Gas

SELECT id

FROM gas

WHERE temp > 150000

26

Single Node: Query 1

169 MB 1.4 GB 36 GB

27

Multiple Nodes: Query 1

Database Z

28

Multiple Nodes:Query 2

Database Z

29

Q4: Gas Deletion

SELECT gas1.id

FROM gas1

FULL OUTER JOIN gas2

ON gas1.id=gas2.id

WHERE gas2.id=NULL

Particles removed between two timesteps

30

Single Node: Query 4

31

Multiple Nodes: Query 4

3/12/09 Bill Howe, eScience Institute 32

Ease of Use

star43 = FOREACH rawGas43 GENERATE $0 AS pid:long; star60 = FOREACH rawGas60 GENERATE $0 AS pid:long; groupedGas = COGROUP star43 BY pid, star60 BY pid;

selectedGas = FOREACH groupedGas GENERATE FLATTEN((IsEmpty(gas43) ? null : gas43)) as s43, FLATTEN((IsEmpty(gas60) ? null : gas60)) as s60;

destroyed = FILTER selectedGas BY s60 is null;

Visualization and Mashups

Dancing with Data

3/12/09 Bill Howe, eScience Institute 34

Data explosion, again

Data growth is outpacing Moore’s Law Why? Cost of acquisition has dropped through the floor Every pairwise comparison of datasets

generates a new dataset -- N2 growth

So: Scalable analysis is necessary But: Scalable analysis is hard

3/12/09 Bill Howe, eScience Institute 35

It’s not just the size….

Corollary: # of apps scales as N2

Every pairwise comparison motivates a new application

To keep up, we need to entrain new programmers, make existing programmers more productive, or both

3/12/09 Bill Howe, eScience Institute 36

Satellite Images + Crime Incidence Reports

3/12/09 Bill Howe, eScience Institute 37

Twitter Feed + Flickr Stream

3/12/09 Bill Howe, eScience Institute 38

Zooplankton and Temperature

<Vis movie>

QuickTime™ and a decompressor

are needed to see this picture.

3/12/09 Bill Howe, eScience Institute 39

Why Visualization?

High bandwidth of the human visual cortex Query-writing presumes a precise goal Try this in SQL: “What does the salt wedge look like?”

3/12/09 Bill Howe, eScience Institute 40

Data Product Ensembles

source: Antonio Baptista, Center for Coastal Margin Observation and Prediction

3/12/09 Bill Howe, eScience Institute 41

Example: Find matching sequences

Given a set of sequences Find all sequences equal to

“GATTACGATATTA”

3/12/09 Bill Howe, eScience Institute 42

Example System: Teradata

AMP = unit of parallelism

3/12/09 Bill Howe, eScience Institute 43

Example System: Teradata

SELECT * FROM Orders o, Lines i WHERE o.item = i.item AND o.date = today()

join

select

scan scan

date = today()

o.item = i.item

Order oItem i

Find all orders from today, along with the items ordered

3/12/09 Bill Howe, eScience Institute 44

Example System: Teradata

AMP 1 AMP 2 AMP 3

selectdate=today()

selectdate=today()

selectdate=today()

scanOrder o

scanOrder o

scanOrder o

hashh(item)

hashh(item)

hashh(item)

AMP 1 AMP 2 AMP 3

3/12/09 Bill Howe, eScience Institute 45

Example System: Teradata

AMP 1 AMP 2 AMP 3

scanItem i

AMP 1 AMP 2 AMP 3

hashh(item)

scanItem i

hashh(item)

scanItem i

hashh(item)

3/12/09 Bill Howe, eScience Institute 46

Example System: Teradata

AMP 1 AMP 2 AMP 3

join join joino.item = i.item o.item = i.item o.item = i.item

contains all orders and all lines where hash(item) = 1

contains all orders and all lines where hash(item) = 2

contains all orders and all lines where hash(item) = 3

3/12/09 Bill Howe, eScience Institute 47

MapReduce Programming Model

Input & Output: each a set of key/value pairs Programmer specifies two functions:

Processes input key/value pair Produces set of intermediate pairs

Combines all intermediate values for a particular key Produces a set of merged output values (usually just

one)

map (in_key, in_value) -> list(out_key, intermediate_value)

reduce (out_key, list(intermediate_value)) -> list(out_value)

Inspired by primitives from functional programming languages such as Lisp, Scheme, and Haskell

slide source: Google, Inc.

3/12/09 Bill Howe, eScience Institute 48

Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood.

Example: Document Processing

3/12/09 Bill Howe, eScience Institute 49

Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood.

Example: Word length histogram

How many “big”, “medium”, and “small” words are used?

Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood.

Big = Yellow = 10+ letters

Medium = Red = 5..9 letters

Small = Blue = 2..4 letters

Tiny = Pink = 1 letter

Example: Word length histogram

Abridged Declaration of IndependenceA Declaration By the Representatives of the United States of America, in GeneralCongress Assembled.When in the course of human events it becomes necessary for a people to advance fromthat subordination in which they have hitherto remained, and to assume among powers ofthe earth the equal and independent station to which the laws of nature and of nature'sgod entitle them, a decent respect to the opinions of mankind requires that they shoulddeclare the causes which impel them to the change.We hold these truths to be self-evident; that all men are created equal and independent;that from that equal creation they derive rights inherent and inalienable, among which arethe preservation of life, and liberty, and the pursuit of happiness; that to secure theseends, governments are instituted among men, deriving their just power from the consentof the governed; that whenever any form of government shall become destructive of theseends, it is the right of the people to alter or to abolish it, and to institute new government,laying it's foundation on such principles and organizing it's power in such form, as tothem shall seem most likely to effect their safety and happiness. Prudence indeed will

dictate that governments long established should not be changed for light and transientcauses: and accordingly all experience hath shewn that mankind are more disposed tosuffer while evils are sufferable, than to right themselves by abolishing the forms towhich they are accustomed. But when a long train of abuses and usurpations, begun at adistinguished period, and pursuing invariably the same object, evinces a design to reducethem to arbitrary power, it is their right, it is their duty, to throw off such government andto provide new guards for future security. Such has been the patient sufferings of thecolonies; and such is now the necessity which constrains them to expunge their formersystems of government. the history of his present majesty is a history of unremittinginjuries and usurpations, among which no one fact stands single or solitary to contradictthe uniform tenor of the rest, all of which have in direct object the establishment of anabsolute tyranny over these states. To prove this, let facts be submitted to a candid world,for the truth of which we pledge a faith yet unsullied by falsehood.

Example: Word length histogram

Split the document into chunks and process each chunk on a different computer

Chunk 1

Chunk 2

(yellow, 20)(red, 71)(blue, 93)(pink, 6 )

Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood.

Map Task 1(204 words)

Map Task 2(190 words)

(key, value)

(yellow, 17)(red, 77)(blue, 107)(pink, 3)

Example: Word length histogram

3/12/09 Bill Howe, eScience Institute 53

(yellow, 17)(red, 77)(blue, 107)(pink, 3)

(yellow, 20)(red, 71)(blue, 93)(pink, 6 )

Reduce tasks

(yellow, 17)(yellow, 20)

(red, 77)(red, 71)

(blue, 93)(blue, 107)

(pink, 6)(pink, 3)

Example: Word length histogram

A Declaration By the Representatives of the United States of America, in GeneralCongress Assembled.When in the course of human events it becomes necessary for a people to advance fromthat subordination in which they have hitherto remained, and to assume among powers ofthe earth the equal and independent station to which the laws of nature and of nature'sgod entitle them, a decent respect to the opinions of mankind requires that they shoulddeclare the causes which impel them to the change.We hold these truths to be self-evident; that all men are created equal and independent;that from that equal creation they derive rights inherent and inalienable, among which arethe preservation of life, and liberty, and the pursuit of happiness; that to secure theseends, governments are instituted among men, deriving their just power from the consentof the governed; that whenever any form of government shall become destructive of theseends, it is the right of the people to alter or to abolish it, and to institute new government,laying it's foundation on such principles and organizing it's power in such form, as tothem shall seem most likely to effect their safety and happiness. Prudence indeed will

dictate that governments long established should not be changed for light and transientcauses: and accordingly all experience hath shewn that mankind are more disposed tosuffer while evils are sufferable, than to right themselves by abolishing the forms towhich they are accustomed. But when a long train of abuses and usurpations, begun at adistinguished period, and pursuing invariably the same object, evinces a design to reducethem to arbitrary power, it is their right, it is their duty, to throw off such government andto provide new guards for future security. Such has been the patient sufferings of thecolonies; and such is now the necessity which constrains them to expunge their formersystems of government. the history of his present majesty is a history of unremittinginjuries and usurpations, among which no one fact stands single or solitary to contradictthe uniform tenor of the rest, all of which have in direct object the establishment of anabsolute tyranny over these states. To prove this, let facts be submitted to a candid world,for the truth of which we pledge a faith yet unsullied by falsehood.

Map task 1

Map task 2

“Shuffle step”

(yellow, 37)

(red, 148)

(blue, 200)

(pink, 9)

3/12/09 Bill Howe, eScience Institute 54

New Example: What does this do?

map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, 1);

reduce(String output_key, Iterator intermediate_values): // output_key: word // output_values: ???? int result = 0; for each v in intermediate_values: result += v; Emit(result);

slide source: Google, Inc.

3/12/09 Bill Howe, eScience Institute 55

Before RDBMS: if your data changed, your application broke.

Early RDBMS were buggy and slow (and often reviled), but required only 5% of the application code.

“Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed.” -- E.F. Codd 1979

Key Ideas: Programs that manipulate tabular data exhibit an algebraic structure allowing reasoning and manipulation independently of physical data representation

Relational Database Management Systems (RDBMS)

3/12/09 Bill Howe, eScience Institute 56

MapReduce is a Nascent Database Engine

Access Methods and Scheduling:

Query Language:

Query Optimizer:

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Pig Latin

Graphics taken from: hadoop.apache.org and research.yahoo.com/node/90

3/12/09 Bill Howe, eScience Institute 57

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

MapReduce and Hadoop

MR introduced by Google Published paper in OSDI 2004

MR: high-level programming model and implementation for large-scale parallel data processing

Hadoop Open source MR implementation Yahoo!, Facebook, New York Times

3/12/09 Bill Howe, eScience Institute 58

operators: • LOAD• STORE • FILTER• FOREACH … GENERATE • GROUP

binary operators: • JOIN• COGROUP• UNION

other support:• UDFs• COUNT• SUM • AVG• MIN/MAX

Additional operators:http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm

A Query Language for MR: Pig Latin

High-level, SQL-like dataflow language for MR Goal: Sweet spot between SQL and MR

Applies SQL-like, high-level language constructs to accomplish low-level MR programming.

3/12/09 Bill Howe, eScience Institute 59

New Task: k-mer Similarity

Given a set of database sequences and a set of query sequences

Return the top N similar pairs, where similarity is defined as the number of common k-mers

3/12/09 Bill Howe, eScience Institute 60

Pig Latin program

D = LOAD ’db_sequences.fasta' USING FASTA() AS (did,dsequence);

Q = LOAD ’query_sequences.fasta' USING FASTA() AS (qid,qsequence);

Kd = FOREACH D GENERATE did, FLATTEN(kmers(7, dsequence));Kq = FOREACH Q GENERATE qid, FLATTEN(kmers(7, qsequence));

R = JOIN Kd BY kmer, Kq BY kmer

G = GROUP R BY (qid, did);C = FOREACH G GENERATE qid, did, COUNT(kmer) as scoreT = FILTER C BY score > 4

STORE g INTO seqs.txt';

3/12/09 Bill Howe, eScience Institute 61

New Task: Alignment

RMAP alignment implemented in HadoopMichael Schatz, CloudBurst: highly sensitive read mapping with MapReduce, Bioinformatics 25(11), April 2009

Goal: Align reads to a reference genome

Overview: Map: Split reads and reference into k-mers Reduce: for matching k-mers, find end-to-end

alignments (seed and extend)

3/12/09 Bill Howe, eScience Institute 62

MapReduce Overhead

QuickTime™ and a decompressor

are needed to see this picture.

3/12/09 Bill Howe, eScience Institute 63

Elastic MapReduce

Custom Jar Java

Streaming Any language that can read/write stdin/stdout

Pig Simple data flow language

Hive SQL

3/12/09 Bill Howe, eScience Institute 64

Taxonomy of Parallel Architectures

Easiest to program, but $$$$

Scales to 1000s of computers

top related