mmds 2014: myria (and scalable graph clustering with relaxmap)

62
Myria: Scalable Analytics as a Service Bill Howe, PhD University of Washington with Dan Suciu, Magda Balazinska, Dan Halperin, and many students MMDS 2014, Berkeley CA

Upload: bill-howe

Post on 10-May-2015

324 views

Category:

Technology


6 download

DESCRIPTION

A talk I gave at the MMDS workshop June 2014 on the Myria system as well as some of Seung-Hee Bae's work on scalable graph clustering. https://mmds-data.org/

TRANSCRIPT

  • 1.Myria: Scalable Analytics as a Service Bill Howe, PhD University of Washington with Dan Suciu, Magda Balazinska, Dan Halperin, and many students MMDS 2014, Berkeley CA

2. Today Three observations about Big Data Myria: Scalable Analytics as a Service Parallel Flow-based Graph Clustering (if time, but there wont be) 7/10/2014 Bill Howe, UW 2/57 3. 7/10/2014 Bill Howe, UW 3 How can we deliver 1000 little SDSSs to anyone who wants one? 4. How much time do you spend handling data as opposed to doing science? Mode answer: 90% 7/10/2014 Bill Howe, UW 4 5. 0 30 60 90 120 Benchmark 1 Benchmark 2 Old system Your system Our system A typical Computer Science paper. slide src: Dan Halperin 6. 0 2500 5000 7500 10000 12500 Benchmark 1 Benchmark 2 Old system Your system Our system What people use The reality of the situation. slide src: Dan Halperin 7. [This was hard] due to the large amount of data (e.g. data indexes for data retrieval, dissection into data blocks and processing steps, order in which steps are performed to match memory/time requirements, file formats required by software used). In addition we actually spend quite some time in iterations fixing problems with certain features (e.g. capping ENCODE data), testing features and feature products to include, identifying useful test data sets, adjusting the training data (e.g. 1000G vs human-derived variants) So roughly 50% of the project was testing and improving the model, 30% figuring out how to do things (engineering) and 20% getting files and getting them into the right format. I guess in total [I spent] 6 months [on this project]. At least 3 months on issues of scale, file handling, and feature engineering. Martin Kircher, Genome SciencesWhy? 3k NSF postdocs in 2010 $50k / postdoc at least 50% overhead maybe $75M annually at NSF alone? 8. Data Science Workflow: 7/10/2014 Bill Howe, UW 8 1) Preparing to run a model 2) Running the model 3) Interpreting the results Gathering, cleaning, integrating, restructuring, transforming, loading, filtering, deleting, combining, merging, verifying, extracting, shaping, massaging 80% of the work -- Aaron Kimball The other 80% of the work 9. 7/10/2014 Bill Howe, UW 9 Your cool algorithmic problem is not the bottleneck Observation 1 10. 7/10/2014 Bill Howe, UW 10 Symbolic Reasoning and Algebraic Optimization N = ((z*2)+((z*3)+0))/1 Algebraic Laws: 1. (+) identity: x+0 = x 2. (/) identity: x/1 = x 3. (*) distributes: (n*x+n*y) = n*(x+y) 4. (*) commutes: x*y = y*x Apply rules 1, 3, 4, 2: N = (2+3)*z two operations instead of five, no division operator Every database does this kind of optimization every time you issue a query 11. SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp , x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp , w.category as nc_category , CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp