meetup asu 150113_upload

A NEW PLATFORM FOR A NEW ERA

2 © Copyright 2013 Pivotal. All rights reserved.

What we will cover in today’s Meetup

� Data Science for Biomedicine –  Challenges –  Platforms, processes, and tools

� Use Cases Leveraging Data Science for Biomedicine –  Genomics: Distributed GWAS –  Image Processing: Massively Parallel Cell Counting –  Healthcare: Predicting asthma-related hospital admissions

� Wrap Up & Questions

3 © Copyright 2013 Pivotal. All rights reserved. 3 © Copyright 2013 Pivotal. All rights reserved.

Challenges


Challenge: The ‘big-ness’ of big data

Oil Exploration Medical Imaging

Video Surveillance Mobile Sensors

Stock Market Gene Sequencing

Smart Grids Social Media

FACEBOOK UPLOADS 250 MILLION PHOTOS EACH DAY

COST TO SEQUENCE ONE GENOME HAS FALLEN FROM $100M IN 2001 TO $10K IN 2011 TO $1K IN 2014

READING SMART METERS EVERY 15 MINUTES

IS 3000X MORE DATA INTENSIVE

OIL RIGS GENERATE

25000 DATA POINTS PER SECOND


Medications"

Family "History"

Lab tests"

Clinical"Narratives"

Imaging"

Environment"

Medical History"

Sensors"& Mobile"

Genetics"

Molecular"Diagnostics"

Challenge: Diverse data


Solutions: New environments & tools HDFS STORAGE AND MPP

ARCHITECTURES DISTRIBUTE STORAGE AND PREVENT DATA MOVEMENT

VARIETY/VELOCITY

DISTRIBUTED COMPUTATION FOR PARALLELIZATION PETABYTES OF DATA

OPEN-SOURCE LIBRARY FOR MACHINE LEARNING AT SCALE AND FRAMEWORK

TO ACCESS COMMON LANGUAGES

RAPIDLY EVOLVING FIELD OF DATA SCIENCE AND

TOOLS

SQL ENGINE AND ODBC/JDBC CONNECTIONS TO HADOOP

MANY EXISTING LIBRARIES, TOOLS AND EXPERTISE

FLEXIBLE

SCALABLE

ENABLING

ACCESSIBLE


Solutions: Leverage Diverse Data Create predictive models at scale •  Integrate data from various sources to build larger models to improve statistics

and inference •  Enable parallelized execution of libraries

False positive rate

True

pos

itive

rate

Medical History

Medical History

Medical History Genetics

Clinician Notes

Clinician Notes

Medical History Genetics Imaging Clinician

Notes


Platforms


Multiple platforms with a single, simple goal: Distributed storage with in-place computation

Hadoop

MPP Database

SQL-on-Hadoop



Think of it as multiple PostGreSQL servers

Segments/Workers

Master

Rows are distributed across segments by a particular field (or randomly)

Hadoop

MPP Database

SQL-on-Hadoop



Think of it as distributed file system with very large blocks of data

Schema on read allows flexibility for a variety of datasets Compute using a variety of paradigms (e.g. MapReduce)

Hadoop

MPP Database

Name Node

Data Node 1

Data Node 2

Data Node 3

Data Node 4

1 2 3 2 3 1 1 2 SQL-on-Hadoop



•  SQL compliant •  World-class query optimizer •  Interactive query •  Horizontal scalability •  Robust data management •  Common Hadoop formats •  Deep analytics

Hadoop

MPP Database

Think of it as distributed PostGreSQL (GPDB) on Hadoop •  SQL compliant •  World-class query optimizer •  Interactive query •  Horizontal scalability •  Robust data management •  Common Hadoop formats •  Deep analytics SQL-on-

Hadoop


Sample Applications Challenges Use Cases

The landscape of technology for big data

HAMSTER/MPI GraphLab

MapReduce

SQL



Batch processing of large volumes of data

Analytics on large-scale structured data

Operations on very large matrices



MapReduce

SQL




Not optimal for highly iterative methods (file I/O bottleneck),

functions over windows


Requires restructuring of data to manipulate very large files


Requires knowledge of OpenMP, mis-used for embarrassingly

parallel problems



MapReduce

SQL




Not optimal for highly iterative methods (file I/O bottleneck),

functions over windows Word count on tweets


Requires restructuring of data to manipulate very large files

Predicting mortality on clinical data from diverse sources


Requires knowledge of OpenMP, mis-used for embarrassingly

parallel problems

Protein docking, molecular dynamics



MapReduce

SQL


Clinical"Narratives"Imaging" Genetics"

Good for processing many images rapidly

Many documents with no shared processing Read mapping

In-database processing of very large images

stored as a table Information retrieval BAM file manipulations,

counts

Processing very large images (e.g. FFT)

Multiple sequence alignment

Choosing the right environment for different analytics challenges


MapReduce

SQL


Process & Tools


1 Find Data Platforms •  Pivotal

Greenplum DB •  Pivotal HD •  Hadoop (other) •  SAS HPA •  AWS

2 Write Code

Editing Tools •  Vi/Vim •  Emacs •  Smultron •  TextWrangler •  Eclipse •  Notepad++ •  IPython •  Sublime •  Rstudio

Languages •  SQL •  Bash scripting •  C •  C++ •  C# •  Java •  Python •  R

3 Run Code Interfaces •  pgAdminIII •  psql •  psycopg2 •  Terminal •  Cygwin •  Putty •  Winscp

4 Write Code for Big Data In-Database •  SQL •  PL/Python •  PL/Java •  PL/R •  PL/pgSQL

Hadoop •  HAWQ •  Pig •  Hive •  Java

5 Implement Algorithms

Libraries •  MADlib Java •  Mahout R •  (Too many to list!) Text •  OpenNLP •  NLTK •  GPText C++ •  opencv

Python •  NumPy •  SciPy •  scikit-learn •  Pandas Programs •  Alpine Miner •  Rstudio •  MATLAB •  SAS •  Stata

6 Show Results

Visualization •  python-matplotlib •  python-networkx •  D3.js •  Tableau

•  GraphViz •  Gephi •  R (ggplot2, lattice,

shiny) •  Excel

7 Collaborate

Sharing Tools •  Chorus •  Confluence •  Socialcast •  Github •  Google Drive &

Hangouts

PIVOTAL DATA SCIENCE TOOLKIT

A large and varied tool box!


1 Find Data Platforms •  Pivotal

Greenplum DB •  Pivotal HD •  Hadoop (other) •  SAS HPA •  AWS

2 Write Code

Editing Tools •  Vi/Vim •  Emacs •  Smultron •  TextWrangler •  Eclipse •  Notepad++ •  IPython •  Sublime •  Rstudio

Languages •  SQL •  Bash scripting •  C •  C++ •  C# •  Java •  Python •  R

3 Run Code Interfaces •  pgAdminIII •  psql •  psycopg2 •  Terminal •  Cygwin •  Putty •  Winscp

4 Write Code for Big Data In-Database •  SQL •  PL/Python •  PL/Java •  PL/R •  PL/pgSQL

Hadoop •  HAWQ •  Pig •  Hive •  Java

5 Implement Algorithms

Libraries •  MADlib Java •  Mahout R •  (Too many to list!) Text •  OpenNLP •  NLTK •  GPText C++ •  opencv

Python •  NumPy •  SciPy •  scikit-learn •  Pandas Programs •  Alpine Miner •  Rstudio •  MATLAB •  SAS •  Stata

6 Show Results

Visualization •  python-matplotlib •  python-networkx •  D3.js •  Tableau

•  GraphViz •  Gephi •  R (ggplot2, lattice,

shiny) •  Excel

7 Collaborate

Sharing Tools •  Chorus •  Confluence •  Socialcast •  Github •  Google Drive &

Hangouts

PIVOTAL DATA SCIENCE TOOLKIT

A large and varied tool box!

Data Review Feature Creation Model Building Operationalization


MADlib In-Database Functions Predictive Modeling Library

Linear Systems •  Sparse and Dense Solvers

Matrix Factorization •  Single Value Decomposition (SVD) •  Low-Rank

Generalized Linear Models •  Linear Regression •  Logistic Regression •  Multinomial Logistic Regression •  Cox Proportional Hazards •  Regression •  Elastic Net Regularization •  Sandwich Estimators (Huber white,

clustered, marginal effects)

Machine Learning Algorithms •  Principal Component Analysis (PCA) •  Association Rules (Affinity Analysis, Market

Basket) •  Topic Modeling (Parallel LDA) •  Decision Trees •  Ensemble Learners (Random Forests) •  Support Vector Machines •  Conditional Random Field (CRF) •  Clustering (K-means) •  Cross Validation

Hypothesis Testing

Chi-Squared test F-test & t-test ANOVA Kolmogorov-Smirnov Mann-Whitney test Wilcoxon signed-rank test Correlation Summary

Support Modules

Array Operations Sparse Vectors Random Sampling Probability Functions

Collaborators:


Linear Regression: Streaming Algorithm •  Finding linear

dependencies between variables

•  How to compute with a single scan?


Linear Regression: Parallel Computation

XT

y

XT y = xiT yi

i∑



y

XT

Master

XT y

Segment 1 Segment 2

X1T y1 X2

T y2+ =



y

XT

Master Segment 1 Segment 2

XT yX1T y1 X2

T y2+ =


Performing a linear regression on 10 million rows in seconds

Hellerstein, Joseph M., et al. "The MADlib analytics library: or MAD skills, the SQL." Proceedings of the VLDB Endowment 5.12 (2012): 1700-1711.


Data Parallelism •  Little or no effort is required to break up the problem into a number of

parallel tasks, and there exists no dependency (or communication) between those parallel tasks

•  Also known as ‘explicit parallelism’ •  Examples:

–  Count a deck of cards by dividing it up between people in this room: Count in parallel

–  MapReduce –  map() function in Python –  apply() family of functions in R


�  The interpreter/VM of the language ‘X’ is installed on each node of the cluster

•  Data Parallelism: -  PL/X piggybacks on MPP

architecture

•  Allows users to write Greenplum/HAWQ/PostgreSQL functions in the R/Python/Java, Perl, pgsql or C languages Standby

Master

…

Master Host

SQL

Interconnect

Segment Host Segment Segment




PL/X : X in {pgsql, R, Python, Java, Perl, C etc.}


Genomics Use Case: Massively-Parallel GWAS Study


In-database genome-wide association study

Network Interconnect

Master Severs

Segment Severs

SQL & R Indiv Covariates

1 2 10 1 F 23 18 2 M 39 41 3 M 50 23

N F 19 24

COVARIATES

SNP1 2 MAA CC TTAT CG TTAA GG TC

TT CG TC




Master Severs

Segment Severs

SNP1 SNP2 SNPM


1 2 10 1 F 23 18 2 M 39 41 3 M 50 23

N F 19 24

Indiv SNP Geno1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG

N M TC

COVARIATES GENOTYPES




Master Severs

Segment Severs

SNP1 SNP2 SNPM

Pval1 Pval2 PvalM


1 2 10 1 F 23 18 2 M 39 41 3 M 50 23

N F 19 24


N M TC

COVARIATES GENOTYPES




Master Severs

Segment Severs

SNP1 SNP2 SNPM

Pval1 Pval2 PvalM


1 2 10 1 F 23 18 2 M 39 41 3 M 50 23

N F 19 24


N M TC

SNP P-value1 2.34x10-212 0.3953 7.15x10-17

M 0.000142

COVARIATES GENOTYPES RESULTS




Master Severs

Segment Severs

SNP1 SNP2 SNPM

Pval1 Pval2 PvalM

LOR1 LOR2 LORM


1 2 10 1 F 23 18 2 M 39 41 3 M 50 23

N F 19 24


N M TC

SNP P-value1 2.34x10-212 0.3953 7.15x10-17

M 0.000142

COVARIATES GENOTYPES RESULTS

•  In-database computation of ~500,000 loci for thousands of individuals occurs rapidly and in parallel

•  Results are easily manipulated and explored


Generate relevant plots using tools like Tableau immediately after parallel statistical analysis in-database

on Pivotal technology

Visualize & analyze genomics data without movement


Simply select SNPs of interest and visualize additional patient data or

metrics stored in the same database!



Rapidly explore additional data sources, like mapped reads, to shorten time to insights. Data is

available on the same platform, no data movement required!



Image Processing Use Case: Massively-Parallel Cell Counting


Tiss

uepa

thol

ogy.

com


An image is simply an array of pixels


Representing an image in a table HAWQ or GPDB enables rapid processing of multiple or extremely large images in parallel without memory limitations

Source Image: Col

Row

0 1 2 0 1 2

0 0 0 1 0 2 1 0 1 1 1 2 2 0 2 1 2 2

col

row

in

tsy

Structured:


Translating image processing to simple SQL

Function Distribution of pixel intensities

SQL SELECT intsy, count(*) !FROM tbl !GROUP BY intsy!

Output 150, 5 215, 4

HAWQ or GPDB enables rapid processing of multiple or extremely large images in parallel without memory limitations

�  No data movement required

�  Simple SQL queries for data exploration

Source Image:

col

row

in

tsy

Structured: Col

Row

0 1 2 0 1 2

0 0 0 1 0 2 1 0 1 1 1 2 2 0 2 1 2 2

col

row

in

tsy


What about windows of pixels?

0 1 2 0 1 2

Source Image:

col

row

in

tsy

Structured: Col

Row

0 0 0 1 0 2 1 0 1 1 1 2 2 0 2 1 2 2

col

row

in

tsy


What about windows of pixels? Source Image:

Col

Row

0 1 2 0 1 2

Function Neighboring pixel values (no diagonals)

SQL SELECT row, col, ! array [intsy, ! LAG ( intsy ) OVER( col_wdw ), ! LEAD ( intsy ) OVER( col_wdw ), ! LAG ( intsy ) OVER( row_wdw ), ! LEAD ( intsy ) OVER( row_wdw ), !!!!! ] intsy_wdw!FROM tbl!WINDOW col_wdw AS (PARTITION BY col ORDER BY row), ! row_wdw AS (PARTITION BY row ORDER BY col), !!!!

Output 1, 1, [215, 150, 215, 150, 215]

0 0 0 1 0 2 1 0 1 1 1 2 2 0 2 1 2 2

col

row

in

tsy

Structured:


Window functions for image processing

0 1 2 0 1 2

What about 8-connected

kernels?

Source Image: Col

Row


SQL SELECT row, col, ! array [intsy, ! LAG ( intsy ) OVER( col_wdw ), ! LEAD ( intsy ) OVER( col_wdw ), ! LAG ( intsy ) OVER( row_wdw ), ! LEAD ( intsy ) OVER( row_wdw ), !!!!! ] intsy_wdw!FROM tbl!WINDOW col_wdw AS (PARTITION BY col ORDER BY row), ! row_wdw AS (PARTITION BY row ORDER BY col), !!!

Output 1, 1, [215, 150, 215, 150, 215]



diag1: row-col diag2: row+col

0 1 2 0 1 2

Col

Row


SQL SELECT row, col, ! array [intsy, ! LAG ( intsy ) OVER( col_wdw ), ! LEAD ( intsy ) OVER( col_wdw ), ! LAG ( intsy ) OVER( row_wdw ), ! LEAD ( intsy ) OVER( row_wdw ), !!!!! ] intsy_wdw!FROM tbl!WINDOW col_wdw AS (PARTITION BY col ORDER BY row), ! row_wdw AS (PARTITION BY row ORDER BY col), !!!

Output 1, 1, [215, 150, 215, 150, 215]

Source Image:



0 1 2 0 1 2

Col

Row


SQL SELECT row, col, ! array [intsy, ! LAG ( intsy ) OVER( col_wdw ), ! LEAD ( intsy ) OVER( col_wdw ), ! LAG ( intsy ) OVER( row_wdw ), ! LEAD ( intsy ) OVER( row_wdw ), ! LAG ( intsy ) OVER( diag1_wdw ), ! LEAD ( intsy ) OVER( diag1_wdw ), ! LAG ( intsy ) OVER( diag2_wdw ), ! LEAD ( intsy ) OVER( diag2_wdw ) ! ] intsy_wdw!FROM tbl!WINDOW col_wdw AS (PARTITION BY col ORDER BY row), ! row_wdw AS (PARTITION BY row ORDER BY col), !diag1_wdw AS (PARTITION BY (row-col) ORDER BY col), !diag2_wdw AS (PARTITION BY (row+col) ORDER BY col) !

Output 1, 1, [215, 150, 215, 150, 215, 150, 215, 150, 150]

Source Image:


Smoothing (noise removal) �  Make each pixel intensity value similar to its

neighbors by averaging the intensity values in the surrounding neighborhood.

�  Smoothing using a uniform box filter:

0 1 2 3 0 1 2 3

Col

Row

0 1 2 3 0 1 2 3 SELECT row, col, madlib.array_mean(intsy_wdw) !

!FROM ( ! SELECT row, col, array [intsy, ! LAG (intsy) OVER( col_wdw ), LEAD (intsy) OVER( col_wdw ), ! LAG (intsy) OVER( row_wdw ), LEAD (intsy) OVER( row_wdww ), ! LAG (intsy) OVER( diag1_wdw ), LEAD (intsy) OVER( diag1_wdw ), ! LAG (intsy) OVER( diag2_wdw ), LEAD (intsy) OVER( diag2_wdw ) ! ] intsy_wdw! FROM tbl! WINDOW col_wdw AS (PARTITION BY col ORDER BY row), ! row_wdw AS (PARTITION BY row ORDER BY col), ! diag1_wdw AS (PARTITION BY (row-col) ORDER BY col), ! diag2_wdw AS (PARTITION BY (row+col) ORDER BY col) !


Smoothing (noise removal)

SELECT row, col, madlib.array_dot(intsy_wdw, ! array[.2,.125,.125,.125,.125,.075,.075,.075,.075]) !FROM ( ! SELECT row, col, array [intsy, ! LAG (intsy) OVER( col_wdw ), LEAD (intsy) OVER( col_wdw ), ! LAG (intsy) OVER( row_wdw ), LEAD (intsy) OVER( row_wdww ), ! LAG (intsy) OVER( diag1_wdw ), LEAD (intsy) OVER( diag1_wdw ), ! LAG (intsy) OVER( diag2_wdw ), LEAD (intsy) OVER( diag2_wdw ) ! ] intsy_wdw! FROM tbl! WINDOW col_wdw AS (PARTITION BY col ORDER BY row), ! row_wdw AS (PARTITION BY row ORDER BY col), ! diag1_wdw AS (PARTITION BY (row-col) ORDER BY col), ! diag2_wdw AS (PARTITION BY (row+col) ORDER BY col) !

�  Make each pixel intensity value similar to its neighbors by averaging the intensity values in the surrounding neighborhood.

�  Smoothing using a Gaussian filter:

0 1 2 3 0 1 2 3

Col

Row

0 1 2 3 0 1 2 3

.2 .125 .125

.125 .075 .075

.125 .075 .075


Image Processing Pipeline For Object Counting

Original

Image name # Cells

Tma_001.jpg 359

Tma_002.jpg 1892

Tma_003.jpg 871

… …

Smoothing Average over

window of pixels

Thresholding Select pixels under intensity threshold

Morphological Operations Min/max over

window of pixels

Object Detection Connected

components

Object Counting Select components

with size filter


Healthcare Use Case: Predicting Asthma-Related Hospital Admissions


Code-a-Thon Details - Logistics •  24-Hour Data Science Code-a-

Thon •  Four finalist vendors:

–  Pivotal –  Cloudera –  Hortonworks, and –  IBM

•  Number of resources per vendor is 5

•  Final deliverable is a 15 minute presentation to senior leaders, executives, doctors, and pharmacists


Code-a-Thon Details - Data �  Air Quality Data

–  Air Pollutants and California Air Resource Board (ARB) Data

–  Daily Particulate matter (PM 10 and 2.5) and Ozone (O3) measurements

�  Medication Order History –  4 years of anonymized medication

order history –  Encounter data

▪  Encounter Type ▪  Encounter Date ▪  Diagnosis ▪  Patient Demographics

—  Age/Gender/Zip Code

▪  Details of the Prescription —  Medication —  Therapeutic Class —  Expiration Date

–  Dispense data ▪  Refill Date/ Location


Raw Air Quality Data �  Measured at 77 stations

�  Dispersed in 50 zip codes

�  Only 6% of customer population lives in a zip code where there is an air station

Any analysis that focuses only on zip codes with air stations would be incomplete

Challenge #1


Step 1. Shepard Interpolation

�  Calculate air miles between all zip codes

�  Populate the air quality measures at zip codes with no stations with inverse distance weighted averages from nearby air stations

Challenge #1


Step 2. Determine zip codes where asthma is over-represented -  We calculated the prevalence of

asthma for the overall population and each zip code

-  We determine whether the distribution of disease prevalence is significantly different for a zip code by running a chi-square test at the zip code level

-  The cut-off for p-value is 0.05

-  The standardized residuals are plotted Red: over-represented asthma Green: under-represented asthma

Challenge #1


Step 3. Spatial Alignment Challenge #1


Predicting Asthma Admissions Findings �  Prior Hospitalization: Our analysis found that patients who have prior asthma related

hospitalizations in the last 12 months were 4.85 times more likely to have a hospitalization (any) in the next 3 months compared to patients who had no prior asthma hospitalizations in the last 12 months.

�  Socio-economic status : Of the various socio-economic status features we tried, the percent population under 50K is the one that was significant.

�  Age Under 10 and Age Above 60 : Compared to the reference group (patients with the ages between 10 and 60) these two age groups have increased likelihood (~24% and ~10%) to be hospitalized in the next 3 months.

�  History of Unfilled Medication: If a patient had an unfilled medication in their history, then ceteris paribus, they are 13% more likely to have a hospitalization (p = 2.7e-06)

Challenge #2


Asthma Population Management Application

Application #1


Asthma Management Application Application #2


Technology Adoption Journey of a Major Healthcare Provider

Prove that better technology can speed up discovery •  Code-a-thon

Prove that better technology can improve model quality • Length of Stay Modeling

Prove that technology is accessible to my clinicians and researchers • Comorbidity Feature Generation App

Prove that data science can help in areas other than clinical analytics • Fraud Detection for Accounts Payable

Prove that, once trained, our scientists can get to insights as quickly as the Pivotal DS team • EDIP Modeling in 4 days


Check out the Pivotal Data Science Blog! http://blog.pivotal.io/data-science-pivotal

A NEW PLATFORM FOR A NEW ERA

meetup asu 150113_upload

Internet

diverse data

node data node

data points

distributed storage

evolving field of data

todays meetup data science

data intensive oil rigs

large blocks of data