meetup asu 150113_upload

63
A NEW PLATFORM FOR A NEW ERA

Upload: jay-etchings

Post on 08-Aug-2015

67 views

Category:

Internet


0 download

TRANSCRIPT

Page 1: Meetup asu 150113_upload

A NEW PLATFORM FOR A NEW ERA

Page 2: Meetup asu 150113_upload

2 © Copyright 2013 Pivotal. All rights reserved.

What we will cover in today’s Meetup

� Data Science for Biomedicine –  Challenges –  Platforms, processes, and tools

� Use Cases Leveraging Data Science for Biomedicine –  Genomics: Distributed GWAS –  Image Processing: Massively Parallel Cell Counting –  Healthcare: Predicting asthma-related hospital admissions

� Wrap Up & Questions

Page 3: Meetup asu 150113_upload

3 © Copyright 2013 Pivotal. All rights reserved. 3 © Copyright 2013 Pivotal. All rights reserved.

Challenges

Page 4: Meetup asu 150113_upload

4 © Copyright 2013 Pivotal. All rights reserved.

Challenge: The ‘big-ness’ of big data

Oil Exploration Medical Imaging

Video Surveillance Mobile Sensors

Stock Market Gene Sequencing

Smart Grids Social Media

FACEBOOK UPLOADS 250 MILLION PHOTOS EACH DAY

COST TO SEQUENCE ONE GENOME HAS FALLEN FROM $100M IN 2001 TO $10K IN 2011 TO $1K IN 2014

READING SMART METERS EVERY 15 MINUTES

IS 3000X MORE DATA INTENSIVE

OIL RIGS GENERATE

25000 DATA POINTS PER SECOND

Page 5: Meetup asu 150113_upload

5 © Copyright 2013 Pivotal. All rights reserved.

Medications"

Family "History"

Lab tests"

Clinical"Narratives"

Imaging"

Environment"

Medical History"

Sensors"& Mobile"

Genetics"

Molecular"Diagnostics"

Challenge: Diverse data

Page 6: Meetup asu 150113_upload

6 © Copyright 2013 Pivotal. All rights reserved.

Solutions: New environments & tools HDFS STORAGE AND MPP

ARCHITECTURES DISTRIBUTE STORAGE AND PREVENT DATA MOVEMENT

VARIETY/VELOCITY

DISTRIBUTED COMPUTATION FOR PARALLELIZATION PETABYTES OF DATA

OPEN-SOURCE LIBRARY FOR MACHINE LEARNING AT SCALE AND FRAMEWORK

TO ACCESS COMMON LANGUAGES

RAPIDLY EVOLVING FIELD OF DATA SCIENCE AND

TOOLS

SQL ENGINE AND ODBC/JDBC CONNECTIONS TO HADOOP

MANY EXISTING LIBRARIES, TOOLS AND EXPERTISE

FLEXIBLE

SCALABLE

ENABLING

ACCESSIBLE

Page 7: Meetup asu 150113_upload

7 © Copyright 2013 Pivotal. All rights reserved.

Solutions: Leverage Diverse Data Create predictive models at scale •  Integrate data from various sources to build larger models to improve statistics

and inference •  Enable parallelized execution of libraries

False positive rate

True

pos

itive

rate

Medical History

Medical History

Medical History Genetics

Clinician Notes

Clinician Notes

Medical History Genetics Imaging Clinician

Notes

Page 8: Meetup asu 150113_upload

8 © Copyright 2013 Pivotal. All rights reserved. 8 © Copyright 2013 Pivotal. All rights reserved.

Platforms

Page 9: Meetup asu 150113_upload

9 © Copyright 2013 Pivotal. All rights reserved.

Multiple platforms with a single, simple goal: Distributed storage with in-place computation

Hadoop

MPP Database

SQL-on-Hadoop

Page 10: Meetup asu 150113_upload

10 © Copyright 2013 Pivotal. All rights reserved.

Multiple platforms with a single, simple goal: Distributed storage with in-place computation

Think of it as multiple PostGreSQL servers

Segments/Workers

Master

Rows are distributed across segments by a particular field (or randomly)

Hadoop

MPP Database

SQL-on-Hadoop

Page 11: Meetup asu 150113_upload

11 © Copyright 2013 Pivotal. All rights reserved.

Multiple platforms with a single, simple goal: Distributed storage with in-place computation

Think of it as distributed file system with very large blocks of data

Schema on read allows flexibility for a variety of datasets Compute using a variety of paradigms (e.g. MapReduce)

Hadoop

MPP Database

Name Node

Data Node 1

Data Node 2

Data Node 3

Data Node 4

1 2 3 2 3 1 1 2 SQL-on-Hadoop

Page 12: Meetup asu 150113_upload

12 © Copyright 2013 Pivotal. All rights reserved.

Multiple platforms with a single, simple goal: Distributed storage with in-place computation

•  SQL compliant •  World-class query optimizer •  Interactive query •  Horizontal scalability •  Robust data management •  Common Hadoop formats •  Deep analytics

Hadoop

MPP Database

Think of it as distributed PostGreSQL (GPDB) on Hadoop •  SQL compliant •  World-class query optimizer •  Interactive query •  Horizontal scalability •  Robust data management •  Common Hadoop formats •  Deep analytics SQL-on-

Hadoop

Page 13: Meetup asu 150113_upload

13 © Copyright 2013 Pivotal. All rights reserved.

Sample Applications Challenges Use Cases

The landscape of technology for big data

HAMSTER/MPI GraphLab

MapReduce

SQL

Page 14: Meetup asu 150113_upload

14 © Copyright 2013 Pivotal. All rights reserved.

Sample Applications Challenges Use Cases

Batch processing of large volumes of data

Analytics on large-scale structured data

Operations on very large matrices

The landscape of technology for big data

HAMSTER/MPI GraphLab

MapReduce

SQL

Page 15: Meetup asu 150113_upload

15 © Copyright 2013 Pivotal. All rights reserved.

Sample Applications Challenges Use Cases

Batch processing of large volumes of data

Not optimal for highly iterative methods (file I/O bottleneck),

functions over windows

Analytics on large-scale structured data

Requires restructuring of data to manipulate very large files

Operations on very large matrices

Requires knowledge of OpenMP, mis-used for embarrassingly

parallel problems

The landscape of technology for big data

HAMSTER/MPI GraphLab

MapReduce

SQL

Page 16: Meetup asu 150113_upload

16 © Copyright 2013 Pivotal. All rights reserved.

Sample Applications Challenges Use Cases

Batch processing of large volumes of data

Not optimal for highly iterative methods (file I/O bottleneck),

functions over windows Word count on tweets

Analytics on large-scale structured data

Requires restructuring of data to manipulate very large files

Predicting mortality on clinical data from diverse sources

Operations on very large matrices

Requires knowledge of OpenMP, mis-used for embarrassingly

parallel problems

Protein docking, molecular dynamics

The landscape of technology for big data

HAMSTER/MPI GraphLab

MapReduce

SQL

Page 17: Meetup asu 150113_upload

17 © Copyright 2013 Pivotal. All rights reserved.

Clinical"Narratives"Imaging" Genetics"

Good for processing many images rapidly

Many documents with no shared processing Read mapping

In-database processing of very large images

stored as a table Information retrieval BAM file manipulations,

counts

Processing very large images (e.g. FFT)

Multiple sequence alignment

Choosing the right environment for different analytics challenges

HAMSTER/MPI GraphLab

MapReduce

SQL

Page 18: Meetup asu 150113_upload

18 © Copyright 2013 Pivotal. All rights reserved. 18 © Copyright 2013 Pivotal. All rights reserved.

Process & Tools

Page 19: Meetup asu 150113_upload

19 © Copyright 2013 Pivotal. All rights reserved.

1 Find Data Platforms •  Pivotal

Greenplum DB •  Pivotal HD •  Hadoop (other) •  SAS HPA •  AWS

2 Write Code

Editing Tools •  Vi/Vim •  Emacs •  Smultron •  TextWrangler •  Eclipse •  Notepad++ •  IPython •  Sublime •  Rstudio

Languages •  SQL •  Bash scripting •  C •  C++ •  C# •  Java •  Python •  R

3 Run Code Interfaces •  pgAdminIII •  psql •  psycopg2 •  Terminal •  Cygwin •  Putty •  Winscp

4 Write Code for Big Data In-Database •  SQL •  PL/Python •  PL/Java •  PL/R •  PL/pgSQL

Hadoop •  HAWQ •  Pig •  Hive •  Java

5 Implement Algorithms

Libraries •  MADlib Java •  Mahout R •  (Too many to list!) Text •  OpenNLP •  NLTK •  GPText C++ •  opencv

Python •  NumPy •  SciPy •  scikit-learn •  Pandas Programs •  Alpine Miner •  Rstudio •  MATLAB •  SAS •  Stata

6 Show Results

Visualization •  python-matplotlib •  python-networkx •  D3.js •  Tableau

•  GraphViz •  Gephi •  R (ggplot2, lattice,

shiny) •  Excel

7 Collaborate

Sharing Tools •  Chorus •  Confluence •  Socialcast •  Github •  Google Drive &

Hangouts

PIVOTAL DATA SCIENCE TOOLKIT

A large and varied tool box!

Page 20: Meetup asu 150113_upload

20 © Copyright 2013 Pivotal. All rights reserved.

1 Find Data Platforms •  Pivotal

Greenplum DB •  Pivotal HD •  Hadoop (other) •  SAS HPA •  AWS

2 Write Code

Editing Tools •  Vi/Vim •  Emacs •  Smultron •  TextWrangler •  Eclipse •  Notepad++ •  IPython •  Sublime •  Rstudio

Languages •  SQL •  Bash scripting •  C •  C++ •  C# •  Java •  Python •  R

3 Run Code Interfaces •  pgAdminIII •  psql •  psycopg2 •  Terminal •  Cygwin •  Putty •  Winscp

4 Write Code for Big Data In-Database •  SQL •  PL/Python •  PL/Java •  PL/R •  PL/pgSQL

Hadoop •  HAWQ •  Pig •  Hive •  Java

5 Implement Algorithms

Libraries •  MADlib Java •  Mahout R •  (Too many to list!) Text •  OpenNLP •  NLTK •  GPText C++ •  opencv

Python •  NumPy •  SciPy •  scikit-learn •  Pandas Programs •  Alpine Miner •  Rstudio •  MATLAB •  SAS •  Stata

6 Show Results

Visualization •  python-matplotlib •  python-networkx •  D3.js •  Tableau

•  GraphViz •  Gephi •  R (ggplot2, lattice,

shiny) •  Excel

7 Collaborate

Sharing Tools •  Chorus •  Confluence •  Socialcast •  Github •  Google Drive &

Hangouts

PIVOTAL DATA SCIENCE TOOLKIT

A large and varied tool box!

Data Review Feature Creation Model Building Operationalization

Page 21: Meetup asu 150113_upload

21 © Copyright 2013 Pivotal. All rights reserved.

MADlib In-Database Functions Predictive Modeling Library

Linear Systems •  Sparse and Dense Solvers

Matrix Factorization •  Single Value Decomposition (SVD) •  Low-Rank

Generalized Linear Models •  Linear Regression •  Logistic Regression •  Multinomial Logistic Regression •  Cox Proportional Hazards •  Regression •  Elastic Net Regularization •  Sandwich Estimators (Huber white,

clustered, marginal effects)

Machine Learning Algorithms •  Principal Component Analysis (PCA) •  Association Rules (Affinity Analysis, Market

Basket) •  Topic Modeling (Parallel LDA) •  Decision Trees •  Ensemble Learners (Random Forests) •  Support Vector Machines •  Conditional Random Field (CRF) •  Clustering (K-means) •  Cross Validation

Hypothesis Testing

Chi-Squared test F-test & t-test ANOVA Kolmogorov-Smirnov Mann-Whitney test Wilcoxon signed-rank test Correlation Summary

Support Modules

Array Operations Sparse Vectors Random Sampling Probability Functions

Collaborators:

Page 22: Meetup asu 150113_upload

22 © Copyright 2013 Pivotal. All rights reserved.

Linear Regression: Streaming Algorithm •  Finding linear

dependencies between variables

•  How to compute with a single scan?

Page 23: Meetup asu 150113_upload

23 © Copyright 2013 Pivotal. All rights reserved.

Linear Regression: Parallel Computation

XT

y

XT y = xiT yi

i∑

Page 24: Meetup asu 150113_upload

24 © Copyright 2013 Pivotal. All rights reserved.

Linear Regression: Parallel Computation

y

XT

Master

XT y

Segment 1 Segment 2

X1T y1 X2

T y2+ =

Page 25: Meetup asu 150113_upload

25 © Copyright 2013 Pivotal. All rights reserved.

Linear Regression: Parallel Computation

y

XT

Master Segment 1 Segment 2

XT yX1T y1 X2

T y2+ =

Page 26: Meetup asu 150113_upload

26 © Copyright 2013 Pivotal. All rights reserved.

Performing a linear regression on 10 million rows in seconds

Hellerstein, Joseph M., et al. "The MADlib analytics library: or MAD skills, the SQL." Proceedings of the VLDB Endowment 5.12 (2012): 1700-1711.

Page 27: Meetup asu 150113_upload

27 © Copyright 2013 Pivotal. All rights reserved.

Data Parallelism •  Little or no effort is required to break up the problem into a number of

parallel tasks, and there exists no dependency (or communication) between those parallel tasks

•  Also known as ‘explicit parallelism’ •  Examples:

–  Count a deck of cards by dividing it up between people in this room: Count in parallel

–  MapReduce –  map() function in Python –  apply() family of functions in R

Page 28: Meetup asu 150113_upload

28 © Copyright 2013 Pivotal. All rights reserved.

�  The interpreter/VM of the language ‘X’ is installed on each node of the cluster

•  Data Parallelism: -  PL/X piggybacks on MPP

architecture

•  Allows users to write Greenplum/HAWQ/PostgreSQL functions in the R/Python/Java, Perl, pgsql or C languages Standby

Master

Master Host

SQL

Interconnect

Segment Host Segment Segment

Segment Host Segment Segment

Segment Host Segment Segment

Segment Host Segment Segment

PL/X : X in {pgsql, R, Python, Java, Perl, C etc.}

Page 29: Meetup asu 150113_upload

29 © Copyright 2013 Pivotal. All rights reserved. 29 © Copyright 2013 Pivotal. All rights reserved.

Genomics Use Case: Massively-Parallel GWAS Study

Page 30: Meetup asu 150113_upload

30 © Copyright 2013 Pivotal. All rights reserved.

In-database genome-wide association study

Network Interconnect

Master Severs

Segment Severs

SQL & R Indiv Covariates

1 2 10 1 F 23 18 2 M 39 41 3 M 50 23

N F 19 24

COVARIATES

SNP1 2 MAA CC TTAT CG TTAA GG TC

TT CG TC

Page 31: Meetup asu 150113_upload

31 © Copyright 2013 Pivotal. All rights reserved.

In-database genome-wide association study

Network Interconnect

Master Severs

Segment Severs

SNP1 SNP2 SNPM

SQL & R Indiv Covariates

1 2 10 1 F 23 18 2 M 39 41 3 M 50 23

N F 19 24

Indiv SNP Geno1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG

N M TC

COVARIATES GENOTYPES

Page 32: Meetup asu 150113_upload

32 © Copyright 2013 Pivotal. All rights reserved.

In-database genome-wide association study

Network Interconnect

Master Severs

Segment Severs

SNP1 SNP2 SNPM

Pval1 Pval2 PvalM

SQL & R Indiv Covariates

1 2 10 1 F 23 18 2 M 39 41 3 M 50 23

N F 19 24

Indiv SNP Geno1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG

N M TC

COVARIATES GENOTYPES

Page 33: Meetup asu 150113_upload

33 © Copyright 2013 Pivotal. All rights reserved.

In-database genome-wide association study

Network Interconnect

Master Severs

Segment Severs

SNP1 SNP2 SNPM

Pval1 Pval2 PvalM

SQL & R Indiv Covariates

1 2 10 1 F 23 18 2 M 39 41 3 M 50 23

N F 19 24

Indiv SNP Geno1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG

N M TC

SNP P-value1 2.34x10-212 0.3953 7.15x10-17

M 0.000142

COVARIATES GENOTYPES RESULTS

Page 34: Meetup asu 150113_upload

34 © Copyright 2013 Pivotal. All rights reserved.

In-database genome-wide association study

Network Interconnect

Master Severs

Segment Severs

SNP1 SNP2 SNPM

Pval1 Pval2 PvalM

LOR1 LOR2 LORM

SQL & R Indiv Covariates

1 2 10 1 F 23 18 2 M 39 41 3 M 50 23

N F 19 24

Indiv SNP Geno1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG

N M TC

SNP P-value1 2.34x10-212 0.3953 7.15x10-17

M 0.000142

COVARIATES GENOTYPES RESULTS

•  In-database computation of ~500,000 loci for thousands of individuals occurs rapidly and in parallel

•  Results are easily manipulated and explored

Page 35: Meetup asu 150113_upload

35 © Copyright 2013 Pivotal. All rights reserved.

Generate relevant plots using tools like Tableau immediately after parallel statistical analysis in-database

on Pivotal technology

Visualize & analyze genomics data without movement

Page 36: Meetup asu 150113_upload

36 © Copyright 2013 Pivotal. All rights reserved.

Simply select SNPs of interest and visualize additional patient data or

metrics stored in the same database!

Visualize & analyze genomics data without movement

Page 37: Meetup asu 150113_upload

37 © Copyright 2013 Pivotal. All rights reserved.

Rapidly explore additional data sources, like mapped reads, to shorten time to insights. Data is

available on the same platform, no data movement required!

Visualize & analyze genomics data without movement

Page 38: Meetup asu 150113_upload

38 © Copyright 2013 Pivotal. All rights reserved. 38 © Copyright 2013 Pivotal. All rights reserved.

Image Processing Use Case: Massively-Parallel Cell Counting

Page 39: Meetup asu 150113_upload

39 © Copyright 2013 Pivotal. All rights reserved.

Tiss

uepa

thol

ogy.

com

Page 40: Meetup asu 150113_upload

40 © Copyright 2013 Pivotal. All rights reserved.

An image is simply an array of pixels

Page 41: Meetup asu 150113_upload

41 © Copyright 2013 Pivotal. All rights reserved.

Representing an image in a table HAWQ or GPDB enables rapid processing of multiple or extremely large images in parallel without memory limitations

Source Image: Col

Row

0 1 2 0 1 2

0 0 0 1 0 2 1 0 1 1 1 2 2 0 2 1 2 2

col

row

in

tsy

Structured:

Page 42: Meetup asu 150113_upload

42 © Copyright 2013 Pivotal. All rights reserved.

Translating image processing to simple SQL

Function Distribution of pixel intensities

SQL SELECT intsy, count(*) !FROM tbl !GROUP BY intsy!

Output 150, 5 215, 4

HAWQ or GPDB enables rapid processing of multiple or extremely large images in parallel without memory limitations

�  No data movement required

�  Simple SQL queries for data exploration

Source Image:

col

row

in

tsy

Structured: Col

Row

0 1 2 0 1 2

0 0 0 1 0 2 1 0 1 1 1 2 2 0 2 1 2 2

col

row

in

tsy

Page 43: Meetup asu 150113_upload

43 © Copyright 2013 Pivotal. All rights reserved.

What about windows of pixels?

0 1 2 0 1 2

Source Image:

col

row

in

tsy

Structured: Col

Row

0 0 0 1 0 2 1 0 1 1 1 2 2 0 2 1 2 2

col

row

in

tsy

Page 44: Meetup asu 150113_upload

44 © Copyright 2013 Pivotal. All rights reserved.

What about windows of pixels? Source Image:

Col

Row

0 1 2 0 1 2

Function Neighboring pixel values (no diagonals)

SQL SELECT row, col, ! array [intsy, ! LAG ( intsy ) OVER( col_wdw ), ! LEAD ( intsy ) OVER( col_wdw ), ! LAG ( intsy ) OVER( row_wdw ), ! LEAD ( intsy ) OVER( row_wdw ), !!!!! ] intsy_wdw!FROM tbl!WINDOW col_wdw AS (PARTITION BY col ORDER BY row), ! row_wdw AS (PARTITION BY row ORDER BY col), !!!!

Output 1, 1, [215, 150, 215, 150, 215]

0 0 0 1 0 2 1 0 1 1 1 2 2 0 2 1 2 2

col

row

in

tsy

Structured:

Page 45: Meetup asu 150113_upload

45 © Copyright 2013 Pivotal. All rights reserved.

Window functions for image processing

0 1 2 0 1 2

What about 8-connected

kernels?

Source Image: Col

Row

Function Neighboring pixel values (no diagonals)

SQL SELECT row, col, ! array [intsy, ! LAG ( intsy ) OVER( col_wdw ), ! LEAD ( intsy ) OVER( col_wdw ), ! LAG ( intsy ) OVER( row_wdw ), ! LEAD ( intsy ) OVER( row_wdw ), !!!!! ] intsy_wdw!FROM tbl!WINDOW col_wdw AS (PARTITION BY col ORDER BY row), ! row_wdw AS (PARTITION BY row ORDER BY col), !!!

Output 1, 1, [215, 150, 215, 150, 215]

Page 46: Meetup asu 150113_upload

46 © Copyright 2013 Pivotal. All rights reserved.

Window functions for image processing

diag1: row-col diag2: row+col

0 1 2 0 1 2

Col

Row

Function Neighboring pixel values (no diagonals)

SQL SELECT row, col, ! array [intsy, ! LAG ( intsy ) OVER( col_wdw ), ! LEAD ( intsy ) OVER( col_wdw ), ! LAG ( intsy ) OVER( row_wdw ), ! LEAD ( intsy ) OVER( row_wdw ), !!!!! ] intsy_wdw!FROM tbl!WINDOW col_wdw AS (PARTITION BY col ORDER BY row), ! row_wdw AS (PARTITION BY row ORDER BY col), !!!

Output 1, 1, [215, 150, 215, 150, 215]

Source Image:

Page 47: Meetup asu 150113_upload

47 © Copyright 2013 Pivotal. All rights reserved.

Window functions for image processing

0 1 2 0 1 2

Col

Row

Function Neighboring pixel values (no diagonals)

SQL SELECT row, col, ! array [intsy, ! LAG ( intsy ) OVER( col_wdw ), ! LEAD ( intsy ) OVER( col_wdw ), ! LAG ( intsy ) OVER( row_wdw ), ! LEAD ( intsy ) OVER( row_wdw ), ! LAG ( intsy ) OVER( diag1_wdw ), ! LEAD ( intsy ) OVER( diag1_wdw ), ! LAG ( intsy ) OVER( diag2_wdw ), ! LEAD ( intsy ) OVER( diag2_wdw ) ! ] intsy_wdw!FROM tbl!WINDOW col_wdw AS (PARTITION BY col ORDER BY row), ! row_wdw AS (PARTITION BY row ORDER BY col), !diag1_wdw AS (PARTITION BY (row-col) ORDER BY col), !diag2_wdw AS (PARTITION BY (row+col) ORDER BY col) !

Output 1, 1, [215, 150, 215, 150, 215, 150, 215, 150, 150]

Source Image:

Page 48: Meetup asu 150113_upload

48 © Copyright 2013 Pivotal. All rights reserved.

Smoothing (noise removal) �  Make each pixel intensity value similar to its

neighbors by averaging the intensity values in the surrounding neighborhood.

�  Smoothing using a uniform box filter:

0 1 2 3 0 1 2 3

Col

Row

0 1 2 3 0 1 2 3 SELECT row, col, madlib.array_mean(intsy_wdw) !

!FROM ( ! SELECT row, col, array [intsy, ! LAG (intsy) OVER( col_wdw ), LEAD (intsy) OVER( col_wdw ), ! LAG (intsy) OVER( row_wdw ), LEAD (intsy) OVER( row_wdww ), ! LAG (intsy) OVER( diag1_wdw ), LEAD (intsy) OVER( diag1_wdw ), ! LAG (intsy) OVER( diag2_wdw ), LEAD (intsy) OVER( diag2_wdw ) ! ] intsy_wdw! FROM tbl! WINDOW col_wdw AS (PARTITION BY col ORDER BY row), ! row_wdw AS (PARTITION BY row ORDER BY col), ! diag1_wdw AS (PARTITION BY (row-col) ORDER BY col), ! diag2_wdw AS (PARTITION BY (row+col) ORDER BY col) !

Page 49: Meetup asu 150113_upload

49 © Copyright 2013 Pivotal. All rights reserved.

Smoothing (noise removal)

SELECT row, col, madlib.array_dot(intsy_wdw, ! array[.2,.125,.125,.125,.125,.075,.075,.075,.075]) !FROM ( ! SELECT row, col, array [intsy, ! LAG (intsy) OVER( col_wdw ), LEAD (intsy) OVER( col_wdw ), ! LAG (intsy) OVER( row_wdw ), LEAD (intsy) OVER( row_wdww ), ! LAG (intsy) OVER( diag1_wdw ), LEAD (intsy) OVER( diag1_wdw ), ! LAG (intsy) OVER( diag2_wdw ), LEAD (intsy) OVER( diag2_wdw ) ! ] intsy_wdw! FROM tbl! WINDOW col_wdw AS (PARTITION BY col ORDER BY row), ! row_wdw AS (PARTITION BY row ORDER BY col), ! diag1_wdw AS (PARTITION BY (row-col) ORDER BY col), ! diag2_wdw AS (PARTITION BY (row+col) ORDER BY col) !

�  Make each pixel intensity value similar to its neighbors by averaging the intensity values in the surrounding neighborhood.

�  Smoothing using a Gaussian filter:

0 1 2 3 0 1 2 3

Col

Row

0 1 2 3 0 1 2 3

.2 .125 .125

.125 .075 .075

.125 .075 .075

Page 50: Meetup asu 150113_upload

50 © Copyright 2013 Pivotal. All rights reserved.

Image Processing Pipeline For Object Counting

Original

Image name # Cells

Tma_001.jpg 359

Tma_002.jpg 1892

Tma_003.jpg 871

… …

Smoothing Average over

window of pixels

Thresholding Select pixels under intensity threshold

Morphological Operations Min/max over

window of pixels

Object Detection Connected

components

Object Counting Select components

with size filter

Page 51: Meetup asu 150113_upload

51 © Copyright 2013 Pivotal. All rights reserved. 51 © Copyright 2013 Pivotal. All rights reserved.

Healthcare Use Case: Predicting Asthma-Related Hospital Admissions

Page 52: Meetup asu 150113_upload

52 © Copyright 2013 Pivotal. All rights reserved.

Code-a-Thon Details - Logistics •  24-Hour Data Science Code-a-

Thon •  Four finalist vendors:

–  Pivotal –  Cloudera –  Hortonworks, and –  IBM

•  Number of resources per vendor is 5

•  Final deliverable is a 15 minute presentation to senior leaders, executives, doctors, and pharmacists

Page 53: Meetup asu 150113_upload

53 © Copyright 2013 Pivotal. All rights reserved.

Code-a-Thon Details - Data �  Air Quality Data

–  Air Pollutants and California Air Resource Board (ARB) Data

–  Daily Particulate matter (PM 10 and 2.5) and Ozone (O3) measurements

�  Medication Order History –  4 years of anonymized medication

order history –  Encounter data

▪  Encounter Type ▪  Encounter Date ▪  Diagnosis ▪  Patient Demographics

—  Age/Gender/Zip Code

▪  Details of the Prescription —  Medication —  Therapeutic Class —  Expiration Date

–  Dispense data ▪  Refill Date/ Location

Page 54: Meetup asu 150113_upload

54 © Copyright 2013 Pivotal. All rights reserved.

Raw Air Quality Data �  Measured at 77 stations

�  Dispersed in 50 zip codes

�  Only 6% of customer population lives in a zip code where there is an air station

Any analysis that focuses only on zip codes with air stations would be incomplete

Challenge #1

Page 55: Meetup asu 150113_upload

55 © Copyright 2013 Pivotal. All rights reserved.

Step 1. Shepard Interpolation

�  Calculate air miles between all zip codes

�  Populate the air quality measures at zip codes with no stations with inverse distance weighted averages from nearby air stations

Challenge #1

Page 56: Meetup asu 150113_upload

56 © Copyright 2013 Pivotal. All rights reserved.

Step 2. Determine zip codes where asthma is over-represented -  We calculated the prevalence of

asthma for the overall population and each zip code

-  We determine whether the distribution of disease prevalence is significantly different for a zip code by running a chi-square test at the zip code level

-  The cut-off for p-value is 0.05

-  The standardized residuals are plotted Red: over-represented asthma Green: under-represented asthma

Challenge #1

Page 57: Meetup asu 150113_upload

57 © Copyright 2013 Pivotal. All rights reserved.

Step 3. Spatial Alignment Challenge #1

Page 58: Meetup asu 150113_upload

58 © Copyright 2013 Pivotal. All rights reserved.

Predicting Asthma Admissions Findings �  Prior Hospitalization: Our analysis found that patients who have prior asthma related

hospitalizations in the last 12 months were 4.85 times more likely to have a hospitalization (any) in the next 3 months compared to patients who had no prior asthma hospitalizations in the last 12 months.

�  Socio-economic status : Of the various socio-economic status features we tried, the percent population under 50K is the one that was significant.

�  Age Under 10 and Age Above 60 : Compared to the reference group (patients with the ages between 10 and 60) these two age groups have increased likelihood (~24% and ~10%) to be hospitalized in the next 3 months.

�  History of Unfilled Medication: If a patient had an unfilled medication in their history, then ceteris paribus, they are 13% more likely to have a hospitalization (p = 2.7e-06)

Challenge #2

Page 59: Meetup asu 150113_upload

59 © Copyright 2013 Pivotal. All rights reserved.

Asthma Population Management Application

Application #1

Page 60: Meetup asu 150113_upload

60 © Copyright 2013 Pivotal. All rights reserved.

Asthma Management Application Application #2

Page 61: Meetup asu 150113_upload

61 © Copyright 2013 Pivotal. All rights reserved.

Technology Adoption Journey of a Major Healthcare Provider

Prove that better technology can speed up discovery •  Code-a-thon

Prove that better technology can improve model quality • Length of Stay Modeling

Prove that technology is accessible to my clinicians and researchers • Comorbidity Feature Generation App

Prove that data science can help in areas other than clinical analytics • Fraud Detection for Accounts Payable

Prove that, once trained, our scientists can get to insights as quickly as the Pivotal DS team • EDIP Modeling in 4 days

Page 62: Meetup asu 150113_upload

62 © Copyright 2013 Pivotal. All rights reserved.

Check out the Pivotal Data Science Blog! http://blog.pivotal.io/data-science-pivotal

Page 63: Meetup asu 150113_upload

A NEW PLATFORM FOR A NEW ERA