2015 jsm session big data: modeling, tools, analytics, & training...

7/30/2015

1

2015 JSM Session

Big Data: Modeling, Tools, Analytics, & Training

Organizer: Ivo D. Dinov, Statistics Online Computational Resource (SOCR), MichiganModerator/Chair: Robin Jeffries, CSU Chico Logistics:

o Date/Time: Mon, 8/10/2015, 10:30 AM - 12:20 PM o Venue: Washington State Convention Center, CC-606

Session: #211347, Sponsor: Statistical Computing

Speakers and Topics

o Ivo Dinov (Michigan), Management, Modeling & Analytic Challenges of Big Biomedical Datao Max Robinson & Gustavo Glusman, ISB ESPALIERS: A visualization method for Big Datao Moo K. Chung (Wisconsin), The computational challenges of constructing and visualizing

large-scale brain networks o Ravi Madduri, Argonne/Chicago, Big Data Services: Globus Online, Galaxy, GridFTPo Barzan Mozafari, Large-scale Data Intensive Systems, Big Data, Interactive Data Processing

http://wiki.socr.umich.edu/index.php/SOCR_Events_JSS_2015

Management, Modeling & Analytic Challenges of Big

Biomedical DataIvo D. Dinov

Statistics Online Computational ResourceUniversity of Michigan

www.SOCR.umich.edu

Outline• Availability, Sharing, Aggregation and Services • Classical Data Science vs. Innovative Big Data Analytics

– Amateur Scientists vs. “Experts”– Data Scientists vs. Practitioners– Domain‐specific vs. Trans‐disciplinary knowledge

• Commercial vs. Open‐source Resourceome• Rapid Big Data Evolution• Big Data IT proliferation• Data democratization• Big Data is incredibly time, space, protocol, context sensitive• Big Data Science Training (opportunities and challenges)

Characteristics of Big Data

Mixture of quantitative & qualitative estimates Dinov, et al. (2014)

Example: analyzing observational data of 1,000’s Parkinson’s disease patients based on 10,000’s signature biomarkers derived from multi‐source imaging, genetics, clinical, physiologic, phenomics and demographic data elements.

Software developments, student training, service platforms and methodological advances associated with the Big Data Discovery Science all present existing opportunities for learners, educators, researchers, practitioners and policy makers

Incompleteness

SizeExpected Increase

Availability, Sharing, Aggregation & Services• There will be over 10 billion mobile‐connected devices in 2016;

i.e., there will be 1.3 mobile devices per capita

U.S. Bureau of Labor StatisticsMcKinsey Global Institute

Percen

t Growth

Big Data Value Potential Index

Bubble Size ~ Relative size of GDP

Industry SectorComputer & Electronic ProductsInformation ServicesManufacturingAdmin, support & waste managementTransportation & WarehousingWholesale TradeProfessional ServicesHealthcare ProvidersReal Estate and RentalFinance and InsuranceUtilitiesRetail TradeGovernmentAccomodation & FoodArts & Enterntainment Corporate ManagementOther ServicesConstruction Education ServicesNatural Resources

Democratization of Big Data Science

• Doctorate training is not mandatory nor does it guarantee appropriate Big Data expertise

• Lower barriers of entry• Demand for constant “Continuing Education”

and self‐training• Dichotomy between theoretical and empirical

sciences• Differences between fundamental knowledge

and experimental skills (big data properties closely approximate core scientific principles)

7/30/2015

2

Big DataScience

Medical SciencesSocial SciencesEnvironmental Sciences....

Math/StatsPhysicsBiologyChemistry....

EngineeringComputer ScienceBioinformaticsBiomath/Biostats....

Domain‐specific vs.Trans‐disciplinary knowledge

Big Data Resourceome• There is an explosion of open‐data‐science resources

– www.data.gov– www.ncbi.nlm.nih.gov/gap

• Spawning of a number of industries and enterprises blending proprietary and open‐source data, code, documentation, expert‐support, infrastructure and services

• Big Data to Knowledge Initiative: www.BD2K.org• Google Cloud Platform (GCP)• Amazon Web Services (AWS)

Big Data Analytics Resourceome

http://bd2k.org/BigDataResourceome.html

BD

Big Data Information Knowledge ActionRaw Observations Processed Data Maps, Models Actionable DecisionsData Aggregation Data Fusion Causal Inference Treatment RegimensData Scrubbing Summary Stats Networks, Analytics Forecasts, Predictions

Semantic‐Mapping Derived Biomarkers Linkages, Associations Healthcare Outcomes

I K A

National Big Data to Knowledge (BD2K) Centers

1. Center for Causal Modeling and Discovery of Biomedical Knowledge from Big Data (U Pitt, Cooper, Bahar, Berg)

2. Center for Predictive Computational Phenotyping (U Wisconsin, Craven 3. National Center for Mobility Data Integration to Insight (The Mobilize Center) (Stanford,

Delp)4. KnowEng, a Scalable Knowledge Engine for Large‐Scale Genomic Data The (U Illinois

Urbana‐Champaign, Han, Sinha, Sorg, Weinshilboum)5. Center for Big Data in Translational Genomics (UC Santa Cruz, Haussler, Patterson)6. Patient‐Centered Information Commons (Harvard, Kohane)7. Center of Excellence for Mobile Sensor Data‐to‐Knowledge (MD2K) (U Memphis, Kumar) 8. Center for Expanded Data Annotation and Retrieval (CEDAR) (Stanford, Musen)9. A Community Effort to Translate Protein Data to Knowledge: An Integrated Platform

(UCLA, Ping, Lindsey, Su, Watson)10. ENIGMA Center for Worldwide Medicine, Imaging, and Genomics (USC, Thompson)11. Big Data for Discovery Science (USC, Toga)

www.BD2K.org

Kryder’s law: Exponential Growth of Data

Dinov, et al., 2014

Gryo_Byte

Cryo_Short

Cryo_Color0

2E+15

4E+15

6E+15

1 µm10 µm

100 µm1mm

1cm

Gryo_ByteCryo_ShortCryo_Color

Neuroimaging(GB)Genomics_BP(GB)Moore’s Law (1000'sTrans/CPU)0

5000000

10000000

15000000

1985‐19891990‐1994 1995‐1999 2000‐2004 2005‐20092010‐2014

2015‐2019(estimated)

Neuroimaging(GB)Genomics_BP(GB)Moore’s Law (1000'sTrans/CPU)

Increase ofImaging Resolution

Data volumeIncreases faster than computational power

Walter, Sci Am, 2005storage doubles every 1.2 yrsComputational power double each 1.6 years

7/30/2015

3

ComplexityVolume,

Heterogeneity

InferenceHigh-throughputExpeditive,Adaptive

RepresentationIncompleteness

(space, time, measure)

ModelingConstraints, Bio,Optimization,Computation

Big Data Challenges Energy & Life‐Span of Big Data Energy encapsulates the holistic information content included in the data (exponentially

increasing with time), which may often represent a significant portion of the joint distribution of the underlying process

Life‐span refers to the relevance and importance of the Data (exponentially decreasing with time)

Time

Exponential Growth of Big Data (Size, Complexity, Importance)

Exponential Value Decay of Static Big Data

Time of Data Fixation

T=10T=12T=14T=16

Data FixationTime Points

http://www.aaas.org/news/big‐data‐blog‐part‐v‐interview‐dr‐ivo‐dinov

End‐to‐end Pipeline Workflow Solutions

Dinov, et al., 2014, Front. Neuroinform.; Dinov, et al., Brain Imaging & Behavior, 2013

End‐to‐end Pipeline Workflow Solutions

Dinov, et al., 2014, Front. Neuroinform.; Dinov, et al., Brain Imaging & Behavior, 2013

Neurodegenerative Disease Applications:Imaging‐Genetic Biomarker Interactions in Alzheimer’s DiseaseGoals

Investigate AD subjects (age 55 – 65) using Neuroimaging Initiative (ADNI) database to understand early‐onset (EO) cognitive impairment using neuroimaging and genetics biomarkers

Data9 EO‐AD and 27 EO‐MCIDerived 15 most impactful neuroimaging markers (out of 336 morphometry measures)Obtained 20 most significant single nucleotide polymorphisms (SNPs) associated with specific neuroimaging biomarkers (out of 620K SNPs)

ApproachGlobal Shape Analysis (GSA) Pipeline workflow

Moon, Dinov et al., 2015, Psych. Investig.Moon, Dinov et al., 2015, J Neuroimaging

Results: AD Imaging‐Genetic Biomarker Interactions

Identified associations between neuroimaging phenotypes and genotypes for the cohort of 36 EO subjects

Overall most significant associations: rs7718456 (Chr 15) and L_hippocampus (volume) rs7718456 and R_hippocampus (volume)

For the 27 EO‐MCI’s, most significant associations rs6446443 (Chr 4, JAKMIP1 janus kinase and microtubule interacting

protein 1 gene) and R_superior_frontal_gyrus (volume) rs17029131 (Chr 2) and L_Precuneus (volume)

Moon, Dinov et al., 2015, Psych. Investig.

7/30/2015

4

Shap

e Inde

x

=

Curved

ness

=

0

Results: AD Imaging‐Genetic Biomarker Interactions Results: AD Imaging‐Genetic Biomarker Interactions

QC Protocol

A. QC Plink workflowB. Genetic association

study


Manhattan plot for all the single nucleotide polymorphisms (SNPs)


GWAS Imaging‐Genetics Approach

• SNPs– E.g., C/T polymorphism

• Model– Phenotype: Yi be the imaging‐biomarker for ith subject– Genotype: Xi be the genotype ith subject at a particular:

SNP0, 1, or2,

• SOCR Multivariate Regression Models–

– , ∑ ε– Stat analysis: ≠ 0

Genotype‐Phenotype Relation

ParentsB A

B BB BAA BA AA

SNP‐Neuroimaging interactions in Alzheimer’s Disease

Overall Associations rs7718456 and L_hippocampus(volume) rs7718456 and R_hippocampus(volume)

EO‐MCI associations rs6446443 R_superior_frontal_gyrus(volume) rs17029131 and L_Precuneus(volume)

7/30/2015

5

Neurodegenerative Disease Applications:Imaging‐Genetic Biomarker Interactions in Alzheimer’s DiseaseGoals

Investigate AD subjects (age 55 – 65) using Neuroimaging Initiative (ADNI) database to understand early‐onset (EO) cognitive impairment using neuroimaging and genetics biomarkers

ResultsGenerated mean geometric models of the left and right hippocampi, middle frontal gyri, and the middle temporal gyri. Linear model statistical maps, at each vertex on the triangulated shapes. Dependent variable was the radial distance morphometry measure (deviation of individual shape model from mean shape atlas) and independent regressors including diagnosis, age, education years, APOE (ε4), MMSE, visiting times, and logical memory (immediate and delayed recall).

ApproachLocal Shape Analysis (LSA) Pipeline workflow

Moon, Dinov et al., 2015, J Neuroimaging

Neurodegenerative Disease Applications:Imaging‐Genetic Biomarker Interactions in Alzheimer’s Disease

Moon, Dinov et al., 2015, Neuroimaging.

Neurodegenerative Disease Applications:Imaging‐Genetic Biomarker Interactions in Alzheimer’s Disease

Moon, Dinov et al., 2015, J Neuroimaging

P‐maps (A) for Lt. hippocampus, P‐maps (B,C) for Rt. Hippocampus, P‐maps (D,E) for Lt. middle frontal gyrus, P‐maps (F,G,H) for Rt. middle frontal gyrus, P‐maps (I, J) for Lt. middle temporal gyrus.

SOCR Big Data Dashboardhttp://socr.umich.edu/HTML5/Dashboard

Web‐service combining and integrating multi‐source socioeconomic and medical datasets

Big data analytic processing

Interface for exploratory navigation, manipulation and visualization

Adding/removing of visual queries and interactive exploration of multivariate associations

Powerful HTML5 technology enabling mobile on‐demand computing

Husain, et al., 2015, J Big Data

SOCR Dashboard (Exploratory Big Data Analytics): Data Fusion

http://socr.umich.edu/HTML5/Dashboard

SOCR Dashboard (Exploratory Big Data Analytics): Data QC

http://socr.umich.edu/HTML5/Dashboard

7/30/2015

6

Training: Big Data Skills

1) Listening: streams, information and language, analyzing sentiment, intent and trends;

2) Looking: searching, indexing and memory management of heterogeneous datasets; Loading: Raw, derived or indexed data as well as meta‐data;

3) Programming: Handling Map‐Reduce/HDFS, No‐SQL DB, protocol provenance, pipeline workflows;

4) Inferring: Principals of data analyses, Bayesian modeling, inference, uncertainty and quantification of likelihoods; Connecting: Reasoning, logic and its limits, dealing with uncertainty; Analytics: Regression, feature selection, dimensionality reduction, temporal patterns;

5) Learning: Classification, clustering, mining, information extraction, knowledge retrieval, decision making;

6) Predicting: Forecasting, neural models, deep learning, and research topics; 7) Summarizing: Presentation of data, processing protocol, analytics provenance,

visualization

Training: Core ProficienciesThe Data Science Certificate program aims to ensure that students awarded this certificate would have the following experiences:

1) (Modeling) Understanding of core Data Science principles, assumptions and applications

2) (Technology) Knowledge of basic protocols for data management, processing, computation, information extraction & visualization

3) (Practice) Hands‐on experience with modeling tools and technology resources in a real project setting

University of Michigan Data Science Training

http://midas.umich.edu/training

Paradigm Shift in Higher Ed & Training A natural parallel to the 4‐pillar scientific discovery paradigm (experimental, theoretical, computational, and data sciences) can be developed for higher ed learning

The new paradigm shift in science and technology education and training involves 4 complementary abilities and skills necessary to handle, understand and interpret complex phenomena using Big Data observations:

o Literacy, which provides the deep domain‐specific knowledge (e.g., math, phy, bio, eng, stats, enviro, social, etc.),

o Numeracy, which develops the skills to process hard information, manipulate numbers, and understand analytical models,

o Algorithmacy, which provides the foundation for developing, optimizing and evaluating protocols, pseudo code, and (jointly with numeracy) facilitates programming, and

o Datacy, which is the new skillset required to understand natural processes, via Big Data proxies, manipulate, extract and synthesize information, develop and evaluate predictive and decision support models.

Acknowledgments

FundingNIH: P20 NR015331, U54 EB020406, P50 NS091856, P30 DK089503NSF: DUE 1416953, 0716055, 1023115

Collaborators • SOCR: Alexandr Kalinin, Selvam Palanimalai, Syed Husain, Ashwini Khare, Rami Elkest, Abhishek Chowdhury,

Patrick Tan, Gary Chan, Andy Foglia, Pratyush Pati, Brian Zhang, Juana Sanchez, Dennis Pearl, Kyle Siegrist, Nicolas Christou, Rob Gould

• AtheyLab: Alex Ade, Ari Allyn‐Feuer, Gen Zheng, John Wiley, David Dilworth, Walter Meixner, Jeffrey R. de Wet, Kevin Smith, Amy Creekmore, Indika Rajapakse, Gerry Higgins, Robert W. Veltri, Brian Athey

• LONI/INI: Arthur Toga, Roger Woods, Jack Van Horn, Zhuowen Tu, Yonggang Shi, David Shattuck, Elizabeth Sowell, Katherine Narr, Anand Joshi, Shantanu Joshi, Paul Thompson, Luminita Vese, Stan Osher, Stefano Soatto, Seok Moon, Junning Li, Young Sung, Fabio Macciardi, Federica Torri, Carl Kesselman

2015 jsm session big data: modeling, tools, analytics, & training...

Documents