2015 jsm session big data: modeling, tools, analytics, & training...
TRANSCRIPT
-
7/30/2015
1
2015 JSM Session
Big Data: Modeling, Tools, Analytics, & Training
Organizer: Ivo D. Dinov, Statistics Online Computational Resource (SOCR), MichiganModerator/Chair: Robin Jeffries, CSU Chico Logistics:
o Date/Time: Mon, 8/10/2015, 10:30 AM - 12:20 PM o Venue: Washington State Convention Center, CC-606
Session: #211347, Sponsor: Statistical Computing
Speakers and Topics
o Ivo Dinov (Michigan), Management, Modeling & Analytic Challenges of Big Biomedical Datao Max Robinson & Gustavo Glusman, ISB ESPALIERS: A visualization method for Big Datao Moo K. Chung (Wisconsin), The computational challenges of constructing and visualizing
large-scale brain networks o Ravi Madduri, Argonne/Chicago, Big Data Services: Globus Online, Galaxy, GridFTPo Barzan Mozafari, Large-scale Data Intensive Systems, Big Data, Interactive Data Processing
http://wiki.socr.umich.edu/index.php/SOCR_Events_JSS_2015
Management, Modeling & Analytic Challenges of Big
Biomedical DataIvo D. Dinov
Statistics Online Computational ResourceUniversity of Michigan
www.SOCR.umich.edu
Outline• Availability, Sharing, Aggregation and Services • Classical Data Science vs. Innovative Big Data Analytics
– Amateur Scientists vs. “Experts”– Data Scientists vs. Practitioners– Domain‐specific vs. Trans‐disciplinary knowledge
• Commercial vs. Open‐source Resourceome• Rapid Big Data Evolution• Big Data IT proliferation• Data democratization• Big Data is incredibly time, space, protocol, context sensitive• Big Data Science Training (opportunities and challenges)
Characteristics of Big Data
Mixture of quantitative & qualitative estimates Dinov, et al. (2014)
Example: analyzing observational data of 1,000’s Parkinson’s disease patients based on 10,000’s signature biomarkers derived from multi‐source imaging, genetics, clinical, physiologic, phenomics and demographic data elements.
Software developments, student training, service platforms and methodological advances associated with the Big Data Discovery Science all present existing opportunities for learners, educators, researchers, practitioners and policy makers
Incompleteness
SizeExpected Increase
Availability, Sharing, Aggregation & Services• There will be over 10 billion mobile‐connected devices in 2016;
i.e., there will be 1.3 mobile devices per capita
U.S. Bureau of Labor StatisticsMcKinsey Global Institute
Percen
t Growth
Big Data Value Potential Index
Bubble Size ~ Relative size of GDP
Industry SectorComputer & Electronic ProductsInformation ServicesManufacturingAdmin, support & waste managementTransportation & WarehousingWholesale TradeProfessional ServicesHealthcare ProvidersReal Estate and RentalFinance and InsuranceUtilitiesRetail TradeGovernmentAccomodation & FoodArts & Enterntainment Corporate ManagementOther ServicesConstruction Education ServicesNatural Resources
Democratization of Big Data Science
• Doctorate training is not mandatory nor does it guarantee appropriate Big Data expertise
• Lower barriers of entry• Demand for constant “Continuing Education”
and self‐training• Dichotomy between theoretical and empirical
sciences• Differences between fundamental knowledge
and experimental skills (big data properties closely approximate core scientific principles)
-
7/30/2015
2
Big DataScience
Medical SciencesSocial SciencesEnvironmental Sciences....
Math/StatsPhysicsBiologyChemistry....
EngineeringComputer ScienceBioinformaticsBiomath/Biostats....
Domain‐specific vs.Trans‐disciplinary knowledge
Big Data Resourceome• There is an explosion of open‐data‐science resources
– www.data.gov– www.ncbi.nlm.nih.gov/gap
• Spawning of a number of industries and enterprises blending proprietary and open‐source data, code, documentation, expert‐support, infrastructure and services
• Big Data to Knowledge Initiative: www.BD2K.org• Google Cloud Platform (GCP)• Amazon Web Services (AWS)
Big Data Analytics Resourceome
http://bd2k.org/BigDataResourceome.html
BD
Big Data Information Knowledge ActionRaw Observations Processed Data Maps, Models Actionable DecisionsData Aggregation Data Fusion Causal Inference Treatment RegimensData Scrubbing Summary Stats Networks, Analytics Forecasts, Predictions
Semantic‐Mapping Derived Biomarkers Linkages, Associations Healthcare Outcomes
I K A
National Big Data to Knowledge (BD2K) Centers
1. Center for Causal Modeling and Discovery of Biomedical Knowledge from Big Data (U Pitt, Cooper, Bahar, Berg)
2. Center for Predictive Computational Phenotyping (U Wisconsin, Craven 3. National Center for Mobility Data Integration to Insight (The Mobilize Center) (Stanford,
Delp)4. KnowEng, a Scalable Knowledge Engine for Large‐Scale Genomic Data The (U Illinois
Urbana‐Champaign, Han, Sinha, Sorg, Weinshilboum)5. Center for Big Data in Translational Genomics (UC Santa Cruz, Haussler, Patterson)6. Patient‐Centered Information Commons (Harvard, Kohane)7. Center of Excellence for Mobile Sensor Data‐to‐Knowledge (MD2K) (U Memphis, Kumar) 8. Center for Expanded Data Annotation and Retrieval (CEDAR) (Stanford, Musen)9. A Community Effort to Translate Protein Data to Knowledge: An Integrated Platform
(UCLA, Ping, Lindsey, Su, Watson)10. ENIGMA Center for Worldwide Medicine, Imaging, and Genomics (USC, Thompson)11. Big Data for Discovery Science (USC, Toga)
www.BD2K.org
Kryder’s law: Exponential Growth of Data
Dinov, et al., 2014
Gryo_Byte
Cryo_Short
Cryo_Color0
2E+15
4E+15
6E+15
1 µm10 µm
100 µm1mm
1cm
Gryo_ByteCryo_ShortCryo_Color
Neuroimaging(GB)Genomics_BP(GB)Moore’s Law (1000'sTrans/CPU)0
5000000
10000000
15000000
1985‐19891990‐1994 1995‐1999 2000‐2004 2005‐20092010‐2014
2015‐2019(estimated)
Neuroimaging(GB)Genomics_BP(GB)Moore’s Law (1000'sTrans/CPU)
Increase ofImaging Resolution
Data volumeIncreases faster than computational power
Walter, Sci Am, 2005storage doubles every 1.2 yrsComputational power double each 1.6 years
-
7/30/2015
3
ComplexityVolume,
Heterogeneity
InferenceHigh-throughputExpeditive,Adaptive
RepresentationIncompleteness
(space, time, measure)
ModelingConstraints, Bio,Optimization,Computation
Big Data Challenges Energy & Life‐Span of Big Data Energy encapsulates the holistic information content included in the data (exponentially
increasing with time), which may often represent a significant portion of the joint distribution of the underlying process
Life‐span refers to the relevance and importance of the Data (exponentially decreasing with time)
Time
Exponential Growth of Big Data (Size, Complexity, Importance)
Exponential Value Decay of Static Big Data
Time of Data Fixation
T=10T=12T=14T=16
Data FixationTime Points
http://www.aaas.org/news/big‐data‐blog‐part‐v‐interview‐dr‐ivo‐dinov
End‐to‐end Pipeline Workflow Solutions
Dinov, et al., 2014, Front. Neuroinform.; Dinov, et al., Brain Imaging & Behavior, 2013
End‐to‐end Pipeline Workflow Solutions
Dinov, et al., 2014, Front. Neuroinform.; Dinov, et al., Brain Imaging & Behavior, 2013
Neurodegenerative Disease Applications:Imaging‐Genetic Biomarker Interactions in Alzheimer’s DiseaseGoals
Investigate AD subjects (age 55 – 65) using Neuroimaging Initiative (ADNI) database to understand early‐onset (EO) cognitive impairment using neuroimaging and genetics biomarkers
Data9 EO‐AD and 27 EO‐MCIDerived 15 most impactful neuroimaging markers (out of 336 morphometry measures)Obtained 20 most significant single nucleotide polymorphisms (SNPs) associated with specific neuroimaging biomarkers (out of 620K SNPs)
ApproachGlobal Shape Analysis (GSA) Pipeline workflow
Moon, Dinov et al., 2015, Psych. Investig.Moon, Dinov et al., 2015, J Neuroimaging
Results: AD Imaging‐Genetic Biomarker Interactions
Identified associations between neuroimaging phenotypes and genotypes for the cohort of 36 EO subjects
Overall most significant associations: rs7718456 (Chr 15) and L_hippocampus (volume) rs7718456 and R_hippocampus (volume)
For the 27 EO‐MCI’s, most significant associations rs6446443 (Chr 4, JAKMIP1 janus kinase and microtubule interacting
protein 1 gene) and R_superior_frontal_gyrus (volume) rs17029131 (Chr 2) and L_Precuneus (volume)
Moon, Dinov et al., 2015, Psych. Investig.
-
7/30/2015
4
Shap
e Inde
x
=
Curved
ness
=
0
Results: AD Imaging‐Genetic Biomarker Interactions Results: AD Imaging‐Genetic Biomarker Interactions
QC Protocol
A. QC Plink workflowB. Genetic association
study
Results: AD Imaging‐Genetic Biomarker Interactions
Manhattan plot for all the single nucleotide polymorphisms (SNPs)
Results: AD Imaging‐Genetic Biomarker Interactions
GWAS Imaging‐Genetics Approach
• SNPs– E.g., C/T polymorphism
• Model– Phenotype: Yi be the imaging‐biomarker for ith subject– Genotype: Xi be the genotype ith subject at a particular:
SNP0, 1, or2,
• SOCR Multivariate Regression Models–
– , ∑ ε– Stat analysis: ≠ 0
Genotype‐Phenotype Relation
ParentsB A
B BB BAA BA AA
SNP‐Neuroimaging interactions in Alzheimer’s Disease
Overall Associations rs7718456 and L_hippocampus(volume) rs7718456 and R_hippocampus(volume)
EO‐MCI associations rs6446443 R_superior_frontal_gyrus(volume) rs17029131 and L_Precuneus(volume)
-
7/30/2015
5
Neurodegenerative Disease Applications:Imaging‐Genetic Biomarker Interactions in Alzheimer’s DiseaseGoals
Investigate AD subjects (age 55 – 65) using Neuroimaging Initiative (ADNI) database to understand early‐onset (EO) cognitive impairment using neuroimaging and genetics biomarkers
ResultsGenerated mean geometric models of the left and right hippocampi, middle frontal gyri, and the middle temporal gyri. Linear model statistical maps, at each vertex on the triangulated shapes. Dependent variable was the radial distance morphometry measure (deviation of individual shape model from mean shape atlas) and independent regressors including diagnosis, age, education years, APOE (ε4), MMSE, visiting times, and logical memory (immediate and delayed recall).
ApproachLocal Shape Analysis (LSA) Pipeline workflow
Moon, Dinov et al., 2015, J Neuroimaging
Neurodegenerative Disease Applications:Imaging‐Genetic Biomarker Interactions in Alzheimer’s Disease
Moon, Dinov et al., 2015, Neuroimaging.
Neurodegenerative Disease Applications:Imaging‐Genetic Biomarker Interactions in Alzheimer’s Disease
Moon, Dinov et al., 2015, J Neuroimaging
P‐maps (A) for Lt. hippocampus, P‐maps (B,C) for Rt. Hippocampus, P‐maps (D,E) for Lt. middle frontal gyrus, P‐maps (F,G,H) for Rt. middle frontal gyrus, P‐maps (I, J) for Lt. middle temporal gyrus.
SOCR Big Data Dashboardhttp://socr.umich.edu/HTML5/Dashboard
Web‐service combining and integrating multi‐source socioeconomic and medical datasets
Big data analytic processing
Interface for exploratory navigation, manipulation and visualization
Adding/removing of visual queries and interactive exploration of multivariate associations
Powerful HTML5 technology enabling mobile on‐demand computing
Husain, et al., 2015, J Big Data
SOCR Dashboard (Exploratory Big Data Analytics): Data Fusion
http://socr.umich.edu/HTML5/Dashboard
SOCR Dashboard (Exploratory Big Data Analytics): Data QC
http://socr.umich.edu/HTML5/Dashboard
-
7/30/2015
6
Training: Big Data Skills
1) Listening: streams, information and language, analyzing sentiment, intent and trends;
2) Looking: searching, indexing and memory management of heterogeneous datasets; Loading: Raw, derived or indexed data as well as meta‐data;
3) Programming: Handling Map‐Reduce/HDFS, No‐SQL DB, protocol provenance, pipeline workflows;
4) Inferring: Principals of data analyses, Bayesian modeling, inference, uncertainty and quantification of likelihoods; Connecting: Reasoning, logic and its limits, dealing with uncertainty; Analytics: Regression, feature selection, dimensionality reduction, temporal patterns;
5) Learning: Classification, clustering, mining, information extraction, knowledge retrieval, decision making;
6) Predicting: Forecasting, neural models, deep learning, and research topics; 7) Summarizing: Presentation of data, processing protocol, analytics provenance,
visualization
Training: Core ProficienciesThe Data Science Certificate program aims to ensure that students awarded this certificate would have the following experiences:
1) (Modeling) Understanding of core Data Science principles, assumptions and applications
2) (Technology) Knowledge of basic protocols for data management, processing, computation, information extraction & visualization
3) (Practice) Hands‐on experience with modeling tools and technology resources in a real project setting
University of Michigan Data Science Training
http://midas.umich.edu/training
Paradigm Shift in Higher Ed & Training A natural parallel to the 4‐pillar scientific discovery paradigm (experimental, theoretical, computational, and data sciences) can be developed for higher ed learning
The new paradigm shift in science and technology education and training involves 4 complementary abilities and skills necessary to handle, understand and interpret complex phenomena using Big Data observations:
o Literacy, which provides the deep domain‐specific knowledge (e.g., math, phy, bio, eng, stats, enviro, social, etc.),
o Numeracy, which develops the skills to process hard information, manipulate numbers, and understand analytical models,
o Algorithmacy, which provides the foundation for developing, optimizing and evaluating protocols, pseudo code, and (jointly with numeracy) facilitates programming, and
o Datacy, which is the new skillset required to understand natural processes, via Big Data proxies, manipulate, extract and synthesize information, develop and evaluate predictive and decision support models.
Acknowledgments
FundingNIH: P20 NR015331, U54 EB020406, P50 NS091856, P30 DK089503NSF: DUE 1416953, 0716055, 1023115
Collaborators • SOCR: Alexandr Kalinin, Selvam Palanimalai, Syed Husain, Ashwini Khare, Rami Elkest, Abhishek Chowdhury,
Patrick Tan, Gary Chan, Andy Foglia, Pratyush Pati, Brian Zhang, Juana Sanchez, Dennis Pearl, Kyle Siegrist, Nicolas Christou, Rob Gould
• AtheyLab: Alex Ade, Ari Allyn‐Feuer, Gen Zheng, John Wiley, David Dilworth, Walter Meixner, Jeffrey R. de Wet, Kevin Smith, Amy Creekmore, Indika Rajapakse, Gerry Higgins, Robert W. Veltri, Brian Athey
• LONI/INI: Arthur Toga, Roger Woods, Jack Van Horn, Zhuowen Tu, Yonggang Shi, David Shattuck, Elizabeth Sowell, Katherine Narr, Anand Joshi, Shantanu Joshi, Paul Thompson, Luminita Vese, Stan Osher, Stefano Soatto, Seok Moon, Junning Li, Young Sung, Fabio Macciardi, Federica Torri, Carl Kesselman