2015 jsm session big data: modeling, tools, analytics, & training...

6
7/30/2015 1 2015 JSM Session Big Data: Modeling, Tools, Analytics, & Training Organizer: Ivo D. Dinov, Statistics Online Computational Resource (SOCR), Michigan Moderator/Chair: Robin Jeffries, CSU Chico Logistics: o Date/Time: Mon, 8/10/2015, 10:30 AM - 12:20 PM o Venue: Washington State Convention Center, CC-606 Session: #211347, Sponsor: Statistical Computing Speakers and Topics o Ivo Dinov (Michigan), Management, Modeling & Analytic Challenges of Big Biomedical Data o Max Robinson & Gustavo Glusman, ISB ESPALIERS: A visualization method for Big Data o Moo K. Chung (Wisconsin), The computational challenges of constructing and visualizing large-scale brain networks o Ravi Madduri, Argonne/Chicago, Big Data Services: Globus Online, Galaxy, GridFTP o Barzan Mozafari, Large-scale Data Intensive Systems, Big Data, Interactive Data Processing http://wiki.socr.umich.edu/index.php/SOCR_Events_JSS_2015 Management, Modeling & Analytic Challenges of Big Biomedical Data Ivo D. Dinov Statistics Online Computational Resource University of Michigan www.SOCR.umich.edu Outline Availability, Sharing, Aggregation and Services Classical Data Science vs. Innovative Big Data Analytics Amateur Scientists vs. “Experts” Data Scientists vs. Practitioners Domainspecific vs. Transdisciplinary knowledge Commercial vs. Opensource Resourceome Rapid Big Data Evolution Big Data IT proliferation Data democratization Big Data is incredibly time, space, protocol, context sensitive Big Data Science Training (opportunities and challenges) Characteristics of Big Data Mixture of quantitative & qualitative estimates Dinov, et al. (2014) Example: analyzing observational data of 1,000’s Parkinson’s disease patients based on 10,000’s signature biomarkers derived from multisource imaging, genetics, clinical, physiologic, phenomics and demographic data elements. Software developments, student training, service platforms and methodological advances associated with the Big Data Discovery Science all present existing opportunities for learners, educators, researchers, practitioners and policy makers Incompleteness Size Expected Increase Availability, Sharing, Aggregation & Services There will be over 10 billion mobileconnected devices in 2016; i.e., there will be 1.3 mobile devices per capita U.S. Bureau of Labor Statistics McKinsey Global Institute Percent Growth Big Data Value Potential Index Bubble Size ~ Relative size of GDP Industry Sector Computer & Electronic Products Information Services Manufacturing Admin, support & waste management Transportation & Warehousing Wholesale Trade Professional Services Healthcare Providers Real Estate and Rental Finance and Insurance Utilities Retail Trade Government Accomodation & Food Arts & Enterntainment Corporate Management Other Services Construction Education Services Natural Resources Democratization of Big Data Science Doctorate training is not mandatory nor does it guarantee appropriate Big Data expertise Lower barriers of entry Demand for constant “Continuing Education” and selftraining Dichotomy between theoretical and empirical sciences Differences between fundamental knowledge and experimental skills (big data properties closely approximate core scientific principles)

Upload: others

Post on 02-Feb-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/30/2015

    1

    2015 JSM Session

    Big Data: Modeling, Tools, Analytics, & Training

    Organizer: Ivo D. Dinov, Statistics Online Computational Resource (SOCR), MichiganModerator/Chair: Robin Jeffries, CSU Chico Logistics:

    o Date/Time: Mon, 8/10/2015, 10:30 AM - 12:20 PM o Venue: Washington State Convention Center, CC-606

    Session: #211347, Sponsor: Statistical Computing

    Speakers and Topics

    o Ivo Dinov (Michigan), Management, Modeling & Analytic Challenges of Big Biomedical Datao Max Robinson & Gustavo Glusman, ISB ESPALIERS: A visualization method for Big Datao Moo K. Chung (Wisconsin), The computational challenges of constructing and visualizing

    large-scale brain networks o Ravi Madduri, Argonne/Chicago, Big Data Services: Globus Online, Galaxy, GridFTPo Barzan Mozafari, Large-scale Data Intensive Systems, Big Data, Interactive Data Processing

    http://wiki.socr.umich.edu/index.php/SOCR_Events_JSS_2015

    Management, Modeling & Analytic Challenges of Big

    Biomedical DataIvo D. Dinov

    Statistics Online Computational ResourceUniversity of Michigan

    www.SOCR.umich.edu

    Outline• Availability, Sharing, Aggregation and Services • Classical Data Science vs. Innovative Big Data Analytics

    – Amateur Scientists vs. “Experts”– Data Scientists vs. Practitioners– Domain‐specific vs. Trans‐disciplinary knowledge 

    • Commercial vs. Open‐source Resourceome• Rapid Big Data Evolution• Big Data IT proliferation• Data democratization• Big Data is incredibly time, space, protocol, context sensitive• Big Data Science Training (opportunities and challenges)

    Characteristics of Big Data

    Mixture of quantitative & qualitative estimates                           Dinov, et al. (2014)

    Example: analyzing observational data of 1,000’s Parkinson’s disease patients based on 10,000’s signature biomarkers derived from multi‐source imaging, genetics, clinical, physiologic, phenomics and demographic data elements. 

    Software developments, student training, service platforms and methodological advances associated with the Big Data Discovery Science all present existing opportunities for learners, educators, researchers, practitioners and policy makers

    Incompleteness

    SizeExpected Increase

    Availability, Sharing, Aggregation & Services• There will be over 10 billion mobile‐connected devices in 2016; 

    i.e., there will be 1.3 mobile devices per capita

    U.S. Bureau of Labor StatisticsMcKinsey Global Institute 

    Percen

    t Growth

    Big Data Value Potential Index

    Bubble Size ~ Relative size of GDP

    Industry SectorComputer & Electronic ProductsInformation ServicesManufacturingAdmin, support & waste managementTransportation & WarehousingWholesale TradeProfessional ServicesHealthcare ProvidersReal Estate and RentalFinance and InsuranceUtilitiesRetail TradeGovernmentAccomodation & FoodArts & Enterntainment Corporate ManagementOther ServicesConstruction Education ServicesNatural Resources

    Democratization of Big Data Science

    • Doctorate training is not mandatory nor does it guarantee appropriate Big Data expertise

    • Lower barriers of entry• Demand for constant  “Continuing Education” 

    and self‐training• Dichotomy between theoretical and empirical 

    sciences• Differences between fundamental knowledge 

    and experimental skills (big data properties closely approximate core scientific principles) 

  • 7/30/2015

    2

    Big DataScience

    Medical SciencesSocial SciencesEnvironmental Sciences....

    Math/StatsPhysicsBiologyChemistry....

    EngineeringComputer ScienceBioinformaticsBiomath/Biostats....

    Domain‐specific vs.Trans‐disciplinary knowledge 

    Big Data Resourceome• There is an explosion of open‐data‐science resources

    – www.data.gov– www.ncbi.nlm.nih.gov/gap

    • Spawning of a number of industries and enterprises blending proprietary and open‐source data, code, documentation, expert‐support, infrastructure and services

    • Big Data to Knowledge Initiative: www.BD2K.org• Google Cloud Platform (GCP)• Amazon Web Services (AWS)

    Big Data Analytics Resourceome

    http://bd2k.org/BigDataResourceome.html

    BD

    Big Data Information Knowledge ActionRaw Observations Processed Data Maps, Models Actionable DecisionsData Aggregation Data Fusion Causal Inference  Treatment RegimensData Scrubbing Summary Stats Networks, Analytics Forecasts, Predictions

    Semantic‐Mapping Derived Biomarkers Linkages, Associations Healthcare Outcomes

    I K A

    National Big Data to Knowledge (BD2K) Centers

    1. Center for Causal Modeling and Discovery of Biomedical Knowledge from Big Data (U Pitt, Cooper, Bahar, Berg)

    2. Center for Predictive Computational Phenotyping (U Wisconsin, Craven 3. National Center for Mobility Data Integration to Insight (The Mobilize Center) (Stanford, 

    Delp)4. KnowEng, a Scalable Knowledge Engine for Large‐Scale Genomic Data The (U Illinois

    Urbana‐Champaign, Han, Sinha, Sorg, Weinshilboum)5. Center for Big Data in Translational Genomics (UC Santa Cruz, Haussler, Patterson)6. Patient‐Centered Information Commons (Harvard, Kohane)7. Center of Excellence for Mobile Sensor Data‐to‐Knowledge (MD2K) (U Memphis, Kumar) 8. Center for Expanded Data Annotation and Retrieval (CEDAR) (Stanford, Musen)9. A Community Effort to Translate Protein Data to Knowledge: An Integrated Platform 

    (UCLA, Ping, Lindsey, Su, Watson)10. ENIGMA Center for Worldwide Medicine, Imaging, and Genomics (USC, Thompson)11. Big Data for Discovery Science (USC, Toga) 

    www.BD2K.org

    Kryder’s law: Exponential Growth of Data 

    Dinov, et al., 2014

    Gryo_Byte

    Cryo_Short

    Cryo_Color0

    2E+15

    4E+15

    6E+15

    1 µm10 µm

    100 µm1mm

    1cm

    Gryo_ByteCryo_ShortCryo_Color

    Neuroimaging(GB)Genomics_BP(GB)Moore’s Law (1000'sTrans/CPU)0

    5000000

    10000000

    15000000

    1985‐19891990‐1994 1995‐1999 2000‐2004 2005‐20092010‐2014

    2015‐2019(estimated)

    Neuroimaging(GB)Genomics_BP(GB)Moore’s Law (1000'sTrans/CPU)

    Increase ofImaging Resolution

    Data volumeIncreases faster than computational power

    Walter, Sci Am, 2005storage doubles every 1.2 yrsComputational power double each 1.6 years

  • 7/30/2015

    3

    ComplexityVolume,

    Heterogeneity

    InferenceHigh-throughputExpeditive,Adaptive

    RepresentationIncompleteness

    (space, time, measure)

    ModelingConstraints, Bio,Optimization,Computation

    Big Data Challenges Energy & Life‐Span of Big Data Energy encapsulates the holistic information content included in the data (exponentially 

    increasing with time), which may often represent a significant portion of the joint distribution of the underlying process 

    Life‐span refers to the relevance and importance of the Data (exponentially decreasing with time)

    Time

    Exponential Growth of Big Data (Size, Complexity, Importance)

    Exponential Value Decay of Static Big Data

    Time of Data Fixation

    T=10T=12T=14T=16

    Data FixationTime Points

    http://www.aaas.org/news/big‐data‐blog‐part‐v‐interview‐dr‐ivo‐dinov

    End‐to‐end Pipeline Workflow Solutions 

    Dinov, et al., 2014, Front. Neuroinform.;                               Dinov, et al., Brain Imaging & Behavior, 2013

    End‐to‐end Pipeline Workflow Solutions 

    Dinov, et al., 2014, Front. Neuroinform.;                               Dinov, et al., Brain Imaging & Behavior, 2013

    Neurodegenerative Disease Applications:Imaging‐Genetic Biomarker Interactions in Alzheimer’s DiseaseGoals

    Investigate AD subjects (age 55 – 65) using Neuroimaging Initiative (ADNI) database to understand early‐onset (EO) cognitive impairment using neuroimaging and genetics biomarkers

    Data9 EO‐AD and 27 EO‐MCIDerived 15 most impactful neuroimaging markers (out of 336 morphometry measures)Obtained 20 most significant single nucleotide polymorphisms (SNPs) associated with specific neuroimaging biomarkers (out of 620K SNPs)

    ApproachGlobal Shape Analysis (GSA) Pipeline workflow

    Moon, Dinov et al., 2015, Psych. Investig.Moon, Dinov et al., 2015, J Neuroimaging

    Results: AD Imaging‐Genetic Biomarker Interactions

    Identified associations between neuroimaging phenotypes and genotypes for the cohort of 36 EO subjects

    Overall most significant associations: rs7718456 (Chr 15) and L_hippocampus (volume) rs7718456 and R_hippocampus (volume)

    For the 27 EO‐MCI’s, most significant associations  rs6446443 (Chr 4, JAKMIP1 janus kinase and microtubule interacting 

    protein 1 gene) and R_superior_frontal_gyrus (volume) rs17029131 (Chr 2) and L_Precuneus (volume)

    Moon, Dinov et al., 2015, Psych. Investig.

  • 7/30/2015

    4

    Shap

    e Inde

    x

    =

    Curved

    ness

    =

    0

    Results: AD Imaging‐Genetic Biomarker Interactions Results: AD Imaging‐Genetic Biomarker Interactions

    QC Protocol

    A. QC Plink workflowB. Genetic association 

    study

    Results: AD Imaging‐Genetic Biomarker Interactions

    Manhattan plot for all the single nucleotide polymorphisms (SNPs) 

    Results: AD Imaging‐Genetic Biomarker Interactions

    GWAS Imaging‐Genetics Approach

    • SNPs– E.g., C/T polymorphism

    • Model– Phenotype: Yi be the imaging‐biomarker for ith subject– Genotype:   Xi be the genotype ith subject at a particular: 

    SNP0, 1, or2,

    • SOCR Multivariate Regression Models–

    – , ∑ ε– Stat analysis:  ≠ 0

    Genotype‐Phenotype Relation

    ParentsB A

    B BB BAA BA AA

    SNP‐Neuroimaging interactions in Alzheimer’s Disease

    Overall Associations rs7718456 and L_hippocampus(volume) rs7718456 and R_hippocampus(volume)

    EO‐MCI associations  rs6446443 R_superior_frontal_gyrus(volume) rs17029131 and L_Precuneus(volume)

  • 7/30/2015

    5

    Neurodegenerative Disease Applications:Imaging‐Genetic Biomarker Interactions in Alzheimer’s DiseaseGoals

    Investigate AD subjects (age 55 – 65) using Neuroimaging Initiative (ADNI) database to understand early‐onset (EO) cognitive impairment using neuroimaging and genetics biomarkers

    ResultsGenerated mean geometric models of the left and right hippocampi, middle frontal gyri, and the middle temporal gyri. Linear model statistical maps, at each vertex on the triangulated shapes. Dependent variable was the radial distance morphometry measure (deviation of individual shape model from mean shape atlas) and independent regressors including diagnosis, age, education years, APOE (ε4), MMSE, visiting times, and logical memory (immediate and delayed recall).

    ApproachLocal Shape Analysis (LSA) Pipeline workflow

    Moon, Dinov et al., 2015, J Neuroimaging

    Neurodegenerative Disease Applications:Imaging‐Genetic Biomarker Interactions in Alzheimer’s Disease

    Moon, Dinov et al., 2015, Neuroimaging.

    Neurodegenerative Disease Applications:Imaging‐Genetic Biomarker Interactions in Alzheimer’s Disease

    Moon, Dinov et al., 2015, J Neuroimaging

    P‐maps (A) for Lt. hippocampus, P‐maps (B,C) for Rt. Hippocampus, P‐maps (D,E) for Lt. middle frontal gyrus, P‐maps (F,G,H) for Rt. middle frontal gyrus, P‐maps (I, J) for Lt. middle temporal gyrus.

    SOCR Big Data Dashboardhttp://socr.umich.edu/HTML5/Dashboard

    Web‐service combining and integrating multi‐source socioeconomic and medical datasets 

    Big data analytic processing

    Interface for exploratory navigation, manipulation and visualization

    Adding/removing of visual queries and interactive exploration of multivariate associations

    Powerful HTML5 technology enabling mobile on‐demand computing

    Husain, et al., 2015, J Big Data

    SOCR Dashboard (Exploratory Big Data Analytics): Data Fusion

    http://socr.umich.edu/HTML5/Dashboard

    SOCR Dashboard (Exploratory Big Data Analytics): Data QC

    http://socr.umich.edu/HTML5/Dashboard

  • 7/30/2015

    6

    Training: Big Data Skills

    1) Listening: streams, information and language, analyzing sentiment, intent and trends;

    2) Looking: searching, indexing and memory management of heterogeneous datasets; Loading: Raw, derived or indexed data as well as meta‐data;  

    3) Programming: Handling Map‐Reduce/HDFS, No‐SQL DB, protocol provenance, pipeline workflows; 

    4) Inferring: Principals of data analyses, Bayesian modeling, inference, uncertainty and quantification of likelihoods; Connecting: Reasoning, logic and its limits, dealing with uncertainty; Analytics: Regression, feature selection, dimensionality reduction, temporal patterns; 

    5) Learning: Classification, clustering, mining, information extraction, knowledge retrieval, decision making; 

    6) Predicting: Forecasting, neural models, deep learning, and research topics; 7) Summarizing: Presentation of data, processing protocol, analytics provenance, 

    visualization

    Training: Core ProficienciesThe Data Science Certificate program aims to ensure that students awarded this certificate would have the following experiences:

    1) (Modeling) Understanding of core Data Science principles, assumptions and applications

    2) (Technology) Knowledge of basic protocols for data management, processing, computation, information extraction & visualization

    3) (Practice) Hands‐on experience with modeling tools and technology resources in a real project setting

    University of Michigan Data Science Training

    http://midas.umich.edu/training

    Paradigm Shift in Higher Ed & Training A natural parallel to the 4‐pillar scientific discovery paradigm (experimental, theoretical, computational, and data sciences) can be developed for higher ed learning

    The new paradigm shift in science and technology education and training involves 4 complementary abilities and skills necessary to handle, understand and interpret complex phenomena using Big Data observations:

    o Literacy, which provides the deep domain‐specific knowledge (e.g., math, phy, bio, eng, stats, enviro, social, etc.), 

    o Numeracy, which develops the skills to process hard information, manipulate numbers, and understand analytical models,

    o Algorithmacy, which provides the foundation for developing, optimizing and evaluating protocols, pseudo code, and (jointly with numeracy) facilitates programming, and 

    o Datacy, which is the new skillset required to understand natural processes, via Big Data proxies, manipulate, extract and synthesize information, develop and evaluate predictive and decision support models. 

    Acknowledgments 

    FundingNIH: P20 NR015331, U54 EB020406, P50 NS091856, P30 DK089503NSF: DUE 1416953, 0716055, 1023115

    Collaborators • SOCR: Alexandr Kalinin, Selvam Palanimalai, Syed Husain, Ashwini Khare, Rami Elkest, Abhishek Chowdhury, 

    Patrick Tan, Gary Chan, Andy Foglia, Pratyush Pati, Brian Zhang, Juana Sanchez, Dennis Pearl, Kyle Siegrist, Nicolas Christou, Rob Gould

    • AtheyLab: Alex Ade, Ari Allyn‐Feuer,  Gen Zheng, John Wiley, David Dilworth, Walter Meixner, Jeffrey R. de Wet, Kevin Smith, Amy Creekmore, Indika Rajapakse, Gerry Higgins, Robert W. Veltri,  Brian Athey

    • LONI/INI: Arthur Toga, Roger Woods, Jack Van Horn, Zhuowen Tu, Yonggang Shi, David Shattuck, Elizabeth Sowell, Katherine Narr, Anand Joshi, Shantanu Joshi, Paul Thompson, Luminita Vese, Stan Osher, Stefano Soatto, Seok Moon, Junning Li, Young Sung, Fabio Macciardi, Federica Torri, Carl Kesselman