industry perspective: big data and big data analytics...inexpensive disk + increased processing...

28
Industry Perspective: Big Data and Big Data Analytics David Barnes Program Director Emerging Internet Technologies IBM Software Group

Upload: others

Post on 04-Feb-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Industry Perspective: Big Data and Big Data Analytics

    David BarnesProgram DirectorEmerging Internet TechnologiesIBM Software Group

  • What is Big Data?

  • The Adjacent Possible

  • Inexpensive disk+ Increased processing power

    + Data Warehouse+The Web

    + X

    = Big Data

    X=Sensors used to gather climate information, posts to social media sites, digital pictures and videos, transaction records, cell phone GPS signals, and more.

  • © 2010 IBM Corporation

    161 exabytes of data were created in 2006 –3 million times the amount of information contained

    in all the books ever written.

    In 2010 the number reached hit 988 exabytes.

    IDC estimates that 1.8 zettabytes were created and replicated in 2011.

  • © 2010 IBM Corporation

    Every day, people create the equivalent of 2.5 quintillion bytes of data from sensors, mobile devices,

    online transactions, and social networks.

    Every month people send one billion Tweets and post 30 billion messages on Facebook.

    90% (or more) of the world’s data is unstructured.

  • The true nature of information

  • Is noisy

    Is often times dirty

    Is often full of valuable information

    Unstructured Data

  • © 2010 IBM Corporation

    Big Data has swept into every industry and business function.

    Businesses need to put the power of Big Data analytics in the hands of their business employees – Data Scientist is somewhat misleading.

    “Leaders in every sector will have to grapple with the implications of big data, not just a few data-oriented managers.” – McKinsey Global Institute

    The Big Data Imperative

    9

    Big Data Business Patterns

    Computational Journalism

    Chief Legal Officer

    Retail Business Planner

    IT Systems Management

    Pharma - Clinical Trials

    Business Fraud Detection

    Evidence Based Medicine

    Web Archiving

    . . .

  • © 2010 IBM Corporation

    Today’s Problem

    Data growing at compound annual growth of 60%/year

    Storage capacity continue to increase dramatically

    Storage access speeds have not kept up

    At transfer speed of 500 MB/sec - 1 terabyte of data will require ~30 mins to read from single drive

    Enter Map/Reduce• Automates the mechanisms of large-scale distributed computation ( i.e. work

    distribution, load balancing, replication, failure/recovery)

    • Divide & Conquer: Split 1 terabyte split among 100 drives will require ~20 seconds to read

    • M/R parallel processing model provides cost effective framework for new generation of analytic applications on unstructured or semi-structured data

  • © 2010 IBM Corporation

    Requirement: A New Class of Big Data Applications

    Big Data analytics must be brought to the line-of-business user.

    •Leverage easy-to-use manipulation metaphors

    •Use natural language technologies for analytics

    •Provide rich visualizations to quickly identify insights

  • DemoBuyer Sentiment Analysis

  • © 2010 IBM Corporation SlideSharenomics - Rise of Social Economy

    Social Media: Chiliean Earthquake 2010

    2010 Chilean earthquake fifth largest earthquake in recorded history

    The affected areas suffered major devastation - buildings, airports, hospitals, prisons, bridges, and roads were severely damaged

    Land-based communications systems suffered major outages

    The wireless 3G infrastructure remained intact and operational

    13

  • © 2010 IBM Corporation SlideSharenomics - Rise of Social Economy

    Social Media: Chiliean Earthquake 2010

    14

    Social networking on wireless networks major form of communications

    Extreme Blue students collected 226 million Tweets, analyzed,categorized by incidence type and location

    Tweets included - Can I get food? Can I get gas? Are the bridges down - images

    The results were visualized

    Completed in ~12 weeks

  • © 2010 IBM Corporation

    Big Data = Volume, Variety and Velocity

    15

    •Volume - Scale from terabytes to zettabytes•Variety - Relational and non-relational data types from an ever-

    expanding variety of sources•Velocity - Streaming data and large volume data movement

  • © 2010 IBM Corporation

    Big Data = Volume, Variety and Velocity

    •Volume - Scale from terabytes to zettabytes•Variety - Relational and non-relational data types from an ever-

    expanding variety of sources•Velocity - Streaming data and large volume data movement

  • The Supercomputer is based on over 1,200 high powered IBM System X servers and can perform 150 trillion calculations per second -- equivalent to 30 million calculations per Danish citizen per

    second.

    Vestas expects its data sets will grow to 20-plus petabytes over the next four years.

  • © 2010 IBM Corporation

    Big Data = Volume, Variety and Velocity

    •Volume - Scale from terabytes to zettabytes•Variety - Relational and non-relational data types from an ever-

    expanding variety of sources•Velocity - Streaming data and large volume data movement

  • ©  2011  IBM  Corporation

    Seton  Healthcare  FamilyReducing  CHF  readmission  to  improve  care  

    Business  ChallengeSeton  Healthcare  strives  to  reduce  the  occurrence  of  high  cost  Congestive  Heart  Failure  (CHF)  readmissions  by  proactively  identifying  patients  likely  to  be  readmitted  on  an  emergent  basis.  

    What’s  Smart?IBM  Content  and  Predictive  Analytics  for  Healthcare  solution  will  help  to  better  target  and  understand  high-‐risk  CHF  patients  for  care  management  programs  by:

    Smarter  Business  Outcomes• Seton  will  be  able  to  proactively  target  care  management  

    and  reduce  re-‐admission  of  CHF  patients.• Teaming  unstructured  content  with  predictive  analytics,  

    Seton  will  be  able  to  identify  patients  likely  for  re-‐admission  and  introduce  early  interventions  to  reduce  cost,  mortality  

    IBM  solution• IBM  Content  and  

    Predictive  Analytics  for  Healthcare

    • IBM  Cognos  Business  Intelligence

    • IBM  BAO  solution  services

    • Utilizing  natural  language  processing  to  extract  key  elements  from  unstructured  History  and  Physical,  Discharge  Summaries,  Echocardiogram  Reports,  and  Consult  Notes

    • Leveraging  predictive  models  that  have  demonstrated  high  positive  predictive  value  against  extracted  elements  of  structured  and  unstructured  data  

    • Providing  an  interface  through  which  providers  can  intuitively  navigate,  interpret  and  take  action

    “IBM  Content  and  Predictive  Analytics  for  Healthcare  uses  the  same  type  of  natural  language  processing  as  IBM  Watson,  enabling  us  to  leverage  information  in  new  ways  not  possible  before.  We  can  access  an  integrated  view  of  relevant  clinical  and  operational  information  to  drive  more  informed  decision  making  and  optimize  patient  and  operational  outcomes.”

  • ©  2011  IBM  CorporaUon2©  2011  IBM  CorporaUon

    IBM  Content  and  PredicUve  AnalyUcs  for  HealthcareThe  Seton  CHF  Readmission  SoluUon  

    Unstructured  Data(Cerner  Clinical  Documenta0on:  History  and  Physical,  Discharge  Summary,  Echocardiogram.)

    Structured  Data(Avega  Cost  Data,  DSS  Admission  History,  DSS  Procedure  History,  Cerner  Clinical  Events)

    Raw  Informa=on

    Search  and  Visually  Explore  (Mine)

    Monitor,  Dashboard  and  Report  (Cognos  BI)

    Ques%on  and  Answer*

    Custom  SoluBons

    Dynamic  Mul=modeInterac=on

    IBM  Content  and  Predic=ve  Analy=cs

    Content  AnalyBcs•Natural  Language  Processing•Medical  Fact  and  Rela0onship  Extrac0on  (Annota0on)

    • Trend,  PaIern,  Anomaly,Devia0on  Analysis

    PredicBve  AnalyBcs• Predic0ve  Scoring  and  Probability  Analysis

    Analyzed  and  Visualized

    Informa=on

    Health  Integra=on  Framework

    Data  Warehouse  and  Model

    Master  Data  Management

    Advanced  Case  Management

    Business  AnalyBcsPartners  (HLI) Specialized  Research

    IBM  Watson  for  Healthcare

    Confirm  hypotheses  or  seek  alternaFve  ideas  with  confidence  based  responses  from  learned  knowledge*

    UUlizing  natural  language  processing  to  extract  key  elements  from  unstructured  History  and  Physical  and  Discharge  Summary

    Leveraging  predicUve  models  that  have  demonstrated  high  posiUve  predicUve  value  against  extracted  elements  of  structured  and  unstructured  data  

    Providing  an  interface  through  which  providers  can  intuiUvely  navigate,  interpret  and  take  acUon

  • ©  2011  IBM  CorporaUon

    The  Data  We  Thought  Would  Be  Useful  …  Wasn’t

    • 113  candidate  predictors  from  structured  and  unstructured  data  sources

    • Structured  data  was  less  reliable  then  unstructured  data  –  increased  the  reliance  on  unstructured  data

    New  Unexpected  Indicators  Emerged  …  Highly  Predic=ve  Model

    • 18  accurate  indicators  or  predictors  (see  next  slide)

    Predictor  Analysis %  EncountersStructured  Data

    %  Encounters  Unstructured  Data

    Ejec0on  Frac0on  (LVEF) 2% 74%

    Smoking  Indicator 35%(65%  Accurate)

    81%(95%  Accurate)

    Living  Arrangements

  • ©  2011  IBM  CorporaUon

    Cognos  dashboard  reporUng  system  can  help  in  monitoring  the  key  clinical,  operaUonal  and  financial  metrics.    More  importantly,  being  able  to  track  down  the  top  priority  cases  for  case  management.  

    5

    Visualizing  the  Results:  Readmissions  Dashboard

    1.Clinical  Sta=s=cs:  admission  count,  readmission  count    and  readmission  rate

    2.Opera=onal  Sta=s=c:  Counts  of  different  length  of  stay  periods

    3.Financial  Sta=s=c:    Total  direct  cost  by  total  admission  and  by  readmission

    4.Mortality:  mortality  rate5.Average  length  of  stay  6.Average  direct  cost  by  total  admission  and  by  readmission  only

    7.PA  Model  Score:  Distribu0on  of  propensity  of  readmission

    1 2 3

    4 5 6

    7

  • © 2010 IBM Corporation

    Big Data = Volume, Variety and Velocity

    •Volume - Scale from terabytes to zettabytes•Variety - Relational and non-relational data types from an ever-

    expanding variety of sources•Velocity - Streaming data and large volume data movement

  • © 2010 IBM Corporation

    USC Annenberg School of Communications

  • © 2010 IBM Corporation

    InfoSphere Streams

    27

  • © 2010 IBM Corporation

    Big Data Platform Vision

    28

    Big Data Enterprise Engines

    Big Data Solutions

    Internet Scale AnalyticsStreaming Analytics

    Developers End Users Administrators

    Big Data User Environments

    Bringing Big Data to the EnterpriseClient and Partner Solutions

    Open Source Foundational Components

    Hadoop MapReduce HDFS Hbase Pig Lucene Jaql

    AG

    ENTS

    INTEG

    RATIO

    N

    Marketing

    Warehouse Appliances

    Data Warehouse

    Database

    Analytics

    Business Intelligence

    Master Data Mgmt

    InfoSphere Warehouse

    Netezza

    InfoSphere MDM

    DB2

    SPSS

    Cognos

    Unica