too big ignore

Upload: detail2k

Post on 03-Jun-2018

235 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 Too Big Ignore

    1/19

    About Deloitte

    Deloitte refers to one or more of Deloitte Touche Tohmatsu Limited, a UK private company limited by guarantee, and its network of member firms, each of which is a legallyseparate and independent entity. Please see www.deloitte.com/about for a detailed description of the legal structure of Deloitte Touche Tohmatsu Limited and its member firms.

    Please see www.deloitte.com/us/about for a detailed description of the legal structure of Deloitte LLP and its subsidiaries. Certain services may not be available to attest clientsunder the rules and regulations of public accounting.

    This publication contains general information only, and none of Deloitte Touche Tohmatsu Limited, its member firms, or its and their affiliates are, by means of this publication,rendering accounting, business, financial, investment, legal, tax, or other professional advice or services. This publication is not a substitute for such professional advice or ser-

    vices, nor should it be used as a basis for any decision or action that may affect your finances or your business. Before making any decision or taking any action that may affectyour finances or your business, you should consult a qualified professional adviser.

    None of Deloitte Touche Tohmatsu Limited, its member firms, or its and their respective affiliates shall be responsible for any loss whatsoever sustained by any person who relieson this publication.

    Copyright 2012 Deloitte Development LLC. All rights reserved.

    Member of Deloitte Touche Tohmatsu Limited

    ISSUE 12 | 2013

    Complimentary article reprint

    BY JAMES GUSZCZA, DAVID STEIER, JOHNLUCKER, VIVEKANAND GOPALKRISHNAN,AND HARVEY LEWIS > ILLUSTRATION BYJON KRAUSE

    Too

    Bigto

    IgnoreWhen does big data provide big value?

  • 8/12/2019 Too Big Ignore

    2/19

    Deloitte Review DELOITTEREVIEW.COM

    36 TOO BIG TO IGNORE

  • 8/12/2019 Too Big Ignore

    3/19

  • 8/12/2019 Too Big Ignore

    4/19

    Deloitte Review DELOITTEREVIEW.COM

    38 TOO BIG TO IGNORE

    On the one hand, it is easy so see why big data has generated so much excite-

    ment both in the business press and in the larger culture. New business models,

    scientific breakthroughs, and proound societal transormations are all observable

    effects o big data.

    Familiar examples abound. Google ranslate exploits word associations in

    massive databases o ree-orm text to yield a tool that can, i you wish, instantly

    translate Icelandic to Indonesian. Social, political, economic, and proessional rela-

    tionshipsand the societal and marketing mores surrounding themare rapidly

    evolving along with the evolution o social networking and social media technolo-

    gies. Political campaigns increasingly harness the power o social networks andsocial media to raise unds, motivate constituencies, and get out the vote. Com-

    panies use detailed databases about their customers behavior and liestyles not

    only to better target them, but to create such innovative data products as playlists,

    newseeds, next-best offers, and recommendation engines or items ranging rom

    airplane tickets to romantic partners. Te raw material or all o these innova-

    tionsand surely many more to comeis large amounts o data.

    So the topic is undoubtedly important. Yet at the same time, much o the lan-

    guage that has come to surround big data conveys a muddled conception o what

    data, big or otherwise, means to the majority o organizations pursuing analytics

    strategies. Big data is indeed a signature issue o our time. But it is also shrouded

    in hyperbole and conusion, which can be a breeding ground or strategic errors.

    In short, big data is a big deal, but it is time to separate the signal rom the noise.

    V IS FOR

    O course business applications o data analysis and predictive modeling go

    back decades.5 For example, credit scoring dates back to the ENIAC-era late

    1950s, and actuaries have long analyzed industrial-strength data to price insurance

    contracts and more recently to set aside appropriate loss reserves as well as guide

    underwriting and claim adjustment decisions. And in the decade since the ap-

    pearance o Michael LewissMoneyball, statistical approaches to improved business

    decision making have spread to realms as disparate as improving patient saety,

    making better hiring decisions, and warding off lapses in child support payments.What then is new about big data?

    Te term is somewhat hard to pin down in part because it is commonly used

    in at least two distinct senses. In everyday discussions, big data is increasingly

    used as shorthand or the granular and varied data sources that go into the sorts o

    projects described above.6 Here big is used in the sense o as rich and detailed

    as practical given the business context. Such data is big relative to the small,

  • 8/12/2019 Too Big Ignore

    5/19

    DELOITTEREVIEW.COM Deloitte Review

    39TOO BIG TO IGNORE

    clean, easily accessible datasets that can easily be manipulated in spreadsheets and

    have traditionally been odder or mainstream academic statistical research. While

    colloquial, this is actually a useul conception o big data or reasons that will

    become clear.

    More officially, big data denotes data sources whose very size creates prob-

    lems or standard data management and analysis tools. Examples include the data

    continuously emanating in vast quantities rom digital sensors, audio and video

    recording devices, mobile computing devices, Internet searches, social networking

    and media technologies, and so on. Such examples motivate Doug Laneys widely

    accepted three Vs characterization o big data:7

    Volume: Here, big is ofen taken to mean multiple terabyte- or petabyte-

    class data, motivating the use o such highly parallel next-generation data

    management technologies as MapReduce, Hadoop, and NoSQL.8

    Variety: Big data goes beyond numbers in databases, a.k.a. structured

    data. Such unstructured data sources as Internet search log files, tweets,

    call center transcripts, telecom messages, email, and data rom sensor net-works, video, and photographs can equally well be considered data. Te

    multi-structured nature o big data in part accounts or its large volume

    and ofen high degree o messiness and noisiness.

    Velocity: Because much o it emanates rom sensors, web search logs, real-

    time eeds, or mobile devices, big data is ofen generated continuously and

    at a rapid clip.

    Big data is said to be different and revolutionary because its size, scope, and flu-

    idity are so great as to enable it to speak to us," ofen in real time, in ways not seen

    beore. Tis point o view is clearly articulated in Chris Andersons influential essay,

    Te End o Teory: Te Data Deluge Makes the Scientific Method Obsolete:9

    There is now a better way. Petabytes allow us to say: Correlation is enough.We can stop looking for models. We can analyze the data without hypotheses

    about what it might show. We can throw the numbers into the biggest comput-ing clusters the world has ever seen and let statistical algorithms find patternswhere science cannot.

    Te thinking is that in the pre-big data era, statistical science was necessary to

    make up or the inherent limitations o incomplete data samples. Statisticians and

    scientists were orced to cleanse, hypothesize, sample, model, and analyze data to

    arrive at contingent lessons that are at least consistent with, and at best strongly

  • 8/12/2019 Too Big Ignore

    6/19

    Deloitte Review DELOITTEREVIEW.COM

    40 TOO BIG TO IGNORE

    supported by, limited data. oday, so the thinking goes, organizations increasingly

    have access to something approaching a complete sample. Terefore the process of

    learning from data becomes akin more to a problem in algorithm and architec-

    ture design than one of learning from and quantifying uncertain knowledge using

    statistical science.

    Te mode of thought represented by Andersons essay can be seductive. But for

    reasons we will explore, most organizations would do well to resist this particular

    seduction.

    IS BIG DIFFERENT?

    Fitzgerald: Te rich are different from you and me.

    Hemingway: Yes, they have more money.

    Google ranslate is a case in point of how big data can be dramaticallydifferent. In their widely cited article, Te Unreasonable Effectiveness ofData,10Google researchers Alon Halevy, Peter Norvig, and Fernando Pereira de-

    scribe an approach to translation that hinges on mining the messy patterns fromenormous collections of translations that exist in the wild. Te approach is no-

    table in that it bypasses traditional statistical methodology involving laboriously

    cleansing, sampling, exploring, and modeling data. Te very size and completeness

    of the data now available employ word associations to do the work that linguistic

    rules and complicated models did in previous, less effective, approaches.

    Halevy, Norvig, and Pereira write:

    Invariably, simple models and a lot of data trump more elaborate modelsbased on less data. Currently, statistical translation models consist mostlyof large memorized phrase tables that give candidate mappings betweenspecific source- and target-language phrases.

    Analogous discussions could be made, for example, about recommenda-

    tion engines, online marketing experiments performed in rapid succession, and

    the use of Internet search data to predict flu outbreaks or nowcast economictrends. Its near-completeness and real-time nature can make big data an unrea-

    sonably effective (to use Halevy, Norvig, and Pereiras phrase) force for innovative

    data products.

    In short, why ask why? In many applications, we only care about a good enough

    next answer, decision, or recommendation. And this is what big data, processed by

    the right computer algorithms, gives us. Or does it?

  • 8/12/2019 Too Big Ignore

    7/19

    DELOITTEREVIEW.COM Deloitte Review

    41TOO BIG TO IGNORE

    HOW DATA SCIENCE IS A SCIENCE

    Te whole of science is nothing more than a refinement of everydaythinking.

    Albert Einstein

    As important as Google ranslate and any number o other big data productsare, cursory thinking about their significance can muddle important issuesranging rom the epistemological to the economic to the strategic.

    First, a undamental distinction should not be orgotten: Data is not the same

    thing as inormation. In itsel, data is nothing more than an inert raw material: un-interpreted symbols on a page or in a database. Furthermore, the size o a dataset is

    ofen a poor proxy or the amount o useul inormation it contains. A single Emi-

    nem video accounts or more o the worlds exabytes o data than do the complete

    works o Einstein. A typical collection o holiday snapshots rom the Galapagos

    Islands will occupy more disk space than On the Origin of Species.

    Second, it is deeply misguided to view the skills needed to convert raw data into

    usable inormation in sofware engineering terms. Contrary to the view articulated

    by Chris Anderson, these skills are inherently scientific in nature. Indeed, they

    are increasingly labeled by the helpul umbrella term data science.11Data science

    should be viewed as a synthesis o statistical science, computer science, and domain

    knowledge appropriate to the problem at hand.12 Te term originated rom a ar-

    sighted group o statisticians who understood that as data continues to grow in

    volume and availability, the ability to interact with, visually explore, and generally

    compute with data becomes an inescapable part o doing serious statistics.13

    Business projects with data science at their corea.k.a. business analytics proj-ectshave three primary phases: design, analysis, and execution. Critical thinking,

    scientific judgment, creativity, and pragmatism are inherent to each o them. It is,

    generally speaking, unrealistic to expect that any o these phases could be auto-

    mated or outsourced to computer algorithms processing big data.

    Strategy and design:Perhaps a data scientists most important skill is in under-

    standing (and ofen helping to articulate) an organizations questions, problems, or

    strategic challenges and then translating them into the design o one or more data

    analysis projects. John ukeys amous slogan captures the need or such a strategy-

    led approach: Better to have an approximate answer to the right question than a

    precise answer to the wrong question.

    o illustrate, consider a hypothetical chie medical officer o a hospital group

    seeking to improve patient outcomes. Should the ocus o data analysis be on

    preventing medical errors? Identiying physicians at high risk o being sued or

  • 8/12/2019 Too Big Ignore

    8/19

    Deloitte Review DELOITTEREVIEW.COM

    42 TOO BIG TO IGNORE

    malpractice? Identiying patients with potentially high medical utilization? Iden-

    tiying patients likely to all off their treatments? Predictive approaches to guiding

    diagnosis and treatment decisions? Even afer the various alternatives have been

    articulated and prioritized, there remains the design challenge o outlining a sen-

    sible series o data analysis and/or predictive modeling steps. Indeed the decision o

    which issues to tackle and in what order might be partially inormed by the opinion

    o a seasoned data scientist regarding their relative easibility.

    Analytical process: Te technical

    aspect o an analytics project itselcomprises three distinct phases: data

    scrubbing and preparation, data analy-

    sis, and validation. Each requires judg-

    ment, generally making it nonautomatic

    in character.

    Te need or statistical program-

    ming (computing with data) is rightly

    emphasized because data comes in such

    messy and disparate orms: scanned documents, web logs, electronic medical re-

    cords, billing records and other financial transactions, geospatial coordinates, au-

    dio/video streams, and increasingly ree-orm unstructured text and beyond. But it

    would be mistaken to conceive o this process purely as a programming challenge.

    Analogous to Halevy, Norvig, and Pereiras observation that more data trumps

    more elaborate models, we have repeatedly ound that the creation o innovative

    data eatures (or synthetic variables) rom raw data elements does more toenhance the effectiveness o analytics projects than does elaborate methodology.

    Examples o creating explanatory or predictive power where none existed beore

    include calculating the distance between two addresses; calculating social network

    centrality using one or more shared associations; using behavioral data in a novel

    way; or creating index variables to serve as proxies or latent traits not directly ob-

    servable.14In short, characterizing the eature creation aspect o data analysis as a

    programming exercise would be akin to characterizing creative writing as word

    processing.15

    Afer data scrubbing comes data exploration, analysis, and validation. Each o

    these steps should be viewed as an exercise in creative scientific investigation, aided

    with the tools o data visualization, statistics, and machine learning. Many datasets

    used in analytics projects contain considerable amounts o random noise, spuri-

    ous correlations, ambiguity, and redundancy alongside the abiding signal that we

    wish to capture in a model or analysis.

    Better to have an approximate

    answer to the right question

    than a precise answer to the

    wrong question.

  • 8/12/2019 Too Big Ignore

    9/19

  • 8/12/2019 Too Big Ignore

    10/19

    Deloitte Review DELOITTEREVIEW.COM

    44 TOO BIG TO IGNORE

    A particularly diabolical and underappreciated problem is known in the vernac-

    ular as multiple comparisons. Imagine 100 people in a room flipping air coins.

    Because so many coins are being repeatedly flipped, it is likely that at least one o

    the coins willby chanceproduce so many heads or so many tails that the coin

    will be ound to be biased with a high degree o statistical significance.16 Tis is

    despite the act that the coin thus selected is in reality no different rom the other

    coins and is likely to produce heads in roughly hal o the uture flips.

    In fields such as medicine, drug

    testing, and psychology, essentially

    this phenomenon is known as the filedrawer problem: A purely mechanical

    approach o filing away nonsignificant

    (or sometimes unwanted) results and

    publicizing significant results system-

    atically yields spurious findings that

    will likely not hold up over time.17 Te

    xkcd cartoon reproduced here conveys

    the issue humorously and succinctly.18

    While this is a problem or any

    large-scale analysis involving many

    possible data dimensions and relationships, it is particularly acute in the age o big

    data and brute-orce algorithms or extracting interesting relationships. Further-

    more, as will be discussed below, the cost o such alse positives should be taken

    into account.

    Te problem o multiple comparisons is but one o many reasons why datascience should not, in general, be viewed as the brute-orce harvesting o

    patterns rom stores o big data. For others, see the inset, Putting the Science in

    Data Science.

    Business execution: As the statistician George Box wrote, All models are wrong,

    but some are useul. A major implication o this statement is that it is generally

    advisable to blend model indications with common sense and expert judgment to

    arrive at a decision. Tis happens ofen even in bona fide big data applications.

    For example, i Jim uses Google ranslate to write to a rancophone riend he will

    hopeully know enough to correct various mistakes and change the second person

    singular rom the ormal vous to the amiliar tu. Similarly, the recommendations

    o book or movie collaborative filtering algorithms are ofen useul but other times

    should be taken with a grain o salt.19

    A venerable motto of computer

    science is GIGO: garbage in, gar-

    bage out. Perhaps an analogous

    motto for the nascent field of

    data science ought to be NIINO:no insight in, none out.

  • 8/12/2019 Too Big Ignore

    11/19

    DELOITTEREVIEW.COM Deloitte Review

    45TOO BIG TO IGNORE

    In these everyday examples, the stakes are low, and it is easy to blend model

    indications (translations or recommendations) with judgment ounded on expe-

    rience (our prior belies about the correct translation or what we will enjoy). In

    more high-stakes business analytics applications, such as using models to make

    medical diagnoses, screening potential employees, security rating, raud investiga-

    tion, or complex loan or insurance underwriting decisions, the blending o expert

    judgment with model indications should be viewed as a risk management issue o

    strategic importance.20 It is important that the data scientist guiding the analyti-

    cal process play a role in communicating the assumptions and limitations o the

    analysis or model so that these issues are effectively addressed.21

    Tis is yet anotherreason why data science should not be ramed in purely programming or engineer-

    ing terms and why big data should not be characterized as an automatic source o

    reliable predictions regardless o context.

    Amassing repositories o big data and purchasing sofware, thereore, is not

    sufficient or business analytics. Data scientistsa.k.a. peopleare essential to the

    process. It should also be kept in mind that in many situations, the appropriate

    data, at least to start, will not be big in the 3V sense, and much can be done with

    open-source statistical computing sofware. Again, domain knowledge and scien-

    tific judgment are important actors in such decisions.

    A venerable motto o computer science is GIGO: garbage in, garbage out. Per-

    haps an analogous motto or the nascent field o data science ought to be NIINO:

    no insight in, none out.

    WHAT HAPPENS IN VAGUENESS STAYS IN VAGUENESS

    It is natural to find the discussions o big data in the business press and blogo-

    sphere bewildering. Te field is inherently multidisciplinary, and terms such

    as big data, business analytics, and data science mean different things to di-

    erent people.22

    Conusion also may arise or a more undamental reason: Concepts relating to

    statistical uncertainty simply do not come naturally to the human mind. Te same

    body o psychological research that underpins behavioral economics also suggests

    that we are very poor natural statisticians. We are naturally prone to find spuriousinormation in data where none exists, latch on to causal narratives that are unsup-

    ported by sketchy statistical evidence, ignore population base rates when estimating

    probabilities or individual cases, be overconfident in our judgments, and generally

    be ooled by randomness.23Tere is little wonder that misleading narratives about

    big data have multiplied.

  • 8/12/2019 Too Big Ignore

    12/19

    Deloitte Review DELOITTEREVIEW.COM

    46 TOO BIG TO IGNORE

    PUTTING THE SCIENCE IN DATA SCIENCE

    (OR, THOS E WHO IGNORE STATISTICS ARE

    CONDEMNED TO REI NVENT IT.) 24

    Data contains too few patterns:

    In many situations, one can fitmany models to, and draw avariety of conclusions from, the

    same data. Examples are all aroundand include forecasting the profit-ability of a cohort of insurancepolicies, estimating Value at Risk(VaR) of a portfolio of securities,evaluating the effectiveness of anad campaign or human resourcespolicy, analyzing a financial timeseries, and predicting the out-come of a political election. Whilemore data is often helpful, it is

    misleading to characterize theseactivities as data driven. Ratherthey are driven by the creative andjudgmental application of scientificprinciples of data analysis. Databig or otherwiseis an inputinto the process, not the drivingforce of the process.

    Human creativity and domainknowledge are often necessaryto create synthetic data features.Examples include body mass index,

    measures of social network central-ity, distances between relevantphysical addresses, compositemeasures of employee perfor-mance, and proxy variables forunobservable latent traits. Onceagain data is a raw material, not asource of automatic insight.

    Data contains too many patterns:

    Datasets contain a mixture ofsignal and noise, and manybig data sources have low signal-

    to-noise ratios, perhaps earninga fourth V: vagueness. Whilestatistical and machine learningtechniques continue to improve intheir ability to separate signal fromnoise, they are often best viewedas tools that facilitate an inherentlyjudgmental process. Examplesinclude selecting which variablesand variable interactions shouldbe considered for inclusion in a

    model, which data points shouldbe excluded or down-weighted forvarious reasons, and whether vari-ous apparently linear or nonlinearrelationships among variables arereal or spurious.

    The problem of multiple compari-sons: It often happens that themore associations you test, themore apparently significant pat-terns you will detect, even whennothing is actually happening. This

    is generally regarded as a basic factof statistical life and becomes morechallenging as data sets becomelarger and messier.

    Not only is information different from data in general, there is no automatic, purely algorithmic

    way to extract the right islands of information from oceans of raw data. Generally speaking, this

    process requires a combination of domain knowledge, creativity, critical thinking, an understanding

    of statistical reasoning, and the ability to visualize and program with data. Theory and sound causal

    understanding remain important checks on the innate human tendency to be fooled by random-

    ness. Far from making the scientific method obsolete, increasing volumes of data are making data

    science a core strategic capability of many organizations. Here are some reasons why.

  • 8/12/2019 Too Big Ignore

    13/19

    DELOITTEREVIEW.COM Deloitte Review

    47TOO BIG TO IGNORE

    Big datasets are unnecessarily big:

    Summary statistics can be sufficientfor the task at hand. An elemen-tary example: Suppose a coin hasbeen repeatedly tossed. Assuminga binomial process, two numbers

    (the total number of tosses and thenumber of heads) contain as muchinformation about the probabilityof heads on the next toss as acomplete history of the previoustosses. More bytes do not alwaystranslate to more information.

    Scores and composite indices (suchas credit scores, social media senti-ment scores, or lifestyle clusters)can reduce hundreds of data ele-ments to a handful of numbers.

    Often a small, carefully chosensample of data contains as muchusable information as a largemessy dataset.

    Big datasets are too small:

    Another broadly accepted factof statistical life is the curse ofdimensionality. In many situations,even the largest conceivable data issparse because of the large num-

    ber of dimensions involved and/or a rare outcome variable (suchas fraud or infrequent purchases).Imagine, for example a marketingresearcher confronting a datasetcontaining few or no purchases formany of the hundreds of differ-ent products offered; an actuaryconfronted with setting profes-sional liability insurance rates for amultitude of specialty/geographycombinations; or a geneticist

    analyzing many thousands of genecombinations for a relatively smallpatient population.

    Biased sampling frames: In 1936the Literary Digestconducted apoll that received over 2 millionresponses predicting that AlfredLandon would prevail over FranklinRoosevelt by a double-digit mar-gin. In fact, Roosevelt won by alandslide. TheLiterary Digesterredin conducting its poll by telephone,which at the time was dispropor-tionately used by the wealthy. Inthis sense the huge sample wasstill too small. Analogously today,petabytes of social media dataarent guaranteed to accurately re-flect the membership or sentimentsof the population of interest.25

  • 8/12/2019 Too Big Ignore

    14/19

    Deloitte Review DELOITTEREVIEW.COM

    48 TOO BIG TO IGNORE

    The patterns in the data are trivial: While "garbage in, garbage out" is

    a well-known danger in data sci-ence, a less frequently noted GIGOis gigabytes in, generalities out.Often, an automatic approach todata analysis results in generalitiesthat are obvious, nonactionable, orboth. For example, a first attemptat a recommendation enginemight suggest the obvious, mostpopular movies or songs to mostpeople; similarly a novice analysisof insurance claim severity mightuncover obvious facts such asback injuries costing more thanfractured arms. Human insight isrequired to design data analysisapproaches that go beyond theobvious.

    On the other hand, obvious-ness is sometimes a good thing.Some of the most valuable models

    insightfully combine a number offairly obvious variables. The valueof such models is due less to incor-porating surprise nuggets of insight(although those are always nice)than to outperforming the humanminds ability to quantify their rela-tive importance and interactions.Also important is that such modelshelp enforce consistency on groupsof cognitively bounded, and per-haps distracted and tired, humanprofessionals.26

    The patterns in the data arediabolically misleading:

    A famous example of SimpsonsParadox occurred when the Uni-versity of California, Berkeley wassued for gender bias in its graduateschool admissions decisions: In1973 Berkeley admitted 44 percentof its male graduate school ap-plicants but only 35 percent ofits female applicants. However

    when the data was broken downby department, the apparent biasdisappeared. Why? Women tendedto apply to departments (such ashumanities) with lower admissionratesdoh! Similarly naive analysescan spuriously suggest that higherprices lead to higher demand, orthat marketing campaigns can leadto lower sales.

    Studies find that highly intelligentwomen tend to marry men who

    are less intelligent than they are.27Top-performing companies tend todo worse over time. These facts donot require sociological or manage-ment theoretic explanations. Theyare instances of regression to themean. The concept was first iden-tified by Francis Galton, a cousinof Charles Darwin and the inventorof regression analysis, when hestudied such phenomena as tall

    parents having shorter offspring(and vice versa). Misunderstandingof this phenomenon can lead tosuch bad business decisions as pay-ing for past performance of sportsplayers who have experiencedwinning streaks, or movie genresand franchises whose moment haspassed.

  • 8/12/2019 Too Big Ignore

    15/19

    DELOITTEREVIEW.COM Deloitte Review

    49TOO BIG TO IGNORE

    Nevertheless, avoiding misconceptions about big data should be regarded as a

    prerequisite or avoiding analytics projects with negative ROI. First, data should

    not be conused with useul inormation. Deriving insight rom data, as discussed

    above, should generally not be viewed as a orm o programming or sofware imple-

    mentation but as a type o scientific investigation requiring the judgmental evalua-

    tion o ambiguous data.

    It is equally important to pay attention to the economic and strategic aspects

    o big data. While the potential benefits o big data get a lot o attention, less atten-

    tion is given to the costs. Many o these costs are straightorward: It costs money to

    acquire, store, back up, secure, integrate, manage, audit, document, and make avail-able any data source.28And while inexpensive multi-terabyte drives are available

    at the local electronics store, they are not characteristic o the enterprise hardware

    many organizations use to manage big data.

    More subtle economic points can be equally relevant. First, economic decisions

    should be made at the margins. Tereore, a big data project should not be evalu-

    ated in isolation but in terms o how much insight or predictive power it is likely to

    add over and above a less costly analytics project.

    Second, big data projects ofen carry opportunity costs. Many organizations

    have a menu o potential analytics projects with limited resources to execute them.

    For a large retailer wishing to make real-time next-best offers or an Internet com-

    pany aiming to enjoy the winner take all effect o network externalities, big data

    naturally rises to the top o the list o priorities. For many other organizations,

    a more likely path to analytics success is paved with a sequence o smaller, well-

    targeted projects, with the benefits o one helping und the next. For such organiza-

    tions, a ocus on big data could be an expensive distraction.Finally, the human capital and organizational costs required to work with, ana-

    lyze, and act upon big data should not be underestimated. Data science skills are

    clearly in demand and, thereore, can be difficult and expensive to acquire. While

    this is subject to change as supply ramps up to meet demand, a deeper and more

    abiding point was made over 40 years ago by the artificial intelligence pioneer and

    management theorist Herbert Simon. Simon wrote that:

    In an information-rich world, the wealth of information means a dearthof something else: a scarcity of whatever it is that information consumes.What information consumes is rather obvious: It consumes the attention ofits recipients. Hence a wealth of information creates a poverty of attentionand a need to allocate that attention efficiently among the overabundanceof information sources that might consume it.29

    o be sure, there is ample evidence that when properly analyzed and act-

    ed upon, data helps enable organizations to make decisions more accurately,

  • 8/12/2019 Too Big Ignore

    16/19

    Deloitte Review DELOITTEREVIEW.COM

    50 TOO BIG TO IGNORE

    consistently, and economically.30But this does not imply the cost-effectiveness o

    big data technology in any particular application.

    Moneyballis ofen held up as a prime example o the sort o transormative in-

    novation enabled by data-driven decision making. In act, variations on Big Data:

    Moneyballor Business are common names or articles about the topic. But Billy

    Beanes achievement did not result rom using terabyte- or petabyte-class data. As

    author Michael Lewis recounts, it resulted rom an inspired use o the rightdata

    (or example on-base percentage rather than batting average) to address the right

    business opportunity (an inefficient market or talent due to a widespread culture

    o intuition-driven decision making). Much the same could be said about the useo statistical analysis and predictive models to help guide medical decisions; loan

    or insurance underwriting decisions; hiring, admissions, and resource allocation

    decisions; and so on.

    Tis last point sheds light on the data is the new oil metaphor. Te metaphor

    is helpul in driving home the point that data should be treated as a valuable asset

    that, analogous to oil, can be refined to help power insights and better decisions.

    But unlike oil, data is not an undifferentiated quantity. More volumes and varieties

    o data are not necessarily better. Indeed, i not pursued strategically, they can lead

    to missed opportunities, expensive distractions, and what Simon called a poverty

    o attention. Rather than let the data tail wag the strategic dog, it is crucial to begin

    with a plan within which the organization can judge which data is the rightdata and

    which analysis is the right analysis.

    TWO CHEERS FOR BIG DATA

    Is big data a big deal? Yes. Instances o innovative big data-driven products, busi-

    ness models, and scientific breakthroughs are already common and are likely to

    multiply in the uture.

    But or a leader seeking an appropriate business analytics strategy specific to

    his or her organization, the answer to this question is less straightorward. Tere is

    indeed abundant evidence that organizationslarge and small, public and

    privatebenefit rom employing the analysis o data to guide strategic, operation-

    al, and tactical decisions. But this does not always entail processing terabyte- orpetabyte-class data on Hadoop clusters.

    Above all, business strategy should guide the choice o data and technology, not

    vice versa. Laying out a clear vision and analytical roadmap helps avoid being side-

    tracked by narratives in which the phrase big data is defined one way and used in

    another; data is conflated with usable inormation; the judgment-inused process o

    data science is mischaracterized as mining patterns and associations rom raw data;

    and the economics o big data are downplayed.

  • 8/12/2019 Too Big Ignore

    17/19

    DELOITTEREVIEW.COM Deloitte Review

    51TOO BIG TO IGNORE

    Properly harnessed, the right data can indeed be an organizations new oil. But

    it is important not to lose sight o two undamental points. First, analytics initiatives

    ultimately do not begin with data; they begin with clearly articulated problems to

    be addressed and opportunities to be pursued. Second, more data does not guaran-

    tee better decisions. But the right dataproperly analyzed and acted uponofen

    does. Organizations that lose sight o these principles risk experiencing big data

    not as the new oil, but as the new turmoil. DR

    James Guszcza is the National Predictive Analytics lead for Deloitte Consulting LLPs Actuarial,Risk, and Analytics practice.

    David Steieris a director with Deloitte Consulting LLP and leads the Deloitte Analytics SolutionsGroup.

    John Luckeris a principal with Deloitte Consulting LLP and Global Advanced Analytics & Model-

    ing Market Offering leader and a US leader of Deloitte Analytics.

    Vivekanand Gopalkrishnanis a director with Deloitte & ouche Financial Advisory Services PteLtd Singapore and leads Deloitte Analytics Institute Asia.

    Harvey Lewisis a director with Deloitte UK and leads the research program for Deloitte Analyticsin the UK.

    Endnotes

    1. Tomas H. Davenport, Paul Barth, and Randy Bean. How Big Data is Different,"MI Sloan Management Review,

    July 30, 2012 .

    2. Andrew McAee and Erik Brynjolsson. Big Data: Te Management Revolution, Harvard Business Review, Octo-

    ber 2012 .

    3. Big Data: Te Next Frontier or Innovation, Competition, and Productivity, McKinsey Global Institute.

    .

    4. Chris Anderson. Te End o Teory: Big Data Makes the Scientific Method Obsolete, Wired, June 2008.

    .

    5. Or perhaps centuries. Te article Beyond the Numbers: Analytics as a Strategic Capability pointed out that actu-

    arial science, which dates back 250 years, might reasonably be considered an early instance o what is today calledbusiness analytics. .

    6. For example, see Te New Boss: Big Data, Te Wall Street Journal, September 20, 2012. Tis article is about the

    use o predictive models to improve hiring decisions. Here, big data is used in the increasingly common col-

    loquial sense o supporting a business analytics application, not the sense suggesting the involvement o terabyte-

    or petabyte-class data.

    7. See Doug Laneys recent blog post or a link to the original research note as well as observations about the concep-

    tions widespread adoption: .

    8. A petabyte, which equals 1 million gigabytes and 1,000 terabytes, is approximately the storage space consumed

    by the movieAvatar, one-twentieth o the storage space consumed by the photographs comprising Google Street

    View, and the amount o data generated by the CERN Large Hadron Collider (LHC) experiments each second. See:

    .9. See .

    10. IEEE Intelligent Systems, March/April 2009. .

    (Continued)

  • 8/12/2019 Too Big Ignore

    18/19

    Deloitte Review DELOITTEREVIEW.COM

    52 TOO BIG TO IGNORE

    11. Unortunately, like the term big data, data science has been garbled in the popular press to the extent that

    prominent analysts have come to express rustration with it. In the blog associated with her Columbia University

    Introduction to Data Science course, Rachel Shutt wrote, I dont call mysel a data scientist. I call mysel a statisti-cian. I reuse to be called a data scientist because as its currently used, its a meaningless, arbitrary marketing term.

    . Similarly, the mathematician/blogger

    Cathy ONeil declared, Tere are ar too many posers out there in the land o data scientists, and its getting to the

    point where Im starting to regret throwing my hat into that ring. . We sympathize with Shutts and

    ONeils sentiments and attempt to use the term in a way that is consistent with their writings.

    12. In a blog entry, Drew Conway memorably used a Venn diagram to define data science. According to Conways

    characterization, data science is the intersection o three circles representing substantive expertise, math and

    statistics knowledge, and hacking skills. Conways excellent definition could perhaps be improved with the addition

    o the abilities to identiy and rame problems as well as communicate effectively. .

    13. Te phrase data science in the sense used here originated in a prescient 2001 essay by the Bell Labs statistician

    and data visualization pioneer William Cleveland: Data Science: An Action Plan or Expanding the echnical

    Areas o the Field o Statistics . Among

    other things, Cleveland called or computing with data and the evaluation o statistical computing tools to be

    added to traditional university curricula. It is notable that Cleveland was a collaborator o the other legendary Bell

    Labs statisticians John ukey and John Chambers. ukey is known as the ather o exploratory data analysis [EDA]

    and Chambers was the inventor o the S (as in statistics) statistical computing language. Te original S language,

    which implemented many o ukeys and Clevelands data exploration and data visualization ideas, is the basis or

    the open-source R statistical computing environment. oday, R is the lingua ranca o applied statistics and has

    played no small role in bringing Clevelands vision o applied statistics as data science to lie. Ironically, although

    ukey himsel never used computers, he coined the terms bit and sofware, and Chambers discussed him as an

    inspiration or the development o S. See John ukey and Sofware by John Chambers: .

    14. For example, one o the most successul innovations with regard to personal insurance applications o the 1990s

    was the adoption o credit scoring to select and price personal auto and homeowner policies. Not only is credit

    score predictive, it is one o the most powerul rating variables in situations where its use is allowed. Tis is likelybecause credit behavior is a proxy or non-directly measurable traits that also maniest themselves in risky behavior.

    It would be misleading to characterize such innovations as mechanical processes o putting numbers into an algo-

    rithm and letting computers find the relevant patterns. Rather, deciding to test this novel application o financial

    stability inormation was a minor innovative breakthrough.

    15. Which in turn brings to mind ruman Capotes amous takedown o Jack Kerouac: Tats not writing, thats typing.

    16. So-called p-values are ofen used as measures o significance which can be thought o as a type o surprise. A

    result with a very low p-value (such as 18 heads in 20 tosses) means that we would be very surprised to get such

    an extreme (or more extreme) result were the coin in act air. Te problem is that tossing a large number o air

    coinsanalogous to perorming a large number o medical trials or examining a large number o correlations in

    a big data baseis virtually guaranteed to produce a number o such significant results purely due to random

    chance.

    17. Te epidemiologist John Ioannidis made essentially this point about medical research in a widely cited 2005

    article entitled, Why Most Published Research Findings Are False. . Ioannidiss point is illustrated by the recent New England Journal of Medicinearticle Chocolate

    Consumption, Cognitive Function, and Nobel Laureates by Franz Messerli. Te article reports a close, significant

    linear correlation (r=0.791, p

  • 8/12/2019 Too Big Ignore

    19/19

    53TOO BIG TO IGNORE

    21. Tis was a major theme o our previous Deloitte Review, Issue 10, article, A Delicate Balance.

    22. Even the academic statistics community itsel has undergone considerable change in recent decades. Some com-

    ments rom a recent blog post o the Carnegie-Mellon statistician Cosma Shalizi vividly convey a sense both othis change, as well as the heterogeneity o university-level statistical training: Suppose you were exposed to that

    subject as a sub-cabalistic ritual o manipulating sums o squares and magical tables according to rules justified (i

    at all) only by a transparently alse origin myththat is to say, you had to endure what is still an all-too-common

    sort o intro. stats. classor, perhaps worse, a 'research methods' class whose content had ossilized beore you were

    born. Suppose you then looked at the genuinely impressive things done by the best o those who call themselves

    'data scientists.' Well then no wonder you think 'Tis is something new and wonderul;' and I would not blame you

    in the least or not connecting it with statistics. Perhaps you might find some aint resemblance, but it would be like

    comparing a childs toy wagon to a Ducati. Modern statistics is not like that, and has not been or decades... the

    skills o a 'data scientist' are those o a modern statistician. .

    23. See Tinking, Fast and Slowby Daniel Kahneman. Kahneman reported that his landmark research exploring sys-

    tematic biases in human cognition and leading to behavioral economics was motivated by his experience teaching

    statistics to university students in Israel. He ound what he was teaching to be very unintuitive. Tis observation

    ran counter to a then-prominent theory that humans are natural statisticians. In the language o Tinking, Fastand Slow, System 1 thinking is rapid and biased towards belie and causal narratives rather than skepticism and

    rational analysis. System 1-style thinking accounts or the bulk o our mental operations andheres the rubis

    very poor at statistical reasoning. System 2 thinking is slow, effortul, and strives or logical coherence rather

    than unsupported narrative coherence. Te phrase ooled by randomness is o course borrowed rom the Nassim

    Nicholas aleb book o the same name. Regarding aleb, Kahneman writes, Te trader-philosopher-statistician

    Nassim aleb could also be considered a psychologist. aleb suggests that we humans constantly ool ourselves

    by constructing flimsy accounts o the past and believing they are true. See our previous Deloitte Reviewarticle A

    Delicate Balance or a discussion o how the Kahnenan schools findings relate to some o the organizational biases

    that can impede analytics projects.

    24. Tis quote is attributed to the Stanord statistician Brad Eron, who is best known as the inventor o the statistical

    technique known as bootstrapping. Another Eron quote partially inspired this essay: In some ways I think that

    scientists have misled themselves into thinking that i you collect enormous amounts o data you are bound to get

    the right answer the allacy that i we could just get it all inside the computer we would get the answer.

    .

    25. An intriguing election-year example concerns differing treatments o the presidential candidates in the traditional

    and social media. Te authors thank our colleague Michael Greene or bringing this example to their attention.

    .

    26. A dramatic example o decision atigue (a.k.a. ego depletion) is judges parole decisions being strongly influ-

    enced by time o day, and presumably blood sugar level. .

    27. In Tinking, Fast and Slow(Chapter 17), Daniel Kahneman provided this example to illustrate how regression to

    the mean tends to be misinterpreted in causal terms.

    28. It is interesting to speculate whether behavioral economics helps account or an apparent bias toward hoarding data

    that one ofen encounters. Te tendency o humans to eel losses more keenly than commensurate gains is known

    in the behavioral economics literature as loss aversion. As the availability o computer storage power grows expo-

    nentially, perhaps there is a tendency to the loss avoidance strategy o storing as much data as possible to ward offthe ear o uture eelings o loss that would result rom having thrown out something useul. Tis is not to say that

    organizations should never err on the side o keeping inormation with no obvious or immediate use; only that the

    process should be guided by cost/benefit estimates rather than a deault strategy o keeping nearly everything.

    29. H. A. Simon, (1971), Designing Organizations or an Inormation-Rich World, in Martin Greenberger, Computers,

    Communication, and the Public Interest, Baltimore, MD: Te Johns Hopkins Press, pp. 4041.

    30. Popular books such asMoneyballand Ian Ayress Super Crunchershave offered considerable anecdotal evidence or

    this, as have our previous Deloitte Reviewarticles Irrational Expectations and Beyond the Numbers. . For corroborating academic evidence, see Strength in Numbers: How Does Data-

    Driven Decisionmaking Affect Firm Performance?by Erik Brynjolsson, Lorin Hitt, and Heekyung Kim. Te authors

    find that data-driven decision making was associated with 56 percent higher productivity in a sample o 179

    publicly traded firms studied. .