big data now current perspectives from oreilly radar copy

Upload: mike-skinner

Post on 05-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    1/137

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    2/137

    Big Data Now

    Beijing Cambridge Farnham Kln Sebastopol Tokyo

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    3/137

    Big Data Now

    Printing History:

    mailto:[email protected]://my.safaribooksonline.com/?portal=oreillyhttp://my.safaribooksonline.com/?portal=oreilly
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    4/137

    Table of Contents

    F o rew o rd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

    1. Data Science and Data Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    2. Data Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

    iii

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    5/137

    3. The Application of Data: Products and Processes . . . . . . . . . . . . . . . . . . . . 75

    4. The Business of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

    iv | Table of Contents

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    6/137

    Table of Contents | v

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    7/137

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    8/137

    Foreword

    vii

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    9/137

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    10/137

    CHAPTER 1

    Data Science and Data Tools

    What is data science?

    1

    http://radar.oreilly.com/mikel/index.htmlhttp://oreilly.com/web2/archive/what-is-web-20.htmlhttp://www.nytimes.com/2009/08/06/technology/06stats.htmlhttp://radar.oreilly.com/mikel/index.html
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    11/137

    What is data science?

    2 | Chapter 1:Data Science and Data Tools

    http://strataconf.com/public/content/landing?_discount=strata&cmp=il-radar-st11-what-is-data-sciencehttp://strataconf.com/public/content/landing?_discount=strata&cmp=il-radar-st11-what-is-data-sciencehttp://strataconf.com/public/content/landing?_discount=strata&cmp=il-radar-st11-what-is-data-sciencehttp://en.wikipedia.org/wiki/CDDB
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    12/137

    Flu trends

    What is data science? | 3

    http://www.linkedin.com/http://www.amazon.com/http://www.linkedin.com/http://www.facebook.com/http://www.google.org/flutrends/about/how.htmlhttp://gdgt.com/discuss/voice-recognition-is-amazing-ive-only-68e/http://en.wikipedia.org/wiki/PageRank
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    13/137

    Where data comes from

    4 | Chapter 1:Data Science and Data Tools

    http://infochimps.org/http://www.factual.com/http://en.wikipedia.org/wiki/Nielsen_BookScanhttp://www.nytimes.com/2010/05/02/magazine/02self-measurement-t.html?ref=magazinehttp://oreilly.com/catalog/9780596804787http://www.factual.com/http://infochimps.org/http://en.wikipedia.org/wiki/Nielsen_BookScanhttp://www.nytimes.com/2010/05/02/magazine/02self-measurement-t.html?ref=magazinehttp://oreilly.com/catalog/9780596804787
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    14/137

    What is data science? | 5

    http://news.cnet.com/2300-1010_3-6031405-6.htmlhttp://en.wikipedia.org/wiki/Motorola_68000
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    15/137

    1956 disk drive

    6 | Chapter 1:Data Science and Data Tools

    http://en.wikipedia.org/wiki/Data_scraping#Screen_scrapinghttp://www.almaden.ibm.com/
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    16/137

    What is data science? | 7

    http://www.nltk.org/http://www.nltk.org/http://google.com/trendshttp://www.nas.nasa.gov/About/Education/Ozone/history.htmlhttp://www.nas.nasa.gov/About/Education/Ozone/history.htmlhttp://oreilly.com/perl/http://oreilly.com/python/http://oreilly.com/catalog/9780596000707http://oreilly.com/catalog/9780596000707http://www.nltk.org/http://www.nltk.org/http://google.com/trends?q=Pythonhttp://google.com/trends?q=Cassandrahttp://google.com/trendshttp://www.nas.nasa.gov/About/Education/Ozone/history.htmlhttp://www.nas.nasa.gov/About/Education/Ozone/history.htmlhttp://oreilly.com/python/http://oreilly.com/perl/http://oreilly.com/catalog/9780596000707http://www.crummy.com/software/BeautifulSoup/http://www.crummy.com/software/BeautifulSoup/http://oreilly.com/catalog/9780596804787%20id=hni2
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    17/137

    Working with data at scale

    8 | Chapter 1:Data Science and Data Tools

    http://twitter.com/hackingdatahttp://oreilly.com/catalog/9780596157128/%20id=aod4%20title=Data?Beautifulhttp://twitter.com/hackingdatahttps://www.mturk.com/mturk/welcome%20id=k3lahttps://www.mturk.com/mturk/welcome%20id=k3la
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    18/137

    What is data science? | 9

    http://aws.amazon.com/elasticmapreduce/http://developer.yahoo.net/blogs/hadoop/2008/02/yahoo-worlds-largest-production-hadoop.htmlhttp://hadoop.apache.org/http://labs.google.com/papers/mapreduce.htmlhttp://hadoop.apache.org/hbase/http://www.riptano.com/http://labs.google.com/papers/bigtable.htmlhttp://www.allthingsdistributed.com/2007/10/amazons_dynamo.htmlhttp://aws.amazon.com/elasticmapreduce/http://www.cloudera.com/http://developer.yahoo.net/blogs/hadoop/2008/02/yahoo-worlds-largest-production-hadoop.htmlhttp://developer.yahoo.net/blogs/hadoop/2008/02/yahoo-worlds-largest-production-hadoop.htmlhttp://hadoop.apache.org/http://hadoop.apache.org/http://labs.google.com/papers/mapreduce.htmlhttp://www.cloudera.com/http://hadoop.apache.org/hbase/http://www.riptano.com/http://cassandra.apache.org/http://www.allthingsdistributed.com/2007/10/amazons_dynamo.htmlhttp://labs.google.com/papers/bigtable.html
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    19/137

    10 | Chapter 1:Data Science and Data Tools

    http://www.snaptell.com/http://www.google.com/mobile/goggles/http://bit.ly/http://twitter.com/hmasonhttp://twitter.com/http://search.twitter.com/http://code.google.com/p/hop/http://hadoop.apache.org/pig/http://www.stanford.edu/class/cs229/http://www.snaptell.com/http://www.google.com/mobile/goggles/http://bit.ly/http://twitter.com/hmasonhttp://twitter.com/http://twitter.com/http://search.twitter.com/http://code.google.com/p/hop/http://hadoop.apache.org/pig/http://hadoop.apache.org/hive/http://hadoop.apache.org/hdfs/http://oreilly.com/catalog/9780596521981
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    20/137

    What is data science? | 11

    http://www.r-project.org/http://cran.r-project.org/http://oreilly.com/catalog/9780596801717/http://twitter.com/datasporahttp://www.dataspora.com/http://cran.r-project.org/http://www.r-project.org/http://www.r-project.org/http://oreilly.com/catalog/9780596801717/http://twitter.com/datasporahttp://twitter.com/datasporahttp://www.dataspora.com/https://www.mturk.com/mturk/welcome%20id=k3lahttp://opencv.willowgarage.com/wiki/http://code.google.com/apis/predict/http://lucene.apache.org/mahout/http://www.cs.waikato.ac.nz/ml/weka/http://elefant.developer.nicta.com.au/http://pybrain.org/
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    21/137

    Making data tell its story

    Data scientists

    12 | Chapter 1:Data Science and Data Tools

    http://flowingdata.com/2010/04/07/watching-the-growth-of-walmart-now-with-100-more-sams-club/http://flowingdata.com/http://manyeyes.alphaworks.ibm.com/manyeyes/http://processing.org/http://www.gnuplot.info/http://twitter.com/wattenberghttp://flowingdata.com/2010/04/07/watching-the-growth-of-walmart-now-with-100-more-sams-club/http://flowingdata.com/http://manyeyes.alphaworks.ibm.com/manyeyes/http://processing.org/http://www.gnuplot.info/http://twitter.com/wattenberghttp://www.amazon.com/Visual-Display-Quantitative-Information-2nd/dp/0961392142/
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    22/137

    What is data science? | 13

    http://oreilly.com/catalog/9780596157128/%20id=aod4%20title=Data?Beautifulhttp://oreilly.com/catalog/9780596157128/%20id=aod4%20title=Data?Beautifulhttp://www.midomi.com/http://twitter.com/dpatilhttp://www.linkedin.com/
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    23/137

    Hiring trends for data science

    14 | Chapter 1:Data Science and Data Tools

    http://radar.oreilly.com/research/http://radar.oreilly.com/research/http://radar.oreilly.com/research/
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    24/137

    What is data science? | 15

    http://oreilly.com/catalog/9780596153946/http://oreilly.com/catalog/9780596527587/http://oreilly.com/catalog/0636920000617/http://oreilly.com/catalog/9780596157128/http://oreilly.com/catalog/9780596529321/http://oreilly.com/catalog/9780596802363/http://oreilly.com/catalog/9780596510497/http://oreilly.com/catalog/9780596801717/http://oreilly.com/catalog/9780596153946/http://oreilly.com/catalog/9780596527587/http://oreilly.com/catalog/0636920000617/http://oreilly.com/catalog/9780596157128/http://oreilly.com/catalog/9780596529321/http://oreilly.com/catalog/9780596802363/http://oreilly.com/catalog/9780596510497/http://oreilly.com/catalog/9780596801717/http://www.mckinseyquarterly.com/Hal_Varian_on_how_the_Web_challenges_managers_2286http://www.mckinseyquarterly.com/Hal_Varian_on_how_the_Web_challenges_managers_2286
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    25/137

    The SMAQ stack for big data

    16 | Chapter 1:Data Science and Data Tools

    http://radar.oreilly.com/edd/index.html
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    26/137

    MapReduce

    The SMAQ stack for big data | 17

    http://labs.google.com/papers/mapreduce.htmlhttp://labs.google.com/papers/mapreduce.htmlhttp://labs.google.com/papers/mapreduce.htmlhttp://oreilly.com/web2/archive/what-is-web-20.htmlhttp://strataconf.com/http://en.wikipedia.org/wiki/LAMP_(software_bundle)
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    27/137

    18 | Chapter 1:Data Science and Data Tools

    http://en.wikipedia.org/wiki/MapReduce#Examplehttp://en.wikipedia.org/wiki/MapReduce#Examplehttp://en.wikipedia.org/wiki/MapReduce#Example
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    28/137

    Hadoop MapReduce

    public static class Mapextends Mapper {

    private final static IntWritable one = new IntWritable(1);private Text word = new Text();

    public void map(LongWritable key, Text value, Context context)

    throws IOException, InterruptedException {

    String line = value.toString();StringTokenizer tokenizer = new StringTokenizer(line);while (tokenizer.hasMoreTokens()) {

    word.set(tokenizer.nextToken()); context.write(word, one);

    }}

    }

    public static class Reduceextends Reducer {

    public void reduce(Text key, Iterable values,Context context) throws IOException, InterruptedException {

    int sum = 0; for (IntWritable val : values) { sum += val.get(); }

    context.write(key, new IntWritable(sum));}}

    The SMAQ stack for big data | 19

    http://hadoop.apache.org/mapreduce/docs/current/http://hadoop.apache.org/#What+Is+Hadoop%3Fhttp://research.yahoo.com/files/cutting.pdf
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    29/137

    Other implementations

    Storage

    20 | Chapter 1:Data Science and Data Tools

    http://en.wikipedia.org/wiki/MapReduce#Implementationshttp://en.wikipedia.org/wiki/MapReduce#Implementations
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    30/137

    Hadoop Distributed File System

    HBase, the Hadoop Database

    The SMAQ stack for big data | 21

    http://labs.google.com/papers/bigtable.htmlhttp://labs.google.com/papers/bigtable.htmlhttp://hbase.apache.org/http://hbase.apache.org/http://labs.google.com/papers/bigtable.htmlhttp://hadoop.apache.org/hdfs/docs/current/hdfs_design.htmlhttp://hadoop.apache.org/hdfs/http://hadoop.apache.org/hdfs/
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    31/137

    Hive

    Cassandra and Hypertable

    22 | Chapter 1:Data Science and Data Tools

    http://cassandra.apache.org/http://hypertable.org/http://www.zvents.com/http://hypertable.org/http://cassandra.apache.org/http://hadoop.apache.org/hive/http://incubator.apache.org/thrift/http://en.wikipedia.org/wiki/Representational_State_Transfer
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    32/137

    NoSQL database implementations of MapReduce

    The SMAQ stack for big data | 23

    https://wiki.basho.com/display/RIAK/Riakhttp://www.mongodb.org/http://code.google.com/p/hypertable/wiki/HiveExtensionhttp://code.google.com/p/hypertable/wiki/HiveExtensionhttp://wiki.apache.org/cassandra/HadoopSupporthttps://wiki.basho.com/display/RIAK/MapReducehttps://wiki.basho.com/display/RIAK/Riakhttp://www.mongodb.org/display/DOCS/MapReducehttp://www.mongodb.org/http://couchdb.apache.org/http://code.google.com/p/hypertable/wiki/HiveExtensionhttp://wiki.apache.org/cassandra/HadoopSupporthttp://wiki.apache.org/cassandra/HadoopSupport
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    33/137

    Integration with SQL databases

    Integration with streaming data sources

    Commercial SMAQ solutions

    24 | Chapter 1:Data Science and Data Tools

    http://github.com/facebook/scribehttp://archive.cloudera.com/cdh/3/flume-0.9.1+1/UserGuide.htmlhttp://github.com/cloudera/flumehttp://github.com/cwensel/cascading.jdbc/http://github.com/backtype/cascading-dbmigratehttp://www.cloudera.com/http://wiki.github.com/cloudera/sqoop/http://github.com/facebook/scribehttp://archive.cloudera.com/cdh/3/flume-0.9.1+1/UserGuide.htmlhttp://archive.cloudera.com/cdh/3/flume-0.9.1+1/UserGuide.htmlhttp://github.com/cloudera/flumehttp://github.com/backtype/cascading-dbmigratehttp://github.com/cwensel/cascading.jdbc/http://www.cloudera.com/http://wiki.github.com/cloudera/sqoop/
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    34/137

    Query

    The SMAQ stack for big data | 25

    http://www.cloudera.com/company/open-source/http://www.cloudera.com/company/open-source/http://www.cloudera.com/company/open-source/http://www.cloudera.com/products-services/enterprise/http://www.cloudera.com/hadoop/http://www.cloudera.com/http://www.netezza.com/releases/2010/release071510.htmhttp://www.netezza.com/releases/2010/release071510.htmhttp://www.netezza.com/http://www.vertica.com/MapReducehttp://www.vertica.com/http://www.cloudera.com/company/open-source/http://www.cloudera.com/company/open-source/http://www.cloudera.com/products-services/enterprise/http://www.cloudera.com/hadoop/http://www.cloudera.com/http://www.netezza.com/releases/2010/release071510.htmhttp://www.netezza.com/releases/2010/release071510.htmhttp://www.netezza.com/http://www.vertica.com/MapReducehttp://www.vertica.com/http://www.asterdata.com/resources/mapreduce.phphttp://www.asterdata.com/resources/mapreduce.phphttp://www.asterdata.com/product/index.phphttp://www.greenplum.com/technology/mapreduce/http://www.greenplum.com/technology/mapreduce/http://www.greenplum.com/http://en.wikipedia.org/wiki/Data_warehousehttp://en.wikipedia.org/wiki/Data_warehouse
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    35/137

    Pig

    input = LOAD 'input/sentences.txt' USING TextLoader();words = FOREACH input GENERATE FLATTEN(TOKENIZE($0));grouped = GROUP words BY $0;counts = FOREACH grouped GENERATE group, COUNT(words);ordered = ORDER counts BY $0;STORE ordered INTO 'output/wordCount' USING PigStorage();

    26 | Chapter 1:Data Science and Data Tools

    http://hadoop.apache.org/pig/docs/r0.7.0/udf.htmlhttp://hadoop.apache.org/pig/
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    36/137

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    37/137

    (defmapcatop split [sentence](seq (.split sentence "\\s+")))

    (? ?word)

    (c/count ?count))

    Search with Solr

    Conclusion

    28 | Chapter 1:Data Science and Data Tools

    http://lucene.apache.org/http://lucene.apache.org/solr/http://nathanmarz.com/blog/introducing-cascalog-a-clojure-based-query-language-for-hado.html
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    38/137

    Scraping, cleaning, and selling big data

    Scraping, cleaning, and selling big data | 29

    http://www.infochimps.com/datasets/http://blog.infochimps.com/2008/12/29/massive-scrape-of-twitters-friend-graph/http://www.infochimps.com/http://blog.infochimps.com/2008/12/29/massive-scrape-of-twitters-friend-graph/http://radar.oreilly.com/audreyw/index.htmlhttp://en.wikipedia.org/wiki/Trespass_to_chattels#United_States_lawhttp://www.infochimps.com/datasets/http://blog.infochimps.com/2008/12/29/massive-scrape-of-twitters-friend-graph/http://blog.infochimps.com/2008/12/29/massive-scrape-of-twitters-friend-graph/http://www.infochimps.com/http://radar.oreilly.com/audreyw/index.htmlhttp://radar.oreilly.com/2010/06/what-is-data-science.html
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    39/137

    30 | Chapter 1:Data Science and Data Tools

    http://en.wikipedia.org/wiki/Denial-of-service_attackhttp://radar.oreilly.com/2011/03/twitter-developers.htmlhttp://dev.twitter.com/pages/api_termshttp://www.copyright.gov/fls/fl102.htmlhttp://en.wikipedia.org/wiki/Denial-of-service_attackhttp://radar.oreilly.com/2011/03/twitter-developers.htmlhttp://dev.twitter.com/pages/api_termshttp://www.copyright.gov/fls/fl102.htmlhttp://www.copyright.gov/title17/92chap1.html#102http://www.copyright.gov/title17/92chap1.html#102
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    40/137

    Scraping, cleaning, and selling big data | 31

    http://www.spss.com/https://en.oreilly.com/oscon2011/public/regwith/os11rad?cmp=il-radar-os11-infochimpshttp://www.wolfram.com/mathematica/http://en.wikipedia.org/wiki/XMLhttp://www.w3.org/RDF/http://www.spss.com/https://en.oreilly.com/oscon2011/public/regwith/os11rad?cmp=il-radar-os11-infochimpshttp://www.oscon.com/oscon2011https://en.oreilly.com/oscon2011/public/regwith/os11rad?cmp=il-radar-os11-infochimps
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    41/137

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    42/137

    Data hand tools

    Data hand tools | 33

    http://www.dataists.com/2010/09/a-taxonomy-of-data-science/http://www.gnu.org/software/octave/http://www.mathworks.com/http://wolfram.com/http://nosql-database.org/http://hadoop.apache.org/http://www.r-project.org/http://radar.oreilly.com/2010/06/what-is-data-science.htmlhttp://radar.oreilly.com/mikel/index.html
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    43/137

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    44/137

    $ grep '599 [A-Z][A-Z]' rudx-log.txt | colrm 1 72 | head -2VRMO...

    $ grep '599 [A-Z][A-Z]' rudx-log.txt | colrm 1 72 | sort |\uniq | head -2

    ADAL

    Data hand tools | 35

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    45/137

    $ grep '599 [A-Z][A-Z]' rudx-log.txt | colrm 1 72 | sort | uniq | wc38 38 342

    $ grep '599 [A-Z][A-Z]' rudx-log.txt | awk '{print $2 " " $11}' |\sort | uniq

    14000 AD14000 AL14000 AN...

    $ grep '599 [A-Z][A-Z]' rudx-log.txt | awk '{print $2 " " $11}' |\sort | uniq | grep 21000 | wc20 40 180

    $ grep '599 [A-Z][A-Z]' rudx-log.txt | awk '{print $2 " " $11}' |\sort | uniq | grep 14000 | wc26 52 234

    ...

    36 | Chapter 1:Data Science and Data Tools

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    46/137

    $ grep '599 [A-Z][A-Z]' `find . -name rudx-log.txt -print` |\awk '{print $2 " " $11}' | sort | uniq | grep 14000 | wc

    48 96 432

    ...

    ./2008/rudx-log.txt:QSO: 14000 CW 2008-03-15 1526 W1JQ 599 0054 \\UA6YW 599 AD./2009/rudx-log.txt:QSO: 14000 CW 2009-03-21 1225 W1JQ 599 0015 \\RG3K 599 VR...

    Data hand tools | 37

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    47/137

    $ find . -name rudx-log.txt -print | xargs grep '599 [A-Z][A-Z]' |\awk '{print $2 " " $11}' | grep 14000 | sort | uniq | wc

    48 96 432

    38 | Chapter 1:Data Science and Data Tools

    http://www.softpanorama.org/Tools/Find/using_exec_option_and_xargs_in_find.shtmlhttp://www.softpanorama.org/Tools/Find/using_exec_option_and_xargs_in_find.shtml
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    48/137

    $ find . -name rudx-log.txt -print | xargs grep '599 [A-Z][A-Z]' |\

    awk '{print $2 " " $11}' | pv | grep 14000 | sort | uniq | wc3.41kB 0:00:00 [ 20kB/s] [48 96 432

    Data hand tools | 39

    http://www.macports.org/ports.phphttp://www.ivarch.com/programs/pv.shtmlhttp://twitter.com/dataspora
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    49/137

    Hadoop: What it is, how it works, and what it can do

    40 | Chapter 1:Data Science and Data Tools

    http://developer.yahoo.com/hadoop/http://en.wikipedia.org/wiki/Nutchhttp://labs.google.com/papers/mapreduce.htmlhttp://labs.google.com/papers/gfs.htmlhttp://strataconf.com/strata2011/public/schedule/speaker/5259?cmp=il-radar-st11-hadoop-olsonhttp://strataconf.com/strata2011/public/schedule/speaker/5259?cmp=il-radar-st11-hadoop-olsonhttp://www.cloudera.com/http://hadoop.apache.org/http://hadoop.apache.org/http://radar.oreilly.com/jamest/index.html
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    50/137

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    51/137

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    52/137

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    53/137

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    54/137

    Four free data tools for journalists (and snoops) | 45

    http://www.nytimes.com/2010/11/28/business/28borker.htmlhttp://www.nytimes.com/2010/11/28/business/28borker.html
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    55/137

    bit.ly

    46 | Chapter 1:Data Science and Data Tools

    http://backtype.com/http://bit.ly/
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    56/137

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    57/137

    The quiet rise of machine learning

    48 | Chapter 1:Data Science and Data Tools

    http://www.orbitz.com/http://www.estar.org.uk/wiki/index.php/Main_Pagehttp://www.astro.ex.ac.uk/http://www.astro.ex.ac.uk/http://www.astro.ex.ac.uk/https://twitter.com/#!/aallan/http://oreilly.com/catalog/9780596806446/http://www.astro.ex.ac.uk/people/aa/http://www.teleread.com/paul-biba/goodreads-revs-up-a-book-recommendation-engine/http://www.discovereads.com/http://radar.oreilly.com/2011/02/watson-machine-learning.htmlhttp://radar.oreilly.com/jennw/index.htmlhttp://radar.oreilly.com/jennw/index.htmlhttp://techcrunch.com/2011/03/29/gmail-to-roll-out-ads-that-learn-from-your-inbox/http://www.google.com/http://www.slideshare.net/jseidman/real-world-machine-learning-at-orbitz-strata-2011http://www.orbitz.com/http://www.estar.org.uk/wiki/index.php/Main_Pagehttp://www.astro.ex.ac.uk/http://www.astro.ex.ac.uk/http://www.astro.ex.ac.uk/people/aa/http://oreilly.com/catalog/9780596806446/https://twitter.com/#!/aallan/http://www.discovereads.com/http://www.teleread.com/paul-biba/goodreads-revs-up-a-book-recommendation-engine/http://www.goodreads.com/http://radar.oreilly.com/2011/02/watson-machine-learning.htmlhttp://radar.oreilly.com/jennw/index.htmlhttp://web.mailana.com/labs/bigdataforjournalists.pdf
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    58/137

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    59/137

    50 | Chapter 1:Data Science and Data Tools

    http://strataconf.com/http://en.wikipedia.org/wiki/Sensor_nodehttp://www.youtube.com/watch?v=7zpl_DZC2-g&feature=player_embeddedhttp://strataconf.com/http://en.wikipedia.org/wiki/Sensor_node
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    60/137

    Where the semantic web stumbled, linked data willsucceed

    Where the semantic web stumbled, linked data will succeed | 51

    http://radar.oreilly.com/2009/05/google-rich-snippets-semantic-web.htmlhttp://opengraphprotocol.org/http://radar.oreilly.com/tylerb/index.htmlhttp://radar.oreilly.com/tylerb/index.htmlhttp://linkeddata.org/http://radar.oreilly.com/2009/05/google-rich-snippets-semantic-web.htmlhttp://radar.oreilly.com/2009/05/google-rich-snippets-semantic-web.htmlhttp://radar.oreilly.com/2010/05/facebook-open-graph-and-the-se.htmlhttp://opengraphprotocol.org/http://en.wikipedia.org/wiki/Holy_Roman_Empirehttp://radar.oreilly.com/tylerb/index.html
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    61/137

    52 | Chapter 1:Data Science and Data Tools

    http://en.wikipedia.org/wiki/Named_entity_recognitionhttp://en.wikipedia.org/wiki/Named_entity_recognitionhttp://developer.yahoo.com/search/boss/structureddata.htmlhttp://data.ordnancesurvey.co.uk/id/7000000000037256http://data.ordnancesurvey.co.uk/id/7000000000037256http://evan.prodromou.name/RDFa_vs_microformatshttp://evan.prodromou.name/RDFa_vs_microformatshttp://data.nytimes.com/http://blog.ordnancesurvey.co.uk/2010/11/linked-data-at-ordnance-survey/http://en.wikipedia.org/wiki/Named_entity_recognitionhttp://foursquare.com/venue/18645http://www.yelp.com/biz/cin-cin-wine-bar-los-gatos-2http://developer.yahoo.com/search/boss/structureddata.htmlhttp://developer.yahoo.com/search/boss/structureddata.htmlhttp://data.ordnancesurvey.co.uk/id/7000000000037256http://data.ordnancesurvey.co.uk/id/7000000000037256http://evan.prodromou.name/RDFa_vs_microformatshttp://evan.prodromou.name/RDFa_vs_microformatshttp://en.wikipedia.org/wiki/Hcardhttp://en.wikipedia.org/wiki/RDFahttp://www.google.com/support/webmasters/bin/answer.py?answer=176035http://www.factual.com/http://developer.yahoo.com/geo/geoplanet/data/http://blog.ordnancesurvey.co.uk/2010/11/linked-data-at-ordnance-survey/http://blog.ordnancesurvey.co.uk/2010/11/linked-data-at-ordnance-survey/http://data.nytimes.com/http://data.nytimes.com/
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    62/137

    Where the semantic web stumbled, linked data will succeed | 53

    http://www.bbc.co.uk/blogs/bbcinternet/2010/07/the_world_cup_and_a_call_to_ac.htmlhttp://www.insidefacebook.com/2010/11/09/aggregated-mentions-machine-reading/http://techcrunch.com/2010/10/27/aro-mobile/http://www.guardian.co.uk/open-platform/blog/linked-data-open-platformhttp://developer.yahoo.com/geo/geoplanet/guide/api-reference.html#api-concordancehttp://en.wikipedia.org/wiki/HCardhttp://developer.yahoo.com/geo/geoplanet/guide/api-reference.html#api-concordancehttp://blog.placecast.net/post/489490648/opening-the-placecast-match-apihttp://www.insidefacebook.com/2010/11/09/aggregated-mentions-machine-reading/http://www.bbc.co.uk/blogs/bbcinternet/2010/07/the_world_cup_and_a_call_to_ac.htmlhttp://www.bbc.co.uk/blogs/bbcinternet/2010/07/the_world_cup_and_a_call_to_ac.htmlhttp://techcrunch.com/2010/10/27/aro-mobile/http://www.guardian.co.uk/open-platform/blog/linked-data-open-platformhttp://developer.yahoo.com/geo/geoplanet/guide/api-reference.html#api-concordancehttp://developer.yahoo.com/geo/geoplanet/guide/api-reference.html#api-concordancehttp://en.wikipedia.org/wiki/HCardhttp://blog.placecast.net/post/489490648/opening-the-placecast-match-apihttp://gigaom.com/2010/05/07/the-great-open-database-of-place-pages-in-the-sky/http://gigaom.com/2010/05/07/the-great-open-database-of-place-pages-in-the-sky/http://viewer.opencalais.com/http://viewer.opencalais.com/http://www.headup.com/http://www.headup.com/
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    63/137

    Social data is an oracle waiting for a question

    54 | Chapter 1:Data Science and Data Tools

    http://radar.oreilly.com/mslocum/index.htmlhttps://en.oreilly.com/where2011/public/regwith/whr11rad?cmp=il-radar-wh11-russell-social-datahttp://oreilly.com/catalog/0636920010203/http://twitter.com/ptwobrussellhttp://www.datameer.com/index.htmlhttp://www.needlebase.com/http://www.needlebase.com/http://aws.amazon.com/publicdatasets/http://radar.oreilly.com/2011/02/google-data-explorer.htmlhttp://radar.oreilly.com/mslocum/index.html
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    64/137

    Social data is an oracle waiting for a question | 55

    http://www.infochimps.com/http://gnip.com/https://en.oreilly.com/where2011/public/regwith/whr11rad?cmp=il-radar-wh11-russell-social-datahttp://gnip.com/http://www.infochimps.com/http://www.infochimps.com/
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    65/137

    The challenges of streaming real-time data

    56 | Chapter 1:Data Science and Data Tools

    http://radar.oreilly.com/audreyw/index.htmlhttp://github.com/ptwobrussell/Mining-the-Social-Webhttp://radar.oreilly.com/audreyw/index.htmlhttp://github.com/ptwobrussell/Mining-the-Social-Webhttps://en.oreilly.com/where2011/public/regwith/whr11rad?cmp=il-radar-wh11-russell-social-data
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    66/137

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    67/137

    58 | Chapter 1:Data Science and Data Tools

    https://en.oreilly.com/stratany2011/public/regwith/stn11rad?cmp=il-radar-st11-gnip-realtime-datahttps://en.oreilly.com/stratany2011/public/regwith/stn11rad?cmp=il-radar-st11-gnip-realtime-datahttps://en.oreilly.com/stratany2011/public/regwith/stn11rad?cmp=il-radar-st11-gnip-realtime-datahttps://en.oreilly.com/stratany2011/public/regwith/stn11rad?cmp=il-radar-st11-gnip-realtime-data
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    68/137

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    69/137

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    70/137

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    71/137

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    72/137

    Theres no definition

    Time for the community to rally

    Why you cant really anonymize your data

    Why you cant really anonymize your data | 63

    http://radar.oreilly.com/petew/index.htmlhttp://radar.oreilly.com/petew/index.htmlhttp://www.datasciencetoolkit.org/http://www.datasciencetoolkit.org/
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    73/137

    64 | Chapter 2:Data Issues

    http://33bits.org/2011/03/09/link-prediction-by-de-anonymization-how-we-won-the-kaggle-social-network-challenge/http://33bits.org/2011/03/09/link-prediction-by-de-anonymization-how-we-won-the-kaggle-social-network-challenge/http://www.kaggle.com/http://33bits.org/about/netflix-paper-home-page/http://33bits.org/about/netflix-paper-home-page/http://33bits.org/2011/03/09/link-prediction-by-de-anonymization-how-we-won-the-kaggle-social-network-challenge/http://33bits.org/2011/03/09/link-prediction-by-de-anonymization-how-we-won-the-kaggle-social-network-challenge/http://33bits.org/2011/03/09/link-prediction-by-de-anonymization-how-we-won-the-kaggle-social-network-challenge/http://www.kaggle.com/http://33bits.org/about/netflix-paper-home-page/http://33bits.org/about/netflix-paper-home-page/http://33bits.org/
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    74/137

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    75/137

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    76/137

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    77/137

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    78/137

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    79/137

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    80/137

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    81/137

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    82/137

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    83/137

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    84/137

    CHAPTER 3

    The Application of Data: Products

    and Processes

    How the Library of Congress is building the Twitterarchive

    75

    http://blog.twitter.com/2010/04/tweet-preservation.htmlhttps://twitter.com/#!/BarackObama/status/1389362776http://bits.blogs.nytimes.com/2010/01/22/first-tweet-from-space/http://bits.blogs.nytimes.com/2010/01/22/first-tweet-from-space/http://blog.twitter.com/2010/04/tweet-preservation.htmlhttp://radar.oreilly.com/audreyw/index.html
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    85/137

    76 | Chapter 3:The Application of Data: Products and Processes

    http://www.loc.gov/folklife/https://groups.google.com/forum/#!topic/twitter-development-talk/Gs2VT4oE-oQ/overview
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    86/137

    How the Library of Congress is building the Twitter archive | 77

    http://www.archive.org/details/301workshttp://mehack.com/map-of-a-twitter-status-objecthttp://www.gnip.com/http://blog.twitter.com/2011/03/numbers.htmlhttp://blogs.loc.gov/loc/2010/04/how-tweet-it-is-library-acquires-entire-twitter-archive/https://en.oreilly.com/oscon2011/public/regwith/os11rad?cmp=il-radar-os11-loc-twitterhttp://www.archive.org/details/301workshttp://mehack.com/map-of-a-twitter-status-objecthttp://www.gnip.com/http://blog.twitter.com/2011/03/numbers.htmlhttp://blogs.loc.gov/loc/2010/04/how-tweet-it-is-library-acquires-entire-twitter-archive/https://en.oreilly.com/oscon2011/public/regwith/os11rad?cmp=il-radar-os11-loc-twitterhttp://www.oscon.com/oscon2011?cmp=il-radar-os11-loc-twitterhttps://en.oreilly.com/oscon2011/public/regwith/os11rad?cmp=il-radar-os11-loc-twitter
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    87/137

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    88/137

    Data journalism and data tools

    Data journalism, data tools, and the newsroom stack | 79

    http://radar.oreilly.com/2010/12/data-journalism.htmlhttp://www.knightfoundation.org/press-room/press-release/knight-foundation-media-innovation-contest-announc/http://gigaom.com/2011/06/22/future-of-media-when-big-data-meets-journalism/http://gigaom.com/2011/06/22/future-of-media-when-big-data-meets-journalism/http://radar.oreilly.com/2010/12/data-journalism.htmlhttp://radar.oreilly.com/2010/12/data-journalism.htmlhttp://www.knightfoundation.org/press-room/press-release/knight-foundation-media-innovation-contest-announc/
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    89/137

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    90/137

    The newsroom stack

    Data journalism, data tools, and the newsroom stack | 81

    http://www.youtube.com/watch?v=CaXWWuNDHgE&feature=player_embeddedhttp://jonathanstray.com/the-editorial-search-enginehttp://www.niemanlab.org/2011/06/the-news-challenge-winning-panda-project-aims-to-make-research-easier-in-the-newsroom/https://docs.google.com/present/view?id=dft4sbfd_71fgd4fpg3
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    91/137

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    92/137

    The data analysis path is built on curiosity, followed by

    action

    The data analysis path is built on curiosity, followed by action | 83

    http://radar.oreilly.com/mslocum/index.htmlhttp://oreilly.com/catalog/9781449389796/http://oreilly.com/catalog/9781449389796/http://www.oreillynet.com/pub/au/933http://radar.oreilly.com/mslocum/index.htmlhttp://www.flickr.com/photos/blprnt/3291244820/
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    93/137

    84 | Chapter 3:The Application of Data: Products and Processes

    http://oreilly.com/catalog/9781449389796/
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    94/137

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    95/137

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    96/137

    How data and analytics can improve education | 87

    http://www.open.ac.uk/http://research.uow.edu.au/learningnetworks/seeing/snapp/index.htmlhttp://google.com/analyticshttp://piwik.org/http://www.moodle.org/http://desire2learn.com/http://www.open.ac.uk/http://research.uow.edu.au/learningnetworks/seeing/snapp/index.htmlhttp://piwik.org/http://google.com/analyticshttp://desire2learn.com/http://www.moodle.org/http://www.athabascau.ca/http://www.elearnspace.org/blog/
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    97/137

    88 | Chapter 3:The Application of Data: Products and Processes

    https://en.oreilly.com/stratany2011/public/regwith/stn11rad?cmp=il-radar-st11-siemens-education-datahttps://en.oreilly.com/stratany2011/public/regwith/stn11rad?cmp=il-radar-st11-siemens-education-datahttp://en.wikipedia.org/wiki/Hawthorne_effecthttps://en.oreilly.com/stratany2011/public/regwith/stn11rad?cmp=il-radar-st11-siemens-education-datahttps://en.oreilly.com/stratany2011/public/regwith/stn11rad?cmp=il-radar-st11-siemens-education-datahttp://radian6.com/http://klout.com/
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    98/137

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    99/137

    90 | Chapter 3:The Application of Data: Products and Processes

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    100/137

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    101/137

    Data science is a pipeline between academic disciplines

    92 | Chapter 3:The Application of Data: Products and Processes

    http://strataconf.com/stratany2011/public/schedule/speaker/104414?cmp=il-radar-st11-drew-conway-data-science-academichttp://strataconf.com/stratany2011?cmp=il-radar-st11-drew-conway-data-science-academichttp://strataconf.com/stratany2011?cmp=il-radar-st11-drew-conway-data-science-academichttp://strataconf.com/stratany2011/public/schedule/speaker/104414?cmp=il-radar-st11-drew-conway-data-science-academichttp://twitter.com/drewconwayhttp://www.drewconway.com/Drew_Conway/About.htmlhttp://radar.oreilly.com/audreyw/index.html
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    102/137

    Data science is a pipeline between academic disciplines | 93

    http://strataconf.com/public/content/landing?_discount=strata&cmp=il-radar-st11-drew-conway-data-science-academichttp://strataconf.com/public/content/landing?_discount=strata&cmp=il-radar-st11-drew-conway-data-science-academichttp://themonkeycage.org/http://themonkeycage.org/http://oreilly.com/python/http://oreilly.com/python/http://oreilly.com/catalog/9780596801717
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    103/137

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    104/137

    Data science is a pipeline between academic disciplines | 95

    https://www.mturk.com/mturk/welcomehttps://www.mturk.com/mturk/welcomehttp://en.wikipedia.org/wiki/Institutional_review_boardhttp://en.wikipedia.org/wiki/Institutional_review_boardhttp://orda.siuc.edu/human/http://orda.siuc.edu/human/https://www.mturk.com/mturk/welcomehttp://en.wikipedia.org/wiki/Institutional_review_boardhttp://en.wikipedia.org/wiki/Institutional_review_boardhttp://orda.siuc.edu/human/http://radar.oreilly.com/2011/02/big-data-metaphor.html
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    105/137

    Big data and open source unlock genetic secrets

    96 | Chapter 3:The Application of Data: Products and Processes

    http://www.benaroyaresearch.org/http://www.benaroyaresearch.org/http://www.oscon.com/oscon2011/public/schedule/speaker/109459?cmp=il-radar-os11-charlie-quinn-data-geneshttp://radar.oreilly.com/2011/04/fcc-website-reboot-open-source-cloud.htmlhttp://radar.oreilly.com/gov2/http://strataconf.com/http://strataconf.com/http://www.economist.com/node/15557443?story_id=15557443http://www.oscon.com/?cmp=il-radar-os11-charlie-quinn-data-geneshttp://www.oscon.com/oscon2011/public/schedule/detail/19186?cmp=il-radar-os11-charlie-quinn-data-geneshttp://www.oscon.com/oscon2011/public/schedule/detail/19186?cmp=il-radar-os11-charlie-quinn-data-geneshttp://www.benaroyaresearch.org/http://www.benaroyaresearch.org/http://www.oscon.com/oscon2011/public/schedule/speaker/109459?cmp=il-radar-os11-charlie-quinn-data-geneshttp://www.huffingtonpost.com/alexander-howard/first-international-open-_b_784440.htmlhttp://www.huffingtonpost.com/alexander-howard/first-international-open-_b_784440.htmlhttp://radar.oreilly.com/2011/04/fcc-website-reboot-open-source-cloud.htmlhttp://radar.oreilly.com/gov2/http://strataconf.com/http://www.economist.com/node/15557443?story_id=15557443http://radar.oreilly.com/alexh/index.html
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    106/137

    Big data and open source unlock genetic secrets | 97

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    107/137

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    108/137

    Big data and open source unlock genetic secrets | 99

    http://www.flickr.com/photos/jurvetson/3351973835/http://www.flickr.com/photos/jurvetson/3351973835/http://www.nih.gov/http://www.flickr.com/photos/jurvetson/3351973835/http://www.nih.gov/http://www.pubnet.org/http://www.oscon.com/oscon2011/public/schedule/detail/19186?cmp=il-radar-os11-charlie-quinn-data-geneshttp://www.oscon.com/oscon2011/public/schedule/detail/19186?cmp=il-radar-os11-charlie-quinn-data-genes
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    109/137

    Visualization deconstructed: Mapping Facebooksfriendships

    Mapping Facebooks friendships

    100 | Chapter 3:The Application of Data: Products and Processes

    http://www.facebook.com/notes/facebook-engineering/visualizing-friendships/469716398919http://radar.oreilly.com/2011/01/visualization-mapping-america.htmlhttp://www.facebook.com/notes/facebook-engineering/visualizing-friendships/469716398919http://www.facebook.com/notes/facebook-engineering/visualizing-friendships/469716398919http://paulbutler.org/http://radar.oreilly.com/2011/01/visualization-mapping-america.htmlhttp://radar.oreilly.com/sebastienp/index.html
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    110/137

    Visualization deconstructed: Mapping Facebooks friendships | 101

    http://apod.nasa.gov/apod/ap001127.htmlhttp://apod.nasa.gov/apod/ap001127.html
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    111/137

    102 | Chapter 3:The Application of Data: Products and Processes

    http://strataconf.com/?cmp=il-radar-st11-viz-facebook-friendshttps://en.oreilly.com/strata2011/public/register?cmp=il-radar-st11-viz-facebook-friendshttp://strataconf.com/?cmp=il-radar-st11-viz-facebook-friends
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    112/137

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    113/137

    104 | Chapter 3:The Application of Data: Products and Processes

    http://twitter.com/#search?q=%23teapartyhttp://twitter.com/#search?q=%23teapartyhttp://twitter.com/#search?q=%23teapartyhttp://twitter.com/#search?q=%23justinbieberhttp://aws.amazon.com/ec2/http://www.datameer.com/about/management.htmlhttp://analytics.google.com/http://www.phpmyadmin.net/home_page/index.phphttp://www.datameer.com/http://www.datameer.com/http://radar.oreilly.com/2010/06/what-is-data-science.html
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    114/137

    Data science democratized | 105

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    115/137

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    116/137

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    117/137

    108 | Chapter 4:The Business of Data

    https://en.oreilly.com/jumpstart2011/public/regwith/stj11rad?cmp=il-radar-st11-alistair_croll_bigdata_081011https://en.oreilly.com/jumpstart2011/public/regwith/stj11rad?cmp=il-radar-st11-alistair_croll_bigdata_081011https://en.oreilly.com/jumpstart2011/public/regwith/stj11rad?cmp=il-radar-st11-alistair_croll_bigdata_081011https://en.oreilly.com/jumpstart2011/public/regwith/stj11rad?cmp=il-radar-st11-alistair_croll_bigdata_081011https://en.oreilly.com/jumpstart2011/public/regwith/stj11rad?cmp=il-radar-st11-alistair_croll_bigdata_081011
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    118/137

    Big data and the innovators dilemma

    Theres no such thing as big data | 109

    http://online.wsj.com/article/SB10001424053111903885604576486330882679982.htmlhttp://online.wsj.com/article/SB10001424053111903885604576486330882679982.htmlhttp://online.wsj.com/article/SB10001424053111903885604576486330882679982.htmlhttp://www.mckinsey.com/mgi/publications/big_data/index.asphttp://en.wikipedia.org/wiki/Eureka_(word)http://ideas.economist.com/event/information-2011http://ideas.economist.com/event/information-2011http://online.wsj.com/article/SB10001424053111903885604576486330882679982.htmlhttp://online.wsj.com/article/SB10001424053111903885604576486330882679982.htmlhttp://www.mckinsey.com/mgi/publications/big_data/index.asphttp://en.wikipedia.org/wiki/Eureka_(word)http://ideas.economist.com/event/information-2011http://en.wikipedia.org/wiki/Disruptive_technology
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    119/137

    Building data startups: Fast, big, and focused

    Setting the stage: The attack of the exponentials

    110 | Chapter 4:The Business of Data

    http://www.slideshare.net/medriscoll/driscoll-strata-buildingdatastartups25may2011cleanhttp://strataconf.com/strata-may2011/public/schedule/detail/20623http://strataconf.com/strata-may2011/public/schedule/detail/20623http://www.slideshare.net/medriscoll/driscoll-strata-buildingdatastartups25may2011cleanhttp://radar.oreilly.com/michaeld/index.htmlhttp://bit.ly/jumpstart-AC
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    120/137

    Leveraging the big data stack

    Building data startups: Fast, big, and focused | 111

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    121/137

    Fast data

    112 | Chapter 4:The Business of Data

    http://radar.oreilly.com/2011/01/what-is-hadoop.html
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    122/137

    Big analytics

    Building data startups: Fast, big, and focused | 113

    http://www.accenture.com/us-en/Pages/index.aspxhttp://www.greenplum.com/http://www.dbms2.com/2011/05/23/databases-ram/http://www.accenture.com/us-en/Pages/index.aspxhttp://www.netezza.com/http://hbase.apache.org/http://labs.google.com/papers/bigtable.htmlhttp://labs.google.com/papers/bigtable.htmlhttp://www.postgresql.org/http://www.greenplum.com/http://www.dbms2.com/2011/05/23/databases-ram/http://www.mapr.com/http://www.fusionio.com/
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    123/137

    Focused services

    114 | Chapter 4:The Business of Data

    http://www.metamarketsgroup.com/http://www.metamarketsgroup.com/http://klout.com/homehttp://www.news.me/http://flipboard.com/http://www.billguard.com/http://strataconf.com/public/content/landing?_discount=strata&cmp=il-radar-st11-driscoll-data-startupshttp://strataconf.com/public/content/landing?_discount=strata&cmp=il-radar-st11-driscoll-data-startupshttp://www.mckinsey.com/mgi/publications/big_data/index.asp
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    124/137

    Democratizing big data

    Data markets arent coming: Theyre already here

    Data markets arent coming: Theyre already here | 115

    http://strataconf.com/strata2011/public/schedule/detail/17604http://strataconf.com/strata2011/public/schedule/speaker/26?cmp=il-radar-st11-valeskihttp://strataconf.com/strata2011/public/schedule/detail/17602?cmp=il-radar-st11-valeskihttp://strataconf.com/strata2011/?cmp=il-radar-st11-valeskihttp://strataconf.com/strata2011/public/schedule/detail/17602?cmp=il-radar-st11-valeskihttp://www.delicious.com/http://twitter.com/http://www.facebook.com/http://twitter.com/#!/jvaleskihttp://gnip.com/http://radar.oreilly.com/julies/index.htmlhttp://infochimps.com/http://strataconf.com/strata2011/public/schedule/speaker/107129?cmp=il-radar-st11-valeskihttps://datamarket.azure.com/http://strataconf.com/strata2011/public/schedule/speaker/50595?cmp=il-radar-st11-valeskihttp://thomsonreuters.com/http://strataconf.com/strata2011/public/schedule/speaker/104234?cmp=il-radar-st11-valeskihttp://urbanmapping.com/http://urbanmapping.com/http://strataconf.com/strata2011/public/schedule/speaker/26?cmp=il-radar-st11-valeskihttp://strataconf.com/strata2011/public/schedule/detail/17604http://strataconf.com/strata2011/public/schedule/detail/17602?cmp=il-radar-st11-valeskihttp://strataconf.com/strata2011/public/schedule/detail/17602?cmp=il-radar-st11-valeskihttp://strataconf.com/strata2011/?cmp=il-radar-st11-valeskihttp://www.delicious.com/http://www.delicious.com/http://www.flickr.com/http://www.facebook.com/http://twitter.com/http://gnip.com/http://twitter.com/#!/jvaleskihttp://radar.oreilly.com/julies/index.htmlhttp://aboutfoursquare.com/foursquare-explains-how-explore-came-to-be/http://www.linkedin.com/answers/technology/information-technology/information-storage/TCH_ITS_IST/59136-2897253http://www.linkedin.com/answers/technology/information-technology/information-storage/TCH_ITS_IST/59136-2897253
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    125/137

    116 | Chapter 4:The Business of Data

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    126/137

    Data markets arent coming: Theyre already here | 117

    http://gnip.com/twitter/decahosehttp://gnip.com/twitter/halfhosehttps://en.oreilly.com/strata2011/public/register?cmp=il-radar-st11-valeskihttps://en.oreilly.com/strata2011/public/register?cmp=il-radar-st11-valeskihttps://en.oreilly.com/strata2011/public/register?cmp=il-radar-st11-valeskihttps://en.oreilly.com/strata2011/public/register?cmp=il-radar-st11-valeskihttps://en.oreilly.com/strata2011/public/register?cmp=il-radar-st11-valeskihttps://en.oreilly.com/strata2011/public/register?cmp=il-radar-st11-valeskihttps://en.oreilly.com/strata2011/public/register?cmp=il-radar-st11-valeskihttp://dev.twitter.com/pages/streaming_api_concepts#samplinghttp://dev.twitter.com/pages/streaming_api_concepts#samplinghttp://gnip.com/twitter/spritzerhttp://gnip.com/twitter/halfhosehttp://gnip.com/twitter/decahosehttps://en.oreilly.com/strata2011/public/register?cmp=il-radar-st11-valeskihttp://strataconf.com/?cmp=il-radar-st11-valeski
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    127/137

    118 | Chapter 4:The Business of Data

    http://en.wikipedia.org/wiki/Customer_relationship_managementhttp://en.wikipedia.org/wiki/Botnethttp://en.wikipedia.org/wiki/Customer_relationship_managementhttp://en.wikipedia.org/wiki/Botnethttp://radar.oreilly.com/2010/10/the-black-market-for-data.htmlhttp://en.wikipedia.org/wiki/Value-added_reseller
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    128/137

    An iTunes model for data

    An iTunes model for data | 119

    http://www.web2expo.com/webexsf2011/public/schedule/detail/16684http://twitter.com/gilelbazhttp://www.factual.com/http://radar.oreilly.com/audreyw/index.htmlhttp://oreilly.com/catalog/9780596157128http://www.web2expo.com/webexsf2011/public/schedule/detail/16684http://www.factual.com/http://twitter.com/gilelbazhttp://radar.oreilly.com/audreyw/index.htmlhttp://oreilly.com/catalog/9780596157128
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    129/137

    120 | Chapter 4:The Business of Data

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    130/137

    An iTunes model for data | 121

    http://www.flickr.com/photos/ivanwalsh/5187183980/http://www.youtube.com/watch?v=X9RErxDRVW4http://www.flickr.com/photos/ivanwalsh/5187183980/http://www.flickr.com/photos/ivanwalsh/5187183980/http://www.youtube.com/watch?v=X9RErxDRVW4http://www.database.com/
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    131/137

    Data is a currency

    122 | Chapter 4:The Business of Data

    http://www.bloomberg.com/solutions/http://thomsonreuters.com/products_services/financial/financial_products/a-z/data_feeds/http://radar.oreilly.com/edd/index.htmlhttp://twitter.com/lockerprojecthttp://www.infochimps.com/http://thomsonreuters.com/products_services/financial/financial_products/a-z/data_feeds/http://www.bloomberg.com/solutions/http://radar.oreilly.com/edd/index.html
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    132/137

    Big data: An opportunity in search of a metaphor

    Big data: An opportunity in search of a metaphor | 123

    http://strataconf.com/strata2011http://radar.oreilly.com/tylerb/index.html
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    133/137

    124 | Chapter 4:The Business of Data

  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    134/137

    Data and the human-machine connection

    Data and the human-machine connection | 125

    http://www.pcworld.com/article/235846/as_twitter_turns_5_it_delivers_350_billion_tweets_each_day.htmlhttp://www.operasolutions.com/index.htmlhttp://www.operasolutions.com/profile_arnab_gupta.htmlhttp://radar.oreilly.com/julies/index.html
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    135/137

    126 | Chapter 4:The Business of Data

    http://strataconf.com/public/content/landing?_discount=strata&cmp=il-radar-st11-gupta-interviewhttp://strataconf.com/public/content/landing?_discount=strata&cmp=il-radar-st11-gupta-interview
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    136/137

    Data and the human-machine connection | 127

    http://www-03.ibm.com/innovation/us/watson/index.htmlhttp://www-03.ibm.com/innovation/us/watson/index.html
  • 7/31/2019 Big Data Now Current Perspectives From OReilly Radar Copy

    137/137

    http://www.flickr.com/photos/pdenker/74684051/http://www.flickr.com/photos/pdenker/74684051/