public - google i_o 2012- crunching big data with bigquery

Upload: pumpido2

Post on 03-Apr-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    1/63

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    2/63

    Crunching Big Data

    with Big Query

    Ryan Boyd, Developer Advocate

    Jordan Tigani, Software Engineer

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    3/63

    How BIG is big?

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    4/63

    1 million rows?

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    5/63

    millionmillionmillion

    millionmillionmillion

    millionmillionmillion

    million

    10 million rows?

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    6/63

    1 million1 million1 million

    1 million1 million1 million

    1 million1 million1 million

    1 million

    1 million1 million1 million

    1 million1 million1 million

    1 million1 million1 million

    1 million

    1 million1 million1 million

    1 million1 million1 million

    1 million1 million1 million

    1 million

    1 1 1

    1 1 1

    1 1 1

    1

    100 million rows

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    7/63

    1 millionmillion 1 million

    1 million1 million

    1 million1 million1 million1 million1 million1 million

    1 million

    1 million1 million

    1 million1 million1 million1 million1 million1 million

    1 million

    1 million1 million

    1 million1 million1 million1 million1 million1 million

    1 million

    1 m1 m

    1 m1 m1 m1 m1 m1 m

    1 m

    1 million1 million

    1 million1 million

    1 million1 million1 million

    1 million

    1 million1 million

    1 million1 million1 million1 million1 million1 million

    1 million

    1 million1 million

    1 million1 million1 million1 million1 million1 million

    1 million

    11

    11111

    million

    1

    1

    millionmillionmillion

    million

    million

    1 million

    millionmillion

    million

    1 million1 million1 million

    1 million

    1 million1 million1 million

    1 million

    1 million1 million1 million

    1 million1 million1 million

    1 million1 million1 million

    1 million

    1 m1 m1 m

    1 m1 m1 m

    1 m1 m1 m

    1 m

    1 million1 million1 million1 million1 million1 million1 million1 million1 million

    1 million1 million1 million1 million1 million1 million1 million1 million1 million

    1 million1 million1 million1 million1 million1 million1 million1 million1 million

    500 million rows

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    8/63

    Big Data at Google

    60 hours

    100 million gigaby

    425 million users

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    9/63

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    10/63

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    11/63

    Google's internal technology:

    Dremel

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    12/63

    Big Data at Google - Finding top installed market a

    SELECT

    top(appId, 20) AS app,

    count(*) AS countFROM installlog.2012;

    ORDER BY

    count DESC

    Result in ~20 seconds!

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    13/63

    Big Data at Google - Finding slow servers

    SELECT

    count(*) AS count, source_machine AS machine

    FROM product.product_log.liveWHERE

    elapsed_time > 4000

    GROUP BY

    source_machine

    ORDER BY

    count DESC

    Result in ~20 seconds!

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    14/63

    BigQuery gives you this power

    Store data with reliability, redundancy andconsistency

    Go from data to meaning

    Quickly!

    At scale ...

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    15/63

    How are developers using it?

    Game and social media analytics

    Advertising campaign optimization

    Sensor data analysis

    Infrastructure monitoring

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    16/63

    Show the power

    Loading your data Running your queries Underlying architecture design

    Advanced queries

    Agenda

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    17/63

    Let's dive in!

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    18/63

    BigQuery UI

    bigquery.cloud.google.com

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    19/63

    Loading your Data

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    20/63

    Ingestion: Data format

    birth_record

    parent_id_mother

    parent_id_father

    plurality

    is_malerace

    weight

    parents

    id

    race

    age

    cigarette_usstate

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    21/63

    Ingestion: Data format

    birth_record

    mother_racemother_age

    mother_cigarette_use

    mother_state

    father_race

    father_age

    father_cigarette_usefather_state

    plurality

    is_male

    race

    weight

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    22/63

    Ingestion: Data format

    1969,1969,1,20,,AL,TRUE,1,7.813,AL,1,2

    1971,1971,5,7,,NY,FALSE,1,7.213,MA,5,72001,2001,12,5,,CA,TRUE,2,6.427,CA,12,

    CSV

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    23/63

    Running your Queries

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    24/63

    Java

    Python .NET PHP

    JavaScript

    Apps Script ... more ...

    Libraries

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    25/63

    It's REST

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    26/63

    BigQuery architecture

    Developing intuition about BigQuery

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    27/63

    Disk storage

    Relational Database Architecture: B-TreeRoot

    0-999

    Level 1

    100-199

    Level 1

    700-799

    Level 1

    400-499

    Level 2

    60-69

    Level 2

    410-419

    Level 2

    430-439

    Level 2

    750-759

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    28/63

    Disk storage

    Relational Database Architecture: Finding a ValueRoot

    0-999

    Level 1

    100-199

    Level 1

    700-799

    Level 1

    400-499

    Level 2

    60-69

    Level 2

    410-419

    Level 2

    430-439

    Level 2

    750-759

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    29/63

    If you do a table scan over a 1TB table,

    you're going to have a bad time.

    Anonymous

    16th century Italian Philosopher-Monk

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    30/63

    Reading 1 TB/ second from disk: 10k+ disks

    Processing 1 TB / sec: 5k processors

    Goal: Perform a 1 TB table scan in 1 secondParallelize Parallelize Parallelize!

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    31/63

    Data access: Column Store

    Record Oriented Storage Column Oriented S

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    32/63

    "Why not MapReduce?

    Anonymous

    Reddit User

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    33/63

    Distributed Storage (e.g. GFS)

    MapReduce... how does it work?Cont

    Worker 0 Worker 1 Worker 2 Worker 3

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    34/63

    Distributed Storage (e.g. GFS)

    MapReduceCont

    Worker 0 Worker 1 Worker 2 Worker 3

    1. Map!

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    35/63

    Distributed Storage (e.g. GFS)

    MapReduceCont

    Worker 0 Worker 1 Worker 2 Worker 3

    2. Reduce!

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    36/63

    Distributed Storage (e.g. GFS)

    MapReduceCont

    Worker 0 Worker 1 Worker 2 Worker 3

    3. Profit!*

    * Don't forget to shuffle Multiple passes may be req

    Void where prohibited

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    37/63

    Gray's third law for big data:Bring computations to the data, rather

    than data to the computations.

    Jim Gray

    Database Pioneer

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    38/63

    Distributed Storage (e.g. GFS)

    BigQuery Architecture: Computation treeMixer 0

    Mixer 1

    Shard 0-8

    Mixer 1

    Shard 17-24

    Mixer 1

    Shard 9-16

    Shard 0 Shard 10 Shard 12 Shard 20

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    39/63

    Distributed Storage (e.g. GFS)

    BigQuery Architecture: Finding a valueMixer 0

    Mixer 1

    Shard 0-8

    Mixer 1

    Shard 17-24

    Mixer 1

    Shard 9-16

    Shard 0 Shard 10 Shard 12 Shard 20

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    40/63

    SELECT COUNT(foo), MAX(foo), STDDEV(f

    FROM ...

    BigQuery SQL Example: Simple aggregates

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    41/63

    SELECT ... FROM ....

    WHERE REGEXP_MATCH(url, "\.com$")

    AND userCONTAINS 'test'

    BigQuery SQL Example: Complex Processing

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    42/63

    SELECT COUNT(*) FROM

    (SELECT foo ..... )GROUP BY foo

    BigQuery SQL Example: Nested SELECT

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    43/63

    BigQuery SQL Example: Small JOIN

    SELECT huge_table.fooFROM huge_table

    JOIN small_table

    ON small_table.foo = small_table.foo

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    44/63

    Distributed Storage (e.g. GFS)

    BigQuery Architecture: Small JoinMixer 0

    Mixer 1

    Shard 0-8

    Mixer 1

    Shard 17-24

    Shard 0 Shard 20

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    45/63

    SELECT foo, barFROM huge_table

    Where huge_table is very large and no filter is applied

    Fix with:

    ... LIMIT 100

    BigQuery SQL Example: Response too large

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    46/63

    SELECT ... FROM ... GROUP BYuser_id

    Where number of unique users is very large.

    Fix with:

    ... WHERE HASH(user_id) % 10 = 0

    BigQuery SQL Example: Internal response too larg

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    47/63

    SELECT user_id, COUNT(user_id) ...

    GROUP BY user_idORDER BY user_id DESC

    Where number of unique users is very large.

    Fix with:

    SELECT TOP(user_id, 20), count(user_id) ...

    BigQuery SQL Example: Internal response too larg

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    48/63

    Wikipedia:

    "GitHub is a web-based hosting service for softwaredevelopment projects ... GitHub is ... the most popularsource hosting site"

    Using GitHub timeline dataset

    Advanced Query Demo

    http://en.wikipedia.org/wiki/Shared_web_hosting_servicehttp://en.wikipedia.org/wiki/Shared_web_hosting_servicehttp://en.wikipedia.org/wiki/Shared_web_hosting_servicehttp://en.wikipedia.org/wiki/Shared_web_hosting_service
  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    49/63

    What is big data, anyway? BigQuery's Not MapReduce What's BigQuery good for? How to think about query execution

    Summary

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    50/63

    SELECT questions FROM audience

    SELECT 'Thank You!'

    FROM ryan, jordan

    http://developers.google.com/bigquery

    @ryguyrg http://profiles.google.com/ryan.boyd

    @tigani https://plus.google.com/115600841849663767233

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    51/63

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    52/63

    Titles are formatted as Open Sans with bold applied andfont size is set at 30pts

    Vertical position for title is .3 Vertical position for bullet text is 1.54

    Title capitalization is title case

    Subtitle capitalization is title case

    Titles and subtitles should never have a period at the end

    Presentation Bullet Slide Layout

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    53/63

    Titles are formatted as Open Sans with bold applied and

    font size is set at 30pts Vertical position for title is .3

    Vertical position for subtitle is 1.1

    Vertical position for bullet text is 2

    Title capitalization is title case

    Subtitle capitalization is title case

    Titles and subtitles should never have a period at the end

    Subtitle Placeholder

    Bullet Slide With Subtitle Placeholder

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    54/63

    Color Palette

    67

    135

    253

    244

    74

    63

    255

    209

    77

    13

    168

    97

    Flat Color Secondary GraysGradient

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    55/63

    Graphic Element Styles and Arrows

    Rounded Boxes

    HTML

    CodeBoxes

    Box Title

    Body Copy

    Goes Here

    Box Title

    Body Copy

    Goes Here

    Arrows

    Content Container Boxes

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    56/63

    Pie Chart ExampleSubtitle Placeholder

    Chart Title

    source: place source info here

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    57/63

    Column Chart ExampleSubtitle Placeholder

    source: place source info here

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    58/63

    Line Chart ExampleSubtitle Placeholder

    source: place source info here

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    59/63

    Table Option ASubtitle Placeholder

    Column 1 Column 2 Column 3 Col

    Row 1 placeholder placeholder placeholder plac

    Row 2 placeholder placeholder placeholder place

    Row 3 placeholder placeholder placeholder plac

    Row 4 placeholder placeholder placeholder plac

    Row 5 placeholder placeholder placeholder plac

    Row 6 placeholder placeholder placeholder plac

    Row 7 placeholder placeholder placeholder place

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    60/63

    Table Option BSubtitle Placeholder

    Header 1 placeholder placeholder placeholder

    Header 2 placeholder placeholder placeholder

    Header 3 placeholder placeholder placeholder

    Header 4 placeholder placeholder placeholder

    Header 5 placeholder placeholder placeholder

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    61/63

    Segue SlideSubtitle Placeholder

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    62/63

    This is an example ofquote text.

    Name

    Company

  • 7/28/2019 PUBLIC - Google I_O 2012- Crunching Big Data With BigQuery

    63/63

    Code Slide With Subtitle PlaceholderSubtitle Placeholder

    // Say hello world until the user starts questioning

    // the meaningfulness of their existence.function helloWorld(world) {

    for (vari = 42;--i >= 0;) {

    alert (Hello + String(world));

    }

    }

    p { color: pink }

    p { color: blue }

    u { color: umber }