bigquery architecture

51

Upload: others

Post on 09-Feb-2022

11 views

Category:

Documents


0 download

TRANSCRIPT

Crunching Big Data with BigQuery

Ryan Boyd, Developer Advocatehttp://profiles.google.com/ryan.boyd@ryguyrg

XLDBTuesday, September 11th

2012

How BIG is big?

1 million rows?

1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million

10 million rows?

1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million

1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million

1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million

1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million

100 million rows?

1 million

1 million1 million 1 million

1 million1 million1 million1 million1 million1 million1 million1 million1 million

1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million

1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million

1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million

1 million1 million1 million1 million

1 million1 million1 million1 million1 million

1 million1 million1 million1 million1 million1 million1 million1 million1 million

1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million

1 million1 million1 million1 million1 million1 million1 million

1 million

1 million1 million1 million

1 million1 million1 million1 million

1 million

1 million

1 million1 million1 million

1 million1 million1 million1 million

1 million1 million1 million1 million

1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million

1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million

1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million

1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million

1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million

1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million

500 million rows!

1 million

Big Data at Google

60 hours100 million gigabytes425 million users

Google's internal technology: Dremel

Big Data at Google - Finding top installed market apps

SELECT top(appId, 20) AS app, count(*) AS countFROM installlog.2012;ORDER BY count DESC

Result in ~20 seconds!

Big Data at Google - Finding slow servers

SELECT count(*) AS count, source_machine AS machineFROM product.product_log.liveWHERE elapsed_time > 4000GROUP BY source_machineORDER BY count DESC

Result in ~20 seconds!

BigQuery gives you this power

Store data with reliability, redundancy and consistency

Go from data to meaning

Quickly!

At scale ...

How are developers using it?

Game and social media analytics

Advertising campaign optimization

Sensor data analysis

Infrastructure monitoring

● Show the power● Running your queries● Preparing your data● Visualizing your data● Underlying architecture design● Latest and greatest features

Agenda

Let's dive in!

It's an API

Google Cloud Storage

Upload your Data

BigQuery

Your data is loaded

Google Cloud Storage

Load your data into BigQuery"jobReference":{ "projectId":"605902584318"}, "configuration":{ "load":{ "destinationTable":{ "projectId":"605902584318", "datasetId":"my_dataset", "tableId":"widget_sales"}, "sourceUris":[ "gs://widget-sales-data/2012080100.csv"], "schema":{ "fields":[{ "name":"widget", "type":"string"}, ...

POST https://www.googleapis.com/bigquery/v2/projects/605902584318/jobs

Query Away!

"jobReference":{ "projectId":"605902584318"}, "query":"SELECT TOP(widget, 50), COUNT(*) AS sale_count FROM widget_sales", "maxResults":100, "apiVersion":"v2"}

POST https://www.googleapis.com/bigquery/v2/projects/605902584318/jobs

● Java● Python● .NET● PHP● JavaScript● Apps Script● ... more ...

Libraries

Libraries - Example JavaScript query

var request = gapi.client.bigquery.jobs.query({ 'projectId': project_id, 'timeoutMs': '30000', 'query': 'SELECT state, AVG(mother_age) AS theav FROM [publicdata:samples.natality] WHERE year=2000 AND ever_born=1 GROUP BY state ORDER BY theav DESC;'});

request.execute(function(response) { console.log(response); $.each(response.result.rows, function(i, item) { ...

BigQuery UI

bigquery.cloud.google.com

Preparing your Data

Schema definition

birth_record

parent_id_motherparent_id_fatherpluralityis_maleraceweight

parents

idraceagecigarette_usestate

Schema definition

birth_record

mother_racemother_agemother_cigarette_usemother_statefather_racefather_agefather_cigarette_usefather_statepluralityis_maleraceweight

Schema definition - sharding

birth_record_2011

mother_racemother_agemother_cigarette_usemother_statefather_racefather_agefather_cigarette_usefather_statepluralityis_maleraceweight

birth_record_2012

mother_racemother_agemother_cigarette_usemother_statefather_racefather_agefather_cigarette_usefather_statepluralityis_maleraceweight

birth_record_2013

birth_record_2014

birth_record_2015

birth_record_2016

Data format

1969,1969,1,20,,AL,TRUE,1,7.813,AL,1,20,true1971,1971,5,7,,NY,FALSE,1,7.213,MA,5,7,true2001,2001,12,5,,CA,TRUE,2,6.427,CA,12,5,true

CSV

Tools to prepare your data

● App Engine MapReduce● Commercial ETL tools

● Pervasive● Informatica● Talend

● UNIX command-line

Visualizing your Data

Commercial visualization tools

Google Spreadsheets

Custom code and the Google Visualization API

BigQuery architecture

“ If you do a table scan over a 1TB table, you're going to have a bad time. ”

Anonymous16th century Italian Philosopher-Monk

● Reading 1 TB/ second from disk:● 10k+ disks

● Processing 1 TB / sec:● 5k processors

Goal: Perform a 1 TB table scan in 1 second

Parallelize Parallelize Parallelize!

Data access: Column Store

Record Oriented Storage Column Oriented Storage

Distributed Storage (e.g. GFS)

BigQuery Architecture

Mixer 0

Mixer 1Shard 0-8

Mixer 1Shard 17-24

Mixer 1Shard 9-16

Shard 0 Shard 10 Shard 12 Shard 24Shard 20

Running your Queries

SELECT COUNT(foo), MAX(foo), STDDEV(foo) FROM ...

BigQuery SQL Example: Simple aggregates

SELECT ... FROM ....WHERE REGEXP_MATCH(url, "\.com$") AND user CONTAINS 'test'

BigQuery SQL Example: Complex Processing

SELECT COUNT(*) FROM (SELECT foo ..... )GROUP BY foo

BigQuery SQL Example: Nested SELECT

BigQuery SQL Example: Small JOIN

SELECT huge_table.foo FROM huge_tableJOIN small_table ON small_table.foo = huge_table.foo

Distributed Storage (e.g. GFS)

BigQuery Architecture: Small Join

Mixer 0

Mixer 1Shard 0-8

Mixer 1Shard 17-24

Shard 0 Shard 24Shard 20

Other new features!

Batch queries!

● Don't need interactive queries for some jobs?● priority: "BATCH"

● API● Column-based datastore● Full table scans FAST● Aggregates● Commercial tool support● Use cases

That's it

SELECT questions FROM audience

SELECT 'Thank You!' FROM ryan

http://developers.google.com/bigquery

@ryguyrg http://profiles.google.com/ryan.boyd