big data analytics with google bigquery, by javier ramirez, datawaki, at span conf

42
by javier ramirez @supercoco9 Big Data Analytics with Google BigQuery https://teowaki.com https://datawaki.com

Upload: javier-ramirez

Post on 19-Jun-2015

487 views

Category:

Software


0 download

DESCRIPTION

In this talk I explain why we decided to use BigQuery as our analytics backend in datawaki.com and teowaki.com I describe the architecture of BigQuery, what you can do with it, and how other people are using it Talk delivered at span conference (spanconf.io) in London, 2014

TRANSCRIPT

Page 1: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf

by javier ramirez@supercoco9

Big Data Analyticswith

Google BigQuery

https://teowaki.com

https://datawaki.com

Page 2: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf
Page 3: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf

INPUT /

OUTPUT

Big Data's #1 Enemy

Page 4: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf

Read one terabyte ofdata inone second

javier ramirez @supercoco9 https://teowaki.com

Page 5: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf

data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the structures of your database architectures.

Ed Dumbill program chair for the O’Reilly Strata Conference

javier ramirez @supercoco9 https://teowaki.com

Page 6: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf

bigdata is doing a fullscan to 330MM rows, matching them against a regexp, and getting the result (223MM rows) in just 5 seconds

javier ramirez @supercoco9 https://teowaki.com

Javier Ramirezimpresionable teowaki founder

Page 7: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf

javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013

REST API +

AngularJS web as an API client

Page 8: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf

1. non intrusive metrics2. keep the history3. interactive queries4. cheap5. extra ball: real time

javier ramirez @supercoco9 https://teowaki.com

Page 9: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf

javier ramirez @supercoco9 https://teowaki.com

Page 10: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf

Apache HadoopApache Cassandra

Apache SparkApache Storm

Amazon Redshift

javier ramirez @supercoco9 https://teowaki.com

Page 11: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf

bigdata is cool but...

expensive cluster

hard to set up and monitor

not interactive enough

Page 12: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf

Our choice:

Google BigQuery

Data analysis as a service

http://developers.google.com/bigquery

javier ramirez @supercoco9 https://teowaki.com

Page 13: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf

Based on Dremel

Specifically designed for interactive queries over petabytes of real-time data

javier ramirez @supercoco9 https://teowaki.com

Page 14: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf

• Analysis of crawled web documents.• Tracking install data for applications on Android Market.• Crash reporting for Google products.• OCR results from Google Books.• Spam analysis.• Debugging of map tiles on Google Maps.• Tablet migrations in managed Bigtable instances.• Results of tests run on Google’s distributed build system.• Disk I/O statistics for hundreds of thousands of disks.• Resource monitoring for jobs run in Google’s data centers.• Symbols and dependencies in Google’s codebase.

What Dremel is used for in Google

Page 15: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf

INDEXES

Data Scientists's#1 Enemy

Page 16: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf

javier ramirez @supercoco9 https://teowaki.com

in BigQuery everything is a full-scan*

*Over a ridiculously fast distributed filesystem.Dremel design goal: 1TB/sec. It was exceeded

BigQuery delivers ~ 50Gb/Sec.

Page 17: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf

Columnarstorage

javier ramirez @supercoco9 https://teowaki.com

Page 18: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf

Colossus filesystemDistributed/redundantParallel readsUltra fast network

javier ramirez @supercoco9 https://teowaki.com

Page 19: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf

highly distributed execution using a tree

javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14

Page 20: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf

Getting started...

Page 21: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf

create dataset and tables

Page 22: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf

loading data

You can feed flat CSV-like files or nested JSON objects

javier ramirez @supercoco9 https://teowaki.com

Page 23: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf

javier ramirez @supercoco9 https://teowaki.com

bq cli

bq load --nosynchronous_mode --encoding UTF-8 --field_delimiter 'tab' --max_bad_records 100 --source_format CSV api.stats 20131014T11-42-05Z.gz

Page 24: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf

web console screenshot

javier ramirez @supercoco9 https://teowaki.com

Page 25: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf

javier ramirez @supercoco9 https://teowaki.com

it's just sql, plus...

analytical SQL functions.correlations.

window functions.views.

JSON fields.timestamped tables.

Page 26: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf

Things you always wanted to try but were too scared to

javier ramirez @supercoco9 https://teowaki.com

select count(*) from publicdata:samples.wikipedia

where REGEXP_MATCH(title, "[0-9]*") AND wp_namespace = 0;

223,163,387Query complete (5.6s elapsed, 9.13 GB processed, Cost: 32¢)

Page 27: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf
Page 28: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf

Global Database of Events, Language and Tone

quarter billion rows30 yearsupdated daily

http://gdeltproject.org/data.html#googlebigquery

Page 29: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf

SELECT Year, Actor1Name, Actor2Name, Count FROM (SELECT Actor1Name, Actor2Name, Year, COUNT(*) Count, RANK() OVER(PARTITION BY YEAR ORDER BY Count DESC) rankFROM (SELECT Actor1Name, Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name < Actor2Name and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode), (SELECT Actor2Name Actor1Name, Actor1Name Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name > Actor2Name and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode),WHERE Actor1Name IS NOT nullAND Actor2Name IS NOT nullGROUP EACH BY 1, 2, 3HAVING Count > 100)

WHERE rank=1ORDER BY Year

Page 30: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf
Page 31: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf

javier ramirez @supercoco9 https://teowaki.com

our most active user

Page 32: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf

javier ramirez @supercoco9 https://teowaki.com

10 request we should be caching

Page 33: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf

javier ramirez @supercoco9 http://teowaki.com

5 most created resources

select uri, count(*) total from stats where method = 'POST' group by URI;

Page 34: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf

javier ramirez @supercoco9 http://teowaki.com

...but

/users/javier/shouts/users/rgo/shouts/teams/javier-community/links/teams/nosqlmatters-cgn/links

Page 35: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf

javier ramirez @supercoco9 http://teowaki.com

5 most created resources

Page 36: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf

what is it being used for?

Page 37: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf
Page 38: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf
Page 39: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf

Analysing weather information

Finding patterns in e-commerce

Match online/offline behaviour

Log analysys

Analysing inventory/booking data

...

Page 40: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf

warning: BigQuery is not open source and not for free

$26 per stored TB

$5 per processed TB*the 1st TB processed every month is free of charge

javier ramirez @supercoco9 https://teowaki.com

Page 41: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf
Page 42: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf

Find related links at

https://teowaki.com/teams/javier-community/link-categories/bigquery-talk

Thanks

javier ramirez@supercoco9

https://teowaki.com

https://datawaki.com