big data with bigquery, presented at devoxxuk 2014 by javier ramirez from teowaki

39
@supercoco9 #devoxxBigQuery Big Data with Google BigQuery Javier Ramirez @supercoco9 https://teowaki.com

Upload: javier-ramirez

Post on 10-May-2015

1.115 views

Category:

Software


2 download

DESCRIPTION

Big data is amazing. You can get insights from your users, find interesting patterns and have lots of geek fun. Problem is big data usually means many servers, a complex set up, intensive monitoring and a steep learning curve. All those things cost money. If you don’t have the money, you are losing all the fun. In my talk I show you how you can use Google BigQuery to manage big data from your application using a hosted solution. And you can start with less than $1 per month.

TRANSCRIPT

Page 1: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@supercoco9#devoxxBigQuery

Big Data with Google BigQuery

Javier Ramirez@supercoco9https://teowaki.com

Page 2: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@supercoco9#DevoxxBigquery

Managing Big Data with BigQuery

Javier Ramirez

•Writing software since 1996

•Web dev. since 1999 (C++, JAVA, PHP, Ruby, JS...)

•Founder of https://teowaki.com

•Google Developer Expert on the Cloud Platform

Page 3: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@YourTwitterHandle@supercoco9#DevoxxBigquery

BIG

BIG

DAT

A

DAT

A

Page 4: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@YourTwitterHandle@supercoco9#DevoxxBigquery

BIG

BIG

SERVER

S

SERVER

S

Page 5: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@YourTwitterHandle@supercoco9#DevoxxBigquery

BIG

BIG

DEV

OPS

DEV

OPS

Page 6: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@YourTwitterHandle@supercoco9#DevoxxBigquery

BIG

BIG

MONEY

MONEY

Page 7: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

bigdata is cool but...

hard to set up and monitor

expensive cluster

not interactive enough

Page 8: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@supercoco9#DevoxxBigquery

bigdata is doing a fullscan to 330MM rows, matching them against a regexp, and getting the result (223MM rows) in just 5 seconds

Page 9: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

Google BigQuery

Data analysis as a service

http://developers.google.com/bigquery

Page 10: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

Based on “Dremel”

Specifically designed for interactive queries over

petabytes of real-time data

Page 11: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@supercoco9#DevoxxBigquery

Your only worries

•Load data

•Query the dataset

Page 12: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

loading data.

You just send the data in

text (or JSON) format

Page 13: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

up to 100K inserts per second

in stream mode

Page 14: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

It's just SQL

select name from USERS order by date;

select count(*) from users;

select max(date) from USERS;

select sum(total) from ORDERS group by user;

Page 15: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@supercoco9#DevoxxBigquery

Subselect and joins out of the box

SELECT Year, Actor1Name, Actor2Name, Count FROM (SELECT Actor1Name, Actor2Name, Year, COUNT(*) Count, RANK() OVER(PARTITION BY YEAR ORDER BY Count DESC) rankFROM

(SELECT Actor1Name, Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name < Actor2Name and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode), (SELECT Actor2Name Actor1Name, Actor1Name Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name > Actor2Name and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode),

WHERE Actor1Name IS NOT nullAND Actor2Name IS NOT nullGROUP EACH BY 1, 2, 3HAVING Count > 100)

WHERE rank=1ORDER BY Year

http://gdeltproject.org/data.html#googlebigquery

Page 16: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@supercoco9#DevoxxBigquery

specific extensions for analytics

withinflattennest

stddev

topfirstlastnth

variance

var_popvar_samp

covar_popcovar_samp

quantiles

correlations

Page 17: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

Things you always wanted to try but were too scared to

select count(*) from publicdata:samples.wikipedia where REGEXP_MATCH(title, "[0-9]*") AND wp_namespace = 0;

223,163,387 Query complete (5.6s elapsed, 9.13 GB processed, Cost: 32¢)

Page 18: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

columnar storage

https://cookbook.experiencesaphana.com/crm/what-is-crm-on-hana/technology-innovation/row-vs-column-based/

Page 19: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

highly distributed execution using a tree

Page 20: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

web console screenshot

Page 21: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@supercoco9#DevoxxBigquery

country segmented traffic

Page 22: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@supercoco9#DevoxxBigqueryjavier ramirez @supercoco9 https://teowaki.com

window functions

Page 23: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@supercoco9#DevoxxBigquery

our most active user

Page 24: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@supercoco9#DevoxxBigquery

Worldwide events in the last 36 years

SELECT Year, Actor1Name, Actor2Name, Count FROM (SELECT Actor1Name, Actor2Name, Year, COUNT(*) Count, RANK() OVER(PARTITION BY YEAR ORDER BY Count DESC) rankFROM

(SELECT Actor1Name, Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name < Actor2Name and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode), (SELECT Actor2Name Actor1Name, Actor1Name Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name > Actor2Name and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode),

WHERE Actor1Name IS NOT nullAND Actor2Name IS NOT nullGROUP EACH BY 1, 2, 3HAVING Count > 100)

WHERE rank=1ORDER BY Year

http://gdeltproject.org/data.html#googlebigquery

Page 25: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki
Page 26: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki
Page 27: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

SELECT repository_name, repository_language, repository_description, COUNT(repository_name) as cnt,repository_urlFROM github.timelineWHERE type="WatchEvent"AND PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC("#{yesterday} 20:00:00")AND repository_url IN (

SELECT repository_urlFROM github.timelineWHERE type="CreateEvent"AND PARSE_UTC_USEC(repository_created_at) >= PARSE_UTC_USEC('#{yesterday}

20:00:00')AND repository_fork = "false"AND payload_ref_type = "repository"GROUP BY repository_url

)GROUP BY repository_name, repository_language, repository_description, repository_urlHAVING cnt >= 5ORDER BY cnt DESCLIMIT 25

Page 28: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@supercoco9#DevoxxBigquery

Page 29: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki
Page 30: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki
Page 31: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@supercoco9#DevoxxBigquery

Automation with Apps Script

●Read from BigQuery

●Create a spreadsheet on Drive

●E-mail it everyday as a PDF

https://developers.google.com/apps-script/

Page 32: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki
Page 33: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki
Page 34: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki
Page 35: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@supercoco9#DevoxxBigquery

bigquery pricing

$26 per stored TB1000000 rows => $0.00416 / month

£0.00243 / month

$5 per processed TB1 full scan = 160 MB

1 count = 0 MB1 full scan over 1 column = 5.4 MB100 GB => $0.05 / month £0.03

AppsScripts is for free

Page 36: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@supercoco9#DevoxxBigquery

£0.054307 / month*

per 1MM rows

*the 1st 1TB every month is free of charge**assumming your rows have web server logs-like info

price per month

Page 37: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@supercoco9#DevoxxBigquery

ig

Page 38: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@YourTwitterHandle#DVXFR14{session hashtag} @supercoco9#devoxxBigquery

THAN

KS!

Javier Ramirez@supercoco9https://teowaki.com

Related links at:

https://teowaki.com/teams/javier-community/link-categories/bigquery-talk

Page 39: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@supercoco9#DevoxxBigquery

Thanks / Creative Commons

•Presentation Template — Guillaume LaForge

•The Queen — A prestigious heritage with some inspiration from The Sex Pistols and funny Devoxxians

•Girl with a Balloon — Banksy

•Tube — Michael Keen