how bigquery broke my heart

18
how BigQuery broke my heart Gabe Hamilton

Upload: gabehamilton

Post on 15-Jan-2015

2.592 views

Category:

Technology


1 download

DESCRIPTION

BigQuery is Google's columnar, massively parallel data querying solution. This talk explores using it as an ad-hoc reporting solution and the limitations present in May 2013.

TRANSCRIPT

Page 1: How BigQuery broke my heart

how

BigQuery broke my heart

Gabe Hamilton

Page 2: How BigQuery broke my heart

Reporting Solutions Smackdown

We are evaluating replacements for SQL Server for our Reporting & Business Intelligence backend.

Many TBs of data.

Closer to SQL the less report migration we need to do.

We like saving money.

Page 3: How BigQuery broke my heart

Solutions we've been testingRedshift

BigQuery

CouchDB

MongoDB

Cassandra

TerraData

Oracle

Page 4: How BigQuery broke my heart

Plus various changes to our design

Some of these are necessary for certain technologies.

Denormalization

Sharding strategies

Nested data

Tune our existing Star Schema and Tables

Page 5: How BigQuery broke my heart

BigQuery is

A massively parallel datastore

Columnar

Queries are SQL Select statements

Uses a Tree structure to distribute across nodes

Page 6: How BigQuery broke my heart

How many nodes?

10,000 nodes!

Page 7: How BigQuery broke my heart

And what price?

3.5 cents /GBResourcePricing

Query cost is per GB in the columns processed

Interactive Queries $0.035

Batch Queries $0.02

Storage $0.12 (per GB/month)

Page 8: How BigQuery broke my heart

Which is great for our big queries

A gnarly query that looks at 200GB of data costs $7.50 in BigQuery.

If that takes 2 hours to run on a $60/hr cluster of a competing technology...

It's a little more complicated because in theory several of those queries could run simultaneously on the competing tech.

Still, that's 4 X cheaper plus the speed improvement.

Page 9: How BigQuery broke my heart

Example: Github data from past year3.5 GB Table

SELECT type, count(*) as num FROM [publicdata:samples.github_timeline] group by type order by num desc;

Query complete (1.1s elapsed, 75.0 MB processed)Event Type numPushEvent 2,686,723CreateEvent 964,830WatchEvent 581,029IssueCommentEvent 507,724GistEvent 366,643IssuesEvent 305,479ForkEvent 180,712PullRequestEvent 173,204FollowEvent 156,427GollumEvent 104,808

Cost $0.0026

or 5 for a penny

Page 10: How BigQuery broke my heart

It was love at first type.

Page 11: How BigQuery broke my heart

But Then...

Reality

Page 12: How BigQuery broke my heart

Uploaded our test dataset

Which is 250GB

Docs are good, tools are good.

Hurdle 1: only one join per query.

Ok, rewrite as ugly nested selects...

Page 13: How BigQuery broke my heart

Result

Page 14: How BigQuery broke my heart

Round 2

No problem, I had seen that joins were somewhat experimental.

Try the denormalized version of the data.

SELECT ProductId, StoreId, ProductSizeId, InventoryDate, avg(InventoryQuantity) as InventoryQuantity FROM BigDataTest.denorm GROUP EACH BY ProductId, StoreId, ProductSizeId, InventoryDate

1st error message helpfully says, try GROUP EACH BY

Page 15: How BigQuery broke my heart

Final Result

Page 16: How BigQuery broke my heart

It's not you, it's me The documentation had some semi-useful information:

Because the system is interactive, queries that produce a large number of groups might fail. The use of the TOP function instead of GROUP BY might solve the problem.

However, the BigQuery TOP function only operates on one column.

At this point I had jumped through enough hoops. I posted on Stack Overflow, the official support channel according to the docs, and have gotten no response.

Page 17: How BigQuery broke my heart

Epilogue

Simplifying my query down to two grouping columns did cause it to run with a limit statement.

SELECT ProductId, StoreId, avg(InventoryQuantity) as InventoryQuantity FROM BigDataTest.denorm GROUP each BY ProductId, StoreId Limit 1000Query complete (4.5s elapsed, 28.1 GB processed)

Without a limit it gives Error: Response too large to return.Perhaps there is still hope for me and BigQuery...

Page 18: How BigQuery broke my heart

Me

Like this talk?

@gabehamilton

My twitter feed is just technical stuff.

or slideshare.net/gabehamilton