Download - How BigQuery broke my heart
how
BigQuery broke my heart
Gabe Hamilton
Reporting Solutions Smackdown
We are evaluating replacements for SQL Server for our Reporting & Business Intelligence backend.
Many TBs of data.
Closer to SQL the less report migration we need to do.
We like saving money.
Solutions we've been testingRedshift
BigQuery
CouchDB
MongoDB
Cassandra
TerraData
Oracle
Plus various changes to our design
Some of these are necessary for certain technologies.
Denormalization
Sharding strategies
Nested data
Tune our existing Star Schema and Tables
BigQuery is
A massively parallel datastore
Columnar
Queries are SQL Select statements
Uses a Tree structure to distribute across nodes
How many nodes?
10,000 nodes!
And what price?
3.5 cents /GBResourcePricing
Query cost is per GB in the columns processed
Interactive Queries $0.035
Batch Queries $0.02
Storage $0.12 (per GB/month)
Which is great for our big queries
A gnarly query that looks at 200GB of data costs $7.50 in BigQuery.
If that takes 2 hours to run on a $60/hr cluster of a competing technology...
It's a little more complicated because in theory several of those queries could run simultaneously on the competing tech.
Still, that's 4 X cheaper plus the speed improvement.
Example: Github data from past year3.5 GB Table
SELECT type, count(*) as num FROM [publicdata:samples.github_timeline] group by type order by num desc;
Query complete (1.1s elapsed, 75.0 MB processed)Event Type numPushEvent 2,686,723CreateEvent 964,830WatchEvent 581,029IssueCommentEvent 507,724GistEvent 366,643IssuesEvent 305,479ForkEvent 180,712PullRequestEvent 173,204FollowEvent 156,427GollumEvent 104,808
Cost $0.0026
or 5 for a penny
It was love at first type.
But Then...
Reality
Uploaded our test dataset
Which is 250GB
Docs are good, tools are good.
Hurdle 1: only one join per query.
Ok, rewrite as ugly nested selects...
Result
Round 2
No problem, I had seen that joins were somewhat experimental.
Try the denormalized version of the data.
SELECT ProductId, StoreId, ProductSizeId, InventoryDate, avg(InventoryQuantity) as InventoryQuantity FROM BigDataTest.denorm GROUP EACH BY ProductId, StoreId, ProductSizeId, InventoryDate
1st error message helpfully says, try GROUP EACH BY
Final Result
It's not you, it's me The documentation had some semi-useful information:
Because the system is interactive, queries that produce a large number of groups might fail. The use of the TOP function instead of GROUP BY might solve the problem.
However, the BigQuery TOP function only operates on one column.
At this point I had jumped through enough hoops. I posted on Stack Overflow, the official support channel according to the docs, and have gotten no response.
Epilogue
Simplifying my query down to two grouping columns did cause it to run with a limit statement.
SELECT ProductId, StoreId, avg(InventoryQuantity) as InventoryQuantity FROM BigDataTest.denorm GROUP each BY ProductId, StoreId Limit 1000Query complete (4.5s elapsed, 28.1 GB processed)
Without a limit it gives Error: Response too large to return.Perhaps there is still hope for me and BigQuery...
Me
Like this talk?
@gabehamilton
My twitter feed is just technical stuff.
or slideshare.net/gabehamilton