google bigquery
Post on 07-Jan-2017
1.979 Views
Preview:
TRANSCRIPT
Google BigQuery
About myselfMatthias Feys
work @Datatonic: - big data (with Google Cloud) - machine learning - data visualizations (Tableau/Spotfire)
Google Qualified Cloud Developer
contact:
- @FsMatt- matthias@datatonic.com
About DatatonicDatatonic is a team of data science experts that help corporations unleash the power of data. They use Google Cloud Platform, data visualisation technologies (like Tableau or Spotfire) and machine learning to build breakthrough solutions. Either as expert advisors, or built as a fully managed solution with support end-to-end (A3S).
Some references:
@teamdatatonic
● What is BigQuery?● How does it scale/work?● How does it compare to:
- NoSQL datastores- MapReduce
● Demo● Pricing Model● Best Practices
This talk
What is BigQuery?
“BigQuery is a fully-managed and cloud-based interactive query service for massive datasets.”
It’s the externalization of Dremel, one of
Google’s core technologies
What is BigQuery? (2)BigQuery Service is available via:
● Web UI (bigquery.cloud.google.com)● console (gcloud)● API (+ client libraries)● external tools (Tableau, Excel, …)● ODBC connector
How Does it Scale?
Dremel Architecture
Data Model/Storage:- Columnar Storage- Nested/Repeated Fields- No Index! -> Single Full Table Scan (from disk)
Query Execution:- Tree Architecture- Using tens of thousands machines over fast Google network (+1Petabit/s)
Columnar Storage
● Traffic minimization:○ only read selected
columns
● Higher Compression Ratio:○ Similar values in the
same column○ From 1:3 → 1:10
Tree Architecture- root server: ->receives query + reads table metadata ->rewrites the query(s) ->sends queries to the next level <-returns final query results
- intermediate servers: ->(similar steps) <-parallel partial aggregation
- leaf servers: ->actually scan (parts) of the table <-send data to intermediate servers
NoSQL Datastore vs. BigQuery?
NoSQL Datastore
● Index based (expected queries)
● Read-write
BigQuery
● Non-index based (ad hoc queries)
● Read-only (append-only)
MapReduce vs. BigQuery?
MapReduce
● High latency● Flexible (complex)
batch processing● Unstructured data
BigQuery
● Low latency ● SQL-like queries
● Structured data
Demo’s
Pricing Model
Category Price Note
Storage Cost $0.020 per GB, per month
Query Cost $5 per TB 1st TB per month is free
Best Practices
Denormalize / Pre-Join Where Possible● Best performance● Only pay for the columns you need● Nested/repeated fields!
Relational Database Design Denormalized Nested/Repeated (JSON)
Table Sharding● You pay for what you read
→ Read less, pay less● Table wildcards allow for easy reading over multiple tables
https://cloud.google.com/bigquery/query-reference#tablewildcardfunctions
Optimize for Query vs. Storage CostsCommon Queries?
- Materialized views(save intermediate results in tables)with pre-aggregated data:
→ faster + cheaper queries
- Store data in multiple tables:- table for daily data- table for weekly data- table for monthly data
Narrow the Table ScansYou only pay for the columns you read
Don’t use “SELECT *” !!!
Table DecoratorsOnly way to avoid doing full table scans!
Allows undeleting tables
options:
● snapshot decorator + range decorator● relative value + absolute values
https://cloud.google.com/bigquery/table-decorators
https://cloud.google.com/bigquery/table-decorators
Query optimizationsQuery Plan
https://cloud.google.com/bigquery/query-plan-explanation
Big Data Reference Architecture
Questions?You can reach me at:
- mail: matthias@datatonic.com- Twitter: @FsMatt
top related