google bigquery 101 & what’s new

14
Section Slide Template Option 2 Put your subtitle here. Feel free to pick from the handful of pretty Google colors available to you. Make the subtitle something clever. People will think it’s neat. Google BigQuery 101 & What’s New Vadim Solovey - CTO, DoIT International Google Cloud Developer Expert | Authorized Trainer [email protected]

Upload: doit-international

Post on 07-Jan-2017

284 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Google BigQuery 101 & What’s New

Section Slide Template Option 2

Put your subtitle here. Feel free to pick from the handful of pretty Google colors available to you.Make the subtitle something clever. People will think it’s neat.

Google BigQuery 101 & What’s NewVadim Solovey - CTO, DoIT InternationalGoogle Cloud Developer Expert | Authorized [email protected]

Page 2: Google BigQuery 101 & What’s New

DoIT International confidential │ Do not distribute

About me..

Vadim Solovey - CTO, DoiT InternationalGoogle Cloud Developer Expert | AWS Solutions [email protected]

Page 3: Google BigQuery 101 & What’s New

DoIT International confidential │ Do not distribute

Agenda

Google BigQuery 101

Partitioned Tables

Standard SQL & New DML Statements

1

2

3

New Formats4

Cost Optimization

6 Q & A

5

Page 4: Google BigQuery 101 & What’s New

DoIT International confidential │ Do not distribute

BigQuery 101

Google’s Highly Distributed Columnar Database optimized for Analytics

● Fully managed NoOps service

● Multi petabyte scale & zero sizing required

● Ingestion + Analytics + Storage + API

● No indexes, only full table scans (!)

● Pre-integrated with other Google Cloud services:

○ Dataproc (Hadoop/Spark)

○ Dataflow (ETL for streaming data)

○ Google Cloud Storage

Page 5: Google BigQuery 101 & What’s New

DoIT International confidential │ Do not distribute

BigQuery 101

Continue...

● Supports nested and repeated fields/columns

● Google’s SQL Dialect

● Query results are cached for up to 24 hours (no charge)

● Charged for storage ($10-$20 per TB/month) and for data scans ($5/TB)

○ No idle costs

○ Highly cost optimizable

Page 6: Google BigQuery 101 & What’s New

DoIT International confidential │ Do not distribute

Based on Dremel

Google File System (GFS)

Leaf Leaf Leaf Leaf Leaf Leaf

Mixer 1 Mixer 1

Mixer 0BigQuery in 60 Seconds

● Long Lived Shared Tree

● Mixer = Master & Reducer

● Leaf = Mapper

● Partial Reduction

● Diskless Data flow

Columnar Storage● Execution Independent

● Reduces Disk Time

Page 7: Google BigQuery 101 & What’s New

DoIT International confidential │ Do not distribute

Demo

Page 8: Google BigQuery 101 & What’s New

DoIT International confidential │ Do not distribute

What’s NewNew features:

● Table Partitions

● Insert/Update/Delete DML

● Standard ANSI SQL 2011

● Identity and Access Management

● Stackdriver for Monitoring

● New data formats for import/export

Page 9: Google BigQuery 101 & What’s New

DoIT International confidential │ Do not distribute

Table PartitionsNew way to shard the data to minimize amount of data being scanned by a query:

● Integrated with Streaming API for easy partition creation and update

● _PARTITIONTIME pseudo column

● Current release supports partition by DAY

Creating partitioned table (using CLI)● bq mk --time_partitioning_type=DAY mydataset.table1● bq mk --time_partitioning_type=DAY --time_partitioning_expiration=259200 mydataset.table2

Accessing partitioned data:● Query all partitions: SELECT * from mydataset.table● Query specific partition: SELECT * from mydataset.table$20161109● Query range: SELECT * FROM mydataset.table WHERE _PARTITIONTIME BETWEEN

TIMESTAMP('2016-01-01') AND TIMESTAMP('2016-01-02')

Page 10: Google BigQuery 101 & What’s New

DoIT International confidential │ Do not distribute

Insert, Update & Delete DMLBigQuery is not append-only anymore ;-)

Data Manipulation Language (DML) supporting these statements:● INSERT● UPDATE● DELETE

Every statement is implicit transactions, no multi-statement transactions yet.

Quotas:● Maximum UPDATE/DELETE statements per day per table: 48● Maximum UPDATE/DELETE statements per day per project: 500● Maximum INSERT statements per day per table: 1,000● Maximum INSERT statements per day per project: 10,000

Page 11: Google BigQuery 101 & What’s New

DoIT International confidential │ Do not distribute

Standard SQLFull ANSI SQL 2011

● With extensions to support nested and repeated fields

● ‘Legacy SQL’ is still supported

Set a desired dialect using prefix, i.e.:● #legacySQL or #standardSQL

#standardSQLSELECT weight_pounds, state, year, gestation_weeksFROM `bigquery-public-data.samples.natality`ORDER BY weight_pounds DESCLIMIT 10;

Page 12: Google BigQuery 101 & What’s New

DoIT International confidential │ Do not distribute

Import/Export FormatsData is importable (and exportable) into/from the following formats:

● *CV files

● JSON

● AVRO

● PARQUET

Page 13: Google BigQuery 101 & What’s New

DoIT International confidential │ Do not distribute

Cost Optimization TipsSome query optimization strategies:

● Use CONTAINS() instead of REGEXP_MATCH(), where possible..

● Sometimes, the sample of data is enough. Use HASH() function to sample the data.

● Use JSON_EXTRACT() if you have raw, unstructured json data in your data

● Avoid nondeterministic queries, i.e. things like NOW() etc. to improve caching

● Don’t query the table which you stream data into (cache will be immediately invalidated)

● Keep query result < 128MB, otherwise it won’t get cached as well

● Use the __TABLES__ & __DATASET__ metadata table for house-keeping goals

Page 14: Google BigQuery 101 & What’s New

DoIT International confidential │ Do not distribute

Are you paying too much?BigQuery is a Columnar Datastore, and maximum performance is achieved on denormalized data sets:

● Pre-Filter with Destination Table when running many similar queries (in WHERE clause)

● Use static tables to optimize BigQuery’s cache○ If streaming/uploading frequently, create daily/hourly ‘snapshots’ and query

them instead of primary table● Always prefer storage over compute!● Set TableExpiration on datasets/partitions for automatic data lifecycle management● Fetch only required columns in your SELECT clause● Use dryRun & EXPLAIN to find most cost efficient query● Set Cost Controls to cap your BigQuery spending