you might be paying too much for bigquery

38
You might be paying too much for BigQuery Ryuji Tamagawa @ Osaka, Japan

Upload: -

Post on 18-Jul-2015

807 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: You might be paying too much for BigQuery

You might be paying too much for BigQuery

Ryuji Tamagawa @ Osaka, Japan

Page 2: You might be paying too much for BigQuery

Agenda

about me

about BigQuery

Basics

Advanced

Tips & Tricks

Page 3: You might be paying too much for BigQuery

Agenda

about me

about BigQuery

Basics

Advanced

Tips & Tricks

What you pay for when using BigQuery

How BigQuery runs your queries

Page 4: You might be paying too much for BigQuery

Agenda

about me

about BigQuery

Basics

Advanced

Tips & Tricks

Selecting columns

Table decorators

Dividing tables

How query cache works

Page 5: You might be paying too much for BigQuery

Agenda

about me

about BigQuery

Basics

Advanced

Tips & Tricks

CPUs & Network are FOR FREE

Sub query optimesed

Repeated Fields

Page 6: You might be paying too much for BigQuery

About meSoftware engineer working for ISV, from architecture design to troubleshooting in the field

Translator working with O’Reilly Japan

‘Google BigQuery Analytics’ is the 25th book

Active in GCPUG, especially in #bq_sushi

A bed for 6 cats

Page 7: You might be paying too much for BigQuery
Page 8: You might be paying too much for BigQuery

About BigQuery (for those who don’t know yet)

Full-managed structured data store queryable with SQL

Very easy to use

Fast, in that almost no slowdown with Big Data

Cost-effective

Built on Google’s infrastructure components

Page 9: You might be paying too much for BigQuery

About BigQuery (for those who don’t know yet)

Basic operations are available within Web UI

You can ‘dryrun’ from Web UI to check the amount of data to be scanned

You can use the command line interface (bq) to integrate BigQuery into your workflow

API are provided for Python, Java

Page 10: You might be paying too much for BigQuery

BigQuery is for analyticsEssentially, data model is same as relational databases with some extension

BigQuery is for analytics, not for transaction processing

You can insert rows(batch or streaming), but can not update or delete them.

There’s no index - tables are always read by ‘fullscan’

You can insert rows from GCS or via HTTP in CSV or JSON format.

Page 11: You might be paying too much for BigQuery

Basics

You might be paying too much for BigQuery

Page 12: You might be paying too much for BigQuery

What you pay for

Storage - $0.020 per GB / month

Queries - $5 per TB processed (scanned)

Streaming inserts - $0.01 per 100,000 rows until July 20, 2015. After July 20, 2015, $0.01 per 200 MB, with individual rows calculated using a 1 KB minimum size.

What matters is the Storage

Page 13: You might be paying too much for BigQuery

A simple exampleLoad 1TB data to a table everyday, keep each table for a month

Query the daily data 5 times everyday to aggregation

For storage : 1TB * 30 (tables) = $0.020 * 1000 * 30 = $600

For Queries:1TB * 5 (Queries) * 30 (days) = $750

Page 14: You might be paying too much for BigQuery

How your data is stored

Your data is stored

1. in thousands of disks ( depending on the size)

2. in columnar format ( ColumnIO or something)

3. compressed(However, the cost is based on uncompressed size)

Page 15: You might be paying too much for BigQuery

How BigQuery runs your queryRequested rows are read from DFS, sent to compute nodes

Compute nodes (could be thousands) form processing tree on the fly

Results are written back to DFS as a table (anonymous or named)

distributed file storage layer (tables)

compute node

compute node

compute node

compute node

compute node

compute node

compute node

results

Page 16: You might be paying too much for BigQuery

How BigQuery runs your queryWhen doing JOIN between large tables or GROUP BY on a large dataset, keys needs to be hashed and associated data will send to nodes depends on the hash value for in-memory join or grouping.

distributed file storage layer (tables)

compute node

compute node

compute node

compute node

compute node

compute node

compute node

results

compute node

compute node

compute node

compute node

compute node

compute node

‘Shuffle’

Page 17: You might be paying too much for BigQuery

Advanced

You might be paying too much for BigQuery

Page 18: You might be paying too much for BigQuery

Narrowing your scan is the key

BigQuery does not have indexes - always do fullscan

BigQuery uses columnar storage. Selecting least columns you need makes the cost lower

C1 C2 C3 C4R1R2R3R4R5R6R7R8R8

Page 19: You might be paying too much for BigQuery

Narrowing your scan is the key

SELECT C1, C2, C3, C4 FROM t

For scanned cell (red-filled), you’ll pay

C1 C2 C3 C4R1 Scaned Scaned Scaned Scaned

R2 Scaned Scaned Scaned Scaned

R3 Scaned Scaned Scaned Scaned

R4 Scaned Scaned Scaned Scaned

R5 Scaned Scaned Scaned Scaned

R6 Scaned Scaned Scaned Scaned

R7 Scaned Scaned Scaned Scaned

R8 Scaned Scaned Scaned Scaned

R8 Scaned Scaned Scaned Scaned

Page 20: You might be paying too much for BigQuery

Narrowing your scan is the key

SELECT C1, C3 FROM t

You’ll pay only for C1 & C3

C1 C2 C3 C4R1 Scaned Scaned

R2 Scaned Scaned

R3 Scaned Scaned

R4 Scaned Scaned

R5 Scaned Scaned

R6 Scaned Scaned

R7 Scaned Scaned

R8 Scaned Scaned

R8 Scaned Scaned

You shouldn’t ‘SELECT *’ unintentionally

Page 21: You might be paying too much for BigQuery

Narrowing your scan is the keyBigQuery’s tables can have virtually any number of rows

Watch out : All those rows will be scanned, no matter what ‘WHERE’ you use in your queries

There are 2 ways to work around this:

table decorators

dividing tables

C1 C2 C3 C4R1R2R3R4R5R6R7R8R9

R99999999990

R99999999991

R99999999992

R99999999993

R99999999994

Page 22: You might be paying too much for BigQuery

Narrowing your scan is the keySnapshot decorators:

you can limit your scan within a snapshot of the table at a given time

SELECT … FROM t@1430665200000

Time-range decorators:

you can limit your scan between a given time range

SELECT … FROM t@-1430751600000

You can pass the time within last 7 days

C1 C2 C3 addedR1 4/1R2 4/1R3 4/1R4 4/1

R99999999990 5/8R99999999991 5/8R99999999992 5/8

R69999999990 5/3R69999999991 5/3R69999999992 5/3

R79999999990 5/5R79999999991 5/5R79999999992 5/5

Page 23: You might be paying too much for BigQuery

Table arrangement estimatedBatch insert creates an inserted ‘block’

Recent inserted blocks (within last 7 day) are left separated from ‘main’ block of the table

Blocks past 7day will be merged with ‘main’ block of the table

streaming inserted rows are not stored in blocks but in BigTable

C1 C2 C3 addedR1 4/1R2 4/1R3 4/1R4 4/1

R99999999990 5/8R99999999991 5/8R99999999992 5/8

R69999999990 5/3R69999999991 5/3R69999999992 5/3

R79999999990 5/5R79999999991 5/5R79999999992 5/5

This is my estimation : as far as I know, Google didn’t officially mentioned about things like this.

Main

Blo

ck B

lock

of 5

/3 B

lock

of 5

/5 B

lock

of 5

/8

As of 2015/5/8

Page 24: You might be paying too much for BigQuery

Table arrengement estimatedBatch insert creates an inserted ‘block’

Becent inserted blocks (within last 7 day) are left separated from ‘main’ block of the table

Blocks past 7day will be merged with ‘main’ block of the table

Streaming inserted rows are not stored in blocks but in BigTable

C1 C2 C3 addedR1 4/1R2 4/1R3 4/1R4 4/1

R69999999990

5/3R6999999999

15/3

R69999999992

5/3

R99999999990 5/8R99999999991 5/8R99999999992 5/8

R79999999990 5/5R79999999991 5/5R79999999992 5/5

Main

Blo

ck B

lock

of 5

/5 B

lock

of 5

/8

As of 2015/5/11

If you focus on last 7 days, decorators are very useful for saving

costs

Page 25: You might be paying too much for BigQuery

Narrowing your scan is the keyTables are often split by date in BigQuery

You can easily union them within FROM clause, separated with comma ( BQ-specific notation)

TABLE_DATE_RANGE function is useful, ex :

SELECT … FROM (TABLE_DATE_RANGE(sample.T, TIMESTAMP(‘2015-05-01’), TIMESTAMP(‘2015-05-10’)))

T20150401

C1 C2 C3 addedR1 12:00R2 13:23R3 14:10R4 14:30

T20150501

C1 C2 C3 addedR1 9:09R2 10:12R3 11:00R4 13:56

T20150510

C1 C2 C3 addedR1 9:09R2 10:12R3 11:00R4 13:56

Page 26: You might be paying too much for BigQuery

Narrowing your scan is the keyWith traditional RDB, usually you don’t split tables like this : Expensive ‘Enterprise’ editions support features like this, but it takes your time for design, operation, and maintenance

In BigQuery, splitting tables like this sometimes even makes your query faster

The difference comes from the architectural difference: BigQuery is designed from ground to reads and processes data from many disks with many compute nodes

T20150401

C1 C2 C3 addedR1 12:00R2 13:23R3 14:10R4 14:30

T20150503

C1 C2 C3 addedR1 9:09R2 10:12R3 11:00R4 13:56

T20150503-1

C1 C2 C3 addedR1 9:09R2 10:12R3 11:00R4 13:56

Page 27: You might be paying too much for BigQuery

Narrowing your scan is the key

DFS Layer

compute node

compute node

compute node

compute node

compute node

compute node

compute node

results

T2015 0501

T2015 0502

T2015 0503

T2015 0508

T2015 0509

T2015 0510

Actually, any single table is stored in many disks and the data from a table is read by many nodes.

Page 28: You might be paying too much for BigQuery

Using query cache

Result of a query will sent to anonymous dataset, with a name generated from name of the tables and their last update timestamp and the query.

When a query is executed, BigQuery checks if the cached result exists at first.

If the query returns the cached result, it costs nothing.

Page 29: You might be paying too much for BigQuery

Query cache is free

Applications like dashboards, which runs almost same queries again and again, can save costs by utilizing query cache

You can write code that save query results somewhere for later use and avoid running same query, but sometimes you don’t have to worry about it - query cache does same for you

Page 30: You might be paying too much for BigQuery

Query cache is enabled when:The query is deterministic (e.g. without NOW() )

The table does NOT have a streaming buffer

The result of the query was not saved to a named table

Actually a large result (>128MB) can not be cached because in such case you have to specify ‘allowLargeResult’ and thus the result must be saved to a named table.

Page 31: You might be paying too much for BigQuery

Tips & Tricks

You might be paying too much for BigQuery

Page 32: You might be paying too much for BigQuery

Trade offs - time & cost

Generally, normalizing your data model makes:

the size of the data small, which means in BigQuery, you pay less

may use more CPU time and network traffic, especially when you run complex queries between large tables

Page 33: You might be paying too much for BigQuery

Trade offs - time & costYou think ‘cost’ in terms of CPU, network, storage in on-premise way

When using BigQuery:

You don’t pay money for CPU nor network

It takes time to run queries that consume much CPU and/or network - queries using EACH keyword

If you don’t have to run queries interactively, they could be run in batch mode with less cost, with an ‘appropriate’ schema.

Page 34: You might be paying too much for BigQuery

distributed file storage layer (tables)

compute node

compute node

compute node

compute node

compute node

compute node

compute noderesults

compute node

compute node

compute node

compute node

compute node

compute node

compute node

compute node

Page 35: You might be paying too much for BigQuery

免費

distributed file storage layer (tables)

compute node

compute node

compute node

compute node

compute node

compute node

compute noderesults

compute node

compute node

compute node

compute node

compute node

compute node

compute node

compute node

Page 36: You might be paying too much for BigQuery

Subqueries optimizedFor example, if you have several types of log in one table and you want to join them with different tables depending on the type, you don’t have to worry

SELECT id, desc FROM (select l.id as id, s1.desc as desc from samples.log l join samples.subTable1 s1 on l.value=s1.subid where l.type=0) t1, (select l.id as id, s2.desc as desc from samples.log l join samples.subTable2 s2 on l.value=s2.subid where l.type=1) t2

This query scan log table only once DFS Layer

compute node

compute node

compute node

subTalbe1 Log subTable2

compute node

Page 37: You might be paying too much for BigQuery

Repeated fieldsYou can store array-like data in a row

This is not standardized feature of SQL

It’s like ‘materialized view’ or pre-joined table - could be compact to store & fast to query

You should have a good understanding of the logic, or you will get unexpected result

Do not use too complex schema (e.g. deeply nested repeated field)

The functions for repeated fields are useful, but watch out for combinational explosion (e.g. FLATTEN)

Page 38: You might be paying too much for BigQuery

Thank you for listening.

Questions?