(almost) serverless analytics system with bigquery & appengine

19
Gabriel PREDA @eRadical (Almost) Serverless Analytics System with BigQuery & AppEngine

Upload: gabriel-preda

Post on 15-Feb-2017

125 views

Category:

Software


0 download

TRANSCRIPT

Page 1: (Almost) Serverless Analytics System with BigQuery & AppEngine

Gabriel PREDA @eRadical

(Almost) Serverless Analytics Systemwith BigQuery & AppEngine

Page 2: (Almost) Serverless Analytics System with BigQuery & AppEngine

Agenda

Going Serverless withAppEngine & TasksPub/Sub, DataStore

BigQueryLoad

BatchStreaming Inserts

QueryUDF

Export...some BigQueries...

Page 3: (Almost) Serverless Analytics System with BigQuery & AppEngine

AeonsSome years ago...

~ 500,000 - 2,000,000 events / day(on average)

Page 4: (Almost) Serverless Analytics System with BigQuery & AppEngine

Some time ago...

~2,000,000 - 22,000,000 events / dayDec 2014: 57,430,000 events / day

1 day to recompute » 12 hours

Page 5: (Almost) Serverless Analytics System with BigQuery & AppEngine

NOW()

22,000,000 - 70,000,000 events / dayAVG » 40,000,000 events / day

Processing ~30GB-70GB / dayRecompute 1 day » 10-20 minutes

Page 6: (Almost) Serverless Analytics System with BigQuery & AppEngine

serverless?

Desired for: https://www.innertrends.com

Page 7: (Almost) Serverless Analytics System with BigQuery & AppEngine

other... (almost) serverless products

Cloud Functions (alpha - Node.JS)

Cloud DataFlow (Java, Python - beta)

Page 8: (Almost) Serverless Analytics System with BigQuery & AppEngine

BigQuery

https://cloud.google.com/bigquery/docs/

Page 9: (Almost) Serverless Analytics System with BigQuery & AppEngine

BigQuery - data types● STRING - UTF-8 (2 bytes + encoded string size)

● BYTES - base64 encoded (except in Avro)

● INTEGER - 64-bit signed (8 bytes)

● FLOAT (8 bytes)

● BOOLEAN - true/false, 1/0 only in CSV (1 byte)

● TIMESTAMP ex:”2014-08-19 12:41:35.220 UTC” (8 bytes)

● DATE, TIME, DATETIME - limited support in Legacy SQL

● RECORD - a collection of fields (size of fields)https://cloud.google.com/bigquery/data-types

Page 10: (Almost) Serverless Analytics System with BigQuery & AppEngine

BigQuery -> loadData()Formats: CSV, JSON (newline delimited), Avro, Parquet (experimental)Tools: Web UI, bq, APISource:

local files,Cloud Storage, [demo]Cloud Datastore (backup files),POST requests,SQL DML*Google Sheets

- Federated Data Sources- Streaming Inserts

Page 11: (Almost) Serverless Analytics System with BigQuery & AppEngine

BigQuery -> loadData()

bq load ...

Page 12: (Almost) Serverless Analytics System with BigQuery & AppEngine

BigQuery -> loadData()

Got some rows?

Page 13: (Almost) Serverless Analytics System with BigQuery & AppEngine

BigQuery -> SELECT … FROM surprise…

query: SELECT { * | field_path.* | expression } [ [ AS ] alias ] [ , ... ] [ FROM from_body [ WHERE bool_expression ] [ OMIT RECORD IF bool_expression] [ GROUP [ EACH ] BY [ ROLLUP ] { field_name_or_alias } [ , ... ] ] [ HAVING bool_expression ] [ ORDER BY field_name_or_alias [ { DESC | ASC } ] [, ... ] ] [ LIMIT n ] ];from_body: from_item [, ...] | # Warning: Comma means UNION ALL here from_item [ join_type ] JOIN [ EACH ] from_item [ ON join_predicate ] | (FLATTEN({ table_name | (query) }, field_name_or_alias)) | table_wildcard_functionfrom_item: { table_name | (query) } [ [ AS ] alias ]join_type: { INNER | [ FULL ] [ OUTER ] | RIGHT [ OUTER ] | LEFT [ OUTER ] | CROSS }

Page 14: (Almost) Serverless Analytics System with BigQuery & AppEngine

BigQuery -> SELECT … FROM surprise…

Date-Partitioned Tables [demo]

Table Decorators - See the past w/ @

Table Wildcard Functions - TABLE_DATE_RANGE() & TABLE_QUERY()

Interesting functions- DateTime » UTC_USEC_TO_DAY/HOUR/MONTH/WEEK/YEAR()

» Shifts a UNIX timestamp in microseconds to the beginning of the period it occurs in.- JSON_EXTRACT[_SCALAR]()- URL functions » HOST(), DOMAIN(), TLD()- REGEXP_MATCH(), REGEXP_EXTRACT()

Page 15: (Almost) Serverless Analytics System with BigQuery & AppEngine

bigquery.defineFunction( 'expandAssetLibrary', // Name of the function exported to SQL ['user_id', 'video_id', 'stage_settings'], // Names of input columns [ {'name': 'user_id', 'type': 'integer'}, // Output schema {'name': 'video_id', 'type': 'string'}, {'name': 'asset', 'type': 'string'} ], expandAssetLibrary // Reference to JavaScript UDF);

function expandAssetLibrary(row, emit) { …………………………

emit({ user_id: row.user_id, video_id: row.video_id, asset: ss.url.replace('http://', ''));}

BigQuery -> User Defined Functions

Page 16: (Almost) Serverless Analytics System with BigQuery & AppEngine

BigQuery -> DML

Standard SQL only

Maximum UPDATE/DELETE statements per day per table: 48

Maximum UPDATE/DELETE statements per day per project: 500

Maximum INSERT statements per day per table: 1,000

Maximum INSERT statements per day per project: 10,000

Page 17: (Almost) Serverless Analytics System with BigQuery & AppEngine

BigQuery -> export()

To: Google Cloud Storage

Format: CSV, JSON [.gz], Avro

…1G files

Page 18: (Almost) Serverless Analytics System with BigQuery & AppEngine

BigQuery -> some (Big)Queries

SELECT year, count(1)

FROM [bigquery-public-data:samples.natality]

WHERE father_age < 18

GROUP BY year

ORDER BY year

SELECT year, count(1)

FROM [bigquery-public-data:samples.natality]

WHERE mother_age < 18

GROUP BY year

ORDER BY year

SELECT table_id, row_count, CEIL(size_bytes/POW(1024, 3)) AS gb

FROM [bigquery-public-data:ghcn_m.__TABLES__] ORDER BY gb DESC

Page 19: (Almost) Serverless Analytics System with BigQuery & AppEngine

BigQuery -> some (Big)Queries

SELECT REGEXP_EXTRACT(path, r'.*\.(.*)$') AS file_extension,

COUNT(1) AS k

FROM [bigquery-public-data:github_repos.files]

GROUP BY file_extension

ORDER BY k DESC

LIMIT 20

SELECT table_id, row_count,

CEIL(size_bytes/POW(1024, 3)) AS gb

FROM [bigquery-public-data:github_repos.__TABLES__]

ORDER BY gb DESC