google dataflow 小試 - jcconf.tw · read from standard google cloud platform data sources •...

47
Google Dataflow 小試 Simon Su @ LinkerNetworks {Google Developer Expert}

Upload: others

Post on 24-Jul-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow

Google Dataflow 小試

Simon Su @ LinkerNetworks{Google Developer Expert}

Page 2: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow

var simon = {/** I am at GCPUG.TW **/};

simon.aboutme = 'http://about.me/peihsinsu';

simon.nodejs = ‘http://opennodes.arecord.us';

simon.googleshare = 'http://gappsnews.blogspot.tw'

simon.nodejsblog = ‘http://nodejs-in-example.blogspot.tw';

simon.blog = ‘http://peihsinsu.blogspot.com';

simon.slideshare = ‘http://slideshare.net/peihsinsu/';

simon.email = ‘[email protected]’;

simon.say(‘Good luck to everybody!');

Page 4: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow

●●

Page 5: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow
Page 6: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow

Virtualized Data Centers

Standard virtual kit for Rent. Still yours to

manage.

2nd WaveColocation

1st Wave

Your kit, someone else’s building.

Yours to manage.

Assembly required True On Demand Cloud

Next

Storage Processing Memory Network

Clusters

Distributed Storage, Processing & Machine Learning

Containers

3rd WaveAn actual, global

elastic cloudInvest your energy in

great apps.

Page 7: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow

FoundationInfrastructure & Operations

Data Services

Application Runtime Services

Enabling No-Touch Operations

Breakthrough Insights, Breakthrough Applications

The Gear that Powers Google

Page 8: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow

Dataflow

StoreCapture Analyze

BigQuery LargerHadoop

Ecosystem

Pub/SubLogs

App EngineBigQuery streaming

Process

CloudStorage

CloudDatastore(NoSQL)

Cloud SQL(mySQL)

BigQuery Storage

Dataproc Dataproc

Page 9: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow

Devices

Physical or virtual servers

as frontend data receiver

MapReduce servers for large

data transformation in

batch way or streaming

Strong queue service for

handling large scale data injection

Large scale data store for storing data and serve query workload

Smart devices, IoT devices and Sensors

1M Devices

16.6K Events/sec

16.6K Events/sec

43B Events/month

Page 10: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow

SpannerDremelMapReduce

Big Table Colossus

2012 20132002 2004 2006 2008 2010

GFS MillWheel

Flume

Page 11: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow

GCP

Managed Service

User Code & SDKWork Manager

Deploy & Schedule

Monitoring UI

Job Manager

Progress & Logs

Page 12: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow

• Movement

• Filtering

• Enrichment

• Shaping

• Reduction

• Batch computation

• Continuous computation

• Composition

• External orchestration

• Simulation

OrchestrationAnalysisETL

Page 13: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow
Page 14: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow
Page 15: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow
Page 16: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow
Page 17: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow
Page 18: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow

Online cloud import (Cloud

Storage Transfer Service)

Object lifecycle management

ACLs Object change notification

Offline import (third party)Regional buckets

Object versioning

Page 19: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow
Page 20: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow
Page 21: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow
Page 22: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow
Page 23: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow
Page 24: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow

●●●●

Page 25: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow
Page 26: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow
Page 27: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow
Page 28: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow

Map

Shuffle

Reduce

ParDo

GroupByKey

ParDo

Page 29: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow

● A Direct Acyclic Graph of data processing

transformations

● Can be submitted to the Dataflow Service for

optimization and execution or executed on an

alternate runner e.g. Spark

● May include multiple inputs and multiple outputs

● May encompass many logical MapReduce

operations

● PCollections flow through the pipeline

Page 30: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow

Your

Source/Sink

Here

❯ Read from standard Google Cloud Platform data sources

• GCS, Pub/Sub, BigQuery, Datastore

❯ Write your own custom source by teaching Dataflow how to read it in parallel

• Currently for bounded sources only

❯ Write to GCS, BigQuery, Pub/Sub

• More coming…

❯ Can use a combination of text, JSON, XML, Avro formatted data

Page 31: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow

❯ A collection of data of type T in a pipeline - PCollection<K,V>

❯ Maybe be either bounded or unbounded in size

❯ Created by using a PTransform to:• Build from a java.util.Collection• Read from a backing data store• Transform an existing PCollection

❯ Often contain the key-value pairs using KV

{Seahawks, NFC, Champions, Seattle, ...}

{..., “NFC Champions #GreenBay”, “Green Bay #superbowl!”, ... “#GoHawks”, ...}

Page 32: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow

● A step, or a processing operation that transforms data○ convert format , group , filter data

● Type of Transforms○ ParDo

○ GroupByKey

○ Combine

○ Flatten

■ Multiple PCollection objects that contain the same data type, you can

merge them into a single logical PCollection using the Flatten transform

Page 33: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow

❯ Processes each element of a PCollection independently using a user-provided DoFn

❯ Corresponds to both the Map and Reduce phases in Hadoop i.e. ParDo->GBK->ParDo

❯ Useful for

Filtering a data set.

Formatting or converting the type of each element

in a data set.

Extracting parts of each element in a data set.

Performing computations on each element in a

data set.

{Seahawks, NFC, Champions, Seattle, ...}

{ KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>, …}

KeyBySessionId

Page 34: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow

Wait a minute… How do you do a GroupByKey on an unbounded PCollection?

{KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>, ...}

{KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>, ...}

GroupByKey

• Takes a PCollection of key-value pairs and gathers up all values with the same key

• Corresponds to the shuffle phase in Hadoop

{KV<S, {Seahawks, Seattle, …}, KV<N, {NFC, …} KV<C, {Champion, …}}

Page 35: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow
Page 36: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow

●●

Page 37: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow
Page 38: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow
Page 39: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow
Page 40: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow
Page 41: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow
Page 43: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow

Publisher A Publisher B Publisher C

Message 1

Topic A Topic B Topic C

Subscription XA Subscription XB Subscription YC

Subscription ZC

Cloud Pub/Sub

Subscriber X Subscriber Y

Message 2 Message 3

Subscriber Z

Message 1Message 2

Message 3Message 3

Globally redundant Low latency (sub sec.) N to N coupling Batched read/write Push & Pull Guaranteed Delivery Auto expiration

Page 44: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow

●●●

Page 45: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow

DatalabAn easy tool for analysis and report

BigQueryAn interactive analysis service

Page 46: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow
Page 47: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow