big data on google cloud
TRANSCRIPT
Big Data On Google CloudTu Pham - IO extended 2017
CTO @ Dyno A Data as service company
Technologies: Java, Python, all kind of databases and Cloud platform from Google, Aws, Azure.
Interests: Cloud computing / architecture, technology evolution, distributed systems.
Husband, Father, GDE, Open source contributor.
Tu Pham
foto: Lars Kruse, Aarhus Universitet 3
Giới thiệu Dyno: - Tech marketing & digital
agency
Forthepast17 years,Googlehasbeenbuildingouttheworld’sfastest,mostpowerful,highestqualitycloudinfrastructureon the planet.
Images by Connie Zhou
Google Cloud Platform is built ont h e s am e i n f r a s t r u c t u re t h a tpowersGoogle.
ImagesbyConnieZhou
Google’sPlatform“[Google's]abilitytobuild,organize,andoperateahugenetworkofserversandfiber-opticcableswithanefficiencyandspeedthatrocksphysicsonitsheels.
This is what makes Google Google: itsphysicalnetwork,itsthousandsoffibermiles,andthosemanythousandsofserversthat,inaggregate,adduptothemother of all
clouds.”
-Wired
77Peering locations
Yes,WeCanPowerthat
Web Mobile Storage&Database
BigData HighlyScalableSystem DataMining
CloudPlatform
Google Cloud Platform
Organizetheworld’sinformationandmakeituniversallyaccessibleanduseful.Google’s Mission
2
“
Google Cloud Platform 5
Source: Boston Consulting Group : The Mobile Revolution: How Mobile Technologies Drive a Trillion-Dollar Impact IDC, 2015
By2020,therewillbe8Billionconnectedsmartphones
—2Xmorethantoday.
And 32 Billion connected “IOT” devices
— 6X more than today.
ExploringtheCloud
IaaSInfrastructure-as-a-
Service
PaaSPlatform-as-a-
Service
SaaSSoftware-as-a-
Service
GoogleCloudPlatform
CloudPlatform
GoogleComputeEngine
CloudPlatform
• FlexibleInfrastructure
• CustomerVMSize
• OnlineDiskResizing
• Network
• InternalNetwork
• Firewall
• LoadBalancing
• ExternalIpAddress
• Billing
• SustainedUsageDiscounts
• PreemptibleVM
AppEngine
•FullyManagedPlatform
• PopularProgrammingLanguageSupport
• FlexibleandScalableApplicationStorage
• Auto-scaling
• VersioningandTrafficSplitting
• LocalDeveloperTools•Third-partyFrameworksandExtensions
CloudPlatform
• GlobalPresence
• FlexibleDeliveryOptions
• Pull
• Push
• DataReliability
• FlowControl
• DataSecurityAndProtection
CloudPlatform
PubSub
• Reliable&ConsistencyProcessing
• UnifiedProgramingModel
• IntelligenceWorkScheduling
• AutoScaling
• Monitoring
• OpenSource
CloudPlatform
CloudDataFlow
• Versioning
• StaticSites
• ResumableTransfers
• ObjectChangeNotifications
• TBscale
CloudPlatform
CloudStorage
CloudSQL
• Fullymanaged
• EaseofUse
• HighlyReliable
• FlexibleCharging
• Security,Availability,Durability
• EasyMigration&DataPortability
• OptimizedMysqlversions
CloudPlatform
BigQuery
• FullyManagedBigDataAnalyticsService
• SupportSQL
• Fast
• Scalable
• FlexibleandFamiliar
• SecurityandReliability
CloudPlatform
DataProc
• Includes
• ApacheHadoop
• ApachePig
• ApacheHive
• ApacheSpark
• FastAndScalableDataProcessing
• FlexibleVirtualMachines
• ResizableCluster
CloudPlatform
DataLab
• PowerfulDataExploration
• Scalable
• DataManagement
• Visualization
• OpenSource(Jupyter)
CloudPlatform
Google’s Data Services for everyone
A common configuration: drawconclusions
Cloud Datalab
Events,metrics,etc.
StreamVisualization and BI
Rawlogs,files,assets,Google
Analyticsdataetc. Co-workers Batch
Batch
B C Applications and A Reports
Confidential + Proprietary
Aserverless bigdatastackthatscalesautomatically
10+YearsofTacklingBigDataProblems
Google Cloud Platform 13
Google Papers
20082002 2004 2006 2010 2012 2014 2015
GFS Map Reduce
Flume Java Millwheel
Open Source
2005
Google Cloud Products BigQuery Pub/Sub Dataflow Bigtable
BigTable Dremel PubSub
Apache Beam
Tensorflow
Confidential & ProprietaryGoogle Cloud Platform 24
Transform Data into Actions
Exploration & CollaborationDatabases Storage
Data Preparation &
Processing Analytics
Advanced Analytics & Intelligence
Mobile apps
Sensors and devices
Web apps
Relational
Key-value
Document
SQL
Wide column
ObjectStream processing
Batch processing
Data preparation
Federated query
Data catalog
Data exploration
Data visualization
Developers
Data scientists
Business analysts
Development environment for Machine
Learning
Pre-Trained Machine Learning models
Data Ingestion
Messaging
Logs
Confidential & ProprietaryGoogle Cloud Platform 25
Transform Data into Actions
Data Preparation &
Processing
Cloud Dataflow
Cloud Dataproc
Exploration & Collaboration
Google BigQuery
Cloud Datalab
Google Analytics 360
Cloud Dataproc
Mobile apps
Sensors and devices
Web apps
Developers
Data scientists
Business analysts
Data Ingestion
Cloud Pub/Sub
App Engine
Databases/Storage
Cloud SQL
Cloud Bigtable
Cloud Datastore
Cloud Storage
Analytics
Google BigQuery
Google Analytics 360
Cloud Dataproc
Google Drive
Advanced Analytics & Intelligence
Cloud Machine Learning
Translate API
Vision API
Speech API
Google Cloud Platform 3
Apache Spark and Apache Hadoop should be
fast, easy, and cost-effective.
GoogleCloudDataProc
Traditional Spark and Hadoop clusters
Google Cloud Dataproc
Google Cloud Dataproc - under the hood
Applications on the cluster
Dataproc Jobs
GCP Products
Spark
PySpark
Spark SQL
MapReduce
Pig
Hive
Dataproc Cluster
Spark & Hadoop OSS
Cloud Dataproc Agent
Google Cloud Services
Dataproc Jobs Features Data Outputs
Easy, fast, cost-effective
Fast Things take seconds to minutes, not hours or weeks Easy Be an expert with your data, not your data infrastructure Cost-effective Pay for exactly what you use
Running Hadoop on Google Cloud
bdutil Free OSS Toolkit
Dataproc Managed Hadoop
Custom Code Monitoring/Health Dev Integration Scaling Job Submission GCP Connectivity Deployment Creation
Custom Code Monitoring/Health Dev Integration Manual Scaling Job Submission GCP Connectivity Deployment Creation
On Premise
Custom Code Monitoring/Health Dev Integration Scaling Job Submission GCP Connectivity Deployment Creation
Google Managed
Google Cloud Platform
Customer Managed
Vendor Hadoop
Custom Code Monitoring/Health Dev Integration Scaling Job Submission GCP Connectivity Deployment Creation
6
Cloud Dataproc - integrated
6
Cloud Dataproc is
natively integrated with
several Google Cloud
Platform products as
part of an integrated data platform.
Storage
Operations
Data
7
Where Cloud Dataproc fits into GCP
7
Google Bigtable (HBase)
Google BigQuery (Analytics, Data warehouse)
Stackdriver Logging (Logging Ops.)
Google Cloud Dataflow (Batch/Stream Processing)
Google Cloud Storage (HCFS/HDFS)
Stackdriver Monitoring (Monitoring)
Building what’s next 33
Scales automatically No setup or administration
Stream up to 100,000 rows p/sec
Easily integrates with third-party software
Google BigQuery makescomplexdataanalysissimple
Confidential + Proprietary
GoogleBigQueryPerformanceExample?
Running an inefficientregular expression over 100 billion rows in
less than 60 seconds
Source: h ttps://cloud.google.com/blog/big-data/2016/01/anatomy-of-a-bigquery-query
GoogleBigQuery
ThePowerofGoogleDremelforeveryone
Storage Compute
Fast IngestQuery
Terabit Network
1000-core Hadoop Cluster = 2.5 hours
Before
Making ad hoc Queries with BigQuery < 5min
After
● 500+Games● HundredsofAnalysts● TerabytesofDataDaily
“Rightatthestartofthepartnershipwewereabletoreducetimetoinsightfrom96hoursto30minutesbyusingBigQuery,allowingustoreactinrealtimetocustomerneedsandprovidebetterservice..”
Gary Sanders Head of the bank's digital analytics function
h ttps://www.finextra.com/newsarticle/28566/lloyds-partners-google-on-data-analytics
Big Data Challenges At Dyno
- Multi TB data warehouse - Raw input > 100 GB new raw data per day (Structured
& Unstructured) - 65 online data source - Unlimited offline data source - Face with data quality problem everyday - From user information & behavior to user interest &
intention - Manage high performance / cost effective system
JOIN THE FLIGHT - WE ARE HIRING
IO Extended 2017
Twitter: @phamptu Email: [email protected]
Frontend Developer: goo.gl/EY8RvV Backend Developer: goo.gl/BnmmK6