donghui zhang [email protected] 2017-5 … · ibm ung intel sap osoft apple acle...
TRANSCRIPT
Your Background Familiar with big-data analytics?
Value = show you what’s “under the hood”.
Familiar with big-data platform?
Mostly review; Value = think about my opinions.
Just curious?
Value = general awareness.
Not interested in big data?
You are in the wrong room.
http://BigAnalyticsPlatform.com 2 (C) 2017 Donghui Zhang
Disclaimer The opinions expressed on this site are mine and
do not necessarily represent those of my employer.
BigAnalyticsPlatform.com is my personal blogging site. I currently work at Facebook.
3 http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang
Content Why
What
History
Technical How-Tos
Career Advice
Conclusions
http://BigAnalyticsPlatform.com 4 (C) 2017 Donghui Zhang
Why Big Data? Data Grows Fast Data in the world:
10 billion TB
90% was produced in the last 2 years!
5
Source: Mikal Khoso. “How Much Data is Produced Every Day?” http://www.northeastern.edu/levelblog/2016/05/13/how-much-data-produced-every-day
http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang
Why Big-Data Platform? Platform can be a competitive advantage.
Enable junior developers to quickly create robust applications.
Google thinks of itself as a systems engineering company.
6
Quote source: Todd Hoff. “Google Architecture”. http://highscalability.com/google-architecture
http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang
7
Data source: Yahoo Finance on 1/3/2017.
159 208 174
106
504
616
156
357
547
234 222
338
0100200300400500600700
IBM
Sam
sun
g
Inte
l
SA
P
Mic
roso
ft
Ap
ple
Ora
cle
Am
azo
n
Go
og
le
Ten
cen
t
Ali
bab
a
Fac
ebo
ok
1911 193819681972 1975 1976 197719941998199819992004
Ma
rke
t ca
p (
bil
lio
n$)
Company + year founded
All biggies have big data platforms
http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang
8
Data source: Yahoo Finance on 1/3/2017.
159 208 174
106
504
616
156
357
547
234 222
338
0100200300400500600700
IBM
Sam
sun
g
Inte
l
SA
P
Mic
roso
ft
Ap
ple
Ora
cle
Am
azo
n
Go
og
le
Ten
cen
t
Ali
bab
a
Fac
ebo
ok
1911 193819681972 1975 1976 197719941998199819992004
Ma
rke
t ca
p (
bil
lio
n$)
Company + year founded
All biggies have big data platforms
top 3 cloud service providers
http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang
9
Data source: Yahoo Finance on 1/3/2017.
159 208 174
106
504
616
156
357
547
234 222
338
0100200300400500600700
IBM
Sam
sun
g
Inte
l
SA
P
Mic
roso
ft
Ap
ple
Ora
cle
Am
azo
n
Go
og
le
Ten
cen
t
Ali
bab
a
Fac
ebo
ok
1911 193819681972 1975 1976 197719941998199819992004
Ma
rke
t ca
p (
bil
lio
n$)
Company + year founded
All biggies have big data platforms
Larry Ellison: “Amazon’s lead is
over”
http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang
10
Data source: Yahoo Finance on 1/3/2017.
159 208 174
106
504
616
156
357
547
234 222
338
0100200300400500600700
IBM
Sam
sun
g
Inte
l
SA
P
Mic
roso
ft
Ap
ple
Ora
cle
Am
azo
n
Go
og
le
Ten
cen
t
Ali
bab
a
Fac
ebo
ok
1911 193819681972 1975 1976 197719941998199819992004
Ma
rke
t ca
p (
bil
lio
n$)
Company + year founded
All biggies have big data platforms
Apple “Pie”
http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang
11
Data source: Yahoo Finance on 1/3/2017.
159 208 174
106
504
616
156
357
547
234 222
338
0100200300400500600700
IBM
Sam
sun
g
Inte
l
SA
P
Mic
roso
ft
Ap
ple
Ora
cle
Am
azo
n
Go
og
le
Ten
cen
t
Ali
bab
a
Fac
ebo
ok
1911 193819681972 1975 1976 197719941998199819992004
Ma
rke
t ca
p (
bil
lio
n$)
Company + year founded
All biggies have big data platforms
Samsung bought Joyant
http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang
12
Data source: Yahoo Finance on 1/3/2017.
159 208 174
106
504
616
156
357
547
234 222
338
0100200300400500600700
IBM
Sam
sun
g
Inte
l
SA
P
Mic
roso
ft
Ap
ple
Ora
cle
Am
azo
n
Go
og
le
Ten
cen
t
Ali
bab
a
Fac
ebo
ok
1911 193819681972 1975 1976 197719941998199819992004
Ma
rke
t ca
p (
bil
lio
n$)
Company + year founded
All biggies have big data platforms
Alibaba 2015: 377 sec (3,377 nodes Apsara) Tencent 2016: 134 sec (512 nodes OpenPower) Gray sort. See http://sortbenchmark.org
http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang
Content Why
What
History
Technical How-Tos
Career Advice
Conclusions
http://BigAnalyticsPlatform.com 13 (C) 2017 Donghui Zhang
What is Big Data? Big data sets
e.g. “This year our users uploaded 10X more videos; we have big data now.”
big volume, big variety, or big velocity
exceed existing data processing capabilities
Big data analytics
e.g. “We use big data to predict stock trends.”
Big data stack
software
platform
infrastructure
14 http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang
The Big Data Stack
15
Analytics
Infrastructure Think IaaS such as AWS EC2. Networked VMs.
Platform Think PaaS such as Google App Engine. A platform for developing software.
Analytics Software Think SaaS such as Microsoft Office 365. Software that Data Scientists can use.
Reports, docs, ad hoc scripts...
http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang
Google Stack
16
Infrastructure
Platform
Products
Custom-built machines; RedHat Linux
GFS/Colossus, BigTable, Spanner, MapReduce/Cloud Dataflow, Chubby,
Borg/Omega
search, advertising, gmail, docs, maps, youtube, cloud platform, …
http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang
Sample Open-Source Stack
17
Infrastructure
Platform
Analytics Software
Analytics
VMs
Spark on YARN with Hive
Tableau, scikit-learn
Python scripts
http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang
5 V’s of Big Data Volume
Variety
Velocity
Veracity
Value
18
5V’s source: Jason Williamson. “The 4 V’s of Big Data”. http://www.dummies.com/careers/find-a-job/the-4-vs-of-big-data
http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang
5 V’s of Big Data Volume
Variety
Velocity
Value
Veracity
19
“Your small data can be my big data!”
http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang
5 V’s of Big Data Volume
Variety
Velocity
Value
Veracity
20
Lessons
• A key feature missing in RDBMS is variety.
RDBMS guru: “Put you data in a database!” Scientist: “My data is not relational.” RDBMS guru: “Make your data relational!” Scientist: “But it is not relational!”
http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang
5 V’s of Big Data Volume
Variety
Velocity
Value
Veracity
21
Streaming. ETL ELT: Load first, transform later.
http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang
5 V’s of Big Data Volume
Variety
Velocity
Value
Veracity
22
Lessons
• Do big data for increasing business value, not for tech.
• Read a book on building a startup.
http://BigAnalyticsPlatform.com
Source: Frank McSherry. “Scalability! But at what COST?” http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html
If you are going to use a big data system for yourself,
see if it is faster than your laptop.
Frank McSherry
(C) 2017 Donghui Zhang
5 V’s of Big Data Volume
Variety
Velocity
Value
Veracity
23
Source: Philip Russom. “Best Practices for Data Lake Management”. https://tdwi.org/research/2016/10/checklist-data-lake-management.aspx
Lessons
• Use Data Lakes, not Data Swamps.
• Read Russom’s “Best Practices for Data lake Management”.
Data scientist: “My analysis suggested this billion-dollar action.” Manager: “Where was the data from?”
http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang
Content Why
What
History
Technical How-Tos
Career Advice
Conclusions
http://BigAnalyticsPlatform.com 24 (C) 2017 Donghui Zhang
Big Data History
25
What goes around comes around.
Mike Stonebraker
Everything has prior art.
David DeWitt
http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang
Big Data History 1969: relational model (Edgar F. Codd*)
1976: System R by IBM (Jim Gray*; transactions)
1986: Postgres (Mike Stonebraker*; ADT)
1990: Gamma (David DeWitt; shared nothing)
2004: MapReduce (Jeff Dean; flexibility)
2005: “One size doesn’t fit all” (Mike Stonebraker)
2006: Hadoop (Doug Cutting)
2011: Spark (Matei Zaharia)
2017: Death of shared nothing (David DeWitt)
26
* Turing Award Winners (1981, 1998, 2014). http://amturing.acm.org/byyear.cfm
http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang
Big Data History
27
Lessons
• Don’t reinvent the wheels. • Read the editors’ intro for “the red book”. • Read "Architecture of a Database System". • Study favorite posts on HighScalability.
The red book: Bailis, Hellerstein, Stonebraker. “Readings in Database Systems”, 5th Ed. http://www.redbook.io HighScalability: http://highscalability.com/all-time-favorites
http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang
Content Why
What
History
Technical How-Tos
Career Advice
Conclusions
http://BigAnalyticsPlatform.com 28 (C) 2017 Donghui Zhang
How to Scale to Many Servers?
29
When your data is small
http://BigAnalyticsPlatform.com
clients
server
(C) 2017 Donghui Zhang
How to Scale to Many Servers?
30
Use a load balancer
http://BigAnalyticsPlatform.com
clients
LB
servers
(C) 2017 Donghui Zhang
How to Scale to Many Servers? Round-Robin DNS, Point of Presence, multi-level LB.
http://BigAnalyticsPlatform.com 31
LB clients
servers
POP
POP
POP
POP
POP
(C) 2017 Donghui Zhang
Image source: Abhijeet Desai. "Google Cluster Architecture". http://www.slideshare.net/abhijeetdesai/google-cluster-architecture
Google Cluster at the Beginning
32 http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang
33
Google Belgium Data Center
Image source: Malte Schwarzkopf. "What does it take to make Google work at scale". https://docs.google.com/presentation/d/1OvJStE8aohGeI3y5BcYX8bBHwoHYCPu99A3KTTZElr0
http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang
34
Image source: Malte Schwarzkopf. "What does it take to make Google work at scale". https://docs.google.com/presentation/d/1OvJStE8aohGeI3y5BcYX8bBHwoHYCPu99A3KTTZElr0
Google Belgium Data Center
http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang
Google Data Centers About 40 data centers
About 2 million machines
Machines are organized in containers each having 1,160 machines
30 racks of 40 machines
Sometimes double stacked
35
Data sources: James Pearn, “How many servers does Google have?” https://plus.google.com/+JamesPearn/posts/VaQu9sNxJuY “Learn How Google Works: in Gory Detail”. http://www.ppcblog.com/how-google-works
http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang
Google Data Size Data too large
130 trillion pages
Index 100 PB (stacking 2TB drives up: 0.8 mile)
Demand too much
3 billion searches per day (or 35K per second)
36
Data sources: https://www.google.com/insidesearch/howsearchworks/thestory http://www.seobook.com/learn-seo/infographics/how-search-works.php http://www.ppcblog.com/how-google-works
http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang
How to Evaluate a Distributed System Well-known goals
Useful (solve your business need)
Performant (high throughput, low latency)
Elastic (you may add/remove nodes)
Scalable (adding nodes improves performance)
Fault tolerant (deal with failures)
In addition, I’d advocate
Flexible (scaling, model, interface, architecture)
http://BigAnalyticsPlatform.com 37 (C) 2017 Donghui Zhang
Shared Nothing Shared Storage
http://BigAnalyticsPlatform.com 38
Image source: David J. DeWitt, Willis Lang. “Data Warehousing in the Cloud – The Death of Shared Nothing.” http://mitdbg.github.io/nedbday/2017/#program
For 30 years, DW were shared nothing.
Now they are all shared storage.
Gamma Teradata Netezza Vertica
DB2/PE SQL Server PDW
Greenplum Asterdata
SciDB
Redshift Spectrum Snowflake Microsoft SQL DW Google BigQuery
(C) 2017 Donghui Zhang
Why Shared Storage? Flexible Scaling!
http://BigAnalyticsPlatform.com 39
Image source: David J. DeWitt, Willis Lang. “Data Warehousing in the Cloud – The Death of Shared Nothing.” http://mitdbg.github.io/nedbday/2017/#program
in minutes
(C) 2017 Donghui Zhang
Case Study: Snowflake (flexible scaling)
S3 DATA
STORAGE
COMPUTE
LAYER
VIRTUAL
WAREHOUSE
N
1
N
2
N
3
N
4 CLUSTER OF EC2 INSTANCES
DATA CACHE
VIRTUAL
WAREHOUSE
N
1
N
2
VIRTUAL
WAREHOUSE
N
1
N
2
N
3
N
4
N
5
N
6
N
7
N
8
CLOUD
SERVICES
AUTHENTICATION & ACCESS CONTROL
QUERY
OPTIMIZER
TRANSACTION
MANAGER
INFRASTRUCTURE
MANAGER SECURITY
METADATA
STORAGE
Database tables stored here
These disks are strictly used as
caches
40
Image source: David J. DeWitt, Willis Lang. “Data Warehousing in the Cloud – The Death of Shared Nothing.” http://mitdbg.github.io/nedbday/2017/#program
http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang
Case Study: Spark
http://BigAnalyticsPlatform.com 41
SparkSQL ML Streaming GraphX
Spark Core
RDD API DataFrame API
Standalone YARN MESOS Local
Java/Scala/Python/R shell/script
(C) 2017 Donghui Zhang
Case Study: Spark (Flexible Model)
http://BigAnalyticsPlatform.com 42
SparkSQL ML Streaming GraphX
Spark Core
RDD API DataFrame API
Standalone YARN MESOS Local
Java/Scala/Python/R shell/script
Not only SQL, but also ML, streaming, graph.
(C) 2017 Donghui Zhang
Case Study: Spark (Flexible Interface)
http://BigAnalyticsPlatform.com 43
SparkSQL ML Streaming GraphX
Spark Core
RDD API DataFrame API
Standalone YARN MESOS Local
Java/Scala/Python/R shell/script
You could access Spark using traditional JDBC.
Also, interactive session (in multiple languages).
Also, submit a script as a task.
(C) 2017 Donghui Zhang
Case Study: Spark (Flexible Architecture)
http://BigAnalyticsPlatform.com 44
SparkSQL ML Streaming GraphX
Spark Core
RDD API DataFrame API
Standalone YARN MESOS Local
Java/Scala/Python/R shell/script
May deploy on top of existing YARN or MESOS.
Could also be standalone.
Possible to add components.
(C) 2017 Donghui Zhang
How to Evaluate a Distributed System
http://BigAnalyticsPlatform.com 45
Lessons
• Flexibility is an important metric. • Spark is a flexible system. • Cloud DW: shared storage.
(C) 2017 Donghui Zhang
In addition to well-known goals
Useful, Performant, Elastic, Scalable, Fault tolerant
I’d advocate
Flexible (scaling, model, interface, architecture)
Content Why
What
History
Technical How-Tos
Career Advice
Conclusions
http://BigAnalyticsPlatform.com 46 (C) 2017 Donghui Zhang
Growing Need for Big Data Jobs
47
Source: https://www.indeed.com/jobtrends
10X in 5 years
http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang
Big Data Roles Chief Data Officer
Data Scientist
Data Engineer
Solutions Architect
Big Data Strategist
...... at least 15 more
48
Source: “Top 20 Big Data jobs and their responsibilities”. http://bigdata-madesimple.com/top-20-big-data-jobs-and-their-responsibilities
http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang
If You Want to Do Analytics Python
Numpy, Jupyter Notebook
Machine Learning
Scikit-learn
Practice at http://DrivenData.org
http://BigAnalyticsPlatform.com 49 (C) 2017 Donghui Zhang
If You Want to Do Big Data Platform Only for senior engineers
Practice at http://LeetCode.com
Embrace open source
Assemble a solution; don’t build from scratch
Consulting business: target medium-sized companies
http://BigAnalyticsPlatform.com 50 (C) 2017 Donghui Zhang
If You Want to Build A Startup Read some books about building a startup
Don’t assume you know users’ pain point
Throw away prototype code
Three key people must have good working relationship: What-To-Do, How-To-Do, and When-To-Do
When in doubt, keep it simple
Strive for a clean API (external and internal)
Do one thing really well first
http://BigAnalyticsPlatform.com 51 (C) 2017 Donghui Zhang
Stonebraker’s Startup Loop while (true)
{
1. Talk with users to find their pain;
2. Brainstorm with professors;
3. Recruit students to build a prototype;
4. Draw a quadrant; E.g.
5. Co-found a VC-backed startup;
6. Play banjo; write papers; give talks; receive awards;
}
E.g. Streambase, Vertica, VoltDB, Paradigm4, Tamr, …
E.g. Received ACM Turing Award 2014
52
Small Big
Simple
Complex
http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang
Content Why
What
History
Technical How-Tos
Career Advice
Conclusions
http://BigAnalyticsPlatform.com 53 (C) 2017 Donghui Zhang
Conclusions All “biggies” have big-data platform
Shared nothing shared storage
Leverage on open source: pick/compose/expand
Flexibility is a key metric for distributed systems
http://BigAnalyticsPlatform.com 54 (C) 2017 Donghui Zhang