alex cheng of baidu: "big data: a new frontier"
DESCRIPTION
TRANSCRIPT
Big Data: A New Frontier
Alex Cheng, VP Baidu 2013-4-12
5 billion+ Search Queries
~4 million Posts on PostBar
~500 million Users
100 million+ Mobile Search Users
~500,000 Business
Clients
Everyday
at
Storage
Processing
Analy1cs &
Predic1on
Data Intelligence Volume
Velocity
Variety
Value
Web Pages & Links 100+ PB Logs 100+ PB UGC 1 PB
Web
News
PostBar Encyclopedia
Knows
Searches, Clicks, Posts etc.
1 petabyte = 2x National Library of China
Logs 100+ PB
UGC 1+ PB
2005
2006
2007
20
08
2009
20
10
2011
20
12
100 PB 100 PB 100 PB
100 PB 100 PB 100 PB 100 PB 100 PB
• 95% of the data was created within the last 3 years
• 100 PB of new data is processed everyday
100 PB 100 PB 100 PB 100 PB 100 PB
100 PB 100 PB 100 PB 100 PB 100 PB
100 PB 100 PB
Growth : 100%+ YoY
Hardware Innovations
• Custom ARM-based
Servers
• Gigabit Switches
• Custom SSD/Flash Storage
TCO -25% Density +70%
PUE 1.18 / 1.37 (#1) Non-cooling hours 48%
Custom Rack Uptime Efficiency 10x
Performance 2x Cost -48%
Baidu Cloud IDC Yangquan, Shanxi, China
Software Innovations
• Global Optimization • Multiple Replication • Data Distribution • Partial Update
MONOLITHIC HW
TRADITIONAL RELATIONAL DATABASE
DIRECT RECORD ACCESS OR QUERIES
TRADITIONAL SERVER STACK
MAPREDUCE
NOSQL DATABASE
PARALLEL RELATIONAL DATABASE
HADOOP
DISTRIBUTED HARDWARE
NEW SERVER STACK
• Real-time online learning • Tens of billions training
samples • Billions of complex features
Feature extraction
Model Training Models
Query Advanced
Search Module
CTR-server
Logs
Offline
Online
Big Data + Web Search
• Real-‐Rme DicRonary Updates • Dynamic Result Modeling • High-‐frequency Inputs
RecommendaRon
Big Data + IME
User Input
NLP Module
Consolidated Search Result
On-Device Quick
Search
Cloud-based
Dictionary
Device-based
Dictionary
Output
Voice
Images
• 10+ Billions Training Examples • Heterogeneous Features • Intensive Computing
Deep Learning
The Future of Big Data “Digital Universe”
2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
20,000
40,000
10,000
30,000
exabytes Machine-generated Sensor Data “Anytime, Anywhere, Any Devices” Smartphone Smart Home Wearable Devices Smart Car … …