big data: it's more than volume, paypal
DESCRIPTION
In this presentation Nachum Shacham talks about the uses and qualities of Big Data, and how they are utilised where he works at PayPal. He talks about the ultimate goal of extracting business value, as well as unlocking the true value of your data through use of algorithms and sufficient data further down the long tail.TRANSCRIPT
BIG DATA: IT’S MORE THAN VOLUME
Nachum Shacham
PayPal
Big Data Innovation Summit
April 2013
IT’S BIG-DATA TIME!
Volume big platforms
Variety multiple data types
Velocity fast response
Value a treasure of patterns
Ultimate Goal: Extract business value
3
TECHNOLOGY HYPE CYCLE
DM Tech Forum
BIG DATA
4
MIXED SIGNALS FROM THE PUNDITS
• Data Lake• “Needle in a hay stack”• “All hay no needles”• “Yet another fad” • “Noth’n new: we’ve been analyzing
data for 30 years”
DM Tech Forum
• “Store’em and they’ll come”• “Don’t ever discard data”• “$524.752MM ROI in 3 years”• “Smart” …• “Hadoop is free”• “Just…”
5
USE YOUR OWN FILTER
• Sift facts from MBS• Seek factual 1-liners• See through metaphors• Discount “Smart” (data, algorithms, systems)• Be skeptical
DM Tech Forum
6
UNLOCK THE VALUE IN BIG DATA
• Data Trumps Algorithms• Sufficient data further down the long tail• Wisdom of the crowd effective recommendations• Combine signals from different media
DM Tech Forum
7
BUSINESS VALUE IN BIG DATA
DM Tech Forum
RISK ANALYSIS
IDENTIFY INFLUENCERS INSOCIAL GRAPHONLINE ADS
REVENUE OPTIMIZATION
FRAUD DETECTION AND PREVENTION
8
LET’S DIG INTO BIG DATA
• Define KPIs• Explore• Model & Measure• Visualize signals• Test• Question test results• Rinse and Repeat
DM Tech Forum
9
BIG-DATA ANALYTICSFROM SEMI-STRUCTURED DATA TO BUSINESS SIGNALS
MapAttempt TASK_TYPE="SETUP" TASKID="task_201212150932_52151_m_000051" TASK_ATTEMPT_ID="attempt_201212150932_52151_m_000051_0" TASK_STATUS="SUCCESS" Task TASKID="task_201212150932_52151_m_000051" TASK_TYPE="SETUP" TASK_STATUS="SUCCESS" FINISH_TIME="1355822133162" COUNTERS="{(FileSystemCounters)(FileSystemCounters)[(FILE_BYTES_WRITTEN)
• Similar goals, different challenges
• Leverage familiar tools for fast adoption
Cloud
RDBMS Data Warehouse Hadoop
MPP PLATFORMS AS WORKBENCHES FOR BIG DATA AND THEIR TOOLS
HivePIG
Javascala
SQL Oozie
StreamingPython R
Hbase
SQL++
R
Map Reduce
11
CLASSES OF ANALYTICS JOBS
Big Data
Data organization
for BI
A few large
models
Many small
models
DATA MANIPULATIONGRAPHICS
MODEL BUILDINGCROSS VALIDATION
PROBLEM MRFORMULATION
MATCH THE JOB TO THE PLATFORM
Data Sourcing
Data Preparation
Exploratory Data Analysis
Predictive Models
Visualization
Reporting
R: THE TOOL FOR ALL ANALYTICS STEPS
R
data files
process linesset sorting key and valueoutput <key, value>
Collect segment data marked by keyProcess segment dataOutput processed segment data
Shuffle sort
Reducer.R
Mapper.py
Text processing
Model per segment
BI-LINGUAL HADOOP STREAMING: LARGE SCALE PARALLEL PREDICTIVE MODELING
SEMI-STRUCTURED DATA TABULAR DATA
Meta VERSION="1" .Job JOBID="job_201212150932_52151" JOBNAME=”DataFilter" USER=”user1234” SUBMIT_TIME="1355822133394" JOBCONF="hdfs://tmp/hadoop-hadoop/mapred/staging/user1234/\.staging/job_201212150932_52151/job\.xml" VIEW_JOB=" " MODIFY_JOB=" " JOB_QUEUE=”B" .Job JOBID="job_201212150932_52151" JOB_PRIORITY="NORMAL" .Job JOBID="job_201212150932_52151" LAUNCH_TIME="1355822223576" TOTAL_MAPS="50" TOTAL_REDUCES="0" JOB_STATUS="PREP" .Task TASKID="task_201212150932_52151_m_000051" TASK_TYPE="SETUP" START_TIME="1355822133148" SPLITS="" .MapAttempt TASK_TYPE="SETUP" TASKID="task_201212150932_52151_m_000051”TASK_ATTEMPT_ID="attempt_201212150932_52151_m_000051_0" START_TIME="1355822133545" TRACKER_NAME="tracker_dn0492\.ebay\.com:localhost\.localdomain/127\.0\.0\.1:33613" HTTP_PORT="50060" .MapAttempt TASK_TYPE="SETUP" TASKID="task_201212150932_52151_m_000051" TASK_ATTEMPT_ID="attempt_201212150932_52151_m_000051_0" TASK_STATUS="SUCCESS" Task TASKID="task_201212150932_52151_m_000051" TASK_TYPE="SETUP" TASK_STATUS="SUCCESS" FINISH_TIME="1355822133162" COUNTERS="{(FileSystemCounters)(FileSystemCounters)[(FILE_BYTES_WRITTEN)(FILE_BYTES_WRITTEN)(27089)]}{(org\.apache\.hadoop\.mapred\.Task$Counter)(Map-Reduce Framework)[(SPILLED_RECORDS)(Spilled Records)(0)]}" .Job JOBID="job_201212150932_52151" JOB_STATUS="RUNNING" .Task TASKID="task_201212150932_52151_m_000001" TASK_TYPE="MAP" START_TIME="1355822133163"
attempt,201212171719,248176,m,000013,0,1355499674337,1355499903213,MAP,SUCCESS,default,rack3,lvsaishdc3dn0109,0109attempt,2012121771719,248176,m,000464,0,1355501042650,1355501253259,MAP,SUCCESS,default,rack5,lvsaishdc3dn0217,0217attempt,2012121771719,248176,m,000626,0,1355501212902,1355501366476,MAP,SUCCESS,default,rack17,lvsaishdc3dn0776,0776attempt,2012121771719,248176,m,001193,0,1355499673762,1355499887662,MAP,SUCCESS,default,rack8,lvsaishdc3dn0366,036attempt,2012121771719,248176,m,001355,0,1355499673545,1355499908182,MAP,SUCCESS,default,rack9,lvsaishdc3dn0386,0386attempt,2012121771719,248176,m,001517,0,1355501266524,1355501470527,MAP,SUCCESS,default,rack5,lvsaishdc3dn0236,0236attempt,2012121771719,248176,m,001850,0,1355501303142,1355501486691,MAP,SUCCESS,default,rack5,lvsaishdc3dn0235,0235
16
FROM TABULAR DATA TO BI
DM Tech Forum
17
PARALLEL SEGMENTED MODELING
RR
RR
R
MAPPERS
REDUCERS
18
MODELS BUILT ON LARGE DATASETS
Meta VERSION="1" .Job JOBID="job_201112150932_52151" JOBNAME=”DataFilter" USER=”user1234” LAUNCH_TIME="1324801865576”
TIME INTERVAL DATA
CONCURRENCY
PERCENTILESTIME SERIESWORD COUNT
REPRESENTATIONAVOID RAM LIMITATIONS
R STAT PROCESSING
Cloud
R LEVERAGING RDBMS POWER
teradataR Scidb-R
TERADATAR FUNCTIONS (SAMPLE)
Function Name What it does
td.zscore Zscore Transformation
td.t.paired T Test Paired
td.cor Correlation Matrix
td.f.oneway One way F Test
td.factanal Factor Analysis
td.freq Frequency Analysis
td.hist Histograms
td.kmeans K-Means Clustering
td.ks Kolmogorov Smirnov Test
td.mode Mode Value of Column
td.tapply Apply a function over a database column
td.summary Like R summary()
td.quantiles Quantile Values
td.rank Rank
ANALYSIS OF A TABLE WITH > 1B ROWS
>library(RJDBC)>library(teradataR)>tdConnect(”TD_WH", uid = tdlogin, pwd = tdpwd, database = ”myVDM”)> system.time(myTbldf <- td.data.frame(”myTbl")) user system elapsed 0.092 0.054 140.071 > dim(myTbldf )[1] 1,131,670,269 9> system.time(cor <- td.cor(myTbldf[3:9])) user system elapsed 0.021 0.003 6.722
C D E F G H I
C 1.0000000 0.7096425 0.22154483 0.24186862 0.13354501 0.4954111 0.19577803D 0.7096425 1.0000000 0.24272691 0.27590234 0.13358632 0.4279517 0.14634683E 0.2215448 0.2427269 1.00000000 0.08940507 0.03734827 0.1631614 0.04401034F 0.2418686 0.2759023 0.08940507 1.00000000 0.07664496 0.1686094 0.04744032G 0.1335450 0.1335863 0.03734827 0.07664496 1.00000000 0.1247046 0.05837435H 0.4954111 0.4279517 0.16316144 0.16860940 0.12470460 1.0000000 0.35395733I 0.1957780 0.1463468 0.04401034 0.04744032 0.05837435 0.3539573 1.00000000
CONCLUSION
• Big data is here. See through the hype• Analyze big data to extract value• Multiple technologies & analytics tools are out there• Match platform, tools and approach• Delegate massive processing to big clusters
Step Up, Dig In, & Have fun
QUESTIONS?
BIG DATA EMPOWERS ALGORITHMS
Banko & Brill “Scaling to Very Very Large Corpora forNatural Language Disambiguation”