an introduction to predictive analytics with big data and open source tools joe heary cto & vp...
TRANSCRIPT
An Introduction to Predictive Analytics with Big Data and Open Source tools
Joe HearyCTO & VP of Technical OperationsZimmerman Associates, Inc. (ZAI)
November 5, 2015
What is Predictive Analytics
“A variety of statistical techniques from modeling, machine learning, and data
mining that analyze current and historical facts to make predictions about future, or otherwise
unknown, events.” - Wikipedia
11/5/2015 Leveraging Data to Lead 2
Predicting the Future Not really about
“predicting the future” About using Data,
Statistical Models, and Machine Learning to identify the likelihood of future outcomes from which we make decisions
Produce new insights that lead to better actions
11/5/2015 Leveraging Data to Lead 3
Machine Learning Evolved from pattern recognition and computation learning
theory in artificial intelligence Construction of algorithms that can learn from data Algorithms build models from example inputs to make
data-driven predictions rather than static program instructions
11/5/2015 Leveraging Data to Lead 4
Siegel, E. (2013). Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die. Hoboken: Wiley
What is Big Data?
“Big data is a collection of data from traditional and digital sources inside and outside your company that
represents a source for ongoing discovery and analysis.”
-- Lisa Arthur, Forbes / CMO Network
11/5/2015 Leveraging Data to Lead 5
Refers to the AMOUNT of data in terms of: VOLUME: the amount of data being generated VARIETY: the type of data (pictures, videos, text, audio, etc.) VELOCITY: the speed at which data is created or changes VERACITY: the truthfulness or adherence to the truth VALUE: the relative value of data to an organization
Big Data due to convergence of…
Big Data
Moore’s Law
Mobile Computin
g
Social Networkin
g
Cloud Computin
g
Leveraging Data to Lead11/5/2015 6
Data Growth
Leveraging Data to Lead
Atlantic Ocean = (est.) 100 Billion, billion Gallons of water
As of 2010, we currently create
2.5 quintillion bytes of data daily
(1018)
If 1 gallon = 1 byte…
11/5/2015 7
- Ken Gabriel, Director of DARPA, March 2012
The Atlantic Ocean could only contain the data created in 2010
- Eric Schmidt, CEO of Google,
2010
Approx. 80% of all data is
“unstructured”
Social Media’s Impact on Data Growth
Leveraging Data to Lead
2010: Eric Schmidt, then CEO of Google, estimates we now create as much data every 2 days as did since the dawn of time through 2003
Source: Skloog Blog
11/5/2015 8
Data Processing before Big Data
Leveraging Data to Lead11/5/2015 9
NoSQL and Hadoop
11/5/2015 Leveraging Data to Lead 10
Big Data software framework for storing data and running applications on clusters of commodity hardware. Has the ability to handle virtually limitless concurrent tasks or jobs.
Non-relational database in which data is stored and accessed from a model other than tabular relationships typical of Relational Database Management Systems (RDBMS)
SQL vs. NoSQL
11/5/2015 Leveraging Data to Lead 11
Vaes, Karem. "Database Variants Explained : SQL or NoSQL? Is That Really the Question?" Random Thoughts on Various Topics by an Information Technology Architect. Karim Vaes, 21 Jan. 2015. Web. 3 Nov. 2015.
NoSQL DB’s Classified by Data Model Column: Accumulo, Cassandra, Druid, HBase, Vertica Document: Clusterpoint, Apache CouchDB, Couchbase,
MarkLogic, MongoDB, OrientDB Key-value: Dynamo, FoundationDB, MemcacheDB, Redis,
Riak, FairCom c-treeACE, Aerospike, OrientDB Graph: Allegro, Neo4J, InfiniteGraph, OrientDB, Virtuoso,
Stardog Multi-model: OrientDB, FoundationDB, ArangoDB, Alchemy
Database, CortexDB
11/5/2015 Leveraging Data to Lead 12
Hadoop Distributed Filesystem (HDFS)
Leveraging Data to Lead11/5/2015 13
Brings compute resources to the data
Implements MapReduce to aggregate into useable summary data
Hadoop Distributed Filesystem (HDFS)
11/5/2015 Leveraging Data to Lead 14
Data NodeA
Data NodeB
Data NodeC
Data NodeD
3
5
1
3
5 4
2
1 4
2
5 3
2
4 1
Client
Name Node
TCP/IP Network
Metadata:Data X -> 1,2,3Data Y -> 4,5
Name Node contains metadata and location of the data
Shuffle/Sort
MapReduce in Hadoop Filesystem
11/5/2015 Leveraging Data to Lead 15
Input DataInput DataInput DataInput Data
Map
Map
Map
Map
Reduce
Reduce
Aggregate
Output
Big Data
No rows of data like RDBMS, only Key-value pairs
11/5/2015 Leveraging Data to Lead 16
Marketing Campaign 1,000,000 prospects $2 each to mail ($2M) 1% (1 out of 100) will buy (10,000) $220 revenue per sale
11/5/2015 Leveraging Data to Lead 17
($220 x 10,000) = $2,200,000- ($2 x 1,000,000) = $2,000,000
Profit = $200,000
Assigning a Predictive Score
11/5/2015 Leveraging Data to Lead 18
Siegel, E. (2013). Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die. Hoboken: Wiley
Targeted Marketing with PA PA results tell us which prospects are likely to respond ID 25% of prospects on list are 3X’s more likely to respond 1M reduced to 250,000 with a 3% response rate (7,500) $220 revenue per sale
$1,150,000 (452.5% increase) in profit
11/5/2015 Leveraging Data to Lead 19
($220 x 7,500) = $1,650,000 - (2$ x 250,000) = $500,000
Profit = $1,150,000
Recommendations: Similar to Others
11/5/2015 Leveraging Data to Lead 20
Recommendations: Closer to Home
Leveraging Data to Lead11/5/2015 21
Top 20 Open Source PA Software
11/5/2015 Leveraging Data to Lead 22
http://www.predictiveanalyticstoday.com/top-predictive-analytics-freeware-software/
• There are several Open Source and Freeware products available to perform Predictive Analytics
• “R” is one of the most popular, but the link below will provide plenty to choose from
Wrap-up and bring it home Convergence of technology leads to Big Data You’re best bet is listening to what the data tells you rather than asking
for an answer to a question that you already know the answer to Real Benefits of Predictive Analytics is the ability to find patterns in
data that you were not aware of before Creating new markets and new opportunities based on data analysis
Using Predictive Analytics with Big Data is truly using data to lead!
Leveraging Data to Lead11/5/2015 23
Question & Answer
Leveraging Data to Lead11/5/2015 24