dunning strata-2012-27-02
DESCRIPTION
TRANSCRIPT
1©MapR Technologies - Confidential
Expect More from Hadoop!
2©MapR Technologies - Confidential
My Background
University, Startups– Aptex, MusicMatch, ID Analytics, Veoh– big data since before it was big
Open source– even before the internet– Apache Hadoop, Mahout, Zookeeper, Drill– bought the beer at first HUG
MapR Founding member of Apache Drill
3©MapR Technologies - Confidential
MapR Technologies
Enterprise quality distribution for Hadoop–Many extensions beyond basic Hadoop
Super strong team–Long history of successful startups
Strong supporter of Apache Drill– and open source in general
4©MapR Technologies - Confidential
meta-Hadoop?
5©MapR Technologies - Confidential
meta Meta- (from Greek: μετά = "after", "beyond", "with", "adjacent", "self"), is a…
6©MapR Technologies - Confidential
Beyond ≠Answering yesterday’s problems
7©MapR Technologies - Confidential
Philosophy First
What is History?
8©MapR Technologies - Confidential
The study of the past
(what came before now)
9©MapR Technologies - Confidential
What is the future?
(it comes after now)
10©MapR Technologies - Confidential
11©MapR Technologies - Confidential
12©MapR Technologies - Confidential
But the future also has a past!
13©MapR Technologies - Confidential
the future of the pastis not
the past of the future
14©MapR Technologies - Confidential
Do you remember the future?
15©MapR Technologies - Confidential
16©MapR Technologies - Confidential
17©MapR Technologies - Confidential
18©MapR Technologies - Confidential
Those are yesterday’s answers
19©MapR Technologies - Confidential
and also the seeds
of tomorrow
20©MapR Technologies - Confidential
Guys wearing Fedoras
21©MapR Technologies - Confidential
Hadoop has a history
22©MapR Technologies - Confidential
Hadoop also has a
future
23©MapR Technologies - Confidential
The Old Future of Hadoop
Implementing yet another Google paper– Map-reduce and HDFS, and Yarn and Tez– more and more, but not really different
Eco-system additions (more Google papers)– simpler programming (Hive and Pig and Crunch) (Sawzall, FlumeJava, etc)– key-value store (big table)– ad hoc query (Dremel)– also not really different
Stands apart from other computing– required by HDFS and other limitations
24©MapR Technologies - Confidential
The New Future of Hadoop
Real-time processing– Combines real-time and long-time
Integration with traditional IT– No need to stand apart
Integration with new technologies– Solr, Node.js, Twisted all should work directly on Hadoop
Fast and flexible computation– Drill logical plan language
25©MapR Technologies - Confidential
Example #1Search Abuse
26©MapR Technologies - Confidential
History matrix
One row per user
One column per thing
27©MapR Technologies - Confidential
Recommendation based on cooccurrence
Cooccurrence gives item-item mapping
One row and column per thing
28©MapR Technologies - Confidential
Cooccurrence matrix can also be implemented as a search index
29©MapR Technologies - Confidential
SolRIndexerSolR
IndexerSolrindexing
Cooccurrence(Mahout)
Item meta-data
Indexshards
Complete history
30©MapR Technologies - Confidential
SolRIndexerSolR
IndexerSolrsearchWeb tier
Item meta-data
Indexshards
User history
31©MapR Technologies - Confidential
Objective Results
At a very large credit card company
History is all transactions, all web interaction
Processing time cut from 20 hours per day to 3
Recommendation engine load time decreased from 8 hours to 3 minutes
32©MapR Technologies - Confidential
Scaling Estimates – Twitter Fire hose
Old School – 8+ separate clusters, 20-25 nodes– >3 Kafka nodes– >2 TwitterLogger– 5-10 Hadoop– >3 Storm– 3 zookeepers (or not?)– NAS for web storage– >2 web servers
MapR – one platform– 5-10 nodes total, any node does any
job– Full HA included,
backups included,disaster recovery included
33©MapR Technologies - Confidential
Example #2Web
Technology
34©MapR Technologies - Confidential
Fast analysis(Storm)
Analytic output
Real-timedata
Raw logs
35©MapR Technologies - Confidential
Large analysis(map-reduce)
Analytic output Raw logs
36©MapR Technologies - Confidential
Presentation tier (d3 + node.js)
Analytic output
Browser query
Raw logs
37©MapR Technologies - Confidential
StormKafka
Twitter API
TwitterLoggerKafka
ClusterKafka
ClusterKafka
Cluster
Kafka API
Storm
NAS
Web Data
Hadoop
Flume
HDFS Data
Old School Storm: Complex architecture
Web-server
http
38©MapR Technologies - Confidential
TwitterAPI
CatcherCatcher Storm
Topic Queue
Web-server
http
Web Data
MapR
TwitterLogger
MapR: One Platform with Streaming Writes
Users can also run extended analytics/MapReduce on the stored data
OptionalMapReduce HDFS
API
NFS NFS NFS NFS
39©MapR Technologies - Confidential
40©MapR Technologies - Confidential
Objective Results
Real-time + long-time analysis is seamless
Web tier can be rooted directly on Hadoop cluster
No need to move data
41©MapR Technologies - Confidential
The future is not what we thought it would be
42©MapR Technologies - Confidential
It is better!
43©MapR Technologies - Confidential
Get Involved!
Tweet:#strataconf
#mapr@ted_dunning
44©MapR Technologies - Confidential
Get Involved!
Join Apache Drill!– [email protected] – Follow @apachedrill
Join MapR!– [email protected]
Download these slides– http://www.mapr.com/company/events/strata-conference-2-2-27-13
Contact me:– [email protected]– [email protected]– @ted_dunning