Download - Tech4Africa - Opportunities around Big Data
2
Agenda
Hardware Software Data
• Situational Applications
• Big Data
3
Situational Applications
– eaghra (Flickr)
4
Situational Applicatio
nsMashups
Data Explosion
Social Platfor
ms
Enterprise
SOA
LAMPPublishi
ng Platform
s
Web 2.0 Era Topic Map
Inexpensive
Storage
Produce Process
Web 2.0
New Information
5
6
Big Data
– blmiers2 (Flickr)
The data just keeps growing…
1024 GIGABYTE= 1 TERABYTE1024 TERABYTES = 1 PETABYTE
1024 PETABYTES = 1 EXABYTE
1 PETABYTE 13.3 Years of HD Video
20 PETABYTES Amount of Data processed by Google daily
5 EXABYTES All words ever spoken by humanity
8
Web as a Platform Web 1.0 - Connecting Machines
Infrastructure
Web 2.0 - Connecting People API Foundation
Facebook Twitter LinkedIn
Google NetFlix
PayPaleBay Pandora
New York Times
The Fractured Web
Service EconomyService for this
Service for that
App Economy for DevicesApp for this App for that
Web 2.0 Data Exhaust of Historical and Real-time Data
Real-time Data
Mobile
Set Top Boxes
Tablets, etc.
Sensor WebAn instrumented and monitored world
Multiple Sensors in your pocket
Opportunity
9
Data Deluge! But filter patterns can help…
Kakadu (Flickr)
10
Filtering WithSearch
11
Awesome
Filtering Socially
12
Filtering Visually
But filter patterns force you down a pre-processed path
M.V. Jantzen (Flickr)
14
What if you could ask your own questions?
–
wowwzers(Flickr)
– MrB-MMX
(Flickr)
And go from discovering Something about Everything…
16
To discovering Everything about Something ?
17
Gathering,
Storing,
Processing &
Delivering Data @
Scale
How do we do this?
Lets examine a few techniques for
18
Gathering Data
Data Marketplaces
19
20
21
Gathering Data
Apache Nutch(Web Crawler)
22
Storing, Reading and Processing - Apache Hadoop Cluster technology with a single master and scale out with multiple slaves It consists of two runtimes:
The Hadoop Distributed File System (HDFS) Map/Reduce
As data is copied onto the HDFS it ensures the data is blocked and replicated to other machines to provide redundancy
A self-contained job (workload) is written in Map/Reduce and submitted to the Hadoop Master which in-turn distributes the job to each slave in the cluster.
Jobs run on data that is on the local disks of the machine they are sent to ensuring data locality
Node (Slave) failures are handled automatically by Hadoop. Hadoop may execute or re-execute a job on any node in the cluster.
Want to know more? “Hadoop – The Definitive Guide (2nd Edition)”
23
Delivering Data @ Scale
• Structured Data
• Low Latency & Random Access
• Column Stores (Apache HBase or Apache Cassandra)• faster seeks
• better compression
• simpler scale out
• De-normalized – Data is written as it is intended to be queried
Want to know more? “HBase – The Definitive Guide” & “Cassandra High Performance Cookbook”
24
Storing, Processing & Delivering : Hadoop + NoSQL
NoSQL Repository
Apache Hadoop
FlumeConnector
NoSQL Connector/API
SQOOPConnector
MySQL
H D F SLog Files
Relational Data (JDBC)
Gather
Read/Transform
Low-latency
-Clean and Filter Data
- Transform and Enrich Data
- Often multiple Hadoop jobs
Web Data
Nutch Crawl
Application
Query
ServeCopy
25
Some things to keep in mind…
– Kanaka Menehune (Flickr)
26
Some things to keep in mind…
• Processing arbitrary types of data (unstructured, semi-structured, structured) requires normalizing data with many different kinds of readers
Hadoop is really great at this !
• However, readers won’t really help you process truly unstructured data such as prose. For that you’re going to have to get handy with Natural Language Processing. But this is really hard.
Consider using parsing services & APIs like Open Calais
Want to know more? “Programming Pig” (O’REILLY)
27
Open Calais (Gnosis)
28
Statistical real-time decision making
Capture Historical information
Use Machine Learning to build decision making models (such as Classification, Clustering & Recommendation)
Mesh real-time events (such as sensor data) against Models to make automated decisions
Want to know more? “Mahout in Action”
29
Tech Bubble?
What does the Data Say?
Pascal Terjan (Flickr
30
31
32
Apache
Identify Optimal Seed URLs& Crawl to a depth of 2
http://www.crunchbase.com/companies?
c=a&q=privately_held
Crawl data is stored in segment dirs on the HDFS
33
34
Company POJO then /t Out
Prelim Filtering on URL
Making the data STRUCTURED
Retrieving HTML
35
Aargh!
My viz tool requires zipcodes to plot geospatially!
Apache Pig Script to Join on City to get Zip Code and Write the results to Vertica
ZipCodes = LOAD 'demo/zipcodes.txt' USING PigStorage('\t') AS (State:chararray, City:chararray, ZipCode:int);
CrunchBase = LOAD 'demo/crunchbase.txt' USING PigStorage('\t') AS
(Company:chararray,City:chararray,State:chararray,Sector:chararray,Round:chararray,Month:int,Year:int,Investor:chararray,Amo
unt:int);
CrunchBaseZip = JOIN CrunchBase BY (City,State), ZipCodes BY (City,State);
STORE CrunchBaseZip INTO
'{CrunchBaseZip(Company varchar(40), City varchar(40), State varchar(40), Sector varchar(40), Round varchar(40),
Month int, Year int, Investor int, Amount varchar(40))}’
USING com.vertica.pig.VerticaStorer(‘VerticaServer','OSCON','5433','dbadmin','');
Total Tech Investments By Year
Investment Funding By Sector
39
Total Investments By Zip Code for all Sectors
$7.3 Billion in San Francisco
$2.9 Billion in Mountain View
$1.2 Billion in Boston
$1.7 Billion in Austin
40
Total Investments By Zip Code for Consumer Web
$1.2 Billion in Chicago
$600 Million in Seattle
$1.7 Billion in San Francisco
41
Total Investments By Zip Code for BioTech
$1.3 Billion in Cambridge
$528 Million in Dallas
$1.1 Billion in San Diego
42
Questions?