yahoo! hack india: hyderabad 2013 | building data products at scale
DESCRIPTION
Sanket Patil speaking on Building Data Products At ScaleTRANSCRIPT
BUILDING DATA PRODUCTS AT SCALE
DATAWEAVE: WHAT WE DO?
• Aggregate large amounts of data publicly available on the web, and serve it to businesses in readily usable forms
• Serve actionable data through APIs, Visualizations, and Dashboards
• Provide reporting and analytics layer on top of datasets and APIs
DATAWEAVE PLATFORM
API Feeds
Data Services
Dashboards
Visualizations and Widgets
Data APIs
Unstructured , spread across sources and temporally changing
Pricing DateOpen Government Data
Social Media Data
Attributes
Attribute
Big Data Platform
HOW DOES IT WORK - 1?
• Crawling/Scraping: from a large number of data sources
• Cleaning/Deduplication: remove as much noise as possible
• Data Normalization: represent related data together in standard forms
HOW DOES IT WORK - 2?
• Store/Index: store optimally to support several complex queries
• Create "Views": on top of data for easy consumption, through APIs, visualizations, dashboards, and reports
• Package data as a product: to solve a bunch of related pain points in a certain domain (e.g., PriceWeave for retail)
AGGREGATION AND EXTRACTION
Extraction LayerOffline Extraction of Factual Data
Aggregation LayerDistributed Crawler Infrastructure
Public Data on the Web
AGGREGATION LAYER
Customized crawler infrastructure
• vertical specific crawlers• capable of crawling the "deep web"
Highly Scalable
• 500+ websites on a daily basis• more with the addition of hardware
Robust to failures (404s, timeouts, server restarts)• stateless distributed workers• crawl state maintained in a separate data store
DATA EXTRACTION LAYER
• Extract as many data points from crawled pages as possible
• Completely offline process, independent of crawling
• Highly parallelized -- scales in a straightforward manner
NORMALIZATIONNormalization Layer
Machine Learning Techniques
Remove Noise Fill Gaps in Data
Represent Data Clustering
Extraction LayerOffline Extraction of Factual Data
KnowledgeBase
NORMALIZATION LAYER
• Remove noise, remove duplicates
• Gather data from multiple sources and fill "gaps" in info
• Normalize data points to a standard internal representation
• Cluster related data together (Machine Learning techniques)
• Build a "knowledge base" -- continuous learning
• "Human in the loop" for data validation
DATA STORAGE AND SERVING
Data APIs Visualizations Dashboards Reports
Serving Layer
HighlyResponsive
Indexes Views
FiltersPre-Computed
Results
Serving LayerDistributed Data Storage
Crawl SnapshotsProcessed DataClustered Data
DATA STORAGE LAYER
• Store snapshots of crawl data -- never throw away raw data!
• Store processed data -- both individual data points as well as "clusters" of related data points
• Distributed data stores
• Highly scalable -- add more hardware
• Highly available -- replication
SERVING LAYER
This is the system as far as a user is concerned!
Must be highly responsive
Process data offline and periodically push it to the serving layer
• create Indexes for fast data retrieval
• create views to serve queries that are known a priori
• minimize computation to the extent possible
DATAWEAVE PLATFORM
API Feeds
Data Services
Dashboards
Visualizations and Widgets
Data APIs
Unstructured , spread across sources and temporally changing
Pricing DateOpen Government Data
Social Media Data
Attributes
Attribute
Big Data Platform
THANK YOU
Sanket Patil
[email protected]+91-9900063093
2013 Dataweave
On Facebook www.facebook.com/DataWeaveCatch us on Twitter @dataweavein
www.dataweave.in