yahoo! hack india: hyderabad 2013 | building data products at scale

BUILDING DATA PRODUCTS AT SCALE

http://dataweave.in/



































































































































DATAWEAVE: WHAT WE DO?

• Aggregate large amounts of data publicly available on the web, and serve it to businesses in readily usable forms

• Serve actionable data through APIs, Visualizations, and Dashboards

• Provide reporting and analytics layer on top of datasets and APIs

DATAWEAVE PLATFORM

API Feeds

Data Services

Dashboards

Visualizations and Widgets

Data APIs

Unstructured , spread across sources and temporally changing

Pricing DateOpen Government Data

Social Media Data

Attributes

Attribute

Big Data Platform





HOW DOES IT WORK - 1?

• Crawling/Scraping: from a large number of data sources

• Cleaning/Deduplication: remove as much noise as possible

• Data Normalization: represent related data together in standard forms

HOW DOES IT WORK - 2?

• Store/Index: store optimally to support several complex queries

• Create "Views": on top of data for easy consumption, through APIs, visualizations, dashboards, and reports

• Package data as a product: to solve a bunch of related pain points in a certain domain (e.g., PriceWeave for retail)

AGGREGATION AND EXTRACTION

Extraction LayerOffline Extraction of Factual Data

Aggregation LayerDistributed Crawler Infrastructure

Public Data on the Web

AGGREGATION LAYER

Customized crawler infrastructure

• vertical specific crawlers• capable of crawling the "deep web"

Highly Scalable

• 500+ websites on a daily basis• more with the addition of hardware

Robust to failures (404s, timeouts, server restarts)• stateless distributed workers• crawl state maintained in a separate data store

DATA EXTRACTION LAYER

• Extract as many data points from crawled pages as possible

• Completely offline process, independent of crawling

• Highly parallelized -- scales in a straightforward manner

NORMALIZATIONNormalization Layer

Machine Learning Techniques

Remove Noise Fill Gaps in Data

Represent Data Clustering

Extraction LayerOffline Extraction of Factual Data

KnowledgeBase

NORMALIZATION LAYER

• Remove noise, remove duplicates

• Gather data from multiple sources and fill "gaps" in info

• Normalize data points to a standard internal representation

• Cluster related data together (Machine Learning techniques)

• Build a "knowledge base" -- continuous learning

• "Human in the loop" for data validation

DATA STORAGE AND SERVING

Data APIs Visualizations Dashboards Reports

Serving Layer

HighlyResponsive

Indexes Views

FiltersPre-Computed

Results

Serving LayerDistributed Data Storage

Crawl SnapshotsProcessed DataClustered Data

DATA STORAGE LAYER

• Store snapshots of crawl data -- never throw away raw data!

• Store processed data -- both individual data points as well as "clusters" of related data points

• Distributed data stores

• Highly scalable -- add more hardware

• Highly available -- replication

SERVING LAYER

This is the system as far as a user is concerned!

Must be highly responsive

Process data offline and periodically push it to the serving layer

• create Indexes for fast data retrieval

• create views to serve queries that are known a priori

• minimize computation to the extent possible

DATAWEAVE PLATFORM

API Feeds

Data Services

Dashboards

Visualizations and Widgets

Data APIs

Unstructured , spread across sources and temporally changing

Pricing DateOpen Government Data

Social Media Data

Attributes

Attribute

Big Data Platform





THANK YOU

Sanket Patil

[email protected]+91-9900063093

2013 Dataweave

On Facebook www.facebook.com/DataWeaveCatch us on Twitter @dataweavein

www.dataweave.in

mailto:[email protected]?subject=

mailto:[email protected]?subject=

http://www.facebook.com/DataWeave

http://www.facebook.com/DataWeave

http://twitter.com/dataweavein

http://twitter.com/dataweavein

http://www.dataweave.in

http://www.dataweave.in









yahoo! hack india: hyderabad 2013 | building data products at scale

Technology

data clustered data

processed data

actionable data

data stores

data validation

raw data

possible data normalization

big data platform