Transcript
Page 1: Data Engineering 101: Building your first data product

Jonathan DinuCo-Founder, Zipfian [email protected]

@clearspandex

@ZipfianAcademy

Data Engineering 101: Building your first data product

May 4th, 2014

Page 2: Data Engineering 101: Building your first data product

Today

• whoami

• Nws Rdr (News Reader)

• The What, Why, and How of Data Products

• Data Engineering

• Building a Pipeline

• Productionizing the Products

• Q&A

Questions? tweet @zipfianacademy #pydata

Page 3: Data Engineering 101: Building your first data product

Formerly

Questions? tweet @zipfianacademy #pydata

Page 4: Data Engineering 101: Building your first data product

Formerly

Questions? tweet @zipfianacademy #pydata

Page 5: Data Engineering 101: Building your first data product

Currently

Questions? tweet @zipfianacademy #pydata

Page 6: Data Engineering 101: Building your first data product

Today Disclaimer:

All characters appearing in this presentation are

fictitious. Any resemblance to real persons, living

or dead, is purely coincidental.

Questions? tweet @zipfianacademy #pydata

Page 7: Data Engineering 101: Building your first data product

Today Disclaimer:

This presentation contains strong opinions that

you may or may not agree with. All thoughts are

my own.

Jonathan DinuCo-Founder, Zipfian [email protected]

@clearspandex

Questions? tweet @zipfianacademy #pydata

Page 8: Data Engineering 101: Building your first data product

Today

• whoami

• Nws Rdr (News Reader)

• The What, Why, and How of Data Products

• Data Engineering

• Building a Pipeline

• Productionizing the Products

• Creating Value for Users

• Q&A

Questions? tweet @zipfianacademy #pydata

Page 9: Data Engineering 101: Building your first data product

nwsrdr (News Reader)

Source: http://www.groovypost.com/wp-content/uploads/2013/05/Bookmark-Button.png

OR

nwsrdr+ nwrsrdr

+ nwrsrdr

+ nwrsrdr

nwsrdr

getnews.com/bookmarklet

When browsing the web simply click the +nwsrdr to save any page to nwsrdr

Get nwsrdr on your desktop

Questions? tweet @zipfianacademy #pydata

Page 10: Data Engineering 101: Building your first data product

nwsrdr

• Auto-categorize Articles

• Find Similar Articles

• Recommend articles

• Suggest Feeds to Follow

• No Ads!

It’s like Prismatic + Pocket + Google Reader (RIP) + Delicious!

Questions? tweet @zipfianacademy #pydata

Page 11: Data Engineering 101: Building your first data product

nwsrdr

It’s like Prismatic + Pocket + Google Reader (RIP) + Delicious!

• Naive Bayes (classification)

• Clustering (unsupervised learning)

• Collaborative Filtering

• Triangle Closing

• Real Business Model!

Questions? tweet @zipfianacademy #pydata

Page 12: Data Engineering 101: Building your first data product

Today

• whoami

• Nws Rdr (News Reader)

• The What, Why, and How of Data Products

• Data Engineering

• Building a Pipeline

• Productionizing the Products

• Q&A

Questions? tweet @zipfianacademy #pydata

Page 13: Data Engineering 101: Building your first data product

OR

Data Products

Product Built on Data(that you sell)

Questions? tweet @zipfianacademy #pydata

Page 14: Data Engineering 101: Building your first data product

OR

Data Products

Product that Generates Data

Questions? tweet @zipfianacademy #pydata

Page 15: Data Engineering 101: Building your first data product

OR

Data Products

Product that Generates Data(that you sell)

Questions? tweet @zipfianacademy #pydata

Page 16: Data Engineering 101: Building your first data product

OR

Data Products

Product that Generates Data(that you sell)

i.e. Facebook

Questions? tweet @zipfianacademy #pydata

Page 17: Data Engineering 101: Building your first data product

OR

Data Products

Questions? tweet @zipfianacademy #pydata Source: http://gifgif.media.mit.edu/

Page 18: Data Engineering 101: Building your first data product

OR

Data Products

Source: http://www.adamlaiacano.com/post/57703317453/data-generating-productsQuestions? tweet @zipfianacademy #pydata

Page 19: Data Engineering 101: Building your first data product

OR

Data Generating Products

Source: http://www.adamlaiacano.com/post/57703317453/data-generating-productsQuestions? tweet @zipfianacademy #pydata

Page 20: Data Engineering 101: Building your first data product

Products that enhance a users’ experience the more “data” a user

provides

Data Generating Products

Ex: Recommender Systems

Questions? tweet @zipfianacademy #pydata

Page 21: Data Engineering 101: Building your first data product

Today

• whoami

• Nws Rdr (News Reader)

• The What, Why, and How of Data Products

• Data Engineering

• Building a Pipeline

• Productionizing the Products

• Q&A

Questions? tweet @zipfianacademy #pydata

Page 22: Data Engineering 101: Building your first data product

OR

Data Science

Questions? tweet @zipfianacademy #pydata

Page 23: Data Engineering 101: Building your first data product

i.e. solve more problems than you create

Data Science

Questions? tweet @zipfianacademy #pydata

Page 24: Data Engineering 101: Building your first data product

Source: http://estoyentretenido.com/wp-content/uploads/2012/11/Jackie-Chan-Meme.jpg

But.... How?!?!?!!?

Data Science

Questions? tweet @zipfianacademy #pydata

Page 25: Data Engineering 101: Building your first data product

Data Engineering

Source: http://www.schooljotter.com/imagefolders/lady/Class_3/Engineer-It-1350063721.PNG

Questions? tweet @zipfianacademy #pydata

Page 26: Data Engineering 101: Building your first data product

Data Engineering

Source: http://www.schooljotter.com/imagefolders/lady/Class_3/Engineer-It-1350063721.PNG

!

Questions? tweet @zipfianacademy #pydata

Page 27: Data Engineering 101: Building your first data product

OR

Data Engineering

Questions? tweet @zipfianacademy #pydata

Page 28: Data Engineering 101: Building your first data product

Prepared Data

Test Set

Training Set Train

ModelSampling

EvaluateCross

Validation

Data Science

Questions? tweet @zipfianacademy #pydata

Page 29: Data Engineering 101: Building your first data product

Raw Data

Cleaned Data

Scrubbing

Prepared DataVectorization

New Data

Test Set

Training Set Train

ModelSampling

EvaluateCross

Validation

Cleaned Data

Prepared DataVectorizationScrubbing

Predict

Labels/Classes

Data Engineering

Questions? tweet @zipfianacademy #pydata

Page 30: Data Engineering 101: Building your first data product

Today

• whoami

• Nws Rdr (News Reader)

• The What, Why, and How of Data Products

• Data Engineering

• Building a Pipeline

• Productionizing the Products

• Q&A

Questions? tweet @zipfianacademy #pydata

Page 31: Data Engineering 101: Building your first data product

What

• Naive Bayes (classification)

• Clustering (unsupervised learning)

• Collaborative Filtering

• Triangle Closing

• Real Business Model

Questions? tweet @zipfianacademy #pydata

Page 32: Data Engineering 101: Building your first data product

nwsrdr

• Auto-categorize Articles

• Find Similar Articles

• Recommend articles

• No Ads!

It’s like Prismatic + Pocket + Google Reader (RIP) + Delicious!

Questions? tweet @zipfianacademy #pydata

Page 33: Data Engineering 101: Building your first data product

nwsrdr

It’s like Prismatic + Pocket + Google Reader (RIP) + Delicious!

• Naive Bayes (classification)

• Clustering (unsupervised learning)

• Collaborative Filtering

• Triangle Closing

• Real Business Model!

Questions? tweet @zipfianacademy #pydata

Page 34: Data Engineering 101: Building your first data product

Source: http://media.tumblr.com/tumblr_lakcynCyG31qbzcoy.jpg

Abstraction (Cake)

How

(ABK)

Questions? tweet @zipfianacademy #pydata

Page 35: Data Engineering 101: Building your first data product

Obligatory Name Drop

Acquisition

Parse

Storage

Transform/Explore

Vectorization

Train

Model

Expose

Presentation

requests

BeautifulSoup4

pandas

pymongo

scikit-learn/NLTK

Questions? tweet @zipfianacademy #pydata

Page 36: Data Engineering 101: Building your first data product

Obligatory Name Drop

Acquisition

Parse

Storage

Transform/Explore

Vectorization

Train

Model

Expose

Presentation

requests

BeautifulSoup4

pandas

pymongo

scikit-learn/NLTK

Questions? tweet @zipfianacademy #pydata

Page 37: Data Engineering 101: Building your first data product

Obligatory Name Drop

Acquisition

Parse

Storage

Transform/Explore

Vectorization

Train

Model

Expose

Presentation

requests

BeautifulSoup4

pandas

pymongo

Flask

yHat

scikit-learn/NLTK

Questions? tweet @zipfianacademy #pydata

Page 38: Data Engineering 101: Building your first data product

Obligatory Name Drop

Acquisition

Parse

Storage

Transform/Explore

Vectorization

Train

Model

Expose

Presentation

requests

BeautifulSoup4

pandas

pymongo

Flask

yHat

At Scale Locally

scrapy

Hadoop Streaming (w/ BeautifulSoup4)

mrjob or Mortar (w/ Python UDF)

Snakebite (HDFS)

MLlib (pySpark)

Flask

yHat

scikit-learn/NLTK

Questions? tweet @zipfianacademy #pydata

Page 39: Data Engineering 101: Building your first data product

Obligatory Name Drop

Acquisition

Parse

Storage

Transform/Explore

Vectorization

Train

Model

Expose

Presentation

At Scale

Flask

yHat

scrapy

Hadoop Streaming (w/ BeautifulSoup4)

mrjob or Mortar (w/ Python UDF)

Snakebite (HDFS)

MLlib (pySpark)

requests

BeautifulSoup4

pandas

pymongo

Flask

yHat

Locally

scikit-learn/NLTK

Questions? tweet @zipfianacademy #pydata

Page 40: Data Engineering 101: Building your first data product

Pipeline

Iteration 0:

• Find out how much data

• Run locally

• Experiment

Questions? tweet @zipfianacademy #pydata

Page 41: Data Engineering 101: Building your first data product

Acquisition

Parse

Storage

Transform/Explore

Vectorization

Train

Model

Expose

Presentation

At Scale

Flask

yHat

scrapy

Hadoop Streaming (w/ BeautifulSoup4)

mrjob or Mortar (w/ Python UDF)

Snakebite (HDFS)

MLlib (pySpark)

requests

BeautifulSoup4

pandas

pymongo

Flask

yHat

Locally

scikit-learn/NLTK

Questions? tweet @zipfianacademy #pydata

Acquire

Page 42: Data Engineering 101: Building your first data product

Retrieve Meta-data for ALL NYT articles

Questions? tweet @zipfianacademy #pydata

Acquire

Page 43: Data Engineering 101: Building your first data product

api_key='xxxxxxxxxxxxx'!!!!url = 'http://api.nytimes.com/svc/search/v2/articlesearch.json?fq=section_name.contains:("Arts" "Business Day" "Opinion" "Sports" "U.S." "World")&sort=newest&api-key=' + api_key!!!!# make an API request!api = requests.get(url)!

Questions? tweet @zipfianacademy #pydata

Acquire

Page 44: Data Engineering 101: Building your first data product

# parse resulting JSON and insert into a mongoDB collection!for content in api.json()['response']['docs']:! if not collection.find_one(content):! collection.insert(content)! !!# only returns 10 per page!"There are only %i docuemtns returned 0_o" % \!! len(api.json()[‘response']['docs'])!

Questions? tweet @zipfianacademy #pydata

Acquire

Page 45: Data Engineering 101: Building your first data product

# there are many more than 10 articles however!total_art = articles_left = api.json()['response']['meta']['hits']!!!print "There are currently %s articles in the NYT archive" % total_art!!!#=> There are currently 15277775 articles in the NYT archive

Questions? tweet @zipfianacademy #pydata

Acquire

Page 46: Data Engineering 101: Building your first data product

Gotchas!

• Rate Limiting

• Page Limiting

Questions? tweet @zipfianacademy #pydata

Acquire

Page 47: Data Engineering 101: Building your first data product

Iterate

Iteration 1:

• (Meaningful) Sample of Data

• Prototype — “Close the Loop”

Questions? tweet @zipfianacademy #pydata

Page 48: Data Engineering 101: Building your first data product

Retrieve Meta-data for ALL NYT articles

Questions? tweet @zipfianacademy #pydata

Acquire

(take 2)

Page 49: Data Engineering 101: Building your first data product

# let us loop (and hopefully not hit our rate limit)!

while articles_left > 0 and page_count < max_pages:!

more_articles = requests.get(url + "&page=" + str(page) + "&end_date=" + str(last_date))!

print "Inserting page " + str(page)!

# make sure it was successful!

if more_articles.status_code == 200:!

for content in more_articles.json()['response']['docs']:!

latest_article = parser.parse(content['pub_date']).strftime("%Y%m%d")!

if not collection.find_one(content) and content['document_type'] == 'article':!

print "No dups"!

try:!

print "Inserting article " + str(content['headline'])!

collection.insert(content)!

except errors.DuplicateKeyError:!

print "Duplicates"!

continue!

else:!

print "In collection already”!

! ! …

Iteration 0.5

Questions? tweet @zipfianacademy #pydata

Acquire

Page 50: Data Engineering 101: Building your first data product

articles_left -= 10! page += 1! page_count += 1! cursor_count += 1! final_page = max(final_page, page)! else:! if more_articles.status_code == 403:! print "Sleepy..."! # account for rate limiting! time.sleep(2)! elif cursor_count > 100:! print "Adjusting date”!! ! ! ! # account for page limiting! cursor_count = 0! page = 0! last_date = latest_article! else:! print "ERRORS: " + str(more_articles.status_code)! cursor_count = 0! page = 0! last_date = latest_article!

Questions? tweet @zipfianacademy #pydata

Acquire

Page 51: Data Engineering 101: Building your first data product

Download HTML content of

articles from NYT.com

Questions? tweet @zipfianacademy #pydata

Acquire

(and store in MongoDB™)

Page 52: Data Engineering 101: Building your first data product

Acquire# now we can get some content!!#limit = 100!limit = 10000!!for article in collection.find({'html' : {'$exists' : False} }):! if limit and limit > 0:! if not article.has_key('html') and article['document_type'] == 'article':! limit -= 1! print article['web_url']! html = requests.get(article['web_url'] + "?smid=tw-nytimes")! ! if html.status_code == 200:! soup = BeautifulSoup(html.text)! ! # serialize html! collection.update({ '_id' : article['_id'] }, { '$set' : !! ! ! ! ! ! ! ! ! ! ! ! ! { 'html' : unicode(soup), 'content' : [] } !! ! ! ! ! ! ! ! ! ! ! ! } )! ! for p in soup.find_all('div', class_='articleBody'):! collection.update({ '_id' : article['_id'] }, { '$push' : !! ! ! ! ! ! ! ! ! ! ! ! ! ! { 'content' : p.get_text() !! ! ! ! ! ! ! ! ! ! ! ! ! } })!

Questions? tweet @zipfianacademy #pydata

Page 53: Data Engineering 101: Building your first data product

Parse

Acquisition

Parse

Storage

Transform/Explore

Vectorization

Train

Model

Expose

Presentation

At Scale

Flask

yHat

scrapy

Hadoop Streaming (w/ BeautifulSoup4)

mrjob or Mortar (w/ Python UDF)

Snakebite (HDFS)

MLlib (pySpark)

requests

BeautifulSoup4

pandas

pymongo

scikit-learn/NLTK

Flask

yHat

Locally

Questions? tweet @zipfianacademy #pydata

Page 54: Data Engineering 101: Building your first data product

Parse HTML with BeautifulSoup

and Extract the article Body

Questions? tweet @zipfianacademy #pydata

(and store in MongoDB™)

Parse

Page 55: Data Engineering 101: Building your first data product

# parse HTML content of articles!

for article in collection.find({'html' : {'$exists' : True} }):!

print article['web_url']!

soup = BeautifulSoup(article['html'], 'html.parser')!

arts = soup.find_all('div', class_='articleBody')!

!

if len(arts) == 0:!

arts = soup.find_all('p', class_=‘story-body-text')!

!! ! …

Questions? tweet @zipfianacademy #pydata

Parse

Page 56: Data Engineering 101: Building your first data product

Store

Acquisition

Parse

Storage

Transform/Explore

Vectorization

Train

Model

Expose

Presentation

At Scale

Flask

yHat

scrapy

Hadoop Streaming (w/ BeautifulSoup4)

mrjob or Mortar (w/ Python UDF)

Snakebite (HDFS)

MLlib (pySpark)

requests

BeautifulSoup4

pandas

pymongo

Flask

yHat

Locally

scikit-learn/NLTK

Questions? tweet @zipfianacademy #pydata

Page 57: Data Engineering 101: Building your first data product

for p in arts:! collection.update({ '_id' : article['_id'] }, { '$push' : !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! { 'content' : p.get_text() } !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! })!

Questions? tweet @zipfianacademy #pydata

Store

Page 58: Data Engineering 101: Building your first data product

Explore

Acquisition

Parse

Storage

Transform/Explore

Vectorization

Train

Model

Expose

Presentation

At Scale

Flask

yHat

scrapy

Hadoop Streaming (w/ BeautifulSoup4)

mrjob or Mortar (w/ Python UDF)

Snakebite (HDFS)

MLlib (pySpark)

requests

BeautifulSoup4

pandas

pymongo

Flask

yHat

Locally

scikit-learn/NLTK

Questions? tweet @zipfianacademy #pydata

Page 59: Data Engineering 101: Building your first data product

Exploratory Data Analysis with pandas

Questions? tweet @zipfianacademy #pydata

Explore

Page 60: Data Engineering 101: Building your first data product

articles.describe()!# ! ! text section!# count 1405 1405!# unique 1397 10!!fig = plt.figure()!# histogram of section counts!articles['section'].value_counts().plot(kind='bar')

Questions? tweet @zipfianacademy #pydata

Explore

Page 61: Data Engineering 101: Building your first data product

Questions? tweet @zipfianacademy #pydata

Explore

Page 62: Data Engineering 101: Building your first data product

error with NYT API

Questions? tweet @zipfianacademy #pydata

Explore

Page 63: Data Engineering 101: Building your first data product

api_key='xxxxxxxxxxxxx'!!!!url = 'http://api.nytimes.com/svc/search/v2/articlesearch.json?fq=section_name.contains:("Arts" "Business Day" "Opinion" "Sports" "U.S." "World")&sort=newest&api-key=' + api_key!!!!# make an API request!api = requests.get(url)!

Questions? tweet @zipfianacademy #pydata

Explore

error with NYT API

Page 64: Data Engineering 101: Building your first data product

Acquisition

Parse

Storage

Transform/Explore

Vectorization

Train

Model

Expose

Presentation

At Scale

Flask

yHat

scrapy

Hadoop Streaming (w/ BeautifulSoup4)

mrjob or Mortar (w/ Python UDF)

Snakebite (HDFS)

MLlib (pySpark)

requests

BeautifulSoup4

pandas

pymongo

Flask

yHat

Locally

scikit-learn/NLTK

Questions? tweet @zipfianacademy #pydata

Vectorize

Page 65: Data Engineering 101: Building your first data product

Tokenize article text and

create feature vectors with NLTK

Questions? tweet @zipfianacademy #pydata

Vectorize

Page 66: Data Engineering 101: Building your first data product

Vectorize

wnl = nltk.WordNetLemmatizer()!!def tokenize_and_normalize(chunks):! words = [ tokenize.word_tokenize(sent) for sent in tokenize.sent_tokenize("".join(chunks)) ]! flatten = [ inner for sublist in words for inner in sublist ]! stripped = [] !! for word in flatten: ! if word not in stopwords.words('english'):! try:! stripped.append(word.encode('latin-1').decode('utf8').lower())! except:! print "Cannot encode: " + word! ! no_punks = [ word for word in stripped if len(word) > 1 ] ! return [wnl.lemmatize(t) for t in no_punks]!

Questions? tweet @zipfianacademy #pydata

Page 67: Data Engineering 101: Building your first data product

Acquisition

Parse

Storage

Transform/Explore

Vectorization

Train

Model

Expose

Presentation

At Scale

Flask

yHat

scrapy

Hadoop Streaming (w/ BeautifulSoup4)

mrjob or Mortar (w/ Python UDF)

Snakebite (HDFS)

MLlib (pySpark)

requests

BeautifulSoup4

pandas

pymongo

Flask

yHat

Locally

scikit-learn/NLTK

Questions? tweet @zipfianacademy #pydata

Train

Page 68: Data Engineering 101: Building your first data product

Train and score a model with scikit-learn

Questions? tweet @zipfianacademy #pydata

Train

Page 69: Data Engineering 101: Building your first data product

# cross validate!from sklearn.cross_validation import train_test_split!!xtrain, xtest, ytrain, ytest = !! ! ! ! ! ! ! train_test_split(X, labels, test_size=0.3)!!# train a model!alpha = 1!multi_bayes = MultinomialNB(alpha=alpha)!!multi_bayes.fit(xtrain, ytrain)!multi_bayes.score(xtest, ytest)

Questions? tweet @zipfianacademy #pydata

Train

Page 70: Data Engineering 101: Building your first data product

Gotchas!

• Model only exists locally on Laptop

• Not Automated for realtime prediction

Questions? tweet @zipfianacademy #pydata

Train

Page 71: Data Engineering 101: Building your first data product

Exposé

Questions? tweet @zipfianacademy #pydata

Page 72: Data Engineering 101: Building your first data product

Iteration 2:

• Expose your model

• Automate your processes

Questions? tweet @zipfianacademy #pydata

Exposé

Page 73: Data Engineering 101: Building your first data product

Getting that model off your lap(top)

Questions? tweet @zipfianacademy #pydata

Exposé

Page 74: Data Engineering 101: Building your first data product

Source: http://pixel.nymag.com/imgs/daily/vulture/2012/03/09/09_joan-taylor.o.jpg/a_560x0.jpg

Questions? tweet @zipfianacademy #pydata

Exposé

Page 75: Data Engineering 101: Building your first data product

A model is just a function

Questions? tweet @zipfianacademy #pydata

Exposé

Page 76: Data Engineering 101: Building your first data product

Inputs...

Questions? tweet @zipfianacademy #pydata

Exposé

Page 77: Data Engineering 101: Building your first data product

Outputs...

Questions? tweet @zipfianacademy #pydata

Exposé

Page 78: Data Engineering 101: Building your first data product

Serialize your model with pickle (or cPickle or joblib)

Questions? tweet @zipfianacademy #pydata

Persistence

Page 79: Data Engineering 101: Building your first data product

Source: http://www.glogster.com/mrsallenballard/pickles-i-love-em-/g-6mevh13be74mgnc9i8qifa0

Persistence

Questions? tweet @zipfianacademy #pydata

Page 80: Data Engineering 101: Building your first data product

Persistence

SerDes

• Disk

• Database

• Memory

Questions? tweet @zipfianacademy #pydata

Page 81: Data Engineering 101: Building your first data product

Acquisition

Parse

Storage

Transform/Explore

Vectorization

Train

Model

Expose

Presentation

At Scale

Flask

yHat

scrapy

Hadoop Streaming (w/ BeautifulSoup4)

mrjob or Mortar (w/ Python UDF)

Snakebite (HDFS)

MLlib (pySpark)

requests

BeautifulSoup4

pandas

pymongo

Flask

yHat

Locally

scikit-learn/NLTK

Questions? tweet @zipfianacademy #pydata

Exposé

Page 82: Data Engineering 101: Building your first data product

Deploy your Model to yHat

Questions? tweet @zipfianacademy #pydata

Exposé

Page 83: Data Engineering 101: Building your first data product

class DocumentClassifier(YhatModel):! @preprocess(in_type=dict, out_type=dict)! def execute(self, data):! featureBody = vectorizer.transform([data['content']])! result = multi_bayes.predict(featureBody)! list_res = result.tolist()! return {"section_name": list_res}!!clf = DocumentClassifier()!yh = Yhat("[email protected]", “xxxxxx",!! ! ! ! ! ! ! ! ! ! ! ! ! "http://cloud.yhathq.com/")!yh.deploy("documentClassifier", DocumentClassifier, globals())

Questions? tweet @zipfianacademy #pydata

Exposé

Page 84: Data Engineering 101: Building your first data product

Acquisition

Parse

Storage

Transform/Explore

Vectorization

Train

Model

Expose

Presentation

At Scale

Flask

yHat

scrapy

Hadoop Streaming (w/ BeautifulSoup4)

mrjob or Mortar (w/ Python UDF)

Snakebite (HDFS)

MLlib (pySpark)

requests

BeautifulSoup4

pandas

pymongo

Flask (on Heroku)

yHat

Locally

scikit-learn/NLTK

Questions? tweet @zipfianacademy #pydata

Present

Page 85: Data Engineering 101: Building your first data product

Create a Flask application to expose your model on the web

Questions? tweet @zipfianacademy #pydata

Present

Page 86: Data Engineering 101: Building your first data product

yh = Yhat("<USERNAME>", "<API KEY>", "http://cloud.yhathq.com/")[email protected]('/')def index(): return app.send_static_file('index.html')[email protected]('/predict', methods=['POST'])def predict(): article = request.form['article'] results = yh.predict("documentClf", { 'content': article }) return jsonify({"results": results})

Questions? tweet @zipfianacademy #pydata

Present

Page 87: Data Engineering 101: Building your first data product

Pipeline

Only Data should Flow

Questions? tweet @zipfianacademy #pydata

Page 88: Data Engineering 101: Building your first data product

DataRemember to Remember

(Lineage)

Acquisition

Parse

Storage

Transform/Explore

Vectorization

Train

Model

Expose

Presentation

Questions? tweet @zipfianacademy #pydata

Pipeline

Page 89: Data Engineering 101: Building your first data product

Immutable append only set of Raw Data

Computation is a view on data

*Lambda Architecture by Nathan MarzQuestions? tweet @zipfianacademy #pydata

Pipeline

Page 90: Data Engineering 101: Building your first data product

Functional Data Science

• Modularity

• Define interfaces

• Separate data from computation

• Data Lineage

Functional

Questions? tweet @zipfianacademy #pydata

Page 91: Data Engineering 101: Building your first data product

Need Robust and Flexible Pipeline!

Questions? tweet @zipfianacademy #pydata

Pipeline

Page 92: Data Engineering 101: Building your first data product

Whatever you do, DO NOT cross the streams

Questions? tweet @zipfianacademy #pydata

Pipeline

Page 93: Data Engineering 101: Building your first data product

NYT API

MongoDB

BeautifulSoup

Feature Matrixscikit-learn

Web App

ModelDeploy

yHat

HerokuPOST

Predict

Predicted Section

Where we are

NLTK

scikit-learn

Questions? tweet @zipfianacademy #pydata

Page 94: Data Engineering 101: Building your first data product

Gotchas!

• Only have a static subset of articles

• Pipeline not automated for re-training

Questions? tweet @zipfianacademy #pydata

Gotchas

Page 95: Data Engineering 101: Building your first data product

Today

• whoami

• Nws Rdr (News Reader)

• The What, Why, and How of Data Products

• Data Engineering

• Building a Pipeline

• Productionizing the Products

• Q&A

Questions? tweet @zipfianacademy #pydata

Page 96: Data Engineering 101: Building your first data product

Iteration 3:

Source: http://vninja.net/wordpress/wp-content/uploads/2013/03/KCaAutomate.pngQuestions? tweet @zipfianacademy #pydata

Iterate

Page 97: Data Engineering 101: Building your first data product

NYT API

MongoDB

cron

Feature Matrixscikit-learn

Web App

ModelDeploy

yHat

HerokuPOST

Predict

Predicted Section

Where we are

NLTK

scikit-learn

Questions? tweet @zipfianacademy #pydata

Amazon EC2

Page 98: Data Engineering 101: Building your first data product

testing

Start small (data) and fast

(development)

testing

Increase size of data set

Optimize and productionize

PROFIT!

$$$

Questions? tweet @zipfianacademy #pydata

How to Scale

Page 99: Data Engineering 101: Building your first data product

How to Scale

testing

Develop locally

testing

Distribute computation

(run on cluster)

Tune parameters

PROFIT!

$$$

Questions? tweet @zipfianacademy #pydata

Can also use a streaming algorithm or

single machine disk based “medium data”

technologies (i.e. database or memory

mapped files)

Page 100: Data Engineering 101: Building your first data product

Products

If you build it...

Questions? tweet @zipfianacademy #pydata

Page 101: Data Engineering 101: Building your first data product

Source: http://nateemery.com/wp-content/uploads/2013/05/field-of-dreams.jpg

Products

Questions? tweet @zipfianacademy #pydata

Page 102: Data Engineering 101: Building your first data product

Today

• whoami

• Nws Rdr (News Reader)

• The What, Why, and How of Data Products

• Data Engineering

• Building a Pipeline

• Productionizing the Products

• Q&A

Questions? tweet @zipfianacademy #pydata

Page 103: Data Engineering 101: Building your first data product

Q & A

Q&AQuestions? tweet @zipfianacademy #pydata

Page 104: Data Engineering 101: Building your first data product

Zipfian Academy

@ZipfianAcademy

Data Science & Data Engineering 12-week Bootcamp (May 12th & Sep 8th)

Weekend Workshops

http://zipfianacademy.com/apply

http://zipfianacademy.com/workshops

Next: Interactive Visualizations w/ d3.js (June 7th)

Questions? tweet @zipfianacademy #pydata

Page 105: Data Engineering 101: Building your first data product

Thank You!

Jonathan DinuCo-Founder, Zipfian [email protected]

@clearspandex

@ZipfianAcademy

http://zipfianacademy.com

Questions? tweet @zipfianacademy #pydata

Page 106: Data Engineering 101: Building your first data product

Appendix

Questions? tweet @zipfianacademy #pydata

Page 107: Data Engineering 101: Building your first data product

Data Sources

Obtain(ranked by ease of use)

1. DaaS -- Data as a service

2. Bulk Download

3. APIs

4. Web Scraping

Questions? tweet @zipfianacademy #pydata

Page 108: Data Engineering 101: Building your first data product

DaaS(Data as a Service)

• Time Series/Numeric: Quandl

• Financial Modeling: Quantopian

• Email Contextualization: Rapleaf

• Location and POI: Factual

Data Sources

Questions? tweet @zipfianacademy #pydata

Page 109: Data Engineering 101: Building your first data product

Bulk Download(just like the good ol’ days)

• File Transfer Protocol (FTP): CDC

• Amazon Web Services: Public Datasets

• Infochimps: Data Marketplace

• Academia: UCI Machine Learning Repository

Data Sources

Questions? tweet @zipfianacademy #pydata

Page 110: Data Engineering 101: Building your first data product

APIs(if it’s not RESTed, I’m not buying)

• Geographic: Foursquare

• Social: Facebook

• Audio: Rdio

• Content: Tumblr

• Realtime: Twitter

• Hidden: Yahoo Finance

Data Sources

Questions? tweet @zipfianacademy #pydata

Page 111: Data Engineering 101: Building your first data product

Web Scraping

1. wget and curl

2. Web Spider/Crawler

3. API scraping

4. Manual Download

(DIY for life)

Data Sources

Questions? tweet @zipfianacademy #pydata

Page 112: Data Engineering 101: Building your first data product

• Delimited Values

• TSV

• CSV

• WSV

• JSON

• XML

• Ad Hoc Formats (avoid these if you can)

Data Formats

Questions? tweet @zipfianacademy #pydata

Page 113: Data Engineering 101: Building your first data product

• JSON is made up of hash tables and arrays • Hash tables: { “foo” : 1, “bar” : 2, baz : “3” } • Arrays: [1, 2, 3] • Arrays of arrays: [[1, 2, 3], [‘foo’, ‘bar’, ‘baz’]] • Array of hashes: [{‘foo’:1, ‘bar’:2}, {‘baz’:3}] • Hashes of hashes: {‘foo’: {‘bar’: 2, ‘baz’: 3}}

Questions? tweet @zipfianacademy #pydata

Data Formats

Page 114: Data Engineering 101: Building your first data product

{"widget": {! "debug": "on",! "window": {! "title": "Sample Konfabulator Widget",! "name": "main_window",! "width": 500,! "height": 500! },! "image": { ! "src": "Images/Sun.png",! "name": "sun1",! "hOffset": 250,! "vOffset": 250,! "alignment": "center"! },! "text": {! "data": "Click Here",! "size": 36,! "style": "bold",! "name": "text1",! "hOffset": 250,! "vOffset": 100,! "alignment": "center",! "onMouseUp": "sun1.opacity = (sun1.opacity / 100) * 90;"! }!}} !

Questions? tweet @zipfianacademy #pydata

Data Formats

Page 115: Data Engineering 101: Building your first data product

• XML is a recursive self-describing container <container>

<item>Foo</item> <item>Bar</item>

<container> <item attr=”SomethingAboutBaz”>Baz</item>

</container> </item>

<container>

Questions? tweet @zipfianacademy #pydata

Data Formats

Page 116: Data Engineering 101: Building your first data product

<widget>! <debug>on</debug>! <window title="Sample Konfabulator Widget">! <name>main_window</name>! <width>500</width>! <height>500</height>! </window>! <image src="Images/Sun.png" name="sun1">! <hOffset>250</hOffset>! <vOffset>250</vOffset>! <alignment>center</alignment>! </image>! <text data="Click Here" size="36" style="bold">! <name>text1</name>! <hOffset>250</hOffset>! <vOffset>100</vOffset>! <alignment>center</alignment>! <onMouseUp>! sun1.opacity = (sun1.opacity / 100) * 90;! </onMouseUp>! </text>!</widget>!

Questions? tweet @zipfianacademy #pydata

Data Formats

Page 117: Data Engineering 101: Building your first data product

• Ad hoc data formats • Fixed-width (Census data) • Graph Edgelists • Voting records • etc.

Questions? tweet @zipfianacademy #pydata

Data Formats

Page 118: Data Engineering 101: Building your first data product

• 7-5-5 format •Sam foo bar!•Roger baz 6!•Jane 314 99

Questions? tweet @zipfianacademy #pydata

Data Formats

Page 119: Data Engineering 101: Building your first data product

• Directed Graph Format 1 2!

1 3!

1 4!

2 3!

4 4

Questions? tweet @zipfianacademy #pydata

Data Formats

Page 120: Data Engineering 101: Building your first data product

• Directed Graph Format 1 2!

1 3!

1 4!

2 3!

4 4

Questions? tweet @zipfianacademy #pydata

Data Formats

Page 121: Data Engineering 101: Building your first data product

Programming languages like Python, Ruby, and R have built in parsers for data formats such as

JSON and CSV. For other esoteric formats you will

probably have to write your own

Questions? tweet @zipfianacademy #pydata

Data Formats


Top Related