pydata berlin meetup
TRANSCRIPT
Helping travelers make better hotel choices
500 million times a month* Steffen Wenz, CTO TrustYou
For every hotel on the planet, provide a summary
of traveler reviews.What does TrustYou do?
✓ Excellent hotel!
✓ Excellent hotel!
✓ Nice building“Clean, hip & modern, excellent facilities”✓ Great view« Vue superbe »
✓ Excellent hotel!*
✓ Nice building“Clean, hip & modern, excellent facilities”✓ Great view« Vue superbe »✓ Great for partying“Nice weekend getaway or for partying”✗ Solo travelers complain about TVs ℹ You should check out Reichstag, KaDeWe & Gendarmenmarkt.
*) nhow Berlin (Full summary)
DBCrawling Semantic Analysis
TrustYou Analytics
API
Kayak...
TrustYou Architecture
200 million reqs/month
Crawling
/find?q=Berlin
/find?q=Munich
/meetup/BerlinPyData
/meetup/BerlinCyclists
/find?q=Munich&pa
ge=2
/meetup/BerlinPolitics
/meetup/BerlinCyclists
/find?q=Munich&pa
ge=3
Seed URLs
Frontier
Basic crawling setup
/find?q=Berlin
/find?q=Munich
/meetup/BerlinPyData
/meetup/BerlinCyclists
/find?q=Munich&pa
ge=2
/meetup/BerlinPolitics
/meetup/BerlinCyclists
/find?q=Munich&pa
ge=3
/find?q=Munich&page=99999999
999...
...
… if only it were so easy
facebok.com/meetup
…
Seed URLs
Frontier
Scrapy
● Build your own web crawlers○ Extract data via CSS selectors, XPath, regexes …○ Handles queuing, request parallelism, cookies,
throttling … ● Comprehensive and well-designed● Commercial support by http://scrapinghub.com/
Frontier
Seed URLs
Intro to Scrapyfrom scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class MySpider(CrawlSpider):
name = "my_spider"
# start with this URL
start_urls = ["http://www.meetup.com/find/?allMeetups=true&radius=50&userFreeform=Berlin"]
# follow these URLs, and call self.parse_meetup to extract data from them
rules = [
Rule(LinkExtractor(allow=[
"^http://www.meetup.com/[^/]+/$",
]), callback="parse_meetup"),
]
def parse_meetup(self, response):
# Extract data about meetup from HTML
m = MeetupItem()
yield m
Try it out!$ scrapy crawl city -a city=Berlin -t jsonlines -o - 2>/dev/null
{"url": "http://www.meetup.com/Making-Customers-Happy-Berlin/", "name": "eCommerce - Making Customers Happy -
Berlin", "members": "774"}
{"url": "http://www.meetup.com/Berlin-Scrum-Meetup/", "name": "Berlin Scrum Meetup", "members": "368"}
{"url": "http://www.meetup.com/Clojure-Berlin/", "name": "The Clojure Conspiracy (Berlin)", "members": "545"}
{"url": "http://www.meetup.com/appliedJavascript/", "name": "Applied Javascript", "members": "494"}
{"url": "http://www.meetup.com/englishconversationclubberlin/", "name": "English Conversation Club Berlin",
"members": "1"}
{"url": "http://www.meetup.com/Berlin-Nights-Out-and-Daylight-Catch-Up/", "name": "Berlin Nights Out and Daylight
Catch Up", "members": "1"}
...
Full code on GitHub, dump of all Berlin meetups(Note: Meetup also has an API …)
Number of registered meetups
Crawling at TrustYou scale
● 2 - 3 million new reviews/week● Customers want alerts 8 - 24h
after review publication!● Smart crawl frequency & depth,
but still high overhead● Pools of constantly refreshed
EC2 proxy IPs● Direct API connections with
many sites
Crawling at TrustYou scale
● Custom framework very similar to scrapy● Runs on Hadoop cluster (100 nodes)● … Though problem not 100% suitable for MapReduce
○ Nodes mostly waiting○ Coordination/messaging between nodes required:
■ Distributed queue■ Rate limiting
Textual Data
Treating textual data
raw text sentence splitting
stopword filteringstemming
tokenization
Tokenization>>> import nltk
>>> raw = "We are always looking for interesting talks, locations to
host meetups and enthusiastic volunteers. Please get in touch using
>>> nltk.sent_tokenize(raw)
['We are always looking for interesting talks, locations to host meetups
and enthusiastic volunteers.', 'Please get in touch using info@pydata.
berlin.']
>>> nltk.word_tokenize(raw)
['We', 'are', 'always', 'looking', 'for', 'interesting', 'talks', ',',
'locations', 'to', 'host', 'meetups', 'and', 'enthusiastic',
'volunteers.', 'Please', 'get', 'in', 'touch', 'using', 'info', '@',
'pydata.berlin', '.']
“great rooms”“great hotel”“rooms are terrible”“hotel is terrible”
JJ NNJJ NNNN VB JJNN VB JJ
Grammars and Parsing
>>> nltk.pos_tag(nltk.word_tokenize("hotel is terrible"))
[('hotel', 'NN'), ('is', 'VBZ'), ('terrible', 'JJ')]
>>> grammar = nltk.CFG.fromstring("""
... OPINION -> NN COP JJ
... OPINION -> JJ NN
... NN -> 'hotel' | 'rooms'
... COP -> 'is' | 'are'
... JJ -> 'great' | 'terrible'
... """)
>>> parser = nltk.ChartParser(grammar)
>>> sent = nltk.word_tokenize("great rooms")
>>> for tree in parser.parse(sent):
>>> print tree
(OPINION (ADJ great) (NOUN rooms))
Grammars and Parsing
WordNet>>> from nltk.corpus import wordnet as wn
>>> wn.morphy('coded', wn.VERB)
'code'
>>> wn.synsets("python")
[Synset('python.n.01'), Synset('python.n.02'), Synset('python.n.
03')]
>>> wn.synset('python.n.01').hypernyms()
[Synset('boa.n.02')]
>>> # meh :/
● “Nice room”● “Room wasn‘t so great”● “The air-conditioning
was so powerful that we were cold in the room even when it was off.”
● “อาหารรสชาติดี”● ” خدمة جیدة“
● 20 languages● Linguistic system
(morphology, taggers, grammars, parsers …)
● Hadoop: Scale out CPU○ ~1B opinions in DB
● Python for ML & NLP libraries
Semantic Analysis at TrustYou
Word2Vec
● Map words to vectors● “Step up” from bag-of-
words model
● ‘Cats’ and ‘dogs’ should be similar - because they occur in similar contexts
>>> m["python"]
array([-0.1351, -0.1040, -0.0823, -0.0287, 0.3709,
-0.0200, -0.0325, 0.0166, 0.3312, -0.0928,
-0.0967, -0.0199, -0.2498, -0.4445, -0.0445,
# ...
-1.0090, -0.2553, 0.2686, -0.4121, 0.3116,
-0.0639, -0.3688, -0.0273, -0.1266, -0.2606,
-0.1549, 0.0023, 0.0084, 0.2169, 0.0060],
dtype=float32)
Fun with Word2Vec>>> # trained from 100k meetup descriptions!
>>> m = gensim.models.Word2Vec.load("data/word2vec")
>>> m.most_similar(positive=["python"])[:3]
[(u'javascript', 0.8382717370986938), (u'php', 0.8266388773918152), (u'django',
0.8189617991447449)]
>>> m.doesnt_match(["python", "c++", "javascript"])
'c++'
>>> m.most_similar(positive=["berlin"])[:3]
[(u'paris', 0.8339072465896606), (u'lisbon', 0.7986686825752258), (u'holland',
0.7970746755599976)]
>>> m.most_similar(positive=["ladies"])[:3]
[(u'girls', 0.8175351619720459), (u'mamas', 0.745951771736145), (u'gals', 0.7336771488189697)]
ML @ TrustYou
● gensim doc2vec model to create hotel embedding
● Used - together with other features - for various classifiers
Workflow Management& Scaling Up
● Build complex pipelines ofbatch jobs○ Dependency resolution○ Parallelism○ Resume failed jobs
● Some support for Hadoop● Pythonic replacement for Oozie● Can be combined with Pig, Hive
Luigi
class MyTask(luigi.Task):
def requires(self):
return DependentTask()
def output(self):
return luigi.LocalTarget("data/my_task_output"))
def run(self):
with self.output().open("w") as out:
out.write("foo")
Luigi tasks vs. Makefilesdata/my_task_output: DependentTask
run
run
run ...
class CrawlTask(luigi.Task):
city = luigi.Parameter()
def output(self):
output_path = os.path.join("data", "{}.jsonl".format(self.city))
return luigi.LocalTarget(output_path)
def run(self):
tmp_output_path = self.output().path + "_tmp"
subprocess.check_output(["scrapy", "crawl", "city", "-a", "city={}".
format(self.city), "-o", tmp_output_path, "-t", "jsonlines"])
os.rename(tmp_output_path, self.output().path)
Example: Wrap crawl in Luigi task
Luigi dependency graphs
Hadoop!
● MapReduce: Programming model for distributed computation problems
● Express your algorithm as sequences of operations:a. Map: Do a linear pass over your data, emit (k, v)b. (Distributed sort)c. Reduce: Linear pass over all (k, v) for the same k
● Python on Hadoop: Hadoop streaming, MRJob, Luigi(Just go learn PySpark instead)
Luigi Hadoop integrationclass HadoopTask(luigi.hadoop.JobTask):
def output(self):
return luigi.HdfsTarget("output_in_hdfs")
def requires(self):
return {
"some_task": SomeTask(),
"some_other_task": SomeOtherTask()
}
def mapper(self, line):
key, value = line.rstrip().split("\t")
yield key, value
def reducer(self, key, values):
yield key, ", ".join(values)
Luigi Hadoop integrationclass HadoopTask(luigi.hadoop.JobTask):
def output(self):
return luigi.HdfsTarget("output_in_hdfs")
def requires(self):
return {
"some_task": SomeTask(),
"some_other_task": SomeOtherTask()
}
def mapper(self, line):
key, value = line.rstrip().split("\t")
yield key, value
def reducer(self, key, values):
yield key, ", ".join(values)
1. Your input data is sitting in distributed file system (HDFS)
2. Luigi creates a .tar.gz, Hadoop moves your code on machines
3. mapper() gets run (distributed)4. Data gets re-sorted by key5. reducer() gets run (distributed)6. Output gets saved in HDFS
● Batch, never real time● Slow even for batch
(lots of disk IO)● Limited expressiveness
(remedies/crutches: MRJob, Pig, Hive)
● Spark: More complete Python support
Beyond MapReduce
Workflows at TrustYou
Workflows at TrustYou
We’re hiring! [email protected]