a living hell - lessons learned in eight years of parsing real estate data
DESCRIPTION
Slides of talk delivered by Ed Freyfogle (@freyfogle) at #csvconf in Berlin on 15 June 2014.TRANSCRIPT
A living hell: lessons learned in eight years of processing real estate listings
Ed FreyfogleCSVConf Berlin
15 July 2014
Residential property search engine in nine markets
3-4 million unique users per month
Processing close to 20M listings daily
Extensive experience / painful lessons in ETL, geocoding, deduping, ...
http://www.nestoria.com
What we do
Real estate is complex, high value transaction. Our goal is :
Simple
Comprehensive
Fast (user time and time to market)
Where does the data come from?
Seller
Agent 1
Agent 2
Agent 3
Where does the data come from?
Seller
Agent 1
Agent 2
Agent 3
Portal 1
Portal 2
Portal 3
Where does the data come from?
Seller
Agent 1
Agent 2
Agent 3
Portal 1
Portal 2
Portal 3
Where does the data come from?
Seller
Agent 1
Agent 2
Agent 3
Portal 1
Portal 2
Portal 3
Where does the data come from?
Seller
Agent 1
Agent 2
Agent 3
Portal 1
Portal 2
Portal 3
Plenty of chances for data to go bad
Where we do it
India
Very, very good at:
Cricket
Amazing cuisine
World’s largest democracy
Too many other things to list here
India
Very, very good at:
Cricket
Amazing cuisine
World’s largest democracy
Too many other things to list here
Utterly fucking terrible at:
Real Estate data quality
Addresses / Geodata
Must garbage in be garbage out?
Can we turn multiple bits of shit into something useful?
What we really do
Must garbage in be garbage out?
Can we turn multiple bits of shit into something useful?
What we really do
something useful
Chaos
Caveat: I love our clients
All the examples you are about to see are all theoretical *wink, wink*
Examples / Horror stories
Us: “Please set up an automated data transfer. Thx!”
Them: “It’s impossible to export the data from the database”
Them: “Just crawl our website”
Them: “Let’s do incremental updates to save bandwidth”
Them: “I’ll just send you an email when there is new stuff … starting when I get back from holiday”
Getting the data
zip or tar full of subdirs, names of which change with each upload
filename “feed.xml?key=SsKpyM62QN0RbqCwnaAc”
One file per agent, when file not supplied no way to know if missing due to error or intentionally
Format A on Monday, B on Tuesday, ...
Fun with files
<Description>Residential Plot available in Suncity&lt;br/&gt; &lt;br&gt;&lt;br/&gt; &lt;br&gt;SUNCITY PROJECT&lt;br/&gt; &lt;br&gt;&lt;br/&gt; &lt;br&gt;A complete township...
"&gt;" - for when you really, really want to be sure you've escapedyour XML
anyone?
XML, LOL
One 500 MB file of XML
On a single line … to save space
Go grep yourself
Newlines, newlines,
newlines
Choose your delimiter wisely - ^B
So simple even a child could get it wrong
Microsoft quotes vs. ASCII quotes
Excel vs. CSV
CSV, LOL
Them “we will send the data in X (where X is large industry player) format”Us “not even X uses that format”
Them “We use X format, but changed it slightly so we could ….”Us *sigh*
Wrong tool for right job
Are they really unique?
Are the unique across time?
Partner re-uses numeric unique ids … in case there is ever a shortage of numbers
Unique identifiers
I’m ranting
Topics we haven’t yet even touched upon:
Character encodings
Geocoding / Parsing addresses
Image processing/classification at scale
Parsing free text descriptions
Deduplication
Too many other things to list here
Never trust, check everything, every single time
Tests, tests, tests, tests
Embrace UNIX philosophy of many small tools in a chain
Reuse rather than reinvent (but not always)
Technology helps manage the problem, it is not “the solution”.Problems are almost always cultural not technical
What have we learned?
Misaligned incentives
Technology laggards
Apathy
Ignorance
Why do they hate us?
Tricked you - there is of course no single perfect solution
Closest thing is dialog, ideally face to face.
People generally want to do right thing, need help to know why and how to do it.
One five minute conversation often more useful than five months of email
The solution
Unless you hate life, do NOT try to scrape real estate data
Re-read the line above.
Our API: http://nestoria.com/api
One more thing
http://nestoria.com and http://nestoria.com/api
http://devblog.nestoria.com - our dev blog
http://www.lokku.com - our parent company
http://opencagedata.com - all your geocoding are belong to us
Twitter: @nestoria, @lokku, @opencagedata, @freyfogle
Slides will be on http://slideshare.net/lokku later today
Learn more