a living hell - lessons learned in eight years of parsing real estate data

28
A living hell: lessons learned in eight years of processing real estate listings Ed Freyfogle CSVConf Berlin 15 July 2014

Upload: lokku

Post on 23-Aug-2014

524 views

Category:

Internet


0 download

DESCRIPTION

Slides of talk delivered by Ed Freyfogle (@freyfogle) at #csvconf in Berlin on 15 June 2014.

TRANSCRIPT

Page 1: A living hell - lessons learned in eight years of parsing real estate data

A living hell: lessons learned in eight years of processing real estate listings

Ed FreyfogleCSVConf Berlin

15 July 2014

Page 2: A living hell - lessons learned in eight years of parsing real estate data

Residential property search engine in nine markets

3-4 million unique users per month

Processing close to 20M listings daily

Extensive experience / painful lessons in ETL, geocoding, deduping, ...

http://www.nestoria.com

Page 3: A living hell - lessons learned in eight years of parsing real estate data

What we do

Real estate is complex, high value transaction. Our goal is :

Simple

Comprehensive

Fast (user time and time to market)

Page 4: A living hell - lessons learned in eight years of parsing real estate data
Page 5: A living hell - lessons learned in eight years of parsing real estate data

Where does the data come from?

Seller

Agent 1

Agent 2

Agent 3

Page 6: A living hell - lessons learned in eight years of parsing real estate data

Where does the data come from?

Seller

Agent 1

Agent 2

Agent 3

Portal 1

Portal 2

Portal 3

Page 7: A living hell - lessons learned in eight years of parsing real estate data

Where does the data come from?

Seller

Agent 1

Agent 2

Agent 3

Portal 1

Portal 2

Portal 3

Page 8: A living hell - lessons learned in eight years of parsing real estate data

Where does the data come from?

Seller

Agent 1

Agent 2

Agent 3

Portal 1

Portal 2

Portal 3

Page 9: A living hell - lessons learned in eight years of parsing real estate data

Where does the data come from?

Seller

Agent 1

Agent 2

Agent 3

Portal 1

Portal 2

Portal 3

Plenty of chances for data to go bad

Page 10: A living hell - lessons learned in eight years of parsing real estate data

Where we do it

Page 11: A living hell - lessons learned in eight years of parsing real estate data

India

Very, very good at:

Cricket

Amazing cuisine

World’s largest democracy

Too many other things to list here

Page 12: A living hell - lessons learned in eight years of parsing real estate data

India

Very, very good at:

Cricket

Amazing cuisine

World’s largest democracy

Too many other things to list here

Utterly fucking terrible at:

Real Estate data quality

Addresses / Geodata

Page 13: A living hell - lessons learned in eight years of parsing real estate data

Must garbage in be garbage out?

Can we turn multiple bits of shit into something useful?

What we really do

Page 14: A living hell - lessons learned in eight years of parsing real estate data

Must garbage in be garbage out?

Can we turn multiple bits of shit into something useful?

What we really do

something useful

Chaos

Page 15: A living hell - lessons learned in eight years of parsing real estate data

Caveat: I love our clients

All the examples you are about to see are all theoretical *wink, wink*

Examples / Horror stories

Page 16: A living hell - lessons learned in eight years of parsing real estate data

Us: “Please set up an automated data transfer. Thx!”

Them: “It’s impossible to export the data from the database”

Them: “Just crawl our website”

Them: “Let’s do incremental updates to save bandwidth”

Them: “I’ll just send you an email when there is new stuff … starting when I get back from holiday”

Getting the data

Page 17: A living hell - lessons learned in eight years of parsing real estate data

zip or tar full of subdirs, names of which change with each upload

filename “feed.xml?key=SsKpyM62QN0RbqCwnaAc”

One file per agent, when file not supplied no way to know if missing due to error or intentionally

Format A on Monday, B on Tuesday, ...

Fun with files

Page 18: A living hell - lessons learned in eight years of parsing real estate data

<Description>Residential Plot available in Suncity&amp;lt;br/&amp;gt;&#13;&amp;lt;br&amp;gt;&amp;lt;br/&amp;gt;&#13;&amp;lt;br&amp;gt;SUNCITY PROJECT&amp;lt;br/&amp;gt;&#13;&amp;lt;br&amp;gt;&amp;lt;br/&amp;gt;&#13;&amp;lt;br&amp;gt;A complete township...

"&amp;gt;" - for when you really, really want to be sure you've escapedyour XML

&#13; anyone?

XML, LOL

Page 19: A living hell - lessons learned in eight years of parsing real estate data

One 500 MB file of XML

On a single line … to save space

Go grep yourself

Page 20: A living hell - lessons learned in eight years of parsing real estate data

Newlines, newlines,

newlines

Choose your delimiter wisely - ^B

So simple even a child could get it wrong

Microsoft quotes vs. ASCII quotes

Excel vs. CSV

CSV, LOL

Page 21: A living hell - lessons learned in eight years of parsing real estate data

Them “we will send the data in X (where X is large industry player) format”Us “not even X uses that format”

Them “We use X format, but changed it slightly so we could ….”Us *sigh*

Wrong tool for right job

Page 22: A living hell - lessons learned in eight years of parsing real estate data

Are they really unique?

Are the unique across time?

Partner re-uses numeric unique ids … in case there is ever a shortage of numbers

Unique identifiers

Page 23: A living hell - lessons learned in eight years of parsing real estate data

I’m ranting

Topics we haven’t yet even touched upon:

Character encodings

Geocoding / Parsing addresses

Image processing/classification at scale

Parsing free text descriptions

Deduplication

Too many other things to list here

Page 24: A living hell - lessons learned in eight years of parsing real estate data

Never trust, check everything, every single time

Tests, tests, tests, tests

Embrace UNIX philosophy of many small tools in a chain

Reuse rather than reinvent (but not always)

Technology helps manage the problem, it is not “the solution”.Problems are almost always cultural not technical

What have we learned?

Page 25: A living hell - lessons learned in eight years of parsing real estate data

Misaligned incentives

Technology laggards

Apathy

Ignorance

Why do they hate us?

Page 26: A living hell - lessons learned in eight years of parsing real estate data

Tricked you - there is of course no single perfect solution

Closest thing is dialog, ideally face to face.

People generally want to do right thing, need help to know why and how to do it.

One five minute conversation often more useful than five months of email

The solution

Page 27: A living hell - lessons learned in eight years of parsing real estate data

Unless you hate life, do NOT try to scrape real estate data

Re-read the line above.

Our API: http://nestoria.com/api

One more thing

Page 28: A living hell - lessons learned in eight years of parsing real estate data

http://nestoria.com and http://nestoria.com/api

http://devblog.nestoria.com - our dev blog

http://www.lokku.com - our parent company

http://opencagedata.com - all your geocoding are belong to us

Twitter: @nestoria, @lokku, @opencagedata, @freyfogle

Slides will be on http://slideshare.net/lokku later today

Learn more