sea amsterdam 2014 november 19

GoDataDrivenPROUDLY PART OF THE XEBIA GROUP

Real time data driven applications

Giovanni LanzaniData Whisperer

Using Python + pandas as back end

Who am I?

2008-2012: PhD Theoretical Physics

2012-2013: KPMG

2013-Now: GoDataDriven

GoDataDriven

Feedback

@gglanzani

GoDataDriven

Real-time, data driven app?

• No store and retrieve;

• Store, {transform, enrich, analyse} and retrieve;

• Real-time: retrieve is not a batch process;

• App: something your mother could use:

SELECT attendees FROM pydataberlin2014 WHERE password = '1234';

GoDataDriven

Get insight about event impact

GoDataDriven

Is it Big Data?

• Raw logs are in the order of 40TB;

• We use Hadoop for storing, enriching and pre-processing;

• (10 nodes, 24TB per nodes)

GoDataDriven

Challenges

1. Privacy;2. Huge pile of data;3. Real-time retrieval;

4. Some real-time analysis.

GoDataDriven

1. Privacy

GoDataDriven

3. Real-time retrieval

• Harder than it looks;

• Large data;

• Retrieval is by giving date, center location + radius.

GoDataDriven

4. (Some) real-time analysis

GoDataDriven

Architecture

AngularJS app.py

helper.py

REST

Front-end Back-end

JSON

GoDataDriven

JS - 1

GoDataDriven

JS - 2

Flask

from flask import Flaskapp = Flask(__name__)

@app.route("/hello")def hello(): return "Hello World!"

if __name__ == "__main__": app.run()

GoDataDriven

app.py example

@app.route('/api/<postcode>/<date>/<radius>', methods=['GET'])@app.route('/api/<postcode>/<date>', methods=['GET'])def datapoints(postcode, date, radius=1.0): ... stats, timeline, points = helper.get_json(postcode, date, radius) return … # returns a JSON object for AngularJS

GoDataDriven

data example

date hour id_activity postcode hits delta sbi

2013-01-01 12 1234 1234AB 35 22 1

2013-01-08 12 1234 1234AB 45 35 1

2013-01-01 11 2345 5555ZB 2 1 2

2013-01-08 11 2345 5555ZB 55 2 2

GoDataDriven

Levenshtein distancedef levenshtein(s, t): if len(s) == 0: return len(t) if len(t) == 0: return len(s) cost = 0 if s[-1] == t[-1] else 1 return min(levenshtein(s[:-1], t) + 1, levenshtein(s, t[:-1]) + 1, levenshtein(s[:-1], t[:-1]) + cost)

levenshtein("search", “engine”) # 6 levenshtein("Amsterdam", “meetup") # 7levenshtein("Lplein", “Leidseplein") # 5

GoDataDriven

Aside

• Memoize every recursive algorithm in Python!from cytoolz import functoolz

@functoolz.memoizedef memo_levenshtein(s, t): if len(s) == 0: return len(t) if len(t) == 0: return len(s) cost = 0 if s[-1] == t[-1] else 1 return min(levenshtein(s[:-1], t) + 1, levenshtein(s, t[:-1]) + 1, levenshtein(s[:-1], t[:-1]) + cost)

GoDataDriven

Aside

• Memoize the algorithm if you’re using Python!a = “Marco Polo str.”b = “Marco Polo straat”%timeit memo_levenshtein(a, b) # 213ns%timeit levenshtein(a, b) # 1.84 µs

GoDataDriven

helper.py example

def get_json(postcode, date, radius): ...

lat, lon = get_lat_lon(postcode) postcodes = get_postcodes(postcode, radius) data = get_data(postcodes, dates)

stats = get_statistics(data, sbi) timeline = get_timeline(data, sbi)

return stats, timeline, data.to_json(orient='records')

GoDataDriven

helper.py example

def get_statistics(data, sbi): sbi_df = data[data.sbi == sbi] # select * from data where sbi = sbi hits = sbi_df.hits.sum() # select sum(hits) from … delta_hits = sbi_df.delta.sum() # select sum(delta) from … if delta_hits: percentage = (hits - delta_hits) / delta_hits else: percentage = 0

return {"sbi": sbi, "total": hits, "percentage": percentage}

GoDataDriven

helper.py exampledef get_timeline(data, sbi): df_sbi = data.groupby([“date”, “hour", “sbi"]).aggregate(sum) # select sum(hits), sum(delta) from data group by date, hour, sbi return df_sbi

GoDataDriven

helper.py exampledef get_json(postcode, date, radius): ... lat, lon = get_lat_lon(postcode) postcodes = get_postcodes(postcode, radius) dates = date.split(';') data = get_data(postcodes, dates)

stats = get_statistics(data) timeline = get_timeline(data, dates)


GoDataDriven

Who has my data?

• First iteration was a (pre)-POC, less data (3GB vs 500GB);

• Time constraints;

• Oeps:

import pandas as pd...source_data = pd.read_csv("data.csv", …)...def get_data(postcodes, dates): result = filter_data(source_data, postcodes, dates) return result

GoDataDriven

Advantage of “everything is a df ”

Pro:

• Fast!!

• Use what you know

• NO DBA’s!

• We all love CSV’s!

Contra:

• Doesn’t scale;

• Huge startup time;

• NO DBA’s!

• We all hate CSV’s!

GoDataDriven

If you want to go down this path

• Set the dataframe index wisely;

• Align the data to the index:

• Beware of modifications of the original dataframe!source_data.sort_index(inplace=True)

GoDataDriven

If you want to go down this path

The reason pandas is faster is because I came up with a better algorithm

GoDataDriven

New architecture

data = get_data(db, postcodes, dates)data = get_data(postcodes, dates)

database.py

Data

psycopg2

AngularJS app.py

helper.py

REST

Front-end Back-end

JSON

GoDataDriven

Handling geo-datadef get_json(postcode, date, radius): """ ... """ lat, lon = get_lat_lon(postcode) postcodes = get_postcodes(postcode, radius) dates = date.split(';') data = get_data(postcodes, dates)

stats = get_statistics(data) timeline = get_timeline(data, dates)


GoDataDriven

Issues?!

• With a radius of 10km, in Amsterdam, you get 10k postcodes. You need to do this in your SQL:

• Index on date and postcode, but single queries running more than 20 minutes.

SELECT * FROM datapoints WHERE date IN date_array

AND postcode IN postcode_array;

GoDataDriven

Postgres + Postgis (2.x)

PostGIS is a spatial database extender for PostgreSQL. Supports geographic objects allowing location queries

SELECT *FROM datapointsWHERE ST_DWithin(lon, lat, 1500)AND dates IN ('2013-02-30', '2013-02-31');-- every point within 1.5km -- from (lat, lon) on imaginary dates

Other db’s?

GoDataDriven

Steps to solve it

1. Align data on disk by date;2. Use the temporary table trick:

3. Lose precision: 1234AB→1234

CREATE TEMPORARY TABLE tmp (postcodes STRING NOT NULL PRIMARY KEY);INSERT INTO tmp (postcodes) VALUES postcode_array;

SELECT * FROM tmp JOIN datapoints d ON d.postcode = tmp.postcodes WHERE d.dt IN dates_array;

GoDataDriven

Take home messages

1. Geospatial problems are “hard” and can kill your queries;

2. Not everybody has infinite resources: be smart and KISS!

3. SQL or NoSQL? (Size, schema)

GoDataDriven

We’re hiring / Questions? / Thank you!

@[email protected]

Giovanni LanzaniData Whisperer

mailto:[email protected]

sea amsterdam 2014 november 19

Data & Analytics