mlgotchas v4 presented · welcome to our toolbox our opinionated views ! •last sig event in 2017...

18
ML Gotchas Mary-Ann & Phil Claridge 15 February 2018 www.mandrel.com @MandrelSystems [email protected] 1 © 2018 Mandrel Systems www.mandrel.com @MandrelSystems

Upload: others

Post on 18-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MLGotchas V4 Presented · Welcome To Our Toolbox Our Opinionated Views ! •Last SIG event in 2017 … • Open source ML with a focus on H 2O. •This event … • End end-to-end

ML GotchasMary-Ann & Phil Claridge

15 February 2018

www.mandrel.com @MandrelSystems [email protected]

1© 2018 Mandrel Systems www.mandrel.com @MandrelSystems

Page 2: MLGotchas V4 Presented · Welcome To Our Toolbox Our Opinionated Views ! •Last SIG event in 2017 … • Open source ML with a focus on H 2O. •This event … • End end-to-end

Editorial - More Slides

• The following slides were presented at the on 15th February.

• If you would like an extended slide set including screen shots of much of the demonstration please email: [email protected]

2© 2017 Mandrel Systems www.mandrel.com @MandrelSystems

Page 3: MLGotchas V4 Presented · Welcome To Our Toolbox Our Opinionated Views ! •Last SIG event in 2017 … • Open source ML with a focus on H 2O. •This event … • End end-to-end

Welcome To Our ToolboxOur Opinionated Views !

• Last SIG event in 2017 …• Open source ML with a focus on H2O.

• This event …• End end-to-end development experience.

• “Gotcha” on the way.

• Making money from closed source ML• More than just Python and SciKit

• Ask about our full toolbox over Pizza !

3© 2017 Mandrel Systems www.mandrel.com @MandrelSystems

Page 4: MLGotchas V4 Presented · Welcome To Our Toolbox Our Opinionated Views ! •Last SIG event in 2017 … • Open source ML with a focus on H 2O. •This event … • End end-to-end

In 20+ Min …. http://carpark.mandrel.comBoth desktop and mobile ready

• Build a web based predictive model for Cambridge Car Parks.

4© 2017 Mandrel Systems www.mandrel.com @MandrelSystems

Page 5: MLGotchas V4 Presented · Welcome To Our Toolbox Our Opinionated Views ! •Last SIG event in 2017 … • Open source ML with a focus on H 2O. •This event … • End end-to-end

Tooling & Source Code• We are going to use BigML to …

• build a JavaScript machine learning model• Model directly pasted into a VueJs web page• Note

• No server side for simple demo• Design pattern usable for mobile apps.• BigML can output models in many languages:

• JS, Python, Java C#, and even Excel.

• BigML Link • www.bigml.com• Start playing with provided datasets (e.g. Titanic)

• Web App Source• http://philclaridge.com

• See the “readme.md” file for instructions how to build & test.• You will also need

• https://nodejs.org

5© 2017 Mandrel Systems www.mandrel.com @MandrelSystems

Page 6: MLGotchas V4 Presented · Welcome To Our Toolbox Our Opinionated Views ! •Last SIG event in 2017 … • Open source ML with a focus on H 2O. •This event … • End end-to-end

Smart Cambridge and Datahttp://smartcambridge.org

• Bus data derived from live bus feeds from Smart Cambridge.• Large archive of historical data.

• Smart City Data includes:• For each Car Park -Every 30 sec

• Date Time• Capacity• Spaces Free & Occupied

• Data for just over a year• Oct 2016 – Jan 2018

6© 2017 Mandrel Systems www.mandrel.com @MandrelSystems

Page 7: MLGotchas V4 Presented · Welcome To Our Toolbox Our Opinionated Views ! •Last SIG event in 2017 … • Open source ML with a focus on H 2O. •This event … • End end-to-end

Tooling …• Data Science …

• Python & Pandas to Wrangle• BigML to build a model

• Web App• Viejas as framework

• With vue-cli template to build skeleton• Bootstrap3 for screen layout & Highcharts

• Including VueStrap

• Common IDE• Both Python and Javascript developed in IntelliJ

• Also use Visual Studio Code

• FYI, Strategically we use: • BigML for basic ML and simple neural nets • H2O for more controlled for machine learning

• With Spark to wrangle very large data sets• Tensorflow etc. for hand crafted neural nets

7© 2017 Mandrel Systems www.mandrel.com @MandrelSystems

Page 8: MLGotchas V4 Presented · Welcome To Our Toolbox Our Opinionated Views ! •Last SIG event in 2017 … • Open source ML with a focus on H 2O. •This event … • End end-to-end

Is This Correct?

• Build a web based predictive model for Cambridge Car Parks.

8© 2017 Mandrel Systems www.mandrel.com @MandrelSystems

Page 9: MLGotchas V4 Presented · Welcome To Our Toolbox Our Opinionated Views ! •Last SIG event in 2017 … • Open source ML with a focus on H 2O. •This event … • End end-to-end

Handover

9© 2017 Mandrel Systems www.mandrel.com @MandrelSystems

Page 10: MLGotchas V4 Presented · Welcome To Our Toolbox Our Opinionated Views ! •Last SIG event in 2017 … • Open source ML with a focus on H 2O. •This event … • End end-to-end

Where should I park?

• …To be most sure of finding a space• Assuming I am driving into town at a known time, and all the town

centre car parks are otherwise equally preferable

• Could be re-phrased as • Which car park is most likely to have at least n spaces?• Which car park is most likely to have at least n% free?

10© 2017 Mandrel Systems www.mandrel.com @MandrelSystems

Page 11: MLGotchas V4 Presented · Welcome To Our Toolbox Our Opinionated Views ! •Last SIG event in 2017 … • Open source ML with a focus on H 2O. •This event … • End end-to-end

Data Available• City Data includes:• For each Car Park

• For 15 months• Every 30 sec

• Date Time, Capacity, Spaces Free, Spaces Occupied

• ~600k samples

11© 2017 Mandrel Systems www.mandrel.com @MandrelSystems

Page 12: MLGotchas V4 Presented · Welcome To Our Toolbox Our Opinionated Views ! •Last SIG event in 2017 … • Open source ML with a focus on H 2O. •This event … • End end-to-end

Data Available• City Data includes:• For each Car Park

• For 15 months• Every 30 sec

• Date Time, Capacity, Spaces Free, Spaces Occupied• ~600k samples

• Data for ONLY just over a year (Oct 2016 – Jan 2018)• There is only 1 Easter bank holiday• There are only 2 ’first day of January sales’• There is only 1 first day of new school year

• Have to be aware of risks of overfitting

12© 2017 Mandrel Systems www.mandrel.com @MandrelSystems

Page 13: MLGotchas V4 Presented · Welcome To Our Toolbox Our Opinionated Views ! •Last SIG event in 2017 … • Open source ML with a focus on H 2O. •This event … • End end-to-end

Screenshot – Car park data file

13

© 2017 Mandrel Systems www.mandrel.com @MandrelSystems

,parking_id,spaces_capacity,spaces_free,spaces_occupied,epoch

0,grafton-east-car-park,874,398,476,1477665147

1,grafton-east-car-park,874,406,468,1477665447

2,grafton-east-car-park,874,411,463,1477665747

3,grafton-east-car-park,874,415,459,1477666048

4,grafton-east-car-park,874,427,447,1477666347…

Page 14: MLGotchas V4 Presented · Welcome To Our Toolbox Our Opinionated Views ! •Last SIG event in 2017 … • Open source ML with a focus on H 2O. •This event … • End end-to-end

Planned process• For single car park• Concatenate all data into single file• Load into BigML• Build decision tree for quick look to

• Verify data loadExport model• Build Web App to predict parking availability• Build more complex model

• Add features to data and repeat• Check feature importance

• Extend to multiple car parks

14© 2017 Mandrel Systems www.mandrel.com @MandrelSystems

Page 15: MLGotchas V4 Presented · Welcome To Our Toolbox Our Opinionated Views ! •Last SIG event in 2017 … • Open source ML with a focus on H 2O. •This event … • End end-to-end

Stages where we found ‘Gotchas’

15© 2017 Mandrel Systems www.mandrel.com @MandrelSystems

For single car park Picked Grand Arcade, with major closures, so non-representative data

Concatenate all data into single file Varying Data formats – especially dates

Load into BigML More varying formats

Build decision tree for quick look Overfitting

Verify data load Weird data – car park closures, change in field use, change in time format, month wraps

Check feature importance

Export model

Build Web App to predict parking availability

More varying formats in exported model

Build more complex model Different requirements on data category labelling Data quantity doesn’t justify more complex model

Add features Had to select to reduce risk of overfitting

Extend to multiple car parks Non-representative data for 3/5 car parks

Page 16: MLGotchas V4 Presented · Welcome To Our Toolbox Our Opinionated Views ! •Last SIG event in 2017 … • Open source ML with a focus on H 2O. •This event … • End end-to-end

Added features• Done

• Day of week

• Month

• Weekday/weekend

• Public Holiday

• Time of day (15 min bins)

• Planned

• School holiday

• Weather (rainfall, temperature)

• Major events (Days before/after Christmas, Science week, Race For Life,

Strawberry Fair/Winter Fair, student arrival/departure …)

16

© 2017 Mandrel Systems www.mandrel.com @MandrelSystems

Page 17: MLGotchas V4 Presented · Welcome To Our Toolbox Our Opinionated Views ! •Last SIG event in 2017 … • Open source ML with a focus on H 2O. •This event … • End end-to-end

Screenshot – Car park data file –enumcarparksandpercent.csv

17© 2017 Mandrel Systems www.mandrel.com @MandrelSystems

,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,parking_id,spaces_capacity,spaces_free,spaces_occupied,epoch,dtm,day,dayofweek,hour,minute,month,publichol,timebin,weekend,year,percent_free,car_park_id,car_park_0,car_park_1,car_park_2,car_park_3,car_park_40,0,0,0,grafton-east-car-park,874,398,476,1477665147,2016-10-28 15:32:27,28,4,15,32,10,0,62,0,2016,45.5377574371,0,1,0,0,0,01,1,1,1,grafton-east-car-park,874,406,468,1477665447,2016-10-28 15:37:27,28,4,15,37,10,0,62,0,2016,46.4530892449,0,1,0,0,0,02,2,2,2,grafton-east-car-park,874,411,463,1477665747,2016-10-28 15:42:27,28,4,15,42,10,0,62,0,2016,47.0251716247,0,1,0,0,0,03,3,3,3,grafton-east-car-park,874,415,459,1477666048,2016-10-28 15:47:28,28,4,15,47,10,0,63,0,2016,47.4828375286,0,1,0,0,0,04,4,4,4,grafton-east-car-park,874,427,447,1477666347,2016-10-28 15:52:27,28,4,15,52,10,0,63,0,2016,48.8558352403,0,1,0,0,0,05,5,5,5,grafton-east-car-park,874,433,441,1477666647,2016-10-28 15:57:27,28,4,15,57,10,0,63,0,2016,49.5423340961,0,1,0,0,0,06,6,6,6,grafton-east-car-park,874,440,434,1477666947,2016-10-28 16:02:27,28,4,16,2,10,0,64,0,2016,50.3432494279,0,1,0,0,0,07,7,7,7,grafton-east-car-park,874,444,430,1477667247,2016-10-28 16:07:27,28,4,16,7,10,0,64,0,2016,50.8009153318,0,1,0,0,0,08,8,8,8,grafton-east-car-park,874,455,419,1477667547,2016-10-28 16:12:27,28,4,16,12,10,0,64,0,2016,52.0594965675,0,1,0,0,0,0…

Page 18: MLGotchas V4 Presented · Welcome To Our Toolbox Our Opinionated Views ! •Last SIG event in 2017 … • Open source ML with a focus on H 2O. •This event … • End end-to-end

Mary-Ann’s DemoEmail [email protected] for slides of demo screenshots

• Attach the data source. (Takes too long to do here)• Configure data the source (data types)• Configure the dataset (Ignored fields)• Look at the data• Build the model• Visual check it looks reasonable• And here is the JS ….

18© 2017 Mandrel Systems www.mandrel.com @MandrelSystems