working with the divvy data set

Working With The Divvy Dataset

Pratik Agrawal

IntroductionOver the past couple of years Divvy has organized data challenges for invigorating someinnovation in the Chicago Data Science community as well as learn new ways to visualizeand manage the bike rental system.

ProblemThere are always Divvy vans that ferry bikes around from station to station based on the lackor surplus of bikes at a given location. This movement of bikes is labor and time intensive.Both of which are high costs that Divvy has to bear. It would be nice to be able to predict thevolume of rentals, and allow for precise scheduling. In this project I have decided to work with daily rental volume (total rides) as my targetvariable, and as this is a supervised learning problem the techniques that would be used areas follows- a) Lasso Regression b) Ridge Regression c) Elastic Net d) Gradient Boosted Regression

Data Setsa) Divvy data set 2015 Q1 & Q2 b) Route Information data- In order to enrich the data set with more information, I decided toinclude distance information (route calculation from HERE.com Route Calculation API) foreach origin/destination pair in the dataset. c) Weather data- weather data from Wunderground.com was downloaded for the periodpertaining to the Divvy data set.

In [1]:

In [2]:

The Dataset

1. Lets read the readme.txt file supplied with the dataset, and see what all features areincluded in this data set

Even though the file is for the 2013 dataset, the columns have not changed much in thecurrent year

Populating the interactive namespace from numpy and matplotlib

import pandas as pdimport matplotlib.pyplot as pltimport gcimport seaborn as sns%pylab inline

import warningswarnings.filterwarnings('ignore')

In [3]: readme_txt_file=open("./week-1/Divvy_Stations_Trips_2013/README.txt",'r'for line in readme_txt_file.readlines(): if line!=None: print line

This file contains metadata for both the Trips and Stations table.

For more information, see the contest page at http://DivvyBikes.com/datachallenge (http://DivvyBikes.com/datachallenge) or email questions to [email protected].

Metadata for Trips Table:

Variables:

trip_id: ID attached to each trip taken

starttime: day and time trip started, in CST

stoptime: day and time trip ended, in CST

bikeid: ID attached to each bike

tripduration: time of trip in seconds

from_station_name: name of station where trip originated

to_station_name: name of station where trip terminated

from_station_id: ID of station where trip originated

to_station_id: ID of station where trip terminated

usertype: "Customer" is a rider who purchased a 24-Hour Pass; "Subscriber" is a rider who purchased an Annual Membership

gender: gender of rider

birthyear: birth year of rider

Notes:

* First row contains column names

http://divvybikes.com/datachallenge

From the above information we have a good idea about what the dataset looks like.Under normal circumstances such a clean set is hard to come by. The meta dataprovided is actually very useful, since that is another feature missing fromdatasets.

2. Read the files

In [3]:

Lets take a quick peek at the head for each data frame

* Total records = 759,789

* Trips that did not include a start or end date were removed from original table.

* Gender and birthday are only available for Subscribers

Metadata for Stations table:

Variables:

name: station name

latitude: station latitude

longitude: station longitude

dpcapacity: number of total docks at each station as of 2/7/2014

online date: date the station went live in the system

df_trips_1=pd.read_csv("../../data/Divvy_Trips_2015-Q1Q2/Divvy_Trips_2015-Q1.csv"df_trips_2=pd.read_csv("../../data/Divvy_Trips_2015-Q1Q2/Divvy_Trips_2015-Q2.csv"df_stations=pd.read_csv("../../data/Divvy_Trips_2015-Q1Q2/Divvy_Stations_2015.csv"df_trips = pd.concat([df_trips_1,df_trips_2])

In [5]:

In [6]:

The dataframes above can be joined on from_station_id/to_station_id and id

Lets look at the shape of the dataframes

In [7]:

In [8]:

Joining the origin station id with the data from the stations data frame

Out[5]: trip_id starttime stoptime bikeid tripduration from_station_id from_station_name

0 47384543/31/201523:58

4/1/20150:03

1095 299 117Wilton Ave &Belmont Ave

1 47384503/31/201523:59

4/1/20150:15

537 940 43Michigan Ave &Washington St

2 47384493/31/201523:59

4/1/20150:11

2350 751 162Damen Ave &Wellington Ave

3 47384483/31/201523:59

4/1/20150:19

938 1240 51Clark St & RandolphSt

4 47384453/31/201523:54

4/1/20150:15

379 1292 134Peoria St & JacksonBlvd

Out[6]: id name latitude longitude dpcapacity landmark

0 2 Michigan Ave & Balbo Ave 41.872293 -87.624091 35 541

1 3 Shedd Aquarium 41.867226 -87.615355 31 544

2 4 Burnham Harbor 41.856268 -87.613348 23 545

3 5 State St & Harrison St 41.874053 -87.627716 23 30

4 6 Dusable Harbor 41.885042 -87.612795 31 548

Out[7]: (1096239, 12)

Out[8]: (474, 6)

df_trips.head()

df_stations.head()

df_trips.shape

df_stations.shape

In [4]:

In [12]:

In [13]:

Joining the destination station id with the data from the stations data frame

In [5]:

In [15]:

Out[12]: (1096239, 18)

1 47384313/31/201523:42

3/31/201523:47


2 47383863/31/201523:04

3/31/201523:07


3 47383033/31/201522:19

3/31/201522:22


4 47380893/31/201521:07

3/31/201521:10


Out[15]: (1096239, 24)

df_from=pd.merge(df_trips,df_stations,left_on="from_station_id",right_on

df_from.shape

df_from.head()

df_divvy=pd.merge(df_from,df_stations,left_on="to_station_id",right_on=

df_divvy.shape

In [16]:

Lets try a sample call to the HERE maps api

In [16]:

To use the here.com api, one has to register as a developer, and is limited to a 100Kcalls/month

For security purposes the application id and code for my dev user has not beenincluded in the api call made below.

In [17]:

In [18]:

Lets take a look at what the HERE Calcuate Route API response looks like


1096234 53484275/27/20157:04

5/27/20157:21

2817 1023 428Dorchester Ave &63rd St

1096235 53382095/26/201510:38

5/26/201510:53

2819 912 428Dorchester Ave &63rd St

1096236 56704226/16/201518:01

6/16/201518:16

3113 869 95Stony Island Ave &64th St

1096237 53750755/28/201515:49

5/28/201516:04

2004 892 391 Halsted St & 69th St

1096238 56118586/13/20159:36

6/13/20159:42

4703 374 388 Halsted St & 63rd St

5 rows × 24 columns

df_divvy.tail()

from urllib2 import urlopenfrom StringIO import StringIOimport simplejson

url = urlopen('http://route.cit.api.here.com/routing/7.2/calculateroute.json?app_id=APP_ID&app_code=APP_CODE&waypoint0=geo!41.90096,-87.623777&waypoint1=geo!41.91468,-87.64332&mode=fastest;car;traffic:disabled'

json_array = simplejson.loads(url)

In [19]:

Out[19]:

json_array

{u'response': {u'language': u'en-us', u'metaInfo': {u'interfaceVersion': u'2.6.18', u'mapVersion': u'8.30.60.106', u'moduleVersion': u'7.2.63.0-1185', u'timestamp': u'2015-12-08T21:37:18Z'}, u'route': [{u'leg': [{u'end': {u'label': u'W Menomonee St', u'linkId': u'+19805890', u'mappedPosition': {u'latitude': 41.9146268, u'longitude': -87.6433185}, u'mappedRoadName': u'W Menomonee St', u'originalPosition': {u'latitude': 41.9146799, u'longitude': -87.64332}, u'shapeIndex': 60, u'sideOfStreet': u'left', u'spot': 0.1862745, u'type': u'stopOver'}, u'length': 3620, u'maneuver': [{u'_type': u'PrivateTransportManeuverType', u'id': u'M1', u'instruction': u'Head toward N Michigan Ave on E Lake Shore Dr. Go for 23 m.', u'length': 23, u'position': {u'latitude': 41.9008181, u'longitude': -87.6237659}, u'travelTime': 9}, {u'_type': u'PrivateTransportManeuverType', u'id': u'M2', u'instruction': u'Turn right onto N Lake Shore Dr (US-41 N). Go for 1.1 km.', u'length': 1120, u'position': {u'latitude': 41.900804, u'longitude': -87.6240349}, u'travelTime': 76}, {u'_type': u'PrivateTransportManeuverType', u'id': u'M3', u'instruction': u'Keep right toward North Ave/IL-64/Lasalle Dr. Go for 316 m.', u'length': 316, u'position': {u'latitude': 41.9106638, u'longitude': -87.6257515}, u'travelTime': 40}, {u'_type': u'PrivateTransportManeuverType', u'id': u'M4', u'instruction': u'Turn left onto W La Salle Dr. Go for 886 m.',

u'length': 886, u'position': {u'latitude': 41.9133461, u'longitude': -87.6259875}, u'travelTime': 114}, {u'_type': u'PrivateTransportManeuverType', u'id': u'M5', u'instruction': u'Turn right onto W North Ave (IL-64). Go for 853 m.', u'length': 853, u'position': {u'latitude': 41.9111681, u'longitude': -87.6331329}, u'travelTime': 123}, {u'_type': u'PrivateTransportManeuverType', u'id': u'M6', u'instruction': u'Turn right onto N Larrabee St. Go for 403 m.', u'length': 403, u'position': {u'latitude': 41.9109857, u'longitude': -87.6434219}, u'travelTime': 66}, {u'_type': u'PrivateTransportManeuverType', u'id': u'M7', u'instruction': u'Turn right onto W Menomonee St. Go for 19 m.', u'length': 19, u'position': {u'latitude': 41.9146228, u'longitude': -87.6435506}, u'travelTime': 3}, {u'_type': u'PrivateTransportManeuverType', u'id': u'M8', u'instruction': u'Arrive at W Menomonee St. Your destination is on the left.', u'length': 0, u'position': {u'latitude': 41.9146268, u'longitude': -87.6433185}, u'travelTime': 0}], u'start': {u'label': u'E Lake Shore Dr', u'linkId': u'-858448508', u'mappedPosition': {u'latitude': 41.9008181, u'longitude': -87.6237659}, u'mappedRoadName': u'E Lake Shore Dr', u'originalPosition': {u'latitude': 41.9009599, u'longitude': -87.6237771}, u'shapeIndex': 0, u'sideOfStreet': u'right', u'spot': 0.0247934, u'type': u'stopOver'}, u'travelTime': 431}],

To access the distance between the two points provided in the API request, we canlook at the summary section of the JSON object

In [22]:

Similarly we can access other parameters such as base time and traffic time (bothhave been provided for vehicle based routing). This API however does not provideestimates as to how traffic affects the bicycle times.

In [23]:

u'mode': {u'feature': [], u'trafficMode': u'disabled', u'transportModes': [u'car'], u'type': u'fastest'}, u'summary': {u'_type': u'RouteSummaryType', u'baseTime': 431, u'distance': 3620, u'flags': [u'park'], u'text': u'The trip takes 3.6 km and 7 mins.', u'trafficTime': 431, u'travelTime': 431}, u'waypoint': [{u'label': u'E Lake Shore Dr', u'linkId': u'-858448508', u'mappedPosition': {u'latitude': 41.9008181, u'longitude': -87.6237659}, u'mappedRoadName': u'E Lake Shore Dr', u'originalPosition': {u'latitude': 41.9009599, u'longitude': -87.6237771}, u'shapeIndex': 0, u'sideOfStreet': u'right', u'spot': 0.0247934, u'type': u'stopOver'}, {u'label': u'W Menomonee St', u'linkId': u'+19805890', u'mappedPosition': {u'latitude': 41.9146268, u'longitude': -87.6433185}, u'mappedRoadName': u'W Menomonee St', u'originalPosition': {u'latitude': 41.9146799, u'longitude': -87.64332}, u'shapeIndex': 60, u'sideOfStreet': u'left', u'spot': 0.1862745, u'type': u'stopOver'}]}]}}

3620

base_time: 431 traffic_time: 431

print json_array['response']['route'][0]['summary']['distance']

print "base_time: ",json_array['response']['route'][0]['summary']['baseTime'print "traffic_time: ",json_array['response']['route'][0]['summary']['trafficTime'

Lets create a function to query the HERE.com Calculate Route API for any twolocations. And also test this with the first two rows of the data set

In [24]:

In [25]:

In [26]:

Now lets do a simple reduction in the number of calls made to the HERE.comAPI

In [29]:

Out[25]: base_time distance json_array traffic_time

0 208 923 {u'response': {u'route': [{u'leg': [{u'start':... 208

1 208 923 {u'response': {u'route': [{u'leg': [{u'start':... 208

Out[26]: {u'_type': u'RouteSummaryType', u'baseTime': 208, u'distance': 923, u'text': u'The trip takes 923 m and 3 mins.', u'trafficTime': 208, u'travelTime': 208}

def calc_dist_time(x): url = urlopen('http://route.cit.api.here.com/routing/7.2/calculateroute.json?app_id=APP_ID&app_code=APP_CODE&waypoint0=geo!%s,%s&waypoint1=geo!%s,%s&mode=fastest;car;traffic:disabled' json_array = simplejson.loads(url) base_time=json_array['response']['route'][0]['summary']['baseTime'] traffic_time=json_array['response']['route'][0]['summary']['trafficTime' distance=json_array['response']['route'][0]['summary']['distance'] return pd.Series({'base_time':base_time, 'traffic_time':traffic_time, 'distance':distance, 'json_array':json_array})df_dist=df_divvy.head(2).apply(calc_dist_time,axis=1)

df_dist

df_dist.json_array[0]['response']['route'][0]['summary']

df_temp=df_divvy.drop_duplicates(["latitude_x","longitude_x","latitude_y"

In [30]:

As can be seen from above, the number of calls that will need to be made to theHERE.com API is 65K, which is well below the monthly quota. This can be furtherreduced by removing the duplicated between x-y and y-x combinations of thelocations.

Note: I tried Google Maps API (only a few thousand free calls, and throttled/deniedthereafter), as well as Open Street Maps API, and only found HERE.com API to be themost responsive, and best in class in terms of quota.

Lets run the query for each combination of location in this reduced dataset

I already ran the code below prior to forming this notebook, and had saved the results of thequeries. Hence you will not see execution numbers for some of the code blocks

In [ ]:

In [33]:

In [66]:

In [76]:


0 47384543/31/201523:58

4/1/20150:03


181 44479911/17/201515:26

1/17/201515:57

645 1859 43Michigan Ave &Washington St

184 46315883/14/201518:20

3/14/201518:38

1226 1103 162Damen Ave &Wellington Ave

192 47356463/31/201517:16

3/31/201517:37

1312 1296 51Clark St & RandolphSt

6/8/2015 6/8/2015 Peoria St & Jackson

df_temp

df_dist=df_temp.apply(calc_dist_time,axis=1)

df_dist_matrix = df_divvy[["latitude_x","longitude_x","latitude_y","longitude_y"

df_dist_time = pd.merge(df_dist_matrix,df_dist,left_on="ix",right_on="ix"

df_dist_time["key"] = "%s_%s_%s_%s"%(str(df_dist_time.latitude_x), str(df_dist_time.longitude_x), str(df_dist_time.latitude_y), str(df_dist_time.longitude_y))

In [80]:

In [89]:

In [17]:

In [18]:

In [19]:

In [20]:

In [39]:

Lets save this data set

In [102]:

We can free up the memory, by forcing garbage collection. I've done this as there is lotof data held in memory, and there is no further use for it.

Out[18]: ['Unnamed: 0', 'ix', 'latitude_x', 'longitude_x', 'latitude_y', 'longitude_y', 'base_time', 'distance', 'json_array', 'traffic_time']

Out[39]: 28

df_divvy["key"] = "%s_%s_%s_%s"%(str(df_divvy.latitude_x), str(df_divvy.longitude_x), str(df_divvy.latitude_y), str(df_divvy.longitude_y))

df_dist_time.to_csv('../../data/Divvy_Trips_2015-Q1Q2/lat_lon_dist_time.csv'

df_dist_time = pd.read_csv('../../data/Divvy_Trips_2015-Q1Q2/lat_lon_dist_time.csv'

list(df_dist_time.columns)

df_divvy = pd.merge(df_divvy,df_dist_time,left_on=["latitude_x","longitude_x"

df_divvy.drop(["ix","json_array"],axis=1,inplace=True)

len(list(df_divvy.columns))

df_divvy.to_csv('../../data/Divvy_Trips_2015-Q1Q2/complete-data.csv')

In [8]:

Lets also download weather information for each day of Q1 & Q2 2015. For thispurpose I downloaded the weather history from wunderground.com

Note: Code execution resumes from here, as code above requires a dev account to makecalls to HERE.com

In [22]:

Out[8]: 114

Out[22]:

CDTMaxTemperatureF

MeanTemperatureF

MinTemperatureF

MaxDewPointF

MeanDewPointF

MinDewpointF

0 1/1/15 32 25 17 16 11 4

1 1/2/15 36 28 20 22 19 15

2 1/3/15 37 34 31 36 32 22

3 1/4/15 36 21 5 35 22 -5

4 1/5/15 10 5 -1 5 -3 -10


df_dist=[]df_dist_matrix=[]df_dist_time=[]df_from=[]df_trips=[]df_trips_1=[]df_trips_2=[]df_divvy_all=[]import gcgc.collect()

weather = pd.read_csv('../../data/Divvy_Trips_2015-Q1Q2/CustomWeather.csv'weather.head()

In [23]:

Lets clean the column names, and get rid of the leading white space

In [24]:

Out[23]: ['CDT', 'Max TemperatureF', 'Mean TemperatureF', 'Min TemperatureF', 'Max Dew PointF', 'MeanDew PointF', 'Min DewpointF', 'Max Humidity', ' Mean Humidity', ' Min Humidity', ' Max Sea Level PressureIn', ' Mean Sea Level PressureIn', ' Min Sea Level PressureIn', ' Max VisibilityMiles', ' Mean VisibilityMiles', ' Min VisibilityMiles', ' Max Wind SpeedMPH', ' Mean Wind SpeedMPH', ' Max Gust SpeedMPH', 'PrecipitationIn', ' CloudCover', ' Events', ' WindDirDegrees']

list(weather.columns)

weather.columns=[c.strip(" ") for c in weather.columns]

In [25]:

Lets convert the date feature of both df_divvy and weather data to the sklearn datetimeobject.

In [26]:

In [27]:

In [28]:

Out[25]: ['CDT', 'Max TemperatureF', 'Mean TemperatureF', 'Min TemperatureF', 'Max Dew PointF', 'MeanDew PointF', 'Min DewpointF', 'Max Humidity', 'Mean Humidity', 'Min Humidity', 'Max Sea Level PressureIn', 'Mean Sea Level PressureIn', 'Min Sea Level PressureIn', 'Max VisibilityMiles', 'Mean VisibilityMiles', 'Min VisibilityMiles', 'Max Wind SpeedMPH', 'Mean Wind SpeedMPH', 'Max Gust SpeedMPH', 'PrecipitationIn', 'CloudCover', 'Events', 'WindDirDegrees']

list(weather.columns)

df_divvy.drop("Unnamed: 0",axis=1,inplace=True)

df_divvy["date"]=df_divvy.starttime.apply(lambda x: x.split(" ")[0])

df_divvy["date"]=pd.to_datetime(df_divvy.date)

In [29]:

In [30]:

In [31]:


0 47384543/31/201523:58

4/1/20150:03


1 47312163/31/20158:03

3/31/20158:08


2 47298483/30/201521:22

3/30/201521:27


3 47296723/30/201520:42

3/30/201520:51


4 47153903/27/201521:26

3/27/201521:31



Out[31]:

CDTMaxTemperatureF

MeanTemperatureF

MinTemperatureF

MaxDewPointF

MeanDewPointF

MinDewpointF

02015-01-01

32 25 17 16 11 4

12015-01-02

36 28 20 22 19 15

22015-01-03

37 34 31 36 32 22

32015-01-04

36 21 5 35 22 -5

42015-01-05

10 5 -1 5 -3 -10


df_divvy.head()

weather["CDT"]=pd.to_datetime(weather.CDT)

weather.head()

In [32]:

Analysis

EDA

Now that we have all the data in order, lets take a look at where these stations arelocated on the map. We will also plot a random sampling of the user types (subscribersv/s customers) and the stations they travel between

In [6]:

weather.PrecipitationIn=weather.PrecipitationIn.convert_objects(convert_numeric

from IPython.display import HTMLimport folium

def inline_map(map): """ Embeds the HTML source of the map directly into the IPython notebook. This method will not work if the map depends on any files (json data). Also this uses the HTML5 srcdoc attribute, which may not be supported in all browsers. """ map._build_map() return HTML('<iframe srcdoc="{srcdoc}" style="width: 100%; height: 510px; border: none"></iframe>'

def embed_map(map, path="map.html"): """ Embeds a linked iframe to the map into the IPython notebook. Note: this method will not capture the source of the map into the notebook. This method should work for all maps (as long as they use relative urls). """ map.create_map(path=path) return HTML('<iframe src="files/{path}" style="width: 100%; height: 510px; border: none"></iframe>'

In [7]:

As can be seen from the above map-

Out[7]:

map_osm = folium.Map(location=[41.9065732,-87.7142335],tiles='Stamen Toner'for i in range(0,df_stations.shape[0]): map_osm.circle_marker(location=[df_stations.latitude[i], df_stations fill_color='blue')

np.random.seed(123)numbers = np.arange(1,1000000)np.random.shuffle(numbers)

for i in range(1,10000): if(df_divvy.usertype[numbers[i]]=="Subscriber"): map_osm.line([[df_divvy.latitude_x[numbers[i]],df_divvy.longitude_x else: map_osm.line([[df_divvy.latitude_x[numbers[i]],df_divvy.longitude_xinline_map(map_osm)

a) Subscribers are marked with the red lines. b) Customers are market with green lines. c) Subscribers tend to use this service more as a daily commute option versus customerswho use this for shorter distances. d) Customers tend to run the bikes in the more tourist-y areas (Lake Shore Trail, Loop,Millenium Park). e) The bike stations on the periphery of the map see the least traffic.

Note: The above map is interactive, so you should be able to zoom in/out and panthroughout.

1. Lets look at the distances travelled by usertype (Customer v/s Subscriber)

In [35]:

It is clear from the above plots that the subscribers in general ride longer distances, as wellas contribute to the majority of bike rentals. However, the customers (or tourists/one-timeriders) also contribute to a significant number of rides. Within the customers, we can think ofthe riders as-

1. tourists- riders, who rent bikes on weekends and Thursdays.2. daily-riders- who do not have active subscription, and are riding these bikes on

Monday-Wednesday.

2. Lets look at who are the most active bike renters in the subscriberscategory-

fig = plt.figure()fig.set_figheight(9)fig.set_figwidth(15)fig.suptitle("Distance Bins for Customer/Subscriber", fontsize=16)ax = plt.subplot("211")df_temp = df_divvy[df_divvy.usertype=="Customer"]df_temp[df_temp.distance<df_temp.distance.quantile(0.95)].distance.histplt.ylim(0,25000)ax.set_ylabel("Distance in meters")ax.set_title("Customer",fontsize=14)

ax = plt.subplot("212")df_temp = df_divvy[df_divvy.usertype=="Subscriber"]df_temp[df_temp.distance<df_temp.distance.quantile(0.95)].distance.histax.set_title("Subscriber",fontsize=14)ax.set_ylabel("Distance in meters")plt.ylim(0,25000)plt.xlabel("Rental Volume")plt.show()

In [36]:

In [37]:

In [38]:

In [39]:

From the above graph and table we see that the millenials are the largest group ofsubscribers.

Out[39]: birthyear Customer Subscriber

61 1986 0 45352

63 1988 2 44295

62 1987 0 42418

60 1985 13 41319

59 1984 0 40523

64 1989 0 39337

58 1983 0 36683

57 1982 0 32490

65 1990 0 32280

56 1981 0 29144

def explore(x): return pd.Series({"Subscriber":np.sum((x.usertype=="Subscriber")).astype "Customer":np.sum((x.usertype=="Customer")).astypedf_birthyear_agg=df_divvy.groupby("birthyear").apply(explore)

df_birthyear_agg.reset_index(inplace=True)

plot(df_birthyear_agg.birthyear,df_birthyear_agg.Subscriber)plt.show()

df_birthyear_agg.sort(["Subscriber"],ascending=False).head(10)

One can also note that there are a few subscribers with the age of 100 and over. It wouldseem that these subscribers have not reported their correct age, or if they have, then theyare in the pink of health.

3. Lets now look at how the weather affects bike rental volumes. For this purpose wewill roll up bike rentals to the day.

a) First we will take a look at the mean temperature and total ridership

Here we will create a few new features- a) total rides: the total number of rentals for the day b) average trip duration for the day (in seconds) c) average trip distance for the day (in meters) d) birth_year_diff_86 - the difference in birth year from 1986. This is based on the precedinganalysis.

In [40]:

In [41]:

In [42]:

def roll_up(x): return pd.Series({"total_rides":np.count_nonzero(x), "avg_trip_duration_s":np.mean(x.tripduration), "avg_distance_m":np.mean(x.distance), "male":np.count_nonzero(x.gender=="Male"), "female":np.count_nonzero(x.gender=="Female"), "birth_year_diff_86":np.mean(1986-x.birthyear)})df_divvy_group=df_divvy.groupby(["usertype","date"]).apply(roll_up)

df_divvy_group.reset_index(inplace=True)

df_divvy_group = pd.merge(df_divvy_group,weather,left_on="date", right_on

In [43]: fig = plt.figure()fig.set_figheight(9)fig.set_figwidth(15)fig.suptitle("Distance Bins for Customer/Subscriber", fontsize=16)ax = plt.subplot("211")df_temp = df_divvy_group[df_divvy_group.usertype=="Customer"]df_temp['Mean TemperatureF'].hist(alpha=0.7,bins=100,color="blue")plt.ylim(0,20)ax.set_ylabel("Mean Temperature")ax.set_title("Customer",fontsize=14)

ax = plt.subplot("212")df_temp = df_divvy_group[df_divvy_group.usertype=="Subscriber"]df_temp['Mean TemperatureF'].hist(alpha=0.7,bins=100,color="red")ax.set_title("Subscriber",fontsize=14)ax.set_ylabel("Mean Temperature")plt.ylim(0,20)plt.xlabel("Rental Volume")plt.show()

In [44]:

Clearly there is a relationship between total ridership and the temperature. The relationshipseems to be slightly exponential for Customers v/s Subscribers. Subscribers can beseen hiring bikes at much lower temperatures.

b) Now lets look at the precipitation in inches and how that affects the ridership

In [45]:

fig = plt.figure()fig.set_figheight(9)fig.set_figwidth(15)fig.suptitle("Temperature and the rider", fontsize=16)ax = plt.subplot("211")df_temp = df_divvy_group[df_divvy_group.usertype=="Customer"]ax.scatter(df_temp["Mean TemperatureF"],df_temp.total_rides,color="blue"ax.set_title("Customer",fontsize=14)

ax = plt.subplot("212")df_temp = df_divvy_group[df_divvy_group.usertype=="Subscriber"]ax.scatter(df_temp["Mean TemperatureF"],df_temp.total_rides,color="red"ax.set_title("Subscriber",fontsize=14)plt.show()

def fun_sum(x): return pd.Series({"TotalRidership":np.sum(x.total_rides)})

In [46]:

As can be seen, precipitation results in a drastic drop in ridership.

c) How does wind speed affect the total rider volume?

fig = plt.figure()fig.set_figheight(9)fig.set_figwidth(15)fig.suptitle("Precipitation and the rider volume", fontsize=16)ax = plt.subplot("211")df_temp = df_divvy_group[df_divvy_group.usertype=="Customer"]df_=df_temp.groupby("PrecipitationIn").apply(fun_sum)df_.reset_index(inplace=True)ax.plot(df_.PrecipitationIn,df_.TotalRidership,color="blue")ax.set_title("Customer",fontsize=14)

df_temp = df_divvy_group[df_divvy_group.usertype=="Subscriber"]df_=df_temp.groupby("PrecipitationIn").apply(fun_sum)df_.reset_index(inplace=True)ax.plot(df_.PrecipitationIn,df_.TotalRidership,color="red")ax.set_title("Subscriber",fontsize=14)plt.show()

In [47]:

As we see from the above graph, the rider volume is affected by the wind speed, howeverthere are multiple sections in this graph. We see that the rider volume increases between 0 -7 mph, however there is a sudden dip at 8mph. This could probably be attributed to fewerdays with 8 mph wind speeds, and hence a lower total ridership volume. We notice that rightafter 9 mph the total rider volume starts a steady decline.

d) Lets look at day of the week and how that affects ridership

In [48]:

fig = plt.figure()fig.set_figheight(9)fig.set_figwidth(15)fig.suptitle("Mean Wind Speed (mph) and the rider volume", fontsize=16)ax = plt.subplot("211")df_temp = df_divvy_group[df_divvy_group.usertype=="Customer"]df_=df_temp.groupby("Mean Wind SpeedMPH").apply(fun_sum)df_.reset_index(inplace=True)ax.plot(df_['Mean Wind SpeedMPH'],df_.TotalRidership,color="blue")ax.set_title("Customer",fontsize=14)

df_temp = df_divvy_group[df_divvy_group.usertype=="Subscriber"]df_=df_temp.groupby("Mean Wind SpeedMPH").apply(fun_sum)df_.reset_index(inplace=True)ax.plot(df_['Mean Wind SpeedMPH'],df_.TotalRidership,color="red")ax.set_title("Subscriber",fontsize=14)plt.show()

df_divvy_group["day_of_year"] = df_divvy_group.date.dt.dayofyeardf_divvy_group["day_of_week_mon_is_0"] = df_divvy_group.date.dt.dayofweek

In [49]:

We can see from the above graphs that there is a difference between the Customer andSubscriber rider characteristic. Customers ride more on weekends, and subscribers ridemore on weekdays. An idea to explore- if we explore the difference between weekend v/sweekday

In [50]:

fig = plt.figure()fig.set_figheight(9)fig.set_figwidth(15)ax = plt.subplot("211")fig.suptitle("Day of week and the rider volume", fontsize=16)df_temp = df_divvy_group[df_divvy_group.usertype=="Customer"]df_=df_temp.groupby("day_of_week_mon_is_0").apply(fun_sum)df_.reset_index(inplace=True)ax.plot(df_.day_of_week_mon_is_0,df_.TotalRidership,color="blue")

df_temp = df_divvy_group[df_divvy_group.usertype=="Subscriber"]df_=df_temp.groupby("day_of_week_mon_is_0").apply(fun_sum)df_.reset_index(inplace=True)ax.plot(df_.day_of_week_mon_is_0,df_.TotalRidership,color="red")plt.show()

df_divvy_group["IsWeekend"] = (df_divvy_group.day_of_week_mon_is_0>4).astype

In [248]:

In [52]:

Model BuildingWe are going to build a few different models with a different selection of features for eachgroup of models.

1. Models being built-

fig = plt.figure()fig.set_figheight(9)fig.set_figwidth(15)ax = plt.subplot("211")fig.suptitle("Is Weekend? and the rider volume", fontsize=16)df_temp = df_divvy_group[df_divvy_group.usertype=="Customer"]df_=df_temp.groupby("IsWeekend").apply(fun_sum)df_.reset_index(inplace=True)ax.plot(df_.IsWeekend,df_.TotalRidership,color="blue")

ax = plt.subplot("212")df_temp = df_divvy_group[df_divvy_group.usertype=="Subscriber"]df_=df_temp.groupby("IsWeekend").apply(fun_sum)df_.reset_index(inplace=True)ax.plot(df_.IsWeekend,df_.TotalRidership,color="red")plt.show()

df_divvy_group.to_csv('../../data/Divvy_Trips_2015-Q1Q2/data-weather-distance.csv'

a) Lasso Regression b) Ridge Regression c) Gradient Boosted Regressor d) Elastic Net

2. Train/Test:70/30

3. Feature scaling: Enabled

4. Grid Search CV: 10 Fold CV

5. Separate models for Customer and Subscriber user types

6. Models for feature sets-

a) All data except day of week b) All data c) Temperature, Precipitation, and Birth Year Diff From 1986 d) All from c) and dummy coded day of week feature

In [137]:

Model Building Code

import sklearn.cross_validation as cvimport sklearn.metrics as mtimport sklearn.linear_model as lmimport sklearn.ensemble as ensembleimport sklearn.preprocessing as psfrom sklearn.grid_search import RandomizedSearchCVfrom sklearn.grid_search import GridSearchCV

In [189]:

Configuration to drive Model Building code

def MSECalc(y, y_pred): return round(mt.mean_squared_error(y,y_pred),8)

def ModelScorer(pred_train, y_train, pred_test, y_test): mse_train = MSECalc(y_train, pred_train) mse_test = MSECalc(y_test, pred_test) return mse_train, mse_test

def ModelBuilder_lasso(X_train,y_train, X_test, config): model = GridSearchCV(lm.Lasso(),param_grid=config["params"]["lasso" model.fit(X_train,y_train) return model.predict(X_train), model.predict(X_test), model.best_params_

def ModelBuilder_ridge(X_train,y_train, X_test, config): model = GridSearchCV(lm.Ridge(),param_grid=config["params"]["ridge" model.fit(X_train,y_train) return model.predict(X_train), model.predict(X_test), model.best_params_

def ModelBuilder_en(X_train,y_train, X_test, config): model = GridSearchCV(lm.ElasticNet(),param_grid=config["params"]["en" model.fit(X_train,y_train) return model.predict(X_train), model.predict(X_test), model.best_params_

def ModelBuilder_gbr(X_train,y_train, X_test, config): model = GridSearchCV(ensemble.GradientBoostingRegressor(),param_grid model.fit(X_train,y_train) return model.predict(X_train), model.predict(X_test), model.best_params_

def ModelComparator(X,y, config): X=ps.scale(X) y=ps.scale(y) X_train, X_test, y_train, y_test = cv.train_test_split(X,y, test_size pred_train={} pred_test={} mse_train={} mse_test={} params={} for model in config["models"]: if "lasso" in model: pred_train[model], pred_test[model], params[model]=ModelBuilder_lasso if "ridge" in model: pred_train[model], pred_test[model], params[model]=ModelBuilder_ridge if "en" in model: pred_train[model], pred_test[model], params[model]=ModelBuilder_en if "gbr" in model: pred_train[model], pred_test[model], params[model]=ModelBuilder_gbr mse_train[model], mse_test[model] = ModelScorer(pred_train[model return mse_train, mse_test, params

In [150]:

Code for dummy coding of categoricals

In [94]:

Method that encapsulates running all models, as well aggregating all scores anddetails about the run

config={"models":["lasso","ridge","en","gbr"], "params":{"lasso":{"alpha":[0.001,0.01,0.1,1], "tol":[0.0001,0.001,0.01,0.1,1]}, "ridge":{"alpha":[0.001,0.01,0.1,1], "tol":[0.0001,0.001,0.01,0.1,1]}, "en":{"tol":np.linspace(0.0001, 0.1, num=15), "alpha":[0.001,0.01,0.1,1], "l1_ratio":np.linspace(0.01, 1, num=15)}, "gbr":{"learning_rate":np.linspace(0.05, 1, num=15), "min_samples_leaf":range(1,10), "min_samples_split":range(1,5)}}, "cv":10}mse_train={}mse_test={}

def dummy_coding(x,col_names): sep={} for col in col_names: vals=list(x[col].unique()) for val in vals: sep["%s_%s"%(col,val)] = (x[col]==val).astype(int) return sep

In [226]:

Variable to catch all scores

In [227]:

Models built with different feature sets-

a) All data except day of week

def run_models(user,features,X,y,config): train, test, param = ModelComparator(X,y,config) mse_test = [] mse_train = [] params = [] models = [] usertype = [] feature_set = [] scores_df = pd.DataFrame() for model in config["models"]: models.append(model) mse_train.append(train[model]) mse_test.append(test[model]) params.append(param[model]) usertype.append(user) feature_set.append(features) scores_df["feature_set"] = feature_set scores_df["usertype"] = usertype scores_df["model"] = models scores_df["mse_train"] = mse_train scores_df["mse_test"] = mse_test scores_df["rmse_train"] = np.sqrt(scores_df.mse_train) scores_df["rmse_test"] = np.sqrt(scores_df.mse_test) scores_df["params"] = params return scores_df

scores = []

In [228]:

b) All data

In [230]:

c) Weather Data Only

X=df_divvy_group[df_divvy_group.usertype=="Subscriber"]X_cust=df_divvy_group[df_divvy_group.usertype=="Customer"]

X.drop(["usertype","date","CDT","Events","day_of_week_mon_is_0"],axis=1X_cust.drop(["usertype","date","CDT","Events","birth_year_diff_86","female"

X_cust.dropna(inplace=True)X.dropna(inplace=True)y_cust=X_cust.total_ridesy=X.total_rides

X_cust.drop("total_rides",axis=1,inplace=True)X.drop("total_rides",axis=1,inplace=True)

scores.append(run_models("subscriber","all_except_dow",X,y,config))scores.append(run_models("customer","all_except_dow",X_cust,y_cust,config


X.drop(["usertype","date","CDT","Events"],axis=1,inplace=True)X_cust.drop(["usertype","date","CDT","Events","birth_year_diff_86","female"


X=pd.concat([X,pd.DataFrame(dummy_coding(X,["day_of_week_mon_is_0"]))],X_cust=pd.concat([X_cust,pd.DataFrame(dummy_coding(X_cust,["day_of_week_mon_is_0"

X_cust.drop(["total_rides","day_of_week_mon_is_0"],axis=1,inplace=True)X.drop(["total_rides","day_of_week_mon_is_0"],axis=1,inplace=True)

scores.append(run_models("subscriber","all_features",X,y,config))scores.append(run_models("customer","all_features",X_cust,y_cust,config

In [231]:

In [232]:

d) Weather data and day of week


X=X[["total_rides","Mean TemperatureF","PrecipitationIn","birth_year_diff_86"X_cust=X[["total_rides","Mean TemperatureF","PrecipitationIn"]]pd.tools.plotting.scatter_matrix(X,figsize=(15,10))plt.show()


X_cust.drop("total_rides",axis=1,inplace=True)X.drop("total_rides",axis=1,inplace=True)

scores.append(run_models("subscriber","temp_prec_birth",X,y,config))scores.append(run_models("customer","temp_prec_birth",X_cust,y_cust,config

In [234]:

In [235]:

In [236]:


X=X[["total_rides","Mean TemperatureF","PrecipitationIn","birth_year_diff_86"X_cust=X[["total_rides","Mean TemperatureF","PrecipitationIn","day_of_week_mon_is_0"

X=pd.concat([X,pd.DataFrame(dummy_coding(X,["day_of_week_mon_is_0"]))],X_cust=pd.concat([X_cust,pd.DataFrame(dummy_coding(X_cust,["day_of_week_mon_is_0"pd.tools.plotting.scatter_matrix(X,figsize=(15,10))plt.show()


X_cust.drop(["total_rides","day_of_week_mon_is_0"],axis=1,inplace=True)X.drop(["total_rides","day_of_week_mon_is_0"],axis=1,inplace=True)

scores.append(run_models("subscriber","temp_prec_birth_dow",X,y,config))scores.append(run_models("customer","temp_prec_birth_dow",X_cust,y_cust

scores_df = pd.concat(scores)

In [240]:

Out[240]: feature_set usertype model mse_train mse_test rmse_train rmse_test

1 all_except_dow subscriber ridge2.200000e-07

2.600000e-07 0.000469 0.000510

1 all_features subscriber ridge2.000000e-07

3.300000e-07 0.000447 0.000574

0 all_except_dow subscriber lasso1.340000e-06

6.900000e-07 0.001158 0.000831

2 all_except_dow subscriber en1.340000e-06

6.900000e-07 0.001158 0.000831

0 all_features subscriber lasso1.340000e-06

6.900000e-07 0.001158 0.000831

2 all_features subscriber en1.340000e-06

6.900000e-07 0.001158 0.000831

3 all_features subscriber gbr1.035000e-05

1.207340e-03 0.003217 0.034747

3 all_except_dow subscriber gbr1.305000e-05

1.214040e-03 0.003612 0.034843

2 temp_prec_birth_dow customer en1.813251e-01

1.317287e-01 0.425823 0.362944

2 temp_prec_birth_dow subscriber en1.863900e-01

1.328632e-01 0.431729 0.364504

1 temp_prec_birth_dow subscriber ridge1.732589e-01

1.335119e-01 0.416244 0.365393

1 temp_prec_birth_dow customer ridge1.738899e-01

1.417032e-01 0.417001 0.376435

0 temp_prec_birth_dow customer lasso1.745095e-01

1.431364e-01 0.417743 0.378334

0 temp_prec_birth_dow subscriber lasso1.744810e-01

1.434022e-01 0.417709 0.378685

3 temp_prec_birth_dow subscriber gbr4.007046e-02 1.722666e-01 0.200176 0.415050

scores_df.sort("mse_test")

3 temp_prec_birth_dow customer gbr2.641917e-02

1.903296e-01 0.162540 0.436268

3 temp_prec_birth subscriber gbr4.543861e-02

1.954518e-01 0.213163 0.442099

3 all_except_dow customer gbr1.000000e-07

1.957464e-01 0.000316 0.442432

0 temp_prec_birth subscriber lasso2.103654e-01

2.080503e-01 0.458656 0.456125

2 temp_prec_birth subscriber en2.090449e-01

2.097234e-01 0.457214 0.457956

1 temp_prec_birth subscriber ridge2.087435e-01

2.105014e-01 0.456885 0.458804

3 all_features customer gbr9.000000e-08

2.334332e-01 0.000300 0.483149

2 all_except_dow customer en3.629435e-01

2.941157e-01 0.602448 0.542324

3 temp_prec_birth customer gbr1.401808e-01

3.056549e-01 0.374407 0.552861

0 temp_prec_birth customer lasso2.751552e-01

3.060788e-01 0.524552 0.553244

2 temp_prec_birth customer en2.748168e-01

3.096775e-01 0.524230 0.556487

1 temp_prec_birth customer ridge2.746825e-01

3.125255e-01 0.524102 0.559040

2 all_features customer en3.495963e-01

3.150416e-01 0.591267 0.561286

0 all_except_dow customer lasso3.445035e-01

6.175809e-01 0.586944 0.785863

3.289846e-

Analysis of results

1. For user type : Subscribera) It's interesting to see that the subscriber model that performed the best (and best overallcompared to customer models as well) was the one with the entire feature set (exceptday_of_week_mon_is_0)-

0 all_features customer lasso 01 7.095680e-01 0.573572 0.842359

1 all_except_dow customer ridge3.086640e-01

8.848603e-01 0.555575 0.940670

1 all_features customer ridge2.946653e-01

1.034366e+00 0.542831 1.017038

In [188]:

b) The best performing model was- Ridge Regression, MSE : 2.600000e-07, RMSE :0.000510 c) Tuned Parameters- alpha: 0.001, tol: 0.0001 d) Another interesting fact to note is that the top performing models for the Subscriber userwere all linear models. e) Models created with only the weather data performed on the lower end of the spectrum forSubscribers. This shows that Subscribers are less influenced by changes in weatherconditions when it comes to renting Divvy bikes.

2. For the user type : Customer

Out[188]: ['usertype', 'date', 'avg_distance_m', 'avg_trip_duration_s', 'birth_year_diff_86', 'female', 'male', 'total_rides', 'CDT', 'Max TemperatureF', 'Mean TemperatureF', 'Min TemperatureF', 'Max Dew PointF', 'MeanDew PointF', 'Min DewpointF', 'Max Humidity', 'Mean Humidity', 'Min Humidity', 'Max Sea Level PressureIn', 'Mean Sea Level PressureIn', 'Min Sea Level PressureIn', 'Max VisibilityMiles', 'Mean VisibilityMiles', 'Min VisibilityMiles', 'Max Wind SpeedMPH', 'Mean Wind SpeedMPH', 'Max Gust SpeedMPH', 'PrecipitationIn', 'CloudCover', 'Events', 'WindDirDegrees', 'day_of_year', 'day_of_week_mon_is_0', 'IsWeekend']

list(df_divvy_group.columns)

a) The best performing model for Customer was trained with only weather data and day ofweek. Note: The scores list the best customer model with feature set inclusive of birth year.This is not true for the model, and is only a labeling issue. b) The best performing model was- Elastic Net, MSE : 1.317287e-01, RMSE : 0.362944 c) Tuned Parameters-

In [249]:

d) In the case of models for Customers it can be noted that the top performing models are alllinear models. e) Customers are people who rent only for the day. These rental decisions can be affected byweather conditions, as customers could be visitors etc., who are not prepared for theweather, and hence decide on the fly about renting a bike. Subscribers on the other hand areusing bikes more for commuting to work, as can be seen from total rider volume by user typefor a given day of the week (EDA section).

Out[249]: {'alpha': 0.1, 'l1_ratio': 0.080714285714285711, 'tol': 0.078592857142857145}

scores_df[(scores_df.feature_set=="temp_prec_birth_dow") & (scores_df.usertype

working with the divvy data set

Documents