working with the divvy data set
DESCRIPTION
There are always Divvy vans that ferry bikes around from station to station based on the lack or surplus of bikes at a given location. This movement of bikes is labor and time intensive. Both of which are high costs that Divvy has to bear. It would be nice to be able to predict the volume of rentals, and allow for precise scheduling.In this project I have decided to work with daily rental volume (total rides) as my target variable, and as this is a supervised learning problem the techniques that would be used are as follows-a) Lasso Regressionb) Ridge Regressionc) Elastic Netd) Gradient Boosted RegressionTRANSCRIPT
Working With The Divvy Dataset
Pratik Agrawal
IntroductionOver the past couple of years Divvy has organized data challenges for invigorating someinnovation in the Chicago Data Science community as well as learn new ways to visualizeand manage the bike rental system.
ProblemThere are always Divvy vans that ferry bikes around from station to station based on the lackor surplus of bikes at a given location. This movement of bikes is labor and time intensive.Both of which are high costs that Divvy has to bear. It would be nice to be able to predict thevolume of rentals, and allow for precise scheduling. In this project I have decided to work with daily rental volume (total rides) as my targetvariable, and as this is a supervised learning problem the techniques that would be used areas follows- a) Lasso Regression b) Ridge Regression c) Elastic Net d) Gradient Boosted Regression
Data Setsa) Divvy data set 2015 Q1 & Q2 b) Route Information data- In order to enrich the data set with more information, I decided toinclude distance information (route calculation from HERE.com Route Calculation API) foreach origin/destination pair in the dataset. c) Weather data- weather data from Wunderground.com was downloaded for the periodpertaining to the Divvy data set.
In [1]:
In [2]:
The Dataset
1. Lets read the readme.txt file supplied with the dataset, and see what all features areincluded in this data set
Even though the file is for the 2013 dataset, the columns have not changed much in thecurrent year
Populating the interactive namespace from numpy and matplotlib
import pandas as pdimport matplotlib.pyplot as pltimport gcimport seaborn as sns%pylab inline
import warningswarnings.filterwarnings('ignore')
In [3]: readme_txt_file=open("./week-1/Divvy_Stations_Trips_2013/README.txt",'r'for line in readme_txt_file.readlines(): if line!=None: print line
This file contains metadata for both the Trips and Stations table.
For more information, see the contest page at http://DivvyBikes.com/datachallenge (http://DivvyBikes.com/datachallenge) or email questions to [email protected].
Metadata for Trips Table:
Variables:
trip_id: ID attached to each trip taken
starttime: day and time trip started, in CST
stoptime: day and time trip ended, in CST
bikeid: ID attached to each bike
tripduration: time of trip in seconds
from_station_name: name of station where trip originated
to_station_name: name of station where trip terminated
from_station_id: ID of station where trip originated
to_station_id: ID of station where trip terminated
usertype: "Customer" is a rider who purchased a 24-Hour Pass; "Subscriber" is a rider who purchased an Annual Membership
gender: gender of rider
birthyear: birth year of rider
Notes:
* First row contains column names
From the above information we have a good idea about what the dataset looks like.Under normal circumstances such a clean set is hard to come by. The meta dataprovided is actually very useful, since that is another feature missing fromdatasets.
2. Read the files
In [3]:
Lets take a quick peek at the head for each data frame
* Total records = 759,789
* Trips that did not include a start or end date were removed from original table.
* Gender and birthday are only available for Subscribers
Metadata for Stations table:
Variables:
name: station name
latitude: station latitude
longitude: station longitude
dpcapacity: number of total docks at each station as of 2/7/2014
online date: date the station went live in the system
df_trips_1=pd.read_csv("../../data/Divvy_Trips_2015-Q1Q2/Divvy_Trips_2015-Q1.csv"df_trips_2=pd.read_csv("../../data/Divvy_Trips_2015-Q1Q2/Divvy_Trips_2015-Q2.csv"df_stations=pd.read_csv("../../data/Divvy_Trips_2015-Q1Q2/Divvy_Stations_2015.csv"df_trips = pd.concat([df_trips_1,df_trips_2])
In [5]:
In [6]:
The dataframes above can be joined on from_station_id/to_station_id and id
Lets look at the shape of the dataframes
In [7]:
In [8]:
Joining the origin station id with the data from the stations data frame
Out[5]: trip_id starttime stoptime bikeid tripduration from_station_id from_station_name
0 47384543/31/201523:58
4/1/20150:03
1095 299 117Wilton Ave &Belmont Ave
1 47384503/31/201523:59
4/1/20150:15
537 940 43Michigan Ave &Washington St
2 47384493/31/201523:59
4/1/20150:11
2350 751 162Damen Ave &Wellington Ave
3 47384483/31/201523:59
4/1/20150:19
938 1240 51Clark St & RandolphSt
4 47384453/31/201523:54
4/1/20150:15
379 1292 134Peoria St & JacksonBlvd
Out[6]: id name latitude longitude dpcapacity landmark
0 2 Michigan Ave & Balbo Ave 41.872293 -87.624091 35 541
1 3 Shedd Aquarium 41.867226 -87.615355 31 544
2 4 Burnham Harbor 41.856268 -87.613348 23 545
3 5 State St & Harrison St 41.874053 -87.627716 23 30
4 6 Dusable Harbor 41.885042 -87.612795 31 548
Out[7]: (1096239, 12)
Out[8]: (474, 6)
df_trips.head()
df_stations.head()
df_trips.shape
df_stations.shape
In [4]:
In [12]:
In [13]:
Joining the destination station id with the data from the stations data frame
In [5]:
In [15]:
Out[12]: (1096239, 18)
1 47384313/31/201523:42
3/31/201523:47
68 260 117Wilton Ave &Belmont Ave
2 47383863/31/201523:04
3/31/201523:07
422 186 117Wilton Ave &Belmont Ave
3 47383033/31/201522:19
3/31/201522:22
1672 145 117Wilton Ave &Belmont Ave
4 47380893/31/201521:07
3/31/201521:10
2720 200 117Wilton Ave &Belmont Ave
Out[15]: (1096239, 24)
df_from=pd.merge(df_trips,df_stations,left_on="from_station_id",right_on
df_from.shape
df_from.head()
df_divvy=pd.merge(df_from,df_stations,left_on="to_station_id",right_on=
df_divvy.shape
In [16]:
Lets try a sample call to the HERE maps api
In [16]:
To use the here.com api, one has to register as a developer, and is limited to a 100Kcalls/month
For security purposes the application id and code for my dev user has not beenincluded in the api call made below.
In [17]:
In [18]:
Lets take a look at what the HERE Calcuate Route API response looks like
Out[16]: trip_id starttime stoptime bikeid tripduration from_station_id from_station_name
1096234 53484275/27/20157:04
5/27/20157:21
2817 1023 428Dorchester Ave &63rd St
1096235 53382095/26/201510:38
5/26/201510:53
2819 912 428Dorchester Ave &63rd St
1096236 56704226/16/201518:01
6/16/201518:16
3113 869 95Stony Island Ave &64th St
1096237 53750755/28/201515:49
5/28/201516:04
2004 892 391 Halsted St & 69th St
1096238 56118586/13/20159:36
6/13/20159:42
4703 374 388 Halsted St & 63rd St
5 rows × 24 columns
df_divvy.tail()
from urllib2 import urlopenfrom StringIO import StringIOimport simplejson
url = urlopen('http://route.cit.api.here.com/routing/7.2/calculateroute.json?app_id=APP_ID&app_code=APP_CODE&waypoint0=geo!41.90096,-87.623777&waypoint1=geo!41.91468,-87.64332&mode=fastest;car;traffic:disabled'
json_array = simplejson.loads(url)
In [19]:
Out[19]:
json_array
{u'response': {u'language': u'en-us', u'metaInfo': {u'interfaceVersion': u'2.6.18', u'mapVersion': u'8.30.60.106', u'moduleVersion': u'7.2.63.0-1185', u'timestamp': u'2015-12-08T21:37:18Z'}, u'route': [{u'leg': [{u'end': {u'label': u'W Menomonee St', u'linkId': u'+19805890', u'mappedPosition': {u'latitude': 41.9146268, u'longitude': -87.6433185}, u'mappedRoadName': u'W Menomonee St', u'originalPosition': {u'latitude': 41.9146799, u'longitude': -87.64332}, u'shapeIndex': 60, u'sideOfStreet': u'left', u'spot': 0.1862745, u'type': u'stopOver'}, u'length': 3620, u'maneuver': [{u'_type': u'PrivateTransportManeuverType', u'id': u'M1', u'instruction': u'Head toward <span class="toward_street">N Michigan Ave</span> on <span class="street">E Lake Shore Dr</span>. <span class="distance-description">Go for <span class="length">23 m</span>.</span>', u'length': 23, u'position': {u'latitude': 41.9008181, u'longitude': -87.6237659}, u'travelTime': 9}, {u'_type': u'PrivateTransportManeuverType', u'id': u'M2', u'instruction': u'Turn <span class="direction">right</span> onto <span class="next-street">N Lake Shore Dr</span> <span class="number">(US-41 N)</span>. <span class="distance-description">Go for <span class="length">1.1 km</span>.</span>', u'length': 1120, u'position': {u'latitude': 41.900804, u'longitude': -87.6240349}, u'travelTime': 76}, {u'_type': u'PrivateTransportManeuverType', u'id': u'M3', u'instruction': u'Keep <span class="direction">right</span> toward <span class="sign"><span lang="en">North Ave</span>/<span lang="en">IL-64</span>/<span lang="en">Lasalle Dr</span></span>. <span class="distance-description">Go for <span class="length">316 m</span>.</span>', u'length': 316, u'position': {u'latitude': 41.9106638, u'longitude': -87.6257515}, u'travelTime': 40}, {u'_type': u'PrivateTransportManeuverType', u'id': u'M4', u'instruction': u'Turn <span class="direction">left</span> onto <span class="next-street">W La Salle Dr</span>. <span class="distance-description">Go for <span class="length">886 m</span>.</span>',
u'length': 886, u'position': {u'latitude': 41.9133461, u'longitude': -87.6259875}, u'travelTime': 114}, {u'_type': u'PrivateTransportManeuverType', u'id': u'M5', u'instruction': u'Turn <span class="direction">right</span> onto <span class="next-street">W North Ave</span> <span class="number">(IL-64)</span>. <span class="distance-description">Go for <span class="length">853 m</span>.</span>', u'length': 853, u'position': {u'latitude': 41.9111681, u'longitude': -87.6331329}, u'travelTime': 123}, {u'_type': u'PrivateTransportManeuverType', u'id': u'M6', u'instruction': u'Turn <span class="direction">right</span> onto <span class="next-street">N Larrabee St</span>. <span class="distance-description">Go for <span class="length">403 m</span>.</span>', u'length': 403, u'position': {u'latitude': 41.9109857, u'longitude': -87.6434219}, u'travelTime': 66}, {u'_type': u'PrivateTransportManeuverType', u'id': u'M7', u'instruction': u'Turn <span class="direction">right</span> onto <span class="next-street">W Menomonee St</span>. <span class="distance-description">Go for <span class="length">19 m</span>.</span>', u'length': 19, u'position': {u'latitude': 41.9146228, u'longitude': -87.6435506}, u'travelTime': 3}, {u'_type': u'PrivateTransportManeuverType', u'id': u'M8', u'instruction': u'Arrive at <span class="street">W Menomonee St</span>. Your destination is on the left.', u'length': 0, u'position': {u'latitude': 41.9146268, u'longitude': -87.6433185}, u'travelTime': 0}], u'start': {u'label': u'E Lake Shore Dr', u'linkId': u'-858448508', u'mappedPosition': {u'latitude': 41.9008181, u'longitude': -87.6237659}, u'mappedRoadName': u'E Lake Shore Dr', u'originalPosition': {u'latitude': 41.9009599, u'longitude': -87.6237771}, u'shapeIndex': 0, u'sideOfStreet': u'right', u'spot': 0.0247934, u'type': u'stopOver'}, u'travelTime': 431}],
To access the distance between the two points provided in the API request, we canlook at the summary section of the JSON object
In [22]:
Similarly we can access other parameters such as base time and traffic time (bothhave been provided for vehicle based routing). This API however does not provideestimates as to how traffic affects the bicycle times.
In [23]:
u'mode': {u'feature': [], u'trafficMode': u'disabled', u'transportModes': [u'car'], u'type': u'fastest'}, u'summary': {u'_type': u'RouteSummaryType', u'baseTime': 431, u'distance': 3620, u'flags': [u'park'], u'text': u'The trip takes <span class="length">3.6 km</span> and <span class="time">7 mins</span>.', u'trafficTime': 431, u'travelTime': 431}, u'waypoint': [{u'label': u'E Lake Shore Dr', u'linkId': u'-858448508', u'mappedPosition': {u'latitude': 41.9008181, u'longitude': -87.6237659}, u'mappedRoadName': u'E Lake Shore Dr', u'originalPosition': {u'latitude': 41.9009599, u'longitude': -87.6237771}, u'shapeIndex': 0, u'sideOfStreet': u'right', u'spot': 0.0247934, u'type': u'stopOver'}, {u'label': u'W Menomonee St', u'linkId': u'+19805890', u'mappedPosition': {u'latitude': 41.9146268, u'longitude': -87.6433185}, u'mappedRoadName': u'W Menomonee St', u'originalPosition': {u'latitude': 41.9146799, u'longitude': -87.64332}, u'shapeIndex': 60, u'sideOfStreet': u'left', u'spot': 0.1862745, u'type': u'stopOver'}]}]}}
3620
base_time: 431 traffic_time: 431
print json_array['response']['route'][0]['summary']['distance']
print "base_time: ",json_array['response']['route'][0]['summary']['baseTime'print "traffic_time: ",json_array['response']['route'][0]['summary']['trafficTime'
Lets create a function to query the HERE.com Calculate Route API for any twolocations. And also test this with the first two rows of the data set
In [24]:
In [25]:
In [26]:
Now lets do a simple reduction in the number of calls made to the HERE.comAPI
In [29]:
Out[25]: base_time distance json_array traffic_time
0 208 923 {u'response': {u'route': [{u'leg': [{u'start':... 208
1 208 923 {u'response': {u'route': [{u'leg': [{u'start':... 208
Out[26]: {u'_type': u'RouteSummaryType', u'baseTime': 208, u'distance': 923, u'text': u'The trip takes <span class="length">923 m</span> and <span class="time">3 mins</span>.', u'trafficTime': 208, u'travelTime': 208}
def calc_dist_time(x): url = urlopen('http://route.cit.api.here.com/routing/7.2/calculateroute.json?app_id=APP_ID&app_code=APP_CODE&waypoint0=geo!%s,%s&waypoint1=geo!%s,%s&mode=fastest;car;traffic:disabled' json_array = simplejson.loads(url) base_time=json_array['response']['route'][0]['summary']['baseTime'] traffic_time=json_array['response']['route'][0]['summary']['trafficTime' distance=json_array['response']['route'][0]['summary']['distance'] return pd.Series({'base_time':base_time, 'traffic_time':traffic_time, 'distance':distance, 'json_array':json_array})df_dist=df_divvy.head(2).apply(calc_dist_time,axis=1)
df_dist
df_dist.json_array[0]['response']['route'][0]['summary']
df_temp=df_divvy.drop_duplicates(["latitude_x","longitude_x","latitude_y"
In [30]:
As can be seen from above, the number of calls that will need to be made to theHERE.com API is 65K, which is well below the monthly quota. This can be furtherreduced by removing the duplicated between x-y and y-x combinations of thelocations.
Note: I tried Google Maps API (only a few thousand free calls, and throttled/deniedthereafter), as well as Open Street Maps API, and only found HERE.com API to be themost responsive, and best in class in terms of quota.
Lets run the query for each combination of location in this reduced dataset
I already ran the code below prior to forming this notebook, and had saved the results of thequeries. Hence you will not see execution numbers for some of the code blocks
In [ ]:
In [33]:
In [66]:
In [76]:
Out[30]: trip_id starttime stoptime bikeid tripduration from_station_id from_station_name
0 47384543/31/201523:58
4/1/20150:03
1095 299 117Wilton Ave &Belmont Ave
181 44479911/17/201515:26
1/17/201515:57
645 1859 43Michigan Ave &Washington St
184 46315883/14/201518:20
3/14/201518:38
1226 1103 162Damen Ave &Wellington Ave
192 47356463/31/201517:16
3/31/201517:37
1312 1296 51Clark St & RandolphSt
6/8/2015 6/8/2015 Peoria St & Jackson
df_temp
df_dist=df_temp.apply(calc_dist_time,axis=1)
df_dist_matrix = df_divvy[["latitude_x","longitude_x","latitude_y","longitude_y"
df_dist_time = pd.merge(df_dist_matrix,df_dist,left_on="ix",right_on="ix"
df_dist_time["key"] = "%s_%s_%s_%s"%(str(df_dist_time.latitude_x), str(df_dist_time.longitude_x), str(df_dist_time.latitude_y), str(df_dist_time.longitude_y))
In [80]:
In [89]:
In [17]:
In [18]:
In [19]:
In [20]:
In [39]:
Lets save this data set
In [102]:
We can free up the memory, by forcing garbage collection. I've done this as there is lotof data held in memory, and there is no further use for it.
Out[18]: ['Unnamed: 0', 'ix', 'latitude_x', 'longitude_x', 'latitude_y', 'longitude_y', 'base_time', 'distance', 'json_array', 'traffic_time']
Out[39]: 28
df_divvy["key"] = "%s_%s_%s_%s"%(str(df_divvy.latitude_x), str(df_divvy.longitude_x), str(df_divvy.latitude_y), str(df_divvy.longitude_y))
df_dist_time.to_csv('../../data/Divvy_Trips_2015-Q1Q2/lat_lon_dist_time.csv'
df_dist_time = pd.read_csv('../../data/Divvy_Trips_2015-Q1Q2/lat_lon_dist_time.csv'
list(df_dist_time.columns)
df_divvy = pd.merge(df_divvy,df_dist_time,left_on=["latitude_x","longitude_x"
df_divvy.drop(["ix","json_array"],axis=1,inplace=True)
len(list(df_divvy.columns))
df_divvy.to_csv('../../data/Divvy_Trips_2015-Q1Q2/complete-data.csv')
In [8]:
Lets also download weather information for each day of Q1 & Q2 2015. For thispurpose I downloaded the weather history from wunderground.com
Note: Code execution resumes from here, as code above requires a dev account to makecalls to HERE.com
In [22]:
Out[8]: 114
Out[22]:
CDTMaxTemperatureF
MeanTemperatureF
MinTemperatureF
MaxDewPointF
MeanDewPointF
MinDewpointF
0 1/1/15 32 25 17 16 11 4
1 1/2/15 36 28 20 22 19 15
2 1/3/15 37 34 31 36 32 22
3 1/4/15 36 21 5 35 22 -5
4 1/5/15 10 5 -1 5 -3 -10
5 rows × 23 columns
df_dist=[]df_dist_matrix=[]df_dist_time=[]df_from=[]df_trips=[]df_trips_1=[]df_trips_2=[]df_divvy_all=[]import gcgc.collect()
weather = pd.read_csv('../../data/Divvy_Trips_2015-Q1Q2/CustomWeather.csv'weather.head()
In [23]:
Lets clean the column names, and get rid of the leading white space
In [24]:
Out[23]: ['CDT', 'Max TemperatureF', 'Mean TemperatureF', 'Min TemperatureF', 'Max Dew PointF', 'MeanDew PointF', 'Min DewpointF', 'Max Humidity', ' Mean Humidity', ' Min Humidity', ' Max Sea Level PressureIn', ' Mean Sea Level PressureIn', ' Min Sea Level PressureIn', ' Max VisibilityMiles', ' Mean VisibilityMiles', ' Min VisibilityMiles', ' Max Wind SpeedMPH', ' Mean Wind SpeedMPH', ' Max Gust SpeedMPH', 'PrecipitationIn', ' CloudCover', ' Events', ' WindDirDegrees']
list(weather.columns)
weather.columns=[c.strip(" ") for c in weather.columns]
In [25]:
Lets convert the date feature of both df_divvy and weather data to the sklearn datetimeobject.
In [26]:
In [27]:
In [28]:
Out[25]: ['CDT', 'Max TemperatureF', 'Mean TemperatureF', 'Min TemperatureF', 'Max Dew PointF', 'MeanDew PointF', 'Min DewpointF', 'Max Humidity', 'Mean Humidity', 'Min Humidity', 'Max Sea Level PressureIn', 'Mean Sea Level PressureIn', 'Min Sea Level PressureIn', 'Max VisibilityMiles', 'Mean VisibilityMiles', 'Min VisibilityMiles', 'Max Wind SpeedMPH', 'Mean Wind SpeedMPH', 'Max Gust SpeedMPH', 'PrecipitationIn', 'CloudCover', 'Events', 'WindDirDegrees']
list(weather.columns)
df_divvy.drop("Unnamed: 0",axis=1,inplace=True)
df_divvy["date"]=df_divvy.starttime.apply(lambda x: x.split(" ")[0])
df_divvy["date"]=pd.to_datetime(df_divvy.date)
In [29]:
In [30]:
In [31]:
Out[29]: trip_id starttime stoptime bikeid tripduration from_station_id from_station_name
0 47384543/31/201523:58
4/1/20150:03
1095 299 117Wilton Ave &Belmont Ave
1 47312163/31/20158:03
3/31/20158:08
719 313 117Wilton Ave &Belmont Ave
2 47298483/30/201521:22
3/30/201521:27
168 310 117Wilton Ave &Belmont Ave
3 47296723/30/201520:42
3/30/201520:51
2473 595 117Wilton Ave &Belmont Ave
4 47153903/27/201521:26
3/27/201521:31
1614 312 117Wilton Ave &Belmont Ave
5 rows × 28 columns
Out[31]:
CDTMaxTemperatureF
MeanTemperatureF
MinTemperatureF
MaxDewPointF
MeanDewPointF
MinDewpointF
02015-01-01
32 25 17 16 11 4
12015-01-02
36 28 20 22 19 15
22015-01-03
37 34 31 36 32 22
32015-01-04
36 21 5 35 22 -5
42015-01-05
10 5 -1 5 -3 -10
5 rows × 23 columns
df_divvy.head()
weather["CDT"]=pd.to_datetime(weather.CDT)
weather.head()
In [32]:
Analysis
EDA
Now that we have all the data in order, lets take a look at where these stations arelocated on the map. We will also plot a random sampling of the user types (subscribersv/s customers) and the stations they travel between
In [6]:
weather.PrecipitationIn=weather.PrecipitationIn.convert_objects(convert_numeric
from IPython.display import HTMLimport folium
def inline_map(map): """ Embeds the HTML source of the map directly into the IPython notebook. This method will not work if the map depends on any files (json data). Also this uses the HTML5 srcdoc attribute, which may not be supported in all browsers. """ map._build_map() return HTML('<iframe srcdoc="{srcdoc}" style="width: 100%; height: 510px; border: none"></iframe>'
def embed_map(map, path="map.html"): """ Embeds a linked iframe to the map into the IPython notebook. Note: this method will not capture the source of the map into the notebook. This method should work for all maps (as long as they use relative urls). """ map.create_map(path=path) return HTML('<iframe src="files/{path}" style="width: 100%; height: 510px; border: none"></iframe>'
In [7]:
As can be seen from the above map-
Out[7]:
map_osm = folium.Map(location=[41.9065732,-87.7142335],tiles='Stamen Toner'for i in range(0,df_stations.shape[0]): map_osm.circle_marker(location=[df_stations.latitude[i], df_stations fill_color='blue')
np.random.seed(123)numbers = np.arange(1,1000000)np.random.shuffle(numbers)
for i in range(1,10000): if(df_divvy.usertype[numbers[i]]=="Subscriber"): map_osm.line([[df_divvy.latitude_x[numbers[i]],df_divvy.longitude_x else: map_osm.line([[df_divvy.latitude_x[numbers[i]],df_divvy.longitude_xinline_map(map_osm)
a) Subscribers are marked with the red lines. b) Customers are market with green lines. c) Subscribers tend to use this service more as a daily commute option versus customerswho use this for shorter distances. d) Customers tend to run the bikes in the more tourist-y areas (Lake Shore Trail, Loop,Millenium Park). e) The bike stations on the periphery of the map see the least traffic.
Note: The above map is interactive, so you should be able to zoom in/out and panthroughout.
1. Lets look at the distances travelled by usertype (Customer v/s Subscriber)
In [35]:
It is clear from the above plots that the subscribers in general ride longer distances, as wellas contribute to the majority of bike rentals. However, the customers (or tourists/one-timeriders) also contribute to a significant number of rides. Within the customers, we can think ofthe riders as-
1. tourists- riders, who rent bikes on weekends and Thursdays.2. daily-riders- who do not have active subscription, and are riding these bikes on
Monday-Wednesday.
2. Lets look at who are the most active bike renters in the subscriberscategory-
fig = plt.figure()fig.set_figheight(9)fig.set_figwidth(15)fig.suptitle("Distance Bins for Customer/Subscriber", fontsize=16)ax = plt.subplot("211")df_temp = df_divvy[df_divvy.usertype=="Customer"]df_temp[df_temp.distance<df_temp.distance.quantile(0.95)].distance.histplt.ylim(0,25000)ax.set_ylabel("Distance in meters")ax.set_title("Customer",fontsize=14)
ax = plt.subplot("212")df_temp = df_divvy[df_divvy.usertype=="Subscriber"]df_temp[df_temp.distance<df_temp.distance.quantile(0.95)].distance.histax.set_title("Subscriber",fontsize=14)ax.set_ylabel("Distance in meters")plt.ylim(0,25000)plt.xlabel("Rental Volume")plt.show()
In [36]:
In [37]:
In [38]:
In [39]:
From the above graph and table we see that the millenials are the largest group ofsubscribers.
Out[39]: birthyear Customer Subscriber
61 1986 0 45352
63 1988 2 44295
62 1987 0 42418
60 1985 13 41319
59 1984 0 40523
64 1989 0 39337
58 1983 0 36683
57 1982 0 32490
65 1990 0 32280
56 1981 0 29144
def explore(x): return pd.Series({"Subscriber":np.sum((x.usertype=="Subscriber")).astype "Customer":np.sum((x.usertype=="Customer")).astypedf_birthyear_agg=df_divvy.groupby("birthyear").apply(explore)
df_birthyear_agg.reset_index(inplace=True)
plot(df_birthyear_agg.birthyear,df_birthyear_agg.Subscriber)plt.show()
df_birthyear_agg.sort(["Subscriber"],ascending=False).head(10)
One can also note that there are a few subscribers with the age of 100 and over. It wouldseem that these subscribers have not reported their correct age, or if they have, then theyare in the pink of health.
3. Lets now look at how the weather affects bike rental volumes. For this purpose wewill roll up bike rentals to the day.
a) First we will take a look at the mean temperature and total ridership
Here we will create a few new features- a) total rides: the total number of rentals for the day b) average trip duration for the day (in seconds) c) average trip distance for the day (in meters) d) birth_year_diff_86 - the difference in birth year from 1986. This is based on the precedinganalysis.
In [40]:
In [41]:
In [42]:
def roll_up(x): return pd.Series({"total_rides":np.count_nonzero(x), "avg_trip_duration_s":np.mean(x.tripduration), "avg_distance_m":np.mean(x.distance), "male":np.count_nonzero(x.gender=="Male"), "female":np.count_nonzero(x.gender=="Female"), "birth_year_diff_86":np.mean(1986-x.birthyear)})df_divvy_group=df_divvy.groupby(["usertype","date"]).apply(roll_up)
df_divvy_group.reset_index(inplace=True)
df_divvy_group = pd.merge(df_divvy_group,weather,left_on="date", right_on
In [43]: fig = plt.figure()fig.set_figheight(9)fig.set_figwidth(15)fig.suptitle("Distance Bins for Customer/Subscriber", fontsize=16)ax = plt.subplot("211")df_temp = df_divvy_group[df_divvy_group.usertype=="Customer"]df_temp['Mean TemperatureF'].hist(alpha=0.7,bins=100,color="blue")plt.ylim(0,20)ax.set_ylabel("Mean Temperature")ax.set_title("Customer",fontsize=14)
ax = plt.subplot("212")df_temp = df_divvy_group[df_divvy_group.usertype=="Subscriber"]df_temp['Mean TemperatureF'].hist(alpha=0.7,bins=100,color="red")ax.set_title("Subscriber",fontsize=14)ax.set_ylabel("Mean Temperature")plt.ylim(0,20)plt.xlabel("Rental Volume")plt.show()
In [44]:
Clearly there is a relationship between total ridership and the temperature. The relationshipseems to be slightly exponential for Customers v/s Subscribers. Subscribers can beseen hiring bikes at much lower temperatures.
b) Now lets look at the precipitation in inches and how that affects the ridership
In [45]:
fig = plt.figure()fig.set_figheight(9)fig.set_figwidth(15)fig.suptitle("Temperature and the rider", fontsize=16)ax = plt.subplot("211")df_temp = df_divvy_group[df_divvy_group.usertype=="Customer"]ax.scatter(df_temp["Mean TemperatureF"],df_temp.total_rides,color="blue"ax.set_title("Customer",fontsize=14)
ax = plt.subplot("212")df_temp = df_divvy_group[df_divvy_group.usertype=="Subscriber"]ax.scatter(df_temp["Mean TemperatureF"],df_temp.total_rides,color="red"ax.set_title("Subscriber",fontsize=14)plt.show()
def fun_sum(x): return pd.Series({"TotalRidership":np.sum(x.total_rides)})
In [46]:
As can be seen, precipitation results in a drastic drop in ridership.
c) How does wind speed affect the total rider volume?
fig = plt.figure()fig.set_figheight(9)fig.set_figwidth(15)fig.suptitle("Precipitation and the rider volume", fontsize=16)ax = plt.subplot("211")df_temp = df_divvy_group[df_divvy_group.usertype=="Customer"]df_=df_temp.groupby("PrecipitationIn").apply(fun_sum)df_.reset_index(inplace=True)ax.plot(df_.PrecipitationIn,df_.TotalRidership,color="blue")ax.set_title("Customer",fontsize=14)
df_temp = df_divvy_group[df_divvy_group.usertype=="Subscriber"]df_=df_temp.groupby("PrecipitationIn").apply(fun_sum)df_.reset_index(inplace=True)ax.plot(df_.PrecipitationIn,df_.TotalRidership,color="red")ax.set_title("Subscriber",fontsize=14)plt.show()
In [47]:
As we see from the above graph, the rider volume is affected by the wind speed, howeverthere are multiple sections in this graph. We see that the rider volume increases between 0 -7 mph, however there is a sudden dip at 8mph. This could probably be attributed to fewerdays with 8 mph wind speeds, and hence a lower total ridership volume. We notice that rightafter 9 mph the total rider volume starts a steady decline.
d) Lets look at day of the week and how that affects ridership
In [48]:
fig = plt.figure()fig.set_figheight(9)fig.set_figwidth(15)fig.suptitle("Mean Wind Speed (mph) and the rider volume", fontsize=16)ax = plt.subplot("211")df_temp = df_divvy_group[df_divvy_group.usertype=="Customer"]df_=df_temp.groupby("Mean Wind SpeedMPH").apply(fun_sum)df_.reset_index(inplace=True)ax.plot(df_['Mean Wind SpeedMPH'],df_.TotalRidership,color="blue")ax.set_title("Customer",fontsize=14)
df_temp = df_divvy_group[df_divvy_group.usertype=="Subscriber"]df_=df_temp.groupby("Mean Wind SpeedMPH").apply(fun_sum)df_.reset_index(inplace=True)ax.plot(df_['Mean Wind SpeedMPH'],df_.TotalRidership,color="red")ax.set_title("Subscriber",fontsize=14)plt.show()
df_divvy_group["day_of_year"] = df_divvy_group.date.dt.dayofyeardf_divvy_group["day_of_week_mon_is_0"] = df_divvy_group.date.dt.dayofweek
In [49]:
We can see from the above graphs that there is a difference between the Customer andSubscriber rider characteristic. Customers ride more on weekends, and subscribers ridemore on weekdays. An idea to explore- if we explore the difference between weekend v/sweekday
In [50]:
fig = plt.figure()fig.set_figheight(9)fig.set_figwidth(15)ax = plt.subplot("211")fig.suptitle("Day of week and the rider volume", fontsize=16)df_temp = df_divvy_group[df_divvy_group.usertype=="Customer"]df_=df_temp.groupby("day_of_week_mon_is_0").apply(fun_sum)df_.reset_index(inplace=True)ax.plot(df_.day_of_week_mon_is_0,df_.TotalRidership,color="blue")
df_temp = df_divvy_group[df_divvy_group.usertype=="Subscriber"]df_=df_temp.groupby("day_of_week_mon_is_0").apply(fun_sum)df_.reset_index(inplace=True)ax.plot(df_.day_of_week_mon_is_0,df_.TotalRidership,color="red")plt.show()
df_divvy_group["IsWeekend"] = (df_divvy_group.day_of_week_mon_is_0>4).astype
In [248]:
In [52]:
Model BuildingWe are going to build a few different models with a different selection of features for eachgroup of models.
1. Models being built-
fig = plt.figure()fig.set_figheight(9)fig.set_figwidth(15)ax = plt.subplot("211")fig.suptitle("Is Weekend? and the rider volume", fontsize=16)df_temp = df_divvy_group[df_divvy_group.usertype=="Customer"]df_=df_temp.groupby("IsWeekend").apply(fun_sum)df_.reset_index(inplace=True)ax.plot(df_.IsWeekend,df_.TotalRidership,color="blue")
ax = plt.subplot("212")df_temp = df_divvy_group[df_divvy_group.usertype=="Subscriber"]df_=df_temp.groupby("IsWeekend").apply(fun_sum)df_.reset_index(inplace=True)ax.plot(df_.IsWeekend,df_.TotalRidership,color="red")plt.show()
df_divvy_group.to_csv('../../data/Divvy_Trips_2015-Q1Q2/data-weather-distance.csv'
a) Lasso Regression b) Ridge Regression c) Gradient Boosted Regressor d) Elastic Net
2. Train/Test:70/30
3. Feature scaling: Enabled
4. Grid Search CV: 10 Fold CV
5. Separate models for Customer and Subscriber user types
6. Models for feature sets-
a) All data except day of week b) All data c) Temperature, Precipitation, and Birth Year Diff From 1986 d) All from c) and dummy coded day of week feature
In [137]:
Model Building Code
import sklearn.cross_validation as cvimport sklearn.metrics as mtimport sklearn.linear_model as lmimport sklearn.ensemble as ensembleimport sklearn.preprocessing as psfrom sklearn.grid_search import RandomizedSearchCVfrom sklearn.grid_search import GridSearchCV
In [189]:
Configuration to drive Model Building code
def MSECalc(y, y_pred): return round(mt.mean_squared_error(y,y_pred),8)
def ModelScorer(pred_train, y_train, pred_test, y_test): mse_train = MSECalc(y_train, pred_train) mse_test = MSECalc(y_test, pred_test) return mse_train, mse_test
def ModelBuilder_lasso(X_train,y_train, X_test, config): model = GridSearchCV(lm.Lasso(),param_grid=config["params"]["lasso" model.fit(X_train,y_train) return model.predict(X_train), model.predict(X_test), model.best_params_
def ModelBuilder_ridge(X_train,y_train, X_test, config): model = GridSearchCV(lm.Ridge(),param_grid=config["params"]["ridge" model.fit(X_train,y_train) return model.predict(X_train), model.predict(X_test), model.best_params_
def ModelBuilder_en(X_train,y_train, X_test, config): model = GridSearchCV(lm.ElasticNet(),param_grid=config["params"]["en" model.fit(X_train,y_train) return model.predict(X_train), model.predict(X_test), model.best_params_
def ModelBuilder_gbr(X_train,y_train, X_test, config): model = GridSearchCV(ensemble.GradientBoostingRegressor(),param_grid model.fit(X_train,y_train) return model.predict(X_train), model.predict(X_test), model.best_params_
def ModelComparator(X,y, config): X=ps.scale(X) y=ps.scale(y) X_train, X_test, y_train, y_test = cv.train_test_split(X,y, test_size pred_train={} pred_test={} mse_train={} mse_test={} params={} for model in config["models"]: if "lasso" in model: pred_train[model], pred_test[model], params[model]=ModelBuilder_lasso if "ridge" in model: pred_train[model], pred_test[model], params[model]=ModelBuilder_ridge if "en" in model: pred_train[model], pred_test[model], params[model]=ModelBuilder_en if "gbr" in model: pred_train[model], pred_test[model], params[model]=ModelBuilder_gbr mse_train[model], mse_test[model] = ModelScorer(pred_train[model return mse_train, mse_test, params
In [150]:
Code for dummy coding of categoricals
In [94]:
Method that encapsulates running all models, as well aggregating all scores anddetails about the run
config={"models":["lasso","ridge","en","gbr"], "params":{"lasso":{"alpha":[0.001,0.01,0.1,1], "tol":[0.0001,0.001,0.01,0.1,1]}, "ridge":{"alpha":[0.001,0.01,0.1,1], "tol":[0.0001,0.001,0.01,0.1,1]}, "en":{"tol":np.linspace(0.0001, 0.1, num=15), "alpha":[0.001,0.01,0.1,1], "l1_ratio":np.linspace(0.01, 1, num=15)}, "gbr":{"learning_rate":np.linspace(0.05, 1, num=15), "min_samples_leaf":range(1,10), "min_samples_split":range(1,5)}}, "cv":10}mse_train={}mse_test={}
def dummy_coding(x,col_names): sep={} for col in col_names: vals=list(x[col].unique()) for val in vals: sep["%s_%s"%(col,val)] = (x[col]==val).astype(int) return sep
In [226]:
Variable to catch all scores
In [227]:
Models built with different feature sets-
a) All data except day of week
def run_models(user,features,X,y,config): train, test, param = ModelComparator(X,y,config) mse_test = [] mse_train = [] params = [] models = [] usertype = [] feature_set = [] scores_df = pd.DataFrame() for model in config["models"]: models.append(model) mse_train.append(train[model]) mse_test.append(test[model]) params.append(param[model]) usertype.append(user) feature_set.append(features) scores_df["feature_set"] = feature_set scores_df["usertype"] = usertype scores_df["model"] = models scores_df["mse_train"] = mse_train scores_df["mse_test"] = mse_test scores_df["rmse_train"] = np.sqrt(scores_df.mse_train) scores_df["rmse_test"] = np.sqrt(scores_df.mse_test) scores_df["params"] = params return scores_df
scores = []
In [228]:
b) All data
In [230]:
c) Weather Data Only
X=df_divvy_group[df_divvy_group.usertype=="Subscriber"]X_cust=df_divvy_group[df_divvy_group.usertype=="Customer"]
X.drop(["usertype","date","CDT","Events","day_of_week_mon_is_0"],axis=1X_cust.drop(["usertype","date","CDT","Events","birth_year_diff_86","female"
X_cust.dropna(inplace=True)X.dropna(inplace=True)y_cust=X_cust.total_ridesy=X.total_rides
X_cust.drop("total_rides",axis=1,inplace=True)X.drop("total_rides",axis=1,inplace=True)
scores.append(run_models("subscriber","all_except_dow",X,y,config))scores.append(run_models("customer","all_except_dow",X_cust,y_cust,config
X=df_divvy_group[df_divvy_group.usertype=="Subscriber"]X_cust=df_divvy_group[df_divvy_group.usertype=="Customer"]
X.drop(["usertype","date","CDT","Events"],axis=1,inplace=True)X_cust.drop(["usertype","date","CDT","Events","birth_year_diff_86","female"
X_cust.dropna(inplace=True)X.dropna(inplace=True)y_cust=X_cust.total_ridesy=X.total_rides
X=pd.concat([X,pd.DataFrame(dummy_coding(X,["day_of_week_mon_is_0"]))],X_cust=pd.concat([X_cust,pd.DataFrame(dummy_coding(X_cust,["day_of_week_mon_is_0"
X_cust.drop(["total_rides","day_of_week_mon_is_0"],axis=1,inplace=True)X.drop(["total_rides","day_of_week_mon_is_0"],axis=1,inplace=True)
scores.append(run_models("subscriber","all_features",X,y,config))scores.append(run_models("customer","all_features",X_cust,y_cust,config
In [231]:
In [232]:
d) Weather data and day of week
X=df_divvy_group[df_divvy_group.usertype=="Subscriber"]X_cust=df_divvy_group[df_divvy_group.usertype=="Customer"]
X=X[["total_rides","Mean TemperatureF","PrecipitationIn","birth_year_diff_86"X_cust=X[["total_rides","Mean TemperatureF","PrecipitationIn"]]pd.tools.plotting.scatter_matrix(X,figsize=(15,10))plt.show()
X_cust.dropna(inplace=True)X.dropna(inplace=True)y_cust=X_cust.total_ridesy=X.total_rides
X_cust.drop("total_rides",axis=1,inplace=True)X.drop("total_rides",axis=1,inplace=True)
scores.append(run_models("subscriber","temp_prec_birth",X,y,config))scores.append(run_models("customer","temp_prec_birth",X_cust,y_cust,config
In [234]:
In [235]:
In [236]:
X=df_divvy_group[df_divvy_group.usertype=="Subscriber"]X_cust=df_divvy_group[df_divvy_group.usertype=="Customer"]
X=X[["total_rides","Mean TemperatureF","PrecipitationIn","birth_year_diff_86"X_cust=X[["total_rides","Mean TemperatureF","PrecipitationIn","day_of_week_mon_is_0"
X=pd.concat([X,pd.DataFrame(dummy_coding(X,["day_of_week_mon_is_0"]))],X_cust=pd.concat([X_cust,pd.DataFrame(dummy_coding(X_cust,["day_of_week_mon_is_0"pd.tools.plotting.scatter_matrix(X,figsize=(15,10))plt.show()
X_cust.dropna(inplace=True)X.dropna(inplace=True)y_cust=X_cust.total_ridesy=X.total_rides
X_cust.drop(["total_rides","day_of_week_mon_is_0"],axis=1,inplace=True)X.drop(["total_rides","day_of_week_mon_is_0"],axis=1,inplace=True)
scores.append(run_models("subscriber","temp_prec_birth_dow",X,y,config))scores.append(run_models("customer","temp_prec_birth_dow",X_cust,y_cust
scores_df = pd.concat(scores)
In [240]:
Out[240]: feature_set usertype model mse_train mse_test rmse_train rmse_test
1 all_except_dow subscriber ridge2.200000e-07
2.600000e-07 0.000469 0.000510
1 all_features subscriber ridge2.000000e-07
3.300000e-07 0.000447 0.000574
0 all_except_dow subscriber lasso1.340000e-06
6.900000e-07 0.001158 0.000831
2 all_except_dow subscriber en1.340000e-06
6.900000e-07 0.001158 0.000831
0 all_features subscriber lasso1.340000e-06
6.900000e-07 0.001158 0.000831
2 all_features subscriber en1.340000e-06
6.900000e-07 0.001158 0.000831
3 all_features subscriber gbr1.035000e-05
1.207340e-03 0.003217 0.034747
3 all_except_dow subscriber gbr1.305000e-05
1.214040e-03 0.003612 0.034843
2 temp_prec_birth_dow customer en1.813251e-01
1.317287e-01 0.425823 0.362944
2 temp_prec_birth_dow subscriber en1.863900e-01
1.328632e-01 0.431729 0.364504
1 temp_prec_birth_dow subscriber ridge1.732589e-01
1.335119e-01 0.416244 0.365393
1 temp_prec_birth_dow customer ridge1.738899e-01
1.417032e-01 0.417001 0.376435
0 temp_prec_birth_dow customer lasso1.745095e-01
1.431364e-01 0.417743 0.378334
0 temp_prec_birth_dow subscriber lasso1.744810e-01
1.434022e-01 0.417709 0.378685
3 temp_prec_birth_dow subscriber gbr4.007046e-02 1.722666e-01 0.200176 0.415050
scores_df.sort("mse_test")
3 temp_prec_birth_dow customer gbr2.641917e-02
1.903296e-01 0.162540 0.436268
3 temp_prec_birth subscriber gbr4.543861e-02
1.954518e-01 0.213163 0.442099
3 all_except_dow customer gbr1.000000e-07
1.957464e-01 0.000316 0.442432
0 temp_prec_birth subscriber lasso2.103654e-01
2.080503e-01 0.458656 0.456125
2 temp_prec_birth subscriber en2.090449e-01
2.097234e-01 0.457214 0.457956
1 temp_prec_birth subscriber ridge2.087435e-01
2.105014e-01 0.456885 0.458804
3 all_features customer gbr9.000000e-08
2.334332e-01 0.000300 0.483149
2 all_except_dow customer en3.629435e-01
2.941157e-01 0.602448 0.542324
3 temp_prec_birth customer gbr1.401808e-01
3.056549e-01 0.374407 0.552861
0 temp_prec_birth customer lasso2.751552e-01
3.060788e-01 0.524552 0.553244
2 temp_prec_birth customer en2.748168e-01
3.096775e-01 0.524230 0.556487
1 temp_prec_birth customer ridge2.746825e-01
3.125255e-01 0.524102 0.559040
2 all_features customer en3.495963e-01
3.150416e-01 0.591267 0.561286
0 all_except_dow customer lasso3.445035e-01
6.175809e-01 0.586944 0.785863
3.289846e-
Analysis of results
1. For user type : Subscribera) It's interesting to see that the subscriber model that performed the best (and best overallcompared to customer models as well) was the one with the entire feature set (exceptday_of_week_mon_is_0)-
0 all_features customer lasso 01 7.095680e-01 0.573572 0.842359
1 all_except_dow customer ridge3.086640e-01
8.848603e-01 0.555575 0.940670
1 all_features customer ridge2.946653e-01
1.034366e+00 0.542831 1.017038
In [188]:
b) The best performing model was- Ridge Regression, MSE : 2.600000e-07, RMSE :0.000510 c) Tuned Parameters- alpha: 0.001, tol: 0.0001 d) Another interesting fact to note is that the top performing models for the Subscriber userwere all linear models. e) Models created with only the weather data performed on the lower end of the spectrum forSubscribers. This shows that Subscribers are less influenced by changes in weatherconditions when it comes to renting Divvy bikes.
2. For the user type : Customer
Out[188]: ['usertype', 'date', 'avg_distance_m', 'avg_trip_duration_s', 'birth_year_diff_86', 'female', 'male', 'total_rides', 'CDT', 'Max TemperatureF', 'Mean TemperatureF', 'Min TemperatureF', 'Max Dew PointF', 'MeanDew PointF', 'Min DewpointF', 'Max Humidity', 'Mean Humidity', 'Min Humidity', 'Max Sea Level PressureIn', 'Mean Sea Level PressureIn', 'Min Sea Level PressureIn', 'Max VisibilityMiles', 'Mean VisibilityMiles', 'Min VisibilityMiles', 'Max Wind SpeedMPH', 'Mean Wind SpeedMPH', 'Max Gust SpeedMPH', 'PrecipitationIn', 'CloudCover', 'Events', 'WindDirDegrees', 'day_of_year', 'day_of_week_mon_is_0', 'IsWeekend']
list(df_divvy_group.columns)
a) The best performing model for Customer was trained with only weather data and day ofweek. Note: The scores list the best customer model with feature set inclusive of birth year.This is not true for the model, and is only a labeling issue. b) The best performing model was- Elastic Net, MSE : 1.317287e-01, RMSE : 0.362944 c) Tuned Parameters-
In [249]:
d) In the case of models for Customers it can be noted that the top performing models are alllinear models. e) Customers are people who rent only for the day. These rental decisions can be affected byweather conditions, as customers could be visitors etc., who are not prepared for theweather, and hence decide on the fly about renting a bike. Subscribers on the other hand areusing bikes more for commuting to work, as can be seen from total rider volume by user typefor a given day of the week (EDA section).
Out[249]: {'alpha': 0.1, 'l1_ratio': 0.080714285714285711, 'tol': 0.078592857142857145}
scores_df[(scores_df.feature_set=="temp_prec_birth_dow") & (scores_df.usertype