building a standard for open bikeshare data

8

Click here to load reader

Upload: mobility-lab

Post on 27-Jan-2015

105 views

Category:

Technology


1 download

DESCRIPTION

Should the bikeshare industry adopt an open data standard? As bikesharing spreads to more cities, having a common method for accessing and analyzing data will become more important.

TRANSCRIPT

Page 1: Building a Standard for Open Bikeshare Data

Building a Standard for Open Bikeshare Data

Originally published at Michael Schade’s Mystery Incorporated Blog

March 2nd, 2014

Should the bikeshare industry adopt an open data standard? As bikesharing spreads to more cities, having a common method for accessing and analyzing data will become more important. We know that transit systems work best when agencies concentrate on their core mission. Transit agencies are not in the information technology business; all they should do is release their data to let third parties build apps that let passengers use the systems.

To use open data, programmers need to know: Where is the data? What are the files called? Which fields are available? What are the fields called?

Bikesharing systems should adopt the standard of having a “data” page which can be found by appending “data” immediately after the main URL. This is what many U.S. government web sites are doing (like justice.gov/data, dot.gov/data, state.gov/data, etc.) It would be awesome to have consistent URLs like capitalbikeshare.com/data and velib.paris.fr/data.

To standardize what the files are called, we have to decide how many files are used, and what formats to use. Some systems do not separate the station information data (which is static) from the station status data (which is dynamic). The Capital Bikeshare XML file and the Bixi Montreal XML file are examples of combining both static and dynamic data in a single file (both use the Bixi public bike system). This might be more convenient in some cases, but for systems that frequently update their displays, it wastes a lot of bandwidth. This process could be made more efficient by using two files. JCDecaux, which manages many bikesharing systems in Europe, separates the static data from the dynamic real-time data.

Denver’s B-cycle doesn’t seem to offer any data at all, though Denver’s Open Data Catalog does offer a variety of formats for data about B-cycle Stations. I doubt this is the true, live, system data, because the coordinates are given as street addresses and not latitude and longitude coordinates.

In addition to information needed by apps, we also need historic data in order to analyze how people use the system. The most common kind

Page 2: Building a Standard for Open Bikeshare Data

is system metrics, such as the type released by Bay Area Bikeshare. This typically shows ridership and membership totals, and is good for showing how the system has grown. It would be updated at the end of each day.

Planners and analysts rely on two other types of historic data: trip history information shows every trip made within a certain period, and station history data shows the status of the stations within a certain period. The best example of the former is the Capital Bikeshare trip history data page, which releases a new data set every quarter. The latter is sometimes recorded by enthusiasts on their own initiative, such as the CaBi Tracker website. In San Francisco, Eric Fisherkeeps a daily log of Bay Area Bikeshare stats at trafficways.org/babs (I used his data in Probing Data from Bay Area Bikeshare).

The trip history and station history files need a naming convention to reflect the content’s date range. CaBi’s largest quarterly file is 72.5MB, for the 572,919 trips in the 2nd quarter of 2012 (they have now started zipping the files). A filename format like trips-2012-3-1-to-2012-5-30.csv would work well.

While the systems are expected to protect their customers’ privacy by not including customer IDs, users should be able to download their own personal trip history files, and those files should use the same format as the main trip history files.

Finally, there should be a standard way of summarizing general information about the entire system. Who provides the equipment, who runs the system, which jurisdictions participate, where the system is located and what its boundaries are, what the hours of operation are, what the operating season is, what the URL is and other contact info. And to really integrate all the various systems, we also could benefit from having the URL for a standard-size logo images, plus the systems’s colors. This System information file should also include data found in a manifest file, namely, a list of all the associated open-data files.

The system information should include definitions of available membership types. This might merit being listed as a separate table. Each membership type should include the cost and duration. We also need to know how long rides can be, and what the charges are for going beyond the time limit. For example, theCaBi pricing rules say rides are free for the first 30 minutes; going up to 30 minutes longer costs $2.00 for casual members (those with 1- or 3-day memberships) and $1.50 for subscribers. In contrast, the Citi Bike pricing rules say rides are free for the first 45 minutes; going up to 30 minutes longer

Page 3: Building a Standard for Open Bikeshare Data

costs $4.00 for those with 24-hour & 7-day passes, and $2.50 for those with annual memberships.

This table summarizes the six types of bikesharing data:System information

:general info

Station information: a mostly-static list of all stationsStation status: the number of available bikes and docks

System metrics: membership and trip totalsTrip history: every trip made during a given period

Station history: a history of the station status list

Here’s how I would organize the files. I’ll use ▶ to indicate a primary key (one that must be unique within the system), and ▷ to indicate a foreign key (one that references another table’s primary key, and which must exist).

The station information data is the information most likely to be shared by bikeshare systems. At the very least, it includes the latitude & longitude coordinates for every station, and the name. The file is fairly static, changing mostly when new stations are added.

Here are the fields I would include, compared with CaBi (DC), Vélib (Paris), and Denver’s B-cycle to see what names they use.

Station informationproposal CaBi Vélib B-cycle

stationid ▶id, terminalName

number GLOBALID

name name name STATION_NAME

address (not used) addressSTATION_ADDRESS, ADDRESS_LINE1, ADDRESS_LINE2

region (not used)(not used)

CITY, STATE

zip (not used)(not used)

ZIP

lat lat latitude (not used)lng long longitude (not used)

installed installDate(not used)

(not used)

removed removalDate(not used)

(not used)

public public(not used)

(not used)

capacity (not used) (not NUM_DOCKS

Page 4: Building a Standard for Open Bikeshare Data

used)

message (not used)(not used)

(not used)

Most systems don’t use a region field, but for multi-jurisdictional systems, it is important to know which jurisdiction manages each station. For example,Capital Bikeshare operates within DC, Montgomery County, Arlington, and Alexandria. Bay Area Bikeshare operates within San Francisco, Redwood City, Palo Alto, Mountain View, and San Jose. Nice Ride operates within Minneapolis and St Paul. Other systems could use this field to track which neighborhood the station is in.

Vélib appends the postal code & city to the address field, but this would be better as a separate fields. For example, the Bastille Richard Lenoir station has an address of “2 BOULEVARD RICHARD LENOIR – 75011 PARIS”, but this should be just “2 BOULEVARD RICHARD LENOIR”, with a zip of “75011″ and a city of “Paris.” And there is no reason for Vélib to use all-uppercase letters. The data should be in the proper mixed-case (using French rules for capitalization), and programs can easily convert to uppercase if they wish.

I would suggest a message field so systems can communicate that a station will be shutting down early, or moved to a new location. Or during snow storms, the rebalancing van might not be able to service a station.

Denver has other fields that should be considered for a standard. “PROPERTY_TYPE” shows whether the station’s location is Private or Public. This could be expanded to show exactly who the property owner or responsible agency is. “POWER_TYPE” has values of Solar Only, Wired Only, and Solar with Wire Backup.

Cities often provide temporary stations. The station ID should correspond to a specific location. If a station returns to the same location for an annual event, it should re-use the old ID.

The station status file should have the smallest amount of data needed to describe the current state of each station. This is the file that will be called most often, potentially thousands of times per minute, so every byte counts. And many people will be querying this data from mobile devices, another reason to keep the file size as small as possible. Here’s how I would design the standard for this file, compared with CaBi (DC) and Denver’s B-cycle to see what names they use. Because I couldn’t find Denver’s XML feed, I used CityBike‘s Denver JSON feed.

Station status

Page 5: Building a Standard for Open Bikeshare Data

proposal CaBi Denver B-cyclestationid ▷ id, terminalName id, idxbikes nbBikes bikesdocks nbEmptyDocks freeopen locked (not used)time lastCommWithServer timestamp

The bikes and docks numbers will generally add up to the capacity value in the station information file, but if there are non-functioning bikes or docks, the total could be smaller. The open field would be true or false. Sometimes stations are temporarily closed, perhaps because they have become inaccessible. The timevalue shows the last time the station communicated with the server. This is useful to determine if the data might no longer be accurate, such as during a power outage.

Notice we don’t duplicate any of the fields in the station information file, other than our foreign key, the stationid field.

The trip history file also needs to be as compact as possible, not because people will be downloading it frequently, but because these files could be used to store millions of records.Trip historystartdatestartstation ▷enddateendstation ▷bikeidusertype

The duration of each trip can be computed on-the-fly and doesn’t need to be included in the file. The startstation and endstation values link up to the stationid field in the station information file. The usertype field describes the type of membership the rider has.

Though few systems release trip history data on a regular basis, there have been occasions when systems have released data in support of a visualization contest. The Hubway Data Visualization Challenge took place in 2013, and included demographic data about the rider of each trip: residential zip code, year of birth, and sex. The Divvy Data Challenge (for Chicago) is currently underway; its data includes riders’ year of birth and sex.

The station history file should be a list of every change in status (available bikes and docks) for every station, listed in chronological

Page 6: Building a Standard for Open Bikeshare Data

order. In order to avoid having to repeat the state of the entire system when only a few stations have new values, the file should start with every station, and thereafter list a station only when it has changed. The initial value would be needed in order to compute the state of any later times recorded in the file.Station historystationid ▷bikesdocksopentime

The dominant data format nowadays is either XML or JSON. CSV is also a good choice, as long as the data fits in a tabular format, consisting of simple rows and columns. For CSV files, the order of fields should be consistent.

The values of the fields are numeric, string, Boolean, and timestamp. Boolean is easily expressed as “true” or “false,” and Unix time is a common way of recording date and time.

By publishing and standardizing bikesharing open data, developers and analysts can make it easier for the public to make use of and discover bikesharing systems across the globe, such as the Bike Share Map by Oliver O’Brian. The vendors, operators, and managing jurisdictions should work together to create a standard that can be used by everyone.