exploring gentrification in nyc - waqidwaqid.com/bigdata-report.pdfgentrification is an often...
TRANSCRIPT
Exploring Gentrification in NYC Using taxi trips, property sales, and permits as indicators
Ben Jakubowski (buj201), Waqid Munawar Volli (wmv214), and Pedro Lambert (pl1529)
Final Project Report
Abstract Gentrification is an oftendiscussed phenomenon in New York City and other large urban cities. In the mid20th century, several factors (including white flight, development of the interstate highway system, and invention of the modern suburb) reduced overall demand for residential property in urban centers. However, over the past several decades this trend has reversed, and demand for residential property in cities has dramatically increased. This in turn has caused property values and rents to rise and neighborhood compositions to change in short, for many urban neighborhoods to “gentrify”. Because gentrification can displace lowerincome city residents and disrupt or destroy communities, it has elicited policy responses (for example, a recent significant zoning change in NYC). Given (i) the significant impacts of gentrification on individuals and communities, (ii) the need for appropriate policy responses, and (iii) the general public interest in gentrification as a social phenomenon, we aimed to apply data science and big data techniques and technologies to create a dashboard for exploring various potential indicators of gentrification in NYC over the years 20102013.
Github Repos: Hadoop spatial inference: https://github.com/plambert/ds1004 Local Data Processing and Merging: https://github.com/buj201/ds1004finalproject Data visualization and Exploration: https://github.com/waqidvolli/BigDataProject
Project report: https://docs.google.com/a/nyu.edu/document/d/1WqwR18j0mbpJvMBvb2JIQrldIVJ1Mm
s1V_McsXAjOc/edit?usp=sharing
Table of Contents: Abstract Introduction Experimental Techniques and Methods Results and Discussion References Individual Project Member Contributions
1
Introduction For project we developed an interactive dashboard that allows users to explore gentrification in NYC using taxi trips, property sales, and permits as indicators. Gentrification is a phenomenon of increasing public interest in NYC and across the country. As seen in figure 1, Google News searches for ‘gentrification’ from the US have increased markedly since 2008.
Locally, gentrificationrelated housing issues have also been the target of policy interventions, most recently changes to the zoning code passed by the City Council on March 22nd, 2016 [1]. While gentrification is often discussed, it is a complex phenomenon and sometimes inconsistently or ambiguously defined. We take as our definition “the arrival of wealthier people in an existing urban district, a related increase in rents and property values, and changes in the district’s character and culture” [2]. Note this definition perhaps masks the human costs of gentrification ‘changes in the district’s character and culture’ often means longtime residents have been priced out (due to rising rents and property values) and forced to move to lessexpensive neighborhoods, displacing (or disrupting or destroying) existing communities. While this definition has its shortcomings, it frames our project objective and methods nicely since several components of our project map directly onto this definition:
Component Reflection in Project
“Existing urban district”
We chose to use Neighborhood Tabulation Areas (NTAs) as our existing urban district. The NYC Department of Planning created NTAs by “aggregating census tracts into 195 neighborhoodlike areas” [3]. Figure 2 shows the 195 NTA boundaries. This is an appropriate unit of analysis because gentrification is experienced at the neighborhood level, and NTAs are intermediate between the overlycoarse Community District and overlygranular Census Tract.
2
“Arrival of wealthier people”
We measure the arrival of wealthier people to a neighborhood using the number of taxi pickups and dropoffs in a neighborhood. This assumes individual wealth positively correlates with number of taxi trips taken. Note this assumption finds some support (for example, there is some evidence of a positive correlation between census tract percapita income and number of taxidropoffs in that tract) [4].
“Increase in rents and property values”
To determine the change in property values at the NTA level, we computed the median perunit residential sale price in each neighborhood.
“Change in district’s character and culture”
This is perhaps the most difficult component of gentrification to assess using data science techniques. Our proxy for this component is the number of permit applications from each NTA for a number of different permit categories. However, it is worth noting the limitations of an analytic approach to the study of gentrification change in neighborhood culture is often more meaningfully described through groundlevel reporting (for example, interviewing longtime neighborhood residents, as in [8]).
Table 1: Connection between operational definition of gentrification and project components Importantly, our project required the use of big data infrastructure since our measure of the arrival of wealthier people was the number of taxi trips with pickups and dropoffs in each neighborhood. While the other feature sets (property sales and permit applications) did not require the use of big data infrastructure, our exploration of gentrification is improved because we used big data tools. Since the combined size of the taxi trip data from 20102013 is 28.6GB, we would not have been able to include taxi trip features in our dashboard without use of big data methods.
Experimental Techniques and Methods Raw data sets: The raw data sets used to generate NTAlevel features in our study were:
Dataset Source Description
Taxi Trips [5] https://uofi.app.box.com/v/nyctaxidata
28.6GB NYC Taxi Data for years 20102013
(697,622,444 trips), obtained through FOIL request
Features used: Pickup latitude/longitude and date/time Dropoff latitude/longitude and date/time
3
Property Sales https://www1.nyc.gov/site/finance/taxes/propertyannualizedsalesupdate.page
94.6 MB All property sales in NYC for years
20102013 (~2,949,000 sales, including all categories, not just residential properties)
Features used: Borough/block/lot numbers (for merge) Address (for NYC Geoclient API query) Number of residential units in property Number of commercial units in property Sale price and date
Multiagency permits https://data.cityofnewyork.us/CityGovernment/MultiAgencyPermits/xfyiuyt5
1.04 GB Permits data from two different data
sources DOB Jobs Permits and DOHMH Permits (4,823,781 permits)
Features used: Permit type and date Borough/block/lot numbers (for merge)
Table 2: Datasets used for feature generation As mentioned in the introduction, big data infrastructure was used to process the taxi trips, while the other datasets did not require the use of big data infrastructure due to their smaller sizes. In addition to the datasets which provided features related to gentrification, we also used a number of supplementary datasets to support NTA identification:
Dataset Source Description
NYC Pluto http://www1.nyc.gov/site/planning/datamaps/opendata.page
432.9 MB Property information at the Tax Lot level. Used to map properties to NTAs, using tax
lot as an identifier
2010 Census Tract to 2010 NTA Equivalency
http://www1.nyc.gov/site/planning/datamaps/opendata/dwnnynta.page
105 KB Map from census tract to NTA
NTA GeoJSON https://data.cityofnewyork.us/CityGovernment/NeighborhoodTabulationAreas/cpf4rkhq
4.2 MB GeoJSON used for NTA spatial inference on
taxi trip pickup and dropoffs.
Table 3: Supplemental datasets used for NTA inference
Data Manipulation: Creating our Tidy Dataset Using these raw data, we constructed a tidy dataset multiindexed by NTA and yearmonth. Again, we chose to use NTA as the geographic unit of analysis since gentrification is primarily experienced at the neighborhood level. We chose to use month as the temporal unit of analysis, since gentrification is a slow process and aggregating counts per month achieved some
4
smoothing of lowfrequency event counts (such as counts of certain types of permit, and number of residential property sales).
Final tidy dataset (features used in dashboard)
Taxi Features: Total number of taxi pickups and dropoffs
Permit Features total number of permits for: Retail food process Plumbing Building alteration Child care application
tracking system Full term mobile food
vending unit Seasonal mobile
food vending unit Mobile food unit Foundation Building equipment Physician New building Building Sign Food service
establishment Building equipment
work
Residential Property Sale Features:
Median residential unit sale price
Total number of sales of strictly residential properties
Table 4: Feature sets in final, tidy dataset Generating these features required substantial feature engineering. We proceed to describe the specific feature engineering approach taken for each dataset. For the Taxi data, total number of pickups and dropoffs was obtained for each NTA, for each yearmonth between 201001 and 201312 as follows:
1. First the data was obtained from the University of Illinois repository New York City Taxi Trip Data (20102013) [5].
2. The pickup and dropoff coordinates and datetimes were retained, while other features were dropped. This reduced dataset was loaded into an AWS S3 bucket.
3. An AWS EMR cluster was provisioned with a. instancecount 5 b. instancetype m3.xlarge
4. A bootstrap action was used to install dependencies (shapely, RTree, and others see github repo for script) on the cluster.
5. Next, NTA polygons were indexed using an RTree. 6. Streaming Hadoop was then used to (Map) tag each taxi trip with the dropoff NTA and
pickup NTA, and (Reduce) aggregate pickup and dropoff counts by NTA and year_month. Specifically, for each taxi trip
5
a. The mapper used the RTree index for efficient spatial inference of the pickup and dropoff neighborhoods. It emitted two key/val pairs for each taxi trip: i. Key1: pickup year_month and NTA code ii. Value1: 1, pickup iii. Key2: dropoff year_month and NTA code iv. Value2: 1, dropoff
b. The reducer aggregated the counts by year_month and NTA code, emitting two key/values pairs: i. Key1 and key2: year_month and NTA code ii. Value1: total number of taxi pickups iii. Value2: total number of taxi dropoffs
For the permit application data, the following approach was taken to feature engineering:
1. First, the borough/block/lot (BBL), permit issuance date, and permit type description were retained from the multiagency permits data, while the other features were dropped.
2. Next, BBL was used to join the permits data with the NYC PLUTO dataset. This allowed us to match each permit with the corresponding NTA.
3. Permits were then grouped by NTA and year_month, and the total number of permit applications of each type were determined. Note this produced counts for 109 different types of permits.
4. Our next challenge was to select an interesting subset of permit type counts for inclusion in the final dataset. We approached this problem as follows:
a. Since our objective is to use permits data to compare how gentrification is differentially affecting NYC neighborhoods, we decided interesting permit features would have the largest betweenneighborhood variance.
b. Thus, we determined the average number of permits for each permit type, for each neighborhood (averaging over the 48 months in the dataset). We then ranked each permit type based on the betweenneighborhood variance of these averages, and retained the 14 features with the highest betweenneighborhood variance. Note the top 14 features were selected (as opposed to say the top 8) since the top 14 features included the most relevant building permit types (i.e. foundation, new building, equipment)
c. Note this approach admitted a number of features that would likely have been dropped if we had selected features based on our intuition or literature regarding gentrification. For example, we likely would not have included the physician permit counts if we had selected features based on our prior beliefs regarding features that differentiated NYC neighborhoods.
For the sales data, the following approach was taken to feature engineering:
1. First, we chose to restrict our analysis to properties that met the following requirements: a. They contained at least one residential unit and exactly zero commercial units.
This inclusion criteria allowed us to determine the perunit residential sale price
6
by simply taking (total sale price) / (number of residential units). If we had allowed buildings with commercial units into our dataset, we would not have had a reasonable method for dividing the sale price into the component attributable to residential property value and the component attributable to commercial property value.
b. The sale price was at least $1000. This requirement was necessary due to the large number of properties with sale prices of $0, $1, or $100. While $1000 is a very low sale price, we intended this threshold to simply exclude the most obviously trivial sale prices (which we attribute to property transfers within families or other similar entities).
2. Next, we constructed the feature dollar_per_unit: the price per residential unit, defined as (total sale price) / (number of residential units)
3. Our next task was to tag each property sale with its containing NTA. Unlike taxi trips, we did not have latitude and longitude coordinates for property sales; thus, we couldn’t use our MapReduce framework to tag sales by NTA. Instead, we used a two step approach:
a. First, we used the approach employed to tag the permit applications, namely using BBL to join the sales data with PLUTO (and ultimately merge in the NTA).
b. While this approach was relatively successful, it missed an unacceptably large number of records, since
“Each unit in a building that is a condominium is defined by the Department of Finance as a separate tax lot. To make condominium information more compatible with parcel information, the Department of City Planning aggregated condominium unit tax lot information [in PLUTO] so that each condominium complex within a tax block is represented by only one tax lot record.” [6]
Due to this mismatch between the property sales records and PLUTO, we next queried the NYC Geoclient API [7] by property address to obtain missed NTAs. Note this approach was somewhat slower (per record) than spatial inference using latitude and longitudes, but the sales data did not include these features.
4. After tagging the records with an NTA using one of these two approaches (direct join with PLUTO, or through the Geoclient API), the sales data were grouped by year_month and NTA and two aggregate features were computed:
a. Median perresidentialunit sales price b. Total number of residential properties (not units) sold
Finally, we merged these three data sets (taxi trips, permits, and residential property sales) by NTA and year_month to produce our final analytic dataset (described in table 4). Note that in our final dataset missing values were replaced with 0, since missing values implied 0 counts (for permit and taxi features). Issues Encountered
7
Through the process, we encountered a number of issues. First, our objective was to explore gentrification at the spatial scale defined by Neighborhood Tabulation Areas, and on the temporal scale of months. Thus, our primary data processing problem was assigning each record to the appropriate NTA. We solved this problem in three ways:
First, we used MapReduce for fast spatial inference of taxi pickup and dropoff neighborhoods. This approach was feasible since each pickup and dropoff point was recorded with latitude/longitude coordinates.
In contrast, the permit dataset lacked geographic coordinates; however, it included BBL values, which supported a join with the comprehensive PLUTO property dataset.
While merging the permits dataset and PLUTO was successful, BBL values in the residential property sales dataset did not adequately map onto the PLUTO BBL attribute. This, we had to use the additional approach of tagging sales with NTA by querying the NYC Geoclient API by property address.
Detailed Experimental setup For our big data analysis (spatial inference over the Taxi Data using Hadoop), we used the following experimental setup:
Hadoop configuration: 5 nodes (1 master, 4 slave)
Tools used In addition streaming Hadoop on AWS EMR, we also made use of a number of
external libraries for spatial inference: Shapely Used for pointinpolygon tests RTree Used to accelerate our hadoop geotagging through construction of
an RTree spatial index over the NTA polygons nyc_geoclient Python binding for the NYC Geoclient API CardoDB Used for visualization
Performance of approach (running times of the scripts) The sales and permit data were processed locally because they were smaller
datasets (geotagging was primarily achieved through a join with PLUTO). Thus we do not report running time of these scripts. The MapReduce jobs (over the 26 GBs of taxi data) ran in approximately 5 hours per year (given the stated configuration with 5 nodes). Runtimes can be reduced through use of larger clusters.
Optimizations to speed up code We sped up the code through three main optimizations. First, we used an RTree
over the NTA polygons to speed up spatial inference of dropoff and pickup neighborhoods. Second, prior to beginning the MapReduce jobs, we reduce the data size by locally dropping all unnecessary features and retaining only those features required in our analysis. Finally, we sped up the code by using additional datasets (PLUTO and the 2010 Census Tract to 2010 NTA Equivalency) for inference of property sale and permit NTA. While we could have potentially
8
tagged records in these datasets by NTA using spatial inference (potentially using Hadoop), it was more efficient to find the necessary supplemental datasets to achieve NTA tagging through a simple merge.
Results and Discussion Again, our objective was to develop a tool that supports visual exploration of neighborhoodlevel changes in NYC over the years 20102013, using taxi trips, permits, and residential property sales as potential measures of gentrification. This tool is valuable to the public, since gentrification is a concern shared by many New Yorkers; it is also potentially valuable to policy makers as they design policy interventions to address the negative impacts of gentrification in and across New York communities. Given our objective of developing a tool to support user interaction with the reduced neighborhoodlevel data, we do not present conclusive results regarding particular changes observed over the years 20102013. Instead, we present example visuals (maps and charts) produced using our tool. First, figure 3 shows the user interface for our tool.
The tool allows the user to visualize the spatial distribution of residential property sales, permit applications, and taxi trips in a map (produced using CartoDB). Moreover, the user can also see the average rate of change for each feature over a 6 month window, with the user setting the 6 month time window for using a slider. Finally, the user can interact with the map to drill down to
9
one specific neighborhood, and see how that neighborhood has changed over time. Figure 4 presents a number of sample maps illustrating the tool’s mapping functionality.
Figure 4.1: This map shows the number of taxi dropoffs and number of new building permits for a subset of neighborhoods.
Figure 4.2: This map shows the number of taxi pickups and number of new building permits for a larger view of New York City.
Figure 4: Sample maps generated by users using the dashboard Figure 5 presents a number of sample charts at the specific neighborhood level. These charts show:
10
Figure 5.1: This chart shows the number of new building permits (selected from a dropdown menu), the median residential unit sale price, and the number of taxi trips for a userspecified neighborhood (selected from the map), for a userspecified 6 month window (selected with a slider).
Figure 5.2: This chart shows the average number of taxi trips for NYC (to provide a reference value), and the number per month for the specified neighborhood.
Figure 5.3: This chart shows the average dollars per residential unit for NYC (to provide a reference value), and the average dollars per residential unit for the specified neighborhood.
Figure 5.4: This chart shows the average number selected permit type for NYC (to provide a reference value), and the number per month for the specified neighborhood.
Figure 5: Subcharts presented to the user in our dashboard Again, these are intended as examples of the types of visuals that can be produced using our tool. The full functionality of the dashboard will apparent through interactive demonstrations during the Monday, 05/16 presentation session.
11
Conclusion Gentrification is a growing problem in New York City and across the country. As more people have moved to large urban areas, rent and property values have increased and the composition and culture of neighborhoods have changed. This has, in turn, driven many longtime residents out of these gentrifying neighborhoods (in a large part as they are priced out). Given the impacts of gentrification, tools that allow a user to visualize this phenomenon through proxy measures including taxi trips, property sales, and permits are valuable to both the public and to policy makers. Public users can see how their neighborhood has changed over the years 20102013 relative to the rest of the city, and policymakers can use the dashboard to develop insight into the city and potentially design both citywide and neighborhood specific policy interventions.
References 1. Goodman, David J. March 22nd 2016, New York Passes Rent Rules to Blunt Gentrification.
New York Times. http://www.nytimes.com/2016/03/23/nyregion/newyorkcouncilpasseszoningchangesdeblasiosought.html
2. Grant, Benjamin. June 17th, 2003. What is Gentrification? American Documentary, Inc. POV http://www.pbs.org/pov/flagwars/whatisgentrification/
3. Donnley, Frank. March 31st, 2016. New York City Data. Baruch College. http://guides.newman.baruch.cuny.edu/c.php?g=188226&p=1243123
4. Huang, Roger. 2016. Do Rich People Take More Taxis? Springboard https://www.springboard.com/blog/dorichpeopletakemoretaxis/
5. Brian Donovan and Daniel B. Work “New York City Taxi Trip Data (20102013)”. 1.0. University of Illinois at UrbanaChampaign. Dataset. http://dx.doi.org/10.13012/J8PN93H8, 2014.
6. NYC Department of City Planning. NYC PLUTO Data Dictionary (16v1). March 2016. http://www1.nyc.gov/assets/planning/download/pdf/datamaps/opendata/pluto_datadictionary.pdf
7. NYC Developer Portal. NYC Geoclient API. https://developer.cityofnewyork.us/api/geoclientapi
8. Kaysen, Ronda. May 13th 2016. Priced Out of a Childhood Home. New York Times. http://www.nytimes.com/2016/05/15/realestate/pricedoutofmychildhoodhome.html
Individual Project Member Contributions Waqid Volli: Waqid was primarily responsible for implementing the dashboard
visualizations. Pedro Lambert: Pedro was primarily responsible for implementing the Hadoop taxi trip
spatial inference and data aggregation.
12
Ben Jakubowski: Ben was primarily responsible for the local data processing tasks (permits and property sales) and merging the three datasets after processing.
All: All group members collaborated on each component, and contributed to highlevel project design.
13