exploring gentrification in nyc - waqidwaqid.com/bigdata-report.pdfgentrification is an often...

13
Exploring Gentrification in NYC Using taxi trips, property sales, and permits as indicators Ben Jakubowski (buj201), Waqid Munawar Volli (wmv214), and Pedro Lambert (pl1529) Final Project Report Abstract Gentrification is an oftendiscussed phenomenon in New York City and other large urban cities. In the mid20 th century, several factors (including white flight, development of the interstate highway system, and invention of the modern suburb) reduced overall demand for residential property in urban centers. However, over the past several decades this trend has reversed, and demand for residential property in cities has dramatically increased. This in turn has caused property values and rents to rise and neighborhood compositions to change in short, for many urban neighborhoods to “gentrify”. Because gentrification can displace lowerincome city residents and disrupt or destroy communities, it has elicited policy responses (for example, a recent significant zoning change in NYC). Given (i) the significant impacts of gentrification on individuals and communities, (ii) the need for appropriate policy responses, and (iii) the general public interest in gentrification as a social phenomenon, we aimed to apply data science and big data techniques and technologies to create a dashboard for exploring various potential indicators of gentrification in NYC over the years 20102013. Github Repos: Hadoop spatial inference: https://github.com/plambert/ds1004 Local Data Processing and Merging: https://github.com/buj201/ds1004finalproject Data visualization and Exploration: https://github.com/waqidvolli/BigDataProject Project report: https://docs.google.com/a/nyu.edu/document/d/1WqwR18j0mbpJvMBvb2JIQrldIVJ1Mm s1V_McsXAjOc/edit?usp=sharing Table of Contents: Abstract Introduction Experimental Techniques and Methods Results and Discussion References Individual Project Member Contributions 1

Upload: others

Post on 02-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Exploring Gentrification in NYC - Waqidwaqid.com/bigdata-report.pdfGentrification is an often discussed phenomenon in New York City and other large urban cities. In the mid 20 th century,

Exploring Gentrification in NYC Using taxi trips, property sales, and permits as indicators

Ben Jakubowski (buj201), Waqid Munawar Volli (wmv214), and Pedro Lambert (pl1529)

Final Project Report

Abstract Gentrification is an often­discussed phenomenon in New York City and other large urban cities. In the mid­20th century, several factors (including white flight, development of the interstate highway system, and invention of the modern suburb) reduced overall demand for residential property in urban centers. However, over the past several decades this trend has reversed, and demand for residential property in cities has dramatically increased. This in turn has caused property values and rents to rise and neighborhood compositions to change­ in short, for many urban neighborhoods to “gentrify”. Because gentrification can displace lower­income city residents and disrupt or destroy communities, it has elicited policy responses (for example, a recent significant zoning change in NYC). Given (i) the significant impacts of gentrification on individuals and communities, (ii) the need for appropriate policy responses, and (iii) the general public interest in gentrification as a social phenomenon, we aimed to apply data science and big data techniques and technologies to create a dashboard for exploring various potential indicators of gentrification in NYC over the years 2010­2013.

Github Repos: Hadoop spatial inference: https://github.com/p­lambert/ds­1004 Local Data Processing and Merging: https://github.com/buj201/ds1004finalproject Data visualization and Exploration: https://github.com/waqidvolli/BigDataProject

Project report: https://docs.google.com/a/nyu.edu/document/d/1WqwR18j0mbpJvM­Bvb2JIQrldIVJ1Mm

s1V_McsXAjOc/edit?usp=sharing

Table of Contents: Abstract Introduction Experimental Techniques and Methods Results and Discussion References Individual Project Member Contributions

1

Waqid Volli
Page 2: Exploring Gentrification in NYC - Waqidwaqid.com/bigdata-report.pdfGentrification is an often discussed phenomenon in New York City and other large urban cities. In the mid 20 th century,

Introduction For project we developed an interactive dashboard that allows users to explore gentrification in NYC using taxi trips, property sales, and permits as indicators. Gentrification is a phenomenon of increasing public interest in NYC and across the country. As seen in figure 1, Google News searches for ‘gentrification’ from the US have increased markedly since 2008.

Locally, gentrification­related housing issues have also been the target of policy interventions, most recently changes to the zoning code passed by the City Council on March 22nd, 2016 [1]. While gentrification is often discussed, it is a complex phenomenon and sometimes inconsistently or ambiguously defined. We take as our definition “the arrival of wealthier people in an existing urban district, a related increase in rents and property values, and changes in the district’s character and culture” [2]. Note this definition perhaps masks the human costs of gentrification­ ‘changes in the district’s character and culture’ often means long­time residents have been priced out (due to rising rents and property values) and forced to move to less­expensive neighborhoods, displacing (or disrupting or destroying) existing communities. While this definition has its shortcomings, it frames our project objective and methods nicely since several components of our project map directly onto this definition:

Component Reflection in Project

“Existing urban district”

We chose to use Neighborhood Tabulation Areas (NTAs) as our existing urban district. The NYC Department of Planning created NTAs by “aggregating census tracts into 195 neighborhood­like areas” [3]. Figure 2 shows the 195 NTA boundaries. This is an appropriate unit of analysis because gentrification is experienced at the neighborhood level, and NTAs are intermediate between the overly­coarse Community District and overly­granular Census Tract.

2

Page 3: Exploring Gentrification in NYC - Waqidwaqid.com/bigdata-report.pdfGentrification is an often discussed phenomenon in New York City and other large urban cities. In the mid 20 th century,

“Arrival of wealthier people”

We measure the arrival of wealthier people to a neighborhood using the number of taxi pick­ups and drop­offs in a neighborhood. This assumes individual wealth positively correlates with number of taxi trips taken. Note this assumption finds some support (for example, there is some evidence of a positive correlation between census tract per­capita income and number of taxi­dropoffs in that tract) [4].

“Increase in rents and property values”

To determine the change in property values at the NTA level, we computed the median per­unit residential sale price in each neighborhood.

“Change in district’s character and culture”

This is perhaps the most difficult component of gentrification to assess using data science techniques. Our proxy for this component is the number of permit applications from each NTA for a number of different permit categories. However, it is worth noting the limitations of an analytic approach to the study of gentrification­ change in neighborhood culture is often more meaningfully described through ground­level reporting (for example, interviewing longtime neighborhood residents, as in [8]).

Table 1: Connection between operational definition of gentrification and project components Importantly, our project required the use of big data infrastructure since our measure of the arrival of wealthier people was the number of taxi trips with pick­ups and drop­offs in each neighborhood. While the other feature sets (property sales and permit applications) did not require the use of big data infrastructure, our exploration of gentrification is improved because we used big data tools. Since the combined size of the taxi trip data from 2010­2013 is 28.6GB, we would not have been able to include taxi trip features in our dashboard without use of big data methods.

Experimental Techniques and Methods Raw data sets: The raw data sets used to generate NTA­level features in our study were:

Dataset Source Description

Taxi Trips [5] https://uofi.app.box.com/v/nyctaxidata

28.6GB NYC Taxi Data for years 2010­2013

(697,622,444 trips), obtained through FOIL request

Features used: Pick­up latitude/longitude and date/time Drop­off latitude/longitude and date/time

3

Page 4: Exploring Gentrification in NYC - Waqidwaqid.com/bigdata-report.pdfGentrification is an often discussed phenomenon in New York City and other large urban cities. In the mid 20 th century,

Property Sales https://www1.nyc.gov/site/finance/taxes/property­annualized­sales­update.page

94.6 MB All property sales in NYC for years

2010­2013 (~2,949,000 sales, including all categories, not just residential properties)

Features used: Borough/block/lot numbers (for merge) Address (for NYC Geoclient API query) Number of residential units in property Number of commercial units in property Sale price and date

Multi­agency permits https://data.cityofnewyork.us/City­Government/Multi­Agency­Permits/xfyi­uyt5

1.04 GB Permits data from two different data

sources­ DOB Jobs Permits and DOHMH Permits (4,823,781 permits)

Features used: Permit type and date Borough/block/lot numbers (for merge)

Table 2: Datasets used for feature generation As mentioned in the introduction, big data infrastructure was used to process the taxi trips, while the other datasets did not require the use of big data infrastructure due to their smaller sizes. In addition to the datasets which provided features related to gentrification, we also used a number of supplementary datasets to support NTA identification:

Dataset Source Description

NYC Pluto http://www1.nyc.gov/site/planning/data­maps/open­data.page

432.9 MB Property information at the Tax Lot level. Used to map properties to NTAs, using tax

lot as an identifier

2010 Census Tract to 2010 NTA Equivalency

http://www1.nyc.gov/site/planning/data­maps/open­data/dwn­nynta.page

105 KB Map from census tract to NTA

NTA GeoJSON https://data.cityofnewyork.us/City­Government/Neighborhood­Tabulation­Areas/cpf4­rkhq

4.2 MB GeoJSON used for NTA spatial inference on

taxi trip pick­up and drop­offs.

Table 3: Supplemental datasets used for NTA inference

Data Manipulation: Creating our Tidy Dataset Using these raw data, we constructed a tidy dataset multi­indexed by NTA and year­month. Again, we chose to use NTA as the geographic unit of analysis since gentrification is primarily experienced at the neighborhood level. We chose to use month as the temporal unit of analysis, since gentrification is a slow process and aggregating counts per month achieved some

4

Page 5: Exploring Gentrification in NYC - Waqidwaqid.com/bigdata-report.pdfGentrification is an often discussed phenomenon in New York City and other large urban cities. In the mid 20 th century,

smoothing of low­frequency event counts (such as counts of certain types of permit, and number of residential property sales).

Final tidy dataset (features used in dashboard)

Taxi Features: Total number of taxi pick­ups and drop­offs

Permit Features­ total number of permits for: Retail food process Plumbing Building alteration Child care application

tracking system Full term mobile food

vending unit Seasonal mobile

food vending unit Mobile food unit Foundation Building equipment Physician New building Building Sign Food service

establishment Building equipment

work

Residential Property Sale Features:

Median residential unit sale price

Total number of sales of strictly residential properties

Table 4: Feature sets in final, tidy dataset Generating these features required substantial feature engineering. We proceed to describe the specific feature engineering approach taken for each dataset. For the Taxi data, total number of pick­ups and drop­offs was obtained for each NTA, for each year­month between 2010­01 and 2013­12 as follows:

1. First the data was obtained from the University of Illinois repository New York City Taxi Trip Data (2010­2013) [5].

2. The pickup and dropoff coordinates and datetimes were retained, while other features were dropped. This reduced dataset was loaded into an AWS S3 bucket.

3. An AWS EMR cluster was provisioned with a. ­­instance­count 5 b. ­­instance­type m3.xlarge

4. A bootstrap action was used to install dependencies (shapely, RTree, and others­ see github repo for script) on the cluster.

5. Next, NTA polygons were indexed using an RTree. 6. Streaming Hadoop was then used to (Map) tag each taxi trip with the drop­off NTA and

pick­up NTA, and (Reduce) aggregate pickup and dropoff counts by NTA and year_month. Specifically, for each taxi trip

5

Page 6: Exploring Gentrification in NYC - Waqidwaqid.com/bigdata-report.pdfGentrification is an often discussed phenomenon in New York City and other large urban cities. In the mid 20 th century,

a. The mapper used the RTree index for efficient spatial inference of the pick­up and drop­off neighborhoods. It emitted two key/val pairs for each taxi trip: i. Key1: pick­up year_month and NTA code ii. Value1: 1, pick­up iii. Key2: drop­off year_month and NTA code iv. Value2: 1, drop­off

b. The reducer aggregated the counts by year_month and NTA code, emitting two key/values pairs: i. Key1 and key2: year_month and NTA code ii. Value1: total number of taxi pick­ups iii. Value2: total number of taxi drop­offs

For the permit application data, the following approach was taken to feature engineering:

1. First, the borough/block/lot (BBL), permit issuance date, and permit type description were retained from the multi­agency permits data, while the other features were dropped.

2. Next, BBL was used to join the permits data with the NYC PLUTO dataset. This allowed us to match each permit with the corresponding NTA.

3. Permits were then grouped by NTA and year_month, and the total number of permit applications of each type were determined. Note this produced counts for 109 different types of permits.

4. Our next challenge was to select an interesting subset of permit type counts for inclusion in the final dataset. We approached this problem as follows:

a. Since our objective is to use permits data to compare how gentrification is differentially affecting NYC neighborhoods, we decided interesting permit features would have the largest between­neighborhood variance.

b. Thus, we determined the average number of permits for each permit type, for each neighborhood (averaging over the 48 months in the dataset). We then ranked each permit type based on the between­neighborhood variance of these averages, and retained the 14 features with the highest between­neighborhood variance. Note the top 14 features were selected (as opposed to say the top 8) since the top 14 features included the most relevant building permit types (i.e. foundation, new building, equipment)

c. Note this approach admitted a number of features that would likely have been dropped if we had selected features based on our intuition or literature regarding gentrification. For example, we likely would not have included the physician permit counts if we had selected features based on our prior beliefs regarding features that differentiated NYC neighborhoods.

For the sales data, the following approach was taken to feature engineering:

1. First, we chose to restrict our analysis to properties that met the following requirements: a. They contained at least one residential unit and exactly zero commercial units.

This inclusion criteria allowed us to determine the per­unit residential sale price

6

Page 7: Exploring Gentrification in NYC - Waqidwaqid.com/bigdata-report.pdfGentrification is an often discussed phenomenon in New York City and other large urban cities. In the mid 20 th century,

by simply taking (total sale price) / (number of residential units). If we had allowed buildings with commercial units into our dataset, we would not have had a reasonable method for dividing the sale price into the component attributable to residential property value and the component attributable to commercial property value.

b. The sale price was at least $1000. This requirement was necessary due to the large number of properties with sale prices of $0, $1, or $100. While $1000 is a very low sale price, we intended this threshold to simply exclude the most obviously trivial sale prices (which we attribute to property transfers within families or other similar entities).

2. Next, we constructed the feature dollar_per_unit: the price per residential unit, defined as (total sale price) / (number of residential units)

3. Our next task was to tag each property sale with its containing NTA. Unlike taxi trips, we did not have latitude and longitude coordinates for property sales; thus, we couldn’t use our MapReduce framework to tag sales by NTA. Instead, we used a two step approach:

a. First, we used the approach employed to tag the permit applications, namely using BBL to join the sales data with PLUTO (and ultimately merge in the NTA).

b. While this approach was relatively successful, it missed an unacceptably large number of records, since

“Each unit in a building that is a condominium is defined by the Department of Finance as a separate tax lot. To make condominium information more compatible with parcel information, the Department of City Planning aggregated condominium unit tax lot information [in PLUTO] so that each condominium complex within a tax block is represented by only one tax lot record.” [6]

Due to this mismatch between the property sales records and PLUTO, we next queried the NYC Geoclient API [7] by property address to obtain missed NTAs. Note this approach was somewhat slower (per record) than spatial inference using latitude and longitudes, but the sales data did not include these features.

4. After tagging the records with an NTA using one of these two approaches (direct join with PLUTO, or through the Geoclient API), the sales data were grouped by year_month and NTA and two aggregate features were computed:

a. Median per­residential­unit sales price b. Total number of residential properties (not units) sold

Finally, we merged these three data sets (taxi trips, permits, and residential property sales) by NTA and year_month to produce our final analytic dataset (described in table 4). Note that in our final dataset missing values were replaced with 0, since missing values implied 0 counts (for permit and taxi features). Issues Encountered

7

Page 8: Exploring Gentrification in NYC - Waqidwaqid.com/bigdata-report.pdfGentrification is an often discussed phenomenon in New York City and other large urban cities. In the mid 20 th century,

Through the process, we encountered a number of issues. First, our objective was to explore gentrification at the spatial scale defined by Neighborhood Tabulation Areas, and on the temporal scale of months. Thus, our primary data processing problem was assigning each record to the appropriate NTA. We solved this problem in three ways:

First, we used MapReduce for fast spatial inference of taxi pick­up and drop­off neighborhoods. This approach was feasible since each pick­up and drop­off point was recorded with latitude/longitude coordinates.

In contrast, the permit dataset lacked geographic coordinates; however, it included BBL values, which supported a join with the comprehensive PLUTO property dataset.

While merging the permits dataset and PLUTO was successful, BBL values in the residential property sales dataset did not adequately map onto the PLUTO BBL attribute. This, we had to use the additional approach of tagging sales with NTA by querying the NYC Geoclient API by property address.

Detailed Experimental setup For our big data analysis (spatial inference over the Taxi Data using Hadoop), we used the following experimental setup:

­ Hadoop configuration: ­ 5 nodes (1 master, 4 slave)

­ Tools used ­ In addition streaming Hadoop on AWS EMR, we also made use of a number of

external libraries for spatial inference: ­ Shapely­ Used for point­in­polygon tests ­ RTree­ Used to accelerate our hadoop geotagging through construction of

an RTree spatial index over the NTA polygons ­ nyc_geoclient­ Python binding for the NYC Geoclient API ­ CardoDB­ Used for visualization

­ Performance of approach (running times of the scripts) ­ The sales and permit data were processed locally because they were smaller

datasets (geotagging was primarily achieved through a join with PLUTO). Thus we do not report running time of these scripts. The MapReduce jobs (over the 26 GBs of taxi data) ran in approximately 5 hours per year (given the stated configuration with 5 nodes). Runtimes can be reduced through use of larger clusters.

­ Optimizations to speed up code ­ We sped up the code through three main optimizations. First, we used an RTree

over the NTA polygons to speed up spatial inference of drop­off and pick­up neighborhoods. Second, prior to beginning the MapReduce jobs, we reduce the data size by locally dropping all unnecessary features and retaining only those features required in our analysis. Finally, we sped up the code by using additional datasets (PLUTO and the 2010 Census Tract to 2010 NTA Equivalency) for inference of property sale and permit NTA. While we could have potentially

8

Page 9: Exploring Gentrification in NYC - Waqidwaqid.com/bigdata-report.pdfGentrification is an often discussed phenomenon in New York City and other large urban cities. In the mid 20 th century,

tagged records in these datasets by NTA using spatial inference (potentially using Hadoop), it was more efficient to find the necessary supplemental datasets to achieve NTA tagging through a simple merge.

Results and Discussion Again, our objective was to develop a tool that supports visual exploration of neighborhood­level changes in NYC over the years 2010­2013, using taxi trips, permits, and residential property sales as potential measures of gentrification. This tool is valuable to the public, since gentrification is a concern shared by many New Yorkers; it is also potentially valuable to policy makers as they design policy interventions to address the negative impacts of gentrification in and across New York communities. Given our objective of developing a tool to support user interaction with the reduced neighborhood­level data, we do not present conclusive results regarding particular changes observed over the years 2010­2013. Instead, we present example visuals (maps and charts) produced using our tool. First, figure 3 shows the user interface for our tool.

The tool allows the user to visualize the spatial distribution of residential property sales, permit applications, and taxi trips in a map (produced using CartoDB). Moreover, the user can also see the average rate of change for each feature over a 6 month window, with the user setting the 6 month time window for using a slider. Finally, the user can interact with the map to drill down to

9

Page 10: Exploring Gentrification in NYC - Waqidwaqid.com/bigdata-report.pdfGentrification is an often discussed phenomenon in New York City and other large urban cities. In the mid 20 th century,

one specific neighborhood, and see how that neighborhood has changed over time. Figure 4 presents a number of sample maps illustrating the tool’s mapping functionality.

Figure 4.1: This map shows the number of taxi dropoffs and number of new building permits for a subset of neighborhoods.

Figure 4.2: This map shows the number of taxi pickups and number of new building permits for a larger view of New York City.

Figure 4: Sample maps generated by users using the dashboard Figure 5 presents a number of sample charts at the specific neighborhood level. These charts show:

10

Page 11: Exploring Gentrification in NYC - Waqidwaqid.com/bigdata-report.pdfGentrification is an often discussed phenomenon in New York City and other large urban cities. In the mid 20 th century,

Figure 5.1: This chart shows the number of new building permits (selected from a dropdown menu), the median residential unit sale price, and the number of taxi trips for a user­specified neighborhood (selected from the map), for a user­specified 6 month window (selected with a slider).

Figure 5.2: This chart shows the average number of taxi trips for NYC (to provide a reference value), and the number per month for the specified neighborhood.

Figure 5.3: This chart shows the average dollars per residential unit for NYC (to provide a reference value), and the average dollars per residential unit for the specified neighborhood.

Figure 5.4: This chart shows the average number selected permit type for NYC (to provide a reference value), and the number per month for the specified neighborhood.

Figure 5: Subcharts presented to the user in our dashboard Again, these are intended as examples of the types of visuals that can be produced using our tool. The full functionality of the dashboard will apparent through interactive demonstrations during the Monday, 05/16 presentation session.

11

Page 12: Exploring Gentrification in NYC - Waqidwaqid.com/bigdata-report.pdfGentrification is an often discussed phenomenon in New York City and other large urban cities. In the mid 20 th century,

Conclusion Gentrification is a growing problem in New York City and across the country. As more people have moved to large urban areas, rent and property values have increased and the composition and culture of neighborhoods have changed. This has, in turn, driven many long­time residents out of these gentrifying neighborhoods (in a large part as they are priced out). Given the impacts of gentrification, tools that allow a user to visualize this phenomenon through proxy measures including taxi trips, property sales, and permits are valuable to both the public and to policy makers. Public users can see how their neighborhood has changed over the years 2010­2013 relative to the rest of the city, and policymakers can use the dashboard to develop insight into the city and potentially design both city­wide and neighborhood specific policy interventions.

References 1. Goodman, David J. March 22nd 2016, New York Passes Rent Rules to Blunt Gentrification.

New York Times. http://www.nytimes.com/2016/03/23/nyregion/new­york­council­passes­zoning­changes­de­blasio­sought.html

2. Grant, Benjamin. June 17th, 2003. What is Gentrification? American Documentary, Inc. POV http://www.pbs.org/pov/flagwars/what­is­gentrification/

3. Donnley, Frank. March 31st, 2016. New York City Data. Baruch College. http://guides.newman.baruch.cuny.edu/c.php?g=188226&p=1243123

4. Huang, Roger. 2016. Do Rich People Take More Taxis? Springboard https://www.springboard.com/blog/do­rich­people­take­more­taxis/

5. Brian Donovan and Daniel B. Work “New York City Taxi Trip Data (2010­2013)”. 1.0. University of Illinois at Urbana­Champaign. Dataset. http://dx.doi.org/10.13012/J8PN93H8, 2014.

6. NYC Department of City Planning. NYC PLUTO Data Dictionary (16v1). March 2016. http://www1.nyc.gov/assets/planning/download/pdf/data­maps/open­data/pluto_datadictionary.pdf

7. NYC Developer Portal. NYC Geoclient API. https://developer.cityofnewyork.us/api/geoclient­api

8. Kaysen, Ronda. May 13th 2016. Priced Out of a Childhood Home. New York Times. http://www.nytimes.com/2016/05/15/realestate/priced­out­of­my­childhood­home.html

Individual Project Member Contributions Waqid Volli: Waqid was primarily responsible for implementing the dashboard

visualizations. Pedro Lambert: Pedro was primarily responsible for implementing the Hadoop taxi trip

spatial inference and data aggregation.

12

Page 13: Exploring Gentrification in NYC - Waqidwaqid.com/bigdata-report.pdfGentrification is an often discussed phenomenon in New York City and other large urban cities. In the mid 20 th century,

Ben Jakubowski: Ben was primarily responsible for the local data processing tasks (permits and property sales) and merging the three datasets after processing.

All: All group members collaborated on each component, and contributed to high­level project design.

13