analyzing concert data to predict ticket price...

Analyzing Concert Data to Predict Ticket Price Markups

Evan PaulApril 2016

https://github.com/epsilon670/predicting_ticket_markups



Background● Buyers of concert tickets are able to re-sell them on StubHub.com, often

at a markup compared to face values

● The price one is willing to pay for a ticket on StubHub is influenced by many factors

○ Is the show sold out?○ How popular is the artist?○ How soon is the show?

● Can we use features to predict the price markup of a concert ticket on StubHub?

Hypothesis

Variables such as the number of days until a show, whether the show is sold out or not, and artist popularity can be used to predict the price markup of concert tickets on StubHub.com.

● Data was gathered from 3 primary sources:○ StubHub.com API

■ Event details and ticket prices○ Webpage scrapes of SongKick.com

■ Ticket Face values and whether shows were sold out or not○ EchoNest.com API

■ Artist metadata and popularity data

Data Sources

StubHub API Data

● Artist● Date of show● # of days until show (from 3/13/16)● Lowest available StubHub ticket price● Venue name● City

Data Gathered from Scraping SongKick.com● Ticket Vendor

○ E.g., Ticketmaster, TicketFly, EventBrite, etc.

● Ticket Face Value● Whether the show is sold out or not

(as of 3/13/16)

Artist Data from EchoNest● Artist “Familiarity” score

○ Measures how well known an artist is (cont. values between 0 and 1)

● Artist “Discovery” score○ Measures the current “discovery” level of an artist (cont. values between 0 and 1)○ I.e., artist who is relatively unknown but is currently getting many plays gets a high score

● Artist “hotttnesss” score○ Measures how much people are sharing an artist currently (cont. values between 0 and 1)

● Number of blogs published recently about artist● Number of news articles published recently● Number of reviews published recently● How many years an artist has been active

Data Collection● Collected data for concerts

from 16 metropolitan areas in USA

● Resulted in 3,126 concerts total

● All data was collected on March 13th, 2016

Data Limitations and Challenges

Data Limitations and Challenges● SongKick did not have complete

data for every show○ 3,126 events total

○ SongKick webpages only had valid ticket info for 1,436 of them

● Some shows were not marked as “sold out” on SongKick when they were actually sold out in reality

○ sold_out feature is thus underrepresented in our data

Data Challenges● StubHub also had some major outliers

Data Challenges● StubHub also had some major outliers ???

Cleaned Data● After removing outliers and bad data, we were left with 1,192 valid

concerts with the following markup characteristics:● Mean ticket markup: $40.87

● Standard Deviation: 20.7

● Min markup: $7.26○ Charlie Puth @ Theatre of Living Arts,

Philadelphia, PA

● Max markup: $104.90○ Robert Plant @ The Moody Theater, Austin, TX

Let’s try to predict ticket markup

Features● Used the following concert features to attempt to predict ticket markup:

○ 'face_value' - original ticket price (in USD)○ 'sold_out' - 1 if show was sold out, 0 if not sold out○ 'days_to_show' - integer for # of days from data collection date (3/13/16) to concert○ 'num_blogs' - integer for # of blog posts about artist recently○ 'num_news' - integer for # of news articles written about artist recently○ 'num_reviews' - integer for # of reviews written about artist recently○ 'discovery' - EchoNest discovery score between 0 and 1○ 'familiarity' - EchoNest familiarity score between 0 and 1○ 'hotttnesss' - EchoNest “hotttnesss” score between 0 and 1○ 'num_years_active' - integer for # of years an artist has been active

Sample Feature Data Frame

Random Forest Regressor● Used RandomForestRegressor from sklearn.

ensemble

● Split data into training set (66%) and test set (33%)

● Tuned model and found best results with 8 max_features and 5,000 trees

● Model with these parameters produced MSE of ~386.65 when run with test data

● This MSE means, the model’s prediction was off by about $19.66 on average when run on test data

Sample Predictions with RF Model● Avett Brothers @ Chicago Theatre in Chicago, IL on 4/21/2016

○ Ticket Face value: $45.00○ Minimum StubHub Price: $82.74○ Actual Markup: $37.74○ Model’s predicted markup: $36.78 (off by $0.96 - pretty good!)

● Avett Brothers @ Chicago Theatre in Chicago, IL on 4/23/2016○ Ticket Face value: $45.00○ Minimum StubHub Price: $120.50○ Actual Markup: $75.50○ Model’s predicted markup: $37.01 (off by $38.49 - eh…)

● Wait! These are 2 predictions for the same artist only 2 days apart! What gives?

One of the shows is sold out!April 21st Show - not sold out(StubHub markup=$37.74)

April 23rd Show - sold out!(StubHub markup=$75.50)

But our data did not capture this...

...because SongKick does not have accurate sold_out status

Not marked as sold out in data...

Re-run Prediction with correct sold_out value● Let’s try re-running our prediction algorithm with the correct sold_out

value for the Avett Bros’ April 23rd show

● Knowing event was sold out, RF model predicts a markup of $45.50○ Originally predicted a markup of $37.01 with 0 sold_out value○ Actual StubHub markup: $75.50○ Not an amazing improvement, but still better

● Lack of correct sold_out values from SongKick may explain why sold_out was an insignificant feature in prediction model

MSE was high with Random Forest Regressor. Can we do better?

Let’s try turning this into a classification problem...● Random Forest didn’t allow us to predict ticket prices very precisely

● But maybe we can predict the range that a markup is in

● Let’s create buckets for different markup ranges:

○ Bucket 1: $0 - $25 ■ 293 observations

○ Bucket 2: $25 - $37■ 299 observations

○ Bucket 3: $37-$52■ 303 observations

○ Bucket 4: >$52■ 297 observations

Random Forest Classifier● RandomForestClassifier yielded best results with max_features value of 4

○ Classifier made correct predictions ~38.3% of the time

● Let’s try Boosting

● Ideal parameters for GradientBoostingClassifier:○ Learning rate: 0.05○ Number of trees: 4,000○ Max depth: 4

● This Boosting algorithm allowed us to predict the markup range for concerts in our test set with 41.6% accuracy

○ Better than nothing, but still not great

Can we interpret anything using the data?

Let’s Try Linear Regression● Linear regression is prone to outliers, so let’s make sure our data isn’t too

skewed● Box plot of raw markup values:

● Let’s take the logs of our data

Let’s Try Linear Regression● Looks much better

Lasso Regression to Find Best Variables● First scaled the data

● Then found ideal alpha for Lasso: a = -3

● Then checked the Lasso coefficients

● Then checked correlation matrix

○ hotttnesss was highly correlated with all variables except sold_out (-0.01)

● Decided to use hotttnesss and sold_out for regression

Running Linear Regression● Used Linear Regression on hotttnesss and sold_out values to predict the

logarithm of ticket price markups

● Used the Statsmodel python package to get p-values, R^2, and coefficients:

● R^2 is low (~0.01)● But coefficient P-values are

significant!○ 0.046 and 0.001

● Model may not capture much variability, but results are significant

Interpreting Linear Regression Results● Hotttnesss coefficient is 0.3267

○ “hotttnesss” = how much people are currently talking about/sharing artist online

● Sold_out coefficient is 0.2531● Interpretation: holding all other variables fixed...

○ For every increase of 0.1 in EchoNest’s hotttness metric, the StubHub ticket price markup increases by ~3.3%*

○ If a show sells out, the StubHub ticket price markup increases by ~25%*

*The prediction values were the logarithms of ticket markups, so we interpret coefficients as % increases rather than absolute increases

Limitations of this Analysis

Data Limitations● SongKick did not always give us correct sold_out values

○ Only had ~80 out of 1,200 shows marked as “sold out”○ Impact of a show being sold out is likely underestimated in models from this dataset

● Only looked at minimum StubHub ticket price to compute markup○ Future studies might look at differing price levels - e.g., VIP sections vs. GA

● Data came from 16 U.S. metros, so conclusions are limited to concerts in those cities

○ Future studies might look at wider concert data across additional geos

Model Limitations● Interpretation from Linear Regression is based on the assumption that

the data is linear○ This may not be true - low R^2 value suggests that linear model doesn’t capture much

variability

● Did not include some variables that may explain additional variability○ Metro area for concert○ Day of the week of show (e.g., weekday vs. weekends)

analyzing concert data to predict ticket price...

Documents