analyzing concert data to predict ticket price...
TRANSCRIPT
Analyzing Concert Data to Predict Ticket Price Markups
Evan PaulApril 2016
https://github.com/epsilon670/predicting_ticket_markups
Background● Buyers of concert tickets are able to re-sell them on StubHub.com, often
at a markup compared to face values
● The price one is willing to pay for a ticket on StubHub is influenced by many factors
○ Is the show sold out?○ How popular is the artist?○ How soon is the show?
● Can we use features to predict the price markup of a concert ticket on StubHub?
Hypothesis
Variables such as the number of days until a show, whether the show is sold out or not, and artist popularity can be used to predict the price markup of concert tickets on StubHub.com.
Data
● Data was gathered from 3 primary sources:○ StubHub.com API
■ Event details and ticket prices○ Webpage scrapes of SongKick.com
■ Ticket Face values and whether shows were sold out or not○ EchoNest.com API
■ Artist metadata and popularity data
Data Sources
StubHub API Data
● Artist● Date of show● # of days until show (from 3/13/16)● Lowest available StubHub ticket price● Venue name● City
Data Gathered from Scraping SongKick.com● Ticket Vendor
○ E.g., Ticketmaster, TicketFly, EventBrite, etc.
● Ticket Face Value● Whether the show is sold out or not
(as of 3/13/16)
Artist Data from EchoNest● Artist “Familiarity” score
○ Measures how well known an artist is (cont. values between 0 and 1)
● Artist “Discovery” score○ Measures the current “discovery” level of an artist (cont. values between 0 and 1)○ I.e., artist who is relatively unknown but is currently getting many plays gets a high score
● Artist “hotttnesss” score○ Measures how much people are sharing an artist currently (cont. values between 0 and 1)
● Number of blogs published recently about artist● Number of news articles published recently● Number of reviews published recently● How many years an artist has been active
Data Collection● Collected data for concerts
from 16 metropolitan areas in USA
● Resulted in 3,126 concerts total
● All data was collected on March 13th, 2016
Data Limitations and Challenges
Data Limitations and Challenges● SongKick did not have complete
data for every show○ 3,126 events total
○ SongKick webpages only had valid ticket info for 1,436 of them
● Some shows were not marked as “sold out” on SongKick when they were actually sold out in reality
○ sold_out feature is thus underrepresented in our data
Data Challenges● StubHub also had some major outliers
Data Challenges● StubHub also had some major outliers ???
Cleaned Data● After removing outliers and bad data, we were left with 1,192 valid
concerts with the following markup characteristics:● Mean ticket markup: $40.87
● Standard Deviation: 20.7
● Min markup: $7.26○ Charlie Puth @ Theatre of Living Arts,
Philadelphia, PA
● Max markup: $104.90○ Robert Plant @ The Moody Theater, Austin, TX
Let’s try to predict ticket markup
Features● Used the following concert features to attempt to predict ticket markup:
○ 'face_value' - original ticket price (in USD)○ 'sold_out' - 1 if show was sold out, 0 if not sold out○ 'days_to_show' - integer for # of days from data collection date (3/13/16) to concert○ 'num_blogs' - integer for # of blog posts about artist recently○ 'num_news' - integer for # of news articles written about artist recently○ 'num_reviews' - integer for # of reviews written about artist recently○ 'discovery' - EchoNest discovery score between 0 and 1○ 'familiarity' - EchoNest familiarity score between 0 and 1○ 'hotttnesss' - EchoNest “hotttnesss” score between 0 and 1○ 'num_years_active' - integer for # of years an artist has been active
Sample Feature Data Frame
Random Forest Regressor● Used RandomForestRegressor from sklearn.
ensemble
● Split data into training set (66%) and test set (33%)
● Tuned model and found best results with 8 max_features and 5,000 trees
● Model with these parameters produced MSE of ~386.65 when run with test data
● This MSE means, the model’s prediction was off by about $19.66 on average when run on test data
Sample Predictions with RF Model● Avett Brothers @ Chicago Theatre in Chicago, IL on 4/21/2016
○ Ticket Face value: $45.00○ Minimum StubHub Price: $82.74○ Actual Markup: $37.74○ Model’s predicted markup: $36.78 (off by $0.96 - pretty good!)
● Avett Brothers @ Chicago Theatre in Chicago, IL on 4/23/2016○ Ticket Face value: $45.00○ Minimum StubHub Price: $120.50○ Actual Markup: $75.50○ Model’s predicted markup: $37.01 (off by $38.49 - eh…)
● Wait! These are 2 predictions for the same artist only 2 days apart! What gives?
One of the shows is sold out!April 21st Show - not sold out(StubHub markup=$37.74)
April 23rd Show - sold out!(StubHub markup=$75.50)
But our data did not capture this...
...because SongKick does not have accurate sold_out status
Not marked as sold out in data...
Re-run Prediction with correct sold_out value● Let’s try re-running our prediction algorithm with the correct sold_out
value for the Avett Bros’ April 23rd show
● Knowing event was sold out, RF model predicts a markup of $45.50○ Originally predicted a markup of $37.01 with 0 sold_out value○ Actual StubHub markup: $75.50○ Not an amazing improvement, but still better
● Lack of correct sold_out values from SongKick may explain why sold_out was an insignificant feature in prediction model
MSE was high with Random Forest Regressor. Can we do better?
Let’s try turning this into a classification problem...● Random Forest didn’t allow us to predict ticket prices very precisely
● But maybe we can predict the range that a markup is in
● Let’s create buckets for different markup ranges:
○ Bucket 1: $0 - $25 ■ 293 observations
○ Bucket 2: $25 - $37■ 299 observations
○ Bucket 3: $37-$52■ 303 observations
○ Bucket 4: >$52■ 297 observations
Random Forest Classifier● RandomForestClassifier yielded best results with max_features value of 4
○ Classifier made correct predictions ~38.3% of the time
● Let’s try Boosting
● Ideal parameters for GradientBoostingClassifier:○ Learning rate: 0.05○ Number of trees: 4,000○ Max depth: 4
● This Boosting algorithm allowed us to predict the markup range for concerts in our test set with 41.6% accuracy
○ Better than nothing, but still not great
Can we interpret anything using the data?
Let’s Try Linear Regression● Linear regression is prone to outliers, so let’s make sure our data isn’t too
skewed● Box plot of raw markup values:
● Let’s take the logs of our data
Let’s Try Linear Regression● Looks much better
Lasso Regression to Find Best Variables● First scaled the data
● Then found ideal alpha for Lasso: a = -3
● Then checked the Lasso coefficients
● Then checked correlation matrix
○ hotttnesss was highly correlated with all variables except sold_out (-0.01)
● Decided to use hotttnesss and sold_out for regression
Running Linear Regression● Used Linear Regression on hotttnesss and sold_out values to predict the
logarithm of ticket price markups
● Used the Statsmodel python package to get p-values, R^2, and coefficients:
● R^2 is low (~0.01)● But coefficient P-values are
significant!○ 0.046 and 0.001
● Model may not capture much variability, but results are significant
Interpreting Linear Regression Results● Hotttnesss coefficient is 0.3267
○ “hotttnesss” = how much people are currently talking about/sharing artist online
● Sold_out coefficient is 0.2531● Interpretation: holding all other variables fixed...
○ For every increase of 0.1 in EchoNest’s hotttness metric, the StubHub ticket price markup increases by ~3.3%*
○ If a show sells out, the StubHub ticket price markup increases by ~25%*
*The prediction values were the logarithms of ticket markups, so we interpret coefficients as % increases rather than absolute increases
Limitations of this Analysis
Data Limitations● SongKick did not always give us correct sold_out values
○ Only had ~80 out of 1,200 shows marked as “sold out”○ Impact of a show being sold out is likely underestimated in models from this dataset
● Only looked at minimum StubHub ticket price to compute markup○ Future studies might look at differing price levels - e.g., VIP sections vs. GA
● Data came from 16 U.S. metros, so conclusions are limited to concerts in those cities
○ Future studies might look at wider concert data across additional geos
Model Limitations● Interpretation from Linear Regression is based on the assumption that
the data is linear○ This may not be true - low R^2 value suggests that linear model doesn’t capture much
variability
● Did not include some variables that may explain additional variability○ Metro area for concert○ Day of the week of show (e.g., weekday vs. weekends)