project report sas

14
MIS 6324 BUSINESS ANALYTICS WITH SAS Dating Application Group 3 Vaibhav Pande, Mary Gramer,Tejasvi Ramdas Sagar, Ritesh KP,Foram Gohil 11/27/2016

Upload: tejasvi-r-s

Post on 20-Jan-2017

15 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Project report SAS

MIS 6324 Business analytics with sas

Dating Application

Group 3

Vaibhav Pande, Mary Gramer,Tejasvi Ramdas Sagar, Ritesh KP,Foram Gohil

11/27/2016

Page 2: Project report SAS

Executive Summary

Young adults in the twenty-first century are among the busiest and technology-tethered generations. When they are not juggling school, careers, or hobbies, many of them are glued to their smart-phones surfing the web, too preoccupied to meet new people. In this cultural environment, it is difficult for young, single adults to find potential dating partners.

Using data from twenty-one speed dating events to create a new dating app, we can connect two individuals based on their interest and preferences thus expediting the dating process. The app will direct the user to rate other users’ profiles based on not only the user’s image, but also how much he/she likes the other user based on their profile information. The profiles will include demographic information, shared Interests, and other attributes such as fun factor, attractiveness, etc. After evaluating each user’s preferences and rating, the app will suggest partners who have similar interests and matching preferences.

After comparing the accuracies and the true positive rates of various models created using SAS Enterprise Miner, we have selected a decision tree to predict the target variable. The data was first altered by applying a replacement node in SAS. Our model can predict whether or not an individual will be interested in dating another human based on their attributes and interests with 80.5% true positive rate and with 81.1% accuracy.

Project Motivation

The current popular mobile applications for meeting other singles such as Tinder, or Bumble, do not consider a person’s preferences or personality - the only deciding factor on whether or not two people converse is their pictures. This inefficient system causes singles to waste their time messaging with people who do not share any of their interests. After spending perhaps hours chatting, two people may realize that they are not interested in going on a face-to-face date with their ‘match.’ Using the speed dating data, we can create a superior dating app for young adults.

Description of Data

The dataset includes observations from twenty-one speed dating events (also called waves) in which each person was paired with five to twenty-two partners of the opposite gender for four minutes each. Before, during, and after the event, participants were asked to rate multiple characteristic about themselves, and each of the partners with which they met. Every participant identified which attributes in a partner are most important to them, rated each partner they met with on these same attributes (called the ‘scorecard’ for each member), and if they would like to go on a second date with the partner.

1

Page 3: Project report SAS

The scorecard given to each participant after the date is as follows:

SCORECARD

YOUR ID NUMBER:Circle “Yes” or “No” below the ID number of each person you meet to indicate whether you would like to see him or her again. Rate their attributes on a scale of 1-10: (1=awful, 10=great). If you haven’t formed an opinion based on your conversation, fill in N/A, but please fill in all boxes. This will be TOTALLY confidential and will NOT be shared with anyone. Then, answer the remaining questions for each person you meet.

ID #: 1 2 3 4 5 6 7 8 9 10

Decision 1=yes0=no

Yes

no

yesno

yesno

yesno

yesno

yesno

yesno

yesno

yesno

Attributes(1=awful, 10=great)

Attractive attr

Sincere sinc

Intelligent intel

Fun fun

Ambitious amb

Shared Interests/Hobbies shar

Overall, how much do you like this person?(1=don't like at all, 10=like a lot)

like

How probable do you think it is that this person will say 'yes' for you?

(1=not probable, 10=extremely probable)

prob

Have you met this person before? met1=yes2=no

yesno

yesno

yesno

yesno

yesno

yesno

yesno

yesno

yesno

In the data set, each observation represents a meeting between a participant and a partner. The observation includes all the information collected about the participant, including demographics,

2

Page 4: Project report SAS

preferences, how they scored their partner, how their partner scored them, and whether both people agreed to go on second date.

Prior to modeling the data, there were many discrepancies and non-uniformities among the variables to be reconciled. For four of the speed dating events (numbers six to nine), the participants ranked their preference for each of the six attributes on a scale of 1-10. For the remaining events, participants ranked their preference by allocating 100 points to the same six attributes. To create consistency in these variables, the values for the ranking in speed dating events six to nine have been scaled to 100 points to be consistent with the other waves.

We used the following formula to scale the data for the waves 6-9:

Ratingscaled=100

Σ Attribute Ratings×Ratingoriginal

The target variable we have selected is decision (dec). In the app we are developing we are more concerned making the right recommendations for a person.

We rejected the following binary attributes from the data: Match (When both person agrees to go on a second date) dec_o (decision of partner to go on a second date) Num_in_3(How many of your matches have you been on a date with so far)

Match and dec_o were rejected because the combination of the participant and partner’s decision is equivalent to our target variable, decision. If we keep these two variables, the model would predict the target variable (decision to go on second date) with close to 100% accuracy. We rejected Num_in_3 because more than 90% of observations were missing.

After processing the dataset, explored the observations to gain a better understanding of the data. Interesting aspects include:

Overall Match Rate: 16.5% Individual ‘Yes’ Rate: 42%

Age Range of Participants: 18-55 Mean: 26.3 St. Deviation: 3.6 Skewness: 1.07

Using interactive decision trees in SAS, we chose several initial nodes to split the data on, and then let SAS decide how to split the tree into subsequent branches. This method shows how the target variable various among participants of different genders, races, age, and the season in which the event was held. Results are shown below.

Gender:

3

Page 5: Project report SAS

Note: ‘0’ represents female, ‘1’ represents male.As the tree shows, females are more conservative in who they choose to go on a second date with. On average, women said ‘yes’ to only 37.4% of males while men said ‘yes’ to 46.57% of females.

Race:

The decision rate varies among races. Black/African Americans said ‘yes’ to 51.2% of partners, while European/Caucasian said ‘Yes’ to 38.79% of partners. The percentage for the other races lie somewhere in between.

Age:

4

Page 6: Project report SAS

SAS Enterprise Miner split the tree based on two age ranges, fewer than 38.5 and above 38.5 years old. For participants under 38.5 years old, like was the most important attribute when deciding if they want to go on a second date. However for participants over 38.5 years old, the most important factor was how ‘fun’ they found their partner to be.

Season:

We found a slight difference in the outcome of the decision variable when we chose to split the decision tree based on what season the speed dating event was held.The tree shows that people are more likely to say ‘yes’ to any given date if the speed dating event is held in winter.

These nuances in the data help us understand how the decision variable is affected by a user’s demographics.

In the dataset, the binary target variable ‘Decision’ is ‘yes’ 41.99 percent. If we take a simple model in which we predict that every ‘Decision’ is no, our misclassification rate would be 41.99. BI Model:

5

Page 7: Project report SAS

We partitioned the data Train 70%, validation 20% and Test 10% we tried running all the classifiers with different sampling techniques like simple random and stratified techniques and we got the best results using stratified sampling technique.For the observation, which are missing values for certain variables, we have used the replacement node to replace the missing class variables with a dot so SAS will recognize the variables as missing.After data pre-processing, we ran the following models:

Regression with replacement node Regression with replacement, variable selection and impute node Regression with replacement, variable selection, impute and transform variables Dmine regression with replacement, variable selection, impute and variable transformation Neural networks with variable selection Decision Tree Decision tree using variable selection and replacement node Gradient boosting with replacement Decision tree with replacement node

Impute node: The dataset has numerous missing values. To address this issue, the mean value for each relative variable was used to replace the missing interval values and the mode of each relative variable was used to replace missing value ordinal values.

6

Page 8: Project report SAS

Variable selection: Since we have many attributes, the variable selection node was used to let SAS automatically choose the variables which most affected the target variable, ‘match.’

Variable transformation: Certain attributes were highly positively skewed. These variables have been transformed using the log function. This method gave superior results to other methods such as inverse or square root.

We altered the ‘maximum branch’ parameter in every decision tree and got the best results when the ‘maximum branch’ was set to 4 for the Decision tree with replacement node.

We executed forward, backward, and stepwise regression for every regression node. We get the best results while keeping the ‘model selection’ parameter to none with the Regression with impute node.

Model comparison results:

The model comparison node shows that the best model selected by the SAS enterprise miner is the Gradient Boosting with replacement node with a misclassification rate of 18.1 percent.

ROC curve for all Models

7

Page 9: Project report SAS

For our application we are more interested in the true positive rate of the model because we will be making recommendations and it would be better to recommend user the people whom they are more likely to say yes for going on a date.

True positive rates:

Dmine Regression 71.4%

Regression with impute 74.2%

Regression with transformed variables 73.8%

Neural Network 31.5%

Gradient boosting 75.2%

Decision tree 72.7%

Decision with variable selection 75.8%

Decision tree with replacement 80.2%

Even though gradient boosting gives the best misclassification rate, we have chosen Decision tree with replacement our BI model based on higher true positive rate. Decision tree has a true positive rate of 80.2 percent whereas gradient boosting has a true positive rate of 75.2 percent. Please see the attached document containing the image of the decision tree.Conclusion:

8

Page 10: Project report SAS

The decision tree uses these variables to split upon and the root node selected is like

Some interesting results:

All the ratings are on the scale of 1 to 10 If user likes a person greater than equal to 8 → user rates them on attractiveness greater than

equal to 7.5 → user thinks the probability of getting a match is greater than equal to 3 .Then there is a 86.28 percent chance that the user will say yes

If the user likes a person greater than equal to 8 → user rates them attractive greater than equal to 4 and less than 7 → user estimates that the number of matches (match_es) greater than 1.5 and they are of the same race. 60 percent chance that the user will say yes.

If the user likes the person greater than equal to 5.5 and less than 6.5 → if they are from London, England. They have 100 percent chance of saying a yes but if the user is from Alabama, Texas, Argentina there is 68.12 percent chance of saying no.

If the user likes a person less than 5.5 → is a lawyer. Then there is a 93.16 percent chance that user will say no the other person. Similarly the user is in the field of Informatics or Psychology, the user will say no 100 percent of the time and if the user is a journalist, there is an 83 percent chance of saying a yes.

Overview on the application:

9

Page 11: Project report SAS

In the mobile application users make their profiles with some pictures and description about themselves. The users are asked to specify their preferences like age range of their partners and the location range and whether they are interested in meaningful friendships or relationships.

The user is shown the profiles of people who match the user preferences and the user is asked for rate their profile on features such as attractiveness, fun and how much they like the overall profile of the person.

Based on these ratings our BI model generates a list of potential partners with whom user is likely to be compatible with and has an option to start a chat.

After significant user base has been established we will be able to design a recommendation system that increases the accuracy by selecting the profiles which similar users have matched with.

References:

Data source: Kaggle.com

Columbia Business School. Ray Fisman and Sheena. Gender Differences in Mate Selection: Evidence from a Speed Dating Experiment. https://www.kaggle.com/annavictoria/speed-dating-experiment

10