categorical data analysis in python
TRANSCRIPT
1
Categorical Data Analysis in Python
By
Jaidev DeshpandeData Scientist, DataCulture Analytics
twitter.com/jaidevd
2
Problem: Who's likely to attend the next meetup?
● Who comes often?● Men / Women?● Where do you live? How far from the venue?● Proficiency with Python
(Beginner / Intermediate / Advanced)?● Area of interest?
3
Something like..
Attendees Features
Attendance (%)
Gender Pincode Proficiency in Python
Interest ...
attendee_1 80 M 411013 Intermediate Web ...
attendee_2 30 F 411040 Advanced Test / Automation
...
attendee_3 55 M 411001 Beginners Scientific ...
... ... ... ... ... ... ...
● 1. Numerical features – continuous and quantitative● 2. Categorical features – discrete and qualitative
4
Common Numerical Operations on Data
● Obviously – add, subtract, mu ltiply divide● Statistical moments● Operations in vector spaces
– D istance measures– Slicing
5
Comparison of Operations
Numerical Data
Addition, subtract, multiply, divide
Mean, Variance, Standard Deviation
Vector Spaces – the very idea of 'measuring'
Categorical Data (Strings, etc)
What's the product of two strings?
The average pincode of two areas?
&%%#&$$*&!!!!
At least get some numbers!
6
One-hot Encoding
● [Apples,
Oranges,
Mangoes]
● sklearn.preprocessing.OneHotEncoder
● sklearn.feature_extraction.DictVectorizer
[0, 0, 1;
0, 1, 0;
1, 0, 0]
7
Original Data
Attendees Features
Attendance (%)
Gender Pincode Proficiency in Python
Interest ...
attendee_1 80 [0 1] [1 0 0 … 0] [0 1 0] [1 0 0 0 0 0] ...
attendee_2 30 [1 0] [0 1 0 … 0] [1 0 0] [0 1 0 0 0 0] ...
attendee_3 55 [0 1] [0 0 1 … 0] [0 0 1] [0 0 1 0 0 0] ...
... ... ... ... ... ... ...
8
Curse of Dimensionality
9
Correspondence Analysis
● Contingency tables (pandas.crosstab)
profeciency advanced beginner intermediate
gender
F 1 0 0
M 0 1 1● Different numerical measures● Perceptual maps
10
Correspondence Analysis
● How are proficiencies related w.r.t gender? (Row profiles)● How are genders related w.r.t proficiency? (Column profiles)
– Cosine similarity– Correlation / Covariance
● How are they interrelated?– Weighted chi-squared distance
● Can the dimensionality be reduced?– Singular value decomposition / PCA– sklearn.decomposition.PCA
– sklearn.decomposition.TruncatedSVD
11
Sample Problem
● Consider the proficiency and interest features from the original problem
● Fake data with 100 observations ● Contingency matrix:
automation scientific web
advanced 8 1 7
beginner 13 9 35
intermediate 7 1 19
12
Results
13
Source and Tutorials
● http://github.com/motherbox/mca