categorical data analysis in python

13
1 Categorical Data Analysis in Python By Jaidev Deshpande Data Scientist, DataCulture Analytics twitter.com/jaidevd

Upload: jaidev-deshpande

Post on 16-Jul-2015

550 views

Category:

Engineering


0 download

TRANSCRIPT

Page 1: Categorical Data Analysis in Python

1

Categorical Data Analysis in Python

By

Jaidev DeshpandeData Scientist, DataCulture Analytics

twitter.com/jaidevd

Page 2: Categorical Data Analysis in Python

2

Problem: Who's likely to attend the next meetup?

● Who comes often?● Men / Women?● Where do you live? How far from the venue?● Proficiency with Python

(Beginner / Intermediate / Advanced)?● Area of interest?

Page 3: Categorical Data Analysis in Python

3

Something like..

Attendees Features

Attendance (%)

Gender Pincode Proficiency in Python

Interest ...

attendee_1 80 M 411013 Intermediate Web ...

attendee_2 30 F 411040 Advanced Test / Automation

...

attendee_3 55 M 411001 Beginners Scientific ...

... ... ... ... ... ... ...

● 1. Numerical features – continuous and quantitative● 2. Categorical features – discrete and qualitative

Page 4: Categorical Data Analysis in Python

4

Common Numerical Operations on Data

● Obviously – add, subtract, mu ltiply divide● Statistical moments● Operations in vector spaces

– D istance measures– Slicing

Page 5: Categorical Data Analysis in Python

5

Comparison of Operations

Numerical Data

Addition, subtract, multiply, divide

Mean, Variance, Standard Deviation

Vector Spaces – the very idea of 'measuring'

Categorical Data (Strings, etc)

What's the product of two strings?

The average pincode of two areas?

&%%#&$$*&!!!!

At least get some numbers!

Page 6: Categorical Data Analysis in Python

6

One-hot Encoding

● [Apples,

Oranges,

Mangoes]

● sklearn.preprocessing.OneHotEncoder

● sklearn.feature_extraction.DictVectorizer

[0, 0, 1;

0, 1, 0;

1, 0, 0]

Page 7: Categorical Data Analysis in Python

7

Original Data

Attendees Features

Attendance (%)

Gender Pincode Proficiency in Python

Interest ...

attendee_1 80 [0 1] [1 0 0 … 0] [0 1 0] [1 0 0 0 0 0] ...

attendee_2 30 [1 0] [0 1 0 … 0] [1 0 0] [0 1 0 0 0 0] ...

attendee_3 55 [0 1] [0 0 1 … 0] [0 0 1] [0 0 1 0 0 0] ...

... ... ... ... ... ... ...

Page 8: Categorical Data Analysis in Python

8

Curse of Dimensionality

Page 9: Categorical Data Analysis in Python

9

Correspondence Analysis

● Contingency tables (pandas.crosstab)

profeciency advanced beginner intermediate

gender

F 1 0 0

M 0 1 1● Different numerical measures● Perceptual maps

Page 10: Categorical Data Analysis in Python

10

Correspondence Analysis

● How are proficiencies related w.r.t gender? (Row profiles)● How are genders related w.r.t proficiency? (Column profiles)

– Cosine similarity– Correlation / Covariance

● How are they interrelated?– Weighted chi-squared distance

● Can the dimensionality be reduced?– Singular value decomposition / PCA– sklearn.decomposition.PCA

– sklearn.decomposition.TruncatedSVD

Page 11: Categorical Data Analysis in Python

11

Sample Problem

● Consider the proficiency and interest features from the original problem

● Fake data with 100 observations ● Contingency matrix:

automation scientific web

advanced 8 1 7

beginner 13 9 35

intermediate 7 1 19

Page 12: Categorical Data Analysis in Python

12

Results

Page 13: Categorical Data Analysis in Python

13

Source and Tutorials

● http://github.com/motherbox/mca