type independent correction of sample selection bias via structural discovery and re-balancing...

3
Type Independent Correction of Sample Selection Bias via tructural Discovery and Re- balancing Jiangtao Ren 1 Xiaoxiao Shi 1 Wei Fan 2 Philip S. Yu 2 1 Sun Yet- Sun University, China 2 IBM T.J.Watson 3 University of Illinois at Chicago

Upload: matthew-stanley

Post on 27-Mar-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Type Independent Correction of Sample Selection Bias via Structural Discovery and Re-balancing Jiangtao Ren 1 Xiaoxiao Shi 1 Wei Fan 2 Philip S. Yu 2 1

Type Independent Correction of Sample Selection Bias via

Structural Discovery and Re-balancing

Jiangtao Ren1

Xiaoxiao Shi1

Wei Fan2

Philip S. Yu2

1Sun Yet-Sun University, China 2IBM T.J.Watson 3University of Illinois at Chicago

Page 2: Type Independent Correction of Sample Selection Bias via Structural Discovery and Re-balancing Jiangtao Ren 1 Xiaoxiao Shi 1 Wei Fan 2 Philip S. Yu 2 1

What is sample selection bias? Inductive learning: training data (x,y) is sampled from the universe of

examples. In many applications: training data (x,y) is not sampled randomly.

Insurance and mortgage data: you only know those people you give a policy.

School data: self-select

There are different possibilities of how (x,y) is selected (Zadrozny’04) S=1 denotes (x,y) is chosen. S is independent from x and y. Total random sample. S is dependent on y not x. Class bias S is dependent on x not on y. Feature bias. S is dependent on both x and y. Both class and feature.

Ubiquitous:Loan Approval, Drug screening, Weather forecasting, Ad Campaign, Fraud Detection, User Profiling, Biomedical Informatics, Intrusion Detection Insurance ,etc

Page 3: Type Independent Correction of Sample Selection Bias via Structural Discovery and Re-balancing Jiangtao Ren 1 Xiaoxiao Shi 1 Wei Fan 2 Philip S. Yu 2 1

Our method Key ideas:

Original Dataset Structural Discovery Structural Rebalance Corrected Dataset

Automatic Clustering

Advantages:1. Type Independent2. Model Independent3. Straightforward

2. Select “trustful” ones3. Label by neighbors

1. The same proportion