type independent correction of sample selection bias via structural discovery and re-balancing...
TRANSCRIPT
Type Independent Correction of Sample Selection Bias via
Structural Discovery and Re-balancing
Jiangtao Ren1
Xiaoxiao Shi1
Wei Fan2
Philip S. Yu2
1Sun Yet-Sun University, China 2IBM T.J.Watson 3University of Illinois at Chicago
What is sample selection bias? Inductive learning: training data (x,y) is sampled from the universe of
examples. In many applications: training data (x,y) is not sampled randomly.
Insurance and mortgage data: you only know those people you give a policy.
School data: self-select
There are different possibilities of how (x,y) is selected (Zadrozny’04) S=1 denotes (x,y) is chosen. S is independent from x and y. Total random sample. S is dependent on y not x. Class bias S is dependent on x not on y. Feature bias. S is dependent on both x and y. Both class and feature.
Ubiquitous:Loan Approval, Drug screening, Weather forecasting, Ad Campaign, Fraud Detection, User Profiling, Biomedical Informatics, Intrusion Detection Insurance ,etc
Our method Key ideas:
Original Dataset Structural Discovery Structural Rebalance Corrected Dataset
Automatic Clustering
Advantages:1. Type Independent2. Model Independent3. Straightforward
2. Select “trustful” ones3. Label by neighbors
1. The same proportion