address standardization with latent semantic association
DESCRIPTION
TRANSCRIPT
![Page 1: Address standardization with latent semantic association](https://reader035.vdocuments.us/reader035/viewer/2022081413/5490b06bb47959f03e8b458f/html5/thumbnails/1.jpg)
1
Address Standardization with Latent Semantic
AssociationAuthor : Honglei Guo, Huijia Zhu, Zhili Guo, XiaoXun Zhang, and Zhong Su
Publication : KDD’09Advisor : Chia-Hui ChangPresenter : Chia-Yi Huang
2010/08/12
![Page 2: Address standardization with latent semantic association](https://reader035.vdocuments.us/reader035/viewer/2022081413/5490b06bb47959f03e8b458f/html5/thumbnails/2.jpg)
2
Introduction Related Works Latent Semantic Association Method Address Standardization Using LASA Model
and Informative Sampling Experiments Conclusions
Outline
![Page 3: Address standardization with latent semantic association](https://reader035.vdocuments.us/reader035/viewer/2022081413/5490b06bb47959f03e8b458f/html5/thumbnails/3.jpg)
3
IntorductionMotivationApproachesRelated Works
![Page 4: Address standardization with latent semantic association](https://reader035.vdocuments.us/reader035/viewer/2022081413/5490b06bb47959f03e8b458f/html5/thumbnails/4.jpg)
4
Address data are highly irregular ◦ most of them are often generated by different
people at different times.
Address should be converted to a standard consistent format.◦ Ex: “1101 Kitchawan Road, Route 134, Yorktown
Heights, N.Y. 10598”◦ [House No. : 1101], [Street : Kitchawan Road],
[Route : Route 134], [City : Yorktown Heights], [State: N.Y. ], [Zip :10598]
Introduction
![Page 5: Address standardization with latent semantic association](https://reader035.vdocuments.us/reader035/viewer/2022081413/5490b06bb47959f03e8b458f/html5/thumbnails/5.jpg)
5
Latent semantic association (LaSA)◦ To minimize human efforts and augment the
size of labeled training data set.
Address Standardization model is learned form LaSA features and informative samples.
Introduction (cont.)
![Page 6: Address standardization with latent semantic association](https://reader035.vdocuments.us/reader035/viewer/2022081413/5490b06bb47959f03e8b458f/html5/thumbnails/6.jpg)
6
Latent Semantic Association Method
Virtual Context DocumentLearning LaSA Model
![Page 7: Address standardization with latent semantic association](https://reader035.vdocuments.us/reader035/viewer/2022081413/5490b06bb47959f03e8b458f/html5/thumbnails/7.jpg)
7
In order to minimize the human efforts, we expect use ps(x, y) to approximate pt(x, y).◦ X : feature space to represent word instances.◦ Y : set of semantic labels.◦ ps(x, y), pt(x, y) : the underlying distribution for the labeled
training data set and the target data set.
LaSA model θs,t to capture latent semantic association among words form the unlabeled domain data.◦ Better augments the training data set.◦ Enhance the estimate of the distribution to better
approximate the real domain distribution.
Latent Semantic Association Model
![Page 8: Address standardization with latent semantic association](https://reader035.vdocuments.us/reader035/viewer/2022081413/5490b06bb47959f03e8b458f/html5/thumbnails/8.jpg)
8
Virtual Context Document◦ Given a word xi , virtual context document of xi is
◦ F(xiSk) : context feature set of xi in the address sample sk,
1≤k≤n.◦ n : total number of the samples which contain xi in the corpus.
Learning LaSA Model form Virtual Context Documents
![Page 9: Address standardization with latent semantic association](https://reader035.vdocuments.us/reader035/viewer/2022081413/5490b06bb47959f03e8b458f/html5/thumbnails/9.jpg)
9
Given vdxi = {f1, …, fj, …, fm} Weight(fi, xi) = log2 {P(fj, xi) / P(fj)P(xi)}
Learning LaSA Model form Virtual Context Documents (cont.)
![Page 10: Address standardization with latent semantic association](https://reader035.vdocuments.us/reader035/viewer/2022081413/5490b06bb47959f03e8b458f/html5/thumbnails/10.jpg)
10
Learning LaSA Model Latent dirichlet
allocation(LDA) imposes a dirichlet distrubution on the topic mixture weights corresponding to the documents in the corpus.
![Page 11: Address standardization with latent semantic association](https://reader035.vdocuments.us/reader035/viewer/2022081413/5490b06bb47959f03e8b458f/html5/thumbnails/11.jpg)
11
Learning LaSA Model (cont.)
![Page 12: Address standardization with latent semantic association](https://reader035.vdocuments.us/reader035/viewer/2022081413/5490b06bb47959f03e8b458f/html5/thumbnails/12.jpg)
12
Address Standardization Using LaSA Model and Informative Sampling
RRM ClassifierLatent Semantic Association FeatureInformative Sampling
![Page 13: Address standardization with latent semantic association](https://reader035.vdocuments.us/reader035/viewer/2022081413/5490b06bb47959f03e8b458f/html5/thumbnails/13.jpg)
13
View address standardization as a sequential classification problem.◦ Employs Robust Risk Minimization(RRM) Classifier.
Latent Semantic Association Feature◦ Frequency : 10◦ Number of topic N : 50◦ Context view window size : {-3 , 3}
Address Standardization Using LaSA Model
![Page 14: Address standardization with latent semantic association](https://reader035.vdocuments.us/reader035/viewer/2022081413/5490b06bb47959f03e8b458f/html5/thumbnails/14.jpg)
14
Informative sample selection method use a variant of uncertainty-sampling.
More uncertain fragments ate contained in the sample, more informative the sample is.
Given an address sample Si = {tokj}Nj=1,
◦ Tokj : jth token unit in Si
Confidence score of Si :◦ Score(tokj) : confidence score of tokj in Si
◦ TokNum(Si) : total number of token units in Si
◦ UncNum(Si) : the number of uncertain units in Si
Token units with lower confidence score(i.e. Score(tokj) ≤ α) are considered as uncertain units.
Informative Sampling
![Page 15: Address standardization with latent semantic association](https://reader035.vdocuments.us/reader035/viewer/2022081413/5490b06bb47959f03e8b458f/html5/thumbnails/15.jpg)
15
Informative Sampling (cont.)
![Page 16: Address standardization with latent semantic association](https://reader035.vdocuments.us/reader035/viewer/2022081413/5490b06bb47959f03e8b458f/html5/thumbnails/16.jpg)
16
Data set
Experiments
![Page 17: Address standardization with latent semantic association](https://reader035.vdocuments.us/reader035/viewer/2022081413/5490b06bb47959f03e8b458f/html5/thumbnails/17.jpg)
17
Performance Enhancement by LaSA model◦ Relative F-measure enhancement◦ Relative Error Reduction
Experiments(cont.)
![Page 18: Address standardization with latent semantic association](https://reader035.vdocuments.us/reader035/viewer/2022081413/5490b06bb47959f03e8b458f/html5/thumbnails/18.jpg)
18
Training Data Reduction by LaSA Feature
Experiments(cont.)
![Page 19: Address standardization with latent semantic association](https://reader035.vdocuments.us/reader035/viewer/2022081413/5490b06bb47959f03e8b458f/html5/thumbnails/19.jpg)
19
Cumulative impact of LaSA model and informative sampling
![Page 20: Address standardization with latent semantic association](https://reader035.vdocuments.us/reader035/viewer/2022081413/5490b06bb47959f03e8b458f/html5/thumbnails/20.jpg)
20
Cumulative impact of LaSA model and informative sampling
![Page 21: Address standardization with latent semantic association](https://reader035.vdocuments.us/reader035/viewer/2022081413/5490b06bb47959f03e8b458f/html5/thumbnails/21.jpg)
21
LaSA-Info method achieves more than 45% reduction in error over the state-of-the-art RRM trained on the same material.
Compared to the supervised learning method, the present approach requires only 5% as much annotated data to achieve the same level of performance.
Conclusions