transactional data mining
DESCRIPTION
A talk to the Bay Area ACM chapter about mining transactional and textual data.TRANSCRIPT
![Page 1: Transactional Data Mining](https://reader033.vdocuments.us/reader033/viewer/2022052412/55946a701a28ab972b8b45e8/html5/thumbnails/1.jpg)
Mining Transactional DataTed Dunning - 2004
![Page 2: Transactional Data Mining](https://reader033.vdocuments.us/reader033/viewer/2022052412/55946a701a28ab972b8b45e8/html5/thumbnails/2.jpg)
Outline
● What are LLR tests?– What value have they shown?
● What are transactional values?– How can we define LLR tests for them?
● How can these methods be applied?– Modeling architecture examples
● How new is this?
![Page 3: Transactional Data Mining](https://reader033.vdocuments.us/reader033/viewer/2022052412/55946a701a28ab972b8b45e8/html5/thumbnails/3.jpg)
Log-likelihood Ratio Tests
● Theorem due to Chernoff showed that generalized log-likelihood ratio is asymptotically 2 distributed in many useful cases
● Most well known statistical tests are either approximately or exactly LLR tests– Includes z-test, F-test, t-test, Pearson's 2
● Pearson's 2 is an approximation valid for large expected counts ... G2 is the exact form for multinomial contingency tables
![Page 4: Transactional Data Mining](https://reader033.vdocuments.us/reader033/viewer/2022052412/55946a701a28ab972b8b45e8/html5/thumbnails/4.jpg)
Mathematical Definition
● Ratio of maximum likelihood under the null hypothesis to the unrestricted maximum likelihood
=max∈0
l X ∣
max∈
l X ∣
d.o.f.=dim −dim 0
● -2 log is asymptotically 2 distributed
![Page 5: Transactional Data Mining](https://reader033.vdocuments.us/reader033/viewer/2022052412/55946a701a28ab972b8b45e8/html5/thumbnails/5.jpg)
Comparison of Two Observations
● Two independent observations, X1 and X
2 can be
compared to determine whether they are from the same distribution
=max∈
l X 1∣ l X 2∣
max1∈ ,2∈
l X 1∣1 l X 2∣2
d.o.f.=dim
1 ,2 ∈ ×
![Page 6: Transactional Data Mining](https://reader033.vdocuments.us/reader033/viewer/2022052412/55946a701a28ab972b8b45e8/html5/thumbnails/6.jpg)
History of LLR Tests for “Text”
● Statistics of Surprise and Coincidence● Genomic QA tools● Luduan● HNC text-mining, preference mining● MusicMatch recommendation engine
![Page 7: Transactional Data Mining](https://reader033.vdocuments.us/reader033/viewer/2022052412/55946a701a28ab972b8b45e8/html5/thumbnails/7.jpg)
How Useful is LLR?
● A test in 1997 showed that a query construction system using LLR (Luduan) decreased the error rate of the best document routing system (Inquery) by approximately 5x at 10% recall and nearly 2x at 20% recall
● Language and species ID programs showed similar improvements versus state of the art
● Previously unsuspected structure around intron splice sites was discovered using LLR tests
![Page 8: Transactional Data Mining](https://reader033.vdocuments.us/reader033/viewer/2022052412/55946a701a28ab972b8b45e8/html5/thumbnails/8.jpg)
TREC Document Routing Results
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Luduan vs Inquery
InqueryLuduanConvectis
Recall
Pre
cisi
on
![Page 9: Transactional Data Mining](https://reader033.vdocuments.us/reader033/viewer/2022052412/55946a701a28ab972b8b45e8/html5/thumbnails/9.jpg)
What are Transactional Variables?
● A transactional sequence is a sequence of transactions.
● Transactions are instances of a symbol and (optionally) a time and an amount:
Z= z1 ... zN z i= i , t i , xi i∈ , an alphabet of symbolst i , x i∈ℝ
![Page 10: Transactional Data Mining](https://reader033.vdocuments.us/reader033/viewer/2022052412/55946a701a28ab972b8b45e8/html5/thumbnails/10.jpg)
Example - Text
● A textual document is a transactional sequence without times or amounts
Z=1 ...N i∈
![Page 11: Transactional Data Mining](https://reader033.vdocuments.us/reader033/viewer/2022052412/55946a701a28ab972b8b45e8/html5/thumbnails/11.jpg)
Example – Traffic Violation History
● A history of traffic violations is a (hopefully empty) sequence of violation types and associated dates (times)
Z= z1 ... zN z i= i , t i i∈{stop-sign ,speeding , DUI , ...}t i∈ℝ
![Page 12: Transactional Data Mining](https://reader033.vdocuments.us/reader033/viewer/2022052412/55946a701a28ab972b8b45e8/html5/thumbnails/12.jpg)
Example – Speech Transcript
● A conversation between a and b can be rendered as a transactions containing words spoken by either a or b at particular times:
Z= z1 ... zN z i= i , t i i∈{a ,b}×t i∈ℝ
![Page 13: Transactional Data Mining](https://reader033.vdocuments.us/reader033/viewer/2022052412/55946a701a28ab972b8b45e8/html5/thumbnails/13.jpg)
Example – Financial History
● A credit card history can be viewed as a transactional sequence with merchant code, date (=time) and amount:
Z= z1 ... z N z i=⟨ i , t i , xi⟩ i∈t i∈ℝ
9/03/03 9/04/03 9/07/03 9/10/03 9/23/0310/03/0310/09/0310/17/0310/24/03
Cash AdvanceGroceriesFuelGroceriesDepartment StorePaymentHotel & MotelRental CarsLufthansa
$300 79 21 42 173-600 104 201 838
![Page 14: Transactional Data Mining](https://reader033.vdocuments.us/reader033/viewer/2022052412/55946a701a28ab972b8b45e8/html5/thumbnails/14.jpg)
Proposed Evolution
Text
Luduan, etc
LLR tests
LLR testsAugmented
Data
TransactionalData
DataAugmentation
TransactionMining
![Page 15: Transactional Data Mining](https://reader033.vdocuments.us/reader033/viewer/2022052412/55946a701a28ab972b8b45e8/html5/thumbnails/15.jpg)
LLR for Transaction Sequence
● Assuming reasonable interactions between timing, symbol selection and amount distribution, LLR test can be decomposed
● Two major terms remain, one for symbols and timing together, one for amounts
LLR=LLR symbols & timingLLR amounts
![Page 16: Transactional Data Mining](https://reader033.vdocuments.us/reader033/viewer/2022052412/55946a701a28ab972b8b45e8/html5/thumbnails/16.jpg)
Anecdotal Observations
● Symbol selection often looks multinomial, or (rarely) Markov
● Timing is often nearly Poisson (but rate depends on which symbol)
● Distribution of amount appears to depend on symbol, but generally not on inter-transaction timing. Mixed discrete/continuous distributions are common in financial settings
![Page 17: Transactional Data Mining](https://reader033.vdocuments.us/reader033/viewer/2022052412/55946a701a28ab972b8b45e8/html5/thumbnails/17.jpg)
Transaction Sequence Distributions
● Mixed Poisson distributions give desired symbol/timing behavior
● Amount distribution depends on symbol
p Z =∏∈
T k e− T
k!∏
i=1. .. Np xi∣ i
p Z =[N !∏∈
k
k! ] [T N e−T
N ! ] ∏i=1. .. N
p xi∣ i
= , ∑∈
=1
![Page 18: Transactional Data Mining](https://reader033.vdocuments.us/reader033/viewer/2022052412/55946a701a28ab972b8b45e8/html5/thumbnails/18.jpg)
LLR for Multinomial
● Easily expressed as entropy of contingency table
[k 11 k 12 ... k 1 nk 21 k 22 ... k 2 n⋮ ⋮ ⋱ ⋮k m1 k m2 ... k mn
] k 1*
k 2*
⋮k m*
k *1 k *2 ... k * n k **
−2 log=2 N ∑ijij logij−∑
ii * logi *−∑
j* j log* j
log=∑ij
k ij logk ij
k i *
k **
k * j=∑
ijk ij log
ij
* jd.o.f.=m−1n−1
![Page 19: Transactional Data Mining](https://reader033.vdocuments.us/reader033/viewer/2022052412/55946a701a28ab972b8b45e8/html5/thumbnails/19.jpg)
LLR for Poisson Mixture
● Easily expressed using timed contingency table
[k 11 k 12 ... k 1 n
k 21 k 22 ... k 2 n
⋮ ⋮ ⋱ ⋮k m1 k m2 ... k mn
∣ t1
t2
⋮tm]
k *1 k *2 ... k * n ∣ t*
log=∑ij
k ij logk ij
t i
t*
k * j=∑
ijk ij log
ij
* j
d.o.f.=m−1n
![Page 20: Transactional Data Mining](https://reader033.vdocuments.us/reader033/viewer/2022052412/55946a701a28ab972b8b45e8/html5/thumbnails/20.jpg)
LLR for Normal Distribution
p x∣ ,= 12
e−x−2
22
=∑ xi
N=∑ xi−2
N
−2 log=2N 1 log1
N 2 log2
d.o.f.=2
● Assume X1 and X
2 are normally distributed
● Null hypothesis of identical mean and variance
![Page 21: Transactional Data Mining](https://reader033.vdocuments.us/reader033/viewer/2022052412/55946a701a28ab972b8b45e8/html5/thumbnails/21.jpg)
Calculations
p x∣ ,= 12
e−x−2
22
=∑
ixi
N=∑i
x−2
Nlog p X 1∣ ,log p X 1∣ ,−log p X 1∣1,1−log p X 2∣2,2=
− ∑i=1. . N 1
[log 2logx1 i−2
22 ]− ∑i=1. . N 2
[log 2logx2 i−2
22 ] ∑
i=1. . N 1[log 2log1
x1 i−12
212 ] ∑
i=1. . N 2[log 2log2
x2 i−22
222 ]
−2 log=2N 1 log1
N 2 log2
d.o.f.=2
● Assume X1 and X
2 are normally distributed
● Null hypothesis of identical mean and variance
![Page 22: Transactional Data Mining](https://reader033.vdocuments.us/reader033/viewer/2022052412/55946a701a28ab972b8b45e8/html5/thumbnails/22.jpg)
Transactional Data in Context
1.234 years male
Real-world input often consists of one or more bags of transactional valuescombined with an assortment of conventional numerical or categorial values.
Extracting information from the transactional data can be difficult and is often, therefore, not done.
![Page 23: Transactional Data Mining](https://reader033.vdocuments.us/reader033/viewer/2022052412/55946a701a28ab972b8b45e8/html5/thumbnails/23.jpg)
Real World Target Variables
Labeled as Red
Mislabeled Instances
SecondaryLabels
a
b
![Page 24: Transactional Data Mining](https://reader033.vdocuments.us/reader033/viewer/2022052412/55946a701a28ab972b8b45e8/html5/thumbnails/24.jpg)
Luduan Modeling Methodology
● Use LLR tests to find exemplars (query terms) from secondary label sets
● Create positive and negative secondary label models for each class of transactional data
● Cluster using output of all secondary label models and all conventional data
● Test clusters for stability ● Use distance cluster centroids and/or secondary
label models as derived input variables
![Page 25: Transactional Data Mining](https://reader033.vdocuments.us/reader033/viewer/2022052412/55946a701a28ab972b8b45e8/html5/thumbnails/25.jpg)
Example #1- Auto Insurance
● Predict probability of attrition and loss for auto insurance customers
● Transactional variables include– Claim history– Traffic violation history– Geographical code of residence(s)– Vehicles owned
● Observed attrition and loss define past behavior
![Page 26: Transactional Data Mining](https://reader033.vdocuments.us/reader033/viewer/2022052412/55946a701a28ab972b8b45e8/html5/thumbnails/26.jpg)
Derived Variables
● Split training data according to observable classes– These include attrition and loss > 0
● Define LLR variables for each class/variable combination
● These 2 m v derived variables can be used for clustering (spectral, k-means, neural gas ...)
● Proximity in LLR space to clusters are the new modeling variables
![Page 27: Transactional Data Mining](https://reader033.vdocuments.us/reader033/viewer/2022052412/55946a701a28ab972b8b45e8/html5/thumbnails/27.jpg)
Results
● Conventional NN modeling by competent analyst was able to explain 2% of variance – No significant difference on training/test data
● Models built using Luduan based cluster proximity variables were able to explain 70% of variance (KS approximately 0.4)– No significant difference on training/test data
![Page 28: Transactional Data Mining](https://reader033.vdocuments.us/reader033/viewer/2022052412/55946a701a28ab972b8b45e8/html5/thumbnails/28.jpg)
Example #2 – Fraud Detection
● Predict probability that an account is likely to result in charge-off due to payment fraud
● Transactional variables include– Zip code– Recent payments and charges– Recent non-monetary transactions
● Bad payments, charge-off, delinquency are observable behavioral outcomes
![Page 29: Transactional Data Mining](https://reader033.vdocuments.us/reader033/viewer/2022052412/55946a701a28ab972b8b45e8/html5/thumbnails/29.jpg)
Derived Variables
● Split training data according to observable classes (charge-off, NSF payment, delinquency)
● Define LLR variables for each class/variable combination
● These 2 m v derived variables can be used directly as model variables
● No results available for publication
![Page 30: Transactional Data Mining](https://reader033.vdocuments.us/reader033/viewer/2022052412/55946a701a28ab972b8b45e8/html5/thumbnails/30.jpg)
Example #3 – E-commerce monitor
● Detect malfunctions or changes in behavior of e-commerce system due to fraud or system failure
● Transaction variables include (time, SKU, amount)
● Desired output is alarm for operational staff
![Page 31: Transactional Data Mining](https://reader033.vdocuments.us/reader033/viewer/2022052412/55946a701a28ab972b8b45e8/html5/thumbnails/31.jpg)
Derived Variables
● Time warp derived as product of smoothed daily and weekly sales rates
● Time warp updated monthly to account for seasonal variations
● Warped time used in transactions● Warped time since last transaction ≈ LLR in
single product/single price case● Full LLR allows testing for significant difference
in Champion/Challenger e-commerce optimizer
![Page 32: Transactional Data Mining](https://reader033.vdocuments.us/reader033/viewer/2022052412/55946a701a28ab972b8b45e8/html5/thumbnails/32.jpg)
Transductive Derived Variables
● All objective segmentations of data provide new LLR variables
● Cross product of model outputs versus objective segmentation provide additional LLR variables for second level model derivation
● Comparable to Luduan query construction technique – TREC pooled evaluation technique provided cross product of relevance versus perceived relevance
![Page 33: Transactional Data Mining](https://reader033.vdocuments.us/reader033/viewer/2022052412/55946a701a28ab972b8b45e8/html5/thumbnails/33.jpg)
Relationship To Risk Tables
● Risk tables are estimate of relative risk for each value of a single symbolic variable– Useful with variables such as post-code of primary
residence– Ad hoc smoothing used to deal with small counts
● Not usually applied to symbol sequences● Risk tables ignore time entirely● Risk tables require considerable analyst finesse
![Page 34: Transactional Data Mining](https://reader033.vdocuments.us/reader033/viewer/2022052412/55946a701a28ab972b8b45e8/html5/thumbnails/34.jpg)
Relationship to Known Techniques
● Clock-tick symbols– Time-embedded symbols viewed as sequences of
symbols along with “ticks” that occur at fixed time intervals
– Allows multinomial LLR as poor man's mixed Poisson LLR
● Not a well known technique, not used in production models
● Difficulties in choosing time resolution and counting period
![Page 35: Transactional Data Mining](https://reader033.vdocuments.us/reader033/viewer/2022052412/55946a701a28ab972b8b45e8/html5/thumbnails/35.jpg)
Conclusions
● Theoretical properties of transaction variables are well defined
● Similarities to known techniques indicates low probability of gross failure
● Similarity to Luduan techniques suggests high probability of superlative performance
● Transactional LLR statistics define similarity metrics useful for clustering