a suffix tree approach to text classification applied to email filtering rajesh pampapathi, boris...
TRANSCRIPT
A Suffix Tree Approach to Text A Suffix Tree Approach to Text Classification Applied to Email Classification Applied to Email
FilteringFilteringRajesh Pampapathi, Boris Mirkin, Mark Rajesh Pampapathi, Boris Mirkin, Mark
LeveneLevene
School of Computer Science and Information SystemsSchool of Computer Science and Information Systems Birkbeck College, University of London Birkbeck College, University of London
Introduction – OutlineIntroduction – Outline
Motivation: Examples of Spam Suffix Tree constructionSuffix Tree construction Document scoring and classificationDocument scoring and classification Experiments and resultsExperiments and results ConclusionConclusion
Buy cheap medications online, no prescription needed.Buy cheap medications online, no prescription needed.We have Viagra, Pherentermine, Levitra, Soma, Ambien, We have Viagra, Pherentermine, Levitra, Soma, Ambien, Tramadol and many more products. Tramadol and many more products. No embarrasing trips to the doctor, get it delivered directly to No embarrasing trips to the doctor, get it delivered directly to your door.your door.
Experienced reliable service.Experienced reliable service.Most trusted name brands.Most trusted name brands.
For your solution click here: http://www.webrx-doctor.com/?For your solution click here: http://www.webrx-doctor.com/?rid=1000 rid=1000
1. Standard spam mail
zygotes zoogenous zoometric zygosphene zygotactic zygoid zygotes zoogenous zoometric zygosphene zygotactic zygoid zucchettos zymolysis zoopathy zygophyllaceous zoophytologist zucchettos zymolysis zoopathy zygophyllaceous zoophytologist zygomaticoauricular zoogeologist zymoid zoophytish zoospores zygomaticoauricular zoogeologist zymoid zoophytish zoospores zygomaticotemporal zoogonous zygotenes zoogony zymosis zuza zygomaticotemporal zoogonous zygotenes zoogony zymosis zuza zoomorphs zythum zoonitic zyzzyva zoophobes zygotactic zoogenous zoomorphs zythum zoonitic zyzzyva zoophobes zygotactic zoogenous zombies zoogrpahy zoneless zoonic zoom zoosporic zoolatrous zombies zoogrpahy zoneless zoonic zoom zoosporic zoolatrous zoophilous zymotically zymosterol zoophilous zymotically zymosterol
FreeHYSHKRODMonthQGYIHOCSupply.IHJBUMDSTIPLIBJTJUBIYYXFreeHYSHKRODMonthQGYIHOCSupply.IHJBUMDSTIPLIBJTJUBIYYXFNFN
* GetJIIXOLDViagraPWXJXFDUUTabletsNXZXVRCBX * GetJIIXOLDViagraPWXJXFDUUTabletsNXZXVRCBX <http://healthygrow.biz/index.php?id=2> <http://healthygrow.biz/index.php?id=2>
zonally zooidal zoospermia zoning zoonosology zooplankton zonally zooidal zoospermia zoning zoonosology zooplankton zoochemical zoogloeal zoological zoologist zooid zoosphere zoochemical zoogloeal zoological zoologist zooid zoosphere zoochemical zoochemical
& Safezoonal andNGASXHBPnatural& Safezoonal andNGASXHBPnatural& TestedQLOLNYQandEAVMGFCapproved& TestedQLOLNYQandEAVMGFCapproved
zonelike zoophytes zoroastrians zonular zoogloeic zoris zygophore zonelike zoophytes zoroastrians zonular zoogloeic zoris zygophore zoograft zoophiles zonulas zygotic zymograms zygotene zootomical zoograft zoophiles zonulas zygotic zymograms zygotene zootomical zymes zoodendrium zygomata zoometries zoographist zygophoric zymes zoodendrium zygomata zoometries zoographist zygophoric zoosporangium zygotes zumatic zygomaticus zorillas zoocurrent zoosporangium zygotes zumatic zygomaticus zorillas zoocurrent zooxanthella zyzzyvas zoophobia zygodactylism zygotenes zooxanthella zyzzyvas zoophobia zygodactylism zygotenes zoopathological noZFYFEPBmas zoopathological noZFYFEPBmas <http://healthygrow.biz/remove.php> <http://healthygrow.biz/remove.php>
5. Embedded message (plus word salad)
Buy meds online and get it shipped to your door Find out more Buy meds online and get it shipped to your door Find out more herehere <http://www.gowebrx.com/?rid=1001> <http://www.gowebrx.com/?rid=1001>
a publications website accepted definition. known are can a publications website accepted definition. known are can Commons the be definition. Commons UK great public Commons the be definition. Commons UK great public principal work Pre-Budget but an can Majesty's many contains principal work Pre-Budget but an can Majesty's many contains statements statements titles (eg includes have website. health, statements statements titles (eg includes have website. health, these Committee Select undertaken described may these Committee Select undertaken described may publications publications
4. Word salads
Creating a Suffix TreeCreating a Suffix Tree
F
E
E
T
M
E
T
E
ROOT
E
E
T
T
T
MEET FEET
(1)
(1)
(2)
(1)
(1)(1)
(1)
(1)
(1)(1)
(1)
(1)
(1) (2)
(2)
(2)
(2)
(4)
Levels of InformationLevels of Information
Characters: the alphabet (and their Characters: the alphabet (and their frequencies) of a class.frequencies) of a class.
Matches: between query strings and a class.Matches: between query strings and a class.ss =nviaXgraU>Tabl$$$ets =nviaXgraU>Tabl$$$etst =xv^ia$graTab£££letst =xv^ia$graTab£££letsMatches(s, t) = {v, ia, gra, Tab, l, ets, $}Matches(s, t) = {v, ia, gra, Tab, l, ets, $}- But what about overlapping matches?- But what about overlapping matches?
Trees: properties of the class as a whole.Trees: properties of the class as a whole.~size~size~density (complexity)~density (complexity)
Document Similarity Document Similarity MeasureMeasure
n
0i
)T),i(d(score1
)T,d(SCORE
The score for a document, d, is the sum of the scores for each suffix:
d(i) is the suffix of d beginning at the ith letter
tau is a tree normalisation coefficient
Substring Similarity Substring Similarity MeasureMeasure
n
0tt )]m(p[)T|m(v)m(score
Score for match, m = m0m1m2…mn, is score(m):
T is the tree profile of the class.
v(m|T) is a normalisation coefficient based on the properties of T.
p(mt) is the probability of the character, mt, of the match m.
Φ[p] is a significance function.
Decision MechanismDecision Mechanism
HAMthreshold)T,d(SCORE
)T,d(SCORE
SS
HH
SPAMthreshold)T,d(SCORE
)T,d(SCORE
SS
HH
Specifications of Specifications of ΦΦ[p][p](character level)(character level)
ConstaConstant:nt:
1 1
Linear:Linear: pp
Square:Square: pp22
Root:Root: pp0.50.5
Logit:Logit: ln(p) – ln(1-ln(p) – ln(1-p)p)
SigmoiSigmoid:d:
(1 + exp(-(1 + exp(-p))p))-1-1
Note: Logit and Sigmoid need to be adjusted to fit in the range [0,1]
Significance functionSignificance function
Threshold VariationThreshold Variation~ Significance functions ~ Significance functions
~~
Threshold VariationThreshold Variation~ Significance functions ~ Significance functions
~~
Match normalisationMatch normalisation
Match unnormalisedMatch unnormalised 11
Match permutation Match permutation normalisednormalised
Match length normalisedMatch length normalised
)T|*m(i)i(f
)T|m(f
)T|'m(i)i(f
)T|m(f
m* is the set of all strings formed by permutations of m
m’ is the set of all strings of length equal to length of m
Match normalisationMatch normalisation
MUN: match unnormalised; MPN: permutation normalised; MLN: length normalised
Threshold VariationThreshold Variation~ match normalisation ~~ match normalisation ~
Constant significance functionunnormalised
Constant significance functionmatch normalised
Specifications of tauSpecifications of tau
UnnormalisUnnormalised: ed:
11
Size(T):Size(T): The total number of nodesThe total number of nodes
Density(T):Density(T): The average number of The average number of children of internal nodeschildren of internal nodes
AvFreq(T):AvFreq(T): Average frequency of nodesAverage frequency of nodes
Tree normalisationTree normalisation
Androutsopoulos et al. Androutsopoulos et al. (2000)(2000)
~ Ling-Spam Corpus ~~ Ling-Spam Corpus ~Pre-processingPre-processing Number Number of of
FeatureFeaturess
Spam Recall Spam Recall ErrorError
Spam Spam Precision Precision
ErrorError
Naïve Bayes (NB)Naïve Bayes (NB) Lemmatizer + Lemmatizer + Stop-ListStop-List
100100 17.22% 17.22% 0.51%0.51%
Suffix Tree (ST)Suffix Tree (ST) NoneNone N/AN/A 2.50%2.50% 0.21%0.21%
Naïve Bayes* (NB*)Naïve Bayes* (NB*) Lemmatizer + Stop-List
UnlimiteUnlimitedd
0.84%0.84% 2.86%2.86%
Pre-processingPre-processing Number Number of of
FeatureFeaturess
Spam Recall Spam Recall ErrorError
Spam Spam Precision Precision
ErrorError
Naïve Bayes (NB)Naïve Bayes (NB) Lemmatizer + Lemmatizer + Stop-ListStop-List
300300 36.95% 36.95% 0%0%
Suffix Tree (ST)Suffix Tree (ST) NoneNone N/AN/A 3.96%3.96% 0%0%
Naïve Bayes* (NB*)Naïve Bayes* (NB*) Lemmatizer + Stop-List
UnlimiteUnlimitedd
10.42%10.42% 0%0%
~ Ling-BKS Corpus ~~ Ling-BKS Corpus ~Pre-processingPre-processing False False
Positive Positive RateRate
False False Negative Negative
RateRate
Suffix Tree (ST)Suffix Tree (ST) NoneNone 0%0% 0%0%
Naïve Bayes* (NB*)Naïve Bayes* (NB*) Lemmatizer + Stop-Lemmatizer + Stop-ListList
0%0% 12.25%12.25%
~ SpamAssassin Corpus ~~ SpamAssassin Corpus ~Pre-processingPre-processing False False
Positive Positive RateRate
False False Negative Negative
RateRate
Suffix Tree (ST)Suffix Tree (ST) NoneNone 3.50%3.50% 3.25%3.25%
Naïve Bayes* (NB*)Naïve Bayes* (NB*) Lemmatizer + Stop-Lemmatizer + Stop-ListList
10.50%10.50% 1.50%1.50%
ConclusionsConclusions
Good overall classifierGood overall classifier- improvement on naïve Bayes- improvement on naïve Bayes- but there’s still room for - but there’s still room for improvementimprovement
Can one method ever maintain 100% Can one method ever maintain 100% accuracy?accuracy?
Extending the classifier Extending the classifier Applications to other domainsApplications to other domains
- web page classification- web page classification
Future Work - ODPFuture Work - ODP
Computational Computational PerformancePerformance
Data SetData Set Training Training (s)(s)
Av. Spam Av. Spam (ms)(ms)
Av. Ham Av. Ham (ms)(ms)
Av. Peak Av. Peak Mem.Mem.
LS-FULL (7.40MB)LS-FULL (7.40MB) 6363 843843 659659 765MB765MB
LS-11 (1.48MB)LS-11 (1.48MB) 3636 221221 206206 259MB259MB
SAeh-11 (5.16MB)SAeh-11 (5.16MB) 155155 504504 25282528 544MB544MB
BKS-LS-11 (1.12MB)BKS-LS-11 (1.12MB) 4141 161161 222 222 345MB345MB
Experimental Data SetsExperimental Data Sets
Ling-Spam (LS)Ling-Spam (LS)Spam (481) collected by Androutsopoulos et al. Spam (481) collected by Androutsopoulos et al. Ham (2412) from online linguists’ bulletin boardHam (2412) from online linguists’ bulletin board
Spam AssassinSpam Assassin- Easy (SAe)- Easy (SAe)- Hard (SAh)- Hard (SAh)Spam (1876) and ham (4176) examples donatedSpam (1876) and ham (4176) examples donated
BBKBBKSpam (652) collected by BirkbeckSpam (652) collected by Birkbeck
Androutsopoulos et al. Androutsopoulos et al. (2000)(2000)
~ Ling-Spam Corpus ~~ Ling-Spam Corpus ~Classifier Classifier ConfigurationConfiguration
ThresholThresholdd
No. of No. of AttribAttrib..
Spam Spam RecallRecall
Spam Spam PrecisiPrecisionon
BareBare 0.50.5 5050 81.10\%81.10\%
96.85\96.85\%%
Stop-ListStop-List 0.50.5 5050 82.35%82.35% 97.13%97.13%
LemmatizerLemmatizer 0.50.5 100100 82.35%82.35% 99.02%99.02%
Lemmatizer + Stop-Lemmatizer + Stop-ListList
0.50.5 100100 82.78% 82.78% 99.4999.49%%
BareBare 0.90.9 200200 76.94\%76.94\% 99.46\99.46\%%
Stop-ListStop-List 0.90.9 200200 76.11\%76.11\% 99.47\99.47\%%
LemmatizerLemmatizer 0.9 0.9 100100 77.57\%77.57\% 99.45\99.45\%%
Lemmatizer + Stop-Lemmatizer + Stop-listlist
0.9 0.9 100100 78.41\78.41\%%
99.47\99.47\%%
BareBare 0.9990.999 200200 73.82\%73.82\% 99.43\99.43\%%
Stop-ListStop-List 0.9990.999 200200 73.40\%73.40\% 99.43\99.43\%%
LemmatizerLemmatizer 0.9990.999 300300 63.67\%63.67\% 100.00\100.00\%%
Lemmatizer + Stop-Lemmatizer + Stop-ListList
0.9990.999 300300 63.05\63.05\%%
100.00100.00\% \%
Androutsopoulos et al. Androutsopoulos et al. (2000)(2000)
~ Ling-Spam Corpus ~~ Ling-Spam Corpus ~Classifier ConfigurationClassifier Configuration Spam Recall Spam Recall ErrorError
Spam Spam Precision Precision
ErrorError
Naïve Bayes (NB)Naïve Bayes (NB) Lemmatizer + Stop-Lemmatizer + Stop-ListList
17.22% 17.22% 0.51%0.51%
Suffix Tree (ST)Suffix Tree (ST) N/AN/A 2.5%2.5% 0.21%0.21%
Naïve Bayes* (NB*)Naïve Bayes* (NB*) Lemmatizer + Stop-List 0.84%0.84% 2.86%2.86%
Classifier ConfigurationClassifier Configuration Spam Recall Spam Recall ErrorError
Spam Spam Precision Precision
ErrorError
Naïve Bayes (NB)Naïve Bayes (NB) Lemmatizer + Stop-Lemmatizer + Stop-ListList
36.95% 36.95% 0%0%
Suffix Tree (ST)Suffix Tree (ST) N/AN/A 3.96%3.96% 0%0%
Naïve Bayes* (NB*)Naïve Bayes* (NB*) Lemmatizer + Stop-List 10.42%10.42% 0%0%
~ SpamAssassin Corpus ~ SpamAssassin Corpus ~~
Classifier ConfigurationClassifier Configuration Spam Spam RecallRecall
Spam Spam PrecisionPrecision
Naïve Bayes (NB)Naïve Bayes (NB) Lemmatizer + Stop-ListLemmatizer + Stop-List 82.78% 82.78% 99.49%99.49%
Suffix Tree (ST)Suffix Tree (ST) N/AN/A 97.50%97.50% 99.79%99.79%
Naïve Bayes* (NB*)Naïve Bayes* (NB*) Lemmatizer + Stop-ListLemmatizer + Stop-List 99.16%99.16% 97.14%97.14%
Classifier ConfigurationClassifier Configuration Spam Spam RecallRecall
Spam Spam PrecisionPrecision
Naïve Bayes (NB)Naïve Bayes (NB) Lemmatizer + Stop-ListLemmatizer + Stop-List 82.78% 82.78% 99.49%99.49%
Suffix Tree (ST)Suffix Tree (ST) N/AN/A 97.50%97.50% 99.79%99.79%
Naïve Bayes* (NB*)Naïve Bayes* (NB*) Lemmatizer + Stop-ListLemmatizer + Stop-List 99.16%99.16% 97.14%97.14%
Vector Space ModelVector Space Model
“What then?” sang Plato’s ghost, “What then?”
1 0 10 1 2 20
whathost plate Platoghost thensangbook
W. B. Yeats
50/1000P(w = ‘what’) = = 0.05
Word Probability
Creating ProfilesCreating Profiles
Mark
ProfilesProfiles
datadatabases information searchengines
dataintelligence criminal computationalpolice
Mark Levene
Mike Hu
ClassificationClassification
Boris Mirkin
Mark Levene
Mike Hu
SBM SML SMH
Naïve BayesNaïve Bayes(similarity measure)(similarity measure)
m
1i
jijj cdPcPdcP
M1k
kj
ij
ji
nM
n1cdP
~
(1)
For a document d = {d1d2d3 … dm }and set of classes c = {c1, c2 ... cJ}:
Where:
N
NcP
~ j
j (2)
(3)
CriticismsCriticisms
Pre-processing:Pre-processing:- Stop-word removal- Stop-word removal- Word stemming/lemmatisation- Word stemming/lemmatisation- Punctuation and formatting- Punctuation and formatting
Smallest unit of consideration is a Smallest unit of consideration is a word.word.
Classes (and documents) are bags of Classes (and documents) are bags of words, i.e. each word is independent words, i.e. each word is independent of all others.of all others.
Word DependenciesWord Dependencies
dataintelligence clustering computationalmeans
dataintelligence criminal computationalmeans
Boris Mirkin
Mike Hu
Word InflectionsWord Inflections
Intellig- OR intelligent
Intelligent
Intelligence
Intelligentsia
Intelligible
Success measuresSuccess measures RecallRecall is the proportion of is the proportion of
correctly classified correctly classified examples of a class. examples of a class.
If If SRSR is is spam recallspam recall, then , then (1-SR) gives the proportion (1-SR) gives the proportion of false negatives.of false negatives.
PrecisionPrecision is the is the proportion assigned to a proportion assigned to a class which are true class which are true members of that class. It is members of that class. It is a measure of the number a measure of the number of true positives. of true positives.
If If SPSP is is spam precisionspam precision, , then (1 – SP) would give then (1 – SP) would give the proportion of false the proportion of false positives.positives.
)HS(#)SS(#
)SS(#SR
)SH(#)SS(#
)SS(#SP
Success measuresSuccess measures True Positive Rate (TPR) True Positive Rate (TPR) is is
the proportion of correctly the proportion of correctly classified examples of the classified examples of the ‘positive’ class. ‘positive’ class.
Spam is typically taken as the Spam is typically taken as the positive class, so TPR is then positive class, so TPR is then the number of spam the number of spam classified as spam over the classified as spam over the total number of spam. total number of spam.
False Positive Rate (FPR) False Positive Rate (FPR) iis the proportion of the s the proportion of the ‘negatve’ class erroneously ‘negatve’ class erroneously assigned to the ‘positive’ assigned to the ‘positive’ class. class.
Ham is typically taken as the Ham is typically taken as the negative class, so FPR is then negative class, so FPR is then the number of ham classified the number of ham classified as spam over the total as spam over the total number of ham. number of ham.
TotalSpam
)SpamSpam(#TPR
TotalHamFPR
)SpamHam(#
Classifier StructureClassifier Structure
Training DataTraining Data
Profiling MethodProfiling Method
Profile RepresentationProfile Representation
Similarity/Comparison Similarity/Comparison MeasureMeasure
Decision Mechanism or Decision Mechanism or Classification CriterionClassification Criterion
DecisionDecision
Spam Ham
Spam Ham
?
Classification using a Classification using a suffix treesuffix tree
Method of profiling is construction of Method of profiling is construction of the treethe tree(no pre-processing, no post-processing)(no pre-processing, no post-processing)
The tree is a profile of the class. The tree is a profile of the class. Similarity measure?Similarity measure? Decision mechanism?Decision mechanism?
Threshold VariationThreshold Variation~ match normalisation ~~ match normalisation ~
Constant significance functionunnormalised
Constant significance functionmatch normalised
SPE = spam precision error; HPE = ham precision error
Threshold VariationThreshold Variation~ Significance functions ~ Significance functions
~~
SPE = spam precision error; HPE = ham precision error
Root function, no normalisation Logit function, no normalisation
Threshold VariationThreshold Variation
Constant significance function(unnormalised)
SPE = spam precision error; HPE = ham precision error