unsupervised detection of anomalous text
DESCRIPTION
Unsupervised Detection of Anomalous Text. David Guthrie. The University of Sheffield. Textual Anomalies. Computers are routinely used to detect differences from what is normal or expected fraud network attacks Principal focus of this research is to similarly detect text that is irregular - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/1.jpg)
Unsupervised Detection of Anomalous Text
David Guthrie
The University of Sheffield
![Page 2: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/2.jpg)
Textual Anomalies
• Computers are routinely used to detect differences from what is normal or expected– fraud– network attacks
• Principal focus of this research is to similarly detect text that is irregular
• We view text that deviates from its context as a type of anomaly
![Page 3: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/3.jpg)
New Document Collection
Anomalous Documents?
![Page 4: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/4.jpg)
Anomalous Documents?
Find text that is unusual
![Page 5: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/5.jpg)
New Document
Anomalous Segments?
![Page 6: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/6.jpg)
New Document
Anomalous Segments?
![Page 7: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/7.jpg)
New Document
Anomalous Segments?
![Page 8: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/8.jpg)
New Document
Anomalous Segments?
Anomalous
![Page 9: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/9.jpg)
Motivation• Plagiarism
– Writing style of plagiarized passages anomalous with respect to the rest of the authors work
– Detect such passages because writing is “odd” not by using external resources (web)
• Improving Corpora– Automatically gathered corpora can contain errors. Improve the integrity
and homogeneity.
• Unsolicited Email– E.g. Spam constructed from sentences
• Undesirable Bulletin Board or Wiki posts– E.g. rants on wikipedia
![Page 10: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/10.jpg)
Goals
• To develop a general approach which recognizes:
– different dimensions of anomaly
– fairly small segments (50 to 100 words)
– Multiple anomalous segments
![Page 11: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/11.jpg)
Unsupervised
• For this task we assume there is no training data available to characterize “normal” or “anomalous” language
• When we first look at a document we have no idea which segments are “normal” and which are “anomalous”
• Segments are anomalous with respect to the rest of the document not to a training corpus
![Page 12: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/12.jpg)
Outlier Detection
• Treat the problem as a type of outlier detection
• We aim to find pieces of text in a corpus that differ significantly from the majority of text in that corpus and thus are ‘outliers’
![Page 13: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/13.jpg)
Characterizing Text
• 166 features computed for every piece of text (many of which have been used successfully for genre classification by Biber, Kessler, Argamon, …)
• Simple Surface Features• Readability Measures• POS Distributions (RASP)• Vocabulary Obscurity• Emotional Affect (General Inquirer Dictionary)
![Page 14: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/14.jpg)
Readability Measures• Attempt to provide a rough indication of the reading
level required for a text
• Purported to correspond how “easily” a text is read
• Work well for differentiating certain texts
( Scores are Flesch Reading Ease)
Romeo & Juliet 84Plato’s Republic 69
Comic Books 92Sports Illustrated 63New York Times 39IRS Code -6
![Page 15: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/15.jpg)
Readability Measures• Flesch-Kincaid Reading Ease
• Flesch-Kincaid Grade Level
• Gunning-Fog Index
• Coleman-Liau Formula
• Automated Readability Index
• Lix Formula
• SMOG Index
€
GradeLevel = 11.8 total syllables
total words
⎛
⎝ ⎜
⎞
⎠ ⎟+0.39
total words
total sentences
⎛
⎝ ⎜
⎞
⎠ ⎟-15.59
€
ReadingEase = 206.835 -1.015 total words
total sentences
⎛
⎝ ⎜
⎞
⎠ ⎟- 84.6
total syllables
total words
⎛
⎝ ⎜
⎞
⎠ ⎟
€
FogIndex = 0.4 * total words
total sentences
⎛
⎝ ⎜
⎞
⎠ ⎟+
words with 3 or more syllables
words
⎛
⎝ ⎜
⎞
⎠ ⎟*100
⎛
⎝ ⎜
⎞
⎠ ⎟
€
Coleman - Liau = 5.89 total characters
total words
⎛
⎝ ⎜
⎞
⎠ ⎟ - 0.3
total sentences
total words × 100
⎛
⎝ ⎜
⎞
⎠ ⎟- 15.8
€
ARI = 4.71 total characters
total words
⎛
⎝ ⎜
⎞
⎠ ⎟+ 0.5
total words
total sentences
⎛
⎝ ⎜
⎞
⎠ ⎟− 21.43
€
Lix = total words
total sentences
⎛
⎝ ⎜
⎞
⎠ ⎟+ 100
total words with at least 6 letters
total words
⎛
⎝ ⎜
⎞
⎠ ⎟
€
SMOG = 3 +words with 3 or more syllables × 30
total sentences
![Page 16: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/16.jpg)
Obscurity of Vocabulary
• Implemented new features to capture vocabulary richness used in a segment of text
• Lists of most frequent words in Gigaword
• Measure distribution of words in a segment of text in each group of words
Top 1,000 wordsTop 5,000 wordsTop 10,000 words Top 50,000 wordsTop 100,000 wordsTop 200,000 wordsTop 300,000 words
![Page 17: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/17.jpg)
Part-of-Speech
• All segments are passed through the RASP (Robust and Accurate Statistical Parser) part-of-speech tagger
• All words tagged with one of 155 part-of-speech tags from the CLAWS 2 tagset
![Page 18: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/18.jpg)
Part-of-Speech
• Ratio of adjectives to nouns
• % of sentences that begin with a subordinating or coordinating conjunctions (but, so, then, yet, if, because, unless, or…)
• % articles• % prepositions• % pronouns• % adjectives• %conjuctions
• Diversity of POS trigrams
€
POSTrigram Density =Numberof different POS trigrams
Total numberof POStrigrams
⎛
⎝ ⎜
⎞
⎠ ⎟× 100
![Page 19: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/19.jpg)
Morphological Analysis
• Texts are also run through the RASP morphological analyser, which produces words lemmas and inflectional affixes
• Gather statistics about the percentage of passive sentences and amount of nominalization
weremadethinkingapples
be + edmake + edthink + ingapple + s
![Page 20: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/20.jpg)
Rank Features
• Store lists ordered by the frequency of occurrence of certain stylistic phenomena
Most frequent POS trigrams listMost frequent POS bigram listMost frequent POS listMost frequent Articles listMost frequent Prepositions listMost frequent Conjunctions listMost frequent Pronouns list
![Page 21: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/21.jpg)
List Rank Similarity
• To calculate the similarity between two segments lists, we use the Spearman’s Rank Correlation measure
€
SpearmanRankCorrelation =1− 6d2
N(N 2 −1)∑
![Page 22: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/22.jpg)
Sentiment
• General Inquirer Dictionary (Developed by social science department at Harvard)
7,800 words tagged with 114 categories:– Positive– Negative– Strong– Weak– Active– Passive– Overstated– Understated– Agreement– Disagreement and many more …
- Negate - Casual slang- Think- Know- Compare- Person Relations- Need- Power Gain- Power Loss- Affection- Work
![Page 23: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/23.jpg)
Representation
• Characterize each piece of text (document, segment, paragraph, …) in our corpus as a vector of features
• Use these vectors to construct a matrix, X, which has number of rows equal to the pieces of text in the corpus and number of columns equal to the number of features
![Page 24: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/24.jpg)
Feature Matrix
X
Represent each piece of text as a vector of features
Document or corpus
![Page 25: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/25.jpg)
Feature Matrix
X
Identify outlying Text
Document or corpus
![Page 26: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/26.jpg)
Approaches
• Mean Distance: • Compute average distance from other segments
• Comp Distance: • compute a segment’s difference from its complement
• SDE Distance• Find the projection of the data where segments appear farthest
![Page 27: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/27.jpg)
Mean Distance
![Page 28: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/28.jpg)
Finding Outlying SegmentsFeature Matrix
seg f1 f2 f3 f4 f5 f6 f7 … fn
1
2
3
4
5
6
7
…
n
• Calculate the distance from segment 1 to segment 2
Dist = .5
![Page 29: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/29.jpg)
Finding Outlying SegmentsFeature Matrix
seg f1 f2 f3 f4 f5 f6 f7 … fn
1
2
3
4
5
6
7
…
n
• Calculate the distance from segment 1 to segment 3
Dist=.3
![Page 30: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/30.jpg)
Finding Outlying SegmentsFeature Matrix
seg
f1
f2
f3
f4
f5
f6
f7
… fn
1234567…n
Build a Distance Matrix
seg 1 2 3 4 5 6 7 n1 0 0.5 0.3 0.4 0.6 0.1 0.9 …2 0.5 0 0.2 0.2 0.1 0.2 0.7 …3 0.3 0.2 0 0.1 0.6 0.1 0.3 …4 0.4 0.2 0.1 0 0.3 0.8 0.8 …5 0.6 0.1 0.6 0.3 0 0.3 0.2 …6 0.1 0.2 0.1 0.8 0.3 0 0.8 …7 0.9 0.7 0.3 0.8 0.2 0.8 0 …n … … … … … … … 0
![Page 31: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/31.jpg)
Finding Outlying SegmentsFeature Matrix
seg
f1
f2
f3
f4
f5
f6
f7
… fn
1234567…n
Choose the segment that is most different
seg 1 2 3 4 5 6 7 n1 0 0.5 0.3 0.4 0.6 0.1 0.9 …2 0.5 0 0.2 0.2 0.1 0.2 0.7 …3 0.3 0.2 0 0.1 0.6 0.1 0.3 …4 0.4 0.2 0.1 0 0.3 0.8 0.8 …5 0.6 0.1 0.6 0.3 0 0.3 0.2 …6 0.1 0.2 0.1 0.8 0.3 0 0.8 …7 0.9 0.7 0.3 0.8 0.2 0.8 0 …n … … … … … … … 0
Distance Matrix
outlieroutlier
![Page 32: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/32.jpg)
Ranking Segments
n…7654321
fn…f7f6f5f4f3f2f1seg
Feature Matrix
n…7654321
n…7654321seg
Distance Matrix
List of Segments
Produce a Ranking of Segments
Segment Number Distance
7 372.41 345.13 211.25 121.92 59.14 36.26 23.4… …n
![Page 33: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/33.jpg)
Distance Measures
€
s =
x iy ii=1
p
∑
x i2
i=1
p
∑ y i2
j=1
p
∑
Cosine Similarity Measured = 1 - s
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture. City Block Distance
Euclidean Distance
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Pearson Correlation Coefficientd = 1 - r
![Page 34: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/34.jpg)
Standardizing Variables• Desirable for all variables to have about the same
influence
• We can express them each as deviations from their means in units of standard deviations (Z score)
• Or Standardize all variables to have a minimum of zero and a maximum of one
![Page 35: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/35.jpg)
Comp Distance
![Page 36: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/36.jpg)
Distance from complement
New Document or corpus
![Page 37: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/37.jpg)
Distance from complement
Segment the text
![Page 38: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/38.jpg)
Distance from complement
seg f1 f2 f3 f4 f5 f6 f7 … fn
1
2
3
4
5
6
7
…
n
Characterize one segment
![Page 39: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/39.jpg)
Distance from complement
n
…
7
6
5
4
3
2
1
fn…f7f6f5f4f3f2f1seg
Characterize the complement of the segment
![Page 40: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/40.jpg)
Distance from complement
n
…
7
6
5
4
3
2
1
fn…f7f6f5f4f3f2f1seg
Compute the distance between the two vectors
D=.4
![Page 41: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/41.jpg)
n
…
7
6
5
4
3
2
1
fn…f7f6f5f4f3f2f1seg
Distance from complement
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
For all segments
D=.4
![Page 42: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/42.jpg)
n
…
7
6
5
4
3
2
1
fn…f7f6f5f4f3f2f1seg
Distance from complement
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Compute distance between segments
D=.6
D=.4
![Page 43: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/43.jpg)
Rank by distance from complement
• Next, segments are ranked by their distance from the complement
• In this scenario we can make good use of list features
Segment Number
Distance from complement
7 372.41 345.13 211.25 121.92 59.14 36.26 23.4… …n
![Page 44: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/44.jpg)
SDE Dist
![Page 45: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/45.jpg)
SDE
• Use the Stahel-Donoho Estimator (SDE) to identify outliers
• Project the data down to one dimension and measure the outlyingness of each piece of text in that dimension
• For every piece of text, the goal is to find a projection of the that maximizes its robust z-score
• Especially suited to data with a large number of dimensions (features)
![Page 46: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/46.jpg)
Outliers are ‘hidden’
![Page 47: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/47.jpg)
![Page 48: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/48.jpg)
Robust Zscore of furthest point is <3
![Page 49: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/49.jpg)
![Page 50: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/50.jpg)
![Page 51: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/51.jpg)
Robust z score for triangles in thisProjection is >12 std dev
![Page 52: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/52.jpg)
![Page 53: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/53.jpg)
![Page 54: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/54.jpg)
Outliers are clearly visible
![Page 55: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/55.jpg)
![Page 56: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/56.jpg)
SDE
• Where a is a direction (unit length vector) and
• xia is the projection of row xi onto direction a
• mad is the median absolute deviation
![Page 57: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/57.jpg)
Anomalies have a large SD
• The distances for each piece of text SD(xi) are then sorted and all pieces of text are ranked
Segment Number SDE Distance
7 372.41 345.13 211.25 121.92 59.14 36.26 23.4… …n
![Page 58: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/58.jpg)
Experiments
• In each experiment we randomly select 50 segments of text from a corpus and insert one piece of text from a different source to act as an ‘outlier’
• Rank segments
• We varied the size of the pieces of text from 100 to 1000 words
![Page 59: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/59.jpg)
Normal Population Anomalous Population
Creating Test Documents
![Page 60: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/60.jpg)
Normal Population Anomalous Population
Creating Test Documents
![Page 61: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/61.jpg)
Creating Test Documents
Normal Population Anomalous Population
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
![Page 62: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/62.jpg)
Test Document
Creating Test Documents
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Now we attempt to spot this anomaly
![Page 63: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/63.jpg)
Creating Test Documents
Normal Population Anomalous Population
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Test documents are created for every segment in anomalous text
![Page 64: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/64.jpg)
New Document
System Output: a ranking
Most Anomalous
2nd Most Anomalous
3nd Most Anomalous
![Page 65: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/65.jpg)
Author Tests
• Compare 8 different authors» Bronte» Carroll» Doyle» Eliot» James» Kipling» Tennyson» Wells
• 56 pairs of authors
• For each author we use 50,000 words
• For each pair at least 50 different paragraphs from one author are inserted into the other author, one at a time.
• Tests are run using different segment sizes:» 100 words» 500 words» 1000 words
![Page 66: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/66.jpg)
Authorship Anomalies - Top 5 Ranking
![Page 67: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/67.jpg)
• Testing whether opinion can be detected in a factual story
• Opinion text is editorials from 4 newspapers totalling 28,200 words» Guardian» New Statesman» New York Times» Daily Telegraph
• Factual text is randomly chosen from the Gigaword and consists of 4 different 78,000 word segments one each from one of the 4 news wire services:
» Agence France Press English Service» Associated Press Worldstream English Service» The New York Times Newswire Service» The Xinhua News Agency English Service
• Each opinion text paragraph is inserted into each news wire service one at a time for at least 28 insertions on each newswire
• Tests are run using different paragraph sizes
Fact versus Opinion Tests
![Page 68: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/68.jpg)
Opinion Anomalies - Top 5 Ranking
![Page 69: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/69.jpg)
• Very different genre from newswire. The writing is much more procedural (e.g. instructions to build telephone phreaking devices) and also very informal (e.g. ``When the fuse contacts the balloon, watch out!!!'')
• Randomly insert one segment from the Anarchist Cookbook and attempt to identify outliers
– This is repeated 200 times for each segment size (100, 500, and 1,000 words)
News versus Anarchist Cookbook
![Page 70: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/70.jpg)
Anarchist Cookbook Anomalies - Top 5 Ranking
![Page 71: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/71.jpg)
• 35 thousand words of Chinese news articles were hand picked (Wei Liu) and translated into English using Google’s Chinese to English translation engine
• Similar genre to English newswire but translations are far from perfect and so the language use is very odd
• Test collections are created for each segment size as before
News versus Chinese MT
![Page 72: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/72.jpg)
Chinese MT Anomalies - Top 5 Ranking
![Page 73: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/73.jpg)
Results
• Comp Distance anomaly detection produced best overall results, closely followed by SDE dist.
• City block distance measure always produced good rankings (with all methods)
![Page 74: Unsupervised Detection of Anomalous Text](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814fdc550346895dbda444/html5/thumbnails/74.jpg)
Conclusions• Variations in text can be viewed as a type of anomaly or outlier and can be successfully detected using automatic unsupervised techniques
• Stylistic features and distributions of the rarity of words are a good choices for characterizing text and detecting a broad range of anomalies.
• Procedure that measures a piece of texts distance from its textual complement performs best
• Accuracy for anomaly detection improves considerably as we increase the length of our segments.