Beyond Verbal Iden.ty Contextual Features for Intertext Discovery
C. W. Forstall1, L. Galli Milić1, N. Coffee2 and D. Nelis1 1. Universite de Genève 2. University at Buffalo, the State University of New York
MoJvaJon We use the text-‐reuse detecJon tool Tesserae to locate potenJally interesJng allusions in Flavian epic poetry. While themaJc resemblance at the scene level is oQen important to establishing the connecJon between two passages and thus the significance of an allusion, Tesserae presently focuses on localized re-‐use of specific phrases and may miss higher-‐level contextual cues. We are tesJng the viability of larger-‐scale, “themaJc” features targeted at the scene or paragraph level. Our goal is to modify the rankings of verbal correspondences idenJfied by Tesserae according to the similarity of the respecJve contents of the phrases. For example, the pair of phrases below was ranked 379th of 912 results by Tesserae, but in the context of systemaJc structural similarity (see right), otherwise lower-‐ranking text-‐reuse becomes more interesJng.
Valerius Flaccus, Argonau(ca 5
5.1-‐70a 5.70b-‐176 5.177-‐216 5.217-‐277 5.278-‐295 5.296-‐328
[BOOK DIVISION] Mariandyni; death and burial of Idmon and Tiphys; Erginus chosen as helmsman. Departure, voyage along southern coast of Black Sea; Argonauts pass the Chalybes, Carambis and Prometheus. Evening and arrival in the Phasis. Prayer of Jason. InvocaJon of a Muse (dea) and the situaJon in Colchis. Divine intervenJon: Juno and Minerva. War. Argonauts make their way to the city and palace of Aietes.
Methods Corpus and text preparaJon Our corpus was primarily epic, enlarged to include Ovid’s Heroides and Seneca’s Medea, which we felt might show affiniJes of style and content to our text of interest, Valerius Flaccus’ Argonau9ca. Each sample was 30 lines of text. Iinflected forms were reduced to lemmata, using methods comparable to those in Tesserae. All preprocessing and subsequent analysis was done using R, with the help of the cluster, mclust, tm and topicmodels packages. Unsupervised classificaJon We used k-‐means clustering to search for stable clusters of passages that shared similar language across works. Clustering was performed on two different feature sets: 1) TF-‐IDF weighted scores for all the words in the corpus
common to two or more 30-‐line samples. Each sample was represented by a vector of approximately 8,000 frequencies.
2) A set of 50 topics generated using Latent Derichlet allocaJon (LDA). Each sample was represented by 50 values, represenJng its scores for each of the topics.
CorrelaJon between clusterings We tested correlaJon between the clusters generated by k-‐means using the adjusted rand index. This gives, for two classificaJons, a measure of their correlaJon above what is expected by chance.
The box plot at right shows correlaJon between k-‐means clustering and true authorship, over 10 repeJJons of the clustering for each treatment: m-‐idf scores on the leQ, and LDA topic scores on the right. We chose k = 11, the number of authors in the corpus. LDA was effecJve at reducing the otherwise significant impact of authorship on the classificaJon.
Cluster stability We varied k, the number of classes, from 2 to 12, and for each value of k we generated 15 clusterings. Adjusted rand indices were calculated for each of 105 possible pairs of clusters for a given value of k. The distribuJons of these (right) provide an indicaJon of the stability of each configuraJon of classes: small numbers of classes are highly stable; among larger values of k, divisions into 6 and 7 classes are most stable.
●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●
●●●
●●●●
●
●●●●
●●
●
●
●
●
●●●●
●
●
2 3 4 5 6 7 8 9 10 11 12
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
lda 50 topics
classes
adju
sted
rand
sco
re
Vergil, Aeneid 7
7.1-‐7 7.8-‐24 7.25-‐36 7.37-‐106 7.107-‐147 7.148-‐285 7.286-‐640
[BOOK DIVISION] Death and burial of Caieta; departure. Voyage along the coast; Trojans pass Circe’s land. Dawn and arrival in Tiber. InvocaJon of the Muse Erato and the situaJon in LaJum. Meal. Prayer of Aeneas; sacrifice. Trojans make their way to the city and palace of LaJnus. Divine intervenJon: Juno and Allecto. War.
ThemaJc similarity We see similar themaJc elements in the openings of Aeneid 7 and Valerius Flaccus’ ArgonauJca 5, in both cases at (what was likely) the mid-‐point of the narraJve.
Pairwise adjusted rand index
randscores.giant.test
Frequency
0.2 0.4 0.6 0.8
0500
1000
1500
2000
Below: a close-‐up showing only Vergil’s Aeneid and Valerius Flaccus’ Argonau9ca. This is the type of result that we are looking for: samples fall into mulJple classes and are not segregated by author.
The figures above show the author effect graphically: for the TF-‐IDF features the disJnctness of authors such as Ovid, Seneca, Lucan, Silius Italicus and Corippus from the central cloud is apparent. Under the LDA treatment, only Ovid maintained the same degree of separaJon.
The effects of authorship
Topic Stability To test the stability of LDA, we generated 100 different LDA models of 50 topics, performing k-‐means clustering on each one with k = 7. The figure at right shows the distribuJon of adjusted rand index values for 4950 pairwise comparisons between the 100 classificaJons produced. CorrelaJon is consistent but low, at around 0.25, with one or two outlier cases having high agreement.
Sample results
Above: book 7 of the Aeneid. The first half of the book, which features more peaceful content, alternates between classes 6 and 7, the most general of the epic classes. The preparaJons for war in the book’s second half group with class 2. Two passages affiliate with more author-‐specific groups: Juno’s speech at 286 falls in the group dominated by Ovid’s Metamorphoses, while the single brief baple scene groups with Lucan’s Civil War in class 3.
23
45
67
Vergil Aeneid 7
first verse of sample
clas
s
6.88
77.
167.
467.
767.
106
7.13
67.
166
7.19
67.
226
7.25
67.
286
7.31
67.
346
7.37
67.
406
7.43
67.
466
7.49
67.
526
7.55
67.
586
7.61
67.
646
7.67
67.
706
7.73
67.
766
7.79
6
tf−idf lda
0.2
0.3
0.4
0.5
0.6
0.7
correlation with authorship
adju
sted
rand
inde
x
Below we show one example of k-‐means clustering into 7 classes, taken from the topic stability experiments described above. Point size shows how oQen, in 100 different tests, each sample fell into the class shown here.
FF
FF
F
F
FF
F
F
FF
FFF
F
FF F
FFFF
F
FF
F
F
F FF F
F
F
F
FF
F
F
FFF
FFF
F
F
FF FF F
FF
F
F
FF
FF
F
FF F
F
F FF
F F
FF
F F
FF F F
F
FF
FF F
F
F
FF
F
FF
F F
F
FF
F
F
FF
FFF
F
F
FF
FF
F
F
FF
F
FF
F
F
F
FF
F
F
FF
FF
F
F
FF
F
F
FF
F
F
FF
F
FF
F
F
FF
F
F
FFF F
FF
FFF
F
F
F
F
FF
F
FF
FF F FF
F
F
F FF
F
FF
FF
FF
FFF V
VV
VV
V
V
VVV
V
V V
VV
V
V
VV
V
V
V
V
VV
V V
V
VVV V
V
V
V
V
V
V VV
VV
VVV
VV
VV V
VV
V
V
VV
VVV
V
V
V
V
V V
V
V
VV
V
V V
V
VV
V
V VV
V
V
VV
V
V
V
VV
VVV
VV
V
V
V
VV
VVVV
VV
V
V
VV VV
VVV
VV
V
V
VV
V
V
VV
V
V
V
VVV
V
V V
VV
V
VV
V
V
VV V
V
VV
V
VV
VVVV
V VV
VVV
V VV V
V VV
V
VV
VV
V
V
V
VV
V
V
VV
V
V
V V
V
VV
VV
V
VV
V V
V
VV
V
V
V
V
V
V
V
VV
VVV
VVV
V
VV
V VVVV
V
VV VV
V VVV
V
V
V
V
V
V
V V
V
V
VVV
V
V
V
V
V
V V V
V
V
VV
V
V
V V
V
V
V
V VV
VVV V VVV
V
V
VV
V
VV
VV
V
VV V
VV
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
VV
V
V
V
V V
VV
VV
V
VV
V
V
V
V
V
V
V VV
V
V
V
VV
VV
−5 0 5 10
−50
510
Close−up: Vergil vs. Valerius Flaccus
PC1
PC2
classificationclass 1class 2class 3class 4class 5class 6class 7
FV
authorshipvalerius_flaccusvergil
●
●
●
●
●
●
●
●
●
●●
●
●● ●
●
●●
●●●●
●●● ●
●●
●●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●●
●
●
●●
●
●
●●
●●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●●
●●
●
●
●●
●
●●
●
●
●
●●●
●
● ●
●
●●
●
●
● ●
●● ● ●
●
●●
●●●
●
●
●●
●
●●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●● ●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●●
−15 −10 −5 0 5 10 15
−10
−50
510
15
Effects of authorshipTF−IDF by author
PC1
PC2
●
●
baebius_italicuscatulluscorippusenniuslucanovidsenecasilius_italicusstatiusvalerius_flaccusvergil
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
−15 −10 −5 0 5
−4−2
02
4
Effects of authorshipLDA by author
PC1
PC2
●
●
baebius_italicuscatulluscorippusenniuslucanovidsenecasilius_italicusstatiusvalerius_flaccusvergil
●
●
●
●●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●●●
●●
●●
●●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●●●
●
● ●
●
●●
●
●
● ●
●
●●
●
●
●●
●●●
●
●
●●
●
●●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●●
●
●●
●
●
●
●
−15 −10 −5 0 5 10 15
−10
−50
510
15
k−means classification of lda
PC1
PC2
classificationclass 1class 2class 3class 4class 5class 6class 7
●
●
authorshipbaebius_italicuscatulluscorippusenniuslucanovidsenecasilius_italicusstatiusvalerius_flaccusvergil
… etenim dat candida certam nox Helicen.
(Val. Flac. 5.70) adspirant aurae in noctem nec candida cursus luna negat, splendet tremulo sub lumine pontus.
(Verg. Aen. 7.8)