detecting and describing historical periods in a large corpora
DESCRIPTION
Many historic periods (or events) are remembered by slogans, expressions or words that are strongly linked to them. Educated people are also able to determine whether a particular word or expression is related to a specific period in human history. The present paper aims to establish correlations between significant historic periods (or events) and the texts written in that period. In order to achieve this, we have developed a system that automatically links words (and topics discovered using Latent Dirichlet Allocation) to periods of time in the recent history. For this analysis to be relevant and conclusive, it must be undertaken on a representative set of texts written throughout history. To this end, instead of relying on manually selected texts, the Google Books Ngram corpus has been chosen as a basis for the analysis. Although it provides only word n-gram statistics for the texts written in a given year, the resulting time series can be used to provide insights about the most important periods and events in recent history, by automatically linking them with specific keywords or even LDA topics.TRANSCRIPT
Detecting and Describing Historical Periods in a Large Corpora
Tiberiu Popa, Traian Rebedea, Costin Chiru
University Politehnica of Bucharest
Faculty of Automatic Control and Computers
Outline
• Context
• Architecture
• Historical Features Detection
• Topic Modeling for Historically Relevant Documents
• Results
• Future Work & Conclusions
11 Nov 201426th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 20142
Context
• Many historic events are remembered by slogans, expressions or words that are strongly linked to them
• Try to establish the correlation between significant historic events and the texts written in that period
• The analysis should be based on a representative set of texts written throughout history
11 Nov 201426th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 20143
Context
• The outcome should contain:– A separation of the years such that each year in a
group are related to a specific event
– A short description for each group of years
• Examples:– 1858–1864, 1867–1868: rebel, confederate, secession,
vicksburg, chattanooga
– 1969–1981: pollution, nixon, slavery, blacks, urbanization
11 Nov 201426th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 20144
Google Books Ngrams
• Corpus that contains statistics extracted from over 5 million books, or about 4% of all books ever published (in English)
• Due to copyright restrictions, only frequency statistics are provided for each word
• Frequencies ranging from unigrams to 5-grams
• Books from 1500 to 2008 (nowadays)
11 Nov 201426th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 20145
Google Books Ngrams
• For each word, the associated time series is denoted by
11 Nov 201426th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 20146
Related Work
• Culturomics – quantitative analysis of culture– Computationally investigation of cultural trends (e.g. using
Google Books, or other corpora over a large period of time)– “can provide insights about fields as diverse as lexicography, the
evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology”
• Semantic evolution of words over time– Topics over time– Time influences the meaning of a word => change of
topics/meanings over time
• Evolution of the topics in a specific research field (e.g. computational linguistics) over time using topic models– Showed the rise of probabilistic models in NLP
11 Nov 201426th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 20147
Architecture
GoogleBooks
N-grams
Historical Relevant
Documents
Relevant Historical
Topics
11 Nov 201426th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 20148
Detection of Historical Features
• A special case of bursty feature detection
• Detects periods of increased activity in the time series
• For each n-gram, it must also assign a “bursty” weight to each year (integer between 0 - 10)
11 Nov 201426th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 20149
Double Change
• Peaks usually consist of a period of abrupt increase, followed by another period of abrupt decrease
• Compute the relative change from one year to another
• The bursty weight rt depends on
11 Nov 201426th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 201410
Linear Model
• Approximate the frequency time series by a piecewise linear function– Fit lines to the graph of the time series by considering larger and larger
intervals until the error rises above a given threshold
• The bursty weight rt depends on the logarithm of the slope
11 Nov 201426th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 201411
Gaussian Model
• Peaks are usually bell-shaped, so try to fit a Gaussian distribution
• First, normalize the time series to get a probability distribution
• Then, try to approximate it with a normal distribution
11 Nov 201426th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 201412
Gaussian Model
• Last, use the earth mover’s distance (EMD) to compute the similarity between and
• Select non-overlapping intervals that have a EMD lower than 0.3 in a greedy fashion from left to right
• The bursty weight rt depends on the change of the fitted Gaussian (max vs. min value) for each discrete interval
11 Nov 201426th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 201413
Detection of Historical Features -Comparison
• Difficult to measure which of these three methods performs best at detecting and characterizing historical relevant peaks
• Need a dataset created with the help of historians
11 Nov 2014 1426th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 2014
Historically Relevant Documents
• Each year is viewed as a document
• The weight of a term in a specific year is given by rt
– For all terms that have rt > 0
• Try to cluster these documents and summarize each cluster
• Use LDA (Latent Dirichlet Allocation) to extract topics
11 Nov 201426th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 201415
Results
• Topic modeling (e.g. LDA) allows each document to capture a mixture of topics
• The analysis of the topics shows that most years have a predominant topic (over 50%in the corresponding mixture)
• The table contains a post-processed version of the topics for the last century
• Manually removed the noisy words that appeared in the top 10 words for each topic
11 Nov 201426th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 201416
Results – American Civil War
• Topic for the American Civil War (1858-1864, 1867-1868)
• Double change bursty feature detection
11 Nov 201426th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 201417
Results – WWI
• Topic for the World War I and peace treaty (1916-1920)
• Gaussian model bursty feature detection
11 Nov 201426th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 201418
Results – pre-WWII
• Topic for the period before World War II (1932-1936)
• Linear model peak detection
11 Nov 201426th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 201419
Future Work
• Exploring alternatives– Computing the historical relevance of a word has a lot
of potential for improvement, both in finding new definitions and in finding ways to combine the existing ones
– Are topic models really the key of understanding historically relevant documents?
• Improve the validation– Build a corpus, with the help of historians and
linguists, that contains a set of ”historical relevant” peaks and periods
11 Nov 201426th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 201420
Conclusions
• Theoretical framework for identifying historic periods and events
• Linking these periods with words and LDA topics extracted from large corpora of texts
• Important concept: historical relevance of a word
• Several methods for computing the historical relevant features
11 Nov 201426th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 201421
Questions?
Discussion__________
This work has been funded by the
Sectorial Operational Programme
Human Resources Development
2007-2013 of the Romanian Ministry of
European Funds through the Financial
Agreement POSDRU/159/1.5/S/132397
11 Nov 201426th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 201422