detecting and describing historical periods in a large corpora

22
Detecting and Describing Historical Periods in a Large Corpora Tiberiu Popa, Traian Rebedea, Costin Chiru University Politehnica of Bucharest Faculty of Automatic Control and Computers

Upload: traian-rebedea

Post on 08-Jul-2015

208 views

Category:

Data & Analytics


0 download

DESCRIPTION

Many historic periods (or events) are remembered by slogans, expressions or words that are strongly linked to them. Educated people are also able to determine whether a particular word or expression is related to a specific period in human history. The present paper aims to establish correlations between significant historic periods (or events) and the texts written in that period. In order to achieve this, we have developed a system that automatically links words (and topics discovered using Latent Dirichlet Allocation) to periods of time in the recent history. For this analysis to be relevant and conclusive, it must be undertaken on a representative set of texts written throughout history. To this end, instead of relying on manually selected texts, the Google Books Ngram corpus has been chosen as a basis for the analysis. Although it provides only word n-gram statistics for the texts written in a given year, the resulting time series can be used to provide insights about the most important periods and events in recent history, by automatically linking them with specific keywords or even LDA topics.

TRANSCRIPT

Page 1: Detecting and Describing Historical Periods in a Large Corpora

Detecting and Describing Historical Periods in a Large Corpora

Tiberiu Popa, Traian Rebedea, Costin Chiru

University Politehnica of Bucharest

Faculty of Automatic Control and Computers

Page 2: Detecting and Describing Historical Periods in a Large Corpora

Outline

• Context

• Architecture

• Historical Features Detection

• Topic Modeling for Historically Relevant Documents

• Results

• Future Work & Conclusions

11 Nov 201426th IEEE International Conference on Tools

with Artificial Intelligence, ICTAI 20142

Page 3: Detecting and Describing Historical Periods in a Large Corpora

Context

• Many historic events are remembered by slogans, expressions or words that are strongly linked to them

• Try to establish the correlation between significant historic events and the texts written in that period

• The analysis should be based on a representative set of texts written throughout history

11 Nov 201426th IEEE International Conference on Tools

with Artificial Intelligence, ICTAI 20143

Page 4: Detecting and Describing Historical Periods in a Large Corpora

Context

• The outcome should contain:– A separation of the years such that each year in a

group are related to a specific event

– A short description for each group of years

• Examples:– 1858–1864, 1867–1868: rebel, confederate, secession,

vicksburg, chattanooga

– 1969–1981: pollution, nixon, slavery, blacks, urbanization

11 Nov 201426th IEEE International Conference on Tools

with Artificial Intelligence, ICTAI 20144

Page 5: Detecting and Describing Historical Periods in a Large Corpora

Google Books Ngrams

• Corpus that contains statistics extracted from over 5 million books, or about 4% of all books ever published (in English)

• Due to copyright restrictions, only frequency statistics are provided for each word

• Frequencies ranging from unigrams to 5-grams

• Books from 1500 to 2008 (nowadays)

11 Nov 201426th IEEE International Conference on Tools

with Artificial Intelligence, ICTAI 20145

Page 6: Detecting and Describing Historical Periods in a Large Corpora

Google Books Ngrams

• For each word, the associated time series is denoted by

11 Nov 201426th IEEE International Conference on Tools

with Artificial Intelligence, ICTAI 20146

Page 7: Detecting and Describing Historical Periods in a Large Corpora

Related Work

• Culturomics – quantitative analysis of culture– Computationally investigation of cultural trends (e.g. using

Google Books, or other corpora over a large period of time)– “can provide insights about fields as diverse as lexicography, the

evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology”

• Semantic evolution of words over time– Topics over time– Time influences the meaning of a word => change of

topics/meanings over time

• Evolution of the topics in a specific research field (e.g. computational linguistics) over time using topic models– Showed the rise of probabilistic models in NLP

11 Nov 201426th IEEE International Conference on Tools

with Artificial Intelligence, ICTAI 20147

Page 8: Detecting and Describing Historical Periods in a Large Corpora

Architecture

GoogleBooks

N-grams

Historical Relevant

Documents

Relevant Historical

Topics

11 Nov 201426th IEEE International Conference on Tools

with Artificial Intelligence, ICTAI 20148

Page 9: Detecting and Describing Historical Periods in a Large Corpora

Detection of Historical Features

• A special case of bursty feature detection

• Detects periods of increased activity in the time series

• For each n-gram, it must also assign a “bursty” weight to each year (integer between 0 - 10)

11 Nov 201426th IEEE International Conference on Tools

with Artificial Intelligence, ICTAI 20149

Page 10: Detecting and Describing Historical Periods in a Large Corpora

Double Change

• Peaks usually consist of a period of abrupt increase, followed by another period of abrupt decrease

• Compute the relative change from one year to another

• The bursty weight rt depends on

11 Nov 201426th IEEE International Conference on Tools

with Artificial Intelligence, ICTAI 201410

Page 11: Detecting and Describing Historical Periods in a Large Corpora

Linear Model

• Approximate the frequency time series by a piecewise linear function– Fit lines to the graph of the time series by considering larger and larger

intervals until the error rises above a given threshold

• The bursty weight rt depends on the logarithm of the slope

11 Nov 201426th IEEE International Conference on Tools

with Artificial Intelligence, ICTAI 201411

Page 12: Detecting and Describing Historical Periods in a Large Corpora

Gaussian Model

• Peaks are usually bell-shaped, so try to fit a Gaussian distribution

• First, normalize the time series to get a probability distribution

• Then, try to approximate it with a normal distribution

11 Nov 201426th IEEE International Conference on Tools

with Artificial Intelligence, ICTAI 201412

Page 13: Detecting and Describing Historical Periods in a Large Corpora

Gaussian Model

• Last, use the earth mover’s distance (EMD) to compute the similarity between and

• Select non-overlapping intervals that have a EMD lower than 0.3 in a greedy fashion from left to right

• The bursty weight rt depends on the change of the fitted Gaussian (max vs. min value) for each discrete interval

11 Nov 201426th IEEE International Conference on Tools

with Artificial Intelligence, ICTAI 201413

Page 14: Detecting and Describing Historical Periods in a Large Corpora

Detection of Historical Features -Comparison

• Difficult to measure which of these three methods performs best at detecting and characterizing historical relevant peaks

• Need a dataset created with the help of historians

11 Nov 2014 1426th IEEE International Conference on Tools

with Artificial Intelligence, ICTAI 2014

Page 15: Detecting and Describing Historical Periods in a Large Corpora

Historically Relevant Documents

• Each year is viewed as a document

• The weight of a term in a specific year is given by rt

– For all terms that have rt > 0

• Try to cluster these documents and summarize each cluster

• Use LDA (Latent Dirichlet Allocation) to extract topics

11 Nov 201426th IEEE International Conference on Tools

with Artificial Intelligence, ICTAI 201415

Page 16: Detecting and Describing Historical Periods in a Large Corpora

Results

• Topic modeling (e.g. LDA) allows each document to capture a mixture of topics

• The analysis of the topics shows that most years have a predominant topic (over 50%in the corresponding mixture)

• The table contains a post-processed version of the topics for the last century

• Manually removed the noisy words that appeared in the top 10 words for each topic

11 Nov 201426th IEEE International Conference on Tools

with Artificial Intelligence, ICTAI 201416

Page 17: Detecting and Describing Historical Periods in a Large Corpora

Results – American Civil War

• Topic for the American Civil War (1858-1864, 1867-1868)

• Double change bursty feature detection

11 Nov 201426th IEEE International Conference on Tools

with Artificial Intelligence, ICTAI 201417

Page 18: Detecting and Describing Historical Periods in a Large Corpora

Results – WWI

• Topic for the World War I and peace treaty (1916-1920)

• Gaussian model bursty feature detection

11 Nov 201426th IEEE International Conference on Tools

with Artificial Intelligence, ICTAI 201418

Page 19: Detecting and Describing Historical Periods in a Large Corpora

Results – pre-WWII

• Topic for the period before World War II (1932-1936)

• Linear model peak detection

11 Nov 201426th IEEE International Conference on Tools

with Artificial Intelligence, ICTAI 201419

Page 20: Detecting and Describing Historical Periods in a Large Corpora

Future Work

• Exploring alternatives– Computing the historical relevance of a word has a lot

of potential for improvement, both in finding new definitions and in finding ways to combine the existing ones

– Are topic models really the key of understanding historically relevant documents?

• Improve the validation– Build a corpus, with the help of historians and

linguists, that contains a set of ”historical relevant” peaks and periods

11 Nov 201426th IEEE International Conference on Tools

with Artificial Intelligence, ICTAI 201420

Page 21: Detecting and Describing Historical Periods in a Large Corpora

Conclusions

• Theoretical framework for identifying historic periods and events

• Linking these periods with words and LDA topics extracted from large corpora of texts

• Important concept: historical relevance of a word

• Several methods for computing the historical relevant features

11 Nov 201426th IEEE International Conference on Tools

with Artificial Intelligence, ICTAI 201421

Page 22: Detecting and Describing Historical Periods in a Large Corpora

Questions?

Discussion__________

This work has been funded by the

Sectorial Operational Programme

Human Resources Development

2007-2013 of the Romanian Ministry of

European Funds through the Financial

Agreement POSDRU/159/1.5/S/132397

11 Nov 201426th IEEE International Conference on Tools

with Artificial Intelligence, ICTAI 201422