on modeling temporal dynamics forgetting and remembering ... · inadvertent forgetting forgetting...

On Modeling Temporal Dynamics Forgetting and Remembering for

Intelligent Information Access

Advanced Methods for IR

Dr. Nattiya Kanhabua L3S Research Center Hannover, Germany

24 June 2015

1

Motivation Three Aspects for Information Management and Retreival

Modeling (1) Temporal Dynamics, (2) Forgetting, and (3) Remembering

Four Selected Papers Kanhabua et al., Learning to Detect Event-Related Queries for Web Search, In

TempWeb'2015 at WWW'2015 Nguyen, Leveraging Dynamic Query Subtopics for Time-aware Search Result

Diversification, In ECIR'2014 Ceroni et al., To Keep or not to Keep: An Expectation-oriented Photo Selection

Method for Personal Photo Collections, In ICMR'2015 Kanhabua at al., What Triggers Human Remembering of Events? A Large-

Scale Analysis of Catalysts for Collective Memory in Wikipedia, In JCDL'2014

Conclusion

Outline

2

Motivation

Temporal Web Dynamics Unprecedented growth and change of data on the Web Changes occur in many aspects, e.g., size, content,

structure and user interactions or queries. Size: web pages are added/deleted at all time Content: web pages are edited/modified Query: users’ information needs changes

2000

First billion-URL index The world’s largest! ≈5000 PCs in clusters!

1995 2015

Web and index sizes

4

2000


2004

Index grows to 4.2 billion pages

1995 2015

Web and index sizes

5

2000


2004


1995 2015

2008

Google counts 1 trillion unique URLs

Web and index sizes

6

2000


2004


1995 2015

2009

TBs or PBs of data/index Tens of thousands of PCs

2008

Google counts 1 trillion unique URLs

Web and index sizes

7

?

http://www.worldwidewebsize.com/ Web and index sizes

8

Content/Structure Changes

Implications: Crawling, Indexing, Ranking

Fig. 1 Categorization of document collections with content changes over time.

Changes in User Behavior

Implications: Query Analysis, Ranking

Fig. 2 Categorization of queries with temporal information needs.

http://www.google.com/insights/search

Temporal Query Examples A temporal query consists of:

Query keywords Temporal expressions

A document consists of: Terms, i.e., bag-of-words Publication time and temporal expressions

[Berberich et al., ECIR 2010]

Implications for Search

query

Temporal Web

Determining Search Intent

Term: {Germany, World, Cup} Time: {06/2006, 07/2006}

D2006

Retrieved results

matching/ranking

Time-sensitive queries

Semantic Annotation

Annotated documents Term: {w1, w2, …, wn}

Time: {PubTime(di), ContentTime(di)}

Three aspects for Intelligent Information Access (Management + Retrieval) (1) Temporal Dynamics (2) Forgetting (3) Remembering

Five Selected Papers Kanhabua et al., Learning to Detect Event-Related Queries for Web Search, In

TempWeb Workshop at WWW'2015. Nguyen, Leveraging Dynamic Query Subtopics for Time-aware Search Result

Diversification, In ECIR'2014. Ceroni et al., To Keep or not to Keep: An Expectation-oriented Photo Selection

Method for Personal Photo Collections, In ICMR'2015. Kanhabua at al., What Triggers Human Remembering of Events? A Large-Scale

Analysis of Catalysts for Collective Memory in Wikipedia, In JCDL'2014. Tran et al., Back to the Past: Supporting Interpretations of Forgotten Stories by

Time-aware Re-Contextualization, In WSDM'2015.

Highlight Research

13

Learning to Detect Event-related Queries Temporal queries are a significant fraction of Web search queries (Nunes et al., 2008; Zhang et al., 2010)

13.8% of explicit temporal queries 17.1% of implicit temporal queries

Characteristics: Certain temporal patterns, i.e., spikes, periodicity (hourly or daily), seasonality and trends Underlying temporal information needs without temporal patterns observed

Tasks: Understand temporal search intent Enable advanced enhancement techniques Automatic method for detecting events in search streams

US Election 2016 Brazil FIFA World Cup

N. Kanhabua and K. Nørvåg, Determining Time of Queries for Re-ranking Search Results, In ECDL'2010.

N. Kanhabua, T. N. Nguyen and W. Nejdl, Learning to Detect Event-Related Queries for Web Search, In TempWeb Workshop at WWW'2015.

N. Kanhabua, R. Blanco and K. Nørvåg, Temporal Information Retrieval, Foundations and Trends in IR, 2015.

Preliminaries Data model:

Set of queries Q issues at different time points Set of clicked URLs U and click-through data Temporal document collection D q: keywords or term(q), and hitting time(q) yq: time series data extracted form Q, U and D

Two-step approach:

Automatically extract a set of candidate queries {q1, ..., qn} from Q

Classify candidates as event-related queries {e1, ..., em} using machine learning techniques

Identifying Event Candidates Time and keyword-based clustering: Step1: Partition query logs into one week

• Group queries from the same event • Possibly contain multiple, unrelated events

Step2: Cluster queries by lexical similarity • Pre-process and sort queries alphabetically • Compute Jaccard similarity of a query pair

Easter - easter 2006, easter 2007, easter 20crafts, easter activities, easter animation, easter animations, easter background, easter basket, easter bread, easter bucket, easter bunny, easter bunny decorations, easter bunny lights

Event-related Query Classification

Classify a query as event-related or not: Periodic and seasonal events Popular and trending events Sporadic (rare) and unseen events General time-sensitive queries Underlying temporal information needs

Features:

Time-series features, e.g., seasonality or trends Popularity-based features, e.g., click-through and burstiness Statistic features, e.g., probability distribution of results

temporal KL-divergence and skewness (kurtosis)

Seasonality

Query: World cup

Detect seasonal queries (Shokouhi, 2011) E.g., Annual events, e.g., US Open and Easter, or a 4-year recurring event, e.g., FIFA World Cup Method: time-series decomposition using

Holt-Winters adaptive exponential smoothing Input: time-series data extracted from external

document collections, YD

Query: Easter

Compute a cosine similarity as seasonality

Y is the original time-series data S is the seasonality component

Autocorrelation

Detect trending events by their predictability Cross correlation with itself or between its

past and future values at different time lags

The stronger inter-day dependencies, the higher value for autocorrelation

where lag=1, shifting the 2nd time series by one day, called 1st-order autocorrelation

Temporal KL-divergence

Analyze a temporal distribution in a result set Measure the difference between the

distribution over time of top-k documents of q and the document collection C

P(t|q) is the probability of generating a publication date t given q

P(t|C) is the probability of a publication date t in the collection

Surprise score

Detect unseen events or surprisingly popular queries (Radinsky et al. , 2012) Assume an unplanned event happening when

there is a significant prediction error Compute the sum of squared errors of prediction

(SSE) using a simple linear regression model

Experiments Query logs: • Two datasets, i.e., AOL and MSN • AOL: 30M queries March 1 - May 31, 2006 • MSN: 15M queries from May 2006 Temporal collection: • The New York Times Annotated Corpus • 1.8M documents from 1987 - 2007

Setting: • HeidelTime (Strötgen & Gertz, 2010) for time

extraction and OpenNLP for entity extraction • Cleansing-step parameters: Jaccard similarity

threshold>0.2; edit distance<3; overlap n-gram=2 • For burstiness features, default parameters for the

burst detection technique provided by CISHELL

In total, 837 event-related queries

Experimental results Feature selection • Study high-impact (best) features • Investigate their importance independent

from classification algorithms • InfoGainAttributeEval method in WEKA Our findings: • Discriminative features are mostly derived

from D and Q • TemporalKL and kurtosis are among

influential features • Trend-based features, such as,

autocorrelation, burst weight, and trending level, play an important role

• Seasonality computed from Q has less impact than the one extracted D

march madness began

14/03/2006

ncaa women tournament began

18/03/2006 01/04/2006

final four began

query: ncaa Mining Dynamic Subtopics for Search Diversification

T. N. Nguyen and N. Kanhabua, Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification. In ECIR’2014

24

However, we are facing: Dramatic increase in content creation (e.g. digital photos) Increasing use of mobile devices with restricted capacity Information overload and changing professional + private lives Lack of systematic preservation is inadvertent forgetting

Forgetting plays a crucial role for human remembering and our lives (needs to focus, stress on important information, and forget details)

A Computer that forgets? Intentionally?? And in context of preservation???

Shouldn't there be something like forgetting in digital memories as well?

Forget IT

Managed forgetting ≠ automatic deletion

Managed forgetting = to remember the right information

Managed Forgetting Based on: • Careful information value assessment • Forgetting strategies via policies • Forgetting options to integrate final manual

checking before deletion • Combination with multi-tier storage solution Managed forgetting ≠ automatic deletion Instead: range of forgetting options e.g. • Resource condensation • Change of indexing & ranking • Reduction of redundancy

Automatic Deletion?

decreasing memory buoyancy

Use of tiers

Contextualized Remembering

Aim: Bring back information into active use in a meaningful way even if a lot

of time has passed Aim for semantic level of preservation

Based on:

Take into account relevant parts of context when moving to archive Increase contextualization of preserved content Consider context evolution over time (evolution-aware contextualization)

A. Ceroni, N. K. Tran, N. Kanhabua and C. Niederée, Bridging Temporal Context Gaps using Time-Aware Re-Contextualization, SIGIR’2014

Evolution-aware Contextualization & Re-contextualization Context of Interpretation

t

C C‘

Archival Information System

Pres(D‘)

Pres(C‘)

Information System

Human Forgetting Change in focus Structural changes

C‘‘

Evolution-aware Contextualization

Re-contextualization

Pres(D‘)

Pres(C‘‘)

Semantic evolution Structural evolution Terminology evolution

Pres(D‘)

Pres(C‘‘)

D

Contextualization

C‘‘‘

D

Context-aware Preservation

Semantic Evolution Detection

D D

Low-investment and expectation-oriented photo selection Low-investment: no manual (semantic) annotation required; a model is

generated by learning from training data with relevant judgments Expectation-oriented: the model aims at predicting which photos

perceived by users as important and likely to be selected

Investigation of the role of coverage in personal photo selection State-of-the-art methods consider coverage as a primary criterion Our work: an expectation-oriented method that explicitly models coverage

with other features for importance prediction

To Keep or not to Keep: Personal Photo Selection

A. Ceroni, V. Solachidis, C. Niederée, O. Papadopoulou, N. Kanhabua and V. Mezaris,To Keep or not to Keep: An Expectation-oriented Photo Selection Method for Personal Photo Collections. In ACM ICMR'2015.

M. Fu, A. Ceroni, V. Solachidis, C. Niederée, O. Papadopoulou, N. Kanhabua and V. Mezaris, Investigating Human Behaviors in Selecting Personal Photos to Preserve Memories, In IEEE ICME Workshop HMMP’2015.

A. Ceroni, V. Solachidis, M. Fu, N. Kanhabua, O. Papadopoulou, C. Niederée, and V. Mezaris, Learning Personalized Expectation-Oriented Photo Selection Models for Personal Photo Collections, In IEEE ICME HMMP’2015.

Personal Photo Dataset

Statistic: 91 photo collections (18,147 photos)

Expectation-oriented Selection Importance prediction: learn selection behaviors from real personal photos Features: Image quality, concepts, faces, clusters, near-duplicates

Hybrid Selection Combining importance prediction with explicitly modeling coverage 3 hybrid methods incorporating coverage in different ways

Approach Overview

Overall Findings Predicting photo importance, based on user expectations as well as

considering image- and collection-level features, improves the selection performance

Coverage are not a primary criteria when selecting personal photos

for preservation

selected not selected

The meeting place The Gammelstad Church Town (UNESCO)

“ Collective memory is a socially constructed, common image (memory) of the past of a community, which frames its current understanding and actions.” [Halbwachs, 1950]

Crowd phenomenon and important to societal processes

Not static as determined by the concerns of the present

From Individual Memories to Collective Memory

M. Halbwachs, On collective memory. Chicago: The University of Chicago Press, 1950 (Translation).

Flashbulb memories in cognitive psychology • A study of remembering of high-impact events, e.g., The British Royal

Wedding or September 11 attacks

• Aspects: details, confidence, consistency of memory over time, impact of media coverage

• Qualitative study: limited number of events and users

We propose a 3-step approach, for a given event: 1. Compute “remembering scores” of past events within the same category 2. Rank related past events by the computed remembering scores 3. Identify features (e.g., time, location) having a high correlation with

remembering

Our approach

Remembering scores: a linear combination of three features:

1. Cross-correlation coefficient (CCF) 2. Sum of squared error (SSE) 3. Skewness (Kurtosis)

Measuring Signals for Memory Revival

Remembering = α•CCF + β•SSE + γ•Kurtosis

Features for Triggered Remembering Temporal similarity: Time distance between two events (in days, months or years)

Time distance based on exponential decay functions

Location similarity: Map a geographic hierarchy of event locations as follows

city -> state -> country -> neighbor countries -> continent

Assign 4 scale values: 4 to same city, 3 to state, 2 to country,1 to continent

Impact of Events: Damaged area/properties/cost/fatalities

Magnitude (for earthquake events)

Highest winds, lowest pressure (for Atlantic hurricanes)

N. Kanhabua and K. Nørvåg: Determining time of queries for re-ranking search results. In ECDL 2010 J. Strötgen, M. Gertz, and C. Junghans: An event-centric model for multilingual document similarity. In SIGIR 2011

Category: Atlantic Hurricane Hurricane Hanna commemorates Hurricane Gustav, the freshest

hurricane stuck at the area of Puerto Rico and East Coast

Hurricane Sandy triggers 1991 Perfect Storm initially formed around Canada area, which t is high impact and most destructive

Category: Aviation accidents Mixture of impact factors, such as, time and location

• Qantas Flight 32 (crashed on 4 November 2010) triggers remembering of (1) Qantas Flight 30 and British Airways Flight 9 (both going to Australia), and (2) Aero Caribbean Flight 883 (most recent event)

Most recent Same

destination

Deadliest (two aircraft collided)

Concorde

Category: Earthquakes • 2010 Canterbury earthquake triggers 2010 Haiti earthquake (recent and high-

impact) and two close-by events, and high-impact historical earthquakes

• 2011 Christchurch earthquake shows locality focus, i.e., people seem to be interested in the previous events in the same region

• June 2011 Christchurch earthquake, the remembered events are dominated by the two predecessor events

Look beyond single events, especially, if there are several events in temporal and local proximity.

Category: Terrorist incidents Interesting observation: semantic similarity between events

• June 2012 Kaduna church bombings triggers other religion terror attacks

• 2008 Mumbai attacks trigger terror attacks in business, entertainment and hotels

2nd

5th

24th

2nd

7th

15th

Thank you!

Questions or suggestions?

41

on modeling temporal dynamics forgetting and remembering ... · inadvertent forgetting forgetting...

Documents