on modeling temporal dynamics forgetting and remembering ... · inadvertent forgetting forgetting...
TRANSCRIPT
On Modeling Temporal Dynamics Forgetting and Remembering for
Intelligent Information Access
Advanced Methods for IR
Dr. Nattiya Kanhabua L3S Research Center Hannover, Germany
24 June 2015
1
Motivation Three Aspects for Information Management and Retreival
Modeling (1) Temporal Dynamics, (2) Forgetting, and (3) Remembering
Four Selected Papers Kanhabua et al., Learning to Detect Event-Related Queries for Web Search, In
TempWeb'2015 at WWW'2015 Nguyen, Leveraging Dynamic Query Subtopics for Time-aware Search Result
Diversification, In ECIR'2014 Ceroni et al., To Keep or not to Keep: An Expectation-oriented Photo Selection
Method for Personal Photo Collections, In ICMR'2015 Kanhabua at al., What Triggers Human Remembering of Events? A Large-
Scale Analysis of Catalysts for Collective Memory in Wikipedia, In JCDL'2014
Conclusion
Outline
2
Motivation
Temporal Web Dynamics Unprecedented growth and change of data on the Web Changes occur in many aspects, e.g., size, content,
structure and user interactions or queries. Size: web pages are added/deleted at all time Content: web pages are edited/modified Query: users’ information needs changes
2000
First billion-URL index The world’s largest! ≈5000 PCs in clusters!
1995 2015
Web and index sizes
4
2000
First billion-URL index The world’s largest! ≈5000 PCs in clusters!
2004
Index grows to 4.2 billion pages
1995 2015
Web and index sizes
5
2000
First billion-URL index The world’s largest! ≈5000 PCs in clusters!
2004
Index grows to 4.2 billion pages
1995 2015
2008
Google counts 1 trillion unique URLs
Web and index sizes
6
2000
First billion-URL index The world’s largest! ≈5000 PCs in clusters!
2004
Index grows to 4.2 billion pages
1995 2015
2009
TBs or PBs of data/index Tens of thousands of PCs
2008
Google counts 1 trillion unique URLs
Web and index sizes
7
?
http://www.worldwidewebsize.com/ Web and index sizes
8
Content/Structure Changes
Implications: Crawling, Indexing, Ranking
Fig. 1 Categorization of document collections with content changes over time.
Changes in User Behavior
Implications: Query Analysis, Ranking
Fig. 2 Categorization of queries with temporal information needs.
http://www.google.com/insights/search
Temporal Query Examples A temporal query consists of:
Query keywords Temporal expressions
A document consists of: Terms, i.e., bag-of-words Publication time and temporal expressions
[Berberich et al., ECIR 2010]
Implications for Search
query
Temporal Web
Determining Search Intent
Term: {Germany, World, Cup} Time: {06/2006, 07/2006}
D2006
Retrieved results
matching/ranking
Time-sensitive queries
Semantic Annotation
Annotated documents Term: {w1, w2, …, wn}
Time: {PubTime(di), ContentTime(di)}
Three aspects for Intelligent Information Access (Management + Retrieval) (1) Temporal Dynamics (2) Forgetting (3) Remembering
Five Selected Papers Kanhabua et al., Learning to Detect Event-Related Queries for Web Search, In
TempWeb Workshop at WWW'2015. Nguyen, Leveraging Dynamic Query Subtopics for Time-aware Search Result
Diversification, In ECIR'2014. Ceroni et al., To Keep or not to Keep: An Expectation-oriented Photo Selection
Method for Personal Photo Collections, In ICMR'2015. Kanhabua at al., What Triggers Human Remembering of Events? A Large-Scale
Analysis of Catalysts for Collective Memory in Wikipedia, In JCDL'2014. Tran et al., Back to the Past: Supporting Interpretations of Forgotten Stories by
Time-aware Re-Contextualization, In WSDM'2015.
Highlight Research
13
Learning to Detect Event-related Queries Temporal queries are a significant fraction of Web search queries (Nunes et al., 2008; Zhang et al., 2010)
13.8% of explicit temporal queries 17.1% of implicit temporal queries
Characteristics: Certain temporal patterns, i.e., spikes, periodicity (hourly or daily), seasonality and trends Underlying temporal information needs without temporal patterns observed
Tasks: Understand temporal search intent Enable advanced enhancement techniques Automatic method for detecting events in search streams
US Election 2016 Brazil FIFA World Cup
N. Kanhabua and K. Nørvåg, Determining Time of Queries for Re-ranking Search Results, In ECDL'2010.
N. Kanhabua, T. N. Nguyen and W. Nejdl, Learning to Detect Event-Related Queries for Web Search, In TempWeb Workshop at WWW'2015.
N. Kanhabua, R. Blanco and K. Nørvåg, Temporal Information Retrieval, Foundations and Trends in IR, 2015.
Preliminaries Data model:
Set of queries Q issues at different time points Set of clicked URLs U and click-through data Temporal document collection D q: keywords or term(q), and hitting time(q) yq: time series data extracted form Q, U and D
Two-step approach:
Automatically extract a set of candidate queries {q1, ..., qn} from Q
Classify candidates as event-related queries {e1, ..., em} using machine learning techniques
Identifying Event Candidates Time and keyword-based clustering: Step1: Partition query logs into one week
• Group queries from the same event • Possibly contain multiple, unrelated events
Step2: Cluster queries by lexical similarity • Pre-process and sort queries alphabetically • Compute Jaccard similarity of a query pair
Easter - easter 2006, easter 2007, easter 20crafts, easter activities, easter animation, easter animations, easter background, easter basket, easter bread, easter bucket, easter bunny, easter bunny decorations, easter bunny lights
Event-related Query Classification
Classify a query as event-related or not: Periodic and seasonal events Popular and trending events Sporadic (rare) and unseen events General time-sensitive queries Underlying temporal information needs
Features:
Time-series features, e.g., seasonality or trends Popularity-based features, e.g., click-through and burstiness Statistic features, e.g., probability distribution of results
temporal KL-divergence and skewness (kurtosis)
Seasonality
Query: World cup
Detect seasonal queries (Shokouhi, 2011) E.g., Annual events, e.g., US Open and Easter, or a 4-year recurring event, e.g., FIFA World Cup Method: time-series decomposition using
Holt-Winters adaptive exponential smoothing Input: time-series data extracted from external
document collections, YD
Query: Easter
Compute a cosine similarity as seasonality
Y is the original time-series data S is the seasonality component
Autocorrelation
Detect trending events by their predictability Cross correlation with itself or between its
past and future values at different time lags
The stronger inter-day dependencies, the higher value for autocorrelation
where lag=1, shifting the 2nd time series by one day, called 1st-order autocorrelation
Temporal KL-divergence
Analyze a temporal distribution in a result set Measure the difference between the
distribution over time of top-k documents of q and the document collection C
P(t|q) is the probability of generating a publication date t given q
P(t|C) is the probability of a publication date t in the collection
Surprise score
Detect unseen events or surprisingly popular queries (Radinsky et al. , 2012) Assume an unplanned event happening when
there is a significant prediction error Compute the sum of squared errors of prediction
(SSE) using a simple linear regression model
Experiments Query logs: • Two datasets, i.e., AOL and MSN • AOL: 30M queries March 1 - May 31, 2006 • MSN: 15M queries from May 2006 Temporal collection: • The New York Times Annotated Corpus • 1.8M documents from 1987 - 2007
Setting: • HeidelTime (Strötgen & Gertz, 2010) for time
extraction and OpenNLP for entity extraction • Cleansing-step parameters: Jaccard similarity
threshold>0.2; edit distance<3; overlap n-gram=2 • For burstiness features, default parameters for the
burst detection technique provided by CISHELL
In total, 837 event-related queries
Experimental results Feature selection • Study high-impact (best) features • Investigate their importance independent
from classification algorithms • InfoGainAttributeEval method in WEKA Our findings: • Discriminative features are mostly derived
from D and Q • TemporalKL and kurtosis are among
influential features • Trend-based features, such as,
autocorrelation, burst weight, and trending level, play an important role
• Seasonality computed from Q has less impact than the one extracted D
march madness began
14/03/2006
ncaa women tournament began
18/03/2006 01/04/2006
final four began
query: ncaa Mining Dynamic Subtopics for Search Diversification
T. N. Nguyen and N. Kanhabua, Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification. In ECIR’2014
24
However, we are facing: Dramatic increase in content creation (e.g. digital photos) Increasing use of mobile devices with restricted capacity Information overload and changing professional + private lives Lack of systematic preservation is inadvertent forgetting
Forgetting plays a crucial role for human remembering and our lives (needs to focus, stress on important information, and forget details)
A Computer that forgets? Intentionally?? And in context of preservation???
Shouldn't there be something like forgetting in digital memories as well?
Forget IT
Managed forgetting ≠ automatic deletion
Managed forgetting = to remember the right information
Managed Forgetting Based on: • Careful information value assessment • Forgetting strategies via policies • Forgetting options to integrate final manual
checking before deletion • Combination with multi-tier storage solution Managed forgetting ≠ automatic deletion Instead: range of forgetting options e.g. • Resource condensation • Change of indexing & ranking • Reduction of redundancy
Automatic Deletion?
decreasing memory buoyancy
Use of tiers
Contextualized Remembering
Aim: Bring back information into active use in a meaningful way even if a lot
of time has passed Aim for semantic level of preservation
Based on:
Take into account relevant parts of context when moving to archive Increase contextualization of preserved content Consider context evolution over time (evolution-aware contextualization)
A. Ceroni, N. K. Tran, N. Kanhabua and C. Niederée, Bridging Temporal Context Gaps using Time-Aware Re-Contextualization, SIGIR’2014
Evolution-aware Contextualization & Re-contextualization Context of Interpretation
t
C C‘
Archival Information System
Pres(D‘)
Pres(C‘)
Information System
Human Forgetting Change in focus Structural changes
C‘‘
Evolution-aware Contextualization
Re-contextualization
Pres(D‘)
Pres(C‘‘)
Semantic evolution Structural evolution Terminology evolution
Pres(D‘)
Pres(C‘‘)
D
Contextualization
C‘‘‘
D
Context-aware Preservation
Semantic Evolution Detection
D D
Low-investment and expectation-oriented photo selection Low-investment: no manual (semantic) annotation required; a model is
generated by learning from training data with relevant judgments Expectation-oriented: the model aims at predicting which photos
perceived by users as important and likely to be selected
Investigation of the role of coverage in personal photo selection State-of-the-art methods consider coverage as a primary criterion Our work: an expectation-oriented method that explicitly models coverage
with other features for importance prediction
To Keep or not to Keep: Personal Photo Selection
A. Ceroni, V. Solachidis, C. Niederée, O. Papadopoulou, N. Kanhabua and V. Mezaris,To Keep or not to Keep: An Expectation-oriented Photo Selection Method for Personal Photo Collections. In ACM ICMR'2015.
M. Fu, A. Ceroni, V. Solachidis, C. Niederée, O. Papadopoulou, N. Kanhabua and V. Mezaris, Investigating Human Behaviors in Selecting Personal Photos to Preserve Memories, In IEEE ICME Workshop HMMP’2015.
A. Ceroni, V. Solachidis, M. Fu, N. Kanhabua, O. Papadopoulou, C. Niederée, and V. Mezaris, Learning Personalized Expectation-Oriented Photo Selection Models for Personal Photo Collections, In IEEE ICME HMMP’2015.
Personal Photo Dataset
Statistic: 91 photo collections (18,147 photos)
Expectation-oriented Selection Importance prediction: learn selection behaviors from real personal photos Features: Image quality, concepts, faces, clusters, near-duplicates
Hybrid Selection Combining importance prediction with explicitly modeling coverage 3 hybrid methods incorporating coverage in different ways
Approach Overview
Overall Findings Predicting photo importance, based on user expectations as well as
considering image- and collection-level features, improves the selection performance
Coverage are not a primary criteria when selecting personal photos
for preservation
selected not selected
The meeting place The Gammelstad Church Town (UNESCO)
“ Collective memory is a socially constructed, common image (memory) of the past of a community, which frames its current understanding and actions.” [Halbwachs, 1950]
Crowd phenomenon and important to societal processes
Not static as determined by the concerns of the present
From Individual Memories to Collective Memory
M. Halbwachs, On collective memory. Chicago: The University of Chicago Press, 1950 (Translation).
Flashbulb memories in cognitive psychology • A study of remembering of high-impact events, e.g., The British Royal
Wedding or September 11 attacks
• Aspects: details, confidence, consistency of memory over time, impact of media coverage
• Qualitative study: limited number of events and users
We propose a 3-step approach, for a given event: 1. Compute “remembering scores” of past events within the same category 2. Rank related past events by the computed remembering scores 3. Identify features (e.g., time, location) having a high correlation with
remembering
Our approach
Remembering scores: a linear combination of three features:
1. Cross-correlation coefficient (CCF) 2. Sum of squared error (SSE) 3. Skewness (Kurtosis)
Measuring Signals for Memory Revival
Remembering = α•CCF + β•SSE + γ•Kurtosis
Features for Triggered Remembering Temporal similarity: Time distance between two events (in days, months or years)
Time distance based on exponential decay functions
Location similarity: Map a geographic hierarchy of event locations as follows
city -> state -> country -> neighbor countries -> continent
Assign 4 scale values: 4 to same city, 3 to state, 2 to country,1 to continent
Impact of Events: Damaged area/properties/cost/fatalities
Magnitude (for earthquake events)
Highest winds, lowest pressure (for Atlantic hurricanes)
N. Kanhabua and K. Nørvåg: Determining time of queries for re-ranking search results. In ECDL 2010 J. Strötgen, M. Gertz, and C. Junghans: An event-centric model for multilingual document similarity. In SIGIR 2011
Category: Atlantic Hurricane Hurricane Hanna commemorates Hurricane Gustav, the freshest
hurricane stuck at the area of Puerto Rico and East Coast
Hurricane Sandy triggers 1991 Perfect Storm initially formed around Canada area, which t is high impact and most destructive
Category: Aviation accidents Mixture of impact factors, such as, time and location
• Qantas Flight 32 (crashed on 4 November 2010) triggers remembering of (1) Qantas Flight 30 and British Airways Flight 9 (both going to Australia), and (2) Aero Caribbean Flight 883 (most recent event)
Most recent Same
destination
Deadliest (two aircraft collided)
Concorde
Category: Earthquakes • 2010 Canterbury earthquake triggers 2010 Haiti earthquake (recent and high-
impact) and two close-by events, and high-impact historical earthquakes
• 2011 Christchurch earthquake shows locality focus, i.e., people seem to be interested in the previous events in the same region
• June 2011 Christchurch earthquake, the remembered events are dominated by the two predecessor events
Look beyond single events, especially, if there are several events in temporal and local proximity.
Category: Terrorist incidents Interesting observation: semantic similarity between events
• June 2012 Kaduna church bombings triggers other religion terror attacks
• 2008 Mumbai attacks trigger terror attacks in business, entertainment and hotels
2nd
5th
24th
2nd
7th
15th
Thank you!
Questions or suggestions?
41