understanding & evaluating search sessions
DESCRIPTION
A talk given in the University of Leeds School of Computing, on the nature of extended search sessions, and on evaluating/measuring learning/sensemaking during longer research sessions.TRANSCRIPT
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Extended Searching Sessionsand Evaluating Success
Dr Max L. WilsonMixed Reality Lab
University of Nottingham, UK
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Studying Extended Search Success In Observable Natural
Sessions
SESSIONS
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Extended Searching Sessions and Evaluating Sensemaking Success
About Me
Study 1: The RealNature of Sessions
Study 2: EvaluatingSensemaking Success
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
About me
MEng & Phd in Southampton
Taught in Swansea for 3 years
Moved to Nottingham April 2012
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
About Me
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
UIST2008
JCDL2008
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
My PhD
Bates, M. J. (1979a). Idea tactics. Journal of the American Society for Information Science, 30(5):280–289.
Bates, M. J. (1979b). Information search tactics. Journal of the American Society for Information Science, 30(4):205–214.
Belkin, N. J., Marchetti, P. G., and Cool, C. (1993). Braque: design of an interface to support user interaction in information retrieval. Information Processing and Management, 29(3):325–344.
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
My PhD
Wilson, M. L., schraefel, m. c., and White, R. W. (2009). Evaluating advanced search interfaces using established information-seeking models. Journal of the American Society for Information Science and Technology, 60(7):1407–1422.
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Search User Interface Design
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
My Team
Horia Maior Matthew Pike Jon Hurlock Paul BrindleyZenah Alkubaisy
Chaoyu (Kelvin) Ye(Study 1)
Mathew Wilson(Study 2)
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Extended Searching Sessions and Evaluating Sensemaking Success
About Me
Study 1: The RealNature of Sessions
Study 2: EvaluatingSensemaking Success
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
People Searching the Web
Elsweiler, D., Wilson M. L. and Kirkegaard-Lunn, B. (2011) Understanding Casual-leisure Information Behaviour. In Spink, A. and Heinstrom, J. (Eds) New Directions in Information Behaviour. Emerald Group Publishing Limited, pp 211-241.
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
The Search Communities
Ingwersen, P., Jarvelin, K., 2005. The turn: integration of information seeking
and retrieval in context. Springer, Berlin, Germany.
The IR Community
•Focused on Accuracy
•Are these results relevant?
•How many are relevant?
•Did we get all the relevant ones?
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
The Search Communities
The IS Community
•Focused on Success
•Did they find the right result?
•How long did they take
•How many interactions?Ingwersen, P., Jarvelin, K., 2005. The turn:
integration of information seekingand retrieval in context. Springer, Berlin, Germany.
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
The Search Communities
The IB Community
•Focused on Quality
•Did they do a good job?
•How did the UI affect the task?
•Was the higher level motivating task achieved more successfully?
Ingwersen, P., Jarvelin, K., 2005. The turn: integration of information seeking
and retrieval in context. Springer, Berlin, Germany.
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
The Search Communities
“Relatively” well known
“Naively estimated”- Study 1
“Simplistically” measured- Study 2
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Work Tasks
• Work tasks - typically considered work-led information-intensive activities the lead to searching
• Can be out-of-work - like planning holidays, or buying a car
• We’ve begun looking at motivating ‘tasks’ outside of work
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Casual Leisure Work Tasks
‘explore’, and ‘search’ in their past, present, and futuretenses. 12 seed-terms were used to query Twitter each hour,with the 100 newest tweets being stored each time. Ourcorpus contains information about hundreds of thousandsof real human searching scenarios and information needs,some examples are shown in Figure 1.
To investigate the information behaviours described in thecorpus, we embarked on a large-scale qualitative, inductiveanalysis of these tweets using a grounded theory approach.With the aim of building a taxonomy of searching scenariosand their features, we have so far coded 2500 tweets in ap-prox. 40 hrs of manual coding time. Already, we have begunto develop a series of dimensions and learned, ourselves, agreat deal about the kinds of search scenarios that peopleexperience in both the physical and digital domains.
To date, we have identified 10 dimensions within our tax-onomy, 6 of which were common in the dataset and havebecome fairly stable. We will present this taxonomy in fu-ture work, when more tweets have been coded and the tax-onomy is complete. Further, once the taxonomy is stableand has been tested for validity, we will use alternative au-tomatic or crowd-sourcing techniques to gain a better ideaof how important the factors are and how they relate. Here,however, we will highlight some of the casual-leisure searchbehaviours documented so far.
4.1 Need-less browsingMuch like the desire to pass time at the television, we saw
many examples (some shown in Table 3) of people passingtime typically associated with the ‘browsing’ keyword.
1) ... I’m not even *doing* anything useful... just browsingeBay aimlessly...
2) to do list today: browse the Internet until fasting breaktime..
3) ... just got done eating dinner and my family is watch-ing the football. Rather browse on the laptop
4) I’m at the dolphin mall. Just browsing.
Table 3: Example tweets where the browsing activ-ity is need-less.
From the collected tweets it is clear that often the inform-ation-need in these situations are not only fuzzy, but typi-cally absent. The aim appears to be focused on the activity,where the measure of success would be in how much theyenjoyed the process, or how long they managed to spend‘wasting time’. If we model these situations by how theymanage to make sense of the domain, or how they progressin defining their information-need, then we are likely to pro-vide the wrong types of support e.g these users may not wantto be supported in defining what they are trying to find oneBay, nor be given help to refine their requirements. Weshould also point out, however, that time wasting browsingwas not always associated with positive emotions (Table 4).
1) It’s happening again. I’m browsing @Etsy. Crap.2) browsing ASOS again. tsk.3) hmmm, just realizd I’ve been browsing ted.com for the
last 3 hours.
Table 4: Example tweets where the information-need-less browsing has created negative emotions.
The addictive nature of these activities came through re-peatedly and suggests perhaps that support is needed to
curtail exploration when it is not appropriate.
4.2 Exploring for the experienceMostly related to the exploration of a novel physical space,
we saw many people exploring with family and friends. Theaim in these situations (see Table 5) is often not to findspecific places, but to spend time with family.
1) exploring the neighbourhood with my baby!2) What a beautiful day to be outside playing and explor-
ing with the kids:)3) Into the nineties and exploring dubstep [music] while
handling lots of small to-dos
Table 5: Example tweets where the experience out-weighs the things found.
In these cases, the goal may be to investigate or learnabout the place, but the the focus of the activity is lesson the specific knowledge gained than on the experience it-self. Another point of note is that in these situations peopleregularly tried to behave in such a way that accidental orserendipitous discoveries were engendered. While examples1) and 2) are physical-world examples, it is easy to imagedigital world equivalents, such as exploring exploring theDisney website with your children.
Below we attempt to combine the characteristics we havediscovered to create an initial definition of what we refer toas casual search.
5. CASUAL SEARCHWe have seen many examples of casual information be-
haviours in these recent projects, but here we highlight thefactors that make them di�erent from our understandingof Information Retrieval, Information Seeking, ExploratorySearch, and Sensemaking. First, we should highlight thatit is not specifically their information-need-less nature thatbreaks the model of exploratory search, although some ex-amples were without an information need entirely. Thedi�erentiators are more in the motivation and reasoningfor searching, where all of our prior models of search aretypically oriented towards finding information, but casualsearch is typically motivated by more hedonistic reasons.We present the following defining points for casual searchtasks:
• In Casual search the information found tends to be ofsecondary importance to the experience of finding.
• The success of Casual search tasks is usually not de-pendent on actually finding the information being sought.
• Casual search tasks are often motivated by being in orwanting to achieve a particular mood or state. Tasksoften relate at a higher level to quality of life and healthof the individual.
• Casual search tasks are frequently associated with veryunder-defined or absent information needs.
These defining points break our models of searching in sev-eral ways. First, our models focus on an information need,where casual search often does not. Second, we measuresuccess in regards to finding the information rather thanthe experience of searching. Third, the motivating scenar-ios we use are work-tasks, which often is not appropriate incasual search.
Wilson, M. L. and Elsweiler, D. (2010) Casual-leisure Searching: the Exploratory Search scenarios that break our current models. In: 4th HCIR Workshop , Aug 22 2010. pp 28-31.
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
People Searching the Web
Elsweiler, D., Wilson M. L. and Kirkegaard-Lunn, B. (2011) Understanding Casual-leisure Information Behaviour. In Spink, A. and Heinstrom, J. (Eds) New Directions in Information Behaviour. Emerald Group Publishing Limited, pp 211-241.
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
• Traditionally examined by analysing logs for stats
• In the 90s, suggested they are broken by ~25mins - More recently by ~5mins
• BUT evidence shows web use typically interleaves tasks - AND tabs make this all much harder
• Become a big focus as Dagstuhls/workshops
Sessions
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Search Trails
• Aimed at finding commonend locations for queries
• An interesting step towardssessions though
• most involved some trailfeatures (not query+click)
White, Ryen W., and Steven M. Drucker. "Investigating behavioral variability in web search." in Proc WWW 2007 . ACM
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Top Sessionsas Seen by Bing
Bailey et al, User task understanding: a web search engine perspective, NII Shonan, 8 Oct 2012
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Top Sessionsas Seen by Bing
Bailey et al, User task understanding: a web search engine perspective, NII Shonan, 8 Oct 2012
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Top Sessionsas Seen by Bing
Bailey et al, User task understanding: a web search engine perspective, NII Shonan, 8 Oct 2012
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Study 1: Investigating Extended Sessions
What on earthis happening here?
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Study 1: Interview MethodSend & Preprocess History
Interview Recording, Cards, Card Sorts, Marked history file, log data
A history artefact - approx 300 items
How would you
define a session?
Mark out history into sessions, starting recently
+
Create ‘cards’ of varying types of ‘sessions’
Open Card Sort
+
Close Card Sort
10mins 20-30mins 30-50mins
15-2
0 Ca
rds
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Study 1: Data
• Rich discussion of ~20 Sessions per participant
• Currently: 7 participants and ~120 sessions - richly described and compared
• Aiming for : 12 participants and 200+ sessions at first
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Study 1: Questions for Sessions
1) Where was this done (e.g. work vs home vs mobile)
2) With who (collaborative?)
3) For who (shared task?)
4) Devices involved (whether devices affect things)
5) Length of the Session (how do they define long?)
6) Successful or not (for future measurement insights)
At some point: tried to learn these for each session
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Study 1: A Card
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Study 1: A Card
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Study 1: Card Sorting
• We aimed first to let them define the dimensions - this lets us see how they define things - how do they self-categorise different sessions
• We then had some targeted card sorts - For who, duration, difficulty, importance, location - whats short vs long? - whats important vs not? - how do people divide work vs home etc
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Study 1: Example Card Sorts
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Study 1: Preliminary Findings
• avg 21 cards per person, inc ~8 sessions of 5+mins - ~4 work & ~4 leisure
• 18.6% of those extended sessions involved task switches
• avg length: 17.5mins avg #queries: 3.55
• short: third said <30s, third said <1m third said <30m
• long: third said >1hour, third said >5mins
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Study 1: Preliminary Findings
• longest sessions: entertainment, work prep, news, shopping
• longest leisure: 22-76mins youtube, 28mins news
• most important: work, money, urgent shopping
• lest important: leisure, entertainment, free time
• most difficult: technical work prep
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Study 1: Preliminary Findings
• Huge divide over where sessions start or stop - many people considered a session to span a large break - paused and left in tabs
• One person divided a single topical episode by phases - and phases were sessions - e.g. broadening/confused stage vs successful focus stage
• One person divided single topical episode by major sources - moved from web searching to video searching on same topic
What is a session?
Implications for where/when to measure successFriday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Study 1: What is a session?Single topic - changing purpose
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Study 1: What is a session?Single topic - pausing sessions
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Study 1: What is a session?Low-query extended sessions
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Study 1: Other observations
• Seeing an informal relationship between who tasks are for - and skewed importance - including for another person, or for a group - and slow sequential interactions (as talk to others)
• Seeing a strong low-query correlation with entertainment - seeing serious-leisure more similar to work tasks
• Hard tasks have high query loads, - and are related to rare or new areas
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Study 1: Summary
• We’re beginning to get some real insight into real sessions
• Already identifying examples where time-splitting isnt sufficient - but intention changing is common
• We’re seeing possible common patterns of overlapping sessions
• We havent finished!
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Study 2: Evaluating Sensemaking
“Simplistically” measured- Study 2
Wilson, M. J. and Wilson, M. L. (2012) A Comparison of Techniques for Measuring Sensemaking and Learning within Participant-Generated Summaries. In: JASIST (accepted).
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Study 2: “Simplistically” measured
• If learning is closed: then a quiz - “closed” determines WHAT should be learned - can measure recall, but also recognising if cued by Q.
• If learning is open: a) sub-topic count (integer) & topic quality (judged likert) b) simple count of facts (integer) and statements (integer)
• These do not measure how “good” the learning was
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Study 2: Measuring “Depth” of Learning
• A theory from Education
• As learning improvesyou progress up the diagram
• You begin to ‘understand’- then critically ‘analyze’- then ‘evaluate’ informationetc.
Image from: http://www.nwlink.com/~donclark/hrd/bloom.html
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Study 2: Developed 3 Scales
• 12 participants performed 3 learning tasks - mix of high and low prior knowledge
• 1) Write summary of knowledge, 2) Learn, 3) Write summary
• 36 pairs of pre/post summaries - 18 high prior knowledge - 18 low prior knowledge
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Study 2: Developed 3 Scales
• Inductive Grounding Theory analysis
• 3 rounds of 6 high and 6 low pairs analysed by 2 researchers
• Validated by an external judge
• Until high Fleiss Kappa scores i.e. ‘substantial agreement’
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Study 2: Measure 1: D-Qual
We went through three major iterations of refining our measurements until we reached
‘substantial agreement’, according to Landis and Koch (1977), between judges. For final
validation of our scores, we used Fleiss’ Kappa (Fleiss, 1971) to determine the agreement
between the two authors and an independent third judge. Our Fleiss Kappa scores are reported
inline below as we describe the scales we produced.
3.2.2 The measures produced by our process Our first measure for depth of learning was ‘D-Qual’, shown in Table 1, which
focused on the quality of recalled facts by their usefulness and was measured on a four-point
scale ranging from irrelevant or useless facts (0 points) to facts that showed a level of
technical understanding (3 points). The emphasis of usefulness in this measure meant that it
was closer to the “understanding” level of Bloom’s revised taxonomy, rather than simply
“remembering”. It was important to differentiate between the two levels as many poor
summaries, as determined by the authors during the coding session, simply listed many
redundantly obvious facts (“A labrador is a dog”) rather than describing them in sentences
and summaries. For D-Qual, the judges achieved a Fleiss kappa of 0.64.
Rating Description
0 Facts are irrelevant to the subject; Facts hold no useful information or advice.
1 Facts are generalised to the overall subject matter; Facts hold little useful information or advice.
2 Facts fulfil the required information need and are useful.
3 A level of technical detail is given via at least one key term associated with the technology of the subject; Statistics are given.
Table 1: Quality of Facts (D-Qual).
Many of the better summaries interpreted facts into more intelligent statements. To
identify this, D-Intrp (Table 2) measured summaries in how they synthesised facts and
statements to draw conclusions and deductions (Bloom’s “analysing”) using a 3-point scale.
This ranged from simply listing facts with no further interpretation (0 points) to structured
combinations in patterns (2 points). The judges achieved a Fleiss kappa of 0.58 for D-Intrp.
Measure understanding rather than remembering
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Study 2: Measure 2: D-Intrp
Rating Description
0 Facts contained within one statement with no association.
1 Association of two useful or detailed facts: ‘A -> B’
2 Association of multiple useful or detailed facts: ‘A+B->C’; ‘A->B->C’; ‘A->B∴C’
Table 2: Interpretation of data into statements (D-Intrp).
D-Crit reflected Bloom’s concept of “evaluating” by identifying statements that
compared facts, or used facts to raise questions about other statements. The measurement for
D-Crit was either true (1 point) or false (0 points), as shown in Table 3. A Fleiss kappa of
0.74 was achieved.
Rating Description
0 Facts are listed with no further thought or analysis.
1 Both advantages and disadvantages listed; Comparisons drawn between items; Participant deduced his or her own questions.
Table 3: Use of critique (D-Crit).
We did not produce a scale for level three of Anderson’s revised version of Bloom’s
taxonomy, “applying”, since the act of writing a summary would not involve the participant
to carry out a procedure that has been learned. This level of learning was thus not identifiable
in our corpus of summaries. Similarly, the highest level, “creating”, also goes beyond writing
about a topic, to more practical elements of learning and so was also left out.
4 Evaluation and Comparison of Measures Having developed our new measures from our initial sample set of written
summaries, we performed a larger user study using a similar protocol. Our new measure was
compared with the two other common analytical measures of written summaries: fact
counting and topic analysis. We used the same study protocol that was pilot tested in our
initial study, refining the Work Task descriptions and procedure slightly. One clear example
of the improvements, beyond the wording of tasks, was to change the medium of written
Measure analysing capabilities
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Study 2: Measure 3: D-CritMeasure evaluating capabilities
Rating Description
0 Facts contained within one statement with no association.
1 Association of two useful or detailed facts: ‘A -> B’
2 Association of multiple useful or detailed facts: ‘A+B->C’; ‘A->B->C’; ‘A->B∴C’
Table 2: Interpretation of data into statements (D-Intrp).
D-Crit reflected Bloom’s concept of “evaluating” by identifying statements that
compared facts, or used facts to raise questions about other statements. The measurement for
D-Crit was either true (1 point) or false (0 points), as shown in Table 3. A Fleiss kappa of
0.74 was achieved.
Rating Description
0 Facts are listed with no further thought or analysis.
1 Both advantages and disadvantages listed; Comparisons drawn between items; Participant deduced his or her own questions.
Table 3: Use of critique (D-Crit).
We did not produce a scale for level three of Anderson’s revised version of Bloom’s
taxonomy, “applying”, since the act of writing a summary would not involve the participant
to carry out a procedure that has been learned. This level of learning was thus not identifiable
in our corpus of summaries. Similarly, the highest level, “creating”, also goes beyond writing
about a topic, to more practical elements of learning and so was also left out.
4 Evaluation and Comparison of Measures Having developed our new measures from our initial sample set of written
summaries, we performed a larger user study using a similar protocol. Our new measure was
compared with the two other common analytical measures of written summaries: fact
counting and topic analysis. We used the same study protocol that was pilot tested in our
initial study, refining the Work Task descriptions and procedure slightly. One clear example
of the improvements, beyond the wording of tasks, was to change the medium of written
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Study 2: Evaluating these measuresCompare against Counting & Topic measures
while facts were defined as individual pieces of information either explicitly listed or
contained within statements. Finally, using these two sub-measures we also created ‘F-Ratio’
which represented the ratio of facts per statement.
To measure breadth and depth of topics, we first outlined some common topics that
were found in the six tasks of the pilot study (i.e. for buying a dog the topics were history of
the breed, health concerns, caring for the dog and personality). Then, to measure breadth (‘T-
Count’), we counted the number of topics that the participant covered in their summary. To
measure depth (‘T-Depth’), each topic was measured on a 4-point scale ranging from not
covered (0 points) to detailed focused coverage (3 points) and averaged.
As the process of learning is primarily internal it is difficult to measure it objectively.
For this reason our measures of learning focused on the difference between pre- and post-task
knowledge held by the participant.
Code Measurement Scale D-Qual Recall of facts 0 – 3 points D-Intrp Interpretation of data into statements 0 – 2 points D-Crit Critique 0 – 1 point F-Fact Number of facts Count F-State Number of statements Count F-Ratio Ratio of facts per statement Average T-Count Number of topics covered (breadth of knowledge) Count T-Depth Level of topic focus (depth of knowledge) 0 – 3 points, averaged
Table 4: Outline of coding scheme used for analysis.
5 Results Before beginning, the data from two participants were removed from the analysis. A
first-pass sanity check over the collected summaries revealed that they had misunderstood the
tasks set. One chose to describe their own feelings and history relating to the task topic, rather
than trying to answer the task. Another described what they intended to search for in their
pre-task summaries, meaning that they could not be compared to other pre-task summaries or
measure their information gain. The analyses below relate to the remaining 34 participants.
With each participant creating 3 pairs of summaries (pre- and post-task), a total of 204
summaries, or 102 pairs of pre- and post-task summaries, were analysed using all the
• Can you differentiate pre- & post- task summaries?
• Can you differentiate high & low prior knowledge?
• How long do summaries need to be?
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Study 2: Analysing summariesPre-task example
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Study 2: Analysing summariesPost-task example
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Study 2: Results
knowledge, especially for pre-task summaries, which can possibly be explained that the
participants who wrote shorter summaries based on high prior knowledge are more likely to
concentrate on a single topic.
All Pre-task Post-task D-Qual U(68) = 537.5, p = 0.32 U(34) = 125, p = 0.28 U(34) = 148, p = 0.46 D-Intrp U(68) = 642, p = 0.21 U(34) = 145, p = 0.47 U(34) = 174, p = 0.16 D-Crit U(68) = 570, p = 0.47 U(34) = 140, p = 0.47 U(34) = 144.5, p = 0.49 F-Fact t(66) = -0.4, p = 0.35 t(32) = -0.75, p = 0.23 t(32) = -0.25, p = 0.4 F-State t(66) = -0.21, p = 0.42 t(32) = -0.4, p = 0.35 t(32) = -0.17, p = 0.43 F-Ratio t(66) = 0.2, p = 0.42 t(32) = 0.31, p = 0.38 t(32) = -0.04, p = 0.48 T-Count t(66) = -0.35, p = 0.36 t(32) = 0.43, p = 0.34 t(32) = -1.01, p = 0.16 T-Depth U(68) = 721, p = 0.04 * U(34) = 194.5, p = 0.04 * U(34) = 168, p = 0.21
Table 12: Comparing high and low prior knowledge in shorter summaries. * Indicates significant results.
All Pre-task Post-task D-Qual U(68) = 390, p = 0.01 * U(34) = 89.5, p = 0.03 * U(34) = 113.5, p = 0.18 D-Intrp U(68) = 497.5, p = 0.16 U(34) = 158.5, p = 0.29 U(34) = 95, p = 0.06 D-Crit U(68) = 693.5, p = 0.08 U(34) = 189, p = 0.05 * U(34) = 154, p = 0.32 F-Fact t(66) = 1.62, p = 0.06 t(32) = 0.64, p = 0.26 t(32) = 1, p = 0.16 F-State t(66) = 1, p = 0.16 t(32) = 0.29, p = 0.39 t(32) = 0.79, p = 0.22 F-Ratio t(66) = 0.86, p = 0.2 t(32) = 0.31, p = 0.38 t(32) = 0.21, p = 0.42 T-Count t(66) = 3.44, p = 0.0005 * t(32) = 1.92, p = 0.03 * t(32) = 2.82, p = 0.004 * T-Depth U(68) = 572, p = 0.48 U(34) = 163, p = 0.25 U(34) = 142, p = 0.48
Table 13: Comparing high and low prior knowledge in longer summaries. * Indicates significant results.
Conversely, however, some measures were able to differentiate between high and low
prior knowledge, even after the task, when summaries were longer, as shown in Table 13.
Looking at the longer pre-task summaries we find that D-Qual shows signs of significant
difference along with critique (D-Crit) and the number of topics covered (T-Count). This
indicates that use of critique in pre-task summaries is a strong differentiator, but only in
longer examples. Like before, however, D-Crit’s significance is lost in the post-task
summary, perhaps indicating that all post-task summaries include some level of critique. A
more sensitive measure of critique (D-Crit) may be required and studied in future work.
Unlike in our initial analysis, however, we find that one measure (T-Count) is able to tell the
difference between high and low prior knowledge, in both pre- and post-task summaries, if
they are longer. Again, this indicates that designing tasks such that participants write longer
summaries may make it easier for measures to measure learning.
Pretty obvious - as you can see
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Study 2: Results
• 1) Most measures could identify learning (between pre-post) - more robust with longer summaries
despite being shorter, while others were poor quality and much longer. There are situations,
therefore, where the length of the summaries may require a more thoughtful consideration.
6.4 Recommendations
To identify learning all measures detailed here were generally effective, but both the length of
the summaries and the prior knowledge held by the participant should be taken in to
consideration. Table 14 provides an overview of the strengths and weaknesses of each
measure and recommendations are made below. While serving as a guide readers should refer
back to the full text in our results section for more detail before using them in a study.
Identifies Learning Identifies Prior Knowledge Ignores Length
High Low Short Long Pre Post Short Long Pre Post D-Qual D-Intrp D-Crit F-Fact F-State F-Ratio T-Count T-Depth
Table 14: Overview of measure suitability.
If participants have written shorter summaries (here averaged to around 90 words) then
learning is only really noticeable if those participants began with low prior knowledge, where
measures such as the quality of facts (D-Qual), simple fact and statement counting (F-Fact, F-
State) and topic coverage (T-Count) can be used to determine an increase of knowledge. If
short summaries are written based on high prior knowledge then only simple fact and
statement counting (F-Fact, F-State) and the depth of topics (T-Depth) reflected an increase.
If participants have written longer summaries (here averaged to around 180 words)
measures such as the quality and number of facts (D-Qual and F-Fact, respectively), ratio of
facts to statements (F-Ratio) and topic depth (T-Depth) can be used in both high and low prior
knowledge situations. Additionally, when the participant has high prior knowledge the
interpretation of facts (F-State) can be used.
When attempting to determine prior knowledge we were only able to use topic depth
(T-Depth) effectively when looking at shorter summaries. Using longer summaries allows
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Study 2: Results
• 2) Only some were good at identifying prior knowledge - these required long pre-task summaries to be written
despite being shorter, while others were poor quality and much longer. There are situations,
therefore, where the length of the summaries may require a more thoughtful consideration.
6.4 Recommendations
To identify learning all measures detailed here were generally effective, but both the length of
the summaries and the prior knowledge held by the participant should be taken in to
consideration. Table 14 provides an overview of the strengths and weaknesses of each
measure and recommendations are made below. While serving as a guide readers should refer
back to the full text in our results section for more detail before using them in a study.
Identifies Learning Identifies Prior Knowledge Ignores Length
High Low Short Long Pre Post Short Long Pre Post D-Qual D-Intrp D-Crit F-Fact F-State F-Ratio T-Count T-Depth
Table 14: Overview of measure suitability.
If participants have written shorter summaries (here averaged to around 90 words) then
learning is only really noticeable if those participants began with low prior knowledge, where
measures such as the quality of facts (D-Qual), simple fact and statement counting (F-Fact, F-
State) and topic coverage (T-Count) can be used to determine an increase of knowledge. If
short summaries are written based on high prior knowledge then only simple fact and
statement counting (F-Fact, F-State) and the depth of topics (T-Depth) reflected an increase.
If participants have written longer summaries (here averaged to around 180 words)
measures such as the quality and number of facts (D-Qual and F-Fact, respectively), ratio of
facts to statements (F-Ratio) and topic depth (T-Depth) can be used in both high and low prior
knowledge situations. Additionally, when the participant has high prior knowledge the
interpretation of facts (F-State) can be used.
When attempting to determine prior knowledge we were only able to use topic depth
(T-Depth) effectively when looking at shorter summaries. Using longer summaries allows
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Study 2: Results
• 3) Our measures were the most robust to length of summary - others require pushing participants beyond 200 words
despite being shorter, while others were poor quality and much longer. There are situations,
therefore, where the length of the summaries may require a more thoughtful consideration.
6.4 Recommendations
To identify learning all measures detailed here were generally effective, but both the length of
the summaries and the prior knowledge held by the participant should be taken in to
consideration. Table 14 provides an overview of the strengths and weaknesses of each
measure and recommendations are made below. While serving as a guide readers should refer
back to the full text in our results section for more detail before using them in a study.
Identifies Learning Identifies Prior Knowledge Ignores Length
High Low Short Long Pre Post Short Long Pre Post D-Qual D-Intrp D-Crit F-Fact F-State F-Ratio T-Count T-Depth
Table 14: Overview of measure suitability.
If participants have written shorter summaries (here averaged to around 90 words) then
learning is only really noticeable if those participants began with low prior knowledge, where
measures such as the quality of facts (D-Qual), simple fact and statement counting (F-Fact, F-
State) and topic coverage (T-Count) can be used to determine an increase of knowledge. If
short summaries are written based on high prior knowledge then only simple fact and
statement counting (F-Fact, F-State) and the depth of topics (T-Depth) reflected an increase.
If participants have written longer summaries (here averaged to around 180 words)
measures such as the quality and number of facts (D-Qual and F-Fact, respectively), ratio of
facts to statements (F-Ratio) and topic depth (T-Depth) can be used in both high and low prior
knowledge situations. Additionally, when the participant has high prior knowledge the
interpretation of facts (F-State) can be used.
When attempting to determine prior knowledge we were only able to use topic depth
(T-Depth) effectively when looking at shorter summaries. Using longer summaries allows
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Study 2: Conclusions• We proposed a new measure based on depth of learning
- demonstrating higher levels of thinking
• This was more robust to size of written summary, - good at long and short, while measuring learning - able to determine if someone has existing high knowledge
• All measures did surprisingly well, for measuring learning
• Ours was most robust for determining prior knowledge level
• Future work: behaviour between good vs bad learners
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Talk Summary
• Search communities are trying to move beyond simple tasks - more than result quality, and time to target
• Current focusing on understanding sessions - which has primarily been splitting logs by time gaps
• Our work 1) moving beyond assumptions about sessions 2) introducing new methods to evaluate sensemaking
Friday, 10 May 13
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/
Talk Summary
• There’s a long way to go before search engines know what we’re doing beyond a query (and immediate refinements) - there’s a long way before we do
• Also - we still need to measure: - success in decision making (like online shopping) - success in entertainment sessions
Friday, 10 May 13