understanding & evaluating search sessions

60
Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/ Extended Searching Sessions and Evaluating Success Dr Max L. Wilson Mixed Reality Lab University of Nottingham, UK Friday, 10 May 13

Upload: max-wilson

Post on 27-Jun-2015

498 views

Category:

Technology


1 download

DESCRIPTION

A talk given in the University of Leeds School of Computing, on the nature of extended search sessions, and on evaluating/measuring learning/sensemaking during longer research sessions.

TRANSCRIPT

Page 1: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Extended Searching Sessionsand Evaluating Success

Dr Max L. WilsonMixed Reality Lab

University of Nottingham, UK

Friday, 10 May 13

Page 2: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Studying Extended Search Success In Observable Natural

Sessions

SESSIONS

Friday, 10 May 13

Page 3: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Extended Searching Sessions and Evaluating Sensemaking Success

About Me

Study 1: The RealNature of Sessions

Study 2: EvaluatingSensemaking Success

Friday, 10 May 13

Page 5: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

About Me

Friday, 10 May 13

Page 6: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/Friday, 10 May 13

Page 7: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

UIST2008

JCDL2008

Friday, 10 May 13

Page 8: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

My PhD

Bates, M. J. (1979a). Idea tactics. Journal of the American Society for Information Science, 30(5):280–289.

Bates, M. J. (1979b). Information search tactics. Journal of the American Society for Information Science, 30(4):205–214.

Belkin, N. J., Marchetti, P. G., and Cool, C. (1993). Braque: design of an interface to support user interaction in information retrieval. Information Processing and Management, 29(3):325–344.

Friday, 10 May 13

Page 9: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

My PhD

Wilson, M. L., schraefel, m. c., and White, R. W. (2009). Evaluating advanced search interfaces using established information-seeking models. Journal of the American Society for Information Science and Technology, 60(7):1407–1422.

Friday, 10 May 13

Page 10: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Search User Interface Design

Friday, 10 May 13

Page 11: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

My Team

Horia Maior Matthew Pike Jon Hurlock Paul BrindleyZenah Alkubaisy

Chaoyu (Kelvin) Ye(Study 1)

Mathew Wilson(Study 2)

Friday, 10 May 13

Page 12: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Extended Searching Sessions and Evaluating Sensemaking Success

About Me

Study 1: The RealNature of Sessions

Study 2: EvaluatingSensemaking Success

Friday, 10 May 13

Page 13: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

People Searching the Web

Elsweiler, D., Wilson M. L. and Kirkegaard-Lunn, B. (2011) Understanding Casual-leisure Information Behaviour. In Spink, A. and Heinstrom, J. (Eds) New Directions in Information Behaviour. Emerald Group Publishing Limited, pp 211-241.

Friday, 10 May 13

Page 14: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

The Search Communities

Ingwersen, P., Jarvelin, K., 2005. The turn: integration of information seeking

and retrieval in context. Springer, Berlin, Germany.

The IR Community

•Focused on Accuracy

•Are these results relevant?

•How many are relevant?

•Did we get all the relevant ones?

Friday, 10 May 13

Page 15: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

The Search Communities

The IS Community

•Focused on Success

•Did they find the right result?

•How long did they take

•How many interactions?Ingwersen, P., Jarvelin, K., 2005. The turn:

integration of information seekingand retrieval in context. Springer, Berlin, Germany.

Friday, 10 May 13

Page 16: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

The Search Communities

The IB Community

•Focused on Quality

•Did they do a good job?

•How did the UI affect the task?

•Was the higher level motivating task achieved more successfully?

Ingwersen, P., Jarvelin, K., 2005. The turn: integration of information seeking

and retrieval in context. Springer, Berlin, Germany.

Friday, 10 May 13

Page 17: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

The Search Communities

“Relatively” well known

“Naively estimated”- Study 1

“Simplistically” measured- Study 2

Friday, 10 May 13

Page 18: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Work Tasks

• Work tasks - typically considered work-led information-intensive activities the lead to searching

• Can be out-of-work - like planning holidays, or buying a car

• We’ve begun looking at motivating ‘tasks’ outside of work

Friday, 10 May 13

Page 19: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Casual Leisure Work Tasks

‘explore’, and ‘search’ in their past, present, and futuretenses. 12 seed-terms were used to query Twitter each hour,with the 100 newest tweets being stored each time. Ourcorpus contains information about hundreds of thousandsof real human searching scenarios and information needs,some examples are shown in Figure 1.

To investigate the information behaviours described in thecorpus, we embarked on a large-scale qualitative, inductiveanalysis of these tweets using a grounded theory approach.With the aim of building a taxonomy of searching scenariosand their features, we have so far coded 2500 tweets in ap-prox. 40 hrs of manual coding time. Already, we have begunto develop a series of dimensions and learned, ourselves, agreat deal about the kinds of search scenarios that peopleexperience in both the physical and digital domains.

To date, we have identified 10 dimensions within our tax-onomy, 6 of which were common in the dataset and havebecome fairly stable. We will present this taxonomy in fu-ture work, when more tweets have been coded and the tax-onomy is complete. Further, once the taxonomy is stableand has been tested for validity, we will use alternative au-tomatic or crowd-sourcing techniques to gain a better ideaof how important the factors are and how they relate. Here,however, we will highlight some of the casual-leisure searchbehaviours documented so far.

4.1 Need-less browsingMuch like the desire to pass time at the television, we saw

many examples (some shown in Table 3) of people passingtime typically associated with the ‘browsing’ keyword.

1) ... I’m not even *doing* anything useful... just browsingeBay aimlessly...

2) to do list today: browse the Internet until fasting breaktime..

3) ... just got done eating dinner and my family is watch-ing the football. Rather browse on the laptop

4) I’m at the dolphin mall. Just browsing.

Table 3: Example tweets where the browsing activ-ity is need-less.

From the collected tweets it is clear that often the inform-ation-need in these situations are not only fuzzy, but typi-cally absent. The aim appears to be focused on the activity,where the measure of success would be in how much theyenjoyed the process, or how long they managed to spend‘wasting time’. If we model these situations by how theymanage to make sense of the domain, or how they progressin defining their information-need, then we are likely to pro-vide the wrong types of support e.g these users may not wantto be supported in defining what they are trying to find oneBay, nor be given help to refine their requirements. Weshould also point out, however, that time wasting browsingwas not always associated with positive emotions (Table 4).

1) It’s happening again. I’m browsing @Etsy. Crap.2) browsing ASOS again. tsk.3) hmmm, just realizd I’ve been browsing ted.com for the

last 3 hours.

Table 4: Example tweets where the information-need-less browsing has created negative emotions.

The addictive nature of these activities came through re-peatedly and suggests perhaps that support is needed to

curtail exploration when it is not appropriate.

4.2 Exploring for the experienceMostly related to the exploration of a novel physical space,

we saw many people exploring with family and friends. Theaim in these situations (see Table 5) is often not to findspecific places, but to spend time with family.

1) exploring the neighbourhood with my baby!2) What a beautiful day to be outside playing and explor-

ing with the kids:)3) Into the nineties and exploring dubstep [music] while

handling lots of small to-dos

Table 5: Example tweets where the experience out-weighs the things found.

In these cases, the goal may be to investigate or learnabout the place, but the the focus of the activity is lesson the specific knowledge gained than on the experience it-self. Another point of note is that in these situations peopleregularly tried to behave in such a way that accidental orserendipitous discoveries were engendered. While examples1) and 2) are physical-world examples, it is easy to imagedigital world equivalents, such as exploring exploring theDisney website with your children.

Below we attempt to combine the characteristics we havediscovered to create an initial definition of what we refer toas casual search.

5. CASUAL SEARCHWe have seen many examples of casual information be-

haviours in these recent projects, but here we highlight thefactors that make them di�erent from our understandingof Information Retrieval, Information Seeking, ExploratorySearch, and Sensemaking. First, we should highlight thatit is not specifically their information-need-less nature thatbreaks the model of exploratory search, although some ex-amples were without an information need entirely. Thedi�erentiators are more in the motivation and reasoningfor searching, where all of our prior models of search aretypically oriented towards finding information, but casualsearch is typically motivated by more hedonistic reasons.We present the following defining points for casual searchtasks:

• In Casual search the information found tends to be ofsecondary importance to the experience of finding.

• The success of Casual search tasks is usually not de-pendent on actually finding the information being sought.

• Casual search tasks are often motivated by being in orwanting to achieve a particular mood or state. Tasksoften relate at a higher level to quality of life and healthof the individual.

• Casual search tasks are frequently associated with veryunder-defined or absent information needs.

These defining points break our models of searching in sev-eral ways. First, our models focus on an information need,where casual search often does not. Second, we measuresuccess in regards to finding the information rather thanthe experience of searching. Third, the motivating scenar-ios we use are work-tasks, which often is not appropriate incasual search.

Wilson, M. L. and Elsweiler, D. (2010) Casual-leisure Searching: the Exploratory Search scenarios that break our current models. In: 4th HCIR Workshop , Aug 22 2010. pp 28-31.

Friday, 10 May 13

Page 20: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

People Searching the Web

Elsweiler, D., Wilson M. L. and Kirkegaard-Lunn, B. (2011) Understanding Casual-leisure Information Behaviour. In Spink, A. and Heinstrom, J. (Eds) New Directions in Information Behaviour. Emerald Group Publishing Limited, pp 211-241.

Friday, 10 May 13

Page 21: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

• Traditionally examined by analysing logs for stats

• In the 90s, suggested they are broken by ~25mins - More recently by ~5mins

• BUT evidence shows web use typically interleaves tasks - AND tabs make this all much harder

• Become a big focus as Dagstuhls/workshops

Sessions

Friday, 10 May 13

Page 22: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Search Trails

• Aimed at finding commonend locations for queries

• An interesting step towardssessions though

• most involved some trailfeatures (not query+click)

White, Ryen W., and Steven M. Drucker. "Investigating behavioral variability in web search." in Proc WWW 2007 . ACM

Friday, 10 May 13

Page 23: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Top Sessionsas Seen by Bing

Bailey et al, User task understanding: a web search engine perspective, NII Shonan, 8 Oct 2012

Friday, 10 May 13

Page 24: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Top Sessionsas Seen by Bing

Bailey et al, User task understanding: a web search engine perspective, NII Shonan, 8 Oct 2012

Friday, 10 May 13

Page 25: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Top Sessionsas Seen by Bing

Bailey et al, User task understanding: a web search engine perspective, NII Shonan, 8 Oct 2012

Friday, 10 May 13

Page 26: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Study 1: Investigating Extended Sessions

What on earthis happening here?

Friday, 10 May 13

Page 27: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Study 1: Interview MethodSend & Preprocess History

Interview Recording, Cards, Card Sorts, Marked history file, log data

A history artefact - approx 300 items

How would you

define a session?

Mark out history into sessions, starting recently

+

Create ‘cards’ of varying types of ‘sessions’

Open Card Sort

+

Close Card Sort

10mins 20-30mins 30-50mins

15-2

0 Ca

rds

Friday, 10 May 13

Page 28: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Study 1: Data

• Rich discussion of ~20 Sessions per participant

• Currently: 7 participants and ~120 sessions - richly described and compared

• Aiming for : 12 participants and 200+ sessions at first

Friday, 10 May 13

Page 29: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Study 1: Questions for Sessions

1) Where was this done (e.g. work vs home vs mobile)

2) With who (collaborative?)

3) For who (shared task?)

4) Devices involved (whether devices affect things)

5) Length of the Session (how do they define long?)

6) Successful or not (for future measurement insights)

At some point: tried to learn these for each session

Friday, 10 May 13

Page 30: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Study 1: A Card

Friday, 10 May 13

Page 31: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Study 1: A Card

Friday, 10 May 13

Page 32: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Study 1: Card Sorting

• We aimed first to let them define the dimensions - this lets us see how they define things - how do they self-categorise different sessions

• We then had some targeted card sorts - For who, duration, difficulty, importance, location - whats short vs long? - whats important vs not? - how do people divide work vs home etc

Friday, 10 May 13

Page 33: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Study 1: Example Card Sorts

Friday, 10 May 13

Page 34: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/Friday, 10 May 13

Page 35: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Study 1: Preliminary Findings

• avg 21 cards per person, inc ~8 sessions of 5+mins - ~4 work & ~4 leisure

• 18.6% of those extended sessions involved task switches

• avg length: 17.5mins avg #queries: 3.55

• short: third said <30s, third said <1m third said <30m

• long: third said >1hour, third said >5mins

Friday, 10 May 13

Page 36: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Study 1: Preliminary Findings

• longest sessions: entertainment, work prep, news, shopping

• longest leisure: 22-76mins youtube, 28mins news

• most important: work, money, urgent shopping

• lest important: leisure, entertainment, free time

• most difficult: technical work prep

Friday, 10 May 13

Page 37: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Study 1: Preliminary Findings

• Huge divide over where sessions start or stop - many people considered a session to span a large break - paused and left in tabs

• One person divided a single topical episode by phases - and phases were sessions - e.g. broadening/confused stage vs successful focus stage

• One person divided single topical episode by major sources - moved from web searching to video searching on same topic

What is a session?

Implications for where/when to measure successFriday, 10 May 13

Page 38: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Study 1: What is a session?Single topic - changing purpose

Friday, 10 May 13

Page 39: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Study 1: What is a session?Single topic - pausing sessions

Friday, 10 May 13

Page 40: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Study 1: What is a session?Low-query extended sessions

Friday, 10 May 13

Page 41: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Study 1: Other observations

• Seeing an informal relationship between who tasks are for - and skewed importance - including for another person, or for a group - and slow sequential interactions (as talk to others)

• Seeing a strong low-query correlation with entertainment - seeing serious-leisure more similar to work tasks

• Hard tasks have high query loads, - and are related to rare or new areas

Friday, 10 May 13

Page 42: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Study 1: Summary

• We’re beginning to get some real insight into real sessions

• Already identifying examples where time-splitting isnt sufficient - but intention changing is common

• We’re seeing possible common patterns of overlapping sessions

• We havent finished!

Friday, 10 May 13

Page 43: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Study 2: Evaluating Sensemaking

“Simplistically” measured- Study 2

Wilson, M. J. and Wilson, M. L. (2012) A Comparison of Techniques for Measuring Sensemaking and Learning within Participant-Generated Summaries. In: JASIST (accepted).

Friday, 10 May 13

Page 44: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Study 2: “Simplistically” measured

• If learning is closed: then a quiz - “closed” determines WHAT should be learned - can measure recall, but also recognising if cued by Q.

• If learning is open: a) sub-topic count (integer) & topic quality (judged likert) b) simple count of facts (integer) and statements (integer)

• These do not measure how “good” the learning was

Friday, 10 May 13

Page 45: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Study 2: Measuring “Depth” of Learning

• A theory from Education

• As learning improvesyou progress up the diagram

• You begin to ‘understand’- then critically ‘analyze’- then ‘evaluate’ informationetc.

Image from: http://www.nwlink.com/~donclark/hrd/bloom.html

Friday, 10 May 13

Page 46: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Study 2: Developed 3 Scales

• 12 participants performed 3 learning tasks - mix of high and low prior knowledge

• 1) Write summary of knowledge, 2) Learn, 3) Write summary

• 36 pairs of pre/post summaries - 18 high prior knowledge - 18 low prior knowledge

Friday, 10 May 13

Page 47: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Study 2: Developed 3 Scales

• Inductive Grounding Theory analysis

• 3 rounds of 6 high and 6 low pairs analysed by 2 researchers

• Validated by an external judge

• Until high Fleiss Kappa scores i.e. ‘substantial agreement’

Friday, 10 May 13

Page 48: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Study 2: Measure 1: D-Qual

We went through three major iterations of refining our measurements until we reached

‘substantial agreement’, according to Landis and Koch (1977), between judges. For final

validation of our scores, we used Fleiss’ Kappa (Fleiss, 1971) to determine the agreement

between the two authors and an independent third judge. Our Fleiss Kappa scores are reported

inline below as we describe the scales we produced.

3.2.2 The measures produced by our process Our first measure for depth of learning was ‘D-Qual’, shown in Table 1, which

focused on the quality of recalled facts by their usefulness and was measured on a four-point

scale ranging from irrelevant or useless facts (0 points) to facts that showed a level of

technical understanding (3 points). The emphasis of usefulness in this measure meant that it

was closer to the “understanding” level of Bloom’s revised taxonomy, rather than simply

“remembering”. It was important to differentiate between the two levels as many poor

summaries, as determined by the authors during the coding session, simply listed many

redundantly obvious facts (“A labrador is a dog”) rather than describing them in sentences

and summaries. For D-Qual, the judges achieved a Fleiss kappa of 0.64.

Rating Description

0 Facts are irrelevant to the subject; Facts hold no useful information or advice.

1 Facts are generalised to the overall subject matter; Facts hold little useful information or advice.

2 Facts fulfil the required information need and are useful.

3 A level of technical detail is given via at least one key term associated with the technology of the subject; Statistics are given.

Table 1: Quality of Facts (D-Qual).

Many of the better summaries interpreted facts into more intelligent statements. To

identify this, D-Intrp (Table 2) measured summaries in how they synthesised facts and

statements to draw conclusions and deductions (Bloom’s “analysing”) using a 3-point scale.

This ranged from simply listing facts with no further interpretation (0 points) to structured

combinations in patterns (2 points). The judges achieved a Fleiss kappa of 0.58 for D-Intrp.

Measure understanding rather than remembering

Friday, 10 May 13

Page 49: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Study 2: Measure 2: D-Intrp

Rating Description

0 Facts contained within one statement with no association.

1 Association of two useful or detailed facts: ‘A -> B’

2 Association of multiple useful or detailed facts: ‘A+B->C’; ‘A->B->C’; ‘A->B∴C’

Table 2: Interpretation of data into statements (D-Intrp).

D-Crit reflected Bloom’s concept of “evaluating” by identifying statements that

compared facts, or used facts to raise questions about other statements. The measurement for

D-Crit was either true (1 point) or false (0 points), as shown in Table 3. A Fleiss kappa of

0.74 was achieved.

Rating Description

0 Facts are listed with no further thought or analysis.

1 Both advantages and disadvantages listed; Comparisons drawn between items; Participant deduced his or her own questions.

Table 3: Use of critique (D-Crit).

We did not produce a scale for level three of Anderson’s revised version of Bloom’s

taxonomy, “applying”, since the act of writing a summary would not involve the participant

to carry out a procedure that has been learned. This level of learning was thus not identifiable

in our corpus of summaries. Similarly, the highest level, “creating”, also goes beyond writing

about a topic, to more practical elements of learning and so was also left out.

4 Evaluation and Comparison of Measures Having developed our new measures from our initial sample set of written

summaries, we performed a larger user study using a similar protocol. Our new measure was

compared with the two other common analytical measures of written summaries: fact

counting and topic analysis. We used the same study protocol that was pilot tested in our

initial study, refining the Work Task descriptions and procedure slightly. One clear example

of the improvements, beyond the wording of tasks, was to change the medium of written

Measure analysing capabilities

Friday, 10 May 13

Page 50: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Study 2: Measure 3: D-CritMeasure evaluating capabilities

Rating Description

0 Facts contained within one statement with no association.

1 Association of two useful or detailed facts: ‘A -> B’

2 Association of multiple useful or detailed facts: ‘A+B->C’; ‘A->B->C’; ‘A->B∴C’

Table 2: Interpretation of data into statements (D-Intrp).

D-Crit reflected Bloom’s concept of “evaluating” by identifying statements that

compared facts, or used facts to raise questions about other statements. The measurement for

D-Crit was either true (1 point) or false (0 points), as shown in Table 3. A Fleiss kappa of

0.74 was achieved.

Rating Description

0 Facts are listed with no further thought or analysis.

1 Both advantages and disadvantages listed; Comparisons drawn between items; Participant deduced his or her own questions.

Table 3: Use of critique (D-Crit).

We did not produce a scale for level three of Anderson’s revised version of Bloom’s

taxonomy, “applying”, since the act of writing a summary would not involve the participant

to carry out a procedure that has been learned. This level of learning was thus not identifiable

in our corpus of summaries. Similarly, the highest level, “creating”, also goes beyond writing

about a topic, to more practical elements of learning and so was also left out.

4 Evaluation and Comparison of Measures Having developed our new measures from our initial sample set of written

summaries, we performed a larger user study using a similar protocol. Our new measure was

compared with the two other common analytical measures of written summaries: fact

counting and topic analysis. We used the same study protocol that was pilot tested in our

initial study, refining the Work Task descriptions and procedure slightly. One clear example

of the improvements, beyond the wording of tasks, was to change the medium of written

Friday, 10 May 13

Page 51: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Study 2: Evaluating these measuresCompare against Counting & Topic measures

while facts were defined as individual pieces of information either explicitly listed or

contained within statements. Finally, using these two sub-measures we also created ‘F-Ratio’

which represented the ratio of facts per statement.

To measure breadth and depth of topics, we first outlined some common topics that

were found in the six tasks of the pilot study (i.e. for buying a dog the topics were history of

the breed, health concerns, caring for the dog and personality). Then, to measure breadth (‘T-

Count’), we counted the number of topics that the participant covered in their summary. To

measure depth (‘T-Depth’), each topic was measured on a 4-point scale ranging from not

covered (0 points) to detailed focused coverage (3 points) and averaged.

As the process of learning is primarily internal it is difficult to measure it objectively.

For this reason our measures of learning focused on the difference between pre- and post-task

knowledge held by the participant.

Code Measurement Scale D-Qual Recall of facts 0 – 3 points D-Intrp Interpretation of data into statements 0 – 2 points D-Crit Critique 0 – 1 point F-Fact Number of facts Count F-State Number of statements Count F-Ratio Ratio of facts per statement Average T-Count Number of topics covered (breadth of knowledge) Count T-Depth Level of topic focus (depth of knowledge) 0 – 3 points, averaged

Table 4: Outline of coding scheme used for analysis.

5 Results Before beginning, the data from two participants were removed from the analysis. A

first-pass sanity check over the collected summaries revealed that they had misunderstood the

tasks set. One chose to describe their own feelings and history relating to the task topic, rather

than trying to answer the task. Another described what they intended to search for in their

pre-task summaries, meaning that they could not be compared to other pre-task summaries or

measure their information gain. The analyses below relate to the remaining 34 participants.

With each participant creating 3 pairs of summaries (pre- and post-task), a total of 204

summaries, or 102 pairs of pre- and post-task summaries, were analysed using all the

• Can you differentiate pre- & post- task summaries?

• Can you differentiate high & low prior knowledge?

• How long do summaries need to be?

Friday, 10 May 13

Page 52: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Study 2: Analysing summariesPre-task example

Friday, 10 May 13

Page 53: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Study 2: Analysing summariesPost-task example

Friday, 10 May 13

Page 54: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Study 2: Results

knowledge, especially for pre-task summaries, which can possibly be explained that the

participants who wrote shorter summaries based on high prior knowledge are more likely to

concentrate on a single topic.

All Pre-task Post-task D-Qual U(68) = 537.5, p = 0.32 U(34) = 125, p = 0.28 U(34) = 148, p = 0.46 D-Intrp U(68) = 642, p = 0.21 U(34) = 145, p = 0.47 U(34) = 174, p = 0.16 D-Crit U(68) = 570, p = 0.47 U(34) = 140, p = 0.47 U(34) = 144.5, p = 0.49 F-Fact t(66) = -0.4, p = 0.35 t(32) = -0.75, p = 0.23 t(32) = -0.25, p = 0.4 F-State t(66) = -0.21, p = 0.42 t(32) = -0.4, p = 0.35 t(32) = -0.17, p = 0.43 F-Ratio t(66) = 0.2, p = 0.42 t(32) = 0.31, p = 0.38 t(32) = -0.04, p = 0.48 T-Count t(66) = -0.35, p = 0.36 t(32) = 0.43, p = 0.34 t(32) = -1.01, p = 0.16 T-Depth U(68) = 721, p = 0.04 * U(34) = 194.5, p = 0.04 * U(34) = 168, p = 0.21

Table 12: Comparing high and low prior knowledge in shorter summaries. * Indicates significant results.

All Pre-task Post-task D-Qual U(68) = 390, p = 0.01 * U(34) = 89.5, p = 0.03 * U(34) = 113.5, p = 0.18 D-Intrp U(68) = 497.5, p = 0.16 U(34) = 158.5, p = 0.29 U(34) = 95, p = 0.06 D-Crit U(68) = 693.5, p = 0.08 U(34) = 189, p = 0.05 * U(34) = 154, p = 0.32 F-Fact t(66) = 1.62, p = 0.06 t(32) = 0.64, p = 0.26 t(32) = 1, p = 0.16 F-State t(66) = 1, p = 0.16 t(32) = 0.29, p = 0.39 t(32) = 0.79, p = 0.22 F-Ratio t(66) = 0.86, p = 0.2 t(32) = 0.31, p = 0.38 t(32) = 0.21, p = 0.42 T-Count t(66) = 3.44, p = 0.0005 * t(32) = 1.92, p = 0.03 * t(32) = 2.82, p = 0.004 * T-Depth U(68) = 572, p = 0.48 U(34) = 163, p = 0.25 U(34) = 142, p = 0.48

Table 13: Comparing high and low prior knowledge in longer summaries. * Indicates significant results.

Conversely, however, some measures were able to differentiate between high and low

prior knowledge, even after the task, when summaries were longer, as shown in Table 13.

Looking at the longer pre-task summaries we find that D-Qual shows signs of significant

difference along with critique (D-Crit) and the number of topics covered (T-Count). This

indicates that use of critique in pre-task summaries is a strong differentiator, but only in

longer examples. Like before, however, D-Crit’s significance is lost in the post-task

summary, perhaps indicating that all post-task summaries include some level of critique. A

more sensitive measure of critique (D-Crit) may be required and studied in future work.

Unlike in our initial analysis, however, we find that one measure (T-Count) is able to tell the

difference between high and low prior knowledge, in both pre- and post-task summaries, if

they are longer. Again, this indicates that designing tasks such that participants write longer

summaries may make it easier for measures to measure learning.

Pretty obvious - as you can see

Friday, 10 May 13

Page 55: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Study 2: Results

• 1) Most measures could identify learning (between pre-post) - more robust with longer summaries

despite being shorter, while others were poor quality and much longer. There are situations,

therefore, where the length of the summaries may require a more thoughtful consideration.

6.4 Recommendations

To identify learning all measures detailed here were generally effective, but both the length of

the summaries and the prior knowledge held by the participant should be taken in to

consideration. Table 14 provides an overview of the strengths and weaknesses of each

measure and recommendations are made below. While serving as a guide readers should refer

back to the full text in our results section for more detail before using them in a study.

Identifies Learning Identifies Prior Knowledge Ignores Length

High Low Short Long Pre Post Short Long Pre Post D-Qual D-Intrp D-Crit F-Fact F-State F-Ratio T-Count T-Depth

Table 14: Overview of measure suitability.

If participants have written shorter summaries (here averaged to around 90 words) then

learning is only really noticeable if those participants began with low prior knowledge, where

measures such as the quality of facts (D-Qual), simple fact and statement counting (F-Fact, F-

State) and topic coverage (T-Count) can be used to determine an increase of knowledge. If

short summaries are written based on high prior knowledge then only simple fact and

statement counting (F-Fact, F-State) and the depth of topics (T-Depth) reflected an increase.

If participants have written longer summaries (here averaged to around 180 words)

measures such as the quality and number of facts (D-Qual and F-Fact, respectively), ratio of

facts to statements (F-Ratio) and topic depth (T-Depth) can be used in both high and low prior

knowledge situations. Additionally, when the participant has high prior knowledge the

interpretation of facts (F-State) can be used.

When attempting to determine prior knowledge we were only able to use topic depth

(T-Depth) effectively when looking at shorter summaries. Using longer summaries allows

Friday, 10 May 13

Page 56: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Study 2: Results

• 2) Only some were good at identifying prior knowledge - these required long pre-task summaries to be written

despite being shorter, while others were poor quality and much longer. There are situations,

therefore, where the length of the summaries may require a more thoughtful consideration.

6.4 Recommendations

To identify learning all measures detailed here were generally effective, but both the length of

the summaries and the prior knowledge held by the participant should be taken in to

consideration. Table 14 provides an overview of the strengths and weaknesses of each

measure and recommendations are made below. While serving as a guide readers should refer

back to the full text in our results section for more detail before using them in a study.

Identifies Learning Identifies Prior Knowledge Ignores Length

High Low Short Long Pre Post Short Long Pre Post D-Qual D-Intrp D-Crit F-Fact F-State F-Ratio T-Count T-Depth

Table 14: Overview of measure suitability.

If participants have written shorter summaries (here averaged to around 90 words) then

learning is only really noticeable if those participants began with low prior knowledge, where

measures such as the quality of facts (D-Qual), simple fact and statement counting (F-Fact, F-

State) and topic coverage (T-Count) can be used to determine an increase of knowledge. If

short summaries are written based on high prior knowledge then only simple fact and

statement counting (F-Fact, F-State) and the depth of topics (T-Depth) reflected an increase.

If participants have written longer summaries (here averaged to around 180 words)

measures such as the quality and number of facts (D-Qual and F-Fact, respectively), ratio of

facts to statements (F-Ratio) and topic depth (T-Depth) can be used in both high and low prior

knowledge situations. Additionally, when the participant has high prior knowledge the

interpretation of facts (F-State) can be used.

When attempting to determine prior knowledge we were only able to use topic depth

(T-Depth) effectively when looking at shorter summaries. Using longer summaries allows

Friday, 10 May 13

Page 57: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Study 2: Results

• 3) Our measures were the most robust to length of summary - others require pushing participants beyond 200 words

despite being shorter, while others were poor quality and much longer. There are situations,

therefore, where the length of the summaries may require a more thoughtful consideration.

6.4 Recommendations

To identify learning all measures detailed here were generally effective, but both the length of

the summaries and the prior knowledge held by the participant should be taken in to

consideration. Table 14 provides an overview of the strengths and weaknesses of each

measure and recommendations are made below. While serving as a guide readers should refer

back to the full text in our results section for more detail before using them in a study.

Identifies Learning Identifies Prior Knowledge Ignores Length

High Low Short Long Pre Post Short Long Pre Post D-Qual D-Intrp D-Crit F-Fact F-State F-Ratio T-Count T-Depth

Table 14: Overview of measure suitability.

If participants have written shorter summaries (here averaged to around 90 words) then

learning is only really noticeable if those participants began with low prior knowledge, where

measures such as the quality of facts (D-Qual), simple fact and statement counting (F-Fact, F-

State) and topic coverage (T-Count) can be used to determine an increase of knowledge. If

short summaries are written based on high prior knowledge then only simple fact and

statement counting (F-Fact, F-State) and the depth of topics (T-Depth) reflected an increase.

If participants have written longer summaries (here averaged to around 180 words)

measures such as the quality and number of facts (D-Qual and F-Fact, respectively), ratio of

facts to statements (F-Ratio) and topic depth (T-Depth) can be used in both high and low prior

knowledge situations. Additionally, when the participant has high prior knowledge the

interpretation of facts (F-State) can be used.

When attempting to determine prior knowledge we were only able to use topic depth

(T-Depth) effectively when looking at shorter summaries. Using longer summaries allows

Friday, 10 May 13

Page 58: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Study 2: Conclusions• We proposed a new measure based on depth of learning

- demonstrating higher levels of thinking

• This was more robust to size of written summary, - good at long and short, while measuring learning - able to determine if someone has existing high knowledge

• All measures did surprisingly well, for measuring learning

• Ours was most robust for determining prior knowledge level

• Future work: behaviour between good vs bad learners

Friday, 10 May 13

Page 59: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Talk Summary

• Search communities are trying to move beyond simple tasks - more than result quality, and time to target

• Current focusing on understanding sessions - which has primarily been splitting logs by time gaps

• Our work 1) moving beyond assumptions about sessions 2) introducing new methods to evaluate sensemaking

Friday, 10 May 13

Page 60: Understanding & Evaluating Search Sessions

Dr Max L. Wilson http://cs.nott.ac.uk/~mlw/

Talk Summary

• There’s a long way to go before search engines know what we’re doing beyond a query (and immediate refinements) - there’s a long way before we do

• Also - we still need to measure: - success in decision making (like online shopping) - success in entertainment sessions

Friday, 10 May 13