research © 2008 yahoo! generating succinct titles for web urls kunal punera joint work with...

25
Research © 2008 Yahoo! Generating Succinct Titles for Web URLs Kunal Punera joint work with Deepayan Chakrabarti and Ravi Kumar Yahoo! Research

Upload: oscar-reynolds

Post on 13-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Research © 2008 Yahoo!

Generating Succinct Titles for Web URLs

Kunal Punera

joint work with Deepayan Chakrabarti and Ravi KumarYahoo! Research

Research © 2008 Yahoo!

Agenda

• Motivation

• Our Approach

• Comparison from Previous Work

• Experimental Results

Research © 2008 Yahoo!

Titles on Search Results Page

• HTML Titles – Too long

– Can be missing

– Non-html results• Pictures, video and

audio clips

• Other Apps– Site-map generation

Research © 2008 Yahoo!

Titles for “Quicklinks”

• Strict length restrictions

• Links displayed in context of home page

Quicklink Titles

Homepage Context

Research © 2008 Yahoo!

Agenda

• Motivation

• Our Approach

• Comparison from Previous Work

• Experimental Results

Research © 2008 Yahoo!

“Sources” of Information about URLs (URL: http://www.barackobama.com/issues/)

URL-Tokens “barack obama issues”

Web page content

(HTMLTitle, KeyPhrases)

“Barack Obama | Change We Can Believe In | Issues”

“Issues”, “Civil Rights”, “Defense”, “Economy”

Anchor text on incoming links (IntrasiteAT, IntersiteAT, HomepageAT)

“Issues”, “Economic Issues”

“Barack Obama’s Plan for America”

Search engine queries

(QueryView, QueryClick, QueryClickPos1)

“obama issues”, “obama platform”, “obama campaign issues”, “barack obama platform”

User generated tags

(DeliciousTags)

“obama campaign platform”, “cool”, “nice webpage”

URL-Tokens “barack obama issues”

Web page content

(HTMLTitle, KeyPhrases)

“Barack Obama | Change We Can Believe In | Issues”

“Issues”, “Civil Rights”, “Defense”, “Economy”

Anchor text on incoming links (IntrasiteAT, IntersiteAT, HomepageAT)

“Issues”, “Economic Issues”

“Barack Obama’s Plan for America”

Search engine queries

(QueryView, QueryClick, QueryClickPos1)

“obama issues”, “obama platform”, “obama campaign issues”, “barack obama platform”

URL-Tokens “barack obama issues”

Web page content

(HTMLTitle, KeyPhrases)

“Barack Obama | Change We Can Believe In | Issues”

“Issues”, “Civil Rights”, “Defense”, “Economy”

Anchor text on incoming links (IntrasiteAT, IntersiteAT, HomepageAT)

“Issues”, “Economic Issues”

“Barack Obama’s Plan for America”

URL-Tokens “barack obama issues”

Web page content

(HTMLTitle, KeyPhrases)

“Barack Obama | Change We Can Believe In | Issues”

“Issues”, “Civil Rights”, “Defense”, “Economy”

URL-Tokens “barack obama issues”

Source Instances

Research © 2008 Yahoo!

Central Idea

Words from title and context (if applicable) are preferentially used by sources in constructing instances.

Degree of these preferences is source dependent.

Research © 2008 Yahoo!

Generation of Instances(URL: http://www.barackobama.com/issues/)

QuicklinkTitle

HomepageAbstract(Context)

GeneralVocabulary

QueryClick Source IntrasiteAT Source HTMLTitle Source …

obama issuesobama campaign issuesbarack obama platform

platform for obama campaign…

IssuesForeign Policy

Economic IssuesYes We Can

“Barack Obama | Change We Can Believe In | Issues”

0.5 0.4 0.10.8 0.1 0.10.2 0.6 0.2

0.5/0.4/0.1 0.8/0.1/0.1 0.2/0.6/0.2

Research © 2008 Yahoo!

Learning Source Generation Parameters(URL: http://www.barackobama.com/issues/)

QuicklinkTitle

HomepageAbstract(Context)

GeneralVocabulary

QueryClick Source IntrasiteAT Source HTMLTitle Source …

obama issuesobama platform

obama campaign issuesbarack obama platform

IssuesForeign Policy

Economic IssuesYes We Can

“Barack Obama | Change We Can Believe In | Issues”

GIVEN Learn parameter values that maximize probability of generation of instances

--/--/-- --/--/-- --/--/--

UNKNOWN

Research © 2008 Yahoo!

Finding Best Quicklink Title (URL: http://www.barackobama.com/issues/)

QuicklinkTitle

HomepageAbstract(Context)

GeneralVocabulary

QueryClick Source IntrasiteAT Source HTMLTitle Source …

obama issuesobama platform

obama campaign issuesbarack obama platform

IssuesForeign Policy

Economic IssuesYes We Can

“Barack Obama | Change We Can Believe In | Issues”

UNKNOWN GIVEN GIVEN

Select title for which probability ofgeneration of instances is maximum

LEARNT

0.5/0.4/0.1 0.8/0.1/0.1 0.2/0.6/0.2

Research © 2008 Yahoo!

sourcess (s)w

lenw instances

)Plog(.1 )contexttitle,|(P log.1

sourcess sw

lenws )(instances

)Plog(.1 )contexttitle,|(P log.in instances #

1

sourcess sw

lenlens

ws )(instances

)Plog(. )contexttitle,|(P log.in instances #

Objective Function

• Sources have different number of instances– QueryClick vs. HTMLTitle

• Sources are associated to target web object to different degrees– QueryClick vs. QueryView

– Comments on Youtube etc.

• Can account for dependent sources

Source specific Normalization

Source specific Weights

Research © 2008 Yahoo!

Learning Source Weights

• With known source generation parameters we have a linear function in source weights

• We learn weights that ranks various candidate titles correctly

– We use the linear ranking SVM described in

Joachims, “Optimizing search engines using clickthrough data”, KDD 2002

Research © 2008 Yahoo!

Where do Title Candidates come from?

• Instances of some sources of information

• Not all sources used

– Ungrammatical (URL-Tokens)

– Miss-spellings (QueryView)

– Sometimes irrelevant (DeliciousTags)

• We clean some instances to obtain more candidates

– Removing website name

Research © 2008 Yahoo!

Agenda

• Motivation

• Our Approach

• Comparisons from Previous Work

• Experimental Results

Research © 2008 Yahoo!

Comparisons with Previous Work

• Our title generation is an “extractive” approach

– Avoid modeling gramatical correctness of titles

• Only learn parameters at the source level

– Lesser training data needed

• Combine information from external sources

– Can obtain titles for objects with no text content

• Respect constraints placed by context of title use

Research © 2008 Yahoo!

BMW: Banko et al., Headline Generation based on Statistical Translation, ACL 2000

• Rank headline candidates using 3 factors

– Likelihood of seeing candidate words in a title

– Likelihood of most likely sequence of the words in candidate

– Likelihood of length of candidate

• Lots of parameters

– to model word being in title

– to model bi-grams

– to combine the above 3 factors

Research © 2008 Yahoo!

Agenda

• Motivation

• Our Approach

• Comparison from Previous work

• Experimental Results

Research © 2008 Yahoo!

Empirical Evaluation

• Two Tasks

– Generating Quicklink titles (manually judged data)

– Generating Web Page Titles

• Metrics

– F-measure, Jaccard, Exact Match, Longest Common Subsequence

• Baselines

– Sources of information our system uses

– BMW: Banko et al., ACL 2000

Research © 2008 Yahoo!

Quicklinks Title Task

Approach F-measure Jaccard Exact Match

Our Approach 0.81 0.75 0.63

HomepageAT 0.70 0.66 0.58

IntrasiteAT 0.43 0.41 0.35

IntersiteAT 0.36 0.32 0.25

HTMLTitle 0.37 0.27 0.05

KeyPhrases 0.25 0.19 0.07

• HomepageAT is a very competitive baseline

• IntrasiteAT better than IntersiteAT

• Our system’s performance approaches inter-judge agreement values

Research © 2008 Yahoo!

Quicklinks Title Task: Learning Rates

• Very few datapoints needed– Learning parameters at source level helps

Research © 2008 Yahoo!

Quicklinks Title Task: Source Weights

• Having Source weights and normalization helps

Research © 2008 Yahoo!

Web Page Title Task

Approach F-measure Jaccard LCS

Our Approach 0.53 0.41 3.44

HomepageAT 0.45 0.34 2.7

KeyPhrases 0.41 0.31 2.54

QueryClick 0.31 0.23 2.1

IntersiteAT 0.29 0.21 1.8

BMW 0.12 0.10 --

• Our approach beats competition– BMW not suited to this task

– Often page text doesn’t describe page well

• HomepageAT surprisingly effective

Research © 2008 Yahoo!

Conclusions

• Our approach combines various sources of information to select titles

• It select titles that respect constraints of length and context

• We empirically showed the effectiveness of our approach

• Future Work

– Deeper language features in selecting titles

– Uniform quicklinks titles across websites

– Contexts of different types

Research © 2008 Yahoo!

Questions

Thank you.

Research © 2008 Yahoo!

Copyright Yahoo! 2008No publication or distribution allowed without written permission