ir, ie and qa over social media social media (blogs, community qa, news aggregators) complementary...
TRANSCRIPT
IR, IE and QA over Social Media Social media (blogs, community QA, news aggregators) Complementary to “traditional” news sources (Rathergate) Grow faster than “traditional” web content, gap widening
Traditional/published: 4Gb/day; social media: 10gb/day [from Andrew Tomkins/Yahoo!, “Future or Web Search”, May 2007]
Research challenges Low(er) quality Content more dynamic User interactions crucial:
ratings, comments, link structure
to retrieve documents and to
evaluate extracted information
Finding High Quality Content for IE/QA Goal: find high-quality content (accurate & well-presented)
Setting: Community QA (Yahoo! Answers) Classifying social media (e.g., cQA) is substantially different from document classification
Sources of information Content analysis Usage data (page views, etc) Community ratings, link analysis
General framework for quality estimation in social media
Graph-based model of contributor relationships, combined with content and usage analysis
Can identify high-quality items with accuracy ~ human agreement
E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne, Finding High Quality Content in Social Media, in Proc. of WSDM 2008
Finding Relevant Content for IE/QA
Goal: given a query, rank social content (cQA) by expected relevance and quality
Approach: Learn ranking functions specifically for social media retrieval Features
Textual content: relevance, stylistics, language models User Interactions: link structure, discussion threads User ratings: incorporate user-provided content ratings
Method: Gradient boosting (GBrank) Developed a new objective function for learning ranking
function using (noisy) preference data.
Results: Outperform Yahoo! default ranking or naïve ranking
by user votes Can be made robust to ratings spam
[same authors, to appear in AIRWeb 2008]
J. Bian, Y. Liu, E. Agichtein and H. Zha. Finding the Right Facts in the Crowd: Factoid Question Answering over Social Media, to appear in Proc. of WWW 2008