web ir/nlp group (wing) @ nus min-yen kan school of computing national university of singapore
TRANSCRIPT
Web IR/NLP Group (WING) @ NUS
Min-Yen KanSchool of Computing
National University of Singaporehttp://wing.comp.nus.edu.sg/
2MSRA Web-Scale NLP Worshop (Daedeok, Korea)
Min-Yen Kan
Web IR/NLP Group @ NUS
Support staff (undergraduate)
• System administrators
• System programmers
Undergraduate Projects
• 4 this year (ask me about topics)
PI: Min-Yen KAN (NLP and IR/DL)
Postdoc: • Su Nam KIM (Multiword Expressions)
PhDs: • Hendra SETIAWAN (Stat MT)
• Long QIU (Scenario Templates)• Yee Fan TAN (Web Record Linkage)• Jin ZHAO (Math IR)• Jesse PRABAWA (UI/HCI for DLs)• Ziheng LIN (Summarization)
One of many groups doing these type of research at NUS
Will go over NLP then DL for today
3MSRA Web-Scale NLP Worshop (Daedeok, Korea)
Min-Yen Kan
Information Extraction
• Keyphase Extraction– Idea: Use section information as evidence (ICADL 07)
•Scenario Template Generation (Long Qiu)
– Aim: to generate database rows from similar news events
Charley landed further south on the Gulf Coast than predicted, … The hurricane … was weakened and is moving over South Carolina
At least 21 missing after the storm hit … But Tokage had weakened by the time it passed over Tokyo, where it had left little damage before moving out to sea.
– Model context and cluster to convergence using EM (EMNLP 06)
4MSRA Web-Scale NLP Worshop (Daedeok, Korea)
Min-Yen Kan
Using less data
• URL Classification (WWW 04)http://www.usatoday.com/stories/080502/ent/hilton.html
http://www.cancersupportgroup.org/forum/230.html
– Classifies 1000’s of URLs per minute, with 2/3rds of full text accuracy
– Useful for focused crawling, web mining applications
5MSRA Web-Scale NLP Worshop (Daedeok, Korea)
Min-Yen Kan
Question-Answering (Hang Cui)• Our Approaches to QA
– Use of external resources from Web & WordNet (SIGIR04)– Employ dependency & SRL for answer extraction (SIGIR05, 06)– Soft pattern analysis of definitional patterns (WWW 05)– Explore temporal relationships and events– Extend techniques to precise passage retrieval– Came 2nd (in 2003, 2004 & 2005) in TREC QA Task– Licensed technology to company in legal search
• Current focus – Relation-based IE & QA – continue focus on linguistic knowledge– Ontology-based Interactive QA – leverage on domain knowledge– Searching for answers and mining terminology from the Web
6MSRA Web-Scale NLP Worshop (Daedeok, Korea)
Min-Yen Kan
Summarization (Ziheng Lin)• Document Concept Lattice Model (IPM 07)
– Aim to find list of sentences that result in minimal info lost– Extract key concept terms, and build concept lattice– Perform sentence extraction that covers max concept terms– Participated in DUC, came in 1st (2005) and 2nd (2006)
• Pioneered iterative construction model for graph-based summarization (DUC 07)
doc1 doc2 doc3
s1
doc1 doc2 doc3
s1
s2
doc1 doc2 doc3
s1
s2
s3
7MSRA Web-Scale NLP Worshop (Daedeok, Korea)
Min-Yen Kan
Statistical Machine Translation (Hendra Setiawan)
表单 是 网页 上 的 数据 输 域 的 集合
表单 是 集合 的 数据 输 域 的 上 网页a page is a coll. of data entry fields on a page
a form is a page on data entry fields of a coll.
上 网页on a page
数据 输 域 的 上 网页on a pagedata entry fields
集合 的 数据 输 域 的 上 网页data entry fields on a pagea coll. of
Function Word Based Reordering (ACL 07)
Function Word Based Reordering (ACL 07)
8MSRA Web-Scale NLP Worshop (Daedeok, Korea)
Min-Yen Kan
Commercial record linkage (Yee Fan Tan)• Addresses
– Dongwon Lee, 110 E. Foster Ave. #410, State College, PA, 16802– LEE Dong, 110 East Foster Avenue Apartment 410, Univ. Park, PA 16802-2343
• Products– Honda Fix vs. Honda Jazz– Apple iPod Nano 4GB vs. 4GB iPod nano 4GB
• Idea: use web as additional context for disambiguation and clustering (JCDL 06, WIDM 07)• Placed 3rd in Web People Search Task (WEPS 2007)
9MSRA Web-Scale NLP Worshop (Daedeok, Korea)
Min-Yen Kan
Multi(ple) Extensions
• Multimodal Alignment – Lyrics with Audio (ACM MM 04)
– Slides with Paper(JCDL 07)
• Current and future work:– Extracted Terminology with User Tagging
–
Text in Focus Slide in Focus
10MSRA Web-Scale NLP Worshop (Daedeok, Korea)
Min-Yen Kan
Focusing on the User
Understanding user searches better– Known item search (JCDL 2005)– Faceted classification of web queries (WebQ 2007)
• Building better user interfaces (Jesse Prabawa)– Revisiting library catalog interfaces to better support searching(JCDL 2007)
11MSRA Web-Scale NLP Worshop (Daedeok, Korea)
Min-Yen Kan
Putting it all together We’re building a niche academic research repository
– e.g., MS Libra, CiteSeer, DBLP, Google Scholar
What? Another one? What’s the catch?– The user interaction and community involvement is central– Overcome faults of imperfect machine learning– Platform for researching how web-scale NLP actively involves user feedback and mechanisms for channeling this
What about Web NLP / IR?– My group emphasizes practical outcomes and deliverables– Find research within industry and practical problems– Multilingual, multimedia, web-as-data angles likely to continue
12MSRA Web-Scale NLP Worshop (Daedeok, Korea)
Min-Yen Kan
Other pointers (NUS-wide)• Text Processing Seminar (with archived slides)
http://wing.comp.nus.edu.sg/chimetext
• Machine Learning (Graphical Models) Reading Group
http://groups.google.com/group/mlnus/
• NLP Reading Group
http://wing.comp.nus.edu.sg/NLPReading/index.php/Main_Page
<AD>
Shameless plug for my group: http://wing.comp.nus.edu.sg
</AD>
Thanks for listening!