multilingual synchronization focusing on wikipedia eun-kyung kim 2011-02-25
TRANSCRIPT
Multilingual Synchronization focusing on Wikipedia
Eun-kyung Kim2011-02-25
Introduction
• Wikipedia: Multilingual encyclopedia– Supports over 270 languages
• English, German, Spanish, French, Chinese, Arabic, …• Allows cross-lingual navigation with inter-language link
– Inter-language links: hyperlinks from any page in one Wikipedia language edition to one or more nearly equivalent or exactly equivalent pages in another Wikipedia language editions
– Different quantity of data on each languages• Wikipedia other language editions often suffer from lack of
information compared to the English version– Multilingual stat on Feb. 2011
» English: 3.5 million articles (Most dominant)» French: 1 million articles (3rd)» Korean: 156,290 articles (22nd)
Goal of M-Sync
• Multilingual Synchronization– Synchronizing contents of Wikipedia from multiple
different languages• Linking among multiple language contents• Combining them to synthesis
– The various Wikipedia editions from different languages • can offer more precise and detailed information based on
different intentions/backgrounds/cultures• can fill the gap between different languages and to acquire
the integrated knowledge
Two types of M-Sync
• Factual synchronization– Filling missing
information
• Cultural synchronization– Improving unknown
information
Factual Synchronization
Approach: Factual Synchronization
• Hypothesis– X is a key fact in L1 X’ should be a key fact in L2
• where X’ is a corresponding term to X in different language– Assumption
» Inter-language links are accurate links to connect two pages about the same entity or concept in different languages
• Key facts of Wikipedia come from the structured data such as:– Infobox
Infobox
• An infobox– a fixed-format table – present a summary of some unifying aspect that
the articles share and to improve navigation to other interrelated articles
– contain facts and statistics
Example of Infobox Asymmetry(Absence)
Example of Infobox Asymmetry(Absence)
Factual Synchronization
• Infobox Synchronization– Similar template structures• Template alignment • Contents translation
– Wikipedia dictionary-based– Google Translation API-based
• Application– Seed-data for article generation
Limitations of Infobox Synchronization
• Missing key properties– E.g.) symptoms of Disease
• Infobox Conflicts– How to select the target
• Infobox A in L1 vs. Infobox B in L2 A, B, C ?
• Users’ dis-satisfaction– Wikipedia is not a parallel corpus– Cultural differences must not be ignored
• The most of articles in different languages are independently created by different users and independently maintained by different communities
Cultural Synchronization
Distributions on Multilingual Overlaps
EN + FR + ES + RU + ZH + KO + AR0
100000
200000
300000
400000
500000
600000
700000
800000
Distributions on Multilingual Overlaps
EN + FR + ES + RU + ZH + KO + AR0
100000
200000
300000
400000
500000
600000
700000
800000
Culturally Unique Contents
Cultural Synchronization
• Synthesizing missing information according to each background knowledge and characteristics – focusing on how to add new information from
other language resources
16
Approach: Cultural Synchronization
• Hypothesis– X has a topic model M1 in L1 X’ may have a topic model
M2 in L2
• where X’ is a corresponding term to X in different language– Assumption
» Inter-language links are accurate links to connect two pages about the same entity or concept in different languages
• where M1 and M2 have different topic distributions according to their topical intentions
• Topic Model– A type of statistical model for discovering the abstract "topics" that occur
in a collection of documents
Topic Model on Links
• Latent Dirichlet allocation (LDA)– The most common topic model currently in use
• Each document may be viewed as a mixture of various topics– If observations are words collected into documents– It posits that each document is a mixture of a small number of
topics and that each word's creation is attributable to one of the document's topics» The topic distribution is assumed to have a Dirichlet prior
– Specifically, links from a word w to a document d depend directly on how frequent the topic of w is in d
18
Links on the Web
• Links – navigate to a web page with
more detailed information– point to previously published
web pages with similar or related content
• Understanding of the influence of each link can substantially benefit many applications – e.g., multilingual sync
메뚜기메뚜기목
귀뚜라미 베짱이
방아깨비
풀무치
농업예멘
사우디아라비아
해충
여치벼메뚜기
Link types of Wikipedia
• internal links to other pages in the wiki– Syntax usage: [[Main Page]]
• external links to other websites• interwiki links to other websites registered to the
wiki in advance– Unlike internal links, interwiki links do not use page
existence detection– Syntax usage: [[wikipedia:Sunflower]]
• Interlanguage links to other websites registered as other language versions of the wiki
Link types of Wikipedia
• internal links to other pages in the wiki– Syntax usage: [[Main Page]]
• external links to other websites• interwiki links to other websites registered to the
wiki in advance– Unlike internal links, interwiki links do not use page
existence detection– Syntax usage: [[wikipedia:Sunflower]]
• Interlanguage links to other websites registered as other language versions of the wiki
21
Multilingual Synchronization Process
Preprocessing(Target Page Selection)Wikipedia
Data LNWikipedia
Data L2WikipediaData L1
Extracting Links
Modeling on influence links
L1 L2 LN…
Finding missing linksaccording to the model
Translating links into target languages to sync
Computing similarity between existing and new
Unifying synchronized data
22
Multilingual Synchronization Process
Preprocessing(Target Page Selection)Wikipedia
Data LNWikipedia
Data L2WikipediaData L1
Extracting Links
Modeling on influence links
L1 L2 LN…
Finding missing linksaccording to the model
Translating links into target languages to sync
Computing similarity between existing and new
Unifying synchronized data
25
Modeling on links
• Example of links in multiple language Wikipedia– Different Wikipedia has different viewpoints and different
concerns (fig)– Some links are newly added, some others are deleted by
user in a temporal manner
How many topics on a document?Example: Section Headings of “Autism”
역사증상사회적 성장의사 소통바깥 고리
CharacteristicsSocial developmentCommunicationRepetitive behaviorOther symptomsClassificationCausesMechanismPathophysiologyNeuropsychologyScreeningDiagnosisManagementPrognosisEpidemiologyHistoryReferencesExternal links
Présentation généraleNotion de spectre autistiqueCatégorisation des troubles liés à l'autismeL'autisme infantileLe syndrome de RettLe syndrome d'AspergerÉpidémiologiePar paysEn FranceAu MarocDépistage et diagnosticTraitementPathologies associéesHistoire de la notionThéorisation de l'autismeL'approche psychanalytiqueThéorie de l'espritOrigine, test de ''Sally et Anne''Remise en cause et évolution du conceptDésordre du traitement temporo-spatial des informations sensoriellesRecherche sur les causes (étiologie)La théorie de l'origine vaccinaleLa théorie de l'intoxication aux métaux lourdsAnomalies cérébrales et défauts du placentaCauses génétiquesAire de perception de la voixVoir aussiArticles connexesBibliographieGénéralisteTémoignages, biographieLittératureVidéo et cinémaLiens externesRéférences
定義特徵社交發展感官系統溝通的困難病因自閉症與超常智商的聯繫世界自閉症日治疗社会关注相关作品电影參見參考資料外部連結
Korean English French Chinese
역사증상사회적 성장의사 소통바깥 고리
특성사회 개발통신반복적인 행동기타 증상분류원인기구PathophysiologyNeuropsychology전형진단관리예지역학역사참고 문헌외부 링크
개요자폐증 스펙트럼의 의미장애의 분류가 자폐증과 관련유아의 자폐증Rett 증후군아스퍼거 증후군역학문학비디오 및 필름국가별프랑스에서모로코에서심사 및 진단치료관련 질병개념의 역사자폐증의 Theorizationpsychoanalytic 접근마음의 이론원래는 앤과 test''Sally''and도전과 개념을 변화Temporomandibular 장애 치료 공간 감각 정보연구 원인 ( 병인 )에예방 접종 뒤에 이론중금속 중독의 이론뇌 이상과 태반의 결함유전적인 원인음성 인식 분야참고관련 기사서지일반증거 전기문학비디오 및 필름외부 링크참고 문헌
정의특징사회 개발관능 시스템의사 소통의 어려움원인특별한 지능 지수와 자폐증 링크세계 자폐증의 날치료사회 관심사관련 작품영화보기외부 링크
Korean English French Chinese
How many topics on a document?Example: Translated-Section Headings of “Autism”
Technical Process Review
LDA-based Topic Model
• LDA– Document Topics Words(links)
• Links: semantic key terms of document• No word boundary detection required
– Extract all links from target pages in 5 languages• Link extractor in python: http://swrc.kaist.ac.kr/msync/
• Link information can be extracted from Wikipedia dump database
– How many Topics are selected• According to sections
– Section: A page can and should be divided into sections, using the section heading syntax
• Section heading extraction in shell: http://swrc.kaist.ac.kr/msync/
Advanced LDA-based Topic Model
• We generate a topic modelwith out-going hypertext of doc.
• We generated a topic modelwith in-coming hypertext of doc.
AAA
BBB
CCC DDD
EEE FFF ZZZ
Document
Advanced LDA-based Topic Model
• We generate a topic modelwith out-going hypertext of doc.
• We generated a topic modelwith in-coming hypertext of doc.
AAA
BBB
CCC DDD
EEE FFF ZZZ
Document
111
222
333
A specific model method is required! (Novelty)
Example: Out-going Hypertext
메뚜기메뚜기목
귀뚜라미
방아깨비
풀무치 농업
예멘 사우디아라비아
여치
벼메뚜기
알
곰팡이 아프리카
중동 천적
거미 사마귀
때까치
개구리
구약성서 야훼 출애굽기
베짱이 해충
Example: In-coming Hypertext
메뚜기메뚜기과
여치
타임 보칸
진드기 초원
최진실 코뿔새과
신사임당
백악기
땅돼지
메뚜기아목메뚜기목
가면라이더 _OOO
타임 _크라이시스 _시리즈의 _등장인물
민족 무용 딱다기
호랑이
탄문
콩고 _ 민주 _공화국 _ 요리 애벌레프레리도그
땅늑대 벼
아스테카문명
유재석무한도전
Contributions
• To show the diverse topic distributions of related entities in several language Wikipedias depending on different topical intentions
• To make Wikipedia pages more shareable to the multilingual users depending on their culturally biased interests’ weight
• To support the seed data (seed keywords) to complete articles in a multilingual manner, or to guide users in generating new articles in Wikipedia