aug. 14, 2012 2012 iaslod linking korean resources to lod: issues in localization mun y. yi

Aug. 14, 2012 2012 IASLOD Linking Korean Resources to LOD: Issues in Localization Mun Y. Yi

- 1 - Agenda Project Scope System Architecture Silk in Action Korean Traditional Knowledge Data Localization Issues

- 2 - LOD2 Work Packages The project is structured into twelve consecutively numbered work packages (WPs). WP1 to WP6 are concerned with development of the LOD2 Stack, and WP7 to WP9 are designed to extensively validate and demonstrate the developed technology on the basis of a carefully selected and representative set of demonstrator applications, holding potentially great impact. WP10 (SWC) is devoted to training, awareness and dissemination, WP11 is concerned with exploitation and standardization activities, as well as technical coordination activities with other projects. WP12 is designed for high-level project coordination, reporting to the EC as well as activities related to the resolution of the IPR and maintenance of the Consortium Agreement.

- 3 - Simplified LOD2 Stack High-Level Architecture The main result of LOD2 will be the LOD2 Stack, an integrated distribution of aligned tools which support the whole life cycle of Linked Data from creation over enrichment, interlinking, fusing to maintenance.

- 4 - Project Scope: Tasks & Deliverables In Task4.1, a semi-automatic machine learning technique will be developed and implemented to simplify the creation of mappings between knowledge bases and the assessment of their quality. KAIST will contribute to this task by providing a platform for automatic linking with Korean, Chinese, and Japanese RDF resources. Task 4.1 Semi-Automatic Data Interlinking - University Leipzig - Digital Enterprise Research Institute - Free University Berlin - KAIST Deliverable 4.1.1 First Linking Assist Release Due Date: M18 (2012-02) Deliverable 4.1.3 Korean Resource Linking Assist Release Due Date: M24 (2012-08) Deliverable 4.1.4 Asian Resource Linking Assist Release Due Date: M30 (2013-02)

- 5 - Project Scope: Tasks & Deliverables (Contd) Task 4.5 Link Data Fusion - University Leipzig - Digital Enterprise Research Institute - Free University Berlin - KAIST Deliverable 4.5.1 Initial Release of Data Fusion Component Due Date: M24 (2012-08) Deliverable 4.5.3 Korean Data Fusion Assistant Due Date: M30 (2013-02) Deliverable 4.5.4 Asian Data Fusion Assistant Due Date: M36 (2013-08) In Task 4.5, methods for fusing data about single concept from multiple different sources will be devised and implemented. KAIST will work on the fusion of multilingual DBpedia datasets, thus eliminating issues for other multilingual resources.

- 6 - Phased Approaches 2 nd Cycle(~July, 2012) Implementation of Korean Resource Linking Assistant Silk Localization Linking with Silk Framework Internal publication 1 st Cycle(~Feb., 2012) Understanding of the Task Domain Semantic Web LOD2 Concept Software Architecture Data Model(Relational2RDF) Pilot Project Korean Traditional Recipe data 3 rd Cycle(~Aug., 2012) Quality Enhancement Linking Quality Publish to the LOD2 cloud The project has been done in 3 iterative cycles. Each cycle focuses on specific tasks, and lessons learned will be transferred into the next cycles. In the 1 st cycle, preliminary RDF data was generated. During the second cycle, we localized Silk to support Korean resource linking. The last cycle focuses on enhancing data quality.

- 7 - Silk in Action url: http://lod.kaist.ac.kr/silk-workbench/http://lod.kaist.ac.kr/silk-workbench/ File or SPARQL endpoint can be sources or targets. Define a project Define a source & a target Define a task Define an output And then click Open

- 8 - Silk in Action (Contd) Multiple operators can be used for complex tasks. Outputs can be displayed or written into a file. Interim result can be exported as a final result or be used as training data sets for machine learning. Learned algorithm can be used to generate final links. Define a source & a target from Property Paths Define operator(s) Click GenerateLinks Click Start

- 9 - Korean Traditional Knowledge Portal

- 10 - Korean Traditional Knowledge Data includes Food (3,236 records) Food name Food type Recipe, ingredients Cooking process (images) Medicine, sickness, and treatment (38,121 records) Agriculture (2,775 units) Life (4,438 units)

- 11 - System Architecture Source Data in Relational DB Silk Virtuoso Triple Store Proprietary RDFgen for transforming relational model to RDF model Silk for link generation Virtuoso triple store for serving RDF RDFgen* Link Creation Silk New Korean Similarity Measures Transformation RDFgen Publication Virtuoso triple store RDF Links Instances Ontology DBpedia

- 12 - Key Linking Issues Data Preprocessing Address Encoding: URI vs.IRI Korean String Similarity Measure Handling Transliterated Data

- 13 - Data Preprocessing : Mapping Relation to RDF Our goal is to make the recipes of Korean traditional food open. Original data from relational database were transformed into tables by object relational mapping. Related ontologies for recipe: LinkedRecipe.com, www.mindswap.org. Tool and IngredientPortion are not implemented at this phase. RelationalRDF Table nameClass name PK column valueSubject Non-PK column namePredicate FK column valueObject(used as URI; RDF link) Non-FK column valueObject(used as string; Literal triple)

- 14 - Handling Non-Latin Data Resources would be described in non-Latin characters. Tools are not known whether to support non-Latin characters. Writing Systems of the world today - Wikipedia

- 15 - Address Encoding URI is a core component of linked data. URIs are used as names for things. URI only allows US-ASCII characters for names of the resource. W3 Recommendations for URI : UTF-8 Character Set & URI Encoding Use UTF-8 character sets for URI, and encode special/non-Latin characters using %. ex) http://ko.wikipedia.org/wiki/%EB%B2%A0%EB%A5%BC%EB%A6%B0 But its hard to understand what it is Another W3 Recommendations : IRI(Internationalized Resource Identifier) ex) http://ko.wikipedia.org/wiki/ Now we can understand what it means. But some characters look so similar that chance for spoofing increases. ( ex)

- 16 - Localization: Silk Workbench Address Encoding Silk Workbench is GUI interface for the generation of links Silk Workbench displays encoded URIs as is so that its hard to understand non-Latin dataset. Decoding URIs enables non-Latin dataset to be displayed in its native language, so its a lot easier to work with.

- 17 - Localization: Korean String Similarity Measures Two kinds of Korean resources exist: Resources in Korean and resources in transliterated Korean. We need to calculate similarity distances for both of them. Korean alphabet has 14 consonants and 10 vowels (together with consonant clusters and diphthongs). For resources in Korean i.e., Korean DBpedia Most of the resources in Korea For resources in transliterated Korean bibimbap i.e., English DBpedia Most of the resources abroad Most of the comparators in Silk are based on string comparison i.e., Levenshtein distance However, writing systems are different from languages to languages. So comparators for Latin or Roman alphabets are appropriate for Korean alphabet? String Similarity Distance Measures for Korean KorED GrpSim OneDSim2 KorPhoD (Our approach) = (sD-1)*3 + min(pD), sD:Syllable Distance, pD:

- 18 - Localization: Korean String Similarity Measures (Contd) Several Korean similarity distances exist to reflect the characteristics of Korean alphabet. We devised a new way to measure based on the distribution of phonemes (KorPhoD). We implemented KoreanPhonemeDistance operator in Silk and used it to build links among Korean resources. SourceTargetLevenstein DistanceActual Edit OperationDifferences in phonemesDifferences in syllables 23 (->, -> add, -> delete)42 SourceTargetLevenshtein DistanceKorEDGrpSimOneDSim2KorPhoD 2 + + *ws( and are similar) + *w3 + 2 + + *wd( and are different) + *w4 + 3 + 3 + *wd+ *w+ *wd2 + 222 + : syllable distance, : phoneme distance Comparison of Similarity Measures for Korean Application of Edit Distance to Korean Resources Performance Comparison Precision : 1.28% vs. 17.78% (about thirteen times improvement ) F-score: 0.0223 vs. 0.0896 (Four times more effective finding correct links)

- 19 - Localization: Transliterated Korean Similarity Measures Two kinds of transliteration related to Korean: From English to Korean / From Korean to English. For now, we focus on the transliteration from Korean to English to build links for resources in Korean. The biggest problem is that there have been various algorithms for transliterating Korean into English so far. From English to Korean Digital -> , , , From Korean to English -> Kalguksu, Kalguksoo, Kalgugsoo, Transliteration algorithms for Korean McCune-Reischauer(1937) : Official standard in the past (from 1984 to 2000) Uses breves( : indicates a short vowel), apostrophes and diereses( : a vowel is sounded in a separate syllable)brevesapostrophesdiereses Yale(1942) Revised Romanization(2000) : Current official standard. Is generally similar to MR, but uses no diacritics or apostrophes, and uses distinct letters for / (t/d), / (k/g), / (ch/j) and / (p/b), etc. and probably many more We found that many academic and government websites still use MR more. Silk doesnt have phonetic similarity measures though i.e., Soundex

- 20 - Localization: Transliterated Korean Similarity Measures (Contd) We compare performance from both string similarity perspective and phonetic similarity perspective. Levenshtein shows good performance for precision, and Soundex shows good performance for recall. KoTlit shows good performance for both precision and recall, and we are still optimizing the algorithms. Performance Comparison M.R.RelevantRetrievedRet. & Rel.Precision(%)Recall(%) Levenshtein* 6669 2875277096.3541.54 Soundex38643259691.5489.50 KoTlit4469424194.9063.59 * threshold:0 R.R.RelevantRetrievedRet. & Rel.Precision(%)Recall(%) Levenshtein* 6669 5552523794.3378.53 Soundex34818761881.7892.79 KoTlit5977564194.3884.59 * threshold:0

- 21 - Concluding Remarks Localization issues are important for Asian and other non-Latin countries Need to develop its own similarity measures string similarity and phonetic similarity SILK is likely to become a key linking assistant program for LOD LOD is a major movement to define the next version of the Internet.

- 22 - Thank you! Mun Yong Yi KAIST http://kslab.kaist.ac.kr mail: [email protected]@kaist.ac.kr

aug. 14, 2012 2012 iaslod linking korean resources to lod: issues in localization mun y. yi

Documents