lacro'13 workshop april 2013, leuven, belgium miguel‐angel sicilia, salvador...

16
Exploring the keyword space in large learning resource aggregations: the case of GLOBE LACRO'13 workshop April 2013, Leuven, Belgium Miguel‐Angel Sicilia, Salvador Sánchez‐ Alonso, Elena Garcia‐Barriocanal, Julià Minguillón, Enayat Rajabi 1

Upload: lucas-hines

Post on 27-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LACRO'13 workshop April 2013, Leuven, Belgium Miguel‐Angel Sicilia, Salvador Sánchez‐Alonso, Elena Garcia‐Barriocanal, Julià Minguillón, Enayat Rajabi

1

Exploring the keyword space in large learning resource

aggregations: the case of GLOBE

LACRO'13 workshopApril 2013, Leuven, Belgium

Miguel‐Angel Sicilia, Salvador Sánchez‐Alonso, Elena Garcia‐Barriocanal, Julià Minguillón, Enayat Rajabi

Page 2: LACRO'13 workshop April 2013, Leuven, Belgium Miguel‐Angel Sicilia, Salvador Sánchez‐Alonso, Elena Garcia‐Barriocanal, Julià Minguillón, Enayat Rajabi

2

Agenda

• Introduction

• GLOBE materials

• Keywords and classifications

• Interlinking to Linked Open Data

• Discussion and conclusion

Page 3: LACRO'13 workshop April 2013, Leuven, Belgium Miguel‐Angel Sicilia, Salvador Sánchez‐Alonso, Elena Garcia‐Barriocanal, Julià Minguillón, Enayat Rajabi

3

Introduction

• Huge number of e-learning resources available on-line, for free or by subscription

• Several initiatives aim at federating e-learning systems to unlock the educational content hidden in their repositories (e.g. GLOBE )

• The use of the IEEE LOM standard + OAI‐PMH hasfacilitated the deployment of such collections

Page 4: LACRO'13 workshop April 2013, Leuven, Belgium Miguel‐Angel Sicilia, Salvador Sánchez‐Alonso, Elena Garcia‐Barriocanal, Julià Minguillón, Enayat Rajabi

4

Background

• How different metadata elements properly describe and categorize the resource space

• IEEE LOM proposes around 50 different elements including keyword and classification:• keywords are intended for the description of

topics in any existing language• classification refers to classifying the Los

• Some experimental studies exist on actual use of IEEE LOM ( e.g. Friesen (2004), Ochoa et al (2011))

Page 5: LACRO'13 workshop April 2013, Leuven, Belgium Miguel‐Angel Sicilia, Salvador Sánchez‐Alonso, Elena Garcia‐Barriocanal, Julià Minguillón, Enayat Rajabi

5

Background – IEEE LOM standard

Page 6: LACRO'13 workshop April 2013, Leuven, Belgium Miguel‐Angel Sicilia, Salvador Sánchez‐Alonso, Elena Garcia‐Barriocanal, Julià Minguillón, Enayat Rajabi

6

GLOBE Materials

• GLOBE(Global Learning Objects Brokered Exchange) enables share and reuse between several learning object repositories

•We harvested GLOBE through OAI-PMH and got around 770,000 metadata records

• Most frequent language is English (also pointed out Ochoa 2011) , while large amount of resource has no language declared

Page 7: LACRO'13 workshop April 2013, Leuven, Belgium Miguel‐Angel Sicilia, Salvador Sánchez‐Alonso, Elena Garcia‐Barriocanal, Julià Minguillón, Enayat Rajabi

7

Language of resource in GLOBE

Language # metadata

en, english, eng, en‐US, en‐gb 392.682

nl 97.976

x‐none,none,blank 77.555

de,de‐AT,de‐DE 49.807

es‐EC, es 47.816

it, ITA 23.102

hu‐HU, hu 20.316

Is 8.804

ca 8.066

fr 6.414

Page 8: LACRO'13 workshop April 2013, Leuven, Belgium Miguel‐Angel Sicilia, Salvador Sánchez‐Alonso, Elena Garcia‐Barriocanal, Julià Minguillón, Enayat Rajabi

8

GLOBE materials: Keyword• There exist around 5,5 million keywords in the sample ( ~ 7 keywords per resource)

• Large number of keywords generated via machine translation (referenced by codes starting with “x-mt-”)

• There are also around 3,2 million records seem generated by human practices ( ~ 4 keywords per resource)

• Frequencies are high for relatively high number of keywords (beyond 15) (might be attributed to automated extraction)

Page 9: LACRO'13 workshop April 2013, Leuven, Belgium Miguel‐Angel Sicilia, Salvador Sánchez‐Alonso, Elena Garcia‐Barriocanal, Julià Minguillón, Enayat Rajabi

9

GLOBE materials: Keyword

Page 10: LACRO'13 workshop April 2013, Leuven, Belgium Miguel‐Angel Sicilia, Salvador Sánchez‐Alonso, Elena Garcia‐Barriocanal, Julià Minguillón, Enayat Rajabi

10

GLOBE materials: Classification• A total of ~ 700k classifications distributed across ~500k resources were found with ~1 million taxon entries

• About 92% of all the resources have at most twoclassifications, and only 187 resources have more than 10.

• There were only 43 different classification purposes found, with discipline being “discipline” a 60% and “Technical design” around 18%. The latter is from a vocabulary specific of the MACE project. Another 11% of the purposes were blank.

• Keywords and classifications were matched against each other for the same resources ( ~270k coincidences)

Page 11: LACRO'13 workshop April 2013, Leuven, Belgium Miguel‐Angel Sicilia, Salvador Sánchez‐Alonso, Elena Garcia‐Barriocanal, Julià Minguillón, Enayat Rajabi

11

GLOBE materials: Classification

Taxon_entry_lang records

en,en‐US 539568

unspecified 180643

ca 158340

es 88261

fr 19734

sv 17488

de 13687

nl 12954

ro 12038

it 10352

Page 12: LACRO'13 workshop April 2013, Leuven, Belgium Miguel‐Angel Sicilia, Salvador Sánchez‐Alonso, Elena Garcia‐Barriocanal, Julià Minguillón, Enayat Rajabi

12

Interlinking to other resources:DBpedia

Page 13: LACRO'13 workshop April 2013, Leuven, Belgium Miguel‐Angel Sicilia, Salvador Sánchez‐Alonso, Elena Garcia‐Barriocanal, Julià Minguillón, Enayat Rajabi

13

• In linked open data RDF links exposed through the web express relationships between elements

• DBpedia is the central dataset and most interlinking tools are providing automated ways to interlink with this dataset

• Keywords and classifications can be approached from the perspective of external data sources

• Keywords and classifications could be linked to large knowledge base e.g. Dbpedia (less than 30%)

Interlinking to other resources:DBpedia

Page 14: LACRO'13 workshop April 2013, Leuven, Belgium Miguel‐Angel Sicilia, Salvador Sánchez‐Alonso, Elena Garcia‐Barriocanal, Julià Minguillón, Enayat Rajabi

14

• English dominates the distribution of languages in GLOBE with a few other represented languages

• There is a considerable amount of keywords generated via machine translation.

• English again dominates the linguistic space of classifications

• Classifications result in a more concise representation, as becomes evident with the contrast of the more than 3 million keywords (excluding machine translation) with the 1 million classification entries

Discussion and conclusion

Page 15: LACRO'13 workshop April 2013, Leuven, Belgium Miguel‐Angel Sicilia, Salvador Sánchez‐Alonso, Elena Garcia‐Barriocanal, Julià Minguillón, Enayat Rajabi

15

• The amount of coincidence with lexical variants in DBPedia entries is limited and there is not a significant difference, so that they appear to have a similar potential for interlinking.

• It is important to highlight that the coincidence analysed have been based on equal string match without any consideration of polysemy and lexical variants.

•It should be noted that GLOBE has to be considered a highly heterogeneous repository in several aspects as described by Ochoa (2011), including the way the metadata is created in the repositories, ranging from automatic creation to quality-controlled, internal mechanisms

Discussion and conclusion

Page 16: LACRO'13 workshop April 2013, Leuven, Belgium Miguel‐Angel Sicilia, Salvador Sánchez‐Alonso, Elena Garcia‐Barriocanal, Julià Minguillón, Enayat Rajabi

16

Thank you