3rd international keystone · layfield/draganovic/azzopardi ikc 2017 additional features: •each...

23
Dragan Ivanovic 3 rd International KEYSTONE Conference Multi-Lingual LSA with Serbian and Croatian: An Investigative Case Study Colin Layfield Joel Azzopardi

Upload: others

Post on 16-Jan-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 3rd International KEYSTONE · Layfield/Draganovic/Azzopardi IKC 2017 Additional features: •Each test run 10 times •Document number varied for semantic space §1,000, 2,500 and

Dragan Ivanovic

3rd International KEYSTONE Conference

Multi-Lingual LSA with Serbian and Croatian:An Investigative Case Study

Colin LayfieldJoel Azzopardi

Page 2: 3rd International KEYSTONE · Layfield/Draganovic/Azzopardi IKC 2017 Additional features: •Each test run 10 times •Document number varied for semantic space §1,000, 2,500 and

Layfield/Draganovic/Azzopardi IKC 2017

Want to create search functionality for dissertations in various Balkan region languages

• Serbian• Croatian• Montenegrin• Bosnian

ProblemThe Problem

The ApproachThe ResultsConclusions

2

Page 3: 3rd International KEYSTONE · Layfield/Draganovic/Azzopardi IKC 2017 Additional features: •Each test run 10 times •Document number varied for semantic space §1,000, 2,500 and

Layfield/Draganovic/Azzopardi IKC 2017

Multi-Lingual Search Problem:Special Case:• Existence of these four “Political”

languages

• In practice, very similar

Multi-Lingual SearchThe Problem

The ApproachThe ResultsConclusions

3

Page 4: 3rd International KEYSTONE · Layfield/Draganovic/Azzopardi IKC 2017 Additional features: •Each test run 10 times •Document number varied for semantic space §1,000, 2,500 and

Layfield/Draganovic/Azzopardi IKC 2017

Other approaches Roughly 2 categories• Query and document translation§ Parallel corpora§ Dictionary§ Machine Translation

Multi-Lingual SearchThe Problem

The ApproachThe ResultsConclusions

4

(e.g. Dhavachelvan, Pothula (2011), Dwivedi, Chandra (2016), Sharma, Morwal (2015))

Page 5: 3rd International KEYSTONE · Layfield/Draganovic/Azzopardi IKC 2017 Additional features: •Each test run 10 times •Document number varied for semantic space §1,000, 2,500 and

Layfield/Draganovic/Azzopardi IKC 2017

Requires a parallel corpora

Multi-Lingual LSAThe Problem

The ApproachThe ResultsConclusions

5

LanguageA

LanguageB

A

B

AB

ParallelDocuments

A + B

Page 6: 3rd International KEYSTONE · Layfield/Draganovic/Azzopardi IKC 2017 Additional features: •Each test run 10 times •Document number varied for semantic space §1,000, 2,500 and

Layfield/Draganovic/Azzopardi IKC 2017

LSA – Transforms term by doc matrix using Singular Value Decomposition

LSAThe Problem

The ApproachThe ResultsConclusions

6

𝐴 = 𝑇𝑆𝐷&

ParallelDocuments

A + BSemantic

Space

Page 7: 3rd International KEYSTONE · Layfield/Draganovic/Azzopardi IKC 2017 Additional features: •Each test run 10 times •Document number varied for semantic space §1,000, 2,500 and

Layfield/Draganovic/Azzopardi IKC 2017

(Dumais et al 1997) – Experimented with ML-LSA• Parallel Corpora (Canadian Parliamentary Q&A)• 2,482 documents – both French & English• 1,500 randomly selected for training• Remaining 982 used as test set for “mate retrieval”

Previous WorkThe Problem

The ApproachThe ResultsConclusions

7

SemanticSpace

1,500 Docs982

Docs

E

F

E

FCompare

Page 8: 3rd International KEYSTONE · Layfield/Draganovic/Azzopardi IKC 2017 Additional features: •Each test run 10 times •Document number varied for semantic space §1,000, 2,500 and

Layfield/Draganovic/Azzopardi IKC 2017

Repeat the previous experimentWith and without parallel semantic spaceHypothesis:

Due to the similarity of Croatian and Serbian are we able to use a semantic space that is comprised of

documents from each language rather than concatenated documents from a parallel set of

documents?

Our ApproachThe Problem

The ApproachThe ResultsConclusions

8

Page 9: 3rd International KEYSTONE · Layfield/Draganovic/Azzopardi IKC 2017 Additional features: •Each test run 10 times •Document number varied for semantic space §1,000, 2,500 and

Layfield/Draganovic/Azzopardi IKC 2017

We used the SETIMES dataset (Tyers & Alperen 2010)• Collection of parallel news articles in the Balkan

languages taken from http://www.setimes.com• 9 Languages overall• Data consisted of:§ 170,466 aligned sentences § 29,391 news articles

The DataThe Problem

The ApproachThe ResultsConclusions

9

Page 10: 3rd International KEYSTONE · Layfield/Draganovic/Azzopardi IKC 2017 Additional features: •Each test run 10 times •Document number varied for semantic space §1,000, 2,500 and

Layfield/Draganovic/Azzopardi IKC 2017

Two sets of experiments run with dataset1. Perform mate-retrieval test using classic ML-LSA 2. Perform mate-retrieval test where each document

in the semantic space consists of only a single language

MethodThe Problem

The ApproachThe ResultsConclusions

10

Page 11: 3rd International KEYSTONE · Layfield/Draganovic/Azzopardi IKC 2017 Additional features: •Each test run 10 times •Document number varied for semantic space §1,000, 2,500 and

Layfield/Draganovic/Azzopardi IKC 2017

Additional features:• Each test run 10 times• Document number varied for semantic space§ 1,000, 2,500 and 5,000

• 5,000 documents chosen from those outside of semantic space for testing

• For each run documents randomly chosen to create semantic space

• Tried with TF-IDF and Log-Entropy weighting

MethodThe Problem

The ApproachThe ResultsConclusions

11

Page 12: 3rd International KEYSTONE · Layfield/Draganovic/Azzopardi IKC 2017 Additional features: •Each test run 10 times •Document number varied for semantic space §1,000, 2,500 and

Layfield/Draganovic/Azzopardi IKC 2017

ResultsThe Problem

The ApproachThe ResultsConclusions

12

Page 13: 3rd International KEYSTONE · Layfield/Draganovic/Azzopardi IKC 2017 Additional features: •Each test run 10 times •Document number varied for semantic space §1,000, 2,500 and

Layfield/Draganovic/Azzopardi IKC 2017

ResultsThe Problem

The ApproachThe ResultsConclusions

13

99.1 98.898.3

90.2

86.7

83.7

98.1 97.7 97.5

82.3

79

75.3

70

75

80

85

90

95

100

1000 2500 5000

ML-LSA Results Cro->Srb

L-E Comb L-E TF-IDF Comb TF-IDF

Page 14: 3rd International KEYSTONE · Layfield/Draganovic/Azzopardi IKC 2017 Additional features: •Each test run 10 times •Document number varied for semantic space §1,000, 2,500 and

Layfield/Draganovic/Azzopardi IKC 2017

ResultsThe Problem

The ApproachThe ResultsConclusions

14

99.6 99.3 99

94.2

91.2

89.6

99 98.8 98.5

87.2

80.1

76.3

70

75

80

85

90

95

100

1000 2500 5000

ML-LSA Results Cro->Srb (Folding in with straight TF)

L-E Comb L-E TF-IDF Comb TF-IDF

Page 15: 3rd International KEYSTONE · Layfield/Draganovic/Azzopardi IKC 2017 Additional features: •Each test run 10 times •Document number varied for semantic space §1,000, 2,500 and

Layfield/Draganovic/Azzopardi IKC 2017

ResultsThe Problem

The ApproachThe ResultsConclusions

15

99.699.1 98.7

94.5

91.4

88.6

98.998.2 97.9

89.2

86.3

83.2

70

75

80

85

90

95

100

1000 2500 5000

ML-LSA Results Top 5 Cro

L-E Comb L-E TF-IDF Comb TF-IDF

Page 16: 3rd International KEYSTONE · Layfield/Draganovic/Azzopardi IKC 2017 Additional features: •Each test run 10 times •Document number varied for semantic space §1,000, 2,500 and

Layfield/Draganovic/Azzopardi IKC 2017

ResultsThe Problem

The ApproachThe ResultsConclusions

16

Page 17: 3rd International KEYSTONE · Layfield/Draganovic/Azzopardi IKC 2017 Additional features: •Each test run 10 times •Document number varied for semantic space §1,000, 2,500 and

Layfield/Draganovic/Azzopardi IKC 2017

ResultsThe Problem

The ApproachThe ResultsConclusions

17

0.9470.954 0.957

0.8

0.813 0.809

0.9510.956 0.959

0.777

0.797 0.799

0.7

0.75

0.8

0.85

0.9

0.95

1

1000 2500 5000

Average Same Similarity Values

L-E Comb L-E TF-IDF Comb TF-IDF

Page 18: 3rd International KEYSTONE · Layfield/Draganovic/Azzopardi IKC 2017 Additional features: •Each test run 10 times •Document number varied for semantic space §1,000, 2,500 and

Layfield/Draganovic/Azzopardi IKC 2017

ResultsThe Problem

The ApproachThe ResultsConclusions

18

0.561

0.635

0.662

0.595

0.662

0.686

0.613

0.686

0.72

0.64

0.705

0.737

0.5

0.55

0.6

0.65

0.7

0.75

0.8

1000 2500 5000

Average Best Croatian Similarity

L-E Comb L-E TF-IDF Comb TF-IDF

Page 19: 3rd International KEYSTONE · Layfield/Draganovic/Azzopardi IKC 2017 Additional features: •Each test run 10 times •Document number varied for semantic space §1,000, 2,500 and

Layfield/Draganovic/Azzopardi IKC 2017

ResultsThe Problem

The ApproachThe ResultsConclusions

19

0.561

0.635

0.662

0.595

0.662

0.686

0.613

0.686

0.72

0.64

0.705

0.737

0.947 0.954 0.957

0.80.813 0.809

0.951 0.956 0.959

0.7770.797 0.799

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

1000 2500 5000

Best Croatian and Same Similarity Scores

BCS-L-E Comb BCS-L-E BCS-TF-IDF Comb BCS-TF-IDF SS-L-E Comb SS-L-E SS-TF-IDF Comb SS-TF-IDF

Page 20: 3rd International KEYSTONE · Layfield/Draganovic/Azzopardi IKC 2017 Additional features: •Each test run 10 times •Document number varied for semantic space §1,000, 2,500 and

Layfield/Draganovic/Azzopardi IKC 2017

Promising results but…• Not as good as expected• Some surprising features§ Increase in similarity values with increase in

semantic space size§ Decrease in mate matching results due to above§ Behavior of folding in documents using straight

TF

ConclusionsThe Problem

The ApproachThe ResultsConclusions

20

Page 21: 3rd International KEYSTONE · Layfield/Draganovic/Azzopardi IKC 2017 Additional features: •Each test run 10 times •Document number varied for semantic space §1,000, 2,500 and

Layfield/Draganovic/Azzopardi IKC 2017

1. Experiment using sentence level semantic space2. Use mixed language semantic space of non

parallel documents3. Detect most important terms from query and use

those4. Investigate folding into weighted semantic space

using TF (and others?)

Future WorkThe Problem

The ApproachThe ResultsConclusions

21

Page 22: 3rd International KEYSTONE · Layfield/Draganovic/Azzopardi IKC 2017 Additional features: •Each test run 10 times •Document number varied for semantic space §1,000, 2,500 and

Questions?Comments?

Suggestions?

Page 23: 3rd International KEYSTONE · Layfield/Draganovic/Azzopardi IKC 2017 Additional features: •Each test run 10 times •Document number varied for semantic space §1,000, 2,500 and

Layfield/Draganovic/Azzopardi IKC 2017

Dhavachelvan, P., & Pothula, S. (2011). A Review on the Cross and Multilingual Information Retrieval. International Journal of Web & Semantic Technology, 2(4), 115–124.

Dumais, S. T., Letsche, T. a, Littman, M. L., & Landauer, T. K. (1997). Automatic Cross-Language Retrieval Using Latent Semantic Indexing. AAAI Technical Report SS-97-05., 18–24.

Dwivedi, S., & Chandra, G. (2016). A Survey on Cross Language Information Retrieval. International Journal on Cybernetics & Informatics, 5(1), 127–142. http://doi.org/10.17148/IJARCCE.2015.4287

Sharma, M., & Morwal, S. (2015). A Survey on Cross Language Information Retrieval. International Journal of Advanced Research in Computer Andf, 4(2), 384–387. http://doi.org/10.17148/IJARCCE.2015.4287

Tyers, F. M., & Alperen, M. S. (2010). South-East European Times : A parallel corpus of Balkan languages. In Proceedings of the Workshop on Exploitation of Multilingual Resources and Tools for Central and (South) Eastern European Languages, LREC 2010 (pp. 49–53). Retrieved from http://xixona.dlsi.ua.es/~fran/publications/lrec2010.pdf

Young, P. G. (1994). Cross-Language Information Retrieval Using Latent Semantic Indexing. University of Knoxville, Tennessee.

References

23