identifying similar text documents

19
Identifying similar text documents Andr´ e Santos [email protected] November 2011

Upload: andrefsantos

Post on 18-Dec-2014

637 views

Category:

Technology


2 download

DESCRIPTION

Slides from a ligthning talk oabout the Perl module Text::Perfide::BookPairs, presented on the I International Per-fide Workshops, at University of MInho, 2011.

TRANSCRIPT

Page 1: Identifying similar text documents

Identifying similar text documents

Andre [email protected]

November 2011

Page 2: Identifying similar text documents

What we get

Andre Santos [email protected] Identifying similar text documents

Page 3: Identifying similar text documents

Duplicated versions

Andre Santos [email protected] Identifying similar text documents

Page 4: Identifying similar text documents

Duplicated versions

Andre Santos [email protected] Identifying similar text documents

Page 5: Identifying similar text documents

Candidate pairs

Andre Santos [email protected] Identifying similar text documents

Page 6: Identifying similar text documents

Candidate pairs

Andre Santos [email protected] Identifying similar text documents

Page 7: Identifying similar text documents

Candidate pairs

Andre Santos [email protected] Identifying similar text documents

Page 8: Identifying similar text documents

What this is really about

similarity

Andre Santos [email protected] Identifying similar text documents

Page 9: Identifying similar text documents

It’s all LIEs!

Language Independent Element (LIE)

Terms which are usually kept untouched duringtranslation.

Year references (e.g. “1977”)

Proper names (e.g. “Sherlock Holmes”)

Andre Santos [email protected] Identifying similar text documents

Page 10: Identifying similar text documents

It’s all LIEs!

Language Independent Element (LIE)

Terms which are usually kept untouched duringtranslation.

Year references (e.g. “1977”)

Proper names (e.g. “Sherlock Holmes”)

Andre Santos [email protected] Identifying similar text documents

Page 11: Identifying similar text documents

It’s all LIEs!

Language Independent Element (LIE)

Terms which are usually kept untouched duringtranslation.

Year references (e.g. “1977”)

Proper names (e.g. “Sherlock Holmes”)

Andre Santos [email protected] Identifying similar text documents

Page 12: Identifying similar text documents

Measuring similarity

similarity(A,B) =|ALIEs ∩ BLIEs||ALIEs ∪ BLIEs|

Andre Santos [email protected] Identifying similar text documents

Page 13: Identifying similar text documents

Measuring similarity

Andre Santos [email protected] Identifying similar text documents

Page 14: Identifying similar text documents

pairbooks

Similarity values

< 0.2 Documents are not related

> 0.4 Documents are candidate pairs

> 0.9 Documents are near duplicates

1.0 Documents are duplicates

Languages

High similarity, same language: (Near) duplicates

High similarity, different language: Candidate pairs

Andre Santos [email protected] Identifying similar text documents

Page 15: Identifying similar text documents

Behold, pairbooks!

~ $ pairbooks PT_list.txt ES_list.txt

PTBR__Umberto_EcoO_nome_da_rosa.txt

(0.227) [6954,7382] ES__Umberto_EcoEl_Nombre_de_la_Rosa(...)

(0.018) [6954,11408] ES__Umberto_EcoEl_Pendulo_De_Foucau(...)

(0.018) [6954,5604] ES__Umberto_EcoDiario_Minimo__2.txt(...)

PTBR__Umberto_EcoO_Pendulo_de_Focault.txt

(0.391) [11276,11408] ES__Umberto_EcoEl_Pendulo_De_Foucau(...)

(0.042) [11276,6024] ES__Umberto_EcoLa_busqueda_de_la_Le(...)

(0.035) [11276,5604] ES__Umberto_EcoDiario_Minimo__2.txt

(...)

Andre Santos [email protected] Identifying similar text documents

Page 16: Identifying similar text documents

Perfect LIEs do not exist

Year references

Can be confused with page numbersHeaders/footers can contain them(publishing year, copyright, . . . )

Proper names

Sometimes are translated (e.g. “SaoTome”, “Judas Tome”, etc)Some languages use different scripts(e.g. Russian)Some languages have declensions

. . .Andre Santos [email protected] Identifying similar text documents

Page 17: Identifying similar text documents

How to improve LIEs (future work)

accept a list of equivalent words

accept a list of stop words

. . .

Andre Santos [email protected] Identifying similar text documents

Page 18: Identifying similar text documents

Give me one of those!

CPANhttp://search.cpan.org/perldoc?pairbooks

Developer version

requires Linux, Perl

Incomplete documentation

Andre Santos [email protected] Identifying similar text documents

Page 19: Identifying similar text documents

Identifying similar text documents

Andre [email protected]

November 2011