identifying similar text documents

Identifying similar text documents Andr´ e Santos [email protected] November 2011

Upload: andrefsantos

Post on 18-Dec-2014

637 views

Category:

Technology

2 download

Report

Download

Embed Size (px):

DESCRIPTION

Slides from a ligthning talk oabout the Perl module Text::Perfide::BookPairs, presented on the I International Per-fide Workshops, at University of MInho, 2011.

TRANSCRIPT

Identifying similar text documents

Andre [email protected]

November 2011

Page 2: Identifying similar text documents

What we get

Andre Santos [email protected] Identifying similar text documents

Page 3: Identifying similar text documents

Duplicated versions

Andre Santos [email protected] Identifying similar text documents

Page 4: Identifying similar text documents

Duplicated versions

Andre Santos [email protected] Identifying similar text documents

Page 5: Identifying similar text documents

Candidate pairs

Andre Santos [email protected] Identifying similar text documents

Page 6: Identifying similar text documents

Candidate pairs

Andre Santos [email protected] Identifying similar text documents

Page 7: Identifying similar text documents

Candidate pairs

Andre Santos [email protected] Identifying similar text documents

Page 8: Identifying similar text documents

What this is really about

similarity

Andre Santos [email protected] Identifying similar text documents

Page 9: Identifying similar text documents

It’s all LIEs!

Language Independent Element (LIE)

Terms which are usually kept untouched duringtranslation.

Year references (e.g. “1977”)

Proper names (e.g. “Sherlock Holmes”)

Andre Santos [email protected] Identifying similar text documents

Page 10: Identifying similar text documents

It’s all LIEs!

Language Independent Element (LIE)

Terms which are usually kept untouched duringtranslation.

Year references (e.g. “1977”)

Proper names (e.g. “Sherlock Holmes”)

Andre Santos [email protected] Identifying similar text documents

Page 11: Identifying similar text documents

It’s all LIEs!

Language Independent Element (LIE)

Terms which are usually kept untouched duringtranslation.

Year references (e.g. “1977”)

Proper names (e.g. “Sherlock Holmes”)

Andre Santos [email protected] Identifying similar text documents

Page 12: Identifying similar text documents

Measuring similarity

similarity(A,B) =|ALIEs ∩ BLIEs||ALIEs ∪ BLIEs|

Andre Santos [email protected] Identifying similar text documents

Page 13: Identifying similar text documents

Measuring similarity

Andre Santos [email protected] Identifying similar text documents

Page 14: Identifying similar text documents

pairbooks

Similarity values

< 0.2 Documents are not related

> 0.4 Documents are candidate pairs

> 0.9 Documents are near duplicates

1.0 Documents are duplicates

Languages

High similarity, same language: (Near) duplicates

High similarity, different language: Candidate pairs

Andre Santos [email protected] Identifying similar text documents

Page 15: Identifying similar text documents

Behold, pairbooks!

~ $ pairbooks PT_list.txt ES_list.txt

PTBR__Umberto_EcoO_nome_da_rosa.txt

(0.227) [6954,7382] ES__Umberto_EcoEl_Nombre_de_la_Rosa(...)

(0.018) [6954,11408] ES__Umberto_EcoEl_Pendulo_De_Foucau(...)

(0.018) [6954,5604] ES__Umberto_EcoDiario_Minimo__2.txt(...)

PTBR__Umberto_EcoO_Pendulo_de_Focault.txt

(0.391) [11276,11408] ES__Umberto_EcoEl_Pendulo_De_Foucau(...)

(0.042) [11276,6024] ES__Umberto_EcoLa_busqueda_de_la_Le(...)

(0.035) [11276,5604] ES__Umberto_EcoDiario_Minimo__2.txt

(...)

Andre Santos [email protected] Identifying similar text documents

Page 16: Identifying similar text documents

Perfect LIEs do not exist

Year references

Can be confused with page numbersHeaders/footers can contain them(publishing year, copyright, . . . )

Proper names

Sometimes are translated (e.g. “SaoTome”, “Judas Tome”, etc)Some languages use different scripts(e.g. Russian)Some languages have declensions

. . .Andre Santos [email protected] Identifying similar text documents

Page 17: Identifying similar text documents

How to improve LIEs (future work)

accept a list of equivalent words

accept a list of stop words

. . .

Andre Santos [email protected] Identifying similar text documents

Page 18: Identifying similar text documents

Give me one of those!

CPANhttp://search.cpan.org/perldoc?pairbooks

Developer version

requires Linux, Perl

Incomplete documentation

Andre Santos [email protected] Identifying similar text documents

http://search.cpan.org/perldoc?pairbooks

Identifying similar text documents

Andre [email protected]

November 2011

Understanding and identifying text structure 1

Identifying Comparative Sentences in Text Documents

IDENTIFYING POLARITY IN DIFFERENT TEXT TYPES - … · IDENTIFYING POLARITY IN DIFFERENT TEXT TYPES Hille Pajupuu, ... or sentences have been annotated as either positive and ... According

Draft Rules – Edits Highlighted - Oregon Docs/caohim2rules.pdfDraft Rules – Edits Highlighted Key to Identifying Changed Text: Deleted Text New/inserted text Division 245 CLEANER

Identifying Types of Arguments in Text: First Steps towards an Identification Procedure

Discriminating Among Word Meanings by Identifying Similar Contexts

Identifying consumers’ arguments in text swaie at ekaw 2012 10-09

What can measures of text comprehension tell us … · What can measures of text comprehension tell us ... similar representational processes as reading comprehension, ... of text

A Brief Survey of Text Mining · Text Mining = Text Data Mining. Text mining can be also deﬁned — similar to data mining — as the application of algorithms and methods from

Identifying Text Type

Identifying and Analyzing Arguments in a Text Argumentation in (Con)Text Symposium, Jan. 4, 2007, Bergen

Connecting Pieces in a Text: Strategies for Identifying Inferences

5.1 Identifying Similar Figures - Big Ideas Math 7/05/g7_05...Exercises 9 and 10 on page 198. ... 5.1 Lesson Key Vocabulary similar ﬁ gures, p. 196 ... Common Error EXAMPLE 2 Identifying

1share.its.ac.id/pluginfile.php/891/mod_resource/content/... · Web viewexamples, C2, A2, P3 4. identifying the structure of text, formulating the structure of text based on the text

Identifying the Author’s Purpose · Identifying the Author’s Purpose ... summarizing text, and distinguishing main ideas from details commensurate with content area needs

Reading Levels Overview - The Syracuse City School ... Levels... · Describe qualitative dimensions of text ... Books at similar levels have similar characteristics. ... Readers often

Identifying problems and solutions in scientific text

Do not crawl in the dust different ur ls similar text

Identifying Text Structure. What nonfiction text have you willingly read in the last 24 hours?

7-3: Identifying Similar Triangles

A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text

d3jc3ahdjad7x7.cloudfront.net · 480 Chapter 8 Similarity Similar Triangles IDENTIFYING SIMILAR TRIANGLES In this lesson, you will continue the study of similar polygons by looking

Identifying signs of schizophrenia in Twitter using text ...studentnet.cs.manchester.ac.uk/resources/library/3rd-year-projects/... · Identifying signs of schizophrenia in Twitter

Identifying the Pathways for Meaning Circulation using Text Network Analysis

A Brief Introduction to Text Types - · PDF fileA Brief Introduction to Text Types ... their characteristics Text Type 1. Narrative 2. ... write a similar text based on the GS and

Extending Business Intelligence with Text Exploration ... · iKnow. iKnow is truly text exploration technology that uses a unique approach to text analysis. Instead of identifying

Similar text analysis

Draft Rules – Edits Highlighted Docs... · DRAFT . Draft Rules – Edits Highlighted. Key to Identifying Changed Text: Deleted Text New/inserted text . DEPARTMENT OF ENVIRONMENTAL

Identifying Linguistic Cues that Distinguish Text Types: A … · 2019. 4. 29. · Identifying Linguistic Cues that Distinguish Text Types 363 Dijk and Kintsch's (1983) model of discourse

Translating and identifying possible problems in a text

Arrays in structured text. Structured text Structured Text is an imperative language similar to Pascal (If, While, etc..) Structed Text is also a procedural

Identifying people in poverty: A multidimensional ... · Barnett Paper 18-03 Identifying people in poverty 4 measures suffer from certain validity problems (and mainly similar problems)

Identifying Similar Web Pages Based On