2016 m7 w2

Weekly Progress Report

11th to 16th July 2016

-Ayush Pareek

This week, I’m doing a study on n-grams and

their possible applications for topic detection

in multiple documents.

(Details in subsequent slides)

In the fields of computational

linguistics and probability, an n-gram is a

contiguous sequence of n items from a

given sequence of text or speech. The items can

be phonemes,syllables, letters, words or base

pairs according to the application. The n-grams

typically are collected from a text or speech

corpus. When the items are words, n-grams may

also be called shingles.[5]

https://en.wikipedia.org/wiki/Computational_linguistics

https://en.wikipedia.org/wiki/Computational_linguistics

https://en.wikipedia.org/wiki/Probability

https://en.wikipedia.org/wiki/Sequence

https://en.wikipedia.org/wiki/Phoneme

https://en.wikipedia.org/wiki/Syllable

https://en.wikipedia.org/wiki/Letter_(alphabet)

https://en.wikipedia.org/wiki/Word

https://en.wikipedia.org/wiki/Base_pairs

https://en.wikipedia.org/wiki/Base_pairs

https://en.wikipedia.org/wiki/Text_corpus

https://en.wikipedia.org/wiki/Speech_corpus

https://en.wikipedia.org/wiki/Speech_corpus

An n-gram of size 1 is referred to as a

"unigram"; size 2 is a "bigram" (or, less

commonly, a "digram"); size 3 is a "trigram".

Larger sizes are sometimes referred to by the

value ofn, e.g., "four-gram", "five-gram", and

so on.

N-grams have been used in summarization

and summary evaluation [1, 2, 3].

https://en.wikipedia.org/wiki/Bigram

https://en.wikipedia.org/wiki/Trigram

This means that n-gram Si,i+n-1 can be found as a substring of length n of the original text,

spanning from the i-th to the (i + n-1)-th character of the original text.

Consider the following sentence:-

This is a sentence.

Word unigrams: this, is, a, sentence

Word bigrams: this is, is a, a sentence

Character bigrams: th, hi, is, s , a, ...

Character 4-grams: this, his , is , ...

Time Complexity = O(size of the text) = O(|T|)

Application of this method to the sentence

Do you like this summary? with a requested n-gram size of 3 would return: {`Do ', ò y', ` yo', `you', òu ', ù l', ` li', `lik', ìke', `ke ', è t', `

th', `thi',`his', ìs ', `s s', ` su', `sum', ùmm', `mma', `mar', àry', `ry?'}

while an algorithm taking disjoint n-grams would return

{`Do ', `you', ` li', `ke ', `thi', `s s', ùmm', àry'} (and `?' would probably be omitted).

Formally,

There are three methods which can be used to create-gram graphs based on how neighbourhood between adjacent n-grams is computed in a text.

In general, a fixed-width window of characters (or words) around a given n-gram N0 is used, with all characters (or words) within the window considered to be neighbours of N0.

These neighbours are represented as connected vertices in the text graph.

The edge connecting the neighbours is weighted, indicating for example the distance between the neighbours or the number of co-occurrences within the text.

1) The Non-Symmetric Approach

2) The Symmetric Approach

3) The Gauss-normalized symmetric

approach (details omitted)

For the string abcdef , the figure shows the n-gram graphs

for the non-symmetric approach, the symmetric approach

and the Gauess-normalized symmetric approach resp.

1)Transitivity implying Indication of n-th order relations:- The graph in itself is a structure that maintains

information about the `neighbor-of-a-neighbor'. This means that if A is related to B through an edge or a path in the graph and B is related to C through another edge or path, then if the proximity relation is considered transitive we can deduce that A is related to C.

The length of the path between A and C can offer information about this indirect proximity relation. This can be further refined, if the edges have been assigned weights indicating the degree of proximity between the connected vertices.

Major Features of the n-gram

2) Language independence/ neutrality. When used in Natural Language Processing, the n-gram

graph representation makes no assumption about the underlying language. This makes the representation fully language-neutral and applicable independent even of writing orientation (left-to-right or right-to left) when character n-grams are used.

Moreover, the fact the method enters the sub-word level has proved to be useful in all the cases where a word appears in different forms, e.g. due to difference in writing style, intersection of word types and so forth.

Given two instances of n-gram graph representation G1, G2, there is a number of operators that can be applied on G1;G2 to provide the n-gram graph equivalent of union, intersection and other such operators of set theory.

For example, let the merging of G1 and G2 corresponding to the union operator in set theory be:

G3 = G1 ∪G2, which is implemented by adding all edges from both graphs to a

third one, while making sure no duplicate edges are created. Two edges are considered duplicates of each other, when they

share identical vertices.

The intersection operator ∩ (G X G -> G) for two graphs, returns a graph with the

common edges of the two operand graphs,

with the averaged weights of the original

edges assigned as edge weights.

The averaged weights make sure that we

keep common edges and their weights are

assigned to the closest possible value to both

the original graphs: the average.

Formally,

The intersection operator, on the other hand,

can be used to determine the common

subgraphs of different document graphs.

Thus, if we take the running intersection of a

number of documents, the final resulting

graph would possibly contain n-grams of the

common-topic of the documents.

How to extract back possible topic/keywords from the resulting graph (after applying the intersection operator)?

The final graph would also contain vertices made from the n-gram of stop-words (noise) since it has not been preprocessed. How to filter it? OR Should we preprocess the documents first?(some methods for removing noise have already been studied in context of summary evaluation)

[1] Michele Banko and Lucy Vanderwende. Using n-grams to understand the nature of summaries. In Daniel Marcu Susan Dumais and Salim Roukos, editors, HLT-NAACL 2004: Short Papers, pages 1{4, Boston, Massachusetts, USA, May 2004. Association for Computational Linguistics.

[2] Chin-Yew Lin and Eduard Hovy. Automatic evaluation of summaries using n-gram co-occurrence statistics. In NAACL '03: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pages 71{78, Morristown, NJ, USA, 2003. Association for Computational Linguistics.

[3] T. Copeck and S. Szpakowicz. Vocabulary usage in newswire summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, pages 19{26. Association for Computational Linguistics, 2004.

[4] George Giannakopoulos. TESTING THE USE OF N-GRAM GRAPHS IN SUMMARIZATION SUB-TASKS 2009

[5] Broder, Andrei Z.; Glassman, Steven C.; Manasse, Mark S.; Zweig, Geoffrey (1997). "Syntactic clustering of the web". Computer Networks and ISDN Systems 29 (8): 1157–1166.

2016 m7 w2

Education