1 text analysis. 2 t indexing t matrix representations t term extraction and analysis t term...

1

Text Analysis

2

Indexing

Matrix Representations

Term Extraction and Analysis

Term Association

Lexical Measures of Term Significance

Document Similarity

Problems of using a uncontrolled vocabulary

3

1. Indexing

Indexing the act of assigning index terms to a document

manually or automatically

Indexing language(Vocabulary) controlled or uncontrolled

controlled: limited to a predefined set of index terms

uncontrolled: allow use of any term that fits some broad c

riteria

4

1. Indexing

purpose to permit easy location of documents by topic

to define topic areas, and hence relate one document to an

other

to predict relevance of a given document to a specified in

formation need

characteristics exhaustivity - the breadth of coverage of the index terms

specificity - the depth of coverage

5

Manual indexing generally, uncontrolled indexing for manual in

dexing

Problem lack of consistency

indexer 마다 다른 exhaustivity 와 specificity

controlled vocabulary 를 사용하면 다른 문제가 발생 document 의 내용을 정확히 나타내기 어려울 수 있다 .

indexer-user mismatch 같은 개념을 다른 용어를 사용해서 표시 controlled vocabulary 를 사용해도 해결하기 어렵다 .

6

Manual indexing(continued) Characterizing the occurrence of terms

link occur together or have semantic relationship

ex) digital and computer using conjunction

role indicating its function or usage

꽃의 이름은 식물학적 정의에 등장하기도 하고 , 정원에서의 용도를 서술하는 문장에 등장하기도 한다

using prepositional phrases

Cross-referencing enhance the usability of an indexing language See,See also(RT),Broader term(BT),Narrower term(NT)

7

Automatic indexing

Algorithm 이용 , index term 을 결정 almost, based on the frequency of occurrence

guiding principles words 는 두개의 subset 으로 나눌 수 있다 .

grammatical/relational and content-bearing

content-bearing words 중에서 더 많이 나타나는 word 는 더 중요

a word 가 document collection 의 average occurrence 와 유의하게 다를 때 document 를 구별하는데 사용가능

8

Automatic indexing (Continued) Does not settle the issue of a controlled vocabular

y vs. an uncontrolled one Recent trends

linguistic knowledge 이용 syntactic structure semantics and concepts ex) DR-LINK(both) : 고유명사 , 보통명사 등의 구분

inferencing technique

A major use of the index inverted file: list the document containing each term

matching terms to document: 한번만 수행 ( 모든 query 가 공유 )

9

2. Matrix Representation

many-to-many relationship between terms and

documents

관계를 명확하게 하기 위해 세 가지 matrix

사용 term-document matrix

term-term matrix

document-document matrix

10

2. Matrix Rep.(continued)

term-document matrix, A rows : vocabulary terms columns : documents 0 : does not occur, 1 or N : occur

term-term matrix, T rows, columns : vocabulary terms nonzero(1 or N)

ith, jth term occur together in some document or have some other relationship

11

2. Matrix Rep.(continued) document-document matrix, D

rows,columns : documents

nonzero documents have some terms in common or have some other relationship

ex) author in common

이 matrix 들은 sparse: 빈칸의 저장을 피해야 ex) term-document matrix 대신 a list of terms 사용

각 term 에는 list of document 가 attach 되어 있다 빈도수가 중요한 경우에는 ‘ frequency-document identifie

r’ 쌍을 저장

12

3. Term Extraction and Analysis Frequency variation

one basis for selection as automatic indexing terms

Zipf’s law rank frequency constant

if the words are ranked in order of decreasing frequency

빈번한 단어들은 빈도수가 급격히 감소함을 암시 자주 나타나는 ( 빈번한 ) 단어

grammatical necessity: the, of, and, and a

half of any given text is made up of approximately 250 words

13

3. Term Extraction and Analysis 빈번한 단어가 index term 으로 부적합한 이유

거의 모든 문서가 이들 단어를 포함 문서의 주된 아이디어와 무관

드문 단어가 index term 으로 부적합한 이유 문서의 아이디어와 유관할 수 있지만 , 이런

단어로 검색하면 결과 문서의 수가 너무 작다 inability to retrieve many documents

Two thresholds for defining index terms upper : high-frequency terms

lower : rare words

14

3. Term Extraction and Analysis Zipf’s law 는 일반적인 guideline 일뿐

빈도수가 딱 한번인 100 개의 단어가 있다면 , 공식이 성립하지 않는다 : 각각은 다른 rank 를 가짐

“the most frequent 20% of the text words account for 70% of term usage.” 와 모순됨 f = kr-1

전체 문서의 수는 이 곡선의 아래 면적이고 , 적분에 의해 구할 수 있다 . 그러나 이 전체 면적은 무한대

따라서 , 어떤 finite portion 도 전체 면적의 70% 가 아니다

f = kr-(1-)(>0), f = kr-(1+)(>0) 의 경우도 마찬가지

15

4. Term Association 빈도수가 충분히 높은 단어쌍이나 구절은

indexing vocabulary 에 포함되어야 함 word proximity

depend on a given number of intervening words, on the words appearing in the same sentence, etc.

word order, punctuation 여러 종류의 문서 집합을 고려해야 한다

digital computer 는 의학 , 음악 분야 문서집합에서는 중요하지만 , 컴퓨터 분야에서는 너무 빈번해서 중요하지 않고 , 철학 분야에서는 너무 드물어서 중요하지 않다

16

5. Lexical Measure of Term Significance

development of an indexing language begins with analysis of the words and phrases occu

rring 문서별 빈도 -> 전체 문서에서의 빈도

Term-document matrix 보다는 term-list 가 더 실용적 sparseness

Word phrase 의 빈도 각 구성 단어의 빈도로부터 직접 구할 수는

없지만 범위는 알 수 있다 f(AB) min (f(A), f(B))

17

5. Lexical Measure of Term Significance

absolute term frequency can be very misleading documents and document collections vary in size

relative term frequency sizes and characteristics 를 고려하여 수정된 값

문서내 빈도수 / 문서의 길이 ( 단어수 )

전체 문서 집합을 고려한 빈도 단어의 전체 빈도수 / 문서 집합의 모든 단어의 빈도수

합 단어를 포함하는 문서의 수 / 전체 문서의 수

18

5. Lexical Measure of Term Significance(continued)

Inverse document frequency weight

Signal-to-noise ratio

Term discrimination value

19

Inverse Document Frequency Weight

The frequency of occurrence of a term is weig

hted by the number of documents that contain

the term 많은 문서에서 나타나면 low weight

inverse document frequency(idf) log2(N/dk)+1 = log2N-log2dk+1

dk : the number of documents containing the term k

N : the number of documents in the collection

최소값 = 1

20

Inverse Document Frequency Weight(continued)

inverse document frequenct weight(tf.idf) wik =fik[log2N - log2dk + 1]

increases with the frequency of the term in the document

decreases with the number of documents containing the term

로그함수 : 문서집합 크기의 증가에 둔감 collection 의 크기가 2 배가 되면 idf 값은 1 증가

1 text analysis. 2 t indexing t matrix representations t term extraction and analysis t term...

Documents