catégorisation automatisée de contenus documentaires : la

43
1 GammaWare Technology June 2002 Yiftach Ravid, VP R&D GammaSite Inc. [email protected]

Upload: butest

Post on 14-Jun-2015

225 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Catégorisation automatisée de contenus documentaires : la

1

GammaWare Technology

June 2002

Yiftach Ravid, VP R&D

GammaSite Inc.

[email protected]

Page 2: Catégorisation automatisée de contenus documentaires : la

2

Overview

- The challenge

- Taxonomies

- Classification

- Focused Crawler

- Q&A

- Categorization

Page 3: Catégorisation automatisée de contenus documentaires : la

3

The challenge: Generate Structured Taxonomies of text repositories

Generate a structured taxonomy of huge text repositoriesGenerate a structured taxonomy of huge text repositories

XML

Word

Domino

Web

Catalogues

Forms

Mail

StructuredStructured

DataData

UnstructuredUnstructured

DataData

Internal DB

Business, Relevant Content

Business, Relevant Content

Information

Application

Services

Page 4: Catégorisation automatisée de contenus documentaires : la

4

Taxonomy

Page 5: Catégorisation automatisée de contenus documentaires : la

5

What is a Taxonomy

Taxonomy Taxis = arrangement or division Nomos = law

The science of classification according to a pre-determined system

Best-known use of taxonomy is in Biology taxonomies of animals and plants

Page 6: Catégorisation automatisée de contenus documentaires : la

6

Web Taxonomy

Best-known use of taxonomies: Web portals or Directories Internet sites classified into hierarchical topics

General:• Yahoo! http://www.yahoo.com/• Open Directory http://www.dmoz.org/ • LookSmart http://www.looksmart.com/r?country=uk

Topical:• Business.Com http://www.business.com/• HealthWeb http://www.healthweb.org/• Education Planet http://www.educationplanet.com/

Page 7: Catégorisation automatisée de contenus documentaires : la

7

Taxonomy - Sample

Page 8: Catégorisation automatisée de contenus documentaires : la

8

Taxonomy vs. Thesaurus

Criteria Taxonomy Thesaurus

Focus Documents and their organization Terms used in the organization

Usage Classification of documents Classified into categories/terms

Indexing documents Terms are attached to documents

Retrieval Mainly browsing Keyword queries

Size Restricted to the necessary terms sizes is very large (Terms may be added freely)

Page 9: Catégorisation automatisée de contenus documentaires : la

9

Classification

Page 10: Catégorisation automatisée de contenus documentaires : la

10

What is a Classifier

Concept (Topic, Subject): An abstract or generic idea generalized from

particular instances [Merriam Webster]

Classifier: A function on a concept (category) and on an

object (document) Returns a number between 0 and 1 called

confidence rate Confidence rate: measuring the confidence that

the object (document) belongs (should be classified) to the concept (category)

Page 11: Catégorisation automatisée de contenus documentaires : la

11

Methods for Automatic Classification

Rule based Pre-defined set of rules Advantage

• incorporating prior knowledge Disadvantages:

• extreme reliance on man-made rules • costly in terms of man-hours

Linguistics Use of morphology, syntax and semantics Not Multi lingual, demands many training

examples

Machine Learning

Page 12: Catégorisation automatisée de contenus documentaires : la

12

What is Machine Learning

Machine Learning is the study of computer algorithms that

automatically improve performance through

“experience”

Page 13: Catégorisation automatisée de contenus documentaires : la

13

Sample for Machine Learning

DOGS CATS

Page 14: Catégorisation automatisée de contenus documentaires : la

14

Discriminating Features

Q1: Who is this person?

Q2: What are the most discriminating features?

Page 15: Catégorisation automatisée de contenus documentaires : la

15

Discriminating Features

Answer: Lips Eyes

Page 16: Catégorisation automatisée de contenus documentaires : la

16

Discriminating Features

The “Margaret Thatcher effect”

Page 17: Catégorisation automatisée de contenus documentaires : la

17

Supervised Inductive Learning

A process where:

A learning algorithm is provided with a set of labeled instances, positive and negative examples (a training set)

Using the training set the leaning algorithm generates a classifier

The quality of the classifier is measured via its ability to perform well on novel instances (a test set)

Page 18: Catégorisation automatisée de contenus documentaires : la

18

Supervised Inductive Learning Example

Training

Test

errors

correct

Page 19: Catégorisation automatisée de contenus documentaires : la

19

Evaluating a Classifier

Category Classifier

Page 20: Catégorisation automatisée de contenus documentaires : la

20

Recall and Precision

True Label

TotalYes No

ClassifiedGood 70 50 120

Bad 30 150 180

Total 100 200 300

Precision (P) = GY / (GY + GN) = 70 / (70+50) = 0.58

F-measure (F) = 2/(1/P + 1/R) = 2*GY/(GY+GN+GY+BY) = 2*70/(100+120) = 0.63

Recall (R) = GY / (GY + BY) = 70 / (70+30) = 0.70

Accuracy (A) = (GY+NN)/(GY+GN+BY+BN) = 220 / 300 = 0.73

Use a confusion matrix to count

Page 21: Catégorisation automatisée de contenus documentaires : la

21

Supervised Statistical Machine Learning

A Supervised Inductive Learning method that is based on statistics obtained from the training set

Benefits Generality and flexibility

• Successfully applied across a broad spectrum of problems

Multi lingual

Low labor costs

Page 22: Catégorisation automatisée de contenus documentaires : la

22

How to Classify documents

Pre defined fields ( Structured data ) Author Title Date

Content ( Unstructured data ) From title, main text, emphasized text All words All 2 words, All 3 words, etc. Phrases, Synonyms, etc.

Page 23: Catégorisation automatisée de contenus documentaires : la

23

Getting Started

Page 24: Catégorisation automatisée de contenus documentaires : la

24

GammaWare Work Flow

Requirements

Design the Taxonomy

Seeding Process

Check SeedTrain

Classifiers

Catalogue Documents

Improve Classifiers

Ready

Page 25: Catégorisation automatisée de contenus documentaires : la

25

Requirements

Initial parameters and decisions: Level of percolation - affects:

• Recall• Precision

Multi label • Maximum number of categories into which

a document can be classified Types of training documents

• Full text, Keywords• Different types per category

List of Stop Words • Common words in the used language and

also in topic

Page 26: Catégorisation automatisée de contenus documentaires : la

26

Taxonomy

A Taxonomy is constructed according to: User\Business needs

• who will be using the taxonomy Data

• content of documents for classification

Good taxonomy: requires critical attention to both the definition and

application of categories and their labels simple and intuitive

How: Using the Expert Tool

Page 27: Catégorisation automatisée de contenus documentaires : la

27

Seeding process

Seeding process: each category within the taxonomy needs to be given a few examples of relevant documents of the same type that the user seeks to catalog

An average of 3-6 relevant documents per category Seeds can either be “positive seeds” or “negative

seeds” for each category

For better results - training documents should be in a similar structure as the documents for classification

How: Using the Expert Tool

Page 28: Catégorisation automatisée de contenus documentaires : la

28

Check Seed

Check seed: Classify the seeds into the taxonomy

Output: An HTML page (browsed by the Expert tool)

For each category shows the cataloging results for all the relevant seeds.

Why: Help in locating seeding problems:

Seeds that are multi labeled

Problems in taxonomy structure

How: Using the GammaWare Manager

Page 29: Catégorisation automatisée de contenus documentaires : la

29

Train Classifiers

Train: Train classifiers for all categories

Output: A classifier file (gcl extension) for each category

Why: The classifiers are used for categorization.

How: Using the GammaWare Manager

Page 30: Catégorisation automatisée de contenus documentaires : la

30

Classify Documents

Categorization: Catalogue documents into a Taxonomy

Output: A table in a database

Why: This is why we are here.

How: Using the GammaWare Manager

Page 31: Catégorisation automatisée de contenus documentaires : la

31

Improve Classifiers

Methods to improve classification results using the Expert Tool.

Re-design the taxonomy Seed problems

• More examples • Add new seeds

• drag and drop documents from classification view

• Negative “seeds”

Modify Categorization and Train parameters

Page 32: Catégorisation automatisée de contenus documentaires : la

32

Categorization

Page 33: Catégorisation automatisée de contenus documentaires : la

33

Hierarchical Categorization

Goal: Classify a document into the appropriate sub-topic(s) in the taxonomy

Difficulties: Many sub-topics A document may fall into several sub-

topics Classifiers are not perfect Must control “Recall” and “Precision”

according to the client’s needs

Page 34: Catégorisation automatisée de contenus documentaires : la

34

Hierarchical Categorization

Divide and Conquer solution: Solve the problem Level by Level At each level decompose the problem into

several, smaller sized classification sub-problems

Note: ignoring interactions between sub-problems can yield poor results

Patent Pending on CategorizationPatent Pending on Categorization

Page 35: Catégorisation automatisée de contenus documentaires : la

35

Focused Crawler

Page 36: Catégorisation automatisée de contenus documentaires : la

36

Topic Specific Crawling

Hyper-linked networks (Intranet, Internet) Two options:

• Crawl the network. Then apply classification schemes to filter relevant documents.

• Using classification schemes crawl the network while teaching the crawler to imitate (intelligent) human surfing strategies

Retrieve all documents that are relevant to a specific topic of interest

Page 37: Catégorisation automatisée de contenus documentaires : la

37

Simple CrawlingSimple Crawling

Crawling: The process of retrieving documents from the netCrawling: The process of retrieving documents from the net

Starting Document

The Network is huge Storage

Network Time

Good for general-purpose search engines

Page 38: Catégorisation automatisée de contenus documentaires : la

38

Link Classifier

Focused Crawling via Link Classifiers

Link ClassifierMy brother new

born child

Herbal tea specialist Retrieve the URL

Link is irrelevant

Link classifier: Decision according to the context of the linkLink classifier: Decision according to the context of the link

Analyze the context of the link

Page 39: Catégorisation automatisée de contenus documentaires : la

39

Focused Crawler – The Learning Process

Crawler Classifier: Checks if the document is good for Crawler Classifier: Checks if the document is good for CrawlingCrawling

Link Classifier

Herbal tea specialist

Retrieve the content of the link

Send acknowledgment to the “link classifier” - Learning Process

Crawler Classifier

Page 40: Catégorisation automatisée de contenus documentaires : la

40

GammaWare API

Page 41: Catégorisation automatisée de contenus documentaires : la

41

Architecture - Basic

RelationalDatabase

CustomerClient

GammaWare API

CO

RB

A GammaWareProxy

File System

File System

RelationalDatabase

GammaWareSoftware

Proxy Client

ODBC

CO

RB

A

GW File System

GW File System

Document Management

Document Management

Web

File System

File System

NotesDomino

NotesDomino

OutlookOutlook

Page 42: Catégorisation automatisée de contenus documentaires : la

42

Multiple Servers

GammaWareServer 4

GammaWareServer 2

Scalability and AvailabilityScalability and Availability

GammaWareServer 3

Database

GammaWareProxy

GammaWareServer

GammaWareProxy

Client

Database

Page 43: Catégorisation automatisée de contenus documentaires : la

43

Q & A