text mining in digital collections
TRANSCRIPT
IntroductionText Mining
Classification
Bridgeman Digital Art LibraryBridgeman CategoriesSample Classification Data
Text mining in digital collections
CHASE: Going digital
Deirdre [email protected]
February 6, 2013
Deirdre Lungley
IntroductionText Mining
Classification
Bridgeman Digital Art LibraryBridgeman CategoriesSample Classification Data
Text mining in digital collections
Deirdre Lungley
IntroductionText Mining
Classification
Bridgeman Digital Art LibraryBridgeman CategoriesSample Classification Data
Text mining in digital collections
Bridgeman Categories
2 Oriental Miniatures 41 Mosaics
7 Maps 44 Semi-precious Stones (see also Jewellery)
9 Posters 46 Science
12 Arms, Armour & Militaria 47 Sculpture
15 Botanical 51 Sports and Leisure
18 Clocks, Watches, Barometers & Sundials 56 Trade Emblems, City Crests, Coats of Arms
20 Costume & Fashion 1126 CHOIR BOOKS
21 Enamels 5000 The Arts and Entertainment
22 Ephemera 5001 Ancient and World Cultures
24 Furniture 5002 Architecture
25 Glass 5003 Business and Industry
27 Icons 5004 Places
29 Inventions 5005 Science and Medicine
30 Jewellery (see also Semi-precious stones) 5006 History
31 Juvenilia / Children's Toys & Games 5007 Religion and Belief
33 Lighting 5010 Travel and Transport
35 Medicine 5011 Plants and Animals
38 Mythology Mythological Myth 5013 Emotions and Ideas
40 Animals
Deirdre Lungley
IntroductionText Mining
Classification
Bridgeman Digital Art LibraryBridgeman CategoriesSample Classification Data
Text mining in digital collections
Sample Classification Data
Query/Clicked URL Gold Standard Annotations Classifier Predictions
monster woman 5007 : Religion and Belief 5007 : Religion and Belief
Dulle Griet raiding Hell 5 : Allegory / Allegorical
38 : Mythology Mythological Myth
nuno 5007 : Religion and Belief 5007 : Religion and Belief
The Fishermen from the Polyptych of St. Vincent 42 : Personalities 5012 : Land and Sea
42 : Personalities
girl poor 5009 : People and Society 5009 : People and Society
A Peasant Girl Gathering Faggots in a Wood 5012 : Land and Sea
Deirdre Lungley
IntroductionText Mining
Classification
Python & NLTKWeb ServicesSample Code (1) – Wikify text
Text mining in digital collections
Tools of the trade
Python:
High level languageMany standard libraries, e.g., XML parser
Natural Language Toolkit (NLTK):
A platform for building Python programs to work with humanlanguage data (nltk.org)
Why?
Glue between applicationsData preparation for tools such as WekaAllows programmatic access to web services
Deirdre Lungley
IntroductionText Mining
Classification
Python & NLTKWeb ServicesSample Code (1) – Wikify text
Text mining in digital collections
Example Web Service – WikipediaMiner
Deirdre Lungley
IntroductionText Mining
Classification
Python & NLTKWeb ServicesSample Code (1) – Wikify text
Text mining in digital collections
Sample Python XML parsing – Wikify RSS title
Deirdre Lungley
IntroductionText Mining
Classification
Python & NLTKWeb ServicesSample Code (1) – Wikify text
Text mining in digital collections
Sample Python XML parsing – Wikify RSS title (Output)
Deirdre Lungley
IntroductionText Mining
Classification
Python & NLTKWeb ServicesSample Code (1) – Wikify text
Text mining in digital collections
Deirdre Lungley
IntroductionText Mining
Classification
Python & NLTKWeb ServicesSample Code (1) – Wikify text
Text mining in digital collections
Deirdre Lungley
IntroductionText Mining
Classification
Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Supervised Learning - Basics
Classifier (Model) built from:
Positive/Negative examples (labelled data)Features - present/absent for a given label
Test data built using:
Present/absent classifier features
Case Study - Support Vector Machine (SVM) Classifier:
Locates marginal points on hyperplane - support vectorsUsed extensively in researchHere – treat as black box – default settings
SVMLight data format:
< target >< feature >:< value > ... < feature >:< value >
Deirdre Lungley
IntroductionText Mining
Classification
Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Supervised Learning - Basics
Classifier (Model) built from:
Positive/Negative examples (labelled data)Features - present/absent for a given label
Test data built using:
Present/absent classifier features
Case Study - Support Vector Machine (SVM) Classifier:
Locates marginal points on hyperplane - support vectorsUsed extensively in researchHere – treat as black box – default settings
SVMLight data format:
< target >< feature >:< value > ... < feature >:< value >
Deirdre Lungley
IntroductionText Mining
Classification
Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Supervised Learning - Basics
Classifier (Model) built from:
Positive/Negative examples (labelled data)Features - present/absent for a given label
Test data built using:
Present/absent classifier features
Case Study - Support Vector Machine (SVM) Classifier:
Locates marginal points on hyperplane - support vectorsUsed extensively in researchHere – treat as black box – default settings
SVMLight data format:
< target >< feature >:< value > ... < feature >:< value >
Deirdre Lungley
IntroductionText Mining
Classification
Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Supervised Learning - Basics
Classifier (Model) built from:
Positive/Negative examples (labelled data)Features - present/absent for a given label
Test data built using:
Present/absent classifier features
Case Study - Support Vector Machine (SVM) Classifier:
Locates marginal points on hyperplane - support vectorsUsed extensively in researchHere – treat as black box – default settings
SVMLight data format:
< target >< feature >:< value > ... < feature >:< value >
Deirdre Lungley
IntroductionText Mining
Classification
Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Training Examples
Feature Extractor
Test Examples
Pos/Neglabelled feature
sets
Test feature
sets
Learning tool
Classifier model
Predictions
Deirdre Lungley
IntroductionText Mining
Classification
Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Training Examples
Feature Extractor
Test Examples
Pos/Neglabelled feature
sets
Test feature
sets
Learning tool
Classifier model
Predictions
Project Gutenberg Catalogue BBC RSS Feed
Training Data
Test Data
SVM_Learn SVM_Classify
Deirdre Lungley
IntroductionText Mining
Classification
Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Training Data – Project Gutenberg
Deirdre Lungley
IntroductionText Mining
Classification
Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Case Study Task: Classify BBC RSS feeds
Retrieve & parse BBC RSS feed
Create Classification Features
CasefoldingTokenisationStemmingStopwords
Classify (test data → predictions)
Output to file on diskCall commandRead file
Deirdre Lungley
IntroductionText Mining
Classification
Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Retrieve & parse RSS feed
Deirdre Lungley
IntroductionText Mining
Classification
Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Retrieve & parse RSS feed (Output)
Deirdre Lungley
IntroductionText Mining
Classification
Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Text to Features
Deirdre Lungley
IntroductionText Mining
Classification
Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Text to Features (Output)
Deirdre Lungley
IntroductionText Mining
Classification
Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Deirdre Lungley
IntroductionText Mining
Classification
Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Deirdre Lungley
IntroductionText Mining
Classification
Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Classify: Test data → predictions (Output)
Deirdre Lungley
IntroductionText Mining
Classification
Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Training Data – Project Gutenberg
Deirdre Lungley
IntroductionText Mining
Classification
Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Deirdre Lungley
IntroductionText Mining
Classification
Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Create training data (Output)
Deirdre Lungley
IntroductionText Mining
Classification
Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
References:
The Regex Coach
Deirdre Lungley
IntroductionText Mining
Classification
Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Thank You!
Deirdre Lungley