![Page 1: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/1.jpg)
1
SIMS 290-2: Applied Natural Language Processing
Preslav NakovOctober 6, 2004
![Page 2: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/2.jpg)
2
Today
The 20 Newsgroups Text Collection
WEKA: Exporer
WEKA: Experimenter
Python Interface to WEKA
WEKA: Real-time Demo
![Page 3: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/3.jpg)
3
The 20 Newsgroups Text Collection
WEKA: Exporer
WEKA: Experimenter
Python Interface to WEKA
WEKA: Real-time Demo
![Page 4: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/4.jpg)
4
Source: originally collected by Ken LangContent and structure:
approximately 20,000 newsgroup documents– 19,997 originally– 18,828 without duplicates
partitioned evenly across 20 different newsgroups
Some categories are strongly related (and thus hard to discriminate):
20 Newsgroups Data Sethttp://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/
comp.graphicscomp.os.ms-windows.misccomp.sys.ibm.pc.hardwarecomp.sys.mac.hardwarecomp.windows.x
rec.autosrec.motorcyclesrec.sport.baseballrec.sport.hockey
sci.cryptsci.electronicssci.medsci.space
misc.forsale talk.politics.misctalk.politics.gunstalk.politics.mideast
talk.religion.miscalt.atheismsoc.religion.christian
computers
![Page 5: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/5.jpg)
5
Sample Posting: “talk.politics.guns”From: [email protected] (C. D. Tavares)Subject: Re: Congress to review ATF's status
In article <[email protected]>, [email protected] (Larry Cipriani) writes:
> WASHINGTON (UPI) -- As part of its investigation of the deadly> confrontation with a Texas cult, Congress will consider whether the> Bureau of Alcohol, Tobacco and Firearms should be moved from the> Treasury Department to the Justice Department, senators said Wednesday.> The idea will be considered because of the violent and fatal events> at the beginning and end of the agency's confrontation with the Branch> Davidian cult.
Of course. When the catbox begines to smell, simply transfer itscontents into the potted plant in the foyer.
"Why Hillary! Your government smells so... FRESH!"--
[email protected] --If you believe that I speak for my company,OR [email protected] write today for my special Investors' Packet...
reply
from
subject
signature
Need special handling during
feature extraction…
… writes:
![Page 6: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/6.jpg)
6
The 20 Newsgroups Text Collection
WEKA: Exporer
WEKA: Experimenter
Python Interface to WEKA
WEKA: Real-time Demo
![Page 7: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/7.jpg)
7Slide adapted from Eibe Frank's
WEKA: The Bird
Copyright: Martin Kramer ([email protected]), University of Waikato, New Zealand
![Page 8: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/8.jpg)
8
WEKA: Terminology
Some synonyms/explanations for the terms used by WEKA, which may differ from what we adopted:
Attribute: feature Relation: collection of examples Instance: collection in use Class: category
![Page 9: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/9.jpg)
9Slide adapted from Eibe Frank's
WEKA: The Software Toolkit
Machine learning/data mining software in JavaGNU LicenseUsed for research, education and applicationsComplements “Data Mining” by Witten & FrankMain features:
data pre-processing tools learning algorithms evaluation methods graphical interface (incl. data visualization) environment for comparing learning algorithms
http://www.cs.waikato.ac.nz/ml/weka
![Page 10: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/10.jpg)
10Slide adapted from Eibe Frank's
WEKA GUI Chooser java -Xmx1000M -jar weka.jar
![Page 11: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/11.jpg)
11Slide adapted from Eibe Frank's
Our Toy Example
We demonstrate WEKA on a toy example:
3 categories from “20 Newsgroups”:– misc.forsale, – rec.sport.hockey, – comp.graphics
20 documents per category features:– words converted to lowercase– frequency 2 or more required– stopwords removed
![Page 12: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/12.jpg)
12Slide adapted from Eibe Frank's
Explorer: Pre-Processing The Data
WEKA can import data is from:files: ARFF, CSV, C4.5, binaryURL SQL database (using JDBC)
Pre-processing tools (filters) are used for:Discretization, normalization, resampling, attribute selection, transforming and combining attributes, etc.
![Page 13: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/13.jpg)
13
List of attributes (last: class variable)
Frequency and categories for the selected
attribute
Statistics about the values of the selected attribute
Classification
Filter selection
Manual attribute selection
Statistical attribute selection
Preprocessing
The Preprocessing Tab
![Page 14: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/14.jpg)
14Slide adapted from Eibe Frank's
Explorer: Building “Classifiers”
Classifiers in WEKA are models for:classification (predict a nominal class)regression (predict a numerical quantity)
Learning algorithms:Naïve Bayes, decision trees, kNN, support vector machines, multi-layer perceptron, logistic regression, etc.
Meta-classifiers:cannot be used alonealways combined with a learning algorithmexamples: boosting, bagging etc.
![Page 15: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/15.jpg)
15
Choice of classifier
The attribute whose value is to be predicted from the values of the remaining ones.
Default is the last attribute.
Here (in our toy example) it is
named “class”.
Cross-validation: split the data into e.g. 10 folds and
10 times train on 9 folds and test on the remaining one
The Classification Tab
![Page 16: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/16.jpg)
16
Choosing a classifier
![Page 17: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/17.jpg)
17
![Page 18: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/18.jpg)
18
False: Gaussian
True: kernels (better)
displays synopsis and options
numerical to nominal
conversion by discretization
outputs additional information
![Page 19: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/19.jpg)
19
![Page 20: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/20.jpg)
20
![Page 21: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/21.jpg)
21
all other numbers can be obtained from it
different/easy class
accuracy
![Page 22: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/22.jpg)
22
Contains information about the actual and the predicted classification
All measures can be derived from it: accuracy: (a+d)/(a+b+c+d) recall: d/(c+d) => R precision: d/(b+d) => P F-measure: 2PR/(P+R) false positive (FP) rate: b/(a+b) true negative (TN) rate: a/(a+b) false negative (FN) rate: c/(c+d)
These extend for more than 2 classes: see previous lecture slides for details
Confusion matrix
predicted
– +
true
– a b
+ c d
![Page 23: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/23.jpg)
23
Outputs the probability
distribution for each example
Predictions Output
![Page 24: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/24.jpg)
24
Probability distribution for
a wrong example:
predicted 1 instead of 3
Naïve Bayes makes incorrect
conditional independence assumptions
and typically is over-confident in its prediction regardless of whether it is
correct or not.
Predictions Output
![Page 25: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/25.jpg)
25
Error Visualization
![Page 26: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/26.jpg)
26
Error Visualization
Little squares designate errors
Axes show example number
![Page 27: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/27.jpg)
27Slide adapted from Eibe Frank's
Find which attributes are the most predictive ones
Two parts: search method: – best-first, forward selection, random, exhaustive, genetic
algorithm, ranking
evaluation method: – information gain, chi-squared, etc.
Very flexible: WEKA allows (almost) arbitrary combinations of these two
Explorer: Attribute Selection
![Page 28: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/28.jpg)
28
Individual Features Ranking
![Page 29: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/29.jpg)
29
misc.forsale
comp.graphics
rec.sport.hockey
Individual Features Ranking
![Page 30: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/30.jpg)
30
misc.forsale
comp.graphics
rec.sport.hockey
???
random number
seed
Individual Features Ranking
![Page 31: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/31.jpg)
31Slide adapted from Jakulin, Bratko, Smrke, Demšar and Zupan's
feature correlation
2-Way Interactions
Feature Interactions
C
BA
category
feature feature
importance of feature B
importance of feature A
![Page 32: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/32.jpg)
32Slide adapted from Jakulin, Bratko, Smrke, Demšar and Zupan's
3-Way Interaction: What is common to A, B and C together;
and cannot be inferred from pairs of features.
Feature Interactions
C
BA
category
feature feature
importance of feature B
importance of feature A
![Page 33: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/33.jpg)
33Slide adapted from Guozhu Dong's
Feature Subsets Selection
Problem illustration
Full setEmpty setEnumeration
SearchExhaustive/Complete (enumeration/branch&bounding)Heuristic (sequential forward/backward)Stochastic (generate/evaluate)Individual features or subsets generation/evaluation
![Page 34: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/34.jpg)
34
Features Subsets Selection
![Page 35: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/35.jpg)
35
misc.forsale
comp.graphics
rec.sport.hockey
17,309 subsets considered21 attributes selected
Features Subsets Selection
![Page 36: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/36.jpg)
36
Saving the Selected Features
All we can do from this tab is to save the buffer in a text file. Not very useful...
But we can also perform feature selection during the pre-processing step...(the following slides)
![Page 37: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/37.jpg)
37
Features Selection on Preprocessing
![Page 38: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/38.jpg)
38
Features Selection on Preprocessing
![Page 39: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/39.jpg)
39
Features Selection on Preprocessing
679 attributes: 678 + 1 (for the class)
![Page 40: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/40.jpg)
40
Features Selection on Preprocessing
Just 22 attributes remain:
21 + 1 (for the class)
![Page 41: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/41.jpg)
41
Run Naïve Bayes With the 21 Features
higher accuracy
21 Attributes
![Page 42: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/42.jpg)
42
different/easy class
accuracy
(AGAIN) Naïve Bayes With All Features
ALL 679 Attributes(repeated slide)
![Page 43: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/43.jpg)
43
Sometimes WEKA has a weird naming for some algorithms
Here is how to find the algorithms Barbara introduced: Naïve Bayes: weka.classifiers.bayes.NaiveBayes Perceptron: weka.classifiers.functions.VotedPerceptron Winnow: weka.classifiers.functions.winnow Decision tree: weka.classifiers.trees.J48 Support vector machines: weka.classifiers.functions.SMO k nearest neighbor: weka.classifiers.lazy.IBk
Some of these are more sophisticated versions of the classic algorithms
e.g. I cannot find the classic Naïve Bayes in WEKA (although there are 5 available implementations).
Some Important Algorithms
![Page 44: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/44.jpg)
44
The 20 Newsgroups Text Collection
WEKA: Explorer
WEKA: Experimenter
Python Interface to WEKA
WEKA: Real-time Demo
![Page 45: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/45.jpg)
45Slide adapted from Eibe Frank's
Experimenter makes it easy to compare the performance of different learning schemes
Problems: classification regression
Results: written into file or databaseEvaluation options:
cross-validation learning curve hold-out
Can also iterate over different parameter settingsSignificance-testing built in!
Performing Experiments
![Page 46: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/46.jpg)
46
Experiments Setup
![Page 47: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/47.jpg)
47
Experiments Setup
![Page 48: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/48.jpg)
48
Experiments Setup
CSV file: can be open in Exceldatasets
algorithms
![Page 49: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/49.jpg)
49
Experiments Setup
![Page 50: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/50.jpg)
50
Experiments Setup
![Page 51: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/51.jpg)
51
Experiments Setup
![Page 52: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/52.jpg)
52
Experiments Setup
![Page 53: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/53.jpg)
53
Experiments Setup
accuracy
SVM is the best
Decision tree is the
worst
SVM is statistically better than Naïve Bayes
Decision tree is statistically worse than Naïve Bayes
![Page 54: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/54.jpg)
54
Experiments: Excel
Results are output into an CSV file, which can
be read in Excel!
![Page 55: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/55.jpg)
55
The 20 Newsgroups Text Collection
WEKA: Explorer
WEKA: Experimenter
Python Interface to WEKA
WEKA: Real-time Demo
![Page 56: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/56.jpg)
56Slide adapted from Eibe Frank's
@relation heart-disease-simplified
@attribute age numeric@attribute sex { female, male}@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}@attribute cholesterol numeric@attribute exercise_induced_angina { no, yes}@attribute class { present, not_present}
@data63,male,typ_angina,233,no,not_present67,male,asympt,286,yes,present67,male,asympt,229,yes,present38,female,non_anginal,?,no,not_present...
WEKA File Format: ARFF
Other attribute types:
• String
• Date
Numerical attribute
Nominal attribute
Missing value
![Page 57: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/57.jpg)
57
Value 0 is not represented explicitlySame header (i.e @relation and @attribute tags)the @data section is different
Instead of @data
0, X, 0, Y, "class A"0, 0, W, 0, "class B"
We have
@data
{1 X, 3 Y, 4 "class A"} {2 W, 4 "class B"}
This is especially useful for textual data (why?)But! Problems with feature selection: cannot save results
WEKA File Format: Sparse ARFF
![Page 58: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/58.jpg)
58
Python Interface to WEKA
Works on the 20 newsgroups collectionExtracts the features
currently words easy to modify, just change one or more of:– extract_features_and_freqs()– is_feature_good() – build_stoplist()
Allows to filter out: the stopwords the infrequent features
Features are weighted by document frequencyProduces an ARFF file to be used by WEKA
![Page 59: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/59.jpg)
59
Python Interface to WEKA
Allows to specify: which subset of classes to consider the number of documents for each class the minimum feature frequency regular expression pattern a feature should match whether to remove the stopwords whether to convert words to lowercase kind of output to produce:
sparse (i.e., feature = value) full vector (list of values)
![Page 60: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/60.jpg)
60
Python Interface to WEKA: How To
Needs installed "20_newsgroups“ and "stopwords“To get the things working under Windows:
open “__init__.py”in the code below, substitute “/” with “\\”
##################################################### 20 Newsgroupsgroups = [(ng, ng+'/.*') for ng in ''' alt.atheism rec.autos sci.space comp.graphics rec.motorcycles soc.religion.christian comp.os.ms-windows.misc rec.sport.baseball talk.politics.guns comp.sys.ibm.pc.hardware rec.sport.hockey talk.politics.mideast comp.sys.mac.hardware sci.crypt talk.politics.misc comp.windows.x sci.electronics talk.religion.misc misc.forsale sci.med'''.split()] twenty_newsgroups = SimpleCorpusReader( '20_newsgroups', '20_newsgroups/', '.*/.*', groups, description_file='../20_newsgroups.readme')del groups # delete temporary variable
![Page 61: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/61.jpg)
61
Python Interface to WEKA
The Main Function
![Page 62: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/62.jpg)
62
Python Interface to WEKA
Example Usage
Python dictionary
Estimated over the whole set! cross-validation: OK; test/train: not OK
Use 1
![Page 63: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/63.jpg)
63
Python Interface to WEKAFunctions You Will Probably Want To Modify
convert to lowercase
Also: stemming!Also: word+POS!
Also: compounds!
![Page 64: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/64.jpg)
64
Python Interface to WEKAYou might want to add… Stemming
Porter stemmer>>> cats = Token(TEXT='cats', POS='NN')
>>> from nltk.stemmer.porter import *
>>> porter = PorterStemmer()
>>> porter.stem(cats)
>>> print cats
<POS='NN', STEM='cat', TEXT='cats'>
WordNet stemmer morphy – morphological analyzer you need the following packages installed:– nltk.wordnet– nltk-contrib.pywordnet
>>> from nltk_contrib.pywordnet.stemmer import *
>>> morphy('dogs')
'dog'
![Page 65: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/65.jpg)
65
Python Interface to WEKAYou might want to add… TF.IDF
TF.IDF: tij log(N/ni) TF– tij: frequency of term i in document j
– this is how features are currently weighted
IDF: log(N/ni)
– ni: number of documents containing term i
– N: total number of documents
Modify the function extract_features_and_freqs_forall()
![Page 66: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/66.jpg)
66
The 20 Newsgroups Text Collection
WEKA: Explorer
WEKA: Experimenter
Python Interface to WEKA
WEKA: Real-time Demo
![Page 67: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d4d5503460f94a2b41c/html5/thumbnails/67.jpg)
67
Summary
The 20 Newsgroups Text Collection
WEKA: The ToolkitExplorer
– Classification– Feature selection
ExperimenterARFF file format
Python Interface to WEKAfeature extraction
stemmingWeighting: TF.IDF
WEKA: Real-time Demo