vocabulary spectral analysis as an exploratory tool for scientific web intelligence mike thelwall...
Post on 22-Dec-2015
216 views
TRANSCRIPT
![Page 1: Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d785503460f94a5aeaa/html5/thumbnails/1.jpg)
Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence
Mike Thelwall
Professor of Information Science
University of Wolverhampton
![Page 2: Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d785503460f94a5aeaa/html5/thumbnails/2.jpg)
Contents
Introduction to Scientific Web Intelligence
Introduction to the Vector Space Model Vocabulary Spectral Analysis Low frequency words
![Page 3: Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d785503460f94a5aeaa/html5/thumbnails/3.jpg)
Part 1
Scientific Web Intelligence
![Page 4: Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d785503460f94a5aeaa/html5/thumbnails/4.jpg)
Scientific Web Intelligence
Applying web mining and web intelligence techniques to collections of academic/scientific web sites
Uses links and text Objective: to identify patterns and visualize
relationships between web sites and subsites Objective: to report to users causal
information about relationships and patterns
![Page 5: Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d785503460f94a5aeaa/html5/thumbnails/5.jpg)
Academic Web Mining
Step 1: Cluster domains by subject content, using text and links
Step 2: Identify patterns and create visualizations for relationships
Step 3: Incorporate user feedback and reason reporting into visualization
This presentation deals with Step 1, deriving subject-based clusters of academic webs from text
analysis
![Page 6: Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d785503460f94a5aeaa/html5/thumbnails/6.jpg)
Part 2
Introduction to the Vector Space Model
![Page 7: Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d785503460f94a5aeaa/html5/thumbnails/7.jpg)
Overview
The Vector Space Model (VSM) is a way of representing documents through the words that they contain
It is a standard technique in Information Retrieval
The VSM allows decisions to be made about which documents are similar to each other and to keyword queries
![Page 8: Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d785503460f94a5aeaa/html5/thumbnails/8.jpg)
How it works: Overview
Each document is broken down into a word frequency table
The tables are called vectors and can be stored as arrays
A vocabulary is built from all the words in all documents in the system
Each document is represented as a vector based against the vocabulary
![Page 9: Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d785503460f94a5aeaa/html5/thumbnails/9.jpg)
Example
Document A– “A dog and a cat.”
Document B– “A frog.”
a dog and cat
2 1 1 1
a frog
1 1
![Page 10: Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d785503460f94a5aeaa/html5/thumbnails/10.jpg)
Example, continued
The vocabulary contains all words used– a, dog, and, cat, frog
The vocabulary needs to be sorted– a, and, cat, dog, frog
![Page 11: Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d785503460f94a5aeaa/html5/thumbnails/11.jpg)
Example, continued
Document A: “A dog and a cat.”
– Vector: (2,1,1,1,0)
Document B: “A frog.”
– Vector: (1,0,0,0,1)
a and cat dog frog
2 1 1 1 0
a and cat dog frog
1 0 0 0 1
![Page 12: Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d785503460f94a5aeaa/html5/thumbnails/12.jpg)
Measuring inter-document similarity For two vectors d and d’ the cosine similarity
between d and d’ is given by:
Here d X d’ is the vector product of d and d’, calculated by multiplying corresponding frequencies together
The cosine measure calculates the angle between the vectors in a high-dimensional virtual space
'
'
dd
dd
![Page 13: Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d785503460f94a5aeaa/html5/thumbnails/13.jpg)
Stopword lists
Commonly occurring words are unlikely to give useful information and may be removed from the vocabulary to speed processing– E.g. “in”, “a”, “the”
![Page 14: Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d785503460f94a5aeaa/html5/thumbnails/14.jpg)
Normalised term frequency (tf)
A normalised measure of the importance of a word to a document is its frequency, divided by the maximum frequency of any term in the document
This is known as the tf factor. Document A: raw frequency vector:
(2,1,1,1,0), tf vector: (1, 0.5, 0.5, 0.5, 0)
![Page 15: Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d785503460f94a5aeaa/html5/thumbnails/15.jpg)
Inverse document frequency (idf)
A calculation designed to make rare words more important than common words
The idf of word i is given by
Where N is the total number of documents and ni is the number that contain word i
ii n
Nidf log
![Page 16: Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d785503460f94a5aeaa/html5/thumbnails/16.jpg)
tf-idf
The tf-idf weighting scheme is to multiply the tf factor and idf factors for each word
Words are important for a document if they are frequent relative to other words in the document and rare in other documents
![Page 17: Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d785503460f94a5aeaa/html5/thumbnails/17.jpg)
Part 3
Vocabulary Spectral Analysis
![Page 18: Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d785503460f94a5aeaa/html5/thumbnails/18.jpg)
Subject-clustering academic webs through text similarity 1
1. Create a collection of virtual documents consisting of all web pages sharing a common domain name in a university.
– Doc. 1 = cs.auckland.ac.uk 14,521 pgs– Doc. 2 = www.auckland.ac.nz 3,463 pgs– …– Doc. 760 = www.vuw.ac.nz 4,125 pgs
![Page 19: Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d785503460f94a5aeaa/html5/thumbnails/19.jpg)
Subject-clustering academic webs through text similarity 22. Convert each virtual document into a tf-idf
word vector3. Identify clusters using k-means and VSM
cosine measures4. Rank words for importance in each ‘natural’
cluster Cluster Membership Indicator5. Manually filter out high-ranking words in
undesired clusters Destroys the natural clustering of the data to
uncover weaker subject clustering
![Page 20: Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d785503460f94a5aeaa/html5/thumbnails/20.jpg)
Cluster Membership Indicator
Cn
w
C
w
iCcmi Cjij
Cjij
),(
For a cluster C of documents and tdf-idf weights wij
The next slide shows the top CMI weights for an undesirednon-subject cluster
![Page 21: Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d785503460f94a5aeaa/html5/thumbnails/21.jpg)
Word Frequency Domains CMI
massey 32991 364 0.30587
palmerston 9023 305 0.09137
and 1883534 674 0.0794
the 3605107 689 0.0746
of 2263812 683 0.06782
in 1317941 655 0.06556
north 21348 414 0.06431
students 127178 550 0.05753
research 186161 546 0.05687
a 1254004 659 0.05616
![Page 22: Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d785503460f94a5aeaa/html5/thumbnails/22.jpg)
Eliminating low frequency words
Can test whether removing low frequency words increases or decreases subject clustering tendency– E.g. are spelling mistakes?
Need partially correct subject clusters Compare similarity of documents within
cluster to similarity with documents outside cluster
![Page 23: Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d785503460f94a5aeaa/html5/thumbnails/23.jpg)
Eliminating low frequency words
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
2 3 4 5 6 7 8 9
10
20
40
80
16
0
32
0
64
0Minimum domains containing word
Intr
a-s
ub
jec
t a
ve
rag
e c
orr
ela
tio
n m
inu
s in
ter-
su
bje
ct
ave
rag
e c
orr
ela
tio
nLaw
Psychology
Architecture
Sport
Maths
Planning
Social studies
Engineering
Languages
Physics
Chemistry
Business
Education
Medicine
Env. Sci.
Food
Computing
Biology
General
Arts
![Page 24: Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d785503460f94a5aeaa/html5/thumbnails/24.jpg)
Summary
For text based academic subject web site clustering:– need to select vocabularies to break natural
clustering and allow subject clustering– consider ignoring low frequency words because
they do not have high clustering power– Need to automate the manual element as far as
possible The results can then form the basis of a
visualization that can give feedback to the user on inter-subject connections