semantics & data mining & document processing nima kaviani school of interactive arts and...
TRANSCRIPT
Semantics & data mining & document processing
Nima KavianiSchool of Interactive Arts and Technology
Simon Fraser University - SURREY
Nima Kaviani 2
Towards Semantic Web Mining[5]
The idea is to combine two fast-developing research areas, Semantic Web, and Web Mining.
Semantic Web can be used to improve the results of web mining by exploiting new semantic structures in the web
Web mining is useful to enhance the concepts and instances by learning the definition of structures for knowledge organization and to provide the population of such knowledge organization
Nima Kaviani 3
Web Mining
Definition: the application of data mining techniques to the content, structure and usage of web resources
Web Content Mining: a form of text mining to extract data from content of the web page
Web Structure Mining: extracts information reside in the structure of hypertext (the idea behind links and also the usage for page rankings)
Web Usage Mining: the web resource that is being mined is the record of the requests made by the user to capture the user behaviors
Nima Kaviani 4
Semantic Web
Definition: to add semantic annotation to web documents so that they can be easily understand by human and read by machines for further inferences
Ontology Learning: semi-automatic extraction of semantics from the web to create an ontology.
Mapping and Merging Ontologies: to merge different ontologies and build a new domain specific ontology (described by Davis)
Instance Learning: automatic or semi-automatic methods to extract information from web-related documents, either to help in annotating new documents or to extract additional information from existing unstructured or partially structured documents.
Nima Kaviani 5
Creating an Ontology
Ontology is a conceptualization of domain into human understandable but machine readable formats. A quadruple of entities, attributes, relationships and axioms. [3]
Steps in creating an ontology for the data [8]: determining the scope of the ontology reusing existing Ontologies enumerating all the concepts needed defining the taxonomy defining the properties defining facets of the concepts defining instances
Are normallyPerformed by
OntologyEngineer
Can be performed
semi-automatically
Nima Kaviani 6
Ontology Learning
Why do we try to make the ontology learning automatic (semi-automatic)?
The source data is usually stored as unstructured, semi-structured (HTML, XML) or structured (Data Bases) format and should be processed in order to be used in creating the ontology[3].
Laborious and cumbersome task
Time consuming
Dynamic nature of available domains
Lack of tools and guidelines [1]
Nima Kaviani 7
Semi-Automatic Ontology Learning
It aims to integrate multitude of disciplines in order to facilitate the construction of Ontologies[12]. because of tacit information available, human intervention is always required [5].
Steps in Building an ontology automatically Acquisition of concepts Establishment of concept taxonomies Discovering of non-taxonomic conceptual relations Pruning the generated ontology
Nima Kaviani 8
Acquisition of concepts and establishing taxonomic relations
Using IR and NLP techniques, concepts can be extracted quite efficiently.
Techniques are normally a combination of methods below with a tendency to consider one of them more effectively. Computational Linguistic Information Retrieval
Nima Kaviani 10
Methods used in acquiring the concepts-1
Computational linguistic approach Pre-processing the text to extract dependencies and single-word nouns
POS-tagger [11] (part-of-speech dependency parser) [2, 4] Minipar (State-of-the-art dependency) [1]
Extracting multi-word noun phrases [2] Shallow parse the text Filter out word phrases with interesting POS-tag patterns Decide for each phrase whether it is a noun phrases
Extracting taxonomical relations[14] Uses regular expressions to find ISA relations Defining regular expression relations like:
NP {, NP}* {,} or other NP Bruises, wounds, broken bones, or other injuries
The overall process is a combination of the tools below[12] Tokenizer: Regular expressions to find nouns Lexicon: as a big repository for stems Lexical analyzer: mixes results from the two methods above and extracts new concepts Chunk parse: works on phrases to generate syntactic dependency relations-uses POS-tagger. Heuristics: includes correlations beside linguistic-base dependency relations.
more terms Low precision High recall (it’s more important in learning)
Regular expressions to extract taxonomies
Nima Kaviani 11
An information retrieval method using term weightings [12] Counting relevant terms and extract the more frequent ones as concepts
lefl,d: the frequency of appearance of term l in the document d dfl: the number of documents in the corpus D that term l occurs in cfl: the total number of occurrences of term l in the corpus D
Methods to find the taxonomy Clustering (starts from scratch and uses distributional data about words) Classification (uses an available hierarchy and refines it) Lexico-Syntactic (regular expressions)
Methods used in acquiring the concepts-2
ldldll df
Dlefdftf log*,,
Dd
dlldlll dftfdftftfidf ,, ,:
Nima Kaviani 12
Methods used in acquiring the concepts-3
Combines information extraction with Ontologies and bootstraps[9]
ontology is used to improve the quality of extraction extracted information is used to improve the ontology the idea is to use indicative terms to find informative terms
and then to use informative terms to find new indicators it is trying to extract a pattern to make indicators and
informative concepts relevant
Nima Kaviani 13
Methods used in acquiring the concepts-4
Specific purpose concept and taxonomy extraction [7]
Methodology: neighborhood of initial keywords The anterior word of a word classifies it (in English) The posterior word of a word represents the domain (in English) coronary heart disease
Sends the query to the search engine and extracts anterior and posterior words of a word and decides on if the word is an instance or subclass.
Clustering is performed according to the coincidence amount
Synonymy is satisfied by using constraints and omitting the initial word
Nima Kaviani 14
Current Status
Results IR and Computational Linguistic can solve the problem Current methods are trying to derive concepts and form
taxonomical relations using the biggest available corpus, World Wide Web.
Problems to be solved Current efforts are mostly using hand-crafted concept
hierarchies Hardly can find synonyms for a set of available concepts. Hardly can make the process of discovering synonyms
automatic using currently found synonyms
Nima Kaviani 15
Establishment of non-taxonomic relations between concepts
The most important and challenging task in building an ontology.
Finding data concepts and taxonomic relations are simpler in comparison to construct non-taxonomic relation between concepts.
These approaches are generally a combination of Natural Language Processing and Machine Learning
Nima Kaviani 16
Methods proposed to establish relations-1
Clustering [13] ASIUM: a software designed based on unsupervised
clustering method Does not require any annotation of texts by hand Learns knowledge in the form of
Subcategorization frames <to travel> <subject: human> <by: vehicle> subject is the syntactic role by is the proposition human and vehicle are restrictions of their selection
Ontologies
Nima Kaviani 17
Methods proposed to establish relations-1
Pre-Processing the text SYLEX provides training text which is attachment of
verbs to noun phrases and clauses. The first step is done by getting the training text as
input and generating instantiated Subcategorization frames as output. <verb> <subject> <object>
Nima Kaviani 18
Methods proposed to establish relations-1
Clustering Algorithm Factorizing similar instantiated subcategorization frames Clustering algorithm used in ASIUM
Links represent generality relations Breadth-First Bottom-up clustering Two classes are aggregated Distance is defined as the portion of common head words in the
two clusters taking into account their frequencies Clusters with a distance less than the threshold are aggregated The threshold doesn’t change in different levels Available clusters, only in the same level, are taken into account
Nima Kaviani 19
Methods proposed to establish relations-1
card(c1) and card(c2): the number of different head words in cluster C1 and C2
Ncomm the number of different common head words between C1 and C2
is the sum of the frequencies of the head words of Cj wordiCj is the i-th head word of cluster Cj f(wordiCj) is its frequency
minimizes the influences of word frequencies
)2(
1 2)1(
1 1 )()(
2*2
1*1
12,1Ccard
i CiCcard
i Ci
commcomm
wordfwordf
Ccard
NFC
Ccard
NFC
CCdist
FCj
Cjcard
Ncomm
Nima Kaviani 20
Methods proposed to establish relations-1
This generality results in change of instantiated Subcategorization frames into Subcategorization frames
Cooperation of user in the process of building the ontology is required User labels the clusters User validates the new clusters
Rejects those words that restrict the given verbs Partitions new clusters into sub-clusters which would not have
been identified before Clusters in each level must get validated before proceeding to
the next level User can partition the clusters and label sub-concepts if he
find the newly generated classes useless or meaningless
Nima Kaviani 21
Example
drive travel
car motorbike
father neighbor father mother
by by
car train
using
bicycle
factorizingcar, bicycle car, train, motorbike
car, train, motorbike,bicycle
clusteringMotorized vehicle
Subjects
verbs
Objects
proposition
Passenger
Nima Kaviani 22
Generalized association rules [10] A set of transactions are defined Each transaction consists of a set of items where each item
is from a set of concepts Two factors are considered in estimating amount of relevancy of two
different concepts Xk and Yk in an association rule:
Support: percentage of transactions that contain Xk and Yk as a subset
Confidence: percentage of transactions that Yk is seen when Xk appears in a transaction
Some changes have been applied to the basic association rule algorithm to make it suitable for associations at the right level of the taxonomy
Methods proposed to establish relations-2
kk YX
Nima Kaviani 23
Methods proposed to establish relations-3
Fuzzy Formal Concept Analysis (FFCA)[3] FCA is based on lattice theory and is used for conceptual
knowledge discovery
Hierarchical relationship of concepts is organized as a lattice rather than a tree
The method uses a citation database to generate concepts
Steps in generating ontology using this method are: FFCA Concept Clustering Ontology generation
Nima Kaviani 24
Current Status
Results Methods proposed have reduced the amount of effort by a
human engineer
Problems to be solved They all consider a single-layer generalization, however, in
many case a multi-layer generalization would result in a better hierarchy
Still human plays a key role in designing the ontology and the quality of the design depends on his works
Nima Kaviani 25
Pruning the generated hierarchy
The generated ontology contains concepts that are not interesting and should be removed.
Methods used to remove uninteresting nodes are: Using a rule based method according to the following condition [6]
Nodes without a domain node are removed Intermediate nodes with the following properties are removed
Nodes without siblings It’s not the root of any concept Conditions which are held in the ontology
Using IR techniques[12] Considering term frequencies, comparing the frequency of the current term
with the frequency in a generic corpus, and removing the term if its frequency in the domain is lower than that of the term in a generic corpus
Nima Kaviani 26
Conclusion
A progress in building ontologies with web-pages rather than static texts as their instances is seen.
There is not a clear and defined way to evaluate automatically built ontologies and these ontologies are compared with hand-crafted ones.
The above fact hampers the comparison between two semi-automatically built ontologies
Nima Kaviani 27
References
1. Sabou, M., Wroe, C., Goble, C., and Mishne, G. Learning domain ontologies for Web service descriptions: an experiment in bioinformatics. In Proceedings of the 14th international Conference on World Wide Web (Chiba, Japan, May 10 - 14, 2005). WWW '05. ACM Press, New York, NY, 2005.
2. van Hage, W. R., de Rijke, M., Marx M., Information Retrieval Support for Ontology Construction and Use. In Proceedings of the 3rd International Semantic Web Conference, Jan 2004, Pages 518 – 533, LNCS, Springer 2004.
3. Quan, T. T. , Hu,i S. C., Fong, A.C.M., Cao, T. H. Automatic Generation of Ontology for Scholarly Semantic Web. In Proceedings of the 3rd International Semantic Web Conference, Jan 2004, Pages 726 – 740, LNCS, Springer 2004.
4. Sabou, M., Wroe, C., Goble, C., and Mishne, G. Learning domain ontologies for Web service descriptions: an experiment in bioinformatics. In Proceedings of the 14th international Conference on World Wide Web (Chiba, Japan, May 10 - 14, 2005). WWW '05. ACM Press, New York, NY, 2005.
5. Berendt, B., Hotho, A., and Stumme, G. Towards semantic web mining. In I. Horrocks and J. Hendler (Eds.), The Semantic Web - ISWC 2002. In Proceedings of the 1st International Semantic Web Conference, June 9-12th, 2002, Sardinia, Italy, pages 264--278. LNCS, Heidelberg, Germany: Springer, 2002.
Nima Kaviani 28
References
6. Navigli, R. and Velardi, P.: Learning Domain Ontologies from Document Warehouses and Dedicated Web Sites. In Computational Linguistics, Volume 30, Issue 2. June 2004.
7. Sa’nchez, D. and Moreno, A. Web Mining Techniques for Automatic Discovery of Medical Knowledge. In Proceedings of Artificial Intelligence in Medicine, 10th Conference on Artificial Intelligence in Medicine, AIME 2005, Aberdeen, UK, July 23-27, 2005.
8. Noy, N. F., and McGuinness, D. L. . Ontology Development 101: A Guide to Creating Your First Ontology. Knowledge Systems Laboratory, March, 2001.
9. Kavalec, M., Svatek, V. Information Extraction and Ontology Learning Guided by Web Directory. In ECAI Workshop on NLP and ML for ontology engineering, Lyon 2002.
10. Maedche, A. and Staab, S. 2000. Mining Ontologies from Text. In Proceedings of the 12th European Workshop on Knowledge Acquisition, Modeling and Management. Pages 189-202, LNCS, vol. 1937. Springer, London, 2000.
11. Schmid, H. Probabilistic part-of-speech tagging using decision trees. In International Conference on New Methods in Language Processing, pages 44--49, Manchester, UK, 1994.
Nima Kaviani 29
References
12. Maedche, A. and Staab, S. 2001. Ontology Learning for the Semantic Web. IEEE Intelligent Systems 16, 2, Mar. 2001.
13. Faure, D. and N'edellec, C. ASIUM: Learning subcategorization frames and restrictions of selection. In the 10th Conference on Machine Learning (ECML 98) -- Workshop on Text Mining, Chemnitz, Germany, April 1998.
14. Hearst, M. Automatic Acquisition of Hyponyms from Large Text Corpora. In Proceedings of the 14th International Conference on Computational Linguistics, Nantes, France, 1992.