duane searsmith automated learning group national center for supercomputing applications university...
TRANSCRIPT
Duane Searsmith
Automated Learning GroupNational Center for Supercomputing ApplicationsUniversity of [email protected]: (217) 244-9129http://alg.ncsa.uiuc.edu
Michael Welge, Director, [email protected] Auvil, Project Manager, [email protected], (217) 265-8021
July 9, 2004
Text Mining with D2K/T2K
alg | Automated Learning Group
Outline
• Text Mining Brief Intro• Unsupervised• Supervised• Information
Extraction• …
• ALG Technology Pieces
• Demonstrations
• Discussion
alg | Automated Learning Group
What is text mining?
• In simplified and practical terms it is the extraction of a relatively small amount of information of interest from a mass amount of text data.
But …
• You might not know what you’re looking for.• Discovering patterns in the haystack. (clustering, mining associations)
• How to recognize a needle.• Sifting through the haystack. (model building, supervised learning)
• Just the facts please.• Enumerating the make and model of every needle. (information
extraction)
alg | Automated Learning Group
Common Tasks for Text Mining & Analysis
• Information retrieval
• Automatic grouping (clustering) of documents
• (Active) Classification
• Information extraction
• Topic detection and tracking
• Automatic summarization • “Understanding” text and question answering
• Machine Translation
alg | Automated Learning Group
Text Preprocessing
• Preprocessing (Text -> Numeric Representation)• Tokenization• Sentence Splitting• Part-of-Speech Tagging• Term Normalization (Stemming)• Filtering (Stops)• Chunking• Term Extraction• Filtering (Again)• Term Weighting• Other Transformations
• Resource Taxing
alg | Automated Learning Group
• Agglomerative (bottom up)
• Quadratic time complexity
• Sampling
•Random
•Partition
• Hard vs. Soft
• Unsupervised method
• Basic notion to all of these approaches is some heuristic for measuring similarity between documents and document groups (term co-occurrence)
Strongly Similar Arcs
Kept
Weakly Similar Arcs
Broken
Clustering: Document Self-Organization
alg | Automated Learning Group
How to Recognize a Needle
• To classify your data you often need to build a model.
• To build a model you typically need examples from a “teacher” – metaphorically speaking.
• Finding good examples can be hard.
• T2K can also use active learning to help find good examples faster making model building easier.
alg | Automated Learning Group
Pattern Mining
• Finding frequent item sets -> Rule Discovery
• Many methods: Apriori, Charm, FPGrowth, CLOSET
• Working with Jiawei Han and students -- Hwanjo Yu and Xiaolei Li
• Application: topic tree construction
alg | Automated Learning Group
Just the Facts Please
• Finding a document that has the information you need is often not the end goal.
• To extract information you must first recognize it – you need to build a model, and that means you need to have examples.
• Levels of IE: What’s hard and what’s harder?
alg | Automated Learning Group
D2K
alg | Automated Learning Group
D2K Features
• Extension of existing API• Provides the capability to programmatically connect modules and set
properties.• Allows D2K-driven applications to be developed.• Provides ability to pause and restart an itinerary.
• Enhanced Distributed Computing• Allows modules that are re-entrant to be executed remotely.• Uses Jini services to look up distributed resources.• Includes interface for specifying the runtime layout of a distributed itinerary.
• Processor Status Overlay • Shows utilization of distributed computing resources.
• Distributed Checkpointing• Resource Manager
• Provides a mechanism for treating selected data structures as if they were stored in global memory.
• Provides memory space that is accessible from multiple modules running locally as well as remotely.
• Batch Processing / Web Services
D2K Overview
alg | Automated Learning Group
D2K/T2K/I2K - Data, Text, and Image Analysis
Information Visualization
alg | Automated Learning Group
• The Engine (distributed, parallelized, persistent)
• Core Modules (building blocks)
• T2K is a specialized set of modules for text analysis
• I2K is a specialized set of modules for image analysis
• D2K Toolkit (rapid development environment)
• ThemeWeaver is an independent application that uses the D2K engine to run algorithms constructed from T2K modules. It is a demonstration platform
• Other D2K driven applications (StreamLined, EMO, …)
D2K Engine Core Modules T2K
Applications
The Technology Pieces
I2K Toolkit
alg | Automated Learning Group
T2K Core
• Tokenization• POS Tagging• Stemming• Chunking• Filters• Term
Weighting• Supervised /
Unsupervised Learning
• GATE Integration
• Pattern Mining• Text Streams• Summarization
T2K Core 1.0 (Beta)
alg | Automated Learning Group
ThemeWeaver
alg | Automated Learning Group
ThemeWeaver: Prototype Text Clustering Application
• Hard clustering algorithms• Modified Kmeans (3 sampling methods)
• Soft clustering• Suffix tree based algorithm• Can be used for longer documents
• Visualizations• “Single link” graph representation• Dendogram cluster tree• Clusters over time
• Drill down and backtrack UI
• D2K/T2K Driven
alg | Automated Learning Group
The ALG Team
StaffLoretta AuvilPeter BajcsyColleen BushellDora CaiDavid ClutterLisa GatzkeVered GorenChris NavarroGreg PapeTom RedmanDuane SearsmithAndrew ShirkAnca SuvaialaDavid TchengMichael Welge
StudentsTyler AlumbaughBradley BerkinJacob BiehlJohn CasselPeter GrovesOlubanji IyunSang-Chul LeeYoung-Jin LeeXiaolei LiBrian NavarroScott RamonSunayana SahaMartin UrbanBei YuHwanjo Yu
alg | Automated Learning Group
* Demo / Discussion *