text mining with d2k/t2k

Post on 24-Feb-2016

86 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Text Mining with D2K/T2K. Outline. Text Mining Brief Intro Unsupervised Supervised Information Extraction … ALG Technology Pieces Demonstrations Discussion. What is text mining?. - PowerPoint PPT Presentation

TRANSCRIPT

Duane SearsmithAutomated Learning GroupNational Center for Supercomputing ApplicationsUniversity of Illinoisdsears@ncsa.uiuc.eduOffice: (217) 244-9129http://alg.ncsa.uiuc.edu

Michael Welge, Director, welge@ncsa.uiuc.eduLoretta Auvil, Project Manager, lauvil@ncsa.uiuc.edu, (217) 265-8021

July 9, 2004

Text Mining with D2K/T2K

alg | Automated Learning Group

Outline

• Text Mining Brief Intro• Unsupervised• Supervised• Information

Extraction• …

• ALG Technology Pieces

• Demonstrations

• Discussion

alg | Automated Learning Group

What is text mining?

• In simplified and practical terms it is the extraction of a relatively small amount of information of interest from a mass amount of text data.

But …

• You might not know what you’re looking for.• Discovering patterns in the haystack. (clustering, mining associations)

• How to recognize a needle.• Sifting through the haystack. (model building, supervised learning)

• Just the facts please.• Enumerating the make and model of every needle. (information

extraction)

alg | Automated Learning Group

Common Tasks for Text Mining & Analysis

• Information retrieval

• Automatic grouping (clustering) of documents

• (Active) Classification

• Information extraction

• Topic detection and tracking

• Automatic summarization • “Understanding” text and question answering

• Machine Translation

alg | Automated Learning Group

Text Preprocessing

• Preprocessing (Text -> Numeric Representation)• Tokenization• Sentence Splitting• Part-of-Speech Tagging• Term Normalization (Stemming)• Filtering (Stops)• Chunking• Term Extraction• Filtering (Again)• Term Weighting• Other Transformations

• Resource Taxing

alg | Automated Learning Group

• Agglomerative (bottom up)• Quadratic time complexity• Sampling

•Random•Partition

• Hard vs. Soft

• Unsupervised method

• Basic notion to all of these approaches is some heuristic for measuring similarity between documents and document groups (term co-occurrence)

Strongly Similar Arcs

Kept

Weakly Similar Arcs

Broken

Clustering: Document Self-Organization

alg | Automated Learning Group

How to Recognize a Needle• To classify your data you often need to build a model.

• To build a model you typically need examples from a “teacher” – metaphorically speaking.

• Finding good examples can be hard.

• T2K can also use active learning to help find good examples faster making model building easier.

alg | Automated Learning Group

Pattern Mining

• Finding frequent item sets -> Rule Discovery

• Many methods: Apriori, Charm, FPGrowth, CLOSET

• Working with Jiawei Han and students -- Hwanjo Yu and Xiaolei Li

• Application: topic tree construction

alg | Automated Learning Group

Just the Facts Please

• Finding a document that has the information you need is often not the end goal.

• To extract information you must first recognize it – you need to build a model, and that means you need to have examples.

• Levels of IE: What’s hard and what’s harder?

alg | Automated Learning Group

D2K

alg | Automated Learning Group

D2K Features

• Extension of existing API• Provides the capability to programmatically connect modules and set properties.• Allows D2K-driven applications to be developed.• Provides ability to pause and restart an itinerary.

• Enhanced Distributed Computing• Allows modules that are re-entrant to be executed remotely.• Uses Jini services to look up distributed resources.• Includes interface for specifying the runtime layout of a distributed itinerary.

• Processor Status Overlay • Shows utilization of distributed computing resources.

• Distributed Checkpointing• Resource Manager

• Provides a mechanism for treating selected data structures as if they were stored in global memory.

• Provides memory space that is accessible from multiple modules running locally as well as remotely.

• Batch Processing / Web Services

D2K Overview

alg | Automated Learning Group

D2K/T2K/I2K - Data, Text, and Image Analysis

Information Visualization

alg | Automated Learning Group

• The Engine (distributed, parallelized, persistent)• Core Modules (building blocks)• T2K is a specialized set of modules for text

analysis• I2K is a specialized set of modules for image

analysis• D2K Toolkit (rapid development environment)• ThemeWeaver is an independent application that

uses the D2K engine to run algorithms constructed from T2K modules. It is a demonstration platform

• Other D2K driven applications (StreamLined, EMO, …)

D2K Engine Core Modules T2K Applications

The Technology Pieces

I2K Toolkit

alg | Automated Learning Group

T2K Core

• Tokenization• POS Tagging• Stemming• Chunking• Filters• Term

Weighting• Supervised /

Unsupervised Learning

• GATE Integration

• Pattern Mining• Text Streams• Summarization

T2K Core 1.0 (Beta)

alg | Automated Learning Group

ThemeWeaver

alg | Automated Learning Group

ThemeWeaver: Prototype Text Clustering Application• Hard clustering algorithms

• Modified Kmeans (3 sampling methods)

• Soft clustering• Suffix tree based algorithm• Can be used for longer documents

• Visualizations• “Single link” graph representation• Dendogram cluster tree• Clusters over time

• Drill down and backtrack UI

• D2K/T2K Driven

alg | Automated Learning Group

The ALG TeamStaff

Loretta AuvilPeter BajcsyColleen BushellDora CaiDavid ClutterLisa GatzkeVered GorenChris NavarroGreg PapeTom RedmanDuane SearsmithAndrew ShirkAnca SuvaialaDavid TchengMichael Welge

StudentsTyler AlumbaughBradley BerkinJacob BiehlJohn CasselPeter GrovesOlubanji IyunSang-Chul LeeYoung-Jin LeeXiaolei LiBrian NavarroScott RamonSunayana SahaMartin UrbanBei YuHwanjo Yu

alg | Automated Learning Group

* Demo / Discussion *

top related