duane searsmith automated learning group national center for supercomputing applications university...

18
Duane Searsmith Automated Learning Group National Center for Supercomputing Applications University of Illinois [email protected] Office: (217) 244-9129 http://alg.ncsa.uiuc.edu Michael Welge, Director, [email protected] Loretta Auvil, Project Manager, lauvil @ncsa.uiuc.edu , (217) 265-8021 July 9, 2004 Text Mining with D2K/T2K

Upload: ashlee-griffith

Post on 12-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Duane Searsmith Automated Learning Group National Center for Supercomputing Applications University of Illinois dsears@ncsa.uiuc.edu Office: (217) 244-9129

Duane Searsmith

Automated Learning GroupNational Center for Supercomputing ApplicationsUniversity of [email protected]: (217) 244-9129http://alg.ncsa.uiuc.edu

Michael Welge, Director, [email protected] Auvil, Project Manager, [email protected], (217) 265-8021

July 9, 2004

Text Mining with D2K/T2K

Page 2: Duane Searsmith Automated Learning Group National Center for Supercomputing Applications University of Illinois dsears@ncsa.uiuc.edu Office: (217) 244-9129

alg | Automated Learning Group

Outline

• Text Mining Brief Intro• Unsupervised• Supervised• Information

Extraction• …

• ALG Technology Pieces

• Demonstrations

• Discussion

Page 3: Duane Searsmith Automated Learning Group National Center for Supercomputing Applications University of Illinois dsears@ncsa.uiuc.edu Office: (217) 244-9129

alg | Automated Learning Group

What is text mining?

• In simplified and practical terms it is the extraction of a relatively small amount of information of interest from a mass amount of text data.

But …

• You might not know what you’re looking for.• Discovering patterns in the haystack. (clustering, mining associations)

• How to recognize a needle.• Sifting through the haystack. (model building, supervised learning)

• Just the facts please.• Enumerating the make and model of every needle. (information

extraction)

Page 4: Duane Searsmith Automated Learning Group National Center for Supercomputing Applications University of Illinois dsears@ncsa.uiuc.edu Office: (217) 244-9129

alg | Automated Learning Group

Common Tasks for Text Mining & Analysis

• Information retrieval

• Automatic grouping (clustering) of documents

• (Active) Classification

• Information extraction

• Topic detection and tracking

• Automatic summarization • “Understanding” text and question answering

• Machine Translation

Page 5: Duane Searsmith Automated Learning Group National Center for Supercomputing Applications University of Illinois dsears@ncsa.uiuc.edu Office: (217) 244-9129

alg | Automated Learning Group

Text Preprocessing

• Preprocessing (Text -> Numeric Representation)• Tokenization• Sentence Splitting• Part-of-Speech Tagging• Term Normalization (Stemming)• Filtering (Stops)• Chunking• Term Extraction• Filtering (Again)• Term Weighting• Other Transformations

• Resource Taxing

Page 6: Duane Searsmith Automated Learning Group National Center for Supercomputing Applications University of Illinois dsears@ncsa.uiuc.edu Office: (217) 244-9129

alg | Automated Learning Group

• Agglomerative (bottom up)

• Quadratic time complexity

• Sampling

•Random

•Partition

• Hard vs. Soft

• Unsupervised method

• Basic notion to all of these approaches is some heuristic for measuring similarity between documents and document groups (term co-occurrence)

Strongly Similar Arcs

Kept

Weakly Similar Arcs

Broken

Clustering: Document Self-Organization

Page 7: Duane Searsmith Automated Learning Group National Center for Supercomputing Applications University of Illinois dsears@ncsa.uiuc.edu Office: (217) 244-9129

alg | Automated Learning Group

How to Recognize a Needle

• To classify your data you often need to build a model.

• To build a model you typically need examples from a “teacher” – metaphorically speaking.

• Finding good examples can be hard.

• T2K can also use active learning to help find good examples faster making model building easier.

Page 8: Duane Searsmith Automated Learning Group National Center for Supercomputing Applications University of Illinois dsears@ncsa.uiuc.edu Office: (217) 244-9129

alg | Automated Learning Group

Pattern Mining

• Finding frequent item sets -> Rule Discovery

• Many methods: Apriori, Charm, FPGrowth, CLOSET

• Working with Jiawei Han and students -- Hwanjo Yu and Xiaolei Li

• Application: topic tree construction

Page 9: Duane Searsmith Automated Learning Group National Center for Supercomputing Applications University of Illinois dsears@ncsa.uiuc.edu Office: (217) 244-9129

alg | Automated Learning Group

Just the Facts Please

• Finding a document that has the information you need is often not the end goal.

• To extract information you must first recognize it – you need to build a model, and that means you need to have examples.

• Levels of IE: What’s hard and what’s harder?

Page 10: Duane Searsmith Automated Learning Group National Center for Supercomputing Applications University of Illinois dsears@ncsa.uiuc.edu Office: (217) 244-9129

alg | Automated Learning Group

D2K

Page 11: Duane Searsmith Automated Learning Group National Center for Supercomputing Applications University of Illinois dsears@ncsa.uiuc.edu Office: (217) 244-9129

alg | Automated Learning Group

D2K Features

• Extension of existing API• Provides the capability to programmatically connect modules and set

properties.• Allows D2K-driven applications to be developed.• Provides ability to pause and restart an itinerary.

• Enhanced Distributed Computing• Allows modules that are re-entrant to be executed remotely.• Uses Jini services to look up distributed resources.• Includes interface for specifying the runtime layout of a distributed itinerary.

• Processor Status Overlay • Shows utilization of distributed computing resources.

• Distributed Checkpointing• Resource Manager

• Provides a mechanism for treating selected data structures as if they were stored in global memory.

• Provides memory space that is accessible from multiple modules running locally as well as remotely.

• Batch Processing / Web Services

D2K Overview

Page 12: Duane Searsmith Automated Learning Group National Center for Supercomputing Applications University of Illinois dsears@ncsa.uiuc.edu Office: (217) 244-9129

alg | Automated Learning Group

D2K/T2K/I2K - Data, Text, and Image Analysis

Information Visualization

Page 13: Duane Searsmith Automated Learning Group National Center for Supercomputing Applications University of Illinois dsears@ncsa.uiuc.edu Office: (217) 244-9129

alg | Automated Learning Group

• The Engine (distributed, parallelized, persistent)

• Core Modules (building blocks)

• T2K is a specialized set of modules for text analysis

• I2K is a specialized set of modules for image analysis

• D2K Toolkit (rapid development environment)

• ThemeWeaver is an independent application that uses the D2K engine to run algorithms constructed from T2K modules. It is a demonstration platform

• Other D2K driven applications (StreamLined, EMO, …)

D2K Engine Core Modules T2K

Applications

The Technology Pieces

I2K Toolkit

Page 14: Duane Searsmith Automated Learning Group National Center for Supercomputing Applications University of Illinois dsears@ncsa.uiuc.edu Office: (217) 244-9129

alg | Automated Learning Group

T2K Core

• Tokenization• POS Tagging• Stemming• Chunking• Filters• Term

Weighting• Supervised /

Unsupervised Learning

• GATE Integration

• Pattern Mining• Text Streams• Summarization

T2K Core 1.0 (Beta)

Page 15: Duane Searsmith Automated Learning Group National Center for Supercomputing Applications University of Illinois dsears@ncsa.uiuc.edu Office: (217) 244-9129

alg | Automated Learning Group

ThemeWeaver

Page 16: Duane Searsmith Automated Learning Group National Center for Supercomputing Applications University of Illinois dsears@ncsa.uiuc.edu Office: (217) 244-9129

alg | Automated Learning Group

ThemeWeaver: Prototype Text Clustering Application

• Hard clustering algorithms• Modified Kmeans (3 sampling methods)

• Soft clustering• Suffix tree based algorithm• Can be used for longer documents

• Visualizations• “Single link” graph representation• Dendogram cluster tree• Clusters over time

• Drill down and backtrack UI

• D2K/T2K Driven

Page 17: Duane Searsmith Automated Learning Group National Center for Supercomputing Applications University of Illinois dsears@ncsa.uiuc.edu Office: (217) 244-9129

alg | Automated Learning Group

The ALG Team

StaffLoretta AuvilPeter BajcsyColleen BushellDora CaiDavid ClutterLisa GatzkeVered GorenChris NavarroGreg PapeTom RedmanDuane SearsmithAndrew ShirkAnca SuvaialaDavid TchengMichael Welge

StudentsTyler AlumbaughBradley BerkinJacob BiehlJohn CasselPeter GrovesOlubanji IyunSang-Chul LeeYoung-Jin LeeXiaolei LiBrian NavarroScott RamonSunayana SahaMartin UrbanBei YuHwanjo Yu

Page 18: Duane Searsmith Automated Learning Group National Center for Supercomputing Applications University of Illinois dsears@ncsa.uiuc.edu Office: (217) 244-9129

alg | Automated Learning Group

* Demo / Discussion *