cni, 3rd april 2006 slide 1 uk national centre for text mining: activities and plans dr. robert...

CNI, 3rd April 2006 Slide 1

UK National Centre for Text UK National Centre for Text Mining:Mining:

Activities and PlansActivities and Plans

Dr. Robert SandersonDept. of Computer ScienceUniversity of Liverpool

[email protected]

http://www.nactem.ac.uk


OverviewOverview

Text Mining?

NaCTeM

Consortium Components

Service Infrastructure

Future Work


Centre for ...Centre for ...

National Centre for ... what was that?

TicksMining!TEXT


... Text Mining?... Text Mining?

Text Mining: No canonical definition

Commonly used definition based on Data Mining:

“The non-trivial extraction of implicit, previously unknown, and potentially useful information from data.”

“The non-trivial extraction of previously unknown, interesting facts from an invariably large collection of texts.”



Typical Data Mining Functions:

Classification

Association Rule Mining

Clustering

Useful when applied to texts, but doesn't fulfill the

definition as they don't discover “facts”.

Information Retrieval also doesn't discover facts.



Need to understand the meaning of the text:

Part of Speech tagging

Clauses

Named Entity Recognition

Find correlations of entities

Infer information from logical chains

Result: New Knowledge


Other BenefitsOther Benefits

Plus a lot more:

Improved document classification

Automatic semantic annotation of documents

Improved access -- search by semantics and concepts

Improved clustering of documents by concept

Summarization

Visualization techniques


Event ExtractionEvent Extraction

Extract events from the text along with information

about the

participants

Can be modeled as relationships between named

entities

Extracting events allows discovery of hidden temporal

correlations

eg: Google refuses to announce plans. Google's

stock falls.

Improves understanding of the semantics, improving

the

functions based around those semantics


NaCTeMNaCTeM

Hosted at University of Manchester

Participants: Universities of Manchester, Liverpool,

Salford

Plus: San Diego Supercomputer Centre, University of

Tokyo,

University of Geneva, University of California

Berkeley

Six full time posts for 3 years (2005-2007)

Plus active board of directors and experts

Current Director: Professor Jun'ichi Tsujii from

U.Tokyo

Funding: JISC, BBSRC, EPSRC


NaCTeM AimsNaCTeM Aims

Provide text mining oriented services

Facilitate access to text mining resources

User support, advice, training and consultancy

Participate in international research

Formulate best practice guidelines

Increase awareness of text mining in all domains

Develop links with industrial partners involved in text

mining


ComponentsComponents

Liverpool: Cheshire3 (Information framework)

Manchester: CAFETIERE (Entity recognition, event

extraction)

Salford: TerMine (Automatic term recognition)

SDSC: Storage Resource Broker (Data grid)

UC Berkeley: Cheshire, TM/IR expertise

U.Tokyo: GENIA, ENJU (Text analysis tools)

U.Geneva: User studies and evaluation


Cheshire3Cheshire3

Information Processing Framework

Liverpool and UC Berkeley

Standards based: XML, SRU, Unicode, etc.

Scalable: Single machine to Grid (PVM, MPI, SRB)

Extensible: Python + C, Object Oriented with stable

API

Work ongoing to integrate Data Mining tools and other

information processing applications


Cheshire3 ExamplesCheshire3 Examples

Integrated tools from other participants in preparation

for

NaCTeM service infrastructure.

Medline: 4350 records/second using 60 concurrent

processes

on SDSC's Teragrid cluster

440 seconds to index 1 field from 16 million MARC

records

Distributed network of Archival Descriptions in the UK

NARA ERA prototype system with SDSC


CAFETIERECAFETIERE

Entity Recognition and Annotation

University of Manchester

Discovers named entities in part of speech tagged text

Discovers temporal events referring to those entities

Integration of ontologies and term processing

Rules based


CAFETIERE ExampleCAFETIERE Example


TerMineTerMine

Automatic Term Recognition

University of Salford/Manchester

Discovers important terms

Assigns 'C-value' score to rank terms

Interaction with terminology databases for term

management


TerMine ExampleTerMine Example


U. Tokyo ToolsU. Tokyo Tools

Natural Language Parsing

University of Tokyo

Tagger, Chunker, ENJU, GENIA

Necessary for any text mining application

Fast and accurate

http://www-tsujii.is.s.u-tokyo.ac.jp/hiiragi/

http://www-tsujii.is.s.u-tokyo.ac.jp/CytoSailing/


Tokyo Tools ExampleTokyo Tools Example


Tokyo Tools Example2Tokyo Tools Example2


Service InfrastructureService Infrastructure

NaCTeM will allow UK researchers to perform text

mining on

their own data in combination with other accessible

resources (eg other data sets, ontologies etc)

Requirements:

Lots of processing power

Lots of storage capacity

Easily extensible/configurable service framework

Access to cutting edge TM, DM and IR tools



Processing provided by UK National Grid Service

Data Storage via SDSC's Storage Resource Broker

Important to store multiple versions of each

document

Cheshire3 provides the Grid enabled information

infrastructure

Plus information retrieval and data mining tools

Manchester and Tokyo provide the text mining tools

Stable tools integrated into Cheshire3 already



Initial NaCTeM services will be focused on the bio

domain:

Bio-informatics is a growing field

Interest from both academic and corporate sectors

Large datasets/services available (MeSH,

Medline, ...)

Web portal interaction

Then expand into other areas, such as Social Sciences

and

Historical text analysis.


Future WorkFuture Work

Services for other domains

GUI Workflow configuration

Integration of user developed services and

applications

Maximizing workflow potential with 'smart'

components

Standardizing annotation schemas

Conference/Workshop

Other?


Thank You Thank You

Questions?

...

Reception!

cni, 3rd april 2006 slide 1 uk national centre for text mining: activities and plans dr. robert...

Documents

text slide

text mining slide

awareness of text mining

overview text mining

epsrc slide

text mining oriented

ticks mining

new knowledge slide