cni, 3rd april 2006 slide 1 uk national centre for text mining: activities and plans dr. robert...
TRANSCRIPT
CNI, 3rd April 2006 Slide 1
UK National Centre for Text UK National Centre for Text Mining:Mining:
Activities and PlansActivities and Plans
Dr. Robert SandersonDept. of Computer ScienceUniversity of Liverpool
http://www.nactem.ac.uk
CNI, 3rd April 2006 Slide 2
OverviewOverview
Text Mining?
NaCTeM
Consortium Components
Service Infrastructure
Future Work
CNI, 3rd April 2006 Slide 3
Centre for ...Centre for ...
National Centre for ... what was that?
TicksMining!TEXT
CNI, 3rd April 2006 Slide 4
... Text Mining?... Text Mining?
Text Mining: No canonical definition
Commonly used definition based on Data Mining:
“The non-trivial extraction of implicit, previously unknown, and potentially useful information from data.”
“The non-trivial extraction of previously unknown, interesting facts from an invariably large collection of texts.”
CNI, 3rd April 2006 Slide 5
... Text Mining?... Text Mining?
Typical Data Mining Functions:
Classification
Association Rule Mining
Clustering
Useful when applied to texts, but doesn't fulfill the
definition as they don't discover “facts”.
Information Retrieval also doesn't discover facts.
CNI, 3rd April 2006 Slide 6
... Text Mining?... Text Mining?
Need to understand the meaning of the text:
Part of Speech tagging
Clauses
Named Entity Recognition
Find correlations of entities
Infer information from logical chains
Result: New Knowledge
CNI, 3rd April 2006 Slide 7
Other BenefitsOther Benefits
Plus a lot more:
Improved document classification
Automatic semantic annotation of documents
Improved access -- search by semantics and concepts
Improved clustering of documents by concept
Summarization
Visualization techniques
CNI, 3rd April 2006 Slide 8
Event ExtractionEvent Extraction
Extract events from the text along with information
about the
participants
Can be modeled as relationships between named
entities
Extracting events allows discovery of hidden temporal
correlations
eg: Google refuses to announce plans. Google's
stock falls.
Improves understanding of the semantics, improving
the
functions based around those semantics
CNI, 3rd April 2006 Slide 9
NaCTeMNaCTeM
Hosted at University of Manchester
Participants: Universities of Manchester, Liverpool,
Salford
Plus: San Diego Supercomputer Centre, University of
Tokyo,
University of Geneva, University of California
Berkeley
Six full time posts for 3 years (2005-2007)
Plus active board of directors and experts
Current Director: Professor Jun'ichi Tsujii from
U.Tokyo
Funding: JISC, BBSRC, EPSRC
CNI, 3rd April 2006 Slide 10
NaCTeM AimsNaCTeM Aims
Provide text mining oriented services
Facilitate access to text mining resources
User support, advice, training and consultancy
Participate in international research
Formulate best practice guidelines
Increase awareness of text mining in all domains
Develop links with industrial partners involved in text
mining
CNI, 3rd April 2006 Slide 11
ComponentsComponents
Liverpool: Cheshire3 (Information framework)
Manchester: CAFETIERE (Entity recognition, event
extraction)
Salford: TerMine (Automatic term recognition)
SDSC: Storage Resource Broker (Data grid)
UC Berkeley: Cheshire, TM/IR expertise
U.Tokyo: GENIA, ENJU (Text analysis tools)
U.Geneva: User studies and evaluation
CNI, 3rd April 2006 Slide 12
Cheshire3Cheshire3
Information Processing Framework
Liverpool and UC Berkeley
Standards based: XML, SRU, Unicode, etc.
Scalable: Single machine to Grid (PVM, MPI, SRB)
Extensible: Python + C, Object Oriented with stable
API
Work ongoing to integrate Data Mining tools and other
information processing applications
CNI, 3rd April 2006 Slide 13
Cheshire3 ExamplesCheshire3 Examples
Integrated tools from other participants in preparation
for
NaCTeM service infrastructure.
Medline: 4350 records/second using 60 concurrent
processes
on SDSC's Teragrid cluster
440 seconds to index 1 field from 16 million MARC
records
Distributed network of Archival Descriptions in the UK
NARA ERA prototype system with SDSC
CNI, 3rd April 2006 Slide 14
CAFETIERECAFETIERE
Entity Recognition and Annotation
University of Manchester
Discovers named entities in part of speech tagged text
Discovers temporal events referring to those entities
Integration of ontologies and term processing
Rules based
CNI, 3rd April 2006 Slide 15
CAFETIERE ExampleCAFETIERE Example
CNI, 3rd April 2006 Slide 16
TerMineTerMine
Automatic Term Recognition
University of Salford/Manchester
Discovers important terms
Assigns 'C-value' score to rank terms
Interaction with terminology databases for term
management
CNI, 3rd April 2006 Slide 17
TerMine ExampleTerMine Example
CNI, 3rd April 2006 Slide 18
U. Tokyo ToolsU. Tokyo Tools
Natural Language Parsing
University of Tokyo
Tagger, Chunker, ENJU, GENIA
Necessary for any text mining application
Fast and accurate
http://www-tsujii.is.s.u-tokyo.ac.jp/hiiragi/
http://www-tsujii.is.s.u-tokyo.ac.jp/CytoSailing/
CNI, 3rd April 2006 Slide 19
Tokyo Tools ExampleTokyo Tools Example
CNI, 3rd April 2006 Slide 20
Tokyo Tools Example2Tokyo Tools Example2
CNI, 3rd April 2006 Slide 21
Service InfrastructureService Infrastructure
NaCTeM will allow UK researchers to perform text
mining on
their own data in combination with other accessible
resources (eg other data sets, ontologies etc)
Requirements:
Lots of processing power
Lots of storage capacity
Easily extensible/configurable service framework
Access to cutting edge TM, DM and IR tools
CNI, 3rd April 2006 Slide 22
Service InfrastructureService Infrastructure
Processing provided by UK National Grid Service
Data Storage via SDSC's Storage Resource Broker
Important to store multiple versions of each
document
Cheshire3 provides the Grid enabled information
infrastructure
Plus information retrieval and data mining tools
Manchester and Tokyo provide the text mining tools
Stable tools integrated into Cheshire3 already
CNI, 3rd April 2006 Slide 23
Service InfrastructureService Infrastructure
Initial NaCTeM services will be focused on the bio
domain:
Bio-informatics is a growing field
Interest from both academic and corporate sectors
Large datasets/services available (MeSH,
Medline, ...)
Web portal interaction
Then expand into other areas, such as Social Sciences
and
Historical text analysis.
CNI, 3rd April 2006 Slide 24
Future WorkFuture Work
Services for other domains
GUI Workflow configuration
Integration of user developed services and
applications
Maximizing workflow potential with 'smart'
components
Standardizing annotation schemas
Conference/Workshop
Other?
CNI, 3rd April 2006 Slide 25
Thank You Thank You
Questions?
...
Reception!