openminted: it's uses and benefits for the social sciences
TRANSCRIPT
1
twitter.com/openminted_eu
Peter Mutschke
ITOC Workshop Philadelphia – February 20, 2016
Open Mining Infrastructure for Text & Data (OpenMinTeD)
2
Goal of Text Mining
This is where the footer goes
implementation of transformational processes
that …
uncover knowledge in unstructured text salient content items
hidden relationships between content items
…to assist researchers and scientific data
curators in making sense of the textual data
• 1 • 2
• 3 • 4
• 5 • 6
• 7
3
The phases of text mining
taken from ICT2015 presentation (N. Manola) @openminted_eu
NLP Analysis
Entity
Recognition
Data Mining
Knowledge
Discovery
Information
Extraction
STAGE 1 STAGE 2 STAGE 3 STAGE 4
Information
Retrieval
OPENMINTED - The Open Mining Infrastructure for Text and Data
4
Challenges
This is where the footer goes
Text Mining (TM) remains a fragmented set of tools
TM requires particular technological and analytical skills
as well as domain knowledge
no shared knowledge how to apply
lack of a central infrastructure
(may rule out use of TM for small research groups)
high entry costs:
need to share infrastructure costs
5
Putting it all together
This is where the footer goes
OpenMinTeD Establish an open and sustainable Text and
Data Mining (TDM) platform and infrastructure
where researchers can collaboratively create,
discover, share and re-use knowledge from a
wide range of text based scientific and
scholarly related sources
• 1 • 2
• 3 • 4
• 5 • 6
• 7
6
OpenMinTeD – working on many fronts
@openminted_eu
6
ACCESSIBLE
CONTENT
DISCOVERABLE
SERVICES
EFFICIENT
PROCESSING
TDM
COMMUNITIES
VALUE ADDED
APPS
Via standardised programmatic interfaces and access rules
Well-documented easily discoverable text mining services and workflows which process, analyse and annotate text
Operate on public e-Infrastructures via standarized APIs
Different scientific communities have different challenges
Community-driven applications to illustrate the value of the infastructure. Engage with industry.
OPENMINTED - The Open Mining Infrastructure for Text and Data
taken from ICT2015 presentation (N. Manola)
• 1 • 2
• 3 • 4
• 5 • 6
• 7
7
Bridging the gap between different communities
@openminted_eu
• 1 • 2
• 3 • 4
• 5 • 6
• 7
8
The project Starts: June 2015
Duration: 3 years
16 Partners:
- 6 mining research groups
- 3 content providers
- 1 data center
- 1 library association
- 2 legal experts
- 6 community related partners
- 2 SMEs
Athena RIC Univ. of Manchester (NacTem) Univ. of Darmstadt INRA EMBL-EBI Agro-Know LIBER Univ. of Amsterdam Open University UK EPFL CNIO Univ. of Sheffield (GATE) GESIS GRNET Frontiers Univ. of Stirling
PARTNERS
@openminted_eu
OPENMINTED = The Open Mining Infrastructure for Text and Data
taken from ICT2015 presentation (N. Manola)
9
OpenMinTeD users
This is where the footer goes
TM consumer to advance their science
Service Providers to enhance their
tools
TM researcher to share their algorithms
Content providers to enrich their
content
10
Infrastructural approach
This is where the footer goes
OpenMinted does not build new services,
but adopts and adapts existing services for
new communities
Focuses on interoperability across text
mining services and content providers
Creates an open & collaborative space for
researchers to use the best fitting textmining
services available
• 1 • 2
• 3 • 4
• 5 • 6
• 7
11 @openminted_eu
Data centre Data centre Data centre Data centre
in public cloud
Publisher text corpus
OpenAIRE/CORE text corpus
PMC text corpus
Other text corpora
Other text corpora
Other text corpora
Other types of text corpora
Layer 3:
Interoperability
to shared storage and
computing resources
Language resources Language resources
Language resources Language resources
Layer 2:
Interoperability of
language resources
& corpora
Layer 1:
Interoperability
of text mining services
(platforms or
components)
Language resources and corpora registry service
Platform services
Users: researchers, curators, text-miners and new services developers
Registry Workflow Management Auth2 & Policy management Annotator Accounting
Mining Platforms Mining Platforms Mining Platforms
Proprietary architectures
Mining Platforms
OPENMINTED = The Open Mining Infrastructure for Text and Data
The architecture
taken from ICT2015 presentation (N. Manola)
• 1 • 2
• 3 • 4
• 5 • 6
• 7
12 @openminted_eu
RESEARCH
ANALYTICS
SOCIAL
SCIENCES
AGRICULTURE LIFE
SCIENCES
Bottom-up approach OpenMinTeD works with 4 use cases, which give their requirements and evaluate the results.
OPENMINTED = The Open Mining Infrastructure for Text and Data
taken from ICT2015 presentation (N. Manola)
13
Science driven approach
This is where the footer goes
14
GESIS: Infrastructure for the
Social Sciences
This is where the footer goes
15
GESIS Research Data Cycle
This is where the footer goes
Study planning Archiving and
registering
Searching
Data collection Data analysis
15
16
Difficulties in Information Seeking
This is where the footer goes
17
Problems Processing Search Results
This is where the footer goes
18
Usefulness of TM enhanced search services
This is where the footer goes
• 1 • 2
• 3 • 4
• 5 • 6
• 7
19
Social Science Use case
Develop and evaluate methods for
automatic detection and linking of named
entities in Social Science publications in
order to advance reliable and context-
sensitive retrieval and linking of relevant
entities
1
9
@openminted_eu
20
Enhancing Search in Text and Data
This is where the footer goes
classical named entity recognition and
disambiguation of relevant entities (names,
places, organizations, terms) to enhance
automatic indexing
recognition of vague variable mentions to
enhance linking of data and publications
enrich data with context information from text
to enhance retrievability of data sets
21
Identifying references to survey variables
This is where the footer goes
OLGA NEŠPOROVÁ, ZDENĚK
R. NEŠPOR (2009). “Religion: An
Unsolved Problem for the Modern
Czech Nation”
ISSP 2008
Link Database
v39: Believe in life after death
v40: Believe in Heaven
22
Benefits from user perspective
This is where the footer goes
semantic search: understanding the contextual
meaning of (search) terms
fuzzy phrase search: search for attitudes,
survey questions in texts (under vagueness)
link retrieval: search and retrieval of links
between text and data
dataset retrieval: facilitating search for research
data in data catalogues at the level of items and
variables
• 1 • 2
• 3 • 4
• 5 • 6
• 7
23
Contact us
www.openminted.eu
twitter.com/openminted_eu
facebook.com/openminted
bit.do/openmintedlinkedin vimeo.com/openminted
bit.do/openmintedplus