text mining and data integration

Post on 10-May-2015

143 Views

Category:

Education

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Lars Juhl Jensen

Text mining and data integration

exponential growth

~45 seconds per paper

information retrieval

named entity recognition

information extraction

association networks

data integration

information retrieval

find the relevant papers

ad hoc retrieval

user-specified query

“yeast AND cell cycle”

PubMed

indexing

fast lookup

stemming

word endings

dynamic query expansion

MeSH terms

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1

and this modification served as a priming step to promote subsequent

Cdc5-dependent Swe1 hyperphosphorylation and degradation

no tool will find that

named entity recognition

computer

as smart as a dog

teach it specific tricks

identify the concepts

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1

and this modification served as a priming step to promote subsequent

Cdc5-dependent Swe1 hyperphosphorylation and degradation

comprehensive lexicon

CDC2

cyclin dependent kinase 1

orthographic variation

upper- and lower-case

CDC2

Cdc2

spaces and hyphens

cyclin dependent kinase 1

cyclin-dependent kinase 1

prefixes and postfixes

CDC2

hCDC2

“black list”

SDS

scalable implementation

>10 km<10 hours

augmented browsing

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1

and this modification served as a priming step to promote subsequent

Cdc5-dependent Swe1 hyperphosphorylation and degradation

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1

and this modification served as a priming step to promote subsequent

Cdc5-dependent Swe1 hyperphosphorylation and degradation

Reflect

Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology, 2009O’Donoghue et al., Journal of Web Semantics, 2010

information extraction

formalize the facts

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1

and this modification served as a priming step to promote subsequent

Cdc5-dependent Swe1 hyperphosphorylation and degradation

two approaches

co-mentioning

counting

within documents

within paragraphs

within sentences

co-mentioning score

NLPNatural Language Processing

grammatical analysis

part-of-speech tagging

multiword detection

semantic tagging

sentence parsing

Gene and protein namesCue words for entity recognitionVerbs for relation extraction

[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]

extract stated facts

high precision

poor recall

text corpus

most use abstracts

few use full-text articles

no access

PDF files

layout-aware extraction

my corpus

~22 million abstracts

~4 million articles

association networks

guilt by association

STRING

Szklarczyk, Franceschini et al., Nucleic Acids Research, 2011

computational predictions

gene fusion

Korbel et al., Nature Biotechnology, 2004

gene neighborhood

Korbel et al., Nature Biotechnology, 2004

phylogenetic profiles

Korbel et al., Nature Biotechnology, 2004

a real example

Cell

Cellulosomes

Cellulose

experimental data

gene coexpression

physical interactions

Jensen & Bork, Science, 2008

curated knowledge

pathways

Letunic & Bork, Trends in Biochemical Sciences, 2008

many databases

different formats

different identifiers

variable quality

not comparable

hard work

quality scores

von Mering et al., Nucleic Acids Research, 2005

calibrate vs. gold standard

von Mering et al., Nucleic Acids Research, 2005

data integration

general approach

suite of web resources

STITCH

STRING + 300k chemicals

Kuhn et al., Nucleic Acids Research, 2012

COMPARTMENTS

subcellular localization

compartments.jensenlab.org

TISSUES

tissue expression

tissues.jensenlab.org

DISEASES

disease genes

unification

curated knowledge

text mining

experimental data

computational predictions

common identifiers

quality scores

visualization

dissemination

web interfaces

evidence viewers

web services

diseases.jensenlab.org

bulk download

thank you!

top related