string: large-scale data and text mining

Post on 27-Jun-2015

225 Views

Category:

Science

4 Downloads

Preview:

Click to see full reader

DESCRIPTION

STRING: Large-scale data and text mining

TRANSCRIPT

STRINGLarge-scale data and text mining

Lars Juhl Jensen

association networks

guilt by association

biological systems

protein networks

STRING

1100+ genomes

computational predictions

gene fusion

Korbel et al., Nature Biotechnology, 2004

gene neighborhood

Korbel et al., Nature Biotechnology, 2004

phylogenetic profiles

Korbel et al., Nature Biotechnology, 2004

a real example

Cell

Cellulosomes

Cellulose

experimental data

gene coexpression

protein interactions

Jensen & Bork, Science, 2008

curated knowledge

complexes

pathways

Letunic & Bork, Trends in Biochemical Sciences, 2008

many databases

different formats

different identifiers

variable quality

not comparable

not same species

hard work

(Ph.D. students)

common identifiers

quality scores

von Mering et al., Nucleic Acids Research, 2005

score calibration

von Mering et al., Nucleic Acids Research, 2005

homology-based transfer

Franceschini et al., Nucleic Acids Research, 2013

missing most of the data

text mining

>10 km

too much to read

computer

comprehensive lexicon

CDC2

cyclin dependent kinase 1

expansion rules

hCdc2

CDC2

flexible matching

cyclin-dependent kinase 1

cyclin dependent kinase 1

“black list”

SDS

co-mentioning

counting

within documents

within paragraphs

within sentences

natural language processing

Gene and protein namesCue words for entity recognitionVerbs for relation extraction

[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]

text corpus

~2 million full-text articles

~22 million abstracts

Exercise 1Go to http://string-db.org

Query for Mt H37Rv adhD

(Rv3086)

Change between different

views

Check evidence for adhD–lipR

link

Extent network to 50

interactors

Exercise 2Go to the paper PMC2995261

Extract the protein names in

table 1

Create STRING network of

them

Change to “advanced” mode

Analyze for clusters and

enrichment

multi-page tables

related resources

general approach

curated knowledge

experimental data

text mining

computational predictions

common identifiers

quality scores

score calibration

visualization

protein networks

string-db.org

chemical networks

stitch-db.org

subcellular localization

compartments.jensenlab.org

tissue expression

tissues.jensenlab.org

disease associations

Work on your own datastring-db.org

stitch-db.org

compartments.jensenlab.org

tissues.jensenlab.org

diseases.jensenlab.org

top related