data mining - university of virginiapeople.virginia.edu/~wrp/cshl06/pdf/birney_ensmart.pdf ·...

8
EnsMart Ewan Birney, European Bioinformatics Institute (EBI) Data Mining... ...Is more than a buzz-word Most molecular biology is moving away from one-gene-at- a-time approaches Needs to make and work with Gene Lists

Upload: others

Post on 24-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining - University of Virginiapeople.virginia.edu/~wrp/cshl06/pdf/birney_ensmart.pdf · Computational Screen Dump Gene regions from set of protein kinase genes Formulate Clever

EnsMart

Ewan Birney, European Bioinformatics Institute (EBI)

Data Mining...

• ...Is more than a buzz-word

• Most molecular biology ismoving away from one-gene-at-a-time approaches

• Needs to make and work withGene Lists

Page 2: Data Mining - University of Virginiapeople.virginia.edu/~wrp/cshl06/pdf/birney_ensmart.pdf · Computational Screen Dump Gene regions from set of protein kinase genes Formulate Clever

Complex Disease Association

GenomeScan

Linkage peaks

Animal models

Candidate Genes

SNP choosing and validation

Patient vs Control association studies

MicroArray

Interesting

Spots

From A

Experiment

Done on

Platform A Interesting

Spots from

B

Experiment

Done on

Platform B

Integrate Spots

To form Gene Set

Dump orthologs and promoters

Page 3: Data Mining - University of Virginiapeople.virginia.edu/~wrp/cshl06/pdf/birney_ensmart.pdf · Computational Screen Dump Gene regions from set of protein kinase genes Formulate Clever

Computational Screen

Dump Gene regions from set of protein kinase genes

Formulate Clever method in house

Display results in Genome context

On biologist-friendly web display

data mining problems...

• You need all the data in one place to

provide data

• The natural language for database

queries (SQL) is not... so natural!

• Often SQL queries are very slow on

normalised databases

• Often there is additional analysis

which needs to occur

Page 4: Data Mining - University of Virginiapeople.virginia.edu/~wrp/cshl06/pdf/birney_ensmart.pdf · Computational Screen Dump Gene regions from set of protein kinase genes Formulate Clever

EnsMart

• SQL queries are slow :-

– transform the data into queryoptimised read-only database

• Additional analysis is needed

– Precompute additional analysis forall items (disk space is cheap!)

• You need all the data in oneplace

– Federate databases (BioMart)

Normalised databases

Gene

Transcripts

Exons

Sequence

External

Reference

>1

>1

>1

Five (six) table join for “genes with this set of affymetrix

Ids on this chromosome

Page 5: Data Mining - University of Virginiapeople.virginia.edu/~wrp/cshl06/pdf/birney_ensmart.pdf · Computational Screen Dump Gene regions from set of protein kinase genes Formulate Clever

Mart Transformation

Normalised

Query optimised

(reverse star schema)

Web User interface

• Web based

• Wizard like

• “dataset”

(focus)

• “filter” -

restrictions

• output

– columns to

show

– sequence

Page 6: Data Mining - University of Virginiapeople.virginia.edu/~wrp/cshl06/pdf/birney_ensmart.pdf · Computational Screen Dump Gene regions from set of protein kinase genes Formulate Clever

Set based work

• EnsMart can export sets of Ids

(Ensembl, Affymetrix, Uniprot)...

• EnsMart can also filter on a

given set of Ids

– (give me all the chromosome

locations of genes defined by my

Affymetrix information)

BioMart

Ensembl specific

Only runs from

www.ensembl.org

Made Generic

Multiple

installations

Query federation

(Arek Kaspryck)

Page 7: Data Mining - University of Virginiapeople.virginia.edu/~wrp/cshl06/pdf/birney_ensmart.pdf · Computational Screen Dump Gene regions from set of protein kinase genes Formulate Clever

BioMart

Any Schema data-mart schema user interface

Mart Builder Mart config

(XML specification)

BioMart

• www.ebi.ac.uk/b

iomart

• Google for

BioMart

• Ensembl

• Uniprot

• MSD structures

Page 8: Data Mining - University of Virginiapeople.virginia.edu/~wrp/cshl06/pdf/birney_ensmart.pdf · Computational Screen Dump Gene regions from set of protein kinase genes Formulate Clever

BioMart future...

• ArrayExpress (gene expression

dataset)

• WormBase

• ...others

Cross-Internet Mart…

WWW

Mutant Stock

Sample Mart Ensembl

Genome

Mart

Firewall

Array Express

Expression Atlas

Mart

Mart Query Building

Software

Give me all the genes mapped within phenotype

X in my samples that are also at least 4 fold upregulated

In kidney. Give me all the ht SNPs in all the genes…