data management for quantitative biology - database systems, may 7, 2015, dr. marius codrea

56
Dr. Sven Nahnsen/Dr. Marius Codrea, Quantitative Biology Center (QBiC) Data Management for Quantitative Biology Lecture 4: Database systems

Upload: qbictue

Post on 07-Aug-2015

145 views

Category:

Education


1 download

TRANSCRIPT

Page 1: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea

Dr. Sven Nahnsen/Dr. Marius Codrea,

Quantitative Biology Center (QBiC)

Data Management for Quantitative Biology

Lecture 4: Database systems

Page 2: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea

Database systems in modern data-driven

biomedical research

I. Typical research scenario

From samples to data

From data to databases and back

From data/databases to information

II. How to?

Page 3: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea

Typical research scenario

What gene(s) / protein(s) are responsible for a specific disease

Page 4: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea

Typical research scenarios

From samples to data

Page 5: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea

Millions of reads in fastq format

Page 6: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea

Alignment/mapping to the target genome

Which organism?

Where to get it from?

Which version?

Annotations?

Databases & Repositories

Page 7: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea

Digitization of biological samples

Similar for other omics technologies

Next GenerationSequencing

f(x):->{1,0}^n

Page 8: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea

Shotgun Proteomics

8

Page 9: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea

Tousands of raw spectra

Page 10: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea

LC-MS/MS experiment Fragment m/z values

Theoretical fragment m/z

Compare

Q9NSC5|HOME3_HUMAN Homer protein homolog 3 - Homo sapiens (Human)

MSTAREQPIFSTRAHVFQIDPATKRNWIPAGKHALTVSYFYDATRNVYRIISIGGAKA

IINSTVTPNMTFTKTSQKFGQWDSRANTVYGLGFASEQHLTQFAEKFQEVKEAAR

LAREKSQDGGELTSPALGLASHQVPPSPLVSANGPGEEKLFRSQSADAPGPTERER

LKKMLSEGSVGEVQWEAEFFALQDSNNKLAGALREANAAAAQWRQQLEAQRAE

AERLRQRVAELEAQAASEVTPTGEKEGLGQGQSLEQLEALVQTKDQEIQTLKSQT

GGPREALEAAEREETQQKVQDLETRNAELEHQLRAMERSLEEARAERERARAEV

GRAAQLLDVSLFELSELREGLARLAEAAP

569.24

572.33

580.30

581.46

582.63

606.32

610.24

616.14

569.24

572.33

580.30

581.46

582.63

606.32

610.24

616.14

569.24

574.83

580.70

580.92

579.99

603.92

611.14

616.74

569.24

572.33

580.30

581.46

582.63

606.32

610.24

616.14

569.24

572.33

580.30

581.46

582.63

606.32

610.24

616.14

569.24

572.33

580.30

581.46

582.63

606.32

610.24

616.14

569.24

572.33

580.30

581.46

582.63

606.32

610.24

616.14

1 QRESTATDILQK 18.77

2 EIEEDSLEGLKK 14.78

Score hits

Theoretical spectra

m/z

[%]

m/z

[%]

m/z

[%]

Experimental spectra

m/z

RT

Protein Sequence

Peptide identification

Page 11: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea

Protein target DATABASE

Page 12: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea

Example Resources

● ENSEMBL + BioMart● http://www.ensembl.org

● National Center for Biotechnology Information (NCBI) + BLAST● http://www.ncbi.nlm.nih.gov/

● The European Bioinformatics Institute (EMBL-EBI) + Clustal● http://www.ebi.ac.uk/services

● UniProt● http://www.uniprot.org/

● University of California, Santa Cruz (UCSC)● https://genome.ucsc.edu/

Page 13: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea

Resources

Page 14: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea
Page 15: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea
Page 16: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea
Page 17: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea
Page 18: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea
Page 19: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea
Page 20: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea
Page 21: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea

Example Repositories

Page 22: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea
Page 23: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea
Page 24: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea
Page 25: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea

https://genevestigator.com/gv/

Page 26: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea

Selected database systems

I. Relational databases

MySQL

II.NoSQL databases

MongoDB

Page 27: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea

Database systems in modern data-driven

biomedical research

I. Typical research scenarios

1.From samples to data (numbers in a file) = digitization

2.From data to databases and back

= use resources (e.g., human genome) and contribute to

repositories (data + scientific observations + metadata)

3.From data/databases to information

= data mining

II. How to?

Page 28: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea

Many database design & concepts

http://dataconomy.com/wp-content/uploads/2014/07/fig2large.jpg28

Page 29: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea

Databases

DB = "A database is an organized collection of data" http://en.wikipedia.org/wiki/Database

DB = DB + data model for the application at hand (business logic) + implementation

DB = DB + database management system (DBMS). Software than enables:

29

CRUD

• Create entries

• Read (retrieve)

• Update / edit

• Delete

DB = DB + Administration (User privilages, monitoring)

Page 30: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea

Selected database systems

I. Relational databases

MySQL

II.NoSQL databases

MongoDB

Page 31: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea

Specific characteristics MongoDB vs MySQL

More details here: http://db-engines.com/en/system/MongoDB%3BMySQL

System Property MongoDB MySQL

Initial release 2009 1995

Current release 3.0.2, April 2015 5.6.24, April 2015

Triggers No Yes

MapReduce Yes No

Foreign keys No Yes

Transaction concepts No ACID*

*A database transaction, must be Atomic, Consistent, Isolated and Durable.

Page 32: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea

Relational databases

● A plausible experiment where samples are collected from different mice before and after some treatment

● High redundancy

● Cumbersome to maintain/update

Page 33: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea

Mice tableSamples table

● Split the data into RELATED tables● Low redundancy● Easier to maintain/update (e.g., add some genotype information to mice)

1:M one-to-many relationship

Relational databases (Normalization)

Page 34: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea

Terminology

Fields

Record 1

Record 6

Primary keyPrimary

key

Foreign KeyRef

Mice.Mouse_number

● Table rows are called "records"● Table columns are called "fields"● The values of the primary keys uniquely identifies the rows of the table● The foreign key uniquely links the rows of the host table to 1 record in the referencing table

Mice table

Samples table

Page 35: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea

Databases

DB = "A database is an organized collection of data" http://en.wikipedia.org/wiki/Database

DB = DB + data model for the application at hand (business logic) + implementation

DB = DB + database management system (DBMS). Software than enables:

35

CRUD

• Create entries

• Read (retrieve or search)

• Update / edit

• Delete

DB = DB + Administration (User privilages, monitoring)

Page 36: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea

Structured Query Language (SQL)

● SQL is a standard language for creating, accessing and modifying relational databases

● MySQL implements SQL database management

Page 37: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea

Connect to the server and create the database

mysql ­u username ­p ­h localhost

CREATE database mouse_experiment;

USE mouse_experiment;

Page 38: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea

Create the tables and insert values

CREATE TABLE mice (  Mouse_number SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT,  Gender enum('Male','Female','NA') DEFAULT 'Female',  Age decimal(4,2) DEFAULT NULL,  Treatment VARCHAR(50) NOT NULL,  PRIMARY KEY (Mouse_number) );

INSERT INTO mice (Gender, Age, Treatment) VALUES ('Male','3','Vitamin A'),('Male','2','Vitamin B'),('Female','2.5','Vitamin A'),('Female','3','Vitamin B'),('Male','4','Vitamin A'),('Female','2','Vitamin B');

No Mouse_number! It is the task of the DBMS to generate UNIQUE id's

Page 39: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea

Create the tables and insert values

CREATE TABLE samples (  Sample_ID SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT,  Mouse_number SMALLINT UNSIGNED NOT NULL,  Timepoint VARCHAR(15) NOT NULL,  PRIMARY KEY (Sample_ID),  FOREIGN KEY (Mouse_number)         REFERENCES mice(Mouse_number)        ON DELETE CASCADE )ENGINE=InnoDB DEFAULT CHARSET=utf8;

Page 40: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea

Create the tables and insert values

Page 41: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea

Queries

SELECT field1, field2,...fieldN from table_name[WHERE Clause][OFFSET M ][LIMIT N]

SELECT * from table_name[WHERE Clause][OFFSET M ][LIMIT N]    

Generic syntax

Page 42: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea

Queries

SELECT * , COUNT(*) as count_per_gender from mice group by Gender, Treatment;

“How many males and how many females per treatment?”

Page 43: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea

JOIN queries

SELECT Sample_ID, Treatment, Timepoint, mice.Mouse_number from samples join mice 

on samples.Mouse_number = mice.Mouse_number where mice.Mouse_number=2; 

“What samples do I have from mouse number 2?”

Page 44: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea

Delete queries

“Mouse number 3 went wrong. Let's just delete it.”

SELECT * from samples;

DELETE from mice where Mouse_number = 3;

SELECT * from samples;

Where are these two samples gone?

Page 45: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea

They were deleted!

CREATE TABLE samples (  Sample_ID SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT,  Mouse_number SMALLINT UNSIGNED NOT NULL,  Timepoint VARCHAR(15) NOT NULL,  PRIMARY KEY (Sample_ID),  FOREIGN KEY (Mouse_number)         REFERENCES mice(Mouse_number)        ON DELETE CASCADE )ENGINE=InnoDB DEFAULT CHARSET=utf8;

Does it make sense not to leave “orphan” samples in the system?

Page 46: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea

Extending the database

Mice table

Samples table1:M one-to-many relationship

Protein tableM:M many-to-many relationship

Page 47: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea

Indexing

● Ultimately, the data is stored in files on disks

● With large amounts of data (tens of million of records), sequential searching becomes not feasible

● MySQL uses B-trees

● “In computer science, a B-tree is a tree data structure that keeps data sorted and allows searches, sequential access, insertions, and deletions in logarithmic time.” http://en.wikipedia.org/wiki/B-tree

Page 48: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea

Indexing B-trees

Source: http://en.wikipedia.org/wiki/B-tree

Page 49: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea

Summary

● Database design requires domain knowledge, including example usecases

● Normalization

● Primary and Foreign Keys

● Implement a MySQL database

● Queries & Join Queries

● Indices

Page 50: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea
Page 51: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea
Page 52: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea
Page 53: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea

Let us revisit ENSEMBL

Page 54: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea
Page 55: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea
Page 56: Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr. Marius Codrea