data management for quantitative biology - database systems, may 7, 2015, dr. marius codrea

Dr. Sven Nahnsen/Dr. Marius Codrea,

Quantitative Biology Center (QBiC)

Data Management for Quantitative Biology

Lecture 4: Database systems

Database systems in modern data-driven

biomedical research

I. Typical research scenario

From samples to data

From data to databases and back

From data/databases to information

II. How to?

Typical research scenario

What gene(s) / protein(s) are responsible for a specific disease

Typical research scenarios

From samples to data

Millions of reads in fastq format

Alignment/mapping to the target genome

Which organism?

Where to get it from?

Which version?

Annotations?

Databases & Repositories

Digitization of biological samples

Similar for other omics technologies

Next GenerationSequencing

f(x):->{1,0}^n

Shotgun Proteomics

8

Tousands of raw spectra

LC-MS/MS experiment Fragment m/z values

Theoretical fragment m/z

Compare

Q9NSC5|HOME3_HUMAN Homer protein homolog 3 - Homo sapiens (Human)

MSTAREQPIFSTRAHVFQIDPATKRNWIPAGKHALTVSYFYDATRNVYRIISIGGAKA

IINSTVTPNMTFTKTSQKFGQWDSRANTVYGLGFASEQHLTQFAEKFQEVKEAAR

LAREKSQDGGELTSPALGLASHQVPPSPLVSANGPGEEKLFRSQSADAPGPTERER

LKKMLSEGSVGEVQWEAEFFALQDSNNKLAGALREANAAAAQWRQQLEAQRAE

AERLRQRVAELEAQAASEVTPTGEKEGLGQGQSLEQLEALVQTKDQEIQTLKSQT

GGPREALEAAEREETQQKVQDLETRNAELEHQLRAMERSLEEARAERERARAEV

GRAAQLLDVSLFELSELREGLARLAEAAP

569.24

572.33

580.30

581.46

582.63

606.32

610.24

616.14

569.24

572.33

580.30

581.46

582.63

606.32

610.24

616.14

569.24

574.83

580.70

580.92

579.99

603.92

611.14

616.74

569.24

572.33

580.30

581.46

582.63

606.32

610.24

616.14

569.24

572.33

580.30

581.46

582.63

606.32

610.24

616.14

569.24

572.33

580.30

581.46

582.63

606.32

610.24

616.14

569.24

572.33

580.30

581.46

582.63

606.32

610.24

616.14

1 QRESTATDILQK 18.77

2 EIEEDSLEGLKK 14.78

Score hits

Theoretical spectra

m/z

[%]

m/z

[%]

m/z

[%]

Experimental spectra

m/z

RT

Protein Sequence

Peptide identification

Protein target DATABASE

Example Resources

● ENSEMBL + BioMart● http://www.ensembl.org

● National Center for Biotechnology Information (NCBI) + BLAST● http://www.ncbi.nlm.nih.gov/

● The European Bioinformatics Institute (EMBL-EBI) + Clustal● http://www.ebi.ac.uk/services

● UniProt● http://www.uniprot.org/

● University of California, Santa Cruz (UCSC)● https://genome.ucsc.edu/

Resources

Example Repositories

https://genevestigator.com/gv/

Selected database systems

I. Relational databases

MySQL

II.NoSQL databases

MongoDB

Database systems in modern data-driven

biomedical research

I. Typical research scenarios

1.From samples to data (numbers in a file) = digitization

2.From data to databases and back

= use resources (e.g., human genome) and contribute to

repositories (data + scientific observations + metadata)

3.From data/databases to information

= data mining

II. How to?

Many database design & concepts

http://dataconomy.com/wp-content/uploads/2014/07/fig2large.jpg28

Databases

DB = "A database is an organized collection of data" http://en.wikipedia.org/wiki/Database

DB = DB + data model for the application at hand (business logic) + implementation

DB = DB + database management system (DBMS). Software than enables:

29

CRUD

• Create entries

• Read (retrieve)

• Update / edit

• Delete

DB = DB + Administration (User privilages, monitoring)

http://en.wikipedia.org/wiki/Database

Selected database systems

I. Relational databases

MySQL

II.NoSQL databases

MongoDB

Specific characteristics MongoDB vs MySQL

More details here: http://db-engines.com/en/system/MongoDB%3BMySQL

System Property MongoDB MySQL

Initial release 2009 1995

Current release 3.0.2, April 2015 5.6.24, April 2015

Triggers No Yes

MapReduce Yes No

Foreign keys No Yes

Transaction concepts No ACID*

*A database transaction, must be Atomic, Consistent, Isolated and Durable.

Relational databases

● A plausible experiment where samples are collected from different mice before and after some treatment

● High redundancy

● Cumbersome to maintain/update

Mice tableSamples table

● Split the data into RELATED tables● Low redundancy● Easier to maintain/update (e.g., add some genotype information to mice)

1:M one-to-many relationship

Relational databases (Normalization)

Terminology

Fields

Record 1

Record 6

Primary keyPrimary

key

Foreign KeyRef

Mice.Mouse_number

● Table rows are called "records"● Table columns are called "fields"● The values of the primary keys uniquely identifies the rows of the table● The foreign key uniquely links the rows of the host table to 1 record in the referencing table

Mice table

Samples table

Databases

DB = "A database is an organized collection of data" http://en.wikipedia.org/wiki/Database

DB = DB + data model for the application at hand (business logic) + implementation

DB = DB + database management system (DBMS). Software than enables:

35

CRUD

• Create entries

• Read (retrieve or search)

• Update / edit

• Delete

DB = DB + Administration (User privilages, monitoring)

http://en.wikipedia.org/wiki/Database

Structured Query Language (SQL)

● SQL is a standard language for creating, accessing and modifying relational databases

● MySQL implements SQL database management

Connect to the server and create the database

mysql u username p h localhost

CREATE database mouse_experiment;

USE mouse_experiment;

Create the tables and insert values

CREATE TABLE mice ( Mouse_number SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT, Gender enum('Male','Female','NA') DEFAULT 'Female', Age decimal(4,2) DEFAULT NULL, Treatment VARCHAR(50) NOT NULL, PRIMARY KEY (Mouse_number) );

INSERT INTO mice (Gender, Age, Treatment) VALUES ('Male','3','Vitamin A'),('Male','2','Vitamin B'),('Female','2.5','Vitamin A'),('Female','3','Vitamin B'),('Male','4','Vitamin A'),('Female','2','Vitamin B');

No Mouse_number! It is the task of the DBMS to generate UNIQUE id's


CREATE TABLE samples ( Sample_ID SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT, Mouse_number SMALLINT UNSIGNED NOT NULL, Timepoint VARCHAR(15) NOT NULL, PRIMARY KEY (Sample_ID), FOREIGN KEY (Mouse_number) REFERENCES mice(Mouse_number) ON DELETE CASCADE )ENGINE=InnoDB DEFAULT CHARSET=utf8;

Queries

SELECT field1, field2,...fieldN from table_name[WHERE Clause][OFFSET M ][LIMIT N]

SELECT * from table_name[WHERE Clause][OFFSET M ][LIMIT N]

Generic syntax

Queries

SELECT * , COUNT(*) as count_per_gender from mice group by Gender, Treatment;

“How many males and how many females per treatment?”

JOIN queries

SELECT Sample_ID, Treatment, Timepoint, mice.Mouse_number from samples join mice

on samples.Mouse_number = mice.Mouse_number where mice.Mouse_number=2;

“What samples do I have from mouse number 2?”

Delete queries

“Mouse number 3 went wrong. Let's just delete it.”

SELECT * from samples;

DELETE from mice where Mouse_number = 3;

SELECT * from samples;

Where are these two samples gone?

They were deleted!

CREATE TABLE samples ( Sample_ID SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT, Mouse_number SMALLINT UNSIGNED NOT NULL, Timepoint VARCHAR(15) NOT NULL, PRIMARY KEY (Sample_ID), FOREIGN KEY (Mouse_number) REFERENCES mice(Mouse_number) ON DELETE CASCADE )ENGINE=InnoDB DEFAULT CHARSET=utf8;

Does it make sense not to leave “orphan” samples in the system?

Extending the database

Mice table

Samples table1:M one-to-many relationship

Protein tableM:M many-to-many relationship

Indexing

● Ultimately, the data is stored in files on disks

● With large amounts of data (tens of million of records), sequential searching becomes not feasible

● MySQL uses B-trees

● “In computer science, a B-tree is a tree data structure that keeps data sorted and allows searches, sequential access, insertions, and deletions in logarithmic time.” http://en.wikipedia.org/wiki/B-tree

Indexing B-trees

Source: http://en.wikipedia.org/wiki/B-tree

Summary

● Database design requires domain knowledge, including example usecases

● Normalization

● Primary and Foreign Keys

● Implement a MySQL database

● Queries & Join Queries

● Indices

Let us revisit ENSEMBL

data management for quantitative biology - database systems, may 7, 2015, dr. marius codrea

Education

modern data

db data model

databases db

data numbers

data mining

selected database systems

protein target database

databases repositories