data management for quantitative biology - database systems, may 7, 2015, dr. marius codrea
TRANSCRIPT
Dr. Sven Nahnsen/Dr. Marius Codrea,
Quantitative Biology Center (QBiC)
Data Management for Quantitative Biology
Lecture 4: Database systems
Database systems in modern data-driven
biomedical research
I. Typical research scenario
From samples to data
From data to databases and back
From data/databases to information
II. How to?
Typical research scenario
What gene(s) / protein(s) are responsible for a specific disease
Typical research scenarios
From samples to data
Millions of reads in fastq format
Alignment/mapping to the target genome
Which organism?
Where to get it from?
Which version?
Annotations?
Databases & Repositories
Digitization of biological samples
Similar for other omics technologies
Next GenerationSequencing
f(x):->{1,0}^n
Shotgun Proteomics
8
Tousands of raw spectra
LC-MS/MS experiment Fragment m/z values
Theoretical fragment m/z
Compare
Q9NSC5|HOME3_HUMAN Homer protein homolog 3 - Homo sapiens (Human)
MSTAREQPIFSTRAHVFQIDPATKRNWIPAGKHALTVSYFYDATRNVYRIISIGGAKA
IINSTVTPNMTFTKTSQKFGQWDSRANTVYGLGFASEQHLTQFAEKFQEVKEAAR
LAREKSQDGGELTSPALGLASHQVPPSPLVSANGPGEEKLFRSQSADAPGPTERER
LKKMLSEGSVGEVQWEAEFFALQDSNNKLAGALREANAAAAQWRQQLEAQRAE
AERLRQRVAELEAQAASEVTPTGEKEGLGQGQSLEQLEALVQTKDQEIQTLKSQT
GGPREALEAAEREETQQKVQDLETRNAELEHQLRAMERSLEEARAERERARAEV
GRAAQLLDVSLFELSELREGLARLAEAAP
569.24
572.33
580.30
581.46
582.63
606.32
610.24
616.14
569.24
572.33
580.30
581.46
582.63
606.32
610.24
616.14
569.24
574.83
580.70
580.92
579.99
603.92
611.14
616.74
569.24
572.33
580.30
581.46
582.63
606.32
610.24
616.14
569.24
572.33
580.30
581.46
582.63
606.32
610.24
616.14
569.24
572.33
580.30
581.46
582.63
606.32
610.24
616.14
569.24
572.33
580.30
581.46
582.63
606.32
610.24
616.14
1 QRESTATDILQK 18.77
2 EIEEDSLEGLKK 14.78
Score hits
Theoretical spectra
m/z
[%]
m/z
[%]
m/z
[%]
Experimental spectra
m/z
RT
Protein Sequence
Peptide identification
Protein target DATABASE
Example Resources
● ENSEMBL + BioMart● http://www.ensembl.org
● National Center for Biotechnology Information (NCBI) + BLAST● http://www.ncbi.nlm.nih.gov/
● The European Bioinformatics Institute (EMBL-EBI) + Clustal● http://www.ebi.ac.uk/services
● UniProt● http://www.uniprot.org/
● University of California, Santa Cruz (UCSC)● https://genome.ucsc.edu/
Resources
Example Repositories
https://genevestigator.com/gv/
Selected database systems
I. Relational databases
MySQL
II.NoSQL databases
MongoDB
Database systems in modern data-driven
biomedical research
I. Typical research scenarios
1.From samples to data (numbers in a file) = digitization
2.From data to databases and back
= use resources (e.g., human genome) and contribute to
repositories (data + scientific observations + metadata)
3.From data/databases to information
= data mining
II. How to?
Many database design & concepts
http://dataconomy.com/wp-content/uploads/2014/07/fig2large.jpg28
Databases
DB = "A database is an organized collection of data" http://en.wikipedia.org/wiki/Database
DB = DB + data model for the application at hand (business logic) + implementation
DB = DB + database management system (DBMS). Software than enables:
29
CRUD
• Create entries
• Read (retrieve)
• Update / edit
• Delete
DB = DB + Administration (User privilages, monitoring)
Selected database systems
I. Relational databases
MySQL
II.NoSQL databases
MongoDB
Specific characteristics MongoDB vs MySQL
More details here: http://db-engines.com/en/system/MongoDB%3BMySQL
System Property MongoDB MySQL
Initial release 2009 1995
Current release 3.0.2, April 2015 5.6.24, April 2015
Triggers No Yes
MapReduce Yes No
Foreign keys No Yes
Transaction concepts No ACID*
*A database transaction, must be Atomic, Consistent, Isolated and Durable.
Relational databases
● A plausible experiment where samples are collected from different mice before and after some treatment
● High redundancy
● Cumbersome to maintain/update
Mice tableSamples table
● Split the data into RELATED tables● Low redundancy● Easier to maintain/update (e.g., add some genotype information to mice)
1:M one-to-many relationship
Relational databases (Normalization)
Terminology
Fields
Record 1
Record 6
Primary keyPrimary
key
Foreign KeyRef
Mice.Mouse_number
● Table rows are called "records"● Table columns are called "fields"● The values of the primary keys uniquely identifies the rows of the table● The foreign key uniquely links the rows of the host table to 1 record in the referencing table
Mice table
Samples table
Databases
DB = "A database is an organized collection of data" http://en.wikipedia.org/wiki/Database
DB = DB + data model for the application at hand (business logic) + implementation
DB = DB + database management system (DBMS). Software than enables:
35
CRUD
• Create entries
• Read (retrieve or search)
• Update / edit
• Delete
DB = DB + Administration (User privilages, monitoring)
Structured Query Language (SQL)
● SQL is a standard language for creating, accessing and modifying relational databases
● MySQL implements SQL database management
Connect to the server and create the database
mysql u username p h localhost
CREATE database mouse_experiment;
USE mouse_experiment;
Create the tables and insert values
CREATE TABLE mice ( Mouse_number SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT, Gender enum('Male','Female','NA') DEFAULT 'Female', Age decimal(4,2) DEFAULT NULL, Treatment VARCHAR(50) NOT NULL, PRIMARY KEY (Mouse_number) );
INSERT INTO mice (Gender, Age, Treatment) VALUES ('Male','3','Vitamin A'),('Male','2','Vitamin B'),('Female','2.5','Vitamin A'),('Female','3','Vitamin B'),('Male','4','Vitamin A'),('Female','2','Vitamin B');
No Mouse_number! It is the task of the DBMS to generate UNIQUE id's
Create the tables and insert values
CREATE TABLE samples ( Sample_ID SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT, Mouse_number SMALLINT UNSIGNED NOT NULL, Timepoint VARCHAR(15) NOT NULL, PRIMARY KEY (Sample_ID), FOREIGN KEY (Mouse_number) REFERENCES mice(Mouse_number) ON DELETE CASCADE )ENGINE=InnoDB DEFAULT CHARSET=utf8;
Create the tables and insert values
Queries
SELECT field1, field2,...fieldN from table_name[WHERE Clause][OFFSET M ][LIMIT N]
SELECT * from table_name[WHERE Clause][OFFSET M ][LIMIT N]
Generic syntax
Queries
SELECT * , COUNT(*) as count_per_gender from mice group by Gender, Treatment;
“How many males and how many females per treatment?”
JOIN queries
SELECT Sample_ID, Treatment, Timepoint, mice.Mouse_number from samples join mice
on samples.Mouse_number = mice.Mouse_number where mice.Mouse_number=2;
“What samples do I have from mouse number 2?”
Delete queries
“Mouse number 3 went wrong. Let's just delete it.”
SELECT * from samples;
DELETE from mice where Mouse_number = 3;
SELECT * from samples;
Where are these two samples gone?
They were deleted!
CREATE TABLE samples ( Sample_ID SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT, Mouse_number SMALLINT UNSIGNED NOT NULL, Timepoint VARCHAR(15) NOT NULL, PRIMARY KEY (Sample_ID), FOREIGN KEY (Mouse_number) REFERENCES mice(Mouse_number) ON DELETE CASCADE )ENGINE=InnoDB DEFAULT CHARSET=utf8;
Does it make sense not to leave “orphan” samples in the system?
Extending the database
Mice table
Samples table1:M one-to-many relationship
Protein tableM:M many-to-many relationship
Indexing
● Ultimately, the data is stored in files on disks
● With large amounts of data (tens of million of records), sequential searching becomes not feasible
● MySQL uses B-trees
● “In computer science, a B-tree is a tree data structure that keeps data sorted and allows searches, sequential access, insertions, and deletions in logarithmic time.” http://en.wikipedia.org/wiki/B-tree
Indexing B-trees
Source: http://en.wikipedia.org/wiki/B-tree
Summary
● Database design requires domain knowledge, including example usecases
● Normalization
● Primary and Foreign Keys
● Implement a MySQL database
● Queries & Join Queries
● Indices
Let us revisit ENSEMBL