data management for quantitative biology - database systems (continued) lims and e-lab books by dr....
TRANSCRIPT
Dr. Sven Nahnsen/Dr. Marius Codrea,
Quantitative Biology Center (QBiC)
Data Management for Quantitative Biology
Lecture 5: Database systems (continued)
LIMS and E-lab books
Many database design & concepts
http://dataconomy.com/wp-content/uploads/2014/07/fig2large.jpg2
Databases
DB = "A database is an organized collection of data" http://en.wikipedia.org/wiki/Database
DB = DB + data model for the application at hand (business logic) + implementation
DB = DB + database management system (DBMS). Software than enables:
3
CRUD
• Create entries
• Read (retrieve)
• Update / edit
• Delete
DB = DB + Administration (User privilages, monitoring)
Selected database systems
I. Relational databases
MySQL
II.NoSQL databases
MongoDB
Specific characteristics MongoDB vs MySQL
More details here: http://db-engines.com/en/system/MongoDB%3BMySQL
System Property MongoDB MySQL
Initial release 2009 1995
Current release 3.0.2, April 2015 5.6.24, April 2015
Triggers No Yes
MapReduce Yes No
Foreign keys No Yes
Transaction concepts No ACID*
*A database transaction, must be Atomic, Consistent, Isolated and Durable.
Fields
Record 1
Record 6
Primary keyPrimary
key
Foreign KeyRef
Mice.Mouse_number
● The values of the primary keys uniquely identifies the rows of the table● The foreign key uniquely links the rows of the host table to 1 record in the referencing table
Mice table
Samples table
Terminology - Relational databases
Mice tableSamples table
Samples are RELATED to mice
1:N one-to-many relationship
Relational databases (Normalization)
Foreign Keys
CREATE TABLE samples ( Sample_ID SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT, Mouse_number SMALLINT UNSIGNED NOT NULL, Timepoint VARCHAR(15) NOT NULL, PRIMARY KEY (Sample_ID), FOREIGN KEY (Mouse_number) REFERENCES mice(Mouse_number) ON DELETE CASCADE )ENGINE=InnoDB DEFAULT CHARSET=utf8;
Queries
SELECT * , COUNT(*) as count_per_gender from mice group by Gender, Treatment;
“How many males and how many females per treatment?”
JOIN queries
SELECT Sample_ID, Treatment, Timepoint, mice.Mouse_number from samples join mice
on samples.Mouse_number = mice.Mouse_number where mice.Mouse_number=2;
“What samples do I have from mouse number 2?”
Relational „facts“
1.Rigid schema (once the structure is defined, it may be difficult to adjust)
2.Normalization introduces/requires additional tables, joins, indices and it scatters data
3.Each field in each record has a single value of a pre-defined type
Mice tableSamples table
1:M one-to-many relationship
Relational „facts“ 1
Generalization to other Projects/Experiments in the lab?
Rigid schema (once the structure is defined, it may be difficult to adjust)
Organisms table
Samples tableBROKEN 1:N one-to-many relationship
Relational „facts“ 1
Mice table
Deleted relationship
CREATE TABLE samples ( Sample_ID SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT, Mouse_number SMALLINT UNSIGNED NOT NULL, Timepoint VARCHAR(15) NOT NULL, PRIMARY KEY (Sample_ID), FOREIGN KEY (Mouse_number) REFERENCES mice(Mouse_number) ON DELETE CASCADE )ENGINE=InnoDB DEFAULT CHARSET=utf8;
Organisms table
Projects table
Relational „facts“ 1
Projects table
Relational „facts“ 1 Users table
Many users can be involved in many projects. With many roles?
Projects_Users table
Projects table
Relational „facts“ 2
Users table
Normalization introduces/requires additional tables, joins, indices and scatters data
Projects_Users table
CREATE INDEX usr on Project_Users (User_ID);
Relational „facts“ 3
Each field in each record has a single value of a pre-defined type
Primary Key Field 1 Field 2 Field 3
A 2-D map (tuples)
Relational „facts“ 3
A single value ?!?
What if a person has 2 affiliations and thus 2 addresses, 2 phone numbers, etc?
Normalization? Again?
NoSQL
DB = "A database is an organized collection of data" http://en.wikipedia.org/wiki/Database
● Can we allow for “some” heterogeneity of the data?
● Can the records be highly similar but not necessarily identical? (e.g., most of the users having just 1 phone number but others more?)
MongoDB is a document-oriented DB
{ Mouse_number: “1”, Gender: “Male”, Age: 3, Treatment: “Vitamin A” }
Field:value pairs
Document ~ Record
http://www.mongodb.org/
MongoDB Documents
{ Mouse_number: “1”, Gender: “Male”, Age: 3, Treatment: “Vitamin A” }
Field:value pairs
Documents are BSON files (binary JSON)
Closely resemble structures in programming languages (key-value association)
Each field can be
● NULL● Single value (integer, string, etc)● An array of many values● Other embedded documents● A reference to another document
MongoDB Collections
Documents are stored in Collections
Collection ~ Table
{ Mouse_number: “6”, Gender: “Female”, Age: 2, Treatment: “Vitamin B” , }
Different representation – The challenge remains the same:
Model the relationships between data
Organisms
Projects
Users
AffiliationsSamples
N
N
N
N
N
N
11
Design & operational mechanisms
MySQL
● Primary Key
● Foreign Key
● Join Tables
MongoDB
● Unique ID
● References
● Embedding
MongoDB – Field types
{ _id: <ObjectID1> Username: { first_name: “Hans”, last_name: “Meyer” }, Gender: “Male”, Age: 30, Phones: [“+490777”, “+350777”],
Affiliations_id: <UUID_affiliation>}
Users document
● array
● embedded document
● reference
● Unique ID
MongoDB – Field types
● Unique ID _id: <ObjectID1>
Acts as a primary key
ObjectId is a 12-byte BSON type, constructed using:
● a 4-byte value representing the seconds since the Unix time,● a 3-byte machine identifier,● a 2-byte process id, and● a 3-byte counter, starting with a random value.
http://docs.mongodb.org/manual/reference/object-id/
ObjectId("507f1f77bcf86cd799439011")
MongoDB – Field types
● array Phones: [“+490777”, “+350777”]
● Upon indexing, each value in the array is in the index
● Query for ANY matching value
MongoDB – Field types
{ _id: <ObjectID1> Username: { first_name: “Hans”, last_name: “Meyer” },
Gender: “Male”,}
● embedded document
● Pre-joined data?
● Can be indexed
● Query at any level on any field
MongoDB – Field types
{ _id: <ObjectID1> Username: { first_name: “Hans”, last_name: “Meyer” }, Gender: “Male”, Age: 30, Phones: [“+490777”, “+350777”],
Affiliations_id: <UUID_affiliation>}
Users document
● reference
Affiliations document
{ _id: <UUID_affiliation> Name: “My lab”, Address: “Tübingen”}
Where is the catch?
● "In MongoDB, write operations are atomic at the document level, and no single write operation can atomically affect more than one document or more than one collection."
● OK, then references (normalized model) are not really Foreign Keys that the DB engine resolves. "Client-side applications must issue follow-up queries to resolve the references".(see next slide)
● “A denormalized data model with embedded data combines all related data for a represented entity in a single document. This facilitates atomic write operations since a single write operation can insert or update the data for an entity.”
● OK, denormalize. Maximum default document size is 16MB.
http://docs.mongodb.org/manual/
Foreign key „ON DELETE CASCADE“
“Mouse number 3 went wrong. Let's just delete it.”
SELECT * from samples;
DELETE from mice where Mouse_number = 3;
SELECT * from samples;
Where are these two samples gone?
The key challenge
Find the right structure of the documents (references and embedded documents) that best fit
● the requirements of the application (queries, updates) -data usage
● the performance of the database engine
Model the relationships between data
Organisms
Projects
Users
AffiliationsSamples
N
N
N
N
N
N
11
Model the relationships between data 1:N
Organisms
Samples
N
1
OrganismsSample_ids: [ ]
SamplesOrganism_id:
OR?
Depends on the most frequent question?
● What samples do I have from Organism X ?● Where Sample Y came from?
● How many samples? Reach the 16MB limit?
● Organism embeds multiple samples
Relational „facts“ 3
Each field in each record has a single value of a pre-defined type
Primary Key Field 1 Field 2 Field 3
A 2-D map (tuples)
MongoDB
Nested documents
_id Field 1 Field 2 Field 3
Queries
{ _id: 4, Project_ID: 2, Species: “human”, Gender :””, Age: 30, Treatment:”Vaccine A”}
Organisms
db.organisms.insert( { Project_ID: 2, Species: “human”, Gender :””, Age: 30, Treatment: ”Vaccine A”}
Queries
{ _id: 4, Project_ID: 2, Species: “human”, Gender :””, Age: 30, Treatment:”Vaccine A”}
Organisms
db.organisms.find( { Project_ID: { $eq : 2} })
SELECT * from organismsWHERE Project_ID = 2;
Queries
{ _id: 4, Project_ID: 2, Species: “human”, Gender :””, Age: 30, Treatment:”Vaccine A”}
Organisms db.organisms.find( { $and: [{Species: /h.*/}, {Age: {$gt: 20 }}]})
SELECT * from organisms WHERE Species like 'h%' AND Age > 20;
Schema flexibility
{ _id: 4, Project_ID: 2, Species: “human”, Gender :””, Age: 30, Treatment:”Vaccine A”}
Organisms
{ _id: 14, Project_ID: 5, Species: “human”, Gender :”Female”, Age: 10, Genetic_background: “WT”}
Data IS the schema!
Queries
{ _id: 4, Project_ID: 2, Species: “human”, Gender :””, Age: 30, Treatment:”Vaccine A”}
Organisms db.organisms.find( { Genetic_background: $exists: true } })
SELECT ???
Model the relationships between data 1:N
Organisms
Samples
N
1
OrganismsSample_ids: [ ]
SamplesOrganism_id:
OR?
● Organism embeds multiple samples
MongoDB
Nested documents
_id Field 1 Field 2 Field 3
Queries
{ _id: 4, Project_ID: 2, Species: “human”, Gender :””, Age: 30, Samples: [ { _id: 10, Timepoint:”5h”},
{ _id: 11, Timepoint:”24h” } ], Treatment:”Vaccine A”}
Organisms
db.organisms.find( { '_id': '4', 'Samples._id':'11' } )
db.organisms.find( { '_id': '4', 'Samples.Timepoint':'5h' } )
Summary
● Database design requires technical and substantial domain specific knowledge
● Normalization
● Indices
MySQL
● Primary Key
● Foreign Key
● Join Tables
MongoDB
● Unique ID
● References
● Embedding
Hint: http://en.wikipedia.org/wiki/Category:Web_application_frameworks
Laboratory information management system (LIMS)
Organisms
Projects
Users
AffiliationsSamples
N
N
N
N
N
N
11
An underlying data structure of a simple LIMS design
LIMS definition
http://en.wikipedia.org/wiki/Laboratory_information_management_system
„A Laboratory Information Management System (LIMS), sometimes referred to as a Laboratory Information System (LIS) or Laboratory Management System (LMS), is a software-based laboratory and information management system that offers a set of key features that support a modern laboratory's operations.“
LIMS properties and functionality
http://en.wikipedia.org/wiki/Laboratory_information_management_system
● Meta data of any sample entering the laboratory
● Tracking of processes throughout sample treatment and preparation; scheduling of the sample and the associated analytical workload
● Quality control associated with the sample and the utilized equipment and inventory
● Inspection, approval, and compilation of the sample data for reporting and/or further analysis
Advantages of LIMS
50
● Fewer transcription errors
● Faster sample processing
● Real-time control of data and metadata
● Reproducibility of experimental processes
● Direct electronic reporting to clients
● Despite many advantages,...
Disadvantages of LIMS
● Customization of LIMS
● Interface is required
● Adequate validation to ensure data quality
With a good LIMS in place, we can consider Electronic Laboratory Notebooks
Electronic laboratory notebooks (ELN)
http://en.wikipedia.org/wiki/Electronic_lab_notebook
An electronic lab notebook (also known as electronic laboratory notebook, or ELN) is a computer program designed to replace paper laboratory notebooks. Lab notebooks in general are used by scientists, engineers and technicians to document research, experiments and procedures performed in a laboratory. A lab notebook is often maintained to be a legal document and may be used in a court of as evidence.
Prominent use-case: review process
http://rushthecourt.net/mag/wp-content/uploads/2010/09/Three-Ring-Binders.jpg
● You submit a paper
● Several months of review process is not unlikely
● Reviewers ask for a more detailed description of the experiments you did two years back
Traditional Paper Lab Books
ELN, a survey
Journal of Laboratory Automation 18(3) 229–234, 2012 Society for Laboratory Automation and Screening
DOI: 10.1177/2211068212471834
Examples of ELN software
Practical issues
● Lab technicians “have only two hands”
● Labs are often not equipped with desktop PCs
● Data security of ELNs opposes challenges
● Scientists are classically reluctant adopters
● There is activation energy required to change work habits
● In academic science there is no formal obligation
● Establishment requires stringent modeling (see previous slides on databases) or significant investments into existing tools
Mobile application of ELNs
Nature Methods 8, 541–543 (2011) doi:10.1038/nmeth.1631
● Handwriting capture technology
● All functionality as on paper● Sketch and manipulate
equations● Draw figures
● All notes can be linked, reordered, archived, edited, tagged, annotated and bundled in virtual 'notebooks' representing different projects
Easy solutions
Evernote as lab notebook
Journal of Laboratory Automation 18(3) 229–234, 2012 Society for Laboratory Automation and Screening
DOI: 10.1177/2211068212471834