msc bioinformatics 2006 biocomputing databases and perl 13 th february 2006
TRANSCRIPT
MSc Bioinformatics 2006
BiocomputingDatabases and Perl
13th February 2006
MSc Bioinformatics 2006
Client
HTTP
Web Server
CGI
Application ServerRequest
Response
MSc Bioinformatics 2006
e.g. PC + Apache +PERL (with CGI module) + MySQL
Client
HTTP
Web Server
CGI DB API
DB Server +DB
Application Server
MSc Bioinformatics 2006
The Database Tier• stores and retrieves data.
• manages updates, allows simultaneous (concurrent) access from web servers, provides security, ensures the integrity of data
• requires complex software.
– Database Management Systems (DBMSs)
• searches and manages data that's stored in databases
• a database is a collection of related data, and an application can have more than one database
– interface is accessed using SQL (Standard Query Language) that's used to define and manipulate databases and data
MSc Bioinformatics 2006
Why use a database server?
• more than one user who needs to access the data at the same time.
• a moderate amount of data• relationships between the stored data items• more than one kind of data object• constraints that must be rigidly enforced on the data,
such as field lengths, field types etc..• If data must be queried to produce reports or results.• large amount of data that must be searched quickly.• If security is important• When adding, deleting, or modifying data is a complex
process.
MSc Bioinformatics 2006
Relational databases
• Entities and Attributes– An entity is a person, place, event, or thing about
which data is collected– An entity set is a named collection of entities sharing
common characteristics– Attributes are characteristics of the entity– Attributes are also called fields
• Tables– A table holds related entities or an entity set– Also called relations– Comprised of rows and columns
MSc Bioinformatics 2006
Tables
• Two-dimensional structure with rows and columns• Each row (tuple) represents a single entity• Columns represent attributes• Row/column intersection represents single value• Tables must have an attribute to uniquely identify each
row, a unique key• Column values all have same data format• Each column has range of values called attribute domain• Order of the rows and columns is immaterial to the
DBMS
MSc Bioinformatics 2006
Fingerprint
Motif
Sequence
MSc Bioinformatics 2006
Entity-relationship model
MSc Bioinformatics 2006
Relational logical model
MSc Bioinformatics 2006
MSc Bioinformatics 2006
mySQL
• Command-line interpreter– UNIX/Mac
/usr/local/bin/mysql -uhugh -pshhh
– WindowsStart > All Programs > mySQL > mySQL Server 5.0 > mySQL command line client
MSc Bioinformatics 2006
Managing databases• Creating databases
mysql> CREATE DATABASE phyloprints;
• To use that databaseuse phyloprints;
• Creating tables
CREATE TABLE "fingerprint" ( "fprint_accn" varchar(15), "identifier" varchar(15), "motifs" int2, "date" date, "up_date" date, "family_title" text, "family_doc“ varchar(18), "category" text, "second_accn" varchar(15), "pseudo" bool);
SHOW CREATE TABLE wine;
MSc Bioinformatics 2006
Table = fingerprint +----------------------------------+----------------------------------+-------+ | Field | Type | Length| +----------------------------------+----------------------------------+-------+ | fprint_accn | varchar() | 15 | | identifier | varchar() | 15 | | motifs | int2 | 2 | | date | date | 4 | | up_date | date | 4 | | family_title | text | var | | family_doc | varchar() | 18 | | category | text | var | | second_accn | varchar() | 15 | | pseudo | bool | 1 | +----------------------------------+----------------------------------+-------+
Table = motif +----------------------------------+----------------------------------+-------+ |Field | Type |Length | +----------------------------------+----------------------------------+-------+ |fprint_accn | varchar() | 18 | | motif | int2 | 2 | | repeat | varchar() | 4 | | seqn_accn | varchar() | 15 | | seqn_fragment | varchar() | 35 | | start_position | int4 | 4 | | inter_motif_dist | int4 | 4 | | initial | bool | 1 | | final | bool | 1 |
+----------------------------------+----------------------------------+-------+
attributes
MSc Bioinformatics 2006
Modifiers
• Are applied to attributes– NOT NULL (row can’t exist without this
attribute having a value)
identifier varchar(15) NOT NULL,
– DEFAULT (sets the data to the value that follows when no data is supplied)pseudo" bool DEFAULT f,
MSc Bioinformatics 2006
Keys
• Primary key – one or more attributes that uniquely identify a row in a table– It is essential that every table has one
PRIMARY KEY (fprint_accn), key names*fprint_accn,identifier) type=MyISAM;
MSc Bioinformatics 2006
Removing databases and tables
• DROP statement (make sure you want to do this – it doesn’t check!)
DROP TABLE seqn;
DROP DATABASE phyloprints;
MSc Bioinformatics 2006
Inserting/Updating/Deleting Data
• Inserting DataINSERT INTO fingerprint VALUES (‘PR90008’,
‘CRYSTALLIN’,0,2000-10-11,2000-10-11,’Crystallin pseudo-signature’,’PDOC90008’,’Others’,t);
• Inserting lots of data rowsINSERT INTO fingerprint VALUES (‘PR90008’,
‘CRYSTALLIN’,0,2000-10-11,2000-10-11,’Crystallin pseudo-signature’,’PDOC90008’,’Others’,t),(‘PR00240’,’ADRENRGCA1DR’,7,1996-08-14,2000-02-18,’Alpha-1D adrenergic receptor signature’,’PDOC00240’, f);
MSc Bioinformatics 2006
• DELETE statement– All data
DELETE FROM fingerprint; – Deleting specific rows – use WHEREDELETE FROM fingerprint WHERE identifer = ‘CRYSTALLIN’;
• UPDATE statementUPDATE fingerprint SET fprint_accn = ‘PR00001’ WHERE identifier=‘CRYSTALLIN’;
MSc Bioinformatics 2006
Querying with SQL
• Key operators– SELECT
– PROJECT
– JOIN
• Other operators– INTERSECT
– UNION (union compatible tables)
– DIFFERENCE
– PRODUCT
– DIVIDE
MSc Bioinformatics 2006
Querying databases with SQL
• SELECT statementSELECT identifier, fprint_accn FROM fingerprint;
SELECT * FROM fingerprint;
• WHERE clauseSELECT * from fingerprint WHERE fprint_accn=‘PR00001’;
MSc Bioinformatics 2006
• select identifier,fprint_accn from fingerprint where pseudo='t' ORDER BY fprint_accn;
• select fprint_accn,COUNT(*) from fingerprint where pseudo='t' GROUP BY fprint_accn;
• select DISTINCT seqn_accn from motif;• select distinct seqn_accn from motif LIMIT
10;
MSc Bioinformatics 2006
Join queries
• output data that's based on relationships between two or more tables
MSc Bioinformatics 2006
Diagram taken from Jane Mabey
MSc Bioinformatics 2006 Diagram taken from Jane Mabey
MSc Bioinformatics 2006
Querying databases using Perl
#! /usr/bin/perl -w # intro6.pl - connect to MySQL, retrieve data, write plain text output use strict; use DBI;
my ($dbh, $sth, $count); $dbh = DBI->connect ("DBI:mysql:host=localhost;database=webdb", "webdev", "webdevpass", {PrintError => 0, RaiseError => 1});
$sth = $dbh->prepare ("SELECT name, wins, losses FROM teams"); $sth->execute (); $count = 0; while (my @val = $sth->fetchrow_array ()) { printf "name = %s, wins = %d, losses = %d\n", $val[0], $val[1], $val[2]; ++
$count; } print "count rows total\n"; $sth->finish (); $dbh->disconnect (); exit (0);
MSc Bioinformatics 2006
%./intro6.pl name = Fargo-Moorhead Twins, wins = 36, losses = 16 name = Winnipeg Maroons, wins = 24, losses = 26 name = Minot Why Nots, wins = 19, losses = 23 name = Warren Wanderers, wins = 16, losses = 30 4 rows total
MSc Bioinformatics 2006
Perl example…use DBI; my $dbh = DBI->connect('DBI:MySql:payroll') or die "Couldn't connect to database: " .
DBI->errstr; my $sth = $dbh->prepare('SELECT * FROM people WHERE lastname = ?') or die "Couldn't
prepare statement: " . $dbh->errstr; print "Enter name> "; while ($lastname = <>) { # Read input from the user my @data; chomp $lastname; $sth->execute($lastname) # Execute the query or die "Couldn't execute statement: " . $sth->errstr; # Read the matching records and print them out while (@data = $sth->fetchrow_array()) {
my $firstname = $data[1]; my $id = $data[2]; print "\t$id: $firstname $lastname\n";
} if ($sth->rows == 0) {
print "No names matched `$lastname'.\n\n"; } $sth->finish; print "\n"; print "Enter name> "; } $dbh->disconnect;
MSc Bioinformatics 2006
Query databases over the Web
#! /usr/bin/perl -w # intro7.pl - connect to MySQL, retrieve data, write HTML output
use strict; use DBI; use CGI qw(:standard); my ($dbh, $sth, $count); $dbh = DBI->connect ("DBI:mysql:host=localhost;database=webdb", "webdev",
"webdevpass", {PrintError => 0, RaiseError => 1}); $sth = $dbh->prepare ("SELECT name, wins, losses FROM teams"); $sth->execute (); print header(), start_html ("team data"); $count = 0; while (my @val = $sth->fetchrow_array ()) {
print p (sprintf ("name = %s, wins = %d, losses = %d\n", $val[0], $val[1], $val[2])); ++$count;
} print p ("$count rows total"), end_html (); $sth->finish (); $dbh->disconnect (); exit (0);
MSc Bioinformatics 2006
More about data retrieval with DBI.pm
• fetchrow_arrayref
$sth = $dbh->prepare ("SELECT name, wins, losses FROM teams");
$sth->execute ();
while (my $ref = $sth->fetchrow_arrayref ()) {
printf "name = %s, wins = %d, losses = %d\n", $ref->[0], $ref->[1], $ref->[2];
}
$sth->finish ();
MSc Bioinformatics 2006
#!/usr/bin/perl
use DBI;
my $db = “sequence”;my $server = ‘localhost’;my $user = ‘root’;my $passwd = ‘passwd’;
my $dbConnection = DBI->connect(“dbi::mysql:$db:$server”,$user,$passwd)
my $query = “show tables”;my $sql = $dbConnection->prepare($query);$sql ->execute();while(my $row = $sql->fetchrow_arrayref){
print join(“\t”,@$row),”\n”;}$dbConnection->disconnect;exit;
MSc Bioinformatics 2006
Higher retrieval methods
• selectrow_array()– to retrieve a single row or a single column
value • selectcol_arrayref()
– returns the first column of a result set, as a reference to an array of values.
• selectall_arrayref() – retrieves the entire result set as a matrix and
returns a reference to it
MSc Bioinformatics 2006
Inserting data in the databasesub new_employee { # Arguments: database handle; first and last names of new
employee; # department ID number for new employee's work assignment
my ($dbh, $first, $last, $department) = @_; my ($insert_handle, $update_handle); my $insert_handle = $dbh->prepare_cached('INSERT INTO
employees VALUES (?,?,?)'); my $update_handle = $dbh->prepare_cached('UPDATE
departments SET num_members = num_members + 1 WHERE id = ?');
die "Couldn't prepare queries; aborting" unless defined $insert_handle && defined $update_handle;
$insert_handle->execute($first, $last, $department) or return 0; $update_handle->execute($department) or return 0; return 1; # Success
}
MSc Bioinformatics 2006
Inserting data in the databasesub new_employee { # Arguments: database handle; first and last names of new
employee; # department ID number for new employee's work assignment
my ($dbh, $first, $last, $department) = @_; my ($insert_handle, $update_handle); my $insert_handle = $dbh->prepare_cached('INSERT INTO employees
VALUES (?,?,?)'); my $update_handle = $dbh->prepare_cached('UPDATE departments SET
num_members = num_members + 1 WHERE id = ?'); die "Couldn't prepare queries; aborting" unless defined
$insert_handle && defined $update_handle; my $success = 1; $success &&= $insert_handle->execute($first, $last, $department);
$success &&= $update_handle->execute($department); my $result = ($success ? $dbh->commit : $dbh->rollback); unless
($result) { die "Couldn't finish transaction: " . $dbh->errstr } return $success;
}
MSc Bioinformatics 2006
• A short cut for DELETING, UPDATING and INSERTING data use the do statement
$dbh->do('DELETE FROM people WHERE age > 65');
MSc Bioinformatics 2006
Reminders
• Executable CGI files
• Configuring Apache to read cgi files from cgi-bin
• In httpd.conf
• ScriptAlias /cgi-bin/ "C:/Program Files/Apache Group/Apache2/cgi-bin/"