1 sql, databases, and ensembl modules. 2 please look for next lecture ensembl api tutorial:
TRANSCRIPT
1
SQL, Databases, and Ensembl Modules
2
Please look for next lecture
Ensembl API Tutorial:http://www.ensembl.org/info/software/core/core_tutorial.html
3
Survey of Databases(if time)
Ensembl web interface
Ensembl Biomart
4
RDBMS• Oracle and Sybase (many others)
– industry standard, commercial products– development and management tools
• PostgresSQL– full-featured relational DBMS– open source– found in most linux distributions– handles unusual datatypes well which adds flexibility for future
extensions• MySQL
– open-source relational DBMS– easy to setup and use– Linux/Windows/Mac
• Each has variations– SQL, datatypes, functions, features
5
Structured Query Language (SQL)
• SQL is the standard language used to create, modify, maintain, and query relational databases.
• SQL commands are issued within the context of the DBMS interface
• SQL commands can be passed to the DBMS by other programs (such as perl, PHP, Java, manually, etc.)
• Many venders have extensions (Sybase-SQL)• SQL Example:• select FIELD from TABLE where CONDITION• select TABLE.FIELD from TABLE where CONDITION• select program from software where program like 'blast%';
6
SQL datatypes
• datatype -- describes the data stored in a particular column of a table
• typically is either numeric or character strings
• SQL defines subtypes that set different upper limits on the size of text or numerical data
• Also -- special types such as DATE, MONEY
7
SQL datatypes
• INT• FLOAT• REAL (larger float)• CHAR -- fixed length text string• TEXT -- variable length text• BLOB -- variable length binary field• DECIMAL -- real number stored as character string• DATE • TIMESTAMP -- value updates every time the record is
modified• ENUM -- limited set of options (numeric or named)• SET -- value is one of a limited set
8
Creating a new table in the database
CREATE TABLE tablename (columnane type [modifiers], columanme type [modifiers])
create table software_package ( packid INT NOT NULL PRIMARY KEY AUTO_INCREMENT,
packname VARCHAR(100),packurl VARCHAR(255),function TEXT,keyword ENUM,os SET,format SET,archfile VARCHAR(255) );
9
Inserting data
INSERT INTO table(colname1, colname2, columname3) VALUES ('value1','value2','value3')
10
Accessing Data -- SELECT
SELECT [fields] FROM [table] where [condition]
select * from [table] where [condition]
select packname, packurl from software
select packname, packurl from software where keyword = "sequence alignment";
select packname, packurl from software where packname like "B%";
11
Joining multiple tables
• SELECT can "join" two related tables• relationships between tables are created by
replicating information ("primary key") from one table as a "foreign key" in another table
select attribute from table1, table2 where primarykey1 = foreignkey2
SELECT packname, packurl, reference_date FROM software, reference WHERE software.package_id = reference.package_id AND reference_date >= 1998;
12
ExampleAgentsAgentID First Last HireDate Phone
100 Mike Hernandez 5/16/95 553-3992101 Greg Piercy 10/15/95 790-3992102 Kathy Ehrlich 3/1/96 351-4993
ClientsClientID AgentID First Last Phone City
9001 100 Stewart Jameson 553-3999 El Paso9002 101 Shannon McLain 555-5555 Atlanta9003 102 Estela Pundt 541-4993 El Paso
EntertainersEntertainerIDAgentID First Last ….
3000 100 John Slade …3001 101 Mark Jebavy …3002 102 Teresa Weiss …
EngagementsClientID EntertainerIDEngagementDate StartTime StopTime
9003 3001 4/1/96 1:00 PM 3:30 PM9002 3000 4/13/96 9:00 PM 1:30 AM9001 3002 5/2/96 3:00 PM 6:00 PM
select Clients.Last, Clients.First, Clients.Phone from Clients where Clients.City = “El Paso” order by Clients.Last, Clients.First
select Last, First, Phone from Clients where City = "El Paso" order by Last, First
13
SQL Basics• select * from table_name• select count(*) from table_name• Find start times of Entertainers where Jameson is the Client• Nested query (may not always be supported)
– select StartTime from Engagements where ClientID = (select ClientID from Clients where Last = “Jameson”)
• Join– select Engagements.StartTime, Clients.ClientID from Engagements,
Clients where Engagements.ClientID = Clients.ClientID and Client.Last = “Jameson”
– select Engagements.StartTime from Engagements,Clients where Engagements.ClientID = Clients.ClientID and Client.Last = “Jameson”
• insert into TABLE values (val1, val2, …)• insert into TABLE (FIELDa, FIELDb, …FIELDn) values (value1,
value2, …)• insert into Clients values (“Terry”, “Braun”, 11/10/03,555-6666)
14
…as a spreadsheetAgents
Clients
Entertainers
Entertainers
Engagements
Engagements
Agents
First
Last HireDate
Phone
ClientID
AgentID
First
Last Phone
City EntertainerID
AgentID
First
Last ClientID
EntertainerID
EngagementDate
StartTime
StopTime
100 Mike Hernandez
5/17/99
553-3992
9001 100 Stewart
Jameson
553-3999
El Paso
3000 100 John Slade
9001 3002 5/3/00
3:00 PM
6:00 PM
101 Greg Piercy
10/16/99
790-3992
9002 101 Shannon
McLain
555-5555
Atlanta
3001 101 Mark Jebavy
9002 3000 4/14/00
9:00 PM
1:30 AM
102 Kathy
Ehrlich
3/2/00
351-4993
9003 102 Estela
Pundt
541-4993
El Paso
3002 102 Teresa
Weiss
9003 3001 4/2/00
1:00 PM
3:30 PM
Flattened table structure, and lined up the rows.
Appears to be okay….
Except…
The relationships in this trivial example are more complicated than a
simple flat structure such as a spreadsheet.
Example) The first Engagement is between Client 9001, and Entertainer 3002
15
SQL Basics
delete TABLE where FIELD = value
update TABLE set FIELD = value
drop DATABASE
16
exon
transcript_id
exon_num
sequence_start
sequence_stop
intron
transcript_id
intron_num
sequence_start
sequence_stop
primer_pair
id
transcript_id
left_primer_id
right_primer_id
transcript
id
sequence_id
source
source_id
sequence
id
target_id
type
sequence
chr_name
strand
genomic_start
genomic_stop
source
source_id
refresh
target
id
date
gene_name
description
accession
status
project
id
name
description
date
set_table
id
project_id
name
date
description
set_target_join
set_id
target_id
rank
cas_rank
cas_options
17
exon
transcript_id
exon_num = 3
sequence_start
sequence_stop
intron
transcript_id
intron_num = 3
sequence_start
sequence_stop
primer_pair
id
transcript_id
left_primer_id
right_primer_id
transcript
id
sequence_id
source = Ensembl
source_id
sequence
id
target_id
type = nucleotide
sequence = ATG…
chr_name = 15
strand = 1
genomic_start = 15,123,120
genomic_stop = 16,378,131
source
source_id
refresh
target
id
date
gene_name = BBS4
description
accession
status
project
id
name = pro1
description
date
set_table
id
project_id
name = testset
date
description
target_set_info
set_id
target_id
rank = 5
cas_rank
cas_options
Sample Data
18
Extracting Data from a DB
• This simple figure provides me with enough understanding of the database structure to extract data
19
Local SQL Example:• We have gene name (BBS4), and would like to find project name,
and the exon number, start/stops• select id from target where gene_name = "BBS4"
– 11• select set_table.id from set_table, target_set_info where
target_set_info.set_id = set_table.id and target_set_info.target_id = TARGET_ID– 3
• select project.name from project, set_table where set_table.project_id = project.id and set_table.id = SET_ID– project333
• select id from sequence where target_id = TARGET_ID– 33
• select id from transcript where sequence_id = SEQUENCE_ID– 68
• select exon_num, sequence_start, sequence_stop from exon where transcript_id = TRANSCRIPT_ID
• Whew.
20
MySQL demo (gscr)
• mysql -h gscr -u tabraun –p
• use trapss;
21
TrAPSS Database
Show tables;
Show columns from table TABLE_NAME;
22
exon
transcript_id
exon_num = 3
sequence_start
sequence_stop
intron
transcript_id
intron_num = 3
sequence_start
sequence_stop
primer_pair
id
transcript_id
left_primer_id
right_primer_id
transcript
id
sequence_id
source = Ensembl
source_id
sequence
id
target_id
type = nucleotide
sequence = ATG…
chr_name = 15
strand = 1
genomic_start = 15,123,120
genomic_stop = 16,378,131
source
source_id
refresh
target
id
date
gene_name = BBS4
description
accession
status
project
id
name = pro1
description
date
set_table
id
project_id
name =testset
date
description
target_set_info
set_id
target_id
rank = 5
cas_rank
cas_options
select id from target where gene_name = "BBS4";
23
exon
transcript_id
exon_num = 3
sequence_start
sequence_stop
intron
transcript_id
intron_num = 3
sequence_start
sequence_stop
primer_pair
id
transcript_id
left_primer_id
right_primer_id
transcript
id
sequence_id
source = Ensembl
source_id
sequence
id
target_id
type = nucleotide
sequence = ATG…
chr_name = 15
strand = 1
genomic_start = 15,123,120
genomic_stop = 16,378,131
source
source_id
refresh
target
id = 23
date
gene_name = BBS4
description
accession
status
project
id
name = pro1
description
date
set_table
id
project_id
name = testset
date
description
target_set_info
set_id
target_id = 23
rank = 5
cas_rank
cas_options
select set_table.id from set_table, target_set_info where target_set_info.set_id = set_table.id and target_set_info.target_id = 23; (3 rows ==> in 3 sets)
24
exon
transcript_id
exon_num = 3
sequence_start
sequence_stop
intron
transcript_id
intron_num = 3
sequence_start
sequence_stop
primer_pair
id
transcript_id
left_primer_id
right_primer_id
transcript
id
sequence_id
source = Ensembl
source_id
sequence
id
target_id
type = nucleotide
sequence = ATG…
chr_name = 15
strand = 1
genomic_start = 15,123,120
genomic_stop = 16,378,131
source
source_id
refresh
target
id
date
gene_name = BBS4
description
accession
status
project
id
name = pro1
description
date
set_table
id = 7
project_id
name =testset
date
description
target_set_info
set_id =
target_id
rank = 5
cas_rank
cas_options
select project.name from project, set_table where set_table.project_id = project.id and set_table.id = 7;
25
exon
transcript_id
exon_num = 3
sequence_start
sequence_stop
intron
transcript_id
intron_num = 3
sequence_start
sequence_stop
primer_pair
id
transcript_id
left_primer_id
right_primer_id
transcript
id
sequence_id
source = Ensembl
source_id
sequence
id
target_id = 23
type = nucleotide
sequence = ATG…
chr_name = 15
strand = 1
genomic_start = 15,123,120
genomic_stop = 16,378,131
source
source_id
refresh
target
id
date
gene_name = BBS4
description
accession
status
project
id
name = pro1
description
date
set_table
id
project_id
name = testset
date
description
target_set_info
set_id
target_id
rank = 5
cas_rank
cas_options
select id from sequence where target_id = 23;
26
exon
transcript_id
exon_num = 3
sequence_start
sequence_stop
intron
transcript_id
intron_num = 3
sequence_start
sequence_stop
primer_pair
id
transcript_id
left_primer_id
right_primer_id
transcript
id
sequence_id = 2216
source = Ensembl
source_id
sequence
id
target_id
type = nucleotide
sequence = ATG…
chr_name = 15
strand = 1
genomic_start = 15,123,120
genomic_stop = 16,378,131
source
source_id
refresh
target
id
date
gene_name = BBS4
description
accession
status
project
id
name = pro1
description
date
set_table
id
project_id
name = testset
date
description
target_set_info
set_id
target_id
rank = 5
cas_rank
cas_options
select id from transcript where sequence_id = 2216;
27
exon
transcript_id = 5053
exon_num = 3
sequence_start
sequence_stop
intron
transcript_id
intron_num = 3
sequence_start
sequence_stop
primer_pair
id
transcript_id
left_primer_id
right_primer_id
transcript
id= 5053
sequence_id
source = Ensembl
source_id
sequence
id
target_id
type = nucleotide
sequence = ATG…
chr_name = 15
strand = 1
genomic_start = 15,123,120
genomic_stop = 16,378,131
source
source_id
refresh
target
id
date
gene_name = BBS4
description
accession
status
project
id
name = pro1
description
date
set_table
id
project_id
name = testset
date
description
target_set_info
set_id
target_id
rank = 5
cas_rank
cas_options
select exon_num, sequence_start, sequence_stop from exon where transcript_id = 5053;
28
SQL Examples – One Large Query
select distinct exon_num, project.name, sequence_start, sequence_stop from exon, transcript, sequence, target, target_set_info, set_table, project where target.gene_name = "BBS4" and project.name = "pro1" and set_table.name = "testset" and set_table.project_id = project.id and target_set_info.set_id = set_table.id and target_set_info.target_id = target.id and sequence.target_id = target.id and transcript.sequence_id = sequence.id and exon.transcript_id = transcript.id order by exon_num;
29
Executing SQL commands from a text file
http://www.mysql.org/doc/refman/5.0/en/batch-commands.html• However, it is also possible to put your SQL statements in a file and then tell
mysql to read its input from that file. To do so, create a text file text_file that contains the statements you wish to execute. Then invoke mysql as shown here:
• shell> mysql db_name < text_file
• If you place a USE db_name statement as the first statement in the file, it is unnecessary to specify the database name on the command line:
• shell> mysql < text_file
• If you are already running mysql, you can execute an SQL script file using the source or \. command:
• mysql> source file_name• mysql> \. file_name
30
Remotely
mysql -h gscr.eng.uiowa.edu -u tabraun -p
31
Setting up MySQL
• server -- database application that is running on the machine where the data is stored– runs as a "daemon" on Unix machines– a process that is always on, listening for and responding
to requests (from clients)– mysqld -- server
• client -- programs that connect to the server and request data– can be on same machine as server, or– located on other machines and connected by network– mysql -- client– mysqladmin -- administration tool
32
MySLQ help
http://dev.mysql.com/
Show tables;
Show columns from table TABLE_NAME;
Show TrAPSS (time permitting)
33
Survey of Databases(if time)
Ensembl web interface
Ensembl Ensmart