sequence analysis - lunds universitet...learn some relevant features found in biological sequences...
TRANSCRIPT
11/10/2016
1
Sequence Analysis
Introduction to Bioinformatics BIMM34
October 2016
Gabriel Teku Department of Experimental Medical Science |Faculty of Medicine | Lund University
Sequence analysis
Learning outcomes
Learn some relevant features found in biological
sequences
Get hands-on experience with popular
sequence/bioinformatics analysis platforms
Familiarize with open source vs commercial
sequence/bioinformatics software
11/10/2016
2
Sequence analysis
Part 1
Features
Motifs
Domains
Part 2
Galaxy
EMBOSS
Software for sequence analysis
Sequence analysis
Part 1
Sequence analysis
Features
Motifs
Domains
11/10/2016
3
Sequence analysis: definition
… refers to the process of subjecting a DNA, RNA or
peptide sequence to any of a wide range of
analytical methods to understand its features,
function, structure, or evolution...
http://en.wikipedia.org/wiki/Sequence_analysis
Exercise 1: quick sequence analysis
1. Obtain the protein sequence encoded by your
chosen gene
2. Obtain the CDS sequence for the protein
3. Translate the CDS sequence obtained above
4. Compare the translated CDS to the protein sequence
obtained from 1 above
http://www.ebi.ac.uk/Tools/st
11/10/2016
4
Types of sequence analysis
Searching databases
Sequence alignments
Feature analyses
Part 1
Sequence analysis
Features
Motifs
Domains
11/10/2016
5
What is a feature
Sequence features are groups of nucleotides or amino
acids that confer certain characteristics upon a gene or
protein, and may be important for its overall function.
http://www.ebi.ac.uk/Tools/st
Protein features
11/10/2016
6
Gene features
Exercise 2: on features
1. Explore the features along your protein from UniProt
2. View the protein’s structure from pdb by following
the 3D structure
3. Identify the functional motif(s) of the protein
PROSITE link from Uniprot → Family & Domains
4. What are the motif(s) as represented by the
database entry
11/10/2016
7
Part 1
Sequence analysis
Features
Motifs
Domains
Motifs
Short, conserved sequence patterns
Associated to specific function(s)
Binding site
Active site
~ 10 - 30 amino acids
Prosite
11/10/2016
8
Motifs
Prosite landing page
Motifs: prosite
From CDS to protein sequence
Statistically significant motifs
Functional motifs
Protein family by virtue of similar functional sites
11/10/2016
9
Prosite motifs search method
Pattern development
Pattern from literature
Profiles
Based on signature patterns
Sensitivity
Specificity
Pattern development in general
11/10/2016
10
Literature curated patterns
published
curated
tested against Swiss-Prot for specificity
Prosite pattern development
New patterns
start with review article
alignment of proteins from article
focus on biologically important regions
create core pattern
Prosite pattern development
11/10/2016
11
New patterns (contd)
Search Swiss-Prot using core sites
Retain/discard core pattern
Refine core pattern and repeat search
Pattern development
Syntax for Prosite patterns
one-letter codes for amino acids, e.g. G=Gly
elements separated by a hyphen, “-”
“X” used where any amino acid is accepted,
Ambiguities indicated by [ ],
e.g. [AG] means Ala or Gly,
Amino acids that are not accepted at a given position are
listed between curly braces, “{ }”,
e.g. {AG} means any amino acid except Ala and Gly,
11/10/2016
12
Syntax for Prosite patterns contd
Number fo repeats are placed between braces,“( )”,
e.g. [AG](2,4) means Ala or Gly between 2 and 4 times,
a pattern is anchored to the N-terminal or C-terminal by
“<“ and “>”, respectively.
G H E G V G K V V K L G A G A
G H E K K G Y F E D R G P S A
G H E G Y G G R S R G G G Y S
G H E F E G P K G C G A L Y I
G H E L R G T T F M P A L E C
G H E G V G K V V K L G A G A
K K Y F E D R A P S S
F Y G R S R G G Y I
L E P K G C P L E C
R T T F M
G-H-E-X(2)-G-X(5)-[GA]-X(3)
11/10/2016
13
Exercise 3
1. Search for a motif of your protein from Prosite
2. Interpret the motif
Methodology
Pattern development
Pattern from literature
New patterns
Profiles
Motifs: prosite
11/10/2016
14
Profiles
Popular approaches
position weight matrix
HMM
A T G T C G
A A G A C T
T A C T C A
C G G A G G
A A C C T G
Sequence 1
Sequence 2
Sequence 3
Sequence 4
Sequence 5
1 2 3 4 5 6
Toy example of multiple aligned nucleotide
sequences
11/10/2016
15
1 2 3 4 5 6 Row freq
A 0.6 0.6 - 0.4 - 0.2 0.3
T 0.2 0.2 - 0.4 0.2 0.2 0.2
G - 0.2 0.6 - 0.2 0.6 0.27
C 0.2 - 0.4 0.2 0.6 - 0.23
Convert multiple aligned sequences to raw
frequency table
1 2 3 4 5 6 Row freq
A 2.0 2.0 - 1.33 -- 0.67 0.3
T 1.0 1.0 - 2.0 1.0 1.0 0.2
G - 0.74 2.22 - 0.74 2.22 0.27
C 0.87 - 1.74 0.87 2.61 - 0.23
Normalize values by dividing them by row
frequency
11/10/2016
16
1 2 3 4 5 6
A 1.0 1.0 - 0.41 -- -0.58
T 0.0 0.0 - 1.0 0.0 0.0
G - -0.43 1.15 - -0.43 1.15
C -0.2 - 0.8 -0.2 1.38 -
PSSM: convert the values to log base 2
1 2 3 4 5 6
A 1.0 1.0 - 0.41 -- -0.58
T 0.0 0.0 - 1.0 0.0 0.0
G - -0.43 1.15 - -0.43 1.15
C -0.2 - 0.8 -0.2 1.38 -
A A C T C G
How does a new sequence, AACTCG, fits to
the PSSM
Sum of log odds score = 1.0 + 1.0 + 0.8 + 1.0 + 1.38 + 1.15 = 6.33
11/10/2016
17
Building a profile from PSSM
Multiple sequence alignments with gaps
Gap penalties
Profile = PSSM that includes gap penalties
Fine tuning gap parameters to achieve good profiles
Building a profile: PSI-BLAST
Query sequence
BLAST
MSA
Profile
BLAST
Additional homologs
Incorporated profile
New profile
Iterate process
Sequence homologs
A B C E....
1
2
etc
A B C E....
1
2
...
11/10/2016
18
MEME Suite Example
Exercise 4
1. BLAST your protein against the Uniprot proteins. 2. Select the first 5 hits and download the sequences
in fasta format 3. Launch the MEME program at http://meme-
suite.org/ 4. Using the downloaded sequence file above, search
for possible motifs using the MEME program. 5. Compare the results to that from Prosite. 6. Leave the results open for later.
11/10/2016
19
Profiles from Hidden Markov Models
More efficient
From speech recognition
Based on Markov Models
Statistical approach
Some motif resources
PROSITE
PRINTS
SMART
InterPro
http://www.ebi.ac.uk/interpro/about.html
11/10/2016
20
Domains introduction
Longer than motifs
conserved sequence patterns
Independent structural and functional unit
Average length, 100 aa
May (not) include motifs along boundries
Domains
HMM applied in domain identification due to its
robustness.
Some domain databases include
Pfam-A
Pfam-B
Prodom
SCOP
CATH
MEME suite
11/10/2016
21
Exercise 5
1. Identify the domain(s) of your protein.
2. Explain how you accomplished the task.
PART 1
Learn some relevant features found in biological
sequences
Get hands-on experience with popular
sequence/bioinformatics analysis platforms
Familiarize with open source vs commercial
sequence/bioinformatics software
11/10/2016
22
PART 2
Learning outcomes
Learn to use tools and workflow on
sequence/bioinformatics platforms
Familiarize yourself with open source vs
commercial software for sequence/bioinformatics
analysis
PART 2
Galaxy
EMBOSS
Software for sequence analysis
11/10/2016
23
Galaxy
https://usegalaxy.org/
One-stop shop
from single sequence to Next Generation
Sequencing
Open source
Large community
Galaxy main public server
http://galaxy.bmc.lu.se/
11/10/2016
24
Galaxy main server start page
Galaxy hands-on
Open
teku-galaxy.omv.lu.se:8080
Familiarize with the interface
Create an account
11/10/2016
25
Question for galaxy 101 tutorial
Which coding exon has the highest number of
single nucleotide polymorphisms (SNPs) on
chromosome 22?
Exercise 6
Complete the galaxy 101 tutorial
https://github.com/nekrut/galaxy/wiki/Galaxy101-1
11/10/2016
26
EMBOSS
The European Molecular Biology Open Software
Suite
Large user community
Available on the web, many OS, servers and
stand-alone
If you know how to use one, then you know how
to use all (sort of)
Mature and stable
EMBOSS
What is it good for?
Sequence alignment
Database search with sequence patterns
Motif identification and domain analysis
Nucleotide sequence pattern analysis
Sequence Analysis Introduction
11/10/2016
27
http://emboss.sourceforge.net/
EMBOSS FROM SOURCEFORGE
EMBOSS programs on galaxy platform
11/10/2016
28
Many other portals
http://www.ebi.ac.uk/Tools/emboss/
http://emboss.bioinformatics.nl/
http://imed.med.ucm.es/EMBOSS/
http://www.bioinformatics2.wsu.edu/emboss/
http://pro.genomics.purdue.edu/emboss/
Quick CpG islands for next exercise
High density CG dinucleotides regions along the
DNA
200 – 500 nucleotides,
enriched with CG
Enriched CpG nucleotides
The p in CpG islands represent the
phosphodiester bond between the C and G
nucleotides
Mostly occur within the promoter of eukaryotic
genes
Lock gene in an inactive state
Helps identify the transcription start site of a gene
11/10/2016
29
Exercise 7
1. From galaxy, emboss toolshed list all tools that analyze
CpG islands (hint: On the search field, type in “cpg” )
2. Access the documentation for two of these tools, preferably
cpgplot and newcpgreport
3. Write down the expected result
4. Run the tools on the human gene ELANE
5. Interpret the results. 57
Software for sequence analysis
Learning outcomes
Familiarize yourself with open source vs
commercial software for
sequence/bioinformatics analysis
11/10/2016
30
Software for sequence analysis
Open source tools
Commercial tools
Software for sequence analysis
Websites with links to open source tools and services
http://www.ebi.ac.uk/services
http://www.ncbi.nlm.nih.gov/guide/sequence-analysis/
http://bioinformatics.ca/links_directory/
http://seqanswers.com/wiki/Software/list
11/10/2016
31
Software for sequence analysis
Open source
GNU general public licenses (GNU GPL)
Continuous evolution of code
Community supported
Enables peer-review (reproducibility)
Examples
EMBOSS
mothur
Software for sequence analysis
Mothur website
11/10/2016
32
Software for sequence analysis
Mothur website
Software for sequence analysis
Commercial tools Proprietary Expensive licenses Streamline research Improve productivity Examples
Geneious (Biomatters Ltd., Auckland, New Zealand) CLC Genomics Workbench (CLC bio, Aarhus, Denmark) Sequencher (Gene Codes Corporation, MI, USA)
11/10/2016
33
Commercial software and feature offerings
Software Cost
(USD)
Free trial
(days)
Platform NGS
analyses
Database
searchinge
Plug-ins Workflow Teaching
suitability
Avadis
NGS
$4500 20 M, W, L ✓ ✗ ✗ ✓ ✗
CLC
Genomics
Workbench
$5500 30 M, W, L ✓ ✓ ✓ ✓ ✓
CodonCod
e Aligner
$720 30 M, W ✓ ✗ ✗ ✗ ✓
Genamics
Expression
$295 30 W ✗ ✓ ✓ ✗ ✗
Geneious $795 14 M, W, L ✓ ✓ ✓ ✓ ✓
Full
Lasergene
Suite
$5950 30 M, W ✓ ✓ ✓ ✓ ✓
MacVector
&
Assembler
$300 21 M ✓ ✓ ✗ ✗ ✓
NextGENe $4049 35 W ✓ ✗ ✗ ✗ ✗
Sequenche
r
$2500 30 M, W ✓ ✓ ✓ ✗ ✓
VectorNTI
Advance
$600 30 W ✗ ✓ ✗ ✓ ✓
Exercise 8: on software for sequence analysis
1.Go to the website of one of the commercial
softwares in the previous slide.
2. Familiarize yourself with the software.
3. Now, search for open source software that
performs similar or same tasks as the
commercial software.
4. Compare both softwares
11/10/2016
34
Summary of PART 2
Familiarity with sequence analysis tools and
workflow on the galaxy platform
Familiarity with sequence analysis software
Open source tools
Commercial softwares