sequence analysis - lunds universitet...learn some relevant features found in biological sequences...

34
11/10/2016 1 Sequence Analysis Introduction to Bioinformatics BIMM34 October 2016 Gabriel Teku Department of Experimental Medical Science |Faculty of Medicine | Lund University Sequence analysis Learning outcomes Learn some relevant features found in biological sequences Get hands-on experience with popular sequence/bioinformatics analysis platforms Familiarize with open source vs commercial sequence/bioinformatics software

Upload: others

Post on 21-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sequence Analysis - Lunds universitet...Learn some relevant features found in biological sequences Get hands-on experience with popular sequence/bioinformatics analysis platforms Familiarize

11/10/2016

1

Sequence Analysis

Introduction to Bioinformatics BIMM34

October 2016

Gabriel Teku Department of Experimental Medical Science |Faculty of Medicine | Lund University

Sequence analysis

Learning outcomes

Learn some relevant features found in biological

sequences

Get hands-on experience with popular

sequence/bioinformatics analysis platforms

Familiarize with open source vs commercial

sequence/bioinformatics software

Page 2: Sequence Analysis - Lunds universitet...Learn some relevant features found in biological sequences Get hands-on experience with popular sequence/bioinformatics analysis platforms Familiarize

11/10/2016

2

Sequence analysis

Part 1

Features

Motifs

Domains

Part 2

Galaxy

EMBOSS

Software for sequence analysis

Sequence analysis

Part 1

Sequence analysis

Features

Motifs

Domains

Page 3: Sequence Analysis - Lunds universitet...Learn some relevant features found in biological sequences Get hands-on experience with popular sequence/bioinformatics analysis platforms Familiarize

11/10/2016

3

Sequence analysis: definition

… refers to the process of subjecting a DNA, RNA or

peptide sequence to any of a wide range of

analytical methods to understand its features,

function, structure, or evolution...

http://en.wikipedia.org/wiki/Sequence_analysis

Exercise 1: quick sequence analysis

1. Obtain the protein sequence encoded by your

chosen gene

2. Obtain the CDS sequence for the protein

3. Translate the CDS sequence obtained above

4. Compare the translated CDS to the protein sequence

obtained from 1 above

http://www.ebi.ac.uk/Tools/st

Page 4: Sequence Analysis - Lunds universitet...Learn some relevant features found in biological sequences Get hands-on experience with popular sequence/bioinformatics analysis platforms Familiarize

11/10/2016

4

Types of sequence analysis

Searching databases

Sequence alignments

Feature analyses

Part 1

Sequence analysis

Features

Motifs

Domains

Page 5: Sequence Analysis - Lunds universitet...Learn some relevant features found in biological sequences Get hands-on experience with popular sequence/bioinformatics analysis platforms Familiarize

11/10/2016

5

What is a feature

Sequence features are groups of nucleotides or amino

acids that confer certain characteristics upon a gene or

protein, and may be important for its overall function.

http://www.ebi.ac.uk/Tools/st

Protein features

Page 6: Sequence Analysis - Lunds universitet...Learn some relevant features found in biological sequences Get hands-on experience with popular sequence/bioinformatics analysis platforms Familiarize

11/10/2016

6

Gene features

Exercise 2: on features

1. Explore the features along your protein from UniProt

2. View the protein’s structure from pdb by following

the 3D structure

3. Identify the functional motif(s) of the protein

PROSITE link from Uniprot → Family & Domains

4. What are the motif(s) as represented by the

database entry

Page 7: Sequence Analysis - Lunds universitet...Learn some relevant features found in biological sequences Get hands-on experience with popular sequence/bioinformatics analysis platforms Familiarize

11/10/2016

7

Part 1

Sequence analysis

Features

Motifs

Domains

Motifs

Short, conserved sequence patterns

Associated to specific function(s)

Binding site

Active site

~ 10 - 30 amino acids

Prosite

Page 8: Sequence Analysis - Lunds universitet...Learn some relevant features found in biological sequences Get hands-on experience with popular sequence/bioinformatics analysis platforms Familiarize

11/10/2016

8

Motifs

Prosite landing page

Motifs: prosite

From CDS to protein sequence

Statistically significant motifs

Functional motifs

Protein family by virtue of similar functional sites

Page 9: Sequence Analysis - Lunds universitet...Learn some relevant features found in biological sequences Get hands-on experience with popular sequence/bioinformatics analysis platforms Familiarize

11/10/2016

9

Prosite motifs search method

Pattern development

Pattern from literature

Profiles

Based on signature patterns

Sensitivity

Specificity

Pattern development in general

Page 10: Sequence Analysis - Lunds universitet...Learn some relevant features found in biological sequences Get hands-on experience with popular sequence/bioinformatics analysis platforms Familiarize

11/10/2016

10

Literature curated patterns

published

curated

tested against Swiss-Prot for specificity

Prosite pattern development

New patterns

start with review article

alignment of proteins from article

focus on biologically important regions

create core pattern

Prosite pattern development

Page 11: Sequence Analysis - Lunds universitet...Learn some relevant features found in biological sequences Get hands-on experience with popular sequence/bioinformatics analysis platforms Familiarize

11/10/2016

11

New patterns (contd)

Search Swiss-Prot using core sites

Retain/discard core pattern

Refine core pattern and repeat search

Pattern development

Syntax for Prosite patterns

one-letter codes for amino acids, e.g. G=Gly

elements separated by a hyphen, “-”

“X” used where any amino acid is accepted,

Ambiguities indicated by [ ],

e.g. [AG] means Ala or Gly,

Amino acids that are not accepted at a given position are

listed between curly braces, “{ }”,

e.g. {AG} means any amino acid except Ala and Gly,

Page 12: Sequence Analysis - Lunds universitet...Learn some relevant features found in biological sequences Get hands-on experience with popular sequence/bioinformatics analysis platforms Familiarize

11/10/2016

12

Syntax for Prosite patterns contd

Number fo repeats are placed between braces,“( )”,

e.g. [AG](2,4) means Ala or Gly between 2 and 4 times,

a pattern is anchored to the N-terminal or C-terminal by

“<“ and “>”, respectively.

G H E G V G K V V K L G A G A

G H E K K G Y F E D R G P S A

G H E G Y G G R S R G G G Y S

G H E F E G P K G C G A L Y I

G H E L R G T T F M P A L E C

G H E G V G K V V K L G A G A

K K Y F E D R A P S S

F Y G R S R G G Y I

L E P K G C P L E C

R T T F M

G-H-E-X(2)-G-X(5)-[GA]-X(3)

Page 13: Sequence Analysis - Lunds universitet...Learn some relevant features found in biological sequences Get hands-on experience with popular sequence/bioinformatics analysis platforms Familiarize

11/10/2016

13

Exercise 3

1. Search for a motif of your protein from Prosite

2. Interpret the motif

Methodology

Pattern development

Pattern from literature

New patterns

Profiles

Motifs: prosite

Page 14: Sequence Analysis - Lunds universitet...Learn some relevant features found in biological sequences Get hands-on experience with popular sequence/bioinformatics analysis platforms Familiarize

11/10/2016

14

Profiles

Popular approaches

position weight matrix

HMM

A T G T C G

A A G A C T

T A C T C A

C G G A G G

A A C C T G

Sequence 1

Sequence 2

Sequence 3

Sequence 4

Sequence 5

1 2 3 4 5 6

Toy example of multiple aligned nucleotide

sequences

Page 15: Sequence Analysis - Lunds universitet...Learn some relevant features found in biological sequences Get hands-on experience with popular sequence/bioinformatics analysis platforms Familiarize

11/10/2016

15

1 2 3 4 5 6 Row freq

A 0.6 0.6 - 0.4 - 0.2 0.3

T 0.2 0.2 - 0.4 0.2 0.2 0.2

G - 0.2 0.6 - 0.2 0.6 0.27

C 0.2 - 0.4 0.2 0.6 - 0.23

Convert multiple aligned sequences to raw

frequency table

1 2 3 4 5 6 Row freq

A 2.0 2.0 - 1.33 -- 0.67 0.3

T 1.0 1.0 - 2.0 1.0 1.0 0.2

G - 0.74 2.22 - 0.74 2.22 0.27

C 0.87 - 1.74 0.87 2.61 - 0.23

Normalize values by dividing them by row

frequency

Page 16: Sequence Analysis - Lunds universitet...Learn some relevant features found in biological sequences Get hands-on experience with popular sequence/bioinformatics analysis platforms Familiarize

11/10/2016

16

1 2 3 4 5 6

A 1.0 1.0 - 0.41 -- -0.58

T 0.0 0.0 - 1.0 0.0 0.0

G - -0.43 1.15 - -0.43 1.15

C -0.2 - 0.8 -0.2 1.38 -

PSSM: convert the values to log base 2

1 2 3 4 5 6

A 1.0 1.0 - 0.41 -- -0.58

T 0.0 0.0 - 1.0 0.0 0.0

G - -0.43 1.15 - -0.43 1.15

C -0.2 - 0.8 -0.2 1.38 -

A A C T C G

How does a new sequence, AACTCG, fits to

the PSSM

Sum of log odds score = 1.0 + 1.0 + 0.8 + 1.0 + 1.38 + 1.15 = 6.33

Page 17: Sequence Analysis - Lunds universitet...Learn some relevant features found in biological sequences Get hands-on experience with popular sequence/bioinformatics analysis platforms Familiarize

11/10/2016

17

Building a profile from PSSM

Multiple sequence alignments with gaps

Gap penalties

Profile = PSSM that includes gap penalties

Fine tuning gap parameters to achieve good profiles

Building a profile: PSI-BLAST

Query sequence

BLAST

MSA

Profile

BLAST

Additional homologs

Incorporated profile

New profile

Iterate process

Sequence homologs

A B C E....

1

2

etc

A B C E....

1

2

...

Page 18: Sequence Analysis - Lunds universitet...Learn some relevant features found in biological sequences Get hands-on experience with popular sequence/bioinformatics analysis platforms Familiarize

11/10/2016

18

MEME Suite Example

Exercise 4

1. BLAST your protein against the Uniprot proteins. 2. Select the first 5 hits and download the sequences

in fasta format 3. Launch the MEME program at http://meme-

suite.org/ 4. Using the downloaded sequence file above, search

for possible motifs using the MEME program. 5. Compare the results to that from Prosite. 6. Leave the results open for later.

Page 19: Sequence Analysis - Lunds universitet...Learn some relevant features found in biological sequences Get hands-on experience with popular sequence/bioinformatics analysis platforms Familiarize

11/10/2016

19

Profiles from Hidden Markov Models

More efficient

From speech recognition

Based on Markov Models

Statistical approach

Some motif resources

PROSITE

PRINTS

SMART

InterPro

http://www.ebi.ac.uk/interpro/about.html

Page 20: Sequence Analysis - Lunds universitet...Learn some relevant features found in biological sequences Get hands-on experience with popular sequence/bioinformatics analysis platforms Familiarize

11/10/2016

20

Domains introduction

Longer than motifs

conserved sequence patterns

Independent structural and functional unit

Average length, 100 aa

May (not) include motifs along boundries

Domains

HMM applied in domain identification due to its

robustness.

Some domain databases include

Pfam-A

Pfam-B

Prodom

SCOP

CATH

MEME suite

Page 21: Sequence Analysis - Lunds universitet...Learn some relevant features found in biological sequences Get hands-on experience with popular sequence/bioinformatics analysis platforms Familiarize

11/10/2016

21

Exercise 5

1. Identify the domain(s) of your protein.

2. Explain how you accomplished the task.

PART 1

Learn some relevant features found in biological

sequences

Get hands-on experience with popular

sequence/bioinformatics analysis platforms

Familiarize with open source vs commercial

sequence/bioinformatics software

Page 22: Sequence Analysis - Lunds universitet...Learn some relevant features found in biological sequences Get hands-on experience with popular sequence/bioinformatics analysis platforms Familiarize

11/10/2016

22

PART 2

Learning outcomes

Learn to use tools and workflow on

sequence/bioinformatics platforms

Familiarize yourself with open source vs

commercial software for sequence/bioinformatics

analysis

PART 2

Galaxy

EMBOSS

Software for sequence analysis

Page 23: Sequence Analysis - Lunds universitet...Learn some relevant features found in biological sequences Get hands-on experience with popular sequence/bioinformatics analysis platforms Familiarize

11/10/2016

23

Galaxy

https://usegalaxy.org/

One-stop shop

from single sequence to Next Generation

Sequencing

Open source

Large community

Galaxy main public server

http://galaxy.bmc.lu.se/

Page 24: Sequence Analysis - Lunds universitet...Learn some relevant features found in biological sequences Get hands-on experience with popular sequence/bioinformatics analysis platforms Familiarize

11/10/2016

24

Galaxy main server start page

Galaxy hands-on

Open

teku-galaxy.omv.lu.se:8080

Familiarize with the interface

Create an account

Page 25: Sequence Analysis - Lunds universitet...Learn some relevant features found in biological sequences Get hands-on experience with popular sequence/bioinformatics analysis platforms Familiarize

11/10/2016

25

Question for galaxy 101 tutorial

Which coding exon has the highest number of

single nucleotide polymorphisms (SNPs) on

chromosome 22?

Exercise 6

Complete the galaxy 101 tutorial

https://github.com/nekrut/galaxy/wiki/Galaxy101-1

Page 26: Sequence Analysis - Lunds universitet...Learn some relevant features found in biological sequences Get hands-on experience with popular sequence/bioinformatics analysis platforms Familiarize

11/10/2016

26

EMBOSS

The European Molecular Biology Open Software

Suite

Large user community

Available on the web, many OS, servers and

stand-alone

If you know how to use one, then you know how

to use all (sort of)

Mature and stable

EMBOSS

What is it good for?

Sequence alignment

Database search with sequence patterns

Motif identification and domain analysis

Nucleotide sequence pattern analysis

Sequence Analysis Introduction

Page 27: Sequence Analysis - Lunds universitet...Learn some relevant features found in biological sequences Get hands-on experience with popular sequence/bioinformatics analysis platforms Familiarize

11/10/2016

27

http://emboss.sourceforge.net/

EMBOSS FROM SOURCEFORGE

EMBOSS programs on galaxy platform

Page 28: Sequence Analysis - Lunds universitet...Learn some relevant features found in biological sequences Get hands-on experience with popular sequence/bioinformatics analysis platforms Familiarize

11/10/2016

28

Many other portals

http://www.ebi.ac.uk/Tools/emboss/

http://emboss.bioinformatics.nl/

http://imed.med.ucm.es/EMBOSS/

http://www.bioinformatics2.wsu.edu/emboss/

http://pro.genomics.purdue.edu/emboss/

Quick CpG islands for next exercise

High density CG dinucleotides regions along the

DNA

200 – 500 nucleotides,

enriched with CG

Enriched CpG nucleotides

The p in CpG islands represent the

phosphodiester bond between the C and G

nucleotides

Mostly occur within the promoter of eukaryotic

genes

Lock gene in an inactive state

Helps identify the transcription start site of a gene

Page 29: Sequence Analysis - Lunds universitet...Learn some relevant features found in biological sequences Get hands-on experience with popular sequence/bioinformatics analysis platforms Familiarize

11/10/2016

29

Exercise 7

1. From galaxy, emboss toolshed list all tools that analyze

CpG islands (hint: On the search field, type in “cpg” )

2. Access the documentation for two of these tools, preferably

cpgplot and newcpgreport

3. Write down the expected result

4. Run the tools on the human gene ELANE

5. Interpret the results. 57

Software for sequence analysis

Learning outcomes

Familiarize yourself with open source vs

commercial software for

sequence/bioinformatics analysis

Page 30: Sequence Analysis - Lunds universitet...Learn some relevant features found in biological sequences Get hands-on experience with popular sequence/bioinformatics analysis platforms Familiarize

11/10/2016

30

Software for sequence analysis

Open source tools

Commercial tools

Software for sequence analysis

Websites with links to open source tools and services

http://www.ebi.ac.uk/services

http://www.ncbi.nlm.nih.gov/guide/sequence-analysis/

http://bioinformatics.ca/links_directory/

http://seqanswers.com/wiki/Software/list

Page 31: Sequence Analysis - Lunds universitet...Learn some relevant features found in biological sequences Get hands-on experience with popular sequence/bioinformatics analysis platforms Familiarize

11/10/2016

31

Software for sequence analysis

Open source

GNU general public licenses (GNU GPL)

Continuous evolution of code

Community supported

Enables peer-review (reproducibility)

Examples

EMBOSS

mothur

Software for sequence analysis

Mothur website

Page 32: Sequence Analysis - Lunds universitet...Learn some relevant features found in biological sequences Get hands-on experience with popular sequence/bioinformatics analysis platforms Familiarize

11/10/2016

32

Software for sequence analysis

Mothur website

Software for sequence analysis

Commercial tools Proprietary Expensive licenses Streamline research Improve productivity Examples

Geneious (Biomatters Ltd., Auckland, New Zealand) CLC Genomics Workbench (CLC bio, Aarhus, Denmark) Sequencher (Gene Codes Corporation, MI, USA)

Page 33: Sequence Analysis - Lunds universitet...Learn some relevant features found in biological sequences Get hands-on experience with popular sequence/bioinformatics analysis platforms Familiarize

11/10/2016

33

Commercial software and feature offerings

Software Cost

(USD)

Free trial

(days)

Platform NGS

analyses

Database

searchinge

Plug-ins Workflow Teaching

suitability

Avadis

NGS

$4500 20 M, W, L ✓ ✗ ✗ ✓ ✗

CLC

Genomics

Workbench

$5500 30 M, W, L ✓ ✓ ✓ ✓ ✓

CodonCod

e Aligner

$720 30 M, W ✓ ✗ ✗ ✗ ✓

Genamics

Expression

$295 30 W ✗ ✓ ✓ ✗ ✗

Geneious $795 14 M, W, L ✓ ✓ ✓ ✓ ✓

Full

Lasergene

Suite

$5950 30 M, W ✓ ✓ ✓ ✓ ✓

MacVector

&

Assembler

$300 21 M ✓ ✓ ✗ ✗ ✓

NextGENe $4049 35 W ✓ ✗ ✗ ✗ ✗

Sequenche

r

$2500 30 M, W ✓ ✓ ✓ ✗ ✓

VectorNTI

Advance

$600 30 W ✗ ✓ ✗ ✓ ✓

Exercise 8: on software for sequence analysis

1.Go to the website of one of the commercial

softwares in the previous slide.

2. Familiarize yourself with the software.

3. Now, search for open source software that

performs similar or same tasks as the

commercial software.

4. Compare both softwares

Page 34: Sequence Analysis - Lunds universitet...Learn some relevant features found in biological sequences Get hands-on experience with popular sequence/bioinformatics analysis platforms Familiarize

11/10/2016

34

Summary of PART 2

Familiarity with sequence analysis tools and

workflow on the galaxy platform

Familiarity with sequence analysis software

Open source tools

Commercial softwares