bioinformatics: an introduction · a brief history of progress…. •the first protein to be...

39
BIOINFORMATICS: An Introduction

Upload: others

Post on 09-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BIOINFORMATICS: An Introduction · A brief history of progress…. •The first protein to be sequenced – insulin •The first complete sequencing of an enzyme, ribonuclease in

BIOINFORMATICS:

An Introduction

Page 2: BIOINFORMATICS: An Introduction · A brief history of progress…. •The first protein to be sequenced – insulin •The first complete sequencing of an enzyme, ribonuclease in

What is

‘Bioinformatics’? • The term was first coined in 1988 by

Dr. Hwa Lim

• The original definition was :

“a collective term for data compilation, organisation, analysis and dissemination”

Page 3: BIOINFORMATICS: An Introduction · A brief history of progress…. •The first protein to be sequenced – insulin •The first complete sequencing of an enzyme, ribonuclease in

Simple

Concept

• The application of computer science and

engineering techniques to biological

analysis.

• The creation of repeatable, reusable,

intelligent software

Page 4: BIOINFORMATICS: An Introduction · A brief history of progress…. •The first protein to be sequenced – insulin •The first complete sequencing of an enzyme, ribonuclease in

That means….

• Using information technology to help solve

biological problems by designing novel

and incisive algorithms and methods of

analyses

Page 5: BIOINFORMATICS: An Introduction · A brief history of progress…. •The first protein to be sequenced – insulin •The first complete sequencing of an enzyme, ribonuclease in

And…

• It also serves to establish innovative

software and create new/maintain existing

databases of information, allowing open

access to the records held within them

Page 6: BIOINFORMATICS: An Introduction · A brief history of progress…. •The first protein to be sequenced – insulin •The first complete sequencing of an enzyme, ribonuclease in

It’s a huge

subject • ‘Bioinformatics’ - the new ‘buzz word’ in the

scientific community

• It is an umbrella term for genomics, proteomics and evolution, and computer science

• It is now necessary for scientists to be inter-disciplinary

Page 7: BIOINFORMATICS: An Introduction · A brief history of progress…. •The first protein to be sequenced – insulin •The first complete sequencing of an enzyme, ribonuclease in

Why?

• The data is collected from a variety of sources

• The terminology is specific to its particular

branch of science

• To make the data easily and universally

interpretable by scientists

Page 8: BIOINFORMATICS: An Introduction · A brief history of progress…. •The first protein to be sequenced – insulin •The first complete sequencing of an enzyme, ribonuclease in

What data?

• Biologists have been classifying data on

species of plants and animals since the

17th century

• The knowledge acquired has escalated in

harmony with the evolution of technology

Page 9: BIOINFORMATICS: An Introduction · A brief history of progress…. •The first protein to be sequenced – insulin •The first complete sequencing of an enzyme, ribonuclease in

A brief history

of progress….

• Genetics began when Mendel proved his

laws of hereditary with varieties of peas and

flowers in 1865

• The invention of the compound microscope in

the 19th century

Page 10: BIOINFORMATICS: An Introduction · A brief history of progress…. •The first protein to be sequenced – insulin •The first complete sequencing of an enzyme, ribonuclease in

A brief history

of progress….

• The first protein to be sequenced – insulin

• The first complete sequencing of an enzyme,

ribonuclease in 1960

• To the sequencing of the first complete

genome (Haemophilus influenzae) published

in 1995

Page 11: BIOINFORMATICS: An Introduction · A brief history of progress…. •The first protein to be sequenced – insulin •The first complete sequencing of an enzyme, ribonuclease in

A brief history

of progress….

• We have since moved on to technologies

permitting the sequencing, recombination and

cloning of DNA

The Human Genome Project

Page 12: BIOINFORMATICS: An Introduction · A brief history of progress…. •The first protein to be sequenced – insulin •The first complete sequencing of an enzyme, ribonuclease in

The Human

Genome Project

• In 1990 the unveiling of the Human

Genome Project (HGP) by the United

States Department of Energy and the

National Institutes of Health

• Goals:

Page 13: BIOINFORMATICS: An Introduction · A brief history of progress…. •The first protein to be sequenced – insulin •The first complete sequencing of an enzyme, ribonuclease in

HGP

• To identify all chemical base pairs and all genes that make up the 23 chromosome pairs found in human DNA

• To develop the next generation of methods for simulating cellular behaviour and pathways

Page 14: BIOINFORMATICS: An Introduction · A brief history of progress…. •The first protein to be sequenced – insulin •The first complete sequencing of an enzyme, ribonuclease in

HGP

• Ultimately to devise means to apply IT to

the modelling of cellular functions as

specified by the enormous datasets

Page 15: BIOINFORMATICS: An Introduction · A brief history of progress…. •The first protein to be sequenced – insulin •The first complete sequencing of an enzyme, ribonuclease in

The ‘omic’

revolution • Bioinformatics has been split into various

subjects:

• Genomics – the sequencing and annotation of genomes

• Proteomics – the description of the complete set of proteins a particular genome codes for

Page 16: BIOINFORMATICS: An Introduction · A brief history of progress…. •The first protein to be sequenced – insulin •The first complete sequencing of an enzyme, ribonuclease in

Sequence

Databases

• Since it became possible to elucidate

protein & nucleic acid sequences, they

have been determined at an ever

increasing rate.

• These sequences were printed in research

journals

Page 17: BIOINFORMATICS: An Introduction · A brief history of progress…. •The first protein to be sequenced – insulin •The first complete sequencing of an enzyme, ribonuclease in

Sequence

Databases

• Their enormous numbers and lengths

(particularly for genome sequences) make

it no longer practical to do so

• It is far more useful to have sequences in

computer-accessible form

Page 18: BIOINFORMATICS: An Introduction · A brief history of progress…. •The first protein to be sequenced – insulin •The first complete sequencing of an enzyme, ribonuclease in

Sequence

Databases

• As an example of a sequence database, let

us describe the annotated protein

sequence database named SWISS-PROT

• A sequence record in SWISS-PROT begins

with the proteins’ ID code of the form X_Y

Page 19: BIOINFORMATICS: An Introduction · A brief history of progress…. •The first protein to be sequenced – insulin •The first complete sequencing of an enzyme, ribonuclease in

Sequence

Databases

• X is up-to-four-character mnemonic

indicating the protein name

e.g., CYC for cytochrome c

e.g., HBA for hemoglobin α chain

Page 20: BIOINFORMATICS: An Introduction · A brief history of progress…. •The first protein to be sequenced – insulin •The first complete sequencing of an enzyme, ribonuclease in

Sequence

Databases • Y is up-to-five-character identification code

indicating the proteins biological source that usually of the first three letters of the genus and the first two letters of the species

e.g., CANFA for Canis familiaris (dog)

Page 21: BIOINFORMATICS: An Introduction · A brief history of progress…. •The first protein to be sequenced – insulin •The first complete sequencing of an enzyme, ribonuclease in
Page 22: BIOINFORMATICS: An Introduction · A brief history of progress…. •The first protein to be sequenced – insulin •The first complete sequencing of an enzyme, ribonuclease in
Page 23: BIOINFORMATICS: An Introduction · A brief history of progress…. •The first protein to be sequenced – insulin •The first complete sequencing of an enzyme, ribonuclease in
Page 24: BIOINFORMATICS: An Introduction · A brief history of progress…. •The first protein to be sequenced – insulin •The first complete sequencing of an enzyme, ribonuclease in

Sequence

alignment

• One can quantitate the sequence similarly

of two polypeptides or two DNA's by

determining their number of aligned

residues that are identical

Page 25: BIOINFORMATICS: An Introduction · A brief history of progress…. •The first protein to be sequenced – insulin •The first complete sequencing of an enzyme, ribonuclease in

Sequence

alignment

• For Example; human & dog cytochrome c,

which differ in 11 of their 104 residues are

89% identical

[ ( 104 - 11 ) / 104 ] X 100 = 89%

Page 26: BIOINFORMATICS: An Introduction · A brief history of progress…. •The first protein to be sequenced – insulin •The first complete sequencing of an enzyme, ribonuclease in

The Homology of Distantly Related

Proteins May Be Difficult to Recognized

• Mutation is a stochastic (Probabilistic or

random) process

• At every stage of evolution each residue has

an equal chance of mutation

Page 27: BIOINFORMATICS: An Introduction · A brief history of progress…. •The first protein to be sequenced – insulin •The first complete sequencing of an enzyme, ribonuclease in
Page 28: BIOINFORMATICS: An Introduction · A brief history of progress…. •The first protein to be sequenced – insulin •The first complete sequencing of an enzyme, ribonuclease in

The Homology of distantly related

proteins may be difficult to recognized

• The relative evolutionary distances between

neighboring branch points are expressed as

the number of amino acid differences per 100

residue of the protein or PAM units

Percentage of Accepted Point Mutations

Page 29: BIOINFORMATICS: An Introduction · A brief history of progress…. •The first protein to be sequenced – insulin •The first complete sequencing of an enzyme, ribonuclease in

The Homology of distantly related

proteins may be difficult to recognized

• Assume that we have a 100-residue protein

in which all point mutations have an equal

probability of being accepted and occur at a

constant rate, thus at an evolutionary

distance of one PAM units, the original and

evolved proteins are 99% identical

Page 30: BIOINFORMATICS: An Introduction · A brief history of progress…. •The first protein to be sequenced – insulin •The first complete sequencing of an enzyme, ribonuclease in

The Homology of distantly related

proteins may be difficult to recognized

• At an evolutionary distance of two PAM units

they are 98% identical

• Whereas at 50 PAM units they are 61%

identical

(0.99)1 X 100 = 99%

(0.99)2 X 100 = 98%

(0.99)50 X 100 = 61%

Page 31: BIOINFORMATICS: An Introduction · A brief history of progress…. •The first protein to be sequenced – insulin •The first complete sequencing of an enzyme, ribonuclease in

The Homology of distantly related

proteins may be difficult to recognized

• Mutational events may result in the insertion

or deletion of one or more residues within a

chain

SQMCILFKAQMNYGH

MFYACRLPMGAHYWL

Page 32: BIOINFORMATICS: An Introduction · A brief history of progress…. •The first protein to be sequenced – insulin •The first complete sequencing of an enzyme, ribonuclease in

The Homology of distantly related

proteins may be difficult to recognized

• If we allowed unlimited gapping: SQMCILFKAQMNYGH

- - M - - F - - - - - -Y - - ACRLPMGAHYWL

• Thus we cannot allow unlimited gapping to maximize the match between two peptides, but neither can we forbid all gapping because it really do occur

Page 33: BIOINFORMATICS: An Introduction · A brief history of progress…. •The first protein to be sequenced – insulin •The first complete sequencing of an enzyme, ribonuclease in

The Homology of distantly related

proteins may be difficult to recognized

• Consequently, for each allowed gap we must impose some sort of penalty in our alignment algorithm that strike a balance between finding the best alignment between:

- distantly related peptides

- rejecting improper alignment

Page 34: BIOINFORMATICS: An Introduction · A brief history of progress…. •The first protein to be sequenced – insulin •The first complete sequencing of an enzyme, ribonuclease in

The Homology of distantly related

proteins may be difficult to recognized

• Unrelated protein will exhibit sequence

identities in the range 15% to 25%

• Yet distantly related proteins may have

similar levels of sequence identity

• This the origin of the twilight zone

Page 35: BIOINFORMATICS: An Introduction · A brief history of progress…. •The first protein to be sequenced – insulin •The first complete sequencing of an enzyme, ribonuclease in

The Homology of distantly related

proteins may be difficult to recognized

• A plot of percent identify vs. evolutionary

distance is an exponential curve that

approaches but never equal zero

Page 36: BIOINFORMATICS: An Introduction · A brief history of progress…. •The first protein to be sequenced – insulin •The first complete sequencing of an enzyme, ribonuclease in

The Homology of distantly related

proteins may be difficult to recognized

• To differentiate homologous proteins in the

twilight zone from those that are unrelated, it

requires sophisticated alignment algorithm

Sequencing Alignment Using Dot Matrices

Page 37: BIOINFORMATICS: An Introduction · A brief history of progress…. •The first protein to be sequenced – insulin •The first complete sequencing of an enzyme, ribonuclease in

The Homology of distantly related

proteins may be difficult to recognized

Page 38: BIOINFORMATICS: An Introduction · A brief history of progress…. •The first protein to be sequenced – insulin •The first complete sequencing of an enzyme, ribonuclease in

The Homology of distantly related

proteins may be difficult to recognized

• The following sections in this introduction of

Bioinformatics will be demonstrated from: the

soft copy of the text book: BIOCHEMISTRY

by: D. Voet & J. Voet, 3rd Edition,

Page 39: BIOINFORMATICS: An Introduction · A brief history of progress…. •The first protein to be sequenced – insulin •The first complete sequencing of an enzyme, ribonuclease in

Biochemical Interactions

Software • Learning Objectives:

1. To understand the alignment process

2. To understand how natural selection affects the likelihood of an amino acid substitution being accepted

3. To understand the basis of sophisticated alignment programs such as BLAST