Download - Protein Modules
To introduce the concept of multidomain proteins
AIMS
OBJECTIVES
To define the terms associated with analysis of multidomain proteins
To introduce the major secondary databases
To select an appropriate secondary database for analysis of protein domains
To carry out an analysis to establish to establish the domain structure of a protein
To ascribe likely biological functions to protein domains
When the amino acid sequences of two proteins are compared and found to exhibit significant similarity they are assumed to be evolutionarily related i.e. they are homologues
two classes of homologue (orthologue and paralogue)
orthologous genes are descended from a unique ancestral gene and their divergence with comparable genes in different organisms is simply parallel to speciation
paralogous genes are descended from copies of a gene that duplicated within a single ancestral genome
a substantial proportion of all proteins are composed of more than one domain
A domain is defined as sequentially consecutive residues in a protein that can fold up independently of other parts of the protein
Crystallographers commonly refer to domains as folds and the term module is also used
The domain/module is the fundamental unit of protein structure
inter-domain splicing, fusion, deletion, duplication and shuffling have occurred frequently during evolution, whereas intra-domain rearrangements have occurred rarely
When two homologous proteins are aligned, there are one or more regions where sequence identity is particularly high, and these regions frequently enable the definition of motifs or signature sequences that are diagnostic(Module 4)
Any particular domain may have one or more characteristic motifs
Domains/modules, motifs/signature sequences constitute the content of many secondary databases and are of enormous value in attempting to predict the function and structure of new proteins
Low complexity regions
The individual domains of multidomain proteins are frequently separated from each other by regions of low complexity, also referred to as linker sequences
Long stretches of repeated residues, particularly proline, glutamine, serine or threonine often indicate linker sequences
The program SEG detects such low complexity regions and can be used as part of BLAST to mask off segments of the query sequence that have low compositional complexity
This leaves the biologically interesting regions of the query sequence available for matching against database sequences
Secondary (pattern) databases
Analysis of the primary protein sequence databases, usuallythrough multiple sequence alignments has led to the identificationof sequence patterns (motifs, signatures, blocks, profiles) common to homologous proteins or protein modules
These motifs, usually of ~10-20 amino acids length, commonly correspond to key functional or structural elements, often domains/modules, and are extremely useful in identifying such features in new uncharacterized proteins
An unknown protein is often too distantly related to any protein of known sequence to detect its resemblance by overall sequence alignment, but it can potentially be identified by the occurrence in its sequence of a particular motif
There are a number of programs which allow the searching of an unknown protein against databases of motifs/profiles etc
Pfam is a collection of multiple alignments and profile hidden Markov models of protein domain families, which is based on proteins from both SWISS-PROT and SP-TrEMBL
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures
PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs