comparative genomics and target discovery maarten sollewijn gelpke mdi, organon

Comparative genomics and Target discovery

Maarten Sollewijn Gelpke

MDI, Organon

What is comparative genomics? What can we learn from comparative genomics? What is target discovery? What are the implications of comparative genomics to

target discovery? What issues in target discovery can be addressed by

comparative genomics?

Overview

Introduction to genomes and sequencing. Comparative genomics aspects. Phylogenomics concepts. Examples of comparative genomics.

Sequence availability

Availability of gene and protein sequences has increased enormously in during the last 2 decades.

Current capacity of the main sequencing centers is >3Gb per month per centre.

This will increase again dramatically with the development of new superfast sequencing techniques.

Currently > 100Gbases

Genomes sequenced

First bacterial genomes sequencedH.influenzae and M.genitalium

The yeast genome

1995

1996

1997E.coli K12

1998

C.elegans

1999

Full sequence of chr. 22

2000

D.melanogaster Genome & Chr. 21

Human draft

2001

A.thaliana•MouseMouse•CionaCiona•RiceRice•FuguFugu•AnophelesAnopheles

2002

2003

Chimpanzee

2004

2005

•Human finishedHuman finished•RatRat•ChickenChicken

XenopusZebrafish

Genome sequencing

Evolutionary relationship between metazoans (multicellular animals) that have been sequenced or are due for sequencing.

primates

rodents

bovidae

mammalia

aves

chordates

vertebrates

metazoans

insects

nematodes

tunicates

fish

tetrapods

amphibia

amniotacarnivores

Genome sequencing

BAC fingerprinting shotgun approach Accurate but laborious!

Shotgun sequencing (WGS)

Bac Clone: 100-200 kbBac Clone: 100-200 kb

Sheared DNA: 1.0-2.0 kbSheared DNA: 1.0-2.0 kb

SequencingTemplates: SequencingTemplates

RandomReadsRandomReads

Whole Genome: 30Mb – 3Gb

Low Base Low Base QualityQuality

ConsensusConsensusSequenceSequence

GapGap

SingleSingleStrandedStrandedRegionRegion

MisMis--AssemblyAssembly

((InvertedInverted))

Low Base Low Base QualityQualityLow Base Low Base QualityQuality


GapGap





GapGapSequenceSequence

GapGap




((InvertedInverted))MisMis--AssemblyAssembly


Assembly

Finishing

Genome sequencing

Current state of sequenced ‘organisms’: >316 Prokaryotes >27 Archae >280 Eukaryotes (complete, in assembly or in progress) >1600 Viruses and > 500 mitochondria/chloroplasts

Some ongoing genome sequencing projects: Poplar, gibbon, platypus, Drosophila species, variety of

pathogenic fungi and bacteria, etc. Meta-genomic projects on environmental samples (soil, deep-

sea, waste sites)

Future of genome sequencing?

New complete genomes. New low-redundancy genomes. New (low-redundancy) genome areas. Meta-genomics. Sequencing of microbial communities. Sequencing of extinct species.

40000 year old Cave bear: 26k, 21 genes. 45000 year old Neanderthaler: 75k diverged from human

lineage ~ 315000 years ago

Comparative genomics Discover what lies hidden in genomic sequence by

comparing sequence information. Main areas

Whole genome alignment Gene prediction Regulatory element prediction Phylogenomics Pharmacogenetics

Affected by evolutionary aspects Mutational forces (introduce random mutations) Selection pressures Ratio of non-synonymous to synonymous substitutions Mutation rates lower or higher than neutral

Comparing sequences, methods.

Pairwise comparison of sequences (alignments) proteins or genes variety of local alignment

tools like BLAST, Smith-Waterman etc.

multiple sequence comparisons (ClustalW, Muscle etc.)

results may be dependent on alignment settings

Comparing sequences, methods.

Whole genome comparisons Large stretches of sequence Divergence up to 450Mya (fugu-human) with sufficient

similarity remaining. BLAT, BLASTZ, Phusion/BlastN

Seeding strategy → alignment extension → gapped alignments

Whole genome comparison

Conservation of synteny! Cross-reference of any genetic traits (diseases!) from one

organism (eg mouse) to genes in the syntenic regions in the other organism (eg human).

Genome expansion and contraction Genome duplications, segmental duplications: important

mechanism for generating new genes. (G+C) content, CpG islands

Reflect different mutational or DNA repair processes? Repeats

Transposable elements are a main force in reshaping genomes. TE’s (or remainders thereof) can be used to measure evolutionary forces acting on the genome.

Neutral mutation rate.

Gene prediction Comparing sequences has contributed enormously to the accuracy of

gene prediction.

Evidence based method. Use cDNAs, ESTs and proteins from various organisms. Apply gene feature rules.

Proteins

Clustered ESTs

cDNAs

Gene model

Gene prediction

De novo methods. Alignment of genomic sequences Splicing rules and other gene features

De novo gene prediction by comparing sequences attempts to model a negative selection of mutations. Areas with less mutations are conserved because the mutations where detrimental for the organism.

Prediction of similar proteins in both genomes.

Newly predicted protein in mouse and human, similar to the disease related gene dystrophin.

Regulatory element prediction

The complexity of higher eukaryotes and their relatively low number of genes can be explained partially through the importance of transcriptional regulation.

Identification of RE’s will have an extensive impact in understanding gene expression patterns (expression intensity, tissue specificity), relations within expression patterns and inferring biological systems or networks.

Regulatory element prediction

No formal models for regulatory motifs Attempt to find conserved regions or motifs based on the

global alignment of similar sequences of different organisms (phylogenetic footprinting). Which species to compare? Evolutionary distance? What regions around gene models to investigate? 5’ and 3’

flanking regions, introns? Take expression patterns into account? How does evolution affect RE’s?

Phylogenomics

Comparison of genes and gene products across a number of species (whole genomes), characterizing homologues and gain insights in the evolutionary process itself.

Pharmacophylogenomics is the use of phylogenomics in aid of drug discovery, through improved target selection and validation.

Orthology and paralogy

Orthologs: genes in different species that arose from a single gene in the most recent common ancestor, by speciation.

Paralogs : genes in the same species that arose from a single gene in a ancestral species, by a process of gene duplication.

Phylogenetic treeof gene X

Target orthology Species differences frequently affect progression of

targets and compounds. Orthology maps in combination with expression studies may explain these differences.

Establishing orthology Reciprocal highest scoring Blast hit. Conservation of synteny. Gene loss or rate of evolution issues.

Orthology does not guarantee common function (functional shift). Extensive sequence divergence High non-synonymous over synonymous nucleotide substitution

ratios. Comparison of regulatory regions?

Target paralogy

Key insights in large pharmacologically relevant families (NRs, GPCRs) can be gained from paralogy analysis.

Paralogy is inter-related with several other gene to function occurrences that can seriously affect the suitability of genes as drug targets

Schematic representation of various mappings of genes to functions.

Function

Protein

Gene

Paralogy Pleiotropy Redundancy Heteromery CrosstalkAlternativetranscription

Pleiotropy Suggested to precede paralogy Relaxed substrate or ligand specificity Multiple protein domains Tissue or cellular localization

Redundancy Total or partial redundancy of function Directly linked to paralogy Robustness against gene knock-outs (target validation) PPAR-δ / PPAR-α in skeletal muscle; PXR / FXR in bile acid

signaling; dopamine transporters / serotonin transporters in adjacent neurons.

Heteromery Formation of heteromers between paralogs Known examples in major classes of drug targets

GPCRs : GABAβ receptors NRs : formation of heterodimers with retinoid X rexeptors (RXR) Ion channels

Crosstalk Combination of pleiotropy and redundancy May be regulated in time and space (expression and

localization) Action of cytokines (interleukins) on immune cell types.

Alternative transcription Intermediate between paralogy and pleiotropy. ‘paralogy in

place’ Increases effective size of the genome (estimated >30% of

human genes show alternative transcription!)

P

Effects on drug discovery

Functional shifts, pleiotropy and redundancy potentially have good or bad news for drug discovery.

Functional shifts Misleading or unavailable animal model Animal toxicity irrelevant for humans

Pleiotropy Unintended drug effects Opportunities for multiple indications

Redundancy Disease resistant to treatment (multi-functionality) Highly selective treatment for complex diseases.

Pharmacogenetics Within species comparative genomics:

Single Nucleotide Polymorphisms: SNPs Current focus in coding regions, expected to expand to

sites of transcription regulation. Determine the site of a SNP and the allele frequencies

from ethnic or multi-ethnic panels of individuals (eg 100)

Pharmacogenetics (PGx): relate SNP information to efficacy and safety issues during the drug development process. Efficacy PGx: Select/predict drug responders, increase

confidence in a certain drug in development. Safety PGx: Identification of individuals with adverse effects to a

drug

Examples

New genes and REs from yeast genomes. Multi species comparisons from targeted genomic

regions. Comparative genomics at the vertebrate extremes. Pharmacogenetics in drug efficacy

Comparison of yeast species to identify genes and regulatory elements. (Kellis et al, Nature 2003)

Saccharomyces cerevisiae and 3 related species

7x coverage WGS of each species Assembly of draft genome sequence S.cerevisiae genome aligned to others using ORFs as seeds

Most ORFs have 1:1 matches. Considerable conserved synteny. Most genomic rearrangements clustered in telomeric regions. Local gene family expansion/contraction, creating phenotypic

diversity over evolutionary time.

Balance between conservation and divergence allows for accurate gene identification and recognition of REs as well!

Identification of genes

Original S.cerevisiae genome (1996): 6275 ORFs Re-analysis and other evidence (2002): 6062 ORFs This study validates all ORFs using a reading frame

conservation score (very sensitive). 5538 ORFs, 20 unresolved, 504 rejected ORFs!

In addition to gene recognition, also largely improved gene structure definitions (start, stop, intron).

Identification of regulatory elements

REs are difficult to identify Short (6-15bp), sequence variation, few known rules

De novo discovery of REs directly from genomic sequence. Develop a motif conservation score system based on known motifs 78 motifs discovered, overlapping with 36 of 55 known motifs

Putative annotation of motifs using adjacent genes. (GO) 25 of 42 new motifs show high category annotation correlation

Discovery of combinatorial control of Res

Applications to human genome? Increase number of species in comparison to enrich the low signal to

noise ratio in humans.

Multi species comparisons from targeted genomic regions. (Thomas et al, Nature 2003)

Comparing targeted regions areas in multiple evolutionary diverse vertebrates (less probable for conservation to occur by chance)

ENCODE project 44 genomic regions (14 manually selected of which some

disease related, 30 random) of diverse gene density and non-exonic conservation

primates, bat, alligator, elephant, cat, emu, leopard, salmon etc.

Initial analysis 1.8 Mb on chromosome 7 containing 10 genes, including CFTR, from 12 species. Detection of ~1000 multi-species conserved sequences of which

>60% would not be detected by a 2 species comparison.

Comparative genomics at the vertebrate extremes (Bofelli et al, Nature 2004)

What can be learned from comparisons of genomes that are distant or closely related in evolution?

Distant comparisons reveal the most constrained sequence elements. Most of the conserved human-fish non-coding sequences are

found near genes with roles in embryonic development. Mutations can have an important role in human disease

Human-Fugu conservation of non-coding sequence in the DACH gene area (development of brain, limbs, sensory organs).

Validation of identified enhancer regions by driving expression of a reporter in mouse embryos.

Comparative genomics at the vertebrate extremes

Intraspecies sequence comparisons allow identification of species specific sequences Phylogenetic shadowing Requires high rate of polymorphism

Comparison among primates show human specific sequences Analysis of regulatory sequence of

ApoA (involved in human heart disease)

A. Mutation rate analysis of Ciona intestinalis 5` region of the forkhead gene. B. Validation of identified potential regulatory elements in Ciona larvae.

Pharmacogenetics in drug efficacy

Efficacy PGx for an obesity drug.Compare genotypes 1-1, 1-2 and 2-2

comparative genomics and target discovery maarten sollewijn gelpke mdi, organon

Documents

future of genome sequencing

comparative genomics

organon slide

neutral slide

gbases slide

main sequencing centers

sequencing of extinct

yeast genome