dal progeo $genoma$umanoad oggi: evoluzione$delle ... · structural variation (sv) "...

Post on 25-Sep-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Dal  proge*o  genoma  umano  ad  oggi:  evoluzione  delle  tecniche  di  

sequenziamento,  analisi  genomica  e  proteomica  e  prospe9ve  future!    

David  Horner  Dipar.mento  di  Bioscienze  Università  degli  Studi  di  Milano  

Come  va  sequenziato  il  DNA?  

•  Sequenziamento  Sanger  (1978  –  oggi):  –  Cos.  rela.vamente  al.  –  Richiede  molto  tempo  per  preparazione  di  campioni  –  Produce  poche  leLuri  LUNGHI  (1000  nt)  –  Pochi  errori  di  sequenziamento  

Sequenziamento  Sanger  (1978)    

Sequenziamento  Sanger  (1978)    

Genome  

1)  Frammentare  in  modo  “casuale”,    clonare  fammen.  in  plasmidi  

2)  Sequenziare  un  fragmento  (a  caso)    

3)  Individuare  un  clone  sovraposto  ….  Sequenziarlo  e  costruire  un  frammento  piu  lungo  

4)  Andare  al  passaggio  2  (fino  alla  fine!)  

viruses  plasmids  

bacteria  fungi  

plants  algae  

insects  

mollusks  

rep.les  

birds  

mammals  

Genomi,  quanto  sono  grandi  ?  

104   108  105   106   107   1011  1010  109  

bony  fish  

amphibians  

Sequenziamento  Sanger  (anni  1990)    

96  reazioni  in  parallelo  1000  nt  x  reazione  

Robot!  

1981  •  Sinclair  ZX-­‐81  

Computer  

Whole  Genome  Shotgun  Approach  

Assembly  by  overlap  

Sequenze  Ripetute  

Sequenze  uniche  

Sequenze  ripetute  

Se  le  sequenze  ripetute  sono  meno  lunghe  del  “leLure”  di  sequenziamento,  non  c’è  problema    

A   B   C  

A   B   C  

Sequenze  Ripetute  

Se  sono  piu  lunghi,  NON  POSSIAMO  ASSEMBLARE!  

A   B   c  ?  

A   C   B  ?  

Steps  to  Assemble  a  Genome  

1. Find overlapping reads

4. Derive consensus sequence ..ACGATTACAATAGGTT..

2. Merge some “good” pairs of reads into longer contigs

3. Link contigs to form supercontigs

Some  Terminology    read        a  500-­‐900  long  word  that  comes    

 out  of  sequencer    mate  pair      a  pair  of  reads  from  two  ends  

 of  the  same  insert  fragment    con-g        a  con.guous  sequence  formed    

 by  several  overlapping  reads    with  no  gaps  

 supercon-g      an  ordered  and  oriented  set  (scaffold)                  of  con.gs,  usually  by  mate  

                   pairs    consensus      sequence  derived  from  the  sequene              mul.ple  alignment  of  reads  

               in  a  con.g  

Con.gs  and  scaffolds  

1. Genome fragmentation

2. Library

3. Sequences 4. Genome assembly by overlap

Shot  Gun  Sequencing  

Timeline  

Meet  Your  Genome  

(The  Wheat  genome  (16.9  Gbp)  is  more  than  5  .mes  bigger  than  the  human  genome  and  80%  of  its  genome  consists  of  repe..ve  sequences)  

The  Human  Genome  

Quanto  è  COMPLESSO  il  genoma?  

(Il  Genoma  di  FRUMENTO  (16.9  Gbp)  è  piu  di  5  VOLTE  piu  grande  di  quello  umano.  80%  consiste  di  elemen.  ripetu.)  

Il  genoma  Umano  c.  3Gb  

Physical  Mapping  

Top  down  sequencing  

1. 2.

3. 4.

Genome fragmentation

Physical map

Subclone library

Sequence clones by walking or by SHOTGUN strategy

Human  Genome  Project  16/02/2001      

OK,  abbiamo  sequenziato  il  genoma  ….    Ora  che  cosa  fare?  

Dove  sono  I  geni?      Sequenziare  ed  allineare  cDNA  (mRNA)  al  genoma  

Ma  quali  gene/allele  sono  responsabile  per  feno.pi  di  interesse?    

Dobbiamo  paragonare  genomi  di  tan.  individui  diversi  e  fare  sta.s.ca  per  capire  feno.pi  complessi  ….    Cioè,  dobbiamo  sequenziare  TANTI  individui  della  stessa  specie  ed  associare  feno.pi  con  geno.pi.    Genome  Wide  Associa.on  Studies  (GWAS)    

“GWAS”  +  “Human”  nella  leLeratura    Prima  di  2004  (60  ar.coli)    Da  2004  in  poi  (>14000  ar.coli)      Sono  sta.  sequenzia.  >  10000  genomi  umani  da  2004  in  poi,    

Come  è  stato  faLo?  

Revolu.onary  techniques  in  molecular  gene.cs  

     Molecular  cloning    Sanger  sequencing    PCR      

Gel  Electrophoresis    Bloung  (Southern/Northern/Western  etc)    Expression  cloning    (microarrays)  

Next  Genera.on  Sequencing  

Next  Genera.on  Sequencing  

•  (Massively  Parallel  /Second  Genera.on)  •  HIGH  throughput  (lots  of  data)  •  Rela.vely  low  cost  •  Transversal  in  terms  of  applica.on  

Read  Length  is  Not  As  Important  For  Resequencing  

0%

10%20%

30%40%

50%

60%70%

80%90%

100%

8 10 12 14 16 18 20

Length of K-mer Reads (bp)

% o

f P

aire

d K

-mer

s w

ith U

niqu

ely

Ass

igna

ble

Loca

tion

E.COLIHUMAN

Cost per megabase of DNA sequence

Next-Generation Sequencing

Illumina  /  Solexa    Gene.c  Analyzer  HiSeq  2000  (150x2  bp,  600  Gb  /  run)  

Applied Biosystems SOLiD 4 SystemTM

(100x2 bp, 400 Gb / run)

Roche  /  454  Genome  Sequencer    FLX  .tanium  (800  bp,  800  Mb  /  run)  

Ion  Proton   PacBio  

A number of platforms using different strategies and chemistries, and with different throughput are entering the market.

Fold coverage % sequenced 0.25 22 0.5 39 0.75 53 1 63 2 87.5 3 95 4 98.2 5 99.4 6 99.75 7 99.91 8 99.97 9 99.99 10 99.995

When  has  a  genome  been  fully  sequenced?  

Illumina  

• Bridge  PCR  

• Sequencing  by  synthesis  using  fluorescent  reversible  terminators      

Technology Overview: Solexa/Illumina Sequencing

http://www.illumina.com/

Immobilize DNA to Surface

Source:    www.illumina.com  

Technology Overview: Solexa Sequencing

Bridge  PCR  

•  DNA  fragments  are  flanked  with  adaptors.  •  A  flat  surface  coated  with  two  types  of  primers,  corresponding  to  the  

adaptors.  •  Amplifica.on  proceeds  in  cycles,  with  one  end  of  each  bridge  

tethered  to  the  surface.  •  Used  by  Solexa.  

Sequence Colonies

The  bases  are  “reversible  terminators”,  only  one  base  can  be  added.  Then  they  are  modified  so  that  the  next  round  of  extension  can  occur.  

Sequence Colonies

Each  base  has  a  different  Fluor  (color).  Excited  by  laser,  and  color  is  read.  

Illumina sequencers sequencing-by-synthesis coupled with bridge amplification

Available  versions:    §   HiSeq    2000  (up  to  600  Gb,  250x2  bp  reads)  

§   HiSeq    1000  (up  to  300  Gb,  250x2    bp  reads)  

§   Genome  Analyzer  (up  to  95  Gb,  150x2  bp  reads)    §   MiSeq      pla=orm      (up    to  6  Gb,  250x2  bp  reads)    

Da  2008    

SNP  calling  •  The  basic  principle  is  simple!  

•  This  looks  like  a  homozygous  SNP  

ACTTTTGCCCTGTGTCTAAAATGCGTCGTAGCATGT - reference!ACTTTTGCCCTGTGACTAAAATG ! ! !read1! TTGCCCTGTGACTAAAATGCGT! ! !read2! TGCCCTGTGACTAAAATGCGTA ! !read3! GCCCTGTGACTAAAATGCGTAG ! !read4! GCCCTGTGACTAAAATGCGTAG ! !read5! CCTGTGACTAAAATGCGTAGTAG ! !read6!

SNP  calling  •  And  this  one  looks  heterozygous  

ACTTTTGCCCTGTGTCTAAAATGCGTCGTAGCATGT - reference!ACTTTTGCCCTGTGACTAAAATG ! ! !read1! TTGCCCTGTGTCTAAAATGCGT! ! !read2! TGCCCTGTGACTAAAATGCGTA ! !read3! GCCCTGTGTCTAAAATGCGTAG ! !read4! GCCCTGTGACTAAAATGCGTAG ! !read5! CCTGTGTCTAAAATGCGTAGTAG ! !read6!

On  average,  we  think  that  we  will  find  a  SNP  (Single  Nucleo.de  

Polymorphism)  between  2  Human  individuals  about  every  2000  bases.  

 99.5%  iden.ty  

maybe  1,500,000  differences!  

Structural Variation (SV) l  Any  DNA  sequence  altera.on  other  than  a  single  nucleo.de  

subs.tu.on  l  copy number variations (CNV), l  transposon movement l  Expansion of trinucleotide and other simple repeats l  insertions-deletions (indels) l  translocations l  inversions l  the vast majority of SV events are small indels

•  Human  genomes  differ  more  as  a  consequence  of  structural  varia.on  than  of  single-­‐base-­‐pair  differences*    –  Causal events in hereditary diseases –  somatic SV –  markers for GWAS / mapping studies

 

49

Copy  Number  Varia.on  (CNVs)  

so... how representative is the reference genome?

Applica.ons  of  NGS  playorms  

•  DNA sequencing -  genome resequencing (SNPs, CNV, GWAS) -  de novo sequencing -  identification of genome structural variants (cancer genome) -  3D chromatin interactions -  Epigenomics (chromatin state and genome methylation) -  Metagenomics (taxonomic analysis of environmental samples)

•  RNA sequencing -  Qualitative and quantitative analysis of the Transcriptome -  Identification and characterization of miRNAs and other ncRNAs - RNA editing -  Metatrancriptomics (functional analysis of envronmental samples)

top related