(shotgun) - read the docs · twospeciicconcepts: •...

67
(Shotgun) sequencing Titus Brown

Upload: others

Post on 26-Apr-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

(Shotgun)  sequencing  Titus  Brown    

Page 2: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Three  basic  problems  

A. B.

XXX

X

XX

X

X

XX

X

C.

XX

X X

XX

Resequencing,  coun2ng,  and  assembly.  

Page 3: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

1.  Resequencing  analysis  

A. B.

XXX

X

XX

X

X

XX

X

C.

XX

X X

XX

We  know  a  reference  genome,  and  want  to  find  variants  (blue)  in  a  background  of  errors  (red)  

Page 4: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

2.  Counting  We  have  a  reference  genome  (or  gene  set)  and  want  to  

know  how  much  we  have.    Think  gene  expression/microarrays.  A. B.

XXX

X

XX

X

X

XX

X

C.

XX

X X

XX

Page 5: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

3.  Assembly  We  don’t  have  a  genome  or  any  reference,  and  we  

want  to  construct  one.  (This  is  how  all  new  genomes  are  sequenced.)  

A. B.

XXX

X

XX

X

X

XX

X

C.

XX

X X

XX

Page 6: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and
Page 7: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Outline  •  Shotgun  sequencing  •  The  magic  of  polonies,  and  how  Illumina  sequencing  works  

•  Sequencing  depth,  read  length,  and  coverage  • Paired-­‐end  sequencing  and  insert  sizes  • Coverage  bias  •  Long  reads:  PacBio  and  Nanopore  sequencing    

Page 8: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Shotgun  sequencing  It  was  the  best  of  2mes,  it  was  the  worst  of  2mes,  it  was  

the  age  of  wisdom,  it  was  the  age  of  foolishness      

It  was  the  best  of  2mes,  it  was  the  wor  ,  it  was  the  worst  of  2mes,  it  was  the    isdom,  it  was  the  age  of  foolishness  

mes,  it  was  the  age  of  wisdom,  it  was  th    

Page 9: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and
Page 10: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Two  speciAic  concepts:  •  First,  sequencing  everything  at  random  is  very  much  easier  than  sequencing  a  specific  gene  region.    (For  example,  it  will  soon  be  easier  and  cheaper  to  shotgun-­‐sequence  all  of  E.  coli  then  it  is  to  get  a  single  good  plasmid  sequence.)  

•  Second,  if  you  are  sequencing  on  a  2-­‐D  substrate  (wells,  or  surfaces,  or  whatnot)  then  any  increase  in  density  (smaller  wells,  or  beRer  imaging)  leads  to  a  squared  increase  in  the  number  of  sequences  yielded.  

Page 11: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Random  sampling  =>  deep  sampling  needed  

Typically  10-­‐100x  needed  for  robust  recovery  (300  Gbp  for  human)  

11  

Page 12: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

“Coverage”  Genome (unknown)

X XX

XX

XX

X

XX

XX

X

X

Reads(randomly chosen;

have errors)

X

XX

“Coverage”  is  simply  the  average  number  of  reads  that  overlap  each  true  base  in  genome.  

 Here,  the  coverage  is  ~10  –  just  draw  a  line  straight  down  from  the  top  

through  all  of  the  reads.   12  

Page 13: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Illumina  yields  the  deepest  sequencing  available  •  MiSeq  

•  30  million  reads  per  run  •  300  base  paired-­‐end  reads  

•  HiSeq  2500  RR/X  10  •  6  billion  reads  per  run  •  150  base  paired-­‐end  reads  

•  PacBio  •  44,000  reads  per  run  •  8500  bp  in  length  

hRp://flxlexblog.wordpress.com/2014/06/11/developments-­‐in-­‐next-­‐genera2on-­‐sequencing-­‐june-­‐2014-­‐edi2on/  

Page 14: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Illumina  basics  

hRp://ted.b2.cornell.edu/cgi-­‐bin/epigenome/method-­‐1.cgi  

(See  hRp://seqanswers.com/forums/showthread.php?t=21  for  details)  

Page 15: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

A  movie  of  Illumina  sequencing:  

     

hRps://www.youtube.com/watch?v=tuD-­‐ST5B3QA#t=61  

Page 16: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

What  goes  wrong  with  basic  assumptions?  •  Not  all  sequence  is  as  easily  sequenced  as  other,  depending  on  your  sequencing  technology  (e.g.  GC/AT  bias);  

•  Some  RNA  not  be  as  accessible  as  others  (secondary  structure);  

Page 17: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

FASTQ  •  @895:1:1:1246:14654/1  •  CAGGCGCCCACCACCGTGCCCTCCAACCTGATGGT  •  +  •  ][aaX__aa[`ZUZ[NONNFNNNNNO_____^RQ_  •  @895:1:1:1246:14654/2  •  ACTGGGCGTAGACGGTGTCCTCATCGGCACCAGC  •  +  •  \UJUWSSV[JQQWNP]]SZ]ZWU^]ZX][^TXR`  •  @895:1:1:1252:19493/1  •  CCGGCGTGGTTGGTGAGGTCACTGAGCTTCATGTC  •  +  •  OOOKONNNNN__`R]O[TGTRSY[IUZ]]]__X__  

Page 18: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Read  length  and  reconstructability  

Whiteford  et  al.,  Nuc.  Acid  Res,  2005  

Page 19: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

“Reconstructability”  • Assembling  new  genomes  or  transcriptomes…  

• Haplotyping  -­‐  think  human  gene2cs  &  viruses,  both.  

Page 20: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Repeats!  (and  shared  exons)  A

R

R B

C D

Page 21: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Longer  reads  …  OR  …  Paired-­‐end/mate  pair  sequencing  

A

R

R B

C D

longer reads

paired ends

Page 22: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Paired-­‐end  sequencing  

hRp://vallandingham.me/RNA_seq_differen2al_expression.html  

Page 23: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and
Page 24: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Mate-­‐pair  sequencing  (long  insert)  

Page 25: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Longer  reads  • PacBio  • Moleculo  • Nanopore  

Page 26: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

hRp://www.melanieswan.com/FOLS.html  

Page 27: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Moleculo  (Illumina)  

hRp://nextgenseek.com/2013/07/illumina-­‐announces-­‐moleculo-­‐long-­‐read-­‐technology-­‐and-­‐phasing-­‐as-­‐service/  

Page 28: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

hRp://labs.mcb.harvard.edu/branton/projects-­‐NanoporeSequencing.htm  

Page 29: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Actual  yields  •  MiSeq  

•  30  million  reads  per  run  •  300  base  paired-­‐end  reads  

•  HiSeq  2500  RR/X  10  •  6  billion  reads  per  run  •  150  base  paired-­‐end  reads  

•  PacBio  •  44,000  reads  per  run  •  8500  bp  in  length  

hRp://flxlexblog.wordpress.com/2014/06/11/developments-­‐in-­‐next-­‐genera2on-­‐sequencing-­‐june-­‐2014-­‐edi2on/  

Page 30: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and
Page 31: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Your  basic  data  (FASTQ)  •  @895:1:1:1246:14654/1  •  CAGGCGCCCACCACCGTGCCCTCCAACCTGATGGT  •  +  •  ][aaX__aa[`ZUZ[NONNFNNNNNO_____^RQ_  •  @895:1:1:1246:14654/2  •  ACTGGGCGTAGACGGTGTCCTCATCGGCACCAGC  •  +  •  \UJUWSSV[JQQWNP]]SZ]ZWU^]ZX][^TXR`  •  @895:1:1:1252:19493/1  •  CCGGCGTGGTTGGTGAGGTCACTGAGCTTCATGTC  •  +  •  OOOKONNNNN__`R]O[TGTRSY[IUZ]]]__X__  

Page 32: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Mapping  

•  Many  fast  &  efficient  computa2onal  solu2ons  exist.  •  You  have  to  figure  out  how  to  choose  parameters  to  maximize  sensi2vity/specificity,  and  when  to  validate.  

U.  Colorado  hRp://genomics-­‐course.jasondk.org/?p=395  

Page 33: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Assembly  Reassemble  random  fragments  computa2onally.  

UMD  assembly  primer  (cbcb.umd.edu)  

Page 34: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Shotgun  sequencing  It  was  the  best  of  2mes,  it  was  the  wor  ,  it  was  the  worst  of  2mes,  it  was  the    isdom,  it  was  the  age  of  foolishness  

mes,  it  was  the  age  of  wisdom,  it  was  th        

It  was  the  best  of  2mes,  it  was  the  worst  of  2mes,  it  was  the  age  of  wisdom,  it  was  the  age  of  foolishness  

 

Page 35: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Where  does  #  of  reads  count?  

A. B.

XXX

X

XX

X

X

XX

X

C.

XX

X X

XX

Resequencing,  coun2ng,  and  assembly.  

Page 36: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Where  does  reconstructability  matter?  

A. B.

XXX

X

XX

X

X

XX

X

C.

XX

X X

XX

Resequencing,  coun2ng,  and  assembly.  

Page 37: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Summary  •  Coverage  maRers  for  SNP  calls  and  assembly;  

•  #  of  reads  maRers  for  coun2ng;  

•  Length  of  reads  maRers  for  reconstructability  (assembly  &  haplotyping);  

•  Illumina  is  s2ll  “best”  for  high  coverage;  •  PacBio  and  Moleculo  =>  genome  assembly;  • Nanopore:  s2ll  tricky  but  lots  of  progress  being  made.  

 

Page 38: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Bad  data  I  asked:  hRps://twiRer.com/c2tusbrown/status/624721875252420608  

I  received:  •  hRp://www.bioinfo-­‐core.org/index.php/Interes2ng_NGS_failures  

•  hRp://bioinfo-­‐core.org/index.php/9th_Discussion-­‐28_October_2010  

•  hRps://biomickwatson.wordpress.com/2013/01/21/ten-­‐things-­‐to-­‐consider-­‐when-­‐choosing-­‐an-­‐ngs-­‐supplier/  

Page 39: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Sequencing  Bloopers  Simon  Andrews  Tim  Stevens  

Page 40: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Technical  sequencer  problems  

Page 41: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Manifold  burst  in  cycle  26  

Page 42: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

SpeciAic  cycles  lost  

Page 43: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

No  priming  /signal  (Wrong  adapters  used)  

Read  1   Read  2    (barcode)  

Page 44: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Tile  Problems  -­‐  Overclustering  

Page 45: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Tile  Problems  –  Consistent  tile  fail  

Page 46: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Tile  problems  –  transient  tile  fail  

Page 47: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Incorrect  Phred  Scores  

Found  LOTS  of  examples  of  this  in  the  SRA  

“the  NCBI  SRA  makes  all  its  data  available  as  standard  Sanger  FASTQ  files    (even  if  originally  from  a  Solexa/Illumina  machine)”  

Nucleic  Acids  Res.  2010  Apr;  38(6):  1767–1771.  

1

10

100

1000

10000

100000

1000000

33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 99 102 105 108 111 114 117 120 123 126

SRR619473  

Phred33  (Sanger)  Phred64  (Illumina)  

Page 48: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Data  Extraction  

Page 49: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Wrong  barcode  annotation  

0

5

10

15

20

25

30

35

40

AGTTCCG ATGTCAG AGTCAAC CAGATCA AGTCAAA GGCTACA ATCACGA TGACCAA ATGTAAG AGTTACG

Expected  barcode  

Not  expected  barcode  

Page 50: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Contaminated  Barcode  Stocks  

0

2

4

6

8

10

12

14

16

18

20

GCCAAT ACTTGA CAGATC TTAGGC CTTGTA ACAGTG GATCAG TTAAGC CGGATC

Barcode  Frequency

Expected  barcode  

Not  expected  barcode  

Page 51: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Odd  sequence  composition  

Page 52: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Read  through  adapter  

Page 53: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Adapter dimer overload  

Sequence   %   Possible Source  CCTAAGGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAA   9.42   Illumina Single End PCR Primer 1  TCAATGAAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAA   7.30   Illumina Single End PCR Primer 1  GAGACTCAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAA   5.65   Illumina Single End PCR Primer 1  

gi|372098977|ref|NT_039624.8| Mus musculus chr16 GRCm38 CTGGAAGGGAGAAAAGTCCAAACATTCTGGCTCTAACTTCT ||||||||||||||||||||||||||||||||| || |||| CTGGAAGGGAGAAAAGTCCAAACATTCTGGCTCCAAGTTCT gi|372098992|ref|NT_039500.8| Mus musculus chr10 GRCm38 CTTTCTCTATCTGAATTATAAACAAAAGCACACAGGCCCGCTTACATTTACATGATAAAATGTGCACTTTG |||||||||| || |||||||||||||||||||||||||||||||| ||||||||||||||| | |||| CTTTCTCTATATGCATTATAAACAAAAGCACACAGGCCCGCTTACAGGGACATGATAAAATGTGAAATTTG (Single-cell Hi-C)  

Page 54: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Positional  Sequence  Bias  Application  SpeciAic  –  BS-­‐Seq  

Page 55: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Positional  Sequence  Biases  Expected  -­‐  RRBS  

Also  reports  of  a  ‘Chinese  CRO’  whose  RRBS  libraries  have  the  MspI  sites  missing  due  to    their  proprietary  and  unexplained  pre-­‐processing  

Page 56: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Positional  Sequence  Biases  Unavoidable  –  RNA-­‐Seq  

Page 57: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Positional  Sequence  Biases  Unexpected  –  Doubled  Adapters  

Page 58: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Overrepresented  Individual  Sequences  •  Adapter  dimers  •  rRNA  •  Satellite  sequences  

Page 59: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

My  data  doesn’t  map  well…  

Page 60: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Contaminated  with  guessable  sequence  

www.bioinforma2cs.babraham.ac.uk/projects/fastq_screen  

Page 61: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Contaminated  with  guessable  sequence  

CRUK  Mul2-­‐genome  alignment  system  (MGA)  

Page 62: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Contamination  with  unguessable  sequence  

>AF431889 AF431889.1 Acinetobacter lwoffii type IIs modification Query: 1 cggtgagcaggcattagaaattgattttttagaaggtgtgttgaagaaactgggccgctt 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||| Sbjct: 4661 cggtgagcaggcattagaaattgattttttagaaggtgtgttgaagaaactgggtcgctt 4720 >GQ352402 GQ352402.1 Acinetobacter baumannii strain AbSK-17 plasmid Query: 1 ggtgagcagtggtttacatggttaattgaacaagacatcaacttctgcattcgtg 55 ||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 8213 ggtgagcagtggtttacatggttaattgaacaagacatcaacttctgcattcgtg 8159 >AF431889 AF431889.1 Acinetobacter lwoffii type IIs modification Query: 1 acttgctgcgattaaagcagaaaaaacacttgctgaattgagtgct 46 |||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 4484 acttgctgcgattaaagcagaaaaaacacttgctgaattgagtgct 4529

Page 63: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

TAGC  Plots  

hRps://github.com/blaxterlab/blobology  

Assemble    

Filter  con2gs    

Plot  %GC  vs  Coverage    

Sample  and  blast  

Page 64: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Reagent  contamination  

Molbio  grade  water  is  not  the  same  as  DNA  free  water  –  heat  treated  but  DNA  survives  

Page 65: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Later  this  week  -­‐-­‐  Many  different  approaches  to  evalua2ng  quality/mismatches:    1.  Quality-­‐score  based  (FastQC  etc)  

2.  Composi2on  based  (FastQC  etc)  

3.  Reference  based  (“I  know  what  the  answer  should  look  like”)  

4.  Assembly-­‐graph  /  k-­‐mer  based  

Page 66: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

Reference  &  quality-­‐score  independent  approaches  (k-­‐mers)  

Zhang  et  al.,  hRps://peerj.com/preprints/890/  

Page 67: (Shotgun) - Read the Docs · Twospeciicconcepts: • First,&sequencing&everything&atrandom &is&very&much&easier& than&sequencing&aspecific&gene&region.&&(For&example,&itwill& soon&be&easier&and

…from  a  well  known  data  set…  

Zhang  et  al.,  hRps://peerj.com/preprints/890/