karel břinda, valentina boeva, gregory kucherov introduction rnf
TRANSCRIPT
Rnf: amethodandtools toevaluateNgs readmappers
Karel Břinda, Valentina Boeva, Gregory [email protected], [email protected], [email protected]
IntroductionAligning reads to a reference sequence is afundamental step in numerous bioinformaticspipelines. The sensitivity and precision of themapping tool can critically affect the accu-racy of produced results.
Read simulators combined with alignmentevaluation tools provide the most straightfor-ward way to evaluate and compare map-pers.
In default of standards for encoding read origins,every evaluation tool had to be made explicitlycompatible with the simulator used to generatereads.
To solve this obstacle, we have created a formatRnf (Read Naming Format) and an associatedsoftware package RnfTools.
RnfDescription: Read NamingFormat, a generic format for as-signing read names with encodedinformation about original posi-tions.Specification:http://karel-brinda.github.io/rnf-spec/
RnfToolsDescription: An associatedsoftware package of Rnf-compatible programs, based onSnakemake [2]. All employedexternal programs are installedautomatically when they areneeded.
Components:
i) MIShmashPipeline applying one of popular read sim-ulating tools (among DwgSim, Art, Ma-son, CuReSim etc.) and transforming thegenerated reads into Rnf format.
ii) LAVEnderTool for read mappers evaluation usingRnf reads.
Source codes and documentation:http://github.com/karel-brinda/rnftoolshttp://rnftools.rtfd.org
Prerequisites:– Unix-like system (Linux, OSX, etc.)– Python 3.2+
Installation using Pip:> pip install rnftools
Installation using Easy Install:> easy_install rnftools
References[1] K. Břinda, V. Boeva, G. Kucherov. RNF: a gen-eral framework to evaluate NGS read mappers.arXiv:1504.00556 [q-bio.GN], 2015.
[2] J. Köster and S. Rahmann. Snakemake – a scal-able bioinformatics workflow engine. Bioinfor-matics 28(19): 2520–2522, 2012.
Read Naming Format
sim__0043fd1__(3,13,F,027871,027970),(3,13,R,029171,029270)__[paired_end],C:[100=,42=1X47=]
Segments of reads Suffix(with comments and extensions)
Read tuple IDPrefix
Leftmost coordinate
Genome ID
DirectionChromosome ID
Rightmost coordinate
Example of simulated read tuplesCoor 12345678901234-5678901234567890123456789
Source 1 - reference genome
chr 1 ATGTTAGATAA-GATAGCTGTGCTAGTAGGCAGTCAGCCC
chr 2 ttcttctggaa-gaccttctcctcctgcaaataaa
Source 2 - generator of random sequences
READS:
r001 ATG-TAGATA ->
r002/1 TTAGATAACGA ->
r002/2 <- TCAG-CGGG
r003/1 tgcaaataa ->
r003/2 gaa-gacc-t ->
r004 ATAGCT............TCAG ->
r005 GTAGG ->
<- agacctt
<- TCGACACG
r006 ATATCACATCATTAGACACTA
Their corresponding Rnf namesreadtuple
LRN SRN
r001 sim__1__(1,1,F,01,10)__[single_end] #1r002 sim__2__(1,1,F,04,14),(1,1,R,31,39)
__[paired_end]#2
r003 sim__3__(1,2,F,09,17),(1,2,F,25,33)__[mate_pair]
#3
r004 sim__4__(1,1,F,15,36)__[spliced],C:[6=12N4=]
#4
r005 sim__5__(1,1,R,15,22),(1,1,F,25,29),(1,2,R,05,11)__[chimeric]
#5
r006 rnd__6__(2,0,N,00,00)__[random] #6
LRN Long read name.
SRN Short read name. They are used only if an LRN ex-ceeds 255 characters (maximum allowed read lengthin Sam). Then a SRN-LRN correspondence file mustbe created.
Evaluation of read mappers using Rnf-compatible programs
Genome 1
Genome 2
Genome n
Read simulator Reads AlignmentMapper
evaluationtool
Report
FASTA
FASTQ BAM TXT/HTMLRNF decoding
RNF encoding
Mapper
Read simulation Mapper evaluation
RnfTools – example of usage
Steps:
1. Simulation of reads. 200.000 reads weresimulated by DwgSim using MIShmash:– 100.000 reads from a human genome (HG38),– 100.000 reads from a mouse genome (MM10).
2. Mapping All reads were mapped to HG38 byi) Yara, ii) Bwa-Mem, iii) Bwa-Sw, andiv) Bowtie2.
3. Evaluation. The obtained Bam files wereevaluated using LAVEnder.
Figure → Comparison of the mappers withrespect to correctly mapped reads.Figure ↘ Detailed graph for Yara.Figure ↓ Detailed graph for Bwa-Mem.
#cor
rect
ly m
appe
d re
ads
/ #re
ads
whi
ch s
houl
d be
map
ped
Correctly mapped reads in all reads which should be mapped
FDR in mapping (#wrongly mapped reads / #mapped reads)
BWA-MEM BWA-SW
Bowtie2 YARA
50 %
60 %
70 %
80 %
90 %
100 %10-4 10-3 10-2 10-1 100
Par
t of a
ll re
ads
(%)
BWA-MEM
FDR in mapping (#wrongly mapped reads / #mapped reads)
Unmapped correctlyUnmapped incorrectlyThresholded correctly
Thresholded incorrectlyMultimapped
Mapped, should be unmappedMapped to wrong position
Mapped correctly
0 %
20 %
40 %
60 %
80 %
100 %10-2 10-1
Par
t of a
ll re
ads
(%)
YARA
FDR in mapping (#wrongly mapped reads / #mapped reads)
Unmapped correctlyUnmapped incorrectlyThresholded correctly
Thresholded incorrectlyMultimapped
Mapped, should be unmappedMapped to wrong position
Mapped correctly
0 %
20 %
40 %
60 %
80 %
100 %10-2 10-1