the bioinformatics challenges and approaches to analyze ngs data
TRANSCRIPT
Next Generation Sequencing technologies have revolutionized the speed and detail of genomic and transcriptomic information, which opens novel research possibilities in the area of regulatory, developmental and cancer biology. However along with the advancements, it has offered great challenge in analyzing and interpreting the huge amount of data generated by the experiments. The continuous development of the different algorithms facilitated the data analysis, but also sometime leads to the confusion in making choice. The comparative analysis of different algorithms is important to choose the best method for the analysis. Here we have used different bioinformatics algorithms in order to prepare a standard pipeline for the various steps involved in ChIP Seq data analysis. We compared different algorithms for alignment, duplicate removal and peak calling.
1) Comparison of different alignment Softwares
The Bioinformatics challenges and approaches to analyze NGS data. Yogita Sharma1, Elisa Fiorito2, Siv Gilfillan3 & Toni Hurtado *1
123Nordic EMBL Partnership, Center for Molecular Medicine Norway (NCMM), University of Oslo, Norway.
*Department of Genetics, Institute for Cancer Research, The Norwegian Radium Hospital, University of Oslo, Norway
RESULTS
ABSTRACT
Selec%ve'inhibi%on'of'HER2'signalling'pathway'reveals'novel'func%ons'of'FOXA1'in'breast'cancer'''
'
!"#$%"&'' ()*)+#,-'../012'
34567'8,$98'7:/:;<'
34567'8,$98'
0/<;;'
!"#$% &!$'(%
3"=67'
)*+*,-./%
01-2314%
)*+*,-./%
>9)$'?)*'@A6BCD'E9&&8F'
../012'34567'8,$98'
34567'-,#G,#H',#'A)?"=,I9#'D98,8$)#$'B!3C0'E9&&8'@>JDKLMJDLF'
:'
1::::'
7:::::'
71::::'
K:::::'
K1::::'
C7::
:'C2::
'C<::
'CN::
'CK::
'7::'
.::'
1::'
0::'
;::'
E"#$%"&'' &)*)+#,-'
O,#G
,#H',#$9#8,$P
'@8Q?
'"I'%9)G8F'
R,8$)#E9'I%"?'E9#$9%'"I'-,#G,#H'8,$9'@-*F'
:' 1:' 7::'
5-26376-.8%
5-23*76-.8%%
A6BDS&)*)+#,-' A6BDSE"#$%"&'
T'"I'34567'8,$98'
34567'H9#"?9'G,8$%,-Q+"#',#'A6BCD'E9&&8'
9:)*+*,-./%36;<86=%#>?@9%/.-;.-7%.-263*8,1-%.-%&!$(AB!$A%/36*=2%8*-863%8644=%%%
Breast' cancer' cell' prolifera/on' results' from'many' different' factors' that' ac/vate'mul/ple' cellular' signaling' pathways,'which' are' currently' target'therapies.'Yet,'how'genomic'pathways'are'influenced'upon'cell=signaling'disrup/on'has'been'poorly'assessed.' 'We'explored'how'the'inhibi/on'of'HER2'signaling'pathway'influences'the'func/on'of'the'transcrip/on'factor'FOXA1.'By'means'of'specific'inhibitors'targe/ng'the'kinase'ac/vity'of'HER2'(Lapa/nib)'we'have' iden/fied'HER2'signaling'pathway'as'a'key'supervisor'of'FOXA1'func/on' in'HER2+'breast'cancer'subtype.'The'HER2'pathway'controls'the'binding'of'FOXA1'towards'key'genes'required'for'the'prolifera/on'in'HER2+'cancer'cells.'Importantly,'the'same'mechanisms'of'FOXA1'regula/on'are'needed'to'induce'cell'prolifera/on'in'ER+'breast'cancer'cells'that'do'not'overexpress'HER2.'We'also'explored'the'func/on'of'FOXA1'upon'Hercep/n'treatment,'which'is'a'monoclonal'an/body'that'binds'to'the'extracellular'domain'of'HER2.'Here,'we'demonstrate'that'treatment'of'breast'cancer'cells'with'Hercep/n'produces'the'ac/va/on'of'cytokine'signaling'pathways.'Importantly,'the'cytokine'pathway'ac/va/on'reprograms'the'chroma/n'interac/ons'of'FOXA1'towards'genes'playing'a'key'role'in'the'ini/a/on'of'the'cell=mediated'cytotoxicity.'All' together,'these'findings'supports'the' idea'that'FOXA1'integrates' input'signals'origina/ng'from'mul/ple'cell=signaling'pathways'to'generate'output'responses'that'culminate'in'control'of'prolifera/on'and'the'response'to'an/=cancer'therapies.'
NCMM-EMBL Breast Cancer Group Siv Gilfillan Elisa Fiorito Madhu Katika Engineer PhD student Post-doc
Yogita Sharma Bioinformatician (starting in June 2013)
Elena González Research assistant (starting in August 2013)
Baoyan Bai Post-doc
(starting in August 2013)
TEAM AND RESOURCES
'Madhumohan'R.'Ka%ka1,2,'Siv'Gilfillan1,'Yogita'Sharma1,'AnneJLise'BørresenJDale2'and'Antoni'Hurtado1,2''
1Nordic'EMBL'Partnership,'Center'for'Molecular'Medicine'Norway'(NCMM),'University'of'Oslo,'Norway'2Department'of'Gene/cs,'Ins/tute'for'Cancer'Research,'The'Norwegian'Radium'Hospital,'University'of'Oslo,'Norway''
'
RESULTS'
ABSTRACT'
!""#$%&'()*&+,-.(#)%'&%#)+%/0#
1234
5#6+(7
+(8#+(%/()+%9
#$):;
0#
!"#"$%&'()%*+,-.%/'#0'#1%23-456%7899:;%
<#=<<#
5<<<#5=<<#><<<#>=<<#?<<<#?=<<#@<<<#@=<<#
A=<<
<#A@B<
<#A@><
<#A?C<
<#A?@<
<#A?<<
<#A>B<
<#A>><
<#A5C<
<#A5@<
<#A5<<
<#AB<<
#A><<
#?<<#
D<<#
55<<#
5=<<#
5E<<#
>?<<#
>D<<#
?5<<#
?=<<#
?E<<#
@?<<#
@D<<#
*.(%&.F# F','-(+6#
G4!HI4J#4K4LJ"M"#
<=%9>?>@#'/%'#)'/'($%*+,-.%/'#0'#1%(A&>B0$%1"#"$%CBDC'>E%FAB%?BAE'F"B>@A#%>#0%G'1B>@A#%%
Cell cycle stages
Endocrine system disorders
7"EE%GAB?)AEA1H%
IDCE"'C%>C'0%G"(>/AE'$G%
JGGD#AEA1'C>E%0'$">$"$%
7"EE%0">()%>#0%$DBK'K>E%
7"EE%1BA&()%>#0%?BAE'F"B>@A#%
7"EEDE>B%GAK"G"#(L%'#K>$'A#%
MI-%B"?E'C>@A#%>#0%B"?>'B%
7"EEDE>B%>$$"G/EH%>#0%AB1>#'N>@A#%
O5P86<%
-C@#%
P86<%
P"BC"?@#%
9>?>@#'/%
IA#%(B">("0%
I/)%/&(#6F.%#
!"
#!"
$!"
%!"
&!"
'!"
(!"
!" $&" %(" &)" (!" *$" )&" +(" #!)"
!"#$%&'(#)*+,-./+,0#+1,#0)-*/2#%3#4)5,.,*+#-.%6+1#3/7+%.0#+%#)*487,#7,22#9.%2)3,./:%*#)*#;<=>?<=>#@.,/0+#7/*7,.#7,22#2)*,0#
,-.."/0123-1/-"456""
,-.."/0123-1/-"456""
78&*&"/-..9"
9:,;<8=;>"
!"
#!"
$!"
%!"
&!"
'!"
#" $" %" &" '" (" *" )" +"
A/B0#
<C$=#
;<=DE#
?0@A#"
BC?"
;<=D!#
DB="
<C$#
;<=<
CFGHI#
!"
#!"
$!"
%!"
&!"
'!"
#" $" %" &" '" (" *" )" +"
A/B0#
8AEF="/-..9"
9:?;GA#"
,-.."CH0IJK"4LH0IJK"M-N:O"/01N:P019Q">0I"9-H3M"O1N""-9JH0L-1"N-R.-P016"
;%8.0#
!"
#!"
$!"
%!"
&!"
'!"
(!"
*!"
)!"
!" $&" %(" &)" (!" *$" )&" +(" #!)"CTR
ICI
EG
F
HE
RU
FOXA1
H3
S-9J-H1"T.0J"0U"/KH0MOP1"UHO/P01":1"78&*&"/-..9""
4LH0IJK"M-N:O"/01N:P019Q">0I"9-H3M"
O1N""-9JH0L-1"N-R.-P016"
;%8.0#
!"##$%&'(
)"'%"$*+
,$$
-$
.-$
/--$
/.-$
-$ 01$
23$
14$
3-$
50$
41$
63$
/-4$
/0-$
/20$
/11$
/.3$
/31$
78!9:;<9=$
78>9?@/$
!"#$
%!&!
"'()*$
!"##$AB&CDE$8'$F!>G5$%"##7$$*HB&CDE$I"J8K$%&'J8L&'7M$=&C$7"B)I$K'J$$"7DB&H"'$J"N#"L&',$
+,$-./$01234.$5674218$!"#$69:$%/1/0;<=9$=9:;7/$>12<=5/16?29$=9$!&@A%!&BC$7/<<8$67?D6?90$:=E/1/94$0/92F=7$>64.36G8$
#HIJK$:/></?29$ !&$:/></?29$
-$
0-$
1-$
3-$
4-$
/--$
-$ 01$ 14$ 50$ 63$ /0-$ /11$ /.4$
-$
0-$
1-$
3-$
4-$
/--$
-$ 01$ 14$ 50$ 63$ /0-$ /11$ /.4$
OA>$
OA>$P$>)#Q"7DBK'D$*K'LGO<,$
R"B"H)#8'$
R"B"H)#8'$P$>)#Q"7DBK'D$*K'LGO<,$
SBT"BKL&'$G2/+$$
SBT"BKL&'$G/-+$$
!"#&$%!&CB$
>&U@/$
OA>$
ER! ER!
%!&CL$%!&CB$
>&U@/$
RO<$
78!9:;<9=$
78>9?@/$
M-&$
!"#$
%!&'"'()*$
FOXA1
H3
V"7D"B'$W#&D$&T$%EB&IKL'$TBK%L&'$8'$
F!>G5$%"##7$$*HB&CDE$I"J8K$%&'J8L&'7M$=&C$7"B)I$K'J$$"7DB&H"'$
J"N#"L&',$
!"##$%&'(
)"'%"$*+
,$$
-$
.-$
/--$
/.-$
0--$
-$ 23$ 3-$ 41$ /-4$ /20$ /.3$
%2;18$ %2;18$
!"#$%&#'%()*+,-%."//01%23%456*7%
89:;9:<%09$"0%09$"0%
=%23%456*7%09$"0%
456*7%<":2&"%;90$>98?@2:%9:%)*+,-%."//0%
!"#$%&'%()*#+*'&%,-%-#./012#3+*4+*5#678,&4-#5%*%-#(9,:+*5#,#;%:#&79%#+*##6<%#,')=,)7*#7>#':67;+*%#-+5*,9+*5#(,6<8,:-#?$@ABCD@AC#3&%,-6#',*'%&#'%99-E#
AB% AC% AD% AE% AF% BG% B7% BH% BI%
F*6%&5%*+'#
F*6&,5%*+'##
)*+-JK">."'@:% )*+-J.2:$>2/%
G7*6&79#
$%&'%()*#
L90$#:."%3>2&%.":$">%23%89:;9:<%09$"%(8'1%
7EMDGC%>";?.";%09$"0%N9$K%!">."'@:%
BGMCAC%9:;?.";%09$"0%N9$K%!">."'@:%
O9:;
9:<%9:$":09$P
%(0?&
%23%>"#;01%
G%HGGGG%AGGGG%CGGGG%EGGGG%
7GGGGG%7HGGGG%7AGGGG%
,7GGG%,EGG%,CGG%,AGG%,HGG%7GG%IGG%BGG%DGG%FGG%
Q5R)-5S%/20$% !T-QTU)VR%/20$%
G%BGGGG%
7GGGGG%7BGGGG%HGGGGG%HBGGGG%IGGGGG%IBGGGG%AGGGGG%
,7GGG%,EGG%,CGG%,AGG%,HGG%
7GG%
IGG%
BGG%
DGG%
FGG%
Q2:$>2/%<#9:% !T-QTU)VR%<#9:%
U#$KN#P%#:#/P090%2" FHI##-+5*,9+*5#B" J%*4&+)'#'%99#
K,6L&,)7*#M" N.1O#-+5*,9+*5#P" N.Q;R#-+5*,9+*5#!" FH2S##-+5*,9+*5#T" N,6L&,9#U+99%&#'%99#
-+5*,9+*5#S" FHT##-+5*,9+*5#I" FH#2V#-+5*,9+*5#W" FH#2#-+5*,9+*5#2V" FH#2!#-+5*,9+*5#22" FH#2S,#-+5*,9+*5#2B" OX.#-+5*,9+*5#2M" G0GAP#-+5*,9+*5#7%%%%%%%H%%%%%%%I%%%%%%%A%%%%%%%B%%%%%%%C%%%%%%%D%%%%%%E%%%%%%%F%%%%%%7G%%%%%77%%%%%7H%%%%7I%%%%%%
!"#$%&%'()*#)(+)*),-#,+.#*)(/)(0#12#34567#,18%9/-#9.0)1(-#.(9):+./#8),+#;6<6=#%(/#>?>#@1'2-#)(#.(/1@.,9)%A#:%(:.9#:.AA-#BC-+)D%8%#:.AA#A)(.E#
>;3?# F>?GH#
!"#$%&
$%&%'()*#
'()*&+),&!-.$%&/01234)5)&6(7718&(1*9":(;&<(,7(*(<&+(<)=&
>3;<
3;:&3;*(;13*?
&/1@+
&"A&9()<1=&
B31*);6(&A9"+&6(;*(9&"A&C3;<3;:&13*(&/C,=&
D&EDDDD&FDDDD&GDDDD&HDDDD&
%DDDDD&%EDDDD&%FDDDD&%GDDDD&
I%DD
D&IHDD
&IGDD
&IFDD
&IEDD
&%DD&
JDD&
KDD&
LDD&
MDD&
6";*9"7&
7),)N;3C&
GM8JKK&13*(1&O";*9"7&*9()*(<&
6(771&
I1(,91A#
$%&%'()*#
P"NA&<316"Q(9?&/!-.$%&R3*(1=&
Selec%ve'inhibi%on'of'HER2'signalling'pathway'reveals'novel'func%ons'of'FOXA1'in'breast'cancer'''
'
!"#$%"&'' ()*)+#,-'../012'
34567'8,$98'7:/:;<'
34567'8,$98'
0/<;;'
!"#$% &!$'(%
3"=67'
)*+*,-./%
01-2314%
)*+*,-./%
>9)$'?)*'@A6BCD'E9&&8F'
../012'34567'8,$98'
34567'-,#G,#H',#'A)?"=,I9#'D98,8$)#$'B!3C0'E9&&8'@>JDKLMJDLF'
:'
1::::'
7:::::'
71::::'
K:::::'
K1::::'
C7::
:'C2::
'C<::
'CN::
'CK::
'7::'
.::'
1::'
0::'
;::'
E"#$%"&'' &)*)+#,-'
O,#G
,#H',#$9#8,$P
'@8Q?
'"I'%9)G8F'
R,8$)#E9'I%"?'E9#$9%'"I'-,#G,#H'8,$9'@-*F'
:' 1:' 7::'
5-26376-.8%
5-23*76-.8%%
A6BDS&)*)+#,-' A6BDSE"#$%"&'
T'"I'34567'8,$98'
34567'H9#"?9'G,8$%,-Q+"#',#'A6BCD'E9&&8'
9:)*+*,-./%36;<86=%#>?@9%/.-;.-7%.-263*8,1-%.-%&!$(AB!$A%/36*=2%8*-863%8644=%%%
Breast' cancer' cell' prolifera/on' results' from'many' different' factors' that' ac/vate'mul/ple' cellular' signaling' pathways,'which' are' currently' target'therapies.'Yet,'how'genomic'pathways'are'influenced'upon'cell=signaling'disrup/on'has'been'poorly'assessed.' 'We'explored'how'the'inhibi/on'of'HER2'signaling'pathway'influences'the'func/on'of'the'transcrip/on'factor'FOXA1.'By'means'of'specific'inhibitors'targe/ng'the'kinase'ac/vity'of'HER2'(Lapa/nib)'we'have' iden/fied'HER2'signaling'pathway'as'a'key'supervisor'of'FOXA1'func/on' in'HER2+'breast'cancer'subtype.'The'HER2'pathway'controls'the'binding'of'FOXA1'towards'key'genes'required'for'the'prolifera/on'in'HER2+'cancer'cells.'Importantly,'the'same'mechanisms'of'FOXA1'regula/on'are'needed'to'induce'cell'prolifera/on'in'ER+'breast'cancer'cells'that'do'not'overexpress'HER2.'We'also'explored'the'func/on'of'FOXA1'upon'Hercep/n'treatment,'which'is'a'monoclonal'an/body'that'binds'to'the'extracellular'domain'of'HER2.'Here,'we'demonstrate'that'treatment'of'breast'cancer'cells'with'Hercep/n'produces'the'ac/va/on'of'cytokine'signaling'pathways.'Importantly,'the'cytokine'pathway'ac/va/on'reprograms'the'chroma/n'interac/ons'of'FOXA1'towards'genes'playing'a'key'role'in'the'ini/a/on'of'the'cell=mediated'cytotoxicity.'All' together,'these'findings'supports'the' idea'that'FOXA1'integrates' input'signals'origina/ng'from'mul/ple'cell=signaling'pathways'to'generate'output'responses'that'culminate'in'control'of'prolifera/on'and'the'response'to'an/=cancer'therapies.'
NCMM-EMBL Breast Cancer Group Siv Gilfillan Elisa Fiorito Madhu Katika Engineer PhD student Post-doc
Yogita Sharma Bioinformatician (starting in June 2013)
Elena González Research assistant (starting in August 2013)
Baoyan Bai Post-doc
(starting in August 2013)
TEAM AND RESOURCES
'Madhumohan'R.'Ka%ka1,2,'Siv'Gilfillan1,'Yogita'Sharma1,'AnneJLise'BørresenJDale2'and'Antoni'Hurtado1,2''
1Nordic'EMBL'Partnership,'Center'for'Molecular'Medicine'Norway'(NCMM),'University'of'Oslo,'Norway'2Department'of'Gene/cs,'Ins/tute'for'Cancer'Research,'The'Norwegian'Radium'Hospital,'University'of'Oslo,'Norway''
'
RESULTS'
ABSTRACT'
!""#$%&'()*&+,-.(#)%'&%#)+%/0#
1234
5#6+(7
+(8#+(%/()+%9
#$):;
0#
!"#"$%&'()%*+,-.%/'#0'#1%23-456%7899:;%
<#=<<#
5<<<#5=<<#><<<#>=<<#?<<<#?=<<#@<<<#@=<<#
A=<<
<#A@B<
<#A@><
<#A?C<
<#A?@<
<#A?<<
<#A>B<
<#A>><
<#A5C<
<#A5@<
<#A5<<
<#AB<<
#A><<
#?<<#
D<<#
55<<#
5=<<#
5E<<#
>?<<#
>D<<#
?5<<#
?=<<#
?E<<#
@?<<#
@D<<#
*.(%&.F# F','-(+6#
G4!HI4J#4K4LJ"M"#
<=%9>?>@#'/%'#)'/'($%*+,-.%/'#0'#1%(A&>B0$%1"#"$%CBDC'>E%FAB%?BAE'F"B>@A#%>#0%G'1B>@A#%%
Cell cycle stages
Endocrine system disorders
7"EE%GAB?)AEA1H%
IDCE"'C%>C'0%G"(>/AE'$G%
JGGD#AEA1'C>E%0'$">$"$%
7"EE%0">()%>#0%$DBK'K>E%
7"EE%1BA&()%>#0%?BAE'F"B>@A#%
7"EEDE>B%GAK"G"#(L%'#K>$'A#%
MI-%B"?E'C>@A#%>#0%B"?>'B%
7"EEDE>B%>$$"G/EH%>#0%AB1>#'N>@A#%
O5P86<%
-C@#%
P86<%
P"BC"?@#%
9>?>@#'/%
IA#%(B">("0%
I/)%/&(#6F.%#
!"
#!"
$!"
%!"
&!"
'!"
(!"
!" $&" %(" &)" (!" *$" )&" +(" #!)"
!"#$%&'(#)*+,-./+,0#+1,#0)-*/2#%3#4)5,.,*+#-.%6+1#3/7+%.0#+%#)*487,#7,22#9.%2)3,./:%*#)*#;<=>?<=>#@.,/0+#7/*7,.#7,22#2)*,0#
,-.."/0123-1/-"456""
,-.."/0123-1/-"456""
78&*&"/-..9"
9:,;<8=;>"
!"
#!"
$!"
%!"
&!"
'!"
#" $" %" &" '" (" *" )" +"
A/B0#
<C$=#
;<=DE#
?0@A#"
BC?"
;<=D!#
DB="
<C$#
;<=<
CFGHI#
!"
#!"
$!"
%!"
&!"
'!"
#" $" %" &" '" (" *" )" +"
A/B0#
8AEF="/-..9"
9:?;GA#"
,-.."CH0IJK"4LH0IJK"M-N:O"/01N:P019Q">0I"9-H3M"O1N""-9JH0L-1"N-R.-P016"
;%8.0#
!"
#!"
$!"
%!"
&!"
'!"
(!"
*!"
)!"
!" $&" %(" &)" (!" *$" )&" +(" #!)"CTR
ICI
EG
F
HE
RU
FOXA1
H3
S-9J-H1"T.0J"0U"/KH0MOP1"UHO/P01":1"78&*&"/-..9""
4LH0IJK"M-N:O"/01N:P019Q">0I"9-H3M"
O1N""-9JH0L-1"N-R.-P016"
;%8.0#
!"##$%&'(
)"'%"$*+
,$$
-$
.-$
/--$
/.-$
-$ 01$
23$
14$
3-$
50$
41$
63$
/-4$
/0-$
/20$
/11$
/.3$
/31$
78!9:;<9=$
78>9?@/$
!"#$
%!&!
"'()*$
!"##$AB&CDE$8'$F!>G5$%"##7$$*HB&CDE$I"J8K$%&'J8L&'7M$=&C$7"B)I$K'J$$"7DB&H"'$J"N#"L&',$
+,$-./$01234.$5674218$!"#$69:$%/1/0;<=9$=9:;7/$>12<=5/16?29$=9$!&@A%!&BC$7/<<8$67?D6?90$:=E/1/94$0/92F=7$>64.36G8$
#HIJK$:/></?29$ !&$:/></?29$
-$
0-$
1-$
3-$
4-$
/--$
-$ 01$ 14$ 50$ 63$ /0-$ /11$ /.4$
-$
0-$
1-$
3-$
4-$
/--$
-$ 01$ 14$ 50$ 63$ /0-$ /11$ /.4$
OA>$
OA>$P$>)#Q"7DBK'D$*K'LGO<,$
R"B"H)#8'$
R"B"H)#8'$P$>)#Q"7DBK'D$*K'LGO<,$
SBT"BKL&'$G2/+$$
SBT"BKL&'$G/-+$$
!"#&$%!&CB$
>&U@/$
OA>$
ER! ER!
%!&CL$%!&CB$
>&U@/$
RO<$
78!9:;<9=$
78>9?@/$
M-&$
!"#$
%!&'"'()*$
FOXA1
H3
V"7D"B'$W#&D$&T$%EB&IKL'$TBK%L&'$8'$
F!>G5$%"##7$$*HB&CDE$I"J8K$%&'J8L&'7M$=&C$7"B)I$K'J$$"7DB&H"'$
J"N#"L&',$
!"##$%&'(
)"'%"$*+
,$$
-$
.-$
/--$
/.-$
0--$
-$ 23$ 3-$ 41$ /-4$ /20$ /.3$
%2;18$ %2;18$
!"#$%&#'%()*+,-%."//01%23%456*7%
89:;9:<%09$"0%09$"0%
=%23%456*7%09$"0%
456*7%<":2&"%;90$>98?@2:%9:%)*+,-%."//0%
!"#$%&'%()*#+*'&%,-%-#./012#3+*4+*5#678,&4-#5%*%-#(9,:+*5#,#;%:#&79%#+*##6<%#,')=,)7*#7>#':67;+*%#-+5*,9+*5#(,6<8,:-#?$@ABCD@AC#3&%,-6#',*'%&#'%99-E#
AB% AC% AD% AE% AF% BG% B7% BH% BI%
F*6%&5%*+'#
F*6&,5%*+'##
)*+-JK">."'@:% )*+-J.2:$>2/%
G7*6&79#
$%&'%()*#
L90$#:."%3>2&%.":$">%23%89:;9:<%09$"%(8'1%
7EMDGC%>";?.";%09$"0%N9$K%!">."'@:%
BGMCAC%9:;?.";%09$"0%N9$K%!">."'@:%
O9:;
9:<%9:$":09$P
%(0?&
%23%>"#;01%
G%HGGGG%AGGGG%CGGGG%EGGGG%
7GGGGG%7HGGGG%7AGGGG%
,7GGG%,EGG%,CGG%,AGG%,HGG%7GG%IGG%BGG%DGG%FGG%
Q5R)-5S%/20$% !T-QTU)VR%/20$%
G%BGGGG%
7GGGGG%7BGGGG%HGGGGG%HBGGGG%IGGGGG%IBGGGG%AGGGGG%
,7GGG%,EGG%,CGG%,AGG%,HGG%
7GG%
IGG%
BGG%
DGG%
FGG%
Q2:$>2/%<#9:% !T-QTU)VR%<#9:%
U#$KN#P%#:#/P090%2" FHI##-+5*,9+*5#B" J%*4&+)'#'%99#
K,6L&,)7*#M" N.1O#-+5*,9+*5#P" N.Q;R#-+5*,9+*5#!" FH2S##-+5*,9+*5#T" N,6L&,9#U+99%&#'%99#
-+5*,9+*5#S" FHT##-+5*,9+*5#I" FH#2V#-+5*,9+*5#W" FH#2#-+5*,9+*5#2V" FH#2!#-+5*,9+*5#22" FH#2S,#-+5*,9+*5#2B" OX.#-+5*,9+*5#2M" G0GAP#-+5*,9+*5#7%%%%%%%H%%%%%%%I%%%%%%%A%%%%%%%B%%%%%%%C%%%%%%%D%%%%%%E%%%%%%%F%%%%%%7G%%%%%77%%%%%7H%%%%7I%%%%%%
!"#$%&%'()*#)(+)*),-#,+.#*)(/)(0#12#34567#,18%9/-#9.0)1(-#.(9):+./#8),+#;6<6=#%(/#>?>#@1'2-#)(#.(/1@.,9)%A#:%(:.9#:.AA-#BC-+)D%8%#:.AA#A)(.E#
>;3?# F>?GH#
!"#$%&
$%&%'()*#
'()*&+),&!-.$%&/01234)5)&6(7718&(1*9":(;&<(,7(*(<&+(<)=&
>3;<
3;:&3;*(;13*?
&/1@+
&"A&9()<1=&
B31*);6(&A9"+&6(;*(9&"A&C3;<3;:&13*(&/C,=&
D&EDDDD&FDDDD&GDDDD&HDDDD&%DDDDD&%EDDDD&%FDDDD&%GDDDD&
I%DD
D&IHDD
&IGDD
&IFDD
&IEDD
&%DD&
JDD&
KDD&
LDD&
MDD&
6";*9"7&
7),)N;3C&
GM8JKK&13*(1&O";*9"7&*9()*(<&
6(771&
I1(,91A#
$%&%'()*#
P"NA&<316"Q(9?&/!-.$%&R3*(1=&
Evaluation Type Genome Size(s) Read Length Read CountAccuracy: Varying Error Rate 3Gbp, 500Mbp 50bp 500,000Accuracy: Varying Indel Size 3Gbp, 500Mbp 50bp 500,000Accuracy: Varying Indel Frequency 3Gbp, 500Mbp 50bp 500,000
Table 1. Experimental setup for each simulation type: genome size(s), read length, and read count.
Error Rate
Accu
racy
0.001 0.010 0.100
0.0
0.2
0.4
0.6
0.8
1.0
ToolBowtieBWAMrFast−RMrFast−SMrsFast−RMrsFast−SNovoalignSOAP
Error Rate
Accu
racy
0.001 0.010 0.100
0.0
0.2
0.4
0.6
0.8
1.0
ToolBowtieBWAMrFast−RMrFast−SMrsFast−RMrsFast−SNovoalignSOAP
Error Rate
Used
Rea
d Ra
tio
0.001 0.010 0.100
0.0
0.2
0.4
0.6
0.8
1.0
ToolBowtieBWAMrFast−RMrFast−SMrsFast−RMrsFast−SNovoalignSOAP
Fig. 1: Human genome: accuracy with varying error rate. (a) shows mapping quality threshold 0, (b) shows threshold 10 and (c) shows theproportion of reads that have mapping quality of at least 10. -R and -S suffixes denote relaxed and strict accuracy, respectively.
4.1.1 Varying Error Rate The accuracy of all algorithms onthe human genome for varying error rate is compared in Figure1. The results for quality threshold 0 (accepting all reads) areshown in Figure 1a, whereas 1b shows the mapping accuracy whenconsidering reads of quality ! 10. We can see that Bowtie, BWA
and Novoalign are the most sensitive to mapping quality thresholdat high error rates; their accuracy significantly increases as reads ofmapping quality 0 are discarded. SOAP’s mapping accuracy is quitehigh even at quality threshold 0, which is consistent with its intendedusage for genotyping SNPs. Figure 1c shows the proportion of
Threshold
Accu
racy
0 4 10 20 30 40
0.0
0.2
0.4
0.6
0.8
1.0
Tool<Theoretical>BowtieBWANovoalignSOAP
Threshold
Used
Rea
d Ra
tio
0 4 10 20 30 40
0.0
0.2
0.4
0.6
0.8
1.0
ToolBowtieBWANovoalignSOAP
Fig. 2: Human genome: comparison of reported accuracy vs. theoretical accuracy for 0.1% base call error rate (only tools that reportmeaningful quality scores are included). (a) shows a comparison of the theoretical accuracy for each mapping quality score vs. each tool’saccuracy at that quality threshold. (b) shows the proportion of reads with a mapping quality greater than or equal to each threshold value.
4
at Odontologisk Fakultetsbibliothek on July 30, 2013http://bioinform
atics.oxfordjournals.org/Downloaded from
Indel Size (mean)
Accu
racy
2 4 7 10 16
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
ToolBowtieBWAMrFast−RMrFast−SMrsFast−RMrsFast−SNovoalignSOAP
Indel Size (mean)
Accu
racy
2 4 7 10 16
0.0
0.2
0.4
0.6
0.8
ToolBowtieBWAMrFast−RMrFast−SMrsFast−RMrsFast−SNovoalignSOAP
Indel Size
Used
Rea
d Ra
tio
2 4 7 10 16
0.0
0.2
0.4
0.6
0.8
1.0
ToolBowtieBWAMrFast−RMrFast−SMrsFast−RMrsFast−SNovoalignSOAP
Fig. 3: Human genome: accuracy with varying indel sizes. (a) shows mapping quality threshold 0, (b) shows threshold 10 and (c) shows theproportion of reads that have mapping quality of at least 10. -R and -S suffixes denote relaxed and strict accuracy, respectively. At indel sizes10 and 16, SOAP discards all reads, producing missing values in (c).
Indel Frequency
Accu
racy
1e−05 1e−04 0.001 0.01
0.0
0.2
0.4
0.6
0.8
1.0
ToolBowtieBWAMrFast−RMrFast−SMrsFast−RMrsFast−SNovoalignSOAP
Indel Frequency
Accu
racy
1e−05 1e−04 0.001 0.01
0.0
0.2
0.4
0.6
0.8
1.0
ToolBowtieBWAMrFast−RMrFast−SMrsFast−RMrsFast−SNovoalignSOAP
Indel Frequency
Used
Rea
d Ra
tio
1e−05 1e−04 1e−03 1e−02
0.0
0.2
0.4
0.6
0.8
1.0
ToolBowtieBWAMrFast−RMrFast−SMrsFast−RMrsFast−SNovoalignSOAP
Fig. 4: Human genome: accuracy with varying indel frequencies. (a) shows mapping quality threshold 0, (b) shows threshold 10 and (c)shows the proportion of reads that have mapping quality of at least 10. -R and -S suffixes denote relaxed and strict accuracy, respectively.
errors in the human genome sequence. Genome Biology, 4(4), R25.Ewing, B. and Green, P. (1998). Base-Calling of Automated Sequencer TracesUsingPhred.II. ErrorProbabilities. Genome Research, 8(3), 186–194.
Ferragina, P. and Manzini, G. (2000). Opportunistic data structures with applications.Foundations of Computer Science, Annual IEEE Symposium on, 0, 390.
Guffanti, A., Iacono, M., Pelucchi, P., Kim, N., Solda, G., Croft, L. J., Taft, R. J.,Rizzi, E., Askarian-Amiri, M., Bonnal, R. J., Callari, M., Mignone, F., Pesole, G.,Bertalot, G., Bernardi, L. R. R., Albertini, A., Lee, C., Mattick, J. S., Zucchi, I., andDe Bellis, G. (2009). A transcriptional sketch of a primary human breast cancer by454 deep sequencing. BMC genomics, 10(1), 163+.
Hach, F., Hormozdiari, F., Alkan, C., Hormozdiari, F., Birol, I., Eichler, E. E.,and Sahinalp, S. C. (2010). mrsFAST: a cache-oblivious algorithm for short-readmapping. Nat. Methods, 7, 576–577.
Horner, D. S., Pavesi, G., CastrignanA, T., De Meo, P. D., Liuni, S., Sammeth, M.,Picardi, E., and Pesole, G. (2010). Bioinformatics approaches for genomics and postgenomics applications of next-generation sequencing. Briefings in Bioinformatics,11(2), 181–197.
Illumina, I. (2010). Quality scores data.International Human Genome Sequencing Consortium (2001). Initial sequencing andanalysis of the human genome. Nature, 409(6822), 860–921.
Langmead, B., Trapnell, C., Pop, M., and Salzberg, S. L. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol.,10, R25.
Li, H. and Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754–1760.
Li, H. and Durbin, R. (2010). Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics, 26, 589–595.
Li, H. and Homer, N. (2010). A survey of sequence alignment algorithms for next-generation sequencing. Briefings in Bioinformatics, 11(5), 473–483.
Li, H., Ruan, J., and Durbin, R. (2008a). Mapping short DNA sequencing reads andcalling variants using mapping quality scores. Genome Res., 18, 1851–1858.
Li, R., Li, Y., Kristiansen, K., and Wang, J. (2008b). SOAP: short oligonucleotidealignment program. Bioinformatics, 24(5), 713–714.
Li, R., Yu, C., Li, Y., Lam, T. W., Yiu, S. M., Kristiansen, K., and Wang, J. (2009).SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics, 25,1966–1967.
Medvedev, P., Stanciu, M., and Brudno, M. (2009). Computational methods fordiscovering structural variation with next-generation sequencing. Nature Methods,6(11s), S13–S20.
6
at Odontologisk Fakultetsbibliothek on July 30, 2013http://bioinform
atics.oxfordjournals.org/Downloaded from
2) Removal of Duplicates 3) Peak Calling MACS CCAT
Total no. of peaks
6031 (3h) 17047 (12h)
8764 (3h) 19208 (12h)
Unique Peaks (3h)
3094 5660
Unique Peaks (12h)
4767 5879
0 2 4 6 8 10 12
Distribution of Peak Heights
0.0e+00 5.0e+07 1.0e+08 1.5e+08 2.0e+08 2.5e+08
ChIP Regions (Peaks) over Chromosomes
Chromosome Size (bp)
Ch
rom
oso
me
12
34
56
78
910
11
12
13
14
15
16
17
18
19
20
21
22
MX
Y
0 5 10 15
Distribution of Peak Heights
0.0e+00 5.0e+07 1.0e+08 1.5e+08 2.0e+08 2.5e+08
ChIP Regions (Peaks) over Chromosomes
Chromosome Size (bp)
Chro
mosom
e
12
34
56
78
91
01
11
21
31
41
51
61
71
81
92
02
12
2M
XY
ChIP regions over chromosome (3h & 12h)
CONCLUSIONS 1) We found that Bowtie performs well for the ChIP- Seq experiments provided the length of indels should be small. 2) Samtools and Picard both removed the same number and egions as duplicates. 3) CCAT and MACS differ from other .