quality control with prinseqprinseq.sourceforge.net/quality_control_with_prinseq.pdf · 2011-01-16...

Post on 16-Sep-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Robert Schmieder - rschmieder@gmail.com Quality control with PRINSEQ

Last modified: January 16, 2011 Page 1 of 15

Qualitycontrol

Sequencingtechnologiesarenotperfectandthequalitycontrol(QC)isanessentialsteptoensurethatthedatausedfordownstreamanalysisisnotcompromisedoflow‐qualitysequences,sequenceartifacts,orsequencecontaminationthatmightleadtoerroneousconclusions.TheeasiestwayofQCislookingatsummarystatisticsofthedata.Therearedifferentprogramsthatcanproducethosestatistics.Webapplicationsallowuserstoeasilyshareanddiscusstheresultswithotherpeoplewithouttransferringlargedatafiles.ThefollowingQCstepsareimplementedinandallgraphicsgeneratedbyPRINSEQ(http://prinseq.sourceforge.net).

Content:

• Necessaryresources• UploadingdatatothePRINSEQwebversion

• Numberandlengthofsequences• Basequalities

• GCcontent

• Poly‐A/Ttails• Ambiguousbases

• Sequenceduplications

• Sequencecomplexity• Tagsequences

• Sequencecontamination• Assemblyqualitymeasures

• References

Robert Schmieder - rschmieder@gmail.com Quality control with PRINSEQ

Last modified: January 16, 2011 Page 2 of 15

Necessaryresources

Hardware

ComputerconnectedtotheInternet

Software

Up‐to‐dateWebbrowser(Firefox,Safari,Chrome,InternetExplorer,…)

Files

FASTAfilewithsequencedataQUALfilewithqualityscores(ifavailable)

FASTQfile(asalternativeformat)

UploadingdatatothePRINSEQwebversion

TouploadanewdatasetinFASTAandQUALformat(orFASTQformat)toPRINSEQ,followthesesteps:

1. Gotohttp://prinseq.sourceforge.net

2. Clickon“UsePRINSEQ”inthetopmenuontheright(thelatestPRINSEQwebversionshouldload)

3. Clickon“Uploadnewdata”

4. SelectyourFASTAandQUALfilesoryourFASTQfileandclick“Submit”

Afterthedataisparsedandprocessedsuccessfully,theuserinterfacewillshowamenuontheleftandamessageinthemainpanelasshownbelow.

Robert Schmieder - rschmieder@gmail.com Quality control with PRINSEQ

Last modified: January 16, 2011 Page 3 of 15

Notes

Afterclickingthesubmitbutton,astatusbar(notprogressbar)willbedisplayeduntilthefileuploadiscompleted.Duringthedataprocessing,severalprogressbarswillshowtheprogressofthedataparsingandstatisticscalculationsteps.

Possibleproblems

1. ThePRINSEQwebinterfacedoesnotload/isnotvisible.

Youonlyseethisandnothingelsehappens:

Solution:MakesurethatyouhaveJavaScriptactivatedinyourbrowser,asthisisrequiredtoloadandusePRINSEQ’swebinterface.

2. Theuploadstatusbardoesnotdisappear.

Afterclickingonthesubmitbuttonyouseethisanditdoesnotdisappear:

Solution:Thefirstthingtocheckisifthefileisstilluploading.Theeasiestwaytodothisisbycheckingtheloadingiconinyourbrowser.

Ifyouseetheloadingicon(right)insteadofthePRINSEQicon(left),yourfileisstilluploadingandyoushouldgiveitmoretime.IfyouseethePRINSEQiconinsteadoftheloadingicon,yourfiledidnotuploadcompletelyandthiscausedanerror.IfyouhaveaslowconnectiontotheInternetortrytouploadlargefiles,theconnectiontothewebservercantimeoutbeforetheuploadwascompleted.Ifyoudidnotuploadcompressedfiles,trytocompressyourfileswithanyofthesupportedcompressionalgorithms(ZIP,GZIP,…).Inrarecases,theissuecanalsobecausedbycertainFirefoxpluginsorextensions.Ifpossible,useanalternativebrowsertotestifthiswasthecase.Ifthebrowsercausedtheproblem,updatingFirefoxandtheplugins/extensiontothelatestversionmightsolvetheproblem.

Robert Schmieder - rschmieder@gmail.com Quality control with PRINSEQ

Last modified: January 16, 2011 Page 4 of 15

Numberandlengthofsequences

Checkthosenumberstomakesureitmatchesapproximatelythemanufacturerestimates.Ifyournumbersareofftoomuch,checktherawdataandfilterstatisticsin"454BaseCallerMetrics"and"454QualityFilterMetrics".

Lengthdistribution

Thelengthdistributioncanbeusedasqualitymeasureforthesequencingrun.Youwouldexpectanormaldistributionforthebestresult.However,mostsequencingresultsshowaslowlyincreasingandthenasteepfallingdistribution.TheplotsinPRINSEQmarkthemeanlength(M)andthelengthforoneandtwostandarddeviations(1SDand2SD),whichcanhelptodecidewheretosetlengththresholdsforthedatapreprocessing.Ifanyofthesequencesislongerthan100bp,thelengthswillbebinnedintheplotsgeneratedbyPRINSEQ.Thenumberofsequencesforeachbinisthenshowninsteadofthenumberofsequencesforasinglelength(valuesmightthereforebebiggerthanshowninthetablefornon‐binnedlengths).

Thefollowingtwodatasetshaveapproximatelythesamenumberofsequences,howeverthelengthdistributionslookdifferent.

Robert Schmieder - rschmieder@gmail.com Quality control with PRINSEQ

Last modified: January 16, 2011 Page 5 of 15

Bothdistributionshavethehighestnumberofsequencesaround500bp,butforthefirstdatasetthemeanofthesequencelengthsishigherandthestandarddeviationislower.Acertainnumberofshorterreadsmightbeexpected,butifthesamplecontainedmainlylongerfragments,thisnumbershouldbelow.

Assumingthatbothsamplescontainedenoughfragmentsofatleast500bpandallfragmentsweresequencedwiththesamenumberofcycles(sequencingflows),wewouldexpectthatthemajorityofthesequenceswouldhaveapproximatelythesamelength.Thehigheramountofshorterreadsintheseconddatasetsuggeststhatthosereadsmighthavebeenoflowerqualityandweretrimmedduringthesignalprocessing.Ifthesamplecontainedmanyshortfragments,theshorterreadsmightbefromthosefragmentsandnotoflowerquality.

Minimumandmaximumreadlength

SequencesintheSFFfilescanbeasshortas40bp(shortersequencesarefilteredduringsignalprocessing).Formultiplexedsamples,theMIDtrimmedsequencescanbeasshortat28bp(assuminga12bpMIDtag).Suchshortsequencescancauseproblemsduring,forexample,databasesearchestofindsimilarsequences.Shortsequencesaremorelikelytomatchatarandompositionbychancethanlongersequencesandmaythereforeresultinfalsepositivefunctionalortaxonomicalassignments.Furthermore,shortsequencesarelikelytobequalitytrimmedduringthesignal‐processingstepandoflowerqualitywithpossiblesequencingerrors.

Insomecases,sequencescanbemuchlongerthanseveralstandarddeviationsabovethemeanlength(e.g.1,500+bpfora500bpmeanlengthwitha100bpstandarddeviation).Thosesequencesshouldbeusedwithcautionastheylikelycontainlongstretchesofhomopolymerrunsasinthefollowingexample.Homopolymersareaknownissueofpyrosequencingtechnologiessuchas454/Roche[1].

aactttaaccttttaaaacccccttaaaaaaactttaaaccccgtaaaccccccgggttt ttttttaaaaaaccgttttttacgggggtttaccccgttttaccggggttttgggggttt taaaaaaaacgggttttaaacgggttaacccccgggttttccgggggtttaaaaagtttt tttaaacgggggttttcccgtaaaaaaaaaaccccgtttaaaaaaaggggttaaaaaaaa aaggggttaaccccccggggtttaaaaaaaaccttttttttttttaaaaaaaacgttttt tttttttaaaaggggttttttttacgggggtaaacgggggggttaaaaaaaaaccccccc cggggggttttaaaaaaaaaacccccggttttaaaaaaccccgttttaacccctttaaaa aaaaaacgggggggttttaaaaaaaaaagggggttttttttttttaaaaacccgttttta aaaaccccccgttttttaacccgggttaaaccccccccgggggggtaaaacccccccccc ggggtaaccccctttttttaaaacccccccccgttttttacccgggggtttttacccccg gggggggtaaaaaaacggggggtttttttttttttaaaaccggggttttttttttttaaa ccccggtttttaaaaaccggtttttaccccggggggttttacccccgggggggggttttt aaaacccccggtttaaaactttaaaaacccgggtaaccccggggttttaaaaaaaaaaaa aaaaccccccccgttaaaaaaaaaaaacccgttttttttttaaaaaaaacccccccccgg ttttaaaaccccccccgggggtttttaccccggggttttaaaaaaaacccgtttaaaaaa accgggttttttaaaggggttttaaacccccccccc

Robert Schmieder - rschmieder@gmail.com Quality control with PRINSEQ

Last modified: January 16, 2011 Page 6 of 15

Basequalities

AsforSangersequencing,next‐generationsequencersproducedatawithlinearlydegradingqualityacrosstheread.Thequalityscoresfor454/RochesequencersarePhred‐basedsinceversion1.1.03,rangingfrom0to40.Phredvaluesarelog‐scaled,whereaqualityscoreof10representsa1in10chanceofanincorrectbasecallandaqualityscoreof20representsa1in100chanceofanincorrectbasecall.

InPRINSEQ,thequalityscoresareplottedacrossthereadsusingboxplots.Thex‐axisindicatestheabsolutepositionifallreadsarenolongerthan100bpandtherelativeposition(in%ofreadlength)ifanyreadislongerthan100bp.Fordatasetswithanyreadlongerthan100bp,asecondplotshowsbinnedqualityvaluestokeepitsabsolutepositions.Thisplotishelpfultoidentifyqualityscoresattheendoflongerreads,whichwouldotherwisebegroupedwiththeendsoftheshorterreads.ThefollowingexampleshowsthequalityscoresacrossthereadlengthforfragmentssequencedwithGSFLXusingtheTitaniumkit.Thesequenceswithlowqualityscoresattheendsshouldbetrimmedduringdatapreprocessing.

Inadditiontothedecreaseinqualityacrosstheread,regionswithhomopolymerstretcheswilltendtohavelowerqualityscores.Huseetal.[1]foundthatsequenceswithanaveragescorebelow25hadmoreerrorsthanthosewithhigheraverages.Therefore,itishelpfultotakealookattheaverage(ormean)qualityscores.PRINSEQprovidesaplotthatshowsthedistributionofsequencemeanqualityscoresofadataset,asshownbelow.Themajorityofthesequencesshouldhavehighmeanqualityscores.

Robert Schmieder - rschmieder@gmail.com Quality control with PRINSEQ

Last modified: January 16, 2011 Page 7 of 15

Lowqualitysequencescancauseproblemsduringdownstreamanalysis.Mostassemblersoralignersdonottakeintoaccountqualityscoreswhenprocessingthedata.Theerrorsinthereadscancomplicatetheassemblyprocessandmightcausemisassembliesormakeanassemblyimpossible.

GCcontent

TheGCcontentdistributionofmostsamplesshouldfollowanormaldistribution.Insomecases,abi‐modaldistributioncanbeobserved,especiallyformetagenomicdatasets.TheGCcontentplotinPRINSEQmarksthemeanGCcontent(M)andtheGCcontentforoneandtwostandarddeviations(1SDand2SD).ThiscanhelptodecidewheretosettheGCcontentthresholds,ifaGCcontentfilterwillbeapplied.Theplotcanalsobeusedtofindthethresholdsorrangetoselectsequencesfromabi‐modaldistribution.

Poly­A/Ttails

Poly‐A/TtailsareconsideredrepeatsofAsorTsatthesequenceends.InPRINSEQ,theminimumlengthofatailis5bpandsequencesthatcontainonlyAsorTsarecountedforbothends.Asmallnumberoftailscanoccurevenaftertrimmingpoly‐A/Ttails.Forexample,asequencethatendswithAAAAATTTTTandthathasbeentrimmedforthepoly‐Twillstillcontainthepoly‐A.

Trimmingpoly‐A/Ttailscanreducethenumberoffalsepositivesduringdatabasesearches,aslongtailstendtoalignwelltosequenceswithlowcomplexityorsequenceswithtails(e.g.viralsequences)inthedatabase.

Robert Schmieder - rschmieder@gmail.com Quality control with PRINSEQ

Last modified: January 16, 2011 Page 8 of 15

Ambiguousbases

SequencescancontaintheambiguousbaseNforpositionsthatcouldnotbeidentifiedasaparticularbase.AhighnumberofNscanbeasignforalowqualitysequenceorevendataset.Ifnoqualityscoresareavailable,thesequencequalitycanbeinferredfromthepercentofNsfoundinasequenceordataset.Huseetal.[1]foundthatthepresenceofanyambiguousbasecallswasasignforoverallpoorsequencequality.

Ambiguousbasescancauseproblemsduringdownstreamanalysis.AssemblerssuchasVelvetandalignerssuchasSHAHA2orBWAusea2‐bitencodingsystemtorepresentnucleotides,asitoffersaspaceefficientwaytostoresequences.Forexample,thenucleotidesA,C,GandTmightbe2‐bitencodedas00,01,10and11.The2‐bitencoding,however,onlyallowstostorethefournucleotidesandanyadditionalambiguousbasecannotberepresented.Thedifferentprogramsdealwiththeproblemindifferentways.Someprogramsreplaceambiguousbaseswitharandombase(e.g.BWA[2])andotherswithafixedbase(e.g.SHAHA2andVelvetreplaceNswithAs[3]).Thiscanresultinmisassembliesorfalsemappingofsequencestoareferencesequenceandtherefore,sequenceswithahighnumberofNsshouldberemovedbeforedownstreamanalysis.

Sequenceduplications

Realorartificial?Assumingarandomsamplingofthegenomicmaterialinanenvironmentsuchasinmetagenomicstudies,readsshouldnotstartatthesamepositionandhavethesameerrors(atleastnotinthenumbersthattheyhavebeenobservedinmostmetagenomes).Gomez‐Alvarezetal.[5]investigatedtheprobleminmoredetailanddidnotfindaspecificpatternorlocationonthesequencingplatethatcouldexplaintheduplications.

Duplicatescanarisewhentherearetoofewfragmentspresentatanystagepriortosequencing,especiallyduringanyPCRstep.Furthermore,thetheoreticalideaofonemicro‐reactorcontainingonebeadfor454/Rochesequencingdoesnotalwaystranslateintopracticewheremanybeadscanbefoundinasinglemicro‐reactor.Unfortunately,artificialduplicatesaredifficulttodistinguishfromexactlyoverlappingreadsthatnaturallyoccurwithindeepsequencesamples.

Thenumberofexpectedsequenceduplicateshighlydependsonthedepthofthelibrary,thetypeoflibrarybeingsequenced(wholegenome,transcriptome,16S,metagenome,...),andthesequencingtechnologyused.Thesequenceduplicatescanbedefinedusingdifferentmethods.Exactduplicatesareidenticalsequencecopies,whereas5'or3'duplicatesaresequencesthatareidenticalwiththe5'or3'endofalongersequence.Consideringthedouble‐strandednatureofDNA,duplicatescouldalsobeconsideredsequencesthatareidenticalwiththereversecomplementofanothersequence.

Robert Schmieder - rschmieder@gmail.com Quality control with PRINSEQ

Last modified: January 16, 2011 Page 9 of 15

ThedifferentplotsinPRINSEQcanbehelpfultoinvestigatethedegreeofsequenceduplicationsinadataset.Thefollowingplotshowsthenumberofsequenceduplicatesfordifferentlengths.Thedistributionofduplicatesshouldbesimilartothelengthdistributionofthedataset.Thenumberof5’duplicatesishigherforshortersequences(asobservedintheexamplebelow),suggestingthatexactsequenceduplicatesmayhavebeentrimmedduringsignalprocessing.

Thenumberofexactduplicatesisoftenhigherthanthenumberof5’and3’duplicatesasinthefollowingexample.

Robert Schmieder - rschmieder@gmail.com Quality control with PRINSEQ

Last modified: January 16, 2011 Page 10 of 15

PRINSEQoffersadditionalplotstoinvestigatethesequenceduplicatesfromdifferentpointsofview.Theplotshowingthesequenceduplicationlevels(withnumberofsequenceswithoneduplicate,twoduplicates,threeduplicates,…)canbeusedtoidentifythedistributionofduplicates(e.g.domanysequenceshaveonlyafewduplicates).Theplotshowingthehighestnumberofduplicatesforasinglesequence(top100)canhelptoindentifyifonlyafewsequenceshavemanyduplicates(e.g.asaresultofspecificPCRamplification)andwhatthehighestduplicationnumbersare.

Dependingonthedatasetanddownstreamanalysis,itshouldbeconsideredtofiltersequenceduplicates.ThemainpurposeofremovingduplicatesistomitigatetheeffectsofPCRamplificationbiasintroducedduringlibraryconstruction.Inaddition,removingduplicatescanresultincomputationalbenefitsbyreducingthenumberofsequencesthatneedtobeprocessedandbyloweringthememoryrequirements.Sequenceduplicatescanalsoimpactabundanceorexpressionmeasuresandcanresultinfalsevariant(SNP)calling.Theexamplebelowshowsthealignmentofsequencesagainstareferencesequence(gray).Thesequenceduplicates(startingatthesameposition)suggestapossiblyfalsefrequencyofbaseCatthepositionmarkedinbold.

Keepinmindthatthenumberofsequenceduplicatesalsodependsontheexperiment.Forshort‐readdatasetswithhighcoveragesuchasinultra‐deepsequencingorgenomere‐sequencingdatasets,eliminatingsingletonscanpresentaneasywayofdramaticallyreducingthenumberoferror‐pronereads.

Sequencecomplexity

Genomesequencescanexhibitintervalswithlow‐complexity,whichmaybepartofthesequencedatasetwhenusingrandomsamplingtechniques.Low‐complexitysequencesaredefinedashavingcommonlyfoundstretchesofnucleotideswithlimitedinformationcontent(e.g.thedinucleotiderepeatCACACACACA).Suchsequencescanproducealargenumberofhigh‐scoringbutbiologicallyinsignificantresultsindatabasesearches.Thecomplexityofasequencecanbeestimatedusingmanydifferentapproaches.PRINSEQcalculatesthesequencecomplexityusingtheDUSTandEntropyapproachesastheypresenttwocommonlyusedexamples.

...ACCACACGTGTTGTGTACATGAACACAGTATATGAGCATACAGAT...

TGAACACAGTCTATGAGCATACAGAT...

TGAACACAGTCTATGAGCATACAGAT...

TGAACACAGTCTATGAGCATACAGAT...

TGAACACAGTCTATGAGCATACAGAT...

TGAACACAGTCTATGAGCATACAGAT...

GTGTACATGAACACAGTATATGAGCATACAGAT...

GTGTTGTGTACATGAACACAGTATATGAGCATACAGAT...

Robert Schmieder - rschmieder@gmail.com Quality control with PRINSEQ

Last modified: January 16, 2011 Page 11 of 15

TheDUSTapproachisadaptedfromthealgorithmusedtomasklow‐complexityregionsduringBLASTsearchpreprocessing[6].Thescoresarecomputedbasedonhowoftendifferenttrinucleotidesoccurandarescaledfrom0to100.Higherscoresimplylowercomplexityandcomplexityscoresabove7canbeconsideredlow‐complexity.Asequenceofhomopolymerrepeats(e.g.TTTTTTTTT)hasascoreof100,ofdinucleotiderepeats(e.g.TATATATATA)hasascorearound49,andoftrinucleotiderepeats(e.g.TAGTAGTAGTAG)hasascorearound32.

TheEntropyapproachevaluatestheentropyoftrinucleotidesinasequence.Theentropyvaluesarescaledfrom0to100andlowerentropyvaluesimplylowercomplexity.Asequenceofhomopolymerrepeats(e.g.TTTTTTTTT)hasanentropyvalueof0,ofdinucleotiderepeats(e.g.TATATATATA)hasavaluearound16,andoftrinucleotiderepeats(e.g.TAGTAGTAGTAG)hasavaluearound26.Sequenceswithanentropyvaluebelow70canbeconsideredlow‐complexity.

Robert Schmieder - rschmieder@gmail.com Quality control with PRINSEQ

Last modified: January 16, 2011 Page 12 of 15

Tagsequences

Tagsequencesareartifactsattheendsofsequencereadssuchasmultiplexidentifiers,adapters,andprimersequencesthatwereintroducedduringpre‐amplificationwithprimer‐basedmethods.Thebasefrequenciesacrossthereadspresentaneasywaytocheckfortagsequences.Ifthedistributionseemsuneven(highfrequenciesforcertainbasesoverseveralpositions),itcouldindicatesomeresidualtagsequences.Thefollowingthreeexamplesshowthebasefrequenciesofdatasetswithnotagsequence,multiplexidentifier(MID)tagsequence,andwholetranscriptomeamplified(WTA)tagsequence.

ThosetagsequenceshouldbetrimmedusingaprogramsuchasTagCleaner(http://tagcleaner.sourceforge.net)[4].Theinputtoanysuchtrimmingprogramshouldbeuntrimmedreads(e.g.notqualitytrimmed),asthiswillalloweasierandmoreaccurateidentificationoftagsequences.PRINSEQcanbeusedaftertagsequencetrimmingtocheckifthetagswereremovedsufficiently.Inadditiontothefrequencyplots,PRINSEQestimatesifthedatasetcontainstagsequences.Theprobabilitiesforatagsequenceatthe5’‐or3’‐endrequireacertainnumberofsequences(10,000shouldbesufficient).Apercentagebelow40%doesnotalwayssuggestatagsequence,especiallyifitcannotbeobservedfromthebasefrequencies.Theestimationdoesnotworkforsequencedatasetsthattargetasingleloci(e.g.16S)andshouldonlybeusedforrandomlysequencedsamplessuchasmetagenomes.

Robert Schmieder - rschmieder@gmail.com Quality control with PRINSEQ

Last modified: January 16, 2011 Page 13 of 15

Sequencecontamination

SequencesobtainedfromimpurenucleicacidpreparationsmaycontainDNAfromsourcesotherthanthesample.Thosesequencecontaminationsareaseriousconcerntothequalityofthedatausedfordownstreamanalysis,possiblycausingerroneousconclusions.ThedinucleotideoddsratiosascalculatedbyPRINSEQusetheinformationcontentinthesequencesofadatasetandcanbeusedtoidentifypossiblycontamination[7].Furthermore,dinucleotideabundanceshavebeenshowntocapturethemajorityofvariationingenomesignaturesandcanbeusedtocompareametagenometoothermicrobialorviralmetagenomes.PRINSEQusesprincipalcomponentanalysis(PCA)togroupmetagenomesfromsimilarenvironmentsbasedondinucleotideabundances.Thiscanhelptoinvestigateifthecorrectsamplewassequenced,asviralandmicrobialmetagenomesshowdistinctpatterns.Assamplesmightbeprocessedusingdifferentprotocolsorsequencedusingdifferenttechniques,thisfeatureshouldbeusedwithcaution.

Robert Schmieder - rschmieder@gmail.com Quality control with PRINSEQ

Last modified: January 16, 2011 Page 14 of 15

ThePCAplotsinPRINSEQshowhowtheusermetagenome(representedbyareddot)groupswithothermetagenomes(bluedots).Sincetheplotsaregeneratedformicrobialandviralmetagenomesseparately,theyaremarkedwithanMorV(topleftcorner).Thepercentagesinparenthesisshowtheexplainedvariationinthefirstandsecondprincipalcomponent.Theplotsaregeneratedusingpreprocesseddatafrompublishedmetagenomesthatweresequencedusingthe454/Rochesequencingplatform.Ifsequencescontaintagsequencesoraretargetedtoacertainloci(e.g.16S),thisapproachwillnotbeabletogrouptheuserdatatometagenomesfromthesameenvironment.Theplotaboveshowshowamicrobialmetagenomemightberelatedtoothermicrobialmetagenomes.(Thisplotsuggestthatthemetagenomeislikelyamarinemetagenomesampledinacoastalregion.)

Thefollowingplotsshowhowaviralmetagenomedoesnotgroupwiththemicrobialmetagenomes(left)butcloselywithothermosquitometagenomes(right).

PRINSEQadditionalliststhedinucleotiderelativeabundanceoddsratiosfortheuploadeddataset.AnomaliesintheoddsratioscanbeusedtoidentifydiscrepanciesinmetagenomessuchashumanDNAcontamination(depressionoftheCGdinucleotidefrequency).

Assemblyqualitymeasures

TheNxxcontigsizeisaweightedmedianthatisdefinedasthelengthofthesmallestcontigCinthesortedlistofallcontigswherethecumulativelengthfromthelargestcontigtocontigCisatleastxx%ofthetotallength(sumofcontiglengths).Replacexxbythepreferredvaluesuchas90togettheN90contigsize.ThehighertheNxxvalue,thehighertherateoflongercontigsandthebetterthedataset.Ifthedatasetdoesnotcontaincontigsorscaffolds,thisinformationcanbeignored.

Robert Schmieder - rschmieder@gmail.com Quality control with PRINSEQ

Last modified: January 16, 2011 Page 15 of 15

References

1.HuseS,HuberJ,MorrisonH,SoginM,WelchD:AccuracyandqualityofmassivelyparallelDNApyrosequencing.GenomeBiology2007,8:R143.

2.LiH,DurbinR:FastandaccurateshortreadalignmentwithBurrows­Wheelertransform.Bioinformatics2009,25:1754‐1760.

3.LiH,HomerN:Asurveyofsequencealignmentalgorithmsfornext­generationsequencing.BriefBioinform2010.

4.SchmiederR,LimYW,RohwerF,EdwardsR:TagCleaner:Identificationandremovaloftagsequencesfromgenomicandmetagenomicdatasets.BMCBioinformatics2010,11:341.

5.Gomez‐AlvarezV,TealTK,SchmidtTM:Systematicartifactsinmetagenomesfromcomplexmicrobialcommunities.ISMEJ2009,3:1314‐1317.

6.MorgulisA,GertzEM,SchäfferAA,AgarwalaR:AfastandsymmetricDUSTimplementationtomasklow­complexityDNAsequences.J.Comput.Biol2006,13:1028‐1040.

7.WillnerD,ThurberRV,RohwerF:Metagenomicsignaturesof86microbialandviralmetagenomes.Environ.Microbiol2009.

top related