week4,lecture8 · the"mostannoying"problems""...
TRANSCRIPT
![Page 1: Week4,Lecture8 · The"mostannoying"problems"" are"caused"by"invisible"characters" • Tabs vsspaces (whenyoucopy/ pasterfrom the"web"itturns"tabs"into"spaces!)" • Newlinesofwrongtype](https://reader033.vdocuments.us/reader033/viewer/2022042208/5eab9e09e85f39666e71ffdd/html5/thumbnails/1.jpg)
BMMB 852: Applied Bioinforma0cs
Week 4, Lecture 8
István Albert
Bioinforma0cs Consul0ng Center Penn State, 2015
![Page 2: Week4,Lecture8 · The"mostannoying"problems"" are"caused"by"invisible"characters" • Tabs vsspaces (whenyoucopy/ pasterfrom the"web"itturns"tabs"into"spaces!)" • Newlinesofwrongtype](https://reader033.vdocuments.us/reader033/viewer/2022042208/5eab9e09e85f39666e71ffdd/html5/thumbnails/2.jpg)
You’ll need a “good” text editor
Absolutely essen0al feature: • Needs to be able to show you white-‐space (allow you to
dis0nguish between tabs and spaces)
• Needs to be able to allow you to change line ending formats (Windows/Unix/Mac)
Handy features: • Syntax highligh0ng • Needs to be able to show line numbers
![Page 3: Week4,Lecture8 · The"mostannoying"problems"" are"caused"by"invisible"characters" • Tabs vsspaces (whenyoucopy/ pasterfrom the"web"itturns"tabs"into"spaces!)" • Newlinesofwrongtype](https://reader033.vdocuments.us/reader033/viewer/2022042208/5eab9e09e85f39666e71ffdd/html5/thumbnails/3.jpg)
There are many op0ons one possible choice
![Page 4: Week4,Lecture8 · The"mostannoying"problems"" are"caused"by"invisible"characters" • Tabs vsspaces (whenyoucopy/ pasterfrom the"web"itturns"tabs"into"spaces!)" • Newlinesofwrongtype](https://reader033.vdocuments.us/reader033/viewer/2022042208/5eab9e09e85f39666e71ffdd/html5/thumbnails/4.jpg)
The most annoying problems are caused by invisible characters
• Tabs vs spaces (when you copy/paster from the web it turns tabs into spaces!)
• New lines of wrong type (yes invisible lines can have types) à Unix, Mac, Windows
• Always use UNIX line endings!
![Page 5: Week4,Lecture8 · The"mostannoying"problems"" are"caused"by"invisible"characters" • Tabs vsspaces (whenyoucopy/ pasterfrom the"web"itturns"tabs"into"spaces!)" • Newlinesofwrongtype](https://reader033.vdocuments.us/reader033/viewer/2022042208/5eab9e09e85f39666e71ffdd/html5/thumbnails/5.jpg)
Short Read Archive
It is (par0ally) documented and “sort of logical” – but only “sort of”
![Page 6: Week4,Lecture8 · The"mostannoying"problems"" are"caused"by"invisible"characters" • Tabs vsspaces (whenyoucopy/ pasterfrom the"web"itturns"tabs"into"spaces!)" • Newlinesofwrongtype](https://reader033.vdocuments.us/reader033/viewer/2022042208/5eab9e09e85f39666e71ffdd/html5/thumbnails/6.jpg)
SRA – Sequence Read Archive naming conven0ons
NCBI BioProject: PRJN... -‐ the overall descrip0on of a single research ini0a0ve; a project will typically relate to mul0ple samples and datasets
NCBI BioSample: SAMN… and/or SRS… in SRA -‐ a descrip0on of biological source material; each physically unique specimen should be registered as a single BioSample with a unique set of a`ributes
SRA Experiment: SRX… -‐ a unique sequencing library for a specific sample
SRA Run: SRR… ERR… -‐ a manifest of data file(s) linked to a given sequencing library (experiment)
There is a cross linking between SRA and NCBI
![Page 7: Week4,Lecture8 · The"mostannoying"problems"" are"caused"by"invisible"characters" • Tabs vsspaces (whenyoucopy/ pasterfrom the"web"itturns"tabs"into"spaces!)" • Newlinesofwrongtype](https://reader033.vdocuments.us/reader033/viewer/2022042208/5eab9e09e85f39666e71ffdd/html5/thumbnails/7.jpg)
Full list of prefixes
![Page 8: Week4,Lecture8 · The"mostannoying"problems"" are"caused"by"invisible"characters" • Tabs vsspaces (whenyoucopy/ pasterfrom the"web"itturns"tabs"into"spaces!)" • Newlinesofwrongtype](https://reader033.vdocuments.us/reader033/viewer/2022042208/5eab9e09e85f39666e71ffdd/html5/thumbnails/8.jpg)
Visit the BioProject for the data
![Page 9: Week4,Lecture8 · The"mostannoying"problems"" are"caused"by"invisible"characters" • Tabs vsspaces (whenyoucopy/ pasterfrom the"web"itturns"tabs"into"spaces!)" • Newlinesofwrongtype](https://reader033.vdocuments.us/reader033/viewer/2022042208/5eab9e09e85f39666e71ffdd/html5/thumbnails/9.jpg)
Web based download of the data
![Page 10: Week4,Lecture8 · The"mostannoying"problems"" are"caused"by"invisible"characters" • Tabs vsspaces (whenyoucopy/ pasterfrom the"web"itturns"tabs"into"spaces!)" • Newlinesofwrongtype](https://reader033.vdocuments.us/reader033/viewer/2022042208/5eab9e09e85f39666e71ffdd/html5/thumbnails/10.jpg)
That’s not ALL – when it comes to biological data distribu0on confusion is the rule.
• The Gene Expression Omnibus also stores results from func0onal genomic experiments à but the raw data links back to SRA.
• GEO was originally designed for microarray data, later augmented for high throughput sequencing
• These organiza0ons appear to be monolithic and it is not clear what en0ty is responsible for them, who makes what decisions and why.
• This is why groups of scien0sts want to form their own independently run informa0on repositories.
![Page 11: Week4,Lecture8 · The"mostannoying"problems"" are"caused"by"invisible"characters" • Tabs vsspaces (whenyoucopy/ pasterfrom the"web"itturns"tabs"into"spaces!)" • Newlinesofwrongtype](https://reader033.vdocuments.us/reader033/viewer/2022042208/5eab9e09e85f39666e71ffdd/html5/thumbnails/11.jpg)
GEO nomenclature
Words that start with G usually refer to GEO: • GPL… will be a plahorm • GSM… indicates a sample • GSE… indicates a series The sequencing data links back to SRA – there are other tools to read GEO data.
![Page 12: Week4,Lecture8 · The"mostannoying"problems"" are"caused"by"invisible"characters" • Tabs vsspaces (whenyoucopy/ pasterfrom the"web"itturns"tabs"into"spaces!)" • Newlinesofwrongtype](https://reader033.vdocuments.us/reader033/viewer/2022042208/5eab9e09e85f39666e71ffdd/html5/thumbnails/12.jpg)
Geing data from SRA
• You will need to install a sojware package called sra-‐toolkit
• This package can fetch and unpack data from SRA
![Page 13: Week4,Lecture8 · The"mostannoying"problems"" are"caused"by"invisible"characters" • Tabs vsspaces (whenyoucopy/ pasterfrom the"web"itturns"tabs"into"spaces!)" • Newlinesofwrongtype](https://reader033.vdocuments.us/reader033/viewer/2022042208/5eab9e09e85f39666e71ffdd/html5/thumbnails/13.jpg)
Download and accessing fastq data
• Work through the SRA tookit examples
• Become familiar with the terminology, accessing data, iden0fying runs
![Page 14: Week4,Lecture8 · The"mostannoying"problems"" are"caused"by"invisible"characters" • Tabs vsspaces (whenyoucopy/ pasterfrom the"web"itturns"tabs"into"spaces!)" • Newlinesofwrongtype](https://reader033.vdocuments.us/reader033/viewer/2022042208/5eab9e09e85f39666e71ffdd/html5/thumbnails/14.jpg)
Homework 8
• Download and unpack at least five SRR runs (use subsets if it seems too slow).
• Run a fastqc report on each.
• Which run do you like most and why? Show one plot that you think shows good quality data.
• How many sequences are in each run? Check the number for at least one run via SRA website.
• What does the following command do:
fastq-‐dump -‐X 10 -‐Z SRR1553610