computational skills course week 1
DESCRIPTION
Computational Skills Course week 1. Mike Gilchrist NIMR May-July 2011. WEEK ONE. Introduction and scope of course. The command line, the nature of computer data, and some useful command line tools for modifying text files. Main topics. The command line Data and text files - PowerPoint PPT PresentationTRANSCRIPT
Computational Skills Courseweek 1
Mike GilchristNIMR May-July 2011
WEEK ONE
Introduction and scope of course. The command line, the nature of computer data, and some useful
command line tools for modifying text files.
Main topics
The command line
Data and text files
Simple command line tools
Third party bioinformatics tools
Databases and SQL
3GL’s – industrial strength computing
Scripts and pipelines
What will not appear and why...
Axioms
Google it
You cannot break your computer
Familiarity breeds content
Bugs are always your fault(and can always be found - eventually)
There are many ways of doing any give task(but one of them may be much better)
Computational biologists tend to obfuscate
GNU = Gnu is Not Unix
WEEK ONE
Basic concepts of computer based data, the command line, and some
essential command line utilities.
The terminal window
At the command line
D:\projects\current> program -U jackyO -n 25 -i E:\data\experiment-1.txt > output.txt
program = program.exe or -rwxr-xr-x 1 migil sysbio 18K Mar 12 16:16 program
[migil@INFO-LINUX2 gcc]$ hts_utils
PATH
F:\data\projects\khokha\exon-capture\blast\data-12may11.txt
Also the PATH environment variable which tells the computer where to look for programs
Computer data
10001011010010110010001110110101011010100101000010100101...
a ‘byte’
128 64 32 16 8 4 2 1 = 35 (base 10)
00000000 = 011111111 = 255
256 possible characters in the simplest form of computer data
The ascii ‘alphabet’
00000000 = 0 \0 (the null terminator)
00001001 = 9 \t TAB00001010 = 10 \n (line feed)00001011 = 13 \r (return)
00100000 = 32 ‘ ‘ (space)
00110000 = 48 ‘0’
00111001 = 57 ‘9’: : : :
01000001 = 65 ‘A’
01011010 = 90 ‘Z’: : : :
@HWUSI-EAS582_0299:3:1:87:82/1GGCGACGATACATTCGGATGTCTGCCCTATCAACTTTCGAT+HWUSI-EAS582_0299:3:1:87:82/1ffbffffff^UYbb\ffbee_M_f]UV^JYRXbSdf`bfff@HWUSI-EAS582_0299:3:1:87:1949/1CGGTTCAGCAGGAATGCCGAGACCGATATCGTATGCCGTCT+HWUSI-EAS582_0299:3:1:87:1949/1_ebhhghghgedgceg\fggfghggg_]eZ\WWcacMb_ce@HWUSI-EAS582_0299:3:1:87:1688/1CGGGCATCTAAGGGCATCACAGACCTGTTATTGCTCGATCT+HWUSI-EAS582_0299:3:1:87:1688/1ffbdd`bfffbf`ZfZddddcbfZfff]ffaddYdeffedd@HWUSI-EAS582_0299:3:1:87:1773/1CGCTTGTTTCCTGATGTCACATGACAACACAAGATCGATAA+HWUSI-EAS582_0299:3:1:87:1773/1gfgggggfggggfgggZggfddggggfag\fggafgdgggg@HWUSI-EAS582_0299:3:1:87:1000/1ATTGTGCAAGTCTCCCAATGTCGATTTAATGAAATCCCTAC+HWUSI-EAS582_0299:3:1:87:1000/1ggeggegeggcgggfdaaffdgggggfggggggggggdggd@HWUSI-EAS582_0299:3:1:87:842/1GGAGTATGGTTGCAAAGCTGAAACTTAAAGGAATTGACGGA+HWUSI-EAS582_0299:3:1:87:842/1gggggggggggggggggggggggfgggggggfggggggggg@HWUSI-EAS582_0299:3:1:88:1826/1TTCGGACACACTGGGCCCAGATGGCTTCTTGGATTTAGGGG+HWUSI-EAS582_0299:3:1:88:1826/1cWbgggcfgfc`g^gc^ddgg\ggbcd\`OfdZW`d`dJdg
A typical text data file
The line ending/platform issue
582_0299:3:1:87:82/1\n\rGATGTCTGCCCTATCAACTTTCGAT\n\r582_0299:3:1:87:82/1\n\rfbee_M_f]UV^JYRXbSdf`bfff\n\r
582_0299:3:1:87:82/1\nGATGTCTGCCCTATCAACTTTCGAT\n582_0299:3:1:87:82/1\nfbee_M_f]UV^JYRXbSdf`bfff\n
DOS (Windows) Unix/Linux (newer Macs)
Programs which read text in a line at a time may give the wrong result if the text file is from an incompatible platform...
If you are moving a file from Linux -> Windows run unix2dos first.>unix2dos <FILENAME>
582_0299:3:1:87:82/1\n\rGATGTCTGCCCTATCAACTTTCGAT\n\r582_0299:3:1:87..
(really)
Command line utilities for manipulating data files
http://sourceforge.net/projects/gnuwin32/files/>grep
finds lines matching ‘patterns’
>sed
swaps one pattern for another
>awk
operates on ‘fields’ in a record
>cat>tail>more/less>unix2dos/dos2unix
AND THEN THERE ARE PIPES... ‘|’AND REDIRECTS... ‘>’
grepA typical fasta sequence file (Xenopus tropicalis mRNAs)
:ACAAGCGATCTTGTAGAGCAATTCCAGCAACACTTAAAGGGACTTCTGTCTGTACTTACCAAACTGACAAAAAAGGCCAATCTACTGACAAACTCTTACAAAAAGCAGATTGGCATTGGTGCTCCGAGCAGT>ENSXETG00000025789|ENSXETT00000019729|nodal2ATGGCAGCCCTAGGAGCCCTCTTTTTATTTGCCATGGCCTCCCTTGTGCACGGGAAGCCCATTCATTCAGACAGAAAAGGAGCTAAAATCCCTCTGGCAGGATCTAACCTGGGATACAAGAAATCCAGCAATTCATATGGTTCCAGACTGTCGCAGGGTATGAGATACCCCCCTTCCATGATGCAGTTATACCAGACTCTGATTTTGGGGAATGATACGGATCTGTCAATCCTGGAATATCCCATCCTGCAGGAATCTGATGCCGTTCTAAGCCTCATTGCAAAAAGTTGCGATGTAGTGGGCAATCGATGGACATTGTCCTTTGACATGTCTTCTATATCGAGCAGCAATGAGCTGAAATTGGCCGAGTTGAGGATCCGCCTCCCTTCCTTTGAAAGGTCCCAGGAT>ENSXETG00000004295|ENSXETT00000014760|neurod4AGCCCCGGACTCATTGATATTGGGCACAGCGAGTCTGCCTGGGAGCTGTCCAGCACTCCATGCTCCTGAAATAACTTGGGCAACAAGTCCGATCTGCCCGCTACTCTGTGCCTCCAGCTCAGGCCCGGGGAGAGGGACCCTGCTGAGCAGGACTCAGGACACTGTTTGAAGATCACATCAAATTCTGCTAATATGTCGGAGATAGTCAGTGTGCATGGGTGGATGGAGGAAGCCCTTAGTTCCCAGGATGAGATGGAGAGGAATCAGCGGCAATCTGCCTATGATATCATTTCAGGTCTGAGTCACGAGGAAAGGTGTAGCATAGATGGAGAAGATGATGATGAAGAAGAAGAGGATGGAGAGAAACCAAAAAAGAGGGGACCCAAAAAAAAGAAGATGACCAAGGCTAGACTGGAGAGGTTTCGTGTGCGCAGAGTAAAAGCCAATGCCAGGGAGCGCACCAGAATGCATGGACTTAATGATGCCCTAGAAAATTTAAGGAGGGTCATGCCTTGCTATTCCAAAACACAAAA>ENSXETG00000006455|ENSXETT00000014172|camk2gAGACAAGAAACAGTGGAATGTTTGAGAAAGTTCAATGCACGGAGAAAGCTTAAAGGTGCAATTCTCACAACAATGTTGGTTTCTCGGAATTTTTCTGGAATTGCATTTGGATGCCGAAAAGCTGCATCCACTGTCCCGTGTACCTCTTCAACGGGGGACACTATAACTGGTGTTGGCAGGCAGACCTCCGCCCCTGTTGTGGCCGCCACCAGTGCTGCCAACTTAGTCGAGCAAGCTGCCAAGAGTTTGTTGAACAAGAAGACAGATGGTGTCAAGCCACAGACCAACAACAAAAACAGCATAATAAGCCCTGCAAAAGAAAACCCCCCATTGCAGACATCAATGGAACCTCAAACAACTGTTGTCCACAATGCAACTGATGGGATAAAAGGATCAACAGAGAGTTGTAACACCACCACT:
grep
F:dev\sql>grep ACGTACGT my-sequence-file.fastaGAGAAGAACGTACGTGAGTGCAACCCATTCCTGGACCCGGAGATGGTGCGATTCCTCTGGTATTAAGAAAGAAAAGTTACGTACGTTGATAGACCTTGTAAGTGAAGAGAAGATGTTAGATTCAACCCAACTTACTATGTTACTATTGCTTCATTCCTTTTCACGTACGTCTGGTCTCAAGGAATTAATAACCAGGATTTTGAAGGGGATTGCTACGTACGTCGCAGGTTATCCGGTGGA::
F:dev\sql>grep –c ACGTACGT my-sequence-file.fasta76
F:dev\sql>grep nodal my-sequence-file.fasta>ENSXETG00000025789|ENSXETT00000019729|nodal2>ENSXETG00000009009|ENSXETT00000019730|nodal3>ENSXETG00000025789|ENSXETT00000019728|nodal2>ENSXETG00000009008|ENSXETT00000019726|nodal1>ENSXETG00000017442|ENSXETT00000037932|nodal5.2>ENSXETG00000016779|ENSXETT00000036596|nodal5>ENSXETG00000016778|ENSXETT00000036593|nodal6>ENSXETG00000023748|ENSXETT00000051228|nodal
F:dev\sql>grep –c “>” my-sequence-file.fasta28937
F:dev\sql>grep “>” my-sequence-file.fasta > my-sequence-file-def-lines.txt
grep reads the input file line by line and reports on each line containing the pattern
sed
F:dev\sql>grep nodal my-sequence-file.fasta > tmp.txt
F:dev\sql>more tmp.txt>ENSXETG00000025789|ENSXETT00000019729|nodal2>ENSXETG00000009009|ENSXETT00000019730|nodal3>ENSXETG00000025789|ENSXETT00000019728|nodal2>ENSXETG00000009008|ENSXETT00000019726|nodal1>ENSXETG00000017442|ENSXETT00000037932|nodal5.2>ENSXETG00000016779|ENSXETT00000036596|nodal5>ENSXETG00000016778|ENSXETT00000036593|nodal6>ENSXETG00000023748|ENSXETT00000051228|nodal
F:\dev\sql>sed "s/nodal/xnr/" tmp.txt>ENSXETG00000025789|ENSXETT00000019729|xnr2>ENSXETG00000009009|ENSXETT00000019730|xnr3>ENSXETG00000025789|ENSXETT00000019728|xnr2>ENSXETG00000009008|ENSXETT00000019726|xnr1>ENSXETG00000017442|ENSXETT00000037932|xnr5.2>ENSXETG00000016779|ENSXETT00000036596|xnr5>ENSXETG00000016778|ENSXETT00000036593|xnr6>ENSXETG00000023748|ENSXETT00000051228|xnr
stream editor: reads input line at a time, and operates on the line as requested – typically to make a substitution. E.g. Replace groups of space with a TAB, etc.
sed
F:dev\sql>more tmp.txt>ENSXETG00000025789|ENSXETT00000019729|nodal2>ENSXETG00000009009|ENSXETT00000019730|nodal3>ENSXETG00000025789|ENSXETT00000019728|nodal2>ENSXETG00000009008|ENSXETT00000019726|nodal1>ENSXETG00000017442|ENSXETT00000037932|nodal5.2>ENSXETG00000016779|ENSXETT00000036596|nodal5>ENSXETG00000016778|ENSXETT00000036593|nodal6>ENSXETG00000023748|ENSXETT00000051228|nodal
F:\dev\sql>sed "s/0/_/" tmp.txt>ENSXETG_0000025789|ENSXETT00000019729|nodal2>ENSXETG_0000009009|ENSXETT00000019730|nodal3>ENSXETG_0000025789|ENSXETT00000019728|nodal2>ENSXETG_0000009008|ENSXETT00000019726|nodal1>ENSXETG_0000017442|ENSXETT00000037932|nodal5.2>ENSXETG_0000016779|ENSXETT00000036596|nodal5>ENSXETG_0000016778|ENSXETT00000036593|nodal6>ENSXETG_0000023748|ENSXETT00000051228|nodal
F:\dev\sql>sed "s/0/_/g" tmp.txt>ENSXETG______25789|ENSXETT______19729|nodal2>ENSXETG_______9__9|ENSXETT______1973_|nodal3>ENSXETG______25789|ENSXETT______19728|nodal2>ENSXETG_______9__8|ENSXETT______19726|nodal1>ENSXETG______17442|ENSXETT______37932|nodal5.2>ENSXETG______16779|ENSXETT______36596|nodal5>ENSXETG______16778|ENSXETT______36593|nodal6>ENSXETG______23748|ENSXETT______51228|nodal
The pipe
F:dev\sql>grep “>” my-sequence-file.fasta > tmp.txt
F:\dev\sql>sed "s/nodal/xnr/" tmp.txt>ENSXETG00000025789|ENSXETT00000019729|xnr2>ENSXETG00000009009|ENSXETT00000019730|xnr3>ENSXETG00000025789|ENSXETT00000019728|xnr2>ENSXETG00000009008|ENSXETT00000019726|xnr1>ENSXETG00000017442|ENSXETT00000037932|xnr5.2>ENSXETG00000016779|ENSXETT00000036596|xnr5>ENSXETG00000016778|ENSXETT00000036593|xnr6>ENSXETG00000023748|ENSXETT00000051228|xnr
F:dev\sql>grep “>” my-sequence-file.fasta | sed "s/nodal/xnr/">ENSXETG00000025789|ENSXETT00000019729|xnr2>ENSXETG00000009009|ENSXETT00000019730|xnr3>ENSXETG00000025789|ENSXETT00000019728|xnr2>ENSXETG00000009008|ENSXETT00000019726|xnr1>ENSXETG00000017442|ENSXETT00000037932|xnr5.2>ENSXETG00000016779|ENSXETT00000036596|xnr5>ENSXETG00000016778|ENSXETT00000036593|xnr6>ENSXETG00000023748|ENSXETT00000051228|xnr
(n)awk
3 06/04/2011 06/04/2011 Ashleigh Howes OK [email protected] Immunoreg O'Garra2 06/04/2011 06/04/2011 Leona Gabrysova OK [email protected] Immunoreg O'Garra1 06/04/2011 06/04/2011 Eric Dang (ok) [email protected] Immunoreg O'Garra
07/04/2011 07/04/2011 Guilherme Neves OK [email protected] Molecular Neuro Pachnis5 08/04/2011 08/04/2011 Mary Wu OK [email protected] Systems Biol Smith24 08/04/2011 08/04/2011 Jose Saldanha OK [email protected] Math Biol Goldstein8 08/04/2011 08/04/2011 Laurent Dupays OK [email protected] Dev Biol Mohun7 08/04/2011 08/04/2011 Veronique Duboc OK [email protected] Dev Biol Logan22 09/04/2011 09/04/2011 Madhu Kadekoppala OK [email protected] Parasitology Holder21 09/04/2011 09/04/2011 George Gentsch OK [email protected] Systems Biol Smith13 10/04/2011 11/04/2011 Siggi Sato OK [email protected] Parasitology Holder
10/04/2011 10/04/2011 James Briscoe OK [email protected] Dev Neurobiol Briscoe26 10/04/2011 17/04/2011 Alex Watson OK [email protected] Systems Biol Smith
10/04/2011 Mustafa Khokha OK [email protected] YALE
Awk: manipulates data in structured, tabular, format - reads input one line at a time, but operates on fields.
F:dev\sql> nawk -F\t "{ print $4, $7; }" I:\transfer\workshop.txtAshleigh [email protected] [email protected] [email protected] [email protected] [email protected]:
F:dev\sql> nawk -F\t "{ print $4, $7; }" I:\transfer\workshop.txt | sed "s/@/ AT /“Ashleigh ahowes AT nimr.mrc.ac.ukLeona lgabrys AT nimr.mrc.ac.ukEric edang AT nimr.mrc.ac.ukGuilherme gneves AT nimr.mrc.ac.ukMary mwu AT nimr.mrc.ac.uk:
Now let’s go and get some data...
Download the set of gene locus coordinates for your model organism from BioMart (Ensembl).
Practise using grep, sed and (n)awl on this data.
E.g. How many nodal genes are there? Etc.
Simple exercise
>gi|10047086|ref|NP_061821.1| mitogen-inducible gene 6 protein [Homo sapiens] MSIAGVAAQEIRVPLKTGFLHNGRAMGNMRKTYWSSRSEFKNNFLNIDPITMAYSLNSSAQERLIPLGHASKSAPMNGHCFAENGPSQKSSLPPLLIPPSENLGPHEEDQVVCGFKKLTVNGVCASTPPLTPIKNSPSLFPCAPLCERGSRPLPPLPISEALSLDDTDCE>gi|10047090|ref|NP_055147.1| small muscle protein, X-linked [Homo sapiens] MNMSKQPVSNVRAIQANINIPMGAFRPGAGQPPRRKECTPEVEEGVPPTSDEEKKPIPGAKKLPGPAVNLSEIQNIKSELKYVPKAEQ>gi|10047100|ref|NP_057387.1| WW domain binding protein 5 [Homo sapiens]MKSCQKMEGKPENESEPKHEEEPKPEEKPEEEEKLEEEAKAKGTFRERLIQSLQEFKEDIHNRHLSNEDMFREVDEIDEIRRVRNKLIVMRWKVNRNHPYPYLM >gi|10047102|ref|NP_057388.1| ribosomal protein L24-like [Homo sapiens]MRIEKCYFCSGPIYPGHGMMFVRNDCKVFRFCKSKCHKNFKKKRNPRKVRWTKAFRKAAGKELTVDNSFEFEKRRNEPIKYQRELWNKTIDAMKRVEEIKQKRQAKFIMNRLKKNKELQKVQDIKEVKQNIHLIRAPLAGKGKQLEEKMVQQLQEDVDMEDAP
Here is a small test, driven by a real need.
Take the following tiny section of a fasta file of human proteins from NCBI, and put it in your own fasta/text file.
Then see if you can edit it to produce new versions of the fasta file by:
(a) removing the NCBI IDs string, i.e. ->>mitogen-inducible gene 6 protein [Homo sapiens]MSIAGVAAQEIRVPLKTGFLHNGRAMGNMRKTYWSSRSEFKNNFLNIDPITMAYSLNSSAQERLIPLGHASKSAPMNGHCFAENGPSQKSSLPPLLIPPSENLGPHEEDQVVCGFKKLTVetc.
(b) removing everything except the accession number>NP_061821.1MSIAGVAAQEIRVPLKTGFLHNGRAMGNMRKTYWSSRSEFKNNFLNIDPITMAYSLNSSAQERLIPLGHASKSAPMNGHCFAENGPSQKSSLPPLLIPPSENLGPHEEDQVVCGFKKLTVetc.
(c) identify and substitute the species name in the square brackets>gi|10047086|ref|NP_061821.1| mitogen-inducible gene 6 protein (species)MSIAGVAAQEIRVPLKTGFLHNGRAMGNMRKTYWSSRSEFKNNFLNIDPITMAYSLNSSAQERLIPLGHASKSAPMNGHCFAENGPSQKSSLPPLLIPPSENLGPHEEDQVVCGFKKLTV etc.
Catches and other platform specific stuff
Paths: folder dividers ‘/’ in mac/linux ‘\’ in windows
Quotes marks: you may have to experiment with double and single quotes, and even sometimes the backquote. With sed you may or may not have to quote the command string, i.e. >sed ‘s/nodal/xnr/’ OR >sed “s/nodal/xnr/” or no quotes at all.
Unix/Linux is case sensitive: Hello != hello
^C (control-C) is your get out of jail free card!
For next week
Download and install MySQL…
This seems to work easily on Windows, but may be rather complicated the Mac, make sure you follow instructions fairly closely.
The server needs to be ‘running in the background’ for you to be able to log on and create databases and tables, etc.