unix essentials - massachusetts institute of...
TRANSCRIPT
Unix Essentials
Bingbing Yuan
1 Next Hot Topics: Unix – Beyond Basics (Mon Oct 20th at 1pm)
Objectives
• Unix Overview
• Whitehead Resources
• Unix Commands
• BaRC Resources
• LSF
2
Objectives: Hands-on
• Parsing Human Body Index (HBI) array
data
Goal: Process a large data file to get important
information such as genes of interest, sorting
expression values, and subset the data for
further investigation.
3
Advantages of Unix
• Processing files with thousands, or millions, of
lines
How many reads are in my fastq file?
Sort by gene name or expression values
• Many programs run on Unix only
Command-line tools
• Automate repetitive tasks or commands
Scripting
• Other software, such as Excel, are not able to
handle large files efficiently
• Open Source 4
Scientific computing resources
5
Shared packages/programs
6
https://tak.wi.mit.edu
Installed
packages/programs
Request new
packages/programs
Login
• Requesting a tak account http://iona.wi.mit.edu/bio/software/unix/bioinfoaccount.php
• Windows
PuTTY or Cygwin
Xming: setup X-windows for graphical display
• Macs
Access through Terminal
7
8
Command
Prompt
user@tak ~$
Connecting to tak for Windows
Unix Commands
• General syntax Command
Options or switches (zero or more)
Arguments (zero or more)
Example: uniq –c myFile.txt
Options can be combined
ls –l –a or ls –la
• Manual (man) page man uniq
• One line description whatis ls
10
command options arguments
Unix Directory Structure
11
root
/
home dev bin nfs lab . . .
jdoe BaRC_Public solexa_public solexa_lodish
page
/lab/page /home/jdoe
Accessing Shared Resources
at Whitehead
• Unix
/nfs/BaRC_Public
/lab/solexa_public
/lab/page
• Windows (access using Start Menu Search)
\\wi-files1\BaRC_Public
\\wi-files1\fink_lab
\\wi-files2\page
\\wi-htdata\solexa_public
• Macs (access using Go Connect to Server…)
cifs://wi-files1/BaRC_Public
cifs://wi-htdata/solexa_public
12
Where’s my lab’s share?
• http://wi-inside.wi.mit.edu/departments/it/services/filestorage/labshares
Directory Contents
• List files/directories
ls lists the contents of a directory
ls –l includes additional info (eg. permissions, time stamp) Options:
-l long listing
-h human readable
thiruvil@tak /nfs/BaRC_Public$ ls -l
total 4740
drwxrwxr-x 5 gbell barc 4096 2012-03-16 15:56 apps/
drwxrwxr-x 4 gbell barc 4096 2011-10-18 09:48 BaRC_code/
drwxrwxrwx 5 gbell barc 4096 2012-09-17 15:03 Bartel_Lab/
drwxrwsrwx 3 gbell barc 4096 2012-05-04 16:17 Cheeseman_Lab/
drwxrwsrwx 3 byuan barc 4096 2010-11-23 14:22 chip_seq/
drwxrwsrwx 2 gbell barc 4096 2012-02-21 16:26 CMT/
-rw-r--r-- 1 gbell barc 192568 2012-10-10 10:14 du.20121010a.txt
Permissions Owner Group Size (bytes) Time Stamp File or directory
13
Permissions
drwxrwxr-x
14
Type: directory (d)
symbolic link(l) User Group Others
r read
w write
x execute
• Use chmod to change permissions
user(u), group(g), others(o), all(a) chmod u+x foo.pl (user can execute)
chmod g-w foo.pl (group can’t write)
permission denied error
thiruvil@tak /nfs/BaRC_Public$ ls -l myFile.txt
-rw-r--r-- 1 thiruvil barc 0 2012-10-10 13:32 myFile.txt
thiruvil@tak /nfs/BaRC_Public$ chmod g+w myFile.txt
thiruvil@tak /nfs/BaRC_Public$ ll myFile.txt
-rw-rw-r-- 1 thiruvil barc 0 2012-10-10 13:32 myFile.txt
Navigating in Unix
• pwd print working directory byuan@tak ~$ pwd
/home/byuan
• cd change directory cd fink_lab # if you are in /lab
cd to home directory
cd ~
cd to directory above
cd ..
cd to a specific directory
cd /nfs/BaRC_Public
• No such file or directory error
15
Organizing Files and
Directories • Commands
mkdir make a directory
mkdir my_foo
rmdir remove a directory (must be empty)
rmdir my_foo
mv move or rename a file/directory
mv myOldFile myNewFile
cp copy a file
cp myOldFile myNewFile
rm remove or delete a file
rm myFile
16
Unix Tips
• Use to reuse previous commands
• Ctrl-c: stop a process that is running
• Tab-completion:
– Complete commands/file names
• Unix is case-sensitive
17
Getting Files
• Getting files or directories
Files wget http://www.broadinstitute.org/igv/projects/downloads/IGV_2.1.17.zip
Directories from (outside) servers scp -r [email protected]:/broad/lab/works .
18
(Un)Compressing Files
• .gz file Compress: gzip expression.txt > expression.gz
Uncompress: gunzip expression.gz
• .tar.gz file
Compress: tar –czvf myFiles.tar.gz myFiles
Uncompress: tar –xzvf myFiles.tar.gz
Options
-c create an archive (files to archive, archive from files)
-x extract an archive (archive to files, files from archive)
-f FILE name of archive
-v be verbose, list all files being archived/extracted
-z create/extract archive with gzip/gunzip
• View compressed files using: – zmore,zgrep
19
Editing a File
• Command-line editors pico
nano
emacs (emacs –nw)
vi
• Graphical editors (Windows users need an X-windows emulator)
Note: may not be part of standard installation nedit
gedit
xemacs
• Put an & at the end of command line to run it in the background when using a graphical editor so that you can continue to use the terminal window
eg. gedit myFile.txt&
20
Viewing a File
• Display page-by-page basis more myFile.txt
Use: to scroll, space for next page and q to quit
• Display first 15 lines of a file head -15 myFile.txt
• Display last 15 lines of a file tail -15 myFile.txt
• Show all contents of a file cat myFile.txt
Show hidden characters (^M or carriage return)
cat –A myFile.txt
• Display number of lines in a file wc –l myFile.txt
21
Output Redirection and Piping
• Write output of a command to file Write to output file
• sort myFile.txt > myFile_sorted.txt
Replace to output file
• sort myFile.txt >| myFile_sorted.txt
Append to output file
• sort myFile.txt >> myFile_sorted.txt
• Piping “|”: use output of one command
as input for another command sort myFile.txt | more
22
Parsing a File: cut
• Select columns of interest cut –f 9,12-15 myGeneValues.txt > col_9.12to15.txt
Options:
-f output only these fields
-d field delimiter
23
Parsing a File: sort and uniq
• Sort on column(s) sort -k 3,3 myGeneExpression.txt | more
Options:
-n numerical sort
-r reverse
-k pos1,pos2 start a key at pos1, end it at pos2
• Get only unique entries
ensure file is sorted before running uniq uniq mySortedGenes.txt > myUniqGenes.txt
Options:
-c count entries
-d duplicate counts
24
Regular Expressions
• Pattern matching
• Easier to search
• Commonly used regular expressions
Example: list all txt files ls *.txt
25
Regular Expression Matches
. All characters
* Zero or more; wildcard
^ Beginning of a line
$ End of a line
Searching Within a File
• grep (global regular expression print)
• Find words, or patterns, occurring in lines of a file grep TMEM geneList.txt
TMEM131
TMEM9B
TMEM14C
TMEM66
TMEM49
Options:
-v select non-matching lines
-i ignore case
-n print line number
Example: get TMEM that does not end with 9
grep TMEM geneList.txt | grep -v "TMEM14C" | more
26
BaRC Resources
• jura.wi.mit.edu
27
28
BaRC SOP
http://barcwiki.wi.mit.edu/wiki/SOPs
BaRC Scripts
29
Running Scripts on Unix
• Perl bed2gff.pl
• R run_rma_customCDF.R
• Python myScript.py
• Matlab matlab -nodesktop -nosplash myScript.m
• Java Archive (JAR) java -Xmx1000m -jar /usr/local/share/IGVTools/igv.jar
30
Running Programs/Tools on
Unix • bedtools
bedtools intersect -a myGenes1.bed –b myGenes2.bed
Other utilities: http://code.google.com/p/bedtools/wiki/Usage
• samtools samtools view myFile.bam
Other utilities: http://samtools.sourceforge.net/samtools.shtml
• Fastx toolkit fastx_quality_stats -i mySeq.fastq -o fastxStats_mySeq
• FastQC fastqc mySeq.fastq
• BLAST blastp –task blastp -db myProtDB.fa –q myProt.fa –out out.txt
31
Commonly Used Data Locations
at Whitehead
32
Location Description
/nfs/genomes Genome data: gff, gtf, fasta,
bowtie indexed files, blat
indexed file, etc. for several
organisms
/nfs/seq/Data Sequence data, including blast
databases, for several
organisms
/nfs/BaRC_datasets Large (array/NGS) datasets:
HBI, HBM 2.0
Scientific computing resources
33
LSF cluster jobs
34
https://tak.wi.mit.edu
Load Sharing Facility (LSF)
Cluster • More computing power
• Multiple jobs running at the same time
35
LSF Commands
• bsub to submit jobs bsub wc –l reads.fq
bsub “sort foo.txt > sorted.txt”
Options:
-e error file
-o standard out file
-m machine
-u email address
• bjobs to view your jobs bjobs
• bkill to kill a job bkill 237878
36
Further Reading
• BaRC: Unix Info
http://iona.wi.mit.edu/bio/education/unix_intro.php
• LSF Cluster (incl. examples)
http://iona.wi.mit.edu/bio/bioinfo/docs/LSF_help.php
• Whitehead IT Scientific Computing Tutorials http://wi-inside.wi.mit.edu/departments/it/services/scientificcomputing/scitutorials
37