productivity tips - introduction to linux for bioinformatics

17
Productivity Joachim Jacob 8 and 15 November 2013

Upload: bits

Post on 07-Dec-2014

542 views

Category:

Technology


2 download

DESCRIPTION

Part 6 of the training "Introduction to linux for bioinformatics". Some useful tips to get your bioinformatics scripts better.

TRANSCRIPT

Page 1: Productivity tips - Introduction to linux for bioinformatics

Productivity

Joachim Jacob8 and 15 November 2013

Page 2: Productivity tips - Introduction to linux for bioinformatics

Multiple commands

In bash, commands put on one line when be separated by “;”

$ wget http://homepage.tudelft.nl/19j49/t-SNE_files/tSNE_linux.tar.gz ; tar xvfz tSNE_linux.tar.gz

Page 3: Productivity tips - Introduction to linux for bioinformatics

Multiple commands

Commands on a oneliner can also be separated by && or ||

&& Only execute the command if the preceding one finished correctly.

$ curl corz.org/ip && echo '\n'

|| (not a pipe!) - Inverse of the above. Only execute the command if the preceding one did not succesfully ends.

Page 4: Productivity tips - Introduction to linux for bioinformatics

Piping a list of files with xargs

A pipe reads the output of a command.

Some commands requires the file name to be passed, instead of the content of the file. E.g. this doesn't work:

$ ls | less

$ ls | fileUsage: file [-bchikLlNnprsvz0] [--apple] [--mime-encoding] [--mime-type] [-e testname] [-F separator] [-f namefile] [-m magicfiles] file ... file -C [-m magicfiles] file [--help]

Page 5: Productivity tips - Introduction to linux for bioinformatics

Piping a list of files with xargs

Some commands requires the file name to be passed, instead of the content of the file.

xargs passes the output of a command as a list of arguments to another program.

$ ls | xargs filebin: directorybuddy.sh: Bourne-Again shell script, ASCII text executableCompression_exercise: directoryDesktop: directoryDocuments: directoryDownloads: directoryFastQValidator.0.1.1.tgz: gzip compressed data, from Unix, last modified: Fri Oct 19 16:44:23 2012

Page 6: Productivity tips - Introduction to linux for bioinformatics

.bashrc

~/.bashrc is a hidden configuration file for bash in your home.

It configures the prompt in your terminal.It contains aliases to commands.

Page 7: Productivity tips - Introduction to linux for bioinformatics

alias example

When you enter a first word on the command line that bash does not recognize as a command, it will search in the aliases for the word.

You can specify aliases in .bashrc. An example:

Page 8: Productivity tips - Introduction to linux for bioinformatics

Alias example

Some interesting aliases

alias ll='ls -lh'alias dirsize="du -sh */"alias uncom='grep -v -E "^\#|^$"'alias hosts="cat /etc/hosts"alias dedup="awk '! x[$0]++' "

Aliases are perfectly suited for storing one-liners: find some athttps://wikis.utexas.edu/display/bioiteam/Scott%27s+list+of+linux+one-liners

Page 10: Productivity tips - Introduction to linux for bioinformatics

Finding stuff: locate

Extremely quick and convenient:locate

However, it won't find the newest files you created. First you need to update the database by running:updatedb

It accepts wildcards. Example:$ locate *.sam

Bonus: How to filter on a certain location?

Page 11: Productivity tips - Introduction to linux for bioinformatics

Finding stuff: find

More elaborate tool to find stuff:$ find -name alignment.sam

Find won't find without specifying options:-name : to search on the name of the file-type : to search for the type: (f)ile, (d)irectory, (l)ink-perm : to search for the permissions (111 or rwx)…

This is the power tool to find stuff.

Page 12: Productivity tips - Introduction to linux for bioinformatics

Finding stuff: find

The most powerful option of find:-exec Execute a command on the found entities.

Page 13: Productivity tips - Introduction to linux for bioinformatics

Finding stuff: find

The most powerful option of find:-exec Execute a command on the found entities.

$ find -name \*.gz ./DRR000542_2.fastq.subset.gz./DRR000542_1.fastq.subset.gz./DRR000545_2.fastq.subset.gz./DRR000545_1.fastq.subset.gz$ find -name \*.gz -exec gunzip {} \;$ lsDRR000542_1.fastq.subset DRR000545_1.fastq.subsetDRR000542_2.fastq.subset DRR000545_2.fastq.subset

Page 14: Productivity tips - Introduction to linux for bioinformatics

Command substitution in bash

In bash, the output of commands can be directly stored in a variable. Put the command between back-ticks.

$ test=`ls -l`$ echo $testtotal 7929624 -rw-rw-r-- 1 joachim joachim 15326 May 10 2013 0538c2b.jpg -rw-rw-r-- 1 joachim joachim 4914797 Nov 8 16:15 18d7alY

Page 15: Productivity tips - Introduction to linux for bioinformatics

Command substitution in bash

A variable can also contain a list. A list contains several entities (e.g. files).

Extracting first 100k lines from compressed text file:

for filename in `ls DRR00054*tar.gz`; \ do zcat $filename | head -n 1000000 \

>${file%.gz}.subset; done

The output of ls is being put in a list. 'for' assigns one after the other the name of the file to the variable file. This variable is used in the

oneliner zcat | head.

Page 16: Productivity tips - Introduction to linux for bioinformatics

Keywords.bashrc

;

alias

prompt

locate

find

Command substitution

Write in your own words what the terms mean

Page 17: Productivity tips - Introduction to linux for bioinformatics

Break