4.managingdata

Upload: parisha-singh

Post on 03-Jun-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 4.Managingdata

    1/46

    Managing data

    Joachim Jacob8 and 15 November 2013

  • 8/12/2019 4.Managingdata

    2/46

    Bioinformatics data

    Historically, bioinformatics has alays !sed te"t filesto store data#

    $enban% record

    &'B file e"cer(t

    HMM (rofile

  • 8/12/2019 4.Managingdata

    3/46

    N$) data

    *he N$) machines s(it a lot of data, stored in plaintext files# *hese files are m!lti(le gigabytes in si+e#

  • 8/12/2019 4.Managingdata

    4/46

    *i(s for managing N$) data

    1#hen yo! move the data, do it in its smallest form#

    Compressthe data#

    2# hen yo! !n(ac% the data, leave it here it is#

    Symbolic links(oint to the data in differentfolders#

    3# &rovide eno!gh storage for yo!r data#

    choose yo!r file system type isely

  • 8/12/2019 4.Managingdata

    5/46

    -om(ression. tools in /in!"

    htt(.#lin!"lin%s#comarticle2011022001103-om(ression*ools#html

    nd some more e"ist###

    http://www.linuxlinks.com/article/20110220091109939/CompressionTools.htmlhttp://www.linuxlinks.com/article/20110220091109939/CompressionTools.html
  • 8/12/2019 4.Managingdata

    6/46

    *i(s

    idely !sed com(ression tools.$N +i( 4gzipBloc% )orting com(ression 4bzip2

    *y(ically, com(ression tools or% on one file#Ho to com(ress directories and their contents6

  • 8/12/2019 4.Managingdata

    7/46

    *ar itho!t com(ression

    *ar 4*a(e rchive is a tool for bundling a set of filesor directories into a single archive.*he res!ltingfile is called a tar ball#

    )ynta" to create a tarball.$ tar -cf archive.tar file1 file2

    )ynta" to e"tract.$ tar -xvf /path/to/archive.tar

  • 8/12/2019 4.Managingdata

    8/46

    -om(ression. a ty(ical case

    rchiving and com(ression mostly occ!r together#*he most !sed formats are tar.gzor tar.bz.*hesefiles are the res!lt of two(rocesses#

    Archiving4tar

    Compressing4g+i( or b+i(2

  • 8/12/2019 4.Managingdata

    9/46

    -om(ression. on yo!r des%to(

  • 8/12/2019 4.Managingdata

    10/46

    -om(ression. on yo!r des%to(

  • 8/12/2019 4.Managingdata

    11/46

    -om(ression. on the command line

    Taris the tool for creating #tar archives, b!t it cancom(ress in one go, ith the + or 7 o(tion#

    Creatinga com(ressed tar archive.$ tar cvfz mytararchive.tar.gz docs/$ tar cvfj mytararchive.tar.bz docs/

    Decompressinga com(ressed tar archive$ tar xvfz mytararchive.tar.gz$ tar xvfj mytararchive.tar.bz

    create -om(ression techni!e

    e"tract filesfilesverbose

  • 8/12/2019 4.Managingdata

    12/46

    'e9com(ression

    *o com(ress one or more files.$ gzip [options] file$ bzip2 [options] file

    *o decom(ress one or more files.$ gunzip [options] file(s)$ bunzip2 [options] file(s)

  • 8/12/2019 4.Managingdata

    13/46

    *i(s

    Many com(ression tools on the command line allotoread compressed files4instead of first !n(ac%ingthen reading#

    $ zcat file(s)$ bzcat file(s)

    -om(ression is alays a balancebeteen time and

    com(ression ratio# $+i( is faster, b+i(2 com(ressesharder#

    :f com(ression is im(ortant to yo!. benchmar%;

  • 8/12/2019 4.Managingdata

    14/46

  • 8/12/2019 4.Managingdata

    15/46

    )ymlin%s

    &ay attention# )omething very convenient;

    symbolic link4or symlin% is a filehich (oints to thelocation of the lin%ed9to file# =o! can do anything ith the

    symlin% that yo! can do on the original file# s yo! movethe original file from its location, the symlin% is >dead>#

    ?

    'onloads

    &ro7ects

    @ice

    B!tterfly

    )e!ences

    nnotation

    alignment.sam

  • 8/12/2019 4.Managingdata

    16/46

    )ymlin%s

    *o create a symlin%, move to the folder in here the symlin%m!st be created, and e"ec!te ln#

    ?

    'onloads

    &ro7ects

    @ice

    B!tterfly

    )e!ences

    nnotation

    alignment.sam

    ~/Projects cd Butterfly~/Butterfly ln -s ../Rice/e!uences/alignment.sam"in#$to$alignment.sam

  • 8/12/2019 4.Managingdata

    17/46

    )ymlin%s

    ?

    'onloads

    &ro7ects

    @ice

    B!tterflyink!to!alignment.sam

    )e!ences

    nnotation

    alignment.samalignment.samalignment.sam

    ~/Projects cd Butterfly~/Butterfly ln -s ../Rice/e!uences/alignment.sam"in#$to$alignment.sam~/Butterfly ls -lh "in#$to$alignment.samlr%xr%xr%x & joachim joachim '' (ct )) &'*'+"in#$to$alignment.sam -, ../e!uences/alignment.sam

    *he symlin% is created# =o! can chec% ith ls.*o delete a symlin%, !se unlin#.

  • 8/12/2019 4.Managingdata

    18/46

  • 8/12/2019 4.Managingdata

    19/46

    'is%s and storage

    :f yo! dive into bioinformatics, yo! ill have tomanage dis%s and storage#

    *o ty(es of dis%s

    " solid state disks/o ca(acity, high s(eed, random rites

    " spinning hard disksHigh ca(acity, >normal> s(eed,se!ential rites#

    htt(.en#i%i(edia#orgi%i)olid9stateAdrivehtt(.en#i%i(edia#orgi%iHardAdis%

    http://en.wikipedia.org/wiki/Solid-state_drivehttp://en.wikipedia.org/wiki/Hard_diskhttp://en.wikipedia.org/wiki/Hard_diskhttp://en.wikipedia.org/wiki/Solid-state_drive
  • 8/12/2019 4.Managingdata

    20/46

    dis% is a device

    ia the terminal, sho the dis%s !sing

    $ sudo fdis# -l-sudo ass%ord for joachim*

    0is# /dev/sda* &1.' 2B3 &1'45&'&1&)bytes...

    0is# /dev/sdb* 166+ 7B3 166+&+&+&) bytes...

  • 8/12/2019 4.Managingdata

    21/46

    dis% is divided into (artitions

    dis% can be divided in (arts, called (artitions#

    n internal diskhich r!ns an o(erating system is!s!ally divided in (artitions, one for each f!nctions#

    n external diskis !s!ally not divided in (artitions#

  • 8/12/2019 4.Managingdata

    22/46

    -hec% o!t the dis% !tility tool

  • 8/12/2019 4.Managingdata

    23/46

    *he system dis%

    Name of the dis%

  • 8/12/2019 4.Managingdata

    24/46

    *he system dis%

    Name c!rrently highlighted (artition

  • 8/12/2019 4.Managingdata

    25/46

    *he system dis%

    &lace in the directory str!ct!rehere the (artition can be accessed

  • 8/12/2019 4.Managingdata

    26/46

    n e"am(le of an )B dis%

    9

    &lace in the directory str!ct!rehere the (artition can be accessed

  • 8/12/2019 4.Managingdata

    27/46

    n e"am(le of an )B dis%

    *he )B dis% is >mo!nted> a!tomatically on thedirectory tree !nder #media#

  • 8/12/2019 4.Managingdata

    28/46

    n e"am(le of an )B dis%

    9*his is the ty(e of file systemon the (artition#

    *he (artition is said to be formatted

    in C*32 4in this case#

  • 8/12/2019 4.Managingdata

    29/46

    Cile system formats

    By defa!lt, many )B flash dis%s are formatted in$AT%

    Dther ty(es are N*C), e"tE, FC)#

    $AT%&G ma" E$B files'T$SG ma"im!m (ortability 4also for !se !nder indos(xt)G defa!lt file system in /in!",

    htt(.en#i%i(edia#orgi%iCileAsystemCileAsystemsAandAo(eratingAsystems

    http://en.wikipedia.org/wiki/File_system#File_systems_and_operating_systemshttp://en.wikipedia.org/wiki/File_system#File_systems_and_operating_systems
  • 8/12/2019 4.Managingdata

    30/46

    n e"am(le of an )B dis%

    Cirst !nmo!nt the device#

    Ne"t, choose format the device#

    * &

  • 8/12/2019 4.Managingdata

    31/46

    Cormat dis%s ith dis% !tility

    -hoose the ty(e of filesystem yo! ant to be onthat device#

  • 8/12/2019 4.Managingdata

    32/46

    Cormat dis%s ith dis% !tility

  • 8/12/2019 4.Managingdata

    33/46

    Cormat dis%s ith dis% !tility

    =o! don>t ant to %no all the commands that or%behind the gnome9dis%9!tility for yo!#

    B!t if yo! do.9 mo!nt9 !mo!nt9 fdis%9 m%fs

    =o! can read the man (ages and search for g!ides onthe internet if yo! ant to get to %no these 4o!t ofsco(e for this co!rse#

  • 8/12/2019 4.Managingdata

    34/46

    -hec%ing storage s(ace

    By defa!lt >dis% !sage analy+er>#

  • 8/12/2019 4.Managingdata

    35/46

    -hec%ing storage s(ace

    Bon!s. IE'ir)tat# Not installed by defa!lt#

  • 8/12/2019 4.Managingdata

    36/46

    -hec%ing storage s(ace

    Bon!s. IE'ir)tat# Not installed by defa!lt#

  • 8/12/2019 4.Managingdata

    37/46

    IE'irstat is a I'< (ac%age

    @ehearsal. hat is I'

  • 8/12/2019 4.Managingdata

    38/46

    )(ace left on dis%s ith df

    *o chec% the storage that is !sed on thedifferent dis%s#

    ~/ df -h8ilesystem ize 9sed :vail 9se; 7ounted on/dev/sda& &)2

  • 8/12/2019 4.Managingdata

    39/46

    *he si+e of directories

    *o chec% the si+e of files or directories#~/ du -sh ?

  • 8/12/2019 4.Managingdata

    40/46

    ildcards on the command line

    ildcards are !sed to describe the names offiles#dirs#

    + ,Dn that (osition, the character may be one of the

    characters beteen K,e#g# saniti+sz,ationmatches. sanitisation and sanitization

    -Dn that (osition, any character is alloed#e#g# saniti-ationmatches. sanitisation, sanitiration, ###

    LDn that (osition, any length of string is alloed

    e#g# s matches. san, sdd, sanitisation, sam#alignment,###

  • 8/12/2019 4.Managingdata

    41/46

    ildcards on the command line

    Many tools that re!ire an argument to (oint tofiles or directories acce(t these ildcards#

    ~/ du -sh 0o?

  • 8/12/2019 4.Managingdata

    42/46

    ildcards on the command line

    Many tools that re!ire an argument to (oint tofiles or directories acce(t these ildcards#

    ~/ du -sh 0o?'.4= 0ocuments)42 0o%nloads

  • 8/12/2019 4.Managingdata

    43/46

  • 8/12/2019 4.Managingdata

    44/46

    ildcards on the command line

    Many tools that re!ire an argument to (oint tofiles or directories acce(t these ildcards#

    ~/ ls ?.fast!ARR&'5

  • 8/12/2019 4.Managingdata

    45/46

    Ieyords

    -om(ression

    rchive

    )ymbolic lin%

    mo!nting

    Cile system format

    (artition

    @ec!rsively

    dfd!

    !nlin%

    rite in yo!r on ords hat the terms mean

  • 8/12/2019 4.Managingdata

    46/46

    Brea%