Download - 4.Managingdata
-
8/12/2019 4.Managingdata
1/46
Managing data
Joachim Jacob8 and 15 November 2013
-
8/12/2019 4.Managingdata
2/46
Bioinformatics data
Historically, bioinformatics has alays !sed te"t filesto store data#
$enban% record
&'B file e"cer(t
HMM (rofile
-
8/12/2019 4.Managingdata
3/46
N$) data
*he N$) machines s(it a lot of data, stored in plaintext files# *hese files are m!lti(le gigabytes in si+e#
-
8/12/2019 4.Managingdata
4/46
*i(s for managing N$) data
1#hen yo! move the data, do it in its smallest form#
Compressthe data#
2# hen yo! !n(ac% the data, leave it here it is#
Symbolic links(oint to the data in differentfolders#
3# &rovide eno!gh storage for yo!r data#
choose yo!r file system type isely
-
8/12/2019 4.Managingdata
5/46
-om(ression. tools in /in!"
htt(.#lin!"lin%s#comarticle2011022001103-om(ression*ools#html
nd some more e"ist###
http://www.linuxlinks.com/article/20110220091109939/CompressionTools.htmlhttp://www.linuxlinks.com/article/20110220091109939/CompressionTools.html -
8/12/2019 4.Managingdata
6/46
*i(s
idely !sed com(ression tools.$N +i( 4gzipBloc% )orting com(ression 4bzip2
*y(ically, com(ression tools or% on one file#Ho to com(ress directories and their contents6
-
8/12/2019 4.Managingdata
7/46
*ar itho!t com(ression
*ar 4*a(e rchive is a tool for bundling a set of filesor directories into a single archive.*he res!ltingfile is called a tar ball#
)ynta" to create a tarball.$ tar -cf archive.tar file1 file2
)ynta" to e"tract.$ tar -xvf /path/to/archive.tar
-
8/12/2019 4.Managingdata
8/46
-om(ression. a ty(ical case
rchiving and com(ression mostly occ!r together#*he most !sed formats are tar.gzor tar.bz.*hesefiles are the res!lt of two(rocesses#
Archiving4tar
Compressing4g+i( or b+i(2
-
8/12/2019 4.Managingdata
9/46
-om(ression. on yo!r des%to(
-
8/12/2019 4.Managingdata
10/46
-om(ression. on yo!r des%to(
-
8/12/2019 4.Managingdata
11/46
-om(ression. on the command line
Taris the tool for creating #tar archives, b!t it cancom(ress in one go, ith the + or 7 o(tion#
Creatinga com(ressed tar archive.$ tar cvfz mytararchive.tar.gz docs/$ tar cvfj mytararchive.tar.bz docs/
Decompressinga com(ressed tar archive$ tar xvfz mytararchive.tar.gz$ tar xvfj mytararchive.tar.bz
create -om(ression techni!e
e"tract filesfilesverbose
-
8/12/2019 4.Managingdata
12/46
'e9com(ression
*o com(ress one or more files.$ gzip [options] file$ bzip2 [options] file
*o decom(ress one or more files.$ gunzip [options] file(s)$ bunzip2 [options] file(s)
-
8/12/2019 4.Managingdata
13/46
*i(s
Many com(ression tools on the command line allotoread compressed files4instead of first !n(ac%ingthen reading#
$ zcat file(s)$ bzcat file(s)
-om(ression is alays a balancebeteen time and
com(ression ratio# $+i( is faster, b+i(2 com(ressesharder#
:f com(ression is im(ortant to yo!. benchmar%;
-
8/12/2019 4.Managingdata
14/46
-
8/12/2019 4.Managingdata
15/46
)ymlin%s
&ay attention# )omething very convenient;
symbolic link4or symlin% is a filehich (oints to thelocation of the lin%ed9to file# =o! can do anything ith the
symlin% that yo! can do on the original file# s yo! movethe original file from its location, the symlin% is >dead>#
?
'onloads
&ro7ects
@ice
B!tterfly
)e!ences
nnotation
alignment.sam
-
8/12/2019 4.Managingdata
16/46
)ymlin%s
*o create a symlin%, move to the folder in here the symlin%m!st be created, and e"ec!te ln#
?
'onloads
&ro7ects
@ice
B!tterfly
)e!ences
nnotation
alignment.sam
~/Projects cd Butterfly~/Butterfly ln -s ../Rice/e!uences/alignment.sam"in#$to$alignment.sam
-
8/12/2019 4.Managingdata
17/46
)ymlin%s
?
'onloads
&ro7ects
@ice
B!tterflyink!to!alignment.sam
)e!ences
nnotation
alignment.samalignment.samalignment.sam
~/Projects cd Butterfly~/Butterfly ln -s ../Rice/e!uences/alignment.sam"in#$to$alignment.sam~/Butterfly ls -lh "in#$to$alignment.samlr%xr%xr%x & joachim joachim '' (ct )) &'*'+"in#$to$alignment.sam -, ../e!uences/alignment.sam
*he symlin% is created# =o! can chec% ith ls.*o delete a symlin%, !se unlin#.
-
8/12/2019 4.Managingdata
18/46
-
8/12/2019 4.Managingdata
19/46
'is%s and storage
:f yo! dive into bioinformatics, yo! ill have tomanage dis%s and storage#
*o ty(es of dis%s
" solid state disks/o ca(acity, high s(eed, random rites
" spinning hard disksHigh ca(acity, >normal> s(eed,se!ential rites#
htt(.en#i%i(edia#orgi%i)olid9stateAdrivehtt(.en#i%i(edia#orgi%iHardAdis%
http://en.wikipedia.org/wiki/Solid-state_drivehttp://en.wikipedia.org/wiki/Hard_diskhttp://en.wikipedia.org/wiki/Hard_diskhttp://en.wikipedia.org/wiki/Solid-state_drive -
8/12/2019 4.Managingdata
20/46
dis% is a device
ia the terminal, sho the dis%s !sing
$ sudo fdis# -l-sudo ass%ord for joachim*
0is# /dev/sda* &1.' 2B3 &1'45&'&1&)bytes...
0is# /dev/sdb* 166+ 7B3 166+&+&+&) bytes...
-
8/12/2019 4.Managingdata
21/46
dis% is divided into (artitions
dis% can be divided in (arts, called (artitions#
n internal diskhich r!ns an o(erating system is!s!ally divided in (artitions, one for each f!nctions#
n external diskis !s!ally not divided in (artitions#
-
8/12/2019 4.Managingdata
22/46
-hec% o!t the dis% !tility tool
-
8/12/2019 4.Managingdata
23/46
*he system dis%
Name of the dis%
-
8/12/2019 4.Managingdata
24/46
*he system dis%
Name c!rrently highlighted (artition
-
8/12/2019 4.Managingdata
25/46
*he system dis%
&lace in the directory str!ct!rehere the (artition can be accessed
-
8/12/2019 4.Managingdata
26/46
n e"am(le of an )B dis%
9
&lace in the directory str!ct!rehere the (artition can be accessed
-
8/12/2019 4.Managingdata
27/46
n e"am(le of an )B dis%
*he )B dis% is >mo!nted> a!tomatically on thedirectory tree !nder #media#
-
8/12/2019 4.Managingdata
28/46
n e"am(le of an )B dis%
9*his is the ty(e of file systemon the (artition#
*he (artition is said to be formatted
in C*32 4in this case#
-
8/12/2019 4.Managingdata
29/46
Cile system formats
By defa!lt, many )B flash dis%s are formatted in$AT%
Dther ty(es are N*C), e"tE, FC)#
$AT%&G ma" E$B files'T$SG ma"im!m (ortability 4also for !se !nder indos(xt)G defa!lt file system in /in!",
htt(.en#i%i(edia#orgi%iCileAsystemCileAsystemsAandAo(eratingAsystems
http://en.wikipedia.org/wiki/File_system#File_systems_and_operating_systemshttp://en.wikipedia.org/wiki/File_system#File_systems_and_operating_systems -
8/12/2019 4.Managingdata
30/46
n e"am(le of an )B dis%
Cirst !nmo!nt the device#
Ne"t, choose format the device#
* &
-
8/12/2019 4.Managingdata
31/46
Cormat dis%s ith dis% !tility
-hoose the ty(e of filesystem yo! ant to be onthat device#
-
8/12/2019 4.Managingdata
32/46
Cormat dis%s ith dis% !tility
-
8/12/2019 4.Managingdata
33/46
Cormat dis%s ith dis% !tility
=o! don>t ant to %no all the commands that or%behind the gnome9dis%9!tility for yo!#
B!t if yo! do.9 mo!nt9 !mo!nt9 fdis%9 m%fs
=o! can read the man (ages and search for g!ides onthe internet if yo! ant to get to %no these 4o!t ofsco(e for this co!rse#
-
8/12/2019 4.Managingdata
34/46
-hec%ing storage s(ace
By defa!lt >dis% !sage analy+er>#
-
8/12/2019 4.Managingdata
35/46
-hec%ing storage s(ace
Bon!s. IE'ir)tat# Not installed by defa!lt#
-
8/12/2019 4.Managingdata
36/46
-hec%ing storage s(ace
Bon!s. IE'ir)tat# Not installed by defa!lt#
-
8/12/2019 4.Managingdata
37/46
IE'irstat is a I'< (ac%age
@ehearsal. hat is I'
-
8/12/2019 4.Managingdata
38/46
)(ace left on dis%s ith df
*o chec% the storage that is !sed on thedifferent dis%s#
~/ df -h8ilesystem ize 9sed :vail 9se; 7ounted on/dev/sda& &)2
-
8/12/2019 4.Managingdata
39/46
*he si+e of directories
*o chec% the si+e of files or directories#~/ du -sh ?
-
8/12/2019 4.Managingdata
40/46
ildcards on the command line
ildcards are !sed to describe the names offiles#dirs#
+ ,Dn that (osition, the character may be one of the
characters beteen K,e#g# saniti+sz,ationmatches. sanitisation and sanitization
-Dn that (osition, any character is alloed#e#g# saniti-ationmatches. sanitisation, sanitiration, ###
LDn that (osition, any length of string is alloed
e#g# s matches. san, sdd, sanitisation, sam#alignment,###
-
8/12/2019 4.Managingdata
41/46
ildcards on the command line
Many tools that re!ire an argument to (oint tofiles or directories acce(t these ildcards#
~/ du -sh 0o?
-
8/12/2019 4.Managingdata
42/46
ildcards on the command line
Many tools that re!ire an argument to (oint tofiles or directories acce(t these ildcards#
~/ du -sh 0o?'.4= 0ocuments)42 0o%nloads
-
8/12/2019 4.Managingdata
43/46
-
8/12/2019 4.Managingdata
44/46
ildcards on the command line
Many tools that re!ire an argument to (oint tofiles or directories acce(t these ildcards#
~/ ls ?.fast!ARR&'5
-
8/12/2019 4.Managingdata
45/46
Ieyords
-om(ression
rchive
)ymbolic lin%
mo!nting
Cile system format
(artition
@ec!rsively
dfd!
!nlin%
rite in yo!r on ords hat the terms mean
-
8/12/2019 4.Managingdata
46/46
Brea%