powering up your graduate experience a survey of computational tools and approaches for biologists ...

Post on 25-Dec-2015

213 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Powering up your graduate experience

A survey of computational tools and approaches for biologists

https://onish.web.unc.edu/firstyeargrads/

Erin Osborne Nishimura

2

Summer Workshop Series

3

Powering up your graduate experience

• What types of computational tools and approaches are available?

• How can I use some of the resources available at UNC?– Hands-on introduction to Killdevil

• Benefits of honing your computational prowess

• How can I learn the next steps?

4

WHAT TYPES OF COMPUTATIONAL TOOLS AND APPROACHES ARE AVAILABLE?

5

What are we talking about when we talk about computational biology?

6

Computational thinking is generalizable outside of the field of genomics

• Genomics• Microscopy image processing• Structural biology processing

• Melding and merging datasets• Re-naming files in batch• Grabbing specific information out of files

• Tailor-made specialized scripts• Automation

7

What hardware do we use to get this done?

• Brain• Physical notes & notebooks• Computers

– Local computers– Virtual computers– Clusters – Cloud– Software

8

Computer ResourcesGet a physical computer

• Computers– Local computers– Virtual computers– Clusters– Cloud– Software

UNC Student Store Sells Computershttp://store.unc.edu/Buy a computer at a discount. Get lifetime support from UNC.

9

Computer ResourcesUse a virtual comptuer

• Computers– Local computers– Virtual computers– Clusters– Cloud– Software

Research Computing hosts Virtual Computing Lab:https://vcl.unc.edu/index.php?mode=selectauthYou can virtually use Microsoft and Linux computers and install tailored software for individual or group use.

10

Computer ResourcesUse a high throughput linux cluster

• Computers– Local computers– Virtual computers– Clusters– Cloud– Software

UNC ITS (Information Technology Services) Manages two main computational clusters: Killdevil (new) and Kure (old)http://its.unc.edu/service/compute-servers-clusters/

11

Computer ResourcesGet on the cloud

• Computers– Local computers– Virtual computers– Clusters– Cloud– Software

Googlehttps://cloud.google.com/genomics/what-is-google-genomicsAmazonhttp://aws.amazon.com/health/life-sciences/

12

Software is available to buy, borrow or obtain for free

Bioinfomatics has site licenses available for checkout. Share spendy software.http://bioinformatics.unc.edu/software/

ITS has a lot of software for free or for purchase at a discounted price. (Endnote, discounted; MatLab, free; SecureShell, free)http://software.sites.unc.edu/software/

ITS Virtual LabVirtually use spendy software for free (latest Adobe, Mathmatica, SPSS, etc)https://virtuallab.unc.edu/Citrix/ITSLabsSFWeb/

Kure and Killdevil come with loadable modules

13

Becoming proficient with new computing environments and languages

• Linux – An operating system, an environment, a lifestyle• Shell scripting – great for pipelines

• Python – A general purpose, high-level programming language. Highly readable and writable.

• Perl – A general purpose, high-level programming language. Great with text files.

• Javascript – a general purpose, high-level programming language. Specialized for web applications, apps, commonly used in plugins (ImageJ).

• R – A high level programming language and software environment specialized for statistics and large data management.

• MATLAB – A computing environment and programming language. Costs money.

• MYSQL – Database organization

• Others?

LIN

UX

PR

OG

RA

MM

ING

LA

NG

UA

GE

SM

AT

H

14

A HANDS-ON INTRO TO KILLDEVIL

How can I use some of these resources?

15

Computer Resources

• Computers– Local computers– Clusters– Cloud– Software

UNC ITS (Information Technology Services) Manages two main computational clusters: Killdevil (new) and Kure (old)http://its.unc.edu/service/compute-servers-clusters/

16

What is Killdevil?

17

No seriously, what is Killdevil?

• A high performance computer cluster

• Linux operating system• 1 login node• 774 compute nodes

– 48 – 96 GB memory per node.– 12 – 16 CPU’s cores per

node.

• 2 large memory nodes (1 TB)• 12 Graphics Processors

(GPUs) nodes• File systems for storage

18

No seriously, what is Killdevil?

19

Getting onto Killdevil

• MAC OS & Linux machines:– Link to killdevil through

“Terminal”– Open “Terminal” (in

Applications -> Utilities)– Type this:ssh <yourOnyen>@killdevil.unc.edu

-- Add password when prompted

• PC– Open SSH Secure

Shell Client– Click on “Quick

Connect”– Hostname =

killdevil.unc.edu– Username =

<yourOnyen>– Port Number = 22– Add password when

prompted

$ ssh erinosb@killdevil.unc.edu

20

Getting onto Killdevil -- demo

21

Keeping a computational notebook

22

Navigating UNIX: getting oriented

• Commands• Manuals

• Your first two commands:– whoami– date

• Getting help with manuals:– man <command>– “spacebar” to scroll– Type “q” to exit

$ whoamierinosb

$ dateThu Apr 9 13:24:09 EDT 2015

$ man whoami

q

23

Navigating UNIX – paths and directories

• pwd – Print Working Directory

• cd – Change Directorycd <directoryname>

• ls – List Contents

$ pwd

$ cd /nas02/home/

$ ls

24

The file structure

• Directories and sub-directories are “folders”• Some important directories on Killdevilhttp

://help.unc.edu/help/getting-started-on-killdevil/#P63_6342

– ms/ – netscr/– ~

• Making a new directorymkdir <directoryname>

• Removing a directoryrm –ri <directoryname>

$ mkdir 1_courses

25

A few key tips and tricks

• Naming conventions• Auto complete with TAB• What if I get stuck?

– CTRL+C

• Get me out of here– Q– CTRL+C– CTRL+D– quit– logout– logoff– logout()– bye– quit()– q()– exit

• What if I need help?– man <command>– <command> -h– <command> --help– <command>

• GOOGLE it!– Use language name

in search

26

Exercise 1: move up and down pathsA) Type the following command:$ cdWhere are you right now? Write down this exact location in your notebook.

B) Enter the following command:$ cd / Now where are you?

C) List the contents of this directory. Do you see the directory nas02? Change into that directory.

D) Use cd and ls to navigate down the file structure back to your original location in Step #1.

E) Type this command:$ cd ..Now where are you? What did cd .. do?

F) Use cd .. to go up multiple folders. Type ‘cd ..’ followed by “ls” and “pwd”. Keep repeating these steps until you go back up to /

G) Now type this command. Where are you now?$ cd <the-exact-location-path-you-wrote-in-step1>H) Now type:$ cd – Where are you now?I) Navigate back down to your home directory through each directory in your path. This time, try typing <TAB> after typing the first three letters of each directory name to initiate autocomplete. What happens?

27

Making and Removing files

• Making a file

touch <filename>• Removing a file

rm –i <filename>

-i is an option

$ command [-OPTIONS] <requiredfiles>

$ touch testfile1.txt

$ rm –i testfile1.txt

28

Exercise 2: Creating a directory tree

1) Make a directory structure in your home directory that you can use. If you already have a home directory structure you like, you can skip this and just create the course directory (150413_FirstYearGradCourse) and the subdirectories.

$ cd #This will move you into your home directory.

Try this command:

$ treeWhat do you see?

2) Use mkdir to create directories and subdirectories within your home directory so that tree will generate a “map” of your files that looks like this:

.| -- 150413_FirstYearGradCourse

|-- exercise03|-- exercise05

3) Put a file in the ~/1_courses/exercise3/ directory labeled 0_exercise03_README.txt

4) Navigate within the director exercise03. Inside this directory make a new directory called ‘exercise10’. You realize that you don’t actually want the exercise10 directory as a subdirectory within exercise03. You would prefer it if it were listed as a directory within 150413_FirstYearGradCourse just like exercise03. Figure out how to move the directory “exercise10” out of the directory “exercise03”. You will use the command ‘mv’ to do this.

Hint: Try using google to search for information about ‘mv’.

29

Getting files onto and off of Kure

• sftp clients– Cyberduck, Mozilla – SSH/SFTP

• Set it up, then drag and drop

• scp

scp <filename> <onyen>@killdevil.unc.edu:/path/

$ pwdmylaptop/erin/

$ scp TFs.tar.gz erinosb@killdevil.unc.edu:/nas02/home/e/r/erinosb/1_courses/150413_firstYearGrads/exercise03/

30

Decompressing directories with tar and gzip

• tar • To “extract” a directory:

tar –zxvf <dir.tar.gz>• To “create” a directory:

tar –zcvf <dir.tar.gz> <dir>

31

Exercise 3: Transcription FactorsA) Download the a zipped directory of transcription factors called “TFs.tar.gz” from https://onish.web.unc.edu/firstyeargrads/ and save it somewhere on your local computer.

B) Upload the file to Killdevil. ~/1_courses/150413_FirstYearGrads/exercise03 directory.

MAC users – Use scp or an SFTP client.

PC users – Use Secure shell or an SFTP client.

C) Did you put your zipped directory in the wrong place? See if you can figure out how to use the mv (move) command to put it in the right place.

D) Expand your zipped TF directory using tar (see slide 29).

E) See what just happened using ls. Now navigate into your directory using cd and see what is in there. What’s inside?

F) more, less and head allow you to look inside files. Use the man pages of these commands to figure out what they do and how they work. Peek into one or two of the enclosed csv files. What do they look like?

G) Now try this command:

$ wc Athaliana_TFs.csvWhat do you see? What does wc do?

H) Chain commands together. There are many ways to string multiple commands together and execute them simultaneously. What do the following commands do:

$ wc Athaliana_TFs.csv; wc Celegans_TFs.csv; wc Dmelanogaster_TFs.csv$ head Athaliana_TFs.csv | wcI) Count all the .csv files using:

$ wc *.csvJ) Start your first program. Make an empty file called “wordCounter.sh” using touch.

32

Our first program

• bash is the linux shell• Writing things in bash is called shell scripting.• All bash scripts:

– Contain the file extension .sh– Start with shebang:

• #!/bin/bash

– Have pseudocode– Are tested every 2 – 3 lines

• Are executed with:

$ bash <scriptname.sh>

33

Our first program: wordCounter.sh

#!/bin/bash

# Count words in files using "wc":echo -e "Word count Arabidopsis TF file: "wc TFs/Athaliana_TFs.csv

To ‘edit’ or write this script, move wordCounter.sh to and from your local computer using Secure SFTP (PC).ORInteractively edit the document by navigating to this file in Cyberduck, write clicking on it, and selecting “edit with”

34

To execute our first program

1) Navigate to the same folder that has your program in it. Use ‘ls’ to double check you can see the program wordCounter.sh

2) Type:

$ bash wordCounter.sh

35

Homework: Writing your first script

• We don’t necessarily want to count the words in every .csv file we have. Let’s say we want just our three favorite species. Let’s make a wordCounter.sh program to do just this.

• Use your SFTP client to copy wordCounter.sh onto your local computer or open and edit it interactively.

• Type in the shebang on the top line.• Write an entry in your computational notebook that you started this code. Include today’s

date and what you want this code to do.• Type in some pseudocode indicating what you want to do. Make sure there is a “#” in front

of any pseudo code so it isn’t interpreted as a command.• Type in a quick shell script to count word in three of your specific files.• Test your code.• Does it work?• Now try adding a line into your code (anywhere) that says the following:

echo –e “Wow, I love learning shell scripting.”• What happened? • Use echo to personalize your output message.• Now open that readme file you made back in Exercise #2. Enter in what your script is and

what it does.

36

Extra: An example shell script #2TF_counter.sh

#!/bin/bash

# Report the number of transcription factors in three different species as # per cis-BP website.# The number of lines in each of these files = the number of transcription # factors.

echo -e "\nThe number of Arabidopsis Transcription factors is: \t"wc TFs/Athaliana_TFs.csv | awk '{print $1}'echo -e "\nThe number of Humans Transcription factors is: \t"wc TFs/Hsapiens_TFs.csv | awk '{print $1}'echo -e "\nThe number of Worm Transcription factors is: \t"wc TFs/Celegans_TFs.csv | awk '{print $1}'

37

Extra: An example of another shell script,

loopScript.sh#!/bin/bash

#A word counting script that loops through multiple entries:

for i in $*do

wc ${i}done

Execute by typing this into the command line:$ bash loopScript.sh TF/*csv

38

A few follow up comments on Killdevil

• If your command takes more than 5 seconds to execute, cancel it!!!! – <CTRL+C>– Learn about LSF (bsub) for computationally intensive jobs.– http://help.unc.edu/help/getting-started-on-killdevil/

• If your project is big, get space!– ~ directory is only 12 GB large– Light users use /netscr for temporary jobs and /ms for storage.– /netscr will be deleted after 21 days!!!– For anything bigger, get your own dedicated space.– http://help.unc.edu/help/getting-started-on-killdevil/

• Off campus? VPN into UNC first before logging on.

• Modules are available on Killdevil and Kure.

39

Load Sharing Facility (LSF)

40

The benefits of honing your computational prowess

• Publishing• Fast & efficient• Reproducible• Collaborative• Employable

41

PROPELLING YOURSELF THROUGH THE NEXT STEPS

How do I learn?

42

Oft encountered languages and environments

• Linux – An operating system, an environment, a lifestyle• Shell scripting – great for pipelines

• Python – A general purpose, high-level programming language. Highly readable and writable.

• Perl – A general purpose, high-level programming language. Great with text files.

• Javascript – a general purpose, high-level programming language. Specialized for web applications, apps, commonly used in plugins (ImageJ).

• R – A high level programming language and software environment specialized for statistics and large data management.

• MATLAB – A computing environment and programming language. Costs money.

• MYSQL – Database organization

• Others?

LIN

UX

PR

OG

RA

MM

ING

LA

NG

UA

GE

SM

AT

H

43

UNC Workshops & Courses

• IT Workshops– Linux, Killdevil & Kure, Python, R, SciPy, Tarheel Linux– http

://reg.abcsignup.com/view/view_month.aspx?as=52&wp=887&aid=UNC-ITS

• Basic Bioinformatics Tools Workshops (Hemant Kelkar)– Linux, Killdevil, LSF, BLAST, Genomics, RNA-seq, PyMol, ENSEMBL– http://guides.lib.unc.edu/c.php?g=8359&p=43018– YOUTUBE!

• RNA-seq Workshop• Summer Workshop• Bioinformatics and Computational Biology Series

– http://www.bcb.unc.edu/training.htm#coursework

• Learn R videos at the Odum Institute– http://www.odum.unc.edu/odum/contentSubpage.jsp?nodeid=670

44

MOOCs

• Coursera– Data Science (9 x 4 week modules, starts May 4)

• https://www.coursera.org/specialization/jhudatascience/1?utm_medium=courseDescripTop

– Python (10 weeks, starts June 1)• https://www.coursera.org/course/pythonlearn

– Statistics

• EdX• Codeacademy• Udacity• Khan Academy• Software Carpentry

46

Travel and Learn

• Cold Spring Harbor Labs Training courses– http://meetings.cshl.edu/courses.html

• MSU Michigan State University Summer Workshop– http://ged.msu.edu/angus/tutorials-2013/– http://bioinformatics.msu.edu/ngs-summer-course

-2015

• Software Carpentry– http://software-carpentry.org/workshops/

index.html

47

A public service announcement on backups

“If your data does not exist in triplicate, spanning at least two tectonic plates, it does not exist”

-- Greg Wilson

48

What do you want to get out of your graduate experience?

49

Extra Slides are after here

50

Common reasons why interested students do not continue to use computational tools

• “I took a few linux workshops, but I couldn’t get anything to work on my computer.”

• “I see the utility of using computational tools but it is impossible to keep track of what I’m doing and I feel disorganized.”

• “Everything takes so long to learn. It seems really inefficient.”

• “I took a few linux workshops but then I didn’t use it and I forgot everything I learned.”

51

There are many programming languages to learn

http://carlcheo.com/startcoding

top related