powering up your graduate experience a survey of computational tools and approaches for biologists ...

51
Powering up your graduate experience A survey of computational tools and approaches for biologists https://onish.web.unc.edu/firstyeargrads/ Erin Osborne Nishimura 1

Upload: ira-walker

Post on 25-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

1

Powering up your graduate experience

A survey of computational tools and approaches for biologists

https://onish.web.unc.edu/firstyeargrads/

Erin Osborne Nishimura

Page 2: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

2

Summer Workshop Series

Page 3: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

3

Powering up your graduate experience

• What types of computational tools and approaches are available?

• How can I use some of the resources available at UNC?– Hands-on introduction to Killdevil

• Benefits of honing your computational prowess

• How can I learn the next steps?

Page 4: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

4

WHAT TYPES OF COMPUTATIONAL TOOLS AND APPROACHES ARE AVAILABLE?

Page 5: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

5

What are we talking about when we talk about computational biology?

Page 6: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

6

Computational thinking is generalizable outside of the field of genomics

• Genomics• Microscopy image processing• Structural biology processing

• Melding and merging datasets• Re-naming files in batch• Grabbing specific information out of files

• Tailor-made specialized scripts• Automation

Page 7: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

7

What hardware do we use to get this done?

• Brain• Physical notes & notebooks• Computers

– Local computers– Virtual computers– Clusters – Cloud– Software

Page 8: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

8

Computer ResourcesGet a physical computer

• Computers– Local computers– Virtual computers– Clusters– Cloud– Software

UNC Student Store Sells Computershttp://store.unc.edu/Buy a computer at a discount. Get lifetime support from UNC.

Page 9: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

9

Computer ResourcesUse a virtual comptuer

• Computers– Local computers– Virtual computers– Clusters– Cloud– Software

Research Computing hosts Virtual Computing Lab:https://vcl.unc.edu/index.php?mode=selectauthYou can virtually use Microsoft and Linux computers and install tailored software for individual or group use.

Page 10: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

10

Computer ResourcesUse a high throughput linux cluster

• Computers– Local computers– Virtual computers– Clusters– Cloud– Software

UNC ITS (Information Technology Services) Manages two main computational clusters: Killdevil (new) and Kure (old)http://its.unc.edu/service/compute-servers-clusters/

Page 11: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

11

Computer ResourcesGet on the cloud

• Computers– Local computers– Virtual computers– Clusters– Cloud– Software

Googlehttps://cloud.google.com/genomics/what-is-google-genomicsAmazonhttp://aws.amazon.com/health/life-sciences/

Page 12: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

12

Software is available to buy, borrow or obtain for free

Bioinfomatics has site licenses available for checkout. Share spendy software.http://bioinformatics.unc.edu/software/

ITS has a lot of software for free or for purchase at a discounted price. (Endnote, discounted; MatLab, free; SecureShell, free)http://software.sites.unc.edu/software/

ITS Virtual LabVirtually use spendy software for free (latest Adobe, Mathmatica, SPSS, etc)https://virtuallab.unc.edu/Citrix/ITSLabsSFWeb/

Kure and Killdevil come with loadable modules

Page 13: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

13

Becoming proficient with new computing environments and languages

• Linux – An operating system, an environment, a lifestyle• Shell scripting – great for pipelines

• Python – A general purpose, high-level programming language. Highly readable and writable.

• Perl – A general purpose, high-level programming language. Great with text files.

• Javascript – a general purpose, high-level programming language. Specialized for web applications, apps, commonly used in plugins (ImageJ).

• R – A high level programming language and software environment specialized for statistics and large data management.

• MATLAB – A computing environment and programming language. Costs money.

• MYSQL – Database organization

• Others?

LIN

UX

PR

OG

RA

MM

ING

LA

NG

UA

GE

SM

AT

H

Page 14: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

14

A HANDS-ON INTRO TO KILLDEVIL

How can I use some of these resources?

Page 15: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

15

Computer Resources

• Computers– Local computers– Clusters– Cloud– Software

UNC ITS (Information Technology Services) Manages two main computational clusters: Killdevil (new) and Kure (old)http://its.unc.edu/service/compute-servers-clusters/

Page 16: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

16

What is Killdevil?

Page 17: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

17

No seriously, what is Killdevil?

• A high performance computer cluster

• Linux operating system• 1 login node• 774 compute nodes

– 48 – 96 GB memory per node.– 12 – 16 CPU’s cores per

node.

• 2 large memory nodes (1 TB)• 12 Graphics Processors

(GPUs) nodes• File systems for storage

Page 18: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

18

No seriously, what is Killdevil?

Page 19: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

19

Getting onto Killdevil

• MAC OS & Linux machines:– Link to killdevil through

“Terminal”– Open “Terminal” (in

Applications -> Utilities)– Type this:ssh <yourOnyen>@killdevil.unc.edu

-- Add password when prompted

• PC– Open SSH Secure

Shell Client– Click on “Quick

Connect”– Hostname =

killdevil.unc.edu– Username =

<yourOnyen>– Port Number = 22– Add password when

prompted

$ ssh [email protected]

Page 20: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

20

Getting onto Killdevil -- demo

Page 21: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

21

Keeping a computational notebook

Page 22: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

22

Navigating UNIX: getting oriented

• Commands• Manuals

• Your first two commands:– whoami– date

• Getting help with manuals:– man <command>– “spacebar” to scroll– Type “q” to exit

$ whoamierinosb

$ dateThu Apr 9 13:24:09 EDT 2015

$ man whoami

q

Page 23: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

23

Navigating UNIX – paths and directories

• pwd – Print Working Directory

• cd – Change Directorycd <directoryname>

• ls – List Contents

$ pwd

$ cd /nas02/home/

$ ls

Page 24: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

24

The file structure

• Directories and sub-directories are “folders”• Some important directories on Killdevilhttp

://help.unc.edu/help/getting-started-on-killdevil/#P63_6342

– ms/ – netscr/– ~

• Making a new directorymkdir <directoryname>

• Removing a directoryrm –ri <directoryname>

$ mkdir 1_courses

Page 25: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

25

A few key tips and tricks

• Naming conventions• Auto complete with TAB• What if I get stuck?

– CTRL+C

• Get me out of here– Q– CTRL+C– CTRL+D– quit– logout– logoff– logout()– bye– quit()– q()– exit

• What if I need help?– man <command>– <command> -h– <command> --help– <command>

• GOOGLE it!– Use language name

in search

Page 26: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

26

Exercise 1: move up and down pathsA) Type the following command:$ cdWhere are you right now? Write down this exact location in your notebook.

B) Enter the following command:$ cd / Now where are you?

C) List the contents of this directory. Do you see the directory nas02? Change into that directory.

D) Use cd and ls to navigate down the file structure back to your original location in Step #1.

E) Type this command:$ cd ..Now where are you? What did cd .. do?

F) Use cd .. to go up multiple folders. Type ‘cd ..’ followed by “ls” and “pwd”. Keep repeating these steps until you go back up to /

G) Now type this command. Where are you now?$ cd <the-exact-location-path-you-wrote-in-step1>H) Now type:$ cd – Where are you now?I) Navigate back down to your home directory through each directory in your path. This time, try typing <TAB> after typing the first three letters of each directory name to initiate autocomplete. What happens?

Page 27: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

27

Making and Removing files

• Making a file

touch <filename>• Removing a file

rm –i <filename>

-i is an option

$ command [-OPTIONS] <requiredfiles>

$ touch testfile1.txt

$ rm –i testfile1.txt

Page 28: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

28

Exercise 2: Creating a directory tree

1) Make a directory structure in your home directory that you can use. If you already have a home directory structure you like, you can skip this and just create the course directory (150413_FirstYearGradCourse) and the subdirectories.

$ cd #This will move you into your home directory.

Try this command:

$ treeWhat do you see?

2) Use mkdir to create directories and subdirectories within your home directory so that tree will generate a “map” of your files that looks like this:

.| -- 150413_FirstYearGradCourse

|-- exercise03|-- exercise05

3) Put a file in the ~/1_courses/exercise3/ directory labeled 0_exercise03_README.txt

4) Navigate within the director exercise03. Inside this directory make a new directory called ‘exercise10’. You realize that you don’t actually want the exercise10 directory as a subdirectory within exercise03. You would prefer it if it were listed as a directory within 150413_FirstYearGradCourse just like exercise03. Figure out how to move the directory “exercise10” out of the directory “exercise03”. You will use the command ‘mv’ to do this.

Hint: Try using google to search for information about ‘mv’.

Page 29: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

29

Getting files onto and off of Kure

• sftp clients– Cyberduck, Mozilla – SSH/SFTP

• Set it up, then drag and drop

• scp

scp <filename> <onyen>@killdevil.unc.edu:/path/

$ pwdmylaptop/erin/

$ scp TFs.tar.gz [email protected]:/nas02/home/e/r/erinosb/1_courses/150413_firstYearGrads/exercise03/

Page 30: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

30

Decompressing directories with tar and gzip

• tar • To “extract” a directory:

tar –zxvf <dir.tar.gz>• To “create” a directory:

tar –zcvf <dir.tar.gz> <dir>

Page 31: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

31

Exercise 3: Transcription FactorsA) Download the a zipped directory of transcription factors called “TFs.tar.gz” from https://onish.web.unc.edu/firstyeargrads/ and save it somewhere on your local computer.

B) Upload the file to Killdevil. ~/1_courses/150413_FirstYearGrads/exercise03 directory.

MAC users – Use scp or an SFTP client.

PC users – Use Secure shell or an SFTP client.

C) Did you put your zipped directory in the wrong place? See if you can figure out how to use the mv (move) command to put it in the right place.

D) Expand your zipped TF directory using tar (see slide 29).

E) See what just happened using ls. Now navigate into your directory using cd and see what is in there. What’s inside?

F) more, less and head allow you to look inside files. Use the man pages of these commands to figure out what they do and how they work. Peek into one or two of the enclosed csv files. What do they look like?

G) Now try this command:

$ wc Athaliana_TFs.csvWhat do you see? What does wc do?

H) Chain commands together. There are many ways to string multiple commands together and execute them simultaneously. What do the following commands do:

$ wc Athaliana_TFs.csv; wc Celegans_TFs.csv; wc Dmelanogaster_TFs.csv$ head Athaliana_TFs.csv | wcI) Count all the .csv files using:

$ wc *.csvJ) Start your first program. Make an empty file called “wordCounter.sh” using touch.

Page 32: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

32

Our first program

• bash is the linux shell• Writing things in bash is called shell scripting.• All bash scripts:

– Contain the file extension .sh– Start with shebang:

• #!/bin/bash

– Have pseudocode– Are tested every 2 – 3 lines

• Are executed with:

$ bash <scriptname.sh>

Page 33: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

33

Our first program: wordCounter.sh

#!/bin/bash

# Count words in files using "wc":echo -e "Word count Arabidopsis TF file: "wc TFs/Athaliana_TFs.csv

To ‘edit’ or write this script, move wordCounter.sh to and from your local computer using Secure SFTP (PC).ORInteractively edit the document by navigating to this file in Cyberduck, write clicking on it, and selecting “edit with”

Page 34: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

34

To execute our first program

1) Navigate to the same folder that has your program in it. Use ‘ls’ to double check you can see the program wordCounter.sh

2) Type:

$ bash wordCounter.sh

Page 35: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

35

Homework: Writing your first script

• We don’t necessarily want to count the words in every .csv file we have. Let’s say we want just our three favorite species. Let’s make a wordCounter.sh program to do just this.

• Use your SFTP client to copy wordCounter.sh onto your local computer or open and edit it interactively.

• Type in the shebang on the top line.• Write an entry in your computational notebook that you started this code. Include today’s

date and what you want this code to do.• Type in some pseudocode indicating what you want to do. Make sure there is a “#” in front

of any pseudo code so it isn’t interpreted as a command.• Type in a quick shell script to count word in three of your specific files.• Test your code.• Does it work?• Now try adding a line into your code (anywhere) that says the following:

echo –e “Wow, I love learning shell scripting.”• What happened? • Use echo to personalize your output message.• Now open that readme file you made back in Exercise #2. Enter in what your script is and

what it does.

Page 36: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

36

Extra: An example shell script #2TF_counter.sh

#!/bin/bash

# Report the number of transcription factors in three different species as # per cis-BP website.# The number of lines in each of these files = the number of transcription # factors.

echo -e "\nThe number of Arabidopsis Transcription factors is: \t"wc TFs/Athaliana_TFs.csv | awk '{print $1}'echo -e "\nThe number of Humans Transcription factors is: \t"wc TFs/Hsapiens_TFs.csv | awk '{print $1}'echo -e "\nThe number of Worm Transcription factors is: \t"wc TFs/Celegans_TFs.csv | awk '{print $1}'

Page 37: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

37

Extra: An example of another shell script,

loopScript.sh#!/bin/bash

#A word counting script that loops through multiple entries:

for i in $*do

wc ${i}done

Execute by typing this into the command line:$ bash loopScript.sh TF/*csv

Page 38: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

38

A few follow up comments on Killdevil

• If your command takes more than 5 seconds to execute, cancel it!!!! – <CTRL+C>– Learn about LSF (bsub) for computationally intensive jobs.– http://help.unc.edu/help/getting-started-on-killdevil/

• If your project is big, get space!– ~ directory is only 12 GB large– Light users use /netscr for temporary jobs and /ms for storage.– /netscr will be deleted after 21 days!!!– For anything bigger, get your own dedicated space.– http://help.unc.edu/help/getting-started-on-killdevil/

• Off campus? VPN into UNC first before logging on.

• Modules are available on Killdevil and Kure.

Page 39: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

39

Load Sharing Facility (LSF)

Page 40: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

40

The benefits of honing your computational prowess

• Publishing• Fast & efficient• Reproducible• Collaborative• Employable

Page 41: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

41

PROPELLING YOURSELF THROUGH THE NEXT STEPS

How do I learn?

Page 42: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

42

Oft encountered languages and environments

• Linux – An operating system, an environment, a lifestyle• Shell scripting – great for pipelines

• Python – A general purpose, high-level programming language. Highly readable and writable.

• Perl – A general purpose, high-level programming language. Great with text files.

• Javascript – a general purpose, high-level programming language. Specialized for web applications, apps, commonly used in plugins (ImageJ).

• R – A high level programming language and software environment specialized for statistics and large data management.

• MATLAB – A computing environment and programming language. Costs money.

• MYSQL – Database organization

• Others?

LIN

UX

PR

OG

RA

MM

ING

LA

NG

UA

GE

SM

AT

H

Page 43: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

43

UNC Workshops & Courses

• IT Workshops– Linux, Killdevil & Kure, Python, R, SciPy, Tarheel Linux– http

://reg.abcsignup.com/view/view_month.aspx?as=52&wp=887&aid=UNC-ITS

• Basic Bioinformatics Tools Workshops (Hemant Kelkar)– Linux, Killdevil, LSF, BLAST, Genomics, RNA-seq, PyMol, ENSEMBL– http://guides.lib.unc.edu/c.php?g=8359&p=43018– YOUTUBE!

• RNA-seq Workshop• Summer Workshop• Bioinformatics and Computational Biology Series

– http://www.bcb.unc.edu/training.htm#coursework

• Learn R videos at the Odum Institute– http://www.odum.unc.edu/odum/contentSubpage.jsp?nodeid=670

Page 44: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

44

MOOCs

• Coursera– Data Science (9 x 4 week modules, starts May 4)

• https://www.coursera.org/specialization/jhudatascience/1?utm_medium=courseDescripTop

– Python (10 weeks, starts June 1)• https://www.coursera.org/course/pythonlearn

– Statistics

• EdX• Codeacademy• Udacity• Khan Academy• Software Carpentry

Page 46: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

46

Travel and Learn

• Cold Spring Harbor Labs Training courses– http://meetings.cshl.edu/courses.html

• MSU Michigan State University Summer Workshop– http://ged.msu.edu/angus/tutorials-2013/– http://bioinformatics.msu.edu/ngs-summer-course

-2015

• Software Carpentry– http://software-carpentry.org/workshops/

index.html

Page 47: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

47

A public service announcement on backups

“If your data does not exist in triplicate, spanning at least two tectonic plates, it does not exist”

-- Greg Wilson

Page 48: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

48

What do you want to get out of your graduate experience?

Page 49: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

49

Extra Slides are after here

Page 50: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

50

Common reasons why interested students do not continue to use computational tools

• “I took a few linux workshops, but I couldn’t get anything to work on my computer.”

• “I see the utility of using computational tools but it is impossible to keep track of what I’m doing and I feel disorganized.”

• “Everything takes so long to learn. It seems really inefficient.”

• “I took a few linux workshops but then I didn’t use it and I forgot everything I learned.”

Page 51: Powering up your graduate experience A survey of computational tools and approaches for biologists  Erin Osborne

51

There are many programming languages to learn

http://carlcheo.com/startcoding