whole genome report
TRANSCRIPT
NGUYEN HOANG BACH, MSc.
Sassari, 2011
WHOLE GENOME ASSEMBLY AND ANALYSIS SHORT REPORT
Supervisor
MASSIMO DELIGIOS, PhD. Prof. PIERO CAPPUCINELLI
DIVISION OF CLINICAL AND EXPERIMENTAL MICROBIOLOGY
DEPARTMENT OF BIOMEDICAL SCIENCES
UNIVERSITY OF SASSARI
Nguyen Hoang Bach, MSc. Page 1
Part 01 Install Cygwin for Windows 7 OS
Install Velvet 1.0.19
Create contig with Velvet
A. Install Cygwin with perl, C++ compiler, debugger, and make for for Windows 7 OS
Cygwin is:
a collection of tools which provide a Linux look and feel environment for Windows.
a DLL (cygwin1.dll) which acts as a Linux API layer providing substantial Linux API functionality. The Cygwin DLL currently works with all recent, commercially released x86 32 bit and 64 bit versions of Windows, with the exception of Windows CE1.
1 Windows CE (now officially known as Windows Embedded Compact and previously also known as Windows Embedded CE , and sometimes abbreviated WinCE) is an operating system developed by Microsoft for embedded systems. Windows CE is a distinct operating system and kernel, rather than a trimmed-down version of desktop Windows. It is not to be confused with Windows XP Embedded which is NT-based.
We can find Full Cygwin Package at URL: http://www.cygwin.com/packages/
gcc-g++ GCC-3 Series legacy compiler: C++ compiler
gdb The GNU Debugger
make The GNU version of the 'make' utility
perl Larry Wall's Practical Extracting and Report Language
perl-Error Perl module for OO error/exception handling
perl-ExtUtils-Depends Build Perl XS that depend on other XS
perl-ExtUtils-PkgConfig Perl module for using pkg-config
perl-Graphics-Magick GraphicsMagick Perl bind (PerlMagick)
perl-Image-Magick Image manipulation software suite (Perl bindings)
perl-Locale-gettext Perl module for using gettext and libintl
perl-SGMLSpm Perl SGMLS parser module
perl-Tk Perl interface for Tk (X11)
perl-Win32-GUI Perl Win32-GUI module
perl-XML-Simple Perl module for simple XML access
perl-libwin32 Perl extensions for using the Win32 API
perl-ming A SWF output library - (Perl bindings)
perl_manpages Perl manpages
Nguyen Hoang Bach, MSc. Page 2
The make utility automatically determines which pieces of a large program need to be recompiled, and issues commands to recompile them. This manual describes GNU make, which was implemented by Richard Stallman and Roland McGrath. Development since Version 3.76 has been handled by Paul D. Smith.
GNU make conforms to section 6.2 of IEEE Standard 1003.2-1992 (POSIX.2). Our examples show C programs, since they are most common, but you can use make with any programming language whose compiler can be run with a shell command. Indeed, make is not limited to programs. You can use it to describe any task where some files must be updated automatically from others whenever the others change.
B. Install Velvet 1.0.19 running on Windows 7 OS with Cygwin (including C++ compiler, debugger, and make)
What is Velvet?
Velvet is a De Novo genomic assembler specially designed for short read sequencing technologies, such as Solexa or 454, developed by Daniel Zerbino and Ewan Birney at the European Bioinformatics Institute (EMBL-EBI), United Kingdom.
Velvet currently takes in short read sequences, removes errors then produces high quality unique contigs. It then uses paired-end read and long read information, when available, to retrieve the repeated areas between contigs.
The memory requirements and time to run Velvetg?
All depend on the number and size of the reads we have to assemble. The memory requirements can be estimated using a relationship we showed to this examples below. The speed at which velvetg will run is dependent on a lot of variables including: CPU type and speed, memory bus speed, size and number of reads, the value of k and many others and so is difficult to estimate. For 30 million 36mers, with a k of 29, to finish the initial velvetg run on the deskop (with 16GB RAM) in 15 - 20 minutes. Subsequent runs are faster. Therefore, 160 hours seems plenty. Our biggest concern will be the memory requirements. The memory estimator is: Ram required for velvetg (Kb) = -109635 + 18977*ReadSize + 86326*GenomeSize + 233353*NumReads - 51092*K
Ram required for velvetg (Gb) = Ram required for velvetg (Kb) / 1048576
Where Read size is in bases Genome size is in millions of bases (Mb) Number of reads is in millions K is the kmer hash value used in velveth
Nguyen Hoang Bach, MSc. Page 3
The results are +/- 0.5 - 0.8 Gbytes on this system. (64 bit Fedora 10 - quad core - 16Gb RAM) I.e: for
K = 31, Number of reads = 50 million read size = 36 Genome size of 5 Megabases
The estimator returns ~10.5 Gbytes of Ram required. The regression equation should be fairly valid for the following ranges:
K = 15 - 31. Numreads = 5 - 70 million Genome size = 2 - 10 Megabases Read length (size) = 20 - 75 bases
C. Creat the contig of sequences data with Velvel 1.0.19
Step1: make
or make ’MAXKMERLENGTH=57’
Step2: combine the whole genome sequences of MT_HUE_20
./shuffleSequences_fastq.pl ./data/s_5_1_FC70G0HAAXX_501827_44251_MTB_20_HUE.fastq
./data/s_5_2_FC70G0HAAXX_501827_44251_MTB_20_HUE.fastq fullseq.fastq
Syntax:
./shuffleSequences_filetype.pl ./[include_path/file1_name] ./[include_path/file2_name]
./[include_path/newfile_name]
Step3
./velveth
Step4:
./velvetg
Nguyen Hoang Bach, MSc. Page 4
Step5:
./velveth output_directory hash_length [-file_format] [-read_type] [filename]
output_directory hash_length [-file_format] [-read_type] [filename]
Velvel_dir/output_dir The hash length is the length of the k-mers being entered in the hash table. • it must be an odd number, to avoid palindromes. If we put in an even number, Velvet will just decrement it and proceed. • it must be below or equal to MAXKMERHASH length (default 31bp), because it is stored on 64 bits • it must be strictly inferior to read length, otherwise we simply will not observe any overlaps between reads, for obvious reasons. As is often the case, it’s a trade-off between specificity and sensitivity. Longer kmers bring we more specificity (i.e. less spurious overlaps) but lowers coverage (cf. below). . . so there’s a sweet spot to be found with time and experience. We like to think in terms of “k-mer coverage”, i.e. how many times has a k-mer been seen among the reads. The relation between k-mer coverage Ck and standard (nucleotide-wise) coverage C is Ck = C ∗ (L−k+1)/L where k is our hash length, and L we read length. Experience shows that this kmer coverage should be above 10 to start getting decent results. If Ck is above 20, we might be “wasting” coverage. Experience also shows that empirical tests with different values for k are not that costly to run!
Supported FASTA
(default) fastq FASTA.gz fastq.gz eland gerald
Read categories are: short (default) shortPaired short2 (same as short,
but for a separate insert-size library)
shortPaired2 (see above) long (for Sanger, 454 or
even reference sequences)
longPaired
Including path
I.e: ./velveth contig 31,45,2 –fastq –shortPaired seq/sequences-data1.fastq seq/ sequences-data2.fastq
We then specified the hash length as 31,45,2 which runs velveth with hash lengths of 31-43 with a step of 2 (note: k-mers have to be odd). This
creates seven directories named contig_31 .. contig_43. To save disk space, the Sequences file is symbolically linked by velvet to the first directory (in this case
contig_31).
Step6: Running velvetg and determining optimal K
./ velvetg contig_33 -exp_cov 396.0 -ins_length1 300 -ins_length2 3000
Nguyen Hoang Bach, MSc. Page 5
The expected coverage parameter was estimated by first counting the number of reads in each library with grep piped to wc (word count):
grep "@HWI-EAS210R_0001" 3kb_mp_shuffled.fastq | wc
8362680 8362680 342363112
grep "@HWI-EAS210R_0001" 300bp_pe_shuffled.fastq | wc
6069248 6069248 248522420
The first number in this output is the number of lines that match the grep pattern. We can arrive at the expected coverage by multiplying those counts by the
length of reads in each library and dividing by the total length of the genome (or our best estimate of it). So to calculate the expected coverage we could
perform the following calculation: ((8362680 * 38) + (6069248 * 54)) / 1,630,000 = 396.
It is important to note here that we can increase the value of the -exp_cov parameter and we may see an improvement in the n50 of the assembly, but it may
also produce mis-assemblies.
When velvetg finishes it will output the number of nodes, n50, and max and total size of the assembly created.
If we look in the contig_* directory, we will also see a few files:
contigs.fa Graph LastGraph Log PreGraph Roadmaps Sequences stats.txt
These files are explained in detail, but the most useful files for post-analysis are the contigs.fa, Log, and stats.txt files. These results should be entered into
the spreadsheet at the front of the lab.
Running the following custom script will output the n50 as well as n90 values for this assembly. For Ubuntu Linux users, we will run:
perl /usr/local/bin/calculateN50.pl auto_*/contigs.fa
Where * is the value of k.
We may notice that this n50 value is slightly different than what was reported by velvet. This is due to the fact that velvet reports its n50 (as well as
everything else) in kmer space. For example, the relationship between coverage and kmer coverage is defined by the following:
Nguyen Hoang Bach, MSc. Page 6
Ck = C ∗ (L−k+1)/L
Where C=coverage,
L=read length
k=kmer length. For other things such as a contig length it is as simple as adding k-1 to the reported length.
Result:
- Nodes: 2232 - Max length: 94 408 bp - Min length: 89 bp Can delete the nodes with short length (<400 bp) with some soflware like: Geneious, CLC Genomic Workbench.
Part 02 Assembly - Blast - Mapping - Annotation
Step7 : Ligate all the nodes of contigs obtained from Velvet and create the circular genome with Geneious Pro 4.8.5 (Build 2010-03-04 10:01)
Geneious Pro is a commercial bioinformatics software platform that is both ultra-powerful and easy to use. We are able to search, organize and analyze genomic and protein information via a single desktop program that provides publication ready images to enhance the impact of our research.
- Create a folder and import the config.faa into this folder.
- Sort all nodes by order and select all the nodes.
- Ligate of the node with Cloning tools -> Ligate Sequences….
- Select Circularize sequences to make circular genome
- Export the circular sequences into new folder and save this sequences (FASTA file)
Nguyen Hoang Bach, MSc. Page 7
Step8: Create full Open Read Frame ORFs with GeneMarkS (http://exon.gatech.edu/GeneMark/genemarks.cgi)
The new gene prediction method, called GeneMarkS, utilizes a non-supervised training procedure and can be used for a newly sequenced prokaryotic genome with no prior knowledge of any protein or rRNA genes. The GeneMarkS implementation uses an improved version of the gene finding program GeneMark.hmm, heuristic Markov models of coding and non-coding regions and the Gibbs sampling multiple alignment program. GeneMarkS predicted precisely 83.2% of the translation starts of GenBank annotated Bacillus subtilis genes and 94.4% of translation starts in an experimentally validated set of Escherichia coli genes. GeneMarkS can detect prokaryotic genes, in terms of identifying open reading frames containing real genes, with an accuracy matching the level of the best currently used gene detection methods. Accurate translation start prediction, in addition to the refinement of protein sequence N-terminal data, provides the benefit of precise positioning of the sequence region situated upstream to a gene start.
Step-by-step diagram of the GeneMarkS procedure
Figure 2. (A) In the process of GeneMarkS training there is no division of the coding sequence into two clusters.(B)The state ‘gene’ represents a sequence composed of an RBS plus a spacer plus the protein-coding sequence (CDS). Gene overlaps encompass all possible types of superpositions: overlap of genes on the same strand (as observed in operons), overlap of genes on opposite strands, overlap of coding region with RBS, and so on.
Nguyen Hoang Bach, MSc. Page 8
Sequence File upload
(Upload the circular genome)
Running Options
Use Prokaryotic Version
Output Options
Email address: (to receive the result via email)
Translate GeneMarkS predicted genes into proteins (Get a list of protein translations of predicted genes in FASTA format. Ideal for smooth transition to using protein data.)
Run
Start GeneMarkS
Result:
1. Protein Translation: Copy all of ORF and save into a FASTA fiel
>Translation: 385..582 (direct), 66 amino acids
MLDLVELLTHWHAGRSQVRLSESLGIDRKTVRKYTAPAIAAGIEPGGEPLSAEQWAELIG
GWFPE*
….
2. Gene List
GeneMark.hmm PROKARYOTIC (Version 2.8)
Date: Wed Apr 20 09:25:23 2011
Sequence file name: sequence
Model file name: GeneMarkS_plus_Heuristic_AT_and_NONC.mod
RBS: Y
Model information: Pseudonative.model
FASTA definition line: empty-FASTA-def-line
Predicted genes
Save the content into a new FASTA file
Nguyen Hoang Bach, MSc. Page 9
Step9: Convert full ORF FASTA file (obtain from GeneMarkS) to tabular format with Galaxy Tool and Edit with MS Excel
Galaxy Tool: http://main.g2.bx.psu.edu/ Convert to tabular format we can open with MS Excel and manipulate on this file easily.
- Upload the full_orf_mt_hue_20_sorted.faa and convert to tabular format. - Save tabular format file and open with MS Excel. - Insert a new column (column A # C1) and label this column (orf_0001 … orf_####) - Save this tabular file and convert to FASTA format.
Step10: Blast the ORF with NCBI server via Blast2Go
Blast2GO is an ALL in ONE tool for functional annotation of (novel) sequences and the analysis of annotation data. Blast2GO can annotate thousands of sequences in one session. We can follow and modify the annotation process at any stage.
Pipeline
Nguyen Hoang Bach, MSc. Page 10
Start Blast2GO by Java Web Start
Requirements:
- The minimum requirement to run Blast2GO is a working Java installation (version > 1.5) (latest version is 1.6)
- The minimum requirement system memory is 512 MB free ( recommend: 2000-3000 MB)
- High speed internet connection
A. Blast all ORF with NCBI server
- Create new project the import the full_orf_mt_hue_20_sorted.faa (added orf
order).
- Run BLAST step with configuration below
- We can stop temporality the blast process, save the data and continue the blast process in next time. With 4757 ORFs of MT_HUE_20 samples and
Blast Hits = 20, it takes us about 24 hours with high speed internet connection. But in this case, we use Blast Hit = 5
- When the blast process finished, export the blast result as fasta file: File > Export > Exports as FASTA
Nguyen Hoang Bach, MSc. Page 11
Step11: create GFF file to annotate circular genome MT_HUE_20
GFF (General Feature Format) lines are based on the GFF standard file format. GFF lines have nine required fields that must be tab-separated. If the fields are separated by spaces instead of tabs, the track will not display correctly. Here is a brief description of the GFF fields:
1. seqname - The name of the sequence. Must be a chromosome or scaffold. 2. source - The program that generated this feature. 3. feature - The name of this type of feature. Some examples of standard feature types are "CDS", "start_codon", "stop_codon", and "exon". 4. start - The starting position of the feature in the sequence. The first base is numbered 1. 5. end - The ending position of the feature (inclusive). 6. score - A score between 0 and 1000. If the track line useScore attribute is set to 1 for this annotation data set, the score value will determine the level
of gray in which this feature is displayed (higher numbers = darker gray). If there is no score value, enter ".". 7. strand - Valid entries include '+', '-', or '.' (for don't know/don't care). 8. frame - If the feature is a coding exon, frame should be a number between 0-2 that represents the reading frame of the first base. If the feature is not
a coding exon, the value should be '.'. 9. group - All lines with the same group are linked together into a single item.
Nguyen Hoang Bach, MSc. Page 12
Example: MT_HUE_20_circular GeneMarkS source 1 4559459 . + . Name source
MT_HUE_20_circular GeneMarkS CDS 385 582 . + . Name orf_0001 ; locus_tag integrase catalytic region
MT_HUE_20_circular GeneMarkS CDS 665 2749 . + . Name orf_0002 ; locus_tag transposase
MT_HUE_20_circular GeneMarkS CDS 2991 5033 . + . Name orf_0003 ; locus_tag conserved hypothetical protein
MT_HUE_20_circular GeneMarkS CDS 5046 5168 . + . Name orf_0004 ; locus_tag ---NA---
MT_HUE_20_circular GeneMarkS CDS 5333 6586 . + . Name orf_0005 ; locus_tag cytochrome p450 125 cyp125
MT_HUE_20_circular GeneMarkS CDS 6586 7605 . + . Name orf_0006 ; locus_tag acyl- dehydrogenase fade28
MT_HUE_20_circular GeneMarkS CDS 7683 8753 . + . Name orf_0007 ; locus_tag acyl- dehydrogenase fade29 - Convert the ORF’s Blast result to tabular format with Galaxy Tool
- Open tabular file with MS excel and separate the content of fist column into 2 column
orf_0001|integrase catalytic region => orf_0001 integrase catalytic region
Data > Text to column > Delimited with | > Finish
- Delete the value of amino acid sequence column
- Creat the GFF file with tabular file and the gene list with MS Excel where:
C1: Name of MT circular genome (MT_HUE_20_circular)
C9: =CONCATENATE("Name ",#column orf_number," ; ","locus_tag ", #column Sequence desc.)
- Copy the content of excel file and paste into a .txt file.
- Rename this file : mt_hue_20_circular.gff
Step12: Open GFF file with Geneious
To have a full genome of MT_HUE_20 strain with annotation, we use the circular sequence obtained from contigs; the sequence description obtained from
Blast all ORF and GFF file in Geneious Software.
- Open Geneious, create a new folder with name GFF.
- Import the mt_hue_20_circular.gff file.
- Get the sequences for this gff file (the mt_hue_20_cicular.fasta)
- Visualize the genome in form circular: Tool -> Circular Sequences
- Zoom in or out to find the specific ORF
Nguyen Hoang Bach, MSc. Page 13
A long fragment of genome MT_HUE_20 strain include many ORF
Step13: Manipulate specific gene with annotated genome of MT_HUE_20
To find a specific gene, RNA polymerase beta subunit (rpoB) gene for example, we find the information in the topBlast data to identify the name of ORF. In
this case, >orf_1934|dna-directed rna polymerase subunit beta rpob.hihi
We use the Geneious Software to analyze this sequences like: export the sequences; blast with NCBI server, find the mutation...
Part 03 Bowtie 0.12.7, MagicViewer
1. Bowtie is an ultrafast, memory-efficient short read aligner geared toward quickly aligning large sets of short DNA sequences (reads) to large genomes. It aligns 35-base-pair reads to the human genome at a rate of 25 million reads per hour on a typical workstation. Bowtie indexes the genome with a Burrows-Wheeler index to keep its memory footprint small: for the human genome, the index is typically about 2.2 GB (for unpaired alignment) or 2.9 GB (for paired-end or colorspace alignment). Multiple processors can be used simultaneously to achieve greater alignment speed. Bowtie can also output alignments in the standard SAM format, allowing Bowtie to interoperate with other tools supporting SAM, including the SAMtools consensus, SNP, and indel callers. Bowtie runs on the command line under Windows, Mac OS X, Linux, and Solaris.
Bowtie also forms the basis for other tools, including TopHat: a fast splice junction mapper for RNA-seq reads, Cufflinks: a tool for transcriptome assembly and isoform quantitiation from RNA-seq reads, Crossbow: a cloud-computing software tool for large-scale resequencing data,and Myrna: a cloud computing tool for calculating differential gene expression in large RNA-seq datasets.
Windows Shell: Convert full sequence reads (fastq) to .SAM file
D:\Softwares\Biotool\bowtie-0.12.7>bowtie.exe -S ./indexes/Test1/fullseq.fastq align_mt.sam
Syntax: bowtie_folder>bowtie.exe –S./[path_file_fullseq.fastq] ./[path_file_fullseq.sam]
Nguyen Hoang Bach, MSc. Page 14
2. MagicViewer help us to study in the variety of genome, such as de novo sequencing, transcriptome sequencing and targeted re-sequencing, especially exon-capture and high-throughput sequencing. For mapping purposes, SNP detections or association studies.
Analyze .SAM file with MagicViewer_1.2.1_i386_win32 program
Step 1: Run MagicViewer.bat file with Windows Shell
D:\Softwares\Biotools\MagicViewer_1.2.1_i386_win32>MagicViewer.bat
Step 2: Convert .SAM to Sorted – Indexing .BAM
Create new project, input reference genome FASTA file (H37Rv genome from NCBI) and Alignment file ( full sequences SAM file) – MagicViewer will convert to Indexing - Sorted BAM file.
Nguyen Hoang Bach, MSc. Page 15
…… đang viết