mcb 432 final table pp 01.06.16

16
Keegan McAuliffe MCB 432: Computing in Molecular Biology The following is my final presentation for MCB 432: detailing the process our group undertook to determine the identity of a unknown bacteria. We were provided with raw sequence reads of a bacteria, and we converted them into contigs and scaffolds. We assembled the data into a complete genome, then annotated for potential genes to successfully determine the identity of the bacteria as Bacteroides vulgatus str. 3975.

Upload: keegan-mcauliffe

Post on 22-Mar-2017

184 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: MCB 432 Final Table PP 01.06.16

Keegan McAuliffe MCB 432: Computing in

Molecular BiologyThe following is my final presentation for MCB 432: detailing the process our group undertook to determine the identity of a unknown bacteria. We were provided with raw sequence reads of a bacteria, and we converted them into contigs and scaffolds. We assembled the data into a complete genome, then annotated for potential genes

to successfully determine the identity of the bacteria as Bacteroides vulgatus str. 3975.

Page 2: MCB 432 Final Table PP 01.06.16

Keegan McAuliffeHenry ChenAndrew StormDominic GentileTeam 10 Results and DiscussionIntroduction:The onset of new high throughput sequencing has increased our ability to analyze genetic information. In this project, we demonstrate how to use raw sequence data from sampled organisms for genetic and genomic analysis. With the raw sequenced reads provided by the PI, we assembled a genome for our unknown microorganism. The genome assembly was accomplished by using the A5ud assembler program (Table 1). With the data generated, we were able to determine the total number of contigs and scaffolds and use these assemblies to predict and annotate genes (Table 2). Assembled genome on hand, we are now capable of searching and analyzing predicted genes in order to characterize our unknown organism, which we accomplished using the Prodigal algorithm for gene prediction. Prodigal generates gene and protein predictions, but does not provide analysis to what those predicted genes and proteins represent. Therefore, we need to employ other programs that function to annotate our predictions and because genes are so complex, we need to be specific in choosing programs for gene analysis. For instance, programs such as Emboss allow you to search for alignments and patterns in your assembly to databases of well-known genes, HMM and Blast searches allow to you to compare protein homology, and many other programs designed to search for features such as tRNA and signal peptides. With this analytical power, we analyzed our genome and present how we accomplished these tasks and our results.

Page 3: MCB 432 Final Table PP 01.06.16

Results: (Optional tasks) The objective of Optional Task 1 was to determine the GC content of each gene. In order to ascertain this information, it was first necessary to assemble our reads into contigs and scaffolds—the objective of Mandatory Task 1. To do this, we first had to unzip or inflate the data of our read, using the “gunzip” command. Next, we ran the A5ud assembler on the data. This generated a file for quality trimming report, assembly report, initial scaffolding report, final scaffold quality check, error corrected reads, contigs, crude scaffolds, broken scaffolds, and final scaffolds. The assembly report contained the GC content for each contig, which we added to Table 3. The average GC content for all contigs is .407. Because GC bonds are more stable than AT bonds, our genome is less stable than a genome of GC content greater than .500. The objective of Optional Task 3 was to determine the best BlastP match for our proteins against the NR database. The first step of Task 3, then, was to determine the proper command to generate a single best match from the NR database for each contig, with an E-value less than 1e-10, as well as the organism to which it belongs, the accession number, and percent identity. The command we used was:blastp –db nr –query TeamProject.faa –out TeamProject.br –evalue 1E-10 –outfmt 6 –max_target_seqs 1This command gave us the E-value, accession number, and percent identity for the blast blastp match of each contig. However, we still needed to the organism name and description of the gene. For this, we used the program efetch.pl. Using a list of accession names as an input, efetch.pl generated the organism name and gene annotation for each gene of interest. This data was recorded in Table 5. This task was also instrumental in determining the most closely related genus, species, and strain to our scaffolds.

Page 4: MCB 432 Final Table PP 01.06.16

The best blastp match for each contig was of the genus Bacteroides, and the overwhelming majority was of the species Bacteroides vulgatus. More specifically, the strain Bacteroides vulgatus str. 3975 RP4 occurred 9 times out of 104 contigs. Furthermore, this represents 60% of the 15 blast results specific enough to indicate strain. This data led us to conclude that Bacteroides vulgatus str. 3975 is the most closely related strain.The objective of Optional Tasks 4 and 5 were to analyze the CDSs for possible proteins and genes. The scaffold sequence were analyzed using PFAM to determine possible protein matches and TIGRFAM to determine possible gene matches. The hmmscan for the PFAM matches used the Pfam-A database and TeamProject.faa. The hmmscan for the TIGRFAM matches used the TIGRFAMs_14.0.HMM database and TeamProject.faa. The results were compiled into Table 6 and Table 7 from the TeamProject_pfam.txt and TeamProject_tigrfam.txt. Only the best match for each CDS were added to Table 3. The PFAM hmmscan revealed that many of the CDS had at least one related protein. The predicted proteins of CDSs with multiple matches were all closely related. For example, all the predicted proteins for the 1_83 CDS are from the Glycosyl transferase family 2. The TIGRFAM search revealed that there were fewer matches; only 33 to the 191 matches of the PFAM search. Most of the CDS with TIGRFAM matches only have one match. Only CDS 1_15, 1_39, 1_82, and 1_85 have multiple matches. These CDSs only had two matches where several PFAM matches had four or five matches. The TIGRFAM and PFAM matches for each CDS both predicted similar functions for the CDSs that had both TIGRFAM and PFAM matches.

Page 5: MCB 432 Final Table PP 01.06.16

Optional Task 6 used PHYRE2 to analyze CDS 1.1_1, 1.1_4, 1.1_14, 1.1_19, 1.1_32, 1.1_54, 1.1_57, 1.1_60, 1.1_68, and 2.1_8. All CDSs except 1.1_1 and 1.1_32 had a confidence of 100.0; with values of 61.1 and 49.4 respectively. The PHYRE2 predicted proteins agree with the PFAM predictions for all except 1.1_1, 1.1_32, 1.1_57, and 1.1_60. The other possible PHYRE2 matches were also not the same as the PFAM results. This may be because the structures of the PFAM matches are not in the PHYRE2 database. For Optional Task 7 we used looked for more specific features such as signal peptides. We used our assembled scaffold (team.fasta) and compared it to a reference database with gram negative prokaryotes, we were able to identify potential signal peptides and determined the length of these peptides. We compared our data to gram negative prokaryotes because our previous blast analysis identified genes and proteins matched those found in the gram negative genus Bacteriodes. The output data (which can be located in the file TeamProj_SigP_Summary.txt) specifically denoted the presence or absence of the signal peptides and the cutoff points of those peptides (C-value). This allowed us to determine the predicted lengths of the peptides. The results can be found in Table 3.The objective of Optional Task 8 was to analyze the presence of rho-independent transcriptional terminators. This is a particularly useful application as intrinsic terminators typically denote genes that are actively transcribed. In order to accomplish this task, we needed to run our genome alignment (team.fasta) for a RHO independent terminator database search while supplying the search with predicted gene coordinates. These predicted gene coordinates were determined through our EMBOSS infoseq analysis of predicted proteins on our assembly and restructured into the TeamProj.coords file for use with our RHO analysis program. The report generated can be found in the file TeamProj_tt + TeamProj_tt.txt and the results of which predicted genes had identifiable RHO independent terminators are listed in Table 3.

Page 6: MCB 432 Final Table PP 01.06.16

Optional Task #9 determined if we can find any homologous RNA secondary structures from our assembled genome. Like all genes, tRNA structure can provide valuable information on the function and origin of the gene, which can be incredibly valuable when characterizing an unknown genome. With our assembled genome in hand (team.fasta) we searched for matches in conserved RNA structures with a handful of RFAM databases: RF00005, RF00010, RF00023, RF00029, RF00059, RF00174, RF00177, RF01693, RF01694, RF01726, RF01998, and RF02001. The data can be found as TeamProj_RF*.txt. From our search we only found 1 tRNA match and include that match in information on the matched gene in Table 3.

For Optional Task 14, we constructed an alignment of our scaffolds with the genome of the bacterial strain with the most sequence matches, which we determined to be Bacteroides vulgatus str. 3975 RP4. On NCBI, we found 184 contigs of a whole genome-sequencing project for this strain. We concatenated these contigs to create a whole genome, to which we compared our scaffolds using blastn. With that blast report as a reference, we aligned the genomes using “act” and saved a screenshot of part of the alignment as Figure 3.

Page 7: MCB 432 Final Table PP 01.06.16

Discussion:As we previously alluded to in the discussing the results of Optional Task 3, we used Blastp to

determine the best match of each contig within the database “NR.” This data, located in Table 5, clearly indicates that genus of the closest relative is Bacteroides. After all, according to our blastp results, the best match of every contig corresponds to the genus Bacteroides. We can further assert that the species is Bacteroides vulgatus. 43 of the 104 contigs list Bacteroides vulgatus as their best match, and of the blast matches that were specific to species, 43 of 49 contigs (87.76%) list Bacteroides vulgatus. We can delve even deeper into the identity of the closest relative, as of the 104 contigs we were searching against, the strain Bacteroides vulgatus str. 3975 RP4 occurred 9 times. Thus, 9 of 15 blast results specific enough to indicate strain list Bacteroides vulgatus str. 3975 RP4. These data led us to conclude that Bacteroides vulgatus str. 3975 is the most closely related strain.

Page 8: MCB 432 Final Table PP 01.06.16

AppendixContains 7 tables containing the raw data used to create our Results and Discussion sections along with 1 figure showing our genome alignment

Page 9: MCB 432 Final Table PP 01.06.16

Table 1 Genome Assembly statistics for Team 10

No. of Read Pairs 47893No. of low quality reads 1763No. of assembled Reads 102640No. of unassembled Reads 2382No. of Contigs 2No. of Scaffolds 2Total nt length of scaffolds 126196

Length %G+CNo. of reads

mapped CoverageContig 100.0 119,977 40.61% 4851245 6065.0Contig 100.1 6,219 37.58% 240956 5811.0

Page 10: MCB 432 Final Table PP 01.06.16

Table 2 Gene annotation summary for scaffoldsCDS/ORFs tRNAs other RNAs

scaffold1.1 95 0 0scaffold2.1 9 1 0

Page 11: MCB 432 Final Table PP 01.06.16

Table 3. Predicted Gene Coordinates

Scaffold Name Type Start Stop Strand NT Length AA Length GC % Signal Peptide?SP Length (AA) Best Blast Hit Blast descriptionscaffold 1.1 1_1 CDS 3 611 - 609 202 0.406 N gi|496057719|ref|WP_008782226.1| transposase, partial

scaffold 1.1 1_2 CDS 845 3022 - 2178 725 0.405 Y 21 gi|649547948|gb|KDS54658.1| hypothetical protein M099_1756

scaffold 1.1 1_3 CDS 3539 3766 - 228 75 0.403 N gi|649547946|gb|KDS54656.1| glycoside hydrolase family 88 domain protein

scaffold 1.1 1_4 CDS 3949 4905 - 957 318 0.383 N gi|492435030|ref|WP_005843062.1| MULTISPECIES: transcriptional regulator

scaffold 1.1 1_5 CDS 5062 6291 + 1230 409 0.408 N gi|492435027|ref|WP_005843060.1| TonB-dependent receptor

scaffold 1.1 1_6 CDS 6311 7198 + 888 295 0.429 Y 18 gi|492435023|ref|WP_005843058.1| hypothetical protein scaffold 1.1 1_7 CDS 7536 8942 + 1407 468 0.396 Y 21 gi|649547942|gb|KDS54652.1| ahpC/TSA family protein scaffold 1.1 1_8 CDS 9027 9767 - 741 246 0.396 N gi|649547941|gb|KDS54651.1| ahpC/TSA family protein

scaffold 1.1 1_9 CDS 10111 12657 + 2547 848 0.421 N gi|495945682|ref|WP_008670261.1| MULTISPECIES: hypothetical protein

scaffold 1.1 1_10 CDS 12750 15755 - 3006 1001 0.36 N gi|495945680|ref|WP_008670259.1| MULTISPECIES: hypothetical protein

scaffold 1.1 1_11 CDS 15884 16252 + 369 122 0.477 Y 19 gi|492458337|ref|WP_005851052.1| alpha-L-fucosidase scaffold 1.1 1_12 CDS 16394 17275 - 882 293 0.468 N gi|492434987|ref|WP_005843035.1| tRNA dimethylallyltransferase 1

scaffold 1.1 1_13 CDS 17363 18388 - 1026 341 0.429 N gi|492434984|ref|WP_005843033.1| MULTISPECIES: hypothetical protein

scaffold 1.1 1_14 CDS 18424 19740 - 1317 438 0.432 N gi|492434981|ref|WP_005843031.1| MULTISPECIES: UDP-N-acetylglucosamine acyltransferase

scaffold 1.1 1_15 CDS 19846 21519 + 1674 557 0.476 N gi|492458346|ref|WP_005851058.1| MULTISPECIES: hydroxymyristoyl-ACP dehydratase

scaffold 1.1 1_16 CDS 21680 21880 + 201 66 0.454 N gi|492458349|ref|WP_005851060.1| MULTISPECIES: UDP-3-O-acylglucosamine N-acyltransferase

scaffold 1.1 1_17 CDS 22035 22727 + 693 230 0.43 N gi|500644323|ref|WP_011964621.1| phosphohydrolase

scaffold 1.1 1_18 CDS 22796 23239 - 444 147 0.453 N gi|492434969|ref|WP_005843024.1| MULTISPECIES: orotidine 5'-phosphate decarboxylase

scaffold 1.1 1_19 CDS 23255 23524 - 270 89 0.47 N gi|492434967|ref|WP_005843023.1| MULTISPECIES: peptide chain release factor 1

scaffold 1.1 1_20 CDS 23527 23871 - 345 114 0.471 N gi|492458355|ref|WP_005851064.1|

MULTISPECIES: phosphoribosylformylglycinamidine cyclo-ligase

scaffold 1.1 1_21 CDS 24081 24527 + 447 148 0.31 N gi|492434963|ref|WP_005843021.1| hypothetical protein

scaffold 1.1 1_22 CDS 24636 24818 + 183 60 0.409 N gi|492434961|ref|WP_005843020.1| MULTISPECIES: toxin Fic

Page 12: MCB 432 Final Table PP 01.06.16

Table 5. Single best blast hit of annotated ORFs from Team 10

Name Gene Identifier Description Organism % identity E-value1_1 gi|496057719|ref|WP_008782226.1| transposase, partial Bacteroides sp. 3_1_40A 100 8.00E-881_2 gi|649547948|gb|KDS54658.1| hypothetical protein M099_1756 Bacteroides vulgatus str. 3975 RP4 100 4.00E-621_3 gi|649547946|gb|KDS54656.1| glycoside hydrolase family 88 domain protein Bacteroides vulgatus str. 3975 RP4 100 6.00E-621_4 gi|492435030|ref|WP_005843062.1| MULTISPECIES: transcriptional regulator Bacteroides 100 5.00E-821_5 gi|492435027|ref|WP_005843060.1| TonB-dependent receptor Bacteroides vulgatus 100 01_6 gi|492435023|ref|WP_005843058.1| hypothetical protein Bacteroides vulgatus 100 01_7 gi|649547942|gb|KDS54652.1| ahpC/TSA family protein Bacteroides vulgatus str. 3975 RP4 100 01_8 gi|649547941|gb|KDS54651.1| ahpC/TSA family protein Bacteroides vulgatus str. 3975 RP4 100 01_9 gi|495945682|ref|WP_008670261.1| MULTISPECIES: hypothetical protein Bacteroides 99.61 01_10 gi|495945680|ref|WP_008670259.1| MULTISPECIES: hypothetical protein Bacteroides 97.22 2.00E-161_11 gi|492458337|ref|WP_005851052.1| alpha-L-fucosidase Bacteroides vulgatus 100 01_12 gi|492434987|ref|WP_005843035.1| tRNA dimethylallyltransferase 1 Bacteroides vulgatus 100 01_13 gi|492434984|ref|WP_005843033.1| MULTISPECIES: hypothetical protein Bacteroides 100 9.00E-1311_14 gi|492434981|ref|WP_005843031.1| MULTISPECIES: UDP-N-acetylglucosamine acyltransferase Bacteroides 100 3.00E-1801_15 gi|492458346|ref|WP_005851058.1| MULTISPECIES: hydroxymyristoyl-ACP dehydratase Bacteroides 100 01_16 gi|492458349|ref|WP_005851060.1| MULTISPECIES: UDP-3-O-acylglucosamine N-acyltransferase Bacteroides 100 01_17 gi|500644323|ref|WP_011964621.1| phosphohydrolase Bacteroides vulgatus 100 01_18 gi|492434969|ref|WP_005843024.1| MULTISPECIES: orotidine 5'-phosphate decarboxylase Bacteroides 100 01_19 gi|492434967|ref|WP_005843023.1| MULTISPECIES: peptide chain release factor 1 Bacteroides 100 01_20 gi|492458355|ref|WP_005851064.1| MULTISPECIES: phosphoribosylformylglycinamidine cyclo-ligase Bacteroides 100 01_21 gi|492434963|ref|WP_005843021.1| hypothetical protein Bacteroides vulgatus 100 6.00E-1381_22 gi|492434961|ref|WP_005843020.1| MULTISPECIES: toxin Fic Bacteroides 100 01_23 gi|492458359|ref|WP_005851066.1| MULTISPECIES: hypothetical protein Bacteroides 100 6.00E-431_24 gi|492434958|ref|WP_005843019.1| hypothetical protein Bacteroides vulgatus 99.64 01_25 gi|492458364|ref|WP_005851068.1| MULTISPECIES: hypothetical protein Bacteroides 100 01_26 gi|492458366|ref|WP_005851069.1| MULTISPECIES: membrane protein Bacteroides 100 2.00E-431_27 gi|492458368|ref|WP_005851070.1| MULTISPECIES: hypothetical protein Bacteroides 100 9.00E-1141_28 gi|492458370|ref|WP_005851071.1| MULTISPECIES: beta-N-acetylhexosaminidase Bacteroides 100 01_29 gi|492434942|ref|WP_005843009.1| MULTISPECIES: endonuclease Bacteroides 99.71 01_30 gi|511016443|ref|WP_016270813.1| excinuclease ABC subunit A Bacteroides vulgatus 100 01_31 gi|492434935|ref|WP_005843004.1| MULTISPECIES: hypothetical protein Bacteroides 100 01_32 gi|492434933|ref|WP_005843003.1| MULTISPECIES: chromate transporter Bacteroides 100 1.00E-1311_33 gi|492434930|ref|WP_005843001.1| MULTISPECIES: chromate transporter Bacteroides 100 1.00E-1051_34 gi|511016442|ref|WP_016270812.1| hypothetical protein Bacteroides vulgatus 100 01_35 gi|511016441|ref|WP_016270811.1| phosphoribosylformylglycinamidine synthase Bacteroides vulgatus 100 01_36 gi|492434921|ref|WP_005842995.1| MULTISPECIES: translocator protein, LysE family Bacteroides 100 4.00E-1501_37 gi|492434917|ref|WP_005842993.1| MULTISPECIES: hypothetical protein Bacteroides 100 5.00E-1271_38 gi|492458387|ref|WP_005851079.1| MULTISPECIES: dTDP-4-dehydrorhamnose reductase Bacteroides 100 01_39 gi|492434911|ref|WP_005842989.1| MULTISPECIES: peptide chain release factor 3 Bacteroides 100 01_40 gi|492434907|ref|WP_005842987.1| MULTISPECIES: molecular chaperone DnaJ Bacteroides 100 01_41 gi|492434904|ref|WP_005842985.1| dihydrofolate reductase Bacteroides vulgatus 100 01_42 gi|548318542|ref|WP_022508241.1| hypothetical protein Bacteroides vulgatus CAG:6 100 1.00E-1741_43 gi|492434896|ref|WP_005842980.1| hypothetical protein Bacteroides vulgatus 100 01_44 gi|492458409|ref|WP_005851092.1| transcriptional regulator Bacteroides vulgatus 99.7 01_45 gi|492434890|ref|WP_005842976.1| MULTISPECIES: hypothetical protein Bacteroides 100 1.00E-441_46 gi|492434887|ref|WP_005842974.1| hypothetical protein Bacteroides vulgatus 100 01_47 gi|500644291|ref|WP_011964611.1| hypothetical protein Bacteroides vulgatus 100 0

Page 13: MCB 432 Final Table PP 01.06.16

Table 6. PFAM domain matches for annotated genes from Team 10

Name PFAM ID Description E valuescaffold1.1_1 PF01610.12 Transposase 2.90E-25scaffold1.1_2 PF11396.3 Protein of unknown function (DUF2874) 7.80E-15scaffold1.1_4 PF03965.11 Penicillinase repressor 2.40E-25scaffold1.1_5 PF03544.9 Gram-negative bacterial TonB protein C-termi 2.50E-23scaffold1.1_5 PF13715.1 Domain of unknown function (DUF4480) 1.50E-16scaffold1.1_5 PF05569.6 BlaR1 peptidase M56 1.00E-11scaffold1.1_5 PF13620.1 Carboxypeptidase regulatory-like domain 2.90E-10scaffold1.1_5 PF07715.10 TonB-dependent Receptor Plug Domain 2.10E-06scaffold1.1_6 PF14559.1 Tetratricopeptide repeat 6.20E-13scaffold1.1_6 PF13414.1 TPR repeat 6.70E-12scaffold1.1_6 PF07719.12 Tetratricopeptide repeat 2.90E-11scaffold1.1_6 PF13428.1 Tetratricopeptide repeat 2.00E-10scaffold1.1_6 PF13432.1 Tetratricopeptide repeat 9.60E-10scaffold1.1_6 PF13429.1 Tetratricopeptide repeat 5.30E-08scaffold1.1_6 PF12895.2 Anaphase-promoting complex, cyclosome, subun 1.30E-07scaffold1.1_6 PF13431.1 Tetratricopeptide repeat 6.80E-06scaffold1.1_7 PF00578.16 AhpC/TSA family 1.30E-11scaffold1.1_7 PF00255.14 Glutathione peroxidase 4.20E-08scaffold1.1_7 PF14289.1 Domain of unknown function (DUF4369) 1.70E-06scaffold1.1_8 PF13905.1 Thioredoxin-like 1.40E-14scaffold1.1_8 PF13098.1 Thioredoxin-like domain 1.90E-14scaffold1.1_8 PF00085.15 Thioredoxin 2.70E-11scaffold1.1_8 PF08534.5 Redoxin 4.30E-11scaffold1.1_8 PF00578.16 AhpC/TSA family 1.00E-07scaffold1.1_11 PF01120.12 Alpha-L-fucosidase 2.60E-87scaffold1.1_12 PF01715.12 IPP transferase 7.70E-64scaffold1.1_12 PF01745.11 Isopentenyl transferase 3.00E-12scaffold1.1_12 PF04851.10 Type III restriction enzyme, res subunit 0.00022scaffold1.1_13 PF07929.6 Plasmid pRiA4b ORF-3-like protein 4.00E-11scaffold1.1_14 PF13720.1 Udp N-acetylglucosamine O-acyltransferase; D 1.20E-28scaffold1.1_14 PF00132.19 Bacterial transferase hexapeptide (six repea 1.10E-25scaffold1.1_15 PF03331.8 UDP-3-O-acyl N-acetylglycosamine deacetylase 6.00E-74scaffold1.1_15 PF07977.8 FabA-like domain 1.10E-35scaffold1.1_16 PF00132.19 Bacterial transferase hexapeptide (six repea 1.10E-29scaffold1.1_16 PF04613.9 UDP-3-O-[3-hydroxymyristoyl] glucosamine N-a 7.00E-17scaffold1.1_16 PF14602.1 Hexapeptide repeat of succinyl-transferase 1.20E-10scaffold1.1_17 PF01966.17 HD domain 2.90E-08scaffold1.1_18 PF00215.19 Orotidine 5'-phosphate decarboxylase / HUMPS 9.20E-30scaffold1.1_19 PF03462.13 PCRF domain 3.40E-39scaffold1.1_19 PF00472.15 RF-1 domain 2.60E-33scaffold1.1_20 PF02769.17 AIR synthase related protein, C-terminal dom 1.70E-12scaffold1.1_22 PF13310.1 Virulence protein RhuM family 5.70E-110scaffold1.1_24 PF02638.10 Glycosyl hydrolase like GH101 1.80E-53scaffold1.1_24 PF13200.1 Putative glycosyl hydrolase domain 3.40E-07scaffold1.1_25 PF02554.9 Carbon starvation protein CstA 8.90E-79scaffold1.1_25 PF13722.1 C-terminal domain on CstA (DUF4161) 2.30E-24

Page 14: MCB 432 Final Table PP 01.06.16

Table 7. TIGRFAM domain matches for annotated genes from Team 10

Name TIGRFAM ID Description E valuescaffold1.1_5TIGR04057 SusC_RagA_signa: TonB-dependent outer membrane receptor, SusC/RagA subfamily, signature region2.70E-16scaffold1.1_5TIGR01352 tonB_Cterm: TonB family C-terminal domain 2.70E-12scaffold1.1_12TIGR00174 miaA: tRNA dimethylallyltransferase 5.90E-75scaffold1.1_14TIGR01852 lipid_A_lpxA: acyl-[acyl-carrier-protein]-UDP-N-acetylglucosamine O-acyltransferase 1.70E-92scaffold1.1_15TIGR00325 lpxC: UDP-3-O-[3-hydroxymyristoyl] N-acetylglucosamine deacetylase 2.50E-56scaffold1.1_15TIGR01750 fabZ: beta-hydroxyacyl-(acyl-carrier-protein) dehydratase FabZ 3.90E-49scaffold1.1_16TIGR01853 lipid_A_lpxD: UDP-3-O-[3-hydroxymyristoyl] glucosamine N-acyltransferase LpxD 3.60E-105scaffold1.1_18TIGR02127 pyrF_sub2: orotidine 5'-phosphate decarboxylase 3.60E-72scaffold1.1_19TIGR00019 prfA: peptide chain release factor 1 1.10E-137scaffold1.1_30TIGR00630 uvra: excinuclease ABC subunit A 0scaffold1.1_38TIGR01214 rmlD: dTDP-4-dehydrorhamnose reductase 1.90E-89scaffold1.1_39TIGR00503 prfC: peptide chain release factor 3 6.10E-207scaffold1.1_39TIGR00231 small_GTP: small GTP-binding protein domain 2.20E-25scaffold1.1_49TIGR02227 sigpep_I_bact: signal peptidase I 1.30E-19scaffold1.1_52TIGR01730 RND_mfp: efflux transporter, RND family, MFP subunit 8.80E-48scaffold1.1_56TIGR00221 nagA: N-acetylglucosamine-6-phosphate deacetylase 1.30E-81scaffold1.1_57TIGR00057 TIGR00057: tRNA threonylcarbamoyl adenosine modification protein, Sua5/YciO/YrdC/YwlC family1.20E-44scaffold1.1_59TIGR00460 fmt: methionyl-tRNA formyltransferase 8.00E-81scaffold1.1_61TIGR02937 sigma70-ECF: RNA polymerase sigma factor, sigma-70 family 4.40E-29scaffold1.1_63TIGR01163 rpe: ribulose-phosphate 3-epimerase 1.00E-83scaffold1.1_64TIGR00360 ComEC_N-term: ComEC/Rec2-related protein 8.50E-27scaffold1.1_67TIGR03990 Arch_GlmM: phosphoglucosamine mutase 1.80E-160scaffold1.1_69TIGR00539 hemN_rel: putative oxygen-independent coproporphyrinogen III oxidase 4.50E-87scaffold1.1_71TIGR00231 small_GTP: small GTP-binding protein domain 1.10E-18scaffold1.1_76TIGR00166 S6: ribosomal protein S6 2.00E-25scaffold1.1_77TIGR00165 S18: ribosomal protein S18 1.90E-33scaffold1.1_78TIGR00158 L9: ribosomal protein L9 1.00E-35scaffold1.1_82TIGR01579 MiaB-like-C: MiaB-like tRNA modifying enzyme 3.00E-122scaffold1.1_82TIGR00089 TIGR00089: radical SAM methylthiotransferase, MiaB/RimO family 1.10E-113scaffold1.1_85TIGR00525 folB: dihydroneopterin aldolase 5.10E-30

Page 15: MCB 432 Final Table PP 01.06.16

Table 8. Phyre2 predicted best crystal structure matches for annotated genes from Team 10

NamePDB best match Pct_identity Confidence

Aligned region Description

1.1_1 c3f9kV 22 61.1 89-115 two domain fragment of hiv-2 integrase in complex with ledgf ibd1.1_4 d1sd4a 19 100 3-120 Penicillinase repressor1.1_14 c3i3aC 39 100 2-255 transferase, structural basis for the sugar nucleotide and acyl chain2 selectivity of leptospira interrogans lpxa1.1_19 c3d5cX 43 100 8-369 peptide chain release factor 1, structural basis for translation termination on the 70s ribosome1.1_32 c3dboA 29 49.4 36-67 toxin/antitoxin, crystal structure of a member of the vapbc family of toxin-antitoxin2 systems, vapbc-5, from mycobacterium tuberculosis1.1_54 c4mt4C 12 100 27-478 transport protein, crystal structure of the campylobacter jejuni cmec outer membrane2 channel1.1_57 c2eqaA 23 100 6-191 rna binding protein, crystal structure of the hypothetical sua5 protein from2 sulfolobus tokodaii1.1_60 c3k6oA 24 100 29-237 structural genomics, unknown function, crystal structure of protein of unknown function duf13442 (yp_001299214.1) from bacteroides vulgatus atcc 84821.1_68 c1upsB 16 100 21-262 glycosyl hydrolase, glcnac[alpha]1-4gal releasing endo-[beta]-galactosidase2 from clostridium perfringens

Page 16: MCB 432 Final Table PP 01.06.16

Figure 3 is a screenshot of the whole-genome alignment of our scaffolds against the genome of Bacteroides vulgatus str. 3975 RP4, which we determined to be the strain with the most blastp matches against our contigs.