[email protected] - plant physiology...2013/07/29 · visualized comparative genomics analytical...
TRANSCRIPT
Running head
Visualized Comparative Genomics Analytical Software
Corresponding author
Name: Xiangfeng Wang
Address: School of Plant Science, University of Arizona, Tucson, AZ, 85721, USA.
Tel: 520-626-4184.
E-mail: [email protected]
Research category
Bioinformatics
Keywords
Bioinformatics, Comparative Genomics, Java, Brassicaceae, Karyotype Visualization,
Synteny Visualization
Plant Physiology Preview. Published on July 29, 2013, as DOI:10.1104/pp.113.219444
Copyright 2013 by the American Society of Plant Biologists
https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
CrusView: a Java-based visualization platform for comparative genomics analyses
in Brassicaceae
Hao Chen1 and Xiangfeng Wang1*
1School of Plant Sciences, University of Arizona, Tucson, Arizona, 85721, Untied States
* To whom correspondence should be addressed. Tel: (001) 520-626-4184
Email: [email protected]
https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
FOOTNOTES
*Corresponding author: Xiangfeng Wang, e-mail: [email protected].
https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
ABSTRACT
In plants and animals, chromosomal breakage and fusion events based on conserved
syntenic genomic blocks lead to conserved patterns of karyotype evolution among
species of the same family. However, karyotype information has not been well utilized in
genomic comparison studies. We present CrusView, a Java-based bioinformatic
application utilizing SWT/SWING graphics libraries and a SQLite database for
performing visualized analyses of comparative genomics data in Brassicaceae (Crucifer)
plants. Compared to similar software and databases, one of the unique features of
CrusView is its integration of karyotype information when comparing two genomes. This
feature allows users to perform karyotype-based genome assembly and karyotype-
assisted genome synteny analyses with preset karyotype patterns of the Brassicaceae
genomes. Additionally, CrusView is a local program, which gives its users high flexibility
when analyzing unpublished genomes and allows the users to upload self-defined
genomic information so that they can visually study the associations between genome
structural variations and genetic elements, including chromosomal rearrangements,
genomic macrosynteny, gene families, high-frequency recombination sites, and tandem
and segmental duplications between related species. This tool will greatly facilitate
karyotype, chromosome and genome evolution studies using visualized comparative
genomics approaches in Brassicaceae. The CrusView is freely available at
http://www.cmbb.arizona.edu/CrusView/.
https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
INTRODUCTION
The Brassicaceae (crucifer) plant family contains more than 3,700 species, including the
model plant organism Arabidopsis thaliana; economically important crop species, such
as Brassica rapa and Brassica napus; and close relatives of A. thaliana used in abiotic
stress research, such as Eutrema salsugineum and Schrenkiella parvula. Because
Brassicaceae plants have high scientific and economic importance, several whole-
genome sequencing projects of the species in this family have been recently launched
(http://www.brassica.info). Moreover, Brassicaceae is also a good system for population
genomics. The 1001 Arabidopsis Genomes Project (http://www.1001genomes.org/) plans
to generate complete genome sequences for 1001 A. thaliana strains to study the
associations between genetic variation and phenotypic diversity. The VEGI (Value-
directed Evolutionary Genomics Initiative) project aims to understand the genome
evolution of Brassicaceae by sequencing several close relatives of A. thaliana, such as
Arabidopsis lyrata and Capsella rubella. Recent advances in high-throughput sequencing
(HTS) technology have greatly expedited these whole-genome sequencing projects of
versatile non-model organisms. Although increasingly longer reads can now be produced
from HTS experiments, de novo assembler tools can only generate contig and/or scaffold
sequences from HTS reads. These tools cannot generate complete chromosome sequences
without genetic and/or physical maps that typically require years to create. This limitation
makes chromosome-scale structural variation (i.e., translocation, inversion, deletion and
insertion, and segmental and tandem duplication) and genomic macro-synteny analyses
difficult to perform.
In both plants and animals, genomes of species within the same family have
evolved with conserved karyotype patterns due to the rearrangements of large
chromosomal segments. Chromosomal karyotypes can be obtained from comparative
chromosomal painting (CCP) experiments by performing in situ hybridization
experiments on BAC sequences between related species. The genome of each
Brassicaceae member is composed of 24 conserved genomic blocks that have been
considered as the basic units of chromosomal rearrangement during genome evolution
https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
(Lysak et al., 2006). The sizes of these conserved blocks range from several to dozens of
mega-bases. Currently, karyotypes profiled by CCP experiments in approximately twenty
species in Brassicaceae have been available; such karyotypes include those from
Arabidopsis thaliana (n=5), Homungia alpine (n=6), Eutremeae (n=7), Arabidopsis
lyrata (n=8), Brassica rapa (n=10), and Polyctenium fremontii (n=14), etc. By utilizing
the karyotype information in Brassicaceae, we have developed a tool – KGBassembler
(karyotype-based genome assembler for Brassicaceaes) – to finalize the assembly of
chromosomes from scaffolds/contigs without relying on a genetic/physical map (Ma et al.,
2012).
Over the past 2 years, complete whole-genome sequences of several Brassicaceae
species have been released, including the aforementioned A. lyrata, S. parvula, B. rapa,
and E. salsugineum (Dassanayake et al., 2011; Hu et al., 2011; Wang et al., 2011; Wright and
Agren, 2011; Wu et al., 2012; Yang R, 2013). These genomic resources have opened a new
era of comparative genomics in Brassicaceae to better understand the genomic evolution
(Cheng F, 2012). Numerous tools and databases are available for performing comparative
genomics analysis in plants. CoGe is a comparative genomics analysis platform that is
now a part of the iPlant Collaborative Project (Jorgensen et al., 2008). The CoGe database
currently includes nearly 2,000 genome sequences of approximately 1,500 organisms,
allowing users to perform online visual analyses of genome synteny and duplication
events (Tang and Lyons, 2012). PLAZA and Vista are also web-based databases that
provide comparative analysis services on the genomic data deposited in the databases
(Frazer et al., 2004; Van Bel et al., 2012). Other stand-alone bioinformatic applications for
comparative genomic analysis, such as Easyfig and genoPlotR, are commonly used to
generate synteny plots of given genome segments at a scale ranging from one single gene
to one chromosome (Guy et al., 2010; Sullivan et al., 2011).
In this work, we present a Java-based bioinformatic application – CrusView – for
performing visualized analyses of genome synteny and karyotype evolution in
Brassicaceae. CrusView features a user-friendly graphical user interface (GUI)
implemented with SWT/SWING graphics libraries and a SQLite database used to
https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
manage local genomic data. Compared to the most commonly used tools in comparative
genomics, one of the unique features of CrusView is that available karyotype data of a
Brassicaceae species is incorporated to facilitate karyotype-based chromosome assembly
and analyses of chromosomal structural evolution. Compared to web-based tools, the
stand-alone CrusView tool was also designed to give users higher flexibility in analyzing
currently unpublished genome data and integrating self-defined genomic information
based on the users’ interests, such as gene families, gene duplications, chromosomal
breakpoints, gene ontology (GO) terms, and groups of orthologs/paralogs, with the
genomic synteny maps. In addition, CrusView can generate images representing genomic
synteny between two compared genomes in PNG/SVG/PDF high-resolution formats that
are suitable for publication.
RESULTS
To demonstrate the basic functionality of CrusView, we prepared two example genomes
and related datasets from Arabidopsis thaliana (n=5) and Eutrema salsugineum (n=7) to
perform visualized comparative genomics analyses. E. salsugineum (also known as salt
cress and Thellungiella halophila) is a halophytic relative of A. thaliana; it inhabits the
seashore saline soils of eastern China. Because E. salsugineum and A. thaliana share
similar life cycles, morphological characters and genetic composition, E. salsugineum has
been widely used in plant salt-tolerance studies using the genetic systems and molecular
tools previously established in A. thaliana. The E. salsugineum genome (243 Mb)
contains seven chromosomes and approximately 24,000 protein-coding genes (Yang R,
2013). The karyotype maps derived from comparative chromosomal painting (CCP)
experiments of both E. salsugineum and A. thaliana are currently available (Lysak et al.,
2006). We used these two genomes to demonstrate the karyotype-based genome assembly
of the E. salsugineum chromosomes and the comparative analyses of E. salsugineum and
A. thaliana with integrated karyotype information.
Overview of the functional panels in CrusView
https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
CrusView can be launched via web-start at http://www.cmbb.arizona.edu/crusview. The
navigation panel includes quick buttons that perform basic operations in CrusView. The
published karyotypes of 20 Brassicaceae species have been integrated into CrusView,
and they are shown in the left “karyotype” panel. We will constantly collect the published
karyotypes generated based on CCP experiments. Each time CrusView is launched, the
program will automatically query the CrusView server to update the local karyotype
database. Genomic data files from E. salsugineum and A. thaliana can be imported into
the SQLite database to run a demonstration for users who run CrusView for the first time.
The primary visualization window shows the seven chromosomes of the primary E.
salsugineum genome (Figure 1). The protein-coding genes of E. salsugineum are
designated with the corresponding colors based on the conserved genomic blocks in
which they are located. The upper-right panel shows the color schemes and the letter
labels for the 24 genomic blocks (A to X), while the lower-right panel shows the five
chromosomes of the secondary A. thaliana genome (Figure 1). The information window
displays the genomic annotations of the genes in the primary genome recorded in the
BED file, including the gene IDs, chromosomal locations, genomic block IDs,
orthologous group IDs, sequence similarities with the homologs in the secondary genome,
gene functional descriptions and other user-defined information (Figure 1). User can
switch the primary and secondary genomes, zoom in/out of the chromosome images,
perform a query of interested genes, and invoke a chromosome-level comparison window
using the quick buttons in the navigation panel.
Visualized karyotype comparison between E. salsugineum and A. thaliana
One of the unique functions of CrusView is that it can generate the digital karyotype of a
genome, allowing users to visually compare the chromosomal karyotypes of the primary
and secondary genomes. The Arabidopsis lyrata (n=8) genome represents an ancestral
karyotype in the Brassicaceae family in which each member’s genome is composed of 24
conserved genomic blocks according to the karyotype analyses of several representative
species in the family using CCP experiments (Lysak et al., 2006). Each conserved genomic
block is a large chromosomal segment that can be represented by a group of A. thaliana
genes in synteny with their orthologs in the genomes of other Brassicaceae species. Thus,
https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
the A. thaliana genes can be used as markers to infer the assignment of the 24 conserved
genomic blocks to another species’ genome in Brassicaceae (Lysak et al., 2006; Yang R,
2013). Our previously developed software program KGBassembler includes a pipeline to
assign the genes in a Brassicaceae genome to the 24 conserved genome blocks with a
color scheme and a letter label (A to X) based on the homology with A. thaliana genes
(Ma et al., 2012). Here, we elucidate this procedure using E. salsugineum as a newly
sequenced genome based on three basic steps: first, the A. thaliana amino acid sequences
were mapped to the E. salsugineum scaffold sequences using BLAST, followed by the
selection of the best aligned locations; second, the A. thaliana genes mapped onto the E.
salsugineum scaffolds were used to infer the conserved genomic blocks, followed by the
assignment of the color schemes and letter labels of the 24 blocks to the E. salsugineum
genes; and third, pseudo-chromosome sequences were generated based on the CCP-
derived (n=7) karyotype of E. salsugineum. This pipeline was integrated into CrusView
and can be applied to any newly sequenced Brassicaceae genome to perform karyotype-
based genome assembly and generate digital karyotypes for comparison purposes.
In CrusView, the digital karyotypes of the primary and secondary genomes will
greatly facilitate visualized genomic comparison and the identification of major
chromosomal rearrangement events causing the genomic evolution of the chromosomal
karyotype in the studied Brassicaceae genome. For example, A. thaliana chromosome 2
(AtChr2) resulted from the merging of E. salsugineum chromosome 4 (EsChr4) and the
long arm (14 Mb to 37 Mb) of EsChr3 (Figure 1). Moreover, when compared with the
ancestral karyotype of the eight A. lyrata chromosomes, users may study the different
evolutionary paths of the karyotype in another species. For example, although AtChr1
resulted from the merging of A. lyrata AlChr1 and AlChr2, the structure of EsChr1
remains unchanged compared with AlChr1 (Figure 1). Furthermore, users can search for
interested gene IDs or ortholog group IDs from the navigation panel and map their
positions on the compared primary and secondary genomic karyotypes.
Visualized fine-adjustment of pseudo-chromosome assembly in CrusView
https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
The automatic generation of pseudo-chromosome sequences based on the KGBassembler
algorithm may miss or misplace certain scaffolds that do not contain sufficient gene
synteny information for inferring the assignment of conserved genomic blocks, which are
either relatively short or contain too many repetitive sequences. Additionally, de novo
scaffold assembly is usually interrupted at the edges of highly repeated centromere
sequences. Thus, manual adjustment of the pseudo-chromosomes may be necessary.
Different from KGBassembler in which users need to edit a text file for manual
adjustment, CrusView allows users to perform visualized fine-adjustment of pseudo-
chromosome assembly in GUI and to consider additional genomic information, such as
positions of genetic markers, centromere-specific CentO tandem repeats, and the density
of protein-coding genes during the adjustment. Users can directly load the project result
produced in KGBassembler for visualized fine adjustment or use the “assembling”
function in CrusView to assemble pseudo-chromosomes from the scaffold sequences.
When the assembling function in CrusView is run for the first time, users must indicate
the working folder containing the required input files described in the Methods section
and an output folder to save the generated chromosome sequences. Users may set up
necessary parameters in the “parameter panel” and save the parameters into an INI
configuration file that can be directly loaded to run the assembling function (Figure 2).
The details of the parameters were explained in the KGBassembler manual, and users
may wish to apply different parameter settings to produce the most optimal assembly,
which is largely dependent on the quality of the scaffold sequences themselves as
generated by de novo assembler tools.
To fine-tune the draft pseudo-chromosome sequences, CrusView allows users to
add files containing genetic markers and CentO tandem repeats. In plants, CentO
sequences are ~170 bp motifs that are tandemly arrayed and specifically located in the
core centromeric regions (Benson, 1999). CentO repeats located at one terminal of a long
scaffold are generally indicative of the centromeric end of a scaffold (Figure 2).
Moreover, the density of protein-coding genes is typically higher in the euchromatic
regions of short and long arms than in the pericentromeric heterochromatic regions
(Figure 2). Thus, these types of information are very useful in assisting users to further
https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
inspect and adjust the scaffold layouts and orientations on the chromosomes, as well as
the genomic positions of the genetic markers. Users can simply perform drag-and-drop
actions with a mouse to correct potentially misplaced scaffolds or to adjust the orientation
of scaffolds. When a manual adjustment is performed, users can save the pseudo-
chromosome sequences to a FASTA file and simultaneously generate the gene annotation
file. Finally, users can use the “push to main screen” function to directly add the
assembled pseudo-chromosome and perform further visualized comparative analyses.
Visualization of genomic synteny between two genomes
The “compare two genomes” function in CrusView can provide a visualization of
genomic synteny for each pair of homologous chromosomes for the primary and
secondary genomes. Chromosome-scale genomic synteny can be visualized in two
manners, a chromosomal karyotype with homologous genes linked between the two
chromosomes and a dot-plot indicating chromosomal macrosynteny with duplication
events (Figure 3A). For example, a comparison of the karyotypes of EsChr4 and AtChr2
indicated that A. thaliana chromosome 2 resulted from an event in which the entire
chromosome 4 (genomic blocks I and J) merged with the long arm of chromosome 3
(genomic blocks K, G and H) in E. salsugineum (Figure 3A). In addition, the visualized
chromosomal synteny with karyotype information can also allow users to examine the
differences in the chromosome structures between the two genomes. For instance, the 18
Mb-long region from 27 to 35 Mb of J block on EsChr4 remains highly similar with the
17 Mb-long region from 13 to 20 Mb on AtChr2, whereas the 25 Mb-long I block of
EsChr4 has seemingly dramatically expanded with highly enriched repetitive sequences
and transposable elements compared to the corresponding ~17 Mb I block region on
AtChr2. More interestingly, a small region of EsChr4 between the positions 10 to 11 Mb
was found resulted from the inverted translocation of a region from AtChr2. The selection
of a genomic region with the mouse can invoke the information window, which contains
the genes located in the regions of interest. By clicking on a gene homologous to the
corresponding A. thaliana gene, users will be redirected to the TAIR database, which
contains detailed gene function information.
https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
Chromosome-scale genomic synteny can also be visualized as a dot-plot in
CrusView to facilitate the identification of segmental duplication and tandem duplication
events between the two compared species. From the dot-plot screen, users can select the
regions containing duplication events of interest with the mouse to obtain information
regarding the genes located in the selected regions (Figure 3A). Right-clicking the mouse
will invoke a pull-down list of advanced actions, such as querying selected genes in the
external TAIR database to view detailed functional descriptions, retrieving gene
sequences to a FASTA file, performing exon-level sequence alignment for a single gene,
and aligning multiple genes in a user-defined synteny region using AJaligner. Figure 3B
demonstrates a genomic region between 23.8 and 24.1 Mb on AtChr4 encompassing two
tandem duplication events of the gene members in the calcium-dependent protein kinase
(CDPK) family that may be involved in stress responsive pathways in A. thaliana. While
AtCDPK27 and AtCDPK31 represent a pair of tandemly duplicated genes that
correspond to the single-copy E. salsugineum gene Thhalv10028618m.g, AtCDPK21 and
AtCDPK23 correspond to the single-copy gene Thhalv10028567m.g (Figure 3B). An
exon-level sequence alignment of a pair of interesting orthologous genes will reveal
exon-level structural variations, amino acid variations, insertions and deletions (INDELs),
and single nucleotide polymorphisms (SNPs), which is illustrated by the comparison of
SALT OVERLY SENSITIVE 1 (AtSOS1) in A. thaliana and its E. salsugineum ortholog
(Figure 3C).
Visualization of a user-defined list of genes, duplication events and copy number
variations (CNV) in a genomic synteny plot
Using CrusView, users may visualize a group of genes of interest in the two compared
genomes to determine their associations with genomic synteny and possible duplication
events. We demonstrate this utility by analyzing the tandemly duplicated F-box
superfamily that has been found to display great copy number variations between A.
thaliana (505 genes) and E. salsugineum (613 genes). First, the genes in E. salsugineum
were assigned to the orthologous groups annotated in the OrthoMCL database (Li et al.,
2003). Each ortholog group indicated by a unique ID contains the putative orthologous
genes in A. thaliana and E. salsugineum. We found that one of the ortholog groups
https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
(OG5_127192) that showed high variation in copy number contained 148 and 130 F-box
genes in A. thaliana and E. salsugineum, respectively. In plants, F-box genes consist of a
large superfamily encoding an E3 ubiquitin ligase that is involved in substrate-specific
protein degradation. First, using the “predict tandem duplication” function in CrusView,
highly homologous genes defined with a cutoff of 40% protein-identity and located
adjacent to each other within a 5 Kb window were highlighted in green in the dot-plot of
EsChr3 and AtChr3 (Figure 4). The protein-identity cutoff and window size can both be
adjusted by the user when predicting tandem duplications. Then, using the “keyword
search” function, a group of genes of interest is displayed in the current dot-plot. For
instance, when searching ID “OG5_127192”, F-box genes classified in this ortholog
group by OrthoMCL were highlighted in red in the same dot-plot image (Figure 4). From
the overlapping green dots (tandemly duplicated genes) and red dots (F-box genes in
group OG5_127192), we observed a macro-syntenic block covering a ~5 Mb region on
AtChr3 and a ~15 Mb region on EsChr3 encompassing 59 and 78 tandemly arrayed F-
box genes in A. thaliana and in E. salsugineum, respectively (Figure 4).
Similarly, users can also add additional genomic information to the BED file to
allow searching for self-defined keywords, such as gene ontology (GO) terms, gene
functional descriptions or gene families. CrusView also allows users to filter a list of
genes or genomic positions of interest from the user-defined genomic information file,
which can be displayed on the dot-plot synteny map. Users can define the color schemes
for different gene groups on the plots using the setting function of CrusView. Finally, the
digital karyotype maps, macro-synteny plots based on the 24 color-coded genomic blocks,
and dot-plot synteny map showing duplication events and mapped genes of interest can
be saved as high-quality PNG/SVG/PDF publication-quality images.
CONCLUSION
In this work, we developed a Java-based bioinformatic application – CrusView – using
the powerful SWI/SWING graphics libraries in the Java and SQLite databases; this
application was designed to facilitate research in comparative genomics. We
demonstrated the basic functionality of CrusView by performing a visual comparison of
https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
the A. thaliana and E. salsugineum genomes in the plant Brassicaceae (Crucifer) family.
Compared to other bioinformatic tools that have been developed for similar purposes, one
of CrusView’s unique features is its incorporation of genomic karyotype information
derived from comparative genomics painting (CCP) experiments. The karyotype of a
species associated with the genome structure visualized in CrusView can greatly assist
users in identifying chromosomal rearrangements, genomic synteny and major
duplication events among the related species. Thus, this unique CrusView feature may
facilitate the understanding of karyotype, chromosome and genome evolution based on a
comparative genomics approach. Furthermore, by considering the advantage of a species’
karyotype, CrusView provides a unique function to infer pseudo-chromosome sequences
from scaffold sequences generated by de novo assemblers based on conserved genomic
blocks. This feature is especially convenient for non-model species that lack a genetic
and/or physical map. However, users should be aware that CrusView does not replace de
novo assembler tools, and its performance in finalizing the assembly of a pseudo-
chromosome sequence depends largely on the quality of the scaffolds and contigs
produced from whole-genome shot-gun sequencing projects.
CrusView also includes an array of utilities that can be used to visualize genome
synteny and duplication events and to map a list of genes of interest associated with
syntenic regions between the two analyzed genomes. Compared to database-based
comparative genomics tools, CrusView is much more flexible in the ability to analyze
unpublished genomes; it allows users to integrate self-defined genomic information, such
as gene ontology (GO) classifications, gene families of interest, hot-spots of
chromosomal breakage/fusion points, high-frequency recombination sites, and tandem
duplication to study their correlations with genomic variations and duplication events.
User-defined information and genome synteny plots can be exported as high-resolution,
publication-quality PNG/SVG/PDF images.
Karyotype mapping based on in situ hybridization experiments is a common
genomic technique that is widely used in animals and plants. Conserved patterns of
chromosomal rearrangements based on syntenic genomic blocks as basic units of
https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
chromosomal breakage and fusion events are commonly observed in the animal and plant
kingdoms (Lysak et al., 2006; Ferguson-Smith and Trifonov, 2007). Therefore, although
CrusView was primarily developed and preset based on the karyotype evolution patterns
in Brassicaceae family (primarily for the convenience of the Brassicaceae community),
this software program may also be used to perform karyotype-based genome assembly or
karyotype-assisted genome synteny analysis in other plant families or in other organisms
for which karyotype data exist. If users wish to use the current version of CrusView for
non-Brassicaceae species, they can access the “setting” function to define the color
schemes and letter labels of the conserved genomic blocks based on the karyotype
evolution patterns of the species of interest. Additionally, to promote the broad use of
CrusView in other organisms, the source code of CrusView has been released through
Sourceforge.net to allow academic users to freely download and modify the programs.
MATERIALS AND METHODS
Basic input files for CrusView
CrusView utilizes the Java web-start function so that it can be launched through the
CrusView homepage. When it is run for the first time, CrusView creates a “CrusView”
folder on the user’s local computer and automatically installs the programs and basic
dataset in the folder. CrusView simultaneously creates a local Java SQLite database to
manage the genomic data that the user wishes to analyze. The data files include a FASTA
file containing chromosome or scaffold/contig sequences and a GFF file containing gene
model annotation that will be imported into the SQLite database. The user must also
prepare a BED file in the “bed” folder to provide additional information, such as ortholog
group IDs, genome block IDs, and protein sequence identities between the primary and
secondary genomes. To enable the advanced search function, the BED file may also
include the user’s self-defined genomic information and functional descriptions added in
the last column, such as gene ontology (GO) terms, gene families, recombination
hotspots, and so on. To analyze a specific group of genes of interests, the user can load a
TXT file containing the gene IDs or genomic positions and their further descriptions into
CrusView through provided functions.
https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
Input files for karyotype-based genome assembly
For the species only containing scaffold sequences but with an available CCP-derived
karyotype map, a karyotype-based genome assembly of pseudo-chromosomes from
scaffold sequences is recommended. The KGBassembler will be invoked by the
“assembling” function in CrusView. The assembly function requires the following input
files: a KARYOTYPE file containing CCP-based karyotype information obtained from
the CrusView website or prepared by the user based on instruction, a PSL file containing
A. thaliana genes aligned on the scaffolds, and a FASTA file containing scaffold
sequences. The user can either provide a configuration file in INI format or edit the
“Parameter” tab in the CrusView interface to set up necessary parameters for assembly. If
a genetic map with gene marker information is prepared by the user as a GMM file with
designated format described in the CrusView manual, CrusView may also incorporate
this information during the manual adjustment of the pseudo-chromosomes. To facilitate
the prediction of scaffold orientations on the pseudo-chromosomes, the user may run the
tandem repeat finder (TRF) software program (Benson, 1999) to identify the scaffolds
containing centromere-specific tandem repeat (CentO) sequences. CentO repeat locations
formatted as a BED file can be loaded into CrusView as additional track.
After the KGBassembler has generated the pseudo-chromosome sequences, the
user may use CrusView to perform fine adjustments to the orientations and orders of the
scaffolds on the pseudo-chromosomes based on the additional information provided by
the user, such as the density of protein-coding genes, user-customized genetic marker,
and the locations of CentO centromeric tandem repeats on the scaffolds. CrusView has
been implemented with an enhanced GUI that can be used to further adjust the pseudo-
chromosome assembly using dragging-and-placing mouse actions. By clicking the “save
assembly” button, the pseudo-chromosome sequences and gene annotation information
will be saved in a FASTA file and a GFF file, respectively.
Conversion of user’s yet-to-publish genome sequence and self-defined gene
annotation to input files compatible with CrusView
https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
To facilitate the user to analyze yet-to-published genome sequence, CrusView include a
function to help the user prepare the input files necessary to be used in CrusView. The
user must provide the genome/scaffold sequences in FASTA format, the gene annotation
file in GFF or GTF format, and one additional karyotype file if the user wants to
karyotype-based assembly of pseudo-chromosome sequences. The user is also prompted
to submit their protein sequences to the OrthoMCL online database (17) to assign the
genes to the corresponding ortholog groups to facilitate genome comparison, gene
duplication analyses and copy number variation analyses. To assign the 24 conserved
genome block IDs to the genes, the user must provide a BLAST result of the protein
sequences of the analyzed genome against A. thaliana proteins. Additional genomic
information that the user wishes to include will be integrated into the last column of the
BED file to enable the keyword search function in CrusView.
Inference of genomic macro-synteny based on conserved genomic blocks
The genomes of the Brassicaceae species share 24 conserved genomic blocks (large
chromosomal segments) designated A to X. An additional ID “0” is used by CrusView to
label undetermined regions that are not assigned to any genomic blocks. The
chromosomal locations of the 24 genomic blocks can be inferred from the CCP-derived
karyotype. Each gene located within the same conserved genomic block is assigned a
designated color code to illustrate the digital karyotype of the studied species. Genes
shared within the same genomic block IDs are considered to be in the same genomic
macro-syntenic regions. To analyze a genome lacking a CCP-derived karyotype or a
genome in other families of plant or animal organisms that have different conserved
genomic blocks, the user can self-define the block IDs with HEX color codes in the BED
file.
Visualization of chromosomal karyotype, genomic synteny and gene alignment
CrusView was implemented with the Java SWT/SWING libraries to develop the GUI
interface and visualization functions. Visualization of the genomic data of an analyzed
species can be performed at three levels – the genome level, the chromosome level and
the gene level. If the karyotype information has been associated with the studied genome,
https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
all of the chromosomes will be visualized with the 24 genomic block IDs with
corresponding colors. The user can select any two chromosomes of interest in the two
compared species to visualize chromosomal synteny. When comparing the karyotypes of
two chromosomes, the pairs of orthologous genes between the two species are linked to
indicate major chromosomal rearrangement events. CrusView also generates a dot-plot
for each pair of selected chromosomes to visualize tandem and segmental duplication
events. The user may select a group of genes from the dot-plot using a mouse framing
action to trigger gene-level visualization. A multi-gene alignment within a designated
genomic region (less than 1 Mb) between the two genomes and an exon-to-exon
alignment of one pair of orthologous genes with single nucleotide polymorphism (SNP)
information can be visualized.
Output image files generated from CrusView
One of the useful utilities of CrusView is to generate high-resolution images and save in
PNG/SVG/PDF formats for publication use. Such images include digital karyotypes,
genome synteny plots, dot-plots of two chromosomes, multi-gene alignment within a
genomic region, exon-to-exon alignment plots, plots of genomic duplication events, and
mapping of a list of interested genes in the genomic synteny plots.
Software availability
CrusView is publically available online (http://www.cmbb.arizona.edu/crusview) and has
been implemented as a Java web-start application under Windows and Linux 32/64 bit
systems with options for different memory sizes. Sample datasets from Arabidopsis
thaliana and Eutrema salsugineum are provided to demonstrate the basic functions of
CrusView. The software manual and a series of video tutorials of CrusView are also
provided online (http://www.cmbb.arizona.edu/crusview/video_tutorial).
COMPETING INTERESTS
The author(s) declare that they have no competing interests.
AUTHORS’ CONTRIBUTIONS
https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
W.X. and C.H. conceived the project. W.X. and C.H. developed the software. W.X. and
C.H. prepared the manuscript.
FIGURE LEGENDS
Figure 1. Functional panels in the CrusView main screen. a. Navigation panel; b. List of
available karyotypes in Brassicaceae; c. Main window showing the primary genome (E.
salsugineum); d. Color scheme and letter labels of the 24 conserved genomic blocks; e.
Window showing the secondary genome (A. thaliana); f. Gene annotation panel; g.
Digital ancestral karyotype of A. lyrata; h. Digital karyotype of A. thaliana; and i. Digital
karyotype of E. salsugineum.
Figure 2. Genome assembling function. a. Digital karyotype of E. salsugineum; b.
unplaced short-scaffold sequences; c. Parameter panel; d. Menu bar; e. Main working
panel for the manual curation of the genome assembly of E. salsugineum; f. Density of
protein-coding genes on scaffolds; g. Centromere-specific tandem repeat; and h. Genetic
marker track.
Figure 3. Visualization of genome synteny and gene alignment. A. Panels for genome
synteny visualization: a. Navigation bar; b. Primary genome; c. Secondary genome; d.
Chromosome synteny; e. Dot-plot; f. Genes in the selected area; g. Action list; h.
Selection of segmental duplication; and i. Genes in the ortholog groups. B. Alignment of
multiple gene members in the CDPK family showing tandem duplication events. C.
Exon-level alignment of the SOS1 genes between A. thaliana and E. salsugineum.
Figure 4. Mapping duplication events and genes of interest onto the dot-plot synteny map.
A dot-plot synteny map of EsChr3 and AtChr3. The blue dots represent homologous gene
pairs in the A. thaliana and E. salsugineum genomes. The blue dots arranged along the
diagonal line indicate a macro-synteny region. The aligned blue dots deviating from the
diagonal line indicate segmental duplications. The green dots represent potential
https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
tandemly duplicated genes selected using a cutoff of protein-identity of 40% and 5 Kb
window size. The red dots represent F-box genes selected by a keyword search. The
overlapping red dots and green dots indicate the tandemly duplicated F-box genes on
EsChr3 and AtChr3.
REFERENCES
Benson, G. (1999). Tandem repeats finder: a program to analyze DNA sequences.
Nucleic Acids Res 27, 573-580.
Cheng F, W.J., Fang L and Wang X (2012). Syntenic gene analysis between Brassica
rapa and other Brassicaceae species. Front. Plant Sci 3, 198.
Dassanayake, M., Oh, D.H., Haas, J.S., Hernandez, A., Hong, H., Ali, S., Yun, D.J.,
Bressan, R.A., Zhu, J.K., Bohnert, H.J., and Cheeseman, J.M. (2011). The
genome of the extremophile crucifer Thellungiella parvula. Nat Genet 43, 913-
U137.
Ferguson-Smith, M.A., and Trifonov, V. (2007). Mammalian karyotype evolution.
Nature reviews. Genetics 8, 950-962.
Frazer, K.A., Pachter, L., Poliakov, A., Rubin, E.M., and Dubchak, I. (2004). VISTA:
computational tools for comparative genomics. Nucleic Acids Res 32, W273-279.
Guy, L., Kultima, J.R., and Andersson, S.G. (2010). genoPlotR: comparative gene and
genome visualization in R. Bioinformatics 26, 2334-2335.
Hu, T.T., Pattyn, P., Bakker, E.G., Cao, J., Cheng, J.F., Clark, R.M., Fahlgren, N.,
Fawcett, J.A., Grimwood, J., Gundlach, H., Haberer, G., Hollister, J.D.,
Ossowski, S., Ottilar, R.P., Salamov, A.A., Schneeberger, K., Spannagl, M.,
Wang, X., Yang, L., Nasrallah, M.E., Bergelson, J., Carrington, J.C., Gaut,
B.S., Schmutz, J., Mayer, K.F.X., de Peer, Y.V., Grigoriev, I.V., Nordborg, M.,
Weigel, D., and Guo, Y.L. (2011). The Arabidopsis lyrata genome sequence and
the basis of rapid genome size change. Nat Genet 43, 476-+.
Jorgensen, R.A., Stein, L., Rain, S., Andrews, G., and Chandler, V. (2008). The iPlant
collaborative: A cyberinfrastructure-centered community for a new plant biology.
In Vitro Cell Dev-An 44, S26-S26.
https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
Li, L., Stoeckert, C.J., and Roos, D.S. (2003). OrthoMCL: Identification of ortholog
groups for eukaryotic genomes. Genome Research 13, 2178-2189.
Lysak, M.A., Berr, A., Pecinka, A., Schmidt, R., McBreen, K., and Schubert, I.
(2006). Mechanisms of chromosome number reduction in Arabidopsis thaliana
and related Brassicaceae species. Proc Natl Acad Sci U S A 103, 5224-5229.
Ma, C., Chen, H., Xin, M., Yang, R., and Wang, X. (2012). KGBassembler: a
karyotype-based genome assembler for Brassicaceae species. Bioinformatics 28,
3141-3143.
Sullivan, M.J., Petty, N.K., and Beatson, S.A. (2011). Easyfig: a genome comparison
visualizer. Bioinformatics 27, 1009-1010.
Tang, H., and Lyons, E. (2012). Unleashing the genome of brassica rapa. Front Plant Sci
3, 172.
Van Bel, M., Proost, S., Wischnitzki, E., Movahedi, S., Scheerlinck, C., Van de Peer,
Y., and Vandepoele, K. (2012). Dissecting plant genomes with the PLAZA
comparative genomics platform. Plant Physiol 158, 590-600.
Wang, X., Wang, H., Wang, J., Sun, R., Wu, J., Liu, S., Bai, Y., Mun, J.H., Bancroft,
I., Cheng, F., Huang, S., Li, X., Hua, W., Freeling, M., Pires, J.C., Paterson,
A.H., Chalhoub, B., Wang, B., Hayward, A., Sharpe, A.G., Park, B.S.,
Weisshaar, B., Liu, B., Li, B., Tong, C., Song, C., Duran, C., Peng, C., Geng,
C., Koh, C., Lin, C., Edwards, D., Mu, D., Shen, D., Soumpourou, E., Li, F.,
Fraser, F., Conant, G., Lassalle, G., King, G.J., Bonnema, G., Tang, H.,
Belcram, H., Zhou, H., Hirakawa, H., Abe, H., Guo, H., Jin, H., Parkin, I.A.,
Batley, J., Kim, J.S., Just, J., Li, J., Xu, J., Deng, J., Kim, J.A., Yu, J., Meng,
J., Min, J., Poulain, J., Hatakeyama, K., Wu, K., Wang, L., Fang, L., Trick,
M., Links, M.G., Zhao, M., Jin, M., Ramchiary, N., Drou, N., Berkman, P.J.,
Cai, Q., Huang, Q., Li, R., Tabata, S., Cheng, S., Zhang, S., Sato, S., Sun, S.,
Kwon, S.J., Choi, S.R., Lee, T.H., Fan, W., Zhao, X., Tan, X., Xu, X., Wang,
Y., Qiu, Y., Yin, Y., Li, Y., Du, Y., Liao, Y., Lim, Y., Narusaka, Y., Wang, Z., Li,
Z., Xiong, Z., and Zhang, Z. (2011). The genome of the mesopolyploid crop
species Brassica rapa. Nat Genet 43, 1035-1039.
https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
Wright, S.I., and Agren, J.A. (2011). The Arabidopsis lyrata genome sequence Sizing
up Arabidopsis genome evolution. Heredity 107, 509-510.
Wu, H.J., Zhang, Z.H., Wang, J.Y., Oh, D.H., Dassanayake, M., Liu, B.H., Huang,
Q.F., Sun, H.X., Xia, R., Wu, Y.R., Wang, Y.N., Yang, Z., Liu, Y., Zhang, W.K.,
Zhang, H.W., Chu, J.F., Yan, C.Y., Fang, S., Zhang, J.S., Wang, Y.Q., Zhang,
F.X., Wang, G.D., Lee, S.Y., Cheeseman, J.M., Yang, B.C., Li, B., Min, J.M.,
Yang, L.F., Wang, J., Chu, C.C., Chen, S.Y., Bohnert, H.J., Zhu, J.K., Wang,
X.J., and Xie, Q. (2012). Insights into salt tolerance from the genome of
Thellungiella salsuginea. Proc Natl Acad Sci U S A 109, 12219-12224.
Yang R, J.D., Chen H, Beilstein M, Grimwood J, Jenkins J, Shu S, Prochnik S, Xin
M, Ma C, Schmutz J, Wing RA, Mitchell-Olds T, Schumaker K and Wang X.
(2013). The reference genome of the halophytic plant Eutrema salsugineum. Front
Plant Sci. 4.
https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.