ibd-labnet training workshop, würzburg, 2 july 2010
TRANSCRIPT
IBD-Labnet Training Workshop, Würzburg, 2 July 2010
CONTENTS
THE NEISSERIA SEQUENCE TYPING WEBSITE .................................................................................................. 3
SEQUENCE DEFINITION DATABASES .............................................................................................................. 4
QUERYING SEQUENCES TO DETERMINE ALLELE IDENTITY................................................................................................... 4
BROWSING SCHEME PROFILES .................................................................................................................................... 6
QUERYING SCHEME PROFILES ..................................................................................................................................... 8
INVESTIGATING ALLELE DIFFERENCES ..........................................................................................................................10
Sequence similarity .......................................................................................................................................10
Sequence comparison ...................................................................................................................................12
ISOLATE DATABASES ....................................................................................................................................13
SEARCHING...........................................................................................................................................................13
USER-CONFIGURABLE OPTIONS.................................................................................................................................14
General .........................................................................................................................................................15
Interface .....................................................................................................................................................................15
Main results table.......................................................................................................................................................15
Isolate full record........................................................................................................................................................15
Display ..........................................................................................................................................................16
Query ............................................................................................................................................................16
QUERYING THE ISOLATE DATABASE BY ALLELIC PROFILES.................................................................................................17
RETRIEVING A LIST OF ISOLATES ................................................................................................................................18
RETRIEVING ISOLATES BY LINKED PUBLICATION .............................................................................................................18
DATA ANALYSIS .....................................................................................................................................................21
Field breakdown ...........................................................................................................................................21
Two field breakdown ....................................................................................................................................23
Unique combinations ....................................................................................................................................26
Scheme and allele breakdown ......................................................................................................................27
EXPORTING DATA...................................................................................................................................................29
Isolate records ..............................................................................................................................................29
Concatenated sequences (FASTA format).....................................................................................................31
Sequences in extended multi-FASTA (XMFA) format ....................................................................................33
THE NEISSERIA SEQUENCE TYPING WEBSITE
The Neisseria sequence typing website is now using a new genomics platform to power the databases - this
allows much greater flexibility in defining loci and can also handle genome scale data. MLST, PorA, FetA, fHbp,
penA and rpoB sequences can all be queried from the same interface.
There are two linked, but independent, databases on the site: one for allele sequence and profile definitions
and the other for isolate data:
SEQUENCE DEFINITION DATABASES
Sequence definition databases contain unique sequences for any number of loci defined by allele numbers.
They can also hold scheme profile definitions, i.e. unique combinations of alleles at specific loci defined by a
primary key field.
QUERYING SEQUENCES TO DETERMINE ALLELE IDENTITY
Click 'Sequence query' on the main contents page to determine the identity, or the nearest match, of a single
sequence.
Paste your sequence in to the box - there is no need to trim. Normally, you can leave the loci setting on 'All
loci' - the software should identify the correct locus based on your sequence. Press ‘submit’.
If an exact match is found, this will be indicated along with the start position of the locus within your
sequence.
If only a partial match is found, the most similar allele is identified along with any nucleotide differences. The
varying nucleotide positions are numbered both relative to the pasted in sequence and to the reference
sequence. The start position of the locus within your sequence is also indicated.
BROWSING SCHEME PROFILES
You can peruse all MLST profiles by clicking the ‘Browse MLST profiles’ link on the contents page:
Please note that some databases may have more than one scheme defined, these may be listed in a drop-
down list box instead:
Choose the field to order the results by and click 'Browse all records'.
Clicking the hyperlink for any profile will display full information about the profile.
QUERYING SCHEME PROFILES
Click the link to 'Search profiles' for the appropriate scheme on the main contents page.
Please note that some databases may have more than one scheme defined, these may be listed in a drop-
down list box instead:
Enter the search criteria you wish to search on. You may also see some drop-down list boxes that allow further
filtering of results.
Each field can be queried using the following modifiers:
Modifier Description
= Case insensitive exact match
Contains Case insensitive match to a partial string, e.g. searching for clonal complex 'contains'st-11 would return all STs belonging to the ST-11 complex
> Greater than
< Less than
NOT Match to values that do not equal the search term (case insensitive)
NOT contain Match to values that do not contain the search term (case insensitive), e.g. searchingfor clonal complex 'NOT contain' lactamica would return all STs that do not belong toa clonal complex that has 'lactamica' in the name.
Clicking the hyperlink for any profile will display full information about the profile.
INVESTIGATING ALLELE DIFFERENCES
SEQUENCE SIMILARITY
To find sequences most similar to a selected allele, click 'Sequence similarity' on the contents page.
Enter the identifer of the allele sequence to investigate and the number of nearest matches you'd like to see,
then press submit. A list of nearest alleles will be displayed, along with the percentage identity and number of
gaps between the sequences.
Click the appropriate 'Compare' button to display a list of nucleotide differences and/or a sequence alignment.
SEQUENCE COMPARISON
To directly compare two sequences click 'Sequence comparison' from the contents page.
Enter two allele identifiers belonging to the same locus and press ‘submit’. A list of nucleotide differences
and/or an alignment will be displayed.
ISOLATE DATABASES
Isolate databases hold isolate provenance information linked to sequence data. These sequences can be
tagged with locus details and allele designations can be defined for each isolate.
SEARCHING
The 'Search database' page allows you to search by combinations of provenance criteria. These can be linked
together by 'and' or 'or'. You can also filter or search on allele designations or scheme fields. The number of
selection criteria combinations that can be combined can be changed by going to the options page (up to 20
can be used).
After the search has been submitted, the results will be displayed in a table.
Combine
provenance
fields
You may also see some drop-down list boxes that allow further filtering of r
by the administrator for each database, but you can remove these or add n
scheme field by going to the user options page.
Combine with
allele designations
or scheme fields
Hyperlink
to isolate
record
esults. Default list boxes will be set
ew ones for any provenance or
Each field can be queried using the following modifiers:
Modifier Description
= Case insensitive exact match
Contains Case insensitive match to a partial string, e.g. searching for clonal complex 'contains'st-11 would return all STs belonging to the ST-11 complex
> Greater than
< Less than
NOT Match to values that do not equal the search term (case insensitive)
NOT contain Match to values that do not contain the search term (case insensitive), e.g. searchingfor clonal complex 'NOT contain' lactamica would return all STs that do not belong toa clonal complex that has 'lactamica' in the name.
To view further information about any of the returned isolates, click on the hyperlinked id number in the table.
An information page will be returned containing all known details about the isolate (see below).
USER-CONFIGURABLE OPTIONS
The user interface is configurable in a number of ways. Choices made are remembered between sessions if you
connect from the same computer (a browser cookie is used to set the appropriate options).
General options can be set by clicking the 'Set general options' link on the main index page. Three tabs are
selectable, each related to a different part of the interface:
GENERAL
The general tab allows the following options to be modified:
INTERFACE
Field combinations - sets the number of fields that can be combined into a query in the isolate
interface. Similar options apply for sample queries and for setting options for loci and scheme fields.
Records per page
Page bar position
Locus aliases - Loci can have multiple names (aliases). Setting this option will display all alternative
names in results tables.
Nucleotides per line - Some analyses display sequence alignments. This option allows you to set the
width of these alignments so suit your display.
MAIN RESULTS TABLE
Hyperlink allele designations - hyperlinks point to an information page about the particular allele
sequence. Depending on the locus, this may exist on a different website.
Differentiate provisional allele designations - Allele designations can be set as confirmed or
provisional, usually depending on the method of assignment. Selecting this option will display
provisional designations in a different colour to confirmed designations.
Display pending allele designations - Loci can have competing allele designations. Selecting this option
highlights the existence of an alternative designation.
ISOLATE FULL RECORD
Differentiate provisional allele designations - Allele designations can be set as confirmed or
provisional, usually depending on the method of assignment. Selecting this option will display
provisional designations in a different colour to confirmed designations.
Display pending allele designations - Loci can have competing allele designations. Selecting this option
highlights the existence of an alternative designation.
Display sender, curator and last updated records - displays a tooltip containing sender information
next to each allele designation.
Display sequence bin information - displays a tooltip with information about the position of the
sequence if tagged within the sequence bin
Display full information about sample records - used when the database is used as part of a laboratory
information management system (LIMS). This option will display records of samples available for the
displayed isolate.
Display all loci - creates an entry in the display for each locus whether or not an allele has been
defined or a sequence tagged for it.
DISPLAY
Selecting checkboxes in this section sets which isolate provenance fields will be displayed in the main results
table following a query. Certain fields will be selected by default - these are set in the database configuration,
but can be overridden.
QUERY
Selecting checkboxes in this section adds drop-down filters in the query form allowing rapid selection of
particular attributes. A filter for publications based on linked PubMed records can also be added as well as a
filter for the completion status of allelic profiles for defined schemes, such as MLST.
QUERYING THE ISOLATE DATABASE BY ALLELIC PROFILES
If a scheme, such as MLST, has been defined for an isolate database it is possible to query the database against
complete or partial allelic profiles. Even if no scheme is defined, queries can be made against all loci.
On the index page, click 'Search by combinations of loci (profiles) - including partial matching' for any defined
scheme.
Enter either a partial (any combination of loci) or complete profile. Alternatively, for scheme profiles, you can
enter a primary key value (e.g. ST) and select 'Autofill' to automatically fill in the associated profile.
RETRIEVING A LIST OF ISOLATES
The isolate database can be queried against a list of values matching any criteria (isolate provenace fields,
alleles, or scheme fields).
Click 'List query' on the index page.
Select the attribute you wish to search against in the drop-down list box.
Enter the list of attributes in the box (one per line) and press ‘submit’.
RETRIEVING ISOLATES BY LINKED PUBLICATION
The list of publications that have been linked to isolates within the whole database can be retrieved by clicking
the 'Publication breakdown' link on the index page.
Alternatively, a list of publications filtered to only those linked to isolates from the latest search can be
reached by clicked the 'Publications' button in the Breakdown list at the bottom of the results table. Please
note that the list of functions may vary.
A list of publications will then be displayed. The list may be filtered further by selecting an author from the
drop-down list at the top. The table can be ordered by any field by clicking the table headers - toggling
between ascending or descending order.
Finally, to retrieve the collection of isolates linked to a paper, click the 'Display' button for the appropriate
paper.
A query can also be filtered to only those isolates linked to a publication by selecting the appropriate paper in
the drop-down list on the query page.
DATA ANALYSIS
FIELD BREAKDOWN
The field breakdown function displays the frequency of each value for fields stored in the isolates table. Allele
and scheme field breakdowns are handled by a different function.
The breakdown function can be selected for the whole database by clicking the 'Field breakdown' link on the
main contents page:
Alternatively, a breakdown can be displayed of the dataset returned from a query by clicking the 'Fields'
button in the Breakdown list at the bottom of the results table. Please note that the list of functions here may
vary.
A series of charts will be displayed. Pick the field to display from the list at the top.
The values used to generate the chart can be displayed or extracted by clicking the 'Display table' link at the
bottom of the page. This displays a table that can be ordered by clicking the appropriate header. The 'Tab-
delimited text' link displays the same information in a format that can be copied to a spreadsheet.
TWO FIELD BREAKDOWN
The two field breakdown function displays a table breaking down one field against another, e.g. breakdown of
serogroup by year.
The function can be selected for the whole database by clicking the 'Two field breakdown' link on the main
contents page:
Alternatively, a two field breakdown can be displayed of the dataset returned from a query by clicking the
'Two field' button in the Breakdown list at the bottom of the results table. Please note that the list of functions
here may vary.
Select the two fields you wish to breakdown and how you would like the values displayed
(percentage/absolute values and totaling options).
Press ‘submit’. The breakdown will be displayed as a table. Bar charts will also be displayed provided the
numbers of returned values for both fields are less than 30.
The table values can be exported in a format suitable for copying in to a spreadsheet by clicking 'Download as
tab-delimited text' underneath the table.
UNIQUE COMBINATIONS
The unique combinations function allows you to select any number of fields or alleles and displays a table
showing the frequency of each unique combination.
The function can be selected for the whole database by clicking the 'Unique combinations' link on the main
contents page:
Alternatively, unique combinations contained in the dataset returned from a query can be analyzed by clicking
the 'Combinations' button in the Breakdown list at the bottom of the results table. Please note that the list of
functions here may vary.
Select the fields that you wish to include – an example may be the outer membrane surface protein variable
regions:
Click ‘submit’ and the table will be calculated.
The table can be sorted by clicking the table headers.
The table values can be exported in a format suitable for copying in to a spreadsheet by clicking 'Download as
tab-delimited text' underneath the table.
SCHEME AND ALLELE BREAKDOWN
The scheme and allele breakdown function displays the frequency of each allele and scheme field (e.g. ST or
clonal complex).
The function can be selected for the whole database by clicking the 'Scheme and allele breakdown' link on the
main contents page:
Alternatively, a breakdown can be displayed of the dataset returned from a query by clicking the
'Schemes/alleles' button in the Breakdown list at the bottom of the results table. Please note that the list of
functions here may vary.
A table showing the number of unique values for each locus and scheme field will be displayed.
A detailed display of allele or field frequencies can be displayed by clicking the appropriate 'Breakdown'
button.
The sorting of the table can be changed by clicking the appropriate header - this toggles between ascending
and descending order.
EXPORTING DATA
ISOLATE RECORDS
The export dataset function outputs multiple isolate records containing both provenance and allele or scheme
data. Output is in tab-delimited text format, suitable for loading in to a spreadsheet and individual fields can
be selected for inclusion in the output.
The function can be selected for the whole database by clicking the 'Export Dataset' link on the main contents
page:
Alternatively, the records returned from a database query, i.e. a subset of the whole database, can be
exported by clicking the 'Dataset' button in the Export list at the bottom of the results table. Please note that
the list of functions here may vary.
You will be presented with a list of fields, loci and scheme fields. All fields (except composite fields) are
selected by default. Unselect any fields not required in the export. Fields can be selected/deselected in bulk
using the 'Select all' and 'Select none' buttons or the 'All' and 'None' buttons for individual sections.
Click ‘submit’. It may take a few seconds to generate the output (depending on the dataset size). Progress is
indicated by a series of dots with each dot representing 50 records processed.
A hyperlink to the export file will appear when it has been generated. Right-click the link with the mouse to
save this to your computer.
CONCATENATED SEQUENCES (FASTA FORMAT)
The export concatenated sequences function outputs sequences in FASTA format. You can select the loci you
wish to include and these will be ordered by the value of the genome_position field set in the loci table then
by name. If a sequence for a particular locus is missing for an isolate, it will be replaced by a series of gap
characters (-) of either the standard length of the locus (if set) or of the most common length of known alleles.
Please note that concatenated sequences are not guaranteed to be aligned.
The function can be selected for the whole database by clicking the 'Concatenate alleles' link on the main
contents page:
Alternatively, the records returned from a database query, i.e. a subset of the whole database, can be
exported by clicking the 'Concatenate' button in the Export list at the bottom of the results table. Please note
that the list of functions here may vary.
You will be presented with a page displaying a list of isolate ids and checkboxes for each locus that can be
selected. If the list of isolates is empty all isolates in the database will be included, otherwise just those isolates
listed will be used. All loci are selected by default. Unselect any loci not required in the export. Fields can be
selected/deselected in bulk using the 'Select all' and 'Select none' buttons, or the 'All' and 'None' buttons for
individual sections. In some cases, both allele designations and sequences tagged from the sequence bin will
be available - you can choose which to use if these conflict. You can also select whether to include the isolate
name in the sequence identifier line in the FASTA file - leaving this unchecked will just use the database id
number as an identifier.
Click ‘submit’. It may take a few seconds to generate the output (depending on the dataset size). Progress is
indicated by a series of dots with each dot representing 50 records processed.
A hyperlink to the export file will appear when it has been generated. Right-click to save this to your computer.
SEQUENCES IN EXTENDED MULTI-FASTA (XMFA) FORMAT
The export XMFA function outputs sequences in XMFA format. You can select the loci you wish to include and
these will be ordered by the value of the genome_position field set in the loci table then by name. If a
sequence for a particular locus is missing for an isolate, it will be replaced by a series of unknown characters
(N). The sequences in the locus blocks will be aligned. Alignment is performed by passing the sequences
through the ‘muscle’ program.
The function can be selected for the whole database by clicking the 'XMFA export' link on the main contents
page:
Alternatively, the records returned from a database query, i.e. a subset of the whole database, can be
exported by clicking the 'XMFA' button in the Export list at the bottom of the results table. Please note that the
list of functions here may vary.
You will be presented with a page displaying a list of isolate ids and checkboxes for each locus that can be
selected. If the list of isolates is empty all isolates in the database will be included, otherwise just those isolates
listed will be used. All loci are selected by default. Unselect any loci not required in the export. Fields can be
selected/deselected in bulk using the 'Select all' and 'Select none' buttons or the 'All' and 'None' buttons for
individual sections. In some cases, both allele designations and sequences tagged from the sequence bin will
be available - you can choose which to use if these conflict. You can also select whether to include the isolate
name in the sequence identifier line in the XMFA file - leaving this unchecked will just use the database id
number as an identifier.
Click ‘submit’. It may take some time to generate the output due to the alignment process (depending on the
dataset size). Progress is indicated by a series of dots with each dot representing 10 loci processed.
A hyperlink to the export file will appear when it has been generated. Right-click the link with the mouse to
save this to your computer.
IBD-Labnet Training Workshop, Würzburg, 2 July 2010
IDEAL DATA
Not too strong! Peaks at approx the same height at end of run as the beginning.
BLEACHING!
This is a reaction containing too much of everything.
The sequencing machines are very sensitive and when a reaction is performed with an excess of everything the
light peaks recorded from the florescent dyes exceed the maximum the CCD camera can cope with.
This was corrected by changing the injection time from 15 to 5 seconds!!
RAW DATA WITH DYE BLOBS
These data would read correctly apart from the dye blobs which co-migrate at approx 70bp and obscure 10-
12bp of read.
ELECTROPHEROGRAM OF PREVIOUS DATA
These traces show the loss of 10-12bp of data, unreadable due to the excess florescent dyes not cleaned away
at the end of the reaction.
POOR DATA MADE MUCH WORSE BY DYE BLOBS
This screen shot shows the raw data from three samples all prepared by the same person on the same
sequencing run. Two, however, have very poor amounts of template and so have enormous dye blobs, the
excess florescent dye left at the end of the reaction.
There is readable weak sequence data available from these samples but because the dye blobs are off-scale
the software is scaling the read from these peaks.
It is necessary to ask the software to analyze the data from a point after the dye blobs to obtain usable data.
SKI-SLOPES
These two samples both show a ski-slope when viewed as raw data, i.e. many short sequencing products but
little or no long sequencing products.
This can be due to too much template versus BigDye or due to an excess of primer versus template.
To differentiate look at the intensity [Reflective light units].
CHANGING PRIMER CONCENTRATION
These are the same template DNA sequenced with slightly different conditions.
The original run ski-sloped badly but as the intensity of the short products were not at the maximum it was
concluded that this was due to an excess of primer!
A 2-fold dilution of primer produced a more even read to the end 400bp.
A 1°C increase in annealing temp also improved the read but had greater background.
Original
Primer diluted 2-fold
Annealing temp raised 1°C
PRIMER DEGRADATION
A common problem when sequences fail or appear to have a lot of background is the degradation of working
primer stock. Primers often lose the last base at the 3’ end and this means extension of sequencing products
start one base ahead of the normal expected start.
This is often caused by acid hydrolysis.
It is good practice to buy HPLC purified primers. Your stock should then be aliquoted into smaller working
stocks to avoid excessive freeze-thaw cycles.
ANOTHER EXAMPLE OF PRIMER DEGRADATION
The primer in this example has lost two bases from the 3’ end.
The G-g at 310bp producing an N.
This happened when the researcher used a primer from a commercial cloning kit for their sequencing reaction.
LATE STARTING SAMPLES
Sequencing samples sometimes fail to inject correctly and therefore start very late in the run. This can be due
to too much template [particularly cosmids] or salt left in the sample.
Re-running the sample will improve those with too much template as there is less available for the second
injection.
However, if the sample does not improve on the second run it must be due to salt left in the sample.
IBD-LABNET WORKSHOP – WÜRZBURG, FRIDAY 2 JULY 2010
PUBMLST NEISSERIA TYPING DATABASES
Reference sequence data for genetic targets involved in typing, antibiotic resistance and vaccine development
have been incorporated in to the PubMLST database, allowing interrogation from a unified interface.
EXERCISES
1. Rifampicin resistance is determined by mutations in the rpoB gene. Search the database for isolates
resistant to rifampicin with the rpoB allele determined (HINT: you can customize the interface from
the options page to add a drop-down box for the rifampicin_range field. You can search for a missing
value using the term null. The screenshots below demonstrate this process).
Sometimes you may need to refresh your browser (press F5) for changes in the interface to be displayed.
a. How many rpoB alleles are associated with resistance (MIC: >1 mg/l)? (HINT: Use the
schemes/alleles breakdown)
b. What are the four most common rpoB alleles among resistant isolates in the database?
c. What mutations are present in these alleles? (HINT: Mutations are displayed in the allele
information page reached by clicking the hyperlinked allele identifier)
d. What is the most common rpoB allele among rifampicin susceptible isolates?
2. The two field breakdown function of the database allows you to investigate how one field relates to
another in a set of data retrieved from a search (or within the whole database). Search for all isolates
from 2008 onwards:
a. Are there any clonal complexes that appear to be predominantly associated with carriage?
(HINT: the disease field contains information about whether an isolate was from carriage or
disease)
b. Which clonal complex is predominantly associated with serogroup C?
c. Which clonal complexes are only associated with serogroup A?
3. The EU-MenNet study published in Yazdankhah et al. 2004 J Clin Microbiol 42:5146-53 contains
random samples of carriage isolates from three countries (Czech Republic, Greece and Norway), along
with disease isolates from a similar time period. From the isolate query page, filter the database to
contain only isolates from this study:
a. What was the most common clonal complex seen in carriage in each of the countries?
b. What was the most common carriage clonal complex in the Czech Republic in
1994?
1996?
c. …and in disease (you can search for disease NOT ‘carrier’)
1994?
1996?
4. A random sample of disease isolates from the UK over 20 years prior to the introduction of the MCC
vaccine is described in Russell et al. 2008 Microbiology 154:1170-7. MLST and finetyping were
performed on approximately 100 isolates from each of the years 1975, 1985 and 1995.
a. What was the most common finetype (serogroup, PorA VR1/VR2, FetA VR, ST (cc)) in each of
the three years? (HINT: Use field combination breakdown)
1975?
1985?
1995?
b. In the whole dataset of 323 isolates, how many clonal complexes are the following
serogroups associated with?
A?
B?
C?
c. What are the most common combinations of PorA variable regions in each of the years?
1975?
1985?
1995?
5. The PubMLST isolate database can not generally be considered a population dataset since submission
is biased with often only new variants submitted. Some laboratories, however, do submit all their
samples or at least an unbiased sample. One of these currently is the Czech Republic. Since the
beginning of 2009:
a. How many clonal complexes have caused disease in the Czech Republic?
b. Which was the most frequently isolated complex from disease?
c. How many STs were found in disease isolates belonging to this complex?
d. Which was the most frequently isolated ST from this complex:
i. In disease?
ii. In carriage?
e. Which serogroup was mainly associated with this complex?