ibd-labnet training workshop, würzburg, 2 july 2010

IBD-Labnet Training Workshop, Würzburg, 2 July 2010

CONTENTS

THE NEISSERIA SEQUENCE TYPING WEBSITE .................................................................................................. 3

SEQUENCE DEFINITION DATABASES .............................................................................................................. 4

QUERYING SEQUENCES TO DETERMINE ALLELE IDENTITY................................................................................................... 4

BROWSING SCHEME PROFILES .................................................................................................................................... 6

QUERYING SCHEME PROFILES ..................................................................................................................................... 8

INVESTIGATING ALLELE DIFFERENCES ..........................................................................................................................10

Sequence similarity .......................................................................................................................................10

Sequence comparison ...................................................................................................................................12

ISOLATE DATABASES ....................................................................................................................................13

SEARCHING...........................................................................................................................................................13

USER-CONFIGURABLE OPTIONS.................................................................................................................................14

General .........................................................................................................................................................15

Interface .....................................................................................................................................................................15

Main results table.......................................................................................................................................................15

Isolate full record........................................................................................................................................................15

Display ..........................................................................................................................................................16

Query ............................................................................................................................................................16

QUERYING THE ISOLATE DATABASE BY ALLELIC PROFILES.................................................................................................17

RETRIEVING A LIST OF ISOLATES ................................................................................................................................18

RETRIEVING ISOLATES BY LINKED PUBLICATION .............................................................................................................18

DATA ANALYSIS .....................................................................................................................................................21

Field breakdown ...........................................................................................................................................21

Two field breakdown ....................................................................................................................................23

Unique combinations ....................................................................................................................................26

Scheme and allele breakdown ......................................................................................................................27

EXPORTING DATA...................................................................................................................................................29

Isolate records ..............................................................................................................................................29

Concatenated sequences (FASTA format).....................................................................................................31

Sequences in extended multi-FASTA (XMFA) format ....................................................................................33

THE NEISSERIA SEQUENCE TYPING WEBSITE

The Neisseria sequence typing website is now using a new genomics platform to power the databases - this

allows much greater flexibility in defining loci and can also handle genome scale data. MLST, PorA, FetA, fHbp,

penA and rpoB sequences can all be queried from the same interface.

There are two linked, but independent, databases on the site: one for allele sequence and profile definitions

and the other for isolate data:

SEQUENCE DEFINITION DATABASES

Sequence definition databases contain unique sequences for any number of loci defined by allele numbers.

They can also hold scheme profile definitions, i.e. unique combinations of alleles at specific loci defined by a

primary key field.

QUERYING SEQUENCES TO DETERMINE ALLELE IDENTITY

Click 'Sequence query' on the main contents page to determine the identity, or the nearest match, of a single

sequence.

Paste your sequence in to the box - there is no need to trim. Normally, you can leave the loci setting on 'All

loci' - the software should identify the correct locus based on your sequence. Press ‘submit’.

If an exact match is found, this will be indicated along with the start position of the locus within your

sequence.

If only a partial match is found, the most similar allele is identified along with any nucleotide differences. The

varying nucleotide positions are numbered both relative to the pasted in sequence and to the reference

sequence. The start position of the locus within your sequence is also indicated.

BROWSING SCHEME PROFILES

You can peruse all MLST profiles by clicking the ‘Browse MLST profiles’ link on the contents page:

Please note that some databases may have more than one scheme defined, these may be listed in a drop-

down list box instead:

Choose the field to order the results by and click 'Browse all records'.

Clicking the hyperlink for any profile will display full information about the profile.

QUERYING SCHEME PROFILES

Click the link to 'Search profiles' for the appropriate scheme on the main contents page.

Please note that some databases may have more than one scheme defined, these may be listed in a drop-

down list box instead:

Enter the search criteria you wish to search on. You may also see some drop-down list boxes that allow further

filtering of results.

Each field can be queried using the following modifiers:

Modifier Description

= Case insensitive exact match

Contains Case insensitive match to a partial string, e.g. searching for clonal complex 'contains'st-11 would return all STs belonging to the ST-11 complex

> Greater than

< Less than

NOT Match to values that do not equal the search term (case insensitive)

NOT contain Match to values that do not contain the search term (case insensitive), e.g. searchingfor clonal complex 'NOT contain' lactamica would return all STs that do not belong toa clonal complex that has 'lactamica' in the name.

Clicking the hyperlink for any profile will display full information about the profile.

INVESTIGATING ALLELE DIFFERENCES

SEQUENCE SIMILARITY

To find sequences most similar to a selected allele, click 'Sequence similarity' on the contents page.

Enter the identifer of the allele sequence to investigate and the number of nearest matches you'd like to see,

then press submit. A list of nearest alleles will be displayed, along with the percentage identity and number of

gaps between the sequences.

Click the appropriate 'Compare' button to display a list of nucleotide differences and/or a sequence alignment.

SEQUENCE COMPARISON

To directly compare two sequences click 'Sequence comparison' from the contents page.

Enter two allele identifiers belonging to the same locus and press ‘submit’. A list of nucleotide differences

and/or an alignment will be displayed.

ISOLATE DATABASES

Isolate databases hold isolate provenance information linked to sequence data. These sequences can be

tagged with locus details and allele designations can be defined for each isolate.

SEARCHING

The 'Search database' page allows you to search by combinations of provenance criteria. These can be linked

together by 'and' or 'or'. You can also filter or search on allele designations or scheme fields. The number of

selection criteria combinations that can be combined can be changed by going to the options page (up to 20

can be used).

After the search has been submitted, the results will be displayed in a table.

Combine

provenance

fields

You may also see some drop-down list boxes that allow further filtering of r

by the administrator for each database, but you can remove these or add n

scheme field by going to the user options page.

Combine with

allele designations

or scheme fields

Hyperlink

to isolate

record

esults. Default list boxes will be set

ew ones for any provenance or

Each field can be queried using the following modifiers:

Modifier Description

= Case insensitive exact match

Contains Case insensitive match to a partial string, e.g. searching for clonal complex 'contains'st-11 would return all STs belonging to the ST-11 complex

> Greater than

< Less than

NOT Match to values that do not equal the search term (case insensitive)

NOT contain Match to values that do not contain the search term (case insensitive), e.g. searchingfor clonal complex 'NOT contain' lactamica would return all STs that do not belong toa clonal complex that has 'lactamica' in the name.

To view further information about any of the returned isolates, click on the hyperlinked id number in the table.

An information page will be returned containing all known details about the isolate (see below).

USER-CONFIGURABLE OPTIONS

The user interface is configurable in a number of ways. Choices made are remembered between sessions if you

connect from the same computer (a browser cookie is used to set the appropriate options).

General options can be set by clicking the 'Set general options' link on the main index page. Three tabs are

selectable, each related to a different part of the interface:

GENERAL

The general tab allows the following options to be modified:

INTERFACE

Field combinations - sets the number of fields that can be combined into a query in the isolate

interface. Similar options apply for sample queries and for setting options for loci and scheme fields.

Records per page

Page bar position

Locus aliases - Loci can have multiple names (aliases). Setting this option will display all alternative

names in results tables.

Nucleotides per line - Some analyses display sequence alignments. This option allows you to set the

width of these alignments so suit your display.

MAIN RESULTS TABLE

Hyperlink allele designations - hyperlinks point to an information page about the particular allele

sequence. Depending on the locus, this may exist on a different website.

Differentiate provisional allele designations - Allele designations can be set as confirmed or

provisional, usually depending on the method of assignment. Selecting this option will display

provisional designations in a different colour to confirmed designations.

Display pending allele designations - Loci can have competing allele designations. Selecting this option

highlights the existence of an alternative designation.

ISOLATE FULL RECORD

Differentiate provisional allele designations - Allele designations can be set as confirmed or

provisional, usually depending on the method of assignment. Selecting this option will display

provisional designations in a different colour to confirmed designations.

Display pending allele designations - Loci can have competing allele designations. Selecting this option

highlights the existence of an alternative designation.

Display sender, curator and last updated records - displays a tooltip containing sender information

next to each allele designation.

Display sequence bin information - displays a tooltip with information about the position of the

sequence if tagged within the sequence bin

Display full information about sample records - used when the database is used as part of a laboratory

information management system (LIMS). This option will display records of samples available for the

displayed isolate.

Display all loci - creates an entry in the display for each locus whether or not an allele has been

defined or a sequence tagged for it.

DISPLAY

Selecting checkboxes in this section sets which isolate provenance fields will be displayed in the main results

table following a query. Certain fields will be selected by default - these are set in the database configuration,

but can be overridden.

QUERY

Selecting checkboxes in this section adds drop-down filters in the query form allowing rapid selection of

particular attributes. A filter for publications based on linked PubMed records can also be added as well as a

filter for the completion status of allelic profiles for defined schemes, such as MLST.

QUERYING THE ISOLATE DATABASE BY ALLELIC PROFILES

If a scheme, such as MLST, has been defined for an isolate database it is possible to query the database against

complete or partial allelic profiles. Even if no scheme is defined, queries can be made against all loci.

On the index page, click 'Search by combinations of loci (profiles) - including partial matching' for any defined

scheme.

Enter either a partial (any combination of loci) or complete profile. Alternatively, for scheme profiles, you can

enter a primary key value (e.g. ST) and select 'Autofill' to automatically fill in the associated profile.

RETRIEVING A LIST OF ISOLATES

The isolate database can be queried against a list of values matching any criteria (isolate provenace fields,

alleles, or scheme fields).

Click 'List query' on the index page.

Select the attribute you wish to search against in the drop-down list box.

Enter the list of attributes in the box (one per line) and press ‘submit’.

RETRIEVING ISOLATES BY LINKED PUBLICATION

The list of publications that have been linked to isolates within the whole database can be retrieved by clicking

the 'Publication breakdown' link on the index page.

Alternatively, a list of publications filtered to only those linked to isolates from the latest search can be

reached by clicked the 'Publications' button in the Breakdown list at the bottom of the results table. Please

note that the list of functions may vary.

A list of publications will then be displayed. The list may be filtered further by selecting an author from the

drop-down list at the top. The table can be ordered by any field by clicking the table headers - toggling

between ascending or descending order.

Finally, to retrieve the collection of isolates linked to a paper, click the 'Display' button for the appropriate

paper.

A query can also be filtered to only those isolates linked to a publication by selecting the appropriate paper in

the drop-down list on the query page.

DATA ANALYSIS

FIELD BREAKDOWN

The field breakdown function displays the frequency of each value for fields stored in the isolates table. Allele

and scheme field breakdowns are handled by a different function.

The breakdown function can be selected for the whole database by clicking the 'Field breakdown' link on the

main contents page:

Alternatively, a breakdown can be displayed of the dataset returned from a query by clicking the 'Fields'

button in the Breakdown list at the bottom of the results table. Please note that the list of functions here may

vary.

A series of charts will be displayed. Pick the field to display from the list at the top.

The values used to generate the chart can be displayed or extracted by clicking the 'Display table' link at the

bottom of the page. This displays a table that can be ordered by clicking the appropriate header. The 'Tab-

delimited text' link displays the same information in a format that can be copied to a spreadsheet.

TWO FIELD BREAKDOWN

The two field breakdown function displays a table breaking down one field against another, e.g. breakdown of

serogroup by year.

The function can be selected for the whole database by clicking the 'Two field breakdown' link on the main

contents page:

Alternatively, a two field breakdown can be displayed of the dataset returned from a query by clicking the

'Two field' button in the Breakdown list at the bottom of the results table. Please note that the list of functions

here may vary.

Select the two fields you wish to breakdown and how you would like the values displayed

(percentage/absolute values and totaling options).

Press ‘submit’. The breakdown will be displayed as a table. Bar charts will also be displayed provided the

numbers of returned values for both fields are less than 30.

The table values can be exported in a format suitable for copying in to a spreadsheet by clicking 'Download as

tab-delimited text' underneath the table.

UNIQUE COMBINATIONS

The unique combinations function allows you to select any number of fields or alleles and displays a table

showing the frequency of each unique combination.

The function can be selected for the whole database by clicking the 'Unique combinations' link on the main

contents page:

Alternatively, unique combinations contained in the dataset returned from a query can be analyzed by clicking

the 'Combinations' button in the Breakdown list at the bottom of the results table. Please note that the list of

functions here may vary.

Select the fields that you wish to include – an example may be the outer membrane surface protein variable

regions:

Click ‘submit’ and the table will be calculated.

The table can be sorted by clicking the table headers.

The table values can be exported in a format suitable for copying in to a spreadsheet by clicking 'Download as

tab-delimited text' underneath the table.

SCHEME AND ALLELE BREAKDOWN

The scheme and allele breakdown function displays the frequency of each allele and scheme field (e.g. ST or

clonal complex).

The function can be selected for the whole database by clicking the 'Scheme and allele breakdown' link on the

main contents page:

Alternatively, a breakdown can be displayed of the dataset returned from a query by clicking the

'Schemes/alleles' button in the Breakdown list at the bottom of the results table. Please note that the list of

functions here may vary.

A table showing the number of unique values for each locus and scheme field will be displayed.

A detailed display of allele or field frequencies can be displayed by clicking the appropriate 'Breakdown'

button.

The sorting of the table can be changed by clicking the appropriate header - this toggles between ascending

and descending order.

EXPORTING DATA

ISOLATE RECORDS

The export dataset function outputs multiple isolate records containing both provenance and allele or scheme

data. Output is in tab-delimited text format, suitable for loading in to a spreadsheet and individual fields can

be selected for inclusion in the output.

The function can be selected for the whole database by clicking the 'Export Dataset' link on the main contents

page:

Alternatively, the records returned from a database query, i.e. a subset of the whole database, can be

exported by clicking the 'Dataset' button in the Export list at the bottom of the results table. Please note that

the list of functions here may vary.

You will be presented with a list of fields, loci and scheme fields. All fields (except composite fields) are

selected by default. Unselect any fields not required in the export. Fields can be selected/deselected in bulk

using the 'Select all' and 'Select none' buttons or the 'All' and 'None' buttons for individual sections.

Click ‘submit’. It may take a few seconds to generate the output (depending on the dataset size). Progress is

indicated by a series of dots with each dot representing 50 records processed.

A hyperlink to the export file will appear when it has been generated. Right-click the link with the mouse to

save this to your computer.

CONCATENATED SEQUENCES (FASTA FORMAT)

The export concatenated sequences function outputs sequences in FASTA format. You can select the loci you

wish to include and these will be ordered by the value of the genome_position field set in the loci table then

by name. If a sequence for a particular locus is missing for an isolate, it will be replaced by a series of gap

characters (-) of either the standard length of the locus (if set) or of the most common length of known alleles.

Please note that concatenated sequences are not guaranteed to be aligned.

The function can be selected for the whole database by clicking the 'Concatenate alleles' link on the main

contents page:


exported by clicking the 'Concatenate' button in the Export list at the bottom of the results table. Please note

that the list of functions here may vary.

You will be presented with a page displaying a list of isolate ids and checkboxes for each locus that can be

selected. If the list of isolates is empty all isolates in the database will be included, otherwise just those isolates

listed will be used. All loci are selected by default. Unselect any loci not required in the export. Fields can be

selected/deselected in bulk using the 'Select all' and 'Select none' buttons, or the 'All' and 'None' buttons for

individual sections. In some cases, both allele designations and sequences tagged from the sequence bin will

be available - you can choose which to use if these conflict. You can also select whether to include the isolate

name in the sequence identifier line in the FASTA file - leaving this unchecked will just use the database id

number as an identifier.

Click ‘submit’. It may take a few seconds to generate the output (depending on the dataset size). Progress is

indicated by a series of dots with each dot representing 50 records processed.

A hyperlink to the export file will appear when it has been generated. Right-click to save this to your computer.

SEQUENCES IN EXTENDED MULTI-FASTA (XMFA) FORMAT

The export XMFA function outputs sequences in XMFA format. You can select the loci you wish to include and

these will be ordered by the value of the genome_position field set in the loci table then by name. If a

sequence for a particular locus is missing for an isolate, it will be replaced by a series of unknown characters

(N). The sequences in the locus blocks will be aligned. Alignment is performed by passing the sequences

through the ‘muscle’ program.

The function can be selected for the whole database by clicking the 'XMFA export' link on the main contents

page:


exported by clicking the 'XMFA' button in the Export list at the bottom of the results table. Please note that the

list of functions here may vary.

You will be presented with a page displaying a list of isolate ids and checkboxes for each locus that can be

selected. If the list of isolates is empty all isolates in the database will be included, otherwise just those isolates

listed will be used. All loci are selected by default. Unselect any loci not required in the export. Fields can be

selected/deselected in bulk using the 'Select all' and 'Select none' buttons or the 'All' and 'None' buttons for

individual sections. In some cases, both allele designations and sequences tagged from the sequence bin will

be available - you can choose which to use if these conflict. You can also select whether to include the isolate

name in the sequence identifier line in the XMFA file - leaving this unchecked will just use the database id

number as an identifier.

Click ‘submit’. It may take some time to generate the output due to the alignment process (depending on the

dataset size). Progress is indicated by a series of dots with each dot representing 10 loci processed.

A hyperlink to the export file will appear when it has been generated. Right-click the link with the mouse to

save this to your computer.

IBD-Labnet Training Workshop, Würzburg, 2 July 2010

IDEAL DATA

Not too strong! Peaks at approx the same height at end of run as the beginning.

BLEACHING!

This is a reaction containing too much of everything.

The sequencing machines are very sensitive and when a reaction is performed with an excess of everything the

light peaks recorded from the florescent dyes exceed the maximum the CCD camera can cope with.

This was corrected by changing the injection time from 15 to 5 seconds!!

RAW DATA WITH DYE BLOBS

These data would read correctly apart from the dye blobs which co-migrate at approx 70bp and obscure 10-

12bp of read.

ELECTROPHEROGRAM OF PREVIOUS DATA

These traces show the loss of 10-12bp of data, unreadable due to the excess florescent dyes not cleaned away

at the end of the reaction.

POOR DATA MADE MUCH WORSE BY DYE BLOBS

This screen shot shows the raw data from three samples all prepared by the same person on the same

sequencing run. Two, however, have very poor amounts of template and so have enormous dye blobs, the

excess florescent dye left at the end of the reaction.

There is readable weak sequence data available from these samples but because the dye blobs are off-scale

the software is scaling the read from these peaks.

It is necessary to ask the software to analyze the data from a point after the dye blobs to obtain usable data.

SKI-SLOPES

These two samples both show a ski-slope when viewed as raw data, i.e. many short sequencing products but

little or no long sequencing products.

This can be due to too much template versus BigDye or due to an excess of primer versus template.

To differentiate look at the intensity [Reflective light units].

CHANGING PRIMER CONCENTRATION

These are the same template DNA sequenced with slightly different conditions.

The original run ski-sloped badly but as the intensity of the short products were not at the maximum it was

concluded that this was due to an excess of primer!

A 2-fold dilution of primer produced a more even read to the end 400bp.

A 1°C increase in annealing temp also improved the read but had greater background.

Original

Primer diluted 2-fold

Annealing temp raised 1°C

PRIMER DEGRADATION

A common problem when sequences fail or appear to have a lot of background is the degradation of working

primer stock. Primers often lose the last base at the 3’ end and this means extension of sequencing products

start one base ahead of the normal expected start.

This is often caused by acid hydrolysis.

It is good practice to buy HPLC purified primers. Your stock should then be aliquoted into smaller working

stocks to avoid excessive freeze-thaw cycles.

ANOTHER EXAMPLE OF PRIMER DEGRADATION

The primer in this example has lost two bases from the 3’ end.

The G-g at 310bp producing an N.

This happened when the researcher used a primer from a commercial cloning kit for their sequencing reaction.

LATE STARTING SAMPLES

Sequencing samples sometimes fail to inject correctly and therefore start very late in the run. This can be due

to too much template [particularly cosmids] or salt left in the sample.

Re-running the sample will improve those with too much template as there is less available for the second

injection.

However, if the sample does not improve on the second run it must be due to salt left in the sample.

IBD-LABNET WORKSHOP – WÜRZBURG, FRIDAY 2 JULY 2010

PUBMLST NEISSERIA TYPING DATABASES

Reference sequence data for genetic targets involved in typing, antibiotic resistance and vaccine development

have been incorporated in to the PubMLST database, allowing interrogation from a unified interface.

EXERCISES

1. Rifampicin resistance is determined by mutations in the rpoB gene. Search the database for isolates

resistant to rifampicin with the rpoB allele determined (HINT: you can customize the interface from

the options page to add a drop-down box for the rifampicin_range field. You can search for a missing

value using the term null. The screenshots below demonstrate this process).

Sometimes you may need to refresh your browser (press F5) for changes in the interface to be displayed.

a. How many rpoB alleles are associated with resistance (MIC: >1 mg/l)? (HINT: Use the

schemes/alleles breakdown)

b. What are the four most common rpoB alleles among resistant isolates in the database?

c. What mutations are present in these alleles? (HINT: Mutations are displayed in the allele

information page reached by clicking the hyperlinked allele identifier)

d. What is the most common rpoB allele among rifampicin susceptible isolates?

2. The two field breakdown function of the database allows you to investigate how one field relates to

another in a set of data retrieved from a search (or within the whole database). Search for all isolates

from 2008 onwards:

a. Are there any clonal complexes that appear to be predominantly associated with carriage?

(HINT: the disease field contains information about whether an isolate was from carriage or

disease)

b. Which clonal complex is predominantly associated with serogroup C?

c. Which clonal complexes are only associated with serogroup A?

3. The EU-MenNet study published in Yazdankhah et al. 2004 J Clin Microbiol 42:5146-53 contains

random samples of carriage isolates from three countries (Czech Republic, Greece and Norway), along

with disease isolates from a similar time period. From the isolate query page, filter the database to

contain only isolates from this study:

a. What was the most common clonal complex seen in carriage in each of the countries?

b. What was the most common carriage clonal complex in the Czech Republic in

1994?

1996?

c. …and in disease (you can search for disease NOT ‘carrier’)

1994?

1996?

4. A random sample of disease isolates from the UK over 20 years prior to the introduction of the MCC

vaccine is described in Russell et al. 2008 Microbiology 154:1170-7. MLST and finetyping were

performed on approximately 100 isolates from each of the years 1975, 1985 and 1995.

a. What was the most common finetype (serogroup, PorA VR1/VR2, FetA VR, ST (cc)) in each of

the three years? (HINT: Use field combination breakdown)

1975?

1985?

1995?

b. In the whole dataset of 323 isolates, how many clonal complexes are the following

serogroups associated with?

A?

B?

C?

c. What are the most common combinations of PorA variable regions in each of the years?

1975?

1985?

1995?

5. The PubMLST isolate database can not generally be considered a population dataset since submission

is biased with often only new variants submitted. Some laboratories, however, do submit all their

samples or at least an unbiased sample. One of these currently is the Czech Republic. Since the

beginning of 2009:

a. How many clonal complexes have caused disease in the Czech Republic?

b. Which was the most frequently isolated complex from disease?

c. How many STs were found in disease isolates belonging to this complex?

d. Which was the most frequently isolated ST from this complex:

i. In disease?

ii. In carriage?

e. Which serogroup was mainly associated with this complex?

ibd-labnet training workshop, würzburg, 2 july 2010

Documents