online resources for snp analysis - usc

34
Online Resources for SNP Analysis 65 MOLECULAR BIOTECHNOLOGY Volume 35, 2007 REVIEW 65 Molecular Biotechnology 2006 Humana Press Inc. All rights of any nature whatsoever reserved. ISSN: 1073–6085/Online ISSN: 1559–0305/2006/35:1/065–098/$30.00 *Author to whom all correspondence and reprint requests should be addressed. The Spanish National Genotyping Centre CeGen, Santiago node, Genomic Medicine Group, University of Santiago de Compostela, Galicia, Spain. E-mail: [email protected]. Abstract Online Resources for SNP Analysis A Review and Route Map Christopher Phillips The major online single nucleotide polymorphism (SNP) databases freely available as research tools for genetic analysis are explained, reviewed, and compared. An outline is given of the search strategies that can be used with the most extensive current SNP databases: National Centre for Biotechnology Information (NCBI) dbSNP and HapMap to help the user secure the most appropriate data for the research needs of clinical genetics and population genetics research. A range of online tools that can be useful in designing SNP genotyping assays are also detailed. Index Entries: Single nucleotide polymorphism, SNP; genotyping; variation; polymorphism; online data- bases; linkage disequilibrium; haplotype; haplotype block; genome; HapMap; clinical genetics; population genetics. 1. Introduction The validated set of human single nucleotide polymorphisms (SNPs) is one of the most valu- able resources to have come from the human genome mapping project (HGP). This dataset was largely unforeseen in the initial phases of sequencing but as several genome equivalents were generated from different individuals during the project, the identification of SNPs became a by-product of growing importance. The discov- ery of new SNP sites by resequencing and the collation of detailed information about each locus has continued to develop both in scope and in depth as the simultaneous publication of draft sequence and first full SNP map 5 yr ago (1, 2). Added to this, our knowledge of SNPs has ex- panded in parallel to an improved understanding of the structure and function of the genome and its gene content. The SNPs provide the most com- plete and densely spaced system of genome land- marks available. Above all, this enables researchers to improve the resolution of linkage maps, in par- ticular in the precise mapping and study of genes contributing to complex disorders, where the con- tributory effect of individual genes is small or where genetics is compounded by multiple locus interaction and environment. As well as enabling greatly enhanced mapping, SNPs provide a new and fascinating layer of detail to our understand- ing of human variability and what this tells us about disease susceptibility, populations, and evo- lution. Because SNPs have low recurrent mutation rates compared to other polymorphic markers they provide the most stable and reliable indicators of the evolutionary history of populations. Last, but not least, SNPs are the very stuff of genetic varia- tion as a substitution in or near a transcribed region can change an amino acid sequence, the control of transcription events, or the splicing pattern for the resulting RNA products. Such changes arise from 65_98_Phillips_MB06_0040 1/3/07, 4:40 PM 65

Upload: others

Post on 27-May-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Online Resources for SNP Analysis - USC

Online Resources for SNP Analysis 65

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

REVIEW

65

Molecular Biotechnology 2006 Humana Press Inc. All rights of any nature whatsoever reserved. ISSN: 1073–6085/Online ISSN: 1559–0305/2006/35:1/065–098/$30.00

*Author to whom all correspondence and reprint requests should be addressed. The Spanish National Genotyping Centre CeGen, Santiagonode, Genomic Medicine Group, University of Santiago de Compostela, Galicia, Spain. E-mail: [email protected].

Abstract

Online Resources for SNP Analysis

A Review and Route Map

Christopher Phillips

The major online single nucleotide polymorphism (SNP) databases freely available as research tools forgenetic analysis are explained, reviewed, and compared. An outline is given of the search strategies that canbe used with the most extensive current SNP databases: National Centre for Biotechnology Information(NCBI) dbSNP and HapMap to help the user secure the most appropriate data for the research needs ofclinical genetics and population genetics research. A range of online tools that can be useful in designingSNP genotyping assays are also detailed.

Index Entries: Single nucleotide polymorphism, SNP; genotyping; variation; polymorphism; online data-bases; linkage disequilibrium; haplotype; haplotype block; genome; HapMap; clinical genetics; populationgenetics.

1. IntroductionThe validated set of human single nucleotide

polymorphisms (SNPs) is one of the most valu-able resources to have come from the humangenome mapping project (HGP). This datasetwas largely unforeseen in the initial phases ofsequencing but as several genome equivalentswere generated from different individuals duringthe project, the identification of SNPs became aby-product of growing importance. The discov-ery of new SNP sites by resequencing and thecollation of detailed information about each locushas continued to develop both in scope and indepth as the simultaneous publication of draftsequence and first full SNP map 5 yr ago (1,2).Added to this, our knowledge of SNPs has ex-panded in parallel to an improved understandingof the structure and function of the genome andits gene content. The SNPs provide the most com-plete and densely spaced system of genome land-

marks available. Above all, this enables researchersto improve the resolution of linkage maps, in par-ticular in the precise mapping and study of genescontributing to complex disorders, where the con-tributory effect of individual genes is small orwhere genetics is compounded by multiple locusinteraction and environment. As well as enablinggreatly enhanced mapping, SNPs provide a newand fascinating layer of detail to our understand-ing of human variability and what this tells usabout disease susceptibility, populations, and evo-lution. Because SNPs have low recurrent mutationrates compared to other polymorphic markers theyprovide the most stable and reliable indicators ofthe evolutionary history of populations. Last, butnot least, SNPs are the very stuff of genetic varia-tion as a substitution in or near a transcribed regioncan change an amino acid sequence, the control oftranscription events, or the splicing pattern for theresulting RNA products. Such changes arise from

65_98_Phillips_MB06_0040 1/3/07, 4:40 PM65

Page 2: Online Resources for SNP Analysis - USC

66 Phillips

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

loci commonly termed coding SNPs, promoter–site SNPs, and splice–site SNPs, respectively.Therefore, it can be argued that SNPs form theprincipal part of the variability affecting eachstage of gene expression and can justifiably besaid to influence the transcriptome and proteome,as well as the genome itself. As such, SNPs aremore than just ubiquitous marker points, theyare a core genome feature that will ultimatelyhelp to explain one of the enduring mysteries ofhuman genetics: why comparable numbers ofgenes in species at opposite ends of the evolutionaryscale can create such profoundly different levels ofcomplexity.

In the same spirit as that prevailing in the data-base management of HGP draft and final nucle-otide sequence, SNP data arising from publicgenome analysis have been open source from thebeginning; freely available in websites accessibleby any researcher. Furthermore, the largest data-base, dbSNP, accepts submissions directly fromthe scientific community, allowing the rapid dis-semination of newly discovered SNP polymor-phisms. In fact all the SNP databases described inthis review are based on open access and in thissense are a shared scientific resource. Added tothis, during 2006 the bulk of privately held SNPdata will become available to the whole researchcommunity. This means one of the original pre-cepts of HGP—free access to the best possibleinformation acts to accelerate scientific progress,equally applies to the overriding majority of humanSNP data. Unhindered access to information is con-sidered to be of such primary importance now thata system for data release has been formalized in therecommended framework for international commu-nity resource projects (http://www.welcome.ac.uk/doc_WTD003208.HTML). This frameworksecures, for all scientists, rapid dissemination,highest standards of database management, andthe end use of data without restriction.

At the same time that genomic data started tobecome available, the fields of bioinformatics anddatabase management were rapidly developing tokeep pace with the scale of the information gener-ated and the complexity of the searches required.

These developments have meant that searchingfor specific SNPs among a total of 9 million ormore human loci can now be achieved fairly eas-ily, provided the researcher begins with clearlydetermined criteria for the markers required for astudy. To this end, the I can firmly recommend,as starting points, two excellent textbooks: HumanMolecular Genetics and Human EvolutionaryGenetics (3,4) that each cover thoroughly thefields of genetic analysis and population genet-ics, respectively, these forming the two principalapplications for SNP analysis. Both books pro-vide, in nonscientific parlance, “one-stop shops”for understanding the extent and specific charac-teristics of SNPs needed for planning studies ofinherited human disease (referred to as clinicalgenetics from this point on) or population genet-ics. I also consider that it is important for scientistsin one area of speciality to be better acquaintedwith the other, so the use of both books is recom-mended. Careful project planning is an essentialstep, in addition to the scrutiny of appropriatereporting publications, before any databasesearch starts in earnest. Therefore, the major stepsof project design involving the access of onlineresources can be visits to PubMed (for the bestcurrent reports in the field), HapMap (for the reviewof SNP candidates in, or close to, genes or genomeregions of interest), dbSNP (for the characteriza-tion of the chosen SNPs), and finally use of vari-ous web tools for optimizing the genotyping assaydesign. This is not intended to be a prescribedroute—many of the sites outlined in this reviewhave viable alternatives that can act, above all, tosupplement the data obtained from the primarysources described here. In addition, particularareas of clinical genetics such as cancer studieshave specialized databases that are not covered inthis review but clearly serve their field with amore direct focus of information.

This review was originally written for a thirdpotential application of SNP analysis, that offorensic identification, commonly termed DNAprofiling. The SNPs offer the potential to greatlyreduce amplified product size compared to exist-ing profiling loci and therefore, could provide

65_98_Phillips_MB06_0040 1/3/07, 4:40 PM66

Page 3: Online Resources for SNP Analysis - USC

Online Resources for SNP Analysis 67

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

much improved results with the highly degradedDNA commonly encountered in forensic analysisor disaster victim identification.

Forensic profiling requires a specific set ofsearch criteria to find SNPs with the necessaryproperties for this field and readers interested inthe forensic application of SNP analysis can findmore specific guidelines in the original articlealong with several specialized reviews (5–7). It isinteresting to note that forensic SNP sets may findapplications in genetic studies where measure-ment of stratification and admixture ratios couldbe achieved using a discrimination and geo-graphic origin marker panel. These equally spe-cialized areas of SNP use in genetic analysis arediscussed in the overview of SNP biology.

Therefore, the purpose of this review is two-fold: to compare SNP databases and to detail themethods that can be used to check the character-istics of SNP markers of most relevance to a pro-posed study. Emphasis has been placed on waysto narrow the selection of SNPs (or any other sup-porting data) to a manageable size—an increas-ingly important strategy given the huge scale ofinformation available. In the final section severalonline resources such as Basic Local AlignmentSearch Tool (BLAST) and RepeatMasker, whichprovide invaluable tools for the design of robustand reliable SNP genotyping assays, are discussedas a supplementary resource to the purely geneticdatabases.

Finally, during the preparation of this articlewholesale changes to the availability of a substan-tial proportion of private SNP data were takingplace. Although the review was not intended tocover searching private SNP databases—in par-ticular, the 4 million SNPs comprising the Celeradatabase known as Celera Discovery System(CDS)—all the information previously held byCelera on these SNPs is in the process of transferto dbSNP. This means the extent of freely acces-sible SNP data is certain to expand still more inthe coming year. Since Celera SNPs were discov-ered using different approaches to the public SNPmapping initiative the expansion should providea considerable number of completely new mark-

ers (potentially as many as 1 million), despite themajority of commercially held Celera loci beingcommon to both public and private databases.Some idea of how much supplementary data mayeventually be released for public access can beobtained from the time taken by National Centrefor Biotechnology Information (NCBI) to scruti-nize, validate, and reorganize the Celera SNPs.This has already occupied 6 months of 2005 andis likely to go beyond a full year of data process-ing on a large scale. This underlines the continu-ing growth in importance of SNPs to provide bothsufficient coverage for the analysis of all parts ofthe genome and to help fully understand the com-plex mechanisms of gene expression.

2. The Biology of SNPs—A Brief OverviewThe SNPs, as base modifications, represent

changes to the ancestral DNA sequence compris-ing a genome. Because the cellular mechanismsfor correcting a base mismatch are extremelyeffective, it is necessary to understand howthese substitution events progress from a singlesequence change that is promptly edited back tothe correct base, to become allelic (i.e., the vari-ant base is polymorphic with a minor allele fre-quency greater than 1%). Substitutions, oncereaching this frequency level, become true singlenucleotide polymorphisms, a self-explanatoryterm but note that the term SNP also encompassessingle base insertions and deletions (commonlytermed indels). Two processes create substitution-based SNPs: incorrect base incorporation duringDNA replication and in situ chemical modificationof a base. The first of these two processes, generat-ing new SNPs from nucleotide misincorporationat DNA synthesis, is an extremely rare eventgiven the fidelity of DNA polymerase enzymesand the elaborate proofreading mechanisms in placeto check physical alignments in fresh sequence cop-ies. The frequency of misincorporation has beenestimated to be approx 10–9–10–10 per nucleotide(8) and this event alone is not sufficiently com-mon to account for the huge number of SNPs withdetectable minor allele frequencies observed in allorganisms studied so far. This is an important

65_98_Phillips_MB06_0040 1/3/07, 4:40 PM67

Page 4: Online Resources for SNP Analysis - USC

68 Phillips

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

point because a feature of SNPs often cited infavor of their use in both population genetics andassociation studies is the long-term stability of sub-stitutions that have become fixed in a sequence,stemming directly from a very low recurrentmutation rate. Consequently, the alternativeprocess of in situ modification of bases must accountfor the majority of SNPs generated. Although arange of factors external to the cell can affect achemical transformation of a base, most notablyhigh-energy radiation, the usual repair processesare, again, far too effective at correcting changesto account for observed SNP frequencies. The realclue to the most widespread SNP generation eventcomes from the fact that C-T and A-G SNPs faroutnumber all other types of substitution. Further-more, the rate of substitution observed within CGdinucleotides (usually termed CpG) is an orderof magnitude higher than for all other dinucle-otide motifs. The CpG dinucleotides have twocritical characteristics. First, they are the targetfor methylation: a universal process of base modi-fication affecting 75% of CpG dinucleotides thatplays a central role in the control of gene expres-sion. Second, they exhibit dyad symmetry, that is,the complementary base sequence is also CG, there-fore methylation effects both strands. Once cytosineis methylated to become 5-methylcytosine it canundergo deamination to form a stable thyminebase surviving on the uncorrected strand. In short,CpG can become TpG or CpA with equal likeli-hood and in half of these cases the original strandis corrected by the repair machinery rather thanthe deaminated strand, leading to a stabilized sub-stitution event. Actually the mutability of CpG issuch that these motifs only occur at 20% of thefrequency predicted by normal base composition.When CpG motifs do occur at high frequency, inparticular the transcription control regions at the59 ends of genes, the tendency is for CpG toescape the destabilizing process of methylation,forming the characteristic CpG islands that canoften act as telltale signatures for spotting puta-tive genes. All this adequately accounts for thebulk of SNPs generated throughout the humangenome and explains the overall high frequencyof SNPs compared to other sequence changes

such as indels (found at only ~10% of the fre-quency of substitution SNPs). It also predicts thatthe great majority of SNPs comprise C-T or A-Gsubstitutions, which is indeed the case. Last, thereis no reason why a substitution event cannot occurmore than once at a given base position, howeverrare such an event is predicted to be and I havesuccessfully located many nonbinary SNPs inonline databases with the aim of developing setsof these loci for mixture detection in forensicanalysis (9).

Although the mechanism behind the generationof SNP variation had been well characterized, itwas not until the completion of the HGP that thedistribution, density, and diversity of SNPs couldbe viewed on a scale large enough to assess howadequate, or otherwise, the marker coverage wasgoing to be. If any extensive gaps in SNP distri-bution occurred in the genome, then SNPs couldnot be universally applied for the fine mapping ofevery gene candidate, or used to track a completeset of genes when large numbers of these actedwith small additive effect in multigenic traits. Theinternational SNP map working group describedin 2001 (2) the first systematic study of the wholeset of SNP variation as it stood at that time, fol-lowing comprehensive sequence comparisonsbetween the multiple donors used for the HGPsequence compilation. The analysis of the 1.42million markers in this first map provided a detailedpicture of the characteristics of human SNPs andallowed the appropriate planning of the futuredevelopment of SNP maps, ultimately giving riseto the work of the SNP consortium Allele Fre-quency Project and the HapMap initiative. Themost important finding with a bearing on theusefullness of SNPs as linkage markers was thatthe distribution of SNPs is constant throughoutthe autosomal chromosome set. In addition, thevast majority of the genome was seen to containSNPs at suitably high density, with only 4% ofsequence showing frequencies less than one SNPper 80 kb, much of this comprising incompleteSNP mapping at the time. This was an importantfinding, not just for the future prospects of SNP-based linkage analysis but as a contrast to thestudies of gene density from HGP data that were

65_98_Phillips_MB06_0040 1/3/07, 4:40 PM68

Page 5: Online Resources for SNP Analysis - USC

Online Resources for SNP Analysis 69

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

revealing considerable disparities between chro-mosomes. This is shown by the contrasting num-bers of genes observed on chromosomes 18 and19—a pair with almost identical lengths but atopposite ends of the gene density scale. Figure 1illustrates the gene distributions alongside theSNP density plots for both chromosomes. In linewith distribution and density, patterns of SNPvariability, measured as nucleotide diversity,

were also remarkably consistent between theautosomes. All showed nucleotide diversitieswithin 10% of the mean, except for the high andlow value outliers of chromosomes 15 and 21,respectively, and the major histocompatibilitycomplex (MHC) region of chromosome 6. The Xand Y chromosomes differ from autosomes inboth effective population size and mutation rateand the lower observed densities and levels of

Fig. 1. SNP and Gene density plots for two autosomes of comparable size: chromosomes 18 and 19.

65_98_Phillips_MB06_0040 1/3/07, 4:40 PM69

Page 6: Online Resources for SNP Analysis - USC

70 Phillips

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

heterozygosity for nonautosomal SNPs matchedpredictions based on the tendency of these chro-mosomes to show reduced diversity. AlthoughSNPs are binary loci with a much lower averageheterozygosity than any other polymorphisms, itwas evident from the HGP map that their balanceddistribution and levels of variability made the useof genome-wide high-density SNP mapping a viableprospect.

Although a vast number of well-characterizedSNPs can provide a wealth of marker sets forassociation studies, the actual level of polymor-phism shown by individual SNPs is limited becausemany closely neighboring loci are bound togetherin near-complete linkage as chromosome seg-ments termed haplotype blocks (alternativelylinkage disequilibrium or LD blocks). Haplotypeblocks occur because the chromosome segmentshave shared ancestry, therefore that particularcombinations of SNP alleles in close proximitycomprising the haplotype are changed only slowlyby recombination or accumulated mutation. Humanevolutionary history has the unusual characteristicsof involving very small population sizes and arelatively short total period of development of thespecies. As a result the human genome is par-ticularly amenable to haplotype block analysisbecause the math dictates that the rate of erosionof haplotypes is slow (~10–8 per base, per genera-tion) and the number of generations adter thedisease variant mutation is relatively small(~10–4–10–5). Using haplotype blocks as the basisfor SNP analysis can have significant advantagesin association studies, as genomic regions can betested without the need to first pinpoint the loca-tion of the functional variants of the trait. How-ever, haplotype block structure adds an additionallayer of complexity because there are no guaran-tees that blocks are consistent in size or distribu-tion, coincidental in position to the genes ofinterest or even very close (10). In fact, initialanalysis using patterns of association betweenSNP pairs revealed a nonrandom distribution oflinkage disequilibrium in the genome, althoughthe blocks themselves consistently showed muchreduced SNP variability with only a limited num-ber of allele combinations per block. This latter

characteristic had the potential to erode the powerof SNPs as association markers, as the normal andvariant gene might both share the same associ-ated haplotype in a great many cases. For this rea-son, there is much interest in allele frequencydifferences between the major population groups,as it provides a chance to choose the best studygroup for a particular set of blocks. Haplotypeblocks are clearly inconsistent with the estab-lished model of genetic distance measurementbased on a predictable recombination rate and itis generally agreed that they represent areas ofextremely low recombination bounded by smallersegments where recombination is much more fre-quent, so-called hotspots (11). For this reason animportant phase of study after the SNP map publi-cation was the analysis of haplotype block diver-sity and structure. Several studies examiningchromosome-wide LD distribution found thatblocks can span relatively large distances (12–14). Two studies obtained comparable blocklength values from different chromosomes: Dalyet al. (14) reported a size range between 3 and 92kb on 5q, whereas De La Vega et al. (15) found awider range of block size for chromosomes 6, 21,and 22 (5 to 300 kb), but with an average size ofjust 26 kb and 18 kb in Europeans and Africans,respectively. The longest block discovered at thistime was 800 kb (11), but blocks longer than 100kb are normally expected to be rare (representingonly 2% to 3% of the total found on chromosomes6, 21, and 22) (15). Many of the principles appliedto haplotype block mapping and underlining theHapMap approach still do not meet with univer-sal support and the main arguments are more fullydiscussed by Wall and Pritchard (16). However,the outcome that would suit most researcherswould be to be sure that a SNP at a given distancefrom a gene variant could adequately mark the vari-ants underlying the phenotype and therefore actas a tag for comparing case subjects to controlsubjects. This is the principle of positional clon-ing, where progressively finer mapping can focuson the positions of contributing loci in the absenceof clear information about gene action. Such anapproach can potentially reduce the genotypingefforts needed to examine complex disorders to

65_98_Phillips_MB06_0040 1/3/07, 4:40 PM70

Page 7: Online Resources for SNP Analysis - USC

Online Resources for SNP Analysis 71

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

the point where a very small number of taggingSNPs might enable the tracking of the multiplevariant genes creating complex disorders. Thiscan be further simplified by attempting analysison populations with a history of reduced variabil-ity due to small founder numbers, bottlenecks,reproductive isolation, or endogamy, hence, thewidespread interest in isolates such as islandpopulations or small, culturally coherent groupslike the Anabaptist sects of North America. Atpresent, population genetics and association stud-ies start to share a considerable amount of com-mon ground. Furthermore, much can be learnedfrom differences in the frequency of, and levelsof susceptibility to, common diseases among thefive major population groups. These are broadlybased on the continental boundaries of Africa,Europe, Asia, the Americas, and Oceania (Pacificisland groups). The HapMap initiative (of whichmore later) was set up to apply the principles out-lined to delineate haplotype blocks in the genome.To develop the exploration of population differ-ences, phase I of HapMap SNP analysis hasgenotyped all loci in 90 subjects each from Africaand North America (of European descent) plus 45subjects each from China and Japan.

To conclude this section mention should bemade of two important aspects of associationstudies using SNPs where a prior understandingof the characteristics of the study population isparticularly important: admixture mapping andstratification effects. Admixture mapping (oftentermed MALD, for mapping by admixture link-age disequilibrium) is exploiting the differencesin haplotype block boundaries between admixingpopulations, previously isolated, to gain higherlevels of association (17). This is because recom-bination over the limited number of generationsbecause admixture has less opportunity to disruptthe associations between SNPs and genes shownto different degrees by each contributing population.In particular, African Americans are an informativepopulation for association studies with an esti-mated level of admixture with Europeans of about20%, although the level has a broad range from4%-30% depending on the US region (18). Strati-fication effects occur when the trait studied and

the genetic variation as a whole are both clusteredinto strata within a population that is presumed tobe homogeneous for both. As a result, the overallwithin-population variation approaches levelsseen between groups of study and control subjectsbecause the groups do not share identical ances-tries. Stratification can potentially create spuriousassociations between the traits studied and a setof SNPs chosen to map them (19). An example ofthis effect is shown by the SNPs rs182549, whichshows a strong allele frequency gradient (cline)from northern to southern Europe as does the traitstudied—adult stature. The association betweentrait and variation in this case was entirely the resultof an identical distribution of variation and was lostwhen individuals were rematched on the basis ofgeographic latitude within Europe (20). Interest-ingly, forensic discrimination SNPs can makeeffective measures of stratification, as such lociare required to be neutral, freely assorting, andhighly polymorphic, characteristics not found inthe association study SNPs. Once again theseaspects highlight the importance of a thoroughunderstanding of human population structure andhistory in the design and interpretation of clinicalgenetics studies.

3. NCBI: PubMed, Entrez,Boolean Principles, and Databases

of Relevance to SNP Analysis(http://www.ncbi.nlm.nih.gov/)

Any search for information relating to geneticsand medicine should begin at the homepage of theNational Centre for Biotechnology Information(NCBI) website. NCBI is one of the institutes ofthe US National Institutes of Health (NIH) andhas been the principal worldwide repository ofgenomic data for the past 18 years. The nucleotidesequence variation database housed at NCBI isknown as dbSNP and comprises the largest col-lection of SNPs available with the most compre-hensive set of supporting data for each locus. Thefollowing sections detail the structure and use ofdbSNP specifically but the importance of NCBIoutside of dbSNP is that genetics research involv-ing SNPs should always be set in the context ofsupporting information. In particular, it is impor-

65_98_Phillips_MB06_0040 1/3/07, 4:40 PM71

Page 8: Online Resources for SNP Analysis - USC

72 Phillips

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

tant to gather, at the same time as SNP data, infor-mation about genes, phenotype, proteins (bothchemical and structural variation), expressiondynamics, and the contextual sequence sur-rounding the SNP. Given the extent of the bio-logical databases at NCBI it is of no surprise thatthis institute has built the most extensive collec-tions of gene (Gene), inherited disorder (OMIM),protein (Protein), gene expression (GEO), andnucleotide (GenBank) databases in the past 8 yr.Together with the Santa Cruz genome database(http://genome.ucsc.edu/), NCBI has managed thecontent of each of the draft versions of the humansequence since 1990 and now keeps the referencesequence, completed in April 2003. This com-prises 2.9 billion bases (99% coverage of gene-containing DNA) with an error rate of 1 in 10,000bases. To most biologists NCBI will already befamiliar in the guise of Medline and PubMed: twobibliographical databases that collate all the prin-cipal citations from biomedical journals (~5000journals in total). Medline was the main source ofdata for PubMed, but has largely been supplanted,the two are still distinguished by the fact thatPubMed has a broader scope by including articlespredating the Medline selection and by contain-ing certain “out of scope” content (i.e., not bio-medical). The NCBI bibliographic databases areby default predominantly text oriented. Thismeans they work by matching text recognized inthe query submission to text in the data recordsand to work efficiently the system needs to regu-late vocabulary. The database of words used toindex PubMed is MeSH (medical subheadings)and this can be searched itself using the searchmenu in the top left of each NCBI homepage. Forassociation study research, checking MeSH is auseful step to help clarify terminology before asearch begins and to review in brief possible relatedareas of study to the disease of interest. For example,the query Repetitive Strain Injury gives the threespecific medical terms used to describe varietiesof this condition with the year of introduction ofeach. For clarification at the top of the page arethe alternative terms named suggestions that arealso used in PubMed for text matching with thequery term (e.g., repetition strain injury).

The importance of prior experience in the useof PubMed is that both PubMed and dbSNP arecomponents of a unified NCBI database retrievalsystem termed Entrez. Therefore, familiarity withbibliographic search strategies that can enable amanageable list of published articles from the 12million available at NCBI can help the user todevelop the same principles for SNP searching.NCBI uses Entrez as a standardized query inter-face for all the major databases it manages. There-fore, using Entrez has two clear benefits for theuser. First, the query system is the same for eachEntrez database and second, the searches can bemade global to return data that is then seen to beinterlinked between many databases. Data returnedfrom multiple sources are linked by cross-refer-enced hyperlinks termed linkouts (i.e., two-wayconnections between each database). To workefficiently and to delineate a search correctlyEntrez relies on an understanding of Booleanterms and combinations of parameters, termedfields, that define the required characteristic froma piece of data. For instance, searching NCBIusing just the search term diabetes gives a quar-ter of a million literature citations alone andequally daunting numbers from the other data-bases. When diabetes is used with the operator“AND” in combination with the term chromo-some 2, the returned PubMed citations drop to amuch more realistic 140 articles, summarizing sev-eral studies, among others, analyzing the interleukin1 gene cluster on chromosome 2 implicated in dia-betes susceptibility. Not surprisingly the numberof genes listed in EntrezGene drops in similarfashion from 1114 to 3, again, acting to define aspecific genome feature before any follow-upsearches have begun. This simple predefinition ofsearch terms can be particularly useful for the pro-cess of designing a new genetic study when it isimportant to check that the work to be undertakenis both manageable and has enough leads to insti-gate database research in earnest, or equally impor-tant, has not already been achieved elsewhere.Therefore in the initial stages the PubMed andOMIM text-based databases (termed literaturedatabases in NCBI) are as important as the geneticcontent databases (termed molecular).

65_98_Phillips_MB06_0040 1/3/07, 4:40 PM72

Page 9: Online Resources for SNP Analysis - USC

Online Resources for SNP Analysis 73

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

Boolean terms govern the rules used in all data-base searching by applying the principles of logicthat define the relationship between a set of inclu-sive or exclusive terms, ruled by the three opera-tors (or operands): AND, OR, and NOT. The logicis summarized in venn diagram form in Fig. 2 andeach operator can be described as follows:

• OR (often termed union) is inclusive, thereforeit returns all database entries that contain atleast one of the provided search terms.

• AND (often termed intersection) is exclusive,therefore it only returns database entries thatcontain all the provided search terms.

• NOT (often termed difference) excludes fromthe returns all database entries with the pro-vided search terms.

Nearly all Entrez database queries use theoperator AND to narrow down a search usingmultiple combinations of an extensive array offields, each set being tailored to the content of thedatabase. This approach is sometimes known as arelational search, as it looks for relations or linksbetween the search terms found in each data record.If search terms are to be confined to a specific fieldEntrez rules require that these are described usingpredefined tags termed field tags and set in squarebrackets. For example, entering “short [au]” returnspublications in PubMed with Short as author andthe list will not include studies of short tandemrepeat loci (unless Short is an author!). Descrip-tions and details of the whole PubMed field taglist are outlined at (http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=help pubmed.box.pubmedhelp.Box_1_Search_Field_D). Entrez searchesdefault to searching all fields in the absence of field

tags and to the operator AND, not OR, if there arespaces between fields, as is common practice else-where. For most of the molecular databases thetext of a tagged field must be in the correct for-mat, termed syntax to be automatically read bythe NCBI search engine and this can initially bea source of frustration until Entrez formattingbecomes second nature. To begin using Entrez itis better to use a menu of fields for a single data-base search from the limits tab that appears oneach database homepage. These take two formsdepending on the type of information held in thedatabase: additional query description boxes ortick-boxes and query description boxes. An exampleof the first might be a PubMed search that can workwith one of 29 limits such as author or text word.These still require an entry in the query box butsome assistance is given for combining these withextra intersections by limiting certain fields at thesame time using a small number of choices inseven additional areas such as text language orhuman/animal subject matter. The date range forpublication or adoption to PubMed (Entrez Date)gives the most useful field limit in combinationwith text word as it ensures the returns concen-trate on the most recent publications. Overall, theunderlying theme is to allow simplified queriesthat still enable manageable numbers of returnsfrom the largest databases. The second form oflimits using tick-boxes is used for many of the mo-lecular databases, notably SNP (termed EntrezSNPin this guise) where a clear level of categorizationis possible with the data. Therefore, for example,ticking one of the 15 IUPAC substitution codes(e.g., Y to denote C/T substitutions [listed in

Fig. 2. Boolean operators. OR applies to all items in A or B, AND to items found in both A and B, and NOTto items in A not found in B.

65_98_Phillips_MB06_0040 1/3/07, 4:40 PM73

Page 10: Online Resources for SNP Analysis - USC

74 Phillips

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

Table 1] and sometimes termed IUPack) starts theprocess of concentrating search terms down to aspecific set of data to provide focus. The SNPfield list is one of the most extensive and the useof terms is outlined in greater detail in the dbSNPsection.

Three modifiers of operator function can beused in Entrez:

• Ranging: setting a range for a value in a field(e.g., SNP heterozygosity) using a colon (:)between the lower and upper limits for thevalue.

• Parentheses: combining related terms togetheras logical groups and forcing the order of opera-tion for the search process or performing a com-mon operation on a group of terms. In the firstcase this sets the order of searching to thebracketed terms first so that the next operationis performed on the results of the first opera-tion. In the latter case brackets are commonlyused to group together NOT items combinedusing OR. Entrez syntax uses curved parenthe-ses for grouping and square parentheses forfield tags. Brackets are helpful in ensuringdescriptions lacking clear definition such as“learning disorders” are replaced by a broadlybased set of more specific terms that in combi-nation keep the focus but prevent false exclu-sions from returns. In place of the above queryterm, using (dyslexia OR attention deficit hyper-activity disorder) together ensures the com-bined returns with either term are available forthe next operation in the query. The example

used in PubMed help is apt because it illus-trates that use of parentheses can mirror thelogic of a sentence: so “find articles on the effectsof heat and humidity on multiple sclerosis” takesthe form: (heat OR humidity) AND multiplesclerosis.

• Wild card: using a star in place of missing textallows a partial entry to be used as a query term(e.g., using BRC* will find both BRCA1 andBRCA2). NCBI does not generally use adja-cency searching in the molecular databases.This is based on the proximal operator NEAR,routinely used by web search engines likeGoogle. A notable exception is the alternativetext terms named suggestions that are handledby MeSH in PubMed. Using adjacency searchestends to lack focus for the majority of data inNCBI, as information is usually clearly andunequivocally categorized.

It is possible to use Boolean terms to combineindividual Entrez searches performed at differenttimes as an alternative to using parentheses, enablingmore opportunity to monitor the number of returnswith different search term combinations. Thisuses the clipboard and history tabs. The clipboardis a workspace for holding up to 500 items manu-ally selected from search returns. Note: (before los-ing work) that contents are cleared after 8 hours ofinactivity. History lists the database search activityas numbers prefixed by a hash (#). Because theserecords are, again, cleared after 8 hours of inac-tivity it is worth getting into the habit of transfer-ring long multistep search records into thepersonal folders available as “My NCBI” (http://www.ncbi.nlm.nih.gov/entrez/login.fcgi?call=so.SignOn..Login&callpath=QueryExt.CubbyQuery..ShowAll&db=pubmed). Previous searchescan be combined as hash fields and using Booleanoperators (e.g., #1 AND #2 gives an intersectionof the first two searches from the current activesession). It is also possible to use hash fieldstogether with normal fields, thus helping tobuild a stepwise record of the search process as itis modified to reduce return numbers in smallstages. Finally, it is logical to fix the values ofcertain fields in Entrez to filter down the number

Table 1IUPAC Codes Used With the [ALLELE] Tag Denoting

SNP Base Substitutions

Code Substitution Code Substitution

M A or C V A or C or GR A or G H A or C or TW A or T D A or G or TS C or G B C or G or TY C or T N A or C or G or TK G or T (or indeterminate base)

A, C, G, and T can be used individually to select all SNPsexhibiting that base as an allele.

65_98_Phillips_MB06_0040 1/3/07, 4:40 PM74

Page 11: Online Resources for SNP Analysis - USC

Online Resources for SNP Analysis 75

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

of returns. For instance, choosing human as theorganism is wise as much SNP data are nowheld for the mouse genome. These options forEntrezSNP searching are discussed in more detailin Subheading 7.

Several key NCBI databases have an importantplace in clinical genetics research design or in sup-port of dbSNP searches. These include GenBank,Gene, Online Mendelian Inheritance in Man(widely referred to as OMIM), and UniSTS. Thisis just a fraction of the total collection and repre-sents the most useful databases for adding dataalready obtained from dbSNP, PubMed, MapViewer, and MeSH. What follows is a brief out-line of the structure and use of the four databasesthat can be accessed in combination with dbSNPsearches.

3.1. GenBankGenBank comprises the nucleotide sequence

database of NCBI. This simple description beliesthe scale of the information held—a collection ofsequences comprising 59 gigabases of data frommore than 130,000 species (spanned by 17 differ-ent genetic codes), which is updated daily. Thedatabase organization involved is equally com-plex but the front end is simple enough if the userneeds just a sequence segment, the coordinates,and the amino acid translation sequence. Theexample sequence file on the GenBank homepageillustrates a standard sequence report with thefields that can be used in searches. GenBank ispart of Entrez, has its own specialized fields, andis a subset database under the “umbrella” groupof EntrezNucleotide. This comprises sequencesubsets for expressed sequence tag data (dbEST),genome survey sequence data (dbGSS), and CoreNucleotide—the subset of interest to most userscontaining genomic sequence data. This allowsjoint searches in Entrez and individual searches inBLAST, avoiding cross-referencing when it is notneeded. To add more complexity and anothername to keep in mind, there is also the RefSeqdatabase. This can be thought of as the referencesequence set for the key study organisms (3244 inearly 2006) with integration, meaning the included

sequences can be optimally compared, as can theannotation, the process of characterizing a genefrom the base sequence. RefSeq is not part ofEntrez but the protein sequence portion can besearched in Entrez as EntrezProtein. For mostusers interested in SNP searches the main con-tact with nucleotide databases is in the process ofdesigning genotyping assays. Therefore it is nor-mally necessary to check flanking sequence forquality or presence of clustering SNPs and tocheck potential primer designs using BLAST.There is not a great need in most cases to godeeper than this.

3.2. GeneGene comprises the NCBI gene directory pre-

viously known as LocusLink. Maintaining thisdatabase is particularly challenging as the genelandscape is constantly changing in so many aspects.Definitions of function, interactions, associationwith a trait, activity, and many other characteris-tics are regularly revised and this must be collatedwith equal regularity. Gene works with uniqueidentifiers assigned to three types of gene enti-ties: genes with defining sequences, genes withknown map positions, and genes inferred fromphenotypic information. These gene identifiersare tracked, and new information is added whenavailable. The scope of Gene to encompass allorganisms supersedes LocusLink, which was cen-tered solely on human gene data. Searches gener-ally begin with an identifier in the form of a letter/number combination standardized by HUGO(http://www.gene.ucl.ac.uk/nomenclature/). Theterm does not have case sensitivity but care isneeded to avoid spaces (treated as an AND opera-tor and usually failing to find the target). The listincludes all the species with genes matching thequery combinations but here case becomes animportant point of distinction. Query “ABCC1”returns ABCC1 in humans and Abcc1 or ABCC1in other mammal species—all essentially thesame gene, but also abcC1 in Dictyostelium sp.,which is a different gene. Each linkout in the listgives a single report page starting with a two-lineheader including the full name and a unique

65_98_Phillips_MB06_0040 1/3/07, 4:40 PM75

Page 12: Online Resources for SNP Analysis - USC

76 Phillips

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

geneID number that can be used in Entrez byitself. This is followed by sections: summaryinformation; graphic summary of transcriptionstructure; graphic summary of genomic contextwith a linkout to MapViewer; bibliography; gen-eral gene information; general protein informa-tion; RefSeq sequences; related sequences andadditional links. This is comprehensive enough toprovide most of the search directions needed toexplore the context of the gene as well as thelikely critical characteristics that can help theresearcher assess its status as a candidate forstudy. Invariably, the most useful section is thebibliography—a comprehensive catalog of rel-evant studies of the gene that allows an easy checkof current interest in the gene in the context of aparticular disease. Reference to the gene symboldenotes the name written as upper case letter/number combinations, gene ID denotes the 5-digitnumber.

3.3. OMIMOnline Mendelian Inheritance in Man or OMIM

is the NCBI phenotype database. In this role it cata-logs traits, diseases, and disorders, although notalways with reference to a gene if no associationhas currently been described. The list of entries inearly 2006 reveals the paucity of understandingof the genetic basis for disease. Of a total of16,612 entries only 384 list a gene with a knownsequence and phenotype (unique OMIM numberprefixed with +). Another 2229 describe a pheno-type with a suspected Mendelian inheritance pat-tern (no prefix), 1502 with a well-describedphenotype lacking a described molecular basis(%), and 1862 with a known molecular basis (#).This leaves the remaining 10,635 entries listed asmerely genes with known sequence (*), thereforethe majority of OMIM data consists of genes lack-ing known phenotypes. Each entry has a uniquenumber often used in the literature to denote acondition and usable as a search term throughoutNCBI. OMIM compensates for the lack of con-crete data by being very readable and informativeabout gene function. The noticeable difference inthe character of the database is because OMIM is

hosted by NCBI and developed independently atJohn Hopkins University. Using OMIM as thebasis for a literature search is usually a fruitfulapproach, the text acts to review the quality of theassociations suggested by studies in the area ofinterest under headings cloning, gene functionand structure, mapping, molecular genetics, andanimal models. The references at the end of theOMIM report are a selected list of the most rel-evant studies to provide the clearest direction forfurther investigation. As an example that high-lights the difficulties of collating very extensivephenotype data with genetic data there is no men-tion in an exhaustive report for gene ABCC1 ofthe effect of coding SNP: rs17822931 on humanearwax viscosity (21). The OMIM mapping sec-tion is particularly helpful as a starting point whenplanning an association study with SNPs or see-ing whether further linkage analysis is needed asa preliminary stage of study.

3.4. UniSTSUniSTS comprises the NCBI database of link-

age markers termed Sequence Tagged Sites (STS)and leads on from the above point about linkageanalysis. UniSTS content encompasses polymor-phic loci other than SNPs available for linkageanalysis (predominantly short tandem repeat sites)and can be searched using gene identifiers or chro-mosome position. The return page gives relatedinformation and helpfully recommends poly-merase chain reaction (PCR) primers to simplifythe development of linkage marker typing if thisis required. The primer pair information is used tomatch alternative names for linkage markers, there-fore, for example, “D2S2300” will retrieve themarker named in the database as “AFM261YB1.”The easiest way to combine STS markers andSNPs to fully cover an area of interest with suit-able linkage markers is to use the Between Mark-ers search option in the dbSNP homepage (http://www.ncbi.nlm.nih.gov/SNP/index.html). Thisprovides two query windows for inserting the STSID’s and a list of SNPs is returned that spans thedistance in between.

65_98_Phillips_MB06_0040 1/3/07, 4:40 PM76

Page 13: Online Resources for SNP Analysis - USC

Online Resources for SNP Analysis 77

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

4. The dbSNP Database (http://www.ncbi.nlm.nih.gov/SNP/index.html)

NCBI dbSNP is the principal database ofSNP information generated from the HGP andthe simultaneously published first SNP map.dbSNP has continued to collate all the data fromvarious SNP validation initiatives that have fol-lowed since, including output from The SNP Con-sortium (principally the Allele Frequency Project),The Perlegen SNP genotyping initiative, andHapMap. The SNP data are regularly updated insynchrony with genome rebuilds, ensuring thehighest quality of SNP locus mapping and scru-tiny. In early 2006, the total dataset amounted to40.6 million different SNP loci, with just underhalf of this number validated by genotyping asample set to confirm polymorphism. The humancontent comprised 10,430,750 SNP clusters (i.e.,rs-numbers) of which 4,236,590 were sited ingenes. A total of 35 organisms have individual SNPdatabases, 12 of these from completed genomes.The Chimpanzee dataset is likely to be of grow-ing importance and currently comprises 1.54 mil-lion SNPs clusters with just more than a third ofthese in genes. With this pace of data building itis a good idea to subscribe to the dbSNP-an-nounce automatic e-mail update system to keepupdated on developments (http://www.ncbi.nlm.nih.gov/mailman/listinfo/dbsnp-announce). Aswell as reporting the release of each new build,announcing newly added features, and outliningcorrections or discovered problems with past orpresent builds, there is an archive for referencingpossible problems with, or qualifications to, pre-viously obtained search data (http://www.ncbi.nlm.nih.gov/mailman/pipermail/dbsnp-announce/).The rapid growth of human SNP data in dbSNP dur-ing the past 5 years is shown in Fig. 3.

Any reference to a SNP locus within NCBI(and elsewhere such as the scientific literature andalternative SNP databases) uses a unique identi-fier comprising a number prefixed with rs. All rs-numbers are listed in NCBI as linkouts, whichwill return a standard format summary page forthe SNP termed the Cluster Report, which lists a

full set of key parameters for the locus. This pagecan be thought of as the SNP locus homepage andfrom this standard point of reference differentroutes can be followed to database entries else-where in NCBI with content related to the SNP,such as Gene, OMIM, or GenBank. Of particularuse is the link to Map Viewer (see Subheading 7.),which plots the SNPs chromosome position inrelation to a variety of other genome featuresacting as landmarks for the locus. Interlinking inboth directions is now standard practice, thereforeclicking an rs-number in any other major SNPdatabase outside of NCBI will connect to thedbSNP cluster report. This means that it is impor-tant to be familiar with the page layout and toknow the limitations that exist for the user withthe way data are presented. The cluster report lay-out has been revamped at the start of 2006 andafter the summary header of four lines each forlocus and allele information, the detail sectionscurrently comprise submission, fasta, geneview,map, diversity, and validation.

4.1. SubmissionSubmissions for the SNP are listed, with the re-

ports used to validate the locus marked by an icon.Each ss number linkout leads to a detailed break-down of the genotyping performed and these in turnlinkout to lists of genotypes with sample ID’s anddetailed population descriptions if these requirechecking. The sample details allow use of consen-sus controls as genotyping standards, for example,submission ss2316529 for SNP rs1490413 listssample CEPH1331.01 as an A-G heterozygote.

4.2. FastaFasta lists the flanking sequence around the

substitution site. Minor problems can occur withrecovery of sequence from this section requiringcare. The amount of sequence can vary from tensto hundreds of bases, whereas the SNP can some-times be found very close to the start or the end ofthe listed sequence (Fig. 4). In these cases recov-ery of sufficient sequence involves visits else-where, either to Entrez Nucleotide or the moreuser-friendly Santa Cruz genome assembly. In-

65_98_Phillips_MB06_0040 1/3/07, 4:40 PM77

Page 14: Online Resources for SNP Analysis - USC

78 Phillips

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

conveniently the NCBI linkout to Santa Cruz hasrecently been removed requiring a “manual trip”to the gateway page http://genome.uscs.edu/cgi-bin/hgGateway and entry of the rs-number. In theSanta Cruz map browser, click on the blue rs-num-ber linkout under the various genome elements inthe map view to get the summary page and thenclick on “View DNA for this feature” and choosethe width of flank. It may be wise to choose 0 and100 bases (i.e., upstream/downstream), for example,followed by 100 and 0, to keep track of the substitu-tion site, as this is shown as a normal base andtherefore its position can be difficult to locate.Note that Santa Cruz uses the term simple nucle-otide polymorphisms. Back in fasta, sequence isarranged in 10-base blocks using different type-sets: upper case/lower case and black/green. Uppercase denotes normal, unique genomic sequence,whereas lower case is used for sequence identifiedby RepeatMasker (detailed in Subheading 11.) aslow complexity or repetitive element sequence.Green is used to denote sequence used by the sub-mitter lab during SNP identification. A commonproblem is a lack of consistency in the direction

of the displayed sequence in fasta and EntrezNucleotide Sequence Viewer. Therefore, userswishing to check sequence more carefully willneed to be prepared to use sequence inversionmacros in Excel or in stand-alone programs to“flip” between different strands in different loca-tions. Helpfully Sequence Viewer allows a “viewon minus strand” option. One final sequencetracking problem is that clustering SNPs are occa-sionally given the IUPAC code (e.g., CTAYGGA)within the sequence in fasta and this is easilyoverlooked, although it appears to be uncommonand largely ad-hoc in occurrence.

4.3. GeneView

For SNPs in genes the GeneView section out-lines the gene context of the substitution withcolor codes for synonymous, not synonymous,and intronic SNPs (pale green, red, and yellow,respectively), includes the different amino acidresidues and their position in the protein sequenceplus linkouts to the nucleotide reference assem-bly for the coding regions of the gene.

Fig. 3. Growth in SNP numbers since dbSNP began cataloging loci in 2000. Plot shows the cumulative numberof unique SNP loci (black line) and the proportion of these SNPs validated by genotyping (gray line). Assimila-tion of loci into dbSNP is fast but proper assessment of the polymorphism much slower.

65_98_Phillips_MB06_0040 1/3/07, 4:40 PM78

Page 15: Online Resources for SNP Analysis - USC

Online Resources for SNP Analysis 79

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

4.4. MapThe Map section lists the NCBI and CDS posi-

tions where both exist, such as, reference andCelera plus linkouts to the contigs used in theassembly for each genome. This section also haslinkouts to Map Viewer and to neighbor SNPdetails. The position and density of neighborSNPs may be different between NCBI and CDS,therefore in assay design it is important to trackall of these to be sure of clean primer-binding sitesequence free from interfering substitutions nearby.This is especially true of primer extension chemis-tries that need ~20 bases of SNP free sequenceimmediately adjacent to the target substitution. IfSNPs close together are in association it followsthat a particular allele may carry a neighbor atidentical frequency and always dropout from theassay producing an apparently monomorphic locus.It is important to note that this problem occurred in~6% of cases during the HapMap phase I SNPgenotyping (22). For a high density of neighborSNPs it may be too problematic to find a cleansequence but the SNPs can be more easily trackedin Santa Cruz, which shows by far the clearestgraphic arrangement of SNP grouping close to thestudy SNP. Unfortunately it requires a longhandprocess of clicking each linkout and obtaining thepositions to construct a fully annotated sequencearound the assayed substitution site.

4.5. Diversity (replaces Variation)The Diversity section originally termed Varia-

tion has been the source of problems for sev-eral years and is now being completely revisedto incorporate the detailed population data com-ing from HapMap. Until recently this section hadbeen potentially misleading in the way it summa-rized allele validation information, because itused average allele frequencies and heterozygos-ity based on merged data from all submitting labo-ratories. For example, a submitters estimate of 0.2minor allele frequency based on 100 individualscombined with another of 0.5 based on 20 indi-viduals was summarized as 0.25 because dbSNPused total chromosome counts from all submis-sions to obtain the average values. Not only can

large allele frequency differences arise fromcombining samples from different populationgroups but dbSNP previously made no distinctionbetween random population samples and samplesof individuals with particular conditions. This isillustrated by SNP rs2075745 where a C alleleonly occurs in subjects with type II diabetes.Despite this, a frequency estimate for C of0.476 was given for many years in the clusterreport, although all other populations tested todate exhibit an A/T substitution at this SNP. Inthis case the misleading frequency resulted froma comparatively large sample of 200 diabetic sub-jects tested by one submitting laboratory and thisskewed the estimates. Since late 2005, dbSNP hasbegun the process of incorporating the detailed,population-based breakdown of frequency esti-mates as it now appears. This gives a vast improve-ment in clarity for a critical SNP characteristic andthis is obviously being extended to all submitterestimates, not just HapMap, as the previous exampleof rs2075745 is now unambiguous in presentation.Although the averaged figures are retained, it isnow straightforward to interpret these with refer-ence to the population studies made by the sub-mitting laboratories. Note, however, that allelesare listed in base-alphabetical order in dbSNP anddo not use the HapMap convention of referenceallele frequency first. This may seem to put undueemphasis on detail but a look at SNP rs176000, anexample of a base inversion between two submit-ting laboratories compounded by three-allelevariation, indicates there is still scope for confu-sion in this section.

4.6. ValidationValidation briefly summarizes the Mendelian

status, PCR performance, and allele frequencydistribution quality indicators of the SNP. In addi-tion to SNPs, data are held for other polymor-phisms loosely defined as simple. These includesmall-scale multibase insertions or deletions (alter-natively termed deletion/insertion polymorphisms,indels, or DIPs), microsatellite repeat variation(also termed short tandem repeats or STRs), andretroposable element insertions. Because dbSNP

65_98_Phillips_MB06_0040 1/3/07, 4:40 PM79

Page 16: Online Resources for SNP Analysis - USC

80 Phillips

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

is an open database, there is a straightforwardframework for receiving and checking SNP datasent in from submitting laboratories. The SNP locusinformation is either reported as a new discovery(rare for human SNPs now) or collated into anexisting reference SNP set. Clearly the latter case,where different laboratories routinely report iden-tical SNPs, requires careful scrutiny of the flank-ing sequence to check for previous submission toNCBI and unique location in the genome. Sub-mission criteria are very effective at detectingnonunique SNPs, the checking process requires aminimum 25-bp context sequence each side of thesubstitution for the detection assay and uses aminimum total context sequence of 100 bp toposition the SNP uniquely in the genome orotherwise. The proportion of nonunique SNPs is,however, small (about 5%) and they are morecommon in pericentromeric areas, therefore theuse of loci from these chromosome regions gen-erally needs more care to ensure that the SNP isunique and is flanked by a relatively small pro-portion of low complexity sequence (sequencecontaining portions of intra- or interchromosomalrepeats, polybase, or short tandem repeats). Whenthe context sequence can be uniquely positionedand the SNP is identified as previously observed,dbSNP will place the submission into the refer-ence SNP group, hence the use of the familiarrs-number denoting the refSNP. The full group ofsubmissions for the same locus is termed a clusterin NCBI, hence the single summary page for eachSNP is termed the reference SNP cluster report.Submitted SNP and reference SNP details are dis-tinguished by using multiple ID numbers prefixedwith ss and a single rs-number, respectively, ateach cluster report. The current ratio between sub-missions and clusters is approximately 2.5:1 (27.2million to 10.4), therefore multiple institutionshave independently confirmed the majority ofNCBI SNPs. Clicking on an ss number (alsotermed the accession number) allows the scrutinyof the quality of the submission including statisti-cal analysis of the individual genotypes to ensurethey are in predicted ratio from the assumption ofHardy Weinberg equilibrium. As outlined previ-ously, these detailed genotype listings can then

be used to check the reliability of study assays byretyping the same controls.

5. The HapMap Project(http://www.hapmap.org/downloads/

nature02168.pdf)The international HapMap project was launched

in late October 2002 with the stated aim of deter-mining the haplotype structure of the human ge-nome. This has broadened in scope slightly toencompass, in their own words, “all commonhuman sequence variation, providing informa-tion needed as a guide to genetic studies of clini-cal phenotypes.” What this means is that themapping of haplotype blocks using set SNP posi-tions has been extended to include any SNPlandmark that could be equally useful. The fullprogram of the HapMap project is ambitious inscale, in some senses approaching that of HGP,but it remains simple in concept—to begin bygenotyping at least one SNP per 5 kb of sequence(just more than 1 million markers) in 269 indi-viduals taken from 4 populations located in threecontinents and to conclude by consolidating SNPnumber and annotation so that a limited set ofSNPs can be confidently assigned as markers thattag each haplotype block. Haplotype blocks cre-ate a large amount of redundancy in the use ofSNPs to measure association, as the average hap-lotype can contain a considerable number of SNPmarkers that share exactly the same frequency andthe same recent ancestry with nearby gene vari-ants, therefore using more than one SNP per blockto track the genes of interest by association doesnot necessarily add any more value to the study.This is, of course, a simplistic argument that assumesthat blocks are easy to define and haplotype diver-sity is limited, but it emphasizes one of the tenetsbehind HapMap planning: to reduce the genotypingeffort required for a clinical genetics study with-out loosing any quality in the association valuesobtained from using a small subset of the SNPvariation available. Because the total genotypeanalysis needed far exceeds anything accom-plished before, it was appropriate to start withmanageable aims and build on these. Therefore,the phase I goals were to characterize and map

65_98_Phillips_MB06_0040 1/3/07, 4:40 PM80

Page 17: Online Resources for SNP Analysis - USC

Online Resources for SNP Analysis 81

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

haplotype blocks; to collate haplotype diversityin each population; and to define every codingSNP (this last aim spans phase I and phase II).The four populations sampled were Yoruba inIbadan, Nigeria (referenced in the HapMap dataas YRI), Japanese in Tokyo (JPT), Han Chinesein Bejing (CHB in HapMap but HCB in dbSNP),and CEPH Utah residents with ancestry in Northernand Western Europe (CEU). The YRI and CEUsamples comprised trios that ultimately allowedvery precise analysis of haplotype phase, that is,whether the alleles of heterozygous SNPs resideon one chromosome or the other. Ten ENCODE(Encyclopaedia of DNA Elements) regionswere analyzed with a 1—fold increase in SNPdensity to compare data quality between theHapMap genome coverage and a more completeSNP catalog. Sequencing of 48 subjects from eachpopulation for these regions has spanned the twophases and also acts as a test bed for low fre-quency SNPs. In contrast to the initial focus on 1million SNPs in phase I, HapMap is now produc-ing genotypes for the phase II goals of expandingthe number of SNPs in dbSNP with adequate mul-tiple population validation from 2.6 million to 9.2million. Phase II also includes extended samplenumbers in the four study populations, a broad-ening of study populations, and a focus on moredetailed analysis of the ENCODE regions.

Three years after initiating the project, thegroup reported the phase I findings in October2005 (22). The investigators highlight the fact thatHapMap is intended to concentrate internationallycoordinated resources on the characterization andunderstanding of the variant part of the genomesequence as a natural extension of the work of HGPin establishing the invariant sequence shared byall individuals. To summarize an extensive report:

1. SNP loci have proved to be highly correlatedwith their immediate neighbors.

2. Analyses to date show the generality of haplo-type block structure and recombination hotspotsin the human genome.

3. The redundancy of proximal SNP sets shouldyield efficiencies in association studies from theuse of a catalog of tagging SNPs and codingSNPs.

4. The SNP data generated so far offers a meansto study genomic variation without recourse towholesale resequencing.

5. The findings have gained increased under-standing and characterization of the humangenome (notably the mapping of deletionmutations), natural selection events in the recentpast and fine-scale recombination organization.

6. dbSNP has collected the vast majority of com-mon SNP variability in the human genome,when SNPs have not been listed they showtight correlation to loci that are in dbSNP. SNPdiscovery using PCR-based sequencing is biasedagainst low frequency SNPs (i.e., those withminor allele frequencies <0.05).

Two further points emerge from the report.First, it is increasingly clear that HapMap datafounded, as it is, on the analysis of four popula-tions, represents a valuable resource for popula-tion genetics analysis (23). Clear signals ofnatural selection have been found from the HapMapdata in a number of genes that are not obvious can-didates for adaptive responses in the immediateevolutionary past. Furthermore, a comparison ofhaplotype-based selection detection tests com-pared with classic methods that use individual lociindicates that the former approach can be moresensitive in detecting recent positive selection(notably in the analysis of G6PD and TNFSF5genes) (24). Extension of HapMap genotyping tonew population samples will bolster the datasetfurther and help to pinpoint which SNP loci areappropriate choices for more extensive samplingof human populations. Second, the report’s con-clusions include a demand for rigor in associationstudies through multiple replications and enlargedsample sizes. Furthermore, the investigatorsemphasize the need for an unbiased approachto the reporting and interpretation of associationstudy results, regardless of outcome. Because thecommon diseases are almost all complex diseasesit is worth taking heed of this advice as the inves-tigators outline that such diseases require verycareful control of environmental influences includ-ing lifestyle and behavior, of adequate clinicalcharacterization of phenotype, and of sufficientreplication of studies. All of these factors are

65_98_Phillips_MB06_0040 1/3/07, 4:40 PM81

Page 18: Online Resources for SNP Analysis - USC

82 Phillips

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

important to control correctly if the precision atthe genetic level is to be fully exploited to under-stand complex disease.

Since phase I was completed, HapMap hasbecome an essential complement to dbSNP forchecking the haplotype positions of a chromo-some region as defined by values measuringlinkage disequilibrium between SNP pairs. Inaddition, access to high-quality allele frequencyestimates for 1 million SNP markers represents asignificant move forward for the validation statusof a large proportion of human SNPs. This becomesparticularly useful as a means to check ones ownallele frequency estimates obtained from the sub-jects of a study, allowing greatly improved qual-ity control of the user’s own genotyping assays.This can be taken one step further by includingCEPH trio-positive controls in an assay and refer-encing the genotypes obtained to the individuallylisted results in HapMap. The HapMap databasecontains structured access to all the genotypesgenerated in the form of SNP report pages, togetherwith detailed maps of haplotype structure in theform of annotated LD plots using pairwise com-parisons of SNPs in the chromosome interval us-ing a stand-alone graphic browser (25). Finally, anequivalent approach to Entrez exists in HapMap,termed SNPmart, for filtering down the datasetsto downloads of manageable size based on simi-lar principles.

The relationship between the two principalSNP databases of HapMap and dbSNP is in theprocess of change and consolidation. They con-tinue to be very closely interlinked but a numberof differences should be emphasized. First, theprocess of dbSNP to provide haplotype map detailsindependently of HapMap is progressing slowlyand does not to match the quality and depth of theother components of dbSNP. For example, in MapViewer a haplotype map option exists (termeddbSNP haplotype), but this is only available forchromosome 21 and takes the form of block posi-tions and reference numbers listed as hyperlinksto reports from Perlegen. These proceed to detail thehaplotype SNP allele composition, for example,SNPs rs2822549 and rs2822550 link to a block

named B002180 with three haplotypes outlinedin base color-coded plots. However, this gives theimpression of work on hold because HapMaphave begun data release. It seems that dbSNP isunlikely to provide an adequate basis for detailedhaplotype mapping in its own right until sufficientdata have been obtained. This process has startedwith the collection of Haplotype data from publi-cations (http://www.ncbi.nlm.nih.gov/SNP/hap/dbSNP_haplotype_intro.html.). Second, dbSNPis still in the process of updating the Variationsection of each SNP cluster report page to encom-pass the vast quantity of allele frequency datagenerated by HapMap and other contributors.HapMap provides a full list of dbSNP loci acrossthe whole genome as link-outsto dbSNP clusterreports for each locus, with HapMap validatedloci uppermost and all others below. The genomemap also makes reference to the NCBI EntrezGene report page for each gene but a third pointof difference is that occasional differences in genecharacterization are evident. For example, SNPrs11779952 is placed in gene SLC39A4, a solutecarrier, by HapMap and in NFKBIL2, a nuclearfactor kappa B inhibitor, by dbSNP. The disparityin the affiliation of SNP and gene in this caseseems to have been caused by a difference ingene position between NCBI and Celera genomebuilds. Although it is difficult to know how thisambiguity has occurred, this example serves toillustrate the effect of different approaches thatmay be taken between HapMap and NCBI to thecuration of data rather than the management ofdata. These two roles are quite distinct, curationin this context describing the interpretive activityrequired by large collections of data that needcharacterization to be properly cataloged (muchlike museum contents). Although the majorityof NCBI database information can be describedobjectively, certain data must be characterizedwith the available knowledge and this can intro-duce disparities between databases run by differ-ent organizations. As this example suggests aprincipal source of curation ambiguities is genedescription, more specifically termed gene anno-tation, that is, the characterization of a gene and

65_98_Phillips_MB06_0040 1/3/07, 4:40 PM82

Page 19: Online Resources for SNP Analysis - USC

Online Resources for SNP Analysis 83

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

its function based on the interpretation of descrip-tive data. This is one step beyond management ofinformation and requires interpretive judgmentsto be made from the data. For instance, is a newputative gene model recently placed in the data-base a real gene that has not yet been fully describedor a pseudogene? Can the gene function be definedin terms of existing pathways? Is the ontology(defined as a set of terms that describe equiva-lence of function, role in a process, or sequencebetween a set of genes or proteins) a close matchto a well-defined gene family or different enoughto suggest a lack of relatedness? Ambiguities indata curation will continue to complicate matterswhen researchers routinely collect and compareinformation from more than one database. Last,any user familiar with dbSNP will have noticedthe differences in public and Celera sequenceposition for SNP locations in the Integrated Mapssection, when both databases hold the same locus.Because the genome sequences were assembled indifferent ways and problems like sequence inver-sions can occasionally arise, it makes sense fordbSNP to list both until there is a full consensussequence assembly. This is not principally a prob-lem of curation but of differences in the chromo-some coordinates obtained by each sequencingproject.

6. Finding SNPs for Clinical Genetics:Using HapMap and SNPbrowser®

Faced with the task of locating the genetic com-ponent of a disease, researchers will either beginwhole genome linkage analysis and focus on thechromosome regions showing the clearest linkagesignals or will have an idea of candidates fromstudies performed already. The strategy that beststarts the SNP analysis proper is to concentratethe initial efforts of locus selection on coding andtagging SNPs as the information content fromthese loci can give the strongest pointers towardthe genes underlying the disease of interest. Con-sulting the HapMap and SNPbrowser databasescan form the principal part of selecting SNPs forthis part of the study and any further fine map-ping analysis of candidate regions.

Applied Biosystems (AB) SNPbrowser is astand-alone database of 5 million loci comprisinga compilation of public and private (i.e., CDSbased) SNPs. The marker set is essentially a datadump direct to the user’s PC for use offline, withan easy and intuitive front-end in the form of anannotated map. The advantage of this database isthat it shows haplotype block information in amuch easier to read format than HapMap and in awindow independent of a web browser. The blockmaps are based on Celera’s own pairwise analysisof 160,000 SNPs (termed the backbone validatedSNPs) so it complements, as well as clarifies,HapMap haplotype block map annotations.SNPbrowser can be downloaded from the ABwebsite (http://marketing.appliedbiosystems.com/mk/get/snpb_landing?isource=fr_E_RD_www_allsnps_com_snpbrowser) and once launched, canbe configured to suit the user’s needs in terms ofhaplotype block annotation displayed, SNP type,population studied, and extent of region shown.The first choice to make is the haplotype blockmap display. The options are to use HapMap orCelera and to display all study populations or justa single one. The Celera maps are constructedfrom the analysis of 45 individuals each fromwhite (American European) and African Ameri-can, plus a smaller number of validated SNPs(hence less reliable map definitions) in Chineseand Japanese. These clearly mirror the HapMapstudy populations, but offer an opportunity tocompare HapMap Africans (YRI) with AfricanAmericans and explore the effect of admixtureand the potential for MALD analysis, outlinedunder Subheading 2. The match to a HapMapstyle of presentation continues with the SNP infor-mation window, an example of which is given inFig. 5. This needs to be unlocked: “View menu,”“SNP details,” “Show (ctrl + D),” then click thepadlock icon upper right of the pop-up window.This additional window shows the Celera allelefrequencies in identical pie charts making com-parisons easy and works with a mouse-over allow-ing the map to remain uncluttered when SNPdensity is high. The NCBI button is a linkout if theSNP carries an rs-number, and it soon becomes

65_98_Phillips_MB06_0040 1/3/07, 4:40 PM83

Page 20: Online Resources for SNP Analysis - USC

84 Phillips

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

Fig. 4. Typical fasta sequence report. The SNP detection assay only confirmed the substitution site because inthis example, computational contig comparison techniques were used.

Fig. 5. SNPbrowser map view for the gene CAPG (see Fig. 6). Top four color bars show haplotype blockdistribution in African American, European, Chinese, and Japanese (top to bottom). The gene bar is color-codedblack to light gray to denote the power based on the gene variant frequency, haplotype frequencies, and blockpositions (upper right scale). The pop-up box lower right gives SNP details linked to a mouse-over system for thedark and light bars positioning each SNP locus on the haplotype block bars above.

evident in any gene that a large proportion of theSNPs displayed are Celera only. The idea ofSNPbrowser is to provide a shopping list for locithat can be genotyped with AB’s proprietaryTaqman® and SNPlex® technologies, but there isno commitment in downloading and using the

browser other than e-mail registration. This sameapproach applies to the Taqman genotyping assaydatabase described under Subheading 9. What isavailable to any user is a commercially orientateddatabase with a significant amount of privateSNPs, but each of these allows important infor-

65_98_Phillips_MB06_0040 1/3/07, 4:40 PM84

Page 21: Online Resources for SNP Analysis - USC

Online Resources for SNP Analysis 85

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

mation to be obtained about the position and vari-ability of the Celera markers ahead of their incor-poration into dbSNP. The color-coded Haplotypeblock positions from the four study populationsare displayed above the gene bar as a combina-tion or individually. Certain caveats apply here:the block edges, although graphically sharp aretentative and any block definition used can onlyproduce fuzzy boundaries at best. Subsets ofSNPs, Taqman assay SNPs, SNPlex assay SNPs,All SNPs, and Coding SNPs only, can be selectedwith buttons on the lower left of the map. Finallyof most interest, over and above the private SNPdata obtained is the chance to review the power ofthe gene shown with a green–black scale (intronsin purple–pink scale to match) and to see a mea-sure of linkage disequilibrium in LDU (i.e., units),an arbitrary scale based on r2 and D’ calculationsbetween sets of SNP pairs. Connecting lines linkthe SNPs true position on the normal kilobasescale to the LDU scale, therefore these merge incases of SNPs sharing the same haplotype block.The two values of power and LDU, novel to theSNPbrowser map, are related because the powerscale is a summary value based on the block dis-tributions across the gene and is intended to pro-vide a means of comparing different parts of agene or different genes. The scale ranges fromless than 0.5 (black) to five values between 0.5and 1.0 (dark to pale green) and is intended to helppredict the power of a block-based approach us-ing different SNP combinations along with theeffect of sample sizes and the minor disease/traitallele frequency. SNPbrowser amounts to a suc-cinct and visually clean map browser that is aninformative complement to HapMap browsing.The linkouts to dbSNP and to the AB Taqmangenotyping summary pages allow quick follow-up of SNPs of interest and the SNP lists (shoppinglists) can be saved or exported. Many researchersuse SNPbrowser as the first snapshot of a geneand this does not necessarily mean avoiding thecommercial pipeline attached, as Taqman hasbeen a mainstay of SNP genotyping for medicalgenetics studies for many years. A set of poster pre-sentations explaining the genetics in more detail

can be downloaded from the AB website (http://docs.appliedbiosystems.com/pebiodocs/00112824.pdf, http://docs.appliedbiosystems.com/pebiodocs/00114486.pdf, http://docs.appliedbiosystems.com/pebiodocs/00112823.pdf).

HapMap browsing is initiated in a similar wayto SNPbrowser, usually starting with a single genename to locate a manageable segment of chromo-some and view the genome landscape (Fig. 6).Begin by placing a landmark (SNP or gene) orchromosome region limits (the range in full bpnumbers) in the genome browser page (Datalinkout top right of the homepage, then the GenericGenome Browser linkout). Often HapMap willoffer a choice of locations if the description isnot a complete match to the data and the list givesfull details to allow review of the user’s owninformation. An excellent set of PowerPointguides are listed in the tutorial page (http://www.hapmap.org/tutorials. html.en). The basics ofgenome browsing HapMap data are outlined byLincoln Stein and explains the steps involves infinding SNPs in or near the gene (termed regionof interest or ROI), viewing patterns of LD, select-ing tag SNPs, and if required, downloading theSNP dataset generated by a search. The layout ofthe main HapMap view has already been detailedunder Subheading 5. and the best strategy is forusers to explore a region and become familiarwith the acquisition of data with a level of depththat suits the particular purposes of the researchstage. The other tutorial presentations fromMichael Boehnke, Mark Daly, Augustine Kong,and Toshihiro Tanaka cover in much more depththan is possible in this review, the detailed aspectsof association study design. One word of caution,using the scrolling arrows to browse a large geneat 5-kb scale or less can take a very long timeand it will soon be noticed that even over longstretches of sequence the pie chart distributionsstart to look very familiar. At this stage it is worthcross-referencing to the SNPbrowser haplotypeblock definitions to determine whether the blockis very long itself or whether the review of SNPdata can fruitfully be focused on recommendedtag SNPs for the region.

65_98_Phillips_MB06_0040 1/3/07, 4:40 PM85

Page 22: Online Resources for SNP Analysis - USC

86 Phillips

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

7. Undirected SNP Searches—EntrezSNPand Map Browsing

There are two ways to examine a group of SNPsin the absence of clear directions to a specific chro-mosome region: direct queries to EntrezSNP andgenome map browsing. At present, there is littleneed to find SNPs without associated landmarkssuch as the position of a linkage signal or a candi-date gene, but there can be particular reasons thata group of SNPs needs to be collected and studiedwith information other than position. An exampleof one such situation was the collection of SNPsfor forensic analysis when it was important to findSNPs with appropriate levels of variability or sub-stitution type (5,26). An EntrezSNP query is thebest approach for obtaining a list of candidateSNPs with combinations of specific characteris-tics. The major drawback of EntrezSNP is thattwo criteria, linkage and flanking sequence qual-ity, cannot be used as part of a query and the lattercharacteristic has a direct bearing on assay suc-cess in nearly all SNP genotyping methodologies.However, any SNP in dbSNP or HapMap suitablefor study will have been validated by a submit-ting laboratory using one of the genotyping tech-niques. By default this tends to prevent SNPs thatwould be impossible to genotype from being re-turned from a query if validation status is used asa fixed field. The appropriate fixed fields wouldbe “organism” [ORGN] or [TAX_ID] prefixedwith human and “map weight” (the number oftimes a SNP maps to the genome) using the term1[MPWT] (ensures all SNPs are unique). The“validation” term by frequency [VALIDATION]is used to ensure SNPs have been validated byrepeat genotyping rather than by contig compari-son and sequencing of pooled donor DNAs. Afterthis the list of search terms combined by opera-tors can be taken from a tickbox list of limits(http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=Limits&DB=snp) or applied by the userwho is familiar with the syntax and specific val-ues required. The most important search fieldtags are listed in Table 2. As an example “allnonsynonymous, AT SNPs in segment 1 to 1.5Mb of chromosome 1, true SNPs only” would

comprise coding nonsynon[FUNC] AND W [AL-LELE] AND 1[CHR] AND 1000000: 1500000[CHRPOS] AND snp[SNP_ CLASS] ANDhuman [ORGN] AND by frequency [VALI-DATION] AND 1[MPWT]. EntrezSNP returns alist of SNPs that qualify with a graphic summaryfor each, shown in Figure 7, giving all the infor-mation necessary for a rapid scan of a large num-ber of loci in one go. It is possible to sort the listin an alternative way to the default sort order ofdescending rs-number by selecting from the dropdown menu (sort) to resort the list, for example,heterozygosity or map position. Using map posi-tion will list from q-arm telomere up to p-arm te-lomere but note that the first loci are unplacedSNPs. Ticking each SNP allows a list to be ex-ported to a text-holding webpage, a file, or theclipboard (Send to drop down menu).

As an alternative to EntrezSNP, map browsingoffers an intuitive way to review large numbersof SNPs in one session. Exploring a chromosomesegment as a map gives the best way to scrutinizethe position and characteristics of nearby genomefeatures of importance: transcripts, genes, orclustering SNPs (neighbor SNPs in NCBI). Fur-thermore, the features around each SNP can bescrutinized easily through a series of linkouts em-bedded into the map view to the dbSNP clusterreport page and Gene reports or other supportingdatabases. Both dbSNP and HapMap have a map-based system at the core of their SNP databases.HapMap Genome Browser (click Browse ProjectData on left hand column of homepage) offers,above all, comprehensive SNP allele frequencydata for all 1 million SNPs given in a succinct butclearly arranged pie chart graphic together withthe position of any coincidental gene locus (theselinkout to NCBI Gene) plus a chromosome scale.The reference allele is blue in each pie chart andthe rs-number linkouts to the HapMap version ofthe cluster report with all the validation data forthe four populations plus details of the assay used.At the top of the main map is the summary chro-mosome view aligned with SNP and gene densityplots. In combination with a %GC plot these are auseful supplement to the map view given by the

65_98_Phillips_MB06_0040 1/3/07, 4:40 PM86

Page 23: Online Resources for SNP Analysis - USC

Online Resources for SNP Analysis 87

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

NCBI browser. The map elements, termed tracks,can be configured in a variety of ways with anemphasis, as would be expected, on SNP distri-bution and alignment with linkage disequilib-rium measurements. Of most interest to users ofHapMap comparing data to NCBI, is the processof annotating the default map view with LD andhaplotype block information. Three LD measurescan be plotted: D’, r2, and LOD and haplotype blockstructures can be viewed as phased haplotypes (i.e.,alleles are assigned to the most likely shared chro-mosome strand to denote the haplotype). Thesestatistics require a more detailed outline than ispossible in this review but the extensive help

pages contain the descriptions and relevant publi-cations. To obtain this additional map annotationdownload the plug-in (a java applet) and go to thereports and analysis drop down menu, right up-permost. Choose “Annotate LD Plot,” click “con-figure” (with a variety of arrangements possible),then “configure” again, then “go”. The same with“Annotate Phased Haplotype Display” from thesame menu where configure just gives options tochoose populations. The LD plots comprise redand gray scale pairwise block patterns. The plotgives marker-to-marker LD values, where thegenotyped SNP are denoted as ticks and themarker pairwise information is plotted as boxes

Table 2Important EntrezSNP Search Field Tags

Description Tag Search field used Example

Observed alleles [ALLELE] IUPAC allele code R[ALLELE] find SNPs with(Table 3) A/G substitutions

Chromosome [CHR] number / X, Y 21[CHR] OR 22[CHR] findSNPs on chromosomes 21 and 22

Base position [BPOS] ranged number 18000:28000[BPOS] AND(used with AND Y[CHR]—find SNPs in 10-kb& [CHR]) section of Y-chromosome

Heterozygosity [HET] ranged number 30:50[HET] find SNPswith heterozygosity valuein range 30%–50%

Function Class [FUNC] locus region, coding nonsynon[FUNC]intron etc. (8 in total)

Build [CBID] number 125[CBID] search build 125

Gene location [GENE] gene symbol CAPG[GENE] search for SNPs inactin capping protein, gelsolin-like gene

Genotyping method [METHOD] description as listed hybridize[METHOD] search for inpage below SNPs found by chip hybridization

(http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=snp - METHOD)

Map weight [HIT], [MPWT] number: 1 = once, NOT (2[HIT] OR 3[HIT]) exclude2 = twice, 3 = 3–9 times SNPs mapping twice or more in ge

nome (NB better to avoid using NOT2[HIT] to include CDS SNPs

Population [POP] description as listed pacific[POP] search for SNPs inpage below genotyped in Australasian &

(http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=snp - POPULATION) Oceanian samples

65_98_Phillips_MB06_0040 1/3/07, 4:40 PM87

Page 24: Online Resources for SNP Analysis - USC

88 Phillips

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

Fig. 6. Typical HapMap genome browser window for the gene CAPG (see also Fig. 5), with 15 SNPs genotypedby HapMap, shown in approximate position as pie-charts and below these, 25 SNPs shown as triangles (exactposition) denoting loci not genotyped, but present in dbSNP. Each set of rs-numbers linkout to HapMap SNPreports and dbSNP cluster reports, respectively. The gene structure of CAPG is outlined as a line diagram at thebase.

between these ticks. Phased haplotypes are givenas blue and yellow blocks and unless the diversityis high it is the clearest way to view haplotypeblock boundaries in any current database.

Map Viewer is the NCBI map browsing toolallowing the simultaneous search and display ofall the NCBI genomic information by chromo-somal position. The interface provides a graphi-cal overview of several databases in combinationwith user-controlled map arrangements. The mapcombinations and elements are arranged by click-ing on the maps and options button mid-left andallows any of 48 different maps to be aligned in

Fig. 7. Example locus return from EntrezSNP forrs2075745 showing annotation. The rs-numberlinkouts to the RefSNP cluster report, this human SNPoccurs on chromosome 11, maps once (underline), issited in a locus (L), exhibits 46% heterozygosity(scale), and has been validated by genotyping (V).

65_98_Phillips_MB06_0040 1/3/07, 4:40 PM88

Page 25: Online Resources for SNP Analysis - USC

Online Resources for SNP Analysis 89

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

the same segment view. The master map containsthe linkouts to the matching database with SNPsplaced in a map termed Variation in the sequencemaps list. Densely packed SNPs are merged togetherand listed as “6 variations,” etc, but only if the mapscale is not fine enough to allow adequate spac-ing of the scale used and for these grouped SNPs,the linkouts are lost. Often it is better to start witha whole view before concentrating on one areaand this is achieved by using the genome view(http://www.ncbi.nlm.nih.gov/mapview/map_search.cgi?taxid=9606&query=). A landmarkwill be positioned on the relevant chromosomewith a small red line, in most cases SNP querieswill result in two marks for multiple positionsdue to the CDS coordinates being different. Mul-tiple landmarks can be displayed together in asingle view by using the OR operator in the searchwindow. The individual whole chromosome view isobtained by clicking on the red number under thechromosome in the genome map. It soon becomesclear that dbSNP offers a much broader range ofgenome landmarks in the displayed region com-pared to HapMap and usefully all SNPs in thedatabase have summary details summarized asicons placed against the map position and rslinkout shown in Figure 8. The smallest scaledefinable using the zoom scale box is 100,000thof total chromosome length (equivalent to 10 kbof chromosome 1). Table 3 details the color-cod-

ing system used by Map Viewer to signify theannotation status of gene loci placed on the Genesmap. NCBI labels all likely genes as “predictedgene models” until confirmed to have a role or beinvolved in a specific pathway.

8. Ensembl(http://www.ensembl.org/index.html)Readers who routinely use Ensembl as their

database of choice for genome browsing may bewondering why it has not been covered in thisreview up until now. Ensembl is one of the prin-cipal genome data repositories with extensivedatabases and search tools stemming from theintegral role of The Sanger Centre and EMBLin HGP and the position of Ensembl in providingopen-source software frameworks for data accessand storage. Ensembl continues to provide themost up to date releases of annotated genome dataand the widest range of species with genomeanalysis available. In the area of human SNP dataand supporting information the content veryclosely mirrors that found in dbSNP and there-fore there is a choice of approaches, NCBI orEnsembl, and each can be used to access, largelythe same data, using comparable frameworks forsearching and cross-referencing information for astudy. There is little to choose between the twoexcept a broader range of data in NCBI and closerintegration to bibliographic data. For this reason

Fig. 8. Symbols used to annotate the NCBI Map Viewer variation map.

65_98_Phillips_MB06_0040 1/3/07, 4:40 PM89

Page 26: Online Resources for SNP Analysis - USC

90 Phillips

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

the researcher interested in just human SNP datacan easily use the most familiar search environ-ment and obtain data of comparable depth andscope from each database collection with visits toPubMed and OMIM when required. The featuresin Ensembl that can offer more detail and broadercoverage are outlined.

Because Ensembl is both a collection of genomedata and a software system that can be adapted forits organization it is important to highlight aspectsof Ensembl content that complement the NCBIdatabases. First, Ensembl has had a pivotal role inthe complex task of gene curation and has pio-neered automated gene annotation techniques.Most genome content is now generated in suchquantity that automated curation has become anecessity. In contrast, the primary importance ofthe human and mouse gene annotation process hasmeant that this has been completed manuallyusing expert scrutiny, thereby enhancing generecognition algorithms in the process. TheVEGA (Vertebrate Genome Annotation) data-base homepage lists linkouts to human, mouse,dog, and zebrafish genome browsers (26). VEGA’smission statement is to provide high quality, fre-quently updated, manual annotation of vertebratefinished genome sequence (27). Without doubt thisprocess will extend to chimpanzee, pufferfish, andagricultural species as the annotation processbecomes more streamlined and benefits fromestablished knowledge of gene character in ver-tebrates. Users interested in obtaining the bestdata relating to human gene architecture are encour-aged to visit the linkouts from Vega human home-page (http://vega.sanger.ac.uk/Homo_sapiens/

index.html) explaining involvement of VEGA inthe CDDS and MHC Haplotype Projects and theWelcome HAVANA groups involvement in theanalysis of ENCODE regions and the CORFprojects (respectively, http://vega.sanger.ac.uk/info/data/ccds.html, http://vega.sanger.ac.uk/info/data/Homo_sapiens.html, http://vega.sanger.ac.uk/info/data/encode.html, and http://vega.sanger.ac.uk/info/data/corf.html).

Second, Ensembl has close ties, through EMBL-EBI, to the high-quality protein sequence databaseof swissprot/uniprot (http://www.ebi.ac.uk/swissprot/). This comprises manually annotatedprotein sequences with content that has beenclosely integrated with the gene annotation pipe-line, therefore the two data sets, gene and protein,have considerable synergy within Ensembl.This makes swissprot/uniprot a better choicefor detailed protein analysis than NCBI Protein,although, as with SNP and gene data, the contentis shared and cross-referenced to a large degree.Third, Ensembl provides a gene-oriented searchsystem in MartView (http://www.ensembl.org/Multi/martview), a system that, like HapMap,uses the biomart engine. Initiating a search of thehuman assembly with VEGA genes as the startpoint allows the interrogation of more than 22,100genes with a full range of filters. This is poten-tially the best way to collate a set of candidategenes on the basis of function and role in pro-cesses, such as, a pathways approach to studyingthe relationship of gene families. OMIM may sug-gest candidates that can be extended to include allgenes that share similar properties in terms offunction. This can help to ensure that a search in

Table 3Color-Coding System Used for Annotation of the Genes Map

Color Gene model evidence used

Blue Confirmed gene model based on alignment of mRNA or mRNA/ESTs

Green EST evidence only

Brown Predicted gene model (using Gnomon program) plus EST alignment evidence

Tan Predicted gene model only

Orange Conflicting evidence—discrepancy between mRNA sequence and model

EST, expressed sequence tag.

65_98_Phillips_MB06_0040 1/3/07, 4:40 PM90

Page 27: Online Resources for SNP Analysis - USC

Online Resources for SNP Analysis 91

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

the initial phases is not too restricted in focus.Fourth, Ensembl allows access to an HGP sequenc-ing trace repository at Trace Server (http://trace.ensembl.org/). Although listing single pass dataand therefore it is both impossible and unrealisticto use this resource for SNP analysis, the breadthof data (1 million traces from 735 species) makesthis a valuable resource for the analysis of sequenceoutside of the mainstream study species.

9 . Online Resources for PopulationGenetic Studies Using SNPs

Although the focus of clinical genetics data-base searching centers on the location of SNPswith reference to genes and haplotype structure,population genetic studies place more emphasison the distribution of allele and haplotype fre-quencies. This can still involve the scrutiny ofSNP position in the gene landscape as the effectsof strong positive selection can leave a character-istic signature in the distribution of SNP variabil-ity around the selected gene. The discovery ofhaplotype variability reduction from selectivesweeps, as this effect is described, has led to someexciting recent studies of genes hitherto not sus-pected to be the subject of selective pressure(28,29). The need for detailed and reliable fre-quency data in this field means HapMap anddbSNP hold centre stage as the two largest allelefrequency datasets available. However, searchingthe data with the aim of exploring frequency dis-tribution differences between populations is noteasy in either database. Furthermore, frequencysearches are limited in both by the 0.5 minorallele frequency ceiling. This is the limit to fre-quency-filtered searches that sets a frequencyrange for the minor allele between 0 and 0.5 foreach class (in this case population). Although thisis, of course, logical as the minor frequency can-not be more than 0.5, it prevents searches of SNPswhere one population shows a minor allele fre-quency of, say, 0.2 for allele C and this is presentat frequency 0.6 in another population, properlyshowing a minor frequency of 0.4 but for the otherallele. This is not a major drawback as work-arounds exist, but it prevents an easy search forloci showing the biggest frequency contrasts and

these have proved to be among the most interest-ing SNPs. The problem of a maximum value of0.5 for each allele applies to all the other data-bases and their frequency search systems.

A popular program for processing multiple da-tabases with frequency filtered searches is Fre-quency Finder (https://mapgenetics.nimh.nih.gov/frequencyfinder/index.jsp) described as a fre-quency data acquisition tool for mining multiplepublic databases (30). This acts as a web portal ora stand-alone program, although both require adata upload in the form of a SNP list as a linedelimited text file or alternatively as SNP identi-fiers or chromosome coordinates placed in aquery box. Frequency Finder returns a table ofrs-numbers, major and minor allele frequencies,and the data source. Although uploading a file ofrs-numbers is a slightly clumsy way of initiatinga search these days, the system is comprehensivein scope as it will locate and list frequency datathat does not overlap between the databasesources used (TSC, dbSNP, Celera, ALFRED,and HGVBase; discussed later). As dbSNP broad-ens the extent of its data collection from othersources, such as Celera and HapMap, this hasbecome less important than it was previously.For example, the original report of the systemin 2004 detected from a whole genome query(246,097 SNPs with data) that 5% of SNPs wereunique to TSC, and 16% unique to AOD. TheHapMap frequency data accessed was confinedin early 2006 (v2.1) to European data only.

The proportion of SNPs described previouslyas unique to Celera are held in a publicly acces-sible subset of the CDS database known as Assays-on Demand (AOD) or Taqman SNP GenotypingAssays. When Celera and Applied Biosystemsmerged their interests to become Applera, a focusof much development was the combination ofTaqman real-time PCR assay technology and theextensive SNP data in CDS, leading to a systemdescribed by the company as knowledge-basedgenotyping. With this system the frequency datafor a SNP was as important as the availability ofoff-the-shelf Taqman assays to genotype the locus.The map browser to access this data SNPbrowserhas been described, AOD is the equivalent of

65_98_Phillips_MB06_0040 1/3/07, 4:40 PM91

Page 28: Online Resources for SNP Analysis - USC

92 Phillips

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

Entrez, a search system front end that permitssearches based on frequency (among other crite-ria). The importance of AOD compared to the vastarray of SNPs that were available privately asCDS was the quality of validation. Celera used apolymorphism discovery resource (PDR) of sixindividuals to indicate whether a SNP detected bythe private genome assembly was a valid SNP ornot. This is an insufficient sample to provide reli-able allele frequency estimates and means theCDS SNP details listed in AOD lack precision. Incontrast, AOD data is based on pilot Taqmananalysis of 45 individuals each from four popula-tions as detailed under Subheading 6. for SNPbrowser. The AOD dataset has since been usedfor many important population genetics SNPstudies that have extended the analysis to a widerrange of populations by using the available Taqmanassays on a broader base of well-defined popula-tions. These studies provide valuable insights intopopulation variability with the aim of improvingthe selection of subjects for MALD analysis andto gain a better understanding of human popula-tion dynamics. Similarly, I used AOD to beginthe process of collecting loci for forensic SNPassays that can suggest a geographic origin for asample of unknown donor (31).

The AOD search page (https://products.appliedb i o s y s t e m s . c o m / a b / e n / U S / a d i r e c t /ab?cmd=ABGTKeywordSearch&catID=600769)allows access to HapMap, JSNP, and DME (drugmetabolizing enzyme) data, in addition to AOD,using keyword searches with gene symbol, genename, public accession number, biological pro-cess, or molecular function. With or without key-words, limits can then be set for intergenic,intragenic, or genic location plus SNP type andeffect if genic, followed by the frequency filtersfor the four AOD or HapMap populations. Thequery returns a page listing the SNPs withlinkouts to dbSNP and Gene plus two decimalplace frequency estimates and chromosome loca-tion. If the user has an “hCV number” used tocatalog Celera SNP data, then it is possible to gaininformation about the locus if it has been vali-dated for AOD, and usefully to obtain the equiva-

lent rs-number if one exists. One advantage thatwill become immediately obvious to the researcheris that this provides the easiest allele frequencysearch system for the 1 million HapMap SNPsdespite the previously stated caveat that 0.5 is thefrequency limit per allele. The simplicity of thisapproach for searching HapMap allele frequencydata is that it gives a list that can be cross-checkedwith AOD estimates if applicable (both are listedeven if one is searched) and ready access then todbSNP and Gene. One additional point of refer-ence if the SNP has been validated by HapMapand AOD: the African allele frequency estimatesin AOD are based on an African-American sample,therefore by comparison with HapMap Africans itis possible to gain an insight into admixture lev-els in the AOD study population. Before HapMapthese were the most reliable SNP allele frequencyestimates available anywhere and were accurateenough to allow a detailed study of allele fre-quency-derived haplotype block definitions basedsolely on AOD SNPs showing tandem arrays ofidentical minor allele frequencies (32). The abil-ity to search on frequency alone continues tomake this database a powerful population geneticstool despite the recent incorporation of linkouts toAOD in the revamped dbSNP cluster reports.

Finally, as mentioned, recent interest in selec-tion in the very recent past and its role in reducinghaplotype diversity in the vicinity of the selectedgene has led to a useful web tool, Haplotter (http://hg-wen.uchicago.edu/selection/haplotter.htm) todetect and study this effect. Based on an in-depthanalysis of the HapMap phase I data release (29)it is likely to promote potentially interesting fur-ther study of genes that previously may not havebeen considered obvious candidates for positiveselection. The tool scans HapMap data for signa-tures of low haplotype diversity and unusuallylong haplotype blocks that result from the rapidincrease in frequency of the selected gene varia-tion and bordering chromosome regions. The sig-natures are defined by the equivocal measure ofcontrast between the haplotypes and the sur-rounding genome landscape because the ancestralallele (positive contrasts as the allele increases in

65_98_Phillips_MB06_0040 1/3/07, 4:40 PM92

Page 29: Online Resources for SNP Analysis - USC

Online Resources for SNP Analysis 93

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

frequency) can be the subject of selection as wellas the variant allele (negative contrasts). Haplottercan work from gene identifiers or a single SNPlandmark (more slow and varied in coverage).The program returns plots of iHS, the measure-ment of contrast to surroundings plus the selec-tion signature or population diversity measures H,D, and FST followed by a table of adjacent genesthat are colored light blue when they show sig-nificant evidence of selection effects. The majoradvantage of this tool is it allows an unbiasedapproach to finding regions with indications ofrecent selection pressure, so in use it is likely toreveal interesting and surprising candidates formore detailed study. In addition, it could focusstudies on the phenotypes such loci exhibit withconsequences for our understanding of the differ-ences in susceptibility to disease between popula-tion groups. The disadvantage is that it appears tolack sensitivity to gene variation and associatedSNPs that have reached, or are very close to, fixa-tion (i.e., where a different allele is fixed, at a fre-quency close to 1, in different populations).Examples of genes, where coding SNPs are closeto fixation, that fail to yield a detectable signalwith Haplotter are FY (inferring resistance tomalarial infection in African populations) andMATP (part of a depigmentation pathway inEuropean populations). In contrast, LCT (creat-ing hypolactasia in Europeans) showing balancedheterozygosity levels reveals one of the strongest se-lection signals of all, although this may also relateto how recently the selective sweeps have occurredat these loci.

10. Other SNP Databases and Resources10.1. The SNP Consortium(http://snp.cshl.org/)

The SNP Consortium (TSC) is run by the ColdSpring Harbor Laboratory on behalf of a private/public partnership of 17 organizations. This data-base comprises 1.8 million loci, all of which arelisted in dbSNP. Both databases are fully cross-referenced with linkouts, but TSC uses a differentSNP locus identification system. TSC SNPs havebeen chosen for study specifically because of their

proximity to genes, as a principal goal of the con-sortium was to construct the first high densitySNP linkage map: the Allele Frequency Project.This project created the most significant featureof the TSC database for SNP research—detailedgenotype frequency data for 55,000 loci fromEuropean (termed Caucasian), African, and Chi-nese population samples. This resource formedthe core SNP validation data available to the pub-lic along with AOD data, before the initiation ofthe HapMap project. The findings of the AlleleFrequency Project provide a detailed analysis ofthe nature of SNP variability in the genome anddifferences between the study populations (35).One word of warning about the interpretation ofallele frequencies generated by pooled DNA tech-niques that form a proportion of the loci detailedin TSC. This technique is generally inaccurate andalmost wholly so when the minor allele frequencyis below 10%. An example is rs994174 with minorallele frequency estimates of 0, 1, and 0 for Euro-pean, Asian, and African samples, respectively,using pooled DNA, whereas the same populationsgive estimates from repeat genotyping of 0.67,0.58, and 0.24 (CEU, CHB, and YRI in HapMap).

10.2. HGVbase (formerly HGBase)(http://hgvbase.cgb.ki.se/)

The Human Genome Variability Databasecomprises nearly 9 million entries concentrated onhuman genome variants including SNPs, Indels,and STRs. HGVbase uses its own system of locusidentification: a nine-digit number prefixed withSNP (if applicable to the locus). The database isin the process of adaptation to give much greateremphasis on phenotype/genotype collation so itwill lose a large part of its focus on the catalogingof SNPs but gain increased importance as a meansto link SNP variability to phenotype. In thewebsite’s own description: “sequence variationsare presented with details of how they are physi-cally and functionally related to the closestneighboring gene.” This will make HGVbase anessential complement to NCBI OMIM and Genefor the analysis of SNP variation and its effect onthe expression of traits. The citation list returned

65_98_Phillips_MB06_0040 1/3/07, 4:40 PM93

Page 30: Online Resources for SNP Analysis - USC

94 Phillips

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

from a query can provide a streamlined approachto starting a literature search of studies of genevariation resulting from coding SNPs. In addition,the strength of HGVbase has been in the empha-sis on listing low frequency variants and newmutations that are on the periphery of mainstreamSNP content elsewhere. Last, I thoroughly recom-mend the indispensable list of linkouts maintainedhere to no less than 46 other online SNP-relateddatabases and resource sites (http://hgvbase.cgb.ki.se/cgi-bin/main.pl?page=databases_.htm).In large part the list covers much of this section ofthe review as a source list of specialized data-bases, each worth exploring but dependent on theparticular aspect of SNP variability of interest tothe user.

10.3. PolyPhen(http://tux.embl-heidelberg.de/ramensky/)

PolyPhen is a tool that provides PolymorphismPhenotyping—predicting the possible impact ofan amino acid sequence change on the propertiesof a protein. It will not work with SNP data inputdirectly but holds a nonsynonymous SNP data-base comprising 50,919 SNPs taken from dbSNPbuild 121 (34). The predictions on the effect of theseSNPs make interesting reading: 9502 unknown,27,991 benign, 7905 possibly damaging, and5525 probably damaging, therefore 32.4% ofSNPs with a known effect appeared to be detri-mental to the protein. This data subset is the easi-est way to use this tool and rs-numbers can beinput as queries directly for comparison againstthe nonsynonomous SNP collection (http://genetics.bwh.harvard.edu/pph/data/index.html).

10.4. ALFRED(http://alfred.med.yale.edu/alfred/index.asp)

The ALlele FREquency Database is an exten-sive collection of frequency reports for polymorphicmarkers comprising 1501 loci, 475 populations,and 41,980 frequency tables. This forms a usefulpopulation analysis tool, in particular the mapfunction, which provides an intuitive search inter-face based on geographic region. Unfortunately theSNP data held is currently patchy (only 841 rs-num-bers), but this situation is certain to change. The

rs-numbers that have been collated in ALFREDare matched to loci (either other polymorphicmarkers or genes) in “Summaries” then “Siteswith dbSNP rs #.”

11. Web-Based SNP Assay Design Tools11.1. NCBI BLAST(http://www.ncbi.nlm.nih.gov/blast/)

BLAST is a tool for calculating sequence simi-larity that accesses the NCBI GenBank databases(35). Typically the SNP assay design process willquery Nucleotide BLAST in two ways.

1. Finding a location for a submitted sequence,effectively the query being: does the submittedsequence exist in a GenBank database?

2. Checking for coincidental similarity in a se-quence, normally a PCR primer, the query being:what is the degree of specificity of the submit-ted sequence?

There are three BLAST programs available forsequence comparison, the BLAST guide (http://www.ncbi.nlm.nih.gov/BLAST/producttable.shtml) can be consulted for the correct programchoice. However, the alignment comparisonsrequired for each of these queries are providedby MegaBLAST and standard BLAST (blastn),respectively. MegaBLAST is designed for longsequences and for a certain degree of mismatch,whereas blastn is designed to give a list of sequencesin order of similarity. A third option, “Search forshort and near exact matches” is recommendedfor sequence specificity checks with less than 20bases. A BLAST query returns a three-part report:(1) a header with query sequence information plussummarizing graphic overview, (2) a set of single-line matching sequence descriptions, and (3) thematching alignments themselves. There are twostatistics that annotate the returns from blastn: thebit score and E-value. The graphic overviewshows the query sequence as a numbered red barand below this the database hits as colored barsaligned to the query. The colors and proximity tothe query represent the alignment scores from red(highest) through to black (lowest) and uppermostto lowest bars. The single-line descriptions giveboth a bit score indicating the goodness of fit of

65_98_Phillips_MB06_0040 1/3/07, 4:40 PM94

Page 31: Online Resources for SNP Analysis - USC

Online Resources for SNP Analysis 95

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

each matched sequence and an expect value (E-value). The bit score is calculated from a formulathat takes into account all matching nucleotidesand gaps, the higher the score, the better the align-ment (http://www.ncbi.nlm.nih.gov/BLAST/tuto-rial/Altschul-1.html). The E-value summarizesthe statistical significance of the alignment, reflect-ing both the size of the database used to preparethe alignments and the score system used; thelower the E-value the more significant the hit. Forexample, a value of 0.05 equates to 5 in 100 or 1in 20, signifying the probability of this match bychance alone. Overall, the routine use of BLASTto check primer sequence specificity should be aprocedure familiar to all genetics researchers.Ensembl has a BLAST site (http://www.sanger.ac.uk/cgi-bin/blast/submitblast/hgp) and sequencealignment tool, SSAHA (http://www.sanger.ac.uk/Software/analysis/SSAHA/) that provides an alter-native to NCBI.

11.2. RepeatMasker(http://repeatmasker.genome.washington.edu/cgi-bin/RepeatMasker)

RepeatMasker is a tool for screening submittedDNA sequences against a broadly based libraryof repetitive elements (36,37). A masked querysequence is returned that can be used for data-base searches plus a table annotating the regionsof repetitive, low-complexity DNA. Users canexpect about 50% of human sequence to bemasked with this program. This system is used byNCBI for classifying the flanking sequence ofdbSNP entries and shown in the fasta section ofa cluster report. RepeatMasker tends to be quiteaggressive in its annotation of sequence, thereforewhen designing primers for SNP genotyping assays,it can be safer to include masked sequence and thencheck the resulting designs for specificity inBLAST.

11.3. Primer3

(http://www-genome.wi.mit.edu/cgi-bin/primer/primer3_www.cgi)

Primer3 is a well-established and popular primerdesign program and sequence analysis tool (38).The flanking sequence for a SNP is submitted and

various PCR parameters such as amplicon sizerange and optimum Tm can be prescribed by theuser before a list of suggested primer sequencesare returned in order of optimum predicted per-formance in PCR. Despite this simple interface, anarray of presets exists for the PCR conditions andthe possibility to annotate the submitted sequenceto direct the primer design process in useful ways.Primer3 provides particularly versatile secondarystructure detection subroutines that can screenprimer designs for such structures. This is anessential step as these can reduce the efficiencyof PCR or even prevent obtaining SNP genotypesfrom an assay, particularly in large multiplexdesigns.

11.4. Santa Cruz In Silico PCR( http://genome.ucsc.edu/cgi-bin/hgPcr)

In-silico PCR is a beautifully simple idea runby the Santa Cruz genome site for checking thespecificity of the primer designs developed for thecapture PCR in a SNP genotyping assay. When theforward and reverse primer sequences are insertedinto the query boxes these are compared to thehuman sequence assembly (or 27 other species asoptions) and the sequence interval between theprimer pair is returned to confirm that the correctsegment is targeted. Each primer sequence islisted in fasta format as capitalized bases and theinterval in lower case. The simplicity and ease ofuse of this web tool makes it a worthwhile alter-native to waiting in the BLAST queue.

12. Concluding RemarksThis review is intended to provide some initial

directions in which to point the mouse to ensurethat a research project using SNP analysis is prop-erly designed and framed. It is important to fullyreview as much of the relevant genetic data aspossible and to consolidate the research aims onthe basis of information gathered. Luckily, thishas never been easier and the data never so exten-sive and detailed in content. However, it is appro-priate to conclude with a last cautionary note. Atany one time in the laboratory where I work, asmany as four or five scientists of a team of 20will be reviewing informational web pages from

65_98_Phillips_MB06_0040 1/3/07, 4:40 PM95

Page 32: Online Resources for SNP Analysis - USC

96 Phillips

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

PubMed, dbSNP, HapMap, or Ensembl. Com-puter-based research can often seem like the majorpart of the work, but users should resist the temp-tation to replace benchwork with a disproportion-ate amount of time following up the work ofothers or collating information about SNPs with-out investigating these loci for themselves. Anoften used dictum these days is “dry work shouldnot be a substitute for wet work.” Although usuallydescribing the tendency to place undue emphasison sequence analysis compared to investigationsof cellular processes in live material, the phrasecould equally well apply to excessive time spentsearching online databases at the expense ofgenerating original data for oneself in pursuit ofthe research aims. Only in the field of popula-tion genetics is there now the real possibility toperform primary research based solely on onlinedata. The SNP genotype information generated bythe HapMap phase I data release has opened upan interesting phase of SNP research as the excit-ing work of Voight et al. (29) has shown by find-ing unforeseen signatures of recent selection inhuman populations. This is an uncommon in-stance—online investigation is still just the startof the work in all other cases of genetic research.

References1. International Human Genome Sequencing Consor-

tium (2001) Initial sequencing and analysis of thehuman genome. Nature 409, 860–921.

2. Sachidanandam, R., et al. (2001) A map of humangenome sequence variation containing 1.42 millionsingle nucleotide polymorphisms. Nature 409, 928–933.

3. Read, A. and Strachan, T. (2003) Human MolecularGenetics 3. Garland Science.

4. Jobling, M. A., Hurles, M. E., and Tyler-Smith, C.(2003) Human Evolutionary Genetics. Garland Sci-ence.

5. Phillips, C., Lareu, M., et al. (2004) Selecting SNPsfor forensic applications, in Progress in ForensicGenetics 10(Doutremepuich, C. and Morling, N.,eds.). Elsevier, Amsterdam.

6. Phillips, C. (2005) Using online databases for devel-oping SNP markers of forensic interest. Methods Mol.Biol. 297, 83–105.

7. Sobrino, B., Brion, M., and Carracedo, A. (2005)SNPs in forensic genetics: a review of SNP typingmethodologies. Forensic Sci. Int. 154, 181–194.

8. Nachman, M. W. and Crowell, S. L. (2000) Estimate ofthe mutation rate per nucleotide in humans. Genetics156, 297–304.

9. Phillips, C., Lareu, M., et al. (2004) Non binary singlenucleotide polymorphism markers, in Progress inForensic Genetics 10 (Doutremepuich, C. and Morling,N., eds.). Elsevier, Amsterdam.

10. Dawson, E,, et al. (2002) A first generation linkagedisequilibrium map of chromosome 22. Nature 418,544–548.

11. Patil, N., et al. (2001) Blocks of limited haplotypediversity revealed by high resolution scanning ofhuman chromosome 21. Science 294, 1669–1670.

12. Gabriel, S. B., et al. (2002) The structure of haplo-type blocks in the human genome. Science 296, 2225–2229.

13. Phillips, M. S., et al. (2003) Chromosome-wide dis-tribution of haplotype blocks and the role of recombi-nation hot spots. Nat. Genet. 33, 382–387.

14. Daly, M., Rioux, J. D., Schaffer, D. F., Hudson, T. J.,and Lander, E. S. (2001) High resolution haplotypestructure in the human genome. Nat. Genet. 29, 229–232.

15. De La Vega, F. M., et al. (2003) Selection of singlenucleotide polymorphisms for a whole–genome link-age disequilibrium mapping set. CSH Genome Sequenc-ing & Biology Meeting, Cold Spring Harbor, NY.

16. Wall, J. D. and Pritchard, J. K. (2003) Haplotypeblocks and linkage disequilibrium in the human genome.Nat. Rev. Genet. 4, 587–597.

17. Patterson, N., Hattangadi, N., Lane, B., et al. (2004)Methods for high–density admixture mapping of dis-ease genes. Am. J. Hum. Genet. 74, 979–1000.

18. Reed, T. E. (1969) Caucasian genes in AmericanNegroes. Science 165, 762–768.

19. Hinds, D. A., Stokowski, R. P., Patil, N., et al. (2004)Matching strategies for genetic association studies instructured populations. Am. J. Hum. Genet. 74, 317–325.

20. Campbell, C. D., Ogburn, E. L., Lunetta, K. L., et al.(2005) Demonstrating stratification in a EuropeanAmerican population. Nat. Genet. 37, 868–872.

21. Yoshiura, K., et al. (2006) A SNP in the ABCC11gene is the determinant of human earwax type. Nat.Genet. 38, 324–330.

22. Altshuler, D., Brooks, L. D., Chakravarti, A., Collins,F. S., Daly, M. J., Donnelly, P.; International HapMapConsortium. (2005) A haplotype map of the humangenome. Nature 437, 1299–1320.

23. McVean, G., Spencer, C. C., and Chaix, R. (2005)Perspectives on human genetic variation from theHapMap Project. PLoS Genet. 1, e54.

24. Sabeti, P. C., et al. (2002) Detecting recent positiveselection in the human genome from haplotype struc-ture. Nature 419, 832–837.

65_98_Phillips_MB06_0040 1/3/07, 4:40 PM96

Page 33: Online Resources for SNP Analysis - USC

Online Resources for SNP Analysis 97

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

25. Thorisson, G. A., Smith, A. V., Krishnan, L., andStein, L. D. (2005) The International HapMap ProjectWeb site. Genome Res. 15, 1592–1593.

26. Loveland, J. (2005) VEGA, the genome browser witha difference. Brief Bioinform. 6, 189–193.

27. Ashurst, J. L., Chen, C. K., Gilbert, J. G., et al. (2005)The Vertebrate Genome Annotation (Vega) database.Nucleic Acids Res. 1, D459–465.

28. Lao, O., Duijn, K., Kersbergen, P., Knijff, P., andKayser, M. (2006) Proportioning whole-genomesingle-nucleotide-polymorphism diversity for theidentification of geographic population structure andgenetic ancestry. Am. J. Hum. Genet. 78, 680–690.

29. Voight, B. F., Kudaravalli, S., Wen, X., and Pritchard,J. K. (2006) A map of recent positive selection in thehuman genome. PLoS Biol. 4, e72.

30. Nguyen, T. H., Liu, C., Gershon, E. S., and McMahon,F. J. (2004) Frequency Finder: a multi-source web ap-plication for collection of public allele frequencies ofSNP markers. Bioinformatics 20, 439–444.

31. Phillips, C., Lareu, M., et al. (2004) Population-specificsingle nucleotide polymorphism. Progress in ForensicGenetics 10 (Doutremepuich, C. and Morling, N., eds.).Elsevier, Amsterdam.

32. Costas, J., Salas, A., Phillips, C., and Carracedo, A.(2005) Human genome–wide screen of haplotype-like blocks of reduced diversity. Gene 11, 219–225.

33. Miller, R. D., et al. (2005) High-density single-nucle-otide polymorphism maps of the human genome.Genomics 86, 117–126.

34. Ramensky, V., Bork, P., and Sunyaev, S. (2002)Human non-synonymous SNPs: server and survey.Nucleic Acids Res. 30, 3894–3900.

35. Altschul, S. F., Gish, W., Miller, W., Myers, E. W.,and Lipman, D. J. (1990) Basic local alignmentsearch tool. J. Mol. Biol. 215, 403–410.

36. Jurka, J., Klonowski, P., Dagman, V., and Pelton, P.(1996) CENSOR—a program for identification andelimination of repetitive elements from DNA sequences.Comput. Chem. 20, 119–121.

37. Bedell, J. A., Korf, I., and Gish, W. (2000) MaskerAid:a performance enhancement to RepeatMasker. Bio-informatics 16, 1040–1041.

38. Rozen, S. and Skaletsky, H. J. (2000) Primer3 on theWWW for general users and for biologist program-mers. Methods Mol. Biol. 132, 365–386.

65_98_Phillips_MB06_0040 1/3/07, 4:40 PM97

Page 34: Online Resources for SNP Analysis - USC

65_98_Phillips_MB06_0040 1/3/07, 4:40 PM98