the genecards tm project at the at the weizmann institute of science
TRANSCRIPT
The The GeneCardsGeneCardsTMTM Project Project at the at the Weizmann Institute of ScienceWeizmann Institute of Science
• For each gene - a card with displayed data and links to entries in major databases
• Genes with HUGO nomenclature symbolsand others
• Automatic data mining and integration
• Advanced human-computer interaction
http://bioinformatics.weizmann.ac.il/cards/http://bioinformatics.weizmann.ac.il/cards/
gene chromosome
chromosomal location
genetic map
mutationmedical applications
protein research article
similar mouse gene
marker
RNAgenealias
disease
DNAsequence
Swiss-Prot
GenBank
EMBL
DDBJ
Sanger Centre
Whitehead/MIT
WashU
GESTEC
UniGene
TIGR
GAC
Stanford
GeneMap'98
CEPH
Genethon
CHLC
Marshfield
Utah
GDB
LDB
UDB
NCBI
GENATLAS
PIRBLOCKS
PRODOM
PRINTS
PfamPDB
OMIMGeneCards
TGD
IMGT
PKR
COPE
HGMD
dbSNP
BRCA1
CFTR
TP53
HOVERGEN
Databases ContainingHuman Genome Information
UDB
GeneCards: From Chaos to Order
Data is retrieved and integrated automatically
A card for each gene
o Aliases o DNA, RNA o Protein o Chromosomal location o Disorders o Medical applications o Related mouse gene o Research articles o Links to more data
Data Related to Genes
Nucleotide SEQUENCE -Genomic/cDNA, -coding/regulatory VARIATION (polymorphism, mutation) Chromosomal LOCATION EXPRESSION (tissues, developmental, disease) PROTEIN - sequence, domains, 3D - subcellular location - 2D electrophoeresis Biological PATHWAYS
G E N E
DISEASE
PHARMA (diagnostics, vaccines, drugs)
ORTHOLOGS (model organisms, knockout)
Commercial DNA ARRAYS
PATENTS
GeneCard: Integrated Data and Starting Point
Mining and Integration of Data
GeneCard Entries in
Data Sources of GeneCards
other Data Sources
link to link tolink to
link to
link to
other Data Sources
Data Sources of GeneCards
A Starting point for More Data
HUGO nomenclature gene symbol
Accession ID to other databases
If chromosome 21
LocusLink or HUGO location
A typical GeneCard: A typical GeneCard: RUNX1RUNX1
For chromosome 21 only
Information on proteins
Sequence accessions
Disorders and mutations
Medical news from Doctor’s guide
Published literature
Single nucleotide polymorphisms
Homologues
Additional information
Start new search
Snapshot of additional Snapshot of additional GeneCard fieldsGeneCard fields
Improved Single Nucleotide Polymorphisms Summaries
Current GeneCards Data Sources and Links
HUGO GDB OMIM SWISS-PROT
LocusLink UDB UniGene MGD DOTS UCSC
GenBank PubMed CroW 21 Doctor’s Guide
HUGE euGenes Genatlas ATLAS HGMD TGDB
BCGD MTDB RZPD MIPS PDB BLOCKS
HORDE dbSNP ENSEMBL SBCELEGANS
GeneLynx IMGT SOURCE
Gene sourcesGene sources
HUGOHUGO
LocusLinkLocusLink
CroW 21CroW 21
MGDMGD
13,046
360
63
8,951
Simple search box
resultsno results
spell corrections
query modification
outside resources
gene 1: name ... -keyword ... ... ... -keyword .
gene 2: name
-keyword ...
search keywords
How to search and findHow to search and find??
Some GeneCards StatisticsSome GeneCards Statistics
27,61227,612 GeneCards (November, 2001)
13,54813,548 HUGO approved genes
2,646,1852,646,185 Accesses to GeneCards (at WIS since
January 1, 1998(
2525 Mirror sites around the world
The Affymetrix System
Genechip Procedure
HybridizationHybridization Signal detectionSignal detection Data analysisData analysisSample Sample preparationpreparation
Fluidic station Scanner Software
ChipCards - A Functional Integration Tool for DNA Array Data
Tsviya Olender, Shirley Horn-Saban, Marilyn Safran, Vered Chalifa-Caspi, Michal Ronen and Doron Lancet
The Crown Human Genome CenterThe Weizmann Institute of Center, Rehovot 76100
ChipCards correlates DNA array data with comprehensive information from gene-specific databases. It is currently implemented for the Affymetrix GeneChip.
ChipCards’s output is an HTML table with essential additional information for each gene including: gene symbol, functional definition, accession number, protein information, chromosomal location and EST data.
Human data is integrated with GeneCards, UDB and Unigene.
Mouse data is integrated with information about the human orthologue via GeneCards, HomoloGene and MGD.
About ChipCards
Example of GeneChip output before ChipCards processing
An Extract of Human Expression Data After ChipCards Processing
A snapshot of ChipCards’s result, with human Affymetrix expression data as input.Each probe set has a link to NCBI, GeneCards and UDB. Information about the cDNA sources of the geneis extracted from Unigene and is given as a separate column in the table. The same for UDB coordinates.
NCBI link GeneCards link UDB link
Murine Expression Data After ChipCards Processiong
GeneCards link
A snapshot of ChipCards output for Mouse Affymetrix expression data. Each probe set is linked to NCBI and Unigene. Information about the human orthologue is integrated into the table and includes links to NCBI, GeneCards and Unigene.
NCBI link Human’s Unigene link
Human orthologes data
NCBI linkMurine’s Unigene link
GeneCardfor novelgene
Unigenecluster
1
2
3
45
Assembly-basedresources
Genesequencetag
Uniquepersistent gene
identifier
Current Research - Adding Cards for Genes that Don’t Yet Have a Name
Improving flexibility, allowing automated parameterized generation from partial sets of sources and/or genes, and appending to an existing database
Providing an Application Programming Interface for users of the generation software to incorporate their own data
Standardizing the format of the database to use XML
Version 3.0 Project Goals
Providing a foundation for supplying a stable identifier for each GeneCard, even when no known gene symbol exists
Improving the maintainability, testability, and quality of the software
Providing a seamless migration path from Version 2.xx while maintaining the current look and feel and functionality
Project Goals (cont’d)
Pros and Cons of Using OOP• Perl not originally
designed as an OOP language
• Type safety, proper encapsulation and aggregation aren’t enforced
• Can be between 20 and 50 % slower
• Allows for more robust implementations
• Greater modularity• More comprehensible
interface to modules• Better abstraction of
software components• Less namespace pollution• Greater code reusability• Software scalability• Cleaner and more compact
code
Combines an object-oriented skeleton with some non object-oriented internals
•The large data structure of gene-based data is implemented as a hash of hashes, avoiding numerous costly instantiations
•All other major components, including the extractors and administration classes, are implemented as objects
The 3.0 Hybrid Solution
GeneCards Architecture
GeneCards Database
Generation Software
SwissProt Extractor
Customized Extractor
UniGene Extractor
Support Functions
API
Display Software
An underlying layer of support tools that manage extracting data from locally mirrored files and the internet, proxy connections,
verification, security, file management, caching, conflict detection, error handling, statistics, and XML output formating
A set of extractor classes, one for each source of information using source-specific algorithms and heuristics (adapted from pervious
versions of GeneCards). Methods include new, prepare and search
A template for building extractor classes. All such classes can create new or append to old entries, as well as generate data for all entries
(genes) at once, or one at a time
A main class that handles building sets of cards according to parameterized partial ordering rules
Generation Software Classes
XML is a meta-language that supports customized tags for describing and providing semantic meaning to structured data
Typed elements are arranged within other elements to form a nested hierarchy
The data is grouped by source in the XML files, but can be retrieved by function: <GCresource>SWISSPROT <GCresource>OMIM <protein> <disorder>Colorectal Cancer <disorder>Germline Cancer </disorder> </disorder> </GCresource> </protein> <GCresource>GENECLINICS <GCresource> <disorder>Li-Fraumeni Syndrome </disorder> </GCResource>
Each extractor module is responsible for its own Document Type Definition (DTD) specification to ensure that the XML is well formed and valid
Files are stored in a hierarchical directory structure, one file per gene
The XML-Based Database
Currently in the design phase
Want to maintain the current look and feel while providing the flexibility of easy customization
Will use XML Perl parser modules in cgi scripts
Search will be expanded beyond current text-based capabilities to include context-specific searches
The Display Software
Procedural programs/ad-hoc flat file format
Object-oriented methodology/standardized XML
Easy to add new extractors Flexible and extensibile
Performance , Searching strategies
3.0 Project Status and Open Issues
Integrated chrmosomal maps
Source-specific information
Thesaurus
Original public databases
Data mining
Semantic Integration
Megabase Integration
Data mining and integrationUnified Database (UDB)
UDB
Sequence-Based Repositioning
(SBR)
Placing finished genomic sequences on UDB map.
Map fine tuning in sequenced regions.
Elimination of overlaps between
contigs
Object repositioning
UDB original map SBR map
SBR (Sequence Based Repositioning)
Search Results - a Map Slice
to MarkerCard
to Unigene
to GeneCard
A MarkerCard
GeneCards Success Stories• GeneCards as a bookmark for linkage analysis
• Mutations that were polymorphisms and not disease-causing• Adult-onset diabetes without obesity in India• Work on Chromosome 21 at the Weizmann Institute• PVT – a heart disease found in Israeli Beduins• Parkinson’s disease paper
Frequently Asked Questions
• What’s special about GeneCards?
• Can I interface my own data?
• Can I access my own in-house database mirrors
instead of public internet sites?
alumni:alumni:Michael RebhanShai Shen-OrrInga PeterJaime PriluskyMichal RonenHershel SaferJulie StampnitzkyLiora Yaar
currentcurrent::Avital AdatoVered Chalifa-CaspiMichal LapidotZvia OlenderNaomi RosenMarilyn Safran, headOrit ShmueliIrina SolomonDoron Lancet, PI
GeneCards/UDB Team