the genecards tm project at the at the weizmann institute of science

The The GeneCardsGeneCardsTMTM Project Project at the at the Weizmann Institute of ScienceWeizmann Institute of Science

• For each gene - a card with displayed data and links to entries in major databases

• Genes with HUGO nomenclature symbolsand others

• Automatic data mining and integration

• Advanced human-computer interaction

http://bioinformatics.weizmann.ac.il/cards/http://bioinformatics.weizmann.ac.il/cards/

gene chromosome

chromosomal location

genetic map

mutationmedical applications

protein research article

similar mouse gene

marker

RNAgenealias

disease

DNAsequence

Swiss-Prot

GenBank

EMBL

DDBJ

Sanger Centre

Whitehead/MIT

WashU

GESTEC

UniGene

TIGR

GAC

Stanford

GeneMap'98

CEPH

Genethon

CHLC

Marshfield

Utah

GDB

LDB

UDB

NCBI

GENATLAS

PIRBLOCKS

PRODOM

PRINTS

PfamPDB

OMIMGeneCards

TGD

IMGT

PKR

COPE

HGMD

dbSNP

BRCA1

CFTR

TP53

HOVERGEN

Databases ContainingHuman Genome Information

UDB

GeneCards: From Chaos to Order

Data is retrieved and integrated automatically

A card for each gene

o Aliases o DNA, RNA o Protein o Chromosomal location o Disorders o Medical applications o Related mouse gene o Research articles o Links to more data

Data Related to Genes

Nucleotide SEQUENCE -Genomic/cDNA, -coding/regulatory VARIATION (polymorphism, mutation) Chromosomal LOCATION EXPRESSION (tissues, developmental, disease) PROTEIN - sequence, domains, 3D - subcellular location - 2D electrophoeresis Biological PATHWAYS

G E N E

DISEASE

PHARMA (diagnostics, vaccines, drugs)

ORTHOLOGS (model organisms, knockout)

Commercial DNA ARRAYS

PATENTS

GeneCard: Integrated Data and Starting Point

Mining and Integration of Data

GeneCard Entries in

Data Sources of GeneCards

other Data Sources

link to link tolink to

link to

link to

other Data Sources

Data Sources of GeneCards

A Starting point for More Data

HUGO nomenclature gene symbol

Accession ID to other databases

If chromosome 21

LocusLink or HUGO location

A typical GeneCard: A typical GeneCard: RUNX1RUNX1

For chromosome 21 only

Information on proteins

Sequence accessions

Disorders and mutations

Medical news from Doctor’s guide

Published literature

Single nucleotide polymorphisms

Homologues

Additional information

Start new search

Snapshot of additional Snapshot of additional GeneCard fieldsGeneCard fields

Improved Single Nucleotide Polymorphisms Summaries

Current GeneCards Data Sources and Links

HUGO GDB OMIM SWISS-PROT

LocusLink UDB UniGene MGD DOTS UCSC

GenBank PubMed CroW 21 Doctor’s Guide

HUGE euGenes Genatlas ATLAS HGMD TGDB

BCGD MTDB RZPD MIPS PDB BLOCKS

HORDE dbSNP ENSEMBL SBCELEGANS

GeneLynx IMGT SOURCE

Gene sourcesGene sources

HUGOHUGO

LocusLinkLocusLink

CroW 21CroW 21

MGDMGD

13,046

360

63

8,951

Simple search box

resultsno results

spell corrections

query modification

outside resources

gene 1: name ... -keyword ... ... ... -keyword .

gene 2: name

-keyword ...

search keywords

How to search and findHow to search and find??

Some GeneCards StatisticsSome GeneCards Statistics

27,61227,612 GeneCards (November, 2001)

13,54813,548 HUGO approved genes

2,646,1852,646,185 Accesses to GeneCards (at WIS since

January 1, 1998(

2525 Mirror sites around the world

The Affymetrix System

Genechip Procedure

HybridizationHybridization Signal detectionSignal detection Data analysisData analysisSample Sample preparationpreparation

Fluidic station Scanner Software

ChipCards - A Functional Integration Tool for DNA Array Data

Tsviya Olender, Shirley Horn-Saban, Marilyn Safran, Vered Chalifa-Caspi, Michal Ronen and Doron Lancet

The Crown Human Genome CenterThe Weizmann Institute of Center, Rehovot 76100

ChipCards correlates DNA array data with comprehensive information from gene-specific databases. It is currently implemented for the Affymetrix GeneChip.

ChipCards’s output is an HTML table with essential additional information for each gene including: gene symbol, functional definition, accession number, protein information, chromosomal location and EST data.

Human data is integrated with GeneCards, UDB and Unigene.

Mouse data is integrated with information about the human orthologue via GeneCards, HomoloGene and MGD.

About ChipCards

Example of GeneChip output before ChipCards processing

An Extract of Human Expression Data After ChipCards Processing

A snapshot of ChipCards’s result, with human Affymetrix expression data as input.Each probe set has a link to NCBI, GeneCards and UDB. Information about the cDNA sources of the geneis extracted from Unigene and is given as a separate column in the table. The same for UDB coordinates.

NCBI link GeneCards link UDB link

Murine Expression Data After ChipCards Processiong

GeneCards link

A snapshot of ChipCards output for Mouse Affymetrix expression data. Each probe set is linked to NCBI and Unigene. Information about the human orthologue is integrated into the table and includes links to NCBI, GeneCards and Unigene.

NCBI link Human’s Unigene link

Human orthologes data

NCBI linkMurine’s Unigene link

GeneCardfor novelgene

Unigenecluster

1

2

3

45

Assembly-basedresources

Genesequencetag

Uniquepersistent gene

identifier

Current Research - Adding Cards for Genes that Don’t Yet Have a Name

Improving flexibility, allowing automated parameterized generation from partial sets of sources and/or genes, and appending to an existing database

Providing an Application Programming Interface for users of the generation software to incorporate their own data

Standardizing the format of the database to use XML

Version 3.0 Project Goals

Providing a foundation for supplying a stable identifier for each GeneCard, even when no known gene symbol exists

Improving the maintainability, testability, and quality of the software

Providing a seamless migration path from Version 2.xx while maintaining the current look and feel and functionality

Project Goals (cont’d)

Pros and Cons of Using OOP• Perl not originally

designed as an OOP language

• Type safety, proper encapsulation and aggregation aren’t enforced

• Can be between 20 and 50 % slower

• Allows for more robust implementations

• Greater modularity• More comprehensible

interface to modules• Better abstraction of

software components• Less namespace pollution• Greater code reusability• Software scalability• Cleaner and more compact

code

Combines an object-oriented skeleton with some non object-oriented internals

•The large data structure of gene-based data is implemented as a hash of hashes, avoiding numerous costly instantiations

•All other major components, including the extractors and administration classes, are implemented as objects

The 3.0 Hybrid Solution

GeneCards Architecture

GeneCards Database

Generation Software

SwissProt Extractor

Customized Extractor

UniGene Extractor

Support Functions

API

Display Software

An underlying layer of support tools that manage extracting data from locally mirrored files and the internet, proxy connections,

verification, security, file management, caching, conflict detection, error handling, statistics, and XML output formating

A set of extractor classes, one for each source of information using source-specific algorithms and heuristics (adapted from pervious

versions of GeneCards). Methods include new, prepare and search

A template for building extractor classes. All such classes can create new or append to old entries, as well as generate data for all entries

(genes) at once, or one at a time

A main class that handles building sets of cards according to parameterized partial ordering rules

Generation Software Classes

XML is a meta-language that supports customized tags for describing and providing semantic meaning to structured data

Typed elements are arranged within other elements to form a nested hierarchy

The data is grouped by source in the XML files, but can be retrieved by function: <GCresource>SWISSPROT <GCresource>OMIM <protein> <disorder>Colorectal Cancer <disorder>Germline Cancer </disorder> </disorder> </GCresource> </protein> <GCresource>GENECLINICS <GCresource> <disorder>Li-Fraumeni Syndrome </disorder> </GCResource>

Each extractor module is responsible for its own Document Type Definition (DTD) specification to ensure that the XML is well formed and valid

Files are stored in a hierarchical directory structure, one file per gene

The XML-Based Database

Currently in the design phase

Want to maintain the current look and feel while providing the flexibility of easy customization

Will use XML Perl parser modules in cgi scripts

Search will be expanded beyond current text-based capabilities to include context-specific searches

The Display Software

Procedural programs/ad-hoc flat file format

Object-oriented methodology/standardized XML

Easy to add new extractors Flexible and extensibile

Performance , Searching strategies

3.0 Project Status and Open Issues

Integrated chrmosomal maps

Source-specific information

Thesaurus

Original public databases

Data mining

Semantic Integration

Megabase Integration

Data mining and integrationUnified Database (UDB)

UDB

Sequence-Based Repositioning

(SBR)

Placing finished genomic sequences on UDB map.

Map fine tuning in sequenced regions.

Elimination of overlaps between

contigs

Object repositioning

UDB original map SBR map

SBR (Sequence Based Repositioning)

Search Results - a Map Slice

to MarkerCard

to Unigene

to GeneCard

A MarkerCard

GeneCards Success Stories• GeneCards as a bookmark for linkage analysis

• Mutations that were polymorphisms and not disease-causing• Adult-onset diabetes without obesity in India• Work on Chromosome 21 at the Weizmann Institute• PVT – a heart disease found in Israeli Beduins• Parkinson’s disease paper

Frequently Asked Questions

• What’s special about GeneCards?

• Can I interface my own data?

• Can I access my own in-house database mirrors

instead of public internet sites?

alumni:alumni:Michael RebhanShai Shen-OrrInga PeterJaime PriluskyMichal RonenHershel SaferJulie StampnitzkyLiora Yaar

currentcurrent::Avital AdatoVered Chalifa-CaspiMichal LapidotZvia OlenderNaomi RosenMarilyn Safran, headOrit ShmueliIrina SolomonDoron Lancet, PI

GeneCards/UDB Team

the genecards tm project at the at the weizmann institute of science

Documents

human data

integrated data

data genecard

mouse data

est data

data sourceslink

order data

displayed data