discovering gene-disease association using on-line scientific text abstracts. raj adhikari advisor:...

Discovering Gene-Disease Association using On-line Scientific Text Abstracts.

Raj AdhikariAdvisor: Javed Mostafa

April 20, 2023 Bioinformatics capstone project 2

Motivation Motivation

A central problem in bioinformatics is how to capture information from the vast scientific literature and create an automated system for “knowledge discovery” that can be used in various areas.

I address the special case of gene-disease interactions and show that using the frequencies/relevance of words in Pubmed abstracts can be used to find genes related to a disease.


Goal Use the combination of statistical

methods and a database to: retrieve research abstracts from Pubmed. extract relevant information from the free

texts using statistical methods. Measure the accuracy of the results and

display the results using a Web based system .

Complement and support existing knowledge base systems like GeneCards.


Resources used in creating database PubMed

The US National Library of Medicine's database that contains more than 11 million references to journal articles in the health sciences.

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi GeneCards

a database of human genes, their products and their involvement in diseases

http://bioinfo.weizmann.ac.il/cards/index.shtml HGNC

HUGO Gene Nomenclature Committee (approved over 19000 human gene symbols).

consistent with OMIM and LocusLink http://www.gene.ucl.ac.uk/nomenclature

Tools used: Perl, CGI, Java, MySQL


Creating the database Data I used:

A relatively small list of genes and diseases in humans

An article set (around 8000) For each Pubmed article:

PMID Article Title Abstract (filter with a list of stop words)

The HUGO dataset. List of around 3500 related gene-disease

pairs from GeneCards.


Populating the database tables Use the book Genes and Disease at OMIM to

generate a list of around 60 diseases and 90 genes.

Search Pubmed for each gene-disease pair on the Title/Abstract field.

Use ESearch (tool that provides access to Pubmed database outside of the web interface) to retrieve data in XML file format.

Use XML::Simple Perl package to parse the XML file

Filter the text using stop words and store each title and abstract along with the related PMID in a database table.

Add more genes using HUGOOMIM: Database of genetic diseases with references to molecular medicine, cell biology, biochemistry and clinical details of the diseases.


Populating the database tables Table structures:

Derivative table

Parse the retrieved text files and create the following tables:

HUGO table structure:

GeneCards table structure:

HGNC

genesymbol

alias

Genesymbol

disease

Term PMID Tfreq

Dfreq

Tfidf LSI


Generating term weights Basic idea: compare co-occurrence of terms in a

document and across a set of documents by generating term weights.

Within a document: Term-Frequency tf measures term density within a document.

Across the document set: Inverse Document Frequency

idf measures the “informativeness” of a term across a dataset.

Thus:

dfntftfidf

i

i log

dfnidf

i

i log


Latent Symantec Indexing Calculating co-occurrence of terms might not

suffice because of possible “noise” in the dataset.

Use LSI, a statistical technique, to estimate a latent structure.

Assume some underlying semantic structure in the dataset which could be partially obscured.

Implementation term by document matrix (tends to be sparse) convert matrix entries to weights, e.g. tfidf. Analyze the matrix by singular value decomposition

(SVD) to derive latent semantic structure model.


SVD SVD

unique mathematical decomposition of a matrix into the product of three matrices:

two with orthonormal columns one with singular values on the diagonal

finds optimal projection into low-dimensional space

tool for dimension reduction


SVDSingular Value Decomposition

{A}={U}{E}{V}T

Where:{U} has orthonormal, unit length columns: {U}{U}’ = I{E} is the diagonal matrix of positive real numbers{V} has orthonormal, unit length columns: {V}{V}’ = I


SVD Approximate Ak keeping only the first k singular

values and the corresponding columns from U and V matrices.

The new matrix Ak does not exactly match the original term by document matrix A. (It gets closer and closer as more singular values are kept).

This is what we want: we don’t want perfect fit since we think some of the 0’s in A should be not be 0 and vice versa.

Limitations of SVD – very memory intensive, cannot handle large datasets.


Scoring Matrix Generation A scoring matrix is generated for each

term weighting method using the data stored in the database.

This matrix is used to find the relationships between genes and diseases.

Relatively fast process since the weights are pre-computed and stored in a database.


Finding relationships

T1 T2 T3 … Tn

D1 1 1

D2 1 1

… 1 0

Dn 1 0

T1 T2 T3 … Tn

T1 2

T2

…

Tn

Use the doc-term matrix to establish relationships between genes and disease


Results


Verification of the relationship Data from GeneCards and HUGO has been

stored in a database. For each gene, if the symbol is an official

genesymbol (according to HUGO), then search for the genesymbol in GeneCards and display the disease associated with it.

Else (if the symbol is an alias), use HUGO to find the official genesymbol and search in GeneCards using this genesymbol and display the disease associated with the gene.


Verification results


Using gene alias Make use of gene alias from HUGO to

increase the chances of detecting correct genes for a given disease

Method: Increment the weight of an official gene by

adding the weight of the of the alias. Group the alias together with the official

gene.


Results for Pancreatic Cancer

Top five genes – without considering alias

Top five genes – considering alias


Using gene alias - problems Problem: HUGO might have multiple official

gene symbols for some alias: This particular alias could actually increase the

weight of a gene that is not related to the disease.

Example:3585 FANCD2 FAD, FA-D2

1101 BRCA2 FAD, FAD1

9508 PSEN1 FAD, S182, PS1


Problem using alias


Verification In addition, the number of Pubmed

articles containing a disease and a gene symbol can be an indication of how strong the association between a disease and a gene is.

Same theory applies for a gene-gene relationship.


In addition, we can use the doc-term matrix to find gene(s) that are related to any given gene.

Using the matrices above, we see that g2 is related to g3 and the weight is 2.

Gene-Gene Relationships

g1 g2 g3

… gn

D1

1 1 1

D2

1 1 1

… 1 0 1

Dn

1 0 0

g1

g2

g3

… gn

g1

g2

2

…

gn


Discovering additional gene-gene relationships We can make use of the possibility that two

genes might be related to each other via a disease as in:

gene1 -> disease1 -> gene2gene1 -> disease2 -> gene2

to establish relationships between gene1 and gene2.

In our case, the fact that gene1 and gene2 are related to each other via two different diseases makes the relationship between them even stronger.


Architecture


System Demonstration http://

biokdd.informatics.indiana.edu/radhikar/search.html

Related URLs: Genecards:

http://bioinfo.weizmann.ac.il/cards/index.shtml HGNC:

http://www.gene.ucl.ac.uk/nomenclature/


Summary Using the combination of statistical

methods and a database, the process of establishing gene-disease relationship using literature data is fast and efficient.

With minimal changes, our system can be extended to discover other relationships like protein-protein interactions, etc.


Future Work Extend our system to incorporate the

entire Medline dataset. Incorporate full gene names. Find a better way to verify the gene-

gene relationships. Incorporate other On-Line scientific

literature databases.


Acknowledgments Professor Javed Mostafa Professor Sun Kim Professor Memo Dalkilic Professor Haixu Tang

discovering gene-disease association using on-line scientific text abstracts. raj adhikari advisor:...

Documents

database of human genes

pubmed database outside

database tablesuse

genedisease association

database of genetic

pubmed abstracts

document matrix

related genedisease