discovering gene-disease association using on-line scientific text abstracts. raj adhikari advisor:...

29
Discovering Gene-Disease Association using On- line Scientific Text Abstracts. Raj Adhikari Advisor: Javed Mostafa

Upload: marcus-bradford

Post on 01-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Discovering Gene-Disease Association using On-line Scientific Text Abstracts.

Raj AdhikariAdvisor: Javed Mostafa

April 20, 2023 Bioinformatics capstone project 2

Motivation Motivation

A central problem in bioinformatics is how to capture information from the vast scientific literature and create an automated system for “knowledge discovery” that can be used in various areas.

I address the special case of gene-disease interactions and show that using the frequencies/relevance of words in Pubmed abstracts can be used to find genes related to a disease.

April 20, 2023 Bioinformatics capstone project 3

Goal Use the combination of statistical

methods and a database to: retrieve research abstracts from Pubmed. extract relevant information from the free

texts using statistical methods. Measure the accuracy of the results and

display the results using a Web based system .

Complement and support existing knowledge base systems like GeneCards.

April 20, 2023 Bioinformatics capstone project 4

Resources used in creating database PubMed

The US National Library of Medicine's database that contains more than 11 million references to journal articles in the health sciences.

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi GeneCards

a database of human genes, their products and their involvement in diseases

http://bioinfo.weizmann.ac.il/cards/index.shtml HGNC

HUGO Gene Nomenclature Committee (approved over 19000 human gene symbols).

consistent with OMIM and LocusLink http://www.gene.ucl.ac.uk/nomenclature

Tools used: Perl, CGI, Java, MySQL

April 20, 2023 Bioinformatics capstone project 5

Creating the database Data I used:

A relatively small list of genes and diseases in humans

An article set (around 8000) For each Pubmed article:

PMID Article Title Abstract (filter with a list of stop words)

The HUGO dataset. List of around 3500 related gene-disease

pairs from GeneCards.

April 20, 2023 Bioinformatics capstone project 6

Populating the database tables Use the book Genes and Disease at OMIM to

generate a list of around 60 diseases and 90 genes.

Search Pubmed for each gene-disease pair on the Title/Abstract field.

Use ESearch (tool that provides access to Pubmed database outside of the web interface) to retrieve data in XML file format.

Use XML::Simple Perl package to parse the XML file

Filter the text using stop words and store each title and abstract along with the related PMID in a database table.

Add more genes using HUGOOMIM: Database of genetic diseases with references to molecular medicine, cell biology, biochemistry and clinical details of the diseases.

April 20, 2023 Bioinformatics capstone project 7

Populating the database tables Table structures:

Derivative table

Parse the retrieved text files and create the following tables:

HUGO table structure:

GeneCards table structure:

HGNC

genesymbol

alias

Genesymbol

disease

Term PMID Tfreq

Dfreq

Tfidf LSI

April 20, 2023 Bioinformatics capstone project 8

Generating term weights Basic idea: compare co-occurrence of terms in a

document and across a set of documents by generating term weights.

Within a document: Term-Frequency tf measures term density within a document.

Across the document set: Inverse Document Frequency

idf measures the “informativeness” of a term across a dataset.

Thus:

dfntftfidf

i

i log

dfnidf

i

i log

April 20, 2023 Bioinformatics capstone project 9

Latent Symantec Indexing Calculating co-occurrence of terms might not

suffice because of possible “noise” in the dataset.

Use LSI, a statistical technique, to estimate a latent structure.

Assume some underlying semantic structure in the dataset which could be partially obscured.

Implementation term by document matrix (tends to be sparse) convert matrix entries to weights, e.g. tfidf. Analyze the matrix by singular value decomposition

(SVD) to derive latent semantic structure model.

April 20, 2023 Bioinformatics capstone project 10

SVD SVD

unique mathematical decomposition of a matrix into the product of three matrices:

two with orthonormal columns one with singular values on the diagonal

finds optimal projection into low-dimensional space

tool for dimension reduction

April 20, 2023 Bioinformatics capstone project 11

SVDSingular Value Decomposition

{A}={U}{E}{V}T

Where:{U} has orthonormal, unit length columns: {U}{U}’ = I{E} is the diagonal matrix of positive real numbers{V} has orthonormal, unit length columns: {V}{V}’ = I

April 20, 2023 Bioinformatics capstone project 12

SVD Approximate Ak keeping only the first k singular

values and the corresponding columns from U and V matrices.

The new matrix Ak does not exactly match the original term by document matrix A. (It gets closer and closer as more singular values are kept).

This is what we want: we don’t want perfect fit since we think some of the 0’s in A should be not be 0 and vice versa.

Limitations of SVD – very memory intensive, cannot handle large datasets.

April 20, 2023 Bioinformatics capstone project 13

Scoring Matrix Generation A scoring matrix is generated for each

term weighting method using the data stored in the database.

This matrix is used to find the relationships between genes and diseases.

Relatively fast process since the weights are pre-computed and stored in a database.

April 20, 2023 Bioinformatics capstone project 14

Finding relationships

T1 T2 T3 … Tn

D1 1 1

D2 1 1

… 1 0

Dn 1 0

T1 T2 T3 … Tn

T1 2

T2

Tn

Use the doc-term matrix to establish relationships between genes and disease

April 20, 2023 Bioinformatics capstone project 15

Results

April 20, 2023 Bioinformatics capstone project 16

Verification of the relationship Data from GeneCards and HUGO has been

stored in a database. For each gene, if the symbol is an official

genesymbol (according to HUGO), then search for the genesymbol in GeneCards and display the disease associated with it.

Else (if the symbol is an alias), use HUGO to find the official genesymbol and search in GeneCards using this genesymbol and display the disease associated with the gene.

April 20, 2023 Bioinformatics capstone project 17

Verification results

April 20, 2023 Bioinformatics capstone project 18

Using gene alias Make use of gene alias from HUGO to

increase the chances of detecting correct genes for a given disease

Method: Increment the weight of an official gene by

adding the weight of the of the alias. Group the alias together with the official

gene.

April 20, 2023 Bioinformatics capstone project 19

Results for Pancreatic Cancer

Top five genes – without considering alias

Top five genes – considering alias

April 20, 2023 Bioinformatics capstone project 20

Using gene alias - problems Problem: HUGO might have multiple official

gene symbols for some alias: This particular alias could actually increase the

weight of a gene that is not related to the disease.

Example:3585 FANCD2 FAD, FA-D2

1101 BRCA2 FAD, FAD1

9508 PSEN1 FAD, S182, PS1

April 20, 2023 Bioinformatics capstone project 21

Problem using alias

April 20, 2023 Bioinformatics capstone project 22

Verification In addition, the number of Pubmed

articles containing a disease and a gene symbol can be an indication of how strong the association between a disease and a gene is.

Same theory applies for a gene-gene relationship.

April 20, 2023 Bioinformatics capstone project 23

In addition, we can use the doc-term matrix to find gene(s) that are related to any given gene.

Using the matrices above, we see that g2 is related to g3 and the weight is 2.

Gene-Gene Relationships

g1 g2 g3

… gn

D1

1 1 1

D2

1 1 1

… 1 0 1

Dn

1 0 0

g1

g2

g3

… gn

g1

g2

2

gn

April 20, 2023 Bioinformatics capstone project 24

Discovering additional gene-gene relationships We can make use of the possibility that two

genes might be related to each other via a disease as in:

gene1 -> disease1 -> gene2gene1 -> disease2 -> gene2

to establish relationships between gene1 and gene2.

In our case, the fact that gene1 and gene2 are related to each other via two different diseases makes the relationship between them even stronger.

April 20, 2023 Bioinformatics capstone project 25

Architecture

April 20, 2023 Bioinformatics capstone project 26

System Demonstration http://

biokdd.informatics.indiana.edu/radhikar/search.html

Related URLs: Genecards:

http://bioinfo.weizmann.ac.il/cards/index.shtml HGNC:

http://www.gene.ucl.ac.uk/nomenclature/

April 20, 2023 Bioinformatics capstone project 27

Summary Using the combination of statistical

methods and a database, the process of establishing gene-disease relationship using literature data is fast and efficient.

With minimal changes, our system can be extended to discover other relationships like protein-protein interactions, etc.

April 20, 2023 Bioinformatics capstone project 28

Future Work Extend our system to incorporate the

entire Medline dataset. Incorporate full gene names. Find a better way to verify the gene-

gene relationships. Incorporate other On-Line scientific

literature databases.

April 20, 2023 Bioinformatics capstone project 29

Acknowledgments Professor Javed Mostafa Professor Sun Kim Professor Memo Dalkilic Professor Haixu Tang