from bioinformatics to health information technology · – gtift tbli thgenetic factors: metabolic...
TRANSCRIPT
From Bioinformatics to Health From Bioinformatics to Health Information TechnologyInformation Technology
©Edited by Mingrui Zhang, CS Department, Winona State University, 2009
OutlineOutline
• What can we contribute to cancer research and treatment from Computer Science or Mathematics?
• How do we adapt our expertise for themIntroduction to lung cancer problems– Introduction to lung cancer problems
– Brief review on microarray technology– An existing computer algorithm, FCM– Adaptation of FCM for biological problems– Case and control study– Software integrationg
©Edited by Mingrui Zhang, CS Department, Winona State University, 2009
Genomic Factors Associated Genomic Factors Associated with Lung Cancerwith Lung Cancerwith Lung Cancerwith Lung Cancer
©Edited by Mingrui Zhang, CS Department, Winona State University, 2009
Lung CancerLung Cancer
• Lung cancer is the leading cause of death among cancer victims in the United States.
– It claims more lives than colon, prostate, and breast cancer combined.
– Smoking is the most significant factor for lung cancer. But, only 10% of ever smokers have lung cancers.
©Edited by Mingrui Zhang, CS Department, Winona State University, 2009
Observed and projected lung cancer death Observed and projected lung cancer death rates, United States, 1930rates, United States, 1930––20032003
The observed death rates are based on data published by the National Center for Health Statistics, Centers for Disease Control. The dotted lines represent straight line projections of the observed slope from 1950–1975 in men and
©Edited by Mingrui Zhang, CS Department, Winona State University, 2009
from 1975–1990 in women. (http://tobaccocontrol.bmj.com)
National Expenditures for Medical National Expenditures for Medical Treatment for the Most Common CancersTreatment for the Most Common Cancers
Based on Cancer Prevalence in 1998 and Cancer-Specific Costs for 1997-1999 projected to 2004 using
©Edited by Mingrui Zhang, CS Department, Winona State University, 2009
Based on Cancer Prevalence in 1998 and Cancer Specific Costs for 1997 1999, projected to 2004 usingthe medical care component of the Consumer Price Index. (http://progressreport.cancer.gov)
Low Survival RateLow Survival Rate
Type 5-Year survival for all stages
EarlyDetection
Late Detection
L ng 14 9% 48 7% 21%Lung 14.9% 48.7% 21%Breast 86.6% 97.0% 23.2%Prostate 97.5% 100% 34.0%Colon 62.3% 90.1% 9.2%
©Edited by Mingrui Zhang, CS Department, Winona State University, 2009
Lung CancerLung Cancer
• Interesting questions:1. Are there factors other than smoking attributed to lung
cancers?2. How does second-hand smoking contribute to cancer?g
• Goals: – To predict the effectiveness of lung cancer treatments
©Edited by Mingrui Zhang, CS Department, Winona State University, 2009
ApproachApproach
• Case control studies– Cases are group of lung cancer patients– Controls are group of normal people
• Identify causal factors for lung cancer other thanIdentify causal factors for lung cancer other thansmoking
– Environmental factorsG ti f t t b li th– Genetic factors: metabolic pathway genes
– Interaction between environmental factors and the metabolic pathway genes
©Edited by Mingrui Zhang, CS Department, Winona State University, 2009
Biomedical Informatics ResearchBiomedical Informatics Research
Data Mining Software ToolData Collection
Clinical Information
ing
SNPsBlood
Mod
eli
Genetic InformationMicroarrayLung Tissue
Microarray Technology:Microarray Technology:Microarray Technology:Microarray Technology:Genes Attributed to CancerGenes Attributed to Cancer
©Edited by Mingrui Zhang, CS Department, Winona State University, 2009
Microarray ExperimentMicroarray Experiment
• To understand the roles certain genes play in the progression of cancer cancer tissue is taken andprogression of cancer, cancer tissue is taken andused in microarray experiment.
©Edited by Mingrui Zhang, CS Department, Winona State University, 2009
Gene ExpressionGene Expression
• There are over 10,000 different probes used. • Each dot represents the location of a gene probe• Each dot represents the location of a gene probe.
Probe GSM10966GSM10966GSM10966GSM10966GSM10966GSM10966GSM10966GSM10966Hs.544577 39513.15 37409.76 29715.27 20536.73 12636.55 30290.4 5380.344 8963.756Hs.544577 39513.15 37409.76 29715.27 20536.73 12636.55 30290.4 5380.344 8963.756Hs.561260 9229.464 21894.5 16412.41 12636.55 17446.9 13930.68 5254.592 24600.7Hs.567356 19056.91 32565.55 26372.36 19455.14 23552.87 22335.71 17281.12 21230.21Hs.436873 35737.01 35313.64 38926 21348.13 23681.18 42374.79 28674.05 12128.16Hs.517792 9468.588 14035.96 5919.428 5919.428 14174.17 11231.28 17788.3 24389.6Hs.487027 29305.41 36906.58 21473.12 22962.5 41911.48 34204.18 28017.95 33684.04Hs.1422 15283.12 25165 23287.42 15057.2 23223.62 15438.6 19509.78 8151.832Hs.202453 15641.64 20993.95 18437.06 18131.26 20888.96 20207.75 18781.45 19350.3Hs.592158 22163.46 37996.65 26291.84 30290.4 36422.28 30758.42 24600.7 23980.17Hs.525622 15737.5 24801.41 17638.29 19056.91 23429.47 20444.12 20089.34 19633.77Hs.226755 28674.05 33514.62 19799.32 24314.99 28297.22 41428.3 28781.67 20327.08Hs.250687 13646.48 28297.22 19860.2 16085.3 16269.2 14468.99 34204.18 12725.62Hs.425633 8644.256 5516.18 11838.31 6331.144 2745.636 4145.204 10042.56 5919.428Hs.386168 9999.856 30404.73 16830.96 25076.12 23820.99 12172.11 9736.256 6141.688Hs.584908 24600.7 27737.02 24923.88 24243.52 16573.38 27044.52 7832.452 10143.37Hs.133379 31731.44 19350.3 12991.75 19118 31893.15 29930.01 15781.63 31497.45Hs.150423 5061.944 12172.11 9780.32 10709.47 8246.392 5868.392 1589.784 3577.928Hs.175473 33684.04 33356.45 24243.52 28569.07 43439.98 37409.76 22699.38 21473.12Hs.83169 39184.98 31232.35 33026.59 6282.64 3898.764 31132.38 30883.74 10624.71Hs 532325 2344 252 1622 224 4019 816 2998 332 997 068 2784 528 12269 02 1310 164
©Edited by Mingrui Zhang, CS Department, Winona State University, 2009
Hs.532325 2344.252 1622.224 4019.816 2998.332 997.068 2784.528 12269.02 1310.164Hs.518267 13831.62 15188.12 8554.644 15587.79 18781.45 17593.42 7651.908 12588.82
Fuzzy ClusteringFuzzy Clustering
• The algorithm assigns a gene to a given number of l tclusters
• Each gene may belong to more than one cluster with different degrees of membershipg p
Fuzzy ClusteringFuzzy Clustering
• The method produces a set of cluster centroids and a pmembership table
A. Gasch and M. Eisen, "Exploring the conditional coregulation of yeast gene expression, p g g y g pthrough fuzzy k-means clustering," Genome Biology, vol. 3, pp. 1-22, 2002.
Fuzzy CFuzzy C--Means ClusteringMeans Clustering
� A set of N samples with their features as X={x1, x2,…p { 1, 2,xN}T
� xi=[xi1, xi2, … xip] is sample i with its p features� A cluster cj=[cj1, cj2,… cjp]� The fuzzy membership uij of sample i to a cluster cj
Fuzzy CFuzzy C--Means (FCM) ClusteringMeans (FCM) Clustering
Randomly initialize membership matrix uijRandomly initialize membership matrix uij
Repeat until ��� � )1()( tt uu for t=1, 2...
C l d� �� �
N
imt
iju )1(
1 2 CCompute cluster centroids� �
� ��
�
�
�
�� N
i
mtij
ij
u1
)1(
1 ; j=1,2,…C.
Find sets 0;1| 2 ���� dCjjI and i ICI �� 21Find sets 0;1| , ���� jii dCjjI and ii ICI �� ,...2,1
� �
����
�
����
��
���
��
�
�C m
t
tij
dd
1
12
)(
)(
Compute membership as� �
���
��
����
�
��
���
��
i
i
k iktij
IiIi
du
10
1
.
���� i
iIiI
Adaptation of Fuzzy Clustering for Adaptation of Fuzzy Clustering for Bioinformatics ProblemsBioinformatics ProblemsBioinformatics ProblemsBioinformatics Problems
Kernels and Validity IndexesKernels and Validity Indexes
• Different kernnels/distance metrics� Distance metrics: Euclidean distance based;
Pearson correlation based
� Choice of fuzziness, m
• Different validity indexes� Crisp: WCSS, FOM, etc.
� Fuzzy: Xie’s, Partition coefficient. Etc.
Different Versions of Fuzzy Different Versions of Fuzzy ClusteringClusteringClusteringClustering
• Methods are categorized according to the objective• Methods are categorized according to the objectivefunction and the metrics used in the method
Objective Function J Metrics m Data Sets
K-means [3] Correlation 2 Yeast
J-means [1] Euclidean 1.15 ~ 1.75 Cancer, Blood [ ] ,
C-means [2, 5, 6] Euclidean 1.1 ~ 2.54Serum, Sporulation,
Yeast,Cancer, Cell line
AdaptingAdapting the Kernelthe Kernel
I iti li tiInitialization: 1. Classify genes into biological processes based on Gene
Ontology terms; )0(2. Use pre-classified genes to initialize )0(
j , and themembership uij;
3 Normalize membership u by 1��C u for each gene i3. Normalize membership uij, by 11
�� �j iju for each gene i.
Apply FCM with a squared Pearson correlation distance,2 h h l b2
,1 CXij id ��� where
ji CX ,� is the Pearson correlation between a gene xi and a cluster cj.
Fuzzy WCSS IndexFuzzy WCSS Index
Gene ExpressionGene Expression
ReferencesReferences
1. Zhang, M., et al. A Fuzzy C-Means Algorithm Using a Correlation Metrics and Gene Ontology in The 19th International Conference on Pattern RecognitionGene Ontology. in The 19th International Conference on Pattern Recognition.2008. Tampa, Florida, USA.
2. Zhang, M., W. Zhang, H. Sicotte and P. Yang, 2009, “Validating a Correlation-Based Fuzzy C-means Clustering Algorithm”, IEEE EMBC, submitted.
24
GEO DatabasesGEO Databases• http://www.ncbi.nlm.nih.gov/geo/
©Edited by Mingrui Zhang, CS Department, Winona State University, 2009
Single Nucleotide Polymorphisms:Single Nucleotide Polymorphisms:Genomic Variations in DiseaseGenomic Variations in DiseaseGenomic Variations in DiseaseGenomic Variations in Disease
©Edited by Mingrui Zhang, CS Department, Winona State University, 2009
Single Nucleotide PolymorphismsSingle Nucleotide Polymorphisms
• SNPs are single bases at a particular locus where individual people have differences in their sequences.p p q
– SNPs are another form of genomic variation in population
©Edited by Mingrui Zhang, CS Department, Winona State University, 2009
Population BasedPopulation Based
• Each ethnic group has its own collection of SNPs.• Human SNPs classified by major or minor allelesHuman SNPs classified by major or minor alleles.
– major alleles are common for all human– minor alleles are useful within an ethnic group
• You should know the average frequency of alleles of the population you are studying!!
©Edited by Mingrui Zhang, CS Department, Winona State University, 2009
HapMapHapMap
©Edited by Mingrui Zhang, CS Department, Winona State University, 2009
HapMap ProjectHapMap Project
• The international HapMap consortium has identified >1 million SNPsidentified >1 million SNPs– Samples from four populations– 1 SNP every 2 kb of genomic sequence1 SNP every 2 kb of genomic sequence
©Edited by Mingrui Zhang, CS Department, Winona State University, 2009
Use SNPs as MarkersUse SNPs as Markers
• SNPs are reliable markers– Most genes contain at least one SNP– Combinations of alleles are associated with particular disease.
• Study of evolutionStudy of evolution– Understand how a subpopulation adapted to the environment by
comparing the differences in their SNPs
• DNA fingerprinting for criminal or parental verification• DNA fingerprinting for criminal or parental verification.• Genotype-specific medication
©Edited by Mingrui Zhang, CS Department, Winona State University, 2009
HapMapHapMap Data and Data and HaploviewHaploview
HaploviewHaploview
HaploviewHaploview
The Goal is to Determine the Best Treatments The Goal is to Determine the Best Treatments or to Improve Patient’s Quality of Lifeor to Improve Patient’s Quality of Lifep yp y
©Edited by Mingrui Zhang, CS Department, Winona State University, 2009
Prototype Software ArchitecturePrototype Software Architecture
EHR databaseMayo Clinic
EHR databaseClinic X...
Prediction output Presentation Input FormR
Current Patient’s Data
ViewCSS Styles
Model
Prediction Model
VariableDefinition- XML
Model Manager- Model Interface (JRI)
View Manager- Web Form Generator- Presentation Generator
Controller
Prototype WebPrototype Web--based Toolbased Tool
ReferencesReferences
1. Zhang, M., Olson, S., Francioni, J., Gegg-Harrison, T., Meng, N., Sun, Z., and Yang P 2009 Integrating R Models with Web Technologies HEALTHINF 2009Yang, P., 2009. Integrating R Models with Web Technologies. HEALTHINF 2009,Porto, Portugal, January 2009.
2. Gegg-Harrison, T., Zhang, M., Meng, N., Sun, Z., and Yang, P., 2009, Porting a Cancer Treatment Prediction to a Mobile Device, IEEE EMBC, submitted.
38