identification of differential aberrations in multiple ... · identification of differential...

10
Biometrics 67, 353–362 June 2011 DOI: 10.1111/j.1541-0420.2010.01457.x Identification of Differential Aberrations in Multiple-Sample Array CGH Studies Huixia Judy Wang Department of Statistics, North Carolina State University, Raleigh, North Carolina 27695, U.S.A. email: [email protected] and Jianhua Hu Department of Biostatistics, University of Texas M. D. Anderson Cancer Center, Houston, Texas 77030, U.S.A. email: [email protected] Summary. Most existing methods for identifying aberrant regions with array CGH data are confined to a single target sample. Focusing on the comparison of multiple samples from two different groups, we develop a new penalized regression approach with a fused adaptive lasso penalty to accommodate the spatial dependence of the clones. The nonrandom aberrant genomic segments are determined by assessing the significance of the differences between neighboring clones and neighboring segments. The algorithm proposed in this article is a first attempt to simultaneously detect the common aberrant regions within each group, and the regions where the two groups differ in copy number changes. The simulation study suggests that the proposed procedure outperforms the commonly used single-sample aberration detection methods for segmentation in terms of both false positives and false negatives. To further assess the value of the proposed method, we analyze a data set from a study that identified the aberrant genomic regions associated with grade subgroups of breast cancer tumors. Key words: Array CGH; Change point; Common aberration; Copy number aberration; Fused lasso; Median regression; Segmentation. 1. Introduction Array comparative genomic hybridization (CGH) is a pow- erful technique for measuring genomic aberrations involv- ing DNA copy number gains and losses (Pinkel et al., 1998; Snijders et al., 2001). Such copy number aberrations (CNAs) can lead to abnormal mRNA transcript levels and result in cellular malfunctions. The detection of CNA regions in a set of tumor samples can help biologists identify genes involved in the genesis and progression of cancer. In a typical CGH experiment, DNA strands from a test and a reference sam- ple are hybridized on the same array consisting of thousands of clones with known genomic locations. After hybridization, the log ratio between the test and the reference intensities is computed on each clone. Because the reference sample is often assumed to have no CNAs, significantly positive log ra- tios indicate copy number gains, and significantly negative log ratios indicate copy number losses in the test sample. Existing methods for identifying genomic regions with CNAs have focused on the analysis of a single target sam- ple. One popular algorithm is the circular binary segmenta- tion (CBS) method proposed by Olshen et al. (2004). The CBS method uses a permutation test to detect change points by recursively splitting each contiguous segment until no significant splits can be found. Other segmentation meth- ods include hidden Markov modeling (Fridlyand et al., 2004; Guha, Li, and Neuberg, 2008), a clustering-based method (Wang et al., 2005), a Gaussian-likelihood-based approach (Hupe et al., 2004), and a Bayesian approach (Lai, Xing and Zhang, 2008). Taking into account the spatial dependence of CGH data, Eilers (2003), Huang et al. (2005), and Tibshi- rani and Wang (2007) employed the penalized least squares method by shrinking the distances between signals at adjacent clone locations. Using similar strategies, Eilers and Menezes (2005), and Li and Zhu (2007) considered smoothing the quantiles of the log ratio data. These approaches focused on visualization rather than change point detection. Willenbrock and Fridlyand (2005) and Lai et al. (2005) considered both sensitivity and specificity when comparing various approaches and found that the CBS method had superior performance for change point detection. Little attention has been given to the comparison of sam- ples from multiple groups. Due to population heterogene- ity, individual samples may have subject-specific CNAs that would represent “passenger” alterations, that is, random so- matic events without pathological relevance (Shah, 2008). Therefore, it is of biological interest to identify genomic C 2010, The International Biometric Society 353

Upload: others

Post on 19-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Identification of Differential Aberrations in Multiple ... · Identification of Differential Aberrations in Array CGH Studies 355 unknown parameter vector. Define the design

Biometrics 67, 353–362

June 2011DOI: 10.1111/j.1541-0420.2010.01457.x

Identification of Differential Aberrations in Multiple-Sample ArrayCGH Studies

Huixia Judy Wang

Department of Statistics, North Carolina State University, Raleigh, North Carolina 27695, U.S.A.email: [email protected]

and

Jianhua Hu

Department of Biostatistics, University of Texas M. D. Anderson Cancer Center, Houston, Texas 77030, U.S.A.email: [email protected]

Summary. Most existing methods for identifying aberrant regions with array CGH data are confined to a single targetsample. Focusing on the comparison of multiple samples from two different groups, we develop a new penalized regressionapproach with a fused adaptive lasso penalty to accommodate the spatial dependence of the clones. The nonrandom aberrantgenomic segments are determined by assessing the significance of the differences between neighboring clones and neighboringsegments. The algorithm proposed in this article is a first attempt to simultaneously detect the common aberrant regionswithin each group, and the regions where the two groups differ in copy number changes. The simulation study suggests that theproposed procedure outperforms the commonly used single-sample aberration detection methods for segmentation in termsof both false positives and false negatives. To further assess the value of the proposed method, we analyze a data set from astudy that identified the aberrant genomic regions associated with grade subgroups of breast cancer tumors.

Key words: Array CGH; Change point; Common aberration; Copy number aberration; Fused lasso; Median regression;Segmentation.

1. IntroductionArray comparative genomic hybridization (CGH) is a pow-erful technique for measuring genomic aberrations involv-ing DNA copy number gains and losses (Pinkel et al., 1998;Snijders et al., 2001). Such copy number aberrations (CNAs)can lead to abnormal mRNA transcript levels and result incellular malfunctions. The detection of CNA regions in a setof tumor samples can help biologists identify genes involvedin the genesis and progression of cancer. In a typical CGHexperiment, DNA strands from a test and a reference sam-ple are hybridized on the same array consisting of thousandsof clones with known genomic locations. After hybridization,the log ratio between the test and the reference intensitiesis computed on each clone. Because the reference sample isoften assumed to have no CNAs, significantly positive log ra-tios indicate copy number gains, and significantly negative logratios indicate copy number losses in the test sample.

Existing methods for identifying genomic regions withCNAs have focused on the analysis of a single target sam-ple. One popular algorithm is the circular binary segmenta-tion (CBS) method proposed by Olshen et al. (2004). TheCBS method uses a permutation test to detect change pointsby recursively splitting each contiguous segment until no

significant splits can be found. Other segmentation meth-ods include hidden Markov modeling (Fridlyand et al., 2004;Guha, Li, and Neuberg, 2008), a clustering-based method(Wang et al., 2005), a Gaussian-likelihood-based approach(Hupe et al., 2004), and a Bayesian approach (Lai, Xing andZhang, 2008). Taking into account the spatial dependence ofCGH data, Eilers (2003), Huang et al. (2005), and Tibshi-rani and Wang (2007) employed the penalized least squaresmethod by shrinking the distances between signals at adjacentclone locations. Using similar strategies, Eilers and Menezes(2005), and Li and Zhu (2007) considered smoothing thequantiles of the log ratio data. These approaches focused onvisualization rather than change point detection. Willenbrockand Fridlyand (2005) and Lai et al. (2005) considered bothsensitivity and specificity when comparing various approachesand found that the CBS method had superior performance forchange point detection.

Little attention has been given to the comparison of sam-ples from multiple groups. Due to population heterogene-ity, individual samples may have subject-specific CNAs thatwould represent “passenger” alterations, that is, random so-matic events without pathological relevance (Shah, 2008).Therefore, it is of biological interest to identify genomic

C© 2010, The International Biometric Society 353

Page 2: Identification of Differential Aberrations in Multiple ... · Identification of Differential Aberrations in Array CGH Studies 355 unknown parameter vector. Define the design

354 Biometrics, June 2011

regions that are common or recurrent among individuals withthe same disease, as they are more likely to contain disease-critical genes. Adopting the definition in Rouveirol et al.(2006), we use “recurrent region” to refer to a sequence ofadjacent clones that have common aberrations for samplesfrom the same group. Focusing on one group with multiplesamples, some authors proposed applying existing aberration-calling algorithms independently to the individual samples,and then combining the discretized profiles to identify recur-rent aberrant regions; see Diskin et al. (2006), Rouveirol et al.(2006), Lipson et al. (2006), BenDor et al. (2007), Ylipaaet al. (2008), and Klijn et al. (2008), among others. Ruedaand Diaz-Uriarte (2010) provided a comprehensive review ofmethods for detecting recurrent CNA regions under differentscenarios. For the comparison of two groups, Willenbrock andFridlyand (2005), and Huang et al. (2007) suggested anothertwo-step procedure whereby multiple samples are first sum-marized into a single index at each clone location (e.g., by theclone-wise group mean differences), and then segmentation isperformed on the summarized indices. Though straightfor-ward and easy to implement, these two-step procedures donot take into account the variations within and across clonesat the same time, and may lose accuracy and sensitivity indetecting aberrant regions of interest.

In this article, we develop a unified approach to identifyregions with differential CNAs between groups as well as re-gions with common CNAs within each group. The proposedapproach models the CGH data from multiple samples di-rectly, and thus is able to account for both within-clone andbetween-clone variations. We assume that the CNAs are con-sistent within groups but potentially different between groups.In addition, as the underlying biological process is discrete,we model the CGH data from the same group as a seriesof discrete segments with unknown boundaries. We proposeto smooth the multiple arrays by fitting a median regressionmodel with a fused adaptive lasso penalty, which shrinks thebaseline and group effects toward piecewise constants. Aftersmoothing, we employ a block bootstrap procedure to de-tect the change points that divide the genome into segmentsso that clones within neighboring segments will have distincteffects.

In Section 2, we describe the median smoothing and seg-mentation procedure, and provide the theoretical propertiesof the proposed penalized estimator. In Section 3, we con-duct a simulation study to compare the performance of theproposed method and some existing methods. In Section 4,we apply the proposed method to data from a breast cancerstudy conducted at the National Cancer Institute, the goal ofwhich was to determine copy number aberrations associatedwith grade 1 and grade 3 breast cancer tumors. We providesome final remarks in Section 5.

2. Proposed Method2.1 Smoothing Based on Penalized Median RegressionLet yij denote the log ratio of the ith array (subject) and thejth clone, where i = 1, . . . , n and j = 1, . . . , G. There are a to-tal of N = nG observations. Suppose the first n1 arrays belongto group 1, the last n2 = n − n1 arrays belong to group 2, andthe clones (1, . . . , G) are ordered by their physical locations

on a chromosome. We assume the following model:

yij = μj + βj zi + eij , (1)

where zi is the group indicator with zi = 0 for i =1, . . . , n1, zi = 1 for i = n1 + 1, . . . , n, eij are the random errorswith median zero, μj is the baseline effect (i.e., the median ofgroup 1), and βj is the group effect on clone j.

For CGH data, because of the physical dependence of neigh-boring clones, the errors eij tend to be locally correlatedamong the clones (Huang et al., 2005). The spatial depen-dence of CGH data is also exhibited in the effects μj and βj

in the sense that they tend to be piecewise constants exceptin the regions where the signals change abruptly. Employinga fused lasso approach similar to that of Tibshirani and Wang(2007), we propose to estimate μj and βj by minimizing

n∑i=1

G∑j=1

|yij − μj − βj zi | + λ

G∑j=2

wj 1|μj − μj−1|

+ λ

G∑j=2

wj 2|βj − βj−1|, (2)

where λ is a nonnegative regularization parameter, and wj 1

and wj 2 are the adaptive weights. In (2), we discouragechanges in the neighboring clones by penalizing |μj − μj−1|and |βj − βj−1|. The influence of the penalty is controlled bythe parameter λ. For λ = 0, we obtain the ordinary medianestimates of μj and βj ; and as λ → ∞, μj → μ1 and βj → β1

for all j = 2, . . . , G. Define μj and βj as some initial con-sistent estimators of μj and βj . Throughout our numericalstudies, we take μj and βj as the ordinary median estimatesobtained with λ = 0. We set wj 1 = max{|μj − μj−1|, 10−5}−1

and wj 2 = max{|βj − βj−1|, 10−5}−1 for j = 2, . . . , G. Insteadof applying the same penalty to all clones, the adaptive lassoassigns larger penalties to the adjacent clones that have sim-ilar expressions; hence the differences of the effects at theseclones are shrunk more toward zero. Numerical studies in Sec-tion 3 suggest that the fused adaptive lasso leads to moreaccurate segmentation and fewer false discoveries of CNA re-gions, when compared to the counterpart without adaptiveweights.

The penalization parameter λ can be selected by minimiz-ing the modified Schwarz information criterion (SIC; Schwarz,1978; Koenker, Ng, and Portnoy, 1994):

SIC(λ) = log

(∑ij

|yij − μj − βj zi |)

+ d log N/(2N ), (3)

where (μj , βj ) are the estimated coefficients, and d is a mea-sure of the complexity of the fitted model with the penal-ization parameter λ. In our implementation, we take d =2 +

∑G

j=2{I(μj �= μj−1) + I(βj �= βj−1)}.

2.2 Asymptotic PropertyIn this section, we establish the oracle property (Fanand Li, 2001) of the fused adaptive lasso estimators μj

and βj . We first reparameterize and let δj = μj − μj−1

and Δj = βj − βj−1, j = 1, . . . , G, where μ0 = β0 = 0. Letθ = (δ1, . . . , δG , Δ1, . . . , ΔG )T denote the (2G)-dimensional

Page 3: Identification of Differential Aberrations in Multiple ... · Identification of Differential Aberrations in Array CGH Studies 355 unknown parameter vector. Define the design

Identification of Differential Aberrations in Array CGH Studies 355

unknown parameter vector. Define the design matrix as X =

( 1n 1 0n 11n 2 1n 2

) ⊗ B, where B is a G × G lower triangular matrix

with 1 on and below the diagonal, and 1n denotes an n-vectorof ones. Let xij be the {(i − 1)G + j}-th row of X. Therefore,(2) is equivalent to

n∑i=1

G∑j=1

∣∣yij − xTij θ

∣∣ + λ∑

k �={1,G +1}wk |θk |, (4)

where wk = |θk |−1 with θk as some initial consistent estimatorof θ. Denote θk ,0 as the true coefficients, A = {k : θk ,0 �= 0, k =1, . . . , 2G}. We write dA = (dk , k ∈ A)T for any vector d =(d1, . . . , d2G )T , and CAA as the submatrix of C with both rowand column indices in A for any (2G) × (2G) matrix C. Thefused adaptive lasso estimator θ enjoys the following oracleproperty.

Theorem 1: Suppose assumptions A1–A3 as spelled out inthe Web Appendix hold, as n → ∞, we have

(i) consistency in selection: Pr({k : θk �= 0, k = 1, . . . ,2G} = A) → 1;

(ii) asymptotic normality:√

n(θA − θA,0) → N (0, Σ),

where Σ = limn→∞ H−1AAVAAH−1

AA, HAA = n−1∑

ij xijAxTijAfij (0),

VAA = (4n)−1∑

ij xijAxTijA + n−1

∑i ,j �=j ′ xijAxT

ij ′A{Pr(eij < 0,

eij ′ < 0) − 1/4}, and fij is the density function of eij .

2.3 Determination of Change Points and Genomic SegmentsWe first describe the procedure for identifying the genomicsegments associated with the group effect. Note that in(4), a nontrivial Δj corresponds to a location at which thegroup effect is changed. To assess the significance of eachΔj for j = 2, . . . , G, we employ the following block bootstrapprocedure:

(i) Given the observed data {yij }, compute the fused adap-tive lasso estimates μj and βj by minimizing (2), andobtain the centered data yij = yij − μj − βj zi .

(ii) Resample the centered data yij with replacements bytreating each sample as a unit to obtain the bootstrapdata y∗

ij .(iii) Fit the penalized median regression with the bootstrap

data to obtain the fused adaptive lasso estimates β∗j and

thus Δ∗j = β∗

j − β∗j−1, j = 2, . . . , G. To save the compu-

tational cost, we fix λ at the optimal value chosen bySIC based on the original data.

(iv) Repeat (ii) and (iii) K times and obtain a set of boot-strap estimates {Δ∗

j k , k = 1, . . . , K}.

We calculate the p-values for the observed Δj byK−1

∑K

k=1 I(|Δ∗j k | ≥ |Δj |), j = 2, . . . , G. We declare that a

clone j is a change point associated with the group effect ifits corresponding p-value is smaller than p∗. For a fixed p-value cutoff p∗, we follow the approach of Fan et al. (2004)and Huang et al. (2005) to estimate the false discovery rate(FDR) by

FDR =p∗ × total number of clones

number of clones whose p-values are less than p∗ .

The p-value cutoff p∗ is chosen so that the FDR is controlledat a desirable level α, say, 5%. Suppose that the p-values aresmaller than p∗ at clones c1, . . . , cm −1, then m segments willbe formed by clones [1, c1 − 1], [c1, c2 − 1], . . . , [cm −1, cG ], sothat the clones within each segment will have the same groupeffect.

The next step is to test the significance of the group effectswithin each segment in order to identify the regions where twogroups differ in CNAs. For each identified segment, we applythe rank score test of Wang and He (2007), which is robustand able to account for the correlation among clones withinthe same subject. Herein we assume a common intrasubjectcorrelation between any two clones from one segment. In ourcontext, the rank score test is based on the signs of residualsobtained by fitting the median regression model under thenull hypothesis of no group effects.

Following a similar procedure, we can also identify the seg-ments associated with the baseline effects μj , that is, the seg-ments that are common to samples from group 1. The com-mon segments for group 2 can then be formed by the changepoints that are associated with either group 1 or the groupeffect. For instance, suppose group 1 has two segments formedby clones [1, 20] and [21, 50], and the group effect is associatedwith two segments formed by clones [1, 25] and [26, 50]. Group2 will then have three segments formed by clones [1,20], [21,25], and [26, 50]. We can then apply the rank score test oneach segment to assess the significance of the intercept effects,and to identify the group-specific regions that have significantcopy number aberrations.

2.4 ComputationHereafter, we will refer to the proposed algorithm based onthe fused adaptive lasso penalization as the FAL method. Theprocedure is very easy to implement with commonly used soft-ware. Given a fixed tuning parameter λ, the minimizationof (2) can be formulated as a linear programming problemand solved efficiently. Denote Y = (yijk ) as the response vec-tor, and Y = (Y T , 0T

2G−2)T as the augmented response vector.

Appending the penalty term in (2) to the design matrix, weobtain the augmented design matrix

X =

⎛⎜⎜⎜⎜⎝

1n 1 ⊗ IG 0

1n 2 ⊗ IG 1n 2 ⊗ IG

λW1D 0

0 λW2D

⎞⎟⎟⎟⎟⎠ ,

where D = diff(IG ) =

⎛⎜⎜⎜⎜⎜⎝

1 −1 · · · 0 0

0 1 · · · 0 0

. . .

0 0 · · · 1 −1

⎞⎟⎟⎟⎟⎟⎠ ,

W1 = diag(w21, . . . , wG 1), W2 = diag(w22, . . . , wG 2), and IG isthe G × G identity matrix. For a given λ, the penalized es-timator of (μ1, . . . , μG , β1, . . . , βG )T can be obtained by re-gressing Y on X at the median. The dimension of the designmatrix may appear daunting, but X is very sparse, consist-ing of only {(n1 + 2n2 + 4)G − 4} nonzero elements. We fit

Page 4: Identification of Differential Aberrations in Multiple ... · Identification of Differential Aberrations in Array CGH Studies 355 unknown parameter vector. Define the design

356 Biometrics, June 2011

the median regression model using the sparse Frisch-Newtoninterior-point algorithm, implemented by the “rq.fit.sfn” func-tion in the R package quantreg. The algorithm uses the sparsematrices adopted from Koenker and Ng (2003). The compu-tation time is roughly proportional to the number of nonzeroelements in the design matrix (Koenker and Ng, 2005). Thecomputation of our implemented method is quite fast. For ex-ample, using R (version 2.8.1) in a PC with 4.0 GB of RAM,the algorithm took 88 seconds on average to complete thecomputation for one simulated data set from Case 2 in Sec-tion 3 with n0 = n1 = 10 and G = 1000, including tuning pa-rameter selection and segmentation based on 500 bootstrapsamples.

The proposed median-based procedure can also be ex-tended to a general quantile level τ to identify regions withCNA in the tails of the intensity distribution. Such CNA re-gions, for instance those shared by 25% of a group of pa-tients, are of biological interest in practice. For FAL at a gen-eral quantile level 0 < τ < 1, we need to replace the L1 lossfunction |u| in the first term of (2)–(4) by the quantile checkfunction ρτ (u) = u{τ − I(u < 0)}. The optimization can stillbe formulated as a linear programming problem, and thus besolved by using the sparse interior point algorithm. The realdata example in Section 4 indicates that the study of tails canhelp identify interesting CNA regions that may be smoothedaway at the median.

3. Simulation StudyOur simulation mimics the squamous cell carcinoma (SCC)data set in Snijders et al. (2001), which contains the log-ratiointensities of 14 TP53-mutant and 61 wild-type samples ofprimary SCCs for 1979 clones. The scientific goal is to identifythe genomic regions with CNAs that are associated with theTP53 mutation. In this simulation study, we assume thereare two groups, each consisting of 10 arrays. The log ratiodata of G equally spaced clones are generated from the model(1). To generate eij , we first obtain the estimated residualseij by subtracting the clone-wise median effects of each groupfrom the log-ratio intensities yij of the SCC data. Then werandomly select G clone locations, and take 10 samples fromeach of the two groups. Finally, the residuals from the 20samples are shuffled at each clone location to make sure thatthey carry no group-specific information.

To assess the performance of FAL in both small and largedata sets, we consider two cases with G = 200 in Case 1 andG = 1000 in Case 2. Case 1 contains 10 contiguous segmentsincluding five with nonzero group effects. Case 2 contains 11segments, six of which have nonzero group effects. The specificchoices of μj and βj in these two cases are given in Table 1.In real tumor samples, the CNA segments usually have differ-ent starting/ending points in different tumor samples. There-fore, we randomly perturb the data in Case 2 such that thestarting/ending points of segment 5 are shifted by up to three

Table 1The top panel shows the simulation designs in Cases 1 and 2, where Loc.start and Loc.end are the locations of the beginning and

the end of a segment. The bottom panel summarizes the frequencies of different methods for identifying contiguous segmentsamong the 100 simulations.

Segment j

1 2 3 4 5 6 7 8 9 10 11

Simulation designCase 1

Loc.start 1 11 26 46 76 101 151 166 181 191 /Loc.end 10 25 45 75 100 150 165 180 190 200 /μj 0 0.25 0.45 0.5 0.5 0 0.43 0.5 1 −1 /βj 0 −0.15 −0.25 0 0.1 0 −0.1 0 0.1 0 /

Case 2Loc.start 1 101 111 121 431 451 746 751 761 771 801Loc.end 100 110 120 430 450 745 750 760 770 800 1000μj 0 0.25 0.45 0.5 0.5 0 0.425 0.5 1 0.5 −1βj 0 −0.2 −0.35 0 0.15 0 −0.15 0 0.15 0.245 0

Frequency of correct identification of each segmentCase 1

FAL 94 75 82 84 68 49 83 81 61 65 /FL 88 15 81 31 74 13 90 40 95 89 /CBS 93 73 76 74 58 25 40 26 32 34 /LZ 84 30 73 28 66 12 75 35 56 46 /SMS 10 0 0 33 76 73 74 62 58 78 /

Case 2FAL 87 56 68 50 77 19 58 70 54 55 95FL 24 4 20 12 86 0 0 17 5 10 42CBS 79 20 25 78 83 14 12 11 39 41 100LZ 18 0 3 0 83 0 0 15 29 56 7SMS 1 0 0 41 70 3 3 4 0 0 59

Page 5: Identification of Differential Aberrations in Multiple ... · Identification of Differential Aberrations in Array CGH Studies 355 unknown parameter vector. Define the design

Identification of Differential Aberrations in Array CGH Studies 357

0 20 40 60 80 100

02

46

8

Case 1

FP

TP

FALFLCBSLZSMS

002051001050

02

46

8

Case 2

FP

TP

FALFLCBSLZSMS

Figure 1. The ROC curves of different methods for detecting the change points associated with the group effect in Cases 1and 2. The TP and FP are the total number of true and false change points identified, respectively. The ideal TP is 9 forCase 1 and 10 for Case 2. This figure appears in color in the electronic version of this article.

clones in all the samples from group 2. For each case, we con-duct 100 simulations.

We compare the following methods: the proposed fusedadaptive lasso method (FAL); the fused lasso method with-out adaptive weights (FL); the circular binary segmentation(CBS) method by Olshen et al. (2004); the smooth segmen-tation method (SMS) by Huang et al. (2007); and the fusedquantile regression approach (LZ) of Li and Zhu (2007). Toimplement SMS, we use the function “FDRcgh” in the R pack-age smoothseg with default options based on Wilcoxon test.For CBS and LZ that are designed for single arrays, segmen-tation is performed on the summarized indices, the clone-wisemedian differences between the two groups. For both FAL andFL, we focus on the median analysis.

We first look at the receiver operator characteristic (ROC)curves to compare the sensitivity and specificity of differentmethods for detecting the change points. There are a total ofnine change points for Case 1, and 10 for Case 2. Figure 1shows the number of true positives (TPs) against the numberof false positives (FPs), averaged over 100 simulations. TheROC curves of FAL, FL, LZ, and CBS are constructed byvarying the p-value cutoffs, where the p-values are obtainedfrom bootstrap for the first three methods. Since SMS cannot

be directly used for change point detection, we calculate thedifferences of the smoothed clone-wise Wilcoxon test statisticvalues at adjacent clones, and identify the clones with dif-ferences exceeding a given threshold value as change points.The ROC curve of SMS is obtained by varying the test statis-tic threshold value. Figure 1 suggests that the SMS methodperforms worse than all the other methods at the beginning,but then catches up when threshold is reduced (FP ≥ 50).In general, the FL method yields smaller bootstrap p-valuesand thus detects more change points than FAL with the samep-value cutoff. For instance, with a small p-value cutoff of10−7, on average FL yields 7.7 TPs and 10 FPs, while FALyields 6.3 TPs and 3.6 FPs in Case 1. Therefore, for Case 1with a smaller number of clones, FL is more sensitive andit can detect more change points without losing much onthe false positives. However, in Case 2 with G = 1000, theloss of FL on the false positives is more pronounced and itgives much worse ROC curves than FAL. Considering bothcases, FAL is the most favorable method for detecting changepoints.

Next we assess the performance of different methods onsegmentation. The bottom panel of Table 1 summarizes thefrequencies of different methods for successfully identifying

Page 6: Identification of Differential Aberrations in Multiple ... · Identification of Differential Aberrations in Array CGH Studies 355 unknown parameter vector. Define the design

358 Biometrics, June 2011

each segment out of the 100 simulations. We count it as asuccessful segmentation if the starting and end locations of anestimated segment differ from the true segment by less thantwo clone locations. Since SMS smoothes out the segments,we consider two adjacent segments distinct if SMS identifiesthat one segment has significant group effects but the otherdoes not. To perform segmentation with the LZ method, weuse bootstrap to assess the significance of jumps at contigu-ous clones, and estimate FDR following the same procedure asdescribed in Section 2.3. The number of bootstrap samples isset as 500 for FAL, FL, and LZ. As the CBS method does notaccount for the multiple test problems, we choose a p-valuethreshold of 0.05/G for the individual test. The FDR thresh-old is set as 0.05 for the other four methods. Segmentationis more challenging in Case 2, as it involves relatively shorter

segments (for instance, segment 7 with only five clones). Forvisual presentation, Figures 2 and 3 depict the segmentationresults of two typical data sets from Cases 1 and 2, respec-tively. In each plot, the vertical lines separate the true 10segments. In plots (a)–(c), the horizontal bars are the esti-mated segment-wise group effects of the segments (markedby brackets []) identified by FAL, FL, and CBS, respectively,and the thicker horizontal bars in (a)–(b) correspond to thedetected segments with significant group effects. We point outthat the CBS method does not provide any results on the sig-nificance of the group effects. Plot (d) shows the estimatedclone-wise group effects from the LZ method. Plot (e) showsthe segmentation results from the SMS method with the big-ger solid dots corresponding to the detected clones that havesignificant group effects.

−0.

40.

00.

2

(a) FAL

Gro

up d

iffer

ence

[[[[

[

[[

[[

[[

[]] ]

]

]]

]]

]]

]

1 2 3 4 5 6 7 8 9 10

−0.

40.

0

(b) FL

Gro

up d

iffer

ence

[[[ [ [

[

[[ [[ [[[ [[[ [[

[[]

] ] ]]

] ]]]]] ]] ]

]] ]]

]

1 2 3 4 5 6 7 8 9 10

−0.

40.

0

(c) CBS

Gro

up d

iffer

ence

[[

[

[[

[[]

]]

]]

]]

1 2 3 4 5 6 7 8 9 10

−0.

40.

0

(d) LZ

Gro

up d

iffer

ence

1 2 3 4 5 6 7 8 9 10

002051001050

−3

−1

1

(e) SMS

Clone order

Tes

t sta

tistic

1 2 3 4 5 6 7 8 9 10

Figure 2. The segmentation results of different methods for a data set in Case 1. The true segments are separated by thevertical lines, and the segments identified by FAL, FL, and CBS are marked by brackets [] in (a)–(c). This figure appears incolor in the electronic version of this article.

Page 7: Identification of Differential Aberrations in Multiple ... · Identification of Differential Aberrations in Array CGH Studies 355 unknown parameter vector. Define the design

Identification of Differential Aberrations in Array CGH Studies 359

−0.

6−

0.2

0.2

(a) FALG

roup

diff

eren

ce

[[[[

[[

[[[[ [

[]]]

]]

]]]] ]

]

1 2 3 4 5 6 78 9 10 11−

0.6

−0.

20.

2

(b) FL

Gro

up d

iffer

ence

[[ [[[[

[[ [[[[[

[[[] ]

]] ]

]]]]]] ]

]] ]

1 2 3 4 5 6 78 9 10 11

−0.

6−

0.2

0.2

(c) CBS

Gro

up d

iffer

ence

[[

[[

[ [[

[]]

]]

] ]]

]

1 2 3 4 5 6 78 9 10 11

−0.

6−

0.2

0.2

(d) LZ

Gro

up d

iffer

ence

1 2 3 4 5 6 78 9 10 11

00010080060040020

−4

−2

02

(e) SMS

Clone order

Tes

t sta

tistic

1 2 3 4 5 6 78 9 10 11

Figure 3. The segmentation results of different methods for a data set in Case 2. The true segments are separated by thevertical lines, and the segments identified by FAL, FL, and CBS are marked by brackets [] in (a)–(c). This figure appears incolor in the electronic version of this article.

Table 1 suggests that CBS has difficulty identifying shortsegments such as segments 7–10 in Cases 1–2, and segments2–3 in Case 2. Because of its continuous segment assumption,SMS cannot distinguish neighboring segments with group ef-fects in the same direction but with different degrees, for in-stance, segments 2 and 3 in Cases 1–2, and segments 9 and10 in Case 2; see Figure 3(e). By using summarized indicesinstead of data from multiple samples, LZ loses sensitivityfor detecting aberrant segments in both cases. In general, FLperforms inferior to FAL, and it generates more false shortsegments. For the example data set shown in Figure 2(b), FLmistakenly divides segment 2 into three distinct segments.Compared with the other four methods, FAL is more success-ful in identifying the true segments and the segments withnonzero group effects, with only slight shifts of the change

point locations. In addition, FAL is more powerful in separat-ing neighboring segments that have different, especially low-level, chromosomal aberrations (for instance, segments 7–9).Recall that in Case 2, the underlying region of segment 5 isshifted by a few clones in some samples. Results from Table 1indicate that all methods are quite robust against such slightlocation shifts.

4. Empirical Data AnalysisWe apply the FAL method to breast cancer data to assessits practical performance. The experiment was conducted bythe National Cancer Institute, enrolling invasive breast can-cer patients with ductal carcinoma in situ. The raw data setscan be downloaded from the National Center for Biotechnol-ogy Information (accession nos. GSE7882). For array CGH,

Page 8: Identification of Differential Aberrations in Multiple ... · Identification of Differential Aberrations in Array CGH Studies 355 unknown parameter vector. Define the design

360 Biometrics, June 2011

amplified Promega Normal Male DNA (Promega, WI, USA)was used as the reference sample. Each CGH array containsmeasurements for 36,288 clones on 24 chromosomes. More in-formation on the data set was provided by Balleine et al.(2008). The data set consists of 13 cases with grade 1, and 10cases with grade 3 invasive cancer. The histopathologic gradehas been widely accepted as a powerful indicator of progno-sis in breast cancer, and it was shown to be associated withsurvival times (Elston and Ellis, 1991). In this section, weidentify the genomic regions with differential CNAs betweenthese two grade subgroups.

By controlling FDR at level 0.05, FAL at the median iden-tifies segments with differential CNAs on four chromosomes:8, 11, 16, and 17. More specifically, grade 1 tumors have moregains at chromosome arms 11p15 and 17q21-q23, while grade3 tumors involve more frequent losses at 8p23-p21, 11q22-q25,

and 17p13-p12. Hereafter, we focus on the study of chromo-somes 8 and 17, which were shown to contain particularly in-fluential regions that distinguish between grade 1 and grade3 associated lesions in Balleine et al. (2008). As a compar-ison, we also apply the CBS method, and the FAL methodat the lower and upper quartiles τ = 0.25 and τ = 0.75. TheCBS method is based on the differences of the clone-wise me-dians of the two subgroups, and the p-value cutoff is chosenas 0.05/G for CBS, where G = 1157 for chromosome 8 andG = 1890 for chromosome 17. We did not include the LZ andSMS methods as both were designed for smoothing ratherthan for segmentation.

Figure 4 shows the genomic profiles of chromosomes 8 and17. The top row shows the grade-associated segments identi-fied by CBS, and the bottom three rows show the segmentsidentified by FAL at three quartiles, respectively, where the

−1.

00.

00.

51.

0

CBS, ch8

Gro

up d

iffer

ence

−1.

00.

00.

51.

0

FAL, Q2, ch8

Gro

up d

iffer

ence

−1.

00.

00.

51.

0

FAL, Q1, ch8

Gro

up d

iffer

ence

−1.

00.

01.

0

FAL, Q3, ch8

Clone order

Gro

up d

iffer

ence

0 200 400 600 800 1000 1200

−1.

00.

01.

0

CBS, ch17

−1.

00.

01.

0

FAL, Q2, ch17

−1.

00.

00.

51.

0

FAL, Q1, ch17

−1

01

23

FAL, Q3, ch17

Clone order

0 200 400 600 800 1000 1200 1400 1600 1800

Figure 4. The segmentation results from CBS and FAL at the three quartiles on chromosomes 8 and 17. The second row isfor the median, and the bottom two are for the first and the third quartiles, respectively. The horizontal bars are the estimatedsegment-wise group effects. The thicker horizontal bars indicate segments with significant grade effects. This figure appearsin color in the electronic version of this article.

Page 9: Identification of Differential Aberrations in Multiple ... · Identification of Differential Aberrations in Array CGH Studies 355 unknown parameter vector. Define the design

Identification of Differential Aberrations in Array CGH Studies 361

thicker horizontal bars are for the segments with significantgrade-associated effects. The circle points correspond to theclone-wise median differences between the two subgroups inthe top two rows, and clone-wise first and third quartile dif-ferences in the bottom two rows.

On chromosome 8, FAL at the median detects five segmentswith significant grade effects, which are formed by clone loca-tions [1, 91], [126, 155], [156, 165], [240, 255], and [256, 280].However, the CBS method fails to separate segments 2 and 3,and segments 4 and 5 when they have differential CNAs in thesame direction but with different degrees, which agrees withthe observations from the simulation study. More specifically,CBS classifies clones 240-345 to be on the same segment. Forfurther examination, we perform the rank score test on regions[240, 255], [256, 280], and [281, 345] separately, and obtain thep-values of 0.002, 0.001, and 0.008, respectively. This suggeststhat only regions [240, 255] and [256, 280] are significantly as-sociated with grade after multiple test adjustment across the11 segments was identified by FAL, and thus are deserving ofmore attention. By comparing the results from FAL at threequartiles, we notice that region [126, 165] has significantlynegative grade effects only at the median. An examination ofdata from two groups suggests that the distribution of grade3 tumors from clones in this region is heavily skewed to theright, and the large variation in the right tail makes FAL failto detect the group effect at the third quartile. Suppose wedefine a clone with log-ratio intensity greater than 1 as ad-dition, and less than −1 as deletion, there are a total of 20additions and 2 deletions in the grade 1 group, while thereare four additions and five deletions in the grade 3 group.The FAL at the first quartile fails to detect the group effectin this region mainly because of the small difference of caseswith deletions between the two groups.

For chromosome 17, FAL at the median suggests that thetwo groups have differential CNAs in segments [44, 80] and[332, 386]. The rank score test gives p-values of 0.002 and0.001 for these two segments, respectively. Similar to the re-sults on chromosome 8, CBS fails to separate the two segmentsfrom their neighbors. Another noteworthy observation is thatFAL at the third quartile identifies the segment [1160, 1387]as being associated with a positive group effect. Using arbi-trary values 1 and −1 as cutoffs for addition and deletion, wefind 44 cases with additions and 226 cases with deletions inthe grade 1 group in this segment, while there are 408 addi-tions and 118 deletions in the grade 3 group. This indicatesthat grade 3 tumors have more frequent gains than grade 1tumors, and this difference is reflected in the upper quantilesand detected by FAL at the third quartile. However, the twograde groups tend to have no significant differences at eithermedian or the lower quartile. This example suggests that us-ing FAL at tail quantiles can help discover interesting CNAregions that may be overlooked by comparing only the centersof the intensity distributions.

5. DiscussionWe have developed a new unified approach for analyzing CGHdata with multiple samples from two groups. In contrast to theexisting methods based on summarized indices, our approachmodels the original data, and thus is able to account for vari-ations from both within and across clones. As discussed in

Rueda and Diaz-Uriarte (2010), there are different scenariosof recurrent regions. Our proposed method operates under theparadigm that the recurrent regions are shared by most sub-jects from the same group. In practice, the change points mayvary for different individuals. However, our simulation studysuggests that the proposed procedure is robust to randomshifts of breakpoints for some subjects. Our empirical studiessuggest that the proposed method has superior performancefor detecting short segments and those with low level copynumber aberrations, and for distinguishing neighboring seg-ments with differential degrees of copy number aberrations.In addition, analysis at the tail quantiles can help identify in-teresting regions that might be smoothed away at the mediandue to population heterogeneity.

In this article, we focused on the comparison of two groups,but the proposed method can be extended to multiple groupsby using multiple predictors to index the groups with one ofthe groups as the reference, in a way similar to analysis ofvariance (ANOVA) models. In addition, the fused lasso ideacan be extended to analyze CGH data with continuous pheno-type to identify genomic segments with dissimilar phenotypeeffects. The proposed fused estimation was obtained by as-suming a working independence structure for residuals fromneighboring clones. We showed that the resulting estimatorstill enjoys the oracle property. Our empirical experience sug-gested that the within-clone variations may differ at differ-ent clone locations in some CGH studies. More efficient esti-mation might be achieved by incorporating the appropriateweights to account for the intrasubject correlation structureand the heterogeneity in clones (Wang, 2009). Such exten-sions are beyond the scope of this article and deserve furtherstudy.

6. Supplementary MaterialsThe proof of Theorem 1 and assumptions A1-A3 referenced inSection 2.2 are given in the Web Appendix, and are availableunder the Paper Information link at the Biometrics websitehttp://www.biometrics.tibs.org.

Acknowledgements

This research work is partially supported by the NationalScience Foundation, grants DMS-07-06963, DMS-0706818,and DMS-1007420; the National Institutes of Health, grantsR01RGM080503A, R21CA129671; and the National CancerInstitute CA97007. The authors would like to thank twoanonymous reviewers, the associate editor, and the editor forconstructive comments and suggestions that helped improvethe paper significantly.

References

Balleine, R. L., Webster, L. R., Davis, S., Salisbury, E. L., Palazzo, J.P., Schwartz, G. F., Cornfield, D. B., Walker, R. L., Byth, K.,Clarke, C. L., and Meltzer, P. S. (2008). Molecular grading ofductal carcinoma in situ of the breast. Clinical Cancer Research14, 8244–8252.

BenDor, A., Lipson, D., Tsalenko, A., Reimers, M., Baumbusch, L.,Barrett, M., Weinstein, J., BorresenDale, A., and Yakhini, Z.(2007). Framework for identifying common aberrations in DNAcopy number data. Proceedings of RECOMB ’07 4453, 122–136.

Page 10: Identification of Differential Aberrations in Multiple ... · Identification of Differential Aberrations in Array CGH Studies 355 unknown parameter vector. Define the design

362 Biometrics, June 2011

Diskin, S., Eck, T., Greshock, J., Mosse, Y., Naylor, T., Stoeckert, C.,Weber, B., Maris, J., and Grant, G. (2006). STAC: A method fortesting the significance of DNA copy number aberrations acrossmultiple array-CGH experiments. Genome Research 16, 1149–1158.

Eilers, P. H. C. (2003). A perfect smoother. Analytical Chemistry 75,3631–3636.

Eilers, P. H. C. and Menezes, R. X. (2005). Quantile smoothing of arrayCGH data. Bioinformatics 21, 1146–1153.

Elston, C. W. and Ellis, I. O. (1991) Pathological prognostic factorsin breast cancer. I. The value of histological grade in breastcancer: Experience from a large study with long-term follow-up.Histopathology 19, 403–410.

Fan, J. and Li, R. (2001). Variable selection via nonconcave penalizedlikelihood and its oracle properties. Journal of the American Sta-tistical Association 96, 1348–1360.

Fan, J., Tam, P., Woude, G. V., and Ren, Y. (2004). Normalization andanalysis of cDNA microarrays using within array replications ap-plied to neuroblastoma cell response to a cytokine. Proceedings ofthe National Academy of Sciences of the United States of America101, 1135–1140.

Fridlyand, J., Snijders, A. M., Pinkel, D., Albertson, D. G., and Jain, A.N. (2004). Application of hidden Markov models to the analysisof the array CGH data. Journal of Multivariate Analysis 90, 132–153.

Guha, S., Li, Y., and Neuberg, D. (2008). Bayesian hidden Markovmodeling of array CGH data. Journal of the American StatisticalAssociation 103, 485–497.

Huang, J., Gusnanto, A., O’Sullivan, K., Staaf, J., Borg, A., and Paw-itan, Y. (2007). Robust smooth segmentation approach for arrayCGH data analysis. Bioinformatics 23, 2463–2469.

Huang, T., Wu, B., Lizardi, P., and Zhao, H. (2005). Detection of DNAcopy number alterations using penalized least squares regression.Bioinformatics 21, 3811–3817.

Hupe, P., Stransky, N., Thiery, J.-P., Radvanyi, F., and Barillot, E.(2004). Analysis of array CGH data: From signal ratio to gainand loss of DNA regions. Bioinformatics 20, 3413–3422.

Klijn, C., Holstege, H., de Ridder, J., Liu, X., Reinders, M., Jonkers, J.,and Wessels, L. (2008). Identification of cancer genes using a sta-tistical framework for multiexperiment analysis of nondiscretizedarray CGH data. Nucleic Acids Research 36, e13.

Koenker, R. and Ng, P. (2003). SparseM: A sparse matrix package forR. Journal of Statistical Software 8(6).

Koenker, R. and Ng, P. (2005). A Frisch-Newton algorithm for sparsequantile regression. Journal Acta Mathematicae Applicatae Sinica(English Series) 21, 225–236.

Koenker, R., Ng, P., and Portnoy, S. (1994). Quantile smoothingsplines. Biometrika 81, 673–680.

Lai, T. L., Xing, H., and Zhang, N. (2008). Stochastic segmentationmodels for array-based comparative genomic hybridization dataanalysis. Biostatistics 9, 290–307.

Lai, W. R., Johnson, M. D., Kucherlapati, R., and Park, P. J. (2005).Comparative analysis of algorithms for identifying amplificationsand deletions in array CGH data. Bioinformatics 21, 3763–3770.

Li, Y. and Zhu, J. (2007). Analysis of array CGH data for cancer studiesusing fused quantile regression. Bioinformatics 23, 2470–2476.

Lipson, D., Aumann, Y., Ben-Dor, A., Linial, N., and Yakhini, Z.(2006). Efficient calculation of interval scores for DNA copy num-ber data analysis. Journal of Computational Biology 13, 215–228.

Olshen, A. B., Venkatraman, E. S., Lucito, R., and Wigler, M. (2004).Circular binary segmentation for the analysis of array-based DNAcopy number data. Biostatistics 5, 557–572.

Pinkel, D., Segraves, R., Sudar, D., Clark, S., Poole, I., Kowbel, D.,Collins, C., Kuo, W. L., Chen, C., Zhai, Y., Dairkee, S. H., Ljung,B. M., Gray, J. W., and Albertson, D. G. (1998). High resolu-tion analysis of DNA copy number variation using comparativegenomic hybridization to microarrays. Nature Genetics 20, 207–211.

Rouveirol, C., Stransky, N., Hupe, P., Rosa, P. L., Viara, E., Baril-lot, E., and Radvanyi, F. (2006). Computation of recurrent min-imal genomic alterations from array-CGH data. Bioinformatics22, 849–856.

Rueda, O. M. and Diaz-Uriarte, R. (2010). Finding recurrent copy num-ber alteration regions: A review of methods. Current Bioinformat-ics, 2010, 5, 1–17.

Schwarz, G. (1978). Estimating the dimension of a model. Annals ofStatistics 6, 461–464.

Shah, S. P. (2008). Computational methods for identification of recur-rent copy number alteration patterns by array CGH. Cytogeneticand Genome Research 123, 343–351.

Snijders, A. M., Nowak, N., Segraves, R., Blackwood, S., Brown, N.,Conroy, J., Hamilton, G., Hindle, A. K., Huey, B., Kimura, K.,Law, S., Myambo, K., Palmer, J., Ylstra, B., Yue, J. P., Gray, J.W., Jain, A. N., Pinkel, D., and Albertson, D. G. (2001). Assem-bly of microarrays for genome-wide measurement of DNA copynumber. Nature Genetics 29, 263–264.

Tibshirani, R. and Wang, P. (2007). Spatial smoothing and hot spotdetection for CGH data using the fused lasso. Biostatistics 9,18–29.

Wang, H. (2009). Inference on quantile regression for heteroscedasticmixed models. Statistica Sinica 19, 1247–1261.

Wang, H. and He, X. (2007). Detecting differential expressions inGeneChip microarray studies: A quantile approach. Journal ofthe American Statistical Association 102, 104–112.

Wang, P., Kim, Y., Pollack, J., Balasubramanian, N., and Tibshirani,R. (2005). A method for calling gains and losses in array CGHdata. Biostatistics 6, 45–58.

Willenbrock, H. and Fridlyand, J. (2005). A comparison study: Apply-ing segmentation to array CGH data for downstream analyses.Bioinformatics 21, 4084–4091.

Ylipaa, A., Nykter, M., Kivinen, V., Hu, L., Cogdell, D., Hun, K.,Zhang, W., and Yli-Harja, O. (2008). Finding common aberra-tions in array CGH data. In Proceedings of the 3rd InternationalSymposium on Communications, Control and Signal Processing(ISCCSP 2008), 1199–1204.

Received August 2009. Revised April 2010.Accepted April 2010.