extension and application of multivariate curve resolution-alternating least squares to four-way...

9
Analytica Chimica Acta 794 (2013) 20–28 Contents lists available at ScienceDirect Analytica Chimica Acta jou rn al h om epage: www.elsevier.com/locate/aca Extension and application of multivariate curve resolution-alternating least squares to four-way quadrilinear data-obtained in the investigation of pollution patterns on Yamuna River, India—A case study Amrita Malik , Roma Tauler Institute of Environmental Assessment and Water Research (IDAEA), Spanish Council for Scientific Research (CSIC), Jordi Girona 18-26, 08034 Barcelona, Catalunya, Spain h i g h l i g h t s This study presents a new devel- opment of the MCR-ALS method introducing a quadrilinear constraint. A long term four-way environmental dataset is presented as a case of study. MCR-ALS resolved dominant pollu- tion patterns for the Yamuna River (India) during the years (1999–2005). The MCR-ALS proves to be a powerful tool to summarize and resolve large multi-dimensional datasets. g r a p h i c a l a b s t r a c t a r t i c l e i n f o Article history: Received 22 April 2013 Received in revised form 10 July 2013 Accepted 18 July 2013 Available online 29 July 2013 Keywords: Chemometrics Multivariate Curve Resolution-Alternating Least Squares PARAFAC River pollution Environmental monitoring Factor analysis a b s t r a c t This study focuses on the development and extension of Multivariate Curve Resolution-Alternating Least Squares (MCR-ALS) to the analysis of four-way datasets. The proposed extension of the MCR-ALS method with non-negativity and the newly developed quadrilinear constraints can be exploited to summarize and manage huge multidimensional datasets and resolve their four way component profiles. In this study, its application is demonstrated by analyzing a four-way data set obtained in a long term environmen- tal monitoring study (15 sampling sites × 9 variables × 12 months × 7 years) belonging to the Yamuna River, one of the most polluted rivers of India and the largest tributary of the Ganges river. MCR-ALS resolved pollution profiles described appropriately the major observed changes on pH, organic pollution, bacteriological pollution and temperature, along with their spatial and temporal distribution patterns for the studied stretch of Yamuna River. Results obtained by MCR-ALS have also been compared with those obtained by another multi-way method, PARAFAC. The methodology used in this study is completely general and it can be applied to other multi-way datasets. © 2013 Elsevier B.V. All rights reserved. 1. Introduction With the advent of industrialization and increasing population, the requirements have increased for higher quality environment Corresponding author. Tel.: +34 93 400 61 40. E-mail addresses: [email protected], [email protected] (A. Malik), [email protected], [email protected] (R. Tauler). stressing for the advance research in all areas concerned to human health and its environment, and continuous monitoring of available natural and man-made resources for their efficient management and control. Modern sophisticated and sensitive analytical tech- nologies and instrumentation are providing huge amounts of experimental observations in all research areas related to life sciences (e.g. chemical kinetic studies, pharmaceuticals, medici- nal and clinical studies, proteomics, metabolomics, genomics etc.) which, in nature, are multivariate, multi-set and multi-way (three 0003-2670/$ see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.aca.2013.07.047

Upload: roma

Post on 12-Dec-2016

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Extension and application of multivariate curve resolution-alternating least squares to four-way quadrilinear data-obtained in the investigation of pollution patterns on Yamuna River,

ErdR

AIC

h

a

ARRAA

KCMLPREF

1

t

R

0h

Analytica Chimica Acta 794 (2013) 20– 28

Contents lists available at ScienceDirect

Analytica Chimica Acta

jou rn al h om epage: www.elsev ier .com/ locate /aca

xtension and application of multivariate curveesolution-alternating least squares to four-way quadrilinearata-obtained in the investigation of pollution patterns on Yamunaiver, India—A case study

mrita Malik ∗, Roma Taulernstitute of Environmental Assessment and Water Research (IDAEA), Spanish Council for Scientific Research (CSIC), Jordi Girona 18-26, 08034 Barcelona,atalunya, Spain

i g h l i g h t s

This study presents a new devel-opment of the MCR-ALS methodintroducing a quadrilinear constraint.A long term four-way environmentaldataset is presented as a case of study.MCR-ALS resolved dominant pollu-tion patterns for the Yamuna River(India) during the years (1999–2005).The MCR-ALS proves to be a powerfultool to summarize and resolve largemulti-dimensional datasets.

g r a p h i c a l a b s t r a c t

r t i c l e i n f o

rticle history:eceived 22 April 2013eceived in revised form 10 July 2013ccepted 18 July 2013vailable online 29 July 2013

eywords:

a b s t r a c t

This study focuses on the development and extension of Multivariate Curve Resolution-Alternating LeastSquares (MCR-ALS) to the analysis of four-way datasets. The proposed extension of the MCR-ALS methodwith non-negativity and the newly developed quadrilinear constraints can be exploited to summarizeand manage huge multidimensional datasets and resolve their four way component profiles. In this study,its application is demonstrated by analyzing a four-way data set obtained in a long term environmen-tal monitoring study (15 sampling sites × 9 variables × 12 months × 7 years) belonging to the Yamuna

hemometricsultivariate Curve Resolution-Alternating

east SquaresARAFACiver pollutionnvironmental monitoring

River, one of the most polluted rivers of India and the largest tributary of the Ganges river. MCR-ALSresolved pollution profiles described appropriately the major observed changes on pH, organic pollution,bacteriological pollution and temperature, along with their spatial and temporal distribution patterns forthe studied stretch of Yamuna River. Results obtained by MCR-ALS have also been compared with thoseobtained by another multi-way method, PARAFAC. The methodology used in this study is completelygeneral and it can be applied to other multi-way datasets.

actor analysis

. Introduction

With the advent of industrialization and increasing population,he requirements have increased for higher quality environment

∗ Corresponding author. Tel.: +34 93 400 61 40.E-mail addresses: [email protected], [email protected] (A. Malik),

[email protected], [email protected] (R. Tauler).

003-2670/$ – see front matter © 2013 Elsevier B.V. All rights reserved.ttp://dx.doi.org/10.1016/j.aca.2013.07.047

© 2013 Elsevier B.V. All rights reserved.

stressing for the advance research in all areas concerned to humanhealth and its environment, and continuous monitoring of availablenatural and man-made resources for their efficient managementand control. Modern sophisticated and sensitive analytical tech-nologies and instrumentation are providing huge amounts of

experimental observations in all research areas related to lifesciences (e.g. chemical kinetic studies, pharmaceuticals, medici-nal and clinical studies, proteomics, metabolomics, genomics etc.)which, in nature, are multivariate, multi-set and multi-way (three
Page 2: Extension and application of multivariate curve resolution-alternating least squares to four-way quadrilinear data-obtained in the investigation of pollution patterns on Yamuna River,

a Chim

orutmaclawspaiamrMemiddievemTebstpce[pi

almv[dwonioHmatadctcrIpAtg

A. Malik, R. Tauler / Analytic

r higher number of ways) data structures. For the natural envi-onmental studies, this kind of datasets are generated, regularly,nder the environmental programs run by government authori-ies, research institutes and Non Governmental Organizations to

onitor the condition and quality of natural resources over timend space all over the world. Often, outcome of these studies areoncentration information on multiple chemical compounds col-ected at different sampling periods from different sampling sitesrranged in large tables, data matrices, or in more complex multi-ay data arrays (three or more directions or modes) [1]. These data

ets are frequently rather cumbersome to interpret by their sim-le direct observation, emphasizing, therefore the need of havingppropriate data analysis tools to extract relevant environmentalnformation from these huge monitoring data sets. The processingnd interpretation of such multi-way data sets require the develop-ent and application of appropriate data analysis tools to extract

eliable information about the investigated analytical systems.ulti-way data analysis methods can better summarize these huge

nvironmental monitoring and other research data sets providing aore in-depth interpretation of the relevant information contained

n them. Three-way or three-mode data (data arranged in threeirections) can be analyzed either using eigenvalue/eigenvecorata decompositions, or by trilinear and non-trilinear alternat-

ng least-squares (ALS) [2]. In practice, ALS is considered to be anfficient method for multi-way data decompositions and Multi-ariate Curve Resolution-Alternating Least Squares (MCR-ALS) hasmerged as a powerful tool to analyze multiple data arrays throughatrix-augmentation fulfilling different model complexities [3–6].

he main advantage of MCR-ALS [7], in this context, is that it isasily adapted to data sets of different complexity and structure,ilinear, trilinear or multilinear, providing optimal least squaresolutions. MCR-ALS has been used to analyze various datasets-hat can be described by a bilinear model-related to many kind ofrocesses and mixtures such as chemical reactions, industrial pro-esses, chromatographic analysis, spectroscopic measurements,nvironmental data, monitored by diverse multivariate responses8]. The success and generalized use of MCR-ALS is related to theossibility to work with multi-way and multi-set data structures,

.e., analyzing several data tables simultaneously [4–6,9].For the analysis of four-way quadrilinear data, the avail-

ble methodologies are usually based on quadrilinear alternatingeast-squares and non-quadrilinear latent structure models. The

ulti-way PARAFAC method [10] and some of its complementaryariants such as alternating penalty quadrilinear decomposition11] and alternating weighted residual constraint quadrilinearecomposition [12] are available for processing data complyingith the quadrilinear model condition [13]. This work is focused

n the extension of MCR-ALS to handle four-way data under non-egativity and the newly developed quadrilinearity constraint. In

ts simplest configuration, MCR-ALS is based only in the fulfillmentf the bilinear model and in having only one data mode in common.owever, if the data has two or more modes in common, multilinearodels, like the trilinear or the quadrilinear models can option-

lly be implemented inside the ALS algorithm as constraints. Usinghis approach MCR-ALS has already been extended and applied tonalyze three-way trilinear data [14]. This study focuses on theevelopment of MCR-ALS with a newly developed quadrilinearonstraint for the analysis of four-way datasets. To demonstratehe efficiency of MCR-ALS with the newly developed quadrilinearonstraint, a four-way dataset originated through the long termegular monitoring campaign (1999–2005) of the Yamuna River, inndia, was used as a case of study to resolve the dominant pollution

atterns and their distribution along time and geographical axis.pplication of the new quadrilinear MCR-ALS algorithm is shown

o be able to resolve specific pollution patterns on temporal andeographical modes. For comparison purpose the dataset was also

ica Acta 794 (2013) 20– 28 21

analyzed with PCA and PARAFAC methods, which are traditionallyused for this kind of studies. This study is intended to provide helpto understand the use of multi-way methods to analyze the hugedatasets collected and recorded in large project or environmentalmonitoring reports, thus helping the decision making authorities toknow what are the main contamination patterns over a particulargeographical area and time frame, and to conclude alterative solu-tions for the health and management of environmental resources.

2. Methods

2.1. Dataset

The dataset used in this study was obtained from the regu-lar monitoring of the Yamuna River, by Central Pollution ControlBoard (CPCB), India [15]. Yamuna River is one of the most pollutedand largest tributary of the Ganges River, India. The total length ofYamuna River from origin (31◦2′12′′ N and 78◦26′10′′ E) to its con-fluence with Ganges is 1376 km. The Yamuna River covers waterdemands of rural and urban settlements like Delhi, Mathura, Agraand Allahabad. In turn, this river receives back outfall of a number ofdrains carrying domestic and industrial wastes rich in organic mat-ter. To check Yamuna River for water quality control and mitigationpurposes, CPCB regularly monitors the river at selected samplingsites. For this study, monthly river quality data from seven years(1999 to 2005) has been used. Sampling sites and parameters wereselected based on their availability and continuity during the studyperiod. The final dataset consisted of 15 sampling sites (S1–S15),9 measured variables or parameters (pH, COD, BOD, NH4, TKN,DO, WT (Water Temperature), TC (Total Coliforms), FC (Fecal Col-iforms)) monitored every month (12 measurements) for 7 years(1999–2005). The locations of sampling points on Yamuna Riverare presented in Fig. 1. The total amount of data values simulta-neously analyzed are therefore 15 × 9 × 12 × 7 = 11,340 individualvalues.

2.2. Data organization

Environmental datasets can be arranged according to the num-ber of ways or modes they have. For example, a data set obtainedin a single monitoring campaign measuring multiple variables orparameters over a set of samples is a two-way data consisting of twoways or modes: samples (first mode) and variables or parameters(second mode) or vice versa, which can be arranged in a two-waydata table or data matrix. If, instead the dataset consists of the samevariables or parameters measured for different conditions (e.g. atdifferent times) for every sample, then one data matrix is collectedper sample and a three-way data set forming a cube is obtained, i.e.three ways or modes are: samples (first mode), variables (secondmode) and conditions (third mode). The term ‘ways’ and ‘modes’are analogous and used interchangeably in this work. The datasethaving a three-way or three-mode data cube per sample will givea four-way data structure. This is the case for instance when, asin this work, two types of time conditions (months and years) aremeasured per sample and variable. As described earlier, the datasetused under this case of study was obtained by the regular monitor-ing of 9 water quality parameters at 15 different (but fixed) sites ofthe Yamuna river in every month of the year and for seven consecu-tive years. Thus, this dataset has four modes: first mode (samples),second mode (variables), third mode (months) and fourth mode(years).

The whole data set was initially arranged into 7 matrices,belonging to each year, having all of them 180 rows (15 samplingsites × 12 months) and 9 columns (variables). These 7 matrices canbe arranged into different ways for their simultaneous analysis.

Page 3: Extension and application of multivariate curve resolution-alternating least squares to four-way quadrilinear data-obtained in the investigation of pollution patterns on Yamuna River,

22 A. Malik, R. Tauler / Analytica Chimica Acta 794 (2013) 20– 28

una R

Iotayow(yimlmtd

2

toasbe

Fig. 1. Map of the study area (Yam

n a first data arrangement, the 7 matrices were concatenated asne on top of the other with their columns (9 variables) sharinghe same column vector space, resulting into a large column-wiseugmented data matrix (Daug) with dimensions of 1260 rows (7ears, 12 months, 15 sampling sites) and 9 columns (variablesr parameters measured). In a second possible arrangement, thehole dataset was arranged in a single four-way data structure

D) of dimensions: 15 sampling sites × 9 variables × 12 months × 7ears. The whole data set contained less than five percent of miss-ng values and these values were replaced using a moving average

ethod [16] before any data pre-treatment. As the dataset fol-ows a time series pattern (the measurements were taken every

onth for continuous seven years giving 84 continuous observa-ions for every site), the moving averaging was performed in theate (month/year) mode for every site.

.3. Data pre-treatment

Different data pre-treatments were tested to see which ofhem provided more easily and straight forward interpretationf data. Applying mean-centering is not appropriate for source

pportionment studies, since, it produces negative values and con-equently application of non-negativity constraints, which are theasic natural constraints in Multivariate Curve Resolution studies ofnvironmental monitoring studies cannot be applied [9,17]. On the

iver) showing sampling locations.

other side, scaling the variables to unit variance (dividing all of themby their respective standard deviation over all measurements) isuseful due to their different units and scales, and it allowed fora better environmental interpretation of data. To compare resultsof PCA and PARAFAC with MCR-ALS, scaling was done in the vari-able mode. For PCA and MCR-ALS, individual data matrices wereaugmented column-wise, keeping variable space common, and allcolumns were scaled to unit standard deviation (scaling within sec-ond data mode). In case of PARAFAC, scaling was also performedwithin second mode [18].

2.4. Chemometric methods

2.4.1. Principal component analysis (PCA)PCA performs a bilinear orthogonal decomposition of the exper-

imental data matrix into the product of two new factor matrices(scores and loadings) of reduced dimensions (principal compo-nents) explaining most of the information (variance) of the originaldataset. Further details about the well known PCA method can befound elsewhere [19,20]. The bilinear orthogonal decomposition ofPCA in matrix form can be written as:

D = XYT + E (1)

where D is the original data matrix, YT is the loadings matrix, X isthe scores matrix and E is residual matrix.

Page 4: Extension and application of multivariate curve resolution-alternating least squares to four-way quadrilinear data-obtained in the investigation of pollution patterns on Yamuna River,

a Chim

2

bhmcpJt

d

w‘afdm

(

D

wwmft

(

D

wtWPbcnmbmM

2

dsdfi

sm‘btbtam

ittr

A. Malik, R. Tauler / Analytic

.4.2. PARAFACPARAFAC method can be considered to be the extension of the

ilinear model PCA method to multi-way data arrays (data arraysaving more than two ways or modes). The elementwise PARAFACodel equation for the decomposition of a three-way dataset DIJK

an be represented mathematically as a sum of the product of com-onent contributions in the different modes (I for number of rows,

for number of columns and K for number of matrices or slices inhe three-way data or data cube) as following [10,21]:

ijk =N∑

n+1

xinyjnzkn + eijk (2)

here dijk represents the ijkth element in the three-way dataset,n’ is the number of components or rank of the data common toll three modes, and, xin, yjn, and zkn are the elements in the threeactor matrices X, Y and Z (loadings in three modes) used to obtainijk, and eijk is the residual term (variance not explained by theodel).The PARAFAC decomposition of each individual data matrix

cube slice) can be represented as follows:

k = XZkYT + Ek = XkYT + Ek (3)

here k refers to the matrix considered and Zk is a diagonal matrixhose elements are in the k column of loadings matrix Z. The Xkatrices share the same first mode loadings X matrix and only dif-

er in the diagonal scaling matrix Zk fulfilling the requirement ofrilinear model assumed by PARAFAC.

The extension of the matrix notation of PARAFAC model in Eq.3) for a quadrilinear model can be written as:

kl = XZkWlYT + Ekl = XklY

T + Ekl (4)

here now Dkl refers to the slice or matrix identified by k in thehird mode (months), and l in the fourth mode (years), and X, Z,

, and Y are the loading matrices corresponding to four modes.ARAFAC model assumes and requires extraction of the same num-er of components in each mode of the multi-way dataset. The coreonsistency diagnostic is the main criteria used for the selection ofumber of components in PARAFAC model [10]. For the PARAFACodel with one component in each mode, the core consistency is

y default 100%. In this study, the non-negativity constraint in allodes was also applied to better compare with results obtained byCR-ALS.

.4.3. MCR-ALSIn this work we propose the extension of MCR-ALS to four-way

ata with non-negativity and quadrilinear constraints. The exten-ion of the MCR-ALS method is based on the analysis of augmentedata matrices. Details about the extension of the MCR-ALS methodor three-way data considering a trilinear model and the possiblenteraction between components has already been described [14].

If the dataset under study has individual data matrices withame number of rows and columns, then these matrices can be aug-ented row- or column wise. For example, if a three-way dataset

DIJK’ is consisted of ‘K’ individual data matrices with the same num-er of rows ‘I’, and columns ‘J’ (i.e. with three different modes),hen a column-wise augmented data matrix, Daug, can be obtainedy [D1;D2;D3;. . .;Dk], where the semicolon ‘;’ is used to indicatehat the different data matrices Dk, (k = 1,. . .,K) are column-wiseppended, one on top of each other, keeping the columns in com-on.A four-way dataset ‘DIJKL’ will consist of a number K × L of ‘Dkl’

ndividual data slices or data matrices (sub-index identifying thehird and fourth modes, k = 1,2,3,. . .,K and l = 1,2,3,. . .,L, respec-ively). All of these individual matrices have the same number ofows ‘I’, and columns ‘J’. The column-wise augmentation of all these

ica Acta 794 (2013) 20– 28 23

Dkl matrices will produce the super-augmented data matrix Daug

(Fig. 2) which in MATLAB notation can be presented as:

Daug = [D11; D21; . . .; DK1; D12; D22; . . .; DK2; . . .; D1L; D2L, . . .DKL]

(5)

The bilinear model equation for the corresponding four-wayaugmented data matrix (Daug), in MATLAB notations can be thenwritten as:

Daug = [D11; D21; . . .; DK1; D12; D22; . . .; DK2; . . .; D1L; D2L, . . .DKL]

= [X11; X21; . . .; XK1; X21; X22; . . .; XK2; . . ...; X1L; X2L; . . .; XKL]

YT + [E11; E21; . . .; EK1; E21; E22; . . .; EK2; . . ...; E1L; E2L; . . .; EKL]

= XaugYT + Eaug (6)

where YT is the second mode loadings matrix related with thedescription of the variables profiles (in this work, measuredparameters and concentrations). The augmented scores matrixXaug = [X11;X21;. . .;XK1;X21;X22;. . .;XK2;. . ...;X1L;X2L;. . .;XKL]compiles all sample scores of the resolved components in eachrelated data matrix (Dkl). It is important to note that the augmentedscores matrix Xaug will have the scores/loadings in the three datamodes intermixed, first mode (sites), third mode (months) andfourth mode (years) loadings, corresponding to loading matrices‘X’, ‘Z’, and ‘W’, respectively.

The extension of MCR-ALS to analyze four-way quadrilinear dataimplies constraining individual profiles of the augmented scoresmatrix Xaug to fulfill the quadrilinearity condition during ALS opti-mization steps. The incorporation of the quadrilinear constraintduring the ALS optimization of the Xaug matrix is presented graph-ically in Fig. 2. As in any other ALS procedure, the first step is toestimate the number of components and an initial estimation ofeither Xaug or YT matrices [3,14]. These initial estimates are thenoptimized iteratively by ALS optimization and at each iteration anew estimation of the augmented scores and loading matrices isobtained. At each iteration different constraints like non-negativity,normalization (of second mode loadings YT) and quadrilinearity(optional) are introduced. The constrained iterative optimizationis carried out until convergence is achieved or until a preselectednumber of cycles are reached. This process can be repeated fordifferent number of components and initial estimations, until themore satisfactory answer, both from a mathematical point of view(lack of fit, explained variance) and from a physico-chemical point(shape of resolved profiles and interpretation should be reasonable)is achieved [14].

In Fig. 2, the whole procedure used for the implementation ofthe quadrilinear constraint can be explained as follows:

(1) First, the bilinear model decomposition of the augmented datamatrix ‘Daug’ is performed according to Eq. (6), for ‘n’ numberof components, (n = 1 to N).

(2) To apply the quadrilinear constraint, the profiles of the same ncomponent, xn

IKL in each of the different Xkl matrices are folded(in a similar way as it is done for the implementation of trilin-earity constraint in MCR-ALS [14]) to give the single componentprofile matrix, xn

IK,L with I × K rows and L columns (where K × Lare the number of matrices simultaneously analyzed, see Fig. 2).

xnaug(xn

IKL)refolding−→ Xn

IK,L (7)

n

(3) Matrix XIK,L is then approximated by its first component bilin-ear decomposition (obtained by SVD [22]) as follows:

XnIK,L ≈ xn

IK wn (8)

Page 5: Extension and application of multivariate curve resolution-alternating least squares to four-way quadrilinear data-obtained in the investigation of pollution patterns on Yamuna River,

24 A. Malik, R. Tauler / Analytica Chimica Acta 794 (2013) 20– 28

drilin

(

(

A

Fig. 2. Implementation of the qua

where now, new vector xnIK of dimensions (I × K,1) has the

loadings of the nth component in the first and third modes(intermixed), and vector wn

L of dimensions (1,L) has the load-ings of component nth in the fourth mode (years). This processis repeated for every component, giving the loadings matrix inthe fourth mode, W of dimensions (N,L), where N is the numberof desired or selected components (n = 1,. . .,N) and L (l = 1,. . .,L)refers to the total number of conditions in the fourth mode (inthis work, number of years).

4) Now, to recover the loading profiles in first and third modes,the previously obtained one component profiles, xn

IK of dimen-sions (IK,1) are folded again to give the single component profilematrix xn

I,K with I rows and K columns:

xnIK

refolding−→ XnI,K (9)

5) Matrix XnI,K is then approximated by its first-component bilin-

ear decomposition, by SVD [22]:

XnI,K ≈ xnzn (10)

where vector xn of dimensions (I,1) has the loadings of thenth component in the first mode (sampling sites), and zn ofdimensions (1,K) have the loadings of component nth in thethird mode (months). This process is repeated for every compo-nent, giving the loadings matrices in the first and third modes,X of dimensions (N,I), and Z of dimensions (N,K), where N is thenumber of selected components (n = 1,. . .,N). Steps (2)–(5) giveloading matrices W, X, and Z corresponding to fourth, first, and

third mode.

Incorporation of this quadrilinear constraint in every step of theLS iterative optimization procedure forces the shape of the loading

ear model constraint in MCR-ALS.

vector profiles in the different modes of the same component to bethe same for the whole data set. The main advantage of the quadri-linear constraint as implemented in MCR-ALS is that it is appliedindependently and optionally to each component of the data set,giving more flexibility to the whole data analysis and allowing totest for full and partial quadrilinear models.

The MCR-ALS program has been written in MATLAB and its cur-rent version for bilinear and trilinear data is freely available as atoolbox at the web address http://www.mcrals.info/. Preliminaryimplementation of the MCR-ALS MATLAB routine including quadri-linear constraint is available under request to the authors. PARAFACroutines used in this work belong to the N-way toolbox of R. Broand C. Andersson and have been downloaded from their website(http://www.models.kvl.dk/sources/nwaytoolbox/).

3. Results and discussion

Although, the main focus of this paper is on the application ofMCR-ALS to four-way environmental data, in brief, results obtainedby PCA and PARAFAC are also given.

3.1. PCA results

To see the effect of data pre-treatment on variances explainedby PCA, the analysis was performed on raw, scaled and auto-scaled(mean zero and standard deviation one) data. PCA explained nearlyall raw data variance (close to 100%) when two components were

considered. In the case of scaled and autoscaled data, 99.32%, and80.50% of the total variance, were explained respectively, when4 components were considered. Measured pH values in the riverwater dataset showed little variation, so when all variables were
Page 6: Extension and application of multivariate curve resolution-alternating least squares to four-way quadrilinear data-obtained in the investigation of pollution patterns on Yamuna River,

A. Malik, R. Tauler / Analytica Chimica Acta 794 (2013) 20– 28 25

Table 1Explained variances by the full models and by the individual components.

Model Explained variance (for scaled data) (%) Overlapping (%)

Comp. 1a Comp. 2a Comp. 3a Comp. 4a ALLb SUMc

PCA 97.35 1.22 0.42 0.33 99.32 99.32 0PARAFAC (2 components) 96.38 4.90 98.01 101.28 3.27MCR-ALS (2 components)d 98.38 5.04 98.00 103.42 5.42MCR-ALS (4 components)d 94.92 7.22 5.53 0.26 98.41 107.92 9.52

a The explained variances by individual components.

svhtifidsgrPPPppiqep(llt

3

wwcwdoii

nwm

TEP

mode in two new data modes, years and months. Therefore, theuse of the quadrilinear constraint can be adequate in this case to

b Total variances captured by full model (ALL).c Sum of explained variance by individual components (SUM).d Using quadrilinear model and non-negativity constraints.

caled with their respective standard deviation, new scaled pHalues were explaining most of the data variance. On the otherand, when other strategies (like logarithmic scaling or Max–Minransformation) were applied to scale variables, significant featuresn the dataset were suppressed. The autoscaled data recovered pro-les similar to those obtained with scaled data (with unit standardeviation), but with negative loadings and scores. As a conclu-ion, data scaled with their standard deviation (see Section 2.3)ave reliable underlying profiles with physical meaning and envi-onmental interpretation. To keep the results comparable withARAFAC and MCR-ALS using non-negativity constraints, finally,CA results obtained with scaled data were considered (Table 1).C1 (97.35% explained variance) showed high positive loadings onH with DO and WT, and PC2 (1.22% explained variance) had highositive loadings on COD, BOD, NH4-N, and TKN with negative load-

ng on DO. PC1 suggests pH as a source of variation in river wateruality and PC2 represents organic pollution sources. PC3 (0.42%xplained variance) and PC4 (0.33% explained variance) had highositive loadings on coliforms (TC and FC) and water temperatureWT), respectively. In the scores plot obtained by PCA, due to thearge number of samples included in the analysis (at different samp-ing locations, months and years) it was difficult to distinguish howhe patterns defined by PCs are distributed among samples.

.2. PARAFAC results

Three-way (15 × 9 × 84 (7 years × 12 months)) PARAFAC modelsith 2, 3 and 4 components were tested and their data fitting alongith explained variances are provided in Table 2. Based on core

onsistency [23], the PARAFAC model with two components (2,2,2)as more reliable. This model explained 98.16% of total (scaled)ata variance with 99.00% core consistency, inferring the adequacyf this trilinear model with two components. The obtained resultsn the form of figures (Fig. SI-1) are provided in the supplementarynformation (SI) for brevity.

Four-way (15 × 9 × 12 × 7) PARAFAC models with differentumber of components are presented in Table 2. Like for three-ay PARAFAC, core consistency (92.12%) of the two-componentodel also suggested the appropriateness of the model explaining

able 2xplained variances and core consistency of various three-way and four-wayARAFAC models.

Components Exp. var (%) Core-consistency (%)

Three-way PARAFAC 1 97.19 1002 98.16 99.023 98.45 48.114 98.73 45.24

Four-way PARAFAC 1 97.11 1002 98.01 92.123 98.27 −0.054 98.43 −0.00

98.01% of the total (scaled) data variance. A core consistency valueabove 90% can be interpreted as fulfilling the multilinear model[23], and in this case it suggests the possible quadrilinear structureof the data. In Table 1, explained variances by individual compo-nents, their sum of explained variance (SUM) and the total variancecaptured by full model (ALL) are given. Profiles resolved by the two-component PARAFAC quadrilinear model are provided in SI (Fig.SI-2).

3.3. MCR-ALS results

MCR-ALS with non-negativity and either bilinear, tri-linear orquadrilinear modeling and with different number of componentswas applied to the dataset under study in this work. However,as said before, the focal point of this work is the extension andapplication of MCR-ALS to handle four-way data, and therefore,only results obtained with the non-negativity and quadrilinear con-straints will be given here in more detail.

Fitting results obtained using non-negativity and with andwithout multilinearity constraints are given in Table 3. BilinearMCR-ALS can over fit the data and it may have ambiguity in thesolutions. These ambiguities in MCR-ALS solutions can be elim-inated with the use of multi-linearity type of constraints [14],as in the case of the multiway dataset investigated in this work.The results obtained by two-component MCR-ALS model withtrilinearity and quadrilinearity constraint along with the non-negativity constraints were similar to the results obtained with thethree and four-way PARAFAC models with two components andnon-negativity constraints, explaining almost the same amountof explained variances (Table 2). However, the repetitive patternsin the time mode (due to mixing of months and years), in theresults obtained by MCR-ALS with the trilinearity constraint andby three-way PARAFAC model (Fig. SI-1) suggest for the possi-ble fulfillment of the quadrilinear model separating the time data

resolve the profiles, in these two new modes. In addition, inclu-sion of the quadrilinear constraint during MCR-ALS optimization

Table 3Lack of fit (lof) and Explained variances (R2) and of different MCR-ALS models.

Component lof (%)R2 (%)

MCR-ALS

Bilinear Trilinear Quadrilinear

2 lof 12.05 13.94 14.14R2 98.55 98.06 98.00

3 lof 10.17 13.06 13.52R2 98.97 98.30 98.17

4 lof 8.35 12.02 12.62R2 99.30 98.56 98.41

5 lof 6.60 11.15 12.10R2 99.56 98.76 98.53

Page 7: Extension and application of multivariate curve resolution-alternating least squares to four-way quadrilinear data-obtained in the investigation of pollution patterns on Yamuna River,

26 A. Malik, R. Tauler / Analytica Chimica Acta 794 (2013) 20– 28

F gativit

csmaPiPebTc0a

tTetpaticAqTmipbb1ocalvr

9iawPArtTi

(a) (b)

ig. 3. Profiles resolved by 2 component MCR-ALS with quadrilinearity and non-ne

an tolerate possible small deviations from the strict quadrilineartructure. Profiles resolved by the two component MCR-ALSodel using quadrilinearity and non-negativity constraints (Fig. 3)

re comparable to those obtained by two component four-wayARAFAC model (Fig., SI-2). To have a better look on the similar-ty between profiles resolved by the two component MCR-ALS andARAFAC model, the angle between them whose arc has the cosinequal to their correlation coefficient was determined. A small angleetween two vectors indicates that they are very similar [24–26].he angle between resolved profiles ranged from 0.51 to 7.81 andorrelation coefficient between the resolved profiles ranged from.991 to 1.000 indicating that the profiles resolved by two methodsre practically identical.

Increasing the number of components in MCR-ALS from twoo five did not change significantly the explained variances (seeable 3). However, looking at resolved shape profiles and consid-ring a better and deeper environmental interpretability of them,he model with four components was finally preferred. The resolvedrofiles by MCR-ALS model with five components were overlappingnd did not provide any new information. Implementation of mul-ilinear constraint in MCR-ALS is more flexible than in PARAFAC andt works with each mode independently. In PARAFAC, all resolvedomponents should fulfill the sought multi-linear condition. MCR-LS is able to handle possible slight departures from trilinear oruadrilinear models. When four components were considered (seeable 3) lof values for MCR-ALS bilinear, trilinear and quadrilinearodeling were 8.3%, 12.0% and 12.6%, respectively. There was an

ncrease of lof value for trilinear and quadrilinear modeling com-ared to bilinear modeling. Instead, there is very little differenceetween trilinear and quadrilinear modeling, either in MCR-ALS,ut also in PARAFAC (lof in PARAFAC quadrilinear modeling was2.2%). Therefore, we may conclude that possible small departuresf the multilinear model as well as possible model overfitting in thease of bilinear modeling, explain the observed differences in lofnd R2 values. From a practical point of view, trilinear and quadri-inear modeling are explaining rather well a large amount of dataariance (99.3% and 98.4%, see Table 3), and on the other hand,esolved profiles (see below) are more easily interpretable.

MCR-ALS explained variance using four components was8.41%. Using this model, resolved profiles had an easy and

mproved environmental interpretation considering both spatialnd temporal modes (Fig. 4), as compared to a MCR-ALS modelith lower number of components (and better also than with

ARAFAC using 4 components). The four profiles resolved by MCR-LS are respectively associated to physico-chemical changes of

iver water, like temperature and pH, and to contamination pat-erns, like bacterial and organic pollution of the river water (Fig. 4).hese profiles are indicating similar patterns to those describedn the preliminary PCA study. Variances explained by individual

(c) (d)

y constraint (a) sites mode; (b) variable mode; (c) month mode, and (d) year mode.

MCR-ALS components are given in Table 1. In MCR, variances asso-ciated to different components do overlap because the resolvedcomponents are not orthogonal, like in PCA. However, MCR-ALSnon-orthogonal components facilitate the physical interpretationof the sources responsible for the water quality variation of theYamuna River in the sites or geographical mode. In the last twocolumns of Table 1, differences between, the explained variancefor the full model (ALL) and for the sum of the individual explainedvariances (SUM) gives a measure of the amount of overlap (theextent of non-orthogonality) among the components of a particularmodel.

First MCR-ALS resolved a profile reflecting mostly the changeson water quality associated to the pH of river water, showing thatthe middle stretch of the river (S5-S11), behaves differently, espe-cially in the months of Apr–Jun, when compared to other sites.Yamuna river receives large amount of organic pollution fromidentified and unidentified sources which when dissolved at highamounts, consume large amounts of oxygen, which undergoes thenanaerobic fermentation processes leading to formation of ammo-nia and organic acids. Hydrolysis of these organic acids will affectthe natural pH of river water [27]. The second component of themodel resolved a profile related mostly to water temperature (WT),which changes along the studied river course and over the time,and reflects the variation of temperature at the various samplingsites and its gradual changing pattern over the different months.The temperature in the study area experiences large changes fromabout 2 ◦C in winter (Dec–Jan) to more than 45 ◦C in summer(Mar–Jun and sometimes extended up to August when monsoonis absent or delayed). The third MCR-ALS resolved profile describesmostly an organic matter pattern (COD, BOD, TKN, NH4-N), withchanges mainly associated with domestic sewage and wastewater,at middle stretch of the river, specifically at sites S5, S6, S7 andS11. The middle stretch of the river gets a lot of treated, partiallytreated and untreated wastewater, surface runoff and solid wastesfrom various point and non-point contamination sources. In thisstretch, the load of organic matter is so high that it consumes theentire dissolved oxygen available in river water [15]. Water qualityat location S5 reflects the impact of discharge of wastewater fromDelhi through approximate twenty drains joining the river. Samp-ling location S6 receives treated and partially treated effluents froma sewage treatment plant and several other drains, and locationS7 also receives, at its upstream, wastewater from the Shahdaradrain which carries mainly domestic wastewater rich in organicmatter along with wastewater from several unidentified small-scale industries. Sampling point S11, reflects the impact of Agra

City wastewater, which is a major tourist resort, on water qual-ity of Yamuna River. Furthermore, temporal variation associatedto this resolved component shows decreased inputs during mon-soon months (Jul–Sep). Yamuna River carries almost 80% of total
Page 8: Extension and application of multivariate curve resolution-alternating least squares to four-way quadrilinear data-obtained in the investigation of pollution patterns on Yamuna River,

A. Malik, R. Tauler / Analytica Chimica Acta 794 (2013) 20– 28 27

F gativit

awdw(itcrcwbwortiraaSuot

4

btcmAwao

(a) (b)

ig. 4. Profiles resolved by 4 component MCR-ALS with quadrilinearity and non-ne

nnual flow during the monsoon period (Jul–Sep). This increasedater flow dilutes the concentration of organic pollutants fromomestic and industrial wastewater sources. On the other hand,ater river flow reduces significantly during non-monsoon period

Oct–Jun), which is diverted from river and extensively used forrrigation and drinking purposes. During this time period, very lit-le or no water flows in the river. This change in water flow of riverhanges dramatically concentrations and pollution loads of theiver over the year. Another MCR-ALS (fourth) resolved profile indi-ated mostly bacterial contamination (FC and TC) pattern of riverater over time. Although the amount of total variance explained

y this component is really low compared to the other components,e consider that it is reliable and that it gives a good description

f bacterial pollution, which is extremely important from an envi-onmental and health point of views. TC and FC are indicators forhe presence of pathogenic organisms. They are usually presentn surface waters, soils, and feces of humans and animals. In thisiver basin, due to non-existence of sanitary facilities in rural areasnd urban areas, especially in slum clusters, the river catchmentrea is used for open defecation [15]. Sampling sites S5, S6 and11 are severely affected by this type of bacterial pollution. A grad-al increase in bacterial population with changing temperature arebserved with extremely high amounts in June, which is, usually,he hottest month of the year in the investigated region.

. Conclusions

MCR-ALS with newly developed quadrilinear method proved toe a powerful tool to summarize and resolve the main contamina-ion profiles of four-way environmental datasets. Although, the twoomponent PARAFAC and MCR-ALS models were able to extract theain pollution patterns over the investigated space and time, MCR-

LS provided an easier and more interpretable and realistic solutionhen four components were considered. The implemented method

nd strategy are completely general and can be used for the analysisf other multi-way data sets obtained in extensive environmental

[[

(c) (d)

y constraint (a) sites mode; (b) variable mode; (c) month mode, and (d) year mode.

monitoring studies of different type and compartments (air, water,solid, etc.), over large geographical areas and during different timeperiods (daily, weekly, monthly, yearly), as well as and in othersimilar mixture analysis problems. Further, this study provides aplatform to explore new possibilities in the development and appli-cation of MCR-ALS algorithm to trilinear and quadrilinear multiwaydata arrays with different conditions and complexity, and underdifferent constraints depending on the particular case in hand.

Acknowledgements

AM gratefully acknowledges the Juan de la Cierva Post-Doctoralresearch grant (JCI-2011-10895), and support from the, Ministe-rio de Economia y Competividad, Spain (CTQ2012-38616-C02-01grant).

Appendix A. Supplementary data

Supplementary data associated with this article can be found, inthe online version, at http://dx.doi.org/10.1016/j.aca.2013.07.047.

References

[1] R. Tauler, S. Lacorte, M. Guillamon, R. Cespesdes, P. Viana, D. Barcelo, Environ.Toxicol. Chem. 25 (2004) 563.

[2] A. Smilde, R. Bro, P. Geladi, Multi-way Analysis: Applications in the ChemicalSciences, Wiley, NJ, 2004, ISBN: 978-0-471-98691-1.

[3] R. Tauler, Chemom. Intell. Lab. Syst. 30 (1995) 133–146.[4] R. Tauler, I. Marques, E. Casassas, J. Chemometr. 12 (1998) 55–75.[5] A. De Juan, R. Tuler, J. Chemometr. 15 (2001) 749–772.[6] M. Alier, M. Felipe, I. Hernadez, R. Tauler, Anal. Bioanal. Chem. 399 (2011)

2015–2029.[7] J. Jaumot, R. Gargallo, A. De Juan, R. Tauler, Chemom. Intell. Lab. Syst. 76 (2005)

101–110.[8] A. De Juan, R. Tuler, Crit. Rev. Anal. Chem. 36 (2006) 163–176.

[9] E. Pere-Trepat, A. Ginebreda, R. Tauler, Chemom. Intell. Lab. Syst. 88 (2007)

69–83.10] R. Bro, Chemom. Intell. Lab. Syst. 38 (1997) 149–171.11] A-L. Xia, H.-L. Wu, S.-F. Li, S.-H. Zhu, L.-Q. Hu, R.-Q. Yu, J. Chemometr. 21 (2007)

133–144.

Page 9: Extension and application of multivariate curve resolution-alternating least squares to four-way quadrilinear data-obtained in the investigation of pollution patterns on Yamuna River,

2 a Chim

[

[[

[

[

[[

[[

[[

[[

8 A. Malik, R. Tauler / Analytic

12] H.-Y. Fu, H.-L. Wu, Y.-J. Yu, Li-Li Yu, S.-R. Zhang, J.-F. Nie, S.-F. Li, R.-Q. Yu, J.Chemometr. 25 (2011) 408–429.

13] C.A. Olivieri, Anal. Methods 4 (2012) 1876–1886.14] R. Tauler, M. Maeder, A. de Juan, in: S.D. Brown, R. Tauler, B. Walczak (Eds.),

Comprehensive Chemometrics: Chemical and Biochemical Data Analysis, 2,Elsevier, Amsterdam, 2009, pp. 473–505, Chapter 2.24.

15] CPCB, Assessment and Development of River Basin Series: ADSORBS/41/2006-07, Central Pollution Control Board (Ministry of Environment & Forest), India,

November, 2006, www.cpcb.nic.in

16] G.E.P. Box, G. Jenkins, Time Series Analysis, Forecasting and Control, Holden-Day, San Francisco, 1976, ISBN: 0816211043.

17] M. Terrado, D. Barceló, R. Tauler, Anal. Chim. Acta 657 (2010) 19–27.18] R. Bro, A.K. Smilde, J. Chemometr. 17 (2003) 16–33.

[

[

[

ica Acta 794 (2013) 20– 28

19] S. Wold, K. Esbensen, P. Geladi, Chemom. Intell. Lab. Syst. 2 (1987) 37–52.20] I.T. Jolliffe, Principal Component Analysis Springer Series in Statistics, Springer-

Verlag, NY, USA, 2002.21] R.A. Harshman, M.E. Lundy, Comput. Stat. Data Anal. 18 (1994) 39–72.22] G.H. Golub, C.F. Van Loan, Matrix Computation, John Hopkins University Press,

Baltimore, USA, 1996.23] R. Bro, H.A.L. Kiers, J. Chemometr. 17 (2003) 274–286.24] A. Bjorck, J. Golub, Math. Comput. 27 (1973) 579–594.

25] P.A. Weden, Matrix pencils, in: B. Kagstrom, A. Ruhe (Eds.), Lecture Notes in

Mathematics, 973, 1983, pp. 263–285.26] M. Dadashi, H. Abdollahi, R. Tauler, Chemom. Intell. Lab. Syst. 118 (2012)

33–40.27] M. Vega, R. Pardo, E. Barrado, L. Deban, Water Res. 32 (1998) 3581–3592.