author's personal copyidl.campus.ciencias.ulisboa.pt/wp-content/uploads/...associative...

32
1 23 Climate Dynamics Observational, Theoretical and Computational Research on the Climate System ISSN 0930-7575 Clim Dyn DOI 10.1007/s00382-016-3112-9 Separation of the atmospheric variability into non-Gaussian multidimensional sources by projection pursuit techniques Carlos A. L. Pires & Andreia F. S. Ribeiro

Upload: others

Post on 31-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Author's personal copyidl.campus.ciencias.ulisboa.pt/wp-content/uploads/...associative feed-forward neural networks in which the encoding of scalar sources from data may be quite com-plex

1 23

Climate DynamicsObservational, Theoretical andComputational Research on the ClimateSystem ISSN 0930-7575 Clim DynDOI 10.1007/s00382-016-3112-9

Separation of the atmospheric variabilityinto non-Gaussian multidimensionalsources by projection pursuit techniques

Carlos A. L. Pires & Andreia F. S. Ribeiro

Page 2: Author's personal copyidl.campus.ciencias.ulisboa.pt/wp-content/uploads/...associative feed-forward neural networks in which the encoding of scalar sources from data may be quite com-plex

1 23

Your article is protected by copyright and

all rights are held exclusively by Springer-

Verlag Berlin Heidelberg. This e-offprint is

for personal use only and shall not be self-

archived in electronic repositories. If you wish

to self-archive your article, please use the

accepted manuscript version for posting on

your own website. You may further deposit

the accepted manuscript version in any

repository, provided it is only made publicly

available 12 months after official publication

or later and provided acknowledgement is

given to the original source of publication

and a link is inserted to the published article

on Springer's website. The link must be

accompanied by the following text: "The final

publication is available at link.springer.com”.

Page 3: Author's personal copyidl.campus.ciencias.ulisboa.pt/wp-content/uploads/...associative feed-forward neural networks in which the encoding of scalar sources from data may be quite com-plex

1 3

DOI 10.1007/s00382-016-3112-9Clim Dyn

Separation of the atmospheric variability into non‑Gaussian multidimensional sources by projection pursuit techniques

Carlos A. L. Pires1 · Andreia F. S. Ribeiro1

Received: 22 May 2015 / Accepted: 30 March 2016 © Springer-Verlag Berlin Heidelberg 2016

Finally the method is applied to the monthly variability of a high-dimensional quasi-geostrophic (QG) atmospheric model, applied to the Northern Hemispheric winter. We find that quite enhanced non-Gaussian dyads of parabolic shape, perform much better than the unrotated variables in which concerns the separation of the four model’s centroid regimes (positive and negative phases of the Arctic Oscil-lation and of the North Atlantic Oscillation). Triads are also likely in the QG model but of weaker expression than dyads due to the imposed shape and dimension. The study emphasizes the existence of nonlinear dyadic and triadic nonlinear teleconnections.

Keywords Non-Gaussianity · Independent component analysis · Low-frequency variability · Nonlinear teleconnections · Source separation

1 Introduction

The climate variability may be described by a highly-dimensional stochastic process (Hasselmann 1976) span-ning a wide range of spatio-temporal scales on different geophysical fields. For its understanding it is useful to char-acterize the climate system interactions by a mixing of a reduced number of low-dimensional features associated to different and independent physical processes (Ross et al. 2008; Ross 2009). A necessary, though not sufficient condi-tion for the separation and identifiability of those processes is the statistical independency between the sets of random variables which characterize them. That is because the sta-tistical independence of variables does not imply that they are dynamically uncoupled. In fact, the physical process identifiability ultimately requires the separation of the sys-tem’s equations into a set of uncoupled systems of smaller

Abstract We develop an expansion of space-distributed time series into statistically independent uncorrelated sub-spaces (statistical sources) of low-dimension and exhibit-ing enhanced non-Gaussian probability distributions with geometrically simple chosen shapes (projection pursuit rationale). The method relies upon a generalization of the principal component analysis that is optimal for Gaussian mixed signals and of the independent component analy-sis (ICA), optimized to split non-Gaussian scalar sources. The proposed method, supported by information theory concepts and methods, is the independent subspace analy-sis (ISA) that looks for multi-dimensional, intrinsically synergetic subspaces such as dyads (2D) and triads (3D), not separable by ICA. Basically, we optimize rotated vari-ables maximizing certain nonlinear correlations (contrast functions) coming from the non-Gaussianity of the joint distribution. As a by-product, it provides nonlinear varia-ble changes ‘unfolding’ the subspaces into nearly Gaussian scalars of easier post-processing. Moreover, the new vari-ables still work as nonlinear data exploratory indices of the non-Gaussian variability of the analysed climatic and geo-physical fields. The method (ISA, followed by nonlinear unfolding) is tested into three datasets. The first one comes from the Lorenz’63 three-dimensional chaotic model, showing a clear separation into a non-Gaussian dyad plus an independent scalar. The second one is a mixture of prop-agating waves of random correlated phases in which the emergence of triadic wave resonances imprints a statistical signature in terms of a non-Gaussian non-separable triad.

* Carlos A. L. Pires [email protected]

1 Instituto Dom Luiz (IDL), Faculdade de Ciências, Universidade de Lisboa, 1749-016 Lisbon, Portugal

Author's personal copy

Page 4: Author's personal copyidl.campus.ciencias.ulisboa.pt/wp-content/uploads/...associative feed-forward neural networks in which the encoding of scalar sources from data may be quite com-plex

C. A. L. Pires, A. F. S. Ribeiro

1 3

dimension, each one governing a (scalar or multidimen-sional) process.

The application of blind source separation (BSS) meth-ods, machine learning and data mining algorithms (Hastie et al. 2008; Yu et al. 2014) aims to compact the system’s variability into a few statistically independent variables (scalars or vectors) without any concern about dynamics, thus reducing the statistical dimension of the climate vari-ability through the sorting of the explained variance and/or Shannon entropy (Shannon 1948) of the climate sys-tem state vector (x hereafter). Beyond the advantage of a more compact system’s description, the BSS may help to put bounds concerning the physical processes identification though a closer approach of that might be the merging of BSS techniques with the stochastic model’s properties.

For the BSS purpose, a simple and popular technique has been used in atmospheric–oceanic sciences, the princi-pal component (PC) analysis (PCA) or empirical orthogo-nal function (EOF) analysis (Jolliffe 2002; Hannachi et al. 2007). PCA writes the x anomaly as a linear combination (mixing) of spatially orthogonal patterns, the EOFs, multi-plied by time-varying uncorrelated factors, the PCs, work-ing as candidate sources which are orthogonal in the sense of the temporal covariance inner product. EOFs may seem slightly physically artificial and therefore patterns of easier geophysical interpretation may be obtained by application of rotational techniques (Horel 1981; Richman 1981, 1986, 1987; Hannachi et al. 2007) where maximization of spe-cific contrast functions is imposed to the rotated orthogo-nal patterns (orthomax family like Varimax and Quartimax) (Browne 2001) but at the expense of correlated factors.

The PCs, filling the vector xPC, become statistically independent scalars, i.e. true statistical sources when their joint probability density function (PDF) is Gaussian which has been shown quite unlikely due to signatures of non-Gaussianity both in climatic observed time-series (Pires and Perdigão 2007; Sura and Sardeshmukh 2008; Per-ron and Sura 2013) and model runs (Franzke et al. 2007). Beyond other origins, the non-Gaussianity of a set of time-evolving variables can be caused either by nonlinear drift or by the multiplicative Gaussian and/or additive non-Gauss-ian noises mimicking the mean forcing of non-resolved scales onto the models’ variables (Sura et al. 2005). An even harder and debatable question is in what conditions the statistical sources connect to physical processes. In fact, the disconnection even holds for purely Gaussian dynamics (e.g. linear drift with white noise) if the forcing matrix is non-normal (Farrell and Ioannou 1996).

For non-Gaussian mixed signals, the PCA becomes a sub-optimal source separation method whereas the issuing PCs become non-Gaussian, non-independent and exhibit mutual nonlinear correlations among them. In other words, sets of PCs get non vanishing values of a much general

measure of statistical dependency, the multi-information (MII) (Schneidman et al. 2003) i.e. the multivariate ver-sion of mutual information (MI) (Cover and Thomas 2006). Consequently, the application of much general BSS algo-rithms is necessary for estimating candidate sources (sca-lars or maybe vectors) from data, minimizing their inter dependency. One of the BSS algorithms, originally devel-oped for neural computation and signal-processing, is the linear-ICA (Independent Component Analysis) (Hyvärinen and Oja 2000; Hastie et al. 2001; Novey and Adali 2008) with some applications to atmospheric and oceanic dataset processing (Aires et al. 2000, 2002; Hannachi et al. 2009; Westra et al. 2010). ICA seeks for orthogonal rotations of the spherized or standardized (unit variance) PCs, i.e. the so called independent components (ICs) minimizing their MII. It decreases as far as ICs become more non-Gaussian which can be measured either by: (a) the negentropy (NE) i.e. the positive deficit of Shannon entropy (Cover and Thomas 2006) as compared to the Gaussian PDF with the same average and variance or (b) some contrast function measuring non-Gaussian PDF features like asymmetries, kurtosis and multimodality (Comon 1994). Linear-ICA is optimal when data depends linearly from sources as xPC = Mz where M is a mixing full-rank column matrix and z is a set of scalar sources. However the source depend-ency may be nonlinear through a general functional dependency xPC = xPC(z), often leading to data concen-tration (high PDF values) around sets of cluster centroids or certain curvilinear manifolds like curves and hyper-sur-faces (Hastie and Stuetzle 1989). In that condition, sources may identified through the implicit variables spanning those manifolds (e.g. the angular variable along a closed curve). BSS methods generally considers the source’s restriction to a parametric family of functions which may be implemented either by the nonlinear-ICA (Hyvärinen and Pajunen 1999; Almeida 2003) or by the nonlinear-PCA (NL-PCA) (Scholz 2012), also used in statistical climatol-ogy (Monahan 2001; Hsieh 2001; Hsieh and Wu 2002; Wu et al. 2006; Teng et al. 2007). Both techniques use auto-associative feed-forward neural networks in which the encoding of scalar sources from data may be quite com-plex and difficult to interpret physically. In alternative, the technique of principal nonlinear dynamical modes (NDMs) (Mukhin et al. 2015) fit data to a parameterized set of non-linear manifolds (curves).

The rationale behind the paper is to consider the mul-tivariate variability, not necessarily coming from scalar sources as discussed above, but from a number r of statis-tically independent vectorial sources: y1, . . . , yr, concat-enated on a full vector y, and which span orthogonal sub-spaces in the sense that components of different subspaces are uncorrelated. All the y components, including those within sources, are assumed to be linearly uncorrelated,

Author's personal copy

Page 5: Author's personal copyidl.campus.ciencias.ulisboa.pt/wp-content/uploads/...associative feed-forward neural networks in which the encoding of scalar sources from data may be quite com-plex

Separation of the atmospheric variability into non-Gaussian multidimensional sources by…

1 3

thus eliminating any linear redundancy though they may be nonlinearly dependent thanks to their non-Gaussian struc-ture. The vectorial source separation is perfect when the PC vector expands as xPC = My. The dimensionality of sources: |yk|, k = 1, . . . , r shall be small enough in order to keep geometrical interpretation and parsimony (Ockham’s razor). This source dependency is the rationale of Multi-dimensional-ICA (MICA) (Cardoso 1998) or independent subspace analysis (ISA) (Theis 2005, 2006; Póczos 2007), reducing to ICA for scalar sources. An intuitive application of ISA is provided by the so called cocktail-party in which different guest groups in a party (the vectorial sources) talk totally different languages and subjects whereas some microphones placed throughout the room save the appar-ent noise coming from conversations (the data). In that con-text, ISA aims identifying the different groups or sources. In order to get a visual insight of vectorial sources, we show in Fig. 1 a sample scatter plot of a three-dimensional PDF that is split into a two-dimensional (dyadic) source y1 = (Y1,1, Y2,1)

′ with a capital letter ‘U’ shape support set and a scalar source y2 = Y1,2 ranging perpendicularly on a finite interval, where the prime stands for matrix transpose. Another dyadic source verifies when a field is given by the sum of a natural variability component e1Y1 plus a multi-plicative noise e2Y2 = e2fn(Y1)W where W and Y1 are inde-pendent random scalars, e1, e2 are constant fields and fn is a function of Y1. The pair of uncorrelated scalars (Y1, Y2) constitutes thus a dyadic source which cannot be replaced by two independent components obtained by linear-ICA.

Another case of dyadic sources of geophysical interest holds when a system exhibit intermittent spells of spatio-temporal oscillations (Plaut and Vautard 1994). There, the sine and the cosine oscillation’s coordinates span the plan of a two-dimensional source. The source’s dimension

grows in case of phase-looking or resonance between dif-ferent oscillation frequencies (see a heuristic proof in Sect. 3.2).

The vectorial hidden sources correspond to data projec-tions exhibiting the highest levels of ‘interestingness’ or ‘featureness’, also corresponding to the highest values of non-Gaussianity. This result comes from the ‘Lemma of Negentropy’, presented in Sect. 2, showing how the global dependency among sources decreases as far their intrinsic non-Gaussianity and structureness increases.

The full minimization of the Multi-Information func-tional, in the space of orthogonal rotations in terms of the system’ PDF, even for moderate dimensions, is a quite intractable task. That problem comes from the unreliability of the PDF estimation from short datasets, essentially due to effects of the ‘Curse of Dimensionality’ (CC) (Bellman 1957; Bocquet et al. 2010). Here, this problem is somehow solved by using projection pursuit (PP) techniques (Fried-man and Tukey 1974; Huber 1985) where non-Gaussianity of sources is assessed by statistical projection indices or positively defined contrast functions computed along low dimension subspaces (non-Gaussian scalars, dyads and triads, respectively of dimension one, two and three, to be tested throughout the paper). After applying the subspace source decomposition by ISA, we propose an additional optional analysis, consisting of a kind of parametric non-linear-ICA, in which contrast functions comes out with nonlinear variable changes (nonlinear ‘unfolding’) within the vectorial sources, producing sets of quasi-Gaussian and quasi-independent scalar sources. That technique shares methodological aspects both with the project pur-suit regression (Friedman and Stuetzle 1981) and with the Generalized-PCA (Gnanadesikan and Wilk 1969; Mizuta 1984).

In order to ease the method’s interpretation, we apply ISA and ICA to the trivariate chaotic Lorenz (1963) low-dimensional model exhibiting a clear non-Gaussian attrac-tor’s joint PDF. Then, we pass to a high-dimensional mete-orological model with a much larger rotational freedom for the source optimization. ISA is performed for the monthly variability of a 3-level baroclinic Quasi-Geostrophic (QG3) spectral model (Marshall and Molteni 1993) with a cross-validation of results. The QG3 model has been used for simulating extra-tropical Northern Hemisphere (NH) regimes and their transitions (Kondrashov et al. 2004; Deloncle et al. 2007) as well as the low-frequency variabil-ity and blocking (Corti et al. 1997), for evaluating the non-Gaussianity signature and mean tendencies of the low-fre-quency flow components (Franzke et al. 2007; Peters and Kravtsov 2012) and also for studying the medium and long range flow predictability (lead times: 10–30 days) (Selten 1997; Vannitsen 2001).

Fig. 1 Sketch of a scatter plot corresponding to a three-dimensional PDF separated into a dyadic source (the plan of letter ‘U’) and an independent component (the direction orthogonal to the letter)

Author's personal copy

Page 6: Author's personal copyidl.campus.ciencias.ulisboa.pt/wp-content/uploads/...associative feed-forward neural networks in which the encoding of scalar sources from data may be quite com-plex

C. A. L. Pires, A. F. S. Ribeiro

1 3

In our study, the method is applied to sets of leading PCs of the QG stream-function monthly average fields, which are then standardized and optimally rotated, looking for non-Gaussian ICs, dyads and triads. Nonlinear correlations between those rotated PCs are associated to quite common bivariate PDF ridges (Stephenson et al. 2004) and regime centroids.

In the QG3 model, we also find enhanced non-Gaussian triads, maximizing triadic correlations (Pires and Perdigão 2015), related to the occurrence of high mean values of the product between three uncorrelated standardized vari-ables. Pires and Perdigão (2015) shows the occurrence of non-Gaussian triads in the low-order Lorenz (1995) model under conditions of triadic wave resonance (Hammack 1993) which hypothetically also occur in the QG3 model. That is an expression of the Triadic Interaction Informa-tion (IT) (Tsujishita 1995)—a measure from information theory—on chaotic fluid dynamics, as result from statisti-cal synergies, not explained by dyadic interactions alone. The plot of the loading maps of the dyads’ and triads’ com-ponents provides a synoptic interpretation of the patterns interacting nonlinearly [e.g. the case of the SST field as shown by Monahan (2001)].

The paper is structured as follows. Section 2 presents the source decomposition method. Section 3 shows appli-cations in the Lorenz’s (1963) low-order model as well as on simple stochastic models, namely resonant waves. Then, Sect. 4 applies the method to the QG3 model, followed by the Discussion (Sect. 5). Finally the list if symbols appear in “Appendix 1”, followed by the negentropy and mutual information estimators (“Appendix 2”) and method’s tech-nicalities (“Appendices 3, 4, 5”).

2 Source separation methodology

2.1 The Lemma of Negentropy—LN

2.1.1 Spherization and rotation

Let us start with the dataset matrix X ∈ RNp×Nt of rank

Nr ≤ min(Np,Nt) collecting Nt realizations (e.g. data indexed by time t) of the random vector x(t) ∈ R

Np, for instance the sea level pressure field at Np grid points. Then, PCA allows for the expansion in the form: x(t) = x +WxPC(t) where x is the sam-pling average and the vector of PCs writes as xPC(t) = (XPC,1, . . . ,XPC,Nr )

′ = W′[x(t)− x] where W ∈ R

Np×Nr is the orthogonal matrix filled by normalized EOFs (on columns), satisfying W′W = INr that equals the identity matrix of order Nr. W comes out from the singular value decomposition (SVD) (Golub and Van Loan 1996) of the sampling covariance matrix:

Cxx ≡ (x − x)(x − x)′ = WΛW′ ∈ R

Np×Np where � is the diagonal matrix of PC variances, sorted by decreasing order: �1 ≥ · · · ≥ �Nr.

The source decomposition method is applied to the N ≤ Nr leading (but not necessarily) standardized, spher-ized PCs (e.g. explaining 95 % of the total variance), rep-resenting the main field features. The components are col-lected in the hereafter denoted vector

The vector a is centred, it has an identity covariance matrix Caa = IN though its components may be statisti-cally dependent when the joint PDF is non-Gaussian. Now, let us look for a number r ≤ N of hidden vectorial candi-date sources, i.e. the least mutually dependent as possible, in the form:

whose sum of dimensions is ∑r

k=1 |yk| = N and R ∈ RN×N

is an orthogonal matrix, i.e. RR′ = IN. That leads to an array of rotated, standardized and uncorrelated scalars veri-fying Cyy = IN. A generic source of dimension Nsrc = |yk| writes as yk = (Y1,k , Y2,k , . . . ,YNsrc ,k)

′. Its components are given by inner products between a and row vectors of R , working as orthonormal source loadings’ vectors in the space of the standardized PCs, i.e.

where Vj,i,k = Rj,k∗ with k∗ ≡∑k−1

l=1 |yl| + i is the jth component (within the interval [−1, 1]) of the loading’s unit-norm vector vi,k or the (jth row, k*th column) com-ponent of R. Throughout the text, we use the indexed let-ters: k for source, i for source component and j for loading component.

2.1.2 Unfolding step

This step is optional, since ISA alone does not require that. Each source is transformed into a equal-dimension source by one-to-one variable changes or homeomor-phisms (unfolding step): yk → zk = (Z1,k , Z2,k , . . . , Z|yk |,k)

∈ R|yk |, k = 1, . . . , r filled by centred and standardized

components, whose concatenation form the vector

Source optimization applied to vector a is performed on the group of orthogonal rotations whereas the unfolding step comes from the rotation outcome within a prescribed

(1)

a = (A1, . . . ,AN )′ ≡ (�

−1/21 XPC,1, . . . , �

−1/2N XPC,N )

′ ∈ RN

(2)y = (y′1, . . . , y′r)′ = (Y1, . . . ,YN )

′ = Ra ∈ RN

(3)

Yi,k = vi,k′a =

N∑

j=1

Vj,i,kAj; i = 1, . . . , |yk |; k = 1, . . . , r,

(4)z = z(a) = (z′1, . . . , z′r)′ = (Z1, . . . ,ZN )

′ ∈ RN

Author's personal copy

Page 7: Author's personal copyidl.campus.ciencias.ulisboa.pt/wp-content/uploads/...associative feed-forward neural networks in which the encoding of scalar sources from data may be quite com-plex

Separation of the atmospheric variability into non-Gaussian multidimensional sources by…

1 3

family of variable changes. Moving ahead the unfolding step corresponds simply to yk = zk , ∀k and z = y whereas the missing of rotation means R = I without loss of gen-erality of the forthcoming results. The BSS method is sup-ported by the Lemma of Negentropy (LN) (Yu et al. 2014), generalized for multivariate sources. For the sake of reada-bility, we preceded the LN by some information theory key concepts (Cover and Thomas 2006).

Definition 1 The general non-negative measure of sta-tistical dependency between z components is given by the multi-information (MII), total correlation or generalized mutual information (Schneidman et al. 2003)

where ρ· is a generic single or joint PDF. For a single scalar argument (N = 1), we set Ii(z) = 0 by definition. Through-out the paper, Ii (5) is used for the MII between the scalar components of the argument vector whereas the usual sym-bol I (5) for mutual information is used when the arguments (scalars or even vectors) are explicitly written. The MII vanishes under global independency, decomposing into two non-negative terms: the intra-MII: Iia(z) ≡

∑rk=1 Ii(zk) ≥ 0

i.e. the sum of source’s MIIs accounting for linear and nonlinear correlations within sources and the inter-MII: Iie(z) ≡ I(z1, . . . , zr) = E

[log[ρz/(ρz1 . . . ρzr )]

]≥ 0 , ass-

essing the statistical dependency between vectorial sources where E is the expectation operator. Since zk is a homeo-morphism of yk, it leads to I(z1, . . . , zr) = I(y1, . . . , yr), i.e. the inter-MII depends uniquely on the subspaces them-selves, independently of the variables which span them. ICA is the particular case of ISA for r = N scalar sources from which Iia(z) = 0 whereas for a single global source (r = 1), Iie(z) = 0. The optimal or less dependent sub-spaces, are those, not necessarily unique, which minimize the inter-MII, vanishing when the source separation is perfect.

2.1.3 Source separability

Since the inter-MII is invariant to any particular ordering of the subspaces, the optimal source decomposition is uniquely dependent on the hereafter called source configuration, i.e. the number r of subspaces and their dimensions |yk|, not necessarily equal. The number of source configurations, is the number of different ways to sum the embedding whole space dimension |y| = N. For instance, for N = 4, there are 5 configurations: 4 = 4 (one quartet); 4 = 3 + 1 (one triad and one scalar); 4 = 2 + 2 (two dyads); 4 = 2 + 1 + 1

(5)

Ii(z) = I(Z1, . . . ,ZN )

≡∫

ρz log

(ρz∏

N

k=1 ρZk

)dz = Iia(z)+ Iie(z)

(one dyad and two scalars); 4 = 1 + 1 + 1 + 1 (four sca-lars). Thanks to the above inter-MII expression, the split-ting (merging) of sources leads to the inter-MII increasing (decreasing) due to the addition (subtraction) of the sum of partial inter-MIIs of the splitted (merged) sources. For par-simony reasons, the preferable configuration is the most fragmented one or maximal configuration (larger number r of sources) without producing any inter-MII change, except if larger dimensional sources exhibit worthwhile simplicity. That configuration is not a priori known in general (source configuration indeterminacy), except, to our knowledge, in ad hoc built cases or when the global N-dimensional joint pdf is Gaussian, with the PCA playing the perfect separa-tion into N Gaussian scalars (null inter-MII). We conjecture that general separation into N sources might asymptotically be reached through non-linear ICA by adjusting homeo-morphisms of increasing complexity: z = z(a) with arbi-trary small MII, which is equivalent to the pdf factorization ρZ ≈

∏Nk=1 ρZk. That is heuristically illustrated, at least in

the coarse-grained description. In order to justify that, let us consider the binning of every of the N scalars filling a into a number B of bins and forming the denoted discrete random vector aB, leading to BN discretized probabilities of com-pound events. Thus, the number of multivariate homeomor-phisms: zB(aB) equals the number of event permutations, i.e. the factorial BN !, which may become quite large even for a moderate number of bins due to the combinatorial explo-sion. The separation of variables stands for the vanishing of the MII Ii(zB) or the satisfaction of BN independency con-straints of type: pzB(zB) =

∏Ni=1 pZiB(ZiB), writable in terms

of the marginal discrete probabilities. Since the number of trial homeomorphisms per constraint grows drastically with B, it is clear that the minimal MII must approach zero for a large enough B. The continuous description approaches for higher B. However, the homeomorphism minimizing MII may become quite complex and fractured or discontinuous (i.e. with close bins not mapping to close bins) when B is too high. By restricting the continuous variable changes to orthogonal rotations (case of linear ICA), the set of corre-spondent discrete fitting homeomorphisms is quite reduced thus preventing any perfect scalar separation, though it may be accomplished by ISA on a given configuration.

Here, we raise the question of necessary and suf-ficient conditions for a pdf source yk being non-sep-arable by linear transformations. Apparently, it must admit a homeomorphism zk such that ρzk =

∏|yk |j=1 ρZj,k

with at least one component depending on all yk compo-nents and being not writable as a single homeomorphism of a linear function of the yk components. For instance for a dyadic pdf yk concentrated near the non-linear curve: Y2,k − f (Y1,k) = 0, we use the variable change: Z1,k = Y1,k; Z2,k = k2[Y2,k − f (Y1,k)− k1] where k1, k2 are standardization constants.

Author's personal copy

Page 8: Author's personal copyidl.campus.ciencias.ulisboa.pt/wp-content/uploads/...associative feed-forward neural networks in which the encoding of scalar sources from data may be quite com-plex

C. A. L. Pires, A. F. S. Ribeiro

1 3

Definition 2 The multivariate negentropy (NE) (Cover and Thomas 2006) gives a non-negative measure of the joint (or single) non-Gaussianity and writes for a generic vector z as:

where H(z) ≡ −E(log ρZ) and Hg(z) ≡ −E(log ρg,z) are Shannon (1948) entropies, respectively of ρz and of the maximum-entropy Gaussian PDF ρg,z, constrained by the average E(z) and covariance matrix Czz. The NE grows with the enhancement of non-Gaussian features (e.g. PDF asymmetries, kurtosis, multimodality, nonlinear correla-tions). When the z components are statistically independ-ent, then the NE is given by the sum of single NEs, hereby denoted as

Definition 3 The positive or negative change of Shannon entropy, associated to the homeomorphism z(a) is:

The above development comes from application of the chain rule of determinants to the Jacobian matrix ∂z

∂a and

from the fact that | det ∂y∂a| = | det R| = 1 for orthogo-

nal matrices. In the absence of unfolding step, y = z and Hnl(y, z) = 0

2.1.4 Lemma of Negentropy (LN)

The joint negentropy of the spherized vector a appearing in Eq. (1) admits the decomposition in terms of NEs and MIIs of the changed variables y (Eq. 2) and z (Eq. 4) as:

The LN proof comes directly from Definitions 1–3 and from the equality Hg(Zi) = Hg(Aj) = 1

2log(2πe),∀i, j. The

negentropy Jrot is invariant over the group of orthogonal rotations applied to a, coinciding to its PDF compactness (Monahan and DelSole 2009) which measures the level of ‘featureness’ or data concentration around a lower-dimen-sional manifold, different from a unit-radius N-dimensional sphere. The z components may still exhibit linear correla-tions. Therefore, taking their spherization by the variable

(6)J(z) ≡ Hg(z)− H(z) ≥ 0

(7)Jind(z) ≡N∑

k=1

J(Zk) ≥ 0

(8)

Hnl(a, z) ≡ H(z)− H(a) = E[log

(| det ∂z

∂a|)]

= Hnl(y, z) =r∑

k=1

Hnl(yk , zk)

(9)

Jrot ≡ J(a) = N

2log(2πe)− H(a) = J(y) = Jind(z)

+r∑

k=1

Hnl(yk , zk)+ Iia(z)+ Iie(z)

change z = C−1/2zz z where Czz = IN, gives the straight-full

LN rewriting as

where Ic(z) ≡ − 12log(det Czz) ≥ 0, coinciding to the MII

of a multivariate Gaussian z vector with correlation matrix Czz.

From the LN, the best subspace decomposition, into a configuration of r sources of prescribed dimensions, in the sense of the smallest statistical interdependence (minimal inter-MII), holds for the rotation matrix, eventually degen-erated, maximizing Jsrc ≡

∑rk=1 J(yk) = Jrot − Ii.e.(y) i.e.

the sum of the sources’ NEs. From the LN reading, the BSS is favoured when the rotated scalars Yi, i = 1, . . . ,N are maximally non-Gaussian or when they are nonlinearly correlated within the vectorial sources. Thanks to (9d), the nonlinearly changed scalars Zi, i = 1, . . . ,N become more Gaussian and statistically independent as far as the Shan-non entropy changes Hnl(yk , zk) become more positive and their sum is closer to Jrot. Improved scalar sources on z may be obtained by eliminating the linear residual depend-ency of z through the term Ic(z) in (10). Below, we give a method to reach those goals.

2.2 ICA and ISA by projection pursuit techniques

As regards the source separation, the optimization of R by direct minimization of Iie(y) ≡ I(y1, . . . , yr) is a quite difficult task since it needs to write the global PDF ρa as a function of R. Moreover, under the availability of short datasets, the PDF estimation risks to be quite unreliable even for relatively short vector dimensions. That is solved by some ISA algorithms, though with their own limita-tions (Theis 2005, 2006, 2007; Póczos 2007; Kirshner and Póczos 2008), which use a contrast function Fie(y) ≥ 0 simulating Iie(y) , given by a weighted sum of squares of nonlinear covariances connecting variables from different sources, thus assessing their statistical inter-connections. The popular JADE (Joint Approximate Diagonaliza-tion of Eigen-matrices) algorithm (Cardoso and Sou-loumiac 1993) belongs to that class using fourth-order cross cumulants (see definition in “Appendix 4”) (Comon 1994; Bradley and Morris 2013; Withers and Nadara-jah 2014). In alternative, we follow the dual strategy by maximizing a sum of positively defined contrast functions Fop,k(yk), k = 1, . . . , r simulating the individual source NEs depending uniquely on sample nonlinear expecta-tions (Póczos and Lorincz 2004) like in the Maximum Dependency analysis (Peng et al. 2005). One example of that comes from the Fast-ICA algorithm (Hyvärinen and Oja 2000; Hastie et al. 2008; Novey and Adali 2008)

(10)Jrot ≡ Jind(z)+r∑

k=1

Hnl(yk , zk)+ Ic(z)+ Ii(z)

Author's personal copy

Page 9: Author's personal copyidl.campus.ciencias.ulisboa.pt/wp-content/uploads/...associative feed-forward neural networks in which the encoding of scalar sources from data may be quite com-plex

Separation of the atmospheric variability into non-Gaussian multidimensional sources by…

1 3

which uses a contrast function approximating of the sca-lar’s negentropy by a robust measure of the asymmetry and kurtosis of its single PDF (see an example in “Appen-dix 2”).

The contrast functions depend implicitly on R taken as a product of Nang(N) ≡ 1

2N(N − 1) elementary Givens

matrix rotations, each one corresponding to the rotation by a certain angle of a particular plane. The contrast function gradient with respect to Euler angles (filling the angle’s vector α) is used within a gradient-descent-based nonlin-ear optimization routine by which we reach contrast func-tion maxima and the ISA solution (see “Appendix 3” for details).

An optimized rotation may be not unique, being degen-erated throughout a set of trivially equivalent sources, com-ing from symmetries of Fop,k(yk) derived from variable permutations and inversion of axes. Moreover, if optimi-zation is done globally i.e. for the whole sum of contrast functions, then any permutation of sources of equal dimen-sion provides degenerated maxima or maybe a degeneracy over a manifold in the angle’s space. Therefore, in order to overcome that situation, each source is optimized at a time (e.g. from the higher to the lower source dimensions and from the more to the less non-Gaussian sources) (Gruber et al. 2009). In practice, we start maximizing Fop,1(y1), holding at the source y1 = R1a. Then the second one maxi-mizes Fop,2(y2) at y2 = R2a2 where a2 is optimized in the complementary subspace of y1 and so on. The matrices R1 ∈ R

|y1|×N , R2 ∈ R|y2|×(N−|y1|) are composed of orthog-

onal and standardized rows depending on a number of independent parameters. In fact, a generic Nsrc-dimensional source projected onto an N-dimensional optimization space is described by 1

2Nsrc(2N − Nsrc − 1) independent param-

eters or rotation angles i.e. NsrcN minus the number of orthonormalizing relationships.

The strategy to set a source configuration is a quite dif-ficult task, coming from the source configuration indeter‑minacy and it must result from a trade-off between sev-eral favouring source properties: the low-dimension, the non-Gaussianity (strength), the synergy; the simplicity and the statistical significance. The optimal choice must result from a range of trials starting by testing large dimensional sources.

2.2.1 Cumulant‑based contrast functions

In order to efficiently perform the subspace separation, the contrast functions must comprehend a large variety of types of single and joint non-Gaussianity, which may be measured by cumulants of different orders (Comon 1994), described in detail in “Appendix 4”. The use of cumu-lants as non-Gaussianity measures has been quite suc-cessful in the analysis of climatic time-series (Bernacchia

and Naveau 2008; Bernacchia et al. 2008). Therefore, we have used contrast functions given by truncated Edgeworth expansions, both of the negentropy and mutual informa-tion which write as sums of squares of single and joint higher-than-second-order cumulants (Bradley and Morris 2013; Withers and Nadarajah 2014), described in detail in “Appendix 4”. Larger is that cumulant truncation order, more general and complex are the issuing PDFs. On that conditions PDFs are more sensitive to the chosen contrast functions thus broadening the class of candidate sources PDFs’ contributing for the smallest inter-MII as possible. Total freedom is given to sources by the alternative explicit inter-MII minimization refereed above.

2.2.2 Projection pursuit

Highly generalized contrast functions can make sources too complex. In order to circumvent that, we adopt here a pro-jection pursuit strategy (Friedman and Tukey 1974; Huber, 1985) in which contrast functions are chosen to reach the highest values when the source’ PDF clusters around sets of centroids, curves, surfaces or in general of manifolds within a certain class and of easy geometrical interpre-tation and visualization, performing a certain nonlinear dimensionality reduction of the dataset (Hastie et al. 2008). That may bound the condition of statistical sources’ inde-pendency although its effect must be attenuated on higher embedding dimensions because of the higher optimization room to find stylized sources. The Projection Pursuit call for some requirements described below.

2.2.3 Requirement of low dimensionality or parsimony

Along the paper, we assume low-dimensional sources with dimensions: one for scalars: yk = (Y1), two for dyads: yk = (Y1, Y2)

′ and three for triads: yk = (Y1, Y2, Y′3 where,

for the sake of simplicity, the source index k on compo-nents will be dropped until the end of the this section. That, of course, does not conform to non-separable quartets, quintets etc.

2.2.4 Requirement of synergy

The contrast functions must evaluate the internal source complexity as a whole i.e. they must reflect the nonlinear joint relationships in which all variables play together and not only a proper subset of them. This issue is solved by the concept of Interaction Information (IT) (Jakulin and Bratko 2004; Timme et al. 2013; Pires and Perdigão 2015) from the information theory, which is defined for a generic vector y of any dimension as the MII part that is accounted for by statistical synergies or emerging phenomena which cannot be explained by proper subsets of y components.

Author's personal copy

Page 10: Author's personal copyidl.campus.ciencias.ulisboa.pt/wp-content/uploads/...associative feed-forward neural networks in which the encoding of scalar sources from data may be quite com-plex

C. A. L. Pires, A. F. S. Ribeiro

1 3

Therefore contrast functions must simulate IT. For dyadic sources, IT coincides to the mutual information I(Y1, Y2), equalling J(Y1, Y2)− J(Y1)− J(Y2) in case of uncorrelated components (Pires and Perdigão 2015). In the case of tri-ads, the so called triadic IT (McGill 1954; Tsujishita 1995; Pires and Perdigão 2015): It(Y1, Y2, Y3) is such that it satis-fies to the consistency equalities

where conditional dyadic multi-informations on (11) depend on conditional PDFs (e.g. ρY1,Y2|Y3 for I(Y1, Y2|Y3) ). For uncorrelated variables like the case here, IT comes exclusively from single and dyadic NEs in the form:

2.2.5 Requirement of nonlinear unfolding

The source separation might stop with the maximiza-tion of contrast functions simulating the negentropies J(yk). However, the LN shows that we can (and we will do) work on the maximization and largest possible posi-tiveness of the Hnl terms (9), taking them or equiva-lents as contrast functions contributing to decrease the inter-MII and make the unfolded sources more Gaussian. Since zk components are standardized, they are given by: Zj = [gj(yk)− E(gj)]var−1/2(gj), j = 1, . . . ,Nsrc = |zk| for a given differentiable homeomorphism g = (g1, . . . , g

′Nsrc

of yk where the index k has been dropped and at least one component depends non-linearly on all yk components. That leads to

Hnl = Hnl,1 + Hnl,2 where Hnl,1 ≡ E[log

(| det ∂g

∂y|)]

and

Hnl,2 ≡ − 12

∑Nsrc

j=1 log[var(gj)] are differentiable func-tionals with respect to R and therefore usable by a gradi-ent-descent algorithm of nonlinear optimization. Admit-ting a bound Hnl,1 ≤ Hnl,1max, we get a strictly positive Hnl as whished, when the spread around the nonlinear hyper-surfaces gj(yk)− E(gj) = 0 is sufficiently small such that Hnl,2 > −Hnl,1max. A large class of homeo-morphisms take the form: gj = Yj, j = 1, . . . ,Nsrc − 1 and gNsrc = YNsrc − g∗(Y1, . . . , YNsrc−1) where g∗ is a differentiable nonlinear function. That leads to Hnl = Hnl,2 = − 1

2log[var(g∗)]. In order to insure a small

spread var(g∗), and optimize the projection pursuit pur-pose, the function g∗ must be flexible enough, being the mean-square error of the linear estimator of YNsrc as a func-tion of a wide enough set Sr of differentiable regression

(11)

It(Y1, Y2, Y3) = I(Y1, Y2|Y3)− I(Y1, Y2) = I(Y1, Y3|Y2)− I(Y1, Y3) = I(Y2, Y3|Y1)− I(Y2, Y3)

= I(Y1, Y2, Y3)− [I(Y1, Y2)+ I(Y1, Y3)

+ I(Y2,Y3)])

(12)

It(Y1, Y2, Y3) = J(Y1, Y2, Y3)− [J(Y1, Y2)+ J(Y1, Y3)

+ J(Y2, Y3)] + J(Y1)+ J(Y2)+ J(Y3)

functions (e.g. multivariate monomials of Y1, . . . , YNsrc−1), i.e. g∗ =

∑l∈Sr dl[gl − E(gl)] where the vector of coeffi-

cients is d = (Cgg)−1cov(g, YNsrc). In the expression of the

gradient of Hnl with respect to R, we must take into account the R-dependence, both of the regression coefficients and regression functions.

2.2.6 Simplest contrast functions

The simplest contrast-functions relying on cumulants and used throughout the paper, depend on third-order cumu-lants. They also solve a nonlinear regression in agreement with the nonlinear unfolding requirement. This is a prom-ising approach thanks to the regular occurrence of non-null third-order significant cumulants in atmospheric–oceanic fields, both in the form of the local skewnesses (Perron and Sura 2013) and through dyadic and triadic moments related to nonlinear tele-connections, suscepti-ble of some physical interpretation (Hlinka et al. 2014).

Therefore, the used contrast functions for scalars (inde-pendent components—ICs), non-Gaussian dyads and non-Gaussian triads, are written respectively as:

which are proportional to squares, respectively of skew-nesses in (13), quadratic correlations: c2 ≡ cor(Y2

1 , Y2) in (14) and triadic correlations (Pires and Perdigão 2015): c3 ≡ cor(Y1Y2, Y3) in (15), as they are called hereafter. The definition of generic third-order cumulants κ ijk in (13–15) appear in (37). The σ 2 terms are variances and dependent on fourth order cumulants contributing in (14) and (15) to negentropies, respectively of Y1 and (Y1, Y2).

2.2.7 Nonlinear variable changes

The nonlinear unfolding step, hereby used for dyads: (Y1, Y2) → (Z1, Z2) and triads: (Y1, Y2, Y3) → (Z1, Z2, Z3) are given respectively by:

depending on values c2 (14) and c3 (15). The vari-ables Z2 (16) and Z3 (17) comes out as standardized

(13)FopIC(Y1) ≡ (κ111)2 = E(Y31 )

2 = [skewness(Y1)]2

(14)FopD(Y1, Y2) ≡ (κ112)2 = E(Y21Y2)

2 = (c2)2σ 2(Y2

1 )

(15)FopT (Y1, Y2, Y3) ≡ (κ123)2 = E(Y1Y2Y3)

2 = (c3)2σ 2(Y1Y2),

(16)

Z1 = Y1; Z2 = (1− c2

2)

−1/2

[Y2 − c2(Y

2

1− 1)var

(Y2

1

)−1/2]

(17)

Z1 = Y1; Z2 = Y2;

Z3 =(1− c

2

3

)−1/2[Y3 − c3Y1Y2var(Y1Y2)

−1/2

]

Author's personal copy

Page 11: Author's personal copyidl.campus.ciencias.ulisboa.pt/wp-content/uploads/...associative feed-forward neural networks in which the encoding of scalar sources from data may be quite com-plex

Separation of the atmospheric variability into non-Gaussian multidimensional sources by…

1 3

residuals of nonlinear regressions. The above variable changes lead to the sought positive entropy changes: Hnl = − 1

2log(1− c22) and Hnl = − 1

2log(1− c23) respec-

tively for dyads and triads, increasing as far as c22 and c23 increase by rotations. This performs well for skewed PDFs, hence excluding the symmetric ones where only the cumulants of even order (2, 4, 6,…) take part. Such is the case of isotropic PDFs depending exclu-sively on the yk Cartesian norm. In order to make the contrast function sensitive to them, it should include fourth-order cumulants but paying for more algorithmic complexity.

By maximizing (14), it contributes to reach larger val-ues of c22 and the kurtosis of Y1 thus contributing to increase Hnl and J(Y1), and then decrease the inter-MII by the LN. Similar argument holds for the maximization of (15). The rigorous maximization of Hnl by rotations would imply using more complicated contrast functions: FopD/σ

2(Y21 )

and FopT/σ2(Y1Y2).

3 Simple analytical cases of source separation

3.1 Vectorial source separation in a low‑dimensional model

Now, in order to illustrate the difference between ICA and ISA, we compare their source separation performances in the three-dimensional Lorenz (1963) chaotic low-order model. The model’s state vector x ≡ (X1,X2,X3)

′ is gov-erned by dynamical equations:

where parameters are set to: ρ = 28, σ = 10, b = 83 .

A long time-series of 106 time-steps ∆t = 0.01 is obtained by integrating (18). with the Predictor–Cor-rector scheme, after enough convergence to the mod-el’s attractor. The time average is x = (0, 0, 25.6)′ . Application of PCA provides the EOF-matrix W = [(−0.65, −0.76, 0.)′ (0, 0, 1.)′ (−0.76, 0.65, 0)′] and the diagonal matrix of PC variances � = diag(146.9, 78.0, 9.5) (Selten 1995). Then PCs are computed, to which we add observational independent white Gaussian noises with a high fraction fnoise = 0.5 of their standard deviations. This makes the joint PDF of PCs fuzzier and more Gaussian with an intentional less performing BSS when comparing to the situation of clean data. The vector of standardized noisy PCs is

(18)

dX1

dt= −σ(X1 − X2);

dX2

dt= X1(ρ − X3)− X2;

dX3

dt= X1X2 − bX3,

where u is a vector of standard Gaussian white noises. A sampling of 1.5 model time-units (enough for significant variable’s decorrelation), is applied, leading to a reduced working time-series of Ndof = 8000 iid realizations. The rotated vector is y = (Y1, Y2, Y

′3 = Ra = (Rα1Rα2Rα3)a ,

where Rαi , i = 1, 2, 3 is the rotation matrix for an angle αi of the axis Ai. Next, we estimate NEs and MIIs in nats (information measure using natural logarithms) of the rotated variables for an exhaustive set of angle combi-nations α1,α2,α3 ∈ [0, 360◦[ with angle-steps of 10°. “Appendix 2” presents the used NE and MII estimators.

Then we look for the best source separation into r = N = 3 scalars i.e. y1 = Y1, y2 = Y2, y3 = Y3 (ICA case) and into r = 2 sources with y1 = (Y1, Y

′2, y2 = Y3 (ISA

case). We estimate the NE invariant Jrot ∼ 0.292 which is decomposed by the LN (9) for the 2-source case as:

where the rhs of (20) decomposes into the sum of marginal NEs, the intra-MII of the dyadic source and the inter-MII I[(Y1, Y2), Y3] = I(y1, y2), where components between round brackets fill the vectorial source arguments of the intra multi-information. Throughout the tested rotations we get the global numerically estimated MII in the interval: I(Y1, Y2, Y3) ∈ [0.160, 0.283]. As regards the linear-ICA solution, it holds at the minimum MII I(Y1, Y2, Y3) = 0.160 between the three scalar sources or ICs. The joint scatter plots are shown in Fig. 2a, b, c, corresponding to dyadic MIs: I(Y1, Y2) = 0.041, I(Y2, Y3) = 0.049 and I(Y1, Y3) = 0.024 coming from reminiscent nonlinear dependences.

Now, playing with two sources made by a scalar and a non-Gaussian dyad, we get the inter-MII in the interval: I[(Y1, Y2), Y3] ∈ [0.024, 0.271], leading to the ISA solution at inter-MII = 0.024 which is smaller than the inter-MII value obtained by any dyadic merging of ICs. That comes from the fact that the Lorenz’ attractor can be described by a planar 2D shaped PDF (near a parabola), ‘fatted’ by an approximated uniform depth. Of course that this is not a generic property for three-dimensional chaotic attractors (e.g. the visible 3D Rössler attractor).

The intra-MII of the ISA solution is I(Y1, Y2) = 0.247 , coming mostly from the quadratic nonlinear cor-relation cor(Y1, Y

22 ) = 0.62 = c122 which is vis-

ible by the bivariate scatter-plot of Fig. 3a. That sug-gests taking the new variables: Z2 = Y2, Z3 = Y3 and

(19)a = (A1,A2,A

′3 ≡ [�(1+ f

2noise

)]−1/2[W′(x − x + fnoise�

1/2u]

(20)

Jrot = J(y) =

3∑

i=1

J(Yi)+ I(Y1, Y2, Y3)

= [J(Y1)+ J(Y2)+ J(Y3)] + Ii(Y1,Y2)+ I[(Y1,Y2),Y3],

Author's personal copy

Page 12: Author's personal copyidl.campus.ciencias.ulisboa.pt/wp-content/uploads/...associative feed-forward neural networks in which the encoding of scalar sources from data may be quite com-plex

C. A. L. Pires, A. F. S. Ribeiro

1 3

Z1 = Y1 − c122(Y22 − 1)/var(Y2

2 )1/2 (17) i.e. the nonlin-

ear regression residual of Y1. The correspondent MII is I(Z1, Z2, Z3) = 0.024, thus clearly improving the linear-ICA solution (MII = 0.160). The remaining dyadic MIs are quite small: I(Y2, Y3) = 0.034 and I(Y1, Y3) = 0.005 as it is apparent through the scatter plots of Fig. 3b, c. For a differ-ent planar shape, the linear unfolding should be more gen-eral (see discussion in Sect. 2).

3.2 Analysis on some stochastic fields

When a stochastic field reduces to a spatial white noise, then each pixel must be scalar source whose pdf depends

on the kind of noise (Gaussian or non-Gaussian). For a sto-chastic field, if higher-than-two-order cross-cumulants on a spatial basis are different from zero (e.g. skewnesses, quad-ratic and triadic correlations), then non-Gaussian dyads and triads are likely.

3.2.1 Superposition of stochastic waves

Below, we give an instructive example of the formation of non-Gaussian dyads and triads in a stochastic field composed of a superposition of three (stationary or propagating) waves in a periodic domain: �(x, t) =

∑3i=1 Ci cos(2πkix − ωit − θi),

with amplitudes Ci, integer wave-numbers

Fig. 2 Bivariate scatter plots of the rotated uncorrelated normal-ized variables minimizing the multi-information I(Y1,Y2,Y3), corresponding to the ICA solution of the Lorenz model. Projections on: a (Y1,Y2), b (Y3,Y2) and c (Y1,Y3)

(a)

(b)

(c)

Fig. 3 As in Fig. 2 for the ISA solution of the Lorenz model. Projections on: a (Y1,Y2), b (Y3,Y2) and c (Y3,Y1)

(a)

(b)

(c)

Author's personal copy

Page 13: Author's personal copyidl.campus.ciencias.ulisboa.pt/wp-content/uploads/...associative feed-forward neural networks in which the encoding of scalar sources from data may be quite com-plex

Separation of the atmospheric variability into non-Gaussian multidimensional sources by…

1 3

ki ∈ Z, i = 1, 2, 3; x ∈ [−1, 1] , frequencies satisfying a cer-tain dispersion relationship ωi = f (ki) and random phases θi distributing uniformly over [−π ,π ]. A very long time-series, much larger than the involved three periods may be con-sidered as an ensemble of spells with different realizations of the three phases, not necessarily independent between each other. Therefore the expectation operator E is set to a joint averaging: the time average and an averaging over the phases. A PCA of �(x, t) leads to six discrete normalized EOFs: 21/2 cos(2πkix), ; 21/2 cos(2πkix − π

2); i = 1, 2, 3

to which correspond six standardized PCs: 21/2 cos(ωit + θi); 21/2 cos(ωit + θi − π

2) respectively.

Now, let us take standardized components Y1, Y2, Y3 with possibility of repetition, obtained through the orthogo-nal rotation matrix R. Now, by playing with trigonometry, we conclude that the third order moment E(Y1Y2Y3) van-ish if any combined sum or difference between the three frequencies is nonzero i.e. ω1 ± ω2 ± ω3 �= 0. There-fore, without loss of generality, let us consider triads in the particular case: ω1 + ω2 + ω3 = 0. It may include the case of stationary waves in which the sum of two frequencies cancel (e.g. ω1 + ω2 = 0). Consequently, E(Y1Y2Y3) = 2−1/2E[cos(θ1 + θ2 + θ3 + α)] ≡ F(α) for a particular constant angle α ∈ [0, 2π ], depending on the rotation coefficients. The above third moment is zero if: (a) phases θ1, θ2, θ3 are all independent; (b) if a sum of two phases is independent from the third one; (c) if phases are all equal, as it is the case of the skewness of any rotated PC. For any of those conditions, F(α) becomes simply a product of sine functions of vanishing expectations.

The conclusion is that, when frequencies satisfy ω1 ± ω2 ± ω3 = 0 [e.g. conditions of three-wave reso-nance on fluid dynamics (Hammack 1993)], there is the possibility of non-Gaussian dyads (two equal components and frequencies) and triads (three different components), if phases are not independent between each other i.e. there is some random phase synchronization between the differ-ent oscillations, such that the sum θ1 + θ2 + θ3 is nearly constant i.e. it has a small variance. In that condition the contrast function of the most non-Gaussian dyad or triad becomes given by maxR[F(α)]2. This mechanism is likely to exist on geofluid dynamics as suggested by some rotat-ing fluid experiments making clearer the triadic wave reso-nance (Lagrange et al. 2008; Bordes et al. 2012; Giesecke et al. 2015). That situation may be useful since it leads to spells of nonlinear predictability of certain components when both phase synchronization and frequency resonance conditions hold in the geophysical flow.

3.3 Changes of three‑order moments due to rotations

The choice of the configuration of sources is crucial for the ISA problem. There are no general rules for determining

the number and dimensionality of vectorial sources. Below we give some examples of badly chosen configurations, for instance when fully independent scalar variables are merged. In fact, the orthogonal rotation of independent non-Gaussian standardized scalars may produce variable pairs which are nonlinearly correlated thus giving the false appearance of a non-Gaussian dyad. This is illustrated by two single cases where A1,A2,A3 are independent, centred, standardized and uncorrelated where A2,A3 are Gaussian and A1 is non-Gaussian with a skewness s ≡ E(A3

1) and a kurtosis k ≡ E(A4

1)− 3.Then, orthogonal rotations around axes 3 and 2 (in this order) are applied leading to uncorre-lated variables Y1, Y2, Y3 as

where (in this context) ci = cos(θi), si = sin(θi), i = 1, 2 . By taking c2 = 1, s2 = 0, we get the quadratic correla-tion cor(Y2

2 , Y2) =[c21(1− c21)

1/2s]/(2+ kc41)

1/2, reach-ing all the values in ] − 1, 1[ along the possible domain: s2 ≤ k + 2. After some algebra and fixing s, k, we get the upper bound: cor(Y2

2 , Y2)2 = c61s

2/6 ≤ min{1, s26, 2s2

3|k| } where kc61 + 6c21 − 4 = 0. That shows how skewnesses may be changed into quadratic correlations.

Moreover, skewnesses of A1,A2,A3 may also be trans-ferred to non-Gaussian triads. This is illustrated through the symmetric triadic correlation (Pires and Perdigão 2015) of Y1, Y2, Y3, belonging to the interval [−1, 1]. In this case, its square is

Intuitively, cor3 is positive when the PDF get local maxima on four of the eight octants such that the product of signs of the three variables is positive. The maximum values are reached when data concentrates near a cer-tain three-dimensional non-planar surface [see Fig. 2 of Pires and Perdigão (2015) and Fig. 15]. Positive values of (cor3)

2 in (22) call for 3D rotations with an upper-bound: (cor3)

2 ≤ 227s2[1+min(0, k)]−1/3.

On the other hand, the transference of non-Gaussian-ity (by orthogonal rotations) from multivariate sources to scalars is also true. However, the nonlinear ‘unfolding’ of nonlinearly correlated variables (though linearly uncor-related) may separate them better than what an orthogo-nal rotation would do. On those conditions, ICA performs sub-optimally. Let us illustrate that by using standard

(21)

c2 0 s20 1 0

−s2 0 c2

c1 s1 0

−s1 c1 0

0 0 1

A1

A2

A3

=

Y1

Y2

Y3

(22)

cor3(Y1, Y2, Y3)2 ≡

{E(Y1Y2Y3)

[var(Y1Y2)var(Y1Y3)var(Y2Y3)]1/6

}2

= c4

1(1− c

2

1)c

2

2(1− c

2

2)s

2

[1+ c2

1(1− c

2

1)c

2

2k][1+ c

2

1(1− c

2

1)s

2

2k][1+ c

4

1c2

2(1− c

2

2)k]

Author's personal copy

Page 14: Author's personal copyidl.campus.ciencias.ulisboa.pt/wp-content/uploads/...associative feed-forward neural networks in which the encoding of scalar sources from data may be quite com-plex

C. A. L. Pires, A. F. S. Ribeiro

1 3

Gaussian and independent random variables: Y1,W and a new one defined as Y2 = 2−1/2c(Y2

1 − 1)+W(1− c2)1/2 where c = cor(Y2, Y

21 ) and Y1, Y2 are centred, standard-

ized and uncorrelated, forming a non-Gaussian dyad. The nonlinearly unfolded independent variables are Z1 = Y1; Z2 = [Y2 − 2−1/2

c(Y21− 1)](1− c

2)

−1/2 = W . After computing the determinant of the correspond-ent Jacobian matrix, we get the entropy change term appearing in the Lemma of Negentropy: Hnl[(Y1, Y2), (Z1, Z2)] = − 1

2log(1− c2) ≥ 0 . Since

Z1, Z2 are independent and Gaussian, one obtains J(Z1)+ J(Z2) = Jind(Z1, Z2) = 0, providing the optimal source separation, hence LN (9) writes as I(Z1, Z2) = 0 = Jrot + 1

2log(1− c2) and thus

Any orthogonal rotation of Y1, Y2 as per-formed by ICA where Z1 = c1Y1 + s1Y2, Z2 = −s1Y1 + c1Y2 lead to dependent variables where I(Z1, Z2) = Jrot − [J(Z1)+ J(Z2)] > 0. For a small c, sin-gle NEs are well approached by the truncated Edgeworth expansion given by 1

12(skewness)2 and therefore

whereas Jrot ≈ 48c2. Therefore ICA, holding for

c21 = 12 leads to dependent rotated variables verifying

I(Z1, Z2) ≈ 18c2, which are sub-optimal with respect to the

nonlinear unfolding that is able to recover the true sources. The conclusion is that linear ICA may be sub-optimal whereas the non-linear unfolding step may solve effectively separate the sources.

4 Source separation of the variability of a quasi‑geostrophic model

In the previous sections, we have applied the source separa-tion to low-dimensional vectors. Below, we jump towards a scenario of high-dimensionality in which the effects of the Central Limit Theorem are quite present leading to quite small values of non-Gaussianity and where there is a much higher rotational freedom for the search of geometrically simple sources.

4.1 Model physics

In this section we test the efficiency of the described Pro-jection Pursuit method of source separation in a moderately

(23)J(Y1, Y2) = − 1

2log(1− c

2) = Jrot

= I(Y1,Y2)+ J(Y1)+ J(Y2).

(24)

J(Z1)+ J(Z2) ≈ 112[skewness(Z1)2

+ skewness(Z2)2] = 3

2c2[c21(1− c

21)] ≤ 3

8c2

complex dynamical model designed to simulate the large-scale and low-frequency atmospheric variability in the mid-latitudes of the Northern Hemisphere (NH) during winter. For that we use the quasi-geostrophic, 3-level model (QG3 for short) of Marshall and Molteni (1993), giving prog-nostics of the QG-potential vorticity (PV) at 200, 500 and 800 hPa (pressure levels 1, 2 and 3 respectively) through the integration of the dynamical equations:

where �i and qi are respectively the QG stream function and PV fields at level i, Jh is the horizontal Jacobian opera-tor, f is the Coriolis parameter and h is the topographic height, scaled by H0 = 9 km. The PVs are given by

where R1 = 700 km and R2 = 900 km in (26) are Rossby deformation radii for the 200–500 and 500–800 hPa layers, respectively. PV tendencies (25) are balanced by: horizon-tal advection, linear diffusive terms Di (i = 1, 2, 3), com-posed by Newtonian relaxation of temperature, linear drag at 800 hPa and spectral diffusion of temperature and vorti-city. Finally, the time-independent forcing fields Si (i = 1, 2, 3) simulate the non-resolved scales and physics, being estimated by time averaging residuals of the QG-PV ten-dencies during the season December-March for a period of 10 years of ECMWF European Centre for Medium-range Weather Forecasts (ECMWF) reanalyses as described by (Michelangeli 1996). For further physical details see Mar-shall and Molteni (1993). Generic horizontal fields U (e.g. �i and qi) are expanded in terms of spherical harmonics Ymn (�,ϕ) with triangular truncation 21:

where t, �,ϕ,m and n are time, the longitude, the latitude, the zonal and total wave-numbers, respectively. The model is described by a state vector of Ntot = 1485 components, comprehending the real and imaginary parts of the non-trivial expansion coefficients of fields �i, (i = 1, 2, 3). The QG3 model has been used for different proposals namely the stochastic modelling of the atmospheric variability by Empirical Model Reduction (EMR) techniques (Kon-drashov et al. 2004, 2006, 2011; Strounine et al. 2009). Beyond that, the QG3 model has also been used for the assessment of weather regimes’ predictability (Vannitsen 2001) and their non-Gaussianity signature (Kondrashov

(25)

∂qi

∂t= −Jh

(�i, qi + f + δi,3fh/H0

)− Di + Si; (i = 1, 2, 3),

(26)

q1 = ∇2�1 − R−21 (�1 −�2)

q2 = ∇2�2 + R−21 (�1 −�2)− R−2

2 (�2 −�3)

q3 = ∇2�3 + R−22 (�2 −�3)

(27)U(�,ϕ, t) =21∑

n=0

n∑

m=−n

Um,n(t)Ymn (�,ϕ)

Author's personal copy

Page 15: Author's personal copyidl.campus.ciencias.ulisboa.pt/wp-content/uploads/...associative feed-forward neural networks in which the encoding of scalar sources from data may be quite com-plex

Separation of the atmospheric variability into non-Gaussian multidimensional sources by…

1 3

et al. 2004; Peters et al. 2012). This fact makes the model a good candidate for testing the proposed source separation method.

4.2 Spatio‑temporal variability and datasets

The used dataset comes from the whole or part of an extended daily-sampled model run of 106 days size, assumed to be large enough to densely cover the model’s attractor. Then, in order to extract the low-frequency sig-nal, we compute daily-sampled moving averages of the state vector with a monthly (30 days) window. Then a PCA of the monthly variability is performed with EOFs being normalized using a L2 norm over the sphere, summed for

the 3 model levels. About 64, 78 and 95 % of the total variance of monthly averages is explained by 10, 20 and 100 PCs respectively (see Fig. 4 showing the cumulated explained variance fraction in % by PCs with index sorted by decreasing variance), taken from the total of Ntot PCs. The dynamical behaviour of leading PCs is essentially cap-tured by a low-dimensional stochastic dynamical model of 10 variables showing a strong non-Gaussian signature (D’Andrea and Vautard 2001; D’Andrea 2002). The lead-ing EOFs have a nearly barotropic-equivalent structure exhibiting quite realistic patterns as shown in Fig. 5 by the correlation maps between the 500-hPa stream-function field and the three leading PCs. The leading EOF1 (Fig. 5a) is a three-zonal-wave dominant mode that is strongly pro-jected in the negative phases, both of the Arctic Oscillation (AO) and of the North Atlantic Oscillation (NAO). The EOF2 and EOF3 resemble to hybrids of the main North-Atlantic, Euro-Asian and North-Pacific observed patterns (Vautard 1990; Kimoto and Ghil 1993a, b; Michelangeli et al. 1995). Moreover, joint PDFs of the leading PCs tend to cluster around the typical NH regime centroids like the positive and negative phases of both NAO and AO patterns (Kondrashov et al. 2004).

Below, we compute the spatial and temporal scales of model modes aiming to relate them with levels of non-Gaussianity. This way, the typical scale of the kth EOF (EOFk) is provided through the averaged total wave-number: n(k) (Kalnay 2003), with weights being given by squared EOF components in the spherical harmonics domain (Selten 1997). Figure 4 shows the log–log graph of n(k), k = 1−1000. There, we get n(k) ≤ 8 corresponding to planetary-scale wave-lengths of 6000–8000 km for the 25 leading EOFs. That is followed by roughly growing values of n(k) saturating near the QG3 model truncation, equal to 21. The typical temporal-scale of the kth PC (PCk) is given by

Fig. 4 Cumulated fraction (in %) of the explained variance by PCs of the QG3 model: dark grey triangles; average total wave-number n(k) of the k-PC: black squares; decorrelation period (in days) of PCs: light grey circles. Quantities given for PCs in the range 1–1000

(a) (b) (c)

Fig. 5 Correlation maps between the 500-hPa stream function and the three leading PCs of the QG3 model, from a to c

Author's personal copy

Page 16: Author's personal copyidl.campus.ciencias.ulisboa.pt/wp-content/uploads/...associative feed-forward neural networks in which the encoding of scalar sources from data may be quite com-plex

C. A. L. Pires, A. F. S. Ribeiro

1 3

its decorrelation time τ(k), k = 1− 1000 (Fig. 4), estimated by the earliest time-lag crossing of the PC’s auto-correlation function (ACF) through the small threshold 0.05. The larg-est value holds for the leading PC: τ(1) = 120 days while τ(k) > 30 days for many of the leading 20 PCs (Franzke and Majda 2006). Beyond the PC20, the ACF roughly linearly decreases with time-lag, saturating near 30 days (not shown) supporting the fact that noisy PCs in the tail’s variance spec-trum tend to behave as 30-daily moving-average processes. In order to remove temporal redundancy, we subsample the total dataset with intervals of 80 days, providing the full dataset for the statistical assessments, which is composed by a number Ndof = 12,000 (degrees of freedom) of nearly iid realizations of the random vector x(t) filled by the N leading PCs. Then, the source decomposition into ICs, non-Gaussian dyads and triads will be tested for the random vector a(Eq. 1) filled by the standardized N = 10 leading PCs. This choice for the reduced dimension of the source optimization sub-space relies upon sensitivity tests on the most non-Gaussian dyad (dominant or leading dyad) with respect to the embed-ding dimension N in the range 2–20 (below in Sect. 4.5). The source patterns, namely those of ICs may be quite dependent on N (Aires et al. 2002; Westra et al. 2010). With this regard, an objective criterion choice of N, valid for ICs, is given by Koch and Naito (2007), depending on bias-adjusted values (in terms of the sample size) of the skewness and kurtosis of PCs and also on an Akaike Information Criterion (AIC) that penalizes higher dimensions.

Despite the above criterion, let us try to establish a rule of thumb for the necessary sample’s effective num-ber of degrees of freedom Ndof in order to robustly opti-mize a source of dimension Nsrc embedded into an N-dimensional optimization space. That source is deter-mined by 1

2Nsrc(2N − Nsrc − 1) independent parameters

(see Sect. 2.2). Now, for a wide range of conditions and by invoking the Central Limit Theorem, the variance of a parameter estimator is roughly proportional to (Ndof )

−1. Therefore a 5 % parameter’s square error calls for a sam-ple with at least Ndof ∼ 20 = Ndof−p data per independ-ent parameter. Consequently the dataset size must be large enough in order to estimate all parameters, and hence Ndof ≥ 1

2NsrcNdof−p(2N − Nsrc − 1) = 10Nsrc(2N − Nsrc − 1) .

On the other hand, for a fixed available Ndof , the optimiza-tion space dimension cannot be too large, being bounded as N ≤ Ndof /(20Nsrc)+ (Nsrc + 1)/2. Therefore, applying that bound to a typical reanalysis sample of 60 years and considering a number of 3–4 degrees of freedom of the win-ter’s monthly variability per year, we get Ndof ∼ 180−240 and so N ≤ 6−8 for dyads (Nsrc = 2) and N ≤ 5−6 for tri-ads (Nsrc = 3). For quite short datasets, a space pre-selec-tion to the mostly projecting PCs on non-Gaussian sources, shall be done before optimization to prevent high dimen-sions N (from tests with observed SST fields—not shown).

In order to ensure the robustness of sources, we optimize them in the first half part of the dataset (training period), followed by a validation step in the second half (validation period), both composed of 6000 iid realizations. Robust sources, (i.e. small sensitivity of the source loading vec-tors) may even be obtained with shorter datasets (of the order of some tens of years long, comparable to the size of available real reanalysis). However, such a large sample size Ndof was decided in order to complement the results of the contrast function maximization (see Sect. 2.2) with diagnostics of the negentropy and multi-information in two and three dimensions (for dyads and triads respectively), whose reliable estimation (see estimator in “Appendix 2”) calls for sufficiently long datasets. Moreover, long training periods assures non-overfitted values of the projection pur-suit contrast function for moderate high N.

4.3 Single negentropy of PCs

According to Sect. 2, the joint negentropy (NE) of PCs, or in general of a set of uncorrelated variables, splits into the sum of single (marginal) NEs plus the global multi-infor-mation among those scalar variables. Some insight of that issue is given by the log–log graph of the single estimated NEs in the full dataset (open circles of Fig. 6). The high-est NE values occur for PC1 and PC2, mostly projected on larger and longer scales (Franzke and Majda 2006; Berner and Branstator 2007). That comes essentially from PDF asymmetries accounted by either the PC1 skewness (0.83), due to more frequent negative phase NAO events (Wooll-ings et al. 2010), and the PC2 skewness (−0.36).

The statistically significant NE value for a 5 % signifi-cance level is marked by a straight line in Fig. 6. That is computed as the 95 % quantile of the set of NEs issued from 1000 synthetic samples filled by Ndof iid standard Gaussian realizations. From Fig. 6, we see that most (about 95 %) of the PCs in the range 10–1000, get non-significant NEs, supporting the hypothesis that short-scale PCs behave as Gaussian stochastic processes spanning the Gaussian manifold of the variability (Blanchard et al. 2006), i.e. the Gaussian distributed statistically independent subspace. This is because the corresponding EOFs in the spatial domain get short correlation lengths, thus leading to PCs which behave as a sum of a large number of quasi iid vari-ables, justifying their Gaussianity according to the Central Limit Theorem.

The single NE values are much smaller than those occurred on low-order chaotic models (see Sect. 3.1) which is consistent with a scenario of intrinsic low non-Gaussi-anity thus making the sources less well separated and less non-Gaussian what calls for an optimization on subspaces of moderate dimensions N in order to find quite improved non-Gaussian distributed projections.

Author's personal copy

Page 17: Author's personal copyidl.campus.ciencias.ulisboa.pt/wp-content/uploads/...associative feed-forward neural networks in which the encoding of scalar sources from data may be quite com-plex

Separation of the atmospheric variability into non-Gaussian multidimensional sources by…

1 3

4.4 Bivariate mutual information among PCs

The joint dyadic negentropies of pairs of PCs can be accounted by both single NEs and bivariate nonlinear correlations assessed by their mutual information (MI). In fact, for any uncorrelated scalars A,B we easily get J(A,B) = J(A)+ J(B)+ I(A,B). Just to put that in evi-dence, we plot the bar graph (Fig. 7) with the estimated mutual information between every pair of PCs in the

range 1–80, using the full dataset. Only cases satisfying MI >0.0065 are plotted (above the grey plan in Fig. 7), surpassing the 95 %-confidence threshold for rejecting the null hypothesis of a bivariate isotropic Gaussian PDF [read from Fig. B1 in Appendix B of Pires and Perdigão (2015) for Ndof = 12,000]. From Fig. 7, the strongest MI values occur when PC1 is one of the MI arguments as for I(PC1,PC5) = 0.038 and I(PC1,PC6) = 0.027 , mostly coming from nonlinear correlations: |cor(PC2

1 ,PC5)| = 0.21 and |cor(PC21 ,PC6)| = 0.17. It also

comes from the goodness of the quadratic best-fit of PC5 and PC6 from PC1. Beyond PC20 and excluding PC1, we get quite small MI values in agreement with the expected small non-Gaussianity of low variance PCs.

4.5 Leading non‑Gaussian dyad and dependence on the optimization space

The dominant non-Gaussian dyad of generic components (Y1, Y2), maximizing the contrast function (14), is com-puted on the subspace of the N leading standardized PCs for N in the range 2–20. The goal is studying the sensitiv-ity of some dyads’ parameters [e.g. quadratic correlation, negentropy, explained variance, vector of loadings (3)] with respect to the number N of PCs of the optimization space.

For that purpose, we show in Fig. 8 the absolute quad-ratic correlation: |cor(Y2

1 , Y2)|, for N = 2–20, both in the training and validation periods, denoted respectively as COR-T and COR-V. Both values are quite similar for each N, indicating no significant presence of positive bias or

Fig. 6 Estimated values of negentropy of PCs (open circles) and ICs (black squares) of the QG3 model. The straight line indicates the 5 % significance level (0.00025) of the null Gaussian hypothesis on the NE estimation

Fig. 7 Mutual information between PCs (in bars). 95 %-statistically significant values are marked in dark and above the baseline of sig-nificance at 0.0065

Fig. 8 Values of the contrast function E(Y21Y2)

2/21/2(F-T) and abso-

lute correlation |cor(Y21Y2)| (Cor-T), both in the calibration period.

Values of the absolute correlation (Cor-V) and of the mutual informa-tion (MII) in the validation period. Explained variance fraction (Var-Exp) by dyads and inner product (InProd) between subspaces of two consecutive dominant dyads

Author's personal copy

Page 18: Author's personal copyidl.campus.ciencias.ulisboa.pt/wp-content/uploads/...associative feed-forward neural networks in which the encoding of scalar sources from data may be quite com-plex

C. A. L. Pires, A. F. S. Ribeiro

1 3

overfitting of the contrast function in the training period. In fact the effective number of optimizing angles in a dyad, i.e. 2N − 3 is much less than the 6000 iid realizations in the optimization sample (see discussion above). The above correlations also compare with the contrast func-tion: |E(Y2

1Y2)|/21/2 (F-T in Fig. 8) whenever Y1 is nearly Gaussian, i.e.σ(Y2

1 ) ≈ 21/2, which seems valid for N ≤ 14 . The quadratic correlation grows consistently with N, due to increasing ‘optimization room’, starting near 0.37 for N = 2 with an orthogonal rotation of the standardized PC1 and PC2. That value is quite similar to the upper bound due to PC1 skewness, i.e. 1√

6|skewness(PC1)| = 0.34 (see

Sect. 3.3). The largest incremental increases of the corre-lation holds when PCs of order i = 5, 6 and 7 come into play in the optimization space, which is consistent with the high values of I(PC1,PCi) (i = 5, 6, 7) as shown in Fig. 7. The quadratic correlation saturates near 0.52 at N = 10, despite the slight increasing of the contrast function as con-sequence of Y1 becoming further leptokurtic (super-Gauss-ian), i.e. with increasing var(Y2

1 ) for N > 10.The estimated MI values I(Y1, Y2) in the valida-

tion period for N = 2–20 (Fig. 8) change quite similarly to the nonlinear correlation thanks to the approxima-tion I(Y1, Y2) ≈ − 1

2log[1− cor(Y2

1 , Y2)2] for enhanced

dyads, which is formally identical to the MI: I(A,B) = − 1

2log[1− cor(A,B)2] between two jointly

Gaussian correlated variables A, B.

4.5.1 Sensitivity of loadings’ vectors

Now, let us consider the loading vectors (3) of the lead-ing dyad, denoted in this section as v1(N), v2(N) for a cer-tain embedding N and the correspondent spanned subspace, denoted in math as VN ≡ Span(v1(N), v2(N)). The leading dyad may be sensitive to the choice of N. In order to assess that, we compute the similarity measure, correlation or kind of inner product between the consecutive two-dimensional subspaces VN−1,VN. The computed and denoted similar-ity: VN−1 · VN ∈ [0, 1], (N = 3, . . . , 20) is based on the cosine of the angle between two subspaces (Gunawan et al. 2005) [see (40) in “Appendix 5” for its definition], Closer is that inner product to one (zero), more projected (more orthogonal) are the subspaces. In our case, that inner prod-uct is always larger than 0.9, which means that successive optimized planes are quite parallel, hence the leading dyad is quite similar to the first one, optimized for N = 2 (sub-space of two leading PCs) for which the loading vectors are: v′1(2) = (.79, .61) and v′2(2) = (−.61, .79).

Finally, the explained variance fraction fvar of the field [see (42) of “Appendix 5”] that is performed by the suc-cessive dominant dyads (in Fig. 8) slightly decreases from

0.37 at N = 2 reaching 0.28 at N = 20. That is because dyads include progressively PCs of smaller explained variance. This behaviour is quite generic on source opti-mization. Therefore in what regards the source separation, there must be a trade-off between the dimension N of the optimization subspace, the explained variance by sources, the negentropy of sources and also the size of the avail-able optimization sample preventing biases due to over-fitting by rotation angles, especially on short datasets (theme beyond the paper’s scope). Furthermore, the itera-tive optimization algorithm (see “Appendix 3”) depends on first guess rotation angles with the possibility of the final solution being trapped on some local maxima of the contrast function, lower than the absolute one. The number of non-trivially different local maxima (i.e. those coming from the nonlinear nature of the cost function, and thus not coming from symmetries like exchanges and reversals of axes) grows in general with N, ranging in our case from 1 at N = 1 to 8 at N = 20. Therefore we have taken a reason-able number set to 40 of first-guess angle random trials for making sure that one reaches the absolute maximum of the contrast function.

4.6 Source separation into independent components

Giving the previous trade-off criteria and the relative dyad’s stabilization, we set the optimization space dimension to N = 10 for the purpose of comparing the space separation into independent components (ICs) and dyads, despite the mixing of dyads and ICs is also quite possible (see discus-sion of the source configuration indeterminacy). The cho-sen ICs maximize the squared skewness (13), being opti-mized and sorted in a sequential way, i.e. one source after the other. By denoting the vectors filled by ICs as yIC, the LN invariant (9) decomposes as:

The numerically estimated negentropies of ICs in the validation period are presented in Fig. 6 (black squares), summing Jind(yIC) = 0.255, clearly surpassing the corre-spondent value Jind(xPC) = 0.135 for PCs thus corroborat-ing the fact that ICs are much less statistically dependent than PCs. The larger NE contribution comes from the lead-ing IC whose vector of loadings (3) have dominant compo-nents: 0.86, 0.31 and 0.21 respectively for the standardized PC1, PC5 and PC6. Therefore, most of the single NEs and nonlinear correlations (Sects. 4.3, 4.4) were converted by orthogonal rotations into the IC1 skewness equal to 1.31. The correlation map between the IC1 and the stream-func-tion fields reveals a quite strong signature on EOF1 pattern (not shown) thanks to its high loading value.

(28)Jrot = Jind(yIC)+ Ii(yIC) = Jind(xPC)+ Ii(xPC)

Author's personal copy

Page 19: Author's personal copyidl.campus.ciencias.ulisboa.pt/wp-content/uploads/...associative feed-forward neural networks in which the encoding of scalar sources from data may be quite com-plex

Separation of the atmospheric variability into non-Gaussian multidimensional sources by…

1 3

4.7 Source separation into non‑Gaussian dyads

In this section, we test aggregated source configurations including dyads. In particular, the 10-dimensional PC embedding subspace is split into r = 5 source dyads, here-after denoted by ID1, ID2, ID3, ID4 and ID5 (or y1, . . . , y5), optimized, one at a time and sorted by decreasing values of the contrast function used for dyads (14). The invariant of LN (9) decomposes as:

Table 1 lists the terms of (29) in the validation period and discriminated for each dyad, denoting by JY1 and JY2, the negentropies of components of the kth dyad: (Y1,k , Y2,k) and by IAY its mutual information (intra-MII). For dyads only, the absolute quadratic correlation and the correspond-ing term Hnl in (9) for each dyad appear as Cor and Hnl in Table 1. Totals for the five sources appear at the table’s bottom row. All the estimated values surpassing the 95 % confidence level on a non-Gaussian scenario for a num-ber 6000 of iid sample realizations are marked as grey in Table 1, i.e. rejecting the null hypothesis of a Gaussian bivariate PDF of null correlation. They corresponds to the thresholds 0.008 for the MIs, 0.009 for the NEs, 0.025 for the absolute correlation and 0.0003 for Hnl.

Now, some relevant conclusions can be drawn from the inspection of Table 1. The two leading dyads are the most non-Gaussian ones, with statistically significant quadratic correlations, respectively 0.52 and 0.19, and accounting for 76 and 9 % of the total sum of dyadic NEs (see Sect. 2.1). The highest correlation is visually clear in Fig. 9a show-ing contours of corresponding two-dimensional PDFs, thus putting in evidence the parabolic shape of the PDF ridges emphasized by the quadratic correlation. This is equivalent to localized maxima of the azimuthal angular PDF cumulating probabilities across angular sectors of the source plan, which holds also on real data (Stephenson

(29)Jrot =

5∑

k=1

(J(Y1,k)+ J(Y2,k)

)+

5∑

k=1

I(Y1,k , Y2,k)+ I(y1, . . . , y5).

et al. 2004). PDFs were estimated by the kernel method devised in “Appendix B” of Pires and Perdigão (2015) with marginal variables subjected to a Gaussian anamorphosis (Wackernagel 1998).

The last remaining three dyads are marginally non-Gaussian and so they roughly span the Gaussian manifold (Blanchard et al. 2006) of the optimization space, as far as third-order moments are concerned, which shows the exist-ence of strongly Gaussian projections, even on a planetary-scale-dominant variability space. This corroborates that a configuration including a lower number of dyads could be considered (see the discussion of the configuration indeter-minacy) due to their weak expression, taking scalars for the orthogonal complement of the leading two ones.

Now, let us look closely to the two dominant and sta-tistically significant dyads. Their components are given by a linear combination of the vector of 10 standardized PCs, denoted as A1,…,A10 (1) where weights come from a unit-norm vector of loadings (3). By sorting those weights up to account ~90 % of the squared norm, we get shorter expansions of the dyads’ components. Then, as regards the first dyad, Y1,1 ~ 0.79A1 + 0.46A2 + 0.17A8 − 0.18A9 which explains 21 % of the total field variance and it is quite parallel to PC1 whereas the second component Y2,1 ~ 0.39A1 − 0.62A2 − 0.28A3 + 0.23A4 + 0.33A5 + 0.31A6 explains 9 % of the variance and projects mostly on PCs whose mutual information with PC1 is high as shown in Fig. 7 by the graph of MIs between PCs. The dominant dyad optimized for N = 10 is quite coplanar with that optimized for N = 2, in which Y1,1 ~ 0.79A1 + 0.61A2 and Y2,1 ~ −0.61A1 + 0.79A2, in agreement with Fig. 8 showing inner products between successive dyad’s subspaces.

As regards the second, much weaker dyad, optimized for N = 10, its components are approximated by Y1,2 ~ −0.77A3− 0.36A5 + 0.20A8 − 0.37A10 and Y2,2 ~ 0.37A2 + 0.48A4 + 0.46A5 + 0.48A8, mixing different PCs and explaining respectively 5 and 6 % of the total variance. The leading dyad captures essentially the same negentropy (~0.196) as the two leading PCs (~0.205).

Table 1 Statistical results of the separation of the QG3 model variability into 5 dyads

Results for each successive independent dyads (IDs) in rows: explained variance fractions: V1, V2, quadratic correlation: Cor, intra-MII: IAY, negentropies: JY1, JY2, Hnl and inter-MII. All values are in 10−3 units. Italics marked values are significant at the 95 % confidence level. Appropriate totals appear in the bottom row

ID V1 V2 Cor IAY JY1 JY2 Hnl IE-ID2 IE-ID3 IE-ID4 IE-ID5

1 208 90 526 168 12 16 162 16 8 9 7

2 47 55 191 19 0 5 19 7 7 7

3 61 38 154 18 0 0 12 8 5

4 39 32 84 9 0 1 4 7

5 33 34 65 8 0 1 2

Totals V1 + V2 = 637 IAY = 222 JY1 + JY2 = 35 Hnl = 199

Author's personal copy

Page 20: Author's personal copyidl.campus.ciencias.ulisboa.pt/wp-content/uploads/...associative feed-forward neural networks in which the encoding of scalar sources from data may be quite com-plex

C. A. L. Pires, A. F. S. Ribeiro

1 3

Now, we analyse the performance of the source separa-tion. The total NE of sources is the sum of intra-MIs (0.222) with the sum of single NEs (0.035), from which the inter-MII is I(y1,…,y5) = Jrot − 0.222 − 0.035 = Jrot − 0.257 which is roughly similar, in this particular case, to inter-MII between ICs: Ii(yICs) = Jrot − 0.255. However dyads win in terms of parsimony (optimized by 5 covariances) to ICs (optimized by 10 skewnesses). That is because the contrast function FopD (14) is only sensitive to third order cross cumulants which might be overtaken through the addition to FopD of squared single skewnesses and cumulant squares of larger order or by taking more general nonlinear regressions (see the Require-ment of nonlinear unfolding) which should provide more non-Gaussian, yet more complex dyads. Moreover, it would lead to further contrast function’s local maxima and the pos-sible increment of badly conditioning over certain manifolds (crests) embedded in the rotation angles’ space.

The level of dependence between dyads will call for the estimation of a PDF-functional on dimension N = 10. However, we partially assess the dependence among pairs of dyads through the MI between single explanatory vari-ables or nonlinear hybrid indices FindD,k, representative of the variability within the kth dyad, where

which is simply the sum of the two standardized terms entering in a positive quadratic correlation. This receipt of adding correlated quantities is also followed for mak-ing optimal atmospheric–oceanic indices as for instance

(30)

FindD,k ≡ (Y2

1,k− 1)var(Y

2

1,k)

−1/2 + sgn[cor(Y2

1,k, Y2,k)]Y2,k ,

the NAO Index given by the difference between stand-ardized pressures at Azores and Iceland and the Bivariate EnSO Time series (BEST) index, combining the atmos-pheric and oceanic responses of the El-Niño (Smith and Sardeshmukh 2000). The MI between all pairs of indices is listed in Table 1 as IE-ID2 (MI between a generic index and the index of ID2), IE-ID3, IE-ID4 and IE-ID5. Accord-ing to this dependence measure, the two leading dyads are not independent at the 95 % confidence level since I(FindD,1,FindD,2) = 0.016 > 0.008, due to the quadratic correlation cor(FindD,1,(FindD,2)

2) = 0.17, which must be again explained by higher-than-three joint cumulants.

4.7.1 Nonlinear unfolding

Now, let us analyse the effect in the budget of negentropy of the nonlinear homeomorphisms (Y1,k , Y2,k) ↔ (Z1,k , Z2,k); k = 1, . . . , 5 (16), obtained as residuals of the nonlinear regressions, forming the vector z , described in Sect. 2.2 and of its spherization z = C

−1/2zz z

used in (10). The LN invariant decomposes as

Then, terms in (31) are given below for the validation period. Therefore Jind(z) = 0.028 < Jind(y) = 0.035 , suggesting that residuals of the nonlinear regression tend to be more Gaussian than dyads’ components. Then Hnl(y, z) = 0.198, hence MII is Ii(z) = Jrot − 0.226 which is higher than the MII among ICs i.e. Jrot − 0.255 . However, some MII comes from z linear correlations,

(31)Jrot = Jind(z)+ Hnl(y, z)+ Ii(z)

= Jind(z)+ Ic(z)+ Hnl(y, z)+ Ii(z)

Fig. 9 Contours of the PDF of the first (a) and second (b) dyads for the optimization subspace dimension N = 10. Marginal variables are subjected to Gaussian anamorphosis. The quadratic correlation

is 0.54 and 0.19 respectively for (a) and (b). Weather regimes cor-respondent to each PDF quadrant are marked in (a)

Author's personal copy

Page 21: Author's personal copyidl.campus.ciencias.ulisboa.pt/wp-content/uploads/...associative feed-forward neural networks in which the encoding of scalar sources from data may be quite com-plex

Separation of the atmospheric variability into non-Gaussian multidimensional sources by…

1 3

leading to Ic(z) = 0.022 mostly due to the leading dyad: cor(Z1,1, Z2,1) = −0.20. The NE of the spherized vector is a small value Jind(z) = 0.028 and consequently its MII is Ii(z) = Jrot − 0.248 which is quite similar to the ICs’ multi-information. Therefore, the new scalars are more advantageous than ICs since they are uncorrelated, quasi-independent and quasi-Gaussian which are more appropri-ate to use in methods designed for Gaussian distributed variables (e.g. Statistical Eigen-Value techniques, linear Inverse Models).

4.8 Physical interpretation of non‑Gaussian dyads

Statistical sources issued from nonlinear chaotic dynam-ics are quite difficult to interpret. In fact, BSS methods try to find projections of the non-Gaussian distributed attrac-tors which much fit certain features or manifolds like here for the quadratic correlations. We try to shed some light to link sources and data clusters corresponding to the weather regimes.

4.8.1 Sources and weather regimes

Let us suppose the existensce of a number of weather regimes corresponding to local PDF maxima, occurring around certain centroids living within the variability space (e.g. the N leading PCs). Intuitively, if the ‘constellation’ of centroids are located nearby or along a certain manifold (e.g. curve, surface), in which a certain relationship holds (e.g. a quadratic curve), then contrast functions (e.g. certain nonlinear correlations) emphasizing that relationship tend to get high values. Therefore each centroid will correspond to a connected piece of that manifold or to a composite in a connected pdf region around that manifold.

That situation seems to occur in the manifolds obtained by the NL-PCA of the sea level pressure and geopoten-tial fields (Teng et al. 2007) as well as in the QG3 model in the dominant dyad. In order to illustrate that we show in Fig. 10 the composite of the 500 hPa stream-function anomaly in each PDF quadrant of the leading dyad corresponding to different sectors of the parab-ola: (a) Y1,1 > 0, Y2,1 > 0; (b) Y1,1 < 0, Y2,1 > 0; (c) Y1,1 < 0, Y2,1 < 0 and (d)Y1,1 > 0, Y2,1 < 0, which are clearly associated, respectively to the AO−, NAO+, AO+ and NAO− weather regimes of the QG3 model as we com-pare with Fig. 2 of Kondrashov et al. (2004), also indicated in the PDF quadrants in Fig. 9a. The above regimes set to PDF quadrants occur respectively 21, 26, 27 and 26 % of the time. The centroids components’ (Y1,1, Y2,1) are respec-tively (0.90, 0.82), (−0.87, 0.83), (−0.71, −0.78) and (0.70, −0.78) and lay well beyond the quadrant’s com-posite centroids: (±(2π)−1/2,±(2π)−1/2) ≈ (±0.4,±0.4) of an isotropic Gaussian PDF. Even without performing

any objective cluster analysis, the centroids fit quite well the curve Y2,1 = 0.13(Y2

1,1 − 1)+ 0.84 agreeing with the dyad’s quadratic correlation, though not excluding other nonlinear fittings. Therefore, it suggest that non-Gaussian sources may provide a better cluster discrimination than that obtained with unrotated PCs by methods such as the k-means (Michelangeli et al. 1995) and Gaussian mixture modeling (Smyth et al. 1999).

4.8.2 Spatial description of sources

Next, we provide some spatial interpretation to source components. In general, the ith component Yi,k of the kth source (here, the dyads) is given by the spatial inner prod-uct between the working anomaly field: �(P)− E[�(P)] at generic points P and loading maps li,k(P) given by

where Vj,i,k are source loadings (3), �j is the jth PC vari-ance and Es(P, j) is the weight of the jth spatially nor-malized EOF at P. Interpretability comes from the fact that higher instantaneous values of Yi,k occur when both the loading map and the anomaly field exhibit a high pat-tern correlation. EOFs mostly contributing for the load-ing map are those in the range 1-N with the higher values of |Vj,i,k�

−1/2j | which may be contaminated by small scale

high-indexed EOFs of low variance due to the −1/2 nega-tive exponent. This raises the issue that relevant non-Gauss-ian features may be hidden by small low-variance scales. Finally, let us remark that loading maps are not spatially orthogonal except if the source’s intervening PC variances are degenerated.

4.8.3 Correlation maps

The correlation between source components and generic random physical variables (e.g. large circulation indices) or maps (SST, surface land temperature, precipitation) may suggest some physical links and teleconnections, giving tools for statistical downscaling and empirical prediction models. In particular, the correlation map between Yi,k and the working field � at point P is

where again, σ means standard deviation. Correlation maps tend to be more large-scale dominated than load-ing maps due to the �1/2k scaling in (33). Moreover maps

(32)li,k(P) =

N∑

j=1

Vj,i,kEs(P, j)�−1/2

j ;

i = 1, . . . |yi,k|; k = 1, . . . , r

(33)cor[�(P), Yi,k] =

N∑

j=1

σ [�(P)]−1Vj,i,kEs(P, j)�1/2j ;

i = 1, . . . , |yi,k|; k = 1, . . . , r,

Author's personal copy

Page 22: Author's personal copyidl.campus.ciencias.ulisboa.pt/wp-content/uploads/...associative feed-forward neural networks in which the encoding of scalar sources from data may be quite com-plex

C. A. L. Pires, A. F. S. Ribeiro

1 3

(32) and (33) may be quite different except when the loads Vj,i,k are restricted to a few (eventually one) dominant PCs of roughly equal variance (nearly the case of the leading dyads).

Further conclusions may be drawn from the analysis of correlation map gradients. Since the working field is the quasi-geostrophic stream function, we retrieve the correla-tion maps with quasi-geostrophic wind. In fact, by neglect-ing gradients of σ [�(P)], the correlation maps with the quasi-geostrophic zonal wind (ug) and meridional wind (vg) are given through the horizontal gradients of (33), i.e. cor[(ug, v′g, Yi,k] ∼ [−∂

∂ycor(�, Yi,k),

∂∂xcor(�, Yi,k)]′.

Finally, an intrinsic interest is to build proxy variables of the source components relying upon local representers of the working field. For that, one takes regions lying at

extremes, both of the correlation maps and of the subse-quent partial correlation maps.

4.8.4 Spatial analysis of main dyads

The normalized loading maps l1,1 and l2,1 of the dominant dyad (i = 1) at the 500 hPa level appear in Fig. 11a, b. They have similarities and dissimilarities with the correla-tion maps (Fig. 11c, d). The l1,1 map (Fig. 11a) and cor-relation map (Fig. 11c) is mostly projected in the map difference between negative (AO, NAO) regimes and positive (AO, NAO) regimes, in agreement with the PDF regimes’ distribution (Fig. 9a). The l2,1 map (Fig. 11c) and correspondent correlation map (Fig. 11d) reveals a 3-wave sequence passing through the Atlantic Ocean,

(a) (b)

(c) (d)

Fig. 10 Composites of the 500 hPa stream function anomaly (in units 106 m2 s−1) for the four quadrants of the leading dyad’s PDF for a sub-space optimization dimension N = 10 and associated QG3 weather regimes (a–d) (in title) (see text for details)

Author's personal copy

Page 23: Author's personal copyidl.campus.ciencias.ulisboa.pt/wp-content/uploads/...associative feed-forward neural networks in which the encoding of scalar sources from data may be quite com-plex

Separation of the atmospheric variability into non-Gaussian multidimensional sources by…

1 3

the North Pole and the Pacific Ocean, essentially making the difference between NAO and AO regimes. Roughly speaking, when the signs of the first dyad’s components are equal: sgn(Y1,1) = sgn(Y2,1), there is constructive interference leading to a hemispheric AO regime; when sgn(Y1,1) = −sgn(Y2,1), there is destructive interference in the Pacific, restricting the anomaly field mostly to a sec-torial Euro-Atlantic NAO regime. The transition between positive and negative NAO regimes may be possibly explained by constructive and destructive interferences between zonal wave numbers 1–2 (Luo et al. 2012a, b).

In terms of geostrophic wind anomalies, a positive value of Y2,1 lead to jet-stream meandering with poleward flux in west Europe (read in Fig. 11c). That is correlated with the square of Y2,1, i.e. with squared anomalies of the extratropi-cal jet-stream (read in Fig. 11d).

As regards the second dyad, the quadratic correlation is much weaker (~0.19), despite the fact that the non-Gaussi-anity is significant at 95 % confidence level. Its PDF (with Gaussian marginals) comes in Fig. 9b. The corresponding loading maps l1,2 and l2,2 at the 500 hPa level appear in Fig. 12a, b (similar to Fig. 11a, b). The correlation maps appear in Fig. 12c, d (similar to Fig. 11c, d). From the load-ing pattern (Fig. 12b), the second component Y2,2 is clearly proportional to the East-Atlantic/Western-Russia pattern index (Barnston and Livezey 1987). This component has a positive ~0.19 correlation with the square of the first com-ponent Y1,2 ~ −A3, that is roughly anti-proportional (from inspection of Fig. 12c) to the jet-stream intensity at the Eastern-Asia boundary.

(a) load Dyad1 Y1 (b) load Dyad1 Y2

(c) cor Dyad1 Y1 (d) cor Dyad1 Y2

Fig. 11 Loading maps (a, b) and correlation maps with stream function anomaly (c, d) at the 500 hPa level of components (Y1,1,Y2,1) of the leading dyad for a subspace optimization dimension N = 10

Author's personal copy

Page 24: Author's personal copyidl.campus.ciencias.ulisboa.pt/wp-content/uploads/...associative feed-forward neural networks in which the encoding of scalar sources from data may be quite com-plex

C. A. L. Pires, A. F. S. Ribeiro

1 3

4.9 Leading non‑Gaussian triad and dependence on the optimization space

We proceed here to the sensitivity study of the contrast function (15) and loading vectors (3) of the leading triad: (Y1, Y2, Y3) with respect to the dimension N of the optimi-zation space in the range N = 3–20. Therefore, we show values of |E(Y1Y2Y3)| (F-T in Fig. 13), growing as expected with N in a similar manner to the absolute triadic correla-tion |cor3(Y1, Y2, Y3)|, both in the training and in valida-tion periods (Cor-T and Cor-V in Fig. 13). Those absolute correlations grow from ~0.30, then stabilize near 0.40 for N = 8–14 to finally reach ~0.55 at N = 20; all of them are statistically significant at the 95 % confidence level, i.e. larger than 0.025. The absence of over-fitting of the rotation angles is justified because the number of fitting angles of a

triad: 3N − 1 is much lower than the 6000 iid realizations of the training sample. For N = 3, the value of cor3 is of the same order of the upper bound 2

27[skewness(PC1)]2 = 0.23

(Sect. 3.3), coming from conversions of skewnesses and quadratic correlations due to orthogonal rotations.

The interaction information (IT) introduced in Sect. 2.2 (noted as I3 in Fig. 13) is closely approximated as It(Y1, Y2, Y3) ∼ − 1

2log[1− cor3(Y1, Y2, Y3)

2] (Pires and Perdigão 2015) following the variations of cor3 and F-T. The full triadic MI: I(Y1, Y2, Y3) (11), denoted as MII in Fig. 13, also grows with N, being decomposed as the sum of the dyadic MIs: (I2 in Fig. 13) with IT. However, that sum practically remains unchanged with N i.e. it does not much respond to the optimization, which justifies the appropriateness of the chosen contrast function (15) as a simulator of IT.

(a) load Dyad2 Y1 (b) load Dyad2 Y2

(c) cor Dyad2 Y1 (d) cor Dyad2 Y2

Fig. 12 The same as Fig. 11 for components (Y1,2,Y2,2) of the second dyad

Author's personal copy

Page 25: Author's personal copyidl.campus.ciencias.ulisboa.pt/wp-content/uploads/...associative feed-forward neural networks in which the encoding of scalar sources from data may be quite com-plex

Separation of the atmospheric variability into non-Gaussian multidimensional sources by…

1 3

Now, we evaluate the successive leading triads as a function of N, by comparing their explained variance with that or other prescribed spaces and by using diagnostics described in “Appendix 5”. The explained variance frac-tion by the leading triad sum over the triad’s components of the explained variance fraction: fvar (42), ranges in interval 32–42 % for N = 3− 20 (Fig. 14) but decreases as far as PCs of less variance are included in the search space (N growing). However the variance fraction onto the three leading PCs: f 3var (43) surpasses 50 % for all triads (Fig. 14).

Despite that fact, the dominant triad may drastically change when the optimization space dimension increases by one unit, when passing from subspace VN−1 to VN. This is diagnosed by the variance fraction fproj(VN−1,VN ) (41) between the leading triads (a less conservative similarity measure than VN−1 · VN) (see Fig. 14). Some abrupt falls of that diagnostic (e.g. at N = 7) occur due to the substitu-tion of previous contrast function maxima by other remote other local maxima. That comes from the ‘competition’, reported in practice, between different local maxima of the contrast function keeping some track in the rotation space and the birth of new local maxima as far as N increases. This shows the effect of the rising of non-Gaussianities and interesting data structures as low-variance small-scale features. Moreover, we have diagnosed for certain sub-space dimensions, the appearance of a multiplicity of local maxima, some of them slightly lower than the absolute one.

Therefore, the difference between local maxima must nor-mally be subjected to a statistical significance test. A small difference between local maxima corresponds to a quasi-degeneracy of the triads. This has also appeared in the Pires and Perdigão (2015) where triads correspond to triadic

Fig. 13 Values relative to the leading triad with subspace optimi-zation dimension N = 8: values of the optimization function F1/2

opT

(noted as F-T) and absolute values of the triadic correlation in train-ing period: (Cor-T) and in the validation period: (Cor-V); Estimated values of IT, MII and sum of dyadic mutual informations (I2) in the validation period

Fig. 14 Fraction of explained variance by the triad with respect to a The whole set of PCs (solid black squares); b The set of 3 normalized leading PCs (open white circles); the triad optimized in a space with less one dimension (open white squares)

Fig. 15 Iso-surface 0.001 of the PDF of the dominant triad optimized for subspace optimization dimension N = 8. Marginal PDFs are standard Gaussian. The triadic correlation is −0.38, consistent with mostly populated octants

Author's personal copy

Page 26: Author's personal copyidl.campus.ciencias.ulisboa.pt/wp-content/uploads/...associative feed-forward neural networks in which the encoding of scalar sources from data may be quite com-plex

C. A. L. Pires, A. F. S. Ribeiro

1 3

wave resonant conditions which are satisfied when the sum of three wave numbers equals zero, which can obviously hold in many ways.

Finally, from the above results of we may conjecture that the sensitivity and level of degeneracy of contrast function maxima grows with the source dimension Nsrc. Moreover the impositions of more stringent contrast functions make the rotation optimization less skilful due to the rareness of more exotic features coming from multi-dimensional syn-ergies. This is shown for example for N = 10 by comparing consistent measures for the leading sources: the NE of the IC (0.174), the MI of the dyad (0.168) and IT of the triad (0.044).

Despite the above refereed pitfalls, triads are not useless and are interpretable in some conditions (Pires and Per-digão 2015 and Sect. 3). Next, we suggest research tracks.

Firstly, the relevance of a non-Gaussian source triad (quartet, quintet etc.) must rely on a hybrid criterion combin-ing non-Gaussianity and explained variance (statistical size), which pulls the preference towards spaces spanned by low-indexed PCs. Second, in order to get robust structures, not so sensitive to the embedding dimension, we must consider the whole set of sufficiently non-redundant triads (using the diagnostic inner product between subspaces from “Appendix 5”), whose contrast function is above a minimum threshold. Finally, it is possible to proceed to the Eigen-decomposition of the third-order cumulant tensor (or others of superior order) by application of the HOSVD (High-Order Singular Value Decomposition) (Kolda and Bader 2009) and princi-pal cumulant component analysis (Morton and Lim 2009), providing the main variability directions contributing for the Frobenius norm (sum of the cumulants squares) and there-fore shrinking the optimization space.

4.10 Physical interpretation of non‑Gaussian triads

Due to the high sensitivity of the triads to N, we have decided not to choose any one in particular. However, inde-pendently of the chosen triads, there are some common worth key points.

The non-null value of the triadic correlation is consist-ent with a certain three-dimensional PDF structure. For instance, for N = 8, one obtains cor3(Y1, Y2, Y3) = −0.38 , which is consistent with probabilities: p−−− = p−++ = p++− = 16% and p+−+ = 17% on octants favouring the condition sgn(Y1Y2Y3) = sgn(cor3) (e.g. p+−+ is the probability of Y1, Y3 > 0, Y2 < 0), which are all larger than 1/8 = 12.5 % verified under the hypoth-esis of triads’ components statistical independence.

Therefore, the maximization of |cor3| leads to four PDF twisted distributed clusters (which apparently are not the QG3 regimes). The correspondent sum of probabilities on the mostly populated octants is 64 > 50 %. The remaining

minor octants occur with probabilities ~8–9 % each. The clustering is evident in the trivariate joint PDF ρY1gY2gY3g of Gaussinized components, appearing in Fig. 15 and showing the PDF iso-surface 0.001 which emphasizes 4 local PDF asymmetrically-displaced local maxima [see similar PDFs in Figs. 2, 12 of Pires and Perdigão (2015)] disposed simi-larly to the vertices of a tetrahedron. The three-dimensional PDFs are estimated by the Kernel-based method explained in “Appendix B” of Pires and Perdigão (2015).

Finally, allow building data exploratory nonlinear indi-ces, like for dyads (30), describing the variability along triads:

where (i, j, k) represents any particular permutation of (1, 2, 3), which may be potentially useful for downscaling pur-poses and statistical modelling.

5 Discussion and conclusions

A decomposition method of the multivariate variability, the so called ISA using a projection pursuit based tech-nique is presented and tested in several scenarios namely a low dimensional chaotic model (Lorenz 1963) and the superposition of waves. Then, ISA is applied to a high-dimensional random vector comprehending the leading principal components of the stream-function field monthly anomalies, generated by an atmospheric model of interme-diate complexity, the quasi-Geostrophic, 3 level model of Marshall and Molteni (1993) (QG3 model), simulating the low-frequency variability and weather regimes in the North Hemisphere during the winter. ISA is one of the Blind Sep-aration Methods (BSS), quite common in signal processing and consisting of a generalization of PCA and independ-ent component analysis (ICA). ISA separates the multi-variate variability into orthogonal subspaces, spanned by uncorrelated components and which are viewed as statisti-cal vectorial sources with non-Gaussian internal structure and susceptible of a possible physical meaning. Candidate vectorial sources are sought within a certain optimization space, being the most statistically independent as possi-ble. This is possible thanks to the Lemma of Negentropy, presented in the paper, showing that the dependence level among sources, measured by mutual information, decreases as far as the sum of the joint sources’ negentropies increase. That is performed by optimizing projections of the origi-nal variability space, maximizing prescribed nonlinear contrast functions, of easy geometrical interpretation. Con-trast functions are chosen to satisfy a set of criteria usually taken by projection pursuit techniques (Huber 1985): the low-dimensionality of the sources’ subspaces, the synergy

(34)FindT ≡ YiYj

σ(YiYj)+ sgn[E(Y1Y2Y3)]Yk ,

Author's personal copy

Page 27: Author's personal copyidl.campus.ciencias.ulisboa.pt/wp-content/uploads/...associative feed-forward neural networks in which the encoding of scalar sources from data may be quite com-plex

Separation of the atmospheric variability into non-Gaussian multidimensional sources by…

1 3

among sources’ components in the contrast functions and the outcome of a nonlinear disentangling among sources’ components through the solution of a nonlinear regression problem. Therefore, on those projections, data tends to con-centrate mostly around certain curves or manifolds contrib-uting for the reduction of the statistical dimensionality.

The high model’s dimension makes the PC’s non-Gauss-ianity quite weak and ‘hidden’ across dimensions. There-fore a method of rotation and projection is necessary to reach highly non-Gaussian features. Therefore, one looks for non-Gaussian dyads and triads, spanned respectively by standardized components (Y1, Y2) and (Y1, Y2, Y3) which maximize squares, respectively of the quadratic covariance E(Y2

1Y2) and of the triadic covariance E(Y1Y2Y3), reach-ing values as high as ~54 and ~40 %, both well above the threshold of statistical significance. For the leading opti-mized dyad of the QG3 model, quite robust with respect to the dimension of optimization space, data concentrates near a parabolic-shaped manifold linking the centroids of the four typical model’s weather regimes labelled as AO−, AO+ , NAO−, NAO + , corresponding to the positive and negative phases of the Arctic Oscillation (AO) and North Atlantic Oscillation (NAO). Nonlinear changes of the sources’ components lead to a set of scalar sources that is competitive with Independent Components (obtained by ICA), in terms of statistical independence but have the advantage of being much closer to Gaussian. On the other hand, there is a much larger variety of triads calling for an objective criterion for their choice.

The correlation and loading maps of the sources’ compo-nents provide certain synoptic interpretations as well their relationship with the field of geostrophic wind anomalies.

The introduced method is a statistical tool of detect-ing enhanced NG triads, holding on chaotic fluid dynam-ics when the Resonance Wave Condition verifies or cer-tain interference both in the wave-number and frequency domains hold. This is an emerging triadic phenomenon leading to high values of an Information-Theoretic meas-ure, the Interaction Information (IT).

The detection of NG sources has important applications in the case of ergodic stochastic multivariate systems (the case of QG3), with applications on predictability (Pires and Perdigão 2015). In fact, nonlinear contemporaneous correlations may be generalized to time-lagged ones (e.g. through the cross bi-covariance function) to which spectral methods may be applied. Maximization of triadic nonlin-ear correlations is indirectly extended to future time lags through the memory of resonance triplets (e.g. temporal wavelets) with some nonlinear predictability. However this is left as a suggestion for future studies.

Acknowledgments This research was developed at IDL with the support of the Portuguese Foundation for Science and Technology

(FCT—Fundação para a Ciência e Tecnologia) through the Project RECI/GEO-MET/0380/2012—SHARE—Seamless High-resolu-tion Atmosphere–ocean Research and also through the project FCT UID/GEO/50019/2013—Instituto Dom Luiz. Thanks are due to Rui Perdigão, Pedro Miranda, Ricardo Trigo for some discussions and our families for the omnipresent support and also to two anony-mous reviewers who undoubtedly have contributed for the paper improvement.

Appendix 1: List of used symbols

a Vector of spherized PCs indexed by decreasing variance

cor, cor3 Pearson correlation and Triadic correlation

c2, c3 Quadratic (Triadic) correlation used in the nonlinear unfolding of dyads (triads)

Cxx Covariance matrix of the generic random vector x

E() Expectation operatorFop,k(yk) Contrast function of the vectorial

source ykFindD, FindT Nonlinear hybrid index derived for

dyads (triads)FopIC , FopD,FopT Contrast function for ICs, dyads and

triadsfproj(V1,V2) Fraction of the squared norm of

subspace V1 that is projected onto subspace V2

fvar Fraction of explained variance by a source

f 3var Fraction of source’s loading vector squared norm projected onto the 3 main standardized PCs

H(x) Shannon entropy of the generic ran-dom vector x

Hg(x) Shannon entropy of the Gaussian pdf with the same mean and covariance matrix as x

Hnl(a, z) Shannon entropy variation coming from variable change a → z

Hnl,1 ,Hnl,2 Decomposing terms of Hnl

I MII between a set of scalar or vecto-rial random variables taken as I arguments

Ic MII coming from the correlation matrix

Ii Multi-information (MII) between a set of scalar random variables

Iia Sum of the MIIs of sources or

Author's personal copy

Page 28: Author's personal copyidl.campus.ciencias.ulisboa.pt/wp-content/uploads/...associative feed-forward neural networks in which the encoding of scalar sources from data may be quite com-plex

C. A. L. Pires, A. F. S. Ribeiro

1 3

Intra-MIIIie MII between the set of vectorial

sources or Inter-MIIIt Interaction information (IT) between

a set of scalar random variables taken as It arguments

J(x) Negentropy of the generic random vector x

Jind(x) Sum of the negentropies of scalar components of x

li,k(P) Value at point P of the loading map for the component Yi,k of source yk

N Dimension of the optimization space for the search of sources or embed-ding dimension

Nang Number of rotation angles of the optimization space

Ndof Number of temporal degrees of free-dom of the dataset

Ndof−p Number of independent samples for estimating each parameter

Np Number of grid points in the ana-lysed spatial domain

Nr Rank of the dataset matrixNsrc Generic dimension of a vectorial

sourceNt Number of available instants in the

datasetNtot Full dimension of the working field

(total number of PCs)r Number of vectorial sources span-

ning the analysis spaceR,R1,R2,Rαi Orthogonal rotation matrices used to

get rotated spherized sourcesVk , VN Linear subspace spanned by the kth

source or by the leading source for an embedding dim. N

vi,k , Vj,i,k Loading vector of the ith comp. of the kth source and its comp. onto the jth standardized PC

vi(N) Loading vector of the ith component of the leading source for a embed-ding dimension N

Vi · Vj Measure of similarity between sub-spaces Vi and Vj

W Matrix of EOFs in columnsX Dataset matrixx Random vector of the raw multivari-

ate dataxPC Vector filled by PCs: XPC,1,XPC,2, . . .

of decreasing variancey Vector of orthogonally rotated spher-

ized PCs

yICs Vector of orthogonally rotated spher-ized PCs for independent compo-nents (ICs)

yk kth vectorial source filled by spher-ized components

|yk| Number of components or dimension of the kth vectorial source yk

Yi,k ith component of the kth vectorial source yk

z, z Array of standardized variables issued from nonlinear unfolding and that of spherized ones

zk , Zi,k kth source obtained after the nonlin-ear unfolding and its ith component

zk , Zi,k kth source obtained after the nonlin-ear unfolding and spherization and its ith component

α Vector filled by rotation angles α1,α2, . . .

κ ijk Third-order cumulant between vari-ables of indices i,j, and k

�, �1 ≥ �2 ≥ . . . Diagonal matrix filled by PC vari-ances: �1 ≥ �2 ≥ . . . in decreasing order

σ Standard deviation( ) Sampling average

Appendix 2: Estimators of the negentropy and mutual information

The estimators of the two- and three-dimensional multi-infor-mations are explained in “Appendix B” of Pires and Perdigão (2015). They rely on kernel-based PDFs with single variables being subjected to Gaussian anamorphoses (Wackernagel 1998), preventing the effect of outliers in the estimation. MII integrals are then computed using Gauss quadrature integra-tion formulas.

For a standardized random variable U, the negentropy is approximated by the linear combination of appropriate contrast functions (Hyvärinen 1998):

Appendix 3: Maximization of the contrast functions

We get the absolute maximum of a generic contrast func-tion Fop using the gradient-descent technique used in

(35)

J(U) ≈ 36

8√3− 9

E

[U exp

(−U

2/2

)]2

+ 24

16√3− 9

E

[exp

(−U

2/2

)− 1/2

]2

Author's personal copy

Page 29: Author's personal copyidl.campus.ciencias.ulisboa.pt/wp-content/uploads/...associative feed-forward neural networks in which the encoding of scalar sources from data may be quite com-plex

Separation of the atmospheric variability into non-Gaussian multidimensional sources by…

1 3

Pires and Perdigão (2015). That is written as an implicit function Fop(R) depending on expectations of type E[FY (ysrc)] where ysrc ≡ (Y1, . . . ,YNsrc)

′ is a generic Nsrc

-dimensional source. Moreover any N-order orthogo-nal matrix R writes as a matrix product of all possible elementary rotation matrices (Jacobi or Givens rota-tions) making generalized Euler angles αl,m of the plane spanned by components (Al,Am) ; l �= m (Raffenetti and Ruedenberg 1969). The number of combined planes and angles (filling the angle’s vector α) is the func-tion Nang(N) ≡ max{0,N(N − 1)/2}. Therefore, since R = R(α), we get E[FY (ysrc)] = E{FY (R(α)a)} ≡ Fα(α) and the gradient of Fop with respect to αl,m is

where Rp,q is the pth column, qth row entry of R.This method is alternative to others like the fixed-point iteration scheme, applied in FastICA (Hyvärinen and Oja 2000) and Lagrange multiplier-based schemes imposing ortho-nor-mality (Jennrich 2001).

Since, FY (ysrc) is a multivariate polynomial, the compo-nents of the above gradient (36) are proportional to a linear combination of expectations of multivariate monomials of the a components which can be estimated a priori from data and saved. Then, the gradient enters into an optimization routine using the Quasi-Newton method to get local min-ima of −Fop(R), starting from a string of randomly chosen first guess angles filling α. There are multiple local maxima of Fop, in a number that generally grows with N and Nsrc. Some of them are trivial, related to symmetries of FY (ysrc) , and others come from the possible projections and rota-tions of the multi-dimensional PDF of a, exhibiting high values of non-Gaussianity [see Pires and Perdigão (2015) for a discussion]. In order to reach the absolute maximum, we try a high number Nfg of first-guess angle vectors, then choosing the maximum of local maxima.

Appendix 4: Cumulants and contrast functions

Let us introduce the cumulants associated to a generic N-dimensional random vector u = (U1, . . . ,UN )

′. A joint cumulant κ of order Nord, between generic, eventually repeated scalar components Uind(1), . . . ,Uind(Nord ) where ind(k) ∈ {1, . . . ,N}, k = 1, . . . ,Nord is

where P runs through the list of all partitions of {1, . . . ,Nord} and B runs through all blocks of P with

(36)∂Fop

∂αl,m= ∂Fop

∂Fα

Nsrc�

k=1

N�

p,q=1

E

�∂FY

∂Yk

∂Yk

∂Rp,q

∂Rp,q

∂αl,m

(37)κ ind(1),...,ind(Nord ) =∑

P

(|P| − 1)!(−1)|P|−1∏

B∈PE

(∏

i∈BUind(i)

),

cardinality |P|. Cumulants of order Nord ≥ 3 vanish for a multivariate Gaussian u working as measures of non-Gaussianity. For instance, cumulants of orders 3 and 4 between centred variables, are expressed respectively as

giving multivariate generalizations of the skewness and kurtosis in the case of standardized variables. The third-order approximations of the negentropy of a scalar (Comon 1994), the MI of a non-Gaussian dyad (Pires and Perdigão 2007) and the IT of a non-Gaussian triad (Pires and Per-digão 2015) are:

where all the Y1, Y2, Y3 are supposed to behave as a sum of Neq iid variables. Contrast functions are then truncated ver-sions of the above approximations.

Appendix 5: Criteria for comparing subspaces of variability

Source components are random variables (e.g. stochastic processes) which belong to a real vectorial metric space, (variability space) provided with an inner product given covariance and norm given by variance. Correlation is the cosine of the angle between random vectors. Therefore orthogonality (colinearity) stands for uncorrelatedness (correlation equal to ±1). The null vector is the zero val-ued random variable constituting a space (the zero space) of dimension zero. The N leading standardized PCs, filling the vector a (Eq. 1), form a basis of orthonormalized vec-tors spanning the full or embeeding N-dimensional space.

Now, let us consider two arbitrary statistical sources y1, y2 of respective dimensions:Nsrc1 ≤ Nsrc2. The sources belong to subspaces, hereby denoted as V1 ≡ Span(v1,1, . . . , vNsrc1,1) and V2 ≡ Span(v1,2, . . . , vNsrc2,2), spanned by the basis of loading vectors: vi,k (k = 1, 2; i = 1, . . . ,Nsrck). Bel-low, we give the correlation (measure of collinearity) between the two subspaces, denoted as V1 · V2,which cor-responds geometrically to the cosine of the angle between two subspaces (e.g. the angle between two planes). For that we introduce the Gram cross matrix G with entries Gi,i′ ≡ v

′i,1vi′ ,2 = cor(Yi,1, Yi′ ,2), i = 1, . . . ,Nsrc1; i

′ = 1, . . . ,Nsrc2 . The dimensional squared matrix GG′ has Nsrc1 Eigen-values

(38)

κ ijk ≡ E(UiUjUk)

κ ijkl ≡ E(UiUjUkUl)− E(UiUj)E(UkUl)

− E(UiUk)E(UjUl)− E(UiUl)E(UjUk),

(39)

J(Y1) = 112

(κ111

)2+ O

(N−1eq

)

I(Y1, Y2) = 14

[(κ112

)2+

(κ221

)2]+ O

(N−1eq

)

It(Y1, Y2, Y3) = 12

(κ123

)2+ O

(N−1eq

),

Author's personal copy

Page 30: Author's personal copyidl.campus.ciencias.ulisboa.pt/wp-content/uploads/...associative feed-forward neural networks in which the encoding of scalar sources from data may be quite com-plex

C. A. L. Pires, A. F. S. Ribeiro

1 3

sorted as: �G,1 ≥ · · · ≥ �G,Nsrc1, all in the interval [0, 1] and

being the squared cosines of the principal angles between the above subspaces (Gunawan et al. 2005). If all Eigen-values vanish (equals one), then the sources are fully separated (totally redundant). A conservative measure of the similarity (correlation) between the subspaces, used throughout the text is defined as

where proj stands for projection on the sub-scripted sub-space. When V1 · V2 = 0 (e.g. two orthogonal planes or dyads in R3), it means that space V1 ∩ V⊥

2 is larger than the zero space, where ( )⊥ is the orthogonal complementary space.

The above eigen-values are also interpreted as explained variances. Therefore, the fraction of V1 variance that is explained by V2 comes as:

being one (zero) for totally redundant (fully separated) sources.

Now, let us express the fraction of variance of the work-ing field (taken from the totality of Ntot PCs) which is explained by the kth source, embedded in N-dimensional space. That is given by in terms of the loading vector com-ponents (3) and PC variances

where the index k has been dropped for simplicity. Finally, a particular form of (42) gives the variance fraction that is explained by the kth source in the space spanned by the 3 leading standardized PCs:

References

Aires F, Chédin A, Nadal JP (2000) Independent component analysis of multivariate time series: application to the tropical SST vari-ability. J Geophys Res 105(D13):17437–17455. doi:10.1029/2000JD900152

Aires F, Rossow WB, Chédin A (2002) Rotation of EOFs by the independent component analysis: toward a solu-tion of the mixing problem in the decomposition of geophysical time series. J Atmos Sci 59:111–123. doi:10.1175/1520-0469(2002)059<0111:ROEBTI>2.0.CO;2

(40)

V1 · V2 ≡ (�G,Nsrc1)

1/2 = minU∈V1

[|cor(U, projV2(U)|] ∈ [0, 1],

(41)fproj(V1,V2) =1

Nsrc

Nsrc1∑

i=1

�G,i ≥ (V1 · V2)2,

(42)fvar ≡

Ntot�

j=1

�j

−1

|yk |�

i=1

N�

j=1

(Vj,i,k)2�j

(43)f 3var ≡1

3

3∑

j=1

N∑

i=1

(Vj,i,k)2

Almeida L (2003) MISEP—linear and nonlinear ICA based on mutual information. J Mach Learn Res 4:1297–1318. http://www.jmlr.org/papers/volume4/almeida03a/almeida03a.pdf

Barnston A, Livezey RE (1987) Classification, seasonality and persis-tence of low-frequency circulation patterns. Mon Weather Rev 115:1083–1126

Bellman RE (1957) Dynamic programming. Princeton University Press, Princeton. ISBN 978-0-691-07951-6

Bernacchia A, Naveau P (2008) Detecting spatial patterns with the cumulant function—part 1: the theory. Nonlinear Process Geo-phys 15:159–167. doi:10.5194/npg-15-159-2008

Bernacchia A, Naveau P, Vrac M, Yiou P (2008) Detecting spatial pat-terns with the cumulant function—part 2: an application to El Nino. Nonlinear Process Geophys 15:169–177. doi:10.5194/npg-15-169-2008

Berner J, Branstator GW (2007) Linear and nonlinear signatures in the planetary wave dynamics of an AGCM: probability density functions. J Atmos Sci 64:117–136. doi:10.1175/JAS3822.1

Blanchard G, Kawanabe M, Sugiyama M, Spokoiny V, Müller KR (2006) In search of non-Gaussian components of a high-dimen-sional distribution. J Mach Learn Res 7:247–282. http://www.jmlr.org/papers/volume7/blanchard06a/blanchard06a.pdf

Bocquet M, Pires CA, Lin W (2010) Beyond Gaussian statistical modeling in geophysical data assimilation. Mon Weather Rev 138:2997–3023. doi:10.1175/2010MWR3164

Bordes G, Moisy F, Dauxois T, Cortet PP (2012) Experimental evi-dence of a triadic resonance of plane inertial waves in a rotating fluid. Phys Fluids 24(1):014105

Bradley D, Morris JM (2013) On the performance of negentropy approximations as test statistics for detecting sinusoidal RFI in microwave radiometers. IEEE Trans Geosci Remote Sens 51:4945–4951. doi:10.1109/TGRS.2013.2266358

Browne MW (2001) An overview of analytic rotation in exploratory factor analysis. Multivar Behav Res 36:111–150

Cardoso J (1998) Multidimensional independent component analysis. In: Proceedings of the 1998 IEEE international conference on acoustics. Speech and signal processing, vol 4, pp 1941–1944. doi:10.1109/ICASSP.1998.681443

Cardoso JF, Souloumiac A (1993) Blind beamforming for non-Gauss-ian signals. IEE Proc F 140(6):362–370

Comon P (1994) Independent component analysis, a new concept? Signal Process 36:287–314

Corti S, Giannini A, Tibaldi S, Molteni S (1997) Patterns of low-frequency variability in a three-level quasi-geostrophic model. Clim Dyn 13(12):883–904. doi:10.1007/s003820050203

Cover TM, Thomas JA (2006) Elements of information theory, 2nd edn. Wiley, USA, p 748

D’Andrea F (2002) Extratropical low-frequency variability as a low-dimensional problem. Part II: stationarity and stability of large-scale equilibria. Q J R Meteorol Soc 128:1059–1073

D’Andrea F, Vautard R (2001) Extratropical low-frequency variabil-ity as a low-dimensional problem I: a simplified model. Q J R Meteorol Soc 127(1357):1374

Deloncle A, Berk R, D’Andrea F, Ghil M (2007) Weather regime pre-diction using statistical learning. J Atmos Sci 64:1619–1635. doi:10.1175/JAS3918.1

Farrell BF, Ioannou PJ (1996) Generalized stability theory. Part I: autonomous operators. J Atmos Sci 53:2025–2040

Franzke C, Majda AJ (2006) Order stochastic mode reduction for a prototype atmospheric GCM. J Atmos Sci 63:457–479. doi:10.1175/JAS3633.1

Franzke C, Majda AJ, Branstator G (2007) The origin of nonlinear signatures of planetary wave dynamics: mean phase space ten-dencies and contributions from non-gaussianity. J Atmos Sci 64:3987–4003. doi:10.1175/2006JAS2221.1

Author's personal copy

Page 31: Author's personal copyidl.campus.ciencias.ulisboa.pt/wp-content/uploads/...associative feed-forward neural networks in which the encoding of scalar sources from data may be quite com-plex

Separation of the atmospheric variability into non-Gaussian multidimensional sources by…

1 3

Friedman JH, Stuetzle W (1981) Projection pursuit regression. J Am Stat Assoc 76:817–823

Friedman JH, Tukey JW (1974) A projection pursuit algorithm for exploratory data analysis. IEEE Trans Comput 23(9):881–890

Giesecke A, Albrecht T, Gundrum T, Herault J, Stefani F (2015) Triadic resonances in non-linear simulations of a fluid flow in a precessing cylinder. New J Phys 17:113044. doi:10.1088/1367-2630/17/11/113044

Gnanadesikan R, Wilk M (1969) Data analytic methods. In: Krishnaiah P (ed) Multivariate analysis II. Academic Press, New York, pp 593–638

Golub GH, van Loan CF (1996) Matrix computations. The John Hop-kins University Press, Baltimore, p 694

Gruber P, Gutch HW, Theis FJ (2009) Hierarchical extraction of inde-pendent subspaces of unknown dimensions. In: Proceedings of the 8th international conference, ICA 2009, Paraty, Brazil, March 15–18. Lecture notes in computer science, vol 5441. Springer, Berlin, pp 259–266. doi:10.1007/978-3-642-00599-2_33

Gunawan H, Neswan O, Setya-Budhi W (2005) A formula for angles between subspaces of inner product spaces. Contrib Algebra Geom 46(2):311–320

Hammack JL (1993) Resonant interactions among surface water waves. Annu Rev Fluid Mech 25:55–97

Hannachi A, Jolliffe IT, Stephenson DB, Trendafilov NT (2007) Empirical orthogonal functions and related techniques in atmos-pheric science: a review. Int J Climatol 27:1119–1152

Hannachi A, Unkel S, Trendafilov NT, Jolliffe IT (2009) Independent component analysis of climate data: a new look at EOF rotation. J Clim 22:2797–2812. doi:10.1175/2008JCLI2571.1

Hasselmann K (1976) Stochastic climate models part I theory. Tellus 28(6):473–485. doi:10.1111/j.2153-3490.1976.tb00696.x

Hastie T, Stuetzle W (1989) Principal curves. J Am Stat Assoc 84:502–516

Hastie T, Tibshirani R, Friedman J (2001) The elements of statisti-cal learning, Data mining, inference, and prediction. Springer Series in Statistics, Springer, New York

Hastie T, Tibshirani R, Friedman J (2008) Elements of statisti-cal learning: data mining, inference and prediction, 2nd edn. Springer, New York, p 778

Hlinka J, Hartman D, Vejmelka M, Novotna D, Palus M (2014) Non-linear dependence and teleconnections in climate data: sources, relevance, nonstationarity. Clim Dyn 42:1873–1886. doi:10.1007/s00382-013-1780-2

Horel JD (1981) A rotated principal component analysis of the inter-annual variability of the Northern Hemisphere 500 mb height field. Mon Weather Rev 109:2080–2092

Hsieh WW (2001) Nonlinear canonical correlation analysis of the tropical Pacific climate variability using a neural network approach. J Clim 14:2528–2539

Hsieh WW, Wu A (2002) Nonlinear multichannel singular spectrum analysis of the tropical Pacific climate variability using a neural network approach. J Geophys Res 107(C7):3076. doi:10.1029/2001JC000957

Huber PJ (1985) Projection pursuit. Ann Stat 13(2):435–475Hyvärinen A (1998) New approximations of differential entropy for inde-

pendent component analysis and projection pursuit. In: Jordan MI, Kearns MJ, Solla SA (eds) Advances in neural information pro-cessing systems, vol 10. MIT Press, Cambridge, MA, pp 273–279

Hyvärinen A, Oja E (2000) Independent component analysis: algo-rithms and application. Neural Netw 13(4–5):411–430

Hyvärinen A, Pajunen P (1999) Nonlinear independent compo-nent analysis: existence and uniqueness results. Neural Netw 12(3):429–439

Jakulin A, Bratko I (2004) Quantifying and visualizing attribute inter-actions: an approach based on entropy. arXiv:cs/0308002v3[cs.AI], 308002, p 3

Jennrich RI (2001) A simple general procedure for orthogonal rota-tion. Psychometrika 66:289–306

Jolliffe IT (2002) Principal component analysis, 2nd edn. Springer, New York, xxix + 487 pp. ISBN 0-387-95442-2

Kalnay E (2003) Atmospheric modelling, data assimilation and predictability. Cambridge University Press, Cambridge, pp xxii + 341. ISBN 0-521-79629-6

Kimoto M, Ghil M (1993a) Multiple flow regimes in the North-ern Hemisphere winter. Part I: methodology and hemispheric regimes. J Atmos Sci 50:2625–2644

Kimoto M, Ghil M (1993b) Multiple flow regimes in the Northern Hemisphere winter. Part II: sectorial regimes and preferred tran-sitions. J Atmos Sci 50:2645–2673

Kirshner S, Póczos B (2008) ICA and ISA using Schweizer–Wolff measure of dependence. In: Proceedings of the 25th interna-tional conference on machine learning. ACM Press, pp 464–471. ICML 2008, 5–9 July, Helsinki, Finland

Koch I, Naito K (2007) Dimension selection for feature selection and dimension reduction with principal and independent component analysis. Neural Comput 19(2):513–545

Kolda TG, Bader BW (2009) Tensor decompositions and appli-cations. SIAM Rev 51:455–500. doi:10.1137/07070111X. CiteSeerX:10.1.1.153.2059

Kondrashov D, Ide K, Ghil M (2004) Weather regimes and preferred transition paths in a three-level quasi-geostrophic model. J Atmos Sci 61:568–587

Kondrashov D, Kravtsov S, Ghil M (2006) Empirical mode reduction in a model of extratropical low-frequency variability. J Atmos Sci 63(7):1859–1877

Kondrashov D, Kravtsov S, Ghil M (2011) Signatures of nonlin-ear dynamics in an idealized atmospheric model. J Atmos Sci 68(1):1–3

Lagrange R, Eloy C, Nadal F, Meunier P (2008) Instability of a fluid inside a precessing cylinder. Phys Fluids 20(8):081701

Lorenz EN (1963) Deterministic nonperiodic flow. J Atmos Sci 20:130–141. doi:10.1175/1520-0469(1963)020<0130:DNF>2.0.CO;2

Lorenz EN (1995) Predictability: a problem partly solved. In: Semi-nar on predictability, vol. I, ECMWF, Reading, pp 1–18. http://www.ecmwf.int/sites/default/files/elibrary/1995/10829-predict-ability-problem-partly-solved.pdf. Last access 15 Nov 2015

Luo D, Jing C, Feldstein SB (2012a) Weather regime transitions and the interannual variability of the north Atlantic oscillation. Part I: a likely connection. J Atmos Sci 69:2329–2346. doi:10.1175/JAS-D-11-0289.1

Luo D, Jing C, Feldstein SB (2012b) Weather regime transitions and the interannual variability of the north Atlantic oscilla-tion. Part II: dynamical processes. J Atmos Sci 69:2347–2363. doi:10.1175/JAS-D-11-0290.1

Marshall J, Molteni F (1993) Toward a dynamical understanding of atmospheric weather regimes. J Atmos Sci 50:1792–1818

McGill WJ (1954) Multivariate information transmission. Psycho-metrika 19:97–116

Michelangeli PA (1996) Variabilité atmosphérique basse-fréquence observée et simulée aux latitudes moyennes, PhD Thesis, Uni-versité Paris VI (France)

Michelangeli PA, Vautard R, Legras B (1995) Weather regimes: recur-rence and quasi stationarity. J Atmos Sci 52:1237–1256

Mizuta M (1984) Generalized principal components analysis invariant under rotations of a coordinate system. J Jpn Stat Soc 14:1–9. https://www.jstage.jst.go.jp/article/jjss1970/14/1/14_1_1/_pdf

Monahan AH (2001) Nonlinear principal component analysis: tropi-cal Indo-Pacific sea surface temperature and sea level pressure. J Clim 14:219–233. doi:10.1175/1520-0442(2001)013<0219:NPCATI>2.0.CO;2

Monahan AH, DelSole T (2009) Information theoretic measures of dependence, compactness, and non-Gaussianity for multivariate

Author's personal copy

Page 32: Author's personal copyidl.campus.ciencias.ulisboa.pt/wp-content/uploads/...associative feed-forward neural networks in which the encoding of scalar sources from data may be quite com-plex

C. A. L. Pires, A. F. S. Ribeiro

1 3

probability distributions. Nonlinear Proc Geophys 16:57–64. doi:10.5194/npg-16-57-2009

Morton J, Lim LH (2009). Principal cumulant component analysis. Unpublished, 2009. http://galton.uchicago.edu/~lekheng/work/pcca.pdf

Mukhin D, Gavrilov A, Feigin A, Loskutov E, Kurths J (2015) Prin-cipal nonlinear dynamical modes of climate variability. Sci Rep 5:15510. doi:10.1038/srep15510

Novey M, Adali T (2008) Complex ICA by negentropy maximization. IEEE Trans Neural Netw Learn Syst 19(4):596–609

Peng HC, Long FH, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal 27(8):1226–1238. doi:10.1109/TPAMI.2005.15

Perron M, Sura P (2013) Climatology of non-Gaussian atmospheric statistics. J Clim 26:1063–1083

Peters JM, Kravtsov S (2012) Origin of non-Gaussian regimes and predictability in an atmospheric model. J Atmos Sci 69(8):2587–2599. doi:10.1175/JAS-D-11-0316.1

Peters JM, Kravtsov S, Schwartz T (2012) Predictability associated with nonlinear regimes in an atmospheric model. J Atmos Sci 69:1137–1154. doi:10.1175/JAS-D-11-0168.1

Pires CA, Perdigão RAP (2007) Non-Gaussianity and asymmetry of the winter monthly precipitation estimation from the NAO. Mon Weather Rev 135:430–448. doi:10.1175/MWR3407.1

Pires CA, Perdigão RAP (2015) Non-Gaussian interaction informa-tion: estimation, optimization and diagnostic application of tri-adic wave resonance. Nonlinear Process Geophys 22:87–108. doi:10.5194/npg-22-87-2015

Plaut G, Vautard R (1994) Spells of low-frequency oscillations and weather regimes in the Northern Hemisphere. J Atmos Sci 51:210–236

Póczos B (2007) Independent subspace analysis. Ph.D. thesis. Eötvös Loránd University, Budapest, Hungary. Supervisor: Dr. András Lőrincz

Póczos B, Lorincz A (2004) Fast multidimensional independent com-ponent analysis. Eotvos Lorand University, Budapest, Hungary. Technical report

Raffenetti C, Ruedenberg K (1969) Parametrization of an orthogonal matrix in terms of generalized eulerian angles. In: Proceed-ings of the international symposium on quantum biology and quantum pharmacology, vol 4, issue supplement S3b:625–634. doi:10.1002/qua.560040725

Richman MB (1981) Obliquely rotated principal components: an improved meteorological map typing technique. J Appl Mete-orol 20:1145–1159

Richman MB (1986) Rotation of principal components. Int J Climatol 6:293–335

Richman MB (1987) Rotation of principal components: a reply. Int J Climatol 7:511–520

Ross I (2009) Nonlinear dimensionality reduction methods in climate data analysis. arXiv:0901.0537v1 [physics.ao-ph]

Ross I, Valdes PJ, Wiggins S (2008) ENSO dynamics in current cli-mate models: an investigation using nonlinear dimension-ality reduction. Nonlinear Proc Geophys 15(2):339–363. doi:10.5194/npg-15-339-2008

Schneidman E, Still S, Berry MJ, Bialek W (2003) Network infor-mation and connected correlations. Phys Rev Lett 91: 238701-1–238701-4

Scholz M (2012) Validation on nonlinear PCA. Neural Process Lett 36(1):21–30. doi:10.1007/s11063-012-9220-6

Selten FM (1995) An efficient empirical description of large-scale atmospheric dynamics. PhD Thesis, Vrije Universiteit, p 169

Selten FM (1997) Baroclinic empirical orthogonal functions as basis functions in an atmospheric model. J Atmos Sci 54:2100–2114

Shannon CE (1948) A mathematical theory of communication. Bell Syst Technol J 27(379–423):623–656

Smith CA, Sardeshmukh P (2000) The effect of ENSO on the intrase-asonal variance of surface temperature in winter. Int J Climatol 20:1543–1557

Smyth P, Ide K, Ghil M (1999) Multiple regimes in Northern Hemi-sphere height fields via mixture model clustering. J Atmos Sci 56:3704–3723

Stephenson DB, Hannachi A, O’Neill A (2004) On the existence of multiple climate regimes. Q J R Meteorol Soc 130:583–605

Strounine K, Kravtsov S, Kondrashov D, Ghil M (2009) Reduced models of atmospheric low-frequency variability: parameter estimation and comparative performance. Phys D Nonlinear Phenom 239(3–4):145–166. doi:10.1016/j.physd.2009.10.013

Sura P, Sardeshmukh PD (2008) A global view of non-Gaussian SST variability. J Phys Oceanogr 38:639–647

Sura P, Newman M, Penland C, Sardeshmukh PD (2005) Multipli-cative noise and non-Gaussianity: a paradigm for atmospheric regimes? J Atmos Sci 62:1391–1409

Teng Q, Fyfe JC, Monahan AH (2007) Northern Hemisphere circu-lation regimes: observed, simulated and predicted. Clim Dyn 28:867–879. doi:10.1007/s00382-006-0220-y

Theis FJ (2005) Multidimensional independent component analysis using characteristic functions. In: Proceedings of European sig-nal processing conference (EUSIPCO 2005)

Theis FJ (2006) Towards a general independent subspace analysis. In: Proceedings of neural information processing systems (NIPS 2006)

Theis J (2007) Uniqueness of non-Gaussian subspace analysis. In: Rosca J et al (ed) ICA 2006, LNCS, vol 3889, pp 917–925

Timme N, Alford W, Flecker B, Beggs JM (2013) Synergy, redun-dancy, and multivariate information measures: an experimental-ist’s perspective. J Comput Neurosci 36:119–140. doi:10.1007/s10827-013-0458-4

Tsujishita T (1995) On triple mutual information. Adv Appl Math 16:269–274

Vannitsen S (2001) Toward a phase-space cartography of the short- and medium-range predictability of weather regimes. Tellus 53–1:56–73

Vautard R (1990) Multiple weather regimes over the north atlan-tic: analysis of precursors and successors. Mon Weather Rev 118:2056–2081. doi:10.1175/1520-0493(1990)118<2056:MWROTN>2.0.CO;2

Wackernagel H (1998) Multivariate geostatistics—an introduction with applications, 2nd edn. Springer, Berlin

Westra S, Brown C, Lall U, Koch I, Sharma A (2010) Interpret-ing variability in global SST data using independent compo-nent analysis and principal component analysis. Int J Climatol 30:333–364. doi:10.1002/joc.1888

Withers CS, Nadarajah S (2014) Negentropy as a function of cumu-lants. Inf Sci 271:31–44. doi:10.1016/j.ins.2014.02.097

Woollings TJ, Hannachi A, Hoskins BJ, Turner A (2010) A regime view of the North Atlantic Oscillation and its response to anthropogenic forcing. J Clim 23:1291–1307

Wu A, Hsieh WW, Shabbar A, Boer GJ, Zwiers FW (2006) The non-linear association between the Arctic Oscillation and North American winter climate. Clim Dyn 26:865–879. doi:10.1007/s00382-006-0118-8

Yu X, Hu D, Xu J (2014) Blind source separation: theory and applica-tions. Wiley, New York, p 416. ISBN: 978-1-118-67984-5

Author's personal copy