latent variable modelling with hyperbolic normalizing flowslatent variable modelling with hyperbolic...

Latent Variable Modelling with Hyperbolic Normalizing Flows

Avishek Joey Bose 1 2 Ariella Smofsky 1 2 Renjie Liao 3 4 Prakash Panangaden 1 2 William L. Hamilton 1 2

AbstractThe choice of approximate posterior distributionsplays a central role in stochastic variational infer-ence (SVI). One effective solution is the use ofnormalizing flows to construct flexible posteriordistributions. However, a key limitation of exist-ing normalizing flows is that they are restrictedto Euclidean space and are ill-equipped to modeldata with an underlying hierarchical structure. Toaddress this fundamental limitation, we presentthe first extension of normalizing flows to hyper-bolic spaces. We first elevate normalizing flowsto hyperbolic spaces using coupling transformsdefined on the tangent bundle, termed TangentCoupling (T C). We further introduce WrappedHyperboloid Coupling (WHC), a fully invertibleand learnable transformation that explicitly uti-lizes the geometric structure of hyperbolic spaces,allowing for expressive posteriors while being effi-cient to sample from. We demonstrate the efficacyof our novel normalizing flow over hyperbolicVAEs and Euclidean normalizing flows. Our ap-proach achieves improved performance on densityestimation, as well as reconstruction of real-worldgraph data, which exhibit a hierarchical structure.Finally, we show that our approach can be used topower a generative model over hierarchical datausing hyperbolic latent variables.

1. IntroductionStochastic variational inference (SVI) methods provide anappealing way of scaling probabilistic modeling to largescale data. These methods transform the problem of com-puting an intractable posterior distribution to finding thebest approximation within a class of tractable probabilitydistributions (Hoffman et al., 2013). Using tractable classesof approximate distributions, e.g., mean-field, and Bethe

1McGill University 2Mila 3University of Toronto 4Vector Insti-tute. Correspondence to: Joey Bose <[email protected]>,Ariella Smofsky <[email protected]>.

Proceedings of the 37 th International Conference on MachineLearning, Online, PMLR 119, 2020. Copyright 2020 by the au-thor(s).

ℙ2K

ℍ2Kℝ2

ℝ2

Figure 1. The shortest path between a given pair of node embed-dings in R2 and hyperbolic space as modelled by the Lorentzmodel H2

K and Poincare disk P2K . Unlike Euclidean space, dis-

tances between points grow exponentially as you move away fromthe origin in hyperbolic space, and thus the shortest paths betweenpoints in hyperbolic space go through a common parent node (i.e.,the origin), giving rise to hierarchical and tree-like structure.

approximations, facilitates efficient inference, at the cost oflimiting the expressiveness of the learned posterior.

In recent years, the power of these SVI methods has beenfurther improved by employing normalizing flows, whichgreatly increase the flexibility of the approximate poste-rior distribution. Normalizing flows involve learning a se-ries of invertible transformations, which are used to trans-form a sample from a simple base distribution to a samplefrom a richer distribution (Rezende & Mohamed, 2015).Indeed, flow-based posteriors enjoy many advantages suchas efficient sampling, exact likelihood estimation, and low-variance gradient estimates when the base distribution isreparametrizable, making them ideal for modern machinelearning problems. There have been numerous advances innormalizing flow construction in Euclidean spaces, such asRealNVP (Dinh et al., 2017), B-NAF (Huang et al., 2018;De Cao et al., 2019), and FFJORD (Grathwohl et al., 2018),to name a few.

However, current normalizing flows are restricted to Eu-clidean space, and as a result, these approaches are ill-equipped to model data with an underlying hierarchi-cal structure. Many real-world datasets—such as ontolo-

arX

iv:2

002.

0633

6v4

[cs

.LG

] 1

3 A

ug 2

020


gies, social networks, sentences in natural language, andevolutionary relationships between biological entities inphylogenetics—exhibit rich hierarchical or tree-like struc-ture. Hierarchical data of this kind can be naturally repre-sented in hyperbolic spaces, i.e., non-Euclidean spaces withconstant negative curvature (Figure 1). But Euclidean nor-malizing flows fail to incorporate these structural inductivebiases, since Euclidean space cannot embed deep hierar-chies without suffering from high distortion (Sarkar, 2011).Furthermore, sampling from densities defined on Euclideanspace will inevitability generate points that do not lie on theunderlying hyperbolic space.

Present work. To address this fundamental limitation, wepresent the first extension of normalizing flows to hyperbolicspaces. Prior works have considered learning models withhyperbolic parameters (Liu et al., 2019b; Nickel & Kiela,2018) as well as variational inference with hyperbolic latentvariables (Nagano et al., 2019; Mathieu et al., 2019), but ourwork represents the first approach to allow flexible densityestimation in hyperbolic space.

To define our normalizing flows we leverage the Lorentzmodel of hyperbolic geometry and introduce two new formsof coupling, Tangent Coupling (T C) and Wrapped Hyper-boloid Coupling (WHC). These define flexible and invert-ible transformations capable of transforming sampled pointsin the hyperbolic space. We derive the change of volumeassociated with these transformations and show that it canbe computed efficiently with O(n) cost, where n is the di-mension of the hyperbolic space. We empirically validateour proposed normalizing flows on structured density esti-mation, reconstruction, and generation tasks on hierarchicaldata, highlighting the utility of our proposed approach.

2. Background on Hyperbolic GeometryWithin the Riemannian geometry framework, hyperbolicspaces are manifolds with constant negative curvature Kand are of particular interest for embedding hierarchicalstructures. There are multiple models of n-dimensional hy-perbolic space, such as the hyperboloid HnK , also knownas the Lorentz model, or the Poincare ball PnK . Figure 1illustrates some key properties of H2

K and P2K , highlight-

ing how distances grow exponentially as you move awayfrom the origin and how the shortest paths between distantpoints tend to go through a common parent (i.e., the ori-gin), giving rise to a hierarchical or tree-like structure. Inthe next section, we briefly review the Lorentz model ofhyperbolic geometry. We are not assuming a backgroundin Riemannian geometry, though Appendix A and Ratcliffe(1994) are of use to the interested reader. Henceforth, fornotational clarity, we use boldface font to denote points onthe hyperboloid manifold.

2.1. Lorentz Model of Hyperbolic Geometry

An n-dimensional hyperbolic space, HnK , is the unique, com-plete, simply-connected n-dimensional Riemannian mani-fold of constant negative curvature, K. For our purposes,the Lorentz model is the most convenient representation ofhyperbolic space, since it is equipped with relatively simpleexplicit formulas and useful numerical stability properties(Nickel & Kiela, 2018). We choose the 2D Poincare disk P2

1

to visualize hyperbolic space because of its conformal map-ping to the unit disk. The Lorentz model embeds hyperbolicspace HnK within the n+ 1-dimensional Minkowski space,defined as the manifold Rn+1 equipped with the followinginner product:

〈x, y〉L := −x0y0 + x1y1 + · · ·+ xnyn, (1)

which has the type 〈·, ·〉L : Rn+1 × Rn+1 → R. It iscommon to denote this space as R1,n to emphasize the dis-tinct role of the zeroth coordinate. In the Lorentz model,we model hyperbolic space as the (upper sheet of) the hy-perboloid embedded in Minkowski space. It is a remark-able fact that though the Lorentzian metric (Eq. 1) is in-definite, the induced Riemannian metric gx on the unithyperboloid is positive definite (Ratcliffe, 1994). The n-Hyperbolic space with constant negative curvature K withorigin o = (1/K, 0, . . . , 0), is a Riemannian manifold(HnK , gx) where

HnK := {x ∈ Rn+1 : 〈x.x〉L = 1/K, x0 > 0, K < 0}.

Equipped with this, the induced distance between two points(x,y) in HnK is given by

d(x, y)L :=1√−K

arccosh(−K〈x, y〉L). (2)

The tangent space to the hyperboloid at the point p ∈ HnKcan also be described as an embedded subspace of R1,n.It is given by the set of points satisfying the orthogonalityrelation with respect to the Minkowski inner product,1

TpHnK := {u : 〈u,p〉L = 0} (3)

Of special interest are vectors in the tangent space at theorigin of HnK whose norm under the Minkowski innerproduct is equivalent to the conventional Euclidean norm.That is v ∈ ToHnK is a vector such that v0 = 0 and||v||L :=

√〈v, v〉L = ||v||2. Thus at the origin the partial

derivatives with respect to the ambient coordinates, Rn+1,define the covariant derivative.

Projections. Starting from the extrinsic view by which weconsider Rn+1 ⊃ HnK , we may project any vector x ∈ Rn+1

1It is also equivalently known as the Lorentz inner product.


on to the hyperboloid using the shortest Euclidean distance:

projHnK (x) =x√

−K||x||L. (4)

Furthermore, by definition a point on the hyperboloid sat-isfies 〈x, x〉L = 1/K and thus when provided with n co-ordinates x = (x1, . . . , xn) we can always determine themissing coordinate to get a point on HnK :

x0 =

√||x||22 +

1

K. (5)

Exponential Map. The exponential map takes a vector, v,in the tangent space of a point x ∈ HnK to a point on themanifold—i.e., y = expKx (v) : TxHnK → HnK by movinga unit length along the geodesic, γ (straightest parametriccurve), uniquely defined by γ(0) = x with direction γ′(0) =v. The closed form expression for the exponential map isthen given by

expKx (v) = cosh( ||v||L

R

)x + sinh

( ||v||LR

) Rv

||v||L, (6)

where we used the generalized radius R = 1/√−K in

place of the curvature.

Logarithmic Map. As the inverse of the exponential map,the logarithmic map takes a point, y, on the manifold backto the tangent space of another point x also on the manifold.In the Lorentz model this is defined as

logKx y =arccosh(α)√

α2 − 1(y− αx), (7)

where α = K〈x, y〉L.

Parallel Transport. The parallel transport for two pointsx, y ∈ HnK is a map that carries the vectors in v ∈ TxHnK tocorresponding vectors at v′ ∈ TyHnK along the geodesic.That is vectors are connected between the two tangentspaces such that the covariant derivative is unchanged.Parallel transport is a map that preserves the metric, i.e.,〈PTKx→y(v),PTKx→y(v

′)〉L = 〈v, v′〉L and in the Lorentzmodel is given by

PTKx→y(v) = v − 〈logKx (y), v〉Ld(x, y)L

(logKx (y) + logKy (x))

= v +〈y, v〉L

R2 − 〈x, y〉L(x + y), (8)

where α is as defined above. Another useful property isthat the inverse parallel transport simply carries the vec-tors back along the geodesic and is simply defined as(PTKx→y(v))

−1 = PTKy→x(v).

2.2. Probability Distributions on Hyperbolic Spaces

Probability distributions can be defined on Riemannianmanifolds, which include HnK as a special case. Onetransforms the infinitesimal volume element on the man-ifold to the corresponding volume element in Rn as de-fined by the co-ordinate charts. In particular, given theRiemannian manifold M(z) and its metric gz, we have∫p(z)dM(z) =

∫p(z)

√|gz|dz, where dz is the Lebesgue

measure. We now briefly survey three distinct generaliza-tions of the normal distribution to Riemannian manifolds.

Riemannian Normal. The first is the Riemannian nor-mal (Pennec, 2006; Said et al., 2014), which is derived frommaximizing the entropy given a Frechet mean µ and a dis-persion parameter σ. Specifically, we have NM(z|µ, σ2) =1Z exp

(−dM(µ, z)2/2σ2

), where dM is the induced dis-

tance and Z is the normalization constant (Said et al., 2014;Mathieu et al., 2019).

Restricted Normal. One can also restrict sampled pointsfrom the normal distribution in the ambient space to themanifold. One example is the Von Mises distribution on theunit circle and its generalized version, i.e., Von Mises-Fisherdistribution on the hypersphere (Davidson et al., 2018).

Wrapped Normal. Finally, we can define a wrapped nor-mal distribution (Falorsi et al., 2019; Nagano et al., 2019),which is obtained by (1) sampling from N (0, I) and thentransforming it to a point v ∈ ToHnK by concatenating 0 asthe zeroth coordinate; (2) parallel transporting the sample vfrom the tangent space at o to the tangent space of anotherpoint µ on the manifold to obtain u; (3) mapping u fromthe tangent space to the manifold using the exponential mapat µ. Sampling from such a distribution is straightforwardand the probability density can be obtained via the changeof variable formula,

log p(z) = log p(v)− (n− 1) log

(sinh (‖u‖L)‖u‖L

), (9)

where p(z) is the wrapped normal distribution and p(v) isthe normal distribution in the tangent space of o.

3. Normalizing Flows on Hyperbolic SpacesWe seek to define flexible and learnable distributions onHnK , which will allow us to learn rich approximate posteriordistributions for hierarchical data. To do so, we designa class of invertible parametric hyperbolic functions, fi :HnK → HnK . A sample from the approximate posteriorcan then be obtained by first sampling from a simple basedistribution z0 ∼ p(z) defined on HnK and then applyinga composition of functions fi∈[j] from this class: zj =fj ◦ fj−1 ◦ · · · ◦ f1(z0).

In order to ensure effective and tractable learning, the classof functions fi must satisfy three key desiderata:


1. Each function fi must be invertible.2. We must be able to efficiently sample from the final

distribution, zj = fj ◦ fj−1 ◦ · · · ◦ f1(z0).3. We must be able to efficiently compute the associated

change in volume (i.e., the Jacobian determinant) ofthe overall transformation.

Given these requirements, the final transformed distributionis given by the change of variables formula:

log p(zj) = log p(z0)−k∑i=1

log det∣∣∣ ∂fj∂zj−1

∣∣∣. (10)

Functions satisfying desiderata 1-3 in Euclidean space areoften termed normalizing flows (Appendix B), and our workextends this idea to hyperbolic spaces. In the followingsections, we describe two flows of increasing complexity:Tangent Coupling (T C) and Wrapped Hyperboloid Cou-pling (WHC). The first approach lifts a standard Euclideanflow to the tangent space at the origin of the hyperboloid.The second approach modifies the flow to explicitly utilizehyperbolic geometry. Figure 2 illustrates synthetic densitiesas learned by our approach on P2

1.

3.1. Tangent Coupling

Similar to the Wrapped Normal distribution (Section 2.2),one strategy to define a normalizing flow on the hyperboloidis to use the tangent space at the origin. That is, we firstsample a point from our base distribution—which we defineto be a Wrapped Normal—and use the logarithmic map atthe origin to transport it to the corresponding tangent space.Once we arrive at the tangent space we are free to apply anyEuclidean flow before finally projecting back to the man-ifold using the exponential map. This approach leveragesthe fact that the tangent bundle of a hyperbolic manifoldhas a well-defined vector space structure, allowing affinetransformations and other operations that are ill-defined onthe manifold itself.

Following this idea, we build upon one of the earliest andmost well-studied flows: the RealNVP flow (Dinh et al.,2017). At its core, the RealNVP flow uses a computationallysymmetric transformation (affine coupling layer), which hasthe benefit of being fast to evaluate and invert due to its lowertriangular Jacobian, whose determinant is cheap to compute.Operationally, the coupling layer is implemented using abinary mask, and partitions some input x into two sets,where the first set, x1 := x1:d, is transformed elementwiseindependently of other dimensions. The second set, x2 :=xd+1:n, is also transformed elementwise but in a way thatdepends on the first set (see Appendix B.2 for more details).Since all coupling layer operations occur at ToHnK we termthis form of coupling as Tangent Coupling (T C).

Thus, the overall transformation due to one layer of our T C

Target 𝒯C(Ours) 𝒲ℍC(Ours)

Figure 2. Comparison of density estimation in hyperbolic spacefor 2D wrapped Gaussian (WG) and mixture of wrapped gaussian(MWG) on P2

1. Densities are visualized in the Poincare disk.Additional qualitative results can be found in Appendix F.

flow is a composition of a logarithmic map, affine couplingdefined on ToHnk , and an exponential map:

fT C(x) =

{z1 = x1

z2 = x2 � σ(s(x1)) + t(x1)

fT C(x) = expKo (fT C(logKo (x))), (11)

where x = logKo (x) is a point on ToHnK , and σ is a pointwisenon-linearity such as the exponential function. Functionss and t are parameterized scale and translation functionsimplemented as neural nets from ToHdK → ToHn−dK . Oneimportant detail is that arbitrary operations on a tangentvector v ∈ ToHnK may transport the resultant vector outsidethe tangent space, hampering subsequent operations. Toavoid this we can keep the first dimension fixed at v0 = 0to ensure we remain in ToHnK .

Similar to the Euclidean RealNVP, we need an efficientexpression for the Jacobian determinant of fT C .

Proposition 1. The Jacobian determinant of a single T Clayer in equation 11 is:∣∣∣∣det

(∂y∂x

)∣∣∣∣ = (R sinh( ||z||LR )

||z||L

)n−1×

n∏i=d+1

σ(s(x1))i

×(R sinh(

|| logKo (x)||LR )

|| logKo (x)||L

)1−n(12)

where, z = fT C(x) and fT C is as defined above.

Proof Sketch. Here we only provide a sketch of the proofand details can be found in Appendix C. First, observe thatthe overall transformation is a valid composition of func-tions: y := expKo ◦ fT C ◦ log

Ko (x). Thus, the overall

determinant can be computed by chain rule and the identity,det(∂y∂x

)= det

(∂expKo (z)

∂z

)· det

(∂f(x)∂x

)· det

(∂ logKo (x)

∂x

).

Tackling each function in the composition individually,


det(∂expKo (z)

∂z

)=(R sinh(

||z||LR )

||z||L

)n−1as derived in Skopek

et al. (2019). As the logarithmic map is the inverse of theexponential map the Jacobian determinant is simply the in-verse of the determinant of the exponential map, which givesthe det

(∂ logKo (x)

∂x

)term. For the middle term, we must cal-

culate the directional derivative of fT C in an orthonormalbasis w.r.t. the Lorentz inner product, of ToHnK . Since thestandard Euclidean basis vectors e1, ..., en are also a basisfor ToHnK , the Jacobian determinant det

(∂f(x)∂x

)simplifies

to that of the RealNVP flow, which is lower triangluar andis thus efficiently computable in O(n) time.

It is remarkable that the middle term in Proposition 1 isprecisely the same change in volume associated with affinecoupling in RealNVP. The change in volume due to the hy-perbolic space only manifests itself through the exponentialand logarithmic maps, each of which can be computed inO(n) cost. Thus, the overall cost is only slightly larger thanthe regular Euclidean RealNVP, but still O(n).

3.2. Wrapped Hyperboloid Coupling

o

t

o

t

x1𝒯oℍ2K 𝒯oℍ2

K

𝒯tℍ2K

tt

xx2

x1 v

𝒲ℍCℍ2K expK

tPTo→tγ γ

γ γ

Figure 3. Wrapped Hyperbolic Coupling. The left figure depictsa partitioned input point x1 := x1:d and x2 := xd+1:n prior toparallel transport. The right figure depicts the x2 vector after it istransformed, parallel transported, and projected to Hn

K .

The hyperbolic normalizing flow with T C layers discussedabove operates purely in the tangent space at the origin. Thissimplifies the computation of the Jacobian determinant, butanchoring the flow at the origin may hinder its expressivepower and its ability to leverage disparate regions of themanifold. In this section, we remedy this shortcoming witha new hyperbolic flow that performs translations betweentangent spaces via parallel transport.

We term this transformation Wrapped Hyperboloid Cou-pling (WHC). As with the T C layer, it is a fully invertibletransformation fWHC : Hnk → Hnk with a tractable ana-lytic form for the Jacobian determinant. To define aWHClayer we first use the logarithmic map at the origin to trans-

port a point to the tangent space. We employ the couplingstrategy previously discussed and partition our input vectorinto two components: x1 := x1:d and x2 := xd+1:n. Letx = logKo (x) be the point on ToHnK after the logarithmicmap. The remainder of theWHC layer can be defined asfollows;

fWHC(x) =

{z1 = x1

z2 = logKo

(expKt(x1)

(PTo→t(x1)(v)

))v = x2 � σ(s(x1))

fWHC(x) = expKo (fWHC(logKo (x))). (13)

Functions s : ToHdk → ToHn−dk and t : ToHdk → Hnk aretaken to be arbitrary neural nets, but the role of t whencompared to T C is vastly different. In particular, the gen-eralization of translation on Riemannian manifolds can beviewed as parallel transport to a different tangent space.Consequently, in Eq. 13, the function t predicts a point onthe manifold that we wish to parallel transport to. Thisgreatly increases the flexibility as we are no longer confinedto the tangent space at the origin. The logarithmic mapis then used to ensure that both z1 and z2 are in the sametangent space before the final exponential map that projectsthe point to the manifold.

One important consideration in the construction of t is that itshould only parallel transport functions of x2. However, theoutput of t is a point on Hnk and without care this can involveelements in x1. To prevent such a scenario we construct theoutput of t = [t0, 0, . . . , 0, td+1, . . . , tn] where elementstd+1:n are used to determine the value of t0 using Eq. 5,such that it is a point on the manifold and every remainingindex is set to zero. Such a construction ensures that onlycomponents of any function of x2 are parallel transportedas desired. Figure 3 illustrates the transformation performedby theWHC layer.

Inverse ofWHC. To invert the flow it is sufficient to showthat argument to the final exponential map at the originitself is invertible. Furthermore, note that x1 undergoes anidentity mapping and is trivially invertible. Thus, we needto show that the second partition is invertible, i.e. that thefollowing transformation is invertible:

z2 = logKo

(expKt(x1)

(PTo→t(x1)(v)

)). (14)

As discussed in Section 2, the parallel transport, exponentialmap, and logarithmic map all have well-defined inverseswith closed forms. Thus, the overall transformation is in-vertible in closed form:{x1 = z1

x2 =(

PTt(z1)→o(logKt(z1)(expKo (z2))

)� σ(s(z1))−1


Properties ofWHC. To compute the Jacobian determinantof the full transformation in Eq. 13 we proceed by analyzingthe effect of WHC on valid orthonormal bases w.r.t. theLorentz inner product for the tangent space at the origin.We state our main result here and provide a sketch of theproof, while the entire proof can be found in Appendix D.Proposition 2. The Jacobian determinant of the functionfWHC in equation 13 is:∣∣∣∣det

(∂y

∂x

)∣∣∣∣ = n∏i=d+1

σ(s(x1))i ×(R sinh( ||q||LR )

||q||L

)l×(R sinh(

|| logKo (q)||LR )

|| logKo (q)||L

)−l×(R sinh( ||z||LR )

||z||L

)n−1×(R sinh(

|| logKo (x)||LR )

|| logKo (x)||L

)1−n, (15)

where z = concat(z1, z2), the constant l = n − d, σ is anon-linearity, q = PTo→t(x1)(v) and q = expKt (q).

Proof Sketch. We first note that the exponential and loga-rithmic maps applied at the beginning and end of theWHCcan be dealt with by appealing to the chain rule and theknown Jacobian determinants for these functions as usedin Proposition 1. Thus, what remains is the following term:∣∣det

(∂z∂x

)∣∣. To evaluate this term we rely on the followingLemma.

Lemma 1. Let h : ToHnk → ToHnk be a function definedas:

h(x) = z = concat(z1, z2). (16)

Now, define a function h∗ : ToHn−d → ToHn−d which actson the subspace of ToHn−d corresponding to the standardbasis elements ed+1, ..., en as

h∗(x2) = logKo2

(expKt2

(PTo2→t2(v)

)), (17)

where x2 denotes the portion of the vector x correspondingto the standard basis elements ed+1, ..., en and s and t areconstants (which depend on x1). In Equation equation 17,we use o2 ∈ Hn−d to denote the vector corresponding toonly the dimensions d+ 1, ..., n and similarly for t2. Thenwe have that∣∣∣∣det

(∂z

∂x

)∣∣∣∣ = ∣∣∣∣det(∂h∗(xd+1:n)

∂xd+1:n)

)∣∣∣∣ . (18)

The proof for Lemma 1 is provided in Appendix D. Us-ing Lemma 1, and the fact that |det(PTu→t(v))| = 1(Nagano et al., 2019) we are left with another composi-tion of functions but on the subspace ToHn−d. The Jaco-bian determinant for these functions, are simply that ofthe logarithmic map, exponential map and the argumentto the parallel transport which can be easily computed as∏ni=d+1 σ(s(x1)).

The cost of computing the change in volume for oneWHClayer is O(n) which is the same as a T C layer plus theadded cost of the two new maps that operate on the lowersubspace of basis elements.

4. ExperimentsWe evaluate our T C-flow andWHC-flow on three tasks:structured density estimation, graph reconstruction, andgraph generation.2 Throughout our experiments, we relyon three main baselines. In Euclidean space, we use Gaus-sian latent variables and affine coupling flows (Dinh et al.,2017), denoted N and NC, respectively. In the Lorentzmodel, we use Wrapped Normal latent variables, H-VAE,as an analogous baseline (Nagano et al., 2019). Since allmodel parameters are defined on Euclidean tangent spaces,models can be trained with conventional optimizers likeAdam (Kingma & Ba, 2014). Following previous work, wealso consider the curvature K as a learnable parameter witha warmup of 10 epochs, and we clamp the max norm ofvectors to 40 before any logarithmic or exponential map(Skopek et al., 2019). Appendix E contains details on modelarchitectures and implementation details.

4.1. Structured Density Estimation

We first consider structured density estimation in a canoni-cal VAE setting (Kingma & Welling, 2013), where we seekto learn rich approximate posteriors using normalizing flowsand evaluate the marginal log-likelihood of test data. Fol-lowing work on hyperbolic VAEs, we test the approacheson a branching diffusion process (BDP) and dynamically bi-narized MNIST (Mathieu et al., 2019; Skopek et al., 2019).

To estimate the log likelihood we perform importance sam-pling using 500 samples from the test set (Burda et al., 2015).Our results are shown in Tables 1 and 2. On both datasets weobserve our hyperbolic flows provide improvements whenusing latent spaces of low dimension. This result matchestheoretical expectations—e.g., that trees can be perfectlyembedded in H2

K—and dovetails with previous work ongraph embedding (Nickel & Kiela, 2017), thus highlightingthe benefit of leveraging hyperbolic space is most prominentin small dimensions. However, as we increase the latent di-mension, the Euclidean approaches can compensate for thisintrinsic geometric limitation. In the case of BDP we notethat the data is indeed a noisy binary tree, which theoreti-cally can be represented in a 2-D hyperbolic space and thusmoving to higher dimensional latent space is not beneficial.

2https://github.com/joeybose/HyperbolicNF

https://github.com/joeybose/HyperbolicNF


Model BDP-2 BDP-4 BDP-6

N -VAE −55.4±0.2 −55.2±0.3 −56.1±0.2

H-VAE −54.9±0.3 −55.4±0.2 −58.0±0.2

NC −55.4±0.4 -54.7±0.1 -55.2±0.3

T C -54.9±0.1 −55.4±0.1 −57.5±0.2

WHC -55.1±0.4 −55.2±0.2 −56.9±0.4

Table 1. Test Log Likelihood on Binary Diffusion Process versuslatent dimension. All normalizing flows use 2-coupling layers.

ModelMNIST

2MNIST

4MNIST

6

N -VAE −139.5±1.0 −115.6±0.2 −100.0±0.02

H-VAE ∗ −113.7±0.9 −99.8±0.2

NC −139.2±0.4 −115.2±0.6 -98.70.3

T C ∗ -112.5±0.2 −99.3±0.2

WHC -136.5±2.1 -112.8±0.5 −99.4±0.2

Table 2. Test Log Likelihood on MNIST averaged over 5 runsverus latent dimension. * indicates numerically unstable settings.

4.2. Graph Reconstruction

We evaluate the utility of our hyperbolic flows by conduct-ing experiments on the task of link prediction using graphneural networks (GNNs) (Scarselli et al., 2008) as an infer-ence model. Given a simple graph G = (V, A,X), definedby a set of nodes V , an adjacency matrix A ∈ Z|V|×|V| andnode feature matrixX ∈ R|V|×n, we learn a VGAE (Kipf &Welling, 2016) model whose inference network, qφ, definesa distribution over node embeddings qφ(Z|A,X). To scorethe likelihood of an edge existing between pairs of nodeswe use an inner product decoder: p(Au,v = 1|zu, zv) =σ(zTu zv), with dot products computed in ToHnK when nec-essary. Given these components, the inference GNNs aretrained to maximize the variational lower bound on a train-ing set of edges.

We use two different disease datasets taken from (Chamiet al., 2019) and (Mathieu et al., 2019)3 for evaluation pur-poses. Our chosen datasets reflect important real world usecases where the data is known to contain hierarchies. Onesuch measure to determine how tree-like a given graph isknown to be Gromovs δ-hyperbolicity and traditional linkprediction datasets such as Cora and Pubmed (Yang et al.,2016) were found to lack such a property and are not suit-able candidates to evaluate our proposed approach (Chamiet al., 2019). The first dataset Diseases-I is composed of anetwork of disorders and disease genes linked by the knowndisordergene associations (Goh et al., 2007). In the seconddataset Diseases-II, we build tree networks of a SIR diseasespreading model (Anderson et al., 1992), where node fea-tures determine the susceptibility to the disease. In Table3 we report the AUC and average precision (AP) on thetest set. We observe consistent improvements when using

3We uncovered issues with the two remaining datasets in (Math-ieu et al., 2019) and thus omit them (Appendix G).

hyperbolic WHC flow. Similar to the structured densityestimation setting, the performance gains ofWHC are bestobserved in low-dimensional latent spaces.

ModelDis-IAUC

Dis-IAP

Dis-IIAUC

Dis-IIAP

N -VAE 0.90±0.01 0.92±0.01 0.92±0.01 0.91±0.01

H-VAE 0.91±5e-3 0.92±5e-3 0.92±4e-3 0.91±0.01

NC 0.92±0.01 0.93±0.01 0.95±4e-3 0.93±0.01

T C 0.93±0.01 0.93±0.01 0.96±0.01 0.95±0.01

WHC 0.93±0.01 0.94±0.01 0.96±0.01 0.96±0.01

Table 3. Test AUC and Test AP on Graph Embeddings where Dis-Ihas latent dimesion 6 and Dis-II has latent dimension 2.

4.3. Graph Generation

Finally, we explore the utility of our hyperbolic flows forgenerating hierarchical structures. As a synthetic testbed,we construct datasets containing uniformly random trees aswell as uniformly random lobster graphs (Golomb, 1996),where each graph contains between 20 to 100 nodes. Unlikeprior work on graph generation—i.e., (Liu et al., 2019a)—our datasets are designed to have explicit hierarchies, thusenabling us to test the utility of hyperbolic generative mod-els. We then train a generative model to learn the distributionof these graphs. We expect the hyperbolic flows to providea significant benefit for generating valid random trees, aswell as learning the distribution of lobster graphs, which area special subset of trees.

We follow the two-stage training procedure outlined inGraph Normalizing Flows (Liu et al., 2019a) in that wefirst train an autoencoder to give node-level latents on whichwe train an normalizing flow for density estimation. Em-pirically, we find that using GRevNets (Liu et al., 2019a)and defining edge probabilities using a distance-based de-coder consistently leads to better generation performance.Thus, we define edge probabilities as p(Au,v = 1|zu, zv) =σ((−dG(u, v)− b)/τ) where b and τ are learned edge spe-cific bias and temperature parameters. At inference time,we first sample the number of nodes to generate from theempirical distribution of the dataset. We then independentlysample node latents from our prior, beginning with a fullyconnected graph, and then push these samples through ourlearned flow to give refined edge probabilities.

To evaluate the various approaches, we construct 100 train-ing graphs for each dataset to train our model. Figure 4shows representative samples generated by the various ap-proaches. We see that hyperbolic normalizing flows learn togenerate tree-like graphs and also match the specific proper-ties of the lobster graph distribution, whereas the Euclideanflow model tends to generate densely connected graphs withmany cycles (or else disconnected graphs). To quantifythese intuitions, Table 4 contains statistics on how often


TrainLo

bste

rR

ando

m T

ree

𝒩C 𝒯C 𝒲ℍC

Figure 4. Selected qualitative results on graph generation for lobster and random tree graph.

Model Accuracy Avg. Clust. Avg. GC.

NC 56.6±5.5 40.9±42.7 0.34±0.10T C 32.1±1.9 98.3±89.5 0.25±0.12WHC 62.1±10.9 21.1±13.4 0.13±0.07

Table 4. Generation statistics on random trees over 5 runs.

Figure 5. MMD scores for graph generation on Lobster graphs.Note, thatNC achieves 0% accuracy.

the different models generate valid trees (denoted by “ac-curacy”), as well as the average number of triangles andthe average global clustering coefficients for the generatedgraphs. Since the target data is random trees, a perfectmodel would achieve 100% accuracy, with no triangles, anda global clustering of 0 for all graphs. As a representativeEuclidean baseline we employ Graph Normalizing Flows(GNFs) which is denoted as NC in Table 4 and Figure5. We see that the hyperbolic models generate valid treesmore often, and they generate graphs with fewer trianglesand lower clustering on average. Finally, to evaluate howwell the models match the specific properties of the lob-ster graphs, we follow Liao et al. (2019) and report theMMD distance between the generated graphs and a testset for various graph statistics (Figure 5). Again, we seethat the hyperbolic approaches significantly outperform theEuclidean normalizing flow.

5. Related WorkHyperbolic Geometry in Machine Learning:. The inter-section of hyperbolic geometry and machine learning hasrecently risen to prominence (Dhingra et al., 2018; Tay et al.,2018; Law et al., 2019; Khrulkov et al., 2019; Ovinnikov,2019). Early prior work proposed to embed data into thePoincare ball model (Nickel & Kiela, 2017; Chamberlainet al., 2017). The equivalent Lorentz model was later shownto have better numerical stability properties (Nickel & Kiela,2018), and recent work has leveraged even more stable tilingapproaches (Yu & De Sa, 2019). In addition, there existsa burgeoning literature of hyperbolic counterparts to con-ventional deep learning modules on Euclidean spaces (e.g.,matrix multiplication), enabling the construction of hyper-bolic neural networks (HNNs) (Gulcehre et al., 2018; Ganeaet al., 2018) with further extensions to graph data usinghyperbolic GNN architectures (Liu et al., 2019a; Chamiet al., 2019). Latent variable models on hyperbolic spacehave also been investigated in the context of VAEs, usinggeneralizations of the normal distribution (Nagano et al.,2019; Mathieu et al., 2019). In contrast, our work learnsa flexible approximate posterior using a novel normalizingflow designed to use the geometric structure of hyperbolicspaces. In addition to work on hyperbolic VAEs, there arealso several works that explore other non-Euclidean spaces(e.g., spherical VAEs) (Davidson et al., 2018; Falorsi et al.,2019; Grattarola et al., 2019).

Learning Implicit Distributions. In contrast with exactlikelihood methods there is growing interest in learningimplicit distributions for generative modelling. Popular ap-proaches include density ratio estimation methods usinga parametric classifiers such as GANS (Goodfellow et al.,2014), and kernel based estimators (Shi et al., 2017). In thecontext of autoencoders learning implicit latent distributioncan be seen as an adversarial game minimizing a specificdivergence (Makhzani et al., 2015) or distance (Tolstikhinet al., 2017). Instead of adversarial formulations implicitdistributions may also be learned directly by estimating the


gradients of log density function using the Stein gradientestimator (Li & Turner, 2017). Finally, such gradient esti-mators can also be used to power variational inference withimplicit posteriors enabling the use of posterior familieswith intractable densities (Shi et al., 2018).

Normalizing Flows:. Normalizing flows (NFs) (Rezende& Mohamed, 2015; Dinh et al., 2017) are a class of proba-bilistic models which use invertible transformations to mapsamples from a simple base distribution to samples from amore complex learned distribution. While there are manyclasses of normalizing flows (Papamakarios et al., 2019;Kobyzev et al., 2019), our work largely follows normaliz-ing flows designed with partially-ordered dependencies, asfound in affine coupling transformations (Dinh et al., 2017).Recently, normalizing flows have also been extended toRiemannian manifolds, such as spherical spaces in Gemiciet al. (2016). In parallel to this work, normalizing flowshave been extended to toriodal spaces (Rezende et al., 2020)and the data manifold (Brehmer & Cranmer, 2020). Finally,relying on affine coupling and GNNs, Liu et al. (2019a)develop graph normalizing flows (GNFs) for generatinggraphs. However, unlike our approach GNFs do not benefitfrom the rich geometry of hyperbolic spaces.

6. ConclusionIn this paper, we introduce two novel normalizing flows onhyperbolic spaces. We show that our flows are efficient tosample from, easy to invert and require only O(n) cost tocompute the change in volume. We demonstrate the effec-tiveness of constructing hyperbolic normalizing flows forlatent variable modeling of hierarchical data. We empiri-cally observe improvements in structured density estimation,graph reconstruction and also generative modeling of tree-structured data, with large qualitative improvements in gen-erated sample quality compared to Euclidean methods. Oneimportant limitation is in the numerical error introduced byclamping operations which prevent the creation of deep flowarchitectures. We hypothesize that this is an inherent limi-tation of the Lorentz model, which may be alleviated withnewer models of hyperbolic geometry that use integer-basedtiling (Yu & De Sa, 2019). In addition, while we consid-ered hyperbolic generalizations of the coupling transformsto define our normalizing flows, designing new classes ofinvertible transformations like autoregressive and residualflows on non-Euclidean spaces is an interesting directionfor future work.

Acknowledgements

Funding: AJB is supported by an IVADO Excellence Fel-lowship. RL was supported by Connaught InternationalScholarship and RBC Fellowship. WLH is supported by aCanada CIFAR AI Chair. This work was also supported by

NSERC Discovery Grants held by WLH and PP. In additionthe authors would like to thank Chinwei Huang, MaximeWabartha, Andre Cianflone and Andrea Madotto for help-ful feedback on earlier drafts of this work and Kevin Luk,Laurent Dinh and Niky Kamran for helpful technical discus-sions. The authors would also like to thank the anonymousreviewers for their comments and feedback and Aaron Lou,Derek Lim, and Leo Huang for catching a bug in the code.

ReferencesAnderson, R. M., Anderson, B., and May, R. M. Infec-

tious diseases of humans: dynamics and control. Oxforduniversity press, 1992.

Brehmer, J. and Cranmer, K. Flows for simultaneous man-ifold learning and density estimation. arXiv preprintarXiv:2003.13913, 2020.

Burda, Y., Grosse, R., and Salakhutdinov, R. Importanceweighted autoencoders. arXiv preprint arXiv:1509.00519,2015.

Chamberlain, B. P., Clough, J., and Deisenroth, M. P. Neuralembeddings of graphs in hyperbolic space. arXiv preprintarXiv:1705.10359, 2017.

Chami, I., Ying, Z., Re, C., and Leskovec, J. Hyperbolicgraph convolutional neural networks. In Advances inNeural Information Processing Systems, pp. 4869–4880,2019.

Davidson, T. R., Falorsi, L., De Cao, N., Kipf, T., andTomczak, J. M. Hyperspherical variational auto-encoders.arXiv preprint arXiv:1804.00891, 2018.

De Cao, N., Titov, I., and Aziz, W. Block neural autoregres-sive flow. arXiv preprint arXiv:1904.04676, 2019.

Dhingra, B., Shallue, C. J., Norouzi, M., Dai, A. M., andDahl, G. E. Embedding text in hyperbolic spaces. arXivpreprint arXiv:1806.04313, 2018.

Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estima-tion using real nvp. In The 5th International Conferenceon Learning Representations (ICLR), Vancouver, 2017.

Falorsi, L., de Haan, P., Davidson, T. R., and Forre, P. Repa-rameterizing distributions on lie groups. arXiv preprintarXiv:1903.02958, 2019.

Ganea, O., Becigneul, G., and Hofmann, T. Hyperbolicneural networks. In Advances in neural information pro-cessing systems, pp. 5345–5355, 2018.

Gemici, M. C., Rezende, D., and Mohamed, S. Normal-izing flows on riemannian manifolds. arXiv preprintarXiv:1611.02304, 2016.


Goh, K.-I., Cusick, M. E., Valle, D., Childs, B., Vidal,M., and Barabasi, A.-L. The human disease network.Proceedings of the National Academy of Sciences, 104(21):8685–8690, 2007.

Golomb, S. W. Polyominoes: puzzles, patterns, problems,and packings, volume 16. Princeton University Press,1996.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,Warde-Farley, D., Ozair, S., Courville, A., and Bengio,Y. Generative adversarial nets. In Advances in neuralinformation processing systems, pp. 2672–2680, 2014.

Grathwohl, W., Chen, R. T., Bettencourt, J., Sutskever, I.,and Duvenaud, D. Ffjord: Free-form continuous dy-namics for scalable reversible generative models. arXivpreprint arXiv:1810.01367, 2018.

Grattarola, D., Livi, L., and Alippi, C. Adversarial autoen-coders with constant-curvature latent manifolds. AppliedSoft Computing, 81:105511, 2019.

Gulcehre, C., Denil, M., Malinowski, M., Razavi, A., Pas-canu, R., Hermann, K. M., Battaglia, P., Bapst, V., Ra-poso, D., Santoro, A., et al. Hyperbolic attention net-works. arXiv preprint arXiv:1805.09786, 2018.

Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J.Stochastic variational inference. The Journal of MachineLearning Research, 14(1):1303–1347, 2013.

Huang, C.-W., Krueger, D., Lacoste, A., and Courville, A.Neural autoregressive flows. In Proceedings of the 35thinternational conference on Machine learning, 2018.

Khrulkov, V., Mirvakhabova, L., Ustinova, E., Oseledets, I.,and Lempitsky, V. Hyperbolic image embeddings. arXivpreprint arXiv:1904.02239, 2019.

Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980, 2014.

Kingma, D. P. and Welling, M. Auto-encoding variationalbayes. arXiv preprint arXiv:1312.6114, 2013.

Kipf, T. N. and Welling, M. Variational graph auto-encoders.arXiv preprint arXiv:1611.07308, 2016.

Kobyzev, I., Prince, S., and Brubaker, M. A. Normal-izing flows: Introduction and ideas. arXiv preprintarXiv:1908.09257, 2019.

Law, M., Liao, R., Snell, J., and Zemel, R. Lorentzian dis-tance learning for hyperbolic representations. In Interna-tional Conference on Machine Learning, pp. 3672–3681,2019.

Li, Y. and Turner, R. E. Gradient estimators for implicitmodels. arXiv preprint arXiv:1705.07107, 2017.

Liao, R., Li, Y., Song, Y., Wang, S., Hamilton, W., Duve-naud, D. K., Urtasun, R., and Zemel, R. Efficient graphgeneration with graph recurrent attention networks. InAdvances in Neural Information Processing Systems, pp.4257–4267, 2019.

Liu, J., Kumar, A., Ba, J., Kiros, J., and Swersky, K. Graphnormalizing flows. In Advances in Neural InformationProcessing Systems, pp. 13556–13566, 2019a.

Liu, Q., Nickel, M., and Kiela, D. Hyperbolic graph neuralnetworks. In Advances in Neural Information ProcessingSystems, pp. 8228–8239, 2019b.

Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., andFrey, B. Adversarial autoencoders. arXiv preprintarXiv:1511.05644, 2015.

Mathieu, E., Le Lan, C., Maddison, C. J., Tomioka, R., andTeh, Y. W. Continuous hierarchical representations withpoincare variational auto-encoders. In Advances in neuralinformation processing systems, pp. 12544–12555, 2019.

Nagano, Y., Yamaguchi, S., Fujita, Y., and Koyama, M.A wrapped normal distribution on hyperbolic space forgradient-based learning. In International Conference onMachine Learning, pp. 4693–4702, 2019.

Nickel, M. and Kiela, D. Poincare embeddings for learn-ing hierarchical representations. In Advances in neuralinformation processing systems, pp. 6338–6347, 2017.

Nickel, M. and Kiela, D. Learning continuous hierarchies inthe lorentz model of hyperbolic geometry. arXiv preprintarXiv:1806.03417, 2018.

Ovinnikov, I. Poincar\’e wasserstein autoencoder. arXivpreprint arXiv:1901.01427, 2019.

Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed,S., and Lakshminarayanan, B. Normalizing flows forprobabilistic modeling and inference. arXiv preprintarXiv:1912.02762, 2019.

Pennec, X. Intrinsic statistics on riemannian manifolds: Ba-sic tools for geometric measurements. Journal of Mathe-matical Imaging and Vision, 25(1):127, 2006.

Ratcliffe, J. G. Foundations of Hyperbolic Manifolds. Num-ber 149 in Graduate Texts in Mathematics. Springer-Verlag, 1994.

Rezende, D. J. and Mohamed, S. Variational inference withnormalizing flows. In Proceedings of the 32nd interna-tional conference on Machine learning. ACM, 2015.


Rezende, D. J., Papamakarios, G., Racaniere, S., Albergo,M. S., Kanwar, G., Shanahan, P. E., and Cranmer, K.Normalizing flows on tori and spheres. arXiv preprintarXiv:2002.02428, 2020.

Said, S., Bombrun, L., and Berthoumieu, Y. New rieman-nian priors on the univariate normal model. Entropy, 16(7):4015–4031, 2014.

Sarkar, R. Low distortion delaunay embedding of trees inhyperbolic plane. In International Symposium on GraphDrawing, pp. 355–366. Springer, 2011.

Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., andMonfardini, G. The graph neural network model. IEEETransactions on Neural Networks, 20(1):61–80, 2008.

Shi, J., Sun, S., and Zhu, J. Kernel implicit variationalinference. arXiv preprint arXiv:1705.10119, 2017.

Shi, J., Sun, S., and Zhu, J. A spectral approach to gradi-ent estimation for implicit distributions. arXiv preprintarXiv:1806.02925, 2018.

Skopek, O., Ganea, O.-E., and Becigneul, G. Mixed-curvature variational autoencoders. arXiv preprintarXiv:1911.08411, 2019.

Tay, Y., Tuan, L. A., and Hui, S. C. Hyperbolic repre-sentation learning for fast and efficient neural questionanswering. In Proceedings of the Eleventh ACM Interna-tional Conference on Web Search and Data Mining, pp.583–591, 2018.

Tolstikhin, I., Bousquet, O., Gelly, S., and Schoelkopf,B. Wasserstein auto-encoders. arXiv preprintarXiv:1711.01558, 2017.

Velickovic, P., Cucurull, G., Casanova, A., Romero, A.,Lio, P., and Bengio, Y. Graph attention networks. arXivpreprint arXiv:1710.10903, 2017.

Xu, B., Wang, N., Chen, T., and Li, M. Empirical evaluationof rectified activations in convolutional network. arXivpreprint arXiv:1505.00853, 2015.

Yang, Z., Cohen, W. W., and Salakhutdinov, R. Revisitingsemi-supervised learning with graph embeddings. arXivpreprint arXiv:1603.08861, 2016.

Yu, T. and De Sa, C. M. Numerically accurate hyperbolicembeddings using tiling-based models. In Advances inNeural Information Processing Systems, pp. 2021–2031,2019.


A. Background on Riemannian GeometryAn n-dimensional manifold is a topological space that is equipped with a family of open sets Ui which cover the space anda family of functions ψi that are homeomorphisms between the Ui and open subsets of R. The pairs (Ui, ψi) are calledcharts. A crucial requirement is that if two open sets Ui and Uj intersect in a region, call it Uij , then the composite mapψi ◦ ψ−1j restricted to Uij is infinitely differentiable. IfM is an n−dimensional manifold then a chart, ψ : U → V , onMmaps an open subset U to an open subset V ⊂ Rn. Furthermore, the image of the point p ∈ U , denoted ψ(p) : Rn is termedthe local coordinates of p on the chart ψ. Examples of manifolds include Rn, the Hypersphere Sn, the Hyperboloid Hn, atorus. In this paper we take an extrinsic view of the geometry, that is to say a manifold can be thought of as being embeddedin a higher dimensional Euclidean space, —i.e. Mn ⊂ Rn+1, and inherits the coordinate system of the ambient space. Thisis not how the subject is usually developed but for spaces of constant curvature one gets convenient formulas.

Tangent Spaces. Let p ∈ M be a point on an n−dimensional smooth manifold and let γ(t) → M be a differentiableparametric curve with parameter t ∈ [−ε, ε] passing through the point such that γ(0) = p. SinceM is a smooth manifoldwe can trace the curve in local coordinates via a chart ψ and the entire curve is given in local coordinates by x = ψ ◦ γ(t).The tangent vector to this curve at p is then simply v = (ψ ◦ γ)′(0). Another interpretation of the tangent vector of γ is byinterpreting the point p as a position vector and the tangent vector is then interpreted as the velocity vector at that point.Using this definition the set of all tangent vectors at p is denoted as TpM, and is called the tangent space at p.

Riemannian Manifold. A Riemannian metric tensor g on a smooth manifoldM is defined as a family of inner productssuch that at each point p ∈M the inner product takes vectors from the tangent space at p, gp = 〈·, ·〉p : TpM×TpM→ R.This means g is defined for every point on M and varies smoothly. Locally, g can be defined using the basis vectorsof the tangent space gij(p) = g( ∂

∂pi, ∂∂pj

). In matrix form the Riemannian metric, G(p), can be expressed as, ∀u, v ∈TpM×TpM, 〈u, v〉p = g(p)(u, v) = uTG(p)v. A smooth manifold manifoldM which is equipped with a Riemannianmetric at every point p ∈M is called a Riemannian manifold. Thus every Riemannian manifold is specified as the tuple(M, g) which define the smooth manifold and its associated Riemannian metric tensor.

Armed with a Riemannian manifold we can now recover some conventional geometric insights such as the length of aparametric curve γ, the distance between two points on the manifold, local notion of angle, surface area and volume. Wedefine the length of a curve, L[γ] =

∫ bagγ(t)||γ′(t)||dt. This definition is very similar to the length of a curve on Euclidean

spaces if we just observe that the Riemannian metric is In. Now turning to the distance between points p and q we canreason that it must be the smallest or distance minimizing parametric curve between the points which in the literature areknown as geodesics4. Stated another way: d(p, q) = inf

{L[γ] |γ : [a, b]→M

}with , γ(a) = p and γ(b) = q. A norm is

induced on every tangent space by gp and is defined as TpM : || · ||p :√〈·, ·〉p. Finally, we can also define an infitisimal

volume element on each tangent space and as a result measure dM(p) =√|G(p)|dp, with dp being the Lebesgue measure.

B. Background Normalizing FlowsGiven a parametrized density on Rn a normalizing flow defines a sequence of invertible transformations to a more complexdensity over the same space via the change of variable formula for probability distributions (Rezende & Mohamed, 2015).Starting from a sample from a base distribution, z0 ∼ p(z), a mapping f : Rd → Rd, with parameters θ that is bothinvertible and smooth, the log density of z′ = f(z0) is defined as log pθ(z′) = log p(z0) − log det

∣∣∣∂f∂z ∣∣∣. Where, pθ(z′)is the probability of the transformed sample and ∂f/∂x is the Jacobian of f . To construct arbitrarily complex densitiesa chain of functions of the same form as f can be defined and through successive application of change of density foreach invertible transformation in the flow. Thus the final sample from a flow is then given by zj = fj ◦ fj−1... ◦ f1(z0)and it’s corresponding density can be determined simply by ln pθ(zj) = ln p(z0) −

∑ji=1 ln det

∣∣∣ ∂fi∂zi−1

∣∣∣. Of practicalimportance when designing normalizing flows is the cost associated with computed the log determinant of the Jacobianwhich is computationally expensive and can range anywhere from O(n!) − O(n3) for an arbitrary matrix and a chosenalgorithm. However, through an appropriate choice of f this computation cost can be brought down significantly. Whilethere are many different choices for the transformation function, f , in this work we consider only RealNVP based flows aspresented in (Dinh et al., 2017) and (Rezende & Mohamed, 2015) due to their simplicity and expressive power in capturingcomplex data distributions.

4Actually a geodesic is usually defined as a curve such that the tangent vector is parallel transported along it. It is then a theorem thatit gives the shortest path.


B.1. Variational Inference with Normalizing Flows

One obvious use case for Normalizing Flows is in learning a more expressive often multi-modal posterior distributionneeded in Variational Inference. Recall that a variational approximation is a lower bound to the data log-likelihood. Take forexample amortized variational inference in a VAE like setting whereby the posterior qθ is parameterized and is amenable togradient based optimization. The overall objective with both encoder and decoder networks:

log p(x) = log

∫p(x|z)p(z)dz (19)

≥ Eqθ(z|x)[logp(x, z)

qθ(z|x)] (Jensen’s Inequality) (20)

= Eqθ(z|x)[log p(x|z)] + Eqθ(z|x)[log

p(z)

qθ(z|x)

](21)

= Eqθ(z|x)[log p(x|z)]−DKL(qθ(z|x)||p(z)) (22)

The tightness of the Evidence Lower Bound (ELBO) also known as the negative free energy of the system, −F(x), isdetermined by the quality of the posterior approximation to the true posterior. Thus, one way to enrich the posteriorapproximation is by letting qθ be a normalizing flow itself and the resultant latent code be the output of the transformation.If we denote q0(z0) the probability of the latent code z0 under the base distribution and zk as the latent code after K flowlayers we may rewrite the Free Energy as follows:

F(x) = Eq0(z0)[log qk(zj)− log p(x, zj)] (23)

= Eq0(z0)[log q0(z0)−

j∑i=1

ln det∣∣∣ ∂fi∂zi−1

∣∣∣− log p(x, zi)]

(24)

= DKL(q0(z0)||p(zj))− Eq0(z0)[ j∑i=1

ln det∣∣∣ ∂fi∂zi−1

∣∣∣− log p(x|zi)]

(25)

For convenience we may take q0 = N (µ, σ2) which is a reparametrized gaussian density and p(z) = N (0, I) a standardnormal.

B.2. Euclidean RealNVP

Computing the Jacobian of functions with high-dimensional domain and codomain and computing the determinants of largematrices are in general computationally very expensive. Further complications can arise with the restriction to bijectivefunctions make for difficult modelling of arbitrary distributions. A simple way to significantly reduce the computationalburden is to design transformations such that the Jacobian matrix is triangular resulting in a determinant which is simply theproduct of the diagonal elements. In (Dinh et al., 2017), real valued non-volume preserving (RealNVP) transformations areintroduced as simple bijections that can be stacked but yet retain the property of having the composition of transformationshaving a triangular determinant. To achieve this each bijection updates a part of the input vector using a function that issimple to invert, but which depends on the remainder of the input vector in a complex way. Such transformations are denotedas affine coupling layers. Formally, given a D dimensional input x and d < D, the output y of an affine coupling layerfollows the equations:

y1:d = x1:d (26)yd+1:D = xd+1:D � exp(s(x1:d)) + t(x1:d). (27)

Where, s and t are parameterized scale and translation functions. As the second part of the input depends on the first, it iseasy to see that the Jacobian given by this transformation is lower triangular. Similarly, the inverse of this transformation isgiven by:

x1:d = y1:d (28)xd+1:D = (yd+1:D − t(y1:d)� exp(−s(y1:d)). (29)


Note that the form of the inverse does not depend on calculating the inverses of either s or t allowing them to be complexfunctions themselves. Further note that with this simple bijection part of the input vector is never touched which can limitthe expressiveness of the model. A simple remedy to this is to simply reverse the elements that undergo scale and translationtransformations prior to the next coupling layer. Such an alternating pattern ensures that each dimension of the input vectordepends in a complex way given a stack of couplings allowing for more expressive models. Finally, the Jacobian of thistransformation is a lower triangular matrix,

∂y

∂x=

[Id 0

∂yd+1:D

xT1:ddiag(exps(x1:d))

]. (30)

C. Change of Variable for Tangent CouplingWe now derive the change in volume formula associated with one T C layer. Without loss of generality we first define abinary mask which we use to partition the elements of a vector at ToHnK into two sets. Thus b is defined as

bj =

{1 if j ≤ d0 otherwise,

Note that all T C layer operations exclude the first dimension which is always copied over by setting b0 = 1 and ensuresthat the resulting sample always remains on ToHnK . Utilizing b we may rewrite Equation 11 as,

y = expKo(b� x+ (1− b)� (x� σ(s(b� x)) + t(b� x))

), (31)

where x = logKo (x) is a point on the tangent space at o. Similar to the Euclidean RealNVP, we wish to calculate the jacobiandeterminant of this overall transformation. We do so by first observing that the overall transformation is a valid compositionof functions: y := expKo ◦ f ◦ log

Ko (x), where z = f(x) is the flow in tangent space. Utilizing the chain rule and the identity

that the determinant of a product is the product of the determinants of its constituents we may decompose the jacobiandeterminant as,

det(∂y∂x

)= det

(∂expKo (z)

∂z

)· det

(∂f(x)∂x

)· det

(∂ logKo (x)∂x

). (32)

Tackling each term on RHS of Eq. 32 individually, det(∂expKo (z)

∂z

)=(R sinh(

||z||LR )

||z||L

)n−1as derived in (Nagano et al.,

2019). As the logarithmic map is the inverse of the exponential map the jacobian determinant is also the inverse —i.e.

det(∂ logKo (x)

∂x

)=(

sinh(|| logKo (x)||L)|| logKo (x)||L

)1−n. For the middle term in Eq. 32 we proceed by selecting the standard basis

{e1, e2, ...en} which is an orthonormal basis with respect to the Lorentz inner product. The directional derivative withrespect to a basis element ej is computed as follows:

df(x) =∂

∂ε

∣∣∣ε=0

f(x+ εej)

=∂

∂ε

∣∣∣ε=0{b� (x+ εej) + (1− b)� ((x+ εej)� σ(s(b� (x+ εej))) + t(b� (x+ εej)))}

= b� ej +∂

∂ε

∣∣∣ε=0{(1− b)� ((x+ εej)� σ(s(b� (x+ εej))) + t(b� (x+ εej)))}

As b ∈ [0, 1]n is a binary mask, it is easy to see that if bj = 1 then only the first term on the RHS remains and the directionalderivative with respect to ej is simply the basis vector itself. Conversely, if bj = 0 then the first term goes to zero and we areleft with the second term,


df(x) =∂

∂ε

∣∣∣ε=0{(1− b)� ((x+ εej)� σ(s(b� (x+ εej))) + t(b� (x+ εej)))}

=∂

∂ε

∣∣∣ε=0{(1− b)� ((x+ εej)� σ(s(b� x)) + t(b� x))}

= ej � σ(s(b� x)).

Where in the second line we’ve used the fact b � εej = 0. All together, the directional derivatives computed using ourchosen basis elements are,

df(x) = (e1, e2, . . . ed, ed+1 � σ(s(b� x)), . . . eD � σ(s(b� x))).

The volume factor given by this linear map is det(df(x)) =√GTG, where G is the matrix of all directional derivatives.

As the basis elements are orthogonal all non-diagonal entries of GTG go to zero and the determinant is the product of theLorentz norms of each component. As ||ej ||L = 1 and ||ej � σ(s(b� x))||L = ||ej � σ(s(b� x))||2 for ToHn

K the overalldeterminant is then df(x) = diag σ(s(b� x)). Finally, the full log jacobian determinant of a T C layer is given by,

log det(∂y∂x

)=(R sinh( ||z||LR )

||z||L

)n−1+

n∑i=d+1

σ(s(x1))i +(R sinh(

|| logKo (x)||LR )

|| logKo (x)||L

)1−n(33)

Thus the overall computational cost is only slightly larger than the regular Euclidean RealNVP, O(n).

D. Change of Variable for Wrapped Hyperbolic CouplingWe consider the following function f : HnK → HnK , which we use to define a normalizing flow in n-dimensional hyperbolicspace (represented via the Lorentz model):

f(x) = expKo(b� x+ (1− b)� logKo

(expKt(b�x)

(PTKo→t(b�x)((1− b)� x� σ(s(b� x)))

))), (34)

where x = logo(x) ∈ ToHnK is the projection of x ∈ HnK to the tangent space at the origin, i.e, ToHnK . As in T C we againutilize a binary mask b so that

bj =

{1 if j ≤ d0 otherwise,

where 0 < d < n. In Equation equation 34 the function s : ToHdK → ToHn−dK is an arbitrary function on the tangent space

at the origin and σ denotes the logistic function. The function t : ToHdK → H∗K ⊂ HnK is a map from the tangent space at the

origin to a subset of hyperbolic space defined by the set of points satisfying the condition that vi = 0,∀i = 2...d,vi ∈ HnK(under their representation in the Lorentz model).

Our goal is to derive the Jacobian determinant of f , i.e.,∣∣∣∣det(∂f(x)

∂x

)∣∣∣∣ , (35)

To do so, we will use the following facts without proof or justification:

• Fact 1: The chain rule for determinants, i.e., the fact that∣∣∣∣det(∂f(x)

∂x

)∣∣∣∣ = ∣∣∣∣det(∂f(x)

∂v

)∣∣∣∣ ∣∣∣∣det(∂v

∂x

)∣∣∣∣ , (36)

where v is introduced via a valid change of variables.


• Fact 2: The Jacobian determinant for the exponential map expKu (z) = TuHnK → HnK is given by

∣∣det(expKu (z))∣∣ = (R sinh( ||z||LR )

||z||L

)n−1(37)

• Fact 3: The Jacobian determinant for the logarithmic map logKu (v) = HnK → TuHnK is given by

∣∣∣det(logKu (v)∣∣∣ = (R sinh(

|| logKo (v)||LR )

|| logKo (v)||L

)1−n

(38)

• Fact 4: The Jacobian determinant for parallel transport PTKu→t(v) = TuHnK → TtHnK is given by∣∣det(PTKu→t(v)

)∣∣ = 1. (39)

Fact 2 and Fact 4 are proven in Nagano et al. (2019) “A Wrapped Normal Distribution on Hyperbolic Space for Gradient-Based Learning” for K = −1 and rederived for general K in Skopek et al. (2019). Fact 3 follows from the fact that thedeterminant of the inverse of a function is the inverse of that function’s determinant. We will use similar arguments to obtainour determinant as were used in Nagano et al. (2019), and we refer the reader to Appendix A.3 in their work for background.

Our main claim is as follows

Proposition 3. The Jacobian determinant of the function fWHC in equation 13 is:∣∣∣∣det(∂y

∂x

)∣∣∣∣ = n∏i=d+1

σ(s(x1))i ×(R sinh( ||q||LR )

||q||L

)l×(R sinh(

|| logKo (q)||LR )

|| logKo (q)||L

)−l×(R sinh( ||z||LR )

||z||L

)n−1×(R sinh(

|| logKo (x)||LR )

|| logKo (x)||L

)1−n, (40)

wherez = b� x+ logKo

(expKt(b�x)

(PTKo→t(b�x)((1− b)� x� σ(s(b� x)))

))the argument to the parallel transport q is,

q = PTKo→t(b�x)((1− b)� x� σ(s(b� x))).

andq = expKt (q)

Proof. We first note that ∣∣∣∣det(∂f(x)

∂x

)∣∣∣∣ = ∣∣∣∣det(∂f(x)

∂z

)∣∣∣∣× ∣∣∣∣det(∂z

∂x

)∣∣∣∣× ∣∣∣∣det(∂x

∂x

)∣∣∣∣ (41)

by the chain rule (recalling that x = logo(x)). Now, we have that∣∣∣∣det(∂f(x)

∂z

)∣∣∣∣ = (R sinh( ||z||LR )

||z||L

)n−1(42)

by Fact 2. And, ∣∣∣∣det(∂x

∂x

)∣∣∣∣ = (R sinh(|| logKo (x)||L

R )

|| logKo (x)||L

)1−n(43)

by Fact 3. Thus, we are left with the term ∣∣∣∣det(∂z

∂x

)∣∣∣∣ .To evaluate this term, we rely on the following Lemma:


Lemma 2. Let h : ToHnK → ToHnK be a function from the tangent space at the origin to the tangent space at the origindefined as:

h(x) = z = b� x+ logKo

(expKt(b�x)

(PTKo→t(b�x)((1− b)� x� σ(s(b� x)))

)). (44)

Now, define a function h∗ : ToHn−dK → ToHn−dK which acts on the subspace of ToHn−dK corresponding to the standardbasis elements ed+1, ..., en as

h∗(xd+1:n) = logKod+1:n

(expKtd+1:n

(PTKod+1:n→td+1:n

(xd+1:n � σ(s)))), (45)

where xd+1:n denotes the portion of the vector x corresponding to the standard basis elements ed+1, ..., en and s and t areconstants (which depend on x2:d). In equation 45, we use od+1:n ∈ Hn−dK to denote the vector corresponding to only thedimensions d+ 1, ..., n and similarily for td+1:n. Then we have that∣∣∣∣det

(∂z

∂x

)∣∣∣∣ = ∣∣∣∣det(∂h∗(xd+1:n)

∂xd+1:n)

)∣∣∣∣ . (46)

Proof. First note that by design we have that

[0, 0.., 0]⊕ h∗(xd+1:n) = logKo

(expKt(b�x)

(PTKo→t(b�x)((1− b)� x� σ(s(b� x)))

)), (47)

i.e., the output of h∗ is equal to right hand side of Equation equation 44 after prepending/concatenating 0s to the output ofh∗.

Now, we can evaluate ∣∣∣∣det(∂z

∂x

)∣∣∣∣by examining the directional derivative with respect to a set of basis elements of ToHnK . Now, given that this is the tangentspace at the origin, we know that the standard (i.e., Euclidean) basis elements e2, ..., en form a valid basis for this subspace,since they are orthogonal under the Lorentz normal and orthogonal to the origin itself. Now, we can note first that

Deih(x) = ei,∀i = 2...d. (48)

In other words, the directional derivative for the first d basis elements is the simply the basis elements themselves. This canbe verified by taking the definition of the directional derivative:

Deih(x) =∂

∂ε

∣∣∣ε=0

h(x+ εei) (49)

and noting that the

logKo

(expKt(b�x)

(PTKo→t(b�x)((1− b)� x� σ(s(b� x)))

))term must equal zero since (1− b)� ei = 0,∀i = 2, ..., d by design. Now, for the basis elements ei with i > d we have that

Deih(x) ⊥ ej ,∀i = d+ 1, ..., n, j = 2..., d. (50)

This holds because

Deih(x) =∂

∂ε

∣∣∣ε=0

h(x+ εei) (51)

=∂

∂ε

∣∣∣ε=0

logKo

(expKt(b�x)

(PTKo→t(b�x)((1− b)� x� σ(s(b� x)))

))(52)

since b� ei = 0,∀i = d+ 1, ..., n by design and because

logKo

(expKt(b�x)

(PTKo→t(b�x)((1− b)� x� σ(s(b� x)))

))⊥ ej ,∀x ∈ ToHnK ,∀j = 2..., d. (53)


due to the (1− b) term inside the parallel transport and by our design of the function t. Together, these facts give that theJacobian matrix for h (under the basis e2, ..., en) has the following block form:(

∂z

∂x

)=

I 0

A∂h∗(xd+1:n)

∂xd+1:n)

(54)

and by the properties of determinants of block matrices we have that∣∣∣∣det(∂z

∂x

)∣∣∣∣ = ∣∣∣∣det(∂h∗(xd+1:n)

∂xd+1:n)

)∣∣∣∣ (55)

Given Lemma 1, all that remains is to evaluate ∣∣∣∣det(∂h∗(xd+1:n)

∂xd+1:n)

)∣∣∣∣ . (56)

This can again be done by the chain rule, where we use Facts 2-4 to compute the determinant for exponential map, logarithmicmap, and parallel transport. Finally, the Jacobian determinant for the term

x� σ(s(b� x))) (57)

can easily be computed as∏nj=d+1 σ(s(b� x))j since the standard Euclidean basis is a basis for the tangent space at the

origin as shown in Appendix B.2.

E. Model Architectures and HyperparametersIn this section we provide more details regarding model architectures and hyperparameters for each experiment in 4. Forall hyperbolic models we used a curvature warmup for 10 epochs which aids in numerical stability Skopek et al. (2019).Specifically, we set R = 11 and linearly decrease to R = 2 every epoch after which it is treated as a learnable parameter.

Structured Density Estimation. For all VAE models our encoder consists of three linear layers. The first layer maps theinput to a hidden space and the other two layers are used to paramaterize the mean and variance of the prior distribution andmap samples to the latent space. The decoder for these models is simply a small MLP that consists of two linear layers thatmap the latent space to the hidden space and then finally back to the observation space. One important distinction betweenEuclidean models and hyperbolic models is that we use aFor BDP the hidden dim size is 200 while for MNIST we use 600and the latent space is varied as shown in Tables 1 and 2. All flow models used in this setting consist of 2 linear layerseach of size 128. Between each layer in either the encoder and decoder we use the LeakyRelu (Xu et al., 2015) activationfunction while tanh is used between flow layers. Lastly, we train all models for 80 epochs with the Adam optimizer withdefault setting (Kingma & Ba, 2014).

Graph Reconstruction. For graph reconstruction task we use the VGAE model as a base (Kipf & Welling, 2016) whichalso uses three linear layers of size 16 as the encoder in the VAE model. The decoder however is parameter less and is simplyan inner product either in Euclidean space or in ToHnK for Hyperbolic models. As the reconstruction objective containsN2 terms we rescale the DKL penalty by a factor of 1/N such that each of the losses are on the same scale. This can beunderstood as a β−VAE like model where β = 1/N . Like the structured density estimation setting all our flow modelsconsist of two linear layers of size 128 with a tanh nonlinearity. Finally, we train the each model for 3000 epochs using theAdam optimizer (Kingma & Ba, 2014).

Graph Generation. For the graph generation task we adapt the training setup from (Liu et al., 2019a) in that we pretrain agraph autoencoder for 100 epochs to generate node latents. Empirically, we found that using a VGAE model for hyperbolicspace worked better than a vanilla a GAE model. Furthermore, instead of using simple linear layers for the encoder weuse GAT (Velickovic et al., 2017) layer of size 32, which has access to the adjacency matrix. We use LeakyReLU forour encoder non-linearity while tanh is used for all flow models. Unlike GRevNets that use node features sampled fromN (0, I) we find that it is necessary to provide the actual adjacency matrix otherwise training did not succeed. Our decoderdefines edge probabilities as p(Au,v = 1|zu, zv) = σ((−dG(u, v)− b)/τ) where b and τ are learned edge specific bias andtemperature parameters implemented as one GAT layer followed by a linear layer both of size 32. Thus both the encoderand decoder are both parameterized and optimized using the Adam optimizer (Kingma & Ba, 2014).


F. Additional Density Estimation ResultsWe now provide additional qualitative results for density estimation in hyperbolic space as visualized in the Poincare disk.For these visualizations we take a density initially defined on Euclidean space and project them to the hyperboloid using thelogarithmic map at the origin. We then sample 500 points from this new density and fit both T C andWHC based flows.The results for the learned densities are shown below in Figure 6.

Target 𝒯C(Ours) 𝒲ℍC(Ours)

Figure 6. Top: Wrapped Gaussian with µ = [−1.0, 1.0] and σ = [1.0, 0.25]T . Mid: Checkerboard pattern projected to hyperbolic space.Bot: 2D Spiral projected to hyperbolic space

G. Dataset IssuesUpon inspecting the code and data kindly provided by Mathieu et al. (2019) we uncovered some issues that led to us omittingtheir CS-PhD and Phylogenetics datasets in our comparisons. In particular, Mathieu et al. (2019) use a decoder in theircross-entropy loss that does not define a proper probability. This appears to have caused optimization issues that artificiallydeflated the reported performance of all the models investigated in that work. When substituting in the dot product decoderemployed in this work, the accuracy of all models increases dramatically. After this change, there is no longer any benefitfrom employing hyperbolic spaces on these datasets. In particular, after applying this fix, the performance of the hyperbolic


VAE used by Mathieu et al. (2019) falls substantially below a Euclidean VAE. Since we expect our hyperbolic flows toonly give gains in cases where hyperbolic spaces provide a benefit over Euclidean spaces, these datasets do not provide ameaningful testbed for our proposed approach. Lastly, upon inspecting the code and data in Mathieu et al. (2019), we alsofound that the columns 1 and 2 in Table 4 of their paper appear to be swapped compared to the results generated by theircode.

latent variable modelling with hyperbolic normalizing flowslatent variable modelling with hyperbolic...

Documents