03 data representation

15
Reading Report 3. Data Representation Valerii Klymchuk April 3, 2015 0. EXERCISE 0 Chapter 3. Summary 0.1 Continuous Data A lot of phenomena are modeled in terms of various physical quantities. In data representation these quanti- ties can be classified in two fundamentally different categories: intrinsically continuous and intrinsically discrete ones. Continuous data are usually manipulated by computers in some finite approximate form. Continuous sampled data are also discrete, since they consist of finite set of data elements, however in con- trast to intrinsically discrete data, sampled data always originates from, and is intended to approximate, a continuous quantity. In contrast, intrinsically discrete data has no counterpart in the continuous world, as is the case of page of text, for example. This is a fundamental difference between continuous (sampled) and discrete data. Mathematically, continuous data can be modeled as a function f : D R d C R c between domain and codomain respectively. f is called a d -dimensional, or d -variate, c -value function. In visualization f sometimes is called a field. Function f is continuous if the graph of the function is a connected surface without “holes” or “jumps.” Cauchy - δ criterion states, that f is continuous, if for every point p C the following holds: > 0, δ> 0: if kx - pk < δ, x C ⇒kf (x) - f (p)k < . Also, f is continuous of order k if the function itself and all its derivatives up to and including order k are also continuous in this sense. This is denoted as f C k . Functions f whose derivatives are continuous on compact intervals are called piecewise continuous. The triplet D =(D, C, f ) defines a continuous dataset. The dimension d of the space R d in which function’s domain is embedded is called the geometrical dimension. Topological dimension of the dataset is the dimension s d of the function domain D itself - the number of independent variables that we need to represent our domain D. For a line of curve in Euclidean space R 3 we have s = 1 and d = 3; if D is a plane or curved surface, then s = 2. The geometric dimension is always fixed to d = 3, hence, the only dimension that varies in datasets is the topological dimension s, there fore in practice it is often called the dataset dimension. We assume that geometrical dimension is always three. The co-dimension of an object of topological dimension s and geometrical dimension d is the difference d - s. The function values are usually called dataset attributes. The dimensionality c of the function codomain C is also called the attribute dimension (usually ranges from 1 to 4). 0.2 Sampled Data The two operations relate sampled data and continuous data: 1

Upload: valerii-klymchuk

Post on 21-Mar-2017

28 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: 03 Data Representation

Reading Report 3. Data Representation

Valerii Klymchuk

April 3, 2015

0. EXERCISE 0

Chapter 3. Summary

0.1 Continuous Data

A lot of phenomena are modeled in terms of various physical quantities. In data representation these quanti-ties can be classified in two fundamentally different categories: intrinsically continuous and intrinsicallydiscrete ones. Continuous data are usually manipulated by computers in some finite approximate form.Continuous sampled data are also discrete, since they consist of finite set of data elements, however in con-trast to intrinsically discrete data, sampled data always originates from, and is intended to approximate, acontinuous quantity. In contrast, intrinsically discrete data has no counterpart in the continuous world,as is the case of page of text, for example. This is a fundamental difference between continuous (sampled)and discrete data.

Mathematically, continuous data can be modeled as a function

f : D ⊂ Rd → C ⊂ Rc

between domain and codomain respectively. f is called a d -dimensional, or d -variate, c-value function. Invisualization f sometimes is called a field.

Function f is continuous if the graph of the function is a connected surface without “holes” or “jumps.”Cauchy ε− δ criterion states, that f is continuous, if for every point p ∈ C the following holds:

∀ε > 0,∃δ > 0 : if‖x− p‖ < δ, x ∈ C ⇒ ‖f(x)− f(p)‖ < ε.

Also, f is continuous of order k if the function itself and all its derivatives up to and including order k arealso continuous in this sense. This is denoted as f ∈ Ck.

Functions f whose derivatives are continuous on compact intervals are called piecewise continuous.The triplet D = (D,C, f) defines a continuous dataset. The dimension d of the space Rd in which

function’s domain is embedded is called the geometrical dimension. Topological dimension of thedataset is the dimension s ≤ d of the function domain D itself - the number of independent variables thatwe need to represent our domain D. For a line of curve in Euclidean space R3 we have s = 1 and d = 3; if Dis a plane or curved surface, then s = 2. The geometric dimension is always fixed to d = 3, hence, the onlydimension that varies in datasets is the topological dimension s, there fore in practice it is often called thedataset dimension. We assume that geometrical dimension is always three.

The co-dimension of an object of topological dimension s and geometrical dimension d is the differenced− s.

The function values are usually called dataset attributes. The dimensionality c of the function codomainC is also called the attribute dimension (usually ranges from 1 to 4).

0.2 Sampled Data

The two operations relate sampled data and continuous data:

1

Page 2: 03 Data Representation

• sampling: given a continuous dataset, we have to be able to produce sampled data from it;

• reconstruction: given a sampled dataset, we have to be able to recover an (approximated) version ofthe original continuous data.

The reconstruction involves specifying the value of the function between its sample points, using thesample values, using a technique called interpolation. The reconstruction quality is a function of the amountand distribution of sample points used.

To be used in practice, a sampled dataset should comply with several requirements: it should be accurate,minimal, generic, efficient, and simple. By accurate, we mean that one should be able to control theproduction of a sampled dataset D∫ from a continuous one Dc such that Dc can be constructed from D∫ witha small user specified error. By minimal, we mean that D∫ contains the least number of sample points neededto ensure a reconstruction with the desired error. By generic, we mean that we can easily replace the variousdata processing operations we had for the continuous Dc with equivalent counterparts for the sampled D∫ .By efficient we mean that both the reconstruction operation and the data processing operations we wish toperform on D∫ can be done efficiently from an algorithmic point of view. By simple we mean that we candesign a reasonably simple software implementation of both D∫ and the operations we want to perform onit.

We define reconstruction as follows: given a sampled dataset {pi, fi} consisting of a set of N sample

points pi ∈ D and sample values fi ∈ C, we want to produce a continuous function f : D → C thatapproximates the original f . The reconstructed function should equal the original one at all sample points,i.e., f(pi) = f(pi) = fi. One way to define the reconstructed function that satisfies this property is to set

f =∑Ni=1 fiφi, where φi : D → C are called basis functions or interpolation functions. In other

words, we defined the reconstruction operation using a weighted sum of a given set of basis functions φi,where weights are exactly our sample values fi. Since we want that f = fj for all sample points pi, we get∑Ni=1 fiφi(pj) = fj ,∀j. This equation must hold to any function f . Let us consider a function

φi(pi) =

{1, i = j

0, i 6= j

Equation above is sometimes referred to as the orthogonality of basis functions. Let us now considerthe constant function g(x) = 1 for any x ∈ D, we obtain

∑Ni=1 φi(pi) = 1,∀pi ∈ D, or

N∑i=1

φi(x) = 1,∀x ∈ D.

The property described in the equation above is called the normality of basis functions. Basis functionsthat are both orthogonal and normal are called orthonormal. To reconstruct a sampled function, we canuse different orthonormal basis functions.

A grid, sometimes also called a mesh, is a subdivision of a given domain D ∈ Rd into a collection of cells,sometimes also called elements, denoted ci. The union of the cells completely covers the sample domain, i.e.,⋃i ci = D, and the cells are non-overlapping, i.e., ci

⋂cj = 0,∀i 6= j.

We can now define the simplest set of basis functions, the constant basis functions. These functionsapproximate a given function by the piecewise, per-cell, constant sample value fi (for every point x ∈ Dit assigns the sample value of the nearest cell center). For this reason, the piecewise constant interpolationis also called nearest-neighbor interpolation. Constant basis functions are simple to implement andhave no computational cost, they work for any cell shape and in any dimension, however, these functionsprovide a poor, staircase like approximation f of the original f . Over every cell visualization has a visiblediscontinuity.

By using higher-order basis functions we can provide a better and more continuous reconstruction. Thenext-simples basis functions beyond the constant ones are the linear basis functions. To use these,however, we need to make some assumptions about the cell types used in the grid. Let us consider a singlequadrilateral cell c having the vertices (v1, v2, v3, v4), where v1 = (0, 0), v2 = (1, 0), v3 = (1, 1) andv4 = (0, 1) - axis-aligned square of edge size 1 with the origin as first vertex. We call this the reference

2

Page 3: 03 Data Representation

cell in R2. Coordinates in the reference cell [0, 1]d are called reference coordinates: r1, ..., rd (or r, s, t ford = 3). We define now four local basis functions Φ1

1, Φ12, Φ1

3, and Φ14; Φ1

i : [0, 1]2 → R as follows:

Φ11(r, s) = (1− r)(1− s),

Φ12(r, s) = r(1− s),

Φ13(r, s) = rs),

Φ14(r, s) = (1− r)s.

These basis functions are indeed orthonormal. For any point (r, s) in the reference cell, we can now use

these basis functions to define a linear function f(r, s) =∑4i=1 fiΦ

1i (r, s) as a sum of linear basis functions,

which makes it a first-order continuous reconstruction of the four sample values f1, f2, f3, f4 defined at thecell vertices. For every arbitrary quadrilateral cell c in R3, we can define a coordinate transformationT : [0, 1]2 → R3 that maps our reference cell to c. We want to map the reference cell vertices vi tothe corresponding world cell pi, so T (vi) = pi. We define our transformation T using our reference basisfunctions to map from a point r, s, t in the reference cell coordinate system to a point x, y, z in the actualcell to be

(x, y, z) = T (r, s, t) =

n∑i=1

piΦ1i (r, s, t).

If T maps the reference cell to the world cell then its inverse T−1 maps points x, y, z in the world cellto points r, s, t in the reference cell, where our basis functions Φ1

i are defined, Using T−1, we can rewriteequation (3.2) for our quad cell c:

f(x, y) =

4∑i=1

fiΦ1i (T−1(x, y)).

In order to compute the inverse transformation T−1, we must invert the expression given by Equation (3.8).This inversion depends on the actual cell type.

We now have a way to reconstruct a piecewise linear function f from samples on any quad grid: forevery cell c in the grid, we simply apply Equation(3.9). We can now finally define our piecewise linearreconstruction in terms of a set of global basis functions φ, just like we did for piecewise constantreconstruction (Equation (3.6)). Given a grid with sample points pi and quad cells ci, we can define ourgrid-wise linear basis functions φ1

i as follows:

φ1i (x, y) =

{0, if (x,y)/∈ cells(pi),

Φ1i (T−1(x, y)), if(x,y) ∈ c=v1,v2,v3,v4, where vj = pi,

where cells(pi) denotes the cells that have pi as a vertex. Sampling the continuous signal f produces a setof samples fi. Multiplying the samples by the global basis functions φi obtained from the reference basisfunctions Φj via the transform T , we obtain the reconstructed signal f .

We can use basis-function machinery and sampling and reconstruction mechanisms applied to more dataattributes than surface geometry alone (e.g., - to shading). Gourand shading produces a smooth illuminationover the polygon by reconstructing original continuous surface using piecewise linear interpolation for boththe geometry and illumination.

0.3 Discrete Datasets

We can say, that, given:

• a grid in terms of a set of cells defined by a set of sample points,

• some sampled values at the cell centers or cell vertices,

• a set of basis functions, we can define a piecewise continuous reconstruction of the sampled signal onthis grid and work with it.

We defined a continuous dataset dataset for a function f : D → C as the triplet D = (D,C, f). In thediscrete case, we replace the function domain D by the sampling grid (pi, ci), and the continuous function

3

Page 4: 03 Data Representation

f by its piecewise k-order continuous reconstruction f computed using the grid, the sample values fi, anda set of basis functions {Φki }. Hence, the discrete (sampled) dataset counterpart of (D,C, f) is the tupleDs = ({pi}, {ci}, {fi}, {Φki }): grid points, grid cells, sample values, and reference basis functions.

Replacing a continuous dataset Dc with its discrete counterpart D∫ means working with a piecewise

k-order continuous function f instead of a potentially higher-order continuous function f . Dataset require-ments: accurate, minimal, generic, efficient and simple for a discrete dataset translate to constraints on thenumber and position of sample points pi, shape of cells ci, type of reference basis functions Φi, and numberand type of sampling values fi. These constraints determine specific implementation solutions as follows.The cell shapes, together with the basis functions, determine different cell types. The number and type ofsample values fi determine the attribute types.

0.4 Cell Types

A grid is a collection of cells ci, whose vertices are the grid sample points pi. Given some data sampled atthe points pi, the cells are used to define supports for the basis functions φi used to interpolate the databetween the sample points.

The dimensionality d of the cells ci has to be the same as the topological dimension of the sampleddomain D, if we want to approximate D by the union of all cells

⋃i ci. For example, if D is a plane (d = 2),

we must use planar cells, such as polygons. If D is a volume (d = 3), we must use volumetric cells, suchas tetrahedra. For each cell type we shall present the linear basis functions it supports, as well as thecoordinate transformation T−1 that maps from locations (x, y, z) in the actual world cell to locations (r, s, t)in the reference cell.

0.4.1 Vertex

The simplest cell type of dimension d = 0 is identical to its single vertex, c = v1. The vertex has a single,constant basis function Φ0

1(r) = 1. In practice there us no distinction between sample points and vertexcells.

0.4.2 Line

Line cells have dimension d = 1 and two vertices c = v1, v2. Line cells used to interpolate along any kind ofcurves embedded in any dimension. Given the reference line cell defined by the points v1 = 0, v2 = 1, thetwo linear basis functions are

Φ11(r) = (1− r),

Φ12(r) = r

The transformation T−1 for line cells is simply the dot product between the position vector of the desiredpoint in the cell p = (x, y, z) with respect to the first cell’s vertex p1 and the cell vector p1p2:

T−1line(x, y, z) = (p = p1)(p2 − p2).

0.4.3 Triangle

The simplest cell type in dimension d = 2 is the triangle, i.e., c = v1, v2, v3. Triangle can be used tointerpolate along any kind of surfaces embedded into any dimension (planar or curved). Given the referencetriangle cell defined by the points v1 = (0, 0), v2 = (1, 0), v3 = (0, 1), the three linear basis functions are

Φ11(r, s) = 1− r − s,

Φ12(r, s) = r,

Φ13(r, s) = s.

The transformation T−1 for triangular cells is T−1tri = (r, s) =

(‖(p−p1)×(p3−p1)‖‖(p2−p1)×(p3−p1)‖ ,

‖(p−p1)×(p2−p1)‖‖(p3−p1)×(p2−p1)‖

)It is

computed as dot products between the position vector p−p1 of the point p in the world cell with the respectto the world cell’s first vertex p1 and the world cell edges p2p1.

4

Page 5: 03 Data Representation

0.4.4 Quad

Another possibility to interpolate over two-dimensional surfaces is to use quadrilateral cells, or quads.The reference quad is defined by the points v1 = (0, 0), v2 = (1, 0), v3 = (1, 1) and v4 = (0, 1) and is anaxis-aligned square of edge size 1. On this reference quad the basis functions are

Φ11(r, s) = (1− r)(1− s),

Φ12(r, s) = r(1− s),

Φ13(r, s) = rs,

Φ14(r, s) = (1− r)s.

A good trade-off between flexibility and simplicity is to support quad cells as input data, but transformthem internally into triangle cells, by dividing every quad into two triangles using one of its two diagonals.

The transformation T−1quad for a general quad cell deals with bilinear basis functions and can not

be easily inverted. We can only solve it numerically for r, s as functions of x, y, z. If our actual cells arerectangular instead of arbitrary quads, like in uniform or rectilinear grid, we can do better. In this case

the transformation T−1rect: T

−1rect = (r, s) =

((p−p1)·(p2−p1)

‖p2−p1‖2, (p−p1)·(p4−p1)

‖p4−p1‖2

).

0.4.5 Tetrahedron

The simplest cell type in demotion d = 3 is the tetrahedron, defined by its four vertices c = (v1, v2, v3, v4).On the reference tetrahedron defined by the points v1 = (0, 0, 0), v2 = (1, 0, 0), v3 = (0, 1, 0) and v4 = (0, 0, 1),the four linear basis functions are

Φ11(r, s, t) = 1− r − s− t,

Φ12(r, s, t) = r,

Φ13(r, s, t) = s,

Φ14(r, s, t) = t.

Given a tetrahedral cell with vertices p1, p2, p3, p4, the transformation T−1tet = (r, s, t) follows the same pattern:

r = |((p−p4)·((p1−p4)×(p3−p4))||((p1−p4)·((p2−p4)×(p3−p4))| ,

s = |((p−p4)·((p1−p4)×(p2−p4))||((p1−p4)·((p2−p4)×(p3−p4))| ,

t = |((p−p3)·((p1−p3)×(p2−p3))||((p1−p4)·((p2−p4)×(p3−p4))| ,

Some applications use also pyramid cells and prism cells to discretize volumetric domain. Pyramid andprism cells can be split into tetrahedral cells.

0.4.6 Hexahedron

The next d = 3 dimensional cell type is the hexahedron, or hex, defined by its eight vertices c = (v1, ..., v8).The reference hexahedron is the axis-aligned cube of unit edge length, with v1 at the origin. On this cell theeight linear basis functions are

Φ11(r, s, t) = (1− r)(1− s)(1− t),

Φ12(r, s, t) = r(1− s)(1− t),

Φ13(r, s, t) = rs(1− t),

Φ14(r, s, t) = (1− r)s(1− t),

Φ15(r, s, t) = (1− r)(1− s)t,

Φ16(r, s, t) = r(1− s)t,

Φ17(r, s, t) = rst,

Φ18(r, s, t) = (1− r)st.

WE can split hexahedral cells into six tetrahedra each and then use only tetrahedra as 3D cell types, simpli-fying software implementations and maintenance. T−1

hex for hexahedral cells cannot be computed analyticaly,and must be determined using numerical methods. However, in case our actual hex cells are parallelepipeds

5

Page 6: 03 Data Representation

(orthogonal edges), these cells can be called box cells. In this case, T−1hex can be computed by taking a dot

product of the position vector p− p1 with the cell edges. For a box cell with vertices p1...p8, we obtain:

T−1box(x, y, z) = (r, s, t) =

((p− p1)(p2 − p1)

‖p2 − p1‖2,

(p− p1)(p4 − p1)

‖p4 − p1‖2,

(p− p1)(p5 − p1)

‖p5 − p1‖2

).

Software packages sometimes offer more cell types, such as squares and pixels (identical to rectangle grid),triangle strips (memory-efficient way to store sequences of triangle cells that share edges), polygons in 2D,and cubes and voxels in 3D (same role as squares and pixels have in 2D). Some applications use quadraticcells and support quadratic basis functions and provide piecewise quadratic (smoother) reconstruction ofdata, which is C2 continuous, and are often used in numerical simulations applications such as finite elementmethods.

In general, you should add new cell types to your application data representation only if these allow youto implement some particular visualization or data processing algorithms much more easily and/or efficientlythat cell types your software already supports. Quadratic cells also contain a midpoint for edges and, for3D cells, centers of cell faces.

0.5 Grid Types

0.5.1 Uniform Grids

In a uniform grid, the domain D is an axis-aligned box, e.g., a line segment for d = 1, rectangle for d = 2,or parallelepiped for d = 3. On a uniform grid , sample points pi ∈ D ⊂ R are equally spaced along the daxes of the domain D. Hence, in the uniform grid, a sample point is described by its d integer coordinatesn1 . . . , nd. These integer coordinates are sometimes called structured coordinates. A simple example ofuniform grid is a 2D pixel image, where every pixel pi is located by two integer coordinates. This regularpoint ordering allows us to define the grid cells implicitly by using the point indexes.

The magor advantages of uniform grids are their simple implementation and practically zero storagerequirements. Regardless of its size, storing d-dimentional grid itself takes 3d floatong-point values, i.e., only12d bytes of memory. Storing the actual sample values at the grid points takes storage proportional to thenumber of sample points.

0.5.2 Rectilinear Grids

Uniform grids are simple and efficient, but have limited modeling power. To accurately represent a functionwith a non-uniform variation rate, we need either to use a high sampling density on a uniform grid, or use agrid with non-uniform sample density. Rectilinear grids relax the constraint of equal sampling distances fora given axis, but keeps the axis-aligned, matrix-like point ordering and implicit cell definition. These gridsare similar to the uniform ones, except that the distances δi,j between the sample points are now not equalalong the grid axes. Implementing a rectilinear grid implies storing the grid origins (mi, Ni) and samplecounts for every dimension d, as for the uniform grid. Additionally, we must store sample steps. In total,the storage requirements are 2d+

∑di=1Ni values.

0.5.3 Structured Grids

In rectilinear grids the samples domain is still a rectangular box. and the sample point density can bechanged only one axis at a time. Rectangular grids, for example, do not allow us to place more samplepoints only in the central peak region of an exponential function.

Structured grids allow explicit placement of every sample point pi = (xi1, . . . , xid). The user can freelyspecify the coordinates xij of all points. At the same time structured grids preserve the matrix-like orderingof the sample points. Implementing a structured grid implies storing the coordinates of all grid sample pointspi and the number of points N1, . . . , Nd per dimension. Structured grids can represent a large number ofshapes.

6

Page 7: 03 Data Representation

0.5.4 Unstructured Grids

Structured grids can be seen as a deformation of uniform grids, where topological ordering of the points(cells) stays the same, but their geometrical position is allowed to vary freely.

There are, however, shapes that cannot be effectively modeled by structured grids. They allow definingboth their sample points and cells explicitly. An unstructured grid can be modeled as a collection of samplepoints pi, i ∈ [0, N ] and cells ci = (vi1, . . . , viCi). The values vij ∈ [0, N ] are called cell vertices and referto the sample points pvij used by the cell. A cell is thus an ordered list of sample point indices. This modelallows us to define every cell separately and independently of the other cells. Also, cells of different typeand even dimensionality can be freely mixed in the same grid, if desired. If cells share the same samplepoints as their vertices, this can be directly expressed, which is useful in several contexts. * Storing indexrepresented by integer is usually cheaper than storing a d-dimensional coordinate (d floating numbers) * Wecan process the grid geometry (positions of the sample points pi) independently of the grid topology, i.e., thecell definitions. In practice, it is preferable to use unstructured grids containing a single cell type, as theseare simpler to implement and also can lead to faster application code. The costs of storing an unstructuredgrid depend on the types of cells used and the actual grid. For example, a grid of C d-dimensional cells withV vertices per cell and N sample points would require dN + CV values.

0.6 Attributes

In visualization, the set of sample values of a sampled dataset is usually called attribute data. Attributedata can be characterized by their dimension c, as well as the semantics of the data they represent. Thisgives rize to several attribute types.

0.6.1 Scalar Attributes

Scalar attributes are c = 1 dimentional. They are represented by plain real numbers. They encode variousphysical quantities such as temperature, concentration, pressure, or density, or geometrical measures, suchas length or height (elevation plot function f : R2 → R).

0.6.2 Vector Attributes

They are usually c = 2 or c = 3 dimensional. Vector attributes can encode position, direction, force, orgradients of scalar functions. Usually vectors have an orientation and a magnitude, also called length ornorm.

0.6.3 Color Attributes

Color attributes are usually c = 3 dimensional and represent the displayable colors on a computer screen.Three components of a color attribute can have different meanings, depending on the color system in use(RGB system). RGB is an additive system, since every color is represented as a mix of pure red, greenand blue colors in different amounts. Equal amounts of these colors determine gray shades, whereas othercombinations determine various hues.

Another popular color representation system is the HSV system, where the three color components specifythe hue, saturation, and value of a given color. The advantage of the HSV system is that it is more intuitivefor the human user. Hue distinguishes between different colors of different wavelengths, such as red, yellow,and blue. Saturation represents the color purity. A saturation of 1 corresponds to pure, undiluted color,whereas a saturation of 0 corresponds to white. Value represents the brightness, or luminance, or a givencolor. A value of 0 is always black, whereas a value of 1 is three brightest color of a given hue and saturationthat can be represented on a given system. The value of luminance component of an HSV color is equal tothe maximum of the R, G, and B components.

0.6.4 Tensor Attributes

Tensor attributes are high-dimensional generalizations of vectors and matrices. We can compute the curva-

ture of a planar curve using its second derivative d2fdx2 , and the curvature of a 3D surface in a given direction

7

Page 8: 03 Data Representation

using its Hessian matrix H of partial derivatives. The hessian matrix is also called the curvative tensorof the given surface.

Besides curvature, tensors can describe other physical quantities that depend on direction, such as waterdiffusivity or stress and strain in materials. Tensors are characterized by their rank. Scalars are tensors ofrank 0. Vectors are tensors of rank 1. The Hessian curvature tensor is a rank 2 symmetric tensor since it isexpressed by a symmetric, rank 2 matrix.

0.6.5 Non-Numerical Attributes

Examples of possible non-numerical attribute types are text, images, file names, or even sound samples.The main property for D∫ is to permit us to reconstruct some piecewise, k-order continuous function

f : D → C, given the sample values fi ∈ C. What should the meaning of the multiplication between samplevalues fi and real-valued basis functions Φi and of addition of the sample values fi in Equation (3.9) be?

0.6.6 Properties of Attribute Data

The main purpose of attribute data is to allow a reconstructions f of the sampled information fi. Attributedata has several general properties:

• attribute data, the sample values fi, must be defined for all sample points pi of a dataset Ds. If samplesin some points pi are missing, there several solutions: 1. remove these points completely from the grig,2. define missing values fi in some way or replace them with some special value (like 0), 3) we candefine missing values using existing values, using some complex interpolation scheme.

• cell type can contain any number of attributes, of any type, as long as these are defined for all datapoints. We can choose whether we want to model our data as a single c-value dataset or as c one-valuedatasets. The answer is to consider all attributes that have a related meaning as a single higher-dimensional attribute - separate attributes with different meanings.

Operations of color attributes must consider all color components simultaneously, as color componentsR,G,B have a related meaning.

Some data visualization applications classify attribute data into:

• node or vertex attributes - defined at the vertices of the grid cell and correspond to a sampleddataset and

• cell attributes - defined at the center points of the grid cells - correspond to sampled dataset thatuses constant basis functions. Vertex attributes can be converted to cell attributes and conversely byresampling.

The attribute components are sometimes related by some constraints. This happens for normal attributes

n ∈ R3, where the three components are constrained to yield unit length normals, i.e., |n| =√n2x + n2

y + n2z.

Depending on the choice of the basis functions, interpolating these components separately as scalar valuesmay not preserve the unit length properly on the interpolated normal n. First solution is to interpolate thecomponents separately, and then enforce the desired constraint on the result by normalizing it. i.e., replacingn with n/|n| (works when sample values do not vary too strongly across a grid cell). Second solution is torepresent the constraint directly in the data attributes, rather than enforcing it after interpolation. Fornormal attribute types, this means representing 3D normals as two independent orientations, e.g., usingpolar coordinates α, β, instead of using the tree x, y, z components, which are dependent via the unit lengthconstraint. We can now interpolate the normal orientations α, β using the desired basis functions, and willalways obtain the correct result.

0.7 Computing Derivatives of Sampled Data

One of the requirements for a sampled dataset D∫ = (pi, ci, fi,Φi) is that it should be generic: we can easilyreplace various data processing operations available for continuous counterpart with equivalent operationsin D∫ .

8

Page 9: 03 Data Representation

f =∑Ni=1 fiφi, then ∂f

∂xi=∑Nj=1 fj

∂φi

∂xi. Using the expressions of the reference basis functions:

∂f∂xi

=∑Nj=1 fj

∂Φi

∂xi(r). We now use the chain rule and obtain: ∂Φ

∂xi=∑dj=1

∂Φi

∂rj

∂rj∂xi

to obtain ∂f∂xi

=∑Nj=1 fj

∑dk=1

∂Φi

∂rk∂rk∂xi

. Finally, we canx rewrite last equation in a convenient matrix form, as follows:∂f∂x1

∂f∂x2

. . .∂f∂xd

=

N∑j=1

fj

∂r1∂x1

∂r2∂x1

. . . ∂rd∂x1

∂r1∂x2

∂r2∂x2

. . . ∂rd∂x2

. . .∂r1∂xd

∂r2∂xd

. . . ∂rd∂xd

︸ ︷︷ ︸inverse Jacobian matrix J−1

∂Φj

∂r1∂Φj

∂r2. . .∂Φj

∂rd

The matrix above is called the inverse Jacobian matrix J−1 = (∂ri/∂xj)ij . this matrix is in-verse of the Jacobian matrix J = (∂xi/∂rj)ij . Using, T−1, we can rewrite the inverse Jacobian as

J−1 = (∂T−1

i (x1,...,xd)

∂xj)ij , where T−1 denotes the it-h component of the function T−1. Putting it all to-

gether, we get the formula for computing the partial derivatives of a sampled dataset f with respect to all

coordinates xi: ( ∂f∂xi) =

∑Nk=1 fk(

∂T−1i

∂xk)ij(

∂Φk

∂ri)i.

To use this equation in practice, we need to evaluate the derivatives of both the reference basis functionsΦk and T−1 for every cell type. Alternatively, we can evaluate the Jacobian matrix instead of its reverse,using the reference-cell to world-cell coordinate transform T instead of T−1, then numerically invert J , andfinally apply Equation (3.33). For all cells described in Section 3.4, the coordinate transformation T−1 are

linear functions of the arguments xi, so their derivatives are constant. Hence, the derivatives of f are of thesame order as those of the basis functions Φk we choose to use.

Partial derivatives of f inside a given cell are computed by linearly interpolating the 1D derivatives of falong opposite cell edges. A similar result can be obtained for rectilinear grids as well as for hexahedral cells.If a dataset is noisy, the computed derivatives tend to exhibit even stronger noise that the original data. Asimple method to limit these problems is to pre-filter the input dataset in order to eliminate high frequencynoise, using methods such as the Laplacian smoothing described in Section 8.4. However, smoothing canalso eliminate important information from the dataset together with the noise.

0.8 Implementation

0.8.1 Grid Implementation

0.9 Advanced Data Representation

Sometimes more advanced forms of data manipulation and representation are needed. We will describe thetask of data resampling, which is used in the process of converting information between different types ofdatasets that have different sample points, cells or basis functions.

0.9.1 Data Resampling

Lets consider piecewise constant normal - polygon normals themselves, which are discontinuous at the poly-gon vertices and actually, over the complete polygon edges - so we can not use them for approximationsfor the vertex normals. How can we compute vertex normal values from the known polygon normals? Theanswer is provided by operation called resampling.

Resampling computes the values f ′i of the target dataset as function of the values fi of the source dataset.For simplicity, we assume that both datasets use the same set of basis functions Φi.

Let us now consider a common resampling operation in data visualization: converting cell attributes (fi)to vertex attributes (f ′i). Cell attributes imply the use of constant basis functions Φi, vertex attributes, incontrast, imply the use of higher-order basis functions, such as linear ones. On the other hand we want thesample points of the target grid cells (target grid vertices) to be identical to the source vertices for the twogrids to match.

Vertex data is the area weighted-average of the cell data in the cells that use a given vertex. Cell attributesare the average of the cell’s vertex attributes.

9

Page 10: 03 Data Representation

Resampling data from cells to vertices increases the assumed continuity. If our original sampled datawere indeed continuous of that order, no problem appears. However if the original data contained, e.g.,zero-order discontinuities, such as jumps or holes, resampling it to a higher-continuity grid also throwsaway discontinuities which might have been a feature of the data and not a sampling artifact. In contrast,resampling from a higher continuity (vertex data) to a lower continuity (cell data) has fewer side effects-overall, the smoothness of the data decreases globally.

Two other frequently used resampling operations are subsampling and supersampling. Subsampling re-duces the number of sample points that are the subset of original dataset points (optimizing the process speedand memory demands, working with smaller datasets). After eliminating some number of points subsamplingoperations can choose or redistribute the remaining points in order to obtain a better approximation of theoriginal data. Subsampling implementations can take advantage of dataset topology. A desirable propertyof subsampling is to keep most samples in the regions of rapid data variations and cull most samples fromthe regions of slow data variation. A technique, called uniform subsampling, is simple and effective whenthe original dataset is densely sampled it is used in uniform , rectilinear and structured grids to keep everyk-th point along every dimention and discard the remaining ones.

Supersampling or refinement is the inverse of subsampling: more data points are created from anexisting dataset. It is useful in situations when we try to create or manipulate information on a dataset ata level of detail, or scale, that is below the one captured by the sampling frequency. Uniform supersamplingintroduces k points into every cell of the original dataset. An efficient supersampling implementation usuallyinserts extra points only in those regions where we need to further add extra information.

0.9.2 Scattered Point Interpolation

There are situations when we would like to avoid constructing and storing a grid of cells to representdata domain. 3D scanner delivers a scattered 3D point set, also called a point cloud: point and theircorresponding data values pi, fi. For scanner the data values fi are the surface normals and/or color measuredby the device.

How do we reconstruct continuous surface if we were given a set above with points and normals?Constructing a grid from scattered points (triangulation): unstructured grid with 2D cells, e.g, triangles,

which have pi as vertices and approximate the surface as much as possible.A second way is griddles interpolation. Storing the cell information can double the amount of memory

required in the worst case. To reconstruct a continuous function from a scattered point set we need a setof griddles basis functions. There are several ways to construct such functions, frequently used choice forgriddles basis functions is radial basis functions or RBFs. These functions depend only on the distance

between the current point and the origin r = |x| =√∑d

i=1 x2i .

RBFs smoothly drop from 1 at their origin (r = 0) to a vanishing value for large values of the distancer. To limit the effect of a basis function to its immediate neighborhood, we specify a radius of influence R,or support radius, beyond which Φ is equal to zero. In this setup a common RBF is the Gaussian function.

Φ(x) =

{e−kr

2

, r < R,

0, r ≥ R,where r = |x|.

The parameter k ≥ 0 controls the decay speed, or the shape of the radial basis functions. Setting k = 0yields constant cylinder shaped radial functions, which are equivalent to the constant basis functions for gridbased datasets. Another popular choice are inverse distance functions defined as

Φ(x) =

{1

1+r2 , r < R,

0, r ≥ R,where r = |x|.

The radius values Ri control the influence of the sample data value of a point pi. Higher values of Ri yieldsmoother reconstructions at higher computational cost, lower values of Ri yield less-smooth reconstructionbut higher performance. In practice, setting Ri to the average inter-point distance in the neighborhood ofpoint pi gives a good balance between smoothness and efficiency.

Given a point p, we shall sum only those basis functions φk that are nonzero at p. In case of radialbasis functions, we must find the k nearest sample points p1, . . . , pk to p so that |p − pk| < Rk. One way

10

Page 11: 03 Data Representation

to accomplish this is to store all sample points pi in a spatial search structure such as a kd-tree. Spatialsearch structures provide efficient retrieval of the k nearest neighbors at any given location. A good, scalableimplementation of such a search structure is provided by the Approximate Nearest Neighbor (ANN) library.Scattered point data sets sometimes are called unstructured point datasets, however, if the functionof a dataset is to provide a piecewise continuous reconstruction of its data samples, we need to specifyalso a choice for the basis functions Φi to have a complete dataset (pi, fi,Φi). To effectively perform thereconstruction, searching methods are needed that return the sample points pi located in the neighborhoodof a given point p.

What have you learned in this chapter?This chapter lays out a discussion on discrete data representation, continuous data sampling and re-

construction. Fundamental differences between continuous (sampled) and discrete data are outlined. Itintroduces basic functions, discrete meshes and cells as means of constructing piecewise continuous approx-imations from sampled data. I learned about various types of datasets commonly used in the visualizationpractice: their advantages, limitations and constraints This chapter gives an understanding of various trade-offs involved in the choice of a dataset for a given visualization application while focuses on efficiency ofimplementing the most commonly used datasets presented with cell types in d ∈ [0, 3] dimensions.

What surprised you the most? I was surprised to find out that there are few representations andmapping of colors between RGB and HSV space.

I was surprised to find out how griddles interpolation works and that it exists. Also, that reconstructionof scattered/unstructured point datasets requires using searching methods to locate nearest sample pointsin the neighborhood of a given point.

I was surprised that datasets with attributes such as text, images, or relations form the target of infor-mation visualization applications, since they are purely discrete, and often not defined on a spatial domain.

What applications not mentioned in the book you could imagine for the techniques ex-plained in this chapter? I can only imagine a datasets that stores high dimentional attributes in orderto allow just enough continuity to perform various types of resampling between target and source grids ofcertain type. Selecting a set of useful grids and proper resampling might improve original visualization modelin a way that it will focus more on nature of a signal, by depending less on the structure/representation ofits sampled data .

1. EXERCISE 1

Consider the following datasets:• The evolution in time of the prices of N different stock-exchange shares, recorded at one-second intervals

over the period of one hour. • The paths covered by all cars driving through a given city, recorded at one-minute intervals over the period of one hour. For each record, we store the car ID, the car’s position, andthe car’s speed. • The amount of rainfall and the air temperature, recorded at a given time instant at Ngiven weather stations over some geographical area.

Describe the kind of grid, grid cells, and data attributes that you would use to store such a dataset.Argue your proposal by considering the kind of data to store, and the locations at which data is recorded(sampled).

• grid: uniform linear grid with 1 second intervals; grid cells - lines length of 1 second; data attributes:price for each our of N shares (360 samples per hour times N shares = 360N values to store)

• grid - a data structure with spatial search that utilizes average interpoint distance between points;cells - grigless radial basis function with compact support; data attributes: car ID, car’s position, car’sspeed, basis functions.

• grid - rectilinear structured grid with specified sampling locations; cells: quads; data attributes: am-mount of rainfall, temperature, location

2. EXERCISE 2

Sampling and reconstruction are closely related operations which reduce a function y = f(x) to a finite

set of sample points (xi, yi) and, respectively, reconstruct an approximation y = f(x) of f(x) from the

sample points. Consider an application where you have to perform the above reconstruction f(x), but you

11

Page 12: 03 Data Representation

are only allowed to use a fixed finite number N of sample points xi. How would you place these samplepoints over the domain of definition of x so that the reconstruction error |f − f | k is equally well minimizedover the entire range of x?

Hints: first, consider the kinds of basis functions you want to use (e.g., constant or linear). Next, considerhow you can minimize the reconstruction error by shifting the points xi around the x axis.

• In case of constant basis functions we can use unoform sampling density with N points placed at equaldistances from each other.

• In case of linear basis functions we can use non-uniform sampling density, in order to assign moresample points to those areas of domain, where function’s higher order derivatives change fast.

3. EXERCISE 3

In Figure 3.10 in Chapter 3 (also displayed below), it is shown that we can use structured grids to covera 2D disk shape. Now, consider an arbitrary convex 2D shape of genus 0 (that is, without holes). The2D shape is specified by means of its contour, which is given as a closed 2D polyline of N points. • Canwe always construct a structured grid so that all points of this polyline will be also points on the grid’sboundary? If not, sketch a simple counter-example. • Can we always construct a structured grid with theconditions listed in the point above and the additional condition that no grid-boundary point exists whichis not a polyline point? If not, sketch a simple counter-example.

Hints: Think about the number of points on the boundary of a structured grid.

• Yes, we can always construct a structured grid of N points and N−2 triangular cells. Since all internalangles of the shape are less that 180 degrees, then it is always possible to take one vertice and connectit to remaning (N − 2) vertices in order to form a triangular structured grid.

• Yes, it is always possible for a convex 2D polyline to use triangulars as describes above so, that allpolyline points are also grid-boundary points.

4. EXERCISE 4

As shown in Figure 3.11 in Chapter 3 (also shown below), not all 2D shapes can be covered by structuredgrids. Consider now a 3D (curved) surface of a half sphere. Can we cover this surface with a structuredgrid? Argue your answer.

Yes, we can cover such half a sphere with a structured grid, consisting of tetraahedral cells. Such shapeconsist only of one component and genus of domain here I assume equals to 0.

5. EXERCISE 5

Consider the 2D cells in the figure below. For each cell, scalar data values vi are indicated at its samplepoints (vertices). Additionally, a separate point p inside the cell is indicated. If bilinear interpolation isused, compute the interpolated value v(p) of the vertex data values vi at the point p. Detail your answer byexplaining how you computed the interpolated value.

• For rectangular quad:

T−1rect = (r, s) =

((p− p1) · (p2 − p1)

‖p2 − p1‖2,

(p− p1) · (p4 − p1)

‖p4 − p1‖2

)where :

(p− p1) = (4− x1, 3− y1) = (3, 1),

(p2 − p1) = (x2 − x1, y2 − y1) = (4, 0),

(p4 − p1) = (x4 − x1, y4 − y1) = (0, 3),

‖p2 − p1‖2 = (x2 − x1)2 = 42 = 16,

‖p4 − p1‖2 = (y4 − y1)2 = 32 = 9.

12

Page 13: 03 Data Representation

T−1rect = (r, s) =

((3, 1) · (4, 0)

16,

(3, 1) · (0, 3)

9

)=

(12

16,

3

9

)=

(3

4,

1

3

).

Calculating 4 basis functions as follows:

Φi(T−1rect) =

Φ1

1(r, s) = (1− r)(1− s) = (1− 3/4)(1− 1/3) = 1/6,

Φ12(r, s) = r(1− s) = (3/4) · (1− 1/3) = 3/4× 2/3 = 1/2,

Φ13(r, s) = rs = 3/4× 1/3 = 3/12 = 1/4,

Φ14(r, s) = (1− r)s = (1− 3/4) · (1/3) = 1/4× 1/3 = 1/12.

Finally, we calculate value for: v(p) = p(x, y) =∑4i=1 viΦ

1i = 3· 16 +1· 12 +4· 14 +0· 1

12 = 1/2+1/2+1+0 = 2.Answer: v(p) = 2.

• For triangular cell:

T−1tri = (r, s) =

(‖(p− p1)× (p3 − p1)‖‖(p2 − p1)× (p3 − p1)‖

,‖(p− p1)× (p2 − p1)‖‖(p3 − p1)× (p2 − p1)‖

), where :

(p− p1) = (3− x1, 3− y1) = (2, 1),

(p2 − p1) = (x2 − x1, y2 − y1) = (4, 1),

(p3 − p1) = (x3 − x1, y3 − y1) = (0, 3),

T−1tri = (r, s) =

(‖(2, 1)× (0, 3)‖‖(4, 1)× (0, 3)‖

,‖(2, 1)× (4, 1)‖‖(0, 3)× (4, 1)‖

)=

(‖3‖‖3‖

,‖9‖‖3‖

)= (1, 3)

Calculating 4 basis functions as follows:

Φi(T−1tri ) =

Φ1

1(r, s) = 1− r − s = 1− 1− 3 = −3,

Φ12(r, s) = r = 1 = 1,

Φ13(r, s) = s = 3 = 3,

Finally, we calculate value for: v(p) = p(x, y) =∑3i=1 viΦ

1i = 3 · (−3) + 1 · 1 + 4 · 3 = 4.

Answer: v(p) = 4.

6. EXERCISE 6

Consider the 2D cells in the figures below. For each cell, vector data values vi are indicated at its samplepoints (vertices). Additionally, a separate point p inside the cell is indicated. If bilinear interpolation isused, compute the interpolated value v(p) of the vertex data values vi at the point p. Detail your answer byexplaining how you computed the interpolated value.

• For rectangular quad:

T−1rect = (r, s) =

((p− p1) · (p2 − p1)

‖p2 − p1‖2,

(p− p1) · (p4 − p1)

‖p4 − p1‖2

)where :

(p− p1) = (4− x1, 3− y1) = (3, 1),

(p2 − p1) = (x2 − x1, y2 − y1) = (4, 0),

(p4 − p1) = (x4 − x1, y4 − y1) = (0, 3),

‖p2 − p1‖2 = (x2 − x1)2 = 42 = 16,

‖p4 − p1‖2 = (y4 − y1)2 = 32 = 9.

T−1rect = (r, s) =

((3, 1) · (4, 0)

16,

(3, 1) · (0, 3)

9

)=

(12

16,

3

9

)=

(3

4,

1

3

).

13

Page 14: 03 Data Representation

Calculating 4 basis functions as follows:

Φi(T−1rect) =

Φ1

1(r, s) = (1− r)(1− s) = (1− 3/4)(1− 1/3) = 1/6,

Φ12(r, s) = r(1− s) = (3/4) · (1− 1/3) = 3/4× 2/3 = 1/2,

Φ13(r, s) = rs = 3/4× 1/3 = 3/12 = 1/4,

Φ14(r, s) = (1− r)s = (1− 3/4) · (1/3) = 1/4× 1/3 = 1/12.

Finally, we calculate value for: v(p) = p(x, y) =∑4i=1 viΦ

1i = (1, 0) · 1

6 + (0, 1) · 12 + (1, 1) · 1

4 + (2, 1) · 112 =

( 16 , 0) + (0, 1

2 ) + ( 14 ,

14 ) + ( 1

6 ,112 ) = (7/12, 5/6).

Answer: v(p) ≈ (0.58, 0.83).

• For triangular cell:

T−1tri = (r, s) =

(‖(p− p1)× (p3 − p1)‖‖(p2 − p1)× (p3 − p1)‖

,‖(p− p1)× (p2 − p1)‖‖(p3 − p1)× (p2 − p1)‖

), where :

(p− p1) = (3− x1, 3− y1) = (2, 1),

(p2 − p1) = (x2 − x1, y2 − y1) = (4, 1),

(p3 − p1) = (x3 − x1, y3 − y1) = (0, 3),

T−1tri = (r, s) =

(‖(2, 1)× (0, 3)‖‖(4, 1)× (0, 3)‖

,‖(2, 1)× (4, 1)‖‖(0, 3)× (4, 1)‖

)=

(‖3‖‖3‖

,‖9‖‖3‖

)= (1, 3)

Calculating 4 basis functions as follows:

Φi(T−1tri ) =

Φ1

1(r, s) = 1− r − s = 1− 1− 3 = −3,

Φ12(r, s) = r = 1 = 1,

Φ13(r, s) = s = 3 = 3,

Finally, we calculate value for: v(p) = p(x, y) =∑3i=1 viΦ

1i = (0,−1) · (−3) + (1, 0) · 1 + (1, 1) · 3 =

(0, 3) + (1, 0) + (3, 3) = (4, 6).Answer: v(p) = (4, 6).

7. EXERCISE 7

Color selection, by end users, is typically done by various widgets which represent the space of availablecolors, such as the color wheel, color hexagon, or three separate color sliders for the R, G, and B (oralternatively H, S, and V ) color components. Assume, now, that we want to select only colors present in agiven subset of the entire color space. Concretely, we have a large set of color photographs, and we next wantto select only colors predominantly present in these photographs, rather than any possible color. Sketch andargue for a color-selection widget that would optimally help users to select only these specific colors. Hints:Think how to modify any of the existing color-selection widgets to ‘focus’ on a specific color range wheremany samples exist.

We can specify a subset of colors we are interested in by specifying R,G,B values for each color inour sample. Then, we have a scatterred field of dots inside a color cube. After doing so we can conductsupersampling by ading even more dots in the neighbourhood of each specified color. Finally, we can useinterpolate a 3D surface along each axis and project result on color cube facets, or on RGB hexagon.

Also, we can modify HSV color wheel, by cutting out segments, corresponding to lowes density of ourcolor sample, so that remaining colors have represent magority of our color samples.

8. EXERCISE 8

Consider a grid where we have color data values recorded at its cell vertices. We would like to use linearinterpolation to compute colors at all points inside the grid cells. We can do this by interpolating colorsrepresented as RGB triplets or, alternatively, colors represented as HSV triplets. Discuss the advantages

14

Page 15: 03 Data Representation

and disadvantages of both schemes. Can you imagine a situation where the RGB interpolation would bearguably preferable to HSV interpolation? Can you imagine a situation when the converse (HSV interpolationis preferable to RGB interpolation) is true? Describe such situations or alternatively argue for the fact thatthey do not exist.

HSV interpolation gives better results since it is separeted from luminance and saturation. In RGB colorscheme we need to interpolate alnong all 3 components.

9. EXERCISE 9

Consider a grid cell, such as a 1D line, 2D triangle or quad, or 3D parallelepiped or cube, and somescalar values vi recorded at the cell vertices. Consider that we are using linear interpolation to reconstructthe sampled scalar signal v(x) at any point x inside the cell. Does a cell shape exist, and a point x in thatcell, so that v(x) is larger than the maximum of vi over all cell vertices? Does a cell shape exist, and a pointx in that cell, so that v(x) is smaller than the minimum of vi over all cell vertices? Argue your answers.

No

10. EXERCISE 10

Consider a grid-cell like in the Exercise 9, and some color values vi recorded at the cell vertices. Considerthat we are using linear interpolation to compute a color v(x) at any point x inside the cell. Does a pointx exist so that v(x) is brighter than any of the colors vi? Does a point x exist so that v(x) is darker thanany of the colors vi? Do the answers to the above two sub-questions depend on the choice of the system, orspace, used to represent colors (RGB or HSV )? Explain your answer.

No.

15