virtual dressing room

REAL-TIME VIRTUAL DRESSING ROOM

Tudor-Andrei TriscaIVA Masters, I year

1. Abstract

In this document it is presented a virtual dressing room application using the Microsoft Kinect Sensor. The proposed approach is based on the extraction of the user from the video stream as a 3D object, on which the clothing models will be applied in real-time, showing to the actual user if the selected outfit fits.

2. Introduction

Trying clothes in clothing stores is usually a time consuming activity. Besides, it might not even be possible to try-on clothes in such cases as online shopping. My motivation is to increase the time efficiency and improve the accessibility of clothes try-on by creating a virtual dressing room environment.

The problem is simply the alignment of the user and the cloth models with accurate position, scale, rotation and ordering. First of all, detection of the user and the body parts is one of the main steps of the problem.

For many applications an accurate 3D model of the human body is needed. The standard approach involves scanning the body using a commercial system such as a laser range scanner or special-purpose structured-light system. Several such body scanners exist, costing anywhere from $35,000 to $500,000. The size and cost of such scanners limit the applications for 3D body models. Many computer vision solutions suffer the same problems and require calibrated multi-camera capture systems. The solution proposed by A. Weiss et. al [1] produces accurate body scans using consumer hardware that can work in a person’s living room (Fig 1).

Fig. 1 (1a) Microsoft Kinect [2]. (1b) 3Dpoint cloud of a human in a cluttered home environment.

(1c) Recovered shape transformed into a new pose.

Microsoft Kinect has become the state of the art depth image sensor in the market after its launch in 2010. There is currently a quite intensive study to implement application programming interfaces for developers including a skeletal body tracking method.

The second step of the application is the positioning of the 3D cloth models to the users’ extracted 3D object.

3. Application architecture and user experience

Fig. 2 Proposed system architecture.

In Fig. 2 is presented the proposed application architecture for the virtual dressing room.

The Kinect SDK provides the skeleton of the user, enabling a natural interaction with the application. By using this technology, using normal gestures to control the output is straightforward and intuitive.

The XNA Game Studio programming environment allows us to create the 3D environment, and manipulate the 3D objects. Considering that the XNA framework includes an extensive set of class libraries, specific to game development, integrating the fun factor in the application is implicit.

The Emgu CV is an open-source image processing functions collection library developed in C#, wrapper for the original OpenCV library, constructed in C. This module is an important part of the solution proposed, providing already implemented image processing optimized methods necessary for some steps in the development of the proposed solution.

The “Brain” is a control module that keeps everything connected and acts as the business layer of the application.

By instrumenting the body and turning it into the input device, we are creating a connection between human and machine. The most important feature that we are

using is motion tracking: this allows the system to record the users’ actions, interpret them, and respond accordingly.

For the 3D scene manipulation, the user just has to use normal gestures. Taking into consideration that the application simulates a real-like scene, usage of it is as intuitive and natural as possible. Also, the Kinect sensor simplifies the instinctive movement while interacting with the interface.

4. User extraction

The Microsoft Kinect sensor consists of an IR camera, an RGB camera, and an IR projector that casts a fixed speckle pattern. Conversion of the pattern, as seen by the IR camera, to a depth map happens on the device. It has a USB interface and images can be captured using a library developed by the OpenKinect project [3] or from the new SDK released this spring by Microsoft. This library [3] provides access to both the depth map and the raw IR video, as well as to the RGB video and data from a built in accelerometer. The video streams are VGA resolution and both the RGB and IR (either raw or the depth map) can be captured synchronized at 30 fps.

Intrinsic calibration of the RGB camera is carried out with a checkerboard and standard calibration techniques [4]. To calibrate the IR camera the projector is covered so that the calibration grid is not corrupted by the projected pattern; otherwise calibration is identical to that of the RGB camera.

Stereo calibration between the depth and RGB cameras can be achieved with standard stereo calibration methods [4].

The Kinect reports depth discretized into 2047 levels, with a final value reserved to mark pixels for which no depth can be calculated. These discrete levels are not uniformly distributed, but are much denser close to the device. The depth is calibrated by lining up a planar target parallel to the Kinect such that the depth values are as uniform as possible across its surface; the distance is then measured and the process repeated with depths ranging from 0.5m to 3m in 0.1m increments [4].

An estimation of the ground plane is obtained by robustly fitting a plane to the bottom of the point cloud, using the Kinect’s on board accelerometer to initialize such that we locate the floor and not one of the walls.

The segmentation of the body from the surrounding environment is made using background subtraction on the depth map.

In order to estimate a body shape that is invariant to pose, a model is needed so that it accurately represents non-rigid shape deformations while factoring deformations caused by changes in intrinsic shape (height, weight, body type, etc.) from deformations caused by changes in pose. Non-rigid deformations due to pose variation are modeled using linear predictors learned from examples. Body shape deformation is modeled using principal component analysis (PCA) on an aligned database of several thousand bodies.

For pose initialization is assumed a gross initial pose estimate; a complete, end to end system would be obtained by combining the method we describe here with an existing coarse pose tracking algorithm. The subject provides their height and theinitial body shape is taken to be the average shape for the subject’s height and gender. The body model is initialized in the scene using the ground plane and the centroid of the point cloud.

For a body model represented as a triangulated 3D mesh with pose and shape parameters θ, a triangle tx(θ) with every pixel x is associated in the overlap between the model silhouette S(θ) and observed silhouette T by finding the front most triangle that projects into x. Let U(θ) ={(x1, tx1(θ)), . . .} for all x in S(θ) ∩ T. For each pixel we have the observed depth ˘ Dx, and for the corresponding triangle t we find the depth, Dx,t(θ), along a ray through the pixel center to the plane of the triangle. Taking ρ to be a robust error function (here, Geman-McClure [5]), the depth objective is

5. Silhouette objective

Methods for fitting 3D models to silhouettes usually approximate one of these two integrals:

Here S and T are silhouettes, ∂S and ∂T are their boundaries, and ρ is a non-decreasing function (e.g. Geman McClure [5]). Frequently, approximations to (1) use a discrete distance map [5] and approximations to (2) use a discrete distance map or a correspondence-based scheme like ICP [6]. Integrals like these are often used to define shape distances, but are not widely used with parametric 3D models under projection.

Accurately fitting a body to the image evidence benefits from bi-directional shape distance functions that compute the distance from the model to the image contour and vice versa. Minimizing the distance from the image to the model ensures that all image measurements are explained while minimizing the distance from the model to the image ensures that visible body parts are entirely explained by image evidence. Modeling the distance from the model to the image is straightforward using the Euclidean distance transform to approximate the distance function to the image silhouette, as this does not change during optimization. Modeling the distance from image to the model is more difficult because the distance functions to the model’s silhouette changes with the parameters being optimized; this makes an explicit computation of the derivatives difficult.

Consequently, many methods that use distance maps either use unidirectional distance, from model silhouette to static observed silhouette or use a derivative-free optimizer. Problems with the unidirectional application of (1) have been discussed and addressed [12]. Similar problems arise with the use of (2) but are not often mentioned. The use of derivative free methods for a high-dimensional problem is impractical, so a method admitting explicit computation of the derivative is needed. ICP methods are frequently used to minimize (2) for 2D to 2D and 3D to 3D shape registration problems. They can be used bi-directionally and optimization is straightforward because the average point-to-shape distance is bounded by the average distance between corresponding points, which is a smooth function of the vertices of both shapes. Under projection we lose this bound because points on the silhouette boundary no longer have a stable relationship to

the 3D geometry. Without this, the use of ICP is problematic, especially with complex articulated and non-rigid objects.

If we have a set of correspondences between 3D model vertices on the silhouette boundary and points on the observed silhouette, as we minimize the average distance of the projected vertices to their corresponding 2D points, some vertices will disappear from the silhouette boundary and new vertices will appear. Since these newly visible vertices will not influence the objective function until we re-compute correspondences, the optimizer may move them anywhere without penalty. When this happens, the parameters being optimized may jump away from low-error fixed points to a solution from which ICP cannot recover. This problem is addressed with a well-behaved new formulation that uses implicit rather than explicit correspondences. The line integral in (2) is computed directly, replacing the explicit correspondences of ICP with the continuously changing ones implied by the min function. Symmetrizing this yields an objective function that is correspondence-free and bidirectional. To compute this integral, we must know, for each point on the integration silhouette, the distance to the nearest point on the other (reference) silhouette. Each segment of the integration silhouette is broken up into pieces that are nearest to the same geometric primitive (vertex or line segment interior) in the reference silhouette. These breaks occur in two circumstances: 1) Along lines emanating from a segment’s vertices and perpendicular to the segment. These lines define the region where perpendicular distance to the segment is defined. 2) On linear or quadratic arcs where two points (quadratic), two segment interiors (linear), or a segment interior and a point (quadratic) are equidistant. The derivative of this integral is easily computed in terms of the derivative of the path of integration and the derivative of the integrand. There is, however, a small problem. At the breaks the integrand is not differentiable with respect to the reference silhouette, as the distance functions to the two equidistant primitives vary independently. Nor is it differentiable with respect to the point of evaluation x, as variation in one direction is dictated by one primitive’s distance function and variation in another will be dictated by the other’s. If these breaks occur only at points, as they do for almost every pair of silhouettes, they do not matter. There are finitely many such breaks, and the value of the integrand at finitely many points, so long as it is bounded, does not affect the value of an integral. But, if a segment onthe integration silhouette lies along one of the arcs where two primitives are equidistant, the non-differentiability of the integrand is inherited by the integral. Because this happens only when two constraints are met – the integration path and arc of equidistance must be parallel and touching – manifolds where our objective function is non-smooth have dimension 2 less than the parameter space. There is

nothing about these constraints that would push the optimization trajectory toward these manifolds. An optimization is using a method intended for smooth functions so that does not encounter problems.

De la Gorce et al. [13] use a similar integration-based approach in the context of articulated hand tracking with a generative model and formulate a differentiable objective function. Their objective focuses on a generative model of image appearance across the interior of the object. They compute a 2D integral, which allows them differentiability despite a 1D discontinuity along the occluding contour of the body. A differentiable version of the area integral in (1) can be

similarly computed, but it would require the computation inside a 2D region, which amounts, to computing the Voronoi diagram for a set of line segments. The silhouette objective function is a symmetrized and scaled version of (2), averaging distance over each silhouette boundary to the other:

where S(θ) is the silhouette of the model with parameters θ and T is the image silhouette.

From bodies to measurements. One of the reasons to fit a 3D body model is to extract standard measurements of the body (arm length, chest circumference, etc.) To calculate measurements from shape parameters, the method to use is the one that follows Allen et al. [7] in spirit, but is the inverse of the problem they describe. Allen et al. learn a linear function from a set of measurements to shape parameters, allowing them to synthesize new people with specified measurements.

6. 3D Clothing Fitting

Zhong Li et. al. [8] present in their paper, a 3D garment fitting method. They firstly search feature points on the clothing and body models and match them by constructing the matching function, and then several key feature points are interactively selected from the limited feature point sets to compute the rigid transformation matrix for the clothing model. Finally, they perform the second match to adjust the garment fitting on the body model.

The main idea of their clothing fitting algorithm is that they firstly construct the local fitting patch to estimate curvature and torsion of each vertex on body and

clothing models which are shown in Fig 3. Secondly, they use the feature extraction method to obtain the feature points. Thirdly, they build up a matching function based on curvature and torsion to match similar feature points and interactively acquire several key feature points from both models. Finally, they calculate the transformation matrix according to pairs of key feature points.

Fig. 3. A woman mesh model and a clothing mesh model.

Before extracting feature points from the garment and body models, the estimation curvature of each vertex on both models needs to be calculated, this may influence the final matching precision.

Razdan and Bae [14] presented a curvature estimation method based on the weighted bi-quadratic Bézier patch. Curvature is related to the second-order derivatives and the third-order surface is a better fit to the shape of a local area. Here a new weighted bicubic Bézier patch is constructed to estimate the geometric properties of each vertex. A bicubic Bézier surface is written as

where Bi,3 (u), Bj,3 (v) are the Bernstein basis functions and bij are the Bézier control points which form the Bézier control net.

The parameters (ui,vi) of vertex pi and other fitting points should be computed first. Here we take the vertex pi and the vertices in its 2-ring neighborhoods as the fitting point. For pi, its approximate normal vector N is computed by the arithmetic average of all the normal vectors of its neighboring triangles. A tangent plane is created which is vertical to N and set pi as the origin of its coordinate system. Then a local Cartesian coordinates in this plane is constructed. The direction from one projected point to the vertex pi is set as the x-axis and one vertical direction as the

y-axis. All fitting points are projected in this plane and the coordinates of all projected points are enclosed by a min-max box. This min-max box is finally

scaled to . The coordinates of projected points in this range are regarded as the corresponding parameters of fitting points. If some projected points of vertices in the tangent plane coincident or some lines connected by two adjacent projected points are self-intersecting, we change another fitting point as the origin of the coordinate system to calculate a new tangent plane or replace fitting vertices in the 2-ring neighborhood by vertices in the 1- ring neighborhood to avoid these situations.

After obtaining the corresponding parameters (ui,vi) of vertex pi and other fitting points, a linear equation system Ax=B is built, where

From this system of equations, the vector x is solved by the least squares method to determine the control points bij. However, the fitting surface sometimes does not reflect the local shape of each vertex. In order to describe the local shape more accurately, we add the adjusting matrix and factor to modify the system as follows

where A, x, B are defined above, the matrix S is added to make the control point distribution of the bicubic Bézier surface as uniform as possible. The following matrix is a good solution which means to minimize the second differences of the boundaries in the control net

The factor a is in [0, 1] and is adjusted according to the mesh noise density. When noise is dense, a is set lower, whereas, a is set higher. Normally it is set as 0.8.

Fig. 4. Curvature models. (Red: high values; Blue: medium values; Green: low values)

For the modified system of equations, there are also used the least squares method to solve the control points bij. For each pi, its geometric property estimation is obtained by the corresponding point B(ui,vi) on the local fitting surface S(u,v). The mean curvature and Gaussian curvature on the point can be computed from the differential geometry formulas. Fig 4 is the mean curvature and Gaussian curvature models of the woman and clothing models.

The torsion property can also be estimated from this local fitting surface. Torsion, like curvature, is an important geometric property of the object’s rigid transformation. For the point on the continuous surface, the torsion is normally measured by the geodesic torsion. Because the curve passing the given point on the surface is not determined, we generally compute the geodesic torsion according to some specific curves passing through the given point on the surface. One curve is the curvature line, but we know the geodesic torsion along the curvature line is zero from the differential geometry knowledge. Another curve is the torsion line, namely, the direction along the angle bisector of two principal directions. The geodesic torsion attains the maximum along the torsion line on the given point. Here it is defined as the principal geodesic torsion. For the discrete mesh model, there is the similar torsion property. This method introduces the principal geodesic torsion and uses it for feature point matching. For the principal geodesic torsion computation, there is the following theorem:

For the principal geodesic torsion on the point of the surface, it’s value can be calculated by the mean curvature H and Gaussian curvature K, i.e.,

The feature points are extracted from the clothing and body models according to the curvature property. Here the feature points include the ridge and valley points. The feature point judgment is related to the calculation of curvature values and their derivatives [9]. For discrete triangular mesh models, these derivatives on each vertex cannot be explicitly calculated. Several estimation methods have been proposed to obtain the ridge point and the valley point. One of them is Stylianou’s method [10] to detect these feature points.

For the 3D clothing fitting on a body model, the important operation is to match two models appropriately. Because the clothing and body models have the

different geometric shape and topological structure, ICP matching algorithms cannot get satisfactory results.

Gal and Cohen-Or [11] presented an algorithm for the match of two similar objects or the local part match of objects. They constructed a matching function about curvature property and it considers not only curvatures of each vertex and its neighborhoods, but also uses the curvature variance in the neighborhood as the reference, they got a better match result for the local parts of an object or similar objects.

Torsion is another intrinsic geometric property on the surface, but it has not gotten enough attention for the mesh processing. A matching function is constructed based on not only curvature but also torsion

where Area(pi ) is the sum of triangle areas of the 1–ring neighborhood of vertex pi, Curv(pi) and Tors(pi) denote the curvature and the principal geodesic torsion on vertex pi, NC(pi) and NT(pi) are the number of minimum(s) or maximum(s) curvatures and principal geodesic torsions in the 1-ring neighborhood, VarC(pi) and VarT(pi) are the curvature variance and the principal geodesic torsion variance in the 1-ring neighborhood, W1, W2 and W3 are threshold values, here we set them to 0.33 respectively.

In the new matching function, the torsion property is added to get the reliable matching result from feature points between the clothing and body models, as shown in Fig 5.

Fig. 5. The feature point matching using the curvature and torsion property

In the whole matching process, it is needed to do some preliminary operations in advance. For example, in order to reduce the wrong match, we get rid of feature points on the head, hand and foot parts of the body model and exclude feature points on the edge of the clothing model. For the body model, its centroids position is computed, then judged whether the feature point needs to be matched by the distance between the centroids and the feature point. For the clothing model, the edge of the clothing model is judged by the number that the edge belongs to the triangle in the mesh model.

Using the new matching function can get rid of redundant feature points between the clothing and body models. However, because of the shape complexity of the clothing and body models, the match relations of feature points are not always correct. Too many match relations of feature points will also influence the computation speed of the following transformation matrix. A compromising method is to interactively select several key feature points before the next rigid transformation matrix construction, as shown in Fig 6. Because the feature point relation is chosen only from the limited feature points sets instead of the huge cloud of vertices, the process is convenient.

Fig. 6. Key feature point acquisition from each model

The transformation matrix construction for the clothing model Let Bi and Gi be the coordinate vector of the feature point on the body model and the corresponding feature point on the clothing model. After getting the match relations of key feature points between two models, all coordinate vectors from both models should satisfy the following coordinate transformation equation

where n is the number of key feature points, is a rotation matrix, T is a translation vector. These key feature points of two models are applied for the above equation and the following function is minimized

For this equation, the singular-value decomposition (SVD) method is used to calculate the matrix R and T which realize the rough match between the clothing and body models. If there is the size difference between the clothing and body models, the scaling operation is performed before the rough match. The scaling value is calculated by comparing the difference between the corresponding feature point and each model’s centroids. After the rough match, the clothing model is aligned by the scaling operation until the clothing model covers the corresponding part of the body model appropriately.

In order to get the precise clothing match on the body model, the second matching process can be done. Namely, after obtaining the matrix R and T by the rough match, the new position of key feature points is calculated on the clothing model, then the above matching method is used to compute new R and T again.

7. Conclusions

In this document a real time virtual dressing room application is introduced which only requires the user to have a Microsoft Kinect device, and a personal computer.

The methodologies are presented as solutions to the problems of development of this project: extracting the users’ 3D body model, fitting the silhouette to the image and the 3D clothes models fitting on the body.

There are many other solutions, but most of them refer only to a 2D environment and have many inaccuracies when regarding the clothing models fitting.

By using the Kinect sensor, the scope of the proposed application is to give a realistic experience to the user, with an intuitive, fun and natural interface, that can be used in the comfort of their own home.

Bibliography

[1] A. Weiss, D. Hirshberg, M. J. Black. Home 3D Body Scans from Noisy Image and Range Data, 2011[1] Microsoft Corp. http://www.xbox.com/kinect.[3] OpenKinect project. http://openkinect.org.[4] J.-Y. Bouguet. Camera Calibration Toolbox for Matlab.http://www.vision.caltech.edu/bouguetj/calib_doc/.[5] S. Geman and D. McClure. Statistical methods for tomographic image reconstruction. Bulletin Int. Statistical Institute, LII(4):5–21, 1987.[6] Q. Delamarre and O. Faugeras. 3D articulated models and multi-view tracking with silhouettes. ICCV, pp. 716–721, 1999.[7] B. Allen, B. Curless, and Z. Popovic. The space of human body shapes: Reconstruction and parameterization from range scans. ACM Trans. Graph., 22(3):587–594, 2003.[8] Zhong Li, Xiaogang Jin ,Brian Barsky, Jun Liu. 3D Clothing Fitting Based on the Geometric Feature Matching, 2009[9] Ohtake Y, Belyaev A, Seidel H. Ridge-valley lines on meshes via implicit surface fitting. ACM Trans. on Graphics. 2004, 23(3), pp.609-612. [10] Stylianou G, Farin G. Crest lines for surface segmentation and flattening. IEEE Trans. Vis. Comput. Graphics. 2004, 10(5) , pp.536-544. [11] Gal R, Cohen-Or D. Salient geometric features for partial shape matching and similarity. ACM Trans. on Graphics. 2006, 25(1), pp.130-150.[12] C. Sminchisescu and A. Telea. Human pose estimation from silhouettes. a consistent approach using distance level sets. WSCG Int. Conf. Computer Graphics, Visualization and Computer Vision, 2002.[13] M. de La Gorce, N. Paragios, and D. Fleet. Model-based hand tracking with texture, shading and self-occlusions. CVPR, pp. 1–8, 2008.[14] Razdan A, Bae M. Curvature estimation scheme for triangle meshes using biquadratic B é zier patches. Computer-Aided Design. 2005, 37(14), pp. 1481-1489.

http://www.vision.caltech.edu/bouguetj/calib_doc/

http://openkinect.org/

virtual dressing room

Documents

d body models

d environment

d cloth models

human body

d model

body scanners

d objects

proposed solution