ucsd ece191 report hand tracking pointing device

Upload: tommy-chheng

Post on 30-May-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/14/2019 UCSD ECE191 Report Hand Tracking Pointing Device

    1/24

    Camera Based Hand and Body Driven Interaction with Very Large

    Displays

    Tom Cassey, Tommy Chheng, Jeffery Lien

    {tcassey,tcchheng, jwlien}@ucsd.edu

    June 15, 2007

  • 8/14/2019 UCSD ECE191 Report Hand Tracking Pointing Device

    2/24

    Abstract

    We introduce a system to track a users hand and head in 3D and real-time for usage with a large tiled display.The system uses overhead stereo cameras and do not require the user to wear any equipment. Hands aredetected using Haar-like features with an Adaboost classifier and heads are detected using Hough transformsgeneralized for circles. Three-dimensional values for the hand and head are obtained by correspondencematching of the paired cameras. Finally, a 3D vector is extrapolated from the centroid of the head to thehand. A projection of the 3D vector to the large tiled display is the pointing location.

  • 8/14/2019 UCSD ECE191 Report Hand Tracking Pointing Device

    3/24

    Contents

    1 Introduction 2

    2 Hardware Setup 4

    2.1 Stereo Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 IR Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    3 Camera Calibration 6

    3.1 Why is Calibration Needed? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.2 Extrinsic Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    3.2.1 Homogeneous Coordinates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.2.2 Homogeneous Extrinsic Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    3.3 Intrinsic Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.4 Rectification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.5 Calibration Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    4 Hand Detection 9

    4.1 Cascaded Adaboost Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104.1.1 Adaboost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104.1.2 Cascaded Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    4.2 Training Data and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    5 Head Detection 12

    5.1 Edge Detection[?] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125.1.1 Image Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125.1.2 Canny Edge Detector[?] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135.1.3 Hough Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    5.2 Head Detection Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    6 Tracking 16

    6.1 Tokens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166.2 Token Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166.3 Ob ject Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    7 Stereopsis 187.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187.2 Epipolar Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187.3 Correspondence Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187.4 Range Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197.5 Range Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    8 Vector Extrapolation 21

    9 Conclusion and Future Work 22

    1

  • 8/14/2019 UCSD ECE191 Report Hand Tracking Pointing Device

    4/24

    Chapter 1

    Introduction

    Today, scientific computing is generating large data sets and high resolution imaging. A biological sampleunder an electron microscope can easily produce gigabytes of high resolution 3D images. Due to advances inremote sensing technologies, the scope and amount of high resolution geospatial imagery has become widelyavailable today. These factors has created the need to visualize the imagery data. Many research groups

    have built large tiled displays. The human computer interface problem for large tiled display still remains tobe a problem. Using a traditional mouse to navigate the data is unfeasible. The NCMIR group at CALit2located in University of California, San Diego, has built an alternative human computer interaction systemusing a hand controller and head gear. Requiring such equipment places a restrictive burden on the user.

    Our objective is to create a hand tracking system given overhead stereo cameras to drive a large display.Our system is vision-based and does not require any external sensors on the user. There are a few contraintsthat eases our objective. Our cameras are fixed overhead. This allows less background noise to aid in handdetection. We developed a hand and arm tracking system called HATS. An overview of our system is depictedin 1.1.

    2

  • 8/14/2019 UCSD ECE191 Report Hand Tracking Pointing Device

    5/24

    Figure 1.1: System Overview

    3

  • 8/14/2019 UCSD ECE191 Report Hand Tracking Pointing Device

    6/24

    Chapter 2

    Hardware Setup

    We place our camera system over the user. Other possibilities for camera configuration could have beenplacing the two cameras in front of the user or one camera above the user and one camera in front of theuser. We selected overhead camera configuration due to better performance and ease of use. There is verylittle noise since the cameras will be pointing at the ground. If we had placed the cameras in front of the

    user, we would have to deal with a noisey background. eg. People walking by.

    2.1 Stereo Cameras

    Two Point Grey Dragon Fly cameras shown in 2.1 were used to make our stereo pair. The cameras are ableto run at a frame rate of 120 fps, with a resolution of 640x480 pixels, each having a 6 mm focal length, anda pixel width of 0.0074 mm.

    The cameras are mounted 364.57 mm apart, and are slightly verged in, in order to have the largest sharedfield of view. Each camera is verged in by 0.0392 radians, thus the two cameras are focused on the samepoint in space about 4.6 m below the cameras. It was decided to use this configuration for the camerassince it provided good range resolution, while also having a large shared field of view. The choice of these

    parameters will further discussed in the Stereopsis chapter.

    2.2 IR Filters

    Infrared filters and illuminators were used in our setup in order to provide an enhanced contrast betweenthe user, specifically skin sections since they show up with high intensity, and the rest of the background.Additionally, using infrared for illumination nullifies any variation in the lighting in which this setup will beused, and can also provide proper illumination for the cameras, while not changing the lighting perceived bythe user. Thus, the setup can function properly in a totally dark room.

    The IR filters used blocks out visible light up to 650 nm, and transmits 90% of light in the 730 nm to2000 nm range (near IR). One thing to note is that incandescent light bulbs and perhaps some other types,

    and the sun emit light in this range as well, and can cause an overexposure in the cameras if not carefullycontrolled.

    4

  • 8/14/2019 UCSD ECE191 Report Hand Tracking Pointing Device

    7/24

    Figure 2.1: Two Point Grey Research Dragon Fly Cameras are used to make a stereo pair. The IR filtershelp skin tone illumination.

    5

  • 8/14/2019 UCSD ECE191 Report Hand Tracking Pointing Device

    8/24

    Chapter 3

    Camera Calibration

    3.1 Why is Calibration Needed?

    The goal of camera calibration is to determine a transformation between the two independent pixel coordi-nate systems of the two cameras, and the real world coordinate system. Once the real world coordinate of

    each pixel can be determined, it is possible to perform stereopsis on the images.

    The process of stereopsis allows the depth of each pixel in the image to be calculated and thus the 3Dcoordinate of each point to be calculated. The stereopsis process will be described in more detail in Chapter7 of this report.

    3.2 Extrinsic Parameters

    The extrinsic camera properties describe the position and orientation of the camera with respect to the worldcoordinate system. In the stereo system we are working with the world coordinate system is taken to bethe coordinate system of the left camera. Therefore the extrinsic parameters describe the transformationbetween the right cameras coordinate system, and the left cameras coordinate system.

    Figure 3.1: Two Coordinate Systems

    Figure 3.1 above shows two different coordinate systems. The transformation between the two coordinatesystems in the diagram is an affine transformation, which consists of a rotation followed by a translation.

    y = Rx + t (3.1)

    The rotation component of the transformation is given by the equation below

    R =

    iA iB iA jB iA kBjA iB jA jB jA kB

    kA iB kA jB kA kB

    (3.2)

    6

  • 8/14/2019 UCSD ECE191 Report Hand Tracking Pointing Device

    9/24

    The translation component of the transformation is given by the equation below

    T =

    txty

    tx

    (3.3)

    Under the Euclidean coordinate system an affine transformation has to be expressed as two separate transfor-mations, as shown in Equation 3.1. However if the image coordinates are first converted to the homogeneouscoordinate system then the transformation can be expressed as a single matrix multiplication.

    3.2.1 Homogeneous Coordinates

    The Homogeneous coordinate system adds an extra dimension to the euclidean coordinate system, where theextra coordinate is a normalising factor.

    Equation 3.4 below shows the transformation from from the euclidean coordinate system to the homoge-neous coordinate system.

    (x,y,z) (x,y,z,w) = (x,y,z, 1) (3.4)

    Equation 3.5 below shows the transformation from from the homogeneous coordinate system to the euclideancoordinate system.

    (x,y,z,w) (x/w,y/w,z/w) (3.5)

    3.2.2 Homogeneous Extrinsic Parameters

    Mext =

    r11 r12 r13 txr21 r22 r23 tyr31 r32 r33 tz0 0 0 1

    (3.6)

    3.3 Intrinsic Parameters

    The intrinsic parameters describe the transformation from the camera coordinates to the pixel coordinatesin the image. This transformation depends on the optical, geometric and digital parameters of the camerathat is capturing a given image.

    The following equations describe the transformation from the camera coordinate system to the pixels inthe image.

    xim =f x

    sx+ ox , (3.7)

    yim =f y

    sy+ oy (3.8)

    Where f is the focal length of the camera, (ox, oy) is the coordinate in Pixels of the image centre (FocalPoint), sx is the width of the pixels in millimetres, and sy is the height of the pixels in millimetres

    When written in matrix form Equations 3.7 and 3.8, yield the following matrix, which is known as theintrinsic matrix.

    Mint =

    f /sx 0 ox0 f /sy oy

    0 0 1

    (3.9)

    3.4 Rectification

    Image rectification is a process that transforms a pair of images so that the images have a common imageplane. Once this process has been performed on the stereo image pair that are captured by the overhead cam-eras, the stereopsis process reduces to a triangulation problem, which is discussed in more detail in Chapter 7.

    7

  • 8/14/2019 UCSD ECE191 Report Hand Tracking Pointing Device

    10/24

    The image rectification process makes use of the extrinsic and intrinsic parameters of the two camerasin order to define the transformation that maps the right camera image onto the same image plane as theleft camera image. Whilst this process is required in order to simplify the stereopsis process it does result inthe rectified objects being warped and disfigured.

    The image rectification process is performed by combining the extrinsic and intrinsic parameters of thesystem into one matrix which is known as the fundamental matrix. This matrix provides the linear trans-formation between the original images and the rectified images that can then be used to perform stereopsis.The next section of this chapter discusses the process that is used in order to determine the parameters ofthis matrix.

    3.5 Calibration Implementation

    [Describe how we used the checker board and SVS software for Calibration]

    8

  • 8/14/2019 UCSD ECE191 Report Hand Tracking Pointing Device

    11/24

    Chapter 4

    Hand Detection

    Using Haar-like features for object detection has several advantages over using other features or direct imagecorelation. For one, simple image features such as edges, color or contours are good for basic object detectionbut as the object gets more complex, they tend not to work well. Papageorgiou[[?]] used Haar-like featuresas the representation of an image for object detection. Since then, Viola and Jones and later Lienhart added

    an extended set of the Haar-like features. They are shown in 4.2.

    Figure 4.1: Haar-like features and an extended set.

    In each of the Haar-like features, the value is determined by the difference of the two different coloredareas. For example, in Fig. 4.2(a), the value is the difference of all the pixels in the black rectangle from allthe pixels in the white rectangle.

    Viola and Jones introduced the integral image to allow fast computation of these Haar-like features.The integral image at any (x,y) is the summation of all the pixels to the upper left as described by equation4.1:

    ii(x, y) =

    xx,yy

    = i(x, y) (4.1)

    where ii(x,y) is the integral image and i(x,y) is the orignal image. The integral image can be computed

    in one pass with the recurrence relation:

    s(x, y) = s(x, y 1) + i(x, y) (4.2)

    9

  • 8/14/2019 UCSD ECE191 Report Hand Tracking Pointing Device

    12/24

    ii(x, y) = ii(x 1, y) + s(x, y) (4.3)

    (s(x,y) is the cumulative row sum, s(x,-1) = 0 and ii(-1, y) = 0)With the integral image, any of the Haar-like features can be computed quickly. As shown in the integral

    image 4.2, the vertical rectangle edge feature in 4.2(a) is simply 4 - (1 + 3).

    Figure 4.2: Haar-like features can be computed quickly using an Integral Image.

    4.1 Cascaded Adaboost Classifiers

    4.1.1 Adaboost

    The Adaboost algorithm[?] uses a combination of simple weak classifiers to create a strong one. Each simpleclassifier is weak because they can only classify perhaps 51% of the training data sucessfully. The finalclassifier becomes strong because it weighs the weak classifiers accordingly during the training process.

    4.1.2 Cascaded Classifiers

    Viola and Jones[?] introduced an algorithm to construct a cascade of adaboosted classifiers as depicted infigure 4.3. The end result gives increased detection rates and reduced computation time.

    Figure 4.3: A cascaded classifier allows early rejection to increase speed.

    Each Adaboost classifier can reject the search window early on. The succeeding classifiers have a moredifficult step of distinguishing the features. Because the early classifiers can reject the majority of searchwindows, a larger portion of the computational power can be focused on the later classifiers.

    The algorithm can target for how much each classifier is trained to find and eliminate. A trade-off isthat a higher expected hit-rate will give more false positives and a lower expected hit-rate can give less false

    positives.An empirical study can be found in the work of Lienhart et al[?].

    10

  • 8/14/2019 UCSD ECE191 Report Hand Tracking Pointing Device

    13/24

    4.2 Training Data and Results

    Below is a list training different sets of data and a qualitative note of its performance in the real-time videocapture.

    Produced XML cascade output file Image dim Training size nstages nsplits minhitrate maxfalsealarmhand classifier 6.xml 20x20 357 15 2 .99 .03

    hand pointing 2 10.xml 20x20 400 15 2 .99 .03wrists 40.xml 40x40 400 15 2 .99 .03hand classifier 8.xml 20x20 400 15 2 .99 .03

    hand classifier 6.xml High hit rate but also high false positives.hand pointing 2 10.xml Low positives but low false positives.wrists 40.xml Can detect arms but bounding box too big.hand classifier 8.xml Low positives but low false positives.

    [show good results and false positives]

    11

  • 8/14/2019 UCSD ECE191 Report Hand Tracking Pointing Device

    14/24

    Chapter 5

    Head Detection

    The head detection module relies on the shape of the head being similar to that of a circle. Generallyspeaking, the head is the most circular object that is found in the scene when a user is interacting with thetiled displays. In order to utilise this geometric property, a circle detector is applied to the set of edge pointsfor the input image. These edge points map out the skeleton or outline of objects in the image, and thus can

    be used to match geometric shapes to the objects.

    Figure 5.1: Head Detection Flowchart

    The flowchart above shows the stages that are involved in detecting regions that are likely to contain a head,these stages are discussed in more detail in the following sections.

    5.1 Edge Detection[?]Boundaries between objects in images are generally marked by a sharp change in the image intensity at thelocation of the boundary, the process of detecting these changes in image intensity is known as edge detection.However boundaries are not the only cause of sharp changes of intensity within images.The causes of sharp intensity changes (edges) are as follows:

    discontinuities in depth

    discontinuities in surface orientation

    changes in material properties

    variations in scene illumination

    The first three of these causes occur (but not exclusively) when there is a boundary between two objects. Thefourth cause variations in the scene illumination is generally causes by objects occluding the light source,which results in shadows being cast over regions in the image.

    As mentioned previously boundaries or edges within images are marked by a sharp change in the imagesintensity at in the area where an edge lies. These sharp changes in image intensity will result in the gradientat the point in an image where an edge occurs having a large magnitude.

    5.1.1 Image Gradient

    Since the images are 2-dimensional, the gradient with respect to the x and y direction must be computedseparately. In order to speed up the computation of the gradient at each point these two operations areimplemented as kernels which are then convolved with the image in order to extract the x, and y components

    of the gradient. The Sobel kernel representations of these two discrete gradient equations are shown below.

    Gx =

    1 2 10 0 0

    1 2 1

    , Gy =

    1 0 12 0 21 0 1

    (5.1)

    12

  • 8/14/2019 UCSD ECE191 Report Hand Tracking Pointing Device

    15/24

    Given the x and y component of the gradient at each point the gradient magnitude at that point can becalculated using the following equations.

    G =

    G2x + G2y (5.2)

    The gradient direction at each point is also required for certain edge detection algorithms such as theCanny edge detector. The equation for determining the direction of the gradient at each point is shown

    below.

    = atan

    GyGx

    (5.3)

    5.1.2 Canny Edge Detector[?]

    The Canny Edge Detector provides a robust method for locating edge points within an image, which isbased upon image gradient. The canny detector also includes additional processing to the calculated gradi-ent image in order to improve the locality of the detected edge. The Canny Edge detector is broken downinto multiple stages which are outlined in the flowchart below.

    Figure 5.2: Canny Edge Detector Flowchart

    The previous section outlined the methods for calculating the gradient at each point, and then how to utilisethis information to calculate the magnitude and direction of the gradient at each point in the image. Howeverapplying this operation to the raw image, generally results in many spurious edges being detected, and hencean inaccurate or unusable edge image being calculated.

    These spurious edges are caused by noise that is introduced to the image when it is captured. In orderto reduce the effect of this noise the image must first be blurred, which involves replacing the value of eachpixel with a weighted average calculated over a small window around the pixel. However this procedure hasthe unwanted effect of smearing edges within the image as well, which results in edges being harder to detect.

    Oriented non-maximal suppression

    The additional processing that is provided by the canny edge detector is known as oriented non-maximalsuppression. This process makes use of the gradient magnitude and direction that was calculated previously,in order to ensure that the edge detector only responds once for each edge in the image, and that the localityof the detected edge is accurate.

    This stage of the algorithm loops over all the detected edge pixels in the image, and looks at the pixels

    that lie either side of the pixel in the direction of the gradient at that point, if the current point is a localmaximum then it is classified as an edge point, otherwise the point is discarded.

    13

  • 8/14/2019 UCSD ECE191 Report Hand Tracking Pointing Device

    16/24

    (a) Edge Image with no Blurring (b) Edge Image with Prior Blur

    Once the pixels in the image that lie on edges have been found, the pixels need to be processed in orderto find circles that best fit the edge information. Since the set of features (edge points) computed by theedge detector are from the whole image and are not just localised to the edges that correspond to the head,the circle matching process is required to be resilient to outliers (features that do not conform to the circlebeing matched) within the input data.

    The method employed for detecting circles within the camera frames is a generalised Hough transform.The Hough transform is a technique for matching a particular geometric shape to a set of features that have

    been found within an image, the equation that the input set of points are matched to is shown below.

    c = r2 = (x ux)2 + (y uy)

    2 (5.4)

    5.1.3 Hough Transform

    The Hough transform is a method that was originally developed for detecting lines within images, the methodcan however be extended for detecting any shape, and in this case has been used to detected circles.

    Looking at the equation of a circle, and the diagram below it can be seen that three parameters are re-quired to describe a circle, namely the radius and the two components of the centre point. The Houghtransform method uses an accumulator array (or search space) which has a dimension for each parameter,thus for detecting circles a 3-dimensional accumulator array is required.

    Figure 5.3: Cicle Diagram

    The efficiency of the Hough transform method for matching a particular shape to a set of points is directlylinked to the dimensionality of the search space, and thus the number of parameters that are required todescribe the shape. In light of the requirement to achieve real time processing the images are down-sampledby a factor of 4 before the circle detector algorithm is applied. Due to the algorithm being third order, thisreduces the number of computations by a factor of 64.

    The transform assumes that each point in the input set (the set of edge points) lies on the circumference of acircle. The parameters of all possible circles, for which the circumference passes through the current point, arethen calculated, and for each potential circle the corresponding value in the accumulator array is incremented.

    Once all the points in the input set have been processed, the accumulator array is checked to find ele-

    ments that have enough votes to be considered as a circle in the image, the parameters corresponding tothese elements in the accumulator array are then returned as a region within the image that could potentiallycontain a head.

    14

  • 8/14/2019 UCSD ECE191 Report Hand Tracking Pointing Device

    17/24

    5.2 Head Detection Results

    The head detection method described above provides a method of identifying potential head regions withinthe input scene. Although the detector performs well in most situations, noise and other circular shapedobjects present in the scene occasionally cause the detector to skip a frame, or to return multiple headregions. In order to overcome these limitations post processing is applied to the results to interpolate when aframe is missed and to enable the most probable head location to be selected when multiple head regions are

    returned by the detector. These p ost processing techniques are implemented in the object tracking modulewhich is discussed later on.

    (a) Edge Image with no Blurring (b) Edge Image with Prior Blur

    15

  • 8/14/2019 UCSD ECE191 Report Hand Tracking Pointing Device

    18/24

    Chapter 6

    Tracking

    Ideally the object detectors employed to detect the head and hand regions within the image would have a100% accuracy rate and as such always locate the correct objects within the scene, and never miss classifya region. However realistically this will not be the case and as such the detection methods employed areinevitably prone to errors.

    The object tracker module aims to supplement the detection modules by providing additional processingwhich helps to limit the effect of erroneous detections and skipped frames.

    The object tracker does this by maintaining the state information of all objects that have been detectedby a particular object detector, within a given time interval, using tokens, and then using this informationto determine which detected region is most likely to correspond to the object that is being tracked.

    6.1 Tokens

    Within the token tracker tokens are data structures that store the state information of a particular objectthat has been detected. The following information is encapsulated within a token.

    Location of the object

    Size of the object (area in pixels)

    Time since the object was last observed (in frames)

    Number of times the object has been observed during its lifetime

    When an object is detected it is first compared to the list of tokens, if the properties of the object are a closeenough match to any of the objects described by the tokens then the token is updated with the new stateinformation. If no match is found, then a new token is initialised with the properties of the newly detectedobject. This process is applied to all the objects that are detected by the object detector.

    6.2 Token UpdatesWhen a match is made between a detected object and a pre-existing token, the token must be updated sothat it reflects the new state of the object. During this update process the count of the number of times theobject has been seen is incremented and the number of frames since the last detection is reset to zero.

    Once all the objects have been processed and the corresponding tokens updated, or initialised, the listof tokens is filtered to remove any dead tokens. A dead token is any token that corresponds to an objectthat has not recently been seen. This could occur for the following three reasons. The token could have beeninitialised for an object that was erroneously detected. If this is the case, then the object is likely to onlybe detected in once, and then not detected again. The second cause could be that the object that the tokenwas tracking is no longer present in the scene. The third reason for a token dying is if the objects propertieschanged too rapidly between detections. This would result in a match between the existing token and the

    object not being made, and a new token being initialised to represent the object.

    Once all the dead tokens have been filtered out from the list, the age of all the tokens (the time sinceseen) is incremented. And the process is then repeated for the next set of objects that are detected.

    16

  • 8/14/2019 UCSD ECE191 Report Hand Tracking Pointing Device

    19/24

    6.3 Object Selection

    At any given moment the object tracker can be keeping track of multiple objects, some of which may havebeen correctly classified, and others which may have been erroneously detected. In order for the system toallow the user to interact with the display, there must be a mechanism to enable the system to discern whichof these objects the correct object is.

    As mentioned before, the ideal case would be if the object detectors correctly identified the object regionswith 100% accuracy. However, with the object tracker supplementing the detection systems, this requirementcan be relaxed such that for a detector to be admissible, it must meet the following requirements.

    The object must be correctly classified in more frames than erroneous regions are classified

    The object must be classified in more frames than it is missed

    If these two criteria are met, then the token tracking the object will receive more updates than any othertoken that exists, this will result in the correct token having a higher vote (observed count) than all theother tokens. Therefore the location of the object can be found stored in the token that has the highest votecount.

    17

  • 8/14/2019 UCSD ECE191 Report Hand Tracking Pointing Device

    20/24

    Chapter 7

    Stereopsis

    Now that the head and hands have been detected and tracked, obtaining their actual location in space isthe final step before the final output vector can be calculated. Calculating the depth of the users hand orhead is accomplished by using two cameras, mounted on the same horizontal axis a certain distance apart,called the baseline. Similar to how humans perceive depth by using two eyes, the computer can calculate

    depth by using two cameras that are properly aligned. The method through which this is accomplished willbe explained in this chapter.

    7.1 Overview

    Using a stereo pair of cameras, gives us two perspectives of the same image. By comparing these twoperspectives, we gain additional information on the scene. Assuming there is no major occlusion, we canfind a point in the right image, and find that same point in the left image. With these two points we cantriangulate the depth of this point in real space. One can imagine, if two cameras are mounted side by side,the left and right image of a far away object will seem almost identical. In fact, as the distance of the objectfrom the cameras increases passed a certain threshold, there can be no discernible difference seen betweenthe two images. On the other hand, if the object is very close, the two perspectives of the object will be very

    different. By using this difference in the images, the depth can be extracted. Thus, the goal is to find andquantize this difference.

    7.2 Epipolar Geometry

    An easy way to quantize this difference is by comparing the the location of where a point of the object isprojected onto both images. Taking the difference of these two locations then gives us a measure of how faraway the image is. This distance is called the disparity, and is measured as the difference between the twoprojections of the same point onto the CCD of the cameras. Thus, given a point in one image, it is necessaryto find the corresponding point in the other image. It is desirable to have these two points, representingthe same point in real space, be on the same horizontal line in the image. This way the search for thecorresponding point can be limited to one dimension.

    In order to use stereo cameras to extract depth, it is necessary to have the left and right images calibrated andaligned so that they share the same horizontal scan lines. That is, any one point in real space, correspondsto a point in the right image, and a point in the left image on the same horizontal line. By mounting the twocameras parallel on the same horizontal axis, this is easily achieved. However, since our cameras are slightlyverged in, the points projected in the two images are not on the same horizontal line. To rectify this problem,we employ the SVS Videre tool, which performs a transformation on the image that effectively makes thetwo cameras parallel. This process of calibration requires several images of a checkerboard pattern displayedat different angles, and then outputs two rectified images that correct for not only the verged cameras, butalso the distortion in the lenses.

    7.3 Correspondence Problem

    Given a point in one image (a hand or a head, for example), how do we locate that same point in theother image? Since the images have been rectified, the point must be on the same horizontal line in bothimages. So the search for the corresponding point is narrowed down to one dimension. As a measure ofcorrespondence, the Normalized Cross Correlation is used. A window around the object of interest from one

    18

  • 8/14/2019 UCSD ECE191 Report Hand Tracking Pointing Device

    21/24

    image is taking and slid across the entire horizontal line, finding the NCC at each horizontal point along theline. The horizontal point with the largest NCC value is then taken to be that corresponding point.

    Figure 7.1: Correspondence Problem

    7.4 Range Calculations

    7.2 shows the setup of the cameras. All coordinates will now be with respect to the labeled axis.

    Figure 7.2: Camera Setup

    The right hand coordinate system is used, where the z axis is point down and out of the cameras, the

    x axis is pointing outward, and the y axis is pointing into the page.

    vergenceL and vergenceR is the angle the cameras are rotated inward.

    b is the baseline, or distance between the two cameras.

    19

  • 8/14/2019 UCSD ECE191 Report Hand Tracking Pointing Device

    22/24

    f is the focal length of the cameras.

    XL and XR are the known distances of the project points onto the CCD and the center of the CCDgiven by the correspondence process.

    L and R are two angles adjacent to the x-axis. These angles characterize the distance z.

    L and R can be calculated given XL, XR, vergenceL, vergenceR , and the focal lengths. They are givenby:L = arctan(

    XLf

    + vergenceL) R = arctan(XRf

    + vergenceR)Using similar triangles Z can be found as:Z = b

    tanRtanLNotice that as vergenceL,R approach zero, Z becomes inversely proportional to the disparity.After Z is found, X and Y are easily computed:X = Z

    tanLY = YLZ

    f

    where YL is similar to XL, but in the y-direction.Given these three values (X,Y,Z), a 3D point of the object of interest is formed.

    7.5 Range Resolution

    The resolution of the range is dependent on many factors, including how far away the object that is tryingto be resolved is away from the camera (Z), the baseline of the cameras, the focal length, and the size of thewidth of a pixel in the CCD. ?? shows the approximately Z2 relationship of the resolution.

    Figure 7.3: Range Resolution

    20

  • 8/14/2019 UCSD ECE191 Report Hand Tracking Pointing Device

    23/24

    Chapter 8

    Vector Extrapolation

    Given a 3D point for the head location and the hand location, a vector of where the user is trying to pointcan be extrapolated. By taking the difference between the hands 3D point and the heads 3D point, andnormalizing its magnitude to be a 1 meter, a unit vector is formed in the direction towards the display.This vector as well as the 3D location of the users head is then outputted to the interface where the users

    intended pointing direction can be projected onto the tiled display.

    21

  • 8/14/2019 UCSD ECE191 Report Hand Tracking Pointing Device

    24/24

    Chapter 9

    Conclusion and Future Work

    We have presented an approach for a marker-free human computer interaction system with overhead stereocameras. We have developed a promising demonstration that can be enhanced to drive large tiled displays.There is more work required in this project for it to be a truly robust system. Speed needs to be improvedand gesture recognition needs to be accomplished. For gesture recognition, we did not get good results with

    an overhead camera system. It would be a good experiment to add a third camera in front of the usersolely for gesture recognition. Porting the code to take advantage of multi-core processors such as the CellBroadband Engine would likely allow parallel computation of classifiers for different gestures.