a computer vision tangible user interface for mixed reality ......a proof of concept, we have...

A Computer Vision

Tangible User Interface for

Mixed Reality Billiards

Brian Hammond

Seidenberg School of Computer Science

Pace University

A thesis submitted for the degree of

Master of Computer Science

December 2007

mailto:[email protected]

http://csis.pace.edu

http://pace.edu

Abstract

Conventional input devices are unnatural for many human-computer

interactions (HCI). In this thesis we describe a system for creating nat-

ural interaction patterns using tangible objects as input devices. A

user’s physical interactions with these tangible objects are monitored

by analyzing real-time video from a single inexpensive digital video

camera using variations on common computer vision algorithms. Lim-

iting tangible object detection to computer vision techniques provides

a non-invasive and inexpensive means to creating user interfaces that

are more compelling and natural than those relying on a keyboard

and mouse. The mixed reality nature of the system stems from the

use of tangible objects as actors in a virtual or synthetic reality. As

a proof of concept, we have developed a pocket billiards simulation

that allows a user to manipulate a physical cue stick in the natural

manner as a tangible input device for practice exercises and games.

Contents

1 Introduction 1

1.1 System classification . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Billiards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Scene description . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Computer vision . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.5 Goals and challenges . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.6 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.6.1 Existing TUI-based systems and toolkits . . . . . . . . . . 11

1.6.2 Systems that use computer vision for user input . . . . . . 14

1.6.3 Systems that implement mixed reality billiards . . . . . . . 16

1.7 System architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2 Server process 20

2.1 Image formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2 Issues with acquired digital images . . . . . . . . . . . . . . . . . 22

2.2.1 Geometric distortion . . . . . . . . . . . . . . . . . . . . . 22

2.2.2 Blooming . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

ii

CONTENTS

2.2.3 Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.2.4 Low rate of acquisition . . . . . . . . . . . . . . . . . . . . 24

2.2.5 Low resolution . . . . . . . . . . . . . . . . . . . . . . . . 25

2.2.6 Chromatic aberration . . . . . . . . . . . . . . . . . . . . . 26

2.2.7 Temporal averaging . . . . . . . . . . . . . . . . . . . . . . 27

2.2.8 Weighted moving averages . . . . . . . . . . . . . . . . . . 28

2.2.9 Manual camera control . . . . . . . . . . . . . . . . . . . . 28

2.3 On stereo vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.4 On tracking algorithms . . . . . . . . . . . . . . . . . . . . . . . . 30

2.5 Feature detection methods . . . . . . . . . . . . . . . . . . . . . . 31

2.5.1 Color representation . . . . . . . . . . . . . . . . . . . . . 34

2.5.2 Thresholding . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.5.3 Contour finding and analysis . . . . . . . . . . . . . . . . . 37

2.5.4 Binary image morphology . . . . . . . . . . . . . . . . . . 39

2.5.5 Flexible color matching . . . . . . . . . . . . . . . . . . . . 40

2.5.6 Convex hulls and convexity defects . . . . . . . . . . . . . 41

2.6 Pose driven feature detection . . . . . . . . . . . . . . . . . . . . . 42

2.6.1 Planar object detection . . . . . . . . . . . . . . . . . . . . 43

2.6.2 Cloth/felt detection . . . . . . . . . . . . . . . . . . . . . . 45

2.6.3 Cue ball detection . . . . . . . . . . . . . . . . . . . . . . 45

2.6.4 Cue stick detection . . . . . . . . . . . . . . . . . . . . . . 46

2.7 Cue stick pose estimation . . . . . . . . . . . . . . . . . . . . . . 48

2.7.1 Estimation of cue tip vertical offset . . . . . . . . . . . . . 50

2.7.2 Estimation of the distance between cue stick and cue ball . 55

iii

CONTENTS

2.7.3 Estimation of cue stick pitch . . . . . . . . . . . . . . . . . 57

2.7.4 Estimation of cue stick yaw . . . . . . . . . . . . . . . . . 59

2.7.5 Estimation of spin-inducing parameters . . . . . . . . . . . 62

2.8 Shot detection and analysis . . . . . . . . . . . . . . . . . . . . . 64

2.8.1 Shot detection . . . . . . . . . . . . . . . . . . . . . . . . . 65

2.8.2 Shot analysis . . . . . . . . . . . . . . . . . . . . . . . . . 66

3 Client process 68

3.1 Simulation states . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

3.2 Orienting the virtual cue stick . . . . . . . . . . . . . . . . . . . . 73

3.3 Shot dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

3.4 Event handling and game logic . . . . . . . . . . . . . . . . . . . . 74

3.5 A focus on user training . . . . . . . . . . . . . . . . . . . . . . . 75

4 Conclusions 77

A Camera calibration 79

References 86

iv

List of Tables

1.1 Scene elements of our prototype TUI . . . . . . . . . . . . . . . . 5

2.1 4-way or 8-way neighbors of a pixel P . . . . . . . . . . . . . . . . 37

2.2 Elements of cue stick pose. . . . . . . . . . . . . . . . . . . . . . . 43

2.3 Feature summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.4 Flexible color model for the cloth/felt. . . . . . . . . . . . . . . . 45

2.5 Flexible color model for the cue stick’s shadow. . . . . . . . . . . 47

2.6 Scene lines to be found for estimation of h. . . . . . . . . . . . . . 54

v

List of Figures

1.1 Typical pocket billiards equipment. . . . . . . . . . . . . . . . . . 3

1.2 Scene layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Top view. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4 Client-server architecture. . . . . . . . . . . . . . . . . . . . . . . 18

2.1 Low quality digital video camera image. . . . . . . . . . . . . . . 21

2.2 High quality still photography image. . . . . . . . . . . . . . . . . 21

2.3 Example of sensor noise. . . . . . . . . . . . . . . . . . . . . . . . 24

2.4 Errors in color reproduction of digital video camera. . . . . . . . . 27

2.5 Server process logic flow. . . . . . . . . . . . . . . . . . . . . . . . 32

2.6 Color mixtures in RGB. . . . . . . . . . . . . . . . . . . . . . . . 34

2.7 HSV color space. . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.8 Example of thresholding. . . . . . . . . . . . . . . . . . . . . . . . 36

2.9 Example of contour finding. . . . . . . . . . . . . . . . . . . . . . 38

2.10 Example of erosion morphological operator. . . . . . . . . . . . . 40

2.11 Example of flexible color matching. . . . . . . . . . . . . . . . . . 41

2.12 Example of convexity defect. . . . . . . . . . . . . . . . . . . . . . 42

vi

LIST OF FIGURES

2.13 Example of cue ball child contour. . . . . . . . . . . . . . . . . . . 46

2.14 Example of cue stick shadow detection. . . . . . . . . . . . . . . . 48

2.15 Screenshot of demo of feature detection. . . . . . . . . . . . . . . 49

2.16 Scene points used in cue stick tip height estimation. . . . . . . . . 51

2.17 Setup for estimation of h; part 1. . . . . . . . . . . . . . . . . . . 52

2.18 Setup for estimation of h; part 2. . . . . . . . . . . . . . . . . . . 53

2.19 Screen capture of cue stick vertical offset estimation demo. . . . . 56

2.20 Distance between cue stick tip and cue ball. . . . . . . . . . . . . 56

2.21 Example of cue stick pitch. . . . . . . . . . . . . . . . . . . . . . . 58

2.22 Image space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

2.23 Current side of virtual billiards table. . . . . . . . . . . . . . . . . 61

2.24 Examples of cue stick yaw. . . . . . . . . . . . . . . . . . . . . . . 61

2.25 Ball space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

2.26 Example distance history. . . . . . . . . . . . . . . . . . . . . . . 65

3.1 Screenshot of Google Sketchup modeling session. . . . . . . . . . . 71

3.2 Client process logic flow. . . . . . . . . . . . . . . . . . . . . . . . 72

3.3 Screenshot of client simulation showcasing the “ghost ball” feature. 76

A.1 Camera calibration chessboard object. . . . . . . . . . . . . . . . 79

A.2 Example of camera lens distortion. . . . . . . . . . . . . . . . . . 80

vii

Chapter 1

Introduction

The most common interaction techniques in HCI today are based on the famil-

iar mouse and keyboard. While extremely successful, these input devices don’t

provide natural interfaces to various categories of software. In fact, one may ar-

gue that most software developed today cannot live up to its full potential due

to the ubiquity of the mouse and keyboard as expected user input devices. In

other words, we as software developers and interaction designers are constrained

to design simple, often unnatural human-computer interactions; or are we?

This thesis describes a protoype tangible user interface (TUI) that uses computer

vision (computer vision) to create a natural, non-invasive user input technique for

cue sports such as 8-ball or snooker. In this system, we utilize a single off-the-shelf

digital video camera (or more commonly, a webcam) to passively view a user’s

interactions with unmarked tangible objects. These interactions are detected

and analyzed in near real-time (≈ 10 Hz), and represent application use cases.

One of our goals is to learn-by-doing; by designing, implementing, and testing a

prototype of a single application using computer vision as an input mechanism,

we wish to learn techniques and lessons for creating a more general class of such

applications in the future.

... computers are still limited in their multimedia understanding. This

means that we have increased the effective bandwidth of information

1

1.1 System classification

from computers to humans, by sending audio, images, audio, graphics,

haptic data, but the same rate of improvement has not happened in

computer understanding. Most computers still receive input from low

bandwidth devices like keyboards or mouse [sic]. Only few interfaces

are able to understand application related domains of audio, visual or

haptic information. (37)

1.1 System classification

A mixed reality (MR) system spans a continuum of realities which includes physi-

cal reality, augmented reality (AR), augmented virtuality (AV), and virtual reality

(VR) (31). Physical reality is what we sense all around us everyday as the real

world. Augmented reality is a relatively new field in which virtual objects are

introduced into the real world, usually by way of 3D graphics overlaid atop a

user’s view of some real world scene. An augmented virtuality system is one in

which objects in our physical reality are incorporated into a virtual environment.

Virtual reality systems create a completely synthetic environment as seen in many

modern video games and research prototypes. Our system can be considered a

MR system in that a user manipulates an object in physical reality which in turn

manipulates an object in a virtual reality. This system can be further described

as an “interreality” system since we are specifically correlating objects of two

distinct realities (14). Our system is also closely related to a recent focus on

pervasive games :

Computer games focus the user’s attention mainly on the computer

screen or 2D/3D virtual environments, and players are bound to using

keyboards, mice, and gamepads while gaming, thereby constraining

interaction. To address this problem, there is a growing trend in to-

day’s games to bring more physical movement and social interaction

into games while still utilizing the benefits of computing and graphical

systems. Thus, the real-world is coming back to computer entertain-

ment with a new gaming genre, referred to as pervasive games, stress-

2

1.2 Billiards

ing the pervasive and ubiquitous nature of these games: Pervasive

games are no longer confined to the virtual domain of the computer,

but integrate the physical and social aspects of the real world. (30)

1.2 Billiards

Our prototype TUI allows a user to manipulate a physical cue stick in a manner

natural to billiards players in order to control a virtual cue stick which is a part

of a physics-based billiards simulation. Following the terminology of (41), we

have created a TUI for the domain of modeling and simulation applications. The

tokens or tangible user interface elements of our system are based on the actors

that participate in a billiards game in physical reality and can be considered to

be spatial tokens as we are concerned with their physical proximity to each other.

These are a human player, a cue ball, and a cue stick. Note that the remaining

billiards balls exist only in our billiards simulation. Also, a physical billiards table

is not required. The user’s desktop acts as a mock billiards table.

(a) Billiards table with pockets. (b) Cue stick. (c) Billiard balls.

Figure 1.1: Typical pocket billiards equipment (not to scale). Please see Fig-

ure 2.2 for an example of a cue ball.

When playing billiards in the physical reality, the user views the table from

the same general vantage point as the cue stick. The cue stick has a tapered

cylindrical shape and is most often comprised of wood with a ferrule or cuff

at the tip used to receive most of the impact force so that the wood does not

3

1.3 Scene description

split. The tips of cue sticks are usually 10 − 13mm in diameter and are covered

in chalk to increase the amount of friction between the stick and the cue ball

during impact. This prevents slippage or miscues. The mass of a cue stick varies;

common masses are ≈ 20 ounces. The cue stick is held by the user and is oriented

in order to strike the cue ball. The cue ball is generally an all-white ball. The

size and mass of billiard balls depends on the billiards game being played. For

instance, in American 8-ball, the cue ball has a mass of 6 ounces and is 2.25 inches

in diameter. The user strikes the cue ball by moving the cue stick into the cue

ball. The cue ball in turn collides with a target billiard ball. The goal in pocket

billiards games is to cause such a target ball to fall into one of the pockets located

on the corners and sides of a billiards table. This is referred to as pocketing a ball.

Our prototype focuses on pocket billiards games, leaving other types of games

(e.g. carom billiards) for future work.


Our TUI expects a particular scene structure and defines constraints on the

appearance of tokens and how they are expected to be interacted with. User-

token interactions occur on or above a flat surface (e.g. a desk surface); thus, this

desk can be considered the reference frame of the TUI (41). A digital video (DV)

camera capable of capturing color video at a decent rate (e.g. 15 or 30 frames

per second (fps)) is mounted atop a tripod and is oriented to view the user’s desk

from above. The exact height does not matter as long as the tokens are visible to

the camera. The camera is attached to a computer which performs online video

capture (also referred to as image acquisition) and image analysis to detect the

actions of the user. The same computer is used for rendering a 3D, physically

based billiards simulation. Since the user is indirectly controlling the simulation

by moving the cue stick, the computer’s display should be visible to the user.

Our token detection algorithms utilize the features of the tangible object as well

as their shadows. Thus, we require a nearby light source of sufficient intensity to

4


Scene Element Role

cue stick token controls virtual counterpart; shots

cue ball token provides realistic tactile feedback; shot aiming

piece ofhcloth/felt container for cue stick, cue ball, and cue stick shadow

light source illuminates tokens; shadows

planar object aid for determining vanishing lines

Table 1.1: Scene elements of our prototype TUI

illuminate the scene. In our prototype TUI, we use a common desk lamp with a

60 watt light bulb (see Figure 1.2).

The green cloth/felt material mentioned in Table 1.1 is used as an container

of the cue stick, cue ball, and the cue stick’s shadow. By ’container’ we mean

that the view from the camera will show the cue stick, cue ball, and cue stick’s

shadow within the area bounded by this piece of material (see Figure 1.3). This

focuses the attention of our computer vision-based algorithms and thus helps

the algorithms detect the tokens. In our prototype TUI, we experimented with a

number of different materials, ultimately settling on a thin piece of material often

found in “golf putting green” products. We chose this type of material for a few

reasons. The rectangular shape and color of this material mimics that of a real

billiards table which keeps us in line with our goal of a natural interface. Also, the

material exhibits diffuse or Lambertian reflection. That is, the surface appears

nearly equally bright from all viewing angles (11), reducing specular highlighting

(simply, bright spots) to a minimum. In practice, the effectiveness of computer

vision object recognition techniques may be hampered by specular reflections

on object surfaces such as polished desk surfaces by breaking the continuity of

otherwise uniform patches of color or resembling shapes being matched which

may lead to false-positives during object detection (33).

The planar object mentioned in Table 1.1 is used to determine a set of vanishing

lines of the plane of the desk. Such lines are used in cue stick pose estimation

(see Section 2.7.1). In our prototype TUI, we used a 3x5 inch index card for

this planar object. The dimensions of this object are not important. What is

5

1.4 Computer vision

important however is that the object is very thin, rectangular in shape, and has

a strong contrast with the surface on which it rests.

Figure 1.2: Scene layout.

1.4 Computer vision

We see computer vision and image processing technology – although

still relatively brittle and slow – play an increasing role in acquiring

appropriate sensor and scene models. Rather than using the video

signal merely as a backdrop on which virtual objects are shown, we ex-

plore the use of image understanding techniques to calibrate, register

and track cameras and objects and to extract the three-dimensional

structure of the scene. (24)

6

1.4 Computer vision

Figure 1.3: Top view.

Computer vision a branch of artificial intelligence whose goal is “to make useful

decisions about real physical objects and scenes based on sensed images” or to

construct “scene descriptions from images” (40). Computer vision is a means to

understand the world using visual data and prior information. This visual data

is specified in the form of digital images. By “prior information” we mean that

recognizing objects in images requires us to know what we are looking for. That

is, we must have prior knowledge of the model of the objects we are trying find

in images (3).

Computer vision systems process digital images acquired or captured from elec-

tronic cameras (33). A digital image is a 2D array of numbers, each of which is

referred to as a picture element or pixel. Each pixel generally has a small range of

potential values (e.g. an 8-bit pixel has the range [0,255]) and encodes the inten-

sity of light that falls on an image sensor at a certain location. Most systems use

255 as maximum intensity and 0 as minimum intensity. A binary digital image

is a 1-bit image in which 0 represents black and 1 represents white. A grayscale

image is comprised of 8-bit pixels and encodes 256 shades of gray (the value 255

is usually equivalent to full intensity or white). A color image generally uses 3

7

1.4 Computer vision

numbers per pixel to encode color components or channels. For instance, a true

color RGB digital image uses 24-bits to encode color intensities (8 bits for each

of red, green, and blue). There are many color models such as RGB, HSV, HLS,

etc. each with particular uses. We shall discuss the role of RGB and HSV color

models in our system in Section 2.5.1.

Algorithms that somehow search, manipulate, or filter the pixels of digital images

are collectively referred to as image processing algorithms. Image processing can

be seen as a mechanical way of manipulating the pixels of a digital image to some

end (e.g. perhaps making the image appear brighter to our eyes). Image process-

ing provides part of the foundation of image understanding in which meaningful

information is extracted from an image (40). We shall use image processing and

computer vision algorithms in order to reduce complexity. When complexity is

low, information is more easily extracted and decisions are easier to make. This

theme of reducing complexity is common in many computer vision systems. In

our system, we shall show how we reduce the complexity of seemingly random val-

ues of pixels in order to find interesting image features. In turn we shall attempt

to extract information from these image features in order to infer properties of

objects based on the models of the tokens of our TUI. From these properties we

will determine the state of each token of our TUI. Changes in a token’s state are

what we are interested in as these can be considered the use cases of our input

system.

Computer vision as an input mechanism is compelling for a few reasons. It is

passive in that the user does not need to physically interact with with some

device (e.g. mouse, keyboard) but can interact in a normal manner with objects

appropriate for a particular context (e.g. using a cue stick to play billiards instead

of clicking and moving a mouse). Using computer vision as an input mechanism

is also inexpensive. Consumer-grade color DV cameras capable of capturing 15-

30 frames per second cost ≈ $30 − $60USD. There are significant challenges

with using computer vision as an input mechanism however. These challenges

are applicable to all computer vision systems. Our physical reality is is three

dimensional (3D); any point in this reality can be specified by three coordinates

(X,Y ,Z). However, images captured by cameras are two dimensional (2D); points

8

1.5 Goals and challenges

in images can be specified by two coordinates (x, y). Thus, one dimension is lost

in process of perspective projection that occurs as light is detected by image

sensors. Many computer vision algorithms try (and succeed) to recover this lost

dimension yet the problem is very difficult to solve in the general case. We shall

see one small set of such algorithms as used to recover the 3D pose of a billiards

cue stick. The algorithms described are not broadly applicable but are nonetheless

useful for study and of course have great utility for our specific prototype TUI.

The accuracy of observations made by a computer vision system is limited to

what the system can visually sense; accuracy is usually limited by the number

of pixels of the images being processed (although algorithms that use sub-pixel

measurements are becoming more common). Changes in lighting conditions can

cause a computer vision-based recognition algorithm to easily become confused.

Occlusions, in which objects of interest are hidden either partially or fully by an-

other object closer to the camera, can also confuse such algorithms. Objects that

undergo fast motion may be difficult to robustly detect and track. Cameras have

a host of possible issues which are described in Section 2.1. Digital images tend

to contain a fairly large amount of visual data. Processing this data efficiently is

important for real-time response rates. CPU utliization is also generally high for

computer vision system implementations due to the large number of pixels that

need to be processed, draining the overall system performance. Most of these

issues can be dealt with by careful system design or by imposing constraints;

e.g. there can be no fast-moving objects as in (28). However, finding a working

balance across all of the mentioned issues is a monumental task for any computer

vision system.

1.5 Goals and challenges

We have a number of goals we wish our system to meet.

• We wish to remove the intermediate layer between applications and users;

what user interface expert Jef Raskin calls the “operating system” problem

(35).

9

1.6 Related work

• Our main goal from the computer vision portion of this thesis is simultane-

ous object detection and pose estimation from natural object features. We

do not use markers to make objects easier to detect. Such a system is the

least invasive to users and hence the most natural.

• We wish to create a realistic, real-time VR billiards simulation that orients

a virtual billiards cue stick based on the detected 3D pose of the user-

controlled physical cue stick.

• We want to detect a user’s shots on a real cue ball, derive the parameters

that describe the shot, and physically model a similar shot in our billiards

simulation.

These goals are difficult to accomplish given our self-imposed constraints of

• using low cost hardware such as webcams ;

• using a single DV camera for acquiring digital images;

• requiring minimal user involvement in calibration or setup;

• using marker-less tokens in their natural context (non-invasiveness);

• robustly detecting potentially fast moving objects;

• allowing objects to undergo partial or full occlusions for indefinite amounts

of time.

1.6 Related work

Our system spans several areas of research and thus has a fairly broad scope. We

are interested in understanding the work of fellow researchers that have created

various TUI-based systems via computer vision or other means (e.g. RFID),

systems that use computer vision as an input mechanism, systems that provide

alternative interfaces to virtual billiards games, and systems that augment the

physical billiards experience for the advancement of training methods.

10

1.6 Related work

1.6.1 Existing TUI-based systems and toolkits

We found the metaphor of light, shadow, and optics in general to be

particularly compelling for interfaces spanning virtual and physical

space. (19)

An interface may be considered tangible when its interface elements are phys-

ical objects that somehow represent digital information (19). These interface

elements, hereafter referred to as tokens (41), are manipulated (e.g. moved, ro-

tated) by a user in order to manipulate digital information. In TUIs, physical

representations embody mechanisms for interactive control (13; 19). TUIs have

seen a large amount of research in the past ten years, starting with Fitzmaurice’s

Ph.D thesis on graspable user interfaces (10), and at the MIT Media Group where

the Tangible Bits initiative of (19) lead to the creation of infrared, magnetic, and

electronic sensors for use in several TUI prototypes. The goal of the Tangible Bits

is the same goal for all TUIs: “to bridge the gaps between both cyberspace and

the physical environment” (19). Indeed, we share the desire to “push computers

into the background”, and make user interfaces more human-centered and less

focused on traditional input devices.

Myron Krueger’s VideoPlace (25) created a broad interest in using cameras as a

passive input mechanism, infering user intent by watching users instead of forcing

them to physically create inputs by way of a physical device. VideoPlace is an

inspiring work of both technical as well as philosophical merit.

In contrast to other work on TUIs, the only means of detecting user interactions

with tokens in our system is via online (or real-time) analysis of captured video

using computer vision-based algorithms. We do not rely on magnetic sensors (cf.

(45)) or electronic tags. Furthermore, our goal is to require a minimal number

of proprietary tokens (cf. (1; 7; 8; 13; 23)), focusing on natural interactions

with common tangible objects for a given context. For instance, while Klinker et

al. (24) use interactive user assistance and magnetic trackers to help make their

computer vision techniques more robust, we wish to keep user involement with

the implementation of our TUI to a minimum.

11

1.6 Related work

The metaDESK project (42) attempted to bring conventional GUI widgets such

as windows, icons, and menus into the physical realm. The physical instantiations

of these user interface elements are sensed using optical, electromagnetic, and me-

chanical sensors. In contrast to metaDESK, our work is focused on creating TUIs

with tokens used in their natural context instead of focusing on creating TUIs

based on contrived tokens such as a “phicon”. We believe that using computer

vision instead of other types of sensors is less constricting as the sensor or sensor

rigs in our system are easy to setup and inexpensive compared with the sensors of

metaDESK which are embedded in the desk itself. We feel that metaDESK, while

a wonderful exercise in developing a prototype TUI, provides little advantage or

convenience over using the traditional input devices to control traditional GUI

elements.

Paper Mache is a toolkit for creating TUIs that utilize one or more of computer

vision, bar codes, and radio frequency ID (RFID) tags (23). The goal of Paper

Mache is to abstract various input mechanisms in order to make TUI development

easier for programmers and systems designers. Paper Mache wishes to shield de-

velopers from having to understand “a field very different from user interface

development” such as computer vision. While this is a laudable goal, we be-

lieve that it is difficult to provide toolkit support for a broad range of TUIs with

only a single API. For instance, abstracting computer vision and RFID-based

input leads to a simplified least-common denominator set of API features. Paper

Mache’s API is based around events wherein an application is notified of physical

objects (or Phob) entering the scene, leaving the scene, and otherwise changing

position or orientation. These events are the same across input mechanisms but

the computer vision based input mechanism allows access to the acquired image.

For advanced applications, accessing this image is almost always going to be re-

quired since Paper Mache’s computer vision analysis is fairly limited to finding

object properties based on contour analysis alone (i.e. with no direct support

for occlusion handling, fast moving objects, nor tracking implementations). Ac-

cessing the underlying image breaks the encapsulation of the system which is in

direct opposition to the overall goal of Paper Mache. Also, Paper Mache is imple-

mented in Java which may cause some concerns regarding system performance,

12

1.6 Related work

especially for a system like ours which has additional computational burdens (e.g.

our physics-based billiards simulation). For this reason, we have not attempted

to implement our TUI use this system.

Touch-Space is a mixed reality system that strives to create TUIs for human-

human and human-physical interaction (6). Touch-Space focuses on physical

reality mixed with AR and VR techniques to focus on the social aspects of TUIs.

Users engage in an ad hoc game that takes place in a large room. Users wear

head-mounted displays (HMD) with video cameras attached. Markers attached

to physical objects in the scene are sensed using computer vision by way of the

ARToolkit. By comparison, our system does not require markers on physical

objects. Also, our system does not require a HMD but renders 3D graphics on a

standard computer display screen. It would be interesting to incorporate a HMD

into our work and render the virtual billiards table of our simulation aligned with

the plane of the user’s desktop (see Figure 1.2). User feedback of the Touch-Space

system shows that users enjoyed the benefits of immersion provided by the use of

HMDs but did not enjoy the limited field of view and weight of the HMD itself.

Crayons (9) is a system similar in goal to Paper Mache in which the creation of

camera-based interactions is abstracted to ease the burden of the programmer.

Unlike Paper Mache however, Crayons focuses on pixel-level classification instead

of object-level analysis. Moreover, Crayons uses machine learning techniques for

classification instead of simple computer vision techniques. Crayons is interest-

ing to us in its attempt to reduce the burden of classification tasks using color

models as our TUI relies in part on color-based analysis for feature detection.

However, we believe that the complexity of computer vision related classification

and analysis is very difficult to abstract while simultaneously providing sufficient

power for demanding applications. Thus, systems like Paper Mache and Crayons

are ultimately not well suited to TUIs like ours in which accuracy is paramount

to ease of implementation.

13

1.6 Related work

1.6.2 Systems that use computer vision for user input

Many other technologies besides vision have been tried to achieve 3D

tracking, but they all have their weaknesses: Mechanical trackers are

accurate enough, although they tether the user to a limited working

volume. Magnetic trackers are vulnerable to distortions by metal in

the environment, which are a common occurrence, and also limit the

range of displacements. Ultrasonic trackers suffer from noise and tend

to be inaccurate at long ranges because of variations in the ambient

temperature. Inertial trackers drift with time. By contrast, vision has

the potential to yield non-invasive, accurate and low-cost solutions to

this problem, provided that one is willing to invest the effort required

to develop sufficiently robust algorithms. (29)

While this quote refers to using computer vision for tracking (simply, following

a moving object; see Section 2.4), it equally applies to systems that do not use

tracking but computer vision in general for object detection since “every tracking

method requires an object detection mechanism either in every frame or when

the object first appears in the video” (46). We agree with the main assessment of

this quote – that computer vision or use as an input mechanism is a fertile area

for research and has potential to offer great advantages. However, developing

robust algorithms to this end is anything but trivial as we shall see.

There are a number of systems that use computer vision as a user input mecha-

nism that rely on the detection of markers or fiducials that must be attached to

the objects of interest (8; 21). The orientation and scale of detected markers is

used for various application purposes. We have avoid marking tokens in our TUI

in order to afford the user the most natural experience possible. For instance, it

would not be overly natural to place a sizable fiducial on a cue stick, especially

since the marker must be oriented towards the camera at all times. We will not

consider such systems (although successful in their own uses) any further in this

thesis.

Visual Touchpad is a system that uses stereo computer vision techniques to track

a user’s hand above a planar object. The movements of the user’s hand acts as

14

1.6 Related work

use cases to manipulate 2D graphical entities. By contrast our system only uses a

single camera. Also, Visual Touchpad determines the height of a finger above the

planar object by requiring a initialization step in which the finger is detected at a

height of ≈ 1cm above the planar object. At runtime, the finger is considered to

be touching the planar object when the disparity information revealed in stereo

vision analysis finds the finger to be within this 1cm distance. Thus, Visual

Touchpad ’s goal is to determine when the finger is touching the planar object

while our goal is to estimate the height of the cue stick above a planar object.

Our system also requires an initialization step to estimate the vertical offset of the

cue tip above the user’s desktop; we however use a completely different method

based on a single view of the scene to estimate this offset (see Section 2.7.1).

mulTetris (1) is a Tetris-style game in which the game pieces are translated

and rotated in accordance with the manipulations of colored blocks by a user

directly atop a computer display. The system uses computer vision as an input

mechanism by viewing the motions of the color blocks. The system uses color-

based tracking which we also experimented with (see Section 2.4). A major issue

with using such a tracker is that “if ... the environment happens to change

(lighting conditions, non-constant camera settings), the setup has to be redone”

(1). We have attempted to combat such issues by using flexible color modeling for

instance. mulTetris follows the basic principles of TUI design as set forth in (10).

The system requires two cameras to perform its analyses; we require only one.

One aspect of the development of mulTetris that we wish to copy in the future

is a usability study which often provides valuable feedback. However, we did not

consider this to be of pressing need for our prototype. While such a usability study

is very useful for the successful development of new user interaction patterns, the

accuracy of measurements is more important for a system like ours in which

interaction patterns are predefined by the context (e.g. billiards).

The PlayAnywhere project (43) is interesting to us for several reasons: it is very

expensive to implement relative to the cost of our project’s cost (≈ $10,000USD

vs. ≈ $50USD); uses shadows as a scene element to derive part of an object’s

pose; performs contour analysis to detect a sense of height of a scene object

above a plane of a desktop; and can be considered a TUI that allows users to

15

1.6 Related work

manipulate (translate, rotate) 2D graphical objects by direct manipulation of

projected images of these objects. PlayAnywhere also uses infrared light instead

of visible light to make feature detection more robust. We wish to investigate the

use of infrared light in future work.

In (2) we find yet another system that uses computer vision for detection of

moving objects by detecting colors in images; the system tracks balls as they fly

through the air during a juggling session. However the system requires the use

of stereo vision algorithms and thus requires two video cameras. The system is

noteworthy for its lack of markers, use of color-based features, and use of the

Kalman filter for motion prediction. See Section 2.4 for the reasons our system

does not ultimately use tracking algorithms.

1.6.3 Systems that implement mixed reality billiards

Several other systems have been developed that augment the billiards experience

in one form or another. Several projects provide a control interface to a VR

simulation. Others augment a physical billiards table with projected video for

the purposes of training. Others use HMDs to provide an augmented view of a

physical billiards table.

The HapStick (16) project is a mixed reality billiards project in which haptics

and force feedback are used to simulate the sensation of striking a real cue ball.

The user manipulates a cue stick that is physically tethered to a sensor. When

contact with the sensor is detected, the system effectively “pushes back” on the

stick to offer the feeling of striking a cue ball. The force of the shot is detected by

the system, and a similar force is imparted to a virtual cue ball in a physics-based

billiards simulation. While the system provides an accurate account of the force

of a cue stick that we have ourselves not been able to match, there are several

limitations that we hope to overcome in our project. The system costs (as of late

2007) hundreds of dollars to build and is nontrivial to construct. Such a system

is thus not easy to deploy to a wide audience. The interface, while allowing a

user to manipulate a physical cue stick, is not overly natural since the stick is

tethered to a rather large contraption. In other words, the sensing in this system

16

1.6 Related work

is active rather than passive. Also, no actual cue ball is struck which is clearly

also unnatural. Finally, while the system can detect and model varying degrees of

pitch and yaw of the cue stick, it can not model spin effects such as draw, follow,

and english that a sophisticated user would demand. While haptic technology

improves the immersive experience for users, our goal is to obviate the need for

haptics by allowing users to interact with real objects instead of haptic models

thereof.

Stochasticks (20) is a system which helps a user visualize potential billiards shots

on a physical billiards table. The user wears a HMD and sees lines and arrows

overlaid atop an otherwise conventional view of a billiards table. This system

is a mixed reality billiards system in that it uses AR technology to augment

the physical reality of playing billiards. As in our prototype, Stochasticks finds

a cloth area in order to focus the attention of its feature detection algorithms.

However, unlike our system, Stochasticks computes a model of billiard table cloth

colors offline in a precomputation step. We use a simple flexible color model for

matching. The aim of Stochasticks is similar to ours: provide a natual means

for a user to self-train in billiards. However, the means are quite different. Our

system allows a user to view the scene their own eyes while Stochasticks requires

a tethered HMD to be worn which is not very natural. Also, Stochasticks requires

the user to be located near a physical billiards table. In our system, the user can

practice billiards using the normal equipment in the comfort of their own home.

As our physics-based simulation improves, we hope that the difference between

the motion of balls on a physical billiards table and our simulation becomes small

enough for training without a physical billiards table to become a viable reality.

It will be interesting to see how Stochasticks evolves as wearable computers and

HMDs reduce in price and size over time.

An excellent example of a augmented billiards training system is the Automatic

Pool Trainer (26). This system is similar to Stochasticks in that it augments

the physical billiards playing experience. However, Automatic Pool Trainer uses

a projector mounted on the ceiling above a billiards table. A colocated camera

views the surface of the table from a top-down vantage point. Computer vision

is used to detect the location of balls on the table. Armed with this knowledge,

17

1.7 System architecture

the system projects lines, circles, and other visual aids to help the billiards player

visualize shots and position balls post-shot. While the system is a wonderful

example of how one can use AR to train users, it is also not deployable to a wide

audience since it its hardware is generally expensive, requires a physical billiards

table, and also requires the ability to mount a projector and camera on the ceiling

above the billiards table.


Our prototype TUI implementation is based on the client-server architecture

wherein a server process performs the vision based token detection and pose

estimation and a client process performs a physics based billiards simulation based

on information received from the server. The server process sends information to

the client over a standard TCP/IP socket connection.

Figure 1.4: Client-server architecture.

The described client-server architecture was chosen mostly for performance im-

provements. Compared to a system that uses a single process to perform both

the image processing and physics simulation, our system can offload the image

18


processing tasks to one CPU and the physics and 3D rendering tasks to another

CPU in a multi-CPU (or multi-core) computer system. Such multi-core systems

are becoming much more common. For instance, the prototype TUI was devel-

oped on an Apple MacBook Pro which has two cores. Indeed, it is even possible

to use one computer to run the server process and another to run the client pro-

cess since the client and server communicate over a TCP/IP socket. We shall

refer to the sending of data between the server process and client process as IPC

for inter-process communication.

Since TCP/IP was not designed for extremely low-latency communications, one

may be concerned with an architecture that uses TCP/IP for an input mechanism,

as the latency of measurements of any input device should be as minimal as

possible. In our tests, we have used only a single computer with multiple cores.

Client-server communications in this scenario using a single TCP/IP socket were

fast enough to send on the order of 25,000 packets per second. However, since DV

cameras deliver only ≈ 30 images per second, our system is clearly not network-

limited.

19

Chapter 2

Server process

The server process is responsible for

• capturing images from a digital video camera;

• performing feature detection and extraction;

• extracting the state or pose of tokens from image features;

• detecting shots taken and the properties thereof;

• notifying connected clients of token’s state and shots taken.

Before we discuss the design and implementation of the server process itself, let

us first describe how the images acquired by an imaging device are formed before

delivery to an application like our prototype TUI. Such a discussion is necessary

in order to understand the limits of imaging devices and the subsequent issues

with captured images that our system must fight in order to accurately measure

elements of our TUI’s scene.

20

Figure 2.1: Image acquired from a 0.3 megapixel Apple iSight digital video

camera.

Figure 2.2: Image acquired from a 6 megapixel Nikon D50 digital SLR camera.

21

2.1 Image formation

2.1 Image formation

The function of a digital video camera (for our purposes) is to sample the visible

spectrum of light at discrete instances of time and form a digital two dimensional

(2D) image representing the intensity of the photons of light that have fallen upon

the camera’s image sensors or cells. The average webcam has roughly 300,000

such cells. Compared with that of a single human eye which has over 100,000,000

rods and cones (crudely the equivalent of an image sensor), we can begin to

understand that modern cameras are not capable of capturing exquisitely fine

detail like the human eye can at least because a digital video camera has low

acuity (sharpness).

2.2 Issues with acquired digital images

Every digital acquired from a camera has some amount of error due to numerous

factors including manufacturing imprecision, the limits of camera circuitry, and

various tradeoffs camera manufacturers balance in order to produce a consumer

product such as the cost vs. quality in lens design. In (40), an overview of such

issues is given. As we have encountered many of these issues in our work, we

provide a summary of the major issues and the countermeasures we have taken

to fight these issues in order to extract the most accurate measurements possible

from fairly low-quality imaging devices.

2.2.1 Geometric distortion

A camera lens that does not focus light as expected generates images that contain

some amount of geometric distortion. This type of distortion causes straight lines

in the world to be curved in acquired images. All cameras have some amount

of geometric distortion due to imperfections in the manufacturing process. How-

ever, there are techniques to quantify the amount of geometric distortion and

even correct it. Camera calibration is a technique widely used to discover the in-

trinsic and extrinsic properties of a camera. The intrinsic parameters of a camera

22


include a set of distortion coefficients that describe how a particular camera dis-

torts images. These parameters are used in the correction process as well. See

appendix A for details on how one can calibrate a camera using Intel’s OpenCV

library (18) and remove the geometric distortion in acquired images.

. . . digital sensors consist of an array of “pixels” collecting photons,

the minute energy packets of which light consists. The number of

photons collected in each pixel is converted into an electrical charge

by the light sensitive photodiode. This charge is then converted into

a voltage, amplified, and converted to a digital value via the analog

to digital converter, so that the camera can process the values into

the final digital image. (33)

The vast majority of webcams and digital video cameras are based on either

Charge Coupled Devices (CCD) or Complimentary Metal Oxide Silicon (CMOS)

technology. Such cameras use an array of sensors where each sensor collects a

charge proportional to the amount of light incident on the sensor. The charge

is sampled or discretized and made available as a digital image. The number of

sensors in the array defines the resolution of images captured by a CCD or CMOS

camera. See (33) for a more detailed discussion of CCD design. There are several

issues that we have experienced from using inexpensive webcams that use CCD

image sensors.

2.2.2 Blooming

Each cell or sensor in a CCD array is not perfectly insulated from one another.

Blooming is an effect where charge collected at one cell leaks into neighboring

cells. We can think of a a CCD image sensor as a bucket that collects charge;

once a bucket is full, the additional charge has no effect on the pixel value,

overexposing the pixel. When the charge overflows to neighboring pixels, those

pixels become overexposed as well.

23


2.2.3 Noise

Noise can be defined as the departure from a true signal. In the case of acquired

digital image, the effects of can be seen in departures in pixel color from their

true color. Noise stems from a range of sources including differences in sensitivity

between otherwise identical image sensors, environmental influences, quantization

(rounding) errors, and so on.

Noise causes our measurements to jitter since the image regions vary with noise

slightly image to image. In the client simulation, the end result of these fluctua-

tions in measurements is a slight wobble in the cue stick orientation and position.

This causes imprecision at best and a general sense that the system isn’t very ro-

bust. We combat this with the use of temporal averaging and/or weighted moving

averages (discussed later).

Figure 2.3: Example of sensor noise. Here we see the hue channel of a single

captured HSV video image. The dark region marked by the letter ’A’ is of uniform

hue yet is speckled with various other hues, mostly attributable to sensor noise.

2.2.4 Low rate of acquisition

The rate at which a video capture device can deliver images to a software applica-

tion is called the device’s acquisition rate. A higher acquisition rate means that

less time passes between the delivery of each captured frame of video. Hence,

24


given a high frame rate, a software application would have a higher chance of see-

ing an event of interest occur in one of the captured frames. In turn, a low frame

rate causes a software application to in effect close its eyes for a (subjectively)

long period of time, only to reopen its eyes and try to make sense of what has

happened in the scene in the intervening time period. In other words, it is easier

to understand the world through vision with more visual data rather than less

(in general; optical illusions notwithstanding). See Section 2.8 for a discussion

on how the low frame rate digital video cameras are dealt with in the context of

sensing billiard shots.

The definition of an acceptable acquisition rate is application dependent. For

instance, an application that attempts to sense or track moving objects (such as

ours) performs better given a higher acquisition rate while an application that

tracks the sun’s position in the sky over the course of a day does not require a

high acquisition rate in order to fulfill its goal.

We should note that a video capture device is not always completely to blame for

a low acquisition rate. An application that takes too long (i.e. longer than the

time to acquire the next image) to process each frame of video will also decrease

the effective acquisition rate. Such an application will cause most video capture

device drivers to drop frames to avoid maintaining a backlog of frames to be

delivered to a requesting application.

Since our application is a user input mechanism, our goal is to find a balance

between performance and accuracy. We wish to provide accurate updates of user

input derived from visual data on a regular, low-latency schedule. Our prototype

implementation causes a drop in the effective acquisition rate due to the amount

of image processing (and thus time) our algorithms require.

2.2.5 Low resolution

A digital image’s resolution defines the number of pixels that comprise the image.

We prefer to capture images of a sufficient resolution in order to be able to detect

small details, “sufficient” being application dependent. If an image’s resolution is

25


too small, some details will not be detected because the image is too coarse. If the

image resolution is large however, additional processing time and memory would

be required. Thus, there is a tradeoff between image size and resources required

to process the image. As a practical matter, many consumer-grade cameras offer

a small range (sometimes even only a single choice) of resolutions for captured

images. In this case we have little to no control over the image resolution; we must

use whatever resolution the camera offers. For instance, in this project, we used

an Apple iSight with offers a single image resolution of 640 x 480 pixels (columns

x rows); this is 640 x 480 ≈ 0.3 megapixels (MP; 1 MP is 1 million pixels).

Modern digital still-photo cameras offer a range of possible resolutions, even 10

megapixels or more. However, such cameras cannot sustain a high capture rate

as required by our system. Thus, we must use a digital video camera in order to

capture images at a higher rate, albeit at a lower resolution. For instance, see

Figures 2.1 and 2.2 which showcases two images, one taken by a webcam and the

other taken by a high quality camera used for still photography. One can clearly

see the difference in image quality between these two images.

Since a pixel represents a sample of the scene, that pixel projected back into the

scene represents the nominal resolution of the image sensor (40). For instance, if

a physical cue ball with a diameter of 2.25 inches is imaged to a circle of diameter

100 pixels, then the nominal resolution is 0.025 inches. In other words, we cannot

detect an world feature that is smaller than 0.025 inches in this case. We thus

recommend orienting the camera such that it tightly frames the area in which

tokens can be seen and not much more.

2.2.6 Chromatic aberration

Chromatic aberration is the inability of a lens to focus different colors in the same

focal plane. This occurs because different wavelengths of light are refracted by

different amounts when passing through a lens. While lens designers deal with

chromatic aberration, there are tradeoffs and thus some amount of chromatic

aberration will be found in almost all images.

26


Figure 2.4: Errors in color reproduction of digital video camera. Color repro-

duction error effects are made more obvious by increasing the saturation and

decreasing the brightness of a portion of Figure 2.1. Note the large amount of

purple hue caused by chromatic aberration.

There are image processing techniques that can reduce the effect of chromatic

aberration. However, we have not followed this route since the complexity of

understanding the relationship between aberration and the captured image is

high, and is different for different cameras.

2.2.7 Temporal averaging

To combat some of the above issues with acquired digital images, we simply

average the previous N acquired images. This is temporal averaging in that we

are averaging images taken over different instances of time. Temporal averaging

has the effect of reducing the variance in color at each pixel. Temporal averaging

assumes image noise is truly random.

The output pixel of a temporally averaged image I at column x and row y may

27


be defined by

I[x, y] =

∑N1 Ii[x, y]

N(2.1)

where Ii is the ith previous image acquired (i.e. i = 0 refers to the current

acquired image). Note that this averaging is performed on each color channel

(e.g. R, G, and B) separately.

While temporal averaging has an advantage of reducing the impact of color repro-

duction issues and sensor noise, it does effectively lower the capture rate N -fold.

Also, temporal averaging introduces motion blur for fast moving objects. In our

system, we have used N = 4 as a good tradeoff between increasing image quality

vs. decreasing capture rate.

2.2.8 Weighted moving averages

To reduce the effect of short-term changes in measurements due to image issues,

we use a weighted moving average (WMA) to focus on measurement trends. With

a WMA N measurements are stored in a history. The WMA can be queried for

an overall average at any given time. The most influence or weight is given

to the most recent measurement, with a linear falloff for the weights of older

measurements. We empirically determined the best quality vs. latency tradeoff

exists when N = 2 when used in addition to temporal averaging and N = 4.

Each WMA can be tailored to fit specific scenarios however.

WMAM =npM + (n− 1)pM−1 + · · ·+ 2pM−n+2 + pM−n+1

n+ (n− 1) + · · ·+ 2 + 1(2.2)

2.2.9 Manual camera control

Modern webcams and consumer grade digital video cameras offer a good amount

of control of the parameters of image formation. We recommend disabling au-

tomatic focus, automatic exposure, and similar camera-controlled automatic cal-

28

2.3 On stereo vision

ibrations, opting instead for manual calibration per scene prior to image acqui-

sition. Automatic exposure is a common feature in many modern cameras. It

automatically controls the iris, shutter and gain to adjust the brightness of ac-

quired images depending on the brightness of sensed objects.

Finally, we should note that we are effectively repurposing webcams for different

application than engineered for. Webcams are typically used for low bitrate video

conferencing, and not for endeavors requiring high precision and color reproduc-

tion (e.g. photogrammetry in general).

2.3 On stereo vision

Bifocal or stereo vision like the human vision system offers two slightly different

views of the same scene. Our brains work out the disparity between the views

of our left and right eyes and infer a sense of depth to points visible in both

views. There has been much research in duplicating the general stereo vision

configuration in computer vision by using two cameras and processing the images

simultaneously captured by both.

Since the major goal of our system is to reconstruct the 3D pose of the billiards

cue stick, it follows that stereo vision might be an applicable technique. However,

we have focused on using a single camera. A single camera is less expensive and

does not require the user to calibrate the stereo rig. A stereo system also requires

much more computational power in order to process two times the number of

pixels compared to a single-camera solution. It is also not clear that processing

a depth image (where each pixel contains the distance from the camera’s image

plane to the scene point) is a better tradeoff than using the techniques discussed

in this thesis since even the best performing stereo vision algorithms do not find

perfect depth calculations at every pixel. As we shall see, the techniques herein

discussed are fairly simple to implement and do not require multiple cameras.

29

2.4 On tracking algorithms

2.4 On tracking algorithms

The function of computer vision-based tracking algorithms is to reliably detect a

moving object through a sequence of images. The main advantage of use tracking

is that tracking algorithms often provide a search window that focuses the atten-

tion of object detectors. Instead of searching the entire image, an object detector

need only search a smaller section of each image, thus increasing performance if

the object is detected within the window as less data needs to be processed. If

an object detector does not detect the target object in the search window, the

search window is generally translated or enlarged and the detection is retried.

This process is repeated until either the object is detected, the search window

grows larger than some threshold size, or a number of iterations passes without

a successful object detection.

Since our system has a billiards cue that moves when the user takes a shot,

one could easily see how tracking algorithms could be applicable to our system.

However, we have found through a number of experiments that several tracking

algorithms known to perform well fail to reliably track our cue stick. Others have

experienced similar issues:

In practice, even the best methods suffer such failures all too often, for

example because the motion is too fast, a complete occlusion occurs,

or simply because the target ob ject moves momentarily out of the

field of view. (29)

While others (1) have had success with the CAMSHIFT algorithm (5), the poor

color reproduction quality of inexpensive digital video (digital video) cameras as

well as the relatively low rate of capture coupled with a fast moving cue stick

causes several insurmountable issues.

• Search window explosion. The search window tends to enlarge out of control

when a decent histogram match is not found. Then, the window does not

tend to shrink if a good candidate area is actually in the image.

30

2.5 Feature detection methods

• The tracked histogram is updated over time. This tended to track new

objects periodically. For instance, since the user’s hand can closely match

in color the cue stick, CAMSHIFT would periodically stop tracking the

cue stick and instead track the user’s hand. Granted, wearing a glove or

performing hand-segmentation could help this issue but forcing the user to

wear a glove is not very natural.

• A fast-moving cue stick tended to confuse the search. This was possibly

due to motion blur (the effect of a not capturing images of a moving object

fast enough) temporarily altering the histogram of the search window.

• The need to choose an initial search window isn’t an overly natural user

interaction.

While there are a number of other tracking algorithms one could attempt to adapt

to our system (29), we felt that the need of tracking algorithms to be reinitialized

when detection fails makes the advantages of their use rather low. This is because

our fast moving cue stick tended to confuse trackers, thus requiring reinitialization

quite often. Furthermore, our scene layout is such that we can use methods other

than tracking to detect our TUI’s tokens.

Total, or even partial occlusion of the tracked ob jects typically results

in tracking failure. The camera [or objects] can easily move too fast

so that the images are motion blurred; the lighting during a shot

can change significantly; reflections and specularities may confuse the

tracker. Even more importantly, an object may drastically change its

aspect very quickly due to displacement. (29)


We define a feature as the image of some object or shape. While feature detection

refers to simply determining if the feature is present in an image, feature extrac-

tion is concerned with determining information from the feature (e.g. orientation,

31


Figure 2.5: Server process logic flow.

32


position, size). In (33), a set of guidelines for feature detectors is specified based

on invariance properties. These guidelines state that a successful feature detector

should be able to detect features under varying position, rotation, or scale (size).

We shall discuss how we detect image features of our defined tokens and extract

information about their position and orientation in the following sections.

There exists a plethora of research (33; 40) of detecting shapes in images that we

could benefit from in our system if we didn’t have to deal with occlusions and fast

moving objects. Occlusions in our system can be caused by the player’s hands

holding the cue stick at various locations along the shaft of the cue stick. These

two issues have caused us to focus on a flexible technique of finding regions of

expected color and extracting information about these regions in order to find the

attributes of our tokens. Using color-matching as our primary feature obviates

the need for a color video camera as image source as well as the use of visible (cf.

infrared) light.

It is worth noting that we do not require the user to use markers or special

equipment. Instead, the tokens of our TUI are the natural objects a user would

normally interact with in the context of a billiards game. In many other systems,

markers or fiducials are used to aid in the feature detection and extraction stage.

(29) notes that “even after more than twenty years of research, practical vision-

based 3D tracking systems still rely on fiducials because this remains the only

approach that is sufficiently fast, robust, and accurate.” While this may be true

in the general case, our system is a single example of how marker-less tokens are

easily detected and moreover that such a system allows for a more natural user

interface by simply letting user interface elements be. Any physical tagging or

marking of a cue stick for example would cause players to compensate for the

change and would likely be looked at as an annoyance.

The issues of acquired digital images such as error-prone color reproduction as

described in Section 2.2 may cause some concern since we have decided to use

color as our primary image feature of interest. However, we shall describe how

we use a flexible color model in order to deal with the issue of color reproduction

quality, changes in illumination, and shadows that may overlay tokens.

33


2.5.1 Color representation

Since our system uses the visible spectrum of light for feature detection, it makes

sense to discuss how we represent color in a digital form. A common reprensenta-

tion is red-green-blue (RGB) in which a color is represented as a mixture of some

amount of red, green, and blue. For instance, the color yellow can be represented

in RGB as full red, full green, and no blue. There are many other color repre-

sentations or color spaces useful in various contexts. Since we are interested in

the color of objects under varying illumination, we would like a color space that

helps us define colors in which color or hue is constant even for darker or brighter

versions of the color.

Figure 2.6: Color mixtures in RGB.

One such color space is the hue-saturation-value (HSV) color space (39). Hue

represents what we normally consider the color. Saturation refers to how vibrant

the color is. Value represents how bright the color is; a low value is found in

darker colors. Note that hue is independent of saturation and value. That is, an

object with a green hue has a green hue even when dark or bright.

Let use define the flexible color model of a token as

F =

H ±∆H

S ±∆S

V ±∆V

(2.3)

where H is the mean hue of FCM, S the mean saturation, V the mean value, ∆H

the allowed variance in hue from H, ∆S the allowed variance from S, and ∆V the

34


allowed variance from V . In practice, we properly handle the fact that the hue

wraps around from 359◦ to 0◦.

Since Intel’s OpenCV library (18) acquires color images whose pixels are specified

in BGR format (BGR is logically equivalent to RGB), we need to convert a BGR

pixel to a HSV pixel. Converting an entire BGR image to a HSV image entails

simply converting each BGR pixel to a HSV pixel. Let min be the minimum

value of the BGR pixel components b, g, and r. Let max be the maximum value

of the BGR pixel components.

h =

0 if max = min

60◦ × g−bmax−min

+ 0◦, if max = r and g ≥ b

60◦ × g−bmax−min

+ 360◦, if max = r and g < b

60◦ × b−rmax−min

+ 120◦, if max = g

60◦ × r−gmax−min

+ 240◦, if max = b

s =

{0, if max = 0max−min

max= 1− min

max, otherwise

v = max

(2.4)

h is then normalized to the range [0, 360).

Figure 2.7: HSV color space.

We only require a color model for the cloth/felt token as well as the shadow of

35


the cue stick. We use other techniques based on thresholding, binary morphology,

and connected component analysis to detect other tokens such as the cue stick

and planar object.

2.5.2 Thresholding

(a) Color image. (b) Grayscale image. (c) Binary image. Shown is a re-sult of thresholding Figure 2.8bwith γ = 150 (out of 255 max.).

Figure 2.8: Example of thresholding.

We may find various feature detection tasks easier if we first reduce the grayscale

image into a binary image and then process the pixels of the binary image (see

Section 1.4 for an explanation of binary and grayscale images). This is often

performed using a technique known as thresholding. A grayscale image is thresh-

olded by rounding all pixel intensities greater than a given threshold γ up to

white (binary image pixel value of 1) and less than the threshold to black (binary

image pixel value of 0). Let G be a grayscale image and B be a binary image.

Then,

B[x, y] =

{1 if G[x, y] ≥ γ

0 if G[x, y] < γ(2.5)

where I[x, y] denotes the value of pixel at column x and row y in image I. There

are several methods to determine a desired value for γ including adaptive tech-

niques that first analyze the intensities of all pixels of G prior to thresholding. For

our prototype however, we have simply found a desired value for γ empirically.

36


NW N NE

W P E

SW S SE

Table 2.1: 4-way or 8-way neighbors of a pixel P .

2.5.3 Contour finding and analysis

We may wish to find regions, blobs, or areas of pixels in a binary image that

are spatially connected to one another in order to process these connected pixels

as units or groups. This is another tactic to reduce complexity in a computer

vision system. There are several names of a general set of methods to help us find

such regions in binary images such as connected components labeling (40), blob

detection, and contour finding (18). We shall adopt the term contours as used in

Intel’s OpenCV library as our implementation uses this library for detection of

such connected areas.

A pixel can be considered to be connected to another in a binary image if it shares

the same pixel value (both are white or both are black). The neighbors of a pixel

determine where a contour-finding algorithm searches for such connections. If we

envision a pixel centered in a compass, we can define the neighbors of the pixel

using either 4-way connections or 8-way connections. In a 4-way connection,

the pixel’s neighbors are found at the directions N , S, E, and W . In a 8-way

connection, the pixel’s neighbors are found at N , NE, E, SE, S, SW , W , and

NW (see Table 2.1).

The goal of a contour-finding algorithm is to find a set of contours given a binary

image and a choice of either 4-way or 8-way connections between pixels. There

are numerous algorithms to perform such contour finding, each with their own

performance tradeoffs and behaviors. We will thus refer the interested reader to

(18; 40) for details. In our project, we have relied upon the OpenCV function

cvF indContours to find contours.

If we allow contours to be nested, we can recursively find contours within other

37


contours. We call these interior contours child contours of a parent contour. Note

that parent and child will have opposite pixel values (white vs. black), alternating

with each nesting. In other words, a contour need not only be comprised of

connected white pixels.

(a) Binary image before contour finding. (b) Binary image after contour finding. Notethe red border around the parent contour andthe purple border around the child contour.

Figure 2.9: Example of contour finding.

Once we have a set of contours, we can derive several properties of the contour

as a whole. This includes such properties as the contour’s area, perimeter, center

of mass, bounding rectangle, oriented bounding box, minimum area rectangle,

circularity, and so on. We have created a small C++ library to help filter and

sort a list of contours based on predicates of a contour’s properties. For instance,

the following is a small code listing that finds a hierarchical set of contours, filters

the list by removing contours whose area is below a hard-coded threshold, and

then finds the largest child contour of the first contour remaining.

38


Listing 2.1: Example code for contour processing.

1 CvSeq∗ root ;2 CvMemStorage∗ memStorage = cvCreateMemStorage ( 0 ) ;3 cvFindContours ( binaryImage , memStorage , &root ,4 s izeof ( CvContour ) , computer vision RETR TREE ,5 computer vision CHAIN APPROX SIMPLE ) ;67 ContourVec contours ;8 s i b l i n g s O f ( root , contours ) ;9 removeContours ( contours , areaLT ( 1 0 0 ) ) ;

10 i f ( contours . s i z e ( ) == 0) throw ” not found” ;1112 ContourVec ch i ldContours ;13 ch i ld renOf ( contours [ 0 ] , ch i ldContours ) ;14 i f ( ch i ldContours . s i z e ( ) == 0) throw ”no c h i l d r e n ” ;15 ContourInfo& l a r g e s t C h i l d = largestByArea ( ch i ldContours ) ;16 // . . .17 cvReleaseMemStorage(&memStorage ) ;

2.5.4 Binary image morphology

It is typical that thresholding will not produce binary images as perfectly as

expected. For instance, small bright highlights on an object might become white

in the resultant binary image even if this is undesired. We can alter the shape of

contours in the binary image using morphological operators such as erosion and

dilation. An erosion of a binary image has the effect of shrinking the contours

from their borders inwards while a dilation has the effect of expanding such

contours from their borders outwards. Morphological operators are useful in

removing contours that may be considered noise as an erosion could erode a

small contour to nothing. A dilation may be useful for filling in “holes” in a

contour (e.g. removing child contours by filling them in with the same pixel

value as the parent). We have only scratched the surface of a broader subject

called mathematical morphology. See (40) for a more in-depth discussion.

39


Figure 2.10: Example of erosion morphological operator. The left image shows an

original binary image. The right image shows the effects of the erosion operator

with rectangular structuring element applied to the image on the left. Notice

how small contours have completely eroded.

2.5.5 Flexible color matching

A useful image filter searches for pixels of a color image that match a color model

such as our flexible color model (see Section 2.5.1). Given an input image I, an

initially clear (all black or value 0) binary image B with equal dimensions as I,

and a flexible color model F , we wish to set each matching pixel of I to white

(value 1) at the corresponding pixel location in B. Thus,

B[x, y] =

{1 if I[x, y] ∈ F0 if I[x, y] 6∈ F

(2.6)

where I[x, y] ∈ F is true when one of the following is true

if Fh + F∆h ≥ 360 : hxy ≥ 360− (360− Fh + F∆h) ∨ hxy < F∆h − (360− Fh)

if Fh − F∆h < 0 : hxy ≥ 360− F∆h ∨ hxy < Fh + F∆h

else : Fh − F∆h ≤ hxy < Fh + F∆h (2.7)

40


and both of the following are true

Fs − F∆s ≤ sxy < Fs + F∆s

Fv − F∆v ≤ vxy < Fv + F∆v (2.8)

where hxy, sxy, and vxy are the components of HSV pixel I[x, y].

The complex structure for handling matching on hue is based on the fact that

hue wraps around between 359◦ and 0◦ and the flexible part of a FCM (i.e. F∆h)

might wrap as well.

2.5.6 Convex hulls and convexity defects

One of the properties of a contour (that is directly supported by OpenCV) is the

contour’s convex hull. Intuitively speaking, one can think of a convex hull as a

rubber band wrapped around the border of a shape. A convex polygon is one

(a) Cue stick and cloth/felt. (b) Resultant B image.

Figure 2.11: The left image (a) shows an image acquired from digital video

camera containing the cloth/felt token with cue stick token hovering above it.

The right image (b) shows the resultant image B in which white pixels match a

FCM designed for the cloth/felt token. Note the lack of a cue stick shadow due

to ambient room lighting.

41

2.6 Pose driven feature detection

in which a line connecting any pair of vertices does not intersect an edge or side

of the polygon. If such a polygon fails to meet this requirement, the polygon is

not convex and the convexity defects can be found. These defects are the regions

along the perimiter of the polygon that cause the polygon to fail to be convex.

If we consider a contour to be a potentially many-sided polygon, we can find the

convex hull and any convexity defects for contours in our binary images.

We have already seen an example of a convexity defect. The cloth/felt contour in

Figure 2.11b is convex except for the single convexity defect caused by the image

of the cue stick jutting in from the top. We exploit this scenario in order to find

the cue stick and its shadow. Note that OpenCV provides an implementation

of both convex hull calculation and convexity defect discovery. A result of the

defect discovery is a point within the defect that is furthest distance-wise to the

convex hull. This can be seen in the right-most image of Figure 2.12.

Figure 2.12: Example of convexity defect. The left image shows a convex shape

of a black square. The middle image shows a convexity “defect” in the white

area; the white area makes the otherwise convex square a non-convex shape. The

right image shows the convex hull of the original square in red and the deepest

point of the convexity defect highlighted in cyan.


We have described the feature detection methods we use in our prototype TUI.

Feature detection lays the foundation for higher level information extraction. In

42


Element Description See also

h vertical offset (or height) of the tip of the cue stick

above the the user’s desk

fig. 2.16

θ pitch of the cue stick shaft fig. 2.21

ψ yaw of the cue stick shaft fig. 2.24

distw distance between the cue stick’s tip and the cue

ball in world units

fig. 2.20

a, b expected 2D offset in ball space from the cue ball

center point

fig. 2.25

Table 2.2: Elements of cue stick pose.

our system, the information we wish to extract from detected features is the 3D

pose of the cue stick. Let us then work backwards by first defining the information

we wish to extract (elements of the cue stick pose) followed by the features we

will need to detect in order to estimate the pose of the cue stick. See Table 2.2 for

a summary of cue stick pose elements. See Table 2.3 for a summary of features

detected using the aforementioned techniques.

2.6.1 Planar object detection

The planar object of interest is a thin rectangular shape with high contrast edges.

To detect this planar object, we first we threshold the image with γ = 192

(we are assuming a bright object for simplicity). We then find contours in the

thresholded image. We apply a contour simplification algorithm called polygonal

approximation which essentially removes small fluctuations along the contour. For

instance, given a contour that roughly resembles a rectangle, applying contour

polygon approximation would result in a contour with 4 sides at near right angles

to one another. See OpenCV’s cvApproxPoly function for implementation details.

43


Feature Helps find .. Detection Description

felt/cloth cue stick, cue

ball, cue stick

shadow

auto container

Tr h user top point of reference object

Br h user bottom point of reference

object

St h user shadow of Tr

cue tip distw auto tip of the cue stickˆshaft θ, ψ auto cue stick shaftˆshadow θ auto shadow of cue stick shaft

Scˆshadow auto shadow of the tip of the cue

stick

planar object h auto edges are used to find paral-

lel lines on plane of desk

parallel lines

on plane of

desk

h auto used to find vanishing points

of plane of desk and thus its

vanishing line too

Table 2.3: Feature summary. Auto features are detected every frame. User

features are specified by the user, one time, at server initialization time.

44


FCM component Value

h 80◦

s 0%

v 0%

∆h 60◦

∆s 100%

∆v 100%

Table 2.4: Flexible color model for the cloth/felt.

We filter the resultant set of contours, keeping those whose polygonal approxima-

tion results in a polygon with 4 sides that are at ±5◦ from a right angle to each

other. The largest such rectangular polygon is considered to be the planar object.

We are only interested in the planar object for its two sets of parallel edges or

sides. These two pairs of sides are used directly as the parallel lines features that

we require.

2.6.2 Cloth/felt detection

To detect the cloth/felt tangible we use the technique put forth in Section 2.5.5.

The result is a binary image containing contours that match the green FCM

devised for the cloth/felt material used in our prototype TUI. We filter small

contours by erosion and assume the largest matching contour is the cloth/felt.

We should note that we find a full tree of contours. That is, we are interested in

the children of the cloth/felt contour as well.

2.6.3 Cue ball detection

We have detected the cloth/felt contour and its child contours. Since the cue ball

is white, it does not match the color model of the cloth/felt. Thus, the white

pixels of the cloth/felt contour will have a circular black hole where the cue ball

is located. We find the cue ball simply by assuming it is the most circular child

45


Figure 2.13: Example of cue ball child contour. Here we see that a cue ball has

been placed on the cloth/felt and does not match the FCM defined for matching

the cloth/felt. The pixels the cue ball’s image are thus black. These pixels

comprise a child contour of the cloth/felt contour. Also note the lack of a shadow

of the cue stick in this example is due to a large amount of ambient light in the

scene.

contour of all such contours. We remove child contours whose contour area is less

than some empirically determined threshold. We use the metric area/radius for

a measure of a contour’s circularity where area is the area of the contour (i.e.

number of white or black pixels) and radius is the radius of a circle that most

tightly encloses or fits the points that comprise the contour. OpenCV provides a

function cvMinEnclosingCircle to calculate such a circle. Circular contours will

maximize this metric. Once we have found the cue ball contour, we flood fill its

contour in the binary image (the result resembles Figure 2.11b) so that further

processing of the cloth/felt does not get confused by the cue ball contour.

2.6.4 Cue stick detection

We use the same technique as in the cloth/felt detection to detect shadows.

However, we first restrict the region of interest (ROI) to the axis-aligned bounding

rectangle of the cloth/felt. The rectangular area shown in Figure 2.11a bears a

close resemblence to the bounding rectangle of the cloth/felt. The color model

for the cue stick’s shadow simply looks for areas of low value (i.e. dark) without

46


FCM component Value

h 0◦

s 0%

v 0%

∆h 360◦

∆s 100%

∆v 35%

Table 2.5: Flexible color model for the cue stick’s shadow.

regard for hue nor saturation. The ∆v used was found empirically and can be

adjusted per scene to account for varying amounts of ambient lighting. The

result is an image with dimensions equal to the ROI that contains white pixels

where shadows were discovered inside the bounding rectangle of the cloth/felt

contour. We simply take the largest contour as the shadow contour. We then

draw this shadow contour filled with white into the cloth/felt image We also fix

the convexity defects of the cloth/felt contour by drawing a thick line along the

perimeter of the convex hull of the cloth/felt. This has the effect of making the

cue stick a child contour of the cloth/felt instead of a convexity defect. We then

refind contours in the ROI and take the largest child of the cloth/felt contour as

the cue stick contour. These steps are illustrated in Figure 2.14.

Once we have the contours that represent the cue stick and its shadow (at least the

portions that are physically above the cloth/felt), we derive a vector for each that

describes the image space direction of each contour. To do this, we consider the

contour’s perimeter as a set of 2D points and extract these points for the shadow

and cue stick contours respectively. We then fit a line to each of these points using

the least-squares error criteria (40). Least-squares offers good performance but

can be inaccurate in general. However, we smooth all feature detection values by

a weighted moving average (WMA) which tends to all but eliminate the influence

of outliers at the expense of a slight latency in measurements. See Section 2.2.8

for more information about WMAs. The line found from the cue stick contour

gives us ˆshaft while the line found from the shadow of the cue stick gives us

47

2.7 Cue stick pose estimation

ˆshadow (see Table 2.3).

Note that from the convexity defect detection we have been given the deepest

point within the defect. Given our scene layout, the deepest such defects will be

equal to the tip of the cue stick and the shadow thereof. We have thus found the

cue tip and Sc from Table 2.3.

(a) Enclosing defects to makechild contours.

(b) Shadow detected. (c) After shadow removal.

Figure 2.14: Example of cue stick shadow detection. The left image (a) shows

the cloth/felt contour with cue stick shadow and cue stick as child contours. Note

how the convex hull perimeter has been stroked to remove the convexity defects.

The middle image (b) shows the ROI of the cloth/felt contour with shadows

detected (white) using a FCM (see Table 2.5). The right image (c) shows the

cloth/felt image after shadow removal which leaves the cue stick child contour.


A common task in a computer vision system is to find the 3D pose of objects

of interest. This involves extracting image features and matching these image

features to object features. In the feature detection stage, we discovered features

48


Figure 2.15: Screenshot of demo of feature detection. The overlaid graphics are

rendered atop acquired video images: the red arrow represents ˆshaft, the green

arrow represents ˆshadow, and the magenta circle (C) outlines the cue ball. Note

the slight errors: the red arrow does not perfectly match the heading of the cue

stick and C does not perfectly match the cue ball boundary. See Section 2.2 for

reasons behind such errors (other than those implicit with our methods).

49


of interest in image space. In this pose estimation stage, we analyze these features

to derive the full 3D pose of the cue stick token using projective geometry and

basic algebra.

A single dimension is lost under perspective projection, namely the distance from

the projection plane to each point being projected. In our system, the images

captured by our digital video camera are perspective projections of the scene onto

the camera’s image plane. The depth at each scene point is thus lost; all we are

given is the 2D location of the point in the captured image.

It is nontrivial to reconstruct the full 3D pose of imaged objects given only 2D

information. There are several techniques used for inferring 3D information of

image points such as stereo vision or range finding (40). However, such techniques

are quite complex and generally expensive to implement. We restrict our 3D pose

estimation to analysis of digital images captured by a single camera.

There are several canonical attributes of a cue stick that we wish to estimate from

extracted image features. These attributes comprise the state of the cue stick at

a given moment in time.

2.7.1 Estimation of cue tip vertical offset

Since we are utilizing a top-down view of the scene and lose the sense of depth from

the camera to any sensed scene point, we effectively lose the height of scene points

positioned between some reference plane and the camera. We are attempting to

find is the vertical offset of the cue stick’s tip from the plane of the user’s desk.

Given that the cue ball will rest on this plane and that the diameter of the cue

ball in world units is known a priori, we can relate the vertical offset of the cue

stick’s tip above the plane of the desk to find the vertical offset of the cue stick’s

tip from the center of the cue ball. This is in turn used by the simulation to

position the virtual cue stick.

We have successfully adapted the techniques described in (36) to estimate the

vertical offset of the cue stick’s tip over the plane of the desk. The algorithm

requires the locations in the image of several scene points as well as parallel

50


lines on the reference plane. See figure 2.16 for a depiction of the required scene

points. Most of these scene points are automatically discovered during the feature

detection stage.

Figure 2.16: Scene points used in cue stick tip height estimation. Here the

reference object is a beverage can. The vanishing line of the plane of the desk

is the line that connects the two vanishing points at the horizon. The vanishing

points are derived from the parallel edges of the planar object.

In (36), the height of a soccer ball above the plane of the playing field is de-

termined by analyzing frames of video of a soccer game. The horizon line of

the plane of the playing field is found by fitting two sets of parallel lines to the

boundary line markers on the playing field. The lines of each set of parallel lines

intersect at a point on the horizon. We thus have two such points on the horizon.

The line that connects these two points is the horizon line of the plane of the

playing field. Several subsequent lines are derived by connecting all mentioned

points (see Table 2.6). The reference object in (36) is chosen as one of the soccer

players on the field.

51


In our system, we find parallel lines as the two pair of parallel lines of the planar

object. Our reference object is a soft drink can. We have adopted the geometry

of (36) to our particular setup. We illustrate the geometry in Figures 2.17 and

2.18. We next compute the labeled lines in the figures using line-line intersections

(see Table 2.6).

Figure 2.17: Setup for estimation of h (part 1). This is an adaptation of Figure

3 in (36).

We continue to derive the points as referenced in Figure 2.18.

• v is the intersection point of lines l8 and l9.

• ri is the intersection point of lines l7 and l8.

• Pi is the intersection point of lines l6 and l9.

52


Figure 2.18: Setup for estimation of h (part 2). This is an adaptation of Figure

4 in (36).

53


Line Description

l1 through reference shadow (hence l1 = Br − St)

l2 through Sc and parallel to l1; hence l2 intersects l1 on

the vanishing line

l3 through Tr and Sc

l4 through the cue tip and Tr

l5 through Br and the intersection point of lines l3 and l4;

the intersection point of l5 and l1 is the projection of the

cue tip on the plane of the desk which we refer to as pb

l6 through Tr and intersection point of l5 and the vanishing

line

l7 through cue tip and intersection point of l5 and the van-

ishing line

l8 through Tr and Br

l9 through Pb and cue tip

Table 2.6: Scene lines required for the estimation of h from (36). See Figures 2.17

and 2.18 for a visual understanding.

54


The cross ratio of the points we have found on l8 are equal to the cross ratio of

the points on l9. Thus,

< Br, ri, Tr, v >=< Pb, Pt, Pi, v >=hr

hr − h(2.9)

where hr = |Tr −Br|, and < a, b, c, d > is the scalar cross ratio

< a, b, c, d >=|a− c||b− d||a− d||b− c|

(2.10)

We thus can determine the vertical offset h of the cue stick’s tip above the plane

of the user’s desk.

h = hr −hr

< Br, ri, Tr, v >(2.11)

2.7.2 Estimation of the distance between cue stick and

cue ball

The (Euclidean) distance between the cue stick’s tip and the cue ball is the

indicator of when a shot occurs for when this distance is 0, the cue stick tip

is touching the cue ball. That is, there is a physical collision between the cue

stick and the cue ball. We describe how we derive the properties of a shot in

Section 2.8. Since we are trying to estimate the full 3D pose of the cue stick

using real world units, we must find the distance between the cue stick’s tip and

the cue ball in such units as well. The client simulation will use this distance in

world units to accurate position the virtual cue stick.

To determine the distance between the cue stick tip and the cue ball, we use a

very simple technique. The cue ball radius in world units, Rw, is known a priori.

Once we’ve detected the cue ball sphere in image space (a circle we shall label C

with radius Cr pixels), we can find the number of world units (e.g. meters) per

pixel with units/pixel = Rw/Cr.

55


Figure 2.19: Demo of cue stick tip height estimation. h was determined to be 2.09

in. The actual height measured with a tape measure was ≈ 1.99 in. Note that

we use a top-down camera pose when viewing the scene. The camera was pitched

upwards by ≈ 15◦ in this demo to better showcase that which is being estimated.

In the online system, the cue tip projection on the desk’s plane will generally not

be visible in image space since in a top-down view it will be directly underneath

the cue tip. The green and purple lines are the parallel lines as discovered by

detecting the edges of the planar object; see Figure 2.16 for a depiction.

Figure 2.20: Distance between cue stick tip and cue ball. Here, distw = 0.22m.

56


Since the camera is pitched downwards by −90◦ (due to a top-down pose) and the

normal of the plane of the user’s desk can be assumed to be 90◦ or “up”, points on

the plane of the user’s desk in the camera’s field of view are close to equidistant

to the camera’s image plane. This means that the ratio of units/pixel is not

affected by perspective foreshortening. In other words, every pixel in a captured

frame of video shares the same ratio of units/pixel because we use a top-down

camera pose. If the camera’s pitch 6= −90◦, then different points on the plane

of the user’s desk would have a different value of units/pixel. Note that the

units/pixel ratio is based on the plane at height Rw above the plane of the user’s

desk. However, since the cue tip height above the plane of the user’s desk doesn’t

often deviate from Rw (in other words, most shots have only little to no draw or

follow), we ignore this small discrepancy in our current implementation.

Since we have detected the cue stick tip point T in image space, we can compute

the (Euclidean) distance in pixels disti between T and the intersection point P

of the cue ball’s projected circle C and a 2D line ˆshaft created from the location

and heading of the cue stick’s shaft. We explained how we found ˆshaft in section

2.6.4. We can thus convert disti to world units with the following, given that the

line from T along ˆshaft intersects C. If no such intersection exists, the system

notifies the user that the cue stick is not pointing at the cue ball.

P = first intersection( ˆshaft, C)

distw =Rw

Cr

|P − T | (2.12)

where first intersection is a function that finds the closest intersection of a

line and a circle (if any).

2.7.3 Estimation of cue stick pitch

We define the angle between the shaft of the cue stick and the surface of the

user’s desk as the cue stick’s pitch. For instance, when θ = 90◦, the cue stick

points down into the surface of the desk.

57


Pitch has a dramatic role in billiard shot dynamics. A shot taken with a cue

stick that has a small pitch (θ ≈ 5◦) imparts a force to the cue ball whose largest

component is in the direction perpendicular to the billiards table; i.e. along the

surface of the table. A shot taken with a cue stick that has a large pitch (θ ≈ 90◦)

allows for “jump shots” by driving the cue ball down into the table, causing the

ball to reflect off the surface of the table and into the air. Masse shots cause the

cue ball to undergo extreme spin effects and are made possible by using a large

pitch coupled with an off-center point of collision with the cue ball. In general,

varying degrees of pitch are employed to refine the major directional component

of the force imparted to the cue ball during a shot. Clearly, incorporating pitch

is critical for a realistic billiards simulation as well as more advanced techniques.

(a) Pitched cue stick with θ ≈ 45◦. (b) Top view of same scene. Notethat the angle between the cuestick and its shadow is ≈ 45◦ asin (a).

Figure 2.21: Example of cue stick pitch.

Our strategy for estimating θ is based on the observation that the angle between

the shaft of the cue stick and the shaft’s shadow is roughly equal to the angle

between the shaft and the surface of the user’s desk (i.e. θ). Refer to Figure 2.21

for an example of this observation. Note that this observation is most accurate

when the light source is situated next to the cue ball. This has a useful side effect

58


of orienting the shadow of the cue ball to an area where it will generally not be

combined with the shadow of the cue stick.

We have defined ˆshaft to represent the direction or heading of the shaft of the

cue stick in image space. That is, given the image space location of the tip of the

cue stick T and a point on the shaft of the cue stick M ,

ˆshaft =|T −M |||T −M ||

(2.13)

We also define a vector ˆshaft′ that is simply ˆshaft but in the opposite direction.

Thus,

ˆshaft′ =|M − T |||M − T ||

(2.14)

Let Ts be the image space point of the shadow of T , Ms to be the image space

point of the shadow of M . We have already defined ˆshadow as

ˆshadow =|Ms − Ts|||Ms − Ts||

(2.15)

Thus,

θ = cos−1( ˆshaft′ · ˆshadow) (2.16)

2.7.4 Estimation of cue stick yaw

In a real billiards game, the player can orbit 360◦ about the cue ball by moving

herself to various sides of the billiards table to gain a new vantage point. Each

position defines a different ˆshaft, the direction of the cue stick. In aeronautics,

yaw is defined as the angle in radians about the “up” vector in some defined

coordinate system. Since our camera affords us a top-down view of the scene,

we are essentially viewing the scene from a point on such an “up” vector. Thus,

ψ can be defined as the angle its shaft makes with one of the image coordinate

system’s axes. We have chosen the vertical axis as in the default scene layout,

the user takes a shot by moving the cue stick vertically in image space. Thus, we

define ψ to be a relative measure of the difference of the image of the shaft of the

cue stick with the image space vertical axis.

59


-

?

u

v

Figure 2.22: Image space.

To find the angle between two unit vectors, we take the inverse of the cosine

of the inner product of the two vectors. Thus, ψ = cos−1( ˆshaft · ˆdown) whereˆdown is the unit image space “down” vector. For instance, given an image space

coordinate system origin in the top left of the image with x extending to the right

and y extending down, ˆdown = 〈0, 1〉.

While we could allow the player to fully orbit the cue ball, we instead limit the

ψ to ±45 and introduce the notion of the current side of the billiards table. The

user remains in the same general physical location and shoots the cue ball in the

same general direction for all shots taken (i.e. into the ball-catch; see fig. 1.2).

The current side of the table is the side of the virtual billiards table the user is

situated on in the simulation. To change the current side, the user simply presses

a key on the keyboard; the virtual cue stick is rotated by 90◦ from its current

yaw.

Since we wish ψ to be in the range [−45, 45] we must define when the ψ is negative

and positive the inner product will only tell us the angle between two vectors.

We have arbitrarily chosen that the ψ > 0 when the cue stick points down and

to the right in image space and ψ < 0 when down and to the left. We discuss

how ψ is used to orient the virtual cue stick in a later chapter.

Given the above discussion, we can only determine the yaw when we have de-

termined the vector ˆshaft, the direction of the cue stick’s shaft in image space.

To find ˆshaft, we simply subtract the shaft point from the tip point as found in

section 2.7.1.

60


Figure 2.23: Current side of virtual billiards table.

(a) ψ > 0. (b) ψ < 0.

Figure 2.24: Examples of cue stick yaw. The left image (a) shows a positive yaw

while the right image (b) shows a negative yaw.

61


2.7.5 Estimation of spin-inducing parameters

Whenever a sphere is struck at an off-center point on its surface, the the sphere

will spin due to the physical effect of torque. Controlling the effects of cue ball

spin is generally only a part of an advanced billiard player’s set of skills. A cue

ball struck somewhere on the lower half of the ball will undergo some amount of

backspin. A large amount of backspin could cause the cue ball to travel backwards

after colliding with another billiard ball. This effect, known as draw, is very useful

for strategically positioning the cue ball for the next shot. The same holds true

for follow, the forward motion of the cue ball after colliding with another ball as

caused by excessive topspin. Sidespin (left or right) is commonly referred to as

english and can be introduced to the cue ball by striking it at a point horizontally

off-center. English, draw, and follow can be incorporated to varying degrees in

a single shot as well. Our physical simulation incorporates these spin effects,

possibly satisfying a more advanced user of the system.

While we could let the physical simulation solely determine the effects of spin

given the heading (θ, ψ) and height (h) of the cue stick, we have decided to

estimate the coefficients of spin as part of pose estimation. The reasons for this

decision will become clearer during the discussion of shot detection and analysis

in Section 2.8.

Based on (27), we define a “ball-centric” coordinate frame (i, j, k) that we shall

refer to as “ball space” as depicted in Figure 2.25. Our goal is to estimate a an

b.

Note that a and b are encoded in the state of the cue stick’s pose as a percentage

of the radius of the cue ball. This allows a client simulation to use arbitrary

scaled cue ball models. Thus, a = [−1, 1] and b = [−1, 1]. Thus, when the cue

stick is aiming at the left half of the cue ball, a < 0. Also, when the cue tip is

aiming at the lower half of the cue ball, b < 0.

To estimate a we can use the point of intersection P of ˆshaft with the circle C

which represents the the cue ball in image space. To estimate b we relate the

vertical offset of the cue tip above the plane of the user’s desk to the vertical

62


Figure 2.25: Ball space. {i, j, k} defines a coordinate system with origin at the

cue ball center. Shown at Pw is the point of future intersection of the cue stick

with the cue ball in world space (the black line represents the expected stick

trajectory as it collides with the cue ball). a and b define offsets from j and i

respectively.

63

2.8 Shot detection and analysis

offset of the center of the cue ball above the plane of the user’s desk. Since we

know the diameter of the cue ball a priori, we can find b.

a =Px − Cx

Cr

b =h−Rw

Rw


A shot occurs in a billiards game world when the user thrusts a cue stick into a

stationary cue ball causing a collision. To sense such a collision using computer

vision, we would ideally like to see the cue stick to cue ball collision at the exact

time of the event in order to derive the properties of the collision accurately.

However, a camera senses the visual spectrum of light at discrete instances of

time, not as a continuous stream of light. It is thus very difficult to directly be a

witness to an event whose duration is very small. High framerate cameras have

been invented to help combat this issue. With such a camera one can see on

the order of thousands of frames per second. However, such cameras are cost-

prohibitive to purchase (or even rent) for an exploratory student project such as

ours. Since one of our goals is a low cost solution, the use of such cameras is

beyond the scope of this project.

Given that most commodity video cameras can capture at a rate of at most 30

frames per second, we will likely not see a collision between the cue stick and cue

ball. This problem is referred to as undersampling a signal. Therefore, we must

be able to detect that a collision has occurred between two consecutive frames of

video. We thus have to estimate the time of collision and cue stick speed given

only the “before” and “after” visual states of the cue stick and cue ball. Indeed, it

is possible that the cue ball could be struck hard enough that it leaves the viewing

area before the camera captures the next frame of video! We assume that a shot

has occurred in this scenario if the cue stick is detected near to where the cue ball

64


was most recently detected. Moreover, we assume a shot has occurred in general

if the cue ball has moved from its previous location more than ε pixels.

2.8.1 Shot detection

To help us detect shots then, we introduce the distance history. The distance

history is a window or history of cue stick / cue ball distance observations over

time. We are interested in the distance between the cue stick and cue ball simply

because when this distance is 0, there is a collision between the cue stick and

ball; a shot has occurred. Refer to figure 2.20 for a visual depiction of cue stick

to cue ball distance. Note that the distance history is cleared after each shot is

taken.

Figure 2.26: Example of distance history. The distance between the cue stick

tip and cue ball (vertical axis) over time (horizontal axis).

In figure 2.26 we see the measured distance of the cue stick over a 10 second time

period. Let d(τ) denote the distance of the cue stick from the cue ball at time τ .

65


Note that d(τ) is a signed value; if the cue stick is located at or past the point at

which the cue ball was most recently detected, d(τ) ≤ 0.

In the example distance history seen in 2.26, the player moved the cue stick

toward and away from the cue ball several times before ultimately taking a shot

as seen in the last observation. The various back and forth motions correspond

to a pre-shot routine that many billiards players implement to increase muscle

memory and perhaps settle their nerves prior to taking a shot.

Since we are not in general able to detect the cue stick at a distance of exactly

0, we monitor the distance history observations for a zero crossing. This occurs

when d(τi) > 0 and d(τi+1) ≤ 0. When this happens, we have detected when the

cue stick tip has collided with the cue ball.

2.8.2 Shot analysis

We could calculate the cue stick’s speed s as the change in distance over the time

interval spanning the zero crossing. However, this is not a good measure of the

cue stick’s speed as the time interval τi+1 − τi is on average short but with high

variance due to the irregular rate of observations and the unpredictability of the

time when a shot might begin. Instead, we search for the local maximum of d(t)

that occurs at the latest time. This corresponds to the right-most peak in the

graph of figure 2.26. Let us call the distance between the cue stick tip and the

cue ball at this local maximum d(τa) and the distance at the time of the last

observation d(τb). We thus consider the shot to take place between time τa and

τb. During this time frame, the cue stick is driven from a recoiled position into

the cue ball. Note that our algorithms do not reliably detect the cue stick as it

undergoes fast motion since this causes motion blur which is difficult to account

for. In a shot situation, this means we detect the cue stick before a collision is

made as well as very soon thereafter. We thus must make the assumption that

the speed of the cue stick is constant between τa and τb. Also, we assume that the

cue stick maintains its current orientation, moving exactly along its most recently

66


derived orientation. Given this scenario, we can derive the speed of the cue stick

at roughly the time of the collision to be

s =d(τb)− d(τa)

τb − τa(2.17)

Coupling the speed s with the derived orientation (yaw and pitch) of the cue

stick allows us to notify clients of a shot directed along a velocity vector. The

discussion of the client simulation in the following chapter describes how the

derived cue stick pose, shot detection, and shot analysis provide the backbone to

a realistic simulation of billiards.

67

Chapter 3

Client process

While our system can support multiple TUI clients, we have focused on one such

client, a physically accurate 3D billiards simulation. The simulation is comprised

of the necessary elements to play billiards: a billiards table, a cue stick, a set

of billiard balls, and a cue ball. A virtual camera is automatically positioned to

view the scene from the vantage point of a virtual player; namely behind and

above the cue stick, however the cue stick may be positioned and oriented. The

major responsibilities of the client process are to

• physically model rigid objects such as billiard balls, cue sticks, and billiards

tables using realistics units of measure;

• detect collisions between physical objects and realistically model the after-

math of the such collisions;

• receive TUI state updates using IPC from the TUI server;

• position and orient the cue stick based on the physical cue stick pose;

• derive the forces of each shot on the cue ball;

• physically model english and side spin;

• provide a physics event to game event translation layer;

68

• provide a basic game framework on which future games can easily be built;

• position the virtual camera for each shot;

• allow for training exercises by loading predefined ball layouts and shot goals;

• support visualizations like the “ghost ball” to aid in training;

• render 3D graphic depiction with a good amount of realism.

To our knowledge, there are few if any virtual billiards applications that are

open enough or documented in detail enough to make “plugging in” new input

mechanisms feasible. For this reason, we have created a physically based billiards

simulation from scratch, leveraging the Newton Game Dynamics (NGD) SDK

(32) for an implementation of rigid body dynamics.

A rigid body is a solid body in which deformations are neglected. The distance

between points on a rigid body does not change, even when undergoing trans-

lations and rotations. A rigid body has a local reference frame which is rigidly

connected to the body. The position of the body in the simulation’s global or

world reference frame is represented by the origin of the body’s local reference

frame. The orientation of the body is represented by the axes of the body’s lo-

cal reference frame. The position of each body is therefore defined by a linear

component as well as an angular component (i.e. orientation). The linear and

angular components can change with time, as a result of a force applied from a

collision from another rigid body perhaps. NGD works roughly by numerically

integrating the properties of each rigid body over time. These properties include

the linear and angular velocity of each body as well as the linear and angular

acceleration of each body.

Physical simulation on a computer is a difficult task to robustly perform as it is

computationally intense and demands a regular rate of updating or stepping the

state of each body over time. Without a regular stepping rate, numeric integrators

tend to explode, characterized by physically unrealistic behavior of bodies in the

simulation. The general guideline for maintaining stability in an integration based

physics simulation is to use a fixed amount of time when updating the state of

69

each body. We have followed this advice in our simulation, updating NGD at a

rate of 300 Hz, each update with a small, fixed timestep.

While numeric integration has been a popular implementation strategy for physics-

based simulation of rigid bodies (e.g. in many modern computer games), there

exist closed-form solutions for simulating the physics of billiards (e.g. (27)). We

have not chosen this route due to time constraints and ease of implementing

a simulation with NGD. However, a physics-based simulation is never simply a

“fire-and-forget” scenario in which the properties of rigid bodies are specified

and a generic rigid body dynamics simulator handles the rest. Instead, forces,

torques, velocities, etc. must often be derived and manually specified for a spe-

cific context. We have leveraged some of the work of (27) to this end for deriving

and applying the physics of cue stick to cue ball collisions (or shot dynamics ; see

Section 3.3).

A complete overview of physical simulation is beyond the scope of this thesis.

Instead, we refer the reader to background material found in (4; 17). Our discus-

sion is based instead on how we used NGD in our simulation to accurately depict

the physics of billiards. To this end we have used realistic units of measure for

the properties of each rigid body in the simulation. This includes the dimensions

of the billiards table, the mass of the cue stick, the mass of the billiard balls, the

force of gravity, as well as the coefficients of restitution between rigid bodies to

accurately model friction effects. The dimensions of our billiards table model are

based on the specifications of the World Pool-Billiard Association (44).

The elements of the 3D simulation were modeled using Google SketchUp (15),

which is a 3D modeling package with a simple user interface. Our models were

exported from SketchUp using a Ruby (38) language script into a format that

is directly supported by the 3D rendering engine OGRE (34). OGRE is used

to render a 3D view of the simulation from the viewpoint of a virtual camera.

The virtual camera is positioned and oriented to view the scene from the vantage

point of where a right-handed billiards player would be positioned to grasp the

virtual cue stick. We have not focused much on realism in our 3D depictions of

a billiards scene. However, our renderings do support shadows for depth cues, as

70

3.1 Simulation states

well as texture mapped billiard balls. Environment mapping is also used to give

the billiard balls a glossy appearance.

Figure 3.1: Screenshot of Google Sketchup modeling session. The image shows

the virtual billiards table used in the client simulation.


The client simulation can be in one of three states at any given time, namely

1. shot setup

2. shooting

3. physical simulation

The shot setup state is active when there are no billiard balls in motion. While

this state is active, the client process receives updates from the server process

containing the current state of the physical cue stick token (see Table 2.2 for

details). With each update, the client updates the position and orientation of the

71


Figure 3.2: Client process logic flow.

72

3.2 Orienting the virtual cue stick

virtual cue stick to match that of the physical cue stick token. The steps taken

to position and orient the virtual cue stick are discussed in Section 3.2.

The shooting state is active when the client receives an update from the server

process that a shot has been detected. While this state is active, the cue ball’s

instantaneous velocity and torque are determined based on the properties of the

detected shot. The derivation of these values is discussed in Section 3.3.

The physical simulation state is active after the cue ball has been set in motion

during the shooting state. While this state is active, NGD is in control, perform-

ing numerical integration for each rigid body in the simulation and delivering

events of collisions to the application. Details of the event handling state are

discussed in Section 3.4.

Once all billiard balls have stopped moving (i.e. they have a linear velocity below

some predefined threshold), the user is free to begin the shot setup state again

which begins the next shot. The process of shot setup, shooting, and physical

simulation repeats until the game is over or the user chooses to abort the game.

3.2 Orienting the virtual cue stick

The 3D model we use for a cue stick has a local frame origin positioned halfway

along the central axis (−X) of the cue stick. The cue tip is thus located at x < 0.

We use OGRE’s scene graph node manipulation routines to first translate the

cue stick to the center of the virtual cue ball. Next we set the orientation of the

cue stick to the product of two quaternions y and p derived from rotations of

ψ and θ about the “up” and “right” vectors respectively. An axis-angle (ω, θ)

representation of a rotation can be converted to a quaternion representation Q

via Q = (cos θ/2, ω sin θ/2). Finally, we translate the cue stick in its local body

frame based on a, b, and distw (see Table 2.2) which are selected from the most

recent cue stick pose update received from the server process.

73

3.3 Shot dynamics

3.3 Shot dynamics

We leverage the work of (27) for our model of how a cue stick to cue ball impact

sets the instantaneous linear velocity and angular velocity of the cue ball. We

follow the general setup as in Figure 2.25. Again, let sw denote the central axis

of the cue stick token; ~sw is parallel to the k− j plane and is pitched at angle θ to

the i− k plane. Let the contact point of the cue stick on the cue ball surface be

point Pw, Rw be the known world radius of the cue ball, a be the horizontal offset

of Pw from the center of the cue ball, and b the vertical offset (see Table 2.2 for

details). Let c = |√R2

w − a2 − b2|. Thus, Pw = (a, c, b). From this, (27) derives

the the force F imparted to the cue ball due to the impact as

F =2mV0

1 +m/M + 5/(2R2w)(a2 + b2 cos θ + c2sin2 θ − 2bc cos θ sin θ)

(3.1)

where m is the mass of the cue ball, M is the mass of the cue stick, and V0 is the

initial velocity of the cue stick at the time of the collision. We assume that the

duration of the collision is neglible. Thus, the instantaneous linear velocity of the

cue ball at the time of collision is ~v = ~F/m = (0,−F/m cos θ,−F/m sin θ).

We also use (27) to set the instantaneous angular velocity ~ω as well.

~ω = 1/I (−cF sin θ + bF cos θ, aF sin θ,−aF cos θ) (3.2)

where I is the moment of inertia (simply, the equivalent of mass but in the

context of a rotating body) of a solid sphere. I = 2/5 mR2w

3.4 Event handling and game logic

NGD provides a callback system as part of its API in order to notify an application

controller of physical events between two rigid bodies in the simulation. An

application controller can handle these events however it deems fit, including

74

3.5 A focus on user training

rendering audible collision sounds, or passing the event to a game logic handler

as we do in our prototype.

The game logic handler is a bridge between the physical events between billiard

balls, the rails (borders) of a billiards table, the cue stick, etc. and the logical

realm that comprises the rules manager for a particular billiards game. The game

logic handler translates the properties of a physical collision between two rigid

bodies into logical game events. For instance, one such event might be “the cue

ball collided with the 8 ball”, or “the 12 ball was pocketed in the northwest corner

pocket.” Each game rules manager handles these events in different ways. This is

a classic example of object-oriented design using inheritance to allow subclasses

to override the behavior of a parent class.

We have only implemented an ad hoc “game” in our prototype due to time

constraints. This game randomly positions billiard balls across the table’s surface

at the start of each game. The user is prompted to try and pocket all of the balls

in the least number of shots. We plan on implementing 8-ball, 9-ball, and snooker

in the future.

In our simulation, it is trivial to detect when a ball is pocketed. We treat each

pocket as a separate rigid body, albeit one that does not ever change position nor

orientation (i.e. pockets are static bodies). NGD notifies the simulation which

bodies are involved in a collision event. In addition, we can associate a numeric

identifier with each body which is used to determine the logical identity of each

body of a collision event.


Since our billiards simulation is not constrained to the physical world, we have

more control to implement training exercises that would otherwise be cumbersome

or time-consuming for users to do on a real billiards table. For example, if a user

is not very skilled at a particular type of billiards shot (e.g. a bank shot), we

could provide a mechanism to help the user improve on these types of shots

by arranging balls in a predetermined configuration. As the user shoots, events

75


would be fired which are matched against expected event sequences. If the user-

generated events match the expected event sequence, the the attempted shot is

successful. The system could also maintain statistics of a user’s performance for

such shots over time. The chief advantage of this simulation-based training on a

real billiards table is time saved.

Figure 3.3: Screenshot of client simulation showcasing the “ghost ball” feature.

The semi-transparent sphere near the 8-ball is the ghost ball. Ghost ball visu-

alization is a pedagogical technique used to help players visualize where the cue

ball will collide with a target ball.

We have also implemented a “ghost ball” feature (see Figure 3.3). A ghost ball

is a ball that is not a proper part of the simulation but only exists during the

setup phase of a shot in which the user orients the cue stick. A semi-transparent

cue ball is rendered at the location where the cue ball will collide with a target

billiards ball if a shot is taken with the cue stick in its current orientation. This

visualization is common advice from billiards instructors to students.

76

Chapter 4

Conclusions

We have described a system that provides for natural, non-invasive human-

computer interactions using only computer vision techniques and a single DV

camera. We believe that even though the low capture rate and generally poor

image quality of consumer-grade DV cameras ultimately limits the accuracy of

image based measurements, future DV cameras and image processing techniques

will improve to make CV based systems a viable alternative to the mouse and

keyboard. Indeed, higher resolution images captured at a faster rate coupled

with ever-increasing processor speeds will also help make this style of input more

widespread.

However, CV based algorithms are general compute intensive due to the shear

amount of data that needs to be processed. We do not foresee this changing

unless some of the feature detection performed in software today is moved into

specialized hardware in the future. This has been explored somewhat (e.g. in

(12)) already. However, the main issue with hardware based feature extraction is

the creation of hardware that is generic enough for multiple classes of features.

A common criticism of CV-based TUIs is that they “are typically developed

and tuned for one specific configuration” (22). Our system is indeed configured

for one specific set of tangible objects. However, we hope that by starting this

style of work with concrete prototypes we can learn the lessons and principles of

77

successful computer vision based TUI design so that future projects are easier to

implement.

In the future we wish to expand our studies by possibly including other forms

of vision-based sensing such as stereo vision or omnidirectional cameras. We

also would like to expand our study of tracking fast moving objects in real-time.

Stroboscopic lighting might provide a benefit to this end. Also, to deal with the

issues of low quality in the reproduction of colors from the visible light spectrum

from webcams, we would like to attempt to use near infrared light. In such a

system, changes in visible illumination would not adversely affect the detection

of infrared light.

We would also like to vet CV as a viable input mechanism by attempting to create

TUIs for several other equipment based sports such as tennis, golf, shuffleboard,

baseball, etc. in order to discover issues that have yet to be discovered. It would

also behoove us to conduct a usability study to gain valuable feedback from real

users who are not intimate with the implementation details of our TUI such as

we are.

Recently there has been a growing focus on “Simulation Based Learning” with

respect to education and training. In simulation based learning, a user learns

some skill by interacting with a simulation of some environment. Simulations

are less expensive than their real counterpart and offer much more control over

configuration. Since the TUI described in this thesis is inexpensive and easy to

deploy, one can start to see the role of TUIs in training for various other sports or

physically based endeavours. Recently, the Center for Immersive and Simulation-

Based Learning was created at Stanford University. Using CV for TUIs could

help such efforts by providing inexpensive means to create such simulations.

78

Appendix A

Camera calibration

Camera calibration provides us with information about the intrinsic and extrinsic

parameters of a camera. The intrinsic parameters are parameters that describe

the camera’s internals including the focal length, the distortion coefficients of the

camera’s lens, and the principal point (which is usually the center of the image).

The extrinsic parameters define the position and orientation of the camera relative

to some world coordinate frame. While the intrinsic parameters of a camera vary

from camera to camera, they need only be found once per camera. The extrinsic

parameters of a camera are view-dependent.

Figure A.1: Chessboard calibration object detected with point correspondences

highlighted.

79

A common technique used to calibrate a camera is to find a number of image-

world point correspondences. An object of known geometry such as a chessboard

pattern is detected in image space. In fact, the chessboard pattern as in Fig-

ure A.1 is the most common calibration object used today.

Every camera lens contains some amount of distortion. The distortion present can

be described by radial and tangential distortion coefficients. These coefficients are

used to rectify distorted images by undistorting acquired images. These undis-

torted images form the input to our CV system. Undistorting a point (x, y)

results in a point (x′, y′). This undistortion procedure is applied to each point of

an acquired image.

x′ = x(1 + k1r2 + k2r

4) + 2p1xy + p2(r2 + 2x2)y′ = y(1 + k1r

2 + k2r4) + p1(r2 + 2y2) + 2p2xy

(A.1)

where r2 = x2 + y2, k1, k2 are radial distortion coefficients and p1, p2 are tan-

gential distortion coefficients. k1, k2, p1, and p2 are determined during camera

calibration.

(a) Acquired image (distorted). (b) Acquired image (undistorted).

Figure A.2: Example of camera lens distortion. Note how the straight lines in

the right image (b) are curved in the left image (a).

We used Intel’s OpenCV library (18) for camera calibration. OpenCV’s calibra-

tion routines are based on the methods of Zhang (47).

80

References

[1] Samuel Aude, Matios Bedrosian, Cyril Clement, and Monica

Dinculescu. mulTetris: A Test of Graspable User Interfaces in Col-

laborative Games, 2006. http://www.cs.mcgill.ca/~mdincu/multetris_

final.pdf [Online; accessed 30-June-2007]. 11, 15, 30

[2] B. Bailey. Real Time 3D motion tracking for interactive computer simu-

lations, 2007. Student project, Imperial College of London. 16

[3] Paul J. Besl and Ramesh C. Jain. Three-dimensional object recogni-

tion. ACM Comput. Surv., 17(1):75–145, March 1985. 7

[4] David M. Bourg. Physics for Game Developers. pub-ORA, pub-ORA:adr,

2002. 70

[5] Gary R. Bradski. Computer Vision Face Tracking for Use in a Perceptual

User Interface. Intel Technology Journal, 2(Q2):15, 1998. 30

[6] Adrian David Cheok, Xubo Yang, Zhou Zhi Ying, Mark

Billinghurst, and Hirokazu Kato. Touch-Space: Mixed Reality Game

Space Based on Ubiquitous, Tangible, and Social Computing. Personal Ubiq-

uitous Comput., 6(5-6):430–442, 2002. 13

[7] E. Costanza and J. Robinson. A Region Adjacency Tree Approach to

the Detection and Design of Fiducials. In Proceedings of Vision, Video and

Graphics, 2003. 11

81

http://www.cs.mcgill.ca/~mdincu/multetris_final.pdf

http://www.cs.mcgill.ca/~mdincu/multetris_final.pdf

REFERENCES

[8] E. Costanza, S.B. Shelley, and J. Robinson. Introducing Audio

d-touch: a Novel Tangible User Interface for Music Composition and Perfor-

mance. In Proceedings of the 6th International Conference on Digital Audio

Effects, 2003. 11, 14

[9] Jerry Fails and Dan Olsen. A design tool for camera-based interaction.

In CHI ’03: Proceedings of the SIGCHI conference on Human factors in

computing systems, pages 449–456, New York, NY, USA, 2003. ACM. 13

[10] George W. Fitzmaurice. Graspable user interfaces. PhD thesis, Toronto,

Ont., Canada, Canada, 1996. Adviser-William Buxton. 11, 15

[11] James D. Foley, Andries van Dam, Steven K. Feiner, and John F.

Hughes. Computer Graphics — Principles and Practice. The Systems

Programming Series. Addison-Wesley, second edition, 1996. 5

[12] W. T. Freeman, K. Tanaka, J. Ohta, and K. Kyuma. Computer

vision for computer games. In FG ’96: Proceedings of the 2nd International

Conference on Automatic Face and Gesture Recognition (FG ’96), page 100,

Washington, DC, USA, 1996. IEEE Computer Society. 77

[13] E. Gamez, N. Leland, O. Shaer, and R. Jacob. The TAC Paradigm:

Unified Conceptual Framework to Represent Tangible User Interfaces, 2003.

http://citeseer.ist.psu.edu/calvillo-gamez03tac.html [Online; ac-

cessed 30-June-2007]. 11

[14] Vadas Gintautas and Alfred W. Hubler. Experimental evidence for

mixed reality states in an interreality system. Physical Review E (Statistical,

Nonlinear, and Soft Matter Physics), 75(5):057201, 2007. 2

[15] Google. Google SketchUp, 2007. http://sketchup.google.com/ [Online;

accessed 01-November-2007]. 70

[16] Venkatraghavan Gourishankar, Govindarajan Srimathveer-

avalli, and Thenkurussi Kesavadas. HapStick: A High Fidelity Haptic

82

http://citeseer.ist.psu.edu/calvillo-gamez03tac.html

http://sketchup.google.com/

REFERENCES

Simulation for Billiards. In WHC ’07: Proceedings of the Second Joint Euro-

Haptics Conference and Symposium on Haptic Interfaces for Virtual Envi-

ronment and Teleoperator Systems, pages 494–500, Washington, DC, USA,

2007. IEEE Computer Society. 16

[17] Mikako Harada, Andrew Witkin, and David Baraff. Interactive

Physically-Based Manipulation of Discrete/Continuous Models. Computer

Graphics, 29(Annual Conference Series):199–208, 1995. 70

[18] Intel Corporation. Open Source Computer Vision Library, 2000.

http://www.intel.com/technology/computing/opencv [Online; accessed

30-June-2007]. 23, 35, 37, 80

[19] Hiroshi Ishii and Brygg Ullmer. Tangible Bits: Towards Seamless

Interfaces between People, Bits and Atoms. In CHI, pages 234–241, 1997.

11

[20] Tony Jebara, Cyrus Eyster, Josh Weaver, Thad Starner, and

Alex Pentland. Stochasticks: Augmenting the Billiards Experience with

Probabilistic Vision and Wearable Computers. In ISWC ’97: Proceedings of

the 1st IEEE International Symposium on Wearable Computers, page 138,

Washington, DC, USA, 1997. IEEE Computer Society. 17

[21] Martin Kaltenbrunner and Ross Bencina. reacTIVision: a

computer-vision framework for table-based tangible interaction. In TEI ’07:

Proceedings of the 1st international conference on Tangible and embedded

interaction, pages 69–74, New York, NY, USA, 2007. ACM. 14

[22] Kjeldsen, Rick, Levas, Anthony, Pinhanez, and Claudio. Dy-

namically reconfigurable vision-based user interfaces. Machine Vision and

Applications, 16(1):6–12, December 2004. 77

[23] Scott R. Klemmer, Jack Li, James Lin, and James A. Lan-

day. Papier-Mache: Toolkit Support for Tangible Input. Technical Report

UCB/CSD-03-1278, EECS Department, University of California, Berkeley,

2003. 11, 12

83

http://www.intel.com/technology/computing/opencv

REFERENCES

[24] Gudrun J. Klinker, Klaus H. Ahlers, David E. Breen, Pierre-

Yves Chevalier, Chris Crampton, Douglas S. Greer, Dieter

Koller, Andre Kramer, Eric Rose, Mihran Tuceryan, and

Ross T. Whitaker. Confluence of Computer Vision and Interactive

Graphics for Augmented Reality. PRESENCE: Teleoperations and Virtual

Environments, (Special Issue on Augmented Reality), 1997. 6, 11

[25] Myron W. Krueger, Thomas Gionfriddo, and Katrin Hinrich-

sen. VIDEOPLACEan artificial reality. In CHI ’85: Proceedings of the

SIGCHI conference on Human factors in computing systems, pages 35–40,

New York, NY, USA, 1985. ACM. 11

[26] Lars Bo Larsen, Rene B. Jensen, Kasper L. Jensen, and Soren

Larsen. Development of an automatic pool trainer. In ACE ’05: Pro-

ceedings of the 2005 ACM SIGCHI International Conference on Advances

in computer entertainment technology, pages 83–87, New York, NY, USA,

2005. ACM. 17

[27] Will Leckie and Michael A. Greenspan. An Event-Based Pool

Physics Simulator. In ACG, pages 247–262, 2006. 62, 70, 74

[28] Bastian Leibe, Thad Starner, William Ribarsky, Zachary

Wartell, David Krum, Brad Singletary, and Larry Hodges. The

perceptive workbench: Towards spontaneous and natural interaction in semi-

immersive virtual environments. In IEEE Virtual Reality 2000 Conference

(VR’2000), pages 13–20, Los Alamitos, Calif., March 18-23 2000. IEEE CS

Press. (Won the VR’2000 Best Paper Award!). 9

[29] V. Lepetit and P. Fua. Monocular Model-Based 3D Tracking of Rigid

Objects: A Survey. Foundations and Trends in Computer Graphics and

Vision, 1(1). 14, 30, 31, 33

[30] Carsten Magerkurth, Adrian David Cheok, Regan L. Mandryk,

and Trond Nilsen. Pervasive games: bringing computer entertainment

back to the real world. Comput. Entertain., 3(3):4–4, 2005. 3

84

REFERENCES

[31] P. Milgram and H. Colquhoun. A Taxonomy of Real and Virtual

World Display Integration, 1999. 2

[32] Newton Game Dynamics. Newton Game Dynamics, 2007. http://

newtondynamics.com [Online; accessed 30-June-2007]. 69

[33] M. Nixon and A. Aguado. Feature Extraction and Image Processing.

Newnes-Oxford, 2002. 5, 7, 23, 33

[34] OGRE contributors. Object-Oriented Game Rendering Engine, 2007.

http://ogre3d.org [Online; accessed 30-June-2007]. 70

[35] Jef Raskin. The Humane Interface: New Directions for Designing Inter-

active Systems. Addison-Wesley Professional, March 2000. 9

[36] Ian D. Reid and A. North. 3D Trajectories from a Single Viewpoint

using Shadows. In BMVC, 1998. 50, 51, 52, 53, 54

[37] Homero V. Rios. Human-computer interaction through computer vision.

In CHI ’01: CHI ’01 extended abstracts on Human factors in computing

systems, pages 59–60, New York, NY, USA, 2001. ACM. 2

[38] Ruby. Ruby programming language official website, 2007. http://www.

ruby-lang.org [online; accessed 25-august-2007]. 70

[39] Alvy Ray Smith. Color Gamut Transform Pairs. j-COMP-GRAPHICS,

12(3):12–19, August 1978. 34

[40] George Stockman and Linda G. Shapiro. Computer Vision. Prentice

Hall PTR, Upper Saddle River, NJ, USA, 2001. 7, 8, 22, 26, 33, 37, 39, 47,

50

[41] B. Ullmer and H. Ishii. Emerging frameworks for tangible user interfaces.

IBM Syst. J., 39(3-4):915–931, 2000. 3, 4, 11

[42] Brygg Ullmer and Hiroshi Ishii. The metaDESK: models and pro-

totypes for tangible user interfaces. In UIST ’97: Proceedings of the 10th

annual ACM symposium on User interface software and technology, pages

223–232, New York, NY, USA, 1997. ACM. 12

85

http://newtondynamics.com

http://newtondynamics.com

http://ogre3d.org

http://www.ruby-lang.org

http://www.ruby-lang.org

REFERENCES

[43] Andrew D. Wilson. PlayAnywhere: a compact interactive tabletop

projection-vision system. In UIST ’05: Proceedings of the 18th annual ACM

symposium on User interface software and technology, pages 83–92, New

York, NY, USA, 2005. ACM. 15

[44] World Pool-Billiard Association. WPA Tournament Table &

Equipment Specifications, 2001. http://www.wpa-pool.com/index.asp?

content=rules_spec [Online; accessed 22-August-2007]. 70

[45] B. Yersin. Virtual Billiards, 2004. Student project, VRLAB, Swiss Federal

Institute of Technology. 11

[46] Alper Yilmaz, Omar Javed, and Mubarak Shah. Object tracking:

A survey. ACM Comput. Surv., 38(4), 2006. 14

[47] Zhengyou Zhang. Flexible Camera Calibration by Viewing a Plane from

Unknown Orientations. ICCV, 01:666, 1999. 80

86

http://www.wpa-pool.com/index.asp?content=rules_spec

http://www.wpa-pool.com/index.asp?content=rules_spec

a computer vision tangible user interface for mixed reality ......a proof of concept, we have...

Documents