a computer vision tangible user interface for mixed reality ......a proof of concept, we have...
TRANSCRIPT
A Computer Vision
Tangible User Interface for
Mixed Reality Billiards
Brian Hammond
Seidenberg School of Computer Science
Pace University
A thesis submitted for the degree of
Master of Computer Science
December 2007
Abstract
Conventional input devices are unnatural for many human-computer
interactions (HCI). In this thesis we describe a system for creating nat-
ural interaction patterns using tangible objects as input devices. A
user’s physical interactions with these tangible objects are monitored
by analyzing real-time video from a single inexpensive digital video
camera using variations on common computer vision algorithms. Lim-
iting tangible object detection to computer vision techniques provides
a non-invasive and inexpensive means to creating user interfaces that
are more compelling and natural than those relying on a keyboard
and mouse. The mixed reality nature of the system stems from the
use of tangible objects as actors in a virtual or synthetic reality. As
a proof of concept, we have developed a pocket billiards simulation
that allows a user to manipulate a physical cue stick in the natural
manner as a tangible input device for practice exercises and games.
Contents
1 Introduction 1
1.1 System classification . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Billiards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Scene description . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Computer vision . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Goals and challenges . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.6 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6.1 Existing TUI-based systems and toolkits . . . . . . . . . . 11
1.6.2 Systems that use computer vision for user input . . . . . . 14
1.6.3 Systems that implement mixed reality billiards . . . . . . . 16
1.7 System architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2 Server process 20
2.1 Image formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Issues with acquired digital images . . . . . . . . . . . . . . . . . 22
2.2.1 Geometric distortion . . . . . . . . . . . . . . . . . . . . . 22
2.2.2 Blooming . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
ii
CONTENTS
2.2.3 Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.4 Low rate of acquisition . . . . . . . . . . . . . . . . . . . . 24
2.2.5 Low resolution . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.6 Chromatic aberration . . . . . . . . . . . . . . . . . . . . . 26
2.2.7 Temporal averaging . . . . . . . . . . . . . . . . . . . . . . 27
2.2.8 Weighted moving averages . . . . . . . . . . . . . . . . . . 28
2.2.9 Manual camera control . . . . . . . . . . . . . . . . . . . . 28
2.3 On stereo vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4 On tracking algorithms . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5 Feature detection methods . . . . . . . . . . . . . . . . . . . . . . 31
2.5.1 Color representation . . . . . . . . . . . . . . . . . . . . . 34
2.5.2 Thresholding . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.5.3 Contour finding and analysis . . . . . . . . . . . . . . . . . 37
2.5.4 Binary image morphology . . . . . . . . . . . . . . . . . . 39
2.5.5 Flexible color matching . . . . . . . . . . . . . . . . . . . . 40
2.5.6 Convex hulls and convexity defects . . . . . . . . . . . . . 41
2.6 Pose driven feature detection . . . . . . . . . . . . . . . . . . . . . 42
2.6.1 Planar object detection . . . . . . . . . . . . . . . . . . . . 43
2.6.2 Cloth/felt detection . . . . . . . . . . . . . . . . . . . . . . 45
2.6.3 Cue ball detection . . . . . . . . . . . . . . . . . . . . . . 45
2.6.4 Cue stick detection . . . . . . . . . . . . . . . . . . . . . . 46
2.7 Cue stick pose estimation . . . . . . . . . . . . . . . . . . . . . . 48
2.7.1 Estimation of cue tip vertical offset . . . . . . . . . . . . . 50
2.7.2 Estimation of the distance between cue stick and cue ball . 55
iii
CONTENTS
2.7.3 Estimation of cue stick pitch . . . . . . . . . . . . . . . . . 57
2.7.4 Estimation of cue stick yaw . . . . . . . . . . . . . . . . . 59
2.7.5 Estimation of spin-inducing parameters . . . . . . . . . . . 62
2.8 Shot detection and analysis . . . . . . . . . . . . . . . . . . . . . 64
2.8.1 Shot detection . . . . . . . . . . . . . . . . . . . . . . . . . 65
2.8.2 Shot analysis . . . . . . . . . . . . . . . . . . . . . . . . . 66
3 Client process 68
3.1 Simulation states . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.2 Orienting the virtual cue stick . . . . . . . . . . . . . . . . . . . . 73
3.3 Shot dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.4 Event handling and game logic . . . . . . . . . . . . . . . . . . . . 74
3.5 A focus on user training . . . . . . . . . . . . . . . . . . . . . . . 75
4 Conclusions 77
A Camera calibration 79
References 86
iv
List of Tables
1.1 Scene elements of our prototype TUI . . . . . . . . . . . . . . . . 5
2.1 4-way or 8-way neighbors of a pixel P . . . . . . . . . . . . . . . . 37
2.2 Elements of cue stick pose. . . . . . . . . . . . . . . . . . . . . . . 43
2.3 Feature summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.4 Flexible color model for the cloth/felt. . . . . . . . . . . . . . . . 45
2.5 Flexible color model for the cue stick’s shadow. . . . . . . . . . . 47
2.6 Scene lines to be found for estimation of h. . . . . . . . . . . . . . 54
v
List of Figures
1.1 Typical pocket billiards equipment. . . . . . . . . . . . . . . . . . 3
1.2 Scene layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Top view. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Client-server architecture. . . . . . . . . . . . . . . . . . . . . . . 18
2.1 Low quality digital video camera image. . . . . . . . . . . . . . . 21
2.2 High quality still photography image. . . . . . . . . . . . . . . . . 21
2.3 Example of sensor noise. . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Errors in color reproduction of digital video camera. . . . . . . . . 27
2.5 Server process logic flow. . . . . . . . . . . . . . . . . . . . . . . . 32
2.6 Color mixtures in RGB. . . . . . . . . . . . . . . . . . . . . . . . 34
2.7 HSV color space. . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.8 Example of thresholding. . . . . . . . . . . . . . . . . . . . . . . . 36
2.9 Example of contour finding. . . . . . . . . . . . . . . . . . . . . . 38
2.10 Example of erosion morphological operator. . . . . . . . . . . . . 40
2.11 Example of flexible color matching. . . . . . . . . . . . . . . . . . 41
2.12 Example of convexity defect. . . . . . . . . . . . . . . . . . . . . . 42
vi
LIST OF FIGURES
2.13 Example of cue ball child contour. . . . . . . . . . . . . . . . . . . 46
2.14 Example of cue stick shadow detection. . . . . . . . . . . . . . . . 48
2.15 Screenshot of demo of feature detection. . . . . . . . . . . . . . . 49
2.16 Scene points used in cue stick tip height estimation. . . . . . . . . 51
2.17 Setup for estimation of h; part 1. . . . . . . . . . . . . . . . . . . 52
2.18 Setup for estimation of h; part 2. . . . . . . . . . . . . . . . . . . 53
2.19 Screen capture of cue stick vertical offset estimation demo. . . . . 56
2.20 Distance between cue stick tip and cue ball. . . . . . . . . . . . . 56
2.21 Example of cue stick pitch. . . . . . . . . . . . . . . . . . . . . . . 58
2.22 Image space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.23 Current side of virtual billiards table. . . . . . . . . . . . . . . . . 61
2.24 Examples of cue stick yaw. . . . . . . . . . . . . . . . . . . . . . . 61
2.25 Ball space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.26 Example distance history. . . . . . . . . . . . . . . . . . . . . . . 65
3.1 Screenshot of Google Sketchup modeling session. . . . . . . . . . . 71
3.2 Client process logic flow. . . . . . . . . . . . . . . . . . . . . . . . 72
3.3 Screenshot of client simulation showcasing the “ghost ball” feature. 76
A.1 Camera calibration chessboard object. . . . . . . . . . . . . . . . 79
A.2 Example of camera lens distortion. . . . . . . . . . . . . . . . . . 80
vii
Chapter 1
Introduction
The most common interaction techniques in HCI today are based on the famil-
iar mouse and keyboard. While extremely successful, these input devices don’t
provide natural interfaces to various categories of software. In fact, one may ar-
gue that most software developed today cannot live up to its full potential due
to the ubiquity of the mouse and keyboard as expected user input devices. In
other words, we as software developers and interaction designers are constrained
to design simple, often unnatural human-computer interactions; or are we?
This thesis describes a protoype tangible user interface (TUI) that uses computer
vision (computer vision) to create a natural, non-invasive user input technique for
cue sports such as 8-ball or snooker. In this system, we utilize a single off-the-shelf
digital video camera (or more commonly, a webcam) to passively view a user’s
interactions with unmarked tangible objects. These interactions are detected
and analyzed in near real-time (≈ 10 Hz), and represent application use cases.
One of our goals is to learn-by-doing; by designing, implementing, and testing a
prototype of a single application using computer vision as an input mechanism,
we wish to learn techniques and lessons for creating a more general class of such
applications in the future.
... computers are still limited in their multimedia understanding. This
means that we have increased the effective bandwidth of information
1
1.1 System classification
from computers to humans, by sending audio, images, audio, graphics,
haptic data, but the same rate of improvement has not happened in
computer understanding. Most computers still receive input from low
bandwidth devices like keyboards or mouse [sic]. Only few interfaces
are able to understand application related domains of audio, visual or
haptic information. (37)
1.1 System classification
A mixed reality (MR) system spans a continuum of realities which includes physi-
cal reality, augmented reality (AR), augmented virtuality (AV), and virtual reality
(VR) (31). Physical reality is what we sense all around us everyday as the real
world. Augmented reality is a relatively new field in which virtual objects are
introduced into the real world, usually by way of 3D graphics overlaid atop a
user’s view of some real world scene. An augmented virtuality system is one in
which objects in our physical reality are incorporated into a virtual environment.
Virtual reality systems create a completely synthetic environment as seen in many
modern video games and research prototypes. Our system can be considered a
MR system in that a user manipulates an object in physical reality which in turn
manipulates an object in a virtual reality. This system can be further described
as an “interreality” system since we are specifically correlating objects of two
distinct realities (14). Our system is also closely related to a recent focus on
pervasive games :
Computer games focus the user’s attention mainly on the computer
screen or 2D/3D virtual environments, and players are bound to using
keyboards, mice, and gamepads while gaming, thereby constraining
interaction. To address this problem, there is a growing trend in to-
day’s games to bring more physical movement and social interaction
into games while still utilizing the benefits of computing and graphical
systems. Thus, the real-world is coming back to computer entertain-
ment with a new gaming genre, referred to as pervasive games, stress-
2
1.2 Billiards
ing the pervasive and ubiquitous nature of these games: Pervasive
games are no longer confined to the virtual domain of the computer,
but integrate the physical and social aspects of the real world. (30)
1.2 Billiards
Our prototype TUI allows a user to manipulate a physical cue stick in a manner
natural to billiards players in order to control a virtual cue stick which is a part
of a physics-based billiards simulation. Following the terminology of (41), we
have created a TUI for the domain of modeling and simulation applications. The
tokens or tangible user interface elements of our system are based on the actors
that participate in a billiards game in physical reality and can be considered to
be spatial tokens as we are concerned with their physical proximity to each other.
These are a human player, a cue ball, and a cue stick. Note that the remaining
billiards balls exist only in our billiards simulation. Also, a physical billiards table
is not required. The user’s desktop acts as a mock billiards table.
(a) Billiards table with pockets. (b) Cue stick. (c) Billiard balls.
Figure 1.1: Typical pocket billiards equipment (not to scale). Please see Fig-
ure 2.2 for an example of a cue ball.
When playing billiards in the physical reality, the user views the table from
the same general vantage point as the cue stick. The cue stick has a tapered
cylindrical shape and is most often comprised of wood with a ferrule or cuff
at the tip used to receive most of the impact force so that the wood does not
3
1.3 Scene description
split. The tips of cue sticks are usually 10 − 13mm in diameter and are covered
in chalk to increase the amount of friction between the stick and the cue ball
during impact. This prevents slippage or miscues. The mass of a cue stick varies;
common masses are ≈ 20 ounces. The cue stick is held by the user and is oriented
in order to strike the cue ball. The cue ball is generally an all-white ball. The
size and mass of billiard balls depends on the billiards game being played. For
instance, in American 8-ball, the cue ball has a mass of 6 ounces and is 2.25 inches
in diameter. The user strikes the cue ball by moving the cue stick into the cue
ball. The cue ball in turn collides with a target billiard ball. The goal in pocket
billiards games is to cause such a target ball to fall into one of the pockets located
on the corners and sides of a billiards table. This is referred to as pocketing a ball.
Our prototype focuses on pocket billiards games, leaving other types of games
(e.g. carom billiards) for future work.
1.3 Scene description
Our TUI expects a particular scene structure and defines constraints on the
appearance of tokens and how they are expected to be interacted with. User-
token interactions occur on or above a flat surface (e.g. a desk surface); thus, this
desk can be considered the reference frame of the TUI (41). A digital video (DV)
camera capable of capturing color video at a decent rate (e.g. 15 or 30 frames
per second (fps)) is mounted atop a tripod and is oriented to view the user’s desk
from above. The exact height does not matter as long as the tokens are visible to
the camera. The camera is attached to a computer which performs online video
capture (also referred to as image acquisition) and image analysis to detect the
actions of the user. The same computer is used for rendering a 3D, physically
based billiards simulation. Since the user is indirectly controlling the simulation
by moving the cue stick, the computer’s display should be visible to the user.
Our token detection algorithms utilize the features of the tangible object as well
as their shadows. Thus, we require a nearby light source of sufficient intensity to
4
1.3 Scene description
Scene Element Role
cue stick token controls virtual counterpart; shots
cue ball token provides realistic tactile feedback; shot aiming
piece ofhcloth/felt container for cue stick, cue ball, and cue stick shadow
light source illuminates tokens; shadows
planar object aid for determining vanishing lines
Table 1.1: Scene elements of our prototype TUI
illuminate the scene. In our prototype TUI, we use a common desk lamp with a
60 watt light bulb (see Figure 1.2).
The green cloth/felt material mentioned in Table 1.1 is used as an container
of the cue stick, cue ball, and the cue stick’s shadow. By ’container’ we mean
that the view from the camera will show the cue stick, cue ball, and cue stick’s
shadow within the area bounded by this piece of material (see Figure 1.3). This
focuses the attention of our computer vision-based algorithms and thus helps
the algorithms detect the tokens. In our prototype TUI, we experimented with a
number of different materials, ultimately settling on a thin piece of material often
found in “golf putting green” products. We chose this type of material for a few
reasons. The rectangular shape and color of this material mimics that of a real
billiards table which keeps us in line with our goal of a natural interface. Also, the
material exhibits diffuse or Lambertian reflection. That is, the surface appears
nearly equally bright from all viewing angles (11), reducing specular highlighting
(simply, bright spots) to a minimum. In practice, the effectiveness of computer
vision object recognition techniques may be hampered by specular reflections
on object surfaces such as polished desk surfaces by breaking the continuity of
otherwise uniform patches of color or resembling shapes being matched which
may lead to false-positives during object detection (33).
The planar object mentioned in Table 1.1 is used to determine a set of vanishing
lines of the plane of the desk. Such lines are used in cue stick pose estimation
(see Section 2.7.1). In our prototype TUI, we used a 3x5 inch index card for
this planar object. The dimensions of this object are not important. What is
5
1.4 Computer vision
important however is that the object is very thin, rectangular in shape, and has
a strong contrast with the surface on which it rests.
Figure 1.2: Scene layout.
1.4 Computer vision
We see computer vision and image processing technology – although
still relatively brittle and slow – play an increasing role in acquiring
appropriate sensor and scene models. Rather than using the video
signal merely as a backdrop on which virtual objects are shown, we ex-
plore the use of image understanding techniques to calibrate, register
and track cameras and objects and to extract the three-dimensional
structure of the scene. (24)
6
1.4 Computer vision
Figure 1.3: Top view.
Computer vision a branch of artificial intelligence whose goal is “to make useful
decisions about real physical objects and scenes based on sensed images” or to
construct “scene descriptions from images” (40). Computer vision is a means to
understand the world using visual data and prior information. This visual data
is specified in the form of digital images. By “prior information” we mean that
recognizing objects in images requires us to know what we are looking for. That
is, we must have prior knowledge of the model of the objects we are trying find
in images (3).
Computer vision systems process digital images acquired or captured from elec-
tronic cameras (33). A digital image is a 2D array of numbers, each of which is
referred to as a picture element or pixel. Each pixel generally has a small range of
potential values (e.g. an 8-bit pixel has the range [0,255]) and encodes the inten-
sity of light that falls on an image sensor at a certain location. Most systems use
255 as maximum intensity and 0 as minimum intensity. A binary digital image
is a 1-bit image in which 0 represents black and 1 represents white. A grayscale
image is comprised of 8-bit pixels and encodes 256 shades of gray (the value 255
is usually equivalent to full intensity or white). A color image generally uses 3
7
1.4 Computer vision
numbers per pixel to encode color components or channels. For instance, a true
color RGB digital image uses 24-bits to encode color intensities (8 bits for each
of red, green, and blue). There are many color models such as RGB, HSV, HLS,
etc. each with particular uses. We shall discuss the role of RGB and HSV color
models in our system in Section 2.5.1.
Algorithms that somehow search, manipulate, or filter the pixels of digital images
are collectively referred to as image processing algorithms. Image processing can
be seen as a mechanical way of manipulating the pixels of a digital image to some
end (e.g. perhaps making the image appear brighter to our eyes). Image process-
ing provides part of the foundation of image understanding in which meaningful
information is extracted from an image (40). We shall use image processing and
computer vision algorithms in order to reduce complexity. When complexity is
low, information is more easily extracted and decisions are easier to make. This
theme of reducing complexity is common in many computer vision systems. In
our system, we shall show how we reduce the complexity of seemingly random val-
ues of pixels in order to find interesting image features. In turn we shall attempt
to extract information from these image features in order to infer properties of
objects based on the models of the tokens of our TUI. From these properties we
will determine the state of each token of our TUI. Changes in a token’s state are
what we are interested in as these can be considered the use cases of our input
system.
Computer vision as an input mechanism is compelling for a few reasons. It is
passive in that the user does not need to physically interact with with some
device (e.g. mouse, keyboard) but can interact in a normal manner with objects
appropriate for a particular context (e.g. using a cue stick to play billiards instead
of clicking and moving a mouse). Using computer vision as an input mechanism
is also inexpensive. Consumer-grade color DV cameras capable of capturing 15-
30 frames per second cost ≈ $30 − $60USD. There are significant challenges
with using computer vision as an input mechanism however. These challenges
are applicable to all computer vision systems. Our physical reality is is three
dimensional (3D); any point in this reality can be specified by three coordinates
(X,Y ,Z). However, images captured by cameras are two dimensional (2D); points
8
1.5 Goals and challenges
in images can be specified by two coordinates (x, y). Thus, one dimension is lost
in process of perspective projection that occurs as light is detected by image
sensors. Many computer vision algorithms try (and succeed) to recover this lost
dimension yet the problem is very difficult to solve in the general case. We shall
see one small set of such algorithms as used to recover the 3D pose of a billiards
cue stick. The algorithms described are not broadly applicable but are nonetheless
useful for study and of course have great utility for our specific prototype TUI.
The accuracy of observations made by a computer vision system is limited to
what the system can visually sense; accuracy is usually limited by the number
of pixels of the images being processed (although algorithms that use sub-pixel
measurements are becoming more common). Changes in lighting conditions can
cause a computer vision-based recognition algorithm to easily become confused.
Occlusions, in which objects of interest are hidden either partially or fully by an-
other object closer to the camera, can also confuse such algorithms. Objects that
undergo fast motion may be difficult to robustly detect and track. Cameras have
a host of possible issues which are described in Section 2.1. Digital images tend
to contain a fairly large amount of visual data. Processing this data efficiently is
important for real-time response rates. CPU utliization is also generally high for
computer vision system implementations due to the large number of pixels that
need to be processed, draining the overall system performance. Most of these
issues can be dealt with by careful system design or by imposing constraints;
e.g. there can be no fast-moving objects as in (28). However, finding a working
balance across all of the mentioned issues is a monumental task for any computer
vision system.
1.5 Goals and challenges
We have a number of goals we wish our system to meet.
• We wish to remove the intermediate layer between applications and users;
what user interface expert Jef Raskin calls the “operating system” problem
(35).
9
1.6 Related work
• Our main goal from the computer vision portion of this thesis is simultane-
ous object detection and pose estimation from natural object features. We
do not use markers to make objects easier to detect. Such a system is the
least invasive to users and hence the most natural.
• We wish to create a realistic, real-time VR billiards simulation that orients
a virtual billiards cue stick based on the detected 3D pose of the user-
controlled physical cue stick.
• We want to detect a user’s shots on a real cue ball, derive the parameters
that describe the shot, and physically model a similar shot in our billiards
simulation.
These goals are difficult to accomplish given our self-imposed constraints of
• using low cost hardware such as webcams ;
• using a single DV camera for acquiring digital images;
• requiring minimal user involvement in calibration or setup;
• using marker-less tokens in their natural context (non-invasiveness);
• robustly detecting potentially fast moving objects;
• allowing objects to undergo partial or full occlusions for indefinite amounts
of time.
1.6 Related work
Our system spans several areas of research and thus has a fairly broad scope. We
are interested in understanding the work of fellow researchers that have created
various TUI-based systems via computer vision or other means (e.g. RFID),
systems that use computer vision as an input mechanism, systems that provide
alternative interfaces to virtual billiards games, and systems that augment the
physical billiards experience for the advancement of training methods.
10
1.6 Related work
1.6.1 Existing TUI-based systems and toolkits
We found the metaphor of light, shadow, and optics in general to be
particularly compelling for interfaces spanning virtual and physical
space. (19)
An interface may be considered tangible when its interface elements are phys-
ical objects that somehow represent digital information (19). These interface
elements, hereafter referred to as tokens (41), are manipulated (e.g. moved, ro-
tated) by a user in order to manipulate digital information. In TUIs, physical
representations embody mechanisms for interactive control (13; 19). TUIs have
seen a large amount of research in the past ten years, starting with Fitzmaurice’s
Ph.D thesis on graspable user interfaces (10), and at the MIT Media Group where
the Tangible Bits initiative of (19) lead to the creation of infrared, magnetic, and
electronic sensors for use in several TUI prototypes. The goal of the Tangible Bits
is the same goal for all TUIs: “to bridge the gaps between both cyberspace and
the physical environment” (19). Indeed, we share the desire to “push computers
into the background”, and make user interfaces more human-centered and less
focused on traditional input devices.
Myron Krueger’s VideoPlace (25) created a broad interest in using cameras as a
passive input mechanism, infering user intent by watching users instead of forcing
them to physically create inputs by way of a physical device. VideoPlace is an
inspiring work of both technical as well as philosophical merit.
In contrast to other work on TUIs, the only means of detecting user interactions
with tokens in our system is via online (or real-time) analysis of captured video
using computer vision-based algorithms. We do not rely on magnetic sensors (cf.
(45)) or electronic tags. Furthermore, our goal is to require a minimal number
of proprietary tokens (cf. (1; 7; 8; 13; 23)), focusing on natural interactions
with common tangible objects for a given context. For instance, while Klinker et
al. (24) use interactive user assistance and magnetic trackers to help make their
computer vision techniques more robust, we wish to keep user involement with
the implementation of our TUI to a minimum.
11
1.6 Related work
The metaDESK project (42) attempted to bring conventional GUI widgets such
as windows, icons, and menus into the physical realm. The physical instantiations
of these user interface elements are sensed using optical, electromagnetic, and me-
chanical sensors. In contrast to metaDESK, our work is focused on creating TUIs
with tokens used in their natural context instead of focusing on creating TUIs
based on contrived tokens such as a “phicon”. We believe that using computer
vision instead of other types of sensors is less constricting as the sensor or sensor
rigs in our system are easy to setup and inexpensive compared with the sensors of
metaDESK which are embedded in the desk itself. We feel that metaDESK, while
a wonderful exercise in developing a prototype TUI, provides little advantage or
convenience over using the traditional input devices to control traditional GUI
elements.
Paper Mache is a toolkit for creating TUIs that utilize one or more of computer
vision, bar codes, and radio frequency ID (RFID) tags (23). The goal of Paper
Mache is to abstract various input mechanisms in order to make TUI development
easier for programmers and systems designers. Paper Mache wishes to shield de-
velopers from having to understand “a field very different from user interface
development” such as computer vision. While this is a laudable goal, we be-
lieve that it is difficult to provide toolkit support for a broad range of TUIs with
only a single API. For instance, abstracting computer vision and RFID-based
input leads to a simplified least-common denominator set of API features. Paper
Mache’s API is based around events wherein an application is notified of physical
objects (or Phob) entering the scene, leaving the scene, and otherwise changing
position or orientation. These events are the same across input mechanisms but
the computer vision based input mechanism allows access to the acquired image.
For advanced applications, accessing this image is almost always going to be re-
quired since Paper Mache’s computer vision analysis is fairly limited to finding
object properties based on contour analysis alone (i.e. with no direct support
for occlusion handling, fast moving objects, nor tracking implementations). Ac-
cessing the underlying image breaks the encapsulation of the system which is in
direct opposition to the overall goal of Paper Mache. Also, Paper Mache is imple-
mented in Java which may cause some concerns regarding system performance,
12
1.6 Related work
especially for a system like ours which has additional computational burdens (e.g.
our physics-based billiards simulation). For this reason, we have not attempted
to implement our TUI use this system.
Touch-Space is a mixed reality system that strives to create TUIs for human-
human and human-physical interaction (6). Touch-Space focuses on physical
reality mixed with AR and VR techniques to focus on the social aspects of TUIs.
Users engage in an ad hoc game that takes place in a large room. Users wear
head-mounted displays (HMD) with video cameras attached. Markers attached
to physical objects in the scene are sensed using computer vision by way of the
ARToolkit. By comparison, our system does not require markers on physical
objects. Also, our system does not require a HMD but renders 3D graphics on a
standard computer display screen. It would be interesting to incorporate a HMD
into our work and render the virtual billiards table of our simulation aligned with
the plane of the user’s desktop (see Figure 1.2). User feedback of the Touch-Space
system shows that users enjoyed the benefits of immersion provided by the use of
HMDs but did not enjoy the limited field of view and weight of the HMD itself.
Crayons (9) is a system similar in goal to Paper Mache in which the creation of
camera-based interactions is abstracted to ease the burden of the programmer.
Unlike Paper Mache however, Crayons focuses on pixel-level classification instead
of object-level analysis. Moreover, Crayons uses machine learning techniques for
classification instead of simple computer vision techniques. Crayons is interest-
ing to us in its attempt to reduce the burden of classification tasks using color
models as our TUI relies in part on color-based analysis for feature detection.
However, we believe that the complexity of computer vision related classification
and analysis is very difficult to abstract while simultaneously providing sufficient
power for demanding applications. Thus, systems like Paper Mache and Crayons
are ultimately not well suited to TUIs like ours in which accuracy is paramount
to ease of implementation.
13
1.6 Related work
1.6.2 Systems that use computer vision for user input
Many other technologies besides vision have been tried to achieve 3D
tracking, but they all have their weaknesses: Mechanical trackers are
accurate enough, although they tether the user to a limited working
volume. Magnetic trackers are vulnerable to distortions by metal in
the environment, which are a common occurrence, and also limit the
range of displacements. Ultrasonic trackers suffer from noise and tend
to be inaccurate at long ranges because of variations in the ambient
temperature. Inertial trackers drift with time. By contrast, vision has
the potential to yield non-invasive, accurate and low-cost solutions to
this problem, provided that one is willing to invest the effort required
to develop sufficiently robust algorithms. (29)
While this quote refers to using computer vision for tracking (simply, following
a moving object; see Section 2.4), it equally applies to systems that do not use
tracking but computer vision in general for object detection since “every tracking
method requires an object detection mechanism either in every frame or when
the object first appears in the video” (46). We agree with the main assessment of
this quote – that computer vision or use as an input mechanism is a fertile area
for research and has potential to offer great advantages. However, developing
robust algorithms to this end is anything but trivial as we shall see.
There are a number of systems that use computer vision as a user input mecha-
nism that rely on the detection of markers or fiducials that must be attached to
the objects of interest (8; 21). The orientation and scale of detected markers is
used for various application purposes. We have avoid marking tokens in our TUI
in order to afford the user the most natural experience possible. For instance, it
would not be overly natural to place a sizable fiducial on a cue stick, especially
since the marker must be oriented towards the camera at all times. We will not
consider such systems (although successful in their own uses) any further in this
thesis.
Visual Touchpad is a system that uses stereo computer vision techniques to track
a user’s hand above a planar object. The movements of the user’s hand acts as
14
1.6 Related work
use cases to manipulate 2D graphical entities. By contrast our system only uses a
single camera. Also, Visual Touchpad determines the height of a finger above the
planar object by requiring a initialization step in which the finger is detected at a
height of ≈ 1cm above the planar object. At runtime, the finger is considered to
be touching the planar object when the disparity information revealed in stereo
vision analysis finds the finger to be within this 1cm distance. Thus, Visual
Touchpad ’s goal is to determine when the finger is touching the planar object
while our goal is to estimate the height of the cue stick above a planar object.
Our system also requires an initialization step to estimate the vertical offset of the
cue tip above the user’s desktop; we however use a completely different method
based on a single view of the scene to estimate this offset (see Section 2.7.1).
mulTetris (1) is a Tetris-style game in which the game pieces are translated
and rotated in accordance with the manipulations of colored blocks by a user
directly atop a computer display. The system uses computer vision as an input
mechanism by viewing the motions of the color blocks. The system uses color-
based tracking which we also experimented with (see Section 2.4). A major issue
with using such a tracker is that “if ... the environment happens to change
(lighting conditions, non-constant camera settings), the setup has to be redone”
(1). We have attempted to combat such issues by using flexible color modeling for
instance. mulTetris follows the basic principles of TUI design as set forth in (10).
The system requires two cameras to perform its analyses; we require only one.
One aspect of the development of mulTetris that we wish to copy in the future
is a usability study which often provides valuable feedback. However, we did not
consider this to be of pressing need for our prototype. While such a usability study
is very useful for the successful development of new user interaction patterns, the
accuracy of measurements is more important for a system like ours in which
interaction patterns are predefined by the context (e.g. billiards).
The PlayAnywhere project (43) is interesting to us for several reasons: it is very
expensive to implement relative to the cost of our project’s cost (≈ $10,000USD
vs. ≈ $50USD); uses shadows as a scene element to derive part of an object’s
pose; performs contour analysis to detect a sense of height of a scene object
above a plane of a desktop; and can be considered a TUI that allows users to
15
1.6 Related work
manipulate (translate, rotate) 2D graphical objects by direct manipulation of
projected images of these objects. PlayAnywhere also uses infrared light instead
of visible light to make feature detection more robust. We wish to investigate the
use of infrared light in future work.
In (2) we find yet another system that uses computer vision for detection of
moving objects by detecting colors in images; the system tracks balls as they fly
through the air during a juggling session. However the system requires the use
of stereo vision algorithms and thus requires two video cameras. The system is
noteworthy for its lack of markers, use of color-based features, and use of the
Kalman filter for motion prediction. See Section 2.4 for the reasons our system
does not ultimately use tracking algorithms.
1.6.3 Systems that implement mixed reality billiards
Several other systems have been developed that augment the billiards experience
in one form or another. Several projects provide a control interface to a VR
simulation. Others augment a physical billiards table with projected video for
the purposes of training. Others use HMDs to provide an augmented view of a
physical billiards table.
The HapStick (16) project is a mixed reality billiards project in which haptics
and force feedback are used to simulate the sensation of striking a real cue ball.
The user manipulates a cue stick that is physically tethered to a sensor. When
contact with the sensor is detected, the system effectively “pushes back” on the
stick to offer the feeling of striking a cue ball. The force of the shot is detected by
the system, and a similar force is imparted to a virtual cue ball in a physics-based
billiards simulation. While the system provides an accurate account of the force
of a cue stick that we have ourselves not been able to match, there are several
limitations that we hope to overcome in our project. The system costs (as of late
2007) hundreds of dollars to build and is nontrivial to construct. Such a system
is thus not easy to deploy to a wide audience. The interface, while allowing a
user to manipulate a physical cue stick, is not overly natural since the stick is
tethered to a rather large contraption. In other words, the sensing in this system
16
1.6 Related work
is active rather than passive. Also, no actual cue ball is struck which is clearly
also unnatural. Finally, while the system can detect and model varying degrees of
pitch and yaw of the cue stick, it can not model spin effects such as draw, follow,
and english that a sophisticated user would demand. While haptic technology
improves the immersive experience for users, our goal is to obviate the need for
haptics by allowing users to interact with real objects instead of haptic models
thereof.
Stochasticks (20) is a system which helps a user visualize potential billiards shots
on a physical billiards table. The user wears a HMD and sees lines and arrows
overlaid atop an otherwise conventional view of a billiards table. This system
is a mixed reality billiards system in that it uses AR technology to augment
the physical reality of playing billiards. As in our prototype, Stochasticks finds
a cloth area in order to focus the attention of its feature detection algorithms.
However, unlike our system, Stochasticks computes a model of billiard table cloth
colors offline in a precomputation step. We use a simple flexible color model for
matching. The aim of Stochasticks is similar to ours: provide a natual means
for a user to self-train in billiards. However, the means are quite different. Our
system allows a user to view the scene their own eyes while Stochasticks requires
a tethered HMD to be worn which is not very natural. Also, Stochasticks requires
the user to be located near a physical billiards table. In our system, the user can
practice billiards using the normal equipment in the comfort of their own home.
As our physics-based simulation improves, we hope that the difference between
the motion of balls on a physical billiards table and our simulation becomes small
enough for training without a physical billiards table to become a viable reality.
It will be interesting to see how Stochasticks evolves as wearable computers and
HMDs reduce in price and size over time.
An excellent example of a augmented billiards training system is the Automatic
Pool Trainer (26). This system is similar to Stochasticks in that it augments
the physical billiards playing experience. However, Automatic Pool Trainer uses
a projector mounted on the ceiling above a billiards table. A colocated camera
views the surface of the table from a top-down vantage point. Computer vision
is used to detect the location of balls on the table. Armed with this knowledge,
17
1.7 System architecture
the system projects lines, circles, and other visual aids to help the billiards player
visualize shots and position balls post-shot. While the system is a wonderful
example of how one can use AR to train users, it is also not deployable to a wide
audience since it its hardware is generally expensive, requires a physical billiards
table, and also requires the ability to mount a projector and camera on the ceiling
above the billiards table.
1.7 System architecture
Our prototype TUI implementation is based on the client-server architecture
wherein a server process performs the vision based token detection and pose
estimation and a client process performs a physics based billiards simulation based
on information received from the server. The server process sends information to
the client over a standard TCP/IP socket connection.
Figure 1.4: Client-server architecture.
The described client-server architecture was chosen mostly for performance im-
provements. Compared to a system that uses a single process to perform both
the image processing and physics simulation, our system can offload the image
18
1.7 System architecture
processing tasks to one CPU and the physics and 3D rendering tasks to another
CPU in a multi-CPU (or multi-core) computer system. Such multi-core systems
are becoming much more common. For instance, the prototype TUI was devel-
oped on an Apple MacBook Pro which has two cores. Indeed, it is even possible
to use one computer to run the server process and another to run the client pro-
cess since the client and server communicate over a TCP/IP socket. We shall
refer to the sending of data between the server process and client process as IPC
for inter-process communication.
Since TCP/IP was not designed for extremely low-latency communications, one
may be concerned with an architecture that uses TCP/IP for an input mechanism,
as the latency of measurements of any input device should be as minimal as
possible. In our tests, we have used only a single computer with multiple cores.
Client-server communications in this scenario using a single TCP/IP socket were
fast enough to send on the order of 25,000 packets per second. However, since DV
cameras deliver only ≈ 30 images per second, our system is clearly not network-
limited.
19
Chapter 2
Server process
The server process is responsible for
• capturing images from a digital video camera;
• performing feature detection and extraction;
• extracting the state or pose of tokens from image features;
• detecting shots taken and the properties thereof;
• notifying connected clients of token’s state and shots taken.
Before we discuss the design and implementation of the server process itself, let
us first describe how the images acquired by an imaging device are formed before
delivery to an application like our prototype TUI. Such a discussion is necessary
in order to understand the limits of imaging devices and the subsequent issues
with captured images that our system must fight in order to accurately measure
elements of our TUI’s scene.
20
Figure 2.1: Image acquired from a 0.3 megapixel Apple iSight digital video
camera.
Figure 2.2: Image acquired from a 6 megapixel Nikon D50 digital SLR camera.
21
2.1 Image formation
2.1 Image formation
The function of a digital video camera (for our purposes) is to sample the visible
spectrum of light at discrete instances of time and form a digital two dimensional
(2D) image representing the intensity of the photons of light that have fallen upon
the camera’s image sensors or cells. The average webcam has roughly 300,000
such cells. Compared with that of a single human eye which has over 100,000,000
rods and cones (crudely the equivalent of an image sensor), we can begin to
understand that modern cameras are not capable of capturing exquisitely fine
detail like the human eye can at least because a digital video camera has low
acuity (sharpness).
2.2 Issues with acquired digital images
Every digital acquired from a camera has some amount of error due to numerous
factors including manufacturing imprecision, the limits of camera circuitry, and
various tradeoffs camera manufacturers balance in order to produce a consumer
product such as the cost vs. quality in lens design. In (40), an overview of such
issues is given. As we have encountered many of these issues in our work, we
provide a summary of the major issues and the countermeasures we have taken
to fight these issues in order to extract the most accurate measurements possible
from fairly low-quality imaging devices.
2.2.1 Geometric distortion
A camera lens that does not focus light as expected generates images that contain
some amount of geometric distortion. This type of distortion causes straight lines
in the world to be curved in acquired images. All cameras have some amount
of geometric distortion due to imperfections in the manufacturing process. How-
ever, there are techniques to quantify the amount of geometric distortion and
even correct it. Camera calibration is a technique widely used to discover the in-
trinsic and extrinsic properties of a camera. The intrinsic parameters of a camera
22
2.2 Issues with acquired digital images
include a set of distortion coefficients that describe how a particular camera dis-
torts images. These parameters are used in the correction process as well. See
appendix A for details on how one can calibrate a camera using Intel’s OpenCV
library (18) and remove the geometric distortion in acquired images.
. . . digital sensors consist of an array of “pixels” collecting photons,
the minute energy packets of which light consists. The number of
photons collected in each pixel is converted into an electrical charge
by the light sensitive photodiode. This charge is then converted into
a voltage, amplified, and converted to a digital value via the analog
to digital converter, so that the camera can process the values into
the final digital image. (33)
The vast majority of webcams and digital video cameras are based on either
Charge Coupled Devices (CCD) or Complimentary Metal Oxide Silicon (CMOS)
technology. Such cameras use an array of sensors where each sensor collects a
charge proportional to the amount of light incident on the sensor. The charge
is sampled or discretized and made available as a digital image. The number of
sensors in the array defines the resolution of images captured by a CCD or CMOS
camera. See (33) for a more detailed discussion of CCD design. There are several
issues that we have experienced from using inexpensive webcams that use CCD
image sensors.
2.2.2 Blooming
Each cell or sensor in a CCD array is not perfectly insulated from one another.
Blooming is an effect where charge collected at one cell leaks into neighboring
cells. We can think of a a CCD image sensor as a bucket that collects charge;
once a bucket is full, the additional charge has no effect on the pixel value,
overexposing the pixel. When the charge overflows to neighboring pixels, those
pixels become overexposed as well.
23
2.2 Issues with acquired digital images
2.2.3 Noise
Noise can be defined as the departure from a true signal. In the case of acquired
digital image, the effects of can be seen in departures in pixel color from their
true color. Noise stems from a range of sources including differences in sensitivity
between otherwise identical image sensors, environmental influences, quantization
(rounding) errors, and so on.
Noise causes our measurements to jitter since the image regions vary with noise
slightly image to image. In the client simulation, the end result of these fluctua-
tions in measurements is a slight wobble in the cue stick orientation and position.
This causes imprecision at best and a general sense that the system isn’t very ro-
bust. We combat this with the use of temporal averaging and/or weighted moving
averages (discussed later).
Figure 2.3: Example of sensor noise. Here we see the hue channel of a single
captured HSV video image. The dark region marked by the letter ’A’ is of uniform
hue yet is speckled with various other hues, mostly attributable to sensor noise.
2.2.4 Low rate of acquisition
The rate at which a video capture device can deliver images to a software applica-
tion is called the device’s acquisition rate. A higher acquisition rate means that
less time passes between the delivery of each captured frame of video. Hence,
24
2.2 Issues with acquired digital images
given a high frame rate, a software application would have a higher chance of see-
ing an event of interest occur in one of the captured frames. In turn, a low frame
rate causes a software application to in effect close its eyes for a (subjectively)
long period of time, only to reopen its eyes and try to make sense of what has
happened in the scene in the intervening time period. In other words, it is easier
to understand the world through vision with more visual data rather than less
(in general; optical illusions notwithstanding). See Section 2.8 for a discussion
on how the low frame rate digital video cameras are dealt with in the context of
sensing billiard shots.
The definition of an acceptable acquisition rate is application dependent. For
instance, an application that attempts to sense or track moving objects (such as
ours) performs better given a higher acquisition rate while an application that
tracks the sun’s position in the sky over the course of a day does not require a
high acquisition rate in order to fulfill its goal.
We should note that a video capture device is not always completely to blame for
a low acquisition rate. An application that takes too long (i.e. longer than the
time to acquire the next image) to process each frame of video will also decrease
the effective acquisition rate. Such an application will cause most video capture
device drivers to drop frames to avoid maintaining a backlog of frames to be
delivered to a requesting application.
Since our application is a user input mechanism, our goal is to find a balance
between performance and accuracy. We wish to provide accurate updates of user
input derived from visual data on a regular, low-latency schedule. Our prototype
implementation causes a drop in the effective acquisition rate due to the amount
of image processing (and thus time) our algorithms require.
2.2.5 Low resolution
A digital image’s resolution defines the number of pixels that comprise the image.
We prefer to capture images of a sufficient resolution in order to be able to detect
small details, “sufficient” being application dependent. If an image’s resolution is
25
2.2 Issues with acquired digital images
too small, some details will not be detected because the image is too coarse. If the
image resolution is large however, additional processing time and memory would
be required. Thus, there is a tradeoff between image size and resources required
to process the image. As a practical matter, many consumer-grade cameras offer
a small range (sometimes even only a single choice) of resolutions for captured
images. In this case we have little to no control over the image resolution; we must
use whatever resolution the camera offers. For instance, in this project, we used
an Apple iSight with offers a single image resolution of 640 x 480 pixels (columns
x rows); this is 640 x 480 ≈ 0.3 megapixels (MP; 1 MP is 1 million pixels).
Modern digital still-photo cameras offer a range of possible resolutions, even 10
megapixels or more. However, such cameras cannot sustain a high capture rate
as required by our system. Thus, we must use a digital video camera in order to
capture images at a higher rate, albeit at a lower resolution. For instance, see
Figures 2.1 and 2.2 which showcases two images, one taken by a webcam and the
other taken by a high quality camera used for still photography. One can clearly
see the difference in image quality between these two images.
Since a pixel represents a sample of the scene, that pixel projected back into the
scene represents the nominal resolution of the image sensor (40). For instance, if
a physical cue ball with a diameter of 2.25 inches is imaged to a circle of diameter
100 pixels, then the nominal resolution is 0.025 inches. In other words, we cannot
detect an world feature that is smaller than 0.025 inches in this case. We thus
recommend orienting the camera such that it tightly frames the area in which
tokens can be seen and not much more.
2.2.6 Chromatic aberration
Chromatic aberration is the inability of a lens to focus different colors in the same
focal plane. This occurs because different wavelengths of light are refracted by
different amounts when passing through a lens. While lens designers deal with
chromatic aberration, there are tradeoffs and thus some amount of chromatic
aberration will be found in almost all images.
26
2.2 Issues with acquired digital images
Figure 2.4: Errors in color reproduction of digital video camera. Color repro-
duction error effects are made more obvious by increasing the saturation and
decreasing the brightness of a portion of Figure 2.1. Note the large amount of
purple hue caused by chromatic aberration.
There are image processing techniques that can reduce the effect of chromatic
aberration. However, we have not followed this route since the complexity of
understanding the relationship between aberration and the captured image is
high, and is different for different cameras.
2.2.7 Temporal averaging
To combat some of the above issues with acquired digital images, we simply
average the previous N acquired images. This is temporal averaging in that we
are averaging images taken over different instances of time. Temporal averaging
has the effect of reducing the variance in color at each pixel. Temporal averaging
assumes image noise is truly random.
The output pixel of a temporally averaged image I at column x and row y may
27
2.2 Issues with acquired digital images
be defined by
I[x, y] =
∑N1 Ii[x, y]
N(2.1)
where Ii is the ith previous image acquired (i.e. i = 0 refers to the current
acquired image). Note that this averaging is performed on each color channel
(e.g. R, G, and B) separately.
While temporal averaging has an advantage of reducing the impact of color repro-
duction issues and sensor noise, it does effectively lower the capture rate N -fold.
Also, temporal averaging introduces motion blur for fast moving objects. In our
system, we have used N = 4 as a good tradeoff between increasing image quality
vs. decreasing capture rate.
2.2.8 Weighted moving averages
To reduce the effect of short-term changes in measurements due to image issues,
we use a weighted moving average (WMA) to focus on measurement trends. With
a WMA N measurements are stored in a history. The WMA can be queried for
an overall average at any given time. The most influence or weight is given
to the most recent measurement, with a linear falloff for the weights of older
measurements. We empirically determined the best quality vs. latency tradeoff
exists when N = 2 when used in addition to temporal averaging and N = 4.
Each WMA can be tailored to fit specific scenarios however.
WMAM =npM + (n− 1)pM−1 + · · ·+ 2pM−n+2 + pM−n+1
n+ (n− 1) + · · ·+ 2 + 1(2.2)
2.2.9 Manual camera control
Modern webcams and consumer grade digital video cameras offer a good amount
of control of the parameters of image formation. We recommend disabling au-
tomatic focus, automatic exposure, and similar camera-controlled automatic cal-
28
2.3 On stereo vision
ibrations, opting instead for manual calibration per scene prior to image acqui-
sition. Automatic exposure is a common feature in many modern cameras. It
automatically controls the iris, shutter and gain to adjust the brightness of ac-
quired images depending on the brightness of sensed objects.
Finally, we should note that we are effectively repurposing webcams for different
application than engineered for. Webcams are typically used for low bitrate video
conferencing, and not for endeavors requiring high precision and color reproduc-
tion (e.g. photogrammetry in general).
2.3 On stereo vision
Bifocal or stereo vision like the human vision system offers two slightly different
views of the same scene. Our brains work out the disparity between the views
of our left and right eyes and infer a sense of depth to points visible in both
views. There has been much research in duplicating the general stereo vision
configuration in computer vision by using two cameras and processing the images
simultaneously captured by both.
Since the major goal of our system is to reconstruct the 3D pose of the billiards
cue stick, it follows that stereo vision might be an applicable technique. However,
we have focused on using a single camera. A single camera is less expensive and
does not require the user to calibrate the stereo rig. A stereo system also requires
much more computational power in order to process two times the number of
pixels compared to a single-camera solution. It is also not clear that processing
a depth image (where each pixel contains the distance from the camera’s image
plane to the scene point) is a better tradeoff than using the techniques discussed
in this thesis since even the best performing stereo vision algorithms do not find
perfect depth calculations at every pixel. As we shall see, the techniques herein
discussed are fairly simple to implement and do not require multiple cameras.
29
2.4 On tracking algorithms
2.4 On tracking algorithms
The function of computer vision-based tracking algorithms is to reliably detect a
moving object through a sequence of images. The main advantage of use tracking
is that tracking algorithms often provide a search window that focuses the atten-
tion of object detectors. Instead of searching the entire image, an object detector
need only search a smaller section of each image, thus increasing performance if
the object is detected within the window as less data needs to be processed. If
an object detector does not detect the target object in the search window, the
search window is generally translated or enlarged and the detection is retried.
This process is repeated until either the object is detected, the search window
grows larger than some threshold size, or a number of iterations passes without
a successful object detection.
Since our system has a billiards cue that moves when the user takes a shot,
one could easily see how tracking algorithms could be applicable to our system.
However, we have found through a number of experiments that several tracking
algorithms known to perform well fail to reliably track our cue stick. Others have
experienced similar issues:
In practice, even the best methods suffer such failures all too often, for
example because the motion is too fast, a complete occlusion occurs,
or simply because the target ob ject moves momentarily out of the
field of view. (29)
While others (1) have had success with the CAMSHIFT algorithm (5), the poor
color reproduction quality of inexpensive digital video (digital video) cameras as
well as the relatively low rate of capture coupled with a fast moving cue stick
causes several insurmountable issues.
• Search window explosion. The search window tends to enlarge out of control
when a decent histogram match is not found. Then, the window does not
tend to shrink if a good candidate area is actually in the image.
30
2.5 Feature detection methods
• The tracked histogram is updated over time. This tended to track new
objects periodically. For instance, since the user’s hand can closely match
in color the cue stick, CAMSHIFT would periodically stop tracking the
cue stick and instead track the user’s hand. Granted, wearing a glove or
performing hand-segmentation could help this issue but forcing the user to
wear a glove is not very natural.
• A fast-moving cue stick tended to confuse the search. This was possibly
due to motion blur (the effect of a not capturing images of a moving object
fast enough) temporarily altering the histogram of the search window.
• The need to choose an initial search window isn’t an overly natural user
interaction.
While there are a number of other tracking algorithms one could attempt to adapt
to our system (29), we felt that the need of tracking algorithms to be reinitialized
when detection fails makes the advantages of their use rather low. This is because
our fast moving cue stick tended to confuse trackers, thus requiring reinitialization
quite often. Furthermore, our scene layout is such that we can use methods other
than tracking to detect our TUI’s tokens.
Total, or even partial occlusion of the tracked ob jects typically results
in tracking failure. The camera [or objects] can easily move too fast
so that the images are motion blurred; the lighting during a shot
can change significantly; reflections and specularities may confuse the
tracker. Even more importantly, an object may drastically change its
aspect very quickly due to displacement. (29)
2.5 Feature detection methods
We define a feature as the image of some object or shape. While feature detection
refers to simply determining if the feature is present in an image, feature extrac-
tion is concerned with determining information from the feature (e.g. orientation,
31
2.5 Feature detection methods
Figure 2.5: Server process logic flow.
32
2.5 Feature detection methods
position, size). In (33), a set of guidelines for feature detectors is specified based
on invariance properties. These guidelines state that a successful feature detector
should be able to detect features under varying position, rotation, or scale (size).
We shall discuss how we detect image features of our defined tokens and extract
information about their position and orientation in the following sections.
There exists a plethora of research (33; 40) of detecting shapes in images that we
could benefit from in our system if we didn’t have to deal with occlusions and fast
moving objects. Occlusions in our system can be caused by the player’s hands
holding the cue stick at various locations along the shaft of the cue stick. These
two issues have caused us to focus on a flexible technique of finding regions of
expected color and extracting information about these regions in order to find the
attributes of our tokens. Using color-matching as our primary feature obviates
the need for a color video camera as image source as well as the use of visible (cf.
infrared) light.
It is worth noting that we do not require the user to use markers or special
equipment. Instead, the tokens of our TUI are the natural objects a user would
normally interact with in the context of a billiards game. In many other systems,
markers or fiducials are used to aid in the feature detection and extraction stage.
(29) notes that “even after more than twenty years of research, practical vision-
based 3D tracking systems still rely on fiducials because this remains the only
approach that is sufficiently fast, robust, and accurate.” While this may be true
in the general case, our system is a single example of how marker-less tokens are
easily detected and moreover that such a system allows for a more natural user
interface by simply letting user interface elements be. Any physical tagging or
marking of a cue stick for example would cause players to compensate for the
change and would likely be looked at as an annoyance.
The issues of acquired digital images such as error-prone color reproduction as
described in Section 2.2 may cause some concern since we have decided to use
color as our primary image feature of interest. However, we shall describe how
we use a flexible color model in order to deal with the issue of color reproduction
quality, changes in illumination, and shadows that may overlay tokens.
33
2.5 Feature detection methods
2.5.1 Color representation
Since our system uses the visible spectrum of light for feature detection, it makes
sense to discuss how we represent color in a digital form. A common reprensenta-
tion is red-green-blue (RGB) in which a color is represented as a mixture of some
amount of red, green, and blue. For instance, the color yellow can be represented
in RGB as full red, full green, and no blue. There are many other color repre-
sentations or color spaces useful in various contexts. Since we are interested in
the color of objects under varying illumination, we would like a color space that
helps us define colors in which color or hue is constant even for darker or brighter
versions of the color.
Figure 2.6: Color mixtures in RGB.
One such color space is the hue-saturation-value (HSV) color space (39). Hue
represents what we normally consider the color. Saturation refers to how vibrant
the color is. Value represents how bright the color is; a low value is found in
darker colors. Note that hue is independent of saturation and value. That is, an
object with a green hue has a green hue even when dark or bright.
Let use define the flexible color model of a token as
F =
H ±∆H
S ±∆S
V ±∆V
(2.3)
where H is the mean hue of FCM, S the mean saturation, V the mean value, ∆H
the allowed variance in hue from H, ∆S the allowed variance from S, and ∆V the
34
2.5 Feature detection methods
allowed variance from V . In practice, we properly handle the fact that the hue
wraps around from 359◦ to 0◦.
Since Intel’s OpenCV library (18) acquires color images whose pixels are specified
in BGR format (BGR is logically equivalent to RGB), we need to convert a BGR
pixel to a HSV pixel. Converting an entire BGR image to a HSV image entails
simply converting each BGR pixel to a HSV pixel. Let min be the minimum
value of the BGR pixel components b, g, and r. Let max be the maximum value
of the BGR pixel components.
h =
0 if max = min
60◦ × g−bmax−min
+ 0◦, if max = r and g ≥ b
60◦ × g−bmax−min
+ 360◦, if max = r and g < b
60◦ × b−rmax−min
+ 120◦, if max = g
60◦ × r−gmax−min
+ 240◦, if max = b
s =
{0, if max = 0max−min
max= 1− min
max, otherwise
v = max
(2.4)
h is then normalized to the range [0, 360).
Figure 2.7: HSV color space.
We only require a color model for the cloth/felt token as well as the shadow of
35
2.5 Feature detection methods
the cue stick. We use other techniques based on thresholding, binary morphology,
and connected component analysis to detect other tokens such as the cue stick
and planar object.
2.5.2 Thresholding
(a) Color image. (b) Grayscale image. (c) Binary image. Shown is a re-sult of thresholding Figure 2.8bwith γ = 150 (out of 255 max.).
Figure 2.8: Example of thresholding.
We may find various feature detection tasks easier if we first reduce the grayscale
image into a binary image and then process the pixels of the binary image (see
Section 1.4 for an explanation of binary and grayscale images). This is often
performed using a technique known as thresholding. A grayscale image is thresh-
olded by rounding all pixel intensities greater than a given threshold γ up to
white (binary image pixel value of 1) and less than the threshold to black (binary
image pixel value of 0). Let G be a grayscale image and B be a binary image.
Then,
B[x, y] =
{1 if G[x, y] ≥ γ
0 if G[x, y] < γ(2.5)
where I[x, y] denotes the value of pixel at column x and row y in image I. There
are several methods to determine a desired value for γ including adaptive tech-
niques that first analyze the intensities of all pixels of G prior to thresholding. For
our prototype however, we have simply found a desired value for γ empirically.
36
2.5 Feature detection methods
NW N NE
W P E
SW S SE
Table 2.1: 4-way or 8-way neighbors of a pixel P .
2.5.3 Contour finding and analysis
We may wish to find regions, blobs, or areas of pixels in a binary image that
are spatially connected to one another in order to process these connected pixels
as units or groups. This is another tactic to reduce complexity in a computer
vision system. There are several names of a general set of methods to help us find
such regions in binary images such as connected components labeling (40), blob
detection, and contour finding (18). We shall adopt the term contours as used in
Intel’s OpenCV library as our implementation uses this library for detection of
such connected areas.
A pixel can be considered to be connected to another in a binary image if it shares
the same pixel value (both are white or both are black). The neighbors of a pixel
determine where a contour-finding algorithm searches for such connections. If we
envision a pixel centered in a compass, we can define the neighbors of the pixel
using either 4-way connections or 8-way connections. In a 4-way connection,
the pixel’s neighbors are found at the directions N , S, E, and W . In a 8-way
connection, the pixel’s neighbors are found at N , NE, E, SE, S, SW , W , and
NW (see Table 2.1).
The goal of a contour-finding algorithm is to find a set of contours given a binary
image and a choice of either 4-way or 8-way connections between pixels. There
are numerous algorithms to perform such contour finding, each with their own
performance tradeoffs and behaviors. We will thus refer the interested reader to
(18; 40) for details. In our project, we have relied upon the OpenCV function
cvF indContours to find contours.
If we allow contours to be nested, we can recursively find contours within other
37
2.5 Feature detection methods
contours. We call these interior contours child contours of a parent contour. Note
that parent and child will have opposite pixel values (white vs. black), alternating
with each nesting. In other words, a contour need not only be comprised of
connected white pixels.
(a) Binary image before contour finding. (b) Binary image after contour finding. Notethe red border around the parent contour andthe purple border around the child contour.
Figure 2.9: Example of contour finding.
Once we have a set of contours, we can derive several properties of the contour
as a whole. This includes such properties as the contour’s area, perimeter, center
of mass, bounding rectangle, oriented bounding box, minimum area rectangle,
circularity, and so on. We have created a small C++ library to help filter and
sort a list of contours based on predicates of a contour’s properties. For instance,
the following is a small code listing that finds a hierarchical set of contours, filters
the list by removing contours whose area is below a hard-coded threshold, and
then finds the largest child contour of the first contour remaining.
38
2.5 Feature detection methods
Listing 2.1: Example code for contour processing.
1 CvSeq∗ root ;2 CvMemStorage∗ memStorage = cvCreateMemStorage ( 0 ) ;3 cvFindContours ( binaryImage , memStorage , &root ,4 s izeof ( CvContour ) , computer vision RETR TREE ,5 computer vision CHAIN APPROX SIMPLE ) ;67 ContourVec contours ;8 s i b l i n g s O f ( root , contours ) ;9 removeContours ( contours , areaLT ( 1 0 0 ) ) ;
10 i f ( contours . s i z e ( ) == 0) throw ” not found” ;1112 ContourVec ch i ldContours ;13 ch i ld renOf ( contours [ 0 ] , ch i ldContours ) ;14 i f ( ch i ldContours . s i z e ( ) == 0) throw ”no c h i l d r e n ” ;15 ContourInfo& l a r g e s t C h i l d = largestByArea ( ch i ldContours ) ;16 // . . .17 cvReleaseMemStorage(&memStorage ) ;
2.5.4 Binary image morphology
It is typical that thresholding will not produce binary images as perfectly as
expected. For instance, small bright highlights on an object might become white
in the resultant binary image even if this is undesired. We can alter the shape of
contours in the binary image using morphological operators such as erosion and
dilation. An erosion of a binary image has the effect of shrinking the contours
from their borders inwards while a dilation has the effect of expanding such
contours from their borders outwards. Morphological operators are useful in
removing contours that may be considered noise as an erosion could erode a
small contour to nothing. A dilation may be useful for filling in “holes” in a
contour (e.g. removing child contours by filling them in with the same pixel
value as the parent). We have only scratched the surface of a broader subject
called mathematical morphology. See (40) for a more in-depth discussion.
39
2.5 Feature detection methods
Figure 2.10: Example of erosion morphological operator. The left image shows an
original binary image. The right image shows the effects of the erosion operator
with rectangular structuring element applied to the image on the left. Notice
how small contours have completely eroded.
2.5.5 Flexible color matching
A useful image filter searches for pixels of a color image that match a color model
such as our flexible color model (see Section 2.5.1). Given an input image I, an
initially clear (all black or value 0) binary image B with equal dimensions as I,
and a flexible color model F , we wish to set each matching pixel of I to white
(value 1) at the corresponding pixel location in B. Thus,
B[x, y] =
{1 if I[x, y] ∈ F0 if I[x, y] 6∈ F
(2.6)
where I[x, y] ∈ F is true when one of the following is true
if Fh + F∆h ≥ 360 : hxy ≥ 360− (360− Fh + F∆h) ∨ hxy < F∆h − (360− Fh)
if Fh − F∆h < 0 : hxy ≥ 360− F∆h ∨ hxy < Fh + F∆h
else : Fh − F∆h ≤ hxy < Fh + F∆h (2.7)
40
2.5 Feature detection methods
and both of the following are true
Fs − F∆s ≤ sxy < Fs + F∆s
Fv − F∆v ≤ vxy < Fv + F∆v (2.8)
where hxy, sxy, and vxy are the components of HSV pixel I[x, y].
The complex structure for handling matching on hue is based on the fact that
hue wraps around between 359◦ and 0◦ and the flexible part of a FCM (i.e. F∆h)
might wrap as well.
2.5.6 Convex hulls and convexity defects
One of the properties of a contour (that is directly supported by OpenCV) is the
contour’s convex hull. Intuitively speaking, one can think of a convex hull as a
rubber band wrapped around the border of a shape. A convex polygon is one
(a) Cue stick and cloth/felt. (b) Resultant B image.
Figure 2.11: The left image (a) shows an image acquired from digital video
camera containing the cloth/felt token with cue stick token hovering above it.
The right image (b) shows the resultant image B in which white pixels match a
FCM designed for the cloth/felt token. Note the lack of a cue stick shadow due
to ambient room lighting.
41
2.6 Pose driven feature detection
in which a line connecting any pair of vertices does not intersect an edge or side
of the polygon. If such a polygon fails to meet this requirement, the polygon is
not convex and the convexity defects can be found. These defects are the regions
along the perimiter of the polygon that cause the polygon to fail to be convex.
If we consider a contour to be a potentially many-sided polygon, we can find the
convex hull and any convexity defects for contours in our binary images.
We have already seen an example of a convexity defect. The cloth/felt contour in
Figure 2.11b is convex except for the single convexity defect caused by the image
of the cue stick jutting in from the top. We exploit this scenario in order to find
the cue stick and its shadow. Note that OpenCV provides an implementation
of both convex hull calculation and convexity defect discovery. A result of the
defect discovery is a point within the defect that is furthest distance-wise to the
convex hull. This can be seen in the right-most image of Figure 2.12.
Figure 2.12: Example of convexity defect. The left image shows a convex shape
of a black square. The middle image shows a convexity “defect” in the white
area; the white area makes the otherwise convex square a non-convex shape. The
right image shows the convex hull of the original square in red and the deepest
point of the convexity defect highlighted in cyan.
2.6 Pose driven feature detection
We have described the feature detection methods we use in our prototype TUI.
Feature detection lays the foundation for higher level information extraction. In
42
2.6 Pose driven feature detection
Element Description See also
h vertical offset (or height) of the tip of the cue stick
above the the user’s desk
fig. 2.16
θ pitch of the cue stick shaft fig. 2.21
ψ yaw of the cue stick shaft fig. 2.24
distw distance between the cue stick’s tip and the cue
ball in world units
fig. 2.20
a, b expected 2D offset in ball space from the cue ball
center point
fig. 2.25
Table 2.2: Elements of cue stick pose.
our system, the information we wish to extract from detected features is the 3D
pose of the cue stick. Let us then work backwards by first defining the information
we wish to extract (elements of the cue stick pose) followed by the features we
will need to detect in order to estimate the pose of the cue stick. See Table 2.2 for
a summary of cue stick pose elements. See Table 2.3 for a summary of features
detected using the aforementioned techniques.
2.6.1 Planar object detection
The planar object of interest is a thin rectangular shape with high contrast edges.
To detect this planar object, we first we threshold the image with γ = 192
(we are assuming a bright object for simplicity). We then find contours in the
thresholded image. We apply a contour simplification algorithm called polygonal
approximation which essentially removes small fluctuations along the contour. For
instance, given a contour that roughly resembles a rectangle, applying contour
polygon approximation would result in a contour with 4 sides at near right angles
to one another. See OpenCV’s cvApproxPoly function for implementation details.
43
2.6 Pose driven feature detection
Feature Helps find .. Detection Description
felt/cloth cue stick, cue
ball, cue stick
shadow
auto container
Tr h user top point of reference object
Br h user bottom point of reference
object
St h user shadow of Tr
cue tip distw auto tip of the cue stickˆshaft θ, ψ auto cue stick shaftˆshadow θ auto shadow of cue stick shaft
Scˆshadow auto shadow of the tip of the cue
stick
planar object h auto edges are used to find paral-
lel lines on plane of desk
parallel lines
on plane of
desk
h auto used to find vanishing points
of plane of desk and thus its
vanishing line too
Table 2.3: Feature summary. Auto features are detected every frame. User
features are specified by the user, one time, at server initialization time.
44
2.6 Pose driven feature detection
FCM component Value
h 80◦
s 0%
v 0%
∆h 60◦
∆s 100%
∆v 100%
Table 2.4: Flexible color model for the cloth/felt.
We filter the resultant set of contours, keeping those whose polygonal approxima-
tion results in a polygon with 4 sides that are at ±5◦ from a right angle to each
other. The largest such rectangular polygon is considered to be the planar object.
We are only interested in the planar object for its two sets of parallel edges or
sides. These two pairs of sides are used directly as the parallel lines features that
we require.
2.6.2 Cloth/felt detection
To detect the cloth/felt tangible we use the technique put forth in Section 2.5.5.
The result is a binary image containing contours that match the green FCM
devised for the cloth/felt material used in our prototype TUI. We filter small
contours by erosion and assume the largest matching contour is the cloth/felt.
We should note that we find a full tree of contours. That is, we are interested in
the children of the cloth/felt contour as well.
2.6.3 Cue ball detection
We have detected the cloth/felt contour and its child contours. Since the cue ball
is white, it does not match the color model of the cloth/felt. Thus, the white
pixels of the cloth/felt contour will have a circular black hole where the cue ball
is located. We find the cue ball simply by assuming it is the most circular child
45
2.6 Pose driven feature detection
Figure 2.13: Example of cue ball child contour. Here we see that a cue ball has
been placed on the cloth/felt and does not match the FCM defined for matching
the cloth/felt. The pixels the cue ball’s image are thus black. These pixels
comprise a child contour of the cloth/felt contour. Also note the lack of a shadow
of the cue stick in this example is due to a large amount of ambient light in the
scene.
contour of all such contours. We remove child contours whose contour area is less
than some empirically determined threshold. We use the metric area/radius for
a measure of a contour’s circularity where area is the area of the contour (i.e.
number of white or black pixels) and radius is the radius of a circle that most
tightly encloses or fits the points that comprise the contour. OpenCV provides a
function cvMinEnclosingCircle to calculate such a circle. Circular contours will
maximize this metric. Once we have found the cue ball contour, we flood fill its
contour in the binary image (the result resembles Figure 2.11b) so that further
processing of the cloth/felt does not get confused by the cue ball contour.
2.6.4 Cue stick detection
We use the same technique as in the cloth/felt detection to detect shadows.
However, we first restrict the region of interest (ROI) to the axis-aligned bounding
rectangle of the cloth/felt. The rectangular area shown in Figure 2.11a bears a
close resemblence to the bounding rectangle of the cloth/felt. The color model
for the cue stick’s shadow simply looks for areas of low value (i.e. dark) without
46
2.6 Pose driven feature detection
FCM component Value
h 0◦
s 0%
v 0%
∆h 360◦
∆s 100%
∆v 35%
Table 2.5: Flexible color model for the cue stick’s shadow.
regard for hue nor saturation. The ∆v used was found empirically and can be
adjusted per scene to account for varying amounts of ambient lighting. The
result is an image with dimensions equal to the ROI that contains white pixels
where shadows were discovered inside the bounding rectangle of the cloth/felt
contour. We simply take the largest contour as the shadow contour. We then
draw this shadow contour filled with white into the cloth/felt image We also fix
the convexity defects of the cloth/felt contour by drawing a thick line along the
perimeter of the convex hull of the cloth/felt. This has the effect of making the
cue stick a child contour of the cloth/felt instead of a convexity defect. We then
refind contours in the ROI and take the largest child of the cloth/felt contour as
the cue stick contour. These steps are illustrated in Figure 2.14.
Once we have the contours that represent the cue stick and its shadow (at least the
portions that are physically above the cloth/felt), we derive a vector for each that
describes the image space direction of each contour. To do this, we consider the
contour’s perimeter as a set of 2D points and extract these points for the shadow
and cue stick contours respectively. We then fit a line to each of these points using
the least-squares error criteria (40). Least-squares offers good performance but
can be inaccurate in general. However, we smooth all feature detection values by
a weighted moving average (WMA) which tends to all but eliminate the influence
of outliers at the expense of a slight latency in measurements. See Section 2.2.8
for more information about WMAs. The line found from the cue stick contour
gives us ˆshaft while the line found from the shadow of the cue stick gives us
47
2.7 Cue stick pose estimation
ˆshadow (see Table 2.3).
Note that from the convexity defect detection we have been given the deepest
point within the defect. Given our scene layout, the deepest such defects will be
equal to the tip of the cue stick and the shadow thereof. We have thus found the
cue tip and Sc from Table 2.3.
(a) Enclosing defects to makechild contours.
(b) Shadow detected. (c) After shadow removal.
Figure 2.14: Example of cue stick shadow detection. The left image (a) shows
the cloth/felt contour with cue stick shadow and cue stick as child contours. Note
how the convex hull perimeter has been stroked to remove the convexity defects.
The middle image (b) shows the ROI of the cloth/felt contour with shadows
detected (white) using a FCM (see Table 2.5). The right image (c) shows the
cloth/felt image after shadow removal which leaves the cue stick child contour.
2.7 Cue stick pose estimation
A common task in a computer vision system is to find the 3D pose of objects
of interest. This involves extracting image features and matching these image
features to object features. In the feature detection stage, we discovered features
48
2.7 Cue stick pose estimation
Figure 2.15: Screenshot of demo of feature detection. The overlaid graphics are
rendered atop acquired video images: the red arrow represents ˆshaft, the green
arrow represents ˆshadow, and the magenta circle (C) outlines the cue ball. Note
the slight errors: the red arrow does not perfectly match the heading of the cue
stick and C does not perfectly match the cue ball boundary. See Section 2.2 for
reasons behind such errors (other than those implicit with our methods).
49
2.7 Cue stick pose estimation
of interest in image space. In this pose estimation stage, we analyze these features
to derive the full 3D pose of the cue stick token using projective geometry and
basic algebra.
A single dimension is lost under perspective projection, namely the distance from
the projection plane to each point being projected. In our system, the images
captured by our digital video camera are perspective projections of the scene onto
the camera’s image plane. The depth at each scene point is thus lost; all we are
given is the 2D location of the point in the captured image.
It is nontrivial to reconstruct the full 3D pose of imaged objects given only 2D
information. There are several techniques used for inferring 3D information of
image points such as stereo vision or range finding (40). However, such techniques
are quite complex and generally expensive to implement. We restrict our 3D pose
estimation to analysis of digital images captured by a single camera.
There are several canonical attributes of a cue stick that we wish to estimate from
extracted image features. These attributes comprise the state of the cue stick at
a given moment in time.
2.7.1 Estimation of cue tip vertical offset
Since we are utilizing a top-down view of the scene and lose the sense of depth from
the camera to any sensed scene point, we effectively lose the height of scene points
positioned between some reference plane and the camera. We are attempting to
find is the vertical offset of the cue stick’s tip from the plane of the user’s desk.
Given that the cue ball will rest on this plane and that the diameter of the cue
ball in world units is known a priori, we can relate the vertical offset of the cue
stick’s tip above the plane of the desk to find the vertical offset of the cue stick’s
tip from the center of the cue ball. This is in turn used by the simulation to
position the virtual cue stick.
We have successfully adapted the techniques described in (36) to estimate the
vertical offset of the cue stick’s tip over the plane of the desk. The algorithm
requires the locations in the image of several scene points as well as parallel
50
2.7 Cue stick pose estimation
lines on the reference plane. See figure 2.16 for a depiction of the required scene
points. Most of these scene points are automatically discovered during the feature
detection stage.
Figure 2.16: Scene points used in cue stick tip height estimation. Here the
reference object is a beverage can. The vanishing line of the plane of the desk
is the line that connects the two vanishing points at the horizon. The vanishing
points are derived from the parallel edges of the planar object.
In (36), the height of a soccer ball above the plane of the playing field is de-
termined by analyzing frames of video of a soccer game. The horizon line of
the plane of the playing field is found by fitting two sets of parallel lines to the
boundary line markers on the playing field. The lines of each set of parallel lines
intersect at a point on the horizon. We thus have two such points on the horizon.
The line that connects these two points is the horizon line of the plane of the
playing field. Several subsequent lines are derived by connecting all mentioned
points (see Table 2.6). The reference object in (36) is chosen as one of the soccer
players on the field.
51
2.7 Cue stick pose estimation
In our system, we find parallel lines as the two pair of parallel lines of the planar
object. Our reference object is a soft drink can. We have adopted the geometry
of (36) to our particular setup. We illustrate the geometry in Figures 2.17 and
2.18. We next compute the labeled lines in the figures using line-line intersections
(see Table 2.6).
Figure 2.17: Setup for estimation of h (part 1). This is an adaptation of Figure
3 in (36).
We continue to derive the points as referenced in Figure 2.18.
• v is the intersection point of lines l8 and l9.
• ri is the intersection point of lines l7 and l8.
• Pi is the intersection point of lines l6 and l9.
52
2.7 Cue stick pose estimation
Figure 2.18: Setup for estimation of h (part 2). This is an adaptation of Figure
4 in (36).
53
2.7 Cue stick pose estimation
Line Description
l1 through reference shadow (hence l1 = Br − St)
l2 through Sc and parallel to l1; hence l2 intersects l1 on
the vanishing line
l3 through Tr and Sc
l4 through the cue tip and Tr
l5 through Br and the intersection point of lines l3 and l4;
the intersection point of l5 and l1 is the projection of the
cue tip on the plane of the desk which we refer to as pb
l6 through Tr and intersection point of l5 and the vanishing
line
l7 through cue tip and intersection point of l5 and the van-
ishing line
l8 through Tr and Br
l9 through Pb and cue tip
Table 2.6: Scene lines required for the estimation of h from (36). See Figures 2.17
and 2.18 for a visual understanding.
54
2.7 Cue stick pose estimation
The cross ratio of the points we have found on l8 are equal to the cross ratio of
the points on l9. Thus,
< Br, ri, Tr, v >=< Pb, Pt, Pi, v >=hr
hr − h(2.9)
where hr = |Tr −Br|, and < a, b, c, d > is the scalar cross ratio
< a, b, c, d >=|a− c||b− d||a− d||b− c|
(2.10)
We thus can determine the vertical offset h of the cue stick’s tip above the plane
of the user’s desk.
h = hr −hr
< Br, ri, Tr, v >(2.11)
2.7.2 Estimation of the distance between cue stick and
cue ball
The (Euclidean) distance between the cue stick’s tip and the cue ball is the
indicator of when a shot occurs for when this distance is 0, the cue stick tip
is touching the cue ball. That is, there is a physical collision between the cue
stick and the cue ball. We describe how we derive the properties of a shot in
Section 2.8. Since we are trying to estimate the full 3D pose of the cue stick
using real world units, we must find the distance between the cue stick’s tip and
the cue ball in such units as well. The client simulation will use this distance in
world units to accurate position the virtual cue stick.
To determine the distance between the cue stick tip and the cue ball, we use a
very simple technique. The cue ball radius in world units, Rw, is known a priori.
Once we’ve detected the cue ball sphere in image space (a circle we shall label C
with radius Cr pixels), we can find the number of world units (e.g. meters) per
pixel with units/pixel = Rw/Cr.
55
2.7 Cue stick pose estimation
Figure 2.19: Demo of cue stick tip height estimation. h was determined to be 2.09
in. The actual height measured with a tape measure was ≈ 1.99 in. Note that
we use a top-down camera pose when viewing the scene. The camera was pitched
upwards by ≈ 15◦ in this demo to better showcase that which is being estimated.
In the online system, the cue tip projection on the desk’s plane will generally not
be visible in image space since in a top-down view it will be directly underneath
the cue tip. The green and purple lines are the parallel lines as discovered by
detecting the edges of the planar object; see Figure 2.16 for a depiction.
Figure 2.20: Distance between cue stick tip and cue ball. Here, distw = 0.22m.
56
2.7 Cue stick pose estimation
Since the camera is pitched downwards by −90◦ (due to a top-down pose) and the
normal of the plane of the user’s desk can be assumed to be 90◦ or “up”, points on
the plane of the user’s desk in the camera’s field of view are close to equidistant
to the camera’s image plane. This means that the ratio of units/pixel is not
affected by perspective foreshortening. In other words, every pixel in a captured
frame of video shares the same ratio of units/pixel because we use a top-down
camera pose. If the camera’s pitch 6= −90◦, then different points on the plane
of the user’s desk would have a different value of units/pixel. Note that the
units/pixel ratio is based on the plane at height Rw above the plane of the user’s
desk. However, since the cue tip height above the plane of the user’s desk doesn’t
often deviate from Rw (in other words, most shots have only little to no draw or
follow), we ignore this small discrepancy in our current implementation.
Since we have detected the cue stick tip point T in image space, we can compute
the (Euclidean) distance in pixels disti between T and the intersection point P
of the cue ball’s projected circle C and a 2D line ˆshaft created from the location
and heading of the cue stick’s shaft. We explained how we found ˆshaft in section
2.6.4. We can thus convert disti to world units with the following, given that the
line from T along ˆshaft intersects C. If no such intersection exists, the system
notifies the user that the cue stick is not pointing at the cue ball.
P = first intersection( ˆshaft, C)
distw =Rw
Cr
|P − T | (2.12)
where first intersection is a function that finds the closest intersection of a
line and a circle (if any).
2.7.3 Estimation of cue stick pitch
We define the angle between the shaft of the cue stick and the surface of the
user’s desk as the cue stick’s pitch. For instance, when θ = 90◦, the cue stick
points down into the surface of the desk.
57
2.7 Cue stick pose estimation
Pitch has a dramatic role in billiard shot dynamics. A shot taken with a cue
stick that has a small pitch (θ ≈ 5◦) imparts a force to the cue ball whose largest
component is in the direction perpendicular to the billiards table; i.e. along the
surface of the table. A shot taken with a cue stick that has a large pitch (θ ≈ 90◦)
allows for “jump shots” by driving the cue ball down into the table, causing the
ball to reflect off the surface of the table and into the air. Masse shots cause the
cue ball to undergo extreme spin effects and are made possible by using a large
pitch coupled with an off-center point of collision with the cue ball. In general,
varying degrees of pitch are employed to refine the major directional component
of the force imparted to the cue ball during a shot. Clearly, incorporating pitch
is critical for a realistic billiards simulation as well as more advanced techniques.
(a) Pitched cue stick with θ ≈ 45◦. (b) Top view of same scene. Notethat the angle between the cuestick and its shadow is ≈ 45◦ asin (a).
Figure 2.21: Example of cue stick pitch.
Our strategy for estimating θ is based on the observation that the angle between
the shaft of the cue stick and the shaft’s shadow is roughly equal to the angle
between the shaft and the surface of the user’s desk (i.e. θ). Refer to Figure 2.21
for an example of this observation. Note that this observation is most accurate
when the light source is situated next to the cue ball. This has a useful side effect
58
2.7 Cue stick pose estimation
of orienting the shadow of the cue ball to an area where it will generally not be
combined with the shadow of the cue stick.
We have defined ˆshaft to represent the direction or heading of the shaft of the
cue stick in image space. That is, given the image space location of the tip of the
cue stick T and a point on the shaft of the cue stick M ,
ˆshaft =|T −M |||T −M ||
(2.13)
We also define a vector ˆshaft′ that is simply ˆshaft but in the opposite direction.
Thus,
ˆshaft′ =|M − T |||M − T ||
(2.14)
Let Ts be the image space point of the shadow of T , Ms to be the image space
point of the shadow of M . We have already defined ˆshadow as
ˆshadow =|Ms − Ts|||Ms − Ts||
(2.15)
Thus,
θ = cos−1( ˆshaft′ · ˆshadow) (2.16)
2.7.4 Estimation of cue stick yaw
In a real billiards game, the player can orbit 360◦ about the cue ball by moving
herself to various sides of the billiards table to gain a new vantage point. Each
position defines a different ˆshaft, the direction of the cue stick. In aeronautics,
yaw is defined as the angle in radians about the “up” vector in some defined
coordinate system. Since our camera affords us a top-down view of the scene,
we are essentially viewing the scene from a point on such an “up” vector. Thus,
ψ can be defined as the angle its shaft makes with one of the image coordinate
system’s axes. We have chosen the vertical axis as in the default scene layout,
the user takes a shot by moving the cue stick vertically in image space. Thus, we
define ψ to be a relative measure of the difference of the image of the shaft of the
cue stick with the image space vertical axis.
59
2.7 Cue stick pose estimation
-
?
u
v
Figure 2.22: Image space.
To find the angle between two unit vectors, we take the inverse of the cosine
of the inner product of the two vectors. Thus, ψ = cos−1( ˆshaft · ˆdown) whereˆdown is the unit image space “down” vector. For instance, given an image space
coordinate system origin in the top left of the image with x extending to the right
and y extending down, ˆdown = 〈0, 1〉.
While we could allow the player to fully orbit the cue ball, we instead limit the
ψ to ±45 and introduce the notion of the current side of the billiards table. The
user remains in the same general physical location and shoots the cue ball in the
same general direction for all shots taken (i.e. into the ball-catch; see fig. 1.2).
The current side of the table is the side of the virtual billiards table the user is
situated on in the simulation. To change the current side, the user simply presses
a key on the keyboard; the virtual cue stick is rotated by 90◦ from its current
yaw.
Since we wish ψ to be in the range [−45, 45] we must define when the ψ is negative
and positive the inner product will only tell us the angle between two vectors.
We have arbitrarily chosen that the ψ > 0 when the cue stick points down and
to the right in image space and ψ < 0 when down and to the left. We discuss
how ψ is used to orient the virtual cue stick in a later chapter.
Given the above discussion, we can only determine the yaw when we have de-
termined the vector ˆshaft, the direction of the cue stick’s shaft in image space.
To find ˆshaft, we simply subtract the shaft point from the tip point as found in
section 2.7.1.
60
2.7 Cue stick pose estimation
Figure 2.23: Current side of virtual billiards table.
(a) ψ > 0. (b) ψ < 0.
Figure 2.24: Examples of cue stick yaw. The left image (a) shows a positive yaw
while the right image (b) shows a negative yaw.
61
2.7 Cue stick pose estimation
2.7.5 Estimation of spin-inducing parameters
Whenever a sphere is struck at an off-center point on its surface, the the sphere
will spin due to the physical effect of torque. Controlling the effects of cue ball
spin is generally only a part of an advanced billiard player’s set of skills. A cue
ball struck somewhere on the lower half of the ball will undergo some amount of
backspin. A large amount of backspin could cause the cue ball to travel backwards
after colliding with another billiard ball. This effect, known as draw, is very useful
for strategically positioning the cue ball for the next shot. The same holds true
for follow, the forward motion of the cue ball after colliding with another ball as
caused by excessive topspin. Sidespin (left or right) is commonly referred to as
english and can be introduced to the cue ball by striking it at a point horizontally
off-center. English, draw, and follow can be incorporated to varying degrees in
a single shot as well. Our physical simulation incorporates these spin effects,
possibly satisfying a more advanced user of the system.
While we could let the physical simulation solely determine the effects of spin
given the heading (θ, ψ) and height (h) of the cue stick, we have decided to
estimate the coefficients of spin as part of pose estimation. The reasons for this
decision will become clearer during the discussion of shot detection and analysis
in Section 2.8.
Based on (27), we define a “ball-centric” coordinate frame (i, j, k) that we shall
refer to as “ball space” as depicted in Figure 2.25. Our goal is to estimate a an
b.
Note that a and b are encoded in the state of the cue stick’s pose as a percentage
of the radius of the cue ball. This allows a client simulation to use arbitrary
scaled cue ball models. Thus, a = [−1, 1] and b = [−1, 1]. Thus, when the cue
stick is aiming at the left half of the cue ball, a < 0. Also, when the cue tip is
aiming at the lower half of the cue ball, b < 0.
To estimate a we can use the point of intersection P of ˆshaft with the circle C
which represents the the cue ball in image space. To estimate b we relate the
vertical offset of the cue tip above the plane of the user’s desk to the vertical
62
2.7 Cue stick pose estimation
Figure 2.25: Ball space. {i, j, k} defines a coordinate system with origin at the
cue ball center. Shown at Pw is the point of future intersection of the cue stick
with the cue ball in world space (the black line represents the expected stick
trajectory as it collides with the cue ball). a and b define offsets from j and i
respectively.
63
2.8 Shot detection and analysis
offset of the center of the cue ball above the plane of the user’s desk. Since we
know the diameter of the cue ball a priori, we can find b.
a =Px − Cx
Cr
b =h−Rw
Rw
2.8 Shot detection and analysis
A shot occurs in a billiards game world when the user thrusts a cue stick into a
stationary cue ball causing a collision. To sense such a collision using computer
vision, we would ideally like to see the cue stick to cue ball collision at the exact
time of the event in order to derive the properties of the collision accurately.
However, a camera senses the visual spectrum of light at discrete instances of
time, not as a continuous stream of light. It is thus very difficult to directly be a
witness to an event whose duration is very small. High framerate cameras have
been invented to help combat this issue. With such a camera one can see on
the order of thousands of frames per second. However, such cameras are cost-
prohibitive to purchase (or even rent) for an exploratory student project such as
ours. Since one of our goals is a low cost solution, the use of such cameras is
beyond the scope of this project.
Given that most commodity video cameras can capture at a rate of at most 30
frames per second, we will likely not see a collision between the cue stick and cue
ball. This problem is referred to as undersampling a signal. Therefore, we must
be able to detect that a collision has occurred between two consecutive frames of
video. We thus have to estimate the time of collision and cue stick speed given
only the “before” and “after” visual states of the cue stick and cue ball. Indeed, it
is possible that the cue ball could be struck hard enough that it leaves the viewing
area before the camera captures the next frame of video! We assume that a shot
has occurred in this scenario if the cue stick is detected near to where the cue ball
64
2.8 Shot detection and analysis
was most recently detected. Moreover, we assume a shot has occurred in general
if the cue ball has moved from its previous location more than ε pixels.
2.8.1 Shot detection
To help us detect shots then, we introduce the distance history. The distance
history is a window or history of cue stick / cue ball distance observations over
time. We are interested in the distance between the cue stick and cue ball simply
because when this distance is 0, there is a collision between the cue stick and
ball; a shot has occurred. Refer to figure 2.20 for a visual depiction of cue stick
to cue ball distance. Note that the distance history is cleared after each shot is
taken.
Figure 2.26: Example of distance history. The distance between the cue stick
tip and cue ball (vertical axis) over time (horizontal axis).
In figure 2.26 we see the measured distance of the cue stick over a 10 second time
period. Let d(τ) denote the distance of the cue stick from the cue ball at time τ .
65
2.8 Shot detection and analysis
Note that d(τ) is a signed value; if the cue stick is located at or past the point at
which the cue ball was most recently detected, d(τ) ≤ 0.
In the example distance history seen in 2.26, the player moved the cue stick
toward and away from the cue ball several times before ultimately taking a shot
as seen in the last observation. The various back and forth motions correspond
to a pre-shot routine that many billiards players implement to increase muscle
memory and perhaps settle their nerves prior to taking a shot.
Since we are not in general able to detect the cue stick at a distance of exactly
0, we monitor the distance history observations for a zero crossing. This occurs
when d(τi) > 0 and d(τi+1) ≤ 0. When this happens, we have detected when the
cue stick tip has collided with the cue ball.
2.8.2 Shot analysis
We could calculate the cue stick’s speed s as the change in distance over the time
interval spanning the zero crossing. However, this is not a good measure of the
cue stick’s speed as the time interval τi+1 − τi is on average short but with high
variance due to the irregular rate of observations and the unpredictability of the
time when a shot might begin. Instead, we search for the local maximum of d(t)
that occurs at the latest time. This corresponds to the right-most peak in the
graph of figure 2.26. Let us call the distance between the cue stick tip and the
cue ball at this local maximum d(τa) and the distance at the time of the last
observation d(τb). We thus consider the shot to take place between time τa and
τb. During this time frame, the cue stick is driven from a recoiled position into
the cue ball. Note that our algorithms do not reliably detect the cue stick as it
undergoes fast motion since this causes motion blur which is difficult to account
for. In a shot situation, this means we detect the cue stick before a collision is
made as well as very soon thereafter. We thus must make the assumption that
the speed of the cue stick is constant between τa and τb. Also, we assume that the
cue stick maintains its current orientation, moving exactly along its most recently
66
2.8 Shot detection and analysis
derived orientation. Given this scenario, we can derive the speed of the cue stick
at roughly the time of the collision to be
s =d(τb)− d(τa)
τb − τa(2.17)
Coupling the speed s with the derived orientation (yaw and pitch) of the cue
stick allows us to notify clients of a shot directed along a velocity vector. The
discussion of the client simulation in the following chapter describes how the
derived cue stick pose, shot detection, and shot analysis provide the backbone to
a realistic simulation of billiards.
67
Chapter 3
Client process
While our system can support multiple TUI clients, we have focused on one such
client, a physically accurate 3D billiards simulation. The simulation is comprised
of the necessary elements to play billiards: a billiards table, a cue stick, a set
of billiard balls, and a cue ball. A virtual camera is automatically positioned to
view the scene from the vantage point of a virtual player; namely behind and
above the cue stick, however the cue stick may be positioned and oriented. The
major responsibilities of the client process are to
• physically model rigid objects such as billiard balls, cue sticks, and billiards
tables using realistics units of measure;
• detect collisions between physical objects and realistically model the after-
math of the such collisions;
• receive TUI state updates using IPC from the TUI server;
• position and orient the cue stick based on the physical cue stick pose;
• derive the forces of each shot on the cue ball;
• physically model english and side spin;
• provide a physics event to game event translation layer;
68
• provide a basic game framework on which future games can easily be built;
• position the virtual camera for each shot;
• allow for training exercises by loading predefined ball layouts and shot goals;
• support visualizations like the “ghost ball” to aid in training;
• render 3D graphic depiction with a good amount of realism.
To our knowledge, there are few if any virtual billiards applications that are
open enough or documented in detail enough to make “plugging in” new input
mechanisms feasible. For this reason, we have created a physically based billiards
simulation from scratch, leveraging the Newton Game Dynamics (NGD) SDK
(32) for an implementation of rigid body dynamics.
A rigid body is a solid body in which deformations are neglected. The distance
between points on a rigid body does not change, even when undergoing trans-
lations and rotations. A rigid body has a local reference frame which is rigidly
connected to the body. The position of the body in the simulation’s global or
world reference frame is represented by the origin of the body’s local reference
frame. The orientation of the body is represented by the axes of the body’s lo-
cal reference frame. The position of each body is therefore defined by a linear
component as well as an angular component (i.e. orientation). The linear and
angular components can change with time, as a result of a force applied from a
collision from another rigid body perhaps. NGD works roughly by numerically
integrating the properties of each rigid body over time. These properties include
the linear and angular velocity of each body as well as the linear and angular
acceleration of each body.
Physical simulation on a computer is a difficult task to robustly perform as it is
computationally intense and demands a regular rate of updating or stepping the
state of each body over time. Without a regular stepping rate, numeric integrators
tend to explode, characterized by physically unrealistic behavior of bodies in the
simulation. The general guideline for maintaining stability in an integration based
physics simulation is to use a fixed amount of time when updating the state of
69
each body. We have followed this advice in our simulation, updating NGD at a
rate of 300 Hz, each update with a small, fixed timestep.
While numeric integration has been a popular implementation strategy for physics-
based simulation of rigid bodies (e.g. in many modern computer games), there
exist closed-form solutions for simulating the physics of billiards (e.g. (27)). We
have not chosen this route due to time constraints and ease of implementing
a simulation with NGD. However, a physics-based simulation is never simply a
“fire-and-forget” scenario in which the properties of rigid bodies are specified
and a generic rigid body dynamics simulator handles the rest. Instead, forces,
torques, velocities, etc. must often be derived and manually specified for a spe-
cific context. We have leveraged some of the work of (27) to this end for deriving
and applying the physics of cue stick to cue ball collisions (or shot dynamics ; see
Section 3.3).
A complete overview of physical simulation is beyond the scope of this thesis.
Instead, we refer the reader to background material found in (4; 17). Our discus-
sion is based instead on how we used NGD in our simulation to accurately depict
the physics of billiards. To this end we have used realistic units of measure for
the properties of each rigid body in the simulation. This includes the dimensions
of the billiards table, the mass of the cue stick, the mass of the billiard balls, the
force of gravity, as well as the coefficients of restitution between rigid bodies to
accurately model friction effects. The dimensions of our billiards table model are
based on the specifications of the World Pool-Billiard Association (44).
The elements of the 3D simulation were modeled using Google SketchUp (15),
which is a 3D modeling package with a simple user interface. Our models were
exported from SketchUp using a Ruby (38) language script into a format that
is directly supported by the 3D rendering engine OGRE (34). OGRE is used
to render a 3D view of the simulation from the viewpoint of a virtual camera.
The virtual camera is positioned and oriented to view the scene from the vantage
point of where a right-handed billiards player would be positioned to grasp the
virtual cue stick. We have not focused much on realism in our 3D depictions of
a billiards scene. However, our renderings do support shadows for depth cues, as
70
3.1 Simulation states
well as texture mapped billiard balls. Environment mapping is also used to give
the billiard balls a glossy appearance.
Figure 3.1: Screenshot of Google Sketchup modeling session. The image shows
the virtual billiards table used in the client simulation.
3.1 Simulation states
The client simulation can be in one of three states at any given time, namely
1. shot setup
2. shooting
3. physical simulation
The shot setup state is active when there are no billiard balls in motion. While
this state is active, the client process receives updates from the server process
containing the current state of the physical cue stick token (see Table 2.2 for
details). With each update, the client updates the position and orientation of the
71
3.1 Simulation states
Figure 3.2: Client process logic flow.
72
3.2 Orienting the virtual cue stick
virtual cue stick to match that of the physical cue stick token. The steps taken
to position and orient the virtual cue stick are discussed in Section 3.2.
The shooting state is active when the client receives an update from the server
process that a shot has been detected. While this state is active, the cue ball’s
instantaneous velocity and torque are determined based on the properties of the
detected shot. The derivation of these values is discussed in Section 3.3.
The physical simulation state is active after the cue ball has been set in motion
during the shooting state. While this state is active, NGD is in control, perform-
ing numerical integration for each rigid body in the simulation and delivering
events of collisions to the application. Details of the event handling state are
discussed in Section 3.4.
Once all billiard balls have stopped moving (i.e. they have a linear velocity below
some predefined threshold), the user is free to begin the shot setup state again
which begins the next shot. The process of shot setup, shooting, and physical
simulation repeats until the game is over or the user chooses to abort the game.
3.2 Orienting the virtual cue stick
The 3D model we use for a cue stick has a local frame origin positioned halfway
along the central axis (−X) of the cue stick. The cue tip is thus located at x < 0.
We use OGRE’s scene graph node manipulation routines to first translate the
cue stick to the center of the virtual cue ball. Next we set the orientation of the
cue stick to the product of two quaternions y and p derived from rotations of
ψ and θ about the “up” and “right” vectors respectively. An axis-angle (ω, θ)
representation of a rotation can be converted to a quaternion representation Q
via Q = (cos θ/2, ω sin θ/2). Finally, we translate the cue stick in its local body
frame based on a, b, and distw (see Table 2.2) which are selected from the most
recent cue stick pose update received from the server process.
73
3.3 Shot dynamics
3.3 Shot dynamics
We leverage the work of (27) for our model of how a cue stick to cue ball impact
sets the instantaneous linear velocity and angular velocity of the cue ball. We
follow the general setup as in Figure 2.25. Again, let sw denote the central axis
of the cue stick token; ~sw is parallel to the k− j plane and is pitched at angle θ to
the i− k plane. Let the contact point of the cue stick on the cue ball surface be
point Pw, Rw be the known world radius of the cue ball, a be the horizontal offset
of Pw from the center of the cue ball, and b the vertical offset (see Table 2.2 for
details). Let c = |√R2
w − a2 − b2|. Thus, Pw = (a, c, b). From this, (27) derives
the the force F imparted to the cue ball due to the impact as
F =2mV0
1 +m/M + 5/(2R2w)(a2 + b2 cos θ + c2sin2 θ − 2bc cos θ sin θ)
(3.1)
where m is the mass of the cue ball, M is the mass of the cue stick, and V0 is the
initial velocity of the cue stick at the time of the collision. We assume that the
duration of the collision is neglible. Thus, the instantaneous linear velocity of the
cue ball at the time of collision is ~v = ~F/m = (0,−F/m cos θ,−F/m sin θ).
We also use (27) to set the instantaneous angular velocity ~ω as well.
~ω = 1/I (−cF sin θ + bF cos θ, aF sin θ,−aF cos θ) (3.2)
where I is the moment of inertia (simply, the equivalent of mass but in the
context of a rotating body) of a solid sphere. I = 2/5 mR2w
3.4 Event handling and game logic
NGD provides a callback system as part of its API in order to notify an application
controller of physical events between two rigid bodies in the simulation. An
application controller can handle these events however it deems fit, including
74
3.5 A focus on user training
rendering audible collision sounds, or passing the event to a game logic handler
as we do in our prototype.
The game logic handler is a bridge between the physical events between billiard
balls, the rails (borders) of a billiards table, the cue stick, etc. and the logical
realm that comprises the rules manager for a particular billiards game. The game
logic handler translates the properties of a physical collision between two rigid
bodies into logical game events. For instance, one such event might be “the cue
ball collided with the 8 ball”, or “the 12 ball was pocketed in the northwest corner
pocket.” Each game rules manager handles these events in different ways. This is
a classic example of object-oriented design using inheritance to allow subclasses
to override the behavior of a parent class.
We have only implemented an ad hoc “game” in our prototype due to time
constraints. This game randomly positions billiard balls across the table’s surface
at the start of each game. The user is prompted to try and pocket all of the balls
in the least number of shots. We plan on implementing 8-ball, 9-ball, and snooker
in the future.
In our simulation, it is trivial to detect when a ball is pocketed. We treat each
pocket as a separate rigid body, albeit one that does not ever change position nor
orientation (i.e. pockets are static bodies). NGD notifies the simulation which
bodies are involved in a collision event. In addition, we can associate a numeric
identifier with each body which is used to determine the logical identity of each
body of a collision event.
3.5 A focus on user training
Since our billiards simulation is not constrained to the physical world, we have
more control to implement training exercises that would otherwise be cumbersome
or time-consuming for users to do on a real billiards table. For example, if a user
is not very skilled at a particular type of billiards shot (e.g. a bank shot), we
could provide a mechanism to help the user improve on these types of shots
by arranging balls in a predetermined configuration. As the user shoots, events
75
3.5 A focus on user training
would be fired which are matched against expected event sequences. If the user-
generated events match the expected event sequence, the the attempted shot is
successful. The system could also maintain statistics of a user’s performance for
such shots over time. The chief advantage of this simulation-based training on a
real billiards table is time saved.
Figure 3.3: Screenshot of client simulation showcasing the “ghost ball” feature.
The semi-transparent sphere near the 8-ball is the ghost ball. Ghost ball visu-
alization is a pedagogical technique used to help players visualize where the cue
ball will collide with a target ball.
We have also implemented a “ghost ball” feature (see Figure 3.3). A ghost ball
is a ball that is not a proper part of the simulation but only exists during the
setup phase of a shot in which the user orients the cue stick. A semi-transparent
cue ball is rendered at the location where the cue ball will collide with a target
billiards ball if a shot is taken with the cue stick in its current orientation. This
visualization is common advice from billiards instructors to students.
76
Chapter 4
Conclusions
We have described a system that provides for natural, non-invasive human-
computer interactions using only computer vision techniques and a single DV
camera. We believe that even though the low capture rate and generally poor
image quality of consumer-grade DV cameras ultimately limits the accuracy of
image based measurements, future DV cameras and image processing techniques
will improve to make CV based systems a viable alternative to the mouse and
keyboard. Indeed, higher resolution images captured at a faster rate coupled
with ever-increasing processor speeds will also help make this style of input more
widespread.
However, CV based algorithms are general compute intensive due to the shear
amount of data that needs to be processed. We do not foresee this changing
unless some of the feature detection performed in software today is moved into
specialized hardware in the future. This has been explored somewhat (e.g. in
(12)) already. However, the main issue with hardware based feature extraction is
the creation of hardware that is generic enough for multiple classes of features.
A common criticism of CV-based TUIs is that they “are typically developed
and tuned for one specific configuration” (22). Our system is indeed configured
for one specific set of tangible objects. However, we hope that by starting this
style of work with concrete prototypes we can learn the lessons and principles of
77
successful computer vision based TUI design so that future projects are easier to
implement.
In the future we wish to expand our studies by possibly including other forms
of vision-based sensing such as stereo vision or omnidirectional cameras. We
also would like to expand our study of tracking fast moving objects in real-time.
Stroboscopic lighting might provide a benefit to this end. Also, to deal with the
issues of low quality in the reproduction of colors from the visible light spectrum
from webcams, we would like to attempt to use near infrared light. In such a
system, changes in visible illumination would not adversely affect the detection
of infrared light.
We would also like to vet CV as a viable input mechanism by attempting to create
TUIs for several other equipment based sports such as tennis, golf, shuffleboard,
baseball, etc. in order to discover issues that have yet to be discovered. It would
also behoove us to conduct a usability study to gain valuable feedback from real
users who are not intimate with the implementation details of our TUI such as
we are.
Recently there has been a growing focus on “Simulation Based Learning” with
respect to education and training. In simulation based learning, a user learns
some skill by interacting with a simulation of some environment. Simulations
are less expensive than their real counterpart and offer much more control over
configuration. Since the TUI described in this thesis is inexpensive and easy to
deploy, one can start to see the role of TUIs in training for various other sports or
physically based endeavours. Recently, the Center for Immersive and Simulation-
Based Learning was created at Stanford University. Using CV for TUIs could
help such efforts by providing inexpensive means to create such simulations.
78
Appendix A
Camera calibration
Camera calibration provides us with information about the intrinsic and extrinsic
parameters of a camera. The intrinsic parameters are parameters that describe
the camera’s internals including the focal length, the distortion coefficients of the
camera’s lens, and the principal point (which is usually the center of the image).
The extrinsic parameters define the position and orientation of the camera relative
to some world coordinate frame. While the intrinsic parameters of a camera vary
from camera to camera, they need only be found once per camera. The extrinsic
parameters of a camera are view-dependent.
Figure A.1: Chessboard calibration object detected with point correspondences
highlighted.
79
A common technique used to calibrate a camera is to find a number of image-
world point correspondences. An object of known geometry such as a chessboard
pattern is detected in image space. In fact, the chessboard pattern as in Fig-
ure A.1 is the most common calibration object used today.
Every camera lens contains some amount of distortion. The distortion present can
be described by radial and tangential distortion coefficients. These coefficients are
used to rectify distorted images by undistorting acquired images. These undis-
torted images form the input to our CV system. Undistorting a point (x, y)
results in a point (x′, y′). This undistortion procedure is applied to each point of
an acquired image.
x′ = x(1 + k1r2 + k2r
4) + 2p1xy + p2(r2 + 2x2)y′ = y(1 + k1r
2 + k2r4) + p1(r2 + 2y2) + 2p2xy
(A.1)
where r2 = x2 + y2, k1, k2 are radial distortion coefficients and p1, p2 are tan-
gential distortion coefficients. k1, k2, p1, and p2 are determined during camera
calibration.
(a) Acquired image (distorted). (b) Acquired image (undistorted).
Figure A.2: Example of camera lens distortion. Note how the straight lines in
the right image (b) are curved in the left image (a).
We used Intel’s OpenCV library (18) for camera calibration. OpenCV’s calibra-
tion routines are based on the methods of Zhang (47).
80
References
[1] Samuel Aude, Matios Bedrosian, Cyril Clement, and Monica
Dinculescu. mulTetris: A Test of Graspable User Interfaces in Col-
laborative Games, 2006. http://www.cs.mcgill.ca/~mdincu/multetris_
final.pdf [Online; accessed 30-June-2007]. 11, 15, 30
[2] B. Bailey. Real Time 3D motion tracking for interactive computer simu-
lations, 2007. Student project, Imperial College of London. 16
[3] Paul J. Besl and Ramesh C. Jain. Three-dimensional object recogni-
tion. ACM Comput. Surv., 17(1):75–145, March 1985. 7
[4] David M. Bourg. Physics for Game Developers. pub-ORA, pub-ORA:adr,
2002. 70
[5] Gary R. Bradski. Computer Vision Face Tracking for Use in a Perceptual
User Interface. Intel Technology Journal, 2(Q2):15, 1998. 30
[6] Adrian David Cheok, Xubo Yang, Zhou Zhi Ying, Mark
Billinghurst, and Hirokazu Kato. Touch-Space: Mixed Reality Game
Space Based on Ubiquitous, Tangible, and Social Computing. Personal Ubiq-
uitous Comput., 6(5-6):430–442, 2002. 13
[7] E. Costanza and J. Robinson. A Region Adjacency Tree Approach to
the Detection and Design of Fiducials. In Proceedings of Vision, Video and
Graphics, 2003. 11
81
REFERENCES
[8] E. Costanza, S.B. Shelley, and J. Robinson. Introducing Audio
d-touch: a Novel Tangible User Interface for Music Composition and Perfor-
mance. In Proceedings of the 6th International Conference on Digital Audio
Effects, 2003. 11, 14
[9] Jerry Fails and Dan Olsen. A design tool for camera-based interaction.
In CHI ’03: Proceedings of the SIGCHI conference on Human factors in
computing systems, pages 449–456, New York, NY, USA, 2003. ACM. 13
[10] George W. Fitzmaurice. Graspable user interfaces. PhD thesis, Toronto,
Ont., Canada, Canada, 1996. Adviser-William Buxton. 11, 15
[11] James D. Foley, Andries van Dam, Steven K. Feiner, and John F.
Hughes. Computer Graphics — Principles and Practice. The Systems
Programming Series. Addison-Wesley, second edition, 1996. 5
[12] W. T. Freeman, K. Tanaka, J. Ohta, and K. Kyuma. Computer
vision for computer games. In FG ’96: Proceedings of the 2nd International
Conference on Automatic Face and Gesture Recognition (FG ’96), page 100,
Washington, DC, USA, 1996. IEEE Computer Society. 77
[13] E. Gamez, N. Leland, O. Shaer, and R. Jacob. The TAC Paradigm:
Unified Conceptual Framework to Represent Tangible User Interfaces, 2003.
http://citeseer.ist.psu.edu/calvillo-gamez03tac.html [Online; ac-
cessed 30-June-2007]. 11
[14] Vadas Gintautas and Alfred W. Hubler. Experimental evidence for
mixed reality states in an interreality system. Physical Review E (Statistical,
Nonlinear, and Soft Matter Physics), 75(5):057201, 2007. 2
[15] Google. Google SketchUp, 2007. http://sketchup.google.com/ [Online;
accessed 01-November-2007]. 70
[16] Venkatraghavan Gourishankar, Govindarajan Srimathveer-
avalli, and Thenkurussi Kesavadas. HapStick: A High Fidelity Haptic
82
REFERENCES
Simulation for Billiards. In WHC ’07: Proceedings of the Second Joint Euro-
Haptics Conference and Symposium on Haptic Interfaces for Virtual Envi-
ronment and Teleoperator Systems, pages 494–500, Washington, DC, USA,
2007. IEEE Computer Society. 16
[17] Mikako Harada, Andrew Witkin, and David Baraff. Interactive
Physically-Based Manipulation of Discrete/Continuous Models. Computer
Graphics, 29(Annual Conference Series):199–208, 1995. 70
[18] Intel Corporation. Open Source Computer Vision Library, 2000.
http://www.intel.com/technology/computing/opencv [Online; accessed
30-June-2007]. 23, 35, 37, 80
[19] Hiroshi Ishii and Brygg Ullmer. Tangible Bits: Towards Seamless
Interfaces between People, Bits and Atoms. In CHI, pages 234–241, 1997.
11
[20] Tony Jebara, Cyrus Eyster, Josh Weaver, Thad Starner, and
Alex Pentland. Stochasticks: Augmenting the Billiards Experience with
Probabilistic Vision and Wearable Computers. In ISWC ’97: Proceedings of
the 1st IEEE International Symposium on Wearable Computers, page 138,
Washington, DC, USA, 1997. IEEE Computer Society. 17
[21] Martin Kaltenbrunner and Ross Bencina. reacTIVision: a
computer-vision framework for table-based tangible interaction. In TEI ’07:
Proceedings of the 1st international conference on Tangible and embedded
interaction, pages 69–74, New York, NY, USA, 2007. ACM. 14
[22] Kjeldsen, Rick, Levas, Anthony, Pinhanez, and Claudio. Dy-
namically reconfigurable vision-based user interfaces. Machine Vision and
Applications, 16(1):6–12, December 2004. 77
[23] Scott R. Klemmer, Jack Li, James Lin, and James A. Lan-
day. Papier-Mache: Toolkit Support for Tangible Input. Technical Report
UCB/CSD-03-1278, EECS Department, University of California, Berkeley,
2003. 11, 12
83
REFERENCES
[24] Gudrun J. Klinker, Klaus H. Ahlers, David E. Breen, Pierre-
Yves Chevalier, Chris Crampton, Douglas S. Greer, Dieter
Koller, Andre Kramer, Eric Rose, Mihran Tuceryan, and
Ross T. Whitaker. Confluence of Computer Vision and Interactive
Graphics for Augmented Reality. PRESENCE: Teleoperations and Virtual
Environments, (Special Issue on Augmented Reality), 1997. 6, 11
[25] Myron W. Krueger, Thomas Gionfriddo, and Katrin Hinrich-
sen. VIDEOPLACEan artificial reality. In CHI ’85: Proceedings of the
SIGCHI conference on Human factors in computing systems, pages 35–40,
New York, NY, USA, 1985. ACM. 11
[26] Lars Bo Larsen, Rene B. Jensen, Kasper L. Jensen, and Soren
Larsen. Development of an automatic pool trainer. In ACE ’05: Pro-
ceedings of the 2005 ACM SIGCHI International Conference on Advances
in computer entertainment technology, pages 83–87, New York, NY, USA,
2005. ACM. 17
[27] Will Leckie and Michael A. Greenspan. An Event-Based Pool
Physics Simulator. In ACG, pages 247–262, 2006. 62, 70, 74
[28] Bastian Leibe, Thad Starner, William Ribarsky, Zachary
Wartell, David Krum, Brad Singletary, and Larry Hodges. The
perceptive workbench: Towards spontaneous and natural interaction in semi-
immersive virtual environments. In IEEE Virtual Reality 2000 Conference
(VR’2000), pages 13–20, Los Alamitos, Calif., March 18-23 2000. IEEE CS
Press. (Won the VR’2000 Best Paper Award!). 9
[29] V. Lepetit and P. Fua. Monocular Model-Based 3D Tracking of Rigid
Objects: A Survey. Foundations and Trends in Computer Graphics and
Vision, 1(1). 14, 30, 31, 33
[30] Carsten Magerkurth, Adrian David Cheok, Regan L. Mandryk,
and Trond Nilsen. Pervasive games: bringing computer entertainment
back to the real world. Comput. Entertain., 3(3):4–4, 2005. 3
84
REFERENCES
[31] P. Milgram and H. Colquhoun. A Taxonomy of Real and Virtual
World Display Integration, 1999. 2
[32] Newton Game Dynamics. Newton Game Dynamics, 2007. http://
newtondynamics.com [Online; accessed 30-June-2007]. 69
[33] M. Nixon and A. Aguado. Feature Extraction and Image Processing.
Newnes-Oxford, 2002. 5, 7, 23, 33
[34] OGRE contributors. Object-Oriented Game Rendering Engine, 2007.
http://ogre3d.org [Online; accessed 30-June-2007]. 70
[35] Jef Raskin. The Humane Interface: New Directions for Designing Inter-
active Systems. Addison-Wesley Professional, March 2000. 9
[36] Ian D. Reid and A. North. 3D Trajectories from a Single Viewpoint
using Shadows. In BMVC, 1998. 50, 51, 52, 53, 54
[37] Homero V. Rios. Human-computer interaction through computer vision.
In CHI ’01: CHI ’01 extended abstracts on Human factors in computing
systems, pages 59–60, New York, NY, USA, 2001. ACM. 2
[38] Ruby. Ruby programming language official website, 2007. http://www.
ruby-lang.org [online; accessed 25-august-2007]. 70
[39] Alvy Ray Smith. Color Gamut Transform Pairs. j-COMP-GRAPHICS,
12(3):12–19, August 1978. 34
[40] George Stockman and Linda G. Shapiro. Computer Vision. Prentice
Hall PTR, Upper Saddle River, NJ, USA, 2001. 7, 8, 22, 26, 33, 37, 39, 47,
50
[41] B. Ullmer and H. Ishii. Emerging frameworks for tangible user interfaces.
IBM Syst. J., 39(3-4):915–931, 2000. 3, 4, 11
[42] Brygg Ullmer and Hiroshi Ishii. The metaDESK: models and pro-
totypes for tangible user interfaces. In UIST ’97: Proceedings of the 10th
annual ACM symposium on User interface software and technology, pages
223–232, New York, NY, USA, 1997. ACM. 12
85
REFERENCES
[43] Andrew D. Wilson. PlayAnywhere: a compact interactive tabletop
projection-vision system. In UIST ’05: Proceedings of the 18th annual ACM
symposium on User interface software and technology, pages 83–92, New
York, NY, USA, 2005. ACM. 15
[44] World Pool-Billiard Association. WPA Tournament Table &
Equipment Specifications, 2001. http://www.wpa-pool.com/index.asp?
content=rules_spec [Online; accessed 22-August-2007]. 70
[45] B. Yersin. Virtual Billiards, 2004. Student project, VRLAB, Swiss Federal
Institute of Technology. 11
[46] Alper Yilmaz, Omar Javed, and Mubarak Shah. Object tracking:
A survey. ACM Comput. Surv., 38(4), 2006. 14
[47] Zhengyou Zhang. Flexible Camera Calibration by Viewing a Plane from
Unknown Orientations. ICCV, 01:666, 1999. 80
86