Visual SLAM and Human Recognition System of a Legged
Robot
by
Jose Carlos Perez Ipiales
A thesis submitted in partial fulfillment of the requirements for the
degree of Master of Engineering in
Mechatronics
Examination Committee: Prof. Manukid Parnichkun (Chairperson) Dr. Mongkol Ekpanyapong Dr. Pisut Koomsap
Nationality: Ecuadorian
Previous Degree: Bachelor of Science in Mechanical Engineering
The University of the Armed Forces-ESPE, Ecuador
Scholarship Donor: Secretariat of Higher Education, Science, Technology and
Innovation (SENESCYT), Ecuador
Asian Institute of Technology
School of Engineering and Technology
Thailand
December 2019
ii
Acknowledgements
I want to express my deepest gratefulness to the hearts that gave me purpose in this world,
to my mother, father and brother.
Besides this, I would like to express my sincere gratitude to the government of Ecuador and
to the Secretariat of Higher Education, Science, Technology and Innovation for entirely
funding my studies at AIT.
iii
Abstract
The purpose of this project was to design a 3D printed 4-legged indoors robot which
incorporates a Visual SLAM system and that is able to manage the intrinsic movement of
the depth camera when the locomotion and the perturbations associated are happening. At
the same time this robot runs a tiny YOLOv2 deep neural network which can recognize
persons with an acceptable level of accuracy and in real time operation.
This robot uses ROS (Robotic Operational System) and is remotely connected to a PC that
performs the computational processes required by the Visual SLAM using RTabMap while
YOLO and the high-level control is being managed by a Jetson Nano board which has a
serial connection with an Arduino Due whose function is to perform the low-level control
necessary to perform the motion.
iv
Table of Contents
Chapter Title Page
Title Page i
Acknowledgements ii
Abstract iii
Table of Contents iv
List of Figures v
List of Tables vi
1 Introduction 1
1.1 Background 1
1.2 Statement of Problem 2
1.3 Objectives 2
1.4 Scope 2
2 Literature Review 3
2.1 Perception 4
2.1.1 Cameras 4
2.2 Simultaneous Localization and Mapping 5
2.2.1 Visual SLAM 5
2.3 Robot Locomotion 6
2.3.1 Stability 7
2.3.2 Legged Robot 8
2.4 Human Detection System 9
2.4.1 Tiny Yolo V2 11
3 Methodology 12
3.1 Development of the Physical Parts of the Robot 14
3.2 Components 15
3.3 Circuit Required 19
3.4 Program Structure 20
4 Results and Discussion 26
4.1 Visual SLAM Experiment 26
4.2 Human Recognition Experiment 32
5 Conclusion and Recommendations 35
5.1 Conclusion 35
5.2 Recommendations
35
References 36
Appendices 38
Appendix A 38
Appendix B 39
v
List of Figures
Figure Title Page
2.1 Levels of Influence of SR 1
2.2 Service Robots (SR) 1
2.3 See – Think – Act Cycle of an Autonomous Mobile Robot 3
2.4 Microsoft Azure RGB – D Camera 4
2.5 Feature-Based ORB-SLAM Reconstruction of a Bedroom 5
2.6 Example of a Semi-Dense 3D Semantic 6
2.7 Locomotion Mechanisms Found in Nature 7
2.8 Gabrielli-von Karman Diagram 9
2.9 YOLO. Detecting a Variety of Object Classes in Real-Time 10
2.10 Tiny Yolo v2 Architecture 11
3.1 4-Legged Robot Developed 12
3.2 Walking Pattern Gait 13
3.3 Perspective of the Movement of Leg 13
3.4 Leg and frame Models of the Developed Legged Robot 14
3.5 PCB and Controller Supports 14
3.6 Example of Connections Made with Threaded Heat Inserts 15
3.7 Components Overview 16
3.8 Servo RDS3135 MG Assembly 17
3.9 Nvidia Jetson Nano 18
3.10 Circuit Diagram 19
3.11 ROS Architecture of the System 21
3.12 Block Diagram of the Structure of Rtab-Map Node 22
3.13 Graph Representation of Locations in the Memory 22
3.14 Block Diagram of the Structure of RGBD Odometry Node 23
3.15 ROS Arduino Communication Node 25
4.1 Initialization of the Robot 26
4.2 Path and Map of the Visual SLAM Experiment 27
4.3 Real Image of the Place of Experiment 28
4.4 X-Axis Comparison of Results and Ground Truth 29
4.5 Y-Axis Comparison of Results and Ground Truth 29
4.6 Yaw Angle Comparison of Results and Ground Truth 30
4.7 Error Associated to the Measurement in the X and Y Axis 30
4.8 Error Associated to the Measurement of the Heading Angle 30
4.9 Example of False Positive with the Object Detection Algorithm 33
4.10 Graphical Representation of Results of Tiny-Yolo-v2 33
Example of RANSAC fitting model estimation using a line
Pinhole Camera Model Graphical Interpretation
Reprojection Error
vi
List of Tables
Table Title Page
3.1 Properties of 3135 MG and 3115 MG Servos 17
4.1 Results Obtained in the Visual Slam Experiment 27
4.2 Errors Associated with the Experiment for X, Y and Yaw Angle 28
4.3 Results Obtained with Tiny-Yolo-v2 32
1
Chapter 1
Introduction
1.1. Background
Service sector gives the impression to be at an inflection point, regarding productivity gains
and service industrialization, similar to the industrial revolution that started in the eighteenth
century. In the present, rapidly improving technology that becomes better, smarter, smaller,
and cheaper will renovate almost all service sectors (Wirtz, 2018).
Robots are one of the technologies that will help us to make this transformation in our daily
lives and in the industry, and that will help to fulfill a growing number of roles that range
from factory automation, medical care, entertainment, cleaning, personal assistance, etc. The
development of service robots has been divided into two sectors: (a) non-manufacturing
productive sectors such as agriculture, mining industry, or medicine; and (b) the personal
service sector, including personal assistance, cleaning, education, entertainment, etc.
(García-Solera, 2018)
Service robots are technical devices ready to execute an operation or specific task with a
semi or fully autonomous behavior which have, at the same time, to face a different type of
dynamic and static objects in their environment, such as humans and others, while
performing its duty in real time.
Service robots will have an important role in the days to come at various levels: the micro
(i.e. individual customer experience), meso (e.g. the market for a particular service) and
macro level (with societal implications).
Figure 2.1: Levels of Influence of SR (Wirtz, 2018) Figure 2.2: Service Robots (SR)
2
1.2. Statement of the Problem
In order to understand the world and to navigate it, service robots must have the ability to
create a map of their environment and to locate themselves in it. While doing it so, using
different SLAM techniques, the perceived world can be of a 2D or a 3D shape depending
on the algorithms and devices used for that purpose.
For many tasks, a 2D map would be enough for the robot to perform the navigation, and on
the other hand a 3D counterpart would help us to obtain a more accurate approximation to
interact with.
The service robot this work is intended to generate has a legged based locomotion -that will
give it the capability of adaptable movement for unideal conditions such as the case of
objects that obstruct the paths or uneven floors- and will also perform a Visual SLAM which
will create a 2D and/or 3D map and at the same time will have to deal with the intrinsic
complex movement that the body of this legged robots have when they perform the
locomotion, especially in 4 legged-case whose displacement will have less contact points
with the floor to give support to its structure and center of gravity -compared with robots
with 6, 8 or more legs-.
At the same time, the proposed robot will incorporate an object recognition algorithm that
can help us to localize persons in environments with different light conditions.
1.3. Objectives
The objective of this work is to develop a 4-legged robot which is able to perform a Visual-
SLAM and recognize humans in an indoors environment with different conditions of light.
1.4. Scope
• Development of a 4-legged robot whose size is about 320x320x280 and its control
algorithm.
• Capability of walking in a terrain with a variation of 2 cms in height
• Implementation of a Visual Slam system
• Human and object recognition using the proper algorithm
3
Chapter 2
Literature Review
An autonomous robot must have the ability to map its surroundings and to locate itself in it
because, after performing this “understanding”, any task will imply a displacement in the
world. The previous statement needs to be complemented with 3 questions that the robot
implicitly has to ask itself before deciding how to act in -or interact with- the environment
in order to navigate in it successfully and perform the desired task: where am I? Where am
I going? How do I get there?
Figure 2.3: See – Think – Act Cycle of an Autonomous Mobile Robot (Tang Dan, 2018)
The figure 2.3 describes the sense-think-act paradigm or cycle that has to be repeated by
autonomous robots in order to navigate in the world, and it consists on:
1) Perceiving and analyzing the data obtained from sensors, thus, perception.
2) Map building and self-localization in it: SLAM; and,
3) The path planning where kinematics and the motion control take the responsibility of
sending commands to the actuators in order to move the robot and to perform displacements.
4
2.1 Perception
Perception is the interface between the robot and the environment that surrounds it and, in
consequence, is a task with relevant importance that will define the subsequent algorithm
used for Simultaneous Localization and Mapping, and due to this reason, also, will finally
shape the way the system is able to capture, “understand”, define and interpret the 2D -or
3D- map of the world.
The perception begins when the sensors incorporated in the robot capture or acquire the raw
data which will have a particular nature according to the devices used. For instance, in the
case of a camera, the raw data will be the images or video obtained and in the case of a sonar,
it will be a sound signal.
The second level of the perception is the clustering of the raw data into meaningful features
and then into objects (Siegwart R., 2011).
2.1.1 Cameras
A camera is nothing but a device conceived for recording visual images in the form of
photographs, film, or video signals.
Nowadays, there are different digital technologies (besides the common monocular cameras)
that help us to get visual-spacial information from the world and that allow the robots itself
to work in different tasks such as object tracking and recognition, image segmentation,
Visual Slam, etc.
Between the alternatives, and particularly the ones that are capable of depth sensing we have:
stereo cameras, Time-of-Flight (ToF) cameras, and RGB-D cameras.
Figure 2.4: Microsoft Azure RGB – D Camera
5
2.2 Simultaneous Localization and Mapping
Also known as SLAM (acronym), is a prerequisite for many robotic applications; in here,
the techniques used estimate jointly a map of an unknown environment and the robot pose
within such map, only from the data streams of its on-board sensors (Bescos, 2018). This
process allows the system to try to localize itself in the map and in the environment without
accumulating drift.
SLAM problem has been addressed from different perspectives but the major part of them
include approaches and datasets that imply a static environment. This means that
unfortunately these systems will only be able to work successfully with small fractions of
dynamic content and, in consequence their applicability in real-world environments can be
compromised but many times and for different tasks it is good enough.
2.2.1 Visual SLAM
Recently, higher degrees of efforts and attention has been put in what is called visual SLAM,
in which the main sensor is a monocular, stereo or a depth camera, but the advantages that
this technique can give are maximized with the last two in mention.
This type of SLAM can be classified in 3 categories:
1) Feature-based methods, where the algorithm –as its name suggests- performs a mapping
of the features that can normally be used for recognition. The figure 2.5 show us an example
of a bedroom reconstruction after using this kind of approach.
Figure 2.5: Feature-Based ORB-SLAM Reconstruction of a Bedroom. Left: map consists of 180
keyframes and 8258 points. Right: map contains 102 keyframes and 8269 points (Mur-Artal, 2014)
2) Direct methods, that try to estimate complete dense reconstructions out of direct
minimization of the reprojection error (geometric error corresponding to the image distance
between a projected point and a measured one) and of the total variation regularization
6
(image denoising method), or in other hand, estimate semi-dense maps focusing its attention
on semantic areas.
Some direct methods imply the utilization of deep learning to recognize static and also
dynamic objects thus altering the perception of the space during the process. An example of
this is presented in figure 2.6 where deep learning helps to execute the semantic
segmentation of the space to, ensuingly, proceed to make the 3D reconstruction.
Figure 2.6: Example of a Semi-Dense 3D Semantic (Li, 2016)
3) Appearance based methods, where the algorithms use information obtained from vision
sensors to localize the robot and map the environment using a process called Loop Closure
which is used to determine whether the robot has seen a location before or not. When the
robot travels to new areas in its surroundings, the map is extended and the number of images
that each new image must be compared to the information obtained analyzing previous ones.
The loop closure process takes longer processing time due to the fact that its complexity
increases linearly when the robot has to make more comparisons unless it uses a convenient
memory allocation system.
2.3 Robot Locomotion
Another key aspect we have to take into account when we are to build a robot is the
locomotion strategy because it defines the way the system makes displacements in the world
and, in consequence, it will also define the actuators, the control algorithms for them and
the sensors used for the perception.
Figure 2.7 shows some examples of some locomotion mechanisms found in nature but the
focus of this elaboration is, particularly, legged and wheeled locomotion.
7
Figure 2.7: Locomotion Mechanisms Found in Nature (Böttcher, 2006)
Legged robot locomotion mechanisms are often inspired by biological systems, which are
very successful in moving through a wide area of harsh, uneven and complex environments.
In this context, the human bipedal walking can be approximated as a rolling polygon
“similar” to a wheel (Böttcher, 2006) –of course only as an approximation because in this
case there is no actively powered wheel-.
2.3.1 Stability
This is a key issue when we are talking about locomotion because a failure in this would
possibly mean a catastrophic accident for the robot or its definite failure to achieve the goal
position.
We can differentiate 2 types of stability, static and dynamic, where the first one in mention
refers to the situation when there is no motion and when the center of mass is completely
within the support polygon (a non-zero area conformed in-between the legs or the wheels
that have contact with the floor).
Most of the robots which are able to walk statically stable have six legs because, for instance,
walking with 4 legs means that the robot will be able to use only 1 leg at the same time and
the other 3 will be creating the support (3 is the minimum amount of contact points to create
a support polygon), thus walking becomes problematic and slow.
8
2.3.2 Legged Robot
A legged robot is well suited to work in a variety of complex environments due to its inherent
capability of climbing steps and crossing gaps -which can be as large as its stride- and to
outcome the ground irregularities or obstacles that a common wheeled robot would
definitely find impossible.
Each leg must have a minimum of 2 DOF (degrees of freedom): 1 for swinging and 1 for
lifting, and for each DOF we require a joint and a motor; this means that, in consequence,
6-legged robots would have a minimum amount of 12 motors which directly affects the
power consumption of the system.
The addition of DOF, in general, increases the maneuverability of the robot, range of terrain
and the ability to travel in variety of gaits; on the other hand, the disadvantages it gives are
the increase of power consumption, weight and presumably costs due to the number of
motors used (Böttcher, 2006).
If the robot has more than 2 legs there is the problem of coordination in order to perform the
locomotion. Traditionally, to proceed with the movement of the motors that perform the
motion of the legs, kinematics and inverse kinematics transformations helped to solve the
equations related with the position and movement of each part involved in this task.
The number of possible gaits that a robot can use depends on the number of legs it has. The
gait is a periodic sequence of lift and release events for each leg. If a robot has k legs the
number of possible events N is, accordant to:
N=(2k-1)!
In case of a bipedal walking machine the number of possible events is:
N=(2k-1)! = (2*2-1)! = 3! = 6
In this case, the possible events are:
1. Lift the left leg
2. Release left let
3. Lift the right leg
4. Release right leg
5. Lift both legs together
6. Release both legs together
It is important to mention that for 6 legged robots, there are 39916800 possible events. Due
to this reason the control is more complex for 6 than 2 legs but, at the same moment, we
have to take into account other aspects of the problem like, for instance, the stability.
There are some performance indices that can be used to evaluate legged robots of different
configurations (Kajita, 2018). Here we are going to talk about 2, but in the literature, we can
find more of them, such as Froude number.
• Specific Resistance, defined by the equation:
9
Where E is the total energy consumption for a travel of distance d, M is the mass of the
vehicle, and g is gravity.
This equation shows us, in principle, how much energy is being transformed to real
displacement which means that is -somehow- a measure of how “smooth” is the locomotion
of the robot.
In figure 2.8 we can appreciate calculated specific resistances for a different kind of robots.
• Duty Factor β, which can be used to make a distinction between walks (β ≥ 0.5) and
runs (β ≥ 0.5). It is defined by the following equation:
Figure 2.8: Gabrielli-von Karman Diagram (Kajita, 2018)
2.4 Human Detection System
Human detection is a branch of the more general object detection problem which tackles the
issue of identifying the presence of predefined classes of objects in an image; in this case,
human beings. In order to perform this task, we have to proceed to identify the presence of
the objects and then, to locate the rectangular boundaries that surrounds them.
Early approaches for human detection include techniques like Haar Cascade, widely used
for face detection, proposed by Viola and Jones which uses an AdaBoost machine learning
classifier (Viola & Jones, 2001); or others like Histogram of Oriented Gradients (HOG)
10
which is a pedestrian detection algorithm developed by Dalal and Triggs that uses a
Supporting Vector Machine (SVM) algorithm for the classification purpose of the objects.
These two approaches are not very good in detecting humans in various poses unless
multiple models are used to detect humans in each pose (Vidanapathirana, 2018). Another
problem observed with these approaches is the fact that quite often a person is detected in
one frame and is not detected in the following one and vice versa, or in other words,
detections are susceptible to flickering. Modern approaches for human detection fall into
what is called Deep Convolutional Neural Networks which provides more accurate results
at the cost of more computations.
Figure 2.9: YOLO. Detecting a Variety of Object Classes in Real-Time (Redmon, 2016)
Some of the current state-of-the-art object detection algorithms –and consequently able to
detect humans too- that are being widely implemented are:
• R-CNN approaches that use Region Proposal methods combined with CNN to first
generate potential bounding boxes in an image and then run a classifier on these
proposed boxes. After classification, post-processing is used to refine the bounding
boxes, eliminate duplicate detections, and rescore the boxes based on other objects in
the scene (Liu, 2016). Examples of this type of algorithms are Fast R-CNN, Faster R-
CNN and Mask R-CNN which are improvements to original R-CNN model initially
proposed.
• Single Shot Multibox Detector (SSD) that discretizes the output space of bounding
boxes into a set of default boxes over different aspect ratios and scales per feature map
location (a feature map is a representation of the dominant features of the image at
different scales) (Liu, 2016).
• You Only Look Once YOLO v3, which can be 3 times faster than SSD in a 320x320
image and on a Pascal Titan X can process images at 30 FPS (Redmon, 2016). The most
salient feature of YOLO v3 is that it makes detections at three different scales which
can be very useful in real time situations. YOLO can be 1000x as faster as R-CNN and
as much as 100x faster than Fast R-CNN.
11
• There are other lighter and smaller versions of YOLO algorithm whose architecture
makes them more suitable for real-time applications using boards with limited
computational power, one example of this is the so-called tiny YOLOv2 which is briefly
described here. On the other hand, the full YOLOv2 model uses as many as three times
the number of layers in comparison to this optimized version.
2.4.1 Tiny Yolo V2
The architecture of this feedforward neural network presented in figure 2.10 consists in 9
convolutional layers each with a leaky rectified linear unit used as activation function and
batch normalization operation interspersed with 6 max-pooling layers and a region layer.
Tiny-Yolo-v2 takes input image size 416 x 416 to 20 output classes in VOC dataset whereas
80 output classes in COCO dataset. This neural network uses convolutions with a 3×3 kernel
(only the last convolution uses a 1x1 kernel) and the max-pooling is made with a 2×2 kernel
and a stride of 2.
Figure 2.10: Tiny Yolo v2 Architecture (Yap, 2018)
It has been proved (Cong & Xiao, 2014) that the convolutional layers occupy over 90 % of
the feed-forward computation period, and therefore this is the reason why this architecture
performs faster than other deep learning-based algorithms although this also result in less
detection accuracy.
12
Chapter 3
Methodology
The robot with 8 degrees of freedom (DOF) developed can be observed in the figure 3.1.
The mechanical design implies that each leg will have 2 DOF, this means that 2 servos are
performing the movements, 1 servo RDS3115 MG and 1 RDS3135 MG model (both
described briefly later on this chapter) that will rotate in perpendicular plains: horizontal and
vertical, respectively.
Figure 3.1: 4-Legged Robot Developed
A dynamic walking gait pattern was used and it consists in using the backwards rotation in
the horizontal plane of 2 legs (opposed with respect to the origin if we consider the center
of the body as such) that are supported on the ground as the way to gain an impulse for the
movement of the robot in the desired direction whilst the other 2 legs are in the air and
moving forward, finally repeating the same pattern but with an alternation of pairs of legs
as it can be seen in the figure 3.2:
13
Figure 3.2: Walking Pattern Gait
Each leg will be changing its position as it is shown in the figure 3.3, creating a circular
shape that can be analyzed and that will give us the way to calculate the movement of the
robot. For a better understanding of the figure 3.3 it has to be mentioned that the orange
dashed line represents the position of the robot’s leg when it is standing and the red lines are
the limits of its movement.
Figure 3.3: Perspective of the Movement of Legs
The distance traveled by each step at ideal conditions can be defined by the following
equation:
𝐷𝑖𝑠𝑝𝑙𝑎𝑐𝑒𝑚𝑒𝑛𝑡 = 2 ∗ 𝑃𝑟𝑜𝑗𝑒𝑐𝑡𝑖𝑜𝑛 𝐻𝑜𝑟𝑖𝑧𝑜𝑛𝑡𝑎𝑙 𝐷𝑖𝑎𝑚𝑒𝑡𝑒𝑟 𝑜𝑓 𝑡ℎ𝑒 𝑙𝑒𝑔𝑠 ∗ 𝑠𝑖𝑛(𝐴𝑛𝑔𝑙𝑒)
14
𝐷𝑖𝑠𝑝𝑙𝑎𝑐𝑒𝑚𝑒𝑛𝑡 = 2 ∗ 15.2 ∗ 𝑠𝑖𝑛(15)
𝐷𝑖𝑠𝑝𝑙𝑎𝑐𝑒𝑚𝑒𝑛𝑡 = 7.8 𝑐𝑚𝑠
The diameter used in this equation is the one described by the projection of the length of the
leg on the horizontal plane.
3.1. Development of the Physical Parts of the Robot
This project starts with the development of 3D models correspondent to the structure of the
robot: the 4 legs, the frame and the supports of the circuits and controllers. The design was
made using the software SolidWorks and the parts were created using a 3D printer.
Figure 3.4: a) Leg, and b) Frame Models of the Developed Legged Robot
The legs and the frame, both components were printed using PLA, can be visualized in the
figure 3.4 and the 2 supports required for the rest of the structure, printed using ABS can be
visualized in the figure 3.5.
Figure 3.5: a) PCB and b) Controller Supports
15
Threaded heat inserts were used for the connections made between the main 3D printed parts
with the rest of the components (see Figure 3.6) that will be connected directly to them.
The threaded inserts have proved to be good for the task since they provide a good resistance
to both torque and pull-out when the plastic solidifies around them and also making it
possible to frequent assembly and disassembly of the unit for service or repair when
necessary without affecting the integrity of the material.
Figure 3.6: Example of Connections Made with Threaded Heat Inserts
3.2 Components
For simplicity, in order to describe the components used to build the robot, they will be
divided into 2 sections: first the ones used for low level control and secondly, the
components used for the high-level command management including also the visual system.
In the figure 3.7 we can see an overview of the system, its main components (sensors,
actuators, controllers and power supply) and the communication method used between them.
16
Figure 3.7: Components Overview
- Low level Control Components
• Actuators:
For the movement of the 4 legs with 2 degrees of freedom (each one), a total of 8 servos was
used:
• 4 RDS 3115MG Servos
• 4 RDS 3135MG Servos
The properties of these 2 different types of motors are listed in the table 3.1. It has to be
mentioned that, for each leg, the servo with more capacity (RDS 3135 model) is the one
holding the robot’s weight since it is making a vertical rotation, whilst the one with less
capacity is in charge of the horizontal rotation as it will be further explained in the gait
pattern description later on this chapter.
An image of the servos used can be seen in the following figure 3.8, which includes the
assembly of the metallic parts that will help us to connect this part to the legs and the frame
of the robot.
17
Figure 3.8: Servo RDS3135 MG Assembly
Table 3.1: Properties of 3135 MG and 3115 MG Servos
Model RDS 3135 MG RDS 3115 MG
Operational
Voltage
6 – 8.4 V 4.8 – 7.2 V
7.4 V 8.4 V 6 V 7.2 V
Operating Speed 0.13 s/60º 0.11 s/60º 0.16 s/60º 0.14 s/60º
Stall Current 3.2 A 3.5 A 1.3 A 1.5 A
Stall Torque 31 Kg.cm 35 Kg.cm 13.5 Kg.cm 15 Kg.cm
Weight 64 g 60 g
• Sensors:
MPU 9255
MPU-9255 is a multi-chip module which houses a 3-Axis gyroscope, a 3-Axis accelerometer
and a 3-Axis magnetometer.
• Gyroscope Features
The triple-axis MEMS gyroscope in the MPU-9255 has the following features:
✓ Digital-output X, Y and Z angular rate sensors with a user-programmable scale range
of ±250, ±500, ±1000, and ±2000 °/s and an integrated 16-bit ADC.
✓ Current demanded: 3.2 mA
• Accelerometer Features
The triple-axis MEMS accelerometer in MPU-9255 includes a wide range of features:
18
✓ Digital-output triple-axis accelerometer with a scale range of ±2g, ±4g, ±8g and
±16g and integrated 16-bit ADCs
✓ Current demanded: 450μA
✓ User-programmable interrupts
• Microcontroller:
Arduino Due
The Arduino Due is a microcontroller board based on the Atmel SAM3X8E ARM Cortex-
M3 CPU. This model is based on a 32-bit ARM core microcontroller and is able to provide
54 digital input/output pins (of which 12 can be used as PWM outputs for the servo motors
that will be used in this project), 12 analog inputs, 4 UARTs (hardware serial ports), a 84
MHz clock and 2 DAC (digital to analog).
This microcontroller will be connected directly to the Jetson Nano and will get its power
supply from it besides the commands that will execute.
- Visual System and High-Level Control Components
The components used for the visual SLAM, the human recognition system and for the high-
level control are the following ones:
1 Vision Sensor
Microsoft Kinect Depth RGB Camera whose depth sensor consists of an infrared laser
projector combined with a monochrome CMOS sensor. This inexpensive vision sensor is
also equipped with an RGB camera and an integrated microphone array which could give
this system further capabilities in future researches.
2 Nvidia Jetson Nano (figure 3.9)
Nvidia Jetson Nano is a low-cost embedded system-on-module (SoM) option created mainly
for AI visual systems which possesses an integrated 128-core Maxwell GPU, a quad-core
ARM A57 64-bit CPU and a 4GB LPDDR4 memory.
Figure 3.9: Nvidia Jetson Nano
19
This device is compatible with L4T Linux Tegra operative system where the parallel
computing platform and programming model CUDA was installed in order to accelerate the
graphics processing needed for the tasks of the visual systems using the GPU.
It has to be mentioned that Jetson Nano required a 32 gb SD Card and an external Wifi
module in order to connect it with another PC using a remote communication.
A swap file of 8 Gbs was created in the SD Card, thus allowing the system to use this space
as a simulation of RAM extra memory and, in consequence, speeding up the performance.
3.3. Circuit Required
For the control of the locomotion and stability of the robot, the circuit showed in figure 3.10
was implemented and it includes a Lipo battery as power supply for the servo motors, each
of which had a 470 uF capacitor connected in parallel in order to smooth the current drown
thus providing a more reliable response.
It has to be pointed out that the system uses a “star” connection type which means that the
ground of the motors will be also connected to the Arduino ground pin. A voltage regulator
is used to maintain a constant voltage of 6,8 V which was decided in
order to not put the servo RDS3115 at its limit although it makes the RDS3135 model have
less torque but enough to hold the robot and to perform the respective locomotion.
Another aspect that has to be mentioned is that 2 different PCBs were prepared in order to
divide the current that each of them will have to handle, consequently protecting all the
components.
Figure 3.10: Circuit Diagram
20
3.4. Program Structure
The low-level control implemented in the Arduino Due board is the one in charge of making
4 different activities:
• Stabilization of the robot using the readings of the roll and pitch angles obtained by
processing the information provided by the MPU 9255.
• Movement towards the front
• Rotation of a certain angle
• Reading the commands sent by ROS and act accordingly
On the other hand, the high-level control implemented uses the ROS Melodic platform
available for the L4T Linux Tegra operative system running in the Jetson Nano and for
Ubuntu 18.04 which is running in a computer that will be connected remotely.
For the implementation in ROS, 5 different nodes were created in order to implement the
tasks that the robot will perform:
• RTab-Map and RTabMap Viz nodes
• RGBD_Odometry node
• Object recognition YOLO darknet node
• ROS_Arduino communication node
Besides this, Openni nodelet will provide ROS access to the camera and another nodelet
called data_throttle is created with the purpose of effectuating the synchronization and
management of the relevant topics published by Openni: RGB images and the Depth
information posteriorly sent by TCP/IP to the client remote PC which uses other nodes
related with the Visual SLAM system.
The figure 3.11 shows the architecture and communication process of the groups, nodes,
nodelets and topics that are running in the system.
21
Figure 3.11: ROS Architecture of the System
22
• Rtab-Map Node
The online output of the node showed in the remote PC screen is the local graph with the
latest added data to the map which is ultimately composed of nodes and links.
This node has 2 different types of memory: working (WM) and long-term (LTM). This
division of the memory permits a long term operation at a large scale due to the dynamic
management approach used to detect loop closures because it keeps the most recent and
frequently observed places (whose weight is evaluated as high) and transfers the oldest ones
to the long term memory where they are stored and can be retrieved if convenient. The
structure of this node can be visualized in the figure 3.12.
An important aspect of the options chosen in RTabMap was to constrain the loop closure
detection and graph optimization to 3 degrees of freedom which is enough for a robot whose
movement is constrained to the floor.
Figure 3.12: Block Diagram of the Structure of Rtab-Map Node (Labbé, 2019)
Rtab-Map receives the data from the odometry node and the throttled depth and RGB images
on the remote computer where the Short-Term Memory STM (which was increased until the
maximum due to the help it would contribute to find a loop closure during the continuous
change of the images received by the system when the legged robot is moving) creates a
node (as part of the map) which will contain the odometry and the sensor data.
Figure 3.13: Graph Representation of Locations in the Memory (Labbé, 2014)
23
To determine which nodes to transfer to LTM, a weighting mechanism identifies locations
that are more important than others and to do so, when creating a new node, STM initializes
the node’s weight to 0 and compares it visually with the last node in the graph. If they are
similar (with the percentage of corresponding quantized SURF features over the similarity
threshold “Mem/RehearsalSimilarity” which was slightly decreased), the weight of the new
node is increased by one plus the weight of the last node.
The loop closure detection relies on a Bayesian filter (Labbé, 2014) (described in the
appendix A) which is used to keep track of its hypothesis by estimating the probability that
the current location matches one of the already visited locations stored in the working
memory or a location estimated to be close enough to it (Angeli, 2018).
• RGBD Odometry Node
For the purpose of the visual odometry, this node is able to work with 2 different approaches
which are called Frame-to-Map (F2M) and Frame-To-Frame (F2F).
The difference of these 2 options is that the first in mention(F2M) compares the features
obtained by the scene with the local map of features created from past frames and the one
that had better results F2F approach makes the comparison against only the last keyframe
instead.
The architecture of this node can be visualized in the figure 3.14.
Figure 3.14: Block Diagram of the Structure of RtabMap Node (Labbé, 2019)
24
A 3 DOF odometry was selected for the same reasons previously mentioned for Rtabmap
node restriction to 3 DOF mapping optimization.
The process that this node does is described by the following steps:
• A maximum number of GFTT (Good Features to Track) was manually set. GFTT was
selected by this node due to the fact that it contributes to obtain features in a uniform
way for images with different sizes and light intensity.
• Feature Matching which uses different approaches if the algorithm is set as F2M or
F2F. For F2M a nearest neighbor distance ratio (NNDR) using BRIEF descriptors
evaluation is done and for the visual Slam implemented, F2F, optical flow is directly
done on the GFTT features providing faster feature correspondences against the last key
frame.
• Motion Prediction which is used to predict where the features of the key frame (F2F) or
feature map (F2M) frame (based on previous motion and assuming a constant velocity).
• Motion Estimation made with the Perspective-n-point PnP RANSAC algorithm used by
OpenCV.
• Local Bundle Adjustment which refines the scene (3D points cloud) and the camera
poses by minimizing the reprojection errors.
• Pose Update
• Key Frame and Feature Map update
The previously mentioned Perspective-n-point RANSAC algorithm used by OpenCV in
RTabMap relies in the combination of 2 algorithms:
1. RANSAC algorithm used to filter the noise
2. Levenberg-Marquadt Optimization to get the solution of the Pinhole Camera Model
which solves the Perspective-n-point problem.
This is further explained in the Appendix B.
• Object Recognition YOLO_ROS Node
This node was implemented as a package and launched in ROS Melodic running at the
Jetson Nano. It has to be mentioned that the necessary computations required by the different
models tested used the integrated GPU of the device thanks to the CUDA software.
Yolov3 was initially tested but unfortunately it was too computational expensive for Jetson
Nano due to the fact that it is already processing the images required by RtabMap and Visual
Odometry nodes. Yolov3 had delays and was only able to run at around 1.2 FPS.
For this reason, the YOLO version used in this implementation is the so called Tiny-Yolo-
v2 which is able to process the incoming images with higher speed (approximately 10.2
frames per second). The weights used in this project were pretrained on VOC dataset and
the algorithm was configured to be able to recognize only 1 class: persons.
The incoming images used by this algorithm come directly from Openni node as the RGB
raw data.
25
• ROS_ARDUINO Connection Node
A Client-Service Node was created in order to send commands to the robot from the remote
PC connected using the TCP/IP Protocole.
A ROS Node called ROS_Arduino containing the service was launched in the robot and is
constantly open for the user. On the other hand, a client is launched in the remote PC when
we want to send information about the movements that the robot should do. Both Client and
Service scripts were written in Python programming language.
After the Service receives the commands from the Client, it codifies this information into an
array and then proceeds to send this to the Arduino Board which is connected with the Jetson
Nano using a serial communication. Once this process is finished, the Arduino will read the
information received and execute the commands moving the robot in the desired way.
The figure 3.15 shows a diagram which contains the way that the communication is done in
order to process this service-client node when it is called by the user.
Figure 3.15.: ROS Arduino Communication Node
26
Chapter 4
Results and Discussion
In order to test the system, 2 different experiment were made with the purpose of measuring
the error of the visual SLAM -when it is estimating the relative position of the robot- and of
the object recognition algorithm -while it is processing the images and indicating the
presence of humans on them-.
4.1. Visual SLAM Experiment
For the Visual SLAM, the relative position was measured and afterwards this was compared
with the real position which was previously measured and marked on the floor.
The parameters that we are going to analyze are the x, y and yaw angle which are enough to
define the pose of a robot constrained to move in the XY plane. The average frame rate of
depth and RGB images was 2.9 hz.
It has to be pointed out that to carry on the experiment, the robot was initialized on a place
whose map has not been previously obtained and afterwards the system starts scanning the
world with the RGB and depth information available -as it can be seen in the figure 4.1
where the features are being detected- and later the visual odometry was set to the position
(0,0) with yaw angle 0 described by position 0 in the table 4.1.
Figure 4.1: Initialization of the Robot
27
Table 4.1: Results Obtained in the Visual Slam Experiment
Position x_real y_real yaw_real X y yaw
0 0 0 0 -0,0014 0,00113 0,032
1 -0,1 0 0 -0,12 -0,00296 -0,743
2 0,5 0 0 0,56 0,0236 -2,47
3 1,1 0 0 1,09 0,0708 -2,3
4 1,7 0 0 1,67 0,0809 -6,78
5 2,3 0 0 2,25 0,0625 -7,78
6 2,9 -0,6 0 2,83 -0,51 -0,0142
7 2,3 -1,2 -90 2,55 -1,29 -93
8 2,3 -1,2 -120 2,4 -1,12 -112
9 2,3 -2,4 -90 2,28 -2,18 -99
10 2,6 -3,6 -120 2,4 -3,39 -114
The table 4.2, describes the positioning results obtained when the robot was following a
continuous path which can also be observed with a colored line in the figure 4.2.
Figure 4.2: Path and Map of the Visual SLAM Experiment
28
Given that the previous figure shows the map created when the robot is in this environment,
we can also use the actual photo (see figure 4.3) of the place in order to give a further analysis
about some particularities found in them.
Figure 4.3: Real Image of the place used for the experiment of the VISUAL SLAM
Now the table 4.2 will show the errors associated with the measurements taken in the
experiment.
Table 4.2: Errors Associated with the Experiment for X, Y and Yaw angle
Position X error (cms) Y error (cms) Angle error (degrees)
0 0,14 -0,113 -0,032
1 2 0,296 0,743
2 -6 -2,36 2,47
3 1 -7,08 2,3
4 3 -8,09 6,78
5 5 -6,25 7,78
6 7 -9 0,0142
7 -25 9 3
8 -10 -8 -8
9 2 -22 9
10 20 -21 -6
29
Now the following figures (4.4, 4.5 and 4.6) will present the comparison of the actual
trajectory points with the estimated ones for the 2 axis and for the heading angle.
Figure 4.4: X-Axis Comparison of Results and Ground Truth
Figure 4.5: Y-Axis Comparison of Results and Ground Truth
-0,5-0,25
00,25
0,50,75
11,25
1,51,75
22,25
2,52,75
33,25
1 2 3 4 5 6 7 8 9 10 11
x p
osi
tio
n (
m)
Measurement
Real
VSlam
-4-3,75
-3,5-3,25
-3-2,75
-2,5-2,25
-2-1,75
-1,5-1,25
-1-0,75
-0,5-0,25
00,25
0,5
1 2 3 4 5 6 7 8 9 10 11
y p
osi
tio
n (
m)
Measurement
Real
VSlam
30
Figure 4.6: Yaw Angle Comparison of Results and Ground Truth
Figure 4.7: Error Associated to the Measurement in the X and Y Axis
The figures 4.7 and 4.8 show the amount of error that the visual system presents for the X -
Y axis measurements and yaw angle respectively.
Figure 4.8: Error Associated to the Measurement of the Heading angle
-140
-120
-100
-80
-60
-40
-20
0
20
1 2 3 4 5 6 7 8 9 10 11
yaw
an
gle
(deg
rees
)
Measurement
Real
VSlam
-30
-20
-10
0
10
20
30
1 2 3 4 5 6 7 8 9 10 11
Axis errors in cms
x error (cms) y error (cms)
-10
-5
0
5
10
1 2 3 4 5 6 7 8 9 10 11
Angle error (degrees)
31
Now the Root Mean Square Error RMSE and the Mean Absolute Error MAE will be
calculated with each one of the 3 values studied, given the following equations:
𝑅𝑀𝑆𝐸 = √∑ (𝑦�̂� − 𝑦𝑡)2𝑁
𝑡=1
𝑁
Where:
N is the number of observations,
𝑦�̂� is the real value of the variable
𝑦𝑡 is the value measured for the given variable
𝑀𝐴𝐸 = ∑ |𝑦�̂� − 𝑦𝑡|𝑁
𝑡=1
𝑁
Where all the variables are the same as previously mentioned for RMSE.
The calculated RMS Errors are:
𝑅𝑀𝑆𝐸𝑋 𝑎𝑥𝑖𝑠 = 8.135
𝑅𝑀𝑆𝐸𝑌 𝑎𝑥𝑖𝑠 = 9.493
𝑅𝑀𝑆𝐸𝑌𝐴𝑊 𝑎𝑛𝑔𝑙𝑒 = 5.295
And the MAE calculated are:
𝑀𝐴𝐸𝑋 𝑎𝑥𝑖𝑠 = 6.012
𝑀𝐴𝐸𝑌 𝑎𝑥𝑖𝑠 = 7.562
𝑀𝐴𝐸𝑌𝐴𝑊 𝑎𝑛𝑔𝑙𝑒 = 4.192
• Discussion
The system will be able to localize itself with an acceptable amount of error although they
show an increase when several consecutive rotations to different orientations are made, this
happened in a place where the depth was mostly outside the range of 4 meters which is
beyond the limit of the Kinect sensor.
It seems that the error in both axis has a similar behavior because both of them act almost
like if they are comparable to a mirror to each other in relation to the horizontal line (the one
that represents 0 error) (see figure 4.8), this show a close relationship between these 2 values
that tells us that an error in one axis will likely lead to error in the other one.
The positioning error increases when the 90 degrees turning was done, indicating that when
it is creating a completely new map with no features (data) previously seen it may require
more observations of the environment.
32
The mean absolute error (MAE) shows us that the X and Y axis may have around 6 and 7
cms of localization error respectively although this value was influenced by few
measurements that had higher error than the others especially in X axis; this is corroborated
by the RMSE which in consequence is higher than the MAE due to these abnormalities in
the errors (the sensitivity to big errors is implicit in its quadratic nature). For the heading or
yaw angle, the root mean square error and the mean absolute error are closer indicating less
variation in the error obtained from the estimations and ground truth.
In the center of the figure 4.3 there is an object which was covered using plastic that was not
scanned (see figure 4.2), probably due to the reflective properties of the material.
The blue chair in the right of the map created demonstrate us that when the algorithm sees
an object it will determine its depth based on the perspective it has which will create some
further noises or errors in some regions of the newly created map.
Zones that are not directly observed and which contain varied light conditions (are more
susceptible to have noise in the map.
4.2. Object Recognition Experiment
This experiment which included tiny YOLO v2 algorithm was made while the robot was
walking to increase the noise of the images and to see how good the performance of the
object recognition can be when it is subject to the intrinsic movements involved by the
process of the dynamic walking.
The data collected consisted in 605 frames that were recorded in 59.23 seconds which
basically gives an average rate of 10.21 hz. This should be enough for real time operation.
It has to be pointed out that these settings do not have any kind of lag while running.
The data collected is be divided into 3 different types:
1. True Positives: which are the frames which successfully detected humans when they
were present in them.
2. False Positives: when the algorithm detected humans but they were not present in
the images.
3. False Negative: when Yolo failed to detect the humans present in the images.
The table 4.3 shows us the results of the test.
Table 4.3: Results Obtained with Tiny-Yolo-v2
Number of Frames True Positives False Positives False Negative
605 218 25 44
33
Figure 4.9: Example of False Positive of the Object Detection Algorithm
Figure 4.10: Graphical Representation of Results of Tiny-Yolo-v2
In order to interpret the results obtained, the precision and the recall are calculated.
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
0
50
100
150
200
250
True Positives False Positives False Negative
34
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =218
218 + 25
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 0.897
𝑅𝑒𝑐𝑎𝑙𝑙 =𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒
𝑅𝑒𝑐𝑎𝑙𝑙 =218
218 + 44
𝑅𝑒𝑐𝑎𝑙𝑙 = 0.83
• Discussion
The precision calculated tells us that 89.7% of the predictions made are correct which
implies that only 11.3 % are false positives and, on the other hand, the recall gives us 83.2%
of positive recognitions when a human is present on the range of the camera.
The figure 4.9 shows us one example of false positives that was detected during the operation
of the robot but this parameter has less relevance in comparison to false negatives because
according to the results, the later has a count almost twice bigger. The false negatives mostly
occur when the human figure is not completely observed by the camera -thus the robot can
only see a part of it- or when it is partially occluded by an object. Sudden movement of the
camera sometimes causes undesired false positives.
This algorithm has shown to be robust against flickering and light conditions that may vary.
35
Chapter 5
Conclusion and Recommendations
5.1. Conclusion
This thesis project handles the implementation of a visual SLAM for a 4-legged robot able
to walk in terrains with obstacles and variations of height that can go up to 25 mm and whose
mechanical and electronic system was designed and built from scratch. It also integrates a
real time object recognition algorithm used in order to identify humans when the robot
captures their image analyzing the RGB information obtained through a depth camera -
Microsoft Kinect 360- with a deep learning neural network known as tiny-yolov2.
The main structure of the legged robot was 3D printed and has shown enough strength to be
functional. On the actuation side, a major challenge was to generate and tune a dynamic gait
pattern that can act fast enough to maintain the camera in a relatively stable state enough to
perform the analysis on the visual data as it corresponds for the demanded tasks. The division
of the tasks and ROS nodes that this robot demanded was successfully implemented, and it
consists on tiny-yolo-v2 fully running on the NVIDIA board which is also in charge of the
connection with the Arduino Due and of the throttling and transport of the RGB and depth
information to the remote client PC which handles this data and transforms it into the 3D
map and pose estimation using RTabMap and RGBD Odometry Nodes.
The visual SLAM system has a mean absolute error (MAE) of 6 and 7.5 cms in x and y axis
respectively; it also has a MAE of 4.2 degrees on the localization task but, nonetheless, it
can have major improvements and perform better (obtaining a better localization results and
a map with higher resolution) if the processing power of the associated device is increased.
The object recognition deep neural network used has showed an accuracy of 89.7% and a
recall of 83% in the task of human recognition.
5.2. Recommendation
Since the system is currently relying solely on the camera information, the addition of other
sensors would help to obtain a more accurate localization or map especially when we take
into account long displacements and/or dynamic environments. However, this combination
would be part of a longer project that relies on the accuracy of a sensor fusion estimation
which will have to be further analyzed.
36
References
Wirtz, J. (2018). Brave new world: service robots in the frontline. Journal of Service
Management.
García-Solera, Álvaro et. al. (2018). Inclusion of service robots in the daily lives of frail
older users: A step-by-step definition procedure on users' requirements. Archives of
Gerontology and Geriatrics, Elsevier, pp. 191-196.
Tang, Dan, et. al. (2018). Development Process of a Smart UAV for Autonomous Target
Detection. The 16th LACCEI International Multi-Conference for Engineering,
Education, and Technology: “Innovation in Education and Inclusion”. Nueva York.
Siegwart R., et. al. (2011). Introduction to Autonomous Mobile Robots. Cambridge: MA:
MIT Press.
Bescos B., et. al. (2018). DynaSLAM: Tracking, Mapping and Inpainting in Dynamic
Scenes. IEEE Robotics and Automation Letters (Volume: 3, Issue: 4).
Mur-Artal, Raul. (2014). ORB-SLAM: Tracking and Mapping Recognizable. Conference:
Workshop on Multi view Geometry in Robotics (MVIGRO). Zaragoza.
Li, X. et. al. (2016). Semi-Dense 3D Semantic Mapping from Monocular SLAM.
Böttcher, S. (2006). Seminar in Principles of robot locomotion. Obtenido de Southern
Illinois University: http://www2.cs.siu.edu/~hexmoor/classes/CS404-
S09/RobotLocomotion.pdf
Kajita, Shuuji & Espiau, Bernard. (2018). Politecnico di Milano. Obtenido de Legged
Robots Tutorial: http://home.deib.polimi.it/gini/robot/docs/legged.pdf
Vidanapathirana, M. (2018) Real-time Human Detection in Computer Vision. Obtenido de
Mediu,: https://medium.com/@madhawavidanapathirana/https-medium-com-
madhawavidanapathirana-real-time-human-detection-in-computer-vision-part-1-
2acb851f4e55
Jones, P. & Viola, M. (2001). Rapid Object Detection using a Boosted Cascade of Simple.
Proceedings IEEE Computer Society Conference on Computer Vision and Pattern
Recognition. Kauai, USA: IEEE.
Wei, Liu et. al. (2016). SSD: Single Shot MultiBox Detector. European Conference on
Computer Vision, 21-37.
Redmon, Joseph et. al. (2016) You Only Look Once: Unified, Real-Time Object Detection.
Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas: IEEE.
Yap, June al. (2018) Fixed Point Implementation of Tiny-Yolo-v2 using OpenCL on FPGA.
International Journal of Advanced Computer Science and Applications (IJACSA)
(Volume: 9, Issue: 10).
J. Cong, and B. Xiao (2014) Minimizing computation in convolutional neural networks.
International Conference on Artificial Neural Networks, pp. 281-290.
Labbé, Mathieu et. al. (2019) RTAB‐Map as an open‐source lidar and visual simultaneous
localization and mapping library for large‐scale and long‐term online operation.,
Journal of Field Robotics (Volume: 36, Issue: 2).
37
Labbé, Mathieu et. al. (2014) Online Global Loop Closure Detection for Large-Scale Multi-
Session Graph-Based SLAM, 2014 IEEE/RSJ International Conference on Intelligent
Robots and Systems. Chicago, USA: IEEE.
Angeli, Adrien & Doncieux, Stéphane & Meyer, J.-A & Filliat, David. (2008). Real-Time
Visual Loop-Closure Detection. Proceedings - IEEE International Conference on
Robotics and Automation. 1842 - 1847.
Bradski, G. (2008). Camera Calibration and 3D Reconstruction. The OpenCV Library Software
Tools. Retrieved Nov 11, 2019, from:
https://docs.opencv.org/2.4/modules/calib3d/doc/camera_calibration_and_3d_reconstr
uction.html?fbclid=IwAR1ZgKeaOXt3b66e4oDkT7hnr1THzYH3-yi-Gsmy-
bC1WeBtnKxUVuA6OWA#bool%20solvePnP(InputArray%20objectPoints,%20Inpu
tArray%20imagePoints,%20InputArray%20cameraMatrix,%20InputArray%20distCo
effs,%20OutputArray%20rvec,%20OutputArray%20tvec,%20bool%20useExtrinsicG
uess,%20int%20flags)
38
Appendices
Appendix A
Bayesian Filter Used for the Loop Closure Detection by Rtabmap Algorithm
The filter estimates the full posterior probability 𝑝(𝑆𝑡|𝐿𝑡)for all i=−1, ..., 𝑡𝑛, where 𝑡𝑛 is the
time index associated with the newest location in Working Memory WM:
where:
𝑆𝑡 is random variable that represents the states of all loop closure hypotheses at time t
ƞ is a normalization term
𝐿𝑡 = 𝐿−1, . . . , 𝐿𝑡is the sequence of locations contained in WM which changes over time due
to the retrieval of the Long-Term Memory.
For the previous calculation, the observation model 𝑝(𝐿𝑡|𝑆𝑡)is evaluated using the following
likelihood function which is also used to determine the weight update of each node:
where 𝑁𝑝𝑎𝑖𝑟 is the number of matched features between the compared locations 𝑁𝑍𝑡 and
𝑁𝑍𝑐.
The difference of the score s and the standard deviation is then normalized by the mean µ of
all non-null scores:
The transition model is used to predict the distribution of 𝑆𝑡, given each state of the
distribution 𝑆𝑡−1 in accordance with the robot’s motion between t and t−1. Combined with
𝑝(𝑆𝑡−1 = 𝑖|𝐿𝑡−1) (the recursive part of the filter), this constitutes the belief of the next loop
closure.
After this normalization process, the process of the loop closure hypothesis selection accepts
the one with the highest score if it is higher than a threshold previously defined.
39
Appendix B
Ransac Algorithm, PNP Problem and Levenberg – Marquardt Algorithm
RANSAC Algorithm
This algorithm is a learning to technique whose purpose is to try to estimate the parameters
of a model while doing a random sampling of the observed data. It consists fundamentally
on 2 steps:
1. A sample subset of the data is randomly chosen and then a fitting model and its
parameters are estimated using them.
2. RANSAC checks all the elements of the data in order to determine whether they are
consistent with the model of the previous step considering a threshold that will give a
tolerance to an amount of error attributable to effects of noise. The elements which don’t fit
in the model are called outliers whereas the others are inliers.
These 2 steps will be repeated until certain estimated model has enough number of inliers.
Example of RANSAC fitting model estimation using a line
A typical example of this algorithm can be showed using lines as a fitting model to determine
inliers and outliers existent in the data.
Perspective-n-point Problem
In the 1980s, Fischler proposed the Perspective-n-Point (PnP) problem whose goal is to
estimate the position and orientation of a calibrated camera from known 3D-2D point
correspondences.
40
The Pinhole Camera Model is the most popular model to solve this problem and it is also
the one used by OpenCV to do so. The figure 37 and the following equation describes the
mathematical model used by this model:
or:
where:
• are the coordinates of a 3D point in the world coordinate space
• are the coordinates of the projection point in pixels
• is a camera matrix, or a matrix of intrinsic parameters
• is a principal point that is usually at the image center
• are the focal lengths expressed in pixel units.
Pinhole Camera Model Graphical Interpretation
In order to solve this mathematical formulation, this node through OpenCV uses the
Levenber-Marquardt algorithm (LMA) and by this procedure finds the approximate
41
solutions of the angles and position of the camera (parameters ß used by LMA) that
minimize the reprojection error.
The reprojection error which is the Euclidian distance in pixels (on the camera plane)
between the point q perceived by the camera and the point ~q computated using the results
of the estimated pose of the camera obtained by LMA for the same feature.
Levenberg – Marquardt Algorithm (LMA)
This is an iterative procedure whose purpose is to find the parameters ß of the model curve
𝑓(𝑥, ß)so that the sum of the square error 𝑆(ß)is minimized:
This minimization process occurs when the parameter vector ß is replaced by a new estimate
ß + δ to determine the function 𝑓(𝑥𝑖 , ß + 𝛿)which can be approximated by:
where:
is the gradient of f with respect to ß. In consequence the approximation of 𝑆(ß + 𝛿)is:
or in vector notation:
Taking the derivative of 𝑆(ß + 𝛿) with respect to δ and setting the result to zero gives:
42
Where Levenberg’s contribution using a damping factor λ which is adjusted at each iteration
and will determine the amount of change δ given to the parameter ß.