license plate survey for traffic analysis ... - university of hawaii · pdf file(ttl): alyx...

LICENSE PLATE SURVEY FOR TRAFFIC ANALYSIS:

IMPROVING ACCURACY WITH CORRECTION ALGORITHMS

A THESIS SUBMITTED TO THE GRADUATE DIVISION OF THE UNIVERSITY OF

HAWAI‘I AT MĀNOA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE

DEGREE OF

MASTER OF SCIENCE

IN

CIVIL ENGINEERING

MAY 2012

By

Alireza Abrishamkar

Thesis Committee:

Panos D. Prevedouros, Chairperson

Peter G. Flachsbart

Michelle H. Teng

ii

I would love to dedicate this thesis to my lovely parents;

and to three spiritual treasures I was so lucky to get to know in my life:

Dr. Chamran, Edoardo, and Mr. Ahmadian

iii

ACKNOWLEDGEMENTS

First off, I would like to express my sincere gratitude and appreciation to my advisor Dr.

Panos Prevedouros for his support, encouragement and advice during the course of this thesis

and my entire Master’s program. I believe the lessons I have learnt from his rectitude and

demeanor during these three years, are no less important than what I have learnt from him in

Traffic Engineering.

I would like to sincerely thank Dr. Flachsbart and Dr. Teng for serving on my thesis

committee, providing guidance and support for finalizing this thesis, and for all of their valuable

helps during my Master’s studies.

I would also like to thank my dear colleagues and friends in Traffic and Transportation Lab

(TTL): Alyx (Xin) Yu, Lambros Mitropoulos, Laxman KC, Kevin Jenkins, Maja Caroee, James

Tokishi, Natasha Soriano, and Myles Gota, for all their helps, and all the nice times we had

together.

I am indeed completely indebted to my parents for everything in my life including this thesis,

and I like to give my sincerest appreciations to them. Finally, I thank my brothers, Afshin and

Amin, for all their help and support.

iv

ABSTRACT

Vehicle tracking methods are widely used for a variety of purposes including collection of

travel time and duration of stay data. The collected data are used for planning and management

purposes. The type of data depends on the method of data collection. Tracking methods are

usually classified into active and passive. In this research they are classified into two categories,

discrete and continuous. Among all methods, the discrete method of license plate matching is the

most prevalent for data collection.

The purpose of this research is to discuss the accuracy of manual license plate matching

method for vehicle tracking and travel time data collection, and provide correction algorithms to

improve the results. The impacts of recordation style and visual similarities between characters

(letters and numbers) on the matching errors are investigated. The correction algorithms are

compared and evaluated.

The application of correction algorithms – specifically those that are more constrained to

filter out false matches – can considerably increase the percentage of matched license plates. To

a lesser degree, this processing can improve the statistical values of the license plate datasets

such as average, standard deviation and median of travel time and duration of stay in a location.

This study also found evidence that a significant portion of mistakenly recorded letters while

recording the license plates are visually similar letters, that by itself underlines the human factor

in the accuracy of the method. Digits are not significantly probable to be mistaken because of

their visual dissimilarity.

The workload of recordation is also proved to be significant: more letters to be recorded

results in more errors.

v

TABLE OF CONTENTS

ACKNOWLEDGEMENTS ..................................................................................................... iii

ABSTRACT ......................................................................................................................... iv

List of Tables .................................................................................................................... viii

List of Figures .....................................................................................................................xi

List of Equations ............................................................................................................... xii

CHAPTER 1 INTRODUCTION ............................................................................................... 1

1.1 Background .......................................................................................................................... 1

1.2 Purpose and Objectives ........................................................................................................ 2

1.3 Definitions ........................................................................................................................... 3

1.4 Thesis Description ................................................................................................................ 3

CHAPTER 2 VEHICLE TRACKING AND TRAVEL TIME DATA COLLECTION .............................. 6

2.1 Active and Passive Vehicle Tracking Methods ....................................................................... 6

2.2 Static and Mobile Vehicle Tracking Methods ........................................................................ 8

2.3 Discrete and Continuous Vehicle Tracking Methods ............................................................. 8

2.3.1 GPS ..................................................................................................................................................... 12

2.3.2 Bluetooth and Radio Frequency Identification (RFID) ....................................................................... 12

2.3.3 License Plate ...................................................................................................................................... 13

CHAPTER 3 LICENSE PLATE MATCHING ERRORS .............................................................. 17

3.1 License Plate Correction ‐ Edit Distance .............................................................................. 18

3.2 Human Memory Factor ...................................................................................................... 20

CHAPTER 4 METHODOLOGY ............................................................................................ 22

4.1 License Plate Format .......................................................................................................... 23

vi

4.2 License Plate Matching ....................................................................................................... 24

4.3 Processing of Unmatched Data ........................................................................................... 27

4.3.1 Full Correction Algorithm .............................................................................................................. 28

4.3.2 Algorithm for Correction of Similar Characters (“Similar Algorithm”) .......................................... 32

4.4 Discussion on Processing Results ........................................................................................ 36

CHAPTER 5 DATA COLLECTION AND ANALYSIS ................................................................ 37

5.1 Data Collection .................................................................................................................. 37

5.1.1 Dataset 1 (ABC1): ITE ......................................................................................................................... 37

5.1.2 Dataset 2 (C123): HAVO 2009 ............................................................................................................ 39

5.1.3 Dataset 3 (ABC123): HAVO 2007 – 1 .................................................................................................. 40

5.1.4 Dataset 4 (ABC123): HAVO 2007 – 2 .................................................................................................. 41

5.2 Individual Analyses ............................................................................................................ 42

5.2.1 Algorithm Comparison ....................................................................................................................... 42

5.2.2 Evaluation of Impact of Similarity on Errors ...................................................................................... 43

5.2.3 Analyses of each Dataset ................................................................................................................... 48

5.2.3.1 Dataset 1 (ABC1): ITE.................................................................................................................. 48

5.2.3.2 Dataset 2 (C123): HAVO 2009 .................................................................................................... 53

5.2.3.3 Dataset 3 (ABC123): HAVO 2007 ‐ 1 ........................................................................................... 57

5.2.3.4 Dataset 4 (ABC123): HAVO 2007 ‐ 2 ........................................................................................... 61

5.3 Aggregate Analyses ............................................................................................................ 64

5.3.1 Processing Time of Algorithms ........................................................................................................... 64

5.3.2 Influence on Percentage of Matched Vehicles .................................................................................. 66

5.3.3 Influence on Statistical Indices ........................................................................................................... 68

5.4 Evaluation of Impact of Similarity after One Iteration and Redefinition of Similar Characters

...................................................................................................................................................... 71

CHAPTER 6 CONCLUSION ................................................................................................ 81

REFERENCES ..................................................................................................................... 84

vii

Appendix A Algorithms ................................................................................................... 87

Appendix B Mistakes Matrices by Algorithms A to D ....................................................... 92

viii

List of Tables

Table 1. Qualitative Comparison of Travel Time Data Collection of Different Techniques for

License Plate Method ................................................................................................................... 15

Table 2. Travel Time Data Collection of Different Techniques for License Plate Method..... 16

Table 3. Similar Letters. .......................................................................................................... 33

Table 4. Similar Digits. ............................................................................................................ 34

Table 5. Sample of Substituted Letters Using the Full Correction Algorithm C. .................... 34

Table 6. Data Collection Specifications .................................................................................. 42

Table 7. Uniform Mistakes Matrix for 900 Mis‐recorded Numbers ...................................... 44

Table 8. Hypothesized Mistakes Matrix for Letters. Yellow Cells are the Intersection of

Similar Letters. .............................................................................................................................. 46

Table 9. Hypothesized Mistakes Matrix for Digits. Yellow Cells are the Intersection of Similar

Digits. ............................................................................................................................................ 46

Table 10. Letters Mistakes Matrix for Dataset 1, by Algorithm E .......................................... 48

Table 11. Ratio of Similar Character Misreadings to Total Count of Misreadings ................. 49

Table 12. Deviation of the Mistake Matrices by Algorithms A to D based on E .................... 49

Table 13. Numbers Mistakes Matrix for Dataset 1, by Algorithm E ...................................... 50










ix



Table 25. Numbers Mistakes Matrix for Dataset 3, by Algorithm E ..................................... 59







Table 32. of Similar Character Misreadings to Total Count of Misreadings .......................... 64


Table 34. Processing Time of Different Correction Algorithms ............................................. 65

Table 35. Ratio of Processing Time for ‘Similar Algorithm’ to other Full Algorithms ............ 65

Table 36. Contribution of each Algorithm to Percentage of Matched Vehicles – Letters and

Digits Separately ........................................................................................................................... 67

Table 37. Ratio of Contribution to Number of Initial Unmatched License Plates ................. 67

Table 38. Average for the Duration of Stay ........................................................................... 68

Table 39. Standard Deviation for the Duration of Stay ......................................................... 69

Table 40. Median for the Duration of Stay ............................................................................ 69

Table 41. Updated Blank Mistakes Matrices for Letters........................................................ 72

Table 42. Updated Blank Mistakes Matrices for Digits .......................................................... 72

Table 43. Updated Letters Mistakes Matrix for Dataset 1, by Algorithm E (Second Iteration)

....................................................................................................................................................... 73

Table 44. Updated Numbers Mistakes Matrix for Dataset 1, by Algorithm E (Second

Iteration) ....................................................................................................................................... 74

x


....................................................................................................................................................... 75


Iteration) ....................................................................................................................................... 76


....................................................................................................................................................... 77


Iteration) ....................................................................................................................................... 78


....................................................................................................................................................... 79


Iteration) ....................................................................................................................................... 80

xi

List of Figures

Figure 1: A Depiction of GPS‐based Active Tracking System. ................................................... 7

Figure 2: Static and Mobile Vehicle Tracking Technologies. .................................................... 8

Figure 3: Discrete‐Continuous Spectrum for Classification of Vehicle Tracking Technologies.

....................................................................................................................................................... 11

Figure 4: Correction Algorithms that are used after Initial Matching. Gray Boxes Show the

Main Algorithms. .......................................................................................................................... 23

Figure 5: Flowchart of Initial License Plate Matching Procedure. .......................................... 26

Figure 6: Unconstrained Full Correction Algorithms. ............................................................. 29

Figure 7: Constrained Full Correction Algorithms. ................................................................. 30

Figure 8: Correction Algorithm for Similar Letters and Digits. ............................................... 35

Figure 9: Data Collection at Waipio Peninsula Soccer Complex ............................................. 38

Figure 10: Data Collection at the Entrance of Hawaii Volcanoes National Park (2009) ......... 40

xii

List of Equations

Equation 1 …………………………………………………………………………………………………………………… 42

Equation 2 …………………………………………………………………………………………………………………… 43

Equation 3 …………………………………………………………………………………………………………………… 43

Equation 4 …………………………………………………………………………………………………………………… 43

Equation 5 …………………………………………………………………………………………………………………… 45

1

CHAPTER1

INTRODUCTION

1.1Background

Collection of the license plate of a vehicle and recording the observation times is a simple

technique to track the vehicle and obtain travel time data. Currently the majority of vehicle

tracking systems are GPS‐based. A GPS system can provide data about vehicle whereabouts

instantly and with high accuracy; however, there are certain limitations. One limitation is that

such a system cannot work underground or in the tunnels. Another is that it can be expensive if

the number of tracked vehicles is high and there is no need to know their whereabouts in real

time. In such cases, other than expensive data collection, data reduction and processing is also

heavier and more costly, since usually a huge volume of data are collected while only a small

portion of it is really needed; therefore, tracking technologies and methods that do not collect

data continuously are preferred for local applications.

One of the most widely used methods for travel time data collection is license plate

matching. License plate matching is also used for origin‐destination studies and transportation

planning. By collecting the location and consequently the path of sufficiently large number of

vehicles, travel pattern recognition can be done; also the average speeds in different segments

of the path can be obtained and be used for traffic management, and planning purposes.

License plate recordation at the entrance and exit of parking lots is a common way to do

parking studies. Other than that, license plate recognition is widely used in toll collection

stations and to some extent for law enforcement.

2

License plate collection and matching can be performed using several methods. They

depend on the size of data to be collected, weather conditions, available budget, required

accuracy, vehicle speed, type of the route, etc. They range from completely manual to

completely automatic. More details about these methods together with advantages and

disadvantages and related issues for each of them are given in Chapter 3.

One of the major issues with license plate matching techniques is its accuracy. The way

accuracy is evaluated depends on the method of data collection and matching. When the

license plate is captured by cameras to be input to image‐processing algorithms and software,

the whole license plate number is recorded. But when it is collected manually, usually a subset

of characters ‐ typically the last four digits [1] ‐ is recorded, in order to expedite the recordation

procedure and allay fears of monitoring private properties. The later method increases the

chance of non‐identical license plates to be matched. Previous studies conducted on the

accuracy of license plate matching normally focus on the probability of spurious matches

because of the few characters that are not recorded and they assume that the recorded

characters are correct [2] (more in Chapter 3). In automatic recordation, the accuracy of

recorded characters is the subject of study [3].

1.2PurposeandObjectives

The purpose of this research is to investigate the accuracy of license plate matching

methods for vehicle tracking and travel time data collection, and provide correction algorithms

to improve the results. The main focus is on the manual license plate method where data are

collected using human observers with pen, paper and watches, and is matched and processed

by computer applications. The impacts of recordation style and visual similarities between

characters (letters and numbers) on the matching errors are investigated. After the unmatched

3

data are processed, the influences of this enhancement on the main statistics of the data are

measured. Finally, the correction algorithms are compared and evaluated.

1.3Definitions

Vehicle tracking is considerably intertwined with electronic devices and computer software.

A vehicle tracking system combines the installation of an electronic device in a vehicle, or fleet

of vehicles, with purpose‐designed computer software at one or more operational bases to

enable the owner or a third party to track vehicle location and other operational, passenger or

freight data.

There are two essential parameters: The “location” of tracked vehicle in some points of

“time”.

Travel time is broadly defined as “the time necessary to traverse a route between any two

points of interest.” [4] By tracking vehicles, travel time is one of the most common types of

collected data; and in fact, sometimes a simple method of vehicle tracking is not more than a

mere travel time data collection. By locating a vehicle at various locations, the travel time

between different points (segment of the whole route) is determined; all of these data

combined, turn the tracking of vehicles along the route.

In this research vehicle tracking is the process of acquiring the duration of stay of vehicles at

specific locations, usually parking lots.

1.4ThesisDescription

This thesis addresses the following topics:

4

• Overview of the methods of vehicle tracking, and categorizations of relevant

technologies and methods.

• Applications and accuracy of license plate matching as a common discrete vehicle

tracking method.

• Definition of possible errors (human vs. random) involved in manual license plate

recordation.

• Discussion on the similarities and differences between errors in manual license plate

recordation and automatic recordation/processing.

• Development of algorithms to improve license plate matching for both types of possible

errors in manual license plate recordation.

• Evaluation of the algorithms for different license plate recordation styles (e.g. the whole

license plate vs. four characters of it).

• Evaluation of the algorithms for letters versus digits.

• Evaluation of the algorithms in terms of performance, accuracy and processing speed.

Following this introductory chapter, the methods of vehicle tracking are presented and

classified in Chapter 2. Common technologies for vehicle tracking and their applications in

travel time data collection are discussed.

In Chapter 3 the accuracy, errors and improvements regarding license plate matching is

discussed and previous works are reviewed.

Chapter 4 describes the methodology used to develop correction algorithms that reduce the

errors and improve license plate matching. It also explains the structure of each algorithm and

how it operates. These algorithms increase matching percentage by processing those license

5

plates that have one mis‐recorded character. This chapter also describes how the results of

different algorithms can be interpreted.

Chapter 5 describes the four datasets that were used, and presents the results of the data

analyses performed on them. The performance of proposed correction algorithms is evaluated

based on these results.

Chapter 6 presents the general conclusions made based on the data analyses.

6

CHAPTER2

VEHICLETRACKINGANDTRAVELTIMEDATACOLLECTION

In this chapter the methods of vehicle tracking are presented. The categorization that

clarifies the connection between vehicle tracking methods and travel time data collection,

namely discrete‐continuous spectrum, is described, and the position and role of license plate

survey in these categories is discussed.

There are several categorizations of vehicle tracking methods and technologies. The most

common one defines two categories with overlapping applications, namely, Active Tracking and

Passive Tracking.

2.1ActiveandPassiveVehicleTrackingMethods

Active Tracking – also known as online tracking or real‐time tracking – is comprised of a

system that locates the vehicle by means of electronic location sensors, generally for

predefined regular points of time, and another system that transmits the data directly to a Data

Management Center (DMC). Depending on the scale of the tracking system the DMC can vary

from a PC or a smart phone to a big data center. The received data can either be recorded on

long‐term memories for future uses or be merely monitored. Although the installed tracker unit

in the vehicle may also store the collected data, the main point of active tracking is immediate

access to the data by the DMC., An active tracking system is used when vehicle must be

monitored in transit.

7

Commercial vehicle tracking systems usually use a cellular data service (e.g. GPRS or SMS)

or satellite communication to send the collected data to the computers at the DMC (Figure 1.)

Some active tracking systems allow for two‐way communications.

Figure 1: A Depiction of GPS‐based Active Tracking System.

Passive Tracking system does not transmit the location data immediately to a DMC after it’s

collected; instead, it records the data for future reference. If data are not needed right away, a

passive system is usually adopted since it reduces the costs.

Passive tracking systems provide a more cost‐effective approach to vehicle tracking. In this

approach when users want to review the recorded data, they need to access the GPS or other

tracking systems installed in the vehicle and manually download the data via proper interfaces.

Such systems are more common in transportation planning studies.

8

2.2StaticandMobileVehicleTrackingMethods

Another categorization classifies vehicle tracking technologies in two classes: Static and

mobile as shown in Figure 2. [5] “The static includes technologies like camera systems,

transponders and dual‐loop detectors. The mobile includes technologies like GPS and cell

phones. Transponders may figure in both classifications because it has characteristics of both,

yet the need for readers on the road makes it static. All the static technologies are tied to the

road … but they cannot be on any road. Budget limitation would not allow static technologies

on all roads.”

Figure 2: Static and Mobile Vehicle Tracking Technologies.

2.3DiscreteandContinuousVehicleTrackingMethods

Tracking methods are also categorized into discrete and continuous.

9

In the continuous method the location of the tracked vehicle is either stored or reported

(like in passive and active methods) for predefined points of time, and usually with predefined

regular intervals; in fact, the tracking procedure is time‐based. As an example, in the common

GPS‐based tracking systems, the location coordinates data are collected for predefined

intervals (e.g., one second) and it can be anywhere on‐route or off‐route.

The discrete method tracking is performed based on the location of the vehicle; meaning

that the time is collected for various preselected locations if the vehicle appears there, and thus

the procedure is location‐based. For instance, license plate tracking is a discrete method since

there are no specific predefined regular points of time for any specific vehicle, for which data

are recorded; instead if the vehicle is observed in any preselected location, the time of

observation is recorded.

Typically in the discrete method the volume of the data is smaller because the time

difference between two consecutive data records is higher.

Both continuous and discrete methods can be either Active or Passive. However, in the

discrete method, the manner of data collection is more diverse. Data can be collected via GPS‐

based systems, Bluetooth technology, license plate survey, via Radio Frequency Identification

(RFID) technology, etc. Data collected from a license plate survey for a group of vehicles is not

usually raw and online; instead, it’s recorded for further process and reduction; thus it is a

Passive method.

The categorization into discrete and continuous is not completely binary. The data collected

by continuous methods and technologies can be filtered so that only coordinates and time

values for specific predefined locations are kept for discrete applications. Moreover, some

continuous tracking‐capable technologies such as Bluetooth and RFID can be set to record data

only at specific locations, i.e., when the Bluetooth transmitter device inside a vehicle passes

10

through a gate where the reader is mounted. In these cases there is no need for subsequent

data reduction from continuous to discrete.

Figure 3 depicts this classification as a spectrum. On one end of the spectrum, license plate

matching is observed as a completely discrete method and on the other end of it are GPS

technologies.

11

Sample

(Continuous)

Applications:

Logestics

Urban logestics (e.g.

construction machines

management)

1) Fleet management

2) Bus schedule control

3) Asset & cargo tracking

Technology:LP

Matching

Voice

DispatchSMS

Passive

RFID

Active

RFID

Mobile

PhoneBluetooth GPS

Sample

(Discrete)

Applications:

1) Travel time data

collection

2) Toll collection

1) Taxi dispatch

management

2) Police

Fleet management

1) Travel time data

collection

2) Bus schedule

control

Toll colelction

More Discrete More Continuous

Figure 3: Discrete‐Continuous Spectrum for Classification of Vehicle Tracking Technologies.

12

2.3.1GPS

GPS‐based systems are the most widely used systems for vehicle tracking. In this system the

location of the tracked vehicles is calculated based on trigonometry laws for the signals

received from several satellites at any given point of time. Keeping contact with at least four

satellites is required for normal operation. These devices can locate vehicles anywhere that GPS

signal coverage exists. Since the signals are received from satellites, the GPS in‐vehicle device

requires a clear line‐of‐sight path to the satellite; the wider the sky view ‐ and therefore the

higher the number of contacted satellites ‐ the better the functionality of the system. The signal

coverage may be too low in some spots to allow proper functionality. For example some basic

GPS receivers cannot operate properly in deep valleys or near tall buildings where the sky view

angle is limited; not to mention inside tunnels and underground.

2.3.2BluetoothandRadioFrequencyIdentification(RFID)

RFID is a technology that uses communication through the use of radio waves to transfer

data between a reader and an electronic tag attached to an object for the purpose of

identification and tracking. There are two different types of tags, namely, passive and active.

The former does not broadcast a signal by itself but the later does. This results in different read

ranges for the two types tags. The read range for an active tag is 300 ft. or more. These tags can

be used for continuous tracking if the tracking field is limited. The read range of passive tags

ranges from three to over 20 ft. depending on the used wave frequency. Passive tags read

range is practically inadequate for continuous vehicle tracking. RFID tags have been used in

tolled highways worldwide. Toll operators use RFID tags to derive volume, speed, travel time

and origin‐destination data. [17, 18]

13

The Bluetooth technology is originally designed as a short‐range wireless connectivity

solution for personal, portable, and hand‐held electronic devices. The Bluetooth radio operates

on a license‐free, globally available Industrial, Scientific and Medical (ISM) band [13]. The

typical working distance of Bluetooth ranges from 10 m to 100 m [14]. The Bluetooth tracking

method is relatively new and not widely used given the limited references to it in the literature.

It has been used for tracking of constructional vehicles in dense urban areas where GPS is

limited. [15]

For vehicle tracking and travel time data collection purposes, attaching Bluetooth

transmitters or RFID tags to targeted vehicles and using RFID/Bluetooth reader equipments in

several needed spots provides the similar kind of data collected by license plate method.

However, since the signature is digital, the matching process is much easier, faster, cheaper and

less labor or processing intensive. If the data collection phase is done well, then the matching

process is of nearly perfect accuracy. On the other hand, in this method collaboration of the

tracked vehicles owners is also needed; at least to receive the transmitters or tags and keep

them in their vehicles. Bluetooth and Active RFID systems have more capabilities in vehicle

tracking mostly due to their longer range. Passive RFID has more limited applications and is

good for providing entry and exit information. [18]

2.3.3LicensePlate

This method is widely used for collection of travel time and duration of stay data for local

applications such as parking studies, corridor studies, etc. Its first phase is observation of

vehicles at specific locations and collecting their license plates. Then the pool of license plates

needs to be matched to identify each vehicle’s travel pattern. Knowing the distance between

the locations, average speed between two points can be calculated, resulting in travel time

14

data. Combining the travel time data for all segments for each vehicle, results in vehicle

tracking over the monitored network.

There are four basic techniques for collecting and processing license plates:

1) Manual: collecting license plates via pen and paper or audio tape recorders and manually

entering license plates and recorded times into a computer.

2) Portable Computer: collecting license plates in the field using portable computers that

automatically provide an arrival time stamp. There is software that facilitates this process.

3) Video with Manual Transcription: collecting license plates in the field using video

cameras or camcorders and manually transcribing license plates using human observers. This

minimizes field crew size and is required in harsh climate locations.

4) Video with Character Recognition: collecting license plates in the field using video, and

then automatically transcribing license plates and arrival times into a computer using

computerized license plate character recognition. This is the typical type of processing by tolling

authorities for exacting the toll charge or for recording toll paying violators.

The license plate matching method in general, regardless of the applied technique, has the

following advantages:

Ability to obtain travel times from a large sample of vehicles, which is useful in

understanding variability of travel times and destinations among vehicles within the traffic

stream.

Data collection equipment is relatively portable.

The license plate matching method, regardless of the applied technique, has the following

disadvantages:

15

Travel time data limited to locations where observation occurs.

Sampling surveys can achieve only limited geographic coverage on a single day.

Manual and portable computer‐based methods are less practical for high‐speed

freeways or long sections of roadway with a low percentage of through‐traffic.

Accuracy of license plate reading is an issue for manual and portable computer‐based

methods.

Skilled data collection personnel required for collecting license plates and/or operating

electronic equipment. [4]

Table 1 and Table 2 provide a comparison among different techniques of license plate

matching methods by FHWA. [4]

Table 1. Qualitative Comparison of Travel Time Data Collection of Different Techniques for License Plate Method

16

Table 2. Travel Time Data Collection of Different Techniques for License Plate Method

17

CHAPTER3

LICENSEPLATEMATCHINGERRORS

One of the major issues with license plate matching is its accuracy. The way accuracy is

evaluated depends on the method of data collection and license plate matching.

Automatic License Plate Recognition (ALPR) systems take a snapshot of the whole license

plate and extract the license plate character and number set by using Optical Character

Recognition (OCR) algorithms, . The performance of the OCR algorithms is critical and is usually

the major cause for errors. Accuracy of derived characters is usually under focus for this

method. Another possible cause of error in the ALPR systems is regarding detection of the

license plates before recognition of their characters. ALPR systems need to detect a vehicle first

then take a picture of it together with its license plate; then the image‐processing software

needs to detect the place of the license plate on the image and after that the OCR software can

extract the number.

This process is more challenging for trucks. A survey on I‐40 indicated that only 82% of

trucks on that route had installed their license plates on its normal place in the middle of the

bumper. Most LRP cameras are aimed at the bumper area [6]. This is a disadvantage of the

automatic method because some license plates are missed.

The advantage of the license plate recording by a crew is that even if a plate is placed

behind the windshield, it can be detected and recorded. Vehicle may be missed by ALPR

systems, but this does not influence the ratio of captured license plates to detected vehicles.

Therefore, high values for this ratio do not necessarily indicate a good ALRP system. If the

recorded license plates are retained in a long‐term memory, manual verification can be done

18

later with almost 100% accuracy. However, it is normally possible only for fractions of the data;

and excessive manual verification is not in line with the purpose of these systems.

Manual license plate recordation is used for different purposes compared to ALPR. Its major

application is in surveys to collect origin‐destination and travel time data. When license plates

are collected manually, usually a subset of characters is recorded to expedite the recordation

procedure. This increases the chance of non‐identical license plates to be matched because it

can create identical subset of characters while the whole license plates are not identical.

The major focus of previous studies conducted on the accuracy of manual license plate

matching is normally on the probability of false matches because of the few characters that are

not recorded. These studies show that statistically reliable estimates of travel parameters can

be obtained without the recording of entire license plate numbers. Makowski found that

although only the last three digits of the license plate numbers were recorded, statistically

reliable values were obtained [7]. The characters that are recorded or in some cases verified

manually are usually considered to be completely correct.

3.1LicensePlateCorrection‐EditDistance

In both automatic and manual methods of license plate matching, there are typically some

erroneous and unmatched license plates together with some mistakenly matched ones. The

techniques that are used to figure out the possible wrong matches include adding constraints

and using other available information such as calculated speed between the two points of

observation. For example if a license plate was observed in two points ten miles apart within

two minutes, it indicates a wrong match.

19

When dealing with batches of vehicles moving on the same road statistical outliers are

sometimes used to filter out wrong matches. There are many ways to define outliers and find

them. For parking lot and duration of stay surveys, using average speed is not appropriate and

the identification of outliers is less feasible. [8, 9]

When wrong matches are found or unmatched license plates are needed to be retried, Edit

Distance algorithms are used to find the nearest possible alternatives. [10] Edit distance or

more specifically the Levenshtein distance is a metric for measuring the amount of difference

between two strings. It is defined as the minimum number of edits needed to transform one

string into the other one, using the allowable edit operations that are insertion, deletion, or

substitution of a single character.

When two license plates match in the initial matching procedure their edit distance is zero.

If during recordation a character of the license plate is missed the recorded string requires a

change or an edit to be matched and if all other recorded characters are correct, then only one

insertion is needed, thus the Levenshtein distance equals one.

There is a more specific edit distance for equal length strings called Hamming distance. [16]

It measures the minimum number of substitutions required to change one string into the other,

or the number of errors that transformed one string into the other. For example the Hamming

distance between ‘ABC123’ and ‘DEC143’ is three. The Hamming distance between ‘ABC123’

and ‘CAB123’ is also three.

The literature includes several methods and recommendations for finding the best match

which has the smallest Edit Distance, but when only one character is assumed invalid, these

methods do not differentiate between the possible matches. For example the Hamming

distances between ‘FNG’ and both ‘EMC’ and ‘OJI’ are three; neither is considered closer to

20

‘FNG’ but common sense suggests that the ‘OJI’ option is less likely to be a misrecordation of

‘FNG’ than the ’EMC’ option. [3]

This study assumes that only one of the characters in mistakenly recorded, therefore the

Hamming distance is always one. Higher distances were deemed unlikely; moreover, if license

plates with Hamming distance of two are matched, the probability of false matches increases.

Oliveira‐Neto et al. (2009) suggest that a probability matrix is needed to be created to provide

an additional help for choosing the better matches when the edit distances are equal. [3] This

study is a step in this direction and tries to find similar characters that are more probable to be

mistaken by a human recorder.

3.2HumanMemoryFactor

Human short‐term memory temporarily stores and manages information. Short‐term

memory has a span of seven chunks of information, plus or minus two. A chunk is referred to as

an integrated piece of information [11]. A chunk can be a digit, a letter, a simple shape, etc.

Studies show that the format of these chunks of information has influence on the ability of

the brain to remember them. For the case of license plate recordation it is directly related to

the format of recordation, e.g., last three digits, one letter and two digits in the middle, etc.

Research conducted on the memorability of license plates showed that the more digits and

letters in a license plate are mixed, the more difficult it is to memorize it [12].

Moreover, while recording the license plates if traffic is heavy and the number of vehicles is

large, the license plates of several vehicles need to be memorized, and they can easily surpass

memory capacity resulting in missed vehicles or wrong license plates.

21

In this study four datasets with three different recordation formats were used to investigate

this issue.

22

CHAPTER4

METHODOLOGY

This chapter (i) describes the method for initial license plate matching; (ii) describes the

methodology for creating algorithms that reduce flawed and unmatched data; and, (iii) explains

the approach taken to compare the results from different algorithms. License plates consist of

characters which include letters and numbers. Special characters such as %,$,! are not found in

license plates. The subsequent discussion focuses on characters, with separate discussion on

letter and number matching.

Several correction algorithms were developed. They were used to improve the percentage

of matched license plates by finding those that were mistakenly recorded. Two hypothetical

reasons for the mistakes were considered, and correction algorithms were created to evaluate

them vis‐à‐vis each other. Figure 4 summarizes the correction algorithms that are described in

this chapter. The structure of the correction algorithms for license plate letters and digits are

very similar; only their targeted characters are different.

23

Figure 4: Correction Algorithms that are used after Initial Matching. Gray Boxes Show the Main Algorithms.

4.1LicensePlateFormat

Since all of the data were collected in Hawaii and the majority of the license plates are

composed of three letters followed by three numbers, the focus was on this type of license

plate. The format of the data records of four license plate datasets available for analysis was

different, as shown below.

All three letters and three digits were recorded (ABC123) – Two sets

Last letter and all the three digits were recorded (C123) – One set

All three letters and the first digit were Recorded (ABC1) – One set

24

The latter two schemes were adopted in order to avoid the monitoring of private property.

For vehicles with license plate format other than ABC123, the whole plate was recorded. These

four datasets were used in analyses described in Chapter 5.

4.2LicensePlateMatching

In order to match the license plates an algorithm was created with Visual Basic for

Applications (VBA) which utilizes the license plate datasets created in Microsoft Excel.

First each dataset was sorted based on the time of entry (primary sorting) and exit

(secondary sorting), and for each data collection station. Although recorded license plates are

automatically sorted by time of entry/exit as they are written down, this sorting is usually not

perfect, particularly when the traffic volume or vehicle speed is high. In these cases data

collectors usually aid each other to minimize missed vehicles. Normally one person reads aloud

the license plate numbers to be written down by the other person, or they both write down

every other license plate number, creating two separate lists. In the latter case the data must

be merged and sorted.

Figure 5 depicts the initial matching process of license plates, where the license plates that

are already identical are matched. Ideally, if all license plates are matched, then no correction

would be necessary, but this is rarely the case when several hundred observations are taken in

the field. The algorithm starts from the first license plate of the entering vehicles (“Ins” list) and

searches for the first license plate among the exited vehicles (“Outs” list) that exactly matches

it. If the exit time is greater than the entry time, then the exited vehicle – whose license plate is

labeled OUT‐LP – is considered to be the same as the entered one – labeled as IN‐LP. After that

the matched OUT‐LP is removed from the original list and is added to the matched list. The

25

process continues until for every IN‐LP, the whole OUT‐LPs list is searched or a matching OUT‐

LP is found; whichever comes first.

Next, the percentage of matched license plates is calculated and IN‐LPs for which no match

is found are saved in the “Unmatched IN‐LPs” list to be processed by the correction algorithms.

For each matched license plate the duration of stay is calculated, and then the average,

standard deviation, and the median of duration of stay for all IN‐LPs is computed.

26

Figure 5: Flowchart of Initial License Plate Matching Procedure.

27

4.3ProcessingofUnmatchedData

Two types of error are considered as the possible cause for unmatched license plates.

Hypothesized reason for errors Suitable Algorithm for correction

Random error in character recognition or recordation Full correction algorithm

Misreading character due to its similarity to another one Algorithm for correction of similar characters

For random misspelling of the characters, it is assumed that when the data collector was

recognizing or recording the license plate, one letter or digit was randomly wrong. It is also

assumed that a letter could be wrongly substituted merely for another letter; and a number for

another number. Errors that mix numbers and letters are assumed to be zero. In other words it

is assumed that, for example, B may be noted instead of T, and 2 instead of 4; but not 6 instead

of E. This assumption is “safe” in Hawaii which has a typical license plate of ABC 123, but less so

in states which use a string of mixed letters and numbers in their license plates, e.g., 1ABC234,

with no spaces, in California.

After the unmatched license plates were separated from those that were matched by the

initial license plate matching algorithm, the “Full Correction Algorithms” we developed was

applied to find matches for mistakenly recorded license plates. These algorithms found the

characters that were mistakenly recorded and the correct characters that resulted in matched

license plates. The characters were retained for all of the new matches, and were analyzed to

see if a considerable number of mistaken records were because of visual similarity among

characters. If yes, instead of the computing intensive (slow) “full correction algorithm”, a much

faster “similar algorithm” was developed for matching unmatched license plates. The two

algorithms require a markedly different volume of computation. However, full algorithms

technically give more complete results since they search both similar and dissimilar characters.

28

4.3.1FullCorrectionAlgorithm

Figure 6 and Figure 7 depict the matching process of unmatched license plates by “Full

Correction Algorithms” that search through all characters – whether similar or dissimilar.

Unconstrained algorithms do not use additional information with the license plate numbers.

Constrained algorithms use information such as vehicle classification to filter out incorrect

matches. Figure 6 shows the unconstrained algorithms and Figure 7 shows the constrained

ones. The full correction algorithms include a subset of five developed algorithms, A through E,

as explained below.

29

Figure 6: Unconstrained Full Correction Algorithms.

30

Figure 7: Constrained Full Correction Algorithms.

31

While performing the character substitutions it is never known which character in the

license plate is misrecorded, so the algorithm needs to check all possibilities. One algorithm

does this for the letters and another does this for the digits as shown in Figure 4.

The five algorithms displayed in Figure 6 and Figure 7 were formed as follows:

A Repeated matches are excluded

B Repeated matches are included

C Used (matched) OUT‐LPs are retained

D The OUT‐LP that yields the closest duration of stay to the median is used

E The OUT‐LP that yields the closest duration of stay to the median, AND has the same

vehicle class as IN‐LP is used

Algorithm A does not count repeated matches for a given IN‐LP and the first match found in

the OUT‐LP list is used. After that, searching stops and the matched OUT‐LP is removed from

the list to avoid it being matched again with another IN‐LP. This is a fast algorithm.

Algorithm B continues the search among all of the OUT‐LPs. This one can find more than

one match for each IN‐LP but any of the matched OUT‐LPs are removed from the list so that

they are not matched again for other IN‐LPs later. This is a slower algorithm.

Algorithm C operates similar to algorithm B, with the exception of retaining the matched

OUT‐LPs in the list, so that they can be matched again. When a match is found for the IN‐LP, it is

not considered as the only correct one. Instead, the search continues until all of the characters

(letters or digits) are replaced with all of the possible replacement characters, and all of the

OUT‐LPs are checked for each of the replacements in each IN‐LP. Possible replacements are 9 if

32

the character is a digit and 25 if it is a letter. After each replacement is performed the OUT‐LP

list is searched to see if any of its data entries match the modified IN‐LP. If no match is found

the current character in the IN‐LP is replaced with the next possible replacement. For instance,

if the character X in XBC123 is currently being replaced and no match is found for it, in the next

step it will be replaced by Y and then again the OUT‐LP list is searched. If no match was found

by replacing the second letter by all 25 letters, then the same process is done for the next

(third) letter – C in this example. This is the slowest algorithm and may result in one OUT‐LP

matching several IN‐LPs.

Algorithm D operates similar to algorithm C but if an IN‐LP has more than one possible

match, then the one that yields a duration of stay closer to the median duration of stay

(calculated based on the whole matched data in the initial matching phase) is selected as the

correct match. This is a more accurate algorithm.

Algorithm E operates similar to algorithm D but other than checking the duration of stay, it

verifies the vehicle class of the pair to be matching as well. This algorithm protects from gross

errors in vehicle plate matching.

If the final results from these algorithms are in agreement with each other, then the

random error for the whole correction procedure is low. If each of the algorithms finds

different matches for a considerable portion of the IN‐LPs and the derived statistics differ

substantially, then the random error is substantial.

4.3.2AlgorithmforCorrectionofSimilarCharacters(“SimilarAlgorithm”)

Figure 8 depicts the matching process of license plates if only similar characters are

searched. Again it is assumed that a letter could only be mistaken with another letter, and a

33

number with another number. This algorithm is suitable if misreading of the characters is

assumed to be mostly due to similarity among characters.

Table 4 show the likely similarities. The similarity between letters and digits was initially

decided upon based on common sense and a list of higher frequency mistakes in Automatic

License Plate Recognition (ALPR) systems [6]. In ALPR systems since OCR software “translates”

the picture of the license plate into data, visual similarity between characters are important and

usually have higher frequencies. Therefore, considerate usage of the higher frequency mistakes

in the case of those systems was applicable for this study. No reference in the field of

psychology to be applicable for this study could be found in the literature.

After processing the unmatched data for each dataset by the full algorithms, a table is

created that shows which letters or digits have been possibly misrecorded, and the substituting

character that yields a matched license plate.

Table 5 shows a sample for letters. For instance, number “4” at the intersection of row E

and column F in the table indicates that in four cases by converting letter “E” in the unmatched

license plates list to “F” a match was found. It indicates that possibly letter F was wrongly read

as E.

Table 3. Similar Letters.

ReferenceLetter

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

R G P F E C N J I I N M Q R O B I V U M

R T W W D F N

L H P

Similar Letters

34

Table 4. Similar Digits.

Table 5. Sample of Substituted Letters Using the Full Correction Algorithm C.

Again, the place of the misrecorded letter or digit is never known in the license plate and

the algorithms check all possibilities. However, the number of possibilities is lower, as the

substitution is done only for similar characters. This significantly improves computing

performance for large databases.

ReferenceDigit

1 2 3 4 5 6 7 8 9 0

7 3 2 6 5 1 3 5 6

8 8 8 5 6 9

9 9 6 8

0 9 0

Similar Digits

A B C D E F G H I J K L M N O P Q R S T U V W X Y ZA 1 1 1 1B 2 1 1 1 2C 3 1 1 1 1 1 1 1D 1 1 2 2 1 9 1 1 1E 1 4 1 2F 1 2 1 1 1 1G 1 1 1 1H 4IJ 1 1 1 1 2 2 1 2 1 1KLM 2N 2 2 1 4 3 1 1 1 1O 2 2P 6 2 1 3 1QR 2 1 1 9 1 1S 1 1T 1 1 1 1 1 1U 1 2 1V 1 1 2 1 1W 1 1X 1 1 3 1Y 1 3 1 4Z 1 2

Su

bst

itu

ted

Let

ter

(Mis

reco

rded

Let

ter)

Substituting Letter Resulted in Matching

35

Figure 8: Correction Algorithm for Similar Letters and Digits.

36

4.4DiscussiononProcessingResults

The results are compared based on the following criteria:

1) Volume of computations and processing speed: The duration of correction process is

measured in terms of minutes and seconds, under same processing conditions, to see how

much a similar character correction algorithm can save time. Also the same comparison is made

for different data recordation styles (ABC123, C123, and ABC1). Finally, the correction time is

compared to initial matching time, to see if the extra time spent for correction is substantial in

the whole license plate matching process or not.

2) Impact on match percentage: For each dataset the numbers of unmatched licenses are

compared, before and after correction algorithms are applied. These are also compared among

different algorithms. Based on these results the contribution of each algorithm is calculated as

reduction percentage for unmatched license plates.

3) Impact on statistics: For each dataset the average duration between two successive

observations of vehicles and the standard deviation of this are compared, before and after the

correction algorithms are applied. Higher percentages indicate greater importance of the

algorithms for traffic surveys.

The results for the full algorithms are also compared with each other. Here, the assumption

is that the results should be close to each other, meaning that the matches that are found are

neither dependent on the direction that the algorithms read and matches the data, nor on the

way the algorithm deals with previously matched licenses. Theoretically, if the correct matches

are found by the correction algorithms, the statistical results from all three algorithms should

be the same.

37

CHAPTER5

DATACOLLECTIONANDANALYSIS

The first section of this chapter describes the collection procedures and the specifications of

the four sets of data, which were analyzed using the algorithms developed as part of this

research.

The results from the analyses are interpreted in the next sections of this chapter. Some of

the analyses are done for each dataset individually in order to evaluate the influence of data

collection format (e.g., all letters and digits recorded); some are done for every algorithm in

each dataset, and some analyses are done with the aggregate data from all datasets.

5.1DataCollection

5.1.1Dataset1(ABC1):ITE

This dataset was collected in the format of three digits and one letter (ABC1) at the

entrance of Waipio Peninsula Soccer Complex. The data were needed for a parking analysis

study for the Institute of Transportation Engineers (ITE).

Data collection started at 7:00 AM and continued until 7:00 PM on Saturday, January 29,

2011. Because of the fairly long duration of data collection, it could not be done continuously

by one person. Therefore, both entering and exiting vehicles datasets were collected by two

people; one person collecting data from 7:00 AM to 12:00 PM, and one from 12:00 PM to 7:00

PM. Four people were involved, three with good to perfect visions and one wearing glasses who

38

collected the exiting vehicles from 12 PM to 7 PM. The sunset time at the location was 6:20 PM.

Therefore, the final 40 minutes of data collection was done in comparatively dimmer light.

However, the volume of the collected data during this period was very small compared to the

whole dataset size. One vehicle entered the park and 30 vehicles exited, out of which 20 were

correctly recorded and matched in the initial matching process. Therefore, no significant impact

may be attributed to this issue.

The distance of the data collectors from the edge of the road (as shown in Figure 9) was

almost between 5 and 10 ft; enough to enable them to read the license plates conveniently and

also be reasonably safe.

Figure 9: Data Collection at Waipio Peninsula Soccer Complex

2182 vehicles were recorded entering the Park and 2206 vehicles exiting. 435 (19.9%) of the

entered vehicles could not be matched initially.

39

However, the maximum number of matched vehicles cannot exceed the minimum of all

entered vehicles and all exited vehicles:

Maximum that can possibly be matched = Min (Entered, Exited)

For this dataset since the number of entered vehicles is smaller than the exited vehicles,

this modification does not make any changes.

Table 6 shows the summary specifications of this dataset.

5.1.2Dataset2(C123):HAVO2009

This dataset was collected in the format of one letter and three digits (C123) at the entrance

of Hawaii Volcanoes National Park (HAVO) in 2009. The data were needed for parking analysis.

Data collection started at 10:00 AM and continued until 4:30 PM on Monday, August 17,

2009. Entering and exiting vehicles each were recorded by one data collector. One with good

vision, and one wearing glasses. For short periods of time (around 10‐15 minutes) a third

person substituted one of the data collectors.

The distance of the data collectors from the edge of the road (as shown in Figure 10) was

almost between 5 and 10 ft; enough to enable them to read the license plates conveniently and

also be reasonably safe.

797 vehicles were recorded entering the Park, 751 vehicles exiting. 423 (56.3% of maximum

possibility = 423÷Min (797, 751)) of the entered vehicles could be matched initially and 43.7%

couldn’t.

40

Figure 10: Data Collection at the Entrance of Hawaii Volcanoes National Park (2009)


5.1.3Dataset3(ABC123):HAVO2007–1

This dataset was collected in the format of three letter and three digits (ABC123) at the

entrance of Hawaii Volcanoes National Park (HAVO) in 2007. The data were needed for parking

analysis.

Data collection started at 10:00 AM and continued until 3:00 PM on Saturday, August 11,

2007. Entering vehicles were recorded by two data collectors together and exiting vehicles

were recorded by one collector during the five hours of data collection. The person who

collected the exiting vehicles wore glasses.

41

The exact distance of the data collectors from the road is not known for this dataset, but it

is estimated to be close to that of Dataset 2: between 5 and 10 ft.



couldn’t.


5.1.4Dataset4(ABC123):HAVO2007–2

This dataset was collected in the format of three letter and three digits (ABC123) at the

entrance of Hawaii Volcanoes National Park (HAVO) in 2007. The data were needed for parking

analysis.

Data collection started at 10:00 AM and continued until 3:00 PM on Sunday, August 12,

2007. Entering vehicles were recorded by two data collectors together and exiting vehicles

were recorded by one collector during the five hours of data collection. The person who

collected the exiting vehicles wore glasses.

The exact distance of the data collectors from the road is not known for this dataset, but it

is estimated to be close to that of Dataset 2: between 5 and 10 ft.



couldn’t.


42

Table 6. Data Collection Specifications

5.2IndividualAnalyses

5.2.1AlgorithmComparison

This analysis was done to investigate the variability of the results when our algorithms are

applied to the same set. The indices of comparison are Difference Percentage, Root‐Mean‐

Square Deviation (RMSD), Coefficient of Variation of Root‐Mean‐Square Deviation (CV(RMSD)),

and Normalized Root‐Mean‐Square Deviation (NRMSD).

The percentage of difference among the algorithms is calculated based on the following

formula:

Difference Percentage = ∑ ∑

∑ ∑∗ Equation 1

Where,

43

M is the count of misreadings of the ith character as the jth character in the Mistakes

Matrix generated by Algorithm X. “I” and “J” are the dimensions of the Mistakes Matrix, 26 for

letters and 10 for numbers. Algorithm E is the Reference Algorithm considered to be the most

comprehensive and accurate because it is the most constrained.

The other indices are calculated as follows:

RMSD= ∑ ∑

∑ ∑∗ Equation 2

CV(RMSD) = ∑ ∑

Equation 3

NRMSD =

Equation 4

5.2.2EvaluationofImpactofSimilarityonErrors

It is assumed that if there is no specific cause for bias the mistakenly recorded characters

should be uniformly distributed in the Mistakes Matrices. For example if we consider the 10x10

Mistakes Matrix for the digits, which is a nest for 90 possible cases of digit misreading (the

diagonal is blank because it corresponds to correct readings), and having a dataset of 900

mistakenly recorded characters, we would anticipate to have a frequency of 10 for each

possible case in the matrix and therefore it should be similar to Table 77. However, if the

resultant Matrix is considerably far from uniformity, this indicates a bias and therefore a cause.

In this research, having noticed that the Mistakes Matrices are not uniform, it was assumed

44

that similarity between characters is probably the cause and similar characters are more

probable to be mistakenly recorded.

Table 7. Uniform Mistakes Matrix for 900 Mis‐recorded Numbers

To test this hypothesis the ratio of similar cases (shown as yellow cells in Table 8 and Table

9) to all cases of mistakes was calculated. For the letters, 30 cases (4.6%) among 26x26‐26=650

cases of mistake in the Mistakes Matrix, and for the digits, 22 cases (24.4%) among 10x10‐

10=90 cases of mistake were considered to be mistaken recording of similar characters. Again,

if the effect of similarity was insignificant, we would observe around 4.6% of mistake counts to

occur in yellow cells for letters; however, the analyses show that in practice it was not the case

and yellow cells typically contained several times that percentage. For the case of digits the

difference was smaller and less significant.

In order to check the significance of this difference between practical results and assumed

case of uniformity (similarity insignificance), a Chi‐Square (χ2) test was performed as follows.

The null and alternative hypotheses were defined as:

H0: Similarity has insignificant effect and observations are distributed by chance.

1 2 3 4 5 6 7 8 9 0

1 10 10 10 10 10 10 10 10 10

2 10 10 10 10 10 10 10 10 10

3 10 10 10 10 10 10 10 10 10

4 10 10 10 10 10 10 10 10 10

5 10 10 10 10 10 10 10 10 10

6 10 10 10 10 10 10 10 10 10

7 10 10 10 10 10 10 10 10 10

8 10 10 10 10 10 10 10 10 10

9 10 10 10 10 10 10 10 10 10

0 10 10 10 10 10 10 10 10 10

45

Ha : Similarity has significant effect and observations are skewed because of it.

χ2 = ∑

= + Equation 5

Where,

ES= Expected value for similar cases (yellow cells in Mistakes Matrices) = percentage of similar

cases1 x data count2

OS= Observed value for similar cases = ∑ (similar cases) = ∑ (yellow cells)

ED= Expected value for dissimilar cases (white cells in Mistakes Matrices) = data count ‐ES

OD= Observed value for similar cases = ∑ (dissimilar cases) = ∑ (yellow cells) = data count ‐ OS

Based on calculated χ2 value and for one degree of freedom the probability value for

acceptance of null hypothesis was calculated.

1 Percentage of similar cases is 4.6% for letters, and 24.4% for digits. 2 Data count in these formulae is the total number of mistakes in recordation, which is the sum of all cells in

the Mistakes Matrix.

46

Table 8. Hypothesized Mistakes Matrix for Letters. Yellow Cells are the Intersection of Similar Letters.

Table 9. Hypothesized Mistakes Matrix for Digits. Yellow Cells are the Intersection of Similar Digits.


A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

V

W

X

Y

Z

1 2 3 4 5 6 7 8 9 0

1

2

3

4

5

6

7

8

9

0

47

In the following section the results of the main correction algorithms for each dataset is

analyzed. In the interest of conciseness, in this section only the Mistakes Matrices created by

the main algorithm (Algorithm E) are shown; the rest of the matrices can be found in the

Appendix.

Under each Mistakes Matrix, its summary is shown in a table on the left which shows the

count of the non‐empty cells and the summation of the cells for both similar data (yellow cells)

and the whole table. Their ratio is also shown in terms of percentage. On another table on the

right, the results of the χ2 test including the P‐value for acceptance of null hypothesis are

shown.

48

5.2.3AnalysesofeachDataset

5.2.3.1Dataset1(ABC1):ITE

Table 10. Letters Mistakes Matrix for Dataset 1, by Algorithm E

It is obvious that the null hypothesis is rejected and similar letters have a significantly higher

frequency (67÷9=7.4 times more) of being mistaken.


A 1 1 1 1 1

B 1 1 1 3

C 1 1 1 2 1 2

D 2 3 1 1 12 2

E 1 5 1 1 3

F 1 3 1 1 1 1

G 1 2 1 1 1

H 4

I

J 2 2 2 2 1 1 5 1

K

L

M 1

N 2 1 2 3 1 1 1

O 2 1

P 1 8 2 2 7 1 1

Q

R 2 1 1 1 1 1 8 1 1 2

S 3 1 1 1

T 2 2 1 1 1 1

U 1 1 1 2 1

V 2 1 1 2 1 2 1

W 1 1 1 1

X 1 1 1 2

Y 3 1 4 1 5

Z 1 2

Count 110

Sum 197

Count 21

Sum 67

Count 19.1%

Sum 34.0%

AllData

Similar Data

Similar to All Ratio

Similar Dissimilar

Observed 67 130

Expected 9 188

Chi-Square 386.652

P-Value 0.000

49

Comparisonofthealgorithms

Table 11. Ratio of Similar Character Misreadings to Total Count of Misreadings

Table 11 outcomes indicate that the counts of mistakes found by the five algorithms are not

so close for this dataset; except for algorithms D and E which are the most comprehensive and

accurate algorithms, they are exactly the same. The great difference between the results of the

algorithms is a negative sign that indicates number of false matches is probably not negligible

for some algorithms and in particular for algorithms A, B and C that are not constrained. The

larger difference between the results of the algorithms compared to the case of the correction

for the digits indicate a greater number of false matches and a fairly poorer performance for

algorithms. But the Sim/All Ratio is still fairly close for three of the algorithms: A, D and E.

Table 12. Deviation of the Mistake Matrices by Algorithms A to D based on E

Table 12 shows that the Mistakes Matrices created by algorithms D and E are exactly the

same. It shows that checking for the classes of vehicles being matched did not add anything to

the process and the classes of vehicles matched by algorithm D were already matched.

Percentage of difference and other indices for algorithms A to C are higher than those for the

digits (discussed in the next part) that indicate more false matches for letters.

A B C D E

Count (All) 164 214 339 197 197

Count (Sim) 56 53 79 67 67

Sim/All Ratio 34.1% 24.8% 23.3% 34.0% 34.0%

Algorithm

A B C D

% Difference 55.3% 68.5% 74.1% 0.0%

RMSD 88.7% 101.0% 127.8% 0.0%

CV(RMSD) 304.4% 346.6% 438.7% 0.0%

NRMSD 7.4% 8.4% 10.7% 0.0%

Algorithm

50

Table 13. Numbers Mistakes Matrix for Dataset 1, by Algorithm E

The null hypothesis is rejected at the 5% significance level. Similar digits have a higher

frequency of being mistaken, but they are only about 1.5 times as frequent as the rest of the

cases.

1 2 3 4 5 6 7 8 9 0

1 3 3 1

2 1 4 2 1

3 2 1 2 1 1 1 1

4 2 1 1 1

5 1 1

6 2 1 1 2 4 2

7 3 1 2 2 1 2

8 1 2 1 2 2 2 1

9 2 1 2 1 1

0

Count 44

Sum 72

Count 12

Sum 26

Count 27.3%

Sum 36.1%

AllData

Similar Data


Similar Dissimilar

Observed 26 46

Expected 18 54

Chi-Square 5.306

P-Value 0.021

51



Table 14 outcomes indicate that the counts of mistakes found by the five algorithms are

closer compared to the case of the correction for the letters. Again, the counts of found

mistakes by algorithms D and E are exactly the same. The smaller difference between the

results of the algorithms compared to the case of the correction for the letters and specifically

the fairly close Sim/All ratios can indicate a smaller number of false matches and a fairly better

performance for algorithms.


Similar to the case for letters,

Table 15 shows that the Mistakes Matrices created by algorithms D and E are exactly the

same; indicating that checking for the classes of vehicles being matched did not add anything to

the process. No judgment can be made based on this dataset in general in terms of accuracy of

the algorithms. For the letters, algorithm C performed least accurately among the

unconstrained algorithms but for the digits it performed better. Percentages of difference

A B C D E

Count (All) 61 67 84 72 72

Count (Sim) 21 25 31 26 26

Sim/All Ratio 34.4% 37.3% 36.9% 36.1% 36.1%

Algorithm

A B C D

% Difference 34.7% 31.9% 16.7% 0.0%

RMSD 69.7% 65.6% 47.1% 0.0%

CV(RMSD) 96.8% 91.1% 65.5% 0.0%

NRMSD 17.4% 16.4% 11.8% 0.0%

Algorithm

52

indicate that the majority of the cells in Mistake Matrices created by these four algorithms are

identical to those of E.

53

5.2.3.2Dataset2(C123):HAVO2009


The null hypothesis is not rejected. Similar letters do not have a significantly higher

frequency of being mistaken.


A 1 1 1 1

B 1 1 1

C 1 1 1

D 1 1 1

E 1 1

F 1 1 1

G 1

H 1

I 1

J 1 1

K 1 1

L

M 1

N 1 1

O 1 1

P 1

Q

R 1

S 1

T 1 1 1 1

U 1 1 1

V 1

W 1

X 3 1

Y 2 1

Z 1 1 1 1 1 1 1

Count 53

Sum 56

Count 3

Sum 3

Count 5.7%

Sum 5.4%Similar to All Ratio

AllData

Similar Data

Similar Dissimilar

Observed 3 53

Expected 3 53

Chi-Square 0.070

P-Value 0.791

54




fairly close and probably the false matches are not so numerous. The sim/all ratio is very low

suggesting insignificance of the impact of similarity on the mistakes count. However, the

difference between the results from algorithms D and E is considerable. Moreover, for this set

of data a four‐class scheme was used to classify vehicles and a great majority of vehicles were

personal vehicles, nonetheless considering the class of vehicles while matching them seems to

be beneficial to some extent. If there is a good spread among classes, this constraint can filter

out a greater portion of wrong matches.


Difference percentage in Table 18 shows the difference among Mistakes Matrices resulted

from algorithms are fairly close. They are lower than those for the digits that indicate less false

matches for letters. CV(RMSD) is the second highest among all datasets while RMSD is not so

big; the reason is a small denominator ∑ ∑

which is caused by the sparseness of the

Mistakes Matrix created by Algorithm E. This becomes obvious by observing Table 16 which is

A B C D E

Count (All) 58 62 76 71 56

Count (Sim) 3 3 3 4 3

Sim/All Ratio 5.2% 4.8% 3.9% 5.6% 5.4%

Algorithm

A B C D

% Difference 53.6% 50.0% 42.9% 26.8%

RMSD 73.2% 70.7% 65.5% 51.8%

CV(RMSD) 883.5% 853.6% 790.3% 624.8%

NRMSD 24.4% 23.6% 21.8% 17.3%

Algorithm

55

mostly filled with “ones” and results in a large ∑ ∑

value. It indicates that the data

are not enough for judgment between the algorithms.


Null hypothesis is not rejected and similar digits do not have a significantly higher frequency

of being mistaken.

1 2 3 4 5 6 7 8 9 0

1 1 1 1

2 1 1 1 1

3 3

4 1 2 1 1 2 1

5 1 2 1 1 2

6 2 1 1 3

7 1 1 1 2

8 1 1 1

9 1 1 1

0

Count 33

Sum 43

Count 7

Sum 12

Count 21.2%

Sum 27.9%


AllData

Similar Data

Sim UnSim

Observed 12 31

Expected 11 32

Chi-Square 0.279

P-Value 0.597

56



Similar to the case for the letters, Table 20 shows that the difference between the results

counts from algorithms D and E is not negligible, again suggesting that considering the class of

vehicles while matching them is beneficial.


Table 21 shows the differences between Mistakes Matrices of numbers resulting from

algorithms are not large suggesting that the false matches are not many. However, the high

value of CV(RMSD) may indicate a degree of sparseness in Table 19 although it is smaller than

that of the letters.

A B C D E

Count (All) 44 56 69 52 43

Count (Sim) 9 18 20 14 12

Sim/All Ratio 20.5% 32.1% 29.0% 26.9% 27.9%

Algorithm

A B C D

% Difference 67.4% 72.1% 79.1% 20.9%

RMSD 87.6% 95.2% 107.8% 45.7%

CV(RMSD) 203.7% 221.5% 250.8% 106.4%

NRMSD 29.2% 31.7% 35.9% 15.2%

Algorithm

57

5.2.3.3Dataset3(ABC123):HAVO2007–1



A

B 1

C

D 1

E 1 1 1

F 1 1

G 1

H 1 2

I

J 1 1

K

L

M

N 1 1

O 1

P

Q

R

S 1

T 1

U

V 2

W 1

X

Y 1

Z

Count 20

Sum 22

Count 4

Sum 5

Count 20.0%

Sum 22.7%

AllData

Similar Data


Similar Dissimilar

Observed 5 17

Expected 1 21

Chi-Square 16.393

P-Value 0.000

58


frequency (4.9 times more) of being mistaken.




close, suggesting that the false matches are probably negligible. However, the difference of the

results between algorithms D and E suggests that considering the class of vehicles during the

matching process is beneficial.


Difference percentage in Table 24 shows the difference between Mistakes Matrices

resulted from algorithms are close indicating a small number of false matches. Of course this

data set is comparatively small and Table 22 is extremely sparse. Almost all of the cells are filled

with “ones”. The data are probably insufficient for a comparison between the Mistakes

Matrices created by the algorithms.

A B C D E

Count (All) 24 24 24 26 22

Count (Sim) 6 6 6 7 5

Sim/All Ratio 25.0% 25.0% 25.0% 26.9% 22.7%

Algorithm

A B C D

% Difference 27.3% 27.3% 27.3% 18.2%

RMSD 52.2% 52.2% 52.2% 42.6%

CV(RMSD) 1604.7% 1604.7% 1604.7% 1310.2%

NRMSD 26.1% 26.1% 26.1% 21.3%

Algorithm

59


The null hypothesis is not rejected and similar digits do not have a significantly higher




1 2 3 4 5 6 7 8 9 0

1 1

2 1 1 2 1

3 1 2 1 1 1

4 1 1 3 2 1

5 2 1

6 1 1 1

7 2 1

8 2 1

9 1 1

0

Count 26

Sum 34

Count 5

Sum 6

Count 19.2%

Sum 17.6%

AllData

Similar Data


Similar Dissimilar

Observed 6 28

Expected 8 26

Chi-Square 0.851

P-Value 0.356

A B C D E

Count (All) 29 30 40 35 34

Count (Sim) 5 5 8 7 6

Sim/All Ratio 17.2% 16.7% 20.0% 20.0% 17.6%

Algorithm

60


Table 26 outcomes indicate that the number of mistakes found by the five algorithms are

not greatly different, suggesting that the false matches are probably not many. The slight

difference between the results from algorithms D and E in both Table 26 and Table 27 suggests

that considering the class of vehicles during the matching process does not make a great

contribution to correction for this dataset.

A B C D

% Difference 26.5% 29.4% 29.4% 2.9%

RMSD 56.9% 59.4% 64.2% 17.1%

CV(RMSD) 167.3% 174.7% 188.7% 50.4%

NRMSD 19.0% 19.8% 21.4% 5.7%

Algorithm

61

5.2.3.4Dataset4(ABC123):HAVO2007–2



frequency (7.8 times more) of being mistaken.


A

B 2 1

C 2

D 1

E 2

F 1 2

G

H 1

I

J 1

K

L

M 1 1

N 1 1

O 1

P 1

Q

R

S

T 1 2

U 1

V 1 2

W 1

X 1

Y

Z

Count 22

Sum 28

Count 8

Sum 10

Count 36.4%

Sum 35.7%

AllData

Similar Data


Similar Dissimilar

Observed 10 18

Expected 1 27

Chi-Square 61.512

P-Value 0.000

62




Table 29 outcomes indicate that the number of mistakes found by the five algorithms are

close, suggesting that the false matches are few. The small difference between the results from

algorithms D and E in both Table 29 and Table 30 suggests that considering the class of vehicles

during the matching process does not make a great contribution to correction for this dataset.

The small RMSDs and Large CV(RMSD)s indicate insufficiency of the data for a comparison

among the algorithms.

A B C D E

Count (All) 28 29 31 30 28

Count (Sim) 9 9 10 10 10

Sim/All Ratio 32.1% 31.0% 32.3% 33.3% 35.7%

Algorithm

A B C D

% Difference 14.3% 17.9% 10.7% 7.1%

RMSD 37.8% 42.3% 32.7% 26.7%

CV(RMSD) 912.5% 1020.2% 790.3% 645.2%

NRMSD 18.9% 21.1% 16.4% 13.4%

Algorithm

63


The null hypothesis is not rejected and similar digits do not have a significantly higher


1 2 3 4 5 6 7 8 9 0

1 1 1

2 1 1

3

4 1 1 1

5 1 1 2 1

6 1 2 1 2

7 2 1 1 1

8 1

9 1 1 1

0

Count 23

Sum 27

Count 6

Sum 9

Count 26.1%

Sum 33.3%

AllData

Similar Data


Similar Dissimilar

Observed 9 18

Expected 7 20

Chi-Square 1.155

P-Value 0.282

64


Table 32. of Similar Character Misreadings to Total Count of Misreadings


Table 32 outcomes indicate that the counts of mistakes found by the five algorithms are not

much different and probably the false matches are not many. The results from Algorithms D

and E are fairly close, both in Table 32 and Table 33. It suggests that considering the class of

vehicles during the matching process does not make a great contribution to correction for this

dataset.

5.3AggregateAnalyses

5.3.1ProcessingTimeofAlgorithms

The processing time is measured in terms of minutes and seconds for two different cases: 1)

if search is made only among the similar characters, 2) if search is made among all characters.

All of the algorithms A to E search among all characters, whether similar or dissimilar, and they

A B C D E

Count (All) 26 30 41 31 27

Count (Sim) 8 8 15 10 9

Sim/All Ratio 30.8% 26.7% 36.6% 32.3% 33.3%

Algorithm

A B C D

% Difference 48.1% 63.0% 66.7% 14.8%

RMSD 74.5% 83.9% 108.9% 47.1%

CV(RMSD) 276.1% 310.7% 403.2% 174.6%

NRMSD 37.3% 41.9% 54.4% 23.6%

Algorithm

65

are in the second category. In fact, similar and dissimilar characters in the Mistakes Matrices

are simply separated based on manual color coding (Table 8 and Table 9) and not by the

algorithms. Therefore for this comparison a lateral algorithm was created. This algorithm only

searches among the defined similar characters and neglects dissimilar ones. The resultant

Mistakes Matrix by this algorithm would only have its yellow cells filled. However, its Mistakes

Matrix is not used and shown in this text, because this algorithm is created only to measure its

processing speed; this speed (represented by processing time) is compared to that of each of

the main algorithms (A to E). This algorithm is named “Similar Algorithm” in Table 34 and Table

35.

Table 34. Processing Time of Different Correction Algorithms

Table 35. Ratio of Processing Time for ‘Similar Algorithm’ to other Full Algorithms

A B C D E

DatasetEntry

CountUnmatched

Checked

CharacterSimilar No‐Rep Rep Rep‐All Median Class

Letter 52 sec 16:25 20:19 21:12 25:23 41:35

Digit 32 sec 2:26 2:42 2:42 3:23 5:32

Letter 3 sec 1:46 1:54 1:52 2:43 3:21

Digit 27 sec 2:1 2:11 2:10 3:11 3:34

Letter 12 sec 4:55 5:0 5:0 7:30 10:30

Digit 25 sec 1:52 1:55 1:55 2:43 4:1

Letter 16 sec 6:28 6:40 6:35 10:47 12:46

Digit 34 sec 2:29 2:33 2:33 3:52 5:1

2182 435

374797

863

771

503

468

HAVO

200927:16

HAVO

2007 ‐ 18:42

HAVO

2007 ‐ 214:11

Processing Time (Min:Sec)

Original Matching

ITE 89:45

A B C D E

DatasetChecked

CharactersNo‐Rep Rep Rep‐All Median Class

Letter 5.3% 4.3% 4.1% 3.4% 2.1%

Digit 21.9% 19.8% 19.8% 15.8% 9.6%

Letter 2.8% 2.6% 2.7% 1.8% 1.5%

Digit 22.3% 20.6% 20.8% 14.1% 12.6%

Letter 4.1% 4.0% 4.0% 2.7% 1.9%

Digit 22.3% 21.7% 21.7% 15.3% 10.4%

Letter 4.1% 4.0% 4.1% 2.5% 2.1%

Digit 22.8% 22.2% 22.2% 14.7% 11.3%

HAVO

2007 ‐ 1

HAVO

2007 ‐ 2

ITE

HAVO

2009

Ratio of Processing Time for "Similar Algorithm" to other Full Algorithms

66

Table 34 shows that the processing time for the ‘Similar Algorithm’ is a small fraction of that

of the full algorithms. It can do the corrections at least four times and up to 60 times faster,

depending on collected data format and whether letters or digits are being corrected. Although

this speed is welcome when doing corrections for very large datasets, it is more of a trade‐off

because not all of the mistakes are due to similarity and therefore this algorithm will miss some

of the other mistakes and lose opportunities for more matches.

5.3.2InfluenceonPercentageofMatchedVehicles

The number of unmatched license plates that were matched after correction by one of the

algorithms is called Algorithm Contribution in this text. Percentage of Algorithm Contribution is

the ratio of Algorithm Contribution to number of unmatched license plates.

Matched/unmatched license plates are calculated based on maximum number of possible

matches:

Maximum that can possibly be matched = Min (Entered, Exited)

67

Table 36. Contribution of each Algorithm to Percentage of Matched Vehicles – Letters and Digits Separately

Table 37. Ratio of Contribution to Number of Initial Unmatched License Plates

A B C D E

DatasetCorrected

Characters

Only

SimilarNo‐Rep Rep Rep‐All

Median

Checked

Class

Checked

Letter 73 165 138 199 199 199

Digit 35 77 76 93 93 93

Overlap ‐ 39 40 52 48 48

Total ‐ 203 174 240 244 244

Letter 3 58 56 68 71 56

Digit 17 54 50 60 64 53

Overlap ‐ 17 10 21 19 11

Total ‐ 95 96 107 116 98

Letter 6 24 24 24 26 22

Digit 8 33 33 43 45 37

Overlap ‐ 5 5 1 1 1

Total ‐ 52 52 66 70 58

Letter 10 28 28 30 30 28

Digit 13 30 29 36 35 31

Overlap ‐ 5 4 2 3 3

Total ‐ 53 53 64 62 56

HAVO

2007 ‐ 2388 863 654

HAVO

2007 ‐ 1324 771 523

HAVO

2009348 797 751

Exited

Vehicles

ITE 435 2182 2206

Algorithm Contribution

Initial

Unmatched

LPs Count

Entered

Vehicles

A B C D E

DatasetCorrected

Characters

Only

SimilarNo‐Rep Rep Rep‐All

Median

Checked

Class

Checked

Letter 16.8% 37.9% 31.7% 45.7% 45.7% 45.7%

Digit 8.0% 17.7% 17.5% 21.4% 21.4% 21.4%

Total ‐ 46.7% 40.0% 55.2% 56.1% 56.1%

Letter 0.9% 16.7% 16.1% 19.5% 20.4% 16.1%

Digit 4.9% 15.5% 14.4% 17.2% 18.4% 15.2%

Total ‐ 27.3% 27.6% 30.7% 33.3% 28.2%

Letter 1.9% 7.4% 7.4% 7.4% 8.0% 6.8%

Digit 2.5% 10.2% 10.2% 13.3% 13.9% 11.4%

Total ‐ 16.0% 16.0% 20.4% 21.6% 17.9%

Letter 2.6% 7.2% 7.2% 7.7% 7.7% 7.2%

Digit 3.4% 7.7% 7.5% 9.3% 9.0% 8.0%

Total ‐ 13.7% 13.7% 16.5% 16.0% 14.4%

HAVO

2007 ‐ 1

HAVO

2007 ‐ 2

Percentage of Algorithm Contribution

ITE

HAVO

2009

68

The algorithms can reduce the number of unmatched license plates from almost 50% to a

negligible amount depending on the format of license plate data collection.

5.3.3InfluenceonStatisticalIndices

Although the number of matched license plates may increase by more detailed algorithms, the

statistical indices derived from the revised set of data may not change noticeably. The index of

“duration of stay” was selected for comparisons based on Average, Standard Deviation, and

Median.

Table 38. Average for the Duration of Stay

A B C D E

DatasetChecked

Characters

Only

SimilarNo‐Rep Rep Rep‐All Median Class

Letter 2:04 2:08 2:09 2:09 2:06 2:06

Digit 2:05 2:07 2:07 2:07 2:07 2:07

Letter 2:09 2:10 2:09 2:08 2:08 2:08

Digit 2:09 2:10 2:09 2:10 2:08 2:07

Letter 1:41 1:38 1:38 1:38 1:38 1:37

Digit 1:40 1:40 1:40 1:39 1:40 1:40

Letter 1:49 1:49 1:50 1:49 1:49 1:49

Digit 1:47 1:46 1:46 1:46 1:46 1:46

Average Duration of Stay (hr:min)Initial Values

(before

correction by

Algorithms)

ITE 2:04

HAVO

20092:09

HAVO

2007 ‐ 11:41

HAVO

2007 ‐ 21:49

69

Table 39. Standard Deviation for the Duration of Stay

Table 40. Median for the Duration of Stay

The range of alteration of average values for the five algorithms is fairly low. It varies from

2.3% (in the second dataset) to about 4% (in the third dataset).

The range of alteration of standard deviation values for the five algorithms is fairly low. It

varies from 1.4% (in the third dataset) to about 4.5% (in the first dataset).

A B C D E

DatasetChecked

Characters

Only


Letter 1:27 1:31 1:31 1:30 1:27 1:27

Digit 1:29 1:30 1:30 1:30 1:30 1:30

Letter 1:29 1:31 1:30 1:30 1:30 1:29

Digit 1:30 1:31 1:31 1:31 1:30 1:30

Letter 1:11 1:11 1:11 1:11 1:11 1:11

Digit 1:11 1:11 1:10 1:10 1:10 1:10

Letter 1:03 1:03 1:03 1:03 1:03 1:03

Digit 1:03 1:02 1:02 1:02 1:02 1:02

Standard Deviation of Duration of Stay (hr:min)

Initial Values

(before

correction by

Algorithms)

ITE 1:26

HAVO

20091:30

HAVO

2007 ‐ 11:11

HAVO

2007 ‐ 21:02

A B C D E

DatasetChecked

Characters

Only


Letter 1:57 2:00 2:01 2:00 1:59 1:59

Digit 1:57 1:58 1:58 1:58 1:58 1:58

Letter 1:56 1:58 1:55 1:55 1:55 1:55

Digit 1:55 1:57 1:55 1:57 1:55 1:52

Letter 1:38 1:35 1:35 1:35 1:35 1:34

Digit 1:38 1:38 1:38 1:37 1:37 1:36

Letter 1:48 1:47 1:48 1:48 1:47 1:47

Digit 1:46 1:43 1:43 1:43 1:43 1:43

Median Duration of Stay (hr:min)

Initial Values

(before

correction by

Algorithms)

HAVO

2007 ‐ 21:48

ITE 1:57

HAVO

20091:55

HAVO

2007 ‐ 11:38

70

The range of alteration of median values for the five algorithms is fairly low. It varies from

3.4% (in the first dataset) to about 5.2% (in the second dataset).

Algorithms D and E use the Match that yields closest duration of stay to the median if more

than one match is found for a license plate, and therefore inherently reduce the range of

change particularly for the median and average. The data prove this. For algorithm (E) the

corrections of average, standard deviation and median are 1.8%, 1.2% and 2.2%, respectively

(average of all datasets).

The largest corrections by Algorithm E are 4.0%, 1.6% and 4.1% for average, standard

deviation and median, respectively which are not negligible for such large datasets. Even if

these values are not high enough to necessitate the use of the algorithm, they are more reliable

since the population size has increased after a considerable portion (11% to 56% for the four

datasets) of the unmatched license plates were matched by this algorithm and the statistics are

based on larger population.

Whether the amount of changes is worth the correction depends on the context of usage.

Such statistics may be accepted and used without any correction, and with 0% to 5%

uncertainty. However, for origin‐destination studies, vehicle tracking and travel pattern

recognition, higher matched percentages are usually needed. For such purposes, one missed

license plate may make the tracking profile of a vehicle almost unusable. At toll collection

stations, and more so, for law enforcement purposes where Automatic License Plate

Recognition (ALPR) is used, near perfect accuracy is required and theoretically all license plates

must be matched.

71

5.4EvaluationofImpactofSimilarityafterOneIterationand

RedefinitionofSimilarCharacters

When all of the datasets were analyzed, their Mistakes Matrices were summed and based

on the total summation the high frequency Mistakes among both similar characters and

dissimilar characters were found. Also some updates were performed on the Mistakes matrices

since seven new similar cases were found. Table 41 and Table 42 show the updated blank

Mistakes Matrices for letters and digits. Some of the similar mistake cases had high frequencies

as assumed (remained yellow in the tables), some did not (changed to red from yellow in the

original table). On the other hand some of the dissimilar mistake cases had high frequencies

compared to other cells (changed to blue from white). Also seven similar mistake cases were

found to be missed (changed to green from white) in the first iteration of the analyses.

The missed similar mistake cases that were found after the first iteration were added to the

list of similar cases. Low frequency similar cases that existed in the list were retained. After

adding the new cases (green cells), all of the analyses were repeated for all datasets with

algorithm E. The results in the following pages show that χ2 values increased. The only dataset

that had insignificant effect for similarity in the first iteration (Dataset 2 – HAVO 2009)

produced a P‐value smaller than 5% and therefore the general result after this analysis is that

all of the datasets reject the null hypothesis and indicate a significant effect of similarity

between letters for mistaken recordation of license plates. For digits the null hypothesis could

not be rejected. Only one dataset produced a significant result.

72

Table 41. Updated Blank Mistakes Matrices for Letters

Table 42. Updated Blank Mistakes Matrices for Digits


A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

V

W

X

Y

Z

Other

possibilities

Dissimilar

letters

with high

frequencies

in 1st

iteration

Similar

letters

with low

frequencies

in 1st

iteration

Similar

letters

which were

missed in 1st

iteration

Similar

letters

with high

frequencies

in 1st

iteration

1 2 3 4 5 6 7 8 9 0

1

2

3

4

5

6

7

8

9

0

Other

mistakes

Dissimilar digits with high

frequencies in 1st iteration

Similar digits with low


Similar digits which were

missed in 1st iteration

Similar digits with high


73

Dataset 1 (ABC1):


For this dataset the data were collected in two shifts and only in one of them (the second

one) a data collector with weaker eyes (wearing glasses) was involved. Therefore, in order to

investigate probable dependency of the frequencies of human mistakes (among similar letters)

on vision, the two shifts were compared after the second correction iteration. The results


A 1 1 1 1 1

B 1 1 1 3

C 1 1 1 2 1 2

D 2 3 1 1 12 2

E 1 5 1 1 3

F 1 3 1 1 1 1

G 1 2 1 1 1

H 4

I

J 2 2 2 2 1 1 5 1

K

L

M 1

N 2 1 2 3 1 1 1

O 2 1

P 1 8 2 2 7 1 1

Q

R 2 1 1 1 1 1 8 1 1 2

S 3 1 1 1

T 2 2 1 1 1 1

U 1 1 1 2 1

V 2 1 1 2 1 2 1

W 1 1 1 1

X 1 1 1 2

Y 3 1 4 1 5

Z 1 2

Count 110

Sum 197

Count 33

Sum 95

Count 30.0%

Sum 48.2%

AllData

Similar Data


Similar Dissimilar

Observed 95 102

Expected 13 184

Chi-Square 536.418

P-Value 0.000

74

showed the impact of good vision to be small since it only reduced the percentage of human

mistakes by 1.8%. On the other hand, in the second shift the percentage of human mistakes

increased only by 1.2%.

Table 44. Updated Numbers Mistakes Matrix for Dataset 1, by Algorithm E (Second Iteration)

1 2 3 4 5 6 7 8 9 0

1 3 3 1

2 1 4 2 1

3 2 1 2 1 1 1 1

4 2 1 1 1

5 1 1

6 2 1 1 2 4 2

7 3 1 2 2 1 2

8 1 2 1 2 2 2 1

9 2 1 2 1 1

0

Count 44

Sum 72

Count 15

Sum 29

Count 34.1%

Sum 40.3%

AllData

Similar Data


Similar Dissimilar

Observed 29 43

Expected 21 51

Chi-Square 4.546

P-Value 0.033

75

Dataset 2 (C123):



A 1 1 1 1

B 1 1 1

C 1 1 1

D 1 1 1

E 1 1

F 1 1 1

G 1

H 1

I 1

J 1 1

K 1 1

L

M 1

N 1 1

O 1 1

P 1

Q

R 1

S 1

T 1 1 1 1

U 1 1 1

V 1

W 1

X 3 1

Y 2 1

Z 1 1 1 1 1 1 1

Count 53

Sum 56

Count 7

Sum 9

Count 13.2%

Sum 16.1%

AllData

Similar Data


Similar Dissimilar

Observed 9 47

Expected 4 52

Chi-Square 7.678

P-Value 0.006

76


1 2 3 4 5 6 7 8 9 0

1 1 1 1

2 1 1 1 1

3 3

4 1 2 1 1 2 1

5 1 2 1 1 2

6 2 1 1 3

7 1 1 1 2

8 1 1 1

9 1 1 1

0

Count 33

Sum 43

Count 10

Sum 15

Count 30.3%

Sum 34.9%

AllData

Similar Data


Similar Dissimilar

Observed 15 28

Expected 12 31

Chi-Square 0.752

P-Value 0.386

77

Dataset 3 (ABC123):



A

B 1

C

D 1

E 1 1 1

F 1 1

G 1

H 1 2

I

J 1 1

K

L

M

N 1 1

O 1

P

Q

R

S 1

T 1

U

V 2

W 1

X

Y 1

Z

Count 20

Sum 22

Count 5

Sum 6

Count 25.0%

Sum 27.3%

AllData

Similar Data


Similar Dissimilar

Observed 6 16

Expected 1 21

Chi-Square 14.655

P-Value 0.000

78


1 2 3 4 5 6 7 8 9 0

1 1

2 1 1 2 1

3 1 2 1 1 1

4 1 1 3 2 1

5 2 1

6 1 1 1

7 2 1

8 2 1

9 1 1

0

Count 26

Sum 34

Count 7

Sum 9

Count 26.9%

Sum 26.5%

AllData

Similar Data


Similar Dissimilar

Observed 9 25

Expected 10 24

Chi-Square 0.097

P-Value 0.756

79

Dataset 4 (ABC123):



A

B 2 1

C 2

D 1

E 2

F 1 2

G

H 1

I

J 1

K

L

M 1 1

N 1 1

O 1

P 1

Q

R

S

T 1 2

U 1

V 1 2

W 1

X 1

Y

Z

Count 22

Sum 28

Count 11

Sum 14

Count 50.0%

Sum 50.0%

AllData

Similar Data


Similar Dissimilar

Observed 14 14

Expected 2 26

Chi-Square 82.917

P-Value 0.000

80


1 2 3 4 5 6 7 8 9 0

1 1 1

2 1 1

3

4 1 1 1

5 1 1 2 1

6 1 2 1 2

7 2 1 1 1

8 1

9 1 1 1

0

Count 23

Sum 27

Count 8

Sum 11

Count 34.8%

Sum 40.7%

AllData

Similar Data


Similar Dissimilar

Observed 11 16

Expected 8 19

Chi-Square 1.846

P-Value 0.174

81

CHAPTER6

CONCLUSION

The purpose of this research was to investigate the accuracy of license plate matching

methods for vehicle tracking and travel time data collection, and provide correction algorithms

to improve the results; also to investigate the role of human mistakes because of similarity

between recorded characters. Four datasets were used in three different recordation styles;

ABC1, C123 and ABC123. Five algorithms were developed to process the unmatched license

plates in the datasets by substitution of mistakenly recorded letters and digits in the license

plates with the correct ones. The most comprehensive and accurate algorithm is the

constrained Algorithm E. It checks the vehicle classification between the pair being matched,

and also if more than one match is found it uses the one that yields duration of stay closer to

the dataset median.

This research showed that after the initial matching phase is done, a considerable increase

in the percentage of matched license plates can be attained. For the most comprehensive and

accurate algorithm that is introduced, this gain ranges from 11% to 56% depending on the style

of recordation of the license plates (e.g. ABC1 or C123), and whether letters are dealt with or

digits.

To a smaller degree the algorithms can improve the statistical values of the license plate

recordation datasets such as average, standard deviation and median of travel time and/or

duration of stay. Based on the four case studies in this research it can be said that the travel

time values can change between 0% and 5% after the processing of unmatched license plates.

The highest corrections by Algorithm E are 4.0%, 1.6% and 4.1% for average, standard deviation

and median, respectively.

Using the classification of vehicles did improve the matching process. The Mistakes Matrices

by Algorithm D and E, which are the same but D is not constrained by vehicle class, were 0% to

82

25% different. The classification scheme was a four or five vehicle class, depending on the

dataset. The better spread of the classes among collected vehicles the more helpful the

classification data can be to increase the accuracy of re‐matching. Since such spread varied

among the datasets used in this study, knowledge of the classes of vehicles was to some extent

more beneficial for some datasets and less for others. If the FHWA scheme with 13 classes is

used, better results may be obtained since it breaks up the personal vehicle class. The utility of

classification in an application like HAVO where buses and minibuses are a substantial share of

traffic is more useful. The same may not be true on e.g. H‐1 freeway where about 98% of the

traffic consists of light duty vehicles.

This study also shows that a significant portion of mistakenly recorded letters while

recording the license plates are visually similar letters, that by itself demonstrates the human

actor in the accuracy of the method. Digits however are not so significantly probable to be

mistaken due to visual similarity.

The five highest repetition of the mistakes based on the aggregate data are for the following

cases.

For the letters (average repetition for all mistakes = 5.2):

D‐P: 100

D‐R: 72

F‐E: 59

X‐Y: 50

U‐V: 48

For the digits (average repetition for all mistakes = 21.0):

6‐8: 55

4‐6: 52

1‐7: 48

2‐8: 42

83

2‐3: 38

For example, letters D and P were recorded mistakenly instead of each other 100 times,

while average frequency of mistakes for all possible cases (650 cells in the Mistakes Matrices)

was only 5.2.

All top five cases for the letters are similar letters; so is three of the top five cases for the

digits.

The style of recordation is also proved to be significant. More letters to be recorded, results

in more human errors. For this reason it is suggested that during the license plate recordation if

the recorders are not close enough to the road, the last four digits of the license plates be

recorded since this style showed lower recordation errors.

Finally the “similar algorithm” that searches only for the similar characters is definitely

recommended, especially for letters. This algorithm most of the times could find around 50% of

the matches, with 100% as its maximum, while its processing time is only 1.5% to 11% of the

most comprehensive algorithm (E). This is highly recommended for large databases.

84

REFERENCES

1. Pline, J.L., Traffic Engineering Handbook. 4th ed1992: Prentice‐Hall Inc.

2. Hauer, E., Correction of license plate surveys for spurious matches. Transportation

Research Part A: General, 1979. 13(2): p. 71–78.

3. Oliveira‐Neto, F.M., L.D. Han, and M.K. Jeong, Tracking Large Trucks in Real Time with

License Plate Recognition and Text‐Mining Techniques. Transportation Research Record:

Journal of the Transportation Research Board, 2009. 2121: p. 121–127.

4. Turner, S.M., et al., Travel Time Data Collection Handbook, 1998, Office of Highway

Information Management, Federal Highway Administration.

5. Gómez‐Torres, N.R. and D.M. Valdés‐Díaz, Detection Technologies for Dynamic Origin‐

Destination Matrices and Heavy Vehicles’ Road Selection Studies, in Seventh LACCEI

Latin American and Caribbean Conference for Engineering and Technology (LACCEI’2009)

“Energy and Technology for the Americas: Education, Innovation, Technology and

Practice”2009: San Cristóbal, Venezuela.

6. Han, L.D., Myong‐KeeJeong, and F.M. Oliveira‐Neto, License Plate Recognition, 2009,

National Transportation Research Center Incorporated (NTRCI) ‐ University

Transportation Center.

7. Makowski, G.G. and K.C. Sinha, A statistical procedure to analyze partial license plate

numbers. Transportation Research Part A: General, 1976. 10(2): p. 131‐132.

85

8. Neto, F.M.O., Matching Vehicle License Plate Numbers Using License Plate Recognition

and Text Mining Techniques, 2010, University of Tennessee, Knoxville: Tennessee

Research and Creative Exchange.

9. Clark, S.D., S. Grant‐Muller, and H. Chen, Cleaning of Matched License Plate Data.

Transportation Research Record: Journal of the Transportation Research Board,

2002(1804): p. 1‐7.

10. Wagner, R.A. and M.J. Fischer, The String‐to‐String Correction Problem. Journal of the

ACM (JACM), 1974. 21(1).

11. Miller, G., The Magical Number Seven, Plus or Minus Two. Psychological Review, 1956.

63: p. 81‐97.

12. Jan Maarten Schraagen, a.K.v.D., Designing a licence plate for memorability.

Ergonomics, 2005. 48(7): p. 796‐806.

13. C. Bisdikian, An overview of the Bluetooth wireless technology, IEEE Communications

Magazine 2001. 39: p. 86 ‐ 94.

14. J. Hallberg, M. Nilsson, K. Synnes, Positioning with Bluetooth, ICT 2003: 10th

International Conference on Telecommunications, Feb 2003, Papeete, French 2, 2003. p.

954 ‐ 958.

15. M. Lu, W. Chen, X Shen, H. Lam, J. Liu, Positioning and tracking construction vehicles in

highly dense urban areas and building construction site. Automation in Construction,

2007. 16(5): p. 647–656

86

16. A. M. Steane, Error Correcting Codes in Quantum Theory. Physical Review Letters, 1996

77(5): p. 793‐797

17. J. Landt, The history of RFID. IEEE Potentials, 2005. 24(4): p. 8‐11

18. Erick C. Jones, Christopher A. Chung, RFID In Logistics: A Practical Introduction, 2008,

CRC Press

87

AppendixA

Algorithms

InitialMatchingAlgorithm

Option Explicit

Sub find_plates_new()

Dim i, j, k, l As Integer

Dim sngStartTime As Single

Dim sngTotalTime As Single

sngStartTime = Timer

For k = 5 To 2187

Cells(k, 15) = Cells(k, 6)

Next k

For j = 5 To 2187

For i = 5 To 2211

If Cells(j, 2) = Cells(i, 6) Then

If Cells(j, 3) <= Cells(i, 7) Then

Cells(j, 5) = Cells(i, 7)

Cells(i, 6) = "Used and excluded"

Exit For

End If

Else: Cells(j, 5) = "No match found!"

End If

Next i

Next j

88

sngTotalTime = Timer - sngStartTime

MsgBox "Time taken: " & Round(sngTotalTime, 2) & " seconds"

Cells(4, 4) = Round(sngTotalTime, 2)

End Sub

89

AlgorithmA

Option Explicit

Sub letter_table_no_repeat()

Dim i, j, k, l, n, w, a, b, hplace, vplace As Integer

Dim beforechange, afterchange As Variant

Dim break As String

Dim sngStartTime As Single

Dim sngTotalTime As Single

sngStartTime = Timer

For k = 5 To 2186

break = "no"

If Cells(k, 5) = "No match found!" Then

For j = 1 To 3

For l = 1 To 26

beforechange = Mid(Cells(k, 2), j, 1)

Cells(k, 14) = Replace(Cells(k, 2), Mid(Cells(k, 2), j, 1), Cells(l, 16))

afterchange = Cells(l, 16)

For i = 5 To 2210

If Cells(k, 14) = Cells(i, 6) And Cells(k, 3) <= Cells(i, 7) Then

Cells(k, 5) = Cells(i, 7)

Cells(i, 6) = "used and excluded"

hplace = charactervalue(beforechange)

vplace = charactervalue(afterchange)

If hplace > 0 And vplace > 0 Then

Cells(vplace + 13, hplace + 17) = Cells(vplace + 13, hplace + 17) + 1

break = "yes"

90

End If

Exit For

End If

Next i








break = "yes"

End If

Exit For

End If

If break = "yes" Then

Exit For

End If

Next l








break = "yes"

End If

Exit For

91

End If

If break = "yes" Then

Exit For

End If

Next j

End If

Next k

sngTotalTime = Timer - sngStartTime

MsgBox "Time taken: " & (sngTotalTime \ 60) & " minutes, " & (sngTotalTime Mod 60) & " seconds"

MsgBox "Time taken: " & Round(sngTotalTime, 2) & " seconds"

Cells(4, 4) = Round(sngTotalTime, 2)

End Sub

92

AppendixB

MistakesMatricesbyAlgorithmsAtoD

Dataset 1 (ABC1): ITE

Algorithm A: Removed matched LPs

Algorithm B: Retained matched LPs


A 1 1 1 1

B 2 1 1 1 2

C 3 1 1 1 1 1 1 1

D 1 1 2 2 1 9 1 1 1

E 1 4 1 2

F 1 2 1 1 1 1

G 1 1 1 1

H 4

I

J 1 1 1 1 2 2 1 2 1 1

K

L

M 2

N 2 2 1 4 3 1 1 1 1

O 2 2

P 6 2 1 3 1

Q

R 2 1 1 9 1 1

S 1 1

T 1 1 1 1 1 1

U 1 2 1

V 1 1 2 1 1

W 1 1

X 1 1 3 1

Y 1 3 1 4

Z 1 2

Count 102

Sum 164

Count 18

Sum 56

Count 17.6%

Sum 34.1%

AllData

Similar Data


Similar Dissimilar

Observed 56 108

Expected 8 156

Chi-Square 324.872

P-Value 0.000

93


A 1 1 1 1

B 1 2 1 1 2

C 1 3 1 1 1 3 1 1

D 1 1 1 2 2 1 8 1 3 1 1

E 5 2 1 1 1

F 1 2 1 1 1 1 1

G 2 1 2 1

H 4

I

J 1 4 2 2 2 2 3 1 1

K

L

M 2

N 2 2 1 1 5 2 1 1 1 1 1

O 3 2

P 6 3 1 3 1 1 1

Q

R 2 2 2 2 6 1 1 1 2

S 1 1 1 1 1 1 1

T 3 1 1 1 1 1 1

U 2 1 1 1 1

V 1 1 1 2 2

W 1 1 1

X 3 1 1 3 1

Y 2 1 1 5 6

Z 2 1 1 1 1 1

Count 127

Sum 214

Count 19

Sum 53

Count 15.0%

Sum 24.8%

AllData

Similar Data


Similar Dissimilar

Observed 53 161

Expected 10 204

Chi-Square 197.387

P-Value 0.000

94

Algorithm C: Full correction


A 1 1 1 1 1

B 1 2 1 1 1 2 3

C 3 3 2 1 5 1 4 1 1 1

D 2 1 1 3 2 1 12 2 1 3 1 1

E 1 7 2 1 1 3

F 4 1 3 1 2 2 1 2 1 1 2

G 1 2 3 2 1 1

H 4

I

J 1 5 2 2 3 3 1 3 1 1 5 2 2

K

L

M 2

N 2 2 2 2 6 3 1 1 2 1 1 1

O 3 2

P 1 9 3 2 7 1 1 1 1

Q

R 2 2 2 1 1 2 10 1 1 1 2

S 1 3 1 1 1 1 1 1 1 1

T 1 3 1 2 3 1 1 1 2

U 2 1 1 1 3 1

V 3 1 1 2 2 2 1

W 2 1 1 1 1 1

X 9 1 2 1 1 3 1

Y 3 1 1 8 1 7

Z 3 4 1 1 1 2 2 1

Count 165

Sum 339

Count 22

Sum 79

Count 13.3%

Sum 23.3%

Similar Data

AllData


Similar Dissimilar

Observed 79 260

Expected 16 323

Chi-Square 268.943

P-Value 0.000

95

Algorithm D: Closest to mean


A 1 1 1 1 1

B 1 1 1 3

C 1 1 1 2 1 2

D 2 3 1 1 12 2

E 1 5 1 1 3

F 1 3 1 1 1 1

G 1 2 1 1 1

H 4

I

J 2 2 2 2 1 1 5 1

K

L

M 1

N 2 1 2 3 1 1 1

O 2 1

P 1 8 2 2 7 1 1

Q

R 2 1 1 1 1 1 8 1 1 2

S 3 1 1 1

T 2 2 1 1 1 1

U 1 1 1 2 1

V 2 1 1 2 1 2 1

W 1 1 1 1

X 1 1 1 2

Y 3 1 4 1 5

Z 1 2

Count 110

Sum 197

Count 21

Sum 67

Count 19.1%

Sum 34.0%

AllData

Similar Data


Similar Dissimilar

Observed 67 130

Expected 9 188

Chi-Square 386.652

P-Value 0.000

96


1 2 3 4 5 6 7 8 9 0

1 1 1 2 1

2 1 5 2 1

3 1 2 1 2 1 1

4 2 1 1 1

5 1 1

6 2 1 1 2 4 1

7 3 1 1

8 1 2 1 2 1 3

9 2 1 1 1 1

0

Count 40

Sum 61

Count 11

Sum 21

Count 27.5%

Sum 34.4%

AllData

Similar Data


Similar Dissimilar

Observed 21 40

Expected 15 46

Chi-Square 3.291

P-Value 0.070

97


1 2 3 4 5 6 7 8 9 0

1 1 1 2 1

2 1 6 2 1

3 1 2 1 2 1 1

4 2 1 1 1

5 1 1

6 2 1 1 2 4 1

7 3 1 1 1

8 1 2 1 1 2 2 2

9 3 1 1 1 2 1

0

Count 43

Sum 67

Count 12

Sum 25

Count 27.9%

Sum 37.3%


AllData

Similar Data

Similar Dissimilar

Observed 25 42

Expected 16 51

Chi-Square 6.008

P-Value 0.014

98


1 2 3 4 5 6 7 8 9 0

1 1 3 3 1

2 1 6 2 1 1

3 2 1 2 1 2 1 1

4 2 1 1 1

5 1 1

6 2 1 1 2 4 2

7 3 2 2 2 1 2

8 1 2 1 2 2 3 3

9 3 1 2 1 2 1

0

Count 47

Sum 84

Count 13

Sum 31

Count 27.7%

Sum 36.9%


AllData

Similar Data

Similar Dissimilar

Observed 31 53

Expected 21 63

Chi-Square 7.061

P-Value 0.008

99


1 2 3 4 5 6 7 8 9 0

1 3 3 1

2 1 4 2 1

3 2 1 2 1 1 1 1

4 2 1 1 1

5 1 1

6 2 1 1 2 4 2

7 3 1 2 2 1 2

8 1 2 1 2 2 2 1

9 2 1 2 1 1

0

Count 44

Sum 72

Count 12

Sum 26

Count 27.3%

Sum 36.1%

AllData

Similar Data


Similar Dissimilar

Observed 26 46

Expected 18 54

Chi-Square 5.306

P-Value 0.021

100

Dataset 2 (C123): HAVO 2009



A 1 1 1

B 1 1 1 1

C 1 1 1

D 1

E 1 1

F 1 1 1

G 1 1

H 1

I 1

J 1

K 1 1 1

L

M 1 1 1

N 1 1

O 1

P 1 1

Q

R 1 1

S 1

T 1 1 1

U 1 1 1

V 1 1 1

W 1 1

X 3 1

Y 2 1

Z 1 1 1 1 1

Count 55

Sum 58

Count 3

Sum 3

Count 5.5%

Sum 5.2%

AllData

Similar Data


Similar Dissimilar

Observed 3 55

Expected 3 55

Chi-Square 0.041

P-Value 0.840

101



A 1 1 1

B 1 1 1 1

C 1 1 1

D 1

E 1 1

F 1 1 1

G 1 1

H 1

I 1

J 1

K 1 1 1

L

M 1 1 1

N 1 1

O 1

P 1 1

Q

R 1 1

S 1

T 1 1 1

U 1 1 1 1

V 1 1 1

W 1 1

X 4 1

Y 2 1

Z 1 1 1 1 1 1 1

Count 58

Sum 62

Count 3

Sum 3

Count 5.2%

Sum 4.8%

Similar Data


AllData

Similar Dissimilar

Observed 3 59

Expected 3 59

Chi-Square 0.007

P-Value 0.933

102



A 1 1 1 1 1 1

B 1 1 1 1 1

C 1 1 1

D 1 1 1

E 1 1

F 1 1 1

G 1 1 1

H 1

I 1

J 1 1

K 1 1 2 1

L

M 1 1 1 1

N 1 1

O 1 1

P 1 1

Q

R 1 1

S 1

T 1 1 1

U 1 1 2 1

V 1 1 1

W 1 1 1

X 4 1

Y 2 1

Z 1 1 1 1 1 1 1

Count 70

Sum 76

Count 3

Sum 3

Count 4.3%

Sum 3.9%

AllData

Similar Data


Similar Dissimilar

Observed 3 73

Expected 4 72

Chi-Square 0.077

P-Value 0.781

103



A 1 1 1 1 1

B 1 1 1 1 1

C 1 1 1

D 1 1 1

E 1 1

F 1 1 1

G 1 1

H 1

I 1

J 1 1

K 1 1 2

L

M 1 1

N 1 1

O 1 1

P 1 1

Q

R 1 1

S 1

T 1 1 1 1

U 1 1 2 1

V 1 1

W 1 1 1 1

X 3 1

Y 2 1

Z 1 1 1 1 1 1 1

Count 66

Sum 71

Count 4

Sum 4

Count 6.1%

Sum 5.6%

AllData

Similar Data


Similar Dissimilar

Observed 4 67

Expected 3 68

Chi-Square 0.167

P-Value 0.683

104


1 2 3 4 5 6 7 8 9 0

1 1 2 1

2 1 1 2

3 1 4

4 1 2 1 2 1 1 1

5 2 1

6 3 1 1 2

7 1 1 1 2

8 1 1

9 1 1 1 2

0

Count 31

Sum 44

Count 5

Sum 9

Count 16.1%

Sum 20.5%

AllData

Similar Data


Similar Dissimilar

Observed 9 35

Expected 11 33

Chi-Square 0.379

P-Value 0.538

105


1 2 3 4 5 6 7 8 9 0

1 1 1 1 2 1

2 1 1 2

3 1 4

4 1 2 1 3 1 1 1

5 2 1 1 1 1

6 4 1 1 3

7 1 1 1 2

8 1 1 2 1 1

9 1 2 1 1

0

Count 39

Sum 56

Count 11

Sum 18

Count 28.2%

Sum 32.1%

AllData

Similar Data


Similar Dissimilar

Observed 18 38

Expected 14 42

Chi-Square 1.797

P-Value 0.180

106


1 2 3 4 5 6 7 8 9 0

1 1 1 1 2 1

2 2 1 1 2 1

3 2 4

4 1 2 1 4 1 1 1

5 2 1 1 1 2

6 1 5 2 2 3

7 1 1 2 2

8 1 1 2 1 1

9 1 1 2 1 2

0

Count 43

Sum 69

Count 11

Sum 20

Count 25.6%

Sum 29.0%

AllData

Similar Data


Similar Dissimilar

Observed 20 49

Expected 17 52

Chi-Square 0.770

P-Value 0.380

107


1 2 3 4 5 6 7 8 9 0

1 1 1 2 1

2 1 1 1 1 1 1

3 3

4 1 2 1 1 2 1 1 1

5 1 2 1 1 2

6 2 1 1 3

7 1 1 1 1 2 1

8 1 1 1 1

9 1 1 1

0

Count 41

Sum 52

Count 9

Sum 14

Count 22.0%

Sum 26.9%

AllData

Similar Data


Similar Dissimilar

Observed 14 38

Expected 13 39

Chi-Square 0.173

P-Value 0.677

108

Dataset 3 (ABC123): HAVO 2007 ‐ 1



A

B 1

C

D 1

E 1 1 2

F 1 1

G 1

H 1 2 1

I

J 1

K

L

M

N 1 1 1

O 1

P

Q

R

S 1

T 1

U

V 3

W

X

Y 1

Z

Count 20

Sum 24

Count 4

Sum 6

Count 20.0%

Sum 25.0%

AllData

Similar Data


Similar Dissimilar

Observed 6 18

Expected 1 23

Chi-Square 22.653

P-Value 0.000

109



A

B 1

C

D 1

E 1 1 2

F 1 1

G 1

H 1 2 1

I

J 1

K

L

M

N 1 1 1

O 1

P

Q

R

S 1

T 1

U

V 3

W

X

Y 1

Z

Count 20

Sum 24

Count 4

Sum 6

Count 20.0%

Sum 25.0%

AllData

Similar Data


Similar Dissimilar

Observed 6 18

Expected 1 23

Chi-Square 22.653

P-Value 0.000

110



A

B 1

C

D 1

E 1 1 2

F 1 1

G 1

H 1 2 1

I

J 1

K

L

M

N 1 1 1

O 1

P

Q

R

S 1

T 1

U

V 3

W

X

Y 1

Z

Count 20

Sum 24

Count 4

Sum 6

Count 20.0%

Sum 25.0%

AllData

Similar Data


Similar Dissimilar

Observed 6 18

Expected 1 23

Chi-Square 22.653

P-Value 0.000

111



A

B 1

C

D 1

E 1 1 2

F 1 1

G 1

H 1 2 1

I

J 1 1

K

L

M

N 1 1 1

O 1

P

Q

R

S 1

T 1

U

V 3

W 1

X

Y 1

Z

Count 22

Sum 26

Count 5

Sum 7

Count 22.7%

Sum 26.9%

AllData

Similar Data


Similar Dissimilar

Observed 7 19

Expected 1 25

Chi-Square 29.390

P-Value 0.000

112


1 2 3 4 5 6 7 8 9 0

1

2 1 1 2 1

3 1 2 1 1

4 1 1 1 1 1

5 1 1

6 1 1 1 1

7 2 1

8 1 1

9 1 1 1

0

Count 26

Sum 29

Count 4

Sum 5

Count 15.4%

Sum 17.2%

AllData

Similar Data


Similar Dissimilar

Observed 5 24

Expected 7 22

Chi-Square 0.815

P-Value 0.367

113


1 2 3 4 5 6 7 8 9 0

1

2 1 1 2 1

3 1 2 1 1

4 1 1 1 1 2

5 1 1

6 1 1 1 1

7 2 1

8 1 1

9 1 1 1

0

Count 26

Sum 30

Count 4

Sum 5

Count 15.4%

Sum 16.7%

AllData

Similar Data


Similar Dissimilar

Observed 5 25

Expected 7 23

Chi-Square 0.983

P-Value 0.322

114


1 2 3 4 5 6 7 8 9 0

1 1

2 1 1 2 1

3 1 2 1 1

4 1 1 2 3 3

5 2 1

6 1 1 1 1

7 2 1

8 2 2 1 1

9 1 1 1

0

Count 29

Sum 40

Count 6

Sum 8

Count 20.7%

Sum 20.0%

AllData

Similar Data


Similar Dissimilar

Observed 8 32

Expected 10 30

Chi-Square 0.428

P-Value 0.513

115


1 2 3 4 5 6 7 8 9 0

1 1

2 1 1 2 1

3 2 2 1 1 1

4 1 1 3 2 1

5 2 1

6 1 1 1

7 2 1

8 2 1

9 1 1

0

Count 26

Sum 35

Count 5

Sum 7

Count 19.2%

Sum 20.0%

AllData

Similar Data


Similar Dissimilar

Observed 7 28

Expected 9 26

Chi-Square 0.374

P-Value 0.541

116

Dataset 4 (ABC123): HAVO 2007 ‐ 2



A

B 2 1

C 1 1

D 1

E 2

F 1 2

G 1

H 1

I

J 1

K

L

M 1 1

N 1

O 1

P 1

Q

R

S

T 1 2

U 1

V 1 2

W 1

X 1

Y

Z

Count 23

Sum 28

Count 7

Sum 9

Count 30.4%

Sum 32.1%

AllData

Similar Data


Similar Dissimilar

Observed 9 19

Expected 1 27

Chi-Square 48.195

P-Value 0.000

117



A

B 2 1

C 1 1

D 1

E 2

F 1 2

G 1

H 1

I

J 1

K

L

M 1 1

N 1

O 1

P 1

Q

R

S

T 1 2

U 1

V 1 2

W 1

X 1

Y 1

Z

Count 24

Sum 29

Count 7

Sum 9

Count 29.2%

Sum 31.0%

AllData

Similar Data


Similar Dissimilar

Observed 9 20

Expected 1 28

Chi-Square 45.978

P-Value 0.000

118



A

B 2 1

C 1 2

D 1

E 2

F 1 2

G 1

H 1

I

J 1

K

L

M 1 1

N 1 1

O 1

P 1

Q

R

S

T 1 2

U 1

V 1 2

W 1

X 1

Y 1

Z

Count 25

Sum 31

Count 8

Sum 10

Count 32.0%

Sum 32.3%

Similar Data


AllData

Similar Dissimilar

Observed 10 21

Expected 1 30

Chi-Square 53.807

P-Value 0.000

119



A

B 2 1

C 1 2

D 1

E 2

F 1 2

G 1

H 1

I

J 1

K

L

M 1 1

N 1 1

O 1

P 1

Q

R

S

T 1 2

U 1

V 1 2

W 1

X 1

Y

Z

Count 24

Sum 30

Count 8

Sum 10

Count 33.3%

Sum 33.3%Similar to All Ratio

AllData

Similar Data

Similar Dissimilar

Observed 10 20

Expected 1 29

Chi-Square 56.201

P-Value 0.000

120


1 2 3 4 5 6 7 8 9 0

1 1 1 1

2 1 1

3

4 1 1 1 1

5 1 2 1 1

6 1 1 1

7 1 1

8 1 1

9 1 1 1 2

0

Count 24

Sum 26

Count 7

Sum 8

Count 29.2%

Sum 30.8%

AllData

Similar Data


Similar Dissimilar

Observed 8 18

Expected 6 20

Chi-Square 0.563

P-Value 0.453

121


1 2 3 4 5 6 7 8 9 0

1 2 1 1

2 1 2

3

4 1 1 1 1

5 1 1 1 1 1

6 1 1 1 1 1

7 1 1

8 1 1

9 1 1 1 2

0

Count 27

Sum 30

Count 7

Sum 8

Count 25.9%

Sum 26.7%

AllData

Similar Data


Similar Dissimilar

Observed 8 22

Expected 7 23

Chi-Square 0.080

P-Value 0.777

122


1 2 3 4 5 6 7 8 9 0

1 2 1 1

2 1 2

3

4 1 1 2 1

5 1 1 2 1 4

6 1 1 3 1 2

7 1 1 1 1

8 2 1

9 1 1 1 2

0

Count 29

Sum 41

Count 8

Sum 15

Count 27.6%

Sum 36.6%

AllData

Similar Data


Similar Dissimilar

Observed 15 26

Expected 10 31

Chi-Square 3.272

P-Value 0.070

123


1 2 3 4 5 6 7 8 9 0

1 1 1

2 1 1

3

4 1 1 1

5 1 1 2 1

6 1 2 1 2

7 2 1 1 1

8 2 1

9 1 1 1 2

0

Count 25

Sum 31

Count 6

Sum 10

Count 24.0%

Sum 32.3%

AllData

Similar Data


Similar Dissimilar

Observed 10 21

Expected 8 23

Chi-Square 1.025

P-Value 0.311

license plate survey for traffic analysis ... - university of hawaii · pdf file(ttl): alyx...

Documents