university of surrey image retrieval using object-relational technology “ images must be...

43
1 University of Surrey School of Electronics, Computing & Mathematics Department of Computing Centre for Knowledge Management Technical Report Image Retrieval using Object-Relaional Technology: A Case Study Mariam Tariq AI Group February 2002

Upload: lelien

Post on 03-May-2018

215 views

Category:

Documents


2 download

TRANSCRIPT

1

��

University of Surrey School of Electronics, Computing & Mathematics

Depar tment of Computing Centre�for�Knowledge�Management

Technical Repor t ���

Image Retrieval using Object-Relaional Technology: A Case Study

Mariam�Tariq�AI Group �

February 2002 �

� 2

Image Retr ieval Using Object-Relational Technology

“ Images must be understood as a kind of language” ��

W.�J.�T.�Mitchell,�Iconology�

CONTENTS

1�Introduction..............................................................................................................................2�

2� Characteristics�of�Images.................................................................................................3�2.1� Image�Categorization�and�Indexing.........................................................................4�2.2� Specialist�Images:�Scene�of�Crime�Images ..............................................................7�

3� Methods�of�Image�Retrieval ..........................................................................................11�3.1� Text-Based�Image�Retrieval ..................................................................................12�3.2� Content-Based�Image�Retrieval .............................................................................13�3.3� Hybrid/Integrated�Approaches�to�Image�Retrieval .................................................15�

4� Image�Retrieval�using�Object-Relational�Technology:�A�Case�Study .............................17�4.1� Storage�And�Manipulation�Of�Multimedia�Data. ...................................................17�4.2� Example�of�Retrieval�By�Visual�Properties............................................................19�4.3� Example�of�Text-Based�Retrieval ..........................................................................28�4.4� A�Comparison�of�Image-�and�Text-based�Retrieval ...............................................32�

5� Discussion.....................................................................................................................39�

References………………………………………………………………………………………....42�

1 Introduction

Increasingly,� digital� image� collections� are� becoming� an� intrinsic� part� of� many� professional�

and� specialist� groups� in� domains� as� disparate� as� art,� medicine� and� scene� of� crime�

documentation,� not� to� mention� the� plethora� of� image� galleries� now� available� on� the� World�

Wide�Web.� In� response� to� this� increase� in� image�repositories,� recent�research� is� focussing�on�

the�most�effective�ways�to�organize,�index�and�retrieve�images.�This�report�aims�to�review�the�

current�state�of� the�art� in� image�retrieval� techniques�and�provide�a�case�study�of�how� images�

can� be� stored� and� retrieved� using� object-relational� technology.� Until� recently� most�

information�retrieval�systems�dealt�with�unstructured�text� repositories�with�the�searches�being�

limited� to� matching� keywords� or� full� text;� while� most� database� systems� provided� for� the�

storage� and� querying� of� highly� structured� but� simple� alphanumeric� data.� Now� with� the�need�

and� ability� for� systems� to� store,� utilize� and� distribute� more� complex� data� such� as� images,�

audio� and� spatial� data,� a� whole� new� paradigm� for� a� more� intelligent� search� and� retrieval�

solution�has�become�necessary.��

� 3

The� existing� methods� and� systems� available� for� image� retrieval� purposes� will� be� discussed�

with� the�aim�to� identify� their�strengths�and� limitations�as�well�as�raise�some�issues�associated�

with�multi-modal�information�extraction�and�retrieval.�Firstly,�the�various�types�of�images�that�

might� be� used� by� different� specialist� groups� are� briefly� discussed� with� a� more� detailed�

overview�of�scene�of�crime� images.�An�attempt� is�made�to�identify�the�various�characteristics�

of� the� images� that� might� be� relevant� for� indexing� purposes.� The� various� methods� currently�

being� used� for� image� retrieval� are� also� discussed,� from� text-based� retrieval� to� content-based�

retrieval� by� visual� properties,� and� finally� the� more� recent� trend� by� some� researchers� to�

combine� these� two� methods.� Next� a� case� study� is� provided� of� Informix,� an� object-relational�

database�system�that�provides�support� for� the�storage�and�retrieval�of�complex�data�based�on�

DataBlade� technology.�Finally,�a�detailed�discussion� is�provided�to�highlight� the�major� issues�

and� define� the� main� criteria� that�need� to�be�considered� for� the�most�effective� image� retrieval�

solution�based�on�current�technology.�

2 Character istics of Images

�This�section�aims�to� identify� the�various�characteristics�an� image�may�possess�and�how�these�

characteristics� can� be� utilized� for� indexing� and� querying� purposes.� Issues� related� to�

classification� are� briefly� discussed� followed� by� a� look� at� the� different� levels� of� information�

that�could�be�related�to�an�image.�Next�we�talk�about�“specialist�images”�which�have�peculiar�

patterns� and� structures� that� can� be� exploited.� Different� groups� of� professionals� deal� with� a�

different�set�of�images�and�the�types�of�queries�performed�may�vary�greatly.�It�is�important�to�

study� the� way� specialists� may� understand� and� describe� images� as� well� as� define� what�

particular�features�they�look�for�when�searching�for�a�particular�image.�Srihari�(1995b,�p409)�

suggests�that,�“Understanding�a�picture�can�be�informally�defined�as�the�process�of�identifying�

relevant� people� and� objects” .� Some� users� might� be� looking� for� a� specific� object,� a� certain�

scene,�a�particular�event�or�an�aberration�from�a�reference�image.�For�example�a�doctor�might�

be�searching�for�a�slight�variation�in�colour�or�texture�of�an�image�of�a�tissue�sample�to�detect�

a� tissue�abnormality,�while�an�architect�might�be� looking�for� the�north�elevation�of�a�specific�

building.� An� art� critic� might� search� for� a� more� abstract� theme� in� an� image� such� as� the�

depiction�of�pain�or�love.�

It�may�be�argued�that�individual�images�have�characteristic�properties:�An�image�is�essentially�

a�contrast�–a�contrast�of�colours,� textures�and�shapes.�These�basic�contrasts�help�humans� to�

� 4

identify� objects� within� an� image.� Hence� it� is� important� to� identify� the�various�characteristics�

that� an� image� might� have� so� that� certain� classes� of� images� with� similar� features� may� be�

grouped� together� and� placed� under� a� certain� category.� Depending� on� this� categorization,�

images� can� be� organized� and� stored� in� a� fashion� that� makes� retrieval� and� browsing� more�

effective.�There�are�numerous�criteria�on�which� images�can�be�categorized�e.g.�an�image�may�

be�coloured�or�black�& �white,�indoor�or�outdoor,�represent�a�seascape�or�landscape,�and�so�on.�

Some�images�may�depict�complex�scenes�such�as�a�group�of�players�in�a�football�match�while�

others� just� focus� on� one� object� e.g.� a� ball.� Unfortunately� due� to� the� variations� in� different�

types�of� images�a�standard�scheme�has�not�yet�been�developed� for�classifying� them.�Fine�art�

cataloguers�were�one�of� the� first� to�attempt� to�organize�and�classify�pictures�of�works�of�art.�

Two� classification� schemes� that� have� been� used� frequently� are� ICONCLASS1� and� the� AAT�

thesaurus�produced�by�Getty�Images2.��

2.1 Image Categor ization and Indexing

�There�can�be�various�kinds�of�information�associated�with�an�image:�visual�information�that�is�

related�directly�to�the�properties�of�the�image�such�as�the�colours�and�shapes�of�objects�present�

(e.g.� green,� brown,� round)� as� well� as� non-visual� information� such� as� the� identification� of�

objects� (e.g.� football� field,� players,� ball)� and� events� (e.g.� football� match,� world� cup)� or� the�

format� of� the� image� (e.g.� gif,� jpeg).� Objects� may� have� meanings� at� different� levels� e.g.� an�

image� of� an� eagle� may� be� described� in� terms� of� its� colour� and� shape� which� is�difficult�with�

words� –brown� (or� black� and� white)� in� colour� with� a� streamlined� shape?� At� a� higher� level� it�

can�be�said�to�be�the�image�of�a�bird,�which�may�be�identified�as�an�eagle.�In�a�more�abstract�

sense� it�may,� for�example,�depict�strength� to�certain�people�such�as� the�Nigerians�who�use� it�

on�their�national�emblem.��

According�to�Del�Bimbo�(Del�Bimbo,�1999)�there�may�be�three�different�types�of�information�

associated�with�images:�content-independent�metadata,�such�as�the�image�format,�author�name�

and� date;� content-dependant� metadata,� which� refers� to� low-level�or�perceptual features�such�

as� colour,� shape,� texture� and�spatial� relationships;�content-descriptive�metadata,�which�deals�

with�semantic primitives�such�as�the�identity�and�role�of�image�objects�or�their�relationship�to�

real-world� objects� etc,� and� impressions� which� are� more� abstract� concepts� such� as� the�

significance�of�the�depicted�scenes.�One�can�argue�that�an�image�can�be�indexed�and�retrieved�

������������������������������������������������1�http://www.iconclass.nl/�2�http://www.getty.edu/�

� 5

at� these�three�levels�–namely�perceptual,�semantic�and�impressions�with�an�increasing�level�of�

abstraction� when� moving� from� perceptual� to� impressions.� The� semantic� and� impressions�

levels� will� most� probably� use� text� descriptors� for� indexing� and� retrieval.� These� various�

attributes� (e.g.� semantic� relationships)� can� also� be� used� to� classify� images� into� distinct�

categories,� which� could� be� arranged� hierarchically� to� enable� navigation� through� image�

collections.�

Jaimes�&�Chang�(2000)�have�discussed�the�need�to�index�visual�information�at�different�levels�

depending�on� the�different�ways�users�might�want� to�access�the� information.�There�may�be�a�

variety� of� information� associated� with� an� image� establishing� the� need� to� determine� what�

information�is�important�for�indexing�and�at�which�level�of�abstraction.�The�authors�appear�to�

distinguish� between� precept,� which� refers� to� what� our� senses� perceive� and� concept,� an�

abstract� or� generic� idea� that� is� an� interpretation�of�what� is�perceived�usually�based�on�some�

background� knowledge.� The� authors� also� make� a�distinction�between�syntax,�which� refers� to�

the� way� visual� elements� are� arranged� and� semantics,� which� deals� with� the� meaning� of� these�

elements;� finally� the� difference� between� general� and� visual� concepts� is� made.� Based� on� the�

above� distinctions,� they� have� developed� a� 10-level� conceptual� framework� for� indexing�visual�

information�at�different�levels�(see�figure�1).��

The� 10-level� structure� can� be� divided� into� two� sections:� those� features� based� on�

syntax/precepts� such� as� type/technique� of� image,� global distribution (e.g.�colour,�shape�and�

texture),� local structure� (colour,� shape� and� texture� of� local� components)� and� global

composition (the� particular� arrangement� of� different� items� in� the� image);� and� those� features�

based� on� semantics/visual� concepts� such� as� generic,�specific�or�abstract�objects�and�scenes.�

There�does�not�seem�to�be�a�provision�in�the�model�though�for�the�different�objects�that�make�

up�an�image�to�be�recursively�defined�in�terms�of�the�objects’ �own�visual�properties.�It�should�

be� noted� that� at� the� perceptual� level� there� is� completely� automatic� indexing,� at� the� semantic�

there�might�be�some�automation�but�some�manual�annotation�may�have�to�be�done�as�well.�At�

the�abstract�level�it�has�to�be�completely�manual�and�there�is�more�knowledge�required�as�you�

go�down�these�levels�i.e.�perceptual�to�abstract.�

� 6

Figure 1�10-level�conceptual�framework�proposed�by�Jaimes�and�Chang�(2000,�p5)�

Jaimes� and� Chang’s� 10-level� conceptual� framework� shares� many� features� in� common� with�

Del-Bimbo’s�broader�classification.�The�following�table�shows�how�their�classifications�relate�

to�each�other.�

GENERAL�CLASSIFICATION�OF�IMAGES

Del�Bimbo� Jaimes�&�Chang�

Content-Independent Metadata �

Syntax/Precept 1.�Type/Technique�

Content-Dependant Metadata 2.�Global�Distribution�

3.�Local�Structure�

4.�Global�Composition�

Content-Descr iptive Metadata -Semantic�primitives�

Semantics/Visual Concept 5.�Generic�Objects�

6.�Generic�Scene�

7.�Specific�Objects�

8.�Specific�Scene�

-Impressions� 9.�Abstract�Objects�

10�Abstract�Scene�

Table 1�Comparison�of�Del�Bimbo�and�Jaimes�&�Chang’s�Indexing�Levels�for�Images.�

Jaimes� &� Chang’s� syntax/precept� category� seems� to� relate� to� Del� Bimbo’s� content-

independent� metadata� and� content-dependent� metadata� categories.� According� to� Jaimes� &�

Chang,� the� Type/Technique� category� could� be� a� description� of� the� type� of� image� such� as� a�

painting�or�black�and�white�photograph� in�which�case� it�would�correspond�with�Del�Bimbo’s�

� 7

content-dependent metadata� level.� The�content-independent�metadata,� for�example� the� image�

format,�could�be�said� to� fall�under� the�type�and�technique�level.�The�semantics/visual�concept�

category� can� be� seen� to� correspond� with� the� content-descriptive� metadata,� which� is� divided�

into� semantic� primitives� and� impressions.� Since� impressions� deals� with� abstract� concepts,� it�

could�map�to�level�9�(abstract�objects)�& �10�(abstract�scene)�of�Jaimes�&�Chang’s�framework.�

It� is� interesting� to�note� that� if�we� try� to�draw�an�analogy�with� linguistics,� the�syntax/precept�

(content-independent�and�dependent�metadata)�level�is�similar�to�the�syntax and grammar�of�a�

language,� the� semantics/visual� concept,� (content-descriptive� metadata)� is� akin� to� the�

semantics�of�a� language�and� impressions�could�be�likened�to�pragmatics�where�the�context�is�

important.�The�above�table�gives�us�a�scheme�for�indexing�all�types�of�images�in�general.�The�

next� section� goes� on� to�discuss� the�characteristics�of�specialist� images�with� the�aim� to�study�

whether�they�have�distinct�features�that�could�be�used�for�indexing�purposes�in�addition�to�the�

general�features�discussed�above.��

2.2 Specialist Images: Scene of Cr ime Images

�Specialist� images� may� be� classified� as� a� set� of� images� that� are� used� by� professional� or�

specialist� groups� such� as� architects,� engineers,� radiologists� and� crime� scene� investigators.� It�

usually� takes�an�expert� in� the�specific� field� to� fully�comprehend�the�complex�information�that�

might� be� depicted� by� the� images.�Usually�specialist� images� tend� to�be� relatively�constrained,�

which�makes�it�easier�to�categorize�them�as�compared�to�any�random�set�of�images.�Consider,�

for�example,�a�radiologist� looking�at�a�set�of�x-ray� images.�Each� image�varies�from�the�other�

but�there�are�certain�features�that�are�common�in�all�such�as�depicting�bone�material�and�being�

on�a�grey�scale.�A�specialist�image�generally�focuses�on�objects�that�are�specific�to�the�domain�

and� a� number� of� these� objects� will� reoccur� frequently� in� different� images.� One� image� may�

contain�a�number�of�different�objects,�which�may�be�related�to�each�other�through�a�variety�of�

meaning� relations.� The� presence� of� one� object� may� help� to� elaborate� the� meaning� or� role� of�

another�or�it�might�preclude�the�existence�of�the�other.��

Key� meaning� relationships� include� spatial� relationships� and� part-whole� relationships.� Hence�

images�can�be�said�to�have�a�number�of�‘hidden’ �features�that�are�discernable�only�to�a�trained�

person.� For� instance� most� people� will� appreciate� an� aesthetically� pleasing� building� but� only�

few� of� us� will�be�able� to�discern� the�aesthetics� from� the�architectural�design�of� the�building.�

Similarly�those�of�us�who�have�studied�physics�or�electronics�at�school�may�be�able�to�identify�

the�components�of�an�electronic�circuit�but� it� is�only� the�well�trained�who�can�tell�at�a�glance�

� 8

whether�one�or�more�component�of�the�circuit�is�missing�or�misplaced.�When�experts�look�at�a�

building,� a� circuit,� an� x-ray� image,� they� see� a� unity,� which� conveys� a� certain� meaning� that�

escapes�the�novice.�A�specialist� image�can�be�distinguished�at�the�level�of�meaning�and�at�the�

level� of� communicative� intent� or� pragmatics� for� example� in� Del� Bimbo’s� classification� a�

specialist�image�may�require�rich�content-descriptive�metadata.��

In� this� section� we� attempt� to� discuss� in� some� detail� the� different� types� of� images� a�scene�of�

crime� officer� may� take� with� respect� to� the� content� of� the� image� in� order� to� elicit� some�

characteristic� features� that�could�be�used� for�categorization.�Crime�scene�photographs�play�a�

key� role� in� most� serious� crime� scene� investigations.� The� main� purpose� of� crime� scene�

photographs� is� to� document� the� scene� of� crime� exactly� as� the� crime� site� was� found.� These�

photographs� are� then� used� for� detailed� analysis� by� the� investigation� officers� and� later� as�

evidence� in�court.�Since�every�crime�scene� is�unique�there� is�no�standard�rule� for� the�number�

or� types� of� photos� that� are� taken� but� a� basic� pattern� is� usually� followed.� The� crime� scene�

officer� usually� follows� along� the� supposed� ‘path’ � of� the� crime,� including� the� point� of� entry,�

location�of�the�crime,�and�the�point�of�exit.�There�are�generally�three�types�of�photographs�that�

are�taken�(Staggs,�1997):��

1)� Overview� photographs� that� show� the� entire� scene� for� example� in� the� case� of� a�

burglary,� photos� might� be� taken� of� the� four� sides� of� the� building� especially� the�

elevation(s)� that� might� have� the� point� of� entry/exit� as� well� as� the� surroundings.� In�

some� cases� birds� eye� view� or� aerial� photos� might� be� taken� as� well� to� show� a� wider�

extent� of� the� surroundings.� If� an� incident� was� in� a� room,�photos�of�each� face�of� the�

walls� might� be� taken� as� well� as� from� each� corner� of� the� room.�Sketches�are�usually�

made�to�indicate�the�location�and�angle�of�the�camera;��

2)� Midrange� photos� are� taken� of� interesting� objects� to� show� the� spatial� relationships�

amongst�them;��

3)� Lastly�close-up photographs�are�taken�of�relevant�objects.�These�might�be�taken�using�

a�special�one-to-one�lens�or�horizontal�and�vertical�rulers�might�be�placed�to�show�the�

scale� of� the� object.� In� this� case� the� object� is� first� photographed� without� the� rulers.�

Sometimes� overlapping� photographs� might� be� taken� to�provide�a�panoramic�view�of�

the�crime�scene.��

� 9

Overview

Midrange

Close-up

Table 2�Examples�of�overview,�midrange�and�close-up�photographs�of�some�ridge�detail.�

The� crime� scene� photographer� uses� special� techniques� such� as� “painting� by� light” � to�

photograph� images� at� night� or� indoors� when� it� is� dark.� The� types� of� photographs� taken�will�

generally�differ�depending�on�the�specific�crime�scene�e.g.�car�accidents,�homicide,�arson�and�

so�on.�Photographs�are�often� taken�of�fingerprint�marks,�footwear�marks,�tool�marks�and�tyre�

marks� that� can� be� later� compared� for� identification.� Blood� stain� patterns� and� trajectories� as�

well� as� ricochet� marks� might� be� photographed� for� further� analysis� by� a� forensic� expert� to�

determine�the�angle�and�height�etc�of�a�shot�or�stab�wound�(Staggs,�1997).�Currently�scene�of�

crime�officers�provide�brief�textual�captions�for�the�images,�which�may�or�may�not�have�more�

detailed� descriptions� provided� in� the� report.� It� should� be� noted� that� the� officers� have� to� be�

careful�not�to�provide�any�interpretation�related�to�the�scene,�for�example,�they�tend�to�refer�to�

a�blood-like�substance�or�red-coloured� liquid� instead�of�saying�blood�even�though�that’s�what�

it� obviously� appears� to� be.� The� language� used� to� describe� the� images� is� discussed� in� some�

detail�in�the�next�section.��

Figure 2�Overview�of�possible�image�categories�for�scene�of�crime�

Figure� 2� intends� to� provide� an� overview� of� some� categories� of� images� that� are� used� in� the�

scene�of�crime�domain.� It� is� important� to�know�in�what�manner�the�SOCOs�will�be�searching�

CRIME�SCENE�IMAGE�CATEGORIES�

GENERAL�SCENES�CARS�

BODY�

BLOODSTAIN�PATTERNS�

TYREMARK�

SHOEMARK�

FINGERPRINT� WEAPONS�

� 10

for� the� images.� For� fingerprints,�shoe-marks,� tyre-marks,�and�bloodstain�patterns� it�might�be�

impossible� to� articulate� a� description� in� words� so� a� content-based� search� will� definitely� be�

needed� in� this�case.�Fingerprint�databases�are�already�one�of�the�most�successful�examples�of�

CBIR�and�are�being�used�extensively�for�identification�purposes�such�as�the�AFIS�(Automatic�

Fingerprint� Identification� System)� used� in� the� USA� and� NAFIS3� (National� Fingerprint�

Identification� System)� recently� launched� in� the� UK.� All� these� categories� have� very� distinct�

patterns�so�CBIR�works�best.�However�when�it�comes�for�general�scene�of�crime�pictures�and�

images� of� bodies� and� weapons,� a� CBIR� search� has� a� lot� of� limitations� so� there� is�a�definite�

need�for�indexing�by�concepts�in�the�form�of�text.��

The�table�below�shows�an�image�taken�at�a�scene�of�crime.�An�attempt�has�been�made�purely�

on� the� intuition� of� this� author� to� classify� the� image� based� on� Jaimes� &� Chang’s� stratified�

model. The�content-independent� information�such�as�a�colour� image�of� jpeg�format�can�go� in�

the� Type/Technique� level.� This� image� looks� like� it� has� been� taken� at� midrange� with� the�

intension� of� showing� the� relationships� between� the� body� and� the� table.� Since� midrange� is� a�

type� of� image,� that� information� can� go� to� level� 1� as� well.� Levels� 2� to� 4� involve� low-level�

image� processing� techniques,� which� will� be� done� automatically� with� no� human� intervention.�

Generic objects� could�be�a� ‘ table,’ � ‘body,’ � ‘gun’ �and� ‘ floor’ �while�a�generic scene� could�be�

‘ indoor’ .� One� can� then� go� on� to� define� the� objects� more� specifically� in� level� 7� such� as� ‘a�

wooden� table,’ � ‘male� body,’ � ‘browning� pistol.’ � The� specific� scene� in� question� could� be� a�

crime� scene.� It� is� much� harder� to� define� abstract� objects� and�scenes:�One�could�say� that� the�

pool�cue�could�have�been�used�as�a� ‘weapon’ �and�the�crime�scene�could�be�a�‘murder�scene’ .�

Scene�of�crime�officers�will�most�probably�not�be�searching�for�images�at�the�abstract�level.�

������������������������������������������������3�http://www.pito.org.uk/news/press/26apr01.asp�

� 11

Body on floor showing adjacent table

Syntax/Precept�

Jpeg,� Colour,�Midrange�

1.�Type/Technique�

2.�Global�Distribution�

3.�Local�Structure�

(Automatic�low-level�processing�

techniques�used�here)�4.�Global�Composition�

Semantics/Visual Concept�

Body,� floor,� table,�can,�gun�

5.�Generic�Objects�

Indoor� 6.�Generic�Scene�

Wooden� Table,� male�body,�Budweiser�can,�Browning� pistol,�brown�pool�cue�

7.�Specific�Objects�

Crime�scene� 8.�Specific�Scene�

Weapon,� deceased�man�

9.�Abstract�Objects�

Murder�scene� 10�Abstract�Scene��

Table 3-� Example� of� how� Jaimes� &� Chang’s� classification� can� be� used� for� a� crime� scene�image��

3 M ethods of Image Retr ieval

�Initially� keyword-based� searching� was� one� of� the� most� popular� ways� for� retrieving� images.�

Images� were� sometimes� placed� under� specific� categories,� which� were� then� organized� into� a�

hierarchy� that� users� could� navigate� and� browse� to� search� for� relevant� images.� Images� were�

generally� annotated� and/or� categorized� manually,� which� may� be� very� time� consuming� if� the�

image� repository� is� large� as� well� as� have� a� degree� of� subjectivity.� The� content-based� image�

retrieval� techniques� were� developed� in� order� to� address� these� limitations.� Here� retrieval� is�

based�on�the�visual�properties�of�an�image�such�as�colour�and�texture.�Though�successful�to�a�

certain�extent�this�method�has�its�own�limitations.�Recently�there�have�been�trends�to�combine�

these�two�approaches�with�the�premise�that�the�two�methods�will�complement�each�other.�The�

next� three� sections� discuss� the� text-based,� content-based,� and� integrated� image� retrieval�

methods�in�some�detail.�

� 12

3.1 Text-Based Image Retr ieval

�Keyword-based� search� has� been� amongst� the� oldest� techniques� used� for� indexing� and�

retrieving� information� initially� just� for� text� repositories�and� later� for� images�as�well.�Experts�

or�people�familiar�with�a�certain�domain�would�manually�annotate�images�with�keywords�they�

thought� appropriate.� These� keywords� could� then� be� used� for� indexing� the� images� and�

performing� searches.� Systems� started� supporting� more� complex� searches� such� as� Boolean�

Searches� where� various� keywords� can� be� combined� together� using� various� connectives� to�

make�a�query�more�precise.�This�method�is�being�used�for�most�search�engines�on�the�Web.�

Web-based� search� engines� such� as� Yahoo� and� WebSeek� (Chang� 1997)� place� images� into�

categories,�which�are�hierarchically�arranged�so�that�people�can�navigate�and�browse�through�

an� image� collection.� Recently� automatic� indexing� is� being� attempted� by� using� text� in�

proximity�to�an�image�on�a�page�in�the�WWW�to�generate�keywords.�Frequency�of�occurrence�

is� used� as� a� measure� to� determine� relevancy� with,� for� example,� more� weight� being� given� to�

words� in� title�blocks�and�words�in�proximity�to�the�image�in�multimedia�documents.�Different�

methods� are� used� to� automatically� produce� a� set� of� relevant� index� terms� by� limiting� the�

number�of� false� terms�and�repetitions�such�as�elimination�of�stopwords,� the�use�of�stemming,�

identification� of� noun� groups� to� eliminate� verbs� and� adjectives� as� well� as� exploiting� the�

structure�of�a�document� if�present�(Yates�& �Neto,�1999).�WordNet� is�quite�extensively�being�

used� for�query�expansion�purposes,� for�example,�Flank� (1998)�uses�weights� for� the�different�

lexical� semantic� relationships� present� in� WordNet� such� as� hypernymy� and� meronymy� for�

semantic�expansion.�Recently�some�researchers�have�started�using�NLP�techniques�to�process�

captions� associated� with� images� to� aid� in� the� retrieval� process.� Rose� et� al� (1999)� extract�

dependency� structures� from� image� captions� as� well� as� queries� and� a� matching� algorithm� is�

used� to� compare� the� two� (they� call� this� phrase matching).� The� match� is� weighted� and�

combined� with� keyword� matching� to� provide� an� overall� score.� Their� system� is� limited� to�

working� with� short� captions.� An� average� of� 9�words�was�used� for� the�evaluation,�conducted�

by�2� judges,�which�had�an�average�precision�of�85.5%�for�keyword�matching�and�93.5%�for�

phrase�matching�at�10%�recall.�

� 13

3.2 Content-Based Image Retr ieval

�One� can� argue� that� content-based� image� retrieval� (CBIR)� systems� work� at� the� perceptual�

level.� Primitive� features� such� as� colour,� shape� and� texture� are� automatically� extracted� using�

certain�algorithms� for�a�stored�set�of� images.�The�user�supplies�a�sample�visual�query�whose�

features� are� extracted� and� compared� with� those� of� the� stored� images� and� the� images� most�

similar� to� the� query� image� are� returned.� This� process� is� known� as� query by visual example�

where�the�example�may�be�an�image,�a�sketch�produced�by�the�user,�or�in�the�case�of�retrieval�

by�colour,�patches�of�relevant�colour�strategically�located�by�the�user.�This�type�of�retrieval�is�

known�as�similarity-based� retrieval�and�differs� from�matching in� that� images� in� the�database�

are� reordered� according� to� their� measured� similarity� to� a� query� example.� Similarity-based�

retrieval� is� concerned� with� ranking� rather� than� classification.� The� user� can� interact� with� the�

system� to� search� for� relevant� images� by� recursively� defining� and� refining� a� query� using� a�

mechanism�known�as�relevance feedback (Del�Bimbo�1999).��

This� section� will� very� briefly� discuss� the� different� methods� used� in� image� processing� to�

compute� the�various�visual�properties.�Colour� is�one�of�the�most�popular�visual�features�used�

for� indexing.� The� most� common� technique� used� is� the� colour histogram,� which� identifies�

different� colour� channels� in� an� image� and� is� constructed� by� counting� the� number� of� pixels�

belonging� to� each� channel.� The� colour� histogram� has� a� number� of� variations� such� as� the�

cumulative�colour�histogram,�colour�moments�and�colour�sets.�When�a� large�set�of� images� is�

used,�using�a�global�colour�scheme�can�result�in�a�large�number�of�false�positives.�Due�to�this�

limitation� a� colour layout� method� can� be� used� which� divides� the� image� up� into� different�

regions� and� then� computes� the� colour� features� of�each� region.�The�problem�here�can�be� that�

the�segmentation�may�not�work�very�reliably.��

Shape� is� one� of� the� key� features� used� to� identify� an� object.� In� image� processing� there� are�

mainly� two� different� methods� to� represent� shapes:� the� boundary� based� method� such� as� the�

Fourier� descriptor� used� to� define� just� the� outer� contour� of� a� shape� and� the� region� based�

method� such� as� the� moment� invariant,� which� computes� values� for� the� entire� shape� region.�

Invariance� is� an� important� property� that� deals� with� the� issue� of� shape� representations�

remaining�unaltered�due�to�object�transformations�such�as�translation,�rotation�and�scaling�etc.�

Due�to� this� transformation� issue,�automatic�segmentation�is�difficult�for�shape�features�if� it� is�

needed�to�identify�objects�in�a�reliable�way�from�images.��

� 14

Texture� is� a� property� of� most� surfaces� and� is� characterized� by� differences� in� brightness�and�

intensity.� It� is� an� important� feature� used� to� distinguish� between� image� patches� of� the� same�

colour� such� as� the� sky� and� the� sea.� Examples� of� some� visual� texture� properties� used� are�

coarseness,� contrast,� regularity� and� directionality.� The� most� popular� method� used� is� the�

wavelet� transform,� which� has� been�combined�with�other�methods�such�as�Kohonen�maps� for�

improved�results�(Del�Bimbo�1999,�Chang�1999).��

Spatial relationships�between�objects�in�images�can�be�useful�in�identifying�objects�as�well�as�

providing� information� on� how� different� objects� relate� to� each� other.� These� relationships� can�

be� directional� or� topological� (Del� Bimbo� 1999).� Directional� relationships� consider� relative�

directions�of�different�objects�such�as�right�of,�below,�above,�as�well�as� the�distance�between�

objects,� which� may� be� defined� using� the� Euclidean� metric.� Topological� relationships,� which�

are� invariant� under� various� transformations,� use� set� theoretical� concepts� such� as� adjacency,�

overlapping,� disjunction� and� containment� to� determine� relationships� between� objects.� In�

images� if� the� different� visual�objects�can�be� identified� the�spatial� relationships�between� them�

can� be� determined� using� a� spatial� parser.� Similarly� natural� language� can� describe� spatial�

relationships� between� a� reference� object� and� unknown� object� using� relational� propositions,�

which� can� then� be� used� to� locate� the� unidentified� object.� Srihari� (1995b)� has� used� this�

information�present�in�photographic�captions�to�aid�in�identifying�the�people�present.�

A� number� of� CBIR�systems�are�currently�available,�which�may�be�used�either� independently�

or�incorporated�into�database�modules�(see�section�3).�At�the�point�of�writing�this�report�three�

commercial� systems�are�available�(Visual�Retrieval�Ware�by�Excalibur�Technologies,�Virage�

and� QBIC).� The� QBIC� system� (Maybury� 1999)� developed� at� IBM� is� one� of� the� earliest�

commercial� CBIR� systems� that� had� a� large� influence� on� systems� developed� later� using� the�

same� techniques� and� ideas.� QBIC� supports� queries� on� image� and� video� databases� based� on�

sample� images,� user-defined� sketches,� user-selected� colour� and� texture� patterns.� The� QBIC�

system�has� two�main�components:�a�database�and�a�visual�query� language.�When�populating�

the�database�the�images�are�processed�to�extract�the�relevant�features,�which�are�then�stored�in�

the�database.�The�query�language�is�used�to�generate�a�graphical�query�whose�features�can�be�

used�to�search�for�similar�images�in�the�database.�Most�of�the�other�systems�are�experimental�

systems�that�are�available�on�the�Web�for�demonstration�purposes.�Chang�et�al�(1999)�provide�

an� insightful� discussion� of� the� current� techniques� and� issues� involved� in� the�CBIR�approach�

while� Veltkamp� &� Tanase� (2000)� provide� an� extensive� survey� of� 39� systems� based� on� a�

detailed�framework�of�the�relevant�features.�

� 15

3.3 Hybr id/Integrated Approaches to Image Retr ieval

�There�has�been�a�large�extent�of�research�done�in�developing�models�and�systems�in�the�fields�

of�Vision�Processing�and�Natural�Language�Processing�but�until� recently� there�has�been�little�

interest� in� integrating� these�two�areas.�Srihari�was�one�of� the� first� researchers� to�contemplate�

the�notion�of�a�more�effective�understanding�and�retrieval�of�pictures� if� the� two�modalities�of�

vision� and� language� are� combined.� This� section� uses� her� work� as� a� basis� to� discuss� similar�

research�trends�as�well�as�to�identify�some�of�the�main�issues�that�need�to�be�considered�when�

building�a�multi-modal�system�(Srihari�1995a).��

The� central� issue� addressed� by� Srihari's� research� is� the� correspondence problem (Srihari,�

1995a) i.e.�how�visual� information�can�be�correlated�with�words� (it� is� important� to�note� that�

words� could� refer� to� sentences,� events� etc.,� and� not� just� nouns).� She� proposes� developing�

models� for� vision� and� language� which� when� combined� with� domain-specific� knowledge�

would� enable� a� mapping� from� one� modality� to� another.� An� interesting� issue� to� investigate�

here� is� how� the� different� semantics� of� vision� and� language� relate� to� each� other.� Srihari� has�

developed� a� system� called� PICTION� which� extends� the� notion� of� using� text� associated� with�

images,� known� as� collateral� text,� in� scene� understanding� by� defining� visual semantics -� a�

theory� of� how� to� systematically� extract� and� interpret� the� visual� information� present� in�

language.� This� is� carried� out� at� the� lexical,� syntactic� and� semantic� levels� as� well� as� by�

interpreting�spatial�prepositions.�Image�interpretation�is�carried�out�to�derive�the�meaning�of�a�

scene�in�terms�of�the�objects�present�and�their�interrelationships.�

According� to� Srihari� visual� descriptions� can� be� organised� in� a� hierarchy� similar� to� textual�

descriptions� and� she� has� extended� WordNet� by� superimposing� a� visual hierarchy� with� links�

such� as� visual� is_a� to� represent� hyponymy� and� visual� part_of� for� meronymy.� Srihari’s�

research� is� however� limited� to� photographs� of� people� and� certain� types� of� structured�

information� available� in� the� respective� photographic� captions.� The� idea� needs� to� be� further�

extended�for�use�with�any�random�text�or�language.�Another�point�she�brings�up�is�that�of�co-

referencing. This� involves� the�ability� to�determine�the�pictorial�object�being�referred�to�by�an�

entity�described�in�a�text.�It�should�be�noted�here�that�the�word�being�used�is�entity -�there�will�

always�be�a�set�of�words� that�might�not�have�a�corresponding� image�(e.g.�certain�adjectives).�

As�Srihari�has�pointed�out,� the�correspondence�between�words�and�images�is�generally�many-

to-many.� It� should� be� considered� that� words� and� images� could� be� inter-related� amongst�

themselves�as�well�as�each�other�in�various�ways.��

� 16

Paek� et� al.� (1999)� use� a� different� methodology� for� the� integration� of� linguistic� and� visual-

based� approaches� for� the� labelling� and� classification� of� indoor/outdoor� photographs� in� the�

domain� of� terrorist� news.� They� use� a� term� frequency� inverse� document� frequency� (TF*IDF)�

vector-based� approach� for� the� text� using� different� types� of� words� extracted� from� different�

amounts� of� text� such� as� caption� and� full� article.� On� the� image� side� they� use� an� object�

frequency� inverse�image�frequency�(OF*IIF)�vector-based�approach�to�classifying�objects�that�

are� defined� by� the� clustering� of� automatically� segmented� regions� of� training� images.� By�

combining� these� two�vectors� they�reach�a�classification�accuracy�of�around�12%�over� that�of�

other�existing�methods.��

Similarly� Sclaroff� et� al� (1999)� have� proposed� a� system� that� combines� visual� and� textual�

statistics�in�the�form�of�a�single�vector�that�can�be�used�to�search�for�images�on�the�Web.�The�

novel� idea� here� was� to� use� Latent� Semantic� Indexing� (LSI)� on� the� text� side� and� colour� and�

orientation� histograms� for� the� images.� LSI� provides� some� advantages� over� the� classical�

keyword� method� used� by� most� search� engines� in� that� it� implicitly� covers� the� issues� of�

synonyms,� word� senses,� term� omissions� etc.� The� text� used� is� either� taken� from� the� URL�

and/or� heuristically� determined� form� its� proximity� to� the� image� on� a� web� document.� Their�

experiments�showed�that�a�better�performance�was�achieved�by�a�combination�of�the�text�and�

visual�vectors.�

The� research� discussed� above� mainly� focuses� on� integrated� indexes� and� not� knowledge�

structures� that� model� both� visual� and� linguistic� features.� Benitez� et� al� have� developed� a�

knowledge� representation� framework� called� MediaNet� that� represents� both� semantic� and�

perceptual� information� related� to� multimedia� data.� A� semantic� network� is� used� where� the�

concepts� (nodes)� refer� to� the�semantic�notion�of�what�an�object� is.�Each�concept�may�have�a�

number�of� text� representations�such�as�“man,” � “human,” � “homo”�as�well�as�audio�and�visual�

representations.�Concepts�can�be� linked� together�by�various� relationships�such�as�hyponymy,�

meronymy,� entailment� etc.� Concepts� with� their� textual� representations� were� created� using�

WordNet,� which� also� generated� all� the� senses� and� synonyms� for� each� word.� WordNet� was�

also�used� to�automatically�generate�all� the�required�relationships.�Visual� representations�were�

automatically�generated�using�colour�and�texture�feature�extraction�tools.�

� 17

4 Image Retr ieval using O-R Technology �This�section�aims�to�provide�an�overview�of�the�current�state�of�the�art�in�database�technology�

within�the�context�of�image�retrieval.�Databases�have�become�an�essential�and�integral�part�of�

most� modern� enterprises.� A� database� system� consists� of� a� database� management� system�

(DBMS)� and� one� or� more� databases.� A� database� is� a� collection� of� interrelated� data,� which�

represents� information� concerning� some� real-world� enterprise� such� as� a� scene� of� crime�

information� system.� The� DBMS� is� a� software� system� that� handles� the� access,� storage� and�

maintenance� of� the� data� as� well� as� functioning� as� the� interface� between� the� users� and� the�

database.�There�has�been�a�steady�evolution�of�database�systems�in�response�to�changes�in�the�

type�and�amount�of�data�as�well�as�advances� in�hardware�support.�For�many�years�relational�

database� management� systems� � (1980’s� to� present)� have� dominated� the� market� being�

optimised� for� storing� and� querying� simple� alphanumeric� data,� which� was� sufficient� for�

traditional� applications.� However,� increasingly� modern� database� applications� need� to� store�

and� manipulate� more�complex�data� types�such�as� images,�video,�spatial�and� time�series�data�

etc.� This� need� led� to� the� advent� of� object-oriented� database� technology� (early� 1990’s)� based�

on�the�object-oriented�programming�paradigm,�which�only�acquired�a�niche�market.�The�main�

reason�for� that�being�that�OODBMSs�are�optimised�for�handling�complex�objects�but�are�not�

query-oriented.� Recently� these� two� technologies� have� been� merged� together� resulting� in�

object-relational�database�technology�with�the�WWW�being�the�major�driving�force�behind�it.��

ORDBMSs�tend�to�be�query-oriented�on�complex�data�hence�taking�the�best�of�both�worlds�as�

well� as� being� downward� compatible� with� relational� systems.� Most� of� the� mainstream�

relational� vendors� have� adopted� object-relational� technology� such� as� Oracle,� Informix,� IBM,�

and� Sybase� etc.� As� yet� the� OR� model� functionality� is� based� on� the� SQL-3� standard.� Some�

ORDBMSs� features� include� support� for� inheritance� and�encapsulation,�definition�of�domain-

specific�data�types�with�related�routines�and�functions,�support�of�smart�large�objects.��

4.1 Storage And Manipulation Of Multimedia Data.

�As� discussed� in�section�3.2,�certain�software�companies�such�as�Excalibur�Technologies�and�

Virage� have� developed� retrieval� solutions� for� image,� audio� and� video� data.� An� interesting�

trend� recently� has� been� to� integrate� such� technologies� with� database� systems� in� the� form� of�

plug-in� modules� (for� example� Informix’s� extensible� DataBlade� module� technology� enables�

the� movement� of� business� logic� from� the� client� to� the� server).� A� DataBlade� module� is� a�

� 18

software�package�that�consists�of�a�collection�of�domain-specific�data� types�with�their�related�

functions� that� can� be� plugged� into� the� Informix� dynamic� server� enabling� it� to� provide� the�

same�level�of�support�for�the�new�data�types�as�it�provides�for�the�built-in�ones.�A�DataBlade�

module� may� consist� of� a� number� of� the� following� components:� User� defined� data types,�

which� may� be� created� as� row,� distinct� or� opaque� types� and� whose� values� can� be� stored,�

manipulated� using� queries� or� routines,� as� well� as� passed� as� arguments� to� routines;� A�

collection� of� routines� which� may� operate� on� the� data� types� providing� new� domain-specific�

operations� that� extend� the� processing� and� aggregation� functions� provided� by� the� server;�

Interfaces,�which�are�a�collection�of�routines�providing�a�standard�for�DataBlade�development�

and�use;�Tables and Indexes� for�storing�and�accessing�data�directly� from�a�database;�Access

Methods�defined�by�the�user�to�operate�on�tables�and�indexes;�Client Code�which�may�provide�

a�user�interface�so�that�users�can�query,�modify�and�display�the�new�data�types.��

Informix�provides�a� tool�known�as�the�DataBlade�Developers�Kit�(DBDK)�to�help�in�creating�

a�DataBlade.�Examples�of�some�of�the�existing�DataBlades�are�those�that�provide�support�for�

multimedia� data� such� as� the� Excalibur� Image� DataBlade� (discussed� in� the� next� section),� the�

Video� Foundation� DataBlade,� the� Spatial� DataBlade� for� geographical� information,� and� the�

Excalibur� and� Verity� Text� DataBlades� for� document� management.� As� mentioned� in� the�

previous�section� the� Informix�server�provides�support� for�smart� large�objects�such�as�images,�

audio,� and� Microsoft� word� files,� which� are� used� by� these� DataBlades.� Various� DataBlade�

packages�can�be�used�together�if�a�combination�of�data�types�is�needed�for�an�application�e.g.�

using� the� Image�and�Text�DataBlades� together� to�store� images�with�captions�so�searches�can�

be� done� on� visual� properties� as� well� as� keywords.� Other� database� vendors� provide� similar�

technologies�such�as�DB2�Extenders�(e.g.�the�Image�and�Video�Extenders),�and�Oracle’s�Data�

Cartridges.�

Figure 34�Various�components�of�a�DataBlade�module.�

������������������������������������������������4.http://www.informix.com�Developing�DataBlade�Modules�for�Informix�Internet�Foundation.2000�–White�Paper�

�������������� ���� ������

��������������� � ���

����������� ! ����

�����"���� �� �#� �! ��

�������$�"����%&� ���� �

���'��)(�"����"����

������ � �&��*� ������"�'$"����+ %�

� 19

4.2 Example of Retr ieval By Visual Proper ties

�Until� recently� most� photographic� data� such� as� scene� of� crime� photos� were� stored� as� hard�

copies�on�film�with�an�ID�or�name-tag�provided�which�might�be�stored�on�a�computer�–similar�

to�an�art�cataloguing�system.�With�advances�in�hardware�and�communications�support,�images�

could� then�be�stored�as� files�on�computer�systems.�This� is�still�being�commonly�used� in�most�

image� databases� and� has� the� major� disadvantage� of� the� files� being� moved� or� tampered� with�

while� the�database�table�still�has�a�reference�to�it�or�the�problem�of�having�a�wrong�reference�

stored� in� the� table.� Now� most� major� DBMSs� provide� support� for� Binary� Large� Objects�

(BLOBs),� such� that� the� images� can� be� stored� in� binary� form�directly� in� the�database,�which�

provides� the� advantage� of� integrity� and� security.� The� Excalibur� Image� DataBlade� provides�

data� types� and� functions� that� allow� the� storage�and� � retrieval�of� images.� Images�of�all� types�

and�formats�are�supported�from�BMPs�to�JPEGs�and�TIFFS.�Images�can�be�stored�as�external�

files� or� BLOBs� and� a� feature� extractor� function� can� be� used� to� extract� features� based� on�

colour,� shape,� texture,� brightness� structure,� colour� structure� and� aspect� ratio� (see� Table� 4),�

which� are� then� stored� as� a� combined� feature� vector.� A� trigger� can� be� used� to� update� the�

feature�vectors�if�an�image�is�changed.��

FEATURE DESCRIPTION

Colour�content� Measure�of�the�colours�in�an�image�

Shape�content� Measure�of�the�relative�orientation,�curvature,�and�contrast�of�lines�in�an�image�

Texture�content� Measure,�on�a�small�scale,�of�the�flow�and�roughness�of�an�image�

Brightness�structure� Measure�of�the�brightness�at�each�point�in�the�image�

Colour�structure� Measure�of�the�hue,�saturation,�and�brightness�at�each�point�in�the�image�

Aspect�ratio� Measure�of�the�ratio�of�the�width�to�the�height�in�the�image�

Table 45 Properties�of�an�image used�for�the�feature�vector��

A�content-based�search�(see�section�3.2)�can�be�carried�out�by�providing�a�sample�image�and�

searching� for� similar� images� in� the� database� based� on� their� feature�vectors.�Searches�can�be�

made�on�a�combination�of�all�the�image�features�or�one�or�more�features�can�be�combined.�For�

example� you� can� search� for� just� similarity� by� shape� or� combine� colour� and� shape� by� using�

������������������������������������������������5�Excalibur�Image�DataBlade�Users�Guide.����http://www.informix.com/answers/english/docs/datablade/5356.pdf���

� 20

flags,� which� can� be� turned� on� (1� or� above)� or�off� (0)�using�binary�values.�Different� relative�

weights�can�also�be�given�to�the�properties�such�as�colour:�2,�shape:�6,�texture:�4,�brightness:�3�

and�so�on.�A�‘ resembles’ �function�can�be�used�to�provide�a�similarity�threshold�for�the�number�

of� returned� images,� for� example� if�a� resemblance�of�0.80� is�used� then� images�with� less� than�

80%� similarity� will� not� be� returned.� The� following� example� shows� three� features� (colour,�

shape� and� texture� from� left� to� right)� weighted� equally� with� a� 75%� resemblance� and� the�

resulting�images�being�given�a�ranking.��

SELECT�img_id,�rank�FROM�gallery�

WHERE�Resembles(fv,�GetFeatureVector(IfdImgDescFromFile(‘ \images\deadBody.gif’ )),�

����������������������������������0.75,�1,�1,�1,�0,�0,�0,�rank�#REAL)�

ORDER�BY�rank;�

Figure 4 The�different�components�of�an�image�feature-based�search�query��

A�selection�of�different� types�of� images� from� the�scene�of�crime�domain�was�used�to� test� the�

image� retrieval� capabilities� of� the� DataBlade.� Different� categories� of� relevant� images� were�

used,� based� on� the� types� of� photographs� taken� at� a� crime� scene� such� as� close-up� shots� of�

single�objects,� indoor�and�outdoor�photographs.�The�images�were�divided�into�the�following�4�

categories�based�on�their�complexity:�

1.� One� object/image� –Same� object,� different� rotations.� Close-up� photographs� are� often�

taken�of�the�same�object�from�various�sides�or�angles.�The�background�would�usually�

be�the�same�in�this�case.�

2.� One� object/image� –Different� objects.� In� this� case� close-up� photographs� are� taken� of�

different�objects,�which�may�or�may�not�belong�to�the�same�set�(for�example�all�guns�

or�knives).�The�background�may�vary.�

3.� Many� objects/image� –Indoor.� These� are� photographs� taken� inside� a� room� in� a�

building�where�a�crime�may�have�been�committed�and�may�be�mid-range�or�overview�

photographs�

FEATURE VECTOR TYPE FV OF QUERY IMAGE

COLOUR CONTENT

ASPECT RATIO

SHAPE CONTENT

TEXTURE CONTENT

BRIGHTNESS STRUCTURE

COLOUR STRUCTURE

SIMILARITY THRESHOLD

� 21

4.� Many� objects/image� –Outdoor.� These� are� overview� or� mid-range� photographs� taken�

outside� if�a�crime�has�been�committed�outdoors�or�of�the�approach�path�to�a�building�

or�car�in�which�the�crime�may�have�been�committed.�

The� next� few� sections� will� discuss� an� example� scenario� for�each�of� the� four�cases�presented�

above.�For�every�scenario�each� image� is�used�as�a�query� image�and�the�similarity�ranking�of�

the�retrieved�images�is�noted.��

SEARCH SCENARIO ONE

(One Object/Image –Same Object)��

The� purpose� of� this� test� was� to� see� that� if� an� object� is� rotated� in� different� directions� what�

would� be� the� range� of� rankings� returned� in� the� search.� A� pistol� was� chosen� as� the� required�

object� and� 6� images� of� the� object� at� different� angles� were� used� as� shown� in� table� 5a.� The�

images�have�been�given�the�IDs�Pistol-1,�to�Pistol-6.��

Pistol-1�

Pistol-2

Pistol-3

Pistol-4

Pistol-5

Pistol-6�

Table 5a Images�of�a�browning�pistol�at�different�angles�and�rotations�

� 22

Table�5b�shows�the�percentage�similarity�of�each�of�the�images�to�Pistol-1,�which�was�chosen�

as� the�query� image.� It� is� interesting� to�note� that�even�though� images�Pistol-1�and�Pistol-6�are�

of� the� same� pistol� they� are� shown� to�have�a�20%�difference� in�similarity�because� they�show�

different�sides�of�the�pistol.�

Image 1 2 3 4 5 6

Rank 100%� 97.5%� 96.9%� 94.7%� 83.3%� 79.4%�

Table 5b�Relative�similarity�and�order�of�retrieval�of�the�Pistol�images�when�Pistol-1�is�given�as�the�sample�query�image����The�matrix�below�was�created�to�see�the�variations� in�ranks�when�each� image�of� the�set�of�6�

above� was� used� as� the� search� image.� It� can� be� seen� from� the� matrix� that� even� though� the�

images� are� of� the� same� object,� due� to� the� different� rotations� they� are� returned�with�different�

rankings.�In�certain�cases�if�the�angle�of�rotation�is�important�then�the�difference�in�ranking�is�

justified�otherwise�it�gives�the�indication�that�the�images�are�of�different�objects.�

1 2 3 4 5 6

Pistol-1 1.0000

Pistol-2 0.9747 1.0000

Pistol-3 0.9688 0.9335 1.0000

Pistol-4 0.9467 0.9823 0.9424 1.0000

Pistol-5 0.8327 0.8324 0.8242 0.8088 1.0000

Pistol-6 0.7940 0.7915 0.8016 0.7859 0.7367 1.0000

Table 5c Matrix�showing�similarity�ranking�of�Single�Object�images�

SEARCH SCENARIO TWO

(One Object/Image –Different Objects).

The� images�shown�below�are�of�1-,�2-�and�3-bladed�knives�with�a�black�background�and�one�

knife� per� image.� The� images� have� been� given� the� IDs� Knife-1,� to� Knife-9.� The� images� are�

� 23

similar�in�that�they�are�all�of�pocketknives�but�vary�in�size,�in�the�number�of�blades,�the�angle�

at� which� the� blades� are� open,� as� well� as� the� colour� of� the� handle.� The� aim� was� to� see� the�

efficacy�of�the�ranking�based�on�the�equally�weighted�features�of�colour,�texture�and�shape.�

Knife-1

Knife-2

Knife-3

Knife-4

Knife-5

Knife-6

Knife-7

Knife-8

Knife-9

Table 6a Images�of�1-,�2-�and�3-bladed�knives�with�the�same�background

The� result� set� returned� when� Knife-1� was� given� as� the� query� image� is� shown� in� the� table�

below.� It� is�interesting�to�note�that�the�difference�in�similarity�between�Knife-1�and�Knife-9�is�

21.5%,�which�is�very�close�to�the�difference�in�the�previous�example�with�the�gun�rotated�to�a�

different�side,�even�though�these�2�images�of�the�knives�are�much�more�dissimilar.�

Image 1 2 3 4 5 6 7 8 9

Rank 100%� 85.6%� 85.5%� 85.2%� 81.8%� 79.7%� 79.3%� 78.5%� 78.5%�

Table 6b Relative�similarity�and�order�of�retrieval�of�the�Knife�images�when�Knife-1�is�given�as�the�sample�query�image���

� 24

The�matrix�below�was�done�similar� to� the�one�for� the�pistol� to�see�how�the�rank�varied�when�

each�of�the�9�knife�images�was�used�as�the�search�image.�It�is�interesting�to�note�that�for�each�

of�the�2-bladed�knives�(Knife-2,�3�& �4)�the�other�2�were�retrieved�as�the�closest�match.�The�1-

bladed�knives�(Knife-6,�7�& �8)�had�the�closest�similarity�to�Knife-9�but�when�1-bladed�Knife-

5� was� given� as� the� query� image,� 2-bladed� Knife-6� was� returned� as� the� closest� match.� The�

similarity� in� this� case� was� based� on� the� type� of� handle� and� angle� of� the� blade� and� not� the�

number�of�blades.�

1 2 3 4 5 6 7 8 9

Knife-1 1.0000

Knife-2 0.8563 1.0000

Knife-3 0.8546 0.8949 1.0000

Knife-4 0.8517 0.8847 0.9209 1.0000

Knife-5 0.8180 0.8548 0.8671 0.8708 1.0000

Knife-6 0.7965 0.8091 0.7903 0.7968 0.8021 1.0000

Knife-7 0.7925 0.8044 0.8006 0.7924 0.7824 0.8403 1.0000

Knife-8 0.7853 0.8323 0.8139 0.8143 0.8471 0.8059 0.8112 1.0000

Knife-9 0.7852 0.7825 0.7687 0.7687 0.7854 0.8414 0.8204 0.8060 1.0000

Table 6c Matrix�showing�similarity�ranking�of�Knife�images�

SEARCH SCENARIO THREE

(Many Objects/Image-Indoor Scene).

In� this� case� we� have� taken� 9� images� from� an� indoor� mock� crime� scene.� The� images� show�a�

body�on�the�floor�near�a�table�with�various�items�on�it�as�well�as�some�images�of�ridge�detail.�

These� images� are� much� more� complex� than� the�previous�2�scenarios� in� that� they�have�many�

objects� in� one� image� as� well� as� differences� in� background� due� to� the� different� colours� and�

textures� of� the� floor,� the� walls� and� the� furniture.� This� complexity� of� the� images� makes� it�

difficult�for�the�user�to�search�for�an�item�present�in�an�image�with�many�objects.�The�images�

have�been�labelled�Indoor-1�to�Indoor-9�and�are�shown�in�table�7a.�

� 25

Indoor-1

Indoor-2�

Indoor-3

Indoor-4

Indoor-5�

Indoor-6

Indoor-7

Indoor-8�

Indoor-9

Table 7a�Images�of�an�indoor�crime�scene

Table�7b�shows�the�percentage�similarity�of�the�images�above�to�the�image�Indoor-1.�From�the�

result� set� it� can� be� seen� that� 9� is� the� most� dissimilar� which� can� be� expected.� It� is� not� clear�

though�on�the�basis�of�the�objects�present�why�Indoor-6�should�be�a�closer�match�than�Indoor-

2�or�4.�

Image 1 6 4 7 5 2 8 3 9

Rank 100%� 74.2%� 70.5%� 69.3%� 69.3%� 67%� 61.5%� 58.6%� 53.2%�

� 26

Table 7b Relative� similarity� and� order� of� retrieval� of� the� Indoor� images� when� Indoor-1� is�given�as�the�sample�query�image��

It�can�be�observed�in�the�matrix�below�that�Indoor-8�is�the�closest�match�to�Indoor-9�which�is�

good�but� it�does�not�hold�the�other�way�round.�The�most�similar�image�to�Indoor-8�is�Indoor-

7,� which� would� be� expected� since� Indoor-8� is� a� close-up� shot� of� Indoor-7.� Images� Indoor-1�

and�4�were�retrieved�as�the�closest�match�to�Indoor-5�indicating�that�the�similarity�was�mainly�

based�on�the�expanse�of�the�floor.�

1 2 3 4 5 6 7 8 9

Indoor -1 1.0000 0.6695 0.5861 0.7047 0.6926 0.7418 0.6931 0.6148 0.5321

Indoor -2 0.6695 1.0000 0.6135 0.6843 0.6148 0.6575 0.6451 0.6271 0.5936

Indoor -3 0.5861 0.6135 1.0000 0.6047 0.6196 0.5846 0.5955 0.5992 0.6039

Indoor -4 0.7047 0.6843 0.6047 1.0000 0.6748 0.7366 0.7183 0.6395 0.5978

Indoor -5 0.6926 0.6148 0.6196 0.6748 1.0000 0.6641 0.6434 0.5328 0.5877

Indoor -6 0.7418 0.6575 0.5846 0.7366 0.6641 1.0000 0.7397 0.6313 0.5982

Indoor -7 0.6931 0.6451 0.5955 0.7183 0.6434 0.7397 1.0000 0.6454 0.5955

Indoor -8 0.6148 0.6271 0.5992 0.6395 0.5328 0.6313 0.6454 1.0000 0.6057

Indoor -9 0.5321 0.5936 0.6039 0.5978 0.5877 0.5982 0.5955 0.6057 1.0000

Table 7c Matrix�showing�similarity�ranking�of�Indoor�images

SEARCH SCENARIO FOUR

(Many Objects/Image –Outdoor Scene)

Outdoor� images� might� be� even� more� complex� than� indoor� images.� The� images� in� table� 8a,�

which� are� labelled� Outdoor-1� to� Outdoor-2,� can� be� seen� to� have� “noise” � in� the� background�

such� as� leaves,� grass,� bushes,� which�might�make� it�even�more�difficult� to� identify�objects�of�

interest.�The� images�on� the�next�page�are�of�a�mock�crime�scene�set-up� in�a�church�where�a�

body�was�found�outdoors.�The�overall�view�of�the�church�is�shown�and�then�the�photographer�

zooms� in� to� the� exact� location� where� the� body� was� found.� Close-up� shots� were� taken�of� the�

body�and�other�objects�of�interest.�Similar�to�the�examples�above,�the�percentage�similarity�of�

all� the� images� to� image�Outdoor-1� is�shown� in� table�8b�while� the�matrix�of�all� the� images� is�

shown�in�table�8c.�

� 27

Outdoor-1

Outdoor-2

Outdoor-3

Outdoor-4

Outdoor-5

Outdoor-6

Outdoor-7

Outdoor-8

Outdoor-9

Table 8a Images�of�an�Outdoor�crime�scene

The�table�below�shows�that�images�2,�3�and�4�are�the�most�similar�to�Indoor-1,�which�is�what�

will�be�expected�since�they�all�show�part�of�the�building�and�some�landscape�nearby.�

� 28

Image 1 2 3 4 8 9 7 6 5

Rank 100%� 76.7%� 71.7%� 71.6%� 71.3%� 70.8%� 70%� 67.9%� 67.1%�

Table 8b Relative�similarity�and�order�of� retrieval�of� the�Outdoor� images�when�Outdoor-1� is�given�as�the�sample�query�image���

In�the�matrix�below�it�can�be�observed�that�the�closest�match�to�Indoor-9,�which�is�a�close-up�

shot� of� the� body,� was� Indoor-4� while� one� would� expect� that� Outdoor-5� or� 8� would�be�more�

similar�since�they�show�a�closer�view�of�the�body.��

1 2 3 4 5 6 7 8 9

Outdoor -1 1.0000 0.7669 0.7171 0.7163 0.6705 0.6785 0.6992 0.7132 0.7075

Outdoor -2 0.7669 1.0000 0.8672 0.8031 0.7105 0.7026 0.7947 0.7873 0.7785

Outdoor -3 0.7171 0.8672 1.0000 0.8328 0.7475 0.7178 0.7996 0.8105 0.7833

Outdoor -4 0.7163 0.8031 0.8328 1.0000 0.8285 0.8132 0.8507 0.8450 0.8447

Outdoor -5 0.6705 0.7105 0.7475 0.8285 1.0000 0.8181 0.8051 0.8349 0.8163

Outdoor -6 0.6785 0.7026 0.7178 0.8132 0.8181 1.0000 0.8052 0.7950 0.8104

Outdoor -7 0.6992 0.7947 0.7996 0.8507 0.8051 0.8052 1.0000 0.8598 0.8326

Outdoor -8 0.7132 0.7873 0.8105 0.8450 0.8349 0.7950 0.8598 1.0000 0.8396

Outdoor -9 0.7075 0.7785 0.7833 0.8447 0.8163 0.8104 0.8326 0.8396 1.0000

Table 8c Matrix�showing�similarity�ranking�of�Outdoor�images�

4.3 Example of Text-Based Retr ieval

�Simple� keyword� searches� have� been� supported� for� a� very� long� time� both� by� databases� and�

information� retrieval� systems.� Traditional� relational� databases� supported� the� search� for�

keywords� or� phrases� in� columns� containing� text� by� using� the� LIKE� or� MATCHES� clause.�

This� search� matches� the� phrase� or� keyword� exactly,� not� accounting� for� any� misspellings,�

alternative� spellings� or� similar� phrases.� The� Text� Search� DataBlade� module� extends� the�

Informix� Dynamic� Server� by� providing� a� set� of� data� types� and� routines� that� enable� the�

performance� of� much� more� sophisticated� searches� such� as� fuzzy� searching,� proximity�

searching,�using�thesauri�for�query�expansion�and�stop�word�lists�to�improve�efficiency.��

� 29

Text� can�be�stored�as�various� types�such�as�CHAR,�CLOB�or�BLOB�depending�on� the�size�

and� type� of� text.� An� operator,� etx_contains(),� is� defined� for� the� DataBlade� and� instructs� the�

server�of�whatever�type�of�search�needs�to�be�performed.�A�special�index�is�created�specifying�

various�parameters�such�as� inclusion�of�a�synonym�list�and�exclusion�of�stop�words�for�each�

table�on�which�the�search� is�required,�which� the�search�engine�then�uses.�The�simplest�search�

that�can�be�performed� is� the�keyword�search,�which�can�be�used�together�with�a�synonym�list�

if� required.� Boolean� search� enables� the� combination� of� keywords� to� form� more� complex�

queries.� Exact� or� approximate� phrase� searches� can� be� done� (a� phrase� being� a� clue� that�

contains� more� than� one� word� and� is� treated� by� the� search� engine� as� a� single� unit).� The�

DataBlade� also� supports� proximity� searches� where� a� number� can� be� specified� as� the�

maximum�number�of�words�allowed�between�the�specified�words�in�the�search�phrase.�Finally�

fuzzy� or� pattern� searches�can�be�performed�where�words� that�closely� resemble� the�keywords�

given�in�the�search�query�are�also�considered.�

A� synonym� list� can� be�extremely�helpful�especially� in� the�Scene�of�Crime�domain�where� the�

use� of� differing� terminologies� and� acronyms� amongst� police� forces� is� very� common.� For�

example� ‘ lift’ � and� ‘ ridge� detail’ � are� common� alternatives� to� ‘ fingerprint’ ,� an� ‘ index’ � is� the�

same� as� a� ‘number� plate’ � or� ‘car� registration� number’ ,� ‘ in� situ’ � can� be� used� meaning� at� the�

‘crime�site’ �or�‘crime�scene’ ,�when�describing�a�(dead)�person�‘Caucasian�woman’ �is�the�same�

as� ‘white� female’ � or� the� code� ‘ IC1’ .� Similarly� a� lot� of� acronyms� are� used� such� as� MO� for�

modus� operandi� and� DOA� for� dead� on� arrival.� The� Text� DataBlade� allows� you� to� create� a�

domain-specific�synonym�list�but�the�problem�here�is�that�it� is�not�possible�to�have�compound�

terms� in� the� list�–and� the�scene�of�crime� terminology� is� full�of�compound�terms�as�discussed�

in� the�next�chapter.�The�table�below�shows�a�few�examples�from�the�synonym�list�created�for�

the�forensic�science�corpus.��

��FINGERPRINT�LIFT�LIFT�FINGERPRINT�GUN�PISTOL�PISTOL�GUN�DRUGS�NARCOTICS�CONTRABAND�NARCOTICS�DRUGS�CONTRABAND�CONTRABAND�NARCOTICS�DRUGS�MARIJUANA�CONTRABAND�NARCOTICS�ROCK�STONE�VICTIM�BODY�BODY�VICTIM Table 9�Example�of�a�Synonym�list�

� 30

Table�10�shows�a�list�of�10�image�Ids�with�their�captions�that�were�used�as�the�data-set�to�test�

the�use�of�the�Text�DataBlade.�The�images�are�shown�in�Appendix�B.�

IMAGE ID CAPTION OF THE IMAGE

CS-1� Overall�view�of�crime�scene�

CS-2� Beer�cans�

CS-3� Rock�stained�with�blood�

CS-4� Close-up�of�clothes�on�victim�

CS-5� Drugs�found�near�body�of�victim�

CS-6� Gas�container�found�near�body�of�victim�

CS-7� Gun�found�near�body�of�victim�

CS-8� Fingerprint�from�body�of�victim�

CS-9� Close-up�of�rock�stained�with�blood�

CS-10� Close-up�of�gas-container�

Table 10 Image�Ids�with�their�captions�

The� table� below� shows� some� examples� of� queries� with� the� output� result.� The� first� query�

demonstrates� the�use�of� the�synonym�list.�Even�though�the�search�keyword�was�‘ lift’ ,�the�row�

with� fingerprint� in� it� was� returned.� The� other� two� queries� illustrate� the� use� of� a� proximity�

search�as�compared�with�an�exact�search.�

1. Example of query showing the use of a synonym list ���������SELECT�img_id,�caption�FROM�mock_scene2���������WHERE� etx_contains(caption,� Row('lift',� 'MATCH_SYNONYM� =� FS-SynonymList-1'));��

img_id caption CS-8� � �������������Fingerprint�from�body�of�victim��

� 31

2. Example of text-based query for a proximity match �

��������SELECT�img_id,�caption�FROM�mock_scene2���������WHERE�etx_contains(caption,�Row('Gas�container�found�near�body',�'SEARCH_TYPE�=�����������PHRASE_APPROX'));������������

img_id caption CS-6� ��������������������������Gas�container�found�near�body�of�victim�CS-5� ��������������������������Drugs�found�near�body�of�victim�CS-7� ��������������������������Gun�found�near�body�of�victim�CS-10� ��������������������������Close-up�of�gas-container� �CS-8� ��������������������������Fingerprint�from�body�of�victim��

3.� Example of text-based query for an exact match�

��������SELECT��img_id,�caption�FROM�mock_scene2���������WHERE�etx_contains(caption,�Row('Gas�container�found�near�body',�'SEARCH_TYPE�=����������PHRASE_EXACT'));�����

img_id caption CS-6� � �������������Gas�container�found�near�body�of�victim�

��Table 11 Some�example�text-based�SQL�queries��

Image� and� text� datatypes� can� be� stored� in� one� table� and� combined� queries� can� be� made� by�

using� the�AND�or�OR�connectives�as�shown� in� the�example�below.�AND�acts�as�a�switch� in�

that�only�those�images�are�returned�that�have�the�keywords�present�in�their�captions�regardless�

of�how�close�the�image�resembles�the�search�image.��

SELECT�img_id,�caption,�rank�FROM�mock_scene2�WHERE�Resembles�(fv,�GetFeatureVector(IfdImgDescFromFile('C:\images\Album4\Crime-Scenes\Mock-Scene\piece5-gas-container.jpg')),�0.5,�1,�1,�1,�0,�0,�0,�rank�#REAL)�AND etx_contains(caption,� Row('container',� 'MATCH_SYNONYM� =� FS-SynonymList-1� & �PATTERN_ALL'))�ORDER�BY�rank;��

img_id caption rank

CS-6� � Gas�container�found�near�body�of�victim� � 0.67007446�CS-10� � Close-up�of�gas-container� � � 0.97129822��

When�just�the�query�image�was�used�then�the�result�returned�was�as�shown�below.�

���

� 32

SELECT�img_id,�caption,�rank�FROM�mock_scene2�WHERE�Resembles�(fv,�GetFeatureVector(IfdImgDescFromFile('C:\images\Album4\Crime-

Scenes\Mock-Scene\piece5-gas-container.jpg')),�0.5,�1,�1,�1,�0,�0,�0,�rank�#REAL)�ORDER�BY�rank;��

img_id caption rank

CS-8� � Fingerprint�from�body�of�victim� � � 0.58262634�CS-5� � Drugs�found�near�body�of�victim� � 0.59703064�CS-4� � Close-up�of�clothes�on�victim� � � 0.60975647�CS-2� � Beer�cans� � � � � 0.61682129�CS-7� � Gun�found�near�body�of�victim� � � 0.63639832�CS-9� � Close-up�of�rock�stained�with�blood� � 0.64620972�CS-3 Rock stained with blood 0.64979553 CS-6 Gas container found near body of victim 0.67007446 CS-1 Overall view of cr ime scene 0.67984009 ��������� ����� ����������������������������� !�"��# ��$%'&(�*)�%�+,)-)

4.4 A Compar ison of Image- and Text-based Retr ieval

�In� this� section� we� will� discuss� the� comparative� effectiveness� of� image-based� retrieval� versus�

text-based� retrieval.� We� shall� use� a� set� of� 65� captioned� images� taken� at� an� artificial� crime�

scene�set�up� for� training�purposes.�The�scene�is�set�at�a�pub�where�there�has�been�a�break-in,�

theft� and� murder.� The� captions� of� the� images� were� taken� from� the� spoken� commentary�

provided�by�the�scene�of�crime�officer.�The�images�of�the�pistol�used�in�scenario�1�were�taken�

form�this�set�of�images�as�well�as�the�indoor�images�in�scenario�3�above.��

The� following� tests� include� the� different� types� of� photographs� taken� at� the� scene� such� as�

photographs� of� the� gun,� body,� and� ridge� details.� The� purpose� was� to� test� out� different�

scenarios� that� a� scene� of� crime� officer� would� use� to� search� for� relevant� images.� Initially� a�

close-up�shot�of�a�fingerprint�(ridge-detail)�was�used�to�test�whether�all�other�images�showing�

ridge� detail� will� be� retrieved.� Then� a� close-up� shot� of� a� gun� found� at� the�scene�was�used� to�

find�all�other� images�with�a�gun� in� it�and�finally� two�different� images�of�a�body� lying�on�the�

floor� (overview� and� midrange� shots)� were� used� to� test� whether� other� images� containing� the�

body� would� be� found.� The� search� was� based� on� colour,� shape,� texture,� brightness� structure�

and� colour� structure.� The� tables� shown� below� display� the� top� ten� closest� matching� images�

sorted� in� increasing� order� of� rank� with� 1.00000� being� the� highest.� All� the� images� used� with�

the�captions�are�available�in�Appendix�A.�

� 33

1.� Testing for retrieval of image with fingerprint �

DSCN1494

DSCN1495

DSCN1496

Table 12a Close-up�photographs�of�ridge-detail�

Image�DSCN1496�was�given�as�the�query�image�and�the�ten�closest�ranked�images�are�shown�

in� the� table� below.� This� query� result� was� pretty� good� –it� retrieved� the� two� other� images� of�

fingerprints�and�also�picked�up�3�images�of�footwear�impressions,�which�is�interesting.�

img_id caption rank

DSCN1473� Photograph� of�writing� in�dust�on�games�machine�saying�clean�me�

0.67176819���

DSCN1445� Shot�of�male�dressed� 0.67431641���

DSCN1459� Showing�top�of�bar� 0.68014526���

DSCN1450� Photograph�of�footwear�impression�in�blood�fully�labelled� 0.68031311�

DSCN1440� Close-up� of� hand,� of� left� hand� showing� footwear� impression,�zig-zag�pattern�

0.68505859�

DSCN1475� Wooden� chair� with� vinyl� seat� partially� broken� found� on� floor�adjacent�to�the�feet�of�the�body�

0.69798279�

DSCN1465� Photograph� of� footwear� impression� in� dust� inside� machine� on�the�second�shelf�up�eighteen�inches�from�floor�

0.70825195�

DSCN1495 fingerpr ints r idge detail 0.72203064�

DSCN1494 fingerpr ints r idge detail 0.79878235��

DSCN1496 fingerpr ints r idge detail 1.0000000�Table 12b Results�of�giving�DSCN1486�(close-up�of�fingerprint)�as�the�query�image.�

� 34

There� are� some� other� images� showing� ridge� detail� such� as� DSCN1488� (“Showing� window�

with� visible� ridge� detail” ),� DSCN1489� (“Same� shot�close-up”),�DSCN1491� (“Photograph�of�

ridge� detail� from� behind� the� bar”),�DSCN1492� (“Photograph�of� ridge�detail� from�behind� the�

bar� showing� salon� in� background”)� and� DSCN1493� (“Photograph� close-up� of� ridge� detail” )�

but� none� of� these� were� retrieved� because� there� are� other� objects� in� the� background� and� the�

ridge� detail� is� shown� at� a� much� smaller� scale� (see� table� 12c� below).� If� the� SoCO� wants� to�

retrieve� all� the� images� with� fingerprints� then� using� a� sample� image� for� the� query� will� not�

result� in� a� satisfactory� result.� However,� if� the� SoCO� typed� in� the�keyword� fingerprint�and�a�

synonym� list� was� used� (FINGERPRINT� =� RIDGE� DETAIL)� then� all� the� expected� images�

will�be�retrieved.�

DSCN1489

DSCN1491

DSCN1493

Table 12c Overview�and�mid-range�photographs�of�ridge�detail.�

2.� Testing for retrieval of image with gun �

Here�we�want� to�determine�whether�using�a�close-up�photograph�of�a�gun�will�retrieve�all�the�

other�images�containing�a�gun�as�well.�Surprisingly�the�system�did�not�pick�up�DSCN1458�the�

image� of� the� same� pistol� taken� from� a� different� side� as� the� next� closest� match.� If� only� the�

shape�and�colour�parameters�were�used�then�the�other�image�was�located�as�the�closest�match.�

As�discussed� later� in� the�section,�varying�the�combination�and�weights�of�features�used�in�the�

search�can�result�in�a�different�output.�

� 35

DSCN1457�

DSCN1458�

DSCN1436

Table 13a Close-up�and�mid-range�photographs�of�a�pistol�found�at�the�crime�scene�

The� other� two� images� (DSCN1436� and� DSCN1455),� which� are� distant� shots� of� the� pistol�

lying� on� the� table� with� some� other� items,� were� retrieved� with� a� similarity� of� 67%� and� 65%�

respectively.� There� were� also� 4� and� 5� other� irrelevant� images� appearing� with� a� closer�

similarity�than�the�2�above�(see�table�13b).�

img_id caption rank

DSCN1455 Just� a� distance� shot� of� the� table� showing� knife,� firearm� and�pool�cue�for�information��

0.65344238�

DSCN1437� Table�showing�bottles�and�pool�cue,�knife� 0.65429688�

DSCN1436 Table� showing� browning high power ,� bottles� and� pool� cue,�knife�

0.67178345�

DSCN1438� Body�on�floor�showing�adjacent�table� 0.67379761�

DSCN1452� Close-up� photograph� of� bottle� showing� apparent� blood-like�substance�

0.67881775�

DSCN1492� Photograph� of� ridge� detail� from�behind� the�bar�showing�salon�in�the�background�

0.68412781�

DSCN1491� Photograph�of�ridge�detail�from�behind�the�bar� 0.68791199�

DSCN1458 Browning high power �self-loading�pistol� 0.69396973�

DSCN1454� Part�of�pool�cue�with�apparent�blood-like�substance�thereon� 0.71173096�

DSCN1457 Nine�millimeter�browning high power �self-loading�pistol� 1.0000000�Table 13b Results�of�giving�DSCN1457�(close-up�of�pistol)�as�the�query�image.�

If� the�keyword� ‘browning�high�power’ �was�used� instead�of� the�query� image�then�DSCN1436�

would�have�also�been�retrieved.�

� 36

3.� Testing for retrieval of image with body �

DSCN1434�

DSCN1435�

DSCN1438

DSCN1438

DSCN1439

DSCN1440

DSCN1441

DSCN1442

DSCN1443

DSCN1444�

DSCN1445�

DSCN1446�

Table 14a Photographs�taken�at�the�crime�scene�showing�the�body.�

� 37

img_id caption rank

DSCN1480� Shot�of�room�from�behind�bar� 0.70158386�

DSCN1436� Table� showing� browning� high� power,� bottles� and� pool� cue,�knife�

0.70312500�

DSCN1453� Photograph�of�green�wine�bottle�on�floor�enfragments� 0.70510864�

DSCN1445 Shot�of�male�dressed� 0.71545410�

DSCN1477� Wooden� table� with� bottles� and� cigar� packet� thereon� with�broken�glass�and�ashtray�

0.71885681�

DSCN1429� Sign�of�the�baskerville�arms� � 0.72137451�

DSCN1479� Shot�of�counter�behind�bar� 0.72909546�

DSCN1440 Close-up� of� hand,� of� left� hand� showing� footwear� impression,�zig-zag�pattern�

0.72961426�

DSCN1434 Body�on�floor�surrounded�by�blood� 0.76255798�

DSCN1446 Full�length�shot�of�body� 1.0000000�Table 14b Results�of�giving�DSCN1446�(full�length�shot�of�body)�as�the�query�image.�

In� this� query� the� attempt� is� to� retrieve� all�photographs� that�show� the�body.�An� image�of� the�

full� length� of� the� body� was� given� as� the� search� image.� The� expected� result� should� have�

displayed� images� DSCN1434,� DSCN1435,� DSCN1437,� DSCN1438� and� DSCN1439� –�

DSCN1446�but�only�2�other� images�of� the�body�were�retrieved�as�the�closest�match�with�one�

other�image�appearing�as�the�7th�closest�match.��

4.� Testing for retrieval of image with close-up of body (DSCN1457)�

The� query� below� aimed� to� retrieve� images�of� the�body�as�well�but�giving�a�close-up�shot�as�

the� search� image.� The� result� here� shows� the� 20� closest� matches.� 3� correct� images� were�

retrieved�but� then�there�was�a�gap�of�1�before�the�next�image�and�a�gap�of�6�images�before�3�

other�relevant�ones.�

img_id caption rank

DSCN1463� Photograph�of� footwear� impression�with�zig-zag�and�target�on�piece�of�wood�

0.65731812�

DSCN1440 Close-up� of� hand,� of� left� hand� showing� footwear� impression,�zig-zag�pattern�

0.65827942�

DSCN1462� Photograph� of� rear� of� games� machine� showing� footwear�impression� on� piece� of� wood� in� foreground� and� footwear�

0.66088867�

� 38

impression�on�chair�adjacent�to�the�machine�

DSCN1459� Showing�top�of�bar� 0.66104126�

DSCN1490� Photograph�of�bar�area�from�extreme�corner�of�room.�Extreme�left-hand�side�of�picture�green�arrow�

0.66152954�

DSCN1438 Body�on�floor�showing�adjacent�table� 0.67022705�

DSCN1434 Body�on�floor�surrounded�by�blood� 0.67086792�

DSCN1446 Full�length�shot�of�body� 0.67202759�

DSCN1450� Photograph�of�footwear�impression�in�blood�fully�labelled� 0.67501831�

DSCN1467� Close-up�of�same�footwear�mark� 0.67640686�

DSCN1436� Table� showing� browning� high� power,� bottles� and� pool� cue,�knife�

0.68112183�

DSCN1480� Shot�of�room�from�behind�bar� 0.68278503�

DSCN1461� Front�of�games�machine� 0.68499756�

DSCN1479� Shot�of�counter�behind�bar� 0.68737793�

DSCN1445 Shot�of�male�dressed� 0.68955994�

DSCN1477� Wooden� table� with� bottles� and� cigar� packet� thereon� with�broken�glass�and�ashtray�

0.70429993�

DSCN1432 Showing�common�approach�path�to�body� 0.70771790�

DSCN1441 Close-up�of�head�and�blood�on�floor� 0.70849609�

DSCN1442 Close-up�of�body,�head�and�blood�on�floor� 0.73866272�

DSCN1443 Male�laying�on�his�back�showing�open�shirt�blood�on�nose�and�chest�with�tie�around�neck�

1.0000000�

Table 14c Results�of�giving�DSCN1457�(close-up�shot�of�the�body)�as�the�query�image.�

These�tests�illustrate�that�when�you�have�many�objects�in�an�image�it�becomes�very�difficult�to�

retrieve� other� images� with� similar� objects.� In� example� (1)� if� the� user� was� looking� for� all�

images�showing� fingerprints� the�search�was�not�successful�using�a�query� image�since�a� large�

number�of� relevant� images�were�not� retrieved.�Similarly� in�example�(2)�with� the�gun,� images�

with�the�gun�on�a�table�with�other�objects�were�retrieved�with�much�lower�ranking.�The�next�2�

examples� with� the� query� images� of� the� body� show� that� all� the� relevant� images� were� not�

retrieved� as� the� closest� matches.� In� all� of� these� examples,� the� relevant� keywords� have� been�

rendered� in� blue� and� one� can� see� that� if� these� keywords� were� used� to� search� the� respective�

captions� of� the� images� then� the� retrieval� would� be� much� more� effective,� for� example� using�

“Browning�high�power” �in�example�(2)�and�“body” �in�examples�(3)�and�(4).��

Another� important� observation� is� that� if� there� could� be� some� query� expansion� such� as� the�

SoCO�providing� the�keyword� “ fingerprint” (example�1)�and�all� the� images�with�“ ridge�detail” �

in� the� caption� being� returned� the� efficiency� increases� even� further.� Similarly� the� keyword�

� 39

“ firearm”�could�be�used�to�retrieve�all�the�images�of�the�browning�pistol�(example�2).�Another�

example�would�be�to�provide�the�keyword�“body” �and�retrieve�all� images�showing�any�part�of�

the� body� containing� keywords� such� as� “head”,� “neck” � and� “male” � (examples� 3� & � 4).� This�

could�be�done�through�the�use�of�a�domain-specific�ontology,�which� is�a�conceptualisation�of�

all� the� entities� present� in� the� domain� as� well� as� their� interrelationships.� If� a� text� and� image�

query�is�combined�together�than�even�higher�precision�might�be�achieved.��

It�was�observed�that� if� the�data�set� is�constrained�or� limited�the�search�has�a�higher�precision�

while�if�a�large�database�of�random�images�is�used�the�search�becomes�less�effective.�A�search�

is� generally� more� effective� if� objects� have� very� distinct� colours� or� shapes.� If� an� image�

comprised�of�a�large�combination�of�indistinct�objects�and�colours�the�retrieval�was�extremely�

ineffective.� It�was�also�observed� that�varying� the�combination�of�visual� features�used� for� the�

search�resulted�in�different�result�sets.��

5 Discussion

�One� main� issue� to� consider� here� is� the� text-based� versus� content-based� versus� hybrid-based�

approaches�to� indexing�and�consequently�retrieving� images.� It� is� interesting� to�consider� in� the�

first�place�whether�communication� is�more�effective�when�several�modes�work� together.�This�

question� may� have� a� different� response� if� considered� from� the� perspective� of� a� machine� as�

opposed� to� a� human.� For� example� if� you� have� an� image� of� Bush� shaking� hands� with� Tony�

Blair� with� the� British� flag� in� the� background� and� a� caption� “Bush� (left)� shaking� hands�with�

Blair.” �For�most�people�the�caption�will�be�redundant�since�they�are�familiar�with�the�physical�

appearance�of�Bush�and�Blair�and�the�flag� in� the�background�will�help�place�a�context�to�the�

image.� For� others� who� are� familiar� with� the� names� and� not� the� appearance,� the� spatial� clue�

‘ left’ � in� the�caption�will�help� identify� the�people.�For�others�who�have�no�clue�who�Bush�and�

Blair�are�will�need�some�supporting�information�apart�from�the�caption�to�identify�them�as�the�

president�of�the�USA�and�the�prime-minister�of�the�UK�respectively.��

The� current� state� of� the� art� in� computer� vision� technology� is� very� limited� when� it� comes� to�

identifying� any� random� object� based� on� just� the� perceptual� features� such� as� colour,� shape,�

texture� or� position� in� an� image.� It� becomes� even� more� difficult� to� capture� higher-level�

semantic�or�abstract�meanings�based�on�context,� roles,�events�and� impressions.�Similarly�text�

has� its� limitations�when� it�comes�to�describing�certain�visual�features�such�as�shape�(e.g.�of�a�

� 40

knife,� fingerprint,� blood� pattern),� texture� (e.g.� sky� and� sea),� define� exact� colour� shades� or�

make� spatial� relationships� explicit.� Hence� if� these� two� modes� are� working� together� they� can�

complement� and� reinforce� each� other� to�address� their� respective� limitations�e.g.� if�a�machine�

detects�a�round�brown�object�in�an�image�and�the�supporting�text�is�related�to�soccer�than�the�

object� is� more� likely� to� be�a�ball�as�opposed� to�say,�a�coconut.�Another�disadvantage�of� the�

CBIR� approach� is� that� it� is� necessary� to� have� a� query� sample,� whether� it� is� an� image� or� a�

sketch�provided�by� the�user,�otherwise� the�system�cannot�be�used.�As�seen� from�section�3.3,�

all� the� researchers� that�used�an� integrated�approach�considered� it�an� improvement�on�using�a�

single�mode�–linguistic�or�visual,�and�achieved�a�higher�precision�and�recall.�

Another� interesting� question� to� consider� is� what� are� the� relative�merits�of� IR�systems�versus�

database� systems� versus� knowledge-based� systems� –what� role� can� each� of� them� play� in� the�

image� retrieval� scenario� individually� as� well� as� collectively?� Traditional� IR� systems� are�

optimised� for� dealing� with� textual� unstructured� data� and� do� not� provide� for� metadata�

information� such� as� a� database� schema.� However� they� provide� for� query� iteration� and�

expansion,� which� is� important� when� retrieving� images� due� to� the� lack� of� precision� in� the�

querying.� IR� systems� also� provide� a� similarity-based� approach� to� image� retrieval,� which� is�

usually�displayed�in�a�ranked�form.�Databases�have�been�optimised�for�structured�textual�data.�

As�discussed�in�the�previous�section�object-relational�(OR)�technology�now�allows�the�storage�

and� retrieval� of� more� complex� data� while� maintaining� the� benefits� of� the� relational� model�

such� as� the� provision� of� metadata,� data� security� and� integrity,� transaction� management� and�

update,� as� well� as� the� simple� structured� query� language,� which� are� not� supported� by� IR�

systems.�Of�course�an�obvious�advantage� is�having�all� this�different�data�stored� in�one�place�

with�a�shared�data�model.�Text,�image�and�video�data�are�not�structured�and�specific�methods�

for� feature� extraction� and� representation� are� required� which� have� been� implemented� by� IR�

systems.� According� to� Yates� & � Neto� (1999)� the� technologies� provided� by� information�

retrieval� systems� and� database� systems� should� be� combined� for� multimedia� information�

retrieval.�This�concept�has�been� implemented� in� Informix�using�DataBlade�technology.�When�

it�comes�to�representing�visual�and�textual�data�it�has�to�be�decided�what�purpose�it�is�going�to�

be�used�for�–just�for�retrieval�purposes�or�whether�there�is�a�need�to�reason�over�the�data.�As�

Srihari�(1995a)�suggested,�it�may�be�more�effective�to�use�a�database�system�to�store�the�data�

using�a�suitable�data�model�since�there�may�be�a� large�amount�of�overhead�associated�with�a�

powerful� representation� scheme.� As� discussed� previously� the� new� OR� paradigm� provides� a�

data� model� based� on� the� object-oriented� paradigm� which� makes� it� more� powerful� and�

versatile� than� the� relational�model�so� in�a�situation�where�vast�quantities�of�data�needs�to�be�

stored�without�the�need�for�reasoning�it�might�be�suitable�to�use�the�OR�model.�

� 41

If�a�text-based�or�hybrid�approach�is�to�be�used�for�image�indexing�and�retrieval�a�main�issue�

is�how�to�analyse�the�various�texts�related�to�an�image.�There�could�be�closely�collateral�texts�

such� as� a� description� or� caption� of� the� image,�or� there�could�be�broadly�collateral� texts� that�

might�contain�generic�or� informal�descriptions�of�objects� in� the� image.�For�example,�consider�

the� images� in� table�5a,� relating� to�a�pistol.�The�caption�of� the� image,� “browning�9mm�pistol�

found�on�table,” �could�be�classified�as�a�closely�collateral�text.�The�next�closest�collateral�text�

could�be�the�case�notes�of�the�SoCOs,�which�describe�part�or�all�of�the�scene�but�focus�on�the�

pistol.� Amongst� the� class� of� broadly� collateral� texts� there� could� be� the� report� for� that�

particular�crime,�which�refers�to�the�pistol�but�other�objects�and�points�of�interest�as�well;�the�

next� more� distant� collateral� text� could� be� an� encyclopaedic� description� of� the� pistol.� There�

could� also� be� even� more� broadly� collateral� texts� which� may� be� written� in� the� less� formal�

general� language� of� every� day� use� including� newspaper� reports� about� the� particular� crime�

whose� images�are�being�discussed�or�reports�about�other�related�crimes.�The�above�examples�

of� texts�are�essentially� those� in�the�formal�register�written�in�an�official�language�with�a�clear�

view� of� the� readership� such� as� the� crime� scene� investigators� associated� with� the� case.� This�

formal� register� may� include� occasional� informal� words,� short� hands� or� ellipses;� for� instance�

instead� of� suggesting� a� fingerprint� was� taken� at� a� scene� of� crime� they� may� say� a� ‘ lift’ � was�

taken.��

���������� �

Figure 5�Closely�and�broadly�collateral�texts.�

This�research�was�carried�out�for�the�benefit�of�the�EPSRC-sponsored�SoCIS�(Scene�of�Crime�

Information� System)� project.� Since� one� of� the� aims� of� the� project� is� to� build� a� visual�

information� system� to� store� and� retrieve� digital� photographs� taken� at� the� scene-of-crime,� it�

was� important� to� review� the� current� technology� available� in� order� to� make� an� informed�

decision�about� the�methods�and�systems�to�use�for� the�storage�and�effective�retrieval�of�scene�

of�crime�images.�

CLOSELY COLLATERAL TEXTS

BROADLY COLLATERAL TEXTS

THIS�IS�A�CLOSELY�COLLATERAL��TEXT�THAT�

COULD�BE�THE�DESCRIPTION�OR�THE�

CAPTION��

THIS�IS�A�BROADLY��COLLATERAL�TEXT�WHICH�COULD�BE�A�

REPORT��

THIS�IS�A�BROADLY��COLLATERAL�TEXT�WHICH�COULD�BE�A�

NEWSPAPER�ARTICLE�OR�

ENCYCLOPAEDIC�DEFINITION�

CAPTION� NEWSPAPER�ARTICLE�

CRIME�SCENE�REPORT�

DICTIONARY�DEFINITION�

� 42

References

Ahmad et al (2002).� Khurshid� Ahmad,� Bogdan� Vrusias� & � Mariam� Tariq,� “Co-operative�neural� networks� and� ‘ integrated’ � classification,” � To� appear� in� the� International Joint Conference on Neural Networks (IJCNN 2002),�Honolulu,�Hawaii,�May�12-17.��Al-Khatib (1999).� Wasfi� Al-Khatib,� Y.� Francis� Day,� Arif� Ghafoor� & � P.� Bruce� Berra,�“Semantic� Modelling� and� Knowledge� Representation� in� Multimedia� Databases,” � IEEE Transactions on Knowledge and Data Engineering,�Vol.11,�No.1,�pp.�64-80,�IEEE.��Baeza-Yates & Ribeiro-Neto� (1999).� Ricardo� Baeza-Yates,� Berthier� Ribeiro-Neto,� Modern Information Retrieval,�Essex,�England�ACM�Press.��Benitez et al. (2000).� Ana� B.� Benitez,� John� R.� Smith� & � Shih-Fu� Chang,� “MediaNet:� A�Multimedia� Information� Network� for� Knowledge� Representation,” � Proceedings of the SPIE 2000 Conference on Internet Multimedia Management Systems (IS&T/SPIE-2000,�Nov�6-8),�Vol.�4210,�Boston,�MA.� Bertino et al. (2001).�Elisa�Bertino,�Barbara�Catania�& �Gian�Piero�Zarri,�Intelligent Database Systems,�ACM�Press�&�Addison�Wesley,�Oxford,�England.� Brown� (2001).� Paul� Brown,� Object-Relational Database Development: A Plumbers Guide,�Menlo�Park,�CA�Informix�Press/Prentice�Hall.��Chang� et al� (1999).� Shih-Fu� Chang,� Yong� Rui� & � Thomas� S.� Huang,� “ Image� Retrieval:�Current� techniques,� Promising� Directions,� and� Open� Issues,” � Journal of Visual Communication and Image Representation, Vol.10,�pp.�39-62,�Academic�Press.��Del Bimbo� (1999).� Alberto� Del� Bimbo, Visual Information Retrieval, Morgan� Kaufmann�Publishers.� Jaimes & Chang�(2000).�Alejandro�Jaimes�&�Shih-Fu�Chang,�“A�Conceptual�Framework�for�Indexing� Visual� Information� at� Multiple� Levels,” � IS&T/SPIE Internet Imaging, Vol.� 3964,�San�Jose,�CA.��Maybury�(1997).�Mark�T.�Maybury�ed.,�Intelligent Multimedia Information Retrieval, Menlo�park�CA,�AAAI�Press/The�MIT�Press.��Mckeowen (1998).� Kathleen� R.� Mckeowen,� Steven� K.� Feiner,� Mukesh� Dalal� & � Shih-Fu�Chang,�“Generating�Multimedia�Briefings:�Coordinating�Language�and�Illustration,” �Artificial Intelligence, Vol.103,�pp.�95-116,�Elsevier�Science.� Mitchell (1987).�W.�J.�T.�Mitchell,�Iconology: Image, Text, Ideology, Chicago: University�of�Chicago�Press.��Ogle & StoneBraker � (1995).� Virginia� E.� Ogle,� Michael� Stonebraker,� “Chabot:� Retrieval�from�A�Relational�Database�of�Images,” �IEEE Computer Magazine,�Vol:�28(9),�pp.40-48.�

� 43

Paek et al�(1999).�S.�Paek,�C.�L.�Sable,�V.�Hatzivassiloglou,�A.�Jaimes,�B.�H.�Schiffman,�S.-F.� Chang,� and� K.� R.� McKeown,� “ Integration� of� visual� and� text� based� approaches� for� the�content�labeling�and�classification�of�Photographs,” �ACM SIGIR'99 Workshop on Multimedia Indexing and Retrieval,�Berkeley,�CA,�Aug.�19,�1999.��Sclaroff� (1999).� Stan� Sclaroff,� Marco� La� Cascia,� Saratendu� Sethi,� “Unifying� Textual� and�Visual� Cues� for� content-Based� Image� Retrieval� on� the�World�Wide�Web,” �Computer Vision and Image Understanding,�Vol.75,�Nos.�1-2,�pp.�86-98,�Academic�Press.��Sonka (1999).�Milan�Sonka,�Vaclav�Hlavac,�Roger�Boyle,� Image Processing, Analysis, and Machine Vision,�Pacific�Grove�CA�Brooks/Cole�Publishing�Company.� Sr ihar i� (1995a).� Rohini� K.� Srihari,� “Computational� Models� for� Integrating� Linguistic� and�Visual� Information:� A� Survey,” Artificial Intelligence Review,� special� issue� on� Integrating Language and Vision,�Vol.�8�(5-6),�pp.349-369.��Srihar i� (1995b).� Rohini� K.� Srihari,� “Use� of� Collateral� Text� in� Understanding� Photos,” �Artificial Intelligence Review,� special� issue�on� Integrating Language and Vision,�Vol.�8�pp.�409--430.��Srihar i & Zhang� (2000).� Rohini� K.� Srihari� & � Zhongfai� Zhang,� “Show&Tell:� a� Semi-Automated�Image�Annotation�System,” �IEEE Multimedia,�vol.7,�no.3,�July-Sept.�2000,�pp.61-71.�IEEE,�USA.��Srihar i et al� (2000).�Rohini�K.�Srihari,�Zhongfai�Zhang�&�Aibing�Rao,�“ Intelligent�Indexing�And� Semantic� Retrieval� Of� Multimodal� Documents,” � Information Retrieval,� Vol.2� (2-3),�pp.245-75,�Kluwer,�USA.��Staggs (1997).� Steven� Staggs,� Crime Scene & Evidence Photographer's Guide,� Staggs�Publishers,�Temecula,�CA. Stonebraker � (1999).� Michael� Stonebraker,� Paul� Brown,� with� Dorothy� Moore,� Object-Relational DBMSs: Tracking the Next Great Wave,� San� Francisco,� CA� Morgan� Kaufmann�Publishers,�Inc.��Subramanian (1999).� V.S.� Subramanian,� Principles of Multimedia Database Systems,�Morgan�Kaufmann.��Veltkamp & Tanase� (2000).� Remco� C.� Veltkamp�&�Mirela�Tanase,� “Content-Based� Image�Retrieval� Systems:� A� Survey,” � Technical� Report,� Dept.� of� Computing� Science,� Utrecht�University.���