cross modality: interaction between image, video and language - a trinity and personal perspective
DESCRIPTION
Cross modality: Interaction between image, video and language - A Trinity and personal perspective. 1. Khurshid Ahmad School of Computer Science and Statistics, Trinity College, Dublin A Seminar presentation. Preamble. - PowerPoint PPT PresentationTRANSCRIPT
1
Cross modality: Interaction between
image, video and language - A Trinity and personal perspective
1
Khurshid AhmadSchool of Computer Science and Statistics,
Trinity College, Dublin
A Seminar presentation
2
PreambleOne key message in modern neuroscience is cross-modality & multi-sensory integration: •uni-modal areas in the brain, such as vision, process complex data received in a single mode –e.g. images
•these areas to interact with each other for the animals to deal with the world of multi-modal data
•unimodal areas interact with hetero-modal areas, areas that are activated by two or more input modalities, to converge the outputs of the uni-modal systems for producing ‘higher cognitive’ behaviour: quantify (enumeration and counting), retrieve images given linguistic cues and vice versa
3
Neural Correlates of Behaviour: Modality and Neuronal Correlation
M. Alex Meredith (2002). On the neuronal basis for multisensory convergence: a brief overview. Cognitive Brain Research Vol. 14 (2002) 31–40
Neural underpinnings of Multisensory Integration:
4
PreambleOne key message in modern neuroscience is cross-modality: ‘Sensory information undergoes extensive associative elaboration and attentional modulation as it becomes incorporated in the texture of cognition’;
Cognitive processes are supposed to arise ‘from analogous associative transformations of similar sets of sensory inputs’ – differences in the resultant cognitive operations are determined by the anatomical and physiological properties of the transmodal node that acts as the critical gateway for the dominant transformation’
Mesulam, M.-Marsel (1998) ‘From sensation to cognition’ Brain Vol. 121, pp 1013-1052
Thin arrows monosynaptic connections;
Thick arrows ‘massive connections’
Broken arrows motor output pathways
Core synaptic hierarchy: primary sensory, upstream and downstream unimodal, and transmodal –heteromodal, paralimbic, and limibic- zones of the cerebral cortex;
5
Neural Correlates of Behaviour: Modality and Neuronal Correlation
‘In addition to […] modality-specific motion-processing areas, there are a number of brain areas that appear to be responsive to motion signals in more than one sensory modality [….] the IPS, [..] precentral gyrus can be activated by auditory, visual or tactile motion signals’
Soto-Faraco, S. et al (2004). ‘Moving Multisensory Research Along: Motion Perception Across Sensory Modalities’. Current Directions in Psy. Sci. Vol 13(1), pp 29-32
Neural underpinnings of Multisensory Motion Integration:
6
Sensation and Cognition
The highest synaptic levels of sensory fugal processing are occupied by heteromodal, paralimbic and limbic cortices – collectively known as transmodal areas.
Key anatomically distinct brain networks with communicating epicentres
Mesulam, M.-Marsel (1998) ‘From sensation to cognition’ Brain Vol. 121, pp 1013-1052
Network Epicentre 1 Epicentre 2
Spatial awareness Posterior Parietal Cortex Frontal eye fields
Language Wernicke’s Area Broca’s area
Explicit memory/emotion
Hippocampal-entorhinal complex
The Amygdla
Face-Object recognition
Mid-temporal cortex Temporo-polar cortex
Working memory-executive function
Lateral pre-frontal cortex Posterior Parietal Cortex (?)
7
Uni- and Cross Modality @ Trinity
Indexing (Rapporteur: Declan O'Sullivan)
Anil Kokaram
Retrieval in Context: A holistic view of intelligent multimedia access
Frank Boland
Audio information acquisition and source localisation
Niall Rea and Rozenn Dahyot
Detection of Illicit Content in Video Streams
Retrieval (Rapporteur: John Dingliana)
Simon Wilson
Bayesian content-based image retrieval
Niall Rooney
Search strategies for cluster-based document indexing and retrieval
Anton Zamolotskikh
A machine learning approach for ontology construction within Collaborative Media Tagging Environments Abstract
Simulation & Visualisation (Rapporteur: Carl Vogel)
Carol O'Sullivan
Perception of dynamic events and implications for real-time Computer Graphics
Gerard Lacey
Efficient rigid body motion tracking with applications to human psychomotor performance assessment
8
Uni- and Cross Modality @ Trinity
Other friends and colleagues:
Trinity Centre for Neurosciences (Fiona Newell, Shane O’Mara, Hugh Garavan & Ian Robertson Cross modality and fMRI imaging);
Linguistics and Phonetics (Ailbhe Ní Chasaide)
Centre for Health Informatics (Jane Grimson)
9
Uni/Cross Modality & Ontology @ Trinity
The key problem for the evolving semantic web and the creation of large data repositories (memories for life in health care, infotainment) is the indexation and efficient retrieval of images – both still and moving- and the identification of key objects and events in the images.
The visual features under-constrain an image and supplemental, collateral, contextual knowledge is required to index the images: Linguistic description and motion features are one of the candidates.
Above all, there must be a conceptual basis of any indexing scheme for it to be robust against changes in the subject domain and changes in the user perspective.
10
Uni/Cross Modality & Ontology @ Trinity
The key term in distributed and soft computing for a conceptual basis is ontology: A consensus amongst a group of people (system developers, domain experts and end-users) about what there is.
We have had a seminar where we discussed the philosophical, formal, linguistic, computational and inter-operability issues related to ontology systems.
There is a work programme that is evolving under the co-ordination of Declan O’Sullivan.
The intention is to see the fit between the work of the ontology consortium with that of folks in video annotation
11
Uni/Cross Modality & Ontology @ Trinity
The intention is to see the fit between the work of the ontology consortium with that of folks in video annotation. The intention is to have a system that works in a distributed environment and interacts with users with variety of devices and allows access and update of large repositories of life and mission critical data.
We have tremendous opportunities: (a) Major government initiatives in health-care an integrated system for text and images related to patients accessible to authorised users on a range of mobile devices; (b) Major opportunities in animation and surveillance; (‘c) Key applications in mini-robotics systems; (d) the opening up of TCIN in clinical care; (e) ageing initiatives
12
Uni/Cross Modality & Ontology @ Trinity
There are key groups in the College that can contribute to the knowledge in computing and contribute to the advancement of key disciplines – health care, neurosciences. This is a win-win opportunity for all.
1. Communications and Value Chain Centre2. Intelligent Systems Cluster in CS (Ontology,
Linguistics, Graphics, Vision)3. Theory and Architecture Cluster (Formal
Methods)4. Distributed Systems Cluster (Ubiquitous
Systems)5. Vision and Speech Groups in EE6. Statistics Cluster (Bayesian Reasoning)
13
Uni/Cross Modality & Ontology @ Trinity
The key message here is this:
Trinity is good at good science;Trinity has substantial expertise and potential in
building novel computing systems;Trinity has demonstrable ability to deal with real
world audio/video systems;All the key players involved have a peer-reviewed
track record
We have the critical mass or have the desire create one!!
14
PreambleNeural computing systems are trained on the principle that if a network can compute then it will learn to compute;
Most neural computing systems are single net, cellular systems – the single net systems;Lesson from biology: no network is an island, the bell tolls across networks
15
PreambleNeural computing systems are trained on the principle that if a network can compute then it will learn to compute.
Multi-net neural computing systems are trained on the principle that if two or more networks learn to compute simultaneously or sequentially , then the multi-net will learn to compute.
16
PreambleMulti-net neural computing systems can be traced backed to the hierarchical mixture of experts’ systems originally reported by Jordan, Jacobs and Barto.
In turn, these systems relate to the broader family of systems – the mixtures of ‘X’Jacobs, R.A., Jordan, M.I. & Barto, A.G. (1991). Task Decomposition through Competition in a Modular Connectionist Architecture: The What and Where Vision Tasks. Cognitive Science, vol. 15, pp. 219-250.
17
PreambleOne key message in modern neuroscience is multi-modality:
My work has been in the multi-net simulation of
language development;
aphasia;
numerosity;
cross-modal retrieval;
attention and automatic video annotation.
18
Learning to Compute: Cross-Modal Interaction and Spatial
Attention
The key to spatial attention is that different stimuli, visual and auditory, help to identify the spatial location of the object generating the stimuli.
One argument is that there may be a neuronal correlate of such crossmodal interaction between two stimuli.
Information related to the location of the stimulus (where) and identifying the stimulus (what) appears to have correlates at the neuronal level in the so-called dorsal and ventral streams in the brain.
19
Learning to Compute: Numerosity, Number Sense and
‘Numerons’ A number of other animal species appear to have the ‘faculty’ of visual enumeration or subitisation. The areas identified have ‘homologs’ in the human brain.
Author % Numerical Neurons in Macque Parietal
Cortex
% Numerical Neurons in Macque Prefrontal
Cortex
Sawamura et al 2002
c. 30 c. 15
Neider et al 2002 c. 15 c. 30
Measurements are a tad problematic in neurobiology
20
Learning to Compute: Numerosity, Number Sense and
‘Numerons’
‘Monkeys watched two displays (first sample, then test) separated by a 1-s delay. [the displays varied in shape, size, texture and so on.] They were trained to release a lever if the displays contained the same number of items. Average performance of both monkeys was significantly better than chance for all tested quantities, with a decline when tested for higher quantities similar to that seen in humans performing comparable tasks. Andreas Nieder, David J. Freedman, Earl K. Miller (2002). ‘Representation of the Quantity of Visual Items in the Primate Prefrontal Cortex’. Science Vol. 297, pp 1709-11.
The ‘Edge’ Effect
21
Computing to Learn
Neural computing systems are trained on the principle that if a network can compute then it will learn to compute.
Multi-net neural computing systems are trained on the principle that if two or more networks learn to compute simultaneously, then the multi-net will learn to compute.
22
Combining multiple modes of information using unsupervised neural classifiers
Two SOMs linked by Hebbian connections One SOM learns to classify a primary
modality of information One SOM learns to classify a collateral
modality of information Hebbian connections associate patterns of
activity in each SOM
Computing to Learn:Unsupervised Self
Organisation
23
Computing to Learn:Unsupervised Self
Organisation
Sequential Multinet Neural Computing Systems: SOMs and Hebbian
connections trained synchronously.
.
.
. . . .
Primary SOM
Bi directional Hebbian Network
Primary Vector
Collateral Vector
Collateral SOM
24
Computing to Learn:Unsupervised Self
Organisation
Work under my supervision at Surrey includes the development of multi-net neural computing architectures for: -language development-language degradation-Collateral images and texts-Numerosity development-In the case of the latter two, then the connections between modules are learnt too – cross-modal interaction via Hebbian connections
25
Hebbian connections associate neighbourhoods of activity
Not just a one-to-one linear association Each SOM’s output is formed by a pattern
of activity centred on the winning neuron for the primary and collateral input
Training is deemed complete when both SOM classifiers have learned to classify their respective inputs
Computing to Learn:Unsupervised Self
Organisation
26
Computing to Learn: The Development of
Numerosity
Hebbian connections from the winning node of the magnitude representation SOFM to all nodes of the verbal SOFM (a), and the vice versa (b). During training those connections are strengthened based on the activations of the node pairs.
(a) (b)
Vrusias B. L. (2004). Combining Unsupervised Classifiers: A Multimodal Case Study. Unpublished PhD Dissertation, University of Surrey.
An unsupervised multinet alternative
27
Computing to Learn:Image and Collateral Texts
MAGNITUDE SOFM
VERBAL SOFM
Number words
HEBBIANCONNECTIONS
Vrusias B. L. (2004). Combining Unsupervised Classifiers: A Multimodal Case Study. Unpublished PhD Dissertation, University of Surrey.
An unsupervised multinet alternative
28
Computing to Learn:The Development of
Numerosity
One
Tw o ThreeFour
Five
Six
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Kohonen Layer Node Number
Avera
ge A
cti
vati
on
An unsupervised multinet alternative: Simulating Fechners’
Law
Ahmad K., Casey, M. & Bale, T. (2002). Connectionist Simulation of Quantification Skills. Connection Science, vol. 14(3), pp. 165-201.
29
Computing to Learn:The Development of
Numerosity
A ‘Hebbian-like learning rule’ that ‘resembles [..] Kohonen learning rule’: A confirmation of the results of Neider &
Miller
Verguts, Tom., & Fias, Vim. (2004). ‘Representation of Numbers in Animals and Humans: A Neural Model. Journal of Cognitive Neuroscience. Vol. 16(No. 9) pp 1493-1504
30
Computing to Learn:Image and Collateral Texts Images have been traditionally indexed
with short texts describing the objects within the image. The accompanying text is sometimes described as collateral to the image.
The ability to use the collateral texts for building computer-based image retrieval systems will help in dealing with image collections that can now be stored digitally.
Theoretically, the manner in which we grasp the relationship between the ‘features’ of the image and the ‘features’ of the collateral text relates back to cross-modality.
31
Computing to Learn:Image and Collateral Texts
Alex Martin* and Linda L Chao (2001). Semantic memory and the brain: structure and processes. Current Opinion in Neurobiology. Vol. 11, pp 194–201
•The approximate locations of [lateral] regions where information about object form, motion and object-use-associated motor patterns may be stored. •Information from an increasing number of sources may be
integrated in the temporal lobes, with specificity increasing along the posterior to anterior axis. •Specific regions of the Left Inferior Parietal Cortex and the polar region of the temporal lobes may be involved differentially in retrieving, monitoring, selecting and maintaining semantic information.
32
Computing to Learn:Image and Collateral Texts
Alex Martin* and Linda L Chao (2001). Semantic memory and the brain: structure and processes. Current Opinion in Neurobiology. Vol. 11, pp 194–201
• Activation of the fusiform gyrus when subjects retrieve color word associates has recently been replicated in two additional studies• Activation in a similar region has been reported during the spontaneous generation of color imagery in auditory color-word synaesthetes
33
Computing to Learn:Image and Collateral Texts
In principle, image collections can be indexed by the visual features of the content alone (colour, texture, shapes, edges). The content-based image retrieval have not been a resounding success:
K. Ahmad, B. Vrusias, and M. Zhu. ‘Visualising an Image Collection?’ In (Eds.) Ebad Banisi et al. Proceedings of the 9th International Conference Information Visualisation (London 6-8 July 2005). Los Alamitos: IEEE Computer Society Press. pp 268-274.
Visual Similarity (Similar Colours) Conceptual Similarity (Balls / Fruits)
34
Computing to Learn:Image and Collateral Texts
We have developed a multi-net system that learns to classify images within an image collection, where each image has a collateral text, based on the common visual features and the verbal features of the collateral text.
The multi-net can also learn to correlate images and their collateral texts using Hebbian links – this means that one image may be associated with more than one collateral text and vice versa
Ahmad, K., Casey, M., Vrusias, B., & Saragiotis P. Combining Multiple Modes of Information using Unsupervised Neural Classifiers. In (Ed.) Terry Windeatt and Fabio Rolli. Proc.4th Int. Workshop, MCS 2003. LNCS 2709. Heidelberg: Springer-Verlag. pp 236-245.
35
Computing to Learn:Image and Collateral Texts
Hebbian connections from the winning node of the text SOFM to all nodes from the image SOFM (a), and the vice versa (b). During training those connections are strengthened based on the activations of the node pairs.
(a) (b)
Vrusias B. L. (2004). Combining Unsupervised Classifiers: A Multimodal Case Study. Unpublished PhD Dissertation, University of Surrey.
Automatic Image Annotation and Illustration
36
Computing to Learn:Image and Collateral Texts
IMAGE SOFM
TEXT SOFM
Keywords
HEBBIANCONNECTIONS
Vrusias B. L. (2004). Combining Unsupervised Classifiers: A Multimodal Case Study. Unpublished PhD Dissertation, University of Surrey.
Automatic Image Annotation and Illustration
37
Computing to Learn:Image and Collateral Texts
Input Layer: Text 30 50 195
Input Layer: Image 67
Output Layer 10 x 10 15 x 15 50 x 50
Hebbian Links 10000 50625 6250000
Training Cycles 1000 10000
Output Layer 15 x 15
Input Text Vector Length 50
Hebbian Links 50625
Training Cycles 1000
Optimum SOFM Configuration
Different SOFM Configurations Used in Simulations
Automatic Image Annotation and Illustration
38
Computing to Learn:Image and Collateral Texts
Modality Method Components
Text Vector Construction
Texts represented through their keywords
Frequency and patterns of usage: most used and least used terms
Image Vector Selection
Standard Physical Features
Colour; Texture; Shape; Brightness
Automatic Image Annotation and Illustration
39
Computing to Learn:Image and Collateral Texts
The performance of the multinet system was compared with a ‘monolithic’ single net SOFM. The monolith was trained with a conjoined text-image vector on a single SOFM.
The performance of the two networks was compared using a ratio of precision (p) & recall (r) statistics, called the
effectiveness ratio F. We use =0.5.
rp
F1
)1(1
1
40
Computing to Learn:Image and Collateral Texts
•The Hemera “PhotoObjects was used as the primary dataset collection for our experiments.• The collection contains about 50,000 photo objects (single object images with no
background), and has been used extensively for image analysis. •Each image (object) in the collection has associated keywords attached, and is characterised by a general category type.
41
Computing to Learn:Image and Collateral Texts
Hemara Collection: Training Subset Used – 1151 images randomly selected from 50000 obejcts
CATEGORY AVERAGE No TERMS
TOTAL No TERMS
No OF OBJECTS
BALLS 9 915 97
BUTTERFLIES & MOTHS 8 993 129
CARS 10 1217 118
DRINKS 10 664 65
FLOWERS 9 1099 117
FRUIT 4 561 131
MONEY 8 909 120
SEATING 8 862 107
TRAINS & PLANES 11 1542 139
WEAPONS 10 1256 128
AVERAGE 8.7 1,002 115
42
Computing to Learn:Image and Collateral Texts
1 1 1 5 5 5 5 5 5 5 7 7 7 7 71 1 1 5 5 5 5 5 5 5 7 7 7 7 71 1 1 6 5 5 5 5 5 10 7 7 7 7 71 1 6 6 6 6 6 6 6 10 10 7 7 7 71 1 6 6 6 6 6 6 6 10 10 10 10 7 71 1 6 6 6 6 6 6 6 6 10 10 2 2 21 9 9 9 6 6 6 6 6 6 2 2 2 2 29 9 9 9 9 9 6 6 6 6 2 2 2 2 29 9 9 9 9 9 9 6 6 10 2 2 2 2 29 9 9 9 9 9 10 10 10 10 10 2 2 2 29 9 9 9 9 10 10 10 10 10 10 4 2 2 29 3 3 9 10 10 10 10 10 8 8 8 4 2 23 3 3 3 3 10 10 10 8 8 8 8 4 4 43 3 3 3 3 3 10 8 8 8 8 8 8 4 43 3 3 3 3 3 8 8 8 8 8 8 8 4 4
6 6 8 5 10 10 10 10 10 10 10 2 2 2 46 7 9 9 10 10 10 10 10 10 10 8 8 8 53 3 9 9 10 10 10 10 10 10 10 8 8 5 52 2 3 4 4 10 10 10 10 10 8 6 6 5 52 2 3 4 4 10 10 9 2 2 2 6 5 5 52 2 4 4 3 3 9 9 2 2 2 2 5 6 62 2 2 3 3 3 9 9 2 2 2 8 6 6 62 2 2 2 3 3 3 9 3 3 3 5 8 6 62 2 2 2 3 3 3 3 3 3 3 5 5 1 12 2 2 2 9 3 3 3 3 3 3 5 5 6 62 2 2 9 9 9 9 9 3 3 3 5 5 6 67 7 7 7 9 9 9 9 10 4 9 5 5 6 17 7 7 7 7 9 9 9 6 4 4 5 5 1 17 7 7 7 7 7 6 6 10 10 5 1 1 1 17 7 7 7 7 1 6 6 1 1 8 1 8 1 1
Visualising the clusters formed by the image-based SOFM.
Visualising the clusters formed by the text-based SOFM.
43
Computing to Learn:Image and Collateral Texts
Automatic Image Annotation and Illustration
• It is not possible for an SOFM to output categories inherent in the training data. Recently, a sequential clustering scheme has been suggested: Produce the initial categorisation using a SOFM and then cluster the output using conventional clustering algorithms like k-means, hierarchical clustering, fuzzy c-means and so on.
• We have obtained the best results with a SOFM+k-means clustering.
44
Computing to Learn:Image and Collateral Texts
• The visual features proved too generic to be useful for classification.• Precision and recall figures were persistently below 0.5 for both metrices.• The results, however, were good for visually well defined objects like coins.• This perhaps explains the poor performance of some of the computer vision systems.
45
Computing to Learn:Image and Collateral Texts
•Textual descriptors are much better for categorisation with precision and recall both quite high.
Toplogy F Precision
Recall
10 x 10
0.70
0.60 0.83
15 x 15
0.72
0.63 0.84
50 x 50
0.80
0.70 0.92 Toplog
yF Precisio
nRecal
l
10x10
0.25 0.24 0.27
15x15
0.26 0.25 0.27
50x50
0.29 0.28 0.29
Text-based categorisation
Image-based categorisation
46
Computing to Learn:Image and Collateral Texts
Neural Network Architecture F0.5
Multinet system: Simple Collateral Mapping
0.76
Monolithic single net system 0.50
The performance of the multinet system was compared with a ‘monolithic’ single net SOFM. The monolith was trained with a conjoined text-image vector on a single SOFM.
47
Computing to Learn:Image and Collateral Texts
The performance of the multinet system was compared with a ‘monolithic’ single net SOFM. The monolith was trained with a conjoined text-image vector on a single SOFM.
System Input Vector Output Vector
F-Measure
SingleNet Monolithic Monolithic 0.36
MultiNet: Auto Annotation
Visual Features Keyword Features
0.38
MultiNet: Auto Illustration
Keyword Features
Visual Features 0.48
Hemara Data Set: Single Objects + No Background
48
Computing to Learn:Image and Collateral Texts
The performance of the multinet system was compared with a ‘monolithic’ single net SOFM. The monolith was trained with a conjoined text-image vector on a single SOFM.
System Input Vector Output Vector
F-Measure
SingleNet Monolithic Monolithic 0.37
MultiNet: Auto Annotation
Visual Features Keyword Features
0.25
MultiNet: Auto Illustration
Keyword Features
Visual Features 0.43
Correl Data Set: Multiple Objects + Background
49
Computing to Learn:Image and Collateral Texts
Automatic Image Illustration through Hebbian cross-modal linkage
Text-query Matched Text
Retrieved Image
50
Computing to Learn:Image and Collateral Texts
Query Image
Matched Image
Retrieved TextAutomatic Image Annotation through Hebbian cross-modal linkage
51
Computing to Learn:Image and Collateral Texts
Performance
TEXT IMAGE IMAGE TEXT
Best 75 76
Worst 70 71
Average 73 74
Results signifying the accuracy of the Hebbian network learning to identify the link between an image – text pair.
Automatic Image Annotation and Illustration
52
How to retrieve the stored video (sequences)?
Unusual events? Unusual behaviour? Unusual objects?
TODAY’S VISION TECHNOLOGY
53
How to retrieve the stored video (sequences)? By
keywords:Experts to annotate video sequences by hand!
Between 5-40 minutes per still image
Inter-indexer variability
TODAY’S VISION TECHNOLOGY
54
VISUAL THESAURUS & VIDEO SUMMARISATION – The Written Word in Closely
and Broadly Collateral Texts
CLOSELY CLOSELY COLLATERAL TEXTSCOLLATERAL TEXTS
CAPTION
CRIME SCENE
REPORT
CLOSELY CLOSELY COLLATERAL TEXTSCOLLATERAL TEXTS
CAPTION
CRIME SCENE
REPORT
BROADLY BROADLY COLLATERAL TEXTSCOLLATERAL TEXTS
NEWSPAPER ARTICLE
DICTIONARY DEFINITION
BROADLY BROADLY COLLATERAL TEXTSCOLLATERAL TEXTS
NEWSPAPER ARTICLE
DICTIONARY DEFINITION
CRIME SCENE IMAGE
55
New concepts generate new keywords and kill off old ones;
New concepts are inevitably written up and published in research papers, magazines, newspapers.
Automatic extraction from text??
TOMMOROW’S VISION TECHNOLOGY?
Earprints?
Telephone Chatter?
Suicide Bomber?
Grassprints?
Crowd Dynamics?
56
New concepts are inevitably written up and published in research papers, magazines, newspapers.
Automatic extraction from text??
TOMMOROW’S VISION TECHNOLOGY?
Earprint Thesaurus created at Surrey automatically!
57
VISUAL THESAURUS & VIDEO SUMMARISATION
Expert Observations Research Reports
Novel DevicesNew Methods
The source of new concepts and terms
InformationExtraction
System
New Terminology: Earprint; Pyrolysis; Bitemarks
58
Development of a Visual Evidence Thesaurus
A visual thesaurus is an arrangement of words and phrases of a language not in alphabetical order but according to the images associated with the words and phrases express.
A visual thesaurus is not a pictorial dictionary:A pictorial dictionary
explains the meanings of words and phrases associated with an image
A visual thesaurus suggests a range of words and phrases associated with an image
VISUAL THESAURUS & VIDEO SUMMARISATION
59
Development of a Visual Evidence Thesaurus
The challenge of the REVEAL Project is to create a visual evidence in arbitrary domains by using a set of systematically collected moving images and associated texts.
A systematic collection of texts is called a CORPUS: A corpus comprises the evidence of how a language is being used at various levels of description:
at the level of word usage (lexical), at the level of phrases and sentences (grammatical),
at the level of meaning (semantics), & at the level if intentions (pragmatics).
VISUAL THESAURUS & VIDEO SUMMARISATION
60
An expert’s description is rich in vocabulary and can be subsequently used to annotate a picture gallery Nick Mitchell (SOCO, Surrey Police) describing a mock murder scene
VISUAL THESAURUS & VIDEO SUMMARISATION – The Spoken Word
61
Scene of Crime Information System (SOCIS) (An earlier EPSRC funded project)
This EPSRC-sponsored project, involving Universities of Surrey and Sheffield, is developing methods and techniques for automatically indexing images with the descriptions provided by Scene of Crime Officers.
9 mm Browning high power pistol
Footwear impression in
blood
Body on floor showing
adjacent table
Fingerprints showingridges
Typical Scene of Crime Images
BUILDING A VISUAL THESAURUS
62
Surrey Forensic Science Corpus ( 0.58 Million words)
USE OF WORDS IN THE FORENSIC SCIENCE CORPUS COMPARED and CONTRASTED WITH
A GOOD SAMPLE OF TEXTS OF EVERDAY USAGE
British National Corpus(100 million words)
contains major works of fiction, science and technology texts, newspapers and magazines
BUILDING A VISUAL THESAURUS
63
SFSC:Relative
Frequency
BNC: Relative
Frequency
SFSC/BNC:WEIRDNESS
the 6.8% 6.2% 1.1
of 3.7% 2.9% 1.2
and 2.7% 2.7% 1.0
to 2.5% 2.6% 1.0
a 2.4% 2.1% 1.1
British National Corpus (BNC) = 100 Million words;
Surrey Forensic Science Corpus (SFSC) = 0.58 Million words;
The 5 words have about the same distribution in the two corpora: These are the so-called closed class words, or grammatical words, and one may find these words with the same frequency as both corpora have English language texts. There is no weirdness in the use of these words in the Forensic Science corpus.
BUILDING A VISUAL THESAURUS
64
SFSC:Relative
Frequency
BNC: Relative
Frequency
SFSC/BNC:WEIRDNESS
evidence
0.47% 0.021% 22
crime 0.40% 0.007% 57
scene 0.27% 0.007% 40
forensic
0.25% 0.001% 473
police 0.25% 0.028% 9
British National Corpus (BNC) = 100 Million words;
Surrey Forensic Science Corpus (SFSC) = 0.58 Million words;
The 5 words do not have the same distribution in the two corpora: These are the so-called open class words, or lexical words. For every 22 instances of evidence in the Surrey corpus there is only one instance of this word in the BNC. And, forensic is most weird: 473 instances in the Surrey Corpus as opposed to only one in the BNC.
BUILDING A VISUAL THESAURUS
65
SFSC:Relative Frequency
BNC: Relative
Frequency
SFSC/BNC:WEIRDNESS
bitemark 0.0187% 0%
earprint 0.0137% 0%
accelerant
0.0115% 0%
pyrolysis 0.0139% 0.00001%
634
ballistics 0.0146% 0.00002%
1263
British National Corpus (BNC) = 100 Million words;
Surrey Forensic Science Corpus (SFSC) = 0.58 Million words;
The first three words DO NOT EXIST in the BNC: These are the so-called neologisms, or new words. Pyrolysis & ballistics both are also lesser used words in the BNC.
BUILDING A VISUAL THESAURUS
66
BUILDING A VISUAL THESAURUSScene of Crime Information System (SOCIS)
The SOCIS system, available freely, automatically indexes images taken at a crime scene with the descriptions provided by scene of crime officers. These descriptions were supplemented by a visual thesaurus constructed from a forensic science corpus.
67
IDENTIFICATION LOCATION ELABORATION
[1] Close up view of exhibit ABC/3 [.] [2] Red and silver knife handle.
On alleyway
floor
Adjacent to building and metal gate
[SOCO 1 – spontaneous free text:] Close up view of exhibit ABC/3 red and silver knife handle on alleyway floor adjacent to building and metal gate.
Indexer Variability: Given the image descriptions are in free text, perhaps each SOCO gives a different description of the image?
BUILDING A VISUAL THESAURUSScene of Crime Information System (SOCIS)
68
Indexer Variability: Given the image descriptions are in free text, perhaps each SOCO gives a different description of the image?
BUILDING A VISUAL THESAURUSScene of Crime Information System (SOCIS)
•Novices attempted to describe everything in the image
“I now see the bedroom or one of the bedrooms it looks like a child's bedroom single bed and a cot to the right of the bed there's a a bedside cabinet four drawers the bottom two drawers are completely open with some items hanging out the third drawer up is slightly open with an item hanging out and on top of that there's some different toys and ornaments the bed doesn't look like it's been disturbed it's just cuddly toys over the pillow cot the cot is open and the doors the side door is down with a cuddly toy in it and a bike propped up against the wall or a scooter looks like an old scooter propped up against the wall” (124 words)
69
Indexer Variability: Given the image descriptions are in free text, perhaps each SOCO gives a different description of the image?
BUILDING A VISUAL THESAURUSScene of Crime Information System (SOCIS)
•Novices were quite succinct in their description after training (c. 6 weeks).
“now a view of the picture from child's bedroom, single bed, with a cot, to the right of the bed a bedside cabinet with two drawers open a four drawer cabinet a bedside cabinet, the bottom two are open, the third one up is slightly open with some clothing hanging out.”(51 Words)
70
Indexer Variability: Given the image descriptions are in free text, perhaps each SOCO gives a different description of the image?
BUILDING A VISUAL THESAURUSScene of Crime Information System (SOCIS)
•Novices were quite succinct in their description after training (c. 6 weeks) but were nothing like an expert:
“View of ground floor bedroom towards bed and cot (as viewed from door).” (13 Words)
71
BUILDING A VISUAL THESAURUSKey to building a visual thesaurus
1. Terminology;
2. Conceptual Structures or Ontology;
3. Methods for updating terminology & ontology;
4. Access to experts, to exemplar images, and to collateral descriptions of images
72
Learning to Compute: Visual Attention: An early processing
model
L. Itti & C. Koch, (2001). Computational Modeling of Visual Attention, Nature Reviews Neuroscience, Vol. 2, No. 3, pp. 194-203, Mar 2001.
Cortical Area
Tasks Function
‘dorsal stream’ (PPC)
spatial localization; directing attention and gaze towards objects of interest in the scene.
Deploy attention
‘ventral stream’ (infero-temporal cortex; IT)
recognition and identification of visual stimuli
Receive attentional feedback modulation;Represent attended locations and objects
73
Learning to Compute: Visual Attention: Itti and Koch Model
L. Itti, (2004). Automatic Foveation for Video Compression Using a Neurobiological Model of Visual Attention, IEEE Transactions on Image Processing, Vol. 13, No. 10, pp. 1304-1318, Oct 2004.
Inputs are decomposed into multiscale analysis channels sensitive to low-level visual features (two color contrasts, temporal flicker, intensity contrast, four orientations, and four directional motion energies). After strong non-linear competition for saliency, all channels are combined into a unique saliency map. This map either directly modulates encoding priority (higher priority for more salient pixels), or guides several virtual foveas towards the most salient locations (highest priority given to fovea centers).
74
Learning to Compute: Visual Attention: Itti and Koch Model
L. Itti, (2004). Automatic Foveation for Video Compression Using a Neurobiological Model of Visual Attention, IEEE Transactions on Image Processing, Vol. 13, No. 10, pp. 1304-1318, Oct 2004.
Inputs are decomposed into multiscale analysis channels sensitive to low-level visual features (two color contrasts, temporal flicker, intensity contrast, four orientations, and four directional motion energies). After strong non-linear competition for saliency, all channels are combined into a unique saliency map. This map either directly modulates encoding priority (higher priority for more salient pixels), or guides several virtual foveas towards the most salient locations (highest priority given to fovea centers).
75
Summary
Preliminary results show that: Modular co-operative multi-net system using
unsupervised learning techniques can improve classification with multiple modalities
Future work: Evaluate against larger sets of data Further understanding of clustering and
classification in SOMs Further explore linkage of neighbourhoods,
more than just a one-to-one mapping, and theory underlying model
76
Afterword
Important that research and development of neural computing systems continues to be informed by, and inspired by, latest results from neuroscience – e.g. insights into multimodal abilities suggests research of modular multi-net architecturesMay claim that the multi-net systems reported in this talk have a heteromodal region, in which the connection between uni-modal networks is learnt
77
Afterword
Here is a movie of my latest project