a survey on wireless multimedia sensor networkstmelodia/papers/multimedia_survey.pdf · development...

Computer Networks 51 (2007) 921–960

www.elsevier.com/locate/comnet

A survey on wireless multimedia sensor networks

Ian F. Akyildiz *, Tommaso Melodia, Kaushik R. Chowdhury

Broadband and Wireless Networking Laboratory, School of Electrical and Computer Engineering, Georgia Institute of Technology,

Atlanta, GA 30332, United States

Received 11 March 2006; received in revised form 6 August 2006; accepted 5 October 2006Available online 2 November 2006

Abstract

The availability of low-cost hardware such as CMOS cameras and microphones has fostered the development of Wire-less Multimedia Sensor Networks (WMSNs), i.e., networks of wirelessly interconnected devices that are able to ubiqui-tously retrieve multimedia content such as video and audio streams, still images, and scalar sensor data from theenvironment. In this paper, the state of the art in algorithms, protocols, and hardware for wireless multimedia sensor net-works is surveyed, and open research issues are discussed in detail. Architectures for WMSNs are explored, along withtheir advantages and drawbacks. Currently off-the-shelf hardware as well as available research prototypes for WMSNsare listed and classified. Existing solutions and open research issues at the application, transport, network, link, and phys-ical layers of the communication protocol stack are investigated, along with possible cross-layer synergies andoptimizations.� 2006 Elsevier B.V. All rights reserved.

Keywords: Wireless sensor networks; Multimedia communications; Distributed smart cameras; Video sensor networks; Energy-awareprotocol design; Cross-layer protocol design; Quality of service

1. Introduction

Wireless sensor networks (WSN) [22] have drawnthe attention of the research community in the lastfew years, driven by a wealth of theoretical andpractical challenges. This growing interest can belargely attributed to new applications enabled bylarge-scale networks of small devices capable of har-

1389-1286/$ - see front matter � 2006 Elsevier B.V. All rights reserveddoi:10.1016/j.comnet.2006.10.002

* Corresponding author. Tel.: +1 404 894 5141; fax: +1 404 8947883.

E-mail addresses: [email protected] (I.F. Akyildiz), [email protected] (T. Melodia), [email protected](K.R. Chowdhury).

vesting information from the physical environment,performing simple processing on the extracted dataand transmitting it to remote locations. Significantresults in this area over the last few years have ush-ered in a surge of civil and military applications. Asof today, most deployed wireless sensor networksmeasure scalar physical phenomena like tempera-ture, pressure, humidity, or location of objects. Ingeneral, most of the applications have low band-width demands, and are usually delay tolerant.

More recently, the availability of inexpensivehardware such as CMOS cameras and microphonesthat are able to ubiquitously capture multimediacontent from the environment has fostered the

.

mailto:[email protected]:[email protected] mailto:[email protected] mailto:[email protected]

922 I.F. Akyildiz et al. / Computer Networks 51 (2007) 921–960

development of Wireless Multimedia Sensor Net-works (WMSNs) [54,90], i.e., networks of wirelesslyinterconnected devices that allow retrieving videoand audio streams, still images, and scalar sensordata. With rapid improvements and miniaturizationin hardware, a single sensor device can be equippedwith audio and visual information collection mod-ules. As an example, the Cyclops image capturingand inference module [103], is designed for extre-mely light-weight imaging and can be interfacedwith a host mote such as Crossbow’s MICA2 [4]or MICAz [5]. In addition to the ability to retrievemultimedia data, WMSNs will also be able to store,process in real-time, correlate and fuse multimediadata originated from heterogeneous sources.

Wireless multimedia sensor networks will notonly enhance existing sensor network applicationssuch as tracking, home automation, and environ-mental monitoring, but they will also enable severalnew applications such as:

• Multimedia surveillance sensor networks. Wirelessvideo sensor networks will be composed of inter-connected, battery-powered miniature videocameras, each packaged with a low-power wire-less transceiver that is capable of processing,sending, and receiving data. Video and audiosensors will be used to enhance and complementexisting surveillance systems against crime andterrorist attacks. Large-scale networks of videosensors can extend the ability of law enforcementagencies to monitor areas, public events, privateproperties and borders.

• Storage of potentially relevant activities. Multime-dia sensors could infer and record potentially rel-evant activities (thefts, car accidents, trafficviolations), and make video/audio streams orreports available for future query.

• Traffic avoidance, enforcement and control sys-tems. It will be possible to monitor car traffic inbig cities or highways and deploy services thatoffer traffic routing advice to avoid congestion.In addition, smart parking advice systems basedon WMSNs [29] will allow monitoring availableparking spaces and provide drivers with auto-mated parking advice, thus improving mobilityin urban areas. Moreover, multimedia sensorsmay monitor the flow of vehicular traffic onhighways and retrieve aggregate informationsuch as average speed and number of cars. Sen-sors could also detect violations and transmitvideo streams to law enforcement agencies

to identify the violator, or buffer images andstreams in case of accidents for subsequent acci-dent scene analysis.

• Advanced health care delivery. Telemedicine sen-sor networks [59] can be integrated with 3G mul-timedia networks to provide ubiquitous healthcare services. Patients will carry medical sensorsto monitor parameters such as body temperature,blood pressure, pulse oximetry, ECG, breathingactivity. Furthermore, remote medical centerswill perform advanced remote monitoring oftheir patients via video and audio sensors, loca-tion sensors, motion or activity sensors, whichcan also be embedded in wrist devices [59].

• Automated assistance for the elderly and familymonitors. Multimedia sensor networks can beused to monitor and study the behavior of elderlypeople as a means to identify the causes ofillnesses that affect them such as dementia [106].Networks of wearable or video and audio sensorscan infer emergency situations and immediatelyconnect elderly patients with remote assistanceservices or with relatives.

• Environmental monitoring. Several projects onhabitat monitoring that use acoustic and videofeeds are being envisaged, in which informationhas to be conveyed in a time-critical fashion.For example, arrays of video sensors are alreadyused by oceanographers to determine the evolu-tion of sandbars via image processing techniques[58].

• Person locator services. Multimedia content suchas video streams and still images, along withadvanced signal processing techniques, can beused to locate missing persons, or identify crimi-nals or terrorists.

• Industrial process control. Multimedia contentsuch as imaging, temperature, or pressureamongst others, may be used for time-criticalindustrial process control. Machine vision is theapplication of computer vision techniques toindustry and manufacturing, where informationcan be extracted and analyzed by WMSNs tosupport a manufacturing process such as thoseused in semiconductor chips, automobiles, foodor pharmaceutical products. For example, inquality control of manufacturing processes,details or final products are automaticallyinspected to find defects. In addition, machinevision systems can detect the position and orien-tation of parts of the product to be picked up bya robotic arm. The integration of machine vision

I.F. Akyildiz et al. / Computer Networks 51 (2007) 921–960 923

systems with WMSNs can simplify and addflexibility to systems for visual inspections andautomated actions that require high-speed,high-magnification, and continuous operation.

As observed in [37], WMSNs will stretch thehorizon of traditional monitoring and surveillancesystems by:

• Enlarging the view. The Field of View (FoV) of asingle fixed camera, or the Field of Regard (FoR)of a single moving pan-tilt-zoom (PTZ) camera islimited. Instead, a distributed system of multiplecameras and sensors enables perception of theenvironment from multiple disparate viewpoints,and helps overcoming occlusion effects.

• Enhancing the view. The redundancy introducedby multiple, possibly heterogeneous, overlappedsensors can provide enhanced understandingand monitoring of the environment. Overlappedcameras can provide different views of the samearea or target, while the joint operation ofcameras and audio or infrared sensors can helpdisambiguate cluttered situations.

• Enabling multi-resolution views. Heterogeneousmedia streams with different granularity can beacquired from the same point of view to providea multi-resolution description of the scene andmultiple levels of abstraction. For example, staticmedium-resolution camera views can be enrichedby views from a zoom camera that provides ahigh-resolution view of a region of interest. Forexample, such feature could be used to recognizepeople based on their facial characteristics.

Many of the above applications require the sen-sor network paradigm to be rethought in view ofthe need for mechanisms to deliver multimedia con-tent with a certain level of quality of service (QoS).Since the need to minimize the energy consumptionhas driven most of the research in sensor networksso far, mechanisms to efficiently deliver applicationlevel QoS, and to map these requirements to net-work layer metrics such as latency and jitter, havenot been primary concerns in mainstream researchon classical sensor networks.

Conversely, algorithms, protocols and techniquesto deliver multimedia content over large-scale net-works have been the focus of intensive research inthe last 20 years, especially in ATM wired and wire-less networks. Later, many of the results derived forATM networks have been readapted, and architec-

tures such as Diffserv and Intserv for Internet QoSdelivery have been developed. However, there areseveral main peculiarities that make QoS deliveryof multimedia content in sensor networks an evenmore challenging, and largely unexplored, task:

• Resource constraints. Sensor devices are con-strained in terms of battery, memory, process-ing capability, and achievable data rate [22].Hence, efficient use of these scarce resources ismandatory.

• Variable channel capacity. While in wired net-works the capacity of each link is assumed tobe fixed and pre-determined, in multi-hop wire-less networks, the attainable capacity of eachwireless link depends on the interference levelperceived at the receiver. This, in turn, dependson the interaction of several functionalities thatare distributively handled by all network devicessuch as power control, routing, and rate policies.Hence, capacity and delay attainable at each linkare location dependent, vary continuously, andmay be bursty in nature, thus making QoS provi-sioning a challenging task.

• Cross-layer coupling of functionalities. In multi-hop wireless networks, there is a strict interde-pendence among functions handled at all layersof the communication stack. Functionalitieshandled at different layers are inherently andstrictly coupled due to the shared nature of thewireless communication channel. Hence, the var-ious functionalities aimed at QoS provisioningshould not be treated separately when efficientsolutions are sought.

• Multimedia in-network processing. Processing ofmultimedia content has mostly been approachedas a problem isolated from the network-designproblem, with a few exceptions such as jointsource-channel coding [44] and channel-adaptivestreaming [51]. Hence, research that addressedthe content delivery aspects has typically not con-sidered the characteristics of the source contentand has primarily studied cross-layer interactionsamong lower layers of the protocol stack. How-ever, the processing and delivery of multimediacontent are not independent and their interactionhas a major impact on the levels of QoS that canbe delivered. WMSNs will allow performing mul-timedia in-network processing algorithms on theraw data. Hence, the QoS required at the applica-tion level will be delivered by means of a combi-nation of both cross-layer optimization of the


communication process, and in-network process-ing of raw data streams that describe the phe-nomenon of interest from multiple views, withdifferent media, and on multiple resolutions.Hence, it is necessary to develop application-independent and self-organizing architectures toflexibly perform in-network processing of multi-media contents.

Efforts from several research areas will need toconverge to develop efficient and flexible WMSNs,and this in turn, will significantly enhance ourability to interact with the physical environment.These include advances in the understanding ofenergy-constrained wireless communications, andthe integration of advanced multimedia processingtechniques in the communication process. Anothercrucial issue is the development of flexible systemarchitectures and software to allow querying thenetwork to specify the required service (thus provid-ing abstraction from implementation details). At thesame time, it is necessary to provide the service inthe most efficient way, which may be in contrastwith the need for abstraction.

In this paper, we survey the state of the art inalgorithms, protocols, and hardware for the devel-opment of wireless multimedia sensor networks,and discuss open research issues in detail. In partic-ular, in Section 2 we point out the characteristics ofwireless multimedia sensor networks, i.e., the majorfactors influencing their design. In Section 3, wesuggest possible architectures for WMSNs anddescribe their characterizing features. In Section 4,we discuss and classify existing hardware and proto-typal implementations for WMSNs, while in Section5 we discuss possible advantages and challenges ofmultimedia in-network processing. In Sections 6–10 we discuss existing solutions and open researchissues at the application, transport, network, link,and physical layers of the communication stack,respectively. In Section 11, we discuss cross-layersynergies and possible optimizations, while inSection 12 we discuss additional complementaryresearch areas such as actuation, synchronizationand security. Finally, in Section 13 we concludethe paper.

2. Factors influencing the design of multimedia sensor

networks

Wireless Multimedia Sensor Networks (WMSNs)will be enabled by the convergence of communica-

tion and computation with signal processing andseveral branches of control theory and embeddedcomputing. This cross-disciplinary research willenable distributed systems of heterogeneous embed-ded devices that sense, interact, and control thephysical environment. There are several factors thatmainly influence the design of a WMSN, which areoutlined in this section.

• Application-specific QoS requirements. The widevariety of applications envisaged on WMSNs willhave different requirements. In addition to datadelivery modes typical of scalar sensor networks,multimedia data include snapshot and streamingmultimedia content. Snapshot-type multimediadata contain event triggered observations obtainedin a short time period. Streaming multimediacontent is generated over longer time periodsand requires sustained information delivery.Hence, a strong foundation is needed in terms ofhardware and supporting high-level algorithmsto deliver QoS and consider application-specificrequirements. These requirements may pertainto multiple domains and can be expressed, amongstothers, in terms of a combination of bounds onenergy consumption, delay, reliability, distortion,or network lifetime.

• High bandwidth demand. Multimedia content,especially video streams, require transmissionbandwidth that is orders of magnitude higherthan that supported by currently available sen-sors. For example, the nominal transmission rateof state-of-the-art IEEE 802.15.4 compliant com-ponents such as Crossbow’s [3] MICAz orTelosB [6] motes is 250 kbit/s. Data rates at leastone order of magnitude higher may be requiredfor high-end multimedia sensors, with compara-ble power consumption. Hence, high data rateand low-power consumption transmission tech-niques need to be leveraged. In this respect, theultra wide band (UWB) transmission techniqueseems particularly promising for WMSNs, andits applicability is discussed in Section 10.

• Multimedia source coding techniques. Uncom-pressed raw video streams require excessivebandwidth for a multi-hop wireless environment.For example, a single monochrome frame inthe NTSC-based Quarter Common IntermediateFormat (QCIF, 176 · 120), requires around21 Kbyte, and at 30 frames per second (fps), avideo stream requires over 5 Mbit/s. Hence, it isapparent that efficient processing techniques for


lossy compression are necessary for multimediasensor networks. Traditional video coding tech-niques used for wireline and wireless communica-tions are based on the idea of reducing the bitrate generated by the source encoder by exploit-ing source statistics. To this aim, encoders relyon intra-frame compression techniques to reduceredundancy within one frame, while they leverageinter-frame compression (also known as predic-tive encoding or motion estimation) to exploitredundancy among subsequent frames to reducethe amount of data to be transmitted and stored,thus achieving good rate-distortion performance.Since predictive encoding requires complexencoders, powerful processing algorithms, andentails high energy consumption, it may not besuited for low-cost multimedia sensors. However,it has recently been shown [50] that the tradi-tional balance of complex encoder and simpledecoder can be reversed within the frameworkof the so-called distributed source coding, whichexploits the source statistics at the decoder, andby shifting the complexity at this end, allowsthe use of simple encoders. Clearly, such algo-rithms are very promising for WMSNs and espe-cially for networks of video sensors, where it maynot be feasible to use existing video encoders atthe source node due to processing and energyconstraints.

• Multimedia in-network processing. WMSNs allowperforming multimedia in-network processingalgorithms on the raw data extracted from theenvironment. This requires new architecturesfor collaborative, distributed, and resource-con-strained processing that allow for filtering andextraction of semantically relevant informationat the edge of the sensor network. This mayincrease the system scalability by reducing thetransmission of redundant information, mergingdata originated from multiple views, on differentmedia, and with multiple resolutions. For exam-ple, in video security applications, informationfrom uninteresting scenes can be compressed toa simple scalar value or not be transmittedaltogether, while in environmental applications,distributed filtering techniques can create atime-elapsed image [120]. Hence, it is necessaryto develop application-independent architecturesto flexibly perform in-network processing of themultimedia content gathered from the environ-ment. For example, IrisNet [93] uses applica-tion-specific filtering of sensor feeds at the

source, i.e., each application processes its desiredsensor feeds on the CPU of the sensor nodeswhere data are gathered. This dramaticallyreduces the bandwidth consumed, since insteadof transferring raw data, IrisNet sends only apotentially small amount of processed data.However, the cost of multimedia processing algo-rithms may be prohibitive for low-end multime-dia sensors. Hence, it is necessary to developscalable and energy-efficient distributed filteringarchitectures to enable processing of redundantdata as close as possible to the periphery of thenetwork.

• Power consumption. Power consumption is a fun-damental concern in WMSNs, even more than intraditional wireless sensor networks. In fact, sen-sors are battery-constrained devices, while multi-media applications produce high volumes ofdata, which require high transmission rates, andextensive processing. While the energy consump-tion of traditional sensor nodes is known to bedominated by the communication functionalities,this may not necessarily be true in WMSNs.Therefore, protocols, algorithms and architec-tures to maximize the network lifetime while pro-viding the QoS required by the application are acritical issue.

• Flexible architecture to support heterogeneousapplications. WMSN architectures will supportseveral heterogeneous and independent applica-tions with different requirements. It is necessaryto develop flexible, hierarchical architectures thatcan accommodate the requirements of all theseapplications in the same infrastructure.

• Multimedia coverage. Some multimedia sensors,in particular video sensors, have larger sensingradii and are sensitive to direction of acquisition(directivity). Furthermore, video sensors can cap-ture images only when there is unobstructed lineof sight between the event and the sensor. Hence,coverage models developed for traditional wire-less sensor networks are not sufficient for pre-deployment planning of a multimedia sensornetwork.

• Integration with Internet (IP) architecture. It is offundamental importance for the commercialdevelopment of sensor networks to provide ser-vices that allow querying the network to retrieveuseful information from anywhere and at anytime. For this reason, future WMSNs willbe remotely accessible from the Internet, andwill therefore need to be integrated with the IP


architecture. The characteristics of WSNs ruleout the possibility of all-IP sensor networks andrecommend the use of application level gatewaysor overlay IP networks as the best approach forintegration between WSNs and the Internet[138].

• Integration with other wireless technologies.Large-scale sensor networks may be created byinterconnecting local ‘‘islands’’ of sensorsthrough other wireless technologies. This needsto be achieved without sacrificing on the effi-ciency of the operation within each individualtechnology.

3. Network architecture

The problem of designing a scalable networkarchitecture is of primary importance. Most propos-als for wireless sensor networks are based on a flat,homogenous architecture in which every sensor hasthe same physical capabilities and can only interactwith neighboring sensors. Traditionally, the researchon algorithms and protocols for sensor networkshas focused on scalability, i.e., how to design solu-tions whose applicability would not be limited by

Fig. 1. Reference architecture of a wir

the growing size of the network. Flat topologiesmay not always be suited to handle the amount oftraffic generated by multimedia applications includ-ing audio and video. Likewise, the processing powerrequired for data processing and communications,and the power required to operate it, may not beavailable on each node.

3.1. Reference architecture

In Fig. 1, we introduce a reference architecturefor WMSNs, where three sensor networks with dif-ferent characteristics are shown, possibly deployedin different physical locations. The first cloud onthe left shows a single-tier network of homogeneousvideo sensors. A subset of the deployed sensors havehigher processing capabilities, and are thus referredto as processing hubs. The union of the processinghubs constitutes a distributed processing architec-ture. The multimedia content gathered is relayedto a wireless gateway through a multi-hop path.The gateway is interconnected to a storage hub, thatis in charge of storing multimedia content locallyfor subsequent retrieval. Clearly, more complexarchitectures for distributed storage can be imple-mented when allowed by the environment and the

eless multimedia sensor network.


application needs, which may result in energy sav-ings since by storing it locally, the multimediacontent does not need to be wirelessly relayed toremote locations. The wireless gateway is alsoconnected to a central sink, which implements thesoftware front-end for network querying andtasking. The second cloud represents a single-tieredclustered architecture of heterogeneous sensors(only one cluster is depicted). Video, audio, andscalar sensors relay data to a central clusterhead,which is also in charge of performing intensive mul-timedia processing on the data (processing hub).The clusterhead relays the gathered content to thewireless gateway and to the storage hub. The lastcloud on the right represents a multi-tiered network,with heterogeneous sensors. Each tier is in chargeof a subset of the functionalities. Resource-con-strained, low-power scalar sensors are in chargeof performing simpler tasks, such as detectingscalar physical measurements, while resource-rich,high-power devices are responsible for more com-plex tasks. Data processing and storage can beperformed in a distributed fashion at each differenttier.

3.2. Single-tier vs. multi-tier sensor deployment

One possible approach for designing a multime-dia sensor application is to deploy homogeneoussensors and program each sensor to perform all pos-sible application tasks. Such an approach yields aflat, single-tier network of homogeneous sensornodes. An alternative, multi-tier approach is to useheterogeneous elements [69]. In this approach,resource-constrained, low-power elements are incharge of performing simpler tasks, such as detect-ing scalar physical measurements, while resource-rich, high-power devices take on more complextasks. For instance, a surveillance application canrely on low-fidelity cameras or scalar acoustic sen-sors to perform motion or intrusion detection, whilehigh-fidelity cameras can be woken up on-demandfor object recognition and tracking. In [68], amulti-tier architecture is advocated for video sensornetworks for surveillance applications. The architec-ture is based on multiple tiers of cameras with differ-ent functionalities, with the lower tier constituted oflow-resolution imaging sensors, and the higher tiercomposed of high-end pan-tilt-zoom cameras. It isargued, and shown by means of experiments, thatsuch an architecture offers considerable advantageswith respect to a single-tier architecture in terms

of scalability, lower cost, better coverage, higherfunctionality, and better reliability.

3.3. Coverage

In traditional WSNs, sensor nodes collect infor-mation from the environment within a pre-definedsensing range, i.e., a roughly circular area definedby the type of sensor being used.

Multimedia sensors generally have larger sensingradii and are also sensitive to the direction of dataacquisition. In particular, cameras can captureimages of objects or parts of regions that are notnecessarily close to the camera itself. However, theimage can obviously be captured only when thereis an unobstructed line-of-sight between the eventand the sensor. Furthermore, each multimediasensor/camera perceives the environment or theobserved object from a different and unique view-point, given the different orientations and positionsof the cameras relative to the observed event orregion. In [118], a preliminary investigation of thecoverage problem for video sensor networks is con-ducted. The concept of sensing range is replacedwith the camera’s field of view, i.e., the maximumvolume visible from the camera. It is also shownhow an algorithm designed for traditional sensornetworks does not perform well with video sensorsin terms of coverage preservation of the monitoredarea.

4. Multimedia sensor hardware

In this section, we review and classify existingimaging, multimedia, and processing wirelessdevices that will find application in next generationwireless multimedia sensor networks. In particular,we discuss existing hardware, with a particularemphasis on video capturing devices, review existingimplementations of multimedia sensor networks,and discuss current possibilities for energy harvest-ing for multimedia sensor devices.

4.1. Enabling hardware platforms

High-end pan-tilt-zoom cameras and high resolu-tion digital cameras are widely available on the mar-ket. However, while such sophisticated devices canfind application as high-quality tiers of multimediasensor networks, we concentrate on low-cost, low-energy consumption imaging and processing devicesthat will be densely deployed and provide detailed


visual information from multiple disparate view-points, help overcoming occlusion effects, and thusenable enhanced interaction with the environment.

4.1.1. Low-resolution imaging motes

The recent availability of CMOS imaging sensors[61] that capture and process an optical image withina single integrated chip, thus eliminating the need formany separate chips required by the traditionalcharged-coupled device (CCD) technology, hasenabled the massive deployment of low-cost visualsensors. CMOS image sensors are already in manyindustrial and consumer sectors, such as cell phones,personal digital assistants (PDAs), consumer andindustrial digital cameras. CMOS image quality isnow matching CCD quality in the low- and mid-range, while CCD is still the technology of choicefor high-end image sensors. The CMOS technologyallows integrating a lens, an image sensor and imageprocessing algorithms, including image stabilizationand image compression, on the same chip. Withrespect to CCD, cameras are smaller, lighter, andconsume less power. Hence, they constitute a suit-able technology to realize imaging sensors to beinterfaced with wireless motes.

However, existing CMOS imagers are stilldesigned to be interfaced with computationally richhost devices, such as cell phones or PDAs. For thisreason, the objective of the Cyclops module [103] isto fill the gap between CMOS cameras and compu-tationally constrained devices. Cyclops is an elec-tronic interface between a CMOS camera moduleand a wireless mote such as MICA2 or MICAz,and contains programmable logic and memory forhigh-speed data communication. Cyclops consistsof an imager (CMOS Agilent ADCM-1700 CIFcamera), an 8-bit ATMEL ATmega128L microcon-troller (MCU), a complex programmable logicdevice (CPLD), an external SRAM and an externalFlash. The MCU controls the imager, configures itsparameters, and performs local processing on theimage to produce an inference. Since image capturerequires faster data transfer and address generationthan the 4 MHz MCU used, a CPLD is used to pro-vide access to the high-speed clock. Cyclops firm-ware is written in the nesC language [48], based onthe TinyOS libraries. The module is connected toa host mote to which it provides a high level inter-face that hides the complexity of the imaging deviceto the host mote. Moreover, it can perform simpleinference on the image data and present it to thehost.

Researchers at Carnegie Mellon University aredeveloping the CMUcam 3, which is an embeddedcamera endowed with a CIF Resolution (352 · 288)RGB color sensor that can load images into memoryat 26 frames per second. CMUcam 3 has softwareJPEG compression and has a basic image manipula-tion library, and can be interface with an 802.15.4compliant TelosB mote [6].

In [41], the design of an integrated mote for wire-less image sensor networks is described. The designis driven by the need to endow motes with adequateprocessing power and memory size for image sens-ing applications. It is argued that 32-bit processorsare better suited for image processing than their 8-bit counterpart, which is used in most existingmotes. It is shown that the time needed to performoperations such as 2-D convolution on an 8-bit pro-cessor such as the ATMEL ATmega128 clocked at4 MHz is 16 times higher than with a 32-bitARM7 device clocked at 48 MHz, while the powerconsumption of the 32-bit processor is only six timeshigher. Hence, an 8-bit processor turns out to beslower and more energy-consuming. Based on thesepremises, a new image mote is developed based onan ARM7 32-bit CPU clocked at 48 MHz, withexternal FRAM or Flash memory, 802.15.4 compli-ant Chipcon CC2420 radio, that is interfaced withmid-resolution ADCM-1670 CIF CMOS sensorsand low-resolution 30 · 30 pixel optical sensors.

The same conclusion is drawn in [81], where theenergy consumption of the 8-bit Atmel AVR pro-cessor clocked at 8 MHz is compared to that ofthe PXA255 32-bit Intel processor, embedded on aStargate platform [10] and clocked at 400 MHz.Three representative algorithms are selected asbenchmarks, i.e., the cyclic redundancy check, afinite impulse response filter, and a fast Fouriertransform. Surprisingly, it is shown that even forsuch relatively simple algorithms the energy con-sumption of an 8-bit processor is between one andtwo orders of magnitude higher.

4.1.2. Medium-resolution imaging motes based on the

Stargate platformIntel has developed several prototypes that con-

stitute important building platform for WMSNapplications. The Stargate board [10] is a high-per-formance processing platform designed for sensor,signal processing, control, robotics, and sensor net-work applications. It is designed by Intel and pro-duced by Crossbow. Stargate is based on Intel’sPXA-255 XScale 400 MHz RISC processor, which


is the same processor found in many handheld com-puters including the Compaq IPAQ and the DellAxim. Stargate has 32 Mbyte of Flash memory,64 Mbyte of SDRAM, and an on-board connectorfor Crossbow’s MICA2 or MICAz motes as wellas PCMCIA Bluetooth or IEEE 802.11 cards.Hence, it can work as a wireless gateway and as acomputational hub for in-network processing algo-rithms. When connected with a webcam or othercapturing device, it can function as a medium-reso-lution multimedia sensor, although its energy con-sumption is still high, as documented in [80].Moreover, although efficient software implementa-tions exist, XScale processors do not have hardwaresupport for floating point operations, which may beneeded to efficiently perform multimedia processingalgorithms.

Intel has also developed two prototypal genera-tions of wireless sensors, known as Imote andImote2. Imote is built around an integrated wirelessmicrocontroller consisting of an 8-bit 12 MHzARM7 processor, a Bluetooth radio, 64 KbyteRAM and 32 Kbyte FLASH memory, as well asseveral I/O options. The software architecture isbased on an ARM port of TinyOS. The second gen-eration of Intel motes has a common core to thenext generation Stargate 2 platform, and is builtaround a new low-power 32-bit PXA271 XScaleprocessor at 320/416/520 MHz, which enables per-forming DSP operations for storage or compres-sion, and an IEEE 802.15.4 ChipCon CC2420radio. It has large on-board RAM and Flash mem-ories (32 Mbyte), additional support for alternateradios, and a variety of high-speed I/O to connectdigital sensors or cameras. Its size is also very lim-ited, 48 · 33 mm, and it can run the Linux operatingsystem and Java applications.

4.2. Energy harvesting

As mentioned before, techniques for prolongingthe lifetime of battery-powered sensors have beenthe focus of a vast amount of literature in sensornetworks. These techniques include hardware opti-mizations such as dynamic optimization of voltageand clock rate, wake-up procedures to keep elec-tronics inactive most of the time, and energy-awareprotocol development for sensor communications.In addition, energy-harvesting techniques, whichextract energy from the environment where the sen-sor itself lies, offer another important mean to pro-long the lifetime of sensor devices.

Systems able to perpetually power sensors basedon simple COTS photovoltaic cells coupled withsupercapacitors and rechargeable batteries havebeen already demonstrated [64]. In [96], the stateof the art in more unconventional techniques forenergy harvesting (also referred to as energy scav-enging) is surveyed. Technologies to generate energyfrom background radio signals, thermoelectric con-version, vibrational excitation, and the humanbody, are overviewed.

As far as collecting energy from backgroundradio signals is concerned, unfortunately, an electricfield of 1 V/m yields only 0.26 lW/cm2, as opposedto 100 lW/cm2 produced by a crystalline siliconsolar cell exposed to bright sunlight. Electric fieldsof intensity of a few volts per meter are only encoun-tered close to strong transmitters. Another practice,which consists in broadcasting RF energy deliber-ately to power electronic devices, is severely limitedby legal limits set by health and safety concerns.

While thermoelectric conversion may not be suit-able for wireless devices, harvesting energy fromvibrations in the surrounding environment may pro-vide another useful source of energy. Vibrationalmagnetic power generators based on moving mag-nets or coils may yield powers that range from tensof microwatts when based on microelectromechani-cal system (MEMS) technologies to over a milliwattfor larger devices. Other vibrational microgenera-tors are based on charged capacitors with movingplates, and depending on their excitation and powerconditioning, yield power on the order of 10 lW. In[96], it is also reported that recent analysis [91] sug-gested that 1 cm3 vibrational microgenerators canbe expected to yield up to 800 lW/cm3 frommachine-induced stimuli, which is orders of magni-tude higher than what provided by currently avail-able microgenerators. Hence, this is a promisingarea of research for small battery-powered devices.

While these techniques may provide an addi-tional source of energy and help prolong the lifetimeof sensor devices, they yield power that is severalorders of magnitude lower as compared to thepower consumption of state-of-the-art multimediadevices. Hence, they may currently be suitable onlyfor very-low duty cycle devices.

4.3. Examples of deployed multimedia sensor

networks

There have been several recent experimentalstudies, mostly limited to video sensor networks.


Panoptes [46] is a system developed for environmen-tal observation and surveillance applications, basedon Intel StrongARM PDA platforms with a Logi-tech webcam as a video capture device. Here, videosensors are high-end devices with Linux operatingsystem, 64 Mbyte of memory, and are networkedthrough 802.11 networking cards. The systemincludes spatial compression (but not temporal),distributed filtering, buffering, and adaptive priori-ties for the video stream.

In [35], a system whose objective is to limit thecomputation, bandwidth, and human attention bur-dens imposed by large-scale video surveillance sys-tems is described. In-network processing is usedon each camera to filter out uninteresting eventslocally, avoiding disambiguation and tracking ofirrelevant environmental distractors. A resourceallocation algorithm is also proposed to steer pan-tilt cameras to follow interesting targets while main-taining awareness of possibly emerging new targets.

In [69], the design and implementation of Sens-Eye, a multi-tier network of heterogeneous wirelessnodes and cameras, is described. The surveillanceapplication consists of three tasks: object detection,recognition and tracking. The objective of thedesign is to demonstrate that a camera sensor net-work containing heterogeneous elements providesnumerous benefits over traditional homogeneoussensor networks. For this reason, SensEye followsa three-tier architecture, as shown in Fig. 2. Thelowest tier consists of low-end devices, i.e., MICA2Motes equipped with 900 MHz radios interfacedwith scalar sensors, e.g., vibration sensors. The sec-ond tier is made up of motes equipped with low-

Tier 1

Tier 2

Tier 3

Scalar Sensors + Mote

Low-res cam + Mote

Webcam + Stargate

Video stream

handoff

wakeup

wakeup

Fig. 2. The multi-tier architecture of SensEye [69].

fidelity Cyclops [103] or CMUcam [107] camera sen-sors. The third tier consists of Stargate [10] nodesequipped with webcams. Each Stargate is equippedwith an embedded 400 MHz XScale processor thatruns Linux and a webcam that can capture higherfidelity images than tier 2 cameras. Tier 3 nodes alsoperform gateway functions, as they are endowedwith a low data rate radio to communicate withmotes in tiers 1–2 at 900 MHz, and an 802.11 radioto communicate with tier 3 Stargate nodes. An addi-tional fourth tier may consist of a sparse deploy-ment of high-resolution, high-end pan-tilt-zoomcameras connected to embedded PCs. The camerasensors at this tier can be used to track movingobjects, and can be utilized to fill coverage gapsand provide additional redundancy. The underlyingdesign principle is to map each task requested by theapplication to the lowest tier with sufficientresources to perform the task. Devices from highertiers are woken up on-demand only when necessary.For example, a high-resolution camera can bewoken up to retrieve high resolution images of anobject that has been previously detected by a lowertier. It is shown that the system can achieve an orderof magnitude reduction in energy consumptionwhile providing comparable surveillance accuracywith respect to single-tier surveillance systems.

In [80], experimental results on the energy con-sumption of a video sensor network testbed are pre-sented. Each sensing node in the testbed consists ofa Stargate board equipped with an 802.11 wirelessnetwork card and a Logitech QuickCam Pro 4000webcam. The energy consumption is assessed usinga benchmark that runs basic tasks such as process-ing, flash memory access, image acquisition, andcommunication over the network. Both steady stateand transient energy consumption behaviorobtained by direct measurements of current with adigital multimeter are reported. In the steady state,it is shown that communication-related tasks areless energy-consuming than intensive processingand flash access when the radio modules are loaded.Interestingly, and unlike in traditional wireless sen-sor networks [99], the processing-intensive bench-mark results in the highest current requirement,and transmission is shown to be only about 5%more energy-consuming than reception. Experimen-tal results also show that delay and additionalamount of energy consumed due to transitions(e.g., to go to sleep mode) are not negligible andmust be accounted for in network and protocoldesign.

Fig. 3. Stargate board interfaced with a medium resolutioncamera. Stargate hosts an 802.11 card and a MICAz mote thatfunctions as a gateway to the sensor network.

Fig. 4. Acroname GARCIA, a mobile robot with a mountedpan-tilt camera and endowed with 802.11 as well as Zigbeeinterfaces.


IrisNet (Internet-scale Resource-Intensive SensorNetwork Services) [93] is an example software plat-form to deploy heterogeneous services on WMSNs.IrisNet allows harnessing a global, wide-area sensornetwork by performing Internet-like queries on thisinfrastructure. Video sensors and scalar sensors arespread throughout the environment, and collectpotentially useful data. IrisNet allows users to per-form Internet-like queries to video sensors andother data. The user views the sensor network as asingle unit that can be queried through a high-levellanguage. Each query operates over data collectedfrom the global sensor network, and allows simpleGoogle-like queries as well as more complex queriesinvolving arithmetic and database operators.

The architecture of IrisNet is two-tiered: hetero-geneous sensors implement a common shared inter-face and are called sensing agents (SA), while thedata produced by sensors is stored in a distributeddatabase that is implemented on organizing agents(OA). Different sensing services are run simulta-neously on the architecture. Hence, the same hard-ware infrastructure can provide different sensingservices. For example, a set of video sensors canprovide a parking space finder service, as well as asurveillance service. Sensor data is represented inthe Extensible Markup Language (XML), whichallows easy organization of hierarchical data. Agroup of OAs is responsible for a sensing service,collects data produced by that service, and orga-nizes the information in a distributed database toanswer the class of relevant queries. IrisNet alsoallows programming sensors with filtering code thatprocesses sensor readings in a service-specific way.A single SA can execute several such software filters(called senselets) that process the raw sensor databased on the requirements of the service that needsto access the data. After senselet processing, the dis-tilled information is sent to a nearby OA.

We have recently built an experimental testbedat the Broadband and Wireless Networking(BWN) Laboratory at Georgia Tech based on cur-rently off-the-shelf advanced devices to demonstratethe efficiency of algorithms and protocols for multi-media communications through wireless sensornetworks.

The testbed is integrated with our scalar sensornetwork testbed, which is composed of a heteroge-neous collection of imotes from Intel and MICAzmotes from Crossbow. Although our testbedalready includes 60 scalar sensors, we plan toincrease its size to deploy a higher scale testbed that

allows testing more complex algorithms and assessthe scalability of the communication protocolsunder examination.

The WMSN-testbed includes three different typesof multimedia sensors: low-end imaging sensors,medium-quality webcam-based multimedia sensors,and pan-tilt cameras mounted on mobile robots.

Low-end imaging sensors such as CMOS cam-eras can be interfaced with Crossbow MICAzmotes. Medium-end video sensors are based onLogitech webcams interfaced with Stargate plat-forms (see Fig. 3).

The high-end video sensors consist of pan-tiltcameras installed on an Acroname GARCIA

Fig. 5. GARCIA deployed on the sensor testbed. It acts as amobile sink, and can move to the area of interest for closer visualinspection. It can also coordinate with other actors and has built-in collision avoidance capability.


robotic platform [1], which we refer to as actor, andshown in Fig. 4. Actors constitute a mobile platformthat can perform adaptive sampling based on eventfeatures detected by low-end motes. The mobileactor can redirect high-resolution cameras to aregion of interest when events are detected bylower-tier, low-resolution video sensors that aredensely deployed, as seen in Fig. 5.

The testbed also includes storage and computa-tional hubs, which are needed to store large multi-media content and perform computationallyintensive multimedia processing algorithms.

5. Collaborative in-network processing

As discussed previously, collaborative in-net-work multimedia processing techniques are of greatinterest in the context of a WMSN. It is necessary todevelop architectures and algorithms to flexibly per-form these functionalities in-network with minimumenergy consumption and limited execution time.The objective is usually to avoid transmitting largeamounts of raw streams to the sink by processingthe data in the network to reduce the communica-tion volume.

Given a source of data (e.g., a video stream), dif-ferent applications may require diverse information(e.g., raw video stream vs. simple scalar or binaryinformation inferred by processing the videostream). This is referred to as application-specificquerying and processing. Hence, it is necessary todevelop expressive and efficient querying languages,and to develop distributed filtering and in-network

processing architectures, to allow real-time retrievalof useful information.

Similarly, it is necessary to develop architecturesthat efficiently allow performing data fusion orother complex processing operations in-network.Algorithms for both inter-media and intra-mediadata aggregation and fusion need to be developed,as simple distributed processing schemes developedfor existing scalar sensors are not suitable for com-putation-intensive processing required by multime-dia contents. Multimedia sensor networks mayrequire computation-intensive processing algo-rithms (e.g., to detect the presence of suspiciousactivity from a video stream). This may require con-siderable processing to extract meaningful informa-tion and/or to perform compression. A fundamentalquestion to be answered is whether this processingcan be done on sensor nodes (i.e., a flat architectureof multi-functional sensors that can perform anytask), or if the need for specialized devices, e.g.,computation hubs, arises.

In what follows, we discuss a non-exhaustive setof significative examples of processing techniquesthat would be applicable distributively in a WMSN,and that will likely drive research on architecturesand algorithms for distributed processing of rawsensor data.

5.1. Data alignment and image registration

Data alignment consists of merging informationfrom multiple sources. One of the most widespreaddata alignment concepts, image registration [137], isa family of techniques, widely used in areas such asremote sensing, medical imaging, and computervision, to geometrically align different images (refer-ence and sensed images) of the same scene taken atdifferent times, from different viewpoints, and/or bydifferent sensors:

• Different Viewpoints (Multi-view Analysis). Imagesof the same scene are acquired from differentviewpoints, to gain a larger 2D view or a 3D rep-resentation of the scene of interest. Main applica-tions are in remote sensing, computer vision and3D shape recovery.

• Different times (multi-temporal analysis). Imagesof the same scene are acquired at different times.The aim is to find and evaluate changes in time inthe scene of interest. The main applications arein computer vision, security monitoring, andmotion tracking.


• Different sensors (multi-modal analysis). Imagesof the same scene are acquired by different sen-sors. The objective is to integrate the informationobtained from different source streams to gainmore complex and detailed scene representation.

Registration methods usually consist of foursteps, i.e., feature detection, feature matching, trans-form model estimation, and image resampling andtransformation. In feature detection, distinctiveobjects such as closed-boundary regions, edges, con-tours, line intersections, corners, etc. are detected.In feature matching, the correspondence betweenthe features detected in the sensed image and thosedetected in the reference image is established. Intransform model estimation, the type and parame-ters of the so-called mapping functions, which alignthe sensed image with the reference image, are esti-mated. The parameters of the mapping functionsare computed by means of the established featurecorrespondence. In the last step, image resamplingand transformation, the sensed image is trans-formed by means of the mapping functions.

These functionalities can clearly be prohibitivefor a single sensor. Hence, research is needed onhow to perform these functionalities on parallelarchitectures of sensors to produce single data sets.

5.2. WMSNs as distributed computer vision systems

Computer vision is a subfield of artificial intelli-gence, whose purpose is to allow a computer toextract features from a scene, an image or multi-dimensional data in general. The objective is topresent this information to a human operator orto control some process (e.g., a mobile robot or anautonomous vehicle). The image data that is fedinto a computer vision system is often a digitalimage, a video sequence, a 3D volume from atomography device or other multimedia content.Traditional computer vision algorithms requireextensive computation, which in turn entails highpower consumption.

WMSNs enable a new approach to computervision, where visual observations across the networkcan be performed by means of distributed computa-tions on multiple, possibly low-end, vision nodes.This requires tools to interface with the user suchas new querying languages and abstractions toexpress complex tasks that are then distributivelyaccomplished through low-level operations on mul-tiple vision nodes. To this aim, it is necessary to

coordinate computations across the vision nodesand return the integrated results, which will consistof metadata information, to the final user.

In [102], the proposed Deep Vision network per-forms operations including object detection or clas-sification, image segmentation, and motion analysisthrough a network of low-end MICA motesequipped with Cyclops cameras [103]. Informationsuch as the presence of an intruder, the number ofvisitors in a scene or the probability of presence ofa human in the monitored area is obtained by col-lecting the results of these operations. Deep Visionprovides a querying interface to the user in the formof declarative queries. Each operation is representedas an attribute that can be executed through anappropriate query. In this way, low-level operationsand processing are encapsulated in a high-level que-rying interface that enables simple interaction withthe video network. As an example, the vision net-work can be deployed in areas with public andrestricted access spaces. The task of detectingobjects in the restricted-access area can be expressedas a query that requests the result of object detec-tion computations such as

SELECT Object,LocationREPORT = 30FROM NetworkWHERE Access = RestrictedPERIOD = 30.

The above query triggers the execution of theobject detection process on the vision nodes thatare located in the restricted-access areas in 30 sintervals.

6. Application layer

The functionalities handled at the applicationlayer of a WMSN are characterized by high hetero-geneity, and encompass traditional communicationproblems as well as more general system challenges.The services offered by the application layer include:(i) providing traffic management and admission con-trol functionalities, i.e., prevent applications fromestablishing data flows when the network resourcesneeded are not available; (ii) performing sourcecoding according to application requirements andhardware constraints, by leveraging advanced mul-timedia encoding techniques; (iii) providing flexibleand efficient system software, i.e., operating systemsand middleware, to export services for higher-layer


applications to build upon; (iv) providing primitivesfor applications to leverage collaborative, advanced

in-network multimedia processing techniques. Inthis section, we provide an overview of thesechallenges.

6.1. Traffic classes

Admission control has to be based on QoSrequirements of the overlying application. We envi-sion that WMSNs will need to provide support anddifferentiated service for several different classesof applications. In particular, they will need toprovide differentiated service between real-timeand delay-tolerant applications, and loss-tolerantand loss-intolerant applications. Moreover, someapplications may require a continuous stream ofmultimedia data for a prolonged period of time(multimedia streaming), while some other applica-tions may require event triggered observationsobtained in a short time period (snapshot multimediacontent). The main traffic classes that need to besupported are:

• Real-time, Loss-tolerant, Multimedia Streams.This class includes video and audio streams, ormulti-level streams composed of video/audioand other scalar data (e.g., temperature read-ings), as well as metadata associated with thestream, that need to reach a human or automatedoperator in real-time, i.e., within strict delaybounds, and that are however relatively loss tol-erant (e.g., video streams can be within a certainlevel of distortion). Traffic in this class usuallyhas high bandwidth demand.

• Delay-tolerant, Loss-tolerant, Multimedia Streams.This class includes multimedia streams that, beingintended for storage or subsequent offline process-ing, do not need to be delivered within strict delaybounds. However, due to the typically high band-width demand of multimedia streams and to lim-ited buffers of multimedia sensors, data in thistraffic class needs to be transmitted almost inreal-time to avoid excessive losses.

• Real-time, Loss-tolerant, Data. This class mayinclude monitoring data from densely deployedscalar sensors such as light sensors whose moni-tored phenomenon is characterized by spatialcorrelation, or loss-tolerant snapshot multimediadata (e.g., images of a phenomenon taken fromseveral multiple viewpoints at the same time).Hence, sensor data has to be received timely

but the application is moderately loss-tolerant.The bandwidth demand is usually between lowand moderate.

• Real-time, Loss-intolerant, Data. This may includedata from time-critical monitoring processes suchas distributed control applications. The band-width demand varies between low and moderate.

• Delay-tolerant, Loss-intolerant, Data. This mayinclude data from critical monitoring processes,with low or moderate bandwidth demand thatrequire some form of offline post processing.

• Delay-tolerant, Loss-tolerant, Data. This mayinclude environmental data from scalar sensornetworks, or non-time-critical snapshot multime-dia content, with low or moderate bandwidthdemand.

QoS requirements have recently been consideredas application admission criteria for sensor networks.In [97], an application admission control algorithm isproposed whose objective is to maximize the networklifetime subject to bandwidth and reliability con-straints of the application. An application admissioncontrol method is proposed in [28], which determinesadmissions based on the added energy load andapplication rewards. While these approaches addressapplication level QoS considerations, they fail to con-sider multiple QoS requirements (e.g., delay, reliabil-ity, and energy consumption) simultaneously, asrequired in WMSNs. Furthermore, these solutionsdo not consider the peculiarities of WMSNs, i.e., theydo not try to base admission control on a tight bal-ancing between communication optimizations andin-network computation. There is a clear need fornew criteria and mechanisms to manage the admis-sion of multimedia flows according to the desiredapplication-layer QoS.

6.2. Multimedia encoding techniques

There exists a vast literature on multimediaencoding techniques. The captured multimedia con-tent should ideally be represented in such a way asto allow reliable transmission over lossy channels(error-resilient coding), using algorithms that mini-mize processing power and the amount of informa-tion to be transmitted. The main design objectivesof a coder for multimedia sensor networks are thus:

• High compression efficiency. Uncompressed rawvideo streams require high data rates and thusconsume excessive bandwidth and energy. It is


necessary to achieve a high ratio of compres-sion to effectively limit bandwidth and energyconsumption.

• Low complexity. Multimedia encoders areembedded in sensor devices. Hence, they needto be low complexity to reduce cost and form fac-tors, and low-power to prolong the lifetime ofsensor nodes.

• Error resiliency. The source coder should providerobust and error-resilient coding of source data.

To achieve a high compression efficiency, the tra-ditional broadcasting paradigm for wireline andwireless communications, where video is com-pressed once at the encoder and decoded severaltimes, has been dominated by predictive encodingtechniques. These, used in the widely spread ISOMPEG schemes, or the ITU-T recommendationsH.263 [11] and H.264 [2] (also known as AVC orMPEG-4 part 10), are based on the idea of reducingthe bit rate generated by the source encoder byexploiting source statistics. Hence, intra-frame com-pression techniques are used to reduce redundancywithin one frame, while inter-frame compression(also known as predictive encoding or motion estima-tion) exploits correlation among subsequent framesto reduce the amount of data to be transmittedand stored, thus achieving good rate-distortion per-formance. Since the computational complexity isdominated by the motion estimation functionality,these techniques require complex encoders, power-ful processing algorithms, and entail high energyconsumption, while decoders are simpler and loadedwith lower processing burden. For typical imple-mentations of state-of-the-art video compressionstandards, such as MPEG or H.263 and H.264,the encoder is 5–10 times more complex than thedecoder [50]. It is easy to see that to realize low-cost,low-energy-consumption multimedia sensors it isnecessary to develop simpler encoders, and stillretain the advantages of high compressionefficiency.

However, it is known from information-theoreticbounds established by Slepian and Wolf for losslesscoding [117] and by Wyner and Ziv [130] for lossycoding with decoder side information, that efficientcompression can be achieved by leveraging knowl-edge of the source statistics at the decoder only. Thisway, the traditional balance of complex encoder andsimple decoder can be reversed [50]. Techniques thatbuild upon these results are usually referred to asdistributed source coding. Distributed source coding

refers to the compression of multiple correlated sen-sor outputs that do not communicate with eachother [131]. Joint decoding is performed by a centralentity that receives data independently compressedby different sensors. However, practical solutionshave not been developed until recently. Clearly,such techniques are very promising for WMSNsand especially for networks of video sensors. Theencoder can be simple and low-power, while thedecoder at the sink will be complex and loaded withmost of the processing and energy burden. Thereader is referred to [131,50] for excellent surveyson the state of the art of distributed source codingin sensor networks and in distributed video coding,respectively. Other encoding and compressionschemes that may be considered for source codingof multimedia streams, including JPEG with differ-ential encoding, distributed coding of images takenby cameras having overlapping fields of view, ormulti-layer coding with wavelet compression, arediscussed in [90]. Here, we focus on recent advanceson low complexity encoders based on Wyner–Zivcoding [130], which are promising solutions for dis-tributed networks of video sensors that are likely tohave a major impact in future design of protocolsfor WMSNs.

The objective of a Wyner–Ziv video coder is toachieve lossy compression of video streams andachieve performance comparable to that of inter-frame encoding (e.g., MPEG), with complexity atthe encoder comparable to that of intra-frame cod-ers (e.g., Motion-JPEG).

6.2.1. Pixel-domain Wyner–Ziv encoder

In [14,15], a practical Wyner–Ziv encoder is pro-posed as a combination of a pixel-domain intra-frame encoder and inter-frame decoder system forvideo compression. A block diagram of the systemis reported in Fig. 6. A regularly spaced subset offrames is coded using a conventional intra-framecoding technique, such as JPEG, as shown at thebottom of the figure. These are referred to as keyframes. All frames between the key frames arereferred to as Wyner–Ziv frames and are intra-frameencoded but inter-frame decoded. The intra-frameencoder for Wyner–Ziv frames (shown on top) iscomposed of a quantizer followed by a Slepian–Wolf coder. Each Wyner–Ziv frame is quantizedand blocks of symbols are sent to the Slepian–Wolfcoder, which is implemented through rate-compati-ble punctured turbo codes (RCPT). The parity bitsgenerated by the RCPT coder are stored in a buffer.

E

SIDE INFORMATION

KEY FRAMES INTRAFRAME

ENCODER(E.G. JPEG)

WYNER-ZIV FRAMES

QUANTIZER RCPTENCODER BUFFER

RCPTDECODER

INTRAFRAMEDECODER

RECONSTRUCTION

INTERPOLATION AND

EXTRAPOLATION

DECODED KEY FRAMES

DECODED WYNER-ZIV

FRAMES

DECODER FEEDBACKREQUEST ADDITIONAL

BITS

INTRAFRAME ENCODER INTERFRAME DECODER

SLEPIAN-WOLF CODER

Fig. 6. Block diagram of a pixel-domain Wyner–Ziv encoder [14].


A subset of these bits is then transmitted uponrequest from the decoder. This allows adapting therate based on the temporally varying statisticsbetween the Wyner–Ziv frame and the side informa-tion. The parity bits generated by the RCPT coderare in fact used to ‘‘correct’’ the frame interpo-lated at the decoder. For each Wyner–Ziv frame,the decoder generates the side information frameby interpolation or extrapolation of previouslydecoded key frames and Wyner–Ziv frames. Theside information is leveraged by assuming a Lapla-cian distribution of the difference between the indi-vidual pixels of the original frame and the sideinformation. The parameter defining the Laplaciandistribution is estimated online. The turbo decodercombines the side information and the parity bitsto reconstruct the original sequence of symbols. Ifreliable decoding of the original symbols is impossi-ble, the turbo decoder requests additional parity bitsfrom the encoder buffer.

Compared to predictive coding such as MPEG orH.26X, pixel-domain Wyner–Ziv encoding is muchsimpler. The Slepian–Wolf encoder only requirestwo feedback shift registers and an interleaver.Its performance, in terms of peak signal-to-noiseratio (PSNR), is 2–5 dB better than conventionalmotion-JPEG intra-frame coding. The main draw-back of this scheme is that it relies on online feed-back from the receiver. Hence it may not besuitable for applications where video is encodedand stored for subsequent use. Moreover, the feed-back may introduce excessive latency for videodecoding in a multi-hop network.

6.2.2. Transform-domain Wyner–Ziv encoderIn conventional source coding, a source vector is

typically decomposed into spectral coefficients byusing orthonormal transforms such as the DiscreteCosine Transform (DCT). These coefficients arethen individually coded with scalar quantizersand entropy coders. In [13], a transform-domainWyner–Ziv encoder is proposed. A block-wiseDCT of each Wyner–Ziv frame is performed. Thetransform coefficients are independently quantized,grouped into coefficient bands, and then com-pressed by a Slepian–Wolf turbo coder. As in thepixel-domain encoder described in the previous sec-tion, the decoder generates a side information framebased on previously reconstructed frames. Based onthe side information, a bank of turbo decodersreconstructs the quantized coefficient bands inde-pendently. The rate-distortion performance isbetween conventional intra-frame transform codingand conventional motion-compensated transformcoding.

A different approach consists of allowing somesimple temporal dependence estimation at the enco-der to perform rate control without the need forfeedback from the receiver. In the PRISM scheme[100], the encoder selects the coding mode basedon the frame difference energy between the currentframe and a previous frame. If the energy of the dif-ference is very small, the block is not encoded. If theblock difference is large, the block is intra-coded.Between these two situations, one of differentencoding modes with different rates is selected.The rate estimation does not involve motion


compensation and hence is necessarily inaccurate, ifmotion compensation is used at the decoder.Further, the flexibility of the decoder is restricted.

6.3. System software and middleware

The development of efficient and flexible systemsoftware to make functional abstractions and infor-mation gathered by scalar and multimedia sensorsavailable to higher layer applications is one of themost important challenges faced by researchers tomanage complexity and heterogeneity of sensor sys-tems. As in [66], the term system software is usedhere to refer to operating systems, virtual machines,and middleware, which export services to higher-layer applications. Different multimedia sensor net-work applications are extremely diverse in theirrequirements and in the way they interact with thecomponents of a sensor system. Hence, the maindesired characteristics of a system software forWMSNs can be identified as follows:

• Provides a high-level interface to specify thebehavior of the sensor system. This includessemantically rich querying languages that allowspecifying what kind of data is requested fromthe sensor network, the quality of the requireddata, and how it should be presented to the user;

• Allows the user to specify application-specificalgorithms to perform in-network processing onthe multimedia content [47]. For example, theuser should be able to specify particular imageprocessing algorithms or multimedia codingformat;

• Long-lived, i.e., needs to smoothly support evo-lutions of the underlying hardware and software;

• Shared among multiple heterogeneous appli-cations;

• Shared among heterogeneous sensors and plat-forms. Scalar and multimedia sensor networksshould coexist in the same architecture, withoutcompromising on performance;

• Scalable.

There is an inherent trade-off between degrees offlexibility and network performance. Platform-inde-pendence is usually achieved through layers ofabstraction, which usually introduce redundancyand prevent the developer from accessing low-leveldetails and functionalities. However, WMSNs arecharacterized by the contrasting objectives of opti-mizing the use of the scarce network resources and

not compromising on performance. The principaldesign objective of existing operating systems for sen-sor networks such as TinyOS is high performance.However, their flexibility, inter-operability and rep-rogrammability are very limited. There is a need forresearch on systems that allow for this integration.

We believe that it is of paramount importance todevelop efficient, high level abstractions that willenable easy and fast development of sensor networkapplications. An abstraction similar to the famousBerkeley TCP sockets, that fostered the develop-ment of Internet applications, is needed for sensorsystems. However, differently from the Berkeleysockets, it is necessary to retain control on the effi-ciency of the low-level operations performed on bat-tery-limited and resource-constrained sensor nodes.

As a first step towards this direction, Chu et al.[34] recently proposed Sdlib, a sensor network dataand communications library built upon the nesclanguage [48] for applications that require best-effort collection of large-size data such as videomonitoring applications. The objective of the effortis to identify common functionalities shared byseveral sensor network applications and to developa library of thoroughly-tested, reusable and efficientnesC components that abstract high-level opera-tions common to most applications, while leavingdifferences among them to adjustable parameters.The library is called Sdlib, Sensor Data Library,as an analogy to the traditional C++ StandardTemplate Library. Sdlib provides an abstractionfor common operations in sensor networks whilethe developer is still able to access low-level opera-tions, which are implemented as a collection of nesCcomponents, when desired. Moreover, to retain effi-ciency of operations that are so critical for sensornetworks battery lifetime and resource constraints,Sdlib exposes policy decisions such as resource allo-cation and rate of operation to the developer, whilehiding the mechanisms of policy enforcement.

6.4. Open research issues

• While theoretical results on Slepian–Wolf andWyner–Ziv coding exist since 30 years, there isstill a lack of practical solutions. The net benefitsand the practicality of these techniques still needto be demonstrated.

• It is necessary to fully explore the trade-offsbetween the achieved fidelity in the descriptionof the phenomenon observed, and the resultingenergy consumption. As an example, the video


distortion perceived by the final user depends onsource coding (frame rate, quantization), andon channel coding strength. For example, in asurveillance application, the objective of maxi-mizing the event detection probability is in con-trast with the objective of minimizing the powerconsumption.

• As discussed above, there is a need for high-layer abstractions that will allow fast develop-ment of sensor applications. However, due to theresource-constrained nature of sensor systems,it is necessary to control the efficiency of thelow-level operations performed on battery-limited and resource-constrained sensor nodes.

• There is a need for simple yet expressivehigh-level primitives for applications to leveragecollaborative, advanced in-network multimediaprocessing techniques.

7. Transport layer

In applications involving high-rate data, thetransport layer assumes special importance by pro-viding end-to-end reliability and congestion controlmechanisms. Particularly, in WMSNs, the followingadditional considerations are in order to accommo-date both the unique characteristics of the WSNparadigm and multimedia transport requirements.

• Effects of congestion. In WMSNs, the effect ofcongestion may be even more pronounced ascompared to traditional networks. When a bot-tleneck sensor is swamped with packets comingfrom several high-rate multimedia streams, apart

Transport Lay

TCP/UDP and TCP Friendly Schemes Appl

• TCP may be preferred over UDP unlike traditional wireless networks

• Compatible with the TCP rate control mechanism

E.g.. STCP [60], MPEG-TFRCP [92]

Reliability

• Per-packet deliverguarantee for selectpacket types

• Redundancy bycaching at intermednodes

Eg. RMST [119], PSFQ [127], (RT)2

Fig. 7. Classification of existing

from temporary disruption of the application, itmay cause rapid depletion of the node’s energy.While applications running on traditional wire-less networks may only experience performancedegradation, the energy loss (due to collisionsand retransmissions) can result in network parti-tion. Thus, congestion control algorithms mayneed to be tuned for immediate response andyet avoid oscillations of data rate along theaffected path.

• Packet re-ordering due to multi-path. Multiplepaths may exist between a given source-sink pair,and the order of packet delivery is strongly influ-enced by the characteristics of the route chosen.As an additional challenge, in real-time video/audio feeds or streaming media, information thatcannot be used in the proper sequence becomesredundant, thus stressing on the need for trans-port layer packet reordering.

We next explore the functionalities and supportprovided by the transport layer to address theseand other challenges of WMSNs. The following dis-cussion is classified into (1) TCP/UDP and TCPfriendly schemes for WMSNs, and (2) application-specific and non-standardized protocols. Fig. 7 sum-marizes the discussion in this section.

7.1. TCP/UDP and TCP friendly schemes for

WMSNs

For real-time applications like streaming media,the User Datagram Protocol (UDP) is preferredover TCP as timeliness is of greater concern than

er

ication Specific and Non-standard Protocols

• Spatio-temporalreporting

• Adjusting of reporting frequency based oncurrent congestion levels

Eg. ESRT [17]

Congestion Control Use of Multipath

• Better loadbalancing and robustness to channel state variability.

• Need to regulate multiple sources monitoring the sameevent

Eg. CODA [128], MRTP [79]

y ed

iate

[52]

transport layer protocols.


reliability. However, in WMSNs, it is expected thatpackets are significantly compressed at the sourceand redundancy is reduced as far as possible owingto the high transmission overhead in the energy-constrained nodes. Under these conditions, we notethe following important characteristics that maynecessitate an approach very different from classicalwireless networks.

• Effect of dropping packets in UDP. Simply drop-ping packets during congestion conditions, asundertaken in UDP, may introduce discernabledisruptions in the order of a fraction of a second.This effect is even more pronounced if the packetdropped contains important original content notcaptured by inter-frame interpolation, like theRegion of Interest (ROI) feature used inJPEG2000 [12] or the I-frame used in the MPEGfamily.

• Support for traffic heterogeneity. Multimedia traf-fic comprising of video, audio, and still imagesexhibits a high level of heterogeneity and maybe further classified into periodic or event driven.The UDP header has no provision to allow anydescription of these traffic classes that may influ-ence congestion control policies. As a contrast tothis, the options field in the TCP header can bemodified to carry data specific information. Asan example, the Sensor Transmission ControlProtocol (STCP) [60] accommodates a differenti-ated approach by including relevant fields in theTCP header. Several other major changes to thetraditional TCP model are proposed in the roundtrip time estimation, congestion notification, thepacket drop policy and by introducing a reliabil-ity driven intimation of lost packets.

We thus believe that TCP with appropriate mod-ifications is preferable over UDP for WMSNs, ifstandardized protocols are to be used. With respectto sensor networks, several problems and theirlikely solutions like large TCP header size, data vs.address centric routing, energy efficiency, amongstothers, are identified and solutions are proposed in[42]. We next indicate the recent work in this direc-tion that evaluates the case for using TCP inWMSNs.

• Effect of jitter induced by TCP. A key factor thatlimits multimedia transport based on TCP, andTCP-like rate control schemes, is the jitter intro-duced by the congestion control mechanism. This

can be, however, mitigated to a large extent byplayout buffers at the sink, which is typicallyassumed to be rich in resources. As an example,the MPEG-TFRCP (TCP Friendly Rate ControlProtocol for MPEG-2 Video Transfer) [92] is anequation-based rate control scheme designed fortransporting MPEG video in a TCP-friendlymanner.

• Overhead of the reliability mechanism in TCP. Asdiscussed earlier, blind dropping of packets inUDP containing highly compressed video/audiodata may adversely affect the quality of transmis-sion. Yet, at the same time, the reliability mecha-nism provided by TCP introduces an end-to-endmessage passing overhead and energy efficiencymust also be considered. Distributed TCP Cach-ing (DTC) [43] overcomes these problems bycaching TCP segments inside the sensor networkand by local retransmission of TCP segments.The nodes closest to the sink are the last-hop for-warders on most of the high-rate data paths andthus run out of energy first. DTC shifts the bur-den of the energy consumption from nodes closeto the sink into the network, apart from reducingnetwork wide retransmissions.

• Regulating streaming through multiple TCP con-nections. The availability of multiple pathsbetween source and sink can be exploited byopening multiple TCP connections for multime-dia traffic [94]. Here, the desired streaming rateand the allowed throughput reduction in presenceof bursty traffic, like sending of video data, iscommunicated to the receiver by the sender. Thisinformation is used by the receiver which thenmeasures the actual throughput and controls therate within the allowed bounds by using multipleTCP connections and dynamically changing itsTCP window size for each connection.

TCP protocols tailor-made for wireless sensornetworks is an active research area with recentimplementations of the light-weight Sensor InternetProtocol (SIP) [78] and the open source uIP [42] thathas a code size of few Kbyte. However, a majorproblem with the TCP-based approach in wirelessnetworks is its inability to distinguish between badchannel conditions and network congestion. Thishas motivated a new family of specialized transportlayer where the design practices followed areentirely opposite to that of TCP [122], or stress ona particular functionality of the transport layer, likereliability or congestion control.


7.2. Application specific and non-standard protocols

Depending on the application, both reliabilityand congestion control may be equally importantfunctionalities or one may be preferred over theother. As an example, in the CYCLOPS image cap-turing and inference module [103] designed forextremely light-weight imaging, congestion controlwould be the primary functionality with multiplesensor flows arriving at the sink, each being moder-ately loss-tolerant. We next list the important char-acteristics of such TCP incompatible protocols incontext of WMSNs.

7.2.1. Reliability

Multimedia streams may consist of images, videoand audio data, each of which merits a differentmetric for reliability. As discussed in Section 7.1,when an image or video is sent with differentiallycoded packets, the arrival of the packets with theROI field or the I-frame respectively should be guar-anteed. The application can, however, withstandmoderate loss for the other packets containing dif-ferential information. Thus, we believe that reliabil-ity needs to be enforced on a per-packet basis tobest utilize the existing networking resources. If aprior recorded video is being sent to the sink, allthe I-frames could be separated and the transportprotocol should ensure that each of these reachthe sink. Reliable Multi-Segment Transport(RMST) [119] or the Pump Slowly Fetch Quickly(PSFQ) protocol [127] can be used for this purposeas they buffer packets at intermediate nodes, allow-ing for faster retransmission in case of packetloss. However, there is an overhead of using thelimited buffer space at a given sensor node forcaching packets destined for other nodes, as wellas performing timely storage and flushing opera-tions on the buffer. In a heterogeneous network,where real-time data is used by actors as discussedin Section 4.3, the Real-time and Reliable Transport(RT)2 protocol [52] can be used that defines differentreliability constraints for sensor–actor and actor–actor communication.

7.2.2. Congestion control

The high rate of injection of multimedia packetsinto the network causes resources to be used upquickly. While typical transmission rates for sensornodes may be about 40 kbit/s, indicative data ratesof a constant bit rate voice traffic may be 64 kbit/s.Video traffic, on the other hand, may be bursty

and in the order of 500 kbit/s [136], thus making itclear that congestion must be addressed in WMSNs.While these data generation rates are high for a sin-gle node, multiple sensors in overlapped regionsmay inject similar traffic on sensing the same phe-nomenon. The Event-to-Sink Reliable Transport(ESRT) protocol [17] leverages the fact that spatialand temporal correlation exists among the individ-ual sensor readings [125]. The ESRT protocol regu-lates the frequency of event reporting in a remoteneighborhood to avoid congestions in the network.However, this approach may not be viable for allsensor applications as nodes transmit data onlywhen they detect an event, which may be a shortduration burst as in the case of a video monitoringapplication. The feedback from the base-stationmay hence not reach in time to prevent a suddencongestion due to this burst.

7.2.3. Use of multi-pathWe advocate the use of multiple paths for data

transfer in WMSNs owing to the following tworeasons:

• A large burst of data (say, resulting from anI-frame) can be split into several smaller bursts,thus not overwhelming the limited buffers at theintermediate sensor nodes.

• The channel conditions may not permit high datarate for the entire duration of the event beingmonitored. By allowing multiple flows, the effec-tive data rate at each path gets reduced and theapplication can be supported.

The design of a multiple source-sink transportprotocol is challenging, and is addressed by theCOngestion Detection and Avoidance (CODA)protocol [128]. It allows a sink to regulate multiplesources associated with a single event in case of per-sistent network congestion. However, as the conges-tion inference in CODA is based on queue length atintermediate nodes, any action taken by the sourceoccurs only after a considerable time delay. Othersolutions include the Multi-flow Real-time Trans-port Protocol (MRTP) [79] that does not specificallyaddress energy efficiency considerations in WMSNs,but is suited for real-time streaming of multimediacontent by splitting packets over different flows.MRTP does not have any mechanism for packetretransmission and is mainly used for real-time datatransmission and hence, reliability can be an issuefor scalar data traffic.


7.3. Open research issues

In summary, the transport layer mechanisms thatcan simultaneously address the unique challengesposed by the WMSN paradigm and multimediacommunication requirements must be incorporated.While several approaches were discussed, someopen issues remain and are outlined below:

• Trade-off between reliability and congestion con-trol. In WMSN applications, the data gatheredfrom the field may contain multimedia informa-tion such as target images, acoustic signal, andeven video captures of a moving target, all ofwhich enjoy a permissible level of loss tolerance.Presence or absence of an intruder, however, mayrequire a single data field but needs to be commu-nicated without any loss of fidelity. Thus, when asingle network contains multimedia as well asscalar data, the transport protocol must decidewhether to focus on one or more functionalitiesso that the application needs are met withoutan unwarranted energy expense. The design ofsuch a layer may as well be modular, with thefunctional blocks of reliability and/or congestioncontrol being invoked as per network demands.

• Real-time communication support. Despite theexistence of reliable transport solutions forWSN as discussed above, none of these protocolsprovide real-time communication support for theapplications with strict delay bounds. Therefore,new transport solutions which can also meet cer-tain application deadlines must be researched.

• Relation between multimedia coding rate andreliability. The success in energy-efficient andreliable delivery of multimedia informationextracted from the phenomenon directly dependson selecting appropriate coding rate, number ofsensor nodes, and data rate for a given event[125]. However, to this end, the event reliabilityshould be accurately measured in order to effi-ciently adapt the multimedia coding and trans-mission rates. For this purpose, new reliabilitymetrics coupled with the application layer codingtechniques should be investigated.

8. Network layer

The network layer addresses the challenging taskof providing variable QoS guarantees depending onwhether the stream carries time-independent data

like configuration or initialization parameters,time-critical low rate data like presence or absenceof the sensed phenomenon, high bandwidth video/audio data, etc. Each of the traffic classes describedin Section 6.1 has its own QoS requirement whichmust be accommodated in the network layer.

a survey on wireless multimedia sensor networkstmelodia/papers/multimedia_survey.pdf · development...

Documents