sensor network databases - national tsing hua...
TRANSCRIPT
Sensor Network Databases
Chapter 6Feng Zhao
Leonidas J. GuibasWireless Sensor Networks
Outline
Sensor Database ChallengesQuerying the Physical EnvironmentQuery InterfacesHigh-Level Database OrganizationIn-Network AggregationData-Centric StorageData Indices and Range QueriesDistributed Hierarchical AggregationTemporal DataSummary
Sensor Network AbstractionCharacteristics: distributed, resource-constrained, failure prone
From data storage point of view: think of a sensor net as a distributed database
Sensor Network Database Challenges
The sensor network is highly volatile.Nodes may be depleted, and links may go down.
Relational tables are not static.New data is continuously being sensed.
High energy cost of communication.In-networking processing during query execution
The rates at which input data arrives to a database operator can be highly variable.
Sensor Network Database Challenges
Limited storage on sensor nodes.Older data has to be discarded.
Sensor tasking interacts in numerous ways with the sensor database system.Classical metrics of database system performance may have to be adjusted.
Differences in Sensor NetworkDatabases
Sensor Network data inherently include errorsinterference from other signals, device noise Range and probabilistic or approximate queries are more appropriate than exact queries.
Additional operators needed to the query language
specify durations and sampling rates for the dataContinuous, long-running type queries
Ex: monitoring the average temperature in a roomHaving correlating and comparing operators
Querying the Physical Environment
An aggregate queryQuery result is computed by integrating data from a set of sensors.Delivery of data from distributed sensor nodes to a central node for computation.Ex: average , join of sensor readings from different groups.
Correlation Queries“Sound an alarm whenever two sensors within 10 meters of each other simultaneously detect an abnormal temperature.”
Querying the Physical Environment
Snapshot queries“Retrieve the current rainfall level for all sensors in Southern California.”
Historical queries“Display the average rainfall level at all sensors for the last three months of the previous year.”
TinyDB Query interfacesSQL-style querying
long-running monitoring query“For the next three hours, retrieve every 10 minutes the maximum rainfall level in each county in Southern California, if it is greater than 3.0 inches.”SELECT max (Rainfall_level), county
FROM sensors
WHERE state = California
GROUP BY county
HAVING max(Rainfall_Level) > 3.0 in
DURATION [ now, now + 180 min ]
SMAPLING PERIOD 10 min
TinyDB Query interfaces
Cougar Sensor Database
Object-relational databaseSQL-type query interfaceEach type of sensor is associated with an abstract data type (ADT)
Device ADT method represent device functionse.g., getTemperature() ; detectTempGreaterThan(90)
Examples of Long-running queries
CREATE LR_QUERY q1 ASSELECT R.dev, R.dev.getTemperature()FROM TempSensors R, NamedPlaces NWHERE $every(30)
AND R.dev.location().inside(N.bbox)AND N.name = “California”;
CREATE LR_QUERY q2 ASSELECT R1.dev.location()FROM TempSensors R1, TempSensors R2WHERE $every(10)
AND R1.dev.detectAbnormalTemperature()AND R2.dev.detectAbnormalTemperature()AND R1.dev > R2.dev;
Probabilistic Queries
Sensor data is subject to random errors.Sensor data is normally distributed and characterized by a gaussian p.d.f.GADT
An instance of the ADT corresponds to a gaussian p.d.f.Use mean μ and standard deviation σ to represent.Prob is used to pose queries.
Probabilistic Queries“Retrieve from sensors all tuples whose temperature is within 0.5 degrees of 68 degrees, with at least 60 percent probability”Ex: SELECT *
FROM sensorsWHERE Sensor.Temp.Prob([67.5,68.5] >= 0.6)
Centralized approach
Each sensor forwards its data to a central server.
DisadvantagesThe nodes near the access point become traffic hot spots.
Sampling rate have to be set to be the highest burdening the network with unnecessary traffic.
In-network storage approach
Choose rendezvous points to storage data in network. Advantages
The overhead to store and access the data is minimized.The overall load is balanced across the network.
Server-based approachRequire a total of 16 message transmissions
In-Network Aggregation
Each sensor may compute a partial state record based on its data and that of its childrenRequire a total of 6 message transmissions
Aggregation Framework
• As in extensible databases, TinyDB supports any aggregation function conforming to:Aggn={finit, fmerge, fevaluate}
Finit {a0} → <a0>
Fmerge {<a1>,<a2>} → <a12> ->Partial State Record
Fevaluate {<a1>} → aggregate value
Example: Average
AVGinit {v} → <v,1>
AVGmerge {<S1, C1>, <S2, C2>} → < S1 + S2 , C1 + C2>
AVGevaluate{<S, C>} → S/C
Aggregates and their efficiency in TAG
80000
60000
40000
20000
Byte
s Tr
ansm
it ted
/ E
poch
, Al l
sens
ors
COU
NT
MIN
HIS
TOG
RAM
AVER
AGE
MED
IAN
Performance MetricsNetwork usage
Total usage and Hot spot usage
Preprocessing timetime taken to construct an index
Storage space requirementQuery time
time to process a query, assemble an answer, and return this answer.
ThroughputUpdate and maintenance cost
Properties of Sensor DatabasePersistence
Data stored in the system must remain available to queries.
ConsistencyA query must be routed correctly to a node where the data are currently stored.
Controlled access to dataScalability in network size
As the number of nodes increase, the communication cost should not grow unduly.
Load balancingTopological generality
The database architecture should work well on a broad range of network topologies.
Query Processing Scheduling
TinyDB uses an epoch-based mechanism.The epoch should be sufficiently large for data to travel from the leaf to the root.Each epoch is divided into time intervals.The number of intervals reflects the depth of the routing tree.Each node only needs to power up during its scheduled interval.
Schedule of In-Network Aggregation
1 2 3 4 5
4 1
3
2
1
4
1
2 3
4
51
Sensor #
Inte
rval
#
SELECT COUNT(*) FROM sensors
Epoch
Interval 4
Schedule of In-Network Aggregation
Interval 3SELECT COUNT(*) FROM sensors
1 2 3 4 5
4 1
3 2
2
1
4
1
2 3
4
5
2
Epoch
Sensor #
Inte
rval
#
Schedule of In-Network Aggregation
Interval 2SELECT COUNT(*) FROM sensors
1 2 3 4 5
4 1
3 2
2 1 3
1
4
1
2 3
4
5
Epoch
Sensor #
1 3
Inte
rval
#
Schedule of In-Network Aggregation
5 Interval 1SELECT COUNT(*) FROM sensors
1 2 3 4 5
4 1
3 2
2 1 3
1 5
4
1
2 3
4
5
Epoch
Sensor #
Inte
rval
#
Schedule of In-Network Aggregation
Interval 4SELECT COUNT(*) FROM sensors
1 2 3 4 5
4 1
3 2
2 1 3
1 5
4 1
1
2 3
4
5
Epoch
Sensor #
Inte
rval
#
1
Data-Centric Storage (DCS)
DCS is a method proposed to support queries from any node in the network by providing a rendezvous mechanism for data and queries.Avoids flooding the entire network.At the center of a DCS system are rendezvous points.DCS distributes the storage load across the entire network.
Data-Centric Storage (DCS)
For example:Geographic hash table (GHT) attempts to distribute data evenly across the network.GHT assumes each node knows its geographic location. (by GPS or…)A data object is associated with a key.Each node is responsible for storing a certain range of keys.
Geographic Hash Table (GHT)
RendezvousEvents are named with keysStorage and retrieval performed using these keysA key is hashed to a geographic position Geographic routing (GPSR) used to locate closest node to this geographic positionThis node serves as a rendezvous for storage and search
CostsNo flooding of queriesAggregate storage cost same as external
Structured ReplicationRendezvous points are replicatedDecreases storage communication costIncreases query dissemination cost
Structured replication in GHT(0,100) (100,100)
Root Point
Level 1 mirror Point
Level 2 mirror Point
(0,0) (100,0)
Data-Centric Storage (DCS)
Reduce unnecessary network trafficHashing to locations respect geographic proximity.Hash to regions rather than to locations to avoid hot spots and increasing robustness.
Trade-offIf the frequency of event generation is high, then pushing data to arbitrary rendezvous points may be too expensive.
Data indices and range queries
It is difficult to serve a range query wellTinyDB aggregation tree require flooding the entire network each queryIndices
Auxiliary data structures to facilitate and speed up the execution of the queryIs useful when the rate of query is high than the rate of update
Indices
Key ideaPre-storing the answers to certain special queries and then delivering the answer to an arbitrary range query
Index structureHash table, k-d tree, quad-tree, R tree,…
Trade-offthe number of pre-stored answers and the speed of query execution.
One-Dimensional Indicess0
s1s2
s3
s4
s5
s6
s7Canonical subsets of sensors along a road
s0 s1 s2 s3 s4 s5 s6 s7
u1 u3
u2
u5 u7
u6
u4
One-Dimensional IndicesWe map logical node ui to physical node si-1Canonical subsets
The nodes with the pre-stored data. s0~s6
Complexity: store O( n ) ; query O( log n )
u1 s0⊕s1
u2 s0⊕s1⊕s2⊕s3
u3 s2⊕s3
u4 s0⊕s1⊕s2⊕s3⊕s4⊕s5⊕s6⊕s7
u5 s4⊕s5
u6 s4⊕s5⊕s6⊕s7
u7 s6⊕s7 (⊕ denotes the aggregation operator )
Multidimensional Indices for Orthogonal Range Searching
Orthogonal range query:Select * from Nestion_Events Where Temperature >= 50 And Temperature <= 60 And Light >= 5 And Light <= 10
10 20 30 40 50 60 70 …
…
50
40
30
20
10
0
Light
Temperature
A k-d tree partitions a plane into rectangles
Drill down the k-d tree with rectangle QWhen reach a node whose corresponding rectangle is disjoint from Q, just stop propagationWhen reach a node whose corresponding rectangle is fully contained in Q, incorporate its count into the events of interestOtherwise, expand a node and continue drilling on its children
A k-d tree partitions a plane into rectangles
Temperature
Ligh
t
Non-orthogonal Range Searching
propagate
propagate
propagate
propagate
propagate
propagate
propagate
Query Range
Distributed Hierarchical Aggregation
Designing a distributed indexLoad-balancing the communication, processing, and storage across the nodes
Robustness considerationFrequent failures of nodes and links
Important to WSN databaseReceive the attention it deserves
Multiresolution Summarization
Wavelet transformsOne way to compress and summarize information for both temporal and spatial signalsData structure
Quad-treeRouting
GPSR + GHTAvoid hot spot
Replication
Partitioning the Summaries
Query start at the root of the summarization treePartition aggregation data in a meaningful way to lessen the load on nodes near the hierarchy rootUse a multi-rooted quad-tree to partition the spatial domainSystem - DIFS
Quad Tree Approach
Quaternary Tree:Each node has 4 children
Each node has 4 histograms summarizing data distribution in each child subtreeQueries only propagate in relevant parts of the tree (pruning)
Quad Tree: Issues
Explicit child pointers required
On storage of new data, update must be propagated up the tree
Every query must originate at tree rootRoot bears greater burden!
DIFS
DIFS stands for distributed index for features in sensor networksGoals
Provide an efficient query mechanism for range searches of event attributesExtend network lifetime by amortizing the costs of communication and storage over as many nodes as possible
Even at expense of modest overall increases
GHT-based Quad TreeWe add an index structure to Structured Replication
Hierarchy of histograms summarizes the range of data within children
Problem: Root is the bottleneck
Every query goes through itInformation from every event that’s generated propagates to it
root pointlevel 1 childrenlevel 2 children
1
10
13
11
5
6
7
8
12
2
15
9 14
3 16
4The DIFS Tree
Every node (except the root) has parentsThe wider the spatial extent an index node knows about, the more constrained the value range it covers
1-49-12
5-813-16
1-49-12
5-813-16
1-49-12
5-813-16
1-49-12
5-813-16
1-16 1-16
1-161-16
1-16 1-16
1-161-16
1-16 1-16
1-161-16
1-16 1-16
1-161-16
1
513
8
2
6
7
4
12
16
15
9 14
3 10
11
1-49-12
5-813-16
1-49-12
5-813-16
1-49-12
5-813-16
1-49-12
5-813-16
1-16 1-16
1-161-16
1-16 1-16
1-161-16
1-16 1-16
1-161-16
1-16 1-16
1-161-169
0 100
0
StorageExample: Event with “temperature” equal to 9 generated at location (68,61) Compute geographically bounded hash
“temperature:1:16” in (50,50)->(75,75)“temperature:9:12” in (50,50)->(100,100)“temperature:9:9” in (0,0)->(100,100)
Periodically propagate up the tree
100
DIFS Hierarchy
Fractional Cascading
Sensor pA sensor p’s view of the worldLeaves of the quad-tree
Locality-Preserving Hashing
Goal:Have a way to map that attribute space to the plane so that nearby locations in attribute space correspond to nearby locations in the plane
DIM (distributed index for multidimensional data)Data with values close to one another are hashed to locations nearbyZone code - zone unique identify
DIM - zone tree & zone code
a
b
c
de
g
f
c
f g
d ea b
1
1
1 1
1
1
1
0
0
0
0
0 0
0
00
010 011 100 101 110
1110 1111
00
010
011
100101
110
1110
1111
Temporal Data
Overall node storage is very limitedWe might query about the past, the present, or the futureData Aging
Application-dependentSchedule for discarding data and data summaries
Indexing Motion Data
A fixed index structure will soon be obsolete, because of heavy update and communication costBoth the index construction and updates can be quite expensiveModify only when new objects are inserted or deleted, or when the trajectory of an object changes
KDS (Kinetic Data Structure)
Update only when certain critical events occurDrawback
It may lead to waste processing during periods of inactivity, when no queries are present in the network, because the index require to be updated as time goes on
These updates need not to be so frequent if the motion predictions are accurate
Summary
This area is still in its infancy, much more needs to be doneAs we remarked, integration of query processing with the networking layer, the mapping of index structures to the spatial topology of the network, and distributed index construction for motion data all remain important topics for further investigation