Beyond Bloom Filters: From Approximate Membership
Checks to Approximate State Machines
By F. Bonomi et al.
Presented byKenny Cheng, Tonny Mak Yui Kuen
2
IntroductionIntroduction
A)A) MotivationMotivation
B)B) ObjectivesObjectives
C)C) Problem statementsProblem statements
3
A) MotivationA) Motivation
• Increasing trend to keep flow state in routers
• Large memory space (~100 bits per flow) is needed for storing a large amount of flow states
• If memory space can be reduced, using fast on-chip memory is feasible to improve performance
4
B) ObjectivesB) Objectives
• Introduce the idea of an Approximate Concurrent State Machine (ACSM), it sacrifices some accuracy for memory size.
• Introduce and compare several solutions to ACSM problem
• To find an approach with the highest accuracy to memory ratio
5
C) Problem statementsC) Problem statements
• Describe 3 techniques based on Bloom filters and hashing, and evaluate them using both theoretical analysis and simulation
6
Bloom Filter
• A data structure proposed by Bloom in 1970
• Designed for membership test, i.e. to test whether an element exists in a set
• Fast and compact
• Chance of false positive, i.e. an element not in the set may be wrongly identified
• No false negative, i.e. an element in the set must be identified correctly
7
How a Bloom Filter Works
• A bit array with all zeros initially• k hash functions
...1 2 k3
0 0 0 0 0 00 0 0 0 0 0 0 0
8
How a Bloom Filter Works
• Hash the element using the hash functions, get k indices in the bit array
• Mark the bits to 1
...1 2 k3
0 0 0 0 0 00 0 0 0 0 0 0 0
Insertion
x
0 0 1 0 0 00 0 1 1 0 0 0 1
9
How a Bloom Filter Works
• Hash the element using the hash functions• If all corresponding bits are 1, it’s in the set
...1 2 k3
0 0 1 0 0 10 0 1 1 1 0 0 1
Lookup
x
0 0 1 0 0 10 0 1 1 1 0 0 1
10
How a Bloom Filter Works
• Sorry, no deletion• You don’t know whether the bits are used by other
elements or not, cannot simply clear them
...1 2 k3
0 0 1 0 0 10 0 1 1 1 0 0 1
Deletion
x
0 0 ? 0 0 10 0 ? ? 1 0 0 ?
11
Counting Bloom Filter
• Use a counter to replace a bit• For insertion, increment the counters• For deletion, decrement the counters• Problems: more space, overflow counters
...1 2 k3
0 0 0 0 1 00 0 0 0 1 0 0 1
x
0 0 0 0 1 00 0 0 0 3 0 0 2 0 0 1 0 1 00 0 1 1 3 0 0 3
12
3 Approaches to ACSM
• Approaches:1. Direct Bloom Filter2. Stateful Bloom Filter3. Fingerprint-compressed Filter
• Operations need to implement:1. Insert(flow, state)2. Lookup(flow) returns (state)3. Delete(flow)4. Update(flow, new_state)
13
Direct Bloom Filter Approach
• Use counting Bloom filter• 4 operations:
Insert – insert (flow_id, state) pairLookup – if state is not provided, have to lookup every state, return “don’t know” if more than one state is foundDelete – lookup + decrement countersUpdate – delete old + insert new
• Improvement: use timing-based deletion to handle non-terminated flows
14
Timing-based Deletion
• Add a timing bit to each cell• Set the bit if the cell is touched• Clear untouched cells periodically, and reset timing bits• Alternative to DBF: use standard Bloom filter instead of
counting, delete elements only by time-based deletion
...1 2 k3
0 0 3 3 0 12 0 1 1 0 1 0 2
x
0 0 3 0 0 00 0 1 1 0 0 0 20 0 0 0 0 00 0 0 0 0 0 0 0Timing Bits 0 0 1 0 0 00 0 1 1 0 0 0 1
15
Stateful Bloom Filter Approach
• Direct Bloom Filter doesn’t store the state of a flow, need to lookup every state
• Improvement: add a state value for each cell for faster lookup
• Hash flow_id only, instead of (flow_id, state) pair
• Introduce a “don’t know” (DK) state when collision occurs
• Keep timing-based deletion
16
Stateful Bloom Filter Approach
• Insert, modify, delete – similar to Direct Bloom Filter, set the cell value to DK for collision (counter > 1)
• Lookup:If all cells are DK, return DKIf all cells are either state i or DK, return state iIf more than one state other than DK, return “not found”
17
1001010110 11100110000 40110111010 2
0111010100 11110011101 3
1100000110 30000111101 3
...
Fingerprint State
Fingerprint-compressed Filter Approach
• Store a fingerprint of flow + state in a d-left hashtable
...
x
...1 2 d
1110001000 1
18
Fingerprint-compressed Filter Approach
• Insert - hash the element, and find the corresponding bucket in each hash table, insert the fingerprint + state in the bucket with least number of elements (choose the left-most one to break ties)
• Lookup – retrieve the state of the fingerprint• Delete – remove the fingerprint• Update – direct update or remove old + add new• Make use of DK when a fingerprint is found in
multiple buckets• Timing-based deletion can still be applied
19
Simulation
• To investigate the size/accuracy trade-off for the 3 approaches
• State machine: 10 states• Legal state changes: 1 → 2 → 3 → … → 10• Run for 1 million flows• About 60000 simultaneous flows• 100 ± 40 packets for each flow• Some packets trigger state change
20
Simulation
• 3 kinds of simulation flows
• Interesting flows (30%) – flows with legal state changes only, always complete
• Noise flows (30%) – flows with random (can be legal or illegal) state changes, never complete
• Random flows (40%) – flows without state change
21
Simulation
False positive rate: % of completed flows which is not-interesting
False negative rate: % of interesting flows without completion
22
Applications
Place in the application level QoS:-
• Video congestion control
• Peer-to-Peer (P2P) traffic identification
23
Video congestion control
• Apply to MPEG video streaming
• 3 kinds of frames for MPEG video:I frame – scene informationP frame – differential informationB frame – least important information
• Can drop B frames up to 30% with acceptable quality
• Need to keep track of current frame
24
Video congestion control
• Use FCF ACSM to keep track of state
• Experimentally the highest false positive rate acceptable is 0.37%
• This requires a memory size of 27 bits per flow (about ¼ compared to original 100 bits)
25
P2P Traffic Identification
• To limit P2P flows to increase quality for other applications
• One possible way to identify a P2P flow:concurrent TCP and UDP flows
• Use ACSM for real-time P2P identification
26
ConclusionConclusion
• It’s feasible for ACSM
• FCF approach is the best approach
• Two potential applications are introduced for ACSM
• ACSM may be beneficial to QoS applications, which are fault-tolerant
27
Comments
• Authors focus on accuracy and memory size, but not real performance
• FCF approach may not perform well on hardware