querying the internet with pier nitin khandelwal
TRANSCRIPT
![Page 1: Querying The Internet With PIER Nitin Khandelwal](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bff41a28abf838cbcda9/html5/thumbnails/1.jpg)
Querying The Internet With PIER
Nitin Khandelwal
![Page 2: Querying The Internet With PIER Nitin Khandelwal](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bff41a28abf838cbcda9/html5/thumbnails/2.jpg)
Motivation
Inject a degree of distribution into databases Internet scale systems vs. hundred node
systems Large scale applications requiring database
functionaity
![Page 3: Querying The Internet With PIER Nitin Khandelwal](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bff41a28abf838cbcda9/html5/thumbnails/3.jpg)
Applications
P2P Databases
Highly distributed and available data Network Monitoring
Intrusion detection
Fingerprint queries
![Page 4: Querying The Internet With PIER Nitin Khandelwal](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bff41a28abf838cbcda9/html5/thumbnails/4.jpg)
Design Principles
Relaxed Consistency Sacrifice Consistency in face of Availability and Partition tolerance Organic Scaling Growth with deployment Natural Habitats for Data Data remains in original format with a DB interface Standard Schemas Achieved though common software
![Page 5: Querying The Internet With PIER Nitin Khandelwal](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bff41a28abf838cbcda9/html5/thumbnails/5.jpg)
DHTs
Implemented with CAN (Content Addressable Network).
Node identified by hyper-rectangle in d-dimensional space
Key hashed to a point, stored in corresponding node. Routing Table of neighbours is maintained. O(d)
![Page 6: Querying The Internet With PIER Nitin Khandelwal](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bff41a28abf838cbcda9/html5/thumbnails/6.jpg)
DHT Design
Routing Layer
Mapping for keys
(-- dynamic as nodes leave and join) Storage Manager
DHT based data Provider
Storage access interface for higher levels
![Page 7: Querying The Internet With PIER Nitin Khandelwal](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bff41a28abf838cbcda9/html5/thumbnails/7.jpg)
Provider
Couples the routing and storage layers
namespace – relation
resourceId – primary key namespace + resourceId >> key
instanceId – distinguishes objects with
same namespace and resourceID
lifetime – item storage duration LScan, Multicast, Newdata
![Page 8: Querying The Internet With PIER Nitin Khandelwal](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bff41a28abf838cbcda9/html5/thumbnails/8.jpg)
PIER Query Processor
Operators: Selection, proj, joins, grouping, agg Operators push and pull data Relaxed Consistency and reachable snapshot:
- working with nodes reachable at query issue.
- Instead, use arrival of query multicast message.
![Page 9: Querying The Internet With PIER Nitin Khandelwal](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bff41a28abf838cbcda9/html5/thumbnails/9.jpg)
Join Algorithm
R, S – relations Nr, Ns – relation namespaces Nq - DHT-based temporary table Symmetric Hash Join:
- Rehashes the relations
- Scan and copy in new namespace Nq Fetch Matches
- One relation(S) already hashed on join attribute - Selections on non-join attributes of S cannot be pushed into the DHT
![Page 10: Querying The Internet With PIER Nitin Khandelwal](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bff41a28abf838cbcda9/html5/thumbnails/10.jpg)
Join Rewriting
Aimed at lowering the bandwidth utilization Symmetric semi-join - Local projections to Resource ID + join keys
- Symmetric Hash Join on two projections
- Global fetch matches join using Resource Ids of R and S
Bloom joins(Hashed semi-join)
- Bloom filter is hashing based bit-vector
- Local bloom filters are published into temporary namespaces
- Filters are OR-ed and multicast to opposite relation’s nodes
![Page 11: Querying The Internet With PIER Nitin Khandelwal](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bff41a28abf838cbcda9/html5/thumbnails/11.jpg)
Workload Parameters
CAN configuration: d = 4 R 10 times larger than S Constants provide 50% selectivity f(x,y) evaluated after the join 90% of R tuples match a tuple in S Result tuples are 1KB each Symmetric hash join used
![Page 12: Querying The Internet With PIER Nitin Khandelwal](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bff41a28abf838cbcda9/html5/thumbnails/12.jpg)
Simulation Setup
Up to 10,000 nodes Network cross-traffic, CPU and memory utilizations
ignored Data shipped from source to computation node for
every query operation 1. 100ms and 10Mbps fully connected links 2. GT-ITM transit-stub topology (similar results)
![Page 13: Querying The Internet With PIER Nitin Khandelwal](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bff41a28abf838cbcda9/html5/thumbnails/13.jpg)
Join Algorithms
Infinite Bandwidth (Observe Impact of just propagation delay) 1024 data and computation nodes Core Join Algorithms:
Performs faster
Rewrites: Bloom Filter: two multicasts
Semi-join: two CAN lookups
![Page 14: Querying The Internet With PIER Nitin Khandelwal](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bff41a28abf838cbcda9/html5/thumbnails/14.jpg)
Join Algorithms -- 2
Limited Bandwidth Symmetric Hash Join:
- Rehashes both tables Semi Joins:
- Transfer only matching tuples At 40% selectivity, bottleneck switches from
computation nodes to query sites
![Page 15: Querying The Internet With PIER Nitin Khandelwal](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bff41a28abf838cbcda9/html5/thumbnails/15.jpg)
Conclusions
Scalability of PIER dervies from relaxed design principles
- adoption of soft states
- dilated snapshot semantics Limitation: Just equality predicates Directions:
- Pushdown of selections into DHT
- Caching and replication of DHT data
- Catalog Manager – Stringent consistency and availability requirements.
![Page 16: Querying The Internet With PIER Nitin Khandelwal](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bff41a28abf838cbcda9/html5/thumbnails/16.jpg)
Sophia: An Information Plane
Nitin Khandelwal
![Page 17: Querying The Internet With PIER Nitin Khandelwal](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bff41a28abf838cbcda9/html5/thumbnails/17.jpg)
Shared Information Plane
Distributed System running throughout the network.
- Collects information about network elements
Local state(load/memory usage), local perspective (reachability of other nodes)
- Evaluate statements(questions) about the state
- Reacting according to conclusions
Killing misbehaving service
![Page 18: Querying The Internet With PIER Nitin Khandelwal](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bff41a28abf838cbcda9/html5/thumbnails/18.jpg)
Challenges
Information is widely distributed and dynamic Statements formulated at run-time – not a-
priori Centralized analysis not practical
Push analysis to the nodes(push into the network)
![Page 19: Querying The Internet With PIER Nitin Khandelwal](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bff41a28abf838cbcda9/html5/thumbnails/19.jpg)
Approach
Use logic programming model - In dynamic and distributed system, therefore
temporal and positional logic
Why? - Expressivity: Intuitive to make statements about the state of the system - Performance: :: Logic expression transformation for efficient evaluation :: Partial results caching
![Page 20: Querying The Internet With PIER Nitin Khandelwal](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bff41a28abf838cbcda9/html5/thumbnails/20.jpg)
Time and Position in the Language
Every term in the system has an environment containing time and location
Eval( bandwidth( env (at(node(Node),
time(Time),
Time > 1032445465,
BwVar),
BwVar > 40000))
![Page 21: Querying The Internet With PIER Nitin Khandelwal](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bff41a28abf838cbcda9/html5/thumbnails/21.jpg)
Performance
Aggressive Caching: - Evaluation results are cached
- Sometimes latency is more important then freshness
- Time environment used to control freshness
Scheduling - Pre-scheduling results to be available when and where they
may be needed.
- Cache can be refreshed with fresh values
![Page 22: Querying The Internet With PIER Nitin Khandelwal](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bff41a28abf838cbcda9/html5/thumbnails/22.jpg)
Evaluation Planning
Given an expression, plan
- where(close to data)
- when (time when dependencies resolved)
- what to evaluate Logic expressions can be transformed at
runtime
![Page 23: Querying The Internet With PIER Nitin Khandelwal](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bff41a28abf838cbcda9/html5/thumbnails/23.jpg)
Extensibility
Users can add new functionality at run-time Capabilities : to protect modules, grant and revoke
privileges. cap569354(Val) :- read sensor. cap435456(Val) :- cap569354(Val). bandwidth(Val) :- cap(435456(Val) Module Protection: All predicates transformed into
capabilities, shared through master key capability Danger in caching – different interfaces
![Page 24: Querying The Internet With PIER Nitin Khandelwal](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bff41a28abf838cbcda9/html5/thumbnails/24.jpg)
PIER and Sophia
Sophia: location of code execution is both explicit in the language and can be evaluated in the course of evaluation.
PIER: details of query execution left to underlying implementation to optimize.
Consequence: Sophia queries are more sophisticated: both user and system participate in evaluation planning.