datacutter - cs.umd. · pdf filedatacutter • restricted processing...
TRANSCRIPT
![Page 1: DataCutter - cs.umd. · PDF fileDataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters](https://reader031.vdocuments.us/reader031/viewer/2022022500/5aa441897f8b9ab4788ba0f4/html5/thumbnails/1.jpg)
DataCutter
Joel SaltzAlan Sussman
Tahsin KurcUniversity of Maryland, College Park
andJohns Hopkins Medical Institutions
http://www.cs.umd.edu/projects/adr
![Page 2: DataCutter - cs.umd. · PDF fileDataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters](https://reader031.vdocuments.us/reader031/viewer/2022022500/5aa441897f8b9ab4788ba0f4/html5/thumbnails/2.jpg)
DataCutter
• A suite of Middleware for subsetting and filteringmulti-dimensional datasets stored on archivalstorage systems
• Subsetting through Range Queries• a hyperbox defined in the multi-dimensional space
underlying the dataset• items whose multi-dimensional coordinates fall into the
box are retrieved.
![Page 3: DataCutter - cs.umd. · PDF fileDataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters](https://reader031.vdocuments.us/reader031/viewer/2022022500/5aa441897f8b9ab4788ba0f4/html5/thumbnails/3.jpg)
DataCutter
• Restricted processing (filtering/aggregations)through Filters• to reduce the amount of data transferred to the client• filters can run anywhere, but intended to run near (i.e.,
over local area network) storage system• based on filter-stream programming model -- to optimize
use of limited resources, such as memory and disk space
![Page 4: DataCutter - cs.umd. · PDF fileDataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters](https://reader031.vdocuments.us/reader031/viewer/2022022500/5aa441897f8b9ab4788ba0f4/html5/thumbnails/4.jpg)
DataCutterClient Client
Archival Storage System
RangeQuery
SegmentInfo.
SegmentData
IndexingService
Client Interface Service
Data Access Service
DataCutter
Filter Filter
Filtering Service
Archival Storage System
Segments: (File,Offset,Size) (File,Offset,Size)
![Page 5: DataCutter - cs.umd. · PDF fileDataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters](https://reader031.vdocuments.us/reader031/viewer/2022022500/5aa441897f8b9ab4788ba0f4/html5/thumbnails/5.jpg)
DataCutter Architecture
• Client Interface Service• Manages client connections and client requests• Manages data and information flow between
different services • Indexing Service
• Two-level hierarchical indexing -- summary anddetailed index files
• Customizable --• Default R-tree index• User can add new indexing methods
![Page 6: DataCutter - cs.umd. · PDF fileDataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters](https://reader031.vdocuments.us/reader031/viewer/2022022500/5aa441897f8b9ab4788ba0f4/html5/thumbnails/6.jpg)
DataCutter Architecture
• Filtering Service• Manages filters (registered in the system)• Users can add/run new filters
• Data Access Service• Manages storage/retrieval of data from the tertiary
storage• Low level system dependent I/O operations
![Page 7: DataCutter - cs.umd. · PDF fileDataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters](https://reader031.vdocuments.us/reader031/viewer/2022022500/5aa441897f8b9ab4788ba0f4/html5/thumbnails/7.jpg)
DataCutter -- Subsetting
• Datasets are partitioned into segments• used to index the dataset, unit of retrieval
• Indexing very large datasets• Multi-level hierarchical indexing scheme• Summary index files -- to index a group of
segments or detailed index files• Detailed index files -- to index the segments
![Page 8: DataCutter - cs.umd. · PDF fileDataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters](https://reader031.vdocuments.us/reader031/viewer/2022022500/5aa441897f8b9ab4788ba0f4/html5/thumbnails/8.jpg)
DataCutter -- Filters• Filters
• Specialized user program to process data(segments) before returning them to the client
• Filter-stream programming model• Originally developed for Active Disks environment
(Acharya, Uysal, and Saltz)• Based on stream abstraction
• A stream denotes a supply of data• Streams deliver data in fixed size buffers• Communication of a filter with its environment is
restricted to its input and output streams
• init, process, finalize interface
![Page 9: DataCutter - cs.umd. · PDF fileDataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters](https://reader031.vdocuments.us/reader031/viewer/2022022500/5aa441897f8b9ab4788ba0f4/html5/thumbnails/9.jpg)
Sample Application:• generate 3D reconstructed
viewfrom new set of sensorreadings
• compare features withreference db
Grid Configuration:• remote data server - reference
db• sensor host - large raw
readings• parallel computation farm
available• 3D reconstruction
computationallyintensive
A Motivating Scenario
WAN
Raw Datasetsensor readings
Sensor ?
Computation Farm
?
Client PC
?
Data Server
?
Reference DBfeature list
![Page 10: DataCutter - cs.umd. · PDF fileDataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters](https://reader031.vdocuments.us/reader031/viewer/2022022500/5aa441897f8b9ab4788ba0f4/html5/thumbnails/10.jpg)
A Motivating Scenario (2)
WAN
Raw Datasetsensor readings
SensorExtract raw
Client PC
View result
Data Server
Extract ref
Reference DBfeature list
Computation Farm
3D reconstruction
Application :// process relevant raw readings// generate 3D view// compute features of 3D view// find similar features in reference db// display new view and similar cases
Extract ref
Extract raw
3D reconstruction
View result
Raw Dataset
Reference DB
![Page 11: DataCutter - cs.umd. · PDF fileDataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters](https://reader031.vdocuments.us/reader031/viewer/2022022500/5aa441897f8b9ab4788ba0f4/html5/thumbnails/11.jpg)
Filters• Filters
• communicate with other filters only using streams• cannot change stream endpoints• are allowed to pre-disclose dynamic allocation of
memory/scratch space in init phase, beforeprocessing phase
• Advantages• location independence• easier scheduling of resources• filter stop and restart is defined explicitly in model
![Page 12: DataCutter - cs.umd. · PDF fileDataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters](https://reader031.vdocuments.us/reader031/viewer/2022022500/5aa441897f8b9ab4788ba0f4/html5/thumbnails/12.jpg)
Placement• The dynamic assignment of filters to
particular hosts for execution is placement(mapping)
• Optimization criteria:• Communication
• leverage filter affinity to dataset• minimize communication volume on slower connections• co-locate filters with large communication volume
• Computation• expensive computation on faster, less loaded hosts
![Page 13: DataCutter - cs.umd. · PDF fileDataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters](https://reader031.vdocuments.us/reader031/viewer/2022022500/5aa441897f8b9ab4788ba0f4/html5/thumbnails/13.jpg)
Restructuring Process
ApplicationTarget Configuration
Decompose
Placement / Schedule
Execute Application
Some setof filters
f3
f4
f5f1
f2
![Page 14: DataCutter - cs.umd. · PDF fileDataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters](https://reader031.vdocuments.us/reader031/viewer/2022022500/5aa441897f8b9ab4788ba0f4/html5/thumbnails/14.jpg)
Software Infrastructure
• Prototype implementation of filter framework• C++ language binding• manual placement• wide-area execution service• one thread for each instantiated filter
![Page 15: DataCutter - cs.umd. · PDF fileDataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters](https://reader031.vdocuments.us/reader031/viewer/2022022500/5aa441897f8b9ab4788ba0f4/html5/thumbnails/15.jpg)
Filter Framework
class MyFilter : public AS_Filter_Base {public:
int init(int argc, char *argv[ ]) { … };int process(stream_t st) { … };int finalize(void) { … };
}
![Page 16: DataCutter - cs.umd. · PDF fileDataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters](https://reader031.vdocuments.us/reader031/viewer/2022022500/5aa441897f8b9ab4788ba0f4/html5/thumbnails/16.jpg)
Filter Connectivity / Placement
[filter.A]outs = stream1 stream3[filter.B]ins = stream1outs = stream2[filter.C]ins = stream2 stream3
A
B
Cstream3
stream1 stream2
[placement]A = host1.cs.umd.eduB = host2.cs.umd.eduC = host3.cs.umd.edu
![Page 17: DataCutter - cs.umd. · PDF fileDataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters](https://reader031.vdocuments.us/reader031/viewer/2022022500/5aa441897f8b9ab4788ba0f4/html5/thumbnails/17.jpg)
Execution Service
host1.cs.umd.edu
AppExec Daemon
filter A
Application
Filter lib
EXEC
Directory Daemon
dir.cs.umd.edu:6000
Directoryname host port
**** **** ******** **** ****Application
Console
Filter lib
???.???.???.???
2. Query
SpecsFilter/Stream
Placement
1. Read
3. Exec
host2.cs.umd.edu
AppExec Daemon
filter B
Application
Filter lib
EXEC
host3.cs.umd.edu
AppExec Daemon
filter C
Application
Filter lib
EXEC
![Page 18: DataCutter - cs.umd. · PDF fileDataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters](https://reader031.vdocuments.us/reader031/viewer/2022022500/5aa441897f8b9ab4788ba0f4/html5/thumbnails/18.jpg)
Related Work
ApplicationLevel
ProgrammingModels
InfrastructureServices
ResourceLevel
Grid availableResources
Globus
User specifiedResources
Legion
Client/Server Sockets
Condor Pool
IdleResources
JavaRMI,DCOM,CORBA
NetSolve,Ninf
AppLeS
HPC++
NWS
DataCutter
HarmonyDSM MPI RPC
DPSSSRB
![Page 19: DataCutter - cs.umd. · PDF fileDataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters](https://reader031.vdocuments.us/reader031/viewer/2022022500/5aa441897f8b9ab4788ba0f4/html5/thumbnails/19.jpg)
Integrating DataCutter with theStorage Resouce Broker
![Page 20: DataCutter - cs.umd. · PDF fileDataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters](https://reader031.vdocuments.us/reader031/viewer/2022022500/5aa441897f8b9ab4788ba0f4/html5/thumbnails/20.jpg)
Storage Resource Broker (SRB)• Middleware between clients and storage
resources• Remote Access to storage resources.
• Various types :• File Systems - UNIX, HPSS, UniTree, DPSS (LBL).• DB large objects - Oracle, DB2, Illustra.
• Uniform client interface (API).
![Page 21: DataCutter - cs.umd. · PDF fileDataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters](https://reader031.vdocuments.us/reader031/viewer/2022022500/5aa441897f8b9ab4788ba0f4/html5/thumbnails/21.jpg)
Storage Resource Broker (SRB)• MCAT - MetaData Catalog
• Datasets (files) and Collections (directories) - inodes andmore.
• Storage resources• User information - authentication, access privileges, etc.
• Software package• Server, client library, UNIX-like utilities, Java GUI• Platforms - Solaris, Sun OS, Digital Unix, SGI Irix, Cray
T90.
![Page 22: DataCutter - cs.umd. · PDF fileDataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters](https://reader031.vdocuments.us/reader031/viewer/2022022500/5aa441897f8b9ab4788ba0f4/html5/thumbnails/22.jpg)
SRB/DataCutter - Prototype Implementation
• Support for Range Queries• Creation of indices over data sets (composed set
of data files)• Subsetting of data sets
• Search for files or portions of files that intersect a givenrange query
• Restricted filter operations on portions of files(data segments) before returning them to theclient (to perform filtering or aggregation to reducedata volume)
![Page 23: DataCutter - cs.umd. · PDF fileDataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters](https://reader031.vdocuments.us/reader031/viewer/2022022500/5aa441897f8b9ab4788ba0f4/html5/thumbnails/23.jpg)
File SID DBLobjID ObjSID Range Query
IndexingService
Filter Filter
Filtering Service
DataCutter
SRB/DataCutter System
Resource
User
Application Meta-data
Storage Resource Broker (SRB)
SRB I/O and MCAT APIMCAT
Application(SRB client)
DB2, Oracle, Illustra, ObjectStore HPSS, UniTree UNIX, ftp
Distributed Storage Resources
![Page 24: DataCutter - cs.umd. · PDF fileDataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters](https://reader031.vdocuments.us/reader031/viewer/2022022500/5aa441897f8b9ab4788ba0f4/html5/thumbnails/24.jpg)
SRB/DataCutter Client Interface
int sfoCreateIndex(srbConn *conn, sfoClass class, int catType, char *inIndexName, char *outIndexName, char *resourceName)
• Creating and Deleting Index
int sfoDeleteIndex(srbConn *conn, sfoClass class, int catType, char *indexName)
![Page 25: DataCutter - cs.umd. · PDF fileDataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters](https://reader031.vdocuments.us/reader031/viewer/2022022500/5aa441897f8b9ab4788ba0f4/html5/thumbnails/25.jpg)
SRB/DataCutter Client Interface• Searching Index -- R-tree index
typedef struct { int dim; /* bounding box dimensions */ double *min; /* minimum in each dimension */ double *max; /* maximum in each dimension */} sfoMBR; /* Bounding box structure */
typedef struct { sfoMBR segmentMBR; /* bounding box of the segment */ char *objID; /* object in SRB that contains the segment */ char *collectionName; /* collection where object is stored */ unsigned int offset; /* offset of the segment in the object */ unsigned int size; /* size of segment */} segmentInfo; /* segment meta-data information */
typedef struct { int segmentCount; /* number of segments returned */ segmentInfo *segments; /* segment meta-data information */ int continueIndex; /* continuation flag */} indexSearchResult; /* search result structure */
![Page 26: DataCutter - cs.umd. · PDF fileDataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters](https://reader031.vdocuments.us/reader031/viewer/2022022500/5aa441897f8b9ab4788ba0f4/html5/thumbnails/26.jpg)
SRB/DataCutter Client Interface• Searching Index -- R-tree index
int sfoSearchIndex(srbConn *conn, sfoClass class, char *indexName, void *query,
indexSearchResult *myresult, int maxSegCount)
typedef struct { int dim; double *min, *max;} rangeQuery;
int sfoGetMoreSearchResult(srbConn *conn, int continueIndex, indexSearchResult *myresult, int maxSegCount)
![Page 27: DataCutter - cs.umd. · PDF fileDataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters](https://reader031.vdocuments.us/reader031/viewer/2022022500/5aa441897f8b9ab4788ba0f4/html5/thumbnails/27.jpg)
Applying Filters
typedef struct { segmentInfo segInfo; /* info on segment data buffer after filter oper. */ char *segment; /* segment data buffer after filter is applied */} segmentData;
typedef struct { int segmentDataCount; /* #segments in segmentData array */ segmentData *segments; /* segmentData array */ int continueIndex; /* continuation flag */} filterDataResult;
![Page 28: DataCutter - cs.umd. · PDF fileDataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters](https://reader031.vdocuments.us/reader031/viewer/2022022500/5aa441897f8b9ab4788ba0f4/html5/thumbnails/28.jpg)
Applying Filtersint sfoApplyFilter(srbConn *conn, sfoClass class, char *hostName, int filterID, char *filterArg, int numOfInputSegments, segmentInfo *inputSegments, filterDataResult *myresult, int maxSegCount)
int sfoGetMoreFilterResult(srbConn *conn, int continueIndex, filterDataResult *myresult, int maxSegCount)
![Page 29: DataCutter - cs.umd. · PDF fileDataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters](https://reader031.vdocuments.us/reader031/viewer/2022022500/5aa441897f8b9ab4788ba0f4/html5/thumbnails/29.jpg)
zoom viewread_data decompress clip
Application: Virtual Microscope
• Interactive software emulation of high power lightmicroscope for processing/visualizing image datasets
• 3-D Image Dataset (100MB to 5GB per focal plane)• Client-server system organization• Rectangular region queries, multiple data chunk reply
• pipeline style processing
![Page 30: DataCutter - cs.umd. · PDF fileDataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters](https://reader031.vdocuments.us/reader031/viewer/2022022500/5aa441897f8b9ab4788ba0f4/html5/thumbnails/30.jpg)
Virtual Microscope Client
![Page 31: DataCutter - cs.umd. · PDF fileDataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters](https://reader031.vdocuments.us/reader031/viewer/2022022500/5aa441897f8b9ab4788ba0f4/html5/thumbnails/31.jpg)
VM Application using SRB/DataCutter
Wide Area Network
Local Area Network
Distributed Collection of Workstations
zoomdecompress
SRB/DataCutter
read
Client
view
clip
Indexing
Client
view
read
decompress
clip
read image chunks
convert jpeg image chunks into RGB pixels
clip image to query boundaries
zoom sub-sample to the required magnification
view stitch image pieces together and display image
Distributed Storage Resources