software connector classification and selection for data-intensive systems
Post on 19-Jan-2016
41 Views
Preview:
DESCRIPTION
TRANSCRIPT
Software Connector Classification and Selection for Data-Intensive
Systems
Chris A. Mattmann, David Woollard, Nenad Medvidovic, Reza Mahjourian
2nd Intl. Workshop on Incorporating COTS Software into Software Systems (IWICSS 2007)
Agenda
• Research Problem and Importance• Our Approach
– Classification– Selection– Analysis
• Evaluation– Precision, Recall, Accuracy Measurements
• Related Work• Conclusion & Future Work
Research Problem and Importance
• Content repositories are growing rapidly in size
• At the same time, we expect more immediate dissemination of this data
• How do we distribute it…– In a performant manor?– Fulfilling system
requirements? ?NASA Planetary Data System
Archive Volume Growth
0
10
20
30
40
50
60
70
80
90
1990 1992 1994 1996 1998 2000 2002 2004 2006 2008
Year
TB (Accum)
TBytes
Software Architecture
• The definition of a system in the form of its canonical building blocks– Software Components: the computational units in the system– Software Connectors: the communications and interactions
between software components– Software Configurations: arrangements of components and
connectors and the rules that guide their composition
Data Distribution Systems
Data Producer
Data ConsumerData ConsumerData ConsumerData Consumer
data
???
data
Connector
Insight: Use Software Connectors to model data distribution technologies
ComponentComponent
Data Movement Technologies
• Wide array of available OTS “large-scale” connector technologies– GridFTP, Aspera software, HTTP/REST, RMI,
CORBA, SOAP, XML-RPC, Bittorrent, JXTA, UFTP, FTP, SFTP, SCP, Siena, GLIDE/PRISM-MW, and more
• Which one is the best one?• How do we compare them
– Given our current architecture?– Given our distribution scenarios & requirements?
Research Question
• What types of software connectors are best suited for delivering vast amounts of data to users, that satisfy their particular scenarios, in a manner that is performant, scalable, in these hugely distributed data systems?
Data Distribution Problem Space
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Broad variety of distribution connector families
• P2P, Grid, Client/Server, and Event-based
• Though each connector family varies slightly in some form or fashion– They all share 3 common atomic connector
constituents• Data Access, Stream, Distributor• Adapted from Mehta et al.’s Connector
Taxonomy
Connector Tradeoff Space
• Surveyed properties of 13 representative distribution connectors, across all 4 distribution connector families and classified them– Client/Server
• SOAP, RMI, CORBA, HTTP/REST, FTP, UFTP, SCP, Commercial UDP Technology
– Peer to Peer• Bittorrent
– Grid• GridFTP, bbFTP
– Event-based• GLIDE, Sienna
Large Heterogeneity in Connector Properties
Procedure Call Connector Breakdown (5 connectors, 2 families)
0
1
2
3
4
5
6
HTTP ResponseRMI message
GridFTP messageSOAP messageCORBA message
one senderMethod Call
Globus Log LayerHTTP Server logRMI Registry
CORBA Name Registry
Web Server
valuereference
publicprotected
private
one receiverkeyword
Num Connectors
proc_call_params_return_valueproc_call_cardinality_sendersproc_call_invocation_explicitproc_call_params_invocation_recordproc_call_params_datatransferproc_call_accessibilityproc_call_semantics
Data Access Connector Breakdown (8 Connectors, 4 families)
0
1
2
3
4
5
6
7
8
9
ProcessGlobal
Dynamic Data Exchange
Database AccessRepository Access
File I/O
Session-Based
Cache
Peer-Based
Many ReceiversOne Receiver
AccessorMutator
Many SendersOne Sender
Num Connectors
data_access_localitydata_access_persistencedata_access_avail_transientdata_access_cardinality_receiversdata_access_accessesdata_access_cardinality_senders
Distributor Connector Breakdown (8 connectors, 4 families)
0
1
2
3
4
5
6
7
8
9
ad-hocbounded
RMI MessageGridFTP Message
SOAP Message
Event
HTTP MessagePeer Pieces
registry-basedattribute-basedHeirarchical
Flat
content-based
tcp/ip
architecture configuration
tracker
Exactly OnceAt least onceBest Effort
dynamiccachedstaticUnicastMulticastBroadcast
Num Connectors
distributor_routing_membershipdistributor_delivery_typedistributor_naming_typedistributor_naming_structuresdistributor_routing_typedistributor_delivery_semanticsdistributor_routing_pathdistributor_delivery_mechanisms
Stream Connector Breakdown (8 connectors, 4 families)
0
1
2
3
4
5
6
7
8
9
Raw
StructuredMany Senders
One Sender
RemoteLocal
Exactly OnceAt least onceBest Effort
bps
Many ReceiversOne Receiver
StatefulStatelessNamed
Bounded
Asynchronous
Time Out Synchronous
Buffered
Num Connectors
stream_formatsstream_cardinality_sendersstream_localitiesstream_deliveriesstream_throughputstream_cardinality_receiversstream_statestream_identitystream_boundsstream_synchronicitystream_buffering
How do experts make these decisions?
• Performed survey of 33 “experts”• Experts defined to be
– Practitioners in industry, building data-intensive systems
– Researchers in data distribution– Admitted architects of data
distribution technologies
• General consensus?– They don’t the how and the why
about which connector(s) are appropriate
– They rely on anecdotal evidence and “intuition”
Percentage Breakdown of Expert Responses
67%
15%
15%
3%
No ResponseNot ComfortableNo TimeFull Response
Expert Survey Demographic
6%
18%
12%
12%6%
22%
6%
12%
6%
Cancer Research
Planetary Science
Earth Science
Industry
Grid Computing
Professors
Web Technologies
Open Source
Students45% of respondents claimed to be uncomfortable being addressed as a data
distribution expert.
Our Approach: DISCO
• Develop a software framework for:– Connector Classification
• Build metadata profiles of connector technologies, describing their intrinsic properties (DCPs)
– Connector Selection• Adaptable, extensible algorithm development framework
for selecting the “right” connectors (and identifying wrong ones)
– Connector Selection Analysis• Measurement of accuracy of results
– Connector Performance Analysis
DISCO in a Nutshell
Building DCPs of all 13 connectors (Classification)
• Rely on Mehta et al. metadata to describe data distribution connectors
• Carefully select metadata to include/exclude
Develop complementary selection algorithms
Preliminary Evaluation
• We developed 13 connector profiles– Based on literature, expert
reviews, and our own development experience
• 30 distribution scenarios• 24 score functions (white
box) and Bayesian domain profiles with 100 conditional probabilities (black box)
ConnectorProfiles
Distribution Scenarios
Answer Key Score Bayesian
DISCO
Precision-RecallAnalysis
Clustering Clustering
Precision-Recall Results
• Error Rate– Probability of incorrectly
labeling a connector as appropriate for a scenario
• Precision– The fraction of selected
connectors appropriate for a scenario
• Recall– Probability of detecting a
connector as appropriate for a scenario
Bayesian Scored-based
True Positive (TP) 101 63
False Positive (FP) 25 200
True Negative (TN) 245 67
False Negative (FN) 19 60
Bayesian Scored-based
Error Rate 11.28% 32.56%
Precision 80.16% 48.46%
Recall 25.90% 16.15%
Related Work
Conclusions & Future Work
• Conclusions– Domain experts (gurus) rely on tacit knowledge and
often cannot explain design rationale– Disco provides a quantification of & framework for
understanding an ad hoc process– Bayesian algorithm has a higher precision rate
• Future Work– Explore the tradeoffs between white-box and black-
box approaches– Investigate the role of architectural mismatch in
connectors for data system architectures
Thank You!
Questions?
top related