modeling and adaptive scheduling of large-scale wide-area data transfers raj kettimuthu advisors:...
Post on 20-Jan-2016
218 Views
Preview:
TRANSCRIPT
Modeling and Adaptive Scheduling of Large-Scale Wide-Area Data Transfers
Raj Kettimuthu
Advisors: Gagan Agrawal, P. Sadayappan
Exploding data volumes
100,000 TB
MACHO et al.: 1 TBPalomar: 3 TB
2MASS: 10 TBGALEX: 30 TBSloan: 40 TB
Pan-STARRS: 40,000 TB
2004: 36 TB2014: 3,300 TB
105 increase in data volumes in 6 years
Astronomy Climate
Genomics
Data movement
Data Transfer Node Data Transfer Node
Storage Storage
Current work
Understand characteristics, control and optimize transfers
Efficient scheduling of wide-area transfers Model – predict and control throughput
– Characterize, identify key features– Data-driven modeling using experimental data
Adaptive scheduling– Algorithm to minimize slowdown– Experimental evaluation using real transfer logs
High-performance, secure data transfer protocol optimized for high-bandwidth wide-area networks Parallel TCP streams, PKI security for authentication, integrity and
encryption, checkpointing for transfer restarts
Based on FTP protocol - defines extensions for high-performance operation and security
Globus implementation of GridFTP is widely used. Globus GridFTP servers support usage statistics
collection – Transfer type, size in bytes, start time of the transfer, transfer
duration etc. are collected for each transfer
GridFTP
5
GridFTP usage log
Parallelism vs concurrency in GridFTP
Data Transfer Node at Site B
Data Transfer Node at Site A
Parallel File System
Parallelism = 3
TCP Connection
TCP Connection
GridFTP Daemon
GridFTP Daemon
GridFTP Client
28112811
GridFTP Server
GridFTP Server
GridFTP Server
GridFTP Server
TCP Connection
TCP ConnectionTCP ConnectionTCP Connection
Concurrency = 2
Control channel
Control channel
Parallelism vs concurrency
Objective - control bandwidth allocation for transfer(s) from a source to the destination(s)
Most large transfers between supercomputers– Ability to both store and process large amounts of data
Site heavily loaded, most bandwidth consumed by small number of sites
Goal – develop simple model for GridFTP – Source concurrency - total number of ongoing transfers between
the endpoint A and all its major transfer endpoints – Destination concurrency - total number of ongoing transfers
between the endpoint A and the endpoint B– External load - All other activities on the endpoints including
transfers to other sites
Model throughput and control bandwidth allocation
Modeling throughput Linear models
Model dest throughput (DT) using source & destination CC
Data to train, validate models – load variation experiments Errors >15% for most cases Log models
Y’ = a1X1 + a2X2 + … + akXk + b
DT = a1*DC + a2*SC + b1
DT = a3 *DC/SC + b2
log(DT)=a4*log(SC) + a5*log(DC) + b3
Modeling throughput
Log model better than linear models, still high errors Model based on just SC and DC too simplistic Incorporate external load
– External load - network, disk, and CPU activities outside transfers– How to measure the external load?– How to include external load in model(s)?
External load
Multiple training data – same SC, DC - different days & times EL - Throughput differences for same SC, DC Three different functions for external load (EL)
– EL1=T −AT, T - throughput for transfer t, AT - average throughput of all transfers with same SC, DC as t
– EL2=T−MT, MT - max throughput with same SC, DC as t– EL3 = T/MT
ELa11 if EL>0 |EL|(−a11) otherwise
AEL{a11} =
DT = a6*DC + a7*SC + a8*EL + b4
DT = SCa9 * DCa10 * AEL{a11} * 2b5
Linear
Log
Models with external load
DT = a6*DC + a7*SC + a8*EL + b4
Predict Controllable Uncontrollable
Unlike SC and DC, external load is uncontrollable Train models – multiple data points with same SC, DC In practice, some recent transfers possible but all
combinations of SC, DC unlikely
Calculating external load in practice
DT = a6*DC + a7*SC + a8*EL + b4
Known Compute
Transfers in past 30 minutes
DT = a6*DC + a7*SC + a8*EL + b4 + e
Historictransfers
Previous Transfer Method
Recent Transfers Method
Recent Transfers with Error Correction
Applying models to control bandwidth
Find DC, SC to achieve target throughput Limit DC to 20 to narrow search space
– Even then, large number of possible DC combinations (20n) SCmax (max source concurrency allowed) is the number of
possible values for SC – Heuristics to limit search space to SCmax * #destinations
DT = a6*DC + a7*SC + a8*EL + b4
Predict Given Known (Compute w/ PT, RT or RTEC)
DT = a6*DC + a7*SC + a8*EL + b4
Given
Compute Known (Compute w/ PT, RT or RTEC)
Experimental setupTACC
NCARSDSC
IndianaNICS
PSC
Experiments
Ratio experiments – allocate available bandwidth at source to destinations using predefined ratio
Available bandwidth at stampede is 9 Gbps 2:1:2:3:3 for Kraken, Mason, Blacklight, Gordon, Yellowstone
Kraken = 2*9Gbps/(2+1+2+3+3) = 2*9Gbps/9 = 2Gbps
Mason=1Gbps, Blacklight=2Gbps, Gordon=3Gbps, Yellowstone=3Gbps
Kraken=2Gbps, Mason=1Gbps, Blacklight=2Gbps, Gordon=3Gbps, Yellowstone=3Gbps
Kraken=3Gbps, Mason=X1Gbps, Blacklight=X2Gbps, Gordon=X3Gbps, Yellowstone=X4Gbps
Factoring experiments – increase destination’s throughput by a factor when source is saturated
Results – Ratio experiments
Ratios are 4:5:6:8:9 for Kraken, Mason, Blacklight, Gordon, and Yellowstone. Concurrencies picked by Algorithm were {1,3,3,1,1}. Model: log with EL1. Method: RTEC
Ratios are 4:5:6:8:9 for Kraken, Mason, Blacklight, Gordon, and Yellowstone. Concurrencies picked by Algorithm were {1,4,3,1,1}. Model: log with EL3. Method: RT
Results – Factoring experiments
Increasing Gordon’s baseline throughput by 2x. Concurrency picked by picked by Algorithm for Gordon was 5
Increasing Yellowstone’s baseline throughput by 1.5x. Concurrency picked by picked by Algorithm for Yellowstone was 3
Adaptive scheduling of data transfers
Data Transfer Node Data Transfer Node
Storage Storage
Adaptive scheduling of data transfers
Adaptive scheduling of data transfers
Bursty transfers opportunity for adaptive scheduling Goals - optimize throughput, improve response times Challenge – adaptive concurrency
– Low load – increase CC (unsaturated destinations) to max. utilization– New requests queue or adjust ongoing transfer concurrency
Data transfer scheduling analogous to parallel job scheduling? – Data transfers ≅ compute jobs. wide-area bandwidth ≅ compute
resources, transfer concurrency ≅ job parallelism CPU, storage network different at source, destination Shared wide area network Scheduling wide-area data transfers challenging
– Heterogenous resources, shared network, dynamic nature of load– Scheduling decisions not based on resource availability at one site
Metrics
Turnaround time – time a job spends in the system: completion time - arrival timeJob slowdown – factor slowed relative to the time on a unloaded system: turnaround time / processing time
Bounded slowdown in parallel job scheduling
Bounded slowdown forwide-area transfers
Job priority for wide-area transfers
Scheduling algorithm
Maximize resource utilization and reduce slowdown– Adaptively queue and adjust concurrency based on load
Preemption/restart– State required is missing block information & No migration – Still overhead (auth, checkpoint restart), p-factor limits preemption
Four key decision-making points– Upon task arrival – schedule or queue – If scheduled, what concurrency value? – When to preempt (and schedule a waiting job)– When to change concurrency of a running job
Use both models and recent observed behavior– Models to predict throughput and determine concurrency value– 5-second averages of observed throughput to determine saturation
Illustrative example
Average turnaround time is 10.92
Average turnaround time for baseline is 12.04
Workload traces
Traces from actual executions– Anonymized GridFTP usage statistics
Busiest day from a 1 month period Busiest server log on that day Limit length of logs due to production environment Three 15-minute logs - 25%, 45%, and 60% load traces
– “load” is total bytes transferred / max. that can be transferred
Destination anonymized in logs – Weighted random split based on capacities
Experimental results – turnaround 60% load
Experimental results – worst case 60% load
Experimental results – 60% load improved baseline
Related work
Several models for predicting behavior & finding optimal parallel TCP streams – Uncongested networks, simulations
Many studies on bandwidth allocation at router – Our focus is application-level control
Adaptive replica selection, algorithms to utilize multiple paths– Ability to control network path– Overlay networks
Workflow schedulers - dependencies between computation and data movement
Adaptive file transfer scheduling w/preemption in production environments not studied
Summary of current work
Models for wide-area data transfer throughput in terms of few key parameters
Log models that combine total source CC, destination CC, and a measure of external load are effective
Methods that utilize both recent and historical experimental data better at estimating external load
Adaptive scheduling algorithm to improve the overall user experience
Evaluated it using real traces on a production system Significant improvements over the current state-of-the-art
Proposed work
File transfers have different time constraints– Near real time to highly flexible
Objective – account time requirements to improve overall user experience
Consider 2 job types – batch and interactive – First, exploit relaxed deadlines of batch jobs – Next, exploit knowledge about future arrival times
Finally, maximize utility value for jobs – Each job has a utility function
Batch jobs
If deadline closer, batch jobs get highest priority– Scheduled with a concurrency of 2, no preemption
Otherwise, batch jobs get lowest priority Interactive jobs measured by turnaround and slowdown,
batch jobs measured by deadline satisfaction rate
Knowledge about future jobs
T1(d2)
T2(d1)
T3(d2)
T1(d2)
T2(d1
)T3
(d2)
0 1 2
0 1 2 3
Wait queue
Schedule A – no knowledge of future jobs
4 5
T1(d2)
T2(d1)
T3(d2)
0 1 2 3 4 5
3
Schedule B – w/ knowledge of future jobs
T1 – 1GB, T2 – 1GBSource – 1GB/sDestination d1 – 1GB/sDestination d2 – 0.5GB/s
T3 – 0.5GB
0.5
1.0
Thro
ughp
ut in
GB/
s
Time in Seconds
0.5
1.0
Thro
ughp
ut in
GB/
s
Time in Seconds
Average Slowdown is (1.5+1+2)/3 = 1.5
Average Slowdown is (1+2+1)/3 = 1.33
Utility based scheduling
Both interactive and batch jobs have deadline Associated utility function
– Impact of missing the deadline Decay – linear, exponential, step, or a combination Each transfer request R defined by tuple, R = (d,A,S,D,U)
– d = destination,– A = arrival time of R, – S = size of the file to be transferred, – D = deadline of R, and – U = utility function of R.
Objective – maximize aggregate utility value of jobs
Utility based scheduling
Inverse of instantaneous utility value as priority Instantaneous utility value calculated as follows
Questions
top related