celebrating diversity in volunteer computing david p. anderson space sciences lab u.c. berkeley...
DESCRIPTION
Diversity of resources CPU type, number, speed RAM, disk Coprocessors OS type and version network performance availability proxies system availability reliability crashes, invalid results, cheatingTRANSCRIPT
![Page 1: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b8d7f8b9ab0599bfdd4/html5/thumbnails/1.jpg)
Celebrating Diversityin Volunteer Computing
David P. AndersonSpace Sciences Lab
U.C. Berkeley
Sept. 1, 2008
![Page 2: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b8d7f8b9ab0599bfdd4/html5/thumbnails/2.jpg)
Background
Volunteer computing distributed scientific computing using
volunteered resources (desktops, laptops, game consoles, cell phones, etc.)
BOINC middleware for volunteer (and desktop grid)
computing
![Page 3: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b8d7f8b9ab0599bfdd4/html5/thumbnails/3.jpg)
Diversity of resources CPU type, number, speed RAM, disk Coprocessors OS type and version network
performance availability proxies
system availability reliability
crashes, invalid results, cheating
![Page 4: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b8d7f8b9ab0599bfdd4/html5/thumbnails/4.jpg)
Diversity of applications
Resource requirements CPU, coprocessors, RAM, storage, network
Completion time constraints Numerical properties
same result on all CPUs a little different unboundedly different
![Page 5: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b8d7f8b9ab0599bfdd4/html5/thumbnails/5.jpg)
IBM World Community Grid “Umbrella” project sponsored by IBM
Rice genome study: Univ. of Washington Protein X-ray crystallography: Ontario Cancer
Inst. African climate study: Univ. of Capetown Dengue fever drug discovery: Univ. of Texas Human protein folding: NYU, Univ. of Washington HIV drug discovery: Scripps Institute
Started Nov. 2004 390,000 volunteers total 167,000 years of CPU time Currently ~170 TeraFLOPS
![Page 6: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b8d7f8b9ab0599bfdd4/html5/thumbnails/6.jpg)
CPU type
1
10
100
1000
10000
100000A
MD
Ath
lon
AM
D A
thlo
n 64
AM
D A
thlo
n FX
AM
D A
thlo
n M
PA
MD
Ath
lon
X2
AM
D A
thlo
n X
PA
MD
Dur
onA
MD
Geo
deA
MD
K6
AM
D K
7A
MD
Opt
eron
AM
D O
ther
AM
D P
heno
mIn
tel C
eler
onIn
tel C
ore
2In
tel C
ore
2 D
uoIn
tel C
ore
2 Q
uad
Inte
l Cor
e D
uoIn
tel O
ther
Inte
l Pen
tium
Inte
l Pen
tium
4In
tel P
entiu
m D
Inte
l Pen
tium
IIIn
tel P
entiu
m II
IIn
tel P
entiu
m M
Inte
l Xeo
nTr
ansm
eta
Cen
taur
Hau
lsIB
M P
ower
PC
Processory Type
Num
ber o
f Com
pute
rs
![Page 7: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b8d7f8b9ab0599bfdd4/html5/thumbnails/7.jpg)
# cores
1
10
100
1000
10000
100000
1 2 3 4 6 8 14 16 24 32 64
Cores
Num
ber o
f Com
pute
rs
![Page 8: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b8d7f8b9ab0599bfdd4/html5/thumbnails/8.jpg)
OS type
1
10
100
1,000
10,000
100,000
1,000,000
Type of Operating System
Num
ber o
f Com
pute
rs
![Page 9: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b8d7f8b9ab0599bfdd4/html5/thumbnails/9.jpg)
RAM
1
10
100
1000
10000
1000000
512
1024
1536
2048
2560
3072
3584
4096
4608
5248
5760
6272
7296
7808
8704
1011
2
1126
4
1318
4
1433
6
1587
2
1638
4
RAM (in MB)
Num
ber o
f Com
pute
rs
![Page 10: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b8d7f8b9ab0599bfdd4/html5/thumbnails/10.jpg)
Free disk space
1
10
100
1000
100000 8 16 24 32 40 48 56 64 72 80 88 96 104
112
120
128
136
144
152
160
168
176
184
192
200
208
216
224
232
240
248
Available Disk Space (GB)
Num
ber o
f Com
pute
rs
![Page 11: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b8d7f8b9ab0599bfdd4/html5/thumbnails/11.jpg)
Availability
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Percent Available
Num
ber o
f Com
pute
rs
![Page 12: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b8d7f8b9ab0599bfdd4/html5/thumbnails/12.jpg)
Job error rate
1
10
100
1000
10000
100000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Percent Error
Num
ber o
f Com
pute
rs
![Page 13: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b8d7f8b9ab0599bfdd4/html5/thumbnails/13.jpg)
Average turnaround time
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
0 39 75 111 147 183 219 255 291 327 363 399 435 471 519
Hours
Num
Com
pute
rs
![Page 14: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b8d7f8b9ab0599bfdd4/html5/thumbnails/14.jpg)
Current WCG applications
![Page 15: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b8d7f8b9ab0599bfdd4/html5/thumbnails/15.jpg)
Job dispatching
1M jobsschedulerclient
Goals maximize system throughput minimize time to batch completion minimize time to grant credit scale to >100 requests/sec
![Page 16: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b8d7f8b9ab0599bfdd4/html5/thumbnails/16.jpg)
BOINC scheduler architecture
Job queue(DB)Schedulerclient
FeederJob cache(shared memory)
Issues: what if cache fills up with unsendable jobs? what is client needs a job not in cache?
![Page 17: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b8d7f8b9ab0599bfdd4/html5/thumbnails/17.jpg)
Homogeneous replication
Different platforms do FP math differently makes result validation difficult
Divide platforms into equivalence classes, send instances of a job to a single class
“Census” program computes distribution Scheduler: send committed jobs if possible
Win/Intel Win/AMD etc. uncommitted
![Page 18: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b8d7f8b9ab0599bfdd4/html5/thumbnails/18.jpg)
Retry acceleration
Retries needed when: job times out error (crash) returned results fail to validate
Send retries to hosts that are: fast (low turnaround) reliable
Shorten latency bound of retries
![Page 19: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b8d7f8b9ab0599bfdd4/html5/thumbnails/19.jpg)
Volunteer app selection
Volunteers can select apps opt to accept jobs from non-selected apps
![Page 20: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b8d7f8b9ab0599bfdd4/html5/thumbnails/20.jpg)
Fast feasibility checks (no DB)
Client sends: hardware spec availability info list of jobs queued, in progress
Resource checks Completion time check
EDF simulation deadlines missed?
![Page 21: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b8d7f8b9ab0599bfdd4/html5/thumbnails/21.jpg)
Slow feasibility checks (DB)
Is job still needed? Has another replica been sent to this
volunteer?
![Page 22: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b8d7f8b9ab0599bfdd4/html5/thumbnails/22.jpg)
jobApplication
Platform mechanism
Jobs are associated with apps, not versions
Win/x86 Win/x64 Linux/x86
App versions
Request message:
platform 0: Win64platform 1: Win32
Application
Win/x86 Win/x64 Linux/x86
App versions jobjob
![Page 23: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b8d7f8b9ab0599bfdd4/html5/thumbnails/23.jpg)
Host punishment
The problem: hosts that error out all jobs Maintain M(h): max jobs per day for host h On each error, decrement M(h) On valid job, double M(h)
![Page 24: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b8d7f8b9ab0599bfdd4/html5/thumbnails/24.jpg)
Anonymous platform mechanism
Rather than downloading apps from server, client has preexisting local apps.
Scheduler: if client has its own apps, only send it jobs for those apps.
Usage scenarios: Computers with unsupported platforms People who optimize apps Security-conscious people who want to inspect
the source code
![Page 25: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b8d7f8b9ab0599bfdd4/html5/thumbnails/25.jpg)
Old scheduling policy
Job cache scan start from random point do fast feasibility checks lock job, do slow feasibility checks
Multiple scans send jobs committed to an HR class if fast host, send retries send work for selected apps is allowed, send work for non-selected apps
Problems rigid policy app == 1 CPU
![Page 26: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b8d7f8b9ab0599bfdd4/html5/thumbnails/26.jpg)
Coprocessor and multi-thread apps
How to select the best version for a given host? How to estimate performance on the host?
Win/x86
single-threaded
multi-threaded CUDA
![Page 27: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b8d7f8b9ab0599bfdd4/html5/thumbnails/27.jpg)
Multithread/coprocessor (cont.) How to decide which app version to use?
app versions have “plan class” string scheduler has project-supplied functionbool app_plan(SCHEDULER_REQUEST &sreq, char* plan_class, HOST_USAGE&);
returns: whether host can run app coprocessor usage CPU usage (possibly fractional) expected FLOPS cmdline to pass to app
embodies knowledge about sublinear speedup, etc. Scheduler: call app_plan() for each version, use
the one with highest expected FLOPS
![Page 28: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b8d7f8b9ab0599bfdd4/html5/thumbnails/28.jpg)
Multithread/coprocessor (cont.)
Client coprocessor handling (currently just CUDA)
hardware check/report scheduling (coprocessors not timesliced)
CPU scheduling run enough apps to use at least N cores
![Page 29: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b8d7f8b9ab0599bfdd4/html5/thumbnails/29.jpg)
Score-based scheduling
random
N
rank by score
feasible jobs
send M highest-scoring jobs
![Page 30: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b8d7f8b9ab0599bfdd4/html5/thumbnails/30.jpg)
Terms in the score function
Bonus if host is fast and job is a retry job is committed to HR class app was selected by volunteer
![Page 31: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b8d7f8b9ab0599bfdd4/html5/thumbnails/31.jpg)
Job size matching
Goal: send large jobs to fast hosts, small jobs to slow hosts reduce credit-granting delay reduce server occupancy time
Census program maintains host statistics Feeder maintains job size statistics Score penalty: |job - host|2
![Page 32: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b8d7f8b9ab0599bfdd4/html5/thumbnails/32.jpg)
Adaptive replication
Goal: achieve a target level of reliability while reducing replication to 1+ε
Idea: replicate less (but always some) as a host becomes more trusted
Policy: maintain “invalid rate” E(h) per host. if E(h) > X, replicate (e.g., 2-fold) else replicate with probability E(h)/X
Is there a counterstrategy?
![Page 33: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b8d7f8b9ab0599bfdd4/html5/thumbnails/33.jpg)
Server simulation
How do we know these policies are any good? How can we study alternatives?
In situ study is difficult SIMBA emulator (U. of Delaware):
SIMBA(emulates N clients)
BOINC server(not emulated)
![Page 34: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b8d7f8b9ab0599bfdd4/html5/thumbnails/34.jpg)
Upcoming scheduler changes
Problems: only use 1 app version completion-time simulation is antiquated (doesn’t
reflect multithread, coprocessor, RAM limitations) New concept: resource signature
#CPUs, #coprocessors, RAM Do simulation based on “greedy EDF
scheduling” using resource signature Select app version that can use available
resources
![Page 35: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b8d7f8b9ab0599bfdd4/html5/thumbnails/35.jpg)
Conclusion
Volunteer computing has diverse resources and workloads
BOINC has mechanisms that deal effectively and efficiently with this diversity
Lots of fun research problems here!