knowledge is power remzi arpaci-dusseau university of wisconsin, madison
Post on 30-Dec-2015
220 Views
Preview:
TRANSCRIPT
Knowledge is Power
Remzi Arpaci-Dusseau
University of Wisconsin, Madison
Systems Without Knowledge
System designers often have limited knowledge• About the applications they run• About the other systems they interact
with
Result: The “curse of generality”• Missed performance optimizations• Limited functionality• Costly, too
Didacticism and Systems
How to gain knowledge?• Depends on environment
Sometimes it’s easy• A scientific application w/ cooperative
developers
Sometimes it’s not• Internals of Microsoft file system
What We Do
Build systems that acquire and exploit knowledge• “Gray box” techniques• Make assumptions, probe + measure,
learn something about how something works• Use knowledge to control systems in
unexpected ways
Result• Increase functionality, improve performance,
increase robustness and manageability too
Outline
OverviewKnowledge and its applications• Gray-box file placement• Semantically-smart disks• Scientific apps, the Grid, and I/O
Conclusions
The People
Gray-box file placement• With James Nugent, Andrea Arpaci-Dusseau
Semantically-smart disks• With Muthian Sivathanu, Vijayan
Prabhakaran,Florentina Popovici, Tim Denehy, Andrea Arpaci-Dusseau
Scientific apps, the Grid, and I/O• With John Bent, Doug Thain,
Andrea Arpaci-Dusseau, Miron Livny
Gray-box Control over File Placement
Controlled File Placement
Typical “Unix” file system: Little control over layout• Just a simple API of open(), read(), write(),
close()
Some applications want more control• e.g., a web server that knows which files are
often accessed together
Usual default: Use the raw disk• Harder to manage, doesn’t integrate w/ other
apps
What Might Be Better
Use normal file system• Convenience
Expose control over layout to applications• Control
Do the above without changing the file system• Can’t always change the system you’re
using
PLACE
A gray-box “Information and Control Layer” (ICL)• It’s just a library
Simple API for file placement• Exposes “FFS-like” groups• Place_Creat(file, mode, groupNumber);
No changes to underlying file system
File System
PLACE
P P
PLACE Outline
Basic operation• Gray-box knowledge• Key techniques
Assessment• Accuracy• Performance
Conclusions
Allocation Knowledge
Gray-box assumption: “FFS”-like allocation• Splits disk into numerous consecutive
“groups”• Spreads directories across groups• Puts files (inodes/data) that are within same
directory into same “group”
Many variants• Our focus: ext2 (but with other variants in
mind)
Exploiting Knowledge for Control
Key structure: Shadow Directory Tree (SDT)
To create a file /foo/bar in group 1:• Create file /.H/1/bar• Rename /.H/1/bar to /foo/bar
/
.h/
1/ 2/ n/
foo/
bar
bar
Challenge: Building the SDT
How to ensure that shadow directory for eachgroup K is in the right on-disk location?
Basic approach to creating a directory in a group:• Mkdir(tmp);• If (tmp is in the desired group)
• Break;
• Bias();
Point of portability: Bias() routine• Must account for different allocation algorithms
Repeat
Some Complications
Controlled directory placement• Similar to system initialization (hence, slow)• To speed up, use shadow cache of
directories
Crash recovery• Crash may leave junk in SDT• Periodic sweep of SDT cleaner fixes this
Level of control depends on underlying FS• e.g., FFS vs. ext2 behavior for large files
Assessment
Does it work?Non-place: 250 files in 1 directory
Non-place: 250 files in 10 directories
Non-place: 250 files in 100 directories
PLACE: 250 files in 100 directories into 1 group
Performance (Small Files)
Performance of 250 200-KB file reads (random)
Performance (Big Files)
Each point: Bandwidth attained reading 100-MB file
PLACE Conclusions
PLACE: Gray-box approach to file layoutSimple and effective control over
placementMain technique: Shadow Directory Tree• Use to control placement• Construction and maintenance are keys
Controlled layout can improve performance• Micro-benchmarks• Web server and I/O parameterization
(see USENIX ‘03)
Outline
OverviewKnowledge and its applications• Gray-box file placement• Semantically-smart disks• Scientific apps, the Grid, and I/O
Conclusions
Semantically-smart Disk Systems
Semantically-Smart Disk System (SDS)
Disk system that understands file system• Data structures• Operations
Operates underneath unmodified FS• Must discover layout + on-disk
structures• Must “reverse engineer” block stream
Exploits knowledge and “smarts” to implement new class of services
FileSystem
SDS
$CPU
SDS Outline
Semantic Knowledge: Acquisition• Off-line• On-line
Semantic Knowledge: Exploitation• Case studies
Conclusions
Static Knowledge: File System Layout
Challenge: How to discover layout information?• White-box approach: Embed knowledge in SDS
• Trend: FS layout does not change frequently
Su
perb
lk
I-B
itm
ap
D-B
itm
ap
Inod
es
Data
I-B
itm
ap
D-B
itm
ap
Inod
es
Data
Group 1 Group 2
Layout Discovery with EOFEOF: Extraction Of File-systems• Tool to automatically determine
layout• Uses gray-box techniques
Basic operation• Start with “soft” model of file system• Probe process (P): Initiates traffic• SDS: Monitors activity from FS
Two distinct tasks:• Classifying blocks by type• Identifying fields within an inode
Result: “Hardened” model of file system structures + fields
P
SDS
File System
EOF: More Details
Multi-phase procedure:• Bootstrap: Summary blocks• Data/data bitmaps• Inodes/inode bitmaps• Inode fields, directory entries
Key techniques• Known patterns: Data blocks• Isolation: Know all but one block, one
block must be…• Assertions: Check assumptions at each
step
EOF: Simplified Example
Create file: Touches many data structures• Directory data, directory inode, file data
(known pattern),file inode, data bitmap, inode bitmap
Reset to beginning of file, write block again• File data (known pattern), file inode• Now, can classify inode block (isolation)• Assertion: only two blocks observed
EOF: Overhead and Summary
Performance: A few minutes per GB• Probably OK, only done “once” per new file
system• Scales well with faster disks (sequential
bandwidth)
Limitations: “FFS”-like file systems (ext2/3, BSD FFS)
Have Knowledge, Will Innovate
Knowing structures is not enough (sometimes)• Data block overloading (data, pointer,
directory)• High-level operations not known (create,
delete)
Requires new on-line techniques• Direct classification• Indirect classification• Block association• Operation inferencing
A Simple Example: Smarter Caching
Modern RAID may have significant cache• Volatile (DRAM)• Non-volatile (NVRAM)
How to exploit semantic informationto cache more intelligently?
FileSystem
SDS
$
Storing Meta-Data in NVRAM
Start with simple meta-data: inodes, bitmaps, etc.• Good for meta-data intensive
workloads
Sup
er
I-B
it
D-B
it
Inod
e
Data
I-B
it
D-B
it
Inod
e
Data
NVRAMCache
Direct Classification
Given address, determine type directly
Direct classification via bounds check• Given disk address, can check bounds
to determine type(superblock, bitmaps, inodes, general data block)
Sup
er
I-B
it
D-B
it
Inod
e
Data
I-B
it
D-B
it
Inod
e
Data
Getting Rid Of The Dead
If file blocks are deleted, remove them from cache• No need to keep dead blocks around
Problem: How to determine if a file is deleted?• Need to look for signs of deletion
Three different places to look:• Inode bitmaps• Directory that contains file• Inode itself
Operation inferencing via block differencing
Operation Inferencing: Detecting Deletes (Inode
Bitmap)S
up
er
I-B
it
D-B
it
Inod
e
Data
I-B
it
D-B
it
Inod
e
Data
SDS
Diff =
Read Old Version
I-B
itm
ap
Result:Deleted Files
Operation Inferencing: Overheads
Space overhead• Block cache of inodes, indirect pointers,
bitmaps, etc.(could be substantial)
Time overhead• CPU: Difference operation is like an extra copy• Disk: May require block read (if small/no cache)
[In paper: Quantified time and space overheads]• Main point: There is a CPU and memory cost
Case Studies
Experimental Set-up
Problem: Don’t have SDS hardware to use (yet!)
“Cost-effective” alternative:• Software prototype
Insert driver underneath of FS• Much like software RAID
Good because…• Traffic stream similar
Bad because…• CPU, memory not isolated from host
FileSystem
SDSOS
Fast RAID Reconstruction
Observe: When reconstructing data onto hot spare,no need to reconstruct data that isn’t live
Trend: Less live data in performance-sensitive I/O systems
Question: How can we performreconstruction quickly?
MirrorHot
Spare
Traditional Approaches
Why not in the file system?• File system doesn’t know what RAID
is
Why not in the storage system?• RAID doesn’t know what blocks are
live(minimally it does, if block has never been written)
The Semantic Way
Easy: Scan disk, only copy live blocks• Key piece of knowledge: Bitmaps• Plus, need to watch for “unmapped”
writes
Optionally, can copy “dead” blocks later• Useful if SDS doesn’t feel “sure” about its
knowledge• Guaranteed correct with prioritized
recovery
Fast Reconstruction: A Graph
Fast reconstruction: Less live data -> less time• How data is spread across disk affects recovery time
RAID-5,IBMDisks
Semantic Conclusions
Innovation in traditional storage stack is limited• File system: high but not low-level info• Storage system: low but not high-level info
Semantically-smart disks: Best of both worlds?• Takes advantage of “smart” disk systems• Exploit low-level information…• …with high-level knowledge of file system
A remaining challenge• Overcoming the “file system obfuscation”
problem
Outline
OverviewKnowledge and its applications• Gray-box file placement• Semantically-smart disks• Scientific apps, the Grid, and I/O
Conclusions
Trends in Scientific Computing
What constitutes a job is increasingly complex• Not your simple process anymore
Data demands increasing• Not just cycles anymore
Wide-area collaboration • “Grids” facilitate sharing
The Question
How to run scientific workloads on the WAN?
WAN
HomeRemote
Scientific Outline
Typical “scientific” jobs• Structure• Properties
Migratory file services• Components• Performance
Conclusions
First Things First
Study of modern scientific applications• A “measure then build” approach
Suite of six applications• BLAST: Searches genomic databases for
matching proteins• IBIS: Global-scale simulation of earth systems• CMS: High-energy physics testing software• Nautilus: Simulation of molecular dynamics• Messkit Hartree-Fock: Simulation of atomic
interactions• AMANDA: Astrophysics simulation of cosmic
events
An Example: AMANDA
A single “job” is a multi-process pipeline -> batch pipelined• Each process is a blue circle
There are many types of I/O• Endpoint (red): unique input/output of pipeline• Pipeline private (green): shared between pipe processes• Batch shared (yellow): shared across all pipes in batch
4K
1M
23M 126M26M 5M
3M 505M
21
88
s 42
s
955
s
36
01
s
Some Things We Learned
Demands of a single pipeline are modest• Modern PC with disk can handle demand• Aggregation of I/O could be harder (WAN)
Lots of sharing of data within and across pipelines• Systems should (have to?) take
advantage of this
Towards Systems Support
Systems Support
Need to build systems support for global execution• Should support “batch-pipelined” jobs
effectively
Goals• Performance: Throughput is what matters
(NOT simple metrics like “availability”)• Failures: Must be handled effectively
(again, with goal of improving performance)
Migratory File Services
Migratory file service• I/O environment for “batch-pipelined”
workloads• Integrates performance and failure
management• Key: Understanding of workloads
Three pieces of implementation• Virtual batch overlay• Migratory proxies• Workflow manager
The Virtual Batch OverlayWant familiar and controllable remote
environment• But often are stuck w/ particular queueing system• Further, cannot assume all relevant s/w installed
Glide in our own “virtual batch system”• On each node, run master, virtual machine, and
migratory proxy (described next)
M
VM MP
Migratory ProxiesMigratory proxies: Run on each remote
node• Fetch and cache data from home node• Cooperative cache for batch inputs• Localize I/O that is pipeline local
Remote
WAN
Home
M
VM MP
J
Workflow Manager
Where workload knowledge is encapsulated
Takes workflow description• Job dependencies• File indicators
Runs each while taking failures into account• Transactional management
• Proxy failure and job failure are not catastrophic(just rerun the job!)
• Proactive data replication
Performance
By exploiting knowledge, order of magnitudeimprovement over naïve approach
Outline
OverviewKnowledge and its applications• Gray-box file placement• Semantically-smart disks• Scientific apps, the Grid, and I/O
Conclusions
Conclusions
The theme: Knowledge is power• If you know how FS decides on file layout,
you can control it (PLACE)• If you know details of FS on-disk structures,
you can gain FS-level knowledge behinda block-based interface (Semantic disks)
• If you know something about workloads andtheir I/O behaviors, you can optimize performanceand handle failure gracefully
“Beware of false knowledge;it is more dangerous than ignorance.”
Bernard Shaw
http://www.cs.wisc.edu/wind
top related