storage: alternate futures
TRANSCRIPT
1
Yotta
Zetta
Exa
Peta
Tera
Giga
Mega
Kilo
Storage: Alternate FuturesStorage: Alternate FuturesJim Gray
Microsoft Research
Research.Micrsoft.com/~Gray/talks
NetStore ’99
Seattle WA, 14 Oct 1999
2
Acknowledgments: Thank You!!
• Dave Patterson: – Convinced me that processors are moving to the
devices.
• Kim Keeton and Erik Riedell– Showed that many useful subtasks can be done by
disk-processors, and quantified execution interval
• Remzi Dusseau – Re-validated Amdhl’s laws
3
Outline• The Surprise-Free Future (5 years)
– 500 mips cpus for 10$ – 1 Gb RAM chips – MAD at 50 Gbpsi – 10 GBps SANs are ubiquitous– 1 GBps WANs are ubiquitous
• Some consequences– Absurd (?) consequences.– Auto-manage storage– Raid10 replaces Raid5– Disc-packs– Disk is the archive media of choice
• A surprising future?– Disks (and other useful things) become supercomputers.– Apps run “in the disk”
4
The Surprise-free Storage Future• 1 Gb RAM chips
• MAD at 50 Gbpsi
• Drives shrink one quantum
• Standard IO
• 10 GBps SANs are ubiquitous
• 1 Gbps WANs are ubiquitous
• 5 tips cpus for 1K$ and 500 mips cpus for 10$
5
1 Gb RAM Chips • Moving to 256 Mb chips now
• 1Gb will be “standard” in 5 years, 4 Gb will be premium product.
• Note: – 256Mb = 32MB: the smallest memory– 1 Gb = 128 MB: the smallest memory
6
MAD at 50 Gbpsi• MAD: Magnetic Aerial Density:
3-10 Mbpsi in products 20 Mbpsi in lab 50 Mbpsi = paramagnetic limit
but…. People have ideas.
• Capacity: rise 10x in 5 years (conservative)• Bandwidth: rise 4x in 5 years (density+rpm) • Disk: 50GB to 500 GB,
• 60-80MBps • 1k$/TB• 15 minute to 3 hour scan time.
7
Disk vs Tape
• Disk– 47 GB– 15 MBps– 10 ms seek time– 5 ms rotate time– 9$/GB for drive
3$/GB for ctlrs/cabinet– 4 TB/rack
• Tape– 40 GB– 5 MBps– 30 sec pick time– Many minute seek time– 5$/GB for media
10$/GB for drive+library– 10 TB/rack
The price advantage of tape is narrowing, and the performance advantage of disk is growing
GuestimatesCern: 200 TB3480 tapes2 col = 50GBRack = 1 TB=20 drives
8
System On A Chip• Integrate Processing with memory on one chip
– chip is 75% memory now– 1MB cache >> 1960 supercomputers– 256 Mb memory chip is 32 MB!– IRAM, CRAM, PIM,… projects abound
• Integrate Networking with processing on one chip– system bus is a kind of network– ATM, FiberChannel, Ethernet,.. Logic on chip.– Direct IO (no intermediate bus)
• Functionally specialized cards shrink to a chip.
9
500 mips System On A Chip for 10$
• 486 now 7$ 233 Mhz ARM for 10$ system on a chiphttp://www.cirrus.com/news/products99/news-product14.html AMD/Celeron 266 ~ 30$
• In 5 years, today’s leading edge will be– System on chip (cpu, cache, mem ctlr, multiple IO)– Low cost– Low-power – Have integrated IO
• High end is 5 BIPS cpus
10
Standard IO in 5 Years
• Probably
• Replace PCI with something better will still need a mezzanine bus standard
• Multiple serial links directly from processor
• Fast (10 GBps/link) for a few meters
• System Area Networks (SANS) ubiquitous (VIA morphs to SIO?)
11
1 GBps1 GBps
Ubiquitous 10 GBps SANs in 5 years
• 1Gbps Ethernet are reality now.– Also FiberChannel ,MyriNet, GigaNet,
ServerNet,, ATM,…
• 10 Gbps x4 WDM deployed now (OC192)
– 3 Tbps WDM working in lab
• In 5 years, expect 10x, progress is astonishing
• Gilder’s law: Bandwidth grows 3x/year http://www.forbes.com/asap/97/0407/090.htm
5 MBps20 Mbsp
40 MBps
80 MBps
120 MBps120 MBps(1Gbps)(1Gbps)
12
Thin Client’s mean HUGE servers
• AOL hosting customer pictures
• Hotmail allows 5 MB/user, 50 M users
• Web sites offer electronic vaulting for SOHO.
• IntelliMirror: replicate client state on server
• Terminal server: timesharing returns
• …. Many more.
13
Standard Storage Metrics• Capacity:
– RAM: MB and $/MB: today at 512MB and 3$/MB– Disk: GB and $/GB: today at 50GB and 10$/GB– Tape: TB and $/TB: today at 50GB and 12k$/TB (nearline)
• Access time (latency)– RAM: 100 ns– Disk: 10 ms– Tape: 30 second pick, 30 second position
• Transfer rate– RAM: 1 GB/s– Disk: 15 MB/s - - - Arrays can go to 1GB/s– Tape: 5 MB/s - - - striping is problematic, but “works”
14
New Storage Metrics: Kaps, Maps, SCAN?
• Kaps: How many kilobyte objects served per second– The file server, transaction processing metric– This is the OLD metric.
• Maps: How many megabyte objects served per second– The Multi-Media metric
• SCAN: How long to scan all the data– the data mining and utility metric
• And– Kaps/$, Maps/$, TBscan/$
15
For the Record (good 1999 devices packaged in system
http://www.tpc.org/results/individual_results/Compaq/compaq.5500.99050701.es.pdf)
DRAM DISK TAPE robotUnit capacity (GB) 1 9 40
Unit price $ 5000 900 20000$/GB 3300 12 12
Latency (s) 1.E-7 2.E-3 3.E+1Bandwidth (MBps) 1000 15 20
Kaps 9.E+5 6.E+2 3.E-2Maps 1.E+3 14.67 3.E-2
Scan time (s/TB) 1 600 24500$/Kaps 6.E-11 1.E-8 6.E-3$/Maps 5.E-8 6.E-7 6.E-3
$/TBscan $0.05 $1 $129
X 100
Tape is 1Tb with 4 DLT readers at 5MBps each.
16
For the Record (good 1999 devices packaged in system
http://www.tpc.org/results/individual_results/Compaq/compaq.5500.99050701.es.pdf)
Tape is 1Tb with 4 DLT readers at 5MBps each.1.E-11
1.E-9
1.E-7
1.E-5
1.E-3
1.E-1
1.E+1
1.E+3
1.E+5
1.E+7
Kaps
Map
s
Scan
time
(s/T
B)
$/Kap
s
$/M
aps
$/TBsc
an
DRAM
DISK
TAPE
17
The Access Time Myth• The Myth: seek or pick time dominates• The reality: (1) Queuing dominates• (2) Transfer dominates BLOBs• (3) Disk seeks often short• Implication: many cheap servers
better than one fast expensive server– shorter queues– parallel transfer– lower cost/access and cost/byte
• This is obvious for disk arrays• This even more obvious for tape arrays
Seek
Rotate
Transfer
Seek
Rotate
Transfer
Wait
18
Storage Ratios Changed• 10x better access time• 10x more bandwidth• 4,000x lower media price
Disk Performance vs Time
1
10
100
1980 1990 2000
Year
seek
s p
er s
eco
nd
ban
dw
idth
: MB
/s
0.1
1.
10.
Cap
acity
(GB
)
Disk accesses/second vs Time
1
10
100
1980 1990 2000
Year
Acc
esse
s p
er S
eco
nd
Storage Price vs TimeMegabytes per kilo-dollar
0.1
1.
10.
100.
1,000.
10,000.
1980 1990 2000
Year
MB
/k$
• DRAM/disk media price ratio changed– 1970-1990 100:1
– 1990-1995 10:1
– 1995-1997 50:1
– today ~ 0.1$pMB disk 30:1
3$pMB dram
19
Data on Disk Can Move to RAM in 8 years
Storage Price vs TimeMegabytes per kilo-dollar
0.1
1.
10.
100.
1,000.
10,000.
1980 1990 2000
Year
MB
/k$
30:1
6 years
20
Outline• The Surprise-Free Future (5 years)
– 500 mips cpus for 10$ – 1 Gb RAM chips – MAD at 50 Gbpsi – 10 GBps SANs are ubiquitous– 1 GBps WANs are ubiquitous
• Some consequences– Absurd (?) consequences.– Auto-manage storage– Raid10 replaces Raid5– Disc-packs– Disk is the archive media of choice
• A surprising future?– Disks (and other useful things) become supercomputers.– Apps run “in the disk”.
21
The (absurd?) consequences• 256 way nUMA?• Huge main memories: now:
500MB - 64GB memories then: 10GB - 1TB memories
• Huge disksnow: 5-50 GB 3.5” disks then: 50-500 GB disks
• Petabyte storage farms– (that you can’t back up or restore).
• Disks >> tapes– “Small” disks:
One platter one inch 10GB
• SAN convergence 1 GBps point to point is easy
• 1 GB RAM chips
• MAD at 50 Gbpsi
• Drives shrink one quantum
• 10 GBps SANs are ubiquitous
• 500 mips cpus for 10$
• 5 bips cpus at high end
22
The Absurd? Consequences• Further segregate processing from storage
• Poor locality
• Much useless data movement
• Amdahl’s laws: bus: 10 B/ips io: 1 b/ips
ProcessorsDisks
~ 1 Tips
RAM Memory
~ 1 TB
~ 100TB
100 GBps10 TBps
23
Storage Latency: How Far Away is the Data?
RegistersOn Chip CacheOn Board Cache
Memory
Disk
12
10
100
Tape /Optical Robot
10 9
10 6
Olympia
This Hotel
This RoomMy Head
10 min
1.5 hr
2 Years
1 min
Pluto
2,000 YearsAndromeda
24
Consequences• AutoManage Storage
• Sixpacks (for arm-limited apps)
• Raid5-> Raid10
• Disk-to-disk backup
• Smart disks
25
Auto Manage Storage• 1980 rule of thumb:
– A DataAdmin per 10GB, SysAdmin per mips
• 2000 rule of thumb– A DataAdmin per 5TB – SysAdmin per 100 clones (varies with app).
• Problem:– 5TB is 60k$ today, 10k$ in a few years.– Admin cost >> storage cost???
• Challenge: – Automate ALL storage admin tasks
26
The “Absurd” Disk
• 2.5 hr scan time (poor sequential access)
• 1 aps / 5 GB (VERY cold data)
• It’s a tape!
1 TB100 MB/s
200 Kaps
27
Extreme case: 1TB disk: Alternatives
• Use all the heads in parallel– Scan in 30 minutes– Still one Kaps/5GB
• Use one platter per arm– Share power/sheetmetal– Scan in 30 minutes– One KAPS per GB
1 TB500 MB/s
200 Kaps
200GB 200GB eacheach
500 MB/s
1,000 Kaps
28
Drives shrink (1.8”, 1”)• 150 kaps for 500 GB is VERY cold data
• 3 GB/platter today, 30 GB/platter in 5years.
• Most disks are ½ full• TPC benchmarks use 9GB drives
(need arms or bandwidth).
• One solution: smaller form factor– More arms per GB– More arms per rack– More arms per Watt
29
Prediction: 6-packs
• One way or another, when disks get huge– Will be packaged as multiple arms– Parallel heads gives bandwidth– Independent arms gives bandwidth & aps
• Package shares power, package, interfaces…
30
Stripes, Mirrors, Parity (RAID 0,1, 5)
• RAID 0: Stripes– bandwidth
• RAID 1: Mirrors, Shadows,…– Fault tolerance– Reads faster, writes 2x slower
• RAID 5: Parity– Fault tolerance– Reads faster– Writes 4x or 6x slower.
0,3,6,.. 1,4,7,.. 2,5,8,..
0,1,2,.. 0,1,2,..
0,2,P2,.. 1,P1,4,.. P0,3,5,..
31
RAID 10 (strips of mirrors) Wins“wastes space, saves arms”
RAID 5:
• Performance– 225 reads/sec– 70 writes/sec– Write
• 4 logical IO, • 2 seek + 1.7 rotate
• SAVES SPACE
• Performance degrades on failure
RAID1
• Performance– 250 reads/sec– 100 writes/sec– Write
• 2 logical IO• 2 seek 0.7 rotate
• SAVES ARMS
• Performance improves on failure
32
The Storage RackToday
• 140 arms • 4TB• 24 racks
24 storage processors6+1 in rack
• Disks = 2.5 GBps IO• Controllers = 1.2 GBps IO• Ports 500 MBps IO
33
Storage Rack in 5 years?• 140 arms
• 50TB• 24 racks
24 storage processors6+1 in rack
• Disks = 2.5 GBps IO• Controllers = 1.2 GBps IO• Ports 500 MBps IO
• My suggestion: move the processors into the storage racks.
34
It’s hard to archive a PetaByteIt takes a LONG time to restore it.
• Store it in two (or more) places online (on disk?).
• Scrub it continuously (look for errors)
• On failure, refresh lost copy from safe copy.
• Can organize the two copies differently (e.g.: one by time, one by space)
35
Crazy Disk Ideas• Disk Farm on a card: surface mount disks
• Disk (magnetic store) on a chip: (micro machines in Silicon)
• Full Apps (e.g. SAP, Exchange/Notes,..) in the disk controller
(a processor with 128 MB dram)ASIC
The Innovator's Dilemma: When New Technologies Cause Great Firms to FailClayton M. Christensen.ISBN: 0875845851
36
The Disk Farm On a Card• The 500GB disc card• An array of discs• Can be used as• 100 discs• 1 striped disc• 50 Fault Tolerant discs• ....etc• LOTS of accesses/second bandwidth
14"
37
Functionally Specialized Cards• Storage
• Network
• Display
M MB DRAM
P mips processor
ASIC
ASIC
ASIC Today:
P=50 mips
M= 2 MB
In a few years
P= 200 mips
M= 64 MB
38
It’s Already True of PrintersPeripheral = CyberBrick
• You buy a printer• You get a
– several network interfaces– A Postscript engine
• cpu, • memory, • software,• a spooler (soon)
– and… a print engine.
39
Tera Byte Backplane
• TODAY– Disk controller is 10 mips risc engine
with 2MB DRAM– NIC is similar power
• SOON– Will become 100 mips systems
with 100 MB DRAM.
• They are nodes in a federation(can run Oracle on NT in disk controller).
• Advantages– Uniform programming model– Great tools– Security– Economics (cyberbricks)– Move computation to data (minimize traffic)
All Device Controllers will be Cray 1’s
CentralProcessor &
Memory
40
With Tera Byte Interconnectand Super Computer Adapters
• Processing is incidental to – Networking– Storage– UI
• Disk Controller/NIC is – faster than device– close to device– Can borrow device
package & power
• So use idle capacity for computation.
• Run app in device.• Both Kim Keeton (UCB) and
Erik Riedel (CMU) thesis investigate thisshow benefits of this approach.
Tera ByteBackplane
41
Implications
• Offload device handling to NIC/HBA
• higher level protocols: I2O, NASD, VIA, IP, TCP…
• SMP and Cluster parallelism is important.
Tera Byte Backplane
• Move app to NIC/device controller
• higher-higher level protocols: CORBA / COM+.
• Cluster parallelism is VERY important.
CentralProcessor &
Memory
Conventional Radical
42
How Do They Talk to Each Other?• Each node has an OS• Each node has local resources: A federation.• Each node does not completely trust the others.• Nodes use RPC to talk to each other
– CORBA? COM+? RMI? – One or all of the above.
• Huge leverage in high-level interfaces.• Same old distributed system story.
SANSIO
stre
ams
data
gram
s
RP
C?
Applications
SIO
streams
datagrams
RP
C ?
Applications
43
Outline• The Surprise-Free Future (5 years)
– Astonishing hardware progress.
• Some consequences– Absurd (?) consequences.– Auto-manage storage– Raid10 replaces Raid5– Disc-packs– Disk is the archive media of choice
• A surprising future?– Disks (and other useful things) become supercomputers.– Apps run “in the disk”