data reservoir: utilization of multi-gigabit backbone network for data-intensive research
DESCRIPTION
Data Reservoir: Utilization of Multi-Gigabit Backbone Network for Data-Intensive Research. Mary Inaba , Makoto Nakamura, Kei Hiraki University of Tokyo. AWOCA 2003. Today’s Topic. New infrastructure for data intensive scientific research Problems of using the Internet. - PowerPoint PPT PresentationTRANSCRIPT
AWOCA2003Data Reservoir: Data Reservoir: Utilization of Multi-Gigabit Backbone Utilization of Multi-Gigabit Backbone
Network forNetwork for Data-Intensive ResearchData-Intensive Research
Mary Inaba, Makoto Nakamura, Kei Hiraki
University of Tokyo
AWOCA 2003
AWOCA2003
Today’s Topic
• New infrastructure for data intensive scientific research
• Problems of using the Internet
AWOCA2003
One day, I was surprised
One professor (Dept. of Astronomy) said Network is for E-mail and paper exchange. FEDEX is for REAL Data exchange. (They use DLT tapes, and airplanes)
AWOCA2003
Huge Data Producers
AKEBONO Sattelite
Radio Telescope in NOBEYAMA
SUBARU telescope
KAMIOKANDE (Novel Prize)
High Energy Accelerator
A lot of Data suggest a lot of scientific truth, by computation.Now, we can compute. Data Intensive Research
AWOCA2003
Huge Data Transfer (inquiry to Profs.)
Current StateData Transfer by DLT, EVERY WEEK.Expected Data Size in a few years
10GB/day for Satellite Data50GB/day High Energy Accelerator50PB tape archive for Earth Simulation
Observatories are shared by many researchers,hence, NEED to bring data to Lab., somehow.
Does Network help?
AWOCA2003
Super-SINET backbone
TohokuUniv
KEK,Tsukuba Univ Univ. Tokyo ,
NAO, NII,Titech,WasedaISAS
Kyoto Univ,Doshisha
Univ
NagoyaUniv,
OkazakiLabs
OsakaUniv
OpticalCross-connect
HokkaidoUniv
KyushuUniv
Start 2002 Jan
Network for Universities and Institute
Combination of 10Gbps ordinary Line several 1Gbps Project Lines (physics, genome, Grid, etc.)
AWOCA2003
Currently
It is not so easy to transfer HUGE data by fully utilizing bandwidth for long distance,
Because,
TCP/IP is popularly used,
for TCP/IP latency is the problem.
Disk I/O speed (50MB/sec)
…
AWOCA2003
Recall HISOTRYInfrastructure for Scientific Research Projects
• Utilization of computing systems at the time
– From the birth of a electronic computer
• Numerical computation ⇒ Tables 、 Equations ①• Supercomputing(vector) ⇒ Simulation ② ③• Servers ⇒ Database 、 Data-mining 、 Genome ④• Internet ⇒ Information Exchange 、
Documentation⑤
Scientific researchers always utilize top-end systems
① ② ③ ④ ⑤EDSAC CDC-6600 CRAY-1 SUN Fire15000 10G Switch
AWOCA2003
Frontier of Information Processing
New transition period -- Balance of computing systems
– Very high-speed network– Large scale disk storage New
infrastructure for– Cluster computers Data Intensive ResearchCPU
GFLOPS
MemoryGB
NetworkInterface
Gbps
Remote DisksLocal Disks
AWOCA2003
Research Projects with Data Reservoir
Name Project Domestic Connection Oversea Connection Current amount of traffic
HideyukiSakai
High Energy PolarimeterSMART
RIKENRCNP
CERNBrook Haven
CERN LEP 70 DAT/monthCERN LHC 100 MB/secBrook Haven 100 MbpsRCNP Accelerator 50 GB/day
YoshiakiSobue
Radio telescop (VLBI)Nobeyama RadioObservatory
Max PlankObservatory
VLBI data 200 GB
SadanoriOkamura
Slone Digital Sky SurveyNational AstronomicalObservatory
Fermi Lab.Survey Data : 10 TBData Exchange betweenFermi Lab
KazuoMakishima
Satellite observation ofearly universe
ISASHiroshima Univ.Saitama Univ.
NASAEuropean SpaceAgency
Current Satellite 1GB/day
ToshioYamagata
Simulation of Global ChangeFrontier Research Systemfor Global Change
N/A1Simulation 10 TBCurrently, data archivesystem with 50 PBytes
TomioKobayashi
:JC ATLAS ExperimentKEKKyoto Univ.Univ. of Tsukuba
CERN CERN LHC 100 MB/sec
TakashiOnaka
Infra-red observationSatellite
IRISNagoya Univ
ESA receiving site(Sweeden)
Downlink …... 200MBData exchange within aminutes
Jun'ichiroMakino
Astronomical Simulationby GRAPE-6
National AstronomicalObservatory
Advanced Study,Princeton Univ.Musium of NaturalHistory
Maxmum Throughput:100MB/s1Simulation: 10TB
HiroakiAihara
KEK b-factoryKEKNaboya Univ.
Princeton Univ.Raw Data:600 GB/dayData exchange : 10GB/day
SadanoriOkamura
SUBARU telescopeNational AstronomicalObservatory
Hawaii Observatory100 GB/day。Peak bandwidth 0.5 GB /sec(4Gbps)
AWOCA2003
DataReservo
ir
DataReservo
ir
High latencyVery high bandwidthNetwork
Distribute Shared File
(DSM like architecture)
Cache Disks
Cache Disks
Local file accesses
Physically addressedParallel andMulti-stream transfer
Local file accesses
Basic Architecture
AWOCA2003
Very High-speedNetwork
DataReservoir
Data analysis at University of Tokyo
Belle Experiments
CERN
X-ray astronomy Satellite ASUKA
SUBARUTelescope
NobeyamaRadio
Observatory( VLBI)
Nuclear experiments
DataReservoir
DataReservoirLocal
Accesses
DistributedShared
files
Data intensive scientific computation through SUPER-SINET
Digital Sky Survey
AWOCA2003
Design Policy• Modification of disk handler under VFS
layer• Direct access to raw device for efficient
data transfer• Multi-level striping for scalability
• Use of iSCSI protocol• Local file accesses through LAN• Global disk transfer through WAN• Single file image• File system transparency
File System
SCSI driveriSCSI driver
iSCSI daemon
SCSI driver(mid)SCSI Driver(low)
sd sg st
- sg -
Application
md (RAID) driverData Server
Disks
AWOCA2003
Disk Server
Scientific Detectors
User Programs
IP Switch
File Server File Server
Disk Server
IP Switch
File Server File Server
Disk Server Disk Server
1st level striping
2nd level striping
Disk access by iSCSI
File accesses on Data Reservoir
AWOCA2003
Disk Server
Scientific Detectors
User Programs
IP Switch
File Server File Server
Disk Server
IP Switch
File Server File Server
Disk Server Disk Server
1st level striping
2nd level striping
Disk access by iSCSI
File accesses on Data Reservoir
User’s View
AWOCA2003
Scientific Detectors
User Programs
File Server File Server File Server File Server
IP Switch IP Switch
Disk Server Disk ServerDisk Server Disk Server
iSCSI Bulk Transfer
Global Network
Global Data Transfer
AWOCA2003
IP
TCP/UDP
NFS
System Call
EXT2
Linux RAID
iSCSI driver
sd Driver sg Driver
Application
Network
Implementation(File Server)
AWOCA2003
IP
TCP
iSCSI daemon
System Call
iSCSI Driver
sg Driver
Application Layer
dr Driver
SCSI Driver
Data Stripe
Network
Disk
Implementation(Disk Server)
Disk
Disk
AWOCA2003
Performance evaluation of Data Reservoir
1. Local experiment 1 Gbps model (basic performance)
2. 40 km experiments 1 Gbps model 、 U. of ⇔ ISAS
3. 1600 km experiments 1 Gbps model
• 26ms latency (Tokyo ⇔ Kyoto⇔Osaka⇔Sendai⇔Tokyo)
• High-quality network (SUPER-Sinet Grid project lines)
4. US-Japan experiments
1. 1Gbps model
2. U. of Tokyo ⇔ Fujitsu Lab. America (Maryland, USA)
3. U. of Tokyo ⇔ Scinet (Maryland, USA)
5. 10 Gbps experiments compare four different switch configuration
1. Extreme Summit 7i, Trunked 8 Gigabit Ethernets
2. RiverStone RS16000 Trunked 8 and 121000BASE-SX
3. Foundry BigIron 10GBASE-LR modules
4. Extreme BlackDiamond Trunked 8 1000BASE-SX
5. Foundry BigIron Trunked 2 10BASE-LR
• the bottleneck (8Gbps) , Trunking 8 Gigabit Ethernets
AWOCA2003
Performance Comparison to ftp(40km)
• ftp ---- Optimal performance (minimum disk head movements)
• iSCSI – Queued operation• iSCSI transfer is 55% faster than ftp on single TCP
stream
FTP 1GB file transfer (DISK to DISK)
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
rate
(MB/
s)
1GB file default1GB file tune
iSCSI transfer (DISK to DISK)
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
40.00
45.00
Queue 1 Queue 2 Queue 4 Queue 8 Queue 16
rate
(MB/
s)
AVERAGEMAXMIN
AWOCA2003
1600 km experiment System
• 870 Mbps file transfer BW
Univ. of Tokyo (CISCO 6509)
↓ 1G Ether (Super-SINET)
Kyoto Univ (Extreme Black
Diamond )
↓ 1G Ether (Super-SINET)
Osaka Univ. (CISCO 3508)
↓ 1G Ether (Super-SINET)
Tohoku Univ. (Jumper fiber)
↓ 1G Ether (Super-SINET)
Univ. of Tokyo (Extreme Summit 7i)
AWOCA2003
IBM
IBM
IBM
IBM
Univ. ofTokyo
TohokuUniv.
(sendai)
KyotoUniv.
OsakaUniv.
550mile
300mile
250mile
IBM
IBM
1000mile line GbE
Network for 1600km experiments
・ Grid project networks of SUPER-Sinet ・ One-way latency 26ms
AWOCA2003
870828
812737
700 707
499478 493
0
100
200
300
400
500
600
700
800
900
1000
1*4*8 1*4*(2+2)1*4*4 1*2*8 1*2*(2+2)1*2*4 1*1*8 1*1*(2+2)1*1*4
Tra
nsfe
r R
ate
(Mbp
s)Transfer speed on 1600 km experiment
Maximum bandwidth by SmartBits = 970 MbpsOverheads of headers ~ 5 %
System configuration (file-servers * disk servers * disks/disk server)
AWOCA2003
10Gbps experiment
• Local connection of two 10Gbps models• 10GBASE-LR or
8 to 12 1000BASE-SX• 24 disk servers + 6 file servers
– Dell 1650, 1.26GHz PentiumIII× 2 1GB memory 、 ServerSet III
HE-SL– NetGear GE NIC– Extreme Summit 7i (Trunking)– Extreme Black Diamond 6808– Foundry Big Iron (10GBASE-LR)– RiverStone RS-16000
11.7 Gbps transfer BW
AWOCA2003
Performance on10Gbps model• 300GBytes file transfer (iSCSI streams)• 5% header loss due to TCP/IP, iSCSI • 7% performance loss due to trunking • Uneven use of disk servers
2416840.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
Number of disk servers
Th
rou
gh
pu
t (G
bp
s)
100GB file transfer in 2 minutes
AWOCA2003
US-Japan Experiments at SC2002 Bandwidth Challenge
92% Usage of Bandwidth using TCP/IP
AWOCA2003
Brief Explanation of TCP/IP
AWOCA2003
User’s View
TCP
TCP
Internet
abcde
Byte stream abcde
TCP is PIPE
Output Same Data In the same order
Input Data
AWOCA2003
TCP’s View
TCP
TCP
Internet
abcde
Byte stream abcdeCheck all data has come?Re-order when arrival order is wrongAsk “re-send” when data misses.Speed Control
AWOCA2003
TCP’s feature
• Keep data until “Acknowledgement” arrives.
• Speed Control (Congestion Control) without knowing the state of routers.
Use Buffer (Window), and when get ACK from receiver new data is moved to buffer
Make Buffer (Window) small, when congestion is guessed to be occurred.
AWOCA2003
Window Size and Throughput
Roughly speaking
RTT: Round Trip Time
Hence, Longer RTT needs
Larger Window Size for same throughput.
Throughput = Window Size / RTT
AWOCA2003
Congestion ControlAIMD Additive Increase
Multiplicative Decrease
AIMD phaseDoubled for every ACK(start phase)
time
Window Size
Gradually accelerate once after congestion occurs, Rapidly slow-down, when congestion is expected.
AWOCA2003
Another Problem
Denote “network with long latency and wide bandwidth” as LFN(Long Fat Pipe Network)
LFN needs large window size,
But, since increment is triggered by ACK.
speed of increment is also SLOW.
(LFN suffers, AIMD)
AWOCA2003
Network Environment
The Bottle Neck (about 600Mbps) Note that 600Mbps < 1Gbps
AWOCA2003
92% using TCP/IP is good,but, still we have a PROBLEM
Several Streams work after other streams finish
AWOCA2003
Fastest and slowest streamin the worst case
The slowest 3 times slower Than the fastest.Even other streams finishThroughput did not recover
Sequence Number
Time
AWOCA2003
Hand-made Tools
• DR Gigabit Network Analyzer– Need accurate Time Stamp with 100ns
accuracy– Dump full packets
• Comet Delay and Drop Pseudo Long Fat Pipe Network(LFN)
Gigabit Ether a packet is sent every 12 μsec
AWOCA2003
Programmable NIC(Network Interface Card)
AWOCA2003
DR Giga Analyzer
AWOCA2003
Comet Delay and Drop
AWOCA2003
Unstable Throughput
• We examined Long Distance Data Transfer, throughput is
8Mbps to 120Mbps.
(When we use Gigabit Ethernet Interface)
AWOCA2003
Fast Ethernet is very stable
AWOCA2003
Analysis of single stream.Number of packetswith 200msec RTT
AWOCA2003
Packet Distribution
Number ofPacketsPer msec
Time(sec)
AWOCA2003
Packet Distribution of Fast Ethernet
Number ofPacketsPer msec
Time(sec)
AWOCA2003
Gigabit Ethernet interfacev.s. Fast Ethernet interface
Even, same “20Mbps”,Behavior of 20Mbps of Gigabit Ethernet Interface and20Mbps of Fast Ethernet Interface Is completely different.
Gigabit Ethernet is very bursty.Router might not like this.
AWOCA2003
2 problems
• Once packets are sent burstly, router sometimes cannot bear.
(Unlucky stream slow, lucky stream fast)
Especially when bottleneck is under Gigabit.
• More than 80% of time, the sender do not send anything.
AWOCA2003
Problem of implementation
1Gbps speed, suppose ether packet 1500B,
1 packet should be sent every 12 μsec.
On the other hand, UNIX Kernel Timer is
10msec.
AWOCA2003
IPG(Inter Packet GAP)
• Transmitter is always on,
• When no packet sent, idle state.
• Each Frame at least 12bytes IPG (IEEE 802.3) sender
• Tunable by e1000 driver, (8bytes – 1023 bytes)
AWOCA2003
IPG tuning for short distance
IPG 8bytes IPG 1023 bytesFast Ethernet 94.1Mbps 56.7MbpsGigabit Ethernet 941Mbps 567Mbps
Suppose Ether Frame is 1500bytes,1508: 2523 is approximately 567: 94 1These work theoretically.(Gigabit ether has been perfectly tuned already for short distance data transfer)
AWOCA2003
IPG tuning for Long Distance
AWOCA2003
MAX,MIN,Average, Standard Deviation of Throughput
FastEther
AWOCA2003
Some patterns of throughput change
AWOCA2003
Detail (Slow Start Phase)
AWOCA2003
Packet Distribution
AWOCA2003
But
• They are like an ad-hoc patch.
What is the essential Problem?
AWOCA2003
One big problem• Good MODEL does not exist.
Old type MODEL does not work well.
such as
queueing theory M/M/1
packt distribution Poisson Distribution
Experiment says it is not good.
Currently, simulation and using real network is the only way to check.
(No Theoretical background)
AWOCA2003
What is the difference of telephone network?
AUTONOMY
AWOCA2003
For Telephone network,
• Telephone Company knows, manages and controls whole network.
• End-node doesn’t have to do heavy job, such as congestion control.
AWOCA2003
Current Trend(?)
• Analyze NETWORK using Game Theory.
• Nash Equilibrium