hpcat - architectural approach to the role of optics in monoprocessor and … · 2019. 1. 3. ·...

dSFcN~GJflDW~UGp5oT

1

Architectural approach to the role of opticsin monoprocessor and multiprocessor machines

Jacques Henri Collet, Daniel Litaize, Jan Van Campenhout, Chris Jesshope,Marc Desmulliez, Hugo Thienpont, James Goodman, and Ahmed Louri

The relevance of introducing optical interconnects ~OI’s! in monoprocessors and multiprocessors isstudied from an architectural point of view. We show that perhaps the major explanation for why opticaltechnologies have nearly been unable to penetrate into computers is that OI’s generally do not shortenthe memory-access time, which is the most critical issue for today’s stored-program machines. Inmonoprocessors the memory-access time is dominated by the electronic latency of the memory itself.Thus implementing OI’s inside the memory hierarchy without changing the memory architecture cannotdramatically improve the global performance. In strongly coupled multiprocessors the node-bypasslatency dominates. Therefore the higher the connectivity ~possibly with optics!, the shorter the path toanother node, but the more expensive the network and the more complex the structure of electronic nodes.This relation leaves the choice of the best network open in terms of simplicity and latency reduction. Thebottlenecks resulting from and the benefits of implementing OI’s are discussed with respect to symmetricmultiprocessors, rings, and distributed shared-memory supercomputers. © 2000 Optical Society ofAmerica

OCIS codes: 200.4650, 200.2610.

tpohgceahcrlt

1. Introduction

Although numerous studies are in progress world-wide for developing short-distance optical intercon-nects ~OI’s!, it clearly emerges from the literature

J. H. Collet [email protected]! is with the Laboratoire d’Analyse et’Architecture des Systemes du Centre National de la Recherchecientifique, 7 avenue du colonel Roche, F-31077 Toulouse,rance. D. Litaize [email protected]! is with the Institut de Recher-he en Informatique, Universite Paul Sabatier, 118 route dearbonne, F-31062 Toulouse, France. J. Van Campenhout

[email protected]! is with the Vakgroep Universteitent, St. Pieternieuwstraat 41, B-9000 Gent, Belgium. C.esshope [email protected]! is with the Institute of In-ormation Sciences and Technology, Massey University, New Zea-and. M. Desmulliez [email protected]! is with theepartment of Computing and Electrical Engineering, Heriot-att University, Riccarton EH14 4AS, UK. H. Thienpont

[email protected]! is with the Laboratory for Photonics, Vrijeniversiteit Brussel, Pleinlaan 2, B-1050 Brussels, Belgium. J.oodman [email protected]! is with the Department of Com-uter Sciences, University of Wisconsin, Madison, Wisconsin3706. A. Louri [email protected]! is with the Departmentf Electrical and Computer Engineering, University of Arizona,ucson, Arizona 85721.Received 26 May 1999; revised manuscript received 8 September

999.0003-6935y00y00671-12$15.00y0© 2000 Optical Society of America

that most of them are based on technological argu-ments and that global operation of the targeted ar-chitecture has not been fully analyzed. Theproceedings of the conferences that constitute Refs. 1and 2 provide a nonexhaustive presentation of thecurrent state of the art in this field. Many studiesattempt to improve some part of the machines butoffer no assurance that this progress improves theglobal operation of the whole system.

In this study we follow a different approach, whichconsists of analyzing the role of short-range OI’s inmonoprocessor and multiprocessor machines from anarchitectural point of view. Most of the discussionthroughout the paper focuses on the relevance of OI’sin reducing memory-access latency ~MAL!, which ishe most critical and permanent issue in stored-rogram computer architectures. The considerationf optics leads to the following paradox: On the oneand, OI’s extend the communication bandwidth butenerally do not directly change ~or address! the ac-ess latency to the memory, which is dominated bylectronic processing times in both monoprocessorsnd tightly bound multiprocessors. On the otherand, OI’s may increase the interprocessor networkonnectivity ~thus reducing the path that separatesemote nodes! at the expense of shifting most of theatency problem to the electronic domain and, in par-icular, to the design of efficient node circuits. The

10 February 2000 y Vol. 39, No. 5 y APPLIED OPTICS 671

rnhtepptcdtfid

Pttpmbsdtfi

F

A~C~

tSp

le

6

choice of the best network is therefore an open issuein terms of implementation simplicity and latencyreduction. These difficulties may explain in partwhy OI’s today are not deployed in general-purposemachines and are considered for use potentially indedicated processors ~that do not execute instructionsor fetch data stored in a memory!.

The contents of this paper are as follows: Thecurrent state of the application of optical technologiesto communication and interconnection networks isreviewed in Section 2. The memory issue and itsinfluence on the architecture of today’s machines arereviewed in Section 3. In Section 4, we discuss therelevance of implementing OI’s in monoprocessors.No dramatic increase in global performance is ex-pected in such systems, as the intrinsic memory la-tency is the dominant latency. The potentiallystronger impact of OI’s in multiprocessor machines isdiscussed in Section 5. Simplicity of implementa-tion is often preferred for small multiprocessor ma-chines, making the introduction of OI’s particularlyattractive in symmetric multiprocessors ~SMP’s! anding architectures for connecting approximately 100odes. In these architectures OI’s can provide auge bandwidth that can minimize contention la-ency ~related to the traffic saturation!, as is alsoxplained in Section 4, while they maintain the sim-licity of the node structure. In supercomputers,roviding a global shared view on a physically dis-ributed memory places a heavy burden on the inter-onnection network and, in particular, on theevelopment of low-latency high-connectivity elec-ronic nodes. The introduction of OI’s in new recon-gurable architectures and in dedicated processors isiscussed briefly in Sections 6 and 7, respectively.

2. Brief Review of the Role of Optics inCommunications Networks

The current state of the competition between opticsand electronics for the processing and the transmis-sion of information is reviewed here as a function ofthe communication distance.

A. Telecommunications Networks

Optical communications have won the battle for long-distance transmission in wide-area networks~WAN’s! and metropolitan-area networks ~MAN’s!.There are at least three reasons for this: ~1! Thebandwidth limitation of OI’s is much less pronouncedthan that of electrical transmission,3 losses are muchlower, and in future systems the effects of nonlineardispersion can be countered by use of solitons. ~2!

arallel transmissions are not usable over long dis-ances because skew makes the synchronization ofhe different reception channels particularly com-lex. ~3! Multiwavelength ~optical! transmissionsake possible the extension of the transmission

andwidth at almost no cost to the network infra-tructure. Thus one permanent objective of long-istance communications consists of increasing theransmission bandwidth through a single monomodeber.

72 APPLIED OPTICS y Vol. 39, No. 5 y 10 February 2000

B. Local-Area Networks

Local-area networks were first designed for datatransmission between computers. We can distin-guish between company networks and industrial net-works that operate in a hostile environment withreal-time constraints. Each computer ~PC, worksta-tion, etc.! in a company network is connected to a hubthrough a few tens of meters of links and operateswith an ethernet protocol. Hubs themselves are in-terconnected by means of high-throughput links thatoperate under various protocols such as the ethernet,the fiber-distributed data interface, and ATM.Links from computers to hubs generally are imple-mented with the preinstalled metallic cables of thebuilding network, whereas serial optical ~i.e., fiber-distributed data interface! links are used mostly forconnections between hubs. Industrial networksalso exhibit a hierarchical structure. Each levelmay use a specific field bus, such as the Profibus,4 the

ieldbus,5 the controller area network ~developed byBosch for cars6!, aviation industry standards like the

eronautical Radio, Inc., Model 429 or Model 629developed by the Airline Electronic Engineeringommittee!, the manufacturing automotive protocol

developed by General Motors!, or the interbus S.7

C. Short-Distance Communications and Interconnects

The transition from serial telecommunications to thecomputer world ~dominated by parallel intercon-nects! occurs at short distances that extend over a fewmeters. A large bandwidth is needed for computerclusters and multiprocessor interconnects, as, for in-stance, in the CRAY Model T3E,8 the IBM ModelSP2,9 the Intel Model Paragon,10 and the SiliconGraphics Model Origin systems.11 Electronic paral-lel interconnects dominate because they allow, acrossa few meters, the extension of the global bandwidthwithout increasing the operation frequency. Theseinterconnects are generally a cheap solution. How-ever, several networks that initially were imple-mented with paralleled electrical links now offerserial ~or parallel! optical alternatives for increasinghe bandwidth and the transmission distances.ome of these networks are HIPPI ~high-performancearallel interface! at 6.4 Gbitsys,12 SCI ~scalable co-

herent interface! at 1.6 Gbitsys,13 and Myrinet at 1.28Gbitsys.14 These networks make possible the build-ing of multicomputers for supporting cluster comput-ing, which is an area in which there is muchexperimentation at the moment. Cluster computingis motivated partly by the preoccupation of develop-ing of more modular, low-cost hardware that wouldsimplify maintenance and compatibility issues formanufacturers. However, it must be stressed thatcluster computing is suitable for some but not allapplications because the latency of internode commu-nications becomes extremely long ~with respect to theprocessor cycle it is currently 1 ns! when the inter-computer distance attains a few meters ~1 m trans-ates to 5 ns!. Therefore a distributed system willxecute numerous applications much more slowly

p

cubtreiccaf

mtt

ovffqoandi

~especially applications with a distributed memoryand those that require numerous internode exchang-es! than does a tightly bound multiprocessor enclosedin a single cabinet.

D. Ultrashort-Distance Interconnects

At present ultrashort-distance interconnects rangingfrom a few centimeters to a few tens of centimetersare in the electronic domain. The machines underconsideration here are monoprocessors ~PC’s andworkstations! or SMP’s such as the Silicon GraphicsModel Power Challenge11 and the Sun Model Enter-

rise.15 In these machines the communication la-tency is never controlled by the propagation~internode or interunit! but by electronic terms~memory latency, routing time, etc.!. This controlonstitutes an essential difference from the distrib-ted systems described in Subsection 2.C. Theandwidth extension has been achieved to date byhe increasing of transmission parallelism and by theeplacement of shared buses ~which are multipointlectrical lines! with dedicated point-to-point parallelnterconnections. For instance, in the Pentium ar-hitecture data are transmitted between the memoryontroller and the chip set through a 64-bit-wide bust 100 MHz and possibly will be transmitted in theuture with 128 bits to attain 12.8 Gbitsys.16

E. Intrachip and Multichip-Module Interconnects

By far most of the communications at this level are inthe electronic domain. However, an optical clockdistribution exists at the interboard level in theCRAY Model T-3D ~Ref. 17!, and research is beingcarried out to extend this technology to the intra-board level for the CRAY Model T-90.18 On theresearch side studies are in progress for the construc-tion of optical backplanes19,20 and on micro-opticalcomponents for interchip and intrachip communica-tions, for example, at Vrije Universiteit Brussel.21,22

F. Optics in Logic-Level Processing

Most of the all-optical computing studies launched inthe middle of the 1980’s have been reoriented towardspecial-purpose systems because ~1! the dramatic in-crease in electronic processing power has progres-sively eliminated many arguments that favoredoptical binary processors ~switching time, switchingenergy! and ~2! the performance of optical numericaldemonstrators comparatively has stagnated. To-day, interest in optical processing seems to be limitedto dedicated processors.

Parallel optical IyO can be exploited for informa-tion transport in smart pixels, early vision process-ing, artificial retina applications,23 and database andsymbolic applications.24 The low-level processing inthe fields of pattern and image recognition mighttake advantage of the potential of optics for the ex-tremely fast implementation of simple Boolean func-tions, as has been demonstrated by recentexperiments that were carried out in the field of ul-trafast optics.25

3. Optical Technologies and the Memory Issue

A. General Remarks

In Section 2, we showed that OI’s have ruled thetelecommunications world by the replacement of elec-trical links with optical serial links without affectingthe communication protocols or the network architec-tures. They will likely invade local-area networkssimilarly. The insertion of OI’s into computers is

uch more problematic despite numerous advan-ages that favor optical technologies.26 Some ofhese arguments are listed below:

• Optical technology can provide extremely highbandwidth almost independently of the interconnectlength at the length scales considered.

• OI’s have advantages in terms of weight andvolume. The interconnect density is potentiallyhigher because of ~1! the much lower interference ofptical signals, either in free space ~electron beamsersus light beams! or when guided ~mutually inter-ering conductors versus optical fibers!, and ~2! theact that each optical communication channel re-uires a cross section of the order of the wavelengthf the light used. Hence per given total cross sectionnd interconnect length a larger number of intercon-ects is possible optically. Connection of high-ensity OI’s is much less bulky than that of electricalnterconnects.

• Broadcasting is feasible because of the capabil-ity of high fan-out. However, high fan-out requireshigh power, which introduces some limitations, suchas an increase in the latency.

• Wavelength-division multiplexing can be usedto increase bandwidth and achieve special advan-tages. Although it requires tunable light sources,there is no interference between wavelengths; interms of networking add–drop capabilities and wave-length routing are possible.

• Optics is at least competitive in terms of power-and control-supply requirements: The speed–powerproduct of advanced centimeter-range OI’s is becom-ing better than that of electrical ones.27–29

• Optics may be capable of the rapid reconfigura-tion of static interconnection patterns. Light polar-ization also makes possible switching and fastconfiguration routing. Apart from power losses themeans needed for optical reconfiguration do not im-pair signal quality or even latency, as is often the casewith electrical switching or reconfiguration.

• OI’s provide galvanic isolation between inter-connected subsystems. This isolation leads to im-proved noise immunity and security for applicationsthat monitor high-voltage systems.

With all the above arguments, how does one ex-plain the paradox that optical technologies have sofar hardly gained application in computers? We feelthat the underlying reason is that optics generallydoes not directly shorten the access time to the mem-ory, which is the most crucial issue for stored-program machines. Therefore it is extremely


itclbnwosbt~iafcalomt1mhbt

~cibwwatrasafwbot

l-

6

difficult to translate the numerous technological ar-guments that claim optical advantages into an opticalarchitecture ~or a demonstrator! that might effec-tively overcome electronic computers ~except for somededicated applications that are related to vision orimage processing!. We briefly review the memoryproblem in Subsection 3.B.

B. Memory Issues and the Architectural Evolution ofMachines

The evolution of stored-program computers has re-peatedly been influenced by the fact that the perfor-mance of microprocessors has increased much morerapidly than has that of memory systems. In earlymicroprocessor systems ~i.e., in the 1970’s!, proces-sors operated at approximately the same rate thatmemory could be cycled, and the processor was con-nected directly to the memory system @dynamicRAM30 ~DRAM!#. This is no longer the case.Whereas processor speeds have been doubling everyfew years, dynamic memory has increased in speedonly marginally over the past two decades, althoughits size has also doubled every few years. This lackof improvement in speed means that the time neededby the processor to fetch instructions or data from themain memory has increased permanently comparedwith the processor-cycle time, making direct ex-changes with the main memory more and more pe-nalizing for the global performance of the computer.

Latency is the key parameter of memory–processornteractions ~much more so than the bandwidth, as inelecommunications! because the processor ex-hanges very short bursts of information ~usually ateast one word, i.e., 4 bytes, and more often a cachelock at a time, i.e., 32 or 64 bytes!. The processorever establishes a steady communication streamith the memory. The major limitation to the speed

f DRAM is in the circuits used for detecting thetored charge on a memory cell. There is a trade-offetween the size of the memory and the rate at whichhis tiny stored charge can be sensed. Static RAMSRAM!, on the other hand, is optimized to be signif-cantly faster than DRAM, although SRAM speed ischieved at the cost of a larger memory cell and there-ore a significantly reduced memory size ~or an in-reased memory cost!. For this reason most desktopnd server systems use DRAM memories to maintainarge memory size at modest cost. The architecturef chips and machines has evolved permanently toaximize global performance despite the degrada-

ion of the MAL. A typical example is shown in Fig.for the PC architecture. The evolution of PC’s andultiprocessor machines has consisted primarily of

iding the MAL with hardware or software solutionsecause it has been impossible to change the memoryechnology. Therefore

• Modern computer architectures use a complexhierarchy of memories. Two on-chip level 1 ~L1!caches are provided in pipelined architectures—onefor data and one for instructions—because data andinstructions must be read concurrently. These


caches typically are 32–64 kbytes in size ~128 kbytesare expected in the next processor, Model K7, fromAMD!. L1 caches can then be connected to a level 2L2! cache, also on chip, or to a much larger off-chipache. The latter typically is approximately 1 Mbyten size. If there is a L2 on-chip cache, there may alsoe a level 3 ~L3! cache off chip. The off-chip cacheill then be connected to the main memory. Cachesork by exploitation of the locations of references inccess to data, either spatially when data adjacent tohose recently used are likely to be reused or tempo-ally when data recently used are likely to be usedgain. The aim of a cache is to make the memoryystem appear to be as large as the largest componentnd to appear as fast as the fastest component. Un-ortunately, when the cache system does not workell through a lack of locality the slowdown is severeecause the DRAM memory-access time is at least anrder of magnitude larger than the processor’s cycleime.

• Current microprocessors are designed to hidethe latency associated with a memory fetch. Tech-niques used to tolerate high-latency memory includespeculative execution in which the results ofbranches and even data values are predicted. Whena misprediction occurs, data generated along wrongbranch paths or based on mispredicted values mustbe cleaned up and the original state restored. An-other technique used is out-of-order execution inwhich instructions start and even terminate beforeprevious instructions in the instruction stream. Op-erands of instructions that have been completed outof order must be held in renaming buffers prior tobeing retired. Thus operands to dependent instruc-tion must then be retrieved from these registers andnot the registers indicated by the instruction. Thisprocess requires tables for register-renaming resultsthat have not been retired as well as for memory fordata-flow matching to determine which instructions

Fig. 1. Current PC architecture: Note the two levels of caches~L1 and L2! and the absence of the shared-memory bus, which isreplaced with dedicated point-to-point connections between thelogic core, the graphic processor, the DRAM controller, and the L2cache. Note also that the number of pins in the core logic isgreater than 500! AGP, accelerated graphics port; PCI, peripheracomponent interface; MB, megabytes.

eranp

tiillttfedtn

aaacndttt

miNt

tcm

at3ene1pp

bmtil

can be executed. The prediction and the clean-uplogic and the additional registers and tables used inout-of-order execution mean that modern micropro-cessors are very complex. Less complex mecha-nisms exist for tolerating high latency into mainmemory, such as microthreading or multithread-ing.31,32

In addition to these latency-tolerant techniquesmost processors attempt to issue more than one in-struction in a single cycle by use of multiple-execution units. This process is meant to increasethroughput for a given clock cycle, again at the ex-pense of complexity. Most recent microprocessorshave at least four-way issue but seldom achieve aneffective instruction per cycle count of greater thantwo. These general considerations hold for any com-puter with specific constraints, depending on thenumber of processors and on the communication net-work of the machine.

C. Evolution or Revolution of the Architecture?

Two approaches prevail with respect to the evolutionor revolution of architectures ~with some possible in-termediate points of view!:

1. The first approach, which we call the evolution-ary approach, consists of trying to integrate opticalcommunication systems in forthcoming machines.This approach requires the analysis of communica-tion bottlenecks in existing or future computers andthe capability of optical communications to solvethese problems ~i! with much more effectiveness thanlectronic solutions and ~ii! with a good chance toeach a cost-effective mass production. Thus thispproach tries to predict the role of optical commu-ications in the next 5–10 years, starting with theresent state.2. The second approach, which we tentatively call

he mutational approach, considers that optics mightnduce new computer architectures with outperform-ng specifications that will justify abandoning ~or ateast dramatically modifying! existing electronic so-utions. This is a more speculative approach abouthe possible long-term evolution of computer archi-ectures. Note that it does not release designersrom having to know quite well the state of the art ofxisting electronic architectures that cannot be re-uced to the pure communication aspects if they areo propose and demonstrate the advantages of theew optical solutions.

In the rest of the paper, we follow the evolutionarypproach, as it is much less risky than the mutationalpproach and can be used as a reference for appreci-ting the relevance of more advanced proposals. Ar-hitectures are analyzed in ascending order of theumber N of connectable processors. Although hun-reds of network topologies have been proposed,here are only a handful of commercial implementa-ions, which reduce to mostly buses, rings, meshes,ori, and central switches. Thus we begin with the

onoprocessor. Then we consider SMP’s ~say, typ-cally 2 , N , 32–64!, rings ~say, potentially 10 ,

, 100!, and supercomputers that could connect upo several thousands of processors.

4. Optical Interconnects in Monoprocessors

What could be the role of optical interconnects inmonoprocessor machines? The registers are linkedto L1 caches, the L1 to the L2 cache, the L2 to thememory or in some machines to the L3 cache, and theL3 cache to the memory. The L1 and the L2 caches,which are often integrated in the processor chip, arebuilt with SRAM’s and can operate at the processorfrequency. Inserting OI’s at this level ~i.e., betweenthe register and L1 or between L1 and L2! increaseshe transfer latency ~because of the optoelectroniconversion time! and degrades processor perfor-ance.33

Perhaps the integration of optical communicationsought to be considered for the longest distances in thememory hierarchy, namely, between the last cachelevel ~considered to be L2 in the following! and themain memory. Two terms contribute to the MAL,namely,

• The intrinsic MAL, which is the leading term tothe MAL and depends on the internal architectureand on the technology of the DRAM’s.

• The communication latency between L2 and thememory.

The need for memory bandwidth in future ma-chines will grow dramatically owing to the increase ofprocessor-operation frequency and to foreseeable ar-chitectural evolutions such as the extension ofinstruction-level parallelism of processors and theuse of higher-order nonblocking caches. Today, pro-cessors run at nearly 700 MHz and issue as many as4 instructionsycycle. In the years 2005–2010 oper-ation at nearly 4 GHz is expected with the executionof 32 instructions per cycle, corresponding to a soleinstruction bandwidth B0 ~between the processor andL1! of the order of 400 Gbytesys. The two levels ofcache will substantially reduce the main-memory ac-cess to a few percent of B0, but, all in all, a bandwidthin the range of 10–20 Gbytesys is expected betweenL2 and the memory! Introducing OI’s ~operating atpproximately 1–2 GHz! might exhibit some advan-ages here ~see the arguments listed in Subsection.A!, but ultimately the role of OI’s will depend on thevolution of chip packaging and motherboard tech-ology. The point is that no major problems arenvisioned by electronics engineers in moving out to-GHz processors with electronic signaling in mono-rocessor and tightly coupled architectures, as ex-lained below.Neither bus frequency of the order of 500 MHz nor

us width in the 500-pin range is seen as an insur-ountable obstacle. The memory bus at this end of

he market is not a significant issue, as its cycle times limited by the DRAM speed. The effects of highatency can be mitigated by the provision of very wide


ebpassca

mme~m

tt

n

6

electrical buses that carry as many as 512 bits ~plusrror-correction bits! of data simultaneously. Possi-ly this broad bus might be split into several inde-endent narrower buses ~say, 64 bits wide! to makeccess parallel to different memory banks. Theseolutions might require new chips with several thou-ands of pins, but current mass-produced devices in-luding from 2500 to 10,000 pins are alreadyvailable in research34 with possibly a role for optical

interconnects.In summary, the possible introduction of OI’s ~be-

tween the registers and L1, L1 and L2, or L2 and themain memory! seems hypothetical because ~1! the

emory-access time is dominated by the intrinsicemory latency and optical communications ~what-

ver their bandwidth is! do not change this issue, and2! the bandwidth challenge between L2 and the

emory, which is in the range of 10 Gbytesys, seemsaccessible to the forthcoming electronic packagingand to the motherboard technology. Electronic so-lutions will likely suffice for building cheap and effi-cient monoprocessor machines within the next 10years.

5. Optical Interconnects in MultiprocessorArchitectures

A. General Remarks

The logical way to reach a performance level not ac-cessible with a monoprocessor consists of connectingseveral processors through an interconnection net-work, thus building a multiprocessor system. Aswas stressed in Section 4, the MAL is a critical pa-rameter for the performance of the machine. Inmultiprocessors it can increase drastically and can beas much as 3 orders of magnitude larger than theprocessor cycle time in the worst case, in particular,when the memory is distributed in a number of pro-cessors ~or clusters of processors! and the execution ofhe application necessitates numerous internoderansfers. It is possible to distinguish ~at least! five

contributions to the latency, namely,

~1! The intrinsic memory latency already men-tioned in Section 4.

~2! The software latency ~communication overhead!associated with formatting, sending, and receivingmessages. It is not clear at this time how opticsmight reduce software latency. A comprehensivestudy of the effect of the high communication band-width capability on the overall communication la-tency has not yet been done. Are there possiblearchitectural innovations in interprocessor commu-nications with OI’s that will eliminate or largely di-minish the effect of software latency?

~3! The ~network! propagation latency, which de-pends on the network topology and the processingoverhead for routing and solving contention prob-lems. This latency is discussed extensively below.

~4! The ~network! contention latency, which criti-cally depends on network saturation. The network


latency is the sum of the propagation and the conten-tion latencies.

~5! The coherence latency in which maintaining thecoherence of the caches in tightly bound machinesrequires broadcasting ~or multicasting! coherencemessages through the communication network.This coherence maintenance also contributes to slow-ing down the memory access. This factor stronglydepends on the network topology. Snooping proto-cols, usually implemented with buses, are much sim-pler and faster than directory-based protocols thathave to be implemented in distributed networks.30

Which types of OI’s ~topologies, technologies, pack-aging schemes, etc.! are the most suitable in the shortterm as well as in the long term? First, it is clearthat network latency is the dominant term in existingmultiprocessors ~in a monoprocessor this is the in-trinsic memory latency!. Second, the network la-tency in strongly coupled multiprocessors isdominated by the bypass time of electronic nodes andnot by the internode-propagation time ~IPT!. This isa key difference from the telecommunications net-works because of the short internode distance in thesystems under consideration @i.e., a few tens of cen-timeters ~see Subsection 2.D!# with an IPT in therange of a few nanoseconds. Optics can provide highconnectivity, but increasing the connectivity gener-ates new issues and must be used parsimoniously, asdiscussed below:

• When the connectivity is low @1 for the unidi-rectional ring shown in Fig. 2~a!# node processing issimple ~only the add–drop of information in the net-work, possibly error detection and correction, andpriority treatment! but is still longer than the IPT.The drawback of unidirectional rings is that the num-ber of nodes a message must pass through in itsround trip to remote memory ~RTRM! equals the total

umber of nodes.• Increasing the connectivity is therefore partic-

ularly attractive {by use of meshes, tori, hypercubes,etc. @see Fig. 2~b!#} because the average internodedistance decreases. Unfortunately, two new issuesarise:

~1! The average internode distance generally de-creases sublinearly versus the connectivity, whereasthe network complexity increases linearly. Thus in-creasing the connectivity sooner or later becomes pro-hibitively expensive. Let us consider an example forclarity, namely, the connection of N 5 256 nodes withk-dimensional bidirectional tori. The average inter-node distance is D~k! ' ~ky4!N1yk, and the number ofconnections is NC~k! 5 Nk. For k 5 2 @a two-dimensional ~2-D! mesh#, k 5 3 @a three-dimensional~3-D! torus#, and k 5 4, we get D~2! ' 8, D~3! ' 4.8,and D~4! ' 4, respectively. Therefore, from k 5 2 tok 5 3, D is divided by 1.66, and from k 5 3 to k 5 4by 1.2. But simultaneously the number of connec-tions is multiplied by 1.5 and 1.33, respectively,

tnrcatpsoss

ttsa~hcct1l

showing that increasing the connectivity becomesmore and more costly.

~2! The node-bypass latency T~k! increases with theconnectivity because of the increase in node complex-ity. The increase of T~k! depends critically on howsophisticated the implementation of the node switchcan be, depending on the operation frequency, theparallelism of transmissions, the communication pro-tocol, the routing algorithm, the admissible cost, etc.A simple solution consists of decomposing ak-dimensional switch in k successive one-dimensional ~1-D! switches optimized for straighttraffic. In that case the average internode latencyL~k! of bidirectional tori scales as L~k! ' D~k! 1 ~k 21! 5 ~ky4!N1yk 1 ~k 2 1!, where D~k! is the averagenumber of bypassed nodes. The second term ac-

Fig. 2. Three networks for connecting N 5 27 nodes: ~a! A uni-directional ring that requires that all the nodes ~27! be bypassed ina RTRM. The node is a 2 3 2 switch. ~b! A 3-D torus with anaverage RTRM of approximately 4. This second network is ap-proximately 7 times faster in terms of the hop number than thatshown in ~a!, but the node becomes a 7 3 7 switch. The increasein the node-bypass time depends critically on the switch design.~c! A fully interconnected network. For clarity, the connections ofonly two nodes are drawn. The RTRM reduces to 1 hop, but theinput structure of each node becomes an N-to-1 multiplexer thatmust operate in the asynchronous mode to reduce the MAL.

counts for the increase of the note latency. Again,with N 5 256 and k 5 2, 3, 4, we deduce that L~2! '9, L~3! ' 6.8, and L~4! ' 7, respectively, showing thatincreasing the connectivity from three to four doesnot shorten the average internode latency. The con-clusion here is that the topological arguments on thebenefits of increasing the connectivity in stronglycoupled multiprocessor networks cannot be sepa-rated from analyzing the complexity of the node-routing implementation.

• If the network is fully interconnected @Fig. 2~c!#he internode distances are minimal. However, thisetwork topology transfers most of the latency-eduction challenge to the design of the node-inputircuits that must multiplex N asynchronous inputsnd serialize access to node memory. Preservinghe memory-ordering constraints also seems to be aarticularly complicated issue for a fully connectedhared-memory machine. Perhaps the input blockf each node of a fully connected network might beeen as a common bus shared by the N inputs with anooping protocol to maintain the coherence.

Therefore finding the connectivity that provideshe lowest network latency ~i.e., the propagation la-ency in our classification at the beginning of thisubsection! for a given number of N nodes is an opennd a complicated problem.35 When N is very largesay, several thousand, as in a supercomputer! theypercube topology may be attractive because theonnectivity scales as log2 N ~see Refs. 36 and 37 foromparisons with other topologies!. But this solu-ion is obviously expensive. For instance, with N 5024, twelve links per node chip are necessary, i.e.,og2 P 1 2 linksychip, requiring approximately 800

pins for a transmission parallelism of 64. Forsmaller machines, say, N less than 64 or 128, sim-plicity may be favored by the use of shared buses orrings. The latencies of the memory or the networkare purely electronic. OI’s can contribute to elimi-nating the contention latency by the provision of ahuge bandwidth. This can be decisive for machinesbased on 1-D interconnection networks such asSMP’s or rings ~see Subsections 5.B and 5.C!.

The scalability of network parameters is hardlypredictable in the long term. It is clear that theoperation frequency of electronic nodes will increase,whereas the IPT is incompressible. Thus the IPTmight become the leading contributor to remote-access latency, with a corresponding increase inmemory-access time ~in terms of the electronics cycle!with strictly no hope of reduction. This long-termevolution would transform any strongly coupled mul-tiprocessor into a weakly coupled machine. How-ever, this situation is not inexorable, as it is based onthe sole scalability of the operation frequencies ofprocessors and nodes. It is likely that it will be ac-companied by a reduction of dimensions ~a possiblemetric being, for instance, the processor power di-vided by the processor volume! so that the processorcycle and the internode distance would diminish si-


htaE

ctmeabSmtlwqa

dnabbpfmten

br

m

a

tiii

bc

6

multaneously, maintaining the preeminence of thenode-bypass latency over the IPT.

B. Symmetric Multiprocessors

A SMP is a physically shared-memory machine witha uniform memory-access time. Early machinescomprised a small number of processors, e.g., notexceeding 32, connected to a memory system througha single multiplexed shared bus.38 The architecture

as evolved, and the number of nodes is larger todayhan was expected just a few years ago, with as manys 64 processors for the Sun Microsystems Modelnterprise 10,000.39

Interconnect links for data and addresses havebeen separated in modern machines ~see Fig. 3!. To-day the solution for increasing the data bandwidthonsists of connecting each processor by a private linko a crossbar and then connecting the crossbar to theemory. But the necessity of preserving the coher-

nce of the cache makes this method unsuitable forddresses. Therefore the shared-address bus hasecome a critical communication bottleneck of theMP architecture because it serializes accesses to theemory and adds an important contention latency to

he MAL. The greater the number of processors, theonger the contention latency. The solution, whichould consist of increasing the bus-operation fre-uency, is particularly complicated because the bus ismultipoint electronic line.The only palliative solution has so far consisted of

uplicating the number of address buses to reach theeeded bandwidth, each bus being attached to anddress-memory range. In addition, bandwidth cane increased by use of the notion that a logical bus cane implemented with a tree structure. This ap-roach seems to be impractical for large SMP’s ~i.e.,or SMP’s with 64, 128, or more processors! and

akes optical solutions attractive. The simplestechnological change might consist of replacing eachlectrical address bus with an optical bus that con-ects the processors to several interleaved memory

Fig. 3. Modern SMP architecture: Two address buses ~Addrus1 and Addr bus2! access the memory while preserving theoherence of the caches.


anks. This approach is attractive for three majoreasons:

• Bus operation to as high as 1 GHz ~or higher!would become possible ~by replacement of the electri-cal bus operating at nearly a few tens of megahertz!because the transmission of optical pulses in guides isnot penalized by capacity effects and critical-load ad-aptations that are encountered for electrical trans-missions in a multipoint line. As a result the SMParchitecture ~i.e., the processor, the bus, and the

emory! would become more scalable.• Parallel transmission through optical lines is

almost skew free in the gigahertz domain for trans-mission over a few tens of centimeters. This simpli-fies data recovery in the case of parallel transmission.

• The introduction of such an optical connectionbasically would not change the bus operation, whichwould always rely on the access serialization and asnooping protocol to maintain the coherence ofcaches.

However, several severe limitations cannot be ig-nored, namely,

• The top transaction efficiency of a shared bus isclose to 1 transactionybus cycle with pipelined arbi-tration. This limit becomes a bottleneck for largeSMP’s with 32 or more processors.40 Speeding upthe bus will surely improve the machines’ operationbut will not solve all the contention problems of largemultiprocessors, as it seems unrealistic to assumethat the bus might operate faster than the processor.Large SMP’s with several optical buses seem inevi-table.

• The bus-operation frequency cannot be in-creased without ensuring that each cache controlleris able to check and update its directory at the bus-operation frequency.

• Speeding up the bus will sooner or later gener-ate an integration issue because the bus length oughtto be limited to make sure that the optical signals canbe stationary within a single bus period. The light-propagation velocity ~c 5 20 cmyns! requires the busto be shorter than 10 cm at 2 GHz, shorter than 5 cmat 4 GHz, etc. This size constraint disappears ifmore than one transaction can be in progress simul-taneously in the bus. In that case the bus architec-ture is akin to that of rings, as described inSubsection 5.C.

C. Rings and Hierarchical Rings

Ring multiprocessors are distributed-memory ma-chines with nonuniform memory-access time.41 Thering is a multiaccess interconnection topology that isttractive for the following reasons:

~1! It permits the use of simple interfaces becausehe ring connects to a given node by means of only onenput and one output port. The node–ring interfaces basically a 2 3 2 switch. This simplicity reflectstself in a relatively low requirement for the number

safbtnr2t

bafbwscemttotfb

astshtmtet

tenbdcamnccwlwmvstpunast

cip

etf

of connecting wires, which often corresponds directlyto the number of pins on physical connectors. Thenumber of connections is considerably smaller thanin 2-D or 3-D network topologies ~torus, mesh, hyper-cube! in which more than one direction exists forincoming and outgoing signals, inducing a largeroverhead for processing routing.

~2! It provides a natural broadcasting and multi-casting mechanism. This feature can be exploited inthe implementation of producer-driven prefetching ofdata, which can improve performance significantly.Unlike with the bus, it is not possible to order parallelmessages between different pairs of nodes, and formost implementations flow control can violate theordering constraints. Thus the ring structure pre-serves a partial ordering of transmitted packets thatcan be exploited for implementation of a cache coher-ence scheme.42

~3! There are point-to-point connections betweenuccessive nodes that do not suffer from the undesir-ble effects such as loading and signal reflectionsrom multiple connectors that plague electrical bus-ased schemes and effectively reduce their feasibilityo small sizes ~see Subsection 5.B!. Therefore sig-als can be transmitted on such links at high clockates. Operation at 3–4 GHz with a parallelism of46 will provide a huge bandwidth close to theerabit-per-second range.

The effective bandwidth is determined essentiallyy the transfer rates attainable at individual nodes,nd it can be improved by an increase in the clockrequency or the width of the transfer path. Theandwidth of a multiprocessor interconnection net-ork can also be increased by means of a hierarchical

tructure whereby a number of localized transfersan take place concurrently on several rings. Forxample, if several local rings are connected byeans of a central ring the number of concurrent

ransfers that can be supported is much higher if theransfers are between only sources and destinationsn the same local ring. Transfers that pass throughhe central ring take more time than local ring trans-ers, but they are, in general, shorter than they woulde if all nodes were connected to a single long ring.

D. Supercomputers and Distributed Shared-MemorySystems

Top-of-the-range supercomputers use more than1000 processors. Although the memory may be dis-tributed, each processor can access all the memory inthe system. The distributed-shared memory ~DSM!rchitecture attempts to provide a single addressingpace for the distributed memory to enable the usero gain transparent access to computational re-ources in scalable systems. One achieves this byiding the remote-communication mechanisms fromhe application writer, thus preserving the program-ing ease and the portability of shared-memory sys-

ems. Additionally, the scalability and the costffectiveness of underlying distributed-memory sys-ems are also inherited.

However, local memory is accessed much fasterhan is remote memory in which messages have to bexchanged across the network to fetch data. Theonuniformity is due not only to the network topologyut also to the packaging technologies and will beegraded substantially by heavy traffic loads andongestion. Additionally, the reliance on localitynd memory allocation requires heavy caching ofemory to reduce remote references. Unfortu-ately, caching shared data introduces the problem ofache coherence, the solution of which relies signifi-antly on the efficiency of the interconnection net-ork. As was stressed in Section 5, designing low-

atency nodes is also a critical issue. The problemill get worse with advances in the speed of currenticroprocessors in which the price–performance ad-

antage of microprocessors is increasing. To build acalable DSM multiprocessor that utilizes state-of-he-art off-the-shelf microprocessors with gigabyte-er-second connections to local memory, one musttilize technology that supports interprocessor con-ections at least in the gigabyte-per-second rangend average access times to shared data in the nano-econd range. OI’s could be the only cost-effectiveechnology for internode communications.

6. Optical Communications in ReconfigurableArchitectures

In nonconventional architectures ~such as customomputing! in the reconfigurable-computing domainncreasing use is being made of arrays of field-rogrammable gate arrays ~FPGA’s!. High inter-

connect density is critical because of the difficulty infinding ~or the nonexistence of ! natural, weakly in-terconnected partitioning of the functions. Further-more, the on-chip interconnect facilities are relativelyslow owing to their configurability. This slowness isnot a passing phase that depends on integration levelbut will persist as the density of chips grows contin-uously. OI’s may have a role in FPGA arrays, es-sentially to speed them up, by the introduction of anew routing layer. The density of optical chip-to-chip IyO may be applicable here even for very shortdistances such as adjacent chips. If OI’s can be jus-tified for this purpose, they may even be feasible forreplacing the relatively slow electrical communica-tions within a single chip. Implementing wide busesfrom the FPGA chip to the reconfiguration memory toachieve rapid reconfiguration could be of benefit innonconventional architectures.

7. Dedicated Optoelectronic Processors

Dedicated processors are very different from general-purpose monoprocessors or multiprocessors becausethey generally do not execute stored programs.They are designed for a specific task, which is oftenrelated to the processing of optical information. TheMAL extensively discussed in Sections 3–6 is nolonger a problem ~as there is no memory or almost noxchange with the memory!. Dedicated optoelec-ronic processors traditionally have an optoelectronicront-end and back-end interface. The optical data


oa

tmscsbi

mt

talmm

aeTbvno

mcprpitiwicccidks

6

streams impinge on photodetectors that convert thelight intensity of beams into electronic signals thatare amplified and processed electronically. The re-sulting processed data can be converted back intooptical signals for further processing if necessary.The number of optical channels can range from 10 tonearly 10,000 over a 1 cm 3 1 cm chip area.

The communication bandwidth becomes a criticalissue. For example, in a vision machine a 1024 31024 correlation on a 1024 3 1024 image requiresapproximately 170 3 106 multiply-and-accumulateperations, which corresponds to 1010 operationsys atvideo frame rate of 30 framesys. In the same way,

a matrix multiplication of 1024 3 1024 requires ap-proximately 109 multiply-and-accumulate opera-tions, which corresponds to 6 3 1010 operationsys forhe same frame rate. General-purpose electronicachines cannot cope with the input–output needs of

uch computationally intensive applications. Opti-ally interconnected electronic chips have beenhown to be the only technology to date that is capa-le of providing a match between computationallyntense applications.43 There is a case therefore for

dedicated optically connected electronic information-processing systems for applications that require ahigh data bandwidth capability. Such applicationsrange from image-processing-primitive operations~Fourier transformation, 2-D convolution and corre-lation, and dot-product and dot-matrix multiplica-tions! to switching fabrics in telecommunications.

Demonstrators based on optically interconnectedelectronics deal with such functions. See, for exam-ple, the fast-Fourier-transform machine built at theUniversity of North Carolina, Charlotte, North Caro-lina, which has an IyO bandwidth of 29 Gbytesys andcalculates a 1024 fast Fourier transform in a fewmicroseconds,44 and the bitonic sorter at Herriot-Watt University, Riccarton, UK, which sorts 102415-bit-deep words within 10 ms.45

8. Conclusion

The most critical issue in computer architectures~from the monoprocessor to large multiprocessor sys-tems! is the access time to the main memory. This isthe key problem that architectures must live with,regardless of whether they are optoelectronic. Al-though memory chips have become much denser ~andtherefore much larger!, they have not become signif-icantly faster. Furthermore, the techniques thatcurrently are used to realize large memory systemssuitable for multiprocessors create coherency prob-lems for which no simple, well-scaling electrical so-lutions are known. This situation furtheraggravates the latency properties of complex memorysystems. It also makes the processor architecturecomplex, as many techniques in the processor arespecifically targeted at the problems of memory la-tency and bandwidth limitations as well as the un-predictability of hierarchical memory systems.Thus designing new low-latency memory chips is acritical challenge. The changes are open, possibly atthe physical level ~with the introduction of new ma-


terials, for instance, superconductors!, at the archi-tectural level regarding the organization of memory~for instance, with the design of multiported memo-ries!, or alternatively by the cooling down of current

emory chips to approximately the liquid-nitrogenemperature.

With respect to future monoprocessor machines,he possible introduction of OI’s in the memory hier-rchy seems hypothetical, as it is likely that the evo-ution of the electronic packaging and the

otherboard technology will suffice to build efficientachines within the next 10 years.With respect to multiprocessor machines, the situ-

tion is more favorable, but a major problem is theconomic risk factor of introducing a new technology.his is a strategic rather than a technological issue,ut unless optics can solve a major problem and pro-ide a significantly better solution at a cheaper costo one is going to take the risk of an untried technol-gy.Introducing optics is conceivable in several types ofultiprocessor machines. For instance, the super-

omputer line has traditionally been the first to ex-eriment with novel techniques because the economicisk may be acceptable in the manufacture of a su-ercomputer for which performance is the primaryssue. But one might also consider developing sys-ems that exploit affordable technology with the basicdea that the commercial success of multiprocessorsill depend not only on their computational capabil-

ty but also on their cost–performance ratio. Suc-essful products might be those that would allowonfiguration of a feasible entry-level machine at aorrespondingly low cost that could then be expandednto a larger system merely by the acquisition of ad-itional hardware modules of essentially the sameind. The multiprocessors related to the differenttrategies are reviewed below.

A. Symmetric Multiprocessor Machines

Perhaps the most significant area in which opticaltechnology might be applicable is in the upper limitsof SMP’s. Rings and buses are 1-D networks whoseperformance depends critically on the traffic band-width of the communication system. The hugebandwidth aids in reducing the traffic latency causedby contention access. In SMP, the number of nodesis much larger today than was expected only a fewyears ago, and there would be great benefit in in-creasing it further, although the techniques requiredfor bus-based implementation are already heroic. Ifoptical technology could help extend SMP, even by afactor of 2 beyond the current maximum ~64 proces-sors for the Sun Microsystems Model Enterprise10,000!, it would readily find application.

B. Rings

Optical implementation of a ring-structured back-plane makes it possible to achieve highly parallellinks with very large bandwidth ~of the order of ter-abits per second!. Complete parallel transmissions~requiring the implementation of a parallelism as

high as 600 channels to insert simultaneously severaltransactions into the ring! would enable the realiza-tion of the huge bandwidth needed in forthcomingprocessors.40 A massively parallel optical ring ~i.e.,several thousands of channels! could be divided intoseveral concentric subrings to increase the band-width further. However, it must be stressed thatthe keys to success will be ~1! the efficiency of theoptical–electronic interface, ~2! the capacity to carryout node operations ~add, drop, bypass, and possiblyon-the-fly error correction! in one or two ring cycles tocompress as much as possible the different node la-tencies, and ~3! the capacity to maintain the coher-ence of the local cache at the ring-operationfrequency. To be successful, it will be necessary forthe OI to show considerably enhanced performance incomparison with its electrical counterpart.

C. Supercomputers

One might consider a demonstrator of a highly par-allel computer connected by a hypercube interconnec-tion network that uses wide, fast buses. The onearea in which optics can have a major advantage overelectronics is in the interconnection buses in anetwork-based parallel computer. Here a high pin-out and a high data rate are both required to mini-mize the latency of network-based memory access.One could imagine a router chip with 10 buses of 500bits, which would be impossible for electronic com-munication. This chip would require the collabora-tion of a parallel-computer manufacturer who isexperienced in the use of commodity microprocessorparts. The major issue would be to obtain a micro-processor die in which the pin-out could be taken toflip-chip–bonding sites for emitters and for which op-tical inputs were also provided.

D. Reconfigurable-Network Machines

Architectures capable of exploiting the ability of op-tics to support static networks that can be reconfig-ured very quickly ~in one cycle?! may find opticsattractive. This reconfiguration rate is largely in-compatible with modern shared-memory systems,which require much asynchronous, variable-sizedpacket switching. One application suggested is theuse of cache-consistency protocols that use write–update, with special multicasting configurations toexploit knowledge about sharing patterns. How-ever, write–update protocols aggravate the memory-ordering problems, and this application may dependon future directions in this as-yet unsolved problem.

E. Dedicated Optoelectronic Processors

There are still issues concerning the design, fabrica-tion, and testing of such systems as those describedabove. On the design front the smartness of thepixel ~i.e., the degree of complexity in relation to thenumber of electronic gates! that each optically inter-connected element should possess to maximize theoverall aggregate data-throughput rate is still understudy. Useful figures of merit can include thepower–weight product of the overall system, the

power-consumption density per gigabit per second,and the throughput rate itself. The performance ofsuch demonstrators is limited by the available laser-source power, the electronic chip area, or the heat-removal capability of the system. The interfacingtechnology between the optoelectronic componentsand the VLSI circuitry is still immature with severaloptions still under consideration: flip-chip bonding,epoxy glue, anisotropic bonding, monolithic integra-tion, etc. The optical hardware needs to be furtherminiaturized and to be compliant industrially, espe-cially in terms of alignment accuracy and connectionsto the outside world. The yield and the testing ofsuch systems have not been studied in great detailand deserve more fundamental study.

This study partly summarizes the conclusions ofthe Workshop on Optical Communications and Com-puter Sciences ~WOCCS! that was held in Toulouse,France, in March 1999. This workshop was spon-sored by the European Commission in the context ofthe MicroElectronic Advanced Research Initiative~Mel ARI!. It is our pleasure to thank PierpaoloMalinverni for his permanent support. We alsowish to thank the participants, namely, Howard L.Davidson, Alfred Forchel, Gilles Jacquemond,Graham Jenkin, John D. Lambkin, Dominique La-vernier, Henk Neefs, Rami Melhem, Matthias Pez,Roger Vounckx, and Zvonko Vranesic, of the WOCCSfor numerous fruitful and informal discussions. Theposition we have taken in this paper does not neces-sarily reflect the opinions of all the members whoattended the workshop.

References and Notes1. P. Chavel, D. A. B. Miller, and H. Thienpont, eds., Optics in

Computing ’98, Proc. SPIE 3490 ~1998!.2. IEEE Computer Society, Proceedings of the Fourth Interna-

tional Conference on Massively Parallel Processing Using Op-tical Interconnects, Montreal, Quebec, Canada, 22–24 June~IEEE Computer Society, Los Alamitos, Calif., 1997!.

3. D. A. B. Miller and H. M. Ozaktas, “Limit to the bit-ratecapacity of electrical interconnects from the aspect ratio ofsystem architecture,” in the Special Issue on Parallel Comput-ing with Optical Interconnects, J. Parallel Distribut. Comput.41, 42–52 ~1997!.

4. See the URL http:yywww.profibus.com.5. See the URL http:yywww.fieldbus.org.6. See the URL http:yywww.can-cia.de.7. See the URL http:yywww.interbus.com.8. S. Scott, “Synchronization and communication in the T3E mul-

tiprocessor,” in Proceedings of the Seventh International Con-ference on Architectural Support for Programming Languagesand Operating Systems ~Association for Computing Machin-ery, New York, 1996!, pp. 26–36.

9. C. B. Stunkel, D. G. Shea, D. G. Grice, P. H. Hochschild, and M.Tsao, “The SP-1 high performance switch,” in Proceedings ofthe Conference on Scalable High Performance Computing~IEEE Computer Society, Los Alamitos, Calif., 1995!, pp. 150–157.

10. See the URL http:yywww.ssd.intel.com.11. See the URL http:yywww.sgi.comyorigin.12. For a complete document on HIPPI 6400, see the URL’s http:yy

www.noc.lanl.govy;jameshyhippi64; http:yywww.scizzl.com;and http:yywww1.cem.chyHSIysciysci.html.


13. S. Scott, M. Vernon, and J. R. Goodman, “Performance of the

1

29. O. Kibar, D. A. Van Blerkom, C. Fan, and S. Esener, “Power

6

SCI ring,” in Proceedings of the Nineteenth International Sym-posium on Computer Architecture ~Association for ComputingMachinery, New York, 1992!, pp. 403–414.

4. N. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L.Seitz, J. N. Seizovic, and W.-K. Su, “Myrinet: a gigabit-per-second local area network,” IEEE Micro. 15, 29–38 ~1995!.

15. See the URL http:yywww.sun.comyserversymidrangeye6500ye6500.spec.html.

16. See the URL http:yywww.rambus.comyhtmlydocumenta-tion.html.

17. S. Tang, T. Li, F. Li, L. Wu, M. Dubinovski, R. Wickman, andR. T. Chen, “A 1-GHz clock signal distribution for multipro-cessor supercomputers,” in Proceedings of the InternationalConference on Massively Parallel Processing Using Optical In-terconnects ~MMPOI96! ~IEEE Computer Society, Los Alami-tos, Calif., 1996!, pp. 186–191.

18. R. T. Chen, L. Wu, F. Li, S. Tang, M. Dubinovski, J. Qi, C. L.Schow, J. C. Campbell, R. Wickman, B. Picor, M. Hibbs-Brenner, J. Bristow, Y.-L. Liu, S. Rattan, and C. Nodding, “SiCMOS process compatible guided-wave multi-Gbitys opticalclock signal distribution system for the Cray T-90 supercom-puter,” in Proceedings of the International Conference on Mas-sively Parallel Processing Using Optical Interconnects~MMPOI97! ~IEEE Computer Society, Los Alamitos, Calif.,1997!, pp. 10–24.

19. T. Szymanski and H. Scott, “Design of a terabit free-spacephotonic backplane for parallel computing,” in Proceedings ofthe Conference on Massively Parallel Processing Using OpticalInterconnects ~MMPOI95! ~IEEE Computer Society, LosAlamitos, Calif., 1995!, pp. 16–27.

20. Y.-S. Liu, B. Robertson, G. C. Boisset, M. H. Ayliffe, R. Iyer, andD. V. Plant, “Design, implementation, and characterization of ahybrid optical interconnect for a four-stage free-space opticalbackplane demonstrator,” Appl. Opt. 37, 2895–2914 ~1998!.

21. G. Verschaffelt, R. Buczynski, P. Tuteleers, P. Vynck, V.Baukens, H. Ottevaere, C. Debaes, S. Kufner, M. Kufner, A.Hermanne, J. Genoe, D. Coppee, R. Vounckx, S. Borghs, I.Veretennicoff, and H. Thienpont, “Demonstration of a mono-lithic multichannel module for multi-Gbys intra-MCM opticalinterconnects,” Photon. Technol. Lett. 10, 1629–1631 ~1998!.

22. V. Baukens, G. Verschaffelt, P. Tuteleers, P. Vynck, H. Otte-vaere, M. Kufner, S. Kufner, I. Veretennicoff, R. Bockstaele, A.Van Hove, B. Dhoedt, R. Baets, and H. Thienpont, “Perfor-mance of optical multi-chip-module interconnects: compar-ing guided-wave and free-space pathways,” J. Eur. Opt. Soc. A1, 255–261 ~1999!.

23. J. C. Rodier, P. Chavel, A. Dupret, E. Belhaire, P. Garda, D.Prevost, and P. Lalanne, “Video-rate simulated annealing forstochastic artificial retinas,” Opt. Commun. 132, 427–431~1996!.

24. J. Tanida and Y. Ichioka, “Programming of optical array logic:image data processing,” Appl. Opt. 27, 2926–2939 ~1988!.

25. A. M. Weiner, “Femtosecond optical pulse shaping and pro-cessing,” Prog. Quantum Electron. 19, 161–237 ~1995!.

26. D. A. B. Miller, “Physical reasons for optical interconnection,”Int. J. Optoelectron. 11, 155–168 ~1997!.

27. G. Yayla, P. Marchand, and S. Esener, “Energy and speedanalysis of digital electrical and free-space optical interconnec-tions,” in Optical Interconnections and Parallel Processing:The Interface, A. Ferreira and P. Berthome, eds. ~Kluwer Ac-ademic, Dordrecht, The Netherlands, 1997!, Chap. 3.

28. H. S. Hinton, An Introduction to Photonic Switching Fabrics~Plenum, New York, 1993!.


minimization and technology comparisons for digital free-space optoelectronic interconnections,” J. Lightwave Technol.17, 546–555 ~1999!.

30. J. L. Hennesy and D. A. Patterson, “Buses connecting IyOdevices to the CPUymemory,” in Computer Architecture, aQuantitative Approach, 2nd ed. ~Morgan Kauffmann, Los Al-tos, Calif., 1996!, Sec. 6.3.

31. A. Bolychevsky, C. R. Jesshope, and V. B. Muchnick, “Dynamicscheduling in RISC architectures,” IEEE Proc. Comput. Digi-tal Technol. 143, 309–317 ~1996!.

32. A. Iannucci, Multithreaded Computer Architecture—A Sum-mary of the State of the Art ~Kluwer Academic, Dordrecht, TheNetherlands, 1994!.

33. H. Neefs, P. Van Heuven, and J. Van Campenhout, “Latencyrequirements of optical interconnects at different memory hi-erarchy levels of a computer system,” in Optics in Computing’98, P. Chavel, D. A. B. Miller, and H. Thienpont, eds., Proc.SPIE 3490, 552–555 ~1998!.

34. H. Davidson, Sun Microsystems, 901 San Antonio Road, PaloAlto, Calif. 94303 ~private communication, 11 March 1999!.

35. J. H. Collet and L. Fesquet, “Comparison of the latency foran optical bus and several 2-D electronic topologies,” inCD-ROM of the Proceedings of the Eleventh InternationalParallel Processing Symposium ~IPPS! ~IEEE ComputerSociety, Los Alamitos, Calif., 1997!, CD addressesX:\workshps\wocs\collet.pdf; X:\workshps\wocs\collet.ps.

36. K. H. Wang, Advanced Computer Architecture: Parallelism,Scalability, Programmability ~McGraw-Hill, New York, 1993!.

37. A. Louri, B. Weech, and C. Neocleous, “A spanning multichan-nel linked hypercube: a gradually scalable optical intercon-nection network for massively parallel processing,” IEEETrans. Parallel Distribut. Sys. 9, 497–512 ~1998!.

38. P. Sindhu, J. M. Frailong, J. Gastinel, M. Cekleov, L. Yuan, B.Gunning, and D. Curry, “XDBus: a high-performance, con-sistent, packet-switched VLSI bus,” in Technical Digest of theSpring ’93 Computer Conferences ~CompCon! ~IEEE ComputerSociety, Los Alamitos, Calif., 1993!, pp. 338–344.

39. A. Charlesworth, “Starfire: extending the SMP envelope,”IEEE Micro. 1, 39–49 ~1998!.

40. W. Hlayel, D. Litaize, L. Fesquet, and J. H. Collet, “Opticalversus electronic bus for address-transactions in future SMParchitectures,” in Proceedings of the Conference on ParallelArchitecture and Compilation Techniques ~PACT! ~IEEE Com-puter Society, Los Alamitos, Calif., 1998!, pp. 22–29.

41. Z. G. Vranesic, M. Stumm, D. M. Lewis, and R. White, “Hector:a hierarchically structured shared-memory multiprocessor,”Computer 24, 72–79 ~1991!.

42. L. A. Barroso and M. Dubois, “The performance of cache co-herent ring-based multiprocessors,” Tech. Rep. CENG-92-19~Department of Electrical Engineering-Systems, University ofSouthern California, Los Angeles, Calif., 1992!.

43. A. V. Krishnamoorthy and D. A. B. Miller, “Scalingoptoelectronic-VLSI circuits into the 21st century: a technol-ogy roadmap,” IEEE J. Select. Top. Quantum Electron. 2,55–76 ~1996!.

44. R. G. Rozier, F. E. Kiamilev, and A. V. Krishnamoorthy, “De-sign and evaluation of a photonic FFT processor,” J. ParallelDistribut. Comput. 41, 131–136 ~1997!.

45. M. P. Y. Desmulliez, F. A. P. Tooley, J. A. B. Dines, N. L. Grant,D. J. Goodwill, D. Baillie, B. S. Wherrett, P. M. Foulk, S.Ashcroft, and P. Black, “Perfect-shuffle interconnected bitonicsorter: optoelectronic design,” Appl. Opt. 34, 5077–5090~1996!.

hpcat - architectural approach to the role of optics in monoprocessor and … · 2019. 1. 3. ·...

Documents