cdn.preterhuman.netcdn.preterhuman.net/texts/computing/clusters/2007-march.txt · 2012-10-01from...

Download cdn.preterhuman.netcdn.preterhuman.net/texts/computing/clusters/2007-March.txt · 2012-10-01From greg.lindahl at qlogic.com Fri Mar 2 12:09:06 2007 From: greg.lindahl at qlogic.com

If you can't read please download the document

Upload: vuongnhu

Post on 28-Mar-2018

215 views

Category:

Documents


1 download

TRANSCRIPT

LWN recently did an article entitled "Who wrote 2.6.20?" It was areponse to a Time magazine article which claimed that Linux waswritten by volunteers, when most of us know that most Linux kerneldevelopment is done by paid developers.

http://lwn.net/Articles/222773/

In one of the charts he looked at all the changes to the kernel in thelast year, and summed them up by company. The top companies were (drumrollplease):

(Unknown) 740990 29.5%Red Hat 361539 14.4%(None) 239888 9.6%IBM 200473 8.0%QLogic 91834 3.7%Novell 91594 3.6%Intel 78041 3.1%

... and we didn't even do our own distro! Hee hee.

-- greg

Greg,I'd just want to point out there is a difference between "Linux" and"[recent] Linux kernel development".

That said, thanks so much for your substantial contributions to ongoingkernel development; that's important :-) and gratz on beating out Novell andIntel.Peter

On 3/2/07, Greg Lindahl wrote:>> LWN recently did an article entitled "Who wrote 2.6.20?" It was a> reponse to a Time magazine article which claimed that Linux was> written by volunteers, when most of us know that most Linux kernel> development is done by paid developers.>> http://lwn.net/Articles/222773/>> In one of the charts he looked at all the changes to the kernel in the> last year, and summed them up by company. The top companies were (drumroll> please):>> (Unknown) 740990 29.5%> Red Hat 361539 14.4%> (None) 239888 9.6%> IBM 200473 8.0%> QLogic 91834 3.7%> Novell 91594 3.6%> Intel 78041 3.1%>> ... and we didn't even do our own distro! Hee hee.>> -- greg>> _______________________________________________> Beowulf mailing list, [email protected]> To change your subscription (digest mode or unsubscribe) visit> http://www.beowulf.org/mailman/listinfo/beowulf>-------------- next part --------------An HTML attachment was scrubbed...URL: http://www.scyld.com/pipermail/beowulf/attachments/20070302/32f99254/attachment.html

Long ago I started keeping my notes and a bit of editorial contenton idle power consumption for various computers and related here:

http://saf.bio.caltech.edu/saving_power.html

A few days ago I realized that there was no Intel Core informationin there. Since I don't have any myself, I hunted down aniMac and a PC and found much to my surprise, that while theCore processors were quite efficient when idling, there wasapparently no way to adjust the power consumption downward anyfurther via Enhanced Speed Step (or whatever Intel calls ittoday.) I assume that the CPUs in these two boxes supported thiscapability, but the BIOS (or it's equivalent on a Mac) apparentlydidn't enable this feature. Sure it's a small sample, but in thisday and age I really expected it to be enabled by default prettymuch everywhere.

Anyway, that got me thinking about idle power consumption on clusters. Many of you have machines that run at 100% CPU 24/7, and forthose systems the following discussion is irrelevant. But thereare other clusters around that tend to sit for long periods oftime between jobs, and whatever power they are using while waitingfor a job is pretty close to a total waste. This is even morecommon on regular PCs, where CPU usage is extremely "bursty".The thing is, on pretty much every machine I've seen (exception:some laptops) there is a gaping hole between the lowest powerlevel on a running machine, and the power level when itgoes to sleep. Putting idle nodes all the way to sleep wouldsave the most power, but it is a nightmare in termsof waking them back up again. Besides the issue of disks thatmight not spin back up, there is the problem of the (many)network protocols which are going to time out and break connections. Also returning from sleep nodes tends to be relativelyslow, taking many seconds to many minutes, depending on a wholelot of variables.

So it would be nice if the range of underclocking / undervoltingadjustments provided on compute nodes extended quite a bit furthertowards the lower end than it currently does. Typicallyidle is something like 70-80W at the lowest clock speed and sleepis 2-4W. There's a lot of room in there to work with. Why is therenot a system that can slow down far enough to use only 15W andstill run, albeit very slowly? On a diskless node 7-10W mighteven be possible. Machines running in these nodes would bealive enough to keep network connections open, and would bea whole lot easier to get back up to full speed than theequivalent machine in a sleep state. Assuming the transitionspeed is similar to Cool'N Quiet we're talking much less than a secondto speed back up again.

There are a lot of articles around about statically underclockedmachines, which proves that running modern hardware slowly ispossible, but the statically underclocked machines cannot besped up again - they start slow, and stay slow. Via sells someprocessors like the C7 which will operate over a very wide power range,but unfortunately the fastest those will crunch isn't anywherenear the speed of an Opteron or Core.

Big iron SMP machines often the ability to shut off CPUswhile the machine is running, well, except for the last oneobviously. With quad cores pretty much here, and octo coreson the horizon, one might imagine large power savings at idle could be achieved the same way on these chips. Can any of thehigh core number Opterons or Core CPUs power down unusedcores now?

In closing, does anybody currently make a rack mountable compute node with a really, really, really, low idle power mode,and also competitive performance when running at 100%?

Regards,

David [email protected], Sequence Analysis Facility, Biology Division, Caltech

*** Call for Papers and Announcement ***

Wavelet Applications in Industrial Processing V (SA109)Part of SPIE?s International Symposium on Optics East 20079-12 September 2007 ? Seaport World Trade Center ? Boston, MA, USA

--- Abstract Due Date Deadline prolongation: 4 March 2007 ------ Manuscript Due Date: 13 August 2007 ---

Web sitehttp://spie.org/Conferences/Calls/07/oe/submitAbstract/index.cfm? fuseaction=SA109ABSTRACT TEXT Approximately 500 words.

Conference Chairs: Fr?d?ric Truchetet, Univ. de Bourgogne (France); Olivier Laligant, Univ. de Bourgogne (France)

Program Committee: Patrice Abry, ?cole Normale Sup?rieure de Lyon (France); Radu V. Balan, Siemens Corporate Research; Atilla M. Baskurt, Univ. Claude Bernard Lyon 1 (France); Amel Benazza-Benyahia, Ecole Sup?rieure des Communications de Tunis (Tunisia); Albert Bijaoui, Observatoire de la C?te d'Azur (France); Seiji Hata, Kagawa Univ. (Japan); Henk J. A. M. Heijmans, Ctr. for Mathematics and Computer Science (Netherlands); William S. Hortos, Associates in Communication Engineering Research and Technology; Jacques Lewalle, Syracuse Univ.; Wilfried R. Philips, Univ. Gent (Belgium); Alexandra Pizurica, Univ. Gent (Belgium); Guoping Qiu, The Univ. of Nottingham (United Kingdom); Hamed Sari-Sarraf, Texas Tech Univ.; Peter Schelkens, Vrije Univ. Brussel (Belgium); Paul Scheunders, Univ. Antwerpen (Belgium); Kenneth W. Tobin, Jr., Oak Ridge National Lab.; G?nther K. G. Wernicke, Humboldt-Univ. zu Berlin (Germany); Gerald Zauner, Fachhochschule Wels (Austria)

The wavelet transform, multiresolution analysis, and other space- frequency or space-scale approaches are now considered standard tools by researchers in image and signal processing. Promising practical results in machine vision and sensors for industrial applications and non destructive testing have been obtained, and a lot of ideas can be applied to industrial imaging projects.This conference is intended to bring together practitioners, researchers, and technologists in machine vision, sensors, non destructive testing, signal and image processing to share recent developments in wavelet and multiresolution approaches. Papers emphasizing fundamental methods that are widely applicable to industrial inspection and other industrial applications are especially welcome.

Papers are solicited but not limited to the following areas:

o New trends in wavelet and multiresolution approach, frame and overcomplete representations, Gabor transform, space-scale and space- frequency analysis, multiwavelets, directional wavelets, lifting scheme for:- sensors- signal and image denoising, enhancement, segmentation, image deblurring- texture analysis- pattern recognition- shape recognition- 3D surface analysis, characterization, compression- acoustical signal processing- stochastic signal analysis- seismic data analysis- real-time implementation- image compression- hardware, wavelet chips.

o Applications:- machine vision- aspect inspection- character recognition- speech enhancement- robot vision- image databases- image indexing or retrieval- data hiding- image watermarking- non destructive evaluation- metrology- real-time inspection.

o Applications in microelectronics manufacturing, web and paper products, glass, plastic, steel, inspection, power production, chemical process, food and agriculture, pharmaceuticals, petroleum industry.All submissions will be peer reviewed. Please note that abstracts must be at least 500 words in length in order to receive full consideration.

------------------------------------------------------------------------ ---------! Abstract Due Date Deadline prolongation: 4 March 2007 !! Manuscript Due Date: 13 August 2007 !------------------------------------------------------------------------ ---------

------------- Submission of Abstracts for Optics East 2007 Symposium ------------Abstract Due Date Deadline prolongation: 4 March 2007 - Manuscript Due Date: 13 August 2007Abstracts, if accepted, will be distributed at the meeting.

* IMPORTANT!- Submissions imply the intent of at least one author to register, attend the symposium, present the paper (either orally or in poster format), and submit a full-length manuscript for publication in the conference Proceedings.- By submitting your abstract, you warrant that all clearances and permissions have been obtained, and authorize SPIE to circulate your abstract to conference committee members for review and selection purposes and if it is accepted, to publish your abstract in conference announcements and publicity.- All authors (including invited or solicited speakers), program committee members, and session chairs are responsible for registering and paying the reduced author, session chair, program committee registration fee. (Current SPIE Members receive a discount on the registration fee.)

* Instructions for Submitting Abstracts via Web

- You are STRONGLY ENCOURAGED to submit abstracts using the ?submit an abstract? link at:http://spie.org/events/oe- Submitting directly on the Web ensures that your abstract will be immediately accessible by the conference chair for review through MySPIE, SPIE?s author/chair web site.- Please note! When submitting your abstract you must provide contact information for all authors, summarize your paper, and identify the contact author who will receive correspondence about the submission and who must submit the manuscript and all revisions. Please have this information available before youbegin the submission process.

- First-time users of MySPIE can create a new account by clicking on the create new account link. You can simplify account creation by using your SPIE ID# which is found on SPIE membership cards or the label of any SPIE mailing.

- If you do not have web access, you may E-MAIL each abstract separately to: [email protected] in ASCII text (not encoded) format. There will be a time delay for abstracts submitted via e-mail as they will not be immediately processed for chair review.IMPORTANT! To ensure proper processing of your abstract, the SUBJECT line must include only:SUBJECT: SA109, TRUCHETET, LALIGANT

- Your abstract submission must include all of the following:1. PAPER TITLE2. AUTHORS (principal author first) For each author: o First (given) Name (initials not acceptable) o Last (family) Name o Affiliation o Mailing Address o Telephone Number o Fax Number o Email Address3. PRESENTATION PREFERENCE "Oral Presentation" or "Poster Presentation."4. PRINCIPAL AUTHOR?S BIOGRAPHY Approximately 50 words.5. ABSTRACT TEXT Approximately 500 words. Accepted abstracts for this conference will be included in the abstract CD-ROM which will be available at the meeting. Please submit only 500-word abstracts that are suitable for publication.6. KEYWORDS Maximum of five keywords.

If you do not have web access, you may E-MAIL each abstract separately to: [email protected] in ASCII text (not encoded) format. There will be a time delay for abstracts submitted via e- mail as they will not be immediately processed for chair review.

* Conditions of Acceptance- Authors are expected to secure funding for registration fees, travel, and accommodations, independent of SPIE, through their sponsoring organizations before submitting abstracts.

- Only original material should be submitted.

- Commercial papers, papers with no new research/development content, and papers where supporting data or a technical description cannot be given for proprietary reasons will not be accepted for presentation in this symposium.

- Abstracts should contain enough detail to clearly convey the approach and the results of the research.

- Government and company clearance to present and publish should be final at the time of submittal. If you are a DoD contractor, allow at least 60 days for clearance. Authors are required to warrant to SPIE in advance of publication of the Proceedings that all necessary permissions and clearances have been obtained, and that submitting authors are authorized to transfer copyright of the paper to SPIE.

* Review, Notification, Program Placement

- To ensure a high-quality conference, all abstracts and Proceedings manuscripts will be reviewed by the Conference Chair/Editor for technical merit and suitability of content. Conference Chair/Editors may require manuscript revision before approving publication, and reserve the right to reject for presentation or publication any paper that does not meet content or presentation expectations. SPIE?s decision on whether to accept a presentation or publish a manuscript is final.

- Applicants will be notified of abstract acceptance and sent manuscript instructions by e-mail no later than 7 May 2007. Notification of acceptance will be placed on SPIE Web the week of 4 June 2007 at http://spie.org/events/oe

- Final placement in an oral or poster session is subject to the Chairs' discretion. Instructions for oral and poster presentations will be sent to you by e-mail. All oral and poster presentations require presentation at the meeting and submission of a manuscript to be included in the Proceedings of SPIE.

* Proceedings of SPIE

- These conferences will result in full-manuscript Chairs/Editor- reviewed volumes published in the Proceedings of SPIE and in the SPIE Digital Library.

- Correctly formatted, ready-to-print manuscripts submitted in English are required for all accepted oral and poster presentations. Electronic submissions are recommended, and result in higher quality reproduction. Submission must be provided in PostScript created with a printer driver compatible with SPIE?s online Electronic Manuscript Submission system. Instructions are included in the author kit and from the ?Author Info? link at the conference website.

- Authors are required to transfer copyright of the manuscript to SPIE or to provide a suitable publication license.

- Papers published are indexed in leading scientific databases including INSPEC, Ei Compendex, Chemical Abstracts, International Aerospace Abstracts, Index to Scientific and Technical Proceedings and NASA Astrophysical Data System, and are searchable in the SPIE Digital Library. Full manuscripts are available to Digital Library subscribers.

- Late manuscripts may not be published in the conference Proceedings and SPIE Digital Library, whether the conference volume will be published before or after the meeting. The objective of this policy is to better serve the conference participants as well as the technical community at large, by enabling timely publication of the Proceedings.

- Papers not presented at the meeting will not be published in theconference Proceedings, except in the case of exceptional circumstances atthe discretion of SPIE and the Conference Chairs/Editors.

wa2

-------------- next part --------------An HTML attachment was scrubbed...URL: http://www.scyld.com/pipermail/beowulf/attachments/20070301/f5901870/attachment.html

Hi,

I have a small (16 dual xeon machines) cluster. We are going to addan additional machine which is only going to serve a big filesystem viaa gigabit interface.

Does anybody knows what is better for a cluster of this size, exporting thefilesystem via NFS or use another alternative such as a cluster filesystemlike GFS or OCFS?

Thanks in advance

--

Jaime D. Perea Duarte. Linux registered user #10472

Dep. Astrofisica Extragalactica. Instituto de Astrofisica de Andalucia (CSIC) Apdo. 3004, 18080 Granada, Spain.

CALL FOR PARTICIPATION(see advance program below)

WORKSHOP ON LARGE-SCALE, VOLATILE DESKTOP GRIDS (PCGRID 2007)held in conjunction with theIEEE International Parallel & Distributed Processing Symposium (IPDPS)March 30, 2007Long Beach, California U.S.A.http://pcgrid07.lri.fr

Desktop grids utilize the free resources available in Intranet orInternet environments for supporting large-scale computation andstorage. For over a decade, desktop grids have been one of the largestand most powerful distributed computing systems in the world, offeringa high return on investment for applications from a wide range ofscientific domains (including computational biology, climateprediction, and high-energy physics). While desktop grids sustain upto Teraflops/second of computing power from hundreds of thousands tomillions of resources, fully leveraging the platform's computationalpower is still a major challenge because of the immense scale, highvolatility, and extreme heterogeneity of such systems.

The purpose of the workshop is to provide a forum for discussing recentadvances and identifying open issues for the development of scalable,fault-tolerant, and secure desktop grid systems. The workshop seeks tobring desktop grid researchers together from theoretical, system, andapplication areas to identify plausible approaches for supportingapplications with a range of complexity and requirements on desktopenvironments.

#####################################################################ADVANCE PROGRAM

(In each session below, the following list of papers will be presented.For the detailed schedule, see http://pcgrid07.lri.fr/program.html)

---------------------------------------------------------------------KEYNOTE SPEAKER:David P. Anderson,Director of BOINC and SETI@home,University of California at Berkeley

---------------------------------------------------------------------

SESSION I: SYSTEMS

Invited Paper: Open Internet-based Sharing for Desktop Grids in iShareXiaojuan Ren, Purdue University, U.S.A.Ayon Basumallik, Purdue University, U.S.A.Zhelong Pan, VMWare, Inc., U.S.A.Rudolf Eigenmann, Purdue University, U.S.A.

Invited Paper: Decentralized Dynamic Host Configuration in Wide-areaOverlay Networks of Virtual WorkstationsArijit Ganguly, University of Florida, U.S.A.David I. Wolinsky, University of Florida, U.S.A.P. Oscar Boykin, University of Florida, U.S.A.Renato J. Figueiredo, University of Florida, U.S.A.

SZTAKI Desktop Grid: a Modular and Scalable Way of Building LargeComputing GridsZoltan Balaton, MTA SZTAKI Research Institute, HungaryGabor Gombas, MTA SZTAKI Research Institute, HungaryPeter Kacsuk, MTA SZTAKI Research Institute, HungaryAdam Kornafeld, MTA SZTAKI Research Institute, HungaryJozsef Kovacs, MTA SZTAKI Research Institute, HungaryAttila Csaba Marosi, MTA SZTAKI Research Institute, HungaryGabor Vida, MTA SZTAKI Research Institute, HungaryNorbert Podhorszki, UC Davis, U.S.A.Tamas Kiss, University of Westminster, U.K.

Direct Execution of Linux Binary on Windows for Grid RPC WorkersYoshifumi Uemura, University of Tsukuba, JapanYoshihiro Nakajima, University of Tsukuba, JapanMitsuhisa Sato, University of Tsukuba, Japan

---------------------------------------------------------------------SESSION II: SCHEDULING AND RESOURCE MANAGEMENT

Local Scheduling for Volunteer ComputingDavid Anderson, UC Berkeley, U.S.A.John McLeod VII, Sybase, Inc., U.S.A.

Moving Volunteer Computing towards Knowledge-Constructed,Dynamically-Adaptive Modeling and SchedulingMichela Taufer, University of Texas at El Paso, U.S.A.Andre Kerstens, University of Texas at El Paso, U.S.A.Trilce Estrada, University of Texas at El Paso, U.S.A.David Flores, University of Texas at El Paso, U.S.A.Richard Zamudio, University of Texas at El Paso, U.S.A.Patricia Teller, University of Texas at El Paso, U.S.A.Roger Armen, The Scripps Research Institute, U.S.A.Charles L. Brooks III, The Scripps Research Institute, U.S.A.

Proxy-based Grid Information DisseminationDeger Erdil, State University of New York at Binghamton, U.S.A.Michael Lewis, State University of New York at Binghamton, U.S.A.Nael Abu-Ghazaleh, State University of New York at Binghamton, U.S.A.

---------------------------------------------------------------------SESSION III: DATA-INTENSIVE APPLICATIONS AND DISTRIBUTED STORAGE

Challenges in Executing Data Intensive Biometric Workloads on a DesktopGridChristopher Moretti, University of Notre Dame, U.S.A.Timothy Faltemier, University of Notre Dame, U.S.A.Douglas Thain, University of Notre Dame, U.S.A.Patrick Flynn, University of Notre Dame, U.S.A.

Invited Paper: Storage@home: Petascale Distributed StorageAdam L. Beberg, Stanford University, U.S.A.Vijay Pande, Stanford University, U.S.A.

---------------------------------------------------------------------SESSION IV: THEORY

Applying IC-Scheduling Theory to Familiar Classes of ComputationsGennaro Cordasco, University of Salerno, ItalyGrzegorz Malewicz, Google, Inc., U.S.A.Arnold Rosenberg, University of Massachusetts at Amherst, U.S.A.

Invited Paper: A Combinatorial Model for Self-Organizing NetworksYuri Dimitrov, Ohio State University, U.S.A.Gennaro Mango, Ohio State University, U.S.A.Carlo Giovine, Ohio State University, U.S.A.Mario Lauria, Ohio State University, U.S.A.

Invited Paper: Towards Contracts & SLA in Large Scale Clusters & DesktopsGridsDenis Caromel, INRIA, FranceFrancoise Baude, INRIA, FranceAlexandre di Costanzo, INRIA, FranceChristian Delbe, INRIA, FranceMario Leyton, INRIA, France

#####################################################################ORGANIZATION

General ChairsDerrick Kondo, INRIA Futurs, FranceFranck Cappello, INRIA Futurs, France

Program ChairGilles Fedak, INRIA Futurs, France

We have a parallel problem that shifts its load balance while executing even though we are certain that it shouldn't. The following will describe our experience level, our clusters, our application, and the problem.

Our Experience

We are the developers of an MPI parallel application -- a 2-d time-dependent multiphysics code -- with all the intimate knowledge of its architecture and implementation that implies. We are presently using the Portland Group Fortran and C compilers and MPICH-1 version 1.2.7. We have had success building and using other parallel applications on HPC systems and clusters of workstations, though in those cases the physics was 3-d. We have plenty of Linux workstation sysadmin experience.

Our House-Built Clusters

We have built a few, small, generally heterogeneous clusters of workstations around AMD processors, Netgear GA311 NICs, and different switches. We used Redhat 8 and 9 for our 32-bit processors, and have shifted to Fedora for our recent systems including our few ventures into 64-bit land. Some of our nodes have dual processors. We have not tuned the OSs at all, other than to be sure that our NICs have appropriate drivers. Some of our switches give us 80-90% of Gb speed as measured by NetPipe, both TCP-IP and MPI, and others give us 30%. In the case described here, the switch is a slower one, but the application's performance is determined by the latency since the messages are relatively small. Our only performance tools are the LINUX utility top and a stopwatch.

Our Application Architecture and Performance Expectation

During execution, the application takes thousands of steps that each advance simulation time. The processors advance through the different physics packages and parts thereof in lock step from one MPIWaitAll to the next, with limited amounts of work being done between the barriers. We use MPIAllReduce to do maximums, minimums, and sums of various quantities.

The application uses a domain decomposition that does not change during each run. Each time step is roughly the same amount of work as previous ones, though the number of iterations in the implicit solution methods changes. However, all processors are taking the same number of iterations in each time step. Thus we expect that the relative load on a processor will remain roughly the same as the relative size of the domain it is assigned in the decomposition. The problem is that it doesn't.

There is one exception to our expectation, in that intermittently after some number of time steps or interval of simulation time, the application does output. Each processor writes some dump files identified with its node number to a problem directory, and a single processor combines those files into one while all the other processors wait. By controlling the frequency of the output, we keep the total time lost in this wait relatively small. In addition, every ten cycles, the output processor writes a brief summary of the problem state to the terminal output.

One more thing before we get to the problem. We don't use mpirun; our application reads a processor group file and starts the remote processes itself. Thus, there is one processor that is distinguished from the others: it was directly invoked from the command line of a shell -- usually tcsh, but never mind that religious war.

The Problem

We have observed unexpected and extreme load-balance shifts during both two- and four-processor runs. In the following, our focus will be on the four processor run. We observe the load balance by monitoring CPU usage on each of the processors with separate xterm-invoked tops from a non-cluster machine. Our primary observable is %CPU; as a secondary observable, we monitor the wall time interval between the 10-cycle terminal edit.

The load balance starts out looking like the relative sizes of the domains we assigned to the various processors, just as we expect. The processor on which the run was started has the smallest domain to handle, and its %CPU is initially around 50%, while the others are around 90%. After a few hundred time steps or so the CPU usage of the processor on which the job was started begins to increase and the others begin to fall. After a thousand time steps or so, the CPU usage is nearly 90% for the originating process, and less than 20% for the remote processes. Not surprisingly, the wall time between 10-cycle terminal edits goes up by a factor of 4 over the same period. By observation, no other task ever consumes more than a few tenths of a percent of the CPU.

The originating processor is the output processor, but only the terminal output is happening during this period, and we observe no significant change in the CPU usage during the cycles when that output is produced. Top is updating its output every 5 seconds and in this run our application is taking one time step every 2 seconds. The message count and size of the messages imply that two processors are spending about 30% of their time in system time for message startup and about a tenth that much actually transmitting data. There are about 6,000 messages sent and received in each time step on those processors, though it varies slightly from time step to time step. The other two processors -- one of which is the originating processor -- have about half that many messages to send and receive, and spend correspondingly less time doing it.

Though we have shuffled the originating processor and the processors in the group the results are always similar. In one case we ran with four identical nodes except that one had Redhat 8 while the others were Redhat 9. In another case we ran four Redhat 9 machines with slightly different AMD processor speeds (2.08 vs 2.16 GHz). The 9.0 kernels are 2.4.20, while the 8.0 has been upgraded to 2.4.18.

Here is a final bit of data. To prove that the shift was not determined by the state of the problem being simulated, we restarted the simulation from a restart dump made by our application when the load had shifted to the originating processor. The load balance immediately after the restart again reflected the domain size as it had in the beginning of the unrestarted simulation. After a thousand cycles in the restarted problem, the load had shifted back to the originating processor.

Conclusion/Hypothesis

Our tentative conclusion is that either MPICH or the operating system is eating an increasing amount of time on the originating processor as the number of time steps accumulates. It is probable that the accumulated number of messages transmitted is the problem. It acts like a leak, but of processor CPU time rather than memory. Top does not show any increase in resident set size (RSS) during the run.

Does anyone have any ideas what this behavior might be, how we can test for it, and what we can do to fix it? Thanks for any help in advance.

Mike

Jaime,In my humble opinion, I think you can start with NFS (but using at leastNFSv4), and see what happens, for instance, what's the real disk accesspattern of your cluster, in terms of amount of IOPS, average BW usage, andread/write pattern.

It could be interesting to know not only what's gonna be the file server,but what's the underlying storage subsystem, and the network requirements. Imean, the server could be big, the network sth like IB or Myrinet, and thenyou can be serving files that reside on a copule of internal disks soperformance would sink...

Anyway, I've seen some NFS file servers in that cluster size, I'd suggesttry it and check if this is enough for your applications.

Hope this helps,Daniel.

2007/3/1, [email protected] :>> Hi,>> I have a small (16 dual xeon machines) cluster. We are going to add> an additional machine which is only going to serve a big filesystem via> a gigabit interface.>> Does anybody knows what is better for a cluster of this size, exporting> the> filesystem via NFS or use another alternative such as a cluster filesystem> like GFS or OCFS?>> Thanks in advance>> -->> Jaime D. Perea Duarte. > Linux registered user #10472>> Dep. Astrofisica Extragalactica.> Instituto de Astrofisica de Andalucia (CSIC)> Apdo. 3004, 18080 Granada, Spain.> _______________________________________________> Beowulf mailing list, [email protected]> To change your subscription (digest mode or unsubscribe) visit> http://www.beowulf.org/mailman/listinfo/beowulf>-------------- next part --------------An HTML attachment was scrubbed...URL: http://www.scyld.com/pipermail/beowulf/attachments/20070302/36d86e0c/attachment.html

> I have a small (16 dual xeon machines) cluster. We are going to add> an additional machine which is only going to serve a big filesystem via> a gigabit interface.

are you comfortable with the expected performance of this design?that is, without any tuning/tweaking, you should achieve ~70 MB/sassuming the server's local disk(s) can manage, etc.

> Does anybody knows what is better for a cluster of this size, exporting the> filesystem via NFS or use another alternative such as a cluster filesystem> like GFS or OCFS?

NFS is really easy. for a small cluster and only Gb, I wouldn't even consider anything else. once you get into hundreds of nodes(or perhaps fewer very IO-intensive ones), alternatives are probablynecessary. mostly, I'd decide on aggregate bandwidth requirements,though I'm sure NFS overhead eventually becomes a problem (thousands of nodes). I think your cluster would be very happy with 1-2 Gblinks from the server, fed with a nice fast md-based raid array.

regards, mark hahn.

On Sat, 3 Mar 2007, Daniel Navas-Parejo Alonso wrote:

> In my?humble opinion, I think you can start with NFS (but using at least> NFSv4)

How stable/usable is the Linux NFSv4 implementation these days ?

It's been a while since I followed the Linux v4 mailing lists and I'm way out of touch with how they're getting on..

cheers,Chris-- Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia

On Saturday 03 March 2007 23:39, Chris Samuel wrote:> How stable/usable is the Linux NFSv4 implementation these days ?

It sucks if you are using CentOS 4.x. With Debian Etch, it seems work pretty well.

wt-- Warren Turkal

>> In myhumble opinion, I think you can start with NFS (but using at least>> NFSv4)>> How stable/usable is the Linux NFSv4 implementation these days ?

why V4?- security. within a cluster, I don't see the point to, say, kerberos.- compound rpcs. probably provides somewhat better efficiency.- open/close, byte-range locking. I don't see much demand.- client caching/delegation/leases - could be valuable for efficiency.

I find that nfsv3 works fine for moderate IO on O(100) clients.I would be very interested to know whether others have observed performance benefits for v4, and whether it's an easy upgrade,such as no new/onerous security framework ('framework' is always a danger sign for me ;)

On Mon, 5 Mar 2007, Mark Hahn wrote:

> why V4?> - security. ?within a cluster, I don't see the point to, say, kerberos.

Agreed, not to mention all the pain of trying to get Kerberos tickets passed through the queueing system and the fact that if you're running a 3 month job it's going to be quite hard to persuade your Kerberos admin to let you be able to create a ticket that lasts that long..

> - compound rpcs. ?probably provides somewhat better efficiency.

Yup, should ease the burden of a lot stat()'s.

> - open/close, byte-range locking. ?I don't see much demand.

Pass. :-)

> - client caching/delegation/leases - could be valuable for efficiency.

Indeed, this to me is the most useful part of it, especially for those people who are running code that should use local scratch but doesn't (either due to lack of coding experience or not having access to the source)..

> I find that nfsv3 works fine for moderate IO on O(100) clients.

Likewise, though we do get the occasional user who is able to generate a pathological case..

> I would be very interested to know whether others have observed> performance benefits for v4, and whether it's an easy upgrade,> such as no new/onerous security framework ('framework' is always> a danger sign for me ;)

When I last played with it you could still use AUTH_SYS (as in v3) rather than having to use Kerberos.

cheers!Chris-- Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia

Chris Samuel wrote:> On Mon, 5 Mar 2007, Mark Hahn wrote:> >> why V4?>> - security. within a cluster, I don't see the point to, say, kerberos.> > Agreed, not to mention all the pain of trying to get Kerberos tickets passed > through the queueing system and the fact that if you're running a 3 month job > it's going to be quite hard to persuade your Kerberos admin to let you be > able to create a ticket that lasts that long..> Purely as a point of interest, since high energy physics labs use AFS (and hence kerberos) they have already faced this one.The ticket is extended when a batch job is submitted:http://services.web.cern.ch/services/afs/arc.html#SECTION00040000000000000000

On Sat, 3 Mar 2007, David Mathog wrote:

> So it would be nice if the range of underclocking / undervolting> adjustments ?provided on compute nodes extended quite a bit further> towards the lower end than it currently does.

FWIW 2.6.21 looks like it will include i386 support for the clockevents and dyntick patches that have been developed out in the real time Linux world. Apparently they have AMD64 and ARM patches too, but these haven't been merged as of yet.

There's a nice LWN article that describes this work:

http://lwn.net/Articles/223185/

All of this is an improvement, but there is still one thing which could be better: there is no real need for a periodic tick in the system. That is especially true when the processor is idle. An idle CPU can save quite a bit of power, but waking that CPU up 100 times (or more) per second will hurt those power savings considerably. With a flexible timer infrastructure, there is no point in turning the CPU back on until it has something to do. So, when the (i386) kernel goes into its idle loop, it checks the next pending timer event. If that event is further away than the next tick, the periodic tick is turned off altogether; instead, the timer is is programmed to fire when the next event comes due. The CPU can then rest unharrassed until that time - unless an interrupt comes in first. Once the processor goes out of the idle state, the periodic tick is restored.[...]

It quotes the developers saying:

The implementation leaves room for further development like full tickless systems, where the time slice is controlled by the scheduler, variable frequency profiling, and a complete removal of jiffies in the future.

-- Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia

I'd strongly recommend Lustre. It will work perfectly well from asingle server node and give much higher bandwidths than NFS. If youhave two nic's you can also serve up the file system over both and seearound 150MB/s total bandwidth.

Also, if you need more storage in the future, you can just add moreservers... and get linear scaling of bandwidth.

Stu.

On 3/1/07, [email protected] wrote:> Hi,>> I have a small (16 dual xeon machines) cluster. We are going to add> an additional machine which is only going to serve a big filesystem via> a gigabit interface.>> Does anybody knows what is better for a cluster of this size, exporting the> filesystem via NFS or use another alternative such as a cluster filesystem> like GFS or OCFS?>> Thanks in advance>> -->> Jaime D. Perea Duarte. > Linux registered user #10472>> Dep. Astrofisica Extragalactica.> Instituto de Astrofisica de Andalucia (CSIC)> Apdo. 3004, 18080 Granada, Spain.> _______________________________________________> Beowulf mailing list, [email protected]> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf>

-- Dr Stuart [email protected]

Hi,

I am building a small (~16) node cluster with an IB interconnect. I need todecide whether I will buy a cheaper, dumb switch and run OpenSM, or get amore expensive switch with a built in subnet manager. The largest thissystem would every grow is 32 nodes (two 24 port switches).

Various vendors (integrators, not switch OEMs) have stated to me thatmanaged switches are the go, and that OpenSM is (a) buggy, and (b) very timeconsuming to set up. But, a managed name brand switch seems to cost a lotmore than a non-managed one using the Mellanox reference design kit(rebadged, but I suspect made by Flextronics...).

My other query is about diagnostic software. With an ethernet switch it ispretty easy to fire up Ethereal (sorry Wireshark, but it is such a sillyname) or Etherape and get a look at what is going on. If I buy a Cisco orVoltaire etc do they come with tools that let me get accuraterepresentations of what is going on? Or are their tools really for large IBnetworks?

Regards,Andrew

[v2]-------------- next part --------------An HTML attachment was scrubbed...URL: http://www.scyld.com/pipermail/beowulf/attachments/20070306/c28ef7d0/attachment.html

On 3/5/07, Chris Samuel wrote:>> > - compound rpcs. probably provides somewhat better efficiency.>> Yup, should ease the burden of a lot stat()'s.

>> > - client caching/delegation/leases - could be valuable for efficiency.>> Indeed, this to me is the most useful part of it, especially for those> people> who are running code that should use local scratch but doesn't (either due> to> lack of coding experience or not having access to the source)..

Our developers had that issue of inconsistent file system view in RHELbased systems, some of it is solved by disabling dir list caching, anotherby using noac, what the other was doing was writing simultaneously to thesame file partitioned over several nodes, I told this is probably not theright way to do file writing. apparently he used to do it in Sun Solaris andit worked flawlessly. NFSv4 brings to the table standard clientimplementation. unfortunately Red Hat recommends RHEL5 which should be outsoon now for NFSv4

Walid.-------------- next part --------------An HTML attachment was scrubbed...URL: http://www.scyld.com/pipermail/beowulf/attachments/20070305/ea9feb32/attachment.html

> Our developers had that issue of inconsistent file system view in RHEL> based systems, some of it is solved by disabling dir list caching, another> by using noac,

well, developers should be smart enough to know what FS they're using,and how it's intended to behave. turning off AC is a nice option, but smarter is to leave it on and not try to cause race conditions.(I expect that such race-friendly behavior will fail on some other non-NFS filesystems, though probably harder to trigger.)

> what the other was doing was writing simultaneously to the> same file partitioned over several nodes, I told this is probably not the> right way to do file writing. apparently he used to do it in Sun Solaris and> it worked flawlessly.

I would spank any developer who said "but it works on platform X"!developers must be aware of the spec, not merely what they can get away with somewhere, sometime. of course, this is the thinking behind apps having "supported" platforms - just a fancy way of saying"no, we don't know what standards-conformance we need, or how we violate the standard, but here's a few places we haven't yet noticed any bad-enough bugs".

writing to different sections of a file is probably wrong on any networked FS, since there will inherently be obscure interactions with the size and alignment of the writes vs client pagecache,network transport, actual network FS, server pagecache and underlying server/disk FS. in my experience, people who expect it to "just work"have an incredibly naive model of how a network FS works (ie, write()produces an RPC direct to the server)

Walid wrote:> >> > > Our developers had that issue of inconsistent file system view in > RHEL based systems, some of it is solved by disabling dir list caching, > another by using noac, what the other was doing was writing > simultaneously to the same file partitioned over several nodes, I told > this is probably not the right way to do file writing. apparently he > used to do it in Sun Solaris and it worked flawlessly.

That leads me to a damn stupid question - how do NFSv4 and ROMIO interoperate then? Anyone got experience of that,or is it signed "There be Dragons" ?

Thanks to those who took the time to consider my original description of our problem. It has now been resolved and the simulation load balance is remaining fixed over thousands of time steps.

The problem, not surprisingly, was in our application code, specifically in our use of MPI in one particular place. We had posted some receives on the originating processor -- which was also the output processor -- for messages that were never sent. We failed to detect the error because -- in another error -- we had failed to do a WaitAll on the receive message queue for those messages. The result was that the originating/output processor had an ever increasing receive queue to hunt through while pairing up receives and arriving messages, and so took increasingly longer with each successive timestep.

We also sent some messages to processors that did not exist, though I think this was less of a problem.

We found the problem by looking for one a related kind. We built and ran a test code, and found accidently that failing to post receives caused processors to have to hunt through an increasing queue of received but unprocessed messages.

Thanks again.

Mike

Stu Midgley wrote:> I'd strongly recommend Lustre. It will work perfectly well from a> single server node and give much higher bandwidths than NFS. If you> have two nic's you can also serve up the file system over both and see> around 150MB/s total bandwidth.> > Also, if you need more storage in the future, you can just add more> servers... and get linear scaling of bandwidth.

How much do you use Lustre? Yes, you can get that bandwidth,but if you code doesn't do large-streaming I/O, you performancewill be worse than NFS. Also, I would like to hear someonespeakup that uses Lustre in a PRODUCTION environment thatdoesn't have a kernel hacker on staff.

Also, Lustre metadata doesn't scale (yet). You can addanother server, but that won't improve the metadata.

Using Lustre also requires you to re-patch your kernel every security update, then get the bugs out again.

Lustre is the right answer for some, but if you aren't goingto have that many compute nodes. It doesn't sound likeit here.

Craig

> > Stu.> > > On 3/1/07, [email protected] wrote:>> Hi,>>>> I have a small (16 dual xeon machines) cluster. We are going to add>> an additional machine which is only going to serve a big filesystem via>> a gigabit interface.>>>> Does anybody knows what is better for a cluster of this size, >> exporting the>> filesystem via NFS or use another alternative such as a cluster >> filesystem>> like GFS or OCFS?>>>> Thanks in advance>>>> -- >>>> Jaime D. Perea Duarte. >> Linux registered user #10472>>>> Dep. Astrofisica Extragalactica.>> Instituto de Astrofisica de Andalucia (CSIC)>> Apdo. 3004, 18080 Granada, Spain.>> _______________________________________________>> Beowulf mailing list, [email protected]>> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf>>> >

On Mon, Mar 05, 2007 at 02:47:50PM -0700, Craig Tierney wrote:

> Also, I would like to hear someone> speakup that uses Lustre in a PRODUCTION environment that> doesn't have a kernel hacker on staff.

One of our Oil & Gas customers does this, no kernel hacker, but theyare paying CFS for support. Which is almost the same thing.

-- greg

Actuall I run it in production and I'm not a kernel hacker. Wecurrently have 6 OSS's with software raid5 to 6 internal SATA disks.We see about 190MB/s per OSS out of the disks and around 150MB/s viathe dual network interfaces.

I can't think of any benchmark you care to mention that a singlelustre OSS/MDS won't outperform NFS. Especially, if you configureyour systems to use both NIC's (most motherboards now come with dualinterfaces) and I don't mean trunking the ports. Just configureportals to know that it can speak to the OSS's via both nics and itwill handle the rest for you.

Lustre's meta data performance is WAY better than NFS. I'd almost sayits WAY better than any global FS I've played with.

Certainly, you have to use Lustre kernels... all we do is run Centosas our clients/servers and then we just grab the pre-build/supportedkernels from CFS.

Its all pretty easy. The current Lustre V1.4 is very very nice and wehave found it to be very robust. Nearly all the problems weexperience turn out to be flakey hardware or kernel issues. NotLustre at all.

You can also checkout the FUSE implementation of a Lustre client Iposted to CFS's website a few weeks back

https://mail.clusterfs.com/wikis/lustre/fuse

while it needs a LOT of work to give decent performance, it does work. Oh, and if someone ports liblustre to macosx, I could also run it onmy mac :)

Stu.

>> How much do you use Lustre? Yes, you can get that bandwidth,> but if you code doesn't do large-streaming I/O, you performance> will be worse than NFS. Also, I would like to hear someone> speakup that uses Lustre in a PRODUCTION environment that> doesn't have a kernel hacker on staff.>> Also, Lustre metadata doesn't scale (yet). You can add> another server, but that won't improve the metadata.>> Using Lustre also requires you to re-patch your kernel every security> update, then get the bugs out again.>> Lustre is the right answer for some, but if you aren't going> to have that many compute nodes. It doesn't sound like> it here.>> Craig

-- Dr Stuart [email protected]

On Fri, 2 Mar 2007, [email protected] wrote:

> I have a small (16 dual xeon machines) cluster. [...]> Does anybody knows what is better for a cluster of this size, exporting the> filesystem via NFS

FWIW we run two NFS servers (dual 2.0GHz Opteron 240's) with users split across the two and they cope with 3 clusters, two with ~180 CPUs and one with ~30 CPUs, all run at an average of 83% utilisation over the last 12 months (one at around 92% utilisation for the last 3 months).

So yes, NFS should be fine. Just don't try and run Gaussian on it. :-)

cheers,Chris-- Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia

-------------- next part --------------A non-text attachment was scrubbed...Name: not availableType: application/pgp-signatureSize: 189 bytesDesc: not availableUrl : http://www.scyld.com/pipermail/beowulf/attachments/20070306/fedd3479/attachment.bin

On Mon, 5 Mar 2007, John Hearns wrote:

> Purely as a point of interest, since high energy physics labs use AFS> (and hence kerberos) they have already faced this one.

Interesting, though it's not clear from that whether it can cope with, say, automatically renewing expiring tickets for running jobs where the job lifetime is longer than the maximum allowed lifetime of a Kerberos ticket.

NB: I've never used Kerberos in anger, so be gentle. :-)

cheers,Chris-- Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia

-------------- next part --------------A non-text attachment was scrubbed...Name: not availableType: application/pgp-signatureSize: 189 bytesDesc: not availableUrl : http://www.scyld.com/pipermail/beowulf/attachments/20070306/768ce825/attachment.bin

Mark Hahn wrote:>> Our developers had that issue of inconsistent file system view in >> RHEL>> based systems, some of it is solved by disabling dir list caching, >> another>> by using noac,> > well, developers should be smart enough to know what FS they're using,> and how it's intended to behave. turning off AC is a nice option, but > smarter is to leave it on and not try to cause race conditions.

... or to catch them and fix them ...

> (I expect that such race-friendly behavior will fail on some other > non-NFS filesystems, though probably harder to trigger.)> >> what the other was doing was writing simultaneously to the>> same file partitioned over several nodes, I told this is probably not the>> right way to do file writing. apparently he used to do it in Sun >> Solaris and>> it worked flawlessly.> > I would spank any developer who said "but it works on platform X"!> developers must be aware of the spec, not merely what they can get away > with somewhere, sometime. of course, this is the thinking behind apps > having "supported" platforms - just a fancy way of saying> "no, we don't know what standards-conformance we need, or how we violate > the standard, but here's a few places we haven't yet noticed any > bad-enough bugs".

*sigh* If only we could "spank" them. In ISV circles, there is a meme running about that Linux == RHEL*. So they code everything to that, and not to the LSB.

Note: this is one thing that the Windows folks sorta kinda do right. There is a "spec" to some degree, and everyone can kinda sorta write to it.

Then again, when you completely dominate something, you can dictate to users. We have a long standing IE problem with rendering forms in tables, everyone else can do it right, and the code checks out as w3c compliant ... oh, never mind, not worth trying to get IE to talk standards.

Standards only work when all players follow them. They also need to be simple enough to follow. Making a standard impossible to follow helps no end user.

--

Joseph Landman, Ph.DFounder and CEOScalable Informatics LLC,email: [email protected] : http://www.scalableinformatics.comphone: +1 734 786 8423fax : +1 734 786 8452 or +1 866 888 3112cell : +1 734 612 4615

>> > Our developers had that issue of inconsistent file system view in RHEL> > based systems, some of it is solved by disabling dir list caching, another> > by using noac,>> well, developers should be smart enough to know what FS they're using,> and how it's intended to behave. turning off AC is a nice option,> but smarter is to leave it on and not try to cause race conditions.> (I expect that such race-friendly behavior will fail on some other> non-NFS filesystems, though probably harder to trigger.)

I recall doing a port of a former employer's seismic processing codeto Linux, which was used to having GPFS or PIOFS around. The onlydistributed(!) filesystem of any sort that I could afford was NFS,which wasn't too bad, except that various programs insisted on havingmultiple nodes (say about 200+) appending to the same filesimultaneously. After much trial & error, I discovered noac and alsothat opening the file on the clients with O_SYNC would send each writeoff to the NFS server immediately. Not an elegant solution. All thiswas happening over 100base-T, and the NFS server, if we were lucky,had a GB connection to the switch. We discovered an interesting racecondition in one of the ethernet drivers along the way. They tell methat they're using GPFS under Linux now.

Stephen

Hi Stu,

> Actually I run it (Lustre) in production and I'm not a kernel hacker.

Thank you for this snapshot of 'real world' Lustre use. At the risk of hijacking this thread (or borrowing it...) could I ask you a question about Lustre? I've always been interested in Lustre but never used it.

Like everyone in this mailing list I am interested in a distributed filesystem whose bandwidth and speed are commensurate with the total raw hardware IO performance of the disks and the network speed and intersection bandwidth. But there are two additional features that I also think would be very desirable:

(1) RAID-across-nodes. For example every ten nodes form a redundant RAID set. The disappearance of any one of these nodes causes no data loss, service loss, or corruption at the user level. The total redundant storage available from the ten nodes is 90% of the available raw storage.

(2) Symmetry: all nodes have identical behavior and features. There are no specialized IO or metadata nodes, which act as filesystem bottlenecks and which are single points of failure.

Am I correct that Lustre does not offer either of these features?

Do you (or does someone else) know if there is an open-source or commercial distributed (posix) filesystem with these features?

Cheers, Bruce

Bruce Allen wrote:

> (1) RAID-across-nodes. For example every ten nodes form a redundant > RAID set. The disappearance of any one of these nodes causes no data > loss, service loss, or corruption at the user level. The total > redundant storage available from the ten nodes is 90% of the available > raw storage.

Using iSCSI targets and iSCSI initiators, you could build RAID5 or RAID6 across boxes using the linux MD device. We have proposed this to some financial customers using our JackRabbit unit.

> (2) Symmetry: all nodes have identical behavior and features. There are > no specialized IO or metadata nodes, which act as filesystem bottlenecks > and which are single points of failure.

For this, we use a set/pair/triple of thin HA servers with stonith running. You can run them in active-passive, active-active (requires some sort of CFS then). The nice part about this is that the metadata resides within the FS, and you look at each machine as a big block o disks.

> Am I correct that Lustre does not offer either of these features?

Lustre is an object data storage system. Breaks apart metadata from data.

> > Do you (or does someone else) know if there is an open-source or > commercial distributed (posix) filesystem with these features?

If you use our idea above (iSCSI targets/initiators), you could run active-passive/STONITH mode using xfs/jfs . We have proposed this at a number of places when they need very fast cutover, and downtime of any sort means significant loss.

> > Cheers,> Bruce

Joe

--

Joseph Landman, Ph.DFounder and CEOScalable Informatics LLC,email: [email protected] : http://www.scalableinformatics.comphone: +1 734 786 8423fax : +1 734 786 8452 or +1 866 888 3112cell : +1 734 612 4615

On Mar 5, 2007, at 10:43 PM, Chris Samuel wrote:

> So yes, NFS should be fine. Just don't try and run Gaussian on > it. :-)

Ok, I'll bite. We're just starting to support Gaussian on a couple of small clusters (32 and 64 cores respectively) and we don't have a lot of experience with it. It looks like there are 3 primary directories, the software root, the tmp dir, and the molecular system/ output files. Which subset of these shouldn't be accessed via NFS?

thanks,charlie

Charlie PeckComputer Science, Earlham Collegehttp://cs.earlham.eduhhtp://cluster.earlham.edu

Charlie Peck wrote:> On Mar 5, 2007, at 10:43 PM, Chris Samuel wrote:> >> So yes, NFS should be fine. Just don't try and run Gaussian on it. :-)> > Ok, I'll bite. We're just starting to support Gaussian on a couple of > small clusters (32 and 64 cores respectively) and we don't have a lot of > experience with it. It looks like there are 3 primary directories, the > software root, the tmp dir, and the molecular system/output files. > Which subset of these shouldn't be accessed via NFS?

Charlie:

Depending upon which links are run, Gaussian can do a fairly good job of consuming all your I/O bandwidth, and then some. Not all links are like this, the DFT links seem to be non-IO bound. As soon as you start spilling integrals to disk, you will see what we mean.

Joe

> > thanks,> charlie> > Charlie Peck> Computer Science, Earlham College> http://cs.earlham.edu> hhtp://cluster.earlham.edu> _______________________________________________> Beowulf mailing list, [email protected]> To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf

--

Joseph Landman, Ph.DFounder and CEOScalable Informatics LLC,email: [email protected] : http://www.scalableinformatics.comphone: +1 734 786 8423fax : +1 734 786 8452 or +1 866 888 3112cell : +1 734 612 4615

Hi,

Am 06.03.2007 um 14:00 schrieb Charlie Peck:

> On Mar 5, 2007, at 10:43 PM, Chris Samuel wrote:>>> So yes, NFS should be fine. Just don't try and run Gaussian on >> it. :-)>> Ok, I'll bite. We're just starting to support Gaussian on a couple > of small clusters (32 and 64 cores respectively) and we don't have > a lot of experience with it. It looks like there are 3 primary > directories, the software root, the tmp dir, and the molecular > system/output files. Which subset of these shouldn't be accessed > via NFS?

for Gaussian all scratch files can be in the local $TMPDIR on the nodes - though the program itself is distributed via NFS for convenience. Even for parallel runs with Linda. We copy necessary files after the job back to the directory of the user, if s/he wishes to access them. Only the default output is written directly to the final location during execution time. Just don't set GAUSS_SCRDIR but make a cd to the batch system supplied $TMPDIR before the program call.

Some hints you can find on the SGE list:

http://gridengine.sunsource.net/servlets/ReadMsg? listName=users&msgNo=14600

-- Reuti

> thanks,> charlie>> Charlie Peck> Computer Science, Earlham College> http://cs.earlham.edu> hhtp://cluster.earlham.edu> _______________________________________________> Beowulf mailing list, [email protected]> To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf

Evening

> Thank you for this snapshot of 'real world' Lustre use. At the risk of> hijacking this thread (or borrowing it...) could I ask you a question> about Lustre? I've always been interested in Lustre but never used it.

I strongly suggest you grab a few old boxes and play with it, itreally is very good. 1.6betas are a little unstable, but 1.4 is veryvery solid. A tad fiddly to setup, but not really that hard. 1.6definitely is nice to setup and use.

>> Like everyone in this mailing list I am interested in a distributed> filesystem whose bandwidth and speed are commensurate with the total raw> hardware IO performance of the disks and the network speed and> intersection bandwidth. But there are two additional features that I also> think would be very desirable:>> (1) RAID-across-nodes. For example every ten nodes form a redundant RAID> set. The disappearance of any one of these nodes causes no data loss,> service loss, or corruption at the user level. The total redundant> storage available from the ten nodes is 90% of the available raw storage.

No, Lustre does not currently support this. There are lots of waysyou could acheive this (as mentioned in other emails), but they willall reduce bandwidth :) It is definitely on the Lustre road map todeliver RAID across servers, but it isn't there yet. Having saidthat, there is nothing stopping you raiding the disks within a node.But, as I keep saying, your NFS servers don't do this either ;)

> (2) Symmetry: all nodes have identical behavior and features. There are> no specialized IO or metadata nodes, which act as filesystem bottlenecks> and which are single points of failure.

No, this is not really what Lustre is trying to acheive. But, it doesallow you to have fail over in the servers and meta-data servers. So,if one crashes, another will take over. On the roadmap will beclustered meta-data servers... but again, its not there yet. Um...did I mention your NFS servers?

> Am I correct that Lustre does not offer either of these features?>> Do you (or does someone else) know if there is an open-source or> commercial distributed (posix) filesystem with these features?>> Cheers,> Bruce>

I think there are some opensource projects (glusterfs?) that claim todo this, but I suspect their bandwidth is nothing aproaching lustre...and probably for all they claim, their meta-data performance probablywon't match lustre either.

With Lustre 1.6 I was seeing 170MB/s sustained from single clients tothe lustre storage. That's pretty impressive given two NIC's in theclient... and I didn't even play with the tuning parameters or jumboframes etc. That was straight out of the box. With 6 OSS's theagregate bandwidth with 1.6 was ~1GB/s... and it happily ran with 16instances of Bonnie++ hammering away on it for a week.

1.4 is slightly down on bandwidth... but stable :) I've since beentold that with tuning you can get 1.4 to perform as well as 1.6.

I really think people should try Lustre. A lot of people were put offin the early days cause their was little documentation... there werefew tools to help configure/mount etc. BUT, with 1.4, it is a verynice product. If you can afford Elan, then you will be in for a verynice experience.

Stu.

-- Dr Stuart [email protected]

[snip]> *sigh* If only we could "spank" them. In ISVcircles,> there is a meme running about that Linux == RHEL*. So > they code everything to that, and not to the LSB.

RHEL makes it simple for 3rd party vendors to porttheir product to Linux because of the much longersupport windows. Vendors love it. And large companieslike the one I work for love it. And it works verywell-outside the cluster.

We have the RHEL conversation on a regular basis withour bosses. RH will come in and talk to them aboutthe wonders of RHEL, and they want us to use it in thecluster, specially for infrastructure.

We then have to patiently explain that we have testedstandard RHEL kernels and do not perform well underour workload-and we point them to our documentation. That usually halts their forward momentum and workflowdoes not suffer.

____________________________________________________________________________________Any questions? Get answers on any topic at www.Answers.yahoo.com. Try it now.

On Mon, Mar 05, 2007 at 11:08:28AM -0500, Mark Hahn wrote:

> writing to different sections of a file is probably wrong on any > networked FS, since there will inherently be obscure interactions > with the size and alignment of the writes vs client pagecache,

I'm rather surprised to see that sentiment on a mailing list for highperformance clusters :>

I would contend that writing to different sections of a file *must* besupported by any file system deployed on a cluster. How else wouldyou get good performance from MPI-IO?

PVFS, GPFS, and Lustre all suppoort simultaneous writes to differentsections of a file.

> in my experience, people who expect it to "just work" have an> incredibly naive model of how a network FS works (ie, write()> produces an RPC direct to the server)

I agree that the POSIX API and consistency semantics make it difficultto achieve high I/O rates for common scientific workloads, and thatNFS is probably not the best solution for those truly parallel workloads.

Fortunately, there are good alternatives out there.

==rob

-- Rob LathamMathematics and Computer Science Division A215 0178 EA2D B059 8CDFArgonne National Lab, IL USA B29D F333 664A 4280 315B

On Mon, Mar 05, 2007 at 04:26:28PM +0000, John Hearns wrote:> That leads me to a damn stupid question - how do NFSv4 and ROMIO > interoperate then? Anyone got experience of that,> or is it signed "There be Dragons" ?

It's not a stupid question at all. It's very important to understandthe impact the choice of file system has on the higher levels of theI/O software stack.

ROMIO does not have a special "NFSv4" ADIO driver. ROMIO treats itlike regular NFSv3. In short, you can use it, but you'll have todisable all caching to make it behave correctly. You'll get ratherbad performance for most workloads, but what good is fast I/O if youget garbled data in your file?

I know I beat this drum a lot, but do consider a true parallel filesystem like PVFS for your MPI-IO applications. GPFS and Luster wouldbe good options too.

==rob

-- Rob LathamMathematics and Computer Science Division A215 0178 EA2D B059 8CDFArgonne National Lab, IL USA B29D F333 664A 4280 315B

On Tue, Mar 06, 2007 at 08:17:32AM +0900, Stu Midgley wrote:> I can't think of any benchmark you care to mention that a single> lustre OSS/MDS won't outperform NFS.

Consider an MPI-IO benchmark where all processes write to differentregions of a file. This workload is common in scientificapplications, say when all processes need to write an HDF5 element toa datafile.

Run that benchmark with one processor and you will get greatperformance out of Lustre. Lustre does an excellent job of cachingdata amd making single-processor I/O go really really fast.

Run that benchmark with two processors, and the clients will spend agreat deal of time revoking each others extent-based locks andexpiring entries from their caches. Performance will take asignificant hit, but will increase as you add more processes.

I don't mean to come across as a Lustre hater. I'm just trying tokeep the discussion honest: the discussion of the "right" file systemfor an application is hard, and lots of factors come into play.

==rob

-- Rob LathamMathematics and Computer Science Division A215 0178 EA2D B059 8CDFArgonne National Lab, IL USA B29D F333 664A 4280 315B

>> writing to different sections of a file is probably wrong on any>> networked FS, since there will inherently be obscure interactions>> with the size and alignment of the writes vs client pagecache,>> I'm rather surprised to see that sentiment on a mailing list for high> performance clusters :>

smiley noted, but I would suggest that HPC is not about convenience first - simply having each node write to a separate file eliminates any such issue,and is hardly an egregious complication to the code.

> I would contend that writing to different sections of a file *must* be> supported by any file system deployed on a cluster. How else would> you get good performance from MPI-IO?

who uses MPI-IO? straight question - I don't believe any of our 1500 users do.

> PVFS, GPFS, and Lustre all suppoort simultaneous writes to different> sections of a file.

NFS certainly does as well. you just have to know the constraints.are you saying you can never get pathological or incorrect results fromparallel operations on the same file on any of those FS's?

>> in my experience, people who expect it to "just work" have an>> incredibly naive model of how a network FS works (ie, write()>> produces an RPC direct to the server)>> I agree that the POSIX API and consistency semantics make it difficult> to achieve high I/O rates for common scientific workloads, and that> NFS is probably not the best solution for those truly parallel workloads.>> Fortunately, there are good alternatives out there.

starting with the question: "do you have a good reason to be writing in parallel to the same file?". I'm not saying the answer is never yes.

I guess I tend to value portability by obscurity-avoidance. not if it makeslife utter hell, of course, but...

> smiley noted, but I would suggest that HPC is not> about convenience first - simply having each node > write to a separate file eliminates any such issue,> and is hardly an egregious complication to the code.

I my environment, it is not always up to the systemadmins to make those decisions. Convenvience for theclients is paramount since their ability to processmost efficiently directly adds to the bottom line. The new way of processing (as I mentioned a whileback) will make the workflows more streamlined andefficient.

It reminds me of an experience I had. We have so manynodes that I wrote a spider to gather all the nodeinfo and then created a DB so I can query theinformation I needed. There were some who thought aflat file would be best because... Suppose a piece ofthe ISS broke off, survived reentry and landed righton my DB server, what then???

____________________________________________________________________________________Don't get soaked. Take a quick peak at the forecastwith the Yahoo! Search weather shortcut.http://tools.search.yahoo.com/shortcuts/#loc_weather

Mark Hahn wrote:>>> writing to different sections of a file is probably wrong on any>>> networked FS, since there will inherently be obscure interactions>>> with the size and alignment of the writes vs client pagecache,>>>> I'm rather surprised to see that sentiment on a mailing list for high>> performance clusters :>>> smiley noted, but I would suggest that HPC is not about convenience > first - simply having each node write to a separate file eliminates > any such issue,> and is hardly an egregious complication to the code.

Actually this can greatly complicate code. If I run a CFD run on n number ofprocesses and they each write the solution to a separate file, then if I run1.5*n processes, how do I read the n files? I can write some code to take then files, and then write out a single file or 1.5*n files for instance. To me thisis a wasteful use of cycles when something like MPI-IO is so much betterand I can stick with a single file.

While I don't want to speak for the entire CFD community, but I haven'tseen anyone write out n files. That concept was proven to be a huge painmany years ago.

Other disciplines may have other opinions of course.

>> I would contend that writing to different sections of a file *must* be>> supported by any file system deployed on a cluster. How else would>> you get good performance from MPI-IO?>> who uses MPI-IO? straight question - I don't believe any of our 1500 > users do.

I do. I also know that some ISV's are moving rapidly to use MPI-IO.

>>> in my experience, people who expect it to "just work" have an>>> incredibly naive model of how a network FS works (ie, write()>>> produces an RPC direct to the server)>>>> I agree that the POSIX API and consistency semantics make it difficult>> to achieve high I/O rates for common scientific workloads, and that>> NFS is probably not the best solution for those truly parallel >> workloads.>>>> Fortunately, there are good alternatives out there.>> starting with the question: "do you have a good reason to be writing > in parallel to the same file?". I'm not saying the answer is never yes.

As Rob mentioned writing in parallel to the same file gets you good performance.I think this is a fundamental underpinning of parallel IO. You can do this withor without MPI-IO. MPI-IO just makes it easier, standard, and portable.

Of course you would not have different processes writing to the same regionof a file. But if you can have each process write to a distinct region or sectionof the file without worrying about having another process stepping on thatone, then why not write in parallel? It's easy to do using MPI-IO. Take a lookat the tutorials on MPI-IO around the web and give them a try.

Jeff

On Tue, Mar 06, 2007 at 11:09:18AM -0500, Mark Hahn wrote:> >I would contend that writing to different sections of a file *must* be> >supported by any file system deployed on a cluster. How else would> >you get good performance from MPI-IO?> > who uses MPI-IO? straight question - I don't believe any of our 1500 users > do.

Excellent question. Direct users? Probably not very many.

We do find that straight-up MPI-IO isn't a good fit for a lot ofscientific applications. The convienence factor you mentioned isindeed important. MPI-IO thinks of data as "stream of bytes", whileapplications think in terms of "multidimentional typed data" (a sliceof upper atmosphere).

Libraries like Parallel-HDF5 and Parallel-NetCDF bridge the gap andprovide a convienent, familiar API. The app is still using MPI-IO,just not directly.

> NFS certainly does as well. you just have to know the constraints.> are you saying you can never get pathological or incorrect results from> parallel operations on the same file on any of those FS's?

You observe correctly that file systems offer a set of rules on whatto expect from I/O patterns. These consistency semantics are not setin stone: MPI-IO consistency semantics are more relaxed than POSIX,yet generally sufficent for parallel scientific applicaitons.

We would consider it a serious bug in PVFS if simultaneousnon-overlapping writes corrupted data.

If the only file system I had access to was NFS, I'd do one file perprocess as well.

> starting with the question: "do you have a good reason to be writing in > parallel to the same file?". I'm not saying the answer is never yes.> > I guess I tend to value portability by obscurity-avoidance. not if it makes> life utter hell, of course, but...

one file per processor falls down on systems like BGL (where even asmall run is 1024 processes, and 128k is not unheard of).

One file per process also robs the higher layers of the I/O softwarestack from an opportunity to optimize access patterns. All processesreading a collumn out of a row-major array is noncontiguous (andgenerally slow) in file-per-processor, but can be contiguous insingle-file after applying data shipping or two-phase collectivebuffering optimizations.

Jeff touched on the data management issues of file-per-processor.

If file-per-processor really is the most portable and convienent wayto work on data, well, I can't argue with that. On NFS, that'sprobably the only way to get correct results. The single-fileapproach, however, has significant benefits on the modern parallelfile systems available today.

As I hope you could tell, this kind of discussion is a lot of fun forme. Thanks!

==rob

-- Rob LathamMathematics and Computer Science Division A215 0178 EA2D B059 8CDFArgonne National Lab, IL USA B29D F333 664A 4280 315B

On Tuesday 06 March 2007 08:58, Robert Latham wrote:> I know I beat this drum a lot, but do consider a true parallel file> system like PVFS for your MPI-IO applications. ? GPFS and Luster would> be good options too. ?

What about OCFS2? Do you know anything about it?

wt-- Warren Turkal, Research Associate III/Systems AdministratorColorado State University, Dept. of Atmospheric Science

On Tuesday 06 March 2007 05:42, Joe Landman wrote:> Using iSCSI targets and iSCSI initiators, you could build RAID5 or RAID6> across boxes using the linux MD device. ?We have proposed this to some> financial customers using our JackRabbit unit.

Do you have the md device mounted on many systems? I didn't think the md device was cluster aware.

wt-- Warren Turkal, Research Associate III/Systems AdministratorColorado State University, Dept. of Atmospheric Science

Warren Turkal wrote:> On Tuesday 06 March 2007 05:42, Joe Landman wrote:>> Using iSCSI targets and iSCSI initiators, you could build RAID5 or RAID6>> across boxes using the linux MD device. We have proposed this to some>> financial customers using our JackRabbit unit.> > Do you have the md device mounted on many systems? I didn't think the md > device was cluster aware.

md mounted on one system. Stonith and a HA server pair (triple,...) on the front end doing the md. With GFS/OCFS2/... you can have it active-active.

> > wt

--

Joseph Landman, Ph.DFounder and CEOScalable Informatics LLC,email: [email protected] : http://www.scalableinformatics.comphone: +1 734 786 8423fax : +1 734 786 8452 or +1 866 888 3112cell : +1 734 612 4615

On Tue, Mar 06, 2007 at 12:06:07AM +1100, Andrew Robbie (GMail) wrote:

> But, a managed name brand switch seems to cost a lot> more than a non-managed one using the Mellanox reference design kit> (rebadged, but I suspect made by Flextronics...).

Andrew,

I know of at least 2 "name brand" unmanaged IB switches, one from QLogic(24 ports) and one from Microway (36 ports). I think Cisco resellsthe QLogic switch, and perhaps Voltaire has something similar.

For larger switches the management board is a tiny fraction of theprice.

-- greg

>well, developers should be smart enough to know what FS they're using,>and how it's intended to behave. turning off AC is a nice option, but >smarter is to leave it on and not try to cause race conditions.

Just yesterday I sat quietly at a customer site while an engineerwasted 30 minutes not understanding that the bizarre behavior he wasseeing was due to attribute caching on NFS. The guy was a CFD expert,not a Unix expert.

Even experts can have troubles with AC, sometimes. Just Say No to AC.

-- greg

On 3/5/07, Andrew Robbie (GMail) wrote:>>> My other query is about diagnostic software. With an ethernet switch it is> pretty easy to fire up Ethereal (sorry Wireshark, but it is such a silly> name) or Etherape and get a look at what is going on. If I buy a Cisco or> Voltaire etc do they come with tools that let me get accurate> representations of what is going on? Or are their tools really for large IB> networks?

doesn't the fabric management software allows you to do some diagnosticsand have an overview of the fabric, silverstorm now bought by Qlogic havealso some scripts that helps in configuration of the fabric, and cluster

regards

Walid-------------- next part --------------An HTML attachment was scrubbed...URL: http://www.scyld.com/pipermail/beowulf/attachments/20070307/8853323c/attachment.html

Andrew Robbie (GMail) wrote:> Various vendors (integrators, not switch OEMs) have stated to me that> managed switches are the go, and that OpenSM is (a) buggy, and (b)> very time consuming to set up. But, a managed name brand switch seems> to cost a lot more than a non-managed one using the Mellanox reference> design kit (rebadged, but I suspect made by Flextronics...).>We have a small (3 nodes) cluster with an IB interconnect here. Theswitch is unmanaged. I cannot confirm either (a) nor (b). OpenSM runswithout problems and the set-up is not too complicated.

On the other hand, we also have a much more expensive shared-memorysystem from a well-known vendor that also features an IB interconnect.We never had to use any of the extra features of the managed switch wehave there. And in contrast to the open-source and unsupported driversthat we use in the small cluster, the commercial driver stack is buggyand causes our machines to crash every now and then (usually under highload).

My advice is to take the unmanaged switch.

> My other query is about diagnostic software. With an ethernet switch> it is pretty easy to fire up Ethereal (sorry Wireshark, but it is such> a silly name) or Etherape and get a look at what is going on. If I buy> a Cisco or Voltaire etc do they come with tools that let me get> accurate representations of what is going on? Or are their tools> really for large IB networks?>If you run "IPoIB" you can use Ethernet monitoring tools to getdiagnostics of the emulated ethernet devices. Our managed switch did notcome with extra diagnostics software. The switch was shipped with thewhole system, though (OEM). I do not know what software you would get ifyou buy a retail IB switch. Regards,Markus

Hi,

Andrew Robbie (GMail) schrieb: > I am building a small (~16) node cluster with an IB interconnect. I need to > decide whether I will buy a cheaper, dumb switch and run OpenSM, or get a > more expensive switch with a built in subnet manager. The largest this > system would every grow is 32 nodes (two 24 port switches). > > Various vendors (integrators, not switch OEMs) have stated to me that > managed switches are the go, and that OpenSM is (a) buggy, and (b) very > time consuming to set up.

It's not _that_ buggy and set up is pretty straigt forward. But itlacks several features you'd really like in big systems. For fewer orequal to 24 nodes you can go with a simple switch and OpenSM. For 32nodes you can use 16 nodes per switch and 8 cables for switchinterconnect. So you should have 1/2 bisection bandwith in theory. ButOpenSM configures IB forwarding rather static at startup and neveradjusts it to actual usage of links and is rather poor to "hotplug"changes in topology. So it is possible that some links are overused butothers not. Nevertheless you can still find 24 nodes in your 32 nodescluster communicating nonblocking (if remaining 8 stay silent), but Idon't know a simple way to get this information from OpenSM or switch.You can write a simple MPI program benchmarking it.

In addition the versions of OpenSM I know crash silently sometimes(which does not affect anything), so you should monitor it in some way(you can restart it whenever you want). Finally I have to admit thatthis are all real life experiences without any deep inside knowledge ofOpenSM or even Infiniband.

So, as a conclusion I would suggest to go with a simple 24port switchand OpenSM for now. If you upgrade to more than 24 nodes you should adda more advanced switch. From my experience you can easily mix Mellanoxswitches with those formerly known as TopSpin, I don't know about othervendors.

As one more hint you should reconsider if you need that many nodes for ajob. If you limit your need of nodes for one job to 24 you can easilygo with two dump 24 switches up to 48 nodes and both subclusters cancommunicate nonblocking. But of course this way no node of onesubcluster can communicate with one of the other one and you need aresource management system able to assign nodes of subcluster to onejob.

Kind regards,-- Mapsolute GmbHFrank GruellichMap24 Systems and Networks

Duesseldorfer Strasse 40a65760 EschbornGermany

Phone: +49 6196 77756-414Fax: +49 6196 77756-100

http://www.mapsolute.com

Hello..

I would like to know what server has the best performance for HPC systemsbetween The Dell Poweredge 1950 (Xeon) And 1435SC (Opteron). Please send mesuggestions...

Here are the complete specifications for both servers:

Poweredge 1435SCDual Core AMD Opteron 2216 2.4GHz3GB RAM 667MHz, 2x512MB and 2x1GB Single Ranked DIMMs

Poweredge 1950Dual Core Intel Xeon 5130 2.0Ghz2GB 533MHz (4x512MB), Single Ranked DIMMs

-- Juan Camilo HernandezIngenieria SanitariaUniversidad de AntioquiaGIGAX - http://www.gigax.org-------------- next part --------------An HTML attachment was scrubbed...URL: http://www.scyld.com/pipermail/beowulf/attachments/20070306/8a6916ce/attachment.html

On Tue, Mar 06, 2007 at 12:06:07AM +1100, Andrew Robbie (GMail) wrote:> Date: Tue, 6 Mar 2007 00:06:07 +1100> From: "Andrew Robbie (GMail)" > To: beowulf@beo