information technology division - cern fileexisting cpu batch scheduling systems. this led to a...
TRANSCRIPT
145Information Technology Division
Information Technology Division
Overview
The LHC Computing Grid (LCG) project made a first production release on schedule in the summer and was
deployed at CERN and numerous outside sites, building on middleware from VDT and EDG. A follow-on
project to the EDG – Enabling Grids for European e-science (EGEE) – was given the green light, although
formal approval will be given only in 2004. IBM and Oracle became sponsors of the CERN openlab for DataGrid
applications. A new GSM contract was negotiated with Sunrise. The structure of the new IT department,
including groups from the former AS, ETT, and IT divisions, was agreed, effective 1 January 2004.
Architecture and Data Challenges
Data Challenges
There was a major re-organization of the tape system in the computer centre during April and May. Several
tape silos were moved to a different building and a large number (50) of a new generation of tape drives were
introduced, replacing 28 old tape drives. In this period there was the opportunity to use the new drives for a
data challenge. The idea was to reach a new record performance of 1 GB/s data rate to tape. After overcoming
quite a few problems, it was finally successful with an average of 920 MB/s to tape for 3 days, reaching peak
values of 1.2 GBytes/s. The configuration used 40 CPU servers to emulate a DAQ system which moved data to
60 disk servers and then into an average of 45 tape servers. The software to do this was essentially a simple
Central Data Recording system without the CASTOR Mass Storage system. The short time-scale did not
allow extensive tests with the CASTOR system, but about 600 MB/s were reached for 12 hours on the last day
of the test.
In January an ATLAS Online Computing Data Challenge to test event building and run control issues was
completed using a 230-node cluster configured from the LCG prototype hardware.
CASTOR and AFS
A proposal for new CASTOR stager development was presented at the CASTOR users meeting at the end
of June. The proposal was approved by the users and the CASTOR operation team, and the developments
started in July. The new architecture was also presented to the LCG PEB in August and at the Vancouver
HEPiX meeting in October.
146 Information Technology Division
In the new architecture design it is recognised that efficient management of PetaByte disk caches made up
of clusters of 100s of commodity file servers (e.g. Linux PCs) resembles in many aspects the CPU cluster
management, for which sophisticated batch scheduling systems have been available since more than a decade.
Rather than reinventing scheduling and resource sharing algorithms and applying them to disk storage
resources, the new CASTOR stager design aims to leverage some of the resource management concepts from
existing CPU batch scheduling systems. This led to a pluggable framework design, where the scheduling task
itself has been externalized allowing the reuse of commercial or open source schedulers. The plan is to support
at least the LSF and Maui schedulers.
In late October a first prototype of the new system was used to prove some of the fundamental design
concepts. In particular the prototype showed that it is possible to delegate the scheduling of disk cache access
requests to an LSF instance. The successful proof of concept was reported to the LSF and Maui development
teams and fruitful collaborations were established with both teams in order to improve the interfaces used by
CASTOR.
Linux Activities
At the beginning of 2003, Red Hat started a strategy change to drive customers to the company’s flagship
‘Enterprise’ version by cutting product lifetimes. The current CERN Linux distribution is based on the freely
distributed Red Hat 7 desktop line where updates were available only until the end of 2003, it was updated a
number of times in 2003.
As a result of discussions with the CERN Linux users and outside institutes (e.g. at HEPiX) about
alternative distributions, it emerged that a Red Hat-based distribution would still be best suited. Such a
distribution could be bought directly from Red Hat (lacking the CERN customizations, and with implications
for smaller institutes that took the CERN distribution in the past), or could be built from sources. Both options
were investigated, and negotiations continue into 2004. As a result, CERN will support its own 7.3.x
distribution with security patches until the end of 2004, i.e. one year longer than Red Hat. The next version is
expected to be certified and deployed until summer 2004.
As a response to frequent Linux installation requests on desktops, new PCs are now delivered with
preinstalled NICE and Linux, after a pilot project in 2002.
Controls
The group participated actively in many divisional and inter-divisional working groups and committees.
These included the Controls Board, as well as working groups on Field Buses, LHC Data Interchange
(LDIWG), and the SCADA Applications Support Group (SASG). The group supplied the chairmen for the
Controls Board, LDIWG, and SASG. In addition, members of the group participated in three conferences:
ICALEPCS (5), CHEP (3), and RT (1). For ICALEPCS a group member jointly organized one of the
conference sessions as a member of the Programme Committee. The primary services and projects supported
by the group are as follows.
147Information Technology Division
LHC Joint Controls Project (JCOP)
The group provides the JCOP project leader and most of the services and projects discussed below were
tailored to clients’ needs through this channel. An External Review of the project was performed in March on
the request of the experiments and the outcome of this was highly supportive of the project and its
achievements. All recommendations made in the report have been followed up. The project continues to
address issues of common interest including the interfaces between the experiments and the systems for
control of magnets, cooling and ventilation, LHC access control, and electrical distribution. In particular,
significant support was given to the rack monitoring and control activities of EP-ESS.
Front-End Systems
This service combines support for both industrial and custom controls solutions. On the industrial side,
OPC (OLE for Process Control) support has been offered CERN-wide to customers requiring advice and test
facilities and help has been provided in the development of custom OPC servers for experiment-specific
devices. Several COTS OPC servers (CAEN, Wiener and ISEG) have been tested in order for them to be
integrated in experiments’ control systems and to validate the EP-ESS/Wiener contract. The group provided
ISEG and Wiener with consultancy. In the domain of Programmable Logic Controllers (PLCs), advice and
hands-on facilities have been made available and first-line support was given to experiments facing automation
problems. Support for the CAN field bus has been mostly for the ATLAS ELMB and Gas ELMB, but
assistance has been provided to use Profibus and Modbus TCP/IP in the control systems of experiments. On
the non-industrial side, the SLiC object-oriented software system continues to be maintained but will not be
further developed.
Experiment Support
The group has provided very specific support to the LHC experiments, and particularly ALICE, ATLAS
and CMS, via JCOP to the test-beam activities of 2003. The nomination of group contact persons to the
experiments has proved to be very successful. In addition to supporting the LHC experiments via JCOP and its
sub-projects, the group was again involved in supporting members of COMPASS and NA60 in the use of the
CERN-standard SCADA system, PVSS, as well as the DCS Framework and Front-End tools. These
experiments continued to require help in putting their control systems into production owing to a severe lack,
or turnover, of manpower. The group also provided support to GIF, and retains responsibly for the NA48
control software. Several requests for new functionality to be added to the system were refused because of a
lack of resources.
National Instruments Support
National Instruments software usage at CERN continued to grow in 2003, and there is now a considerable
user community. The purchase of hardware components for control systems from National Instruments has
also increased. In 2003, the National Instruments LabVIEW 7 software suite made its appearance, and this has
been fully integrated onto the CERN-supported hardware platforms and deployed to the user community.
Improvements have also been made to the National Instruments software directory on NICE to allow easier
148 Information Technology Division
access to the current recommended products, whilst still conserving access to older versions of the software.
Consultancy, programming support, and assistance with project design using LabVIEW has also been
provided to many areas of the user community at CERN.
In December 2003 National Instruments announced changes to the licensing mechanisms for much of their
software and this may result in extra administrative work for monitoring product usage. Discussions are
currently under way with National Instruments to resolve these issues.
DCS Framework Project
Immediately following the JCOP External Review there was a request from the experiments to redesign
the Framework with the aim of simplifying both its use and also the inclusion of new devices. The redesign
effort was co-ordinated by the Framework Working Group and saw a good collaboration with all the
experiments. The redesigned Framework will be released in mid January 2004 together with the candidate
release of PVSS version 3.0. The current release of the Framework was in production during 2003 in
COMPASS, GIF, NA60 and LHC-Gas controls, and is also in use within the LHC experiments’ test beams and
sub-detectors. In addition to its use in the research sector, some components of the Framework are being used
as part of the cryogenics control system framework (UNICOS) for which there has been a good collaboration
with the AB-CO group. To aid the introduction of the redesigned Framework to users, a training course is
being prepared which will be given in conjunction with the PVSS course.
Gas Control Systems (GCS) Project
This project aims to provide control systems for all the gas distribution chains being built for the LHC
experiments by EP-TA1-Gas. Analysis continued following the evolution of the EP-TA1-Gas activities. A first
implementation was released for the ALICE TPC gas system which was built using all the components of the
GCS project. The design of the process control layer was based on the UNICOS PLC library, a product being
developed for AB-CO for the supervision of their cryogenics systems, and the team participated in the
development of the UNICOS supervision package with PVSS. The first components of the LHC GCS
framework were developed and used for the ALICE TPC GCS. This framework aims to automate the
production of the 21 LHC Gas Control Systems by means of a model-based approach.
Detector Safety Systems (DSS) Project
During the first half of 2003 a DSS prototype system was produced based on the operational specification
agreed with the four LHC experiments. This prototype was reviewed and approved by the JCOP Steering
Group in June and the series production of the DSS was started soon after, allowing all the necessary
equipment for the four LHC experiments to be delivered by the supplier before the end of 2003. The first
operational DSS was installed in the CMS experimental area on schedule and has been successfully brought
into operation. Full environmental testing of this equipment has started. The DSS for LHCb was also delivered
to the experiment on schedule in December and connection of the external cabling is now under way.
149Information Technology Division
Based on the experience gained during the prototype implementation, a second generation of software for
the DSS was started in late 2003. Special effort has been made to optimize the OPC communication between
the PLC and the supervisor PC, as well as to reduce the PLC cycle time. This is especially important as some
of the LHC experiments may require more channels than expected for online monitoring of rack safety, an
application not originally considered for the DSS.
Communications
Telephony and GSM
There were two main events in 2003 in the areas of telephony and mobile communications.
The CERN telephone exchange was upgraded to move towards IP telephony. A first significant step was
reached by June when the main telephony backbone was moved to IP. 2003 was also the year when a large part
of the CERN Intercom system was digitized in order to also facilitate its future migration to IP technologies.
Furthermore the audio/video systems interconnecting the amphitheatres were also migrated from the old
analog systems to digital and to the IP campus network.
The migration of the CERN mobile network from the Swisscom operator to Sunrise was a major event in
2003. As a result of the call for tender of March, a contract was signed with Sunrise in July. From August till
December, the entire digital transmission infrastructure was removed, optical fibres were pulled, and 44 new
GSM radio stations were deployed on the CERN site and in the underground tunnels. 2600 CERN mobile
phone users’ subscriptions had to be replaced and were ready for the transition day of 5 January 2004.
General Network Infrastructure and Operation
The network in the computer centre smoothly handled the increasing load during 2003, even though its
capacity is reaching the limits. An upgrade of its capacity, waiting for the new LCG infrastructure, is planned
for 2004. The computer centre move to the vault continued and new procedures to connect large quantities of
equipment were introduced.
The work to eliminate obsolete non-IP network components continued. This old equipment is now more
and more difficult to maintain. Again this year, many DECNET and IPX boxes were eliminated. It is expected
that these services will be decommissioned in 2004.
The DHCP service had to be adapted to support the enforcement of portable computer registration;
similarly the central DNS service was improved to better support load sharing and domain delegation. The
Network Time service was drastically improved to accommodate the LHC requirement and gained a lot in
precision thanks to GPS synchronization. New systems were developed to better understand the state of the
network connections, improving significantly the time to fix a user connection issue.
The LANDB tool set was extended to integrate the portable registration and associated blocking system
that has been gradually enforced all over the CERN sites in 2003. All legacy network-related applications have
150 Information Technology Division
been reviewed and integrated in the new LANDB framework. Work has started on the replacement of the
structured cabling management package.
External Networking
The European Union DataTAG project (research and technological developments for a transatlantic Grid)
which is also co-funded by the US Department of Energy (DoE) and the National Science Foundation (NSF)
successfully passed its first review in March 2003. Remarkable results have been achieved in the area of high-
performance networking and Grid interoperability and public demonstrations have been organized most
notably during the Telecom World 2003 and ICT4D/WSIS (Information & Communication Technologies for
Development/World Summit on Information Society) events in Geneva.
As foreseen in 2002, the external networking infrastructure is rapidly evolving towards 10 Gb/s. Indeed,
the DataTAG circuit was upgraded to 10 Gb/s in September thus allowing new Internet landspeed records to
be achieved by a CERN–Caltech team with single multi-Gb/s IPv4 & IPv6 streams between CERN, Los
Angeles, and Phoenix. These records have been unanimously hailed by the technical press and have been
entered in the Science & Technology section of the Guinness book of records. CERN is also connected at
10 Gb/s to the Netherlight exchange point and at 2.5 Gb/s to VTHD, the French optical testbed. Preparation
for the connection of CERN to GEANT, the pan-European academic and research backbone at 10 Gb/s, is well
under way.
In order to cater for the increased external bandwidth, the interconnection between the internal and
external networks had to be re-engineered with two main components, namely a faster firewall and a multi-
Gb/s high-throughput route for bandwidth demanding Grid applications.
The CERN Internet Exchange Point (CIXP) is slowly emerging from a difficult period due to the crisis in
the telecom industry. However, CERN is still in a most favourable, probably unique, situation with over
2000 strands of fibres distributed over 15 independent cables belonging to a dozen of telecom operators and/or
dark fibre providers.
During 2003, the IPv6 pilot project was rebuilt from scratch and a number of significant milestones were
reached. In particular, CERN is now connected to a number of native IPv6 backbones, such as SWITCH &
6NET in Europe and 6TAP in the USA. As the major European and American backbones have also started to
offer native IPv6 capabilities the use of IPv 6 in IPv4 (‘6in4’) or GRE tunnels to connect to IPv6 ‘islands’
should decrease.
LHC Network Activities and the Communications Infrastructure Project (ComIn)
Major resources were invested in the PS, TCR and SPS network infrastructure rejuvenation to integrate
these into the global LHC communication infrastructure for controls. Thanks to an excellent collaboration
with the AB and ST divisions, the project is progressing well, within schedule and budget. This high workload
introduced delays in some other activities and is symptomatic of the current reduced staffing levels.
151Information Technology Division
Several studies on real-time networking for LHC controls were completed and confirmed that the new
technical network will be able to accommodate the currently expressed requirements. The control network
infrastructure for LHC is taking shape, and the installation required for the cryogenics were installed at pits 1
and 8. New technologies to provide network connectivity in the LHC tunnel have been evaluated; VDSL
technology was selected and started to be installed.
During 2003, the communication infrastructure available in the LHC surface areas has been used to
connect equipment for the cooling and ventilation, for the cryogenic plants, for the electrical sub-stations, and
for the controls equipment. Early in the year, the responsibility for the underground installations was finally
taken up by the LHC integration authority. This clarified the responsibilities, reduced the number of parties to
consult, and made the whole installation procedure underground more efficient.
Following a call for tender, the industrial support contractor for small telecommunication work was
changed on 1 April 2003. In spite of the much delayed signature of the contract, the transition to the new
contractor went smoothly.
The supply of the wireless voice communication services in the underground of the accelerators is a major
component of the new GSM contract, in particular for the LHC. The arcs of the sectors 8–1, 2–3, 3–4 and part
of the transfer tunnels were completed.
As the preparation of the underground progressed, a large number of the 300 red emergency telephones
were brought back into operation. The previous installation had mostly been cut or removed during the
dismantling of LEP and the subsequent civil engineering work. By the end of the year, the installation was
operational wherever the cable duct for security signals is in a usable state.
At its June meeting, the Finance Committee adjudicated the contract for the supply and installation of the
optical fibre network. This contract is managed by ST-EL, and installation work commenced immediately
thereafter. The priority task for the second half of 2003 was to lay the fibres to support the terrestrial backbone
for the new GSM operator.
The network in the underground areas will be made available as required by the installation schedule. This
will normally occur in parallel with the signal cable installation. The LHC working group, CEIWG, has been
preparing and coordinating this activity for sector 7–8. The working group brings together equipment groups
and the service providers. It has allowed us to identify the quantity and the locations of the required
connections in sufficient detail to plan the deployment.
The network installation in SM18 has been updated to support the magnet testbed. Several studies on real-
time networking for LHC controls have been completed on the mini-backbone in SM18. The tests have
confirmed that the new technical network will be able to accommodate the expressed requirements.
Basic network infrastructure was installed at the LHC experimental pits for various acceptance tests. In-
depth discussions with the LHC experiments on their future needs for experiment control were begun.
152 Information Technology Division
Databases
2003 was a year of significant consolidation in terms of the services offered by the Database Group. The
EDMS service was migrated off a Digital Unix cluster to a Sun Solaris cluster, reducing the number of server
platforms on which the group operates services to two (Intel/Linux being the preferred platform for physics-
related applications). This also paved the way for a merge of the EDMS and central database clusters. In
addition, new services were established for the Physics community based on the Oracle Database and
Application Server, replacing previous solutions based on Objectivity/DB. The latter also involved a major
data migration – some 350 TB of data were moved at data rates of up to 100 MB/s sustained over 24-hour
periods. In the context of the LHC Computing Grid (LCG), data management middleware developed as part of
the European DataGrid (EDG) was deployed in production. This middleware is also used by the LCG
Persistency Framework POOL as its Grid-aware file catalogue and file-level meta-data store. Building on the
new Oracle contract, which covers not only the CERN site but CERN registered users at outside institutes for
CERN-related work, Oracle distribution kits were developed both to deploy the Grid data management
middleware and also to allow the export of Oracle-based physics data to external sites. A further major
accomplishment was the successful negotiation with Oracle regarding their sponsorship of the CERN openlab.
As part of their contribution, two Oracle-funded fellows will start early in 2004 and are expected to have a
significant impact on our ability to evaluate and subsequently adopt the latest Oracle technologies.
Furthermore, Oracle announced a new release of their database and application server products – Oracle 10g.
This is particularly important in that it includes a number of key features that were explicitly requested by
CERN on behalf of the HEP and wider scientific community. These activities and events are described in more
detail below. Finally, as part of the overall reorganization of CERN, the structure of a new Database Group in
the IT department was agreed. The mandate of this group includes also the Databases and Application Servers
behind the laboratory’s Administrative Information Services, a further important consolidation.
Database and Application Server Infrastructure
CERN continued to support the Oracle Database as the relational database management system of choice.
This system continues to be widely used across the laboratory for a large spectrum of activities, ranging from
accelerator and physics-related applications to administrative tasks.
As for previous releases of the Oracle Database, CERN was an active participant in the 10i (subsequently
10g, where G stands for Grid) release, as well as for the Oracle management tool (Enterprise Manager,
recently renamed Grid Control). This allowed us to verify that the requested enhancements, such as support
for IEEE floating point numbers, were correctly implemented and offered the storage and performance
enhancements that were expected. A number of additional features, such as Oracle Automated Storage
Management and Cross-Platform transportable tablespaces, were also tested and extensive feedback provided
to Oracle.
CERN continued to have very high visibility within the Oracle community, at both American and
European conferences, as well as in numerous press interviews and reference visits. It is clear that this
visibility had a positive impact on Oracle’s willingness to implement CERN-requested features, as well as to
sponsor the CERN openlab. However, it is important to point out that there are numerous synergies between
the strategies of Oracle and CERN, such as the current focus on Grid computing and the use of commodity
hardware – typically Intel PCs running Red Hat GNU/Linux.
153Information Technology Division
Direct contacts with the Oracle development and management teams have helped to solve issues faced by
CERN deployment of the Oracle products; in 2003 part of these contacts were managed jointly between the IT
and the AS divisions.
The well-received Oracle tutorial series was repeated in a slightly revised form, with the goal of
disseminating ‘best practices’ and hence reducing the support load.
Grid Data Management
The Local Replica Catalog (LRC) and Replica Metadata Catalog (RMC) components of Replica Location
Service (RLS), developed in the context of the European DataGrid (EDG) were deployed in production for the
LHC Computing Grid as described in more detail below.
LHC Persistency Framework
In the context of the LHC Applications Area Persistency Framework, the POOL (POOL of persistent
objects for the LHC) was launched in 2002, with the task of delivering a technology-independent persistency
solution for the LHC experiments. In particular, the delivered solution was required to be capable of allowing
the multiple petabytes of experiment data and associated meta-data to be stored in a distributed and Grid-
enabled fashion. This joint project (together with EP division) delivered several production releases that were
successfully integrated into the frameworks of two of the LHC experiments (CMS and LHCb) and were shown
to be capable of handling their event models. Based on middleware provided by the Grid Data Management
section and through services established by the Physics Data Management section, Grid-aware production
services were set up to allow POOL and other users to store and retrieve files in the LHC Computing Grid,
LCG. These services are described in more detail below.
A second Persistency Framework project was launched in the second half of the year, aimed at providing
an experiment-independent solution to handling detector conditions data.
Physics Data Management
The Sun-based physics cluster – established to decouple the physics and infrastructure environments,
where increasingly different requirements were seen – was put into full production, with all applications and
data migration from the central infrastructure cluster finished by the middle of the year. This cluster runs a
more recent version of Oracle than that currently used on the infrastructure side, and this is likely to continue
in the future, with an earlier move to Oracle 10g and subsequent versions.
Based on preparatory work in 2002, the event data of COMPASS and HARP was migrated from an
Objectivity/DB-based solution – support for which is being dropped by CERN – to one based on Oracle and
Alice ‘DATE’ format files. This migration also included moving the data from legacy tape media to a more
modern format and required a significant effort through much of the year. The total data volume was around
350 TB and data rates of around 100 MB/s were sustained over 24-hour periods. The group also supported
COMPASS data taking and processing for 2004 with a total of 1010 event headers being stored in 6 Oracle
databases.
154 Information Technology Division
A significant amount of effort was devoted to setting up highly reliable production services for the LHC
Computing Grid (LCG). These services provide the Grid-aware file catalogue and file-level meta-data
repository used by POOL and other applications, such as the job scheduler, and any service interruption leads
to a significantly degraded Grid. The middleware involved was deployed using an Oracle Application Server
running on an Intel/Linux PC per virtual organization (VO), with a shared Oracle Database running on an
Intel-based disk server. A similar setup was provided in the certification test-bed as well as for testing new
software releases. Using the technique of DNS alias switching, planned interventions at the level of the
Application Server can be performed in a manner that is totally transparent to running applications. At the
level of the Database, a short interruption of some 10–15 minutes is currently required, although there are
plans to introduce higher availability but cost effective solutions in 2004.
Fabric Infrastructure and Operations
Computer Centre Operation
The Division undertook a thorough review of practices and tools in the Computer Centre operations area
during 2003 both to improve current planning and coordination in the Machine Room and as preparation for
an increased operations load during the ramp-up to LHC data taking. Reinforced by engineer level expertise
provided as part of the LCG project, the team introduced a comprehensive Hardware Management System.
HMS tracks all system movements within the computer centre as well as tracking calls to vendors for
hardware repairs. Built around an application on top of the Division’s Remedy workflow management system,
HMS is closely interfaced to the sitewide network database enabling automatic allocation of network
addresses. This new tool proved invaluable in late November when some 250 new CPU servers were installed
in just two weeks. Overall, HMS and improved planning and control have enabled the Division to manage
machine room operations with reduced manpower in 2003 (the previous Operations Manager having taken a
deserved retirement at Easter), but the operations team is now overstretched with the additional activities
caused by the machine room refurbishment and the increasing numbers of systems being installed each year.
Computer Centre Infrastructure
As planned, construction of an extension to Bldg. 513 started in August. The extension, expected to be
completed in February 2004, will house the new electrical substation for the Computer Centre that is required
to meet the power requirements of the LHC offline computing farms. Installation of the electrical equipment
will start later in 2004 with commissioning foreseen for early 2005.
To prepare the Computer Centre itself to house the offline computing farms, one half of the ground floor
machine room was emptied of computing equipment. This was a major operation involving the move of five
STK tape silos and over 500 compute servers to the machine room that was created in the basement of
Bldg. 513 during 2002. Once all of the computing equipment had been removed, work was started to clean up
the false floor void (with over 15 km of obsolete cabling being removed) and to upgrade the underfloor
electrical and network distribution. The major part of this work was complete by the end of 2003 but some
changes to the air conditioning arrangements are needed before this half of the machine room can be put back
into service. Once it is back in service, expected to be in early March 2004, equipment from the other half of
the machine room will have to be moved across to allow that side to be cleaned up and upgraded in turn.
155Information Technology Division
Data Services
Responsibility for operation of the CASTOR-based managed storage services was transferred to the FIO
group early in 2003 in order to enable the introduction of common management policies across all Linux-
based services in the Division. As a first stage in this process, as the Linux kernel has now developed
sufficiently to meet our demanding needs without special modification, all servers were re-installed with a
standard Linux system replacing the many individually tailored systems in use previously.
This first stage was completed in time for the start of experiment data taking when, for the first time, all
data were recorded directly into the CASTOR system where users need to know only file names and not tape
numbers. The main users of the Division’s Central Data Recording service, NA48 and COMPASS, ran for
many months at an aggregate data rate of over 85 MB/s storing almost 400 TB of data onto 2000 data
cartridges. The successes of the CDR operation, though, owed much to intensive effort to handle operational
problems with the disk server layer. Work to improve control over this element of the service resumed at the
end of accelerator operations and will intensify during 2004.
Changing focus, and following on from the pilot project started in December 2002, the Division decided to
adopt TSM as the single backup system for computer centre systems, including the many servers behind the
central Windows services, and remote departmental servers. By the end of the year some two-thirds of the
client systems had been migrated to the new service which, in addition to running on better performing
hardware, also uses the newer STK 9940B tape drives, allowing a reduction in the costs for tape media.
Fabric Services
The merging of all experiment dedicated Linux compute farms into a single LXBATCH cluster was
completed early in the year with good collaboration with the experiments. Just over 700 new nodes were
installed during the year, to bring the total number of nodes in the cluster to around 1300. Older and weaker
nodes were replaced by faster new ones in order to meet the minimum experiment requirements, and to offer a
more uniform environment. The operating system version was upgraded to version 7.3 of the Red Hat Linux
distribution, and all node inhomogeneities were eliminated by reinstalling all nodes using the Quattor tool
suite (see next section). Overall utilisation of the batch farm was improved through changes to the fair-share
scheduling scheme of the LSF batch scheduler. In particular, the scheduling scheme was refined and extended
to better handle the mix of experiment production work, Physics and Data challenge needs, and ongoing
analysis activities.
As noted in the report for 2002, the Division decided not to retender the contract for system administration
services, but to establish an insourced team using the new P+M flexibility. A run down during the final year of
the contract was negotiated with Serco early in the year. The structured and steady removal of services started
as early as April, thus generating financial savings for the Division. Since the insourced team could only
become operational in October, some additional workload was taken on by the LXBATCH team – again,
though, the EDG/WP4 tools proved their worth as they minimized the additional systems administration
overhead.
156 Information Technology Division
Fabric Management Tools
The already close relationship between the different CERN components of the Fabric Management (WP4)
team within the European DataGrid project – with FIO Group members taking an active part in the WP4 work
– was strengthened in early 2003 when all WP4 activities (which include management of WP4 as a whole and
of the individual subtasks) were gathered together in the FIO Group. The aim of this reorganization was to aid
the deployment at CERN of the tools being developed by WP4 and to ease the transition from the final
development year of the EDG project to normal operations in 2004 and beyond. As part of this second step, the
different WP4 tools were given names during the year – Quattor for the system installation and configuration
package and LEMON for the system monitoring package. Together with a third component, LEAF, these are
the three elements of ELFms, the Extremely Large Fabric management system, which will allow us to manage
the large LHC offline computing farms efficiently. Developments in each of the three different ELFms
components are covered in turn below.
In a modular fashion, different elements of the Quattor component were finalized by the development team
and then rapidly prototyped on the LXBATCH farm. By the end of the year all elements of Quattor were in
place and Quattor was in full control of the Division’s production Linux services. Our experiences with
Quattor on more than 1500 Linux systems have shown this installation and configuration tool to be a robust,
reliable, and scalable system which addresses the needs of large computing clusters. Two successes of Quattor
during the year are particularly noteworthy. First, in a planned intervention, the LSF batch scheduler was
upgraded live on the LXBATCH service in just 10 minutes without impact on any of the 2000 running jobs.
Secondly, and perhaps even more impressive as there could be no preparation beforehand, a critical security
update was installed across all nodes of the farm within an hour of the patch being made available. These
successes, together with the port of Quattor to Solaris by the PS Group, are creating interest in the tool from
LHC experiments and we will build on these contacts during 2004.
For LEMON (the LHC Era Monitoring framework), the key topic for the year was the choice of the central
repository to store monitoring data. Options here were the WP4 developed OraMon server and an export of
data from the PVSS SCADA system into Oracle. These two alternatives were tested extensively in the first
half of the year and led to a decision to use the OraMon server as the production repository from September.
The focus has now switched to the development of monitoring displays tailored for the different target users
(such as console operators, system managers or end users) and the development of combined metrics which
build on low-level information to provide a more service-oriented measure of system performance. Other
elements of the LEMON framework (such as an API for the monitoring repository and the alarm broker) were
developed according to the agreed WP4 plan and most were used in production at CERN during the year. By
the end of 2003 about 2000 machines in the CERN computer centre were being monitored by Lemon and, as
for Quattor, this will be extended during 2004 – initially to Solaris machines following work with PS Group,
but also to Windows-based servers, to components of the computer centre network, and to integrating
application level monitoring of the Oracle services.
The final ELFms component, LEAF (for LHC Era Automated Fabric), brings together the local recovery
elements of the WP4 tools with the Hardware Management System discussed above and a new State
Management System. The intention here is to have tools to manage the desired state of the overall fabric, as
recorded in the Quattor configuration database, and to react to any deviations from this desired state that are
signalled by LEMON. In 2003, work started on collecting CERN requirements for local recovery actions. This
157Information Technology Division
will continue in 2004 along with developments of the State Management System which will automate high-
level management actions such as the migration of systems between different services.
Problem Tracking and Remedy Support
Effort expended on increased automation for the Remedy service and simplification of the applications
paid off in 2003 with the Remedy service running smoothly with much reduced manpower. Despite the
reduced effort, the Service Level Agreement package for the Problem Report Management System workflow
was replaced. The new SLA package, in-line with previous Remedy/IT developments for PRMS and ITCM,
delegates SLA configuration, via a graphical interface, to the Domain Managers. At present the package is
used only by the Desktop and User Support domains but will be extended to control all IT support lines early
in 2004.
System Administration
As mentioned earlier, the Division decided to establish an insourced system administration team in 2003.
There were three clear phases in this area during the year: preparation, team training, and service start-up. In
addition to the selection and later interviewing of candidates, the preparation phase included a review of the
tools and methods used by the outsourced team. This review led to a decision to maintain the ITCM Remedy
based workflow for communications between service managers and the system administration team. However,
the tool managing the repository of procedures to be used by team members was rewritten to improve ease of
use by both parties. With the arrival of the first team members in September the training phase began. This
lasted for one month with courses ranging from an introduction to CERN, through general Linux topics to
specialist areas such as AFS and LSF.
The real test for the team came in October with responsibility for all of the production Linux services in
the centre – well over 1500 machines. Very rapidly the relatively small team was successfully carrying out all
of the required administration tasks for the Linux services and so, after some additional Solaris training the
team took on responsibility for all systems previously administered by the outsourced team by the end of the
year.
All of these systems, together with the many additional systems installed at the end of the year led to a high
workload for the new team. Fortunately the pressure will relax somewhat at the beginning of 2004 when three
new team members arrive. Once the newcomers have been trained, though, the team will be taking on system
administration tasks for systems that were not previously managed by the outsourced contract – including, for
example, the many Windows-based servers.
Grid Deployment
The Grid Deployment group was a new group set up at the beginning of 2003, with most of the group
members coming from LCG funding sources. Several staff joined the group during the year and by the end of
2003 the group had 26 members. The group has three sections: Certification and Testing, responsible for
integrating the LCG middleware; Grid Infrastructure Support, responsible for deploying and supporting the
158 Information Technology Division
operation of the grid; Experiment Integration Support, working directly with the experiments to make their
software work within LCG.
As a part of the LCG project the Grid Deployment group is the core of the Grid Deployment Activity of the
project and has many collaborative projects and activities together with other members of LCG in Europe,
Asia, and the USA. Several short-term and long-term visitors working on these projects are hosted by the
group.
The significant achievements of the group during 2003 were the preparation and successful deployment of
the LCG-1 service to some 28 sites around the world – spanning 3 continents, and the subsequent preparation
of the upgraded middleware forming the basis of the LCG-2 service that will run during 2004. Integral to these
achievements was the setting up of a large certification test bed extending to sites outside of CERN and the
building up of a certification process for the middleware. In November the project underwent two reviews –
one internal and an LHCC review, both of which were successful, and useful in pointing out areas of potential
improvement.
The GD group will also form the core of the EGEE operations activity, and members of the group were
responsible for editing and coordinating the operations part of the EGEE proposal during 2003.
Certification and Testing
This was a significant activity of the group and involved setting up the testing and certification process for
middleware. The section did the middleware integration of the LCG-1 and LCG-2 releases based on VDT and
components from EDG 1.4 and 2.1. This was a major effort and resulted in many significant bug fixes in the
middleware at all levels from the basic Globus and Condor components of VDT, to the higher level EDG
components. The work has now established a baseline functionality against which all future functional
middleware upgrades will be tested. The section operates a large testbed to support this work – some
60 machines simulating a grid within the CERN network, recently extended to several remote sites to enable
real wide-area testing. Within the section there are teams dedicated to building tests and a test framework for
the certification process, and to system-level debugging and problem analysis.
Grid Deployment and Operations Support
This section is responsible for preparing the certified middleware release for deployment, coordinating the
deployment with the regional centres, and supporting the operating service once deployed. Members of the
section also run the CERN Certification Authority – issuing grid certificates to CERN users, managing the
Virtual Organization registration process, and responsibility for grid-related security matters.
Experiment Integration Support
The members of this section work directly with the LHC experiments to help integrate the experiment
software with the LCG grid middleware. The section operate a small testbed where the experiments can test
their software with the latest LCG middleware release before moving to production. The section is responsible
159Information Technology Division
for providing an essential communication channel between the deployment activity and the experiments. This
is essential since things change very quickly in this environment as the experiments prepare for their data
challenges and as the grid middleware undergoes rapid improvement and change.
Although the delivery of LCG-1 was late, the group supported the successful test productions of CMS and
ATLAS at the end of the year.
Internet Services
Interactive Videoconferencing
A new videoconference room was installed – giving a total of eight videoconference meeting rooms
operated by the service. During 2003, 1400 videoconferences were registered in these rooms, which
represents an 11% increase compared to 2002. The use of IP conferencing (using VRVS and H.323) continued
to grow (+13%) compared to legacy ISDN conferencing.
Messaging Services
During 2003, all users of the Messaging Services were smoothly migrated to the new platform. Several
seminars were given to introduce the new system to users and a User Guide was published. The new Web site
was enriched by support issues for the users and tools for the 2nd level supporters.
The anti-spam system was improved with several load-balancing servers and new filters. An anti-virus
system for the Mail infrastructure was bought and deployed on the servers. Monitoring tools were enhanced
by adding several checks, and the backup system was revised and thoroughly tested.
Web Services
During 2003, a substantial effort was made to prepare the next generation of Web hosting servers foreseen
to replace the present infrastructure.
Several improvements were made at the service monitoring level to further increase the stability and
reliability of the hosting servers provided by the CERN Web services. In addition, PHP was introduced as
additional scripting language, available for both Windows hosting servers and Linux-based AFS-gateway
servers.
The number of official Web sites hosted by the CERN Web services infrastructure grew from 1200 to about
1500 during the year. The total number of Web sites increased from 5000 to about 6000. The total number of
HTTP requests served by all central Web servers increased by about 30% to reach 1.3 million requests per day.
The Web services are used not only by CERN staff members but also by many CERN users and HEP
collaborations to share their documents, physics data, and other relevant information from a central point.
160 Information Technology Division
Windows Server Services
A new contract for the purchase of server hardware was established allowing the replacement of all central
servers over the next three years. This will lead to a considerable increase in aggregate performance.
The old installation supporting Windows 95 and NT 4 was phased out on April 1st without major
difficulties. In the meantime, all new PC hardware has been delivered with Windows XP and Office XP pre-
installed. The dual-boot capability (Windows/Linux) has been introduced too. A Virtual Private Network
service was introduced in the middle of the year.
Product Support
Overview
2003 was a year of consolidation across most sectors of PS Group’s activities. New versions of various
products were delivered, tested, and made available to users. In several cases, notably that of Cadence and
related tools, production versions of Windows versions appeared and were installed for evaluation.
The migration of the EUCLID clients from Alpha stations to standard desktop PCs equipped with a high-
quality graphics board was completed on target by the middle of the year with very few problems. Work has
now started on the migration of the server platform. With the successful conclusion of the CAD2000 project,
the group has been working closely with EST colleagues to install and set up the CATIA and SmarTeam
products for a pilot evaluation.
A serious effort was made in promoting user training, in some cases by inviting vendors to offer training
courses, in other cases via courses given by group members. These have been greatly appreciated by the user
community. There was also a significant amount of training of our own staff both for particular tools but also
of Windows itself as we need more and more to understand how to integrate products in Windows.
The SUNDEV cluster in the Computer Centre underwent a major refresh and, as part of the purchase
agreement, we acquired a Sun Blade. Sun also funded some of the code porting effort to implement EDG WP4
tools under Solaris.
Digital and Analog Electronics CAE/CAD Support
Following user demand, the Cadence Design suite (PSD) was made available during 2003 for the Windows
NICE platform. The main issue was to provide an environment where both the design data and the libraries
could be accessible and shared from the CAE Sun workstations and PCs. Several updates for the PSD releases
14 were installed, tested, and made available. Release 15.1 was received in the first weeks of 2004 and work
has started on its deployment. The compatibility and performance is such that the CERN EST Printed Circuit
Board Design office has announced its intention to move towards a native PC-only environment.
161Information Technology Division
Continued efforts have been made to renegotiate better maintenance deals for different products and we
have achieved some successes. The budget proposed at the November 2003 ELEC meeting was approximately
16% less than the corresponding total presented in 2001 while still being able to maintain, and in some cases
improve, the packages presented to users. Future areas of interest have been recognised as Digital Signal
Processing tools and training and the formal verification of digital systems.
Mechanical Engineering and Related Fields
During 2003, PaRC, the cluster for compute-intensive engineering applications was extended by 10 Dual
Xeon Linux nodes for two specific simulation/analyses packages. Three of the nodes have been configured for
running the parallel/multiprocessor version of the computational fluid dynamics package StarCD-HPC. The
other seven nodes, boosted with 4 GB of memory each, have been configured with PVM (Parallel Virtual
Machine) in order to take full advantage of the parallel computing facilities of GdfidL a new analysis tool for
electromagnetic fields in 3D-structures. With this setup a full performance study for the CLIC power
extraction and transfer structure was possible using 200 cells. The same study using MAFIA would have been
limited to 10 cells.
Mathematica 5.0 and MATLAB 6.5 were installed and made available on PaRC. A backup licence server
(Licelan3) was installed and configured for Mathematica in case of problems with the main licence server.
MATLAB was made available for both interactive and batch modes.
Structure Analysis and Field Calculations: following last year’s consolidation of the FEM tools, Ansys 7.1
was put into production along with new versions of DesignSpace, ESAComp, OPERA/TOSCA, HFFS, and
CST Microwave Studio.
In the field of the mechanical CAD systems, the CAD2000 project was concluded and the tool CATIA V5
from Dassault Systems was recommended as the future mechanical 3D CAD system for CERN. For the
members of a pilot project, CATIA V5R9 has been made available in the NICE 2000/XP environment for
installation and a DFS project space has been set up for sharing design data. For the CATIA CAD data
management a test installation of SmarTeam, the local PDM (Product Data Management) system provided by
Dassault, was made this summer. There was a reference visit to ETA in Grenchen to see the use of Axalant
managing Euclid and CATIA V5 data. Finally, according to the preferences of EST/ISS, it was decided to
launch a pilot project with SmarTeam with assistance from the CATIA supplier. The Windows servers for the
CATIA CAD data management have been acquired and installed and the latest version of SmarTeam with
CATIA V5R12 was installed in the first week of 2004.
Support continued for the Autodesk products AutoCAD, Mechanical Desktop, Inventor and the principal
3D mechanical CAD engineering tool, Euclid, from Matra Datavision. AutoCAD version 2004 has been
installed and tested; however, it was decided to skip this release and deploy only the next release expected
early in 2004. New versions, including new licence servers, have been installed for the electrical packages
SEE Electrical Expert and TracElec, a third-party tool used with AutoCAD.
On the Euclid side, the main achievement this year was the move of the whole Euclid user community
from DUNIX to Windows. One hundred and thirty Windows/NICE PCs are now installed and configured for
the Euclid users. The migration of the servers to Windows is planned for 2004. A Windows server
162 Information Technology Division
configuration has already been acquired and installed and is being used by the EST Euclid support for porting
all EST provided Euclid-DUNIX tools to Windows, and by the IT-PS support to test the installation mechanisms.
Concerning Engineering Data Management, the EDMS servers on DUNIX were phased out in September
and the EDMS applications and file server transferred to the Sun Database cluster. This service has now been
taken over by the IT-DB group.
The Cadence–EDMS interface for electronics designers saw steady growth with a new group of users and
local support in AB division who joined this activity.
Software Development Tools
Starting in December 2002, responsibility for the Software Development Tool service (SDT) was
insourced and responsibility assigned to a newly transferred PS staff member.
All SDT products that run on the Windows platform can now be installed using the standard Windows
Add/Remove programs method. Such products have also been inserted into the Windows SMS inventory and
accounting monitoring service.
A new tool XMLSPY, which is used for designing, editing, and debugging professional applications
involving XML, was introduced during the summer of 2003 and SDT service took responsibility for the
OpenInventor and Qt libraries.
CVS
While the central service continued to grow successfully up to 60 repositories in production, we have
developed during the year a specific service for the LCG based on local disks rather than AFS. This LCG
service is currently running acceptance tests by LCG team members. We have added Web interfaces to the
CVS service, in particular implementing CVSWeb and ViewCVS, and tested access from the LCG Savannah
portal. The CVS service and its Web interfaces are described in http://cvs.web.cern.ch/cvs/
Solaris
We have been getting ready for the Solaris 9 certification during the year. The mailing list forum-solaris-
[email protected] continues to be the channel for discussion of the Solaris certification process and
application availability questions in order to reach consensus before any important change. Official
representatives from the LHC experiments are present in the list.
As agreed by COCOTIME we proceeded to the technology refresh of the Sundev facility. We acquired
10 state-of-the-art dual 1 GHz V210 nodes together with a Sun Blade server 1600 under very favourable
conditions. The Blade server is being used to evaluate the Sun N1 management system to compare it with
Quattor and to test the Oracle Application Server in this environment in collaboration with AS division. The
new V210 nodes were put in production in November. They represent a performance gain of more than double
163Information Technology Division
the power of the older nodes. One of the V210 nodes has been temporarily diverted to support the mail service
by offloading the Listbox server. We are reusing the old Sundev nodes to host the Licman and Licelan floating
licence servers.
On request of CMS we installed the SunOne 8 Compilers and provided support for the tests of this version.
We are also maintaining the GNU compilers, Mozilla and other Open Software for Solaris. Some versions are
made for Linux on request of ADC group.
We continued to manage the CMS Sun disk servers in the computing centre. In March we performed a
major disk clean-up and rationalization of both hardware and software for which we were thanked by CMS.
This clean-up has allowed us to run without major problems ever since. At the end of the year, CASTOR
staging activity stopped in these machines, but they are still used for NFS and Objectivity data serving.
A tutorial on Solaris System administration was produced and taught to the new computing centre system
administration team being created in FIO Group.
CAE Cluster for Electronics
We have continued to rationalize the support of the CAE cluster by making use of the Helpdesk and the
Desktop Support Contract for the first levels of support. We have continued to support the EST PCB workshop
to rationalize its computing configuration. A CAE server upgrade was performed during the year. It uncovered
serious technical problems that were eventually solved in cooperation with Sun.
Print Servers
The migration of the print servers to newer hardware and the latest supported release of Red Hat Linux has
been successfully completed. They have been put in production with the help of Printer Support of the US
group. This migration included improvements to the filtering system, development of a database to
synchronize the configurations of the print servers in the cluster, and Web GUIs for printer support and server
management. It also uses a new SOAP interface developed in cooperation with CS group to access and update
the printer mappings in DNS.
Euclid Service
We have been working during the year on the new Windows server platform for the Euclid and CATIA
services as well as maintaining the old infrastructure. In order to streamline communications on these
questions it was agreed to move the person working on these questions to DTS section at the end of the year.
Software Licence Office
During 2003, most of CERN’s major software licences, such as those with Oracle, Microsoft and Symantic
(suppliers of CERN’s anti-virus checker for Windows) did not come up for renewal and there was limited
164 Information Technology Division
contact with those suppliers in respect of software licences. One exception was Symantic where we assisted IS
group in negotiating an extension to cover virus checking of mail traffic.
However, in the second half of the year we were confronted with a new software release strategy by Red
Hat, suppliers of our favoured Linux distribution. Effectively we need to find a way to resource their new so-
called Enterprise licence which is not free as was our previous distribution. And the picture is complicated by
our strong desire to reach a deal which covers our collaborators in other HEP institutes. By the end of the year
we had broad outlines on the framework of such a deal but no indication of how easy it will be to reach an
agreement within a budget which we and our outside collaborators can afford. Negotiations continue.
Another software product which required a lot of negotiation was the Mathematica software library. The
authors of the product required us to move to a new local agent and to a new pricing structure, less favourable
to CERN. After much negotiation, a 3-year agreement has been arrived at which optimizes the CERN budget
with the expected number of licences needed at the current time, but which unfortunately sets lower limits
than before for future expansion.
As last year, more outside groups and divisions have requested our services and in several cases we were
able to group several requests for the same product and obtain better terms for CERN, as well as reducing the
internal overhead in order processing. We also received single requests for specific products and it was decided
by the Desktop Forum that we should only get involved if the same product was requested by multiple
divisions and/or we could add value to SPL in their negotiations with the supplier.
Desktop Support Contract
Several factors increased the workload for the Desktop Computing Support in 2003. The amount of spam
mails arriving via the mail feeds of the support lines has increased significantly, and finally forced us to use a
spam filter to reduce its impact. The virus/worm incidents in the summer also had a major impact. Computers
without up-to-date virus protection were blocked from the network and users needed help to disinfect their
PCs and update the anti-virus software. Finally, the mail migration to Exchange increased the workload on the
support staff as many users had password and authentication issues and needed help with the new
environment. The contractor showed a lot of flexibility and reacted quickly to our requests for extra services
needed to cover the increased workloads. Some of the special services provided via the contract have been in-
sourced in the course of the year. In the area of local support, the merging of PS and SL divisions made it
necessary to adopt the local support to the new situation.
The preparation of the tender for the next contract, the current one having run its maximum five-year
duration, showed that the local support descriptions were far too complex. In discussion with the local support
managers, we defined a menu of services from which the local support managers can select, together with the
choice of a pre-defined set of response times. We also encountered difficulties in determining the workload in
the local support areas. These complications delayed the preparation and dispatch of the tender documents and
meant that we had to postpone the start of the new contract to July 2004 and extend the existing one by six
months.
165Information Technology Division
User Support
The group has three sections: User Assistance (UA), Procurement Services.
User Assistance’s key activities are the Helpdesk, the IT Web pages, and the Computer Newsletter (CNL).
The major achievements were:
As an additional means of communication with the users, the Service status board has been started. It is an
up-to-date Web page reporting all incidents and upcoming service changes and scheduled interventions
concerning IT services.
A new IT Web site was deployed. In contrast to the former organizational-oriented site, the new site is
service oriented. With the help of IS group, division-wide templates were created to give users the same look
and feel that they requested. The migration of the old pages is progressing well.
The Procurement Service is responsible for establishing CERN standard PC Desktop hardware for the
CERN computing community capable of running the CERN standard computing applications and operable
within the CERN networking infrastructure. Once established it is then responsible for checking that the
chosen hardware meets technical specifications, that a supplier is chosen, and that stock levels are maintained.
It takes responsibility for the storage, delivery, and installation of that equipment at the purchaser’s site. Sales
figures for PCs remained much the same as the previous year, as too for monitors, but now with a notable shift
to LCD flat screens.
Although the purchasing process was moved to the CERN Stores at the beginning of 2002 in order to
standardize procedures and reduce paperwork, it is worth noting that the Service created around 2500 CERN
administrative documents, and delivered 4766 articles during 2003. It is indeed time to review the entire
purchasing process from supplier to end user, since there are currently far too many steps and too many
persons involved.
The Managed Desktop Service (‘Rental PCs’) was introduced at the beginning of the year but has been
greeted with at best lukewarm enthusiasm, possibly due to the cost overhead on what is now a relatively cheap
piece of hardware. The target figure of 200 packages as a pilot project has not been met, there being 160 at
present, so a revamp is foreseen for 2004 with a further review of the situation next June.
There was significant success with the development of PC laptop support where there was widespread user
appreciation of the service provided for non-standard laptops.
There was further appreciation of the Service’s involvement in the supply and installation of computing
equipment for conferences both on and off the CERN site. Most notably in the second half of the year with
major contributions to the success of the European School of Medical Physics in Archamps, the Telecom
exhibition at Palexpo, and the RSIS conference both at CERN and Palexpo.
Mac Services provide central sales, support, and maintenance for Apple hardware and software products
and peripherals with minimum personnel. Compared with 2002 we had an increase in the number of sales of
new machines due mainly to the interest of the scientific community in the Unix-based MacOSX operating
166 Information Technology Division
system and the availability of powerful Apple G5 machines. With 600+ machines delivered over the last six
years, appropriate service is required and with this in mind the Mac Task Force recommended changes in the
support structure which will be implemented during 2004.
European DataGrid (EDG) Project
The European DataGrid project will come to a conclusion at the end of March 2004 with a final EU review
on February 19th and 20th at CERN. During the three years of the project, many important results have been
achieved both in the field of middleware development, grid deployment, and in the applications domain.
One of the major milestones of the year 2003 was the release of EDG 2.0 software, to be deployed on an
application testbed composed of more than 15 sites. EDG 2.0 was a major milestone also for the HEP
community, since it represented the synchronization point with a number of projects, notably LCG, and
Trillium (US projects via VDT) and NorduGrid in terms of a common version of Globus and Condor, and on
the use of the GLUE schema and compatible information providers. Throughout the year, DataGrid worked in
close collaboration with LCG, optimizing the middleware that is now the basis of the current LCG1
production infrastructure. With it, many physicists around the world are producing important simulated data
for their detector studies. Many improvements have been applied to EDG 2.0 during the second half of the
year, leading to a better version 2.1 of the software which builds the basis for the upcoming LCG2 production
infrastructure that is being deployed at the beginning of 2004.
But release 2.0 was not only for physics, and representatives of the HEP experiments have been working
together with representatives from the bio-informatics and Earth observation communities, the other scientific
fields supported by DataGrid, on the definition of common high-level grid services.
In parallel, an intensive activity of training and education has continued with the DataGrid tutorials. After
having travelled to 10 different institutions in the project members states during the year, the last session
exploited the true international spirit of EDG, being performed in Islamabad, Pakistan, at the National Centre
for Physics. About 40 people, including physicists and computing scientists, took part in the event, the first of
this kind being held in the country, and which even made it to the national TV news.
In September 2003 the project held the last DataGrid Project Conference in Heidelberg, where one of the
main topics of discussion was the smooth transition between EDG and the proposed EGEE project. In the last
few months in fact, many of the EDG participants have contributed to the submission of the proposal for a new
EU project, in the context of the EC Sixth Framework Programme.
The goal of the proposed EGEE (Enabling Grid for E-science in Europe) is to create a European-wide,
‘production quality’ infrastructure on top of the present (and future) EU RN infrastructures. Such a grid will
provide distributed European research communities with access to major computing resources, independent of
geographic location. Compared to EDG, EGEE will represent a change of emphasis from grid development to
grid deployment; many application domains will be supported with one large-scale infrastructure that will
attract new resources over time.
At the time of writing, the final negotiations with the EU are being concluded so that the project can start
on 1 April 2004.
167Information Technology Division
LHC Computing Grid (LCG Project)
Changes in Structure and Scope
In order to optimize the use of the resources of the EGEE project (mentioned elsewhere in this report) and
LCG, an agreement has been made to have a very close relationship between the managements of the two
projects: the EGEE grid will be operated as an extension of the LCG service, managed by the LCG Grid
Deployment manager; the manager of the EGEE middleware activity, whose task is to acquire or develop a
solid middleware toolkit suitable for HEP and other sciences, will serve as the middleware manager of the
LCG project. EGEE will also fund a small activity for integrating EGEE middleware in the LHC experiments’
applications. The EGEE project will start in April 2004, but the middleware activity already began at the end
of 2003, bringing together experts from the EU funded European DataGrid project, the AliEn project of
ALICE, and the Virtual Data Toolkit (VDT) project which integrates software from several US grid projects.
The last major extension of the scope of Phase 1 of the project was agreed in December, when the SC2
committee accepted the recommendations of two working groups that had defined requirements for distributed
analysis, covering both generic (i.e. not HEP-specific) middleware, as well as HEP-specific distributed
analysis tools. A project structure to implement these recommendations will be defined early in 2004, in
consultation with the LHC community and grid technology projects in Europe and North America.
Taking account of the changing focus of the project, as the LCG service is opened and the first
developments from the applications area are integrated into the mainline software of the experiments, the
Oversight Board agreed in November to a proposal to restructure the principal management committees of the
project – the Project Execution Board (PEB) and the Software and Computing Committee (SC2). The
experiments’ computing management now joins the PEB to increase their participation in the operational
management of the project.
Applications
The first production version of POOL, the object persistency system, was released on schedule in June,
supported by core software from the SEAL project and development infrastructure coming from the SPI
project. The POOL system is built on the basic object storage support of ROOT’s I/O system. The release
contained all the functionality asked for by the experiments for the first production release. The success of
POOL and other LCG software will ultimately be measured by how successfully it is taken up and used by the
experiments. The Application Area of the project has moved its focus towards experiment integration,
feedback and validation activities, although an extensive development programme will continue to be pursued.
CMS successfully integrated POOL and SEAL and validated POOL for event storage in their pre-
challenge simulation production. By the end of September they had successfully stored ~1 M events with
POOL. ATLAS successfully integrated POOL and SEAL in their Release 7, an important milestone towards
production use of POOL in ATLAS DC2 in 2004. Both experiments found the integration of POOL and SEAL
to require more work than they had anticipated. Since it is as vital for the project to see prompt and successful
take-up of the software in the experiments as it is to develop the software in the first place, the project must
examine how to increase the (already significant) effort employed to assist the experiments in integration.
168 Information Technology Division
CERN Fabric
A major restructuring of the tape service took place in April and May, including the re-location of the tape
silos in the new lower level of the Computer Centre, and the installation of new tape drives and the upgrade of
the existing drives – bringing the total number of these StorageTEK model 9940B drives to 50. During the
upgrade it was possible for a short time to have nearly exclusive access to 50 tape drives and the opportunity
was taken to organize a data challenge aiming at an aggregate data recording rate of 1 GB/s. The result was a
sustained average recording rate of 920 MB/s during three days, reaching peak values of 1.2 GB/s.
Some 350 new CPU servers were installed in April, and 55 disk servers, each with a usable capacity of
1.3 TB were installed at the beginning of July. An additional 440 CPU servers were installed in November.
The work of re-costing the Tier 0+1 facility at CERN for Phase 2 (2006–08) was completed in May and
presented at an LHC Seminar in June. A summary of the conclusions is available as CERN-LCG-PEB-
2003-16.
The new CASTOR architecture and design has been presented on several occasions, and a paper has been
distributed for feedback prior to the beginning of development. CERN has made a proposal to the members of
HEPCCC for the support of CASTOR at institutes other than CERN.
Initial components of the Quattor system administration toolkit, designed and implemented by the
European DataGrid and CERN, were deployed on the majority of the systems in the Computer Centre and
have already been used during a major software upgrade.
Grid Technology
The Grid Technology Area (GTA) continued the development towards resolving the issues around opening
and accessing files stored on a storage system from worker nodes. A proposal was agreed by the Grid
Deployment Board, a test version of the software was made available at the end of June, and the first
production release delivered in early September.
The GTA participated in the CMS modelling initiatives and started to study the modelling toolkit
developed initially by the Monarc project. At present this activity is only exploratory, but a short-term plan is
being prepared, with the aim to model some of the data challenges in 2004.
The GTA also proposed and started the OGSA engineering activity during this period, intended to provide
technical input to the future EGEE project. The initial objective of installing and validating the Globus OGSA
Toolkit was achieved, and exploratory work was performed on the integration of existing grid Web service
software components.
169Information Technology Division
Grid Deployment
The first LHC Grid Service was opened on 15 September, deployed to 11 sites – CERN, Academia Sinica
Taipei, CNAF Bologna, FNAL, Forschungszentrum Karlsruhe, IN2P3 Lyon, KFKI Budapest, Moscow State
University, PIC Barcelona, Rutherford Appleton Laboratory, and the University of Tokyo. This grew to almost
30 sites by the end of the year.
Significant work has been invested in security – covering user registration as well as operational
procedures such as incident reporting. There is still a full programme of work in this area, but everything
necessary for the initial service has been put in place. The process for deploying upgrades has already been
tested with the distribution of a security bug fix. Initial versions of user guides and installation guides have
been prepared. The basic operations monitoring and user support systems are also ready. The global grid call
centre will be operated by Forschunszentrum Karlsruhe, which is currently testing the service locally. The
operations centre will be run by Rutherford Appleton Laboratory, which has prepared basic tools and
procedures, and has set up a monitoring Web site at http://www.grid-support.ac.uk/GOC/
By the end of the year, the experiments were installing their software on the LCG-1 service, with a view to
using LCG as the major service for the data challenges scheduled for 2004. A second major release of the grid
software was prepared for deployment in the first months of 2004, prior to the start of the data challenges.
Now that the initial deployment of the service has been completed, there are many tasks of an operational
nature that must be done by collaboration between the regional centres and the deployment team.
CERN openlab for DataGrid Applications
The CERN openlab for DataGrid applications is a framework for evaluating and integrating cutting-edge
technologies or services in partnership with industry, focusing on potential solutions for the LCG. The openlab
invites members from industry to join and contribute systems, resources, or services, and carry out with CERN
large-scale, highly-performing evaluations of their solutions in an advanced integrated environment.
In a nutshell, the major achievements in 2003 were the successful incorporation of two new partners: IBM
and Oracle; the consolidation and expansion of the opencluster (a powerful compute and storage farm); the
start of the gridification process of the opencluster; the 10 Gbps challenge where very high transfer rates were
achieved over LAN and WAN distances (the latter in collaboration with other groups); the organization of
three thematic workshops including one on Total Cost of Ownership; the creation of a new, lighter, category
of sponsors called contributors; the implementation of the openlab student programme, bringing some
11 students in the summer.
Industrial Sponsors
The year 2003 started with three sponsors, Enterasys Networks (Contributing high bit-rate network
equipment), Hewlett Packard (computer servers and fellows), and Intel Corporation (64-bit processor technology
and 10 Gbps Network Interface Cards). In March 2003, IBM joined the openlab (to contribute hardware and
software disk storage solution), followed by Oracle Corporation (to contribute Grid technology and fellows).
170 Information Technology Division
The annual Board of Sponsors meeting was successfully held on 13 June, and the annual report issued on
this occasion. In addition, three Thematic Workshops were organized (on Storage and Data Management,
Fabric Management, and Total Cost of Ownership). On the latter topic (TCO), a position paper establishing
the facts and figures was produced.
In order to permit time-limited incorporations of sponsors to fulfil specific technical missions, a concept of
contributor was devised and proposed to existing sponsors. Contributor status (as opposed to partner status for
existing sponsors) implies lower financial commitment and correspondingly fewer benefits in terms of
influence and exposure.
Technical Progress
The openlab is constructing the opencluster, a pilot compute and storage farm based on HP’s dual
processors machines, Intel’s Itanium Family Processors (IFP) processors, Enterasys’s 10-Gbps switches,
IBM’s Storage Tank system, and Oracle’s 10g Grid solution.
In 2003, the opencluster was first expanded with 32 servers (RX2600) equipped with IPF processors
(second generation, 1 GHz) and running Red Hat Linux Enterprise Server 2.1 and open AFS and LSF. In
October, 16 servers equipped with IPF’s third generation processors (1.3 GHz) were added. This is
complemented by seven development systems.
The concept of openlab technical challenge – where tangible objectives are jointly targeted by some or all
of the partners – was proposed to the sponsors. The first instantiation was the 10 Gbps Challenge, a common
effort by Enterasys, HP, Intel and CERN. In this context, a first experiment where two Linux-based HP
computers with 1 GHz IPF processors directly connected (back-to-back through 10 GbE Network Interface
Cards) reached 5.7 Gbps for memory-to-memory transfer (single stream). The transfer took place over a
10 km fibre. To extend tests over WAN distances, collaborations took place with the DataTag project and the
ATLAS DAQ group. Using openlab IPF-based servers as end-systems, DataTag and Caltech established a new
world Internet-2 land-speed record. Extensive tests with Enterasys’s ER16 router demonstrated that 10 Gbps
rates could only be achieved through multiple parallel streams. An upgrade strategy, including the use of
Enterasys new N7 devices in 2004 was agreed between Enterasys and CERN.
On the storage front, a 30 TB disk sub-system (6 meta-data servers and 8 disk servers) was installed, using
IBM’s StorageTank solution. Performance tests will be conducted in 2004.
Porting to IFP of physics applications (in collaboration with EP/SFT) and CERN systems continued in
2003, including for CASTOR, CLHEP, GEANT4 and ROOT. Other groups also ported their applications
(including ALIROOT by the ALICE Collaboration, CMSIM by CMS US). Results of scalability tests with
PROOF were reported to the CHEP2003 conference. As another example of collaboration with other groups,
20 of the IPF servers were used by ALICE for their 5th Data Challenge.
The gridification effort culminated with the porting of the LCG middleware (based on VDT and EDG).
After some difficulties, porting was almost completed at the end of the year. HP Lab’s SmartFrog monitoring
system was evaluated. As first results are promising, the effort will continue in 2004.
171Information Technology Division
Dissemination and Development Activities
In addition to the thematic workshops organized in the framework of the technical programme, two papers
were published in the Proceedings of the CHEP2003 conference, one article was published in the CERN
Courier, and three joint press releases were issued. The openlab also hosted at CERN two meetings of the First
Tuesday Suisse Romande series, involving active participation of openlab partners.
Based on a pilot programme run in 2002, a CERN openlab student programme was run in the summer of
2003, involving 11 students from seven European countries. Four of these students contributed directly to the
opencluster activity; the others worked on the ATHENA experiment and on the development of the Grid Café
Web site. The latter was successfully demonstrated at the Telecom2003 exhibition and at the SIS Forum, part
of the World Summit on the Information Society event.
Resources
The openlab integrates technical and managerial efforts from several IT groups: ADC (Technical
Management; opencluster via two fellows who joined in 2003; StorageTank); CS (10 GbE networking); DB
(Oracle 10g); DI (Project management, communication).
IT Division Involvement in the RSIS Conference and SIS-Forum
The IT Division was the originator of the RSIS (Role of Science in the Information Society) conference
held at CERN in December. The IT contribution included the responsibility for organizing the projects into
work packages, the creation of working Web sites, and the provision of computing facilities (Web Café) for
attendees during the conference. IT was also in charge of the ‘Enabling Technologies’ morning, a session part
of the programme.
In addition, IT was responsible for the design and implementation of the SIS forum, an exhibition
organized at Palexpo in the framework of the World Summit on the Information Society (WSIS). The
Programme committees and Organizing committee included members from four divisions (ETT, EP, IT, HR);
within IT, members of the CS, DB, DI, IS, and US groups contributed. After a Call for Content, 42 projects
from 32 organizations world-wide were selected and invited to present their activity on the stand. The
culminating event was the inauguration in the presence of Mr Kofi Annan, Secretary-General of the United
Nations. The overall WSIS exhibition received about 38 000 visits. The SIS-Forum site (cern.ch/sis-forum)
received more than half a million visits during the month following the event.
Computer Security
Security incidents increased by a factor of 5 during 2003, with the Blaster worm and its variants
accounting for the major increase. Emergency action to prevent Blaster from spreading allowed the majority
of the site to work normally, whilst significant disruption was reported elsewhere.
172 Information Technology Division
Significant resources were needed to secure the systems not centrally managed by IT services. This worm
was a useful wake-up call for the need to keep systems updated for security patches. Security holes are now
exploited within days of their discovery and worms can quickly follow – some spreading within seconds.
Break-ins were discovered on several systems, where intruders used more advanced techniques than
previously seen. Recent incidents have highlighted the risk to the whole site from insecure systems, including
those not directly visible to the Internet. Campaigns to improve the security of CERN systems, in particular
for timely patching, has shown some improvements, but keeping systems secure still needs to become normal
practice across the whole site.