Taking ECMWF’s new high-performance computing facility into operation

Share
Christine Kitchen

Christine Kitchen, ECMWF’s Deputy Director of Computing, has an important priority: taking the new high-performance computing facility (HPCF) in Bologna, Italy, into operation.

The new facility, comprising four Atos BullSequana XH2000 complexes, will replace ECMWF’s existing system, two Cray XC40 clusters based in Reading, UK. This new HPC service will deliver up to about five times the performance of the current system.

Christine has significant experience of commissioning new HPCFs and bringing them into operation. Her chemistry degree course at Sheffield University (UK) in the 1990s and her subsequent PhD in quantum chemistry provided her initial introduction to supercomputers. An interim job as a computing administrator at AstraZeneca Pharmaceuticals, funding her PhD write-up, presented her first exposure to running research computing facilities.

She subsequently worked in the distributed computing group at Daresbury Laboratory, evaluating and benchmarking new technology. She used this information to advise UK Universities in optimal HPCF solutions.

Christine went from a technical advisory role to establishing a brand-new centralised research computing team at Cardiff University. She stayed there from 2007 to 2021 and was promoted over time to the Assistant Director for Research Computing at Cardiff University (ARCCA)/Supercomputing Wales.

“This change in jobs marked a transition from advising people on how to implement supercomputing services to actual responsibility for delivering to a wide community of researchers,” Christine says.

At Cardiff University, she procured and implemented three generations of supercomputing systems and put them at the disposal of researchers. “It was important to be visible and approachable, meeting and listening to the researchers to understand what they needed and ensure their codes were correctly implemented on the services.”

ECMWF supercomputer in Bologna

The Atos HPCF in ECMWF’s new data centre in Bologna, Italy.

Testing the new HPCF

Christine joined ECMWF at a time when most of the new HPCF had been installed in Bologna. She now needs to navigate the final stages of taking the supercomputers into operation.

“I’ve got similar challenges as in my previous roles in that there are a number of components to integrate to provide a service, but on a greater scale, with the additional consideration of time-critical dependencies for the weather forecast,” she says.

A number of tests were performed before Christine joined in January. These include the factory acceptance test stage, which was conducted prior to the system being installed in Bologna.

The four supercomputing complexes in Bologna are code-named AA, AB, AC and AD. The AA complex is already being used by ECMWF, with forecasting and research analyst teams having been given early access to migrate and validate workflows and pipelines.

“AB and AD were being used by Atos to complete the configuration and final system debugging prior to functional and operational readiness tests performed by ECMWF,” Christine explains. “The reason for doing this work is to remove any glitches in the system and to shake out any faults, hopefully resulting in a stable and performant service to our users.”

The Operational Test involves two sections, a functional and an operational reliability stage. The functional tests comprise a benchmark suite of codes and workflows that are representative of forecasts which will run on the system. They must demonstrate performance reproducibility and complete within a specified timescale.

Once these are completed, a 30-day operational reliability test will start. It will demonstrate that the complexes meet expected availability and reliability metrics required to support our commitments for the daily forecast delivery schedule.

Diagram of four Atos HPCF complexes

The four Atos supercomputing complexes are linked to four routers and two separate storage networks. Three of the complexes are under final configuration validation stages with Atos prior to release to ECMWF.

The road to operations

The final stage will demonstrate the ‘operational readiness’ of the service. This lasts for three months, during which time ECMWF staff conduct pre-production validation checks on the new facility.

The operational readiness period ensures that our analysts and operators can acquire the necessary experience to manage the system and that the infrastructure demonstrates the necessary levels of stability to run a production service. It also enables ECMWF Member States to perform any migration activities prior to full-service transition from the current Cray HPCF.

Test data generated on the new Atos system will be disseminated to external users to ensure compatibility with workflows and provide support to ensure a smooth service transition is achieved.

“We have to demonstrate that the system is reliable, robust and performant before allowing operational forecasts to begin on it,” Christine says.

“The ultimate goal is to deliver the quality of service our Member States expect from this investment. This will bring the consistency and reliability to produce the time-critical forecasts and provide a platform to support the increased ensemble resolutions of future release cycles. We can still do some fine-tuning over the next 12 months with Atos to continue to optimise the performance of the system, although this has to be carefully managed to ensure we do not disrupt the services.”

Operational use of the new HPCF is expected to start in autumn. The next forecasting system modelling upgrade will go ahead in 2023, with an increase in the horizontal resolution of ensemble forecasts from 18 km to 9 km. This step change in resolution will be possible due to the increased computing capabilities provided by the new HPCF.

With over 1,000,000 cores in the new facility, 25% of the supercomputing capacity is dedicated to Member States, of which up to 10% is reserved for Special Projects. This will significantly increase the resources for these activities. In addition to the standard compute cores, one of the complexes has GPIL (general purpose and interactive login) nodes, which includes a number of NVIDIA GPUs to support application development.

Atos BullSequana XH2000 AMD compute blade

Atos BullSequana XH2000 AMD compute blade with three nodes.

Other areas of work

In addition to working on the new HPCF, Christine is involved in several other areas. These include the future development of the Regional Meteorological Data Communication Network (RMDCN) and considering the longer-term data storage strategy.

“There’s the short-term requirement for the Data Handling System (DHS) to be moved to Italy, but there are also questions about the vision for the next ten years that need addressing,” she says.

“I’m still relatively new in the job, so at this point I’m working out where I can genuinely add value and support the teams, escalate and prioritise actions within the department, and help people to solve issues they are facing.”