This is the User guide for the Atos Sequana XH2000 HPCF, installed in ECMWF's data centre in Bologna. This platform provides both the HPCF (AA, AB, AC, AD complexes) and ECGATE services (ECS), which in the past had been on separate platforms.

Introductory tutorial

If you are new to the Atos HPCF and ECS services, you may also be interested in following the Atos HPCF and ECS Introduction Tutorial to learn the basic aspects from a practical perspective.

Below you will find some basic information on the different parts of the system. Please click on the headers or links to get all the details for the given topic.

HPC2020: How to connect

From outside ECMWF, you may use Teleport through our gateway in Bologna, jump.ecmwf.int. Direct access through ECACCESS service is not available.

$> tsh login --proxy=jump.ecmwf.int
$> ssh -J user@jump.ecmwf.int user@hpc-login
# or for users with no formal access to HPC service:
$> ssh -J user@jump.ecmwf.int user@ecs-login

Atos HPCF: System overview

The Atos HPCF consists of four virtually identical complexes: AA, AB, AC and AD. In total, this HPCF features 8128 nodes:

  • 7680 compute nodes, for parallel jobs
  • 448 GPIL (General Purpose and Interactive Login) nodes, which are devised to integrate the interactive and post-processing work from older platforms such as the Cray HPCF, ECGATE and Linux Clusters.

HPC2020: Shells

You will find a familiar environment, similar to other ECMWF platforms. Bash and Ksh are available as login shells, with Bash being the recommended option.

Note that CSH is not available. If you are still using it, please move to a supported shell.

Changing your shell

HPC2020: Filesystems

The filesystems available are HOME, PERM, HPCPERM and SCRATCH, and are completely isolated from those in other ECMWF platforms in Reading such as ECGATE or the Cray HPCF.

Filesystems from those platforms are not cross-mounted either. This means that if you need to use data from another ECMWF platform such as ECGATE or the Cray HPCF, you will need to so transfer it first using scp or rsync. See HPC2020: File transfers for more information.

HPC2020: File transfers

For transfers to ECMWF, we recommend using rsync which will transfer the files over an ssh connection. For that, you will need to have Teleport configured with the appropriate settings in your ssh config file.

Any file transfer tool that supports SSH and the ProxyJump feature should work, such as the command line tools sftp or scp. Alternatively, you may also use the Linux Virtual Desktop and its folder sharing capabilities to copy local files to your ECMWF's HOME or PERM.

HPC2020: Software stack

See HPC2020: The Lmod Module system for a complete picture

A number of software, libraries, compilers and utilities are made available through the HPC2020: The Lmod Module system.

If you want to use a specific software package, please check if it is already provided in modules. If a package or utility is not provided, you may install it yourself in your account or alternatively report as a "Problem on computing" through the ECMWF Support Portal.

HPC2020: Batch system

QoS nameTypeSuitable for...Shared nodes Maximum jobs per userDefault / Max Wall Clock LimitDefault / Max CPUsDefault / Max Memory per node
ngGPUserial and small parallel jobs. It is the defaultYes-average runtime + standard deviation / 2 days1 / -8 GB /  500 GB

HPC2020: Cron service

If you need to run a certain task at given regular intervals automatically, you may use our cron service available on the hosts hpc-cron for HPC users and ecs-cron for those with no access to the HPCF.

Use hpc-cron or ecs-cron

Do not run your crontabs on any host other than "hpc-cron" or "ecs-cron", as they may disappear at any point after a reboot or maintenance session. The only guaranteed nodes are hpc-cron and ecs-cron.

HPC2020: Using ecFlow

If you wish to use ecFlow to run your workloads, ECMWF will provide you with ready-to-go ecFlow server running on an independent Virtual Machine outside the HPCF. Those servers would take care of the orchestration of your workflow, while all tasks in your suites would actually be submitted and run on HPCF. With each machine being dedicated to one ecFlow server, there are no restrictions of cpu time and no possibility of interference with other users.

HPC2020: Accounting

To ensure that computing resources are distributed equitably and to discourage irresponsible use users' jobs on the Atos HPCF are charged for the resources they have used against a project account. Each project account is allocated a number of System Billing Units(SBU) at the beginning of each accounting year (1 January to 31 December). You can monitor your usage in the HPC SBU accounting portal.

HPC2020: ECacccess

ECaccess in Bologna does not offer interactive login access any longer. Users will either use teleport (Teleport SSH Access) or VDI (How to connect - Linux Virtual Desktop VDI) or access the Atos systems in Bologna.

The ECACCESS web toolkit services, such as the job submission, including Time-Critical Option 1 jobs (see below), file transfers and ectrans have been set up on Atos HPCF with the ECACCESS gateway boaccess.ecmwf.int. Previously installed remote (at your site) ECaccess Toolkits should be able to interact with this new gateway. However, we would recommend you to install the latest version, available from Releases - Web Toolkit. To make the remote existing ECaccess Toolkits working with the ECaccess gateways in Bologna, users will need to define the following 2 environmental variable, e.g. to talk to the ECMWF ECaccess gateway in Bologna:

HPC2020: Time Critical option 1 activities

The ECaccess software includes the service of launching user jobs according to the dissemination schedule (Dissemination schedule) of ECMWF's real-time data and products. This service is also known as TC-1 service, or TC-1 jobs. For more information on TC-1, see Simple time-critical jobs.

End of computing services in Reading

HPC2020: Time Critical Option 2 setup

Under the Framework for time-critical applications Member States can run ecFlow suites monitored by ECMWF. Known as the option 2 within that framework, they enjoy a special technical setup to maximise the robustness and high availability similar to ECMWF's own operational production. When moving from a standard user account to a time-critical one (typically starting with a "z" followed by two or three characters) there are a number of things you must be aware of.

HPC2020: Missing features and known issues

If you find any problem or any feature missing that you think should be present, and it is not listed here, please let us know  by reporting as a "Problem on computing" through the ECMWF Support Portal mentioning "Atos" in the summary.

Atos HPCF is not operational platform yet, and many features or elements may be gradually added as complete setup is finalised. Here is a list of the known limitations, missing features and issues.

HPC2020: FAQs

Here are the most common pitfalls users face when working on our Atos HPCF.

News Feed

2023-11-22 Change of default versions of ECMWF software packages

When?

The changes will take place on Wednesday 22 November 2023 09:00 UTC

Do I need to do anything?

2023-05-31 Change of default versions of ECMWF and third-party software packages

When?

The changes will take place on Wednesday 31 May 2023 09:00 UTC

Do I need to do anything?

2023-03-27 Scratch automatic purge enabled

From  the automatic purge of unused files in SCRATCH is enforced. Any files that have not been accessed at any time in the previous 30 days will be automatically deleted. This purge will be conducted regularly, in order to keep the usage of this filesystem within optimal parameters.

SCRATCH is designed to hold temporary large files and to act as the main storage and working filesystem for your jobs and experiments input and output files, but not to keep data for long term.

2023-01-18 Improving the time and memory limit management for batch jobs

Explicit time limit honoured

From ECMWF will enforce killing jobs if they have reached their wall time if #SBATCH --time or command line option -–time was provided with the job.

Alternatively ECMWF accepts jobs without #SBATCH --time or command line option -–time and ECMWF will instead use average runtime of previous "similar" jobs by generating job tag based on user, job name, job geometry and job output path.

2022-12-07 Important change in the new Slurm on the Atos HPC

On Slurm on Atos AD complex was updated to version 22.05. Since AD has been the default cluster with hpc-login and hpc-batch being aliases for nodes on AD complex.

The same version of Slurm 22.05 has also been installed on AA and AB complexes and will be installed on AC complex on  . 

2022-11-30 Unavailability of AA Atos cluster due to system update

On   at 08 UTC AA, the default Atos cluster will became unavailable for essential Slurm and security updates. In preparation of this session:

  • The default Atos login/batch cluster has been changed to AD on  at 9 UTC
  • Batch work on AA will be drained and jobs scheduled to finish after at 06 UTC will be automatically redirected to other complexes.

2022-08-10 SSH host keys fixed on all nodes and Cron service

On 2022-08-10 we have set up the same ssh key across the 4 complexes using the ac-login node as a master key. This change addressed the issue of ssh errors when connecting regarding host key changes for a given host after an update, such as:

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: POSSIBLE DNS SPOOFING DETECTED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
The ECDSA host key for hpc-login has changed,
and the key for the corresponding IP address 10.100.192.100
is unknown. This could either mean that
DNS SPOOFING is happening or the IP address for the host
and its host key have changed at the same time.
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the ECDSA key sent by the remote host is
SHA256:QdNPyN2jAR5m7ngLbtIUjc2JgzknvFP2flMOGbd1i5k.
Please contact your system administrator.
Add correct host key in /home/user/.ssh/known_hosts to get rid of this message.
Offending ECDSA key in /home/user/.ssh/known_hosts:4
ECDSA host key for hpc-login has changed and you have requested strict checking.
Host key verification failed.

2022-07-20 Atos ecgate cluster in Bologna available to all ECMWF Member State ecgate users

We are pleased to announce the availability of the General purpose Atos computing service - named 'ecs' - in Bologna, which will replace the ecgate service in Reading.

We invite you to start testing your activities currently running on ecgate in Reading onto 'ecs' in Bologna. To help you with this work, we have made available the Atos HPCF and ecgate Documentation. We strongly encourage you to read carefully through those pages before you start your tests on 'ecs'. Interactive login access to the systems in Bologna will no longer be through ECaccess, but through Teleport.

2 Comments

  1. Thanks a lot Xavier. It is pertinent and well written.

  2. Could we have something about the debuggers available on ATOS? Thanks, and apologies if I missed this somewhere.