Problem
When tasks are submitted on ecFlow, they quickly go into aborted (red) state, reporting an error with job submission. No job output files are generated so it is difficult to assess what has gone wrong.
Solution
There are several reasons why the job submission could fail.
Check your ECF_JOB_CMD and related variables
If you are using the provided troika utility for the ecflow job management, make sure it uses the right path, options, and configuration file. For the average user using troika, it could look like:
edit ECF_JOB_CMD troika submit -o %ECF_JOBOUT% %SCHOST% %ECF_JOB% edit ECF_KILL_CMD troika kill %SCHOST% %ECF_JOB% edit ECF_STATUS_CMD troika monitor %SCHOST% %ECF_JOB%
You may also use the provided %TROIKA% and %TROIKA_CONFIG% ecFlow variables if you are likely to customise your settings later, as described in the documentation.
Have you set up your SSH key authentication properly?
Another common reason why this could fail is because you may not have your SSH key authentication. If you cannot ssh between HPCF hosts without typing in the password (i.e. from aa-login to ab-login), you will need to follow the instructions in HPC2020: How to connect page to generate your key pair and add your public key to your ~/.ssh/authorized_keys
.
SSH is used for communication between the ecflow server VM and HPC nodes. Therefore, password-less authentication must be set up so your ecFlow server can connect to the HPCF and submit jobs.
Still not working?
If you have checked all of the above but still submissions are failing, you may want to add some logging to your submission to have a better understanding of what could be wrong. If using troika, you may use the utility's log file option. In the example below, troika would write some useful information into the file troika.log in the %ECF_HOME% directory:
edit ECF_JOB_CMD troika -l %ECF_HOME%/troika.log submit -o %ECF_JOBOUT% %SCHOST% %ECF_JOB%