A zombie is a running job that fails authentication when communicating with the ecflow_server. As zombie icon will be visible in ecflowview (see Figure 10 17). There are a wide variety of reasons why a zombie is created. The most common causes are due to user action:
- The node tree is deleted, replaced or reloaded whilst jobs are running
- A task is rerun, whilst in a submitted or active state
- A job is forced to a new state, i.e. complete
Rarer causes include:
- ecFlow script errors, where we have multiple calls to init and complete child commands
- The child commands in the ecFlow script are placed in the background. In this case, order in which the child command contact the server, maybe indeterminate.
- Your queuing system might submit a job twice
- the system running your ecFlow server crashes and the recovered checkpoint file is out of date
- Heavily overloaded server. (typically when run on a virtual machine)
The default behaviour of the ecFlow server is to block the job. The child command continues attempting to contact the ecFlow server. This is done for a period of 24 hours. This period is configurable see ECF_TIMEOUT on ecflow_client. The jobs can also be configured, so that if the server denies the communication, then the command can be set to fail immediately. (See ECF_DENIED on ecflow_client). ecflowview provides a dialog that lists all the zombies and the actions that can be taken. These include:
- Terminate: The child command is asked to fail. Depending on your scripts, this may cause the abort child command to be called, which again will be flagged as a zombie.
- Fob: Allow the job to continue. The child command completes and hence no longer blocks the job. Great care should be taken when this action is chosen. If we have two jobs running, they may cause data corruption. Even when we have a single job, issues can arise. i.e. if the associated command was an event child command, then the event would not be set. If this event was used in a trigger expression, it would never evaluate.
- Delete: Remove the zombie from the server. The job will continue blocking, hence when the child command next contacts the ecflow_server, the zombie will re-appear. If the job is killed manually, then this option can be used.
- Rescue: Adopt the zombie and update the node tree. The ECF_PASS on the zombie is copied over to the task so that the next child command will continue as normal.
Figure 10 17 Zombie icon
Figure 10 18 Zombie tab available from right-clicking on ecFlow server node