NL-166-banner

ecFlow 5 brings benefits to Member States

Avi Bahra, Iain Russell, Sándor Kertész

 

Managing workflows for large-scale data-intensive computational processes is an ever-growing challenge. These workflows must be repeatable, highly available, monitorable and accurate, while still allowing the flexibility to support changes. At ECMWF this challenge has been met with ecFlow, a workflow package developed in‑house to meet the ever-changing requirements of the Centre and its Member and Co‑operating States.

​ecFlow was designed for general use but has been sculpted by the operational and research needs of weather and climate science. For example, at ECMWF it is used for many purposes including research experiment runs, operational model runs, data post-processing and archiving, and software builds.

ecFlow enables users to run a large number of programs, with dependencies on each other and on time, in a controlled environment. It provides good tolerance for hardware and software failures and allows for controlled restarts. The server, client and graphical user interface (GUI) are highly scalable and can handle workflows with hundreds of thousands of tasks. ecFlow is open source and is written in C++ for optimum performance. It runs on UNIX platforms, with many years of experience on Linux and more recent usage on macOS.

Version 5 of ecFlow brings many modernisations and improvements in terms of features, performance, security and maintainability.

ecFlow’s architecture

ecFlow has a client/server architecture (Figure 1). An ecFlow server is responsible for several suites, each a hierarchical collection of tasks. Complex suites can be defined using a Python API that guarantees their syntactic correctness (Figure 2). Simpler suites can be defined through plain text files. The server submits tasks to the machines where they will run, receiving updates as they proceed. Tasks can be defined in any scripting language, for example shell or Python. These scripts can be parametrized, meaning that the same script can be used for many different tasks, with different settings. For example, a script variable ‘FORECAST_STEP’ could be set to 6 when run in one task and 12 in another. The scripts may also have embedded ecFlow commands that communicate their status back to the server, e.g. to show progress or to trigger another task to start. Sophisticated use of these embedded commands allows tasks to dynamically modify the server’s suites, facilitating an adaptive workflow without requiring manual loading of revised suites into the server. ecFlow is not tied to any particular queueing system that may sit in front of the worker machines, but its tasks can be submitted to any such queueing system through the use of a general submission script. ecFlow client applications include a graphical user interface, ecFlowUI, and a command-line program, ecflow_client, both of which can be used to query and modify the server.

FIGURE 1
%3Cstrong%3EFIGURE%201%3C/strong%3E%20Various%20clients%20(GUI,%20Bash,%20Python%20API)%20can%20communicate%20bi-directionally%20with%20an%20ecFlow%20server%20using%20standard%20Transmission%20Control%20Protocols/Internet%20Protocol%20(TCP/IP)%20or%20Secure%20Sockets%20Layer%20(SSL)%20protocols.%20The%20server%20can%20run%20tasks%20directly%20or%20submit%20them%20to%20a%20queueing%20system;%20either%20way,%20they%20can%20still%20communicate%20back%20to%20the%20server.%20The%20server%20keeps%20track%20of%20events%20in%20a%20log%20file,%20providing%20the%20basis%20for%20statistical%20analyses%20of%20past%20events,%20such%20as%20the%20average%20duration%20of%20a%20given%20task.%20A%20checkpoint%20file%20is%20written%20to%20disk%20at%20regular%20intervals,%20providing%20a%20backup%20of%20the%20server%E2%80%99s%20internal%20state%20at%20that%20moment;%20this%20mechanism%20can%20also%20be%20used%20to%20provide%20continuity%20when%20upgrading%20a%20server%20to%20a%20newer%20version%20of%20ecFlow.
FIGURE 1 Various clients (GUI, Bash, Python API) can communicate bi-directionally with an ecFlow server using standard Transmission Control Protocols/Internet Protocol (TCP/IP) or Secure Sockets Layer (SSL) protocols. The server can run tasks directly or submit them to a queueing system; either way, they can still communicate back to the server. The server keeps track of events in a log file, providing the basis for statistical analyses of past events, such as the average duration of a given task. A checkpoint file is written to disk at regular intervals, providing a backup of the server’s internal state at that moment; this mechanism can also be used to provide continuity when upgrading a server to a newer version of ecFlow.

FIGURE 2
%3Cstrong%3EFIGURE%202%3C/strong%3E%20A%20simple%20example%20of%20a%20Python%20script%20that%20creates%20a%20new%20suite%20consisting%20of%20a%20family%20of%20two%20tasks,%20the%20second%20of%20which%20will%20be%20run%20as%20soon%20as%20the%20first%20has%20completed.%20The%20suite%20is%20then%20loaded%20onto%20a%20server%20using%20default%20settings.
FIGURE 2 A simple example of a Python script that creates a new suite consisting of a family of two tasks, the second of which will be run as soon as the first has completed. The suite is then loaded onto a server using default settings.

Graphical user interface

ecFlowUI is the graphical user interface to ecFlow (Figure 3). It is written with the C++ Qt library. ecFlowUI supports real-time monitoring of the workflow, allowing jobs to be started, suspended and terminated. Many aspects can be edited on the fly, including the job scripts themselves and their associated variables. Live and historical job output can be viewed with an efficient built-in viewer that can handle output files of arbitrary size. Dependencies between nodes can be visualised in graphical form, and a built-in log analyser can aid in fine-tuning the workflow.

ecFlowUI can monitor several ecFlow servers at once, with facilities to display only those suites or tasks of interest. It can also be used to move a set of tasks from one server to another.

FIGURE 3
%3Cstrong%3EFIGURE%203%3C/strong%3E%20ecFlowUI%20provides%20a%20rich%20environment%20for%20viewing%20and%20interacting%20with%20suites,%20including%20a%20new%20Trigger%20Graph%20view%20showing%20dependencies%20between%20items%20in%20the%20suite.
FIGURE 3 ecFlowUI provides a rich environment for viewing and interacting with suites, including a new Trigger Graph view showing dependencies between items in the suite.

ecFlow version 5

One limiting factor of ecFlow 4 was that its client/server communication was sensitive to changes in the version of the boost library that it links with. This meant that a single client could not necessarily communicate with all the running servers if they had been built with different versions of boost. The technology also limited the ability to make even small changes in communication protocol, which is sometimes necessary in order to allow new features. ecFlow 5 now uses the JSON format for communication, and clients and servers are free to use different versions of boost. This change also allows for new features to be added without breaking compatibility with older servers or clients. With further improvements to the communication, ecFlowUI can now communicate with servers using fewer network requests, meaning less network traffic. An internal improvement is that ecFlow 5 uses features from the C++14 standard, simplifying some code and providing performance benefits.

ecFlow 5 has a number of additional new features requested both by ECMWF users and by Member and Co-operating States. These include:

  • Improved security features, such as integrated SSL and password-based access; ecFlowUI can now view both SSL and non-SSL based servers in the same session.
  • ecFlowUI now has an interactive trigger graph view to show the interdependency of nodes and attributes.
  • Servers now support auto-archive and auto-restore, allowing parts of a suite to be dynamically written to disk when complete and restored later on. This aids the handling of extremely large suites.
  • Improved features to help users diagnose problems, for example when a worker machine goes down or a running job becomes detached from the server’s records.
  • Additional controls to limit the number of submitted or active tasks.
  • Various smaller features to help refine the suite definitions.

ecFlow’s stability has been validated by daily operational use at ECMWF. In addition, a slew of tests are run every night to ensure that no regressions creep into its releases. With its maturity and proven fitness-for-purpose, future work on ecFlow will emphasise the continuation of this maintenance and stability rather than large new developments.

Migrating to ecFlow 5

Many operational servers at ECMWF have already been migrated to ecFlow version 5. Once ECMWF’s computing centre has moved to Bologna, only version 5 will be available. Fortunately, migration from ecFlow 4 to 5 is straightforward and mostly involves stopping the currently running server and then starting it up again using ecFlow 5. The migration page provides more details (https://confluence.ecmwf.int/display/ECFLOW/Migration+to+ecflow+5). It is important to note that only an ecFlowUI from version 5 can be used with a version 5 server due to the change in communication protocol. Also noteworthy is that although current versions of ecFlow are built with Python 2 and 3 support, once operational in Bologna only Python 3 will be supported. It is therefore advisable to ensure that any suites are migrated as soon as possible in order to avoid any last-minute problems.

Availability

ecFlow is installed on all of ECMWF’s computing platforms, including the Member and Co-operating State server ecgate. If you plan to run an operational ecFlow server at ECMWF, please contact User Services, who will be glad to guide you on the best way to set it up. There are currently a default and a new version of ecFlow 5. To use either one of these, use the commands:

module load ecflow/5

module load ecflow/5new

For use external to ECMWF’s computing platforms, ecFlow is also available as a binary installation on the conda platform, available through the conda-forge channel with this command:

conda install ecflow -c conda-forge

The source is also available on github (https://github.com/ecmwf/ecflow) or as a tarball from the ecFlow Confluence pages (https://confluence.ecmwf.int/display/ECFLOW/).