Awesome Pipeline

A curated list of awesome pipeline toolkits inspired by Awesome Sysadmin


Awesome Pipeline

A curated list of awesome pipeline toolkits inspired by Awesome Sysadmin. A version with metadata (language, stars and last activity) can be found here. Visit Github Pages for html version. This is in gh-pages branch.

Pipeline frameworks & libraries

  • ActionChain - A workflow system for simple linear success/failure workflows.
  • Adage lang stars activity - Small package to describe workflows that are not completely known at definition time.
  • AiiDA lang stars activity - Workflow manager with a strong focus on provenance, performance and extensibility.
  • Airflow lang stars activity - Python-based workflow system created by AirBnb.
  • Anduril - Component-based workflow framework for scientific data analysis.
  • Antha - High-level language for biology.
  • AWE lang stars activity - Workflow and resource management system with CWL support
  • Balsam lang stars activity - Python-based high throughput task and workflow engine.
  • Bds lang stars activity - Scripting language for data pipelines.
  • BioMake lang stars activity - GNU-Make-like utility for managing builds and complex workflows.
  • BioQueue lang stars activity - Explicit framework with web monitoring and resource estimation.
  • Bioshake lang stars activity - Haskell DSL built on shake with strong typing and EDAM support
  • Bistro lang stars activity - Library to build and execute typed scientific workflows.
  • Bpipe lang stars activity - Tool for running and managing bioinformatics pipelines.
  • Briefly lang stars activity - Python Meta-programming Library for Job Flow Control.
  • Cluster Flow lang stars activity - Command-line tool which uses common cluster managers to run bioinformatics pipelines.
  • Clusterjob lang stars activity - Automated reproducibility, and hassle-free submission of computational jobs to clusters.
  • Compi - Application framework for portable computational pipelines.
  • Compss - Programming model for distributed infrastructures.
  • Conan2 lang stars activity - Light-weight workflow management application.
  • Consecution lang stars activity - A Python pipeline abstraction inspired by Apache Storm topologies.
  • Cosmos lang stars activity - Python library for massively parallel workflows.
  • Couler lang stars activity - Unified interface for constructing and managing workflows on different workflow engines, such as Argo Workflows, Tekton Pipelines, and Apache Airflow.
  • Cromwell lang stars activity - Workflow Management System geared towards scientific workflows from the Broad Institute.
  • Cuneiform lang stars activity - Advanced functional workflow language and framework, implemented in Erlang.
  • Cylc lang stars activity - A workflow engine for cycling systems, originally developed for operational environmental forecasting.
  • Dagobah lang stars activity - Simple DAG-based job scheduler in Python.
  • Dagr lang stars activity - A scala based DSL and framework for writing and executing bioinformatics pipelines as Directed Acyclic Graphs.
  • Dagster lang stars activity - Python-based API for defining DAGs that interfaces with popular workflow managers for building data applications.
  • DataJoint lang stars activity - An open-source relational framework for scientific data pipelines.
  • Dask lang stars activity - Dask is a flexible parallel computing library for analytics.
  • Dbt - Framework for writing analytics workflows entirely in SQL. The T part of ETL, focuses on analytics engineering.
  • Dockerflow lang stars activity - Workflow runner that uses Dataflow to run a series of tasks in Docker.
  • Doit lang stars activity - Task management & automation tool.
  • Drake lang stars activity - Robust DSL akin to Make, implemented in Clojure.
  • Drake R package lang stars activity - Reproducibility and high-performance computing with an easy R-focused interface. Unrelated to Factual’s Drake. Succeeded by Targets.
  • Dray lang stars activity - An engine for managing the execution of container-based workflows.
  • eHive lang stars activity - System for creating and running pipelines on a distributed compute resource.
  • Fission Workflows lang stars activity - A fast, lightweight workflow engine for serverless/FaaS functions.
  • Flex lang stars activity - Language agnostic framework for building flexible data science pipelines (Python/Shell/Gnuplot).
  • Flowr lang stars activity - Robust and efficient workflows using a simple language agnostic approach (R package).
  • Gc3pie lang stars activity - Python libraries and tools for running applications on diverse Grids and clusters.
  • Guix Workflow Language - A workflow management language extension for GNU Guix
  • Gwf lang stars activity - Make-like utility for submitting workflows via qsub.
  • HyperLoom lang stars activity - Platform for defining and executing workflow pipelines in large-scale distributed environments.
  • Joblib - Set of tools to provide lightweight pipelining in Python.
  • Jug lang stars activity - A task Based parallelization framework for Python.
  • Kedro lang stars activity - Workflow development tool that helps you build data pipelines.
  • Ketrew lang stars activity - Embedded DSL in the OCAML language alongside a client-server management application.
  • Kronos lang stars activity - Workflow assembler for cancer genome analytics and informatics.
  • Loom lang stars activity - Tool for running bioinformatics workflows locally or in the cloud.
  • Longbow lang stars activity - Job proxying tool for biomolecular simulations.
  • Luigi lang stars activity - Python module that helps you build complex pipelines of batch jobs.
  • Maestro lang stars activity - YAML based HPC workflow execution tool.
  • Makeflow - Workflow engine for executing large complex workflows on clusters.
  • Mara lang stars activity - A lightweight, opinionated ETL framework, halfway between plain scripts and Apache Airflow
  • Mario lang stars activity - Scala library for defining data pipelines.
  • Martian lang stars activity - A language and framework for developing and executing complex computational pipelines.
  • MD Studio lang stars activity - Microservice based workflow engine.
  • MetaFlow lang stars activity - Open-sourced framework from Netflix, for DAG generation for data scientists. Python and R API’s.
  • Mistral lang stars activity - Python based workflow engine by the Open Stack project.
  • Moa lang stars activity - Lightweight workflows in bioinformatics.
  • Nextflow lang stars activity - Flow-based computational toolkit for reproducible and scalable bioinformatics pipelines.
  • NiPype lang stars activity - Workflows and interfaces for neuroimaging packages.
  • OpenGE lang stars activity - Accelerated framework for manipulating and interpreting high-throughput sequencing data.
  • Pachyderm lang stars activity - Distributed and reproducible data pipelining and data management, built on the container ecosystem.
  • Parsl lang stars activity - Parallel Scripting Library.
  • PipEngine lang stars activity - Ruby based launcher for complex biological pipelines.
  • Pinball lang stars activity - Python based workflow engine by Pinterest.
  • Popper lang stars activity - YAML based container-native workflow engine supporting Docker, Singularity, Vagrant VMs with Docker daemon in VM, and local host.
  • Porcupine lang stars activity - Haskell workflow tool to express and compose tasks (optionally cached) whose datasources and sinks are known ahead of time and rebindable, and which can expose arbitrary sets of parameters to the outside world.
  • Prefect Core lang stars activity - Python based workflow engine powering Prefect.
  • Pydra lang stars activity - Lightweight, DAG-based Python dataflow engine for reproducible and scalable scientific pipelines.
  • PyFlow lang stars activity - Lightweight parallel task engine.
  • PypeFlow lang stars activity - Lightweight workflow engine for data analysis scripting.
  • pyperator lang stars activity - Simple push-based python workflow framework using asyncio, supporting recursive networks.
  • pyppl lang stars activity - A python lightweight pipeline framework.
  • pypyr lang stars activity - Automation task-runner for sequential steps defined in a pipeline yaml, with AWS and Slack plug-ins.
  • Pwrake lang stars activity - Parallel workflow extension for Rake.
  • Qdo - Lightweight high-throughput queuing system for workflows with many small tasks to perform.
  • Qsubsec lang stars activity - Simple tokenised template system for SGE.
  • Rabix lang stars activity - Python-based workflow toolkit based on the Common Workflow Language and Docker.
  • Rain lang stars activity - Framework for large distributed task-based pipelines, written in Rust with Python API.
  • Ray lang stars activity - Flexible, high-performance distributed Python execution framework.
  • Reflow lang stars activity - Language and runtime for distributed, incremental data processing in the cloud.
  • Remake lang stars activity - Make-like declarative workflows in R.
  • Rmake - Wrapper for the creation of Makefiles, enabling massive parallelization.
  • Rubra lang stars activity - Pipeline system for bioinformatics workflows.
  • Ruffus - Computation Pipeline library for Python.
  • Ruigi lang stars activity - Pipeline tool for R, inspired by Luigi.
  • Sake lang stars activity - Self-documenting build automation tool.
  • SciLuigi lang stars activity - Helper library for writing flexible scientific workflows in Luigi.
  • SciPipe lang stars activity - Library for writing Scientific Workflows in Go.
  • Scoop lang stars activity - Scalable Concurrent Operations in Python.
  • Seqtools lang stars activity - Python library for lazy evaluation of pipelined transformations on indexable containers.
  • Snakemake lang stars activity - Tool for running and managing bioinformatics pipelines.
  • Spiff lang stars activity - Based on the Workflow Patterns initiative and implemented in Python.
  • Stolos lang stars activity - Directed Acyclic Graph task dependency scheduler that simplify distributed pipelines.
  • Steppy lang stars activity - Lightweight, open-source, Python 3 library for fast and reproducible experimentation
  • StreamFlow lang stars activity - Container native workflow management system focused on hybrid workflows.
  • Suro lang stars activity - Java-based distributed pipeline from Netflix.
  • Swift - Fast easy parallel scripting
  • Targets lang stars activity - Dynamic, function-oriented Make-like reproducible pipelines at scale in R.
  • TaskGraph lang stars activity - A library to help manage complicated computational software pipelines consisting of long running individual tasks.
  • Tibanna lang stars activity - Tool that helps you run genomic pipelines on Amazon cloud.
  • Toil lang stars activity - Distributed pipeline workflow manager (mostly for genomics).
  • Yap - Extensible parallel framework, written in Python using OpenMPI libraries.
  • WorldMake lang stars activity - Easy Collaborative Reproducible Computing.
  • Zenaton - Workflow engine for orchestrating jobs, data and events across your applications and third party services

Workflow platforms

  • ActivePapers - Computational science made reproducible and publishable.
  • Apache Iravata - Framework for executing and managing computational workflows on distributed computing resources.
  • Arteria - Event-driven automation for sequencing centers. Initiates workflows based on events.
  • Arvados - A container based workflow platform.
  • Biokepler - Bioinformatics Scientific Workflow for Distributed Analysis of Large-Scale Biological Data.
  • Butler lang stars activity - Framework for running scientific workflows on public and academic clouds.
  • Chipster - Open source platform for data analysis.
  • Clubber - Cluster Load Balancer for Bioinformatics e-Resources.
  • Digdag - Workflow manager designed for simplicity, extensibility and collaboration.
  • Fireworks lang stars activity - Centralized workflow server for dynamic workflows of high-throughput computations.
  • Flyte lang stars activity - Container-native, type-safe workflow and pipelines platform for large scale processing and ML.
  • Galaxy - Web-based platform for biomedical research.
  • Kepler - Kepler scientific workflow application from University of California.
  • KNIME Analytics Platform - General-purpose platform with many specialized domain extensions.
  • omegaml DataOps Platform lang stars activity - Data & model pipeline deployment for humans
  • OpenMOLE - Workflow Management System for exploration of models and parameter optimization.
  • Ophidia - Data-analytics platform with declarative workflows of distributed operations.
  • Orchest lang stars activity - An IDE for Data Science.
  • Pegasus - Workflow Management System.
  • Pentaho Kettle - Workflow platform with a graphical design environment.
  • Piper lang stars activity - Distributed workflow engine designed to be dead simple.
  • Polyaxon lang stars activity - A platform for machine learning experimentation workflow.
  • Reana lang stars activity - Platform for reusable research data analyses developed by CERN.
  • Sushi lang stars activity - Supporting User for SHell script Integration.
  • Yabi - Online research environment for grid, HPC and cloud computing.
  • Taverna - Domain independent workflow system.
  • Temporal lang stars activity - Highly scalable developer oriented Workflow as Code engine.
  • VisTrails - Scientific workflow and provenance management system.
  • Wings - Semantic workflow system utilizing Pegasus as execution system.
  • Watchdog lang stars activity - Workflow management system for the automated and distributed analysis of large-scale experimental data.

Workflow languages

Workflow standardization initiatives

ETL & Data orchestration

  • DVC - Data version control system for ML project with lightweight pipeline support.
  • lakeFS lang stars activity - Repeatable, atomic and versioned data lake on top of object storage.

Literate programming (aka interactive notebooks)

  • Beaker - Notebook-style development environment.
  • Binder - Turn a GitHub repo into a collection of interactive notebooks powered by Jupyter and Kubernetes
  • IPython - A rich architecture for interactive computing.
  • Jupyter - Language-agnostic notebook literate programming environment.
  • Pathomx - Interactive data workflows built on Python.
  • Polynote lang stars activity - A better notebook for Scala (and more). Built by Netflix.
  • Ploomber lang stars activity - Consolidate your notebooks and scripts in a reproducible pipeline using a pipeline.yaml file
  • R Notebooks - R Markdown notebook literate programming environment.
  • RedPoint Notebooks - Web-native computational notebook for programmers supporting multiple languages, APIs and webooks.
  • SoS - Readable, interactive, cross-platform and cross-language data science workflow system.
  • Zeppelin - Web-based notebook that enables interactive data analytics.

Extract, transform, load (ETL)

  • Cadence lang stars activity - Distributed, scalable, durable, and highly available orchestration engine developed by Uber.
  • Dataform lang stars activity - Dataform is a framework for managing SQL based operations in your data warehouse.
  • LinkedPipes ETL - Linked Data publishing and consumption ETL tool.
  • Kiba ETL - A data processing & ETL framework for Ruby.

Continuous Delivery workflows

  • Argo lang stars activity - Get stuff done with container-native workflows for Kubernetes.
  • CDS lang stars activity - A pipeline based Continuous Delivery Service written in Golang.
  • Deis lang stars activity - Workflow system to create and manage applications on Kubernetes.

Build automation tools

  • Bazel - Build software just as engineers do at Google.
  • DoIt lang stars activity - Highly generalized task-management and automation in Python.
  • Gradle - Unified cross platforms builds.
  • Scons - Python library focused on C/C++ builds.
  • Shake lang stars activity - Define robust build systems akin to GNU Make using Haskell.
  • Make - The GNU Make build system.
  • Prodmodel lang stars activity - Build system for data science pipelines.

Other projects

  • HPC Grid Runner
  • noWorkflow lang stars activity - Supporting infrastructure to run scientific experiments without a scientific workflow management system, and still get things like provenance.
  • Reprozip - Simplifies the process of creating reproducible experiments from command-line executions.