The Hidden Risk of Zombie Workflows in GitHub Actions

In partnership with

TechOps Examples

Hey — It's Govardhana MK 👋

Welcome to another technical edition.

Every Tuesday – You’ll receive a free edition with a byte-size use case, remote job opportunities, top news, tools, and articles.

Every Thursday and Saturday – You’ll receive a special edition with a deep dive use case, remote job opportunities and articles.

👋 👋 A big thank you to today's sponsor UDACITY

Build real AI and tech skills, faster

Udacity helps you build the AI and tech skills employers actually need—fast. Learn from industry experts through hands-on projects designed to mirror real-world work, not just theory.

Whether you’re advancing in your current role or preparing for what’s next, Udacity’s flexible, fully online courses let you learn on your schedule and apply new skills immediately. From AI and machine learning to data, programming, and cloud technologies, you’ll gain practical experience you can show, not just list on a résumé.

Build confidence, stay competitive, and move your career forward with AI and tech skills that are in demand.

IN TODAY'S EDITION

🧠 Use Case
  • The Hidden Risk of Zombie Workflows in GitHub Actions

👀 Remote Jobs

📚️ Resources

If you’re not a subscriber, here’s what you missed last week.

To receive all the full articles and support TechOps Examples, consider subscribing:

🧠 USE CASE

The Hidden Risk of Zombie Workflows in GitHub Actions

A zombie workflow is a GitHub Actions run that is no longer useful but is still executing or retrying. Common patterns:

  • Old commits still running CI after a newer commit is already merged

  • Rerun loops triggered by flaky steps

  • Long running jobs waiting on resources that will never arrive

  • Parallel matrix jobs continuing even after the result is irrelevant

Sample Workflow Pattern

I recently went through an interesting study by sonarsource, where they started with 28,384 popular GitHub repositories and found that only 15,691 actually used GitHub Actions. After removing single branch repos, 14,130 multi branch repositories remained for analysis.

Across these repos, they scanned 7.7 million branches and discovered 442,321 unique workflow files, many of them duplicated across branches as historical snapshots. Filtering only workflows using pull_request_target reduced this to 18,002 potentially attackable workflows.

Ref: sonarsource

Using a strict heuristic focusing on secret usage and write permissions, the list shrank to 2,191 high risk candidates, of which 188 workflows were confirmed vulnerable after manual review.

Ref: sonarsource

121 vulnerabilities still existed on default branches, leaving 67 true Zombie Workflows that lived only in non default branches. These were found in well known projects, proving this is a real risk hiding in forgotten branches.

Ref: sonarsource

How Zombie Workflows are Born

1. Push triggered workflows without cancellation
on:
  push:
    branches:
      - main

Every push creates a new run. If you push 5 commits quickly:

  • 5 workflows start

  • All of them run full CI

  • Only the last one matters

The first four are zombies the moment a newer commit exists.

2. Pull Request workflows with retries and flaky tests

A test flakes. GitHub retries the job. The developer clicks “Re-run all jobs”. Now you have:

  • Old run retrying

  • New run triggered by updated commit

  • Both competing for runners

No guardrails. No cancellation.

3. Matrix jobs that don’t fail fast
strategy:
  matrix:
    region: [us-east-1, eu-west-1, ap-south-1]

One region fails early. But the other regions keep running for 20 more minutes.

From a deployment perspective, the outcome is already decided. But the runners don’t know that. Zombie behavior.

How We Can Fix It

1. Use concurrency aggressively

This single block eliminates most zombies.

concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true
  • Groups runs by workflow + branch

  • Cancels older runs when a new one starts

  • Ensures only the latest commit matters

This is mandatory for: CI workflows, Terraform plans, Preview environments, Any push triggered automation

2. Split CI and CD workflows

One workflow for validation. Another for deployment. This prevents old CI runs from blocking production deploys.

3. Fail fast on matrix jobs
strategy:
  fail-fast: true

If one matrix job fails: Others are cancelled, Runners are freed, Signal is immediate

4. Time box everything
jobs:
  build:
    timeout-minutes: 20

No job should run indefinitely. If it can’t finish in 20 minutes: It’s broken or waiting on something external. Either way, kill it

🔴 Get my DevOps & Kubernetes ebooks! (free for Premium Club and Personal Tier newsletter subscribers)

Looking to promote your company, product, service, or event to 58,000+ DevOps and Cloud Professionals? Let's work together.