• TechOps Examples
  • Posts
  • How Google Cloud Run Provides Built In Fault Tolerance for Highly Available Services

How Google Cloud Run Provides Built In Fault Tolerance for Highly Available Services

TechOps Examples

Hey — It's Govardhana MK 👋

Welcome to another technical edition.

Every Tuesday – You’ll receive a free edition with a byte-size use case, remote job opportunities, top news, tools, and articles.

Every Thursday and Saturday – You’ll receive a special edition with a deep dive use case, remote job opportunities and articles.

Deploying AI-generated code into Kubernetes environments is on the rise.

PerfectScale is conducting a practical, hands-on workshop using mirrord and Cursor.

As a DevOps Enginner, You’ll learn how to:

  • Catch integration issues earlier

  • Reduce reliance on slow CI pipelines

  • Test AI-generated changes without impacting production

IN TODAY'S EDITION

🧠 Use Case
  • How Google Cloud Run Provides Built In Fault Tolerance for Highly Available Services

👀 Remote Jobs

📚️ Resources

If you’re not a subscriber, here’s what you missed last week.

To receive all the full articles and support TechOps Examples, consider subscribing:

🛠️ TOOL OF THE DAY

Nelm - A Helm 4 alternative. It is a Kubernetes deployment tool that manages Helm Charts and deploys them to Kubernetes.

🧠 USE CASE

How Google Cloud Run Provides Built In Fault Tolerance for Highly Available Services

If you are new to fault tolerance, it simply means building systems that keep working even when something breaks. Servers can crash, networks can fail, and traffic can spike unexpectedly. A fault-tolerant system is designed so these problems do not stop users from accessing the service.

By default, Google Cloud Run runs in a single region and already handles failures across zones inside that region. This protects you from instance and zone-level issues, but it does not protect you from a full regional outage. To tolerate regional failures, you must design the architecture explicitly.

Architecting Cloud Run for Regional Fault Tolerance

Ref: Google Cloud

Cloud Run supports this natively through multi-regional deployments:

  • Deploy the same Cloud Run service to multiple regions using the same container image and configuration

  • Place a global external application load balancer in front

  • Configure one backend per region, each backed by a Serverless Network Endpoint Group (NEG)

  • Expose everything through one global external IP as the single entry point for users

Without Cloud Run Service HealthIn simple terms, users hit one global IP, and traffic is routed to Cloud Run services running in different regions.

A Serverless NEG is just the glue between the load balancer and Cloud Run. It tells the load balancer, “this backend is a Cloud Run service,” without exposing servers, instances, or pods.

Without Cloud Run Service Health

When Cloud Run is deployed to multiple regions behind a global external Application Load Balancer, traffic is routed based on proximity, not actual service health.

Ref: Google Cloud

  • Users in Europe are sent to the Europe region

  • Users in the US are sent to the US region

  • The load balancer assumes the regional service is healthy as long as it exists

If the Cloud Run service in a region is partially failing or degraded, traffic still flows to it. The load balancer has no direct signal that the service itself is unhealthy.

Result:

  • Requests hit a failing regional service

  • Users see errors or high latency

  • A healthy region is available but unused

This setup gives regional routing, but not regional fault tolerance.

With Cloud Run Service Health

Here the routing behavior changes.

Cloud Run actively reports regional service health to the global load balancer.

Ref: Google Cloud

  • When a regional Cloud Run service becomes unhealthy, it is marked as unavailable

  • The load balancer automatically stops sending traffic to that region

  • Requests are routed to another healthy region, even if it is farther away

From the user’s perspective, the service continues to work despite a regional failure. This turns multi-region Cloud Run from “closest-region routing” into automatic regional failover

The system no longer depends only on location. It depends on real service health, which is what fault tolerance actually requires.

🔴 Get my DevOps & Kubernetes ebooks! (free for Premium Club and Personal Tier newsletter subscribers)

Looking to promote your company, product, service, or event to 58,000+ DevOps and Cloud Professionals? Let's work together.