Reliability-at-Scale: Implementing Error Handling and Retries in Your Actions

In a perfect world, every API call returns a 200 OK, networks never drop packets, and third-party services have 100% uptime. But as developers, we build for reality, not for perfection. The real world is messy. Transient network issues, temporary service outages, and rate limits are not edge cases; they are inevitable facts of building distributed systems.

Designing for the "happy path" is easy. The real challenge—and the mark of a truly robust system—is how gracefully your workflows handle the failures. Unhandled errors can lead to incomplete processes, corrupted data, and frantic midnight pages.

This is why Actions.do was built with resilience at its core. It’s not enough to simply execute a task; you must be able to execute it reliably. Let's explore how to use the built-in reliability features of Actions.do to build workflow automation that can withstand the chaos of the real world.

The Problem: The Fragility of Simple Execution

Imagine a simple workflow: a new user signs up, and you need to enrich their profile with data from a third-party CRM. The Action might look something like this:

// A simple, "happy path" action
const enrichUserData = new Action({
  id: 'enrich-user-from-crm',
  handler: async ({ userId }) => {
    const crmData = await crmApi.fetchUser(userId);
    await database.updateUser(userId, { crmData });
    return { success: true };
  }
});

This works perfectly... until it doesn't. What happens if:

The crmApi is down for 30 seconds for a rolling restart? The Action fails.
Your network has a momentary hiccup? The Action fails.
The CRM API temporarily rate-limits your request? The Action fails.

In all these cases, the workflow halts, the user's profile remains unenriched, and you might need manual intervention to fix it. This approach doesn't scale.

The Solution: Building Resilience into Every Action

The Actions.do philosophy is that resilience shouldn't be an afterthought—it should be a configurable property of each fundamental building block. By defining reliability rules at the individual Action level, your entire workflow inherits that strength.

Here are the core features you can leverage to transform fragile tasks into resilient, self-healing operations.

1. Smart Retries with Exponential Backoff

The first line of defense against transient failures is to simply try again. But retrying indiscriminately can make a bad situation worse, a phenomenon known as the "thundering herd" problem. Smarter retries involve waiting between attempts.

Actions.do allows you to define a sophisticated retry strategy directly in your Action's configuration.

What it is: Automatically re-executing a failed Action a configured number of times, with an increasing delay between each attempt (exponential backoff).
Why it's crucial: It gives a struggling downstream service or network time to recover before the next attempt, dramatically increasing the chances of eventual success without overwhelming the system.

import { Action } from 'actions.do';

const enrichUserData = new Action({
  id: 'enrich-user-from-crm-v2',
  // highlight-start
  retry: {
    attempts: 5,                // Try up to 5 times
    strategy: 'exponential',      // Use exponential backoff
    initialDelay: '2s',           // Start with a 2-second delay
    maxDelay: '60s'               // Cap delays at 60 seconds
  },
  // highlight-end
  handler: async ({ userId }) => {
    // ... same handler logic
  }
});

With this configuration, a temporary API glitch is no longer a workflow-killing event. It's a minor hiccup that the Action handles automatically.

2. Defensive Timeouts

Some operations don't fail loudly; they just hang, stuck waiting for a response that will never come. A single stuck Action can stall an entire workflow, consuming resources and delaying critical processes.

What it is: A hard limit on how long an individual Action is allowed to run.
Why it's crucial: It ensures that your system can "fail fast." Instead of waiting indefinitely for a non-responsive service, the Action times out, allowing your retry and error-handling logic to take over.

import { Action } from 'actions.do';

const enrichUserData = new Action({
  id: 'enrich-user-from-crm-v3',
  retry: { /* ... */ },
  // highlight-start
  timeout: '45s', // Fail the action if it runs longer than 45 seconds
  // highlight-end
  handler: async ({ userId }) => {
    // ... same handler logic
  }
});

This timeout ensures that one bad call doesn't bring everything to a grinding halt.

3. Graceful Error Handling & Fallbacks

What happens after the final retry attempt fails? You still need a plan. Instead of letting the workflow crash, you can define a graceful fallback.

What it is: A designated plan for what to do when an Action has definitively failed.
Why it's crucial: It allows for graceful degradation. You can log the failure, send an alert to an operations channel, enqueue the task in a dead-letter queue for later inspection, or even return cached/default data to allow the workflow to continue in a limited state.

Actions.do enables you to invoke another Action as a failure handler.

import { Action } from 'actions.do';

// Define the fallback action first
const handleCrmFailure = new Action({
  id: 'notify-ops-on-crm-failure',
  handler: async ({ error, payload }) => {
    const message = `Failed to enrich user ${payload.userId}. Error: ${error.message}`;
    // This could be another Action that calls Slack, PagerDuty, etc.
    console.error(message); 
  }
});

// Now, define the main action with the fallback
const enrichUserData = new Action({
  id: 'enrich-user-from-crm-final',
  retry: { attempts: 5, strategy: 'exponential' },
  timeout: '45s',
  // highlight-start
  onError: {
    // On final failure, execute this other Action
    actionId: 'notify-ops-on-crm-failure'
  },
  // highlight-end
  handler: async ({ userId }) => {
    const crmData = await crmApi.fetchUser(userId);
    await database.updateUser(userId, { crmData });
    return { success: true };
  }
});

Now, if the enrich-user-from-crm action ultimately fails after all its retries, it won't crash silently. It will trigger the notify-ops-on-crm-failure Action, creating a auditable and observable failure that your team can act on.

From Fragile Task to Resilient Building Block

By combining these features, you transform a simple task into a robust, stand-alone building block for any automated workflow. You've defined not just what the Action should do on success, but exactly how it should behave in the face of adversity.

This is the power of "Actions as Code." It moves reliability from an application-level concern to a fundamental property of your business logic itself. When you build your workflows with Actions.do, you're not just automating tasks—you're engineering reliability, scalability, and peace of mind into every step of the process.

Ready to build workflows that don't break? Get started with Actions.do and build your first resilient Action today.

Do Work. With AI.