Exponential Backoff with Jitter — Part 2: The Implementation

In Part 1 we covered the theory — why naive retries cause thundering herd and how timeouts, exponential backoff, and jitter solve it. Here we build the full implementation.

Prerequisites

Node.js 18+ (native fetch support)
Basic familiarity with async/await and REST APIs
Understanding of HTTP status codes

Step 1 — Constants and configuration

const BASE_DELAY_MS    = 1_000;
const MAX_DELAY_MS     = 8_000;
const MAX_ATTEMPTS     = 5;
const REQUEST_TIMEOUT  = 10_000;

// Only these status codes are worth retrying.
// 4xx client errors (except 408, 429) won't fix themselves on retry.
const RETRIABLE_CODES  = new Set([408, 429, 500, 502, 503, 504]);

MAX_ATTEMPTS not MAX_RETRIES — naming matters. MAX_ATTEMPTS = 5 means 5 total tries (first attempt + 4 retries). MAX_RETRIES = 5 would mean 6 total tries. Using “attempts” makes the off-by-one error impossible.

new Set() for RETRIABLE_CODES — Set.has() is O(1) versus O(n) for Array.includes(). Negligible for 5 elements, but sets communicate intent better: this is a membership check, not a sequence.

Step 2 — Sleep helper

const sleep = (ms) => new Promise((resolve) => setTimeout(resolve, ms));

Step 3 — Full jitter delay

const getBackoffDelay = (attempt) => {
  const exponential = BASE_DELAY_MS * Math.pow(2, attempt);
  const capped      = Math.min(MAX_DELAY_MS, exponential);
  return Math.random() * capped;
};

Step 4 — Request timeout with AbortController

const fetchWithTimeout = (url, options = {}) => {
  const controller = new AbortController();
  const timeoutId  = setTimeout(() => controller.abort(), REQUEST_TIMEOUT);

  return fetch(url, { ...options, signal: controller.signal })
    .finally(() => clearTimeout(timeoutId));
};

Always call clearTimeout in finally — if the request succeeds quickly, you don’t want a ghost timeout firing later.

Step 5 — Respecting Retry-After headers

Many APIs tell you exactly how long to wait. Ignoring this header is wasteful and disrespectful to the API:

const getDelay = (attempt, response) => {
  const retryAfter = response?.headers?.get('Retry-After');

  if (retryAfter) {
    const parsed = parseInt(retryAfter, 10);
    if (!isNaN(parsed)) return parsed * 1_000;
    const date = new Date(retryAfter);
    if (!isNaN(date)) return Math.max(0, date - Date.now());
  }

  return getBackoffDelay(attempt);
};

Step 6 — Structured logging

const log = (level, message, meta = {}) => {
  console.log(JSON.stringify({
    timestamp: new Date().toISOString(),
    level,
    message,
    ...meta,
  }));
};

JSON-structured logs are parseable by every observability platform. Never use console.log('retrying...') in production — you can’t query free text.

Log retriable errors at WARN not ERROR — a retriable error that eventually succeeds is not an error. It’s noise at ERROR level that fires your alerts on transient blips.

Complete implementation

import 'dotenv/config';

const BASE_DELAY_MS   = 1_000;
const MAX_DELAY_MS    = 8_000;
const MAX_ATTEMPTS    = 5;
const REQUEST_TIMEOUT = 10_000;
const RETRIABLE_CODES = new Set([408, 429, 500, 502, 503, 504]);
const BASE_URL        = 'https://rickandmortyapi.com/api/character';

const sleep = (ms) => new Promise((resolve) => setTimeout(resolve, ms));

const log = (level, message, meta = {}) =>
  console.log(JSON.stringify({ timestamp: new Date().toISOString(), level, message, ...meta }));

const getBackoffDelay = (attempt) => {
  const cap = Math.min(MAX_DELAY_MS, BASE_DELAY_MS * Math.pow(2, attempt));
  return Math.random() * cap;
};

const getDelay = (attempt, response) => {
  const retryAfter = response?.headers?.get('Retry-After');
  if (retryAfter) {
    const seconds = parseInt(retryAfter, 10);
    if (!isNaN(seconds)) return seconds * 1_000;
    const date = new Date(retryAfter);
    if (!isNaN(date)) return Math.max(0, date - Date.now());
  }
  return getBackoffDelay(attempt);
};

const fetchWithTimeout = (url, options = {}) => {
  const controller = new AbortController();
  const id         = setTimeout(() => controller.abort(), REQUEST_TIMEOUT);
  return fetch(url, { ...options, signal: controller.signal }).finally(() => clearTimeout(id));
};

const fetchWithRetry = async (url, options = {}) => {
  const headers = {
    'Content-Type': 'application/json',
    ...(process.env.API_TOKEN && { Authorization: `Bearer ${process.env.API_TOKEN}` }),
    ...options.headers,
  };

  let attempt = 0;

  while (attempt < MAX_ATTEMPTS) {
    const start = Date.now();
    let response;

    try {
      response = await fetchWithTimeout(url, { ...options, headers });
    } catch (networkError) {
      const isTimeout = networkError.name === 'AbortError';
      if (attempt + 1 >= MAX_ATTEMPTS) throw networkError;

      const delay = getBackoffDelay(attempt);
      log('WARN', isTimeout ? 'Request timed out, retrying' : 'Network error, retrying', {
        error: networkError.message, attempt: attempt + 1, retryInMs: Math.round(delay),
      });
      await sleep(delay);
      attempt++;
      continue;
    }

    if (response.ok) return response.json();
    if ([401, 403].includes(response.status)) throw new Error(`Auth error ${response.status}`);
    if (response.status === 404) throw new Error(`Not found: ${url}`);
    if (!RETRIABLE_CODES.has(response.status)) throw new Error(`Non-retriable HTTP ${response.status}`);

    if (attempt + 1 >= MAX_ATTEMPTS) {
      throw new Error(`All ${MAX_ATTEMPTS} attempts failed (last status: ${response.status})`);
    }

    const delay = getDelay(attempt, response);
    log('WARN', 'Retriable error, backing off', {
      status: response.status, attempt: attempt + 1, retryInMs: Math.round(delay),
    });
    await sleep(delay);
    attempt++;
  }
};

Paginated fetching — retries at one layer only

const fetchAllPages = async (baseUrl) => {
  const results = [];
  let page = 1;

  while (true) {
    const data = await fetchWithRetry(`${baseUrl}?page=${page}`);
    if (!data?.results?.length) break;

    results.push(...data.results);
    log('INFO', `Page ${page} complete`, { count: data.results.length, total: results.length });

    if (!data.info?.next) break;
    page++;
  }

  return results;
};

fetchAllPages has no retry logic of its own — all retries happen inside fetchWithRetry. This is intentional. If both layers retry with 5 attempts each, a single failing page fires 25 requests. In a microservices stack with 3 layers independently retrying: 5³ = 125 requests for what the user sees as one page load.

The rule: retry at one point in the call stack.

Summary

Five things that separate production-grade retry logic from naive retries:

Timeout every request — use AbortController, never let fetch hang
Cap your backoff — Math.min(MAX_DELAY_MS, ...) prevents absurd wait times
Add full jitter — Math.random() * cappedDelay prevents thundering herds
Respect Retry-After — the API knows better than you how long to wait
Retry at one layer only — don’t multiply retries across your call stack