Error Recovery Strategies

This guide covers how to configure and use error recovery strategies to improve agent reliability and success rates through automatic retry, task simplification, fallback approaches, and delegation.

Table of Contents

  1. Overview

  2. Recovery Strategies

  3. Basic Recovery Configuration

  4. Strategy Chains

  5. Custom Recovery Logic

  6. Error Classification

  7. Best Practices

Overview

Error recovery strategies provide:

  • Automatic Retry: Retry failed tasks with exponential backoff

  • Task Simplification: Break down complex tasks into simpler ones

  • Fallback Approaches: Use alternative methods when primary fails

  • Delegation: Delegate tasks to other capable agents

  • Error Classification: Classify errors for appropriate recovery

Recovery Strategies

  1. RETRY: Retry with exponential backoff (for transient errors)

  2. SIMPLIFY: Simplify task and retry (break down complex tasks)

  3. FALLBACK: Use fallback approach or alternative method

  4. DELEGATE: Delegate to another capable agent

  5. ABORT: Abort execution and return error (terminal strategy)

Recovery Strategies

Strategy 1: RETRY

Retry failed tasks with exponential backoff.

Use When:

  • Transient errors (network, timeout, rate limits)

  • Temporary failures

  • Errors likely to succeed on retry

Example:

from aiecs.domain.agent import HybridAgent, AgentConfiguration
from aiecs.domain.agent.models import RecoveryStrategy
from aiecs.llm import OpenAIClient

agent = HybridAgent(
    agent_id="agent-1",
    name="My Agent",
    llm_client=OpenAIClient(),
    tools=["search"],
    config=AgentConfiguration(),
    recovery_strategies=[RecoveryStrategy.RETRY]
)

await agent.initialize()

# Execute with retry recovery
result = await agent.execute_with_recovery(
    task={"description": "Search for Python"},
    context={},
    strategies=[RecoveryStrategy.RETRY]
)

Strategy 2: SIMPLIFY

Simplify complex tasks and retry.

Use When:

  • Task is too complex

  • Breaking down helps

  • Simpler version likely to succeed

Example:

agent = HybridAgent(
    agent_id="agent-1",
    llm_client=llm_client,
    tools=["search"],
    config=config,
    recovery_strategies=[RecoveryStrategy.SIMPLIFY]
)

# Execute with simplification recovery
result = await agent.execute_with_recovery(
    task={"description": "Complex multi-step task"},
    context={},
    strategies=[RecoveryStrategy.SIMPLIFY]
)
# Task simplified and retried automatically

Strategy 3: FALLBACK

Use fallback approach when primary fails.

Use When:

  • Alternative approach available

  • Primary method failed

  • Fallback method acceptable

Example:

agent = HybridAgent(
    agent_id="agent-1",
    llm_client=llm_client,
    tools=["search", "fallback_search"],
    config=config,
    recovery_strategies=[RecoveryStrategy.FALLBACK]
)

# Execute with fallback recovery
result = await agent.execute_with_recovery(
    task={"description": "Search task"},
    context={},
    strategies=[RecoveryStrategy.FALLBACK]
)
# Falls back to alternative tool if primary fails

Strategy 4: DELEGATE

Delegate task to another capable agent.

Use When:

  • Other agents available

  • Current agent lacks capability

  • Delegation appropriate

Example:

# Create agent registry
agent_registry = {
    "specialist-agent": specialist_agent
}

agent = HybridAgent(
    agent_id="agent-1",
    llm_client=llm_client,
    tools=["search"],
    config=config,
    collaboration_enabled=True,
    agent_registry=agent_registry,
    recovery_strategies=[RecoveryStrategy.DELEGATE]
)

# Execute with delegation recovery
result = await agent.execute_with_recovery(
    task={"description": "Specialized task"},
    context={},
    strategies=[RecoveryStrategy.DELEGATE]
)
# Delegated to specialist agent if current agent fails

Strategy 5: ABORT

Abort execution and return error.

Use When:

  • All recovery attempts exhausted

  • Error is terminal

  • No further recovery possible

Example:

agent = HybridAgent(
    agent_id="agent-1",
    llm_client=llm_client,
    tools=["search"],
    config=config,
    recovery_strategies=[RecoveryStrategy.ABORT]
)

# Execute with abort recovery
try:
    result = await agent.execute_with_recovery(
        task={"description": "Task"},
        context={},
        strategies=[RecoveryStrategy.ABORT]
    )
except Exception as e:
    # Abort strategy returns error immediately
    print(f"Task aborted: {e}")

Basic Recovery Configuration

Pattern 1: Single Strategy

Use single recovery strategy.

agent = HybridAgent(
    agent_id="agent-1",
    llm_client=llm_client,
    tools=["search"],
    config=config,
    recovery_strategies=[RecoveryStrategy.RETRY]
)

# Execute with retry
result = await agent.execute_with_recovery(
    task=task,
    context=context,
    strategies=[RecoveryStrategy.RETRY]
)

Pattern 2: Multiple Strategies

Use multiple recovery strategies.

agent = HybridAgent(
    agent_id="agent-1",
    llm_client=llm_client,
    tools=["search"],
    config=config,
    recovery_strategies=[
        RecoveryStrategy.RETRY,
        RecoveryStrategy.SIMPLIFY,
        RecoveryStrategy.FALLBACK
    ]
)

# Execute with multiple strategies (tried in order)
result = await agent.execute_with_recovery(
    task=task,
    context=context,
    strategies=[
        RecoveryStrategy.RETRY,
        RecoveryStrategy.SIMPLIFY,
        RecoveryStrategy.FALLBACK
    ]
)

Pattern 3: Default Strategies

Use default recovery strategies from agent configuration.

agent = HybridAgent(
    agent_id="agent-1",
    llm_client=llm_client,
    tools=["search"],
    config=config,
    recovery_strategies=[
        RecoveryStrategy.RETRY,
        RecoveryStrategy.SIMPLIFY,
        RecoveryStrategy.FALLBACK
    ]
)

# Uses default strategies from agent configuration
result = await agent.execute_with_recovery(task, context)

Strategy Chains

Pattern 1: Full Recovery Chain

Use complete recovery chain.

strategies = [
    RecoveryStrategy.RETRY,      # Try retry first
    RecoveryStrategy.SIMPLIFY,    # Then simplify
    RecoveryStrategy.FALLBACK,    # Then fallback
    RecoveryStrategy.DELEGATE,    # Then delegate
    RecoveryStrategy.ABORT        # Finally abort
]

result = await agent.execute_with_recovery(
    task=task,
    context=context,
    strategies=strategies
)

Pattern 2: Conservative Chain

Use conservative recovery chain (no delegation).

strategies = [
    RecoveryStrategy.RETRY,
    RecoveryStrategy.SIMPLIFY,
    RecoveryStrategy.FALLBACK
    # No delegation - keep within current agent
]

result = await agent.execute_with_recovery(
    task=task,
    context=context,
    strategies=strategies
)

Pattern 3: Quick Fail Chain

Use quick fail chain (abort early).

strategies = [
    RecoveryStrategy.RETRY,
    RecoveryStrategy.ABORT  # Abort after retry
]

result = await agent.execute_with_recovery(
    task=task,
    context=context,
    strategies=strategies
)

Custom Recovery Logic

Pattern 1: Custom Retry Logic

Implement custom retry logic.

class CustomAgent(HybridAgent):
    async def _execute_with_retry(self, func, *args, **kwargs):
        """Custom retry logic"""
        max_retries = 5
        for attempt in range(max_retries):
            try:
                return await func(*args, **kwargs)
            except Exception as e:
                if attempt == max_retries - 1:
                    raise
                # Custom delay logic
                await asyncio.sleep(2 ** attempt)

Pattern 2: Custom Simplification

Implement custom task simplification.

class CustomAgent(HybridAgent):
    async def _simplify_task(self, task):
        """Custom task simplification"""
        description = task.get("description", "")
        
        # Break down complex task
        if "and" in description.lower():
            # Split into multiple tasks
            parts = description.split(" and ")
            return {
                "description": parts[0],  # First part only
                "simplified": True
            }
        
        return task

Pattern 3: Custom Fallback

Implement custom fallback logic.

class CustomAgent(HybridAgent):
    async def _execute_with_fallback(self, task, context):
        """Custom fallback logic"""
        try:
            # Try primary approach
            return await self.execute_task(task, context)
        except Exception:
            # Use fallback tool
            fallback_task = {
                **task,
                "tool": "fallback_search"  # Use fallback tool
            }
            return await self.execute_task(fallback_task, context)

Error Classification

Pattern 1: Error-Based Strategy Selection

Select strategy based on error type.

try:
    result = await agent.execute_task(task, context)
except TimeoutError:
    # Use retry for timeout
    result = await agent.execute_with_recovery(
        task, context,
        strategies=[RecoveryStrategy.RETRY]
    )
except ValueError:
    # Use simplify for validation errors
    result = await agent.execute_with_recovery(
        task, context,
        strategies=[RecoveryStrategy.SIMPLIFY]
    )
except Exception as e:
    # Use full chain for unknown errors
    result = await agent.execute_with_recovery(
        task, context,
        strategies=[
            RecoveryStrategy.RETRY,
            RecoveryStrategy.SIMPLIFY,
            RecoveryStrategy.FALLBACK
        ]
    )

Pattern 2: Automatic Error Classification

Agent automatically classifies errors.

# Agent automatically classifies errors and selects appropriate strategy
result = await agent.execute_with_recovery(
    task=task,
    context=context,
    strategies=[
        RecoveryStrategy.RETRY,      # For transient errors
        RecoveryStrategy.SIMPLIFY,    # For complex tasks
        RecoveryStrategy.FALLBACK     # For other errors
    ]
)

Best Practices

1. Use Appropriate Strategy Order

Order strategies from least to most expensive:

strategies = [
    RecoveryStrategy.RETRY,      # Cheapest - just retry
    RecoveryStrategy.SIMPLIFY,   # Moderate - simplify task
    RecoveryStrategy.FALLBACK,   # Moderate - use alternative
    RecoveryStrategy.DELEGATE,   # Expensive - delegate to another agent
    RecoveryStrategy.ABORT       # Terminal - give up
]

2. Configure Based on Error Types

Configure strategies based on expected error types:

# For network-heavy tasks
strategies = [
    RecoveryStrategy.RETRY,  # Retry network errors
    RecoveryStrategy.FALLBACK  # Use alternative endpoint
]

# For complex tasks
strategies = [
    RecoveryStrategy.SIMPLIFY,  # Break down complex tasks
    RecoveryStrategy.DELEGATE   # Delegate to specialist
]

3. Limit Recovery Attempts

Limit total recovery attempts:

# Limit to 3 total attempts
strategies = [
    RecoveryStrategy.RETRY,      # 1 attempt
    RecoveryStrategy.SIMPLIFY,   # 1 attempt
    RecoveryStrategy.FALLBACK,   # 1 attempt
    RecoveryStrategy.ABORT      # Give up
]

4. Monitor Recovery Success

Monitor recovery success rates:

recovery_attempts = 0
recovery_successes = 0

try:
    result = await agent.execute_task(task, context)
except Exception:
    recovery_attempts += 1
    result = await agent.execute_with_recovery(
        task, context,
        strategies=[RecoveryStrategy.RETRY]
    )
    recovery_successes += 1

success_rate = recovery_successes / recovery_attempts if recovery_attempts > 0 else 0
print(f"Recovery success rate: {success_rate:.1%}")

5. Handle Recovery Failures

Handle cases where all recovery strategies fail:

try:
    result = await agent.execute_with_recovery(
        task=task,
        context=context,
        strategies=[
            RecoveryStrategy.RETRY,
            RecoveryStrategy.SIMPLIFY,
            RecoveryStrategy.FALLBACK,
            RecoveryStrategy.ABORT
        ]
    )
except Exception as e:
    # All recovery strategies failed
    logger.error(f"All recovery strategies failed: {e}")
    # Handle final failure
    handle_final_failure(e)

Summary

Error recovery strategies provide:

  • ✅ Automatic retry with backoff

  • ✅ Task simplification

  • ✅ Fallback approaches

  • ✅ Task delegation

  • ✅ Error classification

Key Takeaways:

  • Order strategies from least to most expensive

  • Configure based on error types

  • Limit recovery attempts

  • Monitor recovery success

  • Handle recovery failures

For more details, see: