Error Recovery Strategies

This guide covers how to configure and use error recovery strategies to improve agent reliability and success rates through automatic retry, task simplification, fallback approaches, and delegation.

Table of Contents

Overview
Recovery Strategies
Basic Recovery Configuration
Strategy Chains
Custom Recovery Logic
Error Classification
Best Practices

Overview

Error recovery strategies provide:

Automatic Retry: Retry failed tasks with exponential backoff
Task Simplification: Break down complex tasks into simpler ones
Fallback Approaches: Use alternative methods when primary fails
Delegation: Delegate tasks to other capable agents
Error Classification: Classify errors for appropriate recovery

Recovery Strategies

RETRY: Retry with exponential backoff (for transient errors)
SIMPLIFY: Simplify task and retry (break down complex tasks)
FALLBACK: Use fallback approach or alternative method
DELEGATE: Delegate to another capable agent
ABORT: Abort execution and return error (terminal strategy)

Recovery Strategies

Strategy 1: RETRY

Retry failed tasks with exponential backoff.

Use When:

Transient errors (network, timeout, rate limits)
Temporary failures
Errors likely to succeed on retry

Example:

from aiecs.domain.agent import HybridAgent, AgentConfiguration
from aiecs.domain.agent.models import RecoveryStrategy
from aiecs.llm import OpenAIClient

agent = HybridAgent(
    agent_id="agent-1",
    name="My Agent",
    llm_client=OpenAIClient(),
    tools=["search"],
    config=AgentConfiguration(),
    recovery_strategies=[RecoveryStrategy.RETRY]
)

await agent.initialize()

# Execute with retry recovery
result = await agent.execute_with_recovery(
    task={"description": "Search for Python"},
    context={},
    strategies=[RecoveryStrategy.RETRY]
)

Strategy 2: SIMPLIFY

Simplify complex tasks and retry.

Use When:

Task is too complex
Breaking down helps
Simpler version likely to succeed

Example:

agent = HybridAgent(
    agent_id="agent-1",
    llm_client=llm_client,
    tools=["search"],
    config=config,
    recovery_strategies=[RecoveryStrategy.SIMPLIFY]
)

# Execute with simplification recovery
result = await agent.execute_with_recovery(
    task={"description": "Complex multi-step task"},
    context={},
    strategies=[RecoveryStrategy.SIMPLIFY]
)
# Task simplified and retried automatically

Strategy 3: FALLBACK

Use fallback approach when primary fails.

Use When:

Alternative approach available
Primary method failed
Fallback method acceptable

Example:

agent = HybridAgent(
    agent_id="agent-1",
    llm_client=llm_client,
    tools=["search", "fallback_search"],
    config=config,
    recovery_strategies=[RecoveryStrategy.FALLBACK]
)

# Execute with fallback recovery
result = await agent.execute_with_recovery(
    task={"description": "Search task"},
    context={},
    strategies=[RecoveryStrategy.FALLBACK]
)
# Falls back to alternative tool if primary fails

Strategy 4: DELEGATE

Delegate task to another capable agent.

Use When:

Other agents available
Current agent lacks capability
Delegation appropriate

Example:

# Create agent registry
agent_registry = {
    "specialist-agent": specialist_agent
}

agent = HybridAgent(
    agent_id="agent-1",
    llm_client=llm_client,
    tools=["search"],
    config=config,
    collaboration_enabled=True,
    agent_registry=agent_registry,
    recovery_strategies=[RecoveryStrategy.DELEGATE]
)

# Execute with delegation recovery
result = await agent.execute_with_recovery(
    task={"description": "Specialized task"},
    context={},
    strategies=[RecoveryStrategy.DELEGATE]
)
# Delegated to specialist agent if current agent fails

Strategy 5: ABORT

Abort execution and return error.

Use When:

All recovery attempts exhausted
Error is terminal
No further recovery possible

Example:

agent = HybridAgent(
    agent_id="agent-1",
    llm_client=llm_client,
    tools=["search"],
    config=config,
    recovery_strategies=[RecoveryStrategy.ABORT]
)

# Execute with abort recovery
try:
    result = await agent.execute_with_recovery(
        task={"description": "Task"},
        context={},
        strategies=[RecoveryStrategy.ABORT]
    )
except Exception as e:
    # Abort strategy returns error immediately
    print(f"Task aborted: {e}")

Basic Recovery Configuration

Pattern 1: Single Strategy

Use single recovery strategy.

agent = HybridAgent(
    agent_id="agent-1",
    llm_client=llm_client,
    tools=["search"],
    config=config,
    recovery_strategies=[RecoveryStrategy.RETRY]
)

# Execute with retry
result = await agent.execute_with_recovery(
    task=task,
    context=context,
    strategies=[RecoveryStrategy.RETRY]
)

Pattern 2: Multiple Strategies

Use multiple recovery strategies.

agent = HybridAgent(
    agent_id="agent-1",
    llm_client=llm_client,
    tools=["search"],
    config=config,
    recovery_strategies=[
        RecoveryStrategy.RETRY,
        RecoveryStrategy.SIMPLIFY,
        RecoveryStrategy.FALLBACK
    ]
)

# Execute with multiple strategies (tried in order)
result = await agent.execute_with_recovery(
    task=task,
    context=context,
    strategies=[
        RecoveryStrategy.RETRY,
        RecoveryStrategy.SIMPLIFY,
        RecoveryStrategy.FALLBACK
    ]
)

Pattern 3: Default Strategies

Use default recovery strategies from agent configuration.

agent = HybridAgent(
    agent_id="agent-1",
    llm_client=llm_client,
    tools=["search"],
    config=config,
    recovery_strategies=[
        RecoveryStrategy.RETRY,
        RecoveryStrategy.SIMPLIFY,
        RecoveryStrategy.FALLBACK
    ]
)

# Uses default strategies from agent configuration
result = await agent.execute_with_recovery(task, context)

Strategy Chains

Pattern 1: Full Recovery Chain

Use complete recovery chain.

strategies = [
    RecoveryStrategy.RETRY,      # Try retry first
    RecoveryStrategy.SIMPLIFY,    # Then simplify
    RecoveryStrategy.FALLBACK,    # Then fallback
    RecoveryStrategy.DELEGATE,    # Then delegate
    RecoveryStrategy.ABORT        # Finally abort
]

result = await agent.execute_with_recovery(
    task=task,
    context=context,
    strategies=strategies
)

Pattern 2: Conservative Chain

Use conservative recovery chain (no delegation).

strategies = [
    RecoveryStrategy.RETRY,
    RecoveryStrategy.SIMPLIFY,
    RecoveryStrategy.FALLBACK
    # No delegation - keep within current agent
]

result = await agent.execute_with_recovery(
    task=task,
    context=context,
    strategies=strategies
)

Pattern 3: Quick Fail Chain

Use quick fail chain (abort early).

strategies = [
    RecoveryStrategy.RETRY,
    RecoveryStrategy.ABORT  # Abort after retry
]

result = await agent.execute_with_recovery(
    task=task,
    context=context,
    strategies=strategies
)

Custom Recovery Logic

Pattern 1: Custom Retry Logic

Implement custom retry logic.

class CustomAgent(HybridAgent):
    async def _execute_with_retry(self, func, *args, **kwargs):
        """Custom retry logic"""
        max_retries = 5
        for attempt in range(max_retries):
            try:
                return await func(*args, **kwargs)
            except Exception as e:
                if attempt == max_retries - 1:
                    raise
                # Custom delay logic
                await asyncio.sleep(2 ** attempt)

Pattern 2: Custom Simplification

Implement custom task simplification.

class CustomAgent(HybridAgent):
    async def _simplify_task(self, task):
        """Custom task simplification"""
        description = task.get("description", "")
        
        # Break down complex task
        if "and" in description.lower():
            # Split into multiple tasks
            parts = description.split(" and ")
            return {
                "description": parts[0],  # First part only
                "simplified": True
            }
        
        return task

Pattern 3: Custom Fallback

Implement custom fallback logic.

class CustomAgent(HybridAgent):
    async def _execute_with_fallback(self, task, context):
        """Custom fallback logic"""
        try:
            # Try primary approach
            return await self.execute_task(task, context)
        except Exception:
            # Use fallback tool
            fallback_task = {
                **task,
                "tool": "fallback_search"  # Use fallback tool
            }
            return await self.execute_task(fallback_task, context)

Error Classification

Pattern 1: Error-Based Strategy Selection

Select strategy based on error type.

try:
    result = await agent.execute_task(task, context)
except TimeoutError:
    # Use retry for timeout
    result = await agent.execute_with_recovery(
        task, context,
        strategies=[RecoveryStrategy.RETRY]
    )
except ValueError:
    # Use simplify for validation errors
    result = await agent.execute_with_recovery(
        task, context,
        strategies=[RecoveryStrategy.SIMPLIFY]
    )
except Exception as e:
    # Use full chain for unknown errors
    result = await agent.execute_with_recovery(
        task, context,
        strategies=[
            RecoveryStrategy.RETRY,
            RecoveryStrategy.SIMPLIFY,
            RecoveryStrategy.FALLBACK
        ]
    )

Pattern 2: Automatic Error Classification

Agent automatically classifies errors.

# Agent automatically classifies errors and selects appropriate strategy
result = await agent.execute_with_recovery(
    task=task,
    context=context,
    strategies=[
        RecoveryStrategy.RETRY,      # For transient errors
        RecoveryStrategy.SIMPLIFY,    # For complex tasks
        RecoveryStrategy.FALLBACK     # For other errors
    ]
)

Best Practices

1. Use Appropriate Strategy Order

Order strategies from least to most expensive:

strategies = [
    RecoveryStrategy.RETRY,      # Cheapest - just retry
    RecoveryStrategy.SIMPLIFY,   # Moderate - simplify task
    RecoveryStrategy.FALLBACK,   # Moderate - use alternative
    RecoveryStrategy.DELEGATE,   # Expensive - delegate to another agent
    RecoveryStrategy.ABORT       # Terminal - give up
]

2. Configure Based on Error Types

Configure strategies based on expected error types:

# For network-heavy tasks
strategies = [
    RecoveryStrategy.RETRY,  # Retry network errors
    RecoveryStrategy.FALLBACK  # Use alternative endpoint
]

# For complex tasks
strategies = [
    RecoveryStrategy.SIMPLIFY,  # Break down complex tasks
    RecoveryStrategy.DELEGATE   # Delegate to specialist
]

3. Limit Recovery Attempts

Limit total recovery attempts:

# Limit to 3 total attempts
strategies = [
    RecoveryStrategy.RETRY,      # 1 attempt
    RecoveryStrategy.SIMPLIFY,   # 1 attempt
    RecoveryStrategy.FALLBACK,   # 1 attempt
    RecoveryStrategy.ABORT      # Give up
]

4. Monitor Recovery Success

Monitor recovery success rates:

recovery_attempts = 0
recovery_successes = 0

try:
    result = await agent.execute_task(task, context)
except Exception:
    recovery_attempts += 1
    result = await agent.execute_with_recovery(
        task, context,
        strategies=[RecoveryStrategy.RETRY]
    )
    recovery_successes += 1

success_rate = recovery_successes / recovery_attempts if recovery_attempts > 0 else 0
print(f"Recovery success rate: {success_rate:.1%}")

5. Handle Recovery Failures

Handle cases where all recovery strategies fail:

try:
    result = await agent.execute_with_recovery(
        task=task,
        context=context,
        strategies=[
            RecoveryStrategy.RETRY,
            RecoveryStrategy.SIMPLIFY,
            RecoveryStrategy.FALLBACK,
            RecoveryStrategy.ABORT
        ]
    )
except Exception as e:
    # All recovery strategies failed
    logger.error(f"All recovery strategies failed: {e}")
    # Handle final failure
    handle_final_failure(e)

Summary

Error recovery strategies provide:

✅ Automatic retry with backoff
✅ Task simplification
✅ Fallback approaches
✅ Task delegation
✅ Error classification

Key Takeaways:

Order strategies from least to most expensive
Configure based on error types
Limit recovery attempts
Monitor recovery success
Handle recovery failures

For more details, see: