# Error Recovery Strategies

This guide covers how to configure and use error recovery strategies to improve agent reliability and success rates through automatic retry, task simplification, fallback approaches, and delegation.

## Table of Contents

1. [Overview](#overview)
2. [Recovery Strategies](#recovery-strategies)
3. [Basic Recovery Configuration](#basic-recovery-configuration)
4. [Strategy Chains](#strategy-chains)
5. [Custom Recovery Logic](#custom-recovery-logic)
6. [Error Classification](#error-classification)
7. [Best Practices](#best-practices)

## Overview

Error recovery strategies provide:

- **Automatic Retry**: Retry failed tasks with exponential backoff
- **Task Simplification**: Break down complex tasks into simpler ones
- **Fallback Approaches**: Use alternative methods when primary fails
- **Delegation**: Delegate tasks to other capable agents
- **Error Classification**: Classify errors for appropriate recovery

### Recovery Strategies

1. **RETRY**: Retry with exponential backoff (for transient errors)
2. **SIMPLIFY**: Simplify task and retry (break down complex tasks)
3. **FALLBACK**: Use fallback approach or alternative method
4. **DELEGATE**: Delegate to another capable agent
5. **ABORT**: Abort execution and return error (terminal strategy)

## Recovery Strategies

### Strategy 1: RETRY

Retry failed tasks with exponential backoff.

**Use When**:
- Transient errors (network, timeout, rate limits)
- Temporary failures
- Errors likely to succeed on retry

**Example**:
```python
from aiecs.domain.agent import HybridAgent, AgentConfiguration
from aiecs.domain.agent.models import RecoveryStrategy
from aiecs.llm import OpenAIClient

agent = HybridAgent(
    agent_id="agent-1",
    name="My Agent",
    llm_client=OpenAIClient(),
    tools=["search"],
    config=AgentConfiguration(),
    recovery_strategies=[RecoveryStrategy.RETRY]
)

await agent.initialize()

# Execute with retry recovery
result = await agent.execute_with_recovery(
    task={"description": "Search for Python"},
    context={},
    strategies=[RecoveryStrategy.RETRY]
)
```

### Strategy 2: SIMPLIFY

Simplify complex tasks and retry.

**Use When**:
- Task is too complex
- Breaking down helps
- Simpler version likely to succeed

**Example**:
```python
agent = HybridAgent(
    agent_id="agent-1",
    llm_client=llm_client,
    tools=["search"],
    config=config,
    recovery_strategies=[RecoveryStrategy.SIMPLIFY]
)

# Execute with simplification recovery
result = await agent.execute_with_recovery(
    task={"description": "Complex multi-step task"},
    context={},
    strategies=[RecoveryStrategy.SIMPLIFY]
)
# Task simplified and retried automatically
```

### Strategy 3: FALLBACK

Use fallback approach when primary fails.

**Use When**:
- Alternative approach available
- Primary method failed
- Fallback method acceptable

**Example**:
```python
agent = HybridAgent(
    agent_id="agent-1",
    llm_client=llm_client,
    tools=["search", "fallback_search"],
    config=config,
    recovery_strategies=[RecoveryStrategy.FALLBACK]
)

# Execute with fallback recovery
result = await agent.execute_with_recovery(
    task={"description": "Search task"},
    context={},
    strategies=[RecoveryStrategy.FALLBACK]
)
# Falls back to alternative tool if primary fails
```

### Strategy 4: DELEGATE

Delegate task to another capable agent.

**Use When**:
- Other agents available
- Current agent lacks capability
- Delegation appropriate

**Example**:
```python
# Create agent registry
agent_registry = {
    "specialist-agent": specialist_agent
}

agent = HybridAgent(
    agent_id="agent-1",
    llm_client=llm_client,
    tools=["search"],
    config=config,
    collaboration_enabled=True,
    agent_registry=agent_registry,
    recovery_strategies=[RecoveryStrategy.DELEGATE]
)

# Execute with delegation recovery
result = await agent.execute_with_recovery(
    task={"description": "Specialized task"},
    context={},
    strategies=[RecoveryStrategy.DELEGATE]
)
# Delegated to specialist agent if current agent fails
```

### Strategy 5: ABORT

Abort execution and return error.

**Use When**:
- All recovery attempts exhausted
- Error is terminal
- No further recovery possible

**Example**:
```python
agent = HybridAgent(
    agent_id="agent-1",
    llm_client=llm_client,
    tools=["search"],
    config=config,
    recovery_strategies=[RecoveryStrategy.ABORT]
)

# Execute with abort recovery
try:
    result = await agent.execute_with_recovery(
        task={"description": "Task"},
        context={},
        strategies=[RecoveryStrategy.ABORT]
    )
except Exception as e:
    # Abort strategy returns error immediately
    print(f"Task aborted: {e}")
```

## Basic Recovery Configuration

### Pattern 1: Single Strategy

Use single recovery strategy.

```python
agent = HybridAgent(
    agent_id="agent-1",
    llm_client=llm_client,
    tools=["search"],
    config=config,
    recovery_strategies=[RecoveryStrategy.RETRY]
)

# Execute with retry
result = await agent.execute_with_recovery(
    task=task,
    context=context,
    strategies=[RecoveryStrategy.RETRY]
)
```

### Pattern 2: Multiple Strategies

Use multiple recovery strategies.

```python
agent = HybridAgent(
    agent_id="agent-1",
    llm_client=llm_client,
    tools=["search"],
    config=config,
    recovery_strategies=[
        RecoveryStrategy.RETRY,
        RecoveryStrategy.SIMPLIFY,
        RecoveryStrategy.FALLBACK
    ]
)

# Execute with multiple strategies (tried in order)
result = await agent.execute_with_recovery(
    task=task,
    context=context,
    strategies=[
        RecoveryStrategy.RETRY,
        RecoveryStrategy.SIMPLIFY,
        RecoveryStrategy.FALLBACK
    ]
)
```

### Pattern 3: Default Strategies

Use default recovery strategies from agent configuration.

```python
agent = HybridAgent(
    agent_id="agent-1",
    llm_client=llm_client,
    tools=["search"],
    config=config,
    recovery_strategies=[
        RecoveryStrategy.RETRY,
        RecoveryStrategy.SIMPLIFY,
        RecoveryStrategy.FALLBACK
    ]
)

# Uses default strategies from agent configuration
result = await agent.execute_with_recovery(task, context)
```

## Strategy Chains

### Pattern 1: Full Recovery Chain

Use complete recovery chain.

```python
strategies = [
    RecoveryStrategy.RETRY,      # Try retry first
    RecoveryStrategy.SIMPLIFY,    # Then simplify
    RecoveryStrategy.FALLBACK,    # Then fallback
    RecoveryStrategy.DELEGATE,    # Then delegate
    RecoveryStrategy.ABORT        # Finally abort
]

result = await agent.execute_with_recovery(
    task=task,
    context=context,
    strategies=strategies
)
```

### Pattern 2: Conservative Chain

Use conservative recovery chain (no delegation).

```python
strategies = [
    RecoveryStrategy.RETRY,
    RecoveryStrategy.SIMPLIFY,
    RecoveryStrategy.FALLBACK
    # No delegation - keep within current agent
]

result = await agent.execute_with_recovery(
    task=task,
    context=context,
    strategies=strategies
)
```

### Pattern 3: Quick Fail Chain

Use quick fail chain (abort early).

```python
strategies = [
    RecoveryStrategy.RETRY,
    RecoveryStrategy.ABORT  # Abort after retry
]

result = await agent.execute_with_recovery(
    task=task,
    context=context,
    strategies=strategies
)
```

## Custom Recovery Logic

### Pattern 1: Custom Retry Logic

Implement custom retry logic.

```python
class CustomAgent(HybridAgent):
    async def _execute_with_retry(self, func, *args, **kwargs):
        """Custom retry logic"""
        max_retries = 5
        for attempt in range(max_retries):
            try:
                return await func(*args, **kwargs)
            except Exception as e:
                if attempt == max_retries - 1:
                    raise
                # Custom delay logic
                await asyncio.sleep(2 ** attempt)
```

### Pattern 2: Custom Simplification

Implement custom task simplification.

```python
class CustomAgent(HybridAgent):
    async def _simplify_task(self, task):
        """Custom task simplification"""
        description = task.get("description", "")
        
        # Break down complex task
        if "and" in description.lower():
            # Split into multiple tasks
            parts = description.split(" and ")
            return {
                "description": parts[0],  # First part only
                "simplified": True
            }
        
        return task
```

### Pattern 3: Custom Fallback

Implement custom fallback logic.

```python
class CustomAgent(HybridAgent):
    async def _execute_with_fallback(self, task, context):
        """Custom fallback logic"""
        try:
            # Try primary approach
            return await self.execute_task(task, context)
        except Exception:
            # Use fallback tool
            fallback_task = {
                **task,
                "tool": "fallback_search"  # Use fallback tool
            }
            return await self.execute_task(fallback_task, context)
```

## Error Classification

### Pattern 1: Error-Based Strategy Selection

Select strategy based on error type.

```python
try:
    result = await agent.execute_task(task, context)
except TimeoutError:
    # Use retry for timeout
    result = await agent.execute_with_recovery(
        task, context,
        strategies=[RecoveryStrategy.RETRY]
    )
except ValueError:
    # Use simplify for validation errors
    result = await agent.execute_with_recovery(
        task, context,
        strategies=[RecoveryStrategy.SIMPLIFY]
    )
except Exception as e:
    # Use full chain for unknown errors
    result = await agent.execute_with_recovery(
        task, context,
        strategies=[
            RecoveryStrategy.RETRY,
            RecoveryStrategy.SIMPLIFY,
            RecoveryStrategy.FALLBACK
        ]
    )
```

### Pattern 2: Automatic Error Classification

Agent automatically classifies errors.

```python
# Agent automatically classifies errors and selects appropriate strategy
result = await agent.execute_with_recovery(
    task=task,
    context=context,
    strategies=[
        RecoveryStrategy.RETRY,      # For transient errors
        RecoveryStrategy.SIMPLIFY,    # For complex tasks
        RecoveryStrategy.FALLBACK     # For other errors
    ]
)
```

## Best Practices

### 1. Use Appropriate Strategy Order

Order strategies from least to most expensive:

```python
strategies = [
    RecoveryStrategy.RETRY,      # Cheapest - just retry
    RecoveryStrategy.SIMPLIFY,   # Moderate - simplify task
    RecoveryStrategy.FALLBACK,   # Moderate - use alternative
    RecoveryStrategy.DELEGATE,   # Expensive - delegate to another agent
    RecoveryStrategy.ABORT       # Terminal - give up
]
```

### 2. Configure Based on Error Types

Configure strategies based on expected error types:

```python
# For network-heavy tasks
strategies = [
    RecoveryStrategy.RETRY,  # Retry network errors
    RecoveryStrategy.FALLBACK  # Use alternative endpoint
]

# For complex tasks
strategies = [
    RecoveryStrategy.SIMPLIFY,  # Break down complex tasks
    RecoveryStrategy.DELEGATE   # Delegate to specialist
]
```

### 3. Limit Recovery Attempts

Limit total recovery attempts:

```python
# Limit to 3 total attempts
strategies = [
    RecoveryStrategy.RETRY,      # 1 attempt
    RecoveryStrategy.SIMPLIFY,   # 1 attempt
    RecoveryStrategy.FALLBACK,   # 1 attempt
    RecoveryStrategy.ABORT      # Give up
]
```

### 4. Monitor Recovery Success

Monitor recovery success rates:

```python
recovery_attempts = 0
recovery_successes = 0

try:
    result = await agent.execute_task(task, context)
except Exception:
    recovery_attempts += 1
    result = await agent.execute_with_recovery(
        task, context,
        strategies=[RecoveryStrategy.RETRY]
    )
    recovery_successes += 1

success_rate = recovery_successes / recovery_attempts if recovery_attempts > 0 else 0
print(f"Recovery success rate: {success_rate:.1%}")
```

### 5. Handle Recovery Failures

Handle cases where all recovery strategies fail:

```python
try:
    result = await agent.execute_with_recovery(
        task=task,
        context=context,
        strategies=[
            RecoveryStrategy.RETRY,
            RecoveryStrategy.SIMPLIFY,
            RecoveryStrategy.FALLBACK,
            RecoveryStrategy.ABORT
        ]
    )
except Exception as e:
    # All recovery strategies failed
    logger.error(f"All recovery strategies failed: {e}")
    # Handle final failure
    handle_final_failure(e)
```

## Summary

Error recovery strategies provide:
- ✅ Automatic retry with backoff
- ✅ Task simplification
- ✅ Fallback approaches
- ✅ Task delegation
- ✅ Error classification

**Key Takeaways**:
- Order strategies from least to most expensive
- Configure based on error types
- Limit recovery attempts
- Monitor recovery success
- Handle recovery failures

For more details, see:
- [Agent Integration Guide](./AGENT_INTEGRATION.md)
- [Collaboration](./COLLABORATION.md)