# Reflexion Agent Pattern

The **Reflexion** pattern enables agents to learn from failures across multiple trials by maintaining a persistent reflection memory. Unlike simple reflection, Reflexion performs multiple attempts, stores insights from each trial, and uses accumulated knowledge to improve subsequent attempts.

## Overview

**Best For**: Tasks requiring learning from failures and iterative improvement

**Complexity**: ⭐⭐⭐ Advanced (Multi-trial learning with memory)

**Cost**: $$$$ Very High (Multiple trials × multiple LLM calls per trial)

## When to Use Reflexion

### Ideal Use Cases

✅ **Problem-solving with trial and error**
- Agent attempts solution
- Evaluates success/failure
- Learns from mistakes
- Tries again with improved approach

✅ **Optimization tasks**
- Multiple attempts to find best solution
- Each trial provides learning
- Memory guides future strategies
- Converges toward optimal approach

✅ **Complex puzzles and challenges**
- Initial attempts may fail
- Insights from failures inform next try
- Persistent memory tracks what doesn't work
- Gradual refinement leads to solution

✅ **Adaptive strategy development**
- Explores different approaches
- Learns which strategies succeed
- Builds knowledge base over trials
- Applies lessons to new attempts

### When NOT to Use Reflexion

❌ **One-shot tasks** → Use Reflection or direct LLM
❌ **No clear success/failure criteria** → Hard to evaluate trials
❌ **Cost-sensitive applications** → Many trials = high cost
❌ **Time-critical tasks** → Multiple trials take time

## How Reflexion Works

### The Multi-Trial Learning Cycle

```
┌─────────────────────────────────────────┐
│  TRIAL 1                                │
│                                         │
│  1. PLAN: Create initial approach       │
│     (no prior memory)                   │
│  2. EXECUTE: Try the approach           │
│  3. EVALUATE: Failed                    │
│  4. REFLECT: "Approach X didn't work    │
│              because Y. Try Z instead"  │
│  5. UPDATE MEMORY: Store insight        │
│                                         │
└─────────────────┬───────────────────────┘
                  ↓
┌─────────────────────────────────────────┐
│  TRIAL 2                                │
│                                         │
│  1. PLAN: Using memory from Trial 1     │
│     "Avoid approach X, try Z instead"   │
│  2. EXECUTE: Try improved approach      │
│  3. EVALUATE: Failed (but closer)       │
│  4. REFLECT: "Z was better than X, but  │
│              needs adjustment W"        │
│  5. UPDATE MEMORY: Add new insight      │
│                                         │
└─────────────────┬───────────────────────┘
                  ↓
┌─────────────────────────────────────────┐
│  TRIAL 3                                │
│                                         │
│  1. PLAN: Using memory from Trials 1-2  │
│     "Apply Z with adjustment W"         │
│  2. EXECUTE: Try refined approach       │
│  3. EVALUATE: Success!                  │
│  4. RETURN: Successful solution         │
│                                         │
└─────────────────────────────────────────┘
```

### Theoretical Foundation

Based on the paper ["Reflexion: Language Agents with Verbal Reinforcement Learning"](https://arxiv.org/abs/2303.11366). Key concepts:

1. **Verbal reinforcement learning**: Learn from natural language feedback
2. **Persistent memory**: Insights accumulate across trials
3. **Self-evaluation**: Agent judges its own success/failure
4. **Iterative refinement**: Each trial improves on previous attempts

### Algorithm

```python
def reflexion_loop(task, max_trials=3):
    """Simplified Reflexion algorithm"""
    reflection_memory = []

    for trial in range(max_trials):
        # 1. Plan using accumulated memory
        plan = llm_plan_with_memory(task, reflection_memory)

        # 2. Execute the plan
        outcome = llm_execute(task, plan)

        # 3. Evaluate success/failure
        evaluation = llm_evaluate(task, outcome)

        if evaluation == "success":
            return outcome

        # 4. Reflect on what went wrong
        reflection = llm_reflect(task, plan, outcome, evaluation)

        # 5. Add to memory for next trial
        reflection_memory.append(reflection)

    # Max trials reached, return best attempt
    return generate_final_answer(task, reflection_memory, outcome)
```

## API Reference

### Class: `ReflexionAgent`

```python
from agent_patterns.patterns import ReflexionAgent

agent = ReflexionAgent(
    llm_configs: Dict[str, Dict[str, Any]],
    max_trials: int = 3,
    prompt_dir: str = "prompts",
    custom_instructions: Optional[str] = None,
    prompt_overrides: Optional[Dict[str, Dict[str, str]]] = None
)
```

#### Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `llm_configs` | `Dict[str, Dict[str, Any]]` | Yes | LLM configs for "thinking", "reflection", "execution", and "documentation" roles |
| `max_trials` | `int` | No | Maximum number of trial attempts (default: 3) |
| `prompt_dir` | `str` | No | Custom prompt directory (default: "prompts") |
| `custom_instructions` | `str` | No | Instructions appended to system prompts |
| `prompt_overrides` | `Dict` | No | Override specific prompts programmatically |

#### LLM Roles

- **thinking**: Used for planning with memory
- **execution**: Used for executing each trial
- **reflection**: Used for evaluating outcomes and generating insights
- **documentation**: Used for generating final answer when trials exhausted

#### Methods

**`run(input_data: str) -> str`**

Executes the Reflexion pattern on the given input.

- **Parameters**:
  - `input_data` (str): The task or problem to solve
- **Returns**: str - The final answer (successful outcome or best attempt)
- **Raises**: ValueError if graph not built

**`build_graph() -> None`**

Builds the LangGraph state graph. Called automatically during initialization.

## Complete Examples

### Basic Usage

```python
from agent_patterns.patterns import ReflexionAgent

# Configure LLMs
llm_configs = {
    "thinking": {
        "provider": "openai",
        "model": "gpt-4",
        "temperature": 0.7,
    },
    "execution": {
        "provider": "openai",
        "model": "gpt-4",
        "temperature": 0.7,
    },
    "reflection": {
        "provider": "openai",
        "model": "gpt-4",
        "temperature": 0.3,  # Lower temp for consistent evaluation
    },
    "documentation": {
        "provider": "openai",
        "model": "gpt-4",
        "temperature": 0.7,
    }
}

# Create agent
agent = ReflexionAgent(
    llm_configs=llm_configs,
    max_trials=3
)

# Solve challenging problem
result = agent.run("""
Puzzle: You have 12 coins that look identical. One is counterfeit
and weighs slightly different (either heavier or lighter).
Using a balance scale only 3 times, identify the counterfeit coin
AND determine if it's heavier or lighter.

Provide step-by-step solution.
""")

print(result)
# Agent will:
# Trial 1: Attempt a solution, likely fail or have gaps
# Trial 2: Learn from Trial 1 mistakes, try improved approach
# Trial 3: Apply accumulated insights, find correct solution
```

### With Custom Instructions

```python
# Add domain-specific learning guidance
debugging_instructions = """
You are debugging code by trial and error. Follow these principles:

PLANNING WITH MEMORY:
- Review what you've tried before
- Don't repeat failed approaches
- Build on partial successes
- Try systematic variations

EXECUTION:
- Be precise in implementing the plan
- Document what you're testing

EVALUATION:
- Check if code runs without errors
- Verify output matches expected results
- Identify specific failure points

REFLECTION:
- Analyze why the approach failed
- Identify what worked and what didn't
- Generate specific, actionable insights
- Propose concrete changes for next trial
"""

agent = ReflexionAgent(
    llm_configs=llm_configs,
    max_trials=5,  # Allow more trials for complex debugging
    custom_instructions=debugging_instructions
)

result = agent.run("""
Debug this Python function that should find the longest palindromic substring:

def longest_palindrome(s):
    result = ""
    for i in range(len(s)):
        for j in range(i, len(s)):
            if s[i:j] == s[i:j][::-1]:
                result = max(result, s[i:j])
    return result

Test cases:
- longest_palindrome("babad") should return "bab" or "aba"
- longest_palindrome("cbbd") should return "bb"

The function has bugs. Fix them.
""")
```

### With Prompt Overrides

```python
# Customize evaluation criteria
overrides = {
    "Evaluate": {
        "system_prompt": """You are a strict evaluator. Determine if the task
was completed successfully. Be rigorous - partial solutions are failures.
Respond with SUCCESS or FAILURE and explain why.""",
        "user_prompt": """Task: {task}

Attempted solution:
{outcome}

Did this FULLY complete the task with no errors or gaps?
Respond with SUCCESS or FAILURE and detailed explanation.

Your evaluation:"""
    },
    "ReflectOnTrial": {
        "system_prompt": """You are a learning agent analyzing failures.
Generate actionable insights about what went wrong and how to improve.""",
        "user_prompt": """Task: {task}

Your plan: {plan}

What happened: {outcome}

Evaluation: {evaluation}

Analyze this trial deeply:
1. What specific aspect failed?
2. Why did it fail?
3. What should be different in the next attempt?
4. What (if anything) worked well and should be kept?

Your insights:"""
    }
}

agent = ReflexionAgent(
    llm_configs=llm_configs,
    max_trials=3,
    prompt_overrides=overrides
)
```

## Customizing Prompts

### Understanding the System Prompt Structure

Version 0.2.0 introduces **enterprise-grade prompts** with a comprehensive 9-section structure providing significantly better guidance (150-300+ lines vs ~32 lines).

**The 9-Section Structure**: All prompts include Role and Identity, Core Capabilities, Process, Output Format, Decision-Making Guidelines, Quality Standards, Edge Cases, Examples, and Critical Reminders. **Benefits**: Increased reliability and robustness.

### Understanding Reflexion Prompts

Reflexion uses five prompt templates (all now with comprehensive 9-section structure):

1. **PlanWithMemory**: Creates plan using reflection memory from previous trials with systematic guidance
2. **Execute**: Executes the current plan with quality standards and edge case handling
3. **Evaluate**: Judges success or failure using explicit criteria and examples
4. **ReflectOnTrial**: Generates insights from the trial with structured process
5. **GenerateFinal**: Creates final answer when trials exhausted with comprehensive quality standards

### Method 1: Custom Instructions

```python
agent = ReflexionAgent(
    llm_configs=llm_configs,
    custom_instructions="""
    LEARNING APPROACH:
    - Each trial should explore meaningfully different approaches
    - Extract specific, actionable lessons
    - Build systematically on previous insights
    - Avoid repeating mistakes

    SUCCESS CRITERIA:
    - Solution must be complete
    - All requirements satisfied
    - No errors or edge cases missed
    """
)
```

### Method 2: Prompt Overrides

```python
# Customize planning with memory
overrides = {
    "PlanWithMemory": {
        "system_prompt": """You learn from past attempts. Use insights from
previous trials to create better plans.""",
        "user_prompt": """Task: {task}

Trial: {trial_count}/{max_trials}

Lessons learned from previous trials:
{memory}

Plan your next approach, incorporating what you've learned.
Be specific about how you're adapting based on past failures.

Your plan:"""
    }
}
```

### Method 3: Custom Prompt Directory

```bash
my_prompts/
└── ReflexionAgent/
    ├── PlanWithMemory/
    │   ├── system_prompt.md
    │   └── user_prompt.md
    ├── Execute/
    │   ├── system_prompt.md
    │   └── user_prompt.md
    ├── Evaluate/
    │   ├── system_prompt.md
    │   └── user_prompt.md
    ├── ReflectOnTrial/
    │   ├── system_prompt.md
    │   └── user_prompt.md
    └── GenerateFinal/
        ├── system_prompt.md
        └── user_prompt.md
```

## Setting Agent Goals

### Via Task Description

Provide clear success criteria:

```python
# Well-defined success criteria
agent.run("""
Task: Optimize this SQL query to run in under 2 seconds

Current query:
SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.order_date > '2023-01-01'
ORDER BY o.total DESC

Database info:
- orders table: 1M rows
- customers table: 100K rows
- No current indexes except primary keys

Success criteria:
- Query returns same results
- Execution time < 2 seconds
- Explain your optimization strategy
""")
```

### Via Custom Instructions

```python
agent = ReflexionAgent(
    llm_configs=llm_configs,
    custom_instructions="""
    GOAL: Find working, optimal solutions through iterative learning

    TRIAL SUCCESS DEFINITION:
    - Solution is correct (passes all test cases)
    - Solution is efficient (meets performance requirements)
    - Solution is complete (handles edge cases)

    REFLECTION QUALITY:
    - Identify root causes of failures
    - Generate specific, testable hypotheses
    - Propose concrete alternative approaches

    LEARNING EFFICIENCY:
    - Don't repeat the same mistakes
    - Build incrementally on partial successes
    - Try meaningfully different approaches if stuck
    """
)
```

## Advanced Usage

### Adjusting Trial Budget

```python
# Quick tasks: fewer trials
quick_agent = ReflexionAgent(
    llm_configs=llm_configs,
    max_trials=2
)

# Complex tasks: more trials
thorough_agent = ReflexionAgent(
    llm_configs=llm_configs,
    max_trials=5
)

# Very challenging tasks: extended trials
research_agent = ReflexionAgent(
    llm_configs=llm_configs,
    max_trials=10
)
```

### Custom Evaluation Logic

```python
class CustomReflexionAgent(ReflexionAgent):
    def _evaluate_outcome(self, state):
        """Override with domain-specific evaluation"""
        task = state["input_task"]
        outcome = state["outcome"]

        # Custom evaluation logic
        if "code" in task.lower():
            # Code-specific checks
            evaluation = self._evaluate_code(outcome)
        elif "math" in task.lower():
            # Math-specific checks
            evaluation = self._evaluate_math(outcome)
        else:
            # Default LLM evaluation
            return super()._evaluate_outcome(state)

        state["evaluation"] = evaluation
        state["evaluation_detail"] = f"Custom evaluation: {evaluation}"
        return state

    def _evaluate_code(self, code):
        """Evaluate code outcome"""
        try:
            # Try to execute code
            exec(code)
            return "success"
        except:
            return "failure"

    def _evaluate_math(self, answer):
        """Evaluate mathematical answer"""
        # Custom math validation logic
        pass

agent = CustomReflexionAgent(llm_configs=llm_configs)
```

### Memory Analysis

```python
class AnalyzingReflexionAgent(ReflexionAgent):
    def run(self, input_data):
        """Override to analyze memory after completion"""
        result = super().run(input_data)

        # Access reflection memory for analysis
        # (Would need to store in instance variable during execution)
        print("\n=== Learning Summary ===")
        print(f"Trials completed: {self.trial_count}")
        print("\nKey insights:")
        for i, insight in enumerate(self.memory_log, 1):
            print(f"{i}. {insight}")

        return result

agent = AnalyzingReflexionAgent(llm_configs=llm_configs)
```

## Performance Considerations

### Cost Analysis

Reflexion is expensive due to multiple trials:

**Per trial cost**:
- Plan: 1 LLM call
- Execute: 1 LLM call
- Evaluate: 1 LLM call
- Reflect: 1 LLM call
- **= 4 calls per trial**

**Total cost**:
- 3 trials: ~12 LLM calls
- 5 trials: ~20 LLM calls
- 10 trials: ~40 LLM calls

**Optimization strategies**:

```python
# 1. Limit trials
agent = ReflexionAgent(llm_configs=llm_configs, max_trials=3)

# 2. Use cheaper models for some roles
llm_configs = {
    "thinking": {"provider": "openai", "model": "gpt-4"},
    "execution": {"provider": "openai", "model": "gpt-3.5-turbo"},  # Cheaper
    "reflection": {"provider": "openai", "model": "gpt-4"},
    "documentation": {"provider": "openai", "model": "gpt-3.5-turbo"}  # Cheaper
}

# 3. Early stopping with custom evaluation
# (Stop as soon as success is detected)
```

### When to Use Reflexion vs Other Patterns

| Task Type | Best Pattern | Reason |
|-----------|-------------|---------|
| One-shot quality improvement | Reflection | ✅ No need for trials |
| Learning from failures | Reflexion | ✅ Designed for this |
| Simple tasks | Direct LLM | ❌ Reflexion overkill |
| Tool-based workflows | ReAct | ❌ Reflexion doesn't use tools |
| Cost-sensitive | Reflection, Plan & Solve | ❌ Reflexion expensive |

## Comparison with Other Patterns

| Aspect | Reflexion | Reflection | ReAct |
|--------|-----------|-----------|--------|
| **Trials** | Multiple | Single task | Single task |
| **Memory** | Persistent across trials | Within-task only | No memory |
| **Learning** | Trial-and-error | Self-critique | Adaptive action |
| **Cost** | Very High | Medium-High | Medium |
| **Best For** | Learning from failures | Quality improvement | Tool interaction |

## Common Pitfalls

### 1. Insufficient Trials

❌ **Bad**: Too few trials for complex problems
```python
agent = ReflexionAgent(llm_configs=llm_configs, max_trials=1)
# This is just expensive execution, no learning benefit
```

✅ **Good**: Appropriate trial budget
```python
agent = ReflexionAgent(llm_configs=llm_configs, max_trials=3-5)
```

### 2. Vague Evaluation Criteria

❌ **Bad**: Unclear success definition
```python
agent.run("Make this better")  # What is "better"?
```

✅ **Good**: Specific, measurable criteria
```python
agent.run("""
Optimize this function to:
1. Run in O(n log n) time or better
2. Use O(n) space or less
3. Pass all test cases
4. Handle edge cases (empty input, single element, etc.)
""")
```

### 3. Weak Reflections

❌ **Bad**: Generic or non-actionable insights

✅ **Good**: Ensure reflections are specific
```python
overrides = {
    "ReflectOnTrial": {
        "user_prompt": """...
Your reflection must include:
1. SPECIFIC failure point (not "it didn't work")
2. ROOT CAUSE analysis (why it failed)
3. CONCRETE alternative approach (exactly what to try next)

Your detailed reflection:"""
    }
}
```

### 4. Repeating Mistakes

❌ **Bad**: Agent ignores previous learnings

✅ **Good**: Emphasize memory usage in planning
```python
overrides = {
    "PlanWithMemory": {
        "user_prompt": """...
Lessons from previous trials:
{memory}

Your plan MUST:
- Avoid approaches that already failed
- Build on what worked
- Try something meaningfully different if previous approaches all failed

Your plan:"""
    }
}
```

## Troubleshooting

### All Trials Failing

**Symptom**: Agent doesn't find solution within max_trials

**Solutions**:
```python
# 1. Increase trial budget
agent = ReflexionAgent(llm_configs=llm_configs, max_trials=7)

# 2. Provide more guidance
custom_instructions = """
APPROACH DIVERSITY:
If first 2 trials fail, try radically different approaches in subsequent trials.
Don't keep varying the same failed strategy.
"""

# 3. Check if task is solvable
# Verify task has clear solution and criteria
```

### Weak Learning Between Trials

**Symptom**: Later trials don't improve on earlier ones

**Solutions**:
```python
# Strengthen reflection prompts
overrides = {
    "ReflectOnTrial": {
        "user_prompt": """...
Analyze deeply:
- What EXACTLY went wrong at which step?
- What assumption was incorrect?
- What approach should be abandoned entirely?
- What aspect should be kept for next trial?
- What SPECIFIC change should trial {trial_count + 1} make?

Your analysis:"""
    }
}
```

### Premature Success Evaluation

**Symptom**: Agent declares success on partial solutions

**Solutions**:
```python
# Make evaluation more rigorous
overrides = {
    "Evaluate": {
        "system_prompt": """You are a strict evaluator. Only declare SUCCESS
if the solution is 100% complete and correct. Partial solutions are FAILURE.""",
        "user_prompt": """...
Check every requirement:
{requirements_checklist}

All must be satisfied for SUCCESS.

Your evaluation:"""
    }
}
```

## Next Steps

- Try the [complete examples](../examples/reflexion-examples.md)
- Learn about [Reflection](reflection.md) for single-task refinement
- Explore [LATS](lats.md) for tree-search based exploration
- Read the [original paper](https://arxiv.org/abs/2303.11366)

## References

- Original paper: [Reflexion: Language Agents with Verbal Reinforcement Learning](https://arxiv.org/abs/2303.11366)
- Related: [Self-Refine](https://arxiv.org/abs/2303.17651) and iterative improvement techniques
- Reinforcement learning concepts applied to language agents