Reflexion Agent Pattern

The Reflexion pattern enables agents to learn from failures across multiple trials by maintaining a persistent reflection memory. Unlike simple reflection, Reflexion performs multiple attempts, stores insights from each trial, and uses accumulated knowledge to improve subsequent attempts.

Overview

Best For: Tasks requiring learning from failures and iterative improvement

Complexity: ⭐⭐⭐ Advanced (Multi-trial learning with memory)

Cost: $$$$ Very High (Multiple trials × multiple LLM calls per trial)

When to Use Reflexion

Ideal Use Cases

✅ Problem-solving with trial and error

Agent attempts solution
Evaluates success/failure
Learns from mistakes
Tries again with improved approach

✅ Optimization tasks

Multiple attempts to find best solution
Each trial provides learning
Memory guides future strategies
Converges toward optimal approach

✅ Complex puzzles and challenges

Initial attempts may fail
Insights from failures inform next try
Persistent memory tracks what doesn’t work
Gradual refinement leads to solution

✅ Adaptive strategy development

Explores different approaches
Learns which strategies succeed
Builds knowledge base over trials
Applies lessons to new attempts

When NOT to Use Reflexion

❌ One-shot tasks → Use Reflection or direct LLM ❌ No clear success/failure criteria → Hard to evaluate trials ❌ Cost-sensitive applications → Many trials = high cost ❌ Time-critical tasks → Multiple trials take time

How Reflexion Works

The Multi-Trial Learning Cycle

┌─────────────────────────────────────────┐
│  TRIAL 1                                │
│                                         │
│  1. PLAN: Create initial approach       │
│     (no prior memory)                   │
│  2. EXECUTE: Try the approach           │
│  3. EVALUATE: Failed                    │
│  4. REFLECT: "Approach X didn't work    │
│              because Y. Try Z instead"  │
│  5. UPDATE MEMORY: Store insight        │
│                                         │
└─────────────────┬───────────────────────┘
                  ↓
┌─────────────────────────────────────────┐
│  TRIAL 2                                │
│                                         │
│  1. PLAN: Using memory from Trial 1     │
│     "Avoid approach X, try Z instead"   │
│  2. EXECUTE: Try improved approach      │
│  3. EVALUATE: Failed (but closer)       │
│  4. REFLECT: "Z was better than X, but  │
│              needs adjustment W"        │
│  5. UPDATE MEMORY: Add new insight      │
│                                         │
└─────────────────┬───────────────────────┘
                  ↓
┌─────────────────────────────────────────┐
│  TRIAL 3                                │
│                                         │
│  1. PLAN: Using memory from Trials 1-2  │
│     "Apply Z with adjustment W"         │
│  2. EXECUTE: Try refined approach       │
│  3. EVALUATE: Success!                  │
│  4. RETURN: Successful solution         │
│                                         │
└─────────────────────────────────────────┘

Theoretical Foundation

Based on the paper “Reflexion: Language Agents with Verbal Reinforcement Learning”. Key concepts:

Verbal reinforcement learning: Learn from natural language feedback
Persistent memory: Insights accumulate across trials
Self-evaluation: Agent judges its own success/failure
Iterative refinement: Each trial improves on previous attempts

Algorithm

def reflexion_loop(task, max_trials=3):
    """Simplified Reflexion algorithm"""
    reflection_memory = []

    for trial in range(max_trials):
        # 1. Plan using accumulated memory
        plan = llm_plan_with_memory(task, reflection_memory)

        # 2. Execute the plan
        outcome = llm_execute(task, plan)

        # 3. Evaluate success/failure
        evaluation = llm_evaluate(task, outcome)

        if evaluation == "success":
            return outcome

        # 4. Reflect on what went wrong
        reflection = llm_reflect(task, plan, outcome, evaluation)

        # 5. Add to memory for next trial
        reflection_memory.append(reflection)

    # Max trials reached, return best attempt
    return generate_final_answer(task, reflection_memory, outcome)

API Reference

Class: `ReflexionAgent`

from agent_patterns.patterns import ReflexionAgent

agent = ReflexionAgent(
    llm_configs: Dict[str, Dict[str, Any]],
    max_trials: int = 3,
    prompt_dir: str = "prompts",
    custom_instructions: Optional[str] = None,
    prompt_overrides: Optional[Dict[str, Dict[str, str]]] = None
)

Parameters

Parameter	Type	Required	Description
`llm_configs`	`Dict[str, Dict[str, Any]]`	Yes	LLM configs for “thinking”, “reflection”, “execution”, and “documentation” roles
`max_trials`	`int`	No	Maximum number of trial attempts (default: 3)
`prompt_dir`	`str`	No	Custom prompt directory (default: “prompts”)
`custom_instructions`	`str`	No	Instructions appended to system prompts
`prompt_overrides`	`Dict`	No	Override specific prompts programmatically

LLM Roles

thinking: Used for planning with memory
execution: Used for executing each trial
reflection: Used for evaluating outcomes and generating insights
documentation: Used for generating final answer when trials exhausted

Methods

run(input_data: str) -> str

Executes the Reflexion pattern on the given input.

Parameters:
- input_data (str): The task or problem to solve
Returns: str - The final answer (successful outcome or best attempt)
Raises: ValueError if graph not built

build_graph() -> None

Builds the LangGraph state graph. Called automatically during initialization.

Complete Examples

Basic Usage

from agent_patterns.patterns import ReflexionAgent

# Configure LLMs
llm_configs = {
    "thinking": {
        "provider": "openai",
        "model": "gpt-4",
        "temperature": 0.7,
    },
    "execution": {
        "provider": "openai",
        "model": "gpt-4",
        "temperature": 0.7,
    },
    "reflection": {
        "provider": "openai",
        "model": "gpt-4",
        "temperature": 0.3,  # Lower temp for consistent evaluation
    },
    "documentation": {
        "provider": "openai",
        "model": "gpt-4",
        "temperature": 0.7,
    }
}

# Create agent
agent = ReflexionAgent(
    llm_configs=llm_configs,
    max_trials=3
)

# Solve challenging problem
result = agent.run("""
Puzzle: You have 12 coins that look identical. One is counterfeit
and weighs slightly different (either heavier or lighter).
Using a balance scale only 3 times, identify the counterfeit coin
AND determine if it's heavier or lighter.

Provide step-by-step solution.
""")

print(result)
# Agent will:
# Trial 1: Attempt a solution, likely fail or have gaps
# Trial 2: Learn from Trial 1 mistakes, try improved approach
# Trial 3: Apply accumulated insights, find correct solution

With Custom Instructions

# Add domain-specific learning guidance
debugging_instructions = """
You are debugging code by trial and error. Follow these principles:

PLANNING WITH MEMORY:
- Review what you've tried before
- Don't repeat failed approaches
- Build on partial successes
- Try systematic variations

EXECUTION:
- Be precise in implementing the plan
- Document what you're testing

EVALUATION:
- Check if code runs without errors
- Verify output matches expected results
- Identify specific failure points

REFLECTION:
- Analyze why the approach failed
- Identify what worked and what didn't
- Generate specific, actionable insights
- Propose concrete changes for next trial
"""

agent = ReflexionAgent(
    llm_configs=llm_configs,
    max_trials=5,  # Allow more trials for complex debugging
    custom_instructions=debugging_instructions
)

result = agent.run("""
Debug this Python function that should find the longest palindromic substring:

def longest_palindrome(s):
    result = ""
    for i in range(len(s)):
        for j in range(i, len(s)):
            if s[i:j] == s[i:j][::-1]:
                result = max(result, s[i:j])
    return result

Test cases:
- longest_palindrome("babad") should return "bab" or "aba"
- longest_palindrome("cbbd") should return "bb"

The function has bugs. Fix them.
""")

With Prompt Overrides

# Customize evaluation criteria
overrides = {
    "Evaluate": {
        "system_prompt": """You are a strict evaluator. Determine if the task
was completed successfully. Be rigorous - partial solutions are failures.
Respond with SUCCESS or FAILURE and explain why.""",
        "user_prompt": """Task: {task}

Attempted solution:
{outcome}

Did this FULLY complete the task with no errors or gaps?
Respond with SUCCESS or FAILURE and detailed explanation.

Your evaluation:"""
    },
    "ReflectOnTrial": {
        "system_prompt": """You are a learning agent analyzing failures.
Generate actionable insights about what went wrong and how to improve.""",
        "user_prompt": """Task: {task}

Your plan: {plan}

What happened: {outcome}

Evaluation: {evaluation}

Analyze this trial deeply:
1. What specific aspect failed?
2. Why did it fail?
3. What should be different in the next attempt?
4. What (if anything) worked well and should be kept?

Your insights:"""
    }
}

agent = ReflexionAgent(
    llm_configs=llm_configs,
    max_trials=3,
    prompt_overrides=overrides
)

Customizing Prompts

Understanding the System Prompt Structure

Version 0.2.0 introduces enterprise-grade prompts with a comprehensive 9-section structure providing significantly better guidance (150-300+ lines vs ~32 lines).

The 9-Section Structure: All prompts include Role and Identity, Core Capabilities, Process, Output Format, Decision-Making Guidelines, Quality Standards, Edge Cases, Examples, and Critical Reminders. Benefits: Increased reliability and robustness.

Understanding Reflexion Prompts

Reflexion uses five prompt templates (all now with comprehensive 9-section structure):

PlanWithMemory: Creates plan using reflection memory from previous trials with systematic guidance
Execute: Executes the current plan with quality standards and edge case handling
Evaluate: Judges success or failure using explicit criteria and examples
ReflectOnTrial: Generates insights from the trial with structured process
GenerateFinal: Creates final answer when trials exhausted with comprehensive quality standards

Method 1: Custom Instructions

agent = ReflexionAgent(
    llm_configs=llm_configs,
    custom_instructions="""
    LEARNING APPROACH:
    - Each trial should explore meaningfully different approaches
    - Extract specific, actionable lessons
    - Build systematically on previous insights
    - Avoid repeating mistakes

    SUCCESS CRITERIA:
    - Solution must be complete
    - All requirements satisfied
    - No errors or edge cases missed
    """
)

Method 2: Prompt Overrides

# Customize planning with memory
overrides = {
    "PlanWithMemory": {
        "system_prompt": """You learn from past attempts. Use insights from
previous trials to create better plans.""",
        "user_prompt": """Task: {task}

Trial: {trial_count}/{max_trials}

Lessons learned from previous trials:
{memory}

Plan your next approach, incorporating what you've learned.
Be specific about how you're adapting based on past failures.

Your plan:"""
    }
}

Method 3: Custom Prompt Directory

my_prompts/
└── ReflexionAgent/
    ├── PlanWithMemory/
    │   ├── system_prompt.md
    │   └── user_prompt.md
    ├── Execute/
    │   ├── system_prompt.md
    │   └── user_prompt.md
    ├── Evaluate/
    │   ├── system_prompt.md
    │   └── user_prompt.md
    ├── ReflectOnTrial/
    │   ├── system_prompt.md
    │   └── user_prompt.md
    └── GenerateFinal/
        ├── system_prompt.md
        └── user_prompt.md

Setting Agent Goals

Via Task Description

Provide clear success criteria:

# Well-defined success criteria
agent.run("""
Task: Optimize this SQL query to run in under 2 seconds

Current query:
SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.order_date > '2023-01-01'
ORDER BY o.total DESC

Database info:
- orders table: 1M rows
- customers table: 100K rows
- No current indexes except primary keys

Success criteria:
- Query returns same results
- Execution time < 2 seconds
- Explain your optimization strategy
""")

Via Custom Instructions

agent = ReflexionAgent(
    llm_configs=llm_configs,
    custom_instructions="""
    GOAL: Find working, optimal solutions through iterative learning

    TRIAL SUCCESS DEFINITION:
    - Solution is correct (passes all test cases)
    - Solution is efficient (meets performance requirements)
    - Solution is complete (handles edge cases)

    REFLECTION QUALITY:
    - Identify root causes of failures
    - Generate specific, testable hypotheses
    - Propose concrete alternative approaches

    LEARNING EFFICIENCY:
    - Don't repeat the same mistakes
    - Build incrementally on partial successes
    - Try meaningfully different approaches if stuck
    """
)

Advanced Usage

Adjusting Trial Budget

# Quick tasks: fewer trials
quick_agent = ReflexionAgent(
    llm_configs=llm_configs,
    max_trials=2
)

# Complex tasks: more trials
thorough_agent = ReflexionAgent(
    llm_configs=llm_configs,
    max_trials=5
)

# Very challenging tasks: extended trials
research_agent = ReflexionAgent(
    llm_configs=llm_configs,
    max_trials=10
)

Custom Evaluation Logic

class CustomReflexionAgent(ReflexionAgent):
    def _evaluate_outcome(self, state):
        """Override with domain-specific evaluation"""
        task = state["input_task"]
        outcome = state["outcome"]

        # Custom evaluation logic
        if "code" in task.lower():
            # Code-specific checks
            evaluation = self._evaluate_code(outcome)
        elif "math" in task.lower():
            # Math-specific checks
            evaluation = self._evaluate_math(outcome)
        else:
            # Default LLM evaluation
            return super()._evaluate_outcome(state)

        state["evaluation"] = evaluation
        state["evaluation_detail"] = f"Custom evaluation: {evaluation}"
        return state

    def _evaluate_code(self, code):
        """Evaluate code outcome"""
        try:
            # Try to execute code
            exec(code)
            return "success"
        except:
            return "failure"

    def _evaluate_math(self, answer):
        """Evaluate mathematical answer"""
        # Custom math validation logic
        pass

agent = CustomReflexionAgent(llm_configs=llm_configs)

Memory Analysis

class AnalyzingReflexionAgent(ReflexionAgent):
    def run(self, input_data):
        """Override to analyze memory after completion"""
        result = super().run(input_data)

        # Access reflection memory for analysis
        # (Would need to store in instance variable during execution)
        print("\n=== Learning Summary ===")
        print(f"Trials completed: {self.trial_count}")
        print("\nKey insights:")
        for i, insight in enumerate(self.memory_log, 1):
            print(f"{i}. {insight}")

        return result

agent = AnalyzingReflexionAgent(llm_configs=llm_configs)

Performance Considerations

Cost Analysis

Reflexion is expensive due to multiple trials:

Per trial cost:

Plan: 1 LLM call
Execute: 1 LLM call
Evaluate: 1 LLM call
Reflect: 1 LLM call
= 4 calls per trial

Total cost:

3 trials: ~12 LLM calls
5 trials: ~20 LLM calls
10 trials: ~40 LLM calls

Optimization strategies:

# 1. Limit trials
agent = ReflexionAgent(llm_configs=llm_configs, max_trials=3)

# 2. Use cheaper models for some roles
llm_configs = {
    "thinking": {"provider": "openai", "model": "gpt-4"},
    "execution": {"provider": "openai", "model": "gpt-3.5-turbo"},  # Cheaper
    "reflection": {"provider": "openai", "model": "gpt-4"},
    "documentation": {"provider": "openai", "model": "gpt-3.5-turbo"}  # Cheaper
}

# 3. Early stopping with custom evaluation
# (Stop as soon as success is detected)

When to Use Reflexion vs Other Patterns

Task Type	Best Pattern	Reason
One-shot quality improvement	Reflection	✅ No need for trials
Learning from failures	Reflexion	✅ Designed for this
Simple tasks	Direct LLM	❌ Reflexion overkill
Tool-based workflows	ReAct	❌ Reflexion doesn’t use tools
Cost-sensitive	Reflection, Plan & Solve	❌ Reflexion expensive

Comparison with Other Patterns

Aspect	Reflexion	Reflection	ReAct
Trials	Multiple	Single task	Single task
Memory	Persistent across trials	Within-task only	No memory
Learning	Trial-and-error	Self-critique	Adaptive action
Cost	Very High	Medium-High	Medium
Best For	Learning from failures	Quality improvement	Tool interaction

Common Pitfalls

1. Insufficient Trials

❌ Bad: Too few trials for complex problems

agent = ReflexionAgent(llm_configs=llm_configs, max_trials=1)
# This is just expensive execution, no learning benefit

✅ Good: Appropriate trial budget

agent = ReflexionAgent(llm_configs=llm_configs, max_trials=3-5)

2. Vague Evaluation Criteria

❌ Bad: Unclear success definition

agent.run("Make this better")  # What is "better"?

✅ Good: Specific, measurable criteria

agent.run("""
Optimize this function to:
1. Run in O(n log n) time or better
2. Use O(n) space or less
3. Pass all test cases
4. Handle edge cases (empty input, single element, etc.)
""")

3. Weak Reflections

❌ Bad: Generic or non-actionable insights

✅ Good: Ensure reflections are specific

overrides = {
    "ReflectOnTrial": {
        "user_prompt": """...
Your reflection must include:
1. SPECIFIC failure point (not "it didn't work")
2. ROOT CAUSE analysis (why it failed)
3. CONCRETE alternative approach (exactly what to try next)

Your detailed reflection:"""
    }
}

4. Repeating Mistakes

❌ Bad: Agent ignores previous learnings

✅ Good: Emphasize memory usage in planning

overrides = {
    "PlanWithMemory": {
        "user_prompt": """...
Lessons from previous trials:
{memory}

Your plan MUST:
- Avoid approaches that already failed
- Build on what worked
- Try something meaningfully different if previous approaches all failed

Your plan:"""
    }
}

Troubleshooting

All Trials Failing

Symptom: Agent doesn’t find solution within max_trials

Solutions:

# 1. Increase trial budget
agent = ReflexionAgent(llm_configs=llm_configs, max_trials=7)

# 2. Provide more guidance
custom_instructions = """
APPROACH DIVERSITY:
If first 2 trials fail, try radically different approaches in subsequent trials.
Don't keep varying the same failed strategy.
"""

# 3. Check if task is solvable
# Verify task has clear solution and criteria

Weak Learning Between Trials

Symptom: Later trials don’t improve on earlier ones

Solutions:

# Strengthen reflection prompts
overrides = {
    "ReflectOnTrial": {
        "user_prompt": """...
Analyze deeply:
- What EXACTLY went wrong at which step?
- What assumption was incorrect?
- What approach should be abandoned entirely?
- What aspect should be kept for next trial?
- What SPECIFIC change should trial {trial_count + 1} make?

Your analysis:"""
    }
}

Premature Success Evaluation

Symptom: Agent declares success on partial solutions

Solutions:

# Make evaluation more rigorous
overrides = {
    "Evaluate": {
        "system_prompt": """You are a strict evaluator. Only declare SUCCESS
if the solution is 100% complete and correct. Partial solutions are FAILURE.""",
        "user_prompt": """...
Check every requirement:
{requirements_checklist}

All must be satisfied for SUCCESS.

Your evaluation:"""
    }
}

Next Steps

Try the complete examples
Learn about Reflection for single-task refinement
Explore LATS for tree-search based exploration
Read the original paper

References

Original paper: Reflexion: Language Agents with Verbal Reinforcement Learning
Related: Self-Refine and iterative improvement techniques
Reinforcement learning concepts applied to language agents

Reflexion Agent Pattern

Overview

When to Use Reflexion

Ideal Use Cases

When NOT to Use Reflexion

How Reflexion Works

The Multi-Trial Learning Cycle

Theoretical Foundation

Algorithm

API Reference

Class: ReflexionAgent

Parameters

LLM Roles

Methods

Complete Examples

Basic Usage

With Custom Instructions

With Prompt Overrides

Customizing Prompts

Understanding the System Prompt Structure

Understanding Reflexion Prompts

Method 1: Custom Instructions

Method 2: Prompt Overrides

Method 3: Custom Prompt Directory

Setting Agent Goals

Via Task Description

Via Custom Instructions

Advanced Usage

Adjusting Trial Budget

Custom Evaluation Logic

Memory Analysis

Performance Considerations

Cost Analysis

When to Use Reflexion vs Other Patterns

Comparison with Other Patterns

Common Pitfalls

1. Insufficient Trials

2. Vague Evaluation Criteria

3. Weak Reflections

4. Repeating Mistakes

Troubleshooting

All Trials Failing

Weak Learning Between Trials

Premature Success Evaluation

Next Steps

References

Class: `ReflexionAgent`