Reflexion Agent Pattern

The Reflexion pattern enables agents to learn from failures across multiple trials by maintaining a persistent reflection memory. Unlike simple reflection, Reflexion performs multiple attempts, stores insights from each trial, and uses accumulated knowledge to improve subsequent attempts.

Overview

Best For: Tasks requiring learning from failures and iterative improvement

Complexity: ⭐⭐⭐ Advanced (Multi-trial learning with memory)

Cost: $$$$ Very High (Multiple trials × multiple LLM calls per trial)

When to Use Reflexion

Ideal Use Cases

Problem-solving with trial and error

  • Agent attempts solution

  • Evaluates success/failure

  • Learns from mistakes

  • Tries again with improved approach

Optimization tasks

  • Multiple attempts to find best solution

  • Each trial provides learning

  • Memory guides future strategies

  • Converges toward optimal approach

Complex puzzles and challenges

  • Initial attempts may fail

  • Insights from failures inform next try

  • Persistent memory tracks what doesn’t work

  • Gradual refinement leads to solution

Adaptive strategy development

  • Explores different approaches

  • Learns which strategies succeed

  • Builds knowledge base over trials

  • Applies lessons to new attempts

When NOT to Use Reflexion

One-shot tasks → Use Reflection or direct LLM ❌ No clear success/failure criteria → Hard to evaluate trials ❌ Cost-sensitive applications → Many trials = high cost ❌ Time-critical tasks → Multiple trials take time

How Reflexion Works

The Multi-Trial Learning Cycle

┌─────────────────────────────────────────┐
│  TRIAL 1                                │
│                                         │
│  1. PLAN: Create initial approach       │
│     (no prior memory)                   │
│  2. EXECUTE: Try the approach           │
│  3. EVALUATE: Failed                    │
│  4. REFLECT: "Approach X didn't work    │
│              because Y. Try Z instead"  │
│  5. UPDATE MEMORY: Store insight        │
│                                         │
└─────────────────┬───────────────────────┘
                  ↓
┌─────────────────────────────────────────┐
│  TRIAL 2                                │
│                                         │
│  1. PLAN: Using memory from Trial 1     │
│     "Avoid approach X, try Z instead"   │
│  2. EXECUTE: Try improved approach      │
│  3. EVALUATE: Failed (but closer)       │
│  4. REFLECT: "Z was better than X, but  │
│              needs adjustment W"        │
│  5. UPDATE MEMORY: Add new insight      │
│                                         │
└─────────────────┬───────────────────────┘
                  ↓
┌─────────────────────────────────────────┐
│  TRIAL 3                                │
│                                         │
│  1. PLAN: Using memory from Trials 1-2  │
│     "Apply Z with adjustment W"         │
│  2. EXECUTE: Try refined approach       │
│  3. EVALUATE: Success!                  │
│  4. RETURN: Successful solution         │
│                                         │
└─────────────────────────────────────────┘

Theoretical Foundation

Based on the paper “Reflexion: Language Agents with Verbal Reinforcement Learning”. Key concepts:

  1. Verbal reinforcement learning: Learn from natural language feedback

  2. Persistent memory: Insights accumulate across trials

  3. Self-evaluation: Agent judges its own success/failure

  4. Iterative refinement: Each trial improves on previous attempts

Algorithm

def reflexion_loop(task, max_trials=3):
    """Simplified Reflexion algorithm"""
    reflection_memory = []

    for trial in range(max_trials):
        # 1. Plan using accumulated memory
        plan = llm_plan_with_memory(task, reflection_memory)

        # 2. Execute the plan
        outcome = llm_execute(task, plan)

        # 3. Evaluate success/failure
        evaluation = llm_evaluate(task, outcome)

        if evaluation == "success":
            return outcome

        # 4. Reflect on what went wrong
        reflection = llm_reflect(task, plan, outcome, evaluation)

        # 5. Add to memory for next trial
        reflection_memory.append(reflection)

    # Max trials reached, return best attempt
    return generate_final_answer(task, reflection_memory, outcome)

API Reference

Class: ReflexionAgent

from agent_patterns.patterns import ReflexionAgent

agent = ReflexionAgent(
    llm_configs: Dict[str, Dict[str, Any]],
    max_trials: int = 3,
    prompt_dir: str = "prompts",
    custom_instructions: Optional[str] = None,
    prompt_overrides: Optional[Dict[str, Dict[str, str]]] = None
)

Parameters

Parameter

Type

Required

Description

llm_configs

Dict[str, Dict[str, Any]]

Yes

LLM configs for “thinking”, “reflection”, “execution”, and “documentation” roles

max_trials

int

No

Maximum number of trial attempts (default: 3)

prompt_dir

str

No

Custom prompt directory (default: “prompts”)

custom_instructions

str

No

Instructions appended to system prompts

prompt_overrides

Dict

No

Override specific prompts programmatically

LLM Roles

  • thinking: Used for planning with memory

  • execution: Used for executing each trial

  • reflection: Used for evaluating outcomes and generating insights

  • documentation: Used for generating final answer when trials exhausted

Methods

run(input_data: str) -> str

Executes the Reflexion pattern on the given input.

  • Parameters:

    • input_data (str): The task or problem to solve

  • Returns: str - The final answer (successful outcome or best attempt)

  • Raises: ValueError if graph not built

build_graph() -> None

Builds the LangGraph state graph. Called automatically during initialization.

Complete Examples

Basic Usage

from agent_patterns.patterns import ReflexionAgent

# Configure LLMs
llm_configs = {
    "thinking": {
        "provider": "openai",
        "model": "gpt-4",
        "temperature": 0.7,
    },
    "execution": {
        "provider": "openai",
        "model": "gpt-4",
        "temperature": 0.7,
    },
    "reflection": {
        "provider": "openai",
        "model": "gpt-4",
        "temperature": 0.3,  # Lower temp for consistent evaluation
    },
    "documentation": {
        "provider": "openai",
        "model": "gpt-4",
        "temperature": 0.7,
    }
}

# Create agent
agent = ReflexionAgent(
    llm_configs=llm_configs,
    max_trials=3
)

# Solve challenging problem
result = agent.run("""
Puzzle: You have 12 coins that look identical. One is counterfeit
and weighs slightly different (either heavier or lighter).
Using a balance scale only 3 times, identify the counterfeit coin
AND determine if it's heavier or lighter.

Provide step-by-step solution.
""")

print(result)
# Agent will:
# Trial 1: Attempt a solution, likely fail or have gaps
# Trial 2: Learn from Trial 1 mistakes, try improved approach
# Trial 3: Apply accumulated insights, find correct solution

With Custom Instructions

# Add domain-specific learning guidance
debugging_instructions = """
You are debugging code by trial and error. Follow these principles:

PLANNING WITH MEMORY:
- Review what you've tried before
- Don't repeat failed approaches
- Build on partial successes
- Try systematic variations

EXECUTION:
- Be precise in implementing the plan
- Document what you're testing

EVALUATION:
- Check if code runs without errors
- Verify output matches expected results
- Identify specific failure points

REFLECTION:
- Analyze why the approach failed
- Identify what worked and what didn't
- Generate specific, actionable insights
- Propose concrete changes for next trial
"""

agent = ReflexionAgent(
    llm_configs=llm_configs,
    max_trials=5,  # Allow more trials for complex debugging
    custom_instructions=debugging_instructions
)

result = agent.run("""
Debug this Python function that should find the longest palindromic substring:

def longest_palindrome(s):
    result = ""
    for i in range(len(s)):
        for j in range(i, len(s)):
            if s[i:j] == s[i:j][::-1]:
                result = max(result, s[i:j])
    return result

Test cases:
- longest_palindrome("babad") should return "bab" or "aba"
- longest_palindrome("cbbd") should return "bb"

The function has bugs. Fix them.
""")

With Prompt Overrides

# Customize evaluation criteria
overrides = {
    "Evaluate": {
        "system_prompt": """You are a strict evaluator. Determine if the task
was completed successfully. Be rigorous - partial solutions are failures.
Respond with SUCCESS or FAILURE and explain why.""",
        "user_prompt": """Task: {task}

Attempted solution:
{outcome}

Did this FULLY complete the task with no errors or gaps?
Respond with SUCCESS or FAILURE and detailed explanation.

Your evaluation:"""
    },
    "ReflectOnTrial": {
        "system_prompt": """You are a learning agent analyzing failures.
Generate actionable insights about what went wrong and how to improve.""",
        "user_prompt": """Task: {task}

Your plan: {plan}

What happened: {outcome}

Evaluation: {evaluation}

Analyze this trial deeply:
1. What specific aspect failed?
2. Why did it fail?
3. What should be different in the next attempt?
4. What (if anything) worked well and should be kept?

Your insights:"""
    }
}

agent = ReflexionAgent(
    llm_configs=llm_configs,
    max_trials=3,
    prompt_overrides=overrides
)

Customizing Prompts

Understanding the System Prompt Structure

Version 0.2.0 introduces enterprise-grade prompts with a comprehensive 9-section structure providing significantly better guidance (150-300+ lines vs ~32 lines).

The 9-Section Structure: All prompts include Role and Identity, Core Capabilities, Process, Output Format, Decision-Making Guidelines, Quality Standards, Edge Cases, Examples, and Critical Reminders. Benefits: Increased reliability and robustness.

Understanding Reflexion Prompts

Reflexion uses five prompt templates (all now with comprehensive 9-section structure):

  1. PlanWithMemory: Creates plan using reflection memory from previous trials with systematic guidance

  2. Execute: Executes the current plan with quality standards and edge case handling

  3. Evaluate: Judges success or failure using explicit criteria and examples

  4. ReflectOnTrial: Generates insights from the trial with structured process

  5. GenerateFinal: Creates final answer when trials exhausted with comprehensive quality standards

Method 1: Custom Instructions

agent = ReflexionAgent(
    llm_configs=llm_configs,
    custom_instructions="""
    LEARNING APPROACH:
    - Each trial should explore meaningfully different approaches
    - Extract specific, actionable lessons
    - Build systematically on previous insights
    - Avoid repeating mistakes

    SUCCESS CRITERIA:
    - Solution must be complete
    - All requirements satisfied
    - No errors or edge cases missed
    """
)

Method 2: Prompt Overrides

# Customize planning with memory
overrides = {
    "PlanWithMemory": {
        "system_prompt": """You learn from past attempts. Use insights from
previous trials to create better plans.""",
        "user_prompt": """Task: {task}

Trial: {trial_count}/{max_trials}

Lessons learned from previous trials:
{memory}

Plan your next approach, incorporating what you've learned.
Be specific about how you're adapting based on past failures.

Your plan:"""
    }
}

Method 3: Custom Prompt Directory

my_prompts/
└── ReflexionAgent/
    ├── PlanWithMemory/
       ├── system_prompt.md
       └── user_prompt.md
    ├── Execute/
       ├── system_prompt.md
       └── user_prompt.md
    ├── Evaluate/
       ├── system_prompt.md
       └── user_prompt.md
    ├── ReflectOnTrial/
       ├── system_prompt.md
       └── user_prompt.md
    └── GenerateFinal/
        ├── system_prompt.md
        └── user_prompt.md

Setting Agent Goals

Via Task Description

Provide clear success criteria:

# Well-defined success criteria
agent.run("""
Task: Optimize this SQL query to run in under 2 seconds

Current query:
SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.order_date > '2023-01-01'
ORDER BY o.total DESC

Database info:
- orders table: 1M rows
- customers table: 100K rows
- No current indexes except primary keys

Success criteria:
- Query returns same results
- Execution time < 2 seconds
- Explain your optimization strategy
""")

Via Custom Instructions

agent = ReflexionAgent(
    llm_configs=llm_configs,
    custom_instructions="""
    GOAL: Find working, optimal solutions through iterative learning

    TRIAL SUCCESS DEFINITION:
    - Solution is correct (passes all test cases)
    - Solution is efficient (meets performance requirements)
    - Solution is complete (handles edge cases)

    REFLECTION QUALITY:
    - Identify root causes of failures
    - Generate specific, testable hypotheses
    - Propose concrete alternative approaches

    LEARNING EFFICIENCY:
    - Don't repeat the same mistakes
    - Build incrementally on partial successes
    - Try meaningfully different approaches if stuck
    """
)

Advanced Usage

Adjusting Trial Budget

# Quick tasks: fewer trials
quick_agent = ReflexionAgent(
    llm_configs=llm_configs,
    max_trials=2
)

# Complex tasks: more trials
thorough_agent = ReflexionAgent(
    llm_configs=llm_configs,
    max_trials=5
)

# Very challenging tasks: extended trials
research_agent = ReflexionAgent(
    llm_configs=llm_configs,
    max_trials=10
)

Custom Evaluation Logic

class CustomReflexionAgent(ReflexionAgent):
    def _evaluate_outcome(self, state):
        """Override with domain-specific evaluation"""
        task = state["input_task"]
        outcome = state["outcome"]

        # Custom evaluation logic
        if "code" in task.lower():
            # Code-specific checks
            evaluation = self._evaluate_code(outcome)
        elif "math" in task.lower():
            # Math-specific checks
            evaluation = self._evaluate_math(outcome)
        else:
            # Default LLM evaluation
            return super()._evaluate_outcome(state)

        state["evaluation"] = evaluation
        state["evaluation_detail"] = f"Custom evaluation: {evaluation}"
        return state

    def _evaluate_code(self, code):
        """Evaluate code outcome"""
        try:
            # Try to execute code
            exec(code)
            return "success"
        except:
            return "failure"

    def _evaluate_math(self, answer):
        """Evaluate mathematical answer"""
        # Custom math validation logic
        pass

agent = CustomReflexionAgent(llm_configs=llm_configs)

Memory Analysis

class AnalyzingReflexionAgent(ReflexionAgent):
    def run(self, input_data):
        """Override to analyze memory after completion"""
        result = super().run(input_data)

        # Access reflection memory for analysis
        # (Would need to store in instance variable during execution)
        print("\n=== Learning Summary ===")
        print(f"Trials completed: {self.trial_count}")
        print("\nKey insights:")
        for i, insight in enumerate(self.memory_log, 1):
            print(f"{i}. {insight}")

        return result

agent = AnalyzingReflexionAgent(llm_configs=llm_configs)

Performance Considerations

Cost Analysis

Reflexion is expensive due to multiple trials:

Per trial cost:

  • Plan: 1 LLM call

  • Execute: 1 LLM call

  • Evaluate: 1 LLM call

  • Reflect: 1 LLM call

  • = 4 calls per trial

Total cost:

  • 3 trials: ~12 LLM calls

  • 5 trials: ~20 LLM calls

  • 10 trials: ~40 LLM calls

Optimization strategies:

# 1. Limit trials
agent = ReflexionAgent(llm_configs=llm_configs, max_trials=3)

# 2. Use cheaper models for some roles
llm_configs = {
    "thinking": {"provider": "openai", "model": "gpt-4"},
    "execution": {"provider": "openai", "model": "gpt-3.5-turbo"},  # Cheaper
    "reflection": {"provider": "openai", "model": "gpt-4"},
    "documentation": {"provider": "openai", "model": "gpt-3.5-turbo"}  # Cheaper
}

# 3. Early stopping with custom evaluation
# (Stop as soon as success is detected)

When to Use Reflexion vs Other Patterns

Task Type

Best Pattern

Reason

One-shot quality improvement

Reflection

✅ No need for trials

Learning from failures

Reflexion

✅ Designed for this

Simple tasks

Direct LLM

❌ Reflexion overkill

Tool-based workflows

ReAct

❌ Reflexion doesn’t use tools

Cost-sensitive

Reflection, Plan & Solve

❌ Reflexion expensive

Comparison with Other Patterns

Aspect

Reflexion

Reflection

ReAct

Trials

Multiple

Single task

Single task

Memory

Persistent across trials

Within-task only

No memory

Learning

Trial-and-error

Self-critique

Adaptive action

Cost

Very High

Medium-High

Medium

Best For

Learning from failures

Quality improvement

Tool interaction

Common Pitfalls

1. Insufficient Trials

Bad: Too few trials for complex problems

agent = ReflexionAgent(llm_configs=llm_configs, max_trials=1)
# This is just expensive execution, no learning benefit

Good: Appropriate trial budget

agent = ReflexionAgent(llm_configs=llm_configs, max_trials=3-5)

2. Vague Evaluation Criteria

Bad: Unclear success definition

agent.run("Make this better")  # What is "better"?

Good: Specific, measurable criteria

agent.run("""
Optimize this function to:
1. Run in O(n log n) time or better
2. Use O(n) space or less
3. Pass all test cases
4. Handle edge cases (empty input, single element, etc.)
""")

3. Weak Reflections

Bad: Generic or non-actionable insights

Good: Ensure reflections are specific

overrides = {
    "ReflectOnTrial": {
        "user_prompt": """...
Your reflection must include:
1. SPECIFIC failure point (not "it didn't work")
2. ROOT CAUSE analysis (why it failed)
3. CONCRETE alternative approach (exactly what to try next)

Your detailed reflection:"""
    }
}

4. Repeating Mistakes

Bad: Agent ignores previous learnings

Good: Emphasize memory usage in planning

overrides = {
    "PlanWithMemory": {
        "user_prompt": """...
Lessons from previous trials:
{memory}

Your plan MUST:
- Avoid approaches that already failed
- Build on what worked
- Try something meaningfully different if previous approaches all failed

Your plan:"""
    }
}

Troubleshooting

All Trials Failing

Symptom: Agent doesn’t find solution within max_trials

Solutions:

# 1. Increase trial budget
agent = ReflexionAgent(llm_configs=llm_configs, max_trials=7)

# 2. Provide more guidance
custom_instructions = """
APPROACH DIVERSITY:
If first 2 trials fail, try radically different approaches in subsequent trials.
Don't keep varying the same failed strategy.
"""

# 3. Check if task is solvable
# Verify task has clear solution and criteria

Weak Learning Between Trials

Symptom: Later trials don’t improve on earlier ones

Solutions:

# Strengthen reflection prompts
overrides = {
    "ReflectOnTrial": {
        "user_prompt": """...
Analyze deeply:
- What EXACTLY went wrong at which step?
- What assumption was incorrect?
- What approach should be abandoned entirely?
- What aspect should be kept for next trial?
- What SPECIFIC change should trial {trial_count + 1} make?

Your analysis:"""
    }
}

Premature Success Evaluation

Symptom: Agent declares success on partial solutions

Solutions:

# Make evaluation more rigorous
overrides = {
    "Evaluate": {
        "system_prompt": """You are a strict evaluator. Only declare SUCCESS
if the solution is 100% complete and correct. Partial solutions are FAILURE.""",
        "user_prompt": """...
Check every requirement:
{requirements_checklist}

All must be satisfied for SUCCESS.

Your evaluation:"""
    }
}

Next Steps

References