Home Papers Patents Talks Posts Experience Projects
Back to posts

Automated Prompt Optimization: From AutoPrompt (2020) to TextGrad (2024)

9 min read

TL;DR

Automated prompt optimization has gone through four eras in five years:

  • 2020–2021 (Foundational): discrete and continuous gradient methods — AutoPrompt, Pattern-Exploiting Training, Prefix-Tuning, Prompt Tuning. Required model gradients.
  • 2022–2023 (LLM-as-Optimizer): the breakthrough — use one LLM to optimize prompts for another. APE, OPRO, RLPrompt, Instruction Induction.
  • 2023–2024 (Algorithmic): sophisticated algorithms — EvoPrompt (evolutionary), APO (textual gradient descent), TEMPERA (test-time RL), DSPy (declarative pipelines).
  • 2024–2025 (Cutting-edge): TextGrad does true automatic differentiation through text; PromptAgent does strategic planning; MoP routes between expert prompts.

If you’re choosing today: DSPy for multi-stage LLM pipelines, TextGrad when you need gradient-style optimization through arbitrary LLM workflows, OPRO as a quick baseline that often beats human prompts by 8–50%.

Source: This article distills the living survey at github.com/sarkar-dipankar/llm-prompt-optimisation. The repo includes the original papers and a longer chronological report; this post is the long-form companion.

Foundational era: discrete gradient optimization (2020–2021)

The field began before LLMs were good enough to optimize themselves. The early methods all required white-box access to the model.

AutoPrompt (Oct 2020)

AutoPrompt was the pioneering breakthrough. It used gradient-based search over a fixed vocabulary to find discrete trigger tokens that elicit a target behavior from a masked LM. It established that prompts could be optimized programmatically rather than written by hand — a now-obvious idea that wasn’t in 2020.

Pattern-Exploiting Training (PET, Jan 2020)

PET reformulated NLP tasks as cloze-style fill-in-the-blank patterns, then trained on the resulting templates. It is the conceptual ancestor of “instruction templating” — formatting your task to match the pre-training distribution.

Prefix-Tuning (Jan 2021) and Prompt Tuning (Apr 2021)

Prefix-Tuning and Prompt Tuning shifted from discrete to continuous (soft) prompts — learnable vectors prepended to the input. Prompt Tuning showed soft prompting becomes competitive with full fine-tuning at scale (~10B+ parameters), establishing the pattern of “small parameter set tunes a large frozen model.”

These methods all need model gradients, so they only work with open-weights models — a constraint the next era removed.

LLM-as-Optimizer era: foundation model breakthroughs (2022–2023)

The key insight: the LLM itself is now smart enough to be the optimizer. No gradients required.

Instruction Induction (May 2022)

Showed that LLMs can infer the natural-language instruction that generated a set of examples — opening the door to using an LLM as a hypothesis generator over the prompt space.

Automatic Prompt Engineer (APE, Nov 2022)

APE was the systematic breakthrough: use an LLM to generate candidate instructions, score them on a held-out set, iterate. APE-generated prompts matched or exceeded human-written ones across many benchmarks.

RLPrompt (May 2022)

Reinforcement-learning approach where a policy network proposes discrete prompts and is trained on the resulting downstream reward — RL applied to the prompt space directly.

OPRO: Optimization by PROmpting (Sep 2023)

OPRO made the “LLM as optimizer” pattern fully general: the LLM is given a meta-prompt describing the task, prior candidate solutions, and their scores, and asked to propose better solutions. Iterate. Reported 8–50% improvements over human-written prompts. The honest caveat is that gains are bigger on tasks where humans aren’t already heavily tuned.

Advanced algorithmic sophistication (2023–2024)

This era brought richer search algorithms and the first declarative frameworks.

EvoPrompt (Sep 2023)

Evolutionary algorithms applied to prompt optimization — population of prompt candidates, mutation and crossover operators implemented as LLM calls, fitness from downstream task performance. Strong on tasks where the search space has many local optima.

APO: Automatic Prompt Optimization (May 2023)

APO introduces “natural language gradient descent”: the LLM examines failed examples, generates a natural-language critique (the gradient), and edits the prompt in the direction of that critique. The intuition is gradient descent with the loss function being “the LLM’s own analysis of why it’s wrong.”

TEMPERA (Nov 2022)

Test-time reinforcement learning — adapts prompts on the fly during inference based on partial feedback, rather than learning a single prompt offline.

DSPy (2023–2024)

DSPy (Stanford) shifted the abstraction: declarative LLM pipelines. Instead of optimizing single prompts, you define modules (Predict, ChainOfThought, ReAct) and a metric, and DSPy compiles the pipeline by optimizing all the underlying prompts and few-shot demonstrations jointly. It is the closest thing the field has to a standard for production LLM applications today.

Cutting-edge neural and multimodal methods (2024–2025)

TextGrad (Jun 2024)

TextGrad is conceptually the most ambitious recent work: automatic differentiation through arbitrary text-based pipelines, where “gradients” are natural-language critiques propagated backward through the computation graph. Any system you can express as a graph of LLM calls becomes optimizable. The framing is a deliberate echo of PyTorch’s autograd, but for prompts.

PromptAgent (2024)

Treats prompt optimization as strategic planning — Monte Carlo tree search over the prompt space, expanding promising branches based on simulated outcomes. Effective on tasks where the optimal prompt requires multi-step reasoning about the task itself.

MoP: Mixture-of-expert prompts (2024)

A router learns to route inputs to specialized expert prompts, much like MoE in model architecture but at the prompt level. Useful when one prompt cannot handle the full input distribution well.

Hard Prompts Made Easy (2023)

Gradient-based discrete optimization that produces interpretable hard prompts rather than soft prompts. Combines the optimization power of continuous methods with the portability of natural-language prompts.

Multimodal and domain-specific specialization

The optimization techniques have spread beyond text:

  • Evolutionary prompt optimization for vision-language models (Mar 2025)
  • Acoustic Prompt Tuning for audio-language models (Nov 2023)
  • Prochemy for automated code-generation prompt optimization (Mar 2025)
  • MathPrompter for mathematical reasoning unification (2023)

The pattern is consistent: each modality borrows the optimizer pattern that worked for text and adapts it.

How to pick an optimizer

Use this decision matrix:

If you need…Use
A quick baseline beating hand-written promptsOPRO — minimal setup, often 8–50% gains
Production multi-stage LLM pipelinesDSPy — declarative, jointly optimizes all stages
Gradient-style optimization through arbitrary LLM graphsTextGrad
Deep search over a complex taskPromptAgent (MCTS) or EvoPrompt (evolutionary)
Soft prompts on open-weights modelsPrefix-Tuning / Prompt Tuning
Specialized routing by input typeMoP
Test-time adaptationTEMPERA

Comparison: method × paradigm × access requirement

MethodYearParadigmNeeds model gradients?Open-source
AutoPrompt2020Discrete gradientYesYes
Pattern-Exploiting Training2020Templating + fine-tuneYesYes
Prefix-Tuning2021Soft promptYesYes
Prompt Tuning2021Soft promptYesYes
Instruction Induction2022LLM-as-hypothesizerNoPartial
APE2022LLM-as-optimizerNoYes
RLPrompt2022RL on promptsNo (uses downstream reward)Yes
OPRO2023LLM-as-optimizerNoYes
EvoPrompt2023EvolutionaryNoYes
APO2023Textual gradientNoYes
TEMPERA2022Test-time RLNoPartial
DSPy2023+Declarative pipeline compileNoYes
TextGrad2024Text autodiffNoYes
PromptAgent2024MCTS planningNoYes
MoP2024Mixture-of-experts routingNoPartial

Practical takeaway: since 2022, almost everything works on closed APIs. The choice is now about which abstraction matches your problem, not whether you have gradient access.

How optimization interacts with compression and structure

Optimized prompts are often verbose because the optimizer was rewarded for accuracy alone. Co-designing optimization with prompt compression — including length in the objective, or running compression as a post-processing step — typically yields a Pareto-better result. And the optimization techniques here only work as well as the underlying prompt structure — give them a well-formed system prompt, instruction hierarchy, and clear evaluation set.

FAQ

DSPy vs TextGrad — which should I use?

DSPy if you have a multi-stage LLM pipeline you want to compile and optimize as a unit (the “PyTorch for prompts” framing). TextGrad if you need to back-propagate optimization through an arbitrary text-based system that doesn’t fit DSPy’s module abstractions. Many teams use DSPy for pipeline structure and TextGrad-style critique loops inside a single complex stage.

Do these methods need access to model gradients?

The 2020–2021 methods (AutoPrompt, Prefix-Tuning, Prompt Tuning) do. Everything from 2022 onward (APE, OPRO, EvoPrompt, APO, DSPy, TextGrad, PromptAgent) works against closed APIs — the “gradient” is a natural-language critique or a downstream score, not a backprop signal.

Does OPRO actually beat human prompts?

Often, yes — the original paper reports 8–50% gains. The catch: gains are largest on tasks where humans haven’t already heavily iterated. On well-known benchmarks where teams have spent months on prompt engineering, gains shrink.

What’s the difference between prompt optimization and fine-tuning?

Fine-tuning changes the model’s weights. Prompt optimization changes the input to a frozen model. Optimization is faster, cheaper, reversible, and works on closed APIs — but it has a lower ceiling than fine-tuning for tasks the base model can’t represent at all.

Are soft prompts dead now?

No, but their niche has narrowed. They remain attractive when (a) you have white-box access to the model, (b) you want maximum compression of task knowledge into a small parameter set, and (c) you’re going to deploy the same model with the soft prompt at inference. For closed APIs they’re a non-starter.

What about prompt injection — does optimization make it worse?

Optimized prompts have the same injection surface as hand-written ones. The defense lives elsewhere — see the Instruction Hierarchy discussion in Prompt Structuring Techniques and the broader stack in LLM Safety Techniques.

Sources & further reading


Related reading on this site: Prompt Structuring Techniques for the foundations the optimizers operate on; LLM Prompt Compression for co-designing length into the objective; LLM Safety Techniques for the safety stack around your optimized prompts.

Related Content