PrefPO: Optimizing Prompts through Preferences

Our team built PrefPO, a lightweight, pairwise preference prompt optimization framework, and today we're open-sourcing it!
Prompt engineering is tedious, and designing reliable prompts at scale is even harder. Automated prompt optimizers exist, but most require labeled data that's impractical to curate for every use case, and they often produce prompts that are bloated, unreadable, and not always aligned with intentions. Our answer is PrefPO, a lightweight prompt optimization framework inspired by RLHF that iteratively optimizes prompts through pairwise comparisons. Unlike other frameworks, datasets aren't required — all you need is a starting prompt and natural language criteria for your task. PrefPO matches or outperforms SOTA prompt optimization methods on performance while producing prompts with far better prompt hygiene: shorter, less repetitive, and actually readable.
Prompts have become the dominant way of adapting large language models (LLMs) to new tasks, yet prompt engineering remains largely ad hoc and unprincipled. State-of-the-art automated prompt optimization methods like MIPROv2, GEPA, and TextGrad attempt to tackle this by systematically searching for effective prompts and can outperform carefully hand-crafted ones. But from our experience working with large enterprises, a few challenges persist:
- Prompt generalization matters. With large-scale deployments, evaluation data is never fully representative and edge cases are inevitable. And as needs evolve, prompts get iterated on constantly. Clean, maintainable prompts are necessary to avoid surprises in production and to iterate effectively. Current optimizers produce bloated and difficult-to-maintain prompts.
- Real tasks have many dimensions of quality. A customer service response needs more than correctness: tone and formatting are critical. In healthcare, the reasoning behind a decision matters as much as the decision itself. Most prompt optimizers optimize against a single correct answer and miss these nuances.
- Fast iteration is essential. New models release constantly, and prompts that work well on one model often don't transfer to another. Continuing to push the frontier requires fast iteration cycles, often without extensive datasets to rely on. Existing prompt optimizers aren't built for this.
Issues with existing methods are easy to see in practice. Below is a straightforward prompt that GPT-4o failed to consistently follow (80% pass rate across 20 runs), optimized with both TextGrad and PrefPO:

The difference is visible at a glance. TextGrad's prompt was 4,002 characters packed with internal self-check routines, banned character lists, and redundant formatting instructions. PrefPO produced a 350-character prompt (over 10x shorter) with no repetition and no self-checks. Human judges agreed, scoring PrefPO's prompt 5/6 for readability and clarity versus TextGrad's 3/6.
Both prompts achieve a 100% pass rate. But a 4,000-character prompt packed with unnecessary instructions is harder to read, debug, and maintain. Surface-level performance metrics like pass rate miss this entirely.
We built PrefPO with these challenges in mind. First, we formalize prompt hygiene, our way of measuring prompt quality beyond task performance.
Prompt Hygiene and Hacking
Prompts increasingly function like code in production systems: iterated on by multiple people, shared across teams, and maintained over time. So just like code, prompts should be evaluated not just on performance but also on their readability and maintainability. Prompt hygiene is a suite of programmatic metrics and evaluator-rated quality dimensions for assessing prompts for practical use. Programmatic metrics capture length, repetition, and similarity to the original prompt. Quality dimensions include readability (how easy the prompt is to understand), specification clarity (whether instructions are well-defined and consistent), and maintainability (how easy the prompt is to modify), all assessed by both LLM and human evaluators.
By these metrics, existing methods struggle. On some tasks, TextGrad prompts ballooned to roughly 10,000 characters and MIPROv2 prompts contained over 34% repetitive content. But hygiene isn't the only issue. As often happens when LLMs face any sort of optimization pressure, we see hacking behavior. We call this prompt hacking: prompts that game evaluation criteria rather than genuinely fulfilling the task.

Prompt Hacking Example: TextGrad (left) inflates the original 50-sentence requirement to 65, while PrefPO-Minimal (right), a variation of PrefPO described below, achieves the same 100% pass rate without altering the constraint.
In the figure above, the original task requires the output to have at least 50 sentences. TextGrad's optimized prompt inflates this to 65 — increasing the chance of success but changing the output beyond what the task requires. PrefPO-Minimal (described below) maintains the original constraint while achieving the same 100% pass rate. We call this a numerical constraint manipulation hack. Vocabulary over-restriction is another common hack we observed: prompts that ban far more words than the task specifies to increase chances of success. Hacking takes many forms, making it hard to catch programmatically, so we tuned an LLM judge for detection. The core concern is that hacked prompts are optimized to look reasonable while changing outputs in ways you didn't ask for, making them both hard to spot and harmful.
PrefPO: Preference-Based Prompt Optimization
Our approach draws on a core insight from reinforcement learning from human feedback (RLHF): pairwise comparisons ("Is response A better than response B?") produce more reliable evaluation signal than absolute scores ("Rate this response from 1 to 10"). Instead of scoring prompts individually and optimizing, as TextGrad does, we compare two prompts head-to-head based on the quality of their outputs.

PrefPO Optimization Loop: Two prompts are sampled from the pool, and their outputs are then compared by an LLM discriminator. The losing prompt is improved by an LLM optimizer using the discriminator's feedback. The updated prompt is added back to the pool, and the process repeats.
PrefPO maintains a pool of candidate prompts. At each iteration, we sample two prompts from the pool, generate their outputs from the task model, and run them through two components:
- Discrimination. An LLM discriminator compares the two outputs against natural language criteria provided by the user (effectively a rubric describing the desired output). It then selects a preferred output and explains why the other fell short.
- Optimization. An LLM optimizer takes the non-preferred prompt along with the discriminator's feedback and produces an improved version, which is then added back to the prompt pool.
Crucially, this entire process works without labeled data because the natural language criteria replace labels. PrefPO shifts data labeling responsibility from practitioners to the LLM, which provides labels implicitly through discrimination. This lets users optimize quickly with limited data. When ground truth is available, we can also pass it to the discriminator for more accurate preferences. We evaluate both labeled and unlabeled settings in our experiments.
Some key design choices:
- Optimization criteria are specified in natural language, so all you need to start is an initial prompt and a description of what you want.
- The discriminator and optimizer are LLMs that can be modified independently just by prompting. PrefPO-Minimal is a variation of PrefPO that leverages this: we added a single-line instruction to the optimizer to make minimal edits to the input prompt, which further improved prompt hygiene and reduced hacking.
- Only the non-preferred prompt gets optimized at each round. The intuition is the loser has the winner to learn from; the winner has no such reference. Crucially, this can still yield a candidate better than both. The diagram above shows this: each original prompt meets only one criterion, but the optimized prompt meets both.
Results
We evaluated PrefPO on two benchmarks: BIG-Bench Hard (BBH), a QA benchmark where we tested on nine subtasks, and IFEval-Hard, a set of 148 challenging instruction-following tasks we curated by filtering for examples from the original IFEval dataset that are particularly challenging for GPT-4o. BBH is a standard benchmark in the prompt optimization literature covering a range of reasoning tasks, and IFEval-Hard is an instruction-following benchmark testing more free-form generation.

Performance degradation on IFEval-Hard compared to IFEval: GPT-4o performance drops the most (~40%), but even frontier models, such as GPT-5.2 and Gemini 3, consistently drop by ~10%.
Performance. For BBH, we evaluated all other methods using labeled training data, and tested PrefPO in both labeled and unlabeled settings. PrefPO ranked first on more individual tasks (4 of 9) than any other method we tested against (MIPROv2, GEPA, TextGrad), though differences between top methods were small. On 6 of 9 tasks, PrefPO's unlabeled performance showed no meaningful difference from labeled performance, suggesting optimization often worked just as well without labels. On two tasks, unlabeled PrefPO outperformed all labeled methods.
On IFEval-Hard, each of the 148 samples has its own prompt to optimize. Outputs are checked programmatically against the task's requirements, and an optimized prompt is successful only if every output passes (20 runs per prompt). For both methods, there is no access to ground truth on prompt efficacy, reflecting real-world scenarios where labeled data is scarce; quality judgments are entirely LLM-determined. Results were close, with TextGrad slightly outperforming PrefPO (84.5% vs 82.4%).
PrefPO was competitive on performance, but showed substantially better prompt hygiene.

Prompt length (left) and repetition ratio (right) across BBH subtasks: Compared to the other prompt optimizers, PrefPO maintains lower length and repetition for most tasks. TextGrad prompts reached nearly 10,000 characters on some tasks; MIPROv2 exceeded 34% repetition.
Hygiene and Hacking. Across both benchmarks, PrefPO consistently produced shorter, cleaner prompts. On BBH, PrefPO produced prompts at least 3–5x shorter and less repetitive than methods like TextGrad and MIPROv2. The differences were just as dramatic on IFEval-Hard, with PrefPO performing better on all programmatic metrics and quality dimensions. Using the same evaluation rubric, both LLM and human evaluators rated PrefPO prompts higher than TextGrad's across every quality dimension (human judges: PrefPO 3.53/6 versus TextGrad's 2.60/6). PrefPO also showed substantially less hacking behavior: according to our LLM hacking judge, only 37% of PrefPO's prompts showed signs of hacking (31% for PrefPO-Minimal), compared to 86% for TextGrad.
Notably, we never explicitly designed for the hygiene or hacking improvements (except with PrefPO-Minimal). They emerged from the preference-based approach. We hypothesize that pairwise comparisons are harder to game and that only modifying the losing prompt prevents issues from compounding.
Practical Guide
When to use PrefPO:
- Your prompts are living artifacts. If your prompts will be read, maintained, and iterated on, PrefPO's hygiene advantage is significant.
- Success depends on more than correctness. Open-ended tasks have many ways to succeed or fail: tone, clarity, formatting, etc. PrefPO lets you specify all of these as optimization criteria.
- You have limited or no labeled data. PrefPO optimizes effectively without labels, especially on tasks with clear criteria. No dataset required to get started.
When PrefPO may not be the best fit:
- Prompt content doesn't matter. We acknowledge there are cases where the goal is purely performance and the prompt is just a means to an end. Other methods may perform similarly here.
- Few-shot prompts already work well enough. If adding a few example demonstrations to the prompt solves the problem, optimization may not be worth the overhead.
- Large labeled datasets are available. Some methods are designed specifically for this setting and may be more appropriate.
Conclusion
With its preference-based approach, PrefPO bets on scale: as models improve, they require less guidance and can better leverage minimal feedback signals for meaningful improvements.
Our evaluation covers QA tasks and instruction following, but we're most curious about other open-ended domains like code generation and creative writing, where prompt hygiene arguably matters even more. Larger human evaluations of both hygiene and hacking would help us understand how these properties affect practitioners in practice. We're also excited about extensions to PrefPO, including more intelligent prompt sampling strategies and support for optimizing multi-step LLM systems.
The full paper, "PrefPO: Pairwise Preference Prompt Optimization," and code are available on our GitHub. If you're interested in building the tools that define how some of the largest companies work with AI, we're hiring.


