Skip to content

Quality and performance

Risk tells you whether your agent handled sensitive data or took consequential actions. Quality and performance tell you whether it actually did its job well. These are different questions — a governance review asks the first; a business review asks the second — and an agent can score well on one while failing the other.

Quality for an AI agent is whether it produces outputs that are accurate, relevant, and useful given what was asked of it. This is harder to define than correctness in traditional software, where a function either returns the right result or it does not. An agent’s outputs are often natural language, judgement calls, or sequences of decisions that only make sense in context — and evaluating them requires knowing what “right” looks like for that task.

Reliability is part of quality: whether runs complete, whether tool calls succeed, whether the agent stays on task. These are the technical foundations on which usefulness depends. An agent that fails half its runs is not a performance problem — it is a quality problem.

Prefactor records the inputs and outputs of each step in a run, giving you the evidence to assess quality, compare runs, and investigate when something looks wrong. That record is not a quality score — evaluating whether an output was actually good requires knowing what the task was — but it is what makes quality review possible at scale.

Performance is whether the agent, as a whole, is delivering the business value it was put in place to deliver. It is the question you would ask of a person or a team: are they performing? Not whether any individual output was correct, but whether the initiative is working — whether the organisation is getting the return it needed from deploying this agent.

An agent can produce individually high-quality runs and still not be performing, if those runs are not achieving the outcomes the business needed. Conversely, an agent that sometimes produces imperfect outputs may still be performing well if, in aggregate, it is moving the business in the right direction.

Performance in this sense cannot be read off a single run or a technical metric. It requires a view across many runs, over time, against the outcomes the agent was intended to drive.

Agents change for reasons that are not always obvious or deliberate. The underlying model may be updated by the provider. Code or prompts may be revised. The way people interact with the agent shifts as it becomes familiar. Any of these can affect quality and performance — and without a continuous record, it is hard to know whether something changed, when it changed, or what caused it.

Prefactor’s run history gives you a continuous record against which change becomes visible. When something shifts — for better or worse — you have the before and after to compare.

  • Risk — the separate question of what data an agent handled and how consequentially.
  • Span — where output data and tool results are recorded.
  • Instance — the run-level record, including lifecycle state and version.