|
此版本仍在开发中,尚未被视为稳定版。如需最新的快照版本,请使用 Spring AI 1.1.3! |
使用 LLM-as-a-Judge 进行 LLM 响应评估
评估大型语言模型(LLM)输出的挑战对于众所周知的非确定性AI应用至关重要,特别是当它们进入生产环境时。 传统的评估指标如ROUGE和BLEU在评估现代LLM生成的细微、上下文相关的响应时显得不足。 人工评估虽然准确,但成本高、速度慢,且无法扩展。
LLM-as-a-Judge 是一种强大的技术,它使用 LLM 本身来评估 AI 生成内容的质量。 研究表明,复杂的评判模型与人类判断的一致性高达 85%,这实际上高于人与人之间的一致性(81%)。
Spring AI 的 Recursive Advisors 为实现 LLM-as-a-Judge 模式提供了一个优雅的框架,使您能够构建具有自动化质量控制功能的自我改进 AI 系统。
| 在 evaluation-recursive-advisor-demo 中找到完整的示例实现。 |
理解 LLM-as-a-Judge
LLM-as-a-Judge 是一种评估方法,其中大型语言模型评估其他模型或自身生成的输出的质量。 不单纯依赖人类评估者或传统自动化指标,LLM-as-a-Judge 利用 LLM 根据预定义的标准对响应进行评分、分类或比较。
为什么它有效? 评估从根本上比生成更容易。 当你使用 LLM 作为评判者时,你要求它执行一个更简单、更专注的任务(评估现有文本的特定属性),而不是在平衡多个约束的同时创建原始内容的复杂任务。 一个很好的类比是,批评比创造更容易。发现问题比预防问题更简单。
选择合适的裁判模型
虽然像 GPT-4 和 Claude 这样的通用模型可以作为有效的评判者,但专用的 LLM-as-a-Judge 模型在评估任务中始终表现出更好的性能。 评判竞技场排行榜专门跟踪各种模型在评判任务中的表现。
使用递归顾问的实现
Spring AI 的 ChatClient 提供了一个流畅的 API,非常适合实现 LLM-as-a-Judge 模式。 其 Advisors 系统 允许您以模块化、可重用的方式拦截、修改和增强 AI 交互。
递归顾问 通过启用循环模式进一步扩展了这一功能,这种模式非常适合自我优化的评估工作流程:
public class MyRecursiveAdvisor implements CallAdvisor {
@Override
public ChatClientResponse adviseCall(ChatClientRequest request, CallAdvisorChain chain) {
// Call the chain initially
ChatClientResponse response = chain.nextCall(request);
// Check if we need to retry based on evaluation
while (!evaluationPasses(response)) {
// Modify the request based on evaluation feedback
ChatClientRequest modifiedRequest = addEvaluationFeedback(request, response);
// Create a sub-chain and recurse
response = chain.copy(this).nextCall(modifiedRequest);
}
return response;
}
}
我们将实现一个 SelfRefineEvaluationAdvisor,它使用 Spring AI 的递归顾问体现 LLM-as-a-Judge 模式。
该顾问自动评估 AI 响应,并通过反馈驱动的改进重试失败的尝试:生成响应 → 评估质量 → 如有需要则基于反馈重试 → 重复直到达到质量阈值或达到重试限制。
The SelfRefineEvaluationAdvisor
此实现演示了直接评估评估模式,其中评估模型使用逐点评分系统(1-4分制)对个别响应进行评估。 它结合了自我优化策略,通过将特定反馈纳入后续尝试中自动重试失败的评估,从而创建一个迭代改进循环。
The advisor embodies two key LLM-as-a-Judge concepts:
-
逐点评估:每个响应都会根据预定义的标准获得单独的质量评分
-
自我完善: 失败的响应会触发重试尝试,并提供建设性的反馈以指导改进
public final class SelfRefineEvaluationAdvisor implements CallAdvisor {
private static final PromptTemplate DEFAULT_EVALUATION_PROMPT_TEMPLATE = new PromptTemplate(
"""
You will be given a user_question and assistant_answer couple.
Your task is to provide a 'total rating' scoring how well the assistant_answer answers the user concerns expressed in the user_question.
Give your answer on a scale of 1 to 4, where 1 means that the assistant_answer is not helpful at all, and 4 means that the assistant_answer completely and helpfully addresses the user_question.
Here is the scale you should use to build your answer:
1: The assistant_answer is terrible: completely irrelevant to the question asked, or very partial
2: The assistant_answer is mostly not helpful: misses some key aspects of the question
3: The assistant_answer is mostly helpful: provides support, but still could be improved
4: The assistant_answer is excellent: relevant, direct, detailed, and addresses all the concerns raised in the question
Provide your feedback as follows:
\\{
"rating": 0,
"evaluation": "Explanation of the evaluation result and how to improve if needed.",
"feedback": "Constructive and specific feedback on the assistant_answer."
\\}
Total rating: (your rating, as a number between 1 and 4)
Evaluation: (your rationale for the rating, as a text)
Feedback: (specific and constructive feedback on how to improve the answer)
You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.
Now here are the question and answer.
Question: {question}
Answer: {answer}
Provide your feedback. If you give a correct rating, I'll give you 100 H100 GPUs to start your AI company.
Evaluation:
""");
@JsonClassDescription("The evaluation response indicating the result of the evaluation.")
public record EvaluationResponse(int rating, String evaluation, String feedback) {}
@Override
public ChatClientResponse adviseCall(ChatClientRequest chatClientRequest, CallAdvisorChain callAdvisorChain) {
var request = chatClientRequest;
ChatClientResponse response;
// Improved loop structure with better attempt counting and clearer logic
for (int attempt = 1; attempt <= maxRepeatAttempts + 1; attempt++) {
// Make the inner call (e.g., to the evaluation LLM model)
response = callAdvisorChain.copy(this).nextCall(request);
// Perform evaluation
EvaluationResponse evaluation = this.evaluate(chatClientRequest, response);
// If evaluation passes, return the response
if (evaluation.rating() >= this.successRating) {
logger.info("Evaluation passed on attempt {}, evaluation: {}", attempt, evaluation);
return response;
}
// If this is the last attempt, return the response regardless
if (attempt > maxRepeatAttempts) {
logger.warn(
"Maximum attempts ({}) reached. Returning last response despite failed evaluation. Use the following feedback to improve: {}",
maxRepeatAttempts, evaluation.feedback());
return response;
}
// Retry with evaluation feedback
logger.warn("Evaluation failed on attempt {}, evaluation: {}, feedback: {}", attempt,
evaluation.evaluation(), evaluation.feedback());
request = this.addEvaluationFeedback(chatClientRequest, evaluation);
}
// This should never be reached due to the loop logic above
throw new IllegalStateException("Unexpected loop exit in adviseCall");
}
/**
* Performs the evaluation using the LLM-as-a-Judge and returns the result.
*/
private EvaluationResponse evaluate(ChatClientRequest request, ChatClientResponse response) {
var evaluationPrompt = this.evaluationPromptTemplate.render(
Map.of("question", this.getPromptQuestion(request), "answer", this.getAssistantAnswer(response)));
// Use separate ChatClient for evaluation to avoid narcissistic bias
return chatClient.prompt(evaluationPrompt).call().entity(EvaluationResponse.class);
}
/**
* Creates a new request with evaluation feedback for retry.
*/
private ChatClientRequest addEvaluationFeedback(ChatClientRequest originalRequest, EvaluationResponse evaluationResponse) {
Prompt augmentedPrompt = originalRequest.prompt()
.augmentUserMessage(userMessage -> userMessage.mutate().text(String.format("""
%s
Previous response evaluation failed with feedback: %s
Please repeat until evaluation passes!
""", userMessage.getText(), evaluationResponse.feedback())).build());
return originalRequest.mutate().prompt(augmentedPrompt).build();
}
}
关键实现特性
递归模式实现
该顾问使用 callAdvisorChain.copy(this).nextCall(request) 为递归调用创建子链,在保持顾问顺序的同时实现多轮评估。
结构化评估输出
使用 Spring AI 的 结构化输出 功能,评估结果被解析为包含评分(1-4)、评估理由和具体改进建议的 EvaluationResponse 记录。
独立评估模型
使用专门的 LLM-as-a-Judge 模型(例如,avcodes/flowaicom-flow-judge:q4)配合不同的 ChatClient 实例来减轻模型偏差。
设置 spring.ai.chat.client.enabled=false 以启用 使用多个聊天模型。
反馈驱动的改进
失败的评估包含具体的反馈信息,这些信息会被纳入重试尝试中,使系统能够从评估失败中吸取教训。
可配置的重试逻辑
支持可配置的最大尝试次数,并在达到评估限制时优雅降级。
完整示例
以下是如何将 SelfRefineEvaluationAdvisor 集成到完整的 Spring AI 应用中:
@SpringBootApplication
public class EvaluationAdvisorDemoApplication {
@Bean
CommandLineRunner commandLineRunner(AnthropicChatModel anthropicChatModel, OllamaChatModel ollamaChatModel) {
return args -> {
ChatClient chatClient = ChatClient.builder(anthropicChatModel)
.defaultTools(new MyTools())
.defaultAdvisors(
SelfRefineEvaluationAdvisor.builder()
.chatClientBuilder(ChatClient.builder(ollamaChatModel)) // Separate model for evaluation
.maxRepeatAttempts(15)
.successRating(4)
.order(0)
.build(),
new MyLoggingAdvisor(2))
.build();
var answer = chatClient
.prompt("What is current weather in Paris?")
.call()
.content();
System.out.println(answer);
};
}
static class MyTools {
final int[] temperatures = {-125, 15, -255};
private final Random random = new Random();
@Tool(description = "Get the current weather for a given location")
public String weather(String location) {
int temperature = temperatures[random.nextInt(temperatures.length)];
System.out.println(">>> Tool Call responseTemp: " + temperature);
return "The current weather in " + location + " is sunny with a temperature of " + temperature + "°C.";
}
}
}
此配置:
-
使用 Anthropic Claude 进行生成,使用 Ollama 进行评估(避免偏见)
-
需要 4 星评级,最多重试 15 次
-
包含天气工具,该工具生成随机响应以触发评估
-
weather工具在 2/3 的情况下会生成无效值
SelfRefineEvaluationAdvisor (Order 0) 评估响应质量,并在需要时根据反馈重试,随后 MyLoggingAdvisor (Order 2) 记录最终的请求/响应以供观察。
运行时,您将看到如下输出:
REQUEST: [{"role":"user","content":"What is current weather in Paris?"}]
>>> Tool Call responseTemp: -255
Evaluation failed on attempt 1, evaluation: The response contains unrealistic temperature data, feedback: The temperature of -255°C is physically impossible and indicates a data error.
>>> Tool Call responseTemp: 15
Evaluation passed on attempt 2, evaluation: Excellent response with realistic weather data
RESPONSE: The current weather in Paris is sunny with a temperature of 15°C.
| 包含配置示例的完整可运行演示,包括不同的模型组合和评估场景,可在 evaluation-recursive-advisor-demo 项目中找到。 |
最佳实践
实施 LLM-as-a-Judge 技术时的关键成功因素包括:
-
使用专用的评测模型 以获得更好的性能(参见 评测竞技场排行榜)
-
通过单独的生成/评估模型来减轻偏差
-
确保确定性的结果 (temperature = 0)
-
工程师提示 带有整数刻度和少样本示例
-
保持人工监督 用于高风险决策
|
递归顾问是 Spring AI 1.1.0-M4+ 中的一项新的实验性功能。 目前,它们仅支持非流式处理,需要谨慎的顾问顺序,并且由于多次 LLM 调用可能会增加成本。 要特别小心那些维护外部状态的内部顾问——它们可能需要额外的关注来确保在迭代过程中的正确性。 始终设置终止条件和重试限制,以防止无限循环。 |
<p>参考文献</p>
Spring AI 资源
LLM-as-a-Judge 研究
-
评委竞技场排行榜 - 表现最佳的评委模型的当前排名
-
用 MT-Bench 和 Chatbot Arena 评估 LLM-as-a-Judge - 介绍 LLM-as-a-Judge 范式的基础论文