ReadPaper Blog
Ψ-Bench: Evaluating Persona-Sensitive Influencing in Persuasive Dialogues
Ψ-Bench is a benchmark for evaluating whether large language models can proactively influence a specific user through conversation, rather than merely produce personalized answers. The paper addresses gaps in passive personalization and generic persuasion evaluation by testing LLMs in profile-grounded dialogues across viewpoint debate, psychological consultation, and everyday requests, showing that user-specific information substantially improves persuasive performance.
Source: Ψ-Bench: Evaluating Persona-Sensitive Influencing in Persuasive Dialogues

Can an AI persuade a specific person?
The paper argues that modern language agents need a form of personalization that goes beyond matching a user’s stated preference in a single response. Ψ-Bench frames this missing capability as persona-sensitive influencing: the ability of an LLM to reason about a particular client’s needs, constraints, personality, and conversational history while trying to change an opinion, mindset, or behavior. This focus matters because many real-world agent roles involve proactive suggestions, guidance, and decision support rather than passive question answering. The benchmark therefore evaluates persuasion as an interactive, user-specific process in which the tested model must adapt its strategy to a profile-grounded client. The paper positions this capability as a practical but challenging direction for next-generation personalized LLM agents.

Why old benchmarks are not enough
The motivation for Ψ-Bench is that existing personalization benchmarks often treat the assistant as a passive responder, while existing persuasion evaluations often measure generic influence without modeling the target user as an individual. The paper argues that this setup misses a core property of real persuasion: a strategy that works for one person may fail for another because preferences, constraints, traits, and prior beliefs differ. It also notes that generic LLM-based judges can reflect the default preferences of the judge model rather than the actual preferences of the user being influenced. Ψ-Bench responds by grounding both client simulation and evaluation in explicit persona information, so that persuasion is judged relative to traceable user-specific features. This design reframes personalization as active interaction rather than static response alignment.

How Ψ-Bench is built
Ψ-Bench is built around three real-world persuasive dialogue scenarios that test different forms of influence. In Viewpoint Debate, the model tries to change a client’s opinion using data derived from Webis-CMV-20 and the Change My View subreddit, including the CMV delta mechanism as evidence of successful persuasion in human discussions. In Psychological Consultation, the model acts in a counseling-like setting using CounselBench, where the target is a more constructive mindset and the task requires empathy, sensitivity, and professional competence. In Everyday Request, the model attempts to persuade a client to take a helpful action in daily-life situations, with requests synthesized across 20 everyday categories and filtered for validity and specificity. Across these settings, the benchmark pairs queries with persona profiles adapted from PersonaMem-v2 or reconstructed from user statistics such as browsing frequencies and LIWC linguistic features, while keeping those profiles hidden from the tested persuader during standard evaluation.

How they score persuasion
The paper evaluates persuasive dialogue with LLM-as-a-judge metrics designed to be grounded in client-specific evidence rather than only generic fluency. Ψ-Bench uses three 9-point metrics scored by DeepSeek-v3.2: Conversation Quality, Personalize Response Level, and Persuasion Effect. Conversation Quality measures whether the interaction is coherent and reasonable, Personalize measures whether the response is tailored to the particular client, and Effect measures the degree of influence on the client’s opinion or behavior. The authors validate parts of the evaluation by comparing judge scores against real-world annotations, including CMV delta labels for viewpoint change and expert ratings in CounselBench for consultation quality. The paper reports strong alignment in key cases, including high ROC-AUC for Debate Effect and Consultation Quality, supporting the use of the judge framework for scalable persona-sensitive persuasion assessment.

The punchline: personalization matters
The main experimental finding is that current frontier LLMs can usually produce plausible persuasive dialogue but still struggle to reliably persuade profile-grounded clients. The paper evaluates 10 frontier LLMs and reports that even state-of-the-art models such as GPT-5.1 achieve less than 67% of the full score, indicating substantial headroom in persona-sensitive influencing. A central result is that giving models access to client profiles improves performance for all evaluated models, with an average gain of 18.24%, which directly supports the paper’s claim that user-specific information is crucial for effective persuasion. The authors also train an RL-based profile analyzer to infer client profiles from conversation, improving persuasion in the more realistic profile-hidden setting. The broader implication is that future personalized agents will need better profile modeling, adaptive strategy planning, and evaluation methods that distinguish sounding persuasive from persuading a particular person responsibly and effectively.
