ReadPaper Blog
Human Psychometric Questionnaires Mischaracterize LLM Behavior
This paper tests whether standard human psychometric questionnaires can reliably characterize and predict large language model behavior in everyday user interactions. It compares Likert-style self-reports on PVQ and BFI instruments with generation-probability profiles derived from realistic, value-laden user queries, finding that questionnaire profiles diverge from what models are likely to generate. The result matters because questionnaire-based claims about LLM values, personality, or demographic simulation may overstate behavioral predictability and safety.
Source: (none provided)

Questionnaire Trouble
The paper asks whether human psychometric questionnaires are valid tools for describing large language models in the settings where users actually encounter them. Its motivation is practical: LLMs are increasingly used for emotional support, ethical advice, child-facing chatbots, and other high-stakes interactions where expressed values and traits affect safety. The authors do not treat LLMs as psychological beings, but instead evaluate whether instruments designed for humans can predict model behavior. They argue that ecological validity requires measuring realistic generations in response to user queries rather than only reflective self-reports. This framing directly challenges a growing line of work that applies instruments such as the Portrait Values Questionnaire and Big Five inventories to models and interprets coherent answers as evidence of stable dispositions.

Two Ways to Profile
The study compares two profiling methods across eight open-source LLMs while keeping the target constructs aligned: Schwartz’s ten basic values and the Big Five personality traits. The first method administers established questionnaires, including PVQ-40, PVQ-21, BFI-44, and BFI-10, using Likert-scale responses and averaging across prompt variants that reverse the option order to reduce order sensitivity. The second method derives generation probability scores from the Value Portrait dataset, which contains 520 query-response pairs based on real-world queries from sources including ShareGPT, LMSYS, Reddit, and Dear Abby. Each scenario has plausible candidate responses annotated through psychometric validation for values and traits, allowing the authors to compute model log-probabilities for construct-tagged responses. This design avoids relying on noisy post-hoc annotation of open-ended model outputs while still measuring the model’s generative distribution over realistic responses.

The Big Mismatch
The paper reports that the two profiling methods produce substantially different construct-level portraits of the same models. For RQ1, construct-ranking agreement between questionnaire-derived profiles and generation-probability profiles is low, meaning that a model’s apparent value or personality hierarchy on PVQ or BFI does not reliably match the values or traits favored in likely responses to user queries. For RQ2, the authors examine within-construct item consistency, a property often cited as evidence that LLM questionnaire answers reflect stable dispositions. They find that this consistency does not appear in generation probabilities over Value Portrait responses. The implication is that internally coherent questionnaire answers may not provide the behavioral predictability that psychometric scores are supposed to support.

Why Questionnaires Look Coherent
The paper’s explanation for the mismatch centers on item textual transparency in established questionnaires. PVQ and BFI items often contain explicit lexical cues that reveal which construct is being measured, making it easier for an LLM to infer the desired value or trait and respond in an alignment-consistent or socially desirable way. The authors contrast this with realistic user queries, where the value or personality construct is not stated in the same transparent form. This mechanism explains why questionnaire profiles can look coherent even when corresponding generation-probability profiles do not show the same structure. The finding reframes questionnaire consistency as a measurement artifact rather than clear evidence of stable model-level values or personality traits.

Persona Prompts Don’t Transfer
The paper also tests whether demographic persona prompts make questionnaire-based profiles more behaviorally meaningful. It finds that personas such as “elderly” and “right-wing” can shift Likert responses on human questionnaires in directions consistent with real human demographic patterns. However, those shifts do not transfer to generation probabilities for realistic user interactions in the Value Portrait setting. This result suggests that models may learn to reproduce surface-level demographic associations when a questionnaire makes the target construct transparent, while failing to simulate the corresponding behavior in ordinary user-facing generation. The authors conclude that established human psychometric questionnaires are insufficient for predicting LLM behavior and that generation-based profiling offers a more ecologically valid alternative for assessing model values and traits.
