ReadPaper Blog
ARCANE: Do Role-Playing Language Agents Stay in Character at the Right Time?
ARCANE studies whether role-playing language agents can portray characters whose values and behavior change across a narrative, rather than merely preserving a fixed persona or recalling chapter-specific facts. The paper introduces Arc-Aware Narrative Evaluation, a benchmark built from Character Arcs and phase-specific probes across 17 novels and 80 principal characters, and shows that conditioning models on explicit character-arc context improves responses, especially for scenarios not present in the source text.
Source: ARCANE: Do Role-Playing Language Agents Stay in Character at the Right Time?

The Problem: A Character Isn’t Frozen
The paper argues that role-playing language agents should be evaluated as temporally situated portrayals of evolving characters, not as static persona machines. Its central problem is that a character’s beliefs, values, and behavioral tendencies can shift substantially as events accumulate across a novel, so a response that is faithful at one chapter may be wrong at another. The authors frame this as a gap between factual point-in-time grounding and psychological point-in-time grounding: an agent must not only know what the character knows, but also act from the character’s current state. Novels are chosen as the testbed because they provide long-form temporal structure, explicit events, and rich descriptions of internal change. The paper’s motivating claim is that authentic role-play requires alignment with a character’s narrative trajectory, because immersion depends on behavior that fits the right moment in the story.

The Gap: Facts Are Not the Whole Story
ARCANE positions itself against prior role-playing benchmarks that mainly test trait inventories, factual recall, surface style, or general behavioral consistency. The paper notes that TimeCHARA and related point-in-time evaluations focus on whether a character knows facts appropriate to a chapter, while ARCANE asks whether the character would behave appropriately at that chapter. This distinction matters because a model can avoid spoilers and still produce a psychologically misplaced response, such as applying a late-story moral outlook to an early-story version of the same character. The paper also emphasizes scenarios beyond the source text, where retrieval from the novel cannot simply locate an answer. By evaluating open-ended behavior in both in-narrative and out-of-narrative situations, ARCANE targets character-behavioral hallucination rather than knowledge hallucination alone.

ARCANE’s Core Idea
The paper’s method has two main components: Character Arc Construction and Probe Generation. A Character Arc represents a psychological axis along which a character changes, with phases defined by chapter ranges, state descriptions, and key events that anchor the transition from an initial state to a later state. Each probe then poses the same scenario-question pair at different phases of the arc and supplies phase-specific reference actions and thoughts. This design makes the evaluation comparative: the correct behavior is not a single timeless answer, but a response that changes with the character’s evolving psychological state. The paper connects this approach to McAdams’ account of personality by moving beyond stable Layer 1 traits toward Layer 2 expressions of traits at the right moment.

What the Dataset Contains
The ARCANE benchmark is automatically constructed over 17 novels and 80 principal characters, producing 544 Character Arcs and 4,601 probes. The dataset includes both source-grounded scenarios and out-of-world scenarios, reflecting the paper’s claim that users often want to explore how a character would respond to new situations rather than merely reproduce events from the book. The construction process creates training and evaluation splits, with the evaluation set built after the underlying Character Arcs are reviewed and updated through consensus by three independent human annotators. This design gives ARCANE both scale and temporal specificity, because each item is tied to a particular phase of a particular psychological arc. The paper presents the benchmark as a way to test whether role-playing language agents can generalize from narrative development to plausible, phase-appropriate behavior.

Why Arc-Based Context Wins
The experiments compare context strategies across six models and six context modes, with the main question being which kind of narrative context best grounds a role-playing agent at the queried phase. The paper reports that conditioning on the Character Arc outperforms every other context strategy on every model, indicating that explicit arc-level psychological summaries are more useful than context that only retrieves local textual evidence. The advantage is largest for scenarios outside the source text, where retrieval-based approaches have little relevant material to find but an arc still states the character’s current psychological position. The authors also fine-tune open-weight models on ARCANE’s training data to produce ARCANE-8B and ARCANE-32B, and these models further widen the arc-grounding advantage, particularly beyond the source text. The result implies that future role-playing systems should represent narrative change directly, because temporal psychological grounding improves character portrayal in situations where factual recall is insufficient.
