ReadPaper Blog
SoCRATES: Reliable Automated Evaluation of Proactive LLM Mediation Across Domains and Socio-Cognitive Variations
SoCRATES addresses the problem of evaluating proactive LLM mediators in realistic social conflicts, where success depends on a changing multi-turn trajectory rather than a single final answer. The paper introduces a benchmark that builds real-grounded conflict scenarios, varies five socio-cognitive conditions independently, and uses topic-localized evaluation to reduce off-topic scoring noise. Its results show that even strong frontier LLM mediators close only about a third of the unmediated consensus gap, making reliable social adaptation a central challenge for automated mediation.
Source: (none provided)

Why mediation is hard
The paper argues that LLM mediation is difficult to evaluate because mediation has no single correct output and unfolds through a real-time exchange shaped by shifting emotions, intentions, and context. Unlike tasks with fixed answers, proactive mediation requires a model to decide both when to intervene and how to guide disputants toward agreement across multiple contested topics. The authors frame this as a trajectory-level evaluation problem, where the quality of an intervention depends on how it affects the subsequent dialogue rather than merely whether an endpoint looks acceptable. They also identify off-topic dialogue as a major source of evaluator error, because scoring every topic at every turn can let irrelevant content distort the measured progress of a dispute. This motivation leads the paper to focus on reliable automated evaluation rather than only on building a stronger mediator.

What existing tests miss
SoCRATES is designed to address three gaps the authors see in existing mediation testbeds: narrow scenario coverage, confounded variation, and noisy turn-level scoring. Prior benchmarks often rely on a small number of expert-authored domains, such as bargaining or legal disputes, which limits how well they represent the diversity of real conflicts. The paper also argues that earlier evaluations tend to vary mainly strategic posture, leaving factors such as emotional reactivity, cultural identity, party composition, and history length underexamined or tangled together. Because these factors affect mediator behavior in different ways, the authors propose varying them independently so failures can be localized to a particular socio-cognitive capability. The paper further criticizes protocols that score every topic at every turn, since topic-irrelevant content can compound errors over a multi-turn trajectory.

SoCRATES appears
The core contribution of the paper is SoCRATES, the Social Conflict Resolution Arena with Topic-localized Evaluation for Social Cognition, a three-stage framework for scalable mediation evaluation. Its agentic scenario curation pipeline uses LLM agents to search the web for real public disputes across eight conflict domains, including transactional, healthcare, business, legal, environmental, public policy, intra-organizational, and international contexts. Retrieved cases are recast into structured scenarios with background information, disputing parties, topic sets, and preference weights, following the Harvard conflict simulation framework. The pipeline then uses unmediated simulation and rejection sampling to retain hard scenarios that do not resolve easily without a mediator. This design lets the benchmark scale beyond hand-authored cases while keeping the scenarios grounded in real conflict evidence.

Five axes to probe
The paper’s socio-cognitive probing stage expands each curated scenario along five independent axes: strategic posture, party composition, history length, emotional reactivity, and cultural identity. These axes are linked to mediator competencies such as strategic adaptation, multi-state tracking, long-context understanding, emotional regulation, and cultural adaptation. In the task formulation, disputing parties are LLM agents with private objectives, fallback options, starting stances, personas, and topic preferences, while the mediator sees only shared context and the dialogue so far. This asymmetry forces the mediator to infer hidden states from the conversation, making the benchmark a test of social cognition rather than simple access to privileged information. By changing one scenario component at a time, SoCRATES can attribute performance shifts to a specific condition instead of blending multiple causes into a single aggregate score.

How scoring works
SoCRATES evaluates mediation with a topic-localized evaluator that scores agreement only on turns that actively move a given topic and carries the previous score forward when the topic is not being advanced. This mechanism is intended to make trajectory evaluation more noise-resilient than per-turn scoring schemes that judge every topic at every moment. The evaluator reports three complementary metrics: consensus gain, which measures how much the mediator closes the unmediated agreement gap; intervention timeliness, which measures when the mediator acts relative to escalation; and intervention effectiveness, which measures the immediate consensus shift produced by interventions. The paper validates the evaluator against two expert annotators on 1,844 dialogue snippets and reports Pearson correlations of 0.82 at the trajectory level and 0.80 at the outcome level. In benchmarking eight proprietary and open-source frontier LLM mediators, the authors find that even the strongest model closes only roughly a third of the unmediated consensus gap, with performance varying sharply across socio-cognitive axes.
