OpenToM: A Comprehensive Benchmark for Evaluating Theory-of-Mind Reasoning Capabilities of Large Language Models

1King's College London, 2Huawei London Research Centre, 3The Alan Turing Institute


Image credit: Bing Image Creator

Introduction

Neural Theory-of-Mind (N-ToM), machine's ability to understand and keep track of the mental states of others, is pivotal in developing socially intelligent agents. However, prevalent N-ToM benchmarks have several shortcomings, including the presence of ambiguous and artificial narratives, absence of personality traits and preferences, a lack of questions addressing characters' psychological mental states, and limited diversity in the questions posed. In response to these issues, we construct OpenToM, a new benchmark for assessing N-ToM with (1) longer and clearer narrative stories, (2) characters with explicit personality traits, (3) actions that are triggered by character intentions, and (4) questions designed to challenge LLMs' capabilities of modeling characters' mental states of both the physical and psychological world. Using OpenToM, we reveal that state-of-the-art LLMs thrive at modeling certain aspects of mental states in the physical world but fall short when tracking characters' mental states in the psychological worlda.

OpenToM Construction

The data generating pipline of OpenToM.


A typical OpenToM story consists of two protagonists, an entity-of-interest, and several locations and containers. Of the two protagonists, one is assumed as the role of the mover, who carries out actions on the entity, and another is the observer, who may or may not witness these actions. Overall, OpenToM contains 696 narratives. Following the 4-stage human-in-the-loop generation pipeline, we produce 596 narratives with GPT-3.5-Turbo. In addition, we sample 100 existing OpenToM plots and produce extra-long narratives (OpenToM-L) using GPT-4-Turbo.

The omissions of characters' personality, intention, and enaction in existing N-ToM benchmarks makes it difficult to construct questions that inquire characters' mental states of the psychological world. To address this, each of the characters in OpenToM stories is personified and acts with an intention. Recognizing that LLMs are good at utilizing spurious correlations such as lexical overlaps, we take extra effort in mitigating the potential spurious cues in OpenToM stories.

Character Personification In many established N-ToM benchmarks, characters do not possess meaningful personal preferences or personality traits. As a result, their actions lack inherent motivation. In OpenToM, we randomly picked two contrasting personalities, namely "considerate" and "inconsiderate". We additionally include a "negativistic" personality to make the story more interesting. Below are brief descriptions of each personalities:

  • Considerate mover acts to ensure the comfort of the observer.
  • Inconsiderate mover acts to make themselves feel comfortable.
  • Negativistic mover acts to make the observer uncomfortable.
Intention and Enaction Based on the mover's personality and the observer's preferences, we generate both the character's intention and their subsequent actions. In such a way, the mover's action and the movement of the entity are anchored in the mover's intention.

Please check our paper and dataset for detailed experiment results and details of the OpenToM dataset.



BibTeX


        @article{xu2024opentom,
          title={OpenToM: A Comprehensive Benchmark for Evaluating Theory-of-Mind Reasoning Capabilities of Large Language Models},
          author={Xu, Hainiu and Zhao, Runcong and Zhu, Lixing and Du, Jinhua and He, Yulan},
          journal={arXiv preprint arXiv:2402.06044},
          year={2024}
        }