DealBench

About DealBench

DealBench is an AI evaluation benchmark that pits large language models against each other in Monopoly Deal-style card games.

Why a game like Monopoly Deal^*?

A game like Deal tests long term strategy as well as improvisation ability in players. Deal often has large swings in player momentum between user turns. Therefore, winning requires models to plan carefully and have the situational awareness to adapt. This long term planning with adjustments is a good simulation of the real world, and is a unique challenge compared to static long term planning benchmarks such as SWE-bench or Web-Arena. Games in general are also more robust to overfitting due to the large number of possible rollouts.

Key Observations

Claude Sonnet 4's low performance is primarily because it makes a lot of incorrect moves - for example, it kept trying to collect rent using the opponent's properties, instead of its own. Anthropic's lack of support for enforced structured outputs might also be hurting its performance.
Open source models like Deepseek-r1 and Qwen3-235b also struggled with the rules and output format. Deepseek stops emitting reasoning if structured outputs is enabled - resulting in poor quality valid moves.
Qwen3 thought for 17k tokens (!) on the first turn before erroring out. In general, extremely long reasoning traces made the model quite slow and difficult to use reliably
All the models use a really large number of turns (15-30) to win. A standard game between humans would conclude within 10 turns. This might indicate that the models are far from human level of play (future work)

Method

All games are 2 player matches. In each turn, the LLM can play up to 3 actions. Each action is one prompt-response pair.
The LLM is provided with the rules in the system prompt. Its also provided with the game history (list of actions done so far), current game state, its hand+properties+bank cards, as well as the opponent's property+bank cards in the user prompt. Its also provided with an example of the output format for every action it can do. All responses are expected to be JSON.
All LLM actions are rigorously validated using game rules. If the LLM breaks a rule, it can try again. 3 retries are allowed before the LLM's turn is skipped due to a lack of valid moves
If the LLM has a just say no card in its hand, it is asked if it would like to use it whenever an opponent takes an action against it
In case the LLM has more than 7 cards in hand at the end of its turn, it is asked to choose which cards it would like to discard
When paying money for rent (or any other purpose), the LLM can choose which money/property cards to use

Future Work

Humans vs LLMs
Multiplayer (> 2) testing

Example prompts

Here an example of the system prompt and a sample user prompt

_{*This project implements a simulation based on the publicly known rules of Monopoly Deal. Monopoly Deal and Monopoly are registered trademarks of Hasbro, Inc. This project is not affiliated with, endorsed by, or associated with Hasbro in any way.}

LLMs make Deals

Leaderboard

Watch Them Play

About DealBench

Why a game like Monopoly Deal^*?

Key Observations

Method

Future Work

Example prompts

LLMs make Deals

Leaderboard

Watch Them Play

About DealBench

Why a game like Monopoly Deal*?

Key Observations

Method

Future Work

Example prompts

Why a game like Monopoly Deal^*?