Diplomacy Review

2025-04-17

2 minute read

My favourite introduction to Diplomacy is the This American Life episode where they send an elite American diplomat to help out an amateur at the National Championships. The most important takeaway is that the rules are very simple, such that performance is a fairly pure measure of the ability to negotiate and plan.

Diplomacy is one of the preferred games for cutting edge RL research, alongside the co-operative card game Hanabi, poker, and a slew of video games including Pokemon and Among Us. I enjoy its vaguely threatening aura - players fight for control of Europe, and the threat of backstabbing is ever-present.

A Noam Brown talk in 2024 covers planning research across many games, including the most recent notable Diplomacy effort - Cicero, from FAIR in 2022.

One of his satisfying results is that in a paper that measured neural net performance on various chess variants, the only variant where neural networks beat humans without the use of search was the bullet chess, where top humans take an average of 0.8 seconds per move.

He also compares methods for selecting the “correct” answer in LLM RL training:

Consensus, or taking the most commonly generated answer, works when the generated solutions aren’t unique too often.
Best-of-N works when a good enough reward model is available.
Using a process reward model (marking each step of the solution individually) lowers the threshold for a “good enough” reward model.

Cicero was produced in 2022 before the RL-for-LLMs wave, using an RL engine separate to the LLM that was used to generate the negotiation messages, so there’s a gap in the market for a new SOTA to emerge. Such a system could provide a great test-bed for evaluating the emergence of social power in LLMs - as Noam said, “a window into the future.”

It also has the potential to greatly simplify the setup. Cicero used an emsemble of 16+ classifiers to monitor the LLM-generated outputs and filter out nonsense and prompt leakage, while RL-trained LLMs could avoid bad generations altogether. Better, it could enable removal of the unintegrated recommendation engine (current pseudo-evals like diplobench give each LLM under test the same recommended best move for a given board state, leaving only the negotiation phase to the LLMs.)