In a Study Presented at a Major Technology Conference in San Diego, Researchers Used Dungeons & Dragons with 3 AI Models in 10-Turn Combats to Assess Planning, Rules, and Collaboration.
Dungeons & Dragons is often remembered as a tabletop game with dice, characters, and improvisational decisions. However, it has a detail that is of great interest to technology: clear rules, defined objectives, and constant dialogue.
That’s exactly why researchers had artificial intelligence models play Dungeons & Dragons, not for fun, but as a test of strategy and teamwork. The idea is to observe whether these models can plan in multiple steps, follow rules without contradiction, and collaborate with other models and even with humans.
This type of assessment has a very straightforward goal: to understand if AI can operate for longer periods without human intervention, maintaining coherence and reliable decisions, something that requires memory and strategic thinking.
-
Motorola launched the Signature with a gold seal from DxOMark, tying with the iPhone 17 Pro in camera performance, Snapdragon 8 Gen 5 that surpassed 3 million in benchmarks, and a zoom that impresses even at night.
-
Satellites reveal beneath the Sahara a giant river buried for thousands of kilometers: study shows that the largest hot desert on the planet was once traversed by a river system comparable to the largest on Earth.
-
Scientists have captured something never seen in space: newly born stars are creating gigantic rings of light a thousand times larger than the distance between the Earth and the Sun, and this changes everything we knew about stellar birth.
-
Geologists find traces of a continent that disappeared 155 million years ago after separating from Australia and reveal that it did not sink, but broke into fragments scattered across Southeast Asia.
Why Dungeons & Dragons Became a Testing Ground for Long Decisions and Rigid Rules
The researchers argue that Dungeons & Dragons is an almost perfect environment for this type of test because it brings together two things that often conflict.
The first is creativity, since everything happens in dialogue. The second is rigidity, because the game has well-defined rules and limits. To perform well, the model needs to communicate, remember what has been decided, plan, and also perceive the intentions and tactics of the opponent.
The game functions as a bridge between natural language and game mechanics, making it clear when the AI is just speaking nicely and when it is making decisions that truly make sense within a system of rules.
How D&D Agents Works, with Dungeon Master, Heroes, and a Mix of AI with Humans
The experiment used a framework called D&D Agents. In it, a single model can take on the role of the Dungeon Master, the Game Master who drives the story and controls monsters, and can also take on the role of a hero.
In each scenario, the setup consisted of 1 Game Master and 4 heroes. The format is flexible: models can play with other models, and humans can fill any role. One possible example is a model acting as the Game Master, while two models and two people play as heroes.
This mix matters because the test measures not only individual efficiency but also coordination and communication when there are different voices on the same team.
The Test Was Not a Full Campaign, It Was Short Combat with 3 Scenarios and 10 Turns
The system did not attempt to simulate a full campaign, those that last hours or weeks. The focus was on combat encounters taken from a prepared adventure called Lost Mine of Phandelver.
To set up each round, the team selected 1 of 3 combat scenarios, defined a set of 4 characters, and adjusted the power level of these characters with three tiers: low, medium, or high.
Each episode lasted 10 turns, and after that, the results were collected to compare performance, choices, and consistency over time.
Three Models Were Compared, and One of Them Performed Better When the Game Tightened
The researchers tested three models in the simulation: DeepSeek V3, Claude Haiku 3.5, and GPT 4.
The comparison used Dungeons & Dragons as a metric to assess long-range planning and the ability to use tools, among other qualities.
This is related to real-world applications mentioned in the study, such as supply chain optimization and manufacturing line design, as well as scenarios requiring coordination between agents, such as disaster response modeling and search and rescue operations.
Overall, Claude Haiku 3.5 showed the best combat efficiency, particularly in the more challenging scenarios. GPT 4 followed closely behind. DeepSeek V3 struggled the most.
In easier scenarios, resource conservation was similar among the three, which makes sense because the test was isolated combat, without the pressure to save for a long adventure.
When things got tough, Claude Haiku 3.5 showed more willingness to expend resources, leading to better results.
Where Industry Enters: What a Game Reveals About Factories and Supply Chains
The connection to industry lies in the type of skills assessed. Step-by-step planning, coordination between agents, and intelligent resource use are exactly what emerges in tasks like supply chain optimization and manufacturing line design.
The same logic applies to operations requiring teams of agents working together, such as disaster response modeling and search and rescue systems. The game becomes a “mini-world” with clear rules and objectives, where it is possible to measure if AI can remain consistent long enough to be useful outside the lab.
The Most Curious Detail: Measuring Performance and Character Consistency
Besides winning or losing, the study also evaluated how the models stayed in character, with a performance quality metric that looked at consistency and variation of voices throughout the game.
The researchers observed that some models created short speeches and repeated styles, while others adapted their way of speaking better according to the character or monster in the scene.
If an AI can maintain strategy and cooperation over 10 turns of linked decisions, that seems like a good “rehearsal” for long real-world problems, or is it still too early to take the result seriously outside the game? Share in the comments the point that caught your attention the most.

-
-
-
-
-
-
32 pessoas reagiram a isso.