A midwit's guide to the AI future through D&D ability scores

Dungeons and DraGANs

Feb 03, 2025

I have become increasingly interested in benchmarking LLMs and where we’ve lost the plot on measuring success in AI development, as well as my interest in qualitative shoggology. Today I want to write a bit about how the 50ish year old D&D ability score system is actually great for thinking about AI developments.

LLM ability scores

for anyone who doesn’t know the system I’m talking about, there are a million videos on it. In the context of LLMs, here’s how I think about it:

STR: raw power, model size, and “carry weight” (context window)

INT: ability in reasoning, problem solving, knowledge

WIS: ability to avoid rabbit-holes, avoid hallucinations, dispense good advice and their own self-awareness

DEX: ability to manipulate environment, tool use, and speed.

CON: safety, resistance to adversarial attacks, “poison resistance” (toxicity)

CHA: charm, engagement, character

how to see AI projects by ability score

I think this is a fun framework for AI development, because you can see that almost every new “advance” in AI is actually targeting a single ability score boost- and how there are certain incentives to optimize for each of these different stats as we build out.

benchmarks/evals by ability score

benchmarks are also pretty easy to sort here- almost all benchmarks are some way of testing a single ability score. Just like in D&D, its easy to mix up Wisdom and Constitution and because AI doesn’t have a physical constitution (yet) I have assigned robustness and defense against adversarial attacks to CON.

AGI through ability scores

In D&D, 10 is considered “normal” in a stat. You’re neither a wimp or a strongman at 10 strength. 18 is the theoretical human max, without using magic. 25 is godlike.

in D&D a “commoner” is all tens across the board. This could be seen as the rough equivalent of “artificial general intelligence”.

this is the AGI that people in 2015 thought would be amazing to have. Totally normal amount of intelligence, totally normal amount of charisma, just a normal set of attributes.

these days when we say AGI we mean all 18s at least. We don’t want something that is about normal human performance on any of these factors, we want something that is as good as any person can be at all these things.

most people think AGI looks like this- 18 intelligence, normal amount of everything else.

dangerous and scary AI is like a 5 in wisdom, a 5 in constitution, and straight 18s in intelligence and charisma. It can convince you of anything, come up with any kind of destructive idea that a human mind can birth, and it sure as hell isn’t wise or aligned.

superintelligence & recursive self-improvement

when we talk about superintelligence, we’re talking about a 25. Something so smart that there’s no real need to worry about the other stuff, because the AI will just infinitely level up their ability scores by playing infinite adventures with itself until its maxed out all the relevant statistics.

This is how superintelligence is defined most narrowly in AI: enough raw power to barely be functional, some other ability scores to be able to be narrowly useful, but a shit-ton of intelligence beyond any human capacity.

This is a “god”. When you hear about the singularity, this is what they’re talking about.

this is the thing that most people in the AI world worry about- if you’re godlike in intelligence, does the rest matter to you?

where I think models are: Feb 2025

the various OpenAI offerings are above human averages in strength and intelligence, but slide down to sad amounts on other ability scores.

Claude is a more balanced model despite a dismal dexterity. Having twice the charisma is a controversial take, but stand by it being both twice as good as chatGPT but also nowhere near a 10 yet.

what Ilya saw

This year at NeurIPS I really felt the burning passion coming from Ilya, a man who thinks he can build a 25 INT model and doesn’t care about any other ability score. He knows that with 25 INT all the other things are trivial, like an archmagi.

When people ask what broke up the OpenAI team, I think the right answer is that Sam wants a balanced character and Ilya & the Gang want a 25 INT magic user who will bend the universe to it’s will.

NPC LLM “commoners” = agents

I think agents can go out in the world and do things on the web once all these stats are 10+. They will be internet “commoners”, filling up our internet world the same way that NPCs populate a D&D campaign. Maybe the really memorable ones have names, but most just have a job- innkeeper, blacksmith, guard. They do a job on the internet, they can have a massive amount of depth if the PCs care to ask, but mostly they’re there to be helpful.

judging models by their ability scores

I use this as a framework for evaluation when I open up a new chat, and use it as a way of pattern recognition on new projects. OpenAI released Deep Research today- thats attempting to be +1 DEX, +1 WIS, and its built on the new O3 model that is maybe 13 INT.

Deepseek is +1 WIS (maybe just because you can see it being wise in the reasoning process)

Claude’s creators seem to care a lot about WIS, CHA and CON which is why its taking a long time to release a new model- they’re trying to boost all 6 ability scores between releases, VS OpenAI which only cares about INT and DEX at the moment.

2025 is the year of DEX score increases

I think that there’s a LOT of folks in 2025 who are trying to get to a 10 in DEX so they can make money. A little more speed, a little more tool use, we’re off to the races. There will be a lot of big businesses build on DEX just like how character AI built a business on CHA in 2022-23.

-C

EDIT 1:

what Deepseek thinks the OpenAI and Anthropic score, as well as a self-evaluation. So humble, Deepseek!

Discussion about this post

Ready for more?