AI IQ is here: a new site scores frontier AI models on the human IQ scale. The results are already dividing tech.
Michael Nuñez
4:47 pm, PT, May 13, 2026
Credit: VentureBeat made with Midjourney
For decades, the IQ test has been one of the most familiar — and most contested — yardsticks for human intelligence. Now, a startup project called
AI IQ
is applying the same metaphor to artificial intelligence, assigning estimated intelligence quotients to more than 50 of the world's most powerful language models and plotting them on a standard bell curve.
The result is a set of interactive visualizations at
aiiq.org
that have ricocheted across social media in the past week, drawing praise from enterprise technologists who say the charts make an impossibly complex market legible — and sharp criticism from researchers and commentators who warn the entire framework is misleading.
"This is super useful," wrote
Thibaut Mélen
, a technology commentator, on X. "Much easier to understand model progress when it's mapped like this instead of another giant leaderboard table."
Brian Vellmure, a business strategist, offered a similar endorsement: "This is helpful. Anecdotally tracks with personal experience."
But the backlash arrived just as quickly. "It's nonsense. AI is far too jagged. The map is not the territory," posted
AI Deeply
, an artificial intelligence commentary account, crystallizing a worry shared by many researchers: that reducing a language model's sprawling, uneven capabilities to a single number creates a dangerous illusion of precision.
More than 50 AI language models, plotted on a standard IQ bell curve by the site AI IQ. The most capable models crowd the right tail of the distribution. (Credit: AI IQ)
Twelve benchmarks, four dimensions, and one controversial number: how AI IQ actually works
AI IQ was created by
Ryan Shea
, an engineer, entrepreneur, and angel investor best known as a co-founder of the blockchain platform
Stacks
. Shea also co-founded
Voterbase
and has invested in the early stages of several unicorns, including
OpenSea
,
Lattice
,
Anchorage
, and
Mercury
. He holds a Bachelor of Science in Mechanical Engineering from Princeton University.
The site's methodology rests on a deceptively simple formula.
AI IQ
groups 12 benchmarks into four reasoning dimensions: abstract, mathematical, programmatic, and academic. The composite IQ is a straight average of those four dimension scores: IQ = ¼ (IQ_Abstract + IQ_Math + IQ_Prog + IQ_Acad).
The abstract reasoning dimension draws from
ARC-AGI-1
and
ARC-AGI-2
, the notoriously difficult pattern-recognition benchmarks designed to test general fluid intelligence. Mathematical reasoning includes
FrontierMath
(Tiers 1–3 and Tier 4),
AIME
, and
ProofBench
. Programmatic reasoning uses
Terminal-Bench 2.0
,
SWE-Bench Verified
, and
SciCode
. Academic reasoning pulls from
Humanity's Last Exam
,
CritPt
, and
GPQA Diamond
.
Each raw benchmark score gets mapped to an implied IQ through what the site describes as "hand-calibrated difficulty curves." Crucially, the methodology compresses ceilings for benchmarks considered easier or more susceptible to data contamination, preventing them from inflating scores above 100. Harder, less gameable benchmarks retain higher ceilings. The system also handles missing data conservatively: models need scores on at least two of the four dimensions to receive a derived IQ, and when benchmarks are absent, the pipeline deliberately pulls scores down rather than up. The site states that "every derived IQ averages all four dimensions, so missing coverage cannot make a model look better by omission."
OpenAI leads the bell curve, but the gap between the top AI models has never been smaller
As of mid-May 2026, the
AI IQ
charts tell a story of rapid convergence at the top of the frontier — and widening diversity in the tiers below.
According to the Frontier IQ Over Time chart,
GPT-5.5
from OpenAI currently sits at the peak of the bell curve, with an estimated IQ near 136 — the highest of any model tracked. It is closely followed by
GPT-5.4
(approximately 131),
Opus 4.7
from Anthropic (approximately 132), and
Opus 4.6
(approximately 129). Google's
Gemini 3.1 Pro
lands near 131, making the top cluster extraordinarily tight.
That compression is not unique to AI IQ's framework.
Visual Capitalist
, drawing from a separate Mensa-based ranking by TrackingAI, recently observed the same dynamic, noting that "the biggest takeaway is how compressed the top of the leaderboard has become." On that scale, Grok-4.20 Expert Mode and GPT 5.4 Pro tied at 145, with Gemini 3.1 Pro at 141.
Below the frontier cluster, the AI IQ charts show a crowded midfield. Models from Chinese labs —
Kimi K2.6
,
GLM-5
,
DeepSeek-V3.2
,
Qwen3.6
,
MiniMax-M2.7
— bunch between roughly 112 and 118, making the cost-performance tier increasingly competitive for enterprise buyers who don't need the absolute best model for every task. One X user, ovsky, noted that the data "confirms experience with sonnet 4.6 being an absolute workhorse as opposed to opus 4.5" — pointing to the way the charts can validate practitioner intuitions that headline rankings often miss.
The trajectory of frontier AI models from October 2023 to mid-2026, as tracked by AI IQ. Provider-colored step-lines connect each lab's flagship releases, showing roughly 60 points of estimated IQ improvement in 30 months. (Credit: AI IQ)
Why emotional intelligence scores are becoming the new battleground in AI model rankings
What distinguishes
AI IQ
from most other benchmarking efforts is its inclusion of an "EQ" — emotional intelligence — score. The site maps each model's EQ-Bench 3 Elo score and Arena Elo score to an estimated EQ using calibrated piecewise-linear scales, then takes a 50/50 weighted composite of the two.
The EQ scores produce a meaningfully different ranking than IQ alone. On the IQ vs. EQ scatter plot, Anthropic's
Opus 4.7
leads on EQ with a score near 132, pushing it into the upper-right quadrant — the most desirable position, signaling both high cognitive and high emotional intelligence. OpenAI's
GPT-5.5
and
GPT-5.4
cluster in the high-IQ zone but lag slightly on EQ. Google's Gemini 3.1 Pro sits in a strong middle position on both axes.
One notable methodological choice has drawn attention:
EQ-Bench 3
is judged by Claude, an Anthropic model, which the site acknowledges "creates potential scoring bias in favor of Anthropic models." To correct for this, AI IQ subtracts a 200-point Elo penalty from the EQ-Bench component for all Anthropic models before mapping to implied EQ. The Arena component is unaffected since it uses human judges. That self-correction is unusual in the benchmarking world, and it suggests Shea is aware of the methodological minefield he has entered. Still, the EQ dimension captures something IQ alone cannot: the growing importance of conversational quality, collaboration, and trust in models deployed for user-facing work.
Plotting IQ against EQ reveals that the smartest models aren't always the most emotionally intelligent. Anthropic's Opus 4.7 dominates the upper-right quadrant. (Credit: AI IQ)
The AI cost-performance chart that enterprise buyers actually need to see
Perhaps the most practically useful chart on the site is not the bell curve but the
IQ vs. Effective Cost
scatter plot. It maps each model's estimated IQ against an "effective cost" metric — defined as the token cost for a task using 2 million input tokens and 1 million output tokens, multiplied by a usage efficiency factor.
The chart reveals a familiar pattern in enterprise technology: the best models are not always the best value. GPT-5.5 and Opus 4.7 sit in the upper-left corner — high IQ, high cost, with effective per-task costs north of $30 and $50 respectively. Meanwhile, models like
GPT-5.4-mini
,
DeepSeek-V3.2
, and
MiniMax-M2.7
occupy a sweet spot in the middle: respectable IQ scores between 112 and 120, at effective costs ranging from roughly $1 to $5 per task. At the cheapest extreme,
GPT-oss-20b
(an open-source OpenAI model) appears near $0.20 effective cost with an IQ around 107 — potentially the most economical option for bulk classification or extraction workloads.
The site also offers a 3D visualization mapping IQ, EQ, and effective cost simultaneously. A dashed line running through the cube points toward the ideal: higher IQ, higher EQ, and lower cost. Models near the "green end" of that axis are stronger all-around deals; those near the "red end" sacrifice capability, cost efficiency, or both. For CIOs staring at API invoices, the implication is clear: the intelligence gap between a $50 model and a $3 model has narrowed enough that routing — using expensive models for hard problems and cheap ones for everything else — is no longer optional. It is the dominant architecture for serious AI deployments.
Critics say AI's "jagged" capabilities make a single IQ score dangerously misleading
The loudest objection to AI IQ is philosophical, and it cuts deep. Critics argue that collapsing a model's uneven capabilities into a single score obscures more than it reveals.
"IQ as a proxy is fading — we're seeing reasoning density spikes that don't map to g-factor," posted
Zaya
, a technology commentator, on X. "GPT-5.5 already hit saturation on MMLU-Pro, but still fails ClockBench 50% of the time."
That observation touches on what AI researchers call the "
jaggedness
" problem: large language models often exhibit wildly uneven capabilities, excelling at graduate-level physics while failing at tasks a child could do. A composite score can paper over those gaps.
Pressureangle, another X user, posted a more granular critique, calling out "
complete lack of transparency
" and arguing the site never fully discloses how its calibration curves were created or validated. In fairness, AI IQ does list its 12 benchmarks and shows the shape of each calibration curve in its methodology modal. But the raw data and precise mathematical transformations are not published as open datasets — a gap that matters to researchers accustomed to fully reproduc
← Torna alle news