Benchmarks

Automated lm-eval-harness scores across multilingual tasks. Higher is better — tap a row or cell to drill in.

At a glance

Languages covered

12

Best language

Ukrainian (Ukraine)

71.6%

Hardest language

Igbo (Nigeria)

32.8%

Eval coverage

37%

246 of 670 cells

Languages by difficulty

Average score across all models

Igbo (Nigeria)
33
Yoruba (Nigeria)
34
Hausa (Nigeria)
42
Spanish (Spain)
50
Arabic (Saudi Arabia)
53
Urdu (Pakistan)
53
English (US)
63
French (France)
64
Swahili (Tanzania)
67
Portuguese (Portugal)
67
Albanian (Albania)
71
Ukrainian (Ukraine)
72

Best model per language

Leader by average score (min. 3 models)

Hardest tasks

Lowest average across qualifying models

hausa afriqa
Hausa (Nigeria)
10.5%
urdu simpleqa
Urdu (Pakistan)
16.4%
hausa afrimmlu
Hausa (Nigeria)
30.0%
urdu emotion class
Urdu (Pakistan)
30.8%
hausa belebele
Hausa (Nigeria)
32.9%
arabic mmlu
Arabic (Saudi Arabia)
35.7%
arabic tydiqa
Arabic (Saudi Arabia)
36.9%
spanish global mmlu
Spanish (Spain)
37.1%

Easiest tasks

Highest average across qualifying models

french sib200
French (France)
86.3%
english mgsm
English (US)
86.2%
english gsm8k
English (US)
85.6%
arabic sib200
Arabic (Saudi Arabia)
84.8%
ukrainian sib200
Ukrainian (Ukraine)
84.5%
portuguese hatebr
Portuguese (Portugal)
84.0%
ukrainian belebele
Ukrainian (Ukraine)
81.7%
swahili sib200
Swahili (Tanzania)
80.8%

All models

#	Model	Avg	Arabic	English	Spanish	French	Hausa	Igbo	Portuguese	Albanian	Swahili	Ukrainian	Urdu	Yoruba	View breakdown
1	GPT-5 Nano 52/67	68.6	65.0	63.6	61.9	64.9	54.8	--	77.5	84.8	76.3	79.8	61.4	--
2	Functionary Swahili Large 27/67	66.8	69.4	--	62.9	76.8	46.1	--	69.8	88.2	91.1	73.9	59.1	--
3	Functionary Swahili Mini 21/67	64.6	43.4	38.1	--	53.8	50.8	--	75.8	84.6	85.5	91.5	67.6	--
4	GPT-oss-120B	59.0	39.4	66.9	43.2	57.6	50.7	48.0	77.7	86.0	74.6	82.0	46.6	46.7
5	GLM 5 12/67	57.0	--	53.2	31.8	84.9	--	--	--	--	79.4	--	--	--
6	Rnj 1 Instruct	38.1	52.4	71.7	58.1	63.3	17.7	17.6	41.2	30.9	23.6	45.6	37.9	20.5

Head-to-head

Model A

Model B

ShowPer languagePer task gapScore zones

Per language

Average score on the tasks both models scored

Hausa4 tasks
GPT-5 Nano
59.3
Functionary Swahili Large
46.1
French3 tasks
GPT-5 Nano
70.6
Functionary Swahili Large
76.8
Portuguese3 tasks
GPT-5 Nano
75.9
Functionary Swahili Large
69.8
Arabic2 tasks
GPT-5 Nano
63.3
Functionary Swahili Large
69.4
Spanish1 task
GPT-5 Nano
68.3
Functionary Swahili Large
62.9
Swahili1 task
GPT-5 Nano
86.8
Functionary Swahili Large
91.1
Urdu8 tasks
GPT-5 Nano
61.4
Functionary Swahili Large
59.1
Albanian3 tasks
GPT-5 Nano
86.5
Functionary Swahili Large
88.2
Ukrainian2 tasks
GPT-5 Nano
74.3
Functionary Swahili Large
73.9

Per task gap

Score difference — bars right = GPT-5 Nano ahead, bars left = Functionary Swahili Large ahead

Hausa
- hausa afrimgsm
  +23.6
- hausa afrixnli
  +16.2
- hausa afriqa
  +14.2
- hausa sib200
  -1.3
Urdu
- urdu freshqa
  +22.3
- urdu fake news
  -15.5
- urdu uquad
  +10.8
- urdu bingcheck
  +8.8
- urdu facttool qa
  -7.5
- urdu factcheckbench
  -7.0
- urdu simpleqa
  +3.5
- urdu emotion class
  +2.7
Portuguese
- portuguese hatebr
  +14.9
- portuguese hate speech
  +4.7
- portuguese tweetsentbr
  -1.3
Arabic
- arabic tydiqa
  -9.8
- arabic sib200
  -2.4
French
- french fquad
  -9.2
- french mgsm
  -6.8
- french sib200
  -2.6
Spanish
- spanish xquad es
  +5.3
Swahili
- swahili sib200
  -4.4
Ukrainian
- ukrainian squad
  +3.8
- ukrainian sib200
  -3.0
Albanian
- albanian belebele
  -3.5
- albanian sib200
  -2.5
- albanian global mmlu
  +0.7

Score zones

GPT-5 Nano vs the mean of all other models — 52 tasks total

Leading
>+5pp
38
Parity
±5pp
12
Behind
-5 to -20pp
2
Far behind
<-20pp
0

Score zones

Functionary Swahili Large vs the mean of all other models — 27 tasks total

Leading
>+5pp
16
Parity
±5pp
9
Behind
-5 to -20pp
2
Far behind
<-20pp
0