Automated lm-eval-harness scores across multilingual tasks. Higher is better — tap a row or cell to drill in.
Average score across all models
Leader by average score (min. 3 models)
Lowest average across qualifying models
Highest average across qualifying models
| # | Model | Avg | Arabic | English | Spanish | French | Hausa | Igbo | Portuguese | Albanian | Swahili | Ukrainian | Urdu | Yoruba | View breakdown |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GPT-5 Nano 52/67 | 68.6 | 65.0 | 63.6 | 61.9 | 64.9 | 54.8 | -- | 77.5 | 84.8 | 76.3 | 79.8 | 61.4 | -- | |
| 2 | Functionary Swahili Large 27/67 | 66.8 | 69.4 | -- | 62.9 | 76.8 | 46.1 | -- | 69.8 | 88.2 | 91.1 | 73.9 | 59.1 | -- | |
| 3 | Functionary Swahili Mini 21/67 | 64.6 | 43.4 | 38.1 | -- | 53.8 | 50.8 | -- | 75.8 | 84.6 | 85.5 | 91.5 | 67.6 | -- | |
| 4 | GPT-oss-120B | 59.0 | 39.4 | 66.9 | 43.2 | 57.6 | 50.7 | 48.0 | 77.7 | 86.0 | 74.6 | 82.0 | 46.6 | 46.7 | |
| 5 | GLM 5 12/67 | 57.0 | -- | 53.2 | 31.8 | 84.9 | -- | -- | -- | -- | 79.4 | -- | -- | -- | |
| 6 | Rnj 1 Instruct | 38.1 | 52.4 | 71.7 | 58.1 | 63.3 | 17.7 | 17.6 | 41.2 | 30.9 | 23.6 | 45.6 | 37.9 | 20.5 |
Average score on the tasks both models scored
Score difference — bars right = GPT-5 Nano ahead, bars left = Functionary Swahili Large ahead
GPT-5 Nano vs the mean of all other models — 52 tasks total
Functionary Swahili Large vs the mean of all other models — 27 tasks total