ModelChorusModelChorus
ChallengeChatLeaderboardBenchmarksHistoryHow it works
Terms of ServicePrivacy PolicyAPI

Copyright 2026 MeetKai Inc.

Benchmarks

Automated lm-eval-harness scores across multilingual tasks. Higher is better — tap a row or cell to drill in.

At a glance

Languages covered
12
Best language
Ukrainian (Ukraine)
71.6%
Hardest language
Igbo (Nigeria)
32.8%
Eval coverage
37%
246 of 670 cells

Languages by difficulty

Average score across all models

  • Igbo (Nigeria)
    33
  • Yoruba (Nigeria)
    34
  • Hausa (Nigeria)
    42
  • Spanish (Spain)
    50
  • Arabic (Saudi Arabia)
    53
  • Urdu (Pakistan)
    53
  • English (US)
    63
  • French (France)
    64
  • Swahili (Tanzania)
    67
  • Portuguese (Portugal)
    67
  • Albanian (Albania)
    71
  • Ukrainian (Ukraine)
    72

Best model per language

Leader by average score (min. 3 models)

  • Arabic (Saudi Arabia)
    Functionary Swahili Large
    69.4%
  • English (US)
    Rnj 1 Instruct
    71.7%
  • Spanish (Spain)
    Functionary Swahili Large
    62.9%
  • French (France)
    GLM 5
    84.9%
  • Hausa (Nigeria)
    GPT-5 Nano
    54.8%
  • Portuguese (Portugal)
    GPT-oss-120B
    77.7%
  • Albanian (Albania)
    Functionary Swahili Large
    88.2%
  • Swahili (Tanzania)
    Functionary Swahili Large
    91.1%
  • Ukrainian (Ukraine)
    Functionary Swahili Mini
    91.5%
  • Urdu (Pakistan)
    Functionary Swahili Mini
    67.6%

Hardest tasks

Lowest average across qualifying models

  • hausa afriqa
    Hausa (Nigeria)
    10.5%
  • urdu simpleqa
    Urdu (Pakistan)
    16.4%
  • hausa afrimmlu
    Hausa (Nigeria)
    30.0%
  • urdu emotion class
    Urdu (Pakistan)
    30.8%
  • hausa belebele
    Hausa (Nigeria)
    32.9%
  • arabic mmlu
    Arabic (Saudi Arabia)
    35.7%
  • arabic tydiqa
    Arabic (Saudi Arabia)
    36.9%
  • spanish global mmlu
    Spanish (Spain)
    37.1%

Easiest tasks

Highest average across qualifying models

  • french sib200
    French (France)
    86.3%
  • english mgsm
    English (US)
    86.2%
  • english gsm8k
    English (US)
    85.6%
  • arabic sib200
    Arabic (Saudi Arabia)
    84.8%
  • ukrainian sib200
    Ukrainian (Ukraine)
    84.5%
  • portuguese hatebr
    Portuguese (Portugal)
    84.0%
  • ukrainian belebele
    Ukrainian (Ukraine)
    81.7%
  • swahili sib200
    Swahili (Tanzania)
    80.8%

All models

#ModelAvgArabicEnglishSpanishFrenchHausaIgboPortugueseAlbanianSwahiliUkrainianUrduYorubaView breakdown
1
GPT-5 Nano
52/67
68.665.063.661.964.954.8--77.584.876.379.861.4--
2
Functionary Swahili Large
27/67
66.869.4--62.976.846.1--69.888.291.173.959.1--
3
Functionary Swahili Mini
21/67
64.643.438.1--53.850.8--75.884.685.591.567.6--
4
GPT-oss-120B
59.039.466.943.257.650.748.077.786.074.682.046.646.7
5
GLM 5
12/67
57.0--53.231.884.9--------79.4------
6
Rnj 1 Instruct
38.152.471.758.163.317.717.641.230.923.645.637.920.5

Head-to-head

Show

Per language

Average score on the tasks both models scored

  • Hausa4 tasks
    GPT-5 Nano
    59.3
    Functionary Swahili Large
    46.1
  • French3 tasks
    GPT-5 Nano
    70.6
    Functionary Swahili Large
    76.8
  • Portuguese3 tasks
    GPT-5 Nano
    75.9
    Functionary Swahili Large
    69.8
  • Arabic2 tasks
    GPT-5 Nano
    63.3
    Functionary Swahili Large
    69.4
  • Spanish1 task
    GPT-5 Nano
    68.3
    Functionary Swahili Large
    62.9
  • Swahili1 task
    GPT-5 Nano
    86.8
    Functionary Swahili Large
    91.1
  • Urdu8 tasks
    GPT-5 Nano
    61.4
    Functionary Swahili Large
    59.1
  • Albanian3 tasks
    GPT-5 Nano
    86.5
    Functionary Swahili Large
    88.2
  • Ukrainian2 tasks
    GPT-5 Nano
    74.3
    Functionary Swahili Large
    73.9

Per task gap

Score difference — bars right = GPT-5 Nano ahead, bars left = Functionary Swahili Large ahead

  • Hausa
    • hausa afrimgsm
      +23.6
    • hausa afrixnli
      +16.2
    • hausa afriqa
      +14.2
    • hausa sib200
      -1.3
  • Urdu
    • urdu freshqa
      +22.3
    • urdu fake news
      -15.5
    • urdu uquad
      +10.8
    • urdu bingcheck
      +8.8
    • urdu facttool qa
      -7.5
    • urdu factcheckbench
      -7.0
    • urdu simpleqa
      +3.5
    • urdu emotion class
      +2.7
  • Portuguese
    • portuguese hatebr
      +14.9
    • portuguese hate speech
      +4.7
    • portuguese tweetsentbr
      -1.3
  • Arabic
    • arabic tydiqa
      -9.8
    • arabic sib200
      -2.4
  • French
    • french fquad
      -9.2
    • french mgsm
      -6.8
    • french sib200
      -2.6
  • Spanish
    • spanish xquad es
      +5.3
  • Swahili
    • swahili sib200
      -4.4
  • Ukrainian
    • ukrainian squad
      +3.8
    • ukrainian sib200
      -3.0
  • Albanian
    • albanian belebele
      -3.5
    • albanian sib200
      -2.5
    • albanian global mmlu
      +0.7

Score zones

GPT-5 Nano vs the mean of all other models — 52 tasks total

  • Leading
    >+5pp
    38
  • Parity
    ±5pp
    12
  • Behind
    -5 to -20pp
    2
  • Far behind
    <-20pp
    0

Score zones

Functionary Swahili Large vs the mean of all other models — 27 tasks total

  • Leading
    >+5pp
    16
  • Parity
    ±5pp
    9
  • Behind
    -5 to -20pp
    2
  • Far behind
    <-20pp
    0