ModelChorusModelChorus
ChallengeChatLeaderboardBenchmarksHistoryHow it works
Terms of ServicePrivacy PolicyAPI

Copyright 2026 MeetKai Inc.

Benchmarks/GPT-oss-120B/All tasks

GPT-oss-120B

67 tasks

Each row below is a single benchmark task this model was evaluated on. The Score column averages every metric the task reports (accuracy, F1, exact-match, etc.). Click a row to browse the individual questions and the model's responses.

Average
59.0
ScoreLanguageTaskMetrics
98.1Ukrainian (Ukraine)
ukrainian_polywrite
ukrainian open generation
open_quality_score: 98.1sample_len: 154.0
open_quality_score: 98.1sample_len: 154.0
96.5Albanian (Albania)
albanian_polywrite
albanian open generation
open_quality_score: 96.5sample_len: 155.0
open_quality_score: 96.5sample_len: 155.0
91.8Ukrainian (Ukraine)
ukrainian_belebele
ukrainian mcq
f1_macro: 91.8sample_len: 900.0
f1_macro: 91.8sample_len: 900.0
91.8Portuguese (Portugal)
portuguese_enem
portuguese mcq
exact_match: 91.8sample_len: 1432.0
exact_match: 91.8sample_len: 1432.0
90.5French (France)
french_sib200
french classification
f1_macro: 90.5sample_len: 204.0
f1_macro: 90.5sample_len: 204.0
90.1Portuguese (Portugal)
portuguese_bluex
portuguese mcq
exact_match: 90.1sample_len: 724.0
exact_match: 90.1sample_len: 724.0
89.6Albanian (Albania)
albanian_belebele
albanian mcq
f1_macro: 89.6sample_len: 900.0
f1_macro: 89.6sample_len: 900.0
89.3Swahili (Tanzania)
swahili_sib200
swahili classification
f1_macro: 89.3sample_len: 204.0
f1_macro: 89.3sample_len: 204.0
88.9Ukrainian (Ukraine)
ukrainian_sib200
ukrainian classification
f1_macro: 88.9sample_len: 204.0
f1_macro: 88.9sample_len: 204.0
88.8English (US)
english_mgsm
english math
exact_match: 88.8sample_len: 250.0
exact_match: 88.8sample_len: 250.0
87.2Arabic (Saudi Arabia)
arabic_sib200
arabic classification
f1_macro: 87.2sample_len: 204.0
f1_macro: 87.2sample_len: 204.0
87.2Albanian (Albania)
albanian_sib200
albanian classification
f1_macro: 87.2sample_len: 204.0
f1_macro: 87.2sample_len: 204.0
86.8English (US)
english_gsm8k
english math
exact_match: 86.8sample_len: 1319.0
exact_match: 86.8sample_len: 1319.0
86.0Hausa (Nigeria)
hausa_sib200
hausa classification
f1_macro: 86.0sample_len: 204.0
f1_macro: 86.0sample_len: 204.0
85.4Portuguese (Portugal)
portuguese_hatebr
portuguese classification
f1_macro: 85.4sample_len: 1400.0
f1_macro: 85.4sample_len: 1400.0
84.7Ukrainian (Ukraine)
ukrainian_global_mmlu
ukrainian mcq
f1_macro: 84.7sample_len: 2850.0
f1_macro: 84.7sample_len: 2850.0
82.8Albanian (Albania)
albanian_global_mmlu
albanian mcq
f1_macro: 82.8sample_len: 400.0
f1_macro: 82.8sample_len: 400.0
82.7Urdu (Pakistan)
urdu_uquad
urdu qa
llm_judge_score: 82.7sample_len: 139.0
llm_judge_score: 82.7sample_len: 139.0
81.8Igbo (Nigeria)
igbo_sib200
igbo classification
f1_macro: 81.8sample_len: 204.0
f1_macro: 81.8sample_len: 204.0
78.6Yoruba (Nigeria)
yoruba_sib200
yoruba classification
f1_macro: 78.6sample_len: 204.0
f1_macro: 78.6sample_len: 204.0
74.0Albanian (Albania)
albanian_aya
albanian open generation
llm_judge_score: 74.0sample_len: 200.0
llm_judge_score: 74.0sample_len: 200.0
74.0English (US)
english_mmlu_pro
english mmlu pro
exact_match: 74.0sample_len: 2100.0
exact_match: 74.0sample_len: 2100.0
71.2French (France)
french_mgsm
french math
exact_match: 71.2sample_len: 250.0
exact_match: 71.2sample_len: 250.0
70.2Portuguese (Portugal)
portuguese_tweetsentbr
portuguese classification
f1_macro: 70.2sample_len: 2010.0
f1_macro: 70.2sample_len: 2010.0
70.0Ukrainian (Ukraine)
ukrainian_zno
ukrainian mcq
f1_macro: 70.0sample_len: 751.0
f1_macro: 70.0sample_len: 751.0
69.6Swahili (Tanzania)
swahili_afrimgsm
swahili afrimgsm
exact_match: 69.6sample_len: 250.0
exact_match: 69.6sample_len: 250.0
68.9Yoruba (Nigeria)
yoruba_naijasenti
yoruba sentiment
f1_macro: 68.9sample_len: 4515.0
f1_macro: 68.9sample_len: 4515.0
68.6Hausa (Nigeria)
hausa_afrixnli
hausa nli
f1_macro: 68.6sample_len: 600.0
f1_macro: 68.6sample_len: 600.0
67.8Hausa (Nigeria)
hausa_naijasenti
hausa sentiment
f1_macro: 67.8sample_len: 5303.0
f1_macro: 67.8sample_len: 5303.0
66.3Urdu (Pakistan)
urdu_freshqa
urdu qa
llm_judge_score: 66.3sample_len: 323.0
llm_judge_score: 66.3sample_len: 323.0
65.8Portuguese (Portugal)
portuguese_hate_speech
portuguese classification
f1_macro: 65.8sample_len: 851.0
f1_macro: 65.8sample_len: 851.0
65.6Hausa (Nigeria)
hausa_afrimgsm
hausa afrimgsm
exact_match: 65.6sample_len: 250.0
exact_match: 65.6sample_len: 250.0
65.0Swahili (Tanzania)
swahili_afrixnli
swahili nli
f1_macro: 65.0sample_len: 600.0
f1_macro: 65.0sample_len: 600.0
64.9Igbo (Nigeria)
igbo_afrixnli
igbo nli
f1_macro: 64.9sample_len: 600.0
f1_macro: 64.9sample_len: 600.0
64.9Spanish (Spain)
spanish_xquad_es
spanish xquad es
exact_match: 54.4f1: 75.4sample_len: 1190.0
exact_match: 54.4f1: 75.4sample_len: 1190.0
62.9Portuguese (Portugal)
portuguese_oab_exams
portuguese mcq
exact_match: 62.9sample_len: 2210.0
exact_match: 62.9sample_len: 2210.0
62.7Igbo (Nigeria)
igbo_naijasenti
igbo sentiment
f1_macro: 62.7sample_len: 3682.0
f1_macro: 62.7sample_len: 3682.0
60.8Yoruba (Nigeria)
yoruba_afrimgsm
yoruba afrimgsm
exact_match: 60.8sample_len: 250.0
exact_match: 60.8sample_len: 250.0
60.5Yoruba (Nigeria)
yoruba_afrixnli
yoruba nli
f1_macro: 60.5sample_len: 600.0
f1_macro: 60.5sample_len: 600.0
58.4Ukrainian (Ukraine)
ukrainian_squad
ukrainian qa
exact_match: 46.2f1: 70.6sample_len: 3812.0
exact_match: 46.2f1: 70.6sample_len: 3812.0
57.8French (France)
french_fquad
french qa
exact_match: 47.3f1: 68.3sample_len: 400.0
exact_match: 47.3f1: 68.3sample_len: 400.0
53.2Igbo (Nigeria)
igbo_afrimgsm
igbo afrimgsm
exact_match: 53.2sample_len: 250.0
exact_match: 53.2sample_len: 250.0
50.1Urdu (Pakistan)
urdu_facttool_qa
urdu claim
f1_macro: 50.1sample_len: 160.0
f1_macro: 50.1sample_len: 160.0
44.6Urdu (Pakistan)
urdu_factcheckbench
urdu claim
f1_macro: 44.6sample_len: 387.0
f1_macro: 44.6sample_len: 387.0
44.3English (US)
ifeval
ifeval
inst_level_loose_acc: 56.0inst_level_strict_acc: 45.8prompt_level_loose_acc: 44.0prompt_level_strict_acc: 31.6sample_len: 541.0
inst_level_loose_acc: 56.0inst_level_strict_acc: 45.8prompt_level_loose_acc: 44.0prompt_level_strict_acc: 31.6sample_len: 541.0
44.3Urdu (Pakistan)
urdu_bingcheck
urdu claim
f1_macro: 44.3sample_len: 102.0
f1_macro: 44.3sample_len: 102.0
40.9Arabic (Saudi Arabia)
arabic_tydiqa
arabic qa
exact_match: 30.0f1: 51.7sample_len: 921.0
exact_match: 30.0f1: 51.7sample_len: 921.0
40.8English (US)
english_belebele
english mcq
f1_macro: 40.8sample_len: 900.0
f1_macro: 40.8sample_len: 900.0
38.1French (France)
french_belebele
french mcq
f1_macro: 38.1sample_len: 900.0
f1_macro: 38.1sample_len: 900.0
37.0Arabic (Saudi Arabia)
arabic_belebele
arabic mcq
f1_macro: 37.0sample_len: 900.0
f1_macro: 37.0sample_len: 900.0
33.7Spanish (Spain)
spanish_belebele
spanish mcq
f1_macro: 33.7sample_len: 900.0
f1_macro: 33.7sample_len: 900.0
33.6Urdu (Pakistan)
urdu_fake_news
urdu classification
f1_macro: 33.6sample_len: 300.0
f1_macro: 33.6sample_len: 300.0
33.5Urdu (Pakistan)
urdu_emotion_class
urdu classification
f1_macro: 33.5sample_len: 200.0
f1_macro: 33.5sample_len: 200.0
31.2Spanish (Spain)
spanish_global_mmlu
spanish mcq
f1_macro: 31.2sample_len: 400.0
f1_macro: 31.2sample_len: 400.0
30.3French (France)
french_mmmlu
french mcq
f1_macro: 30.3sample_len: 14042.0
f1_macro: 30.3sample_len: 14042.0
29.2Igbo (Nigeria)
igbo_afriqa
igbo qa
exact_match: 24.7f1: 33.7sample_len: 409.0
exact_match: 24.7f1: 33.7sample_len: 409.0
25.2Hausa (Nigeria)
hausa_afrimmlu
hausa mcq
f1_macro: 25.2sample_len: 500.0
f1_macro: 25.2sample_len: 500.0
23.5Yoruba (Nigeria)
yoruba_afrimmlu
yoruba mcq
f1_macro: 23.5sample_len: 500.0
f1_macro: 23.5sample_len: 500.0
22.8Hausa (Nigeria)
hausa_belebele
hausa mcq
f1_macro: 22.8sample_len: 900.0
f1_macro: 22.8sample_len: 900.0
22.3Igbo (Nigeria)
igbo_afrimmlu
igbo mcq
f1_macro: 22.3sample_len: 500.0
f1_macro: 22.3sample_len: 500.0
22.1Igbo (Nigeria)
igbo_belebele
igbo mcq
f1_macro: 22.1sample_len: 900.0
f1_macro: 22.1sample_len: 900.0
19.2Hausa (Nigeria)
hausa_afriqa
hausa qa
exact_match: 17.3f1: 21.0sample_len: 300.0
exact_match: 17.3f1: 21.0sample_len: 300.0
19.1Yoruba (Nigeria)
yoruba_belebele
yoruba mcq
f1_macro: 19.1sample_len: 900.0
f1_macro: 19.1sample_len: 900.0
17.5Urdu (Pakistan)
urdu_simpleqa
urdu qa
llm_judge_score: 17.5sample_len: 200.0
llm_judge_score: 17.5sample_len: 200.0
17.1Arabic (Saudi Arabia)
arabic_aratrust
arabic mcq
f1_macro: 17.1sample_len: 522.0
f1_macro: 17.1sample_len: 522.0
15.7Yoruba (Nigeria)
yoruba_afriqa
yoruba qa
exact_match: 12.7f1: 18.8sample_len: 332.0
exact_match: 12.7f1: 18.8sample_len: 332.0
14.7Arabic (Saudi Arabia)
arabic_mmlu
arabic mcq
f1_macro: 14.7sample_len: 14316.0
f1_macro: 14.7sample_len: 14316.0