ModelChorusModelChorus
ChallengeChatLeaderboardBenchmarksHistoryHow it works
Terms of ServicePrivacy PolicyAPI

Copyright 2026 MeetKai Inc.

Benchmarks/GPT-5 Nano/All tasks

GPT-5 Nano

52 tasks

Each row below is a single benchmark task this model was evaluated on. The Score column averages every metric the task reports (accuracy, F1, exact-match, etc.). Click a row to browse the individual questions and the model's responses.

Average
68.6
ScoreLanguageTaskMetrics
91.6Ukrainian (Ukraine)
ukrainian_belebele
ukrainian mcq
f1_macro: 91.6sample_len: 900.0
f1_macro: 91.6sample_len: 900.0
91.2Portuguese (Portugal)
portuguese_hatebr
portuguese classification
f1_macro: 91.2sample_len: 1400.0
f1_macro: 91.2sample_len: 1400.0
90.4Portuguese (Portugal)
portuguese_enem
portuguese mcq
exact_match: 90.4sample_len: 1432.0
exact_match: 90.4sample_len: 1432.0
89.9Urdu (Pakistan)
urdu_uquad
urdu qa
llm_judge_score: 89.9sample_len: 139.0
llm_judge_score: 89.9sample_len: 139.0
88.8Albanian (Albania)
albanian_belebele
albanian mcq
f1_macro: 88.8sample_len: 900.0
f1_macro: 88.8sample_len: 900.0
88.4French (France)
french_sib200
french classification
f1_macro: 88.4sample_len: 204.0
f1_macro: 88.4sample_len: 204.0
87.6English (US)
english_mgsm
english math
exact_match: 87.6sample_len: 250.0
exact_match: 87.6sample_len: 250.0
87.5Albanian (Albania)
albanian_aya
albanian open generation
llm_judge_score: 87.5sample_len: 200.0
llm_judge_score: 87.5sample_len: 200.0
87.3Portuguese (Portugal)
portuguese_bluex
portuguese mcq
exact_match: 87.3sample_len: 724.0
exact_match: 87.3sample_len: 724.0
87.3English (US)
english_gsm8k
english math
exact_match: 87.3sample_len: 1319.0
exact_match: 87.3sample_len: 1319.0
86.8Swahili (Tanzania)
swahili_sib200
swahili classification
f1_macro: 86.8sample_len: 204.0
f1_macro: 86.8sample_len: 204.0
86.8Albanian (Albania)
albanian_sib200
albanian classification
f1_macro: 86.8sample_len: 204.0
f1_macro: 86.8sample_len: 204.0
86.2Ukrainian (Ukraine)
ukrainian_sib200
ukrainian classification
f1_macro: 86.2sample_len: 204.0
f1_macro: 86.2sample_len: 204.0
85.4Arabic (Saudi Arabia)
arabic_sib200
arabic classification
f1_macro: 85.4sample_len: 204.0
f1_macro: 85.4sample_len: 204.0
84.9Ukrainian (Ukraine)
ukrainian_polywrite
ukrainian open generation
open_quality_score: 84.9sample_len: 154.0
open_quality_score: 84.9sample_len: 154.0
84.1Ukrainian (Ukraine)
ukrainian_global_mmlu
ukrainian mcq
f1_macro: 84.1sample_len: 2850.0
f1_macro: 84.1sample_len: 2850.0
83.9Hausa (Nigeria)
hausa_sib200
hausa classification
f1_macro: 83.9sample_len: 204.0
f1_macro: 83.9sample_len: 204.0
83.9Albanian (Albania)
albanian_global_mmlu
albanian mcq
f1_macro: 83.9sample_len: 400.0
f1_macro: 83.9sample_len: 400.0
79.5Urdu (Pakistan)
urdu_bingcheck
urdu claim
f1_macro: 79.5sample_len: 102.0
f1_macro: 79.5sample_len: 102.0
78.9Urdu (Pakistan)
urdu_freshqa
urdu qa
llm_judge_score: 78.9sample_len: 323.0
llm_judge_score: 78.9sample_len: 323.0
77.3Albanian (Albania)
albanian_polywrite
albanian open generation
open_quality_score: 77.3sample_len: 155.0
open_quality_score: 77.3sample_len: 155.0
76.8Swahili (Tanzania)
swahili_afrimgsm
swahili afrimgsm
exact_match: 76.8sample_len: 250.0
exact_match: 76.8sample_len: 250.0
75.3Arabic (Saudi Arabia)
arabic_aratrust
arabic mcq
f1_macro: 75.3sample_len: 522.0
f1_macro: 75.3sample_len: 522.0
72.0Portuguese (Portugal)
portuguese_tweetsentbr
portuguese classification
f1_macro: 72.0sample_len: 2010.0
f1_macro: 72.0sample_len: 2010.0
69.5Ukrainian (Ukraine)
ukrainian_zno
ukrainian mcq
f1_macro: 69.5sample_len: 751.0
f1_macro: 69.5sample_len: 751.0
68.3Spanish (Spain)
spanish_xquad_es
spanish xquad es
exact_match: 58.0f1: 78.6sample_len: 1190.0
exact_match: 58.0f1: 78.6sample_len: 1190.0
68.0Hausa (Nigeria)
hausa_afrixnli
hausa nli
f1_macro: 68.0sample_len: 600.0
f1_macro: 68.0sample_len: 600.0
65.5Urdu (Pakistan)
urdu_factcheckbench
urdu claim
f1_macro: 65.5sample_len: 387.0
f1_macro: 65.5sample_len: 387.0
65.2French (France)
french_mgsm
french math
exact_match: 65.2sample_len: 250.0
exact_match: 65.2sample_len: 250.0
65.2Swahili (Tanzania)
swahili_afrixnli
swahili nli
f1_macro: 65.2sample_len: 600.0
f1_macro: 65.2sample_len: 600.0
64.9Arabic (Saudi Arabia)
arabic_belebele
arabic mcq
f1_macro: 64.9sample_len: 900.0
f1_macro: 64.9sample_len: 900.0
64.5Portuguese (Portugal)
portuguese_hate_speech
portuguese classification
f1_macro: 64.5sample_len: 851.0
f1_macro: 64.5sample_len: 851.0
62.4Ukrainian (Ukraine)
ukrainian_squad
ukrainian qa
exact_match: 49.3f1: 75.6sample_len: 3812.0
exact_match: 49.3f1: 75.6sample_len: 3812.0
61.6Hausa (Nigeria)
hausa_afrimgsm
hausa afrimgsm
exact_match: 61.6sample_len: 250.0
exact_match: 61.6sample_len: 250.0
61.5Urdu (Pakistan)
urdu_facttool_qa
urdu claim
f1_macro: 61.5sample_len: 160.0
f1_macro: 61.5sample_len: 160.0
61.3English (US)
english_mmlu_pro
english mmlu pro
exact_match: 61.3sample_len: 2100.0
exact_match: 61.3sample_len: 2100.0
61.0Spanish (Spain)
spanish_belebele
spanish mcq
f1_macro: 61.0sample_len: 900.0
f1_macro: 61.0sample_len: 900.0
59.4Portuguese (Portugal)
portuguese_oab_exams
portuguese mcq
exact_match: 59.4sample_len: 2210.0
exact_match: 59.4sample_len: 2210.0
59.2French (France)
french_belebele
french mcq
f1_macro: 59.2sample_len: 900.0
f1_macro: 59.2sample_len: 900.0
58.3French (France)
french_fquad
french qa
exact_match: 54.3f1: 62.3sample_len: 400.0
exact_match: 54.3f1: 62.3sample_len: 400.0
58.2Arabic (Saudi Arabia)
arabic_mmlu
arabic mcq
f1_macro: 58.2sample_len: 14316.0
f1_macro: 58.2sample_len: 14316.0
56.5Spanish (Spain)
spanish_global_mmlu
spanish mcq
f1_macro: 56.5sample_len: 400.0
f1_macro: 56.5sample_len: 400.0
55.9Urdu (Pakistan)
urdu_fake_news
urdu classification
f1_macro: 55.9sample_len: 300.0
f1_macro: 55.9sample_len: 300.0
53.6French (France)
french_mmmlu
french mcq
f1_macro: 53.6sample_len: 14042.0
f1_macro: 53.6sample_len: 14042.0
51.8Hausa (Nigeria)
hausa_belebele
hausa mcq
f1_macro: 51.8sample_len: 900.0
f1_macro: 51.8sample_len: 900.0
49.4English (US)
english_belebele
english mcq
f1_macro: 49.4sample_len: 900.0
f1_macro: 49.4sample_len: 900.0
41.3Arabic (Saudi Arabia)
arabic_tydiqa
arabic qa
exact_match: 35.9f1: 46.7sample_len: 921.0
exact_match: 35.9f1: 46.7sample_len: 921.0
40.1Hausa (Nigeria)
hausa_afrimmlu
hausa mcq
f1_macro: 40.1sample_len: 500.0
f1_macro: 40.1sample_len: 500.0
34.2Urdu (Pakistan)
urdu_emotion_class
urdu classification
f1_macro: 34.2sample_len: 200.0
f1_macro: 34.2sample_len: 200.0
32.5English (US)
ifeval
ifeval
inst_level_loose_acc: 39.3inst_level_strict_acc: 38.5prompt_level_loose_acc: 26.6prompt_level_strict_acc: 25.7sample_len: 541.0
inst_level_loose_acc: 39.3inst_level_strict_acc: 38.5prompt_level_loose_acc: 26.6prompt_level_strict_acc: 25.7sample_len: 541.0
25.5Urdu (Pakistan)
urdu_simpleqa
urdu qa
llm_judge_score: 25.5sample_len: 200.0
llm_judge_score: 25.5sample_len: 200.0
23.5Hausa (Nigeria)
hausa_afriqa
hausa qa
exact_match: 22.7f1: 24.3sample_len: 300.0
exact_match: 22.7f1: 24.3sample_len: 300.0