ModelChorusModelChorus
ChallengeChatLeaderboardBenchmarksHistoryHow it works
Terms of ServicePrivacy PolicyAPI

Copyright 2026 MeetKai Inc.

Benchmarks/GPT-oss-120B/Urdu (Pakistan) tasks

GPT-oss-120B

8 tasks

Each row below is a single benchmark task this model was evaluated on. The Score column averages every metric the task reports (accuracy, F1, exact-match, etc.). Click a row to browse the individual questions and the model's responses.

Average
46.6
ScoreLanguageTaskMetrics
82.7Urdu (Pakistan)
urdu_uquad
urdu qa
llm_judge_score: 82.7sample_len: 139.0
llm_judge_score: 82.7sample_len: 139.0
66.3Urdu (Pakistan)
urdu_freshqa
urdu qa
llm_judge_score: 66.3sample_len: 323.0
llm_judge_score: 66.3sample_len: 323.0
50.1Urdu (Pakistan)
urdu_facttool_qa
urdu claim
f1_macro: 50.1sample_len: 160.0
f1_macro: 50.1sample_len: 160.0
44.6Urdu (Pakistan)
urdu_factcheckbench
urdu claim
f1_macro: 44.6sample_len: 387.0
f1_macro: 44.6sample_len: 387.0
44.3Urdu (Pakistan)
urdu_bingcheck
urdu claim
f1_macro: 44.3sample_len: 102.0
f1_macro: 44.3sample_len: 102.0
33.6Urdu (Pakistan)
urdu_fake_news
urdu classification
f1_macro: 33.6sample_len: 300.0
f1_macro: 33.6sample_len: 300.0
33.5Urdu (Pakistan)
urdu_emotion_class
urdu classification
f1_macro: 33.5sample_len: 200.0
f1_macro: 33.5sample_len: 200.0
17.5Urdu (Pakistan)
urdu_simpleqa
urdu qa
llm_judge_score: 17.5sample_len: 200.0
llm_judge_score: 17.5sample_len: 200.0