BenchmarksGPT-oss-120BUrdu (Pakistan) tasks

GPT-oss-120B

8 tasks

Each row below is a single benchmark task this model was evaluated on. The Score column averages every metric the task reports (accuracy, F1, exact-match, etc.). Click a row to browse the individual questions and the model's responses.

Average

46.6

Score	Language	Task	Metrics
82.7	Urdu (Pakistan)	urdu_uquad urdu qa llm_judge_score: 82.7sample_len: 139.0	llm_judge_score: 82.7sample_len: 139.0
66.3	Urdu (Pakistan)	urdu_freshqa urdu qa llm_judge_score: 66.3sample_len: 323.0	llm_judge_score: 66.3sample_len: 323.0
50.1	Urdu (Pakistan)	urdu_facttool_qa urdu claim f1_macro: 50.1sample_len: 160.0	f1_macro: 50.1sample_len: 160.0
44.6	Urdu (Pakistan)	urdu_factcheckbench urdu claim f1_macro: 44.6sample_len: 387.0	f1_macro: 44.6sample_len: 387.0
44.3	Urdu (Pakistan)	urdu_bingcheck urdu claim f1_macro: 44.3sample_len: 102.0	f1_macro: 44.3sample_len: 102.0
33.6	Urdu (Pakistan)	urdu_fake_news urdu classification f1_macro: 33.6sample_len: 300.0	f1_macro: 33.6sample_len: 300.0
33.5	Urdu (Pakistan)	urdu_emotion_class urdu classification f1_macro: 33.5sample_len: 200.0	f1_macro: 33.5sample_len: 200.0
17.5	Urdu (Pakistan)	urdu_simpleqa urdu qa llm_judge_score: 17.5sample_len: 200.0	llm_judge_score: 17.5sample_len: 200.0