BenchmarksGPT-oss-120BEnglish (US) tasks

GPT-oss-120B

5 tasks

Each row below is a single benchmark task this model was evaluated on. The Score column averages every metric the task reports (accuracy, F1, exact-match, etc.). Click a row to browse the individual questions and the model's responses.

Average

66.9

Score	Language	Task	Metrics
88.8	English (US)	english_mgsm english math exact_match: 88.8sample_len: 250.0	exact_match: 88.8sample_len: 250.0
86.8	English (US)	english_gsm8k english math exact_match: 86.8sample_len: 1319.0	exact_match: 86.8sample_len: 1319.0
74.0	English (US)	english_mmlu_pro english mmlu pro exact_match: 74.0sample_len: 2100.0	exact_match: 74.0sample_len: 2100.0
44.3	English (US)	ifeval ifeval inst_level_loose_acc: 56.0inst_level_strict_acc: 45.8prompt_level_loose_acc: 44.0prompt_level_strict_acc: 31.6sample_len: 541.0	inst_level_loose_acc: 56.0inst_level_strict_acc: 45.8prompt_level_loose_acc: 44.0prompt_level_strict_acc: 31.6sample_len: 541.0
40.8	English (US)	english_belebele english mcq f1_macro: 40.8sample_len: 900.0	f1_macro: 40.8sample_len: 900.0