ModelChorusModelChorus
ChallengeChatLeaderboardBenchmarksHistoryHow it works
Terms of ServicePrivacy PolicyAPI

Copyright 2026 MeetKai Inc.

Benchmarks/Functionary Swahili Large/Urdu (Pakistan) tasks

Functionary Swahili Large

8 tasks

Each row below is a single benchmark task this model was evaluated on. The Score column averages every metric the task reports (accuracy, F1, exact-match, etc.). Click a row to browse the individual questions and the model's responses.

Average
59.1
ScoreLanguageTaskMetrics
79.1Urdu (Pakistan)
urdu_uquad
urdu qa
llm_judge_score: 79.1sample_len: 139.0
llm_judge_score: 79.1sample_len: 139.0
72.4Urdu (Pakistan)
urdu_factcheckbench
urdu claim
f1_macro: 72.4sample_len: 387.0
f1_macro: 72.4sample_len: 387.0
71.4Urdu (Pakistan)
urdu_fake_news
urdu classification
f1_macro: 71.4sample_len: 300.0
f1_macro: 71.4sample_len: 300.0
70.7Urdu (Pakistan)
urdu_bingcheck
urdu claim
f1_macro: 70.7sample_len: 102.0
f1_macro: 70.7sample_len: 102.0
69.0Urdu (Pakistan)
urdu_facttool_qa
urdu claim
f1_macro: 69.0sample_len: 160.0
f1_macro: 69.0sample_len: 160.0
56.7Urdu (Pakistan)
urdu_freshqa
urdu qa
llm_judge_score: 56.7sample_len: 323.0
llm_judge_score: 56.7sample_len: 323.0
31.5Urdu (Pakistan)
urdu_emotion_class
urdu classification
f1_macro: 31.5sample_len: 200.0
f1_macro: 31.5sample_len: 200.0
22.0Urdu (Pakistan)
urdu_simpleqa
urdu qa
llm_judge_score: 22.0sample_len: 200.0
llm_judge_score: 22.0sample_len: 200.0