evalscope 0.7.2__tar.gz → 0.8.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of evalscope might be problematic. Click here for more details.

Files changed (329) hide show
  1. {evalscope-0.7.2/evalscope.egg-info → evalscope-0.8.1}/PKG-INFO +123 -119
  2. {evalscope-0.7.2 → evalscope-0.8.1}/README.md +120 -115
  3. evalscope-0.8.1/evalscope/__init__.py +3 -0
  4. evalscope-0.8.1/evalscope/arguments.py +73 -0
  5. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/base.py +6 -2
  6. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/opencompass/api_meta_template.py +8 -14
  7. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/opencompass/backend_manager.py +24 -15
  8. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/opencompass/tasks/eval_api.py +1 -6
  9. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/opencompass/tasks/eval_datasets.py +26 -28
  10. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/__init__.py +3 -3
  11. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/backend_manager.py +21 -25
  12. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/clip_benchmark/__init__.py +1 -1
  13. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/clip_benchmark/arguments.py +6 -6
  14. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/clip_benchmark/dataset_builder.py +62 -79
  15. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/clip_benchmark/task_template.py +29 -43
  16. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/clip_benchmark/tasks/image_caption.py +20 -22
  17. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/clip_benchmark/tasks/zeroshot_classification.py +16 -23
  18. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/clip_benchmark/tasks/zeroshot_retrieval.py +14 -35
  19. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/clip_benchmark/utils/webdataset_convert.py +69 -90
  20. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/cmteb/__init__.py +3 -3
  21. evalscope-0.8.1/evalscope/backend/rag_eval/cmteb/arguments.py +59 -0
  22. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/cmteb/base.py +22 -23
  23. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/cmteb/task_template.py +15 -17
  24. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/cmteb/tasks/Classification.py +98 -79
  25. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/cmteb/tasks/Clustering.py +17 -22
  26. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/cmteb/tasks/CustomTask.py +17 -19
  27. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/cmteb/tasks/PairClassification.py +35 -29
  28. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/cmteb/tasks/Reranking.py +18 -5
  29. evalscope-0.8.1/evalscope/backend/rag_eval/cmteb/tasks/Retrieval.py +345 -0
  30. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/cmteb/tasks/STS.py +126 -104
  31. evalscope-0.8.1/evalscope/backend/rag_eval/cmteb/tasks/__init__.py +69 -0
  32. evalscope-0.8.1/evalscope/backend/rag_eval/ragas/__init__.py +2 -0
  33. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/ragas/arguments.py +3 -8
  34. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/ragas/prompts/chinese/AnswerCorrectness/correctness_prompt_chinese.json +9 -9
  35. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/ragas/prompts/chinese/AnswerCorrectness/long_form_answer_prompt_chinese.json +2 -2
  36. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/ragas/prompts/chinese/AnswerRelevancy/question_generation_chinese.json +3 -3
  37. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/ragas/prompts/chinese/ContextPrecision/context_precision_prompt_chinese.json +5 -5
  38. evalscope-0.8.1/evalscope/backend/rag_eval/ragas/prompts/chinese/CustomNodeFilter/scoring_prompt_chinese.json +7 -0
  39. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/ragas/prompts/chinese/Faithfulness/nli_statements_message_chinese.json +8 -8
  40. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/ragas/prompts/chinese/Faithfulness/statement_prompt_chinese.json +5 -5
  41. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/ragas/prompts/chinese/HeadlinesExtractor/prompt_chinese.json +7 -5
  42. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/ragas/prompts/chinese/MultiHopAbstractQuerySynthesizer/concept_combination_prompt_chinese.json +2 -2
  43. evalscope-0.8.1/evalscope/backend/rag_eval/ragas/prompts/chinese/MultiHopAbstractQuerySynthesizer/generate_query_reference_prompt_chinese.json +30 -0
  44. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/ragas/prompts/chinese/MultiHopAbstractQuerySynthesizer/theme_persona_matching_prompt_chinese.json +2 -2
  45. evalscope-0.8.1/evalscope/backend/rag_eval/ragas/prompts/chinese/MultiHopSpecificQuerySynthesizer/generate_query_reference_prompt_chinese.json +30 -0
  46. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/ragas/prompts/chinese/MultiHopSpecificQuerySynthesizer/theme_persona_matching_prompt_chinese.json +2 -2
  47. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/ragas/prompts/chinese/MultiModalFaithfulness/faithfulness_prompt_chinese.json +2 -2
  48. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/ragas/prompts/chinese/MultiModalRelevance/relevance_prompt_chinese.json +5 -5
  49. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/ragas/prompts/chinese/NERExtractor/prompt_chinese.json +3 -3
  50. evalscope-0.8.1/evalscope/backend/rag_eval/ragas/prompts/chinese/SingleHopSpecificQuerySynthesizer/generate_query_reference_prompt_chinese.json +24 -0
  51. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/ragas/prompts/chinese/SingleHopSpecificQuerySynthesizer/theme_persona_matching_prompt_chinese.json +3 -3
  52. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/ragas/prompts/chinese/SummaryExtractor/prompt_chinese.json +4 -4
  53. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/ragas/prompts/chinese/ThemesExtractor/prompt_chinese.json +2 -2
  54. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/ragas/prompts/persona_prompt.py +0 -1
  55. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/ragas/task_template.py +10 -15
  56. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/ragas/tasks/__init__.py +1 -1
  57. evalscope-0.8.1/evalscope/backend/rag_eval/ragas/tasks/build_distribution.py +45 -0
  58. evalscope-0.8.1/evalscope/backend/rag_eval/ragas/tasks/build_transform.py +135 -0
  59. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/ragas/tasks/testset_generation.py +17 -133
  60. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/ragas/tasks/translate_prompt.py +8 -18
  61. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/utils/clip.py +47 -51
  62. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/utils/embedding.py +13 -12
  63. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/utils/llm.py +8 -6
  64. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/utils/tools.py +12 -11
  65. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/vlm_eval_kit/__init__.py +1 -1
  66. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/vlm_eval_kit/custom_dataset.py +7 -8
  67. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/arc/__init__.py +3 -2
  68. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/arc/ai2_arc.py +19 -16
  69. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/arc/arc_adapter.py +32 -24
  70. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/bbh/__init__.py +1 -2
  71. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/bbh/bbh_adapter.py +28 -25
  72. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/bbh/cot_prompts/boolean_expressions.txt +1 -1
  73. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/bbh/cot_prompts/causal_judgement.txt +1 -1
  74. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/bbh/cot_prompts/date_understanding.txt +1 -1
  75. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/bbh/cot_prompts/disambiguation_qa.txt +1 -1
  76. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/bbh/cot_prompts/dyck_languages.txt +1 -1
  77. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/bbh/cot_prompts/formal_fallacies.txt +1 -1
  78. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/bbh/cot_prompts/geometric_shapes.txt +1 -1
  79. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/bbh/cot_prompts/hyperbaton.txt +1 -1
  80. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/bbh/cot_prompts/logical_deduction_five_objects.txt +1 -1
  81. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/bbh/cot_prompts/logical_deduction_seven_objects.txt +1 -1
  82. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/bbh/cot_prompts/logical_deduction_three_objects.txt +1 -1
  83. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/bbh/cot_prompts/movie_recommendation.txt +1 -1
  84. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/bbh/cot_prompts/multistep_arithmetic_two.txt +1 -1
  85. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/bbh/cot_prompts/navigate.txt +1 -1
  86. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/bbh/cot_prompts/object_counting.txt +1 -1
  87. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/bbh/cot_prompts/penguins_in_a_table.txt +1 -1
  88. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/bbh/cot_prompts/reasoning_about_colored_objects.txt +1 -1
  89. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/bbh/cot_prompts/ruin_names.txt +1 -1
  90. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/bbh/cot_prompts/salient_translation_error_detection.txt +1 -1
  91. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/bbh/cot_prompts/snarks.txt +1 -1
  92. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/bbh/cot_prompts/sports_understanding.txt +1 -1
  93. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/bbh/cot_prompts/temporal_sequences.txt +1 -1
  94. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/bbh/cot_prompts/tracking_shuffled_objects_five_objects.txt +1 -1
  95. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/bbh/cot_prompts/tracking_shuffled_objects_seven_objects.txt +1 -1
  96. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/bbh/cot_prompts/tracking_shuffled_objects_three_objects.txt +1 -1
  97. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/bbh/cot_prompts/web_of_lies.txt +1 -1
  98. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/bbh/cot_prompts/word_sorting.txt +1 -1
  99. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/benchmark.py +16 -16
  100. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/ceval/__init__.py +3 -2
  101. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/ceval/ceval_adapter.py +80 -69
  102. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/ceval/ceval_exam.py +18 -31
  103. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/cmmlu/__init__.py +3 -2
  104. evalscope-0.8.1/evalscope/benchmarks/cmmlu/cmmlu.py +161 -0
  105. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/cmmlu/cmmlu_adapter.py +109 -155
  106. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/cmmlu/samples.jsonl +1 -1
  107. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/competition_math/__init__.py +3 -2
  108. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/competition_math/competition_math.py +7 -16
  109. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/competition_math/competition_math_adapter.py +32 -34
  110. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/data_adapter.py +24 -24
  111. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/general_qa/__init__.py +3 -2
  112. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/general_qa/general_qa_adapter.py +35 -39
  113. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/gsm8k/__init__.py +1 -1
  114. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/gsm8k/gsm8k.py +6 -12
  115. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/gsm8k/gsm8k_adapter.py +27 -24
  116. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/hellaswag/__init__.py +3 -2
  117. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/hellaswag/hellaswag.py +15 -19
  118. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/hellaswag/hellaswag_adapter.py +28 -23
  119. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/humaneval/__init__.py +1 -1
  120. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/humaneval/humaneval.py +15 -18
  121. evalscope-0.8.1/evalscope/benchmarks/humaneval/humaneval_adapter.py +206 -0
  122. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/mmlu/__init__.py +3 -2
  123. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/mmlu/mmlu.py +15 -29
  124. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/mmlu/mmlu_adapter.py +85 -77
  125. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/race/__init__.py +3 -2
  126. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/race/race.py +21 -35
  127. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/race/race_adapter.py +33 -29
  128. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/race/samples.jsonl +1 -1
  129. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/trivia_qa/__init__.py +3 -2
  130. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/trivia_qa/samples.jsonl +1 -1
  131. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/trivia_qa/trivia_qa.py +19 -34
  132. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/trivia_qa/trivia_qa_adapter.py +27 -22
  133. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/truthful_qa/__init__.py +3 -2
  134. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/truthful_qa/truthful_qa.py +25 -29
  135. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/truthful_qa/truthful_qa_adapter.py +36 -37
  136. evalscope-0.8.1/evalscope/cli/cli.py +27 -0
  137. evalscope-0.8.1/evalscope/cli/start_eval.py +31 -0
  138. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/cli/start_perf.py +0 -3
  139. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/cli/start_server.py +27 -41
  140. evalscope-0.8.1/evalscope/config.py +224 -0
  141. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/constants.py +50 -32
  142. evalscope-0.8.1/evalscope/evaluator/evaluator.py +410 -0
  143. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/evaluator/rating_eval.py +12 -33
  144. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/evaluator/reviewer/auto_reviewer.py +48 -76
  145. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/metrics/bundled_rouge_score/rouge_scorer.py +10 -20
  146. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/metrics/code_metric.py +3 -9
  147. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/metrics/math_accuracy.py +3 -6
  148. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/metrics/metrics.py +21 -21
  149. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/metrics/rouge_metric.py +11 -25
  150. evalscope-0.8.1/evalscope/models/__init__.py +3 -0
  151. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/models/api/openai_api.py +40 -29
  152. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/models/custom/__init__.py +0 -1
  153. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/models/custom/custom_model.py +3 -3
  154. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/models/dummy_chat_model.py +7 -8
  155. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/models/model_adapter.py +89 -156
  156. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/models/openai_model.py +20 -20
  157. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/perf/arguments.py +16 -3
  158. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/perf/benchmark.py +9 -11
  159. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/perf/http_client.py +3 -8
  160. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/perf/main.py +8 -1
  161. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/perf/plugin/api/custom_api.py +1 -2
  162. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/perf/plugin/api/dashscope_api.py +1 -2
  163. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/perf/plugin/api/openai_api.py +3 -4
  164. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/perf/plugin/datasets/base.py +1 -2
  165. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/perf/plugin/datasets/flickr8k.py +1 -2
  166. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/perf/plugin/datasets/longalpaca.py +1 -2
  167. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/perf/plugin/datasets/openqa.py +1 -2
  168. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/perf/plugin/registry.py +3 -3
  169. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/perf/utils/analysis_result.py +1 -2
  170. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/perf/utils/benchmark_util.py +5 -6
  171. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/perf/utils/db_util.py +77 -30
  172. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/perf/utils/local_server.py +21 -13
  173. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/registry/config/cfg_arena_zhihu.yaml +1 -1
  174. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/registry/tasks/arc.yaml +2 -3
  175. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/registry/tasks/bbh.yaml +3 -4
  176. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/registry/tasks/bbh_mini.yaml +3 -4
  177. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/registry/tasks/ceval.yaml +3 -3
  178. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/registry/tasks/ceval_mini.yaml +3 -4
  179. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/registry/tasks/cmmlu.yaml +3 -3
  180. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/registry/tasks/eval_qwen-7b-chat_v100.yaml +1 -1
  181. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/registry/tasks/general_qa.yaml +1 -1
  182. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/registry/tasks/gsm8k.yaml +2 -2
  183. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/registry/tasks/mmlu.yaml +3 -3
  184. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/registry/tasks/mmlu_mini.yaml +3 -3
  185. evalscope-0.8.1/evalscope/run.py +180 -0
  186. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/run_arena.py +21 -25
  187. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/summarizer.py +27 -40
  188. evalscope-0.8.1/evalscope/third_party/longbench_write/README.md +175 -0
  189. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/third_party/longbench_write/default_task.json +1 -1
  190. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/third_party/longbench_write/default_task.yaml +8 -7
  191. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/third_party/longbench_write/eval.py +29 -27
  192. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/third_party/longbench_write/infer.py +16 -104
  193. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/third_party/longbench_write/longbench_write.py +5 -4
  194. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/third_party/longbench_write/resources/judge.txt +1 -1
  195. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/third_party/longbench_write/tools/data_etl.py +5 -6
  196. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/third_party/longbench_write/utils.py +0 -1
  197. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/third_party/toolbench_static/eval.py +14 -15
  198. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/third_party/toolbench_static/infer.py +48 -69
  199. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/third_party/toolbench_static/llm/swift_infer.py +4 -12
  200. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/third_party/toolbench_static/requirements.txt +1 -1
  201. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/third_party/toolbench_static/toolbench_static.py +4 -3
  202. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/tools/combine_reports.py +27 -34
  203. evalscope-0.8.1/evalscope/tools/rewrite_eval_results.py +63 -0
  204. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/utils/__init__.py +1 -1
  205. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/utils/arena_utils.py +18 -48
  206. {evalscope-0.7.2/evalscope/perf → evalscope-0.8.1/evalscope}/utils/chat_service.py +4 -5
  207. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/utils/completion_parsers.py +3 -8
  208. evalscope-0.8.1/evalscope/utils/io_utils.py +162 -0
  209. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/utils/logger.py +17 -7
  210. evalscope-0.8.1/evalscope/utils/model_utils.py +11 -0
  211. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/utils/utils.py +5 -306
  212. evalscope-0.8.1/evalscope/version.py +4 -0
  213. {evalscope-0.7.2 → evalscope-0.8.1/evalscope.egg-info}/PKG-INFO +123 -119
  214. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope.egg-info/SOURCES.txt +8 -9
  215. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope.egg-info/requires.txt +2 -2
  216. {evalscope-0.7.2 → evalscope-0.8.1}/requirements/docs.txt +1 -1
  217. {evalscope-0.7.2 → evalscope-0.8.1}/requirements/inner.txt +0 -1
  218. {evalscope-0.7.2 → evalscope-0.8.1}/requirements/rag.txt +1 -1
  219. evalscope-0.8.1/requirements.txt +1 -0
  220. {evalscope-0.7.2 → evalscope-0.8.1}/setup.py +2 -1
  221. {evalscope-0.7.2 → evalscope-0.8.1}/tests/cli/test_run.py +53 -15
  222. {evalscope-0.7.2 → evalscope-0.8.1}/tests/perf/test_perf.py +6 -1
  223. evalscope-0.8.1/tests/rag/test_clip_benchmark.py +85 -0
  224. {evalscope-0.7.2 → evalscope-0.8.1}/tests/rag/test_mteb.py +3 -2
  225. {evalscope-0.7.2 → evalscope-0.8.1}/tests/rag/test_ragas.py +5 -5
  226. {evalscope-0.7.2 → evalscope-0.8.1}/tests/swift/test_run_swift_eval.py +2 -3
  227. {evalscope-0.7.2 → evalscope-0.8.1}/tests/swift/test_run_swift_vlm_eval.py +2 -3
  228. {evalscope-0.7.2 → evalscope-0.8.1}/tests/swift/test_run_swift_vlm_jugde_eval.py +2 -3
  229. {evalscope-0.7.2 → evalscope-0.8.1}/tests/vlm/test_vlmeval.py +3 -2
  230. evalscope-0.7.2/evalscope/__init__.py +0 -3
  231. evalscope-0.7.2/evalscope/backend/rag_eval/cmteb/arguments.py +0 -61
  232. evalscope-0.7.2/evalscope/backend/rag_eval/cmteb/tasks/Retrieval.py +0 -345
  233. evalscope-0.7.2/evalscope/backend/rag_eval/cmteb/tasks/__init__.py +0 -70
  234. evalscope-0.7.2/evalscope/backend/rag_eval/ragas/__init__.py +0 -2
  235. evalscope-0.7.2/evalscope/backend/rag_eval/ragas/metrics/__init__.py +0 -2
  236. evalscope-0.7.2/evalscope/backend/rag_eval/ragas/metrics/multi_modal_faithfulness.py +0 -91
  237. evalscope-0.7.2/evalscope/backend/rag_eval/ragas/metrics/multi_modal_relevance.py +0 -99
  238. evalscope-0.7.2/evalscope/backend/rag_eval/ragas/prompts/chinese/MultiHopAbstractQuerySynthesizer/generate_query_reference_prompt_chinese.json +0 -7
  239. evalscope-0.7.2/evalscope/backend/rag_eval/ragas/prompts/chinese/MultiHopSpecificQuerySynthesizer/generate_query_reference_prompt_chinese.json +0 -7
  240. evalscope-0.7.2/evalscope/backend/rag_eval/ragas/prompts/chinese/SingleHopSpecificQuerySynthesizer/generate_query_reference_prompt_chinese.json +0 -7
  241. evalscope-0.7.2/evalscope/benchmarks/cmmlu/cmmlu.py +0 -166
  242. evalscope-0.7.2/evalscope/benchmarks/humaneval/humaneval_adapter.py +0 -21
  243. evalscope-0.7.2/evalscope/cache.py +0 -98
  244. evalscope-0.7.2/evalscope/cli/cli.py +0 -26
  245. evalscope-0.7.2/evalscope/config.py +0 -166
  246. evalscope-0.7.2/evalscope/evaluator/evaluator.py +0 -690
  247. evalscope-0.7.2/evalscope/models/__init__.py +0 -4
  248. evalscope-0.7.2/evalscope/models/template.py +0 -1446
  249. evalscope-0.7.2/evalscope/run.py +0 -408
  250. evalscope-0.7.2/evalscope/run_ms.py +0 -140
  251. evalscope-0.7.2/evalscope/third_party/longbench_write/README.md +0 -118
  252. evalscope-0.7.2/evalscope/tools/rewrite_eval_results.py +0 -95
  253. evalscope-0.7.2/evalscope/utils/task_cfg_parser.py +0 -10
  254. evalscope-0.7.2/evalscope/utils/task_utils.py +0 -22
  255. evalscope-0.7.2/evalscope/version.py +0 -4
  256. evalscope-0.7.2/requirements.txt +0 -1
  257. evalscope-0.7.2/tests/rag/test_clip_benchmark.py +0 -85
  258. {evalscope-0.7.2 → evalscope-0.8.1}/LICENSE +0 -0
  259. {evalscope-0.7.2 → evalscope-0.8.1}/MANIFEST.in +0 -0
  260. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/__init__.py +0 -0
  261. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/opencompass/__init__.py +0 -0
  262. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/opencompass/tasks/__init__.py +0 -0
  263. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/clip_benchmark/tasks/__init__.py +0 -0
  264. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/clip_benchmark/utils/webdatasets.txt +0 -0
  265. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/rag_eval/utils/__init__.py +0 -0
  266. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/backend/vlm_eval_kit/backend_manager.py +0 -0
  267. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/__init__.py +0 -0
  268. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/ceval/samples.jsonl +0 -0
  269. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/benchmarks/mmlu/samples.jsonl +0 -0
  270. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/cli/__init__.py +0 -0
  271. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/cli/base.py +0 -0
  272. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/evaluator/__init__.py +0 -0
  273. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/evaluator/reviewer/__init__.py +0 -0
  274. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/metrics/__init__.py +0 -0
  275. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/metrics/bundled_rouge_score/__init__.py +0 -0
  276. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/metrics/resources/gpt2-zhcn3-v4.bpe +0 -0
  277. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/metrics/resources/gpt2-zhcn3-v4.json +0 -0
  278. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/models/api/__init__.py +0 -0
  279. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/models/model.py +0 -0
  280. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/perf/__init__.py +0 -0
  281. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/perf/plugin/__init__.py +0 -0
  282. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/perf/plugin/api/__init__.py +0 -0
  283. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/perf/plugin/api/base.py +0 -0
  284. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/perf/plugin/datasets/__init__.py +0 -0
  285. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/perf/plugin/datasets/custom.py +0 -0
  286. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/perf/plugin/datasets/line_by_line.py +0 -0
  287. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/perf/plugin/datasets/speed_benchmark.py +0 -0
  288. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/perf/utils/__init__.py +0 -0
  289. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/perf/utils/handler.py +0 -0
  290. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/registry/__init__.py +0 -0
  291. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/registry/config/cfg_arena.yaml +0 -0
  292. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/registry/config/cfg_pairwise_baseline.yaml +0 -0
  293. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/registry/config/cfg_single.yaml +0 -0
  294. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/registry/data/prompt_template/lmsys_v2.jsonl +0 -0
  295. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/registry/data/prompt_template/prompt_templates.jsonl +0 -0
  296. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/registry/data/qa_browser/battle.jsonl +0 -0
  297. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/registry/data/qa_browser/category_mapping.yaml +0 -0
  298. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/registry/data/question.jsonl +0 -0
  299. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/third_party/__init__.py +0 -0
  300. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/third_party/longbench_write/__init__.py +0 -0
  301. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/third_party/longbench_write/resources/__init__.py +0 -0
  302. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/third_party/longbench_write/resources/longbench_write.jsonl +0 -0
  303. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/third_party/longbench_write/resources/longbench_write_en.jsonl +0 -0
  304. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/third_party/longbench_write/resources/longwrite_ruler.jsonl +0 -0
  305. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/third_party/longbench_write/tools/__init__.py +0 -0
  306. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/third_party/toolbench_static/README.md +0 -0
  307. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/third_party/toolbench_static/__init__.py +0 -0
  308. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/third_party/toolbench_static/config_default.json +0 -0
  309. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/third_party/toolbench_static/config_default.yaml +0 -0
  310. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/third_party/toolbench_static/llm/__init__.py +0 -0
  311. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/tools/__init__.py +0 -0
  312. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope/tools/gen_mmlu_subject_mapping.py +0 -0
  313. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope.egg-info/dependency_links.txt +0 -0
  314. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope.egg-info/entry_points.txt +0 -0
  315. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope.egg-info/not-zip-safe +0 -0
  316. {evalscope-0.7.2 → evalscope-0.8.1}/evalscope.egg-info/top_level.txt +0 -0
  317. {evalscope-0.7.2 → evalscope-0.8.1}/requirements/framework.txt +0 -0
  318. {evalscope-0.7.2 → evalscope-0.8.1}/requirements/opencompass.txt +0 -0
  319. {evalscope-0.7.2 → evalscope-0.8.1}/requirements/perf.txt +0 -0
  320. {evalscope-0.7.2 → evalscope-0.8.1}/requirements/tests.txt +0 -0
  321. {evalscope-0.7.2 → evalscope-0.8.1}/requirements/vlmeval.txt +0 -0
  322. {evalscope-0.7.2 → evalscope-0.8.1}/setup.cfg +0 -0
  323. {evalscope-0.7.2 → evalscope-0.8.1}/tests/__init__.py +0 -0
  324. {evalscope-0.7.2 → evalscope-0.8.1}/tests/cli/__init__.py +0 -0
  325. {evalscope-0.7.2 → evalscope-0.8.1}/tests/perf/__init__.py +0 -0
  326. {evalscope-0.7.2 → evalscope-0.8.1}/tests/rag/__init__.py +0 -0
  327. {evalscope-0.7.2 → evalscope-0.8.1}/tests/swift/__init__.py +0 -0
  328. {evalscope-0.7.2 → evalscope-0.8.1}/tests/test_run_all.py +0 -0
  329. {evalscope-0.7.2 → evalscope-0.8.1}/tests/vlm/__init__.py +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: evalscope
3
- Version: 0.7.2
3
+ Version: 0.8.1
4
4
  Summary: EvalScope: Lightweight LLMs Evaluation Framework
5
5
  Home-page: https://github.com/modelscope/evalscope
6
6
  Author: ModelScope team
@@ -54,7 +54,7 @@ Provides-Extra: vlmeval
54
54
  Requires-Dist: ms-vlmeval>=0.0.9; extra == "vlmeval"
55
55
  Provides-Extra: rag
56
56
  Requires-Dist: mteb==1.19.4; extra == "rag"
57
- Requires-Dist: ragas==0.2.5; extra == "rag"
57
+ Requires-Dist: ragas==0.2.7; extra == "rag"
58
58
  Requires-Dist: webdataset>0.2.0; extra == "rag"
59
59
  Provides-Extra: perf
60
60
  Requires-Dist: aiohttp; extra == "perf"
@@ -70,7 +70,6 @@ Requires-Dist: alibaba_itag_sdk; extra == "inner"
70
70
  Requires-Dist: dashscope; extra == "inner"
71
71
  Requires-Dist: editdistance; extra == "inner"
72
72
  Requires-Dist: jsonlines; extra == "inner"
73
- Requires-Dist: jsonlines; extra == "inner"
74
73
  Requires-Dist: nltk; extra == "inner"
75
74
  Requires-Dist: openai; extra == "inner"
76
75
  Requires-Dist: pandas==1.5.3; extra == "inner"
@@ -126,7 +125,7 @@ Requires-Dist: transformers_stream_generator; extra == "all"
126
125
  Requires-Dist: ms-opencompass>=0.1.4; extra == "all"
127
126
  Requires-Dist: ms-vlmeval>=0.0.9; extra == "all"
128
127
  Requires-Dist: mteb==1.19.4; extra == "all"
129
- Requires-Dist: ragas==0.2.5; extra == "all"
128
+ Requires-Dist: ragas==0.2.7; extra == "all"
130
129
  Requires-Dist: webdataset>0.2.0; extra == "all"
131
130
  Requires-Dist: aiohttp; extra == "all"
132
131
  Requires-Dist: fastapi; extra == "all"
@@ -135,43 +134,47 @@ Requires-Dist: sse_starlette; extra == "all"
135
134
  Requires-Dist: transformers; extra == "all"
136
135
  Requires-Dist: unicorn; extra == "all"
137
136
 
137
+ <p align="center">
138
+ <br>
139
+ <img src="docs/en/_static/images/evalscope_logo.png"/>
140
+ <br>
141
+ <p>
138
142
 
139
143
 
140
- ![](docs/en/_static/images/evalscope_logo.png)
141
-
142
144
  <p align="center">
143
- English | <a href="README_zh.md">简体中文</a>
145
+ <a href="README_zh.md">中文</a> &nbsp | &nbsp English &nbsp
144
146
  </p>
145
147
 
146
148
  <p align="center">
147
- <a href="https://badge.fury.io/py/evalscope"><img src="https://badge.fury.io/py/evalscope.svg" alt="PyPI version" height="18"></a>
148
- <a href="https://pypi.org/project/evalscope"><img alt="PyPI - Downloads" src="https://static.pepy.tech/badge/evalscope">
149
- </a>
150
- <a href="https://github.com/modelscope/evalscope/pulls"><img src="https://img.shields.io/badge/PR-welcome-55EB99.svg"></a>
151
- <a href='https://evalscope.readthedocs.io/en/latest/?badge=latest'>
152
- <img src='https://readthedocs.org/projects/evalscope-en/badge/?version=latest' alt='Documentation Status' />
153
- </a>
154
- <br>
155
- <a href="https://evalscope.readthedocs.io/en/latest/">📖 Documents</a>
149
+ <img src="https://img.shields.io/badge/python-%E2%89%A53.8-5be.svg">
150
+ <a href="https://badge.fury.io/py/evalscope"><img src="https://badge.fury.io/py/evalscope.svg" alt="PyPI version" height="18"></a>
151
+ <a href="https://pypi.org/project/evalscope"><img alt="PyPI - Downloads" src="https://static.pepy.tech/badge/evalscope"></a>
152
+ <a href="https://github.com/modelscope/evalscope/pulls"><img src="https://img.shields.io/badge/PR-welcome-55EB99.svg"></a>
153
+ <a href='https://evalscope.readthedocs.io/en/latest/?badge=latest'><img src='https://readthedocs.org/projects/evalscope/badge/?version=latest' alt='Documentation Status' /></a>
154
+ <p>
155
+
156
+ <p align="center">
157
+ <a href="https://evalscope.readthedocs.io/zh-cn/latest/"> 📖 中文文档</a> &nbsp | &nbsp <a href="https://evalscope.readthedocs.io/en/latest/"> 📖 English Documents</a>
156
158
  <p>
157
159
 
158
160
  > ⭐ If you like this project, please click the "Star" button at the top right to support us. Your support is our motivation to keep going!
159
161
 
160
- ## 📋 Table of Contents
162
+ ## 📋 Contents
161
163
  - [Introduction](#introduction)
162
164
  - [News](#News)
163
165
  - [Installation](#installation)
164
166
  - [Quick Start](#quick-start)
165
167
  - [Evaluation Backend](#evaluation-backend)
166
168
  - [Custom Dataset Evaluation](#custom-dataset-evaluation)
167
- - [Offline Evaluation](#offline-evaluation)
168
- - [Arena Mode](#arena-mode)
169
169
  - [Model Serving Performance Evaluation](#Model-Serving-Performance-Evaluation)
170
+ - [Arena Mode](#arena-mode)
170
171
 
171
172
 
172
173
  ## 📝 Introduction
173
174
 
174
- EvalScope is the official model evaluation and performance benchmarking framework launched by the [ModelScope](https://modelscope.cn/) community. It comes with built-in common benchmarks and evaluation metrics, such as MMLU, CMMLU, C-Eval, GSM8K, ARC, HellaSwag, TruthfulQA, MATH, and HumanEval. EvalScope supports various types of model evaluations, including LLMs, multimodal LLMs, embedding models, and reranker models. It is also applicable to multiple evaluation scenarios, such as end-to-end RAG evaluation, arena mode, and model inference performance stress testing. Moreover, with the seamless integration of the ms-swift training framework, evaluations can be initiated with a single click, providing full end-to-end support from model training to evaluation 🚀
175
+ EvalScope is [ModelScope](https://modelscope.cn/)'s official framework for model evaluation and benchmarking, designed for diverse assessment needs. It supports various model types including large language models, multimodal, embedding, reranker, and CLIP models.
176
+
177
+ The framework accommodates multiple evaluation scenarios such as end-to-end RAG evaluation, arena mode, and inference performance testing. It features built-in benchmarks and metrics like MMLU, CMMLU, C-Eval, and GSM8K. Seamlessly integrated with the [ms-swift](https://github.com/modelscope/ms-swift) training framework, EvalScope enables one-click evaluations, offering comprehensive support for model training and assessment 🚀
175
178
 
176
179
  <p align="center">
177
180
  <img src="docs/en/_static/images/evalscope_framework.png" width="70%">
@@ -193,6 +196,7 @@ The architecture includes the following modules:
193
196
 
194
197
 
195
198
  ## 🎉 News
199
+ - 🔥 **[2024.12.13]** Model evaluation optimization: no need to pass the `--template-type` parameter anymore; supports starting evaluation with `evalscope eval --args`. Refer to the [📖 User Guide](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html) for more details.
196
200
  - 🔥 **[2024.11.26]** The model inference service performance evaluator has been completely refactored: it now supports local inference service startup and Speed Benchmark; asynchronous call error handling has been optimized. For more details, refer to the [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/index.html).
197
201
  - 🔥 **[2024.10.31]** The best practice for evaluating Multimodal-RAG has been updated, please check the [📖 Blog](https://evalscope.readthedocs.io/zh-cn/latest/blog/RAG/multimodal_RAG.html#multimodal-rag) for more details.
198
202
  - 🔥 **[2024.10.23]** Supports multimodal RAG evaluation, including the assessment of image-text retrieval using [CLIP_Benchmark](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/clip_benchmark.html), and extends [RAGAS](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/ragas.html) to support end-to-end multimodal metrics evaluation.
@@ -264,124 +268,129 @@ We recommend using conda to manage your environment and installing dependencies
264
268
 
265
269
  ## 🚀 Quick Start
266
270
 
267
- ### 1. Simple Evaluation
268
- To evaluate a model using default settings on specified datasets, follow the process below:
271
+ To evaluate a model on specified datasets using default configurations, this framework supports two ways to initiate evaluation tasks: using the command line or using Python code.
269
272
 
270
- #### Installation using pip
273
+ ### Method 1. Using Command Line
271
274
 
272
- You can execute this in any directory:
275
+ Execute the `eval` command in any directory:
273
276
  ```bash
274
- python -m evalscope.run \
277
+ evalscope eval \
275
278
  --model Qwen/Qwen2.5-0.5B-Instruct \
276
- --template-type qwen \
277
- --datasets gsm8k ceval \
278
- --limit 10
279
+ --datasets gsm8k arc \
280
+ --limit 5
279
281
  ```
280
282
 
281
- #### Installation from source
283
+ ### Method 2. Using Python Code
282
284
 
283
- You need to execute this in the `evalscope` directory:
284
- ```bash
285
- python evalscope/run.py \
286
- --model Qwen/Qwen2.5-0.5B-Instruct \
287
- --template-type qwen \
288
- --datasets gsm8k ceval \
289
- --limit 10
290
- ```
285
+ When using Python code for evaluation, you need to submit the evaluation task using the `run_task` function, passing a `TaskConfig` as a parameter. It can also be a Python dictionary, yaml file path, or json file path, for example:
291
286
 
292
- > If prompted with `Do you wish to run the custom code? [y/N]`, please type `y`.
287
+ **Using Python Dictionary**
293
288
 
294
- **Results (tested with only 10 samples)**
295
- ```text
296
- Report table:
297
- +-----------------------+--------------------+-----------------+
298
- | Model | ceval | gsm8k |
299
- +=======================+====================+=================+
300
- | Qwen2.5-0.5B-Instruct | (ceval/acc) 0.5577 | (gsm8k/acc) 0.5 |
301
- +-----------------------+--------------------+-----------------+
289
+ ```python
290
+ from evalscope.run import run_task
291
+
292
+ task_cfg = {
293
+ 'model': 'Qwen/Qwen2.5-0.5B-Instruct',
294
+ 'datasets': ['gsm8k', 'arc'],
295
+ 'limit': 5
296
+ }
297
+
298
+ run_task(task_cfg=task_cfg)
302
299
  ```
303
300
 
301
+ <details><summary>More Startup Methods</summary>
304
302
 
305
- #### Basic Parameter Descriptions
306
- - `--model`: Specifies the `model_id` of the model on [ModelScope](https://modelscope.cn/), allowing automatic download. For example, see the [Qwen2-0.5B-Instruct model link](https://modelscope.cn/models/qwen/Qwen2-0.5B-Instruct/summary); you can also use a local path, such as `/path/to/model`.
307
- - `--template-type`: Specifies the template type corresponding to the model. Refer to the `Default Template` field in the [template table](https://swift.readthedocs.io/en/latest/Instruction/Supported-models-datasets.html#llm) for filling in this field.
308
- - `--datasets`: The dataset name, allowing multiple datasets to be specified, separated by spaces; these datasets will be automatically downloaded. Refer to the [supported datasets list](https://evalscope.readthedocs.io/en/latest/get_started/supported_dataset.html) for available options.
309
- - `--limit`: Maximum number of evaluation samples per dataset; if not specified, all will be evaluated, which is useful for quick validation.
303
+ **Using `TaskConfig`**
310
304
 
305
+ ```python
306
+ from evalscope.run import run_task
307
+ from evalscope.config import TaskConfig
311
308
 
312
- ### 2. Parameterized Evaluation
313
- If you wish to conduct a more customized evaluation, such as modifying model parameters or dataset parameters, you can use the following commands:
309
+ task_cfg = TaskConfig(
310
+ model='Qwen/Qwen2.5-0.5B-Instruct',
311
+ datasets=['gsm8k', 'arc'],
312
+ limit=5
313
+ )
314
314
 
315
- **Example 1:**
316
- ```shell
317
- python evalscope/run.py \
318
- --model qwen/Qwen2-0.5B-Instruct \
319
- --template-type qwen \
320
- --model-args revision=master,precision=torch.float16,device_map=auto \
321
- --datasets gsm8k ceval \
322
- --use-cache true \
323
- --limit 10
315
+ run_task(task_cfg=task_cfg)
324
316
  ```
325
317
 
326
- **Example 2:**
327
- ```shell
328
- python evalscope/run.py \
329
- --model qwen/Qwen2-0.5B-Instruct \
330
- --template-type qwen \
331
- --generation-config do_sample=false,temperature=0.0 \
332
- --datasets ceval \
333
- --dataset-args '{"ceval": {"few_shot_num": 0, "few_shot_random": false}}' \
334
- --limit 10
318
+ **Using `yaml` file**
319
+
320
+ `config.yaml`:
321
+ ```yaml
322
+ model: Qwen/Qwen2.5-0.5B-Instruct
323
+ datasets:
324
+ - gsm8k
325
+ - arc
326
+ limit: 5
335
327
  ```
336
328
 
337
- #### Parameter Descriptions
338
- In addition to the three [basic parameters](#basic-parameter-descriptions), the other parameters are as follows:
339
- - `--model-args`: Model loading parameters, separated by commas, in `key=value` format.
340
- - `--generation-config`: Generation parameters, separated by commas, in `key=value` format.
341
- - `do_sample`: Whether to use sampling, default is `false`.
342
- - `max_new_tokens`: Maximum generation length, default is 1024.
343
- - `temperature`: Sampling temperature.
344
- - `top_p`: Sampling threshold.
345
- - `top_k`: Sampling threshold.
346
- - `--use-cache`: Whether to use local cache, default is `false`. If set to `true`, previously evaluated model and dataset combinations will not be evaluated again, and will be read directly from the local cache.
347
- - `--dataset-args`: Evaluation dataset configuration parameters, provided in JSON format, where the key is the dataset name and the value is the parameter; note that these must correspond one-to-one with the values in `--datasets`.
348
- - `--few_shot_num`: Number of few-shot examples.
349
- - `--few_shot_random`: Whether to randomly sample few-shot data; if not specified, defaults to `true`.
350
-
351
-
352
- ### 3. Use the run_task Function to Submit an Evaluation Task
353
- Using the `run_task` function to submit an evaluation task requires the same parameters as the command line. You need to pass a dictionary as the parameter, which includes the following fields:
354
-
355
- #### 1. Configuration Task Dictionary Parameters
356
329
  ```python
357
- import torch
358
- from evalscope.constants import DEFAULT_ROOT_CACHE_DIR
359
-
360
- # Example
361
- your_task_cfg = {
362
- 'model_args': {'revision': None, 'precision': torch.float16, 'device_map': 'auto'},
363
- 'generation_config': {'do_sample': False, 'repetition_penalty': 1.0, 'max_new_tokens': 512},
364
- 'dataset_args': {},
365
- 'dry_run': False,
366
- 'model': 'qwen/Qwen2-0.5B-Instruct',
367
- 'template_type': 'qwen',
368
- 'datasets': ['arc', 'hellaswag'],
369
- 'work_dir': DEFAULT_ROOT_CACHE_DIR,
370
- 'outputs': DEFAULT_ROOT_CACHE_DIR,
371
- 'mem_cache': False,
372
- 'dataset_hub': 'ModelScope',
373
- 'dataset_dir': DEFAULT_ROOT_CACHE_DIR,
374
- 'limit': 10,
375
- 'debug': False
376
- }
330
+ from evalscope.run import run_task
331
+
332
+ run_task(task_cfg="config.yaml")
333
+ ```
334
+
335
+ **Using `json` file**
336
+
337
+ `config.json`:
338
+ ```json
339
+ {
340
+ "model": "Qwen/Qwen2.5-0.5B-Instruct",
341
+ "datasets": ["gsm8k", "arc"],
342
+ "limit": 5
343
+ }
377
344
  ```
378
- Here, `DEFAULT_ROOT_CACHE_DIR` is set to `'~/.cache/evalscope'`.
379
345
 
380
- #### 2. Execute Task with run_task
381
346
  ```python
382
347
  from evalscope.run import run_task
383
- run_task(task_cfg=your_task_cfg)
348
+
349
+ run_task(task_cfg="config.json")
350
+ ```
351
+ </details>
352
+
353
+ ### Basic Parameter
354
+ - `--model`: Specifies the `model_id` of the model in [ModelScope](https://modelscope.cn/), which can be automatically downloaded, e.g., [Qwen/Qwen2.5-0.5B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-0.5B-Instruct/summary); or use the local path of the model, e.g., `/path/to/model`
355
+ - `--datasets`: Dataset names, supports inputting multiple datasets separated by spaces. Datasets will be automatically downloaded from modelscope. For supported datasets, refer to the [Dataset List](https://evalscope.readthedocs.io/en/latest/get_started/supported_dataset.html)
356
+ - `--limit`: Maximum amount of evaluation data for each dataset. If not specified, it defaults to evaluating all data. Can be used for quick validation
357
+
358
+ ### Output Results
384
359
  ```
360
+ +-----------------------+-------------------+-----------------+
361
+ | Model | ai2_arc | gsm8k |
362
+ +=======================+===================+=================+
363
+ | Qwen2.5-0.5B-Instruct | (ai2_arc/acc) 0.6 | (gsm8k/acc) 0.6 |
364
+ +-----------------------+-------------------+-----------------+
365
+ ```
366
+
367
+ ## ⚙️ Complex Evaluation
368
+ For more customized evaluations, such as customizing model parameters or dataset parameters, you can use the following command. The evaluation startup method is the same as simple evaluation. Below shows how to start the evaluation using the `eval` command:
369
+
370
+ ```shell
371
+ evalscope eval \
372
+ --model Qwen/Qwen2.5-0.5B-Instruct \
373
+ --model-args revision=master,precision=torch.float16,device_map=auto \
374
+ --generation-config do_sample=true,temperature=0.5 \
375
+ --dataset-args '{"gsm8k": {"few_shot_num": 0, "few_shot_random": false}}' \
376
+ --datasets gsm8k \
377
+ --limit 10
378
+ ```
379
+
380
+ ### Parameter
381
+ - `--model-args`: Model loading parameters, separated by commas in `key=value` format. Default parameters:
382
+ - `revision`: Model version, default is `master`
383
+ - `precision`: Model precision, default is `auto`
384
+ - `device_map`: Model device allocation, default is `auto`
385
+ - `--generation-config`: Generation parameters, separated by commas in `key=value` format. Default parameters:
386
+ - `do_sample`: Whether to use sampling, default is `false`
387
+ - `max_length`: Maximum length, default is 2048
388
+ - `max_new_tokens`: Maximum length of generation, default is 512
389
+ - `--dataset-args`: Configuration parameters for evaluation datasets, passed in `json` format. The key is the dataset name, and the value is the parameters. Note that it needs to correspond one-to-one with the values in the `--datasets` parameter:
390
+ - `few_shot_num`: Number of few-shot examples
391
+ - `few_shot_random`: Whether to randomly sample few-shot data, if not set, defaults to `true`
392
+
393
+ Reference: [Full Parameter Description](https://evalscope.readthedocs.io/en/latest/get_started/parameters.html)
385
394
 
386
395
 
387
396
  ## Evaluation Backend
@@ -419,12 +428,7 @@ Speed Benchmark Results:
419
428
  ```
420
429
 
421
430
  ## Custom Dataset Evaluation
422
- EvalScope supports custom dataset evaluation. For detailed information, please refer to the Custom Dataset Evaluation [📖User Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/custom_dataset.html)
423
-
424
- ## Offline Evaluation
425
- You can use local dataset to evaluate the model without internet connection.
426
-
427
- Refer to: Offline Evaluation [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/offline_evaluation.html)
431
+ EvalScope supports custom dataset evaluation. For detailed information, please refer to the Custom Dataset Evaluation [📖User Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/custom_dataset/index.html)
428
432
 
429
433
 
430
434
  ## Arena Mode