YESciEval 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- yescieval-0.1.0/LICENSE +21 -0
- yescieval-0.1.0/PKG-INFO +58 -0
- yescieval-0.1.0/README.md +31 -0
- yescieval-0.1.0/images/logo.png +0 -0
- yescieval-0.1.0/pyproject.toml +35 -0
- yescieval-0.1.0/yescieval/__init__.py +27 -0
- yescieval-0.1.0/yescieval/base/__init__.py +10 -0
- yescieval-0.1.0/yescieval/base/judge.py +16 -0
- yescieval-0.1.0/yescieval/base/parser.py +28 -0
- yescieval-0.1.0/yescieval/base/rubric.py +39 -0
- yescieval-0.1.0/yescieval/judge/__init__.py +7 -0
- yescieval-0.1.0/yescieval/judge/judges.py +51 -0
- yescieval-0.1.0/yescieval/parser/__init__.py +3 -0
- yescieval-0.1.0/yescieval/parser/parsers.py +61 -0
- yescieval-0.1.0/yescieval/rubric/__init__.py +7 -0
- yescieval-0.1.0/yescieval/rubric/informativeness.py +161 -0
- yescieval-0.1.0/yescieval/rubric/structural.py +160 -0
- yescieval-0.1.0/yescieval/rubric/stylistic.py +163 -0
yescieval-0.1.0/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2025 XXX
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
yescieval-0.1.0/PKG-INFO
ADDED
|
@@ -0,0 +1,58 @@
|
|
|
1
|
+
Metadata-Version: 2.3
|
|
2
|
+
Name: YESciEval
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering.
|
|
5
|
+
License: MIT
|
|
6
|
+
Author: Hamed Babaei Giglou
|
|
7
|
+
Author-email: hamedbabaeigiglou@gmail.com
|
|
8
|
+
Requires-Python: >=3.10,<4.0.0
|
|
9
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
10
|
+
Classifier: Programming Language :: Python :: 3
|
|
11
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
12
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
13
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
14
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
15
|
+
Requires-Dist: numpy
|
|
16
|
+
Requires-Dist: openai
|
|
17
|
+
Requires-Dist: pandas
|
|
18
|
+
Requires-Dist: peft
|
|
19
|
+
Requires-Dist: pre-commit
|
|
20
|
+
Requires-Dist: pydantic
|
|
21
|
+
Requires-Dist: torch
|
|
22
|
+
Requires-Dist: transformers
|
|
23
|
+
Project-URL: Homepage, https://yescieval.readthedocs.io/
|
|
24
|
+
Project-URL: Repository, https://github.com/sciknoworg/YESciEval/
|
|
25
|
+
Description-Content-Type: text/markdown
|
|
26
|
+
|
|
27
|
+
<div align="center">
|
|
28
|
+
<img src="images/logo.png" width="50%" height="30%"/>
|
|
29
|
+
</div>
|
|
30
|
+
|
|
31
|
+
<div align="center">
|
|
32
|
+
|
|
33
|
+
|
|
34
|
+
[](https://github.com/pre-commit/pre-commit)
|
|
35
|
+
[](https://github.com/psf/black)
|
|
36
|
+
[](https://pycqa.github.io/isort/)
|
|
37
|
+
[](https://opensource.org/licenses/MIT)
|
|
38
|
+
|
|
39
|
+
|
|
40
|
+
</div>
|
|
41
|
+
|
|
42
|
+
|
|
43
|
+
## 📋 What is the YESciEval?
|
|
44
|
+
|
|
45
|
+
|
|
46
|
+
Large Language Models (LLMs) drive scientific question-answering on modern search engines, yet their evaluation robustness remains underexplored. We introduce **YESciEval**, an open-source framework that combines fine-grained rubric-based assessment with reinforcement learning to mitigate optimism bias in LLM evaluators. The framework is presented as f ollows:
|
|
47
|
+
|
|
48
|
+
|
|
49
|
+
We release multidisciplinary scienceQ&A datasets, including adversarial variants, with evaluation scores from multiple LLMs. Independent of proprietary models and human feedback, our approach enables scalable, cost-free evaluation. By advancing reliable LLM-as-a-judge models, this work supports AI alignment and fosters robust, transparent evaluation essential for scientific inquiry and artificial general intelligence.
|
|
50
|
+
|
|
51
|
+
## 📃 License
|
|
52
|
+
|
|
53
|
+
This work is licensed under a [](https://opensource.org/licenses/MIT).
|
|
54
|
+
|
|
55
|
+
|
|
56
|
+
|
|
57
|
+
|
|
58
|
+
|
|
@@ -0,0 +1,31 @@
|
|
|
1
|
+
<div align="center">
|
|
2
|
+
<img src="images/logo.png" width="50%" height="30%"/>
|
|
3
|
+
</div>
|
|
4
|
+
|
|
5
|
+
<div align="center">
|
|
6
|
+
|
|
7
|
+
|
|
8
|
+
[](https://github.com/pre-commit/pre-commit)
|
|
9
|
+
[](https://github.com/psf/black)
|
|
10
|
+
[](https://pycqa.github.io/isort/)
|
|
11
|
+
[](https://opensource.org/licenses/MIT)
|
|
12
|
+
|
|
13
|
+
|
|
14
|
+
</div>
|
|
15
|
+
|
|
16
|
+
|
|
17
|
+
## 📋 What is the YESciEval?
|
|
18
|
+
|
|
19
|
+
|
|
20
|
+
Large Language Models (LLMs) drive scientific question-answering on modern search engines, yet their evaluation robustness remains underexplored. We introduce **YESciEval**, an open-source framework that combines fine-grained rubric-based assessment with reinforcement learning to mitigate optimism bias in LLM evaluators. The framework is presented as f ollows:
|
|
21
|
+
|
|
22
|
+
|
|
23
|
+
We release multidisciplinary scienceQ&A datasets, including adversarial variants, with evaluation scores from multiple LLMs. Independent of proprietary models and human feedback, our approach enables scalable, cost-free evaluation. By advancing reliable LLM-as-a-judge models, this work supports AI alignment and fosters robust, transparent evaluation essential for scientific inquiry and artificial general intelligence.
|
|
24
|
+
|
|
25
|
+
## 📃 License
|
|
26
|
+
|
|
27
|
+
This work is licensed under a [](https://opensource.org/licenses/MIT).
|
|
28
|
+
|
|
29
|
+
|
|
30
|
+
|
|
31
|
+
|
|
Binary file
|
|
@@ -0,0 +1,35 @@
|
|
|
1
|
+
[tool.poetry]
|
|
2
|
+
name = "YESciEval"
|
|
3
|
+
|
|
4
|
+
version = "0.1.0"
|
|
5
|
+
|
|
6
|
+
description = "YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering."
|
|
7
|
+
authors = ["Hamed Babaei Giglou <hamedbabaeigiglou@gmail.com>"]
|
|
8
|
+
license = "MIT License"
|
|
9
|
+
readme = "README.md"
|
|
10
|
+
homepage = "https://yescieval.readthedocs.io/"
|
|
11
|
+
repository = "https://github.com/sciknoworg/YESciEval/"
|
|
12
|
+
include = ["images/logo.png"]
|
|
13
|
+
|
|
14
|
+
[tool.poetry.dependencies]
|
|
15
|
+
python = ">=3.10,<4.0.0"
|
|
16
|
+
pre-commit="*"
|
|
17
|
+
transformers="*"
|
|
18
|
+
torch="*"
|
|
19
|
+
peft="*"
|
|
20
|
+
openai="*"
|
|
21
|
+
pandas="*"
|
|
22
|
+
numpy="*"
|
|
23
|
+
pydantic="*"
|
|
24
|
+
|
|
25
|
+
|
|
26
|
+
[tool.poetry.dev-dependencies]
|
|
27
|
+
ruff = "*"
|
|
28
|
+
pre-commit = "*"
|
|
29
|
+
setuptools = "*"
|
|
30
|
+
wheel = "*"
|
|
31
|
+
twine = "*"
|
|
32
|
+
|
|
33
|
+
[build-system]
|
|
34
|
+
requires = ["poetry-core>=1.0.0"]
|
|
35
|
+
build-backend = "poetry.core.masonry.api"
|
|
@@ -0,0 +1,27 @@
|
|
|
1
|
+
|
|
2
|
+
__version__ = "0.1.0"
|
|
3
|
+
|
|
4
|
+
from .base import Rubric, Parser
|
|
5
|
+
from .rubric import (Informativeness, Correctness, Completeness, Coherence, Relevancy,
|
|
6
|
+
Integration, Cohesion, Readability, Conciseness)
|
|
7
|
+
from .judge import AutoJudge, AskAutoJudge, BioASAutoJudge
|
|
8
|
+
from .parser import GPTParser
|
|
9
|
+
#
|
|
10
|
+
# __all__ = [
|
|
11
|
+
# "Rubric",
|
|
12
|
+
# "Informativeness",
|
|
13
|
+
# "Correctness",
|
|
14
|
+
# "Completeness",
|
|
15
|
+
# "Coherence",
|
|
16
|
+
# "Relevancy",
|
|
17
|
+
# "Integration",
|
|
18
|
+
# "Cohesion",
|
|
19
|
+
# "Readability",
|
|
20
|
+
# "Conciseness",
|
|
21
|
+
# "Parser",
|
|
22
|
+
# "AutoJudge",
|
|
23
|
+
# "AskAutoJudge",
|
|
24
|
+
# "BioASAutoJudge"
|
|
25
|
+
# ]
|
|
26
|
+
|
|
27
|
+
#
|
|
@@ -0,0 +1,16 @@
|
|
|
1
|
+
from abc import ABC
|
|
2
|
+
from typing import Dict, Any
|
|
3
|
+
from . import Parser, Rubric
|
|
4
|
+
|
|
5
|
+
|
|
6
|
+
class Judge(ABC):
|
|
7
|
+
|
|
8
|
+
def from_pretrained(self, model_id:str, device: str="auto", token:str =""):
|
|
9
|
+
self.model, self.tokenizer = self._from_pretrained(model_id=model_id, device=device, token=token)
|
|
10
|
+
|
|
11
|
+
def judge(self, rubric: Rubric, max_new_tokens: int=150) -> Dict[str, Dict[str, str]]:
|
|
12
|
+
pass
|
|
13
|
+
|
|
14
|
+
def _from_pretrained(self, model_id: str, device: str = "auto", token: str = "") -> [Any, Any]:
|
|
15
|
+
pass
|
|
16
|
+
|
|
@@ -0,0 +1,28 @@
|
|
|
1
|
+
from abc import ABC, abstractmethod
|
|
2
|
+
from typing import Dict, Any
|
|
3
|
+
from pydantic import BaseModel, Field
|
|
4
|
+
|
|
5
|
+
|
|
6
|
+
class RubricLikertScale(BaseModel):
|
|
7
|
+
rating: int = Field(..., ge=1, le=5, description="Rating from 1 to 5")
|
|
8
|
+
rationale: str = Field(..., description="Textual explanation for the rating")
|
|
9
|
+
|
|
10
|
+
|
|
11
|
+
class Parser(ABC):
|
|
12
|
+
"""
|
|
13
|
+
Abstract base class for parsing model outputs into structured characteristic evaluations.
|
|
14
|
+
|
|
15
|
+
Each characteristic maps to a CharacteristicScore with a rating and rationale.
|
|
16
|
+
"""
|
|
17
|
+
@abstractmethod
|
|
18
|
+
def parse(self, raw_output: str) -> Any:
|
|
19
|
+
"""
|
|
20
|
+
Parse the raw model output into structured characteristic evaluations.
|
|
21
|
+
|
|
22
|
+
Args:
|
|
23
|
+
raw_output (str): The text generated by the model.
|
|
24
|
+
|
|
25
|
+
Returns:
|
|
26
|
+
Dict[str, CharacteristicScore]: Mapping from characteristic name to its score and rationale.
|
|
27
|
+
"""
|
|
28
|
+
return raw_output
|
|
@@ -0,0 +1,39 @@
|
|
|
1
|
+
from abc import ABC
|
|
2
|
+
from pydantic import BaseModel
|
|
3
|
+
from typing import Dict, List
|
|
4
|
+
|
|
5
|
+
|
|
6
|
+
|
|
7
|
+
class Rubric(BaseModel, ABC):
|
|
8
|
+
"""
|
|
9
|
+
Abstract base class for evaluation rubrics.
|
|
10
|
+
Subclasses must implement `verbalize`.
|
|
11
|
+
"""
|
|
12
|
+
system_prompt_template: str
|
|
13
|
+
papers: Dict[str, str]
|
|
14
|
+
question: str
|
|
15
|
+
answer: str
|
|
16
|
+
user_prompt_template: str = ("Evaluate and rate the quality of the following scientific synthesis "
|
|
17
|
+
"according to the characteristics given in the system prompt.\n"
|
|
18
|
+
"\n<scientific-synthesis>{answer}</scientific-synthesis>\n"
|
|
19
|
+
"\n<research-question>{question}</research-question>\n"
|
|
20
|
+
"\n<paper-titles-and-abstracts>\n{content}</paper-titles-and-abstracts>\n\n###")
|
|
21
|
+
|
|
22
|
+
def render_papers(self) -> str:
|
|
23
|
+
paper_content = ""
|
|
24
|
+
for idx, (title, abstract) in enumerate(self.papers.items()):
|
|
25
|
+
paper_content += f"{idx + 1}. " + title + "\n\n" + abstract + "\n\n"
|
|
26
|
+
return paper_content
|
|
27
|
+
|
|
28
|
+
def verbalize(self):
|
|
29
|
+
return self.user_prompt_template.format(answer=self.answer,
|
|
30
|
+
question=self.question,
|
|
31
|
+
content=self.render_papers())
|
|
32
|
+
|
|
33
|
+
def instruct(self) -> List[Dict[str, str]]:
|
|
34
|
+
message = [
|
|
35
|
+
{"role": "system", "content": self.system_prompt_template},
|
|
36
|
+
{"role": "user", "content": self.verbalize()},
|
|
37
|
+
]
|
|
38
|
+
return message
|
|
39
|
+
|
|
@@ -0,0 +1,51 @@
|
|
|
1
|
+
from ..base import Judge, Rubric
|
|
2
|
+
from typing import Dict
|
|
3
|
+
|
|
4
|
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
|
5
|
+
from peft import PeftModel, PeftConfig
|
|
6
|
+
import torch
|
|
7
|
+
|
|
8
|
+
|
|
9
|
+
|
|
10
|
+
class AutoJudge(Judge):
|
|
11
|
+
|
|
12
|
+
def _from_pretrained(self, model_id:str, device:str="auto", token:str =""):
|
|
13
|
+
config = PeftConfig.from_pretrained(model_id)
|
|
14
|
+
base_model_name = config.base_model_name_or_path
|
|
15
|
+
tokenizer = AutoTokenizer.from_pretrained(base_model_name,
|
|
16
|
+
padding_side="left",
|
|
17
|
+
token=token)
|
|
18
|
+
tokenizer.pad_token = tokenizer.eos_token
|
|
19
|
+
base_model = AutoModelForCausalLM.from_pretrained(
|
|
20
|
+
base_model_name,
|
|
21
|
+
torch_dtype=torch.float32,
|
|
22
|
+
device_map=device,
|
|
23
|
+
token=token
|
|
24
|
+
)
|
|
25
|
+
model = PeftModel.from_pretrained(base_model, model_id)
|
|
26
|
+
return model, tokenizer
|
|
27
|
+
|
|
28
|
+
def evaluate(self, rubric: Rubric, max_new_tokens: int=150) -> Dict[str, Dict[str, str]]:
|
|
29
|
+
inputs = self.tokenizer.apply_chat_template(rubric.instruct(),
|
|
30
|
+
add_generation_prompt=True,
|
|
31
|
+
return_dict=True,
|
|
32
|
+
return_tensors="pt")
|
|
33
|
+
inputs.to(self.model.device)
|
|
34
|
+
outputs = self.model.generate(**inputs,
|
|
35
|
+
max_new_tokens=max_new_tokens,
|
|
36
|
+
pad_token_id=self.tokenizer.eos_token_id)
|
|
37
|
+
evaluation = self.tokenizer.decode(outputs[0][len(inputs["input_ids"][0]):], skip_special_tokens=True)
|
|
38
|
+
return evaluation
|
|
39
|
+
|
|
40
|
+
|
|
41
|
+
class AskAutoJudge(AutoJudge):
|
|
42
|
+
def from_pretrained(self, model_id:str="SciKnowOrg/YESciEval-ASK-Llama-3.1-8B",
|
|
43
|
+
device:str="auto",
|
|
44
|
+
token:str =""):
|
|
45
|
+
return super()._from_pretrained(model_id=model_id, device=device, token=token)
|
|
46
|
+
|
|
47
|
+
class BioASAutoJudge(AutoJudge):
|
|
48
|
+
def from_pretrained(self, model_id: str = "SciKnowOrg/YESciEval-BioASQ-Llama-3.1-8B",
|
|
49
|
+
device: str = "auto",
|
|
50
|
+
token: str = ""):
|
|
51
|
+
return super()._from_pretrained(model_id=model_id, device=device, token=token)
|
|
@@ -0,0 +1,61 @@
|
|
|
1
|
+
|
|
2
|
+
from ..base import Parser, RubricLikertScale
|
|
3
|
+
import time
|
|
4
|
+
from openai import OpenAI
|
|
5
|
+
|
|
6
|
+
class GPTParser(Parser):
|
|
7
|
+
"""
|
|
8
|
+
Abstract base class for parsing model outputs into structured characteristic evaluations.
|
|
9
|
+
|
|
10
|
+
Each characteristic maps to a CharacteristicScore with a rating and rationale.
|
|
11
|
+
"""
|
|
12
|
+
def __init__(self, openai_key:str, parser_model:str="gpt-4o-mini"):
|
|
13
|
+
self.client = OpenAI(api_key=openai_key)
|
|
14
|
+
self.parser_model = parser_model
|
|
15
|
+
|
|
16
|
+
def parse(self, raw_output: str) -> RubricLikertScale:
|
|
17
|
+
"""
|
|
18
|
+
Parse the raw model output into structured characteristic evaluations.
|
|
19
|
+
|
|
20
|
+
Args:
|
|
21
|
+
raw_output (str): The text generated by the model.
|
|
22
|
+
|
|
23
|
+
Returns:
|
|
24
|
+
Dict[str, CharacteristicScore]: Mapping from characteristic name to its score and rationale.
|
|
25
|
+
"""
|
|
26
|
+
functions = [
|
|
27
|
+
{
|
|
28
|
+
"name": "evaluate_characteristic",
|
|
29
|
+
"description": "Extracting the exact `rating` and `rationale` from the given text.",
|
|
30
|
+
"parameters": {
|
|
31
|
+
"type": "object",
|
|
32
|
+
"properties": {
|
|
33
|
+
"rating": {
|
|
34
|
+
"type": "number",
|
|
35
|
+
"description": "A numerical rating assigned to the characteristic in the text.",
|
|
36
|
+
"minimum": 1,
|
|
37
|
+
"maximum": 5
|
|
38
|
+
},
|
|
39
|
+
"rationale": {
|
|
40
|
+
"type": "string",
|
|
41
|
+
"description": "The explanation for the assigned rating."
|
|
42
|
+
}
|
|
43
|
+
},
|
|
44
|
+
"required": ["rating", "rationale"]
|
|
45
|
+
}
|
|
46
|
+
}
|
|
47
|
+
]
|
|
48
|
+
while True:
|
|
49
|
+
try:
|
|
50
|
+
completion = self.client.chat.completions.create(
|
|
51
|
+
model=self.parser_model,
|
|
52
|
+
messages=[{"role": "user", "content": raw_output}],
|
|
53
|
+
functions=functions
|
|
54
|
+
)
|
|
55
|
+
parsed_output = eval(completion.choices[0].message.function_call.arguments)
|
|
56
|
+
break
|
|
57
|
+
except Exception as e:
|
|
58
|
+
print(f"Error {e}")
|
|
59
|
+
time.sleep(3)
|
|
60
|
+
|
|
61
|
+
return RubricLikertScale(rating=parsed_output['rating'], rationale=parsed_output['rationale'])
|
|
@@ -0,0 +1,7 @@
|
|
|
1
|
+
from .informativeness import Informativeness, Correctness, Completeness
|
|
2
|
+
from .structural import Coherence, Relevancy, Integration
|
|
3
|
+
from .stylistic import Cohesion, Readability, Conciseness
|
|
4
|
+
|
|
5
|
+
__all__ = ["Informativeness", "Correctness", "Completeness",
|
|
6
|
+
"Coherence", "Relevancy", "Integration",
|
|
7
|
+
"Cohesion", "Readability", "Conciseness"]
|
|
@@ -0,0 +1,161 @@
|
|
|
1
|
+
from ..base import Rubric
|
|
2
|
+
|
|
3
|
+
correctness_prompt = """<Context>
|
|
4
|
+
Scientific synthesis generation involves creating a concise, coherent, and integrated summary from a collection of scientific texts (such as research paper titles and abstracts) that addresses a specific research question. Unlike general text summarization, which may focus on extracting or abstracting key points from a single text or multiple texts on a broad topic, scientific synthesis is more specialized. It requires:
|
|
5
|
+
|
|
6
|
+
- Understanding and Addressing a Specific Research Question: The synthesis must specifically answer a research question, requiring a deep understanding of the subject matter and the ability to extract and integrate relevant information from various sources.
|
|
7
|
+
- Use of Scientific Literature: The process involves synthesizing information from scientific literature, such as research papers, focusing on the given titles and abstracts. This requires not only summarizing these texts but also evaluating their relevance, correctness, and completeness in the context of the research question.
|
|
8
|
+
- Synthesis Format: The synthesis output should be concisely presented in a single paragraph of not more than 200 words. This format requires distilling and integrating diverse scientific insights into a coherent and comprehensive summary that addresses the research question directly. The single-paragraph format emphasizes the importance of concise and integrated communication of complex information.
|
|
9
|
+
- Synthesize vs. Summarize: The goal is to synthesize—meaning to combine elements to form a coherent whole—rather than just summarize each source individually. This involves integration, cohesion, and coherence of information from multiple sources, presenting it in a way that produces new insights or understanding in response to the research question.
|
|
10
|
+
- Referencing Source Material: Each claim or piece of information in the synthesis must be traceable to the source material (the abstracts), ensuring the synthesis's accuracy and reliability.
|
|
11
|
+
- Adherence to Quality Characteristics: It should be possible to evaluate the synthesis quality based on correctness characteristic, ensuring it effectively communicates the synthesized information.
|
|
12
|
+
|
|
13
|
+
In essence, scientific synthesis generation is a complex task that goes beyond simply summarizing texts; it involves critically analyzing, integrating, and presenting scientific information from multiple sources to succinctly answer a targeted research question, adhering to high standards of clarity, reliability, and insightfulness.
|
|
14
|
+
</Context>
|
|
15
|
+
|
|
16
|
+
<Role>
|
|
17
|
+
You are tasked as a scientific syntheses quality evaluator.
|
|
18
|
+
</Role>
|
|
19
|
+
|
|
20
|
+
<Task-Description>
|
|
21
|
+
A user will provide you with a synthesis which has been generated as an answer to a research question using the titles and abstracts of relevant research works. You will also be provided with the research question and the paper titles+abstracts of the relevant works that were synthesized. You must use the evaluation characteristic listed below to evaluate a given scientific synthesis. The general objective is that a synthesis should succinctly address the research question by synthesizing only the content from the provided abstracts, while also referencing the source abstract for each claim.
|
|
22
|
+
</Task-Description>
|
|
23
|
+
|
|
24
|
+
<Evaluation-Characteristics>
|
|
25
|
+
1. Correctness: is the information in the answer a correct representation of the content of the provided abstracts?
|
|
26
|
+
</Evaluation-Characteristics>
|
|
27
|
+
|
|
28
|
+
<Rating-Scale>
|
|
29
|
+
For a given characteristic, rate the quality from 1 (very bad) to 5 (very good). Follow the guidelines specified below for each rating per evaluation characteristic.
|
|
30
|
+
|
|
31
|
+
1. Correctness
|
|
32
|
+
Rating 1. Very bad: The synthesis consistently misrepresents or inaccurately portrays the content of the provided abstracts, showing a significant deviation from the original sources.
|
|
33
|
+
Rating 2. Bad: The synthesis contains several inaccuracies or misinterpretations of the source abstracts.
|
|
34
|
+
Rating 3. Moderate: The synthesis accurately represents most of the content from the provided abstracts but may contain minor errors.
|
|
35
|
+
Rating 4. Good: The synthesis provides an accurate representation of the content from the provided abstracts with minor exceptions.
|
|
36
|
+
Rating 5. Very good: The information in the synthesis is an accurate and faithful representation of the content from the provided abstracts, without any factual errors or misinterpretations.
|
|
37
|
+
</Rating-Scale>
|
|
38
|
+
|
|
39
|
+
<Response-Format>
|
|
40
|
+
For each characteristic rate the quality from 1 (very bad) to 5 (very good). Provide a short rationale for each rating.
|
|
41
|
+
Return your response in JSON format: {characteristic : {‘rating’ : ‘’, ‘rationale’ : ‘’}}
|
|
42
|
+
|
|
43
|
+
<Example-Response>
|
|
44
|
+
{
|
|
45
|
+
"Correctness": {"rating": "4", "rationale": "The synthesis represents the content of the provided abstract, but with minor inrelevant information."}
|
|
46
|
+
}
|
|
47
|
+
</Example-Response>
|
|
48
|
+
</Response-Format>
|
|
49
|
+
|
|
50
|
+
<Note>
|
|
51
|
+
Your evaluation should be based solely on the content of the provided synthesis and abstracts. Ensure your rationale is objective and backed by specific examples from the provided material.
|
|
52
|
+
</Note>"""
|
|
53
|
+
class Correctness(Rubric):
|
|
54
|
+
system_prompt_template: str = correctness_prompt
|
|
55
|
+
|
|
56
|
+
completeness_prompt = """<Context>
|
|
57
|
+
Scientific synthesis generation involves creating a concise, coherent, and integrated summary from a collection of scientific texts (such as research paper titles and abstracts) that addresses a specific research question. Unlike general text summarization, which may focus on extracting or abstracting key points from a single text or multiple texts on a broad topic, scientific synthesis is more specialized. It requires:
|
|
58
|
+
|
|
59
|
+
- Understanding and Addressing a Specific Research Question: The synthesis must specifically answer a research question, requiring a deep understanding of the subject matter and the ability to extract and integrate relevant information from various sources.
|
|
60
|
+
- Use of Scientific Literature: The process involves synthesizing information from scientific literature, such as research papers, focusing on the given titles and abstracts. This requires not only summarizing these texts but also evaluating their relevance, correctness, and completeness in the context of the research question.
|
|
61
|
+
- Synthesis Format: The synthesis output should be concisely presented in a single paragraph of not more than 200 words. This format requires distilling and integrating diverse scientific insights into a coherent and comprehensive summary that addresses the research question directly. The single-paragraph format emphasizes the importance of concise and integrated communication of complex information.
|
|
62
|
+
- Synthesize vs. Summarize: The goal is to synthesize—meaning to combine elements to form a coherent whole—rather than just summarize each source individually. This involves integration, cohesion, and coherence of information from multiple sources, presenting it in a way that produces new insights or understanding in response to the research question.
|
|
63
|
+
- Referencing Source Material: Each claim or piece of information in the synthesis must be traceable to the source material (the abstracts), ensuring the synthesis's accuracy and reliability.
|
|
64
|
+
- Adherence to Quality Characteristics: It should be possible to evaluate the synthesis quality based on completeness characteristic, ensuring it effectively communicates the synthesized information.
|
|
65
|
+
|
|
66
|
+
In essence, scientific synthesis generation is a complex task that goes beyond simply summarizing texts; it involves critically analyzing, integrating, and presenting scientific information from multiple sources to succinctly answer a targeted research question, adhering to high standards of clarity, reliability, and insightfulness.
|
|
67
|
+
</Context>
|
|
68
|
+
|
|
69
|
+
<Role>
|
|
70
|
+
You are tasked as a scientific syntheses quality evaluator.
|
|
71
|
+
</Role>
|
|
72
|
+
|
|
73
|
+
<Task-Description>
|
|
74
|
+
A user will provide you with a synthesis which has been generated as an answer to a research question using the titles and abstracts of relevant research works. You will also be provided with the research question and the paper titles+abstracts of the relevant works that were synthesized. You must use the evaluation characteristic listed below to evaluate a given scientific synthesis. The general objective is that a synthesis should succinctly address the research question by synthesizing only the content from the provided abstracts, while also referencing the source abstract for each claim.
|
|
75
|
+
</Task-Description>
|
|
76
|
+
|
|
77
|
+
<Evaluation-Characteristics>
|
|
78
|
+
1. Completeness: is the answer a comprehensive encapsulation of the relevant information in the provided abstracts?
|
|
79
|
+
</Evaluation-Characteristics>
|
|
80
|
+
|
|
81
|
+
<Rating-Scale>
|
|
82
|
+
For a given characteristic, rate the quality from 1 (very bad) to 5 (very good). Follow the guidelines specified below for each rating per evaluation characteristic.
|
|
83
|
+
|
|
84
|
+
1. Completeness
|
|
85
|
+
Rating 1. Very bad: The synthesis omits most of the relevant information, failing to capture the essential points or details from the provided abstracts.
|
|
86
|
+
Rating 2. Bad: Significant portions of relevant information from the provided abstracts are missing.
|
|
87
|
+
Rating 3. Moderate: The synthesis captures a fair amount of the relevant information, though it may overlook some details.
|
|
88
|
+
Rating 4. Good: The synthesis includes almost all relevant information, missing only minor details.
|
|
89
|
+
Rating 5. Very good: The synthesis comprehensively encapsulates all relevant information from the provided abstracts, leaving no pertinent details or points unaddressed.
|
|
90
|
+
</Rating-Scale>
|
|
91
|
+
|
|
92
|
+
<Response-Format>
|
|
93
|
+
For each characteristic rate the quality from 1 (very bad) to 5 (very good). Provide a short rationale for each rating.
|
|
94
|
+
Return your response in JSON format: {characteristic : {‘rating’ : ‘’, ‘rationale’ : ‘’}}
|
|
95
|
+
|
|
96
|
+
<Example-Response>
|
|
97
|
+
{
|
|
98
|
+
"Completeness": {"rating": "4", "rationale": "Only minor details are missing in the synthesis."}
|
|
99
|
+
}
|
|
100
|
+
</Example-Response>
|
|
101
|
+
</Response-Format>
|
|
102
|
+
|
|
103
|
+
<Note>
|
|
104
|
+
Your evaluation should be based solely on the content of the provided synthesis and abstracts. Ensure your rationale is objective and backed by specific examples from the provided material.
|
|
105
|
+
</Note>"""
|
|
106
|
+
class Completeness(Rubric):
|
|
107
|
+
system_prompt_template: str = completeness_prompt
|
|
108
|
+
|
|
109
|
+
informativeness_prompt = """<Context>
|
|
110
|
+
Scientific synthesis generation involves creating a concise, coherent, and integrated summary from a collection of scientific texts (such as research paper titles and abstracts) that addresses a specific research question. Unlike general text summarization, which may focus on extracting or abstracting key points from a single text or multiple texts on a broad topic, scientific synthesis is more specialized. It requires:
|
|
111
|
+
|
|
112
|
+
- Understanding and Addressing a Specific Research Question: The synthesis must specifically answer a research question, requiring a deep understanding of the subject matter and the ability to extract and integrate relevant information from various sources.
|
|
113
|
+
- Use of Scientific Literature: The process involves synthesizing information from scientific literature, such as research papers, focusing on the given titles and abstracts. This requires not only summarizing these texts but also evaluating their relevance, correctness, and completeness in the context of the research question.
|
|
114
|
+
- Synthesis Format: The synthesis output should be concisely presented in a single paragraph of not more than 200 words. This format requires distilling and integrating diverse scientific insights into a coherent and comprehensive summary that addresses the research question directly. The single-paragraph format emphasizes the importance of concise and integrated communication of complex information.
|
|
115
|
+
- Synthesize vs. Summarize: The goal is to synthesize—meaning to combine elements to form a coherent whole—rather than just summarize each source individually. This involves integration, cohesion, and coherence of information from multiple sources, presenting it in a way that produces new insights or understanding in response to the research question.
|
|
116
|
+
- Referencing Source Material: Each claim or piece of information in the synthesis must be traceable to the source material (the abstracts), ensuring the synthesis's accuracy and reliability.
|
|
117
|
+
- Adherence to Quality Characteristics: It should be possible to evaluate the synthesis quality based on informativeness characteristic, ensuring it effectively communicates the synthesized information.
|
|
118
|
+
|
|
119
|
+
In essence, scientific synthesis generation is a complex task that goes beyond simply summarizing texts; it involves critically analyzing, integrating, and presenting scientific information from multiple sources to succinctly answer a targeted research question, adhering to high standards of clarity, reliability, and insightfulness.
|
|
120
|
+
</Context>
|
|
121
|
+
|
|
122
|
+
<Role>
|
|
123
|
+
You are tasked as a scientific syntheses quality evaluator.
|
|
124
|
+
</Role>
|
|
125
|
+
|
|
126
|
+
<Task-Description>
|
|
127
|
+
A user will provide you with a synthesis which has been generated as an answer to a research question using the titles and abstracts of relevant research works. You will also be provided with the research question and the paper titles+abstracts of the relevant works that were synthesized. You must use the evaluation characteristic listed below to evaluate a given scientific synthesis. The general objective is that a synthesis should succinctly address the research question by synthesizing only the content from the provided abstracts, while also referencing the source abstract for each claim.
|
|
128
|
+
</Task-Description>
|
|
129
|
+
|
|
130
|
+
<Evaluation-Characteristics>
|
|
131
|
+
1. Informativeness: is the answer a useful and informative reply to the question?
|
|
132
|
+
</Evaluation-Characteristics>
|
|
133
|
+
|
|
134
|
+
<Rating-Scale>
|
|
135
|
+
For a given characteristic, rate the quality from 1 (very bad) to 5 (very good). Follow the guidelines specified below for each rating per evaluation characteristic.
|
|
136
|
+
|
|
137
|
+
1. Informativeness
|
|
138
|
+
Rating 1. Very bad: The synthesis offers no valuable insights or useful information in response to the research question, lacking depth and utility.
|
|
139
|
+
Rating 2. Bad: The answer provides limited new insights or useful information in response to the research question.
|
|
140
|
+
Rating 3. Moderate: The answer is somewhat informative, offering insights or useful information but not in a comprehensive or detailed manner.
|
|
141
|
+
Rating 4. Good: The answer is informative and insightful, providing comprehensive information in response to the research question.
|
|
142
|
+
Rating 5. Very good: The synthesis is highly informative, providing valuable insights and detailed information that thoroughly addresses the research question.
|
|
143
|
+
</Rating-Scale>
|
|
144
|
+
|
|
145
|
+
<Response-Format>
|
|
146
|
+
For each characteristic rate the quality from 1 (very bad) to 5 (very good). Provide a short rationale for each rating.
|
|
147
|
+
Return your response in JSON format: {characteristic : {‘rating’ : ‘’, ‘rationale’ : ‘’}}
|
|
148
|
+
|
|
149
|
+
<Example-Response>
|
|
150
|
+
{
|
|
151
|
+
"Informativeness": {"rating": "4", "rationale": "Most information is informative for the research question."}
|
|
152
|
+
}
|
|
153
|
+
</Example-Response>
|
|
154
|
+
</Response-Format>
|
|
155
|
+
|
|
156
|
+
<Note>
|
|
157
|
+
Your evaluation should be based solely on the content of the provided synthesis and abstracts. Ensure your rationale is objective and backed by specific examples from the provided material.
|
|
158
|
+
</Note>"""
|
|
159
|
+
class Informativeness(Rubric):
|
|
160
|
+
system_prompt_template: str = informativeness_prompt
|
|
161
|
+
|
|
@@ -0,0 +1,160 @@
|
|
|
1
|
+
from ..base import Rubric
|
|
2
|
+
|
|
3
|
+
coherence_prompt = """<Context>
|
|
4
|
+
Scientific synthesis generation involves creating a concise, coherent, and integrated summary from a collection of scientific texts (such as research paper titles and abstracts) that addresses a specific research question. Unlike general text summarization, which may focus on extracting or abstracting key points from a single text or multiple texts on a broad topic, scientific synthesis is more specialized. It requires:
|
|
5
|
+
|
|
6
|
+
- Understanding and Addressing a Specific Research Question: The synthesis must specifically answer a research question, requiring a deep understanding of the subject matter and the ability to extract and integrate relevant information from various sources.
|
|
7
|
+
- Use of Scientific Literature: The process involves synthesizing information from scientific literature, such as research papers, focusing on the given titles and abstracts. This requires not only summarizing these texts but also evaluating their relevance, correctness, and completeness in the context of the research question.
|
|
8
|
+
- Synthesis Format: The synthesis output should be concisely presented in a single paragraph of not more than 200 words. This format requires distilling and integrating diverse scientific insights into a coherent and comprehensive summary that addresses the research question directly. The single-paragraph format emphasizes the importance of concise and integrated communication of complex information.
|
|
9
|
+
- Synthesize vs. Summarize: The goal is to synthesize—meaning to combine elements to form a coherent whole—rather than just summarize each source individually. This involves integration, cohesion, and coherence of information from multiple sources, presenting it in a way that produces new insights or understanding in response to the research question.
|
|
10
|
+
- Referencing Source Material: Each claim or piece of information in the synthesis must be traceable to the source material (the abstracts), ensuring the synthesis's accuracy and reliability.
|
|
11
|
+
- Adherence to Quality Characteristics: It should be possible to evaluate the synthesis quality based on coherence characteristic, ensuring it effectively communicates the synthesized information.
|
|
12
|
+
|
|
13
|
+
In essence, scientific synthesis generation is a complex task that goes beyond simply summarizing texts; it involves critically analyzing, integrating, and presenting scientific information from multiple sources to succinctly answer a targeted research question, adhering to high standards of clarity, reliability, and insightfulness.
|
|
14
|
+
</Context>
|
|
15
|
+
|
|
16
|
+
<Role>
|
|
17
|
+
You are tasked as a scientific syntheses quality evaluator.
|
|
18
|
+
</Role>
|
|
19
|
+
|
|
20
|
+
<Task-Description>
|
|
21
|
+
A user will provide you with a synthesis which has been generated as an answer to a research question using the titles and abstracts of relevant research works. You will also be provided with the research question and the paper titles+abstracts of the relevant works that were synthesized. You must use the evaluation characteristic listed below to evaluate a given scientific synthesis. The general objective is that a synthesis should succinctly address the research question by synthesizing only the content from the provided abstracts, while also referencing the source abstract for each claim.
|
|
22
|
+
</Task-Description>
|
|
23
|
+
|
|
24
|
+
<Evaluation-Characteristics>
|
|
25
|
+
1. Coherence: are the ideas connected in a sound and logical manner?
|
|
26
|
+
</Evaluation-Characteristics>
|
|
27
|
+
|
|
28
|
+
<Rating-Scale>
|
|
29
|
+
For a given characteristic, rate the quality from 1 (very bad) to 5 (very good). Follow the guidelines specified below for each rating per evaluation characteristic.
|
|
30
|
+
|
|
31
|
+
1. Coherence
|
|
32
|
+
Rating 1. Very bad: The synthesis lacks logical connection between ideas, leading to a narrative that is confusing and difficult to follow.
|
|
33
|
+
Rating 2. Bad: The ideas are not always logically connected, leading to a somewhat confusing narrative.
|
|
34
|
+
Rating 3. Moderate: The ideas are logically connected for the most part, but the narrative could be strengthened for better clarity.
|
|
35
|
+
Rating 4. Good: The ideas are logically and soundly connected, offering a clear and understandable narrative.
|
|
36
|
+
Rating 5. Very good: The ideas within the synthesis are connected in a logical and sound manner, forming a coherent and compelling narrative that is easy to follow.
|
|
37
|
+
</Rating-Scale>
|
|
38
|
+
|
|
39
|
+
<Response-Format>
|
|
40
|
+
For each characteristic rate the quality from 1 (very bad) to 5 (very good). Provide a short rationale for each rating.
|
|
41
|
+
Return your response in JSON format: {characteristic : {‘rating’ : ‘’, ‘rationale’ : ‘’}}
|
|
42
|
+
|
|
43
|
+
<Example-Response>
|
|
44
|
+
{
|
|
45
|
+
"Coherence": {"rating": "4", "rationale": "The ideation is soundly connected with clear narrative."}
|
|
46
|
+
}
|
|
47
|
+
</Example-Response>
|
|
48
|
+
</Response-Format>
|
|
49
|
+
|
|
50
|
+
<Note>
|
|
51
|
+
Your evaluation should be based solely on the content of the provided synthesis and abstracts. Ensure your rationale is objective and backed by specific examples from the provided material.
|
|
52
|
+
</Note>"""
|
|
53
|
+
class Coherence(Rubric):
|
|
54
|
+
system_prompt_template: str = coherence_prompt
|
|
55
|
+
|
|
56
|
+
integration_prompt = """<Context>
|
|
57
|
+
Scientific synthesis generation involves creating a concise, coherent, and integrated summary from a collection of scientific texts (such as research paper titles and abstracts) that addresses a specific research question. Unlike general text summarization, which may focus on extracting or abstracting key points from a single text or multiple texts on a broad topic, scientific synthesis is more specialized. It requires:
|
|
58
|
+
|
|
59
|
+
- Understanding and Addressing a Specific Research Question: The synthesis must specifically answer a research question, requiring a deep understanding of the subject matter and the ability to extract and integrate relevant information from various sources.
|
|
60
|
+
- Use of Scientific Literature: The process involves synthesizing information from scientific literature, such as research papers, focusing on the given titles and abstracts. This requires not only summarizing these texts but also evaluating their relevance, correctness, and completeness in the context of the research question.
|
|
61
|
+
- Synthesis Format: The synthesis output should be concisely presented in a single paragraph of not more than 200 words. This format requires distilling and integrating diverse scientific insights into a coherent and comprehensive summary that addresses the research question directly. The single-paragraph format emphasizes the importance of concise and integrated communication of complex information.
|
|
62
|
+
- Synthesize vs. Summarize: The goal is to synthesize—meaning to combine elements to form a coherent whole—rather than just summarize each source individually. This involves integration, cohesion, and coherence of information from multiple sources, presenting it in a way that produces new insights or understanding in response to the research question.
|
|
63
|
+
- Referencing Source Material: Each claim or piece of information in the synthesis must be traceable to the source material (the abstracts), ensuring the synthesis's accuracy and reliability.
|
|
64
|
+
- Adherence to Quality Characteristics: It should be possible to evaluate the synthesis quality based on integration characteristic, ensuring it effectively communicates the synthesized information.
|
|
65
|
+
|
|
66
|
+
In essence, scientific synthesis generation is a complex task that goes beyond simply summarizing texts; it involves critically analyzing, integrating, and presenting scientific information from multiple sources to succinctly answer a targeted research question, adhering to high standards of clarity, reliability, and insightfulness.
|
|
67
|
+
</Context>
|
|
68
|
+
|
|
69
|
+
<Role>
|
|
70
|
+
You are tasked as a scientific syntheses quality evaluator.
|
|
71
|
+
</Role>
|
|
72
|
+
|
|
73
|
+
<Task-Description>
|
|
74
|
+
A user will provide you with a synthesis which has been generated as an answer to a research question using the titles and abstracts of relevant research works. You will also be provided with the research question and the paper titles+abstracts of the relevant works that were synthesized. You must use the evaluation characteristic listed below to evaluate a given scientific synthesis. The general objective is that a synthesis should succinctly address the research question by synthesizing only the content from the provided abstracts, while also referencing the source abstract for each claim.
|
|
75
|
+
</Task-Description>
|
|
76
|
+
|
|
77
|
+
<Evaluation-Characteristics>
|
|
78
|
+
1. Integration: are the sources structurally and linguistically well-integrated, using appropriate markers of provenance/quotation and logical connectors for each reference? In addition, are the sources integrated as a single paragraph?
|
|
79
|
+
</Evaluation-Characteristics>
|
|
80
|
+
|
|
81
|
+
<Rating-Scale>
|
|
82
|
+
For a given characteristic, rate the quality from 1 (very bad) to 5 (very good). Follow the guidelines specified below for each rating per evaluation characteristic.
|
|
83
|
+
|
|
84
|
+
1. Integration
|
|
85
|
+
Rating 1. Very Bad: The synthesis fails to integrate the sources in any meaningful way. It lacks appropriate markers, connectors, or transitions between ideas and fails to combine the information into a single, cohesive paragraph.
|
|
86
|
+
Rating 2. Bad: The sources are somewhat integrated but inconsistently. The use of markers and connectors is sporadic or inappropriately applied, and the information is presented in multiple paragraphs without a clear unifying structure.
|
|
87
|
+
Rating 3. Moderate: The sources are integrated into a coherent manner within one or multiple paragraphs. The transitions or connections could be smoother, and the text would benefit from better paragraph structure to enhance clarity and unity.
|
|
88
|
+
Rating 4. Good: The sources are well-integrated, using appropriate markers and connectors to create a seamless narrative. The information is effectively organized into a single paragraph, showing a clear, unified approach.
|
|
89
|
+
Rating 5. Very Good: The synthesis seamlessly integrates information from the various sources, using appropriate markers and connectors to create a smooth and unified narrative. All information is skillfully condensed into a single, well-structured paragraph, exemplifying excellent integration.
|
|
90
|
+
</Rating-Scale>
|
|
91
|
+
|
|
92
|
+
<Response-Format>
|
|
93
|
+
For each characteristic rate the quality from 1 (very bad) to 5 (very good). Provide a short rationale for each rating.
|
|
94
|
+
Return your response in JSON format: {characteristic : {‘rating’ : ‘’, ‘rationale’ : ‘’}}
|
|
95
|
+
|
|
96
|
+
<Example-Response>
|
|
97
|
+
{
|
|
98
|
+
"Integration": {"rating": "4", "rationale": "Almost all sources are well-integrated with approriate connectors."}
|
|
99
|
+
}
|
|
100
|
+
</Example-Response>
|
|
101
|
+
</Response-Format>
|
|
102
|
+
|
|
103
|
+
<Note>
|
|
104
|
+
Your evaluation should be based solely on the content of the provided synthesis and abstracts. Ensure your rationale is objective and backed by specific examples from the provided material.
|
|
105
|
+
</Note>"""
|
|
106
|
+
class Integration(Rubric):
|
|
107
|
+
system_prompt_template: str = integration_prompt
|
|
108
|
+
|
|
109
|
+
relevancy_prompt = """<Context>
|
|
110
|
+
Scientific synthesis generation involves creating a concise, coherent, and integrated summary from a collection of scientific texts (such as research paper titles and abstracts) that addresses a specific research question. Unlike general text summarization, which may focus on extracting or abstracting key points from a single text or multiple texts on a broad topic, scientific synthesis is more specialized. It requires:
|
|
111
|
+
|
|
112
|
+
- Understanding and Addressing a Specific Research Question: The synthesis must specifically answer a research question, requiring a deep understanding of the subject matter and the ability to extract and integrate relevant information from various sources.
|
|
113
|
+
- Use of Scientific Literature: The process involves synthesizing information from scientific literature, such as research papers, focusing on the given titles and abstracts. This requires not only summarizing these texts but also evaluating their relevance, correctness, and completeness in the context of the research question.
|
|
114
|
+
- Synthesis Format: The synthesis output should be concisely presented in a single paragraph of not more than 200 words. This format requires distilling and integrating diverse scientific insights into a coherent and comprehensive summary that addresses the research question directly. The single-paragraph format emphasizes the importance of concise and integrated communication of complex information.
|
|
115
|
+
- Synthesize vs. Summarize: The goal is to synthesize—meaning to combine elements to form a coherent whole—rather than just summarize each source individually. This involves integration, cohesion, and coherence of information from multiple sources, presenting it in a way that produces new insights or understanding in response to the research question.
|
|
116
|
+
- Referencing Source Material: Each claim or piece of information in the synthesis must be traceable to the source material (the abstracts), ensuring the synthesis's accuracy and reliability.
|
|
117
|
+
- Adherence to Quality Characteristics: It should be possible to evaluate the synthesis quality based on relevancy characteristic, ensuring it effectively communicates the synthesized information.
|
|
118
|
+
|
|
119
|
+
In essence, scientific synthesis generation is a complex task that goes beyond simply summarizing texts; it involves critically analyzing, integrating, and presenting scientific information from multiple sources to succinctly answer a targeted research question, adhering to high standards of clarity, reliability, and insightfulness.
|
|
120
|
+
</Context>
|
|
121
|
+
|
|
122
|
+
<Role>
|
|
123
|
+
You are tasked as a scientific syntheses quality evaluator.
|
|
124
|
+
</Role>
|
|
125
|
+
|
|
126
|
+
<Task-Description>
|
|
127
|
+
A user will provide you with a synthesis which has been generated as an answer to a research question using the titles and abstracts of relevant research works. You will also be provided with the research question and the paper titles+abstracts of the relevant works that were synthesized. You must use the evaluation characteristic listed below to evaluate a given scientific synthesis. The general objective is that a synthesis should succinctly address the research question by synthesizing only the content from the provided abstracts, while also referencing the source abstract for each claim.
|
|
128
|
+
</Task-Description>
|
|
129
|
+
|
|
130
|
+
<Evaluation-Characteristics>
|
|
131
|
+
1. Relevancy: is the information in the answer relevant to the question?
|
|
132
|
+
</Evaluation-Characteristics>
|
|
133
|
+
|
|
134
|
+
<Rating-Scale>
|
|
135
|
+
For a given characteristic, rate the quality from 1 (very bad) to 5 (very good). Follow the guidelines specified below for each rating per evaluation characteristic.
|
|
136
|
+
|
|
137
|
+
1. Relevancy
|
|
138
|
+
Rating 1. Very bad: The information provided does not relate to the research question, showing a lack of understanding or connection to the topic.
|
|
139
|
+
Rating 2. Bad: The information occasionally relates to the research question but lacks direct and consistent relevance.
|
|
140
|
+
Rating 3. Moderate: The information is generally related to the research question, with occasional lapses in direct relevance.
|
|
141
|
+
Rating 4. Good: The information is consistently relevant to the research question, with only minor exceptions.
|
|
142
|
+
Rating 5. Very good: The synthesis is directly and consistently relevant to the research question, demonstrating a deep understanding of the topic and its nuances.
|
|
143
|
+
</Rating-Scale>
|
|
144
|
+
|
|
145
|
+
<Response-Format>
|
|
146
|
+
For each characteristic rate the quality from 1 (very bad) to 5 (very good). Provide a short rationale for each rating.
|
|
147
|
+
Return your response in JSON format: {characteristic : {‘rating’ : ‘’, ‘rationale’ : ‘’}}
|
|
148
|
+
|
|
149
|
+
<Example-Response>
|
|
150
|
+
{
|
|
151
|
+
"Relevancy": {"rating": "4", "rationale": "Most information is relevant, but there is a minor detail that seems out of scope."}
|
|
152
|
+
}
|
|
153
|
+
</Example-Response>
|
|
154
|
+
</Response-Format>
|
|
155
|
+
|
|
156
|
+
<Note>
|
|
157
|
+
Your evaluation should be based solely on the content of the provided synthesis and abstracts. Ensure your rationale is objective and backed by specific examples from the provided material.
|
|
158
|
+
</Note>"""
|
|
159
|
+
class Relevancy(Rubric):
|
|
160
|
+
system_prompt_template: str = relevancy_prompt
|
|
@@ -0,0 +1,163 @@
|
|
|
1
|
+
from ..base import Rubric
|
|
2
|
+
|
|
3
|
+
cohesion_prompt = """<Context>
|
|
4
|
+
Scientific synthesis generation involves creating a concise, coherent, and integrated summary from a collection of scientific texts (such as research paper titles and abstracts) that addresses a specific research question. Unlike general text summarization, which may focus on extracting or abstracting key points from a single text or multiple texts on a broad topic, scientific synthesis is more specialized. It requires:
|
|
5
|
+
|
|
6
|
+
- Understanding and Addressing a Specific Research Question: The synthesis must specifically answer a research question, requiring a deep understanding of the subject matter and the ability to extract and integrate relevant information from various sources.
|
|
7
|
+
- Use of Scientific Literature: The process involves synthesizing information from scientific literature, such as research papers, focusing on the given titles and abstracts. This requires not only summarizing these texts but also evaluating their relevance, correctness, and completeness in the context of the research question.
|
|
8
|
+
- Synthesis Format: The synthesis output should be concisely presented in a single paragraph of not more than 200 words. This format requires distilling and integrating diverse scientific insights into a coherent and comprehensive summary that addresses the research question directly. The single-paragraph format emphasizes the importance of concise and integrated communication of complex information.
|
|
9
|
+
- Synthesize vs. Summarize: The goal is to synthesize—meaning to combine elements to form a coherent whole—rather than just summarize each source individually. This involves integration, cohesion, and coherence of information from multiple sources, presenting it in a way that produces new insights or understanding in response to the research question.
|
|
10
|
+
- Referencing Source Material: Each claim or piece of information in the synthesis must be traceable to the source material (the abstracts), ensuring the synthesis's accuracy and reliability.
|
|
11
|
+
- Adherence to Quality Characteristics: It should be possible to evaluate the synthesis quality based on cohesion characteristic, ensuring it effectively communicates the synthesized information.
|
|
12
|
+
|
|
13
|
+
In essence, scientific synthesis generation is a complex task that goes beyond simply summarizing texts; it involves critically analyzing, integrating, and presenting scientific information from multiple sources to succinctly answer a targeted research question, adhering to high standards of clarity, reliability, and insightfulness.
|
|
14
|
+
</Context>
|
|
15
|
+
|
|
16
|
+
<Role>
|
|
17
|
+
You are tasked as a scientific syntheses quality evaluator.
|
|
18
|
+
</Role>
|
|
19
|
+
|
|
20
|
+
<Task-Description>
|
|
21
|
+
A user will provide you with a synthesis which has been generated as an answer to a research question using the titles and abstracts of relevant research works. You will also be provided with the research question and the paper titles+abstracts of the relevant works that were synthesized. You must use the evaluation characteristic listed below to evaluate a given scientific synthesis. The general objective is that a synthesis should succinctly address the research question by synthesizing only the content from the provided abstracts, while also referencing the source abstract for each claim.
|
|
22
|
+
</Task-Description>
|
|
23
|
+
|
|
24
|
+
<Evaluation-Characteristics>
|
|
25
|
+
1. Cohesion: are the sentences connected appropriately such that the resulting synthesis is cohesive?
|
|
26
|
+
</Evaluation-Characteristics>
|
|
27
|
+
|
|
28
|
+
<Rating-Scale>
|
|
29
|
+
For a given characteristic, rate the quality from 1 (very bad) to 5 (very good). Follow the guidelines specified below for each rating per evaluation characteristic.
|
|
30
|
+
|
|
31
|
+
1. Cohesion
|
|
32
|
+
Rating 1. Very bad: The sentences within the synthesis are disconnected, resulting in a disjointed and fragmented narrative.
|
|
33
|
+
Rating 2. Bad: There are attempts at connecting sentences, but the synthesis often feels disjointed.
|
|
34
|
+
Rating 3. Moderate: The sentences are connected in a way that the synthesis is mostly cohesive, with some areas of improvement.
|
|
35
|
+
Rating 4. Good: The synthesis is cohesive, with sentences well-connected to form a unified narrative.
|
|
36
|
+
Rating 5. Very good: The synthesis is highly cohesive, with all sentences and paragraphs logically connected, facilitating a clear and coherent narrative flow.
|
|
37
|
+
</Rating-Scale>
|
|
38
|
+
|
|
39
|
+
<Response-Format>
|
|
40
|
+
For each characteristic rate the quality from 1 (very bad) to 5 (very good). Provide a short rationale for each rating.
|
|
41
|
+
Return your response in JSON format: {characteristic : {‘rating’ : ‘’, ‘rationale’ : ‘’}}
|
|
42
|
+
|
|
43
|
+
<Example-Response>
|
|
44
|
+
{
|
|
45
|
+
"Cohesion": {"rating": "4", "rationale": "Almost all sentences are connected appropriately and form a cohesive narrative."}
|
|
46
|
+
}
|
|
47
|
+
</Example-Response>
|
|
48
|
+
</Response-Format>
|
|
49
|
+
|
|
50
|
+
<Note>
|
|
51
|
+
Your evaluation should be based solely on the content of the provided synthesis and abstracts. Ensure your rationale is objective and backed by specific examples from the provided material.
|
|
52
|
+
</Note>"""
|
|
53
|
+
|
|
54
|
+
class Cohesion(Rubric):
|
|
55
|
+
system_prompt_template: str = cohesion_prompt
|
|
56
|
+
|
|
57
|
+
|
|
58
|
+
conciseness_prompt = """<Context>
|
|
59
|
+
Scientific synthesis generation involves creating a concise, coherent, and integrated summary from a collection of scientific texts (such as research paper titles and abstracts) that addresses a specific research question. Unlike general text summarization, which may focus on extracting or abstracting key points from a single text or multiple texts on a broad topic, scientific synthesis is more specialized. It requires:
|
|
60
|
+
|
|
61
|
+
- Understanding and Addressing a Specific Research Question: The synthesis must specifically answer a research question, requiring a deep understanding of the subject matter and the ability to extract and integrate relevant information from various sources.
|
|
62
|
+
- Use of Scientific Literature: The process involves synthesizing information from scientific literature, such as research papers, focusing on the given titles and abstracts. This requires not only summarizing these texts but also evaluating their relevance, correctness, and completeness in the context of the research question.
|
|
63
|
+
- Synthesis Format: The synthesis output should be concisely presented in a single paragraph of not more than 200 words. This format requires distilling and integrating diverse scientific insights into a coherent and comprehensive summary that addresses the research question directly. The single-paragraph format emphasizes the importance of concise and integrated communication of complex information.
|
|
64
|
+
- Synthesize vs. Summarize: The goal is to synthesize—meaning to combine elements to form a coherent whole—rather than just summarize each source individually. This involves integration, cohesion, and coherence of information from multiple sources, presenting it in a way that produces new insights or understanding in response to the research question.
|
|
65
|
+
- Referencing Source Material: Each claim or piece of information in the synthesis must be traceable to the source material (the abstracts), ensuring the synthesis's accuracy and reliability.
|
|
66
|
+
- Adherence to Quality Characteristics: It should be possible to evaluate the synthesis quality based on conciseness characteristic, ensuring it effectively communicates the synthesized information.
|
|
67
|
+
|
|
68
|
+
In essence, scientific synthesis generation is a complex task that goes beyond simply summarizing texts; it involves critically analyzing, integrating, and presenting scientific information from multiple sources to succinctly answer a targeted research question, adhering to high standards of clarity, reliability, and insightfulness.
|
|
69
|
+
</Context>
|
|
70
|
+
|
|
71
|
+
<Role>
|
|
72
|
+
You are tasked as a scientific syntheses quality evaluator.
|
|
73
|
+
</Role>
|
|
74
|
+
|
|
75
|
+
<Task-Description>
|
|
76
|
+
A user will provide you with a synthesis which has been generated as an answer to a research question using the titles and abstracts of relevant research works. You will also be provided with the research question and the paper titles+abstracts of the relevant works that were synthesized. You must use the evaluation characteristic listed below to evaluate a given scientific synthesis. The general objective is that a synthesis should succinctly address the research question by synthesizing only the content from the provided abstracts, while also referencing the source abstract for each claim.
|
|
77
|
+
</Task-Description>
|
|
78
|
+
|
|
79
|
+
<Evaluation-Characteristics>
|
|
80
|
+
1. Conciseness: is the answer short and clear, without redundant statements?
|
|
81
|
+
</Evaluation-Characteristics>
|
|
82
|
+
|
|
83
|
+
<Rating-Scale>
|
|
84
|
+
For a given characteristic, rate the quality from 1 (very bad) to 5 (very good). Follow the guidelines specified below for each rating per evaluation characteristic.
|
|
85
|
+
|
|
86
|
+
1. Conciseness
|
|
87
|
+
Rating 1. Very Bad: The synthesis is verbose and cluttered with redundant or irrelevant information, significantly detracting from its clarity and focus.
|
|
88
|
+
Rating 2. Bad: The synthesis includes some redundant or irrelevant statements, detracting from its clarity.
|
|
89
|
+
Rating 3. Moderate: The synthesis is relatively clear and to the point, but could be more concise by eliminating a few redundant elements.
|
|
90
|
+
Rating 4. Good: The synthesis is concise and to the point, with virtually no redundant statements or unnecessary information.
|
|
91
|
+
Rating 5. Very Good: The synthesis is precisely concise, delivering information clearly and directly without any superfluous details or redundancy, enhancing its clarity and impact.
|
|
92
|
+
</Rating-Scale>
|
|
93
|
+
|
|
94
|
+
<Response-Format>
|
|
95
|
+
For each characteristic rate the quality from 1 (very bad) to 5 (very good). Provide a short rationale for each rating.
|
|
96
|
+
Return your response in JSON format: {characteristic : {‘rating’ : ‘’, ‘rationale’ : ‘’}}
|
|
97
|
+
|
|
98
|
+
<Example-Response>
|
|
99
|
+
{
|
|
100
|
+
"Conciseness": {"rating": "4", "rationale": "The synthesis contains almost no unnecessary information or redundant statements and is straight to the point."}
|
|
101
|
+
}
|
|
102
|
+
</Example-Response>
|
|
103
|
+
</Response-Format>
|
|
104
|
+
|
|
105
|
+
<Note>
|
|
106
|
+
Your evaluation should be based solely on the content of the provided synthesis and abstracts. Ensure your rationale is objective and backed by specific examples from the provided material.
|
|
107
|
+
</Note>"""
|
|
108
|
+
class Conciseness(Rubric):
|
|
109
|
+
system_prompt_template: str = conciseness_prompt
|
|
110
|
+
|
|
111
|
+
readability_prompt = """<Context>
|
|
112
|
+
Scientific synthesis generation involves creating a concise, coherent, and integrated summary from a collection of scientific texts (such as research paper titles and abstracts) that addresses a specific research question. Unlike general text summarization, which may focus on extracting or abstracting key points from a single text or multiple texts on a broad topic, scientific synthesis is more specialized. It requires:
|
|
113
|
+
|
|
114
|
+
- Understanding and Addressing a Specific Research Question: The synthesis must specifically answer a research question, requiring a deep understanding of the subject matter and the ability to extract and integrate relevant information from various sources.
|
|
115
|
+
- Use of Scientific Literature: The process involves synthesizing information from scientific literature, such as research papers, focusing on the given titles and abstracts. This requires not only summarizing these texts but also evaluating their relevance, correctness, and completeness in the context of the research question.
|
|
116
|
+
- Synthesis Format: The synthesis output should be concisely presented in a single paragraph of not more than 200 words. This format requires distilling and integrating diverse scientific insights into a coherent and comprehensive summary that addresses the research question directly. The single-paragraph format emphasizes the importance of concise and integrated communication of complex information.
|
|
117
|
+
- Synthesize vs. Summarize: The goal is to synthesize—meaning to combine elements to form a coherent whole—rather than just summarize each source individually. This involves integration, cohesion, and coherence of information from multiple sources, presenting it in a way that produces new insights or understanding in response to the research question.
|
|
118
|
+
- Referencing Source Material: Each claim or piece of information in the synthesis must be traceable to the source material (the abstracts), ensuring the synthesis's accuracy and reliability.
|
|
119
|
+
- Adherence to Quality Characteristics: It should be possible to evaluate the synthesis quality based on readability characteristic, ensuring it effectively communicates the synthesized information.
|
|
120
|
+
|
|
121
|
+
In essence, scientific synthesis generation is a complex task that goes beyond simply summarizing texts; it involves critically analyzing, integrating, and presenting scientific information from multiple sources to succinctly answer a targeted research question, adhering to high standards of clarity, reliability, and insightfulness.
|
|
122
|
+
</Context>
|
|
123
|
+
|
|
124
|
+
<Role>
|
|
125
|
+
You are tasked as a scientific syntheses quality evaluator.
|
|
126
|
+
</Role>
|
|
127
|
+
|
|
128
|
+
<Task-Description>
|
|
129
|
+
A user will provide you with a synthesis which has been generated as an answer to a research question using the titles and abstracts of relevant research works. You will also be provided with the research question and the paper titles+abstracts of the relevant works that were synthesized. You must use the evaluation characteristic listed below to evaluate a given scientific synthesis. The general objective is that a synthesis should succinctly address the research question by synthesizing only the content from the provided abstracts, while also referencing the source abstract for each claim.
|
|
130
|
+
</Task-Description>
|
|
131
|
+
|
|
132
|
+
<Evaluation-Characteristics>
|
|
133
|
+
1. Readability: does the answer follow appropriate style and structure conventions for academic writing, particularly for readability?
|
|
134
|
+
</Evaluation-Characteristics>
|
|
135
|
+
|
|
136
|
+
<Rating-Scale>
|
|
137
|
+
For a given characteristic, rate the quality from 1 (very bad) to 5 (very good). Follow the guidelines specified below for each rating per evaluation characteristic.
|
|
138
|
+
|
|
139
|
+
1. Readability
|
|
140
|
+
Rating 1. Very bad: The synthesis is poorly written, with pervasive issues in style, structure, and language use, making it difficult to understand.
|
|
141
|
+
Rating 2. Bad: The text has noticeable issues with style, structure, or language use, affecting clarity.
|
|
142
|
+
Rating 3. Moderate: The synthesis follows appropriate conventions and uses language correctly, with minor issues in style or structure.
|
|
143
|
+
Rating 4. Good: The text is well-structured and easy to read, with language that is appropriately used and only minor stylistic improvements needed.
|
|
144
|
+
Rating 5. Very good: The synthesis is exceptionally well-written, following stylistic and structural conventions with precise language use, making it accessible and easy to read.
|
|
145
|
+
</Rating-Scale>
|
|
146
|
+
|
|
147
|
+
<Response-Format>
|
|
148
|
+
For each characteristic rate the quality from 1 (very bad) to 5 (very good). Provide a short rationale for each rating.
|
|
149
|
+
Return your response in JSON format: {characteristic : {‘rating’ : ‘’, ‘rationale’ : ‘’}}
|
|
150
|
+
|
|
151
|
+
<Example-Response>
|
|
152
|
+
{
|
|
153
|
+
"Readability": {"rating": "4", "rationale": "The synthesis follows academic writing conventions almost perfectly and displays appropriate style."}
|
|
154
|
+
}
|
|
155
|
+
</Example-Response>
|
|
156
|
+
</Response-Format>
|
|
157
|
+
|
|
158
|
+
<Note>
|
|
159
|
+
Your evaluation should be based solely on the content of the provided synthesis and abstracts. Ensure your rationale is objective and backed by specific examples from the provided material.
|
|
160
|
+
</Note>"""
|
|
161
|
+
class Readability(Rubric):
|
|
162
|
+
system_prompt_template: str = readability_prompt
|
|
163
|
+
|