python-flexeval 0.1.5__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- python_flexeval-0.1.5/.env-example +5 -0
- python_flexeval-0.1.5/.github/dependabot.yml +8 -0
- python_flexeval-0.1.5/.github/workflows/deploy-to-pypi.yml +80 -0
- python_flexeval-0.1.5/.github/workflows/github-pages.yml +68 -0
- python_flexeval-0.1.5/.github/workflows/validate.yaml +32 -0
- python_flexeval-0.1.5/.gitignore +187 -0
- python_flexeval-0.1.5/.pre-commit-config.yaml +7 -0
- python_flexeval-0.1.5/.python-version +1 -0
- python_flexeval-0.1.5/.vscode/settings.json +11 -0
- python_flexeval-0.1.5/CITATION.bib +19 -0
- python_flexeval-0.1.5/CITATION.cff +37 -0
- python_flexeval-0.1.5/CLAUDE.md +114 -0
- python_flexeval-0.1.5/DEVELOPMENT.md +151 -0
- python_flexeval-0.1.5/Dockerfile +28 -0
- python_flexeval-0.1.5/EDM_2024_FlexEval.pdf +0 -0
- python_flexeval-0.1.5/LICENSE +21 -0
- python_flexeval-0.1.5/Makefile +30 -0
- python_flexeval-0.1.5/PKG-INFO +118 -0
- python_flexeval-0.1.5/README.md +78 -0
- python_flexeval-0.1.5/data/metabase/.gitkeep +0 -0
- python_flexeval-0.1.5/docker-compose.yml +44 -0
- python_flexeval-0.1.5/docs/_static/flexeval_banner.svg +1 -0
- python_flexeval-0.1.5/docs/_static/flexeval_favicon.svg +1 -0
- python_flexeval-0.1.5/docs/_static/flexeval_logo.png +0 -0
- python_flexeval-0.1.5/docs/_static/flexeval_logo2.png +0 -0
- python_flexeval-0.1.5/docs/_templates/footer.html +7 -0
- python_flexeval-0.1.5/docs/api.rst +14 -0
- python_flexeval-0.1.5/docs/conf.py +187 -0
- python_flexeval-0.1.5/docs/getting_started.rst +63 -0
- python_flexeval-0.1.5/docs/index.rst +70 -0
- python_flexeval-0.1.5/docs/sphinxext/__init__.py +0 -0
- python_flexeval-0.1.5/docs/sphinxext/github.py +214 -0
- python_flexeval-0.1.5/docs/user_guide/abstractions.rst +39 -0
- python_flexeval-0.1.5/docs/user_guide/cli.rst +20 -0
- python_flexeval-0.1.5/docs/user_guide/index.rst +16 -0
- python_flexeval-0.1.5/docs/user_guide/logging.rst +18 -0
- python_flexeval-0.1.5/docs/user_guide/motivation.md +73 -0
- python_flexeval-0.1.5/docs/user_guide/rubric_guide.md +87 -0
- python_flexeval-0.1.5/docs/vignettes.py +161 -0
- python_flexeval-0.1.5/docs/vignettes.rst +22 -0
- python_flexeval-0.1.5/example_project/example_specific_rubrics.yaml +92 -0
- python_flexeval-0.1.5/logs/.gitkeep +0 -0
- python_flexeval-0.1.5/make.bat +35 -0
- python_flexeval-0.1.5/pyproject.toml +83 -0
- python_flexeval-0.1.5/ruff.toml +3 -0
- python_flexeval-0.1.5/src/flexeval/__init__.py +11 -0
- python_flexeval-0.1.5/src/flexeval/__main__.py +11 -0
- python_flexeval-0.1.5/src/flexeval/classes/__init__.py +15 -0
- python_flexeval-0.1.5/src/flexeval/classes/base.py +32 -0
- python_flexeval-0.1.5/src/flexeval/classes/dataset.py +82 -0
- python_flexeval-0.1.5/src/flexeval/classes/eval_runner.py +158 -0
- python_flexeval-0.1.5/src/flexeval/classes/eval_set_run.py +32 -0
- python_flexeval-0.1.5/src/flexeval/classes/message.py +183 -0
- python_flexeval-0.1.5/src/flexeval/classes/metric.py +55 -0
- python_flexeval-0.1.5/src/flexeval/classes/thread.py +79 -0
- python_flexeval-0.1.5/src/flexeval/classes/tool_call.py +51 -0
- python_flexeval-0.1.5/src/flexeval/classes/turn.py +206 -0
- python_flexeval-0.1.5/src/flexeval/cli.py +104 -0
- python_flexeval-0.1.5/src/flexeval/completions.py +147 -0
- python_flexeval-0.1.5/src/flexeval/compute_metrics.py +788 -0
- python_flexeval-0.1.5/src/flexeval/config.yaml +23 -0
- python_flexeval-0.1.5/src/flexeval/configuration/__init__.py +1 -0
- python_flexeval-0.1.5/src/flexeval/configuration/completion_functions.py +231 -0
- python_flexeval-0.1.5/src/flexeval/configuration/evals.yaml +864 -0
- python_flexeval-0.1.5/src/flexeval/configuration/function_metrics.py +650 -0
- python_flexeval-0.1.5/src/flexeval/configuration/rubric_metrics.yaml +194 -0
- python_flexeval-0.1.5/src/flexeval/data_loader.py +513 -0
- python_flexeval-0.1.5/src/flexeval/db_utils.py +38 -0
- python_flexeval-0.1.5/src/flexeval/dependency_graph.py +234 -0
- python_flexeval-0.1.5/src/flexeval/eval_schema.json +256 -0
- python_flexeval-0.1.5/src/flexeval/function_types.py +173 -0
- python_flexeval-0.1.5/src/flexeval/helpers.py +52 -0
- python_flexeval-0.1.5/src/flexeval/io/__init__.py +1 -0
- python_flexeval-0.1.5/src/flexeval/io/parsers/yaml_parser.py +69 -0
- python_flexeval-0.1.5/src/flexeval/log_utils.py +34 -0
- python_flexeval-0.1.5/src/flexeval/metrics/__init__.py +8 -0
- python_flexeval-0.1.5/src/flexeval/metrics/access.py +28 -0
- python_flexeval-0.1.5/src/flexeval/metrics/save.py +39 -0
- python_flexeval-0.1.5/src/flexeval/rubric.py +62 -0
- python_flexeval-0.1.5/src/flexeval/run_utils.py +65 -0
- python_flexeval-0.1.5/src/flexeval/runner.py +132 -0
- python_flexeval-0.1.5/src/flexeval/schema/__init__.py +11 -0
- python_flexeval-0.1.5/src/flexeval/schema/config_schema.py +46 -0
- python_flexeval-0.1.5/src/flexeval/schema/eval_schema.py +163 -0
- python_flexeval-0.1.5/src/flexeval/schema/evalrun_schema.py +97 -0
- python_flexeval-0.1.5/src/flexeval/schema/rubric_schema.py +40 -0
- python_flexeval-0.1.5/src/flexeval/schema/schema_utils.py +26 -0
- python_flexeval-0.1.5/src/metabase/Dockerfile +11 -0
- python_flexeval-0.1.5/tests/__init__.py +0 -0
- python_flexeval-0.1.5/tests/data/multiturn.jsonl +3 -0
- python_flexeval-0.1.5/tests/data/plot-convos.jsonl +3 -0
- python_flexeval-0.1.5/tests/data/simple.jsonl +2 -0
- python_flexeval-0.1.5/tests/data/simple_nosystem.jsonl +2 -0
- python_flexeval-0.1.5/tests/integration/__init__.py +0 -0
- python_flexeval-0.1.5/tests/integration/config-tests.yaml +16 -0
- python_flexeval-0.1.5/tests/integration/data/multiturn.jsonl +3 -0
- python_flexeval-0.1.5/tests/integration/data/plot-convos.jsonl +3 -0
- python_flexeval-0.1.5/tests/integration/data/simple.jsonl +2 -0
- python_flexeval-0.1.5/tests/integration/evals.yaml +412 -0
- python_flexeval-0.1.5/tests/integration/functional_tests.py +1243 -0
- python_flexeval-0.1.5/tests/integration/langgraph_data.py +81 -0
- python_flexeval-0.1.5/tests/resources/function_metric.py +6 -0
- python_flexeval-0.1.5/tests/resources/functional_config.yaml +15 -0
- python_flexeval-0.1.5/tests/resources/functional_evals.yaml +405 -0
- python_flexeval-0.1.5/tests/resources/test_config.yaml +11 -0
- python_flexeval-0.1.5/tests/resources/test_dataset.jsonl +1 -0
- python_flexeval-0.1.5/tests/resources/test_evals.yaml +100 -0
- python_flexeval-0.1.5/tests/resources/test_rubric_metrics.yaml +0 -0
- python_flexeval-0.1.5/tests/resources/unittest.env +2 -0
- python_flexeval-0.1.5/tests/unit/__init__.py +0 -0
- python_flexeval-0.1.5/tests/unit/io/test_yaml_parser.py +16 -0
- python_flexeval-0.1.5/tests/unit/mixins.py +94 -0
- python_flexeval-0.1.5/tests/unit/test_completions.py +88 -0
- python_flexeval-0.1.5/tests/unit/test_compute_metrics.py +356 -0
- python_flexeval-0.1.5/tests/unit/test_data_loader.py +100 -0
- python_flexeval-0.1.5/tests/unit/test_db_utils.py +30 -0
- python_flexeval-0.1.5/tests/unit/test_dependency_graph.py +25 -0
- python_flexeval-0.1.5/tests/unit/test_eval_runner.py +27 -0
- python_flexeval-0.1.5/tests/unit/test_function_metrics.py +36 -0
- python_flexeval-0.1.5/tests/unit/test_function_types.py +140 -0
- python_flexeval-0.1.5/tests/unit/test_functional.py +706 -0
- python_flexeval-0.1.5/tests/unit/test_rubric.py +9 -0
- python_flexeval-0.1.5/tests/unit/test_schema.py +32 -0
- python_flexeval-0.1.5/uv.lock +4836 -0
- python_flexeval-0.1.5/vignettes/.gitignore +3 -0
- python_flexeval-0.1.5/vignettes/basic.py +25 -0
- python_flexeval-0.1.5/vignettes/basic_cli.md +25 -0
- python_flexeval-0.1.5/vignettes/basic_rubric.py +56 -0
- python_flexeval-0.1.5/vignettes/conversations.jsonl +1 -0
- python_flexeval-0.1.5/vignettes/custom_functions.py +2 -0
- python_flexeval-0.1.5/vignettes/custom_rubric.md +39 -0
- python_flexeval-0.1.5/vignettes/custom_rubrics.yaml +41 -0
- python_flexeval-0.1.5/vignettes/eval_run.yaml +28 -0
|
@@ -0,0 +1,80 @@
|
|
|
1
|
+
name: Publish package to PyPI and TestPyPI
|
|
2
|
+
|
|
3
|
+
on:
|
|
4
|
+
push:
|
|
5
|
+
branches: ["main"]
|
|
6
|
+
tags: ["v*"]
|
|
7
|
+
workflow_dispatch:
|
|
8
|
+
|
|
9
|
+
|
|
10
|
+
jobs:
|
|
11
|
+
build:
|
|
12
|
+
name: Build distribution
|
|
13
|
+
runs-on: ubuntu-latest
|
|
14
|
+
|
|
15
|
+
steps:
|
|
16
|
+
- uses: actions/checkout@v4
|
|
17
|
+
with:
|
|
18
|
+
persist-credentials: false
|
|
19
|
+
- name: Install uv
|
|
20
|
+
uses: astral-sh/setup-uv@v5
|
|
21
|
+
with:
|
|
22
|
+
version: "latest"
|
|
23
|
+
- name: Set up Python
|
|
24
|
+
uses: actions/setup-python@v5
|
|
25
|
+
with:
|
|
26
|
+
python-version-file: ".python-version"
|
|
27
|
+
- name: Build a binary wheel and a source tarball
|
|
28
|
+
run: >-
|
|
29
|
+
uv build
|
|
30
|
+
- name: Store the distribution packages
|
|
31
|
+
uses: actions/upload-artifact@v4
|
|
32
|
+
with:
|
|
33
|
+
name: python-package-distributions
|
|
34
|
+
path: dist/
|
|
35
|
+
|
|
36
|
+
publish-to-pypi:
|
|
37
|
+
name: >-
|
|
38
|
+
Publish Python distribution to PyPI
|
|
39
|
+
if: startsWith(github.ref, 'refs/tags/') # only publish to PyPI on tag pushes
|
|
40
|
+
needs:
|
|
41
|
+
- build
|
|
42
|
+
runs-on: ubuntu-latest
|
|
43
|
+
environment:
|
|
44
|
+
name: pypi
|
|
45
|
+
url: https://pypi.org/p/python-flexeval
|
|
46
|
+
permissions:
|
|
47
|
+
id-token: write # IMPORTANT: mandatory for trusted publishing
|
|
48
|
+
|
|
49
|
+
steps:
|
|
50
|
+
- name: Download all the dists
|
|
51
|
+
uses: actions/download-artifact@v4
|
|
52
|
+
with:
|
|
53
|
+
name: python-package-distributions
|
|
54
|
+
path: dist/
|
|
55
|
+
- name: Publish distribution to PyPI
|
|
56
|
+
uses: pypa/gh-action-pypi-publish@release/v1
|
|
57
|
+
|
|
58
|
+
publish-to-testpypi:
|
|
59
|
+
name: Publish Python distribution to TestPyPI
|
|
60
|
+
needs:
|
|
61
|
+
- build
|
|
62
|
+
runs-on: ubuntu-latest
|
|
63
|
+
|
|
64
|
+
environment:
|
|
65
|
+
name: testpypi
|
|
66
|
+
url: https://test.pypi.org/p/python-flexeval
|
|
67
|
+
|
|
68
|
+
permissions:
|
|
69
|
+
id-token: write # IMPORTANT: mandatory for trusted publishing
|
|
70
|
+
|
|
71
|
+
steps:
|
|
72
|
+
- name: Download all the dists
|
|
73
|
+
uses: actions/download-artifact@v4
|
|
74
|
+
with:
|
|
75
|
+
name: python-package-distributions
|
|
76
|
+
path: dist/
|
|
77
|
+
- name: Publish distribution to TestPyPI
|
|
78
|
+
uses: pypa/gh-action-pypi-publish@release/v1
|
|
79
|
+
with:
|
|
80
|
+
repository-url: https://test.pypi.org/legacy/
|
|
@@ -0,0 +1,68 @@
|
|
|
1
|
+
# Workflow for building and deploying a Sphinx site to GitHub Pages
|
|
2
|
+
name: Deploy Sphinx site with GitHub Pages dependencies preinstalled
|
|
3
|
+
|
|
4
|
+
on:
|
|
5
|
+
# Runs on pushes targeting following list of branches
|
|
6
|
+
push:
|
|
7
|
+
branches: ["main"]
|
|
8
|
+
|
|
9
|
+
# Also allows you to run this workflow manually from the Actions tab
|
|
10
|
+
workflow_dispatch:
|
|
11
|
+
|
|
12
|
+
# Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages
|
|
13
|
+
permissions:
|
|
14
|
+
contents: read
|
|
15
|
+
pages: write
|
|
16
|
+
id-token: write
|
|
17
|
+
|
|
18
|
+
# Allow only one concurrent deployment, skipping runs queued between the run in-progress and latest queued.
|
|
19
|
+
# However, do NOT cancel in-progress runs as we want to allow these production deployments to complete.
|
|
20
|
+
concurrency:
|
|
21
|
+
group: "pages"
|
|
22
|
+
cancel-in-progress: false
|
|
23
|
+
|
|
24
|
+
jobs:
|
|
25
|
+
# Build job
|
|
26
|
+
build:
|
|
27
|
+
runs-on: ubuntu-latest
|
|
28
|
+
steps:
|
|
29
|
+
- name: Checkout
|
|
30
|
+
uses: actions/checkout@v4
|
|
31
|
+
|
|
32
|
+
- name: Setup Pages
|
|
33
|
+
uses: actions/configure-pages@v5
|
|
34
|
+
|
|
35
|
+
- name: Install uv
|
|
36
|
+
uses: astral-sh/setup-uv@v5
|
|
37
|
+
with:
|
|
38
|
+
version: "latest"
|
|
39
|
+
|
|
40
|
+
- name: "Set up Python"
|
|
41
|
+
uses: actions/setup-python@v5
|
|
42
|
+
with:
|
|
43
|
+
python-version-file: ".python-version"
|
|
44
|
+
|
|
45
|
+
- name: Install docs dependencies
|
|
46
|
+
run: |
|
|
47
|
+
uv sync --group docs
|
|
48
|
+
|
|
49
|
+
- name: Build with Sphinx
|
|
50
|
+
run: |
|
|
51
|
+
make html BUILDDIR=_site
|
|
52
|
+
|
|
53
|
+
- name: Upload artifact
|
|
54
|
+
uses: actions/upload-pages-artifact@v3
|
|
55
|
+
with:
|
|
56
|
+
path: _site/html/
|
|
57
|
+
|
|
58
|
+
# Deployment job
|
|
59
|
+
deploy:
|
|
60
|
+
environment:
|
|
61
|
+
name: github-pages
|
|
62
|
+
url: ${{ steps.deployment.outputs.page_url }}
|
|
63
|
+
runs-on: ubuntu-latest
|
|
64
|
+
needs: build
|
|
65
|
+
steps:
|
|
66
|
+
- name: Deploy to GitHub Pages
|
|
67
|
+
id: deployment
|
|
68
|
+
uses: actions/deploy-pages@v4
|
|
@@ -0,0 +1,32 @@
|
|
|
1
|
+
name: Run validation
|
|
2
|
+
|
|
3
|
+
on: [push]
|
|
4
|
+
|
|
5
|
+
jobs:
|
|
6
|
+
build:
|
|
7
|
+
name: Run validation
|
|
8
|
+
runs-on: ubuntu-latest
|
|
9
|
+
|
|
10
|
+
steps:
|
|
11
|
+
- name: Check out code
|
|
12
|
+
uses: actions/checkout@v4
|
|
13
|
+
|
|
14
|
+
- name: Install uv
|
|
15
|
+
uses: astral-sh/setup-uv@v5
|
|
16
|
+
with:
|
|
17
|
+
version: "latest"
|
|
18
|
+
|
|
19
|
+
- name: "Set up Python"
|
|
20
|
+
uses: actions/setup-python@v5
|
|
21
|
+
with:
|
|
22
|
+
python-version-file: ".python-version"
|
|
23
|
+
|
|
24
|
+
- name: Install Python dependencies
|
|
25
|
+
run: |
|
|
26
|
+
uv sync
|
|
27
|
+
|
|
28
|
+
- name: Run automated validation checks
|
|
29
|
+
run: |
|
|
30
|
+
uv run python -m unittest discover -s tests.unit
|
|
31
|
+
env:
|
|
32
|
+
CURRENT_BRANCH_NAME: ${{ github.head_ref || github.ref_name }}
|
|
@@ -0,0 +1,187 @@
|
|
|
1
|
+
# Byte-compiled / optimized / DLL files
|
|
2
|
+
__pycache__/
|
|
3
|
+
*.py[cod]
|
|
4
|
+
*$py.class
|
|
5
|
+
|
|
6
|
+
# C extensions
|
|
7
|
+
*.so
|
|
8
|
+
|
|
9
|
+
# Distribution / packaging
|
|
10
|
+
.Python
|
|
11
|
+
build/
|
|
12
|
+
develop-eggs/
|
|
13
|
+
dist/
|
|
14
|
+
downloads/
|
|
15
|
+
eggs/
|
|
16
|
+
.eggs/
|
|
17
|
+
lib/
|
|
18
|
+
lib64/
|
|
19
|
+
parts/
|
|
20
|
+
sdist/
|
|
21
|
+
var/
|
|
22
|
+
wheels/
|
|
23
|
+
share/python-wheels/
|
|
24
|
+
*.egg-info/
|
|
25
|
+
.installed.cfg
|
|
26
|
+
*.egg
|
|
27
|
+
MANIFEST
|
|
28
|
+
|
|
29
|
+
# PyInstaller
|
|
30
|
+
# Usually these files are written by a python script from a template
|
|
31
|
+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
|
|
32
|
+
*.manifest
|
|
33
|
+
*.spec
|
|
34
|
+
|
|
35
|
+
# Installer logs
|
|
36
|
+
pip-log.txt
|
|
37
|
+
pip-delete-this-directory.txt
|
|
38
|
+
|
|
39
|
+
# Unit test / coverage reports
|
|
40
|
+
htmlcov/
|
|
41
|
+
.tox/
|
|
42
|
+
.nox/
|
|
43
|
+
.coverage
|
|
44
|
+
.coverage.*
|
|
45
|
+
.cache
|
|
46
|
+
nosetests.xml
|
|
47
|
+
coverage.xml
|
|
48
|
+
*.cover
|
|
49
|
+
*.py,cover
|
|
50
|
+
.hypothesis/
|
|
51
|
+
.pytest_cache/
|
|
52
|
+
cover/
|
|
53
|
+
|
|
54
|
+
# Translations
|
|
55
|
+
*.mo
|
|
56
|
+
*.pot
|
|
57
|
+
|
|
58
|
+
# Django stuff:
|
|
59
|
+
*.log
|
|
60
|
+
local_settings.py
|
|
61
|
+
db.sqlite3
|
|
62
|
+
db.sqlite3-journal
|
|
63
|
+
|
|
64
|
+
# Flask stuff:
|
|
65
|
+
instance/
|
|
66
|
+
.webassets-cache
|
|
67
|
+
|
|
68
|
+
# Scrapy stuff:
|
|
69
|
+
.scrapy
|
|
70
|
+
|
|
71
|
+
# Sphinx documentation
|
|
72
|
+
docs/_build/
|
|
73
|
+
|
|
74
|
+
# PyBuilder
|
|
75
|
+
.pybuilder/
|
|
76
|
+
target/
|
|
77
|
+
|
|
78
|
+
# Jupyter Notebook
|
|
79
|
+
.ipynb_checkpoints
|
|
80
|
+
|
|
81
|
+
# IPython
|
|
82
|
+
profile_default/
|
|
83
|
+
ipython_config.py
|
|
84
|
+
|
|
85
|
+
# pyenv
|
|
86
|
+
# For a library or package, you might want to ignore these files since the code is
|
|
87
|
+
# intended to run in multiple environments; otherwise, check them in:
|
|
88
|
+
# .python-version
|
|
89
|
+
|
|
90
|
+
# pipenv
|
|
91
|
+
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
|
|
92
|
+
# However, in case of collaboration, if having platform-specific dependencies or dependencies
|
|
93
|
+
# having no cross-platform support, pipenv may install dependencies that don't work, or not
|
|
94
|
+
# install all needed dependencies.
|
|
95
|
+
#Pipfile.lock
|
|
96
|
+
|
|
97
|
+
# poetry
|
|
98
|
+
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
|
|
99
|
+
# This is especially recommended for binary packages to ensure reproducibility, and is more
|
|
100
|
+
# commonly ignored for libraries.
|
|
101
|
+
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
|
|
102
|
+
#poetry.lock
|
|
103
|
+
|
|
104
|
+
# pdm
|
|
105
|
+
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
|
|
106
|
+
#pdm.lock
|
|
107
|
+
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
|
|
108
|
+
# in version control.
|
|
109
|
+
# https://pdm.fming.dev/#use-with-ide
|
|
110
|
+
.pdm.toml
|
|
111
|
+
|
|
112
|
+
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
|
|
113
|
+
__pypackages__/
|
|
114
|
+
|
|
115
|
+
# Celery stuff
|
|
116
|
+
celerybeat-schedule
|
|
117
|
+
celerybeat.pid
|
|
118
|
+
|
|
119
|
+
# SageMath parsed files
|
|
120
|
+
*.sage.py
|
|
121
|
+
|
|
122
|
+
# Environments
|
|
123
|
+
.env
|
|
124
|
+
.venv
|
|
125
|
+
env/
|
|
126
|
+
venv/
|
|
127
|
+
ENV/
|
|
128
|
+
env.bak/
|
|
129
|
+
venv.bak/
|
|
130
|
+
|
|
131
|
+
# Spyder project settings
|
|
132
|
+
.spyderproject
|
|
133
|
+
.spyproject
|
|
134
|
+
|
|
135
|
+
# Rope project settings
|
|
136
|
+
.ropeproject
|
|
137
|
+
|
|
138
|
+
# mkdocs documentation
|
|
139
|
+
/site
|
|
140
|
+
|
|
141
|
+
# mypy
|
|
142
|
+
.mypy_cache/
|
|
143
|
+
.dmypy.json
|
|
144
|
+
dmypy.json
|
|
145
|
+
|
|
146
|
+
# Pyre type checker
|
|
147
|
+
.pyre/
|
|
148
|
+
|
|
149
|
+
# pytype static type analyzer
|
|
150
|
+
.pytype/
|
|
151
|
+
|
|
152
|
+
# Cython debug symbols
|
|
153
|
+
cython_debug/
|
|
154
|
+
|
|
155
|
+
# PyCharm
|
|
156
|
+
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
|
|
157
|
+
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
|
|
158
|
+
# and can be added to the global gitignore or merged into this file. For a more nuclear
|
|
159
|
+
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
|
|
160
|
+
#.idea/
|
|
161
|
+
|
|
162
|
+
#Evals-specific
|
|
163
|
+
logs/*
|
|
164
|
+
!logs/.gitkeep
|
|
165
|
+
|
|
166
|
+
data/results/evals-outputs/*
|
|
167
|
+
!data/results/evals-outputs/.gitkeep
|
|
168
|
+
|
|
169
|
+
.DS_Store
|
|
170
|
+
temp/
|
|
171
|
+
test-cases/
|
|
172
|
+
src/llm-evals/evals_sync/
|
|
173
|
+
*.db
|
|
174
|
+
*.db-*
|
|
175
|
+
|
|
176
|
+
# notebooks
|
|
177
|
+
*.ipynb
|
|
178
|
+
|
|
179
|
+
# sqlite data
|
|
180
|
+
*.sqlite
|
|
181
|
+
|
|
182
|
+
# Claude Code
|
|
183
|
+
.claude/
|
|
184
|
+
|
|
185
|
+
# Docs
|
|
186
|
+
docs/generated
|
|
187
|
+
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
3.10
|
|
@@ -0,0 +1,19 @@
|
|
|
1
|
+
@inproceedings{christie_flexeval_2024,
|
|
2
|
+
author = {S. Thomas Christie and
|
|
3
|
+
Baptiste Moreau-Pernet and
|
|
4
|
+
Yu Tian and
|
|
5
|
+
John Whitmer},
|
|
6
|
+
title = {FlexEval: a customizable tool for chatbot
|
|
7
|
+
performance evaluation and dialogue analysis
|
|
8
|
+
},
|
|
9
|
+
booktitle = {Proceedings of the 17th International Conference
|
|
10
|
+
on Educational Data Mining
|
|
11
|
+
},
|
|
12
|
+
year = 2024,
|
|
13
|
+
pages = {903--908},
|
|
14
|
+
publisher = {International Educational Data Mining Society},
|
|
15
|
+
month = jul,
|
|
16
|
+
venue = {Atlanta, Georgia, USA},
|
|
17
|
+
doi = {10.5281/zenodo.12729993},
|
|
18
|
+
url = {https://doi.org/10.5281/zenodo.12729993},
|
|
19
|
+
}
|
|
@@ -0,0 +1,37 @@
|
|
|
1
|
+
cff-version: 1.2.0
|
|
2
|
+
message: 'If FlexEval contributes to a project that leads to a scientific publication, please acknowledge this fact by citing: S. Thomas Christie, Baptiste Moreau-Pernet, Yu Tian, and John Whitmer, "FlexEval: a customizable tool for chatbot performance evaluation and dialogue analysis", Proceedings of the 17th International Conference on Educational Data Mining, pp. 903-908, 2024.'
|
|
3
|
+
title: 'FlexEval: a customizable tool for chatbot performance evaluation and dialogue analysis'
|
|
4
|
+
authors:
|
|
5
|
+
- family-names: Christie
|
|
6
|
+
given-names: S. Thomas
|
|
7
|
+
- family-names: Moreau-Pernet
|
|
8
|
+
given-names: Baptiste
|
|
9
|
+
- family-names: Tian
|
|
10
|
+
given-names: Yu
|
|
11
|
+
- family-names: Whitmer
|
|
12
|
+
given-names: John
|
|
13
|
+
type: software
|
|
14
|
+
url: 'https://github.com/DigitalHarborFoundation/FlexEval'
|
|
15
|
+
repository-code: 'https://github.com/DigitalHarborFoundation/FlexEval'
|
|
16
|
+
preferred-citation:
|
|
17
|
+
type: conference-paper
|
|
18
|
+
authors:
|
|
19
|
+
- family-names: Christie
|
|
20
|
+
given-names: S. Thomas
|
|
21
|
+
- family-names: Moreau-Pernet
|
|
22
|
+
given-names: Baptiste
|
|
23
|
+
- family-names: Tian
|
|
24
|
+
given-names: Yu
|
|
25
|
+
- family-names: Whitmer
|
|
26
|
+
given-names: John
|
|
27
|
+
title: "FlexEval: a customizable tool for chatbot performance evaluation and dialogue analysis"
|
|
28
|
+
year: 2024
|
|
29
|
+
date-published: 2024-07
|
|
30
|
+
conference: 17th International Conference on Educational Data Mining
|
|
31
|
+
venue: Atlanta, Georgia, USA
|
|
32
|
+
start: 903
|
|
33
|
+
end: 908
|
|
34
|
+
doi: 10.5281/zenodo.12729993
|
|
35
|
+
publisher:
|
|
36
|
+
name: International Educational Data Mining Society
|
|
37
|
+
website: 'https://educationaldatamining.org/'
|
|
@@ -0,0 +1,114 @@
|
|
|
1
|
+
# CLAUDE.md
|
|
2
|
+
|
|
3
|
+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
|
4
|
+
|
|
5
|
+
## Development Commands
|
|
6
|
+
|
|
7
|
+
### Environment Setup
|
|
8
|
+
```bash
|
|
9
|
+
# Install uv package manager first: https://docs.astral.sh/uv/getting-started/installation/
|
|
10
|
+
curl -LsSf https://astral.sh/uv/install.sh | sh
|
|
11
|
+
|
|
12
|
+
# Install dependencies
|
|
13
|
+
uv sync
|
|
14
|
+
```
|
|
15
|
+
|
|
16
|
+
### Running FlexEval
|
|
17
|
+
|
|
18
|
+
Not quite ready; our refactor is close to being done.
|
|
19
|
+
|
|
20
|
+
```bash
|
|
21
|
+
# CLI help
|
|
22
|
+
uv run python -m flexeval --help
|
|
23
|
+
```
|
|
24
|
+
|
|
25
|
+
### Testing
|
|
26
|
+
```bash
|
|
27
|
+
# Run unit tests
|
|
28
|
+
uv run python -m unittest discover -s tests.unit
|
|
29
|
+
|
|
30
|
+
# Run specific test file
|
|
31
|
+
uv run python -m unittest tests.unit.{module_name}
|
|
32
|
+
|
|
33
|
+
# Integration tests are in tests/integration/, but aren't ready yet
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
### Build and Dependencies
|
|
37
|
+
```bash
|
|
38
|
+
# Build the package
|
|
39
|
+
uv build
|
|
40
|
+
|
|
41
|
+
# Add dependency
|
|
42
|
+
uv add {package_name}
|
|
43
|
+
|
|
44
|
+
# Update dependencies
|
|
45
|
+
uv lock --upgrade
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
### Linting
|
|
49
|
+
```bash
|
|
50
|
+
# Run ruff linter (configured in ruff.toml)
|
|
51
|
+
ruff check src/
|
|
52
|
+
ruff format src/
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
## Architecture Overview
|
|
56
|
+
|
|
57
|
+
FlexEval is a tool for evaluating LLM-powered systems using custom metrics, completion functions, and LLM-graded rubrics. The system operates on conversational data at multiple granularities.
|
|
58
|
+
|
|
59
|
+
### Core Abstractions
|
|
60
|
+
|
|
61
|
+
**EvalRun** (`src/flexeval/schema/evalrun_schema.py`): The top-level execution unit that combines:
|
|
62
|
+
- Data sources (conversations in JSONL format as inputs, an SQLite filepath as output)
|
|
63
|
+
- An Eval specification (metrics to compute)
|
|
64
|
+
- Configuration (workers, database path, etc.)
|
|
65
|
+
- Rubric and function sources
|
|
66
|
+
|
|
67
|
+
**Eval** (`src/flexeval/schema/eval_schema.py`): Defines what to evaluate:
|
|
68
|
+
- Function metrics (Python functions that compute numeric values)
|
|
69
|
+
- Rubric metrics (LLM-graded evaluations using chain-of-thought)
|
|
70
|
+
- Completion LLM (for generating new responses)
|
|
71
|
+
- Grader LLM (for rubric evaluation)
|
|
72
|
+
- Dependencies between metrics
|
|
73
|
+
|
|
74
|
+
**Config** (`src/flexeval/schema/config_schema.py`): Defines how to evaluate (e.g. single- vs multi-process, etc.)
|
|
75
|
+
|
|
76
|
+
### Data Hierarchy
|
|
77
|
+
The evaluation operates at multiple levels of granularity:
|
|
78
|
+
- **Thread**: Full conversation
|
|
79
|
+
- **Turn**: User-assistant exchange pair
|
|
80
|
+
- **Message**: Individual message from user or assistant
|
|
81
|
+
- **ToolCall**: Function/tool invocation within a message
|
|
82
|
+
|
|
83
|
+
### Key Components
|
|
84
|
+
|
|
85
|
+
**Configuration System**:
|
|
86
|
+
- `src/flexeval/configuration/rubric_metrics.yaml`: Default set of rubrics
|
|
87
|
+
- `src/flexeval/configuration/function_metrics.py`: Default set of Python metric functions
|
|
88
|
+
- `src/flexeval/configuration/completion_functions.py`: Set of functions for producing LLM completions (usually via API call)
|
|
89
|
+
|
|
90
|
+
**Execution Pipeline** (`src/flexeval/runner.py`):
|
|
91
|
+
1. Load configuration and eval specification
|
|
92
|
+
2. Create Dataset from data sources
|
|
93
|
+
3. Run EvalRunner to compute metrics
|
|
94
|
+
4. Store results in SQLite database
|
|
95
|
+
|
|
96
|
+
**Metric System**:
|
|
97
|
+
- Function metrics: Python functions that analyze conversations/turns/messages
|
|
98
|
+
- Rubric metrics: LLM-based evaluations using structured prompts
|
|
99
|
+
- Dependencies: Metrics can depend on other metrics meeting certain criteria
|
|
100
|
+
- Aggregation: Results aggregated by role, turn, etc.
|
|
101
|
+
|
|
102
|
+
### Data Format
|
|
103
|
+
|
|
104
|
+
Input data is in JSONL format with conversations as:
|
|
105
|
+
```json
|
|
106
|
+
{"input": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
Results are stored in an SQLite database (defaulting to `data/results/results.db`) for querying and analysis.
|
|
110
|
+
|
|
111
|
+
### Schema System
|
|
112
|
+
|
|
113
|
+
The project uses Pydantic models for validation:
|
|
114
|
+
- `src/flexeval/schema/`: Contains all schema definitions
|