PyPI - assignment-codeval - Versions diffs - 0.0.8__tar.gz → 0.0.10__tar.gz - Mend

assignment-codeval 0.0.8tar.gz → 0.0.10tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (22) hide show

{assignment_codeval-0.0.8 → assignment_codeval-0.0.10}/PKG-INFO RENAMED Viewed

@@ -1,20 +1,19 @@
 Metadata-Version: 2.4
 Name: assignment-codeval
-Version: 0.0.8
+Version: 0.0.10
 Summary: CodEval for evaluating programming assignments
 Requires-Python: >=3.12
 Description-Content-Type: text/markdown
 Requires-Dist: canvasapi==3.3.0
-Requires-Dist: certifi==2021.10.8
-Requires-Dist: charset-normalizer==2.0.9
 Requires-Dist: click==8.2.1
 Requires-Dist: configparser==5.2.0
-Requires-Dist: idna==3.3
 Requires-Dist: pytz==2021.3
-Requires-Dist: requests==2.27.0
-Requires-Dist: urllib3==1.26.7
+Requires-Dist: requests>=2.28.0
 Requires-Dist: pymongo==4.3.3
 Requires-Dist: markdown==3.4.1
+Requires-Dist: anthropic>=0.39.0
+Requires-Dist: openai>=1.0.0
+Requires-Dist: google-generativeai>=0.8.0
 Provides-Extra: test
 Requires-Dist: pytest>=7.0; extra == "test"
@@ -72,9 +71,9 @@ Tags used in a spec file (\<course name>.codeval)
 | CMD/TCMD | Run Command | Will be followed by a command to run. The TCMD will cause the evaluation to fail if the command exits with an error. |
 | CMP | Compare | Will be followed by two files to compare. |
 | T/HT | Test Case | Will be followed by the command to run to test the submission. |
-| I/IF | Supply Input | Specifies the input for a test case. The IF version will read the input from a file. |
-| O/OF | Check Output | Specifies the expected output for a test case. The OF version will read from a file. |
-| E | Check Error | Specifies the expected error output for a test case. |
+| I/IB/IF | Supply Input | Specifies the input for a test case. I adds a newline, IB does not add a newline, IF reads from a file. |
+| O/OB/OF | Check Output | Specifies the expected output for a test case. O adds a newline, OB does not add a newline, OF reads from a file. |
+| E/EB | Check Error | Specifies the expected error output for a test case. E adds a newline, EB does not. |
 | TO | Timeout | Specifies the time limit in seconds for a test case to run. Defaults to 20 seconds. |
 | X | Exit Code | Specifies the expected exit code for a test case. Defaults to zero. |
 | SS | Start Server | Command containing timeout (wait until server starts), kill timeout (wait to kill the server), and the command to start a server |
@@ -247,3 +246,97 @@ Refer to a sample spec file [here](SQL/samples/ASSIGNMENT:CREATE.codeval)
 	C cc -o bigbag --std=gnu11 bigbag.c
+## 4. Test Assignments with AI Models
+Test programming assignments against multiple AI models (Claude, GPT, Gemini) to benchmark their performance.
+### Installation
+Install the AI provider packages you want to use:
+```bash
+# Install all AI providers
+pip install assignment-codeval[ai]
+# Or install specific providers
+pip install anthropic        # For Claude models
+pip install openai           # For GPT models
+pip install google-generativeai  # For Gemini models
+```
+### codeval.ini contents (optional)
+```
+[AI]
+anthropic_key=sk-ant-...
+openai_key=sk-...
+google_key=...
+```
+API keys can also be provided via:
+- Environment variables: `ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, `GOOGLE_API_KEY`
+- Command line options: `--anthropic-key`, `--openai-key`, `--google-key`
+### Command to run
+```bash
+assignment-codeval test-with-ai <codeval_file> [OPTIONS]
+```
+### Options
+| Option | Description |
+|--------|-------------|
+| `-o, --output-dir` | Directory to store solutions and results (default: `ai_test_results`) |
+| `-n, --attempts` | Number of attempts per model (default: 1) |
+| `-m, --models` | Specific models to test (can be used multiple times) |
+| `-p, --providers` | Only test models from specific providers: `anthropic`, `openai`, `google` |
+| `--anthropic-key` | Anthropic API key |
+| `--openai-key` | OpenAI API key |
+| `--google-key` | Google API key |
+### Examples
+```bash
+# Test with all Anthropic models
+assignment-codeval test-with-ai my_assignment.codeval -p anthropic
+# Test with specific model, 3 attempts each
+assignment-codeval test-with-ai my_assignment.codeval -m "Claude Sonnet 4" -n 3
+# Test with all providers (requires all API keys)
+assignment-codeval test-with-ai my_assignment.codeval -n 2
+# Pass API key directly
+assignment-codeval test-with-ai my_assignment.codeval --anthropic-key sk-ant-xxx -p anthropic
+```
+### Supported Models
+| Provider | Models |
+|----------|--------|
+| Anthropic | Claude Sonnet 4, Claude Opus 4 |
+| OpenAI | GPT-4o, GPT-4o Mini, o1, o3-mini |
+| Google | Gemini 2.0 Flash, Gemini 1.5 Pro |
+Note: You can add additional models using `-m "model-id"`. Check each provider's documentation for available model IDs.
+### Output Structure
+```
+ai_test_results/
+├── prompt.txt                    # The prompt sent to AI models
+├── results.json                  # Summary of all results
+├── Claude_Sonnet_4/
+│   └── attempt_1/
+│       ├── raw_response.txt      # Raw AI response
+│       ├── solution.c            # Extracted code
+│       └── <codeval files>       # Copied for evaluation
+├── GPT-4o/
+│   └── attempt_1/
+│       └── ...
+└── ...
+```
+### Notes
+- The command extracts the assignment description from the codeval file (between `CRT_HW START` and `CRT_HW END` tags)
+- Support files from `support_files/` directory are automatically copied for evaluation
+- Results include pass/fail status, response time, and any errors
+- Use multiple attempts (`-n`) to account for AI response variability

{assignment_codeval-0.0.8 → assignment_codeval-0.0.10}/README.md RENAMED Viewed

@@ -52,9 +52,9 @@ Tags used in a spec file (\<course name>.codeval)
 | CMD/TCMD | Run Command | Will be followed by a command to run. The TCMD will cause the evaluation to fail if the command exits with an error. |
 | CMP | Compare | Will be followed by two files to compare. |
 | T/HT | Test Case | Will be followed by the command to run to test the submission. |
-| I/IF | Supply Input | Specifies the input for a test case. The IF version will read the input from a file. |
-| O/OF | Check Output | Specifies the expected output for a test case. The OF version will read from a file. |
-| E | Check Error | Specifies the expected error output for a test case. |
+| I/IB/IF | Supply Input | Specifies the input for a test case. I adds a newline, IB does not add a newline, IF reads from a file. |
+| O/OB/OF | Check Output | Specifies the expected output for a test case. O adds a newline, OB does not add a newline, OF reads from a file. |
+| E/EB | Check Error | Specifies the expected error output for a test case. E adds a newline, EB does not. |
 | TO | Timeout | Specifies the time limit in seconds for a test case to run. Defaults to 20 seconds. |
 | X | Exit Code | Specifies the expected exit code for a test case. Defaults to zero. |
 | SS | Start Server | Command containing timeout (wait until server starts), kill timeout (wait to kill the server), and the command to start a server |
@@ -227,3 +227,97 @@ Refer to a sample spec file [here](SQL/samples/ASSIGNMENT:CREATE.codeval)
 	C cc -o bigbag --std=gnu11 bigbag.c
+## 4. Test Assignments with AI Models
+Test programming assignments against multiple AI models (Claude, GPT, Gemini) to benchmark their performance.
+### Installation
+Install the AI provider packages you want to use:
+```bash
+# Install all AI providers
+pip install assignment-codeval[ai]
+# Or install specific providers
+pip install anthropic        # For Claude models
+pip install openai           # For GPT models
+pip install google-generativeai  # For Gemini models
+```
+### codeval.ini contents (optional)
+```
+[AI]
+anthropic_key=sk-ant-...
+openai_key=sk-...
+google_key=...
+```
+API keys can also be provided via:
+- Environment variables: `ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, `GOOGLE_API_KEY`
+- Command line options: `--anthropic-key`, `--openai-key`, `--google-key`
+### Command to run
+```bash
+assignment-codeval test-with-ai <codeval_file> [OPTIONS]
+```
+### Options
+| Option | Description |
+|--------|-------------|
+| `-o, --output-dir` | Directory to store solutions and results (default: `ai_test_results`) |
+| `-n, --attempts` | Number of attempts per model (default: 1) |
+| `-m, --models` | Specific models to test (can be used multiple times) |
+| `-p, --providers` | Only test models from specific providers: `anthropic`, `openai`, `google` |
+| `--anthropic-key` | Anthropic API key |
+| `--openai-key` | OpenAI API key |
+| `--google-key` | Google API key |
+### Examples
+```bash
+# Test with all Anthropic models
+assignment-codeval test-with-ai my_assignment.codeval -p anthropic
+# Test with specific model, 3 attempts each
+assignment-codeval test-with-ai my_assignment.codeval -m "Claude Sonnet 4" -n 3
+# Test with all providers (requires all API keys)
+assignment-codeval test-with-ai my_assignment.codeval -n 2
+# Pass API key directly
+assignment-codeval test-with-ai my_assignment.codeval --anthropic-key sk-ant-xxx -p anthropic
+```
+### Supported Models
+| Provider | Models |
+|----------|--------|
+| Anthropic | Claude Sonnet 4, Claude Opus 4 |
+| OpenAI | GPT-4o, GPT-4o Mini, o1, o3-mini |
+| Google | Gemini 2.0 Flash, Gemini 1.5 Pro |
+Note: You can add additional models using `-m "model-id"`. Check each provider's documentation for available model IDs.
+### Output Structure
+```
+ai_test_results/
+├── prompt.txt                    # The prompt sent to AI models
+├── results.json                  # Summary of all results
+├── Claude_Sonnet_4/
+│   └── attempt_1/
+│       ├── raw_response.txt      # Raw AI response
+│       ├── solution.c            # Extracted code
+│       └── <codeval files>       # Copied for evaluation
+├── GPT-4o/
+│   └── attempt_1/
+│       └── ...
+└── ...
+```
+### Notes
+- The command extracts the assignment description from the codeval file (between `CRT_HW START` and `CRT_HW END` tags)
+- Support files from `support_files/` directory are automatically copied for evaluation
+- Results include pass/fail status, response time, and any errors
+- Use multiple attempts (`-n`) to account for AI response variability

{assignment_codeval-0.0.8 → assignment_codeval-0.0.10}/pyproject.toml RENAMED Viewed

@@ -4,22 +4,21 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "assignment-codeval"
-version = "0.0.8"
+version = "0.0.10"
 description = "CodEval for evaluating programming assignments"
 readme = "README.md"
 requires-python = ">=3.12"
 dependencies = [
 "canvasapi==3.3.0",
-"certifi==2021.10.8",
-"charset-normalizer==2.0.9",
 "click==8.2.1",
 "configparser==5.2.0",
-"idna==3.3",
 "pytz==2021.3",
-"requests==2.27.0",
-"urllib3==1.26.7",
+"requests>=2.28.0",
 "pymongo==4.3.3",
 "markdown==3.4.1",
+"anthropic>=0.39.0",
+"openai>=1.0.0",
+"google-generativeai>=0.8.0",
 ]
 [project.optional-dependencies]

assignment_codeval-0.0.10/src/assignment_codeval/ai_benchmark.py ADDED Viewed

@@ -0,0 +1,528 @@
+#!/usr/bin/env python3
+"""
+AI Benchmark Module for CodEval
+Sends programming assignments to various AI models, collects their solutions,
+and evaluates them using the existing CodEval framework.
+Supported providers:
+- Anthropic (Claude models)
+- OpenAI (GPT models)
+- Google (Gemini models)
+"""
+import os
+import re
+import json
+import time
+import subprocess
+from pathlib import Path
+from dataclasses import dataclass
+from typing import Optional
+from configparser import ConfigParser
+import click
+from .commons import info, warn, error, debug
+@dataclass
+class AIModel:
+    """Represents an AI model configuration."""
+    provider: str  # anthropic, openai, google
+    model_id: str
+    display_name: str
+# Default models to benchmark
+DEFAULT_MODELS = [
+    # Anthropic models
+    AIModel("anthropic", "claude-sonnet-4-20250514", "Claude Sonnet 4"),
+    AIModel("anthropic", "claude-opus-4-20250514", "Claude Opus 4"),
+    # OpenAI models
+    AIModel("openai", "gpt-4o", "GPT-4o"),
+    AIModel("openai", "gpt-4o-mini", "GPT-4o Mini"),
+    AIModel("openai", "o1", "o1"),
+    AIModel("openai", "o3-mini", "o3-mini"),
+    # Google models
+    AIModel("google", "gemini-2.0-flash", "Gemini 2.0 Flash"),
+    AIModel("google", "gemini-1.5-pro-latest", "Gemini 1.5 Pro"),
+]
+def load_ai_config() -> dict:
+    """Load AI API keys from config file."""
+    config_path = Path.home() / ".config" / "codeval.ini"
+    config = ConfigParser()
+    if config_path.exists():
+        config.read(config_path)
+    return {
+        "anthropic_key": config.get("AI", "anthropic_key", fallback=os.environ.get("ANTHROPIC_API_KEY")),
+        "openai_key": config.get("AI", "openai_key", fallback=os.environ.get("OPENAI_API_KEY")),
+        "google_key": config.get("AI", "google_key", fallback=os.environ.get("GOOGLE_API_KEY")),
+    }
+def extract_assignment_from_codeval(codeval_path: str) -> tuple[str, str, str]:
+    """
+    Extract assignment description, compile command, and language from a codeval file.
+    Returns:
+        (description, compile_command, language)
+    """
+    with open(codeval_path, "r", encoding="utf-8") as f:
+        content = f.read()
+    # Extract content between CRT_HW START and CRT_HW END
+    match = re.search(r"CRT_HW START \S+\n(.*?)CRT_HW END", content, re.DOTALL)
+    if match:
+        description = match.group(1).strip()
+    else:
+        # If no CRT_HW markers, use everything before first tag
+        lines = []
+        for line in content.split("\n"):
+            if re.match(r"^[A-Z]+\s", line):
+                break
+            lines.append(line)
+        description = "\n".join(lines).strip()
+    # Extract compile command
+    compile_match = re.search(r"^C\s+(.+)$", content, re.MULTILINE)
+    compile_cmd = compile_match.group(1) if compile_match else ""
+    # Detect language from compile command
+    language = "unknown"
+    if "gcc" in compile_cmd or "cc " in compile_cmd:
+        language = "c"
+    elif "g++" in compile_cmd:
+        language = "cpp"
+    elif "python" in compile_cmd:
+        language = "python"
+    elif "javac" in compile_cmd:
+        language = "java"
+    elif "rustc" in compile_cmd or "cargo" in compile_cmd:
+        language = "rust"
+    elif "go " in compile_cmd:
+        language = "go"
+    return description, compile_cmd, language
+def extract_source_filename(compile_cmd: str) -> str:
+    """Extract the source filename from a compile command."""
+    # Look for common source file extensions
+    match = re.search(r"(\S+\.(c|cpp|cc|py|java|rs|go))", compile_cmd)
+    if match:
+        return match.group(1)
+    return "solution.c"
+def build_prompt(description: str, language: str, filename: str) -> str:
+    """Build the prompt to send to AI models."""
+    lang_hints = {
+        "c": "Write the solution in C. Use standard C libraries only.",
+        "cpp": "Write the solution in C++. Use standard C++ libraries only.",
+        "python": "Write the solution in Python 3.",
+        "java": "Write the solution in Java.",
+        "rust": "Write the solution in Rust.",
+        "go": "Write the solution in Go.",
+    }
+    hint = lang_hints.get(language, "")
+    return f"""You are solving a programming assignment. {hint}
+IMPORTANT: Output ONLY the code. No explanations, no markdown code blocks, no comments about the solution. Just the raw source code that can be directly saved to a file and compiled/run.
+The solution should be saved as: {filename}
+Here is the assignment:
+{description}
+Remember: Output ONLY the code, nothing else."""
+def extract_code_from_response(response: str, language: str) -> str:
+    """Extract code from AI response, handling markdown blocks if present."""
+    # Try to extract from markdown code block
+    patterns = [
+        r"```(?:c|cpp|python|java|rust|go)?\n(.*?)```",
+        r"```\n(.*?)```",
+    ]
+    for pattern in patterns:
+        match = re.search(pattern, response, re.DOTALL)
+        if match:
+            return match.group(1).strip()
+    # If no code blocks, assume the whole response is code
+    # But strip any leading/trailing explanation
+    lines = response.strip().split("\n")
+    # Remove lines that look like explanations
+    code_lines = []
+    in_code = False
+    for line in lines:
+        # Detect start of code
+        if not in_code:
+            if line.startswith("#include") or line.startswith("import ") or \
+               line.startswith("def ") or line.startswith("int ") or \
+               line.startswith("void ") or line.startswith("public ") or \
+               line.startswith("package ") or line.startswith("use ") or \
+               line.startswith("fn ") or line.startswith("func "):
+                in_code = True
+        if in_code:
+            code_lines.append(line)
+    if code_lines:
+        return "\n".join(code_lines)
+    return response.strip()
+def call_anthropic(model_id: str, prompt: str, api_key: str) -> Optional[str]:
+    """Call Anthropic API."""
+    try:
+        import anthropic
+    except ImportError:
+        error("anthropic package not installed. Run: pip install anthropic")
+        return None
+    try:
+        client = anthropic.Anthropic(api_key=api_key)
+        # Adjust max_tokens based on model capabilities
+        max_tokens = 4096  # Safe default for older models
+        if "claude-3-5" in model_id or "claude-sonnet-4" in model_id or "claude-opus-4" in model_id:
+            max_tokens = 8192
+        message = client.messages.create(
+            model=model_id,
+            max_tokens=max_tokens,
+            messages=[{"role": "user", "content": prompt}]
+        )
+        return message.content[0].text
+    except Exception as e:
+        error(f"Anthropic API error: {e}")
+        return None
+def call_openai(model_id: str, prompt: str, api_key: str) -> Optional[str]:
+    """Call OpenAI API."""
+    try:
+        import openai
+    except ImportError:
+        error("openai package not installed. Run: pip install openai")
+        return None
+    try:
+        client = openai.OpenAI(api_key=api_key)
+        response = client.chat.completions.create(
+            model=model_id,
+            messages=[{"role": "user", "content": prompt}],
+            max_tokens=8192,
+        )
+        return response.choices[0].message.content
+    except Exception as e:
+        error(f"OpenAI API error: {e}")
+        return None
+def call_google(model_id: str, prompt: str, api_key: str) -> Optional[str]:
+    """Call Google Gemini API."""
+    try:
+        import google.generativeai as genai
+    except ImportError:
+        error("google-generativeai package not installed. Run: pip install google-generativeai")
+        return None
+    try:
+        genai.configure(api_key=api_key)
+        model = genai.GenerativeModel(model_id)
+        response = model.generate_content(prompt)
+        return response.text
+    except Exception as e:
+        error(f"Google API error: {e}")
+        return None
+def call_model(model: AIModel, prompt: str, config: dict) -> Optional[str]:
+    """Call the appropriate API based on provider."""
+    if model.provider == "anthropic":
+        if not config["anthropic_key"]:
+            warn(f"No Anthropic API key configured, skipping {model.display_name}")
+            return None
+        return call_anthropic(model.model_id, prompt, config["anthropic_key"])
+    elif model.provider == "openai":
+        if not config["openai_key"]:
+            warn(f"No OpenAI API key configured, skipping {model.display_name}")
+            return None
+        return call_openai(model.model_id, prompt, config["openai_key"])
+    elif model.provider == "google":
+        if not config["google_key"]:
+            warn(f"No Google API key configured, skipping {model.display_name}")
+            return None
+        return call_google(model.model_id, prompt, config["google_key"])
+    else:
+        error(f"Unknown provider: {model.provider}")
+        return None
+def run_benchmark(
+    codeval_path: str,
+    output_dir: str,
+    models: Optional[list[AIModel]] = None,
+    attempts: int = 1,
+    config: Optional[dict] = None,
+) -> dict:
+    """
+    Run benchmark on a codeval assignment with multiple AI models.
+    Args:
+        codeval_path: Path to the .codeval file
+        output_dir: Directory to store solutions and results
+        models: List of models to test (defaults to DEFAULT_MODELS)
+        attempts: Number of attempts per model
+        config: Optional config dict with API keys
+    Returns:
+        Dictionary with results for each model
+    """
+    if models is None:
+        models = DEFAULT_MODELS
+    if config is None:
+        config = load_ai_config()
+    # Convert to absolute path to avoid issues with relative paths
+    codeval_path = str(Path(codeval_path).resolve())
+    # Extract assignment info
+    description, compile_cmd, language = extract_assignment_from_codeval(codeval_path)
+    source_file = extract_source_filename(compile_cmd)
+    info(f"Assignment: {Path(codeval_path).stem}")
+    info(f"Language: {language}")
+    info(f"Source file: {source_file}")
+    # Build prompt
+    prompt = build_prompt(description, language, source_file)
+    # Create output directory
+    output_path = Path(output_dir)
+    output_path.mkdir(parents=True, exist_ok=True)
+    # Save prompt for reference
+    (output_path / "prompt.txt").write_text(prompt)
+    results = {}
+    for model in models:
+        model_dir = output_path / model.display_name.replace(" ", "_")
+        model_dir.mkdir(exist_ok=True)
+        model_results = {
+            "attempts": [],
+            "best_score": 0,
+            "passed": False,
+        }
+        for attempt in range(attempts):
+            attempt_dir = model_dir / f"attempt_{attempt + 1}"
+            attempt_dir.mkdir(exist_ok=True)
+            info(f"\n{'='*60}")
+            info(f"Model: {model.display_name} (Attempt {attempt + 1}/{attempts})")
+            info(f"{'='*60}")
+            # Call the model
+            start_time = time.time()
+            response = call_model(model, prompt, config)
+            elapsed = time.time() - start_time
+            if response is None:
+                model_results["attempts"].append({
+                    "success": False,
+                    "error": "API call failed",
+                    "time": elapsed,
+                })
+                continue
+            # Save raw response
+            (attempt_dir / "raw_response.txt").write_text(response)
+            # Extract code
+            code = extract_code_from_response(response, language)
+            source_path = attempt_dir / source_file
+            source_path.write_text(code)
+            info(f"Response received in {elapsed:.2f}s")
+            info(f"Code saved to {source_path}")
+            # Copy codeval file to attempt directory, stripping Z tags (not supported locally)
+            import shutil
+            try:
+                with open(codeval_path, "r", encoding="utf-8") as f:
+                    codeval_content = f.read()
+                # Remove Z tag lines (zip file downloads only work on Canvas)
+                codeval_lines = [line for line in codeval_content.split("\n") if not line.startswith("Z ")]
+                (attempt_dir / Path(codeval_path).name).write_text("\n".join(codeval_lines))
+                # Copy support files if they exist
+                codeval_dir = Path(codeval_path).parent
+                support_dir = codeval_dir / "support_files"
+                if support_dir.exists():
+                    for support_file in support_dir.iterdir():
+                        shutil.copy(support_file, attempt_dir / support_file.name)
+            except Exception as e:
+                error(f"Failed to copy codeval/support files: {e}")
+                model_results["attempts"].append({
+                    "success": False,
+                    "error": f"File copy failed: {e}",
+                    "time": elapsed,
+                })
+                continue
+            # Run evaluation using subprocess
+            info("Running evaluation...")
+            try:
+                result = subprocess.run(
+                    ["assignment-codeval", "run-evaluation", Path(codeval_path).name],
+                    cwd=attempt_dir,
+                    capture_output=True,
+                    text=True,
+                    timeout=120,
+                )
+                # Save evaluation output
+                (attempt_dir / "evaluation_output.txt").write_text(
+                    f"=== STDOUT ===\n{result.stdout}\n\n=== STDERR ===\n{result.stderr}"
+                )
+                eval_passed = result.returncode == 0
+                model_results["attempts"].append({
+                    "success": True,
+                    "passed": eval_passed,
+                    "time": elapsed,
+                })
+                if eval_passed:
+                    model_results["passed"] = True
+                    info(f"✓ {model.display_name} PASSED")
+                else:
+                    info(f"✗ {model.display_name} FAILED")
+                    # Show brief failure info
+                    if "FAILED" in result.stdout:
+                        for line in result.stdout.split("\n"):
+                            if "FAILED" in line or "Command ran" in line:
+                                info(f"  {line.strip()}")
+            except subprocess.TimeoutExpired:
+                model_results["attempts"].append({
+                    "success": False,
+                    "error": "Evaluation timed out",
+                    "time": elapsed,
+                })
+                error("Evaluation timed out")
+            except Exception as e:
+                model_results["attempts"].append({
+                    "success": False,
+                    "error": str(e),
+                    "time": elapsed,
+                })
+                error(f"Evaluation error: {e}")
+        results[model.display_name] = model_results
+    # Save results summary
+    (output_path / "results.json").write_text(json.dumps(results, indent=2))
+    # Print summary
+    print("\n" + "="*60)
+    print("BENCHMARK RESULTS SUMMARY")
+    print("="*60)
+    for model_name, result in results.items():
+        status = "✓ PASS" if result["passed"] else "✗ FAIL"
+        attempts_info = f"{sum(1 for a in result['attempts'] if a.get('passed', False))}/{len(result['attempts'])}"
+        print(f"{model_name:30} {status:10} ({attempts_info} attempts passed)")
+    return results
+@click.command("test-with-ai")
+@click.argument("codeval_file", type=click.Path(exists=True))
+@click.option("--output-dir", "-o", default="ai_test_results",
+              help="Directory to store solutions and results")
+@click.option("--attempts", "-n", default=1, type=int,
+              help="Number of attempts per model")
+@click.option("--models", "-m", multiple=True,
+              help="Specific models to test (can be used multiple times)")
+@click.option("--providers", "-p", multiple=True,
+              type=click.Choice(["anthropic", "openai", "google"]),
+              help="Only test models from these providers")
+@click.option("--anthropic-key", envvar="ANTHROPIC_API_KEY",
+              help="Anthropic API key (or set ANTHROPIC_API_KEY env var)")
+@click.option("--openai-key", envvar="OPENAI_API_KEY",
+              help="OpenAI API key (or set OPENAI_API_KEY env var)")
+@click.option("--google-key", envvar="GOOGLE_API_KEY",
+              help="Google API key (or set GOOGLE_API_KEY env var)")
+def benchmark_ai_command(codeval_file, output_dir, attempts, models, providers,
+                         anthropic_key, openai_key, google_key):
+    """
+    Test AI models on a programming assignment.
+    Sends the assignment to multiple AI models, collects their solutions,
+    and evaluates them using the codeval framework.
+    API keys can be provided via:
+    - Command line options (--anthropic-key, --openai-key, --google-key)
+    - Environment variables (ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY)
+    - Config file (~/.config/codeval.ini in [AI] section)
+    Example:
+        assignment-codeval test-with-ai my_assignment.codeval -n 3 -p anthropic
+    """
+    # Build config from provided keys
+    config = load_ai_config()
+    if anthropic_key:
+        config["anthropic_key"] = anthropic_key
+    if openai_key:
+        config["openai_key"] = openai_key
+    if google_key:
+        config["google_key"] = google_key
+    # Filter models if specific ones requested
+    test_models = DEFAULT_MODELS
+    if models:
+        test_models = [m for m in DEFAULT_MODELS if m.model_id in models or m.display_name in models]
+    if providers:
+        test_models = [m for m in test_models if m.provider in providers]
+    if not test_models:
+        error("No models selected for testing")
+        return
+    info(f"Testing {len(test_models)} models with {attempts} attempt(s) each")
+    run_benchmark(codeval_file, output_dir, test_models, attempts, config)
+def get_benchmark_command():
+    """Return the Click command for CLI registration."""
+    return benchmark_ai_command

{assignment_codeval-0.0.8 → assignment_codeval-0.0.10}/src/assignment_codeval/cli.py RENAMED Viewed

@@ -1,5 +1,6 @@
 import click
+from assignment_codeval.ai_benchmark import get_benchmark_command
 from assignment_codeval.create_assignment import create_assignment
 from assignment_codeval.evaluate import run_evaluation
 from assignment_codeval.github_connect import github_setup_repo
@@ -16,6 +17,7 @@ cli.add_command(upload_submission_comments)
 cli.add_command(github_setup_repo)
 cli.add_command(evaluate_submissions)
 cli.add_command(create_assignment)
+cli.add_command(get_benchmark_command())
 if __name__ == "__main__":
     cli()

{assignment_codeval-0.0.8 → assignment_codeval-0.0.10}/src/assignment_codeval/evaluate.py RENAMED Viewed

@@ -366,7 +366,7 @@ def test_case_hidden(test_case_command):
 def supply_input(inputs):
-    """Specifies the input for a test case.
+    """Specifies the input for a test case (adds newline at end).
     Arguments:
         *inputs: inputs to be used for test case
@@ -374,8 +374,21 @@ def supply_input(inputs):
     Returns:
         None
     """
-    with open("fileinput", "a") as outfile:
-        outfile.write(inputs)
+    with open("fileinput", "ab") as outfile:
+        outfile.write((inputs + "\n").encode("utf-8"))
+def supply_input_bare(inputs):
+    """Specifies the input for a test case without adding a newline.
+    Arguments:
+        *inputs: inputs to be used for test case
+    Returns:
+        None
+    """
+    with open("fileinput", "ab") as outfile:
+        outfile.write(inputs.encode("utf-8"))
 def supply_input_file(input_file):
@@ -387,15 +400,15 @@ def supply_input_file(input_file):
     Returns:
         None
     """
-    with open(input_file, "r") as infile:
-        input_lines = infile.readlines()
+    with open(input_file, "rb") as infile:
+        input_data = infile.read()
-    with open("fileinput", "a") as outfile:
-        outfile.writelines(input_lines)
+    with open("fileinput", "ab") as outfile:
+        outfile.write(input_data)
 def check_output(outputs):
-    """Specifies the expected output for a test case.
+    """Specifies the expected output for a test case (adds newline at end).
     Arguments:
         *outputs: outputs to be used for test case
@@ -408,6 +421,20 @@ def check_output(outputs):
         outfile.write(outputs + "\n")
+def check_output_bare(outputs):
+    """Specifies the expected output for a test case without adding a newline.
+    Arguments:
+        *outputs: outputs to be used for test case
+    Returns:
+        None
+    """
+    with open("expectedoutput", "a") as outfile:
+        outfile.write(outputs)
 def check_output_file(output_file):
     """Specifies the expected output for a test case read from a file.
@@ -425,7 +452,7 @@ def check_output_file(output_file):
 def check_error(error_output):
-    """Specifies the expected error output for a test case.
+    """Specifies the expected error output for a test case (adds newline at end).
     Arguments:
         error_output: expected error output for a test case
@@ -437,6 +464,19 @@ def check_error(error_output):
         outfile.write(error_output + "\n")
+def check_error_bare(error_output):
+    """Specifies the expected error output for a test case without adding a newline.
+    Arguments:
+        error_output: expected error output for a test case
+    Returns:
+        None
+    """
+    with open("expectederror", "a") as outfile:
+        outfile.write(error_output)
 def hint(hints):
     """Hint
@@ -536,10 +576,13 @@ tag_func_map = {
     "T": test_case,
     "HT": test_case_hidden,
     "I": supply_input,
+    "IB": supply_input_bare,
     "IF": supply_input_file,
     "O": check_output,
+    "OB": check_output_bare,
     "OF": check_output_file,
     "E": check_error,
+    "EB": check_error_bare,
     "HINT": hint,
     "TO": timeout,
     "X": exit_code,
@@ -584,6 +627,8 @@ def parse_tags(tags: list[str]):
     tag_only_pattern = r"([A-Z_]+)\s*$"
     valid_tags = set(tag_func_map.keys())
+    # Tags to silently ignore (used by other tools but not by run-evaluation)
+    ignored_tags = {"CTO", "Z", "RUN"}
     # Track if we're inside a CRT_HW block (content to ignore)
     in_crt_hw_block = False
@@ -609,6 +654,9 @@ def parse_tags(tags: list[str]):
         # Check for tag without arguments
         if tag_only_match and not tag_match:
             tag = tag_only_match.group(1)
+            # Skip ignored tags
+            if tag in ignored_tags:
+                continue
             if tag in valid_tags:
                 print(f"Error on line {line_num}: Tag '{tag}' requires arguments")
                 print(f"  {line_num}: {tag_line.rstrip()}")
@@ -626,6 +674,9 @@ def parse_tags(tags: list[str]):
             potential_tag = re.match(r"([A-Z]+)", tag_line)
             if potential_tag:
                 tag = potential_tag.group(1)
+                # Skip ignored tags
+                if tag in ignored_tags:
+                    continue
                 if tag not in valid_tags and len(tag) <= 4:
                     print(f"Error on line {line_num}: Unknown tag '{tag}'")
                     print(f"  {line_num}: {tag_line.rstrip()}")
@@ -636,6 +687,10 @@ def parse_tags(tags: list[str]):
         tag = tag_match.group(1)
         args = tag_match.group(2)
+        # Skip ignored tags
+        if tag in ignored_tags:
+            continue
         # Check for unknown tag
         if tag not in valid_tags:
             print(f"Error on line {line_num}: Unknown tag '{tag}'")
@@ -715,7 +770,7 @@ def check_test():
     print(f"Test case {test_case_count} of {test_case_total}")
     passed = True
-    with open("fileinput", "r") as fileinput, open(
+    with open("fileinput", "rb") as fileinput, open(
         "youroutput", "w"
     ) as youroutput, open("yourerror", "w") as yourerror:
         test_exec = subprocess.Popen(

{assignment_codeval-0.0.8 → assignment_codeval-0.0.10}/src/assignment_codeval.egg-info/PKG-INFO RENAMED Viewed

@@ -1,20 +1,19 @@
 Metadata-Version: 2.4
 Name: assignment-codeval
-Version: 0.0.8
+Version: 0.0.10
 Summary: CodEval for evaluating programming assignments
 Requires-Python: >=3.12
 Description-Content-Type: text/markdown
 Requires-Dist: canvasapi==3.3.0
-Requires-Dist: certifi==2021.10.8
-Requires-Dist: charset-normalizer==2.0.9
 Requires-Dist: click==8.2.1
 Requires-Dist: configparser==5.2.0
-Requires-Dist: idna==3.3
 Requires-Dist: pytz==2021.3
-Requires-Dist: requests==2.27.0
-Requires-Dist: urllib3==1.26.7
+Requires-Dist: requests>=2.28.0
 Requires-Dist: pymongo==4.3.3
 Requires-Dist: markdown==3.4.1
+Requires-Dist: anthropic>=0.39.0
+Requires-Dist: openai>=1.0.0
+Requires-Dist: google-generativeai>=0.8.0
 Provides-Extra: test
 Requires-Dist: pytest>=7.0; extra == "test"
@@ -72,9 +71,9 @@ Tags used in a spec file (\<course name>.codeval)
 | CMD/TCMD | Run Command | Will be followed by a command to run. The TCMD will cause the evaluation to fail if the command exits with an error. |
 | CMP | Compare | Will be followed by two files to compare. |
 | T/HT | Test Case | Will be followed by the command to run to test the submission. |
-| I/IF | Supply Input | Specifies the input for a test case. The IF version will read the input from a file. |
-| O/OF | Check Output | Specifies the expected output for a test case. The OF version will read from a file. |
-| E | Check Error | Specifies the expected error output for a test case. |
+| I/IB/IF | Supply Input | Specifies the input for a test case. I adds a newline, IB does not add a newline, IF reads from a file. |
+| O/OB/OF | Check Output | Specifies the expected output for a test case. O adds a newline, OB does not add a newline, OF reads from a file. |
+| E/EB | Check Error | Specifies the expected error output for a test case. E adds a newline, EB does not. |
 | TO | Timeout | Specifies the time limit in seconds for a test case to run. Defaults to 20 seconds. |
 | X | Exit Code | Specifies the expected exit code for a test case. Defaults to zero. |
 | SS | Start Server | Command containing timeout (wait until server starts), kill timeout (wait to kill the server), and the command to start a server |
@@ -247,3 +246,97 @@ Refer to a sample spec file [here](SQL/samples/ASSIGNMENT:CREATE.codeval)
 	C cc -o bigbag --std=gnu11 bigbag.c
+## 4. Test Assignments with AI Models
+Test programming assignments against multiple AI models (Claude, GPT, Gemini) to benchmark their performance.
+### Installation
+Install the AI provider packages you want to use:
+```bash
+# Install all AI providers
+pip install assignment-codeval[ai]
+# Or install specific providers
+pip install anthropic        # For Claude models
+pip install openai           # For GPT models
+pip install google-generativeai  # For Gemini models
+```
+### codeval.ini contents (optional)
+```
+[AI]
+anthropic_key=sk-ant-...
+openai_key=sk-...
+google_key=...
+```
+API keys can also be provided via:
+- Environment variables: `ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, `GOOGLE_API_KEY`
+- Command line options: `--anthropic-key`, `--openai-key`, `--google-key`
+### Command to run
+```bash
+assignment-codeval test-with-ai <codeval_file> [OPTIONS]
+```
+### Options
+| Option | Description |
+|--------|-------------|
+| `-o, --output-dir` | Directory to store solutions and results (default: `ai_test_results`) |
+| `-n, --attempts` | Number of attempts per model (default: 1) |
+| `-m, --models` | Specific models to test (can be used multiple times) |
+| `-p, --providers` | Only test models from specific providers: `anthropic`, `openai`, `google` |
+| `--anthropic-key` | Anthropic API key |
+| `--openai-key` | OpenAI API key |
+| `--google-key` | Google API key |
+### Examples
+```bash
+# Test with all Anthropic models
+assignment-codeval test-with-ai my_assignment.codeval -p anthropic
+# Test with specific model, 3 attempts each
+assignment-codeval test-with-ai my_assignment.codeval -m "Claude Sonnet 4" -n 3
+# Test with all providers (requires all API keys)
+assignment-codeval test-with-ai my_assignment.codeval -n 2
+# Pass API key directly
+assignment-codeval test-with-ai my_assignment.codeval --anthropic-key sk-ant-xxx -p anthropic
+```
+### Supported Models
+| Provider | Models |
+|----------|--------|
+| Anthropic | Claude Sonnet 4, Claude Opus 4 |
+| OpenAI | GPT-4o, GPT-4o Mini, o1, o3-mini |
+| Google | Gemini 2.0 Flash, Gemini 1.5 Pro |
+Note: You can add additional models using `-m "model-id"`. Check each provider's documentation for available model IDs.
+### Output Structure
+```
+ai_test_results/
+├── prompt.txt                    # The prompt sent to AI models
+├── results.json                  # Summary of all results
+├── Claude_Sonnet_4/
+│   └── attempt_1/
+│       ├── raw_response.txt      # Raw AI response
+│       ├── solution.c            # Extracted code
+│       └── <codeval files>       # Copied for evaluation
+├── GPT-4o/
+│   └── attempt_1/
+│       └── ...
+└── ...
+```
+### Notes
+- The command extracts the assignment description from the codeval file (between `CRT_HW START` and `CRT_HW END` tags)
+- Support files from `support_files/` directory are automatically copied for evaluation
+- Results include pass/fail status, response time, and any errors
+- Use multiple attempts (`-n`) to account for AI response variability

{assignment_codeval-0.0.8 → assignment_codeval-0.0.10}/src/assignment_codeval.egg-info/SOURCES.txt RENAMED Viewed

@@ -1,6 +1,7 @@
 README.md
 pyproject.toml
 src/assignment_codeval/__init__.py
+src/assignment_codeval/ai_benchmark.py
 src/assignment_codeval/canvas_utils.py
 src/assignment_codeval/cli.py
 src/assignment_codeval/commons.py

{assignment_codeval-0.0.8 → assignment_codeval-0.0.10}/src/assignment_codeval.egg-info/requires.txt RENAMED Viewed

@@ -1,14 +1,13 @@
 canvasapi==3.3.0
-certifi==2021.10.8
-charset-normalizer==2.0.9
 click==8.2.1
 configparser==5.2.0
-idna==3.3
 pytz==2021.3
-requests==2.27.0
-urllib3==1.26.7
+requests>=2.28.0
 pymongo==4.3.3
 markdown==3.4.1
+anthropic>=0.39.0
+openai>=1.0.0
+google-generativeai>=0.8.0
 [test]
 pytest>=7.0