cr-proc 0.1.10__tar.gz → 0.1.12__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: cr_proc
3
- Version: 0.1.10
3
+ Version: 0.1.12
4
4
  Summary: A tool for processing BYU CS code recording files.
5
5
  Author: Ethan Dye
6
6
  Author-email: mrtops03@gmail.com
@@ -28,7 +28,8 @@ poetry install
28
28
 
29
29
  ## Usage
30
30
 
31
- The processor can be run using the `cr_proc` command with recording file(s) and a template:
31
+ The processor can be run using the `cr_proc` command with recording file(s) and
32
+ a template:
32
33
 
33
34
  ```bash
34
35
  poetry run cr_proc <path-to-jsonl-file> <path-to-template-file>
@@ -36,7 +37,8 @@ poetry run cr_proc <path-to-jsonl-file> <path-to-template-file>
36
37
 
37
38
  ### Batch Processing
38
39
 
39
- You can process multiple recording files at once (e.g., for different students' submissions):
40
+ You can process multiple recording files at once (e.g., for different students'
41
+ submissions):
40
42
 
41
43
  ```bash
42
44
  # Process multiple files
@@ -47,9 +49,11 @@ poetry run cr_proc recordings/*.jsonl.gz template.py
47
49
  ```
48
50
 
49
51
  When processing multiple files:
52
+
50
53
  - Each recording is processed independently (for different students/documents)
51
54
  - Time calculations and verification are done separately for each file
52
- - A combined time report is shown at the end summarizing total editing time across all recordings
55
+ - A combined time report is shown at the end summarizing total editing time
56
+ across all recordings
53
57
  - Results can be output to individual files using `--output-dir`
54
58
 
55
59
  ### Arguments
@@ -61,24 +65,34 @@ When processing multiple files:
61
65
 
62
66
  ### Options
63
67
 
64
- - `-t, --time-limit MINUTES`: (Optional) Maximum allowed time in minutes between the
65
- first and last edit in the recording. Applied individually to each recording file and
66
- also to the combined total in batch mode. If the elapsed time exceeds this limit, the
67
- recording is flagged as suspicious.
68
- - `-d, --document DOCUMENT`: (Optional) Document path or filename to process from the
69
- recording. Defaults to the document whose extension matches the template file.
70
- - `-o, --output-json OUTPUT_JSON`: (Optional) Path to output JSON file with verification
71
- results (time info and suspicious events). In batch mode, creates a single JSON file
72
- containing all recordings plus the combined time report.
73
- - `-f, --output-file OUTPUT_FILE`: (Optional) Write reconstructed code to specified file
74
- instead of stdout. For single files only.
75
- - `--output-dir OUTPUT_DIR`: (Optional) Directory to write reconstructed code files in
76
- batch mode. Files are named based on input recording filenames.
77
- - `-s, --show-autocomplete-details`: (Optional) Show individual auto-complete events in
78
- addition to aggregate statistics.
79
- - `-p, --playback`: (Optional) Play back the recording in real-time, showing code evolution.
80
- - `--playback-speed SPEED`: (Optional) Playback speed multiplier (1.0 = real-time, 2.0 = 2x
81
- speed, 0.5 = half speed).
68
+ - `-t, --time-limit MINUTES`: (Optional) Maximum allowed time in minutes between
69
+ the first and last edit in the recording. Applied individually to each
70
+ recording file and also to the combined total in batch mode. If the elapsed
71
+ time exceeds this limit, the recording is flagged as suspicious.
72
+ - `-d, --document DOCUMENT`: (Optional) Document path or filename to process
73
+ from the recording. Defaults to the document whose extension matches the
74
+ template file.
75
+ - `-o, --output-json OUTPUT_JSON`: (Optional) Path to output JSON file with
76
+ verification results (time info and suspicious events). In batch mode, creates
77
+ a single JSON file containing all recordings plus the combined time report.
78
+ - `-f, --output-file OUTPUT_FILE`: (Optional) Write reconstructed code to
79
+ specified file instead of stdout. For single files only.
80
+ - `--output-dir OUTPUT_DIR`: (Optional) Directory to write reconstructed code
81
+ files in batch mode. Files are named based on input recording filenames.
82
+ - `--submitted-file SUBMITTED_FILE`: (Optional) Path to the submitted final file
83
+ to verify against the reconstructed output. If provided, the reconstructed code
84
+ will be compared to this file and differences will be reported.
85
+ - `--submitted-dir SUBMITTED_DIR`: (Optional) Directory containing submitted files
86
+ to verify against the reconstructed output. For each recording file, the
87
+ corresponding submitted file will be found by matching the filename
88
+ (e.g., `homework0-ISC.recording.jsonl.gz` will match `homework0-ISC.py`).
89
+ Cannot be used with `--submitted-file`.
90
+ - `-s, --show-autocomplete-details`: (Optional) Show individual auto-complete
91
+ events in addition to aggregate statistics.
92
+ - `-p, --playback`: (Optional) Play back the recording in real-time, showing
93
+ code evolution.
94
+ - `--playback-speed SPEED`: (Optional) Playback speed multiplier (1.0 =
95
+ real-time, 2.0 = 2x speed, 0.5 = half speed).
82
96
 
83
97
  ### Examples
84
98
 
@@ -106,7 +120,20 @@ Save JSON results:
106
120
  poetry run cr_proc student1.jsonl.gz student2.jsonl.gz template.py -o results/
107
121
  ```
108
122
 
109
- This will process each recording independently and flag any that exceed 30 minutes.
123
+ Verify against a single submitted file:
124
+
125
+ ```bash
126
+ poetry run cr_proc homework0.recording.jsonl.gz homework0.py --submitted-file submitted_homework0.py
127
+ ```
128
+
129
+ Verify against submitted files in a directory (batch mode):
130
+
131
+ ```bash
132
+ poetry run cr_proc recordings/*.jsonl.gz template.py --submitted-dir submissions/
133
+ ```
134
+
135
+ This will process each recording independently and flag any that exceed 30
136
+ minutes.
110
137
 
111
138
  The processor will:
112
139
 
@@ -118,8 +145,9 @@ The processor will:
118
145
 
119
146
  ### Output
120
147
 
121
- Reconstructed code files are written to disk using `-f/--output-file` (single file)
122
- or `--output-dir` (batch mode). The processor does not output reconstructed code to stdout.
148
+ Reconstructed code files are written to disk using `-f/--output-file` (single
149
+ file) or `--output-dir` (batch mode). The processor does not output
150
+ reconstructed code to stdout.
123
151
 
124
152
  Verification information, warnings, and errors are printed to stderr, including:
125
153
 
@@ -133,8 +161,8 @@ Verification information, warnings, and errors are printed to stderr, including:
133
161
 
134
162
  ### Suspicious Activity Detection
135
163
 
136
- The processor automatically detects and reports three types of suspicious activity
137
- patterns:
164
+ The processor automatically detects and reports three types of suspicious
165
+ activity patterns:
138
166
 
139
167
  #### 1. Time Limit Exceeded
140
168
 
@@ -142,8 +170,8 @@ When the `--time-limit` flag is specified, the processor flags recordings where
142
170
  the elapsed time between the first and last edit exceeds the specified limit.
143
171
  This can indicate unusually long work sessions or potential external assistance.
144
172
 
145
- Each recording file is checked independently against the time limit. In batch mode,
146
- the combined total time is also checked against the limit.
173
+ Each recording file is checked independently against the time limit. In batch
174
+ mode, the combined total time is also checked against the limit.
147
175
 
148
176
  **Example warning (single file):**
149
177
 
@@ -199,12 +227,14 @@ Events #42-#44 (rapid one-line pastes (AI indicator)): 3 lines, 89 chars
199
227
 
200
228
  ### JSON Output Format
201
229
 
202
- The `--output-json` flag generates JSON files with verification results using a consistent format
203
- for both single file and batch modes, making it easier for tooling to consume.
230
+ The `--output-json` flag generates JSON files with verification results using a
231
+ consistent format for both single file and batch modes, making it easier for
232
+ tooling to consume.
204
233
 
205
234
  #### JSON Structure
206
235
 
207
236
  All JSON output follows this unified format:
237
+
208
238
  - `batch_mode`: Boolean indicating if multiple files were processed
209
239
  - `total_files`: Number of files processed
210
240
  - `verified_count`: How many files passed verification
@@ -219,6 +249,7 @@ All JSON output follows this unified format:
219
249
  - `files`: Array of individual results for each recording
220
250
 
221
251
  **Single file example:**
252
+
222
253
  ```json
223
254
  {
224
255
  "batch_mode": false,
@@ -244,6 +275,7 @@ All JSON output follows this unified format:
244
275
  ```
245
276
 
246
277
  **Batch file example:**
278
+
247
279
  ```json
248
280
  {
249
281
  "batch_mode": true,
@@ -16,7 +16,8 @@ poetry install
16
16
 
17
17
  ## Usage
18
18
 
19
- The processor can be run using the `cr_proc` command with recording file(s) and a template:
19
+ The processor can be run using the `cr_proc` command with recording file(s) and
20
+ a template:
20
21
 
21
22
  ```bash
22
23
  poetry run cr_proc <path-to-jsonl-file> <path-to-template-file>
@@ -24,7 +25,8 @@ poetry run cr_proc <path-to-jsonl-file> <path-to-template-file>
24
25
 
25
26
  ### Batch Processing
26
27
 
27
- You can process multiple recording files at once (e.g., for different students' submissions):
28
+ You can process multiple recording files at once (e.g., for different students'
29
+ submissions):
28
30
 
29
31
  ```bash
30
32
  # Process multiple files
@@ -35,9 +37,11 @@ poetry run cr_proc recordings/*.jsonl.gz template.py
35
37
  ```
36
38
 
37
39
  When processing multiple files:
40
+
38
41
  - Each recording is processed independently (for different students/documents)
39
42
  - Time calculations and verification are done separately for each file
40
- - A combined time report is shown at the end summarizing total editing time across all recordings
43
+ - A combined time report is shown at the end summarizing total editing time
44
+ across all recordings
41
45
  - Results can be output to individual files using `--output-dir`
42
46
 
43
47
  ### Arguments
@@ -49,24 +53,34 @@ When processing multiple files:
49
53
 
50
54
  ### Options
51
55
 
52
- - `-t, --time-limit MINUTES`: (Optional) Maximum allowed time in minutes between the
53
- first and last edit in the recording. Applied individually to each recording file and
54
- also to the combined total in batch mode. If the elapsed time exceeds this limit, the
55
- recording is flagged as suspicious.
56
- - `-d, --document DOCUMENT`: (Optional) Document path or filename to process from the
57
- recording. Defaults to the document whose extension matches the template file.
58
- - `-o, --output-json OUTPUT_JSON`: (Optional) Path to output JSON file with verification
59
- results (time info and suspicious events). In batch mode, creates a single JSON file
60
- containing all recordings plus the combined time report.
61
- - `-f, --output-file OUTPUT_FILE`: (Optional) Write reconstructed code to specified file
62
- instead of stdout. For single files only.
63
- - `--output-dir OUTPUT_DIR`: (Optional) Directory to write reconstructed code files in
64
- batch mode. Files are named based on input recording filenames.
65
- - `-s, --show-autocomplete-details`: (Optional) Show individual auto-complete events in
66
- addition to aggregate statistics.
67
- - `-p, --playback`: (Optional) Play back the recording in real-time, showing code evolution.
68
- - `--playback-speed SPEED`: (Optional) Playback speed multiplier (1.0 = real-time, 2.0 = 2x
69
- speed, 0.5 = half speed).
56
+ - `-t, --time-limit MINUTES`: (Optional) Maximum allowed time in minutes between
57
+ the first and last edit in the recording. Applied individually to each
58
+ recording file and also to the combined total in batch mode. If the elapsed
59
+ time exceeds this limit, the recording is flagged as suspicious.
60
+ - `-d, --document DOCUMENT`: (Optional) Document path or filename to process
61
+ from the recording. Defaults to the document whose extension matches the
62
+ template file.
63
+ - `-o, --output-json OUTPUT_JSON`: (Optional) Path to output JSON file with
64
+ verification results (time info and suspicious events). In batch mode, creates
65
+ a single JSON file containing all recordings plus the combined time report.
66
+ - `-f, --output-file OUTPUT_FILE`: (Optional) Write reconstructed code to
67
+ specified file instead of stdout. For single files only.
68
+ - `--output-dir OUTPUT_DIR`: (Optional) Directory to write reconstructed code
69
+ files in batch mode. Files are named based on input recording filenames.
70
+ - `--submitted-file SUBMITTED_FILE`: (Optional) Path to the submitted final file
71
+ to verify against the reconstructed output. If provided, the reconstructed code
72
+ will be compared to this file and differences will be reported.
73
+ - `--submitted-dir SUBMITTED_DIR`: (Optional) Directory containing submitted files
74
+ to verify against the reconstructed output. For each recording file, the
75
+ corresponding submitted file will be found by matching the filename
76
+ (e.g., `homework0-ISC.recording.jsonl.gz` will match `homework0-ISC.py`).
77
+ Cannot be used with `--submitted-file`.
78
+ - `-s, --show-autocomplete-details`: (Optional) Show individual auto-complete
79
+ events in addition to aggregate statistics.
80
+ - `-p, --playback`: (Optional) Play back the recording in real-time, showing
81
+ code evolution.
82
+ - `--playback-speed SPEED`: (Optional) Playback speed multiplier (1.0 =
83
+ real-time, 2.0 = 2x speed, 0.5 = half speed).
70
84
 
71
85
  ### Examples
72
86
 
@@ -94,7 +108,20 @@ Save JSON results:
94
108
  poetry run cr_proc student1.jsonl.gz student2.jsonl.gz template.py -o results/
95
109
  ```
96
110
 
97
- This will process each recording independently and flag any that exceed 30 minutes.
111
+ Verify against a single submitted file:
112
+
113
+ ```bash
114
+ poetry run cr_proc homework0.recording.jsonl.gz homework0.py --submitted-file submitted_homework0.py
115
+ ```
116
+
117
+ Verify against submitted files in a directory (batch mode):
118
+
119
+ ```bash
120
+ poetry run cr_proc recordings/*.jsonl.gz template.py --submitted-dir submissions/
121
+ ```
122
+
123
+ This will process each recording independently and flag any that exceed 30
124
+ minutes.
98
125
 
99
126
  The processor will:
100
127
 
@@ -106,8 +133,9 @@ The processor will:
106
133
 
107
134
  ### Output
108
135
 
109
- Reconstructed code files are written to disk using `-f/--output-file` (single file)
110
- or `--output-dir` (batch mode). The processor does not output reconstructed code to stdout.
136
+ Reconstructed code files are written to disk using `-f/--output-file` (single
137
+ file) or `--output-dir` (batch mode). The processor does not output
138
+ reconstructed code to stdout.
111
139
 
112
140
  Verification information, warnings, and errors are printed to stderr, including:
113
141
 
@@ -121,8 +149,8 @@ Verification information, warnings, and errors are printed to stderr, including:
121
149
 
122
150
  ### Suspicious Activity Detection
123
151
 
124
- The processor automatically detects and reports three types of suspicious activity
125
- patterns:
152
+ The processor automatically detects and reports three types of suspicious
153
+ activity patterns:
126
154
 
127
155
  #### 1. Time Limit Exceeded
128
156
 
@@ -130,8 +158,8 @@ When the `--time-limit` flag is specified, the processor flags recordings where
130
158
  the elapsed time between the first and last edit exceeds the specified limit.
131
159
  This can indicate unusually long work sessions or potential external assistance.
132
160
 
133
- Each recording file is checked independently against the time limit. In batch mode,
134
- the combined total time is also checked against the limit.
161
+ Each recording file is checked independently against the time limit. In batch
162
+ mode, the combined total time is also checked against the limit.
135
163
 
136
164
  **Example warning (single file):**
137
165
 
@@ -187,12 +215,14 @@ Events #42-#44 (rapid one-line pastes (AI indicator)): 3 lines, 89 chars
187
215
 
188
216
  ### JSON Output Format
189
217
 
190
- The `--output-json` flag generates JSON files with verification results using a consistent format
191
- for both single file and batch modes, making it easier for tooling to consume.
218
+ The `--output-json` flag generates JSON files with verification results using a
219
+ consistent format for both single file and batch modes, making it easier for
220
+ tooling to consume.
192
221
 
193
222
  #### JSON Structure
194
223
 
195
224
  All JSON output follows this unified format:
225
+
196
226
  - `batch_mode`: Boolean indicating if multiple files were processed
197
227
  - `total_files`: Number of files processed
198
228
  - `verified_count`: How many files passed verification
@@ -207,6 +237,7 @@ All JSON output follows this unified format:
207
237
  - `files`: Array of individual results for each recording
208
238
 
209
239
  **Single file example:**
240
+
210
241
  ```json
211
242
  {
212
243
  "batch_mode": false,
@@ -232,6 +263,7 @@ All JSON output follows this unified format:
232
263
  ```
233
264
 
234
265
  **Batch file example:**
266
+
235
267
  ```json
236
268
  {
237
269
  "batch_mode": true,
@@ -1,6 +1,6 @@
1
1
  [project]
2
2
  name = "cr_proc"
3
- version = "0.1.10"
3
+ version = "0.1.12"
4
4
  description = "A tool for processing BYU CS code recording files."
5
5
  authors = [
6
6
  {name = "Ethan Dye",email = "mrtops03@gmail.com"}
@@ -169,6 +169,9 @@ def reconstruct_file_from_events(
169
169
  from .load import is_edit_event
170
170
  events = tuple(e for e in events if is_edit_event(e))
171
171
 
172
+ # Skip no-op events (oldFragment == newFragment, typically file-open markers)
173
+ events = tuple(e for e in events if not (e.get("oldFragment") == e.get("newFragment") and e.get("offset") == 0))
174
+
172
175
  # Read template content
173
176
  if normalize_newlines:
174
177
  template = _normalize_newlines(template)
@@ -197,6 +200,39 @@ def reconstruct_file_from_events(
197
200
  # No events for target_doc; return template unchanged
198
201
  return template
199
202
 
203
+ # Handle case where first event is a file-open/load event at offset 0
204
+ # (IDE captures the file content as seen when opened)
205
+ if evs and evs[0].get("offset") == 0:
206
+ first_old = evs[0].get("oldFragment", "")
207
+ first_new = evs[0].get("newFragment", "")
208
+
209
+ if first_old and not template.startswith(first_old):
210
+ # Check if this looks like a file-open event:
211
+ # - First event is at offset 0
212
+ # - oldFragment and newFragment contain significant content (file was loaded)
213
+ # - Template is much smaller (stub/placeholder)
214
+ is_likely_file_open = (
215
+ first_old == first_new and # no-op replacement (just file load)
216
+ len(first_old) > 50 and # substantial content
217
+ len(template) < len(first_old) # template is smaller stub
218
+ )
219
+
220
+ if is_likely_file_open:
221
+ # Use first event's oldFragment as the template (actual file state when opened)
222
+ template = first_old
223
+ else:
224
+ # Template genuinely doesn't match
225
+ raise ValueError(
226
+ f"Template content does not match recording's initial state.\n"
227
+ f"First event expects to replace {len(first_old)} chars starting at offset 0,\n"
228
+ f"but template only has {len(template)} chars and starts with:\n"
229
+ f"{template[:min(100, len(template))]!r}\n\n"
230
+ f"Expected to start with:\n"
231
+ f"{first_old[:min(100, len(first_old))]!r}\n\n"
232
+ f"Recording was likely made on a different version of the file.\n"
233
+ f"Document path in recording: {target_doc}"
234
+ )
235
+
200
236
  if utf16_mode:
201
237
  # Work in UTF-16-LE byte space
202
238
  doc_bytes = template.encode("utf-16-le")
@@ -5,9 +5,38 @@ from pathlib import Path, PureWindowsPath, PurePosixPath
5
5
  from typing import Any
6
6
 
7
7
 
8
+ def normalize_path_string(path_str: str) -> str:
9
+ """
10
+ Normalize a path string to use forward slashes (POSIX style).
11
+
12
+ Handles both Windows-style (backslash) and Unix-style (forward slash) paths
13
+ regardless of the current platform. Useful for cross-platform consistency
14
+ when files are created on Windows but processed on other systems.
15
+
16
+ Parameters
17
+ ----------
18
+ path_str : str
19
+ Path string (may use Windows or Unix separators)
20
+
21
+ Returns
22
+ -------
23
+ str
24
+ Normalized path string using forward slashes
25
+ """
26
+ # Try to detect if this is a Windows path (contains backslashes)
27
+ if "\\" in path_str:
28
+ # Windows-style path
29
+ path_obj = PureWindowsPath(path_str)
30
+ else:
31
+ # Unix-style path (or just a filename)
32
+ path_obj = PurePosixPath(path_str)
33
+
34
+ return path_obj.as_posix()
35
+
36
+
8
37
  def _normalize_document_path(doc_path: str) -> tuple[str, str]:
9
38
  """
10
- Normalize a document path to extract filename and stem.
39
+ Extract filename and stem from a document path.
11
40
 
12
41
  Handles both Windows-style (backslash) and Unix-style (forward slash) paths
13
42
  regardless of the current platform.
@@ -22,14 +51,9 @@ def _normalize_document_path(doc_path: str) -> tuple[str, str]:
22
51
  tuple[str, str]
23
52
  (filename, stem) extracted from the path
24
53
  """
25
- # Try to detect if this is a Windows path (contains backslashes)
26
- if "\\" in doc_path:
27
- # Windows-style path
28
- path_obj = PureWindowsPath(doc_path)
29
- else:
30
- # Unix-style path (or just a filename)
31
- path_obj = PurePosixPath(doc_path)
32
-
54
+ # Normalize to forward slashes first, then parse
55
+ normalized = normalize_path_string(doc_path)
56
+ path_obj = PurePosixPath(normalized)
33
57
  return path_obj.name, path_obj.stem
34
58
 
35
59
 
@@ -4,6 +4,8 @@ import sys
4
4
  from pathlib import Path
5
5
  from typing import Any
6
6
 
7
+ from .document import normalize_path_string
8
+
7
9
 
8
10
  def write_batch_json_output(
9
11
  output_path: Path,
@@ -36,15 +38,21 @@ def write_batch_json_output(
36
38
  # Convert results to JSON-serializable format
37
39
  files_data = []
38
40
  for r in results:
39
- files_data.append({
40
- "jsonl_file": str(r["jsonl_file"]),
41
+ file_result = {
42
+ "jsonl_file": normalize_path_string(str(r["jsonl_file"])),
41
43
  "document": r["target_document"],
42
44
  "verified": r["verified"],
43
45
  "time_info": r["time_info"],
44
46
  "suspicious_events": r["suspicious_events"],
45
47
  "template_diff": r.get("template_diff", ""),
46
48
  "reconstructed_code": r["reconstructed"],
47
- })
49
+ }
50
+
51
+ # Add submitted_comparison if present
52
+ if r.get("submitted_comparison") is not None:
53
+ file_result["submitted_comparison"] = r["submitted_comparison"]
54
+
55
+ files_data.append(file_result)
48
56
 
49
57
  # Use consistent format for both single and batch modes
50
58
  output_data = {
@@ -1,6 +1,7 @@
1
1
  from typing import Any
2
2
  from datetime import datetime
3
3
  import difflib
4
+ from .document import normalize_path_string
4
5
 
5
6
  # ============================================================================
6
7
  # Constants for detection thresholds
@@ -837,15 +838,19 @@ def verify(template: str, jsonData: tuple[dict[str, Any], ...]) -> tuple[str, li
837
838
 
838
839
 
839
840
  def combine_time_info(
840
- time_infos: list[dict[str, Any] | None], time_limit_minutes: int | None
841
+ all_events: list[tuple[dict[str, Any], ...]], time_limit_minutes: int | None
841
842
  ) -> dict[str, Any] | None:
842
843
  """
843
- Combine time information from multiple recording files.
844
+ Combine time information from multiple recording files, avoiding double-counting overlapping time.
845
+
846
+ Merges all events from multiple recordings, then calculates the actual time spent editing
847
+ using the same logic as check_time_limit (gap analysis with focus awareness). This ensures
848
+ overlapping editing sessions are not double-counted.
844
849
 
845
850
  Parameters
846
851
  ----------
847
- time_infos : list[dict[str, Any] | None]
848
- List of time information dictionaries from multiple files
852
+ all_events : list[tuple[dict[str, Any], ...]]
853
+ List of event tuples from multiple recording files
849
854
  time_limit_minutes : int | None
850
855
  Time limit to check against
851
856
 
@@ -854,40 +859,94 @@ def combine_time_info(
854
859
  dict[str, Any] | None
855
860
  Combined time information, or None if no valid data
856
861
  """
857
- valid_infos = [info for info in time_infos if info is not None]
858
- if not valid_infos:
862
+ # Filter out empty event sets
863
+ valid_event_sets = [events for events in all_events if events]
864
+ if not valid_event_sets:
859
865
  return None
860
866
 
861
- # Sum elapsed times across all sessions
862
- total_elapsed = sum(info["minutes_elapsed"] for info in valid_infos)
867
+ # Merge all events from all recordings into a single tuple
868
+ merged_events = tuple(
869
+ event
870
+ for event_set in valid_event_sets
871
+ for event in event_set
872
+ )
863
873
 
864
- # Find overall first and last timestamps
865
- all_timestamps = []
866
- for info in valid_infos:
867
- all_timestamps.append(
868
- datetime.fromisoformat(info["first_timestamp"].replace("Z", "+00:00"))
869
- )
870
- all_timestamps.append(
871
- datetime.fromisoformat(info["last_timestamp"].replace("Z", "+00:00"))
872
- )
874
+ # Use check_time_limit on the merged events to calculate time properly
875
+ # This handles overlapping periods automatically since we're now analyzing
876
+ # all events together chronologically
877
+ combined_result = check_time_limit(merged_events, time_limit_minutes)
873
878
 
874
- first_ts = min(all_timestamps)
875
- last_ts = max(all_timestamps)
876
- overall_span = (last_ts - first_ts).total_seconds() / 60
879
+ if combined_result is None:
880
+ return None
877
881
 
878
- result = {
879
- "time_limit_minutes": time_limit_minutes,
880
- "minutes_elapsed": round(total_elapsed, 2),
881
- "first_timestamp": first_ts.isoformat().replace("+00:00", "Z"),
882
- "last_timestamp": last_ts.isoformat().replace("+00:00", "Z"),
883
- "file_count": len(valid_infos),
884
- "overall_span_minutes": round(overall_span, 2),
885
- }
882
+ # Add file_count to the result
883
+ combined_result["file_count"] = len(valid_event_sets)
886
884
 
887
- # For time limit check in combined mode, use the sum of elapsed times
888
- if time_limit_minutes is not None:
889
- result["exceeds_limit"] = total_elapsed > time_limit_minutes
890
- else:
891
- result["exceeds_limit"] = False
885
+ return combined_result
892
886
 
893
- return result
887
+
888
+ def compare_submitted_file(reconstructed_code: str, submitted_file_path) -> dict[str, Any]:
889
+ """
890
+ Compare reconstructed code from recording with a submitted final file.
891
+
892
+ Parameters
893
+ ----------
894
+ reconstructed_code : str
895
+ The code reconstructed from the recording
896
+ submitted_file_path : Path
897
+ Path to the submitted file
898
+
899
+ Returns
900
+ -------
901
+ dict[str, Any]
902
+ Dictionary containing:
903
+ - matches: bool indicating if the files match
904
+ - submitted_file: path to the submitted file
905
+ - diff: unified diff string if files don't match
906
+ - whitespace_only: bool indicating if only whitespace differs
907
+ """
908
+ try:
909
+ submitted_content = submitted_file_path.read_text()
910
+ except Exception as e:
911
+ return {
912
+ "matches": False,
913
+ "submitted_file": normalize_path_string(str(submitted_file_path)),
914
+ "error": f"Failed to read submitted file: {e}",
915
+ "diff": "",
916
+ "whitespace_only": False,
917
+ }
918
+
919
+ # Normalize newlines for comparison
920
+ reconstructed_normalized = _normalize_newlines(reconstructed_code)
921
+ submitted_normalized = _normalize_newlines(submitted_content)
922
+
923
+ # Check exact match
924
+ matches = reconstructed_normalized == submitted_normalized
925
+
926
+ # Check if only whitespace differs
927
+ whitespace_only = False
928
+ if not matches:
929
+ whitespace_only = is_only_whitespace_differences(
930
+ submitted_normalized, reconstructed_normalized
931
+ )
932
+
933
+ # Generate diff if they don't match
934
+ diff_text = ""
935
+ if not matches:
936
+ reconstructed_lines = reconstructed_normalized.splitlines(keepends=True)
937
+ submitted_lines = submitted_normalized.splitlines(keepends=True)
938
+ diff = difflib.unified_diff(
939
+ reconstructed_lines,
940
+ submitted_lines,
941
+ fromfile="reconstructed",
942
+ tofile="submitted",
943
+ lineterm="",
944
+ )
945
+ diff_text = "".join(diff)
946
+
947
+ return {
948
+ "matches": matches,
949
+ "submitted_file": normalize_path_string(str(submitted_file_path)),
950
+ "diff": diff_text,
951
+ "whitespace_only": whitespace_only,
952
+ }
@@ -18,11 +18,13 @@ from .api.output import write_batch_json_output
18
18
  from .api.verify import (
19
19
  check_time_limit,
20
20
  combine_time_info,
21
+ compare_submitted_file,
21
22
  detect_external_copypaste,
22
23
  template_diff,
23
24
  verify,
24
25
  )
25
26
  from .display import (
27
+ display_submitted_file_comparison,
26
28
  display_suspicious_events,
27
29
  display_template_diff,
28
30
  display_time_info,
@@ -102,6 +104,21 @@ def create_parser() -> argparse.ArgumentParser:
102
104
  help="Directory to write reconstructed code files in batch mode (one file per recording). "
103
105
  "Files are named based on input recording filenames.",
104
106
  )
107
+ parser.add_argument(
108
+ "--submitted-file",
109
+ type=Path,
110
+ default=None,
111
+ help="Path to the submitted final file to verify against the reconstructed output. "
112
+ "If provided, the reconstructed code will be compared to this file.",
113
+ )
114
+ parser.add_argument(
115
+ "--submitted-dir",
116
+ type=Path,
117
+ default=None,
118
+ help="Directory containing submitted files to compare against. "
119
+ "For each recording, the corresponding submitted file will be found by matching the filename. "
120
+ "For example, 'homework0-ISC.recording.jsonl.gz' will match 'homework0-ISC.py' in the directory.",
121
+ )
105
122
  parser.add_argument(
106
123
  "-s",
107
124
  "--show-autocomplete-details",
@@ -169,12 +186,55 @@ def expand_file_patterns(patterns: list[str]) -> list[Path]:
169
186
  return existing_files
170
187
 
171
188
 
189
+ def find_submitted_file(
190
+ jsonl_file: Path,
191
+ submitted_dir: Path,
192
+ target_document: str | None,
193
+ ) -> Path | None:
194
+ """
195
+ Find the submitted file corresponding to a recording file.
196
+
197
+ Matches by replacing '.recording.jsonl.gz' with the extension of the
198
+ target document (or '.py' if not specified).
199
+
200
+ Parameters
201
+ ----------
202
+ jsonl_file : Path
203
+ Path to the JSONL recording file
204
+ submitted_dir : Path
205
+ Directory containing submitted files
206
+ target_document : str | None
207
+ Target document path (to extract extension)
208
+
209
+ Returns
210
+ -------
211
+ Path | None
212
+ Path to the submitted file if found, None otherwise
213
+ """
214
+ # Determine the file extension from target_document or default to .py
215
+ extension = ".py"
216
+ if target_document:
217
+ extension = Path(target_document).suffix or ".py"
218
+
219
+ # Remove '.recording.jsonl.gz' and add the appropriate extension
220
+ base_name = jsonl_file.name.replace(".recording.jsonl.gz", "")
221
+ submitted_filename = base_name + extension
222
+
223
+ submitted_file = submitted_dir / submitted_filename
224
+ if submitted_file.exists():
225
+ return submitted_file
226
+
227
+ return None
228
+
229
+
172
230
  def process_single_file(
173
231
  jsonl_path: Path,
174
232
  template_data: str,
175
233
  target_document: str | None,
176
234
  time_limit: int | None,
177
- ) -> tuple[bool, str, list[dict[str, Any]], dict[str, Any] | None, str]:
235
+ submitted_file: Path | None = None,
236
+ submitted_dir: Path | None = None,
237
+ ) -> tuple[bool, str, list[dict[str, Any]], dict[str, Any] | None, str, tuple[dict[str, Any], ...], dict[str, Any] | None]:
178
238
  """
179
239
  Process a single JSONL recording file.
180
240
 
@@ -188,17 +248,21 @@ def process_single_file(
188
248
  Document to process
189
249
  time_limit : int | None
190
250
  Time limit in minutes
251
+ submitted_file : Path | None
252
+ Path to the submitted file to compare against
253
+ submitted_dir : Path | None
254
+ Directory containing submitted files to compare against
191
255
 
192
256
  Returns
193
257
  -------
194
258
  tuple
195
- (verified, reconstructed_code, suspicious_events, time_info, template_diff_text)
259
+ (verified, reconstructed_code, suspicious_events, time_info, template_diff_text, doc_events, submitted_comparison)
196
260
  """
197
261
  try:
198
262
  json_data = load_jsonl(jsonl_path)
199
263
  except (FileNotFoundError, ValueError, IOError) as e:
200
264
  print(f"Error loading {jsonl_path}: {e}", file=sys.stderr)
201
- return False, "", [], None, ""
265
+ return False, "", [], None, "", (), None
202
266
 
203
267
  # Filter events for target document
204
268
  doc_events = filter_events_by_document(json_data, target_document)
@@ -207,7 +271,7 @@ def process_single_file(
207
271
  f"Warning: No events found for document '{target_document}' in {jsonl_path}",
208
272
  file=sys.stderr,
209
273
  )
210
- return False, "", [], None, ""
274
+ return False, "", [], None, "", (), None
211
275
 
212
276
  # Check time information
213
277
  time_info = check_time_limit(doc_events, time_limit)
@@ -218,13 +282,29 @@ def process_single_file(
218
282
  reconstructed = reconstruct_file_from_events(
219
283
  doc_events, verified_template, document_path=target_document
220
284
  )
221
- return True, reconstructed, suspicious_events, time_info, ""
285
+
286
+ # Compare with submitted file if provided
287
+ submitted_comparison = None
288
+ actual_submitted_file = submitted_file
289
+
290
+ # If submitted_dir is provided, find the matching file
291
+ if submitted_dir and not submitted_file:
292
+ actual_submitted_file = find_submitted_file(jsonl_path, submitted_dir, target_document)
293
+ if actual_submitted_file:
294
+ print(f"Found submitted file: {actual_submitted_file.name}", file=sys.stderr)
295
+
296
+ if actual_submitted_file and actual_submitted_file.exists():
297
+ submitted_comparison = compare_submitted_file(reconstructed, actual_submitted_file)
298
+ elif actual_submitted_file:
299
+ print(f"Warning: Submitted file not found: {actual_submitted_file}", file=sys.stderr)
300
+
301
+ return True, reconstructed, suspicious_events, time_info, "", doc_events, submitted_comparison
222
302
  except ValueError as e:
223
303
  # If verification fails but we have events, still try to reconstruct
224
304
  print(f"Warning: Verification failed for {jsonl_path}: {e}", file=sys.stderr)
225
305
  try:
226
306
  if not doc_events:
227
- return False, "", [], time_info, ""
307
+ return False, "", [], time_info, "", (), None
228
308
 
229
309
  # Compute diff against template and still detect suspicious events
230
310
  diff_text = template_diff(template_data, doc_events)
@@ -235,19 +315,35 @@ def process_single_file(
235
315
  reconstructed = reconstruct_file_from_events(
236
316
  doc_events, initial_state, document_path=target_document
237
317
  )
238
- return False, reconstructed, suspicious_events, time_info, diff_text
318
+
319
+ # Compare with submitted file if provided
320
+ submitted_comparison = None
321
+ actual_submitted_file = submitted_file
322
+
323
+ # If submitted_dir is provided, find the matching file
324
+ if submitted_dir and not submitted_file:
325
+ actual_submitted_file = find_submitted_file(jsonl_path, submitted_dir, target_document)
326
+ if actual_submitted_file:
327
+ print(f"Found submitted file: {actual_submitted_file.name}", file=sys.stderr)
328
+
329
+ if actual_submitted_file and actual_submitted_file.exists():
330
+ submitted_comparison = compare_submitted_file(reconstructed, actual_submitted_file)
331
+ elif actual_submitted_file:
332
+ print(f"Warning: Submitted file not found: {actual_submitted_file}", file=sys.stderr)
333
+
334
+ return False, reconstructed, suspicious_events, time_info, diff_text, doc_events, submitted_comparison
239
335
  except Exception as reconstruction_error:
240
336
  print(
241
337
  f"Error reconstructing {jsonl_path}: {type(reconstruction_error).__name__}: {reconstruction_error}",
242
338
  file=sys.stderr,
243
339
  )
244
- return False, "", [], time_info, ""
340
+ return False, "", [], time_info, "", (), None
245
341
  except Exception as e:
246
342
  print(
247
343
  f"Error processing {jsonl_path}: {type(e).__name__}: {e}",
248
344
  file=sys.stderr,
249
345
  )
250
- return False, "", [], time_info, ""
346
+ return False, "", [], time_info, "", (), None
251
347
 
252
348
 
253
349
  def write_reconstructed_file(
@@ -274,7 +370,7 @@ def write_reconstructed_file(
274
370
  """
275
371
  try:
276
372
  output_path.parent.mkdir(parents=True, exist_ok=True)
277
- output_path.write_text(content)
373
+ output_path.write_text(content + '\n')
278
374
  print(f"{file_description} written to: {output_path}", file=sys.stderr)
279
375
  return True
280
376
  except Exception as e:
@@ -387,8 +483,8 @@ def process_batch(
387
483
  file_template_data = template_data
388
484
 
389
485
  # Process the file
390
- verified, reconstructed, suspicious_events, time_info, diff_text = process_single_file(
391
- jsonl_file, file_template_data, target_document, args.time_limit
486
+ verified, reconstructed, suspicious_events, time_info, diff_text, doc_events, submitted_comparison = process_single_file(
487
+ jsonl_file, file_template_data, target_document, args.time_limit, args.submitted_file, args.submitted_dir
392
488
  )
393
489
 
394
490
  if not verified:
@@ -398,6 +494,7 @@ def process_batch(
398
494
  display_time_info(time_info)
399
495
  display_suspicious_events(suspicious_events, args.show_autocomplete_details)
400
496
  display_template_diff(diff_text)
497
+ display_submitted_file_comparison(submitted_comparison)
401
498
 
402
499
  # Store results
403
500
  results.append({
@@ -408,6 +505,8 @@ def process_batch(
408
505
  "suspicious_events": suspicious_events,
409
506
  "time_info": time_info,
410
507
  "template_diff": diff_text,
508
+ "doc_events": doc_events,
509
+ "submitted_comparison": submitted_comparison,
411
510
  })
412
511
 
413
512
  # Write output file if requested
@@ -470,14 +569,15 @@ def process_single(
470
569
 
471
570
  print(f"Processing: {target_document or template_base}", file=sys.stderr)
472
571
 
473
- verified, reconstructed, suspicious_events, time_info, diff_text = process_single_file(
474
- jsonl_file, file_template_data, target_document, args.time_limit
572
+ verified, reconstructed, suspicious_events, time_info, diff_text, doc_events, submitted_comparison = process_single_file(
573
+ jsonl_file, file_template_data, target_document, args.time_limit, args.submitted_file, args.submitted_dir
475
574
  )
476
575
 
477
576
  # Display results
478
577
  display_time_info(time_info)
479
578
  display_suspicious_events(suspicious_events, args.show_autocomplete_details)
480
579
  display_template_diff(diff_text)
580
+ display_submitted_file_comparison(submitted_comparison)
481
581
 
482
582
  # Write output file if requested
483
583
  if reconstructed and args.output_file:
@@ -492,6 +592,8 @@ def process_single(
492
592
  "suspicious_events": suspicious_events,
493
593
  "time_info": time_info,
494
594
  "template_diff": diff_text,
595
+ "doc_events": doc_events,
596
+ "submitted_comparison": submitted_comparison,
495
597
  }]
496
598
 
497
599
  return results, verified
@@ -526,6 +628,11 @@ def main() -> int:
526
628
  parser.print_help()
527
629
  return 1
528
630
 
631
+ # Validate that both --submitted-file and --submitted-dir are not provided simultaneously
632
+ if args.submitted_file and args.submitted_dir:
633
+ print("Error: Cannot specify both --submitted-file and --submitted-dir", file=sys.stderr)
634
+ return 1
635
+
529
636
  # Expand file patterns and validate
530
637
  try:
531
638
  jsonl_files = expand_file_patterns(jsonl_patterns)
@@ -600,10 +707,10 @@ def main() -> int:
600
707
  print_batch_summary(len(results), verified_count, failed_files)
601
708
 
602
709
  # Display combined time report
603
- time_infos = [r["time_info"] for r in results]
710
+ all_events = [r["doc_events"] for r in results]
604
711
  combined_time = None
605
- if any(time_infos):
606
- combined_time = combine_time_info(time_infos, args.time_limit)
712
+ if any(all_events):
713
+ combined_time = combine_time_info(all_events, args.time_limit)
607
714
  display_time_info(combined_time, is_combined=True)
608
715
 
609
716
  # Write JSON output
@@ -176,6 +176,39 @@ def display_template_diff(diff_text: str) -> None:
176
176
  print(diff_text, file=sys.stderr)
177
177
 
178
178
 
179
+ def display_submitted_file_comparison(comparison: dict[str, Any] | None) -> None:
180
+ """
181
+ Display comparison results between reconstructed code and submitted file.
182
+
183
+ Parameters
184
+ ----------
185
+ comparison : dict[str, Any] | None
186
+ Comparison results from compare_submitted_file, or None if no comparison
187
+ """
188
+ if not comparison:
189
+ return
190
+
191
+ print("\nSubmitted file comparison:", file=sys.stderr)
192
+ print(f" Submitted file: {comparison['submitted_file']}", file=sys.stderr)
193
+
194
+ if "error" in comparison:
195
+ print(f" Error: {comparison['error']}", file=sys.stderr)
196
+ return
197
+
198
+ if comparison["matches"]:
199
+ print(" ✓ Reconstructed code matches submitted file exactly", file=sys.stderr)
200
+ elif comparison.get("whitespace_only", False):
201
+ print(" ⚠ Reconstructed code differs only in whitespace from submitted file", file=sys.stderr)
202
+ else:
203
+ print(" ✗ Reconstructed code differs from submitted file", file=sys.stderr)
204
+ if comparison.get("diff"):
205
+ print("\n Diff (reconstructed → submitted):", file=sys.stderr)
206
+ # Indent each line of the diff
207
+ for line in comparison["diff"].split("\n"):
208
+ if line:
209
+ print(f" {line}", file=sys.stderr)
210
+
211
+
179
212
  def print_separator() -> None:
180
213
  """Print a separator line."""
181
214
  print(f"{'='*80}", file=sys.stderr)