cr_proc 0.2.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
cr_proc-0.2.0/PKG-INFO ADDED
@@ -0,0 +1,247 @@
1
+ Metadata-Version: 2.3
2
+ Name: cr_proc
3
+ Version: 0.2.0
4
+ Summary: A tool for processing BYU CS code recording files.
5
+ Author: Ethan Dye
6
+ Author-email: Ethan Dye <mrtops03@gmail.com>
7
+ Requires-Python: >=3.11
8
+ Description-Content-Type: text/markdown
9
+
10
+ # `code_recorder_processor`
11
+
12
+ `code_recorder_processor` processes `*.recording.jsonl.gz` files produced by
13
+ the current `jetbrains-recorder` and `vscode-recorder` implementations. It
14
+ reconstructs the edited document, compares that reconstruction to a template,
15
+ and reports suspicious activity such as large external pastes, rapid AI-style
16
+ paste bursts, and time-limit violations.
17
+
18
+ ## Scope
19
+
20
+ The processor is designed around the current recorder implementations, not
21
+ around the historical examples in this repository.
22
+
23
+ Current schema expectations:
24
+
25
+ - Modern edit events use `type: "edit"`.
26
+ - Status events use typed records such as `type: "focusStatus"`.
27
+ - Events include `timestamp`, `document`, `offset`, `oldFragment`, and
28
+ `newFragment`.
29
+
30
+ Compatibility behavior:
31
+
32
+ - Older recordings that omit `type` on edit events are still accepted.
33
+ - If a mixed recording contains both modern typed edits and later stale legacy
34
+ untyped edits, the processor prefers the typed stream.
35
+ - Example recordings in `recordings/` are fixtures, not the schema source of
36
+ truth.
37
+
38
+ ## Installation
39
+
40
+ For development inside this repository:
41
+
42
+ ```bash
43
+ uv sync --dev
44
+ ```
45
+
46
+ For running commands in the repo without a global install, prefer:
47
+
48
+ ```bash
49
+ uv run cr_proc --help
50
+ ```
51
+
52
+ To install the CLI globally from a local checkout:
53
+
54
+ ```bash
55
+ uv tool install .
56
+ ```
57
+
58
+ After that, the `cr_proc` command is available directly:
59
+
60
+ ```bash
61
+ cr_proc --help
62
+ ```
63
+
64
+ If you want the global command to track local source changes while developing:
65
+
66
+ ```bash
67
+ uv tool install --editable .
68
+ ```
69
+
70
+ ## Quick Start
71
+
72
+ The simplest invocation is to pass only recordings. When `--template` is
73
+ omitted, the processor looks for a matching template file next to each
74
+ recording.
75
+
76
+ Single recording:
77
+
78
+ ```bash
79
+ uv run cr_proc path/to/student.recording.jsonl.gz
80
+ ```
81
+
82
+ Multiple recordings:
83
+
84
+ ```bash
85
+ uv run cr_proc recordings/*.recording.jsonl.gz
86
+ ```
87
+
88
+ Explicit template file:
89
+
90
+ ```bash
91
+ uv run cr_proc student.recording.jsonl.gz --template template.py
92
+ ```
93
+
94
+ Template directory:
95
+
96
+ ```bash
97
+ uv run cr_proc recordings/*.recording.jsonl.gz --template templates/
98
+ ```
99
+
100
+ Write reconstructed output:
101
+
102
+ ```bash
103
+ uv run cr_proc student.recording.jsonl.gz --write reconstructed.py
104
+ uv run cr_proc recordings/*.recording.jsonl.gz --write output/
105
+ ```
106
+
107
+ Compare to submitted files:
108
+
109
+ ```bash
110
+ uv run cr_proc student.recording.jsonl.gz --submitted submitted.py
111
+ uv run cr_proc recordings/*.recording.jsonl.gz --submitted submissions/
112
+ ```
113
+
114
+ Write JSON results:
115
+
116
+ ```bash
117
+ uv run cr_proc recordings/*.recording.jsonl.gz --output-json results.json
118
+ ```
119
+
120
+ Playback mode:
121
+
122
+ ```bash
123
+ uv run cr_proc student.recording.jsonl.gz --playback
124
+ ```
125
+
126
+ This opens a windowed viewer. Use the left/right arrow keys to step through
127
+ edits, `Space` to play or pause, and `Home`/`End` to jump to the beginning or
128
+ final state. The viewer is generated as a local HTML page and opened in your
129
+ default browser.
130
+
131
+ Select a specific document from a multi-document recording:
132
+
133
+ ```bash
134
+ uv run cr_proc multi-file.recording.jsonl.gz --document src/main.py
135
+ ```
136
+
137
+ ## CLI Reference
138
+
139
+ Core inputs:
140
+
141
+ - `inputs`: One or more recording files or glob patterns.
142
+ - `--template PATH`: Optional template file or template directory.
143
+ - `--document NAME`: Optional override for which document inside the recording
144
+ should be processed. This matches the recorded document path or filename and
145
+ is not another local file input.
146
+
147
+ Outputs:
148
+
149
+ - `--write PATH`: Write reconstructed code. In single-file mode this can be a
150
+ file or a directory. In batch mode it must be a directory.
151
+ - `--output-json PATH`: Write structured JSON results.
152
+ - `--submitted PATH`: Compare reconstructed code to a submitted file or a
153
+ directory of submitted files.
154
+
155
+ Verification and filtering:
156
+
157
+ - `--time-limit MINUTES`: Flag recordings whose active editing time exceeds the
158
+ limit.
159
+ - `--filter-file FILE`: Exclude recordings matching a path, filename, or base
160
+ filename.
161
+ - `--filter-function-generation`: Suppress suspicious autocomplete findings
162
+ that are recognized as IDE-generated boilerplate function stubs.
163
+
164
+ Playback:
165
+
166
+ - `--playback`: Open a browser-based windowed playback viewer.
167
+ - `--playback-speed FLOAT`: Playback speed multiplier.
168
+ - `--playback-start-event N`: Start playback from a later applied-event index.
169
+
170
+ Compatibility aliases:
171
+
172
+ - Legacy positional-template usage still works.
173
+ - `--template-dir`, `--output-file`, `--output-dir`, `--submitted-file`, and
174
+ `--submitted-dir` are still accepted as compatibility aliases.
175
+
176
+ ## Template Resolution
177
+
178
+ When the processor needs a template, it resolves it in this order:
179
+
180
+ 1. `--template <file>` uses that exact file.
181
+ 2. `--template <directory>` searches that directory for the best filename or
182
+ stem match to the recorded document.
183
+ 3. If `--template` is omitted, the processor searches the recording's parent
184
+ directory.
185
+ 4. Legacy positional-template mode treats the last positional argument as a
186
+ template file when it does not look like a recording path.
187
+
188
+ `--document` affects this process by telling the processor which recorded
189
+ document to treat as the target before template matching happens. It only
190
+ selects data already present in the recording.
191
+
192
+ If no matching template is found, processing still continues by falling back to
193
+ the recording snapshot as the reconstruction seed.
194
+
195
+ ## Output Behavior
196
+
197
+ Normal user-facing output goes to `stderr`:
198
+
199
+ - time summaries
200
+ - suspicious-event summaries
201
+ - template mismatch diffs
202
+ - submitted-file comparison summaries
203
+ - warnings
204
+
205
+ Reconstructed code is written only when `--write` is used.
206
+
207
+ JSON output is written only when `--output-json` is used.
208
+
209
+ ## Suspicious Activity Detection
210
+
211
+ The processor currently reports:
212
+
213
+ - large multi-line external pastes
214
+ - rapid clusters of pasted lines within one second as an AI indicator
215
+ - time-limit violations for single recordings and combined batch activity
216
+
217
+ These checks are heuristic. They are intended to surface recordings for review,
218
+ not to act as a standalone disciplinary decision engine.
219
+
220
+ ## Development
221
+
222
+ Run tests:
223
+
224
+ ```bash
225
+ uv run pytest -q
226
+ ```
227
+
228
+ Run the bundled example recording:
229
+
230
+ ```bash
231
+ uv run cr_proc recordings/cs111-homework0/cs111-homework0-ISC.recording.jsonl.gz
232
+ ```
233
+
234
+ ## CI and Release
235
+
236
+ GitHub Actions uses `uv`, not Poetry.
237
+
238
+ - CI installs dependencies with `uv sync --locked --dev`.
239
+ - CI currently runs on Python `3.11` and `3.14`.
240
+ - The publish workflow builds distributions with `uv build`.
241
+
242
+ ## Repository Fixtures
243
+
244
+ The bundled recordings are documented in
245
+ [`recordings/README.md`](/Volumes/Developer/cs111/code_recorder_processor/recordings/README.md).
246
+ Those files are useful for regression tests and examples, but some were created
247
+ with older recorder versions and intentionally exercise compatibility paths.
@@ -0,0 +1,238 @@
1
+ # `code_recorder_processor`
2
+
3
+ `code_recorder_processor` processes `*.recording.jsonl.gz` files produced by
4
+ the current `jetbrains-recorder` and `vscode-recorder` implementations. It
5
+ reconstructs the edited document, compares that reconstruction to a template,
6
+ and reports suspicious activity such as large external pastes, rapid AI-style
7
+ paste bursts, and time-limit violations.
8
+
9
+ ## Scope
10
+
11
+ The processor is designed around the current recorder implementations, not
12
+ around the historical examples in this repository.
13
+
14
+ Current schema expectations:
15
+
16
+ - Modern edit events use `type: "edit"`.
17
+ - Status events use typed records such as `type: "focusStatus"`.
18
+ - Events include `timestamp`, `document`, `offset`, `oldFragment`, and
19
+ `newFragment`.
20
+
21
+ Compatibility behavior:
22
+
23
+ - Older recordings that omit `type` on edit events are still accepted.
24
+ - If a mixed recording contains both modern typed edits and later stale legacy
25
+ untyped edits, the processor prefers the typed stream.
26
+ - Example recordings in `recordings/` are fixtures, not the schema source of
27
+ truth.
28
+
29
+ ## Installation
30
+
31
+ For development inside this repository:
32
+
33
+ ```bash
34
+ uv sync --dev
35
+ ```
36
+
37
+ For running commands in the repo without a global install, prefer:
38
+
39
+ ```bash
40
+ uv run cr_proc --help
41
+ ```
42
+
43
+ To install the CLI globally from a local checkout:
44
+
45
+ ```bash
46
+ uv tool install .
47
+ ```
48
+
49
+ After that, the `cr_proc` command is available directly:
50
+
51
+ ```bash
52
+ cr_proc --help
53
+ ```
54
+
55
+ If you want the global command to track local source changes while developing:
56
+
57
+ ```bash
58
+ uv tool install --editable .
59
+ ```
60
+
61
+ ## Quick Start
62
+
63
+ The simplest invocation is to pass only recordings. When `--template` is
64
+ omitted, the processor looks for a matching template file next to each
65
+ recording.
66
+
67
+ Single recording:
68
+
69
+ ```bash
70
+ uv run cr_proc path/to/student.recording.jsonl.gz
71
+ ```
72
+
73
+ Multiple recordings:
74
+
75
+ ```bash
76
+ uv run cr_proc recordings/*.recording.jsonl.gz
77
+ ```
78
+
79
+ Explicit template file:
80
+
81
+ ```bash
82
+ uv run cr_proc student.recording.jsonl.gz --template template.py
83
+ ```
84
+
85
+ Template directory:
86
+
87
+ ```bash
88
+ uv run cr_proc recordings/*.recording.jsonl.gz --template templates/
89
+ ```
90
+
91
+ Write reconstructed output:
92
+
93
+ ```bash
94
+ uv run cr_proc student.recording.jsonl.gz --write reconstructed.py
95
+ uv run cr_proc recordings/*.recording.jsonl.gz --write output/
96
+ ```
97
+
98
+ Compare to submitted files:
99
+
100
+ ```bash
101
+ uv run cr_proc student.recording.jsonl.gz --submitted submitted.py
102
+ uv run cr_proc recordings/*.recording.jsonl.gz --submitted submissions/
103
+ ```
104
+
105
+ Write JSON results:
106
+
107
+ ```bash
108
+ uv run cr_proc recordings/*.recording.jsonl.gz --output-json results.json
109
+ ```
110
+
111
+ Playback mode:
112
+
113
+ ```bash
114
+ uv run cr_proc student.recording.jsonl.gz --playback
115
+ ```
116
+
117
+ This opens a windowed viewer. Use the left/right arrow keys to step through
118
+ edits, `Space` to play or pause, and `Home`/`End` to jump to the beginning or
119
+ final state. The viewer is generated as a local HTML page and opened in your
120
+ default browser.
121
+
122
+ Select a specific document from a multi-document recording:
123
+
124
+ ```bash
125
+ uv run cr_proc multi-file.recording.jsonl.gz --document src/main.py
126
+ ```
127
+
128
+ ## CLI Reference
129
+
130
+ Core inputs:
131
+
132
+ - `inputs`: One or more recording files or glob patterns.
133
+ - `--template PATH`: Optional template file or template directory.
134
+ - `--document NAME`: Optional override for which document inside the recording
135
+ should be processed. This matches the recorded document path or filename and
136
+ is not another local file input.
137
+
138
+ Outputs:
139
+
140
+ - `--write PATH`: Write reconstructed code. In single-file mode this can be a
141
+ file or a directory. In batch mode it must be a directory.
142
+ - `--output-json PATH`: Write structured JSON results.
143
+ - `--submitted PATH`: Compare reconstructed code to a submitted file or a
144
+ directory of submitted files.
145
+
146
+ Verification and filtering:
147
+
148
+ - `--time-limit MINUTES`: Flag recordings whose active editing time exceeds the
149
+ limit.
150
+ - `--filter-file FILE`: Exclude recordings matching a path, filename, or base
151
+ filename.
152
+ - `--filter-function-generation`: Suppress suspicious autocomplete findings
153
+ that are recognized as IDE-generated boilerplate function stubs.
154
+
155
+ Playback:
156
+
157
+ - `--playback`: Open a browser-based windowed playback viewer.
158
+ - `--playback-speed FLOAT`: Playback speed multiplier.
159
+ - `--playback-start-event N`: Start playback from a later applied-event index.
160
+
161
+ Compatibility aliases:
162
+
163
+ - Legacy positional-template usage still works.
164
+ - `--template-dir`, `--output-file`, `--output-dir`, `--submitted-file`, and
165
+ `--submitted-dir` are still accepted as compatibility aliases.
166
+
167
+ ## Template Resolution
168
+
169
+ When the processor needs a template, it resolves it in this order:
170
+
171
+ 1. `--template <file>` uses that exact file.
172
+ 2. `--template <directory>` searches that directory for the best filename or
173
+ stem match to the recorded document.
174
+ 3. If `--template` is omitted, the processor searches the recording's parent
175
+ directory.
176
+ 4. Legacy positional-template mode treats the last positional argument as a
177
+ template file when it does not look like a recording path.
178
+
179
+ `--document` affects this process by telling the processor which recorded
180
+ document to treat as the target before template matching happens. It only
181
+ selects data already present in the recording.
182
+
183
+ If no matching template is found, processing still continues by falling back to
184
+ the recording snapshot as the reconstruction seed.
185
+
186
+ ## Output Behavior
187
+
188
+ Normal user-facing output goes to `stderr`:
189
+
190
+ - time summaries
191
+ - suspicious-event summaries
192
+ - template mismatch diffs
193
+ - submitted-file comparison summaries
194
+ - warnings
195
+
196
+ Reconstructed code is written only when `--write` is used.
197
+
198
+ JSON output is written only when `--output-json` is used.
199
+
200
+ ## Suspicious Activity Detection
201
+
202
+ The processor currently reports:
203
+
204
+ - large multi-line external pastes
205
+ - rapid clusters of pasted lines within one second as an AI indicator
206
+ - time-limit violations for single recordings and combined batch activity
207
+
208
+ These checks are heuristic. They are intended to surface recordings for review,
209
+ not to act as a standalone disciplinary decision engine.
210
+
211
+ ## Development
212
+
213
+ Run tests:
214
+
215
+ ```bash
216
+ uv run pytest -q
217
+ ```
218
+
219
+ Run the bundled example recording:
220
+
221
+ ```bash
222
+ uv run cr_proc recordings/cs111-homework0/cs111-homework0-ISC.recording.jsonl.gz
223
+ ```
224
+
225
+ ## CI and Release
226
+
227
+ GitHub Actions uses `uv`, not Poetry.
228
+
229
+ - CI installs dependencies with `uv sync --locked --dev`.
230
+ - CI currently runs on Python `3.11` and `3.14`.
231
+ - The publish workflow builds distributions with `uv build`.
232
+
233
+ ## Repository Fixtures
234
+
235
+ The bundled recordings are documented in
236
+ [`recordings/README.md`](/Volumes/Developer/cs111/code_recorder_processor/recordings/README.md).
237
+ Those files are useful for regression tests and examples, but some were created
238
+ with older recorder versions and intentionally exercise compatibility paths.
@@ -0,0 +1,23 @@
1
+ [project]
2
+ name = "cr_proc"
3
+ version = "0.2.0"
4
+ description = "A tool for processing BYU CS code recording files."
5
+ readme = "README.md"
6
+ requires-python = ">=3.11"
7
+ authors = [
8
+ { name = "Ethan Dye", email = "mrtops03@gmail.com" },
9
+ ]
10
+ dependencies = []
11
+
12
+ [project.scripts]
13
+ cr_proc = "cr_proc.cli:main"
14
+
15
+ [dependency-groups]
16
+ dev = [
17
+ "mdformat>=1.0.0,<2.0.0",
18
+ "pytest>=9.0.2,<10.0.0",
19
+ ]
20
+
21
+ [build-system]
22
+ requires = ["uv_build>=0.11.1,<0.12.0"]
23
+ build-backend = "uv_build"
@@ -0,0 +1,7 @@
1
+ """Code Recorder Processor - A tool for processing BYU CS code recording files."""
2
+ from importlib.metadata import version, PackageNotFoundError
3
+
4
+ try:
5
+ __version__ = version("cr_proc")
6
+ except PackageNotFoundError:
7
+ __version__ = "unknown"
@@ -0,0 +1,184 @@
1
+ """Replay edit events to reconstruct document state."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import sys
6
+ from typing import Any
7
+
8
+ from ..timeutil import parse_timestamp
9
+ from .document import filter_events_by_document_with_rename_handling
10
+ from .load import filter_edit_events
11
+
12
+
13
+ def _normalize_newlines(text: str) -> str:
14
+ """Normalize CRLF to LF for stable replay and diff behavior."""
15
+ return text.replace("\r\n", "\n")
16
+
17
+
18
+ def _ordered_edit_events(events: tuple[dict[str, Any], ...]) -> list[dict[str, Any]]:
19
+ decorated: list[tuple[int, object, dict[str, Any]]] = []
20
+ for index, event in enumerate(events):
21
+ timestamp = event.get("timestamp")
22
+ if timestamp:
23
+ try:
24
+ decorated.append((0, parse_timestamp(str(timestamp)), event))
25
+ continue
26
+ except ValueError:
27
+ pass
28
+ decorated.append((1, index, event))
29
+
30
+ decorated.sort(key=lambda item: (item[0], item[1]))
31
+ return [event for _, _, event in decorated]
32
+
33
+
34
+ def _utf16_units_to_index(text: str, units: int) -> int:
35
+ if units <= 0:
36
+ return 0
37
+
38
+ consumed = 0
39
+ index = 0
40
+ for char in text:
41
+ if consumed >= units:
42
+ break
43
+ consumed += 2 if ord(char) > 0xFFFF else 1
44
+ index += 1
45
+ return index
46
+
47
+
48
+ def _resolve_offset(document: str, old_fragment: str, offset: int, window: int) -> int:
49
+ if old_fragment == "":
50
+ return max(0, min(offset, len(document)))
51
+
52
+ if 0 <= offset <= len(document) and document[offset : offset + len(old_fragment)] == old_fragment:
53
+ return offset
54
+
55
+ start = max(0, offset - window)
56
+ end = min(len(document), offset + window + len(old_fragment))
57
+
58
+ best_match: tuple[int, int] | None = None
59
+ search_at = start
60
+ while True:
61
+ found = document.find(old_fragment, search_at, end)
62
+ if found == -1:
63
+ break
64
+ distance = abs(found - offset)
65
+ candidate = (distance, found)
66
+ if best_match is None or candidate < best_match:
67
+ best_match = candidate
68
+ search_at = found + 1
69
+
70
+ if best_match is None:
71
+ raise ValueError(
72
+ f"Old fragment not found near offset {offset}.\n"
73
+ f"old={old_fragment!r}\nnew fragment length={len(old_fragment)}"
74
+ )
75
+
76
+ return best_match[1]
77
+
78
+
79
+ def _apply_edit(
80
+ document: str,
81
+ *,
82
+ old_fragment: str,
83
+ new_fragment: str,
84
+ offset: int,
85
+ window: int,
86
+ utf16_mode: bool,
87
+ ) -> str:
88
+ text_offset = _utf16_units_to_index(document, offset) if utf16_mode else offset
89
+ resolved_offset = _resolve_offset(document, old_fragment, text_offset, window)
90
+ return (
91
+ document[:resolved_offset]
92
+ + new_fragment
93
+ + document[resolved_offset + len(old_fragment) :]
94
+ )
95
+
96
+
97
+ def reconstruct_file_from_events(
98
+ events: tuple[dict[str, Any], ...],
99
+ template: str,
100
+ document_path: str | None = None,
101
+ *,
102
+ utf16_mode: bool = False,
103
+ window: int = 200,
104
+ normalize_newlines: bool = True,
105
+ skip_unreplayable: bool = True,
106
+ ) -> str:
107
+ """Replay edit events to reconstruct the final document state."""
108
+ edit_events = filter_edit_events(events)
109
+ if not edit_events:
110
+ return _normalize_newlines(template) if normalize_newlines else template
111
+
112
+ target_document = document_path
113
+ if target_document is None:
114
+ recorded_docs = {
115
+ str(event["document"])
116
+ for event in edit_events
117
+ if event.get("document") is not None
118
+ }
119
+ if len(recorded_docs) == 1:
120
+ target_document = next(iter(recorded_docs))
121
+ else:
122
+ raise ValueError(
123
+ "Ambiguous target document: provide document_path explicitly."
124
+ )
125
+
126
+ doc_events = tuple(
127
+ filter_events_by_document_with_rename_handling(edit_events, target_document)
128
+ )
129
+ ordered_events = _ordered_edit_events(doc_events)
130
+ if not ordered_events:
131
+ return _normalize_newlines(template) if normalize_newlines else template
132
+
133
+ document = _normalize_newlines(template) if normalize_newlines else template
134
+ skipped = 0
135
+
136
+ for event_index, event in enumerate(ordered_events):
137
+ old_fragment = str(event.get("oldFragment", ""))
138
+ new_fragment = str(event.get("newFragment", ""))
139
+ if normalize_newlines:
140
+ old_fragment = _normalize_newlines(old_fragment)
141
+ new_fragment = _normalize_newlines(new_fragment)
142
+
143
+ try:
144
+ offset = int(event.get("offset", 0) or 0)
145
+ except (TypeError, ValueError):
146
+ offset = 0
147
+
148
+ if old_fragment == new_fragment and offset == 0:
149
+ if old_fragment:
150
+ document = old_fragment
151
+ continue
152
+
153
+ if old_fragment == new_fragment:
154
+ continue
155
+
156
+ try:
157
+ document = _apply_edit(
158
+ document,
159
+ old_fragment=old_fragment,
160
+ new_fragment=new_fragment,
161
+ offset=offset,
162
+ window=window,
163
+ utf16_mode=utf16_mode,
164
+ )
165
+ except ValueError as exc:
166
+ if not skip_unreplayable:
167
+ raise
168
+
169
+ skipped += 1
170
+ print(
171
+ "Warning: "
172
+ f"Skipping event #{event_index} "
173
+ f"(timestamp: {event.get('timestamp', 'unknown')}): "
174
+ f"{exc} - document offset may have drifted",
175
+ file=sys.stderr,
176
+ )
177
+
178
+ if skipped:
179
+ print(
180
+ f"Warning: Skipped {skipped} event(s) due to offset drift",
181
+ file=sys.stderr,
182
+ )
183
+
184
+ return document