@synsci/cli-darwin-x64-baseline 1.1.73 → 1.1.75

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,126 @@
1
+ # Production Data Formatting Reference
2
+
3
+ ## Overview
4
+
5
+ Production data is the most valuable training signal for model specialization. This reference covers patterns for extracting, cleaning, and formatting production data from common sources.
6
+
7
+ ## Data Source Patterns
8
+
9
+ ### REST API Logs
10
+
11
+ Most production systems log API requests and responses. Common formats:
12
+
13
+ ```python
14
+ # Typical API log structure
15
+ log_entry = {
16
+ "timestamp": "2026-01-15T10:30:00Z",
17
+ "request_id": "req_abc123",
18
+ "user_id": "user_456",
19
+ "endpoint": "/v1/chat/completions",
20
+ "input": {
21
+ "model": "gpt-4o",
22
+ "messages": [...],
23
+ "temperature": 0.7
24
+ },
25
+ "output": {
26
+ "choices": [{"message": {"content": "..."}}],
27
+ "usage": {"prompt_tokens": 150, "completion_tokens": 200}
28
+ },
29
+ "latency_ms": 1200,
30
+ "status": 200
31
+ }
32
+ ```
33
+
34
+ **Extraction pattern**: Pull `input.messages` and `output.choices[0].message.content`, format into standard JSONL.
35
+
36
+ ### Database Records
37
+
38
+ If your product stores LLM interactions in a database:
39
+
40
+ ```sql
41
+ SELECT
42
+ system_prompt,
43
+ user_input,
44
+ COALESCE(corrected_response, model_response) as target_response,
45
+ user_feedback
46
+ FROM llm_interactions
47
+ WHERE user_feedback != 'rejected'
48
+ AND created_at > NOW() - INTERVAL '90 days'
49
+ ORDER BY created_at DESC;
50
+ ```
51
+
52
+ **Key**: Always prefer `corrected_response` over raw `model_response` when available.
53
+
54
+ ### Structured Feedback
55
+
56
+ If users rate or edit model outputs:
57
+
58
+ | Signal | Quality | Use |
59
+ |--------|---------|-----|
60
+ | User edited output | Highest | Use edited version as training target |
61
+ | Thumbs up / accepted | High | Use original output as training target |
62
+ | Thumbs down / rejected | Medium | Exclude from SFT, use for DPO (rejected example) |
63
+ | No feedback | Low | Use with caution, filter by heuristics |
64
+
65
+ ## Cleaning Pipeline
66
+
67
+ ```python
68
+ def clean_production_data(examples):
69
+ """Standard cleaning pipeline for production data."""
70
+ cleaned = []
71
+ for ex in examples:
72
+ messages = ex["messages"]
73
+
74
+ # Skip empty or trivial examples
75
+ assistant_msg = next((m for m in messages if m["role"] == "assistant"), None)
76
+ if not assistant_msg or len(assistant_msg["content"].strip()) < 10:
77
+ continue
78
+
79
+ # Normalize whitespace
80
+ for msg in messages:
81
+ msg["content"] = " ".join(msg["content"].split())
82
+
83
+ # Remove PII patterns (customize for your domain)
84
+ for msg in messages:
85
+ msg["content"] = redact_pii(msg["content"])
86
+
87
+ # Skip if user input is too short (likely a test)
88
+ user_msg = next((m for m in messages if m["role"] == "user"), None)
89
+ if user_msg and len(user_msg["content"].strip()) < 5:
90
+ continue
91
+
92
+ cleaned.append({"messages": messages})
93
+
94
+ return cleaned
95
+ ```
96
+
97
+ ## Multi-Turn Conversations
98
+
99
+ For products with multi-turn interactions, preserve the full conversation:
100
+
101
+ ```python
102
+ def conversation_to_training(conversation):
103
+ """Convert a multi-turn conversation to training format.
104
+ Each assistant turn becomes a training example with full history."""
105
+ examples = []
106
+ messages = []
107
+
108
+ for turn in conversation["turns"]:
109
+ messages.append({"role": turn["role"], "content": turn["content"]})
110
+
111
+ # Create an example at each assistant turn
112
+ if turn["role"] == "assistant":
113
+ examples.append({"messages": list(messages)})
114
+
115
+ return examples
116
+ ```
117
+
118
+ ## Volume Guidelines
119
+
120
+ | Dataset Size | Expected Quality | Recommended Approach |
121
+ |-------------|-----------------|---------------------|
122
+ | < 100 | Insufficient for SFT | Use synthetic bootstrapping first |
123
+ | 100-1,000 | Minimum viable | LoRA fine-tune, careful eval |
124
+ | 1,000-10,000 | Good | Standard LoRA or QLoRA |
125
+ | 10,000-100,000 | Strong | Full fine-tune viable |
126
+ | > 100,000 | Excellent | Multi-epoch training, curriculum learning |
package/bin/synsc CHANGED
Binary file
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@synsci/cli-darwin-x64-baseline",
3
- "version": "1.1.73",
3
+ "version": "1.1.75",
4
4
  "os": [
5
5
  "darwin"
6
6
  ],