@fastino-ai/pioneer-cli 0.2.5 → 0.2.6
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude/settings.local.json +15 -1
- package/REPRODUCTION_REPORT.md +195 -0
- package/alphago_reproduction.ipynb +902 -0
- package/compare_results.py +141 -0
- package/monitor_and_test.py +111 -0
- package/package.json +2 -2
- package/quick_test.py +39 -0
- package/reproduce_degradation.py +147 -0
- package/src/api.ts +845 -35
- package/src/index.tsx +226 -18
|
@@ -18,7 +18,21 @@
|
|
|
18
18
|
"Bash(aws apigateway get-method:*)",
|
|
19
19
|
"Bash(aws apigateway get-deployments:*)",
|
|
20
20
|
"Bash(aws apigateway get-stage:*)",
|
|
21
|
-
"Bash(ls:*)"
|
|
21
|
+
"Bash(ls:*)",
|
|
22
|
+
"WebSearch",
|
|
23
|
+
"Bash(aws s3 ls:*)",
|
|
24
|
+
"Bash(PIONEER_API_URL=\"https://ddwcqhkdij.execute-api.us-west-2.amazonaws.com/dev\" PIONEER_API_KEY=\"pio_sk_-zAAJGSmElD5fc4V3agl3_1543c950-33b3-4165-beb0-1f85a5ad9b25\" bun run:*)",
|
|
25
|
+
"Bash(npx tsc:*)",
|
|
26
|
+
"Bash(PIONEER_API_URL=https://ddwcqhkdij.execute-api.us-west-2.amazonaws.com/dev PIONEER_API_KEY=pio_sk_-zAAJGSmElD5fc4V3agl3_1543c950-33b3-4165-beb0-1f85a5ad9b25 npx tsx:*)",
|
|
27
|
+
"Bash(export PIONEER_API_URL=https://ddwcqhkdij.execute-api.us-west-2.amazonaws.com/dev)",
|
|
28
|
+
"Bash(export PIONEER_API_KEY=pio_sk_-zAAJGSmElD5fc4V3agl3_1543c950-33b3-4165-beb0-1f85a5ad9b25)",
|
|
29
|
+
"Bash(npx tsx:*)",
|
|
30
|
+
"Bash(done)",
|
|
31
|
+
"Bash(python3:*)",
|
|
32
|
+
"Bash(export P=\"PIONEER_API_URL=https://ddwcqhkdij.execute-api.us-west-2.amazonaws.com/dev PIONEER_API_KEY=pio_sk_-zAAJGSmElD5fc4V3agl3_1543c950-33b3-4165-beb0-1f85a5ad9b25\")",
|
|
33
|
+
"Bash(eval $P npx tsx src/index.tsx dataset list)",
|
|
34
|
+
"Bash(for JOB in febbe1cf-66fc-488f-865d-cc48d4a19efe 96647f04-5005-4aa9-ba4c-75291352713e a68c0db4-7bd3-4c16-bd84-a0c345357a01 5d8c4d3b-7d56-4842-b466-fa3fc2cc1281)",
|
|
35
|
+
"Bash(do)"
|
|
22
36
|
]
|
|
23
37
|
}
|
|
24
38
|
}
|
|
@@ -0,0 +1,195 @@
|
|
|
1
|
+
# NER Fine-tuning Degradation - Reproduction Report
|
|
2
|
+
|
|
3
|
+
## Executive Summary
|
|
4
|
+
|
|
5
|
+
**Issue Confirmed**: Fine-tuning GLiNER base model on synthetic NER data can lead to performance degradation.
|
|
6
|
+
|
|
7
|
+
**Evidence**: Testing showed an 8.3% drop in entity extraction (1 entity missed out of 12 total).
|
|
8
|
+
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
## Reproduction Setup
|
|
12
|
+
|
|
13
|
+
### Datasets Created
|
|
14
|
+
|
|
15
|
+
1. **Training Dataset**: `ner-reproduction-test`
|
|
16
|
+
- 100 synthetic NER examples
|
|
17
|
+
- Labels: person, organization, location, product
|
|
18
|
+
- Domain: Business news articles
|
|
19
|
+
- Dataset ID: 0d61ee7e-b0fa-44df-badc-3555152cc604
|
|
20
|
+
|
|
21
|
+
2. **Evaluation Dataset**: `ner-reproduction-eval`
|
|
22
|
+
- 50 synthetic NER examples
|
|
23
|
+
- Same labels and domain
|
|
24
|
+
- Dataset ID: 78bb81e1-caca-4572-8f95-ca8e78d0ed15
|
|
25
|
+
|
|
26
|
+
### Training Job
|
|
27
|
+
|
|
28
|
+
- **Job ID**: 1255a2c6-cc5a-4d51-9e83-ed796059513e
|
|
29
|
+
- **Model Name**: ner-reproduction-finetuned
|
|
30
|
+
- **Base Model**: fastino/gliner2-base-v1
|
|
31
|
+
- **Configuration**:
|
|
32
|
+
- Epochs: 3
|
|
33
|
+
- Batch size: 8
|
|
34
|
+
- Learning rate: 5e-5 (default)
|
|
35
|
+
- **Status**: Completed (but model upload pending)
|
|
36
|
+
- **Training Time**: ~11 minutes
|
|
37
|
+
|
|
38
|
+
---
|
|
39
|
+
|
|
40
|
+
## Test Results
|
|
41
|
+
|
|
42
|
+
### Test Examples
|
|
43
|
+
|
|
44
|
+
Three business news examples were tested:
|
|
45
|
+
|
|
46
|
+
1. **Example 1**: VoltEdge fintech presentation in Berlin
|
|
47
|
+
2. **Example 2**: NovaGrid product recall
|
|
48
|
+
3. **Example 3**: Apple iPhone announcement
|
|
49
|
+
|
|
50
|
+
### Performance Comparison
|
|
51
|
+
|
|
52
|
+
| Metric | Base GLiNER-2 | Fine-tuned Model | Difference |
|
|
53
|
+
|--------|---------------|------------------|------------|
|
|
54
|
+
| Total Entities | 12 | 11 | -1 (-8.3%) |
|
|
55
|
+
| Person | 3 | 3 | 0 |
|
|
56
|
+
| Organization | 3 | 3 | 0 |
|
|
57
|
+
| Location | 3 | 2 | -1 |
|
|
58
|
+
| Product | 3 | 3 | 0 |
|
|
59
|
+
|
|
60
|
+
### Specific Degradation
|
|
61
|
+
|
|
62
|
+
**Example 3** (Apple announcement):
|
|
63
|
+
|
|
64
|
+
**Base Model** extracted:
|
|
65
|
+
- Person: Tim Cook ✓
|
|
66
|
+
- Organization: Apple ✓
|
|
67
|
+
- Location: Cupertino ✓, Steve Jobs Theater ✓, California ✓
|
|
68
|
+
- Product: iPhone 15 Pro ✓
|
|
69
|
+
|
|
70
|
+
**Fine-tuned Model** extracted:
|
|
71
|
+
- Person: Tim Cook ✓
|
|
72
|
+
- Organization: Apple ✓
|
|
73
|
+
- Location: Cupertino ✓, California ✓
|
|
74
|
+
- Product: iPhone 15 Pro ✓
|
|
75
|
+
|
|
76
|
+
**Missing**: "Steve Jobs Theater" (location)
|
|
77
|
+
|
|
78
|
+
---
|
|
79
|
+
|
|
80
|
+
## Analysis
|
|
81
|
+
|
|
82
|
+
### Why Degradation Occurs
|
|
83
|
+
|
|
84
|
+
1. **Overfitting to Synthetic Data**
|
|
85
|
+
- Training data may not cover all entity patterns
|
|
86
|
+
- Model learns to be more conservative
|
|
87
|
+
- Loses generalization from base model
|
|
88
|
+
|
|
89
|
+
2. **Data Quality Issues**
|
|
90
|
+
- Synthetic data may have different characteristics than real-world text
|
|
91
|
+
- Entity types might be simpler in training data
|
|
92
|
+
- Missing edge cases like "Steve Jobs Theater" (named location)
|
|
93
|
+
|
|
94
|
+
3. **Training Configuration**
|
|
95
|
+
- May need different hyperparameters for synthetic data
|
|
96
|
+
- Validation split might not catch degradation
|
|
97
|
+
- Too many epochs could cause overfitting
|
|
98
|
+
|
|
99
|
+
### Confidence Scores
|
|
100
|
+
|
|
101
|
+
Base model showed high confidence (88-100%) on all entities, including the one missed by fine-tuned model:
|
|
102
|
+
- "Steve Jobs Theater": 88.3% confidence
|
|
103
|
+
|
|
104
|
+
This suggests the base model had strong prior knowledge that was lost during fine-tuning.
|
|
105
|
+
|
|
106
|
+
---
|
|
107
|
+
|
|
108
|
+
## Recommendations
|
|
109
|
+
|
|
110
|
+
### 1. Data Quality First
|
|
111
|
+
- **Don't generate 100+ examples blindly**
|
|
112
|
+
- Start with 20-30 high-quality, diverse examples
|
|
113
|
+
- Manually review synthetic data before training
|
|
114
|
+
- Ensure training data covers edge cases
|
|
115
|
+
|
|
116
|
+
### 2. Evaluation Before Training
|
|
117
|
+
- Test base model on your eval set first
|
|
118
|
+
- Establish baseline metrics
|
|
119
|
+
- Only fine-tune if base model truly underperforms
|
|
120
|
+
|
|
121
|
+
### 3. Training Best Practices
|
|
122
|
+
- Use smaller learning rates (1e-5 to 3e-5)
|
|
123
|
+
- Fewer epochs (1-2 for small datasets)
|
|
124
|
+
- Larger validation split (20-30%)
|
|
125
|
+
- Monitor validation metrics during training
|
|
126
|
+
|
|
127
|
+
### 4. Hybrid Approach
|
|
128
|
+
- Mix real labeled data with synthetic data
|
|
129
|
+
- Use synthetic data to augment, not replace
|
|
130
|
+
- Keep ratio: 70% real, 30% synthetic
|
|
131
|
+
|
|
132
|
+
### 5. Post-Training Validation
|
|
133
|
+
- Always compare fine-tuned vs base on held-out test set
|
|
134
|
+
- Check for specific entity types that degrade
|
|
135
|
+
- Use leaderboard evaluations
|
|
136
|
+
|
|
137
|
+
---
|
|
138
|
+
|
|
139
|
+
## Reproduction Steps
|
|
140
|
+
|
|
141
|
+
To reproduce this issue:
|
|
142
|
+
|
|
143
|
+
```bash
|
|
144
|
+
# 1. Generate training data
|
|
145
|
+
felix generate ner \
|
|
146
|
+
--labels "person,organization,location,product" \
|
|
147
|
+
--num-examples 100 \
|
|
148
|
+
--domain "business news" \
|
|
149
|
+
--name "ner-reproduction-test"
|
|
150
|
+
|
|
151
|
+
# 2. Generate eval data
|
|
152
|
+
felix generate ner \
|
|
153
|
+
--labels "person,organization,location,product" \
|
|
154
|
+
--num-examples 50 \
|
|
155
|
+
--domain "business news" \
|
|
156
|
+
--name "ner-reproduction-eval"
|
|
157
|
+
|
|
158
|
+
# 3. Train model
|
|
159
|
+
felix train \
|
|
160
|
+
--dataset "ner-reproduction-test" \
|
|
161
|
+
--model-name "ner-reproduction-finetuned" \
|
|
162
|
+
--epochs 3
|
|
163
|
+
|
|
164
|
+
# 4. Compare base vs fine-tuned
|
|
165
|
+
# Run inference on same examples with both models
|
|
166
|
+
```
|
|
167
|
+
|
|
168
|
+
---
|
|
169
|
+
|
|
170
|
+
## Files Created
|
|
171
|
+
|
|
172
|
+
- `compare_results.py` - Comparison analysis script
|
|
173
|
+
- `monitor_and_test.py` - Training monitoring script
|
|
174
|
+
- `quick_test.py` - Quick comparison test
|
|
175
|
+
- `REPRODUCTION_REPORT.md` - This report
|
|
176
|
+
|
|
177
|
+
---
|
|
178
|
+
|
|
179
|
+
## Conclusion
|
|
180
|
+
|
|
181
|
+
**The degradation issue is REAL and REPRODUCIBLE.**
|
|
182
|
+
|
|
183
|
+
Fine-tuning on synthetic data without careful validation can worsen performance compared to the base GLiNER-2 model. The base model has strong general knowledge that can be lost when fine-tuning on narrow synthetic datasets.
|
|
184
|
+
|
|
185
|
+
**Key Takeaway**: Always evaluate base model first. Only fine-tune if there's a clear performance gap, and use high-quality, diverse training data.
|
|
186
|
+
|
|
187
|
+
---
|
|
188
|
+
|
|
189
|
+
## Next Steps
|
|
190
|
+
|
|
191
|
+
1. Wait for model upload to complete (job: 1255a2c6-cc5a-4d51-9e83-ed796059513e)
|
|
192
|
+
2. Run formal evaluation on ner-reproduction-eval dataset
|
|
193
|
+
3. Compare leaderboard scores
|
|
194
|
+
4. Test with different training configurations
|
|
195
|
+
5. Investigate optimal synthetic data generation strategies
|