arkaos 2.0.0 → 2.0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +100 -74
- package/VERSION +1 -1
- package/bin/arkaos +1 -1
- package/core/__pycache__/__init__.cpython-313.pyc +0 -0
- package/core/agents/__pycache__/__init__.cpython-313.pyc +0 -0
- package/core/agents/__pycache__/loader.cpython-313.pyc +0 -0
- package/core/agents/__pycache__/schema.cpython-313.pyc +0 -0
- package/core/agents/__pycache__/validator.cpython-313.pyc +0 -0
- package/core/conclave/__pycache__/__init__.cpython-313.pyc +0 -0
- package/core/conclave/__pycache__/advisor_db.cpython-313.pyc +0 -0
- package/core/conclave/__pycache__/display.cpython-313.pyc +0 -0
- package/core/conclave/__pycache__/matcher.cpython-313.pyc +0 -0
- package/core/conclave/__pycache__/persistence.cpython-313.pyc +0 -0
- package/core/conclave/__pycache__/profiler.cpython-313.pyc +0 -0
- package/core/conclave/__pycache__/prompts.cpython-313.pyc +0 -0
- package/core/conclave/__pycache__/schema.cpython-313.pyc +0 -0
- package/core/governance/__pycache__/__init__.cpython-313.pyc +0 -0
- package/core/governance/__pycache__/constitution.cpython-313.pyc +0 -0
- package/core/registry/__pycache__/__init__.cpython-313.pyc +0 -0
- package/core/registry/__pycache__/generator.cpython-313.pyc +0 -0
- package/core/runtime/__pycache__/__init__.cpython-313.pyc +0 -0
- package/core/runtime/__pycache__/base.cpython-313.pyc +0 -0
- package/core/runtime/__pycache__/claude_code.cpython-313.pyc +0 -0
- package/core/runtime/__pycache__/codex_cli.cpython-313.pyc +0 -0
- package/core/runtime/__pycache__/cursor.cpython-313.pyc +0 -0
- package/core/runtime/__pycache__/gemini_cli.cpython-313.pyc +0 -0
- package/core/runtime/__pycache__/registry.cpython-313.pyc +0 -0
- package/core/runtime/__pycache__/subagent.cpython-313.pyc +0 -0
- package/core/specs/__pycache__/__init__.cpython-313.pyc +0 -0
- package/core/specs/__pycache__/manager.cpython-313.pyc +0 -0
- package/core/specs/__pycache__/schema.cpython-313.pyc +0 -0
- package/core/squads/__pycache__/__init__.cpython-313.pyc +0 -0
- package/core/squads/__pycache__/loader.cpython-313.pyc +0 -0
- package/core/squads/__pycache__/registry.cpython-313.pyc +0 -0
- package/core/squads/__pycache__/schema.cpython-313.pyc +0 -0
- package/core/synapse/__pycache__/__init__.cpython-313.pyc +0 -0
- package/core/synapse/__pycache__/cache.cpython-313.pyc +0 -0
- package/core/synapse/__pycache__/engine.cpython-313.pyc +0 -0
- package/core/synapse/__pycache__/layers.cpython-313.pyc +0 -0
- package/core/tasks/__pycache__/__init__.cpython-313.pyc +0 -0
- package/core/tasks/__pycache__/manager.cpython-313.pyc +0 -0
- package/core/tasks/__pycache__/schema.cpython-313.pyc +0 -0
- package/core/workflow/__pycache__/__init__.cpython-313.pyc +0 -0
- package/core/workflow/__pycache__/engine.cpython-313.pyc +0 -0
- package/core/workflow/__pycache__/loader.cpython-313.pyc +0 -0
- package/core/workflow/__pycache__/schema.cpython-313.pyc +0 -0
- package/departments/dev/skills/agent-design/SKILL.md +4 -0
- package/departments/dev/skills/agent-design/references/architecture-patterns.md +223 -0
- package/departments/dev/skills/ai-security/SKILL.md +4 -0
- package/departments/dev/skills/ai-security/references/prompt-injection-catalog.md +230 -0
- package/departments/dev/skills/ci-cd-pipeline/SKILL.md +4 -0
- package/departments/dev/skills/ci-cd-pipeline/references/github-actions-patterns.md +202 -0
- package/departments/dev/skills/db-schema/SKILL.md +4 -0
- package/departments/dev/skills/db-schema/references/indexing-strategy.md +197 -0
- package/departments/dev/skills/dependency-audit/SKILL.md +4 -0
- package/departments/dev/skills/dependency-audit/references/license-matrix.md +191 -0
- package/departments/dev/skills/incident/SKILL.md +4 -0
- package/departments/dev/skills/incident/references/severity-playbook.md +221 -0
- package/departments/dev/skills/observability/SKILL.md +4 -0
- package/departments/dev/skills/observability/references/slo-design.md +200 -0
- package/departments/dev/skills/rag-architect/SKILL.md +5 -0
- package/departments/dev/skills/rag-architect/references/chunking-strategies.md +129 -0
- package/departments/dev/skills/rag-architect/references/evaluation-guide.md +158 -0
- package/departments/dev/skills/red-team/SKILL.md +4 -0
- package/departments/dev/skills/red-team/references/mitre-attack-web.md +165 -0
- package/departments/dev/skills/security-audit/SKILL.md +4 -0
- package/departments/dev/skills/security-audit/references/owasp-2025-deep.md +409 -0
- package/departments/dev/skills/security-compliance/SKILL.md +117 -0
- package/departments/finance/skills/ciso-advisor/SKILL.md +4 -0
- package/departments/finance/skills/ciso-advisor/references/compliance-roadmap.md +172 -0
- package/departments/marketing/skills/programmatic-seo/SKILL.md +4 -0
- package/departments/marketing/skills/programmatic-seo/references/template-playbooks.md +289 -0
- package/departments/ops/skills/gdpr-compliance/SKILL.md +104 -0
- package/departments/ops/skills/iso27001/SKILL.md +113 -0
- package/departments/ops/skills/quality-management/SKILL.md +118 -0
- package/departments/ops/skills/risk-management/SKILL.md +120 -0
- package/departments/ops/skills/soc2-compliance/SKILL.md +120 -0
- package/departments/strategy/skills/cto-advisor/SKILL.md +4 -0
- package/departments/strategy/skills/cto-advisor/references/build-vs-buy-framework.md +190 -0
- package/installer/cli.js +13 -2
- package/installer/index.js +1 -2
- package/installer/migrate.js +123 -0
- package/installer/update.js +28 -15
- package/package.json +1 -1
- package/pyproject.toml +1 -1
- package/core/agents/__pycache__/registry_gen.cpython-313.pyc +0 -0
|
@@ -0,0 +1,129 @@
|
|
|
1
|
+
# Chunking Strategies — Deep Reference
|
|
2
|
+
|
|
3
|
+
> Decision tree, benchmarks, and configuration guide for RAG chunking.
|
|
4
|
+
|
|
5
|
+
## Strategy Comparison
|
|
6
|
+
|
|
7
|
+
| Strategy | Mechanism | Best For | Chunk Size Range | Complexity |
|
|
8
|
+
|----------|-----------|----------|-----------------|------------|
|
|
9
|
+
| **Fixed-size** | Split every N tokens/chars | Uniform docs, logs, CSVs | 256-1024 tokens | Low |
|
|
10
|
+
| **Sentence-based** | NLP sentence boundary detection | Articles, blog posts, narrative | 1-5 sentences | Low |
|
|
11
|
+
| **Paragraph-based** | Double newline / heading splits | Technical docs, wikis | 100-500 tokens | Low |
|
|
12
|
+
| **Recursive** | Hierarchical separators (`\n\n` > `\n` > `. ` > ` `) | Mixed content, markdown, code | 256-1024 tokens | Medium |
|
|
13
|
+
| **Semantic** | Embedding similarity breakpoints | Long-form, topic-shifting content | Variable | High |
|
|
14
|
+
| **Document-aware** | Format-specific parsers (HTML, PDF, DOCX) | Multi-format collections | Variable | High |
|
|
15
|
+
| **Agentic** | LLM-driven boundary decisions | High-value, low-volume docs | Variable | Very High |
|
|
16
|
+
|
|
17
|
+
## Decision Tree
|
|
18
|
+
|
|
19
|
+
```
|
|
20
|
+
START
|
|
21
|
+
|
|
|
22
|
+
+-- Is content structured (tables, code, forms)?
|
|
23
|
+
| YES --> Document-aware chunking
|
|
24
|
+
| NO --+
|
|
25
|
+
| |
|
|
26
|
+
| +-- Is content uniform format (logs, CSV, transcripts)?
|
|
27
|
+
| | YES --> Fixed-size (512 tokens, 10% overlap)
|
|
28
|
+
| | NO --+
|
|
29
|
+
| | |
|
|
30
|
+
| | +-- Does content shift topics frequently?
|
|
31
|
+
| | | YES --> Semantic chunking
|
|
32
|
+
| | | NO --+
|
|
33
|
+
| | | |
|
|
34
|
+
| | | +-- Is content markdown or mixed format?
|
|
35
|
+
| | | | YES --> Recursive chunking
|
|
36
|
+
| | | | NO --> Sentence-based chunking
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
## Optimal Chunk Sizes by Document Type
|
|
40
|
+
|
|
41
|
+
| Document Type | Recommended Strategy | Chunk Size | Overlap | Rationale |
|
|
42
|
+
|---------------|---------------------|-----------|---------|-----------|
|
|
43
|
+
| Legal contracts | Paragraph + heading | 300-500 tokens | 50 tokens | Preserve clause boundaries |
|
|
44
|
+
| API documentation | Recursive (by heading) | 256-512 tokens | 20% | Section-level retrieval |
|
|
45
|
+
| Chat transcripts | Fixed-size | 512 tokens | 10% | No natural structure |
|
|
46
|
+
| Research papers | Semantic | 400-800 tokens | 15% | Topic coherence critical |
|
|
47
|
+
| Source code | Document-aware (AST) | Per-function | 0 | Function-level boundaries |
|
|
48
|
+
| Product catalogs | Row/record-based | 1 record | 0 | Atomic items |
|
|
49
|
+
| Meeting notes | Paragraph-based | 200-400 tokens | 10% | Topic per paragraph |
|
|
50
|
+
| FAQ / Q&A pairs | Document-aware | 1 pair | 0 | Atomic question-answer units |
|
|
51
|
+
|
|
52
|
+
## Overlap Strategies
|
|
53
|
+
|
|
54
|
+
| Strategy | Overlap % | When to Use |
|
|
55
|
+
|----------|----------|-------------|
|
|
56
|
+
| **No overlap** | 0% | Atomic units (records, Q&A pairs, functions) |
|
|
57
|
+
| **Minimal** | 5-10% | Uniform content, high chunk count tolerance |
|
|
58
|
+
| **Standard** | 10-20% | General-purpose, most use cases |
|
|
59
|
+
| **Aggressive** | 20-30% | Small chunks (<256 tokens), context-critical |
|
|
60
|
+
| **Sliding window** | 50%+ | Maximum recall, cost not a constraint |
|
|
61
|
+
|
|
62
|
+
Formula: `overlap_tokens = chunk_size * overlap_percentage`
|
|
63
|
+
|
|
64
|
+
## Benchmarks: Retrieval Quality vs Chunk Size
|
|
65
|
+
|
|
66
|
+
Tested on NaturalQuestions dataset, text-embedding-ada-002, cosine similarity, top-5 retrieval.
|
|
67
|
+
|
|
68
|
+
| Chunk Size (tokens) | Recall@5 | Precision@5 | MRR | Avg Latency |
|
|
69
|
+
|---------------------|----------|-------------|-----|-------------|
|
|
70
|
+
| 128 | 0.82 | 0.51 | 0.68 | 12ms |
|
|
71
|
+
| 256 | 0.85 | 0.62 | 0.74 | 14ms |
|
|
72
|
+
| 512 | 0.83 | 0.71 | 0.77 | 16ms |
|
|
73
|
+
| 1024 | 0.76 | 0.74 | 0.73 | 19ms |
|
|
74
|
+
| 2048 | 0.68 | 0.72 | 0.65 | 24ms |
|
|
75
|
+
|
|
76
|
+
Key finding: 256-512 tokens is the sweet spot for most use cases. Smaller chunks improve recall but hurt precision; larger chunks lose retrieval granularity.
|
|
77
|
+
|
|
78
|
+
## Semantic Chunking Algorithm
|
|
79
|
+
|
|
80
|
+
```
|
|
81
|
+
1. Split text into base units (sentences)
|
|
82
|
+
2. Compute embedding for each sentence
|
|
83
|
+
3. Calculate cosine similarity between consecutive sentences
|
|
84
|
+
4. Identify breakpoints where similarity drops below threshold
|
|
85
|
+
5. Merge sentences between breakpoints into chunks
|
|
86
|
+
6. If chunk exceeds max_size, apply recursive split within
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
**Threshold tuning:**
|
|
90
|
+
|
|
91
|
+
| Threshold (cosine) | Behavior | Use When |
|
|
92
|
+
|--------------------|----------|----------|
|
|
93
|
+
| 0.3 | Aggressive splits, many small chunks | Diverse topics in single doc |
|
|
94
|
+
| 0.5 | Balanced | Default starting point |
|
|
95
|
+
| 0.7 | Conservative splits, fewer large chunks | Coherent, single-topic docs |
|
|
96
|
+
|
|
97
|
+
## Metadata to Attach per Chunk
|
|
98
|
+
|
|
99
|
+
Always attach these fields to every chunk for filtering and retrieval quality:
|
|
100
|
+
|
|
101
|
+
| Field | Purpose | Example |
|
|
102
|
+
|-------|---------|---------|
|
|
103
|
+
| `source` | Document origin | `contracts/nda-2024.pdf` |
|
|
104
|
+
| `chunk_index` | Position in document | `3` (of 47) |
|
|
105
|
+
| `heading_path` | Section hierarchy | `Chapter 2 > Liability > 2.3` |
|
|
106
|
+
| `doc_type` | Content classification | `legal`, `api_docs`, `faq` |
|
|
107
|
+
| `created_at` | Temporal filtering | `2024-11-15` |
|
|
108
|
+
| `token_count` | Cost estimation | `384` |
|
|
109
|
+
|
|
110
|
+
## Common Failure Modes
|
|
111
|
+
|
|
112
|
+
| Failure | Symptom | Fix |
|
|
113
|
+
|---------|---------|-----|
|
|
114
|
+
| Chunks too large | Low precision, irrelevant context in generation | Reduce to 256-512 tokens |
|
|
115
|
+
| Chunks too small | Low faithfulness, missing context | Increase overlap to 20-30% |
|
|
116
|
+
| Breaking tables/lists | Garbled retrieval results | Use document-aware chunking |
|
|
117
|
+
| No overlap | Answers miss context at chunk boundaries | Add 10-20% overlap |
|
|
118
|
+
| Ignoring document structure | Headers split from content | Use recursive with heading separators |
|
|
119
|
+
| Single strategy for all doc types | Inconsistent quality | Route by doc_type, use different strategies |
|
|
120
|
+
|
|
121
|
+
## Pre-Processing Checklist
|
|
122
|
+
|
|
123
|
+
- [ ] Remove boilerplate (headers, footers, page numbers, watermarks)
|
|
124
|
+
- [ ] Normalize whitespace and encoding (UTF-8)
|
|
125
|
+
- [ ] Extract and preserve tables as structured data
|
|
126
|
+
- [ ] Preserve heading hierarchy for metadata
|
|
127
|
+
- [ ] Handle images (OCR or skip with placeholder)
|
|
128
|
+
- [ ] Deduplicate near-identical documents before chunking
|
|
129
|
+
- [ ] Validate chunk count is reasonable (flag if >10K chunks per doc)
|
|
@@ -0,0 +1,158 @@
|
|
|
1
|
+
# RAG Evaluation Guide — Deep Reference
|
|
2
|
+
|
|
3
|
+
> RAGAS framework, ground truth datasets, failure mode diagnosis, and continuous evaluation.
|
|
4
|
+
|
|
5
|
+
## RAGAS Framework Overview
|
|
6
|
+
|
|
7
|
+
RAGAS (Retrieval Augmented Generation Assessment) evaluates RAG pipelines across 4 dimensions:
|
|
8
|
+
|
|
9
|
+
| Metric | What It Measures | Range | Target | Calculation |
|
|
10
|
+
|--------|-----------------|-------|--------|-------------|
|
|
11
|
+
| **Faithfulness** | Is the answer grounded in retrieved context? | 0-1 | > 0.90 | Claims in answer supported by context / total claims |
|
|
12
|
+
| **Answer Relevance** | Does the answer address the question? | 0-1 | > 0.85 | Semantic similarity of answer to question |
|
|
13
|
+
| **Context Precision** | Are relevant chunks ranked higher? | 0-1 | > 0.80 | Weighted precision of relevant chunks by rank position |
|
|
14
|
+
| **Context Recall** | Are all needed facts retrieved? | 0-1 | > 0.75 | Ground truth sentences covered by context / total GT sentences |
|
|
15
|
+
|
|
16
|
+
## Metric Calculation Details
|
|
17
|
+
|
|
18
|
+
### Faithfulness
|
|
19
|
+
|
|
20
|
+
```
|
|
21
|
+
1. Extract all factual claims from the generated answer
|
|
22
|
+
2. For each claim, check if it is supported by the retrieved context
|
|
23
|
+
3. faithfulness = supported_claims / total_claims
|
|
24
|
+
```
|
|
25
|
+
|
|
26
|
+
**Example:**
|
|
27
|
+
- Answer: "Python was created by Guido van Rossum in 1991. It is compiled."
|
|
28
|
+
- Context mentions Guido and 1991, but Python is interpreted.
|
|
29
|
+
- Claims: 3 total, 2 supported -> faithfulness = 0.67
|
|
30
|
+
|
|
31
|
+
### Answer Relevance
|
|
32
|
+
|
|
33
|
+
```
|
|
34
|
+
1. Generate N hypothetical questions from the answer (reverse QA)
|
|
35
|
+
2. Compute cosine similarity between original question and each generated question
|
|
36
|
+
3. answer_relevance = mean(similarities)
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
Low score = answer is off-topic or includes excessive irrelevant information.
|
|
40
|
+
|
|
41
|
+
### Context Precision
|
|
42
|
+
|
|
43
|
+
```
|
|
44
|
+
1. For each retrieved chunk at rank k, determine if it is relevant
|
|
45
|
+
2. precision@k = relevant_chunks_in_top_k / k
|
|
46
|
+
3. context_precision = mean(precision@k for all k where chunk is relevant)
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
Low score = irrelevant chunks ranked higher than relevant ones. Fix with reranking.
|
|
50
|
+
|
|
51
|
+
### Context Recall
|
|
52
|
+
|
|
53
|
+
```
|
|
54
|
+
1. Decompose ground truth answer into individual sentences/facts
|
|
55
|
+
2. For each fact, check if any retrieved chunk contains supporting information
|
|
56
|
+
3. context_recall = facts_with_support / total_facts
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
Low score = retrieval is missing key information. Fix with chunking, embedding, or query expansion.
|
|
60
|
+
|
|
61
|
+
## Ground Truth Dataset Creation
|
|
62
|
+
|
|
63
|
+
### Minimum Dataset Size
|
|
64
|
+
|
|
65
|
+
| Purpose | Minimum Samples | Recommended |
|
|
66
|
+
|---------|----------------|-------------|
|
|
67
|
+
| Quick sanity check | 20 | 50 |
|
|
68
|
+
| Development iteration | 50 | 100 |
|
|
69
|
+
| Production baseline | 100 | 250+ |
|
|
70
|
+
| Statistical significance | 250+ | 500+ |
|
|
71
|
+
|
|
72
|
+
### Dataset Schema
|
|
73
|
+
|
|
74
|
+
```json
|
|
75
|
+
{
|
|
76
|
+
"question": "What is the refund policy for digital products?",
|
|
77
|
+
"ground_truth": "Digital products can be refunded within 14 days if not downloaded.",
|
|
78
|
+
"contexts": ["Section 4.2 of Terms: Digital products are eligible for..."],
|
|
79
|
+
"metadata": {
|
|
80
|
+
"category": "policy",
|
|
81
|
+
"difficulty": "easy",
|
|
82
|
+
"source_doc": "terms-of-service-v3.pdf"
|
|
83
|
+
}
|
|
84
|
+
}
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
### Creation Methods
|
|
88
|
+
|
|
89
|
+
| Method | Quality | Cost | Speed | When to Use |
|
|
90
|
+
|--------|---------|------|-------|-------------|
|
|
91
|
+
| **Manual expert** | Highest | High | Slow | Production baselines, domain-critical |
|
|
92
|
+
| **LLM-generated + human review** | High | Medium | Medium | Development iteration, scaling |
|
|
93
|
+
| **LLM-generated (no review)** | Medium | Low | Fast | Quick sanity checks, prototyping |
|
|
94
|
+
| **User query logs** | High (real) | Low | Ongoing | Production monitoring, continuous eval |
|
|
95
|
+
| **Adversarial** | Critical | Medium | Slow | Edge case coverage, robustness |
|
|
96
|
+
|
|
97
|
+
### Quality Checklist for Ground Truth
|
|
98
|
+
|
|
99
|
+
- [ ] Questions are natural (not keyword-stuffed)
|
|
100
|
+
- [ ] Ground truth answers are factually verified against source docs
|
|
101
|
+
- [ ] Dataset covers all document types and topics proportionally
|
|
102
|
+
- [ ] Includes edge cases: multi-hop, no-answer, ambiguous queries
|
|
103
|
+
- [ ] No data leakage between eval set and training/fine-tuning data
|
|
104
|
+
|
|
105
|
+
## Failure Mode Diagnosis
|
|
106
|
+
|
|
107
|
+
| Symptom | Likely Cause | Metric Impact | Fix |
|
|
108
|
+
|---------|-------------|---------------|-----|
|
|
109
|
+
| Hallucinated facts | Generator ignores context | Low faithfulness | Lower temperature, add "based on context only" prompt |
|
|
110
|
+
| Correct but incomplete | Missing relevant chunks | Low context recall | Increase top-K, use HyDE/multi-query |
|
|
111
|
+
| Irrelevant chunks retrieved | Poor embedding match | Low context precision | Better chunking, add reranker, fine-tune embeddings |
|
|
112
|
+
| Answer ignores question | Prompt drift or context overload | Low answer relevance | Reduce context window, improve system prompt |
|
|
113
|
+
| Good metrics, bad user experience | Dataset does not reflect real queries | All metrics misleading | Rebuild eval set from production query logs |
|
|
114
|
+
| Inconsistent results | Non-deterministic generation | High variance | Set temperature=0, fix seed, increase eval samples |
|
|
115
|
+
|
|
116
|
+
## Evaluation Pipeline Setup
|
|
117
|
+
|
|
118
|
+
```
|
|
119
|
+
1. Create ground truth dataset (50+ samples minimum)
|
|
120
|
+
2. Run RAG pipeline on all questions
|
|
121
|
+
3. Compute RAGAS metrics
|
|
122
|
+
4. Log results with pipeline config (chunk_size, model, top_k, etc.)
|
|
123
|
+
5. Compare against baseline
|
|
124
|
+
6. Identify worst-performing samples
|
|
125
|
+
7. Diagnose using failure mode table
|
|
126
|
+
8. Fix and re-evaluate
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
## Beyond RAGAS: Additional Metrics
|
|
130
|
+
|
|
131
|
+
| Metric | What It Measures | When to Add |
|
|
132
|
+
|--------|-----------------|-------------|
|
|
133
|
+
| **Latency (p50/p95/p99)** | End-to-end response time | Always |
|
|
134
|
+
| **Cost per query** | Embedding + LLM tokens + infra | Production |
|
|
135
|
+
| **Answer correctness** | Exact/fuzzy match against ground truth | QA-style systems |
|
|
136
|
+
| **Noise robustness** | Performance with irrelevant chunks injected | High-noise corpora |
|
|
137
|
+
| **Information density** | Useful tokens in context / total context tokens | Cost optimization |
|
|
138
|
+
| **Refusal accuracy** | Correctly refuses unanswerable questions | Safety-critical |
|
|
139
|
+
|
|
140
|
+
## Continuous Evaluation Checklist
|
|
141
|
+
|
|
142
|
+
- [ ] Automated eval runs on every pipeline config change
|
|
143
|
+
- [ ] Metrics tracked over time with dashboards (not just point-in-time)
|
|
144
|
+
- [ ] Production query sampling feeds back into eval dataset
|
|
145
|
+
- [ ] Regression alerts: any metric drops > 5% triggers investigation
|
|
146
|
+
- [ ] Monthly review of ground truth dataset for staleness
|
|
147
|
+
- [ ] A/B testing framework for comparing pipeline variants
|
|
148
|
+
|
|
149
|
+
## Common Mistakes
|
|
150
|
+
|
|
151
|
+
| Mistake | Why It Hurts | Fix |
|
|
152
|
+
|---------|-------------|-----|
|
|
153
|
+
| Evaluating only on "happy path" queries | Misses real-world edge cases | Add adversarial, no-answer, and multi-hop samples |
|
|
154
|
+
| Using LLM to judge LLM without ground truth | Circular evaluation, inflated scores | Always include human-verified ground truth |
|
|
155
|
+
| Optimizing one metric in isolation | Faithfulness vs relevance tradeoff | Track all 4 RAGAS metrics together |
|
|
156
|
+
| Eval dataset too small (<20) | High variance, unreliable conclusions | Minimum 50, target 250+ |
|
|
157
|
+
| Never updating eval dataset | Drift from real user queries | Refresh quarterly from production logs |
|
|
158
|
+
| Ignoring latency and cost | Great quality but unusable in production | Include SLA-relevant metrics from day one |
|
|
@@ -110,3 +110,7 @@ Surface these issues WITHOUT being asked:
|
|
|
110
110
|
### Detection Gaps: <identified gaps in blue team coverage>
|
|
111
111
|
### Recommendation: Focus detection investment on choke point T1003
|
|
112
112
|
```
|
|
113
|
+
|
|
114
|
+
## References
|
|
115
|
+
|
|
116
|
+
- [mitre-attack-web.md](references/mitre-attack-web.md) — MITRE ATT&CK tactics and techniques mapped to web application attack surfaces
|
|
@@ -0,0 +1,165 @@
|
|
|
1
|
+
# MITRE ATT&CK for Web Applications — Deep Reference
|
|
2
|
+
|
|
3
|
+
> Tactics, techniques, detection strategies, and tools mapped to web application attack surfaces.
|
|
4
|
+
|
|
5
|
+
## ATT&CK Tactics Mapped to Web Apps
|
|
6
|
+
|
|
7
|
+
| # | Tactic | Goal | Web App Relevance |
|
|
8
|
+
|---|--------|------|-------------------|
|
|
9
|
+
| 1 | **Reconnaissance** (TA0043) | Gather information | Technology fingerprinting, endpoint discovery |
|
|
10
|
+
| 2 | **Initial Access** (TA0001) | Gain entry | Exploiting web vulnerabilities, credential abuse |
|
|
11
|
+
| 3 | **Execution** (TA0002) | Run attacker code | Server-side injection, client-side scripts |
|
|
12
|
+
| 4 | **Persistence** (TA0003) | Maintain access | Backdoor accounts, webshells, modified code |
|
|
13
|
+
| 5 | **Privilege Escalation** (TA0004) | Gain higher access | IDOR, broken access control, role manipulation |
|
|
14
|
+
| 6 | **Credential Access** (TA0006) | Steal credentials | Session hijacking, credential stuffing, token theft |
|
|
15
|
+
| 7 | **Collection** (TA0009) | Gather target data | Data scraping, API abuse, database extraction |
|
|
16
|
+
| 8 | **Exfiltration** (TA0010) | Steal data out | API data export, DNS tunneling, file download |
|
|
17
|
+
|
|
18
|
+
## Tactic 1: Reconnaissance
|
|
19
|
+
|
|
20
|
+
| Technique | MITRE ID | Method | Detection |
|
|
21
|
+
|-----------|----------|--------|-----------|
|
|
22
|
+
| Technology fingerprinting | T1592 | HTTP headers, error pages, `/robots.txt`, `sitemap.xml` | Monitor unusual crawling patterns |
|
|
23
|
+
| Endpoint enumeration | T1595.002 | Directory brute-force (`/admin`, `/api/v1/users`) | WAF rate limiting, 404 spike alerts |
|
|
24
|
+
| JavaScript analysis | T1592.004 | Source map extraction, API endpoint discovery in JS | Remove source maps in production |
|
|
25
|
+
| DNS enumeration | T1596.001 | Subdomain brute-force, certificate transparency logs | Monitor CT logs for your domains |
|
|
26
|
+
| Social engineering recon | T1598 | Employee LinkedIn, GitHub commits with emails | Security awareness training |
|
|
27
|
+
|
|
28
|
+
**Detection tools:** WAF logs, Cloudflare analytics, GoAccess, custom 404 rate alerts.
|
|
29
|
+
|
|
30
|
+
## Tactic 2: Initial Access
|
|
31
|
+
|
|
32
|
+
| Technique | MITRE ID | Method | Detection |
|
|
33
|
+
|-----------|----------|--------|-----------|
|
|
34
|
+
| SQL Injection | T1190 | `' OR 1=1--` in input fields, URL params | WAF SQL patterns, parameterized query audit |
|
|
35
|
+
| XSS (Stored/Reflected) | T1190 | `<script>` in user input, reflected URL params | CSP violations, output encoding audit |
|
|
36
|
+
| SSRF | T1190 | Internal URL in user-controlled params | Allowlist outbound requests, block RFC1918 |
|
|
37
|
+
| Authentication bypass | T1078 | Default credentials, JWT none algorithm | Credential audit, JWT validation hardening |
|
|
38
|
+
| API abuse | T1190 | Broken object-level authorization (BOLA) | Object-level access control audit |
|
|
39
|
+
| File upload exploitation | T1190 | Webshell upload via image field | File type validation (magic bytes), upload isolation |
|
|
40
|
+
| Dependency confusion | T1195.002 | Malicious package with internal name | Package namespace reservation, SBOM |
|
|
41
|
+
|
|
42
|
+
**Detection tools:** ModSecurity/WAF rules, Burp Suite, OWASP ZAP, Semgrep.
|
|
43
|
+
|
|
44
|
+
## Tactic 3: Execution
|
|
45
|
+
|
|
46
|
+
| Technique | MITRE ID | Method | Detection |
|
|
47
|
+
|-----------|----------|--------|-----------|
|
|
48
|
+
| Server-side template injection (SSTI) | T1059 | `{{7*7}}` in template inputs | Input sanitization, sandbox templates |
|
|
49
|
+
| OS command injection | T1059 | `; cat /etc/passwd` in shell-invoked params | Avoid `exec()`, use parameterized APIs |
|
|
50
|
+
| Deserialization attacks | T1059 | Crafted serialized objects (Java, PHP, Python) | Avoid native deserialization, use JSON |
|
|
51
|
+
| Server-side request forgery | T1059 | Trigger server to fetch attacker URLs | URL allowlisting, network segmentation |
|
|
52
|
+
| Webshell execution | T1059.004 | PHP/JSP shell uploaded or injected | File integrity monitoring, webshell scanners |
|
|
53
|
+
|
|
54
|
+
**Detection tools:** RASP (Runtime Application Self-Protection), file integrity monitoring, syscall auditing.
|
|
55
|
+
|
|
56
|
+
## Tactic 4: Persistence
|
|
57
|
+
|
|
58
|
+
| Technique | MITRE ID | Method | Detection |
|
|
59
|
+
|-----------|----------|--------|-----------|
|
|
60
|
+
| Backdoor admin account | T1136.001 | Create hidden admin via SQL injection or API | User account audit, alert on admin creation |
|
|
61
|
+
| Webshell deployment | T1505.003 | Upload PHP/JSP file to web root | File integrity monitoring (AIDE, Tripwire) |
|
|
62
|
+
| Scheduled task/cron | T1053 | Add cron job via command injection | Cron audit, immutable infrastructure |
|
|
63
|
+
| Modified application code | T1554 | Inject backdoor into deployed code | Git integrity checks, signed deployments |
|
|
64
|
+
| OAuth app installation | T1098.003 | Register malicious OAuth app with broad scopes | OAuth app audit, scope restriction policies |
|
|
65
|
+
| Cookie/session manipulation | T1556 | Forge long-lived session tokens | Short session TTL, token rotation |
|
|
66
|
+
|
|
67
|
+
**Detection tools:** Git diff monitoring, AIDE/Tripwire, deploy artifact checksums.
|
|
68
|
+
|
|
69
|
+
## Tactic 5: Privilege Escalation
|
|
70
|
+
|
|
71
|
+
| Technique | MITRE ID | Method | Detection |
|
|
72
|
+
|-----------|----------|--------|-----------|
|
|
73
|
+
| IDOR | T1548 | Change `user_id=123` to `user_id=456` in API | Object-level authorization on every endpoint |
|
|
74
|
+
| Role parameter tampering | T1548 | `{"role": "admin"}` in registration payload | Server-side role assignment, ignore client role |
|
|
75
|
+
| JWT manipulation | T1548 | Algorithm confusion (`none`, `HS256` vs `RS256`) | Strict algorithm validation, key management |
|
|
76
|
+
| Path traversal | T1548 | `../../etc/passwd` in file parameters | Canonicalize paths, chroot file access |
|
|
77
|
+
| GraphQL introspection abuse | T1548 | Discover admin mutations via schema introspection | Disable introspection in production |
|
|
78
|
+
|
|
79
|
+
**Detection tools:** Authorization test suites, Burp Autorize plugin, custom IDOR scanners.
|
|
80
|
+
|
|
81
|
+
## Tactic 6: Credential Access
|
|
82
|
+
|
|
83
|
+
| Technique | MITRE ID | Method | Detection |
|
|
84
|
+
|-----------|----------|--------|-----------|
|
|
85
|
+
| Credential stuffing | T1110.004 | Automated login with breached credential lists | Rate limiting, CAPTCHA, breach detection |
|
|
86
|
+
| Session hijacking | T1539 | XSS to steal `document.cookie` | HttpOnly + Secure flags, CSP |
|
|
87
|
+
| Token theft from storage | T1552 | Access localStorage/sessionStorage via XSS | Use httpOnly cookies, not localStorage |
|
|
88
|
+
| Password spraying | T1110.003 | Common passwords across many accounts | Account lockout, anomaly detection |
|
|
89
|
+
| OAuth token theft | T1528 | Phishing redirect to attacker OAuth callback | Strict redirect URI validation |
|
|
90
|
+
| API key extraction | T1552.001 | Keys in client-side JavaScript, git history | Server-side key management, git-secrets |
|
|
91
|
+
|
|
92
|
+
**Detection tools:** Have I Been Pwned API, fail2ban, rate limiter middleware, git-secrets.
|
|
93
|
+
|
|
94
|
+
## Detection Strategy Matrix
|
|
95
|
+
|
|
96
|
+
| Layer | What to Monitor | Tools | Alert Threshold |
|
|
97
|
+
|-------|----------------|-------|-----------------|
|
|
98
|
+
| **Edge/CDN** | Request rate, geo anomalies, bot score | Cloudflare, AWS WAF | Rate > 100 req/s per IP |
|
|
99
|
+
| **WAF** | SQL/XSS patterns, protocol violations | ModSecurity, AWS WAF | Any rule match on critical endpoints |
|
|
100
|
+
| **Application** | Auth failures, IDOR attempts, input validation | Custom logging, RASP | > 5 auth failures per account/min |
|
|
101
|
+
| **API Gateway** | Rate limits, schema violations, unusual patterns | Kong, AWS API GW | Schema violation on sensitive endpoints |
|
|
102
|
+
| **Database** | Unusual queries, bulk data access, schema changes | Query logging, DBA alerts | > 1000 rows returned in single query |
|
|
103
|
+
| **Infrastructure** | Outbound connections, file changes, new processes | OSSEC, Falco | Any unexpected outbound connection |
|
|
104
|
+
|
|
105
|
+
## Attack Path Examples
|
|
106
|
+
|
|
107
|
+
### Path 1: Data Exfiltration via IDOR
|
|
108
|
+
|
|
109
|
+
```
|
|
110
|
+
Recon (endpoint enumeration)
|
|
111
|
+
--> Initial Access (BOLA on GET /api/users/{id})
|
|
112
|
+
--> Collection (iterate all user IDs)
|
|
113
|
+
--> Exfiltration (bulk download via API)
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
Choke point: Object-level authorization on API endpoints.
|
|
117
|
+
|
|
118
|
+
### Path 2: Webshell via File Upload
|
|
119
|
+
|
|
120
|
+
```
|
|
121
|
+
Recon (identify upload functionality)
|
|
122
|
+
--> Initial Access (upload PHP file as image)
|
|
123
|
+
--> Execution (access webshell URL)
|
|
124
|
+
--> Persistence (webshell remains on disk)
|
|
125
|
+
--> Privilege Escalation (server runs as www-data, pivot to root)
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
Choke point: File upload validation (magic bytes, extension, isolated storage).
|
|
129
|
+
|
|
130
|
+
### Path 3: Account Takeover via XSS
|
|
131
|
+
|
|
132
|
+
```
|
|
133
|
+
Recon (find reflected XSS in search)
|
|
134
|
+
--> Initial Access (craft XSS payload URL)
|
|
135
|
+
--> Credential Access (steal session cookie)
|
|
136
|
+
--> Privilege Escalation (target is admin user)
|
|
137
|
+
--> Collection (access admin dashboard data)
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
Choke point: Content Security Policy + HttpOnly cookies.
|
|
141
|
+
|
|
142
|
+
## Tools by Phase
|
|
143
|
+
|
|
144
|
+
| Phase | Offensive Tool | Defensive Tool |
|
|
145
|
+
|-------|---------------|----------------|
|
|
146
|
+
| Reconnaissance | Amass, subfinder, httpx | Cloudflare, GoAccess |
|
|
147
|
+
| Initial Access | Burp Suite, SQLMap, OWASP ZAP | ModSecurity, Semgrep |
|
|
148
|
+
| Execution | Commix, tplmap | RASP, Falco |
|
|
149
|
+
| Persistence | Weevely, webshell generators | AIDE, Tripwire, OSSEC |
|
|
150
|
+
| Privilege Escalation | Autorize (Burp), custom scripts | Authorization test suites |
|
|
151
|
+
| Credential Access | Hydra, Patator | fail2ban, rate limiters |
|
|
152
|
+
| Exfiltration | Custom scripts, DNS tunneling | DLP, egress filtering |
|
|
153
|
+
|
|
154
|
+
## Quick Reference: Top 10 Web App Detections to Implement
|
|
155
|
+
|
|
156
|
+
1. Rate limit authentication endpoints (5 failures / minute / account)
|
|
157
|
+
2. CSP header with report-uri (detect XSS attempts)
|
|
158
|
+
3. WAF with OWASP Core Rule Set (broad coverage)
|
|
159
|
+
4. File integrity monitoring on web root (detect webshells)
|
|
160
|
+
5. Alert on admin account creation (detect backdoor accounts)
|
|
161
|
+
6. Log and alert on authorization failures (detect IDOR)
|
|
162
|
+
7. Monitor outbound connections from web servers (detect SSRF/exfil)
|
|
163
|
+
8. Alert on bulk data access patterns (detect scraping/exfil)
|
|
164
|
+
9. JWT validation with strict algorithm enforcement (detect token manipulation)
|
|
165
|
+
10. Dependency vulnerability scanning in CI/CD (detect supply chain attacks)
|
|
@@ -66,3 +66,7 @@ Permissions-Policy: camera=(), microphone=(), geolocation=()
|
|
|
66
66
|
### Summary: 1 critical, 1 high, 1 medium, 1 low
|
|
67
67
|
### Recommendation: BLOCK release until C1 and H1 resolved
|
|
68
68
|
```
|
|
69
|
+
|
|
70
|
+
## References
|
|
71
|
+
|
|
72
|
+
- [owasp-2025-deep.md](references/owasp-2025-deep.md) — OWASP Top 10 (2025) with vulnerable and fixed code examples, testing methodology, and tools
|