PyPI - rota - Versions diffs - 0.0.post1__tar.gz - Mend

rota 0.0.post1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (220) hide show

rota-0.0.post1/.env.example +15 -0
rota-0.0.post1/.github/RELEASE_NOTES_v0.1.1.md +122 -0
rota-0.0.post1/.github/workflows/create-release.yml +38 -0
rota-0.0.post1/.github/workflows/publish-to-pypi.yml +45 -0
rota-0.0.post1/.kiro/specs/cve-neo4j-integration/design.md +490 -0
rota-0.0.post1/.kiro/specs/cve-neo4j-integration/requirements.md +97 -0
rota-0.0.post1/.kiro/specs/cve-neo4j-integration/tasks.md +96 -0
rota-0.0.post1/.kiro/specs/kev-integration/design.md +419 -0
rota-0.0.post1/.kiro/specs/kev-integration/requirements.md +140 -0
rota-0.0.post1/.kiro/specs/kev-integration/tasks.md +392 -0
rota-0.0.post1/.kiro/specs/paper-evaluation-framework/design.md +574 -0
rota-0.0.post1/.kiro/specs/paper-evaluation-framework/requirements.md +130 -0
rota-0.0.post1/.kiro/specs/paper-evaluation-framework/tasks.md +348 -0
rota-0.0.post1/.kiro/specs/refactor-to-rota/design.md +587 -0
rota-0.0.post1/.kiro/specs/refactor-to-rota/requirements.md +233 -0
rota-0.0.post1/.kiro/specs/refactor-to-rota/tasks.md +566 -0
rota-0.0.post1/.kiro/specs/zero-day-prediction-system/design.md +1022 -0
rota-0.0.post1/.kiro/specs/zero-day-prediction-system/requirements.md +137 -0
rota-0.0.post1/.kiro/specs/zero-day-prediction-system/tasks.md +279 -0
rota-0.0.post1/.kiro/steering/product.md +21 -0
rota-0.0.post1/.kiro/steering/research.md +70 -0
rota-0.0.post1/.kiro/steering/structure.md +55 -0
rota-0.0.post1/.kiro/steering/tech.md +50 -0
rota-0.0.post1/.pypirc.example +16 -0
rota-0.0.post1/CHANGELOG.md +90 -0
rota-0.0.post1/LICENSE +21 -0
rota-0.0.post1/MANIFEST.in +17 -0
rota-0.0.post1/PKG-INFO +426 -0
rota-0.0.post1/QUICKSTART.md +332 -0
rota-0.0.post1/README.md +375 -0
rota-0.0.post1/config/cve_config.yaml +42 -0
rota-0.0.post1/config/cve_test_config.yaml +29 -0
rota-0.0.post1/config/epss_config.yaml +17 -0
rota-0.0.post1/config/example_config.yaml +16 -0
rota-0.0.post1/config/exploit_config.yaml +15 -0
rota-0.0.post1/config/github_advisory_config.yaml +30 -0
rota-0.0.post1/dashboard/README.md +66 -0
rota-0.0.post1/dashboard/app.py +270 -0
rota-0.0.post1/docs/README.md +53 -0
rota-0.0.post1/docs/data-roadmap.md +196 -0
rota-0.0.post1/docs/graphiti-comparison.md +137 -0
rota-0.0.post1/docs/guides/release-guide.md +192 -0
rota-0.0.post1/docs/guides/temporal-setup.md +160 -0
rota-0.0.post1/docs/performance-optimization.md +185 -0
rota-0.0.post1/docs/research-directions.md +268 -0
rota-0.0.post1/docs/system-overview.md +357 -0
rota-0.0.post1/docs/usage-guide.md +531 -0
rota-0.0.post1/graph/.env.example +21 -0
rota-0.0.post1/graph/.kiro/steering/agent.md +190 -0
rota-0.0.post1/graph/.kiro/steering/article.md +266 -0
rota-0.0.post1/graph/.kiro/steering/methodology.md +444 -0
rota-0.0.post1/graph/.kiro/steering/product.md +195 -0
rota-0.0.post1/graph/.kiro/steering/structure.md +125 -0
rota-0.0.post1/graph/.kiro/steering/tech.md +172 -0
rota-0.0.post1/graph/ARCHITECTURE.md +322 -0
rota-0.0.post1/graph/QUICKSTART.md +205 -0
rota-0.0.post1/graph/README.md +285 -0
rota-0.0.post1/graph/configs/llm_config.yaml +81 -0
rota-0.0.post1/graph/configs/signal_weights.yaml +52 -0
rota-0.0.post1/graph/docker/Dockerfile.python +41 -0
rota-0.0.post1/graph/docker-compose.yml +56 -0
rota-0.0.post1/graph/requirements.txt +48 -0
rota-0.0.post1/graph/scripts/download_neo4j_benchmark.py +130 -0
rota-0.0.post1/graph/scripts/download_synthcypher.py +179 -0
rota-0.0.post1/graph/scripts/generate_research_proposal.py +331 -0
rota-0.0.post1/graph/scripts/generate_research_proposal_simple.py +267 -0
rota-0.0.post1/graph/scripts/generate_semester_proposal.py +476 -0
rota-0.0.post1/graph/scripts/phase1_collect_data.py +112 -0
rota-0.0.post1/graph/scripts/phase2_extract_signals.py +118 -0
rota-0.0.post1/graph/scripts/phase4_llm_prediction.py +107 -0
rota-0.0.post1/graph/scripts/run_full_pipeline.py +138 -0
rota-0.0.post1/graph/scripts/simple_data_collector.py +80 -0
rota-0.0.post1/graph/scripts/validate_log4shell.py +139 -0
rota-0.0.post1/graph/setup_environment.ps1 +37 -0
rota-0.0.post1/graph/setup_wsl_environment.sh +61 -0
rota-0.0.post1/graph/src/__init__.py +6 -0
rota-0.0.post1/graph/src/api/main.py +209 -0
rota-0.0.post1/graph/src/data_collection/__init__.py +6 -0
rota-0.0.post1/graph/src/data_collection/package_collector.py +142 -0
rota-0.0.post1/graph/src/data_collection/vulnerability_collector.py +166 -0
rota-0.0.post1/graph/src/graph_analysis/__init__.py +5 -0
rota-0.0.post1/graph/src/graph_analysis/dependency_graph.py +223 -0
rota-0.0.post1/graph/src/llm_reasoning/__init__.py +5 -0
rota-0.0.post1/graph/src/llm_reasoning/risk_predictor.py +257 -0
rota-0.0.post1/graph/src/risk_scoring/__init__.py +5 -0
rota-0.0.post1/graph/src/risk_scoring/latent_risk_calculator.py +205 -0
rota-0.0.post1/graph/src/signal_extraction/__init__.py +5 -0
rota-0.0.post1/graph/src/signal_extraction/vulnerability_patterns.py +163 -0
rota-0.0.post1/graph/src/validation/__init__.py +5 -0
rota-0.0.post1/graph/src/validation/historical_validator.py +215 -0
rota-0.0.post1/graph/tests/test_graph_analysis.py +105 -0
rota-0.0.post1/graph/tests/test_signal_extraction.py +93 -0
rota-0.0.post1/graph//354/227/260/352/265/254/352/263/204/355/232/215/354/204/234.md +300 -0
rota-0.0.post1/graph//354/227/260/352/265/254/352/263/204/355/232/215/354/204/234_/355/225/234/355/225/231/352/270/260.md +391 -0
rota-0.0.post1/pyproject.toml +90 -0
rota-0.0.post1/requirements.txt +20 -0
rota-0.0.post1/results/paper/dataset_test/statistics.json +27 -0
rota-0.0.post1/results/paper/dataset_test3/statistics.json +220 -0
rota-0.0.post1/results/paper/validation_real/metrics.json +16 -0
rota-0.0.post1/results/paper/validation_test/metrics.json +16 -0
rota-0.0.post1/scripts/GIT_COMMANDS.ps1 +33 -0
rota-0.0.post1/scripts/GIT_COMMANDS.sh +47 -0
rota-0.0.post1/scripts/README.md +77 -0
rota-0.0.post1/scripts/archive/benchmark_signal_collection.py +180 -0
rota-0.0.post1/scripts/archive/collect_critical_cves.py +140 -0
rota-0.0.post1/scripts/archive/collect_data.py +43 -0
rota-0.0.post1/scripts/archive/collect_opensource_cves.py +209 -0
rota-0.0.post1/scripts/archive/load_cve_with_graphiti.py +285 -0
rota-0.0.post1/scripts/archive/run_historical_validation_mock.py +187 -0
rota-0.0.post1/scripts/archive/test_prediction_concept.py +190 -0
rota-0.0.post1/scripts/archive/tune_threshold.py +148 -0
rota-0.0.post1/scripts/check_neo4j_data.py +67 -0
rota-0.0.post1/scripts/collection/collect_cve_data.py +128 -0
rota-0.0.post1/scripts/collection/collect_epss.py +106 -0
rota-0.0.post1/scripts/collection/collect_exploits.py +124 -0
rota-0.0.post1/scripts/collection/collect_github_advisory.py +118 -0
rota-0.0.post1/scripts/collection/collect_paper_dataset.py +188 -0
rota-0.0.post1/scripts/create_release.ps1 +51 -0
rota-0.0.post1/scripts/create_release.sh +55 -0
rota-0.0.post1/scripts/deployment/publish_to_pypi.py +125 -0
rota-0.0.post1/scripts/experiments/historical_validation.py +218 -0
rota-0.0.post1/scripts/experiments/run_historical_validation.py +139 -0
rota-0.0.post1/scripts/experiments/run_prediction_demo.py +199 -0
rota-0.0.post1/scripts/loading/load_advisory_to_neo4j.py +192 -0
rota-0.0.post1/scripts/loading/load_cve_to_neo4j.py +249 -0
rota-0.0.post1/scripts/loading/load_epss_to_neo4j.py +105 -0
rota-0.0.post1/scripts/loading/load_exploits_to_neo4j.py +147 -0
rota-0.0.post1/setup.cfg +4 -0
rota-0.0.post1/src/rota/__init__.py +17 -0
rota-0.0.post1/src/rota/__main__.py +6 -0
rota-0.0.post1/src/rota/__version__.py +12 -0
rota-0.0.post1/src/rota/_version.py +34 -0
rota-0.0.post1/src/rota/axle/__init__.py +14 -0
rota-0.0.post1/src/rota/cli/__init__.py +14 -0
rota-0.0.post1/src/rota/cli/main.py +457 -0
rota-0.0.post1/src/rota/config.py +116 -0
rota-0.0.post1/src/rota/hub/__init__.py +15 -0
rota-0.0.post1/src/rota/hub/connection.py +72 -0
rota-0.0.post1/src/rota/hub/loader.py +603 -0
rota-0.0.post1/src/rota/hub/query.py +377 -0
rota-0.0.post1/src/rota/hub/supply_chain.py +440 -0
rota-0.0.post1/src/rota/oracle/__init__.py +6 -0
rota-0.0.post1/src/rota/oracle/commit_analyzer.py +443 -0
rota-0.0.post1/src/rota/oracle/integrated_oracle.py +366 -0
rota-0.0.post1/src/rota/oracle/predictor.py +583 -0
rota-0.0.post1/src/rota/oracle/prompts/analysis.jinja2 +42 -0
rota-0.0.post1/src/rota/oracle/prompts/prediction.jinja2 +116 -0
rota-0.0.post1/src/rota/py.typed +1 -0
rota-0.0.post1/src/rota/spokes/__init__.py +30 -0
rota-0.0.post1/src/rota/spokes/base.py +218 -0
rota-0.0.post1/src/rota/spokes/cve.py +251 -0
rota-0.0.post1/src/rota/spokes/cwe.py +159 -0
rota-0.0.post1/src/rota/spokes/epss.py +120 -0
rota-0.0.post1/src/rota/spokes/github.py +323 -0
rota-0.0.post1/src/rota/spokes/kev.py +85 -0
rota-0.0.post1/src/rota/spokes/package.py +382 -0
rota-0.0.post1/src/rota/utils/__init__.py +11 -0
rota-0.0.post1/src/rota/wheel/__init__.py +14 -0
rota-0.0.post1/src/rota.egg-info/PKG-INFO +426 -0
rota-0.0.post1/src/rota.egg-info/SOURCES.txt +218 -0
rota-0.0.post1/src/rota.egg-info/dependency_links.txt +1 -0
rota-0.0.post1/src/rota.egg-info/entry_points.txt +2 -0
rota-0.0.post1/src/rota.egg-info/requires.txt +23 -0
rota-0.0.post1/src/rota.egg-info/top_level.txt +2 -0
rota-0.0.post1/src/zero_day_defense/__init__.py +43 -0
rota-0.0.post1/src/zero_day_defense/cli.py +149 -0
rota-0.0.post1/src/zero_day_defense/config.py +68 -0
rota-0.0.post1/src/zero_day_defense/data_sources/__init__.py +17 -0
rota-0.0.post1/src/zero_day_defense/data_sources/base.py +73 -0
rota-0.0.post1/src/zero_day_defense/data_sources/cve.py +186 -0
rota-0.0.post1/src/zero_day_defense/data_sources/epss.py +75 -0
rota-0.0.post1/src/zero_day_defense/data_sources/exploit_db.py +94 -0
rota-0.0.post1/src/zero_day_defense/data_sources/github.py +124 -0
rota-0.0.post1/src/zero_day_defense/data_sources/github_advisory.py +128 -0
rota-0.0.post1/src/zero_day_defense/data_sources/maven.py +58 -0
rota-0.0.post1/src/zero_day_defense/data_sources/npm.py +42 -0
rota-0.0.post1/src/zero_day_defense/data_sources/pypi.py +48 -0
rota-0.0.post1/src/zero_day_defense/evaluation/__init__.py +18 -0
rota-0.0.post1/src/zero_day_defense/evaluation/ablation/__init__.py +9 -0
rota-0.0.post1/src/zero_day_defense/evaluation/baselines/__init__.py +15 -0
rota-0.0.post1/src/zero_day_defense/evaluation/dataset/__init__.py +11 -0
rota-0.0.post1/src/zero_day_defense/evaluation/dataset/collector.py +400 -0
rota-0.0.post1/src/zero_day_defense/evaluation/dataset/statistics.py +336 -0
rota-0.0.post1/src/zero_day_defense/evaluation/dataset/validator.py +311 -0
rota-0.0.post1/src/zero_day_defense/evaluation/results/__init__.py +13 -0
rota-0.0.post1/src/zero_day_defense/evaluation/statistics/__init__.py +11 -0
rota-0.0.post1/src/zero_day_defense/evaluation/validation/__init__.py +9 -0
rota-0.0.post1/src/zero_day_defense/evaluation/validation/metrics.py +125 -0
rota-0.0.post1/src/zero_day_defense/evaluation/validation/temporal_splitter.py +198 -0
rota-0.0.post1/src/zero_day_defense/pipeline.py +86 -0
rota-0.0.post1/src/zero_day_defense/prediction/__init__.py +27 -0
rota-0.0.post1/src/zero_day_defense/prediction/agents/__init__.py +11 -0
rota-0.0.post1/src/zero_day_defense/prediction/agents/recommendation.py +123 -0
rota-0.0.post1/src/zero_day_defense/prediction/agents/signal_analyzer.py +226 -0
rota-0.0.post1/src/zero_day_defense/prediction/agents/threat_assessment.py +205 -0
rota-0.0.post1/src/zero_day_defense/prediction/engine/__init__.py +9 -0
rota-0.0.post1/src/zero_day_defense/prediction/engine/clusterer.py +272 -0
rota-0.0.post1/src/zero_day_defense/prediction/engine/scorer.py +208 -0
rota-0.0.post1/src/zero_day_defense/prediction/exceptions.py +57 -0
rota-0.0.post1/src/zero_day_defense/prediction/feature_engineering/__init__.py +11 -0
rota-0.0.post1/src/zero_day_defense/prediction/feature_engineering/builder.py +159 -0
rota-0.0.post1/src/zero_day_defense/prediction/feature_engineering/embedder.py +191 -0
rota-0.0.post1/src/zero_day_defense/prediction/feature_engineering/extractor.py +438 -0
rota-0.0.post1/src/zero_day_defense/prediction/models.py +163 -0
rota-0.0.post1/src/zero_day_defense/prediction/signal_collectors/__init__.py +11 -0
rota-0.0.post1/src/zero_day_defense/prediction/signal_collectors/github_signals.py +534 -0
rota-0.0.post1/src/zero_day_defense/prediction/signal_collectors/github_signals_fast.py +373 -0
rota-0.0.post1/src/zero_day_defense/prediction/signal_collectors/package_signals.py +56 -0
rota-0.0.post1/src/zero_day_defense/prediction/signal_collectors/storage.py +172 -0
rota-0.0.post1/src/zero_day_defense/prediction/validation/__init__.py +9 -0
rota-0.0.post1/src/zero_day_defense/prediction/validation/feedback.py +38 -0
rota-0.0.post1/src/zero_day_defense/prediction/validation/validator.py +137 -0
rota-0.0.post1/src/zero_day_defense/py.typed +0 -0
rota-0.0.post1/src/zero_day_defense.egg-info/PKG-INFO +55 -0
rota-0.0.post1/src/zero_day_defense.egg-info/SOURCES.txt +17 -0
rota-0.0.post1/src/zero_day_defense.egg-info/dependency_links.txt +1 -0
rota-0.0.post1/src/zero_day_defense.egg-info/requires.txt +3 -0
rota-0.0.post1/src/zero_day_defense.egg-info/top_level.txt +1 -0
rota-0.0.post1/tests/test_end_to_end.py +277 -0
rota-0.0.post1/tests/test_integrated_oracle.py +147 -0

rota-0.0.post1/.env.example ADDED Viewed

@@ -0,0 +1,15 @@
+# GitHub API Token
+# Get your token from: https://github.com/settings/tokens
+# Required scopes: repo (for private repos) or public_repo (for public repos only)
+GITHUB_TOKEN=your_github_personal_access_token_here
+# Gemini API Key
+# Get your key from: https://makersuite.google.com/app/apikey
+GEMINI_API_KEY=your_gemini_api_key_here
+# Or use GOOGLE_API_KEY instead
+# GOOGLE_API_KEY=your_google_api_key_here
+# Neo4j Configuration (Optional - for data persistence)
+NEO4J_URI=bolt://localhost:7687
+NEO4J_USERNAME=neo4j
+NEO4J_PASSWORD=your_neo4j_password_here

rota-0.0.post1/.github/RELEASE_NOTES_v0.1.1.md ADDED Viewed

@@ -0,0 +1,122 @@
+# ROTA v0.1.1 - Initial PyPI Release 🎉
+We're excited to announce the first official release of **ROTA** (Real-time Operational Threat Assessment) on PyPI!
+## 🚀 Installation
+```bash
+pip install rota
+```
+## ✨ What's New
+### Core Features
+- **Real-time Vulnerability Prediction**: AI-powered analysis of code changes
+- **Multi-source Data Collection**: CVE, GitHub, EPSS, Exploit-DB integration
+- **Historical Validation Framework**: Validated on 80+ real CVE cases
+- **CLI Interface**: Easy-to-use command-line tools
+- **Python API**: Programmatic access for automation
+### Command-Line Interface
+```bash
+# Analyze repository risk
+rota predict --repo django/django --commit abc123
+# Collect security data
+rota collect --source cve --output data.jsonl
+# Run historical validation
+rota validate --dataset cves.jsonl --output results/
+```
+### Python API
+```python
+from rota import analyze_code_push
+result = analyze_code_push("django/django", "abc123")
+print(f"Risk Score: {result['risk_score']}")
+```
+## 📊 Validation Results
+- **Dataset**: 80 CVEs from Django (2007-2024)
+- **Pilot Study**: 3 CVEs validated with real GitHub data
+- **Average Lead Time**: 90 days before CVE disclosure
+- **Execution Time**: ~22 minutes per CVE
+## 🏗️ Architecture
+- **Spokes**: Multi-source data collectors (CVE, GitHub, EPSS, etc.)
+- **Hub**: Neo4j-based knowledge graph integration
+- **Wheel**: Pattern analysis and clustering
+- **Oracle**: AI-powered prediction engine
+- **Axle**: Historical validation framework
+## 📦 What's Included
+### Data Sources
+- CVE (NVD)
+- GitHub Advisory
+- EPSS Scores
+- Exploit-DB
+- Package Registries (PyPI, npm, Maven)
+### Prediction Components
+- GitHub signal collectors
+- Feature engineering (20+ behavioral features)
+- Risk scoring engine
+- Temporal pattern analysis
+### Evaluation Framework
+- Dataset collection automation
+- Historical validation with temporal splitting
+- Performance metrics (Precision, Recall, F1, Lead Time)
+- Baseline comparisons
+## 🔧 Requirements
+- Python 3.10+
+- GitHub API token (for full functionality)
+- Optional: Neo4j for graph analysis
+## 📚 Documentation
+- [Quick Start Guide](https://github.com/susie-Choi/rota/blob/main/HOW_TO_PUBLISH.md)
+- [API Documentation](https://github.com/susie-Choi/rota/blob/main/docs/)
+- [Paper Evaluation Framework](https://github.com/susie-Choi/rota/blob/main/docs/PAPER_FRAMEWORK_SUMMARY.md)
+## 🐛 Known Issues
+- Historical validation can be slow (~22 min/CVE) due to GitHub API rate limits
+- Currently focused on GitHub repositories
+- Limited to English language repositories
+## 🔮 Future Plans
+- GraphQL API integration for better performance
+- Support for additional version control systems
+- Enhanced machine learning models
+- Real-time dashboard improvements
+- Enterprise features
+## 🙏 Acknowledgments
+This project is part of ongoing research on LLM-based pre-signal analysis for predicting potential vulnerabilities in software ecosystems.
+## 📝 Changelog
+See [CHANGELOG.md](https://github.com/susie-Choi/rota/blob/main/CHANGELOG.md) for detailed changes.
+## 🤝 Contributing
+Contributions are welcome! Please feel free to submit issues and pull requests.
+## 📄 License
+MIT License - see [LICENSE](https://github.com/susie-Choi/rota/blob/main/LICENSE) for details.
+---
+**Install now**: `pip install rota`
+**Star us on GitHub**: https://github.com/susie-Choi/rota ⭐

rota-0.0.post1/.github/workflows/create-release.yml ADDED Viewed

@@ -0,0 +1,38 @@
+name: Create Release
+on:
+  push:
+    tags:
+      - 'v*'  # Trigger on version tags like v0.1.3
+jobs:
+  create-release:
+    runs-on: ubuntu-latest
+    permissions:
+      contents: write
+    steps:
+    - uses: actions/checkout@v4
+      with:
+        fetch-depth: 0
+    - name: Extract version from tag
+      id: get_version
+      run: echo "VERSION=${GITHUB_REF#refs/tags/v}" >> $GITHUB_OUTPUT
+    - name: Extract changelog
+      id: changelog
+      run: |
+        # Extract changelog for this version
+        VERSION=${{ steps.get_version.outputs.VERSION }}
+        sed -n "/## \[$VERSION\]/,/## \[/p" CHANGELOG.md | sed '$d' > release_notes.md
+    - name: Create Release
+      uses: softprops/action-gh-release@v1
+      with:
+        name: v${{ steps.get_version.outputs.VERSION }}
+        body_path: release_notes.md
+        draft: false
+        prerelease: false
+      env:
+        GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

rota-0.0.post1/.github/workflows/publish-to-pypi.yml ADDED Viewed

@@ -0,0 +1,45 @@
+name: Publish to PyPI
+on:
+  push:
+    branches:
+      - main
+    paths:
+      - 'pyproject.toml'
+      - 'src/**'
+      - '.github/workflows/publish-to-pypi.yml'
+  workflow_dispatch:  # Manual trigger option
+  release:
+    types: [published]
+jobs:
+  build-and-publish:
+    runs-on: ubuntu-latest
+    steps:
+    - uses: actions/checkout@v4
+    - name: Set up Python
+      uses: actions/setup-python@v4
+      with:
+        python-version: '3.10'
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install build twine
+    - name: Clean dist directory
+      run: rm -rf dist/ build/ *.egg-info
+    - name: Build package
+      run: python -m build
+    - name: Check package
+      run: python -m twine check dist/*
+    - name: Publish to PyPI
+      env:
+        TWINE_USERNAME: __token__
+        TWINE_PASSWORD: ${{ secrets.PYPI_API_TOKEN }}
+      run: python -m twine upload dist/*

rota-0.0.post1/.kiro/specs/cve-neo4j-integration/design.md ADDED Viewed

@@ -0,0 +1,490 @@
+# Design Document
+## Overview
+CVE-Neo4j 통합 기능은 NVD (National Vulnerability Database)에서 CVE 데이터를 수집하고, 이를 Neo4j 그래프 데이터베이스에 로드하여 취약점 간의 관계를 분석할 수 있도록 합니다. 이 시스템은 기존 Zero-Day Defense 데이터 파이프라인 아키텍처를 확장하며, Palantir Ontology와 유사한 그래프 기반 분석 경험을 제공합니다.
+## Architecture
+### High-Level Architecture
+```mermaid
+graph LR
+    A[NVD API] -->|HTTP Request| B[CVEDataSource]
+    B -->|SourceResult| C[JSONL Storage]
+    C -->|Read| D[CVEGraphLoader]
+    D -->|Cypher Queries| E[Neo4j Database]
+    F[YAML Config] -->|Configuration| B
+    F -->|Configuration| D
+```
+### Component Layers
+1. **Data Collection Layer**: NVD API와 통신하여 CVE 데이터 수집
+2. **Storage Layer**: JSONL 형식으로 중간 데이터 저장
+3. **Graph Loading Layer**: Neo4j에 그래프 구조로 데이터 변환 및 로드
+4. **Configuration Layer**: YAML 기반 설정 관리
+## Components and Interfaces
+### 1. CVEDataSource
+**Purpose**: NVD API 2.0을 통해 CVE 데이터를 수집하는 데이터 소스
+**Class Structure**:
+```python
+class CVEDataSource(BaseDataSource):
+    source_name: str = "nvd_cve"
+    BASE_URL: str = "https://services.nvd.nist.gov/rest/json/cves/2.0"
+    def __init__(
+        self,
+        *,
+        timeout: float = 30.0,
+        rate_limit_sleep: float = 6.0,
+        api_key: Optional[str] = None,
+        **kwargs
+    )
+    def collect_by_cve_id(self, cve_id: str, *, cutoff: datetime) -> SourceResult
+    def collect_by_keyword(self, keyword: str, *, cutoff: datetime, max_results: int = 100) -> SourceResult
+    def collect_by_cpe(self, cpe_name: str, *, cutoff: datetime, max_results: int = 100) -> SourceResult
+    def collect(self, package: str, *, cutoff: datetime) -> SourceResult
+```
+**Key Features**:
+- BaseDataSource 상속으로 기존 아키텍처와 통합
+- 세 가지 수집 방식 지원: CVE ID, 키워드, CPE
+- API 키 유무에 따른 동적 rate limit 조정 (6초 → 0.6초)
+- cutoff_date 기반 시간적 필터링
+**Rate Limiting Strategy**:
+- API 키 없음: 6초 대기 (NVD 공개 rate limit)
+- API 키 있음: 0.6초 대기 (50 requests per 30 seconds)
+### 2. CVEGraphLoader
+**Purpose**: JSONL 파일에서 CVE 데이터를 읽어 Neo4j 그래프로 변환
+**Class Structure**:
+```python
+class CVEGraphLoader:
+    def __init__(self, uri: str, username: str, password: str)
+    def close(self)
+    def create_constraints(self)
+    def load_cve(self, cve_data: Dict[str, Any]) -> None
+    def load_from_jsonl(self, jsonl_path: Path) -> None
+```
+**Key Features**:
+- Neo4j Python driver 사용
+- Uniqueness constraints를 통한 중복 방지 및 성능 최적화
+- MERGE 패턴으로 멱등성(idempotency) 보장
+- 배치 처리 및 진행 상황 로깅
+### 3. Command-Line Scripts
+#### collect_cve_data.py
+**Purpose**: CVE 데이터 수집 실행
+**Arguments**:
+- `config`: YAML 설정 파일 경로 (필수)
+- `--log-level`: 로깅 레벨 (기본: INFO)
+- `--output`: 출력 파일 경로 오버라이드 (선택)
+**Workflow**:
+1. YAML 설정 로드
+2. NVD API 키 확인 (config 또는 환경변수)
+3. CVEDataSource 초기화
+4. 각 CVE 타겟에 대해 데이터 수집
+5. JSONL 형식으로 저장
+6. 수집 통계 출력
+#### load_cve_to_neo4j.py
+**Purpose**: JSONL 데이터를 Neo4j에 로드
+**Arguments**:
+- `jsonl_file`: JSONL 파일 경로 (필수)
+- `--uri`: Neo4j URI (기본: bolt://localhost:7687)
+- `--username`: Neo4j 사용자명 (기본: neo4j)
+- `--password`: Neo4j 비밀번호 (필수)
+- `--log-level`: 로깅 레벨 (기본: INFO)
+**Workflow**:
+1. Neo4j 연결 설정
+2. Uniqueness constraints 생성
+3. JSONL 파일 읽기
+4. 각 CVE에 대해 그래프 노드 및 관계 생성
+5. 로드 통계 출력
+## Data Models
+### Neo4j Graph Schema
+#### Node Types
+**CVE Node**:
+```cypher
+(:CVE {
+    id: String,                    // CVE-YYYY-NNNNN
+    sourceIdentifier: String,      // 출처 (e.g., cve@mitre.org)
+    published: DateTime,           // 공개 날짜
+    lastModified: DateTime,        // 최종 수정 날짜
+    vulnStatus: String,            // 상태 (e.g., "Analyzed")
+    description: String,           // 영문 설명
+    cvssVersion: String,           // CVSS 버전 (e.g., "3.1")
+    cvssScore: Float,              // CVSS 점수 (0.0-10.0)
+    cvssSeverity: String,          // 심각도 (LOW, MEDIUM, HIGH, CRITICAL)
+    cvssVector: String             // CVSS 벡터 문자열
+})
+```
+**CPE Node** (Common Platform Enumeration):
+```cypher
+(:CPE {
+    uri: String,                   // CPE URI (unique)
+    version: String,               // 버전 문자열
+    versionStartIncluding: String, // 영향받는 버전 시작 (포함)
+    versionEndExcluding: String    // 영향받는 버전 끝 (제외)
+})
+```
+**CWE Node** (Common Weakness Enumeration):
+```cypher
+(:CWE {
+    id: String                     // CWE-NNN (unique)
+})
+```
+**Vendor Node**:
+```cypher
+(:Vendor {
+    name: String                   // 벤더명 (unique)
+})
+```
+**Product Node**:
+```cypher
+(:Product {
+    vendor: String,                // 벤더명
+    name: String                   // 제품명
+    // (vendor, name) composite unique
+})
+```
+**Reference Node**:
+```cypher
+(:Reference {
+    url: String,                   // URL (unique)
+    source: String                 // 출처
+})
+```
+#### Relationship Types
+```mermaid
+graph TD
+    CVE[CVE Node]
+    CPE[CPE Node]
+    CWE[CWE Node]
+    Vendor[Vendor Node]
+    Product[Product Node]
+    Ref[Reference Node]
+    CVE -->|AFFECTS| CPE
+    CVE -->|HAS_WEAKNESS| CWE
+    CVE -->|HAS_REFERENCE| Ref
+    Vendor -->|PRODUCES| Product
+    Product -->|HAS_VERSION| CPE
+```
+**Relationship Descriptions**:
+- `(:CVE)-[:AFFECTS]->(:CPE)`: CVE가 특정 CPE(제품 버전)에 영향을 미침
+- `(:CVE)-[:HAS_WEAKNESS]->(:CWE)`: CVE가 특정 CWE 약점 유형과 연관됨
+- `(:CVE)-[:HAS_REFERENCE]->(:Reference)`: CVE가 참조 링크를 가짐
+- `(:Vendor)-[:PRODUCES]->(:Product)`: 벤더가 제품을 생산함
+- `(:Product)-[:HAS_VERSION]->(:CPE)`: 제품이 특정 버전(CPE)을 가짐
+### JSONL Data Format
+각 레코드는 다음 구조를 따릅니다:
+```json
+{
+    "source": "nvd_cve",
+    "package": "CVE-2021-44228",
+    "collected_at": "2025-10-15T12:34:56.789012",
+    "payload": {
+        "vulnerabilities": [
+            {
+                "cve": {
+                    "id": "CVE-2021-44228",
+                    "sourceIdentifier": "cve@mitre.org",
+                    "published": "2021-12-10T10:15:09.000",
+                    "lastModified": "2021-12-14T01:15:00.000",
+                    "vulnStatus": "Analyzed",
+                    "descriptions": [...],
+                    "metrics": {...},
+                    "weaknesses": [...],
+                    "configurations": [...],
+                    "references": [...]
+                }
+            }
+        ],
+        "total_results": 1
+    },
+    "metadata": {
+        "description": "Log4Shell - Apache Log4j2 RCE",
+        "target_id": "CVE-2021-44228"
+    }
+}
+```
+### Configuration Schema
+**cve_config.yaml**:
+```yaml
+cutoff_date: "2021-12-31T23:59:59"
+output_dir: "data/raw"
+request_timeout: 30
+rate_limit_sleep: 6.0
+nvd_api_key: "optional-api-key"  # 또는 환경변수 NVD_API_KEY
+cve_targets:
+  - id: "CVE-2021-44228"
+    description: "Log4Shell"
+  - id: "keyword:log4j"
+    description: "All Log4j CVEs"
+  - id: "cpe:2.3:a:apache:log4j:2.14.1:*:*:*:*:*:*:*"
+    description: "Log4j 2.14.1 specific"
+```
+## Error Handling
+### Data Collection Errors
+**Strategy**: Fail-safe with logging
+1. **Network Errors**:
+   - BaseDataSource의 _request 메서드가 처리
+   - 429 (Rate Limit) 발생 시 자동 재시도
+   - 기타 HTTP 에러는 DataSourceError 발생
+2. **CVE Not Found**:
+   - DataSourceError 발생
+   - 로그에 기록하고 다음 타겟으로 진행
+3. **Invalid Response**:
+   - JSON 파싱 실패 시 예외 발생
+   - 로그에 기록하고 다음 타겟으로 진행
+4. **Timeout**:
+   - requests timeout 설정 (기본 30초)
+   - 타임아웃 발생 시 예외 발생 및 로깅
+**Error Logging**:
+```python
+try:
+    result = cve_source.collect(cve_id, cutoff=cutoff_date)
+    collected_count += 1
+except Exception as e:
+    logger.error(f"Error collecting {cve_id}: {e}")
+    error_count += 1
+    continue
+```
+### Neo4j Loading Errors
+**Strategy**: Transaction-based with constraint handling
+1. **Connection Errors**:
+   - Neo4j driver가 자동으로 재연결 시도
+   - 연결 실패 시 명확한 에러 메시지 출력
+2. **Constraint Violations**:
+   - MERGE 사용으로 중복 방지
+   - Constraint 생성 실패는 경고로 처리 (이미 존재할 수 있음)
+3. **Invalid Data**:
+   - CVE ID 없는 레코드는 스킵
+   - 로그에 경고 기록
+4. **Cypher Query Errors**:
+   - 각 CVE 로드를 try-except로 감싸서 격리
+   - 하나의 CVE 실패가 전체 프로세스를 중단하지 않음
+**Error Logging**:
+```python
+try:
+    self.load_cve(cve_data)
+    count += 1
+except Exception as e:
+    logger.error(f"Error loading CVE: {e}")
+```
+## Testing Strategy
+### Unit Testing
+**CVEDataSource Tests**:
+- Mock NVD API 응답을 사용한 각 수집 메서드 테스트
+- cutoff_date 필터링 로직 검증
+- Rate limit 처리 검증
+- API 키 유무에 따른 동작 차이 검증
+**CVEGraphLoader Tests**:
+- Mock Neo4j driver를 사용한 로드 로직 테스트
+- CPE 파싱 로직 검증
+- Cypher 쿼리 생성 검증
+- 중복 데이터 처리 검증
+### Integration Testing
+**End-to-End Data Flow**:
+1. 테스트용 CVE 데이터 수집 (실제 NVD API 또는 mock)
+2. JSONL 파일 생성 검증
+3. Neo4j 테스트 인스턴스에 로드
+4. Cypher 쿼리로 데이터 무결성 검증
+**Sample Verification Queries**:
+```cypher
+// CVE 노드 수 확인
+MATCH (c:CVE) RETURN count(c)
+// Log4Shell 관계 확인
+MATCH (c:CVE {id: 'CVE-2021-44228'})-[r]->(n)
+RETURN type(r), labels(n), n LIMIT 10
+// 특정 벤더의 취약점 수
+MATCH (v:Vendor {name: 'apache'})-[:PRODUCES]->(p:Product)
+      <-[:HAS_VERSION]-(cpe:CPE)<-[:AFFECTS]-(c:CVE)
+RETURN p.name, count(DISTINCT c) as vuln_count
+ORDER BY vuln_count DESC
+```
+### Manual Testing
+**Test Scenarios**:
+1. 소규모 CVE 세트로 전체 파이프라인 실행
+2. Neo4j Browser에서 그래프 시각화 확인
+3. 다양한 Cypher 쿼리로 관계 탐색
+4. Rate limit 테스트 (API 키 있음/없음)
+5. 에러 시나리오 테스트 (잘못된 CVE ID, 네트워크 오류 등)
+## Performance Considerations
+### Data Collection
+**Bottleneck**: NVD API rate limits
+- API 키 없음: ~10 requests/minute
+- API 키 있음: ~50 requests/30 seconds
+**Optimization**:
+- 병렬 처리는 rate limit 때문에 효과 제한적
+- 대신 배치 수집 및 재개 기능 고려 (향후 개선)
+### Neo4j Loading
+**Bottleneck**: Cypher 쿼리 실행 시간
+**Optimization**:
+1. Uniqueness constraints로 인덱스 자동 생성
+2. MERGE 사용으로 중복 체크 최적화
+3. 배치 크기 조정 가능 (현재는 개별 트랜잭션)
+**Future Improvements**:
+- Batch insert using UNWIND
+- Transaction batching (예: 100개 CVE당 1 트랜잭션)
+- Parallel loading with connection pooling
+## Security Considerations
+1. **API Key Management**:
+   - 환경변수 사용 권장
+   - 설정 파일에 저장 시 .gitignore 추가 필요
+2. **Neo4j Credentials**:
+   - 명령줄 인자로만 전달 (설정 파일에 저장 금지)
+   - 환경변수 사용 고려
+3. **Input Validation**:
+   - CVE ID 형식 검증 (CVE-YYYY-NNNNN)
+   - URL 및 문자열 이스케이핑 (Neo4j driver가 자동 처리)
+4. **Rate Limiting**:
+   - NVD API 남용 방지를 위한 적절한 대기 시간 설정
+## Deployment Considerations
+### Prerequisites
+1. **Python Environment**:
+   - Python 3.10+
+   - Dependencies: pyyaml, requests, tqdm, neo4j
+2. **Neo4j Installation**:
+   - Neo4j Desktop 또는 Docker
+   - 최소 버전: 4.x+
+   - 권장: 5.x+ (constraint 문법 지원)
+3. **NVD API Key** (선택):
+   - https://nvd.nist.gov/developers/request-an-api-key
+### Installation Steps
+```bash
+# 1. 의존성 설치
+pip install -r requirements.txt
+# 2. Neo4j 설치 (Docker 예시)
+docker run -d \
+  --name neo4j \
+  -p 7474:7474 -p 7687:7687 \
+  -e NEO4J_AUTH=neo4j/password \
+  neo4j:latest
+# 3. CVE 데이터 수집
+python scripts/collect_cve_data.py config/cve_config.yaml
+# 4. Neo4j에 로드
+python scripts/load_cve_to_neo4j.py data/raw/cve_data.jsonl --password password
+```
+### Monitoring
+**Logging**:
+- 모든 스크립트는 Python logging 사용
+- 로그 레벨 조정 가능 (--log-level)
+- 수집/로드 통계 자동 출력
+**Metrics to Track**:
+- 수집된 CVE 수
+- 수집 실패 수
+- Neo4j 로드 시간
+- 생성된 노드/관계 수
+## Future Enhancements
+1. **Incremental Updates**:
+   - 이미 수집된 CVE 스킵
+   - lastModified 기반 업데이트 감지
+2. **Advanced Graph Queries**:
+   - 취약점 전파 경로 분석
+   - 유사 CVE 클러스터링
+   - 시간에 따른 취약점 트렌드
+3. **Integration with Existing Pipeline**:
+   - 패키지 데이터와 CVE 연결
+   - GitHub 이슈/PR과 CVE 매핑
+4. **Visualization**:
+   - Neo4j Bloom 통합
+   - 커스텀 대시보드 개발
+5. **Performance**:
+   - 비동기 데이터 수집
+   - 배치 Neo4j 로딩
+   - 캐싱 레이어 추가