data2prompt 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 arianmokhtariha
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,166 @@
1
+ Metadata-Version: 2.4
2
+ Name: data2prompt
3
+ Version: 0.1.0
4
+ Summary: A high-performance CLI tool to convert local data science workspaces into LLM-ready context.
5
+ Author-email: Arian Mokhtariha <arian1385mokhtarihaa@gmail.com>
6
+ License: MIT License
7
+
8
+ Copyright (c) 2026 arianmokhtariha
9
+
10
+ Permission is hereby granted, free of charge, to any person obtaining a copy
11
+ of this software and associated documentation files (the "Software"), to deal
12
+ in the Software without restriction, including without limitation the rights
13
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
14
+ copies of the Software, and to permit persons to whom the Software is
15
+ furnished to do so, subject to the following conditions:
16
+
17
+ The above copyright notice and this permission notice shall be included in all
18
+ copies or substantial portions of the Software.
19
+
20
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
21
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
22
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
23
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
24
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
25
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
26
+ SOFTWARE.
27
+
28
+ Project-URL: Homepage, https://github.com/arianmokhtariha/data2prompt
29
+ Keywords: llm,cli,data,prompt,ai,data-science,context
30
+ Classifier: Programming Language :: Python :: 3
31
+ Classifier: License :: OSI Approved :: MIT License
32
+ Classifier: Operating System :: OS Independent
33
+ Classifier: Environment :: Console
34
+ Requires-Python: >=3.10
35
+ Description-Content-Type: text/markdown
36
+ License-File: LICENSE
37
+ Requires-Dist: pandas>=2.0.0
38
+ Requires-Dist: openpyxl>=3.1.0
39
+ Requires-Dist: tabulate>=0.9.0
40
+ Requires-Dist: rich>=13.0.0
41
+ Requires-Dist: tiktoken>=0.7.0
42
+ Requires-Dist: regex>=2024.0.0
43
+ Requires-Dist: pathspec>=0.12.0
44
+ Provides-Extra: dev
45
+ Requires-Dist: pytest>=8.0.0; extra == "dev"
46
+ Dynamic: license-file
47
+
48
+ <p align="center">
49
+ <img src="assets/banner.png" alt="Data2Prompt Banner" width="800">
50
+ </p>
51
+
52
+ <p align="center">
53
+ <a href="https://github.com/arianmokhtariha/data2prompt/blob/main/LICENSE"><img src="https://img.shields.io/badge/license-MIT-blue.svg" alt="License"></a>
54
+ <a href="https://www.python.org/downloads/"><img src="https://img.shields.io/badge/python-3.10+-blue.svg" alt="Python 3.10+"></a>
55
+ <a href="https://github.com/arianmokhtariha/data2prompt"><img src="https://img.shields.io/badge/status-active-brightgreen.svg" alt="Status"></a>
56
+ </p>
57
+
58
+
59
+ > **High-performance codebase-to-prompt orchestration for Data Science workflows and data-heavy projects.**
60
+
61
+ data2prompt is a CLI tool designed to bridge the gap between local data-heavy projects and Large Language Model (LLM) context windows. Unlike generic code-packagers, it provides an intelligent,optimized output for LLM attention mechanism, token-aware representation of a project's structure and content.
62
+
63
+ ## πŸ“ Important Note
64
+ **Data2prompt** is purpose-built for **data-heavy projects** (`.csv`, `.sql`, `.xlsx`, `.ipynb`), not large pure-code repositories. It intelligently samples and truncates data files to prevent context window explosion while preserving semantic structure.
65
+
66
+
67
+ ## 🎯 Why Data2Prompt?
68
+ Generic code-to-prompt tools choke on data filesβ€”they either skip them entirely or dump raw CSVs that waste 90% of your context window. Data2Prompt solves this with intelligent sampling, schema extraction, and LLM-optimized formatting specifically designed for data science workflows.
69
+
70
+
71
+ ## ✨ Core Features
72
+
73
+ * **Smart Jupyter Parsing**: Intelligently extracts code, markdown, and text outputs from [`.ipynb`](docs/parsers.md) files while stripping heavy Base64 images and raw HTML to preserve context.
74
+ * **Multi-Format Sampling**: Advanced sampling strategies for [CSV, SQL, and Excel](docs/parsers.md) files to preserve schema and data context which reduces the data size significantly while extracting the needed context for llm.
75
+ * **Aggressive truncations**: To preserve context, long lines are truncated to neutralize line injections and avoid exploding the context windows, if a tabular data was still to large after sampling it will get truncated to a certain amount, also if a raw text file of unhandled type was too large it will get truncated to a certain amount.
76
+ * **Defensive Processing**: Automatic binary detection (Null-byte checks), Checks if a file is binary by looking for a Null byte in the first 1024 bytes.
77
+ * **Optimized LLM attention**: The default output format is markdown with well structured schema and another option is xml output with xml style tags to enhance LLM anchoring for complex analysis and large context windows
78
+ * **Token-Aware Output**: Real-time token estimation using `tiktoken` (`o200k_base`) to ensure prompts fit target LLMs (Claude 3.5, GPT-4o, Gemini 1.5) and advanced offline token counting via `regex`.
79
+ * **Professional TUI**: A high-fidelity terminal interface built with `Rich`, featuring a Matrix-style startup animation and interactive, scrollable reports on Windows.
80
+ * **Dynamic Markdown Wrapping**: Uses intelligent backtick depth to ensure robust nesting of code blocks in the final output.
81
+ * **Gitignore aware**: Respects the .gitignore rules by default and you can turn this feature off with cli argument(--no-gitignore) if needed.
82
+
83
+ ## πŸ—οΈ Architecture & Engineering Standards
84
+
85
+ This project is a portfolio-grade implementation of the **Modular Functional Orchestration (MFO)** pattern, reflecting senior-level engineering maturity:
86
+
87
+ * **Registry & Strategy Patterns**: Uses a `ParserRegistry` for extensible file handling and an `OutputGenerator` strategy for multiple formats (Markdown, XML).
88
+ * **Centralized Configuration**: All core logic, magic numbers, and default ignore lists reside in [`src/data2prompt/constants.py`](src/data2prompt/constants.py).
89
+ * **Strict Type Hinting**: Fully typed function signatures (PEP 484) across all modules.
90
+ * **UI Encapsulation**: All terminal feedback is handled by a dedicated `UIHandler`, ensuring a clean separation between logic and presentation.
91
+
92
+ For a deep dive into the system design, see the [Architecture Documentation](docs/architecture.md).
93
+
94
+ ## πŸš€ Quick Start
95
+
96
+ ### Installation
97
+
98
+ Ensure you have Python 3.10+ installed.
99
+
100
+ ```bash
101
+ # Clone the repository
102
+ git clone https://github.com/arianmokhtariha/data2prompt.git
103
+ cd data2prompt
104
+
105
+ # Install normally
106
+ pip install .
107
+
108
+ # Install in editable mode
109
+ pip install -e .
110
+
111
+ # Its Recommended to use pipx instead of pip for easier venv handling
112
+ ```
113
+
114
+ ### Usage
115
+
116
+ Run `data2prompt` in your project root to generate a structured prompt:
117
+
118
+ ```bash
119
+ # Basic usage (defaults to markdown output)
120
+ data2prompt
121
+
122
+ # Custom output with xml format and specific sampling
123
+ data2prompt --output my_analysis --format xml --csv-sample-size 50 --ignore-folders venv .pytest_cache
124
+ ```
125
+
126
+ ### CLI Arguments
127
+
128
+ | Argument | Description | Default |
129
+ | :--- | :--- | :--- |
130
+ | `-o`, `--output` | Base name of the generated file | `PROMPT` |
131
+ | `-f`, `--format` | Output format (`xml` or `markdown`) | `markdown` |
132
+ | `-s`, `--csv-sample-size` | Number of random rows to sample from CSVs | `15` |
133
+ | `--max-lines` | Max lines of text output per notebook cell | `40` |
134
+ | `--max-file-size` | Max file size in KB to read entirely | `70` |
135
+
136
+ See the [CLI Reference](docs/cli.md) for a full list of arguments.
137
+
138
+ ## πŸ“š Documentation
139
+
140
+ Explore the detailed documentation for more information:
141
+
142
+ * [**Architecture**](docs/architecture.md): MFO pattern and module flow.
143
+ * [**CLI Reference**](docs/cli.md): Detailed argument descriptions and usage.
144
+ * [**Parsers**](docs/parsers.md): How different file types are handled.
145
+ * [**Output Formats**](docs/output.md): Details on Markdown and XML generation.
146
+ * [**User Interface**](docs/ui.md): Features of the high-tech TUI.
147
+ * [**Installation**](docs/installation.md): Comprehensive setup guide.
148
+
149
+ ## πŸ› οΈ Developer Setup
150
+
151
+ To contribute or run tests:
152
+
153
+ ```bash
154
+ pip install -e .[dev]
155
+ pytest
156
+ ```
157
+
158
+ ## 🌟 Show Your Support
159
+
160
+ If Data2Prompt saves you token costs or speeds up your workflow, consider:
161
+ - ⭐ Starring the repo
162
+ - πŸ› Reporting issues or suggesting features
163
+ - πŸ”€ Contributing parsers for new file types
164
+
165
+ ---
166
+ *Built with precision for the modern AI-assisted development workflow.*
@@ -0,0 +1,119 @@
1
+ <p align="center">
2
+ <img src="assets/banner.png" alt="Data2Prompt Banner" width="800">
3
+ </p>
4
+
5
+ <p align="center">
6
+ <a href="https://github.com/arianmokhtariha/data2prompt/blob/main/LICENSE"><img src="https://img.shields.io/badge/license-MIT-blue.svg" alt="License"></a>
7
+ <a href="https://www.python.org/downloads/"><img src="https://img.shields.io/badge/python-3.10+-blue.svg" alt="Python 3.10+"></a>
8
+ <a href="https://github.com/arianmokhtariha/data2prompt"><img src="https://img.shields.io/badge/status-active-brightgreen.svg" alt="Status"></a>
9
+ </p>
10
+
11
+
12
+ > **High-performance codebase-to-prompt orchestration for Data Science workflows and data-heavy projects.**
13
+
14
+ data2prompt is a CLI tool designed to bridge the gap between local data-heavy projects and Large Language Model (LLM) context windows. Unlike generic code-packagers, it provides an intelligent,optimized output for LLM attention mechanism, token-aware representation of a project's structure and content.
15
+
16
+ ## πŸ“ Important Note
17
+ **Data2prompt** is purpose-built for **data-heavy projects** (`.csv`, `.sql`, `.xlsx`, `.ipynb`), not large pure-code repositories. It intelligently samples and truncates data files to prevent context window explosion while preserving semantic structure.
18
+
19
+
20
+ ## 🎯 Why Data2Prompt?
21
+ Generic code-to-prompt tools choke on data filesβ€”they either skip them entirely or dump raw CSVs that waste 90% of your context window. Data2Prompt solves this with intelligent sampling, schema extraction, and LLM-optimized formatting specifically designed for data science workflows.
22
+
23
+
24
+ ## ✨ Core Features
25
+
26
+ * **Smart Jupyter Parsing**: Intelligently extracts code, markdown, and text outputs from [`.ipynb`](docs/parsers.md) files while stripping heavy Base64 images and raw HTML to preserve context.
27
+ * **Multi-Format Sampling**: Advanced sampling strategies for [CSV, SQL, and Excel](docs/parsers.md) files to preserve schema and data context which reduces the data size significantly while extracting the needed context for llm.
28
+ * **Aggressive truncations**: To preserve context, long lines are truncated to neutralize line injections and avoid exploding the context windows, if a tabular data was still to large after sampling it will get truncated to a certain amount, also if a raw text file of unhandled type was too large it will get truncated to a certain amount.
29
+ * **Defensive Processing**: Automatic binary detection (Null-byte checks), Checks if a file is binary by looking for a Null byte in the first 1024 bytes.
30
+ * **Optimized LLM attention**: The default output format is markdown with well structured schema and another option is xml output with xml style tags to enhance LLM anchoring for complex analysis and large context windows
31
+ * **Token-Aware Output**: Real-time token estimation using `tiktoken` (`o200k_base`) to ensure prompts fit target LLMs (Claude 3.5, GPT-4o, Gemini 1.5) and advanced offline token counting via `regex`.
32
+ * **Professional TUI**: A high-fidelity terminal interface built with `Rich`, featuring a Matrix-style startup animation and interactive, scrollable reports on Windows.
33
+ * **Dynamic Markdown Wrapping**: Uses intelligent backtick depth to ensure robust nesting of code blocks in the final output.
34
+ * **Gitignore aware**: Respects the .gitignore rules by default and you can turn this feature off with cli argument(--no-gitignore) if needed.
35
+
36
+ ## πŸ—οΈ Architecture & Engineering Standards
37
+
38
+ This project is a portfolio-grade implementation of the **Modular Functional Orchestration (MFO)** pattern, reflecting senior-level engineering maturity:
39
+
40
+ * **Registry & Strategy Patterns**: Uses a `ParserRegistry` for extensible file handling and an `OutputGenerator` strategy for multiple formats (Markdown, XML).
41
+ * **Centralized Configuration**: All core logic, magic numbers, and default ignore lists reside in [`src/data2prompt/constants.py`](src/data2prompt/constants.py).
42
+ * **Strict Type Hinting**: Fully typed function signatures (PEP 484) across all modules.
43
+ * **UI Encapsulation**: All terminal feedback is handled by a dedicated `UIHandler`, ensuring a clean separation between logic and presentation.
44
+
45
+ For a deep dive into the system design, see the [Architecture Documentation](docs/architecture.md).
46
+
47
+ ## πŸš€ Quick Start
48
+
49
+ ### Installation
50
+
51
+ Ensure you have Python 3.10+ installed.
52
+
53
+ ```bash
54
+ # Clone the repository
55
+ git clone https://github.com/arianmokhtariha/data2prompt.git
56
+ cd data2prompt
57
+
58
+ # Install normally
59
+ pip install .
60
+
61
+ # Install in editable mode
62
+ pip install -e .
63
+
64
+ # Its Recommended to use pipx instead of pip for easier venv handling
65
+ ```
66
+
67
+ ### Usage
68
+
69
+ Run `data2prompt` in your project root to generate a structured prompt:
70
+
71
+ ```bash
72
+ # Basic usage (defaults to markdown output)
73
+ data2prompt
74
+
75
+ # Custom output with xml format and specific sampling
76
+ data2prompt --output my_analysis --format xml --csv-sample-size 50 --ignore-folders venv .pytest_cache
77
+ ```
78
+
79
+ ### CLI Arguments
80
+
81
+ | Argument | Description | Default |
82
+ | :--- | :--- | :--- |
83
+ | `-o`, `--output` | Base name of the generated file | `PROMPT` |
84
+ | `-f`, `--format` | Output format (`xml` or `markdown`) | `markdown` |
85
+ | `-s`, `--csv-sample-size` | Number of random rows to sample from CSVs | `15` |
86
+ | `--max-lines` | Max lines of text output per notebook cell | `40` |
87
+ | `--max-file-size` | Max file size in KB to read entirely | `70` |
88
+
89
+ See the [CLI Reference](docs/cli.md) for a full list of arguments.
90
+
91
+ ## πŸ“š Documentation
92
+
93
+ Explore the detailed documentation for more information:
94
+
95
+ * [**Architecture**](docs/architecture.md): MFO pattern and module flow.
96
+ * [**CLI Reference**](docs/cli.md): Detailed argument descriptions and usage.
97
+ * [**Parsers**](docs/parsers.md): How different file types are handled.
98
+ * [**Output Formats**](docs/output.md): Details on Markdown and XML generation.
99
+ * [**User Interface**](docs/ui.md): Features of the high-tech TUI.
100
+ * [**Installation**](docs/installation.md): Comprehensive setup guide.
101
+
102
+ ## πŸ› οΈ Developer Setup
103
+
104
+ To contribute or run tests:
105
+
106
+ ```bash
107
+ pip install -e .[dev]
108
+ pytest
109
+ ```
110
+
111
+ ## 🌟 Show Your Support
112
+
113
+ If Data2Prompt saves you token costs or speeds up your workflow, consider:
114
+ - ⭐ Starring the repo
115
+ - πŸ› Reporting issues or suggesting features
116
+ - πŸ”€ Contributing parsers for new file types
117
+
118
+ ---
119
+ *Built with precision for the modern AI-assisted development workflow.*
@@ -0,0 +1,41 @@
1
+ [build-system]
2
+ requires = ["setuptools>=61.0"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "data2prompt"
7
+ version = "0.1.0"
8
+ description = "A high-performance CLI tool to convert local data science workspaces into LLM-ready context."
9
+ readme = "README.md"
10
+ requires-python = ">=3.10"
11
+ license = {file = "LICENSE"}
12
+ authors = [
13
+ {name = "Arian Mokhtariha", email = "arian1385mokhtarihaa@gmail.com"}
14
+ ]
15
+ keywords = ["llm", "cli", "data", "prompt", "ai", "data-science", "context"]
16
+ classifiers = [
17
+ "Programming Language :: Python :: 3",
18
+ "License :: OSI Approved :: MIT License",
19
+ "Operating System :: OS Independent",
20
+ "Environment :: Console",
21
+ ]
22
+ dependencies = [
23
+ "pandas>=2.0.0",
24
+ "openpyxl>=3.1.0",
25
+ "tabulate>=0.9.0",
26
+ "rich>=13.0.0",
27
+ "tiktoken>=0.7.0",
28
+ "regex>=2024.0.0",
29
+ "pathspec>=0.12.0",
30
+ ]
31
+
32
+ [project.scripts]
33
+ data2prompt = "data2prompt.main:main"
34
+
35
+ [project.optional-dependencies]
36
+ dev = [
37
+ "pytest>=8.0.0",
38
+ ]
39
+
40
+ [project.urls]
41
+ Homepage = "https://github.com/arianmokhtariha/data2prompt"
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+
File without changes
@@ -0,0 +1,146 @@
1
+ import argparse
2
+ from dataclasses import dataclass, field
3
+ from typing import List, Set
4
+ from pathlib import Path
5
+ from argparse import Namespace
6
+
7
+ from .constants import (
8
+ CORE_IGNORES,
9
+ CORE_IGNORE_FILES,
10
+ CORE_SKIP_EXTS,
11
+ DEFAULT_CSV_SAMPLE_SIZE,
12
+ DEFAULT_SQL_SAMPLE_SIZE,
13
+ DEFAULT_SQL_MAX_LINES,
14
+ DEFAULT_MAX_LINES,
15
+ DEFAULT_MAX_SHEETS,
16
+ DEFAULT_SEED,
17
+ DEFAULT_LINE_LENGTH_THRESHOLD,
18
+ DEFAULT_TRUNCATED_LINE_LENGTH,
19
+ DEFAULT_TABLE_CHAR_LIMIT,
20
+ DEFAULT_TABLE_TRUNCATED_SIZE,
21
+ DEFAULT_MAX_FILE_SIZE_KB,
22
+ DEFAULT_OUTPUT_FILE,
23
+ DEFAULT_FORMAT,
24
+ SUPPORTED_FORMATS
25
+ )
26
+
27
+ @dataclass
28
+ class Config:
29
+ """Data Transfer Object for application configuration."""
30
+ output: str
31
+ format: str
32
+ csv_sample_size: int
33
+ seed: int
34
+ sql_sample_size: int
35
+ sql_max_lines: int
36
+ max_lines: int
37
+ max_sheets: int
38
+ line_length_threshold: int
39
+ truncated_line_length: int
40
+ table_limit: int
41
+ table_truncate: int
42
+ ignore_folders: Set[str] = field(default_factory=set)
43
+ ignore_files: Set[str] = field(default_factory=set)
44
+ max_file_size: int = 0
45
+ skip_exts: Set[str] = field(default_factory=set)
46
+ use_gitignore: bool = True
47
+
48
+ def setup_cli() -> Config:
49
+ """Configures the Command Line Interface (CLI) for the tool.
50
+
51
+ Defines all available flags and their help descriptions.
52
+
53
+ Returns:
54
+ Config: A type-safe configuration object.
55
+ """
56
+ parser = argparse.ArgumentParser(
57
+ description="πŸ“Š Data2Prompt: High-tech prompt packaging for Data Scientists."
58
+ )
59
+
60
+ # Output settings
61
+ parser.add_argument('-o', '--output', default=DEFAULT_OUTPUT_FILE,
62
+ help=f'Base name of the generated file (default: {DEFAULT_OUTPUT_FILE})')
63
+
64
+ parser.add_argument('-f', '--format', choices=list(SUPPORTED_FORMATS.keys()), default=DEFAULT_FORMAT,
65
+ help=f'Output format: xml or markdown (default: {DEFAULT_FORMAT})')
66
+
67
+ # CSV sampling settings
68
+ parser.add_argument('-s', '--csv-sample-size', type=int, default=DEFAULT_CSV_SAMPLE_SIZE,
69
+ help=f'Number of random rows to sample from CSVs (default: {DEFAULT_CSV_SAMPLE_SIZE})')
70
+ parser.add_argument('--seed', type=int, default=DEFAULT_SEED,
71
+ help=f'Random seed for consistent CSV sampling (default: {DEFAULT_SEED})')
72
+
73
+ # SQL sampling settings
74
+ parser.add_argument('--sql-sample-size', type=int, default=DEFAULT_SQL_SAMPLE_SIZE,
75
+ help=f'Number of INSERT statements to keep in SQL files (default: {DEFAULT_SQL_SAMPLE_SIZE})')
76
+
77
+ parser.add_argument('--sql-max-lines', type=int, default=DEFAULT_SQL_MAX_LINES,
78
+ help=f'Max non-data lines to keep in SQL files (default: {DEFAULT_SQL_MAX_LINES})')
79
+
80
+ # Notebook settings
81
+ parser.add_argument('--max-lines', type=int, default=DEFAULT_MAX_LINES,
82
+ help=f'Max lines of text output to keep per notebook cell (default: {DEFAULT_MAX_LINES})')
83
+
84
+ # Excel settings
85
+ parser.add_argument('--max-sheets', type=int, default=DEFAULT_MAX_SHEETS,
86
+ help=f'Max number of sheets to process in Excel files (default: {DEFAULT_MAX_SHEETS})')
87
+
88
+ # Line Truncation settings
89
+ parser.add_argument('--line-length-threshold', type=int, default=DEFAULT_LINE_LENGTH_THRESHOLD,
90
+ help=f'Max characters per line before truncation (default: {DEFAULT_LINE_LENGTH_THRESHOLD})')
91
+ parser.add_argument('--truncated-line-length', type=int, default=DEFAULT_TRUNCATED_LINE_LENGTH,
92
+ help=f'Length to truncate long lines to (default: {DEFAULT_TRUNCATED_LINE_LENGTH})')
93
+
94
+ # Table Truncation settings
95
+ parser.add_argument('--table-limit', type=int, default=DEFAULT_TABLE_CHAR_LIMIT,
96
+ help=f'Max characters for a single table/sheet after sampling (default: {DEFAULT_TABLE_CHAR_LIMIT})')
97
+ parser.add_argument('--table-truncate', type=int, default=DEFAULT_TABLE_TRUNCATED_SIZE,
98
+ help=f'Length to truncate large tables to (default: {DEFAULT_TABLE_TRUNCATED_SIZE})')
99
+
100
+ # Exclusions
101
+ parser.add_argument('--ignore-folders', nargs='+', default=[],
102
+ help='Additional folders to skip entirely')
103
+
104
+ parser.add_argument('--ignore-files', nargs='+', default=[],
105
+ help='Additional files to skip entirely')
106
+
107
+ parser.add_argument('--max-file-size', type=int, default=DEFAULT_MAX_FILE_SIZE_KB,
108
+ help=f'Max file size in KB to read entirely (default: {DEFAULT_MAX_FILE_SIZE_KB}KB)')
109
+
110
+ # file formats to ignore
111
+ parser.add_argument('--skip-exts', nargs='+', default=[],
112
+ help='Additional file extensions to skip content for')
113
+
114
+ parser.add_argument('--no-gitignore', action='store_false', dest='use_gitignore',
115
+ help='Disable automatic .gitignore detection and filtering')
116
+
117
+ args = parser.parse_args()
118
+
119
+ # --- Argument Merging Logic ---
120
+ # We combine the user's terminal input with our CORE constants.
121
+ # This ensures that even if a user provides custom ignores, essential items
122
+ # like '.git' or binary extensions are still respected.
123
+
124
+ # Combine base name with format-specific extension
125
+ extension = SUPPORTED_FORMATS.get(args.format, SUPPORTED_FORMATS.get(DEFAULT_FORMAT))
126
+ final_output_name = f"{args.output}{extension}"
127
+
128
+ return Config(
129
+ output=final_output_name,
130
+ format=args.format,
131
+ csv_sample_size=args.csv_sample_size,
132
+ seed=args.seed,
133
+ sql_sample_size=args.sql_sample_size,
134
+ sql_max_lines=args.sql_max_lines,
135
+ max_lines=args.max_lines,
136
+ max_sheets=args.max_sheets,
137
+ line_length_threshold=args.line_length_threshold,
138
+ truncated_line_length=args.truncated_line_length,
139
+ table_limit=args.table_limit,
140
+ table_truncate=args.table_truncate,
141
+ ignore_folders=set(args.ignore_folders) | CORE_IGNORES,
142
+ ignore_files=set(args.ignore_files) | CORE_IGNORE_FILES,
143
+ max_file_size=args.max_file_size,
144
+ skip_exts=set(args.skip_exts) | CORE_SKIP_EXTS,
145
+ use_gitignore=args.use_gitignore
146
+ )
@@ -0,0 +1,94 @@
1
+ # --- Core Defaults & Constants ---
2
+
3
+ # Folders matching these names are excluded from both the project tree and content processing.
4
+ CORE_IGNORES = {
5
+ '.git', '__pycache__', 'venv', '.vscode', '.ipynb_checkpoints',
6
+ 'node_modules', '.idea', 'dist', 'build', '.mypy_cache',
7
+ '.pytest_cache', 'target', '.docker', '.aws', '.gcloud',
8
+ '__MACOSX'
9
+ }
10
+
11
+ # Specific filenames that should be excluded from the entire process.
12
+ CORE_IGNORE_FILES = set()
13
+
14
+ # Files with these extensions will have their names listed in the project tree,
15
+ # but their actual content will be skipped.
16
+ CORE_SKIP_EXTS = {
17
+ # Data & Databases
18
+ '.pbix', '.db', '.sqlite', '.sqlite3', '.parquet', '.pkl', '.pickle', '.feather', '.h5',
19
+ # Compressed & Binary
20
+ '.zip', '.tar', '.gz', '.7z', '.rar', '.exe', '.dll', '.so', '.bin',
21
+ # Media
22
+ '.png', '.jpg', '.jpeg', '.gif', '.svg', '.pdf', '.mp4', '.mp3', '.mov',
23
+ # Environment & Secrets
24
+ '.env', '.venv', '.pyc', '.ds_store'
25
+ }
26
+
27
+ # Default values for CLI arguments and processing functions
28
+ DEFAULT_CSV_SAMPLE_SIZE = 15 # Controls the number of rows per csv file.
29
+ DEFAULT_SQL_SAMPLE_SIZE = 15 # Controls the number of INSERT/data rows kept per table in SQL files.
30
+ DEFAULT_SQL_MAX_LINES = 50 # Caps the total number of non-data lines (comments, setup, etc.) in SQL files.
31
+ DEFAULT_MAX_LINES = 40 # Max lines of text output to keep per notebook cell.
32
+ DEFAULT_MAX_SHEETS = 10 # Max number of sheets to process in Excel files.
33
+ DEFAULT_SEED = 42 # Random seed for consistent sampling.
34
+ DEFAULT_LINE_LENGTH_THRESHOLD = 4000 # Max characters allowed per line before truncation is triggered.
35
+ DEFAULT_TRUNCATED_LINE_LENGTH = 1000 # Number of characters to keep when a line is truncated.
36
+ DEFAULT_TABLE_CHAR_LIMIT = 50000 # Max characters allowed for a single table/sheet representation after sampling.
37
+ DEFAULT_TABLE_TRUNCATED_SIZE = 20000 # Number of characters to keep when a table/sheet is truncated due to size.
38
+ DEFAULT_MAX_FILE_SIZE_KB = 70 # maximum file size of unhandled type to keep enitrely (if file is larger than that only the first 10kb will be shown)
39
+ DEFAULT_OUTPUT_FILE = 'PROMPT' # default output base name (extension added via --format)
40
+ DEFAULT_FORMAT = 'markdown' # default output format
41
+
42
+ # Mapping of format types to their respective file extensions
43
+ SUPPORTED_FORMATS = {
44
+ 'xml': '.xml',
45
+ 'markdown': '.md'
46
+ }
47
+
48
+ # A unique identifier added to the top of every generated file to prevent recursive scanning.
49
+ GENERATION_FLAG = "DATA2PROMPT_GENERATED_CONTENT"
50
+
51
+ # --- LLM Structured Output Constants ---
52
+ # Refactored System Instructions (Repomix Style)
53
+ SYSTEM_INSTRUCTIONS_MARKDOWN = """## purpose: \nThis document is a structured representation of a codebase and data schema. It is designed to be consumed by a Large Language Model.
54
+ The output is organized into sections:
55
+ 1. Directory Structure: List of all files in this project.
56
+ 2. Files: The content of each file, clearly labeled with its path using '## File: {path}' headers.
57
+ For all standard files, content is wrapped in markdown code blocks using dynamic backtick depth to ensure robust nesting.
58
+ For notebooks, individual cells are clearly labeled with cell numbers, types, and their respective file paths.
59
+ For Excel files, individual sheets are clearly labeled with sheet names, numbers, and their respective file paths."""
60
+
61
+ SYSTEM_INSTRUCTIONS_XML = """<purpose>\nThis document is a structured representation of a codebase and data schema. It is designed to be consumed by a Large Language Model.
62
+ The output is organized into XML tags:
63
+ 1. <directory_structure>: List of all files in this project.
64
+ 2. <files>: Contains the repository's files.
65
+ 3. <file>: Represents a single file with a 'path' attribute.
66
+ 4. <cell>: Used within notebooks to encapsulate individual cells, featuring 'path', 'number', and 'type' attributes.
67
+ 5. <sheet>: Used within Excel files to encapsulate individual sheets, featuring 'name', 'number', and 'path' attributes.\n</purpose>"""
68
+
69
+ # Updated Tags
70
+ TAG_DIRECTORY_STRUCTURE = "directory_structure"
71
+ TAG_FILES = "files"
72
+ TAG_FILE = "file"
73
+ TAG_CONTENT = "content" # Used for notebook cells
74
+
75
+ # --- UI & Aesthetic Constants ---
76
+ MATRIX_DARK_GREEN = (0, 150, 0)
77
+ MATRIX_NEON_GREEN = (0, 255, 0)
78
+ STARTUP_ANIMATION_DURATION = 0.9
79
+ ANIMATION_FRAME_DELAY = 0.03
80
+
81
+ # Scroll Bar Characters
82
+ SCROLL_THUMB = "β–ˆ"
83
+ SCROLL_TRACK = "β”‚"
84
+
85
+ # ASCII Art for the application header
86
+ ASCII_ART = [
87
+ " ",
88
+ " β–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—",
89
+ " β•šβ–ˆβ–ˆβ•— β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•— β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•— β•šβ•β•β–ˆβ–ˆβ•”β•β•β• β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•— β•šβ•β•β•β•β–ˆβ–ˆβ•— β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•— β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•— β–ˆβ–ˆβ•”β•β•β•β–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•— β•šβ•β•β–ˆβ–ˆβ•”β•β•β•",
90
+ " β•šβ–ˆβ–ˆβ•— β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•‘ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β• β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β• β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β• β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•”β–ˆβ–ˆβ–ˆβ–ˆβ•”β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β• β–ˆβ–ˆβ•‘ ",
91
+ " β–ˆβ–ˆβ•”β• β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•”β•β•β•β• β–ˆβ–ˆβ•”β•β•β•β• β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•— β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘β•šβ–ˆβ–ˆβ•”β•β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•”β•β•β•β• β–ˆβ–ˆβ•‘ ",
92
+ " β–ˆβ–ˆβ•”β• β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β• β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘ β•šβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β• β–ˆβ–ˆβ•‘ β•šβ•β• β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘ ",
93
+ " β•šβ•β• β•šβ•β•β•β•β•β• β•šβ•β• β•šβ•β• β•šβ•β• β•šβ•β• β•šβ•β• β•šβ•β•β•β•β•β•β• β•šβ•β• β•šβ•β• β•šβ•β• β•šβ•β•β•β•β•β• β•šβ•β• β•šβ•β• β•šβ•β• β•šβ•β• "
94
+ ]