bookdatamaker 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (30) hide show
  1. bookdatamaker-0.1.0/LICENSE +8 -0
  2. bookdatamaker-0.1.0/PKG-INFO +594 -0
  3. bookdatamaker-0.1.0/README.md +548 -0
  4. bookdatamaker-0.1.0/pyproject.toml +78 -0
  5. bookdatamaker-0.1.0/setup.cfg +4 -0
  6. bookdatamaker-0.1.0/src/bookdatamaker/__init__.py +3 -0
  7. bookdatamaker-0.1.0/src/bookdatamaker/cli.py +808 -0
  8. bookdatamaker-0.1.0/src/bookdatamaker/dataset/__init__.py +6 -0
  9. bookdatamaker-0.1.0/src/bookdatamaker/dataset/builder.py +119 -0
  10. bookdatamaker-0.1.0/src/bookdatamaker/dataset/dataset_manager.py +268 -0
  11. bookdatamaker-0.1.0/src/bookdatamaker/llm/__init__.py +6 -0
  12. bookdatamaker-0.1.0/src/bookdatamaker/llm/parallel_generator.py +677 -0
  13. bookdatamaker-0.1.0/src/bookdatamaker/mcp/__init__.py +5 -0
  14. bookdatamaker-0.1.0/src/bookdatamaker/mcp/server.py +700 -0
  15. bookdatamaker-0.1.0/src/bookdatamaker/ocr/__init__.py +6 -0
  16. bookdatamaker-0.1.0/src/bookdatamaker/ocr/document_parser.py +207 -0
  17. bookdatamaker-0.1.0/src/bookdatamaker/ocr/extractor.py +455 -0
  18. bookdatamaker-0.1.0/src/bookdatamaker/utils/__init__.py +6 -0
  19. bookdatamaker-0.1.0/src/bookdatamaker/utils/page_manager.py +507 -0
  20. bookdatamaker-0.1.0/src/bookdatamaker/utils/status.py +135 -0
  21. bookdatamaker-0.1.0/src/bookdatamaker.egg-info/PKG-INFO +594 -0
  22. bookdatamaker-0.1.0/src/bookdatamaker.egg-info/SOURCES.txt +28 -0
  23. bookdatamaker-0.1.0/src/bookdatamaker.egg-info/dependency_links.txt +1 -0
  24. bookdatamaker-0.1.0/src/bookdatamaker.egg-info/entry_points.txt +2 -0
  25. bookdatamaker-0.1.0/src/bookdatamaker.egg-info/requires.txt +39 -0
  26. bookdatamaker-0.1.0/src/bookdatamaker.egg-info/top_level.txt +1 -0
  27. bookdatamaker-0.1.0/tests/test_dataset.py +119 -0
  28. bookdatamaker-0.1.0/tests/test_mcp.py +96 -0
  29. bookdatamaker-0.1.0/tests/test_ocr.py +40 -0
  30. bookdatamaker-0.1.0/tests/test_paragraph_indexing.py +201 -0
@@ -0,0 +1,8 @@
1
+ Copyright 2025 zwh20081
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the β€œSoftware”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
4
+
5
+ The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
6
+
7
+ THE SOFTWARE IS PROVIDED β€œAS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
8
+
@@ -0,0 +1,594 @@
1
+ Metadata-Version: 2.4
2
+ Name: bookdatamaker
3
+ Version: 0.1.0
4
+ Summary: CLI tool for extracting text with DeepSeek OCR and generating datasets
5
+ Author-email: Book Data Maker <contact@example.com>
6
+ License: MIT
7
+ Requires-Python: <3.13,>=3.10
8
+ Description-Content-Type: text/markdown
9
+ License-File: LICENSE
10
+ Requires-Dist: click
11
+ Requires-Dist: httpx
12
+ Requires-Dist: openai
13
+ Requires-Dist: mcp
14
+ Requires-Dist: pyarrow
15
+ Requires-Dist: pandas
16
+ Requires-Dist: python-dotenv
17
+ Requires-Dist: rich
18
+ Requires-Dist: aiofiles
19
+ Requires-Dist: Pillow
20
+ Requires-Dist: tqdm
21
+ Requires-Dist: PyMuPDF
22
+ Requires-Dist: ebooklib
23
+ Requires-Dist: beautifulsoup4
24
+ Provides-Extra: local
25
+ Requires-Dist: transformers; extra == "local"
26
+ Requires-Dist: torch; extra == "local"
27
+ Requires-Dist: flash-attn; extra == "local"
28
+ Provides-Extra: document
29
+ Requires-Dist: PyMuPDF; extra == "document"
30
+ Requires-Dist: ebooklib; extra == "document"
31
+ Requires-Dist: beautifulsoup4; extra == "document"
32
+ Provides-Extra: all
33
+ Requires-Dist: transformers; extra == "all"
34
+ Requires-Dist: torch; extra == "all"
35
+ Requires-Dist: flash-attn; extra == "all"
36
+ Requires-Dist: PyMuPDF; extra == "all"
37
+ Requires-Dist: ebooklib; extra == "all"
38
+ Requires-Dist: beautifulsoup4; extra == "all"
39
+ Provides-Extra: dev
40
+ Requires-Dist: pytest; extra == "dev"
41
+ Requires-Dist: pytest-asyncio; extra == "dev"
42
+ Requires-Dist: black; extra == "dev"
43
+ Requires-Dist: ruff; extra == "dev"
44
+ Requires-Dist: mypy; extra == "dev"
45
+ Dynamic: license-file
46
+
47
+ # Book Data Maker
48
+
49
+ A powerful CLI tool for extracting text from documents using DeepSeek OCR and generating high-quality datasets with LLM assistance.
50
+
51
+ ## Table of Contents
52
+
53
+ ### πŸš€ Getting Started
54
+ - [Features](#features)
55
+ - [Quick Start](#quick-start)
56
+ - [Installation](#installation)
57
+
58
+ ### πŸ“– User Guide
59
+ - [Extract Text (Stage 1)](#extract-text-stage-1)
60
+ - [Generate Dataset (Stage 2)](#generate-dataset-stage-2)
61
+ - [Export Dataset](#export-dataset)
62
+
63
+ ### πŸ”§ Advanced
64
+ - [Position Distribution](#position-distribution)
65
+ - [Performance Tuning](#performance-tuning)
66
+ - [MCP Server](#mcp-server)
67
+
68
+ ### πŸ“š Reference
69
+ - [Command Reference](#command-reference)
70
+ - [Troubleshooting](#troubleshooting)
71
+ - [Development](#development)
72
+
73
+ ---
74
+
75
+ ## Features
76
+
77
+ - πŸ“„ **Multi-Format Support**: PDF, EPUB, and images
78
+ - 🏠 **Self-Hosted OCR**: Local transformers for DeepSeek-OCR (no API costs)
79
+ - πŸ€– **Parallel Generation**: Multiple LLM threads explore documents simultaneously
80
+ - 🎯 **Smart Distribution**: Control thread starting positions
81
+ - πŸ’Ύ **SQLite Storage**: Real-time dataset storage with flexible export
82
+ - πŸ“Š **Multiple Formats**: JSONL, Parquet, CSV, JSON
83
+ - 🌐 **Flexible Modes**: API or self-hosted for both stages
84
+ - πŸ“ˆ **Progress Tracking**: Real-time progress bars
85
+ - ⚑ **Resume Support**: Continue interrupted sessions
86
+
87
+ ## Quick Start
88
+
89
+ ### Prerequisites
90
+
91
+ ```bash
92
+ # Set API keys (choose one based on your mode)
93
+ export OPENAI_API_KEY=your_openai_key # For API mode
94
+ export DEEPSEEK_API_KEY=your_deepseek_key # For API OCR mode
95
+ ```
96
+
97
+ ### Option 1: API Mode (Fastest Setup)
98
+
99
+ ```bash
100
+ # 1. Install
101
+ pip install -r requirements.txt && pip install -e .
102
+
103
+ # 2. Extract β†’ Generate β†’ Export
104
+ bookdatamaker extract book.pdf -o ./extracted
105
+ bookdatamaker generate ./extracted/combined.txt -d dataset.db --distribution "10,10,20,30,20,10"
106
+ bookdatamaker export-dataset dataset.db -o output.parquet
107
+ ```
108
+
109
+ ### Option 2: Self-Hosted Mode (Free, Private)
110
+
111
+ ```bash
112
+ # 1. Install with local dependencies
113
+ pip install -r requirements.txt && pip install -e ".[local]"
114
+
115
+ # 2. Extract with local OCR
116
+ bookdatamaker extract book.pdf --mode local --batch-size 8 -o ./extracted
117
+
118
+ # 3. Generate with vLLM
119
+ bookdatamaker generate ./extracted/combined.txt \
120
+ --mode vllm \
121
+ --vllm-model-path meta-llama/Llama-3-8B-Instruct \
122
+ --distribution "25,25,25,25" \
123
+ -d dataset.db
124
+
125
+ # 4. Export
126
+ bookdatamaker export-dataset dataset.db -o output.parquet
127
+ ```
128
+
129
+ ## Installation
130
+
131
+ ### Basic Installation
132
+
133
+ ```bash
134
+ git clone https://github.com/yourusername/bookdatamaker.git
135
+ cd bookdatamaker
136
+ pip install -r requirements.txt
137
+ pip install -e .
138
+ ```
139
+
140
+ ### Optional: Local Inference Support
141
+
142
+ ```bash
143
+ # For self-hosted OCR and LLM generation
144
+ pip install -e ".[local]" # Installs transformers, torch, flash-attn
145
+ ```
146
+
147
+ ### System Requirements
148
+
149
+ **For API Mode:**
150
+ - Python 3.10+
151
+ - API keys (OpenAI, DeepSeek, etc.)
152
+
153
+ **For Local Mode:**
154
+ - Python 3.10+
155
+ - NVIDIA GPU with CUDA support
156
+ - 16GB+ VRAM recommended
157
+ - Linux or WSL2 (recommended)
158
+
159
+ ---
160
+
161
+ ## Extract Text (Stage 1)
162
+
163
+ Extract text from documents using DeepSeek OCR.
164
+
165
+ ### Supported Formats
166
+
167
+ - **PDF**: Text extraction or OCR from rendered pages
168
+ - **EPUB**: E-book text extraction
169
+ - **Images**: JPG, PNG, BMP, TIFF, WebP
170
+
171
+ ### API Mode
172
+
173
+ ```bash
174
+ # Basic usage
175
+ bookdatamaker extract book.pdf -o ./extracted
176
+
177
+ # Custom API endpoint
178
+ bookdatamaker extract book.pdf \
179
+ --deepseek-api-url https://custom-api.example.com/v1 \
180
+ -o ./extracted
181
+ ```
182
+
183
+ ### Local Mode
184
+
185
+ Use local transformers model for OCR (no API calls):
186
+
187
+ ```bash
188
+ # Basic usage
189
+ bookdatamaker extract book.pdf --mode local -o ./extracted
190
+
191
+ # With custom batch size (adjust based on GPU memory)
192
+ bookdatamaker extract book.pdf --mode local --batch-size 12 -o ./extracted
193
+
194
+ # Process directory of images
195
+ bookdatamaker extract ./images/ --mode local -o ./extracted
196
+ ```
197
+
198
+ **Batch Size Guidelines:**
199
+ - **12-16**: GPUs with 24GB+ VRAM
200
+ - **8-12**: GPUs with 16GB+ VRAM (default: 8)
201
+ - **4-8**: GPUs with 8-12GB VRAM
202
+ - **1-4**: GPUs with <8GB VRAM
203
+
204
+ ### Output Structure
205
+
206
+ ```
207
+ ./extracted/
208
+ β”œβ”€β”€ page_001.txt
209
+ β”œβ”€β”€ page_002.txt
210
+ β”œβ”€β”€ ...
211
+ └── combined.txt # All pages with [PAGE_XXX] markers
212
+ ```
213
+
214
+ ---
215
+
216
+ ## Generate Dataset (Stage 2)
217
+
218
+ Generate Q&A datasets using parallel LLM threads.
219
+
220
+ ### Basic Usage
221
+
222
+ ```bash
223
+ # 6 threads (from distribution), 20 Q&A pairs per thread
224
+ bookdatamaker generate combined.txt \
225
+ -d dataset.db \
226
+ --distribution "10,10,20,30,20,10" \
227
+ --datasets-per-thread 20
228
+ ```
229
+
230
+ **Key Concept**: Thread count is determined by the number of comma-separated values in `--distribution`.
231
+
232
+ ### API Mode Examples
233
+
234
+ ```bash
235
+ # OpenAI/Azure
236
+ bookdatamaker generate combined.txt \
237
+ -d dataset.db \
238
+ --openai-api-url https://api.openai.com/v1 \
239
+ --model gpt-4 \
240
+ --distribution "10,10,20,30,20,10"
241
+
242
+ # Custom API endpoint
243
+ bookdatamaker generate combined.txt \
244
+ --openai-api-url http://localhost:8000/v1 \
245
+ --model your-model-name \
246
+ --distribution "25,25,25,25"
247
+ ```
248
+
249
+ ### vLLM Direct Mode (Self-Hosted)
250
+
251
+ Use vLLM directly without API server:
252
+
253
+ ```bash
254
+ # Single GPU
255
+ bookdatamaker generate combined.txt \
256
+ --mode vllm \
257
+ --vllm-model-path meta-llama/Llama-3-8B-Instruct \
258
+ --distribution "25,25,25,25" \
259
+ -d dataset.db
260
+
261
+ # Multi-GPU (4 GPUs, 6 threads)
262
+ bookdatamaker generate combined.txt \
263
+ --mode vllm \
264
+ --vllm-model-path meta-llama/Llama-3-70B-Instruct \
265
+ --tensor-parallel-size 4 \
266
+ --distribution "10,10,20,30,20,10" \
267
+ -d dataset.db
268
+ ```
269
+
270
+ **Benefits of vLLM Mode:**
271
+ - No API costs
272
+ - Full privacy (local processing)
273
+ - Optimized inference
274
+ - Thread-safe parallel processing
275
+ - Automatic batching
276
+
277
+ ### Custom Prompts
278
+
279
+ Add specific instructions to guide LLM behavior:
280
+
281
+ ```bash
282
+ # Language specification
283
+ bookdatamaker generate combined.txt \
284
+ --custom-prompt "Generate all Q&A in Chinese with simplified characters"
285
+
286
+ # Format specification
287
+ bookdatamaker generate combined.txt \
288
+ --custom-prompt "Questions should be multiple-choice with 4 options"
289
+
290
+ # Multiple requirements
291
+ bookdatamaker generate combined.txt \
292
+ --custom-prompt "Requirements:
293
+ 1. Generate questions in English
294
+ 2. Focus on practical applications
295
+ 3. Include code examples
296
+ 4. Answer length: 50-150 words
297
+ 5. Difficulty: intermediate"
298
+ ```
299
+
300
+ ---
301
+
302
+ ## Export Dataset
303
+
304
+ Export from SQLite database to your preferred format:
305
+
306
+ ```bash
307
+ # Parquet (recommended for data analysis)
308
+ bookdatamaker export-dataset dataset.db -o output.parquet
309
+
310
+ # JSON Lines (easy to stream)
311
+ bookdatamaker export-dataset dataset.db -o output.jsonl -f jsonl
312
+
313
+ # CSV (Excel-friendly)
314
+ bookdatamaker export-dataset dataset.db -o output.csv -f csv
315
+
316
+ # JSON with metadata
317
+ bookdatamaker export-dataset dataset.db -o output.json -f json --include-metadata
318
+ ```
319
+
320
+ **Format Comparison:**
321
+
322
+ | Format | Best For | Size | Load Speed |
323
+ |--------|----------|------|------------|
324
+ | Parquet | Data analysis, ML | Smallest | Fastest |
325
+ | JSONL | Streaming, processing | Medium | Fast |
326
+ | CSV | Excel, spreadsheets | Largest | Medium |
327
+ | JSON | API responses | Large | Slow |
328
+
329
+ ---
330
+
331
+ ## Position Distribution
332
+
333
+ Control where threads start in the document using distribution percentages.
334
+
335
+ ### How It Works
336
+
337
+ ```
338
+ Document: 500 paragraphs
339
+ Distribution: "10,10,20,30,20,10" (6 threads)
340
+
341
+ Thread 0: Start at 0% β†’ Paragraph 1
342
+ Thread 1: Start at 10% β†’ Paragraph 50
343
+ Thread 2: Start at 20% β†’ Paragraph 100
344
+ Thread 3: Start at 50% β†’ Paragraph 250
345
+ Thread 4: Start at 70% β†’ Paragraph 350
346
+ Thread 5: Start at 80% β†’ Paragraph 400
347
+ ```
348
+
349
+ ### Distribution Strategies
350
+
351
+ ```bash
352
+ # Even distribution (4 threads)
353
+ --distribution "25,25,25,25"
354
+ # Start at: 0%, 25%, 50%, 75%
355
+
356
+ # Front-heavy (4 threads) - focus on beginning
357
+ --distribution "40,30,20,10"
358
+ # Start at: 0%, 40%, 70%, 90%
359
+
360
+ # Middle-heavy (5 threads) - focus on middle
361
+ --distribution "10,20,40,20,10"
362
+ # Start at: 0%, 10%, 30%, 70%, 90%
363
+
364
+ # Dense sampling (10 threads) - fine-grained coverage
365
+ --distribution "10,10,10,10,10,10,10,10,10,10"
366
+ ```
367
+
368
+ ### Thread Count Guidelines
369
+
370
+ - **Small documents** (<100 paragraphs): 2-4 threads
371
+ - **Medium documents** (100-500 paragraphs): 4-8 threads
372
+ - **Large documents** (>500 paragraphs): 8-16 threads
373
+
374
+ ---
375
+
376
+ ## Performance Tuning
377
+
378
+ ### Extraction (Stage 1)
379
+
380
+ **Batch Size Optimization:**
381
+
382
+ ```bash
383
+ # Maximum speed (24GB+ VRAM)
384
+ bookdatamaker extract book.pdf --mode local --batch-size 16
385
+
386
+ # Balanced (16GB VRAM)
387
+ bookdatamaker extract book.pdf --mode local --batch-size 8
388
+
389
+ # Conservative (<8GB VRAM)
390
+ bookdatamaker extract book.pdf --mode local --batch-size 4
391
+ ```
392
+
393
+ ### Generation (Stage 2)
394
+
395
+ **Optimal Configurations:**
396
+
397
+ ```bash
398
+ # Maximum throughput (multi-GPU, 12 threads)
399
+ bookdatamaker generate text.txt --mode vllm \
400
+ --vllm-model-path meta-llama/Llama-3-70B \
401
+ --tensor-parallel-size 4 \
402
+ --distribution "5,5,10,10,15,15,15,15,5,5,2,3" \
403
+ --datasets-per-thread 50
404
+
405
+ # Balanced (single GPU, 6 threads)
406
+ bookdatamaker generate text.txt --mode vllm \
407
+ --vllm-model-path meta-llama/Llama-3-8B \
408
+ --distribution "10,10,20,30,20,10" \
409
+ --datasets-per-thread 20
410
+
411
+ # Conservative (2 threads)
412
+ bookdatamaker generate text.txt --mode vllm \
413
+ --vllm-model-path meta-llama/Llama-3-8B \
414
+ --distribution "50,50" \
415
+ --datasets-per-thread 10
416
+ ```
417
+
418
+ ---
419
+
420
+ ## Interactive Chat
421
+
422
+ Chat with an LLM that can access your document through MCP tools. Perfect for exploring documents interactively or testing Q&A generation.
423
+
424
+ ### Start Chat Session
425
+
426
+ ```bash
427
+ # Basic chat with GPT-4
428
+ bookdatamaker chat combined.txt
429
+
430
+ # With vLLM server
431
+ bookdatamaker chat combined.txt \
432
+ --openai-api-url http://localhost:8000/v1 \
433
+ --model Qwen/Qwen3-4B-Thinking-2507
434
+
435
+ # With custom database
436
+ bookdatamaker chat combined.txt --db my_dataset.db
437
+ ```
438
+
439
+ ### Example Interaction
440
+
441
+ ```
442
+ πŸ“š Document: combined.txt
443
+ πŸ“Š Paragraphs: 578
444
+ πŸ€– Model: gpt-4
445
+
446
+ You: What's in paragraph 100?
447
+ - `-f, --format`: Format: `jsonl`, `parquet`, `csv`, `json` (default: `parquet`)
448
+ - `--include-metadata`: Include timestamps
449
+
450
+ ### Parameter Tables
451
+
452
+ #### extract Parameters
453
+
454
+ | Parameter | Type | Default | Description |
455
+ |-----------|------|---------|-------------|
456
+ | `input_path` | required | - | Input file or directory |
457
+ | `--output-dir` | optional | `extracted_text` | Output directory |
458
+ | `--mode` | optional | `api` | OCR mode: `api` or `local` |
459
+ | `--batch-size` | optional | `8` | Batch size for local mode |
460
+ | `--deepseek-api-key` | optional | env var | DeepSeek API key |
461
+ | `--deepseek-api-url` | optional | `https://api.deepseek.com/v1` | DeepSeek API URL |
462
+ | `--local-model-path` | optional | `deepseek-ai/DeepSeek-OCR` | Local model path |
463
+
464
+ #### generate Parameters
465
+
466
+ | Parameter | Type | Default | Description |
467
+ |-----------|------|---------|-------------|
468
+ | `text_file` | required | - | Combined text file |
469
+ | `--db` | optional | `dataset.db` | Database file path |
470
+ | `--mode` | optional | `api` | LLM mode: `api` or `vllm` |
471
+ | `--distribution` | optional | `10,10,20,30,20,10` | Position distribution (determines threads) |
472
+ | `--datasets-per-thread` | optional | `10` | Target Q&A pairs per thread |
473
+ | `--openai-api-key` | optional | env var | OpenAI API key |
474
+ | `--openai-api-url` | optional | `https://api.openai.com/v1` | API URL |
475
+ | `--model` | optional | `gpt-4` | Model name |
476
+ | `--vllm-model-path` | optional | - | vLLM model path |
477
+ | `--tensor-parallel-size` | optional | `1` | Number of GPUs |
478
+ | `--custom-prompt` | optional | - | Additional instructions |
479
+
480
+ ---
481
+
482
+ ## Troubleshooting
483
+
484
+ ### Common Issues
485
+
486
+ **Problem: Threads not completing**
487
+ - Reduce `--datasets-per-thread`
488
+ - Check API rate limits
489
+ - Verify API keys
490
+ - Ensure document has enough content
491
+
492
+ **Problem: Out of memory (OCR)**
493
+ - Reduce `--batch-size`
494
+ - Use API mode instead of local
495
+
496
+ **Problem: Out of memory (Generation)**
497
+ - Reduce thread count (fewer distribution values)
498
+ - Use smaller model
499
+ - Reduce `--tensor-parallel-size`
500
+
501
+ **Problem: Low quality Q&A pairs**
502
+ - Adjust distribution to focus on content-rich sections
503
+ - Use higher-quality model (e.g., GPT-4)
504
+ - Add specific `--custom-prompt` instructions
505
+ - Check OCR quality
506
+
507
+ **Problem: SQLite errors**
508
+ - Ensure database path is writable
509
+ - Don't modify database during generation
510
+ - Delete and regenerate if corrupted
511
+
512
+ ### Debug Mode
513
+
514
+ Set environment variable for verbose logging:
515
+
516
+ ```bash
517
+ export LOG_LEVEL=DEBUG
518
+ bookdatamaker generate combined.txt -d dataset.db
519
+ ```
520
+
521
+ ---
522
+
523
+ ## Development
524
+
525
+ ### Project Structure
526
+
527
+ ```
528
+ bookdatamaker/
529
+ β”œβ”€β”€ src/bookdatamaker/
530
+ β”‚ β”œβ”€β”€ cli.py # CLI interface
531
+ β”‚ β”œβ”€β”€ ocr/
532
+ β”‚ β”‚ β”œβ”€β”€ extractor.py # OCR extraction
533
+ β”‚ β”‚ └── document_parser.py # Document parsing
534
+ β”‚ β”œβ”€β”€ mcp/
535
+ β”‚ β”‚ └── server.py # MCP server
536
+ β”‚ β”œβ”€β”€ llm/
537
+ β”‚ β”‚ └── parallel_generator.py # Parallel generation
538
+ β”‚ β”œβ”€β”€ dataset/
539
+ β”‚ β”‚ β”œβ”€β”€ builder.py # Dataset building
540
+ β”‚ β”‚ └── dataset_manager.py # SQLite management
541
+ β”‚ └── utils/
542
+ β”‚ β”œβ”€β”€ page_manager.py # Page navigation
543
+ β”‚ └── status.py # Progress indicators
544
+ └── tests/ # Test files
545
+ ```
546
+
547
+ ### Development Setup
548
+
549
+ ```bash
550
+ # Clone repository
551
+ git clone https://github.com/yourusername/bookdatamaker.git
552
+ cd bookdatamaker
553
+
554
+ # Install dev dependencies
555
+ pip install -e ".[dev]"
556
+
557
+ # Run tests
558
+ pytest tests/
559
+
560
+ # Code formatting
561
+ black src/
562
+ ruff check src/
563
+
564
+ # Type checking
565
+ mypy src/
566
+ ```
567
+
568
+ ### Contributing
569
+
570
+ Contributions welcome! Please:
571
+ 1. Fork the repository
572
+ 2. Create a feature branch
573
+ 3. Add tests for new features
574
+ 4. Ensure all tests pass
575
+ 5. Submit a pull request
576
+
577
+ ### Testing
578
+
579
+ ```bash
580
+ # Run all tests
581
+ pytest
582
+
583
+ # Run specific test file
584
+ pytest tests/test_ocr.py
585
+
586
+ # Run with coverage
587
+ pytest --cov=bookdatamaker tests/
588
+ ```
589
+
590
+ ---
591
+
592
+ ## License
593
+
594
+ MIT License - see LICENSE file for details.