html2mcq 1.3.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
html2mcq-1.3.0/LICENSE ADDED
@@ -0,0 +1,25 @@
1
+ MIT License
2
+
3
+ <<<<<<< HEAD
4
+ Copyright (c) 2025 html2mcq contributors
5
+ =======
6
+ Copyright (c) 2026 Manjur Alam
7
+ >>>>>>> 5915aa00a9ca68ac27d15c82a278ddca5a2857c6
8
+
9
+ Permission is hereby granted, free of charge, to any person obtaining a copy
10
+ of this software and associated documentation files (the "Software"), to deal
11
+ in the Software without restriction, including without limitation the rights
12
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
13
+ copies of the Software, and to permit persons to whom the Software is
14
+ furnished to do so, subject to the following conditions:
15
+
16
+ The above copyright notice and this permission notice shall be included in all
17
+ copies or substantial portions of the Software.
18
+
19
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
20
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
21
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
22
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
23
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
24
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
25
+ SOFTWARE.
@@ -0,0 +1,344 @@
1
+ Metadata-Version: 2.4
2
+ Name: html2mcq
3
+ Version: 1.3.0
4
+ Summary: Convert any HTML tutorial page into MCQ questions using AI
5
+ License: MIT
6
+ Project-URL: Homepage, https://github.com/manjur-ai/html2mcq
7
+ Project-URL: Issues, https://github.com/manjur-ai/html2mcq/issues
8
+ Keywords: mcq,quiz,ai,education,html,tutorial,llm
9
+ Classifier: Development Status :: 4 - Beta
10
+ Classifier: Intended Audience :: Developers
11
+ Classifier: Intended Audience :: Education
12
+ Classifier: License :: OSI Approved :: MIT License
13
+ Classifier: Programming Language :: Python :: 3
14
+ Classifier: Programming Language :: Python :: 3.9
15
+ Classifier: Programming Language :: Python :: 3.10
16
+ Classifier: Programming Language :: Python :: 3.11
17
+ Classifier: Programming Language :: Python :: 3.12
18
+ Classifier: Topic :: Education
19
+ Classifier: Topic :: Text Processing :: Markup :: HTML
20
+ Requires-Python: >=3.9
21
+ Description-Content-Type: text/markdown
22
+ License-File: LICENSE
23
+ Requires-Dist: beautifulsoup4>=4.12
24
+ Requires-Dist: lxml>=4.9
25
+ Requires-Dist: anthropic>=0.25
26
+ Requires-Dist: openai>=1.30
27
+ Requires-Dist: youtube-transcript-api>=0.6
28
+ Requires-Dist: pymupdf>=1.24
29
+ Provides-Extra: docling
30
+ Requires-Dist: docling>=2.0; extra == "docling"
31
+ Provides-Extra: dev
32
+ Requires-Dist: pytest>=7; extra == "dev"
33
+ Requires-Dist: pytest-cov; extra == "dev"
34
+ Requires-Dist: black; extra == "dev"
35
+ Requires-Dist: ruff; extra == "dev"
36
+ Requires-Dist: build; extra == "dev"
37
+ Requires-Dist: twine; extra == "dev"
38
+ Dynamic: license-file
39
+
40
+ # html2mcq
41
+
42
+ **Convert any HTML tutorial page, YouTube video, or PDF into MCQ questions using AI.**
43
+
44
+ [![PyPI version](https://badge.fury.io/py/html2mcq.svg)](https://pypi.org/project/html2mcq/)
45
+ [![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/)
46
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
47
+ [![Tests](https://img.shields.io/badge/tests-68%20passing-brightgreen.svg)]()
48
+
49
+ ---
50
+
51
+ ## Install
52
+
53
+ ```bash
54
+ pip install html2mcq
55
+ ```
56
+
57
+ That's it. Everything is included — HTML extraction, YouTube transcripts, PDF support, all AI providers.
58
+
59
+ > For scanned PDFs (OCR + complex layouts), optionally install Docling:
60
+ > ```bash
61
+ > pip install html2mcq docling
62
+ > ```
63
+
64
+ ---
65
+
66
+ ## Quick Start
67
+
68
+ ```python
69
+ from html2mcq import MCQGenerator
70
+
71
+ gen = MCQGenerator(
72
+ api_key="sk-or-v1-...",
73
+ provider="openrouter",
74
+ model="meta-llama/llama-3.3-70b-instruct:free",
75
+ )
76
+
77
+ # From an HTML tutorial page
78
+ mcq = gen.from_url("https://docs.python.org/3/tutorial/", n=15)
79
+
80
+ # From a YouTube video (transcript fetched automatically)
81
+ mcq = gen.from_video_url("https://www.youtube.com/watch?v=VXU4LSAQDSc", n=10)
82
+
83
+ # From a PDF URL
84
+ mcq = gen.from_pdf_url("https://example.com/tutorial.pdf", n=10)
85
+
86
+ # From a local PDF file
87
+ mcq = gen.from_pdf_path("/path/to/notes.pdf", n=10)
88
+
89
+ # From raw HTML
90
+ mcq = gen.from_html(html_string, n=10)
91
+
92
+ print(mcq.to_json())
93
+ print(mcq.to_pretty_str())
94
+ ```
95
+
96
+ ---
97
+
98
+ ## What it supports
99
+
100
+ | Input | How |
101
+ |---|---|
102
+ | HTML tutorial page URL | `from_url(url)` — extracts text, code, tables, images, video links, PDF links |
103
+ | YouTube video URL | `from_video_url(url)` or `from_url(url)` — transcript fetched automatically |
104
+ | PDF via URL | `from_pdf_url(url)` — downloaded and extracted |
105
+ | Local PDF file | `from_pdf_path(path)` |
106
+ | Raw HTML string | `from_html(html)` |
107
+ | Pre-built blocks | `from_blocks(blocks)` |
108
+
109
+ ---
110
+
111
+ ## Output Schema
112
+
113
+ ```json
114
+ {
115
+ "total_exam_time": 20,
116
+ "questions": [
117
+ {
118
+ "question_html": "Which of these are non-mutating array methods?",
119
+ "options": ["push()", "map()", "filter()", "pop()"],
120
+ "answers": [1, 2],
121
+ "multi": true,
122
+ "marks": 1,
123
+ "negative_marks": 0,
124
+ "difficulty": "medium",
125
+ "explaination": "map() and filter() return new arrays without modifying the original."
126
+ }
127
+ ]
128
+ }
129
+ ```
130
+
131
+ - `answers` is always an array — supports multiple correct answers
132
+ - `multi: true` → `negative_marks: 0.0`, `multi: false` → `negative_marks: 0.25`
133
+ - `marks` always `1`, difficulty roughly 1/3 each easy/medium/hard
134
+
135
+ ---
136
+
137
+ ## AI Providers
138
+
139
+ ```python
140
+ # Anthropic Claude
141
+ gen = MCQGenerator(api_key="sk-ant-...", provider="anthropic")
142
+
143
+ # OpenAI
144
+ gen = MCQGenerator(api_key="sk-...", provider="openai", model="gpt-4o")
145
+
146
+ # OpenRouter — 100+ models, free Llama available
147
+ gen = MCQGenerator(api_key="sk-or-...", provider="openrouter",
148
+ model="meta-llama/llama-3.3-70b-instruct:free")
149
+
150
+ # API key from environment variable
151
+ # Set ANTHROPIC_API_KEY / OPENAI_API_KEY / OPENROUTER_API_KEY
152
+ gen = MCQGenerator(provider="anthropic")
153
+ ```
154
+
155
+ ---
156
+
157
+ ## Custom Instructions
158
+
159
+ Override the AI's default behaviour without breaking the fixed rules (JSON schema, marks, options count).
160
+
161
+ ```python
162
+ # Apply to every call this generator makes
163
+ gen = MCQGenerator(
164
+ provider="openrouter",
165
+ custom_instructions=(
166
+ "Make answers very close and confusing. "
167
+ "Only people with sharp attention should get 100% marks."
168
+ )
169
+ )
170
+
171
+ # Or per individual call
172
+ mcq = gen.from_url(
173
+ "https://example.com/tutorial",
174
+ n=10,
175
+ custom_instructions="All questions must be code-based. No theory questions."
176
+ )
177
+ ```
178
+
179
+ **How the prompt is structured:**
180
+ ```
181
+ SYSTEM PROMPT (fixed — always sent, never modified)
182
+ Rules: 4 options, JSON schema, marks/negative_marks, no hallucination...
183
+
184
+ USER PROMPT
185
+ [extracted content: text / code / tables / transcripts / PDFs]
186
+
187
+ Generate exactly N questions. Mix difficulties...
188
+
189
+ --- CUSTOM INSTRUCTIONS (highest priority) ---
190
+ Your custom text here
191
+ --- END CUSTOM INSTRUCTIONS ---
192
+
193
+ Return ONLY the JSON array.
194
+ ```
195
+
196
+ ---
197
+
198
+ ## PDF Backends
199
+
200
+ By default, PyMuPDF handles PDFs instantly. For scanned or complex PDFs, Docling is used as an automatic fallback.
201
+
202
+ ```python
203
+ # Default — PyMuPDF, fast, works for most digital PDFs
204
+ gen = MCQGenerator(provider="anthropic")
205
+
206
+ # With Docling Serve on your own server (GPU recommended)
207
+ # Run once: docker run --gpus all -p 5001:5001 quay.io/docling/docling-serve
208
+ gen = MCQGenerator(
209
+ provider="anthropic",
210
+ docling_api_url="http://your-server:5001"
211
+ )
212
+ ```
213
+
214
+ **Auto-fallback:**
215
+ ```
216
+ PyMuPDF extracts text
217
+ ↓ fewer than 100 chars? (scanned PDF)
218
+ → retry with Docling Serve (if docling_api_url set)
219
+ → or retry with Docling Local (if pip install html2mcq docling)
220
+ ```
221
+
222
+ ---
223
+
224
+ ## Desktop GUI
225
+
226
+ ```bash
227
+ python html2mcq_gui.py
228
+ ```
229
+
230
+ 5 input tabs: **Web URL · YouTube · PDF URL · Local PDF · Raw HTML**
231
+
232
+ Options: N questions, difficulty mix, focus topics, custom instructions, PDF backend selector.
233
+
234
+ Output: syntax-highlighted pretty view or JSON, copy to clipboard, save as JSON.
235
+
236
+ ---
237
+
238
+ ## CLI
239
+
240
+ ```bash
241
+ # Basic
242
+ html2mcq https://example.com/tutorial --n 20
243
+
244
+ # All options
245
+ html2mcq https://example.com/tutorial \
246
+ --n 20 \
247
+ --provider openrouter \
248
+ --model meta-llama/llama-3.3-70b-instruct:free \
249
+ --difficulty "40% easy, 40% medium, 20% hard" \
250
+ --topics "variables" "functions" \
251
+ --instructions "Make answers very close and confusing" \
252
+ --output quiz.json \
253
+ --format json
254
+
255
+ # From a local HTML file
256
+ html2mcq --html page.html --n 5
257
+ ```
258
+
259
+ ---
260
+
261
+ ## Live Tests
262
+
263
+ ```bash
264
+ export OPENROUTER_API_KEY="sk-or-v1-..."
265
+ python test_live.py # all input types
266
+ python test_live.py --test html # HTML only
267
+ python test_live.py --test video # YouTube only
268
+ python test_live.py --test pdf_file # local PDF only
269
+ python test_live.py --n 5
270
+ ```
271
+
272
+ ---
273
+
274
+ ## API Reference
275
+
276
+ ### `MCQGenerator`
277
+
278
+ ```python
279
+ MCQGenerator(
280
+ api_key=None, # or set env var ANTHROPIC/OPENAI/OPENROUTER_API_KEY
281
+ provider="anthropic", # "anthropic" | "openai" | "openrouter"
282
+ model="", # default per provider if not set
283
+ batch_size=10, # questions per API call; large N is auto-batched
284
+ max_tokens=4096,
285
+ pdf_backend="pymupdf", # "pymupdf" | "docling_local" | "docling_serve"
286
+ docling_api_url="", # Docling Serve URL e.g. "http://your-server:5001"
287
+ docling_ocr=True,
288
+ transcript_languages=["en"],
289
+ custom_instructions="", # global custom instructions for every call
290
+ )
291
+ ```
292
+
293
+ | Method | Description |
294
+ |---|---|
295
+ | `from_url(url, n, ...)` | HTML page; auto-detects YouTube links |
296
+ | `from_html(html, n, base_url, ...)` | Raw HTML string |
297
+ | `from_video_url(url, n, video_title, ...)` | YouTube video via transcript |
298
+ | `from_pdf_url(url, n, pdf_title, ...)` | PDF via URL |
299
+ | `from_pdf_path(path, n, pdf_title, ...)` | Local PDF file |
300
+ | `from_blocks(blocks, n, ...)` | Pre-extracted `ContentBlock` list |
301
+
302
+ All methods accept `custom_instructions`, `difficulty_mix`, and `focus_topics`.
303
+
304
+ ### `MCQSet`
305
+
306
+ | Property / Method | Description |
307
+ |---|---|
308
+ | `.questions` | `List[MCQQuestion]` |
309
+ | `.total_exam_time` | Minutes — auto-calculated as n × 2 |
310
+ | `.to_json()` | Exam-ready JSON (`total_exam_time` + `questions` only) |
311
+ | `.to_pretty_str()` | Human-readable output |
312
+ | `.filter_by_difficulty(d)` | Returns filtered `MCQSet` for `"easy"/"medium"/"hard"` |
313
+
314
+ ---
315
+
316
+ ## Project Structure
317
+
318
+ ```
319
+ html2mcq/
320
+ ├── html2mcq/
321
+ │ ├── __init__.py # Public exports
322
+ │ ├── extractor.py # HTML parser
323
+ │ ├── video.py # YouTube transcript extractor
324
+ │ ├── pdf.py # PDF extractor (PyMuPDF + Docling)
325
+ │ ├── generator.py # MCQGenerator — main API
326
+ │ ├── models.py # ContentBlock, MCQQuestion, MCQSet
327
+ │ ├── prompts.py # Fixed system prompt + dynamic user prompt
328
+ │ └── cli.py # CLI entry point
329
+ ├── tests/
330
+ │ └── test_html2mcq.py # 68 unit tests (fully mocked, no API key needed)
331
+ ├── html2mcq_gui.py # Tkinter desktop GUI
332
+ ├── test_live.py # Live integration tests
333
+ ├── examples/
334
+ │ └── basic_usage.py
335
+ ├── pyproject.toml
336
+ ├── README.md
337
+ └── CHANGELOG.md
338
+ ```
339
+
340
+ ---
341
+
342
+ ## License
343
+
344
+ MIT © 2025 html2mcq contributors
@@ -0,0 +1,305 @@
1
+ # html2mcq
2
+
3
+ **Convert any HTML tutorial page, YouTube video, or PDF into MCQ questions using AI.**
4
+
5
+ [![PyPI version](https://badge.fury.io/py/html2mcq.svg)](https://pypi.org/project/html2mcq/)
6
+ [![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/)
7
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
8
+ [![Tests](https://img.shields.io/badge/tests-68%20passing-brightgreen.svg)]()
9
+
10
+ ---
11
+
12
+ ## Install
13
+
14
+ ```bash
15
+ pip install html2mcq
16
+ ```
17
+
18
+ That's it. Everything is included — HTML extraction, YouTube transcripts, PDF support, all AI providers.
19
+
20
+ > For scanned PDFs (OCR + complex layouts), optionally install Docling:
21
+ > ```bash
22
+ > pip install html2mcq docling
23
+ > ```
24
+
25
+ ---
26
+
27
+ ## Quick Start
28
+
29
+ ```python
30
+ from html2mcq import MCQGenerator
31
+
32
+ gen = MCQGenerator(
33
+ api_key="sk-or-v1-...",
34
+ provider="openrouter",
35
+ model="meta-llama/llama-3.3-70b-instruct:free",
36
+ )
37
+
38
+ # From an HTML tutorial page
39
+ mcq = gen.from_url("https://docs.python.org/3/tutorial/", n=15)
40
+
41
+ # From a YouTube video (transcript fetched automatically)
42
+ mcq = gen.from_video_url("https://www.youtube.com/watch?v=VXU4LSAQDSc", n=10)
43
+
44
+ # From a PDF URL
45
+ mcq = gen.from_pdf_url("https://example.com/tutorial.pdf", n=10)
46
+
47
+ # From a local PDF file
48
+ mcq = gen.from_pdf_path("/path/to/notes.pdf", n=10)
49
+
50
+ # From raw HTML
51
+ mcq = gen.from_html(html_string, n=10)
52
+
53
+ print(mcq.to_json())
54
+ print(mcq.to_pretty_str())
55
+ ```
56
+
57
+ ---
58
+
59
+ ## What it supports
60
+
61
+ | Input | How |
62
+ |---|---|
63
+ | HTML tutorial page URL | `from_url(url)` — extracts text, code, tables, images, video links, PDF links |
64
+ | YouTube video URL | `from_video_url(url)` or `from_url(url)` — transcript fetched automatically |
65
+ | PDF via URL | `from_pdf_url(url)` — downloaded and extracted |
66
+ | Local PDF file | `from_pdf_path(path)` |
67
+ | Raw HTML string | `from_html(html)` |
68
+ | Pre-built blocks | `from_blocks(blocks)` |
69
+
70
+ ---
71
+
72
+ ## Output Schema
73
+
74
+ ```json
75
+ {
76
+ "total_exam_time": 20,
77
+ "questions": [
78
+ {
79
+ "question_html": "Which of these are non-mutating array methods?",
80
+ "options": ["push()", "map()", "filter()", "pop()"],
81
+ "answers": [1, 2],
82
+ "multi": true,
83
+ "marks": 1,
84
+ "negative_marks": 0,
85
+ "difficulty": "medium",
86
+ "explaination": "map() and filter() return new arrays without modifying the original."
87
+ }
88
+ ]
89
+ }
90
+ ```
91
+
92
+ - `answers` is always an array — supports multiple correct answers
93
+ - `multi: true` → `negative_marks: 0.0`, `multi: false` → `negative_marks: 0.25`
94
+ - `marks` always `1`, difficulty roughly 1/3 each easy/medium/hard
95
+
96
+ ---
97
+
98
+ ## AI Providers
99
+
100
+ ```python
101
+ # Anthropic Claude
102
+ gen = MCQGenerator(api_key="sk-ant-...", provider="anthropic")
103
+
104
+ # OpenAI
105
+ gen = MCQGenerator(api_key="sk-...", provider="openai", model="gpt-4o")
106
+
107
+ # OpenRouter — 100+ models, free Llama available
108
+ gen = MCQGenerator(api_key="sk-or-...", provider="openrouter",
109
+ model="meta-llama/llama-3.3-70b-instruct:free")
110
+
111
+ # API key from environment variable
112
+ # Set ANTHROPIC_API_KEY / OPENAI_API_KEY / OPENROUTER_API_KEY
113
+ gen = MCQGenerator(provider="anthropic")
114
+ ```
115
+
116
+ ---
117
+
118
+ ## Custom Instructions
119
+
120
+ Override the AI's default behaviour without breaking the fixed rules (JSON schema, marks, options count).
121
+
122
+ ```python
123
+ # Apply to every call this generator makes
124
+ gen = MCQGenerator(
125
+ provider="openrouter",
126
+ custom_instructions=(
127
+ "Make answers very close and confusing. "
128
+ "Only people with sharp attention should get 100% marks."
129
+ )
130
+ )
131
+
132
+ # Or per individual call
133
+ mcq = gen.from_url(
134
+ "https://example.com/tutorial",
135
+ n=10,
136
+ custom_instructions="All questions must be code-based. No theory questions."
137
+ )
138
+ ```
139
+
140
+ **How the prompt is structured:**
141
+ ```
142
+ SYSTEM PROMPT (fixed — always sent, never modified)
143
+ Rules: 4 options, JSON schema, marks/negative_marks, no hallucination...
144
+
145
+ USER PROMPT
146
+ [extracted content: text / code / tables / transcripts / PDFs]
147
+
148
+ Generate exactly N questions. Mix difficulties...
149
+
150
+ --- CUSTOM INSTRUCTIONS (highest priority) ---
151
+ Your custom text here
152
+ --- END CUSTOM INSTRUCTIONS ---
153
+
154
+ Return ONLY the JSON array.
155
+ ```
156
+
157
+ ---
158
+
159
+ ## PDF Backends
160
+
161
+ By default, PyMuPDF handles PDFs instantly. For scanned or complex PDFs, Docling is used as an automatic fallback.
162
+
163
+ ```python
164
+ # Default — PyMuPDF, fast, works for most digital PDFs
165
+ gen = MCQGenerator(provider="anthropic")
166
+
167
+ # With Docling Serve on your own server (GPU recommended)
168
+ # Run once: docker run --gpus all -p 5001:5001 quay.io/docling/docling-serve
169
+ gen = MCQGenerator(
170
+ provider="anthropic",
171
+ docling_api_url="http://your-server:5001"
172
+ )
173
+ ```
174
+
175
+ **Auto-fallback:**
176
+ ```
177
+ PyMuPDF extracts text
178
+ ↓ fewer than 100 chars? (scanned PDF)
179
+ → retry with Docling Serve (if docling_api_url set)
180
+ → or retry with Docling Local (if pip install html2mcq docling)
181
+ ```
182
+
183
+ ---
184
+
185
+ ## Desktop GUI
186
+
187
+ ```bash
188
+ python html2mcq_gui.py
189
+ ```
190
+
191
+ 5 input tabs: **Web URL · YouTube · PDF URL · Local PDF · Raw HTML**
192
+
193
+ Options: N questions, difficulty mix, focus topics, custom instructions, PDF backend selector.
194
+
195
+ Output: syntax-highlighted pretty view or JSON, copy to clipboard, save as JSON.
196
+
197
+ ---
198
+
199
+ ## CLI
200
+
201
+ ```bash
202
+ # Basic
203
+ html2mcq https://example.com/tutorial --n 20
204
+
205
+ # All options
206
+ html2mcq https://example.com/tutorial \
207
+ --n 20 \
208
+ --provider openrouter \
209
+ --model meta-llama/llama-3.3-70b-instruct:free \
210
+ --difficulty "40% easy, 40% medium, 20% hard" \
211
+ --topics "variables" "functions" \
212
+ --instructions "Make answers very close and confusing" \
213
+ --output quiz.json \
214
+ --format json
215
+
216
+ # From a local HTML file
217
+ html2mcq --html page.html --n 5
218
+ ```
219
+
220
+ ---
221
+
222
+ ## Live Tests
223
+
224
+ ```bash
225
+ export OPENROUTER_API_KEY="sk-or-v1-..."
226
+ python test_live.py # all input types
227
+ python test_live.py --test html # HTML only
228
+ python test_live.py --test video # YouTube only
229
+ python test_live.py --test pdf_file # local PDF only
230
+ python test_live.py --n 5
231
+ ```
232
+
233
+ ---
234
+
235
+ ## API Reference
236
+
237
+ ### `MCQGenerator`
238
+
239
+ ```python
240
+ MCQGenerator(
241
+ api_key=None, # or set env var ANTHROPIC/OPENAI/OPENROUTER_API_KEY
242
+ provider="anthropic", # "anthropic" | "openai" | "openrouter"
243
+ model="", # default per provider if not set
244
+ batch_size=10, # questions per API call; large N is auto-batched
245
+ max_tokens=4096,
246
+ pdf_backend="pymupdf", # "pymupdf" | "docling_local" | "docling_serve"
247
+ docling_api_url="", # Docling Serve URL e.g. "http://your-server:5001"
248
+ docling_ocr=True,
249
+ transcript_languages=["en"],
250
+ custom_instructions="", # global custom instructions for every call
251
+ )
252
+ ```
253
+
254
+ | Method | Description |
255
+ |---|---|
256
+ | `from_url(url, n, ...)` | HTML page; auto-detects YouTube links |
257
+ | `from_html(html, n, base_url, ...)` | Raw HTML string |
258
+ | `from_video_url(url, n, video_title, ...)` | YouTube video via transcript |
259
+ | `from_pdf_url(url, n, pdf_title, ...)` | PDF via URL |
260
+ | `from_pdf_path(path, n, pdf_title, ...)` | Local PDF file |
261
+ | `from_blocks(blocks, n, ...)` | Pre-extracted `ContentBlock` list |
262
+
263
+ All methods accept `custom_instructions`, `difficulty_mix`, and `focus_topics`.
264
+
265
+ ### `MCQSet`
266
+
267
+ | Property / Method | Description |
268
+ |---|---|
269
+ | `.questions` | `List[MCQQuestion]` |
270
+ | `.total_exam_time` | Minutes — auto-calculated as n × 2 |
271
+ | `.to_json()` | Exam-ready JSON (`total_exam_time` + `questions` only) |
272
+ | `.to_pretty_str()` | Human-readable output |
273
+ | `.filter_by_difficulty(d)` | Returns filtered `MCQSet` for `"easy"/"medium"/"hard"` |
274
+
275
+ ---
276
+
277
+ ## Project Structure
278
+
279
+ ```
280
+ html2mcq/
281
+ ├── html2mcq/
282
+ │ ├── __init__.py # Public exports
283
+ │ ├── extractor.py # HTML parser
284
+ │ ├── video.py # YouTube transcript extractor
285
+ │ ├── pdf.py # PDF extractor (PyMuPDF + Docling)
286
+ │ ├── generator.py # MCQGenerator — main API
287
+ │ ├── models.py # ContentBlock, MCQQuestion, MCQSet
288
+ │ ├── prompts.py # Fixed system prompt + dynamic user prompt
289
+ │ └── cli.py # CLI entry point
290
+ ├── tests/
291
+ │ └── test_html2mcq.py # 68 unit tests (fully mocked, no API key needed)
292
+ ├── html2mcq_gui.py # Tkinter desktop GUI
293
+ ├── test_live.py # Live integration tests
294
+ ├── examples/
295
+ │ └── basic_usage.py
296
+ ├── pyproject.toml
297
+ ├── README.md
298
+ └── CHANGELOG.md
299
+ ```
300
+
301
+ ---
302
+
303
+ ## License
304
+
305
+ MIT © 2025 html2mcq contributors
@@ -0,0 +1,21 @@
1
+ """
2
+ html2mcq - Convert any HTML tutorial page, YouTube video, or PDF into MCQ questions using AI.
3
+ """
4
+
5
+ from .generator import MCQGenerator
6
+ from .extractor import ContentExtractor
7
+ from .video import VideoTranscriptExtractor
8
+ from .pdf import PDFExtractor
9
+ from .models import MCQQuestion, MCQSet, ContentBlock
10
+
11
+ __version__ = "1.2.0"
12
+ __author__ = "html2mcq"
13
+ __all__ = [
14
+ "MCQGenerator",
15
+ "ContentExtractor",
16
+ "VideoTranscriptExtractor",
17
+ "PDFExtractor",
18
+ "MCQQuestion",
19
+ "MCQSet",
20
+ "ContentBlock",
21
+ ]