knwler 0.3.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
knwler-0.3.0/PKG-INFO ADDED
@@ -0,0 +1,198 @@
1
+ Metadata-Version: 2.4
2
+ Name: knwler
3
+ Version: 0.3.0
4
+ Summary: Fast and accurate graph extraction from text using LLMs
5
+ Requires-Python: >=3.12
6
+ Description-Content-Type: text/markdown
7
+ Requires-Dist: aiohttp>=3.13.3
8
+ Requires-Dist: tiktoken>=0.12.0
9
+ Requires-Dist: typer>=0.21.1
10
+ Requires-Dist: rich>=14.3.2
11
+ Requires-Dist: networkx>=3.6.1
12
+ Requires-Dist: pymupdf4llm[layout,ocr]>=0.2.9
13
+ Requires-Dist: jinja2>=3.1.6
14
+
15
+ # Knwler
16
+
17
+ **Turn any document into a structured knowledge graph**
18
+
19
+ Knwler is a lightweight, single-file Python tool that extracts structured knowledge graphs from documents using AI. Feed it a PDF or text file and receive a richly connected network of entities, relationships, and topics — complete with an interactive HTML report and exports ready for your favorite graph analytics platform.
20
+
21
+ Built for compliance teams, legal departments, research analysts, and anyone who needs to rapidly understand the structure hidden inside dense documents.
22
+
23
+ ![](./Screenshot1.png)
24
+
25
+ ![](./Screenshot2.png)
26
+ ---
27
+
28
+ ## Why Knwler?
29
+
30
+ | Challenge | How Knwler Solves It |
31
+ |---|---|
32
+ | Manually mapping entities and relationships in 100+ page regulatory documents | Automated extraction produces a navigable knowledge graph in minutes |
33
+ | Expensive vendor lock-in for document intelligence | Runs fully local with Ollama (zero data leaves your machine) or via OpenAI for speed |
34
+ | Documents in multiple languages across jurisdictions | Auto-detects language and adapts all prompts — supports English, German, French, Spanish, and Dutch out of the box |
35
+ | Results trapped inside one tool | Exports to HTML, GML, GraphML, and raw JSON — import directly into Neo4j, Gephi, yEd, Memgraph, SurrealDB, or any graph platform |
36
+ | High per-document processing costs | ~$0.20 per 20-page PDF with OpenAI/GPT-4o; completely free when running locally; LLM response caching means re-runs cost nothing |
37
+
38
+ ---
39
+
40
+ ## Key Features
41
+
42
+ ### Dual LLM Backend — Cloud or Fully Local
43
+ Choose between **OpenAI** for maximum speed, or **Ollama** for fully offline, air-gapped operation. Qwen 2.5 at 3B–14B parameters delivers strong results locally. You can even switch backends between runs and incrementally augment the same graph.
44
+
45
+ ### Automatic Schema Discovery
46
+ The pipeline analyzes a sample of your document and **infers the optimal entity types and relation types** — no manual ontology engineering required. You can also supply a schema if you wish. A schema is a set of types of entities (person, concept, location...) and relations (knows, has_accepted, has_signed...).
47
+
48
+ ### Multilingual by Design
49
+ Language is **auto-detected** on every run. All prompts (summarization, extraction, community labeling) and all console/UI output are localized. Adding a new language is as simple as extending a single JSON file.
50
+
51
+ ### Incremental & Augmentable
52
+ Re-run on new documents or updated schemas and **the existing graph is augmented** rather than rebuilt. Entity descriptions from multiple sources are intelligently consolidated via LLM-powered summarization.
53
+
54
+ ### Community Detection & Topic Assignment
55
+ The Louvain algorithm automatically **discovers clusters of related entities** and an LLM labels each community with human-readable topics — giving you instant thematic insight into the document's structure.
56
+
57
+ ### Self-Contained HTML Report
58
+ Export a **single HTML file** with interactive Cytoscape.js network visualization, entity index, topic overview, and rephrased text chunks — shareable without any server or dependencies.
59
+
60
+ ### Rich Export Ecosystem
61
+ - **JSON** — the canonical output; import into Neo4j, Memgraph, SurrealDB, or generate vector embeddings
62
+ - **GML / GraphML** — open directly in yEd, Gephi, or any standards-compliant graph tool
63
+ - **HTML** — standalone interactive report
64
+
65
+ ### Intelligent Caching
66
+ Every LLM call is **hashed and cached** locally. Re-generating reports, tweaking export settings, or re-running with a different schema costs zero additional API calls.
67
+
68
+ ### Human-Readable Chunk Rephrasing
69
+ Each text chunk is rephrased for readability alongside the original, making the report accessible to non-expert stakeholders while preserving full traceability to source text.
70
+
71
+ ### PDF & Text Ingestion
72
+ Handles **PDF-to-text extraction** (via PyMuPDF) as well as plain text and Markdown files. Extracted text is cached to avoid redundant PDF parsing on subsequent runs.
73
+
74
+ ### Portable & Minimal
75
+ A **single Python file (~2,000 lines)**, managed via `uv` with minimal dependencies. No database, no backend server, no Docker required.
76
+
77
+ ---
78
+
79
+ ## Cost & Performance
80
+
81
+ | Scenario | Time (20-page PDF) | Cost |
82
+ |---|---|---|
83
+ | OpenAI GPT-4o / GPT-4o-mini | ~2–4 minutes | ~$0.20 |
84
+ | Ollama Qwen 2.5 (Mac M4 Pro, 64 GB) | ~20–40 minutes | Free |
85
+ | Cached re-run (any backend) | Seconds | Free |
86
+
87
+ ---
88
+
89
+ ## Quick Start
90
+
91
+ ```bash
92
+ # Install dependencies
93
+ uv sync
94
+
95
+ # Run with OpenAI
96
+ uv run main.py --openai -f document.pdf
97
+
98
+ # Run fully local with Ollama
99
+ uv run main.py -f document.pdf
100
+
101
+ # Re-export HTML only (no LLM calls)
102
+ uv run main.py --html-only
103
+ ```
104
+
105
+ > **Tip:** When running Ollama locally, launch it via CLI with parallel processing for best throughput:
106
+ > ```bash
107
+ > OLLAMA_NUM_PARALLEL=8 ollama serve
108
+ > ```
109
+ > Adjust the number based on your machine specs (8 is suitable for a Mac M4 Pro with 64 GB RAM).
110
+
111
+ ## CLI Options
112
+
113
+ | Option | Description |
114
+ |---|---|
115
+ | `--file`, `-f` | Input PDF or text file |
116
+ | `--openai` | Use OpenAI API instead of Ollama |
117
+ | `--extraction-model`, `-e` | Model for chunk extraction (default: `qwen2.5:3b` / `gpt-4o-mini`) |
118
+ | `--discovery-model`, `-d` | Model for schema discovery (default: `qwen2.5:14b` / `gpt-4o`) |
119
+ | `--concurrent`, `-c` | Max concurrent LLM requests (default: 10) |
120
+ | `--max-tokens` | Max tokens per chunk (default: 400) |
121
+ | `--no-discovery` | Skip schema discovery, use built-in defaults |
122
+ | `--no-cache` | Disable LLM response caching |
123
+ | `--language`, `-l` | Force language code (e.g., `en`, `de`, `fr`) — auto-detects if omitted |
124
+ | `--url`, `-u` | Source URL for metadata |
125
+ | `--output`, `-o` | Output JSON filename (saved to `results/`) |
126
+ | `--html-report` | Generate HTML report (default: on) |
127
+ | `--gml-export` | Generate GML graph file (default: on) |
128
+ | `--html-only` | Re-export HTML from existing results without re-running extraction |
129
+
130
+ ## Examples
131
+
132
+ ```bash
133
+ # EU AI Act (English)
134
+ uv run main.py --openai \
135
+ --url "https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=OJ:L_202401689" \
136
+ -f samples/EUAI.pdf
137
+
138
+ # NIST AI Risk Management Framework
139
+ uv run main.py --openai \
140
+ --url "https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf" \
141
+ -f samples/Nist.pdf
142
+
143
+ # Belgian Civil Code (Dutch — auto-detected)
144
+ uv run main.py --openai \
145
+ --url "https://www.ejustice.just.fgov.be/cgi/article_body.pl?language=nl&pub_date=2022-07-01&caller=list&numac=2022032058" \
146
+ -f samples/BurgerlijkBoek5.pdf
147
+
148
+ # Deloitte Sustainability Report (German — auto-detected)
149
+ uv run main.py --openai \
150
+ --url "https://www.deloitte.com/de/de/legal/publikationen.html" \
151
+ -f examples/Deloitte/Deloitte-Nachhaltigkeitsbericht-2024.pdf
152
+ ```
153
+
154
+ ## Integration
155
+
156
+ The raw JSON output is designed for downstream integration:
157
+
158
+ - **Import into Neo4j / Memgraph / SurrealDB** — entities and relations map directly to nodes and edges
159
+ - **Generate vector embeddings** — use entity descriptions for semantic search
160
+ - **Feed into n8n workflows** — connect document intelligence to CRM, alerting, or reporting pipelines without code
161
+ - **Visualize in yEd or Gephi** — open the GML/GraphML export for advanced layout and analysis
162
+
163
+ ---
164
+
165
+ ## Examples
166
+
167
+ You can find example reports and raw graph data in diverse languages in the `examples` directory.
168
+
169
+ ## Language
170
+
171
+ Everything language related sits in the `languages.json` and this contains both the language-specific prompts as well as the text used for console output.
172
+ You can easily add additional languages, simply ask Copilot, Gemini or any AI to translate the JSON.
173
+
174
+ ## OpenAI Key
175
+
176
+ If you run the process in your terminal the code will look for the usual `OPENAI_API_KEY`.
177
+ You can assign it explicitly via a terminal export
178
+
179
+ ```bash
180
+ export OPENAI_API_KEY=...
181
+ ```
182
+ or in the code (look for `os.environ.get("OPENAI_API_KEY", "")`).
183
+
184
+
185
+ ## Ollama
186
+
187
+ Ollama is just a convenient local LLM service, you can use LMStudio or any other service.
188
+ The default model is Qwen 2.5 but here as well, experiment and see what works best for you.
189
+ We have done lots of benchmarks and bigger models are not better, sometimes quite the opposite. Small models of 3 or 7 billion parameters will be fine and a lot faster.
190
+ Thinking, in particular, is really standing in the way of graph extraction. Whatever you do, don't enable thinking and don't use advanced MOE models.
191
+
192
+ ## Disclaimer
193
+
194
+ The information extracted by Knwler is generated via machine learning and natural language processing, which may result in errors, omissions, or misinterpretations of the original source material. This tool is provided "as is" for informational purposes only. Users are advised to independently verify any critical data against original source documents before making business, legal, or financial decisions.
195
+
196
+ ---
197
+
198
+ *Built by [Orbifold Consulting](https://orbifold.net) and inspired by [Knwl](https://knwl.ai)*.
knwler-0.3.0/README.md ADDED
@@ -0,0 +1,184 @@
1
+ # Knwler
2
+
3
+ **Turn any document into a structured knowledge graph**
4
+
5
+ Knwler is a lightweight, single-file Python tool that extracts structured knowledge graphs from documents using AI. Feed it a PDF or text file and receive a richly connected network of entities, relationships, and topics — complete with an interactive HTML report and exports ready for your favorite graph analytics platform.
6
+
7
+ Built for compliance teams, legal departments, research analysts, and anyone who needs to rapidly understand the structure hidden inside dense documents.
8
+
9
+ ![](./Screenshot1.png)
10
+
11
+ ![](./Screenshot2.png)
12
+ ---
13
+
14
+ ## Why Knwler?
15
+
16
+ | Challenge | How Knwler Solves It |
17
+ |---|---|
18
+ | Manually mapping entities and relationships in 100+ page regulatory documents | Automated extraction produces a navigable knowledge graph in minutes |
19
+ | Expensive vendor lock-in for document intelligence | Runs fully local with Ollama (zero data leaves your machine) or via OpenAI for speed |
20
+ | Documents in multiple languages across jurisdictions | Auto-detects language and adapts all prompts — supports English, German, French, Spanish, and Dutch out of the box |
21
+ | Results trapped inside one tool | Exports to HTML, GML, GraphML, and raw JSON — import directly into Neo4j, Gephi, yEd, Memgraph, SurrealDB, or any graph platform |
22
+ | High per-document processing costs | ~$0.20 per 20-page PDF with OpenAI/GPT-4o; completely free when running locally; LLM response caching means re-runs cost nothing |
23
+
24
+ ---
25
+
26
+ ## Key Features
27
+
28
+ ### Dual LLM Backend — Cloud or Fully Local
29
+ Choose between **OpenAI** for maximum speed, or **Ollama** for fully offline, air-gapped operation. Qwen 2.5 at 3B–14B parameters delivers strong results locally. You can even switch backends between runs and incrementally augment the same graph.
30
+
31
+ ### Automatic Schema Discovery
32
+ The pipeline analyzes a sample of your document and **infers the optimal entity types and relation types** — no manual ontology engineering required. You can also supply a schema if you wish. A schema is a set of types of entities (person, concept, location...) and relations (knows, has_accepted, has_signed...).
33
+
34
+ ### Multilingual by Design
35
+ Language is **auto-detected** on every run. All prompts (summarization, extraction, community labeling) and all console/UI output are localized. Adding a new language is as simple as extending a single JSON file.
36
+
37
+ ### Incremental & Augmentable
38
+ Re-run on new documents or updated schemas and **the existing graph is augmented** rather than rebuilt. Entity descriptions from multiple sources are intelligently consolidated via LLM-powered summarization.
39
+
40
+ ### Community Detection & Topic Assignment
41
+ The Louvain algorithm automatically **discovers clusters of related entities** and an LLM labels each community with human-readable topics — giving you instant thematic insight into the document's structure.
42
+
43
+ ### Self-Contained HTML Report
44
+ Export a **single HTML file** with interactive Cytoscape.js network visualization, entity index, topic overview, and rephrased text chunks — shareable without any server or dependencies.
45
+
46
+ ### Rich Export Ecosystem
47
+ - **JSON** — the canonical output; import into Neo4j, Memgraph, SurrealDB, or generate vector embeddings
48
+ - **GML / GraphML** — open directly in yEd, Gephi, or any standards-compliant graph tool
49
+ - **HTML** — standalone interactive report
50
+
51
+ ### Intelligent Caching
52
+ Every LLM call is **hashed and cached** locally. Re-generating reports, tweaking export settings, or re-running with a different schema costs zero additional API calls.
53
+
54
+ ### Human-Readable Chunk Rephrasing
55
+ Each text chunk is rephrased for readability alongside the original, making the report accessible to non-expert stakeholders while preserving full traceability to source text.
56
+
57
+ ### PDF & Text Ingestion
58
+ Handles **PDF-to-text extraction** (via PyMuPDF) as well as plain text and Markdown files. Extracted text is cached to avoid redundant PDF parsing on subsequent runs.
59
+
60
+ ### Portable & Minimal
61
+ A **single Python file (~2,000 lines)**, managed via `uv` with minimal dependencies. No database, no backend server, no Docker required.
62
+
63
+ ---
64
+
65
+ ## Cost & Performance
66
+
67
+ | Scenario | Time (20-page PDF) | Cost |
68
+ |---|---|---|
69
+ | OpenAI GPT-4o / GPT-4o-mini | ~2–4 minutes | ~$0.20 |
70
+ | Ollama Qwen 2.5 (Mac M4 Pro, 64 GB) | ~20–40 minutes | Free |
71
+ | Cached re-run (any backend) | Seconds | Free |
72
+
73
+ ---
74
+
75
+ ## Quick Start
76
+
77
+ ```bash
78
+ # Install dependencies
79
+ uv sync
80
+
81
+ # Run with OpenAI
82
+ uv run main.py --openai -f document.pdf
83
+
84
+ # Run fully local with Ollama
85
+ uv run main.py -f document.pdf
86
+
87
+ # Re-export HTML only (no LLM calls)
88
+ uv run main.py --html-only
89
+ ```
90
+
91
+ > **Tip:** When running Ollama locally, launch it via CLI with parallel processing for best throughput:
92
+ > ```bash
93
+ > OLLAMA_NUM_PARALLEL=8 ollama serve
94
+ > ```
95
+ > Adjust the number based on your machine specs (8 is suitable for a Mac M4 Pro with 64 GB RAM).
96
+
97
+ ## CLI Options
98
+
99
+ | Option | Description |
100
+ |---|---|
101
+ | `--file`, `-f` | Input PDF or text file |
102
+ | `--openai` | Use OpenAI API instead of Ollama |
103
+ | `--extraction-model`, `-e` | Model for chunk extraction (default: `qwen2.5:3b` / `gpt-4o-mini`) |
104
+ | `--discovery-model`, `-d` | Model for schema discovery (default: `qwen2.5:14b` / `gpt-4o`) |
105
+ | `--concurrent`, `-c` | Max concurrent LLM requests (default: 10) |
106
+ | `--max-tokens` | Max tokens per chunk (default: 400) |
107
+ | `--no-discovery` | Skip schema discovery, use built-in defaults |
108
+ | `--no-cache` | Disable LLM response caching |
109
+ | `--language`, `-l` | Force language code (e.g., `en`, `de`, `fr`) — auto-detects if omitted |
110
+ | `--url`, `-u` | Source URL for metadata |
111
+ | `--output`, `-o` | Output JSON filename (saved to `results/`) |
112
+ | `--html-report` | Generate HTML report (default: on) |
113
+ | `--gml-export` | Generate GML graph file (default: on) |
114
+ | `--html-only` | Re-export HTML from existing results without re-running extraction |
115
+
116
+ ## Examples
117
+
118
+ ```bash
119
+ # EU AI Act (English)
120
+ uv run main.py --openai \
121
+ --url "https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=OJ:L_202401689" \
122
+ -f samples/EUAI.pdf
123
+
124
+ # NIST AI Risk Management Framework
125
+ uv run main.py --openai \
126
+ --url "https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf" \
127
+ -f samples/Nist.pdf
128
+
129
+ # Belgian Civil Code (Dutch — auto-detected)
130
+ uv run main.py --openai \
131
+ --url "https://www.ejustice.just.fgov.be/cgi/article_body.pl?language=nl&pub_date=2022-07-01&caller=list&numac=2022032058" \
132
+ -f samples/BurgerlijkBoek5.pdf
133
+
134
+ # Deloitte Sustainability Report (German — auto-detected)
135
+ uv run main.py --openai \
136
+ --url "https://www.deloitte.com/de/de/legal/publikationen.html" \
137
+ -f examples/Deloitte/Deloitte-Nachhaltigkeitsbericht-2024.pdf
138
+ ```
139
+
140
+ ## Integration
141
+
142
+ The raw JSON output is designed for downstream integration:
143
+
144
+ - **Import into Neo4j / Memgraph / SurrealDB** — entities and relations map directly to nodes and edges
145
+ - **Generate vector embeddings** — use entity descriptions for semantic search
146
+ - **Feed into n8n workflows** — connect document intelligence to CRM, alerting, or reporting pipelines without code
147
+ - **Visualize in yEd or Gephi** — open the GML/GraphML export for advanced layout and analysis
148
+
149
+ ---
150
+
151
+ ## Examples
152
+
153
+ You can find example reports and raw graph data in diverse languages in the `examples` directory.
154
+
155
+ ## Language
156
+
157
+ Everything language related sits in the `languages.json` and this contains both the language-specific prompts as well as the text used for console output.
158
+ You can easily add additional languages, simply ask Copilot, Gemini or any AI to translate the JSON.
159
+
160
+ ## OpenAI Key
161
+
162
+ If you run the process in your terminal the code will look for the usual `OPENAI_API_KEY`.
163
+ You can assign it explicitly via a terminal export
164
+
165
+ ```bash
166
+ export OPENAI_API_KEY=...
167
+ ```
168
+ or in the code (look for `os.environ.get("OPENAI_API_KEY", "")`).
169
+
170
+
171
+ ## Ollama
172
+
173
+ Ollama is just a convenient local LLM service, you can use LMStudio or any other service.
174
+ The default model is Qwen 2.5 but here as well, experiment and see what works best for you.
175
+ We have done lots of benchmarks and bigger models are not better, sometimes quite the opposite. Small models of 3 or 7 billion parameters will be fine and a lot faster.
176
+ Thinking, in particular, is really standing in the way of graph extraction. Whatever you do, don't enable thinking and don't use advanced MOE models.
177
+
178
+ ## Disclaimer
179
+
180
+ The information extracted by Knwler is generated via machine learning and natural language processing, which may result in errors, omissions, or misinterpretations of the original source material. This tool is provided "as is" for informational purposes only. Users are advised to independently verify any critical data against original source documents before making business, legal, or financial decisions.
181
+
182
+ ---
183
+
184
+ *Built by [Orbifold Consulting](https://orbifold.net) and inspired by [Knwl](https://knwl.ai)*.
@@ -0,0 +1,198 @@
1
+ Metadata-Version: 2.4
2
+ Name: knwler
3
+ Version: 0.3.0
4
+ Summary: Fast and accurate graph extraction from text using LLMs
5
+ Requires-Python: >=3.12
6
+ Description-Content-Type: text/markdown
7
+ Requires-Dist: aiohttp>=3.13.3
8
+ Requires-Dist: tiktoken>=0.12.0
9
+ Requires-Dist: typer>=0.21.1
10
+ Requires-Dist: rich>=14.3.2
11
+ Requires-Dist: networkx>=3.6.1
12
+ Requires-Dist: pymupdf4llm[layout,ocr]>=0.2.9
13
+ Requires-Dist: jinja2>=3.1.6
14
+
15
+ # Knwler
16
+
17
+ **Turn any document into a structured knowledge graph**
18
+
19
+ Knwler is a lightweight, single-file Python tool that extracts structured knowledge graphs from documents using AI. Feed it a PDF or text file and receive a richly connected network of entities, relationships, and topics — complete with an interactive HTML report and exports ready for your favorite graph analytics platform.
20
+
21
+ Built for compliance teams, legal departments, research analysts, and anyone who needs to rapidly understand the structure hidden inside dense documents.
22
+
23
+ ![](./Screenshot1.png)
24
+
25
+ ![](./Screenshot2.png)
26
+ ---
27
+
28
+ ## Why Knwler?
29
+
30
+ | Challenge | How Knwler Solves It |
31
+ |---|---|
32
+ | Manually mapping entities and relationships in 100+ page regulatory documents | Automated extraction produces a navigable knowledge graph in minutes |
33
+ | Expensive vendor lock-in for document intelligence | Runs fully local with Ollama (zero data leaves your machine) or via OpenAI for speed |
34
+ | Documents in multiple languages across jurisdictions | Auto-detects language and adapts all prompts — supports English, German, French, Spanish, and Dutch out of the box |
35
+ | Results trapped inside one tool | Exports to HTML, GML, GraphML, and raw JSON — import directly into Neo4j, Gephi, yEd, Memgraph, SurrealDB, or any graph platform |
36
+ | High per-document processing costs | ~$0.20 per 20-page PDF with OpenAI/GPT-4o; completely free when running locally; LLM response caching means re-runs cost nothing |
37
+
38
+ ---
39
+
40
+ ## Key Features
41
+
42
+ ### Dual LLM Backend — Cloud or Fully Local
43
+ Choose between **OpenAI** for maximum speed, or **Ollama** for fully offline, air-gapped operation. Qwen 2.5 at 3B–14B parameters delivers strong results locally. You can even switch backends between runs and incrementally augment the same graph.
44
+
45
+ ### Automatic Schema Discovery
46
+ The pipeline analyzes a sample of your document and **infers the optimal entity types and relation types** — no manual ontology engineering required. You can also supply a schema if you wish. A schema is a set of types of entities (person, concept, location...) and relations (knows, has_accepted, has_signed...).
47
+
48
+ ### Multilingual by Design
49
+ Language is **auto-detected** on every run. All prompts (summarization, extraction, community labeling) and all console/UI output are localized. Adding a new language is as simple as extending a single JSON file.
50
+
51
+ ### Incremental & Augmentable
52
+ Re-run on new documents or updated schemas and **the existing graph is augmented** rather than rebuilt. Entity descriptions from multiple sources are intelligently consolidated via LLM-powered summarization.
53
+
54
+ ### Community Detection & Topic Assignment
55
+ The Louvain algorithm automatically **discovers clusters of related entities** and an LLM labels each community with human-readable topics — giving you instant thematic insight into the document's structure.
56
+
57
+ ### Self-Contained HTML Report
58
+ Export a **single HTML file** with interactive Cytoscape.js network visualization, entity index, topic overview, and rephrased text chunks — shareable without any server or dependencies.
59
+
60
+ ### Rich Export Ecosystem
61
+ - **JSON** — the canonical output; import into Neo4j, Memgraph, SurrealDB, or generate vector embeddings
62
+ - **GML / GraphML** — open directly in yEd, Gephi, or any standards-compliant graph tool
63
+ - **HTML** — standalone interactive report
64
+
65
+ ### Intelligent Caching
66
+ Every LLM call is **hashed and cached** locally. Re-generating reports, tweaking export settings, or re-running with a different schema costs zero additional API calls.
67
+
68
+ ### Human-Readable Chunk Rephrasing
69
+ Each text chunk is rephrased for readability alongside the original, making the report accessible to non-expert stakeholders while preserving full traceability to source text.
70
+
71
+ ### PDF & Text Ingestion
72
+ Handles **PDF-to-text extraction** (via PyMuPDF) as well as plain text and Markdown files. Extracted text is cached to avoid redundant PDF parsing on subsequent runs.
73
+
74
+ ### Portable & Minimal
75
+ A **single Python file (~2,000 lines)**, managed via `uv` with minimal dependencies. No database, no backend server, no Docker required.
76
+
77
+ ---
78
+
79
+ ## Cost & Performance
80
+
81
+ | Scenario | Time (20-page PDF) | Cost |
82
+ |---|---|---|
83
+ | OpenAI GPT-4o / GPT-4o-mini | ~2–4 minutes | ~$0.20 |
84
+ | Ollama Qwen 2.5 (Mac M4 Pro, 64 GB) | ~20–40 minutes | Free |
85
+ | Cached re-run (any backend) | Seconds | Free |
86
+
87
+ ---
88
+
89
+ ## Quick Start
90
+
91
+ ```bash
92
+ # Install dependencies
93
+ uv sync
94
+
95
+ # Run with OpenAI
96
+ uv run main.py --openai -f document.pdf
97
+
98
+ # Run fully local with Ollama
99
+ uv run main.py -f document.pdf
100
+
101
+ # Re-export HTML only (no LLM calls)
102
+ uv run main.py --html-only
103
+ ```
104
+
105
+ > **Tip:** When running Ollama locally, launch it via CLI with parallel processing for best throughput:
106
+ > ```bash
107
+ > OLLAMA_NUM_PARALLEL=8 ollama serve
108
+ > ```
109
+ > Adjust the number based on your machine specs (8 is suitable for a Mac M4 Pro with 64 GB RAM).
110
+
111
+ ## CLI Options
112
+
113
+ | Option | Description |
114
+ |---|---|
115
+ | `--file`, `-f` | Input PDF or text file |
116
+ | `--openai` | Use OpenAI API instead of Ollama |
117
+ | `--extraction-model`, `-e` | Model for chunk extraction (default: `qwen2.5:3b` / `gpt-4o-mini`) |
118
+ | `--discovery-model`, `-d` | Model for schema discovery (default: `qwen2.5:14b` / `gpt-4o`) |
119
+ | `--concurrent`, `-c` | Max concurrent LLM requests (default: 10) |
120
+ | `--max-tokens` | Max tokens per chunk (default: 400) |
121
+ | `--no-discovery` | Skip schema discovery, use built-in defaults |
122
+ | `--no-cache` | Disable LLM response caching |
123
+ | `--language`, `-l` | Force language code (e.g., `en`, `de`, `fr`) — auto-detects if omitted |
124
+ | `--url`, `-u` | Source URL for metadata |
125
+ | `--output`, `-o` | Output JSON filename (saved to `results/`) |
126
+ | `--html-report` | Generate HTML report (default: on) |
127
+ | `--gml-export` | Generate GML graph file (default: on) |
128
+ | `--html-only` | Re-export HTML from existing results without re-running extraction |
129
+
130
+ ## Examples
131
+
132
+ ```bash
133
+ # EU AI Act (English)
134
+ uv run main.py --openai \
135
+ --url "https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=OJ:L_202401689" \
136
+ -f samples/EUAI.pdf
137
+
138
+ # NIST AI Risk Management Framework
139
+ uv run main.py --openai \
140
+ --url "https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf" \
141
+ -f samples/Nist.pdf
142
+
143
+ # Belgian Civil Code (Dutch — auto-detected)
144
+ uv run main.py --openai \
145
+ --url "https://www.ejustice.just.fgov.be/cgi/article_body.pl?language=nl&pub_date=2022-07-01&caller=list&numac=2022032058" \
146
+ -f samples/BurgerlijkBoek5.pdf
147
+
148
+ # Deloitte Sustainability Report (German — auto-detected)
149
+ uv run main.py --openai \
150
+ --url "https://www.deloitte.com/de/de/legal/publikationen.html" \
151
+ -f examples/Deloitte/Deloitte-Nachhaltigkeitsbericht-2024.pdf
152
+ ```
153
+
154
+ ## Integration
155
+
156
+ The raw JSON output is designed for downstream integration:
157
+
158
+ - **Import into Neo4j / Memgraph / SurrealDB** — entities and relations map directly to nodes and edges
159
+ - **Generate vector embeddings** — use entity descriptions for semantic search
160
+ - **Feed into n8n workflows** — connect document intelligence to CRM, alerting, or reporting pipelines without code
161
+ - **Visualize in yEd or Gephi** — open the GML/GraphML export for advanced layout and analysis
162
+
163
+ ---
164
+
165
+ ## Examples
166
+
167
+ You can find example reports and raw graph data in diverse languages in the `examples` directory.
168
+
169
+ ## Language
170
+
171
+ Everything language related sits in the `languages.json` and this contains both the language-specific prompts as well as the text used for console output.
172
+ You can easily add additional languages, simply ask Copilot, Gemini or any AI to translate the JSON.
173
+
174
+ ## OpenAI Key
175
+
176
+ If you run the process in your terminal the code will look for the usual `OPENAI_API_KEY`.
177
+ You can assign it explicitly via a terminal export
178
+
179
+ ```bash
180
+ export OPENAI_API_KEY=...
181
+ ```
182
+ or in the code (look for `os.environ.get("OPENAI_API_KEY", "")`).
183
+
184
+
185
+ ## Ollama
186
+
187
+ Ollama is just a convenient local LLM service, you can use LMStudio or any other service.
188
+ The default model is Qwen 2.5 but here as well, experiment and see what works best for you.
189
+ We have done lots of benchmarks and bigger models are not better, sometimes quite the opposite. Small models of 3 or 7 billion parameters will be fine and a lot faster.
190
+ Thinking, in particular, is really standing in the way of graph extraction. Whatever you do, don't enable thinking and don't use advanced MOE models.
191
+
192
+ ## Disclaimer
193
+
194
+ The information extracted by Knwler is generated via machine learning and natural language processing, which may result in errors, omissions, or misinterpretations of the original source material. This tool is provided "as is" for informational purposes only. Users are advised to independently verify any critical data against original source documents before making business, legal, or financial decisions.
195
+
196
+ ---
197
+
198
+ *Built by [Orbifold Consulting](https://orbifold.net) and inspired by [Knwl](https://knwl.ai)*.
@@ -0,0 +1,7 @@
1
+ README.md
2
+ pyproject.toml
3
+ knwler.egg-info/PKG-INFO
4
+ knwler.egg-info/SOURCES.txt
5
+ knwler.egg-info/dependency_links.txt
6
+ knwler.egg-info/requires.txt
7
+ knwler.egg-info/top_level.txt
@@ -0,0 +1,7 @@
1
+ aiohttp>=3.13.3
2
+ tiktoken>=0.12.0
3
+ typer>=0.21.1
4
+ rich>=14.3.2
5
+ networkx>=3.6.1
6
+ pymupdf4llm[layout,ocr]>=0.2.9
7
+ jinja2>=3.1.6
@@ -0,0 +1 @@
1
+ templates
@@ -0,0 +1,15 @@
1
+ [project]
2
+ name = "knwler"
3
+ version = "0.3.0"
4
+ description = "Fast and accurate graph extraction from text using LLMs"
5
+ readme = "README.md"
6
+ requires-python = ">=3.12"
7
+ dependencies = [
8
+ "aiohttp>=3.13.3",
9
+ "tiktoken>=0.12.0",
10
+ "typer>=0.21.1",
11
+ "rich>=14.3.2",
12
+ "networkx>=3.6.1",
13
+ "pymupdf4llm[layout,ocr]>=0.2.9",
14
+ "jinja2>=3.1.6",
15
+ ]
knwler-0.3.0/setup.cfg ADDED
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+