@mclawnet/agent 0.5.9 → 0.6.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (78) hide show
  1. package/cli.js +168 -61
  2. package/dist/__tests__/cli.test.d.ts +2 -0
  3. package/dist/__tests__/cli.test.d.ts.map +1 -0
  4. package/dist/__tests__/service-config.test.d.ts +2 -0
  5. package/dist/__tests__/service-config.test.d.ts.map +1 -0
  6. package/dist/__tests__/service-linux.test.d.ts +2 -0
  7. package/dist/__tests__/service-linux.test.d.ts.map +1 -0
  8. package/dist/__tests__/service-macos.test.d.ts +2 -0
  9. package/dist/__tests__/service-macos.test.d.ts.map +1 -0
  10. package/dist/__tests__/service-windows.test.d.ts +2 -0
  11. package/dist/__tests__/service-windows.test.d.ts.map +1 -0
  12. package/dist/backend-adapter.d.ts +2 -0
  13. package/dist/backend-adapter.d.ts.map +1 -1
  14. package/dist/{chunk-KHPEQTWF.js → chunk-KITKMSBE.js} +166 -90
  15. package/dist/chunk-KITKMSBE.js.map +1 -0
  16. package/dist/chunk-W3LSW4XY.js +95 -0
  17. package/dist/chunk-W3LSW4XY.js.map +1 -0
  18. package/dist/hub-connection.d.ts.map +1 -1
  19. package/dist/index.js +1 -1
  20. package/dist/linux-5KQ4SCAA.js +175 -0
  21. package/dist/linux-5KQ4SCAA.js.map +1 -0
  22. package/dist/macos-FGY546NC.js +173 -0
  23. package/dist/macos-FGY546NC.js.map +1 -0
  24. package/dist/service/config.d.ts +19 -0
  25. package/dist/service/config.d.ts.map +1 -0
  26. package/dist/service/index.d.ts +6 -0
  27. package/dist/service/index.d.ts.map +1 -0
  28. package/dist/service/index.js +46 -0
  29. package/dist/service/index.js.map +1 -0
  30. package/dist/service/linux.d.ts +18 -0
  31. package/dist/service/linux.d.ts.map +1 -0
  32. package/dist/service/macos.d.ts +18 -0
  33. package/dist/service/macos.d.ts.map +1 -0
  34. package/dist/service/types.d.ts +19 -0
  35. package/dist/service/types.d.ts.map +1 -0
  36. package/dist/service/windows.d.ts +18 -0
  37. package/dist/service/windows.d.ts.map +1 -0
  38. package/dist/session-manager.d.ts +4 -7
  39. package/dist/session-manager.d.ts.map +1 -1
  40. package/dist/skill-loader.d.ts +8 -0
  41. package/dist/skill-loader.d.ts.map +1 -0
  42. package/dist/start.d.ts.map +1 -1
  43. package/dist/start.js +1 -1
  44. package/dist/windows-PIJ4CMWX.js +164 -0
  45. package/dist/windows-PIJ4CMWX.js.map +1 -0
  46. package/package.json +8 -6
  47. package/skills/academic-search/SKILL.md +147 -0
  48. package/skills/architecture/SKILL.md +294 -0
  49. package/skills/changelog-generator/SKILL.md +112 -0
  50. package/skills/chart-visualization/SKILL.md +183 -0
  51. package/skills/code-review/SKILL.md +304 -0
  52. package/skills/codebase-health/SKILL.md +281 -0
  53. package/skills/consulting-analysis/SKILL.md +584 -0
  54. package/skills/content-research-writer/SKILL.md +546 -0
  55. package/skills/data-analysis/SKILL.md +194 -0
  56. package/skills/deep-research/SKILL.md +198 -0
  57. package/skills/docx/SKILL.md +211 -0
  58. package/skills/github-deep-research/SKILL.md +207 -0
  59. package/skills/image-generation/SKILL.md +209 -0
  60. package/skills/lead-research-assistant/SKILL.md +207 -0
  61. package/skills/mcp-builder/SKILL.md +304 -0
  62. package/skills/meeting-insights-analyzer/SKILL.md +335 -0
  63. package/skills/pair-programming/SKILL.md +196 -0
  64. package/skills/pdf/SKILL.md +309 -0
  65. package/skills/performance-analysis/SKILL.md +261 -0
  66. package/skills/podcast-generation/SKILL.md +224 -0
  67. package/skills/pptx/SKILL.md +497 -0
  68. package/skills/project-learnings/SKILL.md +280 -0
  69. package/skills/security-audit/SKILL.md +211 -0
  70. package/skills/skill-creator/SKILL.md +200 -0
  71. package/skills/technical-writing/SKILL.md +286 -0
  72. package/skills/testing/SKILL.md +363 -0
  73. package/skills/video-generation/SKILL.md +247 -0
  74. package/skills/web-design-guidelines/SKILL.md +203 -0
  75. package/skills/webapp-testing/SKILL.md +162 -0
  76. package/skills/workflow-automation/SKILL.md +299 -0
  77. package/skills/xlsx/SKILL.md +305 -0
  78. package/dist/chunk-KHPEQTWF.js.map +0 -1
@@ -0,0 +1,194 @@
1
+ ---
2
+ name: data-analysis
3
+ description: Guide systematic data analysis workflows using Python (pandas, DuckDB) or SQL. Use when analyzing datasets, generating statistics, creating summaries, or exploring structured data from CSV/Excel/database sources.
4
+ ---
5
+
6
+ # Data Analysis
7
+
8
+ A systematic framework for analyzing structured data — from initial inspection through statistical analysis to visualization and reporting.
9
+
10
+ ## Overview
11
+
12
+ Data analysis follows a predictable arc:
13
+
14
+ 1. **Understand** — What question are we answering? What decision does this inform?
15
+ 2. **Inspect** — What does the data look like? Types, ranges, quality issues?
16
+ 3. **Transform** — Clean, reshape, and enrich the data for analysis.
17
+ 4. **Analyze** — Apply aggregation or statistical techniques to extract insights.
18
+ 5. **Visualize** — Create charts that communicate findings clearly.
19
+ 6. **Report** — Summarize findings with context, caveats, and recommendations.
20
+
21
+ Never skip the inspection step. Jumping straight to analysis on data you do not understand produces confidently wrong results.
22
+
23
+ ## When to Use
24
+
25
+ - Exploring a dataset's structure and contents
26
+ - Summary statistics, distributions, or trend analysis
27
+ - Answering business questions from CSV, Excel, or database data
28
+ - Aggregation reports with grouping, filtering, and ranking
29
+ - Comparing cohorts or time periods; detecting anomalies
30
+
31
+ ## When NOT to Use
32
+
33
+ - **ML model training** — This covers descriptive analysis, not predictive modeling.
34
+ - **Real-time streaming** — Use stream processing tools.
35
+ - **ETL pipeline design** — This is ad-hoc analysis, not production pipelines.
36
+
37
+ ## Step 1: Understand Requirements
38
+
39
+ Before touching data, clarify the question.
40
+
41
+ - What specific question needs an answer? Restate it precisely.
42
+ - Who is the audience? Technical team, executives, external stakeholders?
43
+ - What format is expected? Table, chart, single number, written summary?
44
+ - What time range, filters, or segments apply?
45
+
46
+ A vague question like "analyze our sales data" must be narrowed: "Top 10 products by revenue in Q1 2025, broken down by region?"
47
+
48
+ ## Step 2: Data Inspection
49
+
50
+ **Pandas:**
51
+ ```python
52
+ df = pd.read_csv("data.csv")
53
+ df.shape; df.dtypes; df.head(10); df.describe()
54
+ df.isnull().sum(); df.nunique(); df.duplicated().sum()
55
+ ```
56
+
57
+ **DuckDB:**
58
+ ```sql
59
+ DESCRIBE TABLE 'data.csv';
60
+ SELECT COUNT(*) FROM 'data.csv';
61
+ SUMMARIZE SELECT * FROM 'data.csv';
62
+ ```
63
+
64
+ **What to note:** Column types match expectations? Missing values random or systematic? Cardinality sensible? Date formats consistent? Numeric ranges plausible (no negative ages, percentages over 100)?
65
+
66
+ ## Step 3: Choosing Your Tool
67
+
68
+ | Scenario | Best Tool | Why |
69
+ |---|---|---|
70
+ | Quick exploration, single file | **pandas** | Fastest to write, rich API |
71
+ | Large file (100MB+), multi-file joins | **DuckDB** | Columnar engine, minimal memory |
72
+ | Existing database | **SQL** | Query where data lives |
73
+ | Complex reshaping (pivot, melt) | **pandas** | Most flexible transformation API |
74
+ | Window functions, CTEs | **DuckDB / SQL** | SQL is more expressive for these |
75
+
76
+ **Rule of thumb:** Data fits in memory and one-off exploration? Use pandas. Joins, window functions, or files larger than RAM? Use DuckDB.
77
+
78
+ ```python
79
+ import duckdb
80
+ result = duckdb.sql("SELECT region, SUM(revenue) FROM 'sales.csv' GROUP BY region")
81
+ result.df() # convert to pandas DataFrame
82
+ ```
83
+
84
+ ## Step 4: Common Analysis Patterns
85
+
86
+ ### Aggregation and Grouping
87
+
88
+ ```python
89
+ # pandas
90
+ df.groupby("region")["revenue"].agg(["sum", "mean", "count"])
91
+ ```
92
+ ```sql
93
+ -- SQL
94
+ SELECT region, SUM(revenue) AS total, AVG(revenue) AS avg, COUNT(*) AS n
95
+ FROM sales GROUP BY region ORDER BY total DESC;
96
+ ```
97
+
98
+ ### Ranking and Top-N
99
+
100
+ ```python
101
+ df.nlargest(10, "revenue")
102
+ ```
103
+ ```sql
104
+ SELECT *, RANK() OVER (PARTITION BY region ORDER BY revenue DESC) AS rnk
105
+ FROM sales QUALIFY rnk <= 10;
106
+ ```
107
+
108
+ ### Time Series Aggregation
109
+
110
+ ```python
111
+ df["date"] = pd.to_datetime(df["date"])
112
+ df.set_index("date").resample("M")["revenue"].sum()
113
+ ```
114
+ ```sql
115
+ SELECT DATE_TRUNC('month', order_date) AS month, SUM(revenue) AS total
116
+ FROM orders GROUP BY month ORDER BY month;
117
+ ```
118
+
119
+ ### Joins
120
+
121
+ ```python
122
+ merged = pd.merge(orders, customers, on="customer_id", how="left")
123
+ ```
124
+ ```sql
125
+ SELECT o.*, c.segment, c.region
126
+ FROM orders o LEFT JOIN customers c ON o.customer_id = c.customer_id;
127
+ ```
128
+
129
+ ### Window Functions (SQL)
130
+
131
+ ```sql
132
+ -- Running total
133
+ SUM(revenue) OVER (ORDER BY date ROWS UNBOUNDED PRECEDING)
134
+ -- Month-over-month change
135
+ revenue - LAG(revenue) OVER (ORDER BY month) AS mom_change
136
+ -- Percentile rank
137
+ PERCENT_RANK() OVER (PARTITION BY department ORDER BY salary)
138
+ ```
139
+
140
+ ### Pivot / Crosstab
141
+
142
+ ```python
143
+ pd.pivot_table(df, values="revenue", index="region", columns="quarter", aggfunc="sum", fill_value=0)
144
+ ```
145
+ ```sql
146
+ -- DuckDB
147
+ PIVOT (SELECT region, category, revenue FROM sales)
148
+ ON category USING SUM(revenue) GROUP BY region;
149
+ ```
150
+
151
+ ## Step 5: Visualization Guidance
152
+
153
+ | Message | Chart Type |
154
+ |---|---|
155
+ | Comparison across categories | Bar chart (horizontal if many labels) |
156
+ | Trend over time | Line chart |
157
+ | Part-of-whole composition | Stacked bar or pie (2-5 slices only) |
158
+ | Distribution | Histogram or box plot |
159
+ | Correlation | Scatter plot |
160
+
161
+ **Rules:**
162
+ - Title states the insight, not just the metric. "Revenue grew 40% in Q2" not "Revenue by Quarter."
163
+ - Label axes with units. "Revenue ($M)" not "Revenue."
164
+ - Sort bar charts by value unless order is inherent (months, stages).
165
+ - Avoid 3D charts, dual axes, and pie charts with more than 5 slices.
166
+
167
+ ```python
168
+ import matplotlib.pyplot as plt
169
+ import seaborn as sns
170
+ sns.set_style("whitegrid")
171
+ fig, ax = plt.subplots(figsize=(10, 6))
172
+ ax.set_title("Clear, Descriptive Title")
173
+ ax.set_xlabel("X Label with Units"); ax.set_ylabel("Y Label with Units")
174
+ plt.tight_layout(); plt.savefig("output.png", dpi=150)
175
+ ```
176
+
177
+ ## Step 6: Output and Reporting
178
+
179
+ Present results in the format most useful to the audience:
180
+ - **Tables** — For precise values. Markdown for small results, CSV export for large.
181
+ - **Charts** — For trends, comparisons, distributions. Save as PNG.
182
+ - **Written narrative** — For executive audiences. Finding, evidence, caveats.
183
+
184
+ ### Reporting Structure
185
+
186
+ Every analysis report should include: **Key Findings** (headline results), **Methodology** (data source, time range, filters), **Detailed Results** (tables/charts), **Caveats** (missing data, assumptions), and **Recommendations** (next steps).
187
+
188
+ ### Quality Checklist
189
+ - [ ] Numbers add up — totals match, percentages sum correctly
190
+ - [ ] Null handling is explicit — excluded, filled, or counted separately?
191
+ - [ ] Date ranges are correct — no boundary date errors
192
+ - [ ] Units are consistent — no mixing dollars/cents, bytes/megabytes
193
+ - [ ] Sample size is noted — context for statistical significance
194
+ - [ ] Results are reproducible — steps are clear enough to replicate
@@ -0,0 +1,198 @@
1
+ ---
2
+ name: deep-research
3
+ description: Conduct systematic multi-angle web research across multiple sources and depths. Use when answering questions requiring current comprehensive information, researching topics in depth, or gathering data before content generation tasks.
4
+ ---
5
+
6
+ # Deep Research
7
+
8
+ A systematic methodology for conducting thorough web research across multiple angles, depths, and sources. Load this skill BEFORE starting any content generation task to ensure comprehensive information gathering.
9
+
10
+ ## When to Use
11
+
12
+ **Always load this skill when:**
13
+
14
+ ### Research Questions
15
+ - User asks "what is X", "explain X", "research X", "investigate X"
16
+ - User wants to understand a concept, technology, or topic in depth
17
+ - The question requires current, comprehensive information from multiple sources
18
+ - A single web search would be insufficient to answer properly
19
+
20
+ ### Content Generation (Pre-research)
21
+ - Creating presentations, articles, reports, or documentation
22
+ - Creating frontend designs or UI mockups
23
+ - Producing any content that requires real-world information, examples, or current data
24
+
25
+ ## When NOT to Use
26
+
27
+ - **Academic literature specifically** — use the `academic-search` skill for scholarly papers
28
+ - **GitHub repository analysis** — use the `github-deep-research` skill
29
+ - **Questions answerable from the codebase** — read the code directly
30
+ - **Consulting-grade reports** — use the `consulting-analysis` skill (which uses deep-research as a sub-step)
31
+
32
+ ## Core Principle
33
+
34
+ **Never generate content based solely on general knowledge.** The quality of your output directly depends on the quality and quantity of research conducted beforehand. A single search query is NEVER enough.
35
+
36
+ ## Research Methodology
37
+
38
+ ### Phase 1: Broad Exploration
39
+
40
+ Start with broad searches to understand the landscape:
41
+
42
+ 1. **Initial Survey**: Search for the main topic to understand the overall context
43
+ 2. **Identify Dimensions**: From initial results, identify key subtopics, themes, angles, or aspects that need deeper exploration
44
+ 3. **Map the Territory**: Note different perspectives, stakeholders, or viewpoints that exist
45
+
46
+ Example:
47
+ ```
48
+ Topic: "AI in healthcare"
49
+ Initial searches:
50
+ - "AI healthcare applications 2025"
51
+ - "artificial intelligence medical diagnosis"
52
+ - "healthcare AI market trends"
53
+
54
+ Identified dimensions:
55
+ - Diagnostic AI (radiology, pathology)
56
+ - Treatment recommendation systems
57
+ - Administrative automation
58
+ - Patient monitoring
59
+ - Regulatory landscape
60
+ - Ethical considerations
61
+ ```
62
+
63
+ ### Phase 2: Deep Dive
64
+
65
+ For each important dimension identified, conduct targeted research:
66
+
67
+ 1. **Specific Queries**: Use WebSearch with precise keywords for each subtopic
68
+ 2. **Multiple Phrasings**: Try different keyword combinations and phrasings
69
+ 3. **Fetch Full Content**: Use WebFetch to read important sources in full, not just snippets
70
+ 4. **Follow References**: When sources mention other important resources, search for those too
71
+
72
+ Example:
73
+ ```
74
+ Dimension: "Diagnostic AI in radiology"
75
+ Targeted searches:
76
+ - "AI radiology FDA approved systems"
77
+ - "chest X-ray AI detection accuracy"
78
+ - "radiology AI clinical trials results"
79
+
80
+ Then fetch and read:
81
+ - Key research papers or summaries
82
+ - Industry reports
83
+ - Real-world case studies
84
+ ```
85
+
86
+ ### Phase 3: Diversity & Validation
87
+
88
+ Ensure comprehensive coverage by seeking diverse information types:
89
+
90
+ | Information Type | Purpose | Example Searches |
91
+ |-----------------|---------|------------------|
92
+ | **Facts & Data** | Concrete evidence | "statistics", "data", "numbers", "market size" |
93
+ | **Examples & Cases** | Real-world applications | "case study", "example", "implementation" |
94
+ | **Expert Opinions** | Authority perspectives | "expert analysis", "interview", "commentary" |
95
+ | **Trends & Predictions** | Future direction | "trends 2025", "forecast", "future of" |
96
+ | **Comparisons** | Context and alternatives | "vs", "comparison", "alternatives" |
97
+ | **Challenges & Criticisms** | Balanced view | "challenges", "limitations", "criticism" |
98
+
99
+ ### Phase 4: Synthesis Check
100
+
101
+ Before proceeding to content generation, verify:
102
+
103
+ - [ ] Have I searched from at least 3-5 different angles?
104
+ - [ ] Have I fetched and read the most important sources in full?
105
+ - [ ] Do I have concrete data, examples, and expert perspectives?
106
+ - [ ] Have I explored both positive aspects and challenges/limitations?
107
+ - [ ] Is my information current and from authoritative sources?
108
+
109
+ **If any answer is NO, continue researching before generating content.**
110
+
111
+ ## Search Strategy Tips
112
+
113
+ ### Effective Query Patterns
114
+
115
+ ```
116
+ # Be specific with context
117
+ Bad: "AI trends"
118
+ Good: "enterprise AI adoption trends 2025"
119
+
120
+ # Include authoritative source hints
121
+ "[topic] research paper"
122
+ "[topic] McKinsey report"
123
+ "[topic] industry analysis"
124
+
125
+ # Search for specific content types
126
+ "[topic] case study"
127
+ "[topic] statistics"
128
+ "[topic] expert interview"
129
+
130
+ # Use temporal qualifiers with the actual current year
131
+ "[topic] 2025"
132
+ "[topic] latest"
133
+ "[topic] recent developments"
134
+ ```
135
+
136
+ ### Temporal Awareness
137
+
138
+ **Always use today's date when forming time-sensitive search queries.** The current date is available in your system prompt context.
139
+
140
+ Use the right level of precision depending on what the user is asking:
141
+
142
+ | User intent | Temporal precision needed | Example query |
143
+ |---|---|---|
144
+ | "today / this morning / just released" | **Month + Day** | `"tech news February 28 2025"` |
145
+ | "this week" | **Week range** | `"technology releases week of Feb 24 2025"` |
146
+ | "recently / latest / new" | **Month** | `"AI breakthroughs February 2025"` |
147
+ | "this year / trends" | **Year** | `"software trends 2025"` |
148
+
149
+ **Rules:**
150
+ - When the user asks about "today" or "just released", use **month + day + year** in your search queries to get same-day results
151
+ - Never drop to year-only when day-level precision is needed — `"tech news 2025"` will NOT surface today's news
152
+ - Try multiple phrasings: numeric form (`2025-02-28`), written form (`February 28 2025`), and relative terms (`today`, `this week`) across different queries
153
+
154
+ ### When to Use WebFetch
155
+
156
+ Use WebFetch to read full page content when:
157
+ - A search result looks highly relevant and authoritative
158
+ - You need detailed information beyond the snippet
159
+ - The source contains data, case studies, or expert analysis
160
+ - You want to understand the full context of a finding
161
+
162
+ ### Iterative Refinement
163
+
164
+ Research is iterative. After initial searches:
165
+ 1. Review what you have learned
166
+ 2. Identify gaps in your understanding
167
+ 3. Formulate new, more targeted queries
168
+ 4. Repeat until you have comprehensive coverage
169
+
170
+ ## Quality Bar
171
+
172
+ Your research is sufficient when you can confidently answer:
173
+ - What are the key facts and data points?
174
+ - What are 2-3 concrete real-world examples?
175
+ - What do experts say about this topic?
176
+ - What are the current trends and future directions?
177
+ - What are the challenges or limitations?
178
+ - What makes this topic relevant or important now?
179
+
180
+ ## Common Mistakes to Avoid
181
+
182
+ - Stopping after 1-2 searches
183
+ - Relying on search snippets without reading full sources
184
+ - Searching only one aspect of a multi-faceted topic
185
+ - Ignoring contradicting viewpoints or challenges
186
+ - Using outdated information when current data exists
187
+ - Starting content generation before research is complete
188
+
189
+ ## Output
190
+
191
+ After completing research, you should have:
192
+ 1. A comprehensive understanding of the topic from multiple angles
193
+ 2. Specific facts, data points, and statistics
194
+ 3. Real-world examples and case studies
195
+ 4. Expert perspectives and authoritative sources
196
+ 5. Current trends and relevant context
197
+
198
+ **Only then proceed to content generation**, using the gathered information to create high-quality, well-informed content.
@@ -0,0 +1,211 @@
1
+ ---
2
+ name: docx
3
+ description: Create, edit, and analyze Word documents (.docx) with support for tracked changes, comments, formatting preservation, and text extraction. Use when working with professional documents for creating, modifying, reviewing with redlines, or extracting content.
4
+ ---
5
+
6
+ # DOCX creation, editing, and analysis
7
+
8
+ ## Overview
9
+
10
+ A user may ask you to create, edit, or analyze the contents of a .docx file. A .docx file is essentially a ZIP archive containing XML files and other resources that you can read or edit. You have different tools and workflows available for different tasks.
11
+
12
+ ## When to Use
13
+
14
+ - Creating new Word documents from scratch
15
+ - Editing or reviewing existing .docx files with tracked changes
16
+ - Extracting text, comments, or metadata from Word documents
17
+ - Adding redline/tracked changes to legal, business, or academic documents
18
+ - Converting documents between formats
19
+
20
+ ## When NOT to Use
21
+
22
+ - **Spreadsheets** — use the `xlsx` skill
23
+ - **Presentations** — use the `pptx` skill
24
+ - **PDF documents** — use the `pdf` skill
25
+ - **Plain text or Markdown** — edit directly, no special tooling needed
26
+
27
+ ## Workflow Decision Tree
28
+
29
+ ### Reading/Analyzing Content
30
+ Use "Text extraction" or "Raw XML access" sections below
31
+
32
+ ### Creating New Document
33
+ Use "Creating a new Word document" workflow
34
+
35
+ ### Editing Existing Document
36
+ - **Your own document + simple changes**
37
+ Use "Basic OOXML editing" workflow
38
+
39
+ - **Someone else's document**
40
+ Use **"Redlining workflow"** (recommended default)
41
+
42
+ - **Legal, academic, business, or government docs**
43
+ Use **"Redlining workflow"** (required)
44
+
45
+ ## Reading and analyzing content
46
+
47
+ ### Text extraction
48
+ If you just need to read the text contents of a document, you should convert the document to markdown using pandoc. Pandoc provides excellent support for preserving document structure and can show tracked changes:
49
+
50
+ ```bash
51
+ # Convert document to markdown with tracked changes
52
+ pandoc --track-changes=all path-to-file.docx -o output.md
53
+ # Options: --track-changes=accept/reject/all
54
+ ```
55
+
56
+ ### Raw XML access
57
+ You need raw XML access for: comments, complex formatting, document structure, embedded media, and metadata. For any of these features, you'll need to unpack a document and read its raw XML contents.
58
+
59
+ #### Unpacking a file
60
+ `python ooxml/scripts/unpack.py <office_file> <output_directory>`
61
+
62
+ #### Key file structures
63
+ * `word/document.xml` - Main document contents
64
+ * `word/comments.xml` - Comments referenced in document.xml
65
+ * `word/media/` - Embedded images and media files
66
+ * Tracked changes use `<w:ins>` (insertions) and `<w:del>` (deletions) tags
67
+
68
+ ## Creating a new Word document
69
+
70
+ When creating a new Word document from scratch, use **docx-js**, which allows you to create Word documents using JavaScript/TypeScript.
71
+
72
+ ### Workflow
73
+ 1. **MANDATORY - READ ENTIRE FILE**: Read [`docx-js.md`](docx-js.md) (~500 lines) completely from start to finish. **NEVER set any range limits when reading this file.** Read the full file content for detailed syntax, critical formatting rules, and best practices before proceeding with document creation.
74
+ 2. Create a JavaScript/TypeScript file using Document, Paragraph, TextRun components (You can assume all dependencies are installed, but if not, refer to the dependencies section below)
75
+ 3. Export as .docx using Packer.toBuffer()
76
+
77
+ ## Editing an existing Word document
78
+
79
+ When editing an existing Word document, use the **Document library** (a Python library for OOXML manipulation). The library automatically handles infrastructure setup and provides methods for document manipulation. For complex scenarios, you can access the underlying DOM directly through the library.
80
+
81
+ ### Workflow
82
+ 1. **MANDATORY - READ ENTIRE FILE**: Read [`ooxml.md`](ooxml.md) (~600 lines) completely from start to finish. **NEVER set any range limits when reading this file.** Read the full file content for the Document library API and XML patterns for directly editing document files.
83
+ 2. Unpack the document: `python ooxml/scripts/unpack.py <office_file> <output_directory>`
84
+ 3. Create and run a Python script using the Document library (see "Document Library" section in ooxml.md)
85
+ 4. Pack the final document: `python ooxml/scripts/pack.py <input_directory> <office_file>`
86
+
87
+ The Document library provides both high-level methods for common operations and direct DOM access for complex scenarios.
88
+
89
+ ## Redlining workflow for document review
90
+
91
+ This workflow allows you to plan comprehensive tracked changes using markdown before implementing them in OOXML. **CRITICAL**: For complete tracked changes, you must implement ALL changes systematically.
92
+
93
+ **Batching Strategy**: Group related changes into batches of 3-10 changes. This makes debugging manageable while maintaining efficiency. Test each batch before moving to the next.
94
+
95
+ **Principle: Minimal, Precise Edits**
96
+ When implementing tracked changes, only mark text that actually changes. Repeating unchanged text makes edits harder to review and appears unprofessional. Break replacements into: [unchanged text] + [deletion] + [insertion] + [unchanged text]. Preserve the original run's RSID for unchanged text by extracting the `<w:r>` element from the original and reusing it.
97
+
98
+ Example - Changing "30 days" to "60 days" in a sentence:
99
+ ```python
100
+ # BAD - Replaces entire sentence
101
+ '<w:del><w:r><w:delText>The term is 30 days.</w:delText></w:r></w:del><w:ins><w:r><w:t>The term is 60 days.</w:t></w:r></w:ins>'
102
+
103
+ # GOOD - Only marks what changed, preserves original <w:r> for unchanged text
104
+ '<w:r w:rsidR="00AB12CD"><w:t>The term is </w:t></w:r><w:del><w:r><w:delText>30</w:delText></w:r></w:del><w:ins><w:r><w:t>60</w:t></w:r></w:ins><w:r w:rsidR="00AB12CD"><w:t> days.</w:t></w:r>'
105
+ ```
106
+
107
+ ### Tracked changes workflow
108
+
109
+ 1. **Get markdown representation**: Convert document to markdown with tracked changes preserved:
110
+ ```bash
111
+ pandoc --track-changes=all path-to-file.docx -o current.md
112
+ ```
113
+
114
+ 2. **Identify and group changes**: Review the document and identify ALL changes needed, organizing them into logical batches:
115
+
116
+ **Location methods** (for finding changes in XML):
117
+ - Section/heading numbers (e.g., "Section 3.2", "Article IV")
118
+ - Paragraph identifiers if numbered
119
+ - Grep patterns with unique surrounding text
120
+ - Document structure (e.g., "first paragraph", "signature block")
121
+ - **DO NOT use markdown line numbers** - they don't map to XML structure
122
+
123
+ **Batch organization** (group 3-10 related changes per batch):
124
+ - By section: "Batch 1: Section 2 amendments", "Batch 2: Section 5 updates"
125
+ - By type: "Batch 1: Date corrections", "Batch 2: Party name changes"
126
+ - By complexity: Start with simple text replacements, then tackle complex structural changes
127
+ - Sequential: "Batch 1: Pages 1-3", "Batch 2: Pages 4-6"
128
+
129
+ 3. **Read documentation and unpack**:
130
+ - **MANDATORY - READ ENTIRE FILE**: Read [`ooxml.md`](ooxml.md) (~600 lines) completely from start to finish. **NEVER set any range limits when reading this file.** Pay special attention to the "Document Library" and "Tracked Change Patterns" sections.
131
+ - **Unpack the document**: `python ooxml/scripts/unpack.py <file.docx> <dir>`
132
+ - **Note the suggested RSID**: The unpack script will suggest an RSID to use for your tracked changes. Copy this RSID for use in step 4b.
133
+
134
+ 4. **Implement changes in batches**: Group changes logically (by section, by type, or by proximity) and implement them together in a single script. This approach:
135
+ - Makes debugging easier (smaller batch = easier to isolate errors)
136
+ - Allows incremental progress
137
+ - Maintains efficiency (batch size of 3-10 changes works well)
138
+
139
+ **Suggested batch groupings:**
140
+ - By document section (e.g., "Section 3 changes", "Definitions", "Termination clause")
141
+ - By change type (e.g., "Date changes", "Party name updates", "Legal term replacements")
142
+ - By proximity (e.g., "Changes on pages 1-3", "Changes in first half of document")
143
+
144
+ For each batch of related changes:
145
+
146
+ **a. Map text to XML**: Grep for text in `word/document.xml` to verify how text is split across `<w:r>` elements.
147
+
148
+ **b. Create and run script**: Use `get_node` to find nodes, implement changes, then `doc.save()`. See **"Document Library"** section in ooxml.md for patterns.
149
+
150
+ **Note**: Always grep `word/document.xml` immediately before writing a script to get current line numbers and verify text content. Line numbers change after each script run.
151
+
152
+ 5. **Pack the document**: After all batches are complete, convert the unpacked directory back to .docx:
153
+ ```bash
154
+ python ooxml/scripts/pack.py unpacked reviewed-document.docx
155
+ ```
156
+
157
+ 6. **Final verification**: Do a comprehensive check of the complete document:
158
+ - Convert final document to markdown:
159
+ ```bash
160
+ pandoc --track-changes=all reviewed-document.docx -o verification.md
161
+ ```
162
+ - Verify ALL changes were applied correctly:
163
+ ```bash
164
+ grep "original phrase" verification.md # Should NOT find it
165
+ grep "replacement phrase" verification.md # Should find it
166
+ ```
167
+ - Check that no unintended changes were introduced
168
+
169
+
170
+ ## Converting Documents to Images
171
+
172
+ To visually analyze Word documents, convert them to images using a two-step process:
173
+
174
+ 1. **Convert DOCX to PDF**:
175
+ ```bash
176
+ soffice --headless --convert-to pdf document.docx
177
+ ```
178
+
179
+ 2. **Convert PDF pages to JPEG images**:
180
+ ```bash
181
+ pdftoppm -jpeg -r 150 document.pdf page
182
+ ```
183
+ This creates files like `page-1.jpg`, `page-2.jpg`, etc.
184
+
185
+ Options:
186
+ - `-r 150`: Sets resolution to 150 DPI (adjust for quality/size balance)
187
+ - `-jpeg`: Output JPEG format (use `-png` for PNG if preferred)
188
+ - `-f N`: First page to convert (e.g., `-f 2` starts from page 2)
189
+ - `-l N`: Last page to convert (e.g., `-l 5` stops at page 5)
190
+ - `page`: Prefix for output files
191
+
192
+ Example for specific range:
193
+ ```bash
194
+ pdftoppm -jpeg -r 150 -f 2 -l 5 document.pdf page # Converts only pages 2-5
195
+ ```
196
+
197
+ ## Code Style Guidelines
198
+ **IMPORTANT**: When generating code for DOCX operations:
199
+ - Write concise code
200
+ - Avoid verbose variable names and redundant operations
201
+ - Avoid unnecessary print statements
202
+
203
+ ## Dependencies
204
+
205
+ Required dependencies (install if not available):
206
+
207
+ - **pandoc**: `sudo apt-get install pandoc` (for text extraction)
208
+ - **docx**: `npm install -g docx` (for creating new documents)
209
+ - **LibreOffice**: `sudo apt-get install libreoffice` (for PDF conversion)
210
+ - **Poppler**: `sudo apt-get install poppler-utils` (for pdftoppm to convert PDF to images)
211
+ - **defusedxml**: `pip install defusedxml` (for secure XML parsing)