npm - qmdr - Versions diffs - 1.0.0 → 1.0.2 - Mend

qmdr 1.0.0 → 1.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (70) hide show

package/AI-SETUP.md +61 -98
package/docs/setup-openclaw.md +20 -3
package/package.json +1 -1
package/qmd-wrapper-backup +12 -0
package/src/llm.ts +28 -1
package/src/qmd.ts +29 -2
package/.github/workflows/release.yml +0 -77
package/finetune/BALANCED_DISTRIBUTION.md +0 -157
package/finetune/DATA_IMPROVEMENTS.md +0 -218
package/finetune/Justfile +0 -43
package/finetune/Modelfile +0 -16
package/finetune/README.md +0 -299
package/finetune/SCORING.md +0 -286
package/finetune/configs/accelerate_multi_gpu.yaml +0 -17
package/finetune/configs/grpo.yaml +0 -49
package/finetune/configs/sft.yaml +0 -42
package/finetune/configs/sft_local.yaml +0 -40
package/finetune/convert_gguf.py +0 -221
package/finetune/data/best_glm_prompt.txt +0 -17
package/finetune/data/gepa_generated.prompts.json +0 -32
package/finetune/data/qmd_expansion_balanced_deduped.jsonl +0 -413
package/finetune/data/qmd_expansion_diverse_addon.jsonl +0 -386
package/finetune/data/qmd_expansion_handcrafted.jsonl +0 -65
package/finetune/data/qmd_expansion_handcrafted_only.jsonl +0 -336
package/finetune/data/qmd_expansion_locations.jsonl +0 -64
package/finetune/data/qmd_expansion_people.jsonl +0 -46
package/finetune/data/qmd_expansion_short_nontech.jsonl +0 -200
package/finetune/data/qmd_expansion_v2.jsonl +0 -1498
package/finetune/data/qmd_only_sampled.jsonl +0 -399
package/finetune/dataset/analyze_data.py +0 -369
package/finetune/dataset/clean_data.py +0 -906
package/finetune/dataset/generate_balanced.py +0 -823
package/finetune/dataset/generate_data.py +0 -714
package/finetune/dataset/generate_data_offline.py +0 -206
package/finetune/dataset/generate_diverse.py +0 -441
package/finetune/dataset/generate_ollama.py +0 -326
package/finetune/dataset/prepare_data.py +0 -197
package/finetune/dataset/schema.py +0 -73
package/finetune/dataset/score_data.py +0 -115
package/finetune/dataset/validate_schema.py +0 -104
package/finetune/eval.py +0 -196
package/finetune/evals/queries.txt +0 -56
package/finetune/gepa/__init__.py +0 -1
package/finetune/gepa/best_prompt.txt +0 -31
package/finetune/gepa/best_prompt_glm.txt +0 -1
package/finetune/gepa/dspy_gepa.py +0 -204
package/finetune/gepa/example.py +0 -117
package/finetune/gepa/generate.py +0 -129
package/finetune/gepa/gepa_outputs.jsonl +0 -10
package/finetune/gepa/gepa_outputs_glm.jsonl +0 -20
package/finetune/gepa/model.json +0 -19
package/finetune/gepa/optimizer.py +0 -70
package/finetune/gepa/score.py +0 -84
package/finetune/jobs/eval.py +0 -490
package/finetune/jobs/eval_common.py +0 -354
package/finetune/jobs/eval_verbose.py +0 -113
package/finetune/jobs/grpo.py +0 -141
package/finetune/jobs/quantize.py +0 -244
package/finetune/jobs/sft.py +0 -121
package/finetune/pyproject.toml +0 -23
package/finetune/reward.py +0 -610
package/finetune/train.py +0 -611
package/finetune/uv.lock +0 -4070
package/src/cli.test.ts +0 -963
package/src/eval.test.ts +0 -412
package/src/llm.test.ts +0 -559
package/src/mcp.test.ts +0 -889
package/src/store-paths.test.ts +0 -395
package/src/store.test.ts +0 -2483
package/test/eval-harness.ts +0 -223

package/AI-SETUP.md CHANGED Viewed

@@ -112,9 +112,16 @@ chmod +x /usr/local/bin/qmd
 ---
-## Phase 3: Configure Embedding (SiliconFlow)
+## Phase 3: Configure API Keys
-SiliconFlow provides the embedding model. Ask:
+All configuration goes into one file: `~/.config/qmd/.env`
+QMDR automatically reads this file on startup (both Bun and binary installs). Create it:
+```bash
+mkdir -p ~/.config/qmd
+nano ~/.config/qmd/.env   # or any editor
+```
 > **Do you have a SiliconFlow API key?**
 >
@@ -124,8 +131,9 @@ SiliconFlow provides the embedding model. Ask:
 ### Option 1: User has key
-```bash
-export QMD_SILICONFLOW_API_KEY=sk-user-provided-key
+Write to `~/.config/qmd/.env`:
+```
+QMD_SILICONFLOW_API_KEY=sk-user-provided-key
 ```
 ### Option 2: Guide to get key
@@ -142,76 +150,45 @@ export QMD_SILICONFLOW_API_KEY=sk-user-provided-key
 > 2. Go to API Keys page → Create new key
 > 3. Free tier includes embedding models
-After user provides key:
-```bash
-export QMD_SILICONFLOW_API_KEY=sk-user-provided-key
+After user provides key, write to `~/.config/qmd/.env`:
+```
+QMD_SILICONFLOW_API_KEY=sk-user-provided-key
 ```
-**International users — additional API endpoint config:**
-```bash
-export QMD_SILICONFLOW_BASE_URL=https://api.siliconflow.com/v1
+**International users — add this line too:**
+```
+QMD_SILICONFLOW_BASE_URL=https://api.siliconflow.com/v1
 ```
 (China users don't need this — the default `https://api.siliconflow.cn/v1` works)
 ### Option 3: Custom provider
-Ask for:
-1. API endpoint URL (must be OpenAI-compatible)
-2. API key
-```bash
-export QMD_OPENAI_API_KEY=user-key
-export QMD_OPENAI_BASE_URL=https://their-endpoint.com/v1
-export QMD_EMBED_PROVIDER=openai
+Ask for endpoint URL and API key, then write to `~/.config/qmd/.env`:
 ```
-Test connectivity:
-```bash
-curl -s "$QMD_OPENAI_BASE_URL/models" -H "Authorization: Bearer $QMD_OPENAI_API_KEY" | head -5
+QMD_OPENAI_API_KEY=user-key
+QMD_OPENAI_BASE_URL=https://their-endpoint.com/v1
+QMD_EMBED_PROVIDER=openai
 ```
 ### Configure embedding model
 Default: `Qwen/Qwen3-Embedding-8B` (on SiliconFlow, free)
-Ask:
-> Keep the default embedding model (`Qwen/Qwen3-Embedding-8B`), or choose your own?
->
-> If you don't know which to pick, the default is good.
+> Keep the default embedding model, or choose your own?
-If user wants to change:
-```bash
-export QMD_SILICONFLOW_EMBED_MODEL=their-chosen-model
+If user wants to change, add to `.env`:
+```
+QMD_SILICONFLOW_EMBED_MODEL=their-chosen-model
 # or for custom provider:
-export QMD_OPENAI_EMBED_MODEL=their-chosen-model
+QMD_OPENAI_EMBED_MODEL=their-chosen-model
 ```
-### Configure embedding dimensions
-> Do you know the output dimensions of your embedding model?
->
-> 1. **Use default** (auto-detect from model — recommended)
-> 2. **I know the dimensions** — let me enter it
-> 3. **I don't know** — please look it up for me
-If option 3: search the web for "{model_name} embedding dimensions" and inform the user.
-Note: If dimensions change after initial indexing, user must run `qmd embed -f` to rebuild.
 ### Configure chunk size
-Default: `200` tokens per chunk, `40` tokens overlap.
-> Chunk size controls how documents are split for embedding.
->
-> - **Default: 200 tokens** (recommended for most use cases)
-> - Larger chunks = more context per result, but less precise
-> - Smaller chunks = more precise, but may lose context
-```bash
-# Only if user wants to change defaults:
-export QMD_CHUNK_SIZE_TOKENS=200      # default
-export QMD_CHUNK_OVERLAP_TOKENS=40    # default
+Default: `200` tokens per chunk, `40` tokens overlap. Only add to `.env` if user wants to change:
+```
+QMD_CHUNK_SIZE_TOKENS=200
+QMD_CHUNK_OVERLAP_TOKENS=40
 ```
 ---
@@ -222,22 +199,19 @@ Query expansion rewrites the user's search query into multiple variations (keywo
 Default: `GLM-4.5-Air` on SiliconFlow (~¥1/month, fast, good quality).
-Ask:
-> Use the default query expansion model (`GLM-4.5-Air` on SiliconFlow), or use your own?
+> Use the default query expansion model, or use your own?
 >
 > 1. **Default** (GLM-4.5-Air on SiliconFlow — recommended)
 > 2. **Use the reranking provider's model** (configured in next step)
 > 3. **Custom** — I want to specify a model
-If option 2: will be configured after Phase 5.
-If option 3:
-```bash
-export QMD_QUERY_EXPANSION_PROVIDER=openai  # or gemini
-# For OpenAI-compatible:
-export QMD_OPENAI_MODEL=their-model-name
-# For Gemini:
-export QMD_GEMINI_API_KEY=their-key
+If option 3, add to `~/.config/qmd/.env`:
+```
+QMD_QUERY_EXPANSION_PROVIDER=openai
+QMD_OPENAI_MODEL=their-model-name
+# Or for Gemini:
+QMD_QUERY_EXPANSION_PROVIDER=gemini
+QMD_GEMINI_API_KEY=their-key
 ```
 ---
@@ -256,9 +230,10 @@ Reranking uses a large language model to judge which search results are truly re
 ### Option 1: User has Gemini key
-```bash
-export QMD_GEMINI_API_KEY=user-key
-export QMD_RERANK_PROVIDER=gemini
+Add to `~/.config/qmd/.env`:
+```
+QMD_GEMINI_API_KEY=user-key
+QMD_RERANK_PROVIDER=gemini
 ```
 ### Option 2: Guide to get Gemini key
@@ -268,48 +243,34 @@ export QMD_RERANK_PROVIDER=gemini
 > 2. Click "Get API key" → Create key
 > 3. Free tier: 15 RPM / 1M tokens per day (more than enough for reranking)
-Note for China users: Gemini API may require a proxy. If the user has a proxy:
-```bash
-export QMD_GEMINI_BASE_URL=https://their-proxy-endpoint
+Note for China users: Gemini API may require a proxy. Add to `.env`:
+```
+QMD_GEMINI_BASE_URL=https://their-proxy-endpoint
 ```
 If no proxy available, recommend Option 3 instead.
 ### Option 3: Alternative reranking
+Add to `~/.config/qmd/.env`:
 **Using SiliconFlow LLM rerank (no extra key needed):**
-```bash
-export QMD_RERANK_PROVIDER=siliconflow
-export QMD_RERANK_MODE=llm
+```
+QMD_RERANK_PROVIDER=siliconflow
+QMD_RERANK_MODE=llm
 ```
-**Using a dedicated rerank model API (e.g. BAAI/bge-reranker):**
-```bash
-export QMD_RERANK_PROVIDER=siliconflow
-export QMD_RERANK_MODE=rerank
-export QMD_SILICONFLOW_RERANK_MODEL=BAAI/bge-reranker-v2-m3
+**Using a dedicated rerank model API:**
+```
+QMD_RERANK_PROVIDER=siliconflow
+QMD_RERANK_MODE=rerank
+QMD_SILICONFLOW_RERANK_MODEL=BAAI/bge-reranker-v2-m3
 ```
 **Using OpenAI-compatible endpoint:**
-```bash
-export QMD_RERANK_PROVIDER=openai
-export QMD_RERANK_MODE=llm
-export QMD_OPENAI_API_KEY=their-key
-export QMD_OPENAI_BASE_URL=https://their-endpoint/v1
 ```
-For Claude Code / OpenCode users: you can reuse whichever model API you're already paying for.
-### Model selection
-Default reranking model: `gemini-2.5-flash` with `thinkingBudget: 0` (no reasoning overhead, pure relevance judgment).
-> Keep the default reranking model (`gemini-2.5-flash`), or change it?
-If user wants to change:
-```bash
-export QMD_GEMINI_MODEL=their-model      # for Gemini provider
-export QMD_LLM_RERANK_MODEL=their-model  # for SiliconFlow/OpenAI LLM rerank
+QMD_RERANK_PROVIDER=openai
+QMD_RERANK_MODE=llm
 ```
 ---
@@ -414,7 +375,7 @@ After setup, add this to `AGENTS.md` in the project root:
 ## Environment Variables Reference
 | Variable | Default | Purpose |
-|----------|---------|---------||
+|----------|---------|---------|
 | **API Keys** | | |
 | `QMD_SILICONFLOW_API_KEY` | — | SiliconFlow |
 | `QMD_GEMINI_API_KEY` | — | Google Gemini |
@@ -443,6 +404,8 @@ After setup, add this to `AGENTS.md` in the project root:
 | `QMD_RERANK_DOC_LIMIT` | `40` | Max docs for reranking |
 | `QMD_RERANK_CHUNKS_PER_DOC` | `3` | Chunks per doc for reranking |
 | **Paths** | | |
+| `QMD_CONFIG_DIR` | `~/.config/qmd` | Config directory (index.yml + .env location) |
+| `XDG_CACHE_HOME` | `~/.cache` | Cache directory (database at `$XDG_CACHE_HOME/qmd/index.sqlite`) |
 | `QMD_SQLITE_VEC_PATH` | auto | sqlite-vec native extension path |
 **If you change the embedding model, run `qmd embed -f` to rebuild the vector index.**

package/docs/setup-openclaw.md CHANGED Viewed

@@ -79,7 +79,11 @@ Edit `~/.openclaw/openclaw.json` — add/merge the `memory` block:
 API keys must be available to the OpenClaw process, not just your shell.
-**macOS (launchd):**
+**Option A: Use `~/.config/qmd/.env` (recommended — works for all install types)**
+QMDR auto-loads this file on startup. If you already configured it in Phase 3-5, you're done — OpenClaw will inherit these settings automatically.
+**Option B: macOS (launchd) — set in the service plist:**
 Add to `~/Library/LaunchAgents/ai.openclaw.gateway.plist` under `EnvironmentVariables`:
 ```xml
 <key>QMD_SILICONFLOW_API_KEY</key>
@@ -89,7 +93,7 @@ Add to `~/Library/LaunchAgents/ai.openclaw.gateway.plist` under `EnvironmentVari
 ```
 Then reload: `launchctl unload ~/Library/LaunchAgents/ai.openclaw.gateway.plist && launchctl load ~/Library/LaunchAgents/ai.openclaw.gateway.plist`
-**Linux (systemd):**
+**Option C: Linux (systemd) — set in the service unit:**
 Add to the service unit under `[Service]`:
 ```ini
 Environment=QMD_SILICONFLOW_API_KEY=sk-your-key
@@ -97,13 +101,26 @@ Environment=QMD_GEMINI_API_KEY=your-gemini-key
 ```
 Then: `systemctl --user daemon-reload && systemctl --user restart openclaw`
+**Priority:** System/process env vars > `~/.config/qmd/.env` (env vars override .env if both are set).
 ## Step 5: Restart and verify
+Before restarting, confirm:
+1. `qmd doctor` passes with no errors ✅
+2. `qmd query "test"` returns results ✅
+3. API keys are in launchd/systemd env (Step 4) ✅
 ```bash
 openclaw gateway restart
+# Or if using OpenClaw CLI:
+openclaw gateway stop && openclaw gateway start
 ```
-Then test by asking your bot about past conversations, or check logs for `"backend": "qmd"`.
+After restart, verify QMDR is active:
+- Ask your bot about something from past conversations
+- Check that `memory_search` results show `"provider": "qmd"` (not `"sqlite"`)
+- If results seem to fall back to basic search, check `qmd doctor` inside the OpenClaw process environment
 ## Step 6: Add usage tips to TOOLS.md

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "qmdr",
-  "version": "1.0.0",
+  "version": "1.0.2",
   "description": "Remote-first CLI search engine for markdown docs, notes, and knowledge bases",
   "type": "module",
   "bin": {

package/qmd-wrapper-backup ADDED Viewed

@@ -0,0 +1,12 @@
+#!/bin/bash
+# QMD wrapper - runs via bun instead of compiled binary
+# When --json flag is present, redirect stderr separately to avoid breaking JSON output
+# The debug console.log calls in qmd go to stdout; we need to filter them
+if echo "$@" | grep -q '\-\-json'; then
+    # Run and extract only the JSON array from stdout
+    output=$(bun run /Users/fu/clawd/qmd/src/qmd.ts "$@" 2>/dev/null)
+    # Extract JSON: find the line starting with [ and everything after
+    echo "$output" | sed -n '/^\[/,$ p'
+else
+    exec bun run /Users/fu/clawd/qmd/src/qmd.ts "$@"
+fi

package/src/llm.ts CHANGED Viewed

@@ -26,7 +26,16 @@ type LlamaModel = any;
 type LlamaEmbeddingContext = any;
 type LlamaToken = any;
 let LlamaChatSession: any = null;
-getLlamaCpp().then(m => { LlamaChatSession = m.LlamaChatSession; }).catch(() => {});
+// Lazy-init LlamaChatSession only when actually needed (avoid eager import crash on Linux CI)
+async function ensureLlamaChatSession() {
+  if (!LlamaChatSession) {
+    try {
+      const m = await getLlamaCpp();
+      LlamaChatSession = m.LlamaChatSession;
+    } catch {}
+  }
+  return LlamaChatSession;
+}
 import { homedir } from "os";
 import { join } from "path";
 import { existsSync, mkdirSync, statSync, unlinkSync, readdirSync, readFileSync, writeFileSync } from "fs";
@@ -964,6 +973,7 @@ export class LlamaCpp implements LLM {
 export type RemoteLLMConfig = {
   rerankProvider: 'siliconflow' | 'gemini' | 'openai';
+  rerankMode?: 'llm' | 'rerank'; // 'llm' = chat model, 'rerank' = dedicated rerank API
   embedProvider?: 'siliconflow' | 'openai'; // remote embedding provider (optional)
   queryExpansionProvider?: 'siliconflow' | 'gemini' | 'openai'; // remote query expansion (optional)
   siliconflow?: {
@@ -1040,6 +1050,23 @@ export class RemoteLLM implements LLM {
     options: RerankOptions = {}
   ): Promise<RerankResult> {
     if (this.config.rerankProvider === 'siliconflow') {
+      // LLM mode: use SiliconFlow's OpenAI-compatible chat API for reranking
+      if (this.config.rerankMode === 'llm') {
+        // Build a temporary openai-like config from siliconflow settings
+        const sf = this.config.siliconflow;
+        if (!sf?.apiKey) throw new Error("SiliconFlow API key required for LLM rerank");
+        const savedOpenai = this.config.openai;
+        this.config.openai = {
+          apiKey: sf.apiKey,
+          baseUrl: (sf.baseUrl || "https://api.siliconflow.cn/v1").replace(/\/$/, ""),
+          model: sf.queryExpansionModel || "zai-org/GLM-4.5-Air",
+        };
+        try {
+          return await this.rerankWithOpenAI(query, documents, options);
+        } finally {
+          this.config.openai = savedOpenai;
+        }
+      }
       return this.rerankWithSiliconflow(query, documents, options);
     }
     if (this.config.rerankProvider === 'openai') {

package/src/qmd.ts CHANGED Viewed

@@ -90,6 +90,31 @@ import { handleCollectionCommand } from "./app/commands/collection.js";
 import { handleSearchCommand, handleVSearchCommand, handleQueryCommand } from "./app/commands/search.js";
 import { handleCleanupCommand, handlePullCommand, handleStatusCommand, handleUpdateCommand, handleEmbedCommand, handleMcpCommand, handleDoctorCommand } from "./app/commands/maintenance.js";
 import { createLLMService } from "./app/services/llm-service.js";
+import { existsSync } from "fs";
+import { join } from "path";
+// =============================================================================
+// Load config from ~/.config/qmd/.env (single source of truth)
+// Environment variables set externally take priority (won't be overwritten)
+// =============================================================================
+const qmdConfigDir = process.env.QMD_CONFIG_DIR || join(homedir(), ".config", "qmd");
+const qmdEnvPath = join(qmdConfigDir, ".env");
+if (existsSync(qmdEnvPath)) {
+  const envContent = readFileSync(qmdEnvPath, "utf-8");
+  for (const line of envContent.split("\n")) {
+    const trimmed = line.trim();
+    if (!trimmed || trimmed.startsWith("#")) continue;
+    const eqIdx = trimmed.indexOf("=");
+    if (eqIdx === -1) continue;
+    const key = trimmed.slice(0, eqIdx).trim();
+    const val = trimmed.slice(eqIdx + 1).trim().replace(/^["']|["']$/g, "");
+    // Don't override existing env vars (system/process env takes priority)
+    if (!process.env[key]) {
+      process.env[key] = val;
+    }
+  }
+}
 // Enable production mode - allows using default database path
 // Tests must set INDEX_PATH or use createStore() with explicit path
@@ -270,9 +295,10 @@ function getRemoteLLM(): RemoteLLM | null {
     if (rerankProvider === 'gemini' || rerankProvider === 'openai') {
       effectiveRerankProvider = rerankProvider;
     } else if (rerankProvider === 'siliconflow') {
-      effectiveRerankProvider = sfApiKey ? 'openai' : undefined;
+      // LLM rerank via SiliconFlow's OpenAI-compatible API
+      effectiveRerankProvider = 'siliconflow';
     } else {
-      effectiveRerankProvider = sfApiKey ? 'openai' : (gmApiKey ? 'gemini' : (oaApiKey ? 'openai' : undefined));
+      effectiveRerankProvider = sfApiKey ? 'siliconflow' : (gmApiKey ? 'gemini' : (oaApiKey ? 'openai' : undefined));
     }
   }
   const effectiveEmbedProvider = embedProvider || (sfApiKey ? 'siliconflow' : (oaApiKey ? 'openai' : undefined));
@@ -283,6 +309,7 @@ function getRemoteLLM(): RemoteLLM | null {
   const config: RemoteLLMConfig = {
     rerankProvider: effectiveRerankProvider || 'siliconflow',
+    rerankMode: rerankMode as 'llm' | 'rerank',
     embedProvider: effectiveEmbedProvider,
     queryExpansionProvider: effectiveQueryExpansionProvider,
   };

package/.github/workflows/release.yml DELETED Viewed

@@ -1,77 +0,0 @@
-name: Release
-on:
-  push:
-    tags:
-      - 'v*'
-permissions:
-  contents: write
-jobs:
-  build:
-    strategy:
-      matrix:
-        include:
-          - os: macos-latest
-            target: darwin-arm64
-            artifact: qmd-darwin-arm64
-          - os: macos-13
-            target: darwin-x64
-            artifact: qmd-darwin-x64
-          - os: ubuntu-latest
-            target: linux-x64
-            artifact: qmd-linux-x64
-          - os: ubuntu-latest
-            target: linux-arm64
-            artifact: qmd-linux-arm64
-    runs-on: ${{ matrix.os }}
-    steps:
-      - uses: actions/checkout@v4
-      - uses: oven-sh/setup-bun@v2
-        with:
-          bun-version: latest
-      - name: Install dependencies
-        run: bun install
-      - name: Build binary
-        run: |
-          if [ "${{ matrix.target }}" = "linux-arm64" ]; then
-            bun build --compile --target=bun-linux-arm64 src/qmd.ts --outfile ${{ matrix.artifact }}
-          else
-            bun build --compile src/qmd.ts --outfile ${{ matrix.artifact }}
-          fi
-      - name: Test binary
-        if: matrix.target != 'linux-arm64'
-        run: |
-          ./${{ matrix.artifact }} --help
-          ./${{ matrix.artifact }} status 2>/dev/null || true
-          echo "✅ Binary runs successfully"
-      - uses: actions/upload-artifact@v4
-        with:
-          name: ${{ matrix.artifact }}
-          path: ${{ matrix.artifact }}
-  release:
-    needs: build
-    runs-on: ubuntu-latest
-    steps:
-      - uses: actions/download-artifact@v4
-        with:
-          merge-multiple: true
-      - name: Make binaries executable
-        run: chmod +x qmd-*
-      - name: Create Release
-        uses: softprops/action-gh-release@v2
-        with:
-          files: qmd-*
-          generate_release_notes: true
-          draft: false

package/finetune/BALANCED_DISTRIBUTION.md DELETED Viewed

@@ -1,157 +0,0 @@
-# QMD Training Data - Balanced Distribution Summary
-## Overview
-The training data has been rebalanced to reduce excessive tech focus while maintaining adequate technical coverage for QMD's use case. The new distribution emphasizes diverse life topics while keeping tech at a reasonable 15%.
-## Distribution Comparison
-### Before (Original Data)
-```
-Technical:        ~50% ████████████████████████████████████████
-How-to:           ~45% █████████████████████████████████████
-What-is:          ~40% █████████████████████████████████
-Other:            ~15% ████████████
-Short queries:    10%  ████████
-Temporal:         1.6% █
-Named entities:   3.4% ██
-```
-### After (Balanced Approach)
-```
-Category                    Percentage
-────────────────────────────────────────
-Health & Wellness           12%  █████████
-Finance & Business          12%  █████████
-Technology                  15%  ███████████
-Home & Garden               10%  ████████
-Food & Cooking              10%  ████████
-Travel & Geography          10%  ████████
-Hobbies & Crafts            10%  ████████
-Education & Learning         8%  ██████
-Arts & Culture               8%  ██████
-Lifestyle & Relationships    5%  ████
-────────────────────────────────────────
-Short queries (1-2 words):  20%
-Temporal (2025/2026):       15%
-Named entities:            10%+
-```
-## Key Improvements
-### 1. Category Diversity
-**New Non-Tech Categories Added:**
-- **Health & Wellness**: Meditation, fitness, nutrition, mental health
-- **Finance & Business**: Budgeting, investing, career, entrepreneurship
-- **Home & Garden**: DIY, repairs, cleaning, gardening, organization
-- **Food & Cooking**: Recipes, techniques, meal planning, nutrition
-- **Travel & Geography**: Travel planning, destinations, geography facts
-- **Hobbies & Crafts**: Photography, art, music, woodworking, knitting
-- **Education & Learning**: Study techniques, languages, online courses
-- **Arts & Culture**: Art history, music, film, theater, literature
-- **Lifestyle & Relationships**: Habits, relationships, parenting, minimalism
-### 2. Temporal Queries (2025/2026)
-Updated to use current era years for recency queries:
-- "latest research 2026"
-- "Shopify updates 2025"
-- "what changed in React 2026"
-- "AI developments 2025"
-This ensures the model learns to handle queries from the current time period.
-### 3. Short Query Coverage
-Expanded from 47 to 144+ short keywords across all categories:
-- Tech: auth, config, api, cache, deploy
-- Health: meditate, hydrate, stretch, exercise
-- Finance: budget, save, invest, taxes
-- Home: clean, organize, repair, garden
-- Food: cook, bake, recipe, meal
-- Travel: travel, pack, passport, hotel
-- Hobbies: photo, draw, paint, knit, guitar
-- Education: study, learn, course, exam
-- Arts: art, music, film, dance
-- Life: habit, routine, organize, parent
-## Usage
-### Quick Start - Use Balanced Data
-```bash
-cd finetune
-# Add 500 balanced examples
-cat data/qmd_expansion_balanced.jsonl >> data/qmd_expansion_v2.jsonl
-# Prepare with enhanced short query templates
-uv run dataset/prepare_data.py --add-short 2
-# Train
-uv run train.py sft --config configs/sft.yaml
-```
-### Generate Fresh Data with Claude API
-```bash
-# Set API key
-export ANTHROPIC_API_KEY=your_key
-# Generate 300 balanced examples
-uv run dataset/generate_data.py --count 300 \
-  --output data/qmd_expansion_fresh.jsonl
-# Analyze distribution
-uv run dataset/analyze_data.py --input data/qmd_expansion_fresh.jsonl
-# Prepare for training
-uv run dataset/prepare_data.py --input data/qmd_expansion_fresh.jsonl
-```
-### Generate Even More Balanced Examples
-```bash
-# Generate 500 life-focused examples (15% tech)
-uv run dataset/generate_balanced.py
-# Or generate 265 additional diverse examples
-uv run dataset/generate_diverse.py
-```
-## File Summary
-### Modified Files:
-- `dataset/generate_data.py` - Added category weights (15% tech), 2025/2026 dates
-- `dataset/prepare_data.py` - Expanded SHORT_QUERIES from 47→144, templates 5→16
-### New Files:
-- `dataset/generate_balanced.py` - Life-focused generator (500 examples)
-- `dataset/generate_diverse.py` - Philosophy/History/Geography/Trivia generator (265 examples)
-- `dataset/analyze_data.py` - Dataset analysis and quality reporting
-- `DATA_IMPROVEMENTS.md` - Detailed improvement documentation
-### Generated Data:
-- `data/qmd_expansion_balanced.jsonl` - 500 balanced examples
-- `data/qmd_expansion_diverse_addon.jsonl` - 265 diverse examples
-## Expected Benefits
-1. **Better Short Query Handling**: 20% coverage vs 10% before
-2. **Named Entity Preservation**: 10%+ coverage vs 3.4% before
-3. **Temporal Understanding**: 15% with 2025/2026 vs 1.6% before
-4. **Domain Diversity**: 10 categories vs tech-only before
-5. **Life-Document Search**: Better at searching personal notes on health, finance, hobbies
-## Next Steps
-1. Merge balanced examples into training set
-2. Retrain model with improved distribution
-3. Evaluate using `evals/queries.txt`
-4. Monitor scores on temporal/named-entity/short queries
-5. Iterate based on results
----
-Generated: 2026-01-30