npm - @aiready/pattern-detect - Versions diffs - 0.1.3 → 0.2.0 - Mend

@aiready/pattern-detect 0.1.3 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

package/README.md +149 -38
package/dist/{chunk-N5DE7IYX.mjs → chunk-AXHGYYYZ.mjs} +5 -17
package/dist/{chunk-57O7FEEM.mjs → chunk-JKVKOXYR.mjs} +43 -36
package/dist/{chunk-DNI7S33V.mjs → chunk-OFGMDX66.mjs} +34 -31
package/dist/{chunk-4CZGZIDL.mjs → chunk-QE4E3F7C.mjs} +20 -19
package/dist/{chunk-ZNZ5O435.mjs → chunk-TXWPOVYU.mjs} +37 -35
package/dist/cli.js +45 -55
package/dist/cli.mjs +17 -18
package/dist/index.d.mts +5 -10
package/dist/index.d.ts +5 -10
package/dist/index.js +29 -38
package/dist/index.mjs +1 -1
package/package.json +11 -11
package/dist/chunk-6VQTQRDW.mjs +0 -245
package/dist/chunk-JTJXOIO2.mjs +0 -378
package/dist/chunk-K5O2HVB5.mjs +0 -114
package/dist/chunk-RLWJXASG.mjs +0 -227
package/dist/chunk-YA3N6EC5.mjs +0 -351

package/README.md CHANGED Viewed

@@ -24,6 +24,19 @@ AI coding assistants (GitHub Copilot, ChatGPT, Claude) generate functionally sim
 | Refactoring Suggestions | ❌ Generic | ✅ Specific to pattern type |
 | Output Formats | Text/JSON | Console/JSON/HTML with rich formatting |
+#### How We Differ (and When to Use Each)
+- **Semantic intent vs exact clones**: jscpd flags copy-paste or near-duplicates; we detect functionally similar code even when structure differs (e.g., two API handlers with different frameworks).
+- **Pattern typing**: We classify duplicates into `api-handler`, `validator`, `utility`, `component`, etc., so teams can prioritize coherent refactors.
+- **AI context cost**: We estimate tokens wasted to quantify impact on AI tools (larger context, higher cost, more confusion).
+- **Refactoring guidance**: We propose targeted fixes per pattern type (e.g., extract middleware or create base handler).
+- **Performance profile**: We use Jaccard similarity with candidate filtering; ~2–3s for ~500 blocks on medium repos.
+Recommended workflow:
+- Run **jscpd** in CI to enforce low clone percentage (blocking).
+- Run **@aiready/pattern-detect** to surface semantic duplicates and token waste (advisory), feeding a refactoring backlog.
+- Use both for comprehensive hygiene: jscpd for exact clones; AIReady for intent-level duplication that AI tends to reintroduce.
 ## 🚀 Installation
 ```bash
@@ -57,6 +70,33 @@ aiready-patterns ./src --output json --output-file report.json
 aiready-patterns ./src --output html
 ```
+#### Presets (quick copy/paste)
+```bash
+# Speed-first (large repos)
+aiready-patterns ./src \
+  --min-shared-tokens 12 \
+  --max-candidates 60 \
+  --max-blocks 300
+# Coverage-first (more findings)
+aiready-patterns ./src \
+  --min-shared-tokens 6 \
+  --max-candidates 150
+# Short-block focus (helpers/utilities)
+aiready-patterns ./src \
+  --min-lines 5 \
+  --min-shared-tokens 6 \
+  --max-candidates 120 \
+  --exclude "**/test/**"
+# Deep dive with streaming (comprehensive detection)
+aiready-patterns ./src \
+  --no-approx \
+  --stream-results
+```
 ### Programmatic API
 ```typescript
@@ -172,7 +212,8 @@ router.get('/posts/:id', createResourceHandler('Post', database.posts.findOne));
 | Option | Description | Default |
 |--------|-------------|---------|
-| `minSimilarity` | Similarity threshold (0-1). Use 0.40 for Jaccard (default), 0.85+ for Levenshtein | `0.40` |
+| `minSimilarity` | Similarity threshold (0-1). Default `0.40` (Jaccard). Raise for only obvious duplicates; lower to catch more | `0.40` |
+| `minSimilarity` | Similarity threshold (0-1). Default `0.40` (Jaccard). Raise for only obvious duplicates; lower to catch more | `0.40` |
 | `minLines` | Minimum lines to consider a pattern | `5` |
 | `maxBlocks` | Maximum code blocks to analyze (prevents OOM) | `500` |
 | `include` | File patterns to include | `['**/*.{ts,tsx,js,jsx,py,java}']` |
@@ -237,34 +278,27 @@ Estimated tokens wasted when AI tools process duplicate code:
 ### Algorithm Complexity
-**Fast Mode (default)**: **O(B × C × T)** where:
+**Jaccard Similarity**: **O(B × C × T)** where:
 - B = number of blocks
 - C = average candidates per block (~100)
 - T = average tokens per block (~50)
-- **Jaccard similarity** is O(T) instead of O(N²) Levenshtein
+- **O(T) per comparison** instead of O(N²)
 - **Default threshold: 0.40** (comprehensive detection including tests and helpers)
-**Exact Mode** (`--no-approx --no-fast-mode`): **O(B² × N²)** where:
-- B = number of blocks
-- N = average characters per block
-- **Levenshtein similarity** - more accurate, much slower
-- **Recommended threshold: 0.85+**
-- **Not recommended for >100 files**
 ### Performance Benchmarks
-| Repo Size | Blocks | Fast Mode | Exact Mode |
-|-----------|--------|-----------|------------|
-| Small (<100 files) | ~50 | <1s | ~10s |
-| Medium (100-500 files) | ~500 | ~2s | ~8 min |
-| Large (500+ files) | ~500 (capped) | ~2s | ~76 min |
+| Repo Size | Blocks | Analysis Time |
+|-----------|--------|--------------|
+| Small (<100 files) | ~50 | <1s |
+| Medium (100-500 files) | ~500 | ~2s |
+| Large (500+ files) | ~500 (capped) | ~2s |
-**Example:** 828 code blocks → limited to 500 → **2.4s** (fast) vs **76 min** (exact)
+**Example:** 828 code blocks → limited to 500 → **2.4s** analysis time
 ### Tuning Options
 ```bash
-# Default (fast Jaccard mode, 40% threshold - comprehensive detection)
+# Default (40% threshold - comprehensive detection)
 aiready-patterns ./src
 # Higher threshold for only obvious duplicates
@@ -273,37 +307,114 @@ aiready-patterns ./src --similarity 0.65
 # Lower threshold for more potential duplicates
 aiready-patterns ./src --similarity 0.55
-# Increase quality at cost of speed
-aiready-patterns ./src --no-fast-mode --max-comparisons 100000
+# Approximate mode is default (fast, with candidate filtering)
+aiready-patterns ./src
-# Exact mode with progress tracking (slowest, shows % and ETA)
-aiready-patterns ./src --no-approx --no-fast-mode --stream-results --max-blocks 100
+# Exact mode with progress tracking (shows % and ETA)
+aiready-patterns ./src --no-approx --stream-results
 # Maximum speed (aggressive filtering)
-aiready-patterns ./src --max-blocks 200 --min-shared-tokens 12
-# Exact mode (slowest, most accurate)
-aiready-patterns ./src --no-approx --no-fast-mode --max-comparisons 500000
+aiready-patterns ./src --min-shared-tokens 12 --min-lines 10
 ```
+## 🎛️ Tuning Playbook
+Use these presets to quickly balance precision, recall, and runtime:
+- Speed-first (large repos):
+  - `aiready-patterns ./src --min-shared-tokens 12 --max-candidates 60 --max-blocks 300`
+  - Cuts weak candidates early; best for fast, iterative scans.
+- Coverage-first (more findings):
+  - `aiready-patterns ./src --min-shared-tokens 6 --max-candidates 150`
+  - Expands candidate pool; expect more results and longer runtime.
+- Short-block focus (helpers/utilities):
+  - `aiready-patterns ./src --min-lines 5 --min-shared-tokens 6 --max-candidates 120`
+  - Better recall for small functions; consider `--exclude "**/test/**"` to reduce noise.
+### Minimum Lines vs Min Shared Tokens
+- `minLines` filters which blocks are extracted; lower values include smaller functions that have fewer tokens overall.
+- Smaller blocks naturally share fewer tokens; to avoid missing true matches when `minLines` is low (≤5–6), consider lowering `minSharedTokens` by 1–2.
+- Recommended pairs:
+  - `minLines 5–6` → `minSharedTokens 6–8` (recall-friendly; watch noise)
+  - `minLines 8–10` → `minSharedTokens 8–10` (precision-first)
+- Default balance: `minLines=5`, `minSharedTokens=8` works well for most repos. Reduce `minSharedTokens` only when you specifically want to catch more short helpers.
 **CLI Options:**
-- `--stream-results` - Output duplicates as found (useful for long analysis)
-- `--no-fast-mode` - Use Levenshtein instead of Jaccard (more accurate, much slower)
-- `--no-approx` - Disable candidate filtering (enables progress % and ETA)
-- `--max-comparisons N` - Cap total comparisons (default 50K)
-- `--max-blocks N` - Limit blocks analyzed (default 500)
+- `--stream-results` - Output duplicates as found (enabled by default)
+- `--no-approx` - Disable approximate mode (slower, O(B²) complexity, use with caution)
+- `--min-lines N` - Filter blocks smaller than N lines (default 5)
+### Controlling Analysis Scope
+The tool analyzes **all extracted code blocks** by default. Control scope using:
+**1. `--min-lines` (primary filter):**
+- Filters blocks during extraction (most efficient)
+- Higher values = focus on substantial functions
+- Lower values = catch smaller utility duplicates
+**2. `--no-approx` mode (use with caution):**
+- Disables approximate mode (candidate pre-filtering)
+- O(B²) complexity - compares every block to every other block
+- **Automatic safety limit:** 500K comparisons (~1000 blocks max)
+- Shows warning when used with >500 blocks
+- Approximate mode (default) is recommended for all use cases
+**Examples:**
+```bash
+# Focus on substantial functions only
+aiready-patterns ./src --min-lines 15
+# Comprehensive scan of all functions (recommended)
+aiready-patterns ./src --min-lines 5
+# Quick scan of major duplicates
+aiready-patterns ./src --min-lines 20
+```
+**Recommendations by codebase size:**
+| Repo Size | Files | Strategy | Expected Time |
+|-----------|-------|----------|---------------|
+| **Small** | <100 | Use defaults | <1s ✅ |
+| **Medium** | 100-500 | Use defaults | 1-5s ✅ |
+| **Large** | 500-1,000 | Use defaults or `--min-lines 10` | 3-10s ✅ |
+| **Very Large** | 1,000-5,000 | `--min-lines 15` or analyze by module | 5-20s ⚠️ |
+| **Super Large** | 5,000+ | **Analyze by module** (see below) | 10-60s per module ⚠️ |
+### Analyzing Very Large Repositories
+For repos with 1,000+ files, use modular analysis:
+```bash
+# Analyze by top-level directory
+for dir in src/*/; do
+  echo "Analyzing $dir"
+  aiready-patterns "$dir" --min-lines 10
+done
+# Or focus on specific high-value areas
+aiready-patterns ./src/api --min-lines 10
+aiready-patterns ./src/core --min-lines 10
+aiready-patterns ./src/services --min-lines 10
+# For super large repos (5K+ files), increase thresholds
+aiready-patterns ./src/backend --min-lines 20 --similarity 0.50
+```
+**Why modular analysis?**
+- Ensures comprehensive coverage (100% of each module)
+- Avoids hitting comparison budget limits
+- Provides focused, actionable results per module
+- Better for CI/CD integration (parallel jobs)
 **Progress Indicators:**
 - **Approx mode**: Shows blocks processed + duplicates found
 - **Exact mode**: Shows % complete, ETA, and comparisons processed
-- **Stream mode**: Prints each duplicate immediately when found
-**Recommendations:**
-- **< 100 files**: Use defaults, or try `--no-fast-mode` for higher accuracy
-- **100-500 files**: Use defaults with fast mode (2-5s typical)
-- **500-1000 files**: Use `--max-blocks 500 --min-lines 10` (~3-10s)
-- **1000+ files**: Use `--max-blocks 300 --min-lines 15` or analyze by module
-- **Slow analysis**: Add `--stream-results` to see progress in real-time
+- **Stream mode**: Prints each duplicate immediately when found (enabled by default)
 ## 🔧 CI/CD Integration

package/dist/{chunk-N5DE7IYX.mjs → chunk-AXHGYYYZ.mjs} RENAMED Viewed

@@ -2,7 +2,7 @@
 import { scanFiles, readFileContent } from "@aiready/core";
 // src/detector.ts
-import { similarityScore, estimateTokens } from "@aiready/core";
+import { estimateTokens } from "@aiready/core";
 function categorizePattern(code) {
   const lower = code.toLowerCase();
   if (lower.includes("request") && lower.includes("response") || lower.includes("router.") || lower.includes("app.get") || lower.includes("app.post") || lower.includes("express") || lower.includes("ctx.body")) {
@@ -80,15 +80,6 @@ function jaccardSimilarity(tokens1, tokens2) {
   const union = set1.size + set2.size - intersection;
   return union === 0 ? 0 : intersection / union;
 }
-function calculateSimilarity(block1, block2) {
-  const norm1 = normalizeCode(block1);
-  const norm2 = normalizeCode(block2);
-  const baseSimilarity = similarityScore(norm1, norm2);
-  const tokens1 = norm1.split(/[\s(){}[\];,]+/).filter(Boolean);
-  const tokens2 = norm2.split(/[\s(){}[\];,]+/).filter(Boolean);
-  const tokenSimilarity = similarityScore(tokens1.join(" "), tokens2.join(" "));
-  return baseSimilarity * 0.4 + tokenSimilarity * 0.6;
-}
 async function detectDuplicatePatterns(files, options) {
   const {
     minSimilarity,
@@ -98,7 +89,6 @@ async function detectDuplicatePatterns(files, options) {
     approx = true,
     minSharedTokens = 8,
     maxCandidatesPerBlock = 100,
-    fastMode = true,
     maxComparisons = 5e4,
     // Cap at 50K comparisons by default
     streamResults = false
@@ -213,7 +203,7 @@ async function detectDuplicatePatterns(files, options) {
         if (maxComparisons && comparisonsProcessed >= maxComparisons) break;
         comparisonsProcessed++;
         const block2 = allBlocks[j];
-        const similarity = fastMode ? jaccardSimilarity(blockTokens[i], blockTokens[j]) : calculateSimilarity(block1.content, block2.content);
+        const similarity = jaccardSimilarity(blockTokens[i], blockTokens[j]);
         if (similarity >= minSimilarity) {
           const duplicate = {
             file1: block1.file,
@@ -243,7 +233,7 @@ async function detectDuplicatePatterns(files, options) {
         comparisonsProcessed++;
         const block2 = allBlocks[j];
         if (block1.file === block2.file) continue;
-        const similarity = fastMode ? jaccardSimilarity(blockTokens[i], blockTokens[j]) : calculateSimilarity(block1.content, block2.content);
+        const similarity = jaccardSimilarity(blockTokens[i], blockTokens[j]);
         if (similarity >= minSimilarity) {
           const duplicate = {
             file1: block1.file,
@@ -293,15 +283,14 @@ function getRefactoringSuggestion(patternType, similarity) {
 }
 async function analyzePatterns(options) {
   const {
-    minSimilarity = 0.65,
-    // Lower default for fast Jaccard mode (Levenshtein would be 0.85+)
+    minSimilarity = 0.4,
+    // Jaccard similarity default (40% threshold)
     minLines = 5,
     maxBlocks = 500,
     batchSize = 100,
     approx = true,
     minSharedTokens = 8,
     maxCandidatesPerBlock = 100,
-    fastMode = true,
     maxComparisons = 5e4,
     streamResults = false,
     ...scanOptions
@@ -322,7 +311,6 @@ async function analyzePatterns(options) {
     approx,
     minSharedTokens,
     maxCandidatesPerBlock,
-    fastMode,
     maxComparisons,
     streamResults
   });

package/dist/{chunk-57O7FEEM.mjs → chunk-JKVKOXYR.mjs} RENAMED Viewed

@@ -2,7 +2,7 @@
 import { scanFiles, readFileContent } from "@aiready/core";
 // src/detector.ts
-import { similarityScore, estimateTokens } from "@aiready/core";
+import { estimateTokens } from "@aiready/core";
 function categorizePattern(code) {
   const lower = code.toLowerCase();
   if (lower.includes("request") && lower.includes("response") || lower.includes("router.") || lower.includes("app.get") || lower.includes("app.post") || lower.includes("express") || lower.includes("ctx.body")) {
@@ -54,6 +54,7 @@ function extractCodeBlocks(content, minLines) {
       blocks.push({
         content: blockContent,
         startLine: blockStart + 1,
+        endLine: i + 1,
         patternType: categorizePattern(blockContent),
         linesOfCode
       });
@@ -79,43 +80,34 @@ function jaccardSimilarity(tokens1, tokens2) {
   const union = set1.size + set2.size - intersection;
   return union === 0 ? 0 : intersection / union;
 }
-function calculateSimilarity(block1, block2) {
-  const norm1 = normalizeCode(block1);
-  const norm2 = normalizeCode(block2);
-  const baseSimilarity = similarityScore(norm1, norm2);
-  const tokens1 = norm1.split(/[\s(){}[\];,]+/).filter(Boolean);
-  const tokens2 = norm2.split(/[\s(){}[\];,]+/).filter(Boolean);
-  const tokenSimilarity = similarityScore(tokens1.join(" "), tokens2.join(" "));
-  return baseSimilarity * 0.4 + tokenSimilarity * 0.6;
-}
 async function detectDuplicatePatterns(files, options) {
   const {
     minSimilarity,
     minLines,
-    maxBlocks = 500,
     batchSize = 100,
     approx = true,
     minSharedTokens = 8,
     maxCandidatesPerBlock = 100,
-    fastMode = true,
-    maxComparisons = 5e4,
-    // Cap at 50K comparisons by default
     streamResults = false
   } = options;
   const duplicates = [];
-  let allBlocks = files.flatMap(
+  const maxComparisons = approx ? Infinity : 5e5;
+  const allBlocks = files.flatMap(
     (file) => extractCodeBlocks(file.content, minLines).map((block) => ({
-      ...block,
+      content: block.content,
+      startLine: block.startLine,
+      endLine: block.endLine,
       file: file.file,
       normalized: normalizeCode(block.content),
-      tokenCost: estimateTokens(block.content)
+      patternType: block.patternType,
+      tokenCost: estimateTokens(block.content),
+      linesOfCode: block.linesOfCode
     }))
   );
   console.log(`Extracted ${allBlocks.length} code blocks for analysis`);
-  if (allBlocks.length > maxBlocks) {
-    console.log(`\u26A0\uFE0F  Limiting to ${maxBlocks} blocks (sorted by size) to prevent memory issues`);
-    console.log(`   Use --max-blocks to increase limit or --min-lines to filter smaller blocks`);
-    allBlocks = allBlocks.sort((a, b) => b.linesOfCode - a.linesOfCode).slice(0, maxBlocks);
+  if (!approx && allBlocks.length > 500) {
+    console.log(`\u26A0\uFE0F  Using --no-approx mode with ${allBlocks.length} blocks may be slow (O(B\xB2) complexity).`);
+    console.log(`   Consider using approximate mode (default) for better performance.`);
   }
   const stopwords = /* @__PURE__ */ new Set([
     "return",
@@ -205,16 +197,22 @@ async function detectDuplicatePatterns(files, options) {
     }
     if (approx && candidates) {
       for (const { j } of candidates) {
-        if (maxComparisons && comparisonsProcessed >= maxComparisons) break;
+        if (!approx && maxComparisons !== Infinity && comparisonsProcessed >= maxComparisons) {
+          console.log(`\u26A0\uFE0F  Comparison safety limit reached (${maxComparisons.toLocaleString()} comparisons in --no-approx mode).`);
+          console.log(`   This prevents excessive runtime on large repos. Consider using approximate mode (default) or --min-lines to reduce blocks.`);
+          break;
+        }
         comparisonsProcessed++;
         const block2 = allBlocks[j];
-        const similarity = fastMode ? jaccardSimilarity(blockTokens[i], blockTokens[j]) : calculateSimilarity(block1.content, block2.content);
+        const similarity = jaccardSimilarity(blockTokens[i], blockTokens[j]);
         if (similarity >= minSimilarity) {
           const duplicate = {
             file1: block1.file,
             file2: block2.file,
             line1: block1.startLine,
             line2: block2.startLine,
+            endLine1: block1.endLine,
+            endLine2: block2.endLine,
             similarity,
             snippet: block1.content.split("\n").slice(0, 5).join("\n") + "\n...",
             patternType: block1.patternType,
@@ -225,7 +223,7 @@ async function detectDuplicatePatterns(files, options) {
           if (streamResults) {
             console.log(`
    \u2705 Found: ${duplicate.patternType} ${Math.round(similarity * 100)}% similar`);
-            console.log(`      ${duplicate.file1}:${duplicate.line1} \u21D4 ${duplicate.file2}:${duplicate.line2}`);
+            console.log(`      ${duplicate.file1}:${duplicate.line1}-${duplicate.endLine1} \u21D4 ${duplicate.file2}:${duplicate.line2}-${duplicate.endLine2}`);
             console.log(`      Token cost: ${duplicate.tokenCost.toLocaleString()}`);
           }
         }
@@ -236,13 +234,15 @@ async function detectDuplicatePatterns(files, options) {
         comparisonsProcessed++;
         const block2 = allBlocks[j];
         if (block1.file === block2.file) continue;
-        const similarity = fastMode ? jaccardSimilarity(blockTokens[i], blockTokens[j]) : calculateSimilarity(block1.content, block2.content);
+        const similarity = jaccardSimilarity(blockTokens[i], blockTokens[j]);
         if (similarity >= minSimilarity) {
           const duplicate = {
             file1: block1.file,
             file2: block2.file,
             line1: block1.startLine,
             line2: block2.startLine,
+            endLine1: block1.endLine,
+            endLine2: block2.endLine,
             similarity,
             snippet: block1.content.split("\n").slice(0, 5).join("\n") + "\n...",
             patternType: block1.patternType,
@@ -253,7 +253,7 @@ async function detectDuplicatePatterns(files, options) {
           if (streamResults) {
             console.log(`
    \u2705 Found: ${duplicate.patternType} ${Math.round(similarity * 100)}% similar`);
-            console.log(`      ${duplicate.file1}:${duplicate.line1} \u21D4 ${duplicate.file2}:${duplicate.line2}`);
+            console.log(`      ${duplicate.file1}:${duplicate.line1}-${duplicate.endLine1} \u21D4 ${duplicate.file2}:${duplicate.line2}-${duplicate.endLine2}`);
             console.log(`      Token cost: ${duplicate.tokenCost.toLocaleString()}`);
           }
         }
@@ -284,16 +284,13 @@ function getRefactoringSuggestion(patternType, similarity) {
 }
 async function analyzePatterns(options) {
   const {
-    minSimilarity = 0.65,
-    // Lower default for fast Jaccard mode (Levenshtein would be 0.85+)
+    minSimilarity = 0.4,
+    // Jaccard similarity default (40% threshold)
     minLines = 5,
-    maxBlocks = 500,
     batchSize = 100,
     approx = true,
     minSharedTokens = 8,
     maxCandidatesPerBlock = 100,
-    fastMode = true,
-    maxComparisons = 5e4,
     streamResults = false,
     ...scanOptions
   } = options;
@@ -308,13 +305,10 @@ async function analyzePatterns(options) {
   const duplicates = await detectDuplicatePatterns(fileContents, {
     minSimilarity,
     minLines,
-    maxBlocks,
     batchSize,
     approx,
     minSharedTokens,
     maxCandidatesPerBlock,
-    fastMode,
-    maxComparisons,
     streamResults
   });
   for (const file of files) {
@@ -378,8 +372,21 @@ function generateSummary(results) {
     const typeMatch = issue.message.match(/^(\S+(?:-\S+)*) pattern/);
     const fileMatch = issue.message.match(/similar to (.+?) \(/);
     return {
-      file1: issue.location.file,
-      file2: fileMatch?.[1] || "unknown",
+      files: [
+        {
+          path: issue.location.file,
+          startLine: issue.location.line,
+          endLine: 0
+          // Not available from Issue
+        },
+        {
+          path: fileMatch?.[1] || "unknown",
+          startLine: 0,
+          // Not available from Issue
+          endLine: 0
+          // Not available from Issue
+        }
+      ],
       similarity: similarityMatch ? parseInt(similarityMatch[1]) / 100 : 0,
       patternType: typeMatch?.[1] || "unknown",
       tokenCost: tokenMatch ? parseInt(tokenMatch[1]) : 0

package/dist/{chunk-DNI7S33V.mjs → chunk-OFGMDX66.mjs} RENAMED Viewed

@@ -2,7 +2,7 @@
 import { scanFiles, readFileContent } from "@aiready/core";
 // src/detector.ts
-import { similarityScore, estimateTokens } from "@aiready/core";
+import { estimateTokens } from "@aiready/core";
 function categorizePattern(code) {
   const lower = code.toLowerCase();
   if (lower.includes("request") && lower.includes("response") || lower.includes("router.") || lower.includes("app.get") || lower.includes("app.post") || lower.includes("express") || lower.includes("ctx.body")) {
@@ -54,6 +54,7 @@ function extractCodeBlocks(content, minLines) {
       blocks.push({
         content: blockContent,
         startLine: blockStart + 1,
+        endLine: i + 1,
         patternType: categorizePattern(blockContent),
         linesOfCode
       });
@@ -79,44 +80,32 @@ function jaccardSimilarity(tokens1, tokens2) {
   const union = set1.size + set2.size - intersection;
   return union === 0 ? 0 : intersection / union;
 }
-function calculateSimilarity(block1, block2) {
-  const norm1 = normalizeCode(block1);
-  const norm2 = normalizeCode(block2);
-  const baseSimilarity = similarityScore(norm1, norm2);
-  const tokens1 = norm1.split(/[\s(){}[\];,]+/).filter(Boolean);
-  const tokens2 = norm2.split(/[\s(){}[\];,]+/).filter(Boolean);
-  const tokenSimilarity = similarityScore(tokens1.join(" "), tokens2.join(" "));
-  return baseSimilarity * 0.4 + tokenSimilarity * 0.6;
-}
 async function detectDuplicatePatterns(files, options) {
   const {
     minSimilarity,
     minLines,
-    maxBlocks = 500,
     batchSize = 100,
     approx = true,
     minSharedTokens = 8,
     maxCandidatesPerBlock = 100,
-    fastMode = true,
     maxComparisons = 5e4,
     // Cap at 50K comparisons by default
     streamResults = false
   } = options;
   const duplicates = [];
-  let allBlocks = files.flatMap(
+  const allBlocks = files.flatMap(
     (file) => extractCodeBlocks(file.content, minLines).map((block) => ({
-      ...block,
+      content: block.content,
+      startLine: block.startLine,
+      endLine: block.endLine,
       file: file.file,
       normalized: normalizeCode(block.content),
-      tokenCost: estimateTokens(block.content)
+      patternType: block.patternType,
+      tokenCost: estimateTokens(block.content),
+      linesOfCode: block.linesOfCode
     }))
   );
   console.log(`Extracted ${allBlocks.length} code blocks for analysis`);
-  if (allBlocks.length > maxBlocks) {
-    console.log(`\u26A0\uFE0F  Limiting to ${maxBlocks} blocks (sorted by size) to prevent memory issues`);
-    console.log(`   Use --max-blocks to increase limit or --min-lines to filter smaller blocks`);
-    allBlocks = allBlocks.sort((a, b) => b.linesOfCode - a.linesOfCode).slice(0, maxBlocks);
-  }
   const stopwords = /* @__PURE__ */ new Set([
     "return",
     "const",
@@ -208,13 +197,15 @@ async function detectDuplicatePatterns(files, options) {
         if (maxComparisons && comparisonsProcessed >= maxComparisons) break;
         comparisonsProcessed++;
         const block2 = allBlocks[j];
-        const similarity = fastMode ? jaccardSimilarity(blockTokens[i], blockTokens[j]) : calculateSimilarity(block1.content, block2.content);
+        const similarity = jaccardSimilarity(blockTokens[i], blockTokens[j]);
         if (similarity >= minSimilarity) {
           const duplicate = {
             file1: block1.file,
             file2: block2.file,
             line1: block1.startLine,
             line2: block2.startLine,
+            endLine1: block1.endLine,
+            endLine2: block2.endLine,
             similarity,
             snippet: block1.content.split("\n").slice(0, 5).join("\n") + "\n...",
             patternType: block1.patternType,
@@ -225,7 +216,7 @@ async function detectDuplicatePatterns(files, options) {
           if (streamResults) {
             console.log(`
    \u2705 Found: ${duplicate.patternType} ${Math.round(similarity * 100)}% similar`);
-            console.log(`      ${duplicate.file1}:${duplicate.line1} \u21D4 ${duplicate.file2}:${duplicate.line2}`);
+            console.log(`      ${duplicate.file1}:${duplicate.line1}-${duplicate.endLine1} \u21D4 ${duplicate.file2}:${duplicate.line2}-${duplicate.endLine2}`);
             console.log(`      Token cost: ${duplicate.tokenCost.toLocaleString()}`);
           }
         }
@@ -236,13 +227,15 @@ async function detectDuplicatePatterns(files, options) {
         comparisonsProcessed++;
         const block2 = allBlocks[j];
         if (block1.file === block2.file) continue;
-        const similarity = fastMode ? jaccardSimilarity(blockTokens[i], blockTokens[j]) : calculateSimilarity(block1.content, block2.content);
+        const similarity = jaccardSimilarity(blockTokens[i], blockTokens[j]);
         if (similarity >= minSimilarity) {
           const duplicate = {
             file1: block1.file,
             file2: block2.file,
             line1: block1.startLine,
             line2: block2.startLine,
+            endLine1: block1.endLine,
+            endLine2: block2.endLine,
             similarity,
             snippet: block1.content.split("\n").slice(0, 5).join("\n") + "\n...",
             patternType: block1.patternType,
@@ -253,7 +246,7 @@ async function detectDuplicatePatterns(files, options) {
           if (streamResults) {
             console.log(`
    \u2705 Found: ${duplicate.patternType} ${Math.round(similarity * 100)}% similar`);
-            console.log(`      ${duplicate.file1}:${duplicate.line1} \u21D4 ${duplicate.file2}:${duplicate.line2}`);
+            console.log(`      ${duplicate.file1}:${duplicate.line1}-${duplicate.endLine1} \u21D4 ${duplicate.file2}:${duplicate.line2}-${duplicate.endLine2}`);
             console.log(`      Token cost: ${duplicate.tokenCost.toLocaleString()}`);
           }
         }
@@ -284,14 +277,13 @@ function getRefactoringSuggestion(patternType, similarity) {
 }
 async function analyzePatterns(options) {
   const {
-    minSimilarity = 0.85,
+    minSimilarity = 0.4,
+    // Jaccard similarity default (40% threshold)
     minLines = 5,
-    maxBlocks = 500,
     batchSize = 100,
     approx = true,
     minSharedTokens = 8,
     maxCandidatesPerBlock = 100,
-    fastMode = true,
     maxComparisons = 5e4,
     streamResults = false,
     ...scanOptions
@@ -307,12 +299,10 @@ async function analyzePatterns(options) {
   const duplicates = await detectDuplicatePatterns(fileContents, {
     minSimilarity,
     minLines,
-    maxBlocks,
     batchSize,
     approx,
     minSharedTokens,
     maxCandidatesPerBlock,
-    fastMode,
     maxComparisons,
     streamResults
   });
@@ -377,8 +367,21 @@ function generateSummary(results) {
     const typeMatch = issue.message.match(/^(\S+(?:-\S+)*) pattern/);
     const fileMatch = issue.message.match(/similar to (.+?) \(/);
     return {
-      file1: issue.location.file,
-      file2: fileMatch?.[1] || "unknown",
+      files: [
+        {
+          path: issue.location.file,
+          startLine: issue.location.line,
+          endLine: 0
+          // Not available from Issue
+        },
+        {
+          path: fileMatch?.[1] || "unknown",
+          startLine: 0,
+          // Not available from Issue
+          endLine: 0
+          // Not available from Issue
+        }
+      ],
       similarity: similarityMatch ? parseInt(similarityMatch[1]) / 100 : 0,
       patternType: typeMatch?.[1] || "unknown",
       tokenCost: tokenMatch ? parseInt(tokenMatch[1]) : 0