@mytechtoday/augment-extensions 0.5.0 → 1.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (523) hide show
  1. package/AGENTS.md +265 -232
  2. package/README.md +956 -771
  3. package/augment-extensions/coding-standards/bash/README.md +196 -196
  4. package/augment-extensions/coding-standards/bash/module.json +163 -163
  5. package/augment-extensions/coding-standards/bash/rules/naming-conventions.md +336 -336
  6. package/augment-extensions/coding-standards/bash/rules/universal-standards.md +289 -289
  7. package/augment-extensions/coding-standards/css/README.md +40 -40
  8. package/augment-extensions/coding-standards/css/examples/css-examples.css +550 -550
  9. package/augment-extensions/coding-standards/css/module.json +44 -44
  10. package/augment-extensions/coding-standards/css/rules/css-modern-features.md +448 -448
  11. package/augment-extensions/coding-standards/css/rules/css-standards.md +492 -492
  12. package/augment-extensions/coding-standards/html/README.md +40 -40
  13. package/augment-extensions/coding-standards/html/examples/html-examples.html +267 -267
  14. package/augment-extensions/coding-standards/html/examples/responsive-layout.html +505 -505
  15. package/augment-extensions/coding-standards/html/module.json +44 -44
  16. package/augment-extensions/coding-standards/html/rules/html-standards.md +349 -349
  17. package/augment-extensions/coding-standards/html-css-js/README.md +194 -194
  18. package/augment-extensions/coding-standards/html-css-js/examples/async-examples.js +487 -487
  19. package/augment-extensions/coding-standards/html-css-js/examples/css-examples.css +550 -550
  20. package/augment-extensions/coding-standards/html-css-js/examples/dom-examples.js +667 -667
  21. package/augment-extensions/coding-standards/html-css-js/examples/html-examples.html +267 -267
  22. package/augment-extensions/coding-standards/html-css-js/examples/javascript-examples.js +612 -612
  23. package/augment-extensions/coding-standards/html-css-js/examples/responsive-layout.html +505 -505
  24. package/augment-extensions/coding-standards/html-css-js/module.json +48 -48
  25. package/augment-extensions/coding-standards/html-css-js/rules/async-patterns.md +515 -515
  26. package/augment-extensions/coding-standards/html-css-js/rules/css-modern-features.md +448 -448
  27. package/augment-extensions/coding-standards/html-css-js/rules/css-standards.md +492 -492
  28. package/augment-extensions/coding-standards/html-css-js/rules/dom-manipulation.md +439 -439
  29. package/augment-extensions/coding-standards/html-css-js/rules/html-standards.md +349 -349
  30. package/augment-extensions/coding-standards/html-css-js/rules/javascript-standards.md +486 -486
  31. package/augment-extensions/coding-standards/html-css-js/rules/performance.md +463 -463
  32. package/augment-extensions/coding-standards/html-css-js/rules/tooling.md +543 -543
  33. package/augment-extensions/coding-standards/js/README.md +46 -46
  34. package/augment-extensions/coding-standards/js/examples/async-examples.js +487 -487
  35. package/augment-extensions/coding-standards/js/examples/dom-examples.js +667 -667
  36. package/augment-extensions/coding-standards/js/examples/javascript-examples.js +612 -612
  37. package/augment-extensions/coding-standards/js/module.json +49 -49
  38. package/augment-extensions/coding-standards/js/rules/async-patterns.md +515 -515
  39. package/augment-extensions/coding-standards/js/rules/dom-manipulation.md +439 -439
  40. package/augment-extensions/coding-standards/js/rules/javascript-standards.md +486 -486
  41. package/augment-extensions/coding-standards/js/rules/performance.md +463 -463
  42. package/augment-extensions/coding-standards/js/rules/tooling.md +543 -543
  43. package/augment-extensions/coding-standards/php/README.md +248 -248
  44. package/augment-extensions/coding-standards/php/examples/api-endpoint-example.php +204 -204
  45. package/augment-extensions/coding-standards/php/examples/cli-command-example.php +206 -206
  46. package/augment-extensions/coding-standards/php/examples/legacy-refactoring-example.php +234 -234
  47. package/augment-extensions/coding-standards/php/examples/web-application-example.php +211 -211
  48. package/augment-extensions/coding-standards/php/examples/woocommerce-extension-example.php +215 -215
  49. package/augment-extensions/coding-standards/php/examples/wordpress-plugin-example.php +189 -189
  50. package/augment-extensions/coding-standards/php/module.json +166 -166
  51. package/augment-extensions/coding-standards/php/rules/api-development.md +480 -480
  52. package/augment-extensions/coding-standards/php/rules/category-configuration.md +332 -332
  53. package/augment-extensions/coding-standards/php/rules/cli-tools.md +472 -472
  54. package/augment-extensions/coding-standards/php/rules/cms-integration.md +561 -561
  55. package/augment-extensions/coding-standards/php/rules/code-quality.md +402 -402
  56. package/augment-extensions/coding-standards/php/rules/documentation.md +425 -425
  57. package/augment-extensions/coding-standards/php/rules/ecommerce.md +627 -627
  58. package/augment-extensions/coding-standards/php/rules/error-handling.md +336 -336
  59. package/augment-extensions/coding-standards/php/rules/legacy-migration.md +677 -677
  60. package/augment-extensions/coding-standards/php/rules/naming-conventions.md +279 -279
  61. package/augment-extensions/coding-standards/php/rules/performance.md +392 -392
  62. package/augment-extensions/coding-standards/php/rules/psr-standards.md +186 -186
  63. package/augment-extensions/coding-standards/php/rules/security.md +358 -358
  64. package/augment-extensions/coding-standards/php/rules/testing.md +403 -403
  65. package/augment-extensions/coding-standards/php/rules/type-declarations.md +331 -331
  66. package/augment-extensions/coding-standards/php/rules/web-applications.md +426 -426
  67. package/augment-extensions/coding-standards/powershell/README.md +154 -154
  68. package/augment-extensions/coding-standards/powershell/examples/admin-example.ps1 +272 -272
  69. package/augment-extensions/coding-standards/powershell/examples/automation-example.ps1 +173 -173
  70. package/augment-extensions/coding-standards/powershell/examples/cloud-example.ps1 +243 -243
  71. package/augment-extensions/coding-standards/powershell/examples/cross-platform-example.ps1 +297 -297
  72. package/augment-extensions/coding-standards/powershell/examples/dsc-example.ps1 +224 -224
  73. package/augment-extensions/coding-standards/powershell/examples/legacy-migration-example.ps1 +340 -340
  74. package/augment-extensions/coding-standards/powershell/examples/module-example.psm1 +255 -255
  75. package/augment-extensions/coding-standards/powershell/module.json +165 -165
  76. package/augment-extensions/coding-standards/powershell/rules/administrative-tools.md +439 -439
  77. package/augment-extensions/coding-standards/powershell/rules/automation-scripts.md +240 -240
  78. package/augment-extensions/coding-standards/powershell/rules/cloud-orchestration.md +384 -384
  79. package/augment-extensions/coding-standards/powershell/rules/configuration-schema.md +383 -383
  80. package/augment-extensions/coding-standards/powershell/rules/cross-platform-scripts.md +482 -482
  81. package/augment-extensions/coding-standards/powershell/rules/dsc-configurations.md +296 -296
  82. package/augment-extensions/coding-standards/powershell/rules/error-handling.md +314 -314
  83. package/augment-extensions/coding-standards/powershell/rules/legacy-migrations.md +466 -466
  84. package/augment-extensions/coding-standards/powershell/rules/modules-functions.md +244 -244
  85. package/augment-extensions/coding-standards/powershell/rules/naming-conventions.md +266 -266
  86. package/augment-extensions/coding-standards/powershell/rules/performance-optimization.md +209 -209
  87. package/augment-extensions/coding-standards/powershell/rules/security-practices.md +314 -314
  88. package/augment-extensions/coding-standards/powershell/rules/testing-guidelines.md +268 -268
  89. package/augment-extensions/coding-standards/powershell/rules/universal-standards.md +197 -197
  90. package/augment-extensions/coding-standards/python/README.md +48 -48
  91. package/augment-extensions/coding-standards/python/examples/best-practices.py +373 -373
  92. package/augment-extensions/coding-standards/python/module.json +30 -30
  93. package/augment-extensions/coding-standards/python/rules/async-patterns.md +884 -884
  94. package/augment-extensions/coding-standards/python/rules/best-practices.md +232 -232
  95. package/augment-extensions/coding-standards/python/rules/code-organization.md +220 -220
  96. package/augment-extensions/coding-standards/python/rules/documentation.md +831 -831
  97. package/augment-extensions/coding-standards/python/rules/error-handling.md +1008 -1008
  98. package/augment-extensions/coding-standards/python/rules/naming-conventions.md +172 -172
  99. package/augment-extensions/coding-standards/python/rules/testing.md +409 -409
  100. package/augment-extensions/coding-standards/python/rules/tooling.md +446 -446
  101. package/augment-extensions/coding-standards/python/rules/type-hints.md +253 -253
  102. package/augment-extensions/coding-standards/react/README.md +45 -45
  103. package/augment-extensions/coding-standards/react/module.json +27 -27
  104. package/augment-extensions/coding-standards/react/rules/component-patterns.md +214 -214
  105. package/augment-extensions/coding-standards/react/rules/hooks-best-practices.md +235 -235
  106. package/augment-extensions/coding-standards/react/rules/performance.md +300 -300
  107. package/augment-extensions/coding-standards/react/rules/state-management.md +265 -265
  108. package/augment-extensions/coding-standards/react/rules/typescript-react.md +271 -271
  109. package/augment-extensions/coding-standards/typescript/README.md +45 -45
  110. package/augment-extensions/coding-standards/typescript/module.json +27 -27
  111. package/augment-extensions/coding-standards/typescript/rules/naming-conventions.md +225 -225
  112. package/augment-extensions/collections/html-css-js/README.md +82 -82
  113. package/augment-extensions/collections/html-css-js/collection.json +41 -41
  114. package/augment-extensions/domain-rules/api-design/README.md +41 -41
  115. package/augment-extensions/domain-rules/api-design/module.json +27 -27
  116. package/augment-extensions/domain-rules/api-design/rules/authentication.md +263 -263
  117. package/augment-extensions/domain-rules/api-design/rules/documentation.md +395 -395
  118. package/augment-extensions/domain-rules/api-design/rules/error-handling.md +290 -290
  119. package/augment-extensions/domain-rules/api-design/rules/graphql-api.md +313 -313
  120. package/augment-extensions/domain-rules/api-design/rules/rest-api.md +214 -214
  121. package/augment-extensions/domain-rules/api-design/rules/versioning.md +268 -268
  122. package/augment-extensions/domain-rules/database/README.md +161 -161
  123. package/augment-extensions/domain-rules/database/examples/flat-database-example.md +793 -793
  124. package/augment-extensions/domain-rules/database/examples/hybrid-database-example.md +1132 -1132
  125. package/augment-extensions/domain-rules/database/examples/nosql-document-example.md +868 -868
  126. package/augment-extensions/domain-rules/database/examples/nosql-graph-example.md +805 -805
  127. package/augment-extensions/domain-rules/database/examples/relational-schema-example.md +621 -621
  128. package/augment-extensions/domain-rules/database/examples/vector-database-example.md +965 -965
  129. package/augment-extensions/domain-rules/database/module.json +28 -28
  130. package/augment-extensions/domain-rules/database/rules/flat-databases.md +624 -624
  131. package/augment-extensions/domain-rules/database/rules/nosql-databases.md +588 -588
  132. package/augment-extensions/domain-rules/database/rules/nosql-document-stores.md +856 -856
  133. package/augment-extensions/domain-rules/database/rules/nosql-graph-databases.md +778 -778
  134. package/augment-extensions/domain-rules/database/rules/nosql-key-value-stores.md +963 -963
  135. package/augment-extensions/domain-rules/database/rules/performance-optimization.md +1076 -1076
  136. package/augment-extensions/domain-rules/database/rules/relational-databases.md +697 -697
  137. package/augment-extensions/domain-rules/database/rules/relational-indexing.md +671 -671
  138. package/augment-extensions/domain-rules/database/rules/relational-query-optimization.md +607 -607
  139. package/augment-extensions/domain-rules/database/rules/relational-schema-design.md +907 -907
  140. package/augment-extensions/domain-rules/database/rules/relational-transactions.md +783 -783
  141. package/augment-extensions/domain-rules/database/rules/security-standards.md +980 -980
  142. package/augment-extensions/domain-rules/database/rules/universal-best-practices.md +485 -485
  143. package/augment-extensions/domain-rules/database/rules/vector-databases.md +521 -521
  144. package/augment-extensions/domain-rules/database/rules/vector-embeddings.md +858 -858
  145. package/augment-extensions/domain-rules/database/rules/vector-indexing.md +934 -934
  146. package/augment-extensions/domain-rules/design/color/themes/catppuccin-latte/README.md +23 -23
  147. package/augment-extensions/domain-rules/design/color/themes/catppuccin-latte/module.json +26 -26
  148. package/augment-extensions/domain-rules/design/color/themes/catppuccin-mocha/README.md +23 -23
  149. package/augment-extensions/domain-rules/design/color/themes/catppuccin-mocha/module.json +26 -26
  150. package/augment-extensions/domain-rules/design/color/themes/dracula/README.md +23 -23
  151. package/augment-extensions/domain-rules/design/color/themes/dracula/module.json +26 -26
  152. package/augment-extensions/domain-rules/design/color/themes/gruvbox-dark/README.md +23 -23
  153. package/augment-extensions/domain-rules/design/color/themes/gruvbox-dark/module.json +26 -26
  154. package/augment-extensions/domain-rules/design/color/themes/gruvbox-light/README.md +23 -23
  155. package/augment-extensions/domain-rules/design/color/themes/gruvbox-light/module.json +26 -26
  156. package/augment-extensions/domain-rules/design/color/themes/high-contrast/README.md +27 -27
  157. package/augment-extensions/domain-rules/design/color/themes/high-contrast/module.json +26 -26
  158. package/augment-extensions/domain-rules/design/color/themes/monokai/README.md +23 -23
  159. package/augment-extensions/domain-rules/design/color/themes/monokai/module.json +26 -26
  160. package/augment-extensions/domain-rules/design/color/themes/nord/README.md +23 -23
  161. package/augment-extensions/domain-rules/design/color/themes/nord/module.json +26 -26
  162. package/augment-extensions/domain-rules/design/color/themes/one-dark/README.md +23 -23
  163. package/augment-extensions/domain-rules/design/color/themes/one-dark/module.json +26 -26
  164. package/augment-extensions/domain-rules/design/color/themes/one-light/README.md +23 -23
  165. package/augment-extensions/domain-rules/design/color/themes/one-light/module.json +26 -26
  166. package/augment-extensions/domain-rules/design/color/themes/solarized-dark/README.md +23 -23
  167. package/augment-extensions/domain-rules/design/color/themes/solarized-dark/module.json +26 -26
  168. package/augment-extensions/domain-rules/design/color/themes/solarized-light/README.md +23 -23
  169. package/augment-extensions/domain-rules/design/color/themes/solarized-light/module.json +26 -26
  170. package/augment-extensions/domain-rules/design/color/themes/tokyo-night/README.md +23 -23
  171. package/augment-extensions/domain-rules/design/color/themes/tokyo-night/module.json +26 -26
  172. package/augment-extensions/domain-rules/mcp/README.md +150 -150
  173. package/augment-extensions/domain-rules/mcp/examples/compressed-example.md +522 -522
  174. package/augment-extensions/domain-rules/mcp/examples/graph-augmented-example.md +520 -520
  175. package/augment-extensions/domain-rules/mcp/examples/hybrid-example.md +570 -570
  176. package/augment-extensions/domain-rules/mcp/examples/state-based-example.md +427 -427
  177. package/augment-extensions/domain-rules/mcp/examples/token-based-example.md +435 -435
  178. package/augment-extensions/domain-rules/mcp/examples/vector-based-example.md +502 -502
  179. package/augment-extensions/domain-rules/mcp/module.json +49 -49
  180. package/augment-extensions/domain-rules/mcp/rules/compressed-mcp.md +595 -595
  181. package/augment-extensions/domain-rules/mcp/rules/configuration.md +345 -345
  182. package/augment-extensions/domain-rules/mcp/rules/graph-augmented-mcp.md +687 -687
  183. package/augment-extensions/domain-rules/mcp/rules/hybrid-mcp.md +636 -636
  184. package/augment-extensions/domain-rules/mcp/rules/state-based-mcp.md +484 -484
  185. package/augment-extensions/domain-rules/mcp/rules/testing-validation.md +360 -360
  186. package/augment-extensions/domain-rules/mcp/rules/token-based-mcp.md +393 -393
  187. package/augment-extensions/domain-rules/mcp/rules/universal-rules.md +194 -194
  188. package/augment-extensions/domain-rules/mcp/rules/vector-based-mcp.md +625 -625
  189. package/augment-extensions/domain-rules/security/README.md +41 -41
  190. package/augment-extensions/domain-rules/security/module.json +28 -28
  191. package/augment-extensions/domain-rules/security/rules/authentication-security.md +361 -361
  192. package/augment-extensions/domain-rules/security/rules/encryption.md +208 -208
  193. package/augment-extensions/domain-rules/security/rules/input-validation.md +294 -294
  194. package/augment-extensions/domain-rules/security/rules/owasp-top-10.md +339 -339
  195. package/augment-extensions/domain-rules/security/rules/secure-coding.md +293 -293
  196. package/augment-extensions/domain-rules/security/rules/web-security.md +268 -268
  197. package/augment-extensions/domain-rules/seo-sales-marketing/ANNOUNCEMENT.md +143 -0
  198. package/augment-extensions/domain-rules/{marketing-standards/seo-sales-marketing → seo-sales-marketing}/README.md +140 -136
  199. package/augment-extensions/domain-rules/{marketing-standards/seo-sales-marketing → seo-sales-marketing}/SCHEMA-VALIDATION-REPORT.md +216 -216
  200. package/augment-extensions/domain-rules/seo-sales-marketing/TEST-VALIDATION.md +129 -0
  201. package/augment-extensions/domain-rules/seo-sales-marketing/USAGE-GUIDES.md +254 -0
  202. package/augment-extensions/domain-rules/{marketing-standards/seo-sales-marketing → seo-sales-marketing}/examples/brand-kit-example.yaml +292 -292
  203. package/augment-extensions/domain-rules/{marketing-standards/seo-sales-marketing → seo-sales-marketing}/examples/campaign-brief-example.yaml +389 -389
  204. package/augment-extensions/domain-rules/{marketing-standards/seo-sales-marketing → seo-sales-marketing}/examples/content-calendar-example.yaml +643 -643
  205. package/augment-extensions/domain-rules/{marketing-standards/seo-sales-marketing → seo-sales-marketing}/examples/email-newsletter-example.md +376 -376
  206. package/augment-extensions/domain-rules/{marketing-standards/seo-sales-marketing → seo-sales-marketing}/examples/landing-page-example.md +934 -934
  207. package/augment-extensions/domain-rules/{marketing-standards/seo-sales-marketing → seo-sales-marketing}/examples/ppc-ad-copy-example.md +301 -301
  208. package/augment-extensions/domain-rules/{marketing-standards/seo-sales-marketing → seo-sales-marketing}/examples/seo-blog-post-example.md +347 -347
  209. package/augment-extensions/domain-rules/{marketing-standards/seo-sales-marketing → seo-sales-marketing}/examples/social-media-campaign-example.md +606 -606
  210. package/augment-extensions/domain-rules/{marketing-standards/seo-sales-marketing → seo-sales-marketing}/module.json +50 -50
  211. package/augment-extensions/domain-rules/{marketing-standards/seo-sales-marketing → seo-sales-marketing}/rules/affiliate-influencer-marketing.md +593 -593
  212. package/augment-extensions/domain-rules/{marketing-standards/seo-sales-marketing → seo-sales-marketing}/rules/asset-management.md +418 -418
  213. package/augment-extensions/domain-rules/{marketing-standards/seo-sales-marketing → seo-sales-marketing}/rules/brand-consistency.md +210 -210
  214. package/augment-extensions/domain-rules/{marketing-standards/seo-sales-marketing → seo-sales-marketing}/rules/content-marketing.md +337 -337
  215. package/augment-extensions/domain-rules/{marketing-standards/seo-sales-marketing → seo-sales-marketing}/rules/conversion-optimization.md +455 -455
  216. package/augment-extensions/domain-rules/{marketing-standards/seo-sales-marketing → seo-sales-marketing}/rules/direct-sales.md +499 -499
  217. package/augment-extensions/domain-rules/{marketing-standards/seo-sales-marketing → seo-sales-marketing}/rules/email-marketing.md +439 -439
  218. package/augment-extensions/domain-rules/{marketing-standards/seo-sales-marketing → seo-sales-marketing}/rules/legal-compliance.md +227 -227
  219. package/augment-extensions/domain-rules/{marketing-standards/seo-sales-marketing → seo-sales-marketing}/rules/ppc-advertising.md +569 -569
  220. package/augment-extensions/domain-rules/{marketing-standards/seo-sales-marketing → seo-sales-marketing}/rules/seo-optimization.md +470 -470
  221. package/augment-extensions/domain-rules/{marketing-standards/seo-sales-marketing → seo-sales-marketing}/rules/social-media-marketing.md +414 -414
  222. package/augment-extensions/domain-rules/{marketing-standards/seo-sales-marketing → seo-sales-marketing}/rules/universal-marketing.md +177 -177
  223. package/augment-extensions/domain-rules/{marketing-standards/seo-sales-marketing → seo-sales-marketing}/schemas/asset-inventory.schema.json +247 -247
  224. package/augment-extensions/domain-rules/{marketing-standards/seo-sales-marketing → seo-sales-marketing}/schemas/brand-kit.schema.json +326 -326
  225. package/augment-extensions/domain-rules/{marketing-standards/seo-sales-marketing → seo-sales-marketing}/schemas/campaign-brief.schema.json +342 -342
  226. package/augment-extensions/domain-rules/{marketing-standards/seo-sales-marketing → seo-sales-marketing}/schemas/color-palette.schema.json +223 -223
  227. package/augment-extensions/domain-rules/{marketing-standards/seo-sales-marketing → seo-sales-marketing}/schemas/content-template.schema.json +383 -383
  228. package/augment-extensions/domain-rules/wordpress/README.md +163 -163
  229. package/augment-extensions/domain-rules/wordpress/module.json +32 -32
  230. package/augment-extensions/domain-rules/wordpress/rules/coding-standards.md +617 -617
  231. package/augment-extensions/domain-rules/wordpress/rules/directory-structure.md +270 -270
  232. package/augment-extensions/domain-rules/wordpress/rules/file-patterns.md +423 -423
  233. package/augment-extensions/domain-rules/wordpress/rules/gutenberg-blocks.md +493 -493
  234. package/augment-extensions/domain-rules/wordpress/rules/performance.md +568 -568
  235. package/augment-extensions/domain-rules/wordpress/rules/plugin-development.md +510 -510
  236. package/augment-extensions/domain-rules/wordpress/rules/project-detection.md +251 -251
  237. package/augment-extensions/domain-rules/wordpress/rules/rest-api.md +501 -501
  238. package/augment-extensions/domain-rules/wordpress/rules/security.md +564 -564
  239. package/augment-extensions/domain-rules/wordpress/rules/theme-development.md +388 -388
  240. package/augment-extensions/domain-rules/wordpress/rules/woocommerce.md +441 -441
  241. package/augment-extensions/domain-rules/wordpress-plugin/README.md +139 -139
  242. package/augment-extensions/domain-rules/wordpress-plugin/examples/ajax-plugin.md +1599 -1599
  243. package/augment-extensions/domain-rules/wordpress-plugin/examples/custom-post-type-plugin.md +1727 -1727
  244. package/augment-extensions/domain-rules/wordpress-plugin/examples/gutenberg-block-plugin.md +428 -428
  245. package/augment-extensions/domain-rules/wordpress-plugin/examples/gutenberg-block.md +422 -422
  246. package/augment-extensions/domain-rules/wordpress-plugin/examples/mvc-plugin.md +1623 -1623
  247. package/augment-extensions/domain-rules/wordpress-plugin/examples/object-oriented-plugin.md +1343 -1343
  248. package/augment-extensions/domain-rules/wordpress-plugin/examples/rest-endpoint.md +734 -734
  249. package/augment-extensions/domain-rules/wordpress-plugin/examples/settings-page-plugin.md +1350 -1350
  250. package/augment-extensions/domain-rules/wordpress-plugin/examples/simple-procedural-plugin.md +503 -503
  251. package/augment-extensions/domain-rules/wordpress-plugin/examples/singleton-plugin.md +971 -971
  252. package/augment-extensions/domain-rules/wordpress-plugin/module.json +53 -53
  253. package/augment-extensions/domain-rules/wordpress-plugin/rules/activation-hooks.md +770 -770
  254. package/augment-extensions/domain-rules/wordpress-plugin/rules/admin-interface.md +874 -874
  255. package/augment-extensions/domain-rules/wordpress-plugin/rules/ajax-handlers.md +629 -629
  256. package/augment-extensions/domain-rules/wordpress-plugin/rules/asset-management.md +559 -559
  257. package/augment-extensions/domain-rules/wordpress-plugin/rules/context-providers.md +709 -709
  258. package/augment-extensions/domain-rules/wordpress-plugin/rules/cron-jobs.md +736 -736
  259. package/augment-extensions/domain-rules/wordpress-plugin/rules/database-management.md +1057 -1057
  260. package/augment-extensions/domain-rules/wordpress-plugin/rules/documentation-standards.md +463 -463
  261. package/augment-extensions/domain-rules/wordpress-plugin/rules/frontend-functionality.md +478 -478
  262. package/augment-extensions/domain-rules/wordpress-plugin/rules/gutenberg-blocks.md +818 -818
  263. package/augment-extensions/domain-rules/wordpress-plugin/rules/internationalization.md +416 -416
  264. package/augment-extensions/domain-rules/wordpress-plugin/rules/migration.md +667 -667
  265. package/augment-extensions/domain-rules/wordpress-plugin/rules/performance-optimization.md +878 -878
  266. package/augment-extensions/domain-rules/wordpress-plugin/rules/plugin-architecture.md +693 -693
  267. package/augment-extensions/domain-rules/wordpress-plugin/rules/plugin-structure.md +352 -352
  268. package/augment-extensions/domain-rules/wordpress-plugin/rules/rest-api.md +818 -818
  269. package/augment-extensions/domain-rules/wordpress-plugin/rules/scaffolding-workflow.md +624 -624
  270. package/augment-extensions/domain-rules/wordpress-plugin/rules/security-best-practices.md +866 -866
  271. package/augment-extensions/domain-rules/wordpress-plugin/rules/testing-patterns.md +1165 -1165
  272. package/augment-extensions/domain-rules/wordpress-plugin/rules/testing.md +414 -414
  273. package/augment-extensions/domain-rules/wordpress-plugin/rules/vscode-integration.md +751 -751
  274. package/augment-extensions/domain-rules/wordpress-plugin/rules/woocommerce-integration.md +949 -949
  275. package/augment-extensions/domain-rules/wordpress-plugin/rules/wordpress-org-submission.md +458 -458
  276. package/augment-extensions/examples/design-patterns/README.md +37 -37
  277. package/augment-extensions/examples/design-patterns/examples/behavioral-patterns.md +370 -370
  278. package/augment-extensions/examples/design-patterns/examples/creational-patterns.md +250 -250
  279. package/augment-extensions/examples/design-patterns/examples/structural-patterns.md +264 -264
  280. package/augment-extensions/examples/design-patterns/module.json +27 -27
  281. package/augment-extensions/examples/gutenberg-block-plugin/README.md +101 -101
  282. package/augment-extensions/examples/gutenberg-block-plugin/examples/testimonial-block.md +428 -428
  283. package/augment-extensions/examples/gutenberg-block-plugin/module.json +40 -40
  284. package/augment-extensions/examples/rest-api-plugin/README.md +98 -98
  285. package/augment-extensions/examples/rest-api-plugin/examples/task-manager-api.md +1299 -1299
  286. package/augment-extensions/examples/rest-api-plugin/module.json +40 -40
  287. package/augment-extensions/examples/woocommerce-extension/README.md +98 -98
  288. package/augment-extensions/examples/woocommerce-extension/examples/product-customizer.md +763 -763
  289. package/augment-extensions/examples/woocommerce-extension/module.json +40 -40
  290. package/augment-extensions/workflows/beads/README.md +135 -135
  291. package/augment-extensions/workflows/beads/examples/complete-workflow-example.md +278 -278
  292. package/augment-extensions/workflows/beads/module.json +55 -55
  293. package/augment-extensions/workflows/beads/rules/best-practices.md +398 -398
  294. package/augment-extensions/workflows/beads/rules/file-format.md +327 -327
  295. package/augment-extensions/workflows/beads/rules/manual-setup.md +315 -315
  296. package/augment-extensions/workflows/beads/rules/workflow.md +326 -326
  297. package/augment-extensions/workflows/beads-integration/IMPLEMENTATION-STATUS.md +145 -145
  298. package/augment-extensions/workflows/beads-integration/README.md +143 -143
  299. package/augment-extensions/workflows/beads-integration/config/defaults.json +32 -32
  300. package/augment-extensions/workflows/beads-integration/config/schema.json +140 -140
  301. package/augment-extensions/workflows/beads-integration/examples/basic-task-generation.md +293 -293
  302. package/augment-extensions/workflows/beads-integration/module.json +75 -75
  303. package/augment-extensions/workflows/beads-integration/rules/core-rules.md +219 -219
  304. package/augment-extensions/workflows/beads-integration/rules/effectiveness-standards.md +256 -256
  305. package/augment-extensions/workflows/beads-integration/rules/task-generation.md +607 -607
  306. package/augment-extensions/workflows/database/README.md +195 -195
  307. package/augment-extensions/workflows/database/ai-prompt-testing.md +295 -295
  308. package/augment-extensions/workflows/database/examples/migration-example.md +498 -498
  309. package/augment-extensions/workflows/database/examples/optimization-example.md +496 -496
  310. package/augment-extensions/workflows/database/examples/schema-design-example.md +444 -444
  311. package/augment-extensions/workflows/database/module.json +42 -42
  312. package/augment-extensions/workflows/database/rules/data-migration.md +249 -249
  313. package/augment-extensions/workflows/database/rules/documentation-standards.md +339 -339
  314. package/augment-extensions/workflows/database/rules/migration-workflow.md +352 -352
  315. package/augment-extensions/workflows/database/rules/optimization-workflow.md +435 -435
  316. package/augment-extensions/workflows/database/rules/schema-design-workflow.md +535 -535
  317. package/augment-extensions/workflows/database/rules/testing-patterns.md +305 -305
  318. package/augment-extensions/workflows/database/rules/workflow.md +458 -458
  319. package/augment-extensions/workflows/wordpress-plugin/README.md +232 -232
  320. package/augment-extensions/workflows/wordpress-plugin/ai-prompts.md +839 -839
  321. package/augment-extensions/workflows/wordpress-plugin/bead-decomposition-patterns.md +854 -854
  322. package/augment-extensions/workflows/wordpress-plugin/examples/complete-plugin-example.md +540 -540
  323. package/augment-extensions/workflows/wordpress-plugin/examples/custom-post-type-example.md +1083 -1083
  324. package/augment-extensions/workflows/wordpress-plugin/examples/feature-addition-workflow.md +669 -669
  325. package/augment-extensions/workflows/wordpress-plugin/examples/plugin-creation-workflow.md +597 -597
  326. package/augment-extensions/workflows/wordpress-plugin/examples/secure-form-handler-example.md +925 -925
  327. package/augment-extensions/workflows/wordpress-plugin/examples/security-audit-workflow.md +752 -752
  328. package/augment-extensions/workflows/wordpress-plugin/examples/wordpress-org-submission-workflow.md +773 -773
  329. package/augment-extensions/workflows/wordpress-plugin/module.json +49 -49
  330. package/augment-extensions/workflows/wordpress-plugin/rules/best-practices.md +942 -942
  331. package/augment-extensions/workflows/wordpress-plugin/rules/development-workflow.md +702 -702
  332. package/augment-extensions/workflows/wordpress-plugin/rules/submission-workflow.md +728 -728
  333. package/augment-extensions/workflows/wordpress-plugin/rules/testing-workflow.md +775 -775
  334. package/augment-extensions/writing-standards/screenplay/README.md +339 -300
  335. package/augment-extensions/writing-standards/screenplay/_templates/README.md +121 -121
  336. package/augment-extensions/writing-standards/screenplay/_templates/genre-template.md +153 -153
  337. package/augment-extensions/writing-standards/screenplay/_templates/style-template.md +243 -243
  338. package/augment-extensions/writing-standards/screenplay/_templates/theme-template.md +213 -213
  339. package/augment-extensions/writing-standards/screenplay/examples/aaa-hollywood-scene.fountain +164 -164
  340. package/augment-extensions/writing-standards/screenplay/examples/beat-sheet-example.yaml +95 -95
  341. package/augment-extensions/writing-standards/screenplay/examples/character-profile-example.yaml +116 -116
  342. package/augment-extensions/writing-standards/screenplay/examples/commercial-30sec.fountain +151 -151
  343. package/augment-extensions/writing-standards/screenplay/examples/independent-monologue.fountain +67 -67
  344. package/augment-extensions/writing-standards/screenplay/examples/news-segment.fountain +142 -142
  345. package/augment-extensions/writing-standards/screenplay/examples/plot-outline-example.yaml +184 -184
  346. package/augment-extensions/writing-standards/screenplay/examples/tv-episode-teaser.fountain +204 -204
  347. package/augment-extensions/writing-standards/screenplay/genres/README.md +181 -181
  348. package/augment-extensions/writing-standards/screenplay/genres/examples/.gitkeep +2 -2
  349. package/augment-extensions/writing-standards/screenplay/genres/module.json +70 -70
  350. package/augment-extensions/writing-standards/screenplay/genres/rules/.gitkeep +2 -2
  351. package/augment-extensions/writing-standards/screenplay/genres/rules/action.md +399 -399
  352. package/augment-extensions/writing-standards/screenplay/genres/rules/adventure.md +407 -407
  353. package/augment-extensions/writing-standards/screenplay/genres/rules/animation.md +293 -293
  354. package/augment-extensions/writing-standards/screenplay/genres/rules/biographical.md +293 -293
  355. package/augment-extensions/writing-standards/screenplay/genres/rules/comedy.md +401 -401
  356. package/augment-extensions/writing-standards/screenplay/genres/rules/documentary.md +293 -293
  357. package/augment-extensions/writing-standards/screenplay/genres/rules/drama.md +409 -409
  358. package/augment-extensions/writing-standards/screenplay/genres/rules/fantasy.md +293 -293
  359. package/augment-extensions/writing-standards/screenplay/genres/rules/historical.md +293 -293
  360. package/augment-extensions/writing-standards/screenplay/genres/rules/horror.md +268 -268
  361. package/augment-extensions/writing-standards/screenplay/genres/rules/musical.md +294 -294
  362. package/augment-extensions/writing-standards/screenplay/genres/rules/mystery.md +293 -293
  363. package/augment-extensions/writing-standards/screenplay/genres/rules/noir.md +294 -294
  364. package/augment-extensions/writing-standards/screenplay/genres/rules/romance.md +293 -293
  365. package/augment-extensions/writing-standards/screenplay/genres/rules/sci-fi.md +289 -289
  366. package/augment-extensions/writing-standards/screenplay/genres/rules/superhero.md +293 -293
  367. package/augment-extensions/writing-standards/screenplay/genres/rules/thriller.md +294 -294
  368. package/augment-extensions/writing-standards/screenplay/genres/rules/western.md +293 -293
  369. package/augment-extensions/writing-standards/screenplay/module.json +124 -124
  370. package/augment-extensions/writing-standards/screenplay/rules/aaa-hollywood-films.md +339 -339
  371. package/augment-extensions/writing-standards/screenplay/rules/ai-integration-testing.md +329 -329
  372. package/augment-extensions/writing-standards/screenplay/rules/character-development.md +169 -169
  373. package/augment-extensions/writing-standards/screenplay/rules/commercials.md +437 -437
  374. package/augment-extensions/writing-standards/screenplay/rules/dialogue-writing.md +263 -263
  375. package/augment-extensions/writing-standards/screenplay/rules/diversity-inclusion.md +261 -261
  376. package/augment-extensions/writing-standards/screenplay/rules/examples-guide.md +315 -315
  377. package/augment-extensions/writing-standards/screenplay/rules/file-organization.md +213 -0
  378. package/augment-extensions/writing-standards/screenplay/rules/formatting-validation.md +413 -413
  379. package/augment-extensions/writing-standards/screenplay/rules/fountain-format.md +372 -372
  380. package/augment-extensions/writing-standards/screenplay/rules/independent-films.md +374 -374
  381. package/augment-extensions/writing-standards/screenplay/rules/live-tv-productions.md +443 -443
  382. package/augment-extensions/writing-standards/screenplay/rules/narrative-structures.md +207 -207
  383. package/augment-extensions/writing-standards/screenplay/rules/news-broadcasts.md +444 -444
  384. package/augment-extensions/writing-standards/screenplay/rules/pacing-timing.md +331 -331
  385. package/augment-extensions/writing-standards/screenplay/rules/quality-review-checklist.md +334 -334
  386. package/augment-extensions/writing-standards/screenplay/rules/quick-reference.md +299 -299
  387. package/augment-extensions/writing-standards/screenplay/rules/screen-continuity.md +263 -263
  388. package/augment-extensions/writing-standards/screenplay/rules/streaming-content.md +412 -412
  389. package/augment-extensions/writing-standards/screenplay/rules/trope-management.md +370 -370
  390. package/augment-extensions/writing-standards/screenplay/rules/tv-series.md +374 -374
  391. package/augment-extensions/writing-standards/screenplay/rules/universal-formatting.md +339 -339
  392. package/augment-extensions/writing-standards/screenplay/rules/vscode-integration.md +277 -277
  393. package/augment-extensions/writing-standards/screenplay/rules/web-content.md +393 -393
  394. package/augment-extensions/writing-standards/screenplay/schemas/beat-sheet.json +332 -332
  395. package/augment-extensions/writing-standards/screenplay/schemas/character-profile.json +247 -247
  396. package/augment-extensions/writing-standards/screenplay/schemas/feature-selection.json +200 -200
  397. package/augment-extensions/writing-standards/screenplay/schemas/plot-outline.json +233 -233
  398. package/augment-extensions/writing-standards/screenplay/schemas/screenplay-config.json +245 -245
  399. package/augment-extensions/writing-standards/screenplay/schemas/trope-inventory.json +221 -221
  400. package/augment-extensions/writing-standards/screenplay/styles/README.md +159 -159
  401. package/augment-extensions/writing-standards/screenplay/styles/examples/.gitkeep +2 -2
  402. package/augment-extensions/writing-standards/screenplay/styles/examples/style-applications.md +1449 -1449
  403. package/augment-extensions/writing-standards/screenplay/styles/module.json +64 -64
  404. package/augment-extensions/writing-standards/screenplay/styles/rules/.gitkeep +2 -2
  405. package/augment-extensions/writing-standards/screenplay/styles/rules/dialogue-centric.md +520 -520
  406. package/augment-extensions/writing-standards/screenplay/styles/rules/ensemble.md +499 -499
  407. package/augment-extensions/writing-standards/screenplay/styles/rules/epic.md +497 -497
  408. package/augment-extensions/writing-standards/screenplay/styles/rules/experimental.md +492 -492
  409. package/augment-extensions/writing-standards/screenplay/styles/rules/flashback.md +509 -509
  410. package/augment-extensions/writing-standards/screenplay/styles/rules/linear.md +490 -490
  411. package/augment-extensions/writing-standards/screenplay/styles/rules/minimalist.md +499 -499
  412. package/augment-extensions/writing-standards/screenplay/styles/rules/non-linear.md +501 -501
  413. package/augment-extensions/writing-standards/screenplay/styles/rules/poetic.md +499 -499
  414. package/augment-extensions/writing-standards/screenplay/styles/rules/realistic.md +498 -498
  415. package/augment-extensions/writing-standards/screenplay/styles/rules/satirical.md +499 -499
  416. package/augment-extensions/writing-standards/screenplay/styles/rules/surreal.md +508 -508
  417. package/augment-extensions/writing-standards/screenplay/styles/rules/voice-over.md +500 -500
  418. package/augment-extensions/writing-standards/screenplay/themes/README.md +158 -158
  419. package/augment-extensions/writing-standards/screenplay/themes/examples/.gitkeep +2 -2
  420. package/augment-extensions/writing-standards/screenplay/themes/examples/common-mistakes-and-fixes.md +643 -643
  421. package/augment-extensions/writing-standards/screenplay/themes/examples/complete-scene-example.md +311 -311
  422. package/augment-extensions/writing-standards/screenplay/themes/examples/individual-theme-examples.md +562 -562
  423. package/augment-extensions/writing-standards/screenplay/themes/examples/multi-theme-weaving.md +538 -538
  424. package/augment-extensions/writing-standards/screenplay/themes/examples/theme-application-guide.md +432 -432
  425. package/augment-extensions/writing-standards/screenplay/themes/examples/theme-integration-across-acts.md +637 -637
  426. package/augment-extensions/writing-standards/screenplay/themes/module.json +66 -66
  427. package/augment-extensions/writing-standards/screenplay/themes/rules/.gitkeep +2 -2
  428. package/augment-extensions/writing-standards/screenplay/themes/rules/ambition.md +458 -458
  429. package/augment-extensions/writing-standards/screenplay/themes/rules/betrayal.md +490 -490
  430. package/augment-extensions/writing-standards/screenplay/themes/rules/environment.md +458 -458
  431. package/augment-extensions/writing-standards/screenplay/themes/rules/fate.md +459 -459
  432. package/augment-extensions/writing-standards/screenplay/themes/rules/friendship.md +491 -491
  433. package/augment-extensions/writing-standards/screenplay/themes/rules/growth.md +491 -491
  434. package/augment-extensions/writing-standards/screenplay/themes/rules/identity.md +490 -490
  435. package/augment-extensions/writing-standards/screenplay/themes/rules/isolation.md +464 -464
  436. package/augment-extensions/writing-standards/screenplay/themes/rules/justice.md +461 -461
  437. package/augment-extensions/writing-standards/screenplay/themes/rules/love.md +489 -489
  438. package/augment-extensions/writing-standards/screenplay/themes/rules/power.md +494 -494
  439. package/augment-extensions/writing-standards/screenplay/themes/rules/redemption.md +483 -483
  440. package/augment-extensions/writing-standards/screenplay/themes/rules/revenge.md +489 -489
  441. package/augment-extensions/writing-standards/screenplay/themes/rules/survival.md +496 -496
  442. package/augment-extensions/writing-standards/screenplay/themes/rules/technology.md +463 -463
  443. package/augment-extensions/writing-standards/screenplay/utils/__tests__/file-organization.test.ts +169 -0
  444. package/augment-extensions/writing-standards/screenplay/utils/file-organization.ts +165 -0
  445. package/cli/MODULES.md +302 -302
  446. package/cli/dist/cli.js +113 -22
  447. package/cli/dist/cli.js.map +1 -1
  448. package/cli/dist/commands/gui.d.ts.map +1 -1
  449. package/cli/dist/commands/gui.js +54 -6
  450. package/cli/dist/commands/gui.js.map +1 -1
  451. package/cli/dist/commands/init.d.ts.map +1 -1
  452. package/cli/dist/commands/init.js +76 -23
  453. package/cli/dist/commands/init.js.map +1 -1
  454. package/cli/dist/commands/self-remove.d.ts.map +1 -1
  455. package/cli/dist/commands/self-remove.js +48 -74
  456. package/cli/dist/commands/self-remove.js.map +1 -1
  457. package/cli/dist/commands/show.d.ts +15 -0
  458. package/cli/dist/commands/show.d.ts.map +1 -1
  459. package/cli/dist/commands/show.js +576 -23
  460. package/cli/dist/commands/show.js.map +1 -1
  461. package/cli/dist/commands/showCompleted.d.ts +21 -0
  462. package/cli/dist/commands/showCompleted.d.ts.map +1 -0
  463. package/cli/dist/commands/showCompleted.js +225 -0
  464. package/cli/dist/commands/showCompleted.js.map +1 -0
  465. package/cli/dist/commands/skill.js +88 -88
  466. package/cli/dist/commands/update.d.ts +2 -0
  467. package/cli/dist/commands/update.d.ts.map +1 -1
  468. package/cli/dist/commands/update.js +67 -1
  469. package/cli/dist/commands/update.js.map +1 -1
  470. package/cli/dist/utils/beadsCompletedChecker.d.ts +72 -0
  471. package/cli/dist/utils/beadsCompletedChecker.d.ts.map +1 -0
  472. package/cli/dist/utils/beadsCompletedChecker.js +198 -0
  473. package/cli/dist/utils/beadsCompletedChecker.js.map +1 -0
  474. package/cli/dist/utils/catalog-sync.js +13 -13
  475. package/cli/dist/utils/config-system.d.ts +111 -0
  476. package/cli/dist/utils/config-system.d.ts.map +1 -0
  477. package/cli/dist/utils/config-system.js +239 -0
  478. package/cli/dist/utils/config-system.js.map +1 -0
  479. package/cli/dist/utils/extractCommandHelp.d.ts +51 -0
  480. package/cli/dist/utils/extractCommandHelp.d.ts.map +1 -0
  481. package/cli/dist/utils/extractCommandHelp.js +250 -0
  482. package/cli/dist/utils/extractCommandHelp.js.map +1 -0
  483. package/cli/dist/utils/hook-system.d.ts +84 -0
  484. package/cli/dist/utils/hook-system.d.ts.map +1 -0
  485. package/cli/dist/utils/hook-system.js +151 -0
  486. package/cli/dist/utils/hook-system.js.map +1 -0
  487. package/cli/dist/utils/inspection-cache.d.ts +56 -0
  488. package/cli/dist/utils/inspection-cache.d.ts.map +1 -0
  489. package/cli/dist/utils/inspection-cache.js +166 -0
  490. package/cli/dist/utils/inspection-cache.js.map +1 -0
  491. package/cli/dist/utils/inspection-handlers.d.ts +75 -0
  492. package/cli/dist/utils/inspection-handlers.d.ts.map +1 -0
  493. package/cli/dist/utils/inspection-handlers.js +171 -0
  494. package/cli/dist/utils/inspection-handlers.js.map +1 -0
  495. package/cli/dist/utils/install-rules.js +55 -55
  496. package/cli/dist/utils/mcp-integration.js +44 -44
  497. package/cli/dist/utils/module-system.d.ts +1 -0
  498. package/cli/dist/utils/module-system.d.ts.map +1 -1
  499. package/cli/dist/utils/module-system.js +8 -3
  500. package/cli/dist/utils/module-system.js.map +1 -1
  501. package/cli/dist/utils/plugin-system.d.ts +133 -0
  502. package/cli/dist/utils/plugin-system.d.ts.map +1 -0
  503. package/cli/dist/utils/plugin-system.js +210 -0
  504. package/cli/dist/utils/plugin-system.js.map +1 -0
  505. package/cli/dist/utils/progress.d.ts +67 -0
  506. package/cli/dist/utils/progress.d.ts.map +1 -0
  507. package/cli/dist/utils/progress.js +146 -0
  508. package/cli/dist/utils/progress.js.map +1 -0
  509. package/cli/dist/utils/rule-install-hooks.js +8 -8
  510. package/cli/dist/utils/stream-reader.d.ts +34 -0
  511. package/cli/dist/utils/stream-reader.d.ts.map +1 -0
  512. package/cli/dist/utils/stream-reader.js +147 -0
  513. package/cli/dist/utils/stream-reader.js.map +1 -0
  514. package/cli/dist/utils/vscode-editor.d.ts +45 -0
  515. package/cli/dist/utils/vscode-editor.d.ts.map +1 -0
  516. package/cli/dist/utils/vscode-editor.js +171 -0
  517. package/cli/dist/utils/vscode-editor.js.map +1 -0
  518. package/cli/dist/utils/vscode-links.d.ts +49 -0
  519. package/cli/dist/utils/vscode-links.d.ts.map +1 -0
  520. package/cli/dist/utils/vscode-links.js +167 -0
  521. package/cli/dist/utils/vscode-links.js.map +1 -0
  522. package/modules.md +667 -630
  523. package/package.json +85 -85
@@ -1,965 +1,965 @@
1
- # Vector Database Example: Semantic Search Application
2
-
3
- ## Overview
4
-
5
- This example demonstrates a complete semantic search application using vector databases. It covers:
6
- - Document ingestion and preprocessing
7
- - Embedding generation with OpenAI
8
- - Vector storage in Pinecone and Weaviate
9
- - Similarity search
10
- - Hybrid search (vector + keyword)
11
- - Metadata filtering
12
- - Sample queries with explanations
13
-
14
- **Use Case**: Knowledge base search for technical documentation
15
-
16
- **Tech Stack:**
17
- - **Vector Database**: Pinecone (managed) or Weaviate (self-hosted)
18
- - **Embedding Model**: OpenAI `text-embedding-3-small`
19
- - **Language**: Python
20
- - **Framework**: LangChain (optional, for RAG)
21
-
22
- ---
23
-
24
- ## Architecture
25
-
26
- ```
27
- Documents (PDF, Markdown, HTML)
28
-
29
- Document Loader & Chunker
30
-
31
- Embedding Generator (OpenAI)
32
-
33
- Vector Database (Pinecone/Weaviate)
34
-
35
- Search API (Similarity + Hybrid)
36
-
37
- Results (Ranked by relevance)
38
- ```
39
-
40
- ---
41
-
42
- ## Setup
43
-
44
- ### Install Dependencies
45
-
46
- ```bash
47
- pip install openai pinecone-client weaviate-client tiktoken langchain
48
- ```
49
-
50
- ### Environment Variables
51
-
52
- ```bash
53
- # .env file
54
- OPENAI_API_KEY=your-openai-api-key
55
- PINECONE_API_KEY=your-pinecone-api-key
56
- PINECONE_ENVIRONMENT=us-west1-gcp
57
- WEAVIATE_URL=http://localhost:8080
58
- ```
59
-
60
- ---
61
-
62
- ## Step 1: Document Ingestion
63
-
64
- ### Load Documents
65
-
66
- ```python
67
- import os
68
- from pathlib import Path
69
- from typing import List, Dict
70
-
71
- class DocumentLoader:
72
- """Load documents from various sources"""
73
-
74
- def load_markdown_files(self, directory: str) -> List[Dict]:
75
- """Load all markdown files from directory"""
76
- documents = []
77
-
78
- for file_path in Path(directory).rglob("*.md"):
79
- with open(file_path, 'r', encoding='utf-8') as f:
80
- content = f.read()
81
-
82
- documents.append({
83
- "id": str(file_path),
84
- "text": content,
85
- "metadata": {
86
- "source": str(file_path),
87
- "filename": file_path.name,
88
- "type": "markdown"
89
- }
90
- })
91
-
92
- return documents
93
-
94
- def load_pdf_files(self, directory: str) -> List[Dict]:
95
- """Load PDF files (requires PyPDF2)"""
96
- import PyPDF2
97
- documents = []
98
-
99
- for file_path in Path(directory).rglob("*.pdf"):
100
- with open(file_path, 'rb') as f:
101
- pdf_reader = PyPDF2.PdfReader(f)
102
- text = ""
103
-
104
- for page in pdf_reader.pages:
105
- text += page.extract_text()
106
-
107
- documents.append({
108
- "id": str(file_path),
109
- "text": text,
110
- "metadata": {
111
- "source": str(file_path),
112
- "filename": file_path.name,
113
- "type": "pdf",
114
- "pages": len(pdf_reader.pages)
115
- }
116
- })
117
-
118
- return documents
119
-
120
- # Usage
121
- loader = DocumentLoader()
122
- documents = loader.load_markdown_files("./docs")
123
- print(f"Loaded {len(documents)} documents")
124
- ```
125
-
126
- ### Chunk Documents
127
-
128
- ```python
129
- import tiktoken
130
-
131
- class DocumentChunker:
132
- """Chunk documents into smaller pieces"""
133
-
134
- def __init__(self, chunk_size: int = 512, overlap: int = 50):
135
- self.chunk_size = chunk_size
136
- self.overlap = overlap
137
- self.tokenizer = tiktoken.get_encoding("cl100k_base")
138
-
139
- def chunk_by_tokens(self, text: str) -> List[str]:
140
- """Split text into chunks by token count"""
141
- tokens = self.tokenizer.encode(text)
142
- chunks = []
143
-
144
- for i in range(0, len(tokens), self.chunk_size - self.overlap):
145
- chunk_tokens = tokens[i:i + self.chunk_size]
146
- chunk_text = self.tokenizer.decode(chunk_tokens)
147
- chunks.append(chunk_text)
148
-
149
- return chunks
150
-
151
- def chunk_documents(self, documents: List[Dict]) -> List[Dict]:
152
- """Chunk all documents"""
153
- chunked_docs = []
154
-
155
- for doc in documents:
156
- chunks = self.chunk_by_tokens(doc["text"])
157
-
158
- for i, chunk in enumerate(chunks):
159
- chunked_docs.append({
160
- "id": f"{doc['id']}_chunk_{i}",
161
- "text": chunk,
162
- "metadata": {
163
- **doc["metadata"],
164
- "chunk_index": i,
165
- "total_chunks": len(chunks),
166
- "parent_id": doc["id"]
167
- }
168
- })
169
-
170
- return chunked_docs
171
-
172
- # Usage
173
- chunker = DocumentChunker(chunk_size=512, overlap=50)
174
- chunked_documents = chunker.chunk_documents(documents)
175
- print(f"Created {len(chunked_documents)} chunks from {len(documents)} documents")
176
- ```
177
-
178
- ---
179
-
180
- ## Step 2: Embedding Generation
181
-
182
- ### Generate Embeddings with OpenAI
183
-
184
- ```python
185
- from openai import OpenAI
186
- from typing import List
187
- import time
188
-
189
- class EmbeddingGenerator:
190
- """Generate embeddings using OpenAI"""
191
-
192
- def __init__(self, api_key: str, model: str = "text-embedding-3-small"):
193
- self.client = OpenAI(api_key=api_key)
194
- self.model = model
195
-
196
- def generate_embedding(self, text: str) -> List[float]:
197
- """Generate embedding for a single text"""
198
- response = self.client.embeddings.create(
199
- model=self.model,
200
- input=text
201
- )
202
- return response.data[0].embedding
203
-
204
- def generate_embeddings_batch(self, texts: List[str], batch_size: int = 100) -> List[List[float]]:
205
- """Generate embeddings in batches"""
206
- embeddings = []
207
-
208
- for i in range(0, len(texts), batch_size):
209
- batch = texts[i:i + batch_size]
210
-
211
- # Rate limiting: wait if needed
212
- time.sleep(0.1)
213
-
214
- response = self.client.embeddings.create(
215
- model=self.model,
216
- input=batch
217
- )
218
-
219
- batch_embeddings = [item.embedding for item in response.data]
220
- embeddings.extend(batch_embeddings)
221
-
222
- print(f"Generated embeddings for {len(embeddings)}/{len(texts)} texts")
223
-
224
- return embeddings
225
-
226
- def embed_documents(self, documents: List[Dict]) -> List[Dict]:
227
- """Add embeddings to documents"""
228
- texts = [doc["text"] for doc in documents]
229
- embeddings = self.generate_embeddings_batch(texts)
230
-
231
- for doc, embedding in zip(documents, embeddings):
232
- doc["embedding"] = embedding
233
-
234
- return documents
235
-
236
- # Usage
237
- import os
238
- from dotenv import load_dotenv
239
-
240
- load_dotenv()
241
-
242
- generator = EmbeddingGenerator(api_key=os.getenv("OPENAI_API_KEY"))
243
- embedded_documents = generator.embed_documents(chunked_documents)
244
- print(f"Generated embeddings for {len(embedded_documents)} chunks")
245
- ```
246
-
247
- ---
248
-
249
- ## Step 3: Vector Storage
250
-
251
- ### Option 1: Pinecone (Managed)
252
-
253
- ```python
254
- import pinecone
255
- from typing import List, Dict
256
-
257
- class PineconeVectorStore:
258
- """Store and search vectors in Pinecone"""
259
-
260
- def __init__(self, api_key: str, environment: str, index_name: str):
261
- pinecone.init(api_key=api_key, environment=environment)
262
- self.index_name = index_name
263
- self.index = None
264
-
265
- def create_index(self, dimension: int = 1536):
266
- """Create Pinecone index"""
267
- if self.index_name not in pinecone.list_indexes():
268
- pinecone.create_index(
269
- name=self.index_name,
270
- dimension=dimension,
271
- metric="cosine",
272
- metadata_config={"indexed": ["source", "type", "filename"]}
273
- )
274
-
275
- self.index = pinecone.Index(self.index_name)
276
- print(f"Created/connected to index: {self.index_name}")
277
-
278
- def upsert_documents(self, documents: List[Dict], batch_size: int = 100):
279
- """Upsert documents to Pinecone"""
280
- for i in range(0, len(documents), batch_size):
281
- batch = documents[i:i + batch_size]
282
-
283
- # Prepare vectors for upsert
284
- vectors = [
285
- (
286
- doc["id"],
287
- doc["embedding"],
288
- {
289
- "text": doc["text"],
290
- **doc["metadata"]
291
- }
292
- )
293
- for doc in batch
294
- ]
295
-
296
- self.index.upsert(vectors=vectors)
297
- print(f"Upserted {min(i + batch_size, len(documents))}/{len(documents)} documents")
298
-
299
- def search(self, query_embedding: List[float], top_k: int = 10, filter: Dict = None):
300
- """Search for similar vectors"""
301
- results = self.index.query(
302
- vector=query_embedding,
303
- top_k=top_k,
304
- filter=filter,
305
- include_metadata=True
306
- )
307
-
308
- return [
309
- {
310
- "id": match.id,
311
- "score": match.score,
312
- "text": match.metadata.get("text", ""),
313
- "metadata": match.metadata
314
- }
315
- for match in results.matches
316
- ]
317
-
318
- # Usage
319
- pinecone_store = PineconeVectorStore(
320
- api_key=os.getenv("PINECONE_API_KEY"),
321
- environment=os.getenv("PINECONE_ENVIRONMENT"),
322
- index_name="knowledge-base"
323
- )
324
-
325
- pinecone_store.create_index(dimension=1536)
326
- pinecone_store.upsert_documents(embedded_documents)
327
- ```
328
-
329
- ### Option 2: Weaviate (Self-Hosted)
330
-
331
- ```python
332
- import weaviate
333
- from typing import List, Dict
334
-
335
- class WeaviateVectorStore:
336
- """Store and search vectors in Weaviate"""
337
-
338
- def __init__(self, url: str):
339
- self.client = weaviate.Client(url)
340
- self.class_name = "Document"
341
-
342
- def create_schema(self):
343
- """Create Weaviate schema"""
344
- schema = {
345
- "class": self.class_name,
346
- "description": "Technical documentation chunks",
347
- "vectorizer": "none", # We provide our own vectors
348
- "properties": [
349
- {
350
- "name": "text",
351
- "dataType": ["text"],
352
- "description": "Document text content"
353
- },
354
- {
355
- "name": "source",
356
- "dataType": ["string"],
357
- "description": "Source file path"
358
- },
359
- {
360
- "name": "filename",
361
- "dataType": ["string"],
362
- "description": "File name"
363
- },
364
- {
365
- "name": "type",
366
- "dataType": ["string"],
367
- "description": "Document type (markdown, pdf, etc.)"
368
- },
369
- {
370
- "name": "chunk_index",
371
- "dataType": ["int"],
372
- "description": "Chunk index"
373
- },
374
- {
375
- "name": "parent_id",
376
- "dataType": ["string"],
377
- "description": "Parent document ID"
378
- }
379
- ]
380
- }
381
-
382
- # Delete class if exists
383
- if self.client.schema.exists(self.class_name):
384
- self.client.schema.delete_class(self.class_name)
385
-
386
- self.client.schema.create_class(schema)
387
- print(f"Created schema for class: {self.class_name}")
388
-
389
- def upsert_documents(self, documents: List[Dict], batch_size: int = 100):
390
- """Upsert documents to Weaviate"""
391
- with self.client.batch as batch:
392
- batch.batch_size = batch_size
393
-
394
- for i, doc in enumerate(documents):
395
- properties = {
396
- "text": doc["text"],
397
- "source": doc["metadata"].get("source", ""),
398
- "filename": doc["metadata"].get("filename", ""),
399
- "type": doc["metadata"].get("type", ""),
400
- "chunk_index": doc["metadata"].get("chunk_index", 0),
401
- "parent_id": doc["metadata"].get("parent_id", "")
402
- }
403
-
404
- batch.add_data_object(
405
- data_object=properties,
406
- class_name=self.class_name,
407
- vector=doc["embedding"],
408
- uuid=doc["id"]
409
- )
410
-
411
- if (i + 1) % 100 == 0:
412
- print(f"Upserted {i + 1}/{len(documents)} documents")
413
-
414
- def search(self, query_embedding: List[float], top_k: int = 10, where_filter: Dict = None):
415
- """Search for similar vectors"""
416
- query = self.client.query.get(
417
- self.class_name,
418
- ["text", "source", "filename", "type", "chunk_index"]
419
- ).with_near_vector({
420
- "vector": query_embedding
421
- }).with_limit(top_k)
422
-
423
- if where_filter:
424
- query = query.with_where(where_filter)
425
-
426
- results = query.do()
427
-
428
- return [
429
- {
430
- "text": item["text"],
431
- "metadata": {
432
- "source": item.get("source", ""),
433
- "filename": item.get("filename", ""),
434
- "type": item.get("type", ""),
435
- "chunk_index": item.get("chunk_index", 0)
436
- }
437
- }
438
- for item in results["data"]["Get"][self.class_name]
439
- ]
440
-
441
- # Usage
442
- weaviate_store = WeaviateVectorStore(url=os.getenv("WEAVIATE_URL"))
443
- weaviate_store.create_schema()
444
- weaviate_store.upsert_documents(embedded_documents)
445
- ```
446
-
447
- ---
448
-
449
- ## Step 4: Similarity Search
450
-
451
- ### Basic Similarity Search
452
-
453
- ```python
454
- class SemanticSearchEngine:
455
- """Semantic search engine using vector database"""
456
-
457
- def __init__(self, vector_store, embedding_generator):
458
- self.vector_store = vector_store
459
- self.embedding_generator = embedding_generator
460
-
461
- def search(self, query: str, top_k: int = 10) -> List[Dict]:
462
- """Search for documents similar to query"""
463
- # Generate query embedding
464
- query_embedding = self.embedding_generator.generate_embedding(query)
465
-
466
- # Search vector database
467
- results = self.vector_store.search(query_embedding, top_k=top_k)
468
-
469
- return results
470
-
471
- def format_results(self, results: List[Dict]) -> str:
472
- """Format search results for display"""
473
- output = []
474
-
475
- for i, result in enumerate(results, 1):
476
- output.append(f"\n--- Result {i} (Score: {result.get('score', 'N/A'):.4f}) ---")
477
- output.append(f"Source: {result['metadata'].get('source', 'Unknown')}")
478
- output.append(f"Text: {result['text'][:200]}...")
479
-
480
- return "\n".join(output)
481
-
482
- # Usage
483
- search_engine = SemanticSearchEngine(pinecone_store, generator)
484
-
485
- # Example query
486
- query = "How do I configure database indexing?"
487
- results = search_engine.search(query, top_k=5)
488
- print(search_engine.format_results(results))
489
- ```
490
-
491
- **Example Output:**
492
- ```
493
- --- Result 1 (Score: 0.8923) ---
494
- Source: docs/database/indexing.md
495
- Text: Database indexing is crucial for query performance. To configure indexes,
496
- you should first identify frequently queried columns. Create indexes using the
497
- CREATE INDEX statement...
498
-
499
- --- Result 2 (Score: 0.8654) ---
500
- Source: docs/database/performance.md
501
- Text: Index configuration affects query speed significantly. Best practices include
502
- creating composite indexes for multi-column queries and monitoring index usage...
503
- ```
504
-
505
- ---
506
-
507
- ## Step 5: Hybrid Search (Vector + Keyword)
508
-
509
- ### Hybrid Search with Metadata Filtering
510
-
511
- ```python
512
- class HybridSearchEngine:
513
- """Hybrid search combining vector similarity and metadata filtering"""
514
-
515
- def __init__(self, vector_store, embedding_generator):
516
- self.vector_store = vector_store
517
- self.embedding_generator = embedding_generator
518
-
519
- def search_with_filter(
520
- self,
521
- query: str,
522
- top_k: int = 10,
523
- document_type: str = None,
524
- source_pattern: str = None
525
- ) -> List[Dict]:
526
- """Search with metadata filtering"""
527
- # Generate query embedding
528
- query_embedding = self.embedding_generator.generate_embedding(query)
529
-
530
- # Build filter
531
- filter_dict = {}
532
- if document_type:
533
- filter_dict["type"] = {"$eq": document_type}
534
- if source_pattern:
535
- filter_dict["source"] = {"$regex": source_pattern}
536
-
537
- # Search with filter
538
- results = self.vector_store.search(
539
- query_embedding,
540
- top_k=top_k,
541
- filter=filter_dict if filter_dict else None
542
- )
543
-
544
- return results
545
-
546
- def search_by_category(self, query: str, category: str, top_k: int = 10) -> List[Dict]:
547
- """Search within a specific category"""
548
- return self.search_with_filter(
549
- query=query,
550
- top_k=top_k,
551
- source_pattern=f".*{category}.*"
552
- )
553
-
554
- # Usage
555
- hybrid_engine = HybridSearchEngine(pinecone_store, generator)
556
-
557
- # Search only in database documentation
558
- results = hybrid_engine.search_by_category(
559
- query="How to optimize queries?",
560
- category="database",
561
- top_k=5
562
- )
563
-
564
- # Search only markdown files
565
- results = hybrid_engine.search_with_filter(
566
- query="API authentication",
567
- document_type="markdown",
568
- top_k=5
569
- )
570
- ```
571
-
572
- ### Hybrid Search with Keyword Boosting
573
-
574
- ```python
575
- from typing import List, Dict, Set
576
-
577
- class KeywordBoostingSearch:
578
- """Hybrid search with keyword boosting"""
579
-
580
- def __init__(self, vector_store, embedding_generator):
581
- self.vector_store = vector_store
582
- self.embedding_generator = embedding_generator
583
-
584
- def extract_keywords(self, query: str) -> Set[str]:
585
- """Extract important keywords from query"""
586
- # Simple keyword extraction (can use NLP libraries for better results)
587
- stopwords = {"the", "a", "an", "in", "on", "at", "to", "for", "of", "and", "or"}
588
- words = query.lower().split()
589
- keywords = {word for word in words if word not in stopwords and len(word) > 3}
590
- return keywords
591
-
592
- def keyword_match_score(self, text: str, keywords: Set[str]) -> float:
593
- """Calculate keyword match score"""
594
- text_lower = text.lower()
595
- matches = sum(1 for keyword in keywords if keyword in text_lower)
596
- return matches / len(keywords) if keywords else 0
597
-
598
- def hybrid_search(
599
- self,
600
- query: str,
601
- top_k: int = 10,
602
- vector_weight: float = 0.7,
603
- keyword_weight: float = 0.3
604
- ) -> List[Dict]:
605
- """Hybrid search with weighted scoring"""
606
- # Extract keywords
607
- keywords = self.extract_keywords(query)
608
-
609
- # Vector search (get more results for re-ranking)
610
- query_embedding = self.embedding_generator.generate_embedding(query)
611
- vector_results = self.vector_store.search(query_embedding, top_k=top_k * 2)
612
-
613
- # Re-rank with keyword boosting
614
- for result in vector_results:
615
- vector_score = result.get("score", 0)
616
- keyword_score = self.keyword_match_score(result["text"], keywords)
617
-
618
- # Combined score
619
- result["combined_score"] = (
620
- vector_weight * vector_score +
621
- keyword_weight * keyword_score
622
- )
623
-
624
- # Sort by combined score
625
- vector_results.sort(key=lambda x: x["combined_score"], reverse=True)
626
-
627
- return vector_results[:top_k]
628
-
629
- # Usage
630
- keyword_search = KeywordBoostingSearch(pinecone_store, generator)
631
-
632
- results = keyword_search.hybrid_search(
633
- query="database indexing performance optimization",
634
- top_k=5,
635
- vector_weight=0.7,
636
- keyword_weight=0.3
637
- )
638
-
639
- for result in results:
640
- print(f"Combined Score: {result['combined_score']:.4f}")
641
- print(f"Text: {result['text'][:100]}...\n")
642
- ```
643
-
644
- ---
645
-
646
- ## Step 6: Sample Queries with Explanations
647
-
648
- ### Query 1: Basic Semantic Search
649
-
650
- ```python
651
- # Query: "How do I set up authentication?"
652
- query = "How do I set up authentication?"
653
- results = search_engine.search(query, top_k=3)
654
-
655
- # Explanation:
656
- # - Converts query to embedding vector
657
- # - Finds documents with similar embeddings (semantic similarity)
658
- # - Returns top 3 most similar documents
659
- # - Will find documents about "authentication setup", "auth configuration", etc.
660
- # even if they don't use exact words "set up authentication"
661
- ```
662
-
663
- **Why it works:**
664
- - Semantic understanding: "set up" ≈ "configure" ≈ "initialize"
665
- - Finds conceptually similar content, not just keyword matches
666
-
667
- ### Query 2: Filtered Search
668
-
669
- ```python
670
- # Query: "API rate limiting" (only in API documentation)
671
- query = "API rate limiting"
672
- results = hybrid_engine.search_with_filter(
673
- query=query,
674
- source_pattern=".*api.*",
675
- top_k=5
676
- )
677
-
678
- # Explanation:
679
- # - Semantic search for "API rate limiting"
680
- # - Filters results to only include documents from API docs
681
- # - Combines vector similarity with metadata filtering
682
- # - More precise results than pure semantic search
683
- ```
684
-
685
- **Why it works:**
686
- - Narrows search scope to relevant documentation section
687
- - Reduces noise from unrelated documents
688
- - Faster search (fewer vectors to compare)
689
-
690
- ### Query 3: Hybrid Search with Keyword Boosting
691
-
692
- ```python
693
- # Query: "PostgreSQL index optimization"
694
- query = "PostgreSQL index optimization"
695
- results = keyword_search.hybrid_search(
696
- query=query,
697
- top_k=5,
698
- vector_weight=0.6,
699
- keyword_weight=0.4
700
- )
701
-
702
- # Explanation:
703
- # - Semantic search finds conceptually similar documents
704
- # - Keyword matching boosts documents containing "PostgreSQL", "index", "optimization"
705
- # - Weighted combination: 60% semantic similarity, 40% keyword match
706
- # - Balances semantic understanding with exact term matching
707
- ```
708
-
709
- **Why it works:**
710
- - Semantic search finds related concepts (e.g., "database tuning")
711
- - Keyword boosting prioritizes documents with specific terms (e.g., "PostgreSQL")
712
- - Best of both worlds: semantic understanding + term precision
713
-
714
- ### Query 4: Multi-Filter Search
715
-
716
- ```python
717
- # Query: "error handling" (only in Python docs, markdown files)
718
- query = "error handling"
719
- results = hybrid_engine.search_with_filter(
720
- query=query,
721
- document_type="markdown",
722
- source_pattern=".*python.*",
723
- top_k=5
724
- )
725
-
726
- # Explanation:
727
- # - Semantic search for "error handling"
728
- # - Filter 1: Only markdown files
729
- # - Filter 2: Only Python documentation
730
- # - Highly targeted results
731
- ```
732
-
733
- **Why it works:**
734
- - Multiple filters narrow search scope significantly
735
- - Reduces false positives from other languages/formats
736
- - Faster and more accurate results
737
-
738
- ### Query 5: RAG (Retrieval-Augmented Generation)
739
-
740
- ```python
741
- def rag_query(question: str, search_engine, llm_client):
742
- """Answer question using RAG"""
743
- # 1. Retrieve relevant context
744
- results = search_engine.search(question, top_k=5)
745
- context = "\n\n".join([r["text"] for r in results])
746
-
747
- # 2. Augment prompt with context
748
- prompt = f"""
749
- Context from documentation:
750
- {context}
751
-
752
- Question: {question}
753
-
754
- Answer the question based on the context above. If the context doesn't
755
- contain enough information, say so.
756
- """
757
-
758
- # 3. Generate answer with LLM
759
- response = llm_client.chat.completions.create(
760
- model="gpt-4",
761
- messages=[{"role": "user", "content": prompt}]
762
- )
763
-
764
- answer = response.choices[0].message.content
765
-
766
- return {
767
- "answer": answer,
768
- "sources": [r["metadata"]["source"] for r in results]
769
- }
770
-
771
- # Usage
772
- question = "What are the best practices for database indexing?"
773
- result = rag_query(question, search_engine, OpenAI(api_key=os.getenv("OPENAI_API_KEY")))
774
-
775
- print(f"Answer: {result['answer']}")
776
- print(f"\nSources: {', '.join(result['sources'])}")
777
-
778
- # Explanation:
779
- # - Retrieves top 5 most relevant documents
780
- # - Provides context to LLM
781
- # - LLM generates answer based on retrieved context
782
- # - Returns answer with source citations
783
- ```
784
-
785
- **Why it works:**
786
- - Combines retrieval (vector search) with generation (LLM)
787
- - Grounds LLM responses in actual documentation
788
- - Provides source attribution for verification
789
- - Reduces hallucinations (LLM making up information)
790
-
791
- ---
792
-
793
- ## Complete Example: End-to-End Workflow
794
-
795
- ```python
796
- import os
797
- from dotenv import load_dotenv
798
-
799
- # Load environment variables
800
- load_dotenv()
801
-
802
- # 1. Load documents
803
- print("Step 1: Loading documents...")
804
- loader = DocumentLoader()
805
- documents = loader.load_markdown_files("./docs")
806
- print(f"Loaded {len(documents)} documents")
807
-
808
- # 2. Chunk documents
809
- print("\nStep 2: Chunking documents...")
810
- chunker = DocumentChunker(chunk_size=512, overlap=50)
811
- chunked_documents = chunker.chunk_documents(documents)
812
- print(f"Created {len(chunked_documents)} chunks")
813
-
814
- # 3. Generate embeddings
815
- print("\nStep 3: Generating embeddings...")
816
- generator = EmbeddingGenerator(api_key=os.getenv("OPENAI_API_KEY"))
817
- embedded_documents = generator.embed_documents(chunked_documents)
818
- print(f"Generated embeddings for {len(embedded_documents)} chunks")
819
-
820
- # 4. Store in vector database
821
- print("\nStep 4: Storing in Pinecone...")
822
- pinecone_store = PineconeVectorStore(
823
- api_key=os.getenv("PINECONE_API_KEY"),
824
- environment=os.getenv("PINECONE_ENVIRONMENT"),
825
- index_name="knowledge-base"
826
- )
827
- pinecone_store.create_index(dimension=1536)
828
- pinecone_store.upsert_documents(embedded_documents)
829
- print("Documents stored successfully")
830
-
831
- # 5. Create search engines
832
- print("\nStep 5: Creating search engines...")
833
- search_engine = SemanticSearchEngine(pinecone_store, generator)
834
- hybrid_engine = HybridSearchEngine(pinecone_store, generator)
835
- keyword_search = KeywordBoostingSearch(pinecone_store, generator)
836
-
837
- # 6. Run sample queries
838
- print("\n" + "="*80)
839
- print("SAMPLE QUERIES")
840
- print("="*80)
841
-
842
- # Query 1: Basic semantic search
843
- print("\n[Query 1] Basic Semantic Search")
844
- print("Query: 'How to configure database indexes?'")
845
- results = search_engine.search("How to configure database indexes?", top_k=3)
846
- print(search_engine.format_results(results))
847
-
848
- # Query 2: Filtered search
849
- print("\n[Query 2] Filtered Search")
850
- print("Query: 'API authentication' (only markdown files)")
851
- results = hybrid_engine.search_with_filter(
852
- query="API authentication",
853
- document_type="markdown",
854
- top_k=3
855
- )
856
- print(search_engine.format_results(results))
857
-
858
- # Query 3: Hybrid search with keyword boosting
859
- print("\n[Query 3] Hybrid Search with Keyword Boosting")
860
- print("Query: 'PostgreSQL performance optimization'")
861
- results = keyword_search.hybrid_search(
862
- query="PostgreSQL performance optimization",
863
- top_k=3,
864
- vector_weight=0.7,
865
- keyword_weight=0.3
866
- )
867
- for i, result in enumerate(results, 1):
868
- print(f"\n--- Result {i} (Combined Score: {result['combined_score']:.4f}) ---")
869
- print(f"Source: {result['metadata'].get('source', 'Unknown')}")
870
- print(f"Text: {result['text'][:200]}...")
871
-
872
- print("\n" + "="*80)
873
- print("Search system ready!")
874
- print("="*80)
875
- ```
876
-
877
- ---
878
-
879
- ## Performance Metrics
880
-
881
- ### Measuring Search Quality
882
-
883
- ```python
884
- def evaluate_search_quality(search_engine, test_queries, ground_truth):
885
- """Evaluate search quality using test queries"""
886
- metrics = {
887
- "precision_at_5": [],
888
- "recall_at_5": [],
889
- "mrr": [] # Mean Reciprocal Rank
890
- }
891
-
892
- for query, relevant_docs in zip(test_queries, ground_truth):
893
- results = search_engine.search(query, top_k=5)
894
- result_ids = [r["id"] for r in results]
895
-
896
- # Precision@5
897
- relevant_found = len(set(result_ids) & set(relevant_docs))
898
- precision = relevant_found / 5
899
- metrics["precision_at_5"].append(precision)
900
-
901
- # Recall@5
902
- recall = relevant_found / len(relevant_docs) if relevant_docs else 0
903
- metrics["recall_at_5"].append(recall)
904
-
905
- # MRR
906
- for i, result_id in enumerate(result_ids, 1):
907
- if result_id in relevant_docs:
908
- metrics["mrr"].append(1 / i)
909
- break
910
- else:
911
- metrics["mrr"].append(0)
912
-
913
- return {
914
- "precision_at_5": sum(metrics["precision_at_5"]) / len(metrics["precision_at_5"]),
915
- "recall_at_5": sum(metrics["recall_at_5"]) / len(metrics["recall_at_5"]),
916
- "mrr": sum(metrics["mrr"]) / len(metrics["mrr"])
917
- }
918
-
919
- # Example
920
- test_queries = [
921
- "How to configure database indexes?",
922
- "API authentication best practices",
923
- "PostgreSQL performance tuning"
924
- ]
925
-
926
- ground_truth = [
927
- ["docs/database/indexing.md_chunk_0", "docs/database/indexing.md_chunk_1"],
928
- ["docs/api/auth.md_chunk_0", "docs/api/security.md_chunk_2"],
929
- ["docs/database/postgres.md_chunk_5", "docs/database/performance.md_chunk_3"]
930
- ]
931
-
932
- metrics = evaluate_search_quality(search_engine, test_queries, ground_truth)
933
- print(f"Precision@5: {metrics['precision_at_5']:.2%}")
934
- print(f"Recall@5: {metrics['recall_at_5']:.2%}")
935
- print(f"MRR: {metrics['mrr']:.4f}")
936
- ```
937
-
938
- ---
939
-
940
- ## Summary
941
-
942
- **What We Built:**
943
- 1. ✅ Document ingestion pipeline (markdown, PDF)
944
- 2. ✅ Intelligent chunking with overlap
945
- 3. ✅ Embedding generation with OpenAI
946
- 4. ✅ Vector storage in Pinecone/Weaviate
947
- 5. ✅ Semantic similarity search
948
- 6. ✅ Hybrid search (vector + metadata filtering)
949
- 7. ✅ Keyword boosting for precision
950
- 8. ✅ RAG implementation for Q&A
951
- 9. ✅ Performance evaluation metrics
952
-
953
- **Key Takeaways:**
954
- - Vector databases enable semantic search (meaning-based, not keyword-based)
955
- - Chunking strategy affects search quality (512 tokens with 50 token overlap works well)
956
- - Hybrid search (vector + metadata) provides best results
957
- - RAG combines retrieval with LLM generation for accurate answers
958
- - Always measure search quality with precision, recall, and MRR metrics
959
-
960
- **Next Steps:**
961
- - See `../rules/vector-databases.md` for vector database fundamentals
962
- - See `../rules/vector-embeddings.md` for embedding strategies
963
- - See `../rules/vector-indexing.md` for index optimization
964
- - Experiment with different chunking strategies and embedding models
965
- - Tune hybrid search weights for your use case
1
+ # Vector Database Example: Semantic Search Application
2
+
3
+ ## Overview
4
+
5
+ This example demonstrates a complete semantic search application using vector databases. It covers:
6
+ - Document ingestion and preprocessing
7
+ - Embedding generation with OpenAI
8
+ - Vector storage in Pinecone and Weaviate
9
+ - Similarity search
10
+ - Hybrid search (vector + keyword)
11
+ - Metadata filtering
12
+ - Sample queries with explanations
13
+
14
+ **Use Case**: Knowledge base search for technical documentation
15
+
16
+ **Tech Stack:**
17
+ - **Vector Database**: Pinecone (managed) or Weaviate (self-hosted)
18
+ - **Embedding Model**: OpenAI `text-embedding-3-small`
19
+ - **Language**: Python
20
+ - **Framework**: LangChain (optional, for RAG)
21
+
22
+ ---
23
+
24
+ ## Architecture
25
+
26
+ ```
27
+ Documents (PDF, Markdown, HTML)
28
+
29
+ Document Loader & Chunker
30
+
31
+ Embedding Generator (OpenAI)
32
+
33
+ Vector Database (Pinecone/Weaviate)
34
+
35
+ Search API (Similarity + Hybrid)
36
+
37
+ Results (Ranked by relevance)
38
+ ```
39
+
40
+ ---
41
+
42
+ ## Setup
43
+
44
+ ### Install Dependencies
45
+
46
+ ```bash
47
+ pip install openai pinecone-client weaviate-client tiktoken langchain
48
+ ```
49
+
50
+ ### Environment Variables
51
+
52
+ ```bash
53
+ # .env file
54
+ OPENAI_API_KEY=your-openai-api-key
55
+ PINECONE_API_KEY=your-pinecone-api-key
56
+ PINECONE_ENVIRONMENT=us-west1-gcp
57
+ WEAVIATE_URL=http://localhost:8080
58
+ ```
59
+
60
+ ---
61
+
62
+ ## Step 1: Document Ingestion
63
+
64
+ ### Load Documents
65
+
66
+ ```python
67
+ import os
68
+ from pathlib import Path
69
+ from typing import List, Dict
70
+
71
+ class DocumentLoader:
72
+ """Load documents from various sources"""
73
+
74
+ def load_markdown_files(self, directory: str) -> List[Dict]:
75
+ """Load all markdown files from directory"""
76
+ documents = []
77
+
78
+ for file_path in Path(directory).rglob("*.md"):
79
+ with open(file_path, 'r', encoding='utf-8') as f:
80
+ content = f.read()
81
+
82
+ documents.append({
83
+ "id": str(file_path),
84
+ "text": content,
85
+ "metadata": {
86
+ "source": str(file_path),
87
+ "filename": file_path.name,
88
+ "type": "markdown"
89
+ }
90
+ })
91
+
92
+ return documents
93
+
94
+ def load_pdf_files(self, directory: str) -> List[Dict]:
95
+ """Load PDF files (requires PyPDF2)"""
96
+ import PyPDF2
97
+ documents = []
98
+
99
+ for file_path in Path(directory).rglob("*.pdf"):
100
+ with open(file_path, 'rb') as f:
101
+ pdf_reader = PyPDF2.PdfReader(f)
102
+ text = ""
103
+
104
+ for page in pdf_reader.pages:
105
+ text += page.extract_text()
106
+
107
+ documents.append({
108
+ "id": str(file_path),
109
+ "text": text,
110
+ "metadata": {
111
+ "source": str(file_path),
112
+ "filename": file_path.name,
113
+ "type": "pdf",
114
+ "pages": len(pdf_reader.pages)
115
+ }
116
+ })
117
+
118
+ return documents
119
+
120
+ # Usage
121
+ loader = DocumentLoader()
122
+ documents = loader.load_markdown_files("./docs")
123
+ print(f"Loaded {len(documents)} documents")
124
+ ```
125
+
126
+ ### Chunk Documents
127
+
128
+ ```python
129
+ import tiktoken
130
+
131
+ class DocumentChunker:
132
+ """Chunk documents into smaller pieces"""
133
+
134
+ def __init__(self, chunk_size: int = 512, overlap: int = 50):
135
+ self.chunk_size = chunk_size
136
+ self.overlap = overlap
137
+ self.tokenizer = tiktoken.get_encoding("cl100k_base")
138
+
139
+ def chunk_by_tokens(self, text: str) -> List[str]:
140
+ """Split text into chunks by token count"""
141
+ tokens = self.tokenizer.encode(text)
142
+ chunks = []
143
+
144
+ for i in range(0, len(tokens), self.chunk_size - self.overlap):
145
+ chunk_tokens = tokens[i:i + self.chunk_size]
146
+ chunk_text = self.tokenizer.decode(chunk_tokens)
147
+ chunks.append(chunk_text)
148
+
149
+ return chunks
150
+
151
+ def chunk_documents(self, documents: List[Dict]) -> List[Dict]:
152
+ """Chunk all documents"""
153
+ chunked_docs = []
154
+
155
+ for doc in documents:
156
+ chunks = self.chunk_by_tokens(doc["text"])
157
+
158
+ for i, chunk in enumerate(chunks):
159
+ chunked_docs.append({
160
+ "id": f"{doc['id']}_chunk_{i}",
161
+ "text": chunk,
162
+ "metadata": {
163
+ **doc["metadata"],
164
+ "chunk_index": i,
165
+ "total_chunks": len(chunks),
166
+ "parent_id": doc["id"]
167
+ }
168
+ })
169
+
170
+ return chunked_docs
171
+
172
+ # Usage
173
+ chunker = DocumentChunker(chunk_size=512, overlap=50)
174
+ chunked_documents = chunker.chunk_documents(documents)
175
+ print(f"Created {len(chunked_documents)} chunks from {len(documents)} documents")
176
+ ```
177
+
178
+ ---
179
+
180
+ ## Step 2: Embedding Generation
181
+
182
+ ### Generate Embeddings with OpenAI
183
+
184
+ ```python
185
+ from openai import OpenAI
186
+ from typing import List
187
+ import time
188
+
189
+ class EmbeddingGenerator:
190
+ """Generate embeddings using OpenAI"""
191
+
192
+ def __init__(self, api_key: str, model: str = "text-embedding-3-small"):
193
+ self.client = OpenAI(api_key=api_key)
194
+ self.model = model
195
+
196
+ def generate_embedding(self, text: str) -> List[float]:
197
+ """Generate embedding for a single text"""
198
+ response = self.client.embeddings.create(
199
+ model=self.model,
200
+ input=text
201
+ )
202
+ return response.data[0].embedding
203
+
204
+ def generate_embeddings_batch(self, texts: List[str], batch_size: int = 100) -> List[List[float]]:
205
+ """Generate embeddings in batches"""
206
+ embeddings = []
207
+
208
+ for i in range(0, len(texts), batch_size):
209
+ batch = texts[i:i + batch_size]
210
+
211
+ # Rate limiting: wait if needed
212
+ time.sleep(0.1)
213
+
214
+ response = self.client.embeddings.create(
215
+ model=self.model,
216
+ input=batch
217
+ )
218
+
219
+ batch_embeddings = [item.embedding for item in response.data]
220
+ embeddings.extend(batch_embeddings)
221
+
222
+ print(f"Generated embeddings for {len(embeddings)}/{len(texts)} texts")
223
+
224
+ return embeddings
225
+
226
+ def embed_documents(self, documents: List[Dict]) -> List[Dict]:
227
+ """Add embeddings to documents"""
228
+ texts = [doc["text"] for doc in documents]
229
+ embeddings = self.generate_embeddings_batch(texts)
230
+
231
+ for doc, embedding in zip(documents, embeddings):
232
+ doc["embedding"] = embedding
233
+
234
+ return documents
235
+
236
+ # Usage
237
+ import os
238
+ from dotenv import load_dotenv
239
+
240
+ load_dotenv()
241
+
242
+ generator = EmbeddingGenerator(api_key=os.getenv("OPENAI_API_KEY"))
243
+ embedded_documents = generator.embed_documents(chunked_documents)
244
+ print(f"Generated embeddings for {len(embedded_documents)} chunks")
245
+ ```
246
+
247
+ ---
248
+
249
+ ## Step 3: Vector Storage
250
+
251
+ ### Option 1: Pinecone (Managed)
252
+
253
+ ```python
254
+ import pinecone
255
+ from typing import List, Dict
256
+
257
+ class PineconeVectorStore:
258
+ """Store and search vectors in Pinecone"""
259
+
260
+ def __init__(self, api_key: str, environment: str, index_name: str):
261
+ pinecone.init(api_key=api_key, environment=environment)
262
+ self.index_name = index_name
263
+ self.index = None
264
+
265
+ def create_index(self, dimension: int = 1536):
266
+ """Create Pinecone index"""
267
+ if self.index_name not in pinecone.list_indexes():
268
+ pinecone.create_index(
269
+ name=self.index_name,
270
+ dimension=dimension,
271
+ metric="cosine",
272
+ metadata_config={"indexed": ["source", "type", "filename"]}
273
+ )
274
+
275
+ self.index = pinecone.Index(self.index_name)
276
+ print(f"Created/connected to index: {self.index_name}")
277
+
278
+ def upsert_documents(self, documents: List[Dict], batch_size: int = 100):
279
+ """Upsert documents to Pinecone"""
280
+ for i in range(0, len(documents), batch_size):
281
+ batch = documents[i:i + batch_size]
282
+
283
+ # Prepare vectors for upsert
284
+ vectors = [
285
+ (
286
+ doc["id"],
287
+ doc["embedding"],
288
+ {
289
+ "text": doc["text"],
290
+ **doc["metadata"]
291
+ }
292
+ )
293
+ for doc in batch
294
+ ]
295
+
296
+ self.index.upsert(vectors=vectors)
297
+ print(f"Upserted {min(i + batch_size, len(documents))}/{len(documents)} documents")
298
+
299
+ def search(self, query_embedding: List[float], top_k: int = 10, filter: Dict = None):
300
+ """Search for similar vectors"""
301
+ results = self.index.query(
302
+ vector=query_embedding,
303
+ top_k=top_k,
304
+ filter=filter,
305
+ include_metadata=True
306
+ )
307
+
308
+ return [
309
+ {
310
+ "id": match.id,
311
+ "score": match.score,
312
+ "text": match.metadata.get("text", ""),
313
+ "metadata": match.metadata
314
+ }
315
+ for match in results.matches
316
+ ]
317
+
318
+ # Usage
319
+ pinecone_store = PineconeVectorStore(
320
+ api_key=os.getenv("PINECONE_API_KEY"),
321
+ environment=os.getenv("PINECONE_ENVIRONMENT"),
322
+ index_name="knowledge-base"
323
+ )
324
+
325
+ pinecone_store.create_index(dimension=1536)
326
+ pinecone_store.upsert_documents(embedded_documents)
327
+ ```
328
+
329
+ ### Option 2: Weaviate (Self-Hosted)
330
+
331
+ ```python
332
+ import weaviate
333
+ from typing import List, Dict
334
+
335
+ class WeaviateVectorStore:
336
+ """Store and search vectors in Weaviate"""
337
+
338
+ def __init__(self, url: str):
339
+ self.client = weaviate.Client(url)
340
+ self.class_name = "Document"
341
+
342
+ def create_schema(self):
343
+ """Create Weaviate schema"""
344
+ schema = {
345
+ "class": self.class_name,
346
+ "description": "Technical documentation chunks",
347
+ "vectorizer": "none", # We provide our own vectors
348
+ "properties": [
349
+ {
350
+ "name": "text",
351
+ "dataType": ["text"],
352
+ "description": "Document text content"
353
+ },
354
+ {
355
+ "name": "source",
356
+ "dataType": ["string"],
357
+ "description": "Source file path"
358
+ },
359
+ {
360
+ "name": "filename",
361
+ "dataType": ["string"],
362
+ "description": "File name"
363
+ },
364
+ {
365
+ "name": "type",
366
+ "dataType": ["string"],
367
+ "description": "Document type (markdown, pdf, etc.)"
368
+ },
369
+ {
370
+ "name": "chunk_index",
371
+ "dataType": ["int"],
372
+ "description": "Chunk index"
373
+ },
374
+ {
375
+ "name": "parent_id",
376
+ "dataType": ["string"],
377
+ "description": "Parent document ID"
378
+ }
379
+ ]
380
+ }
381
+
382
+ # Delete class if exists
383
+ if self.client.schema.exists(self.class_name):
384
+ self.client.schema.delete_class(self.class_name)
385
+
386
+ self.client.schema.create_class(schema)
387
+ print(f"Created schema for class: {self.class_name}")
388
+
389
+ def upsert_documents(self, documents: List[Dict], batch_size: int = 100):
390
+ """Upsert documents to Weaviate"""
391
+ with self.client.batch as batch:
392
+ batch.batch_size = batch_size
393
+
394
+ for i, doc in enumerate(documents):
395
+ properties = {
396
+ "text": doc["text"],
397
+ "source": doc["metadata"].get("source", ""),
398
+ "filename": doc["metadata"].get("filename", ""),
399
+ "type": doc["metadata"].get("type", ""),
400
+ "chunk_index": doc["metadata"].get("chunk_index", 0),
401
+ "parent_id": doc["metadata"].get("parent_id", "")
402
+ }
403
+
404
+ batch.add_data_object(
405
+ data_object=properties,
406
+ class_name=self.class_name,
407
+ vector=doc["embedding"],
408
+ uuid=doc["id"]
409
+ )
410
+
411
+ if (i + 1) % 100 == 0:
412
+ print(f"Upserted {i + 1}/{len(documents)} documents")
413
+
414
+ def search(self, query_embedding: List[float], top_k: int = 10, where_filter: Dict = None):
415
+ """Search for similar vectors"""
416
+ query = self.client.query.get(
417
+ self.class_name,
418
+ ["text", "source", "filename", "type", "chunk_index"]
419
+ ).with_near_vector({
420
+ "vector": query_embedding
421
+ }).with_limit(top_k)
422
+
423
+ if where_filter:
424
+ query = query.with_where(where_filter)
425
+
426
+ results = query.do()
427
+
428
+ return [
429
+ {
430
+ "text": item["text"],
431
+ "metadata": {
432
+ "source": item.get("source", ""),
433
+ "filename": item.get("filename", ""),
434
+ "type": item.get("type", ""),
435
+ "chunk_index": item.get("chunk_index", 0)
436
+ }
437
+ }
438
+ for item in results["data"]["Get"][self.class_name]
439
+ ]
440
+
441
+ # Usage
442
+ weaviate_store = WeaviateVectorStore(url=os.getenv("WEAVIATE_URL"))
443
+ weaviate_store.create_schema()
444
+ weaviate_store.upsert_documents(embedded_documents)
445
+ ```
446
+
447
+ ---
448
+
449
+ ## Step 4: Similarity Search
450
+
451
+ ### Basic Similarity Search
452
+
453
+ ```python
454
+ class SemanticSearchEngine:
455
+ """Semantic search engine using vector database"""
456
+
457
+ def __init__(self, vector_store, embedding_generator):
458
+ self.vector_store = vector_store
459
+ self.embedding_generator = embedding_generator
460
+
461
+ def search(self, query: str, top_k: int = 10) -> List[Dict]:
462
+ """Search for documents similar to query"""
463
+ # Generate query embedding
464
+ query_embedding = self.embedding_generator.generate_embedding(query)
465
+
466
+ # Search vector database
467
+ results = self.vector_store.search(query_embedding, top_k=top_k)
468
+
469
+ return results
470
+
471
+ def format_results(self, results: List[Dict]) -> str:
472
+ """Format search results for display"""
473
+ output = []
474
+
475
+ for i, result in enumerate(results, 1):
476
+ output.append(f"\n--- Result {i} (Score: {result.get('score', 'N/A'):.4f}) ---")
477
+ output.append(f"Source: {result['metadata'].get('source', 'Unknown')}")
478
+ output.append(f"Text: {result['text'][:200]}...")
479
+
480
+ return "\n".join(output)
481
+
482
+ # Usage
483
+ search_engine = SemanticSearchEngine(pinecone_store, generator)
484
+
485
+ # Example query
486
+ query = "How do I configure database indexing?"
487
+ results = search_engine.search(query, top_k=5)
488
+ print(search_engine.format_results(results))
489
+ ```
490
+
491
+ **Example Output:**
492
+ ```
493
+ --- Result 1 (Score: 0.8923) ---
494
+ Source: docs/database/indexing.md
495
+ Text: Database indexing is crucial for query performance. To configure indexes,
496
+ you should first identify frequently queried columns. Create indexes using the
497
+ CREATE INDEX statement...
498
+
499
+ --- Result 2 (Score: 0.8654) ---
500
+ Source: docs/database/performance.md
501
+ Text: Index configuration affects query speed significantly. Best practices include
502
+ creating composite indexes for multi-column queries and monitoring index usage...
503
+ ```
504
+
505
+ ---
506
+
507
+ ## Step 5: Hybrid Search (Vector + Keyword)
508
+
509
+ ### Hybrid Search with Metadata Filtering
510
+
511
+ ```python
512
+ class HybridSearchEngine:
513
+ """Hybrid search combining vector similarity and metadata filtering"""
514
+
515
+ def __init__(self, vector_store, embedding_generator):
516
+ self.vector_store = vector_store
517
+ self.embedding_generator = embedding_generator
518
+
519
+ def search_with_filter(
520
+ self,
521
+ query: str,
522
+ top_k: int = 10,
523
+ document_type: str = None,
524
+ source_pattern: str = None
525
+ ) -> List[Dict]:
526
+ """Search with metadata filtering"""
527
+ # Generate query embedding
528
+ query_embedding = self.embedding_generator.generate_embedding(query)
529
+
530
+ # Build filter
531
+ filter_dict = {}
532
+ if document_type:
533
+ filter_dict["type"] = {"$eq": document_type}
534
+ if source_pattern:
535
+ filter_dict["source"] = {"$regex": source_pattern}
536
+
537
+ # Search with filter
538
+ results = self.vector_store.search(
539
+ query_embedding,
540
+ top_k=top_k,
541
+ filter=filter_dict if filter_dict else None
542
+ )
543
+
544
+ return results
545
+
546
+ def search_by_category(self, query: str, category: str, top_k: int = 10) -> List[Dict]:
547
+ """Search within a specific category"""
548
+ return self.search_with_filter(
549
+ query=query,
550
+ top_k=top_k,
551
+ source_pattern=f".*{category}.*"
552
+ )
553
+
554
+ # Usage
555
+ hybrid_engine = HybridSearchEngine(pinecone_store, generator)
556
+
557
+ # Search only in database documentation
558
+ results = hybrid_engine.search_by_category(
559
+ query="How to optimize queries?",
560
+ category="database",
561
+ top_k=5
562
+ )
563
+
564
+ # Search only markdown files
565
+ results = hybrid_engine.search_with_filter(
566
+ query="API authentication",
567
+ document_type="markdown",
568
+ top_k=5
569
+ )
570
+ ```
571
+
572
+ ### Hybrid Search with Keyword Boosting
573
+
574
+ ```python
575
+ from typing import List, Dict, Set
576
+
577
+ class KeywordBoostingSearch:
578
+ """Hybrid search with keyword boosting"""
579
+
580
+ def __init__(self, vector_store, embedding_generator):
581
+ self.vector_store = vector_store
582
+ self.embedding_generator = embedding_generator
583
+
584
+ def extract_keywords(self, query: str) -> Set[str]:
585
+ """Extract important keywords from query"""
586
+ # Simple keyword extraction (can use NLP libraries for better results)
587
+ stopwords = {"the", "a", "an", "in", "on", "at", "to", "for", "of", "and", "or"}
588
+ words = query.lower().split()
589
+ keywords = {word for word in words if word not in stopwords and len(word) > 3}
590
+ return keywords
591
+
592
+ def keyword_match_score(self, text: str, keywords: Set[str]) -> float:
593
+ """Calculate keyword match score"""
594
+ text_lower = text.lower()
595
+ matches = sum(1 for keyword in keywords if keyword in text_lower)
596
+ return matches / len(keywords) if keywords else 0
597
+
598
+ def hybrid_search(
599
+ self,
600
+ query: str,
601
+ top_k: int = 10,
602
+ vector_weight: float = 0.7,
603
+ keyword_weight: float = 0.3
604
+ ) -> List[Dict]:
605
+ """Hybrid search with weighted scoring"""
606
+ # Extract keywords
607
+ keywords = self.extract_keywords(query)
608
+
609
+ # Vector search (get more results for re-ranking)
610
+ query_embedding = self.embedding_generator.generate_embedding(query)
611
+ vector_results = self.vector_store.search(query_embedding, top_k=top_k * 2)
612
+
613
+ # Re-rank with keyword boosting
614
+ for result in vector_results:
615
+ vector_score = result.get("score", 0)
616
+ keyword_score = self.keyword_match_score(result["text"], keywords)
617
+
618
+ # Combined score
619
+ result["combined_score"] = (
620
+ vector_weight * vector_score +
621
+ keyword_weight * keyword_score
622
+ )
623
+
624
+ # Sort by combined score
625
+ vector_results.sort(key=lambda x: x["combined_score"], reverse=True)
626
+
627
+ return vector_results[:top_k]
628
+
629
+ # Usage
630
+ keyword_search = KeywordBoostingSearch(pinecone_store, generator)
631
+
632
+ results = keyword_search.hybrid_search(
633
+ query="database indexing performance optimization",
634
+ top_k=5,
635
+ vector_weight=0.7,
636
+ keyword_weight=0.3
637
+ )
638
+
639
+ for result in results:
640
+ print(f"Combined Score: {result['combined_score']:.4f}")
641
+ print(f"Text: {result['text'][:100]}...\n")
642
+ ```
643
+
644
+ ---
645
+
646
+ ## Step 6: Sample Queries with Explanations
647
+
648
+ ### Query 1: Basic Semantic Search
649
+
650
+ ```python
651
+ # Query: "How do I set up authentication?"
652
+ query = "How do I set up authentication?"
653
+ results = search_engine.search(query, top_k=3)
654
+
655
+ # Explanation:
656
+ # - Converts query to embedding vector
657
+ # - Finds documents with similar embeddings (semantic similarity)
658
+ # - Returns top 3 most similar documents
659
+ # - Will find documents about "authentication setup", "auth configuration", etc.
660
+ # even if they don't use exact words "set up authentication"
661
+ ```
662
+
663
+ **Why it works:**
664
+ - Semantic understanding: "set up" ≈ "configure" ≈ "initialize"
665
+ - Finds conceptually similar content, not just keyword matches
666
+
667
+ ### Query 2: Filtered Search
668
+
669
+ ```python
670
+ # Query: "API rate limiting" (only in API documentation)
671
+ query = "API rate limiting"
672
+ results = hybrid_engine.search_with_filter(
673
+ query=query,
674
+ source_pattern=".*api.*",
675
+ top_k=5
676
+ )
677
+
678
+ # Explanation:
679
+ # - Semantic search for "API rate limiting"
680
+ # - Filters results to only include documents from API docs
681
+ # - Combines vector similarity with metadata filtering
682
+ # - More precise results than pure semantic search
683
+ ```
684
+
685
+ **Why it works:**
686
+ - Narrows search scope to relevant documentation section
687
+ - Reduces noise from unrelated documents
688
+ - Faster search (fewer vectors to compare)
689
+
690
+ ### Query 3: Hybrid Search with Keyword Boosting
691
+
692
+ ```python
693
+ # Query: "PostgreSQL index optimization"
694
+ query = "PostgreSQL index optimization"
695
+ results = keyword_search.hybrid_search(
696
+ query=query,
697
+ top_k=5,
698
+ vector_weight=0.6,
699
+ keyword_weight=0.4
700
+ )
701
+
702
+ # Explanation:
703
+ # - Semantic search finds conceptually similar documents
704
+ # - Keyword matching boosts documents containing "PostgreSQL", "index", "optimization"
705
+ # - Weighted combination: 60% semantic similarity, 40% keyword match
706
+ # - Balances semantic understanding with exact term matching
707
+ ```
708
+
709
+ **Why it works:**
710
+ - Semantic search finds related concepts (e.g., "database tuning")
711
+ - Keyword boosting prioritizes documents with specific terms (e.g., "PostgreSQL")
712
+ - Best of both worlds: semantic understanding + term precision
713
+
714
+ ### Query 4: Multi-Filter Search
715
+
716
+ ```python
717
+ # Query: "error handling" (only in Python docs, markdown files)
718
+ query = "error handling"
719
+ results = hybrid_engine.search_with_filter(
720
+ query=query,
721
+ document_type="markdown",
722
+ source_pattern=".*python.*",
723
+ top_k=5
724
+ )
725
+
726
+ # Explanation:
727
+ # - Semantic search for "error handling"
728
+ # - Filter 1: Only markdown files
729
+ # - Filter 2: Only Python documentation
730
+ # - Highly targeted results
731
+ ```
732
+
733
+ **Why it works:**
734
+ - Multiple filters narrow search scope significantly
735
+ - Reduces false positives from other languages/formats
736
+ - Faster and more accurate results
737
+
738
+ ### Query 5: RAG (Retrieval-Augmented Generation)
739
+
740
+ ```python
741
+ def rag_query(question: str, search_engine, llm_client):
742
+ """Answer question using RAG"""
743
+ # 1. Retrieve relevant context
744
+ results = search_engine.search(question, top_k=5)
745
+ context = "\n\n".join([r["text"] for r in results])
746
+
747
+ # 2. Augment prompt with context
748
+ prompt = f"""
749
+ Context from documentation:
750
+ {context}
751
+
752
+ Question: {question}
753
+
754
+ Answer the question based on the context above. If the context doesn't
755
+ contain enough information, say so.
756
+ """
757
+
758
+ # 3. Generate answer with LLM
759
+ response = llm_client.chat.completions.create(
760
+ model="gpt-4",
761
+ messages=[{"role": "user", "content": prompt}]
762
+ )
763
+
764
+ answer = response.choices[0].message.content
765
+
766
+ return {
767
+ "answer": answer,
768
+ "sources": [r["metadata"]["source"] for r in results]
769
+ }
770
+
771
+ # Usage
772
+ question = "What are the best practices for database indexing?"
773
+ result = rag_query(question, search_engine, OpenAI(api_key=os.getenv("OPENAI_API_KEY")))
774
+
775
+ print(f"Answer: {result['answer']}")
776
+ print(f"\nSources: {', '.join(result['sources'])}")
777
+
778
+ # Explanation:
779
+ # - Retrieves top 5 most relevant documents
780
+ # - Provides context to LLM
781
+ # - LLM generates answer based on retrieved context
782
+ # - Returns answer with source citations
783
+ ```
784
+
785
+ **Why it works:**
786
+ - Combines retrieval (vector search) with generation (LLM)
787
+ - Grounds LLM responses in actual documentation
788
+ - Provides source attribution for verification
789
+ - Reduces hallucinations (LLM making up information)
790
+
791
+ ---
792
+
793
+ ## Complete Example: End-to-End Workflow
794
+
795
+ ```python
796
+ import os
797
+ from dotenv import load_dotenv
798
+
799
+ # Load environment variables
800
+ load_dotenv()
801
+
802
+ # 1. Load documents
803
+ print("Step 1: Loading documents...")
804
+ loader = DocumentLoader()
805
+ documents = loader.load_markdown_files("./docs")
806
+ print(f"Loaded {len(documents)} documents")
807
+
808
+ # 2. Chunk documents
809
+ print("\nStep 2: Chunking documents...")
810
+ chunker = DocumentChunker(chunk_size=512, overlap=50)
811
+ chunked_documents = chunker.chunk_documents(documents)
812
+ print(f"Created {len(chunked_documents)} chunks")
813
+
814
+ # 3. Generate embeddings
815
+ print("\nStep 3: Generating embeddings...")
816
+ generator = EmbeddingGenerator(api_key=os.getenv("OPENAI_API_KEY"))
817
+ embedded_documents = generator.embed_documents(chunked_documents)
818
+ print(f"Generated embeddings for {len(embedded_documents)} chunks")
819
+
820
+ # 4. Store in vector database
821
+ print("\nStep 4: Storing in Pinecone...")
822
+ pinecone_store = PineconeVectorStore(
823
+ api_key=os.getenv("PINECONE_API_KEY"),
824
+ environment=os.getenv("PINECONE_ENVIRONMENT"),
825
+ index_name="knowledge-base"
826
+ )
827
+ pinecone_store.create_index(dimension=1536)
828
+ pinecone_store.upsert_documents(embedded_documents)
829
+ print("Documents stored successfully")
830
+
831
+ # 5. Create search engines
832
+ print("\nStep 5: Creating search engines...")
833
+ search_engine = SemanticSearchEngine(pinecone_store, generator)
834
+ hybrid_engine = HybridSearchEngine(pinecone_store, generator)
835
+ keyword_search = KeywordBoostingSearch(pinecone_store, generator)
836
+
837
+ # 6. Run sample queries
838
+ print("\n" + "="*80)
839
+ print("SAMPLE QUERIES")
840
+ print("="*80)
841
+
842
+ # Query 1: Basic semantic search
843
+ print("\n[Query 1] Basic Semantic Search")
844
+ print("Query: 'How to configure database indexes?'")
845
+ results = search_engine.search("How to configure database indexes?", top_k=3)
846
+ print(search_engine.format_results(results))
847
+
848
+ # Query 2: Filtered search
849
+ print("\n[Query 2] Filtered Search")
850
+ print("Query: 'API authentication' (only markdown files)")
851
+ results = hybrid_engine.search_with_filter(
852
+ query="API authentication",
853
+ document_type="markdown",
854
+ top_k=3
855
+ )
856
+ print(search_engine.format_results(results))
857
+
858
+ # Query 3: Hybrid search with keyword boosting
859
+ print("\n[Query 3] Hybrid Search with Keyword Boosting")
860
+ print("Query: 'PostgreSQL performance optimization'")
861
+ results = keyword_search.hybrid_search(
862
+ query="PostgreSQL performance optimization",
863
+ top_k=3,
864
+ vector_weight=0.7,
865
+ keyword_weight=0.3
866
+ )
867
+ for i, result in enumerate(results, 1):
868
+ print(f"\n--- Result {i} (Combined Score: {result['combined_score']:.4f}) ---")
869
+ print(f"Source: {result['metadata'].get('source', 'Unknown')}")
870
+ print(f"Text: {result['text'][:200]}...")
871
+
872
+ print("\n" + "="*80)
873
+ print("Search system ready!")
874
+ print("="*80)
875
+ ```
876
+
877
+ ---
878
+
879
+ ## Performance Metrics
880
+
881
+ ### Measuring Search Quality
882
+
883
+ ```python
884
+ def evaluate_search_quality(search_engine, test_queries, ground_truth):
885
+ """Evaluate search quality using test queries"""
886
+ metrics = {
887
+ "precision_at_5": [],
888
+ "recall_at_5": [],
889
+ "mrr": [] # Mean Reciprocal Rank
890
+ }
891
+
892
+ for query, relevant_docs in zip(test_queries, ground_truth):
893
+ results = search_engine.search(query, top_k=5)
894
+ result_ids = [r["id"] for r in results]
895
+
896
+ # Precision@5
897
+ relevant_found = len(set(result_ids) & set(relevant_docs))
898
+ precision = relevant_found / 5
899
+ metrics["precision_at_5"].append(precision)
900
+
901
+ # Recall@5
902
+ recall = relevant_found / len(relevant_docs) if relevant_docs else 0
903
+ metrics["recall_at_5"].append(recall)
904
+
905
+ # MRR
906
+ for i, result_id in enumerate(result_ids, 1):
907
+ if result_id in relevant_docs:
908
+ metrics["mrr"].append(1 / i)
909
+ break
910
+ else:
911
+ metrics["mrr"].append(0)
912
+
913
+ return {
914
+ "precision_at_5": sum(metrics["precision_at_5"]) / len(metrics["precision_at_5"]),
915
+ "recall_at_5": sum(metrics["recall_at_5"]) / len(metrics["recall_at_5"]),
916
+ "mrr": sum(metrics["mrr"]) / len(metrics["mrr"])
917
+ }
918
+
919
+ # Example
920
+ test_queries = [
921
+ "How to configure database indexes?",
922
+ "API authentication best practices",
923
+ "PostgreSQL performance tuning"
924
+ ]
925
+
926
+ ground_truth = [
927
+ ["docs/database/indexing.md_chunk_0", "docs/database/indexing.md_chunk_1"],
928
+ ["docs/api/auth.md_chunk_0", "docs/api/security.md_chunk_2"],
929
+ ["docs/database/postgres.md_chunk_5", "docs/database/performance.md_chunk_3"]
930
+ ]
931
+
932
+ metrics = evaluate_search_quality(search_engine, test_queries, ground_truth)
933
+ print(f"Precision@5: {metrics['precision_at_5']:.2%}")
934
+ print(f"Recall@5: {metrics['recall_at_5']:.2%}")
935
+ print(f"MRR: {metrics['mrr']:.4f}")
936
+ ```
937
+
938
+ ---
939
+
940
+ ## Summary
941
+
942
+ **What We Built:**
943
+ 1. ✅ Document ingestion pipeline (markdown, PDF)
944
+ 2. ✅ Intelligent chunking with overlap
945
+ 3. ✅ Embedding generation with OpenAI
946
+ 4. ✅ Vector storage in Pinecone/Weaviate
947
+ 5. ✅ Semantic similarity search
948
+ 6. ✅ Hybrid search (vector + metadata filtering)
949
+ 7. ✅ Keyword boosting for precision
950
+ 8. ✅ RAG implementation for Q&A
951
+ 9. ✅ Performance evaluation metrics
952
+
953
+ **Key Takeaways:**
954
+ - Vector databases enable semantic search (meaning-based, not keyword-based)
955
+ - Chunking strategy affects search quality (512 tokens with 50 token overlap works well)
956
+ - Hybrid search (vector + metadata) provides best results
957
+ - RAG combines retrieval with LLM generation for accurate answers
958
+ - Always measure search quality with precision, recall, and MRR metrics
959
+
960
+ **Next Steps:**
961
+ - See `../rules/vector-databases.md` for vector database fundamentals
962
+ - See `../rules/vector-embeddings.md` for embedding strategies
963
+ - See `../rules/vector-indexing.md` for index optimization
964
+ - Experiment with different chunking strategies and embedding models
965
+ - Tune hybrid search weights for your use case