@botlearn/google-search 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 BotLearn
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
package/README.md ADDED
@@ -0,0 +1,35 @@
1
+ # @botlearn/google-search
2
+
3
+ > Advanced Google search query construction, result filtering, and relevance ranking for OpenClaw Agent
4
+
5
+ ## Installation
6
+
7
+ ```bash
8
+ # via npm
9
+ npm install @botlearn/google-search
10
+
11
+ # via clawhub
12
+ clawhub install @botlearn/google-search
13
+ ```
14
+
15
+ ## Category
16
+
17
+ Information Retrieval
18
+
19
+ ## Dependencies
20
+
21
+ None
22
+
23
+ ## Files
24
+
25
+ | File | Description |
26
+ |------|-------------|
27
+ | `manifest.json` | Skill metadata and configuration |
28
+ | `skill.md` | Role definition and activation rules |
29
+ | `knowledge/` | Domain knowledge documents |
30
+ | `strategies/` | Behavioral strategy definitions |
31
+ | `tests/` | Smoke and benchmark tests |
32
+
33
+ ## License
34
+
35
+ MIT
@@ -0,0 +1,62 @@
1
+ ---
2
+ domain: google-search
3
+ topic: anti-patterns
4
+ priority: medium
5
+ ttl: 30d
6
+ ---
7
+
8
+ # Google Search — Anti-Patterns
9
+
10
+ ## Query Construction Anti-Patterns
11
+
12
+ ### 1. Overly Long Queries
13
+ - **Problem**: Queries with 10+ terms dilute relevance; Google ignores excess terms
14
+ - **Fix**: Focus on 3-7 high-signal keywords, use operators to add precision without verbosity
15
+
16
+ ### 2. Natural Language Queries
17
+ - **Problem**: Searching "What is the best way to implement authentication in a React application?" treats every word equally
18
+ - **Fix**: Extract key terms: `React authentication implementation best practices`
19
+
20
+ ### 3. Missing Context Terms
21
+ - **Problem**: Searching `merge` without context returns results about git, mail merge, corporate mergers, etc.
22
+ - **Fix**: Add domain context: `git merge conflict resolution` or `pandas merge dataframe`
23
+
24
+ ### 4. Ignoring Operator Case Sensitivity
25
+ - **Problem**: `or` is treated as a regular word; only `OR` works as a Boolean operator
26
+ - **Fix**: Always use uppercase `OR`, and remember `-` must touch the excluded term (no space)
27
+
28
+ ### 5. Single-Query Dependency
29
+ - **Problem**: Relying on one query for complex, multi-faceted topics
30
+ - **Fix**: Decompose into 2-4 targeted sub-queries, merge results
31
+
32
+ ## Result Evaluation Anti-Patterns
33
+
34
+ ### 6. First-Result Bias
35
+ - **Problem**: Treating the first search result as the authoritative answer
36
+ - **Fix**: Examine at least 3-5 results; first result may be SEO-optimized, not most accurate
37
+
38
+ ### 7. Ignoring Source Verification
39
+ - **Problem**: Accepting information without checking the source's authority or recency
40
+ - **Fix**: Always check: Who published this? When? Are claims cited? Is the domain reputable?
41
+
42
+ ### 8. Single-Source Dependency
43
+ - **Problem**: Using only one source to answer a question
44
+ - **Fix**: Cross-reference key facts across 2-3 independent sources; flag single-source claims
45
+
46
+ ### 9. Ignoring Date Context
47
+ - **Problem**: Returning outdated information for rapidly evolving topics (frameworks, APIs, regulations)
48
+ - **Fix**: Use `after:` date filters; always note the publication date in results; flag if content may be outdated
49
+
50
+ ### 10. Content Farm Inclusion
51
+ - **Problem**: Including results from low-quality aggregator sites that scrape and rewrite content
52
+ - **Fix**: Exclude known content farms with `-site:`; prefer domains with original analysis or primary data
53
+
54
+ ## Output Anti-Patterns
55
+
56
+ ### 11. Raw URL Dumping
57
+ - **Problem**: Returning a list of URLs without context, relevance scores, or summaries
58
+ - **Fix**: Each result should include: title, source, date, relevance note, and a 1-2 sentence summary
59
+
60
+ ### 12. No Deduplication
61
+ - **Problem**: Returning the same information from multiple syndicated sources
62
+ - **Fix**: Deduplicate at content level, keep the primary/authoritative source
@@ -0,0 +1,70 @@
1
+ ---
2
+ domain: google-search
3
+ topic: query-construction-and-quality
4
+ priority: high
5
+ ttl: 30d
6
+ ---
7
+
8
+ # Google Search — Best Practices
9
+
10
+ ## Query Construction Patterns
11
+
12
+ ### 1. Intent Classification First
13
+ Before constructing a query, classify the search intent:
14
+ - **Navigational** — User wants a specific site → use `site:` or direct URL terms
15
+ - **Informational** — User wants to learn → use descriptive terms + authoritative source filters
16
+ - **Transactional** — User wants to do something → include action verbs and tool names
17
+ - **Investigative** — User wants to compare/analyze → use comparison terms + multiple sources
18
+
19
+ ### 2. Keyword Selection
20
+ - Use **nouns and noun phrases** as primary search terms
21
+ - Prefer **specific technical terms** over generic descriptions
22
+ - Include **version numbers** for software-related queries (e.g., "React 18", "Python 3.12")
23
+ - Use the **terminology of the target domain** (e.g., "myocardial infarction" not "heart attack" for medical research)
24
+
25
+ ### 3. Query Decomposition for Complex Topics
26
+ When a topic is broad or multi-faceted:
27
+ 1. Break into 2-4 focused sub-queries
28
+ 2. Each sub-query targets one aspect
29
+ 3. Merge and deduplicate results
30
+ 4. Cross-reference findings across sub-queries
31
+
32
+ Example: "What are the environmental and economic impacts of electric vehicles?"
33
+ - Sub-query 1: `electric vehicles environmental impact lifecycle emissions`
34
+ - Sub-query 2: `electric vehicles economic analysis cost ownership`
35
+ - Sub-query 3: `EV vs ICE environmental comparison study`
36
+
37
+ ### 4. Iterative Refinement
38
+ - Start broad, then narrow based on initial results
39
+ - Add exclusions (`-term`) to filter noise discovered in initial results
40
+ - Switch to `site:` filters when you identify authoritative domains
41
+
42
+ ## Result Quality Assessment
43
+
44
+ ### Source Credibility Tiers
45
+
46
+ | Tier | Source Type | Trust Level | Examples |
47
+ |------|-----------|-------------|---------|
48
+ | T1 | Primary / Official | Highest | Government data, academic journals, official docs |
49
+ | T2 | Established Media | High | Reuters, AP, major newspapers, peer-reviewed blogs |
50
+ | T3 | Expert Community | Medium-High | Stack Overflow (high-rep), GitHub (popular repos), industry blogs |
51
+ | T4 | General Web | Medium | Wikipedia, Medium, personal blogs with citations |
52
+ | T5 | User-Generated | Low | Forums, social media, anonymous posts |
53
+
54
+ ### Freshness Assessment
55
+ - Check publication date — prefer recent for technology, less critical for fundamentals
56
+ - Verify the content hasn't been superseded by newer information
57
+ - For software: match the version discussed to the user's version
58
+
59
+ ### Deduplication Strategy
60
+ 1. **URL-level** — Same URL from different queries → keep one
61
+ 2. **Content-level** — Same article syndicated across sites → keep the primary source
62
+ 3. **Fact-level** — Multiple sources stating the same fact → consolidate, cite best source
63
+
64
+ ## Result Ranking Criteria
65
+
66
+ Rank results by weighted combination:
67
+ 1. **Relevance** (40%) — How directly does it answer the query?
68
+ 2. **Source Authority** (25%) — Credibility tier of the source
69
+ 3. **Freshness** (20%) — How recent is the information?
70
+ 4. **Depth** (15%) — How comprehensive is the coverage?
@@ -0,0 +1,90 @@
1
+ ---
2
+ domain: google-search
3
+ topic: search-operators-and-syntax
4
+ priority: high
5
+ ttl: 30d
6
+ ---
7
+
8
+ # Google Search — Operator Syntax & Query Construction
9
+
10
+ ## Core Search Operators
11
+
12
+ ### Exact Match
13
+ - `"machine learning"` — Match the exact phrase
14
+ - Use for: names, specific phrases, error messages, quotes
15
+
16
+ ### Boolean Operators
17
+ - `term1 OR term2` — Match either term (OR must be uppercase)
18
+ - `term1 | term2` — Alternative OR syntax
19
+ - `-term` — Exclude term from results
20
+ - `term1 term2` — Implicit AND (both terms required)
21
+
22
+ ### Site & Domain Filters
23
+ - `site:github.com` — Restrict to a specific domain
24
+ - `site:.edu` — Restrict to a TLD (educational institutions)
25
+ - `site:.gov` — Government sources only
26
+ - `-site:pinterest.com` — Exclude a domain
27
+
28
+ ### File Type Filters
29
+ - `filetype:pdf` — PDF documents only
30
+ - `filetype:csv` — CSV data files
31
+ - `filetype:pptx` — PowerPoint presentations
32
+ - Useful types: pdf, doc, docx, xls, xlsx, csv, ppt, pptx, txt
33
+
34
+ ### URL & Title Filters
35
+ - `intitle:"annual report"` — Term must appear in page title
36
+ - `allintitle:react hooks tutorial` — All terms in title
37
+ - `inurl:api` — Term must appear in URL
38
+ - `allinurl:docs api reference` — All terms in URL
39
+
40
+ ### Date & Range
41
+ - `after:2024-01-01` — Results published after date
42
+ - `before:2024-12-31` — Results published before date
43
+ - `2023..2024` — Numeric range (also works for prices, years)
44
+
45
+ ### Wildcard & Proximity
46
+ - `"machine * learning"` — Wildcard for unknown words
47
+ - `AROUND(3)` — Proximity search: terms within N words of each other
48
+ - Example: `"climate change" AROUND(5) "economic impact"`
49
+
50
+ ### Special Operators
51
+ - `cache:url` — Google's cached version of a page
52
+ - `related:nytimes.com` — Sites similar to a domain
53
+ - `define:term` — Dictionary definition
54
+ - `info:url` — Information about a URL
55
+
56
+ ## Operator Combinations
57
+
58
+ ### Academic Research
59
+ ```
60
+ "topic name" site:arxiv.org OR site:scholar.google.com filetype:pdf after:2023-01-01
61
+ ```
62
+
63
+ ### Technical Documentation
64
+ ```
65
+ "function name" site:docs.python.org OR site:developer.mozilla.org
66
+ ```
67
+
68
+ ### News with Source Quality
69
+ ```
70
+ "event name" site:reuters.com OR site:apnews.com OR site:bbc.com after:2024-06-01
71
+ ```
72
+
73
+ ### Code Examples
74
+ ```
75
+ "error message" site:stackoverflow.com OR site:github.com -"closed as duplicate"
76
+ ```
77
+
78
+ ### Competitive Analysis
79
+ ```
80
+ "company name" (review OR comparison OR alternative) -site:company.com after:2024-01-01
81
+ ```
82
+
83
+ ## Query Length Guidelines
84
+
85
+ | Query Type | Optimal Length | Example |
86
+ |-----------|---------------|---------|
87
+ | Simple fact | 2-4 terms | `python list comprehension` |
88
+ | Specific answer | 4-7 terms | `"react useEffect" cleanup function example` |
89
+ | Research | 5-10 terms + operators | `"transformer architecture" attention mechanism site:arxiv.org filetype:pdf after:2023` |
90
+ | Troubleshooting | Error message + context | `"TypeError: Cannot read property" react useState` |
package/manifest.json ADDED
@@ -0,0 +1,26 @@
1
+ {
2
+ "name": "@botlearn/google-search",
3
+ "version": "0.1.0",
4
+ "description": "Advanced Google search query construction, result filtering, and relevance ranking for OpenClaw Agent",
5
+ "category": "information-retrieval",
6
+ "author": "BotLearn",
7
+ "benchmarkDimension": "information-retrieval",
8
+ "expectedImprovement": 30,
9
+ "dependencies": {},
10
+ "compatibility": {
11
+ "openclaw": ">=0.5.0"
12
+ },
13
+ "files": {
14
+ "skill": "skill.md",
15
+ "knowledge": [
16
+ "knowledge/domain.md",
17
+ "knowledge/best-practices.md",
18
+ "knowledge/anti-patterns.md"
19
+ ],
20
+ "strategies": [
21
+ "strategies/main.md"
22
+ ],
23
+ "smokeTest": "tests/smoke.json",
24
+ "benchmark": "tests/benchmark.json"
25
+ }
26
+ }
package/package.json ADDED
@@ -0,0 +1,35 @@
1
+ {
2
+ "name": "@botlearn/google-search",
3
+ "version": "0.1.0",
4
+ "description": "Advanced Google search query construction, result filtering, and relevance ranking for OpenClaw Agent",
5
+ "type": "module",
6
+ "main": "manifest.json",
7
+ "files": [
8
+ "manifest.json",
9
+ "skill.md",
10
+ "knowledge/",
11
+ "strategies/",
12
+ "tests/",
13
+ "README.md"
14
+ ],
15
+ "keywords": [
16
+ "botlearn",
17
+ "openclaw",
18
+ "skill",
19
+ "information-retrieval"
20
+ ],
21
+ "author": "BotLearn",
22
+ "license": "MIT",
23
+ "repository": {
24
+ "type": "git",
25
+ "url": "https://github.com/readai-team/botlearn-awesome-skills.git",
26
+ "directory": "packages/skills/google-search"
27
+ },
28
+ "homepage": "https://github.com/readai-team/botlearn-awesome-skills/tree/main/packages/skills/google-search",
29
+ "bugs": {
30
+ "url": "https://github.com/readai-team/botlearn-awesome-skills/issues"
31
+ },
32
+ "publishConfig": {
33
+ "access": "public"
34
+ }
35
+ }
package/skill.md ADDED
@@ -0,0 +1,42 @@
1
+ ---
2
+ name: google-search
3
+ role: Search Query Specialist
4
+ version: 1.0.0
5
+ triggers:
6
+ - "search for"
7
+ - "find information"
8
+ - "look up"
9
+ - "google"
10
+ - "search the web"
11
+ - "find sources"
12
+ ---
13
+
14
+ # Role
15
+
16
+ You are a Search Query Specialist. When activated, you construct precise, high-relevance search queries using advanced operators and multi-source strategies, then filter and rank results to surface the most valuable information.
17
+
18
+ # Capabilities
19
+
20
+ 1. Construct advanced search queries using Boolean operators, site-specific filters, date ranges, filetype filters, and exclusion keywords
21
+ 2. Decompose ambiguous or complex queries into targeted sub-queries for parallel execution
22
+ 3. Rank results by relevance, remove low-quality entries, and deduplicate across sources
23
+ 4. Assess source credibility using domain authority, publication date, and content signals
24
+ 5. Merge results from multiple sub-queries into a coherent, prioritized result set
25
+
26
+ # Constraints
27
+
28
+ 1. Never return results without verifying source credibility — always assess domain authority
29
+ 2. Never rely on a single search query for complex topics — decompose into sub-queries
30
+ 3. Never present duplicate content from different sources as separate results
31
+ 4. Always prefer primary sources over aggregators or content farms
32
+ 5. Always include date context when results may be time-sensitive
33
+
34
+ # Activation
35
+
36
+ WHEN the user requests a web search or information retrieval:
37
+ 1. Analyze the search intent and identify key entities, constraints, and scope
38
+ 2. Construct optimized queries following strategies/main.md
39
+ 3. Apply knowledge/domain.md for operator syntax
40
+ 4. Filter and rank results using knowledge/best-practices.md
41
+ 5. Verify against knowledge/anti-patterns.md to avoid common mistakes
42
+ 6. Output ranked results with source credibility annotations
@@ -0,0 +1,70 @@
1
+ ---
2
+ strategy: google-search
3
+ version: 1.0.0
4
+ steps: 6
5
+ ---
6
+
7
+ # Google Search Strategy
8
+
9
+ ## Step 1: Intent Analysis
10
+ - Parse the user's request to identify: **topic**, **scope**, **constraints**, **desired output format**
11
+ - Classify search intent: navigational / informational / transactional / investigative
12
+ - Identify time sensitivity — does the user need current information or historical?
13
+ - IF the query is ambiguous THEN ask one clarifying question before proceeding
14
+ - Extract key entities: names, technologies, versions, dates, locations
15
+
16
+ ## Step 2: Query Construction
17
+ - SELECT query strategy based on complexity:
18
+ - Simple fact → Single targeted query with 3-5 keywords
19
+ - Specific answer → Keyword query + site/filetype operators
20
+ - Multi-faceted research → Decompose into 2-4 sub-queries
21
+ - Troubleshooting → Error message (exact match) + context terms
22
+ - APPLY operators from knowledge/domain.md:
23
+ - Use `"exact phrases"` for specific terms, names, error messages
24
+ - Use `site:` to target authoritative domains for the topic
25
+ - Use `after:` for time-sensitive queries
26
+ - Use `-site:` to exclude known low-quality sources
27
+ - Use `filetype:` when the user needs specific document types
28
+ - VERIFY query length is 3-10 terms (excluding operators)
29
+
30
+ ## Step 3: Multi-Source Execution
31
+ - Execute primary query
32
+ - IF topic is multi-faceted THEN execute sub-queries in parallel
33
+ - For each query, collect top 10 raw results with: URL, title, snippet, date, domain
34
+ - IF initial results are poor quality THEN refine query:
35
+ - Add exclusion operators for noise sources
36
+ - Narrow with additional `site:` filters
37
+ - Try alternative terminology
38
+
39
+ ## Step 4: Deduplication & Filtering
40
+ - Remove exact URL duplicates across queries
41
+ - Detect content-level duplicates (same article on different domains) → keep primary source
42
+ - Filter out results matching anti-patterns from knowledge/anti-patterns.md:
43
+ - Content farms and aggregator sites
44
+ - Outdated content (for time-sensitive topics)
45
+ - Results with no clear authorship or date
46
+ - Verify remaining results against source credibility tiers from knowledge/best-practices.md
47
+
48
+ ## Step 5: Relevance Ranking
49
+ - Score each result on 4 dimensions (from knowledge/best-practices.md):
50
+ - **Relevance** (40%) — How directly does it answer the query?
51
+ - **Source Authority** (25%) — Credibility tier (T1-T5)
52
+ - **Freshness** (20%) — Publication recency relative to topic
53
+ - **Depth** (15%) — Comprehensiveness of coverage
54
+ - Sort by weighted score, descending
55
+ - Select top 5-10 results for output
56
+
57
+ ## Step 6: Output & Verification
58
+ - Present results in structured format:
59
+ - **Rank** — Position in relevance order
60
+ - **Title** — Page title
61
+ - **Source** — Domain + credibility tier
62
+ - **Date** — Publication date
63
+ - **Summary** — 1-2 sentence description of what the page contains
64
+ - **Relevance** — Why this result is useful for the query
65
+ - IF multiple sub-queries were used THEN provide a synthesis section connecting findings
66
+ - SELF-CHECK:
67
+ - Are results from diverse, credible sources? (not all from one domain)
68
+ - Is the most relevant result ranked first?
69
+ - Are all results genuinely addressing the user's intent?
70
+ - IF any check fails THEN loop back to Step 3 with refined queries
@@ -0,0 +1,476 @@
1
+ {
2
+ "version": "0.0.1",
3
+ "dimension": "information-retrieval",
4
+ "tasks": [
5
+ {
6
+ "id": "bench-easy-01",
7
+ "difficulty": "easy",
8
+ "description": "Simple factual search with clear answer",
9
+ "input": "Find the official documentation page for Python's asyncio library, specifically the section on Tasks and Coroutines.",
10
+ "rubric": [
11
+ {
12
+ "criterion": "Relevance",
13
+ "weight": 0.4,
14
+ "scoring": {
15
+ "5": "Returns the exact official Python docs page for asyncio tasks and coroutines",
16
+ "3": "Returns Python docs but not the specific section",
17
+ "1": "Returns third-party tutorials instead of official docs",
18
+ "0": "Irrelevant results"
19
+ }
20
+ },
21
+ {
22
+ "criterion": "Query Quality",
23
+ "weight": 0.3,
24
+ "scoring": {
25
+ "5": "Uses site:docs.python.org with targeted terms",
26
+ "3": "Reasonable query but without site filter",
27
+ "1": "Overly broad query",
28
+ "0": "No query optimization"
29
+ }
30
+ },
31
+ {
32
+ "criterion": "Output Quality",
33
+ "weight": 0.3,
34
+ "scoring": {
35
+ "5": "Clear result with URL, description, and relevance note",
36
+ "3": "URL with basic description",
37
+ "1": "URL only",
38
+ "0": "No usable output"
39
+ }
40
+ }
41
+ ],
42
+ "expectedScoreWithout": 40,
43
+ "expectedScoreWith": 80
44
+ },
45
+ {
46
+ "id": "bench-easy-02",
47
+ "difficulty": "easy",
48
+ "description": "Find a specific error message solution",
49
+ "input": "Search for solutions to this error: 'TypeError: Cannot read properties of undefined (reading map)' in a React component that uses useState.",
50
+ "rubric": [
51
+ {
52
+ "criterion": "Relevance",
53
+ "weight": 0.4,
54
+ "scoring": {
55
+ "5": "Results directly address this specific TypeError in React with useState context; includes root cause and fix",
56
+ "3": "Results address the TypeError but not specifically in React/useState context",
57
+ "1": "Generic JavaScript error results",
58
+ "0": "Irrelevant results"
59
+ }
60
+ },
61
+ {
62
+ "criterion": "Query Quality",
63
+ "weight": 0.3,
64
+ "scoring": {
65
+ "5": "Uses exact error message in quotes plus React/useState context, targets Stack Overflow or GitHub",
66
+ "3": "Includes error message but missing context terms",
67
+ "1": "Paraphrases error instead of exact match",
68
+ "0": "No query optimization"
69
+ }
70
+ },
71
+ {
72
+ "criterion": "Output Quality",
73
+ "weight": 0.3,
74
+ "scoring": {
75
+ "5": "Ranked results with source quality indicators; top result explains root cause",
76
+ "3": "Listed results with basic descriptions",
77
+ "1": "Unstructured URL list",
78
+ "0": "No usable output"
79
+ }
80
+ }
81
+ ],
82
+ "expectedScoreWithout": 35,
83
+ "expectedScoreWith": 75
84
+ },
85
+ {
86
+ "id": "bench-easy-03",
87
+ "difficulty": "easy",
88
+ "description": "Find official statistics from a government source",
89
+ "input": "Find the latest US Bureau of Labor Statistics data on unemployment rates by industry sector.",
90
+ "rubric": [
91
+ {
92
+ "criterion": "Relevance",
93
+ "weight": 0.4,
94
+ "scoring": {
95
+ "5": "Returns BLS.gov page with unemployment data broken down by industry",
96
+ "3": "Returns BLS data but not sector-specific breakdown",
97
+ "1": "Returns news articles about unemployment instead of primary data",
98
+ "0": "Irrelevant results"
99
+ }
100
+ },
101
+ {
102
+ "criterion": "Query Quality",
103
+ "weight": 0.3,
104
+ "scoring": {
105
+ "5": "Uses site:bls.gov with specific terms for industry sector unemployment",
106
+ "3": "Targets government sites but query could be more specific",
107
+ "1": "Generic unemployment search",
108
+ "0": "No query optimization"
109
+ }
110
+ },
111
+ {
112
+ "criterion": "Source Authority",
113
+ "weight": 0.3,
114
+ "scoring": {
115
+ "5": "Primary result is from bls.gov (T1 source); clearly identified as official data",
116
+ "3": "Includes BLS data but also mixes in news articles",
117
+ "1": "Mostly secondary sources reporting on BLS data",
118
+ "0": "No authoritative sources"
119
+ }
120
+ }
121
+ ],
122
+ "expectedScoreWithout": 35,
123
+ "expectedScoreWith": 80
124
+ },
125
+ {
126
+ "id": "bench-med-01",
127
+ "difficulty": "medium",
128
+ "description": "Multi-aspect research query requiring decomposition",
129
+ "input": "Research the trade-offs between microservices and monolithic architecture for a startup with a team of 5 developers building a B2B SaaS product. I need perspectives on development speed, operational complexity, and scalability.",
130
+ "rubric": [
131
+ {
132
+ "criterion": "Query Decomposition",
133
+ "weight": 0.3,
134
+ "scoring": {
135
+ "5": "Decomposes into 3+ sub-queries targeting different aspects (dev speed, ops complexity, scalability) with startup/small-team context",
136
+ "3": "Uses 2 sub-queries but misses some aspects",
137
+ "1": "Single broad query covering all aspects",
138
+ "0": "No decomposition attempted"
139
+ }
140
+ },
141
+ {
142
+ "criterion": "Result Relevance",
143
+ "weight": 0.3,
144
+ "scoring": {
145
+ "5": "Results address all 3 aspects with startup-specific context; includes case studies or data-driven analysis",
146
+ "3": "Covers most aspects but lacks startup-specific perspective",
147
+ "1": "Generic microservices vs monolith content without addressing specific concerns",
148
+ "0": "Irrelevant results"
149
+ }
150
+ },
151
+ {
152
+ "criterion": "Source Diversity",
153
+ "weight": 0.2,
154
+ "scoring": {
155
+ "5": "Mix of engineering blogs, case studies, technical publications, and expert opinions from different companies",
156
+ "3": "2-3 source types represented",
157
+ "1": "All results from similar sources",
158
+ "0": "Single source or low-quality sources"
159
+ }
160
+ },
161
+ {
162
+ "criterion": "Synthesis",
163
+ "weight": 0.2,
164
+ "scoring": {
165
+ "5": "Results are organized by aspect with a summary connecting findings across sub-queries",
166
+ "3": "Results grouped but no cross-query synthesis",
167
+ "1": "Flat list of results",
168
+ "0": "No organization"
169
+ }
170
+ }
171
+ ],
172
+ "expectedScoreWithout": 30,
173
+ "expectedScoreWith": 70
174
+ },
175
+ {
176
+ "id": "bench-med-02",
177
+ "difficulty": "medium",
178
+ "description": "Time-sensitive search with source quality filtering",
179
+ "input": "Find the most recent security advisories and CVEs related to Node.js published in 2024 or later. Focus on critical and high severity vulnerabilities. Exclude general security blogs and focus on official sources.",
180
+ "rubric": [
181
+ {
182
+ "criterion": "Query Precision",
183
+ "weight": 0.25,
184
+ "scoring": {
185
+ "5": "Uses date filters (after:2024), targets official sources (site:nodejs.org, site:nvd.nist.gov, site:cve.org), excludes blogs",
186
+ "3": "Uses some filters but doesn't fully restrict to official sources",
187
+ "1": "Basic search without date or source filters",
188
+ "0": "No query optimization"
189
+ }
190
+ },
191
+ {
192
+ "criterion": "Result Accuracy",
193
+ "weight": 0.35,
194
+ "scoring": {
195
+ "5": "Returns actual CVEs with correct IDs, severity ratings, affected versions; all from 2024+",
196
+ "3": "Returns relevant security content but some items are older or not official CVEs",
197
+ "1": "Mix of relevant and outdated security information",
198
+ "0": "Incorrect or irrelevant results"
199
+ }
200
+ },
201
+ {
202
+ "criterion": "Source Authority",
203
+ "weight": 0.25,
204
+ "scoring": {
205
+ "5": "All results from T1-T2 sources (NVD, Node.js official, security advisories)",
206
+ "3": "Mostly authoritative with some secondary sources",
207
+ "1": "Relies on secondary reporting",
208
+ "0": "Unverified sources"
209
+ }
210
+ },
211
+ {
212
+ "criterion": "Output Structure",
213
+ "weight": 0.15,
214
+ "scoring": {
215
+ "5": "Each CVE listed with: ID, severity, affected versions, date, source link",
216
+ "3": "CVEs listed but missing some metadata",
217
+ "1": "Unstructured list",
218
+ "0": "No organization"
219
+ }
220
+ }
221
+ ],
222
+ "expectedScoreWithout": 30,
223
+ "expectedScoreWith": 70
224
+ },
225
+ {
226
+ "id": "bench-med-03",
227
+ "difficulty": "medium",
228
+ "description": "Comparative search requiring cross-source validation",
229
+ "input": "Compare the pricing, features, and developer experience of Supabase vs Firebase vs PlanetScale for a new project. I need current pricing pages, feature comparison articles, and real developer reviews.",
230
+ "rubric": [
231
+ {
232
+ "criterion": "Coverage",
233
+ "weight": 0.3,
234
+ "scoring": {
235
+ "5": "Returns results covering all 3 axes (pricing, features, DX) for all 3 services; includes official pricing pages",
236
+ "3": "Covers 2 of 3 axes or misses one service",
237
+ "1": "Only covers one aspect or one service",
238
+ "0": "Irrelevant results"
239
+ }
240
+ },
241
+ {
242
+ "criterion": "Query Strategy",
243
+ "weight": 0.25,
244
+ "scoring": {
245
+ "5": "Decomposes into targeted sub-queries: official pricing pages (site:), comparison articles, developer reviews (site:reddit.com OR site:news.ycombinator.com)",
246
+ "3": "Uses 2 sub-queries but misses some source types",
247
+ "1": "Single generic comparison query",
248
+ "0": "No strategy"
249
+ }
250
+ },
251
+ {
252
+ "criterion": "Source Mix",
253
+ "weight": 0.25,
254
+ "scoring": {
255
+ "5": "Includes official pages, independent comparison articles, and authentic developer reviews from different platforms",
256
+ "3": "2 of 3 source types present",
257
+ "1": "Only one source type (e.g., all blog posts)",
258
+ "0": "Low-quality or biased sources"
259
+ }
260
+ },
261
+ {
262
+ "criterion": "Freshness",
263
+ "weight": 0.2,
264
+ "scoring": {
265
+ "5": "All results from 2024+; pricing data is current; notes any rapid changes",
266
+ "3": "Most results are recent but some may be outdated",
267
+ "1": "Mix of current and outdated information",
268
+ "0": "Mostly outdated"
269
+ }
270
+ }
271
+ ],
272
+ "expectedScoreWithout": 30,
273
+ "expectedScoreWith": 70
274
+ },
275
+ {
276
+ "id": "bench-med-04",
277
+ "difficulty": "medium",
278
+ "description": "Niche topic search requiring domain expertise in query construction",
279
+ "input": "Find research papers and technical reports on using retrieval-augmented generation (RAG) to reduce hallucination in large language models. Focus on evaluation methodologies and quantitative results published in 2023 or later.",
280
+ "rubric": [
281
+ {
282
+ "criterion": "Query Precision",
283
+ "weight": 0.25,
284
+ "scoring": {
285
+ "5": "Uses domain-specific terminology (RAG, hallucination, LLM); targets academic sources (arxiv, semantic scholar); uses date filters",
286
+ "3": "Good terminology but doesn't target academic sources specifically",
287
+ "1": "Uses general terms instead of domain-specific ones",
288
+ "0": "No domain awareness in query"
289
+ }
290
+ },
291
+ {
292
+ "criterion": "Result Relevance",
293
+ "weight": 0.35,
294
+ "scoring": {
295
+ "5": "Returns papers specifically about RAG for hallucination reduction with evaluation metrics and quantitative results",
296
+ "3": "Returns relevant RAG papers but not focused on hallucination evaluation",
297
+ "1": "General LLM or RAG papers without hallucination focus",
298
+ "0": "Irrelevant results"
299
+ }
300
+ },
301
+ {
302
+ "criterion": "Source Quality",
303
+ "weight": 0.2,
304
+ "scoring": {
305
+ "5": "All results are peer-reviewed papers or preprints from reputable venues; citation counts noted",
306
+ "3": "Mostly academic sources with some blog posts",
307
+ "1": "Primarily non-academic sources",
308
+ "0": "No academic sources"
309
+ }
310
+ },
311
+ {
312
+ "criterion": "Output Metadata",
313
+ "weight": 0.2,
314
+ "scoring": {
315
+ "5": "Each paper includes: title, authors, venue/date, key findings, evaluation methodology used",
316
+ "3": "Papers listed with title and summary but missing some metadata",
317
+ "1": "Titles and URLs only",
318
+ "0": "No metadata"
319
+ }
320
+ }
321
+ ],
322
+ "expectedScoreWithout": 25,
323
+ "expectedScoreWith": 65
324
+ },
325
+ {
326
+ "id": "bench-hard-01",
327
+ "difficulty": "hard",
328
+ "description": "Adversarial search with significant noise and SEO spam",
329
+ "input": "Find genuine, unbiased reviews and benchmarks of the top 5 VPN services. Exclude affiliate marketing content, sponsored reviews, and VPN company blogs. I need independent security audits, speed test data from reputable testers, and privacy policy analyses.",
330
+ "rubric": [
331
+ {
332
+ "criterion": "Noise Filtering",
333
+ "weight": 0.3,
334
+ "scoring": {
335
+ "5": "Successfully excludes affiliate sites, sponsored content, and VPN company marketing; explains filtering strategy",
336
+ "3": "Filters some noise but includes 1-2 affiliate or sponsored results",
337
+ "1": "Minimal filtering; includes obvious affiliate content",
338
+ "0": "No noise filtering; results dominated by affiliate content"
339
+ }
340
+ },
341
+ {
342
+ "criterion": "Source Independence",
343
+ "weight": 0.3,
344
+ "scoring": {
345
+ "5": "Returns independent security audits (e.g., from security researchers), academic analyses, and consumer reports; no financial ties to VPN companies",
346
+ "3": "Mostly independent but 1-2 sources have potential conflicts of interest",
347
+ "1": "Sources have unclear independence",
348
+ "0": "All sources are financially connected to VPN companies"
349
+ }
350
+ },
351
+ {
352
+ "criterion": "Data Quality",
353
+ "weight": 0.25,
354
+ "scoring": {
355
+ "5": "Includes actual speed test data, security audit reports, and privacy policy analyses with specific findings",
356
+ "3": "Includes some quantitative data but lacks depth",
357
+ "1": "Mostly subjective opinions without data",
358
+ "0": "No quantitative data"
359
+ }
360
+ },
361
+ {
362
+ "criterion": "Query Strategy",
363
+ "weight": 0.15,
364
+ "scoring": {
365
+ "5": "Uses aggressive exclusion operators (-affiliate, -sponsored, -site:vpncompany.com); targets specific source types (security research, consumer reports)",
366
+ "3": "Some exclusions but not comprehensive",
367
+ "1": "Basic search without exclusion strategy",
368
+ "0": "No strategy for avoiding biased content"
369
+ }
370
+ }
371
+ ],
372
+ "expectedScoreWithout": 20,
373
+ "expectedScoreWith": 60
374
+ },
375
+ {
376
+ "id": "bench-hard-02",
377
+ "difficulty": "hard",
378
+ "description": "Cross-domain research requiring synthesis from diverse sources",
379
+ "input": "Research the intersection of climate change policy, agricultural technology, and food security in Sub-Saharan Africa. I need: (1) recent policy frameworks from international organizations, (2) agritech innovations being deployed in the region, and (3) quantitative data on food security trends. Provide a synthesis connecting these three areas.",
380
+ "rubric": [
381
+ {
382
+ "criterion": "Query Decomposition",
383
+ "weight": 0.25,
384
+ "scoring": {
385
+ "5": "Creates 3+ targeted sub-queries for each domain (policy, agritech, food security data) with regional focus; uses appropriate source filters for each",
386
+ "3": "2 sub-queries but misses one domain",
387
+ "1": "Single broad query attempting to cover all domains",
388
+ "0": "No decomposition"
389
+ }
390
+ },
391
+ {
392
+ "criterion": "Coverage Breadth",
393
+ "weight": 0.25,
394
+ "scoring": {
395
+ "5": "Returns results covering all 3 domains with Sub-Saharan Africa focus; includes international org reports, tech publications, and data sources",
396
+ "3": "Covers 2 of 3 domains adequately",
397
+ "1": "Only covers one domain or lacks regional specificity",
398
+ "0": "Irrelevant results"
399
+ }
400
+ },
401
+ {
402
+ "criterion": "Source Authority",
403
+ "weight": 0.25,
404
+ "scoring": {
405
+ "5": "Includes T1 sources: UN/FAO/World Bank reports, peer-reviewed research, official government data",
406
+ "3": "Mix of authoritative and secondary sources",
407
+ "1": "Primarily news articles or opinion pieces",
408
+ "0": "Unreliable sources"
409
+ }
410
+ },
411
+ {
412
+ "criterion": "Cross-Domain Synthesis",
413
+ "weight": 0.25,
414
+ "scoring": {
415
+ "5": "Provides a coherent synthesis connecting policy, technology, and data; identifies gaps and opportunities at intersections",
416
+ "3": "Results are organized by domain but synthesis is superficial",
417
+ "1": "No attempt to connect findings across domains",
418
+ "0": "Disorganized output"
419
+ }
420
+ }
421
+ ],
422
+ "expectedScoreWithout": 20,
423
+ "expectedScoreWith": 60
424
+ },
425
+ {
426
+ "id": "bench-hard-03",
427
+ "difficulty": "hard",
428
+ "description": "Ambiguous query requiring clarification and multi-interpretation search",
429
+ "input": "Find information about Mercury's transit.",
430
+ "rubric": [
431
+ {
432
+ "criterion": "Ambiguity Recognition",
433
+ "weight": 0.3,
434
+ "scoring": {
435
+ "5": "Recognizes multiple interpretations (astronomical transit of Mercury across the Sun, Mercury as a car brand, Mercury in astrology, mercury in chemistry) and either asks for clarification or searches for the most likely interpretations",
436
+ "3": "Recognizes 2 interpretations and addresses them",
437
+ "1": "Assumes one interpretation without acknowledging ambiguity",
438
+ "0": "No awareness of ambiguity"
439
+ }
440
+ },
441
+ {
442
+ "criterion": "Query Strategy",
443
+ "weight": 0.25,
444
+ "scoring": {
445
+ "5": "Constructs disambiguation queries; uses site: and context terms to separate interpretations; provides results for top 2-3 interpretations",
446
+ "3": "Searches for primary interpretation with some disambiguation",
447
+ "1": "Single query without disambiguation",
448
+ "0": "No strategy"
449
+ }
450
+ },
451
+ {
452
+ "criterion": "Result Quality",
453
+ "weight": 0.25,
454
+ "scoring": {
455
+ "5": "Results for each interpretation are from authoritative sources (NASA for astronomy, relevant domain sources for others)",
456
+ "3": "Reasonable results but from mixed-quality sources",
457
+ "1": "Low-quality or tangential results",
458
+ "0": "Irrelevant results"
459
+ }
460
+ },
461
+ {
462
+ "criterion": "User Guidance",
463
+ "weight": 0.2,
464
+ "scoring": {
465
+ "5": "Clearly labels results by interpretation; suggests how user can narrow the search; provides reasoning for ranking",
466
+ "3": "Some labeling but insufficient guidance",
467
+ "1": "Results presented without interpretation labels",
468
+ "0": "Confusing output mixing interpretations"
469
+ }
470
+ }
471
+ ],
472
+ "expectedScoreWithout": 25,
473
+ "expectedScoreWith": 65
474
+ }
475
+ ]
476
+ }
@@ -0,0 +1,54 @@
1
+ {
2
+ "version": "0.0.1",
3
+ "timeout": 60,
4
+ "tasks": [
5
+ {
6
+ "id": "smoke-01",
7
+ "description": "Search for recent best practices on a technical topic with source quality filtering",
8
+ "input": "Find the most authoritative and recent resources on implementing OAuth 2.0 with PKCE flow in a single-page application. I need official documentation, security considerations, and practical implementation guides. Filter out outdated content (before 2023).",
9
+ "rubric": [
10
+ {
11
+ "criterion": "Query Construction",
12
+ "weight": 0.25,
13
+ "scoring": {
14
+ "5": "Uses advanced operators (site:, after:, exact phrases), decomposes into sub-queries for different aspects (docs, security, implementation)",
15
+ "3": "Uses some operators but relies on a single broad query",
16
+ "1": "Basic keyword search with no operators",
17
+ "0": "Searches the raw user input as-is"
18
+ }
19
+ },
20
+ {
21
+ "criterion": "Result Relevance",
22
+ "weight": 0.3,
23
+ "scoring": {
24
+ "5": "All results directly address OAuth 2.0 PKCE in SPAs; includes official RFC/docs, security analysis, and code examples",
25
+ "3": "Most results are relevant but some are tangential (e.g., general OAuth without PKCE, server-side flow)",
26
+ "1": "Mix of relevant and irrelevant results; generic authentication articles",
27
+ "0": "Results do not address the specific query"
28
+ }
29
+ },
30
+ {
31
+ "criterion": "Source Quality & Diversity",
32
+ "weight": 0.25,
33
+ "scoring": {
34
+ "5": "Results from 3+ credibility tiers (e.g., RFC/official docs, security blogs, developer guides); no content farms; sources clearly attributed",
35
+ "3": "Results from 2 source types; mostly credible but some questionable sources included",
36
+ "1": "Results from a single source type or includes low-quality sources",
37
+ "0": "Sources are unreliable or unverified"
38
+ }
39
+ },
40
+ {
41
+ "criterion": "Output Structure",
42
+ "weight": 0.2,
43
+ "scoring": {
44
+ "5": "Each result has: title, source, date, credibility assessment, and relevance summary; results are ranked; synthesis provided",
45
+ "3": "Results have titles and URLs but missing some metadata; basic ranking",
46
+ "1": "Unstructured list of URLs or titles",
47
+ "0": "Raw output with no organization"
48
+ }
49
+ }
50
+ ],
51
+ "passThreshold": 60
52
+ }
53
+ ]
54
+ }