@botlearn/twitter-intel 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +35 -0
- package/knowledge/anti-patterns.md +86 -0
- package/knowledge/best-practices.md +117 -0
- package/knowledge/domain.md +140 -0
- package/manifest.json +26 -0
- package/package.json +35 -0
- package/skill.md +48 -0
- package/strategies/main.md +113 -0
- package/tests/benchmark.json +476 -0
- package/tests/smoke.json +54 -0
package/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2025 BotLearn
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
package/README.md
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
1
|
+
# @botlearn/twitter-intel
|
|
2
|
+
|
|
3
|
+
> Twitter/X platform intelligence gathering — tracking KOLs, extracting trending topics, analyzing engagement signals, detecting bot activity, and synthesizing actionable insights for OpenClaw Agent
|
|
4
|
+
|
|
5
|
+
## Installation
|
|
6
|
+
|
|
7
|
+
```bash
|
|
8
|
+
# via npm
|
|
9
|
+
npm install @botlearn/twitter-intel
|
|
10
|
+
|
|
11
|
+
# via clawhub
|
|
12
|
+
clawhub install @botlearn/twitter-intel
|
|
13
|
+
```
|
|
14
|
+
|
|
15
|
+
## Category
|
|
16
|
+
|
|
17
|
+
Information Retrieval
|
|
18
|
+
|
|
19
|
+
## Dependencies
|
|
20
|
+
|
|
21
|
+
None
|
|
22
|
+
|
|
23
|
+
## Files
|
|
24
|
+
|
|
25
|
+
| File | Description |
|
|
26
|
+
|------|-------------|
|
|
27
|
+
| `manifest.json` | Skill metadata and configuration |
|
|
28
|
+
| `skill.md` | Role definition and activation rules |
|
|
29
|
+
| `knowledge/` | Domain knowledge documents |
|
|
30
|
+
| `strategies/` | Behavioral strategy definitions |
|
|
31
|
+
| `tests/` | Smoke and benchmark tests |
|
|
32
|
+
|
|
33
|
+
## License
|
|
34
|
+
|
|
35
|
+
MIT
|
|
@@ -0,0 +1,86 @@
|
|
|
1
|
+
---
|
|
2
|
+
domain: twitter-intel
|
|
3
|
+
topic: anti-patterns
|
|
4
|
+
priority: medium
|
|
5
|
+
ttl: 30d
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
# Twitter Intelligence — Anti-Patterns
|
|
9
|
+
|
|
10
|
+
## Engagement Blindness Anti-Patterns
|
|
11
|
+
|
|
12
|
+
### 1. Equating Virality with Credibility
|
|
13
|
+
- **Problem**: Treating a tweet with 50K retweets as inherently more credible than one with 500 retweets
|
|
14
|
+
- **Why it fails**: Virality is driven by emotional resonance, controversy, and algorithmic amplification — not accuracy. Misinformation routinely outperforms corrections in engagement metrics
|
|
15
|
+
- **Fix**: Always evaluate the source account's credibility score independently of engagement numbers. A Nano-KOL with domain expertise and 200 likes may carry more intelligence value than a Mega-KOL hot take with 100K likes
|
|
16
|
+
|
|
17
|
+
### 2. Like Count as Sentiment Proxy
|
|
18
|
+
- **Problem**: Interpreting high like counts as public agreement or approval
|
|
19
|
+
- **Why it fails**: Users like tweets for many reasons: humor, relatability, bookmarking, ironic appreciation. Likes on sarcastic or critical tweets are easily misread as endorsement of the surface message
|
|
20
|
+
- **Fix**: Use explicit textual sentiment analysis on the tweet content itself. Treat likes as an attention signal, not an opinion signal. Cross-reference with reply sentiment for ground truth
|
|
21
|
+
|
|
22
|
+
### 3. Follower Count as Authority Measure
|
|
23
|
+
- **Problem**: Automatically ranking accounts with more followers as more authoritative
|
|
24
|
+
- **Why it fails**: Follower counts can be inflated through purchasing, follow-back schemes, or historical virality unrelated to current domain expertise. Many genuine domain experts have modest followings
|
|
25
|
+
- **Fix**: Use the composite credibility score from knowledge/best-practices.md, which weights listed count, original content ratio, and engagement quality alongside follower count
|
|
26
|
+
|
|
27
|
+
### 4. Impression Count Overreliance
|
|
28
|
+
- **Problem**: Treating impression counts as a reliable measure of message reach and impact
|
|
29
|
+
- **Why it fails**: Impressions measure timeline appearances, not actual reading. Auto-scrolling, algorithmic insertion, and muted accounts all inflate impressions without genuine attention
|
|
30
|
+
- **Fix**: Use engagement rate (interactions / impressions) as the meaningful metric. Low engagement rate on high impressions suggests passive exposure, not active reception
|
|
31
|
+
|
|
32
|
+
## Sarcasm & Tone Anti-Patterns
|
|
33
|
+
|
|
34
|
+
### 5. Ignoring Sarcasm and Irony Markers
|
|
35
|
+
- **Problem**: Taking tweet text at face value without assessing tone, leading to inverted sentiment classification
|
|
36
|
+
- **Why it fails**: Twitter culture is heavily sarcastic. Tweets like "Oh great, another data breach, exactly what we needed" would be classified as positive without tone analysis
|
|
37
|
+
- **Fix**: Check for sarcasm indicators:
|
|
38
|
+
- Quotation marks around praise ("great" move)
|
|
39
|
+
- Hyperbolic positive language in negative contexts
|
|
40
|
+
- Eye-roll or clown emojis following a statement
|
|
41
|
+
- "Surely" / "definitely" / "totally" in contexts that suggest the opposite
|
|
42
|
+
- Thread context: sarcastic reply to a serious tweet
|
|
43
|
+
- Account history: does this user typically use irony?
|
|
44
|
+
|
|
45
|
+
### 6. Context-Free Quote Tweet Analysis
|
|
46
|
+
- **Problem**: Analyzing the quoted tweet's text without considering the quoting user's commentary
|
|
47
|
+
- **Why it fails**: Quote tweets are frequently used to disagree, mock, or add critical commentary. The quoted content and the quoting commentary often have opposite sentiments
|
|
48
|
+
- **Fix**: Always analyze the quote tweet as a composite: original text + quoting commentary + any added media. The quoting user's intent is the primary signal, not the original content
|
|
49
|
+
|
|
50
|
+
### 7. Thread Fragment Extraction
|
|
51
|
+
- **Problem**: Extracting a single tweet from a multi-tweet thread and analyzing it in isolation
|
|
52
|
+
- **Why it fails**: Thread authors build nuanced arguments across tweets. A single tweet may contain a devil's advocate position, a setup for a counterpoint, or a hypothetical — all of which are misrepresented when isolated
|
|
53
|
+
- **Fix**: Always retrieve and analyze the complete thread (using `conversation_id`). Attribute the overall thread thesis, not individual tweet fragments
|
|
54
|
+
|
|
55
|
+
## Bot Amplification Anti-Patterns
|
|
56
|
+
|
|
57
|
+
### 8. Treating All Engagement as Organic
|
|
58
|
+
- **Problem**: Including bot-generated retweets, likes, and replies in engagement metrics and trend calculations without filtering
|
|
59
|
+
- **Why it fails**: Bot networks can manufacture artificial trends, inflate engagement by 10-100x, and create a false impression of consensus. Reporting bot-amplified metrics as organic misleads intelligence consumers
|
|
60
|
+
- **Fix**: Run bot detection heuristics (knowledge/domain.md) on the engagement sources before reporting metrics. Report both raw and bot-filtered engagement numbers. Flag topics where >15% of engagement comes from suspected bots
|
|
61
|
+
|
|
62
|
+
### 9. Coordinated Hashtag Campaigns as Organic Trends
|
|
63
|
+
- **Problem**: Reporting a trending hashtag as an organic trend when it is being driven by a coordinated campaign
|
|
64
|
+
- **Why it fails**: Organized groups (political campaigns, marketing agencies, state actors) routinely coordinate hashtag pushes. The hashtag may trend without genuine organic interest
|
|
65
|
+
- **Fix**: Check for coordination signals:
|
|
66
|
+
- Multiple accounts tweeting the same hashtag within the same 5-minute window with similar/identical text
|
|
67
|
+
- Accounts in the campaign share creation dates, follower patterns, or bio templates
|
|
68
|
+
- Hashtag volume drops sharply after the campaign window — organic trends have longer tails
|
|
69
|
+
- Compare the hashtag's geographic spread — coordinated campaigns often originate from a single region
|
|
70
|
+
|
|
71
|
+
### 10. Astroturfing Misread as Grassroots Movement
|
|
72
|
+
- **Problem**: Presenting an astroturfing campaign (organized fake grassroots activity) as genuine public sentiment
|
|
73
|
+
- **Why it fails**: Sophisticated astroturfing uses aged accounts, varied content, and staggered timing to mimic organic activity. Without careful analysis, it passes initial filters
|
|
74
|
+
- **Fix**: Apply the KOL Cascade Analysis from knowledge/best-practices.md. Genuine grassroots movements show Micro-KOL-to-Macro-KOL cascade over days. Astroturfing shows simultaneous activation across account tiers with no prior Micro-KOL buildup
|
|
75
|
+
|
|
76
|
+
## Analysis & Reporting Anti-Patterns
|
|
77
|
+
|
|
78
|
+
### 11. Recency Bias in Trend Assessment
|
|
79
|
+
- **Problem**: Reporting the most recent tweets as the definitive position on a topic without historical baseline
|
|
80
|
+
- **Why it fails**: Twitter discourse oscillates rapidly. A negative reaction in the last 2 hours may follow days of positive sentiment, or vice versa. Snapshot analysis misrepresents the trajectory
|
|
81
|
+
- **Fix**: Always establish a baseline period (7-30 days) before assessing current sentiment. Report both the current state and the direction of change. Use the Sentiment Shift Detection technique from knowledge/best-practices.md
|
|
82
|
+
|
|
83
|
+
### 12. Single-Platform Echo Chamber
|
|
84
|
+
- **Problem**: Treating Twitter as representative of broader public opinion
|
|
85
|
+
- **Why it fails**: Twitter's user base skews toward specific demographics (urban, media-engaged, tech-savvy). Topics that dominate Twitter may be irrelevant to the broader population, and vice versa. Twitter's algorithmic amplification creates feedback loops
|
|
86
|
+
- **Fix**: Always caveat intelligence with "on Twitter/X" — never extrapolate to general public sentiment. Recommend cross-platform validation when the user needs broader opinion data. Note the platform's demographic skew in the confidence assessment
|
|
@@ -0,0 +1,117 @@
|
|
|
1
|
+
---
|
|
2
|
+
domain: twitter-intel
|
|
3
|
+
topic: signal-filtering-credibility-trend-detection
|
|
4
|
+
priority: high
|
|
5
|
+
ttl: 30d
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
# Twitter Intelligence — Best Practices
|
|
9
|
+
|
|
10
|
+
## Signal Filtering Methodology
|
|
11
|
+
|
|
12
|
+
### 1. Multi-Layer Noise Reduction
|
|
13
|
+
Apply filters sequentially to reduce the tweet corpus to actionable signals:
|
|
14
|
+
1. **Language filter** — Restrict to target language(s) using `lang:` operator
|
|
15
|
+
2. **Bot filter** — Exclude accounts matching bot heuristics from knowledge/domain.md
|
|
16
|
+
3. **Retweet deduplication** — Collapse retweet chains to original tweet; count retweets as amplification metric
|
|
17
|
+
4. **Relevance filter** — Score tweet-to-topic semantic similarity; discard below 0.6 threshold
|
|
18
|
+
5. **Authority filter** — Weight remaining tweets by source KOL tier (from knowledge/domain.md)
|
|
19
|
+
|
|
20
|
+
### 2. Signal-to-Noise Ratio Optimization
|
|
21
|
+
- Prefer `-is:retweet` for opinion extraction — retweets indicate amplification, not original thought
|
|
22
|
+
- Use `is:quote` to capture annotated discourse — quote tweets reveal disagreement, nuance, and counter-narratives
|
|
23
|
+
- Filter for `has:links` when seeking evidence-backed claims
|
|
24
|
+
- Apply `min_faves:` thresholds proportional to the topic's tweet volume:
|
|
25
|
+
- High-volume topic (>10K tweets/day): `min_faves:50`
|
|
26
|
+
- Medium-volume topic (1K-10K/day): `min_faves:10`
|
|
27
|
+
- Low-volume/niche topic (<1K/day): no minimum — all signals matter
|
|
28
|
+
|
|
29
|
+
### 3. Temporal Window Selection
|
|
30
|
+
- **Breaking news**: Last 1-4 hours — prioritize recency over engagement
|
|
31
|
+
- **Trend monitoring**: Last 24-72 hours — balance recency with engagement signals
|
|
32
|
+
- **Sentiment baseline**: Last 7-30 days — establish norms before measuring shifts
|
|
33
|
+
- **Historical analysis**: Full archive — requires Academic/Pro API access
|
|
34
|
+
|
|
35
|
+
## Credibility Assessment Framework
|
|
36
|
+
|
|
37
|
+
### Account Credibility Score (0-100)
|
|
38
|
+
|
|
39
|
+
Calculate a composite credibility score for each account:
|
|
40
|
+
|
|
41
|
+
| Factor | Weight | Scoring |
|
|
42
|
+
|--------|--------|---------|
|
|
43
|
+
| Account age | 15% | <30 days: 0, 30d-1y: 40, 1-3y: 70, 3y+: 100 |
|
|
44
|
+
| Follower/following ratio | 15% | <0.5: 20, 0.5-2: 40, 2-10: 70, 10+: 100 |
|
|
45
|
+
| Listed count per 1K followers | 15% | <1: 20, 1-5: 50, 5-20: 80, 20+: 100 |
|
|
46
|
+
| Original content ratio | 15% | <20%: 20, 20-50%: 50, 50-80%: 80, 80%+: 100 |
|
|
47
|
+
| Verified status | 10% | None: 40, Blue: 60, Gold/Grey: 100 |
|
|
48
|
+
| Bio completeness | 10% | Empty: 0, Generic: 30, Professional with affiliations: 100 |
|
|
49
|
+
| Engagement quality | 10% | Bot-like patterns: 0, Normal: 60, High-quality replies: 100 |
|
|
50
|
+
| Posting consistency | 10% | Sporadic/burst: 30, Regular cadence: 70, Daily with variety: 100 |
|
|
51
|
+
|
|
52
|
+
### Credibility Tiers
|
|
53
|
+
|
|
54
|
+
| Tier | Score | Treatment |
|
|
55
|
+
|------|-------|-----------|
|
|
56
|
+
| Authoritative | 80-100 | Primary source — cite directly, high confidence |
|
|
57
|
+
| Credible | 60-79 | Reliable source — cite with standard attribution |
|
|
58
|
+
| Provisional | 40-59 | Use with caution — require corroboration from higher tier |
|
|
59
|
+
| Suspect | 20-39 | Do not cite alone — only as supporting data with corroboration |
|
|
60
|
+
| Unreliable | 0-19 | Exclude from analysis — flag if part of coordinated campaign |
|
|
61
|
+
|
|
62
|
+
### Cross-Referencing Protocol
|
|
63
|
+
- **Single-source claim**: Never report. Require 2+ independent Credible-tier sources
|
|
64
|
+
- **Controversial claim**: Require 3+ independent sources across different KOL tiers
|
|
65
|
+
- **Statistical claim**: Require primary data source or link to verifiable dataset
|
|
66
|
+
- **Breaking event**: Allow single Authoritative-tier source with "unconfirmed" label; upgrade after corroboration
|
|
67
|
+
|
|
68
|
+
## Trend Detection Techniques
|
|
69
|
+
|
|
70
|
+
### 1. Volume Velocity Analysis
|
|
71
|
+
Track tweet volume over sliding windows to detect acceleration:
|
|
72
|
+
- Calculate **tweets per hour** for the target topic over the last 72 hours
|
|
73
|
+
- Compute **velocity** = (current hour volume) / (average hourly volume over past 72h)
|
|
74
|
+
- Thresholds:
|
|
75
|
+
- Velocity 1.5-3x: **Elevated interest** — monitor closely
|
|
76
|
+
- Velocity 3-10x: **Emerging trend** — begin analysis
|
|
77
|
+
- Velocity >10x: **Viral event** — prioritize for immediate briefing
|
|
78
|
+
|
|
79
|
+
### 2. Hashtag Co-occurrence Mapping
|
|
80
|
+
- Track which hashtags appear together in tweets about the target topic
|
|
81
|
+
- New co-occurrences signal narrative evolution (e.g., a tech topic suddenly co-occurring with #regulation)
|
|
82
|
+
- Build a co-occurrence graph; detect new clusters forming over 24-48h windows
|
|
83
|
+
|
|
84
|
+
### 3. KOL Cascade Analysis
|
|
85
|
+
- Track when a topic moves across KOL tiers:
|
|
86
|
+
- Nano/Micro-KOL discussion first → Macro-KOL pickup → Mega-KOL amplification = organic trend
|
|
87
|
+
- Mega-KOL first → immediate broad amplification without prior Micro-KOL discussion = top-down narrative push
|
|
88
|
+
- The **cascade direction** indicates whether a trend is grassroots or manufactured
|
|
89
|
+
|
|
90
|
+
### 4. Sentiment Shift Detection
|
|
91
|
+
- Establish a rolling 7-day sentiment baseline for the target topic
|
|
92
|
+
- Detect statistically significant shifts (>2 standard deviations from baseline)
|
|
93
|
+
- Categorize shifts:
|
|
94
|
+
- **Gradual drift**: Sentiment changes over days — underlying narrative evolution
|
|
95
|
+
- **Sharp reversal**: Sentiment flips within hours — triggered by a specific event or revelation
|
|
96
|
+
- **Polarization spike**: Average sentiment stays similar but variance increases — growing disagreement
|
|
97
|
+
|
|
98
|
+
### 5. Geographic & Demographic Spread
|
|
99
|
+
- Track when a topic crosses from one geographic region or language community to another
|
|
100
|
+
- Use `place_country:` and `lang:` operators to segment
|
|
101
|
+
- Cross-community spread is a strong indicator of a trend gaining mainstream traction
|
|
102
|
+
|
|
103
|
+
## Intelligence Report Structure
|
|
104
|
+
|
|
105
|
+
### Standard Briefing Format
|
|
106
|
+
1. **Executive Summary** — 2-3 sentence overview: what happened, why it matters, confidence level
|
|
107
|
+
2. **Key Findings** — Bulleted list of 3-5 main intelligence points, each with source attribution
|
|
108
|
+
3. **KOL Positions** — Table of notable KOL statements with credibility tier and reach metrics
|
|
109
|
+
4. **Trend Metrics** — Volume, velocity, sentiment data with time-series context
|
|
110
|
+
5. **Bot/Inauthenticity Assessment** — Percentage of engagement flagged as inauthentic; impact on findings
|
|
111
|
+
6. **Confidence Rating** — Overall confidence: High (80-100%), Medium (50-79%), Low (<50%) with justification
|
|
112
|
+
7. **Recommended Actions** — What the user should do with this intelligence; monitoring suggestions
|
|
113
|
+
|
|
114
|
+
### Confidence Rating Criteria
|
|
115
|
+
- **High (80-100%)**: 3+ Authoritative sources, consistent across KOL tiers, minimal bot contamination (<5%)
|
|
116
|
+
- **Medium (50-79%)**: 2+ Credible sources, mostly consistent with minor discrepancies, moderate bot presence (5-15%)
|
|
117
|
+
- **Low (<50%)**: Single source or conflicting accounts, high bot contamination (>15%), or rapidly evolving situation
|
|
@@ -0,0 +1,140 @@
|
|
|
1
|
+
---
|
|
2
|
+
domain: twitter-intel
|
|
3
|
+
topic: twitter-api-engagement-metrics-kol-signals
|
|
4
|
+
priority: high
|
|
5
|
+
ttl: 30d
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
# Twitter Intelligence — API, Engagement Metrics & KOL Signals
|
|
9
|
+
|
|
10
|
+
## Twitter/X API v2 Endpoints
|
|
11
|
+
|
|
12
|
+
### Tweet Search
|
|
13
|
+
- **Recent Search** — `GET /2/tweets/search/recent` — Tweets from last 7 days
|
|
14
|
+
- Query operators: `from:`, `to:`, `is:retweet`, `is:reply`, `is:quote`, `has:media`, `has:links`, `lang:`
|
|
15
|
+
- Max results per request: 100 (paginate with `next_token`)
|
|
16
|
+
- Rate limit: 450 requests / 15-min window (App-level), 180 (User-level)
|
|
17
|
+
- **Full-Archive Search** — `GET /2/tweets/search/all` — Complete tweet history (Academic/Pro access)
|
|
18
|
+
- Same query operators as Recent Search
|
|
19
|
+
- Rate limit: 300 requests / 15-min window
|
|
20
|
+
- **Filtered Stream** — `POST /2/tweets/search/stream/rules` + `GET /2/tweets/search/stream`
|
|
21
|
+
- Real-time streaming with up to 25 concurrent rules (Basic), 1000 (Pro)
|
|
22
|
+
- Supports all search operators as filter rules
|
|
23
|
+
|
|
24
|
+
### User & Account Data
|
|
25
|
+
- **User Lookup** — `GET /2/users/by/username/:username`
|
|
26
|
+
- Fields: `id`, `name`, `username`, `created_at`, `description`, `public_metrics`, `verified`, `verified_type`
|
|
27
|
+
- **User Tweets Timeline** — `GET /2/users/:id/tweets` — Up to 3,200 most recent tweets
|
|
28
|
+
- **Followers/Following** — `GET /2/users/:id/followers`, `GET /2/users/:id/following`
|
|
29
|
+
- Rate limit: 15 requests / 15-min window
|
|
30
|
+
|
|
31
|
+
### Engagement & Metrics
|
|
32
|
+
- **Tweet Metrics** (via `tweet.fields=public_metrics`):
|
|
33
|
+
- `retweet_count`, `reply_count`, `like_count`, `quote_count`, `bookmark_count`, `impression_count`
|
|
34
|
+
- **User Metrics** (via `user.fields=public_metrics`):
|
|
35
|
+
- `followers_count`, `following_count`, `tweet_count`, `listed_count`
|
|
36
|
+
|
|
37
|
+
### Trend Data
|
|
38
|
+
- **Trending Topics** — `GET /1.1/trends/place.json?id={WOEID}`
|
|
39
|
+
- Returns top 50 trends for a location (WOEID-based)
|
|
40
|
+
- Includes `tweet_volume` (last 24h) when available
|
|
41
|
+
- Rate limit: 75 requests / 15-min window
|
|
42
|
+
|
|
43
|
+
## Search Query Operators
|
|
44
|
+
|
|
45
|
+
### Content Filters
|
|
46
|
+
- `"exact phrase"` — Match exact phrase in tweet text
|
|
47
|
+
- `keyword1 keyword2` — Both terms required (implicit AND)
|
|
48
|
+
- `keyword1 OR keyword2` — Match either term
|
|
49
|
+
- `-keyword` — Exclude tweets containing term
|
|
50
|
+
- `#hashtag` — Match hashtag
|
|
51
|
+
- `$TICKER` — Match cashtag (financial symbols)
|
|
52
|
+
- `url:"domain.com"` — Tweets containing links to domain
|
|
53
|
+
|
|
54
|
+
### Account Filters
|
|
55
|
+
- `from:username` — Tweets authored by account
|
|
56
|
+
- `to:username` — Tweets directed at account (replies/mentions)
|
|
57
|
+
- `@username` — Tweets mentioning account
|
|
58
|
+
- `retweets_of:username` — Retweets of account's tweets
|
|
59
|
+
|
|
60
|
+
### Tweet Type Filters
|
|
61
|
+
- `is:retweet` / `-is:retweet` — Include/exclude retweets
|
|
62
|
+
- `is:reply` / `-is:reply` — Include/exclude replies
|
|
63
|
+
- `is:quote` — Quote tweets only
|
|
64
|
+
- `is:verified` — From verified accounts only
|
|
65
|
+
- `has:media` — Tweets with images or video
|
|
66
|
+
- `has:links` — Tweets with URLs
|
|
67
|
+
- `has:hashtags` — Tweets with at least one hashtag
|
|
68
|
+
|
|
69
|
+
### Temporal & Engagement Filters
|
|
70
|
+
- `since:2024-01-01` / `until:2024-12-31` — Date range (YYYY-MM-DD)
|
|
71
|
+
- `min_retweets:100` — Minimum retweet threshold
|
|
72
|
+
- `min_faves:500` — Minimum like threshold
|
|
73
|
+
- `min_replies:50` — Minimum reply threshold
|
|
74
|
+
- `lang:en` — Language filter (ISO 639-1)
|
|
75
|
+
|
|
76
|
+
## KOL Identification Signals
|
|
77
|
+
|
|
78
|
+
### Authority Indicators
|
|
79
|
+
| Signal | Description | Weight |
|
|
80
|
+
|--------|------------|--------|
|
|
81
|
+
| Verified badge | Blue checkmark (paid) or gold/grey (org/gov) | Medium — verification is now pay-to-play; gold/grey is stronger |
|
|
82
|
+
| Follower-to-following ratio | High ratio (>10:1) suggests organic authority | High |
|
|
83
|
+
| Listed count | Number of lists the account appears on — indicates curation by others | High |
|
|
84
|
+
| Account age | Older accounts with consistent activity are more credible | Medium |
|
|
85
|
+
| Bio & affiliations | Institutional affiliation, professional credentials | High |
|
|
86
|
+
|
|
87
|
+
### Content Quality Indicators
|
|
88
|
+
| Signal | Description | Weight |
|
|
89
|
+
|--------|------------|--------|
|
|
90
|
+
| Original tweet ratio | Proportion of original tweets vs retweets — high ratio suggests thought leadership | High |
|
|
91
|
+
| Thread creation | Regularly publishes long-form threads — indicates deep analysis | Medium |
|
|
92
|
+
| Citation behavior | Links to primary sources, papers, data — indicates research rigor | High |
|
|
93
|
+
| Engagement quality | Reply-to-like ratio — high reply engagement suggests genuine discourse | Medium |
|
|
94
|
+
| Consistency | Posts on-topic regularly over months/years — not a flash account | High |
|
|
95
|
+
|
|
96
|
+
### KOL Classification Tiers
|
|
97
|
+
|
|
98
|
+
| Tier | Followers | Characteristics | Intelligence Value |
|
|
99
|
+
|------|-----------|----------------|-------------------|
|
|
100
|
+
| Mega-KOL | 1M+ | Broad reach, high noise, opinion-shaping | Trend confirmation, narrative direction |
|
|
101
|
+
| Macro-KOL | 100K-1M | Industry visibility, media crossover | Sector sentiment, emerging narratives |
|
|
102
|
+
| Mid-KOL | 10K-100K | Domain specialists, practitioner voices | Technical signals, insider perspective |
|
|
103
|
+
| Micro-KOL | 1K-10K | Niche experts, early adopters | Early signals, ground-truth validation |
|
|
104
|
+
| Nano-KOL | <1K | Hyper-specialized, often undervalued | Deep domain knowledge, contrarian signals |
|
|
105
|
+
|
|
106
|
+
## Engagement Metric Interpretation
|
|
107
|
+
|
|
108
|
+
### Healthy Engagement Ratios
|
|
109
|
+
- **Like-to-impression ratio**: 1-3% is typical; >5% indicates high resonance
|
|
110
|
+
- **Retweet-to-like ratio**: 0.1-0.3 is normal; >0.5 suggests strong shareability or controversy
|
|
111
|
+
- **Reply-to-like ratio**: 0.01-0.05 is normal; >0.1 indicates contentious content ("ratio'd")
|
|
112
|
+
- **Quote-to-retweet ratio**: >0.3 suggests the tweet is being challenged or annotated
|
|
113
|
+
|
|
114
|
+
### Anomalous Engagement Patterns
|
|
115
|
+
- **Spike without context**: Sudden engagement surge with no clear catalyst — possible bot amplification
|
|
116
|
+
- **Follower burst**: Account gains 10K+ followers in <24h without viral content — possible purchased followers
|
|
117
|
+
- **Uniform engagement timing**: Likes/retweets arriving at metronomic intervals — bot signature
|
|
118
|
+
- **Low-quality reply flood**: High reply count but replies are generic, single-emoji, or from low-follower accounts — astroturfing
|
|
119
|
+
|
|
120
|
+
## Bot Detection Heuristics
|
|
121
|
+
|
|
122
|
+
### Account-Level Signals
|
|
123
|
+
- Default profile image or AI-generated avatar
|
|
124
|
+
- Username contains long random number strings (e.g., `user83749201`)
|
|
125
|
+
- Account created in bulk pattern (similar creation dates, sequential naming)
|
|
126
|
+
- Bio is generic, copied, or empty; displays no domain expertise
|
|
127
|
+
- Following/follower ratio near 1:1 with high absolute numbers (follow-back bot)
|
|
128
|
+
|
|
129
|
+
### Behavior-Level Signals
|
|
130
|
+
- Tweets at inhuman frequency (>100 tweets/hour)
|
|
131
|
+
- Posts 24/7 with no sleep pattern
|
|
132
|
+
- Content is entirely retweets or templated replies
|
|
133
|
+
- Engages across unrelated topics with no coherent interest graph
|
|
134
|
+
- Replies within seconds of target tweet publication
|
|
135
|
+
|
|
136
|
+
### Network-Level Signals
|
|
137
|
+
- Cluster of accounts retweeting the same content within minutes
|
|
138
|
+
- Shared followers/following lists across multiple accounts
|
|
139
|
+
- Coordinated hashtag usage at the same timestamps
|
|
140
|
+
- Reply chains that form repetitive patterns (e.g., same phrase variations)
|
package/manifest.json
ADDED
|
@@ -0,0 +1,26 @@
|
|
|
1
|
+
{
|
|
2
|
+
"name": "@botlearn/twitter-intel",
|
|
3
|
+
"version": "0.1.0",
|
|
4
|
+
"description": "Twitter/X platform intelligence gathering — tracking KOLs, extracting trending topics, analyzing engagement signals, detecting bot activity, and synthesizing actionable insights for OpenClaw Agent",
|
|
5
|
+
"category": "information-retrieval",
|
|
6
|
+
"author": "BotLearn",
|
|
7
|
+
"benchmarkDimension": "information-retrieval",
|
|
8
|
+
"expectedImprovement": 30,
|
|
9
|
+
"dependencies": {},
|
|
10
|
+
"compatibility": {
|
|
11
|
+
"openclaw": ">=0.5.0"
|
|
12
|
+
},
|
|
13
|
+
"files": {
|
|
14
|
+
"skill": "skill.md",
|
|
15
|
+
"knowledge": [
|
|
16
|
+
"knowledge/domain.md",
|
|
17
|
+
"knowledge/best-practices.md",
|
|
18
|
+
"knowledge/anti-patterns.md"
|
|
19
|
+
],
|
|
20
|
+
"strategies": [
|
|
21
|
+
"strategies/main.md"
|
|
22
|
+
],
|
|
23
|
+
"smokeTest": "tests/smoke.json",
|
|
24
|
+
"benchmark": "tests/benchmark.json"
|
|
25
|
+
}
|
|
26
|
+
}
|
package/package.json
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
1
|
+
{
|
|
2
|
+
"name": "@botlearn/twitter-intel",
|
|
3
|
+
"version": "0.1.0",
|
|
4
|
+
"description": "Twitter/X platform intelligence gathering — tracking KOLs, extracting trending topics, analyzing engagement signals, detecting bot activity, and synthesizing actionable insights for OpenClaw Agent",
|
|
5
|
+
"type": "module",
|
|
6
|
+
"main": "manifest.json",
|
|
7
|
+
"files": [
|
|
8
|
+
"manifest.json",
|
|
9
|
+
"skill.md",
|
|
10
|
+
"knowledge/",
|
|
11
|
+
"strategies/",
|
|
12
|
+
"tests/",
|
|
13
|
+
"README.md"
|
|
14
|
+
],
|
|
15
|
+
"keywords": [
|
|
16
|
+
"botlearn",
|
|
17
|
+
"openclaw",
|
|
18
|
+
"skill",
|
|
19
|
+
"information-retrieval"
|
|
20
|
+
],
|
|
21
|
+
"author": "BotLearn",
|
|
22
|
+
"license": "MIT",
|
|
23
|
+
"repository": {
|
|
24
|
+
"type": "git",
|
|
25
|
+
"url": "https://github.com/readai-team/botlearn-awesome-skills.git",
|
|
26
|
+
"directory": "packages/skills/twitter-intel"
|
|
27
|
+
},
|
|
28
|
+
"homepage": "https://github.com/readai-team/botlearn-awesome-skills/tree/main/packages/skills/twitter-intel",
|
|
29
|
+
"bugs": {
|
|
30
|
+
"url": "https://github.com/readai-team/botlearn-awesome-skills/issues"
|
|
31
|
+
},
|
|
32
|
+
"publishConfig": {
|
|
33
|
+
"access": "public"
|
|
34
|
+
}
|
|
35
|
+
}
|
package/skill.md
ADDED
|
@@ -0,0 +1,48 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: twitter-intel
|
|
3
|
+
role: Twitter Intelligence Analyst
|
|
4
|
+
version: 1.0.0
|
|
5
|
+
triggers:
|
|
6
|
+
- "twitter"
|
|
7
|
+
- "tweet"
|
|
8
|
+
- "KOL"
|
|
9
|
+
- "trending"
|
|
10
|
+
- "X platform"
|
|
11
|
+
- "twitter intelligence"
|
|
12
|
+
- "twitter analysis"
|
|
13
|
+
- "influencer tracking"
|
|
14
|
+
- "twitter trends"
|
|
15
|
+
- "social listening"
|
|
16
|
+
---
|
|
17
|
+
|
|
18
|
+
# Role
|
|
19
|
+
|
|
20
|
+
You are a Twitter Intelligence Analyst. When activated, you monitor the Twitter/X platform to track key opinion leaders (KOLs), extract trending narratives, analyze engagement signals, detect bot-driven amplification, and synthesize actionable intelligence reports from the platform's real-time discourse.
|
|
21
|
+
|
|
22
|
+
# Capabilities
|
|
23
|
+
|
|
24
|
+
1. Curate and maintain watchlists of KOLs, domain experts, and emerging voices within specified topics or industries
|
|
25
|
+
2. Filter high-signal tweets from noise using engagement metrics, account credibility scoring, and content relevance analysis
|
|
26
|
+
3. Extract and classify opinions, stances, and sentiment from tweet threads, quote tweets, and reply chains
|
|
27
|
+
4. Detect emerging trends, narrative shifts, and coordinated amplification campaigns before they reach mainstream awareness
|
|
28
|
+
5. Synthesize multi-source Twitter intelligence into structured, time-stamped briefings with confidence ratings and source attribution
|
|
29
|
+
6. Identify bot networks, astroturfing patterns, and inauthentic engagement to separate organic signal from manufactured consensus
|
|
30
|
+
|
|
31
|
+
# Constraints
|
|
32
|
+
|
|
33
|
+
1. Never treat high engagement (likes, retweets) as a proxy for credibility — always verify the source account's authenticity and authority
|
|
34
|
+
2. Never report on a trend based on a single tweet or a single account — require corroboration from 3+ independent sources
|
|
35
|
+
3. Never ignore sarcasm, irony, or satire markers — always assess tweet tone before extracting sentiment or opinion
|
|
36
|
+
4. Never present bot-amplified content as organic public opinion — always flag suspected inauthentic activity
|
|
37
|
+
5. Always include temporal context (timestamps, trend velocity) — Twitter intelligence is time-sensitive by nature
|
|
38
|
+
6. Always respect rate limits and platform terms of service when interfacing with Twitter/X API endpoints
|
|
39
|
+
|
|
40
|
+
# Activation
|
|
41
|
+
|
|
42
|
+
WHEN the user requests Twitter monitoring, KOL tracking, or trend analysis:
|
|
43
|
+
1. Identify the target topic, industry, or set of accounts to monitor
|
|
44
|
+
2. Execute source curation and signal filtering following strategies/main.md
|
|
45
|
+
3. Apply knowledge/domain.md for API usage, metric interpretation, and KOL identification
|
|
46
|
+
4. Evaluate findings using knowledge/best-practices.md for credibility and trend validation
|
|
47
|
+
5. Check against knowledge/anti-patterns.md to avoid engagement blindness, sarcasm misreads, and bot amplification traps
|
|
48
|
+
6. Output a structured intelligence briefing with confidence levels, source attribution, and temporal context
|
|
@@ -0,0 +1,113 @@
|
|
|
1
|
+
---
|
|
2
|
+
strategy: twitter-intel
|
|
3
|
+
version: 1.0.0
|
|
4
|
+
steps: 6
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
# Twitter Intelligence Strategy
|
|
8
|
+
|
|
9
|
+
## Step 1: Source Curation
|
|
10
|
+
- Parse the user's request to identify: **target topic**, **accounts of interest**, **time window**, **geographic scope**, and **desired intelligence type** (trend analysis, KOL tracking, sentiment monitoring, or event coverage)
|
|
11
|
+
- IF the request is vague THEN ask one clarifying question to narrow the scope (e.g., "Which industry vertical?" or "Any specific accounts to prioritize?")
|
|
12
|
+
- Build initial source lists:
|
|
13
|
+
- **Topic keywords**: Extract 3-7 primary keywords and hashtags relevant to the topic
|
|
14
|
+
- **KOL watchlist**: Identify 5-15 accounts across KOL tiers (from knowledge/domain.md) that are relevant to the topic
|
|
15
|
+
- **Exclusion list**: Identify known bot accounts, spam hashtags, and noise sources to exclude
|
|
16
|
+
- Construct Twitter/X API search queries using operators from knowledge/domain.md:
|
|
17
|
+
- Primary query: topic keywords + `-is:retweet` + `lang:` filter
|
|
18
|
+
- KOL query: `from:` operators for watchlist accounts + topic keywords
|
|
19
|
+
- Trend query: hashtag co-occurrences + `min_faves:` threshold
|
|
20
|
+
- Set the temporal window based on intelligence type:
|
|
21
|
+
- Breaking event: last 1-4 hours
|
|
22
|
+
- Trend monitoring: last 24-72 hours
|
|
23
|
+
- Baseline establishment: last 7-30 days
|
|
24
|
+
|
|
25
|
+
## Step 2: Signal Filtering
|
|
26
|
+
- Execute queries and collect raw tweet corpus
|
|
27
|
+
- Apply multi-layer noise reduction from knowledge/best-practices.md:
|
|
28
|
+
1. Remove exact duplicates and retweet chains — collapse to original tweets with amplification counts
|
|
29
|
+
2. Run bot detection heuristics from knowledge/domain.md on all accounts in the corpus
|
|
30
|
+
3. Score each account using the Account Credibility Score framework (knowledge/best-practices.md)
|
|
31
|
+
4. Filter tweets by semantic relevance to the target topic — discard below 0.6 similarity threshold
|
|
32
|
+
5. Apply engagement thresholds proportional to topic volume
|
|
33
|
+
- Tag each remaining tweet with metadata:
|
|
34
|
+
- Source credibility tier (Authoritative / Credible / Provisional / Suspect)
|
|
35
|
+
- Bot probability score (0-100%)
|
|
36
|
+
- Relevance score (0.6-1.0)
|
|
37
|
+
- IF >15% of engagement is flagged as bot-driven THEN flag the topic for inauthenticity risk
|
|
38
|
+
- VERIFY against anti-pattern #8 (knowledge/anti-patterns.md): ensure bot engagement is separated from organic metrics
|
|
39
|
+
|
|
40
|
+
## Step 3: Opinion Extraction
|
|
41
|
+
- For each high-signal tweet (credibility tier >= Provisional, relevance >= 0.7):
|
|
42
|
+
- Extract the **stated position** — what claim or opinion does the tweet express?
|
|
43
|
+
- Classify **sentiment** — positive / negative / neutral / mixed toward the target topic
|
|
44
|
+
- Detect **tone markers** — check for sarcasm, irony, or satire per anti-pattern #5 (knowledge/anti-patterns.md)
|
|
45
|
+
- IF the tweet is a quote tweet THEN analyze as composite (original + commentary) per anti-pattern #6
|
|
46
|
+
- IF the tweet is part of a thread THEN retrieve full thread via `conversation_id` and analyze holistically per anti-pattern #7
|
|
47
|
+
- Group extracted opinions into **stance clusters**:
|
|
48
|
+
- Supporters (positive sentiment toward topic/entity)
|
|
49
|
+
- Critics (negative sentiment toward topic/entity)
|
|
50
|
+
- Neutral analysts (factual commentary without clear stance)
|
|
51
|
+
- Contrarians (minority positions that diverge from the dominant narrative)
|
|
52
|
+
- For each cluster, identify the **strongest voice** — the highest-credibility KOL representing that position
|
|
53
|
+
- Calculate **opinion distribution** — percentage of credible voices in each cluster
|
|
54
|
+
|
|
55
|
+
## Step 4: Trend Detection
|
|
56
|
+
- Apply Volume Velocity Analysis from knowledge/best-practices.md:
|
|
57
|
+
- Calculate tweets per hour over the analysis window
|
|
58
|
+
- Compute velocity relative to 72-hour average
|
|
59
|
+
- IF velocity > 3x THEN classify as emerging trend
|
|
60
|
+
- IF velocity > 10x THEN classify as viral event — prioritize for immediate briefing
|
|
61
|
+
- Execute Hashtag Co-occurrence Mapping:
|
|
62
|
+
- Build co-occurrence graph for hashtags in the filtered corpus
|
|
63
|
+
- Identify new hashtag clusters forming in the last 24-48 hours
|
|
64
|
+
- Flag unexpected co-occurrences (e.g., tech topic + political hashtag = narrative hijacking risk)
|
|
65
|
+
- Perform KOL Cascade Analysis:
|
|
66
|
+
- Track the chronological spread across KOL tiers
|
|
67
|
+
- IF Micro-KOL → Macro-KOL → Mega-KOL cascade THEN classify as organic trend
|
|
68
|
+
- IF Mega-KOL-first or simultaneous cross-tier activation THEN flag as top-down narrative push per anti-pattern #9
|
|
69
|
+
- Run Sentiment Shift Detection:
|
|
70
|
+
- Compare current sentiment against the 7-day rolling baseline
|
|
71
|
+
- IF shift > 2 standard deviations THEN classify the shift type (gradual drift, sharp reversal, or polarization spike)
|
|
72
|
+
- VERIFY against anti-pattern #10: check for astroturfing signatures in any detected trends
|
|
73
|
+
|
|
74
|
+
## Step 5: Insight Synthesis
|
|
75
|
+
- Merge findings from Steps 2-4 into a coherent intelligence picture:
|
|
76
|
+
- Connect opinion clusters (Step 3) with trend dynamics (Step 4)
|
|
77
|
+
- Identify **causal narratives**: what events or statements triggered the observed patterns?
|
|
78
|
+
- Assess **trajectory**: is the trend accelerating, plateauing, or declining?
|
|
79
|
+
- Evaluate **cross-topic spillover**: is the topic affecting or being affected by adjacent conversations?
|
|
80
|
+
- Generate confidence ratings using criteria from knowledge/best-practices.md:
|
|
81
|
+
- Count corroborating sources across credibility tiers
|
|
82
|
+
- Calculate bot contamination percentage
|
|
83
|
+
- Assess source diversity (single echo chamber vs. multi-community signal)
|
|
84
|
+
- Identify **intelligence gaps** — what questions remain unanswered? What data would increase confidence?
|
|
85
|
+
- VERIFY against anti-pattern #11: ensure the assessment includes baseline context, not just the latest snapshot
|
|
86
|
+
- VERIFY against anti-pattern #12: caveat all findings as Twitter/X-specific; do not extrapolate to general public opinion
|
|
87
|
+
|
|
88
|
+
## Step 6: Intelligence Briefing Output
|
|
89
|
+
- Structure the output following the Standard Briefing Format from knowledge/best-practices.md:
|
|
90
|
+
1. **Executive Summary** — 2-3 sentences: what the intelligence reveals, why it matters, confidence level
|
|
91
|
+
2. **Key Findings** — 3-5 bulleted intelligence points, each with:
|
|
92
|
+
- The finding itself
|
|
93
|
+
- Source attribution (KOL name, credibility tier, tweet link)
|
|
94
|
+
- Corroboration count (how many independent sources support this)
|
|
95
|
+
3. **KOL Positions** — Table of notable KOL statements:
|
|
96
|
+
- Account handle | Credibility tier | Follower count | Position summary | Tweet link
|
|
97
|
+
4. **Trend Metrics** — Quantitative data:
|
|
98
|
+
- Tweet volume and velocity (with time-series if available)
|
|
99
|
+
- Sentiment distribution (% positive / negative / neutral)
|
|
100
|
+
- Top hashtags and co-occurrences
|
|
101
|
+
- Geographic spread (if detectable)
|
|
102
|
+
5. **Bot & Inauthenticity Assessment** — Percentage of flagged engagement; impact on findings
|
|
103
|
+
6. **Confidence Rating** — High / Medium / Low with explicit justification
|
|
104
|
+
7. **Recommended Actions** — What the user should do next:
|
|
105
|
+
- Accounts to watch for follow-up developments
|
|
106
|
+
- Suggested monitoring frequency
|
|
107
|
+
- Cross-platform validation recommendations if confidence is Medium or Low
|
|
108
|
+
- SELF-CHECK before delivery:
|
|
109
|
+
- Are all claims backed by attributed sources?
|
|
110
|
+
- Have bot-contaminated metrics been flagged?
|
|
111
|
+
- Is temporal context (timestamps, trend direction) included throughout?
|
|
112
|
+
- Does the briefing avoid the 12 anti-patterns from knowledge/anti-patterns.md?
|
|
113
|
+
- IF any check fails THEN loop back to the relevant step and re-analyze
|
|
@@ -0,0 +1,476 @@
|
|
|
1
|
+
{
|
|
2
|
+
"version": "0.0.1",
|
|
3
|
+
"dimension": "information-retrieval",
|
|
4
|
+
"tasks": [
|
|
5
|
+
{
|
|
6
|
+
"id": "bench-easy-01",
|
|
7
|
+
"difficulty": "easy",
|
|
8
|
+
"description": "Identify top KOLs for a well-known topic",
|
|
9
|
+
"input": "List the top 5 most influential accounts on Twitter/X discussing artificial intelligence. For each account, provide their handle, follower count, KOL tier, and a brief description of why they are influential in the AI space.",
|
|
10
|
+
"rubric": [
|
|
11
|
+
{
|
|
12
|
+
"criterion": "KOL Identification Accuracy",
|
|
13
|
+
"weight": 0.4,
|
|
14
|
+
"scoring": {
|
|
15
|
+
"5": "Identifies 5 genuinely influential AI accounts across multiple tiers (researchers, industry leaders, commentators); accounts are widely recognized in the AI community",
|
|
16
|
+
"3": "Identifies 3-4 relevant accounts but selection is skewed toward one category (e.g., all corporate accounts)",
|
|
17
|
+
"1": "Identifies 1-2 relevant accounts mixed with irrelevant ones",
|
|
18
|
+
"0": "No relevant AI KOLs identified"
|
|
19
|
+
}
|
|
20
|
+
},
|
|
21
|
+
{
|
|
22
|
+
"criterion": "Metadata Completeness",
|
|
23
|
+
"weight": 0.3,
|
|
24
|
+
"scoring": {
|
|
25
|
+
"5": "Each account includes: handle, follower count, KOL tier classification, credibility indicators, and a specific reason for their influence",
|
|
26
|
+
"3": "Handles and follower counts provided but missing tier classification or influence rationale",
|
|
27
|
+
"1": "Only handles listed without supporting data",
|
|
28
|
+
"0": "No metadata provided"
|
|
29
|
+
}
|
|
30
|
+
},
|
|
31
|
+
{
|
|
32
|
+
"criterion": "Source Diversity",
|
|
33
|
+
"weight": 0.3,
|
|
34
|
+
"scoring": {
|
|
35
|
+
"5": "Accounts span multiple KOL tiers (Mega, Macro, Mid) and roles (researcher, founder, journalist, practitioner)",
|
|
36
|
+
"3": "Accounts from 2 tiers or roles",
|
|
37
|
+
"1": "All accounts from same tier or role",
|
|
38
|
+
"0": "No diversity consideration"
|
|
39
|
+
}
|
|
40
|
+
}
|
|
41
|
+
],
|
|
42
|
+
"expectedScoreWithout": 35,
|
|
43
|
+
"expectedScoreWith": 75
|
|
44
|
+
},
|
|
45
|
+
{
|
|
46
|
+
"id": "bench-easy-02",
|
|
47
|
+
"difficulty": "easy",
|
|
48
|
+
"description": "Extract trending hashtags for a specific topic",
|
|
49
|
+
"input": "What are the top trending hashtags on Twitter/X related to cryptocurrency in the past 24 hours? Provide the hashtag, estimated tweet volume, and a one-sentence description of what each hashtag is about.",
|
|
50
|
+
"rubric": [
|
|
51
|
+
{
|
|
52
|
+
"criterion": "Hashtag Relevance",
|
|
53
|
+
"weight": 0.4,
|
|
54
|
+
"scoring": {
|
|
55
|
+
"5": "Returns 5+ genuinely trending crypto-related hashtags that are currently active; all are directly relevant to cryptocurrency topics",
|
|
56
|
+
"3": "Returns 3-4 relevant hashtags but includes 1-2 generic or outdated ones",
|
|
57
|
+
"1": "Returns mostly generic hashtags (#crypto, #bitcoin) without trending context",
|
|
58
|
+
"0": "Irrelevant hashtags"
|
|
59
|
+
}
|
|
60
|
+
},
|
|
61
|
+
{
|
|
62
|
+
"criterion": "Volume & Context Data",
|
|
63
|
+
"weight": 0.3,
|
|
64
|
+
"scoring": {
|
|
65
|
+
"5": "Each hashtag includes estimated tweet volume, what it refers to, and why it is trending now",
|
|
66
|
+
"3": "Volume estimates provided but context is vague",
|
|
67
|
+
"1": "Hashtags listed without volume or context",
|
|
68
|
+
"0": "No supporting data"
|
|
69
|
+
}
|
|
70
|
+
},
|
|
71
|
+
{
|
|
72
|
+
"criterion": "Temporal Accuracy",
|
|
73
|
+
"weight": 0.3,
|
|
74
|
+
"scoring": {
|
|
75
|
+
"5": "Hashtags are clearly from the last 24 hours with timestamps or recency indicators; distinguishes current trends from evergreen tags",
|
|
76
|
+
"3": "Some recency indication but not clearly time-bounded",
|
|
77
|
+
"1": "No temporal context — could be from any time period",
|
|
78
|
+
"0": "Returns outdated trends"
|
|
79
|
+
}
|
|
80
|
+
}
|
|
81
|
+
],
|
|
82
|
+
"expectedScoreWithout": 35,
|
|
83
|
+
"expectedScoreWith": 80
|
|
84
|
+
},
|
|
85
|
+
{
|
|
86
|
+
"id": "bench-easy-03",
|
|
87
|
+
"difficulty": "easy",
|
|
88
|
+
"description": "Basic sentiment check on a public figure",
|
|
89
|
+
"input": "What is the general sentiment on Twitter/X toward Elon Musk this week? Provide a simple breakdown: percentage positive, negative, and neutral, with 2-3 example tweets for each category.",
|
|
90
|
+
"rubric": [
|
|
91
|
+
{
|
|
92
|
+
"criterion": "Sentiment Classification",
|
|
93
|
+
"weight": 0.4,
|
|
94
|
+
"scoring": {
|
|
95
|
+
"5": "Provides clear percentage breakdown with reasonable distribution; acknowledges mixed/polarized sentiment; methodology is explained",
|
|
96
|
+
"3": "Provides a general sentiment direction (mostly positive/negative) but percentages are vague or unsupported",
|
|
97
|
+
"1": "Single-word sentiment (e.g., 'negative') without nuance or data",
|
|
98
|
+
"0": "No sentiment analysis"
|
|
99
|
+
}
|
|
100
|
+
},
|
|
101
|
+
{
|
|
102
|
+
"criterion": "Example Quality",
|
|
103
|
+
"weight": 0.3,
|
|
104
|
+
"scoring": {
|
|
105
|
+
"5": "2-3 examples per category from credible accounts; examples clearly represent the stated sentiment; tone is correctly interpreted (sarcasm detected)",
|
|
106
|
+
"3": "1-2 examples per category but some are ambiguous or from low-credibility accounts",
|
|
107
|
+
"1": "Generic examples that do not clearly demonstrate the sentiment category",
|
|
108
|
+
"0": "No examples provided"
|
|
109
|
+
}
|
|
110
|
+
},
|
|
111
|
+
{
|
|
112
|
+
"criterion": "Temporal Framing",
|
|
113
|
+
"weight": 0.3,
|
|
114
|
+
"scoring": {
|
|
115
|
+
"5": "Analysis is clearly bounded to the current week; notes any specific events driving sentiment shifts",
|
|
116
|
+
"3": "Roughly time-bounded but does not connect sentiment to specific events",
|
|
117
|
+
"1": "No clear time boundary; could be about any period",
|
|
118
|
+
"0": "Clearly outdated analysis"
|
|
119
|
+
}
|
|
120
|
+
}
|
|
121
|
+
],
|
|
122
|
+
"expectedScoreWithout": 40,
|
|
123
|
+
"expectedScoreWith": 80
|
|
124
|
+
},
|
|
125
|
+
{
|
|
126
|
+
"id": "bench-med-01",
|
|
127
|
+
"difficulty": "medium",
|
|
128
|
+
"description": "Track narrative evolution around a policy event",
|
|
129
|
+
"input": "Analyze how the Twitter/X conversation around US data privacy legislation has evolved over the past month. Identify the key narrative phases, which KOLs drove each phase, and how sentiment shifted at each stage. Flag any coordinated campaign activity.",
|
|
130
|
+
"rubric": [
|
|
131
|
+
{
|
|
132
|
+
"criterion": "Narrative Phase Identification",
|
|
133
|
+
"weight": 0.3,
|
|
134
|
+
"scoring": {
|
|
135
|
+
"5": "Identifies 3+ distinct narrative phases with clear temporal boundaries and triggering events; shows how the conversation evolved from one phase to the next",
|
|
136
|
+
"3": "Identifies 2 phases but boundaries are unclear or triggering events are not specified",
|
|
137
|
+
"1": "Treats the entire month as a single undifferentiated conversation",
|
|
138
|
+
"0": "No temporal analysis"
|
|
139
|
+
}
|
|
140
|
+
},
|
|
141
|
+
{
|
|
142
|
+
"criterion": "KOL Attribution",
|
|
143
|
+
"weight": 0.25,
|
|
144
|
+
"scoring": {
|
|
145
|
+
"5": "Attributes specific KOLs to each narrative phase; shows cascade patterns (who spoke first, who amplified); credibility tiers assigned",
|
|
146
|
+
"3": "Names some KOLs but does not connect them to specific phases or show cascade dynamics",
|
|
147
|
+
"1": "Generic mention of 'influencers' without specific attribution",
|
|
148
|
+
"0": "No KOL analysis"
|
|
149
|
+
}
|
|
150
|
+
},
|
|
151
|
+
{
|
|
152
|
+
"criterion": "Sentiment Trajectory",
|
|
153
|
+
"weight": 0.25,
|
|
154
|
+
"scoring": {
|
|
155
|
+
"5": "Maps sentiment across the full month with shifts clearly linked to events; distinguishes gradual drift from sharp reversals; reports baseline and deviations",
|
|
156
|
+
"3": "Reports overall sentiment change but without phase-by-phase granularity",
|
|
157
|
+
"1": "Single sentiment label for the whole period",
|
|
158
|
+
"0": "No sentiment analysis"
|
|
159
|
+
}
|
|
160
|
+
},
|
|
161
|
+
{
|
|
162
|
+
"criterion": "Coordination Detection",
|
|
163
|
+
"weight": 0.2,
|
|
164
|
+
"scoring": {
|
|
165
|
+
"5": "Explicitly checks for coordinated campaigns; reports findings (whether coordination was detected or not) with evidence; distinguishes organic from manufactured activity",
|
|
166
|
+
"3": "Mentions coordination possibility but without systematic detection or evidence",
|
|
167
|
+
"1": "Does not address coordination at all",
|
|
168
|
+
"0": "Presents potentially coordinated activity as organic"
|
|
169
|
+
}
|
|
170
|
+
}
|
|
171
|
+
],
|
|
172
|
+
"expectedScoreWithout": 25,
|
|
173
|
+
"expectedScoreWith": 65
|
|
174
|
+
},
|
|
175
|
+
{
|
|
176
|
+
"id": "bench-med-02",
|
|
177
|
+
"difficulty": "medium",
|
|
178
|
+
"description": "Competitive intelligence via Twitter KOL monitoring",
|
|
179
|
+
"input": "I'm launching a fintech startup focused on embedded payments. Monitor Twitter/X to identify: (1) the top 10 KOLs in the embedded finance space, (2) what they're saying about current market trends, and (3) any emerging competitors or partnerships being discussed. Provide a competitive intelligence briefing.",
|
|
180
|
+
"rubric": [
|
|
181
|
+
{
|
|
182
|
+
"criterion": "KOL Discovery",
|
|
183
|
+
"weight": 0.3,
|
|
184
|
+
"scoring": {
|
|
185
|
+
"5": "Identifies 10 relevant KOLs across tiers (industry analysts, fintech founders, VC investors, developers); includes credibility scores and relevance justification for each",
|
|
186
|
+
"3": "Identifies 5-7 relevant KOLs but limited to one or two categories",
|
|
187
|
+
"1": "Identifies fewer than 5 KOLs or includes irrelevant accounts",
|
|
188
|
+
"0": "No KOL discovery"
|
|
189
|
+
}
|
|
190
|
+
},
|
|
191
|
+
{
|
|
192
|
+
"criterion": "Market Trend Extraction",
|
|
193
|
+
"weight": 0.3,
|
|
194
|
+
"scoring": {
|
|
195
|
+
"5": "Extracts 3+ specific market trends with KOL source attribution; trends are actionable and relevant to embedded payments; includes supporting tweet evidence",
|
|
196
|
+
"3": "Identifies general fintech trends but not specific to embedded payments or lacking attribution",
|
|
197
|
+
"1": "Vague trend statements without evidence",
|
|
198
|
+
"0": "No trend analysis"
|
|
199
|
+
}
|
|
200
|
+
},
|
|
201
|
+
{
|
|
202
|
+
"criterion": "Competitive Signal Detection",
|
|
203
|
+
"weight": 0.25,
|
|
204
|
+
"scoring": {
|
|
205
|
+
"5": "Identifies specific competitors or partnerships being discussed by KOLs; provides context on competitive positioning; flags early signals of market moves",
|
|
206
|
+
"3": "Mentions some competitors but without context or KOL attribution",
|
|
207
|
+
"1": "Generic competitive landscape without Twitter-specific intelligence",
|
|
208
|
+
"0": "No competitive signals"
|
|
209
|
+
}
|
|
210
|
+
},
|
|
211
|
+
{
|
|
212
|
+
"criterion": "Briefing Structure",
|
|
213
|
+
"weight": 0.15,
|
|
214
|
+
"scoring": {
|
|
215
|
+
"5": "Follows structured briefing format with executive summary, KOL table, trend analysis, competitive signals, and recommended monitoring actions",
|
|
216
|
+
"3": "Partially structured with some sections missing",
|
|
217
|
+
"1": "Unstructured narrative",
|
|
218
|
+
"0": "No organization"
|
|
219
|
+
}
|
|
220
|
+
}
|
|
221
|
+
],
|
|
222
|
+
"expectedScoreWithout": 25,
|
|
223
|
+
"expectedScoreWith": 65
|
|
224
|
+
},
|
|
225
|
+
{
|
|
226
|
+
"id": "bench-med-03",
|
|
227
|
+
"difficulty": "medium",
|
|
228
|
+
"description": "Detect and analyze bot activity around a controversial topic",
|
|
229
|
+
"input": "Investigate whether bot networks are amplifying specific narratives about genetically modified organisms (GMOs) on Twitter/X. Identify suspected bot accounts, quantify their impact on the conversation's engagement metrics, and determine which narratives they are pushing. Separate organic KOL opinions from bot-amplified messaging.",
|
|
230
|
+
"rubric": [
|
|
231
|
+
{
|
|
232
|
+
"criterion": "Bot Detection Methodology",
|
|
233
|
+
"weight": 0.3,
|
|
234
|
+
"scoring": {
|
|
235
|
+
"5": "Applies systematic bot detection using account-level, behavior-level, and network-level signals from domain knowledge; explains the heuristics used and their confidence",
|
|
236
|
+
"3": "Applies some bot detection but only at one level (e.g., account-level only) without network analysis",
|
|
237
|
+
"1": "Mentions bots but does not apply systematic detection",
|
|
238
|
+
"0": "No bot detection"
|
|
239
|
+
}
|
|
240
|
+
},
|
|
241
|
+
{
|
|
242
|
+
"criterion": "Impact Quantification",
|
|
243
|
+
"weight": 0.25,
|
|
244
|
+
"scoring": {
|
|
245
|
+
"5": "Reports bot contamination percentage; provides both raw and bot-filtered engagement metrics; quantifies how much bot activity inflates specific narrative visibility",
|
|
246
|
+
"3": "Reports presence of bots but does not quantify their impact on metrics",
|
|
247
|
+
"1": "Vague mention of bot activity without quantification",
|
|
248
|
+
"0": "No quantification"
|
|
249
|
+
}
|
|
250
|
+
},
|
|
251
|
+
{
|
|
252
|
+
"criterion": "Narrative Separation",
|
|
253
|
+
"weight": 0.25,
|
|
254
|
+
"scoring": {
|
|
255
|
+
"5": "Clearly separates organic KOL opinions from bot-amplified narratives; shows which narratives are genuinely held by credible voices vs. artificially boosted",
|
|
256
|
+
"3": "Identifies different narratives but does not clearly separate organic from amplified",
|
|
257
|
+
"1": "Treats all narratives equally without bot-organic distinction",
|
|
258
|
+
"0": "No narrative analysis"
|
|
259
|
+
}
|
|
260
|
+
},
|
|
261
|
+
{
|
|
262
|
+
"criterion": "Evidence Presentation",
|
|
263
|
+
"weight": 0.2,
|
|
264
|
+
"scoring": {
|
|
265
|
+
"5": "Provides specific examples of suspected bot accounts with evidence (account age, posting patterns, network connections); shows coordination patterns",
|
|
266
|
+
"3": "Mentions suspected accounts but evidence is thin or anecdotal",
|
|
267
|
+
"1": "No specific examples or evidence",
|
|
268
|
+
"0": "Unsupported claims"
|
|
269
|
+
}
|
|
270
|
+
}
|
|
271
|
+
],
|
|
272
|
+
"expectedScoreWithout": 20,
|
|
273
|
+
"expectedScoreWith": 60
|
|
274
|
+
},
|
|
275
|
+
{
|
|
276
|
+
"id": "bench-med-04",
|
|
277
|
+
"difficulty": "medium",
|
|
278
|
+
"description": "Multi-language Twitter intelligence gathering",
|
|
279
|
+
"input": "Monitor Twitter/X discourse about renewable energy policy in both English and Spanish-speaking communities. Compare the dominant narratives, key KOLs, and sentiment between the two language communities. Identify any cross-language narrative transfer.",
|
|
280
|
+
"rubric": [
|
|
281
|
+
{
|
|
282
|
+
"criterion": "Multi-Language Coverage",
|
|
283
|
+
"weight": 0.3,
|
|
284
|
+
"scoring": {
|
|
285
|
+
"5": "Constructs separate queries for English and Spanish using lang: filters; identifies KOLs in both communities; provides analysis for each language independently",
|
|
286
|
+
"3": "Covers both languages but analysis is significantly deeper for one language",
|
|
287
|
+
"1": "Only covers one language effectively",
|
|
288
|
+
"0": "Single-language analysis despite multi-language request"
|
|
289
|
+
}
|
|
290
|
+
},
|
|
291
|
+
{
|
|
292
|
+
"criterion": "Cross-Community Comparison",
|
|
293
|
+
"weight": 0.3,
|
|
294
|
+
"scoring": {
|
|
295
|
+
"5": "Provides structured comparison of narratives, KOLs, and sentiment between communities; identifies shared vs. unique concerns; notes cultural/political context differences",
|
|
296
|
+
"3": "Compares sentiment but does not compare narratives or KOLs in depth",
|
|
297
|
+
"1": "Presents each community separately without comparison",
|
|
298
|
+
"0": "No comparison"
|
|
299
|
+
}
|
|
300
|
+
},
|
|
301
|
+
{
|
|
302
|
+
"criterion": "Cross-Language Transfer Detection",
|
|
303
|
+
"weight": 0.2,
|
|
304
|
+
"scoring": {
|
|
305
|
+
"5": "Identifies specific narratives or talking points that transferred from one language community to the other; traces the transfer timeline and amplifiers",
|
|
306
|
+
"3": "Notes some overlap between communities but does not trace transfer dynamics",
|
|
307
|
+
"1": "Does not address cross-language transfer",
|
|
308
|
+
"0": "Ignores the cross-language aspect entirely"
|
|
309
|
+
}
|
|
310
|
+
},
|
|
311
|
+
{
|
|
312
|
+
"criterion": "Output Quality",
|
|
313
|
+
"weight": 0.2,
|
|
314
|
+
"scoring": {
|
|
315
|
+
"5": "Structured briefing with side-by-side comparison tables; confidence ratings for each community; recommended monitoring for both languages",
|
|
316
|
+
"3": "Readable output but lacks comparative structure",
|
|
317
|
+
"1": "Unstructured mixed narrative",
|
|
318
|
+
"0": "Disorganized output"
|
|
319
|
+
}
|
|
320
|
+
}
|
|
321
|
+
],
|
|
322
|
+
"expectedScoreWithout": 20,
|
|
323
|
+
"expectedScoreWith": 60
|
|
324
|
+
},
|
|
325
|
+
{
|
|
326
|
+
"id": "bench-hard-01",
|
|
327
|
+
"difficulty": "hard",
|
|
328
|
+
"description": "Real-time crisis monitoring and misinformation detection",
|
|
329
|
+
"input": "A major cybersecurity breach has just been reported at a Fortune 500 company. Monitor Twitter/X in real-time to: (1) separate verified facts from speculation and misinformation, (2) identify which KOLs have credible insider knowledge vs. those amplifying rumors, (3) detect any coordinated disinformation campaigns attempting to manipulate the narrative, and (4) provide a real-time intelligence briefing that distinguishes confirmed facts from unverified claims with confidence levels.",
|
|
330
|
+
"rubric": [
|
|
331
|
+
{
|
|
332
|
+
"criterion": "Fact vs. Speculation Separation",
|
|
333
|
+
"weight": 0.3,
|
|
334
|
+
"scoring": {
|
|
335
|
+
"5": "Explicitly categorizes each claim as confirmed/unconfirmed/speculation with evidence basis; traces claims to primary sources; flags contradictions between accounts; assigns per-claim confidence levels",
|
|
336
|
+
"3": "Separates some facts from speculation but categorization is inconsistent or missing evidence basis",
|
|
337
|
+
"1": "Presents all information with similar authority regardless of verification status",
|
|
338
|
+
"0": "No fact-checking or verification"
|
|
339
|
+
}
|
|
340
|
+
},
|
|
341
|
+
{
|
|
342
|
+
"criterion": "KOL Credibility Assessment",
|
|
343
|
+
"weight": 0.25,
|
|
344
|
+
"scoring": {
|
|
345
|
+
"5": "Assesses each KOL's likely access to insider information based on professional background, prior accuracy, and institutional affiliation; distinguishes cybersecurity experts from general tech commentators; applies credibility scoring framework",
|
|
346
|
+
"3": "Identifies some credible sources but does not systematically assess expertise relevance",
|
|
347
|
+
"1": "Treats all KOLs equally regardless of domain expertise",
|
|
348
|
+
"0": "No credibility assessment"
|
|
349
|
+
}
|
|
350
|
+
},
|
|
351
|
+
{
|
|
352
|
+
"criterion": "Disinformation Detection",
|
|
353
|
+
"weight": 0.25,
|
|
354
|
+
"scoring": {
|
|
355
|
+
"5": "Systematically checks for coordinated campaigns: synchronized posting, bot amplification, narrative manipulation; reports findings with specific evidence; quantifies disinformation contamination of the overall conversation",
|
|
356
|
+
"3": "Notes potential disinformation but without systematic detection or quantification",
|
|
357
|
+
"1": "Does not address disinformation possibility",
|
|
358
|
+
"0": "Presents disinformation as credible intelligence"
|
|
359
|
+
}
|
|
360
|
+
},
|
|
361
|
+
{
|
|
362
|
+
"criterion": "Real-Time Briefing Quality",
|
|
363
|
+
"weight": 0.2,
|
|
364
|
+
"scoring": {
|
|
365
|
+
"5": "Structured briefing with clear timestamps, fact/speculation labels on each item, evolving confidence ratings, and recommended next monitoring actions; suitable for decision-makers",
|
|
366
|
+
"3": "Useful briefing but lacks timestamps or evolving confidence tracking",
|
|
367
|
+
"1": "Static summary without real-time structure",
|
|
368
|
+
"0": "Unusable output"
|
|
369
|
+
}
|
|
370
|
+
}
|
|
371
|
+
],
|
|
372
|
+
"expectedScoreWithout": 15,
|
|
373
|
+
"expectedScoreWith": 55
|
|
374
|
+
},
|
|
375
|
+
{
|
|
376
|
+
"id": "bench-hard-02",
|
|
377
|
+
"difficulty": "hard",
|
|
378
|
+
"description": "Geopolitical narrative tracking across multiple stakeholder groups",
|
|
379
|
+
"input": "Track the Twitter/X discourse around US-China technology competition, specifically regarding semiconductor export controls. Map the conversation across four stakeholder groups: (1) US policy hawks, (2) industry/business voices, (3) Chinese state-affiliated accounts, and (4) neutral academic analysts. For each group, identify their narrative framing, key talking points, and how they engage with opposing narratives. Detect any state-sponsored information operations.",
|
|
380
|
+
"rubric": [
|
|
381
|
+
{
|
|
382
|
+
"criterion": "Stakeholder Group Mapping",
|
|
383
|
+
"weight": 0.25,
|
|
384
|
+
"scoring": {
|
|
385
|
+
"5": "Successfully identifies and separates accounts into all 4 stakeholder groups with clear classification criteria; accounts are correctly attributed; includes 3+ KOLs per group",
|
|
386
|
+
"3": "Maps 2-3 groups but classification is imprecise or one group is missing",
|
|
387
|
+
"1": "Treats the conversation as monolithic without stakeholder segmentation",
|
|
388
|
+
"0": "No stakeholder mapping"
|
|
389
|
+
}
|
|
390
|
+
},
|
|
391
|
+
{
|
|
392
|
+
"criterion": "Narrative Framing Analysis",
|
|
393
|
+
"weight": 0.25,
|
|
394
|
+
"scoring": {
|
|
395
|
+
"5": "For each group: identifies the core narrative frame (e.g., national security vs. free trade vs. sovereign rights), key talking points, and rhetorical strategies; shows how frames differ and where they conflict",
|
|
396
|
+
"3": "Identifies general positions but does not analyze framing strategies or rhetorical differences",
|
|
397
|
+
"1": "Surface-level position summary without framing analysis",
|
|
398
|
+
"0": "No narrative analysis"
|
|
399
|
+
}
|
|
400
|
+
},
|
|
401
|
+
{
|
|
402
|
+
"criterion": "Cross-Group Engagement Analysis",
|
|
403
|
+
"weight": 0.25,
|
|
404
|
+
"scoring": {
|
|
405
|
+
"5": "Maps how groups engage with each other: quote tweets, reply patterns, counter-narratives; identifies which groups are talking past each other vs. directly engaging; shows information flow between groups",
|
|
406
|
+
"3": "Notes some cross-group interaction but analysis is shallow",
|
|
407
|
+
"1": "Analyzes each group in isolation without cross-group dynamics",
|
|
408
|
+
"0": "No engagement analysis"
|
|
409
|
+
}
|
|
410
|
+
},
|
|
411
|
+
{
|
|
412
|
+
"criterion": "State-Sponsored Detection",
|
|
413
|
+
"weight": 0.25,
|
|
414
|
+
"scoring": {
|
|
415
|
+
"5": "Applies systematic detection for state-affiliated accounts (official labels, behavior patterns, coordination signals); distinguishes organic vs. state-directed messaging; provides evidence for assessments; quantifies state-sponsored content share",
|
|
416
|
+
"3": "Identifies some state-affiliated accounts but detection is not systematic",
|
|
417
|
+
"1": "Mentions state involvement without evidence",
|
|
418
|
+
"0": "Ignores state-sponsored operations entirely"
|
|
419
|
+
}
|
|
420
|
+
}
|
|
421
|
+
],
|
|
422
|
+
"expectedScoreWithout": 15,
|
|
423
|
+
"expectedScoreWith": 55
|
|
424
|
+
},
|
|
425
|
+
{
|
|
426
|
+
"id": "bench-hard-03",
|
|
427
|
+
"difficulty": "hard",
|
|
428
|
+
"description": "Predictive intelligence from early Twitter signals",
|
|
429
|
+
"input": "Based on current Twitter/X signals, identify 3 emerging technology trends that have not yet reached mainstream media coverage but are gaining traction among Micro-KOLs and Mid-KOLs in the tech space. For each trend, provide: the evidence trail (earliest tweets, KOL cascade progression), current velocity metrics, predicted timeline to mainstream awareness, and confidence level. Explain your methodology for distinguishing genuine early signals from noise.",
|
|
430
|
+
"rubric": [
|
|
431
|
+
{
|
|
432
|
+
"criterion": "Early Signal Detection",
|
|
433
|
+
"weight": 0.3,
|
|
434
|
+
"scoring": {
|
|
435
|
+
"5": "Identifies 3 genuine emerging trends with clear evidence they originated in Micro/Mid-KOL circles; provides earliest tweet timestamps and shows the signal before mainstream pickup; trends are plausible and specific",
|
|
436
|
+
"3": "Identifies 2 trends but evidence trail is incomplete or one trend is already mainstream",
|
|
437
|
+
"1": "Identifies already-mainstream topics or vague trend categories",
|
|
438
|
+
"0": "No genuine early signal detection"
|
|
439
|
+
}
|
|
440
|
+
},
|
|
441
|
+
{
|
|
442
|
+
"criterion": "Evidence Quality & Cascade Analysis",
|
|
443
|
+
"weight": 0.25,
|
|
444
|
+
"scoring": {
|
|
445
|
+
"5": "For each trend: provides specific tweet evidence, shows KOL cascade progression with timestamps, maps the spread across KOL tiers; evidence is verifiable and attributed",
|
|
446
|
+
"3": "Some evidence provided but cascade analysis is incomplete or evidence is anecdotal",
|
|
447
|
+
"1": "Claims without supporting evidence or timeline",
|
|
448
|
+
"0": "No evidence"
|
|
449
|
+
}
|
|
450
|
+
},
|
|
451
|
+
{
|
|
452
|
+
"criterion": "Predictive Assessment",
|
|
453
|
+
"weight": 0.25,
|
|
454
|
+
"scoring": {
|
|
455
|
+
"5": "Provides specific timeline predictions for mainstream awareness with justified reasoning; includes velocity metrics, comparisons to historical trend patterns, and explicit uncertainty bounds",
|
|
456
|
+
"3": "General timeline prediction without quantitative basis or uncertainty bounds",
|
|
457
|
+
"1": "No predictive element — only describes current state",
|
|
458
|
+
"0": "No prediction attempted"
|
|
459
|
+
}
|
|
460
|
+
},
|
|
461
|
+
{
|
|
462
|
+
"criterion": "Methodology Transparency",
|
|
463
|
+
"weight": 0.2,
|
|
464
|
+
"scoring": {
|
|
465
|
+
"5": "Clearly explains the signal-vs-noise methodology: which heuristics were used, how false positives were filtered, what thresholds were applied, and what the limitations are",
|
|
466
|
+
"3": "Mentions methodology at a high level but lacks specifics",
|
|
467
|
+
"1": "No methodology explanation — trends appear to be selected arbitrarily",
|
|
468
|
+
"0": "No methodology"
|
|
469
|
+
}
|
|
470
|
+
}
|
|
471
|
+
],
|
|
472
|
+
"expectedScoreWithout": 15,
|
|
473
|
+
"expectedScoreWith": 55
|
|
474
|
+
}
|
|
475
|
+
]
|
|
476
|
+
}
|
package/tests/smoke.json
ADDED
|
@@ -0,0 +1,54 @@
|
|
|
1
|
+
{
|
|
2
|
+
"version": "0.0.1",
|
|
3
|
+
"timeout": 60,
|
|
4
|
+
"tasks": [
|
|
5
|
+
{
|
|
6
|
+
"id": "smoke-01",
|
|
7
|
+
"description": "Track KOL opinions on an emerging technology topic with bot detection and trend context",
|
|
8
|
+
"input": "Monitor Twitter/X for key opinion leaders discussing the impact of new EU AI regulation on the tech startup ecosystem. Identify the dominant narratives, track which KOLs are driving the conversation, and flag any bot-driven amplification. I need a structured intelligence briefing with confidence ratings.",
|
|
9
|
+
"rubric": [
|
|
10
|
+
{
|
|
11
|
+
"criterion": "Source Curation & KOL Identification",
|
|
12
|
+
"weight": 0.25,
|
|
13
|
+
"scoring": {
|
|
14
|
+
"5": "Identifies 5+ relevant KOLs across multiple tiers (tech policy experts, startup founders, regulatory analysts); constructs targeted queries using from: operators and topic keywords; sets appropriate temporal window",
|
|
15
|
+
"3": "Identifies 2-3 relevant accounts but misses important KOL tiers or uses overly broad queries",
|
|
16
|
+
"1": "Generic keyword search with no specific KOL targeting or source curation",
|
|
17
|
+
"0": "No source curation attempted"
|
|
18
|
+
}
|
|
19
|
+
},
|
|
20
|
+
{
|
|
21
|
+
"criterion": "Signal Filtering & Bot Detection",
|
|
22
|
+
"weight": 0.25,
|
|
23
|
+
"scoring": {
|
|
24
|
+
"5": "Applies multi-layer filtering (bot removal, deduplication, relevance scoring); explicitly checks for bot amplification signals; reports both raw and filtered metrics; flags inauthenticity risks",
|
|
25
|
+
"3": "Some filtering applied but bot detection is incomplete or not quantified",
|
|
26
|
+
"1": "Minimal filtering; treats all engagement as organic",
|
|
27
|
+
"0": "No filtering or bot detection"
|
|
28
|
+
}
|
|
29
|
+
},
|
|
30
|
+
{
|
|
31
|
+
"criterion": "Opinion & Trend Analysis",
|
|
32
|
+
"weight": 0.25,
|
|
33
|
+
"scoring": {
|
|
34
|
+
"5": "Extracts distinct stance clusters (supporters, critics, neutrals) with attributed KOL positions; detects trend velocity and sentiment direction; accounts for sarcasm and tone; includes temporal context",
|
|
35
|
+
"3": "Identifies general sentiment but lacks stance clustering or temporal analysis; tone detection is basic",
|
|
36
|
+
"1": "Surface-level sentiment without stance attribution or trend context",
|
|
37
|
+
"0": "No opinion or trend analysis"
|
|
38
|
+
}
|
|
39
|
+
},
|
|
40
|
+
{
|
|
41
|
+
"criterion": "Output Structure & Confidence",
|
|
42
|
+
"weight": 0.25,
|
|
43
|
+
"scoring": {
|
|
44
|
+
"5": "Follows structured briefing format: executive summary, key findings with source attribution, KOL positions table, trend metrics, bot assessment, confidence rating with justification, and recommended actions",
|
|
45
|
+
"3": "Partially structured output with some elements missing (e.g., no confidence rating or no recommended actions)",
|
|
46
|
+
"1": "Unstructured narrative summary without clear sections or source attribution",
|
|
47
|
+
"0": "Raw data dump with no organization"
|
|
48
|
+
}
|
|
49
|
+
}
|
|
50
|
+
],
|
|
51
|
+
"passThreshold": 60
|
|
52
|
+
}
|
|
53
|
+
]
|
|
54
|
+
}
|