@botlearn/sentiment-analyzer 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +35 -0
- package/knowledge/anti-patterns.md +86 -0
- package/knowledge/best-practices.md +148 -0
- package/knowledge/domain.md +152 -0
- package/manifest.json +26 -0
- package/package.json +35 -0
- package/skill.md +47 -0
- package/strategies/main.md +110 -0
- package/tests/benchmark.json +476 -0
- package/tests/smoke.json +54 -0
package/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2025 BotLearn
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
package/README.md
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
1
|
+
# @botlearn/sentiment-analyzer
|
|
2
|
+
|
|
3
|
+
> Fine-grained sentiment recognition and opinion mining with aspect-based analysis, sarcasm detection, and polarity classification for OpenClaw Agent
|
|
4
|
+
|
|
5
|
+
## Installation
|
|
6
|
+
|
|
7
|
+
```bash
|
|
8
|
+
# via npm
|
|
9
|
+
npm install @botlearn/sentiment-analyzer
|
|
10
|
+
|
|
11
|
+
# via clawhub
|
|
12
|
+
clawhub install @botlearn/sentiment-analyzer
|
|
13
|
+
```
|
|
14
|
+
|
|
15
|
+
## Category
|
|
16
|
+
|
|
17
|
+
content-processing
|
|
18
|
+
|
|
19
|
+
## Dependencies
|
|
20
|
+
|
|
21
|
+
None
|
|
22
|
+
|
|
23
|
+
## Files
|
|
24
|
+
|
|
25
|
+
| File | Description |
|
|
26
|
+
|------|-------------|
|
|
27
|
+
| `manifest.json` | Skill metadata and configuration |
|
|
28
|
+
| `skill.md` | Role definition and activation rules |
|
|
29
|
+
| `knowledge/` | Domain knowledge documents |
|
|
30
|
+
| `strategies/` | Behavioral strategy definitions |
|
|
31
|
+
| `tests/` | Smoke and benchmark tests |
|
|
32
|
+
|
|
33
|
+
## License
|
|
34
|
+
|
|
35
|
+
MIT
|
|
@@ -0,0 +1,86 @@
|
|
|
1
|
+
---
|
|
2
|
+
domain: sentiment-analyzer
|
|
3
|
+
topic: anti-patterns
|
|
4
|
+
priority: medium
|
|
5
|
+
ttl: 30d
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
# Sentiment Analysis — Anti-Patterns
|
|
9
|
+
|
|
10
|
+
## Polarity Classification Anti-Patterns
|
|
11
|
+
|
|
12
|
+
### 1. Binary Sentiment Trap
|
|
13
|
+
- **Problem**: Reducing all sentiment to just positive or negative, losing nuance and gradation
|
|
14
|
+
- **Fix**: Always use a fine-grained scale (7-point: strongly negative through strongly positive). "Okay" is not positive. "Not terrible" is not the same as "good". Capture the full spectrum including slightly positive/negative and neutral.
|
|
15
|
+
|
|
16
|
+
### 2. Ignoring Negation and Valence Shifters
|
|
17
|
+
- **Problem**: Treating "not good" as positive because "good" is in the sentiment lexicon, or ignoring "barely", "hardly", "never" as modifiers
|
|
18
|
+
- **Fix**: Always parse for negation markers (not, no, never, neither, un-, in-, -less) BEFORE assigning polarity. Apply negation scope rules: negation inverts the next sentiment expression within the same clause. Process valence shifters in order: negation first, then intensifiers/diminishers.
|
|
19
|
+
|
|
20
|
+
### 3. Ignoring Intensifiers and Diminishers
|
|
21
|
+
- **Problem**: Treating "excellent" and "somewhat good" as the same strength of positive sentiment
|
|
22
|
+
- **Fix**: Apply intensity multipliers. Intensifiers (very, extremely, incredibly) amplify by 1.25-1.5x. Diminishers (somewhat, slightly, a bit) reduce by 0.5-0.75x. Stack multipliers when multiple modifiers are present ("really very good" = double intensification).
|
|
23
|
+
|
|
24
|
+
### 4. Context-Free Lexicon Lookup
|
|
25
|
+
- **Problem**: Assigning fixed sentiment scores from a general lexicon without considering domain context. "Sick" is negative in health context but positive in slang ("that's sick!").
|
|
26
|
+
- **Fix**: Detect the text domain first. Apply domain-specific sentiment adjustments. Maintain override lists for domain-dependent terms. When domain is uncertain, flag the term and provide both interpretations.
|
|
27
|
+
|
|
28
|
+
### 5. Averaging Away Mixed Sentiment
|
|
29
|
+
- **Problem**: A review that says "Amazing camera but horrible battery" gets averaged to "neutral", losing the actionable insight that two distinct aspects have opposite sentiments
|
|
30
|
+
- **Fix**: Report aspect-level sentiments separately. Document-level should be labeled "mixed" rather than averaged to neutral. Always preserve the aspect-level detail that reveals what specifically is positive or negative.
|
|
31
|
+
|
|
32
|
+
## Sarcasm and Irony Anti-Patterns
|
|
33
|
+
|
|
34
|
+
### 6. Missing Sarcasm and Irony
|
|
35
|
+
- **Problem**: Taking sarcastic statements at face value. "Great, another update that broke everything" is classified as positive because of "great"
|
|
36
|
+
- **Fix**: Check for incongruity between sentiment words and context (complaints, low ratings, negative events). Look for hyperbole, excessive punctuation, and known sarcasm markers. When sarcasm is detected, invert the polarity. Flag uncertainty when sarcasm detection is ambiguous.
|
|
37
|
+
|
|
38
|
+
### 7. Over-Detecting Sarcasm
|
|
39
|
+
- **Problem**: Flagging every positive statement in a negative context as sarcastic, even when the author is genuinely acknowledging a positive aspect
|
|
40
|
+
- **Fix**: Require multiple sarcasm signals before inverting. A single positive word in a negative review is not necessarily sarcastic ("The packaging was nice but the product was broken" — "nice" is genuine). Only invert when there is strong incongruity evidence (e.g., star rating contradicts text, extreme hyperbole).
|
|
41
|
+
|
|
42
|
+
### 8. Ignoring Quoted or Reported Sarcasm
|
|
43
|
+
- **Problem**: Treating sarcasm in reported speech the same as the author's own sarcasm. "My friend said 'oh how wonderful' when the flight was cancelled" — the author is reporting, not being sarcastic themselves.
|
|
44
|
+
- **Fix**: Identify the opinion holder. Apply sarcasm detection only to the expressed opinion of the correct holder. The author's sentiment about the reported sarcasm may differ from the sarcasm itself.
|
|
45
|
+
|
|
46
|
+
## Scope and Granularity Anti-Patterns
|
|
47
|
+
|
|
48
|
+
### 9. Sentence-Level Assumption
|
|
49
|
+
- **Problem**: Assuming each sentence has exactly one sentiment, when a single sentence can contain multiple aspects with different sentiments: "The food was delicious but overpriced"
|
|
50
|
+
- **Fix**: Perform clause-level or aspect-level analysis within sentences. Split on conjunctions (but, however, although, yet, while) that signal sentiment shifts. Each clause may have an independent polarity.
|
|
51
|
+
|
|
52
|
+
### 10. Ignoring Implicit Aspects
|
|
53
|
+
- **Problem**: Only extracting explicit aspect nouns ("battery", "screen") and missing implicit aspects. "It's too heavy" implies the WEIGHT aspect without naming it.
|
|
54
|
+
- **Fix**: Maintain an implicit aspect mapping: adjective/property to aspect category. "Heavy" maps to WEIGHT, "slow" maps to PERFORMANCE, "expensive" maps to PRICE. Train on domain-specific implicit aspect patterns.
|
|
55
|
+
|
|
56
|
+
### 11. Entity Confusion
|
|
57
|
+
- **Problem**: Misattributing sentiment to the wrong entity. "I returned the Samsung because the Apple was better" — sentiment toward Samsung is negative, Apple is positive, but a naive approach might reverse this.
|
|
58
|
+
- **Fix**: Parse sentence structure to correctly bind sentiment expressions to their targets using dependency parsing or proximity rules. Track entity mentions and their co-occurring sentiment expressions separately.
|
|
59
|
+
|
|
60
|
+
## Confidence and Output Anti-Patterns
|
|
61
|
+
|
|
62
|
+
### 12. Overconfident Neutral
|
|
63
|
+
- **Problem**: Assigning high confidence (>0.90) to "neutral" when the text is actually ambiguous, mixed, or contains sentiment that was not detected
|
|
64
|
+
- **Fix**: Reserve high-confidence neutral for genuinely objective/factual text ("The meeting is at 3 PM"). For ambiguous text, assign lower confidence neutral (0.40-0.60) and flag as "potentially ambiguous". Run a second pass specifically looking for implicit sentiment.
|
|
65
|
+
|
|
66
|
+
### 13. Ignoring Opinion Holder
|
|
67
|
+
- **Problem**: Treating all sentiment in a text as the author's opinion, even when some is reported speech, quotes, or hypothetical
|
|
68
|
+
- **Fix**: Distinguish between the author's own sentiment, reported sentiment from others, and hypothetical/conditional sentiment. Tag each opinion with its holder: author, quoted source, or hypothetical.
|
|
69
|
+
|
|
70
|
+
### 14. No Uncertainty Communication
|
|
71
|
+
- **Problem**: Always presenting a single definitive sentiment label without indicating when the analysis is uncertain or when multiple interpretations are plausible
|
|
72
|
+
- **Fix**: Always include confidence scores. When confidence is below 0.60, present multiple possible interpretations ranked by likelihood. Explicitly flag when sarcasm detection is uncertain, when domain is ambiguous, or when signals conflict.
|
|
73
|
+
|
|
74
|
+
### 15. Temporal Sentiment Blindness
|
|
75
|
+
- **Problem**: Ignoring how sentiment evolves within a text. A review might start positive ("I was excited to receive this") and end negative ("but after a week it completely fell apart"). The final sentiment matters most.
|
|
76
|
+
- **Fix**: Track sentiment trajectory across the document. Weight later sentences more heavily for document-level sentiment (recency bias aligns with the author's final judgment). Flag sentiment shifts and report them explicitly.
|
|
77
|
+
|
|
78
|
+
## Data Quality Anti-Patterns
|
|
79
|
+
|
|
80
|
+
### 16. Emoji and Emoticon Blindness
|
|
81
|
+
- **Problem**: Ignoring emoji or emoticons that carry significant sentiment signal, especially in social media and messaging
|
|
82
|
+
- **Fix**: Include emoji sentiment mappings in the lexicon. Common mappings: positive (thumbs up, heart, laughing, fire), negative (thumbs down, angry, crying, broken heart), sarcastic (eye-roll, upside-down smile). Emoji can override or reinforce textual sentiment.
|
|
83
|
+
|
|
84
|
+
### 17. Treating All Text Sources Equally
|
|
85
|
+
- **Problem**: Applying the same analysis parameters to a tweet, a product review, a news article, and a legal document
|
|
86
|
+
- **Fix**: Detect text source/register and adjust analysis accordingly. Tweets: expect abbreviations, emoji, hashtags. Product reviews: expect star ratings as anchors. News: distinguish editorial from reported content. Legal: formal negation, not sentiment-bearing.
|
|
@@ -0,0 +1,148 @@
|
|
|
1
|
+
---
|
|
2
|
+
domain: sentiment-analyzer
|
|
3
|
+
topic: multi-level-analysis-and-calibration
|
|
4
|
+
priority: high
|
|
5
|
+
ttl: 30d
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
# Sentiment Analysis — Best Practices
|
|
9
|
+
|
|
10
|
+
## Multi-Level Analysis
|
|
11
|
+
|
|
12
|
+
### 1. Document-Level vs. Sentence-Level vs. Aspect-Level
|
|
13
|
+
Always analyze at the most granular level appropriate for the task, then aggregate upward:
|
|
14
|
+
|
|
15
|
+
- **Aspect-level** — Most granular: identifies sentiment toward specific features or entities
|
|
16
|
+
- **Sentence-level** — Each sentence gets an independent polarity label
|
|
17
|
+
- **Document-level** — Overall sentiment, aggregated from lower levels
|
|
18
|
+
|
|
19
|
+
**Aggregation rule**: Document sentiment is NOT simply the average of sentence sentiments. Weight by:
|
|
20
|
+
1. **Position** — Concluding sentences carry more weight (recency effect)
|
|
21
|
+
2. **Emphasis** — Sentences with intensifiers or exclamation carry more weight
|
|
22
|
+
3. **Aspect importance** — Core aspects (e.g., food in a restaurant review) weigh more than peripheral ones (e.g., parking)
|
|
23
|
+
4. **Explicit summary signals** — Phrases like "overall", "in summary", "all in all" anchor the document-level sentiment
|
|
24
|
+
|
|
25
|
+
### 2. Sentence-Level Analysis Pipeline
|
|
26
|
+
For each sentence:
|
|
27
|
+
1. Identify opinion targets (explicit or implicit aspects)
|
|
28
|
+
2. Locate sentiment expressions (adjectives, adverbs, verbs, phrases)
|
|
29
|
+
3. Detect valence shifters (negation, intensifiers, diminishers)
|
|
30
|
+
4. Compute modified polarity: `base_polarity * shifter_multiplier`
|
|
31
|
+
5. Assign confidence based on signal strength
|
|
32
|
+
|
|
33
|
+
### 3. Cross-Sentence Sentiment Flow
|
|
34
|
+
Sentiment can span multiple sentences:
|
|
35
|
+
- "The screen is gorgeous. Colors are vibrant and the resolution is sharp." — all three sentences express positive sentiment toward the DISPLAY aspect
|
|
36
|
+
- Track aspect continuity across sentences using coreference: "The laptop... It... This device..."
|
|
37
|
+
- Contrastive conjunctions signal sentiment shifts: "The food was great. However, the service was terrible."
|
|
38
|
+
|
|
39
|
+
## Sarcasm and Irony Detection
|
|
40
|
+
|
|
41
|
+
### Indicators of Sarcasm
|
|
42
|
+
Sarcasm detection requires looking beyond lexical sentiment. Key signals:
|
|
43
|
+
|
|
44
|
+
1. **Hyperbole** — Excessive praise or criticism that doesn't match context:
|
|
45
|
+
- "Oh sure, waiting 3 hours for a table is just DELIGHTFUL"
|
|
46
|
+
2. **Incongruity** — Positive words in objectively negative situations:
|
|
47
|
+
- "Love how my flight got cancelled for the third time this month"
|
|
48
|
+
3. **Punctuation patterns** — Ellipsis, excessive exclamation, quotes:
|
|
49
|
+
- "The 'premium' service was truly... something"
|
|
50
|
+
4. **Contextual mismatch** — Sentiment contradicts known facts:
|
|
51
|
+
- Rating: 1/5 stars. Text: "Best experience ever!"
|
|
52
|
+
5. **Universal quantifiers with negative context** — "Everything" + "perfect" when listing complaints:
|
|
53
|
+
- "Everything about this product is absolutely perfect (if you enjoy things that break immediately)"
|
|
54
|
+
6. **Hashtags and tags** (social media): `#sarcasm`, `#not`, `/s`
|
|
55
|
+
|
|
56
|
+
### Sarcasm Handling Strategy
|
|
57
|
+
- IF sarcasm is detected THEN invert the surface sentiment polarity
|
|
58
|
+
- Assign lower confidence (0.60-0.75) to sarcasm-inverted sentiments since sarcasm detection is inherently uncertain
|
|
59
|
+
- Flag the output as sarcasm-detected so downstream consumers are aware
|
|
60
|
+
- When uncertain whether text is sarcastic, report BOTH literal and sarcastic interpretations with respective confidence scores
|
|
61
|
+
|
|
62
|
+
### Irony vs. Sarcasm
|
|
63
|
+
- **Sarcasm** — Intentionally saying the opposite of what is meant, usually to mock or criticize
|
|
64
|
+
- **Irony** — A broader concept where reality contradicts expectations (not always mocking)
|
|
65
|
+
- **Situational irony** in text: "The fire station burned down" — neutral/objective sentiment despite ironic content
|
|
66
|
+
- Only invert sentiment for **verbal irony/sarcasm**, not situational irony
|
|
67
|
+
|
|
68
|
+
## Domain Calibration
|
|
69
|
+
|
|
70
|
+
### Why Domain Matters
|
|
71
|
+
The same word carries different sentiment weight depending on context:
|
|
72
|
+
|
|
73
|
+
| Word | Product Reviews | Medical Text | Financial Text | Sports |
|
|
74
|
+
|------|----------------|-------------|----------------|--------|
|
|
75
|
+
| aggressive | -0.4 | -0.2 | +0.3 | +0.4 |
|
|
76
|
+
| stable | +0.3 | +0.6 | +0.5 | +0.1 |
|
|
77
|
+
| volatile | -0.5 | -0.3 | -0.6 | +0.2 |
|
|
78
|
+
| critical | -0.5 | -0.7 | -0.4 | +0.1 |
|
|
79
|
+
| positive | +0.5 | +0.8 | +0.6 | +0.3 |
|
|
80
|
+
| sharp | +0.3 | -0.3 | -0.4 | +0.4 |
|
|
81
|
+
|
|
82
|
+
### Domain Detection Heuristics
|
|
83
|
+
Before starting analysis, identify the text domain:
|
|
84
|
+
1. Check for domain-specific vocabulary (medical terms, financial jargon, product categories)
|
|
85
|
+
2. Identify the text source if available (Amazon review, tweet, news article, clinical note)
|
|
86
|
+
3. Look for structural cues (star ratings, review headings, formal structure)
|
|
87
|
+
4. Apply the appropriate domain-specific sentiment adjustments from the lexicon
|
|
88
|
+
|
|
89
|
+
### Calibration Rules
|
|
90
|
+
- **Product reviews** — Star ratings (if available) serve as ground truth anchors; calibrate text analysis to align with explicit ratings
|
|
91
|
+
- **Social media** — Short, informal, heavy slang/emoji usage; expand lexicon to include platform-specific terms
|
|
92
|
+
- **News articles** — More objective tone; distinguish between reported sentiment (what sources said) and editorial sentiment
|
|
93
|
+
- **Academic/technical text** — Hedged language is standard, not an indicator of negativity; "may", "could", "suggests" are neutral
|
|
94
|
+
- **Legal/regulatory text** — Formal negation patterns; "shall not" is a directive, not sentiment
|
|
95
|
+
|
|
96
|
+
## Confidence Scoring
|
|
97
|
+
|
|
98
|
+
### Calibrated Confidence Guidelines
|
|
99
|
+
|
|
100
|
+
| Signal Strength | Confidence Range | Example |
|
|
101
|
+
|----------------|-----------------|---------|
|
|
102
|
+
| Strong explicit + consistent | 0.90 - 1.00 | "Absolutely love it, best purchase ever!" |
|
|
103
|
+
| Clear sentiment, single aspect | 0.80 - 0.89 | "The battery life is disappointing" |
|
|
104
|
+
| Sentiment with hedging | 0.65 - 0.79 | "It seems fairly good, though I'm not sure yet" |
|
|
105
|
+
| Mixed/conflicting signals | 0.50 - 0.64 | "Great features but terrible reliability" |
|
|
106
|
+
| Sarcasm-inverted | 0.55 - 0.75 | "Oh wonderful, another crash" (detected as sarcasm) |
|
|
107
|
+
| Ambiguous/insufficient data | 0.30 - 0.49 | "It is what it is" |
|
|
108
|
+
|
|
109
|
+
### Confidence Penalties
|
|
110
|
+
Apply confidence reductions for:
|
|
111
|
+
- **Short text** (< 10 words): -0.10 (less context for disambiguation)
|
|
112
|
+
- **No explicit sentiment words**: -0.15 (relying on implicit sentiment)
|
|
113
|
+
- **Domain mismatch** (uncertain domain): -0.10
|
|
114
|
+
- **Sarcasm uncertainty**: -0.15 to -0.25
|
|
115
|
+
- **Multiple conflicting aspects**: -0.05 per conflicting pair (for document-level confidence only)
|
|
116
|
+
|
|
117
|
+
## Output Formatting
|
|
118
|
+
|
|
119
|
+
### Structured Sentiment Report
|
|
120
|
+
Always output in a structured format:
|
|
121
|
+
|
|
122
|
+
```
|
|
123
|
+
Document-Level Summary:
|
|
124
|
+
Overall Polarity: [label] ([score])
|
|
125
|
+
Confidence: [0.00 - 1.00]
|
|
126
|
+
Dominant Emotion: [if applicable]
|
|
127
|
+
|
|
128
|
+
Aspect-Level Detail:
|
|
129
|
+
[Aspect 1]:
|
|
130
|
+
Polarity: [label] ([score])
|
|
131
|
+
Opinion Expression: "[quoted text]"
|
|
132
|
+
Valence Shifters: [none | negation | intensifier | diminisher]
|
|
133
|
+
Confidence: [0.00 - 1.00]
|
|
134
|
+
[Aspect 2]:
|
|
135
|
+
...
|
|
136
|
+
|
|
137
|
+
Flags:
|
|
138
|
+
Sarcasm Detected: [yes/no]
|
|
139
|
+
Mixed Sentiment: [yes/no]
|
|
140
|
+
Domain: [detected domain]
|
|
141
|
+
Low Confidence Aspects: [list]
|
|
142
|
+
```
|
|
143
|
+
|
|
144
|
+
### Handling Edge Cases
|
|
145
|
+
- **Purely factual text**: Label as "neutral/objective" with a note that no opinion was expressed
|
|
146
|
+
- **Mixed sentiment**: Report both the positive and negative aspects separately; document-level is "mixed" not an average
|
|
147
|
+
- **Non-English text**: Note the language limitation; apply analysis only if the language is within capability
|
|
148
|
+
- **Very short text** (< 5 words): Analyze but flag low confidence; avoid over-interpreting
|
|
@@ -0,0 +1,152 @@
|
|
|
1
|
+
---
|
|
2
|
+
domain: sentiment-analyzer
|
|
3
|
+
topic: sentiment-polarity-and-opinion-mining
|
|
4
|
+
priority: high
|
|
5
|
+
ttl: 30d
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
# Sentiment Analysis — Polarity, ABSA, Opinion Mining & Valence Shifters
|
|
9
|
+
|
|
10
|
+
## Sentiment Polarity Scale
|
|
11
|
+
|
|
12
|
+
### Fine-Grained Polarity Levels
|
|
13
|
+
Sentiment is classified on a 7-point scale rather than a binary positive/negative split:
|
|
14
|
+
|
|
15
|
+
| Level | Label | Score Range | Example |
|
|
16
|
+
|-------|-------|-------------|---------|
|
|
17
|
+
| 7 | Strongly Positive | +0.75 to +1.00 | "Absolutely phenomenal — the best I've ever used" |
|
|
18
|
+
| 6 | Positive | +0.50 to +0.74 | "I really like this product, it works well" |
|
|
19
|
+
| 5 | Slightly Positive | +0.25 to +0.49 | "It's decent, does what it's supposed to" |
|
|
20
|
+
| 4 | Neutral | -0.24 to +0.24 | "The device weighs 200 grams and ships in a blue box" |
|
|
21
|
+
| 3 | Slightly Negative | -0.25 to -0.49 | "It's okay but nothing special" |
|
|
22
|
+
| 2 | Negative | -0.50 to -0.74 | "Disappointed — it broke after a week" |
|
|
23
|
+
| 1 | Strongly Negative | -0.75 to -1.00 | "Completely unusable, waste of money, worst purchase ever" |
|
|
24
|
+
|
|
25
|
+
### Polarity Signals
|
|
26
|
+
- **Lexical cues**: Sentiment-bearing words (e.g., "excellent", "terrible", "mediocre")
|
|
27
|
+
- **Syntactic patterns**: Comparative structures ("better than", "worse than"), superlatives ("the best", "the worst")
|
|
28
|
+
- **Pragmatic cues**: Exclamation marks, ALL CAPS, emoji, rhetorical questions
|
|
29
|
+
- **Contextual signals**: Domain-specific terms that carry sentiment only in context (e.g., "unpredictable" is negative for software, neutral for weather description)
|
|
30
|
+
|
|
31
|
+
## Aspect-Based Sentiment Analysis (ABSA)
|
|
32
|
+
|
|
33
|
+
### Core Components
|
|
34
|
+
ABSA decomposes a text into structured opinion tuples:
|
|
35
|
+
|
|
36
|
+
```
|
|
37
|
+
(aspect_term, aspect_category, sentiment_polarity, opinion_expression, confidence)
|
|
38
|
+
```
|
|
39
|
+
|
|
40
|
+
**Example**:
|
|
41
|
+
- Input: "The camera quality is stunning but the battery life is disappointing"
|
|
42
|
+
- Tuple 1: ("camera quality", QUALITY, positive, "stunning", 0.92)
|
|
43
|
+
- Tuple 2: ("battery life", PERFORMANCE, negative, "disappointing", 0.89)
|
|
44
|
+
|
|
45
|
+
### Aspect Extraction Methods
|
|
46
|
+
1. **Explicit aspects** — Directly mentioned noun phrases: "screen", "battery", "customer service"
|
|
47
|
+
2. **Implicit aspects** — Inferred from context: "It's too heavy" implies WEIGHT aspect without naming it
|
|
48
|
+
3. **Composite aspects** — Multi-word aspect terms: "noise cancellation", "build quality", "user interface"
|
|
49
|
+
|
|
50
|
+
### Common Aspect Categories
|
|
51
|
+
|
|
52
|
+
| Domain | Typical Aspect Categories |
|
|
53
|
+
|--------|--------------------------|
|
|
54
|
+
| Product Reviews | Quality, Price, Design, Performance, Durability, Usability, Customer Service |
|
|
55
|
+
| Restaurant Reviews | Food, Service, Ambiance, Price, Cleanliness, Location, Wait Time |
|
|
56
|
+
| Software Reviews | Functionality, Performance, UI/UX, Reliability, Documentation, Support, Pricing |
|
|
57
|
+
| Hotel Reviews | Room, Cleanliness, Location, Staff, Amenities, Value, Food |
|
|
58
|
+
|
|
59
|
+
### Aspect-Sentiment Pairing Rules
|
|
60
|
+
- Each aspect can have exactly one polarity (if a text says contradictory things about the same aspect, split into sub-aspects)
|
|
61
|
+
- Aspects without a clear sentiment expression are tagged as neutral with low confidence
|
|
62
|
+
- An opinion expression can modify multiple aspects: "Both the food and service were excellent"
|
|
63
|
+
|
|
64
|
+
## Opinion Mining Components
|
|
65
|
+
|
|
66
|
+
### Opinion Holder Identification
|
|
67
|
+
The entity expressing the opinion:
|
|
68
|
+
- **First-person**: "I think this is great" — opinion holder is the author
|
|
69
|
+
- **Reported speech**: "My colleague says the software is buggy" — opinion holder is the colleague
|
|
70
|
+
- **Impersonal**: "This product is considered reliable" — opinion holder is general consensus
|
|
71
|
+
|
|
72
|
+
### Opinion Expression Types
|
|
73
|
+
1. **Direct opinions** — Explicitly stated: "The battery life is excellent"
|
|
74
|
+
2. **Comparative opinions** — Relative judgments: "Camera A is better than Camera B"
|
|
75
|
+
3. **Suggestive opinions** — Implied through suggestions: "They should improve the packaging"
|
|
76
|
+
4. **Conditional opinions** — Dependent on context: "Great if you don't mind the weight"
|
|
77
|
+
|
|
78
|
+
### Opinion Strength Indicators
|
|
79
|
+
- **Intensifiers** (+): very, extremely, incredibly, absolutely, utterly, remarkably
|
|
80
|
+
- **Diminishers** (-): somewhat, slightly, a bit, fairly, kind of, sort of, rather
|
|
81
|
+
- **Modal hedges** (~): might be, could be, seems, appears to, arguably
|
|
82
|
+
- **Superlatives** (++): best, worst, most, least, greatest, finest
|
|
83
|
+
|
|
84
|
+
## Valence Shifters
|
|
85
|
+
|
|
86
|
+
Valence shifters are linguistic constructs that alter the base sentiment of a word or phrase.
|
|
87
|
+
|
|
88
|
+
### Negation
|
|
89
|
+
Negation inverts or significantly reduces the polarity of a sentiment expression:
|
|
90
|
+
|
|
91
|
+
| Negation Type | Examples | Effect |
|
|
92
|
+
|---------------|----------|--------|
|
|
93
|
+
| Simple negation | not, no, never, neither, nor | Inverts polarity |
|
|
94
|
+
| Morphological negation | un-, in-, im-, dis-, -less | Inverts polarity |
|
|
95
|
+
| Implicit negation | fail to, lack, absence of, devoid of | Inverts polarity |
|
|
96
|
+
| Double negation | "not uncommon", "not without merit" | Weakened positive (litotes) |
|
|
97
|
+
| Negation scope | "not only good but great" | Scope ends at "but" — positive |
|
|
98
|
+
|
|
99
|
+
### Negation Scope Rules
|
|
100
|
+
- Negation typically scopes over the **next sentiment-bearing word or phrase**
|
|
101
|
+
- Conjunctions (`but`, `however`, `although`) reset negation scope
|
|
102
|
+
- Clause boundaries terminate negation scope
|
|
103
|
+
- Example: "I don't think the camera is bad" — negation applies to "bad", result is weakly positive
|
|
104
|
+
|
|
105
|
+
### Intensifiers and Diminishers
|
|
106
|
+
|
|
107
|
+
**Intensifiers** amplify the existing polarity:
|
|
108
|
+
```
|
|
109
|
+
"very good" → more positive than "good"
|
|
110
|
+
"very bad" → more negative than "bad"
|
|
111
|
+
"absolutely terrible" → more negative than "terrible"
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
**Diminishers** reduce the magnitude of the existing polarity:
|
|
115
|
+
```
|
|
116
|
+
"somewhat good" → less positive than "good"
|
|
117
|
+
"slightly bad" → less negative than "bad"
|
|
118
|
+
"fairly decent" → weakly positive
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
### Irrealis Markers
|
|
122
|
+
Words or constructions that indicate the sentiment is hypothetical, wished-for, or counterfactual rather than asserted:
|
|
123
|
+
|
|
124
|
+
| Marker Type | Examples | Effect |
|
|
125
|
+
|-------------|----------|--------|
|
|
126
|
+
| Conditional | "if it were better", "would be nice if" | Sentiment is hypothetical, not current |
|
|
127
|
+
| Subjunctive | "I wish it were faster" | Implies negative current state |
|
|
128
|
+
| Questions | "Is this any good?" | Uncertain — do not classify as asserted sentiment |
|
|
129
|
+
| Future/Hope | "hopefully it improves" | Implies negative current state, positive desired state |
|
|
130
|
+
|
|
131
|
+
## Sentiment Lexicons
|
|
132
|
+
|
|
133
|
+
### General-Purpose Lexicons
|
|
134
|
+
- **VADER** — Valence Aware Dictionary for social media, includes emoji and slang, scores -4 to +4
|
|
135
|
+
- **SentiWordNet** — Synset-level scores for WordNet entries (positivity, negativity, objectivity)
|
|
136
|
+
- **AFINN** — 2,477 words scored -5 to +5, good for short informal text
|
|
137
|
+
- **NRC Emotion Lexicon** — 14,182 words annotated for 8 emotions + positive/negative
|
|
138
|
+
|
|
139
|
+
### Domain-Specific Adjustments
|
|
140
|
+
Words change sentiment valence across domains:
|
|
141
|
+
- "unpredictable" — negative (software), neutral/positive (thriller novel), positive (sports)
|
|
142
|
+
- "aggressive" — negative (customer service), positive (investment strategy), neutral (medical treatment)
|
|
143
|
+
- "volatile" — negative (software), neutral (chemistry), negative (financial markets)
|
|
144
|
+
- "intense" — positive (workout), negative (headache), positive (flavor), neutral (light)
|
|
145
|
+
|
|
146
|
+
### Slang and Informal Expressions
|
|
147
|
+
Social media and informal text require expanded lexicons:
|
|
148
|
+
- "fire" / "lit" — strongly positive (slang)
|
|
149
|
+
- "salty" — negative (slang, meaning bitter or upset)
|
|
150
|
+
- "slaps" — strongly positive (slang, meaning excellent)
|
|
151
|
+
- "mid" — negative (slang, meaning mediocre)
|
|
152
|
+
- "/s" or "not" (sarcasm markers in online text)
|
package/manifest.json
ADDED
|
@@ -0,0 +1,26 @@
|
|
|
1
|
+
{
|
|
2
|
+
"name": "@botlearn/sentiment-analyzer",
|
|
3
|
+
"version": "0.1.0",
|
|
4
|
+
"description": "Fine-grained sentiment recognition and opinion mining with aspect-based analysis, sarcasm detection, and polarity classification for OpenClaw Agent",
|
|
5
|
+
"category": "content-processing",
|
|
6
|
+
"author": "BotLearn",
|
|
7
|
+
"benchmarkDimension": "content-understanding",
|
|
8
|
+
"expectedImprovement": 30,
|
|
9
|
+
"dependencies": {},
|
|
10
|
+
"compatibility": {
|
|
11
|
+
"openclaw": ">=0.5.0"
|
|
12
|
+
},
|
|
13
|
+
"files": {
|
|
14
|
+
"skill": "skill.md",
|
|
15
|
+
"knowledge": [
|
|
16
|
+
"knowledge/domain.md",
|
|
17
|
+
"knowledge/best-practices.md",
|
|
18
|
+
"knowledge/anti-patterns.md"
|
|
19
|
+
],
|
|
20
|
+
"strategies": [
|
|
21
|
+
"strategies/main.md"
|
|
22
|
+
],
|
|
23
|
+
"smokeTest": "tests/smoke.json",
|
|
24
|
+
"benchmark": "tests/benchmark.json"
|
|
25
|
+
}
|
|
26
|
+
}
|
package/package.json
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
1
|
+
{
|
|
2
|
+
"name": "@botlearn/sentiment-analyzer",
|
|
3
|
+
"version": "0.1.0",
|
|
4
|
+
"description": "Fine-grained sentiment recognition and opinion mining with aspect-based analysis, sarcasm detection, and polarity classification for OpenClaw Agent",
|
|
5
|
+
"type": "module",
|
|
6
|
+
"main": "manifest.json",
|
|
7
|
+
"files": [
|
|
8
|
+
"manifest.json",
|
|
9
|
+
"skill.md",
|
|
10
|
+
"knowledge/",
|
|
11
|
+
"strategies/",
|
|
12
|
+
"tests/",
|
|
13
|
+
"README.md"
|
|
14
|
+
],
|
|
15
|
+
"keywords": [
|
|
16
|
+
"botlearn",
|
|
17
|
+
"openclaw",
|
|
18
|
+
"skill",
|
|
19
|
+
"content-processing"
|
|
20
|
+
],
|
|
21
|
+
"author": "BotLearn",
|
|
22
|
+
"license": "MIT",
|
|
23
|
+
"repository": {
|
|
24
|
+
"type": "git",
|
|
25
|
+
"url": "https://github.com/readai-team/botlearn-awesome-skills.git",
|
|
26
|
+
"directory": "packages/skills/sentiment-analyzer"
|
|
27
|
+
},
|
|
28
|
+
"homepage": "https://github.com/readai-team/botlearn-awesome-skills/tree/main/packages/skills/sentiment-analyzer",
|
|
29
|
+
"bugs": {
|
|
30
|
+
"url": "https://github.com/readai-team/botlearn-awesome-skills/issues"
|
|
31
|
+
},
|
|
32
|
+
"publishConfig": {
|
|
33
|
+
"access": "public"
|
|
34
|
+
}
|
|
35
|
+
}
|
package/skill.md
ADDED
|
@@ -0,0 +1,47 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: sentiment-analyzer
|
|
3
|
+
role: Sentiment Analysis Specialist
|
|
4
|
+
version: 1.0.0
|
|
5
|
+
triggers:
|
|
6
|
+
- "sentiment"
|
|
7
|
+
- "opinion mining"
|
|
8
|
+
- "analyze tone"
|
|
9
|
+
- "polarity"
|
|
10
|
+
- "sentiment analysis"
|
|
11
|
+
- "emotional tone"
|
|
12
|
+
- "opinion detection"
|
|
13
|
+
- "feeling analysis"
|
|
14
|
+
---
|
|
15
|
+
|
|
16
|
+
# Role
|
|
17
|
+
|
|
18
|
+
You are a Sentiment Analysis Specialist. When activated, you perform fine-grained sentiment recognition and opinion mining on text, identifying polarity at document, sentence, and aspect levels. You detect nuanced sentiment cues including sarcasm, irony, hedging, and intensification, and produce structured sentiment assessments with confidence scores achieving >85% accuracy.
|
|
19
|
+
|
|
20
|
+
# Capabilities
|
|
21
|
+
|
|
22
|
+
1. Classify sentiment polarity at multiple granularities: document-level, sentence-level, and aspect-level (ABSA)
|
|
23
|
+
2. Identify and extract opinion targets (aspects) and their associated sentiment expressions using opinion mining techniques
|
|
24
|
+
3. Detect valence shifters including negation, intensifiers, diminishers, and irrealis markers that modify base sentiment
|
|
25
|
+
4. Recognize sarcasm, irony, and implicit sentiment that contradicts surface-level lexical cues
|
|
26
|
+
5. Produce calibrated confidence scores for each sentiment judgment, reflecting genuine uncertainty when signals are mixed
|
|
27
|
+
6. Aggregate aspect-level sentiments into a coherent document-level summary with weighted rollup
|
|
28
|
+
|
|
29
|
+
# Constraints
|
|
30
|
+
|
|
31
|
+
1. Never assign sentiment without identifying the specific opinion target — every sentiment must be anchored to an aspect or entity
|
|
32
|
+
2. Never treat sentiment as purely binary (positive/negative) — always use a fine-grained scale (e.g., strongly negative, negative, slightly negative, neutral, slightly positive, positive, strongly positive)
|
|
33
|
+
3. Never ignore negation or valence shifters — "not good" is not positive, "not bad" is not negative
|
|
34
|
+
4. Never assume literal interpretation when sarcasm or irony markers are present (hyperbole, contradiction, context mismatch)
|
|
35
|
+
5. Never present high-confidence scores when the text contains genuinely ambiguous or conflicting sentiment signals
|
|
36
|
+
6. Always calibrate sentiment interpretation to the domain context — product reviews, social media, and formal reports use different sentiment conventions
|
|
37
|
+
|
|
38
|
+
# Activation
|
|
39
|
+
|
|
40
|
+
WHEN the user requests sentiment analysis, opinion mining, or tone assessment:
|
|
41
|
+
1. Segment the input text into analyzable units following strategies/main.md
|
|
42
|
+
2. Identify opinion targets (aspects) and sentiment expressions using knowledge/domain.md
|
|
43
|
+
3. Detect valence shifters, sarcasm markers, and contextual modifiers
|
|
44
|
+
4. Classify polarity on a fine-grained scale with calibrated confidence
|
|
45
|
+
5. Verify against knowledge/anti-patterns.md to avoid common sentiment analysis errors
|
|
46
|
+
6. Apply knowledge/best-practices.md for multi-level aggregation and domain calibration
|
|
47
|
+
7. Output structured sentiment assessment with aspect-level detail and document-level summary
|
|
@@ -0,0 +1,110 @@
|
|
|
1
|
+
---
|
|
2
|
+
strategy: sentiment-analyzer
|
|
3
|
+
version: 1.0.0
|
|
4
|
+
steps: 6
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
# Sentiment Analysis Strategy
|
|
8
|
+
|
|
9
|
+
## Step 1: Segmentation
|
|
10
|
+
- Receive the input text and determine its scope: single sentence, paragraph, multi-paragraph document, or structured data (e.g., review with star rating)
|
|
11
|
+
- Detect the text **domain** using vocabulary and structural cues:
|
|
12
|
+
- Product review → look for product terms, star ratings, pros/cons structure
|
|
13
|
+
- Social media → short text, emoji, hashtags, informal language, slang
|
|
14
|
+
- News/editorial → formal structure, quotations, byline
|
|
15
|
+
- Technical/academic → hedged language, citations, formal tone
|
|
16
|
+
- Segment the text into analyzable units:
|
|
17
|
+
- Split into sentences using punctuation and structural boundaries
|
|
18
|
+
- Within sentences, split on contrastive conjunctions (`but`, `however`, `although`, `yet`, `while`, `on the other hand`) into clauses
|
|
19
|
+
- IF the text contains bullet points, numbered lists, or structured sections THEN treat each item as a separate unit
|
|
20
|
+
- Record metadata: total sentence count, detected domain, text register (formal/informal), presence of ratings or structured signals
|
|
21
|
+
|
|
22
|
+
## Step 2: Aspect Identification
|
|
23
|
+
- For each sentence/clause, extract **opinion targets** (aspects):
|
|
24
|
+
- **Explicit aspects**: Noun phrases that are the subject of opinion expressions ("the battery life", "customer support", "build quality")
|
|
25
|
+
- **Implicit aspects**: Infer from adjectives or properties without a named target ("too heavy" → WEIGHT aspect, "overpriced" → PRICE aspect, "slow" → PERFORMANCE aspect)
|
|
26
|
+
- **Composite aspects**: Multi-word targets that should not be split ("noise cancellation", "user interface", "customer service")
|
|
27
|
+
- Categorize each aspect into a domain-appropriate category using the taxonomy in knowledge/domain.md
|
|
28
|
+
- IF no aspects are identified THEN treat the entire sentence as a single-aspect unit targeting the overall subject
|
|
29
|
+
- IF the text mentions multiple entities (e.g., comparing products) THEN tag each aspect with its entity to prevent misattribution
|
|
30
|
+
- Build an **aspect inventory**: list all unique aspects found across the document with their source sentences
|
|
31
|
+
|
|
32
|
+
## Step 3: Sentiment Cue Detection
|
|
33
|
+
- For each clause/aspect pair, identify all sentiment-bearing expressions:
|
|
34
|
+
- **Sentiment words**: Adjectives ("excellent", "terrible"), adverbs ("poorly", "beautifully"), verbs ("love", "hate", "enjoy", "struggle"), nouns ("disaster", "delight", "nightmare")
|
|
35
|
+
- **Phrases and idioms**: "fell short of expectations", "blew me away", "left a lot to be desired", "hit it out of the park"
|
|
36
|
+
- **Comparative expressions**: "better than", "worse than", "not as good as", "far superior to"
|
|
37
|
+
- **Emoji and emoticons**: Map to sentiment scores using knowledge/domain.md lexicon entries
|
|
38
|
+
- Detect all **valence shifters** in scope:
|
|
39
|
+
- **Negation markers**: not, no, never, neither, hardly, barely, nobody, nothing, un-, in-, im-, dis-, -less
|
|
40
|
+
- **Intensifiers**: very, extremely, incredibly, absolutely, utterly, remarkably, thoroughly, deeply
|
|
41
|
+
- **Diminishers**: somewhat, slightly, a bit, fairly, kind of, sort of, rather, marginally
|
|
42
|
+
- **Irrealis markers**: if, would, could, might, wish, hope, should (indicates hypothetical, not asserted sentiment)
|
|
43
|
+
- Determine **negation scope**: negation applies to the next sentiment expression within the same clause; conjunction boundaries and clause boundaries terminate scope
|
|
44
|
+
- IF a sentiment word appears but has no clear aspect target THEN check previous sentences for aspect continuity via coreference ("The laptop... It... This device...")
|
|
45
|
+
|
|
46
|
+
## Step 4: Polarity Classification
|
|
47
|
+
- For each aspect-sentiment pair, compute the polarity score:
|
|
48
|
+
1. Start with the **base polarity** of the sentiment expression (from lexicon or contextual analysis)
|
|
49
|
+
2. Apply **domain calibration** — adjust the base score if the word has domain-specific sentiment (see knowledge/domain.md domain-specific adjustments)
|
|
50
|
+
3. Apply **valence shifters** in order:
|
|
51
|
+
- Negation: multiply by -0.8 to -1.0 (not full inversion for nuanced negation like "not bad")
|
|
52
|
+
- Intensifiers: multiply magnitude by 1.25-1.5x
|
|
53
|
+
- Diminishers: multiply magnitude by 0.5-0.75x
|
|
54
|
+
- Double negation: result is weakly positive ("not uncommon" ≈ +0.2 to +0.3)
|
|
55
|
+
4. Apply **irrealis discount**: IF the sentiment is hypothetical/conditional THEN reduce confidence by 0.20 and tag as "non-asserted"
|
|
56
|
+
- Map the final score to the 7-point polarity label using the scale in knowledge/domain.md
|
|
57
|
+
- IF sarcasm indicators are present (see Step 3 and knowledge/best-practices.md):
|
|
58
|
+
- Check for incongruity: positive words + negative context, or hyperbole + complaint pattern
|
|
59
|
+
- IF sarcasm is confirmed THEN invert polarity and apply confidence penalty (see Step 5)
|
|
60
|
+
- IF sarcasm is uncertain THEN report BOTH literal and inverted interpretations
|
|
61
|
+
|
|
62
|
+
## Step 5: Confidence Assessment
|
|
63
|
+
- Assign a confidence score (0.00 - 1.00) to each aspect-level sentiment based on signal strength:
|
|
64
|
+
- **Start at 0.85** as baseline for clear, unambiguous sentiment
|
|
65
|
+
- **Adjust upward** (+0.05 to +0.15):
|
|
66
|
+
- Multiple reinforcing sentiment cues for the same aspect
|
|
67
|
+
- Star rating aligns with textual sentiment
|
|
68
|
+
- Strong intensifiers with clear targets
|
|
69
|
+
- **Adjust downward** (-0.05 to -0.25):
|
|
70
|
+
- Short text (< 10 words): -0.10
|
|
71
|
+
- No explicit sentiment words (relying on implicit): -0.15
|
|
72
|
+
- Sarcasm detected but uncertain: -0.15 to -0.25
|
|
73
|
+
- Domain uncertain: -0.10
|
|
74
|
+
- Hedging language present: -0.10
|
|
75
|
+
- Conflicting signals within the same aspect: -0.15
|
|
76
|
+
- Cap confidence at 1.00 and floor at 0.30
|
|
77
|
+
- SELF-CHECK against knowledge/anti-patterns.md:
|
|
78
|
+
- Is any neutral label assigned with confidence > 0.85? → Verify the text is genuinely factual
|
|
79
|
+
- Is any sentiment assigned without an identified aspect? → Reassign to an aspect or flag
|
|
80
|
+
- Has negation been properly accounted for in every case?
|
|
81
|
+
- Has the domain calibration been applied?
|
|
82
|
+
|
|
83
|
+
## Step 6: Aggregation & Output
|
|
84
|
+
- **Aspect-level output**: For each aspect, output the structured tuple:
|
|
85
|
+
```
|
|
86
|
+
Aspect: [aspect_term] ([category])
|
|
87
|
+
Polarity: [label] ([score])
|
|
88
|
+
Opinion Expression: "[quoted text from source]"
|
|
89
|
+
Valence Shifters: [list or "none"]
|
|
90
|
+
Sarcasm: [detected/not detected/uncertain]
|
|
91
|
+
Confidence: [0.00-1.00]
|
|
92
|
+
```
|
|
93
|
+
- **Sentence-level aggregation**: For sentences with multiple aspects:
|
|
94
|
+
- IF all aspects agree in polarity THEN sentence polarity = the shared direction with averaged magnitude
|
|
95
|
+
- IF aspects disagree THEN sentence polarity = "mixed" with the dominant direction noted
|
|
96
|
+
- **Document-level aggregation**: Compute overall sentiment using weighted rollup:
|
|
97
|
+
1. Weight aspects by **importance** (core domain aspects weigh 1.5x; peripheral aspects weigh 0.75x)
|
|
98
|
+
2. Weight by **position** (concluding sentences weigh 1.25x)
|
|
99
|
+
3. Weight by **emphasis** (intensified sentiments weigh 1.15x)
|
|
100
|
+
4. IF explicit summary phrases exist ("overall", "in summary", "all in all") THEN anchor document sentiment to those phrases
|
|
101
|
+
5. IF sentiment is mixed across aspects THEN label document as "mixed" with a breakdown, NOT an averaged neutral
|
|
102
|
+
- **Final output**: Produce the complete structured sentiment report following the format in knowledge/best-practices.md:
|
|
103
|
+
- Document-level summary with polarity, confidence, and dominant emotion
|
|
104
|
+
- Aspect-level detail table
|
|
105
|
+
- Flags: sarcasm, mixed sentiment, domain, low-confidence aspects
|
|
106
|
+
- SELF-CHECK the complete output:
|
|
107
|
+
- Does the document-level sentiment coherently represent the aspect-level findings?
|
|
108
|
+
- Are all flagged anti-patterns from knowledge/anti-patterns.md avoided?
|
|
109
|
+
- Is every sentiment anchored to a specific aspect or entity?
|
|
110
|
+
- IF any check fails THEN loop back to the relevant step and reprocess
|
|
@@ -0,0 +1,476 @@
|
|
|
1
|
+
{
|
|
2
|
+
"version": "0.0.1",
|
|
3
|
+
"dimension": "content-understanding",
|
|
4
|
+
"tasks": [
|
|
5
|
+
{
|
|
6
|
+
"id": "bench-easy-01",
|
|
7
|
+
"difficulty": "easy",
|
|
8
|
+
"description": "Simple positive sentiment with clear opinion target",
|
|
9
|
+
"input": "Analyze the sentiment of this text:\n\n\"I absolutely love this coffee maker. It brews a perfect cup every morning and the built-in grinder is incredibly convenient. Best kitchen purchase I've made in years.\"",
|
|
10
|
+
"rubric": [
|
|
11
|
+
{
|
|
12
|
+
"criterion": "Polarity Accuracy",
|
|
13
|
+
"weight": 0.4,
|
|
14
|
+
"scoring": {
|
|
15
|
+
"5": "Correctly identifies strongly positive sentiment; detects intensifiers ('absolutely', 'incredibly', 'best') and maps to the high end of the positive scale",
|
|
16
|
+
"3": "Identifies positive sentiment but does not capture the intensity or differentiate from mildly positive",
|
|
17
|
+
"1": "Identifies sentiment direction but labels it as neutral or mixed",
|
|
18
|
+
"0": "Incorrect polarity"
|
|
19
|
+
}
|
|
20
|
+
},
|
|
21
|
+
{
|
|
22
|
+
"criterion": "Aspect Extraction",
|
|
23
|
+
"weight": 0.3,
|
|
24
|
+
"scoring": {
|
|
25
|
+
"5": "Extracts at least 3 aspects: overall product, brewing quality, built-in grinder/convenience; all tagged as positive",
|
|
26
|
+
"3": "Extracts 1-2 aspects or correctly identifies overall sentiment without aspect detail",
|
|
27
|
+
"1": "No aspect extraction, only document-level label",
|
|
28
|
+
"0": "No usable analysis"
|
|
29
|
+
}
|
|
30
|
+
},
|
|
31
|
+
{
|
|
32
|
+
"criterion": "Output Quality",
|
|
33
|
+
"weight": 0.3,
|
|
34
|
+
"scoring": {
|
|
35
|
+
"5": "Structured output with polarity label, score, confidence (high, >0.90), and aspect breakdown",
|
|
36
|
+
"3": "Provides polarity and some structure but missing confidence or aspect detail",
|
|
37
|
+
"1": "Single-word or single-sentence output",
|
|
38
|
+
"0": "No usable output"
|
|
39
|
+
}
|
|
40
|
+
}
|
|
41
|
+
],
|
|
42
|
+
"expectedScoreWithout": 45,
|
|
43
|
+
"expectedScoreWith": 85
|
|
44
|
+
},
|
|
45
|
+
{
|
|
46
|
+
"id": "bench-easy-02",
|
|
47
|
+
"difficulty": "easy",
|
|
48
|
+
"description": "Simple negative sentiment with explicit complaints",
|
|
49
|
+
"input": "Analyze the sentiment of this text:\n\n\"Terrible experience at this restaurant. The food was cold, the waiter was rude, and we waited over an hour for our order. Never coming back.\"",
|
|
50
|
+
"rubric": [
|
|
51
|
+
{
|
|
52
|
+
"criterion": "Polarity Accuracy",
|
|
53
|
+
"weight": 0.4,
|
|
54
|
+
"scoring": {
|
|
55
|
+
"5": "Correctly identifies strongly negative sentiment across all aspects; detects 'terrible', 'cold', 'rude', 'never' as negative signals",
|
|
56
|
+
"3": "Identifies negative sentiment but misses intensity or treats some aspects as neutral",
|
|
57
|
+
"1": "Gets direction right but classifies as mildly negative",
|
|
58
|
+
"0": "Incorrect polarity"
|
|
59
|
+
}
|
|
60
|
+
},
|
|
61
|
+
{
|
|
62
|
+
"criterion": "Aspect Extraction",
|
|
63
|
+
"weight": 0.3,
|
|
64
|
+
"scoring": {
|
|
65
|
+
"5": "Extracts 3+ aspects: food (cold/negative), service/waiter (rude/negative), wait time (negative); all correctly negative",
|
|
66
|
+
"3": "Extracts 1-2 aspects, misses some",
|
|
67
|
+
"1": "Only document-level analysis",
|
|
68
|
+
"0": "No analysis"
|
|
69
|
+
}
|
|
70
|
+
},
|
|
71
|
+
{
|
|
72
|
+
"criterion": "Output Quality",
|
|
73
|
+
"weight": 0.3,
|
|
74
|
+
"scoring": {
|
|
75
|
+
"5": "Structured output with aspect-level detail, high confidence scores, domain detected as restaurant review",
|
|
76
|
+
"3": "Basic structure with polarity labels but missing domain detection or confidence",
|
|
77
|
+
"1": "Minimal output",
|
|
78
|
+
"0": "No usable output"
|
|
79
|
+
}
|
|
80
|
+
}
|
|
81
|
+
],
|
|
82
|
+
"expectedScoreWithout": 45,
|
|
83
|
+
"expectedScoreWith": 85
|
|
84
|
+
},
|
|
85
|
+
{
|
|
86
|
+
"id": "bench-easy-03",
|
|
87
|
+
"difficulty": "easy",
|
|
88
|
+
"description": "Neutral/factual text with no opinion expressed",
|
|
89
|
+
"input": "Analyze the sentiment of this text:\n\n\"The company was founded in 2015 and is headquartered in Austin, Texas. It currently employs approximately 500 people across three offices. The annual revenue report will be published next quarter.\"",
|
|
90
|
+
"rubric": [
|
|
91
|
+
{
|
|
92
|
+
"criterion": "Polarity Accuracy",
|
|
93
|
+
"weight": 0.4,
|
|
94
|
+
"scoring": {
|
|
95
|
+
"5": "Correctly identifies the text as neutral/objective with no opinion expressed; does not force a positive or negative label",
|
|
96
|
+
"3": "Identifies as mostly neutral but incorrectly finds weak sentiment in factual statements",
|
|
97
|
+
"1": "Misclassifies as positive or negative",
|
|
98
|
+
"0": "Completely wrong classification"
|
|
99
|
+
}
|
|
100
|
+
},
|
|
101
|
+
{
|
|
102
|
+
"criterion": "Objectivity Detection",
|
|
103
|
+
"weight": 0.3,
|
|
104
|
+
"scoring": {
|
|
105
|
+
"5": "Explicitly notes the text is factual/informational with no opinion holder or sentiment expressions; provides appropriate low-confidence note",
|
|
106
|
+
"3": "Labels as neutral but does not explain why or note the absence of opinion",
|
|
107
|
+
"1": "Does not distinguish between neutral-opinion and factual-objective",
|
|
108
|
+
"0": "No objectivity assessment"
|
|
109
|
+
}
|
|
110
|
+
},
|
|
111
|
+
{
|
|
112
|
+
"criterion": "Output Quality",
|
|
113
|
+
"weight": 0.3,
|
|
114
|
+
"scoring": {
|
|
115
|
+
"5": "Structured output noting neutral classification, no aspects with sentiment, and a note that the text is purely factual",
|
|
116
|
+
"3": "Basic neutral label with some explanation",
|
|
117
|
+
"1": "Minimal output",
|
|
118
|
+
"0": "No usable output"
|
|
119
|
+
}
|
|
120
|
+
}
|
|
121
|
+
],
|
|
122
|
+
"expectedScoreWithout": 40,
|
|
123
|
+
"expectedScoreWith": 80
|
|
124
|
+
},
|
|
125
|
+
{
|
|
126
|
+
"id": "bench-med-01",
|
|
127
|
+
"difficulty": "medium",
|
|
128
|
+
"description": "Mixed sentiment review with contrasting aspects requiring aspect-level analysis",
|
|
129
|
+
"input": "Analyze the sentiment of this text:\n\n\"The laptop has an incredible display — the OLED panel is vibrant and the 120Hz refresh rate makes scrolling buttery smooth. Build quality is also top-notch with a solid aluminum chassis. However, the keyboard is mushy and imprecise, the trackpad is too small for comfortable use, and the fan noise under any load is distractingly loud. For the price they're asking, the input devices should be much better. The webcam is passable but nothing special.\"",
|
|
130
|
+
"rubric": [
|
|
131
|
+
{
|
|
132
|
+
"criterion": "Aspect-Level Accuracy",
|
|
133
|
+
"weight": 0.35,
|
|
134
|
+
"scoring": {
|
|
135
|
+
"5": "Correctly identifies 6+ aspects (display, refresh rate, build quality, keyboard, trackpad, fan noise, price/value, webcam) with correct individual polarities: display/build positive, keyboard/trackpad/fan negative, webcam slightly positive/neutral, price negative",
|
|
136
|
+
"3": "Identifies 3-4 aspects with mostly correct polarities but misses some",
|
|
137
|
+
"1": "Only 1-2 aspects identified or several misclassified",
|
|
138
|
+
"0": "No aspect-level analysis"
|
|
139
|
+
}
|
|
140
|
+
},
|
|
141
|
+
{
|
|
142
|
+
"criterion": "Mixed Sentiment Handling",
|
|
143
|
+
"weight": 0.3,
|
|
144
|
+
"scoring": {
|
|
145
|
+
"5": "Document-level sentiment labeled as 'mixed' with clear breakdown of positive aspects (display, build) vs. negative aspects (input, noise, value); does not average to neutral",
|
|
146
|
+
"3": "Labels as mixed but aggregation is simplistic (e.g., averaged to slightly negative)",
|
|
147
|
+
"1": "Forces a single polarity label (positive or negative) ignoring the contrast",
|
|
148
|
+
"0": "Completely mischaracterizes the overall sentiment"
|
|
149
|
+
}
|
|
150
|
+
},
|
|
151
|
+
{
|
|
152
|
+
"criterion": "Valence Shifter and Comparative Handling",
|
|
153
|
+
"weight": 0.2,
|
|
154
|
+
"scoring": {
|
|
155
|
+
"5": "Detects 'however' as sentiment pivot, 'too small' as negative intensification, 'nothing special' as diminished neutral, 'should be much better' as negative expectation gap",
|
|
156
|
+
"3": "Handles 'however' pivot but misses nuanced shifters",
|
|
157
|
+
"1": "Misses most valence shifters and comparatives",
|
|
158
|
+
"0": "No shifter handling"
|
|
159
|
+
}
|
|
160
|
+
},
|
|
161
|
+
{
|
|
162
|
+
"criterion": "Output Structure",
|
|
163
|
+
"weight": 0.15,
|
|
164
|
+
"scoring": {
|
|
165
|
+
"5": "Full structured report with aspect table, confidence per aspect, document summary, domain detected as product/laptop review",
|
|
166
|
+
"3": "Aspect list with polarities but missing confidence or structure",
|
|
167
|
+
"1": "Unstructured text output",
|
|
168
|
+
"0": "No usable output"
|
|
169
|
+
}
|
|
170
|
+
}
|
|
171
|
+
],
|
|
172
|
+
"expectedScoreWithout": 30,
|
|
173
|
+
"expectedScoreWith": 75
|
|
174
|
+
},
|
|
175
|
+
{
|
|
176
|
+
"id": "bench-med-02",
|
|
177
|
+
"difficulty": "medium",
|
|
178
|
+
"description": "Sentiment analysis with negation chains and double negation",
|
|
179
|
+
"input": "Analyze the sentiment of this text:\n\n\"I wouldn't say the service was bad — the staff wasn't unfriendly, and the response time was not unreasonable. But I can't pretend I was satisfied either. The product itself doesn't lack features, yet none of them work without issues. It's not that I don't appreciate the effort, but the execution leaves much to be desired.\"",
|
|
180
|
+
"rubric": [
|
|
181
|
+
{
|
|
182
|
+
"criterion": "Negation Handling",
|
|
183
|
+
"weight": 0.4,
|
|
184
|
+
"scoring": {
|
|
185
|
+
"5": "Correctly resolves all negation chains: 'wasn't unfriendly' = weakly positive (litotes), 'not unreasonable' = weakly positive, 'can't pretend satisfied' = negative, 'doesn't lack features' = neutral-positive, 'none work without issues' = negative, 'not that I don't appreciate' = weakly positive but pivots to negative",
|
|
186
|
+
"3": "Resolves simple negations correctly but mishandles double negations or litotes",
|
|
187
|
+
"1": "Treats negated positives as positive or misreads most negation",
|
|
188
|
+
"0": "Ignores negation entirely"
|
|
189
|
+
}
|
|
190
|
+
},
|
|
191
|
+
{
|
|
192
|
+
"criterion": "Overall Polarity",
|
|
193
|
+
"weight": 0.3,
|
|
194
|
+
"scoring": {
|
|
195
|
+
"5": "Correctly identifies the overall sentiment as slightly negative to negative — the author uses polite hedging but the cumulative message is dissatisfaction. The politeness does not make it neutral.",
|
|
196
|
+
"3": "Identifies as negative but doesn't capture the hedged, polite-negative nuance",
|
|
197
|
+
"1": "Classifies as neutral or positive due to the polite framing",
|
|
198
|
+
"0": "Completely wrong classification"
|
|
199
|
+
}
|
|
200
|
+
},
|
|
201
|
+
{
|
|
202
|
+
"criterion": "Confidence Calibration",
|
|
203
|
+
"weight": 0.15,
|
|
204
|
+
"scoring": {
|
|
205
|
+
"5": "Assigns moderate confidence (0.60-0.75) reflecting the genuine ambiguity and hedged language; flags the complexity of the negation patterns",
|
|
206
|
+
"3": "Provides confidence but it's either too high (>0.90) or too low (<0.40)",
|
|
207
|
+
"1": "No confidence scoring",
|
|
208
|
+
"0": "No analysis"
|
|
209
|
+
}
|
|
210
|
+
},
|
|
211
|
+
{
|
|
212
|
+
"criterion": "Output Quality",
|
|
213
|
+
"weight": 0.15,
|
|
214
|
+
"scoring": {
|
|
215
|
+
"5": "Structured output with each negation chain explained, aspect breakdown (service, product features, execution), and hedging explicitly noted",
|
|
216
|
+
"3": "Basic structure but negation resolution not explained",
|
|
217
|
+
"1": "Minimal output",
|
|
218
|
+
"0": "No usable output"
|
|
219
|
+
}
|
|
220
|
+
}
|
|
221
|
+
],
|
|
222
|
+
"expectedScoreWithout": 25,
|
|
223
|
+
"expectedScoreWith": 70
|
|
224
|
+
},
|
|
225
|
+
{
|
|
226
|
+
"id": "bench-med-03",
|
|
227
|
+
"difficulty": "medium",
|
|
228
|
+
"description": "Social media text with emoji, slang, and informal language",
|
|
229
|
+
"input": "Analyze the sentiment of these social media posts:\n\n1. \"this new album is straight fire bruh the beats slap so hard cant stop listening\"\n2. \"ngl the movie was mid at best the cgi looked like a ps2 cutscene\"\n3. \"just got the update and everything is broken again amazing job devs\"\n4. \"lowkey obsessed w this cafe their matcha latte hits different fr fr\"",
|
|
230
|
+
"rubric": [
|
|
231
|
+
{
|
|
232
|
+
"criterion": "Slang and Informal Language",
|
|
233
|
+
"weight": 0.35,
|
|
234
|
+
"scoring": {
|
|
235
|
+
"5": "Correctly interprets all slang: 'fire'/'slap' = strongly positive, 'mid' = negative/mediocre, 'amazing job devs' = sarcastic negative, 'hits different'/'obsessed' = strongly positive; handles 'ngl', 'fr fr', 'bruh', 'lowkey', 'w'",
|
|
236
|
+
"3": "Gets most slang right but misinterprets 1-2 terms or misses sarcasm in post 3",
|
|
237
|
+
"1": "Fails to interpret most slang terms; treats unfamiliar terms as neutral",
|
|
238
|
+
"0": "Cannot process informal language"
|
|
239
|
+
}
|
|
240
|
+
},
|
|
241
|
+
{
|
|
242
|
+
"criterion": "Sarcasm Detection",
|
|
243
|
+
"weight": 0.3,
|
|
244
|
+
"scoring": {
|
|
245
|
+
"5": "Correctly identifies post 3 as sarcastic — 'amazing job devs' with context of 'everything is broken' signals sarcasm; inverts polarity to strongly negative; distinguishes from the genuine positive sentiment in posts 1 and 4",
|
|
246
|
+
"3": "Detects sarcasm in post 3 but with low confidence, or misflags another post as sarcastic",
|
|
247
|
+
"1": "Misses sarcasm in post 3; takes 'amazing job' at face value",
|
|
248
|
+
"0": "No sarcasm detection capability"
|
|
249
|
+
}
|
|
250
|
+
},
|
|
251
|
+
{
|
|
252
|
+
"criterion": "Per-Post Accuracy",
|
|
253
|
+
"weight": 0.2,
|
|
254
|
+
"scoring": {
|
|
255
|
+
"5": "All 4 posts correctly classified: post 1 strongly positive, post 2 negative, post 3 strongly negative (sarcasm-inverted), post 4 strongly positive",
|
|
256
|
+
"3": "3 out of 4 posts correctly classified",
|
|
257
|
+
"1": "2 or fewer posts correctly classified",
|
|
258
|
+
"0": "All posts misclassified"
|
|
259
|
+
}
|
|
260
|
+
},
|
|
261
|
+
{
|
|
262
|
+
"criterion": "Output Quality",
|
|
263
|
+
"weight": 0.15,
|
|
264
|
+
"scoring": {
|
|
265
|
+
"5": "Each post analyzed separately with polarity, slang interpretation noted, sarcasm flagged where applicable, domain identified as social media",
|
|
266
|
+
"3": "Posts analyzed but slang interpretation not explained",
|
|
267
|
+
"1": "Single aggregate sentiment for all posts",
|
|
268
|
+
"0": "No usable output"
|
|
269
|
+
}
|
|
270
|
+
}
|
|
271
|
+
],
|
|
272
|
+
"expectedScoreWithout": 25,
|
|
273
|
+
"expectedScoreWith": 70
|
|
274
|
+
},
|
|
275
|
+
{
|
|
276
|
+
"id": "bench-med-04",
|
|
277
|
+
"difficulty": "medium",
|
|
278
|
+
"description": "Comparative sentiment requiring entity-specific opinion tracking",
|
|
279
|
+
"input": "Analyze the sentiment of this text:\n\n\"After switching from Android to iPhone, I have to say the ecosystem integration is miles ahead — Airdrop, iMessage, and Handoff work seamlessly. The camera on the iPhone is also noticeably sharper in low-light conditions. That said, I really miss the customization freedom on Android. The notification system on iOS is still frustratingly limited compared to what Android offers, and I hate that I can't set default apps properly. The hardware build on both is equally premium. If Google could match Apple's ecosystem cohesion, it would be the perfect phone.\"",
|
|
280
|
+
"rubric": [
|
|
281
|
+
{
|
|
282
|
+
"criterion": "Entity-Specific Sentiment",
|
|
283
|
+
"weight": 0.35,
|
|
284
|
+
"scoring": {
|
|
285
|
+
"5": "Correctly tracks sentiment per entity: iPhone positive (ecosystem, camera), iPhone negative (notifications, default apps, customization); Android positive (customization, notifications), Android negative (ecosystem cohesion); hardware neutral/positive for both",
|
|
286
|
+
"3": "Distinguishes some entity-specific sentiments but conflates opinions or misattributes 1-2",
|
|
287
|
+
"1": "Treats all sentiment as applying to one entity",
|
|
288
|
+
"0": "No entity-level separation"
|
|
289
|
+
}
|
|
290
|
+
},
|
|
291
|
+
{
|
|
292
|
+
"criterion": "Comparative Expression Handling",
|
|
293
|
+
"weight": 0.3,
|
|
294
|
+
"scoring": {
|
|
295
|
+
"5": "Correctly interprets 'miles ahead' (strongly positive for iPhone ecosystem), 'noticeably sharper' (positive iPhone camera), 'frustratingly limited compared to' (negative iOS, positive Android for notifications), 'equally premium' (neutral comparison), 'if Google could match' (conditional — implies Android is worse on ecosystem)",
|
|
296
|
+
"3": "Handles explicit comparisons but misses implied comparative ('if Google could match')",
|
|
297
|
+
"1": "Misinterprets comparisons or ignores comparative structure",
|
|
298
|
+
"0": "No comparative handling"
|
|
299
|
+
}
|
|
300
|
+
},
|
|
301
|
+
{
|
|
302
|
+
"criterion": "Conditional/Irrealis Detection",
|
|
303
|
+
"weight": 0.15,
|
|
304
|
+
"scoring": {
|
|
305
|
+
"5": "Identifies 'If Google could match Apple's ecosystem cohesion, it would be the perfect phone' as conditional/hypothetical; notes it implies current Android ecosystem is inferior while expressing positive hypothetical",
|
|
306
|
+
"3": "Notes the conditional but doesn't fully interpret its sentiment implications",
|
|
307
|
+
"1": "Treats conditional as asserted fact",
|
|
308
|
+
"0": "Ignores conditional entirely"
|
|
309
|
+
}
|
|
310
|
+
},
|
|
311
|
+
{
|
|
312
|
+
"criterion": "Output Structure",
|
|
313
|
+
"weight": 0.2,
|
|
314
|
+
"scoring": {
|
|
315
|
+
"5": "Structured output with separate sentiment summaries for iPhone and Android, aspect-level breakdown per entity, comparisons explicitly noted, conditional flagged",
|
|
316
|
+
"3": "Some entity separation but incomplete aspect detail",
|
|
317
|
+
"1": "Single-entity output or unstructured",
|
|
318
|
+
"0": "No usable output"
|
|
319
|
+
}
|
|
320
|
+
}
|
|
321
|
+
],
|
|
322
|
+
"expectedScoreWithout": 25,
|
|
323
|
+
"expectedScoreWith": 70
|
|
324
|
+
},
|
|
325
|
+
{
|
|
326
|
+
"id": "bench-hard-01",
|
|
327
|
+
"difficulty": "hard",
|
|
328
|
+
"description": "Subtle sarcasm, irony, and implicit sentiment in a lengthy review",
|
|
329
|
+
"input": "Analyze the sentiment of this text:\n\n\"I must commend the airline for their unwavering commitment to mediocrity. In an age where most carriers are at least trying to improve, it's refreshing to find one that has perfected the art of consistent disappointment.\n\nLet me count the ways: The 'spacious' economy seats — perfect if you happen to be a contortionist. The 'complimentary' beverage service — a thimble of water delivered with the enthusiasm of someone fulfilling a court-ordered community service obligation. And my personal favorite: the 'on-time departure' that only required us to redefine what 'on time' means by approximately 90 minutes.\n\nThe in-flight entertainment system featured a cutting-edge selection of films from 2019, which I suppose counts as 'curated' in the same way my junk drawer is 'curated.' The Wi-Fi, advertised at $12.99, delivered speeds that would make a dial-up modem feel nostalgic.\n\nI will say, in complete sincerity, that the flight crew was genuinely kind and professional throughout. They deserve to work for a better airline.\n\nFive stars. Would definitely recommend to anyone I secretly dislike.\"",
|
|
330
|
+
"rubric": [
|
|
331
|
+
{
|
|
332
|
+
"criterion": "Sarcasm and Irony Detection",
|
|
333
|
+
"weight": 0.35,
|
|
334
|
+
"scoring": {
|
|
335
|
+
"5": "Correctly identifies pervasive sarcasm throughout: 'commend... mediocrity', 'refreshing... disappointment', 'spacious' in scare quotes, 'complimentary' as ironic, 'on-time departure' as ironic, 'cutting-edge... 2019', 'curated' as mocking, 'five stars' as sarcastic, 'recommend to anyone I secretly dislike' as confirming sarcastic intent. Correctly identifies the ONE genuine positive statement about the flight crew.",
|
|
336
|
+
"3": "Detects sarcasm in most obvious cases but misses some or incorrectly flags the genuine compliment about the crew as sarcastic",
|
|
337
|
+
"1": "Detects some sarcasm but misclassifies the overall tone or misses the crew compliment as genuine",
|
|
338
|
+
"0": "Takes the review at face value or misses sarcasm entirely"
|
|
339
|
+
}
|
|
340
|
+
},
|
|
341
|
+
{
|
|
342
|
+
"criterion": "Genuine vs. Sarcastic Sentiment Separation",
|
|
343
|
+
"weight": 0.3,
|
|
344
|
+
"scoring": {
|
|
345
|
+
"5": "Correctly separates the genuine positive sentiment about the crew ('in complete sincerity', 'genuinely kind and professional') from the sarcastic negative sentiment about everything else. Notes the explicit sincerity marker as evidence of genuine opinion.",
|
|
346
|
+
"3": "Separates crew sentiment but with low confidence, or partially misattributes sarcastic/genuine labels",
|
|
347
|
+
"1": "Treats all sentiment as either sarcastic or genuine without distinction",
|
|
348
|
+
"0": "No separation attempted"
|
|
349
|
+
}
|
|
350
|
+
},
|
|
351
|
+
{
|
|
352
|
+
"criterion": "Aspect-Level Detail",
|
|
353
|
+
"weight": 0.2,
|
|
354
|
+
"scoring": {
|
|
355
|
+
"5": "Extracts 7+ aspects: seats/comfort (sarcastic, actually negative), beverage service (sarcastic, negative), punctuality (sarcastic, negative), entertainment (sarcastic, negative), Wi-Fi (negative), crew (genuinely positive), overall recommendation (sarcastic negative). Each with inverted polarity and confidence.",
|
|
356
|
+
"3": "Extracts 4-6 aspects with mostly correct sarcasm-inverted polarities",
|
|
357
|
+
"1": "Extracts 1-3 aspects or misses sarcasm inversion",
|
|
358
|
+
"0": "No aspect extraction"
|
|
359
|
+
}
|
|
360
|
+
},
|
|
361
|
+
{
|
|
362
|
+
"criterion": "Confidence and Flags",
|
|
363
|
+
"weight": 0.15,
|
|
364
|
+
"scoring": {
|
|
365
|
+
"5": "High confidence on sarcasm detection (multiple reinforcing signals), flags entire review as 'predominantly sarcastic', notes the sincerity marker exception, document-level is strongly negative despite surface-positive language",
|
|
366
|
+
"3": "Flags sarcasm but confidence calibration is off (too low given strong signals)",
|
|
367
|
+
"1": "No sarcasm flagging or confidence",
|
|
368
|
+
"0": "No flags or confidence"
|
|
369
|
+
}
|
|
370
|
+
}
|
|
371
|
+
],
|
|
372
|
+
"expectedScoreWithout": 20,
|
|
373
|
+
"expectedScoreWith": 65
|
|
374
|
+
},
|
|
375
|
+
{
|
|
376
|
+
"id": "bench-hard-02",
|
|
377
|
+
"difficulty": "hard",
|
|
378
|
+
"description": "Multi-entity opinion mining with reported speech and opinion holder disambiguation",
|
|
379
|
+
"input": "Analyze the sentiment of this text:\n\n\"The board meeting was contentious. CEO Sarah Chen defended the Q3 results, calling them 'a strong performance given market headwinds,' but CFO Mark Rivera countered that the margins were 'dangerously thin' and warned the current trajectory was 'unsustainable.' Several board members expressed concern about the R&D budget, which they described as 'reckless spending with no clear ROI.'\n\nIndustry analyst Patricia Wong noted in her post-meeting report that 'the company's strategic pivot shows promise but execution risk remains elevated.' She praised the new product line but questioned whether the management team had the operational discipline to scale it.\n\nInternal employee sentiment, based on a recent Glassdoor survey, shows 72% satisfaction with company culture but only 34% confidence in senior leadership's direction. One anonymous review stated: 'Great colleagues, great mission, terrible management decisions.'\"",
|
|
380
|
+
"rubric": [
|
|
381
|
+
{
|
|
382
|
+
"criterion": "Opinion Holder Identification",
|
|
383
|
+
"weight": 0.35,
|
|
384
|
+
"scoring": {
|
|
385
|
+
"5": "Correctly identifies and separates 5+ opinion holders: Sarah Chen (positive on Q3), Mark Rivera (negative on margins/trajectory), board members (negative on R&D), Patricia Wong (mixed — positive on pivot, negative on execution risk), employees (positive on culture, negative on leadership). Each opinion correctly attributed to its holder.",
|
|
386
|
+
"3": "Identifies 3-4 holders with mostly correct attribution but conflates some opinions",
|
|
387
|
+
"1": "Identifies 1-2 holders or misattributes opinions significantly",
|
|
388
|
+
"0": "No opinion holder tracking; treats all as a single voice"
|
|
389
|
+
}
|
|
390
|
+
},
|
|
391
|
+
{
|
|
392
|
+
"criterion": "Reported vs. Direct Sentiment",
|
|
393
|
+
"weight": 0.25,
|
|
394
|
+
"scoring": {
|
|
395
|
+
"5": "Correctly distinguishes reported speech (quoted opinions from Chen, Rivera, board members, Wong, employee) from narrative framing. Notes that the text author is reporting, not expressing their own opinion. Identifies direct quotes vs. paraphrased opinions.",
|
|
396
|
+
"3": "Distinguishes some reported speech but misses the narrative framing or misattributes the author's stance",
|
|
397
|
+
"1": "Treats all sentiment as the text author's opinion",
|
|
398
|
+
"0": "No distinction between reported and direct sentiment"
|
|
399
|
+
}
|
|
400
|
+
},
|
|
401
|
+
{
|
|
402
|
+
"criterion": "Aspect-Entity Matrix",
|
|
403
|
+
"weight": 0.25,
|
|
404
|
+
"scoring": {
|
|
405
|
+
"5": "Builds a complete matrix: Q3 results (Chen: positive, Rivera: negative), R&D budget (board: negative), strategic pivot (Wong: positive), execution capability (Wong: negative), company culture (employees: positive), leadership (employees: negative). Quantitative data (72%, 34%) correctly interpreted.",
|
|
406
|
+
"3": "Covers most aspect-entity pairs but misses some or doesn't integrate quantitative data",
|
|
407
|
+
"1": "Partial matrix with significant gaps",
|
|
408
|
+
"0": "No aspect-entity tracking"
|
|
409
|
+
}
|
|
410
|
+
},
|
|
411
|
+
{
|
|
412
|
+
"criterion": "Output Quality",
|
|
413
|
+
"weight": 0.15,
|
|
414
|
+
"scoring": {
|
|
415
|
+
"5": "Structured output organized by opinion holder with aspect-level detail per holder; overall synthesis noting the contentious/divided nature of sentiment; confidence adjusted for reported speech",
|
|
416
|
+
"3": "Some organization by holder but synthesis is superficial",
|
|
417
|
+
"1": "Flat list or single aggregate sentiment",
|
|
418
|
+
"0": "No usable output"
|
|
419
|
+
}
|
|
420
|
+
}
|
|
421
|
+
],
|
|
422
|
+
"expectedScoreWithout": 20,
|
|
423
|
+
"expectedScoreWith": 60
|
|
424
|
+
},
|
|
425
|
+
{
|
|
426
|
+
"id": "bench-hard-03",
|
|
427
|
+
"difficulty": "hard",
|
|
428
|
+
"description": "Cross-domain sentiment requiring domain calibration and implicit sentiment detection",
|
|
429
|
+
"input": "Analyze the sentiment of these three texts from different domains:\n\nText A (Medical case note): \"Patient presents with stable vitals and unremarkable lab results. The tumor has not progressed since the last scan. While the prognosis remains guarded, the treatment protocol appears to be achieving its intended effect. No new contraindications were identified.\"\n\nText B (Financial analyst report): \"Q4 earnings were flat against consensus expectations. The company maintained its aggressive acquisition strategy despite rising interest rates. Revenue growth decelerated to 3% YoY, though management characterized this as 'disciplined growth.' Cash reserves remain robust at $2.1B, providing a defensive buffer against market volatility.\"\n\nText C (Restaurant critic review): \"Chef Martinez's new tasting menu is an exercise in restraint. The flavors are precise rather than bold, the plating understated rather than theatrical. Some may find the portions challenging, but each course demonstrates an uncommon technical mastery. The wine pairing, curated by sommelier Ana Torres, is nothing short of revelatory. A quiet triumph.\"",
|
|
430
|
+
"rubric": [
|
|
431
|
+
{
|
|
432
|
+
"criterion": "Domain Detection and Calibration",
|
|
433
|
+
"weight": 0.3,
|
|
434
|
+
"scoring": {
|
|
435
|
+
"5": "Correctly detects all three domains (medical, financial, culinary) and calibrates accordingly: 'stable' is positive in medical, 'aggressive' is contextual in finance, 'restraint' is positive in fine dining criticism. 'Unremarkable' is positive in medical (no abnormalities), 'flat' is neutral-negative in finance, 'challenging' portions is diplomatic negative in food criticism.",
|
|
436
|
+
"3": "Detects domains but miscalibrates 2-3 domain-specific terms",
|
|
437
|
+
"1": "Applies general-purpose sentiment without domain calibration",
|
|
438
|
+
"0": "No domain awareness"
|
|
439
|
+
}
|
|
440
|
+
},
|
|
441
|
+
{
|
|
442
|
+
"criterion": "Implicit Sentiment Detection",
|
|
443
|
+
"weight": 0.3,
|
|
444
|
+
"scoring": {
|
|
445
|
+
"5": "Detects implicit sentiment: Text A — cautiously positive (no progression is good in oncology, 'guarded' is standard hedging not negative); Text B — mixed (flat earnings = neutral-negative, 'disciplined growth' in quotes suggests skepticism of management's framing, 'defensive buffer' is positive); Text C — strongly positive despite restrained language ('uncommon technical mastery', 'revelatory', 'quiet triumph' are high praise in critic register)",
|
|
446
|
+
"3": "Gets overall direction right for each text but misses subtle implicit signals",
|
|
447
|
+
"1": "Misreads the register — treats medical hedging as negative, critic's restraint as lukewarm",
|
|
448
|
+
"0": "Completely misreads the implicit sentiment"
|
|
449
|
+
}
|
|
450
|
+
},
|
|
451
|
+
{
|
|
452
|
+
"criterion": "Register and Tone Awareness",
|
|
453
|
+
"weight": 0.25,
|
|
454
|
+
"scoring": {
|
|
455
|
+
"5": "Recognizes that each text uses a different register: clinical objectivity (Text A), analytical finance (Text B), literary criticism (Text C). Notes that sentiment expression norms differ: hedging is standard in medical/clinical, 'quiet triumph' is superlative praise from a food critic, scare quotes around 'disciplined growth' signal analyst skepticism.",
|
|
456
|
+
"3": "Notes different registers but doesn't fully adjust interpretation",
|
|
457
|
+
"1": "Treats all three texts with the same interpretive framework",
|
|
458
|
+
"0": "No register awareness"
|
|
459
|
+
}
|
|
460
|
+
},
|
|
461
|
+
{
|
|
462
|
+
"criterion": "Output Quality",
|
|
463
|
+
"weight": 0.15,
|
|
464
|
+
"scoring": {
|
|
465
|
+
"5": "Each text analyzed separately with domain label, register notes, aspect-level detail, and calibrated confidence. Cross-domain comparison notes how the same words carry different weight.",
|
|
466
|
+
"3": "Separate analysis per text but without cross-domain comparison or register notes",
|
|
467
|
+
"1": "Combined analysis or missing structure",
|
|
468
|
+
"0": "No usable output"
|
|
469
|
+
}
|
|
470
|
+
}
|
|
471
|
+
],
|
|
472
|
+
"expectedScoreWithout": 15,
|
|
473
|
+
"expectedScoreWith": 60
|
|
474
|
+
}
|
|
475
|
+
]
|
|
476
|
+
}
|
package/tests/smoke.json
ADDED
|
@@ -0,0 +1,54 @@
|
|
|
1
|
+
{
|
|
2
|
+
"version": "0.0.1",
|
|
3
|
+
"timeout": 60,
|
|
4
|
+
"tasks": [
|
|
5
|
+
{
|
|
6
|
+
"id": "smoke-01",
|
|
7
|
+
"description": "Analyze sentiment of a product review with mixed opinions across multiple aspects and a sarcastic remark",
|
|
8
|
+
"input": "Analyze the sentiment of this product review:\n\n\"I've been using this wireless headphone for two weeks now. The sound quality is absolutely phenomenal — rich bass, crisp highs, and a wide soundstage that rivals wired headphones. The noise cancellation is decent, though not quite as good as the Sony XM5s. However, the battery life is a complete joke. They advertise 30 hours but I'm lucky to get 15. Oh, and the 'premium' companion app? It crashes every single time I try to adjust the EQ. At least the comfort is great — I can wear them for hours without any ear fatigue. Overall, good sound hampered by poor software and misleading battery claims.\"",
|
|
9
|
+
"rubric": [
|
|
10
|
+
{
|
|
11
|
+
"criterion": "Aspect Identification",
|
|
12
|
+
"weight": 0.25,
|
|
13
|
+
"scoring": {
|
|
14
|
+
"5": "Identifies all 5+ aspects: sound quality, noise cancellation, battery life, companion app, comfort; correctly categorizes each",
|
|
15
|
+
"3": "Identifies 3-4 aspects but misses some or miscategorizes",
|
|
16
|
+
"1": "Identifies only 1-2 obvious aspects",
|
|
17
|
+
"0": "No aspect-level analysis performed"
|
|
18
|
+
}
|
|
19
|
+
},
|
|
20
|
+
{
|
|
21
|
+
"criterion": "Polarity Accuracy",
|
|
22
|
+
"weight": 0.3,
|
|
23
|
+
"scoring": {
|
|
24
|
+
"5": "Correctly assigns: sound quality (strongly positive), noise cancellation (slightly positive), battery life (strongly negative), app (strongly negative), comfort (positive); detects sarcasm in 'premium' and 'joke'",
|
|
25
|
+
"3": "Gets most polarities right but misses sarcasm or misclassifies 1-2 aspects",
|
|
26
|
+
"1": "Only gets document-level polarity roughly right; aspect polarities are incorrect",
|
|
27
|
+
"0": "Polarity assignments are largely wrong"
|
|
28
|
+
}
|
|
29
|
+
},
|
|
30
|
+
{
|
|
31
|
+
"criterion": "Valence Shifter Handling",
|
|
32
|
+
"weight": 0.2,
|
|
33
|
+
"scoring": {
|
|
34
|
+
"5": "Correctly processes: 'absolutely' as intensifier on 'phenomenal', 'not quite as good' as diminished comparison, 'complete joke' as sarcastic negative, scare quotes on 'premium' as ironic",
|
|
35
|
+
"3": "Handles basic intensifiers/negation but misses sarcastic usage or scare quotes",
|
|
36
|
+
"1": "Ignores most valence shifters",
|
|
37
|
+
"0": "No valence shifter detection"
|
|
38
|
+
}
|
|
39
|
+
},
|
|
40
|
+
{
|
|
41
|
+
"criterion": "Output Structure and Confidence",
|
|
42
|
+
"weight": 0.25,
|
|
43
|
+
"scoring": {
|
|
44
|
+
"5": "Structured output with aspect-level detail, confidence scores, document-level summary labeled as 'mixed', sarcasm flagged, and the concluding sentence used to anchor overall sentiment",
|
|
45
|
+
"3": "Has aspect breakdown and document summary but missing confidence scores or sarcasm flags",
|
|
46
|
+
"1": "Only provides document-level sentiment without aspect detail",
|
|
47
|
+
"0": "Unstructured or no usable output"
|
|
48
|
+
}
|
|
49
|
+
}
|
|
50
|
+
],
|
|
51
|
+
"passThreshold": 60
|
|
52
|
+
}
|
|
53
|
+
]
|
|
54
|
+
}
|