npm - @wentorai/research-plugins - Versions diffs - 1.0.0 - Mend

@wentorai/research-plugins 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (252) hide show

package/skills/domains/ai-ml/llm-evaluation-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,194 @@
+---
+name: llm-evaluation-guide
+description: "Evaluate and benchmark large language models for research applications"
+metadata:
+  openclaw:
+    emoji: "brain"
+    category: "domains"
+    subcategory: "ai-ml"
+    keywords: ["LLM evaluation", "benchmarking", "language models", "model evaluation", "NLP metrics", "BLEU", "perplexity"]
+    source: "wentor-research-plugins"
+---
+# LLM Evaluation Guide
+A skill for evaluating and benchmarking large language models (LLMs) in research settings. Covers automatic metrics, human evaluation protocols, benchmark suites, evaluation pitfalls, and best practices for reporting LLM performance.
+## Evaluation Taxonomy
+### Types of Evaluation
+```
+1. Intrinsic evaluation:
+   Measures model quality on its own terms
+   - Perplexity, likelihood, calibration
+   - Useful for comparing architectures and training procedures
+2. Extrinsic evaluation:
+   Measures model quality on downstream tasks
+   - Task-specific benchmarks (QA, summarization, classification)
+   - Closer to real-world usefulness
+3. Human evaluation:
+   Human judges rate model outputs
+   - Fluency, correctness, helpfulness, safety
+   - Gold standard but expensive and slow
+```
+## Automatic Metrics
+### Common Metrics by Task
+| Task | Metric | Description |
+|------|--------|-------------|
+| Language modeling | Perplexity | Lower is better; measures prediction quality |
+| Machine translation | BLEU, COMET | N-gram overlap; learned quality estimation |
+| Summarization | ROUGE-1/2/L | Recall of n-grams against reference |
+| Question answering | Exact Match, F1 | Token-level match against reference answer |
+| Classification | Accuracy, F1 | Standard classification metrics |
+| Generation quality | BERTScore | Semantic similarity via embeddings |
+| Factuality | FActScore | Proportion of atomic facts supported by evidence |
+### Computing Key Metrics
+```python
+from collections import Counter
+import math
+def compute_bleu(reference: list[str], hypothesis: list[str],
+                 max_n: int = 4) -> float:
+    """
+    Compute corpus-level BLEU score (simplified).
+    Args:
+        reference: List of reference token sequences
+        hypothesis: List of hypothesis token sequences
+        max_n: Maximum n-gram order
+    """
+    precisions = []
+    for n in range(1, max_n + 1):
+        num = 0
+        den = 0
+        for ref_tokens, hyp_tokens in zip(reference, hypothesis):
+            ref_ngrams = Counter(
+                tuple(ref_tokens[i:i+n]) for i in range(len(ref_tokens) - n + 1)
+            )
+            hyp_ngrams = Counter(
+                tuple(hyp_tokens[i:i+n]) for i in range(len(hyp_tokens) - n + 1)
+            )
+            clipped = {ng: min(c, ref_ngrams.get(ng, 0))
+                       for ng, c in hyp_ngrams.items()}
+            num += sum(clipped.values())
+            den += max(sum(hyp_ngrams.values()), 1)
+        precisions.append(num / max(den, 1))
+    # Brevity penalty
+    ref_len = sum(len(r) for r in reference)
+    hyp_len = sum(len(h) for h in hypothesis)
+    bp = math.exp(1 - ref_len / max(hyp_len, 1)) if hyp_len < ref_len else 1.0
+    # Geometric mean of precisions
+    log_avg = sum(math.log(max(p, 1e-10)) for p in precisions) / max_n
+    return bp * math.exp(log_avg)
+```
+## Benchmark Suites
+### Major LLM Benchmarks
+```
+General knowledge and reasoning:
+  - MMLU (Massive Multitask Language Understanding): 57 subjects, MCQ
+  - HellaSwag: Commonsense sentence completion
+  - ARC (AI2 Reasoning Challenge): Science questions
+  - WinoGrande: Coreference resolution / commonsense
+Coding:
+  - HumanEval: Python function completion (pass@k)
+  - MBPP: Mostly basic Python problems
+  - SWE-bench: Real-world software engineering tasks
+Math:
+  - GSM8K: Grade school math word problems
+  - MATH: Competition-level mathematics
+Safety and alignment:
+  - TruthfulQA: Resistance to common misconceptions
+  - BBQ (Bias Benchmark for QA): Social bias in QA
+  - RealToxicityPrompts: Tendency to generate toxic text
+Instruction following:
+  - MT-Bench: Multi-turn conversation quality (LLM-as-judge)
+  - AlpacaEval: Instruction-following quality
+  - Chatbot Arena: ELO-based human preference ranking
+```
+## Human Evaluation
+### Designing a Human Evaluation Protocol
+```python
+def design_human_eval(task: str, n_annotators: int = 3,
+                      n_examples: int = 200) -> dict:
+    """
+    Design a human evaluation protocol for LLM outputs.
+    Args:
+        task: The task being evaluated
+        n_annotators: Number of independent annotators per example
+        n_examples: Number of examples to evaluate
+    """
+    return {
+        "task": task,
+        "n_annotators": n_annotators,
+        "n_examples": n_examples,
+        "criteria": [
+            {"name": "Fluency", "scale": "1-5",
+             "description": "Is the text grammatically correct and natural?"},
+            {"name": "Relevance", "scale": "1-5",
+             "description": "Does the output address the input/question?"},
+            {"name": "Correctness", "scale": "1-5",
+             "description": "Is the factual content accurate?"},
+            {"name": "Helpfulness", "scale": "1-5",
+             "description": "Would a user find this response useful?"}
+        ],
+        "agreement_metric": "Krippendorff's alpha (ordinal)",
+        "presentation": "Randomize model order; blind annotators to model identity",
+        "calibration": "Have all annotators rate 20 shared examples first",
+        "cost_estimate": f"~{n_examples * n_annotators * 0.50:.0f} USD at typical rates"
+    }
+```
+## Evaluation Pitfalls
+### Common Mistakes
+```
+1. Data contamination:
+   Test data may appear in the LLM's training set.
+   Mitigation: Use held-out datasets, check for contamination,
+   create new test sets.
+2. Metric gaming:
+   High BLEU does not mean high quality; ROUGE rewards verbosity.
+   Mitigation: Use multiple metrics and human evaluation.
+3. Cherry-picking examples:
+   Showing only best-case outputs misrepresents model capabilities.
+   Mitigation: Report aggregate metrics over full test sets.
+4. Ignoring variance:
+   LLM outputs vary with temperature and random seeds.
+   Mitigation: Report mean and standard deviation over multiple runs.
+5. Unfair comparisons:
+   Comparing models with different prompt formats or few-shot counts.
+   Mitigation: Standardize prompts and report all hyperparameters.
+```
+## Reporting Standards
+When publishing LLM evaluation results, report: model name and version, parameter count and architecture, evaluation dataset with version number, exact prompts used (include in appendix), number of few-shot examples, decoding parameters (temperature, top-p, max tokens), multiple metrics (not just one), confidence intervals or significance tests, and hardware and inference cost where relevant.

package/skills/domains/ai-ml/prompt-engineering-research/SKILL.md ADDED Viewed

@@ -0,0 +1,233 @@
+---
+name: prompt-engineering-research
+description: "Systematic prompt engineering methods for AI-assisted academic research workf..."
+metadata:
+  openclaw:
+    emoji: "robot"
+    category: "domains"
+    subcategory: "ai-ml"
+    keywords: ["machine learning", "deep learning", "NLP", "AI coding", "prompt engineering", "LLM"]
+    source: "wentor"
+---
+# Prompt Engineering for Research
+A skill for applying systematic prompt engineering techniques in academic research contexts. Covers prompt design patterns, evaluation methodologies, and practical workflows for using large language models (LLMs) as research tools.
+## Prompt Design Patterns
+### Core Prompting Strategies
+| Strategy | Description | Best For | Reliability |
+|----------|------------|---------|-------------|
+| Zero-shot | Direct instruction, no examples | Simple, well-defined tasks | Moderate |
+| Few-shot | Include 2-5 examples in prompt | Pattern matching, formatting | High |
+| Chain-of-thought | "Think step by step" | Reasoning, math, analysis | High |
+| Role prompting | "You are an expert in..." | Domain-specific tasks | Moderate |
+| Structured output | Request JSON/YAML/table format | Data extraction | High |
+| Self-consistency | Sample multiple times, majority vote | Fact-checking, reasoning | Very high |
+### Research-Specific Prompt Templates
+```python
+def create_research_prompt(task_type: str, context: dict) -> str:
+    """
+    Generate a structured prompt for common research tasks.
+    Args:
+        task_type: One of 'literature_summary', 'methodology_critique',
+                   'code_review', 'data_interpretation', 'writing_feedback'
+        context: Dict with task-specific context
+    """
+    templates = {
+        'literature_summary': """
+You are an academic researcher specializing in {domain}.
+Summarize the following paper excerpt, focusing on:
+1. The research question and its significance
+2. The methodology used
+3. Key findings and their implications
+4. Limitations acknowledged by the authors
+5. How this work relates to {related_topic}
+Paper excerpt:
+{text}
+Provide a structured summary in 200-300 words. Distinguish clearly
+between what the authors claim and what the evidence supports.
+""",
+        'methodology_critique': """
+You are a methods expert reviewing a research design.
+Evaluate the following methodology description:
+{text}
+Assess the following:
+1. Internal validity: Are there confounding variables not controlled?
+2. External validity: How generalizable are the findings?
+3. Statistical approach: Is the analysis appropriate for the data?
+4. Sample: Is the sample size adequate? Any selection bias?
+5. Reproducibility: Could another researcher replicate this?
+For each concern, rate severity (minor/moderate/major) and suggest
+a specific improvement.
+""",
+        'data_interpretation': """
+You are a statistical consultant helping interpret results.
+Given these results:
+{results}
+Context: {context_description}
+Provide:
+1. Plain-language interpretation of each result
+2. Effect size interpretation (is it practically significant?)
+3. Potential alternative explanations
+4. Caveats the authors should mention
+5. Suggested follow-up analyses
+Be precise about what the data does and does not support.
+Do not overstate findings.
+"""
+    }
+    template = templates.get(task_type, templates['literature_summary'])
+    return template.format(**context)
+```
+## Chain-of-Thought for Complex Research Tasks
+### Structured Reasoning
+```python
+def research_cot_prompt(question: str, data: str) -> str:
+    """
+    Create a chain-of-thought prompt for complex research analysis.
+    """
+    return f"""
+I need to analyze the following research question step by step.
+Research Question: {question}
+Available Data:
+{data}
+Please reason through this systematically:
+Step 1: Identify the key variables and their relationships
+Step 2: Consider what statistical test or analytical approach is appropriate
+Step 3: Check assumptions required for this approach
+Step 4: Perform the analysis or describe how to perform it
+Step 5: Interpret the results in context
+Step 6: State limitations and alternative interpretations
+Show your reasoning at each step before moving to the next.
+If you are uncertain about any step, explicitly state the uncertainty
+rather than guessing.
+"""
+```
+## Evaluation and Reliability
+### Measuring Prompt Effectiveness
+```python
+def evaluate_prompt(prompt_template: str, test_cases: list[dict],
+                     expected_outputs: list[str],
+                     model_fn: callable) -> dict:
+    """
+    Systematically evaluate a prompt template's reliability.
+    Args:
+        prompt_template: The prompt template with {placeholders}
+        test_cases: List of dicts with placeholder values
+        expected_outputs: Expected outputs for each test case
+        model_fn: Function that takes a prompt string and returns model output
+    """
+    results = []
+    for case, expected in zip(test_cases, expected_outputs):
+        prompt = prompt_template.format(**case)
+        # Run multiple times for consistency check
+        outputs = [model_fn(prompt) for _ in range(3)]
+        # Measure consistency (self-agreement)
+        from difflib import SequenceMatcher
+        similarities = []
+        for i in range(len(outputs)):
+            for j in range(i+1, len(outputs)):
+                sim = SequenceMatcher(None, outputs[i], outputs[j]).ratio()
+                similarities.append(sim)
+        avg_similarity = sum(similarities) / len(similarities) if similarities else 0
+        results.append({
+            'test_case': case,
+            'n_runs': 3,
+            'consistency': round(avg_similarity, 3),
+            'outputs': outputs
+        })
+    return {
+        'n_test_cases': len(test_cases),
+        'avg_consistency': round(
+            sum(r['consistency'] for r in results) / len(results), 3
+        ),
+        'results': results,
+        'reliability': (
+            'high' if all(r['consistency'] > 0.8 for r in results)
+            else 'moderate' if all(r['consistency'] > 0.5 for r in results)
+            else 'low -- prompt needs refinement'
+        )
+    }
+```
+## Research Workflow Integration
+### Automated Literature Screening
+```python
+def screen_paper_relevance(title: str, abstract: str,
+                            inclusion_criteria: list[str],
+                            exclusion_criteria: list[str]) -> str:
+    """
+    Generate a prompt for AI-assisted paper screening in systematic reviews.
+    """
+    return f"""
+You are screening papers for a systematic review.
+Paper:
+Title: {title}
+Abstract: {abstract}
+Inclusion criteria:
+{chr(10).join(f'- {c}' for c in inclusion_criteria)}
+Exclusion criteria:
+{chr(10).join(f'- {c}' for c in exclusion_criteria)}
+Evaluate the paper against each criterion and respond with:
+1. INCLUDE, EXCLUDE, or UNCERTAIN
+2. Which specific criteria were met or not met
+3. Confidence level (high/medium/low)
+Important: When uncertain, err on the side of INCLUDE (to be screened
+at full-text stage). False exclusions are worse than false inclusions
+in systematic review screening.
+"""
+```
+## Ethical Considerations
+- **Transparency**: Always disclose AI usage in your research methodology
+- **Verification**: Never trust LLM outputs without independent verification -- check facts, citations, and calculations
+- **Bias awareness**: LLMs can introduce biases; use structured prompts and diverse perspectives
+- **Citation integrity**: LLMs may hallucinate citations; verify every reference exists
+- **Authorship**: AI tools do not meet authorship criteria (ICMJE); they are tools, not co-authors
+- **Reproducibility**: Document the model, version, temperature, and exact prompts used
+## Key References
+- Wei, J., et al. (2022). Chain-of-thought prompting elicits reasoning in LLMs. *NeurIPS*.
+- Brown, T., et al. (2020). Language models are few-shot learners. *NeurIPS*.

package/skills/domains/ai-ml/reinforcement-learning-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,254 @@
+---
+name: reinforcement-learning-guide
+description: "Reinforcement learning fundamentals, algorithms, and research"
+metadata:
+  openclaw:
+    emoji: "robot"
+    category: "domains"
+    subcategory: "ai-ml"
+    keywords: ["reinforcement learning", "machine learning", "deep learning", "neural network"]
+    source: "wentor-research-plugins"
+---
+# Reinforcement Learning Guide
+Understand and implement reinforcement learning algorithms from tabular methods through deep RL, including policy gradients, actor-critic, and model-based approaches.
+## RL Fundamentals
+### The RL Framework
+An agent interacts with an environment to maximize cumulative reward:
+```
+Agent                     Environment
+  |                           |
+  |--- action a_t ---------->|
+  |                           |--- next state s_{t+1}
+  |<-- reward r_t, state s_t |--- reward r_{t+1}
+  |                           |
+```
+| Concept | Symbol | Definition |
+|---------|--------|-----------|
+| State | s | Observation of the environment |
+| Action | a | Decision made by the agent |
+| Reward | r | Scalar feedback signal |
+| Policy | pi(a\|s) | Mapping from states to actions |
+| Value function | V(s) | Expected cumulative reward from state s |
+| Q-function | Q(s, a) | Expected cumulative reward from (s, a) |
+| Discount factor | gamma | Weight of future vs. immediate rewards (0-1) |
+| Return | G_t | Sum of discounted future rewards from time t |
+### Key Equations
+```
+# Return (discounted cumulative reward)
+G_t = r_t + gamma * r_{t+1} + gamma^2 * r_{t+2} + ...
+# Bellman equation for V
+V(s) = E[r + gamma * V(s') | s]
+# Bellman equation for Q
+Q(s, a) = E[r + gamma * max_a' Q(s', a') | s, a]
+# Policy gradient theorem
+gradient J(theta) = E[gradient log pi_theta(a|s) * Q(s, a)]
+```
+## Algorithm Taxonomy
+| Category | Algorithm | Key Idea | On/Off Policy |
+|----------|-----------|----------|--------------|
+| **Value-based** | Q-Learning | Learn Q(s,a), act greedily | Off-policy |
+| | DQN | Q-Learning + neural net + replay buffer | Off-policy |
+| | Double DQN | Two networks to reduce overestimation | Off-policy |
+| | Dueling DQN | Separate value and advantage streams | Off-policy |
+| **Policy gradient** | REINFORCE | Monte Carlo policy gradient | On-policy |
+| | PPO | Clipped surrogate objective | On-policy |
+| | TRPO | Trust region constraint | On-policy |
+| **Actor-Critic** | A2C/A3C | Advantage actor-critic (parallel) | On-policy |
+| | SAC | Maximum entropy + off-policy AC | Off-policy |
+| | TD3 | Twin delayed DDPG | Off-policy |
+| **Model-based** | Dreamer | World model + imagination | On-policy |
+| | MBPO | Model-based policy optimization | Off-policy |
+| | MuZero | Learned model + planning (MCTS) | Off-policy |
+## Implementation: DQN
+```python
+import torch
+import torch.nn as nn
+import torch.optim as optim
+import numpy as np
+from collections import deque
+import random
+class QNetwork(nn.Module):
+    def __init__(self, state_dim, action_dim, hidden_dim=128):
+        super().__init__()
+        self.net = nn.Sequential(
+            nn.Linear(state_dim, hidden_dim),
+            nn.ReLU(),
+            nn.Linear(hidden_dim, hidden_dim),
+            nn.ReLU(),
+            nn.Linear(hidden_dim, action_dim)
+        )
+    def forward(self, x):
+        return self.net(x)
+class DQNAgent:
+    def __init__(self, state_dim, action_dim, lr=1e-3, gamma=0.99,
+                 epsilon=1.0, epsilon_decay=0.995, epsilon_min=0.01,
+                 buffer_size=10000, batch_size=64):
+        self.action_dim = action_dim
+        self.gamma = gamma
+        self.epsilon = epsilon
+        self.epsilon_decay = epsilon_decay
+        self.epsilon_min = epsilon_min
+        self.batch_size = batch_size
+        self.q_network = QNetwork(state_dim, action_dim)
+        self.target_network = QNetwork(state_dim, action_dim)
+        self.target_network.load_state_dict(self.q_network.state_dict())
+        self.optimizer = optim.Adam(self.q_network.parameters(), lr=lr)
+        self.replay_buffer = deque(maxlen=buffer_size)
+    def select_action(self, state):
+        if random.random() < self.epsilon:
+            return random.randint(0, self.action_dim - 1)
+        with torch.no_grad():
+            q_values = self.q_network(torch.FloatTensor(state))
+            return q_values.argmax().item()
+    def store_transition(self, state, action, reward, next_state, done):
+        self.replay_buffer.append((state, action, reward, next_state, done))
+    def train_step(self):
+        if len(self.replay_buffer) < self.batch_size:
+            return 0.0
+        batch = random.sample(self.replay_buffer, self.batch_size)
+        states, actions, rewards, next_states, dones = zip(*batch)
+        states = torch.FloatTensor(np.array(states))
+        actions = torch.LongTensor(actions)
+        rewards = torch.FloatTensor(rewards)
+        next_states = torch.FloatTensor(np.array(next_states))
+        dones = torch.FloatTensor(dones)
+        # Current Q values
+        q_values = self.q_network(states).gather(1, actions.unsqueeze(1)).squeeze()
+        # Target Q values (Double DQN variant)
+        with torch.no_grad():
+            best_actions = self.q_network(next_states).argmax(1)
+            next_q = self.target_network(next_states).gather(1, best_actions.unsqueeze(1)).squeeze()
+            targets = rewards + self.gamma * next_q * (1 - dones)
+        loss = nn.MSELoss()(q_values, targets)
+        self.optimizer.zero_grad()
+        loss.backward()
+        self.optimizer.step()
+        self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)
+        return loss.item()
+    def update_target(self):
+        self.target_network.load_state_dict(self.q_network.state_dict())
+```
+## Implementation: PPO
+```python
+class PPOAgent:
+    def __init__(self, state_dim, action_dim, lr=3e-4, gamma=0.99,
+                 lam=0.95, clip_ratio=0.2, epochs=10):
+        self.gamma = gamma
+        self.lam = lam
+        self.clip_ratio = clip_ratio
+        self.epochs = epochs
+        self.actor = nn.Sequential(
+            nn.Linear(state_dim, 64), nn.Tanh(),
+            nn.Linear(64, 64), nn.Tanh(),
+            nn.Linear(64, action_dim), nn.Softmax(dim=-1)
+        )
+        self.critic = nn.Sequential(
+            nn.Linear(state_dim, 64), nn.Tanh(),
+            nn.Linear(64, 64), nn.Tanh(),
+            nn.Linear(64, 1)
+        )
+        self.optimizer = optim.Adam(
+            list(self.actor.parameters()) + list(self.critic.parameters()), lr=lr
+        )
+    def compute_gae(self, rewards, values, dones):
+        """Generalized Advantage Estimation."""
+        advantages = []
+        gae = 0
+        for t in reversed(range(len(rewards))):
+            next_value = values[t + 1] if t + 1 < len(values) else 0
+            delta = rewards[t] + self.gamma * next_value * (1 - dones[t]) - values[t]
+            gae = delta + self.gamma * self.lam * (1 - dones[t]) * gae
+            advantages.insert(0, gae)
+        return torch.FloatTensor(advantages)
+    def update(self, states, actions, old_log_probs, rewards, dones):
+        values = self.critic(states).squeeze().detach().numpy()
+        advantages = self.compute_gae(rewards, values, dones)
+        returns = advantages + torch.FloatTensor(values[:len(advantages)])
+        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
+        for _ in range(self.epochs):
+            probs = self.actor(states)
+            dist = torch.distributions.Categorical(probs)
+            new_log_probs = dist.log_prob(actions)
+            entropy = dist.entropy().mean()
+            ratio = (new_log_probs - old_log_probs).exp()
+            clipped = torch.clamp(ratio, 1 - self.clip_ratio, 1 + self.clip_ratio)
+            actor_loss = -torch.min(ratio * advantages, clipped * advantages).mean()
+            critic_loss = nn.MSELoss()(self.critic(states).squeeze(), returns)
+            loss = actor_loss + 0.5 * critic_loss - 0.01 * entropy
+            self.optimizer.zero_grad()
+            loss.backward()
+            self.optimizer.step()
+```
+## Research Environments
+| Environment | Domain | Complexity | Key Paper |
+|-------------|--------|-----------|-----------|
+| Gymnasium (ex-Gym) | Classic control, Atari | Low-High | Brockman et al., 2016 |
+| MuJoCo | Continuous control, robotics | Medium-High | Todorov et al., 2012 |
+| DMControl | Continuous control from pixels | High | Tassa et al., 2018 |
+| ProcGen | Procedurally generated games | High (generalization) | Cobbe et al., 2020 |
+| Minigrid | Grid-world navigation | Low-Medium | Chevalier-Boisvert et al. |
+| Isaac Gym | GPU-accelerated physics sim | High | Makoviychuk et al., 2021 |
+| NetHack | Complex roguelike game | Very High | Kuttler et al., 2020 |
+## Top Venues
+| Venue | Type | Focus |
+|-------|------|-------|
+| NeurIPS | Conference | Broad ML including RL |
+| ICML | Conference | Broad ML including RL |
+| ICLR | Conference | Representation learning, deep RL |
+| AAAI | Conference | Broad AI |
+| CoRL | Conference | Robot learning |
+| JMLR | Journal | Broad ML (open access) |
+| L4DC | Conference | Learning for dynamics and control |
+## Key Research Directions (2024-2025)
+1. **RLHF / RLAIF**: RL from human or AI feedback for LLM alignment
+2. **Offline RL**: Learning from pre-collected datasets without environment interaction
+3. **Foundation models for control**: Using pre-trained LLMs/VLMs as world models or planners
+4. **Multi-agent RL**: Cooperative and competitive settings with communication
+5. **Safe RL**: Constrained optimization to ensure safety during training and deployment
+6. **Sample-efficient RL**: Reducing the gap between model-free and model-based sample complexity