merleau 0.1.1__tar.gz → 0.2.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- merleau-0.2.0/.github/workflows/python-publish.yml +65 -0
- merleau-0.2.0/.gitignore +7 -0
- merleau-0.2.0/CLAUDE.md +57 -0
- {merleau-0.1.1 → merleau-0.2.0}/PKG-INFO +1 -1
- merleau-0.2.0/README.md +77 -0
- merleau-0.2.0/analyze_video.py +56 -0
- merleau-0.2.0/merleau/__init__.py +3 -0
- merleau-0.2.0/merleau/cli.py +117 -0
- {merleau-0.1.1 → merleau-0.2.0}/pyproject.toml +8 -1
- merleau-0.2.0/research/positioning_merleau.md +191 -0
- merleau-0.2.0/uv.lock +694 -0
- merleau-0.1.1/README.md +0 -53
- merleau-0.1.1/merleau.egg-info/PKG-INFO +0 -7
- merleau-0.1.1/merleau.egg-info/SOURCES.txt +0 -7
- merleau-0.1.1/merleau.egg-info/dependency_links.txt +0 -1
- merleau-0.1.1/merleau.egg-info/top_level.txt +0 -1
- merleau-0.1.1/setup.cfg +0 -4
- /merleau-0.1.1/merleau.egg-info/requires.txt → /merleau-0.2.0/requirements.txt +0 -0
|
@@ -0,0 +1,65 @@
|
|
|
1
|
+
# This workflow will upload a Python Package to PyPI when a release is created
|
|
2
|
+
# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python#publishing-to-package-registries
|
|
3
|
+
|
|
4
|
+
# This workflow uses actions that are not certified by GitHub.
|
|
5
|
+
# They are provided by a third-party and are governed by
|
|
6
|
+
# separate terms of service, privacy policy, and support
|
|
7
|
+
# documentation.
|
|
8
|
+
|
|
9
|
+
name: Upload Python Package
|
|
10
|
+
|
|
11
|
+
on:
|
|
12
|
+
release:
|
|
13
|
+
types: [published]
|
|
14
|
+
|
|
15
|
+
permissions:
|
|
16
|
+
contents: read
|
|
17
|
+
|
|
18
|
+
jobs:
|
|
19
|
+
release-build:
|
|
20
|
+
runs-on: ubuntu-latest
|
|
21
|
+
|
|
22
|
+
steps:
|
|
23
|
+
- uses: actions/checkout@v4
|
|
24
|
+
|
|
25
|
+
- uses: actions/setup-python@v5
|
|
26
|
+
with:
|
|
27
|
+
python-version: "3.x"
|
|
28
|
+
|
|
29
|
+
- name: Build release distributions
|
|
30
|
+
run: |
|
|
31
|
+
# NOTE: put your own distribution build steps here.
|
|
32
|
+
python -m pip install build
|
|
33
|
+
python -m build
|
|
34
|
+
|
|
35
|
+
- name: Upload distributions
|
|
36
|
+
uses: actions/upload-artifact@v4
|
|
37
|
+
with:
|
|
38
|
+
name: release-dists
|
|
39
|
+
path: dist/
|
|
40
|
+
|
|
41
|
+
pypi-publish:
|
|
42
|
+
runs-on: ubuntu-latest
|
|
43
|
+
needs:
|
|
44
|
+
- release-build
|
|
45
|
+
permissions:
|
|
46
|
+
# IMPORTANT: this permission is mandatory for trusted publishing
|
|
47
|
+
id-token: write
|
|
48
|
+
|
|
49
|
+
# Dedicated environments with protections for publishing are strongly recommended.
|
|
50
|
+
# For more information, see: https://docs.github.com/en/actions/deployment/targeting-different-environments/using-environments-for-deployment#deployment-protection-rules
|
|
51
|
+
environment:
|
|
52
|
+
name: pypi
|
|
53
|
+
url: https://pypi.org/p/merleau
|
|
54
|
+
|
|
55
|
+
steps:
|
|
56
|
+
- name: Retrieve release distributions
|
|
57
|
+
uses: actions/download-artifact@v4
|
|
58
|
+
with:
|
|
59
|
+
name: release-dists
|
|
60
|
+
path: dist/
|
|
61
|
+
|
|
62
|
+
- name: Publish release distributions to PyPI
|
|
63
|
+
uses: pypa/gh-action-pypi-publish@release/v1
|
|
64
|
+
with:
|
|
65
|
+
packages-dir: dist/
|
merleau-0.2.0/.gitignore
ADDED
merleau-0.2.0/CLAUDE.md
ADDED
|
@@ -0,0 +1,57 @@
|
|
|
1
|
+
# CLAUDE.md
|
|
2
|
+
|
|
3
|
+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
|
4
|
+
|
|
5
|
+
## Project Overview
|
|
6
|
+
|
|
7
|
+
Merleau is a CLI tool for video understanding using Google's Gemini API. Named after Maurice Merleau-Ponty, the phenomenologist philosopher. The CLI command is `ponty`.
|
|
8
|
+
|
|
9
|
+
See `research/positioning_merleau.md` for market positioning and differentiation strategy.
|
|
10
|
+
|
|
11
|
+
## Commands
|
|
12
|
+
|
|
13
|
+
```bash
|
|
14
|
+
# Install dependencies
|
|
15
|
+
uv sync
|
|
16
|
+
|
|
17
|
+
# Run the CLI
|
|
18
|
+
uv run ponty video.mp4
|
|
19
|
+
uv run ponty video.mp4 -p "Custom prompt" -m gemini-2.0-flash
|
|
20
|
+
|
|
21
|
+
# Build package
|
|
22
|
+
uv build
|
|
23
|
+
|
|
24
|
+
# Publish to PyPI
|
|
25
|
+
uv publish --token <token>
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
## Architecture
|
|
29
|
+
|
|
30
|
+
```
|
|
31
|
+
merleau/
|
|
32
|
+
├── merleau/
|
|
33
|
+
│ ├── __init__.py # Package version
|
|
34
|
+
│ └── cli.py # CLI entry point (ponty command)
|
|
35
|
+
├── research/ # Market research and positioning
|
|
36
|
+
├── pyproject.toml # Package config with [project.scripts] entry point
|
|
37
|
+
└── analyze_video.py # Legacy standalone script
|
|
38
|
+
```
|
|
39
|
+
|
|
40
|
+
### CLI Flow (merleau/cli.py)
|
|
41
|
+
1. Parse arguments (video path, prompt, model, cost flag)
|
|
42
|
+
2. Load API key from environment or `.env`
|
|
43
|
+
3. Upload video to Gemini Files API
|
|
44
|
+
4. Poll for processing completion
|
|
45
|
+
5. Generate content analysis
|
|
46
|
+
6. Display results and optional cost breakdown
|
|
47
|
+
|
|
48
|
+
## Key Differentiators
|
|
49
|
+
|
|
50
|
+
- **Native Gemini video** - Only CLI with true video understanding (not frame extraction)
|
|
51
|
+
- **YouTube URL support** - Direct analysis via Gemini's preview feature
|
|
52
|
+
- **Cost transparency** - Token usage and pricing shown by default
|
|
53
|
+
|
|
54
|
+
## Configuration
|
|
55
|
+
|
|
56
|
+
- `.env` - Contains `GEMINI_API_KEY` (required)
|
|
57
|
+
- `pyproject.toml` - Package metadata, dependencies, and CLI entry point
|
merleau-0.2.0/README.md
ADDED
|
@@ -0,0 +1,77 @@
|
|
|
1
|
+
# Merleau
|
|
2
|
+
|
|
3
|
+
> *"The world is not what I think, but what I live through."*
|
|
4
|
+
> — Maurice Merleau-Ponty
|
|
5
|
+
|
|
6
|
+
A CLI tool for video understanding using Google's Gemini API. Named after [Maurice Merleau-Ponty](https://en.wikipedia.org/wiki/Maurice_Merleau-Ponty), the phenomenologist philosopher whose work on perception inspires how this tool helps you perceive your videos.
|
|
7
|
+
|
|
8
|
+
## Why Merleau?
|
|
9
|
+
|
|
10
|
+
Google Gemini is the **only major AI provider** with native video understanding—Claude doesn't support video, and GPT-4o requires frame extraction workarounds. Merleau is the first CLI that actually understands video rather than analyzing frames.
|
|
11
|
+
|
|
12
|
+
## Features
|
|
13
|
+
|
|
14
|
+
- **Native Gemini video processing** - Upload and analyze videos directly
|
|
15
|
+
- **YouTube URL support** - Analyze videos directly from YouTube (free preview)
|
|
16
|
+
- **Customizable prompts** - Ask any question about your video
|
|
17
|
+
- **Cost estimation** - Token usage tracking and cost breakdown
|
|
18
|
+
- **Multiple models** - Support for different Gemini models
|
|
19
|
+
|
|
20
|
+
## Installation
|
|
21
|
+
|
|
22
|
+
Using [uv](https://docs.astral.sh/uv/) (recommended):
|
|
23
|
+
```bash
|
|
24
|
+
uv sync
|
|
25
|
+
```
|
|
26
|
+
|
|
27
|
+
Or install from PyPI:
|
|
28
|
+
```bash
|
|
29
|
+
pip install merleau
|
|
30
|
+
```
|
|
31
|
+
|
|
32
|
+
## Configuration
|
|
33
|
+
|
|
34
|
+
1. Get a Gemini API key from [Google AI Studio](https://aistudio.google.com/apikey)
|
|
35
|
+
2. Set the API key as an environment variable or create a `.env` file:
|
|
36
|
+
```
|
|
37
|
+
GEMINI_API_KEY=your_api_key_here
|
|
38
|
+
```
|
|
39
|
+
|
|
40
|
+
## Usage
|
|
41
|
+
|
|
42
|
+
```bash
|
|
43
|
+
# Basic video analysis
|
|
44
|
+
ponty video.mp4
|
|
45
|
+
|
|
46
|
+
# Custom prompt
|
|
47
|
+
ponty video.mp4 -p "Summarize the key points in this video"
|
|
48
|
+
|
|
49
|
+
# Use a different model
|
|
50
|
+
ponty video.mp4 -m gemini-2.0-flash
|
|
51
|
+
|
|
52
|
+
# Hide cost information
|
|
53
|
+
ponty video.mp4 --no-cost
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
### Options
|
|
57
|
+
|
|
58
|
+
| Option | Description |
|
|
59
|
+
|--------|-------------|
|
|
60
|
+
| `-p, --prompt` | Prompt for the analysis (default: "Explain what happens in this video") |
|
|
61
|
+
| `-m, --model` | Gemini model to use (default: gemini-2.5-flash) |
|
|
62
|
+
| `--no-cost` | Hide usage and cost information |
|
|
63
|
+
|
|
64
|
+
## Output
|
|
65
|
+
|
|
66
|
+
The CLI provides:
|
|
67
|
+
- Video content analysis from Gemini
|
|
68
|
+
- Token usage breakdown (prompt, response, total)
|
|
69
|
+
- Estimated cost based on Gemini pricing
|
|
70
|
+
|
|
71
|
+
## Pricing Reference
|
|
72
|
+
|
|
73
|
+
Gemini 2.5 Flash (as of 2025):
|
|
74
|
+
- Input: $0.15 per 1M tokens (text/image), $0.075 per 1M tokens (video)
|
|
75
|
+
- Output: $0.60 per 1M tokens, $3.50 for thinking tokens
|
|
76
|
+
|
|
77
|
+
A 1-hour video costs approximately **$0.11-0.32** to analyze.
|
|
@@ -0,0 +1,56 @@
|
|
|
1
|
+
import os
|
|
2
|
+
import time
|
|
3
|
+
from dotenv import load_dotenv
|
|
4
|
+
from google import genai
|
|
5
|
+
|
|
6
|
+
# Load environment variables from .env file
|
|
7
|
+
load_dotenv()
|
|
8
|
+
|
|
9
|
+
# Initialize the client with the API key
|
|
10
|
+
client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))
|
|
11
|
+
|
|
12
|
+
# Upload the video file
|
|
13
|
+
video_path = "MATLAB_Modernizer.mp4"
|
|
14
|
+
print(f"Uploading video: {video_path}")
|
|
15
|
+
myfile = client.files.upload(file=video_path)
|
|
16
|
+
print(f"Upload complete. File URI: {myfile.uri}")
|
|
17
|
+
|
|
18
|
+
# Wait for the file to be processed (become ACTIVE)
|
|
19
|
+
print("Waiting for file to be processed...")
|
|
20
|
+
while myfile.state.name == "PROCESSING":
|
|
21
|
+
print(".", end="", flush=True)
|
|
22
|
+
time.sleep(2)
|
|
23
|
+
myfile = client.files.get(name=myfile.name)
|
|
24
|
+
|
|
25
|
+
if myfile.state.name == "FAILED":
|
|
26
|
+
raise ValueError(f"File processing failed: {myfile.state.name}")
|
|
27
|
+
|
|
28
|
+
print(f"\nFile state: {myfile.state.name}")
|
|
29
|
+
|
|
30
|
+
# Generate content using Gemini 2.5 Flash
|
|
31
|
+
print("\nAnalyzing video with Gemini 2.5 Flash...")
|
|
32
|
+
response = client.models.generate_content(
|
|
33
|
+
model="gemini-2.5-flash",
|
|
34
|
+
contents=[myfile, "Explain what happens in this video"]
|
|
35
|
+
)
|
|
36
|
+
print("\n--- Video Analysis ---")
|
|
37
|
+
print(response.text)
|
|
38
|
+
|
|
39
|
+
# Show usage/cost information
|
|
40
|
+
print("\n--- Usage Information ---")
|
|
41
|
+
if hasattr(response, 'usage_metadata'):
|
|
42
|
+
usage = response.usage_metadata
|
|
43
|
+
print(f"Prompt tokens: {usage.prompt_token_count}")
|
|
44
|
+
print(f"Response tokens: {usage.candidates_token_count}")
|
|
45
|
+
print(f"Total tokens: {usage.total_token_count}")
|
|
46
|
+
|
|
47
|
+
# Gemini 2.5 Flash pricing (as of 2025):
|
|
48
|
+
# Input: $0.15 per 1M tokens (text/image), $0.075 per 1M tokens for video
|
|
49
|
+
# Output: $0.60 per 1M tokens, $3.50 for thinking tokens
|
|
50
|
+
input_cost = (usage.prompt_token_count / 1_000_000) * 0.15
|
|
51
|
+
output_cost = (usage.candidates_token_count / 1_000_000) * 0.60
|
|
52
|
+
total_cost = input_cost + output_cost
|
|
53
|
+
print(f"\nEstimated cost:")
|
|
54
|
+
print(f" Input: ${input_cost:.6f}")
|
|
55
|
+
print(f" Output: ${output_cost:.6f}")
|
|
56
|
+
print(f" Total: ${total_cost:.6f}")
|
|
@@ -0,0 +1,117 @@
|
|
|
1
|
+
"""Command-line interface for Merleau video analysis."""
|
|
2
|
+
|
|
3
|
+
import argparse
|
|
4
|
+
import os
|
|
5
|
+
import sys
|
|
6
|
+
import time
|
|
7
|
+
|
|
8
|
+
from dotenv import load_dotenv
|
|
9
|
+
from google import genai
|
|
10
|
+
|
|
11
|
+
|
|
12
|
+
def wait_for_processing(client, file):
|
|
13
|
+
"""Wait for file to finish processing."""
|
|
14
|
+
while file.state.name == "PROCESSING":
|
|
15
|
+
print(".", end="", flush=True)
|
|
16
|
+
time.sleep(2)
|
|
17
|
+
file = client.files.get(name=file.name)
|
|
18
|
+
print()
|
|
19
|
+
return file
|
|
20
|
+
|
|
21
|
+
|
|
22
|
+
def print_usage(usage):
|
|
23
|
+
"""Print token usage and cost estimation."""
|
|
24
|
+
print("\n--- Usage Information ---")
|
|
25
|
+
print(f"Prompt tokens: {usage.prompt_token_count}")
|
|
26
|
+
print(f"Response tokens: {usage.candidates_token_count}")
|
|
27
|
+
print(f"Total tokens: {usage.total_token_count}")
|
|
28
|
+
|
|
29
|
+
# Gemini 2.5 Flash pricing (as of 2025):
|
|
30
|
+
# Input: $0.15 per 1M tokens (text/image), $0.075 per 1M tokens for video
|
|
31
|
+
# Output: $0.60 per 1M tokens, $3.50 for thinking tokens
|
|
32
|
+
input_cost = (usage.prompt_token_count / 1_000_000) * 0.15
|
|
33
|
+
output_cost = (usage.candidates_token_count / 1_000_000) * 0.60
|
|
34
|
+
total_cost = input_cost + output_cost
|
|
35
|
+
print(f"\nEstimated cost:")
|
|
36
|
+
print(f" Input: ${input_cost:.6f}")
|
|
37
|
+
print(f" Output: ${output_cost:.6f}")
|
|
38
|
+
print(f" Total: ${total_cost:.6f}")
|
|
39
|
+
|
|
40
|
+
|
|
41
|
+
def analyze(video_path, prompt, model, show_cost):
|
|
42
|
+
"""Analyze a video file using Gemini."""
|
|
43
|
+
load_dotenv()
|
|
44
|
+
|
|
45
|
+
api_key = os.getenv("GEMINI_API_KEY")
|
|
46
|
+
if not api_key:
|
|
47
|
+
print("Error: GEMINI_API_KEY not found in environment or .env file", file=sys.stderr)
|
|
48
|
+
sys.exit(1)
|
|
49
|
+
|
|
50
|
+
if not os.path.exists(video_path):
|
|
51
|
+
print(f"Error: Video file not found: {video_path}", file=sys.stderr)
|
|
52
|
+
sys.exit(1)
|
|
53
|
+
|
|
54
|
+
client = genai.Client(api_key=api_key)
|
|
55
|
+
|
|
56
|
+
# Upload video
|
|
57
|
+
print(f"Uploading video: {video_path}")
|
|
58
|
+
myfile = client.files.upload(file=video_path)
|
|
59
|
+
print(f"Upload complete. File URI: {myfile.uri}")
|
|
60
|
+
|
|
61
|
+
# Wait for processing
|
|
62
|
+
print("Waiting for file to be processed...", end="")
|
|
63
|
+
myfile = wait_for_processing(client, myfile)
|
|
64
|
+
|
|
65
|
+
if myfile.state.name == "FAILED":
|
|
66
|
+
print(f"Error: File processing failed", file=sys.stderr)
|
|
67
|
+
sys.exit(1)
|
|
68
|
+
|
|
69
|
+
print(f"File state: {myfile.state.name}")
|
|
70
|
+
|
|
71
|
+
# Generate analysis
|
|
72
|
+
print(f"\nAnalyzing video with {model}...")
|
|
73
|
+
response = client.models.generate_content(
|
|
74
|
+
model=model,
|
|
75
|
+
contents=[myfile, prompt]
|
|
76
|
+
)
|
|
77
|
+
|
|
78
|
+
print("\n--- Video Analysis ---")
|
|
79
|
+
print(response.text)
|
|
80
|
+
|
|
81
|
+
# Show usage if requested
|
|
82
|
+
if show_cost and hasattr(response, 'usage_metadata'):
|
|
83
|
+
print_usage(response.usage_metadata)
|
|
84
|
+
|
|
85
|
+
|
|
86
|
+
def main():
|
|
87
|
+
"""Main entry point for the CLI."""
|
|
88
|
+
parser = argparse.ArgumentParser(
|
|
89
|
+
prog="ponty",
|
|
90
|
+
description="Analyze videos using Google's Gemini API"
|
|
91
|
+
)
|
|
92
|
+
parser.add_argument(
|
|
93
|
+
"video",
|
|
94
|
+
help="Path to the video file to analyze"
|
|
95
|
+
)
|
|
96
|
+
parser.add_argument(
|
|
97
|
+
"-p", "--prompt",
|
|
98
|
+
default="Explain what happens in this video",
|
|
99
|
+
help="Prompt for the analysis (default: 'Explain what happens in this video')"
|
|
100
|
+
)
|
|
101
|
+
parser.add_argument(
|
|
102
|
+
"-m", "--model",
|
|
103
|
+
default="gemini-2.5-flash",
|
|
104
|
+
help="Gemini model to use (default: gemini-2.5-flash)"
|
|
105
|
+
)
|
|
106
|
+
parser.add_argument(
|
|
107
|
+
"--no-cost",
|
|
108
|
+
action="store_true",
|
|
109
|
+
help="Hide usage and cost information"
|
|
110
|
+
)
|
|
111
|
+
|
|
112
|
+
args = parser.parse_args()
|
|
113
|
+
analyze(args.video, args.prompt, args.model, show_cost=not args.no_cost)
|
|
114
|
+
|
|
115
|
+
|
|
116
|
+
if __name__ == "__main__":
|
|
117
|
+
main()
|
|
@@ -1,9 +1,16 @@
|
|
|
1
1
|
[project]
|
|
2
2
|
name = "merleau"
|
|
3
|
-
version = "0.
|
|
3
|
+
version = "0.2.0"
|
|
4
4
|
description = "Video analysis using Google's Gemini 2.5 Flash API"
|
|
5
5
|
requires-python = ">=3.10"
|
|
6
6
|
dependencies = [
|
|
7
7
|
"google-genai",
|
|
8
8
|
"python-dotenv",
|
|
9
9
|
]
|
|
10
|
+
|
|
11
|
+
[project.scripts]
|
|
12
|
+
ponty = "merleau.cli:main"
|
|
13
|
+
|
|
14
|
+
[build-system]
|
|
15
|
+
requires = ["hatchling"]
|
|
16
|
+
build-backend = "hatchling.build"
|
|
@@ -0,0 +1,191 @@
|
|
|
1
|
+
# Positioning merleau: A strategic guide for Gemini-powered video CLI tools
|
|
2
|
+
|
|
3
|
+
**Merleau enters a market with a significant gap**: no unified CLI tool offers native Gemini video understanding combined with multi-provider support and developer-first design. Google Gemini is the **only major AI provider** with true native video processing—Claude doesn't support video at all, and GPT-4o requires manual frame extraction workarounds. This creates a compelling differentiation opportunity for a well-designed open-source tool.
|
|
4
|
+
|
|
5
|
+
The addressable market includes **500,000-2 million technical users** (developers, agencies, technical creators) who are underserved by existing GUI-focused solutions. The timing is favorable: video AI is "one of the fastest growing areas in AI," and Gemini's 2024-2025 updates have made native video understanding both powerful and cost-effective at **$0.11-0.32 per hour of video**.
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## The competitive landscape reveals a clear opportunity
|
|
10
|
+
|
|
11
|
+
The AI video analysis tool ecosystem in 2024-2025 divides into four distinct categories, each with notable gaps:
|
|
12
|
+
|
|
13
|
+
**Scene detection tools** like PySceneDetect (~2.5k GitHub stars) handle technical video segmentation well, but lack semantic understanding. **Traditional computer vision libraries** (supervision, MMAction2, PyTorchVideo) serve research use cases but require significant expertise. **YouTube-specific tools** like youtube-transcript-api (4k+ stars) extract transcripts effectively but don't analyze actual video content.
|
|
14
|
+
|
|
15
|
+
The most direct competitor is **video-analyzer** (1.2k stars, byjlw/video-analyzer), which supports frame extraction + Whisper transcription + vision model analysis with Ollama, OpenAI, and OpenRouter. However, it doesn't support **native Gemini video upload**, meaning it misses Gemini's key advantage: processing video files directly with combined audio-visual analysis rather than frame extraction.
|
|
16
|
+
|
|
17
|
+
| Capability Gap | Current State | merleau Opportunity |
|
|
18
|
+
|----------------|---------------|---------------------|
|
|
19
|
+
| Native Gemini video | No existing CLI tool | First-mover advantage |
|
|
20
|
+
| Multi-provider unified UX | Fragmented approaches | Single consistent interface |
|
|
21
|
+
| YouTube URL analysis | Transcript-only tools | Full video understanding via Gemini |
|
|
22
|
+
| Cost estimation | Not available | Pre-processing cost preview |
|
|
23
|
+
| Structured output standards | Inconsistent schemas | Standardized JSON output |
|
|
24
|
+
|
|
25
|
+
The research frameworks (MiniGPT4-Video, SmolVLM2, Open-R1-Video) remain research-quality rather than production-ready. This leaves a clear market gap for a **production-grade CLI tool with native Gemini support**.
|
|
26
|
+
|
|
27
|
+
---
|
|
28
|
+
|
|
29
|
+
## Developer audience wants performance, composability, and clean output
|
|
30
|
+
|
|
31
|
+
Developers researching video analysis tools consistently express frustration with **video processing complexity** and **multimodal pipeline challenges**. Building RAG systems for video requires handling text, audio, and visual data simultaneously—a task described as "challenging" even by NVIDIA's documentation.
|
|
32
|
+
|
|
33
|
+
The most requested workflow patterns include:
|
|
34
|
+
|
|
35
|
+
- **Video-to-text transcription pipelines** with timestamped output (SRT) and plain text versions
|
|
36
|
+
- **Video RAG systems** combining frame sampling, vision-language analysis, and vector search
|
|
37
|
+
- **Content moderation pipelines** requiring real-time processing and explainability
|
|
38
|
+
- **Semantic video search** enabling natural language queries against video libraries
|
|
39
|
+
|
|
40
|
+
**CLI framework preference has shifted to Typer** for new Python projects—it's described as "the FastAPI of CLIs" with type hint-driven design and auto-generated documentation. Click remains standard for complex applications but Typer is the 2024-2025 favorite for modern tooling.
|
|
41
|
+
|
|
42
|
+
For output formats, **JSON is the universal default** for programmatic consumption (jq compatibility is essential), while **Markdown serves human-readable summaries**. Developers explicitly want `--format json|yaml|markdown|text` flags with JSON Lines for streaming scenarios.
|
|
43
|
+
|
|
44
|
+
Key technical requirements developers mention:
|
|
45
|
+
- Pipeline composability (chain FFmpeg → analysis → output)
|
|
46
|
+
- Streaming and batch support (URLs, files, watch folders)
|
|
47
|
+
- Progress indication for long processing
|
|
48
|
+
- Timestamp preservation in all outputs
|
|
49
|
+
- Multiple output granularities (frame-level, scene-level, video-level)
|
|
50
|
+
|
|
51
|
+
---
|
|
52
|
+
|
|
53
|
+
## Content creators represent a large but CLI-resistant market
|
|
54
|
+
|
|
55
|
+
The YouTube creator ecosystem includes **65-69 million active creators**, but the vast majority are non-technical users who rely on browser extensions. TubeBuddy and VidIQ dominate with GUI-based competitor analysis, keyword research, and optimization features priced at $3.60-$99/month.
|
|
56
|
+
|
|
57
|
+
**These tools have significant gaps that merleau could address**: neither offers deep content/script analysis beyond metadata, no automatic keyword extraction from video speech, no hook quality scoring, no batch video content analysis, and no programmatic access for developers.
|
|
58
|
+
|
|
59
|
+
The realistic CLI-adoptable segment is much smaller:
|
|
60
|
+
|
|
61
|
+
| Segment | Estimated Size | CLI Readiness |
|
|
62
|
+
|---------|----------------|---------------|
|
|
63
|
+
| MCNs/agencies managing multiple channels | ~10,000+ organizations | Medium-High |
|
|
64
|
+
| Technical YouTubers (dev channels) | ~500,000-1 million | High |
|
|
65
|
+
| Data-savvy marketers | ~100,000-500,000 | Medium |
|
|
66
|
+
| Average YouTubers | 60+ million | Very Low |
|
|
67
|
+
|
|
68
|
+
The **high-potential niches** are MCNs needing batch operations across 10+ channels, tech/developer YouTube channels already comfortable with CLI tools, enterprise content teams requiring workflow integration, and researchers studying YouTube trends at scale.
|
|
69
|
+
|
|
70
|
+
A CLI tool succeeds in this market by targeting **power users who've outgrown TubeBuddy/VidIQ** rather than competing for the general creator market. Batch transcript extraction, programmatic competitor monitoring, and clean JSON/CSV export for data pipelines are the most valued features.
|
|
71
|
+
|
|
72
|
+
---
|
|
73
|
+
|
|
74
|
+
## Gemini's technical advantages define the differentiation strategy
|
|
75
|
+
|
|
76
|
+
Google Gemini is **unique among major AI providers** in offering native video understanding. The competitive position is stark:
|
|
77
|
+
|
|
78
|
+
| Provider | Native Video | Audio from Video | Max Duration | YouTube URLs |
|
|
79
|
+
|----------|--------------|------------------|--------------|--------------|
|
|
80
|
+
| **Gemini** | ✅ Direct upload | ✅ Combined analysis | 2+ hours | ✅ Free preview |
|
|
81
|
+
| GPT-4o | ❌ Frame extraction | ❌ Separate Whisper | Minutes | ❌ No |
|
|
82
|
+
| Claude | ❌ **No video support** | ❌ No | N/A | ❌ No |
|
|
83
|
+
| Twelve Labs | ✅ Video-native | ✅ Native | Hours | ❌ No |
|
|
84
|
+
|
|
85
|
+
Claude's complete lack of video support and GPT-4o's frame extraction requirement mean Gemini is the **only viable choice for a native video CLI tool** among the major general-purpose providers. Twelve Labs offers specialized video APIs but lacks Gemini's general reasoning capabilities.
|
|
86
|
+
|
|
87
|
+
**Gemini's key technical capabilities** include:
|
|
88
|
+
- **Native multimodal processing** at 1 FPS default (configurable)
|
|
89
|
+
- **2 million token context** enabling ~6 hours of video in a single prompt
|
|
90
|
+
- **Direct YouTube URL analysis** (currently free in preview)
|
|
91
|
+
- **Combined audio-visual understanding** without separate transcription steps
|
|
92
|
+
- **Video clipping** with start/end offsets for segment analysis
|
|
93
|
+
|
|
94
|
+
**Pricing makes high-volume processing viable**: Gemini 2.5 Flash-Lite costs $0.10 per million tokens, making a 1-hour video analysis cost approximately $0.11-0.32 depending on resolution settings. Context caching reduces costs by 75-90% for repeated queries on the same video. This is **24x cheaper than GPT-4o** for comparable capabilities.
|
|
95
|
+
|
|
96
|
+
The SDK situation requires attention: use `google-genai` (current), not `google-generativeai` (deprecated). Rate limits on the free tier are restrictive (5-10 RPM, 100-250 requests/day), so documentation should guide users toward the paid tier for production use.
|
|
97
|
+
|
|
98
|
+
---
|
|
99
|
+
|
|
100
|
+
## Differentiation through unique capabilities
|
|
101
|
+
|
|
102
|
+
Based on the competitive and technical analysis, merleau should differentiate on these underserved capabilities:
|
|
103
|
+
|
|
104
|
+
**Native Gemini video as the core value proposition.** No existing CLI tool offers this. Frame extraction workarounds for GPT-4o are complex, expensive, and lose audio context. merleau can be "the only CLI that actually understands video" rather than analyzing frames.
|
|
105
|
+
|
|
106
|
+
**YouTube URL support as a killer feature.** Gemini's direct YouTube analysis (free in preview) enables workflows no competitor matches: `ponty youtube https://youtube.com/watch?v=... --summarize`. This appeals to both developers building YouTube-related tools and technical creators analyzing content.
|
|
107
|
+
|
|
108
|
+
**Cost transparency before processing.** Current tools don't estimate costs upfront. merleau could offer `ponty estimate video.mp4` to show token counts and expected costs before committing to an API call—addressing a clear developer pain point.
|
|
109
|
+
|
|
110
|
+
**Structured, consistent JSON output.** The ecosystem lacks standardized schemas for video analysis results. Defining a clear output format (with timestamps, confidence scores, frame references) and making it consistent across operations would improve integration into data pipelines.
|
|
111
|
+
|
|
112
|
+
**Intelligent processing modes.** Offer presets like `--mode lecture` (low FPS, high compression), `--mode action` (high FPS), `--mode audio-focus` (optimize for speech), letting users optimize cost/quality without understanding token mechanics.
|
|
113
|
+
|
|
114
|
+
---
|
|
115
|
+
|
|
116
|
+
## Launch strategy for maximum open source impact
|
|
117
|
+
|
|
118
|
+
Successful open-source CLI tools (ruff, httpie, ripgrep, fzf) share common patterns: **dramatic value proposition** (10-100x faster, dramatically simpler), **beautiful README with GIF demos**, **responsive maintenance**, and **clear differentiation messaging**.
|
|
119
|
+
|
|
120
|
+
**The launch sequence should follow this timeline:**
|
|
121
|
+
|
|
122
|
+
*4-2 weeks before launch:* Finalize repository structure with pyproject.toml, comprehensive README, CONTRIBUTING.md, and GitHub Actions for CI/CD and PyPI publishing via Trusted Publishers. Create demo GIF/video showing the tool in action.
|
|
123
|
+
|
|
124
|
+
*Launch week:* Primary channel is **Hacker News Show HN** posted Monday-Tuesday morning US time. Title format: `Show HN: merleau – A CLI for video understanding using Google Gemini`. Opening comment should tell the personal story ("I built this because..."), mention the philosophy naming briefly as a curiosity hook, and explicitly ask for feedback.
|
|
125
|
+
|
|
126
|
+
*Launch day execution:* Respond to every Hacker News comment within the first 3 hours. Simultaneously post to r/Python, r/commandline, Twitter (thread format with GIF), and Dev.to with a tutorial-style post.
|
|
127
|
+
|
|
128
|
+
**The philosophy naming is a marketing asset.** Maurice Merleau-Ponty's work on perception and embodiment creates natural brand storytelling: "Just as the philosopher explored how we perceive reality, ponty helps you perceive your videos." The unusual name becomes a conversation starter and differentiator. A brief README quote from Merleau-Ponty adds character without being pretentious.
|
|
129
|
+
|
|
130
|
+
**Community building should start simple.** Enable GitHub Discussions with categories (Q&A, Ideas, Show & Tell) before launch. Add Discord only after reaching 50+ active users—developers find Discussions "too formal" for quick questions. Fast PR review (within 48 hours) dramatically increases contributor return rates.
|
|
131
|
+
|
|
132
|
+
---
|
|
133
|
+
|
|
134
|
+
## Technical implementation recommendations
|
|
135
|
+
|
|
136
|
+
**Use Typer as the CLI framework.** It's the modern standard for Python CLIs, built on Click but with type hint-driven automatic documentation. The FastAPI comparison is apt—it reduces boilerplate while maintaining power.
|
|
137
|
+
|
|
138
|
+
**Output format structure:**
|
|
139
|
+
```
|
|
140
|
+
Default: JSON (machine-readable)
|
|
141
|
+
Flags: --format json|yaml|markdown|text
|
|
142
|
+
Streaming: JSON Lines (newline-delimited) for --stream mode
|
|
143
|
+
```
|
|
144
|
+
|
|
145
|
+
**Command structure suggestion:**
|
|
146
|
+
```bash
|
|
147
|
+
# Core analysis
|
|
148
|
+
ponty analyze video.mp4 --prompt "Summarize key points"
|
|
149
|
+
ponty analyze https://youtube.com/watch?v=... --summary
|
|
150
|
+
|
|
151
|
+
# Specialized operations
|
|
152
|
+
ponty transcribe video.mp4 --format srt
|
|
153
|
+
ponty describe video.mp4 --timestamps
|
|
154
|
+
ponty search video.mp4 --query "when does the speaker mention AI"
|
|
155
|
+
|
|
156
|
+
# Utilities
|
|
157
|
+
ponty estimate video.mp4 # Cost preview
|
|
158
|
+
ponty config set api_key YOUR_KEY
|
|
159
|
+
```
|
|
160
|
+
|
|
161
|
+
**Implementation priorities for v1.0:**
|
|
162
|
+
1. Native Gemini video upload with File API handling
|
|
163
|
+
2. YouTube URL support (Gemini's preview feature)
|
|
164
|
+
3. Basic prompting with `--prompt` flag
|
|
165
|
+
4. JSON output with timestamps
|
|
166
|
+
5. Cost estimation command
|
|
167
|
+
6. Progress indication for long videos
|
|
168
|
+
|
|
169
|
+
**Later releases** can add GPT-4o frame extraction mode (for users wanting multi-provider), local model support via Ollama, batch processing for directories, and watch folder functionality.
|
|
170
|
+
|
|
171
|
+
---
|
|
172
|
+
|
|
173
|
+
## Success metrics and realistic expectations
|
|
174
|
+
|
|
175
|
+
**Week 1 targets:** 100+ GitHub stars, successful Hacker News front page, first external users providing feedback.
|
|
176
|
+
|
|
177
|
+
**Month 1 targets:** 500-1,000 GitHub stars (strong launch), measurable PyPI downloads, first GitHub issues showing real usage, early contributors submitting PRs.
|
|
178
|
+
|
|
179
|
+
**Month 3 targets:** Sustained star growth, active GitHub Discussions, mentions in newsletters or "awesome" lists, users building integrations.
|
|
180
|
+
|
|
181
|
+
The tool serves a **niche but valuable segment**—not mass-market but technically sophisticated users who will appreciate the craftsmanship. Success looks like becoming the default recommendation when developers ask "how do I analyze video with Gemini from the command line?"
|
|
182
|
+
|
|
183
|
+
---
|
|
184
|
+
|
|
185
|
+
## Conclusion
|
|
186
|
+
|
|
187
|
+
Merleau/ponty enters the market at an ideal moment: Gemini's native video capabilities are mature and cost-effective, no competitor offers a unified CLI for this workflow, and developer interest in video AI tooling is surging. The differentiation strategy is clear—be the **first and best CLI for native video understanding**, leveraging Gemini's unique position as the only major provider with true video input support.
|
|
188
|
+
|
|
189
|
+
The path forward involves launching with a focused v1.0 (native Gemini, YouTube URLs, clean JSON output, cost estimation), targeting developers and technical creators rather than mass-market YouTubers, and building community through responsive maintenance and compelling documentation. The philosophy naming creates a distinctive brand identity that stands out in a sea of generic tool names.
|
|
190
|
+
|
|
191
|
+
The competitive landscape analysis reveals that video-analyzer is the closest existing tool but misses the native Gemini opportunity. By filling this gap with excellent developer experience, merleau can establish a defensible position as the standard CLI for AI video understanding.
|