@goonnguyen/human-mcp 1.4.0 → 2.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +284 -29
- package/dist/index.js +65606 -1697
- package/package.json +5 -1
package/README.md
CHANGED
|
@@ -21,12 +21,44 @@ Human MCP is a Model Context Protocol server that provides AI coding agents with
|
|
|
21
21
|
- **Performance**: Loading states, visual performance indicators
|
|
22
22
|
- **Layout**: Responsive design, positioning, visual hierarchy
|
|
23
23
|
|
|
24
|
+
🎨 **Content Generation**
|
|
25
|
+
- Generate high-quality images from text descriptions using Imagen API
|
|
26
|
+
- Create professional videos from text prompts using Veo 3.0 API
|
|
27
|
+
- Image-to-video generation combining Imagen and Veo 3.0
|
|
28
|
+
- Multiple artistic styles: photorealistic, artistic, cartoon, sketch, digital art (images) and realistic, cinematic, artistic, cartoon, animation (videos)
|
|
29
|
+
- Flexible aspect ratios (1:1, 16:9, 9:16, 4:3, 3:4) and output formats
|
|
30
|
+
- Video duration controls (4s, 8s, 12s) with FPS options (1-60 fps)
|
|
31
|
+
- Camera movement controls: static, pan, zoom, dolly movements
|
|
32
|
+
- Advanced prompt engineering and negative prompts
|
|
33
|
+
|
|
34
|
+
🗣️ **Speech Generation**
|
|
35
|
+
- Convert text to natural-sounding speech with 30+ voice options
|
|
36
|
+
- Long-form content narration with chapter breaks
|
|
37
|
+
- Technical code explanation with spoken analysis
|
|
38
|
+
- Voice customization and style control
|
|
39
|
+
- Multi-language support (24 languages)
|
|
40
|
+
- Professional audio export in WAV format
|
|
41
|
+
|
|
24
42
|
🤖 **AI-Powered**
|
|
25
43
|
- Uses Google Gemini 2.5 Flash for fast, accurate analysis
|
|
44
|
+
- Advanced Imagen API for high-quality image generation
|
|
45
|
+
- Cutting-edge Veo 3.0 API for professional video generation
|
|
46
|
+
- Gemini Speech Generation API for natural voice synthesis
|
|
26
47
|
- Detailed technical insights for developers
|
|
27
48
|
- Actionable recommendations for fixing issues
|
|
28
49
|
- Structured output with detected elements and coordinates
|
|
29
50
|
|
|
51
|
+
### Google Gemini Documentation
|
|
52
|
+
- [Gemini API](https://ai.google.dev/gemini-api/docs?hl=en)
|
|
53
|
+
- [Gemini Models](https://ai.google.dev/gemini-api/docs/models)
|
|
54
|
+
- [Video Understanding](https://ai.google.dev/gemini-api/docs/video-understanding?hl=en)
|
|
55
|
+
- [Image Understanding](https://ai.google.dev/gemini-api/docs/image-understanding)
|
|
56
|
+
- [Document Understanding](https://ai.google.dev/gemini-api/docs/document-processing)
|
|
57
|
+
- [Audio Understanding](https://ai.google.dev/gemini-api/docs/audio)
|
|
58
|
+
- [Speech Generation](https://ai.google.dev/gemini-api/docs/speech-generation)
|
|
59
|
+
- [Image Generation](https://ai.google.dev/gemini-api/docs/image-generation)
|
|
60
|
+
- [Video Generation](https://ai.google.dev/gemini-api/docs/video)
|
|
61
|
+
|
|
30
62
|
## Quick Start
|
|
31
63
|
|
|
32
64
|
### Getting Your Google Gemini API Key
|
|
@@ -919,7 +951,103 @@ Compare two images to identify visual differences.
|
|
|
919
951
|
{
|
|
920
952
|
"source1": "/path/to/before.png",
|
|
921
953
|
"source2": "/path/to/after.png",
|
|
922
|
-
"comparison_type": "structural"
|
|
954
|
+
"comparison_type": "structural"
|
|
955
|
+
}
|
|
956
|
+
```
|
|
957
|
+
|
|
958
|
+
### gemini_gen_image
|
|
959
|
+
|
|
960
|
+
Generate high-quality images from text descriptions using Gemini Imagen API.
|
|
961
|
+
|
|
962
|
+
```json
|
|
963
|
+
{
|
|
964
|
+
"prompt": "A modern minimalist login form with clean typography",
|
|
965
|
+
"style": "digital_art",
|
|
966
|
+
"aspect_ratio": "16:9",
|
|
967
|
+
"negative_prompt": "cluttered, low quality, blurry"
|
|
968
|
+
}
|
|
969
|
+
```
|
|
970
|
+
|
|
971
|
+
### gemini_gen_video
|
|
972
|
+
|
|
973
|
+
Generate professional videos from text descriptions using Gemini Veo 3.0 API.
|
|
974
|
+
|
|
975
|
+
```json
|
|
976
|
+
{
|
|
977
|
+
"prompt": "A serene mountain landscape at sunrise with gentle camera movement",
|
|
978
|
+
"duration": "8s",
|
|
979
|
+
"style": "cinematic",
|
|
980
|
+
"aspect_ratio": "16:9",
|
|
981
|
+
"camera_movement": "pan_right",
|
|
982
|
+
"fps": 30
|
|
983
|
+
}
|
|
984
|
+
```
|
|
985
|
+
|
|
986
|
+
### gemini_image_to_video
|
|
987
|
+
|
|
988
|
+
Generate videos from images and text descriptions using Imagen + Veo 3.0 pipeline.
|
|
989
|
+
|
|
990
|
+
```json
|
|
991
|
+
{
|
|
992
|
+
"prompt": "Animate this landscape with flowing water and moving clouds",
|
|
993
|
+
"image_input": "data:image/jpeg;base64,/9j/4AAQ...",
|
|
994
|
+
"duration": "12s",
|
|
995
|
+
"style": "realistic",
|
|
996
|
+
"camera_movement": "zoom_in"
|
|
997
|
+
}
|
|
998
|
+
```
|
|
999
|
+
|
|
1000
|
+
### mouth_speak
|
|
1001
|
+
|
|
1002
|
+
Convert text to natural-sounding speech with voice customization.
|
|
1003
|
+
|
|
1004
|
+
```json
|
|
1005
|
+
{
|
|
1006
|
+
"text": "Welcome to our application. Let me guide you through the interface.",
|
|
1007
|
+
"voice": "Zephyr",
|
|
1008
|
+
"language": "en-US",
|
|
1009
|
+
"style_prompt": "Speak in a friendly, welcoming tone"
|
|
1010
|
+
}
|
|
1011
|
+
```
|
|
1012
|
+
|
|
1013
|
+
### mouth_narrate
|
|
1014
|
+
|
|
1015
|
+
Generate narration for long-form content with chapter breaks and style control.
|
|
1016
|
+
|
|
1017
|
+
```json
|
|
1018
|
+
{
|
|
1019
|
+
"content": "Chapter 1: Introduction to React...",
|
|
1020
|
+
"voice": "Sage",
|
|
1021
|
+
"narration_style": "educational",
|
|
1022
|
+
"chapter_breaks": true,
|
|
1023
|
+
"max_chunk_size": 8000
|
|
1024
|
+
}
|
|
1025
|
+
```
|
|
1026
|
+
|
|
1027
|
+
### mouth_explain
|
|
1028
|
+
|
|
1029
|
+
Generate spoken explanations of code with technical analysis.
|
|
1030
|
+
|
|
1031
|
+
```json
|
|
1032
|
+
{
|
|
1033
|
+
"code": "function factorial(n) { return n <= 1 ? 1 : n * factorial(n-1); }",
|
|
1034
|
+
"programming_language": "javascript",
|
|
1035
|
+
"voice": "Apollo",
|
|
1036
|
+
"explanation_level": "intermediate",
|
|
1037
|
+
"include_examples": true
|
|
1038
|
+
}
|
|
1039
|
+
```
|
|
1040
|
+
|
|
1041
|
+
### mouth_customize
|
|
1042
|
+
|
|
1043
|
+
Test different voices and styles to find the best fit for your content.
|
|
1044
|
+
|
|
1045
|
+
```json
|
|
1046
|
+
{
|
|
1047
|
+
"text": "Hello, this is a voice test sample.",
|
|
1048
|
+
"voice": "Charon",
|
|
1049
|
+
"style_variations": ["professional", "casual", "energetic"],
|
|
1050
|
+
"compare_voices": ["Puck", "Sage", "Apollo"]
|
|
923
1051
|
}
|
|
924
1052
|
```
|
|
925
1053
|
|
|
@@ -950,12 +1078,92 @@ Compare two images to identify visual differences.
|
|
|
950
1078
|
# Check accessibility compliance
|
|
951
1079
|
{
|
|
952
1080
|
"source": "page-screenshot.png",
|
|
953
|
-
"type": "image",
|
|
1081
|
+
"type": "image",
|
|
954
1082
|
"analysis_type": "accessibility",
|
|
955
1083
|
"check_accessibility": true
|
|
956
1084
|
}
|
|
957
1085
|
```
|
|
958
1086
|
|
|
1087
|
+
### Image Generation for Design
|
|
1088
|
+
```bash
|
|
1089
|
+
# Generate UI mockups and design elements
|
|
1090
|
+
{
|
|
1091
|
+
"prompt": "Professional dashboard interface with data visualization charts",
|
|
1092
|
+
"style": "digital_art",
|
|
1093
|
+
"aspect_ratio": "16:9"
|
|
1094
|
+
}
|
|
1095
|
+
```
|
|
1096
|
+
|
|
1097
|
+
### Prototype Creation
|
|
1098
|
+
```bash
|
|
1099
|
+
# Create visual prototypes for development
|
|
1100
|
+
{
|
|
1101
|
+
"prompt": "Mobile app login screen with modern design, dark theme",
|
|
1102
|
+
"style": "photorealistic",
|
|
1103
|
+
"aspect_ratio": "9:16",
|
|
1104
|
+
"negative_prompt": "old-fashioned, bright colors"
|
|
1105
|
+
}
|
|
1106
|
+
```
|
|
1107
|
+
|
|
1108
|
+
### Video Generation for Prototyping
|
|
1109
|
+
```bash
|
|
1110
|
+
# Create animated prototypes and demonstrations
|
|
1111
|
+
{
|
|
1112
|
+
"prompt": "User interface animation showing a smooth login process with form transitions",
|
|
1113
|
+
"duration": "8s",
|
|
1114
|
+
"style": "digital_art",
|
|
1115
|
+
"aspect_ratio": "16:9",
|
|
1116
|
+
"camera_movement": "static",
|
|
1117
|
+
"fps": 30
|
|
1118
|
+
}
|
|
1119
|
+
```
|
|
1120
|
+
|
|
1121
|
+
### Marketing Video Creation
|
|
1122
|
+
```bash
|
|
1123
|
+
# Generate promotional videos for products
|
|
1124
|
+
{
|
|
1125
|
+
"prompt": "Elegant product showcase video with professional lighting and smooth camera movement",
|
|
1126
|
+
"duration": "12s",
|
|
1127
|
+
"style": "cinematic",
|
|
1128
|
+
"aspect_ratio": "16:9",
|
|
1129
|
+
"camera_movement": "dolly_forward"
|
|
1130
|
+
}
|
|
1131
|
+
```
|
|
1132
|
+
|
|
1133
|
+
### Code Explanation Audio
|
|
1134
|
+
```bash
|
|
1135
|
+
# Generate spoken explanations for code reviews
|
|
1136
|
+
{
|
|
1137
|
+
"code": "const useAuth = () => { const [user, setUser] = useState(null); return { user, login: setUser }; }",
|
|
1138
|
+
"programming_language": "javascript",
|
|
1139
|
+
"voice": "Apollo",
|
|
1140
|
+
"explanation_level": "advanced",
|
|
1141
|
+
"include_examples": true
|
|
1142
|
+
}
|
|
1143
|
+
```
|
|
1144
|
+
|
|
1145
|
+
### Documentation Narration
|
|
1146
|
+
```bash
|
|
1147
|
+
# Convert technical documentation to audio
|
|
1148
|
+
{
|
|
1149
|
+
"content": "This API endpoint handles user authentication and returns a JWT token...",
|
|
1150
|
+
"voice": "Sage",
|
|
1151
|
+
"narration_style": "professional",
|
|
1152
|
+
"chapter_breaks": true
|
|
1153
|
+
}
|
|
1154
|
+
```
|
|
1155
|
+
|
|
1156
|
+
### User Interface Voice Feedback
|
|
1157
|
+
```bash
|
|
1158
|
+
# Generate voice responses for applications
|
|
1159
|
+
{
|
|
1160
|
+
"text": "File uploaded successfully. Processing will complete in approximately 30 seconds.",
|
|
1161
|
+
"voice": "Kore",
|
|
1162
|
+
"language": "en-US",
|
|
1163
|
+
"style_prompt": "Speak in a helpful, reassuring tone"
|
|
1164
|
+
}
|
|
1165
|
+
```
|
|
1166
|
+
|
|
959
1167
|
## Prompts
|
|
960
1168
|
|
|
961
1169
|
Human MCP includes pre-built prompts for common debugging scenarios:
|
|
@@ -1041,9 +1249,22 @@ HTTP_ENABLE_RATE_LIMITING=false
|
|
|
1041
1249
|
Human MCP Server
|
|
1042
1250
|
├── Eyes Tool (Vision Understanding)
|
|
1043
1251
|
│ ├── Image Analysis
|
|
1044
|
-
│ ├── Video Processing
|
|
1252
|
+
│ ├── Video Processing
|
|
1045
1253
|
│ ├── GIF Frame Extraction
|
|
1046
1254
|
│ └── Visual Comparison
|
|
1255
|
+
├── Hands Tool (Content Generation)
|
|
1256
|
+
│ ├── Image Generation (Imagen API)
|
|
1257
|
+
│ ├── Video Generation (Veo 3.0 API)
|
|
1258
|
+
│ ├── Image-to-Video Pipeline
|
|
1259
|
+
│ ├── Style Customization
|
|
1260
|
+
│ ├── Aspect Ratio & Duration Control
|
|
1261
|
+
│ ├── Camera Movement Control
|
|
1262
|
+
│ └── Prompt Engineering
|
|
1263
|
+
├── Mouth Tool (Speech Generation)
|
|
1264
|
+
│ ├── Text-to-Speech Synthesis
|
|
1265
|
+
│ ├── Long-form Narration
|
|
1266
|
+
│ ├── Code Explanation
|
|
1267
|
+
│ └── Voice Customization
|
|
1047
1268
|
├── Debugging Prompts
|
|
1048
1269
|
└── Documentation Resources
|
|
1049
1270
|
```
|
|
@@ -1056,7 +1277,7 @@ For detailed architecture information and future development plans, see:
|
|
|
1056
1277
|
|
|
1057
1278
|
**Mission**: Transform AI coding agents with complete human-like sensory capabilities, bridging the gap between artificial and human intelligence through sophisticated multimodal analysis.
|
|
1058
1279
|
|
|
1059
|
-
### Current Status: Phase 1 Complete ✅
|
|
1280
|
+
### Current Status: Phase 1 Complete ✅ | Phase 4 Complete ✅ | Phase 5 Complete ✅
|
|
1060
1281
|
|
|
1061
1282
|
**Eyes (Visual Analysis)** - Production Ready (v1.2.1)
|
|
1062
1283
|
- Advanced image, video, and GIF analysis capabilities
|
|
@@ -1065,6 +1286,25 @@ For detailed architecture information and future development plans, see:
|
|
|
1065
1286
|
- Processing 20+ visual formats with 98.5% success rate
|
|
1066
1287
|
- Sub-30 second response times for detailed analysis
|
|
1067
1288
|
|
|
1289
|
+
**Hands (Content Generation)** - Production Ready (v1.4.0)
|
|
1290
|
+
- High-quality image generation using Gemini Imagen API
|
|
1291
|
+
- Professional video generation using Gemini Veo 3.0 API
|
|
1292
|
+
- Image-to-video generation pipeline combining Imagen + Veo 3.0
|
|
1293
|
+
- Multiple artistic styles and aspect ratios for both images and videos
|
|
1294
|
+
- Video duration controls (4s, 8s, 12s) with FPS options (1-60 fps)
|
|
1295
|
+
- Camera movement controls: static, pan, zoom, dolly movements
|
|
1296
|
+
- Advanced prompt engineering with negative prompts
|
|
1297
|
+
- Comprehensive validation and error handling with retry logic
|
|
1298
|
+
- Fast generation times with reliable output
|
|
1299
|
+
|
|
1300
|
+
**Mouth (Speech Generation)** - Production Ready (v1.3.0)
|
|
1301
|
+
- Natural text-to-speech with 30+ voice options
|
|
1302
|
+
- Long-form content narration with chapter breaks
|
|
1303
|
+
- Technical code explanation with spoken analysis
|
|
1304
|
+
- Voice customization and style control
|
|
1305
|
+
- Multi-language support (24 languages)
|
|
1306
|
+
- Professional audio export in WAV format
|
|
1307
|
+
|
|
1068
1308
|
### Upcoming Development Phases
|
|
1069
1309
|
|
|
1070
1310
|
#### Phase 2: Document Understanding (Q4 2025)
|
|
@@ -1083,21 +1323,28 @@ For detailed architecture information and future development plans, see:
|
|
|
1083
1323
|
- Support for 20+ audio formats (WAV, MP3, AAC, OGG, FLAC)
|
|
1084
1324
|
- Real-time audio processing capabilities
|
|
1085
1325
|
|
|
1086
|
-
#### Phase 4: Speech Generation - Mouth
|
|
1087
|
-
**AI Voice Capabilities**
|
|
1088
|
-
- High-quality text-to-speech with
|
|
1089
|
-
- Code explanation and technical content narration
|
|
1090
|
-
- Multi-language speech generation (
|
|
1091
|
-
- Long-form content narration with natural pacing
|
|
1092
|
-
- Professional-quality audio export in
|
|
1093
|
-
|
|
1094
|
-
|
|
1095
|
-
|
|
1096
|
-
|
|
1097
|
-
-
|
|
1098
|
-
- Video generation
|
|
1099
|
-
-
|
|
1100
|
-
-
|
|
1326
|
+
#### Phase 4: Speech Generation - Mouth ✅ COMPLETE
|
|
1327
|
+
**AI Voice Capabilities** - Production Ready (v1.3.0)
|
|
1328
|
+
- ✅ High-quality text-to-speech with 30+ voice options using Gemini Speech API
|
|
1329
|
+
- ✅ Code explanation and technical content narration
|
|
1330
|
+
- ✅ Multi-language speech generation (24 languages supported)
|
|
1331
|
+
- ✅ Long-form content narration with chapter breaks and natural pacing
|
|
1332
|
+
- ✅ Professional-quality audio export in WAV format
|
|
1333
|
+
- ✅ Voice customization with style prompts and voice comparison
|
|
1334
|
+
|
|
1335
|
+
#### Phase 5: Content Generation - Hands ✅ COMPLETE
|
|
1336
|
+
**Creative Content Creation** - Production Ready (v1.4.0)
|
|
1337
|
+
- ✅ Image generation from text descriptions using Imagen API
|
|
1338
|
+
- ✅ Video generation from text prompts using Veo 3.0 API
|
|
1339
|
+
- ✅ Image-to-video generation pipeline combining Imagen + Veo 3.0
|
|
1340
|
+
- ✅ Multiple artistic styles for images and videos
|
|
1341
|
+
- ✅ Flexible aspect ratios: 1:1, 16:9, 9:16, 4:3, 3:4
|
|
1342
|
+
- ✅ Video duration controls (4s, 8s, 12s) with FPS options (1-60 fps)
|
|
1343
|
+
- ✅ Camera movement controls: static, pan, zoom, dolly movements
|
|
1344
|
+
- ✅ Advanced prompt engineering with negative prompts
|
|
1345
|
+
- ✅ Comprehensive error handling and validation with retry logic
|
|
1346
|
+
- Future: Advanced image editing (inpainting, style transfer, enhancement)
|
|
1347
|
+
- Future: Animation creation with motion graphics
|
|
1101
1348
|
|
|
1102
1349
|
### Target Architecture (End 2025)
|
|
1103
1350
|
|
|
@@ -1121,8 +1368,8 @@ The evolution from single-capability visual analysis to comprehensive human-like
|
|
|
1121
1368
|
│ • Narration │
|
|
1122
1369
|
│ │
|
|
1123
1370
|
│ ✋ Hands (Creation) │
|
|
1124
|
-
│ • Image Generation
|
|
1125
|
-
│ • Video Generation
|
|
1371
|
+
│ • Image Generation ✅│
|
|
1372
|
+
│ • Video Generation ✅│
|
|
1126
1373
|
└──────────────────────┘
|
|
1127
1374
|
```
|
|
1128
1375
|
|
|
@@ -1145,15 +1392,16 @@ The evolution from single-capability visual analysis to comprehensive human-like
|
|
|
1145
1392
|
### Success Metrics & Timeline
|
|
1146
1393
|
|
|
1147
1394
|
- **Phase 2 (Document Understanding)**: January - March 2025
|
|
1148
|
-
- **Phase 3 (Audio Processing)**: April - June 2025
|
|
1149
|
-
- **Phase 4 (Speech Generation)**:
|
|
1150
|
-
- **Phase 5 (Content Generation)**:
|
|
1395
|
+
- **Phase 3 (Audio Processing)**: April - June 2025
|
|
1396
|
+
- **Phase 4 (Speech Generation)**: ✅ Completed September 2025
|
|
1397
|
+
- **Phase 5 (Content Generation)**: ✅ Completed September 2025
|
|
1151
1398
|
|
|
1152
1399
|
**Target Goals:**
|
|
1153
1400
|
- Support 50+ file formats across all modalities
|
|
1154
|
-
- 99%+ success rate with
|
|
1401
|
+
- 99%+ success rate with optimized processing times (images <30s, videos <5min)
|
|
1155
1402
|
- 1000+ MCP client integrations and 100K+ monthly API calls
|
|
1156
1403
|
- Comprehensive documentation with real-world examples
|
|
1404
|
+
- Professional-grade content generation capabilities
|
|
1157
1405
|
|
|
1158
1406
|
### Getting Involved
|
|
1159
1407
|
|
|
@@ -1165,10 +1413,17 @@ Human MCP is built for the developer community. Whether you're integrating with
|
|
|
1165
1413
|
|
|
1166
1414
|
## Supported Formats
|
|
1167
1415
|
|
|
1168
|
-
**
|
|
1169
|
-
**
|
|
1170
|
-
**
|
|
1171
|
-
**
|
|
1416
|
+
**Analysis Formats**:
|
|
1417
|
+
- **Images**: PNG, JPEG, WebP, GIF (static)
|
|
1418
|
+
- **Videos**: MP4, WebM, MOV, AVI
|
|
1419
|
+
- **GIFs**: Animated GIF with frame extraction
|
|
1420
|
+
- **Sources**: File paths, URLs, base64 data URLs
|
|
1421
|
+
|
|
1422
|
+
**Generation Formats**:
|
|
1423
|
+
- **Images**: PNG, JPEG (Base64 output)
|
|
1424
|
+
- **Videos**: MP4 (Base64 output)
|
|
1425
|
+
- **Durations**: 4s, 8s, 12s video lengths
|
|
1426
|
+
- **Quality**: Professional-grade output with customizable FPS (1-60)
|
|
1172
1427
|
|
|
1173
1428
|
## Contributing
|
|
1174
1429
|
|