@goonnguyen/human-mcp 1.4.0 → 2.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (3) hide show
  1. package/README.md +284 -29
  2. package/dist/index.js +65606 -1697
  3. package/package.json +5 -1
package/README.md CHANGED
@@ -21,12 +21,44 @@ Human MCP is a Model Context Protocol server that provides AI coding agents with
21
21
  - **Performance**: Loading states, visual performance indicators
22
22
  - **Layout**: Responsive design, positioning, visual hierarchy
23
23
 
24
+ 🎨 **Content Generation**
25
+ - Generate high-quality images from text descriptions using Imagen API
26
+ - Create professional videos from text prompts using Veo 3.0 API
27
+ - Image-to-video generation combining Imagen and Veo 3.0
28
+ - Multiple artistic styles: photorealistic, artistic, cartoon, sketch, digital art (images) and realistic, cinematic, artistic, cartoon, animation (videos)
29
+ - Flexible aspect ratios (1:1, 16:9, 9:16, 4:3, 3:4) and output formats
30
+ - Video duration controls (4s, 8s, 12s) with FPS options (1-60 fps)
31
+ - Camera movement controls: static, pan, zoom, dolly movements
32
+ - Advanced prompt engineering and negative prompts
33
+
34
+ 🗣️ **Speech Generation**
35
+ - Convert text to natural-sounding speech with 30+ voice options
36
+ - Long-form content narration with chapter breaks
37
+ - Technical code explanation with spoken analysis
38
+ - Voice customization and style control
39
+ - Multi-language support (24 languages)
40
+ - Professional audio export in WAV format
41
+
24
42
  🤖 **AI-Powered**
25
43
  - Uses Google Gemini 2.5 Flash for fast, accurate analysis
44
+ - Advanced Imagen API for high-quality image generation
45
+ - Cutting-edge Veo 3.0 API for professional video generation
46
+ - Gemini Speech Generation API for natural voice synthesis
26
47
  - Detailed technical insights for developers
27
48
  - Actionable recommendations for fixing issues
28
49
  - Structured output with detected elements and coordinates
29
50
 
51
+ ### Google Gemini Documentation
52
+ - [Gemini API](https://ai.google.dev/gemini-api/docs?hl=en)
53
+ - [Gemini Models](https://ai.google.dev/gemini-api/docs/models)
54
+ - [Video Understanding](https://ai.google.dev/gemini-api/docs/video-understanding?hl=en)
55
+ - [Image Understanding](https://ai.google.dev/gemini-api/docs/image-understanding)
56
+ - [Document Understanding](https://ai.google.dev/gemini-api/docs/document-processing)
57
+ - [Audio Understanding](https://ai.google.dev/gemini-api/docs/audio)
58
+ - [Speech Generation](https://ai.google.dev/gemini-api/docs/speech-generation)
59
+ - [Image Generation](https://ai.google.dev/gemini-api/docs/image-generation)
60
+ - [Video Generation](https://ai.google.dev/gemini-api/docs/video)
61
+
30
62
  ## Quick Start
31
63
 
32
64
  ### Getting Your Google Gemini API Key
@@ -919,7 +951,103 @@ Compare two images to identify visual differences.
919
951
  {
920
952
  "source1": "/path/to/before.png",
921
953
  "source2": "/path/to/after.png",
922
- "comparison_type": "structural"
954
+ "comparison_type": "structural"
955
+ }
956
+ ```
957
+
958
+ ### gemini_gen_image
959
+
960
+ Generate high-quality images from text descriptions using Gemini Imagen API.
961
+
962
+ ```json
963
+ {
964
+ "prompt": "A modern minimalist login form with clean typography",
965
+ "style": "digital_art",
966
+ "aspect_ratio": "16:9",
967
+ "negative_prompt": "cluttered, low quality, blurry"
968
+ }
969
+ ```
970
+
971
+ ### gemini_gen_video
972
+
973
+ Generate professional videos from text descriptions using Gemini Veo 3.0 API.
974
+
975
+ ```json
976
+ {
977
+ "prompt": "A serene mountain landscape at sunrise with gentle camera movement",
978
+ "duration": "8s",
979
+ "style": "cinematic",
980
+ "aspect_ratio": "16:9",
981
+ "camera_movement": "pan_right",
982
+ "fps": 30
983
+ }
984
+ ```
985
+
986
+ ### gemini_image_to_video
987
+
988
+ Generate videos from images and text descriptions using Imagen + Veo 3.0 pipeline.
989
+
990
+ ```json
991
+ {
992
+ "prompt": "Animate this landscape with flowing water and moving clouds",
993
+ "image_input": "data:image/jpeg;base64,/9j/4AAQ...",
994
+ "duration": "12s",
995
+ "style": "realistic",
996
+ "camera_movement": "zoom_in"
997
+ }
998
+ ```
999
+
1000
+ ### mouth_speak
1001
+
1002
+ Convert text to natural-sounding speech with voice customization.
1003
+
1004
+ ```json
1005
+ {
1006
+ "text": "Welcome to our application. Let me guide you through the interface.",
1007
+ "voice": "Zephyr",
1008
+ "language": "en-US",
1009
+ "style_prompt": "Speak in a friendly, welcoming tone"
1010
+ }
1011
+ ```
1012
+
1013
+ ### mouth_narrate
1014
+
1015
+ Generate narration for long-form content with chapter breaks and style control.
1016
+
1017
+ ```json
1018
+ {
1019
+ "content": "Chapter 1: Introduction to React...",
1020
+ "voice": "Sage",
1021
+ "narration_style": "educational",
1022
+ "chapter_breaks": true,
1023
+ "max_chunk_size": 8000
1024
+ }
1025
+ ```
1026
+
1027
+ ### mouth_explain
1028
+
1029
+ Generate spoken explanations of code with technical analysis.
1030
+
1031
+ ```json
1032
+ {
1033
+ "code": "function factorial(n) { return n <= 1 ? 1 : n * factorial(n-1); }",
1034
+ "programming_language": "javascript",
1035
+ "voice": "Apollo",
1036
+ "explanation_level": "intermediate",
1037
+ "include_examples": true
1038
+ }
1039
+ ```
1040
+
1041
+ ### mouth_customize
1042
+
1043
+ Test different voices and styles to find the best fit for your content.
1044
+
1045
+ ```json
1046
+ {
1047
+ "text": "Hello, this is a voice test sample.",
1048
+ "voice": "Charon",
1049
+ "style_variations": ["professional", "casual", "energetic"],
1050
+ "compare_voices": ["Puck", "Sage", "Apollo"]
923
1051
  }
924
1052
  ```
925
1053
 
@@ -950,12 +1078,92 @@ Compare two images to identify visual differences.
950
1078
  # Check accessibility compliance
951
1079
  {
952
1080
  "source": "page-screenshot.png",
953
- "type": "image",
1081
+ "type": "image",
954
1082
  "analysis_type": "accessibility",
955
1083
  "check_accessibility": true
956
1084
  }
957
1085
  ```
958
1086
 
1087
+ ### Image Generation for Design
1088
+ ```bash
1089
+ # Generate UI mockups and design elements
1090
+ {
1091
+ "prompt": "Professional dashboard interface with data visualization charts",
1092
+ "style": "digital_art",
1093
+ "aspect_ratio": "16:9"
1094
+ }
1095
+ ```
1096
+
1097
+ ### Prototype Creation
1098
+ ```bash
1099
+ # Create visual prototypes for development
1100
+ {
1101
+ "prompt": "Mobile app login screen with modern design, dark theme",
1102
+ "style": "photorealistic",
1103
+ "aspect_ratio": "9:16",
1104
+ "negative_prompt": "old-fashioned, bright colors"
1105
+ }
1106
+ ```
1107
+
1108
+ ### Video Generation for Prototyping
1109
+ ```bash
1110
+ # Create animated prototypes and demonstrations
1111
+ {
1112
+ "prompt": "User interface animation showing a smooth login process with form transitions",
1113
+ "duration": "8s",
1114
+ "style": "digital_art",
1115
+ "aspect_ratio": "16:9",
1116
+ "camera_movement": "static",
1117
+ "fps": 30
1118
+ }
1119
+ ```
1120
+
1121
+ ### Marketing Video Creation
1122
+ ```bash
1123
+ # Generate promotional videos for products
1124
+ {
1125
+ "prompt": "Elegant product showcase video with professional lighting and smooth camera movement",
1126
+ "duration": "12s",
1127
+ "style": "cinematic",
1128
+ "aspect_ratio": "16:9",
1129
+ "camera_movement": "dolly_forward"
1130
+ }
1131
+ ```
1132
+
1133
+ ### Code Explanation Audio
1134
+ ```bash
1135
+ # Generate spoken explanations for code reviews
1136
+ {
1137
+ "code": "const useAuth = () => { const [user, setUser] = useState(null); return { user, login: setUser }; }",
1138
+ "programming_language": "javascript",
1139
+ "voice": "Apollo",
1140
+ "explanation_level": "advanced",
1141
+ "include_examples": true
1142
+ }
1143
+ ```
1144
+
1145
+ ### Documentation Narration
1146
+ ```bash
1147
+ # Convert technical documentation to audio
1148
+ {
1149
+ "content": "This API endpoint handles user authentication and returns a JWT token...",
1150
+ "voice": "Sage",
1151
+ "narration_style": "professional",
1152
+ "chapter_breaks": true
1153
+ }
1154
+ ```
1155
+
1156
+ ### User Interface Voice Feedback
1157
+ ```bash
1158
+ # Generate voice responses for applications
1159
+ {
1160
+ "text": "File uploaded successfully. Processing will complete in approximately 30 seconds.",
1161
+ "voice": "Kore",
1162
+ "language": "en-US",
1163
+ "style_prompt": "Speak in a helpful, reassuring tone"
1164
+ }
1165
+ ```
1166
+
959
1167
  ## Prompts
960
1168
 
961
1169
  Human MCP includes pre-built prompts for common debugging scenarios:
@@ -1041,9 +1249,22 @@ HTTP_ENABLE_RATE_LIMITING=false
1041
1249
  Human MCP Server
1042
1250
  ├── Eyes Tool (Vision Understanding)
1043
1251
  │ ├── Image Analysis
1044
- │ ├── Video Processing
1252
+ │ ├── Video Processing
1045
1253
  │ ├── GIF Frame Extraction
1046
1254
  │ └── Visual Comparison
1255
+ ├── Hands Tool (Content Generation)
1256
+ │ ├── Image Generation (Imagen API)
1257
+ │ ├── Video Generation (Veo 3.0 API)
1258
+ │ ├── Image-to-Video Pipeline
1259
+ │ ├── Style Customization
1260
+ │ ├── Aspect Ratio & Duration Control
1261
+ │ ├── Camera Movement Control
1262
+ │ └── Prompt Engineering
1263
+ ├── Mouth Tool (Speech Generation)
1264
+ │ ├── Text-to-Speech Synthesis
1265
+ │ ├── Long-form Narration
1266
+ │ ├── Code Explanation
1267
+ │ └── Voice Customization
1047
1268
  ├── Debugging Prompts
1048
1269
  └── Documentation Resources
1049
1270
  ```
@@ -1056,7 +1277,7 @@ For detailed architecture information and future development plans, see:
1056
1277
 
1057
1278
  **Mission**: Transform AI coding agents with complete human-like sensory capabilities, bridging the gap between artificial and human intelligence through sophisticated multimodal analysis.
1058
1279
 
1059
- ### Current Status: Phase 1 Complete ✅
1280
+ ### Current Status: Phase 1 Complete ✅ | Phase 4 Complete ✅ | Phase 5 Complete ✅
1060
1281
 
1061
1282
  **Eyes (Visual Analysis)** - Production Ready (v1.2.1)
1062
1283
  - Advanced image, video, and GIF analysis capabilities
@@ -1065,6 +1286,25 @@ For detailed architecture information and future development plans, see:
1065
1286
  - Processing 20+ visual formats with 98.5% success rate
1066
1287
  - Sub-30 second response times for detailed analysis
1067
1288
 
1289
+ **Hands (Content Generation)** - Production Ready (v1.4.0)
1290
+ - High-quality image generation using Gemini Imagen API
1291
+ - Professional video generation using Gemini Veo 3.0 API
1292
+ - Image-to-video generation pipeline combining Imagen + Veo 3.0
1293
+ - Multiple artistic styles and aspect ratios for both images and videos
1294
+ - Video duration controls (4s, 8s, 12s) with FPS options (1-60 fps)
1295
+ - Camera movement controls: static, pan, zoom, dolly movements
1296
+ - Advanced prompt engineering with negative prompts
1297
+ - Comprehensive validation and error handling with retry logic
1298
+ - Fast generation times with reliable output
1299
+
1300
+ **Mouth (Speech Generation)** - Production Ready (v1.3.0)
1301
+ - Natural text-to-speech with 30+ voice options
1302
+ - Long-form content narration with chapter breaks
1303
+ - Technical code explanation with spoken analysis
1304
+ - Voice customization and style control
1305
+ - Multi-language support (24 languages)
1306
+ - Professional audio export in WAV format
1307
+
1068
1308
  ### Upcoming Development Phases
1069
1309
 
1070
1310
  #### Phase 2: Document Understanding (Q4 2025)
@@ -1083,21 +1323,28 @@ For detailed architecture information and future development plans, see:
1083
1323
  - Support for 20+ audio formats (WAV, MP3, AAC, OGG, FLAC)
1084
1324
  - Real-time audio processing capabilities
1085
1325
 
1086
- #### Phase 4: Speech Generation - Mouth (Q4 2025)
1087
- **AI Voice Capabilities**
1088
- - High-quality text-to-speech with customizable voice parameters
1089
- - Code explanation and technical content narration
1090
- - Multi-language speech generation (10+ languages)
1091
- - Long-form content narration with natural pacing
1092
- - Professional-quality audio export in multiple formats
1093
-
1094
- #### Phase 5: Content Generation - Hands (Q4 2025)
1095
- **Creative Content Creation**
1096
- - Image generation from text descriptions using Imagen API
1097
- - Advanced image editing (inpainting, style transfer, enhancement)
1098
- - Video generation up to 30 seconds using Veo3 API
1099
- - Animation creation with motion graphics
1100
- - Batch content generation for workflow automation
1326
+ #### Phase 4: Speech Generation - Mouth COMPLETE
1327
+ **AI Voice Capabilities** - Production Ready (v1.3.0)
1328
+ - High-quality text-to-speech with 30+ voice options using Gemini Speech API
1329
+ - Code explanation and technical content narration
1330
+ - Multi-language speech generation (24 languages supported)
1331
+ - Long-form content narration with chapter breaks and natural pacing
1332
+ - Professional-quality audio export in WAV format
1333
+ - ✅ Voice customization with style prompts and voice comparison
1334
+
1335
+ #### Phase 5: Content Generation - Hands ✅ COMPLETE
1336
+ **Creative Content Creation** - Production Ready (v1.4.0)
1337
+ - Image generation from text descriptions using Imagen API
1338
+ - Video generation from text prompts using Veo 3.0 API
1339
+ - Image-to-video generation pipeline combining Imagen + Veo 3.0
1340
+ - Multiple artistic styles for images and videos
1341
+ - ✅ Flexible aspect ratios: 1:1, 16:9, 9:16, 4:3, 3:4
1342
+ - ✅ Video duration controls (4s, 8s, 12s) with FPS options (1-60 fps)
1343
+ - ✅ Camera movement controls: static, pan, zoom, dolly movements
1344
+ - ✅ Advanced prompt engineering with negative prompts
1345
+ - ✅ Comprehensive error handling and validation with retry logic
1346
+ - Future: Advanced image editing (inpainting, style transfer, enhancement)
1347
+ - Future: Animation creation with motion graphics
1101
1348
 
1102
1349
  ### Target Architecture (End 2025)
1103
1350
 
@@ -1121,8 +1368,8 @@ The evolution from single-capability visual analysis to comprehensive human-like
1121
1368
  │ • Narration │
1122
1369
  │ │
1123
1370
  │ ✋ Hands (Creation) │
1124
- │ • Image Generation
1125
- │ • Video Generation
1371
+ │ • Image Generation ✅│
1372
+ │ • Video Generation ✅│
1126
1373
  └──────────────────────┘
1127
1374
  ```
1128
1375
 
@@ -1145,15 +1392,16 @@ The evolution from single-capability visual analysis to comprehensive human-like
1145
1392
  ### Success Metrics & Timeline
1146
1393
 
1147
1394
  - **Phase 2 (Document Understanding)**: January - March 2025
1148
- - **Phase 3 (Audio Processing)**: April - June 2025
1149
- - **Phase 4 (Speech Generation)**: September - October 2025
1150
- - **Phase 5 (Content Generation)**: October - December 2025
1395
+ - **Phase 3 (Audio Processing)**: April - June 2025
1396
+ - **Phase 4 (Speech Generation)**: Completed September 2025
1397
+ - **Phase 5 (Content Generation)**: Completed September 2025
1151
1398
 
1152
1399
  **Target Goals:**
1153
1400
  - Support 50+ file formats across all modalities
1154
- - 99%+ success rate with sub-60 second processing times
1401
+ - 99%+ success rate with optimized processing times (images <30s, videos <5min)
1155
1402
  - 1000+ MCP client integrations and 100K+ monthly API calls
1156
1403
  - Comprehensive documentation with real-world examples
1404
+ - Professional-grade content generation capabilities
1157
1405
 
1158
1406
  ### Getting Involved
1159
1407
 
@@ -1165,10 +1413,17 @@ Human MCP is built for the developer community. Whether you're integrating with
1165
1413
 
1166
1414
  ## Supported Formats
1167
1415
 
1168
- **Images**: PNG, JPEG, WebP, GIF (static)
1169
- **Videos**: MP4, WebM, MOV, AVI
1170
- **GIFs**: Animated GIF with frame extraction
1171
- **Sources**: File paths, URLs, base64 data URLs
1416
+ **Analysis Formats**:
1417
+ - **Images**: PNG, JPEG, WebP, GIF (static)
1418
+ - **Videos**: MP4, WebM, MOV, AVI
1419
+ - **GIFs**: Animated GIF with frame extraction
1420
+ - **Sources**: File paths, URLs, base64 data URLs
1421
+
1422
+ **Generation Formats**:
1423
+ - **Images**: PNG, JPEG (Base64 output)
1424
+ - **Videos**: MP4 (Base64 output)
1425
+ - **Durations**: 4s, 8s, 12s video lengths
1426
+ - **Quality**: Professional-grade output with customizable FPS (1-60)
1172
1427
 
1173
1428
  ## Contributing
1174
1429