@goonnguyen/human-mcp 1.4.0 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (3) hide show
  1. package/README.md +197 -19
  2. package/dist/index.js +65179 -1697
  3. package/package.json +5 -1
package/README.md CHANGED
@@ -21,12 +21,39 @@ Human MCP is a Model Context Protocol server that provides AI coding agents with
21
21
  - **Performance**: Loading states, visual performance indicators
22
22
  - **Layout**: Responsive design, positioning, visual hierarchy
23
23
 
24
+ 🎨 **Content Generation**
25
+ - Generate high-quality images from text descriptions
26
+ - Multiple artistic styles: photorealistic, artistic, cartoon, sketch, digital art
27
+ - Flexible aspect ratios and output formats
28
+ - Advanced prompt engineering and negative prompts
29
+
30
+ 🗣️ **Speech Generation**
31
+ - Convert text to natural-sounding speech with 30+ voice options
32
+ - Long-form content narration with chapter breaks
33
+ - Technical code explanation with spoken analysis
34
+ - Voice customization and style control
35
+ - Multi-language support (24 languages)
36
+ - Professional audio export in WAV format
37
+
24
38
  🤖 **AI-Powered**
25
39
  - Uses Google Gemini 2.5 Flash for fast, accurate analysis
40
+ - Advanced Imagen API for high-quality image generation
41
+ - Gemini Speech Generation API for natural voice synthesis
26
42
  - Detailed technical insights for developers
27
43
  - Actionable recommendations for fixing issues
28
44
  - Structured output with detected elements and coordinates
29
45
 
46
+ ### Google Gemini Documentation
47
+ - [Gemini API](https://ai.google.dev/gemini-api/docs?hl=en)
48
+ - [Gemini Models](https://ai.google.dev/gemini-api/docs/models)
49
+ - [Video Understanding](https://ai.google.dev/gemini-api/docs/video-understanding?hl=en)
50
+ - [Image Understanding](https://ai.google.dev/gemini-api/docs/image-understanding)
51
+ - [Document Understanding](https://ai.google.dev/gemini-api/docs/document-processing)
52
+ - [Audio Understanding](https://ai.google.dev/gemini-api/docs/audio)
53
+ - [Speech Generation](https://ai.google.dev/gemini-api/docs/speech-generation)
54
+ - [Image Generation](https://ai.google.dev/gemini-api/docs/image-generation)
55
+ - [Video Generation](https://ai.google.dev/gemini-api/docs/video)
56
+
30
57
  ## Quick Start
31
58
 
32
59
  ### Getting Your Google Gemini API Key
@@ -919,7 +946,74 @@ Compare two images to identify visual differences.
919
946
  {
920
947
  "source1": "/path/to/before.png",
921
948
  "source2": "/path/to/after.png",
922
- "comparison_type": "structural"
949
+ "comparison_type": "structural"
950
+ }
951
+ ```
952
+
953
+ ### gemini_gen_image
954
+
955
+ Generate high-quality images from text descriptions using Gemini Imagen API.
956
+
957
+ ```json
958
+ {
959
+ "prompt": "A modern minimalist login form with clean typography",
960
+ "style": "digital_art",
961
+ "aspect_ratio": "16:9",
962
+ "negative_prompt": "cluttered, low quality, blurry"
963
+ }
964
+ ```
965
+
966
+ ### mouth_speak
967
+
968
+ Convert text to natural-sounding speech with voice customization.
969
+
970
+ ```json
971
+ {
972
+ "text": "Welcome to our application. Let me guide you through the interface.",
973
+ "voice": "Zephyr",
974
+ "language": "en-US",
975
+ "style_prompt": "Speak in a friendly, welcoming tone"
976
+ }
977
+ ```
978
+
979
+ ### mouth_narrate
980
+
981
+ Generate narration for long-form content with chapter breaks and style control.
982
+
983
+ ```json
984
+ {
985
+ "content": "Chapter 1: Introduction to React...",
986
+ "voice": "Sage",
987
+ "narration_style": "educational",
988
+ "chapter_breaks": true,
989
+ "max_chunk_size": 8000
990
+ }
991
+ ```
992
+
993
+ ### mouth_explain
994
+
995
+ Generate spoken explanations of code with technical analysis.
996
+
997
+ ```json
998
+ {
999
+ "code": "function factorial(n) { return n <= 1 ? 1 : n * factorial(n-1); }",
1000
+ "programming_language": "javascript",
1001
+ "voice": "Apollo",
1002
+ "explanation_level": "intermediate",
1003
+ "include_examples": true
1004
+ }
1005
+ ```
1006
+
1007
+ ### mouth_customize
1008
+
1009
+ Test different voices and styles to find the best fit for your content.
1010
+
1011
+ ```json
1012
+ {
1013
+ "text": "Hello, this is a voice test sample.",
1014
+ "voice": "Charon",
1015
+ "style_variations": ["professional", "casual", "energetic"],
1016
+ "compare_voices": ["Puck", "Sage", "Apollo"]
923
1017
  }
924
1018
  ```
925
1019
 
@@ -950,12 +1044,67 @@ Compare two images to identify visual differences.
950
1044
  # Check accessibility compliance
951
1045
  {
952
1046
  "source": "page-screenshot.png",
953
- "type": "image",
1047
+ "type": "image",
954
1048
  "analysis_type": "accessibility",
955
1049
  "check_accessibility": true
956
1050
  }
957
1051
  ```
958
1052
 
1053
+ ### Image Generation for Design
1054
+ ```bash
1055
+ # Generate UI mockups and design elements
1056
+ {
1057
+ "prompt": "Professional dashboard interface with data visualization charts",
1058
+ "style": "digital_art",
1059
+ "aspect_ratio": "16:9"
1060
+ }
1061
+ ```
1062
+
1063
+ ### Prototype Creation
1064
+ ```bash
1065
+ # Create visual prototypes for development
1066
+ {
1067
+ "prompt": "Mobile app login screen with modern design, dark theme",
1068
+ "style": "photorealistic",
1069
+ "aspect_ratio": "9:16",
1070
+ "negative_prompt": "old-fashioned, bright colors"
1071
+ }
1072
+ ```
1073
+
1074
+ ### Code Explanation Audio
1075
+ ```bash
1076
+ # Generate spoken explanations for code reviews
1077
+ {
1078
+ "code": "const useAuth = () => { const [user, setUser] = useState(null); return { user, login: setUser }; }",
1079
+ "programming_language": "javascript",
1080
+ "voice": "Apollo",
1081
+ "explanation_level": "advanced",
1082
+ "include_examples": true
1083
+ }
1084
+ ```
1085
+
1086
+ ### Documentation Narration
1087
+ ```bash
1088
+ # Convert technical documentation to audio
1089
+ {
1090
+ "content": "This API endpoint handles user authentication and returns a JWT token...",
1091
+ "voice": "Sage",
1092
+ "narration_style": "professional",
1093
+ "chapter_breaks": true
1094
+ }
1095
+ ```
1096
+
1097
+ ### User Interface Voice Feedback
1098
+ ```bash
1099
+ # Generate voice responses for applications
1100
+ {
1101
+ "text": "File uploaded successfully. Processing will complete in approximately 30 seconds.",
1102
+ "voice": "Kore",
1103
+ "language": "en-US",
1104
+ "style_prompt": "Speak in a helpful, reassuring tone"
1105
+ }
1106
+ ```
1107
+
959
1108
  ## Prompts
960
1109
 
961
1110
  Human MCP includes pre-built prompts for common debugging scenarios:
@@ -1041,9 +1190,19 @@ HTTP_ENABLE_RATE_LIMITING=false
1041
1190
  Human MCP Server
1042
1191
  ├── Eyes Tool (Vision Understanding)
1043
1192
  │ ├── Image Analysis
1044
- │ ├── Video Processing
1193
+ │ ├── Video Processing
1045
1194
  │ ├── GIF Frame Extraction
1046
1195
  │ └── Visual Comparison
1196
+ ├── Hands Tool (Content Generation)
1197
+ │ ├── Image Generation
1198
+ │ ├── Style Customization
1199
+ │ ├── Aspect Ratio Control
1200
+ │ └── Prompt Engineering
1201
+ ├── Mouth Tool (Speech Generation)
1202
+ │ ├── Text-to-Speech Synthesis
1203
+ │ ├── Long-form Narration
1204
+ │ ├── Code Explanation
1205
+ │ └── Voice Customization
1047
1206
  ├── Debugging Prompts
1048
1207
  └── Documentation Resources
1049
1208
  ```
@@ -1056,7 +1215,7 @@ For detailed architecture information and future development plans, see:
1056
1215
 
1057
1216
  **Mission**: Transform AI coding agents with complete human-like sensory capabilities, bridging the gap between artificial and human intelligence through sophisticated multimodal analysis.
1058
1217
 
1059
- ### Current Status: Phase 1 Complete ✅
1218
+ ### Current Status: Phase 1 Complete ✅ | Phase 4 Complete ✅ | Phase 5 Complete ✅
1060
1219
 
1061
1220
  **Eyes (Visual Analysis)** - Production Ready (v1.2.1)
1062
1221
  - Advanced image, video, and GIF analysis capabilities
@@ -1065,6 +1224,21 @@ For detailed architecture information and future development plans, see:
1065
1224
  - Processing 20+ visual formats with 98.5% success rate
1066
1225
  - Sub-30 second response times for detailed analysis
1067
1226
 
1227
+ **Hands (Content Generation)** - Production Ready (v1.2.2)
1228
+ - High-quality image generation using Gemini Imagen API
1229
+ - Multiple artistic styles and aspect ratios
1230
+ - Advanced prompt engineering with negative prompts
1231
+ - Comprehensive validation and error handling
1232
+ - Fast generation times with reliable output
1233
+
1234
+ **Mouth (Speech Generation)** - Production Ready (v1.3.0)
1235
+ - Natural text-to-speech with 30+ voice options
1236
+ - Long-form content narration with chapter breaks
1237
+ - Technical code explanation with spoken analysis
1238
+ - Voice customization and style control
1239
+ - Multi-language support (24 languages)
1240
+ - Professional audio export in WAV format
1241
+
1068
1242
  ### Upcoming Development Phases
1069
1243
 
1070
1244
  #### Phase 2: Document Understanding (Q4 2025)
@@ -1083,21 +1257,25 @@ For detailed architecture information and future development plans, see:
1083
1257
  - Support for 20+ audio formats (WAV, MP3, AAC, OGG, FLAC)
1084
1258
  - Real-time audio processing capabilities
1085
1259
 
1086
- #### Phase 4: Speech Generation - Mouth (Q4 2025)
1087
- **AI Voice Capabilities**
1088
- - High-quality text-to-speech with customizable voice parameters
1089
- - Code explanation and technical content narration
1090
- - Multi-language speech generation (10+ languages)
1091
- - Long-form content narration with natural pacing
1092
- - Professional-quality audio export in multiple formats
1093
-
1094
- #### Phase 5: Content Generation - Hands (Q4 2025)
1095
- **Creative Content Creation**
1096
- - Image generation from text descriptions using Imagen API
1097
- - Advanced image editing (inpainting, style transfer, enhancement)
1098
- - Video generation up to 30 seconds using Veo3 API
1099
- - Animation creation with motion graphics
1100
- - Batch content generation for workflow automation
1260
+ #### Phase 4: Speech Generation - Mouth COMPLETE
1261
+ **AI Voice Capabilities** - Production Ready (v1.3.0)
1262
+ - High-quality text-to-speech with 30+ voice options using Gemini Speech API
1263
+ - Code explanation and technical content narration
1264
+ - Multi-language speech generation (24 languages supported)
1265
+ - Long-form content narration with chapter breaks and natural pacing
1266
+ - Professional-quality audio export in WAV format
1267
+ - ✅ Voice customization with style prompts and voice comparison
1268
+
1269
+ #### Phase 5: Content Generation - Hands ✅ COMPLETE
1270
+ **Creative Content Creation** - Production Ready (v1.2.2)
1271
+ - Image generation from text descriptions using Imagen API
1272
+ - Multiple artistic styles: photorealistic, artistic, cartoon, sketch, digital_art
1273
+ - Flexible aspect ratios: 1:1, 16:9, 9:16, 4:3, 3:4
1274
+ - Advanced prompt engineering with negative prompts
1275
+ - Comprehensive error handling and validation
1276
+ - Future: Advanced image editing (inpainting, style transfer, enhancement)
1277
+ - Future: Video generation up to 30 seconds using Veo3 API
1278
+ - Future: Animation creation with motion graphics
1101
1279
 
1102
1280
  ### Target Architecture (End 2025)
1103
1281