@goonnguyen/human-mcp 2.10.1 → 2.12.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (4) hide show
  1. package/README.md +408 -410
  2. package/human-mcp.png +0 -0
  3. package/package.json +8 -3
  4. package/dist/index.js +0 -200265
package/README.md CHANGED
@@ -4,67 +4,51 @@
4
4
 
5
5
  ![Human MCP](human-mcp.png)
6
6
 
7
- Human MCP v2.2.0 is a comprehensive Model Context Protocol server that provides AI coding agents with human-like capabilities including visual analysis, document processing, speech generation, content creation, and advanced reasoning for debugging, understanding, and enhancing multimodal content.
7
+ Human MCP v2.10.0 is a comprehensive Model Context Protocol server that provides AI coding agents with human-like capabilities including visual analysis, document processing, speech generation, content creation, image editing, browser automation, and advanced reasoning for debugging, understanding, and enhancing multimodal content.
8
8
 
9
9
  ## Features
10
10
 
11
- 🎯 **Visual Analysis (Eyes) - ✅ Complete**
12
- - Analyze screenshots for UI bugs and layout issues
13
- - Process screen recordings to understand error sequences
14
- - Extract insights from GIFs and animations
15
- - Compare visual changes between versions
16
-
17
- 📄 **Document Processing (Eyes Extended) - ✅ Complete v2.0.0**
18
- - Comprehensive document analysis for PDF, DOCX, XLSX, PPTX, TXT, MD, RTF, ODT, CSV, JSON, XML, HTML
19
- - Structured data extraction using custom JSON schemas
20
- - Document summarization with multiple types (brief, detailed, executive, technical)
21
- - Text extraction with formatting preservation
22
- - Table and image extraction from documents
23
- - Auto-format detection and processing
24
-
25
- 🔍 **Specialized Analysis Types**
26
- - **UI Debug**: Layout issues, rendering problems, visual bugs
27
- - **Error Detection**: Visible errors, broken functionality, system failures
28
- - **Accessibility**: Color contrast, WCAG compliance, readability
29
- - **Performance**: Loading states, visual performance indicators
30
- - **Layout**: Responsive design, positioning, visual hierarchy
31
- - **Document Analysis**: Content extraction, data mining, document intelligence
32
-
33
- 🎨 **Content Generation (Hands) - Complete v2.0.0**
34
- - Generate high-quality images from text descriptions using Imagen API
35
- - Create professional videos from text prompts using Veo 3.0 API
36
- - Image-to-video generation combining Imagen and Veo 3.0
37
- - Multiple artistic styles: photorealistic, artistic, cartoon, sketch, digital art (images) and realistic, cinematic, artistic, cartoon, animation (videos)
38
- - Flexible aspect ratios (1:1, 16:9, 9:16, 4:3, 3:4) and output formats
39
- - Video duration controls (4s, 8s, 12s) with FPS options (1-60 fps)
40
- - Camera movement controls: static, pan, zoom, dolly movements
41
- - Advanced prompt engineering and negative prompts
42
-
43
- 🗣️ **Speech Generation (Mouth) - ✅ Complete v1.3.0**
44
- - Convert text to natural-sounding speech with 30+ voice options
45
- - Long-form content narration with chapter breaks
46
- - Technical code explanation with spoken analysis
47
- - Voice customization and style control
48
- - Multi-language support (24 languages)
49
- - Professional audio export in WAV format
50
-
51
- 🧠 **Advanced Reasoning (Brain) - Complete v2.2.0**
52
- - Sequential thinking with dynamic problem-solving and thought revision
53
- - Multi-step analysis with hypothesis generation and testing
54
- - Deep analytical reasoning with assumption tracking and alternative perspectives
55
- - Problem solving with constraint handling and iterative refinement
56
- - Meta-cognitive reflection and analysis improvement
57
- - Advanced reasoning patterns for complex technical problems
58
-
59
- 🤖 **AI-Powered**
60
- - Uses Google Gemini 2.5 Flash for fast, accurate analysis
61
- - Advanced Imagen API for high-quality image generation
62
- - Cutting-edge Veo 3.0 API for professional video generation
63
- - Gemini Speech Generation API for natural voice synthesis
64
- - Advanced reasoning with sequential thinking and meta-cognitive reflection
65
- - Detailed technical insights for developers
66
- - Actionable recommendations for fixing issues
67
- - Structured output with detected elements and coordinates
11
+ 🎯 **Visual Analysis (Eyes) - ✅ Complete (4 tools)**
12
+ - **eyes_analyze**: Analyze images, videos, and GIFs for UI bugs, errors, and accessibility
13
+ - **eyes_compare**: Compare two images to find visual differences
14
+ - **eyes_read_document**: Extract text and data from PDF, DOCX, XLSX, PPTX, and more
15
+ - **eyes_summarize_document**: Generate summaries and insights from documents
16
+
17
+ **Content Generation & Image Editing (Hands) - ✅ Complete (16 tools)**
18
+ - **Image Generation** (1 tool): gemini_gen_image - Generate images from text using Imagen API
19
+ - **Video Generation** (2 tools): gemini_gen_video, gemini_image_to_video - Create videos with Veo 3.0
20
+ - **AI Image Editing** (5 tools): Gemini-powered editing with inpainting, outpainting, style transfer, object manipulation, composition
21
+ - **Jimp Processing** (4 tools): Local image manipulation - crop, resize, rotate, mask
22
+ - **Background Removal** (1 tool): rmbg_remove_background - AI-powered background removal
23
+ - **Browser Automation** (3 tools): playwright_screenshot_fullpage, playwright_screenshot_viewport, playwright_screenshot_element - Automated web screenshots
24
+
25
+ 🗣️ **Speech Generation (Mouth) - ✅ Complete (4 tools)**
26
+ - **mouth_speak**: Convert text to speech with 30+ voices and 24 languages
27
+ - **mouth_narrate**: Long-form content narration with chapter breaks
28
+ - **mouth_explain**: Generate spoken code explanations with technical analysis
29
+ - **mouth_customize**: Test and compare different voices and styles
30
+
31
+ 🧠 **Advanced Reasoning (Brain) - Complete (3 tools)**
32
+ - **mcp__reasoning__sequentialthinking**: Native sequential thinking with thought revision
33
+ - **brain_analyze_simple**: Fast pattern-based analysis (problem solving, root cause, SWOT, etc.)
34
+ - **brain_patterns_info**: List available reasoning patterns and frameworks
35
+ - **brain_reflect_enhanced**: AI-powered meta-cognitive reflection for complex analysis
36
+
37
+ ## Total: 27 MCP Tools Across 4 Human Capabilities
38
+
39
+ **👁️ Eyes (4 tools)** - Visual analysis and document processing
40
+ **✋ Hands (16 tools)** - Content generation, image editing, and browser automation
41
+ **🗣️ Mouth (4 tools)** - Speech generation and narration
42
+ **🧠 Brain (3 tools)** - Advanced reasoning and problem solving
43
+
44
+ ### Technology Stack
45
+ - **Google Gemini 2.5 Flash** - Vision, document, and reasoning AI
46
+ - **Gemini Imagen API** - High-quality image generation
47
+ - **Gemini Veo 3.0 API** - Professional video generation
48
+ - **Gemini Speech API** - Natural voice synthesis (30+ voices, 24 languages)
49
+ - **Playwright** - Browser automation for web screenshots
50
+ - **Jimp** - Fast local image processing
51
+ - **rmbg** - AI-powered background removal (U2Net+, ModNet, BRIAI models)
68
52
 
69
53
  ### Google Gemini Documentation
70
54
  - [Gemini API](https://ai.google.dev/gemini-api/docs?hl=en)
@@ -945,291 +929,293 @@ for file in *.png; do
945
929
  done
946
930
  ```
947
931
 
948
- ## Tools
932
+ ## MCP Tools Reference
949
933
 
950
- ### eyes_analyze
951
-
952
- Comprehensive visual analysis for images, videos, and GIFs.
953
-
954
- ```json
955
- {
956
- "source": "/path/to/screenshot.png",
957
- "type": "image",
958
- "analysis_type": "ui_debug",
959
- "detail_level": "detailed",
960
- "specific_focus": "login form validation"
961
- }
962
- ```
963
-
964
- ### eyes_compare
965
-
966
- Compare two images to identify visual differences.
934
+ ### 👁️ Eyes Tools (Visual Analysis & Document Processing)
967
935
 
936
+ **eyes_analyze** - Analyze images, videos, and GIFs
968
937
  ```json
969
938
  {
970
- "source1": "/path/to/before.png",
971
- "source2": "/path/to/after.png",
972
- "comparison_type": "structural"
939
+ "source": "path/to/image.png or URL",
940
+ "focus": "What to analyze (optional)",
941
+ "detail": "quick or detailed (default: detailed)"
973
942
  }
974
943
  ```
975
944
 
976
- ### eyes_read_document
977
-
978
- Comprehensive document analysis and content extraction.
979
-
945
+ **eyes_compare** - Compare two images
980
946
  ```json
981
947
  {
982
- "source": "/path/to/document.pdf",
983
- "format": "auto",
984
- "options": {
985
- "extract_text": true,
986
- "extract_tables": true,
987
- "detail_level": "detailed"
988
- }
948
+ "image1": "path/to/first.png",
949
+ "image2": "path/to/second.png",
950
+ "focus": "differences, similarities, layout, or content"
989
951
  }
990
952
  ```
991
953
 
992
- ### eyes_extract_data
993
-
994
- Extract structured data from documents using custom schemas.
995
-
954
+ **eyes_read_document** - Extract content from documents
996
955
  ```json
997
956
  {
998
- "source": "/path/to/invoice.pdf",
999
- "format": "auto",
1000
- "schema": {
1001
- "invoice_number": "string",
1002
- "amount": "number",
1003
- "date": "string"
1004
- }
957
+ "document": "path/to/document.pdf",
958
+ "pages": "1-5 or all (default: all)",
959
+ "extract": "text, tables, or both (default: both)"
1005
960
  }
1006
961
  ```
1007
962
 
1008
- ### eyes_summarize
1009
-
1010
- Generate summaries and key insights from documents.
1011
-
963
+ **eyes_summarize_document** - Summarize documents
1012
964
  ```json
1013
965
  {
1014
- "source": "/path/to/report.docx",
1015
- "format": "auto",
1016
- "options": {
1017
- "summary_type": "executive",
1018
- "include_key_points": true,
1019
- "max_length": 500
1020
- }
966
+ "document": "path/to/document.pdf",
967
+ "length": "brief, medium, or detailed",
968
+ "focus": "Specific topics (optional)"
1021
969
  }
1022
970
  ```
1023
971
 
1024
- ### mouth_speak
1025
-
1026
- Convert text to natural-sounding speech.
972
+ ### 🗣️ Mouth Tools (Speech Generation)
1027
973
 
974
+ **mouth_speak** - Text to speech
1028
975
  ```json
1029
976
  {
1030
- "text": "Welcome to our application. Let me guide you through the interface.",
1031
- "voice": "Zephyr",
1032
- "language": "en-US",
1033
- "style_prompt": "Speak in a friendly, welcoming tone"
977
+ "text": "Your text here (max 32k tokens)",
978
+ "voice": "Zephyr (or 30+ other voices)",
979
+ "language": "en-US (or 24 languages)",
980
+ "style_prompt": "Speaking style description (optional)"
1034
981
  }
1035
982
  ```
1036
983
 
1037
- ### mouth_narrate
1038
-
1039
- Generate narration for long-form content with chapter breaks.
1040
-
984
+ **mouth_narrate** - Long-form narration
1041
985
  ```json
1042
986
  {
1043
- "content": "Chapter 1: Introduction to React...",
987
+ "content": "Long content to narrate",
1044
988
  "voice": "Sage",
1045
- "narration_style": "educational",
989
+ "narration_style": "professional, casual, educational, or storytelling",
1046
990
  "chapter_breaks": true
1047
991
  }
1048
992
  ```
1049
993
 
1050
- ### mouth_explain
1051
-
1052
- Generate spoken explanations of code with technical analysis.
1053
-
994
+ **mouth_explain** - Code explanation
1054
995
  ```json
1055
996
  {
1056
- "code": "function factorial(n) { return n <= 1 ? 1 : n * factorial(n-1); }",
997
+ "code": "function example() {}",
1057
998
  "programming_language": "javascript",
1058
999
  "voice": "Apollo",
1059
- "explanation_level": "intermediate"
1000
+ "explanation_level": "beginner, intermediate, or advanced"
1060
1001
  }
1061
1002
  ```
1062
1003
 
1063
- ### mouth_customize
1064
-
1065
- Test different voices and styles for optimal content delivery.
1066
-
1004
+ **mouth_customize** - Voice testing
1067
1005
  ```json
1068
1006
  {
1069
- "text": "Hello, this is a voice test sample.",
1007
+ "text": "Test sample",
1070
1008
  "voice": "Charon",
1071
- "style_variations": ["professional", "casual", "energetic"],
1072
- "compare_voices": ["Puck", "Sage", "Apollo"]
1009
+ "style_variations": ["professional", "casual"],
1010
+ "compare_voices": ["Puck", "Sage"]
1073
1011
  }
1074
1012
  ```
1075
1013
 
1076
- ### gemini_gen_image
1014
+ ### ✋ Hands Tools (Content Generation & Image Editing)
1077
1015
 
1078
- Generate high-quality images from text descriptions using Gemini Imagen API.
1016
+ #### Image Generation (1 tool)
1079
1017
 
1018
+ **gemini_gen_image** - Generate images from text
1080
1019
  ```json
1081
1020
  {
1082
- "prompt": "A modern minimalist login form with clean typography",
1083
- "style": "digital_art",
1084
- "aspect_ratio": "16:9",
1085
- "negative_prompt": "cluttered, low quality, blurry"
1021
+ "prompt": "A modern minimalist login form",
1022
+ "style": "photorealistic, artistic, cartoon, sketch, or digital_art",
1023
+ "aspect_ratio": "1:1, 16:9, 9:16, 4:3, or 3:4",
1024
+ "negative_prompt": "What to avoid (optional)"
1086
1025
  }
1087
1026
  ```
1088
1027
 
1089
- ### gemini_gen_video
1090
-
1091
- Generate professional videos from text descriptions using Gemini Veo 3.0 API.
1028
+ #### Video Generation (2 tools)
1092
1029
 
1030
+ **gemini_gen_video** - Generate videos from text
1093
1031
  ```json
1094
1032
  {
1095
- "prompt": "A serene mountain landscape at sunrise with gentle camera movement",
1096
- "duration": "8s",
1097
- "style": "cinematic",
1098
- "aspect_ratio": "16:9",
1099
- "camera_movement": "pan_right",
1100
- "fps": 30
1033
+ "prompt": "Mountain landscape at sunrise",
1034
+ "duration": "4s, 8s, or 12s",
1035
+ "style": "realistic, cinematic, artistic, cartoon, or animation",
1036
+ "camera_movement": "static, pan_left, pan_right, zoom_in, zoom_out, dolly_forward, dolly_backward",
1037
+ "fps": 24
1101
1038
  }
1102
1039
  ```
1103
1040
 
1104
- ### gemini_image_to_video
1105
-
1106
- Generate videos from images and text descriptions using Imagen + Veo 3.0 pipeline.
1107
-
1041
+ **gemini_image_to_video** - Animate images
1108
1042
  ```json
1109
1043
  {
1110
- "prompt": "Animate this landscape with flowing water and moving clouds",
1111
- "image_input": "data:image/jpeg;base64,/9j/4AAQ...",
1112
- "duration": "12s",
1113
- "style": "realistic",
1044
+ "prompt": "Animate with flowing water",
1045
+ "image_input": "base64 or URL",
1046
+ "duration": "8s",
1114
1047
  "camera_movement": "zoom_in"
1115
1048
  }
1116
1049
  ```
1117
1050
 
1118
- ### mouth_speak
1051
+ #### AI Image Editing (5 tools)
1119
1052
 
1120
- Convert text to natural-sounding speech with voice customization.
1053
+ **gemini_edit_image** - Comprehensive AI editing (5 operations: inpaint, outpaint, style_transfer, object_manipulation, multi_image_compose)
1121
1054
 
1055
+ **gemini_inpaint_image** - Add/modify areas with text (no mask required)
1122
1056
  ```json
1123
1057
  {
1124
- "text": "Welcome to our application. Let me guide you through the interface.",
1125
- "voice": "Zephyr",
1126
- "language": "en-US",
1127
- "style_prompt": "Speak in a friendly, welcoming tone"
1058
+ "input_image": "base64 or path",
1059
+ "prompt": "What to add/change",
1060
+ "mask_prompt": "Where to edit (optional)"
1128
1061
  }
1129
1062
  ```
1130
1063
 
1131
- ### mouth_narrate
1132
-
1133
- Generate narration for long-form content with chapter breaks and style control.
1064
+ **gemini_outpaint_image** - Expand image borders
1065
+ ```json
1066
+ {
1067
+ "input_image": "base64 or path",
1068
+ "prompt": "What to add in expanded area",
1069
+ "expand_direction": "all, left, right, top, bottom, horizontal, vertical",
1070
+ "expansion_ratio": 1.5
1071
+ }
1072
+ ```
1134
1073
 
1074
+ **gemini_style_transfer_image** - Apply artistic styles
1135
1075
  ```json
1136
1076
  {
1137
- "content": "Chapter 1: Introduction to React...",
1138
- "voice": "Sage",
1139
- "narration_style": "educational",
1140
- "chapter_breaks": true,
1141
- "max_chunk_size": 8000
1077
+ "input_image": "base64 or path",
1078
+ "prompt": "Desired style",
1079
+ "style_image": "Reference image (optional)",
1080
+ "style_strength": 0.7
1142
1081
  }
1143
1082
  ```
1144
1083
 
1145
- ### mouth_explain
1084
+ **gemini_compose_images** - Combine multiple images
1085
+ ```json
1086
+ {
1087
+ "input_image": "Primary image",
1088
+ "secondary_images": ["image1", "image2"],
1089
+ "prompt": "How to compose",
1090
+ "composition_layout": "blend, collage, overlay, side_by_side"
1091
+ }
1092
+ ```
1146
1093
 
1147
- Generate spoken explanations of code with technical analysis.
1094
+ #### Jimp Processing (4 tools - Local, Fast)
1148
1095
 
1096
+ **jimp_crop_image** - Crop images (6 modes)
1149
1097
  ```json
1150
1098
  {
1151
- "code": "function factorial(n) { return n <= 1 ? 1 : n * factorial(n-1); }",
1152
- "programming_language": "javascript",
1153
- "voice": "Apollo",
1154
- "explanation_level": "intermediate",
1155
- "include_examples": true
1099
+ "input_image": "path or URL",
1100
+ "mode": "manual, center, top_left, aspect_ratio",
1101
+ "width": 800,
1102
+ "height": 600
1156
1103
  }
1157
1104
  ```
1158
1105
 
1159
- ### mouth_customize
1160
-
1161
- Test different voices and styles to find the best fit for your content.
1106
+ **jimp_resize_image** - Resize images (5 algorithms)
1107
+ ```json
1108
+ {
1109
+ "input_image": "path or URL",
1110
+ "width": 1920,
1111
+ "algorithm": "bilinear, bicubic, nearestNeighbor",
1112
+ "maintain_aspect_ratio": true
1113
+ }
1114
+ ```
1162
1115
 
1116
+ **jimp_rotate_image** - Rotate images
1163
1117
  ```json
1164
1118
  {
1165
- "text": "Hello, this is a voice test sample.",
1166
- "voice": "Charon",
1167
- "style_variations": ["professional", "casual", "energetic"],
1168
- "compare_voices": ["Puck", "Sage", "Apollo"]
1119
+ "input_image": "path or URL",
1120
+ "angle": 90,
1121
+ "background_color": "#ffffff"
1169
1122
  }
1170
1123
  ```
1171
1124
 
1172
- ### brain_think
1125
+ **jimp_mask_image** - Apply grayscale masks
1126
+ ```json
1127
+ {
1128
+ "input_image": "path or URL",
1129
+ "mask_image": "path or URL (black=transparent, white=opaque)"
1130
+ }
1131
+ ```
1173
1132
 
1174
- Advanced sequential thinking with dynamic problem-solving and thought revision.
1133
+ #### Background Removal (1 tool)
1175
1134
 
1135
+ **rmbg_remove_background** - AI background removal (3 quality levels: fast, balanced, high)
1176
1136
  ```json
1177
1137
  {
1178
- "problem": "Complex technical issue requiring multi-step analysis",
1179
- "initialThoughts": 5,
1180
- "thinkingStyle": "analytical",
1181
- "context": {
1182
- "domain": "software engineering",
1183
- "constraints": ["limited resources", "tight deadline"]
1184
- },
1185
- "options": {
1186
- "allowRevision": true,
1187
- "enableBranching": true,
1188
- "maxThoughts": 10
1189
- }
1138
+ "input_image": "path or URL",
1139
+ "quality": "fast, balanced, or high",
1140
+ "output_format": "png or jpeg"
1190
1141
  }
1191
1142
  ```
1192
1143
 
1193
- ### brain_analyze
1144
+ #### Browser Automation (3 tools)
1194
1145
 
1195
- Deep analytical reasoning with assumption tracking and alternative perspectives.
1146
+ **playwright_screenshot_fullpage** - Capture full page including scrollable content
1147
+ ```json
1148
+ {
1149
+ "url": "https://example.com",
1150
+ "format": "png or jpeg",
1151
+ "quality": 80,
1152
+ "timeout": 30000,
1153
+ "wait_until": "load, domcontentloaded, or networkidle",
1154
+ "viewport": { "width": 1920, "height": 1080 }
1155
+ }
1156
+ ```
1196
1157
 
1158
+ **playwright_screenshot_viewport** - Capture visible viewport area only
1197
1159
  ```json
1198
1160
  {
1199
- "subject": "System architecture design decisions",
1200
- "analysisDepth": "detailed",
1201
- "considerAlternatives": true,
1202
- "trackAssumptions": true,
1203
- "focusAreas": ["scalability", "security", "maintainability"],
1204
- "thinkingStyle": "systematic"
1161
+ "url": "https://example.com",
1162
+ "format": "png or jpeg",
1163
+ "quality": 80,
1164
+ "timeout": 30000,
1165
+ "wait_until": "networkidle",
1166
+ "viewport": { "width": 1920, "height": 1080 }
1205
1167
  }
1206
1168
  ```
1207
1169
 
1208
- ### brain_solve
1170
+ **playwright_screenshot_element** - Capture specific element on page
1171
+ ```json
1172
+ {
1173
+ "url": "https://example.com",
1174
+ "selector": ".main-content or 'Click me' or 'button'",
1175
+ "selector_type": "css, text, or role",
1176
+ "format": "png or jpeg",
1177
+ "timeout": 30000,
1178
+ "wait_for_selector": true
1179
+ }
1180
+ ```
1209
1181
 
1210
- Multi-step problem solving with hypothesis testing and constraint handling.
1182
+ ### 🧠 Brain Tools (Advanced Reasoning)
1211
1183
 
1184
+ **mcp__reasoning__sequentialthinking** - Native sequential thinking with thought revision
1212
1185
  ```json
1213
1186
  {
1214
- "problemStatement": "Performance bottleneck in distributed system",
1215
- "solutionApproach": "systematic",
1216
- "verifyHypotheses": true,
1217
- "maxIterations": 10,
1218
- "constraints": ["budget limitations", "existing infrastructure"],
1219
- "requirements": ["99.9% uptime", "sub-second response"]
1187
+ "problem": "Complex issue description",
1188
+ "thought": "Current thinking step",
1189
+ "thoughtNumber": 1,
1190
+ "totalThoughts": 5,
1191
+ "nextThoughtNeeded": true,
1192
+ "isRevision": false
1220
1193
  }
1221
1194
  ```
1222
1195
 
1223
- ### brain_reflect
1196
+ **brain_analyze_simple** - Fast pattern-based analysis
1197
+ ```json
1198
+ {
1199
+ "problem": "Issue to analyze",
1200
+ "pattern": "problem_solving, root_cause, pros_cons, swot, or cause_effect",
1201
+ "context": "Additional background (optional)"
1202
+ }
1203
+ ```
1224
1204
 
1225
- Meta-cognitive reflection and analysis improvement.
1205
+ **brain_patterns_info** - List reasoning patterns
1206
+ ```json
1207
+ {
1208
+ "pattern": "Specific pattern name (optional)"
1209
+ }
1210
+ ```
1226
1211
 
1212
+ **brain_reflect_enhanced** - AI-powered meta-cognitive reflection
1227
1213
  ```json
1228
1214
  {
1229
- "originalAnalysis": "Previous analysis of system architecture decisions and their implications...",
1230
- "reflectionFocus": ["assumptions", "logic_gaps", "alternative_approaches"],
1231
- "improvementGoals": ["reduce bias", "consider edge cases"],
1232
- "newInformation": "Recent performance metrics show different bottlenecks"
1215
+ "originalAnalysis": "Previous analysis to reflect on",
1216
+ "focusAreas": ["assumptions", "logic_gaps", "alternative_approaches"],
1217
+ "improvementGoal": "What to improve (optional)",
1218
+ "detailLevel": "concise or detailed"
1233
1219
  }
1234
1220
  ```
1235
1221
 
@@ -1390,6 +1376,40 @@ Meta-cognitive reflection and analysis improvement.
1390
1376
  }
1391
1377
  ```
1392
1378
 
1379
+ ### Automated Web Screenshots
1380
+ ```bash
1381
+ # Capture full page screenshot for documentation
1382
+ {
1383
+ "url": "https://example.com/dashboard",
1384
+ "format": "png",
1385
+ "wait_until": "networkidle",
1386
+ "viewport": { "width": 1920, "height": 1080 }
1387
+ }
1388
+ ```
1389
+
1390
+ ### Element-Specific Screenshots
1391
+ ```bash
1392
+ # Capture specific UI component for bug reporting
1393
+ {
1394
+ "url": "https://example.com/app",
1395
+ "selector": ".error-message",
1396
+ "selector_type": "css",
1397
+ "wait_for_selector": true,
1398
+ "format": "png"
1399
+ }
1400
+ ```
1401
+
1402
+ ### Responsive Testing Screenshots
1403
+ ```bash
1404
+ # Capture mobile viewport for responsive design testing
1405
+ {
1406
+ "url": "https://example.com",
1407
+ "format": "png",
1408
+ "viewport": { "width": 375, "height": 812 },
1409
+ "wait_until": "networkidle"
1410
+ }
1411
+ ```
1412
+
1393
1413
  ## Prompts
1394
1414
 
1395
1415
  Human MCP includes pre-built prompts for common debugging scenarios:
@@ -1473,197 +1493,175 @@ HTTP_ENABLE_RATE_LIMITING=false
1473
1493
  ## Architecture
1474
1494
 
1475
1495
  ```
1476
- Human MCP Server
1477
- ├── Eyes Tool (Vision Understanding)
1478
- │ ├── Image Analysis
1479
- │ ├── Video Processing
1480
- │ ├── GIF Frame Extraction
1481
- ├── Visual Comparison
1482
- └── Document Processing (PDF, DOCX, XLSX, PPTX, etc.)
1483
- ├── Hands Tool (Content Generation)
1484
- │ ├── Image Generation (Imagen API)
1485
- ├── Video Generation (Veo 3.0 API)
1486
- │ ├── Image-to-Video Pipeline
1487
- │ ├── Style Customization
1488
- ├── Aspect Ratio & Duration Control
1489
- │ ├── Camera Movement Control
1490
- └── Prompt Engineering
1491
- ├── Mouth Tool (Speech Generation)
1492
- │ ├── Text-to-Speech Synthesis
1493
- │ ├── Long-form Narration
1494
- ├── Code Explanation
1495
- └── Voice Customization
1496
- ├── Brain Tool (Advanced Reasoning) ✅ COMPLETE
1497
- │ ├── Sequential Thinking
1498
- │ ├── Deep Analytical Reasoning
1499
- ├── Problem Solving
1500
- │ ├── Meta-cognitive Reflection
1501
- ├── Hypothesis Testing
1502
- ├── Thought Revision
1503
- ├── Assumption Tracking
1504
- └── Context-aware Reasoning
1505
- ├── Debugging Prompts
1506
- └── Documentation Resources
1507
- ```
1508
-
1509
- For detailed architecture information and future development plans, see:
1510
- - **[Project Roadmap](docs/project-roadmap.md)** - Complete development roadmap and future vision
1511
- - **[Architecture Documentation](docs/codebase-structure-architecture-code-standards.md)** - Technical architecture and code standards
1496
+ Human MCP Server v2.10.0
1497
+ ├── 👁️ Eyes Tools (4) - Visual Analysis & Document Processing
1498
+ │ ├── eyes_analyze - Images, videos, GIFs analysis
1499
+ │ ├── eyes_compare - Image comparison
1500
+ │ ├── eyes_read_document - Document content extraction
1501
+ └── eyes_summarize_document - Document summarization
1502
+
1503
+ ├── Hands Tools (16) - Content Generation, Image Editing & Browser Automation
1504
+ │ ├── Image Generation (1)
1505
+ │ └── gemini_gen_image
1506
+ │ ├── Video Generation (2)
1507
+ ├── gemini_gen_video
1508
+ │ └── gemini_image_to_video
1509
+ │ ├── AI Image Editing (5)
1510
+ │ ├── gemini_edit_image
1511
+ │ │ ├── gemini_inpaint_image
1512
+ ├── gemini_outpaint_image
1513
+ ├── gemini_style_transfer_image
1514
+ │ └── gemini_compose_images
1515
+ ├── Jimp Processing (4)
1516
+ │ │ ├── jimp_crop_image
1517
+ ├── jimp_resize_image
1518
+ ├── jimp_rotate_image
1519
+ │ └── jimp_mask_image
1520
+ │ ├── Background Removal (1)
1521
+ │ └── rmbg_remove_background
1522
+ └── Browser Automation (3)
1523
+ ├── playwright_screenshot_fullpage
1524
+ ├── playwright_screenshot_viewport
1525
+ │ └── playwright_screenshot_element
1526
+
1527
+ ├── 🗣️ Mouth Tools (4) - Speech Generation
1528
+ │ ├── mouth_speak - Text-to-speech
1529
+ │ ├── mouth_narrate - Long-form narration
1530
+ │ ├── mouth_explain - Code explanation
1531
+ │ └── mouth_customize - Voice testing
1532
+
1533
+ └── 🧠 Brain Tools (3) - Advanced Reasoning
1534
+ ├── mcp__reasoning__sequentialthinking - Native sequential thinking
1535
+ ├── brain_analyze_simple - Pattern-based analysis
1536
+ ├── brain_patterns_info - Reasoning frameworks
1537
+ └── brain_reflect_enhanced - AI-powered reflection
1538
+
1539
+ Total: 27 MCP Tools
1540
+ ```
1541
+
1542
+ **Documentation:**
1543
+ - **[Project Roadmap](docs/project-roadmap.md)** - Development roadmap and future vision
1544
+ - **[Project Overview](docs/project-overview-pdr.md)** - Product requirements and specifications
1545
+ - **[Architecture & Code Standards](docs/codebase-structure-architecture-code-standards.md)** - Technical architecture
1546
+ - **[Codebase Summary](docs/codebase-summary.md)** - Comprehensive codebase overview
1512
1547
 
1513
1548
  ## Development Roadmap & Vision
1514
1549
 
1515
1550
  **Mission**: Transform AI coding agents with complete human-like sensory capabilities, bridging the gap between artificial and human intelligence through sophisticated multimodal analysis.
1516
1551
 
1517
- ### Current Status: Phase 1-2 Complete ✅ | Phase 4-6 Complete ✅ | v2.2.0
1518
-
1519
- **Eyes (Visual Analysis + Document Processing)** - Production Ready (v2.0.0)
1520
- - ✅ Advanced image, video, and GIF analysis capabilities
1521
- - ✅ UI debugging, error detection, accessibility auditing
1522
- - ✅ Image comparison with pixel, structural, and semantic analysis
1523
- - ✅ Document processing for PDF, DOCX, XLSX, PPTX, TXT, MD, RTF, ODT, CSV, JSON, XML, HTML
1524
- - ✅ Structured data extraction using custom JSON schemas
1525
- - Document summarization with multiple types (brief, detailed, executive, technical)
1526
- - ✅ Processing 20+ visual formats + 12+ document formats with 95%+ success rate
1527
- - ✅ Sub-30 second response times for images, sub-60 second for documents
1528
-
1529
- **Mouth (Speech Generation)** - Production Ready (v1.3.0)
1530
- - ✅ Natural text-to-speech with 30+ voice options
1531
- - ✅ Long-form content narration with chapter breaks
1532
- - ✅ Technical code explanation with spoken analysis
1533
- - Voice customization and style control
1534
- - ✅ Multi-language support (24 languages)
1535
- - ✅ Professional audio export in WAV format
1536
-
1537
- **Hands (Content Generation)** - Production Ready (v2.0.0)
1538
- - ✅ High-quality image generation using Gemini Imagen API
1539
- - Professional video generation using Gemini Veo 3.0 API
1540
- - ✅ Image-to-video generation pipeline combining Imagen + Veo 3.0
1541
- - ✅ Multiple artistic styles and aspect ratios for both images and videos
1542
- - ✅ Video duration controls (4s, 8s, 12s) with FPS options (1-60 fps)
1543
- - ✅ Camera movement controls: static, pan, zoom, dolly movements
1544
- - Advanced prompt engineering with negative prompts
1545
- - ✅ Comprehensive validation and error handling with retry logic
1546
- - Fast generation times with reliable output
1547
-
1548
- **Brain (Advanced Reasoning)** - Production Ready (v2.2.0)
1549
- - ✅ Sequential thinking with dynamic problem-solving and thought revision
1550
- - ✅ Deep analytical reasoning with assumption tracking and alternative perspectives
1551
- - ✅ Problem solving with hypothesis testing and constraint handling
1552
- - ✅ Meta-cognitive reflection and analysis improvement
1553
- - ✅ Multiple thinking styles (analytical, systematic, creative, scientific, etc.)
1554
- - ✅ Context-aware reasoning with domain-specific considerations
1555
- - ✅ Confidence scoring and evidence evaluation
1556
- - ✅ Comprehensive reasoning workflows for complex technical problems
1557
-
1558
- ### Remaining Development Phases
1559
-
1560
- #### Phase 3: Audio Processing - Ears (Q1 2025)
1561
- **Advanced Audio Intelligence**
1552
+ ### Current Status: v2.10.0 - 27 Production-Ready MCP Tools
1553
+
1554
+ **👁️ Eyes (4 tools)** - Visual Analysis & Document Processing
1555
+ - ✅ Image, video, GIF analysis with UI debugging and accessibility auditing
1556
+ - ✅ Image comparison with visual difference detection
1557
+ - ✅ Document processing for 12+ formats (PDF, DOCX, XLSX, PPTX, etc.)
1558
+ - ✅ Document summarization and content extraction
1559
+
1560
+ **✋ Hands (16 tools)** - Content Generation, Image Editing & Browser Automation
1561
+ - ✅ Image generation with Gemini Imagen API (5 styles, 5 aspect ratios)
1562
+ - ✅ Video generation with Gemini Veo 3.0 API (duration, FPS, camera controls)
1563
+ - ✅ AI-powered image editing: inpainting, outpainting, style transfer, composition
1564
+ - Fast local Jimp processing: crop, resize, rotate, mask
1565
+ - ✅ AI background removal with 3 quality models
1566
+ - ✅ Browser automation: full page, viewport, and element screenshots with Playwright
1567
+
1568
+ **🗣️ Mouth (4 tools)** - Speech Generation
1569
+ - ✅ Text-to-speech with 30+ voices and 24 languages
1570
+ - ✅ Long-form narration with chapter breaks
1571
+ - ✅ Code explanation with technical analysis
1572
+ - Voice testing and customization
1573
+
1574
+ **🧠 Brain (3 tools)** - Advanced Reasoning
1575
+ - ✅ Native sequential thinking (fast, no API calls)
1576
+ - ✅ Pattern-based analysis (problem solving, root cause, SWOT, etc.)
1577
+ - ✅ AI-powered reflection for complex analysis
1578
+
1579
+ ### Future Development
1580
+
1581
+ #### Phase 3: Audio Processing - Ears (Planned Q1 2025)
1582
+ Only remaining capability to complete the human sensory suite:
1562
1583
  - Speech-to-text transcription with speaker identification
1563
- - Audio content analysis (music, speech, noise classification)
1564
- - Audio quality assessment and debugging capabilities
1584
+ - Audio content analysis and classification
1585
+ - Audio quality assessment and debugging
1565
1586
  - Support for 20+ audio formats (WAV, MP3, AAC, OGG, FLAC)
1566
- - Real-time audio processing capabilities
1567
-
1568
- #### Phase 4: Speech Generation - Mouth ✅ COMPLETE
1569
- **AI Voice Capabilities** - Production Ready (v1.3.0)
1570
- - ✅ High-quality text-to-speech with 30+ voice options using Gemini Speech API
1571
- - Code explanation and technical content narration
1572
- - ✅ Multi-language speech generation (24 languages supported)
1573
- - ✅ Long-form content narration with chapter breaks and natural pacing
1574
- - ✅ Professional-quality audio export in WAV format
1575
- - Voice customization with style prompts and voice comparison
1576
-
1577
- #### Phase 5: Content Generation - Hands ✅ COMPLETE
1578
- **Creative Content Creation** - Production Ready (v2.0.0)
1579
- - Image generation from text descriptions using Imagen API
1580
- - Video generation from text prompts using Veo 3.0 API
1581
- - ✅ Image-to-video generation pipeline combining Imagen + Veo 3.0
1582
- - Multiple artistic styles for images and videos
1583
- - Flexible aspect ratios: 1:1, 16:9, 9:16, 4:3, 3:4
1584
- - Video duration controls (4s, 8s, 12s) with FPS options (1-60 fps)
1585
- - Camera movement controls: static, pan, zoom, dolly movements
1586
- - Advanced prompt engineering with negative prompts
1587
- - Comprehensive error handling and validation with retry logic
1588
- - Future: Advanced image editing (inpainting, style transfer, enhancement)
1589
- - Future: Animation creation with motion graphics
1590
-
1591
- #### Phase 6: Brain - Advanced Reasoning ✅ COMPLETE
1592
- **Advanced Cognitive Intelligence** - Production Ready (v2.2.0)
1593
- - Sequential thinking with dynamic problem-solving and thought revision
1594
- - ✅ Deep analytical reasoning with assumption tracking and alternative perspectives
1595
- - Problem solving with hypothesis testing and constraint handling
1596
- - Meta-cognitive reflection and analysis improvement
1597
- - Multiple thinking styles (analytical, systematic, creative, scientific, critical, strategic, intuitive, collaborative)
1598
- - Context-aware reasoning with domain-specific considerations
1599
- - ✅ Confidence scoring and evidence evaluation
1600
- - Comprehensive reasoning workflows for complex technical problems
1601
-
1602
- ### Target Architecture (Current v2.2.0 - Almost Complete)
1603
-
1604
- The evolution from single-capability visual analysis to comprehensive human-like sensory and cognitive intelligence (5 of 6 phases complete):
1605
-
1606
- ```
1607
- ┌─────────────────┐ ┌──────────────────────┐ ┌─────────────────────────┐
1608
- │ AI Agent │◄──►│ Human MCP │◄──►│ Google AI Services │
1609
- │ (MCP Client) │ │ Server │ │ • Gemini Vision API │
1610
- └─────────────────┘ │ │ │ • Gemini Audio API │
1611
- │ 👁️ Eyes (Vision) │ │ • Gemini Speech API │
1612
- │ • Images/Video │ │ • Imagen API (Images) │
1613
- │ • Documents │ │ • Veo3 API (Video) │
1614
- │ │ └─────────────────────────┘
1615
- │ 👂 Ears (Audio) │
1616
- │ • Speech-to-Text │
1617
- │ • Audio Analysis │
1618
- │ │
1619
- │ 👄 Mouth (Speech) │
1620
- │ • Text-to-Speech │
1621
- │ • Narration │
1622
- │ │
1623
- │ ✋ Hands (Creation) │
1624
- │ • Image Generation ✅│
1625
- │ • Video Generation ✅│
1626
- │ │
1627
- │ 🧠 Brain (Reasoning)│
1628
- │ • Sequential Think ✅│
1629
- │ • Hypothesis Test ✅│
1630
- │ • Reflection ✅│
1631
- └──────────────────────┘
1632
- ```
1633
-
1634
- ### Key Benefits by 2025
1587
+
1588
+ **Note:** Phases 1, 2, 4, 5, and 6 are complete with 27 production-ready tools
1589
+
1590
+ ### System Architecture (v2.10.0)
1591
+
1592
+ Complete human-like capabilities through 27 MCP tools:
1593
+
1594
+ ```
1595
+ ┌─────────────────┐ ┌──────────────────────────┐ ┌─────────────────────────┐
1596
+ │ AI Agent │◄──►│ Human MCP Server │◄──►│ Google AI Services │
1597
+ │ (MCP Client) │ │ v2.10.0 │ │ • Gemini 2.5 Flash │
1598
+ └─────────────────┘ │ │ │ Gemini Imagen API │
1599
+ │ 👁️ Eyes (4 tools) ✅ │ │ Gemini Veo 3.0 API │
1600
+ │ • Visual Analysis │ │ Gemini Speech API
1601
+ │ • Document Processing │ └─────────────────────────┘
1602
+ │ │
1603
+ │ ✋ Hands (16 tools) ✅ │ ┌─────────────────────────┐
1604
+ │ • Image Generation │ │ Processing Libraries │
1605
+ │ • Video Generation │ │ Playwright (browser)
1606
+ │ • AI Image Editing │ │ Jimp (image proc) │
1607
+ │ • Jimp Processing │ │ rmbg (bg removal) │
1608
+ │ • Background Removal │ │ ffmpeg (video) │
1609
+ │ • Browser Automation │ │ Sharp (GIF)
1610
+ │ │ └─────────────────────────┘
1611
+ │ 🗣️ Mouth (4 tools) ✅ │
1612
+ │ • Text-to-Speech │
1613
+ │ • Narration │
1614
+ │ • Code Explanation │
1615
+ │ │
1616
+ │ 🧠 Brain (3 tools) ✅ │
1617
+ │ • Sequential Thinking │
1618
+ │ • Pattern Analysis │
1619
+ │ • AI Reflection │
1620
+ │ │
1621
+ │ 👂 Ears (Planned 2025) │
1622
+ └──────────────────────────┘
1623
+ ```
1624
+
1625
+ ### Key Benefits
1635
1626
 
1636
1627
  **For Developers:**
1637
- - Complete multimodal debugging and analysis workflows
1638
- - Automated accessibility auditing and compliance checking
1639
- - Visual regression testing and quality assurance
1640
- - Document analysis for technical specifications
1641
- - Audio processing for voice interfaces and content
1642
- - Advanced reasoning and hypothesis-driven problem solving
1628
+ - Visual debugging with UI bug detection and accessibility auditing
1629
+ - Automated web screenshots for testing and documentation
1630
+ - Document processing for technical specifications and reports
1631
+ - AI-powered image and video generation for prototyping
1632
+ - Advanced image editing without complex tools
1633
+ - Speech generation for documentation and code explanations
1634
+ - Sophisticated problem-solving with sequential reasoning
1643
1635
 
1644
1636
  **For AI Agents:**
1645
- - Human-like understanding of visual, audio, and document content
1646
- - Ability to generate explanatory content in multiple formats
1647
- - Sophisticated analysis capabilities beyond text processing
1648
- - Enhanced debugging and problem-solving workflows
1649
- - Creative content generation and editing capabilities
1650
- - Advanced cognitive processing with sequential thinking and reflection
1651
-
1652
- ### Success Metrics & Timeline
1653
-
1654
- - **Phase 2 (Document Understanding)**: ✅ Completed September 2025
1655
- - **Phase 3 (Audio Processing)**: January - March 2025
1656
- - **Phase 4 (Speech Generation)**: Completed September 2025
1657
- - **Phase 5 (Content Generation)**: Completed September 2025
1658
- - **Phase 6 (Brain/Reasoning)**: Completed September 2025
1659
-
1660
- **Target Goals:**
1661
- - Support 50+ file formats across all modalities
1662
- - 99%+ success rate with optimized processing times (images <30s, videos <5min)
1663
- - Advanced reasoning with 95%+ logical consistency (ACHIEVED)
1664
- - 1000+ MCP client integrations and 100K+ monthly API calls
1665
- - ✅ Comprehensive documentation with real-world examples (ACHIEVED)
1666
- - ✅ Professional-grade content generation and reasoning capabilities (ACHIEVED)
1637
+ - Human-like multimodal understanding (vision, speech, documents)
1638
+ - Automated web interaction and screenshot capture
1639
+ - Creative content generation (images, videos, speech)
1640
+ - Advanced image editing capabilities (inpainting, style transfer, etc.)
1641
+ - Fast local image processing (crop, resize, rotate, mask)
1642
+ - Complex reasoning with thought revision and reflection
1643
+ - Pattern-based analysis for common problems
1644
+
1645
+ ### Current Achievements (v2.10.0)
1646
+
1647
+ **Completed Phases:**
1648
+ - Phase 1: Eyes - Visual Analysis (4 tools)
1649
+ - Phase 2: Document Understanding (integrated into Eyes)
1650
+ - Phase 4: Mouth - Speech Generation (4 tools)
1651
+ - ✅ Phase 5: Hands - Content Generation, Image Editing & Browser Automation (16 tools)
1652
+ - ✅ Phase 6: Brain - Advanced Reasoning (3 tools)
1653
+
1654
+ **Remaining:**
1655
+ - Phase 3: Ears - Audio Processing (planned Q1 2025)
1656
+
1657
+ **Goals Achieved:**
1658
+ - ✅ 27 production-ready MCP tools
1659
+ - ✅ Support for 30+ file formats (images, videos, documents, audio)
1660
+ - ✅ Browser automation for automated web screenshots
1661
+ - ✅ Sub-30 second response times for most operations
1662
+ - ✅ Professional-grade content generation (images, videos, speech)
1663
+ - ✅ Advanced reasoning with native + AI-powered tools
1664
+ - ✅ Comprehensive documentation and examples
1667
1665
 
1668
1666
  ### Getting Involved
1669
1667
 
@@ -1698,11 +1696,11 @@ Human MCP is built for the developer community. Whether you're integrating with
1698
1696
  - **Durations**: 4s, 8s, 12s video lengths
1699
1697
  - **Quality**: Professional-grade output with customizable FPS (1-60)
1700
1698
 
1701
- **Reasoning Capabilities (v2.2.0)**:
1702
- - **Thinking Styles**: Analytical, systematic, creative, scientific, critical, strategic, intuitive, collaborative
1703
- - **Problem Types**: Technical debugging, architecture decisions, hypothesis testing, complex analysis
1704
- - **Output Formats**: Structured reasoning chains, hypothesis validation, reflection analysis, confidence scoring
1705
- - **Complexity**: Multi-step analysis with branching logic, thought revision, and meta-cognitive reflection
1699
+ **Reasoning Capabilities (Brain Tools)**:
1700
+ - **Native Sequential Thinking**: Fast, API-free thought processes with revision support
1701
+ - **Pattern Analysis**: Quick problem-solving using proven frameworks (root cause, SWOT, pros/cons, etc.)
1702
+ - **AI Reflection**: Complex meta-cognitive analysis for improving reasoning quality
1703
+ - **Output Formats**: Structured thought chains, pattern-based solutions, improvement recommendations
1706
1704
 
1707
1705
  ## Contributing
1708
1706