@goonnguyen/human-mcp 2.1.0 → 2.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (3) hide show
  1. package/README.md +296 -45
  2. package/dist/index.js +2383 -0
  3. package/package.json +1 -1
package/README.md CHANGED
@@ -4,24 +4,33 @@
4
4
 
5
5
  ![Human MCP](human-mcp.png)
6
6
 
7
- Human MCP is a Model Context Protocol server that provides AI coding agents with human-like visual capabilities for debugging and understanding visual content like screenshots, recordings, and UI elements.
7
+ Human MCP v2.0.0 is a comprehensive Model Context Protocol server that provides AI coding agents with human-like capabilities including visual analysis, document processing, speech generation, and content creation for debugging, understanding, and enhancing multimodal content.
8
8
 
9
9
  ## Features
10
10
 
11
- 🎯 **Visual Analysis**
11
+ 🎯 **Visual Analysis (Eyes) - ✅ Complete**
12
12
  - Analyze screenshots for UI bugs and layout issues
13
- - Process screen recordings to understand error sequences
13
+ - Process screen recordings to understand error sequences
14
14
  - Extract insights from GIFs and animations
15
15
  - Compare visual changes between versions
16
16
 
17
+ 📄 **Document Processing (Eyes Extended) - ✅ Complete v2.0.0**
18
+ - Comprehensive document analysis for PDF, DOCX, XLSX, PPTX, TXT, MD, RTF, ODT, CSV, JSON, XML, HTML
19
+ - Structured data extraction using custom JSON schemas
20
+ - Document summarization with multiple types (brief, detailed, executive, technical)
21
+ - Text extraction with formatting preservation
22
+ - Table and image extraction from documents
23
+ - Auto-format detection and processing
24
+
17
25
  🔍 **Specialized Analysis Types**
18
26
  - **UI Debug**: Layout issues, rendering problems, visual bugs
19
27
  - **Error Detection**: Visible errors, broken functionality, system failures
20
28
  - **Accessibility**: Color contrast, WCAG compliance, readability
21
29
  - **Performance**: Loading states, visual performance indicators
22
30
  - **Layout**: Responsive design, positioning, visual hierarchy
31
+ - **Document Analysis**: Content extraction, data mining, document intelligence
23
32
 
24
- 🎨 **Content Generation**
33
+ 🎨 **Content Generation (Hands) - ✅ Complete v2.0.0**
25
34
  - Generate high-quality images from text descriptions using Imagen API
26
35
  - Create professional videos from text prompts using Veo 3.0 API
27
36
  - Image-to-video generation combining Imagen and Veo 3.0
@@ -31,7 +40,7 @@ Human MCP is a Model Context Protocol server that provides AI coding agents with
31
40
  - Camera movement controls: static, pan, zoom, dolly movements
32
41
  - Advanced prompt engineering and negative prompts
33
42
 
34
- 🗣️ **Speech Generation**
43
+ 🗣️ **Speech Generation (Mouth) - ✅ Complete v1.3.0**
35
44
  - Convert text to natural-sounding speech with 30+ voice options
36
45
  - Long-form content narration with chapter breaks
37
46
  - Technical code explanation with spoken analysis
@@ -39,6 +48,15 @@ Human MCP is a Model Context Protocol server that provides AI coding agents with
39
48
  - Multi-language support (24 languages)
40
49
  - Professional audio export in WAV format
41
50
 
51
+ 🧠 **Advanced Reasoning (Brain) - 🔄 Future Phase Q2 2025**
52
+ Ref: https://github.com/modelcontextprotocol/servers/blob/main/src/sequentialthinking/index.ts
53
+ - Sequential thinking with dynamic problem-solving
54
+ - Multi-step analysis with hypothesis generation and testing
55
+ - Thought revision and reflection capabilities
56
+ - Branching logic for non-linear problem exploration
57
+ - Meta-cognitive analysis and process optimization
58
+ - Advanced reasoning patterns for complex technical problems
59
+
42
60
  🤖 **AI-Powered**
43
61
  - Uses Google Gemini 2.5 Flash for fast, accurate analysis
44
62
  - Advanced Imagen API for high-quality image generation
@@ -955,6 +973,106 @@ Compare two images to identify visual differences.
955
973
  }
956
974
  ```
957
975
 
976
+ ### eyes_read_document
977
+
978
+ Comprehensive document analysis and content extraction.
979
+
980
+ ```json
981
+ {
982
+ "source": "/path/to/document.pdf",
983
+ "format": "auto",
984
+ "options": {
985
+ "extract_text": true,
986
+ "extract_tables": true,
987
+ "detail_level": "detailed"
988
+ }
989
+ }
990
+ ```
991
+
992
+ ### eyes_extract_data
993
+
994
+ Extract structured data from documents using custom schemas.
995
+
996
+ ```json
997
+ {
998
+ "source": "/path/to/invoice.pdf",
999
+ "format": "auto",
1000
+ "schema": {
1001
+ "invoice_number": "string",
1002
+ "amount": "number",
1003
+ "date": "string"
1004
+ }
1005
+ }
1006
+ ```
1007
+
1008
+ ### eyes_summarize
1009
+
1010
+ Generate summaries and key insights from documents.
1011
+
1012
+ ```json
1013
+ {
1014
+ "source": "/path/to/report.docx",
1015
+ "format": "auto",
1016
+ "options": {
1017
+ "summary_type": "executive",
1018
+ "include_key_points": true,
1019
+ "max_length": 500
1020
+ }
1021
+ }
1022
+ ```
1023
+
1024
+ ### mouth_speak
1025
+
1026
+ Convert text to natural-sounding speech.
1027
+
1028
+ ```json
1029
+ {
1030
+ "text": "Welcome to our application. Let me guide you through the interface.",
1031
+ "voice": "Zephyr",
1032
+ "language": "en-US",
1033
+ "style_prompt": "Speak in a friendly, welcoming tone"
1034
+ }
1035
+ ```
1036
+
1037
+ ### mouth_narrate
1038
+
1039
+ Generate narration for long-form content with chapter breaks.
1040
+
1041
+ ```json
1042
+ {
1043
+ "content": "Chapter 1: Introduction to React...",
1044
+ "voice": "Sage",
1045
+ "narration_style": "educational",
1046
+ "chapter_breaks": true
1047
+ }
1048
+ ```
1049
+
1050
+ ### mouth_explain
1051
+
1052
+ Generate spoken explanations of code with technical analysis.
1053
+
1054
+ ```json
1055
+ {
1056
+ "code": "function factorial(n) { return n <= 1 ? 1 : n * factorial(n-1); }",
1057
+ "programming_language": "javascript",
1058
+ "voice": "Apollo",
1059
+ "explanation_level": "intermediate"
1060
+ }
1061
+ ```
1062
+
1063
+ ### mouth_customize
1064
+
1065
+ Test different voices and styles for optimal content delivery.
1066
+
1067
+ ```json
1068
+ {
1069
+ "text": "Hello, this is a voice test sample.",
1070
+ "voice": "Charon",
1071
+ "style_variations": ["professional", "casual", "energetic"],
1072
+ "compare_voices": ["Puck", "Sage", "Apollo"]
1073
+ }
1074
+ ```
1075
+
958
1076
  ### gemini_gen_image
959
1077
 
960
1078
  Generate high-quality images from text descriptions using Gemini Imagen API.
@@ -1051,6 +1169,58 @@ Test different voices and styles to find the best fit for your content.
1051
1169
  }
1052
1170
  ```
1053
1171
 
1172
+ ### brain_think
1173
+
1174
+ Advanced sequential thinking with dynamic problem-solving.
1175
+
1176
+ ```json
1177
+ {
1178
+ "problem": "Complex technical issue requiring multi-step analysis",
1179
+ "initial_thoughts": 5,
1180
+ "allow_revision": true,
1181
+ "enable_branching": true,
1182
+ "thinking_style": "analytical"
1183
+ }
1184
+ ```
1185
+
1186
+ ### brain_analyze
1187
+
1188
+ Deep analytical reasoning with branching support.
1189
+
1190
+ ```json
1191
+ {
1192
+ "subject": "System architecture design decisions",
1193
+ "analysis_depth": "detailed",
1194
+ "consider_alternatives": true,
1195
+ "track_assumptions": true
1196
+ }
1197
+ ```
1198
+
1199
+ ### brain_solve
1200
+
1201
+ Multi-step problem solving with hypothesis testing.
1202
+
1203
+ ```json
1204
+ {
1205
+ "problem_statement": "Performance bottleneck in distributed system",
1206
+ "solution_approach": "systematic",
1207
+ "verify_hypotheses": true,
1208
+ "max_iterations": 10
1209
+ }
1210
+ ```
1211
+
1212
+ ### brain_reflect
1213
+
1214
+ Thought revision and process optimization.
1215
+
1216
+ ```json
1217
+ {
1218
+ "previous_analysis": "reference_to_prior_thinking",
1219
+ "reflection_focus": ["assumptions", "logic_gaps", "alternative_approaches"],
1220
+ "optimize_process": true
1221
+ }
1222
+ ```
1223
+
1054
1224
  ## Example Use Cases
1055
1225
 
1056
1226
  ### Debugging UI Issues
@@ -1164,6 +1334,50 @@ Test different voices and styles to find the best fit for your content.
1164
1334
  }
1165
1335
  ```
1166
1336
 
1337
+ ### Advanced Problem Solving
1338
+ ```bash
1339
+ # Analyze complex technical issues with multi-step reasoning
1340
+ {
1341
+ "problem": "Database performance degradation in production environment",
1342
+ "initial_thoughts": 8,
1343
+ "allow_revision": true,
1344
+ "enable_branching": true,
1345
+ "thinking_style": "systematic"
1346
+ }
1347
+ ```
1348
+
1349
+ ### Architecture Decision Analysis
1350
+ ```bash
1351
+ # Deep analysis of system design decisions
1352
+ {
1353
+ "subject": "Microservices vs monolithic architecture for e-commerce platform",
1354
+ "analysis_depth": "detailed",
1355
+ "consider_alternatives": true,
1356
+ "track_assumptions": true
1357
+ }
1358
+ ```
1359
+
1360
+ ### Hypothesis-Driven Debugging
1361
+ ```bash
1362
+ # Systematic problem solving with hypothesis testing
1363
+ {
1364
+ "problem_statement": "API response time increased by 300% after deployment",
1365
+ "solution_approach": "scientific",
1366
+ "verify_hypotheses": true,
1367
+ "max_iterations": 15
1368
+ }
1369
+ ```
1370
+
1371
+ ### Code Review Reasoning
1372
+ ```bash
1373
+ # Reflect on code analysis and optimization approaches
1374
+ {
1375
+ "previous_analysis": "Initial code review findings",
1376
+ "reflection_focus": ["performance_assumptions", "security_gaps", "maintainability"],
1377
+ "optimize_process": true
1378
+ }
1379
+ ```
1380
+
1167
1381
  ## Prompts
1168
1382
 
1169
1383
  Human MCP includes pre-built prompts for common debugging scenarios:
@@ -1265,6 +1479,13 @@ Human MCP Server
1265
1479
  │ ├── Long-form Narration
1266
1480
  │ ├── Code Explanation
1267
1481
  │ └── Voice Customization
1482
+ ├── Brain Tool (Advanced Reasoning) [Future]
1483
+ │ ├── Sequential Thinking
1484
+ │ ├── Hypothesis Testing
1485
+ │ ├── Thought Revision
1486
+ │ ├── Branching Logic
1487
+ │ ├── Meta-cognitive Analysis
1488
+ │ └── Problem-solving Workflows
1268
1489
  ├── Debugging Prompts
1269
1490
  └── Documentation Resources
1270
1491
  ```
@@ -1277,45 +1498,40 @@ For detailed architecture information and future development plans, see:
1277
1498
 
1278
1499
  **Mission**: Transform AI coding agents with complete human-like sensory capabilities, bridging the gap between artificial and human intelligence through sophisticated multimodal analysis.
1279
1500
 
1280
- ### Current Status: Phase 1 Complete ✅ | Phase 4 Complete ✅ | Phase 5 Complete ✅
1501
+ ### Current Status: Phase 1-2 Complete ✅ | Phase 4-5 Complete ✅ | v2.0.0
1281
1502
 
1282
- **Eyes (Visual Analysis)** - Production Ready (v1.2.1)
1283
- - Advanced image, video, and GIF analysis capabilities
1284
- - UI debugging, error detection, accessibility auditing
1285
- - Image comparison with pixel, structural, and semantic analysis
1286
- - Processing 20+ visual formats with 98.5% success rate
1287
- - Sub-30 second response times for detailed analysis
1288
-
1289
- **Hands (Content Generation)** - Production Ready (v1.4.0)
1290
- - High-quality image generation using Gemini Imagen API
1291
- - Professional video generation using Gemini Veo 3.0 API
1292
- - Image-to-video generation pipeline combining Imagen + Veo 3.0
1293
- - Multiple artistic styles and aspect ratios for both images and videos
1294
- - Video duration controls (4s, 8s, 12s) with FPS options (1-60 fps)
1295
- - Camera movement controls: static, pan, zoom, dolly movements
1296
- - Advanced prompt engineering with negative prompts
1297
- - Comprehensive validation and error handling with retry logic
1298
- - Fast generation times with reliable output
1503
+ **Eyes (Visual Analysis + Document Processing)** - Production Ready (v2.0.0)
1504
+ - Advanced image, video, and GIF analysis capabilities
1505
+ - UI debugging, error detection, accessibility auditing
1506
+ - Image comparison with pixel, structural, and semantic analysis
1507
+ - Document processing for PDF, DOCX, XLSX, PPTX, TXT, MD, RTF, ODT, CSV, JSON, XML, HTML
1508
+ - Structured data extraction using custom JSON schemas
1509
+ - ✅ Document summarization with multiple types (brief, detailed, executive, technical)
1510
+ - Processing 20+ visual formats + 12+ document formats with 95%+ success rate
1511
+ - ✅ Sub-30 second response times for images, sub-60 second for documents
1299
1512
 
1300
1513
  **Mouth (Speech Generation)** - Production Ready (v1.3.0)
1301
- - Natural text-to-speech with 30+ voice options
1302
- - Long-form content narration with chapter breaks
1303
- - Technical code explanation with spoken analysis
1304
- - Voice customization and style control
1305
- - Multi-language support (24 languages)
1306
- - Professional audio export in WAV format
1307
-
1308
- ### Upcoming Development Phases
1514
+ - Natural text-to-speech with 30+ voice options
1515
+ - Long-form content narration with chapter breaks
1516
+ - Technical code explanation with spoken analysis
1517
+ - Voice customization and style control
1518
+ - Multi-language support (24 languages)
1519
+ - Professional audio export in WAV format
1520
+
1521
+ **Hands (Content Generation)** - Production Ready (v2.0.0)
1522
+ - ✅ High-quality image generation using Gemini Imagen API
1523
+ - ✅ Professional video generation using Gemini Veo 3.0 API
1524
+ - ✅ Image-to-video generation pipeline combining Imagen + Veo 3.0
1525
+ - ✅ Multiple artistic styles and aspect ratios for both images and videos
1526
+ - ✅ Video duration controls (4s, 8s, 12s) with FPS options (1-60 fps)
1527
+ - ✅ Camera movement controls: static, pan, zoom, dolly movements
1528
+ - ✅ Advanced prompt engineering with negative prompts
1529
+ - ✅ Comprehensive validation and error handling with retry logic
1530
+ - ✅ Fast generation times with reliable output
1309
1531
 
1310
- #### Phase 2: Document Understanding (Q4 2025)
1311
- **Expanding Eyes Capabilities**
1312
- - PDF, Word, Excel, PowerPoint document analysis
1313
- - Text extraction with 95%+ accuracy and formatting preservation
1314
- - Structured data extraction and cross-document comparison
1315
- - Integration with Gemini's Document Understanding API
1316
- - Processing time under 60 seconds for typical documents
1532
+ ### Remaining Development Phases
1317
1533
 
1318
- #### Phase 3: Audio Processing - Ears (Q4 2025)
1534
+ #### Phase 3: Audio Processing - Ears (Q1 2025)
1319
1535
  **Advanced Audio Intelligence**
1320
1536
  - Speech-to-text transcription with speaker identification
1321
1537
  - Audio content analysis (music, speech, noise classification)
@@ -1323,6 +1539,15 @@ For detailed architecture information and future development plans, see:
1323
1539
  - Support for 20+ audio formats (WAV, MP3, AAC, OGG, FLAC)
1324
1540
  - Real-time audio processing capabilities
1325
1541
 
1542
+ #### Phase 6: Brain (Thinking/Reasoning) - Q2 2025
1543
+ **Advanced Cognitive Intelligence**
1544
+ - Sequential thinking with dynamic problem-solving
1545
+ - Multi-step analysis with hypothesis generation and testing
1546
+ - Thought revision and reflection capabilities
1547
+ - Branching logic for non-linear problem exploration
1548
+ - Meta-cognitive analysis and process optimization
1549
+ - Advanced reasoning patterns for complex technical problems
1550
+
1326
1551
  #### Phase 4: Speech Generation - Mouth ✅ COMPLETE
1327
1552
  **AI Voice Capabilities** - Production Ready (v1.3.0)
1328
1553
  - ✅ High-quality text-to-speech with 30+ voice options using Gemini Speech API
@@ -1348,7 +1573,7 @@ For detailed architecture information and future development plans, see:
1348
1573
 
1349
1574
  ### Target Architecture (End 2025)
1350
1575
 
1351
- The evolution from single-capability visual analysis to comprehensive human-like sensory intelligence:
1576
+ The evolution from single-capability visual analysis to comprehensive human-like sensory and cognitive intelligence:
1352
1577
 
1353
1578
  ```
1354
1579
  ┌─────────────────┐ ┌──────────────────────┐ ┌─────────────────────────┐
@@ -1370,6 +1595,11 @@ The evolution from single-capability visual analysis to comprehensive human-like
1370
1595
  │ ✋ Hands (Creation) │
1371
1596
  │ • Image Generation ✅│
1372
1597
  │ • Video Generation ✅│
1598
+ │ │
1599
+ │ 🧠 Brain (Reasoning)│
1600
+ │ • Sequential Think │
1601
+ │ • Hypothesis Test │
1602
+ │ • Reflection │
1373
1603
  └──────────────────────┘
1374
1604
  ```
1375
1605
 
@@ -1381,6 +1611,7 @@ The evolution from single-capability visual analysis to comprehensive human-like
1381
1611
  - Visual regression testing and quality assurance
1382
1612
  - Document analysis for technical specifications
1383
1613
  - Audio processing for voice interfaces and content
1614
+ - Advanced reasoning and hypothesis-driven problem solving
1384
1615
 
1385
1616
  **For AI Agents:**
1386
1617
  - Human-like understanding of visual, audio, and document content
@@ -1388,20 +1619,23 @@ The evolution from single-capability visual analysis to comprehensive human-like
1388
1619
  - Sophisticated analysis capabilities beyond text processing
1389
1620
  - Enhanced debugging and problem-solving workflows
1390
1621
  - Creative content generation and editing capabilities
1622
+ - Advanced cognitive processing with sequential thinking and reflection
1391
1623
 
1392
1624
  ### Success Metrics & Timeline
1393
1625
 
1394
- - **Phase 2 (Document Understanding)**: January - March 2025
1395
- - **Phase 3 (Audio Processing)**: April - June 2025
1626
+ - **Phase 2 (Document Understanding)**: Completed September 2025
1627
+ - **Phase 3 (Audio Processing)**: January - March 2025
1396
1628
  - **Phase 4 (Speech Generation)**: ✅ Completed September 2025
1397
1629
  - **Phase 5 (Content Generation)**: ✅ Completed September 2025
1630
+ - **Phase 6 (Brain/Reasoning)**: April - June 2025
1398
1631
 
1399
1632
  **Target Goals:**
1400
1633
  - Support 50+ file formats across all modalities
1401
1634
  - 99%+ success rate with optimized processing times (images <30s, videos <5min)
1635
+ - Advanced reasoning with 95%+ logical consistency
1402
1636
  - 1000+ MCP client integrations and 100K+ monthly API calls
1403
1637
  - Comprehensive documentation with real-world examples
1404
- - Professional-grade content generation capabilities
1638
+ - Professional-grade content generation and reasoning capabilities
1405
1639
 
1406
1640
  ### Getting Involved
1407
1641
 
@@ -1413,18 +1647,35 @@ Human MCP is built for the developer community. Whether you're integrating with
1413
1647
 
1414
1648
  ## Supported Formats
1415
1649
 
1416
- **Analysis Formats**:
1650
+ **Visual Analysis Formats**:
1417
1651
  - **Images**: PNG, JPEG, WebP, GIF (static)
1418
1652
  - **Videos**: MP4, WebM, MOV, AVI
1419
1653
  - **GIFs**: Animated GIF with frame extraction
1420
1654
  - **Sources**: File paths, URLs, base64 data URLs
1421
1655
 
1422
- **Generation Formats**:
1656
+ **Document Processing Formats (v2.0.0)**:
1657
+ - **Documents**: PDF, DOCX, XLSX, PPTX, TXT, MD, RTF, ODT
1658
+ - **Data**: CSV, JSON, XML, HTML
1659
+ - **Features**: Text extraction, table processing, structured data extraction
1660
+ - **Auto-detection**: Automatic format detection from content and extensions
1661
+
1662
+ **Speech Generation Formats**:
1663
+ - **Output**: WAV (Base64 encoded), 24kHz mono
1664
+ - **Languages**: 24+ languages supported
1665
+ - **Voices**: 30+ voice options with style control
1666
+
1667
+ **Content Generation Formats**:
1423
1668
  - **Images**: PNG, JPEG (Base64 output)
1424
1669
  - **Videos**: MP4 (Base64 output)
1425
1670
  - **Durations**: 4s, 8s, 12s video lengths
1426
1671
  - **Quality**: Professional-grade output with customizable FPS (1-60)
1427
1672
 
1673
+ **Reasoning Capabilities (Future)**:
1674
+ - **Thinking Styles**: Analytical, systematic, creative, scientific reasoning approaches
1675
+ - **Problem Types**: Technical debugging, architecture decisions, hypothesis testing
1676
+ - **Output Formats**: Structured reasoning chains, hypothesis validation, reflection analysis
1677
+ - **Complexity**: Multi-step analysis with branching logic and thought revision
1678
+
1428
1679
  ## Contributing
1429
1680
 
1430
1681
  1. Fork the repository