matchina 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- matchina-0.1.0/CHANGELOG.md +91 -0
- matchina-0.1.0/CONTRIBUTING.md +166 -0
- matchina-0.1.0/LICENSE +21 -0
- matchina-0.1.0/MANIFEST.in +19 -0
- matchina-0.1.0/PKG-INFO +206 -0
- matchina-0.1.0/README.md +173 -0
- matchina-0.1.0/examples/basic_usage.py +79 -0
- matchina-0.1.0/matchina/__init__.py +87 -0
- matchina-0.1.0/matchina/__version__.py +1 -0
- matchina-0.1.0/matchina/core/__init__.py +4 -0
- matchina-0.1.0/matchina/core/matcher.py +112 -0
- matchina-0.1.0/matchina/core/strategies.py +315 -0
- matchina-0.1.0/matchina/data/__init__.py +3 -0
- matchina-0.1.0/matchina/data/entities.db +0 -0
- matchina-0.1.0/matchina/data/storage.py +132 -0
- matchina-0.1.0/matchina/models/__init__.py +3 -0
- matchina-0.1.0/matchina/models/entity.py +55 -0
- matchina-0.1.0/matchina/utils/__init__.py +3 -0
- matchina-0.1.0/matchina/utils/normalizer.py +135 -0
- matchina-0.1.0/matchina.egg-info/PKG-INFO +206 -0
- matchina-0.1.0/matchina.egg-info/SOURCES.txt +25 -0
- matchina-0.1.0/matchina.egg-info/dependency_links.txt +1 -0
- matchina-0.1.0/matchina.egg-info/requires.txt +8 -0
- matchina-0.1.0/matchina.egg-info/top_level.txt +1 -0
- matchina-0.1.0/pyproject.toml +75 -0
- matchina-0.1.0/setup.cfg +4 -0
- matchina-0.1.0/tests/test_matcher.py +243 -0
|
@@ -0,0 +1,91 @@
|
|
|
1
|
+
# Change Log
|
|
2
|
+
|
|
3
|
+
All notable changes to this project will be documented in this file.
|
|
4
|
+
|
|
5
|
+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
|
6
|
+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
7
|
+
|
|
8
|
+
## [Unreleased]
|
|
9
|
+
|
|
10
|
+
### Added
|
|
11
|
+
- Initial release preparation
|
|
12
|
+
- GitHub Actions CI workflow
|
|
13
|
+
- Comprehensive documentation (README, CONTRIBUTING, CHANGELOG)
|
|
14
|
+
- PyPI publish script
|
|
15
|
+
- Project renamed to Matchina
|
|
16
|
+
|
|
17
|
+
### Changed
|
|
18
|
+
- Enhanced README with API documentation and examples
|
|
19
|
+
- Improved test coverage
|
|
20
|
+
- Project name updated from cn-entity-resolver to Matchina
|
|
21
|
+
|
|
22
|
+
### Fixed
|
|
23
|
+
- None yet
|
|
24
|
+
|
|
25
|
+
## [0.1.0] - 2026-03-14
|
|
26
|
+
|
|
27
|
+
### Added
|
|
28
|
+
- 四层匹配策略(精确匹配、别名匹配、规则匹配、模糊匹配)
|
|
29
|
+
- 中英文企业名称匹配支持
|
|
30
|
+
- 别名/简称识别
|
|
31
|
+
- 批量匹配功能 `resolve_batch()`
|
|
32
|
+
- 模糊搜索功能 `search()`
|
|
33
|
+
- 完整的类型注解和 mypy 支持
|
|
34
|
+
- 约 30+ 企业数据(中文、英文、别名)
|
|
35
|
+
- 单元测试覆盖核心功能
|
|
36
|
+
- MIT 许可证
|
|
37
|
+
|
|
38
|
+
### Technical Stack
|
|
39
|
+
- Python 3.8+
|
|
40
|
+
- pytest 测试框架
|
|
41
|
+
- mypy 类型检查
|
|
42
|
+
- ruff 代码检查
|
|
43
|
+
- setuptools 构建系统
|
|
44
|
+
|
|
45
|
+
### Data Format
|
|
46
|
+
- SQLite 数据库存储企业数据
|
|
47
|
+
- JSON 别名映射表
|
|
48
|
+
- 离线数据,无需网络连接
|
|
49
|
+
|
|
50
|
+
### API
|
|
51
|
+
- `resolve(name, top_k=5)` - 单个名称匹配
|
|
52
|
+
- `search(query, limit=10)` - 模糊搜索
|
|
53
|
+
- `resolve_batch(names, top_k=5)` - 批量匹配
|
|
54
|
+
- `EntityMatcher` 类接口
|
|
55
|
+
- `Entity` 数据模型
|
|
56
|
+
- `MatchResult` 结果对象
|
|
57
|
+
|
|
58
|
+
### Performance
|
|
59
|
+
- 批量匹配优化
|
|
60
|
+
- 懒加载全局 matcher 实例
|
|
61
|
+
- 缓存匹配结果
|
|
62
|
+
|
|
63
|
+
### Documentation
|
|
64
|
+
- 快速开始指南
|
|
65
|
+
- API 文档
|
|
66
|
+
- 示例代码
|
|
67
|
+
- 匹配策略说明
|
|
68
|
+
|
|
69
|
+
---
|
|
70
|
+
|
|
71
|
+
## Version History
|
|
72
|
+
|
|
73
|
+
- **0.1.0** (2026-03-14) - Initial release (Matchina)
|
|
74
|
+
|
|
75
|
+
---
|
|
76
|
+
|
|
77
|
+
## Upcoming
|
|
78
|
+
|
|
79
|
+
### Planned Features
|
|
80
|
+
- 更多企业数据(目标 100+)
|
|
81
|
+
- 拼音匹配支持
|
|
82
|
+
- 行业分类过滤
|
|
83
|
+
- 置信度阈值配置
|
|
84
|
+
- 自定义数据加载
|
|
85
|
+
- 性能优化(缓存、索引)
|
|
86
|
+
|
|
87
|
+
### Future Considerations
|
|
88
|
+
- 异步 API 支持
|
|
89
|
+
- 增量数据更新
|
|
90
|
+
- 企业关联关系
|
|
91
|
+
- 行业知识图谱
|
|
@@ -0,0 +1,166 @@
|
|
|
1
|
+
# 贡献指南
|
|
2
|
+
|
|
3
|
+
感谢你对 Matchina 的关注!欢迎贡献代码、文档、测试或数据。
|
|
4
|
+
|
|
5
|
+
## 如何贡献
|
|
6
|
+
|
|
7
|
+
### 1. 报告问题
|
|
8
|
+
|
|
9
|
+
在 GitHub Issues 中报告 bug 或提出功能请求:
|
|
10
|
+
- 描述清楚问题
|
|
11
|
+
- 提供复现步骤
|
|
12
|
+
- 附上预期行为和实际行为
|
|
13
|
+
|
|
14
|
+
### 2. 提交代码
|
|
15
|
+
|
|
16
|
+
#### Fork 仓库
|
|
17
|
+
|
|
18
|
+
```bash
|
|
19
|
+
git clone https://github.com/YOUR_USERNAME/matchina.git
|
|
20
|
+
cd matchina
|
|
21
|
+
```
|
|
22
|
+
|
|
23
|
+
#### 创建开发环境
|
|
24
|
+
|
|
25
|
+
```bash
|
|
26
|
+
python3 -m venv .venv
|
|
27
|
+
source .venv/bin/activate # macOS/Linux
|
|
28
|
+
pip install -e ".[dev]"
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
#### 创建分支
|
|
32
|
+
|
|
33
|
+
```bash
|
|
34
|
+
git checkout -b feature/your-feature-name
|
|
35
|
+
# 或
|
|
36
|
+
git checkout -b fix/issue-123
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
### 3. 开发流程
|
|
40
|
+
|
|
41
|
+
1. **编写代码**:遵循现有代码风格
|
|
42
|
+
2. **添加测试**:确保新功能有测试覆盖
|
|
43
|
+
3. **运行检查**:
|
|
44
|
+
```bash
|
|
45
|
+
# 运行测试
|
|
46
|
+
pytest
|
|
47
|
+
|
|
48
|
+
# 类型检查
|
|
49
|
+
mypy matchina
|
|
50
|
+
|
|
51
|
+
# 代码风格检查
|
|
52
|
+
ruff check matchina tests
|
|
53
|
+
|
|
54
|
+
# 格式化代码
|
|
55
|
+
ruff format matchina tests
|
|
56
|
+
```
|
|
57
|
+
4. **提交更改**:
|
|
58
|
+
```bash
|
|
59
|
+
git add .
|
|
60
|
+
git commit -m "feat: add new feature"
|
|
61
|
+
```
|
|
62
|
+
|
|
63
|
+
### 4. 提交 Pull Request
|
|
64
|
+
|
|
65
|
+
1. 推送分支到 GitHub
|
|
66
|
+
2. 在原始仓库创建 Pull Request
|
|
67
|
+
3. 描述更改内容和原因
|
|
68
|
+
4. 等待代码审查
|
|
69
|
+
|
|
70
|
+
### 5. 添加企业数据
|
|
71
|
+
|
|
72
|
+
如果要添加新的企业数据:
|
|
73
|
+
|
|
74
|
+
1. 在 `data/` 目录中添加数据文件
|
|
75
|
+
2. 更新 `matchina/data/storage.py` 加载新数据
|
|
76
|
+
3. 添加测试验证数据正确性
|
|
77
|
+
4. 在文档中说明数据来源
|
|
78
|
+
|
|
79
|
+
## 代码规范
|
|
80
|
+
|
|
81
|
+
### 命名约定
|
|
82
|
+
|
|
83
|
+
- 变量/函数:`snake_case`
|
|
84
|
+
- 类:`PascalCase`
|
|
85
|
+
- 常量:`UPPER_CASE`
|
|
86
|
+
- 私有方法:`_prefix`
|
|
87
|
+
|
|
88
|
+
### 文档规范
|
|
89
|
+
|
|
90
|
+
- 所有公共函数/类必须有 docstring
|
|
91
|
+
- 使用 Google 风格 docstring
|
|
92
|
+
- 包含 Example 部分
|
|
93
|
+
|
|
94
|
+
### 类型注解
|
|
95
|
+
|
|
96
|
+
- 所有函数必须有类型注解
|
|
97
|
+
- 使用 `typing` 模块
|
|
98
|
+
- 运行 mypy 确保类型正确
|
|
99
|
+
|
|
100
|
+
### 提交信息规范
|
|
101
|
+
|
|
102
|
+
遵循 [Conventional Commits](https://www.conventionalcommits.org/):
|
|
103
|
+
|
|
104
|
+
```
|
|
105
|
+
feat: 新功能
|
|
106
|
+
fix: 修复 bug
|
|
107
|
+
docs: 文档更新
|
|
108
|
+
style: 代码格式(不影响功能)
|
|
109
|
+
refactor: 重构(既不修复 bug也不添加功能)
|
|
110
|
+
test: 添加测试
|
|
111
|
+
chore: 构建/工具/配置
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
示例:
|
|
115
|
+
```bash
|
|
116
|
+
git commit -m "feat: add batch processing support"
|
|
117
|
+
git commit -m "fix: resolve fuzzy match edge case"
|
|
118
|
+
git commit -m "docs: update API documentation"
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
## 测试要求
|
|
122
|
+
|
|
123
|
+
### 单元测试
|
|
124
|
+
|
|
125
|
+
- 新功能必须有单元测试
|
|
126
|
+
- 保持测试覆盖率
|
|
127
|
+
- 测试边界情况和错误处理
|
|
128
|
+
|
|
129
|
+
### 运行测试
|
|
130
|
+
|
|
131
|
+
```bash
|
|
132
|
+
# 运行所有测试
|
|
133
|
+
pytest
|
|
134
|
+
|
|
135
|
+
# 运行特定测试
|
|
136
|
+
pytest tests/test_matcher.py -v
|
|
137
|
+
|
|
138
|
+
# 查看覆盖率
|
|
139
|
+
pytest --cov=matchina
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
## 发布流程
|
|
143
|
+
|
|
144
|
+
维护者发布新版本:
|
|
145
|
+
|
|
146
|
+
1. 更新 `CHANGELOG.md`
|
|
147
|
+
2. 更新版本号 `matchina/__version__.py`
|
|
148
|
+
3. 创建 release tag
|
|
149
|
+
4. 运行发布脚本
|
|
150
|
+
5. 上传到 PyPI
|
|
151
|
+
|
|
152
|
+
详见发布流程文档。
|
|
153
|
+
|
|
154
|
+
## 社区准则
|
|
155
|
+
|
|
156
|
+
- 尊重他人,友好交流
|
|
157
|
+
- 对事不对人
|
|
158
|
+
- 欢迎不同背景的贡献者
|
|
159
|
+
- 遵守开源社区最佳实践
|
|
160
|
+
|
|
161
|
+
## 联系方式
|
|
162
|
+
|
|
163
|
+
- GitHub Issues: 问题和讨论
|
|
164
|
+
- Email: 联系维护者
|
|
165
|
+
|
|
166
|
+
感谢你的贡献!🎉
|
matchina-0.1.0/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 cn-entity-resolver contributors
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,19 @@
|
|
|
1
|
+
include README.md
|
|
2
|
+
include LICENSE
|
|
3
|
+
include CHANGELOG.md
|
|
4
|
+
include CONTRIBUTING.md
|
|
5
|
+
include pyproject.toml
|
|
6
|
+
recursive-include matchina *.py
|
|
7
|
+
recursive-include matchina/data *.db *.json
|
|
8
|
+
recursive-include examples *.py
|
|
9
|
+
recursive-include tests *.py
|
|
10
|
+
global-exclude __pycache__
|
|
11
|
+
global-exclude *.pyc
|
|
12
|
+
global-exclude .gitignore
|
|
13
|
+
global-exclude .mypy_cache
|
|
14
|
+
global-exclude .pytest_cache
|
|
15
|
+
global-exclude .ruff_cache
|
|
16
|
+
global-exclude .venv
|
|
17
|
+
global-exclude dist
|
|
18
|
+
global-exclude build
|
|
19
|
+
global-exclude *.egg-info
|
matchina-0.1.0/PKG-INFO
ADDED
|
@@ -0,0 +1,206 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: matchina
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: 中国企业中英文名称对齐匹配库
|
|
5
|
+
Author: Matchina contributors
|
|
6
|
+
License: MIT
|
|
7
|
+
Project-URL: Homepage, https://github.com/xxx/matchina
|
|
8
|
+
Project-URL: Repository, https://github.com/xxx/matchina
|
|
9
|
+
Project-URL: Documentation, https://github.com/xxx/matchina#readme
|
|
10
|
+
Classifier: Development Status :: 3 - Alpha
|
|
11
|
+
Classifier: Intended Audience :: Developers
|
|
12
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
13
|
+
Classifier: Operating System :: OS Independent
|
|
14
|
+
Classifier: Programming Language :: Python :: 3
|
|
15
|
+
Classifier: Programming Language :: Python :: 3.8
|
|
16
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
18
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
19
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
20
|
+
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
|
21
|
+
Classifier: Topic :: Text Processing :: Linguistic
|
|
22
|
+
Requires-Python: >=3.8
|
|
23
|
+
Description-Content-Type: text/markdown
|
|
24
|
+
License-File: LICENSE
|
|
25
|
+
Requires-Dist: typing-extensions>=4.0.0
|
|
26
|
+
Provides-Extra: dev
|
|
27
|
+
Requires-Dist: pytest>=7.0.0; extra == "dev"
|
|
28
|
+
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
|
|
29
|
+
Requires-Dist: black>=23.0.0; extra == "dev"
|
|
30
|
+
Requires-Dist: mypy>=1.0.0; extra == "dev"
|
|
31
|
+
Requires-Dist: ruff>=0.1.0; extra == "dev"
|
|
32
|
+
Dynamic: license-file
|
|
33
|
+
|
|
34
|
+
# Matchina
|
|
35
|
+
|
|
36
|
+
[English](README_en.md) | 中文
|
|
37
|
+
|
|
38
|
+
中国企业中英文名称对齐匹配库。支持精确匹配、别名匹配、模糊搜索,离线使用,无需网络。
|
|
39
|
+
|
|
40
|
+
## 特性
|
|
41
|
+
|
|
42
|
+
- 🔍 **四层匹配策略**:精确匹配 → 别名匹配 → 规则匹配 → 模糊匹配
|
|
43
|
+
- 🇨🇳 **中英文支持**:同时支持中文和英文名称
|
|
44
|
+
- 📦 **离线使用**:所有数据打包在库内,无需网络
|
|
45
|
+
- ⚡ **高性能**:单次匹配 < 100ms
|
|
46
|
+
- 🐍 **Python 3.8+**:支持 Python 3.8 及以上版本
|
|
47
|
+
|
|
48
|
+
## 安装
|
|
49
|
+
|
|
50
|
+
```bash
|
|
51
|
+
pip install matchina
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
## 快速开始
|
|
55
|
+
|
|
56
|
+
```python
|
|
57
|
+
from matchina import resolve, search, resolve_batch
|
|
58
|
+
|
|
59
|
+
# 精确匹配
|
|
60
|
+
result = resolve("华为")
|
|
61
|
+
print(result[0].name_cn) # 华为技术有限公司
|
|
62
|
+
print(result[0].name_en) # Huawei Technologies Co., Ltd.
|
|
63
|
+
print(result[0].confidence) # 1.0
|
|
64
|
+
|
|
65
|
+
# 英文匹配
|
|
66
|
+
result = resolve("Alibaba")
|
|
67
|
+
print(result[0].name_cn) # 阿里巴巴集团控股有限公司
|
|
68
|
+
|
|
69
|
+
# 别名匹配
|
|
70
|
+
result = resolve("抖音")
|
|
71
|
+
print(result[0].name_cn) # 北京字节跳动科技有限公司
|
|
72
|
+
print(result[0].confidence) # 0.95
|
|
73
|
+
|
|
74
|
+
# 批量匹配
|
|
75
|
+
names = ["腾讯", "百度", "小米", "比亚迪"]
|
|
76
|
+
results = resolve_batch(names)
|
|
77
|
+
for name, matches in results.items():
|
|
78
|
+
if matches:
|
|
79
|
+
print(f"{name} → {matches[0].name_cn}")
|
|
80
|
+
|
|
81
|
+
# 模糊搜索
|
|
82
|
+
results = search("科技", limit=5)
|
|
83
|
+
for r in results:
|
|
84
|
+
print(f"{r.name_cn} (置信度: {r.confidence:.2f})")
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
## 匹配策略
|
|
88
|
+
|
|
89
|
+
| 匹配类型 | 说明 | 置信度 |
|
|
90
|
+
|----------|------|--------|
|
|
91
|
+
| `exact` | 完全匹配 | 1.0 |
|
|
92
|
+
| `alias` | 别名匹配 | 0.95 |
|
|
93
|
+
| `rule` | 规则匹配(后缀/缩写) | 0.85 |
|
|
94
|
+
| `fuzzy` | 模糊匹配(编辑距离) | 0.6-0.8 |
|
|
95
|
+
|
|
96
|
+
## 数据覆盖
|
|
97
|
+
|
|
98
|
+
当前版本包含:
|
|
99
|
+
|
|
100
|
+
- **1132 家企业实体**
|
|
101
|
+
- **112 个企业别名**
|
|
102
|
+
- 覆盖 A 股、港股、中概股、独角兽、知名出海品牌
|
|
103
|
+
|
|
104
|
+
## API 文档
|
|
105
|
+
|
|
106
|
+
### `resolve(name, top_k=5)`
|
|
107
|
+
|
|
108
|
+
匹配企业名称。
|
|
109
|
+
|
|
110
|
+
**参数**:
|
|
111
|
+
- `name` (str): 企业名称(中文或英文)
|
|
112
|
+
- `top_k` (int): 返回结果数量,默认 5
|
|
113
|
+
|
|
114
|
+
**返回**: `List[MatchResult]`
|
|
115
|
+
|
|
116
|
+
### `search(query, limit=10)`
|
|
117
|
+
|
|
118
|
+
模糊搜索企业名称。
|
|
119
|
+
|
|
120
|
+
**参数**:
|
|
121
|
+
- `query` (str): 搜索关键词
|
|
122
|
+
- `limit` (int): 返回结果数量,默认 10
|
|
123
|
+
|
|
124
|
+
**返回**: `List[MatchResult]`
|
|
125
|
+
|
|
126
|
+
### `resolve_batch(names, top_k=5)`
|
|
127
|
+
|
|
128
|
+
批量匹配企业名称。
|
|
129
|
+
|
|
130
|
+
**参数**:
|
|
131
|
+
- `names` (List[str]): 企业名称列表
|
|
132
|
+
- `top_k` (int): 每个名称返回结果数量
|
|
133
|
+
|
|
134
|
+
**返回**: `Dict[str, List[MatchResult]]`
|
|
135
|
+
|
|
136
|
+
### `MatchResult`
|
|
137
|
+
|
|
138
|
+
匹配结果对象。
|
|
139
|
+
|
|
140
|
+
```python
|
|
141
|
+
@dataclass
|
|
142
|
+
class MatchResult:
|
|
143
|
+
entity_id: str # 实体 ID
|
|
144
|
+
name_cn: str # 中文名称
|
|
145
|
+
name_en: str # 英文名称
|
|
146
|
+
confidence: float # 置信度 (0.0-1.0)
|
|
147
|
+
aliases: List[str] # 别名列表
|
|
148
|
+
match_type: str # 匹配类型
|
|
149
|
+
```
|
|
150
|
+
|
|
151
|
+
## 示例企业
|
|
152
|
+
|
|
153
|
+
| 中文名 | 英文名 | 别名 |
|
|
154
|
+
|--------|--------|------|
|
|
155
|
+
| 华为技术有限公司 | Huawei Technologies Co., Ltd. | 华为、HUAWEI |
|
|
156
|
+
| 腾讯控股有限公司 | Tencent Holdings Limited | 腾讯、微信、QQ、Tencent |
|
|
157
|
+
| 阿里巴巴集团控股有限公司 | Alibaba Group Holding Limited | 阿里、淘宝、天猫、Alibaba |
|
|
158
|
+
| 北京字节跳动科技有限公司 | Beijing ByteDance Technology Co., Ltd. | 字节跳动、抖音、TikTok、今日头条 |
|
|
159
|
+
| 小米科技有限责任公司 | Xiaomi Technology Co., Ltd. | 小米、MI、Xiaomi |
|
|
160
|
+
|
|
161
|
+
## 开发
|
|
162
|
+
|
|
163
|
+
```bash
|
|
164
|
+
# 克隆仓库
|
|
165
|
+
git clone https://github.com/xxx/matchina.git
|
|
166
|
+
cd matchina
|
|
167
|
+
|
|
168
|
+
# 安装开发依赖
|
|
169
|
+
pip install -e ".[dev]"
|
|
170
|
+
|
|
171
|
+
# 运行测试
|
|
172
|
+
pytest
|
|
173
|
+
|
|
174
|
+
# 类型检查
|
|
175
|
+
mypy matchina
|
|
176
|
+
|
|
177
|
+
# 代码检查
|
|
178
|
+
ruff check matchina
|
|
179
|
+
```
|
|
180
|
+
|
|
181
|
+
## 贡献
|
|
182
|
+
|
|
183
|
+
欢迎贡献代码、报告问题或提出建议!
|
|
184
|
+
|
|
185
|
+
1. Fork 本仓库
|
|
186
|
+
2. 创建功能分支 (`git checkout -b feature/amazing-feature`)
|
|
187
|
+
3. 提交更改 (`git commit -m 'Add amazing feature'`)
|
|
188
|
+
4. 推送到分支 (`git push origin feature/amazing-feature`)
|
|
189
|
+
5. 创建 Pull Request
|
|
190
|
+
|
|
191
|
+
## 许可证
|
|
192
|
+
|
|
193
|
+
[MIT License](LICENSE)
|
|
194
|
+
|
|
195
|
+
## 更新日志
|
|
196
|
+
|
|
197
|
+
### v0.1.0 (2026-03-14)
|
|
198
|
+
|
|
199
|
+
- 首次发布
|
|
200
|
+
- 支持 1132 家企业实体
|
|
201
|
+
- 四层匹配策略
|
|
202
|
+
- 中英文支持
|
|
203
|
+
|
|
204
|
+
---
|
|
205
|
+
|
|
206
|
+
**Matchina** - Match China, Match Enterprise Names.
|
matchina-0.1.0/README.md
ADDED
|
@@ -0,0 +1,173 @@
|
|
|
1
|
+
# Matchina
|
|
2
|
+
|
|
3
|
+
[English](README_en.md) | 中文
|
|
4
|
+
|
|
5
|
+
中国企业中英文名称对齐匹配库。支持精确匹配、别名匹配、模糊搜索,离线使用,无需网络。
|
|
6
|
+
|
|
7
|
+
## 特性
|
|
8
|
+
|
|
9
|
+
- 🔍 **四层匹配策略**:精确匹配 → 别名匹配 → 规则匹配 → 模糊匹配
|
|
10
|
+
- 🇨🇳 **中英文支持**:同时支持中文和英文名称
|
|
11
|
+
- 📦 **离线使用**:所有数据打包在库内,无需网络
|
|
12
|
+
- ⚡ **高性能**:单次匹配 < 100ms
|
|
13
|
+
- 🐍 **Python 3.8+**:支持 Python 3.8 及以上版本
|
|
14
|
+
|
|
15
|
+
## 安装
|
|
16
|
+
|
|
17
|
+
```bash
|
|
18
|
+
pip install matchina
|
|
19
|
+
```
|
|
20
|
+
|
|
21
|
+
## 快速开始
|
|
22
|
+
|
|
23
|
+
```python
|
|
24
|
+
from matchina import resolve, search, resolve_batch
|
|
25
|
+
|
|
26
|
+
# 精确匹配
|
|
27
|
+
result = resolve("华为")
|
|
28
|
+
print(result[0].name_cn) # 华为技术有限公司
|
|
29
|
+
print(result[0].name_en) # Huawei Technologies Co., Ltd.
|
|
30
|
+
print(result[0].confidence) # 1.0
|
|
31
|
+
|
|
32
|
+
# 英文匹配
|
|
33
|
+
result = resolve("Alibaba")
|
|
34
|
+
print(result[0].name_cn) # 阿里巴巴集团控股有限公司
|
|
35
|
+
|
|
36
|
+
# 别名匹配
|
|
37
|
+
result = resolve("抖音")
|
|
38
|
+
print(result[0].name_cn) # 北京字节跳动科技有限公司
|
|
39
|
+
print(result[0].confidence) # 0.95
|
|
40
|
+
|
|
41
|
+
# 批量匹配
|
|
42
|
+
names = ["腾讯", "百度", "小米", "比亚迪"]
|
|
43
|
+
results = resolve_batch(names)
|
|
44
|
+
for name, matches in results.items():
|
|
45
|
+
if matches:
|
|
46
|
+
print(f"{name} → {matches[0].name_cn}")
|
|
47
|
+
|
|
48
|
+
# 模糊搜索
|
|
49
|
+
results = search("科技", limit=5)
|
|
50
|
+
for r in results:
|
|
51
|
+
print(f"{r.name_cn} (置信度: {r.confidence:.2f})")
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
## 匹配策略
|
|
55
|
+
|
|
56
|
+
| 匹配类型 | 说明 | 置信度 |
|
|
57
|
+
|----------|------|--------|
|
|
58
|
+
| `exact` | 完全匹配 | 1.0 |
|
|
59
|
+
| `alias` | 别名匹配 | 0.95 |
|
|
60
|
+
| `rule` | 规则匹配(后缀/缩写) | 0.85 |
|
|
61
|
+
| `fuzzy` | 模糊匹配(编辑距离) | 0.6-0.8 |
|
|
62
|
+
|
|
63
|
+
## 数据覆盖
|
|
64
|
+
|
|
65
|
+
当前版本包含:
|
|
66
|
+
|
|
67
|
+
- **1132 家企业实体**
|
|
68
|
+
- **112 个企业别名**
|
|
69
|
+
- 覆盖 A 股、港股、中概股、独角兽、知名出海品牌
|
|
70
|
+
|
|
71
|
+
## API 文档
|
|
72
|
+
|
|
73
|
+
### `resolve(name, top_k=5)`
|
|
74
|
+
|
|
75
|
+
匹配企业名称。
|
|
76
|
+
|
|
77
|
+
**参数**:
|
|
78
|
+
- `name` (str): 企业名称(中文或英文)
|
|
79
|
+
- `top_k` (int): 返回结果数量,默认 5
|
|
80
|
+
|
|
81
|
+
**返回**: `List[MatchResult]`
|
|
82
|
+
|
|
83
|
+
### `search(query, limit=10)`
|
|
84
|
+
|
|
85
|
+
模糊搜索企业名称。
|
|
86
|
+
|
|
87
|
+
**参数**:
|
|
88
|
+
- `query` (str): 搜索关键词
|
|
89
|
+
- `limit` (int): 返回结果数量,默认 10
|
|
90
|
+
|
|
91
|
+
**返回**: `List[MatchResult]`
|
|
92
|
+
|
|
93
|
+
### `resolve_batch(names, top_k=5)`
|
|
94
|
+
|
|
95
|
+
批量匹配企业名称。
|
|
96
|
+
|
|
97
|
+
**参数**:
|
|
98
|
+
- `names` (List[str]): 企业名称列表
|
|
99
|
+
- `top_k` (int): 每个名称返回结果数量
|
|
100
|
+
|
|
101
|
+
**返回**: `Dict[str, List[MatchResult]]`
|
|
102
|
+
|
|
103
|
+
### `MatchResult`
|
|
104
|
+
|
|
105
|
+
匹配结果对象。
|
|
106
|
+
|
|
107
|
+
```python
|
|
108
|
+
@dataclass
|
|
109
|
+
class MatchResult:
|
|
110
|
+
entity_id: str # 实体 ID
|
|
111
|
+
name_cn: str # 中文名称
|
|
112
|
+
name_en: str # 英文名称
|
|
113
|
+
confidence: float # 置信度 (0.0-1.0)
|
|
114
|
+
aliases: List[str] # 别名列表
|
|
115
|
+
match_type: str # 匹配类型
|
|
116
|
+
```
|
|
117
|
+
|
|
118
|
+
## 示例企业
|
|
119
|
+
|
|
120
|
+
| 中文名 | 英文名 | 别名 |
|
|
121
|
+
|--------|--------|------|
|
|
122
|
+
| 华为技术有限公司 | Huawei Technologies Co., Ltd. | 华为、HUAWEI |
|
|
123
|
+
| 腾讯控股有限公司 | Tencent Holdings Limited | 腾讯、微信、QQ、Tencent |
|
|
124
|
+
| 阿里巴巴集团控股有限公司 | Alibaba Group Holding Limited | 阿里、淘宝、天猫、Alibaba |
|
|
125
|
+
| 北京字节跳动科技有限公司 | Beijing ByteDance Technology Co., Ltd. | 字节跳动、抖音、TikTok、今日头条 |
|
|
126
|
+
| 小米科技有限责任公司 | Xiaomi Technology Co., Ltd. | 小米、MI、Xiaomi |
|
|
127
|
+
|
|
128
|
+
## 开发
|
|
129
|
+
|
|
130
|
+
```bash
|
|
131
|
+
# 克隆仓库
|
|
132
|
+
git clone https://github.com/xxx/matchina.git
|
|
133
|
+
cd matchina
|
|
134
|
+
|
|
135
|
+
# 安装开发依赖
|
|
136
|
+
pip install -e ".[dev]"
|
|
137
|
+
|
|
138
|
+
# 运行测试
|
|
139
|
+
pytest
|
|
140
|
+
|
|
141
|
+
# 类型检查
|
|
142
|
+
mypy matchina
|
|
143
|
+
|
|
144
|
+
# 代码检查
|
|
145
|
+
ruff check matchina
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
## 贡献
|
|
149
|
+
|
|
150
|
+
欢迎贡献代码、报告问题或提出建议!
|
|
151
|
+
|
|
152
|
+
1. Fork 本仓库
|
|
153
|
+
2. 创建功能分支 (`git checkout -b feature/amazing-feature`)
|
|
154
|
+
3. 提交更改 (`git commit -m 'Add amazing feature'`)
|
|
155
|
+
4. 推送到分支 (`git push origin feature/amazing-feature`)
|
|
156
|
+
5. 创建 Pull Request
|
|
157
|
+
|
|
158
|
+
## 许可证
|
|
159
|
+
|
|
160
|
+
[MIT License](LICENSE)
|
|
161
|
+
|
|
162
|
+
## 更新日志
|
|
163
|
+
|
|
164
|
+
### v0.1.0 (2026-03-14)
|
|
165
|
+
|
|
166
|
+
- 首次发布
|
|
167
|
+
- 支持 1132 家企业实体
|
|
168
|
+
- 四层匹配策略
|
|
169
|
+
- 中英文支持
|
|
170
|
+
|
|
171
|
+
---
|
|
172
|
+
|
|
173
|
+
**Matchina** - Match China, Match Enterprise Names.
|