PyPI - tdmelodic-torch - Versions diffs - 2.0.0__tar.gz - Mend

tdmelodic-torch 2.0.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (69) hide show

tdmelodic_torch-2.0.0/LICENSE ADDED Viewed

@@ -0,0 +1,29 @@
+BSD 3-Clause License
+Copyright (c) 2019-, PKSHA Technology Inc.
+All rights reserved.
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+1. Redistributions of source code must retain the above copyright notice, this
+   list of conditions and the following disclaimer.
+2. Redistributions in binary form must reproduce the above copyright notice,
+   this list of conditions and the following disclaimer in the documentation
+   and/or other materials provided with the distribution.
+3. Neither the name of the copyright holder nor the names of its
+   contributors may be used to endorse or promote products derived from
+   this software without specific prior written permission.
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

tdmelodic_torch-2.0.0/MANIFEST.in ADDED Viewed

@@ -0,0 +1,5 @@
+include README.md
+include LICENSE
+include requirements.txt
+include tdmelodic/nn/lang/mecab/my_mecabrc
+include tdmelodic/nn/resource/net_it_2500000

tdmelodic_torch-2.0.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,88 @@
+Metadata-Version: 2.4
+Name: tdmelodic-torch
+Version: 2.0.0
+Summary: tdmelodic: Tokyo Japanese Accent Estimator (PyTorch fork)
+Home-page: https://github.com/Na2CuCl4/tdmelodic
+Author: Hideyuki Tachibana, Zirui Xia
+Author-email: xiazr0422@163.com
+Classifier: Development Status :: 5 - Production/Stable
+Classifier: Environment :: Console
+Classifier: Intended Audience :: Science/Research
+Classifier: Intended Audience :: Developers
+Classifier: License :: OSI Approved :: BSD License
+Classifier: Operating System :: POSIX
+Classifier: Programming Language :: Python :: 3.8
+Classifier: Programming Language :: Python :: 3.9
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Topic :: Text Processing :: Linguistic
+Classifier: Topic :: Multimedia :: Sound/Audio :: Speech
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Classifier: Natural Language :: Japanese
+Requires-Python: >=3.8
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: numpy>=1.15.4
+Requires-Dist: torch>=2.0.0
+Requires-Dist: mecab-python3>=0.996.1
+Requires-Dist: jaconv>=0.2.4
+Requires-Dist: python-Levenshtein>=0.12.0
+Requires-Dist: tqdm>=4.42.1
+Requires-Dist: regex>=2020.1.8
+Requires-Dist: romkan>=0.2.1
+Dynamic: author
+Dynamic: author-email
+Dynamic: classifier
+Dynamic: description
+Dynamic: description-content-type
+Dynamic: home-page
+Dynamic: license-file
+Dynamic: requires-dist
+Dynamic: requires-python
+Dynamic: summary
+<p align="center">
+<img src="https://github.com/PKSHATechnology-Research/tdmelodic/raw/master/docs/imgs/logo/logo_tdmelodic.svg" width="200" />
+</p>
+# Tokyo Dialect MELOdic accent DICtionary (tdmelodic) generator
+[![document](https://readthedocs.org/projects/tdmelodic/badge/?version=latest)](https://tdmelodic.readthedocs.io/en/latest)
+[![arXiv](https://img.shields.io/badge/arXiv-2009.09679-B31B1B.svg)](https://arxiv.org/abs/2009.09679)
+[![Python unittest](https://github.com/PKSHATechnology-Research/tdmelodic/actions/workflows/test.yml/badge.svg)](https://github.com/PKSHATechnology-Research/tdmelodic/actions/workflows/test.yml)
+[![Docker](https://github.com/PKSHATechnology-Research/tdmelodic/actions/workflows/docker-image.yml/badge.svg)](https://github.com/PKSHATechnology-Research/tdmelodic/actions/workflows/docker-image.yml)
+[![Lilypond](https://github.com/PKSHATechnology-Research/tdmelodic/actions/workflows/img.yml/badge.svg)](https://github.com/PKSHATechnology-Research/tdmelodic/actions/workflows/img.yml)
+[![License](https://img.shields.io/badge/License-BSD%203--Clause-blue.svg)](https://opensource.org/licenses/BSD-3-Clause)
+This module generates a large scale accent dictionary of
+Japanese (Tokyo dialect) using a neural network based technique.
+> **2026-06**: Migrated the neural network backend from **Chainer** to **PyTorch**.
+> The public API (`Converter.sy2a()`, `Converter.s2ya()`) is fully backward-compatible
+> and produces identical inference results. Now supports Python 3.8+.
+For academic use, please cite the following paper.
+[[IEEE Xplore]](https://ieeexplore.ieee.org/document/9054081)
+[[arXiv]](https://arxiv.org/abs/2009.09679)
+```bibtex
+@inproceedings{tachibana2020icassp,
+    author    = "H. Tachibana and Y. Katayama",
+    title     = "Accent Estimation of {Japanese} Words from Their Surfaces and Romanizations
+                 for Building Large Vocabulary Accent Dictionaries",
+    booktitle = {2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
+    pages     = "8059--8063",
+    year      = "2020",
+    doi       = "10.1109/ICASSP40776.2020.9054081"
+}
+```
+## Installation and Usage
+- English: [tdmelodic Documentation](https://tdmelodic.readthedocs.io/en/latest)
+- 日本語: [tdmelodic 利用マニュアル](https://tdmelodic.readthedocs.io/ja/latest)
+## Acknowledgement
+Some part of this work is based on the results obtained from a project subsidized by the New Energy and Industrial Technology Development Organization (NEDO).

tdmelodic_torch-2.0.0/README.md ADDED Viewed

@@ -0,0 +1,45 @@
+<p align="center">
+<img src="https://github.com/PKSHATechnology-Research/tdmelodic/raw/master/docs/imgs/logo/logo_tdmelodic.svg" width="200" />
+</p>
+# Tokyo Dialect MELOdic accent DICtionary (tdmelodic) generator
+[![document](https://readthedocs.org/projects/tdmelodic/badge/?version=latest)](https://tdmelodic.readthedocs.io/en/latest)
+[![arXiv](https://img.shields.io/badge/arXiv-2009.09679-B31B1B.svg)](https://arxiv.org/abs/2009.09679)
+[![Python unittest](https://github.com/PKSHATechnology-Research/tdmelodic/actions/workflows/test.yml/badge.svg)](https://github.com/PKSHATechnology-Research/tdmelodic/actions/workflows/test.yml)
+[![Docker](https://github.com/PKSHATechnology-Research/tdmelodic/actions/workflows/docker-image.yml/badge.svg)](https://github.com/PKSHATechnology-Research/tdmelodic/actions/workflows/docker-image.yml)
+[![Lilypond](https://github.com/PKSHATechnology-Research/tdmelodic/actions/workflows/img.yml/badge.svg)](https://github.com/PKSHATechnology-Research/tdmelodic/actions/workflows/img.yml)
+[![License](https://img.shields.io/badge/License-BSD%203--Clause-blue.svg)](https://opensource.org/licenses/BSD-3-Clause)
+This module generates a large scale accent dictionary of
+Japanese (Tokyo dialect) using a neural network based technique.
+> **2026-06**: Migrated the neural network backend from **Chainer** to **PyTorch**.
+> The public API (`Converter.sy2a()`, `Converter.s2ya()`) is fully backward-compatible
+> and produces identical inference results. Now supports Python 3.8+.
+For academic use, please cite the following paper.
+[[IEEE Xplore]](https://ieeexplore.ieee.org/document/9054081)
+[[arXiv]](https://arxiv.org/abs/2009.09679)
+```bibtex
+@inproceedings{tachibana2020icassp,
+    author    = "H. Tachibana and Y. Katayama",
+    title     = "Accent Estimation of {Japanese} Words from Their Surfaces and Romanizations
+                 for Building Large Vocabulary Accent Dictionaries",
+    booktitle = {2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
+    pages     = "8059--8063",
+    year      = "2020",
+    doi       = "10.1109/ICASSP40776.2020.9054081"
+}
+```
+## Installation and Usage
+- English: [tdmelodic Documentation](https://tdmelodic.readthedocs.io/en/latest)
+- 日本語: [tdmelodic 利用マニュアル](https://tdmelodic.readthedocs.io/ja/latest)
+## Acknowledgement
+Some part of this work is based on the results obtained from a project subsidized by the New Energy and Industrial Technology Development Organization (NEDO).

tdmelodic_torch-2.0.0/requirements.txt ADDED Viewed

@@ -0,0 +1,8 @@
+numpy>=1.15.4
+torch>=2.0.0
+mecab-python3>=0.996.1
+jaconv>=0.2.4
+python-Levenshtein>=0.12.0
+tqdm>=4.42.1
+regex>=2020.1.8
+romkan>=0.2.1

tdmelodic_torch-2.0.0/setup.cfg ADDED Viewed

@@ -0,0 +1,4 @@
+[egg_info]
+tag_build =
+tag_date = 0

tdmelodic_torch-2.0.0/setup.py ADDED Viewed

@@ -0,0 +1,69 @@
+#!/usr/bin/env python
+from setuptools import setup, find_packages
+from os import path
+import re, io
+def _readme():
+    with open('README.md') as readme_file:
+        return readme_file.read().replace(":copyright:", "(c)")
+def _requirements():
+    root_dir = path.abspath(path.dirname(__file__))
+    return [name.rstrip() for name in open(path.join(root_dir, 'requirements.txt')).readlines()]
+def _get_version():
+    version = re.search(
+        r'__version__\s*=\s*[\'"]([^\'"]*)[\'"]',  # It excludes inline comment too
+        io.open('tdmelodic/__init__.py', encoding='utf_8_sig').read()
+        ).group(1)
+    return version
+setup(
+    name="tdmelodic-torch",
+    author="Hideyuki Tachibana, Zirui Xia",
+    author_email='xiazr0422@163.com',
+    python_requires='>=3.8',
+    url="https://github.com/Na2CuCl4/tdmelodic",
+    description="tdmelodic: Tokyo Japanese Accent Estimator (PyTorch fork)",
+    long_description=_readme(),
+    long_description_content_type="text/markdown",
+    install_requires=_requirements(),
+    tests_requires=_requirements(),
+    setup_requires=[],
+    include_package_data=True,
+    packages=find_packages(include=['tdmelodic', 'tdmelodic.*']),
+    version=_get_version(),
+    zip_safe=False,
+    entry_points={
+        'console_scripts':[
+            'tdmelodic-convert = tdmelodic.nn.convert_dic:main',
+            'tdmelodic-sy2a = tdmelodic.nn.convert:main_sy2a',
+            'tdmelodic-s2ya = tdmelodic.nn.convert:main_s2ya',
+            'tdmelodic-neologd-preprocess = tdmelodic.filters.neologd_preprocess:main',
+            'tdmelodic-modify-unigram-cost = tdmelodic.filters.postprocess_modify_unigram_cost:main',
+        ]
+    },
+    classifiers=[
+        'Development Status :: 5 - Production/Stable',
+        'Environment :: Console',
+        'Intended Audience :: Science/Research',
+        'Intended Audience :: Developers',
+        'License :: OSI Approved :: BSD License',
+        'Operating System :: POSIX',
+        'Programming Language :: Python :: 3.8',
+        'Programming Language :: Python :: 3.9',
+        'Programming Language :: Python :: 3.10',
+        'Programming Language :: Python :: 3.11',
+        'Topic :: Text Processing :: Linguistic',
+        'Topic :: Multimedia :: Sound/Audio :: Speech',
+        'Topic :: Scientific/Engineering :: Artificial Intelligence',
+        'Natural Language :: Japanese',
+    ]
+)

tdmelodic_torch-2.0.0/tdmelodic/__init__.py ADDED Viewed

@@ -0,0 +1,11 @@
+from .nn import *
+from .filters import *
+__copyright__    = 'Copyright (C) 2019 Hideyuki Tachibana, PKSHA Technology Inc.'
+__version__      = '2.0.0'
+__license__      = 'BSD-3-Clause'
+__author__       = 'Hideyuki Tachibana, Zirui Xia'
+__author_email__ = 'xiazr0422@163.com'
+__url__          = 'https://github.com/Na2CuCl4/tdmelodic'
+__all__ = ['nn', 'filters']

tdmelodic_torch-2.0.0/tdmelodic/filters/__init__.py ADDED Viewed

File without changes

tdmelodic_torch-2.0.0/tdmelodic/filters/neologd_patch.py ADDED Viewed

@@ -0,0 +1,165 @@
+# -----------------------------------------------------------------------------
+# Copyright (c) 2019-, PKSHA Technology Inc.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+# -----------------------------------------------------------------------------
+# -*- coding: utf-8 -*-
+import sys
+import os
+import argparse
+import regex as re
+import csv
+from tqdm import tqdm
+import tempfile
+import copy
+import unicodedata
+import jaconv
+from tdmelodic.util.dic_index_map import get_dictionary_index_map
+from tdmelodic.util.util import count_lines
+from tdmelodic.util.word_type import WordType
+from .yomi.basic import modify_longvowel_errors
+from .yomi.basic import modify_yomi_of_numerals
+from .yomi.particle_yomi import ParticleYomi
+from .yomi.wrong_yomi_detection import SimpleWrongYomiDetector
+class NeologdPatch(object):
+    def __init__(self, *args, **kwargs):
+        for k, v in kwargs.items():
+            if k != "input" and k != "output":
+                self.__setattr__(k, v)
+        self.IDX_MAP = get_dictionary_index_map(self.mode) # dictionary type
+        self.wt = WordType(self.mode)
+        self.wrong_yomi_detector = SimpleWrongYomiDetector(mode=self.mode)
+        self.particle_yomi = ParticleYomi()
+    def showinfo(self):
+        print("ℹ️  [ Info ]", file=sys.stderr)
+        self.message("| {}  Hash tags will{}be removed.", self.rm_hashtag)
+        self.message("| {}  Noisy katakana words will{}be removed.", self.rm_noisy_katakana)
+        self.message("| {}  Person names will{}be removed.", self.rm_person)
+        self.message("| {}  Emojis will{}be removed.", self.rm_emoji)
+        self.message("| {}  Symbols will{}be removed.", self.rm_symbol)
+        self.message("| {}  Numerals will{}be removed.", self.rm_numeral)
+        self.message("| {}  Wrong yomi words will{}be removed.", self.rm_wrong_yomi)
+        self.message("| {}  Words with special particles \"は\" and \"へ\" will{}be removed", self.rm_special_particle)
+        self.message("| {}  Long vowel errors will{}be corrected.", self.cor_longvow)
+        self.message("| {}  Numeral yomi errors will{}be corrected.", self.cor_yomi_num)
+        self.message("| {}  Surface forms will{}be normalized.", self.normalize)
+    @classmethod
+    def message(cls, message, flag):
+        if flag:
+            message = message.format("✅", " ")
+        else:
+            message = message.format("‼️", " *NOT* ")
+        print(message, file=sys.stderr)
+    def add_accent_column(self, line, idx_accent=None):
+        line = line + ['' for i in range(10)]
+        line[idx_accent] = '@'
+        return line
+    def normalize_surface(self, line, idx_surface=None):
+        s = line[idx_surface]
+        s = unicodedata.normalize("NFKC", s)
+        s = s.upper()
+        s = jaconv.normalize(s, "NFKC")
+        s = jaconv.h2z(s, digit=True, ascii=True, kana=True)
+        s = s.replace("\u00A5", "\uFFE5") # yen symbol
+        line[idx_surface] = s
+        return line
+    def process_single_line(self, line):
+        # ----------------------------------------------------------------------
+        # remove words by word types
+        if self.rm_hashtag:
+            if self.wt.is_hashtag(line):
+                return None
+        if self.rm_noisy_katakana:
+            if self.wt.is_noisy_katakana(line):
+                return None
+        if self.rm_person:
+            if self.wt.is_person(line):
+                return None
+        if self.rm_emoji:
+            if self.wt.is_emoji(line):
+                return None
+        if self.rm_symbol:
+            if self.wt.is_symbol(line):
+                return None
+        if self.rm_numeral:
+            if self.wt.is_numeral(line):
+                return None
+        line = copy.deepcopy(line)
+        # ----------------------------------------------------------------------
+        # correct yomi
+        if self.cor_longvow:
+            line = modify_longvowel_errors(line, idx_yomi=self.IDX_MAP["YOMI"])
+        if self.cor_yomi_num:
+            if self.wt.is_numeral(line):
+                line = modify_yomi_of_numerals(line,
+                    idx_surface=self.IDX_MAP["SURFACE"], idx_yomi=self.IDX_MAP["YOMI"])
+        # ----------------------------------------------------------------------
+        # 助詞の読みを修正する（TODO）
+        if self.rm_special_particle:
+            line = self.particle_yomi(line, self.IDX_MAP)
+            if line is None:
+                return None
+        # ----------------------------------------------------------------------
+        # normalize surface
+        if self.normalize:
+            line = self.normalize_surface(line, idx_surface=self.IDX_MAP["SURFACE"])
+        # ----------------------------------------------------------------------
+        # remove words with their yomi
+        if self.rm_wrong_yomi:
+            line = self.wrong_yomi_detector(line)
+            if line is None:
+                return None
+        # ----------------------------------------------------------------------
+        # add additional columns for compatibility with unidic-kana-accent
+        if self.mode == "unidic":
+            line = self.add_accent_column(line, idx_accent=self.IDX_MAP["ACCENT"])
+        # ----------------------------------------------------------------------
+        return line
+    def __call__(self, fp_in, fp_out):
+        self.showinfo()
+        L = count_lines(fp_in)
+        n_removed = 0
+        n_corrected= 0
+        for line in tqdm(csv.reader(fp_in), total=L):
+            try:
+                line_processed = self.process_single_line(line)
+            except Exception as e:
+                print(e)
+                print(line)
+                sys.exit(1)
+            if line_processed is None:
+                n_removed += 1
+                continue
+            if line_processed[:20] != line[:20]:
+                n_corrected += 1
+            fp_out.write(','.join(line_processed) + '\n')
+        print("🍺  [ Complete! ]", file=sys.stderr)
+        print("📊  Number of removed entries ", n_removed, file=sys.stderr)
+        print("📊  Number of corrected entries ", n_corrected, file=sys.stderr)
+        return

tdmelodic_torch-2.0.0/tdmelodic/filters/neologd_preprocess.py ADDED Viewed

@@ -0,0 +1,127 @@
+# -----------------------------------------------------------------------------
+# Copyright (c) 2019-, PKSHA Technology Inc.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+# -----------------------------------------------------------------------------
+# -*- coding: utf-8 -*-
+import sys
+import os
+import argparse
+import tempfile
+from .neologd_patch import NeologdPatch
+from .neologd_rmdups import rmdups
+class Preprocess(object):
+    def __init__(self, flag_rmdups, neologd_patch, dictionary_type="unidic"):
+        self.flag_rmdups = flag_rmdups
+        self.neologd_patch_module = neologd_patch
+        self.dictionary_type = dictionary_type
+    def do_rmdups(self, fp_in):
+        fp_tmp = tempfile.NamedTemporaryFile("w+")
+        print("📌  creating a temporary file", fp_tmp.name, file=sys.stderr)
+        rmdups(fp_in, fp_tmp, self.dictionary_type)
+        fp_tmp.seek(0)
+        fp_in.close() # CPython's GC will automatically closes the previous fp_in without doing this
+        fp_in = fp_tmp
+        return fp_in
+    def do_neologd_patch(self, fp_in):
+        fp_tmp = tempfile.NamedTemporaryFile("w+")
+        print("📌  creating a temporary file", fp_tmp.name, file=sys.stderr)
+        self.neologd_patch_module(fp_in, fp_tmp)
+        fp_tmp.seek(0)
+        fp_in.close() # CPython's GC will automatically closes the previous fp_in without doing this
+        fp_in = fp_tmp
+        return fp_in
+    def copy_temp_to_output(self, fp_in, fp_out):
+        # output
+        for l in fp_in:
+            fp_out.write(l)
+        fp_in.close()
+        fp_out.close()
+    def __call__(self, fp_in, fp_out):
+        print("ℹ️  [ Info ]", file=sys.stderr)
+        NeologdPatch.message("| {} Duplicate entried will{}be removed.", self.flag_rmdups)
+        if self.flag_rmdups:
+            fp_in = self.do_rmdups(fp_in)
+        fp_in = self.do_neologd_patch(fp_in)
+        print("💾  [ Saving ]", file=sys.stderr)
+        self.copy_temp_to_output(fp_in, fp_out)
+        print("🍺  [ Done ]", file=sys.stderr)
+def my_add_argument(parser, option_name, default, help_):
+    help_ = help_ + " <default={}>".format(str(default))
+    if sys.version_info >= (3, 9):
+        parser.add_argument("--" + option_name,
+            action=argparse.BooleanOptionalAction,
+            default=default,
+            help=help_)
+    else:
+        parser.add_argument("--" + option_name,
+            action="store_true",
+            default=default,
+            help=help_)
+        parser.add_argument("--no-" + option_name,
+            action="store_false",
+            dest=option_name,
+            default=default)
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        '-i', '--input',
+        nargs='?',
+        type=argparse.FileType("r"),
+        default=sys.stdin,
+        help='input CSV file (NEologd dicitionary file) <default=STDIN>')
+    parser.add_argument(
+        '-o', '--output',
+        nargs='?',
+        type=argparse.FileType("w"),
+        default=sys.stdout,
+        help='output CSV file <default=STDOUT>')
+    parser.add_argument(
+        "-m", "--mode",
+        type=str,
+        choices=["unidic", "ipadic"],
+        default="unidic",
+        help="dictionary format type <default=unidic>",
+    )
+    my_add_argument(parser, "rmdups", True, "remove duplicate entries or not")
+    my_add_argument(parser, "rm_hashtag", True, "remove hash tags or not")
+    my_add_argument(parser, "rm_noisy_katakana", True, "remove noisy katakana words or not")
+    my_add_argument(parser, "rm_person", False, "remove person names or not")
+    my_add_argument(parser, "rm_emoji", False, "remove emojis or not")
+    my_add_argument(parser, "rm_symbol", False, "remove symbols or not")
+    my_add_argument(parser, "rm_numeral", False, "remove numerals or not")
+    my_add_argument(parser, "rm_wrong_yomi", True, "remove words with possibly wrong yomi or not")
+    my_add_argument(parser, "rm_special_particle", True, "remove words with special particles \"は\" or \"へ\"")
+    my_add_argument(parser, "cor_longvow", True, "correct long vowel errors or not")
+    my_add_argument(parser, "cor_yomi_num", True, "correct the yomi of numerals or not")
+    my_add_argument(parser, "normalize", False, "normalize the surface forms by applying "
+        "NFKC Unicode normalization, "
+        "capitalization of alphabets, "
+        "and "
+        "hankaku-to-zenkaku converter.")
+    args = parser.parse_args()
+    if args.input == args.output:
+        print("[ Error ] intput and output files should be different.", file=sys.stderr)
+        sys.exit(0)
+    try:
+        preprocess = Preprocess(args.rmdups, NeologdPatch(**vars(args)), dictionary_type=args.mode)
+        preprocess(args.input, args.output)
+    except Exception as e:
+        print(e, file=sys.stderr)
+if __name__ == '__main__':
+    main()

tdmelodic_torch-2.0.0/tdmelodic/filters/neologd_rmdups.py ADDED Viewed

@@ -0,0 +1,94 @@
+# -----------------------------------------------------------------------------
+# Copyright (c) 2019-, PKSHA Technology Inc.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+# -----------------------------------------------------------------------------
+# -*- coding: utf-8 -*-
+import sys
+import os
+import argparse
+import regex as re
+import csv
+from tqdm import tqdm
+import jaconv
+import unicodedata
+from dataclasses import dataclass
+from tdmelodic.nn.lang.japanese.kansuji import numeric2kanji
+from tdmelodic.util.dic_index_map import get_dictionary_index_map
+from tdmelodic.util.util import count_lines
+from tdmelodic.util.word_type import WordType
+from .yomi.yomieval import YomiEvaluator
+# ------------------------------------------------------------------------------------
+def normalize_surface(text):
+    # hankaku
+    text = unicodedata.normalize("NFKC",text)
+    text = jaconv.h2z(text, digit=True, ascii=True, kana=False)
+    # kansuji
+    text = numeric2kanji(text)
+    # (株), 株式会社など
+    text = text.replace("（株）","・カブシキガイシャ・")
+    text = text.replace("（有）","・ユウゲンガイシャ・")
+    text = text.replace("＆","・アンド・")
+    return text
+# ------------------------------------------------------------------------------------
+@dataclass
+class LineInfo(object):
+    surf: str
+    yomi: str
+    pos: str
+def get_line_info(line, IDX_MAP):
+    s = line[IDX_MAP["SURFACE"]]
+    y = line[IDX_MAP["YOMI"]]
+    pos = "-".join([line[i] for i in [IDX_MAP["POS1"], IDX_MAP["POS2"], IDX_MAP["POS3"]]])
+    s = normalize_surface(s)
+    return LineInfo(s, y, pos)
+def rmdups(fp_in, fp_out, dictionary_type="unidic"):
+    """
+    dictionary_type: unidic or ipadic
+    """
+    IDX_MAP = get_dictionary_index_map(dictionary_type)
+    yomieval = YomiEvaluator()
+    prev_line = [""] * 100
+    c = 0
+    L = count_lines(fp_in)
+    wt = WordType(dictionary_type)
+    print("ℹ️  [ Removing duplicate entries ]", file=sys.stderr)
+    for i, curr_line in enumerate(tqdm(csv.reader(fp_in), total=L)):
+        prev = get_line_info(prev_line, IDX_MAP)
+        curr = get_line_info(curr_line, IDX_MAP)
+        if prev.surf == curr.surf and prev.pos == curr.pos and \
+            not wt.is_person(prev_line) and not wt.is_placename(prev_line):
+            # if the surface form and pos are the same
+            distance_p = yomieval.eval(prev.surf, prev.yomi)
+            distance_c = yomieval.eval(curr.surf, curr.yomi)
+        else:
+            distance_p = 0
+            distance_c = 100
+        if distance_p > distance_c:
+            c += 1
+            # if c % 100 == 0:
+            #    print(c, curr.surf, "| deleted: ", prev.yomi, distance_p, " | left: ", curr.yomi, distance_c, file=sys.stderr)
+        else:
+            if i != 0:
+                fp_out.write(",".join(prev_line) + "\n")
+        prev_line = curr_line
+        continue
+    fp_out.write(",".join(prev_line) + "\n")
+    print("📊  Number of removed duplicate entries ", c, file=sys.stderr)