PyPI - batchalign - Versions diffs - 0.7.3b12__tar.gz → 0.7.3b14__tar.gz - Mend

batchalign 0.7.3b12tar.gz → 0.7.3b14tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (109) hide show

{batchalign-0.7.3b12/batchalign.egg-info → batchalign-0.7.3b14}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: batchalign
-Version: 0.7.3b12
+Version: 0.7.3b14
 Summary: Python Speech Language Sample Analysis
 Author: Brian MacWhinney, Houjun Liu
 Author-email: macw@cmu.edu, houjun@cmu.edu
@@ -82,7 +82,7 @@ The TalkBank Project, of which Batchalign is a part, is supported by NIH grant H
 ## Quick Start
-The following instructions is a quick start to install Batchalign. For most users aiming to process CHAT and audio with Batchalign, we recommend more detailed usage instructions: for [usage](https://talkbank.org/info/BA2-usage.pdf) and [human transcript cleanup](https://talkbank.org/info/BA2-cleanup.pdf). The following provides a quick start guide for the program.
+The following instructions provide a quick start to installing Batchalign. For most users aiming to process CHAT and audio with Batchalign, we recommend more detailed usage instructions: for [usage](https://talkbank.org/info/BA2-usage.pdf) and [human transcript cleanup](https://talkbank.org/info/BA2-cleanup.pdf). The following provides a quick start guide for the program.
 ### Get Python
 - We support Python versions 3.9, 3.10, and 3.11.
@@ -112,7 +112,7 @@ py -m pip3 install -U batchalign
 ```
 ### Rock and Roll
-There are two main ways of interacting with Batchalign. Batchalign can be used as a program to batch-process CHAT (hence the name), or a Python LSA library.
+There are two main ways of interacting with Batchalign. Batchalign can be used as a program to batch-process CHAT (hence the name), or as a Python LSA library.
 - to get started with the Batchalign program, [tap here](#quick-start-command-line)
 - to get started on the Batchalign Library (assumes familiarity with Python), [tap here](#quick-start-python)
@@ -121,7 +121,7 @@ There are two main ways of interacting with Batchalign. Batchalign can be used a
 ### Basic Usage
-Once installed, you can invoke the Batchalign CLI program via the `batchalign` command.
+Once installed, you can invoke the Batchalign program by typing `batchalign` into the Terminal (MacOS) or Command Prompt (Windows).
 It is used in the following basic way:
@@ -131,9 +131,9 @@ batchalign [verb] [input_dir] [output_dir]
 Where `verb` includes:
-1. `transcribe` - placing only an audio of video file (`.mp3/.mp4/.wav`) in the input directory, perform ASR on the audio, diarizes utterances, identifies some basic conversational features like retracing and filled pauses, and generate word-level alignments. You must supply a language code flag: `--lang=[three letter ISO language code]` for the ASR system to know what language the transcript is in. You can choose the flags `--rev` to use Rev.AI, a commercial ASR service, or `--whisper`, to use a local copy of OpenAI Whisper.
-2. `align` - placing both an audio of video file (`.mp3/.mp4/.wav`) and an *utterance-aligned* CHAT file in the input directory, generate word-level alignments
-3. `morphotag` - placing a CHAT file in the input directory, uses Stanford NLP Stanza to generate morphological and dependency analyses. You must supply a language code flag: `--lang=[three letter ISO language code]` for the alignment system to know what language the transcript is in.
+1. `transcribe` - by placing only an audio of video file (`.mp3/.mp4/.wav`) in the input directory, this function performs ASR on the audio, diarizes utterances, identifies some basic conversational features like retracing and filled pauses, and generates word-level alignments. You must supply a language code flag: `--lang=[three letter ISO language code]` for the ASR system to know what language the transcript is in. You can choose the flags `--rev` to use Rev.AI, a commercial ASR service, or `--whisper`, to use a local copy of OpenAI Whisper.
+2. `align` - by placing both an audio of video file (`.mp3/.mp4/.wav`) and an *utterance-aligned* CHAT file in the input directory, this function recovers utterance-level time alignments (if they are not already annotated) and generates word-level alignments. The @Languages header in the CHAT file tells the program which language is in the transcript.
+3. `morphotag` - by placing a CHAT file in the input directory, this function uses Stanford NLP Stanza to generate morphological and dependency analyses. The @Languages header in the CHAT file tells the program which language is in the transcript. You must supply a language code flag: `--lang=[three letter ISO language code]` for the alignment system to know what language the transcript is in.
 <!-- 4. `bulletize` - placing both an audio of video file (`.mp3/.mp4/.wav`) and an *unlinked* CHAT file in the input directory, generate utterance-level alignments through ASR -->
 You can get a CHAT transcript to experiment with [at the TalkBank website](https://talkbank.org/), under any of the "Banks" that are available. You can also generate and parse a CHAT transcript via [the Python program](https://github.com/TalkBank/batchalign2?tab=readme-ov-file#chat).

{batchalign-0.7.3b12 → batchalign-0.7.3b14}/README.md RENAMED Viewed

@@ -8,7 +8,7 @@ The TalkBank Project, of which Batchalign is a part, is supported by NIH grant H
 ## Quick Start
-The following instructions is a quick start to install Batchalign. For most users aiming to process CHAT and audio with Batchalign, we recommend more detailed usage instructions: for [usage](https://talkbank.org/info/BA2-usage.pdf) and [human transcript cleanup](https://talkbank.org/info/BA2-cleanup.pdf). The following provides a quick start guide for the program.
+The following instructions provide a quick start to installing Batchalign. For most users aiming to process CHAT and audio with Batchalign, we recommend more detailed usage instructions: for [usage](https://talkbank.org/info/BA2-usage.pdf) and [human transcript cleanup](https://talkbank.org/info/BA2-cleanup.pdf). The following provides a quick start guide for the program.
 ### Get Python
 - We support Python versions 3.9, 3.10, and 3.11.
@@ -38,7 +38,7 @@ py -m pip3 install -U batchalign
 ```
 ### Rock and Roll
-There are two main ways of interacting with Batchalign. Batchalign can be used as a program to batch-process CHAT (hence the name), or a Python LSA library.
+There are two main ways of interacting with Batchalign. Batchalign can be used as a program to batch-process CHAT (hence the name), or as a Python LSA library.
 - to get started with the Batchalign program, [tap here](#quick-start-command-line)
 - to get started on the Batchalign Library (assumes familiarity with Python), [tap here](#quick-start-python)
@@ -47,7 +47,7 @@ There are two main ways of interacting with Batchalign. Batchalign can be used a
 ### Basic Usage
-Once installed, you can invoke the Batchalign CLI program via the `batchalign` command.
+Once installed, you can invoke the Batchalign program by typing `batchalign` into the Terminal (MacOS) or Command Prompt (Windows).
 It is used in the following basic way:
@@ -57,9 +57,9 @@ batchalign [verb] [input_dir] [output_dir]
 Where `verb` includes:
-1. `transcribe` - placing only an audio of video file (`.mp3/.mp4/.wav`) in the input directory, perform ASR on the audio, diarizes utterances, identifies some basic conversational features like retracing and filled pauses, and generate word-level alignments. You must supply a language code flag: `--lang=[three letter ISO language code]` for the ASR system to know what language the transcript is in. You can choose the flags `--rev` to use Rev.AI, a commercial ASR service, or `--whisper`, to use a local copy of OpenAI Whisper.
-2. `align` - placing both an audio of video file (`.mp3/.mp4/.wav`) and an *utterance-aligned* CHAT file in the input directory, generate word-level alignments
-3. `morphotag` - placing a CHAT file in the input directory, uses Stanford NLP Stanza to generate morphological and dependency analyses. You must supply a language code flag: `--lang=[three letter ISO language code]` for the alignment system to know what language the transcript is in.
+1. `transcribe` - by placing only an audio of video file (`.mp3/.mp4/.wav`) in the input directory, this function performs ASR on the audio, diarizes utterances, identifies some basic conversational features like retracing and filled pauses, and generates word-level alignments. You must supply a language code flag: `--lang=[three letter ISO language code]` for the ASR system to know what language the transcript is in. You can choose the flags `--rev` to use Rev.AI, a commercial ASR service, or `--whisper`, to use a local copy of OpenAI Whisper.
+2. `align` - by placing both an audio of video file (`.mp3/.mp4/.wav`) and an *utterance-aligned* CHAT file in the input directory, this function recovers utterance-level time alignments (if they are not already annotated) and generates word-level alignments. The @Languages header in the CHAT file tells the program which language is in the transcript.
+3. `morphotag` - by placing a CHAT file in the input directory, this function uses Stanford NLP Stanza to generate morphological and dependency analyses. The @Languages header in the CHAT file tells the program which language is in the transcript. You must supply a language code flag: `--lang=[three letter ISO language code]` for the alignment system to know what language the transcript is in.
 <!-- 4. `bulletize` - placing both an audio of video file (`.mp3/.mp4/.wav`) and an *unlinked* CHAT file in the input directory, generate utterance-level alignments through ASR -->
 You can get a CHAT transcript to experiment with [at the TalkBank website](https://talkbank.org/), under any of the "Banks" that are available. You can also generate and parse a CHAT transcript via [the Python program](https://github.com/TalkBank/batchalign2?tab=readme-ov-file#chat).

{batchalign-0.7.3b12 → batchalign-0.7.3b14}/batchalign/document.py RENAMED Viewed

@@ -208,6 +208,7 @@ class Utterance(BaseModel):
         # t = re.sub(r"^[^\w\d\s<]+", "", t.strip()).strip()
         t = re.sub(r",", " , ", t.strip()).strip()
         t = re.sub(r" +", " ", t.strip()).strip()
+        t = t.replace("+ ,", "+,").strip()
         return t
     def __repr__(self):

batchalign-0.7.3b14/batchalign/pipelines/morphosyntax/ja/verbforms.py ADDED Viewed

@@ -0,0 +1,56 @@
+"""
+verbforms.py
+Fix Japanese verb forms.
+"""
+def verbform(upos, target, text):
+    if "遣" in text and upos == "noun":
+        return "verb", "遣る"
+    if "死" in text:
+        return "verb", "死ぬ"
+    if "立" in text:
+        return "verb", "立つ"
+    if "引" in text:
+        return "verb", "引く"
+    if "出" in text:
+        return "verb", "出す"
+    if "引" in text:
+        return "verb", "引く"
+    if "飲" in text:
+        return "verb", "飲む"
+    if "呼" in text:
+        return "verb", "呼ぶ"
+    if "脱" in text:
+        return "verb", "脱ぐ"
+    if text == "な" and upos == "part":
+        return "aux", "な"
+    if text == "呼ん":
+        return "verb", "呼ぶ"
+    if text == "な" and upos == "aux":
+        return "aux", "な"
+    if text == "だり":
+        return "aux", "たり"
+    if text == "たり":
+        return "aux", "たり"
+    if text == "たら":
+        return "sconj", "たら"
+    if text == "たっ":
+        return "sconj", "たって"
+    # if text == "て" and upos == "sconj":
+    #     return "aux", "て"
+    if text == "なさい" and target == "為さる":
+        return "aux", "為さい"
+    if text == "な" and upos == "part":
+        return "aux", "な"
+    if text == "脱" and upos == "noun":
+        return "verb", "脱"
+    if text == "よう" and upos == "aux":
+        return "aux", "よう"
+    if text == "ろ" and upos == "aux" and target == "為る":
+        return "aux", "ろ"
+    # if upos == "verb" and "る" in target:
+    #     return "verb", target.replace("る","").strip()
+    return upos,target

{batchalign-0.7.3b12 → batchalign-0.7.3b14}/batchalign/pipelines/morphosyntax/ud.py RENAMED Viewed

@@ -233,9 +233,14 @@ def handler__VERB(word, lang=None):
     tense = feats.get("Tense", "")
     polarity = feats.get("Polarity", "")
     polite = feats.get("Polite", "")
-    return handler(word, lang)+flag+stringify_feats(aspect, mood,
-                                              tense, polarity, polite,
-                                              number[:1]+person)
+    res = handler(word, lang)
+    if "sconj" in res:
+        return res
+    else:
+        return res+flag+stringify_feats(aspect, mood,
+                                        tense, polarity, polite,
+                                        number[:1]+person)
 def handler__actual_PUNCT(word, lang=None):
     # actual punctuation handler
@@ -692,7 +697,7 @@ def morphoanalyze(doc: Document, retokenize:bool, status_hook:callable = None, *
     elif not any([i in ["hr", "zh", "zh-hans", "zh-hant", "ja", "ko",
                         "sl", "sr", "bg", "ru", "et", "hu",
-                        "eu", "el", "he", "af", "ga", "da"] for i in lang]):
+                        "eu", "el", "he", "af", "ga", "da", "ro"] for i in lang]):
         if "en" in lang:
             config["processors"]["mwt"] = "gum"
         else:
@@ -878,6 +883,7 @@ def morphoanalyze(doc: Document, retokenize:bool, status_hook:callable = None, *
                 retokenized_ut = retokenized_ut.replace(" >", ">")
                 retokenized_ut = retokenized_ut.replace("< ", "<")
                 retokenized_ut = retokenized_ut.replace(" :", ":")
+                retokenized_ut = retokenized_ut.replace("+ ,", "+,")
                 retokenized_ut = retokenized_ut.replace(": <", ": <")
                 retokenized_ut = retokenized_ut.replace(" ↑", "↑")
                 retokenized_ut = re.sub(r"@ ?w ?p", "@wp", retokenized_ut)

batchalign-0.7.3b14/batchalign/version ADDED Viewed

@@ -0,0 +1,3 @@
+0.7.3-beta.14
+July 6th, 2024
+UD Fixes

{batchalign-0.7.3b12 → batchalign-0.7.3b14/batchalign.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: batchalign
-Version: 0.7.3b12
+Version: 0.7.3b14
 Summary: Python Speech Language Sample Analysis
 Author: Brian MacWhinney, Houjun Liu
 Author-email: macw@cmu.edu, houjun@cmu.edu
@@ -82,7 +82,7 @@ The TalkBank Project, of which Batchalign is a part, is supported by NIH grant H
 ## Quick Start
-The following instructions is a quick start to install Batchalign. For most users aiming to process CHAT and audio with Batchalign, we recommend more detailed usage instructions: for [usage](https://talkbank.org/info/BA2-usage.pdf) and [human transcript cleanup](https://talkbank.org/info/BA2-cleanup.pdf). The following provides a quick start guide for the program.
+The following instructions provide a quick start to installing Batchalign. For most users aiming to process CHAT and audio with Batchalign, we recommend more detailed usage instructions: for [usage](https://talkbank.org/info/BA2-usage.pdf) and [human transcript cleanup](https://talkbank.org/info/BA2-cleanup.pdf). The following provides a quick start guide for the program.
 ### Get Python
 - We support Python versions 3.9, 3.10, and 3.11.
@@ -112,7 +112,7 @@ py -m pip3 install -U batchalign
 ```
 ### Rock and Roll
-There are two main ways of interacting with Batchalign. Batchalign can be used as a program to batch-process CHAT (hence the name), or a Python LSA library.
+There are two main ways of interacting with Batchalign. Batchalign can be used as a program to batch-process CHAT (hence the name), or as a Python LSA library.
 - to get started with the Batchalign program, [tap here](#quick-start-command-line)
 - to get started on the Batchalign Library (assumes familiarity with Python), [tap here](#quick-start-python)
@@ -121,7 +121,7 @@ There are two main ways of interacting with Batchalign. Batchalign can be used a
 ### Basic Usage
-Once installed, you can invoke the Batchalign CLI program via the `batchalign` command.
+Once installed, you can invoke the Batchalign program by typing `batchalign` into the Terminal (MacOS) or Command Prompt (Windows).
 It is used in the following basic way:
@@ -131,9 +131,9 @@ batchalign [verb] [input_dir] [output_dir]
 Where `verb` includes:
-1. `transcribe` - placing only an audio of video file (`.mp3/.mp4/.wav`) in the input directory, perform ASR on the audio, diarizes utterances, identifies some basic conversational features like retracing and filled pauses, and generate word-level alignments. You must supply a language code flag: `--lang=[three letter ISO language code]` for the ASR system to know what language the transcript is in. You can choose the flags `--rev` to use Rev.AI, a commercial ASR service, or `--whisper`, to use a local copy of OpenAI Whisper.
-2. `align` - placing both an audio of video file (`.mp3/.mp4/.wav`) and an *utterance-aligned* CHAT file in the input directory, generate word-level alignments
-3. `morphotag` - placing a CHAT file in the input directory, uses Stanford NLP Stanza to generate morphological and dependency analyses. You must supply a language code flag: `--lang=[three letter ISO language code]` for the alignment system to know what language the transcript is in.
+1. `transcribe` - by placing only an audio of video file (`.mp3/.mp4/.wav`) in the input directory, this function performs ASR on the audio, diarizes utterances, identifies some basic conversational features like retracing and filled pauses, and generates word-level alignments. You must supply a language code flag: `--lang=[three letter ISO language code]` for the ASR system to know what language the transcript is in. You can choose the flags `--rev` to use Rev.AI, a commercial ASR service, or `--whisper`, to use a local copy of OpenAI Whisper.
+2. `align` - by placing both an audio of video file (`.mp3/.mp4/.wav`) and an *utterance-aligned* CHAT file in the input directory, this function recovers utterance-level time alignments (if they are not already annotated) and generates word-level alignments. The @Languages header in the CHAT file tells the program which language is in the transcript.
+3. `morphotag` - by placing a CHAT file in the input directory, this function uses Stanford NLP Stanza to generate morphological and dependency analyses. The @Languages header in the CHAT file tells the program which language is in the transcript. You must supply a language code flag: `--lang=[three letter ISO language code]` for the alignment system to know what language the transcript is in.
 <!-- 4. `bulletize` - placing both an audio of video file (`.mp3/.mp4/.wav`) and an *unlinked* CHAT file in the input directory, generate utterance-level alignments through ASR -->
 You can get a CHAT transcript to experiment with [at the TalkBank website](https://talkbank.org/), under any of the "Banks" that are available. You can also generate and parse a CHAT transcript via [the Python program](https://github.com/TalkBank/batchalign2?tab=readme-ov-file#chat).

batchalign-0.7.3b12/batchalign/pipelines/morphosyntax/ja/verbforms.py DELETED Viewed

@@ -1,34 +0,0 @@
-"""
-verbforms.py
-Fix Japanese verb forms.
-"""
-def verbform(upos, target, text):
-    if text == "な" and upos == "part":
-        return "aux", "うな"
-    if text == "呼ん":
-        return upos, "呼ん"
-    if text == "たり":
-        return "aux", "たり"
-    if text == "たら":
-        return "sconj", "たら"
-    if text == "たっ":
-        return "sconj", "たって"
-    if text == "て" and upos == "sconj":
-        return "aux", "て"
-    if text == "なさい" and target == "為さる":
-        return "aux", "為さい"
-    if text == "な" and upos == "part":
-        return "aux", "な"
-    if text == "脱" and upos == "noun":
-        return "verb", "脱"
-    if text == "よう" and upos == "aux":
-        return "aux", "よう"
-    if text == "ろ" and upos == "aux" and target == "為る":
-        return "aux", "ろ"
-    if upos == "verb" and "る" in target:
-        return "verb", target.replace("る","").strip()
-    return upos,target