@huggingface/tasks 0.13.1-test → 0.13.1-test2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/package.json +4 -2
- package/src/dataset-libraries.ts +89 -0
- package/src/default-widget-inputs.ts +718 -0
- package/src/gguf.ts +40 -0
- package/src/hardware.ts +482 -0
- package/src/index.ts +59 -0
- package/src/library-to-tasks.ts +76 -0
- package/src/local-apps.ts +412 -0
- package/src/model-data.ts +149 -0
- package/src/model-libraries-downloads.ts +18 -0
- package/src/model-libraries-snippets.ts +1128 -0
- package/src/model-libraries.ts +820 -0
- package/src/pipelines.ts +698 -0
- package/src/snippets/common.ts +39 -0
- package/src/snippets/curl.spec.ts +94 -0
- package/src/snippets/curl.ts +120 -0
- package/src/snippets/index.ts +7 -0
- package/src/snippets/inputs.ts +167 -0
- package/src/snippets/js.spec.ts +148 -0
- package/src/snippets/js.ts +305 -0
- package/src/snippets/python.spec.ts +144 -0
- package/src/snippets/python.ts +321 -0
- package/src/snippets/types.ts +16 -0
- package/src/tasks/audio-classification/about.md +86 -0
- package/src/tasks/audio-classification/data.ts +81 -0
- package/src/tasks/audio-classification/inference.ts +52 -0
- package/src/tasks/audio-classification/spec/input.json +35 -0
- package/src/tasks/audio-classification/spec/output.json +11 -0
- package/src/tasks/audio-to-audio/about.md +56 -0
- package/src/tasks/audio-to-audio/data.ts +70 -0
- package/src/tasks/automatic-speech-recognition/about.md +90 -0
- package/src/tasks/automatic-speech-recognition/data.ts +82 -0
- package/src/tasks/automatic-speech-recognition/inference.ts +160 -0
- package/src/tasks/automatic-speech-recognition/spec/input.json +35 -0
- package/src/tasks/automatic-speech-recognition/spec/output.json +38 -0
- package/src/tasks/chat-completion/inference.ts +322 -0
- package/src/tasks/chat-completion/spec/input.json +350 -0
- package/src/tasks/chat-completion/spec/output.json +206 -0
- package/src/tasks/chat-completion/spec/stream_output.json +213 -0
- package/src/tasks/common-definitions.json +100 -0
- package/src/tasks/depth-estimation/about.md +45 -0
- package/src/tasks/depth-estimation/data.ts +70 -0
- package/src/tasks/depth-estimation/inference.ts +35 -0
- package/src/tasks/depth-estimation/spec/input.json +25 -0
- package/src/tasks/depth-estimation/spec/output.json +16 -0
- package/src/tasks/document-question-answering/about.md +53 -0
- package/src/tasks/document-question-answering/data.ts +85 -0
- package/src/tasks/document-question-answering/inference.ts +110 -0
- package/src/tasks/document-question-answering/spec/input.json +85 -0
- package/src/tasks/document-question-answering/spec/output.json +36 -0
- package/src/tasks/feature-extraction/about.md +72 -0
- package/src/tasks/feature-extraction/data.ts +57 -0
- package/src/tasks/feature-extraction/inference.ts +40 -0
- package/src/tasks/feature-extraction/spec/input.json +47 -0
- package/src/tasks/feature-extraction/spec/output.json +15 -0
- package/src/tasks/fill-mask/about.md +51 -0
- package/src/tasks/fill-mask/data.ts +79 -0
- package/src/tasks/fill-mask/inference.ts +62 -0
- package/src/tasks/fill-mask/spec/input.json +38 -0
- package/src/tasks/fill-mask/spec/output.json +29 -0
- package/src/tasks/image-classification/about.md +50 -0
- package/src/tasks/image-classification/data.ts +88 -0
- package/src/tasks/image-classification/inference.ts +52 -0
- package/src/tasks/image-classification/spec/input.json +35 -0
- package/src/tasks/image-classification/spec/output.json +11 -0
- package/src/tasks/image-feature-extraction/about.md +23 -0
- package/src/tasks/image-feature-extraction/data.ts +59 -0
- package/src/tasks/image-segmentation/about.md +63 -0
- package/src/tasks/image-segmentation/data.ts +99 -0
- package/src/tasks/image-segmentation/inference.ts +69 -0
- package/src/tasks/image-segmentation/spec/input.json +45 -0
- package/src/tasks/image-segmentation/spec/output.json +26 -0
- package/src/tasks/image-text-to-text/about.md +76 -0
- package/src/tasks/image-text-to-text/data.ts +102 -0
- package/src/tasks/image-to-3d/about.md +62 -0
- package/src/tasks/image-to-3d/data.ts +75 -0
- package/src/tasks/image-to-image/about.md +129 -0
- package/src/tasks/image-to-image/data.ts +101 -0
- package/src/tasks/image-to-image/inference.ts +68 -0
- package/src/tasks/image-to-image/spec/input.json +55 -0
- package/src/tasks/image-to-image/spec/output.json +12 -0
- package/src/tasks/image-to-text/about.md +61 -0
- package/src/tasks/image-to-text/data.ts +82 -0
- package/src/tasks/image-to-text/inference.ts +143 -0
- package/src/tasks/image-to-text/spec/input.json +34 -0
- package/src/tasks/image-to-text/spec/output.json +14 -0
- package/src/tasks/index.ts +312 -0
- package/src/tasks/keypoint-detection/about.md +57 -0
- package/src/tasks/keypoint-detection/data.ts +50 -0
- package/src/tasks/mask-generation/about.md +65 -0
- package/src/tasks/mask-generation/data.ts +55 -0
- package/src/tasks/object-detection/about.md +37 -0
- package/src/tasks/object-detection/data.ts +86 -0
- package/src/tasks/object-detection/inference.ts +75 -0
- package/src/tasks/object-detection/spec/input.json +31 -0
- package/src/tasks/object-detection/spec/output.json +50 -0
- package/src/tasks/placeholder/about.md +15 -0
- package/src/tasks/placeholder/data.ts +21 -0
- package/src/tasks/placeholder/spec/input.json +35 -0
- package/src/tasks/placeholder/spec/output.json +17 -0
- package/src/tasks/question-answering/about.md +56 -0
- package/src/tasks/question-answering/data.ts +75 -0
- package/src/tasks/question-answering/inference.ts +99 -0
- package/src/tasks/question-answering/spec/input.json +67 -0
- package/src/tasks/question-answering/spec/output.json +29 -0
- package/src/tasks/reinforcement-learning/about.md +167 -0
- package/src/tasks/reinforcement-learning/data.ts +75 -0
- package/src/tasks/sentence-similarity/about.md +97 -0
- package/src/tasks/sentence-similarity/data.ts +101 -0
- package/src/tasks/sentence-similarity/inference.ts +32 -0
- package/src/tasks/sentence-similarity/spec/input.json +40 -0
- package/src/tasks/sentence-similarity/spec/output.json +12 -0
- package/src/tasks/summarization/about.md +58 -0
- package/src/tasks/summarization/data.ts +76 -0
- package/src/tasks/summarization/inference.ts +57 -0
- package/src/tasks/summarization/spec/input.json +42 -0
- package/src/tasks/summarization/spec/output.json +14 -0
- package/src/tasks/table-question-answering/about.md +43 -0
- package/src/tasks/table-question-answering/data.ts +59 -0
- package/src/tasks/table-question-answering/inference.ts +61 -0
- package/src/tasks/table-question-answering/spec/input.json +44 -0
- package/src/tasks/table-question-answering/spec/output.json +40 -0
- package/src/tasks/tabular-classification/about.md +65 -0
- package/src/tasks/tabular-classification/data.ts +68 -0
- package/src/tasks/tabular-regression/about.md +87 -0
- package/src/tasks/tabular-regression/data.ts +57 -0
- package/src/tasks/text-classification/about.md +173 -0
- package/src/tasks/text-classification/data.ts +103 -0
- package/src/tasks/text-classification/inference.ts +51 -0
- package/src/tasks/text-classification/spec/input.json +35 -0
- package/src/tasks/text-classification/spec/output.json +11 -0
- package/src/tasks/text-generation/about.md +154 -0
- package/src/tasks/text-generation/data.ts +114 -0
- package/src/tasks/text-generation/inference.ts +200 -0
- package/src/tasks/text-generation/spec/input.json +219 -0
- package/src/tasks/text-generation/spec/output.json +179 -0
- package/src/tasks/text-generation/spec/stream_output.json +103 -0
- package/src/tasks/text-to-3d/about.md +62 -0
- package/src/tasks/text-to-3d/data.ts +56 -0
- package/src/tasks/text-to-audio/inference.ts +143 -0
- package/src/tasks/text-to-audio/spec/input.json +31 -0
- package/src/tasks/text-to-audio/spec/output.json +17 -0
- package/src/tasks/text-to-image/about.md +96 -0
- package/src/tasks/text-to-image/data.ts +100 -0
- package/src/tasks/text-to-image/inference.ts +75 -0
- package/src/tasks/text-to-image/spec/input.json +63 -0
- package/src/tasks/text-to-image/spec/output.json +13 -0
- package/src/tasks/text-to-speech/about.md +63 -0
- package/src/tasks/text-to-speech/data.ts +79 -0
- package/src/tasks/text-to-speech/inference.ts +145 -0
- package/src/tasks/text-to-speech/spec/input.json +31 -0
- package/src/tasks/text-to-speech/spec/output.json +7 -0
- package/src/tasks/text-to-video/about.md +41 -0
- package/src/tasks/text-to-video/data.ts +102 -0
- package/src/tasks/text2text-generation/inference.ts +55 -0
- package/src/tasks/text2text-generation/spec/input.json +55 -0
- package/src/tasks/text2text-generation/spec/output.json +14 -0
- package/src/tasks/token-classification/about.md +76 -0
- package/src/tasks/token-classification/data.ts +92 -0
- package/src/tasks/token-classification/inference.ts +85 -0
- package/src/tasks/token-classification/spec/input.json +65 -0
- package/src/tasks/token-classification/spec/output.json +37 -0
- package/src/tasks/translation/about.md +65 -0
- package/src/tasks/translation/data.ts +70 -0
- package/src/tasks/translation/inference.ts +67 -0
- package/src/tasks/translation/spec/input.json +50 -0
- package/src/tasks/translation/spec/output.json +14 -0
- package/src/tasks/unconditional-image-generation/about.md +50 -0
- package/src/tasks/unconditional-image-generation/data.ts +72 -0
- package/src/tasks/video-classification/about.md +37 -0
- package/src/tasks/video-classification/data.ts +84 -0
- package/src/tasks/video-classification/inference.ts +59 -0
- package/src/tasks/video-classification/spec/input.json +42 -0
- package/src/tasks/video-classification/spec/output.json +10 -0
- package/src/tasks/video-text-to-text/about.md +98 -0
- package/src/tasks/video-text-to-text/data.ts +66 -0
- package/src/tasks/visual-question-answering/about.md +48 -0
- package/src/tasks/visual-question-answering/data.ts +97 -0
- package/src/tasks/visual-question-answering/inference.ts +62 -0
- package/src/tasks/visual-question-answering/spec/input.json +41 -0
- package/src/tasks/visual-question-answering/spec/output.json +21 -0
- package/src/tasks/zero-shot-classification/about.md +40 -0
- package/src/tasks/zero-shot-classification/data.ts +70 -0
- package/src/tasks/zero-shot-classification/inference.ts +67 -0
- package/src/tasks/zero-shot-classification/spec/input.json +50 -0
- package/src/tasks/zero-shot-classification/spec/output.json +11 -0
- package/src/tasks/zero-shot-image-classification/about.md +75 -0
- package/src/tasks/zero-shot-image-classification/data.ts +84 -0
- package/src/tasks/zero-shot-image-classification/inference.ts +61 -0
- package/src/tasks/zero-shot-image-classification/spec/input.json +45 -0
- package/src/tasks/zero-shot-image-classification/spec/output.json +10 -0
- package/src/tasks/zero-shot-object-detection/about.md +45 -0
- package/src/tasks/zero-shot-object-detection/data.ts +67 -0
- package/src/tasks/zero-shot-object-detection/inference.ts +66 -0
- package/src/tasks/zero-shot-object-detection/spec/input.json +40 -0
- package/src/tasks/zero-shot-object-detection/spec/output.json +47 -0
- package/src/tokenizer-data.ts +32 -0
- package/src/widget-example.ts +125 -0
|
@@ -0,0 +1,63 @@
|
|
|
1
|
+
## Use Cases
|
|
2
|
+
|
|
3
|
+
Text-to-Speech (TTS) models can be used in any speech-enabled application that requires converting text to speech imitating human voice.
|
|
4
|
+
|
|
5
|
+
### Voice Assistants
|
|
6
|
+
|
|
7
|
+
TTS models are used to create voice assistants on smart devices. These models are a better alternative compared to concatenative methods where the assistant is built by recording sounds and mapping them, since the outputs in TTS models contain elements in natural speech such as emphasis.
|
|
8
|
+
|
|
9
|
+
### Announcement Systems
|
|
10
|
+
|
|
11
|
+
TTS models are widely used in airport and public transportation announcement systems to convert the announcement of a given text into speech.
|
|
12
|
+
|
|
13
|
+
## Inference Endpoints
|
|
14
|
+
|
|
15
|
+
The Hub contains over [1500 TTS models](https://huggingface.co/models?pipeline_tag=text-to-speech&sort=downloads) that you can use right away by trying out the widgets directly in the browser or calling the models as a service using Inference Endpoints. Here is a simple code snippet to get you started:
|
|
16
|
+
|
|
17
|
+
```python
|
|
18
|
+
import json
|
|
19
|
+
import requests
|
|
20
|
+
|
|
21
|
+
headers = {"Authorization": f"Bearer {API_TOKEN}"}
|
|
22
|
+
API_URL = "https://api-inference.huggingface.co/models/microsoft/speecht5_tts"
|
|
23
|
+
|
|
24
|
+
def query(payload):
|
|
25
|
+
response = requests.post(API_URL, headers=headers, json=payload)
|
|
26
|
+
return response
|
|
27
|
+
|
|
28
|
+
output = query({"text_inputs": "Max is the best doggo."})
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
You can also use libraries such as [espnet](https://huggingface.co/models?library=espnet&pipeline_tag=text-to-speech&sort=downloads) or [transformers](https://huggingface.co/models?pipeline_tag=text-to-speech&library=transformers&sort=trending) if you want to handle the Inference directly.
|
|
32
|
+
|
|
33
|
+
## Direct Inference
|
|
34
|
+
|
|
35
|
+
Now, you can also use the Text-to-Speech pipeline in Transformers to synthesise high quality voice.
|
|
36
|
+
|
|
37
|
+
```python
|
|
38
|
+
from transformers import pipeline
|
|
39
|
+
|
|
40
|
+
synthesizer = pipeline("text-to-speech", "suno/bark")
|
|
41
|
+
|
|
42
|
+
synthesizer("Look I am generating speech in three lines of code!")
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
You can use [huggingface.js](https://github.com/huggingface/huggingface.js) to infer summarization models on Hugging Face Hub.
|
|
46
|
+
|
|
47
|
+
```javascript
|
|
48
|
+
import { HfInference } from "@huggingface/inference";
|
|
49
|
+
|
|
50
|
+
const inference = new HfInference(HF_TOKEN);
|
|
51
|
+
await inference.textToSpeech({
|
|
52
|
+
model: "facebook/mms-tts",
|
|
53
|
+
inputs: "text to generate speech from",
|
|
54
|
+
});
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
## Useful Resources
|
|
58
|
+
|
|
59
|
+
- [Hugging Face Audio Course](https://huggingface.co/learn/audio-course/chapter6/introduction)
|
|
60
|
+
- [ML for Audio Study Group - Text to Speech Deep Dive](https://www.youtube.com/watch?v=aLBedWj-5CQ)
|
|
61
|
+
- [Speech Synthesis, Recognition, and More With SpeechT5](https://huggingface.co/blog/speecht5)
|
|
62
|
+
- [Optimizing a Text-To-Speech model using 🤗 Transformers](https://huggingface.co/blog/optimizing-bark)
|
|
63
|
+
- [Train your own TTS models with Parler-TTS](https://github.com/huggingface/parler-tts)
|
|
@@ -0,0 +1,79 @@
|
|
|
1
|
+
import type { TaskDataCustom } from "../index.js";
|
|
2
|
+
|
|
3
|
+
const taskData: TaskDataCustom = {
|
|
4
|
+
canonicalId: "text-to-audio",
|
|
5
|
+
datasets: [
|
|
6
|
+
{
|
|
7
|
+
description: "10K hours of multi-speaker English dataset.",
|
|
8
|
+
id: "parler-tts/mls_eng_10k",
|
|
9
|
+
},
|
|
10
|
+
{
|
|
11
|
+
description: "Multi-speaker English dataset.",
|
|
12
|
+
id: "mythicinfinity/libritts_r",
|
|
13
|
+
},
|
|
14
|
+
],
|
|
15
|
+
demo: {
|
|
16
|
+
inputs: [
|
|
17
|
+
{
|
|
18
|
+
label: "Input",
|
|
19
|
+
content: "I love audio models on the Hub!",
|
|
20
|
+
type: "text",
|
|
21
|
+
},
|
|
22
|
+
],
|
|
23
|
+
outputs: [
|
|
24
|
+
{
|
|
25
|
+
filename: "audio.wav",
|
|
26
|
+
type: "audio",
|
|
27
|
+
},
|
|
28
|
+
],
|
|
29
|
+
},
|
|
30
|
+
metrics: [
|
|
31
|
+
{
|
|
32
|
+
description: "The Mel Cepstral Distortion (MCD) metric is used to calculate the quality of generated speech.",
|
|
33
|
+
id: "mel cepstral distortion",
|
|
34
|
+
},
|
|
35
|
+
],
|
|
36
|
+
models: [
|
|
37
|
+
{
|
|
38
|
+
description: "A powerful TTS model.",
|
|
39
|
+
id: "parler-tts/parler-tts-large-v1",
|
|
40
|
+
},
|
|
41
|
+
{
|
|
42
|
+
description: "A massively multi-lingual TTS model.",
|
|
43
|
+
id: "coqui/XTTS-v2",
|
|
44
|
+
},
|
|
45
|
+
{
|
|
46
|
+
description: "Robust TTS model.",
|
|
47
|
+
id: "metavoiceio/metavoice-1B-v0.1",
|
|
48
|
+
},
|
|
49
|
+
{
|
|
50
|
+
description: "A prompt based, powerful TTS model.",
|
|
51
|
+
id: "parler-tts/parler_tts_mini_v0.1",
|
|
52
|
+
},
|
|
53
|
+
],
|
|
54
|
+
spaces: [
|
|
55
|
+
{
|
|
56
|
+
description: "An application for generate highly realistic, multilingual speech.",
|
|
57
|
+
id: "suno/bark",
|
|
58
|
+
},
|
|
59
|
+
{
|
|
60
|
+
description:
|
|
61
|
+
"An application on XTTS, a voice generation model that lets you clone voices into different languages.",
|
|
62
|
+
id: "coqui/xtts",
|
|
63
|
+
},
|
|
64
|
+
{
|
|
65
|
+
description: "An application that generates speech in different styles in English and Chinese.",
|
|
66
|
+
id: "mrfakename/E2-F5-TTS",
|
|
67
|
+
},
|
|
68
|
+
{
|
|
69
|
+
description: "An application that synthesizes speech for diverse speaker prompts.",
|
|
70
|
+
id: "parler-tts/parler_tts_mini",
|
|
71
|
+
},
|
|
72
|
+
],
|
|
73
|
+
summary:
|
|
74
|
+
"Text-to-Speech (TTS) is the task of generating natural sounding speech given text input. TTS models can be extended to have a single model that generates speech for multiple speakers and multiple languages.",
|
|
75
|
+
widgetModels: ["suno/bark"],
|
|
76
|
+
youtubeId: "NW62DpzJ274",
|
|
77
|
+
};
|
|
78
|
+
|
|
79
|
+
export default taskData;
|
|
@@ -0,0 +1,145 @@
|
|
|
1
|
+
/**
|
|
2
|
+
* Inference code generated from the JSON schema spec in ./spec
|
|
3
|
+
*
|
|
4
|
+
* Using src/scripts/inference-codegen
|
|
5
|
+
*/
|
|
6
|
+
|
|
7
|
+
/**
|
|
8
|
+
* Inputs for Text To Speech inference
|
|
9
|
+
*/
|
|
10
|
+
export interface TextToSpeechInput {
|
|
11
|
+
/**
|
|
12
|
+
* The input text data
|
|
13
|
+
*/
|
|
14
|
+
inputs: string;
|
|
15
|
+
/**
|
|
16
|
+
* Additional inference parameters
|
|
17
|
+
*/
|
|
18
|
+
parameters?: TextToSpeechParameters;
|
|
19
|
+
[property: string]: unknown;
|
|
20
|
+
}
|
|
21
|
+
|
|
22
|
+
/**
|
|
23
|
+
* Additional inference parameters
|
|
24
|
+
*
|
|
25
|
+
* Additional inference parameters for Text To Speech
|
|
26
|
+
*/
|
|
27
|
+
export interface TextToSpeechParameters {
|
|
28
|
+
/**
|
|
29
|
+
* Parametrization of the text generation process
|
|
30
|
+
*/
|
|
31
|
+
generation_parameters?: GenerationParameters;
|
|
32
|
+
[property: string]: unknown;
|
|
33
|
+
}
|
|
34
|
+
|
|
35
|
+
/**
|
|
36
|
+
* Parametrization of the text generation process
|
|
37
|
+
*
|
|
38
|
+
* Ad-hoc parametrization of the text generation process
|
|
39
|
+
*/
|
|
40
|
+
export interface GenerationParameters {
|
|
41
|
+
/**
|
|
42
|
+
* Whether to use sampling instead of greedy decoding when generating new tokens.
|
|
43
|
+
*/
|
|
44
|
+
do_sample?: boolean;
|
|
45
|
+
/**
|
|
46
|
+
* Controls the stopping condition for beam-based methods.
|
|
47
|
+
*/
|
|
48
|
+
early_stopping?: EarlyStoppingUnion;
|
|
49
|
+
/**
|
|
50
|
+
* If set to float strictly between 0 and 1, only tokens with a conditional probability
|
|
51
|
+
* greater than epsilon_cutoff will be sampled. In the paper, suggested values range from
|
|
52
|
+
* 3e-4 to 9e-4, depending on the size of the model. See [Truncation Sampling as Language
|
|
53
|
+
* Model Desmoothing](https://hf.co/papers/2210.15191) for more details.
|
|
54
|
+
*/
|
|
55
|
+
epsilon_cutoff?: number;
|
|
56
|
+
/**
|
|
57
|
+
* Eta sampling is a hybrid of locally typical sampling and epsilon sampling. If set to
|
|
58
|
+
* float strictly between 0 and 1, a token is only considered if it is greater than either
|
|
59
|
+
* eta_cutoff or sqrt(eta_cutoff) * exp(-entropy(softmax(next_token_logits))). The latter
|
|
60
|
+
* term is intuitively the expected next token probability, scaled by sqrt(eta_cutoff). In
|
|
61
|
+
* the paper, suggested values range from 3e-4 to 2e-3, depending on the size of the model.
|
|
62
|
+
* See [Truncation Sampling as Language Model Desmoothing](https://hf.co/papers/2210.15191)
|
|
63
|
+
* for more details.
|
|
64
|
+
*/
|
|
65
|
+
eta_cutoff?: number;
|
|
66
|
+
/**
|
|
67
|
+
* The maximum length (in tokens) of the generated text, including the input.
|
|
68
|
+
*/
|
|
69
|
+
max_length?: number;
|
|
70
|
+
/**
|
|
71
|
+
* The maximum number of tokens to generate. Takes precedence over max_length.
|
|
72
|
+
*/
|
|
73
|
+
max_new_tokens?: number;
|
|
74
|
+
/**
|
|
75
|
+
* The minimum length (in tokens) of the generated text, including the input.
|
|
76
|
+
*/
|
|
77
|
+
min_length?: number;
|
|
78
|
+
/**
|
|
79
|
+
* The minimum number of tokens to generate. Takes precedence over min_length.
|
|
80
|
+
*/
|
|
81
|
+
min_new_tokens?: number;
|
|
82
|
+
/**
|
|
83
|
+
* Number of groups to divide num_beams into in order to ensure diversity among different
|
|
84
|
+
* groups of beams. See [this paper](https://hf.co/papers/1610.02424) for more details.
|
|
85
|
+
*/
|
|
86
|
+
num_beam_groups?: number;
|
|
87
|
+
/**
|
|
88
|
+
* Number of beams to use for beam search.
|
|
89
|
+
*/
|
|
90
|
+
num_beams?: number;
|
|
91
|
+
/**
|
|
92
|
+
* The value balances the model confidence and the degeneration penalty in contrastive
|
|
93
|
+
* search decoding.
|
|
94
|
+
*/
|
|
95
|
+
penalty_alpha?: number;
|
|
96
|
+
/**
|
|
97
|
+
* The value used to modulate the next token probabilities.
|
|
98
|
+
*/
|
|
99
|
+
temperature?: number;
|
|
100
|
+
/**
|
|
101
|
+
* The number of highest probability vocabulary tokens to keep for top-k-filtering.
|
|
102
|
+
*/
|
|
103
|
+
top_k?: number;
|
|
104
|
+
/**
|
|
105
|
+
* If set to float < 1, only the smallest set of most probable tokens with probabilities
|
|
106
|
+
* that add up to top_p or higher are kept for generation.
|
|
107
|
+
*/
|
|
108
|
+
top_p?: number;
|
|
109
|
+
/**
|
|
110
|
+
* Local typicality measures how similar the conditional probability of predicting a target
|
|
111
|
+
* token next is to the expected conditional probability of predicting a random token next,
|
|
112
|
+
* given the partial text already generated. If set to float < 1, the smallest set of the
|
|
113
|
+
* most locally typical tokens with probabilities that add up to typical_p or higher are
|
|
114
|
+
* kept for generation. See [this paper](https://hf.co/papers/2202.00666) for more details.
|
|
115
|
+
*/
|
|
116
|
+
typical_p?: number;
|
|
117
|
+
/**
|
|
118
|
+
* Whether the model should use the past last key/values attentions to speed up decoding
|
|
119
|
+
*/
|
|
120
|
+
use_cache?: boolean;
|
|
121
|
+
[property: string]: unknown;
|
|
122
|
+
}
|
|
123
|
+
|
|
124
|
+
/**
|
|
125
|
+
* Controls the stopping condition for beam-based methods.
|
|
126
|
+
*/
|
|
127
|
+
export type EarlyStoppingUnion = boolean | "never";
|
|
128
|
+
|
|
129
|
+
/**
|
|
130
|
+
* Outputs for Text to Speech inference
|
|
131
|
+
*
|
|
132
|
+
* Outputs of inference for the Text To Audio task
|
|
133
|
+
*/
|
|
134
|
+
export interface TextToSpeechOutput {
|
|
135
|
+
/**
|
|
136
|
+
* The generated audio waveform.
|
|
137
|
+
*/
|
|
138
|
+
audio: unknown;
|
|
139
|
+
samplingRate: unknown;
|
|
140
|
+
/**
|
|
141
|
+
* The sampling rate of the generated audio waveform.
|
|
142
|
+
*/
|
|
143
|
+
sampling_rate?: number;
|
|
144
|
+
[property: string]: unknown;
|
|
145
|
+
}
|
|
@@ -0,0 +1,31 @@
|
|
|
1
|
+
{
|
|
2
|
+
"$id": "/inference/schemas/text-to-speech/input.json",
|
|
3
|
+
"$schema": "http://json-schema.org/draft-06/schema#",
|
|
4
|
+
"description": "Inputs for Text To Speech inference",
|
|
5
|
+
"title": "TextToSpeechInput",
|
|
6
|
+
"type": "object",
|
|
7
|
+
"properties": {
|
|
8
|
+
"inputs": {
|
|
9
|
+
"description": "The input text data",
|
|
10
|
+
"type": "string"
|
|
11
|
+
},
|
|
12
|
+
"parameters": {
|
|
13
|
+
"description": "Additional inference parameters",
|
|
14
|
+
"$ref": "#/$defs/TextToSpeechParameters"
|
|
15
|
+
}
|
|
16
|
+
},
|
|
17
|
+
"$defs": {
|
|
18
|
+
"TextToSpeechParameters": {
|
|
19
|
+
"title": "TextToSpeechParameters",
|
|
20
|
+
"description": "Additional inference parameters for Text To Speech",
|
|
21
|
+
"type": "object",
|
|
22
|
+
"properties": {
|
|
23
|
+
"generation_parameters": {
|
|
24
|
+
"description": "Parametrization of the text generation process",
|
|
25
|
+
"$ref": "/inference/schemas/common-definitions.json#/definitions/GenerationParameters"
|
|
26
|
+
}
|
|
27
|
+
}
|
|
28
|
+
}
|
|
29
|
+
},
|
|
30
|
+
"required": ["inputs"]
|
|
31
|
+
}
|
|
@@ -0,0 +1,41 @@
|
|
|
1
|
+
## Use Cases
|
|
2
|
+
|
|
3
|
+
### Script-based Video Generation
|
|
4
|
+
|
|
5
|
+
Text-to-video models can be used to create short-form video content from a provided text script. These models can be used to create engaging and informative marketing videos. For example, a company could use a text-to-video model to create a video that explains how their product works.
|
|
6
|
+
|
|
7
|
+
### Content format conversion
|
|
8
|
+
|
|
9
|
+
Text-to-video models can be used to generate videos from long-form text, including blog posts, articles, and text files. Text-to-video models can be used to create educational videos that are more engaging and interactive. An example of this is creating a video that explains a complex concept from an article.
|
|
10
|
+
|
|
11
|
+
### Voice-overs and Speech
|
|
12
|
+
|
|
13
|
+
Text-to-video models can be used to create an AI newscaster to deliver daily news, or for a film-maker to create a short film or a music video.
|
|
14
|
+
|
|
15
|
+
## Task Variants
|
|
16
|
+
Text-to-video models have different variants based on inputs and outputs.
|
|
17
|
+
|
|
18
|
+
### Text-to-video Editing
|
|
19
|
+
|
|
20
|
+
One text-to-video task is generating text-based video style and local attribute editing. Text-to-video editing models can make it easier to perform tasks like cropping, stabilization, color correction, resizing and audio editing consistently.
|
|
21
|
+
|
|
22
|
+
### Text-to-video Search
|
|
23
|
+
|
|
24
|
+
Text-to-video search is the task of retrieving videos that are relevant to a given text query. This can be challenging, as videos are a complex medium that can contain a lot of information. By using semantic analysis to extract the meaning of the text query, visual analysis to extract features from the videos, such as the objects and actions that are present in the video, and temporal analysis to categorize relationships between the objects and actions in the video, we can determine which videos are most likely to be relevant to the text query.
|
|
25
|
+
|
|
26
|
+
### Text-driven Video Prediction
|
|
27
|
+
|
|
28
|
+
Text-driven video prediction is the task of generating a video sequence from a text description. Text description can be anything from a simple sentence to a detailed story. The goal of this task is to generate a video that is both visually realistic and semantically consistent with the text description.
|
|
29
|
+
|
|
30
|
+
### Video Translation
|
|
31
|
+
|
|
32
|
+
Text-to-video translation models can translate videos from one language to another or allow to query the multilingual text-video model with non-English sentences. This can be useful for people who want to watch videos in a language that they don't understand, especially when multi-lingual captions are available for training.
|
|
33
|
+
|
|
34
|
+
## Inference
|
|
35
|
+
Contribute an inference snippet for text-to-video here!
|
|
36
|
+
|
|
37
|
+
## Useful Resources
|
|
38
|
+
|
|
39
|
+
In this area, you can insert useful resources about how to train or use a model for this task.
|
|
40
|
+
|
|
41
|
+
- [Text-to-Video: The Task, Challenges and the Current State](https://huggingface.co/blog/text-to-video)
|
|
@@ -0,0 +1,102 @@
|
|
|
1
|
+
import type { TaskDataCustom } from "../index.js";
|
|
2
|
+
|
|
3
|
+
const taskData: TaskDataCustom = {
|
|
4
|
+
datasets: [
|
|
5
|
+
{
|
|
6
|
+
description: "Microsoft Research Video to Text is a large-scale dataset for open domain video captioning",
|
|
7
|
+
id: "iejMac/CLIP-MSR-VTT",
|
|
8
|
+
},
|
|
9
|
+
{
|
|
10
|
+
description: "UCF101 Human Actions dataset consists of 13,320 video clips from YouTube, with 101 classes.",
|
|
11
|
+
id: "quchenyuan/UCF101-ZIP",
|
|
12
|
+
},
|
|
13
|
+
{
|
|
14
|
+
description: "A high-quality dataset for human action recognition in YouTube videos.",
|
|
15
|
+
id: "nateraw/kinetics",
|
|
16
|
+
},
|
|
17
|
+
{
|
|
18
|
+
description: "A dataset of video clips of humans performing pre-defined basic actions with everyday objects.",
|
|
19
|
+
id: "HuggingFaceM4/something_something_v2",
|
|
20
|
+
},
|
|
21
|
+
{
|
|
22
|
+
description:
|
|
23
|
+
"This dataset consists of text-video pairs and contains noisy samples with irrelevant video descriptions",
|
|
24
|
+
id: "HuggingFaceM4/webvid",
|
|
25
|
+
},
|
|
26
|
+
{
|
|
27
|
+
description: "A dataset of short Flickr videos for the temporal localization of events with descriptions.",
|
|
28
|
+
id: "iejMac/CLIP-DiDeMo",
|
|
29
|
+
},
|
|
30
|
+
],
|
|
31
|
+
demo: {
|
|
32
|
+
inputs: [
|
|
33
|
+
{
|
|
34
|
+
label: "Input",
|
|
35
|
+
content: "Darth Vader is surfing on the waves.",
|
|
36
|
+
type: "text",
|
|
37
|
+
},
|
|
38
|
+
],
|
|
39
|
+
outputs: [
|
|
40
|
+
{
|
|
41
|
+
filename: "text-to-video-output.gif",
|
|
42
|
+
type: "img",
|
|
43
|
+
},
|
|
44
|
+
],
|
|
45
|
+
},
|
|
46
|
+
metrics: [
|
|
47
|
+
{
|
|
48
|
+
description:
|
|
49
|
+
"Inception Score uses an image classification model that predicts class labels and evaluates how distinct and diverse the images are. A higher score indicates better video generation.",
|
|
50
|
+
id: "is",
|
|
51
|
+
},
|
|
52
|
+
{
|
|
53
|
+
description:
|
|
54
|
+
"Frechet Inception Distance uses an image classification model to obtain image embeddings. The metric compares mean and standard deviation of the embeddings of real and generated images. A smaller score indicates better video generation.",
|
|
55
|
+
id: "fid",
|
|
56
|
+
},
|
|
57
|
+
{
|
|
58
|
+
description:
|
|
59
|
+
"Frechet Video Distance uses a model that captures coherence for changes in frames and the quality of each frame. A smaller score indicates better video generation.",
|
|
60
|
+
id: "fvd",
|
|
61
|
+
},
|
|
62
|
+
{
|
|
63
|
+
description:
|
|
64
|
+
"CLIPSIM measures similarity between video frames and text using an image-text similarity model. A higher score indicates better video generation.",
|
|
65
|
+
id: "clipsim",
|
|
66
|
+
},
|
|
67
|
+
],
|
|
68
|
+
models: [
|
|
69
|
+
{
|
|
70
|
+
description: "A strong model for consistent video generation.",
|
|
71
|
+
id: "rain1011/pyramid-flow-sd3",
|
|
72
|
+
},
|
|
73
|
+
{
|
|
74
|
+
description: "A robust model for text-to-video generation.",
|
|
75
|
+
id: "VideoCrafter/VideoCrafter2",
|
|
76
|
+
},
|
|
77
|
+
{
|
|
78
|
+
description: "A cutting-edge text-to-video generation model.",
|
|
79
|
+
id: "TIGER-Lab/T2V-Turbo-V2",
|
|
80
|
+
},
|
|
81
|
+
],
|
|
82
|
+
spaces: [
|
|
83
|
+
{
|
|
84
|
+
description: "An application that generates video from text.",
|
|
85
|
+
id: "VideoCrafter/VideoCrafter",
|
|
86
|
+
},
|
|
87
|
+
{
|
|
88
|
+
description: "Consistent video generation application.",
|
|
89
|
+
id: "TIGER-Lab/T2V-Turbo-V2",
|
|
90
|
+
},
|
|
91
|
+
{
|
|
92
|
+
description: "A cutting edge video generation application.",
|
|
93
|
+
id: "Pyramid-Flow/pyramid-flow",
|
|
94
|
+
},
|
|
95
|
+
],
|
|
96
|
+
summary:
|
|
97
|
+
"Text-to-video models can be used in any application that requires generating consistent sequence of images from text. ",
|
|
98
|
+
widgetModels: [],
|
|
99
|
+
youtubeId: undefined,
|
|
100
|
+
};
|
|
101
|
+
|
|
102
|
+
export default taskData;
|
|
@@ -0,0 +1,55 @@
|
|
|
1
|
+
/**
|
|
2
|
+
* Inference code generated from the JSON schema spec in ./spec
|
|
3
|
+
*
|
|
4
|
+
* Using src/scripts/inference-codegen
|
|
5
|
+
*/
|
|
6
|
+
|
|
7
|
+
/**
|
|
8
|
+
* Inputs for Text2text Generation inference
|
|
9
|
+
*/
|
|
10
|
+
export interface Text2TextGenerationInput {
|
|
11
|
+
/**
|
|
12
|
+
* The input text data
|
|
13
|
+
*/
|
|
14
|
+
inputs: string;
|
|
15
|
+
/**
|
|
16
|
+
* Additional inference parameters
|
|
17
|
+
*/
|
|
18
|
+
parameters?: Text2TextGenerationParameters;
|
|
19
|
+
[property: string]: unknown;
|
|
20
|
+
}
|
|
21
|
+
|
|
22
|
+
/**
|
|
23
|
+
* Additional inference parameters
|
|
24
|
+
*
|
|
25
|
+
* Additional inference parameters for Text2text Generation
|
|
26
|
+
*/
|
|
27
|
+
export interface Text2TextGenerationParameters {
|
|
28
|
+
/**
|
|
29
|
+
* Whether to clean up the potential extra spaces in the text output.
|
|
30
|
+
*/
|
|
31
|
+
clean_up_tokenization_spaces?: boolean;
|
|
32
|
+
/**
|
|
33
|
+
* Additional parametrization of the text generation algorithm
|
|
34
|
+
*/
|
|
35
|
+
generate_parameters?: { [key: string]: unknown };
|
|
36
|
+
/**
|
|
37
|
+
* The truncation strategy to use
|
|
38
|
+
*/
|
|
39
|
+
truncation?: Text2TextGenerationTruncationStrategy;
|
|
40
|
+
[property: string]: unknown;
|
|
41
|
+
}
|
|
42
|
+
|
|
43
|
+
export type Text2TextGenerationTruncationStrategy = "do_not_truncate" | "longest_first" | "only_first" | "only_second";
|
|
44
|
+
|
|
45
|
+
/**
|
|
46
|
+
* Outputs of inference for the Text2text Generation task
|
|
47
|
+
*/
|
|
48
|
+
export interface Text2TextGenerationOutput {
|
|
49
|
+
generatedText: unknown;
|
|
50
|
+
/**
|
|
51
|
+
* The generated text.
|
|
52
|
+
*/
|
|
53
|
+
generated_text?: string;
|
|
54
|
+
[property: string]: unknown;
|
|
55
|
+
}
|
|
@@ -0,0 +1,55 @@
|
|
|
1
|
+
{
|
|
2
|
+
"$id": "/inference/schemas/text2text-generation/input.json",
|
|
3
|
+
"$schema": "http://json-schema.org/draft-06/schema#",
|
|
4
|
+
"description": "Inputs for Text2text Generation inference",
|
|
5
|
+
"title": "Text2TextGenerationInput",
|
|
6
|
+
"type": "object",
|
|
7
|
+
"properties": {
|
|
8
|
+
"inputs": {
|
|
9
|
+
"description": "The input text data",
|
|
10
|
+
"type": "string"
|
|
11
|
+
},
|
|
12
|
+
"parameters": {
|
|
13
|
+
"description": "Additional inference parameters",
|
|
14
|
+
"$ref": "#/$defs/Text2textGenerationParameters"
|
|
15
|
+
}
|
|
16
|
+
},
|
|
17
|
+
"$defs": {
|
|
18
|
+
"Text2textGenerationParameters": {
|
|
19
|
+
"title": "Text2textGenerationParameters",
|
|
20
|
+
"description": "Additional inference parameters for Text2text Generation",
|
|
21
|
+
"type": "object",
|
|
22
|
+
"properties": {
|
|
23
|
+
"clean_up_tokenization_spaces": {
|
|
24
|
+
"type": "boolean",
|
|
25
|
+
"description": "Whether to clean up the potential extra spaces in the text output."
|
|
26
|
+
},
|
|
27
|
+
"truncation": {
|
|
28
|
+
"title": "Text2textGenerationTruncationStrategy",
|
|
29
|
+
"type": "string",
|
|
30
|
+
"description": "The truncation strategy to use",
|
|
31
|
+
"oneOf": [
|
|
32
|
+
{
|
|
33
|
+
"const": "do_not_truncate"
|
|
34
|
+
},
|
|
35
|
+
{
|
|
36
|
+
"const": "longest_first"
|
|
37
|
+
},
|
|
38
|
+
{
|
|
39
|
+
"const": "only_first"
|
|
40
|
+
},
|
|
41
|
+
{
|
|
42
|
+
"const": "only_second"
|
|
43
|
+
}
|
|
44
|
+
]
|
|
45
|
+
},
|
|
46
|
+
"generate_parameters": {
|
|
47
|
+
"title": "generateParameters",
|
|
48
|
+
"type": "object",
|
|
49
|
+
"description": "Additional parametrization of the text generation algorithm"
|
|
50
|
+
}
|
|
51
|
+
}
|
|
52
|
+
}
|
|
53
|
+
},
|
|
54
|
+
"required": ["inputs"]
|
|
55
|
+
}
|
|
@@ -0,0 +1,14 @@
|
|
|
1
|
+
{
|
|
2
|
+
"$id": "/inference/schemas/text2text-generation/output.json",
|
|
3
|
+
"$schema": "http://json-schema.org/draft-06/schema#",
|
|
4
|
+
"description": "Outputs of inference for the Text2text Generation task",
|
|
5
|
+
"title": "Text2TextGenerationOutput",
|
|
6
|
+
"type": "object",
|
|
7
|
+
"properties": {
|
|
8
|
+
"generated_text": {
|
|
9
|
+
"type": "string",
|
|
10
|
+
"description": "The generated text."
|
|
11
|
+
}
|
|
12
|
+
},
|
|
13
|
+
"required": ["generatedText"]
|
|
14
|
+
}
|
|
@@ -0,0 +1,76 @@
|
|
|
1
|
+
## Use Cases
|
|
2
|
+
|
|
3
|
+
### Information Extraction from Invoices
|
|
4
|
+
|
|
5
|
+
You can extract entities of interest from invoices automatically using Named Entity Recognition (NER) models. Invoices can be read with Optical Character Recognition models and the output can be used to do inference with NER models. In this way, important information such as date, company name, and other named entities can be extracted.
|
|
6
|
+
|
|
7
|
+
## Task Variants
|
|
8
|
+
|
|
9
|
+
### Named Entity Recognition (NER)
|
|
10
|
+
|
|
11
|
+
NER is the task of recognizing named entities in a text. These entities can be the names of people, locations, or organizations. The task is formulated as labeling each token with a class for each named entity and a class named "0" for tokens that do not contain any entities. The input for this task is text and the output is the annotated text with named entities.
|
|
12
|
+
|
|
13
|
+
#### Inference
|
|
14
|
+
|
|
15
|
+
You can use the 🤗 Transformers library `ner` pipeline to infer with NER models.
|
|
16
|
+
|
|
17
|
+
```python
|
|
18
|
+
from transformers import pipeline
|
|
19
|
+
|
|
20
|
+
classifier = pipeline("ner")
|
|
21
|
+
classifier("Hello I'm Omar and I live in Zürich.")
|
|
22
|
+
```
|
|
23
|
+
|
|
24
|
+
### Part-of-Speech (PoS) Tagging
|
|
25
|
+
In PoS tagging, the model recognizes parts of speech, such as nouns, pronouns, adjectives, or verbs, in a given text. The task is formulated as labeling each word with a part of the speech.
|
|
26
|
+
|
|
27
|
+
#### Inference
|
|
28
|
+
|
|
29
|
+
You can use the 🤗 Transformers library `token-classification` pipeline with a POS tagging model of your choice. The model will return a json with PoS tags for each token.
|
|
30
|
+
|
|
31
|
+
```python
|
|
32
|
+
from transformers import pipeline
|
|
33
|
+
|
|
34
|
+
classifier = pipeline("token-classification", model = "vblagoje/bert-english-uncased-finetuned-pos")
|
|
35
|
+
classifier("Hello I'm Omar and I live in Zürich.")
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
This is not limited to transformers! You can also use other libraries such as Stanza, spaCy, and Flair to do inference! Here is an example using a canonical [spaCy](https://hf.co/blog/spacy) model.
|
|
39
|
+
|
|
40
|
+
```python
|
|
41
|
+
!pip install https://huggingface.co/spacy/en_core_web_sm/resolve/main/en_core_web_sm-any-py3-none-any.whl
|
|
42
|
+
|
|
43
|
+
import en_core_web_sm
|
|
44
|
+
|
|
45
|
+
nlp = en_core_web_sm.load()
|
|
46
|
+
doc = nlp("I'm Omar and I live in Zürich.")
|
|
47
|
+
for token in doc:
|
|
48
|
+
print(token.text, token.pos_, token.dep_, token.ent_type_)
|
|
49
|
+
|
|
50
|
+
## I PRON nsubj
|
|
51
|
+
## 'm AUX ROOT
|
|
52
|
+
## Omar PROPN attr PERSON
|
|
53
|
+
### ...
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
## Useful Resources
|
|
57
|
+
|
|
58
|
+
Would you like to learn more about token classification? Great! Here you can find some curated resources that you may find helpful!
|
|
59
|
+
|
|
60
|
+
- [Course Chapter on Token Classification](https://huggingface.co/course/chapter7/2?fw=pt)
|
|
61
|
+
- [Blog post: Welcome spaCy to the Hugging Face Hub](https://huggingface.co/blog/spacy)
|
|
62
|
+
|
|
63
|
+
### Notebooks
|
|
64
|
+
|
|
65
|
+
- [PyTorch](https://github.com/huggingface/notebooks/blob/master/examples/token_classification.ipynb)
|
|
66
|
+
- [TensorFlow](https://github.com/huggingface/notebooks/blob/master/examples/token_classification-tf.ipynb)
|
|
67
|
+
|
|
68
|
+
### Scripts for training
|
|
69
|
+
|
|
70
|
+
- [PyTorch](https://github.com/huggingface/transformers/tree/main/examples/pytorch/token-classification)
|
|
71
|
+
- [TensorFlow](https://github.com/huggingface/transformers/tree/main/examples/tensorflow)
|
|
72
|
+
- [Flax](https://github.com/huggingface/transformers/tree/main/examples/flax/token-classification)
|
|
73
|
+
|
|
74
|
+
### Documentation
|
|
75
|
+
|
|
76
|
+
- [Token classification task guide](https://huggingface.co/docs/transformers/tasks/token_classification)
|