@huggingface/tasks 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (103) hide show
  1. package/assets/audio-classification/audio.wav +0 -0
  2. package/assets/audio-to-audio/input.wav +0 -0
  3. package/assets/audio-to-audio/label-0.wav +0 -0
  4. package/assets/audio-to-audio/label-1.wav +0 -0
  5. package/assets/automatic-speech-recognition/input.flac +0 -0
  6. package/assets/automatic-speech-recognition/wav2vec2.png +0 -0
  7. package/assets/contribution-guide/anatomy.png +0 -0
  8. package/assets/contribution-guide/libraries.png +0 -0
  9. package/assets/depth-estimation/depth-estimation-input.jpg +0 -0
  10. package/assets/depth-estimation/depth-estimation-output.png +0 -0
  11. package/assets/document-question-answering/document-question-answering-input.png +0 -0
  12. package/assets/image-classification/image-classification-input.jpeg +0 -0
  13. package/assets/image-segmentation/image-segmentation-input.jpeg +0 -0
  14. package/assets/image-segmentation/image-segmentation-output.png +0 -0
  15. package/assets/image-to-image/image-to-image-input.jpeg +0 -0
  16. package/assets/image-to-image/image-to-image-output.png +0 -0
  17. package/assets/image-to-image/pix2pix_examples.jpg +0 -0
  18. package/assets/image-to-text/savanna.jpg +0 -0
  19. package/assets/object-detection/object-detection-input.jpg +0 -0
  20. package/assets/object-detection/object-detection-output.jpg +0 -0
  21. package/assets/table-question-answering/tableQA.jpg +0 -0
  22. package/assets/text-to-image/image.jpeg +0 -0
  23. package/assets/text-to-speech/audio.wav +0 -0
  24. package/assets/text-to-video/text-to-video-output.gif +0 -0
  25. package/assets/unconditional-image-generation/unconditional-image-generation-output.jpeg +0 -0
  26. package/assets/video-classification/video-classification-input.gif +0 -0
  27. package/assets/visual-question-answering/elephant.jpeg +0 -0
  28. package/assets/zero-shot-image-classification/image-classification-input.jpeg +0 -0
  29. package/dist/index.cjs +3105 -0
  30. package/dist/index.d.cts +145 -0
  31. package/dist/index.d.ts +145 -0
  32. package/dist/index.js +3079 -0
  33. package/package.json +35 -0
  34. package/src/Types.ts +58 -0
  35. package/src/audio-classification/about.md +85 -0
  36. package/src/audio-classification/data.ts +77 -0
  37. package/src/audio-to-audio/about.md +55 -0
  38. package/src/audio-to-audio/data.ts +63 -0
  39. package/src/automatic-speech-recognition/about.md +86 -0
  40. package/src/automatic-speech-recognition/data.ts +77 -0
  41. package/src/const.ts +51 -0
  42. package/src/conversational/about.md +50 -0
  43. package/src/conversational/data.ts +62 -0
  44. package/src/depth-estimation/about.md +38 -0
  45. package/src/depth-estimation/data.ts +52 -0
  46. package/src/document-question-answering/about.md +54 -0
  47. package/src/document-question-answering/data.ts +67 -0
  48. package/src/feature-extraction/about.md +35 -0
  49. package/src/feature-extraction/data.ts +57 -0
  50. package/src/fill-mask/about.md +51 -0
  51. package/src/fill-mask/data.ts +77 -0
  52. package/src/image-classification/about.md +48 -0
  53. package/src/image-classification/data.ts +88 -0
  54. package/src/image-segmentation/about.md +63 -0
  55. package/src/image-segmentation/data.ts +96 -0
  56. package/src/image-to-image/about.md +81 -0
  57. package/src/image-to-image/data.ts +97 -0
  58. package/src/image-to-text/about.md +58 -0
  59. package/src/image-to-text/data.ts +87 -0
  60. package/src/index.ts +2 -0
  61. package/src/object-detection/about.md +36 -0
  62. package/src/object-detection/data.ts +73 -0
  63. package/src/placeholder/about.md +15 -0
  64. package/src/placeholder/data.ts +18 -0
  65. package/src/question-answering/about.md +56 -0
  66. package/src/question-answering/data.ts +69 -0
  67. package/src/reinforcement-learning/about.md +176 -0
  68. package/src/reinforcement-learning/data.ts +78 -0
  69. package/src/sentence-similarity/about.md +97 -0
  70. package/src/sentence-similarity/data.ts +100 -0
  71. package/src/summarization/about.md +57 -0
  72. package/src/summarization/data.ts +72 -0
  73. package/src/table-question-answering/about.md +43 -0
  74. package/src/table-question-answering/data.ts +63 -0
  75. package/src/tabular-classification/about.md +67 -0
  76. package/src/tabular-classification/data.ts +69 -0
  77. package/src/tabular-regression/about.md +91 -0
  78. package/src/tabular-regression/data.ts +58 -0
  79. package/src/tasksData.ts +104 -0
  80. package/src/text-classification/about.md +171 -0
  81. package/src/text-classification/data.ts +90 -0
  82. package/src/text-generation/about.md +128 -0
  83. package/src/text-generation/data.ts +124 -0
  84. package/src/text-to-image/about.md +65 -0
  85. package/src/text-to-image/data.ts +88 -0
  86. package/src/text-to-speech/about.md +63 -0
  87. package/src/text-to-speech/data.ts +70 -0
  88. package/src/text-to-video/about.md +36 -0
  89. package/src/text-to-video/data.ts +97 -0
  90. package/src/token-classification/about.md +78 -0
  91. package/src/token-classification/data.ts +83 -0
  92. package/src/translation/about.md +65 -0
  93. package/src/translation/data.ts +68 -0
  94. package/src/unconditional-image-generation/about.md +45 -0
  95. package/src/unconditional-image-generation/data.ts +66 -0
  96. package/src/video-classification/about.md +53 -0
  97. package/src/video-classification/data.ts +84 -0
  98. package/src/visual-question-answering/about.md +43 -0
  99. package/src/visual-question-answering/data.ts +90 -0
  100. package/src/zero-shot-classification/about.md +39 -0
  101. package/src/zero-shot-classification/data.ts +66 -0
  102. package/src/zero-shot-image-classification/about.md +68 -0
  103. package/src/zero-shot-image-classification/data.ts +79 -0
package/package.json ADDED
@@ -0,0 +1,35 @@
1
+ {
2
+ "name": "@huggingface/tasks",
3
+ "version": "0.0.1",
4
+ "type": "module",
5
+ "description": "A library of tasks for the Hugging Face Hub",
6
+ "main": "./dist/index.js",
7
+ "module": "./dist/index.mjs",
8
+ "types": "./dist/index.d.ts",
9
+ "exports": {
10
+ ".": {
11
+ "types": "./dist/index.d.ts",
12
+ "require": "./dist/index.js",
13
+ "import": "./dist/index.mjs"
14
+ }
15
+ },
16
+ "scripts": {
17
+ "prepublishOnly": "npm run build",
18
+ "build": "tsup src/index.ts --format cjs,esm --clean --dts",
19
+ "test": "vitest run",
20
+ "type-check": "tsc"
21
+ },
22
+ "source": "src/index.ts",
23
+ "files": [
24
+ "assets",
25
+ "src",
26
+ "dist"
27
+ ],
28
+ "devDependencies": {
29
+ "tsup": "^7.3.0",
30
+ "typescript": "^5.2.2"
31
+ },
32
+ "publishConfig": {
33
+ "access": "public"
34
+ }
35
+ }
package/src/Types.ts ADDED
@@ -0,0 +1,58 @@
1
+ import type { ModelLibraryKey } from "../../js/src/lib/interfaces/Libraries";
2
+ import type { PipelineType } from "../../js/src/lib/interfaces/Types";
3
+
4
+ export interface ExampleRepo {
5
+ description: string;
6
+ id: string;
7
+ }
8
+
9
+ export type TaskDemoEntry = {
10
+ filename: string;
11
+ type: "audio";
12
+ } | {
13
+ data: Array<{
14
+ label: string;
15
+ score: number;
16
+ }>;
17
+ type: "chart";
18
+ } | {
19
+ filename: string;
20
+ type: "img";
21
+ } | {
22
+ table: string[][];
23
+ type: "tabular";
24
+ } | {
25
+ content: string;
26
+ label: string;
27
+ type: "text";
28
+ } | {
29
+ text: string;
30
+ tokens: Array<{
31
+ end: number;
32
+ start: number;
33
+ type: string;
34
+ }>;
35
+ type: "text-with-tokens";
36
+ } ;
37
+
38
+ export interface TaskDemo {
39
+ inputs: TaskDemoEntry[];
40
+ outputs: TaskDemoEntry[];
41
+ }
42
+
43
+ export interface TaskData {
44
+ datasets: ExampleRepo[];
45
+ demo: TaskDemo;
46
+ id: PipelineType;
47
+ isPlaceholder?: boolean;
48
+ label: string;
49
+ libraries: ModelLibraryKey[];
50
+ metrics: ExampleRepo[];
51
+ models: ExampleRepo[];
52
+ spaces: ExampleRepo[];
53
+ summary: string;
54
+ widgetModels: string[];
55
+ youtubeId?: string;
56
+ }
57
+
58
+ export type TaskDataCustom = Omit<TaskData, "id" | "label" | "libraries">;
@@ -0,0 +1,85 @@
1
+ ## Use Cases
2
+
3
+ ### Command Recognition
4
+
5
+ Command recognition or keyword spotting classifies utterances into a predefined set of commands. This is often done on-device for fast response time.
6
+
7
+ As an example, using the Google Speech Commands dataset, given an input, a model can classify which of the following commands the user is typing:
8
+
9
+ ```
10
+ 'yes', 'no', 'up', 'down', 'left', 'right', 'on', 'off', 'stop', 'go', 'unknown', 'silence'
11
+ ```
12
+
13
+ Speechbrain models can easily perform this task with just a couple of lines of code!
14
+
15
+ ```python
16
+ from speechbrain.pretrained import EncoderClassifier
17
+ model = EncoderClassifier.from_hparams(
18
+ "speechbrain/google_speech_command_xvector"
19
+ )
20
+ model.classify_file("file.wav")
21
+ ```
22
+
23
+ ### Language Identification
24
+
25
+ Datasets such as VoxLingua107 allow anyone to train language identification models for up to 107 languages! This can be extremely useful as a preprocessing step for other systems. Here's an example [model](https://huggingface.co/TalTechNLP/voxlingua107-epaca-tdnn)trained on VoxLingua107.
26
+
27
+ ### Emotion recognition
28
+
29
+ Emotion recognition is self explanatory. In addition to trying the widgets, you can use the Inference API to perform audio classification. Here is a simple example that uses a [HuBERT](https://huggingface.co/superb/hubert-large-superb-er) model fine-tuned for this task.
30
+
31
+ ```python
32
+ import json
33
+ import requests
34
+
35
+ headers = {"Authorization": f"Bearer {API_TOKEN}"}
36
+ API_URL = "https://api-inference.huggingface.co/models/superb/hubert-large-superb-er"
37
+
38
+ def query(filename):
39
+ with open(filename, "rb") as f:
40
+ data = f.read()
41
+ response = requests.request("POST", API_URL, headers=headers, data=data)
42
+ return json.loads(response.content.decode("utf-8"))
43
+
44
+ data = query("sample1.flac")
45
+ # [{'label': 'neu', 'score': 0.60},
46
+ # {'label': 'hap', 'score': 0.20},
47
+ # {'label': 'ang', 'score': 0.13},
48
+ # {'label': 'sad', 'score': 0.07}]
49
+ ```
50
+
51
+ You can use [huggingface.js](https://github.com/huggingface/huggingface.js) to infer with audio classification models on Hugging Face Hub.
52
+
53
+ ```javascript
54
+ import { HfInference } from "@huggingface/inference";
55
+
56
+ const inference = new HfInference(HF_ACCESS_TOKEN);
57
+ await inference.audioClassification({
58
+ data: await (await fetch("sample.flac")).blob(),
59
+ model: "facebook/mms-lid-126",
60
+ })
61
+ ```
62
+
63
+ ### Speaker Identification
64
+
65
+ Speaker Identification is classifying the audio of the person speaking. Speakers are usually predefined. You can try out this task with [this model](https://huggingface.co/superb/wav2vec2-base-superb-sid). A useful dataset for this task is VoxCeleb1.
66
+
67
+ ## Solving audio classification for your own data
68
+
69
+ We have some great news! You can do fine-tuning (transfer learning) to train a well-performing model without requiring as much data. Pretrained models such as Wav2Vec2 and HuBERT exist. [Facebook's Wav2Vec2 XLS-R model](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/) is a large multilingual model trained on 128 languages and with 436K hours of speech.
70
+
71
+ ## Useful Resources
72
+
73
+ Would you like to learn more about the topic? Awesome! Here you can find some curated resources that you may find helpful!
74
+
75
+ ### Notebooks
76
+
77
+ - [PyTorch](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/audio_classification.ipynb)
78
+
79
+ ### Scripts for training
80
+
81
+ - [PyTorch](https://github.com/huggingface/transformers/tree/main/examples/pytorch/audio-classification)
82
+
83
+ ### Documentation
84
+
85
+ - [Audio classification task guide](https://huggingface.co/docs/transformers/tasks/audio_classification)
@@ -0,0 +1,77 @@
1
+ import type { TaskDataCustom } from "../Types";
2
+
3
+ const taskData: TaskDataCustom = {
4
+ datasets: [
5
+ {
6
+ description: "A benchmark of 10 different audio tasks.",
7
+ id: "superb",
8
+ },
9
+ ],
10
+ demo: {
11
+ inputs: [
12
+ {
13
+ filename: "audio.wav",
14
+ type: "audio",
15
+ },
16
+ ],
17
+ outputs: [
18
+ {
19
+ data: [
20
+ {
21
+ label: "Up",
22
+ score: 0.2,
23
+ },
24
+ {
25
+ label: "Down",
26
+ score: 0.8,
27
+ },
28
+ ],
29
+ type: "chart",
30
+ },
31
+ ],
32
+ },
33
+ metrics: [
34
+ {
35
+ description: "",
36
+ id: "accuracy",
37
+ },
38
+ {
39
+ description: "",
40
+ id: "recall",
41
+ },
42
+ {
43
+ description: "",
44
+ id: "precision",
45
+ },
46
+ {
47
+ description: "",
48
+ id: "f1",
49
+ },
50
+ ],
51
+ models: [
52
+ {
53
+ description: "An easy-to-use model for Command Recognition.",
54
+ id: "speechbrain/google_speech_command_xvector",
55
+ },
56
+ {
57
+ description: "An Emotion Recognition model.",
58
+ id: "ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition",
59
+ },
60
+ {
61
+ description: "A language identification model.",
62
+ id: "facebook/mms-lid-126",
63
+ },
64
+ ],
65
+ spaces: [
66
+ {
67
+ description: "An application that can predict the language spoken in a given audio.",
68
+ id: "akhaliq/Speechbrain-audio-classification",
69
+ },
70
+ ],
71
+ summary:
72
+ "Audio classification is the task of assigning a label or class to a given audio. It can be used for recognizing which command a user is giving or the emotion of a statement, as well as identifying a speaker.",
73
+ widgetModels: ["facebook/mms-lid-126"],
74
+ youtubeId: "KWwzcmG98Ds",
75
+ };
76
+
77
+ export default taskData;
@@ -0,0 +1,55 @@
1
+ ## Use Cases
2
+
3
+ ### Speech Enhancement (Noise removal)
4
+
5
+ Speech Enhancement is a bit self explanatory. It improves (or enhances) the quality of an audio by removing noise. There are multiple libraries to solve this task, such as Speechbrain, Asteroid and ESPNet. Here is a simple example using Speechbrain
6
+
7
+ ```python
8
+ from speechbrain.pretrained import SpectralMaskEnhancement
9
+ model = SpectralMaskEnhancement.from_hparams(
10
+ "speechbrain/mtl-mimic-voicebank"
11
+ )
12
+ model.enhance_file("file.wav")
13
+ ```
14
+
15
+ Alternatively, you can use the [Inference API](https://huggingface.co/inference-api) to solve this task
16
+
17
+ ```python
18
+ import json
19
+ import requests
20
+
21
+ headers = {"Authorization": f"Bearer {API_TOKEN}"}
22
+ API_URL = "https://api-inference.huggingface.co/models/speechbrain/mtl-mimic-voicebank"
23
+
24
+ def query(filename):
25
+ with open(filename, "rb") as f:
26
+ data = f.read()
27
+ response = requests.request("POST", API_URL, headers=headers, data=data)
28
+ return json.loads(response.content.decode("utf-8"))
29
+
30
+ data = query("sample1.flac")
31
+ ```
32
+ You can use [huggingface.js](https://github.com/huggingface/huggingface.js) to infer with audio-to-audio models on Hugging Face Hub.
33
+
34
+ ```javascript
35
+ import { HfInference } from "@huggingface/inference";
36
+
37
+ const inference = new HfInference(HF_ACCESS_TOKEN);
38
+ await inference.audioToAudio({
39
+ data: await (await fetch("sample.flac")).blob(),
40
+ model: "speechbrain/sepformer-wham",
41
+ })
42
+ ```
43
+
44
+ ### Audio Source Separation
45
+
46
+ Audio Source Separation allows you to isolate different sounds from individual sources. For example, if you have an audio file with multiple people speaking, you can get an audio file for each of them. You can then use an Automatic Speech Recognition system to extract the text from each of these sources as an initial step for your system!
47
+
48
+ Audio-to-Audio can also be used to remove noise from audio files: you get one audio for the person speaking and another audio for the noise. This can also be useful when you have multi-person audio with some noise: yyou can get one audio for each person and then one audio for the noise.
49
+
50
+ ## Training a model for your own data
51
+
52
+ If you want to learn how to train models for the Audio-to-Audio task, we recommend the following tutorials:
53
+
54
+ - [Speech Enhancement](https://speechbrain.github.io/tutorial_enhancement.html)
55
+ - [Source Separation](https://speechbrain.github.io/tutorial_separation.html)
@@ -0,0 +1,63 @@
1
+ import type { TaskDataCustom } from "../Types";
2
+
3
+ const taskData: TaskDataCustom = {
4
+ datasets: [
5
+ {
6
+ description: "512-element X-vector embeddings of speakers from CMU ARCTIC dataset.",
7
+ id: "Matthijs/cmu-arctic-xvectors",
8
+ },
9
+ ],
10
+ demo: {
11
+ inputs: [
12
+ {
13
+ filename: "input.wav",
14
+ type: "audio",
15
+ },
16
+ ],
17
+ outputs: [
18
+ {
19
+ filename: "label-0.wav",
20
+ type: "audio",
21
+ },
22
+ {
23
+ filename: "label-1.wav",
24
+ type: "audio",
25
+ },
26
+ ],
27
+ },
28
+ metrics: [
29
+ {
30
+ description: "The Signal-to-Noise ratio is the relationship between the target signal level and the background noise level. It is calculated as the logarithm of the target signal divided by the background noise, in decibels.",
31
+ id: "snri",
32
+ },
33
+ {
34
+ description: "The Signal-to-Distortion ratio is the relationship between the target signal and the sum of noise, interference, and artifact errors",
35
+ id: "sdri",
36
+ },
37
+ ],
38
+ models: [
39
+ {
40
+ description: "A solid model of audio source separation.",
41
+ id: "speechbrain/sepformer-wham",
42
+ },
43
+ {
44
+ description: "A speech enhancement model.",
45
+ id: "speechbrain/metricgan-plus-voicebank",
46
+ },
47
+ ],
48
+ spaces: [
49
+ {
50
+ description: "An application for speech separation.",
51
+ id: "younver/speechbrain-speech-separation",
52
+ },
53
+ {
54
+ description: "An application for audio style transfer.",
55
+ id: "nakas/audio-diffusion_style_transfer",
56
+ },
57
+ ],
58
+ summary: "Audio-to-Audio is a family of tasks in which the input is an audio and the output is one or multiple generated audios. Some example tasks are speech enhancement and source separation.",
59
+ widgetModels: ["speechbrain/sepformer-wham"],
60
+ youtubeId: "iohj7nCCYoM",
61
+ };
62
+
63
+ export default taskData;
@@ -0,0 +1,86 @@
1
+ ## Use Cases
2
+
3
+ ### Virtual Speech Assistants
4
+
5
+ Many edge devices have an embedded virtual assistant to interact with the end users better. These assistances rely on ASR models to recognize different voice commands to perform various tasks. For instance, you can ask your phone for dialing a phone number, ask a general question, or schedule a meeting.
6
+
7
+ ### Caption Generation
8
+
9
+ A caption generation model takes audio as input from sources to generate automatic captions through transcription, for live-streamed or recorded videos. This can help with content accessibility. For example, an audience watching a video that includes a non-native language, can rely on captions to interpret the content. It can also help with information retention at online-classes environments improving knowledge assimilation while reading and taking notes faster.
10
+
11
+ ## Task Variants
12
+
13
+ ### Multilingual ASR
14
+
15
+ Multilingual ASR models can convert audio inputs with multiple languages into transcripts. Some multilingual ASR models include [language identification](https://huggingface.co/tasks/audio-classification) blocks to improve the performance.
16
+
17
+ The use of Multilingual ASR has become popular, the idea of maintaining just a single model for all language can simplify the production pipeline. Take a look at [Whisper](https://huggingface.co/openai/whisper-large-v2) to get an idea on how 100+ languages can be processed by a single model.
18
+
19
+ ## Inference
20
+
21
+ The Hub contains over [~9,000 ASR models](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&sort=downloads) that you can use right away by trying out the widgets directly in the browser or calling the models as a service using the Inference API. Here is a simple code snippet to do exactly this:
22
+
23
+ ```python
24
+ import json
25
+ import requests
26
+
27
+ headers = {"Authorization": f"Bearer {API_TOKEN}"}
28
+ API_URL = "https://api-inference.huggingface.co/models/openai/whisper-large-v2"
29
+
30
+ def query(filename):
31
+ with open(filename, "rb") as f:
32
+ data = f.read()
33
+ response = requests.request("POST", API_URL, headers=headers, data=data)
34
+ return json.loads(response.content.decode("utf-8"))
35
+
36
+ data = query("sample1.flac")
37
+ ```
38
+
39
+ You can also use libraries such as [transformers](https://huggingface.co/models?library=transformers&pipeline_tag=automatic-speech-recognition&sort=downloads), [speechbrain](https://huggingface.co/models?library=speechbrain&pipeline_tag=automatic-speech-recognition&sort=downloads), [NeMo](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&library=nemo&sort=downloads) and [espnet](https://huggingface.co/models?library=espnet&pipeline_tag=automatic-speech-recognition&sort=downloads) if you want one-click managed Inference without any hassle.
40
+
41
+ ```python
42
+ from transformers import pipeline
43
+
44
+ with open("sample.flac", "rb") as f:
45
+ data = f.read()
46
+
47
+ pipe = pipeline("automatic-speech-recognition", "openai/whisper-large-v2")
48
+ pipe("sample.flac")
49
+ # {'text': "GOING ALONG SLUSHY COUNTRY ROADS AND SPEAKING TO DAMP AUDIENCES IN DRAUGHTY SCHOOL ROOMS DAY AFTER DAY FOR A FORTNIGHT HE'LL HAVE TO PUT IN AN APPEARANCE AT SOME PLACE OF WORSHIP ON SUNDAY MORNING AND HE CAN COME TO US IMMEDIATELY AFTERWARDS"}
50
+ ```
51
+
52
+ You can use [huggingface.js](https://github.com/huggingface/huggingface.js) to transcribe text with javascript using models on Hugging Face Hub.
53
+
54
+ ```javascript
55
+ import { HfInference } from "@huggingface/inference";
56
+
57
+ const inference = new HfInference(HF_ACCESS_TOKEN);
58
+ await inference.automaticSpeechRecognition({
59
+ data: await (await fetch("sample.flac")).blob(),
60
+ model: "openai/whisper-large-v2",
61
+ })
62
+ ```
63
+
64
+ ## Solving ASR for your own data
65
+
66
+ We have some great news! You can fine-tune (transfer learning) a foundational speech model on a specific language without tonnes of data. Pretrained models such as Whisper, Wav2Vec2-MMS and HuBERT exist. [OpenAI's Whisper model](https://huggingface.co/openai/whisper-large-v2) is a large multilingual model trained on 100+ languages and with 680K hours of speech.
67
+
68
+ The following detailed [blog post](https://huggingface.co/blog/fine-tune-whisper) shows how to fine-tune a pre-trained Whisper checkpoint on labeled data for ASR. With the right data and strategy you can fine-tune a high-performant model on a free Google Colab instance too. We suggest to read the blog post for more info!
69
+
70
+ ## Hugging Face Whisper Event
71
+
72
+ On December 2022, over 450 participants collaborated, fine-tuned and shared 600+ ASR Whisper models in 100+ different languages. You can compare these models on the event's speech recognition [leaderboard](https://huggingface.co/spaces/whisper-event/leaderboard?dataset=mozilla-foundation%2Fcommon_voice_11_0&config=ar&split=test).
73
+
74
+ These events help democratize ASR for all languages, including low-resource languages. In addition to the trained models, the [event](https://github.com/huggingface/community-events/tree/main/whisper-fine-tuning-event) helps to build practical collaborative knowledge.
75
+
76
+ ## Useful Resources
77
+ - [Fine-tuning MetaAI's MMS Adapter Models for Multi-Lingual ASR](https://huggingface.co/blog/mms_adapters)
78
+ - [Making automatic speech recognition work on large files with Wav2Vec2 in 🤗 Transformers](https://huggingface.co/blog/asr-chunking)
79
+ - [Boosting Wav2Vec2 with n-grams in 🤗 Transformers](https://huggingface.co/blog/wav2vec2-with-ngram)
80
+ - [ML for Audio Study Group - Intro to Audio and ASR Deep Dive](https://www.youtube.com/watch?v=D-MH6YjuIlE)
81
+ - [Massively Multilingual ASR: 50 Languages, 1 Model, 1 Billion Parameters](https://arxiv.org/pdf/2007.03001.pdf)
82
+ - An ASR toolkit made by [NVIDIA: NeMo](https://github.com/NVIDIA/NeMo) with code and pretrained models useful for new ASR models. Watch the [introductory video](https://www.youtube.com/embed/wBgpMf_KQVw) for an overview.
83
+ - [An introduction to SpeechT5, a multi-purpose speech recognition and synthesis model](https://huggingface.co/blog/speecht5)
84
+ - [A guide on Fine-tuning Whisper For Multilingual ASR with 🤗Transformers](https://huggingface.co/blog/fine-tune-whisper)
85
+ - [Automatic speech recognition task guide](https://huggingface.co/docs/transformers/tasks/asr)
86
+ - [Speech Synthesis, Recognition, and More With SpeechT5](https://huggingface.co/blog/speecht5)
@@ -0,0 +1,77 @@
1
+ import type { TaskDataCustom } from "../Types";
2
+
3
+ const taskData: TaskDataCustom = {
4
+ datasets: [
5
+ {
6
+ description: "18,000 hours of multilingual audio-text dataset in 108 languages.",
7
+ id: "mozilla-foundation/common_voice_13_0",
8
+ },
9
+ {
10
+ description: "An English dataset with 1,000 hours of data.",
11
+ id: "librispeech_asr",
12
+ },
13
+ {
14
+ description: "High quality, multi-speaker audio data and their transcriptions in various languages.",
15
+ id: "openslr",
16
+ },
17
+ ],
18
+ demo: {
19
+ inputs: [
20
+ {
21
+ filename: "input.flac",
22
+ type: "audio",
23
+ },
24
+ ],
25
+ outputs: [
26
+ {
27
+ /// GOING ALONG SLUSHY COUNTRY ROADS AND SPEAKING TO DAMP AUDIENCES I
28
+ label: "Transcript",
29
+ content: "Going along slushy country roads and speaking to damp audiences in...",
30
+ type: "text",
31
+ },
32
+ ],
33
+ },
34
+ metrics: [
35
+ {
36
+ description: "",
37
+ id: "wer",
38
+ },
39
+ {
40
+ description: "",
41
+ id: "cer",
42
+ },
43
+ ],
44
+ models: [
45
+ {
46
+ description: "A powerful ASR model by OpenAI.",
47
+ id: "openai/whisper-large-v2",
48
+ },
49
+ {
50
+ description: "A good generic ASR model by MetaAI.",
51
+ id: "facebook/wav2vec2-base-960h",
52
+ },
53
+ {
54
+ description: "An end-to-end model that performs ASR and Speech Translation by MetaAI.",
55
+ id: "facebook/s2t-small-mustc-en-fr-st",
56
+ },
57
+ ],
58
+ spaces: [
59
+ {
60
+ description: "A powerful general-purpose speech recognition application.",
61
+ id: "openai/whisper",
62
+ },
63
+ {
64
+ description: "Fastest speech recognition application.",
65
+ id: "sanchit-gandhi/whisper-jax",
66
+ },
67
+ {
68
+ description: "An application that transcribes speeches in YouTube videos.",
69
+ id: "jeffistyping/Youtube-Whisperer",
70
+ },
71
+ ],
72
+ summary: "Automatic Speech Recognition (ASR), also known as Speech to Text (STT), is the task of transcribing a given audio to text. It has many applications, such as voice user interfaces.",
73
+ widgetModels: ["openai/whisper-large-v2"],
74
+ youtubeId: "TksaY_FDgnk",
75
+ };
76
+
77
+ export default taskData;
package/src/const.ts ADDED
@@ -0,0 +1,51 @@
1
+ import type { ModelLibraryKey } from "../../js/src/lib/interfaces/Libraries";
2
+ import type { PipelineType } from "../../js/src/lib/interfaces/Types";
3
+
4
+ /*
5
+ * Model libraries compatible with each ML task
6
+ */
7
+ export const TASKS_MODEL_LIBRARIES: Record<PipelineType, ModelLibraryKey[]> = {
8
+ "audio-classification": ["speechbrain", "transformers"],
9
+ "audio-to-audio": ["asteroid", "speechbrain"],
10
+ "automatic-speech-recognition": ["espnet", "nemo", "speechbrain", "transformers", "transformers.js"],
11
+ "conversational": ["transformers"],
12
+ "depth-estimation": ["transformers"],
13
+ "document-question-answering": ["transformers"],
14
+ "feature-extraction": ["sentence-transformers", "transformers", "transformers.js"],
15
+ "fill-mask": ["transformers", "transformers.js"],
16
+ "graph-ml": ["transformers"],
17
+ "image-classification": ["keras", "timm", "transformers", "transformers.js"],
18
+ "image-segmentation": ["transformers", "transformers.js"],
19
+ "image-to-image": [],
20
+ "image-to-text": ["transformers.js"],
21
+ "video-classification": [],
22
+ "multiple-choice": ["transformers"],
23
+ "object-detection": ["transformers", "transformers.js"],
24
+ "other": [],
25
+ "question-answering": ["adapter-transformers", "allennlp", "transformers", "transformers.js"],
26
+ "robotics": [],
27
+ "reinforcement-learning": ["transformers", "stable-baselines3", "ml-agents", "sample-factory"],
28
+ "sentence-similarity": ["sentence-transformers", "spacy", "transformers.js"],
29
+ "summarization": ["transformers", "transformers.js"],
30
+ "table-question-answering": ["transformers"],
31
+ "table-to-text": ["transformers"],
32
+ "tabular-classification": ["sklearn"],
33
+ "tabular-regression": ["sklearn"],
34
+ "tabular-to-text": ["transformers"],
35
+ "text-classification": ["adapter-transformers", "spacy", "transformers", "transformers.js"],
36
+ "text-generation": ["transformers", "transformers.js"],
37
+ "text-retrieval": [],
38
+ "text-to-image": [],
39
+ "text-to-speech": ["espnet", "tensorflowtts", "transformers"],
40
+ "text-to-audio": ["transformers"],
41
+ "text-to-video": [],
42
+ "text2text-generation": ["transformers", "transformers.js"],
43
+ "time-series-forecasting": [],
44
+ "token-classification": ["adapter-transformers", "flair", "spacy", "span-marker", "stanza", "transformers", "transformers.js"],
45
+ "translation": ["transformers", "transformers.js"],
46
+ "unconditional-image-generation": [],
47
+ "visual-question-answering": [],
48
+ "voice-activity-detection": [],
49
+ "zero-shot-classification": ["transformers", "transformers.js"],
50
+ "zero-shot-image-classification": ["transformers.js"],
51
+ };
@@ -0,0 +1,50 @@
1
+ ## Use Cases
2
+
3
+ ### Chatbot 💬
4
+
5
+ Chatbots are used to have conversations instead of providing direct contact with a live human. They are used to provide customer service, sales, and can even be used to play games (see [ELIZA](https://en.wikipedia.org/wiki/ELIZA) from 1966 for one of the earliest examples).
6
+
7
+ ## Voice Assistants 🎙️
8
+
9
+ Conversational response models are used as part of voice assistants to provide appropriate responses to voice based queries.
10
+
11
+ ## Inference
12
+
13
+ You can infer with Conversational models with the 🤗 Transformers library using the `conversational` pipeline. This pipeline takes a conversation prompt or a list of conversations and generates responses for each prompt. The models that this pipeline can use are models that have been fine-tuned on a multi-turn conversational task (see https://huggingface.co/models?filter=conversational for a list of updated Conversational models).
14
+
15
+ ```python
16
+ from transformers import pipeline, Conversation
17
+ converse = pipeline("conversational")
18
+
19
+ conversation_1 = Conversation("Going to the movies tonight - any suggestions?")
20
+ conversation_2 = Conversation("What's the last book you have read?")
21
+ converse([conversation_1, conversation_2])
22
+
23
+ ## Output:
24
+ ## Conversation 1
25
+ ## user >> Going to the movies tonight - any suggestions?
26
+ ## bot >> The Big Lebowski ,
27
+ ## Conversation 2
28
+ ## user >> What's the last book you have read?
29
+ ## bot >> The Last Question
30
+ ```
31
+
32
+ You can use [huggingface.js](https://github.com/huggingface/huggingface.js) to infer with conversational models on Hugging Face Hub.
33
+
34
+ ```javascript
35
+ import { HfInference } from "@huggingface/inference";
36
+
37
+ const inference = new HfInference(HF_ACCESS_TOKEN);
38
+ await inference.conversational({
39
+ model: 'facebook/blenderbot-400M-distill',
40
+ inputs: "Going to the movies tonight - any suggestions?"
41
+ })
42
+ ```
43
+
44
+ ## Useful Resources
45
+
46
+ - Learn how ChatGPT and InstructGPT work in this blog: [Illustrating Reinforcement Learning from Human Feedback (RLHF)](https://huggingface.co/blog/rlhf)
47
+ - [Reinforcement Learning from Human Feedback From Zero to ChatGPT](https://www.youtube.com/watch?v=EAd4oQtEJOM)
48
+ - [A guide on Dialog Agents](https://huggingface.co/blog/dialog-agents)
49
+
50
+ This page was made possible thanks to the efforts of [Viraat Aryabumi](https://huggingface.co/viraat).