@huggingface/inference 3.13.0 → 3.13.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,11 +1,11 @@
1
1
  # 🤗 Hugging Face Inference
2
2
 
3
- A Typescript powered wrapper for Inference Providers (serverless) and Inference Endpoints (dedicated).
4
- It works with [Inference Providers (serverless)](https://huggingface.co/docs/api-inference/index) – including all supported third-party Inference Providers – and [Inference Endpoints (dedicated)](https://huggingface.co/docs/inference-endpoints/index), and even with .
3
+ A Typescript powered wrapper that provides a unified interface to run inference across multiple services for models hosted on the Hugging Face Hub:
5
4
 
6
- Check out the [full documentation](https://huggingface.co/docs/huggingface.js/inference/README).
5
+ 1. [Inference Providers](https://huggingface.co/docs/inference-providers/index): a streamlined, unified access to hundreds of machine learning models, powered by our serverless inference partners. This new approach builds on our previous Serverless Inference API, offering more models, improved performance, and greater reliability thanks to world-class providers. Refer to the [documentation](https://huggingface.co/docs/inference-providers/index#partners) for a list of supported providers.
6
+ 2. [Inference Endpoints](https://huggingface.co/docs/inference-endpoints/index): a product to easily deploy models to production. Inference is run by Hugging Face in a dedicated, fully managed infrastructure on a cloud provider of your choice.
7
+ 3. Local endpoints: you can also run inference with local inference servers like [llama.cpp](https://github.com/ggerganov/llama.cpp), [Ollama](https://ollama.com/), [vLLM](https://github.com/vllm-project/vllm), [LiteLLM](https://docs.litellm.ai/docs/simple_proxy), or [Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) by connecting the client to these local endpoints.
7
8
 
8
- You can also try out a live [interactive notebook](https://observablehq.com/@huggingface/hello-huggingface-js-inference), see some demos on [hf.co/huggingfacejs](https://huggingface.co/huggingfacejs), or watch a [Scrimba tutorial that explains how Inference Endpoints works](https://scrimba.com/scrim/cod8248f5adfd6e129582c523).
9
9
 
10
10
  ## Getting Started
11
11
 
@@ -42,7 +42,7 @@ const hf = new InferenceClient('your access token');
42
42
 
43
43
  Your access token should be kept private. If you need to protect it in front-end applications, we suggest setting up a proxy server that stores the access token.
44
44
 
45
- ### All supported inference providers
45
+ ## Using Inference Providers
46
46
 
47
47
  You can send inference requests to third-party providers with the inference client.
48
48
 
@@ -50,6 +50,7 @@ Currently, we support the following providers:
50
50
  - [Fal.ai](https://fal.ai)
51
51
  - [Featherless AI](https://featherless.ai)
52
52
  - [Fireworks AI](https://fireworks.ai)
53
+ - [HF Inference](https://huggingface.co/docs/inference-providers/providers/hf-inference)
53
54
  - [Hyperbolic](https://hyperbolic.xyz)
54
55
  - [Nebius](https://studio.nebius.ai)
55
56
  - [Novita](https://novita.ai/?utm_source=github_huggingface&utm_medium=github_readme&utm_campaign=link)
@@ -63,7 +64,8 @@ Currently, we support the following providers:
63
64
  - [Cerebras](https://cerebras.ai/)
64
65
  - [Groq](https://groq.com)
65
66
 
66
- To send requests to a third-party provider, you have to pass the `provider` parameter to the inference function. Make sure your request is authenticated with an access token.
67
+ To send requests to a third-party provider, you have to pass the `provider` parameter to the inference function. The default value of the `provider` parameter is "auto", which will select the first of the providers available for the model, sorted by your preferred order in https://hf.co/settings/inference-providers.
68
+
67
69
  ```ts
68
70
  const accessToken = "hf_..."; // Either a HF access token, or an API key from the third-party provider (Replicate in this example)
69
71
 
@@ -75,6 +77,7 @@ await client.textToImage({
75
77
  })
76
78
  ```
77
79
 
80
+ You also have to make sure your request is authenticated with an access token.
78
81
  When authenticated with a Hugging Face access token, the request is routed through https://huggingface.co.
79
82
  When authenticated with a third-party provider key, the request is made directly against that provider's inference API.
80
83
 
@@ -82,6 +85,7 @@ Only a subset of models are supported when requesting third-party providers. You
82
85
  - [Fal.ai supported models](https://huggingface.co/api/partners/fal-ai/models)
83
86
  - [Featherless AI supported models](https://huggingface.co/api/partners/featherless-ai/models)
84
87
  - [Fireworks AI supported models](https://huggingface.co/api/partners/fireworks-ai/models)
88
+ - [HF Inference supported models](https://huggingface.co/api/partners/hf-inference/models)
85
89
  - [Hyperbolic supported models](https://huggingface.co/api/partners/hyperbolic/models)
86
90
  - [Nebius supported models](https://huggingface.co/api/partners/nebius/models)
87
91
  - [Nscale supported models](https://huggingface.co/api/partners/nscale/models)
@@ -92,7 +96,6 @@ Only a subset of models are supported when requesting third-party providers. You
92
96
  - [Cohere supported models](https://huggingface.co/api/partners/cohere/models)
93
97
  - [Cerebras supported models](https://huggingface.co/api/partners/cerebras/models)
94
98
  - [Groq supported models](https://console.groq.com/docs/models)
95
- - [HF Inference API (serverless)](https://huggingface.co/models?inference=warm&sort=trending)
96
99
 
97
100
  ❗**Important note:** To be compatible, the third-party API must adhere to the "standard" shape API we expect on HF model pages for each pipeline task type.
98
101
  This is not an issue for LLMs as everyone converged on the OpenAI API anyways, but can be more tricky for other tasks like "text-to-image" or "automatic-speech-recognition" where there exists no standard API. Let us know if any help is needed or if we can make things easier for you!
@@ -116,22 +119,22 @@ await textGeneration({
116
119
 
117
120
  This will enable tree-shaking by your bundler.
118
121
 
119
- ## Natural Language Processing
122
+ ### Natural Language Processing
120
123
 
121
- ### Text Generation
124
+ #### Text Generation
122
125
 
123
126
  Generates text from an input prompt.
124
127
 
125
- [Demo](https://huggingface.co/spaces/huggingfacejs/streaming-text-generation)
126
-
127
128
  ```typescript
128
129
  await hf.textGeneration({
129
- model: 'gpt2',
130
+ model: 'mistralai/Mixtral-8x7B-v0.1',
131
+ provider: "together",
130
132
  inputs: 'The answer to the universe is'
131
133
  })
132
134
 
133
135
  for await (const output of hf.textGenerationStream({
134
- model: "google/flan-t5-xxl",
136
+ model: "mistralai/Mixtral-8x7B-v0.1",
137
+ provider: "together",
135
138
  inputs: 'repeat "one two three four"',
136
139
  parameters: { max_new_tokens: 250 }
137
140
  })) {
@@ -139,16 +142,15 @@ for await (const output of hf.textGenerationStream({
139
142
  }
140
143
  ```
141
144
 
142
- ### Text Generation (Chat Completion API Compatible)
143
-
144
- Using the `chatCompletion` method, you can generate text with models compatible with the OpenAI Chat Completion API. All models served by [TGI](https://huggingface.co/docs/text-generation-inference/) on Hugging Face support Messages API.
145
+ #### Chat Completion
145
146
 
146
- [Demo](https://huggingface.co/spaces/huggingfacejs/streaming-chat-completion)
147
+ Generate a model response from a list of messages comprising a conversation.
147
148
 
148
149
  ```typescript
149
150
  // Non-streaming API
150
151
  const out = await hf.chatCompletion({
151
- model: "meta-llama/Llama-3.1-8B-Instruct",
152
+ model: "Qwen/Qwen3-32B",
153
+ provider: "cerebras",
152
154
  messages: [{ role: "user", content: "Hello, nice to meet you!" }],
153
155
  max_tokens: 512,
154
156
  temperature: 0.1,
@@ -157,7 +159,8 @@ const out = await hf.chatCompletion({
157
159
  // Streaming API
158
160
  let out = "";
159
161
  for await (const chunk of hf.chatCompletionStream({
160
- model: "meta-llama/Llama-3.1-8B-Instruct",
162
+ model: "Qwen/Qwen3-32B",
163
+ provider: "cerebras",
161
164
  messages: [
162
165
  { role: "user", content: "Can you help me solve an equation?" },
163
166
  ],
@@ -169,33 +172,18 @@ for await (const chunk of hf.chatCompletionStream({
169
172
  }
170
173
  }
171
174
  ```
175
+ #### Feature Extraction
172
176
 
173
- It's also possible to call Mistral or OpenAI endpoints directly:
177
+ This task reads some text and outputs raw float values, that are usually consumed as part of a semantic database/semantic search.
174
178
 
175
179
  ```typescript
176
- const openai = new InferenceClient(OPENAI_TOKEN).endpoint("https://api.openai.com");
177
-
178
- let out = "";
179
- for await (const chunk of openai.chatCompletionStream({
180
- model: "gpt-3.5-turbo",
181
- messages: [
182
- { role: "user", content: "Complete the equation 1+1= ,just the answer" },
183
- ],
184
- max_tokens: 500,
185
- temperature: 0.1,
186
- seed: 0,
187
- })) {
188
- if (chunk.choices && chunk.choices.length > 0) {
189
- out += chunk.choices[0].delta.content;
190
- }
191
- }
192
-
193
- // For mistral AI:
194
- // endpointUrl: "https://api.mistral.ai"
195
- // model: "mistral-tiny"
180
+ await hf.featureExtraction({
181
+ model: "sentence-transformers/distilbert-base-nli-mean-tokens",
182
+ inputs: "That is a happy person",
183
+ });
196
184
  ```
197
185
 
198
- ### Fill Mask
186
+ #### Fill Mask
199
187
 
200
188
  Tries to fill in a hole with a missing word (token to be precise).
201
189
 
@@ -206,7 +194,7 @@ await hf.fillMask({
206
194
  })
207
195
  ```
208
196
 
209
- ### Summarization
197
+ #### Summarization
210
198
 
211
199
  Summarizes longer text into shorter text. Be careful, some models have a maximum length of input.
212
200
 
@@ -221,7 +209,7 @@ await hf.summarization({
221
209
  })
222
210
  ```
223
211
 
224
- ### Question Answering
212
+ #### Question Answering
225
213
 
226
214
  Answers questions based on the context you provide.
227
215
 
@@ -235,7 +223,7 @@ await hf.questionAnswering({
235
223
  })
236
224
  ```
237
225
 
238
- ### Table Question Answering
226
+ #### Table Question Answering
239
227
 
240
228
  ```typescript
241
229
  await hf.tableQuestionAnswering({
@@ -252,7 +240,7 @@ await hf.tableQuestionAnswering({
252
240
  })
253
241
  ```
254
242
 
255
- ### Text Classification
243
+ #### Text Classification
256
244
 
257
245
  Often used for sentiment analysis, this method will assign labels to the given text along with a probability score of that label.
258
246
 
@@ -263,7 +251,7 @@ await hf.textClassification({
263
251
  })
264
252
  ```
265
253
 
266
- ### Token Classification
254
+ #### Token Classification
267
255
 
268
256
  Used for sentence parsing, either grammatical, or Named Entity Recognition (NER) to understand keywords contained within text.
269
257
 
@@ -274,7 +262,7 @@ await hf.tokenClassification({
274
262
  })
275
263
  ```
276
264
 
277
- ### Translation
265
+ #### Translation
278
266
 
279
267
  Converts text from one language to another.
280
268
 
@@ -294,7 +282,7 @@ await hf.translation({
294
282
  })
295
283
  ```
296
284
 
297
- ### Zero-Shot Classification
285
+ #### Zero-Shot Classification
298
286
 
299
287
  Checks how well an input text fits into a set of labels you provide.
300
288
 
@@ -308,22 +296,7 @@ await hf.zeroShotClassification({
308
296
  })
309
297
  ```
310
298
 
311
- ### Conversational
312
-
313
- This task corresponds to any chatbot-like structure. Models tend to have shorter max_length, so please check with caution when using a given model if you need long-range dependency or not.
314
-
315
- ```typescript
316
- await hf.conversational({
317
- model: 'microsoft/DialoGPT-large',
318
- inputs: {
319
- past_user_inputs: ['Which movie is the best ?'],
320
- generated_responses: ['It is Die Hard for sure.'],
321
- text: 'Can you explain why ?'
322
- }
323
- })
324
- ```
325
-
326
- ### Sentence Similarity
299
+ #### Sentence Similarity
327
300
 
328
301
  Calculate the semantic similarity between one text and a list of other sentences.
329
302
 
@@ -341,9 +314,9 @@ await hf.sentenceSimilarity({
341
314
  })
342
315
  ```
343
316
 
344
- ## Audio
317
+ ### Audio
345
318
 
346
- ### Automatic Speech Recognition
319
+ #### Automatic Speech Recognition
347
320
 
348
321
  Transcribes speech from an audio file.
349
322
 
@@ -356,7 +329,7 @@ await hf.automaticSpeechRecognition({
356
329
  })
357
330
  ```
358
331
 
359
- ### Audio Classification
332
+ #### Audio Classification
360
333
 
361
334
  Assigns labels to the given audio along with a probability score of that label.
362
335
 
@@ -369,7 +342,7 @@ await hf.audioClassification({
369
342
  })
370
343
  ```
371
344
 
372
- ### Text To Speech
345
+ #### Text To Speech
373
346
 
374
347
  Generates natural-sounding speech from text input.
375
348
 
@@ -382,7 +355,7 @@ await hf.textToSpeech({
382
355
  })
383
356
  ```
384
357
 
385
- ### Audio To Audio
358
+ #### Audio To Audio
386
359
 
387
360
  Outputs one or multiple generated audios from an input audio, commonly used for speech enhancement and source separation.
388
361
 
@@ -393,9 +366,9 @@ await hf.audioToAudio({
393
366
  })
394
367
  ```
395
368
 
396
- ## Computer Vision
369
+ ### Computer Vision
397
370
 
398
- ### Image Classification
371
+ #### Image Classification
399
372
 
400
373
  Assigns labels to a given image along with a probability score of that label.
401
374
 
@@ -408,7 +381,7 @@ await hf.imageClassification({
408
381
  })
409
382
  ```
410
383
 
411
- ### Object Detection
384
+ #### Object Detection
412
385
 
413
386
  Detects objects within an image and returns labels with corresponding bounding boxes and probability scores.
414
387
 
@@ -421,7 +394,7 @@ await hf.objectDetection({
421
394
  })
422
395
  ```
423
396
 
424
- ### Image Segmentation
397
+ #### Image Segmentation
425
398
 
426
399
  Detects segments within an image and returns labels with corresponding bounding boxes and probability scores.
427
400
 
@@ -432,7 +405,7 @@ await hf.imageSegmentation({
432
405
  })
433
406
  ```
434
407
 
435
- ### Image To Text
408
+ #### Image To Text
436
409
 
437
410
  Outputs text from a given image, commonly used for captioning or optical character recognition.
438
411
 
@@ -443,7 +416,7 @@ await hf.imageToText({
443
416
  })
444
417
  ```
445
418
 
446
- ### Text To Image
419
+ #### Text To Image
447
420
 
448
421
  Creates an image from a text prompt.
449
422
 
@@ -456,7 +429,7 @@ await hf.textToImage({
456
429
  })
457
430
  ```
458
431
 
459
- ### Image To Image
432
+ #### Image To Image
460
433
 
461
434
  Image-to-image is the task of transforming a source image to match the characteristics of a target image or a target image domain.
462
435
 
@@ -472,7 +445,7 @@ await hf.imageToImage({
472
445
  });
473
446
  ```
474
447
 
475
- ### Zero Shot Image Classification
448
+ #### Zero Shot Image Classification
476
449
 
477
450
  Checks how well an input image fits into a set of labels you provide.
478
451
 
@@ -488,20 +461,10 @@ await hf.zeroShotImageClassification({
488
461
  })
489
462
  ```
490
463
 
491
- ## Multimodal
492
-
493
- ### Feature Extraction
494
-
495
- This task reads some text and outputs raw float values, that are usually consumed as part of a semantic database/semantic search.
464
+ ### Multimodal
496
465
 
497
- ```typescript
498
- await hf.featureExtraction({
499
- model: "sentence-transformers/distilbert-base-nli-mean-tokens",
500
- inputs: "That is a happy person",
501
- });
502
- ```
503
466
 
504
- ### Visual Question Answering
467
+ #### Visual Question Answering
505
468
 
506
469
  Visual Question Answering is the task of answering open-ended questions based on an image. They output natural language responses to natural language questions.
507
470
 
@@ -517,7 +480,7 @@ await hf.visualQuestionAnswering({
517
480
  })
518
481
  ```
519
482
 
520
- ### Document Question Answering
483
+ #### Document Question Answering
521
484
 
522
485
  Document question answering models take a (document, question) pair as input and return an answer in natural language.
523
486
 
@@ -533,9 +496,9 @@ await hf.documentQuestionAnswering({
533
496
  })
534
497
  ```
535
498
 
536
- ## Tabular
499
+ ### Tabular
537
500
 
538
- ### Tabular Regression
501
+ #### Tabular Regression
539
502
 
540
503
  Tabular regression is the task of predicting a numerical value given a set of attributes.
541
504
 
@@ -555,7 +518,7 @@ await hf.tabularRegression({
555
518
  })
556
519
  ```
557
520
 
558
- ### Tabular Classification
521
+ #### Tabular Classification
559
522
 
560
523
  Tabular classification is the task of classifying a target category (a group) based on set of attributes.
561
524
 
@@ -600,48 +563,80 @@ for await (const chunk of stream) {
600
563
  }
601
564
  ```
602
565
 
603
- ## Custom Inference Endpoints
566
+ ## Using Inference Endpoints
604
567
 
605
- Learn more about using your own inference endpoints [here](https://hf.co/docs/inference-endpoints/)
568
+ The examples we saw above use inference providers. While these prove to be very useful for prototyping
569
+ and testing things quickly. Once you're ready to deploy your model to production, you'll need to use a dedicated infrastructure. That's where [Inference Endpoints](https://huggingface.co/docs/inference-endpoints/index) comes into play. It allows you to deploy any model and expose it as a private API. Once deployed, you'll get a URL that you can connect to:
606
570
 
607
571
  ```typescript
608
- const gpt2 = hf.endpoint('https://xyz.eu-west-1.aws.endpoints.huggingface.cloud/gpt2');
609
- const { generated_text } = await gpt2.textGeneration({inputs: 'The answer to the universe is'});
572
+ import { InferenceClient } from '@huggingface/inference';
610
573
 
611
- // Chat Completion Example
612
- const ep = hf.endpoint(
613
- "https://router.huggingface.co/hf-inference/models/meta-llama/Llama-3.1-8B-Instruct"
614
- );
615
- const stream = ep.chatCompletionStream({
616
- model: "tgi",
617
- messages: [{ role: "user", content: "Complete the equation 1+1= ,just the answer" }],
618
- max_tokens: 500,
619
- temperature: 0.1,
620
- seed: 0,
574
+ const hf = new InferenceClient("hf_xxxxxxxxxxxxxx", {
575
+ endpointUrl: "https://j3z5luu0ooo76jnl.us-east-1.aws.endpoints.huggingface.cloud/v1/",
621
576
  });
622
- let out = "";
623
- for await (const chunk of stream) {
624
- if (chunk.choices && chunk.choices.length > 0) {
625
- out += chunk.choices[0].delta.content;
626
- console.log(out);
627
- }
628
- }
577
+
578
+ const response = await hf.chatCompletion({
579
+ messages: [
580
+ {
581
+ role: "user",
582
+ content: "What is the capital of France?",
583
+ },
584
+ ],
585
+ });
586
+
587
+ console.log(response.choices[0].message.content);
629
588
  ```
630
589
 
631
- By default, all calls to the inference endpoint will wait until the model is
632
- loaded. When [scaling to
633
- 0](https://huggingface.co/docs/inference-endpoints/en/autoscaling#scaling-to-0)
634
- is enabled on the endpoint, this can result in non-trivial waiting time. If
635
- you'd rather disable this behavior and handle the endpoint's returned 500 HTTP
636
- errors yourself, you can do so like so:
590
+ By default, all calls to the inference endpoint will wait until the model is loaded. When [scaling to 0](https://huggingface.co/docs/inference-endpoints/en/autoscaling#scaling-to-0)
591
+ is enabled on the endpoint, this can result in non-trivial waiting time. If you'd rather disable this behavior and handle the endpoint's returned 500 HTTP errors yourself, you can do so like so:
637
592
 
638
593
  ```typescript
639
- const gpt2 = hf.endpoint('https://xyz.eu-west-1.aws.endpoints.huggingface.cloud/gpt2');
640
- const { generated_text } = await gpt2.textGeneration(
641
- {inputs: 'The answer to the universe is'},
642
- {retry_on_error: false},
594
+ const hf = new InferenceClient("hf_xxxxxxxxxxxxxx", {
595
+ endpointUrl: "https://j3z5luu0ooo76jnl.us-east-1.aws.endpoints.huggingface.cloud/v1/",
596
+ });
597
+
598
+ const response = await hf.chatCompletion(
599
+ {
600
+ messages: [
601
+ {
602
+ role: "user",
603
+ content: "What is the capital of France?",
604
+ },
605
+ ],
606
+ },
607
+ {
608
+ retry_on_error: false,
609
+ }
643
610
  );
644
611
  ```
612
+ ## Using local endpoints
613
+
614
+ You can use `InferenceClient` to run chat completion with local inference servers (llama.cpp, vllm, litellm server, TGI, mlx, etc.) running on your own machine. The API should be OpenAI API-compatible.
615
+
616
+ ```typescript
617
+ import { InferenceClient } from '@huggingface/inference';
618
+
619
+ const hf = new InferenceClient(undefined, {
620
+ endpointUrl: "http://localhost:8080",
621
+ });
622
+
623
+ const response = await hf.chatCompletion({
624
+ messages: [
625
+ {
626
+ role: "user",
627
+ content: "What is the capital of France?",
628
+ },
629
+ ],
630
+ });
631
+
632
+ console.log(response.choices[0].message.content);
633
+ ```
634
+
635
+ <Tip>
636
+
637
+ Similarily to the OpenAI JS client, `InferenceClient` can be used to run Chat Completion inference with any OpenAI REST API-compatible endpoint.
638
+
639
+ </Tip>
645
640
 
646
641
  ## Running tests
647
642