@stephen-lord/other2 1.0.8 → 1.0.10

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (47) hide show
  1. package/dist/docs/manus/CN-/346/211/222/345/256/214/345/205/250/347/275/221/346/234/200/345/274/272-AI-/345/233/242/351/230/237/347/232/204-Context-Engineering-/346/224/273/347/225/245/346/210/221/344/273/254/346/200/273/347/273/223/345/207/272/344/272/206/350/277/231-5-/345/244/247/346/226/271/346/263/225-/346/231/272/346/272/220/347/244/276/345/214/272.md +2464 -0
  2. package/dist/docs/manus/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus.md +212 -0
  3. package/dist/docs/manus/Context-Engineering-for-AI-Agents-Part-2.md +96 -0
  4. package/dist/docs/manus/Industry.md +94 -0
  5. package/dist/docs/manus/Observability-for-Manus-15-Agents-Logs-Retries-and-Error-Budgets.md +346 -0
  6. package/dist/docs/manus/OpenManus-Technical-Analysis-Architecture-and-Implementation-of-an-Open-Source-A.md +324 -0
  7. package/dist/docs/manus/README.md +85 -0
  8. package/dist/docs/manus/Tech-Constrained-Decoding-Agent-Reliability.md +81 -0
  9. package/dist/docs/manus/Tech-How-to-build-function-calling-and-JSON-mode.md +43 -0
  10. package/dist/docs/manus/Tech-Understanding-Logit-Bias-in-LLMs-Medium.md +1354 -0
  11. package/dist/docs/manus/The-Performance-Reality-KV-Cache-as-the-North-Star.md +155 -0
  12. package/dist/docs/manus/Why-Context-Engineering.md +125 -0
  13. package/dist/docs/manus/article_1_raw.md +1 -0
  14. package/dist/docs/manus/split_articles.py +52 -0
  15. package/dist/docs/manus//346/235/245/350/207/252-Manus-/347/232/204/344/270/200/346/211/213/345/210/206/344/272/253/345/246/202/344/275/225/346/236/204/345/273/272-AI-Agent-/347/232/204/344/270/212/344/270/213/346/226/207/345/267/245/347/250/213-/346/231/272/346/272/220/347/244/276/345/214/272.md +2180 -0
  16. package/dist/ui-ux-pro-max/SKILL.md +386 -0
  17. package/dist/ui-ux-pro-max/data/charts.csv +26 -0
  18. package/dist/ui-ux-pro-max/data/colors.csv +97 -0
  19. package/dist/ui-ux-pro-max/data/icons.csv +101 -0
  20. package/dist/ui-ux-pro-max/data/landing.csv +31 -0
  21. package/dist/ui-ux-pro-max/data/products.csv +97 -0
  22. package/dist/ui-ux-pro-max/data/prompts.csv +24 -0
  23. package/dist/ui-ux-pro-max/data/react-performance.csv +45 -0
  24. package/dist/ui-ux-pro-max/data/stacks/flutter.csv +53 -0
  25. package/dist/ui-ux-pro-max/data/stacks/html-tailwind.csv +56 -0
  26. package/dist/ui-ux-pro-max/data/stacks/jetpack-compose.csv +53 -0
  27. package/dist/ui-ux-pro-max/data/stacks/nextjs.csv +53 -0
  28. package/dist/ui-ux-pro-max/data/stacks/nuxt-ui.csv +51 -0
  29. package/dist/ui-ux-pro-max/data/stacks/nuxtjs.csv +59 -0
  30. package/dist/ui-ux-pro-max/data/stacks/react-native.csv +52 -0
  31. package/dist/ui-ux-pro-max/data/stacks/react.csv +54 -0
  32. package/dist/ui-ux-pro-max/data/stacks/shadcn.csv +61 -0
  33. package/dist/ui-ux-pro-max/data/stacks/svelte.csv +54 -0
  34. package/dist/ui-ux-pro-max/data/stacks/swiftui.csv +51 -0
  35. package/dist/ui-ux-pro-max/data/stacks/vue.csv +50 -0
  36. package/dist/ui-ux-pro-max/data/styles.csv +59 -0
  37. package/dist/ui-ux-pro-max/data/typography.csv +58 -0
  38. package/dist/ui-ux-pro-max/data/ui-reasoning.csv +101 -0
  39. package/dist/ui-ux-pro-max/data/ux-guidelines.csv +100 -0
  40. package/dist/ui-ux-pro-max/data/web-interface.csv +31 -0
  41. package/dist/ui-ux-pro-max/scripts/__pycache__/core.cpython-310.pyc +0 -0
  42. package/dist/ui-ux-pro-max/scripts/__pycache__/core.cpython-312.pyc +0 -0
  43. package/dist/ui-ux-pro-max/scripts/__pycache__/design_system.cpython-312.pyc +0 -0
  44. package/dist/ui-ux-pro-max/scripts/core.py +258 -0
  45. package/dist/ui-ux-pro-max/scripts/design_system.py +1066 -0
  46. package/dist/ui-ux-pro-max/scripts/search.py +106 -0
  47. package/package.json +6 -6
@@ -0,0 +1,1354 @@
1
+ # Understanding Logit Bias in LLMs | Medium
2
+
3
+ **Source:** https://medium.com/@serhatcck/token-level-control-in-openai-models-a-developers-guide-to-logit-bias-6fcc04a8a41f
4
+
5
+ ---
6
+
7
+ URL: https://medium.com/@serhatcck/token-level-control-in-openai-models-a-developers-guide-to-logit-bias-6fcc04a8a41f
8
+ Author: Serhat ÇİÇEK
9
+
10
+ Understanding Logit Bias in LLMs | Medium
11
+
12
+ Sitemap
13
+
14
+ Open in app
15
+
16
+ Sign up
17
+
18
+ Sign in
19
+
20
+ Medium Logo
21
+
22
+ Get app
23
+
24
+ Write
25
+
26
+ Search
27
+
28
+ Sign up
29
+
30
+ Sign in
31
+
32
+ # Token-Level Control in OpenAI Models: A Developer’s Guide to Logit Bias
33
+
34
+ Serhat ÇİÇEK
35
+
36
+ 6 min read
37
+
38
+ Dec 11, 2025
39
+
40
+ --
41
+
42
+ Listen
43
+
44
+ Share
45
+
46
+ Press enter or click to view image in full size
47
+
48
+ Logit bias is one of the least discussed yet most influential parameters in modern LLM development. By directly manipulating token-level probabilities, it allows developers to shape model outputs with a precision that traditional prompting cannot achieve. Whether you need deterministic text generation, safer content boundaries, or fine-tuned behavioral constraints without model retraining, logit bias provides a powerful mechanism for controlling how an LLM thinks and responds. Understanding this feature is essential for anyone building reliable, predictable, and production-grade AI systems.
49
+
50
+ ## What Is Bias in Large Language Models?
51
+
52
+ Bias in Large Language Models refers to systematic tendencies in how a model generates text. Instead of producing purely neutral or balanced outputs, an LLM may favor certain tokens, ideas, or associations because of the statistical patterns it learned during training. This means the model’s responses can shift — not due to correctness, but due to patterns embedded in the data or architecture.
53
+
54
+ A useful way to understand this is to compare LLM bias with human cognitive bias. For example, if a developer spends years writing code mostly in Python, they naturally develop a positive bias toward that language. When starting a new project, choosing Python becomes more likely — not necessarily because it is the best option, but because past experience shaped their preference.
55
+
56
+ LLMs behave in a similar way: if the training data heavily features certain topics, styles, or associations, the model becomes more inclined to reproduce them.
57
+
58
+ In practical terms, bias is simply a distortion in token probabilities. Some tokens become more likely, others less likely, shaping the model’s tone, content, and reasoning. For AI developers, understanding this is crucial: bias affects predictability, alignment, safety, and overall output quality.
59
+
60
+ ### Common Sources of Bias in LLMs
61
+
62
+ - Instruction-Tuning Bias — Reinforcement from human feedback shaping preferred behaviors.
63
+ - Decoding-Time Bias — Sampling techniques and parameters (e.g., temperature, logit bias) that shift token probabilities.
64
+ - Objective & Loss Function Bias — Optimization that favors certain patterns over others.
65
+ - Representational Bias — Embeddings forming unequal relationships between concepts.
66
+ - Training Data Imbalance — Overrepresented topics, sentiments, or cultural viewpoints.
67
+
68
+ ## Bias and Tokens in OpenAI Models
69
+
70
+ OpenAI’s bias controls — such as logit_bias — operate entirely at the token level. Tokens are the smallest units a model uses to understand and generate text. Instead of reading characters or whole words, LLMs break text into tokens using a tokenizer. This means that any bias applied to a model directly alters the probability distribution of individual tokens, not entire sentences.
71
+
72
+ For example:
73
+
74
+ - [“Hello”, “ world”, “!”]
75
+ - A simple sentence like “Hello world!” could be encoded into:
76
+ - The word “JavaScript” may be split into multiple tokens depending on the tokenizer.
77
+ - The word “Python” might be a single token.
78
+
79
+ Because bias works at this granular level, developers must understand how tokenization affects outcomes. Applying a positive logit bias to the token representing “Python” increases the chance of the model using that word in its response. Conversely, applying a strong negative bias to a token like “no” can drastically reduce its appearance, even if it makes the response grammatically awkward.
80
+
81
+ ### What Do +100 and −100 Mean in Logit Bias?
82
+
83
+ In OpenAI models, logit bias values such as +100 and −100 represent extremely strong adjustments to a token’s probability. A +100 bias forces the model to strongly prefer a specific token — effectively pushing its probability close to 100% whenever it’s a valid next token. Conversely, a −100 bias nearly eliminates a token from the generation process, reducing its probability to almost zero. These extreme values behave like “hard constraints” for token selection. For example, applying +100 to the token for “Python” almost guarantees the model will mention it, while assigning −100 to the token “Java” will cause the model to avoid it entirely — even when it would normally be a reasonable choice. Although smaller values can create more nuanced nudges, ±100 is commonly used when developers need deterministic control over model output.
84
+
85
+ ## PoC
86
+
87
+ To experiment with token-level biasing in real time, you can download and run the project available at LLM-Bias on GitHub. Simply clone the repository from
88
+
89
+ https://github.com/Serhatcck/LLM-Bias, install the dependencies, and start the application. The interface allows you to enter a prompt, choose a logit bias value, and test how different token biases affect the model’s output.
90
+
91
+ Press enter or click to view image in full size
92
+
93
+ ### -100 Logit Bias
94
+
95
+ We start with a simple baseline prompt:
96
+
97
+ “Explain quantum mechanics with two sentences.”
98
+
99
+ Without any bias applied, the model generates a neutral, standard explanation based on its learned distribution of scientific terminology.
100
+
101
+ Press enter or click to view image in full size
102
+
103
+ Now we repeat the exact same prompt, but this time we introduce a negative logit bias for the token(s) representing “physics” (for example “physics” or its token ID).
104
+
105
+ Assigning a strong negative bias such as −80 to −100 almost completely suppresses the appearance of that word.
106
+
107
+ Press enter or click to view image in full size
108
+
109
+ After running both prompts, the difference becomes clear:
110
+
111
+ - The negatively biased output avoids using the word physics entirely, even when it logically fits the explanation. The model instead substitutes more generic or indirect descriptions to compensate.
112
+ - The unbiased output may reference physics naturally, since quantum mechanics is part of the field.
113
+
114
+ ## Insights From Our Logit Bias Experiments
115
+
116
+ While testing both positive and negative logit bias across different prompts, we observed several practical behaviors that developers should be aware of:
117
+
118
+ ### Negative Logit Bias Observations
119
+
120
+ - Logit bias values above –80 tend to be weak. The model may still generate the unwanted token, especially in contexts where it is highly probable.
121
+ - To reliably suppress a word, you must include all of its variations in the logit bias list — such as lowercase/uppercase versions, plural forms, and common morphological variants.
122
+
123
+ ### Positive Logit Bias Observations
124
+
125
+ - Positive bias values of +80 or lower typically do not create strong enough pressure. The model fails to consistently insert the target word unless the word already aligns with the prompt.
126
+ - When a high positive value (e.g., +100) is applied to a token unrelated to the prompt, the model often takes significantly longer to produce a response. This happens because the LLM struggles to justify the forced token within the context.
127
+
128
+ ## Summary
129
+
130
+ Negative logit bias is far more practical for production environments, especially when filtering unwanted words, profanity, or brand names. However, developers must carefully select and tokenize all relevant variations of the terms they want to block. Positive logit bias, while useful for controlled generation, often introduces latency and instability when forcing unrelated tokens into the output.
131
+
132
+ ## References
133
+
134
+ ### OpenAI Documentation
135
+
136
+ - https://platform.openai.com/docs/guides/text-generation/tokenization
137
+ - GPT Tokenization Guide
138
+ - https://platform.openai.com/docs/api-reference/chat/create#chat-create-logit_bias
139
+ - OpenAI API Reference — Logit Bias
140
+
141
+ ### LLM Tokenization & Probability
142
+
143
+ - https://huggingface.co/docs/tokenizers/index
144
+ - Tokenizers by HuggingFace (general tokenizer behavior)
145
+ - https://github.com/openai/tiktoken
146
+ - OpenAI Tiktoken Library (tokenizer implementation)
147
+
148
+ ### Research Papers on LLM Bias
149
+
150
+ - https://arxiv.org/abs/2005.14050
151
+ - Assessing Social and Linguistic Bias in Language Models
152
+ - https://arxiv.org/abs/1906.07337
153
+ - Bias in Language Models: A Comprehensive Review
154
+ - https://arxiv.org/abs/2312.01708
155
+ - A Survey on Bias in Large Language Models (arXiv)
156
+
157
+ LLM
158
+
159
+ NLP
160
+
161
+ AI Agent
162
+
163
+ Bias In Ai
164
+
165
+ Llm Tokens
166
+
167
+ ## Written by Serhat ÇİÇEK
168
+
169
+ 119 followers
170
+
171
+ 85 following
172
+
173
+ Security Researcher & AI Researcher & Software Engineer
174
+
175
+ ## No responses yet
176
+
177
+ Help
178
+
179
+ Status
180
+
181
+ About
182
+
183
+ Careers
184
+
185
+ Press
186
+
187
+ Blog
188
+
189
+ Privacy
190
+
191
+ Rules
192
+
193
+ Terms
194
+
195
+ Text to speech
196
+
197
+ # Logit Bias - LLM Parameter Guide - Vellum
198
+ URL: https://vellum.ai/llm-parameters/logit-bias
199
+
200
+ Logit Bias - LLM Parameter Guide - Vellum
201
+
202
+ LLM ParametersLogit Bias
203
+
204
+ ← All Parameters
205
+
206
+ # Logit Bias
207
+
208
+ OpenAIAnthropic
209
+
210
+ # What is Logit Bias
211
+
212
+ The logit bias parameter lets you control whether the model is more or less likely to generate a specific word.
213
+
214
+ # How does it work behind the scenes
215
+
216
+ The model is always deciding which word (or tokens) to pick next. All these tokens have their own IDs, and using logit bias we can forbid the model to use some of these IDs.
217
+
218
+ But how can we actually find these IDs?
219
+
220
+ The simplest way is to use OpenAI’s tokenizer tool . Just type in your words, toggle the “Text-Token ID” option at the bottom, and you’ll get the IDs for your words. In some cases you’ll get more tokens for one word.
221
+
222
+ It’s important to note here that different models may produce different tokens for the same input, so you should always check with the model provider to learn about their tokenization process.
223
+
224
+ There are couple of important things to note here:
225
+
226
+ You can use OpenAI’s tokenizer tool to find out the tokens for GPT-3.5 and GPT-4 models, but there is still no data for GPT-4o and GPT-4o mini. One word can have two tokens. Characters before a word (e.g. a space or underscore) can produce different tokens for the same word. Capitalization and no-capitalization versions of same word might result in different tokens.
227
+
228
+ # How to set this parameter correctly
229
+
230
+ In the API, this parameter accepts a JSON object that maps token IDs to bias value. This bias value can vary between -100 to 100. The parameter takes tokens, not text, so you’d use the tokenizer we mentioned above to take the token ids for the words that you’d want to “bias”.
231
+
232
+ # How to experiment with Logit Bias
233
+
234
+ The closer the value is to -100, the more likely that token will be blocked from being generated. The closer it is to 100, the more the model is encouraged to use that token.
235
+
236
+ To test this parameter, try adjusting the values gradually and analyze the impact. Using small values like 1 or -1 won’t make much difference, but values like 5 or -5 can have a much stronger effect.
237
+
238
+ # When to use Logit Bias
239
+
240
+ Use logit bias when know specifically which words you want to ban or encourage repetitive use.
241
+
242
+ ### Example 1: Ban offensive words
243
+
244
+ One example where you’d want to ban some words (tokens) from appearing in the results is for moderation purposes.
245
+
246
+ Suppose you’re building a guardrail that will capture offensive content in your chatbot. Now, you may want to ban words like “stupid”. The word “stupid” tokenizes to two IDs [267, 16263] , and the same word with a space before “ stupid” tokenizes to another ID [18754] . To ban them from appearing in the results we can add the logit bias like so:
247
+
248
+ Example 2: Encourage neutral answers in a chatbot If you’re using a customer support chatbot, you’ll likely want it to maintain a calm, neutral tone. To help with that, you can encourage the model to use more neutral words like “understand,” “assist,” and “resolve”. To make the model output “understand,” you need to map it to two token IDs and add a bias of 5:
249
+
250
+
251
+
252
+ # Constrained Decoding and Structured Output for Agent Reliability - Engineering Notes
253
+ URL: https://notes.muthu.co/2025/11/constrained-decoding-and-structured-output-for-agent-reliability/
254
+
255
+ Constrained Decoding and Structured Output for Agent Reliability - Engineering Notes
256
+
257
+ # Engineering Notes
258
+
259
+ Thoughts and Ideas on AI by Muthukrishnan
260
+
261
+ # Constrained Decoding and Structured Output for Agent Reliability
262
+
263
+ When building production AI agents, one of the most persistent problems is unpredictable output formats. An agent needs to call a tool with precise JSON parameters, but the LLM wraps the output in markdown code blocks, adds explanatory text, or hallucinate invalid field names. This breaks the entire agent pipeline.
264
+
265
+ Constrained decoding solves this by restricting what tokens an LLM can generate, ensuring outputs always conform to specified formats like JSON schemas, regular expressions, or context-free grammars. It’s the difference between hoping your agent produces valid JSON and guaranteeing it.
266
+
267
+ ## Concept Introduction
268
+
269
+ During text generation, an LLM samples tokens from a probability distribution at each step. Constrained decoding modifies this process by masking invalid tokens, setting their probability to zero before sampling. The result: tool calls always have correct parameter names and types, database queries never contain invalid SQL syntax, API requests match OpenAPI specifications, and decision outputs are always parseable by downstream systems.
270
+
271
+ ```
272
+ Standard Decoding:
273
+ P(next_token | context) → Sample from all vocabulary
274
+
275
+ Constrained Decoding:
276
+ P(next_token | context, grammar) → Sample only from valid tokens
277
+
278
+ ```
279
+
280
+ The constraint can be:
281
+
282
+ - A JSON schema (only generate valid JSON matching the schema)
283
+ - A regular expression (output must match the regex)
284
+ - A context-free grammar (follow specific syntax rules)
285
+ - A finite-state machine (transition through defined states)
286
+
287
+ Modern implementations use:
288
+
289
+ - Token masking at inference time
290
+ - Incremental parsing to track valid next tokens
291
+ - Beam search with grammar-aware scoring
292
+ - Logit bias to steer generation probabilistically
293
+
294
+ ## Historical & Theoretical Context
295
+
296
+ The concept emerged from multiple research threads:
297
+
298
+ 1. Semantic Parsing (1990s–2000s) Early NLP systems used grammar-based parsers to convert natural language to formal representations (SQL, logic). These were rigid but guaranteed valid output.
299
+
300
+ 2. Constrained Generation in NLG (2010s) Neural text generation models began incorporating hard constraints:
301
+
302
+ - Hokamp & Liu (2017): Grid Beam Search for lexically constrained generation
303
+ - Forcing specific phrases to appear in translations or summaries
304
+
305
+ 3. Structured Prediction (2015–2020) Seq2seq models for code generation, semantic parsing, and structured data extraction needed format guarantees. Early solutions used post-processing and re-ranking.
306
+
307
+ 4. LLM Function Calling Era (2020–present) As LLMs became agents with tool use, reliable structured output became critical:
308
+
309
+ - OpenAI Function Calling (2023): Proprietary constrained decoding for JSON tool calls
310
+ - Guidance (2023): Microsoft’s grammar-based generation library
311
+ - Outlines (2023): Fast regex and JSON schema constraints using FSMs
312
+ - LM Format Enforcer (2023): Token masking for various formats
313
+
314
+ ### Theoretical Foundation
315
+
316
+ Constrained decoding connects to:
317
+
318
+ - Formal Language Theory: Using automata to define valid sequences
319
+ - Parsing Theory: Incremental parsing to determine next valid tokens
320
+ - Probabilistic Inference: Conditioning probability distributions on constraints
321
+ - Program Synthesis: Generating code that compiles/type-checks
322
+
323
+ ## Algorithms & Math
324
+
325
+ ### Core Algorithm: FSM-Guided Token Masking
326
+
327
+ The most efficient modern approach uses finite-state machines:
328
+
329
+ Pseudocode:
330
+
331
+ ```
332
+ def constrained_decode(prompt, schema, max_tokens):
333
+ # Convert schema to FSM
334
+ fsm = schema_to_fsm(schema)
335
+ state = fsm.initial_state
336
+ tokens = []
337
+
338
+ for _ in range(max_tokens):
339
+ # Get next token logits from LLM
340
+ logits = llm.forward(prompt + tokens)
341
+
342
+ # Mask invalid tokens based on current FSM state
343
+ valid_tokens = fsm.get_valid_tokens(state)
344
+ masked_logits = mask_logits(logits, valid_tokens)
345
+
346
+ # Sample next token
347
+ next_token = sample(masked_logits)
348
+ tokens.append(next_token)
349
+
350
+ # Update FSM state
351
+ state = fsm.transition(state, next_token)
352
+
353
+ # Check if reached accept state
354
+ if fsm.is_terminal(state):
355
+ break
356
+
357
+ return tokens
358
+
359
+ ```
360
+
361
+ Mathematical Formulation:
362
+
363
+ Let $\mathcal{G}$ be a grammar defining valid outputs, and $\mathcal{L}(\mathcal{G})$ the language it accepts.
364
+
365
+ Standard decoding samples:
366
+
367
+ $$ t_i \sim \text{softmax}(\mathbf{z}_i) $$
368
+
369
+ Constrained decoding samples:
370
+
371
+ $$ t_i \sim \text{softmax}(\mathbf{z}_i + \mathbf{m}_i) $$
372
+
373
+ where the mask $\mathbf{m}_i$ is:
374
+
375
+ $$ m_i^{(j)} = \begin{cases} 0 & \text{if } t_1, \ldots, t_{i-1}, j \in \mathcal{L}(\mathcal{G}) \\ -\infty & \text{otherwise} \end{cases} $$
376
+
377
+ ### JSON Schema to FSM Conversion
378
+
379
+ Converting a JSON schema to an FSM involves:
380
+
381
+ 1. Tokenize the schema structure:`{`,`"field"`,`:`,`[`, numbers, strings, etc.
382
+ 2. Build states for each schema element: object start, field names, value types
383
+ 3. Define transitions: Valid next tokens from each state
384
+ 4. Handle recursion: For nested objects/arrays
385
+
386
+ Example:
387
+
388
+ ```
389
+ {
390
+ "type": "object",
391
+ "properties": {
392
+ "action": {"type": "string", "enum": ["search", "calculate"]},
393
+ "value": {"type": "number"}
394
+ },
395
+ "required": ["action", "value"]
396
+ }
397
+
398
+ ```
399
+
400
+ FSM states:
401
+
402
+ ```
403
+ START → "{" → "action" → ":" → ("search"|"calculate") → "," → "value" → ":" → NUMBER → "}" → END
404
+
405
+ ```
406
+
407
+ At each state, only specific tokens are valid (e.g., after`"action":`, only`"search"` or`"calculate"`).
408
+
409
+ ## Design Patterns & Architectures
410
+
411
+ Schema-First Agent Design
412
+
413
+ Define schemas before implementing agents:
414
+
415
+ ```
416
+ from pydantic import BaseModel
417
+
418
+ class SearchTool(BaseModel):
419
+ query: str
420
+ max_results: int = 10
421
+ filters: dict[str, str] = {}
422
+
423
+ class CalculateTool(BaseModel):
424
+ expression: str
425
+ precision: int = 2
426
+
427
+ # Agent now MUST output one of these
428
+
429
+ ```
430
+
431
+ Layered Validation
432
+
433
+ Combine multiple constraint layers:
434
+
435
+ 1. Token-level: Constrained decoding ensures valid syntax
436
+ 2. Type-level: Schema validation checks types
437
+ 3. Semantic-level: Business logic validates values
438
+
439
+ ```
440
+ # Layer 1: Constrained decoding produces valid JSON
441
+ output = constrained_generate(prompt, json_schema)
442
+
443
+ # Layer 2: Validate against Pydantic model
444
+ tool_call = SearchTool.parse_raw(output)
445
+
446
+ # Layer 3: Business logic
447
+ if tool_call.max_results > 100:
448
+ raise ValueError("max_results too high")
449
+
450
+ ```
451
+
452
+ Progressive Refinement
453
+
454
+ For complex outputs, chain constrained generations:
455
+
456
+ ```
457
+ graph LR
458
+ A[User Query] --> B[Generate Tool Choice<br/>Constrained: tool names only]
459
+ B --> C[Generate Parameters<br/>Constrained: specific schema]
460
+ C --> D[Execute Tool]
461
+ D --> E[Generate Response<br/>Constrained: response format]
462
+
463
+ ```
464
+
465
+ This reduces error accumulation compared to generating everything at once.
466
+
467
+ ### Integration with Agent Architectures
468
+
469
+ Planner-Executor-Memory Loop:
470
+
471
+ ```
472
+ class ConstrainedAgent:
473
+ def plan(self, goal: str) -> Plan:
474
+ # Constrained to Plan schema
475
+ return constrained_generate(
476
+ f"Create plan for: {goal}",
477
+ schema=Plan.schema()
478
+ )
479
+
480
+ def execute(self, step: PlanStep) -> ActionResult:
481
+ # Constrained to ActionResult schema
482
+ return constrained_generate(
483
+ f"Execute: {step.description}",
484
+ schema=ActionResult.schema()
485
+ )
486
+
487
+ ```
488
+
489
+ ReAct Loop:
490
+
491
+ ```
492
+ def react_step(observation: str) -> ThoughtActionObservation:
493
+ # Force exactly: Thought: <text>\nAction: <json>\n
494
+ return constrained_generate(
495
+ prompt=observation,
496
+ grammar=react_grammar # CFG for ReAct format
497
+ )
498
+
499
+ ```
500
+
501
+ ## Practical Application
502
+
503
+ ### Small Coding Example: Weather Agent with Outlines
504
+
505
+ ```
506
+ from outlines import models, generate
507
+ from pydantic import BaseModel
508
+
509
+ # Define tool schema
510
+ class WeatherQuery(BaseModel):
511
+ location: str
512
+ unit: str # "celsius" or "fahrenheit"
513
+
514
+ # Load model
515
+ model = models.transformers("mistralai/Mistral-7B-Instruct-v0.2")
516
+
517
+ # Create constrained generator
518
+ generator = generate.json(model, WeatherQuery)
519
+
520
+ # Generate - GUARANTEED to be valid WeatherQuery
521
+ prompt = """User: What's the weather in Tokyo?
522
+ Assistant: I'll check the weather. Tool call:
523
+ """
524
+
525
+ result = generator(prompt)
526
+ # result = WeatherQuery(location="Tokyo", unit="celsius")
527
+
528
+ print(f"Location: {result.location}")
529
+ print(f"Unit: {result.unit}")
530
+
531
+ ```
532
+
533
+ ### Integration with LangGraph
534
+
535
+ ```
536
+ from langgraph.graph import StateGraph, END
537
+ from outlines import models, generate
538
+ from pydantic import BaseModel
539
+
540
+ class AgentState(BaseModel):
541
+ messages: list[str]
542
+ next_action: str | None = None
543
+
544
+ class ActionSchema(BaseModel):
545
+ action: str # "search" | "calculate" | "finish"
546
+ parameters: dict
547
+
548
+ # Constrained action generator
549
+ model = models.transformers("meta-llama/Llama-3-8B-Instruct")
550
+ action_generator = generate.json(model, ActionSchema)
551
+
552
+ def decide_action(state: AgentState) -> AgentState:
553
+ prompt = format_prompt(state.messages)
554
+ action = action_generator(prompt) # Always valid ActionSchema
555
+ state.next_action = action.action
556
+ return state
557
+
558
+ def execute_action(state: AgentState) -> AgentState:
559
+ # Execute the action
560
+ result = execute(state.next_action)
561
+ state.messages.append(result)
562
+ return state
563
+
564
+ # Build graph
565
+ workflow = StateGraph(AgentState)
566
+ workflow.add_node("decide", decide_action)
567
+ workflow.add_node("execute", execute_action)
568
+ workflow.add_conditional_edges("decide",
569
+ lambda s: END if s.next_action == "finish" else "execute")
570
+ workflow.set_entry_point("decide")
571
+
572
+ app = workflow.compile()
573
+
574
+ ```
575
+
576
+ ### OpenAI Structured Outputs
577
+
578
+ OpenAI provides native support:
579
+
580
+ ```
581
+ from openai import OpenAI
582
+ from pydantic import BaseModel
583
+
584
+ class ResearchResult(BaseModel):
585
+ summary: str
586
+ key_findings: list[str]
587
+ confidence_score: float
588
+
589
+ client = OpenAI()
590
+
591
+ completion = client.chat.completions.create(
592
+ model="gpt-4o-2024-08-06",
593
+ messages=[
594
+ {"role": "user", "content": "Analyze this research paper: ..."}
595
+ ],
596
+ response_format=ResearchResult # Structured output mode
597
+ )
598
+
599
+ result = ResearchResult.model_validate_json(completion.choices[0].message.content)
600
+ # Guaranteed to be valid ResearchResult
601
+
602
+ ```
603
+
604
+ ## Latest Developments & Research
605
+
606
+ ### Recent Breakthroughs (2023-2025)
607
+
608
+ 1. Efficient FSM Construction
609
+
610
+ “Faster Constrained Decoding for Open-Domain Generation” (2024)
611
+
612
+ - Converts JSON schemas to minimal FSMs in milliseconds
613
+ - Reduces overhead to <5% latency increase
614
+ - Open-sourced in Outlines 0.1+ library
615
+
616
+ 2. Grammar-Based Prompting
617
+
618
+ “Guidance: A Faster, More Efficient Programming Paradigm for Constrained Generation” (Microsoft, 2023)
619
+
620
+ - Interleaves constraints with generation
621
+ - Allows complex patterns like “generate code that compiles”
622
+ - Used in Microsoft Copilot
623
+
624
+ 3. Type-Aware Constrained Decoding
625
+
626
+ “TypeT5: Seq2seq Type Inference using Static Analysis” (2024)
627
+
628
+ - Combines static type checking with constrained decoding
629
+ - Ensures generated code is type-safe
630
+ - 99.7% type correctness on HumanEval
631
+
632
+ 4. Semantic Constraints
633
+
634
+ “NeuroLogic Aesque Decoding”* (2024)
635
+
636
+ - Combines probabilistic constraints (logit bias) with hard constraints
637
+ - Balances fluency and correctness
638
+ - Used for constrained dialogue generation
639
+
640
+ ### Benchmarks
641
+
642
+ JSON Schema Compliance (SchemaBench, 2024):
643
+
644
+ - Prompt engineering: 82% valid
645
+ - Post-processing: 89% valid
646
+ - Constrained decoding: 99.8% valid
647
+
648
+ Tool Calling Reliability (FunctionHub, 2024):
649
+
650
+ - Standard generation: 76% executable calls
651
+ - OpenAI function calling: 94% executable
652
+ - Outlines JSON mode: 99.2% executable
653
+
654
+ ### Open Problems
655
+
656
+ 1. Multi-modal constraints: Extending to image/audio generation
657
+ 2. Soft constraints: Probabilistic preferences vs. hard rules
658
+ 3. Constraint learning: Inferring schemas from examples
659
+ 4. Distributed decoding: Constraints across multi-agent systems
660
+ 5. Constraint debugging: Tools to visualize why constraints fail
661
+
662
+ ### Ongoing Research
663
+
664
+ - Adaptive masking: Learning which tokens to mask based on context
665
+ - Constraint synthesis: Automatically generating schemas from documentation
666
+ - Probabilistic grammars: Weighted FSMs for soft guidance
667
+ - Cross-lingual constraints: Applying constraints to multilingual models
668
+
669
+ ## Cross-Disciplinary Insight
670
+
671
+ ### Connections to Compiler Theory
672
+
673
+ Constrained decoding is essentially parsing in reverse:
674
+
675
+ - Parser: Valid string → Abstract syntax tree
676
+ - Constrained decoder: AST (schema) → Valid string
677
+
678
+ Modern compilers use LR parsers that incrementally determine valid next tokens, which is exactly what constrained decoders do. The FSM used is analogous to a parse table in compiler design.
679
+
680
+ ### Links to Control Theory
681
+
682
+ The constraint can be seen as a controller in a feedback loop:
683
+
684
+ ```
685
+ graph LR
686
+ A[LLM<br/>Plant] -->|Token probabilities| B[Constraint<br/>Controller]
687
+ B -->|Masked probabilities| C[Sampler]
688
+ C -->|Next token| A
689
+ B -.->|Desired output format| B
690
+
691
+ ```
692
+
693
+ This mirrors model predictive control where future states are constrained to a safe/desired region.
694
+
695
+ ### Cognitive Science Parallel
696
+
697
+ Human language production involves monitoring: we catch ourselves mid-sentence if about to say something incorrect. Constrained decoding is an artificial form of this executive control, filtering invalid “thoughts” before they’re expressed.
698
+
699
+ ## Daily Challenge: Build a Constrained SQL Generator
700
+
701
+ Goal: Create an AI agent that generates valid SQL queries using constrained decoding.
702
+
703
+ Requirements:
704
+
705
+ 1. Define a simple SQL grammar (SELECT, FROM, WHERE with basic conditions)
706
+ 2. Implement constrained decoding to ensure syntactic validity
707
+ 3. Test on natural language queries
708
+
709
+ Starter Code:
710
+
711
+ ```
712
+ from outlines import models, generate
713
+
714
+ # Define SQL grammar (simplified)
715
+ sql_grammar = r"""
716
+ start: select_stmt
717
+ select_stmt: "SELECT" columns "FROM" table where_clause?
718
+ columns: COLUMN ("," COLUMN)*
719
+ table: WORD
720
+ where_clause: "WHERE" condition
721
+ condition: COLUMN OPERATOR VALUE
722
+
723
+ COLUMN: /[a-z_]+/
724
+ OPERATOR: "=" | ">" | "<" | "!="
725
+ VALUE: /"[^"]*"/ | /[0-9]+/
726
+ WORD: /[a-z_]+/
727
+
728
+ %import common.WS
729
+ %ignore WS
730
+ """
731
+
732
+ model = models.transformers("your-model")
733
+ sql_generator = generate.cfg(model, sql_grammar)
734
+
735
+ # Test queries
736
+ queries = [
737
+ "Find all users with age greater than 30",
738
+ "Get product names where price is less than 100",
739
+ ]
740
+
741
+ for query in queries:
742
+ prompt = f"Natural language: {query}\nSQL: "
743
+ sql = sql_generator(prompt)
744
+ print(f"Query: {query}")
745
+ print(f"SQL: {sql}\n")
746
+
747
+ ```
748
+
749
+ Extension Challenges:
750
+
751
+ 1. Add support for JOIN operations
752
+ 2. Validate against actual table schemas
753
+ 3. Measure how often unconstrained models produce invalid SQL
754
+ 4. Compare generation time with/without constraints
755
+
756
+ Time estimate: 20-30 minutes
757
+
758
+ ## References & Further Reading
759
+
760
+ ### Key Papers
761
+
762
+ “Guidance: A Faster Programming Paradigm for Constrained LLM Generation” Microsoft Research, 2023 https://github.com/microsoft/guidance
763
+
764
+ “Outlines: Fast and Flexible Structured Generation” Normal Computing, 2023 https://github.com/outlines-dev/outlines Paper: https://arxiv.org/abs/2307.09702
765
+
766
+ “Grammar-Constrained Decoding for Structured NLP Tasks” Shin et al., EMNLP 2021 https://arxiv.org/abs/2106.08462
767
+
768
+ “Constrained Decoding for Neural NLG from Compositional Representations” Balakrishnan et al., ACL 2019 https://arxiv.org/abs/1906.07220
769
+
770
+ “A Guided Constrained Decoding for Faithful Text Generation”* Lu et al., NeurIPS 2022 https://arxiv.org/abs/2210.05097
771
+
772
+ ### Tools & Libraries
773
+
774
+ - Outlines (Python): https://github.com/outlines-dev/outlines
775
+ - Guidance (Python): https://github.com/microsoft/guidance
776
+ - LM Format Enforcer (Python): https://github.com/noamgat/lm-format-enforcer
777
+ - LMQL (Query language): https://lmql.ai/
778
+ - OpenAI Structured Outputs: https://platform.openai.com/docs/guides/structured-outputs
779
+
780
+ ### Blog Posts & Tutorials
781
+
782
+ “Structured Generation with Outlines” https://outlines-dev.github.io/outlines/welcome/
783
+
784
+ “How OpenAI’s Structured Outputs Work” https://cookbook.openai.com/examples/structured_outputs_intro
785
+
786
+ “Building Reliable Agents with Constrained Decoding” LangChain blog, 2024 https://blog.langchain.dev/constrained-decoding/
787
+
788
+ ### Frameworks Supporting Constrained Outputs
789
+
790
+ - LangGraph: Via custom parsers + retry logic
791
+ - CrewAI: Via Pydantic models + validation
792
+ - AutoGen: Via response format specifications
793
+ - LlamaIndex: Via output parsers + constrained generation
794
+
795
+ ### Advanced Topics
796
+
797
+ - Incremental Parsing: Earley parsers for CFGs
798
+ - Efficient FSM Minimization: Hopcroft’s algorithm
799
+ - Probabilistic Context-Free Grammars: Soft constraints
800
+ - Constraint Propagation: SAT solvers for complex constraints
801
+
802
+ ---
803
+
804
+ Next Steps:
805
+
806
+ 1. Complete the daily challenge to internalize the concepts
807
+ 2. Experiment with Outlines or Guidance on your own prompts
808
+ 3. Profile the latency impact of constraints on your use case
809
+ 4. Design schemas for your agent’s tool calls
810
+ 5. Read the Outlines paper for implementation details
811
+
812
+ Constrained decoding transforms AI agents from unpredictable text generators into reliable system components.
813
+
814
+ ---
815
+
816
+ # How to build function calling and JSON mode for open-source and fine-tuned LLMs
817
+ URL: https://baseten.co/blog/how-to-build-function-calling-and-json-mode-for-open-source-and-fine-tuned-llms
818
+
819
+ How to build function calling and JSON mode for open-source and fine-tuned LLMs
820
+
821
+ "Inference Engineering" is now available. Get your copy here
822
+
823
+ Model performance
824
+
825
+ # How to build function calling and JSON mode for open-source and fine-tuned LLMs
826
+
827
+ Use a state machine to generate token masks for logit biasing to enable function calling and structured output at the model server level.
828
+
829
+ ### Authors
830
+
831
+ Bryce Dubayah
832
+
833
+ Philip Kiely
834
+
835
+ ### Last updated
836
+
837
+ May 16, 2025
838
+
839
+ ### Share
840
+
841
+ Today, we announced support for function calling and structured output for LLMs deployed with our TensorRT-LLM Engine Builder. This adds support at the model server level for two key features:
842
+
843
+ Function calling: also known as “tool use,” this feature lets you pass a set of defined tools to a LLM as part of the request body. Based on the prompt, the model selects and returns the most appropriate function/tool from the provided options.
844
+
845
+ Structured output: an evolution of “JSON mode,” this feature enforces an output schema defined as part of the LLM input. The LLM output is guaranteed to adhere to the provided schema, with full Pydantic support.
846
+
847
+ To introduce these features, we build new capabilities into our customized version of NVIDIA’s Triton inference server. This engineering deep dive explains how the implementation works under the hood: defining schemas and tools, building a state machine, and using logit biasing to force valid output.
848
+
849
+ And the best part? Thanks to pre-computed token masks, there’s minimal latency impact from using either feature after the first call with a given schema is completed. You can expect the same tokens per second when generating JSON as when generating ordinary text.
850
+
851
+ If you’re looking to get started quickly with these new features, check out our launch announcement and docs for function calling and structured output. For implementation details, keep reading!
852
+
853
+ ## How structured output is generated
854
+
855
+ To understand how it’s possible to guarantee structured output, we need to dive into the details of how a token is generated during LLM inference. If you’re familiar with LLM inference, you’ll know that a new token is generated on each forward pass through the model. During that forward pass:
856
+
857
+ A vector of logits is outputted from the final layer of the LLM’s neural network.
858
+
859
+ A normalization function like softmax is applied to turn the logits into probabilities.
860
+
861
+ Using these probabilities, a token is selected. Depending on settings like`top_p`,`top_k`,`beam_width`, and`temperature`, this may not always be the highest-probability token.
862
+
863
+ Structured output uses logit biasing in the first step to guarantee valid tokens are generated.
864
+
865
+ ### Logit biasing ensures token validity
866
+
867
+ The length of the logit vector outputted in the first step is equal to the number of tokens in the model’s vocabulary. For example, Llama 3 LLMs have a vocabulary of ~128,000 tokens. Thus, the logit vector will have about 128K values. Each logit in the vector is a score representing how much the LLM thinks that the given token from the vocabulary could be the next token in the output sequence.
868
+
869
+ For structured output, we only want to generate valid tokens. For example, an array in JSON must have both an opening and closing bracket:`[1, 2, 3]`. If we already have generated`[1, 2, 3` then the valid options are:
870
+
871
+ A comma, space, and another value such as four: ,`4`.
872
+
873
+ A closing bracket to end the array:`]`.
874
+
875
+ From the model’s vocabulary, most of the possible tokens will not be valid at certain points when generating structured output. Logit biasing guarantees valid output structure by identifying every invalid token and setting its score to negative infinity, ensuring that the invalid tokens cannot be generated.
876
+
877
+
878
+
879
+ This discussion of logit biasing raises a natural question: how do we know where we are in the output schema and which tokens are valid?
880
+
881
+ ### State machine provides token requirements
882
+
883
+ The model server running beneath the inference process is responsible for tracking output format using a state machine. This model server is a modified version of NVIDIA Triton with extra capabilities that we call “Briton” (Baseten + Triton = Briton).
884
+
885
+ Using an industry standard library Outlines, which also powers VLLM, the Briton model server takes the schema passed as model output, transforms it into a regular expression, then generates a state machine from that regex. We chose Outlines for its robust feature set and reliability.
886
+
887
+ However, Outlines is written in Python, while TensorRT-LLM and Triton run in C++ for speed and efficiency. To handle this, we first generate the state machine in Python, then serialize it to Protocol Buffers and load it into the model server.
888
+
889
+ Once loaded into the model server, the state machine makes the logit biasing process incredibly efficient. The state machine is cached in memory, and an appropriate token mask – a list of 1s and 0s corresponding to valid and invalid tokens – is created for each node of the state machine for logit biasing. This means that these calculations aren’t made during inference time, rather, existing masks are applied based on which state is active.
890
+
891
+ With no token mask calculations happening during token generation, this approach to logit biasing has a negligible effect on model performance, so you’ll get the same high tokens per second that you’re used to from TensorRT-LLM while also ensuring that every token is valid for the provided output schema.
892
+
893
+ ## How to use function calling
894
+
895
+ Function calling works by providing LLMs with a structured description of a set of tools. Based on the prompt, the model selects the most appropriate tool or tools for the task described. Functions can be anything: API calls, ORM access, SQL queries, or just a script.
896
+
897
+
898
+
899
+ A function written to be passed to an LLM — note the descriptive docstring.
900
+
901
+ It’s essential to understand that function calling does not give the LLM the capability to execute code. Instead, the function calling asks the LLM to choose the most appropriate function from the list of available tools. The actual function execution needs to happen in the same environment that made the LLM call.
902
+
903
+ Our function calling implementation follows the OpenAI API spec for compatibility, but applies to any model served with TensorRT-LLM via the Engine Builder that has built-in function calling capabilities (e.g. Llama 3.1 Instruct, but not Llama 3). Using the same logit biasing process that creates structured output, Briton (the modified Triton inference server) guarantees schematically correct tool responses.
904
+
905
+
906
+
907
+ Example payload with function calling via the "tools" key
908
+
909
+ Function calling is critical for building agentic workflows and other advanced Compound AI systems. To use function calling for yourself, check out our function calling example in the documentation.
910
+
911
+ ## How to use structured output
912
+
913
+ The more general structured output feature forces LLMs to return output that adheres to a Pydantic schema. Structured output is valid JSON, but goes beyond JSON mode with support for required and optional fields, multiple data types, and additional validations like maximum length.
914
+
915
+ To start, define your output schema as a Pydantic model.
916
+
917
+
918
+
919
+ Pydantic model for a "Person" object. The schema can be passed to an LLM to structure output.
920
+
921
+ Then, when you add the schema to the LLM call, the model server will build the schema into a state machine and use it for token masking as described above. The LLM inference arguments match the OpenAI API spec for structured output to ensure maximum compatibility.
922
+
923
+
924
+
925
+ Example LLM request payload with a response schema.
926
+
927
+ Structured output is useful for a wide range of Compound AI applications as the guaranteed schema adherence means you can integrate LLMs into larger systems without worrying about type errors. To try structured output for your application, start with our structured output example in the documentation.
928
+
929
+ ## What to build with function calling and structured output
930
+
931
+ While the implementation behind these new features is interesting, what’s even more exciting is the use cases they enable.
932
+
933
+ Function calling unlocks a wide range of agentic use cases for open source LLMs. With function calling, you can give agents access to a set of tools to accomplish tasks. As we saw above, the LLM is only able to select the best tool, not actually execute the API call or run the function, so that’s where multi-step AI systems are needed.
934
+
935
+ These multi-step, often multi-model systems are commonly known as Compound AI. When building multi-stage Compound AI systems, structured output is critical. With structured output, each component of the system can communicate in valid JSON, preventing errors and avoiding parsing overhead.
936
+
937
+ As you build with function calling and structured output, remember that the model server changes don’t enhance quality, they only enforce format. Clear prompting and techniques like few-shot prompting still have their place for getting quality output within the enforced structure.
938
+
939
+ Get started building:
940
+
941
+ First, deploy Llama 3.1 8B with the TensorRT-LLM Engine Builder
942
+
943
+ Then try function calling with an accurate LLM math demo
944
+
945
+ And get JSON-mode output with in a document parsing demo
946
+
947
+ Subscribe to our newsletter
948
+
949
+ Stay up to date on model performance, inference infrastructure, and more.
950
+
951
+
952
+
953
+ ## Explore Baseten today
954
+
955
+ Start deploying Talk to an engineer
956
+
957
+ # Custom Logits Processors - vLLM
958
+ URL: https://docs.vllm.ai/en/latest/features/custom_logitsprocs/
959
+
960
+ You are viewing the latest developer preview docs. Click here to view docs for the latest stable release.
961
+
962
+ - Examples
963
+ - Offline Inference
964
+ - Online Serving
965
+ - Others
966
+ - Pooling
967
+ - RL
968
+ - General
969
+ - Inference and Serving
970
+ - Deployment
971
+ - Integrations
972
+ - Training
973
+ - Configuration
974
+ - Models
975
+ - Extensions
976
+ - Hardware Supported Models
977
+ - Features
978
+ - Speculative Decoding
979
+ - Developer Guide
980
+ - Model Implementation
981
+ - CI
982
+ - Design Documents
983
+ - Architecture Overview
984
+ - Attention Backend Feature Support
985
+ - CUDA Graphs
986
+ - Vision Encoder (ViT) CUDA Graphs
987
+ - CustomOp
988
+ - Dual Batch Overlap
989
+ - How to debug the vLLM-torch.compile integration
990
+ - Fused MoE Modular Kernel
991
+ - Fusion torch.compile passes
992
+ - Integration with Hugging Face
993
+ - Hybrid KV Cache Manager
994
+ - Logits Processors
995
+ - Metrics
996
+ - Multi-Modal Data Processing
997
+ - Model Runner V2 Design Document
998
+ - Fused MoE Kernel Features
999
+ - Python Multiprocessing
1000
+ - Optimization Levels
1001
+ - P2P NCCL Connector
1002
+ - Paged Attention
1003
+ - Automatic Prefix Caching
1004
+ - torch.compile integration
1005
+ - torch.compile with Multimodal Encoders
1006
+ - Benchmarking
1007
+ - API Reference
1008
+ - benchmarks
1009
+ - sweep
1010
+ - compilation
1011
+ - utility
1012
+ - config
1013
+ - device_allocator
1014
+ - distributed
1015
+ - ec_transfer
1016
+ - elastic_ep
1017
+ - eplb
1018
+ - kv_transfer
1019
+ - mooncake
1020
+ - moriio
1021
+ - offloading
1022
+ - p2p
1023
+ - weight_transfer
1024
+ - engine
1025
+ - entrypoints
1026
+ - cli
1027
+ - mcp
1028
+ - openai
1029
+ - completion
1030
+ - engine
1031
+ - generate
1032
+ - models
1033
+ - parser
1034
+ - realtime
1035
+ - responses
1036
+ - speech_to_text
1037
+ - pooling
1038
+ - classify
1039
+ - embed
1040
+ - pooling
1041
+ - score
1042
+ - sagemaker
1043
+ - serve
1044
+ - disagg
1045
+ - elastic_ep
1046
+ - instrumentator
1047
+ - lora
1048
+ - profile
1049
+ - render
1050
+ - rlhf
1051
+ - rpc
1052
+ - sleep
1053
+ - tokenize
1054
+ - inputs
1055
+ - kernels
1056
+ - logging_utils
1057
+ - lora
1058
+ - ops
1059
+ - triton_ops
1060
+ - xpu_ops
1061
+ - punica_wrapper
1062
+ - model_executor
1063
+ - scaled_mm
1064
+ - layers
1065
+ - fla
1066
+ - fused_moe
1067
+ - oracle
1068
+ - prepare_finalize
1069
+ - router
1070
+ - runner
1071
+ - mamba
1072
+ - pooler
1073
+ - tokwise
1074
+ - quantization
1075
+ - transform
1076
+ - quark
1077
+ - utils
1078
+ - rotary_embedding
1079
+ - model_loader
1080
+ - models
1081
+ - offloader
1082
+ - warmup
1083
+ - multimodal
1084
+ - processing
1085
+ - parser
1086
+ - platforms
1087
+ - plugins
1088
+ - lora_resolvers
1089
+ - profiler
1090
+ - ray
1091
+ - reasoning
1092
+ - renderers
1093
+ - tokenizers
1094
+ - tool_parsers
1095
+ - tracing
1096
+ - transformers_utils
1097
+ - configs
1098
+ - processors
1099
+ - triton_utils
1100
+ - usage
1101
+ - utils
1102
+ - v1
1103
+ - ops
1104
+ - core
1105
+ - engine
1106
+ - executor
1107
+ - kv_offload
1108
+ - worker
1109
+ - metrics
1110
+ - pool
1111
+ - sample
1112
+ - ops
1113
+ - spec_decode
1114
+ - structured_output
1115
+ - worker
1116
+ - mm
1117
+ - model_states
1118
+ - pool
1119
+ - sample
1120
+ - spec_decode
1121
+ - CLI Reference
1122
+ - Community
1123
+ - Blog
1124
+ - Forum
1125
+ - Slack
1126
+
1127
+ # 404 - Not found
1128
+
1129
+ Back to top
1130
+
1131
+ # GUEST POST - Crafting Unique AI Personas: Harnessing the Power of Logit Bias in Large Language Models | Microsoft Agent Framework
1132
+ URL: https://devblogs.microsoft.com/semantic-kernel/guest-post-crafting-unique-ai-personas-harnessing-the-power-of-logit-bias-in-large-language-models/
1133
+ Author: Anthony Puppo
1134
+
1135
+ GUEST POST - Crafting Unique AI Personas: Harnessing the Power of Logit Bias in Large Language Models | Microsoft Agent Framework
1136
+
1137
+
1138
+
1139
+
1140
+
1141
+
1142
+
1143
+
1144
+
1145
+
1146
+
1147
+
1148
+
1149
+
1150
+
1151
+ Anthony Puppo
1152
+
1153
+ Software Engineer
1154
+
1155
+
1156
+
1157
+ Large Language Models (LLMs) have revolutionized our interaction with software. However, there’s a catch – their responses can be monotonous and impersonal. This is where ‘personas’ come in. They add a human touch to LLMs, transforming generic outputs into customized responses that resonate with users. This is particularly handy in applications like customer service bots and virtual assistants. But how do we create these personas without hefty costs or time investments? The good news is, we can tweak a set of common parameters in most LLMs to influence their output, and that’s what we’ll explore today.
1158
+
1159
+ The examples in this blog post utilize C#, Semantic Kernel, SharpToken and the OpenAI API. If you’d like to follow along and experiment yourself, first create a new console project:
1160
+
1161
+ ```default
1162
+ dotnet new console --framework net7.0
1163
+ ```
1164
+
1165
+ Then install dependencies using NuGet:
1166
+
1167
+ ```default
1168
+ dotnet add package Microsoft.SemanticKernel --prerelease
1169
+ dotnet add package SharpToken
1170
+ ```
1171
+
1172
+ Additionally, I’ve written a small demo application that utilizes some of the techniques discussed in this blog. It is open-source and available on GitHub.
1173
+
1174
+ ### Introduction To How LLMs Work
1175
+
1176
+ In simplified terms, LLMs function a bit like a predictive text engine. They interpret a sentence with each word or part of a word as a ‘token’. The tokens are then transformed into a numerical format that the model can process.
1177
+
1178
+ ```csharp
1179
+ var encoding = GptEncoding.GetEncoding("gpt-3.5-turbo");
1180
+ var rawTokens = encoding.Encode("Wonderful day we're having!");
1181
+ var textTokens = rawTokens.Select((x) => $""{encoding.Decode(new() { x })}"").ToList();
1182
+
1183
+ Console.WriteLine($"Raw tokens: {string.Join(", ", rawTokens)}");
1184
+ Console.WriteLine($"Tokenized text: {string.Join(", ", textTokens)}");
1185
+
1186
+ // Output:
1187
+ // Raw tokens: 62372, 1285, 1938, 584, 2351, 3515, 0
1188
+ // Tokenized text: "Wonder", "ful", " day", " we", "'re", " having", "!"
1189
+ ```
1190
+
1191
+ Based on its prior training, the model predicts what should come next in the sequence. For example, after “I don’t like”, the model might suggest “apples”. This prediction can be thought of as the model’s first draft — it’s good, but can we make it better?
1192
+
1193
+ ### LLM Parameters
1194
+
1195
+ Consumers of LLMs have the ability to adjust certain parameters. This can yield creative, varied, and engaging results.
1196
+
1197
+ ##### Temperature
1198
+
1199
+ This parameter adjusts the entropy of our model’s output. A high temperature makes the model’s output diverse and creative, while a lower temperature results in more focused and predictable responses.
1200
+
1201
+ ##### Top P
1202
+
1203
+ Top P (also known as nucleus sampling) guides the selection of next tokens based on cumulated probabilities. This is a more nuanced measure of controlling randomness that can often lead to more diverse outputs. For example, if the model is predicting the next word in “The cat climbed up the ___”, and the options tree, roof, and wall add up to around 90% probability, Top P, if set at 90%, restricts the model to select among those three options.
1204
+
1205
+ ##### Frequency and Presence Penalties
1206
+
1207
+ Frequency penalty discourages the overuse of specific tokens, and presence penalty penalizes tokens previously used in the output, irrespective of frequency. These mechanisms can be instrumental in quelling repetition and promoting diversity.
1208
+
1209
+ ##### Logit Bias
1210
+
1211
+ Logit bias directly manipulates the logits (the raw, unnormalized scores predicted by the model) for specific tokens before they are passed through the softmax function for probability distribution. By adjusting the logit bias, one can promote or demote particular tokens. For instance, if we want the model to avoid using a certain token, we can assign a negative logit bias to it, making it less likely to be chosen. Likewise, assigning a higher bias will have the model favor the token.
1212
+
1213
+ ### Persona Generation Using Logit Bias
1214
+
1215
+ For a practical demonstration, let’s consider a scenario where we desire our model to generate shorter sentences. To achieve this, we can manipulate the bias for common punctuation such as “.”, “!”, and “?”.
1216
+
1217
+ First, setup the kernel so we can interact with the model:
1218
+
1219
+ ```csharp
1220
+ var kernel = new KernelBuilder()
1221
+ .WithOpenAIChatCompletionService("gpt-3.5-turbo", "<your-openai-api-key>")
1222
+ .Build();
1223
+ ```
1224
+
1225
+ Then call it with our custom settings:
1226
+
1227
+ ```csharp
1228
+ var result = await kernel.InvokeSemanticFunctionAsync(
1229
+ "Describe a rainbow.",
1230
+ requestSettings: new OpenAIRequestSettings()
1231
+ {
1232
+ Temperature = 0,
1233
+ TopP = 1,
1234
+ FrequencyPenalty = 0,
1235
+ PresencePenalty = 0,
1236
+ TokenSelectionBiases = new[] { ".", "!", "?" }
1237
+ .SelectMany((x) => encoding.Encode(x))
1238
+ .ToDictionary((x) => x, (x) => 10)
1239
+ });
1240
+
1241
+ Console.WriteLine(result);
1242
+ ```
1243
+
1244
+ And we get the following:
1245
+
1246
+ A rainbow is a beautiful and natural phenomenon. It appears as a circular arc of colors in the sky. It is formed when sunlight is refracted, or bent, as it passes through raindrops. The sunlight is then reflected inside the raindrop and refracted again. This process causes the light to separate into its component colors. The colors of a rainbow, from top to bottom, are red, orange, yellow, green, blue, indigo, and violet. The colors are vibrant and distinct. The rainbow usually appears after rain showers when the sun is still shining. It can also be seen near waterfalls or fountains. The sight of a rainbow is often associated with joy, hope, and wonder. It is a mesmerizing display of nature’s beauty.
1247
+
1248
+ Conversely, if we make the bias negative:
1249
+
1250
+ ```csharp
1251
+ // Surrounding code omitted for brevity...
1252
+ TokenSelectionBiases = new[] { ".", "!", "?" }
1253
+ .SelectMany((x) => encoding.Encode(x))
1254
+ .ToDictionary((x) => x, (x) => -10)
1255
+ ```
1256
+
1257
+ We then get something like this:
1258
+
1259
+ A rainbow is a beautiful and natural phenomenon that occurs when sunlight is refracted, or bent, by water droplets in the air, creating a spectrum of colors in the sky. Typically, a rainbow appears as a semi-circular arc of vibrant colors, with red being the outermost color and violet being the innermost color, although sometimes a full circle can be seen in certain conditions. The colors of a rainbow, in order, are red, orange, yellow, green, blue, indigo, and violet, often remembered by the acronym ROYGBIV. Each color of the rainbow is distinct and blends seamlessly into the next, creating a stunning display of hues that can be seen against a backdrop of dark clouds or a clear blue sky. Rainbows are often seen after rain showers when the sun emerges from behind the clouds, casting its rays onto the raindrops in the air, causing them to act as tiny prisms that refract the sunlight and create the colorful spectrum. The sight of a rainbow is often associated with feelings of joy, wonder, and hope, as it is a symbol of beauty and harmony in nature. Rainbows are not physical objects that can be touched or approached, but rather optical illusions that appear to be located at a specific distance from the observer, making them seem elusive and magical. Overall, a rainbow is a breathtaking and ephemeral display of colors that captivates the imagination and reminds us of the wonders of the natural world around us.
1260
+
1261
+ The first response is more concise and straightforward, providing a clear and simple explanation of a rainbow. It uses a more casual and conversational tone, making it easier to understand for a general audience.
1262
+
1263
+ The second is more detailed and comprehensive, providing a more scientific explanation of a rainbow. It uses a more formal and academic tone, making it suitable for a more knowledgeable audience or someone seeking a deeper understanding. The language is more descriptive and the sentences are longer, contributing to a more elaborate and thorough explanation.
1264
+
1265
+ The effects of tweaking logit_bias are evident in the given examples, and these modifications show how we can mold the model’s responses to be more in line with a specific persona. By amplifying or diminishing this bias, we can guide the model to generate responses that are concise or verbose, casual or formal, simple or detailed, depending on the desired personality. However, the key lies in balance. Overdoing it might result in an overbearing or inconsistent persona, while underdoing it might make the persona feel generic.
1266
+
1267
+ ### Next Steps
1268
+
1269
+ So, what are some of the options for putting the topics discussed here into practice?
1270
+
1271
+ 1. Experiment with Logit Bias: Get hands-on experience with this feature. Start with simple tweaks to the bias values and observe how the output changes. As you gain familiarity, attempt to create a more complex persona by adjusting the bias for a wider range of tokens.
1272
+ 2. Dive into Stylometry: Learn more about stylometry. This field of study can provide insights into how writing styles can be quantified and analyzed, which can be helpful in creating more nuanced personas.
1273
+ 3. Implement Part-of-Speech Tagging: Incorporate part-of-speech tagging. It can be useful in understanding the grammatical structure of the sentences generated by the model (or text from a pre-existing persona you are attempting to emulate). This understanding can help you tune the logit bias more effectively.
1274
+ 4. Randomize Character Creation: Create a corpus of words relevant to the desired persona. Use this corpus to randomly assign attributes to the model’s persona. This can add an element of unpredictability to the model, making it more engaging.
1275
+ 5. Explore Token Frequencies and TF-IDF: Rather than merely looking at the plain text, consider tokenizing the text. This approach can be combined with Term Frequency-Inverse Document Frequency (TF-IDF) to assess the frequency of model tokens. This insight can guide the adjustment of logit bias values more appropriately (since models are operating at the token level).
1276
+ 6. Combine Parameters: Don’t limit yourself to logit bias. Try combining it with other parameters like temperature and top_p for more nuanced control over the output. Remember, the aim is to create a persona that is consistent, engaging, and believable.
1277
+
1278
+ ### Closing Thoughts
1279
+
1280
+ Crafting unique personas can be a tricky pursuit. Logit bias, however, offers a promising starting point. It’s a tool to help you steer your models outputs towards a more personalized touch. Yet, it’s important to note, it’s but one piece of the puzzle. While other parameters might not singularly make a huge impact in persona development, their combined use could unlock more possibilities. The journey to mastering persona creation in LLMs is an intriguing one, and hopefully, this has given you a useful compass to navigate it.
1281
+
1282
+
1283
+
1284
+
1285
+
1286
+ Category
1287
+
1288
+ Semantic Kernel
1289
+
1290
+ Topics
1291
+
1292
+ Personas Semantic Kernel
1293
+
1294
+ Share
1295
+
1296
+
1297
+
1298
+ ## Author
1299
+
1300
+ Anthony Puppo
1301
+
1302
+ Software Engineer
1303
+
1304
+
1305
+
1306
+
1307
+
1308
+
1309
+
1310
+ November 1, 2023
1311
+
1312
+ ### What to expect from v1 and beyond for Semantic Kernel.
1313
+
1314
+ Matthew Bolanos
1315
+
1316
+ November 2, 2023
1317
+
1318
+ ### AutoGen Agents Meet Semantic Kernel
1319
+
1320
+ John Maeda
1321
+
1322
+
1323
+
1324
+ ## Stay informed
1325
+
1326
+ Get notified when new posts are published.
1327
+
1328
+ Email *
1329
+
1330
+ Country/Region * Select...United StatesAfghanistanÅland IslandsAlbaniaAlgeriaAmerican SamoaAndorraAngolaAnguillaAntarcticaAntigua and BarbudaArgentinaArmeniaArubaAustraliaAustriaAzerbaijanBahamasBahrainBangladeshBarbadosBelarusBelgiumBelizeBeninBermudaBhutanBoliviaBonaireBosnia and HerzegovinaBotswanaBouvet IslandBrazilBritish Indian Ocean TerritoryBritish Virgin IslandsBruneiBulgariaBurkina FasoBurundiCabo VerdeCambodiaCameroonCanadaCayman IslandsCentral African RepublicChadChileChinaChristmas IslandCocos (Keeling) IslandsColombiaComorosCongoCongo (DRC)Cook IslandsCosta RicaCôte dIvoireCroatiaCuraçaoCyprusCzechiaDenmarkDjiboutiDominicaDominican RepublicEcuadorEgyptEl SalvadorEquatorial GuineaEritreaEstoniaEswatiniEthiopiaFalkland IslandsFaroe IslandsFijiFinlandFranceFrench GuianaFrench PolynesiaFrench Southern TerritoriesGabonGambiaGeorgiaGermanyGhanaGibraltarGreeceGreenlandGrenadaGuadeloupeGuamGuatemalaGuernseyGuineaGuinea-BissauGuyanaHaitiHeard Island and McDonald IslandsHondurasHong Kong SARHungaryIcelandIndiaIndonesiaIraqIrelandIsle of ManIsraelItalyJamaicaJan MayenJapanJerseyJordanKazakhstanKenyaKiribatiKoreaKosovoKuwaitKyrgyzstanLaosLatviaLebanonLesothoLiberiaLibyaLiechtensteinLithuaniaLuxembourgMacau SARMadagascarMalawiMalaysiaMaldivesMaliMaltaMarshall IslandsMartiniqueMauritaniaMauritiusMayotteMexicoMicronesiaMoldovaMonacoMongoliaMontenegroMontserratMoroccoMozambiqueMyanmarNamibiaNauruNepalNetherlandsNew CaledoniaNew ZealandNicaraguaNigerNigeriaNiueNorfolk IslandNorth MacedoniaNorthern Mariana IslandsNorwayOmanPakistanPalauPalestinian AuthorityPanamaPapua New GuineaParaguayPeruPhilippinesPitcairn IslandsPolandPortugalPuerto RicoQatarRéunionRomaniaRwandaSabaSaint BarthélemySaint Kitts and NevisSaint LuciaSaint MartinSaint Pierre and MiquelonSaint Vincent and the GrenadinesSamoaSan MarinoSão Tomé and PríncipeSaudi ArabiaSenegalSerbiaSeychellesSierra LeoneSingaporeSint EustatiusSint MaartenSlovakiaSloveniaSolomon IslandsSomaliaSouth AfricaSouth Georgia and South Sandwich IslandsSouth SudanSpainSri LankaSt HelenaAscensionTristan da CunhaSurinameSvalbardSwedenSwitzerlandTaiwanTajikistanTanzaniaThailandTimor-LesteTogoTokelauTongaTrinidad and TobagoTunisiaTurkeyTurkmenistanTurks and Caicos IslandsTuvaluU.S. Outlying IslandsU.S. Virgin IslandsUgandaUkraineUnited Arab EmiratesUnited KingdomUruguayUzbekistanVanuatuVatican CityVenezuelaVietnamWallis and FutunaYemenZambiaZimbabwe
1331
+
1332
+ I would like to receive the Microsoft Agent Framework Newsletter. Privacy Statement.
1333
+
1334
+ Subscribe
1335
+
1336
+ Follow this blog
1337
+
1338
+
1339
+
1340
+
1341
+
1342
+
1343
+
1344
+
1345
+
1346
+
1347
+
1348
+
1349
+
1350
+
1351
+
1352
+ Sign in
1353
+
1354
+ Theme