xtremeflow 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 Flow Jiang
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,3 @@
1
+ include README.md
2
+ include LICENSE
3
+ recursive-include xtremeflow *.py
@@ -0,0 +1,139 @@
1
+ Metadata-Version: 2.4
2
+ Name: xtremeflow
3
+ Version: 0.1.0
4
+ Summary: XtremeFlow: A high-performance Python asynchronous task scheduler engineered to push LLM workloads to their absolute physical limits
5
+ Author-email: Flow Jiang <flowjzh@gmail.com>
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/flowjzh/xtremeflow
8
+ Project-URL: Repository, https://github.com/flowjzh/xtremeflow.git
9
+ Project-URL: Issues, https://github.com/flowjzh/xtremeflow/issues
10
+ Keywords: async,scheduler,rate-limiting,llm,asyncio,concurrency,backpressure
11
+ Classifier: Development Status :: 4 - Beta
12
+ Classifier: Intended Audience :: Developers
13
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
14
+ Classifier: License :: OSI Approved :: MIT License
15
+ Classifier: Programming Language :: Python :: 3
16
+ Classifier: Programming Language :: Python :: 3.9
17
+ Classifier: Programming Language :: Python :: 3.10
18
+ Classifier: Programming Language :: Python :: 3.11
19
+ Classifier: Programming Language :: Python :: 3.12
20
+ Classifier: Programming Language :: Python :: 3.13
21
+ Classifier: Programming Language :: Python :: 3 :: Only
22
+ Classifier: Operating System :: OS Independent
23
+ Classifier: Typing :: Typed
24
+ Requires-Python: >=3.9
25
+ Description-Content-Type: text/markdown
26
+ License-File: LICENSE
27
+ Provides-Extra: dev
28
+ Requires-Dist: pytest>=8.4.2; extra == "dev"
29
+ Requires-Dist: pytest-asyncio>=1.2.0; extra == "dev"
30
+ Dynamic: license-file
31
+
32
+ # XtremeFlow
33
+
34
+ > **"Exhaust rate limits, not patience. Squeezing maximum throughput from every second."**
35
+
36
+ ### 🦅 About
37
+
38
+ **XtremeFlow** is a high-performance asynchronous task scheduler engineered to push **Large Language Model (LLM)** workloads to their absolute physical limits.
39
+
40
+ **The Problem:**
41
+ LLM providers throttle your velocity through a combination of **Concurrency**, **RPS**/**RPM** or **TPS**/**TPM**. Most schedulers are defensive—they wait too long, leave gaps in your schedule, and waste capacity. In high-volume production, idle time is a lost resource.
42
+
43
+ **The XtremeFlow Philosophy:**
44
+ Stop being polite with your rate limits. **XtremeFlow is offensive.** It is designed to saturate your provider's capacity with surgical precision. Using a unique **Backpressure Reflex**, it maintains peak velocity until the very moment a limit is hit, executes a synchronized global cool-down, and resumes at full speed the millisecond the provider allows.
45
+
46
+ > ⚠️ **Limitation:** XtremeFlow is currently optimized for **single-process** `asyncio` applications. It manages state in-memory and does not support distributed rate limiting (e.g., Redis-based) out of the box.
47
+
48
+ ### ⚡ Key Features
49
+
50
+ * **Aggressive Saturation**: Engineered to fill every available millisecond of your allowed rate, ensuring zero wasted throughput.
51
+ * **Backpressure Reflex**: Automatically detects 429 triggers and orchestrates a global **Exponential Backoff** across all workers to stay in perfect sync with provider resets.
52
+ * **Dynamic Calibration**: Supports post-request reporting of *actual* usage to instantly "refund" over-estimated capacity back to the scheduler.
53
+ * **Async-Native**: Built on `asyncio` for low-latency scheduling where every microsecond counts.
54
+ * **KV Cache Optimization**: Provides utilities to maximize KV cache utilization across parallel LLM requests, dramatically reducing token consumption and improving throughput.
55
+ * **Async Pipeline**: Producer-consumer pipeline for streaming workloads with automatic backpressure handling.
56
+
57
+ ### 🚀 Quick Start
58
+
59
+ ```python
60
+ import asyncio
61
+ from openai import RateLimitError
62
+ from xtremeflow.scheduler.rate_limit import auto_backoff, report_token_usage
63
+ from xtremeflow.scheduler.token import TokenRateScheduler
64
+
65
+ # Initialize: 10 concurrent slots, 60 RPM, 50k TPM
66
+ scheduler = TokenRateScheduler(
67
+ max_concurrency=10,
68
+ rpm=60,
69
+ tpm=50000
70
+ )
71
+
72
+ @auto_backoff(retry_for=RateLimitError, base_retry_after=2.0)
73
+ async def call_llm_api(prompt: str):
74
+ """
75
+ Wraps LLM call with Backpressure Reflex.
76
+ Global synchronization ensures you don't keep hitting the wall during cooldown.
77
+ """
78
+ print(f"Executing task: {prompt}")
79
+
80
+ # Simulated API call
81
+ await asyncio.sleep(1)
82
+
83
+ # Calibration: Refund unused quota to the scheduler
84
+ report_token_usage(actual_tokens=450)
85
+
86
+ return "success"
87
+
88
+ async def main():
89
+ tasks = []
90
+ for i in range(10):
91
+ # Dispatch with an estimated cost to saturate the current limit
92
+ t = await scheduler.start_task(
93
+ call_llm_api(f"Task {i}"),
94
+ estimated_tokens=500
95
+ )
96
+ tasks.append(t)
97
+
98
+ results = await asyncio.gather(*tasks)
99
+ print(f"XtremeFlow: Successfully processed {len(results)} tasks at peak throughput.")
100
+
101
+ if __name__ == "__main__":
102
+ asyncio.run(main())
103
+ ```
104
+
105
+ ### 🔥 Performance Tools
106
+
107
+ Beyond rate limiting, XtremeFlow provides utilities to maximize token efficiency and throughput.
108
+
109
+ **KV Cache Optimization** (`kv_batch`)
110
+ ```python
111
+ from xtremeflow.kvbatch import kv_batch
112
+
113
+ # First request establishes KV cache, rest run in parallel
114
+ task = kv_batch(
115
+ llm_score(prompt) for prompt in same_job_with_different_resumes
116
+ )
117
+ results = await task
118
+ ```
119
+ Reduces token consumption by 40-60% for batched requests with shared prefixes.
120
+
121
+ **Async Pipeline** (`async_pipeline`)
122
+ ```python
123
+ from xtremeflow.pipeline import async_pipeline
124
+
125
+ # Producer: scheduler-controlled, exhausts this tier's rate limit
126
+ async def producer(queue: asyncio.Queue):
127
+ async for item in source:
128
+ task = await scheduler.start_task(llm_api(item), estimate_tokens)
129
+ await queue.put(task)
130
+
131
+ # Processor: slower sequential processing, yields to next tier
132
+ async def process_item(item):
133
+ result = await item
134
+ return await db_write(result) # Different rate limit tier
135
+
136
+ async for result in async_pipeline(producer, process_item):
137
+ yield result # Can chain to another tier
138
+ ```
139
+ Decouples rate limit tiers—exhausting each tier's limit frees up quota for other tasks immediately, maximizing overall system throughput.
@@ -0,0 +1,108 @@
1
+ # XtremeFlow
2
+
3
+ > **"Exhaust rate limits, not patience. Squeezing maximum throughput from every second."**
4
+
5
+ ### 🦅 About
6
+
7
+ **XtremeFlow** is a high-performance asynchronous task scheduler engineered to push **Large Language Model (LLM)** workloads to their absolute physical limits.
8
+
9
+ **The Problem:**
10
+ LLM providers throttle your velocity through a combination of **Concurrency**, **RPS**/**RPM** or **TPS**/**TPM**. Most schedulers are defensive—they wait too long, leave gaps in your schedule, and waste capacity. In high-volume production, idle time is a lost resource.
11
+
12
+ **The XtremeFlow Philosophy:**
13
+ Stop being polite with your rate limits. **XtremeFlow is offensive.** It is designed to saturate your provider's capacity with surgical precision. Using a unique **Backpressure Reflex**, it maintains peak velocity until the very moment a limit is hit, executes a synchronized global cool-down, and resumes at full speed the millisecond the provider allows.
14
+
15
+ > ⚠️ **Limitation:** XtremeFlow is currently optimized for **single-process** `asyncio` applications. It manages state in-memory and does not support distributed rate limiting (e.g., Redis-based) out of the box.
16
+
17
+ ### ⚡ Key Features
18
+
19
+ * **Aggressive Saturation**: Engineered to fill every available millisecond of your allowed rate, ensuring zero wasted throughput.
20
+ * **Backpressure Reflex**: Automatically detects 429 triggers and orchestrates a global **Exponential Backoff** across all workers to stay in perfect sync with provider resets.
21
+ * **Dynamic Calibration**: Supports post-request reporting of *actual* usage to instantly "refund" over-estimated capacity back to the scheduler.
22
+ * **Async-Native**: Built on `asyncio` for low-latency scheduling where every microsecond counts.
23
+ * **KV Cache Optimization**: Provides utilities to maximize KV cache utilization across parallel LLM requests, dramatically reducing token consumption and improving throughput.
24
+ * **Async Pipeline**: Producer-consumer pipeline for streaming workloads with automatic backpressure handling.
25
+
26
+ ### 🚀 Quick Start
27
+
28
+ ```python
29
+ import asyncio
30
+ from openai import RateLimitError
31
+ from xtremeflow.scheduler.rate_limit import auto_backoff, report_token_usage
32
+ from xtremeflow.scheduler.token import TokenRateScheduler
33
+
34
+ # Initialize: 10 concurrent slots, 60 RPM, 50k TPM
35
+ scheduler = TokenRateScheduler(
36
+ max_concurrency=10,
37
+ rpm=60,
38
+ tpm=50000
39
+ )
40
+
41
+ @auto_backoff(retry_for=RateLimitError, base_retry_after=2.0)
42
+ async def call_llm_api(prompt: str):
43
+ """
44
+ Wraps LLM call with Backpressure Reflex.
45
+ Global synchronization ensures you don't keep hitting the wall during cooldown.
46
+ """
47
+ print(f"Executing task: {prompt}")
48
+
49
+ # Simulated API call
50
+ await asyncio.sleep(1)
51
+
52
+ # Calibration: Refund unused quota to the scheduler
53
+ report_token_usage(actual_tokens=450)
54
+
55
+ return "success"
56
+
57
+ async def main():
58
+ tasks = []
59
+ for i in range(10):
60
+ # Dispatch with an estimated cost to saturate the current limit
61
+ t = await scheduler.start_task(
62
+ call_llm_api(f"Task {i}"),
63
+ estimated_tokens=500
64
+ )
65
+ tasks.append(t)
66
+
67
+ results = await asyncio.gather(*tasks)
68
+ print(f"XtremeFlow: Successfully processed {len(results)} tasks at peak throughput.")
69
+
70
+ if __name__ == "__main__":
71
+ asyncio.run(main())
72
+ ```
73
+
74
+ ### 🔥 Performance Tools
75
+
76
+ Beyond rate limiting, XtremeFlow provides utilities to maximize token efficiency and throughput.
77
+
78
+ **KV Cache Optimization** (`kv_batch`)
79
+ ```python
80
+ from xtremeflow.kvbatch import kv_batch
81
+
82
+ # First request establishes KV cache, rest run in parallel
83
+ task = kv_batch(
84
+ llm_score(prompt) for prompt in same_job_with_different_resumes
85
+ )
86
+ results = await task
87
+ ```
88
+ Reduces token consumption by 40-60% for batched requests with shared prefixes.
89
+
90
+ **Async Pipeline** (`async_pipeline`)
91
+ ```python
92
+ from xtremeflow.pipeline import async_pipeline
93
+
94
+ # Producer: scheduler-controlled, exhausts this tier's rate limit
95
+ async def producer(queue: asyncio.Queue):
96
+ async for item in source:
97
+ task = await scheduler.start_task(llm_api(item), estimate_tokens)
98
+ await queue.put(task)
99
+
100
+ # Processor: slower sequential processing, yields to next tier
101
+ async def process_item(item):
102
+ result = await item
103
+ return await db_write(result) # Different rate limit tier
104
+
105
+ async for result in async_pipeline(producer, process_item):
106
+ yield result # Can chain to another tier
107
+ ```
108
+ Decouples rate limit tiers—exhausting each tier's limit frees up quota for other tasks immediately, maximizing overall system throughput.
@@ -0,0 +1,50 @@
1
+ [project]
2
+ name = 'xtremeflow'
3
+ version = '0.1.0'
4
+ description = 'XtremeFlow: A high-performance Python asynchronous task scheduler engineered to push LLM workloads to their absolute physical limits'
5
+ readme = 'README.md'
6
+ requires-python = '>=3.9'
7
+ license = {text = 'MIT'}
8
+ authors = [
9
+ {name = 'Flow Jiang', email = 'flowjzh@gmail.com'},
10
+ ]
11
+ keywords = [
12
+ 'async',
13
+ 'scheduler',
14
+ 'rate-limiting',
15
+ 'llm',
16
+ 'asyncio',
17
+ 'concurrency',
18
+ 'backpressure',
19
+ ]
20
+ classifiers = [
21
+ 'Development Status :: 4 - Beta',
22
+ 'Intended Audience :: Developers',
23
+ 'Topic :: Software Development :: Libraries :: Python Modules',
24
+ 'License :: OSI Approved :: MIT License',
25
+ 'Programming Language :: Python :: 3',
26
+ 'Programming Language :: Python :: 3.9',
27
+ 'Programming Language :: Python :: 3.10',
28
+ 'Programming Language :: Python :: 3.11',
29
+ 'Programming Language :: Python :: 3.12',
30
+ 'Programming Language :: Python :: 3.13',
31
+ 'Programming Language :: Python :: 3 :: Only',
32
+ 'Operating System :: OS Independent',
33
+ 'Typing :: Typed',
34
+ ]
35
+ dependencies = []
36
+
37
+ [project.optional-dependencies]
38
+ dev = [
39
+ 'pytest>=8.4.2',
40
+ 'pytest-asyncio>=1.2.0',
41
+ ]
42
+
43
+ [project.urls]
44
+ Homepage = 'https://github.com/flowjzh/xtremeflow'
45
+ Repository = 'https://github.com/flowjzh/xtremeflow.git'
46
+ Issues = 'https://github.com/flowjzh/xtremeflow/issues'
47
+
48
+ [build-system]
49
+ requires = ['setuptools>=61.0', 'wheel']
50
+ build-backend = 'setuptools.build_meta'
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+
@@ -0,0 +1,70 @@
1
+ import asyncio
2
+ import pytest
3
+ from xtremeflow.kvbatch import kv_batch
4
+
5
+
6
+ @pytest.mark.asyncio
7
+ async def test_single_batch_execution():
8
+ executed = []
9
+
10
+ async def mock_task(name):
11
+ await asyncio.sleep(0.01)
12
+ executed.append(name)
13
+ return f'result_{name}'
14
+
15
+ task = kv_batch(mock_task(n) for n in ['a', 'b', 'c'])
16
+ results = await task
17
+
18
+ assert len(results) == 3
19
+ assert results == ['result_a', 'result_b', 'result_c']
20
+ assert executed == ['a', 'b', 'c']
21
+
22
+
23
+ @pytest.mark.asyncio
24
+ async def test_first_wait_pattern():
25
+ execution_order = []
26
+
27
+ async def tracked_task(name):
28
+ execution_order.append(f'{name}_start')
29
+ await asyncio.sleep(0.05)
30
+ execution_order.append(f'{name}_end')
31
+ return name
32
+
33
+ task = kv_batch(tracked_task(n) for n in ['first', 'second', 'third'])
34
+ await task
35
+
36
+ assert execution_order[0] == 'first_start'
37
+ assert execution_order[1] == 'first_end'
38
+
39
+
40
+ @pytest.mark.asyncio
41
+ async def test_parallel_execution_after_first():
42
+ start_times = {}
43
+
44
+ async def timed_task(name):
45
+ start_times[name] = asyncio.get_event_loop().time()
46
+ await asyncio.sleep(0.05)
47
+ return name
48
+
49
+ task = kv_batch(timed_task(n) for n in ['first', 'second', 'third', 'fourth'])
50
+ await task
51
+
52
+ assert start_times['first'] < start_times['second']
53
+
54
+ rest_starts = [start_times[k] for k in ['second', 'third', 'fourth']]
55
+ max_diff = max(rest_starts) - min(rest_starts)
56
+ assert max_diff < 0.03
57
+
58
+
59
+ @pytest.mark.asyncio
60
+ async def test_exception_handling():
61
+ async def failing_task(name):
62
+ await asyncio.sleep(0.01)
63
+ if name == 'fail':
64
+ raise ValueError('Test error')
65
+ return name
66
+
67
+ task = kv_batch(failing_task(n) for n in ['ok', 'fail', 'ok2'])
68
+
69
+ with pytest.raises(ValueError, match='Test error'):
70
+ await task
@@ -0,0 +1,276 @@
1
+ import asyncio
2
+ import time
3
+ from unittest.mock import patch
4
+
5
+ from xtremeflow.scheduler.request import RequestRateScheduler
6
+ from xtremeflow.scheduler.token import TokenRateScheduler
7
+
8
+
9
+ async def test_request_rate_rps_limiting():
10
+ scheduler = RequestRateScheduler(max_concurrency=10, max_rps=10)
11
+
12
+ start = time.time()
13
+ tasks = []
14
+
15
+ for i in range(20):
16
+ async def mock_task(n=i):
17
+ await asyncio.sleep(0.01)
18
+ return n
19
+
20
+ task = await scheduler.start_task(mock_task())
21
+ tasks.append(task)
22
+
23
+ await asyncio.gather(*tasks)
24
+ elapsed = time.time() - start
25
+
26
+ actual_rps = 20 / elapsed
27
+ expected_rps = 10.0
28
+ error_pct = abs(actual_rps - expected_rps) / expected_rps * 100
29
+ assert error_pct < 5, f'RPS error {error_pct:.1f}% exceeds 5%, expected {expected_rps}, got {actual_rps:.2f}'
30
+
31
+
32
+ async def test_request_rate_rpm_limiting():
33
+ scheduler = RequestRateScheduler(max_concurrency=10, max_rpm=600)
34
+
35
+ start = time.time()
36
+ tasks = []
37
+
38
+ for i in range(10):
39
+ async def mock_task(n=i):
40
+ await asyncio.sleep(0.01)
41
+ return n
42
+
43
+ task = await scheduler.start_task(mock_task())
44
+ tasks.append(task)
45
+
46
+ await asyncio.gather(*tasks)
47
+ elapsed = time.time() - start
48
+
49
+ actual_rpm = (10 / elapsed) * 60
50
+ expected_rpm = 600.0
51
+ error_pct = abs(actual_rpm - expected_rpm) / expected_rpm * 100
52
+ assert error_pct < 5, f'RPM error {error_pct:.1f}% exceeds 5%, expected {expected_rpm}, got {actual_rpm:.2f}'
53
+
54
+
55
+ async def test_token_rate_tps_limiting():
56
+ scheduler = TokenRateScheduler(max_concurrency=10, max_tps=500)
57
+
58
+ start = time.time()
59
+ tasks = []
60
+
61
+ for i in range(50):
62
+ async def mock_task(n=i):
63
+ await asyncio.sleep(0.01)
64
+ return n
65
+
66
+ task = await scheduler.start_task(mock_task(), estimated_tokens=10)
67
+ tasks.append(task)
68
+
69
+ await asyncio.gather(*tasks)
70
+ elapsed = time.time() - start
71
+
72
+ actual_tps = (50 * 10) / elapsed
73
+ expected_tps = 500.0
74
+ error_pct = abs(actual_tps - expected_tps) / expected_tps * 100
75
+ assert error_pct < 5, f'TPS error {error_pct:.1f}% exceeds 5%, expected {expected_tps}, got {actual_tps:.2f}'
76
+
77
+
78
+ async def test_token_rate_tpm_limiting():
79
+ scheduler = TokenRateScheduler(max_concurrency=10, max_tpm=30000)
80
+
81
+ start = time.time()
82
+ tasks = []
83
+
84
+ for i in range(50):
85
+ async def mock_task(n=i):
86
+ await asyncio.sleep(0.01)
87
+ return n
88
+
89
+ task = await scheduler.start_task(mock_task(), estimated_tokens=10)
90
+ tasks.append(task)
91
+
92
+ await asyncio.gather(*tasks)
93
+ elapsed = time.time() - start
94
+
95
+ actual_tpm = (50 * 10 / elapsed) * 60
96
+ expected_tpm = 30000.0
97
+ error_pct = abs(actual_tpm - expected_tpm) / expected_tpm * 100
98
+ assert error_pct < 5, f'TPM error {error_pct:.1f}% exceeds 5%, expected {expected_tpm}, got {actual_tpm:.2f}'
99
+
100
+
101
+ async def test_token_rate_scheduler_with_token_correction():
102
+ scheduler = TokenRateScheduler(max_concurrency=10, max_tps=100)
103
+
104
+ async def overestimated_task():
105
+ await asyncio.sleep(0.01)
106
+ from xtremeflow.scheduler.token import report_token_usage
107
+ report_token_usage(actual=25)
108
+ return 'done'
109
+
110
+ start = time.time()
111
+ tasks = []
112
+
113
+ for _ in range(5):
114
+ task = await scheduler.start_task(overestimated_task(), estimated_tokens=50)
115
+ tasks.append(task)
116
+
117
+ await asyncio.gather(*tasks)
118
+ elapsed = time.time() - start
119
+
120
+ assert elapsed < 2.0, 'Token correction should speed up processing'
121
+
122
+
123
+ async def test_concurrency_backpressure():
124
+ scheduler = RequestRateScheduler(max_concurrency=5, max_rps=10)
125
+
126
+ active_count = 0
127
+ max_active = 0
128
+
129
+ async def track_active():
130
+ await asyncio.sleep(1)
131
+ return 'done'
132
+
133
+ start = time.time()
134
+ tasks = []
135
+
136
+ original_create_task = asyncio.create_task
137
+
138
+ def tracked_create_task(coro, *args, **kwargs):
139
+ nonlocal active_count, max_active
140
+ active_count += 1
141
+ max_active = max(max_active, active_count)
142
+
143
+ task = original_create_task(coro, *args, **kwargs)
144
+
145
+ def on_done(_):
146
+ nonlocal active_count
147
+ active_count -= 1
148
+
149
+ task.add_done_callback(on_done)
150
+ return task
151
+
152
+ with patch('asyncio.create_task', side_effect=tracked_create_task):
153
+ for _ in range(10):
154
+ task = await scheduler.start_task(track_active())
155
+ tasks.append(task)
156
+
157
+ await asyncio.gather(*tasks)
158
+ elapsed = time.time() - start
159
+
160
+ assert max_active <= 5, f'Max concurrent tasks should be <= 5, got {max_active}'
161
+ assert elapsed >= 1.8, f'Should take at least 1.8s with 10 tasks @ 5 concurrency, got {elapsed:.2f}s'
162
+
163
+
164
+ async def test_auto_backoff_retry_with_default_exponential():
165
+ from xtremeflow.scheduler.rate_limit import RetryException, auto_backoff
166
+
167
+ scheduler = RequestRateScheduler(max_concurrency=1)
168
+ attempt_count = 0
169
+
170
+ @auto_backoff(base_retry_after=0.1, max_retries=3)
171
+ async def failing_task():
172
+ nonlocal attempt_count
173
+ attempt_count += 1
174
+ if attempt_count < 3:
175
+ raise RetryException('Rate limit exceeded')
176
+ return 'success'
177
+
178
+ start = time.time()
179
+ task = await scheduler.start_task(failing_task())
180
+ result = await task
181
+ elapsed = time.time() - start
182
+
183
+ assert result == 'success'
184
+ assert attempt_count == 3, f'Expected 3 attempts, got {attempt_count}'
185
+ assert elapsed >= 0.3, f'Expected at least 0.3s for exponential backoff (0.1 + 0.2), got {elapsed:.2f}s'
186
+
187
+
188
+ async def test_auto_backoff_with_custom_retry_after():
189
+ from xtremeflow.scheduler.rate_limit import RetryException, auto_backoff
190
+
191
+ scheduler = RequestRateScheduler(max_concurrency=1)
192
+ attempt_count = 0
193
+
194
+ @auto_backoff(max_retries=2)
195
+ async def failing_task_with_custom_wait():
196
+ nonlocal attempt_count
197
+ attempt_count += 1
198
+ if attempt_count == 1:
199
+ raise RetryException('Rate limit exceeded', retry_after=0.15)
200
+ return 'success'
201
+
202
+ start = time.time()
203
+ task = await scheduler.start_task(failing_task_with_custom_wait())
204
+ result = await task
205
+ elapsed = time.time() - start
206
+
207
+ assert result == 'success'
208
+ assert attempt_count == 2
209
+ assert 0.14 <= elapsed <= 0.17, f'Expected ~0.15s wait, got {elapsed:.2f}s'
210
+
211
+
212
+ async def test_backoff_blocks_other_tasks():
213
+ from xtremeflow.scheduler.rate_limit import RetryException, auto_backoff
214
+
215
+ scheduler = RequestRateScheduler(max_concurrency=2)
216
+ attempt_count = 0
217
+
218
+ @auto_backoff(base_retry_after=0.3, max_retries=2)
219
+ async def failing_task():
220
+ nonlocal attempt_count
221
+ attempt_count += 1
222
+ if attempt_count == 1:
223
+ raise RetryException('Rate limit exceeded', retry_after=0.3)
224
+ return 'failing_success'
225
+
226
+ async def normal_task():
227
+ await asyncio.sleep(0.6)
228
+ return 'normal_success'
229
+
230
+ start = time.time()
231
+ task1 = await scheduler.start_task(failing_task())
232
+ task2 = await scheduler.start_task(normal_task())
233
+ await asyncio.gather(task1, task2)
234
+ elapsed = time.time() - start
235
+
236
+ assert attempt_count == 2
237
+ # Timeline:
238
+ # - failing_task fails immediately (t=0)
239
+ # - Sets _backoff_until = 0.3
240
+ # - normal_task waits at _wait_for_quota() for 0.3s
241
+ # - At t=0.3s, normal_task starts and takes 0.6s
242
+ # - At t=0.3+s, failing_task retries and succeeds
243
+ # - Both complete around t=0.9s
244
+ assert 0.85 <= elapsed <= 0.95, f'Expected ~0.9s total (0.3 backoff + 0.6 execution), got {elapsed:.2f}s'
245
+
246
+
247
+ async def test_reset_quota():
248
+ scheduler = RequestRateScheduler(max_concurrency=10, max_rps=10)
249
+ await asyncio.sleep(1)
250
+
251
+ start = time.time()
252
+ tasks = []
253
+ for i in range(10):
254
+ async def mock_task(n=i):
255
+ await asyncio.sleep(0)
256
+ return n
257
+ task = await scheduler.start_task(mock_task())
258
+ tasks.append(task)
259
+ await asyncio.gather(*tasks)
260
+ elapsed = time.time() - start
261
+ assert elapsed < 0.1, f'Expected burst execution <0.1s with full bucket, got {elapsed:.2f}s'
262
+
263
+ await asyncio.sleep(1)
264
+ scheduler.reset_quota()
265
+
266
+ start = time.time()
267
+ tasks = []
268
+ for i in range(10):
269
+ async def mock_task(n=i):
270
+ await asyncio.sleep(0)
271
+ return n
272
+ task = await scheduler.start_task(mock_task())
273
+ tasks.append(task)
274
+ await asyncio.gather(*tasks)
275
+ elapsed = time.time() - start
276
+ assert elapsed >= 0.9, f'Expected ~1s with empty bucket after reset, got {elapsed:.2f}s'
@@ -0,0 +1,5 @@
1
+ '''
2
+ XtremeFlow: A high-performance asynchronous task scheduler for LLM workloads.
3
+ '''
4
+
5
+ __version__ = '0.1.0'
@@ -0,0 +1,59 @@
1
+ '''Helper for KV cache-optimized async task batches.
2
+
3
+ This module provides utilities for executing async tasks with a "first-wait,
4
+ then-parallel" pattern optimized for KV cache utilization in LLM applications.
5
+
6
+ Execution Pattern:
7
+
8
+ Input: [task1, task2, task3, ...]
9
+
10
+ ┌────────────────────────────────────┐
11
+ │ Phase 1: First Task │
12
+ │ task1 runs to completion │
13
+ │ (establishes KV cache) │
14
+ └────────────────────────────────────┘
15
+
16
+ ┌────────────────────────────────────┐
17
+ │ Phase 2: Parallel Tasks │
18
+ │ task2, task3, ... run concurrently │
19
+ │ (share the established cache) │
20
+ └────────────────────────────────────┘
21
+
22
+ Output: [result1, result2, result3, ...]
23
+
24
+ Use Case Example:
25
+
26
+ When scoring multiple resumes for the same job, each request shares the
27
+ job description prefix. The first request establishes a KV cache for the
28
+ job description. Subsequent requests can then run in parallel, leveraging
29
+ the cached computation for better performance.
30
+ '''
31
+
32
+ import asyncio
33
+ from typing import Awaitable, Iterable, List, TypeVar
34
+
35
+ T = TypeVar('T')
36
+
37
+
38
+ async def _process_aws(*aws: Awaitable[T]) -> List[T]:
39
+ results = [await aws[0]] if aws else []
40
+ results += await asyncio.gather(*aws[1:])
41
+ return results
42
+
43
+
44
+ def kv_batch(aws: Iterable[Awaitable[T]]) -> asyncio.Task[List[T]]:
45
+ '''Create a batch task with KV cache optimization.
46
+
47
+ Args:
48
+ aws: An iterable of awaitables to process.
49
+
50
+ Returns:
51
+ An asyncio.Task that completes with a list of results.
52
+
53
+ Example:
54
+ >>> task = kv_batch(
55
+ ... llm_score(prompt) for prompt in same_job_with_different_resumes
56
+ ... )
57
+ >>> results = await task
58
+ '''
59
+ return asyncio.create_task(_process_aws(*aws))
@@ -0,0 +1,39 @@
1
+ import asyncio
2
+ from typing import Any, Callable, AsyncGenerator, AsyncIterable, Optional
3
+
4
+
5
+ async def async_chunks(iterable: AsyncIterable, size: int):
6
+ it = aiter(iterable)
7
+ while True:
8
+ chunk = []
9
+ for _ in range(size):
10
+ try:
11
+ item = await anext(it)
12
+ chunk.append(item)
13
+ except StopAsyncIteration:
14
+ if chunk:
15
+ yield chunk
16
+ return
17
+ yield chunk
18
+
19
+
20
+ async def async_pipeline(
21
+ producer: Callable[[asyncio.Queue], Any], process_item: Optional[Callable[[Any], Any]] = None
22
+ ) -> AsyncGenerator[Any]:
23
+ queue = asyncio.Queue()
24
+
25
+ async def producer_wrapper():
26
+ await producer(queue)
27
+ await queue.put(None)
28
+
29
+ asyncio.create_task(producer_wrapper())
30
+
31
+ while True:
32
+ item = await queue.get()
33
+ if item is None:
34
+ break
35
+
36
+ try:
37
+ yield await process_item(item) if process_item else item
38
+ finally:
39
+ queue.task_done()
@@ -0,0 +1,5 @@
1
+ from .base import TaskScheduler
2
+
3
+ __all__ = [
4
+ 'TaskScheduler',
5
+ ]
@@ -0,0 +1,35 @@
1
+ import asyncio
2
+ from typing import Any, Coroutine
3
+
4
+
5
+ class TaskScheduler:
6
+ def __init__(self, max_concurrency: int):
7
+ if max_concurrency <= 0:
8
+ raise ValueError(f'max_concurrency must be positive, got {max_concurrency}')
9
+
10
+ self.semaphore = asyncio.Semaphore(max_concurrency)
11
+ self.active_tasks = 0
12
+ self.total_completed = 0
13
+ self.pending_tasks = set()
14
+
15
+ async def _execute_coro(self, coro: Coroutine, **kwargs) -> Any:
16
+ return await coro
17
+
18
+ def _task_done(self, task):
19
+ self.pending_tasks.discard(task)
20
+ self.active_tasks -= 1
21
+ self.total_completed += 1
22
+ self.semaphore.release()
23
+
24
+ async def start_task(self, coro: Coroutine, **kwargs) -> asyncio.Task:
25
+ await self.semaphore.acquire()
26
+ self.active_tasks += 1
27
+ task = asyncio.create_task(self._execute_coro(coro, **kwargs))
28
+ self.pending_tasks.add(task)
29
+ task.add_done_callback(self._task_done)
30
+ return task
31
+
32
+ async def wait_pending(self):
33
+ if self.pending_tasks:
34
+ await asyncio.gather(*self.pending_tasks)
35
+ self.pending_tasks.clear()
@@ -0,0 +1,111 @@
1
+ from __future__ import annotations
2
+
3
+ import asyncio
4
+ import logging
5
+ import time
6
+ from abc import ABC, abstractmethod
7
+ from contextvars import ContextVar
8
+ from dataclasses import dataclass
9
+ from functools import wraps
10
+ from typing import Any, Coroutine, Optional, Type, Union
11
+
12
+ from .base import TaskScheduler
13
+
14
+ logger = logging.getLogger(__name__)
15
+
16
+ _current_ctx: ContextVar['Optional[ExecutionContext]'] = ContextVar('_current_ctx', default=None)
17
+
18
+
19
+ @dataclass
20
+ class ExecutionContext:
21
+ scheduler: RateLimitScheduler
22
+ extra: Optional[dict] = None
23
+
24
+
25
+ class RetryException(Exception):
26
+ def __init__(self, message: str = '', retry_after: Optional[float] = None):
27
+ super().__init__(message)
28
+ self.retry_after = retry_after
29
+
30
+
31
+ def auto_backoff(
32
+ retry_for: Union[Type[Exception], list[Type[Exception]], None] = None,
33
+ max_retries: int = 3,
34
+ base_retry_after: float = 2.0,
35
+ exponential: bool = True
36
+ ):
37
+ if retry_for is None:
38
+ retry_types = (RetryException,)
39
+ elif isinstance(retry_for, list):
40
+ retry_types = tuple(retry_for)
41
+ else:
42
+ retry_types = retry_for
43
+
44
+ def decorator(func):
45
+ @wraps(func)
46
+ async def wrapper(*args, **kwargs):
47
+ last_exc = None
48
+ for attempt in range(max_retries + 1):
49
+ try:
50
+ return await func(*args, **kwargs)
51
+ except retry_types as e:
52
+ last_exc = e
53
+ ctx = _current_ctx.get()
54
+
55
+ if attempt < max_retries and ctx:
56
+ header_wait = getattr(e, 'retry_after', None)
57
+ if header_wait is not None and isinstance(header_wait, (int, float)):
58
+ wait_sec = float(header_wait)
59
+ else:
60
+ wait_sec = base_retry_after * (2 ** attempt) if exponential else base_retry_after
61
+ logger.warning(
62
+ f'Retrying in {wait_sec:.1f}s '
63
+ f'(attempt {attempt + 1}/{max_retries}): {e}'
64
+ )
65
+ ctx.scheduler.notify_rate_limit_exceeded(wait_sec)
66
+ await asyncio.sleep(wait_sec)
67
+ continue
68
+ raise last_exc
69
+ return wrapper
70
+ return decorator
71
+
72
+
73
+ def get_context() -> Optional[ExecutionContext]:
74
+ return _current_ctx.get()
75
+
76
+
77
+ class RateLimitScheduler(TaskScheduler, ABC):
78
+ def __init__(self, max_concurrency: int, init_ratio: float = 0.0):
79
+ super().__init__(max_concurrency)
80
+ self._backoff_until = 0.0
81
+ self._initial_ratio = init_ratio
82
+
83
+ def notify_rate_limit_exceeded(self, retry_after: float):
84
+ self._backoff_until = max(self._backoff_until, time.monotonic() + retry_after)
85
+
86
+ def _get_backoff_wait(self) -> float:
87
+ return max(0.0, self._backoff_until - time.monotonic())
88
+
89
+ def _get_wait_time(self) -> float:
90
+ return self._get_backoff_wait()
91
+
92
+ @abstractmethod
93
+ def _consume_rate_quota(self):
94
+ '''Consume rate limit quota. Subclasses must implement this.'''
95
+
96
+ async def _wait_for_quota(self):
97
+ while True:
98
+ wait_time = self._get_wait_time()
99
+ if wait_time <= 0:
100
+ self._consume_rate_quota()
101
+ break
102
+ await asyncio.sleep(wait_time)
103
+
104
+ async def _execute_coro(self, coro: Coroutine, ctx_extra=None, **kwargs) -> Any:
105
+ ctx = ExecutionContext(scheduler=self, extra=ctx_extra)
106
+ token = _current_ctx.set(ctx)
107
+ try:
108
+ await self._wait_for_quota()
109
+ return await super()._execute_coro(coro, **kwargs)
110
+ finally:
111
+ _current_ctx.reset(token)
@@ -0,0 +1,49 @@
1
+ import time
2
+ from typing import Optional
3
+
4
+ from .rate_limit import RateLimitScheduler
5
+
6
+
7
+ class RequestRateScheduler(RateLimitScheduler):
8
+ def __init__(
9
+ self,
10
+ max_rps: Optional[int] = None,
11
+ max_rpm: Optional[int] = None,
12
+ *args,
13
+ **kwargs
14
+ ):
15
+ super().__init__(*args, **kwargs)
16
+ self._max_rps = max_rps
17
+ self._max_rpm = max_rpm
18
+ self._rps_bucket = (max_rps or 0) * self._initial_ratio
19
+ self._rpm_bucket = (max_rpm or 0) * self._initial_ratio
20
+ self._last_req_update = time.monotonic()
21
+
22
+ def _get_wait_time(self) -> float:
23
+ now = time.monotonic()
24
+ delta = now - self._last_req_update
25
+ self._last_req_update = now
26
+
27
+ if self._max_rps:
28
+ self._rps_bucket = min(float(self._max_rps), self._rps_bucket + delta * self._max_rps)
29
+ if self._max_rpm:
30
+ self._rpm_bucket = min(float(self._max_rpm), self._rpm_bucket + delta * (self._max_rpm / 60.0))
31
+
32
+ waits = [super()._get_wait_time()]
33
+ if self._max_rps and self._rps_bucket < 1:
34
+ waits.append((1 - self._rps_bucket) / self._max_rps)
35
+ if self._max_rpm and self._rpm_bucket < 1:
36
+ waits.append((1 - self._rpm_bucket) / (self._max_rpm / 60.0))
37
+
38
+ return max(waits)
39
+
40
+ def _consume_rate_quota(self):
41
+ if self._max_rps:
42
+ self._rps_bucket -= 1
43
+ if self._max_rpm:
44
+ self._rpm_bucket -= 1
45
+
46
+ def reset_quota(self):
47
+ self._rps_bucket = (self._max_rps or 0) * self._initial_ratio
48
+ self._rpm_bucket = (self._max_rpm or 0) * self._initial_ratio
49
+ self._last_req_update = time.monotonic()
@@ -0,0 +1,79 @@
1
+ import asyncio
2
+ import time
3
+ from typing import Optional, cast
4
+
5
+ from .request import RequestRateScheduler
6
+ from .rate_limit import get_context
7
+
8
+
9
+ class TokenRateScheduler(RequestRateScheduler):
10
+ def __init__(
11
+ self,
12
+ max_tps: Optional[int] = None,
13
+ max_tpm: Optional[int] = None,
14
+ *args,
15
+ **kwargs
16
+ ):
17
+ super().__init__(*args, **kwargs)
18
+ self._max_tps = max_tps
19
+ self._max_tpm = max_tpm
20
+ self._tps_bucket = (self._max_tps or 0) * self._initial_ratio
21
+ self._tpm_bucket = (self._max_tpm or 0) * self._initial_ratio
22
+ self._last_token_update = time.monotonic()
23
+
24
+ async def start_task(self, coro, estimated_tokens: int, **kwargs) -> asyncio.Task:
25
+ return await super().start_task(
26
+ coro, ctx_extra={'estimated_tokens': estimated_tokens},
27
+ **kwargs)
28
+
29
+ def _get_wait_time(self) -> float:
30
+ now = time.monotonic()
31
+ delta = now - self._last_token_update
32
+ self._last_token_update = now
33
+
34
+ if self._max_tps:
35
+ self._tps_bucket = min(float(self._max_tps), self._tps_bucket + delta * self._max_tps)
36
+ if self._max_tpm:
37
+ self._tpm_bucket = min(float(self._max_tpm), self._tpm_bucket + delta * (self._max_tpm / 60.0))
38
+
39
+ waits = [super()._get_wait_time()]
40
+
41
+ ctx = get_context()
42
+ tokens = ctx.extra.get('estimated_tokens', 0)
43
+ if tokens > 0:
44
+ if self._max_tps and self._tps_bucket < tokens:
45
+ waits.append((tokens - self._tps_bucket) / self._max_tps)
46
+ if self._max_tpm and self._tpm_bucket < tokens:
47
+ waits.append((tokens - self._tpm_bucket) / (self._max_tpm / 60.0))
48
+ return max(waits)
49
+
50
+ def _consume_rate_quota(self):
51
+ super()._consume_rate_quota()
52
+ ctx = get_context()
53
+ tokens = ctx.extra.get('estimated_tokens', 0)
54
+ if tokens > 0:
55
+ if self._max_tps:
56
+ self._tps_bucket -= tokens
57
+ if self._max_tpm:
58
+ self._tpm_bucket -= tokens
59
+
60
+ def _apply_correction(self, actual: int):
61
+ ctx = get_context()
62
+ estimated = ctx.extra.get('estimated_tokens', 0)
63
+ diff = estimated - actual
64
+ if diff == 0:
65
+ return
66
+ if self._max_tps:
67
+ self._tps_bucket = min(float(self._max_tps), self._tps_bucket + diff)
68
+ if self._max_tpm:
69
+ self._tpm_bucket = min(float(self._max_tpm), self._tpm_bucket + diff)
70
+
71
+ def reset_quota(self):
72
+ self._tps_bucket = (self._max_tps or 0) * self._initial_ratio
73
+ self._tpm_bucket = (self._max_tpm or 0) * self._initial_ratio
74
+ self._last_token_update = time.monotonic()
75
+
76
+
77
+ def report_token_usage(actual: int):
78
+ ctx = get_context()
79
+ cast(TokenRateScheduler, ctx.scheduler)._apply_correction(actual)
@@ -0,0 +1,139 @@
1
+ Metadata-Version: 2.4
2
+ Name: xtremeflow
3
+ Version: 0.1.0
4
+ Summary: XtremeFlow: A high-performance Python asynchronous task scheduler engineered to push LLM workloads to their absolute physical limits
5
+ Author-email: Flow Jiang <flowjzh@gmail.com>
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/flowjzh/xtremeflow
8
+ Project-URL: Repository, https://github.com/flowjzh/xtremeflow.git
9
+ Project-URL: Issues, https://github.com/flowjzh/xtremeflow/issues
10
+ Keywords: async,scheduler,rate-limiting,llm,asyncio,concurrency,backpressure
11
+ Classifier: Development Status :: 4 - Beta
12
+ Classifier: Intended Audience :: Developers
13
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
14
+ Classifier: License :: OSI Approved :: MIT License
15
+ Classifier: Programming Language :: Python :: 3
16
+ Classifier: Programming Language :: Python :: 3.9
17
+ Classifier: Programming Language :: Python :: 3.10
18
+ Classifier: Programming Language :: Python :: 3.11
19
+ Classifier: Programming Language :: Python :: 3.12
20
+ Classifier: Programming Language :: Python :: 3.13
21
+ Classifier: Programming Language :: Python :: 3 :: Only
22
+ Classifier: Operating System :: OS Independent
23
+ Classifier: Typing :: Typed
24
+ Requires-Python: >=3.9
25
+ Description-Content-Type: text/markdown
26
+ License-File: LICENSE
27
+ Provides-Extra: dev
28
+ Requires-Dist: pytest>=8.4.2; extra == "dev"
29
+ Requires-Dist: pytest-asyncio>=1.2.0; extra == "dev"
30
+ Dynamic: license-file
31
+
32
+ # XtremeFlow
33
+
34
+ > **"Exhaust rate limits, not patience. Squeezing maximum throughput from every second."**
35
+
36
+ ### 🦅 About
37
+
38
+ **XtremeFlow** is a high-performance asynchronous task scheduler engineered to push **Large Language Model (LLM)** workloads to their absolute physical limits.
39
+
40
+ **The Problem:**
41
+ LLM providers throttle your velocity through a combination of **Concurrency**, **RPS**/**RPM** or **TPS**/**TPM**. Most schedulers are defensive—they wait too long, leave gaps in your schedule, and waste capacity. In high-volume production, idle time is a lost resource.
42
+
43
+ **The XtremeFlow Philosophy:**
44
+ Stop being polite with your rate limits. **XtremeFlow is offensive.** It is designed to saturate your provider's capacity with surgical precision. Using a unique **Backpressure Reflex**, it maintains peak velocity until the very moment a limit is hit, executes a synchronized global cool-down, and resumes at full speed the millisecond the provider allows.
45
+
46
+ > ⚠️ **Limitation:** XtremeFlow is currently optimized for **single-process** `asyncio` applications. It manages state in-memory and does not support distributed rate limiting (e.g., Redis-based) out of the box.
47
+
48
+ ### ⚡ Key Features
49
+
50
+ * **Aggressive Saturation**: Engineered to fill every available millisecond of your allowed rate, ensuring zero wasted throughput.
51
+ * **Backpressure Reflex**: Automatically detects 429 triggers and orchestrates a global **Exponential Backoff** across all workers to stay in perfect sync with provider resets.
52
+ * **Dynamic Calibration**: Supports post-request reporting of *actual* usage to instantly "refund" over-estimated capacity back to the scheduler.
53
+ * **Async-Native**: Built on `asyncio` for low-latency scheduling where every microsecond counts.
54
+ * **KV Cache Optimization**: Provides utilities to maximize KV cache utilization across parallel LLM requests, dramatically reducing token consumption and improving throughput.
55
+ * **Async Pipeline**: Producer-consumer pipeline for streaming workloads with automatic backpressure handling.
56
+
57
+ ### 🚀 Quick Start
58
+
59
+ ```python
60
+ import asyncio
61
+ from openai import RateLimitError
62
+ from xtremeflow.scheduler.rate_limit import auto_backoff, report_token_usage
63
+ from xtremeflow.scheduler.token import TokenRateScheduler
64
+
65
+ # Initialize: 10 concurrent slots, 60 RPM, 50k TPM
66
+ scheduler = TokenRateScheduler(
67
+ max_concurrency=10,
68
+ rpm=60,
69
+ tpm=50000
70
+ )
71
+
72
+ @auto_backoff(retry_for=RateLimitError, base_retry_after=2.0)
73
+ async def call_llm_api(prompt: str):
74
+ """
75
+ Wraps LLM call with Backpressure Reflex.
76
+ Global synchronization ensures you don't keep hitting the wall during cooldown.
77
+ """
78
+ print(f"Executing task: {prompt}")
79
+
80
+ # Simulated API call
81
+ await asyncio.sleep(1)
82
+
83
+ # Calibration: Refund unused quota to the scheduler
84
+ report_token_usage(actual_tokens=450)
85
+
86
+ return "success"
87
+
88
+ async def main():
89
+ tasks = []
90
+ for i in range(10):
91
+ # Dispatch with an estimated cost to saturate the current limit
92
+ t = await scheduler.start_task(
93
+ call_llm_api(f"Task {i}"),
94
+ estimated_tokens=500
95
+ )
96
+ tasks.append(t)
97
+
98
+ results = await asyncio.gather(*tasks)
99
+ print(f"XtremeFlow: Successfully processed {len(results)} tasks at peak throughput.")
100
+
101
+ if __name__ == "__main__":
102
+ asyncio.run(main())
103
+ ```
104
+
105
+ ### 🔥 Performance Tools
106
+
107
+ Beyond rate limiting, XtremeFlow provides utilities to maximize token efficiency and throughput.
108
+
109
+ **KV Cache Optimization** (`kv_batch`)
110
+ ```python
111
+ from xtremeflow.kvbatch import kv_batch
112
+
113
+ # First request establishes KV cache, rest run in parallel
114
+ task = kv_batch(
115
+ llm_score(prompt) for prompt in same_job_with_different_resumes
116
+ )
117
+ results = await task
118
+ ```
119
+ Reduces token consumption by 40-60% for batched requests with shared prefixes.
120
+
121
+ **Async Pipeline** (`async_pipeline`)
122
+ ```python
123
+ from xtremeflow.pipeline import async_pipeline
124
+
125
+ # Producer: scheduler-controlled, exhausts this tier's rate limit
126
+ async def producer(queue: asyncio.Queue):
127
+ async for item in source:
128
+ task = await scheduler.start_task(llm_api(item), estimate_tokens)
129
+ await queue.put(task)
130
+
131
+ # Processor: slower sequential processing, yields to next tier
132
+ async def process_item(item):
133
+ result = await item
134
+ return await db_write(result) # Different rate limit tier
135
+
136
+ async for result in async_pipeline(producer, process_item):
137
+ yield result # Can chain to another tier
138
+ ```
139
+ Decouples rate limit tiers—exhausting each tier's limit frees up quota for other tasks immediately, maximizing overall system throughput.
@@ -0,0 +1,19 @@
1
+ LICENSE
2
+ MANIFEST.in
3
+ README.md
4
+ pyproject.toml
5
+ tests/test_kvbatch.py
6
+ tests/test_scheduler.py
7
+ xtremeflow/__init__.py
8
+ xtremeflow/kvbatch.py
9
+ xtremeflow/pipeline.py
10
+ xtremeflow.egg-info/PKG-INFO
11
+ xtremeflow.egg-info/SOURCES.txt
12
+ xtremeflow.egg-info/dependency_links.txt
13
+ xtremeflow.egg-info/requires.txt
14
+ xtremeflow.egg-info/top_level.txt
15
+ xtremeflow/scheduler/__init__.py
16
+ xtremeflow/scheduler/base.py
17
+ xtremeflow/scheduler/rate_limit.py
18
+ xtremeflow/scheduler/request.py
19
+ xtremeflow/scheduler/token.py
@@ -0,0 +1,4 @@
1
+
2
+ [dev]
3
+ pytest>=8.4.2
4
+ pytest-asyncio>=1.2.0
@@ -0,0 +1 @@
1
+ xtremeflow