erfi-pytorch 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 erfi-pytorch contributors
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,4 @@
1
+ include LICENSE
2
+ include README.md
3
+ recursive-include docs *.md
4
+ recursive-include third_party/faddeeva Faddeeva.cc Faddeeva.hh LICENSE README.md
@@ -0,0 +1,128 @@
1
+ Metadata-Version: 2.4
2
+ Name: erfi-pytorch
3
+ Version: 0.1.0
4
+ Summary: GPU-accelerated imaginary error function for real PyTorch tensors
5
+ Author: erfi-pytorch contributors
6
+ License-Expression: MIT
7
+ Project-URL: Homepage, https://github.com/ZhichaoZhu/erfi_pytorch
8
+ Project-URL: Repository, https://github.com/ZhichaoZhu/erfi_pytorch.git
9
+ Project-URL: Issues, https://github.com/ZhichaoZhu/erfi_pytorch/issues
10
+ Project-URL: Documentation, https://github.com/ZhichaoZhu/erfi_pytorch/blob/main/docs/faddeeva.md
11
+ Keywords: pytorch,cuda,triton,special-functions,erfi
12
+ Classifier: Development Status :: 3 - Alpha
13
+ Classifier: Programming Language :: Python :: 3
14
+ Classifier: Programming Language :: Python :: 3.10
15
+ Classifier: Programming Language :: Python :: 3.11
16
+ Classifier: Programming Language :: Python :: 3.12
17
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
18
+ Classifier: Topic :: Scientific/Engineering :: Mathematics
19
+ Requires-Python: >=3.10
20
+ Description-Content-Type: text/markdown
21
+ License-File: LICENSE
22
+ License-File: third_party/faddeeva/LICENSE
23
+ Requires-Dist: torch>=2.7
24
+ Provides-Extra: test
25
+ Requires-Dist: mpmath>=1.3; extra == "test"
26
+ Requires-Dist: pytest>=8; extra == "test"
27
+ Requires-Dist: scipy>=1.11; extra == "test"
28
+ Provides-Extra: benchmark
29
+ Requires-Dist: scipy>=1.11; extra == "benchmark"
30
+ Provides-Extra: triton
31
+ Requires-Dist: triton>=3.3; platform_system == "Linux" and extra == "triton"
32
+ Dynamic: license-file
33
+
34
+ # erfi-pytorch
35
+
36
+ `erfi-pytorch` provides a forward-only imaginary error function for real
37
+ PyTorch tensors:
38
+
39
+ ```python
40
+ import torch
41
+ from erfi_pytorch import erfi
42
+
43
+ x = torch.linspace(-4, 4, 1_000_000, device="cuda")
44
+ y = erfi(x)
45
+ ```
46
+
47
+ The package supports `torch.float32` and `torch.float64` and preserves tensor
48
+ shape, dtype, and device. Its pure-PyTorch graph is compatible with
49
+ `torch.compile(fullgraph=True, backend="eager")`. Inductor compilation depends
50
+ on a working platform compiler or Triton installation and is validated
51
+ separately on supported Linux CUDA environments.
52
+
53
+ ## Backends
54
+
55
+ - **Pure PyTorch:** portable CPU and CUDA implementation.
56
+ - **Triton:** fused path for contiguous NVIDIA CUDA tensors with at least
57
+ 65,536 elements, when Triton is available.
58
+
59
+ Windows and systems without Triton automatically use the pure PyTorch path.
60
+ No CUDA toolkit or native compiler is required.
61
+
62
+ ## Installation
63
+
64
+ ```bash
65
+ pip install erfi-pytorch
66
+ ```
67
+
68
+ For development and reference tests:
69
+
70
+ ```bash
71
+ pip install -e ".[test]"
72
+ ```
73
+
74
+ On Linux, install the optional Triton dependency if it is not already
75
+ provided by your PyTorch installation:
76
+
77
+ ```bash
78
+ pip install -e ".[test,triton]"
79
+ ```
80
+
81
+ ## Numerical method
82
+
83
+ For real `x`, the implementation uses
84
+
85
+ ```text
86
+ erfi(x) = exp(x^2) Im(w(x)),
87
+ ```
88
+
89
+ where `w` is the Faddeeva function. `Im(w(x))` is evaluated with a Taylor
90
+ polynomial near zero and a 100-interval table of low-degree polynomial
91
+ approximations elsewhere. Near floating-point overflow, the final magnitude
92
+ is reconstructed in the log domain so representable results are not lost to
93
+ premature overflow in `exp(x^2)`.
94
+
95
+ The polynomial coefficients originate from Steven G. Johnson's
96
+ MIT-licensed Faddeeva implementation. The original license notice is retained
97
+ in
98
+ [`third_party/faddeeva`](https://github.com/ZhichaoZhu/erfi_pytorch/tree/main/third_party/faddeeva).
99
+
100
+ The detailed implementation notes are in
101
+ [`docs/faddeeva.md`](https://github.com/ZhichaoZhu/erfi_pytorch/blob/main/docs/faddeeva.md).
102
+
103
+ ## License
104
+
105
+ This project is released under the MIT License. The vendored Faddeeva sources
106
+ and material derived from them retain the original Copyright (c) 2012
107
+ Massachusetts Institute of Technology attribution and MIT license notice.
108
+
109
+ ## Limitations
110
+
111
+ - Inputs must be real `torch.float32` or `torch.float64` tensors.
112
+ - This release is forward-only. `requires_grad=True` raises an error.
113
+ - Triton acceleration currently targets NVIDIA CUDA.
114
+ - Windows uses the pure-PyTorch CUDA backend because upstream Triton support
115
+ is not generally available there.
116
+
117
+ ## Benchmark
118
+
119
+ ```bash
120
+ python benchmarks/benchmark_erfi.py --dtype float32
121
+ ```
122
+
123
+ The benchmark covers powers of two from `2^10` through `2^24` and reports
124
+ eager PyTorch, compiled PyTorch, eager dispatch, and compiled dispatch.
125
+ Before timing, it compares the operator against `scipy.special.erfi` and
126
+ reports maximum absolute error, maximum and mean relative error, and infinity
127
+ mismatches. Use `--precision-elements` to change the comparison sample count
128
+ or `--skip-precision` to run timing only.
@@ -0,0 +1,95 @@
1
+ # erfi-pytorch
2
+
3
+ `erfi-pytorch` provides a forward-only imaginary error function for real
4
+ PyTorch tensors:
5
+
6
+ ```python
7
+ import torch
8
+ from erfi_pytorch import erfi
9
+
10
+ x = torch.linspace(-4, 4, 1_000_000, device="cuda")
11
+ y = erfi(x)
12
+ ```
13
+
14
+ The package supports `torch.float32` and `torch.float64` and preserves tensor
15
+ shape, dtype, and device. Its pure-PyTorch graph is compatible with
16
+ `torch.compile(fullgraph=True, backend="eager")`. Inductor compilation depends
17
+ on a working platform compiler or Triton installation and is validated
18
+ separately on supported Linux CUDA environments.
19
+
20
+ ## Backends
21
+
22
+ - **Pure PyTorch:** portable CPU and CUDA implementation.
23
+ - **Triton:** fused path for contiguous NVIDIA CUDA tensors with at least
24
+ 65,536 elements, when Triton is available.
25
+
26
+ Windows and systems without Triton automatically use the pure PyTorch path.
27
+ No CUDA toolkit or native compiler is required.
28
+
29
+ ## Installation
30
+
31
+ ```bash
32
+ pip install erfi-pytorch
33
+ ```
34
+
35
+ For development and reference tests:
36
+
37
+ ```bash
38
+ pip install -e ".[test]"
39
+ ```
40
+
41
+ On Linux, install the optional Triton dependency if it is not already
42
+ provided by your PyTorch installation:
43
+
44
+ ```bash
45
+ pip install -e ".[test,triton]"
46
+ ```
47
+
48
+ ## Numerical method
49
+
50
+ For real `x`, the implementation uses
51
+
52
+ ```text
53
+ erfi(x) = exp(x^2) Im(w(x)),
54
+ ```
55
+
56
+ where `w` is the Faddeeva function. `Im(w(x))` is evaluated with a Taylor
57
+ polynomial near zero and a 100-interval table of low-degree polynomial
58
+ approximations elsewhere. Near floating-point overflow, the final magnitude
59
+ is reconstructed in the log domain so representable results are not lost to
60
+ premature overflow in `exp(x^2)`.
61
+
62
+ The polynomial coefficients originate from Steven G. Johnson's
63
+ MIT-licensed Faddeeva implementation. The original license notice is retained
64
+ in
65
+ [`third_party/faddeeva`](https://github.com/ZhichaoZhu/erfi_pytorch/tree/main/third_party/faddeeva).
66
+
67
+ The detailed implementation notes are in
68
+ [`docs/faddeeva.md`](https://github.com/ZhichaoZhu/erfi_pytorch/blob/main/docs/faddeeva.md).
69
+
70
+ ## License
71
+
72
+ This project is released under the MIT License. The vendored Faddeeva sources
73
+ and material derived from them retain the original Copyright (c) 2012
74
+ Massachusetts Institute of Technology attribution and MIT license notice.
75
+
76
+ ## Limitations
77
+
78
+ - Inputs must be real `torch.float32` or `torch.float64` tensors.
79
+ - This release is forward-only. `requires_grad=True` raises an error.
80
+ - Triton acceleration currently targets NVIDIA CUDA.
81
+ - Windows uses the pure-PyTorch CUDA backend because upstream Triton support
82
+ is not generally available there.
83
+
84
+ ## Benchmark
85
+
86
+ ```bash
87
+ python benchmarks/benchmark_erfi.py --dtype float32
88
+ ```
89
+
90
+ The benchmark covers powers of two from `2^10` through `2^24` and reports
91
+ eager PyTorch, compiled PyTorch, eager dispatch, and compiled dispatch.
92
+ Before timing, it compares the operator against `scipy.special.erfi` and
93
+ reports maximum absolute error, maximum and mean relative error, and infinity
94
+ mismatches. Use `--precision-elements` to change the comparison sample count
95
+ or `--skip-precision` to run timing only.
@@ -0,0 +1,451 @@
1
+ # Faddeeva.cc and Faddeeva.hh
2
+
3
+ ## Overview
4
+
5
+ This code implements a family of related special functions for real and
6
+ complex double-precision arguments:
7
+
8
+ - the Faddeeva function `w(z)`
9
+ - the scaled complementary error function `erfcx(z)`
10
+ - the error function `erf(z)`
11
+ - the imaginary error function `erfi(z)`
12
+ - the complementary error function `erfc(z)`
13
+ - Dawson's integral
14
+
15
+ The central idea is to implement the Faddeeva function accurately over the
16
+ complex plane, then derive most of the other functions from mathematical
17
+ identities. Special real-argument implementations and local Taylor expansions
18
+ are used where they are faster or avoid numerical cancellation.
19
+
20
+ The header places the C++ API in namespace `Faddeeva` and supplies overloads
21
+ for `double` and `std::complex<double>`. Complex functions accept an optional
22
+ relative-error target:
23
+
24
+ ```cpp
25
+ std::complex<double> Faddeeva::w(std::complex<double> z,
26
+ double relerr = 0);
27
+ ```
28
+
29
+ A non-positive `relerr` means approximately machine precision. Internally,
30
+ the value is clamped to the range from `DBL_EPSILON` to `0.1`.
31
+
32
+ ## Mathematical Relationships
33
+
34
+ The primary function is
35
+
36
+ ```text
37
+ w(z) = exp(-z^2) erfc(-i z).
38
+ ```
39
+
40
+ The other important identities are
41
+
42
+ ```text
43
+ erfcx(z) = exp(z^2) erfc(z) = w(i z)
44
+ erfi(z) = -i erf(i z)
45
+ D(z) = sqrt(pi)/2 exp(-z^2) erfi(z)
46
+ ```
47
+
48
+ where `D(z)` denotes Dawson's integral.
49
+
50
+ For real `x`, the Faddeeva function has the especially useful form
51
+
52
+ ```text
53
+ w(x) = exp(-x^2) + i * 2/sqrt(pi) * D(x).
54
+ ```
55
+
56
+ Therefore,
57
+
58
+ ```text
59
+ w_im(x) = Im(w(x)) = 2/sqrt(pi) * D(x)
60
+ erfi(x) = exp(x^2) * w_im(x).
61
+ ```
62
+
63
+ This last identity is the basis of the optimized real `erfi` implementation.
64
+
65
+ ## Structure of the Source
66
+
67
+ ### Portability layer
68
+
69
+ The opening macros allow the same source to compile as either C++ using
70
+ `std::complex<double>` or C using C99 complex numbers. They also normalize
71
+ construction of complex values, infinity, NaN, `isnan`, `isinf`, and
72
+ `copysign` across older compilers.
73
+
74
+ In C++, macros such as
75
+
76
+ ```cpp
77
+ #define FADDEEVA(name) Faddeeva::name
78
+ #define C(a,b) std::complex<double>(a,b)
79
+ ```
80
+
81
+ turn the shared implementation into the namespace API declared by
82
+ `Faddeeva.hh`.
83
+
84
+ ### The `w(z)` numerical engine
85
+
86
+ `w(z)` first handles the coordinate axes:
87
+
88
+ - For purely imaginary `z = i y`, `w(i y) = erfcx(y)`.
89
+ - For real `z = x`, it returns
90
+ `exp(-x*x) + i*w_im(x)`.
91
+
92
+ For general complex arguments, it chooses between two main algorithms.
93
+
94
+ 1. **Large arguments:** a continued-fraction expansion is used because it is
95
+ fast and asymptotically accurate. For extremely large values this reduces
96
+ to one or two terms, such as
97
+ `w(z) ~= i / (sqrt(pi) z)`.
98
+ 2. **Smaller arguments:** a convergent summation based on Algorithm 916 is
99
+ used. Precomputed exponential coefficients accelerate the normal
100
+ machine-precision path.
101
+
102
+ The branch boundary is not just a simple `|z|` test. The code avoids the
103
+ continued fraction near parts of the real axis where the real component of
104
+ `w(z)` would have poor relative accuracy.
105
+
106
+ For arguments in the lower half-plane, it uses the symmetry
107
+
108
+ ```text
109
+ w(z) = 2 exp(-z^2) - w(-z).
110
+ ```
111
+
112
+ Several expressions are algebraically rearranged to avoid intermediate
113
+ overflow. For example, the real part of `-z^2` is computed as
114
+
115
+ ```cpp
116
+ (y - x) * (x + y)
117
+ ```
118
+
119
+ instead of directly evaluating `y*y - x*x`.
120
+
121
+ ### Real helper functions
122
+
123
+ `erfcx(double)` and `w_im(double)` use similar three-region strategies:
124
+
125
+ - continued fractions for large magnitude arguments;
126
+ - piecewise Chebyshev polynomial approximations for the middle range;
127
+ - Taylor expansions near zero.
128
+
129
+ The Chebyshev lookup tables dominate the size of `Faddeeva.cc`. They are
130
+ precomputed polynomial coefficients, not separate conceptual algorithms.
131
+
132
+ ## Imaginary Error Function
133
+
134
+ The imaginary error function is defined by
135
+
136
+ ```text
137
+ erfi(z) = -i erf(i z)
138
+ = 2/sqrt(pi) integral_0^z exp(t^2) dt.
139
+ ```
140
+
141
+ Unlike `erf(x)`, which approaches `+1` or `-1` on the real axis, `erfi(x)`
142
+ grows approximately like
143
+
144
+ ```text
145
+ erfi(x) ~ exp(x^2) / (sqrt(pi) x).
146
+ ```
147
+
148
+ That rapid growth is why overflow handling matters.
149
+
150
+ ### Complex implementation
151
+
152
+ The complex implementation is:
153
+
154
+ ```cpp
155
+ cmplx FADDEEVA(erfi)(cmplx z, double relerr)
156
+ {
157
+ cmplx e = FADDEEVA(erf)(C(-cimag(z),creal(z)), relerr);
158
+ return C(cimag(e), -creal(e));
159
+ }
160
+ ```
161
+
162
+ If `z = x + i y`, then
163
+
164
+ ```text
165
+ i z = -y + i x.
166
+ ```
167
+
168
+ That explains the input rotation:
169
+
170
+ ```cpp
171
+ C(-imag(z), real(z))
172
+ ```
173
+
174
+ If the result is `e = a + i b`, multiplying it by `-i` gives
175
+
176
+ ```text
177
+ -i(a + i b) = b - i a.
178
+ ```
179
+
180
+ That explains the output rotation:
181
+
182
+ ```cpp
183
+ C(imag(e), -real(e))
184
+ ```
185
+
186
+ Thus the complex `erfi` routine contains no independent approximation.
187
+ It delegates to the complex `erf` implementation and performs two exact
188
+ coordinate rotations. This also means that `relerr` is passed directly to
189
+ the underlying Faddeeva calculation.
190
+
191
+ ### Why complex `erf` is the hard part
192
+
193
+ The delegated `erf(z)` routine derives its normal path from `w(z)`, but it
194
+ does not blindly evaluate one formula everywhere. It has several stability
195
+ branches:
196
+
197
+ - real and imaginary axes are handled directly;
198
+ - positive and negative real parts use different symmetry formulas;
199
+ - a Taylor series is used near `z = 0`;
200
+ - a second expansion is used near the imaginary axis when the direct
201
+ expression would subtract nearly equal values.
202
+
203
+ The imaginary-axis identity used there is
204
+
205
+ ```text
206
+ erf(i y) = i erfi(y)
207
+ = i exp(y^2) w_im(y).
208
+ ```
209
+
210
+ This is the `taylor_erfi` branch in `erf(z)`. Despite its label, it is a local
211
+ expansion of `erf(x+i y)` around the imaginary axis, using the real `erfi(y)`
212
+ value as its base term.
213
+
214
+ ### Real implementation
215
+
216
+ The optimized real overload is:
217
+
218
+ ```cpp
219
+ double FADDEEVA_RE(erfi)(double x)
220
+ {
221
+ return x*x > 720 ? (x > 0 ? Inf : -Inf)
222
+ : exp(x*x) * FADDEEVA(w_im)(x);
223
+ }
224
+ ```
225
+
226
+ To derive this formula, begin with the definition of the Faddeeva function
227
+ for real `x`:
228
+
229
+ ```text
230
+ w(x) = exp(-x^2) erfc(-i x).
231
+ ```
232
+
233
+ Using
234
+
235
+ ```text
236
+ erfc(-i x) = 1 - erf(-i x)
237
+ = 1 + i erfi(x),
238
+ ```
239
+
240
+ we obtain
241
+
242
+ ```text
243
+ w(x) = exp(-x^2) [1 + i erfi(x)].
244
+ ```
245
+
246
+ Taking its imaginary part gives
247
+
248
+ ```text
249
+ Im(w(x)) = exp(-x^2) erfi(x),
250
+ ```
251
+
252
+ and hence
253
+
254
+ ```text
255
+ erfi(x) = exp(x^2) Im(w(x))
256
+ = exp(x^2) w_im(x).
257
+ ```
258
+
259
+ It is also useful to express this through Dawson's integral:
260
+
261
+ ```text
262
+ w_im(x) = 2 D(x) / sqrt(pi)
263
+ erfi(x) = 2 exp(x^2) D(x) / sqrt(pi).
264
+ ```
265
+
266
+ This specialized route is faster than constructing a complex value, calling
267
+ complex `erf`, and rotating the answer.
268
+
269
+ The `x*x > 720` check is a coarse explicit overflow guard. The largest finite
270
+ exponential has an exponent near 709.78, and `erfi` itself also eventually
271
+ exceeds the double range, so some smaller inputs can naturally overflow
272
+ during the multiplication. Above the guard, the code directly returns the
273
+ mathematically correct signed infinity. This is especially important for
274
+ infinite or enormous inputs, where evaluating the identity mechanically
275
+ could produce the indeterminate form `Inf * 0` because `w_im(x)` tends to
276
+ zero as `|x|` grows.
277
+
278
+ The result is odd because `w_im` is odd:
279
+
280
+ ```text
281
+ erfi(-x) = -erfi(x).
282
+ ```
283
+
284
+ ### How `w_im(x)` is computed
285
+
286
+ `w_im(x)` is a scaled Dawson integral:
287
+
288
+ ```text
289
+ w_im(x) = 2 D(x) / sqrt(pi).
290
+ ```
291
+
292
+ The actual numerical work for real `erfi` therefore occurs inside `w_im`.
293
+ It divides the real axis into three numerical regions.
294
+
295
+ 1. **Small `|x|`**, approximately `|x| <= 0.0309`:
296
+
297
+ ```text
298
+ w_im(x) = 2/sqrt(pi)
299
+ * (x - 2x^3/3 + 4x^5/15 - 8x^7/105 + 16x^9/945).
300
+ ```
301
+
302
+ The source evaluates this in Horner form:
303
+
304
+ ```cpp
305
+ double x2 = x*x;
306
+ return x * (1.1283791670955125739
307
+ - x2 * (0.75225277806367504925
308
+ - x2 * (0.30090111122547001970
309
+ - x2 * (0.085971746064420005629
310
+ - x2 * 0.016931216931216931217))));
311
+ ```
312
+
313
+ Horner form uses fewer multiplications and generally accumulates less
314
+ rounding error. The explicit leading factor `x` naturally preserves odd
315
+ symmetry. Near zero,
316
+
317
+ ```text
318
+ w_im(x) ~= 2x/sqrt(pi)
319
+ erfi(x) ~= 2x/sqrt(pi),
320
+ ```
321
+
322
+ because `exp(x^2) ~= 1`.
323
+
324
+ 2. **Moderate `|x|`**, up to 45:
325
+
326
+ The code maps positive `x` using
327
+
328
+ ```text
329
+ y = 1 / (1 + x)
330
+ y100 = 100 y.
331
+ ```
332
+
333
+ The call is:
334
+
335
+ ```cpp
336
+ return w_im_y100(100/(1+x), x);
337
+ ```
338
+
339
+ The integer part of `y100` selects one of 100 intervals in a `switch`.
340
+ Each case maps its local interval approximately to `[-1,1]`, for example:
341
+
342
+ ```cpp
343
+ double t = 2*y100 - 1;
344
+ ```
345
+
346
+ It then evaluates a low-degree, Chebyshev-derived polynomial in `t`, again
347
+ in Horner form. The large coefficient table in `Faddeeva.cc` is therefore
348
+ a lookup table of fitted polynomials. At runtime the process is simply:
349
+
350
+ ```text
351
+ transform x -> select interval -> evaluate polynomial.
352
+ ```
353
+
354
+ For a moderate negative input, the code evaluates the positive argument
355
+ and negates it:
356
+
357
+ ```cpp
358
+ return -w_im_y100(100/(1-x), -x);
359
+ ```
360
+
361
+ This explicitly applies `w_im(-x) = -w_im(x)`.
362
+
363
+ 3. **Large `|x|`**, `45 < |x| <= 5e7`:
364
+
365
+ A five-term continued-fraction approximation is simplified into a rational
366
+ expression:
367
+
368
+ ```cpp
369
+ return ispi * ((x*x) * (x*x-4.5) + 2)
370
+ / (x * ((x*x) * (x*x-5) + 3.75));
371
+ ```
372
+
373
+ Here `ispi = 1/sqrt(pi)`. This is an algebraically simplified form of
374
+
375
+ ```text
376
+ 1/sqrt(pi)
377
+ ------------------------------------------------
378
+ x - (1/2)/(x - 1/(x - (3/2)/(x - 2/x))).
379
+ ```
380
+
381
+ For `|x| > 5e7`, only the leading asymptotic term is needed:
382
+
383
+ ```text
384
+ w_im(x) ~= 1 / (sqrt(pi) x).
385
+ ```
386
+
387
+ Combining this asymptotic form with the real `erfi` formula gives the expected
388
+ large-argument behavior:
389
+
390
+ ```text
391
+ erfi(x) ~= exp(x^2) / (sqrt(pi) x).
392
+ ```
393
+
394
+ The complete real execution path is therefore:
395
+
396
+ ```text
397
+ erfi(x)
398
+ |
399
+ +-- x^2 > 720? --> return signed infinity
400
+ |
401
+ +-- compute w_im(x)
402
+ | +-- tiny x: Taylor polynomial
403
+ | +-- moderate x: piecewise Chebyshev polynomials
404
+ | +-- large x: continued-fraction rational approximation
405
+ | +-- enormous x: 1/(sqrt(pi) x)
406
+ |
407
+ +-- return exp(x^2) * w_im(x)
408
+ ```
409
+
410
+ ## Other Exported Functions
411
+
412
+ - `erfcx(z)` is a direct rotation into `w`: `erfcx(z) = w(i z)`.
413
+ - `erf(z)` uses `w` plus symmetry and Taylor expansions to avoid
414
+ cancellation.
415
+ - `erfc(z)` uses direct formulas rather than always computing `1-erf(z)`,
416
+ because that subtraction would lose precision when `erfc(z)` is tiny.
417
+ - `Dawson(z)` uses `w`, axis-specific formulas, continued fractions, and
418
+ Taylor expansions. For real input it is simply
419
+ `sqrt(pi)/2 * w_im(x)`.
420
+
421
+ ## Numerical Design Lessons
422
+
423
+ The implementation is less about finding one universal formula and more
424
+ about choosing an equivalent formula that is well-conditioned in each
425
+ region:
426
+
427
+ - use asymptotic continued fractions when arguments are large;
428
+ - use convergent sums or fitted polynomials in the middle range;
429
+ - use Taylor expansions near zeros and cancellation points;
430
+ - exploit oddness, conjugation, rotations, and reflection identities;
431
+ - special-case axes, infinities, NaNs, signed zero, overflow, and underflow;
432
+ - avoid forming huge and tiny intermediate quantities whose final product
433
+ would be representable.
434
+
435
+ For `erfi` specifically, the important implementation chain is:
436
+
437
+ ```text
438
+ complex erfi(z)
439
+ -> rotate z to i*z
440
+ -> stabilized complex erf(i*z)
441
+ -> rotate result by -i
442
+
443
+ real erfi(x)
444
+ -> specialized w_im(x)
445
+ -> multiply by exp(x^2)
446
+ -> return signed infinity before overflow becomes indeterminate
447
+ ```
448
+
449
+ The complex path maximizes reuse of the robust `erf` implementation, while
450
+ the real path maximizes speed and preserves accuracy through the
451
+ Faddeeva/Dawson relationship.