math-precision 1.0.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- math_precision-1.0.0/LICENSE.txt +3 -0
- math_precision-1.0.0/PKG-INFO +10 -0
- math_precision-1.0.0/math_precision/__init__.py +77 -0
- math_precision-1.0.0/math_precision/math_precision.py +159 -0
- math_precision-1.0.0/math_precision.egg-info/PKG-INFO +10 -0
- math_precision-1.0.0/math_precision.egg-info/SOURCES.txt +9 -0
- math_precision-1.0.0/math_precision.egg-info/dependency_links.txt +1 -0
- math_precision-1.0.0/math_precision.egg-info/top_level.txt +1 -0
- math_precision-1.0.0/setup.cfg +7 -0
- math_precision-1.0.0/setup.py +87 -0
|
@@ -0,0 +1,10 @@
|
|
|
1
|
+
Metadata-Version: 2.2
|
|
2
|
+
Name: math_precision
|
|
3
|
+
Version: 1.0.0
|
|
4
|
+
Home-page: https://github.com/sapiens-technology/math_precision
|
|
5
|
+
Author: SAPIENS TECHNOLOGY
|
|
6
|
+
License: Proprietary Software
|
|
7
|
+
License-File: LICENSE.txt
|
|
8
|
+
Dynamic: author
|
|
9
|
+
Dynamic: home-page
|
|
10
|
+
Dynamic: license
|
|
@@ -0,0 +1,77 @@
|
|
|
1
|
+
"""
|
|
2
|
+
The presented algorithm implements a benchmarking method designed to evaluate the **mathematical precision** of language models,
|
|
3
|
+
targeting a well-known limitation of these systems: their ability to reliably perform numerical reasoning under high precision constraints.
|
|
4
|
+
|
|
5
|
+
The `MathPrecision` class functions as a synthetic data generator, automatically producing a dataset of mathematically structured problems.
|
|
6
|
+
Each sample consists of an expression built from three high-precision numbers (generated using the `decimal` module) combined with two randomly selected arithmetic operators:
|
|
7
|
+
addition, subtraction, multiplication, or division. These numbers are not standard floats but high-precision values with many decimal places,
|
|
8
|
+
significantly increasing the difficulty and minimizing the likelihood of correct answers through approximation.
|
|
9
|
+
|
|
10
|
+
For every generated expression, the algorithm computes the correct result with high numerical precision and then constructs a multiple-choice question.
|
|
11
|
+
Only one option corresponds to the correct answer, while the remaining alternatives are generated by applying small perturbations to the true result.
|
|
12
|
+
These incorrect options are intentionally designed to be plausible,
|
|
13
|
+
forcing the evaluated model to perform actual numerical reasoning rather than relying on superficial pattern recognition.
|
|
14
|
+
|
|
15
|
+
Additionally, both operators and answer choices are shuffled, ensuring structural diversity across samples and reducing positional bias.
|
|
16
|
+
The final output is a structured dataset containing input-output pairs, ready to be used in automated evaluation pipelines.
|
|
17
|
+
|
|
18
|
+
Among the main advantages of this benchmarking approach:
|
|
19
|
+
|
|
20
|
+
First, the **high numerical precision requirement** makes the test substantially more rigorous than traditional benchmarks based on simple arithmetic.
|
|
21
|
+
It allows clear differentiation between models that merely approximate reasoning and those that can consistently execute precise calculations.
|
|
22
|
+
|
|
23
|
+
Another key advantage is **scalable synthetic data generation**. Since the dataset is produced programmatically, it can be expanded arbitrarily,
|
|
24
|
+
with full control over complexity, distribution, and format. This removes reliance on static datasets and mitigates the risk of overfitting to known benchmarks.
|
|
25
|
+
|
|
26
|
+
The method is also **robust against memorization**, as each execution generates entirely new and unique problems.
|
|
27
|
+
This prevents models from gaining artificial advantages by memorizing fixed question sets.
|
|
28
|
+
|
|
29
|
+
Furthermore, the use of closely related incorrect alternatives enables **fine-grained evaluation**, capturing not only whether a model gets the answer right,
|
|
30
|
+
but also how close it gets. This is particularly useful for identifying systematic errors, such as precision loss or difficulties handling operator precedence.
|
|
31
|
+
|
|
32
|
+
Finally, the approach is **simple, automated, and easily integrable** into evaluation workflows,
|
|
33
|
+
making it suitable for both offline benchmarking and continuous testing during model development.
|
|
34
|
+
|
|
35
|
+
In summary, this algorithm provides an effective and rigorous benchmark for assessing a fundamental capability—mathematical precision—while balancing experimental control,
|
|
36
|
+
scalability, and realistic difficulty.
|
|
37
|
+
"""
|
|
38
|
+
# --------------------------> A SAPIENS TECHNOLOGY®️ PRODUCTION) <--------------------------
|
|
39
|
+
from .math_precision import *
|
|
40
|
+
"""
|
|
41
|
+
The presented algorithm implements a benchmarking method designed to evaluate the **mathematical precision** of language models,
|
|
42
|
+
targeting a well-known limitation of these systems: their ability to reliably perform numerical reasoning under high precision constraints.
|
|
43
|
+
|
|
44
|
+
The `MathPrecision` class functions as a synthetic data generator, automatically producing a dataset of mathematically structured problems.
|
|
45
|
+
Each sample consists of an expression built from three high-precision numbers (generated using the `decimal` module) combined with two randomly selected arithmetic operators:
|
|
46
|
+
addition, subtraction, multiplication, or division. These numbers are not standard floats but high-precision values with many decimal places,
|
|
47
|
+
significantly increasing the difficulty and minimizing the likelihood of correct answers through approximation.
|
|
48
|
+
|
|
49
|
+
For every generated expression, the algorithm computes the correct result with high numerical precision and then constructs a multiple-choice question.
|
|
50
|
+
Only one option corresponds to the correct answer, while the remaining alternatives are generated by applying small perturbations to the true result.
|
|
51
|
+
These incorrect options are intentionally designed to be plausible,
|
|
52
|
+
forcing the evaluated model to perform actual numerical reasoning rather than relying on superficial pattern recognition.
|
|
53
|
+
|
|
54
|
+
Additionally, both operators and answer choices are shuffled, ensuring structural diversity across samples and reducing positional bias.
|
|
55
|
+
The final output is a structured dataset containing input-output pairs, ready to be used in automated evaluation pipelines.
|
|
56
|
+
|
|
57
|
+
Among the main advantages of this benchmarking approach:
|
|
58
|
+
|
|
59
|
+
First, the **high numerical precision requirement** makes the test substantially more rigorous than traditional benchmarks based on simple arithmetic.
|
|
60
|
+
It allows clear differentiation between models that merely approximate reasoning and those that can consistently execute precise calculations.
|
|
61
|
+
|
|
62
|
+
Another key advantage is **scalable synthetic data generation**. Since the dataset is produced programmatically, it can be expanded arbitrarily,
|
|
63
|
+
with full control over complexity, distribution, and format. This removes reliance on static datasets and mitigates the risk of overfitting to known benchmarks.
|
|
64
|
+
|
|
65
|
+
The method is also **robust against memorization**, as each execution generates entirely new and unique problems.
|
|
66
|
+
This prevents models from gaining artificial advantages by memorizing fixed question sets.
|
|
67
|
+
|
|
68
|
+
Furthermore, the use of closely related incorrect alternatives enables **fine-grained evaluation**, capturing not only whether a model gets the answer right,
|
|
69
|
+
but also how close it gets. This is particularly useful for identifying systematic errors, such as precision loss or difficulties handling operator precedence.
|
|
70
|
+
|
|
71
|
+
Finally, the approach is **simple, automated, and easily integrable** into evaluation workflows,
|
|
72
|
+
making it suitable for both offline benchmarking and continuous testing during model development.
|
|
73
|
+
|
|
74
|
+
In summary, this algorithm provides an effective and rigorous benchmark for assessing a fundamental capability—mathematical precision—while balancing experimental control,
|
|
75
|
+
scalability, and realistic difficulty.
|
|
76
|
+
"""
|
|
77
|
+
# --------------------------> A SAPIENS TECHNOLOGY®️ PRODUCTION) <--------------------------
|
|
@@ -0,0 +1,159 @@
|
|
|
1
|
+
"""
|
|
2
|
+
The presented algorithm implements a benchmarking method designed to evaluate the **mathematical precision** of language models,
|
|
3
|
+
targeting a well-known limitation of these systems: their ability to reliably perform numerical reasoning under high precision constraints.
|
|
4
|
+
|
|
5
|
+
The `MathPrecision` class functions as a synthetic data generator, automatically producing a dataset of mathematically structured problems.
|
|
6
|
+
Each sample consists of an expression built from three high-precision numbers (generated using the `decimal` module) combined with two randomly selected arithmetic operators:
|
|
7
|
+
addition, subtraction, multiplication, or division. These numbers are not standard floats but high-precision values with many decimal places,
|
|
8
|
+
significantly increasing the difficulty and minimizing the likelihood of correct answers through approximation.
|
|
9
|
+
|
|
10
|
+
For every generated expression, the algorithm computes the correct result with high numerical precision and then constructs a multiple-choice question.
|
|
11
|
+
Only one option corresponds to the correct answer, while the remaining alternatives are generated by applying small perturbations to the true result.
|
|
12
|
+
These incorrect options are intentionally designed to be plausible,
|
|
13
|
+
forcing the evaluated model to perform actual numerical reasoning rather than relying on superficial pattern recognition.
|
|
14
|
+
|
|
15
|
+
Additionally, both operators and answer choices are shuffled, ensuring structural diversity across samples and reducing positional bias.
|
|
16
|
+
The final output is a structured dataset containing input-output pairs, ready to be used in automated evaluation pipelines.
|
|
17
|
+
|
|
18
|
+
Among the main advantages of this benchmarking approach:
|
|
19
|
+
|
|
20
|
+
First, the **high numerical precision requirement** makes the test substantially more rigorous than traditional benchmarks based on simple arithmetic.
|
|
21
|
+
It allows clear differentiation between models that merely approximate reasoning and those that can consistently execute precise calculations.
|
|
22
|
+
|
|
23
|
+
Another key advantage is **scalable synthetic data generation**. Since the dataset is produced programmatically, it can be expanded arbitrarily,
|
|
24
|
+
with full control over complexity, distribution, and format. This removes reliance on static datasets and mitigates the risk of overfitting to known benchmarks.
|
|
25
|
+
|
|
26
|
+
The method is also **robust against memorization**, as each execution generates entirely new and unique problems.
|
|
27
|
+
This prevents models from gaining artificial advantages by memorizing fixed question sets.
|
|
28
|
+
|
|
29
|
+
Furthermore, the use of closely related incorrect alternatives enables **fine-grained evaluation**, capturing not only whether a model gets the answer right,
|
|
30
|
+
but also how close it gets. This is particularly useful for identifying systematic errors, such as precision loss or difficulties handling operator precedence.
|
|
31
|
+
|
|
32
|
+
Finally, the approach is **simple, automated, and easily integrable** into evaluation workflows,
|
|
33
|
+
making it suitable for both offline benchmarking and continuous testing during model development.
|
|
34
|
+
|
|
35
|
+
In summary, this algorithm provides an effective and rigorous benchmark for assessing a fundamental capability—mathematical precision—while balancing experimental control,
|
|
36
|
+
scalability, and realistic difficulty.
|
|
37
|
+
"""
|
|
38
|
+
# --------------------------> A SAPIENS TECHNOLOGY®️ PRODUCTION) <--------------------------
|
|
39
|
+
class MathPrecision:
|
|
40
|
+
def __init__(self, show_errors=True, display_error_point=False):
|
|
41
|
+
try:
|
|
42
|
+
self.__show_errors = bool(show_errors) if type(show_errors) in (bool, int, float) else True
|
|
43
|
+
self.__display_error_point = bool(display_error_point) if type(display_error_point) in (bool, int, float) else False
|
|
44
|
+
try:
|
|
45
|
+
from warnings import simplefilter, filterwarnings
|
|
46
|
+
from logging import disable, CRITICAL
|
|
47
|
+
from os import environ
|
|
48
|
+
simplefilter('ignore')
|
|
49
|
+
filterwarnings('ignore')
|
|
50
|
+
disable(CRITICAL)
|
|
51
|
+
except: pass
|
|
52
|
+
from traceback import print_exc
|
|
53
|
+
self.__print_exc = print_exc
|
|
54
|
+
except Exception as error:
|
|
55
|
+
try:
|
|
56
|
+
if self.__show_errors:
|
|
57
|
+
error_message = 'ERROR in MathPrecision.__init__: '+str(error)
|
|
58
|
+
print(error_message)
|
|
59
|
+
try: self.__print_exc() if self.__display_error_point else None
|
|
60
|
+
except: pass
|
|
61
|
+
except: pass
|
|
62
|
+
def generateDatabase(self, n_samples=10):
|
|
63
|
+
try:
|
|
64
|
+
return_dictionary, database = {}, []
|
|
65
|
+
n_samples = max(1, int(n_samples)) if type(n_samples) in (int, float) else 10
|
|
66
|
+
def _generate_high_precision_number(start=100, end=200, precision=100):
|
|
67
|
+
from decimal import getcontext, Decimal
|
|
68
|
+
from random import SystemRandom
|
|
69
|
+
getcontext().prec = precision
|
|
70
|
+
random_generator = SystemRandom()
|
|
71
|
+
integer_part = Decimal(random_generator.randint(start, end))
|
|
72
|
+
decimal_part = Decimal(str(random_generator.random()))
|
|
73
|
+
high_precision_number = integer_part + decimal_part
|
|
74
|
+
return high_precision_number
|
|
75
|
+
def _shuffle_list(input_list=[]):
|
|
76
|
+
from random import shuffle
|
|
77
|
+
shuffle(input_list)
|
|
78
|
+
return input_list
|
|
79
|
+
from random import randint
|
|
80
|
+
operators, alternatives = ['+', '-', '*', '/'], ['A)', 'B)', 'C)', 'D)']
|
|
81
|
+
for _ in range(n_samples):
|
|
82
|
+
operator_a = _shuffle_list(input_list=operators)[randint(0, 3)]
|
|
83
|
+
operator_b = _shuffle_list(input_list=operators)[randint(0, 3)]
|
|
84
|
+
number1, number2, number3 = _generate_high_precision_number(), _generate_high_precision_number(), _generate_high_precision_number()
|
|
85
|
+
number1, number2, number3 = sorted([number1, number2, number3])[::-1]
|
|
86
|
+
formatted_input = f'{number1} {operator_a} {number2} {operator_b} {number3}'
|
|
87
|
+
correct_answer = eval(formatted_input)
|
|
88
|
+
alternative_answers, input_output, used_values = [], {}, set()
|
|
89
|
+
for index, alternative in enumerate(_shuffle_list(input_list=alternatives)):
|
|
90
|
+
if index == 0:
|
|
91
|
+
formatted_correct = f'{correct_answer:.10f}'
|
|
92
|
+
correct_alternative = f'{alternative} {formatted_correct}'
|
|
93
|
+
alternative_answers.append(correct_alternative)
|
|
94
|
+
input_output['output'] = correct_alternative
|
|
95
|
+
used_values.add(formatted_correct)
|
|
96
|
+
else:
|
|
97
|
+
while True:
|
|
98
|
+
operator_x = _shuffle_list(input_list=['+', '-'])[randint(0, 1)]
|
|
99
|
+
wrong_answer = eval(f'{correct_answer} {operator_x} {_generate_high_precision_number(start=0, end=1)}')
|
|
100
|
+
formatted_wrong = f'{wrong_answer:.10f}'
|
|
101
|
+
if formatted_wrong not in used_values:
|
|
102
|
+
used_values.add(formatted_wrong)
|
|
103
|
+
string_wrong_answer = f'{alternative} {formatted_wrong}'
|
|
104
|
+
alternative_answers.append(string_wrong_answer)
|
|
105
|
+
break
|
|
106
|
+
_input = f'{formatted_input} = ?\n\n'
|
|
107
|
+
alternative_answers = sorted(alternative_answers)
|
|
108
|
+
for alternative_answer in alternative_answers: _input += f'{alternative_answer}\n'
|
|
109
|
+
input_output['input'] = _input.strip()
|
|
110
|
+
database.append(input_output)
|
|
111
|
+
return_dictionary['data'] = database
|
|
112
|
+
return return_dictionary
|
|
113
|
+
except Exception as error:
|
|
114
|
+
try:
|
|
115
|
+
if self.__show_errors:
|
|
116
|
+
error_message = 'ERROR in MathPrecision.generateDatabase: '+str(error)
|
|
117
|
+
print(error_message)
|
|
118
|
+
try: self.__print_exc() if self.__display_error_point else None
|
|
119
|
+
except: pass
|
|
120
|
+
except: pass
|
|
121
|
+
return {'data': []}
|
|
122
|
+
"""
|
|
123
|
+
The presented algorithm implements a benchmarking method designed to evaluate the **mathematical precision** of language models,
|
|
124
|
+
targeting a well-known limitation of these systems: their ability to reliably perform numerical reasoning under high precision constraints.
|
|
125
|
+
|
|
126
|
+
The `MathPrecision` class functions as a synthetic data generator, automatically producing a dataset of mathematically structured problems.
|
|
127
|
+
Each sample consists of an expression built from three high-precision numbers (generated using the `decimal` module) combined with two randomly selected arithmetic operators:
|
|
128
|
+
addition, subtraction, multiplication, or division. These numbers are not standard floats but high-precision values with many decimal places,
|
|
129
|
+
significantly increasing the difficulty and minimizing the likelihood of correct answers through approximation.
|
|
130
|
+
|
|
131
|
+
For every generated expression, the algorithm computes the correct result with high numerical precision and then constructs a multiple-choice question.
|
|
132
|
+
Only one option corresponds to the correct answer, while the remaining alternatives are generated by applying small perturbations to the true result.
|
|
133
|
+
These incorrect options are intentionally designed to be plausible,
|
|
134
|
+
forcing the evaluated model to perform actual numerical reasoning rather than relying on superficial pattern recognition.
|
|
135
|
+
|
|
136
|
+
Additionally, both operators and answer choices are shuffled, ensuring structural diversity across samples and reducing positional bias.
|
|
137
|
+
The final output is a structured dataset containing input-output pairs, ready to be used in automated evaluation pipelines.
|
|
138
|
+
|
|
139
|
+
Among the main advantages of this benchmarking approach:
|
|
140
|
+
|
|
141
|
+
First, the **high numerical precision requirement** makes the test substantially more rigorous than traditional benchmarks based on simple arithmetic.
|
|
142
|
+
It allows clear differentiation between models that merely approximate reasoning and those that can consistently execute precise calculations.
|
|
143
|
+
|
|
144
|
+
Another key advantage is **scalable synthetic data generation**. Since the dataset is produced programmatically, it can be expanded arbitrarily,
|
|
145
|
+
with full control over complexity, distribution, and format. This removes reliance on static datasets and mitigates the risk of overfitting to known benchmarks.
|
|
146
|
+
|
|
147
|
+
The method is also **robust against memorization**, as each execution generates entirely new and unique problems.
|
|
148
|
+
This prevents models from gaining artificial advantages by memorizing fixed question sets.
|
|
149
|
+
|
|
150
|
+
Furthermore, the use of closely related incorrect alternatives enables **fine-grained evaluation**, capturing not only whether a model gets the answer right,
|
|
151
|
+
but also how close it gets. This is particularly useful for identifying systematic errors, such as precision loss or difficulties handling operator precedence.
|
|
152
|
+
|
|
153
|
+
Finally, the approach is **simple, automated, and easily integrable** into evaluation workflows,
|
|
154
|
+
making it suitable for both offline benchmarking and continuous testing during model development.
|
|
155
|
+
|
|
156
|
+
In summary, this algorithm provides an effective and rigorous benchmark for assessing a fundamental capability—mathematical precision—while balancing experimental control,
|
|
157
|
+
scalability, and realistic difficulty.
|
|
158
|
+
"""
|
|
159
|
+
# --------------------------> A SAPIENS TECHNOLOGY®️ PRODUCTION) <--------------------------
|
|
@@ -0,0 +1,10 @@
|
|
|
1
|
+
Metadata-Version: 2.2
|
|
2
|
+
Name: math_precision
|
|
3
|
+
Version: 1.0.0
|
|
4
|
+
Home-page: https://github.com/sapiens-technology/math_precision
|
|
5
|
+
Author: SAPIENS TECHNOLOGY
|
|
6
|
+
License: Proprietary Software
|
|
7
|
+
License-File: LICENSE.txt
|
|
8
|
+
Dynamic: author
|
|
9
|
+
Dynamic: home-page
|
|
10
|
+
Dynamic: license
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
math_precision
|
|
@@ -0,0 +1,87 @@
|
|
|
1
|
+
"""
|
|
2
|
+
The presented algorithm implements a benchmarking method designed to evaluate the **mathematical precision** of language models,
|
|
3
|
+
targeting a well-known limitation of these systems: their ability to reliably perform numerical reasoning under high precision constraints.
|
|
4
|
+
|
|
5
|
+
The `MathPrecision` class functions as a synthetic data generator, automatically producing a dataset of mathematically structured problems.
|
|
6
|
+
Each sample consists of an expression built from three high-precision numbers (generated using the `decimal` module) combined with two randomly selected arithmetic operators:
|
|
7
|
+
addition, subtraction, multiplication, or division. These numbers are not standard floats but high-precision values with many decimal places,
|
|
8
|
+
significantly increasing the difficulty and minimizing the likelihood of correct answers through approximation.
|
|
9
|
+
|
|
10
|
+
For every generated expression, the algorithm computes the correct result with high numerical precision and then constructs a multiple-choice question.
|
|
11
|
+
Only one option corresponds to the correct answer, while the remaining alternatives are generated by applying small perturbations to the true result.
|
|
12
|
+
These incorrect options are intentionally designed to be plausible,
|
|
13
|
+
forcing the evaluated model to perform actual numerical reasoning rather than relying on superficial pattern recognition.
|
|
14
|
+
|
|
15
|
+
Additionally, both operators and answer choices are shuffled, ensuring structural diversity across samples and reducing positional bias.
|
|
16
|
+
The final output is a structured dataset containing input-output pairs, ready to be used in automated evaluation pipelines.
|
|
17
|
+
|
|
18
|
+
Among the main advantages of this benchmarking approach:
|
|
19
|
+
|
|
20
|
+
First, the **high numerical precision requirement** makes the test substantially more rigorous than traditional benchmarks based on simple arithmetic.
|
|
21
|
+
It allows clear differentiation between models that merely approximate reasoning and those that can consistently execute precise calculations.
|
|
22
|
+
|
|
23
|
+
Another key advantage is **scalable synthetic data generation**. Since the dataset is produced programmatically, it can be expanded arbitrarily,
|
|
24
|
+
with full control over complexity, distribution, and format. This removes reliance on static datasets and mitigates the risk of overfitting to known benchmarks.
|
|
25
|
+
|
|
26
|
+
The method is also **robust against memorization**, as each execution generates entirely new and unique problems.
|
|
27
|
+
This prevents models from gaining artificial advantages by memorizing fixed question sets.
|
|
28
|
+
|
|
29
|
+
Furthermore, the use of closely related incorrect alternatives enables **fine-grained evaluation**, capturing not only whether a model gets the answer right,
|
|
30
|
+
but also how close it gets. This is particularly useful for identifying systematic errors, such as precision loss or difficulties handling operator precedence.
|
|
31
|
+
|
|
32
|
+
Finally, the approach is **simple, automated, and easily integrable** into evaluation workflows,
|
|
33
|
+
making it suitable for both offline benchmarking and continuous testing during model development.
|
|
34
|
+
|
|
35
|
+
In summary, this algorithm provides an effective and rigorous benchmark for assessing a fundamental capability—mathematical precision—while balancing experimental control,
|
|
36
|
+
scalability, and realistic difficulty.
|
|
37
|
+
"""
|
|
38
|
+
# --------------------------> A SAPIENS TECHNOLOGY®️ PRODUCTION) <--------------------------
|
|
39
|
+
from setuptools import setup, find_packages
|
|
40
|
+
package_name = 'math_precision'
|
|
41
|
+
version = '1.0.0'
|
|
42
|
+
setup(
|
|
43
|
+
name=package_name,
|
|
44
|
+
version=version,
|
|
45
|
+
author='SAPIENS TECHNOLOGY',
|
|
46
|
+
packages=find_packages(),
|
|
47
|
+
url='https://github.com/sapiens-technology/math_precision',
|
|
48
|
+
license='Proprietary Software'
|
|
49
|
+
)
|
|
50
|
+
"""
|
|
51
|
+
The presented algorithm implements a benchmarking method designed to evaluate the **mathematical precision** of language models,
|
|
52
|
+
targeting a well-known limitation of these systems: their ability to reliably perform numerical reasoning under high precision constraints.
|
|
53
|
+
|
|
54
|
+
The `MathPrecision` class functions as a synthetic data generator, automatically producing a dataset of mathematically structured problems.
|
|
55
|
+
Each sample consists of an expression built from three high-precision numbers (generated using the `decimal` module) combined with two randomly selected arithmetic operators:
|
|
56
|
+
addition, subtraction, multiplication, or division. These numbers are not standard floats but high-precision values with many decimal places,
|
|
57
|
+
significantly increasing the difficulty and minimizing the likelihood of correct answers through approximation.
|
|
58
|
+
|
|
59
|
+
For every generated expression, the algorithm computes the correct result with high numerical precision and then constructs a multiple-choice question.
|
|
60
|
+
Only one option corresponds to the correct answer, while the remaining alternatives are generated by applying small perturbations to the true result.
|
|
61
|
+
These incorrect options are intentionally designed to be plausible,
|
|
62
|
+
forcing the evaluated model to perform actual numerical reasoning rather than relying on superficial pattern recognition.
|
|
63
|
+
|
|
64
|
+
Additionally, both operators and answer choices are shuffled, ensuring structural diversity across samples and reducing positional bias.
|
|
65
|
+
The final output is a structured dataset containing input-output pairs, ready to be used in automated evaluation pipelines.
|
|
66
|
+
|
|
67
|
+
Among the main advantages of this benchmarking approach:
|
|
68
|
+
|
|
69
|
+
First, the **high numerical precision requirement** makes the test substantially more rigorous than traditional benchmarks based on simple arithmetic.
|
|
70
|
+
It allows clear differentiation between models that merely approximate reasoning and those that can consistently execute precise calculations.
|
|
71
|
+
|
|
72
|
+
Another key advantage is **scalable synthetic data generation**. Since the dataset is produced programmatically, it can be expanded arbitrarily,
|
|
73
|
+
with full control over complexity, distribution, and format. This removes reliance on static datasets and mitigates the risk of overfitting to known benchmarks.
|
|
74
|
+
|
|
75
|
+
The method is also **robust against memorization**, as each execution generates entirely new and unique problems.
|
|
76
|
+
This prevents models from gaining artificial advantages by memorizing fixed question sets.
|
|
77
|
+
|
|
78
|
+
Furthermore, the use of closely related incorrect alternatives enables **fine-grained evaluation**, capturing not only whether a model gets the answer right,
|
|
79
|
+
but also how close it gets. This is particularly useful for identifying systematic errors, such as precision loss or difficulties handling operator precedence.
|
|
80
|
+
|
|
81
|
+
Finally, the approach is **simple, automated, and easily integrable** into evaluation workflows,
|
|
82
|
+
making it suitable for both offline benchmarking and continuous testing during model development.
|
|
83
|
+
|
|
84
|
+
In summary, this algorithm provides an effective and rigorous benchmark for assessing a fundamental capability—mathematical precision—while balancing experimental control,
|
|
85
|
+
scalability, and realistic difficulty.
|
|
86
|
+
"""
|
|
87
|
+
# --------------------------> A SAPIENS TECHNOLOGY®️ PRODUCTION) <--------------------------
|