metacountregressor 0.1.63__py3-none-any.whl → 0.1.64__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (33) hide show
  1. metacountregressor/__init__.py +0 -3
  2. metacountregressor/data/1848.csv +1849 -0
  3. metacountregressor/data/4000.csv +4746 -0
  4. metacountregressor/data/Copy of 190613_HV Crash Data 2007-2017 Dates.xlsx +0 -0
  5. metacountregressor/data/Ex-16-3.csv +276 -0
  6. metacountregressor/data/Ex-16-3variables.csv +276 -0
  7. metacountregressor/data/Indiana_data.csv +339 -0
  8. metacountregressor/data/MichiganData.csv +33972 -0
  9. metacountregressor/data/Stage5A.csv +1849 -0
  10. metacountregressor/data/Stage5A_1848_All_Initial_Columns.csv +1849 -0
  11. metacountregressor/data/ThaiAccident.csv +20230 -0
  12. metacountregressor/data/artificial_1h_mixed_corr_2023_MOOF.csv +1001 -0
  13. metacountregressor/data/artificial_ZA.csv +20001 -0
  14. metacountregressor/data/artificial_mixed_corr_2023_MOOF.csv +2001 -0
  15. metacountregressor/data/artificial_mixed_corr_2023_MOOF_copy.csv +2001 -0
  16. metacountregressor/data/latex_summary_output.tex +2034 -0
  17. metacountregressor/data/rqc40516_MotorcycleQUT_engineer_crash.csv +8287 -0
  18. metacountregressor/data/rural_int.csv +37081 -0
  19. metacountregressor/data/sum_stats.R +83 -0
  20. metacountregressor/data/summary_output.txt +302 -0
  21. metacountregressor/main.py +0 -3
  22. metacountregressor/plt_style.txt +52 -0
  23. metacountregressor/requirements.txt +16 -0
  24. metacountregressor/requirements_new.txt +145 -0
  25. metacountregressor/set_data.csv +8440 -0
  26. metacountregressor-0.1.64.dist-info/METADATA +274 -0
  27. metacountregressor-0.1.64.dist-info/RECORD +40 -0
  28. {metacountregressor-0.1.63.dist-info → metacountregressor-0.1.64.dist-info}/WHEEL +1 -2
  29. metacountregressor/test_motor.py +0 -296
  30. metacountregressor-0.1.63.dist-info/METADATA +0 -14
  31. metacountregressor-0.1.63.dist-info/RECORD +0 -19
  32. metacountregressor-0.1.63.dist-info/top_level.txt +0 -1
  33. {metacountregressor-0.1.63.dist-info → metacountregressor-0.1.64.dist-info}/LICENSE.txt +0 -0
@@ -0,0 +1,274 @@
1
+ Metadata-Version: 2.1
2
+ Name: metacountregressor
3
+ Version: 0.1.64
4
+ Summary: A python package for count regression of rare events assisted by metaheuristics
5
+ Author: zahern
6
+ Author-email: zeke.ahern@hdr.qut.edu.au
7
+ Requires-Python: >=3.10,<3.11
8
+ Classifier: Programming Language :: Python :: 3
9
+ Classifier: Programming Language :: Python :: 3.10
10
+ Requires-Dist: latextable (>=1.0.0,<2.0.0)
11
+ Requires-Dist: matplotlib (>=3.7.1,<4.0.0)
12
+ Requires-Dist: numpy (>=1.24.3,<2.0.0)
13
+ Requires-Dist: pandas (>=2.0.2,<3.0.0)
14
+ Requires-Dist: psutil (>=5.9.5,<6.0.0)
15
+ Requires-Dist: scikit-learn (>=1.2.2,<2.0.0)
16
+ Requires-Dist: scipy (>=1.10.1,<2.0.0)
17
+ Requires-Dist: statsmodels (>=0.14.0,<0.15.0)
18
+ Requires-Dist: tabulate (>=0.9.0,<0.10.0)
19
+ Description-Content-Type: text/markdown
20
+
21
+ <div style="display: flex; align-items: center;">
22
+ <img src="https://github.com/zahern/data/raw/main/m.png" alt="My Image" style="width: 200px; margin-right: 20px;">
23
+ <p><span style="font-size: 60px;"><strong>MetaCountRegressor</strong></span></p>
24
+ </div>
25
+
26
+ ##### Quick Setup
27
+ The Below code demonstrates how to set up automatic optimization assisted by the harmony search algorithm. References to the Differential Evolution and Simulated Annealing has been mentioned (change accordingly)
28
+
29
+ ## Quick install: Requires Python 3.10
30
+
31
+ Install `metacountregressor` using pip as follows:
32
+
33
+ ```bash
34
+ pip install metacountregressor
35
+
36
+
37
+ ```python
38
+ import pandas as pd
39
+ import numpy as np
40
+ from metacountregressor.solution import ObjectiveFunction
41
+ from metacountregressor.metaheuristics import (harmony_search,
42
+ differential_evolution,
43
+ simulated_annealing)
44
+ ```
45
+
46
+ #### Basic setup.
47
+ The initial setup involves reading in the data and selecting an optimization algorithm. As the runtime progresses, new solutions will be continually evaluated. Finally, at the end of the runtime, the best solution will be identified and printed out. In the case of multiple objectives all of the best solutions will be printed out that belong to the Pareto frontier.
48
+
49
+
50
+ ```python
51
+ # Read data from CSV file
52
+ df = pd.read_csv(
53
+ "https://raw.githubusercontent.com/zahern/data/main/Ex-16-3.csv")
54
+ X = df
55
+ y = df['FREQ'] # Frequency of crashes
56
+ X['Offset'] = np.log(df['AADT']) # Explicitley define how to offset the data, no offset otherwise
57
+ # Drop Y, selected offset term and ID as there are no panels
58
+ X = df.drop(columns=['FREQ', 'ID', 'AADT'])
59
+
60
+ #some example argument, these are defualt so the following line is just for claritity. See the later agruments section for detials.
61
+ arguments = {'algorithm': 'hs', 'test_percentage': 0.15, 'test_complexity': 6, 'instance_number':1,
62
+ 'val_percentage':0.15, 'obj_1': 'bic', '_obj_2': 'RMSE_TEST', "MAX_TIME": 6}
63
+ # Fit the model with metacountregressor
64
+ obj_fun = ObjectiveFunction(X, y, **arguments)
65
+ #replace with other metaheuristics if desired
66
+ results = harmony_search(obj_fun)
67
+
68
+
69
+ ```
70
+
71
+ ## Arguments to feed into the Objective Function:
72
+ ###
73
+ Note: Please Consider the main arguments to change.
74
+
75
+ - `algorithm`: This parameter has multiple choices for the algorithm, such as 'hs', 'sa', and 'de'. Only one choice should be defined as a string value.
76
+ - `test_percentage`: This parameter represents the percentage of data used for in-sample prediction of the model. The value 0.15 corresponds to 15% of the data.
77
+ - `val_percentage`: This parameter represents the percentage of data used to validate the model. The value 0.15 corresponds to 15% of the data.
78
+ - `test_complexity`: This parameter defines the complexity level for testing. The value 6 tests all complexities. Alternatively, you can provide a list of numbers to consider different complexities. The complexities are further explained later in this document.
79
+ - `instance_number`: This parameter is used to give a name to the outputs.
80
+ - `obj_1`: This parameter has multiple choices for obj_1, such as 'bic', 'aic', and 'hqic'. Only one choice should be defined as a string value.
81
+ - `_obj_2`: This parameter has multiple choices for objective 2, such as 'RMSE_TEST', 'MSE_TEST', and 'MAE_TEST'.
82
+ - `_max_time`: This parameter specifies the maximum number of seconds for the total estimation before stopping.
83
+ - `distribution`: This parameter is a list of distributions to consider. Please select all of the available options and put them into a list of valid options if you want to to consider the distribution type for use when modellign with random parameters. The valid options include: 'Normal', 'LnNormal', 'Triangular', and 'Uniform'.
84
+ - `transformations`: This parameters is a list of transformations to consider. Plesee select all of the available options and put them into a list of valid options if you want to consider the transformation type. The valid options include 'Normal', 'LnNormal', 'Triangular', 'Uniform'.
85
+ - `method_ll`: This is a specificication on the type of solvers are avilable to solve the lower level maximum likilihood objective. The valid options include: 'Normal', 'LnNormal', 'Triangular', and 'Uniform'.
86
+
87
+
88
+
89
+ ### An Example of changing the arguments.
90
+ Modify the arguments according to your preferences using the commented code as a guide.
91
+
92
+
93
+ ```python
94
+ #Solution Arguments
95
+ arguments = {
96
+ 'algorithm': 'hs', #alternatively input 'de', or 'sa'
97
+ 'is_multi': 1,
98
+ 'test_percentage': 0.2, # used in multi-objective optimisation only. Saves 20% of data for testing.
99
+ 'val_percenetage:': 0.2, # Saves 20% of data for testing.
100
+ 'test_complexity': 6, # Complexity level for testing (6 tests all) or a list to consider potential differences in complexity
101
+ 'instance_number': 'name', # used for creeating a named folder where your models are saved into from the directory
102
+ 'distribution': ['Normal', 'LnNormal', 'Triangular', 'Uniform'],
103
+ 'Model': [0,1], # or equivalently ['POS', 'NB']
104
+ 'transformations': ['no', 'sqrt', 'archsinh'],
105
+ 'method_ll': 'BFGS_2',
106
+ '_max_time': 10
107
+ }
108
+ obj_fun = ObjectiveFunction(X, y, **arguments)
109
+ results = harmony_search(obj_fun)
110
+ ```
111
+
112
+ ## Initial Solution Configurement
113
+ Listed below is an example of how to specify an initial solution within the framework. This initial solution will be used to calculate the fitness and considered in the objective-based search. However, as the search progresses, different hypotheses may be proposed, and alternative modeling components may completely replace the initial solution.
114
+
115
+
116
+ ```python
117
+ #Model Decisions, Specify for Intial Optimization
118
+ manual_fit_spec = {
119
+ 'fixed_terms': ['SINGLE', 'LENGTH'],
120
+ 'rdm_terms': ['AADT:normal'],
121
+ 'rdm_cor_terms': ['GRADEBR:uniform', 'CURVES:triangular'],
122
+ 'grouped_terms': [],
123
+ 'hetro_in_means': ['ACCESS:normal', 'MINRAD:normal'],
124
+ 'transformations': ['no', 'no', 'log', 'no', 'no', 'no', 'no'],
125
+ 'dispersion': 1
126
+ }
127
+ #Search Arguments
128
+ arguments = {
129
+ 'algorithm': 'hs',
130
+ 'test_percentage': 0.2,
131
+ 'test_complexity': 6,
132
+ 'instance_number': 'name',
133
+ 'Manual_Fit': manual_fit_spec
134
+ }
135
+ obj_fun = ObjectiveFunction(X, y, **arguments)
136
+ ```
137
+
138
+ simarly to return the results feed the objective function into a metaheuristic solution algorithm. An example of this is provided below:
139
+
140
+
141
+ ```python
142
+ results = harmony_search(obj_fun)
143
+ print(results)
144
+ ```
145
+
146
+ ## Notes:
147
+ ### Capabilities of the software include:
148
+ * Handling of Panel Data
149
+ * Support for Data Transformations
150
+ * Implementation of Models with Correlated and Non-Correlated Random Parameters
151
+ * A variety of mixing distributions for parameter estimations, including normal, lognormal, truncated normal, Lindley, Gamma, triangular, and uniform distributions
152
+ Capability to handle heterogeneity in the means of the random parameters
153
+ * Use of Halton draws for simulated maximum likelihood estimation
154
+ * Support for grouped random parameters with unbalanced groups
155
+ * Post-estimation tools for assessing goodness of fit, making predictions, and conducting out-of-sample validation
156
+ * Multiple parameter optimization routines, such as the BFGS method
157
+ * Comprehensive hypothesis testing using single objectives, such as in-sample BIC and log-likelihood
158
+ * Extensive hypothesis testing using multiple objectives, such as in-sample BIC and out-of-sample MAE (Mean Absolute Error), or in-sample AIC and out-of-sample MSPE (mean-square prediction errorr)
159
+ * Features that allow analysts to pre-specify variables, interactions, and mixing distributions, among others
160
+ * Meta-heuristic Guided Optimization, including techniques like Simulated Annealing, Harmony Search, and Differential Evolution
161
+ * Customization of Hyper-parameters to solve problems tailored to your dataset
162
+ * Out-of-the-box optimization capability using default metaheuristics
163
+
164
+ ### Intreting the output of the model:
165
+ A regression table is produced. The following text elements are explained:
166
+ - Std. Dev.: This column appears for effects that are related to random paramters and displays the assument distributional assumption next to it
167
+ - Chol: This term refers to Cholesky decomposition element, to show the correlation between two random paramaters. The combination of the cholesky element on iyself is equivalent to a normal random parameter.
168
+ - hetro group #: This term represents the heterogeneity group number, which refers all of the contributing factors that share hetrogentiy in the means to each other under the same numbered value.
169
+ - $\tau$: This column, displays the type of transformation that was applied to the specific contributing factor in the data.
170
+
171
+
172
+ ## Arguments:
173
+ #### In reference to the arguments that can be fed into the solution alrogithm, a dictionary system is utilised with relecant names these include
174
+
175
+
176
+ The following list describes the arguments available in this function. By default, all of the capabilities described are enabled unless specified otherwise as an argument. For list arguments, include all desired elements in the list to ensure the corresponding options are considered. Example code will be provided later in this guide.
177
+
178
+ 1. **`complexity_level`**: This argument accepts an integer 1-6 or a list based of integegers between 0 to 5 eg might be a possible configuration [0, 2, 3]. Each integer represents a hierarchy level for estimable models associated with each explanatory variable. Here is a summary of the hierarchy:
179
+ - 0: Null model
180
+ - 1: Simple fixed effects model
181
+ - 2: Random parameters model
182
+ - 3: Random correlated parameters model
183
+ - 4: Grouped random parameters model
184
+ - 5: Heterogeneity in the means random parameter model
185
+
186
+ **Note:** For the grouped random parameters model, groupings need to be defined prior to estimation. This can be achieved by including the following key-value pair in the arguments of the `ObjectiveFunction`: `'group': "Enter Column Grouping in data"`. Replace `"Enter Column Grouping in data"` with the actual column grouping in your dataset.
187
+
188
+ Similarly, for panel data, the panel column needs to be defined using the key-value pair: `'panel': "enter column string covering panels"`. Replace `"enter column string covering panels"` with the appropriate column string that represents the panel information in your dataset.
189
+
190
+ 2. **`distributions`**: This argument accepts a list of strings where each string corresponds to a distribution. Valid options include:
191
+ - "Normal"
192
+ - "Lindley"
193
+ - "Uniform"
194
+ - "LogNormal"
195
+ - "Triangular"
196
+ - "Gamma"
197
+ - "TruncatedNormal"
198
+ - Any of the above, concatenated with ":" (e.g., "Normal:grouped"; requires a grouping term defined in the model)
199
+
200
+ 3. **`Model`**: This argument specifies the model form. It can be a list of integers representing different models to test:
201
+ - 0: Poisson
202
+ - 1: Negative-Binomial
203
+ - 2: Generalized-Poisson
204
+
205
+ 4. **`transformations`**: This argument accepts a list of strings representing available transformations within the framework. Valid options include:
206
+ - "no"
207
+ - "square-root"
208
+ - "logarithmic"
209
+ - "archsinh"
210
+ - "as_factor"
211
+
212
+ 5. **`is_multi`**: This argument accepts an integer indicating whether single or multiple objectives are to be tested (0 for single, 1 for multiple).
213
+
214
+ 6. **`test_percentage`**: This argument is used for multi-objective optimization. Define it as a decimal; for example, 0.2 represents 20% of the data for testing.
215
+
216
+ 7. **`val_percentage`**: This argument saves data for validation. Define it as a decimal; for example, 0.2 represents 20% of the data for validation.
217
+
218
+ 8. **`_max_time`**: This argument is used to add a termination time in the algorithm. It takes values as seconds. Note the time is only dependenant on the time after intial population of solutions are generated.
219
+
220
+ # Example
221
+
222
+
223
+ Let's start by fitting very simple models, use those model sto help and define the objectives, then perform more of an extensive search on the variables that are identified more commonly
224
+
225
+
226
+
227
+ ```python
228
+ df = pd.read_csv(
229
+ "https://raw.githubusercontent.com/zahern/data/main/Ex-16-3.csv")
230
+ X = df
231
+ y = df['FREQ'] # Frequency of crashes
232
+ X['Offset'] = np.log(df['AADT']) # Explicitley define how to offset the data, no offset otherwise
233
+ # Drop Y, selected offset term and ID as there are no panels
234
+ X = df.drop(columns=['FREQ', 'ID', 'AADT'])
235
+
236
+ arguments = {
237
+ 'algorithm': 'hs', #alternatively input 'de', or 'sa'
238
+ 'is_multi': 1,
239
+ 'test_percentage': 0.2, # used in multi-objective optimisation only. Saves 20% of data for testing.
240
+ 'val_percentage:': 0.2, # Saves 20% of data for testing.
241
+ 'test_complexity': 3, # For Very simple Models
242
+ 'obj_1': 'BIC', '_obj_2': 'RMSE_TEST',
243
+ 'instance_number': 'name', # used for creeating a named folder where your models are saved into from the directory
244
+ 'distribution': ['Normal'],
245
+ 'Model': [0], # or equivalently ['POS', 'NB']
246
+ 'transformations': ['no', 'sqrt', 'archsinh'],
247
+ '_max_time': 10000
248
+ }
249
+ obj_fun = ObjectiveFunction(X, y, **arguments)
250
+
251
+ results = harmony_search(obj_fun)
252
+ print(results)
253
+ ```
254
+
255
+ ## Contact
256
+ If you have any questions, ideas to improve MetaCountRegressor, or want to report a bug, just open a new issue in [GitHub repository](https://github.com/zahern/CountDataEstimation).
257
+
258
+ ## Citing MetaCountRegressor
259
+ Please cite MetaCountRegressor as follows:
260
+
261
+ Ahern, Z., Corry P., Paz A. (2023). MetaCountRegressor [Computer software]. [https://pypi.org/project/metacounregressor/](https://pypi.org/project/metacounregressor/)
262
+
263
+ Or using BibTex as follows:
264
+
265
+ ```bibtex
266
+ @misc{Ahern2023,
267
+ author = {Zeke Ahern and Paul Corry and Alexander Paz},
268
+ journal = {PyPi},
269
+ title = {metacountregressor · PyPI},
270
+ url = {https://pypi.org/project/metacountregressor/0.1.47/},
271
+ year = {2023},
272
+ }
273
+
274
+
@@ -0,0 +1,40 @@
1
+ metacountregressor/__init__.py,sha256=UM4zaqoAcZVWyx3SeL9bRS8xpQ_iLZU9fIIARWmfjis,2937
2
+ metacountregressor/_device_cust.py,sha256=759fnKmTYccJm4Lpi9_1reurh6OB9d6q9soPR0PltKc,2047
3
+ metacountregressor/data/1848.csv,sha256=p53-8nntBb6X3bMJdYAJER6rbTicl_6RsWpOV-S8n_Y,47516
4
+ metacountregressor/data/4000.csv,sha256=XDGmMrZazvcaeAMMAdVibB6EWfEJdyZox8iXtDb7d9w,131240
5
+ metacountregressor/data/Copy of 190613_HV Crash Data 2007-2017 Dates.xlsx,sha256=xzdiR6bFHZ3j22BPJXOmruX_DA08jd-ujeaNwNimsxg,3469742
6
+ metacountregressor/data/Ex-16-3.csv,sha256=7z6tW9A4rqhQGOeiRE4LzlsJh8EjRvGnIQXo00hK82Y,42302
7
+ metacountregressor/data/Ex-16-3variables.csv,sha256=rzl_pv-gy5rdpImZkhzc52cx__S98di1V2CK_krO-DA,14901
8
+ metacountregressor/data/Indiana_data.csv,sha256=9jXPjPuk7pQqtSNcra3RhKB86i2zPo_xavmShsniK7A,14113
9
+ metacountregressor/data/MichiganData.csv,sha256=FwyHAbou49It8rLilY6dF4Uaoi4q-JjrxIqCRl07KfQ,1283955
10
+ metacountregressor/data/Stage5A.csv,sha256=rVqcKFOJTNCF5Aykd8adxLp0Rw2W_3PgZhjZocP9Su8,216874
11
+ metacountregressor/data/Stage5A_1848_All_Initial_Columns.csv,sha256=rVqcKFOJTNCF5Aykd8adxLp0Rw2W_3PgZhjZocP9Su8,216874
12
+ metacountregressor/data/ThaiAccident.csv,sha256=ZQJaD66qPE0QdqgaKTAYuQ42y1OvD0xu3FP4fqPfUFw,397972
13
+ metacountregressor/data/artificial_1h_mixed_corr_2023_MOOF.csv,sha256=8krw3famdKpUsQ4J-cewVVaskMroPh0oSj4PB09yWdM,144023
14
+ metacountregressor/data/artificial_ZA.csv,sha256=dpQsL9dqJtzz7V0wLxF5XURhuMk6C_NCZHZLTEUJERA,3693175
15
+ metacountregressor/data/artificial_mixed_corr_2023_MOOF.csv,sha256=gxJbBsA5318fhhH_4FDQdMUMEhxaYR2W5-EsJ7gwBms,322085
16
+ metacountregressor/data/artificial_mixed_corr_2023_MOOF_copy.csv,sha256=IlYpjwmG6DJH0GfHM3QnqjRILBKLC5df7Y7EyNWMTZM,322612
17
+ metacountregressor/data/latex_summary_output.tex,sha256=d35GSHoNlmKC-7EedvzVELzj3EU3tVRwAYByHt8BZ1o,35287
18
+ metacountregressor/data/rqc40516_MotorcycleQUT_engineer_crash.csv,sha256=6WzhwY7lUGy0S8BovhQe0l01XlgsXCZRzBeAGGgeWJg,3898046
19
+ metacountregressor/data/rural_int.csv,sha256=l_jgI_uTTVN3vmNjwfOE8KNYOvAXsNPRrOdAwamEbQM,6922036
20
+ metacountregressor/data/sum_stats.R,sha256=Qc2k4jVdZJbuk4kMt0t7KmbmoYgAR-C1SURZm_KVN0U,3327
21
+ metacountregressor/data/summary_output.txt,sha256=SPC7ycxu2yaGjSYqY9GLuDv5UBxXH7kT_FeivBZ8gxI,33846
22
+ metacountregressor/halton.py,sha256=jhovA45UBoZYU9g-hl6Lb2sBIx_ZBTNdPrpgkzR9fng,9463
23
+ metacountregressor/helperprocess.py,sha256=nlabPUz_hf8SbFONn2wBKwcVJusTKcynPaxkEYvTWlU,9052
24
+ metacountregressor/main.py,sha256=21Nz4OSFoiFbPSu2H0jOb5k4JaT9mDjw3EbY6QbC5R4,17135
25
+ metacountregressor/main_old.py,sha256=eTS4ygq27MnU-dZ_j983Ucb-D5XfbVF8OJQK2hVVLZc,24123
26
+ metacountregressor/metaheuristics.py,sha256=4coj8gTyesnYFA8r4UkQer156OQ30BRQhFTMFEwSjtU,105490
27
+ metacountregressor/pareto_file.py,sha256=whySaoPAUWYjyI8zo0hwAOa3rFk6SIUlHSpqZiLur0k,23096
28
+ metacountregressor/pareto_logger__plot.py,sha256=mEU2QN4wmsM7t39GJ_XhJ_jjsdl09JOmG0U2jICrAkI,30037
29
+ metacountregressor/plt_style.txt,sha256=Zv4mDFxN-KP_OnwfbC2IoAgEclOVeodIC4maLnERXfw,1280
30
+ metacountregressor/requirements.txt,sha256=jfxFP4msdp3H5kGuu7Mh2ea-18Gg1RvAPgPJygEe4JA,264
31
+ metacountregressor/requirements_new.txt,sha256=ZN_Kq1GgoT2WLFhzAC2J8__ZjY0Lj7ztIUsobXy4yBY,2543
32
+ metacountregressor/set_data.csv,sha256=UUrFFSs-q78mMcMKjYRCT3pciG0yuhdxlNZ-LL2GKN0,649589
33
+ metacountregressor/setup.py,sha256=8w6IqX0tJsbYrOI1BJLIJCIvOnunKli5I9fsF5PhHv4,919
34
+ metacountregressor/single_objective_finder.py,sha256=jVG7GJBqzSP4_riYr-kMMKy_LE3SlGmKMunNhHYxgRg,8011
35
+ metacountregressor/solution.py,sha256=fQaoo71zIxfOKLC6oTE9BnAm5OJRqRZnQ7a9nPIc9cM,284973
36
+ metacountregressor/test_generated_paper2.py,sha256=pwOoRzl1jJIIOUAAvbkT6HmmTQ81mwpsshn9SLdKOg8,3927
37
+ metacountregressor-0.1.64.dist-info/LICENSE.txt,sha256=OXLcl0T2SZ8Pmy2_dmlvKuetivmyPd5m1q-Gyd-zaYY,35149
38
+ metacountregressor-0.1.64.dist-info/METADATA,sha256=m_id4JHffI21V9t7LdUjwj2LLJVwhakWBQQgVzN5ED8,14679
39
+ metacountregressor-0.1.64.dist-info/WHEEL,sha256=sP946D7jFCHeNz5Iq4fL4Lu-PrWrFsgfLXbbkciIZwg,88
40
+ metacountregressor-0.1.64.dist-info/RECORD,,
@@ -1,5 +1,4 @@
1
1
  Wheel-Version: 1.0
2
- Generator: bdist_wheel (0.43.0)
2
+ Generator: poetry-core 1.9.0
3
3
  Root-Is-Purelib: true
4
4
  Tag: py3-none-any
5
-
@@ -1,296 +0,0 @@
1
- import pandas as pd
2
- import numpy as np
3
- import shap
4
- from sklearn.feature_selection import SelectKBest
5
- from sklearn.feature_selection import f_regression
6
- from sklearn.datasets import make_regression
7
- from sklearn.model_selection import train_test_split
8
- from sklearn.preprocessing import StandardScaler
9
- import xgboost as xgb
10
- import matplotlib.pyplot as plt
11
- from sklearn.inspection import permutation_importance
12
-
13
- X, y = make_regression(n_samples=1000, n_features=100, n_informative=10, noise=0.1, random_state=1)
14
- X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
15
-
16
-
17
- def shape_ply(df, y):
18
- df_features = df
19
- model = xgb.XGBRegressor(n_estimators=500, max_depth=20, learning_rate=0.1, subsample=0.8, random_state=33)
20
- model.fit(df_features, y)
21
-
22
- clustering = shap.utils.hclust(df_features, y)
23
-
24
- # PERMUATION fEATUre importance
25
- scoring = ['r2', 'neg_mean_squared_error']
26
- perm_importance = permutation_importance(model, df_features, y, scoring=scoring, n_repeats=5, random_state=33)
27
- perm_importance_r2 = pd.DataFrame(data={'importance': perm_importance['r2']['importances_mean']},
28
- index=df_features.columns)
29
- perm_importance_r2.sort_values(by='importance', ascending=False).plot(kind='bar')
30
- # plt.tight_layout()
31
- # plt.savefig('plot.png')
32
- # plt.show()
33
-
34
- print(1)
35
-
36
- # SHAPELY PART
37
- explainer = shap.Explainer(model)
38
- shap_values = explainer(df_features)
39
- # fig = shap.summary_plot(shap_values, df_features, show = False)
40
-
41
- # plt.figure(figsize=(10, 8))
42
- fig = shap.plots.beeswarm(shap_values, order=shap_values.abs.max(0), show=False)
43
- plt.gca().set_xticklabels([])
44
- plt.savefig('shap_values.png', bbox_inches='tight')
45
- plt.show()
46
- shap.plots.bar(shap_values.abs.mean(0), show=False)
47
- plt.gca().set_xticklabels([])
48
- plt.savefig('bar_values.png', bbox_inches='tight')
49
-
50
- shap.plots.bar(shap_values, clustering=clustering, clustering_cutoff=0.8, show=False)
51
- plt.gca().set_xticklabels([])
52
- plt.savefig('bar_valuesff.png', bbox_inches='tight')
53
- print(2)
54
-
55
-
56
- def summary_stats(crash_data):
57
- # Calculate the count of crashes
58
- crash_observations = crash_data.shape[0]
59
- crash_factors = crash_data.shape[1]
60
-
61
- # Calculate the mean age of drivers involved in the crashes
62
- total_vehicle = crash_data['CASUALTY_TOTAL'].sum()
63
-
64
- # Calculate the median number of vehicles involved in the crashes
65
- median_vehicles = crash_data['CASUALTY_TOTAL'].median()
66
-
67
- # Calculate the mode of crash types
68
- mode_crash_type = crash_data['CASUALTY_TOTAL'].mode()[0]
69
-
70
- # Calculate the 25th and 75th percentiles of a numeric variable
71
- percentile_25 = crash_data['CASUALTY_TOTAL'].quantile(0.25)
72
- percentile_75 = crash_data['CASUALTY_TOTAL'].quantile(0.75)
73
- std_dev = crash_data['CASUALTY_TOTAL'].std()
74
-
75
- # Print the summary statistics
76
- print("Summary Statistics:")
77
- print('Total Crashes', total_vehicle)
78
- print("Crash Observation:", crash_observations)
79
- print("Crash Factors:", crash_factors)
80
-
81
- print("Median Vehicles Involved:", median_vehicles)
82
- print("Mode Crash Type:", mode_crash_type)
83
-
84
- print("Standard Deviation of Age:", std_dev)
85
- print("25th Percentile of Crash Totals:", percentile_25)
86
- print("75th Percentile of Crash Total:", percentile_75)
87
-
88
- latex_table = pd.DataFrame({
89
- 'Statistic': ['Total Crashes', 'Crash Observation', 'Crash Factors', 'Median Vehicles Involved',
90
- 'Mode Crash Type', 'Standard Deviation of Age', '25th Percentile of Crash Totals',
91
- '75th Percentile of Crash Total'],
92
- 'Value': [total_vehicle, crash_observations, crash_factors, median_vehicles, mode_crash_type,
93
- std_dev, percentile_25, percentile_75]
94
- }).to_latex(index=False)
95
-
96
- # Print the LaTeX table
97
- print(latex_table)
98
- summary_statsd = crash_data[
99
- ['CRASH_YEAR', 'CRASH_FIN_YEAR', 'CRASH_MONTH', 'CRASH_DAY_OF_WEEK', 'CRASH_HOUR']].describe()
100
-
101
- # Print the summary statistics
102
- print("Summary Statistics:")
103
- print(summary_statsd)
104
-
105
- if 'CRASH_SEVERITY' in crash_data:
106
- severity_counts = crash_data['CRASH_SEVERITY'].value_counts()
107
- print(severity_counts)
108
-
109
-
110
- # Load data
111
-
112
-
113
-
114
-
115
-
116
- # Convert data types
117
-
118
- def clean_data_types(df):
119
- for col in df.columns:
120
- if df[col].dtype == 'object':
121
- # Attempt to convert the column to numeric type
122
- df[col] = pd.to_numeric(df[col], errors='coerce')
123
- return df
124
-
125
-
126
-
127
-
128
- # id folumn is a string drop
129
- # feauture slection
130
- def select_features(X_train, y_train, n_f=16):
131
- feature_names = X_train.columns
132
- # configure to select all features
133
- fs = SelectKBest(score_func=f_regression, k=16)
134
-
135
- # learn relationship from training data
136
- fs.fit(X_train, y_train)
137
-
138
- mask = fs.get_support() # Boolean array of selected features
139
- selected_features = [feature for bool, feature in zip(mask, feature_names) if bool]
140
- # transform train input data
141
- X_train_fs = fs.transform(X_train)
142
- X_train_fs = pd.DataFrame(X_train_fs, columns=selected_features)
143
- X_train = X_train[selected_features]
144
- return X_train, fs
145
-
146
-
147
- def findCorrelation(corr, cutoff=0.9, exact=None):
148
- """
149
- This function is the Python implementation of the R function
150
- `findCorrelation()`.
151
-
152
- Relies on numpy and pandas, so must have them pre-installed.
153
-
154
- It searches through a correlation matrix and returns a list of column names
155
- to remove to reduce pairwise correlations.
156
-
157
- For the documentation of the R function, see
158
- https://www.rdocumentation.org/packages/caret/topics/findCorrelation
159
- and for the source code of `findCorrelation()`, see
160
- https://github.com/topepo/caret/blob/master/pkg/caret/R/findCorrelation.R
161
-
162
- -----------------------------------------------------------------------------
163
-
164
- Parameters:
165
- -----------
166
- corr: pandas dataframe.
167
- A correlation matrix as a pandas dataframe.
168
- cutoff: float, default: 0.9.
169
- A numeric value for the pairwise absolute correlation cutoff
170
- exact: bool, default: None
171
- A boolean value that determines whether the average correlations be
172
- recomputed at each step
173
- -----------------------------------------------------------------------------
174
- Returns:
175
- --------
176
- list of column names
177
- -----------------------------------------------------------------------------
178
- Example:
179
- --------
180
- R1 = pd.DataFrame({
181
- 'x1': [1.0, 0.86, 0.56, 0.32, 0.85],
182
- 'x2': [0.86, 1.0, 0.01, 0.74, 0.32],
183
- 'x3': [0.56, 0.01, 1.0, 0.65, 0.91],
184
- 'x4': [0.32, 0.74, 0.65, 1.0, 0.36],
185
- 'x5': [0.85, 0.32, 0.91, 0.36, 1.0]
186
- }, index=['x1', 'x2', 'x3', 'x4', 'x5'])
187
-
188
- findCorrelation(R1, cutoff=0.6, exact=False) # ['x4', 'x5', 'x1', 'x3']
189
- findCorrelation(R1, cutoff=0.6, exact=True) # ['x1', 'x5', 'x4']
190
- """
191
-
192
- def _findCorrelation_fast(corr, avg, cutoff):
193
-
194
- combsAboveCutoff = corr.where(lambda x: (np.tril(x) == 0) & (x > cutoff)).stack().index
195
-
196
- rowsToCheck = combsAboveCutoff.get_level_values(0)
197
- colsToCheck = combsAboveCutoff.get_level_values(1)
198
-
199
- msk = avg[colsToCheck] > avg[rowsToCheck].values
200
- deletecol = pd.unique(np.r_[colsToCheck[msk], rowsToCheck[~msk]]).tolist()
201
-
202
- return deletecol
203
-
204
- def _findCorrelation_exact(corr, avg, cutoff):
205
-
206
- x = corr.loc[(*[avg.sort_values(ascending=False).index] * 2,)]
207
-
208
- if (x.dtypes.values[:, None] == ['int64', 'int32', 'int16', 'int8']).any():
209
- x = x.astype(float)
210
-
211
- x.values[(*[np.arange(len(x))] * 2,)] = np.nan
212
-
213
- deletecol = []
214
- for ix, i in enumerate(x.columns[:-1]):
215
- for j in x.columns[ix + 1:]:
216
- if x.loc[i, j] > cutoff:
217
- if x[i].mean() > x[j].mean():
218
- deletecol.append(i)
219
- x.loc[i] = x[i] = np.nan
220
- else:
221
- deletecol.append(j)
222
- x.loc[j] = x[j] = np.nan
223
- return deletecol
224
-
225
- if not np.allclose(corr, corr.T) or any(corr.columns != corr.index):
226
- raise ValueError("correlation matrix is not symmetric.")
227
-
228
- acorr = corr.abs()
229
- avg = acorr.mean()
230
-
231
- if exact or exact is None and corr.shape[1] < 100:
232
- return _findCorrelation_exact(acorr, avg, cutoff)
233
- else:
234
- return _findCorrelation_fast(acorr, avg, cutoff)
235
-
236
-
237
- # Normalize data
238
-
239
-
240
-
241
- print('here is the main stuff')
242
- NAH = 0
243
- if NAH:
244
- df = pd.read_csv('rqc40516_MotorcycleQUT_engineer_crash.csv', skiprows=5)
245
- # Clean 'CRASH_SPEED_LIMIT' and convert to integer
246
- df['CRASH_SPEED_LIMIT'] = df['CRASH_SPEED_LIMIT'].str.replace(' km/h', '').astype(int)
247
-
248
- # Clean data types
249
- df = clean_data_types(df)
250
-
251
- # Encode categorical variables
252
- categories = ['CRASH_SEVERITY', 'CRASH_TYPE', 'CRASH_NATURE', 'CRASH_ATMOSPHERIC_CONDITION']
253
- df = pd.get_dummies(df, columns=categories)
254
-
255
- # Select only numeric columns
256
- numeric_types = ['int32', 'uint8', 'bool', 'int64', 'float64']
257
- df = df.select_dtypes(include=numeric_types)
258
-
259
- # Check for missing values and fill with column mean
260
- missing_values_count = df['CASUALTY_TOTAL'].isnull().sum()
261
- df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean())
262
-
263
- # Remove unnecessary columns
264
- df.drop(columns=['CRASH_REF_NUMBER'], inplace=True)
265
-
266
- # Define columns to exclude from the analysis
267
- EXCLUDE = [
268
- 'LONGITUDE', 'YEAR', 'DCA', 'ID', 'LATIT', 'NAME', 'SEVERITY',
269
- "CASUALTY", "CRASH_FIN_YEAR", "CRASH_HOUR"
270
- ]
271
-
272
- # Filter out excluded columns
273
- df = df[[col for col in df.columns if col not in EXCLUDE]]
274
-
275
- # Prepare target variable
276
- y = df['CASUALTY_TOTAL']
277
-
278
- # Check for finite values and compute correlations
279
- finite_check = df.apply(np.isfinite).all()
280
- df_clean = df.loc[:, finite_check]
281
- corr = df_clean.corr()
282
-
283
- # Identify and remove highly correlated features
284
- hc = findCorrelation(corr, cutoff=0.5)
285
- trimmed_df = df_clean.drop(columns=hc)
286
-
287
- # Feature selection
288
- df_cleaner, fs = select_features(trimmed_df, y)
289
-
290
- """
291
- # Split data
292
- from sklearn.model_selection import train_test_split
293
-
294
- X_train, X_test, y_train, y_test = train_test_split(df.drop('target_column', axis=1), df['target_column'],
295
- test_size=0.2, random_state=42)
296
- """
@@ -1,14 +0,0 @@
1
- Metadata-Version: 2.1
2
- Name: metacountregressor
3
- Version: 0.1.63
4
- Summary: Extensions for a Python package for estimation of count models.
5
- Home-page: https://github.com/zahern/MetaCount
6
- Author: Zeke Ahern
7
- Author-email: zeke.ahern@hdr.qut.edu.au
8
- License: QUT
9
- Requires-Python: >=3.10
10
- License-File: LICENSE.txt
11
- Requires-Dist: numpy >=1.13.1
12
- Requires-Dist: scipy >=1.0.0
13
-
14
- long_description
@@ -1,19 +0,0 @@
1
- metacountregressor/__init__.py,sha256=L-dpP2iXXPDaKjC34BWkL0XsmC8s8IKuChvqHEraMb4,2940
2
- metacountregressor/_device_cust.py,sha256=759fnKmTYccJm4Lpi9_1reurh6OB9d6q9soPR0PltKc,2047
3
- metacountregressor/halton.py,sha256=jhovA45UBoZYU9g-hl6Lb2sBIx_ZBTNdPrpgkzR9fng,9463
4
- metacountregressor/helperprocess.py,sha256=nlabPUz_hf8SbFONn2wBKwcVJusTKcynPaxkEYvTWlU,9052
5
- metacountregressor/main.py,sha256=vGdoVsiiiBz_kaCZN_zXSRFS59EtYYNlmz0Et1hgLr8,17138
6
- metacountregressor/main_old.py,sha256=eTS4ygq27MnU-dZ_j983Ucb-D5XfbVF8OJQK2hVVLZc,24123
7
- metacountregressor/metaheuristics.py,sha256=4coj8gTyesnYFA8r4UkQer156OQ30BRQhFTMFEwSjtU,105490
8
- metacountregressor/pareto_file.py,sha256=whySaoPAUWYjyI8zo0hwAOa3rFk6SIUlHSpqZiLur0k,23096
9
- metacountregressor/pareto_logger__plot.py,sha256=mEU2QN4wmsM7t39GJ_XhJ_jjsdl09JOmG0U2jICrAkI,30037
10
- metacountregressor/setup.py,sha256=8w6IqX0tJsbYrOI1BJLIJCIvOnunKli5I9fsF5PhHv4,919
11
- metacountregressor/single_objective_finder.py,sha256=jVG7GJBqzSP4_riYr-kMMKy_LE3SlGmKMunNhHYxgRg,8011
12
- metacountregressor/solution.py,sha256=fQaoo71zIxfOKLC6oTE9BnAm5OJRqRZnQ7a9nPIc9cM,284973
13
- metacountregressor/test_generated_paper2.py,sha256=pwOoRzl1jJIIOUAAvbkT6HmmTQ81mwpsshn9SLdKOg8,3927
14
- metacountregressor/test_motor.py,sha256=tQqot89vcdJMWBY7Y3AOYBkku4Jjj0Ua1cRNeUeK9OA,10424
15
- metacountregressor-0.1.63.dist-info/LICENSE.txt,sha256=OXLcl0T2SZ8Pmy2_dmlvKuetivmyPd5m1q-Gyd-zaYY,35149
16
- metacountregressor-0.1.63.dist-info/METADATA,sha256=Ss5jdF_L-b78jKsg_x3hvLXBSl7wxoAs50sjc7cNvtI,412
17
- metacountregressor-0.1.63.dist-info/WHEEL,sha256=GJ7t_kWBFywbagK5eo9IoUwLW6oyOeTKmQ-9iHFVNxQ,92
18
- metacountregressor-0.1.63.dist-info/top_level.txt,sha256=zGG7UC5WIpr76gsFUpwJ4En2aCcoNTONBaS3OewwjR0,19
19
- metacountregressor-0.1.63.dist-info/RECORD,,
@@ -1 +0,0 @@
1
- metacountregressor