qdesc 1.0.7__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- qdesc-1.0.7/LICENCE.txt +191 -0
- qdesc-1.0.7/PKG-INFO +170 -0
- qdesc-1.0.7/README.md +148 -0
- qdesc-1.0.7/qdesc/__init__.py +213 -0
- qdesc-1.0.7/qdesc/update_checker.py +79 -0
- qdesc-1.0.7/qdesc.egg-info/PKG-INFO +170 -0
- qdesc-1.0.7/qdesc.egg-info/SOURCES.txt +10 -0
- qdesc-1.0.7/qdesc.egg-info/dependency_links.txt +1 -0
- qdesc-1.0.7/qdesc.egg-info/requires.txt +6 -0
- qdesc-1.0.7/qdesc.egg-info/top_level.txt +1 -0
- qdesc-1.0.7/setup.cfg +4 -0
- qdesc-1.0.7/setup.py +25 -0
qdesc-1.0.7/LICENCE.txt
ADDED
|
@@ -0,0 +1,191 @@
|
|
|
1
|
+
GNU GENERAL PUBLIC LICENSE
|
|
2
|
+
Version 3, 29 June 2007
|
|
3
|
+
|
|
4
|
+
Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/>
|
|
5
|
+
Everyone is permitted to copy and distribute verbatim copies
|
|
6
|
+
of this license document, but changing it is not allowed.
|
|
7
|
+
|
|
8
|
+
Preamble
|
|
9
|
+
|
|
10
|
+
The GNU General Public License is a free, copyleft license for
|
|
11
|
+
software and other kinds of works.
|
|
12
|
+
|
|
13
|
+
The licenses for most software and other practical works are designed
|
|
14
|
+
to take away your freedom to share and change the works. By contrast,
|
|
15
|
+
the GNU General Public License is intended to guarantee your freedom to
|
|
16
|
+
share and change all versions of a program--to make sure it remains free
|
|
17
|
+
software for all its users. We, the Free Software Foundation, use the
|
|
18
|
+
GNU General Public License for most of our software; it applies also to
|
|
19
|
+
any other work released this way by its authors. You can apply it to
|
|
20
|
+
your programs, too.
|
|
21
|
+
|
|
22
|
+
When we speak of free software, we are referring to freedom, not
|
|
23
|
+
price. Our General Public Licenses are designed to make sure that you
|
|
24
|
+
have the freedom to distribute copies of free software (and charge for
|
|
25
|
+
them if you wish), that you receive source code or can get it if you
|
|
26
|
+
want it, that you can change the software or use pieces of it in new
|
|
27
|
+
free programs, and that you know you can do these things.
|
|
28
|
+
|
|
29
|
+
To protect your rights, we need to prevent others from denying you
|
|
30
|
+
these rights or asking you to surrender the rights. Therefore, you have
|
|
31
|
+
certain responsibilities if you distribute copies of the software, or if
|
|
32
|
+
you modify it: responsibilities to respect the freedom of others.
|
|
33
|
+
|
|
34
|
+
For example, if you distribute copies of such a program, whether
|
|
35
|
+
gratis or for a fee, you must pass on to the recipients the same
|
|
36
|
+
freedoms that you received. You must make sure that they, too, receive
|
|
37
|
+
or can get the source code. And you must show them these terms so they
|
|
38
|
+
know their rights.
|
|
39
|
+
|
|
40
|
+
Developers that use the GNU GPL protect your rights with two steps:
|
|
41
|
+
(1) assert copyright on the software, and (2) offer you this License
|
|
42
|
+
giving you legal permission to copy, distribute and/or modify it.
|
|
43
|
+
|
|
44
|
+
For the developers' and authors' protection, the GPL clearly explains
|
|
45
|
+
that there is no warranty for this free software. For both users' and
|
|
46
|
+
authors' sake, the GPL requires that modified versions be marked as
|
|
47
|
+
changed, so that their problems will not be attributed erroneously to
|
|
48
|
+
authors of previous versions.
|
|
49
|
+
|
|
50
|
+
Some devices are designed to deny users access to install or run
|
|
51
|
+
modified versions of the software inside them, although the manufacturer
|
|
52
|
+
can do so. This is fundamentally incompatible with the aim of protecting
|
|
53
|
+
users' freedom to change the software. The systematic pattern of such
|
|
54
|
+
abuse occurs in the area of products for individuals to use, which is
|
|
55
|
+
precisely where it is most unacceptable. Therefore, we have designed
|
|
56
|
+
this version of the GPL to prohibit the practice for those products. If
|
|
57
|
+
such problems arise substantially in other domains, we stand ready to
|
|
58
|
+
extend this provision to those domains in future versions of the GPL, as
|
|
59
|
+
needed to protect the freedom of users.
|
|
60
|
+
|
|
61
|
+
Finally, every program is threatened constantly by software patents.
|
|
62
|
+
States should not allow patents to restrict development and use of
|
|
63
|
+
software on general-purpose computers, but in those that do, we wish to
|
|
64
|
+
avoid the special danger that patents applied to a free program could
|
|
65
|
+
make it effectively proprietary. To prevent this, the GPL assures that
|
|
66
|
+
patents cannot be used to render the program non-free.
|
|
67
|
+
|
|
68
|
+
The precise terms and conditions for copying, distribution and
|
|
69
|
+
modification follow.
|
|
70
|
+
|
|
71
|
+
TERMS AND CONDITIONS
|
|
72
|
+
|
|
73
|
+
0. Definitions.
|
|
74
|
+
|
|
75
|
+
"This License" refers to version 3 of the GNU General Public License.
|
|
76
|
+
|
|
77
|
+
"Copyright" also means copyright-like laws that apply to other kinds
|
|
78
|
+
of works, such as semiconductor masks.
|
|
79
|
+
|
|
80
|
+
"The Program" refers to any copyrightable work licensed under this
|
|
81
|
+
License. Each licensee is addressed as "you". "Licensees" and
|
|
82
|
+
"recipients" may be individuals or organizations.
|
|
83
|
+
|
|
84
|
+
To "modify" a work means to copy from or adapt all or part of the work
|
|
85
|
+
in a fashion requiring copyright permission, other than the making of an
|
|
86
|
+
exact copy. The resulting work is called a "modified version" of the
|
|
87
|
+
earlier work or a work "based on" the earlier work.
|
|
88
|
+
|
|
89
|
+
A "covered work" means either the unmodified Program or a work based
|
|
90
|
+
on the Program.
|
|
91
|
+
|
|
92
|
+
To "propagate" a work means to do anything with it that, without
|
|
93
|
+
permission, would make you directly or secondarily liable for
|
|
94
|
+
infringement under applicable copyright law, except executing it on a
|
|
95
|
+
computer or modifying a private copy. Propagation includes copying,
|
|
96
|
+
distribution (with or without modification), making available to the
|
|
97
|
+
public, and in some countries other activities as well.
|
|
98
|
+
|
|
99
|
+
To "convey" a work means any kind of propagation that enables other
|
|
100
|
+
parties to make or receive copies. Mere interaction with a user through
|
|
101
|
+
a computer network, with no transfer of a copy, is not conveying.
|
|
102
|
+
|
|
103
|
+
An interactive user interface displays "Appropriate Legal Notices"
|
|
104
|
+
to the extent that it includes a convenient and prominently visible
|
|
105
|
+
feature that (1) displays an appropriate copyright notice, and (2)
|
|
106
|
+
tells the user that there is no warranty for the work (except to the
|
|
107
|
+
extent that warranties are provided), that licensees may convey the
|
|
108
|
+
work under this License, and how to view a copy of this License. If
|
|
109
|
+
the interface presents a list of user commands or options, such as a
|
|
110
|
+
menu, a prominent item in the list meets this criterion.
|
|
111
|
+
|
|
112
|
+
1. Source Code.
|
|
113
|
+
|
|
114
|
+
The "source code" for a work means the preferred form of the work
|
|
115
|
+
for making modifications to it. "Object code" means any non-source
|
|
116
|
+
form of a work.
|
|
117
|
+
|
|
118
|
+
A "Standard Interface" means an interface that either is an official
|
|
119
|
+
standard defined by a recognized standards body, or, in the case of
|
|
120
|
+
interfaces specified for a particular programming language, one that
|
|
121
|
+
is widely used among developers working in that language.
|
|
122
|
+
|
|
123
|
+
The "System Libraries" of an executable work include anything, other
|
|
124
|
+
than the work as a whole, that (a) is included in the normal form of
|
|
125
|
+
packaging a Major Component, but which is not part of that Major
|
|
126
|
+
Component, and (b) serves only to enable use of the work with that
|
|
127
|
+
Major Component, or to implement a Standard Interface for which an
|
|
128
|
+
implementation is available to the public in source code form. A
|
|
129
|
+
"Major Component", in this context, means a major essential component
|
|
130
|
+
(kernel, window system, and so on) of the specific operating system
|
|
131
|
+
(if any) on which the executable work runs, or a compiler used to
|
|
132
|
+
produce the work, or an object code interpreter used to run it.
|
|
133
|
+
|
|
134
|
+
The "Corresponding Source" for a work in object code form means all
|
|
135
|
+
the source code needed to generate, install, and (for an executable
|
|
136
|
+
work) run the object code and to modify the work, including scripts to
|
|
137
|
+
control those activities. However, it does not include the work's
|
|
138
|
+
System Libraries, or general-purpose tools or generally available free
|
|
139
|
+
programs which are used unmodified in performing those activities but
|
|
140
|
+
which are not part of the work. For example, Corresponding Source
|
|
141
|
+
includes interface definition files associated with source files for
|
|
142
|
+
the work, and the source code for shared libraries and dynamically
|
|
143
|
+
linked subprograms that the work is specifically designed to require,
|
|
144
|
+
such as by intimate data communication or control flow between those
|
|
145
|
+
subprograms and other parts of the work.
|
|
146
|
+
|
|
147
|
+
The Corresponding Source need not include anything that users
|
|
148
|
+
can regenerate automatically from other parts of the Corresponding
|
|
149
|
+
Source.
|
|
150
|
+
|
|
151
|
+
The Corresponding Source for a work in source code form is that
|
|
152
|
+
same work.
|
|
153
|
+
|
|
154
|
+
2. Basic Permissions.
|
|
155
|
+
|
|
156
|
+
All rights granted under this License are granted for the term of
|
|
157
|
+
copyright on the Program, and are irrevocable provided the stated
|
|
158
|
+
conditions are met. This License explicitly affirms your unlimited
|
|
159
|
+
permission to run the unmodified Program. The output from running a
|
|
160
|
+
covered work is covered by this License only if the output, given its
|
|
161
|
+
content, constitutes a covered work. This License acknowledges your
|
|
162
|
+
rights of fair use or other equivalent, as provided by copyright law.
|
|
163
|
+
|
|
164
|
+
You may make, run and propagate covered works that you do not
|
|
165
|
+
convey, without conditions so long as your license otherwise remains
|
|
166
|
+
in force. You may convey covered works to others for the sole purpose
|
|
167
|
+
of having them make modifications exclusively for you, or provide you
|
|
168
|
+
with facilities for running those works, provided that you comply with
|
|
169
|
+
the terms of this License in conveying all material for which you do
|
|
170
|
+
not control copyright. Those thus making or running the covered works
|
|
171
|
+
for you must do so exclusively on your behalf, under your direction
|
|
172
|
+
and control, on terms that prohibit them from making any copies of
|
|
173
|
+
your copyrighted material outside their relationship with you.
|
|
174
|
+
|
|
175
|
+
Conveying under any other circumstances is permitted solely under
|
|
176
|
+
the conditions stated below. Sublicensing is not allowed; section 10
|
|
177
|
+
makes it unnecessary.
|
|
178
|
+
|
|
179
|
+
3. Protecting Users' Legal Rights From Anti-Circumvention Law.
|
|
180
|
+
|
|
181
|
+
No covered work shall be deemed part of an effective technological
|
|
182
|
+
measure under any applicable law fulfilling obligations under article
|
|
183
|
+
11 of the WIPO copyright treaty adopted on 20 December 1996, or
|
|
184
|
+
similar laws prohibiting or restricting circumvention of such
|
|
185
|
+
measures.
|
|
186
|
+
|
|
187
|
+
When you convey a covered work, you waive any legal power to forbid
|
|
188
|
+
circumvention of technological measures to the extent such circumvention
|
|
189
|
+
is effected by exercising rights under this License with respect to
|
|
190
|
+
the covered work, and you disclaim any intention to limit operation or
|
|
191
|
+
modification of the work as a means of enforcing
|
qdesc-1.0.7/PKG-INFO
ADDED
|
@@ -0,0 +1,170 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: qdesc
|
|
3
|
+
Version: 1.0.7
|
|
4
|
+
Summary: Quick and Easy way to do descriptive analysis.
|
|
5
|
+
Author: Paolo Hilado
|
|
6
|
+
Author-email: datasciencepgh@proton.me
|
|
7
|
+
Description-Content-Type: text/markdown
|
|
8
|
+
License-File: LICENCE.txt
|
|
9
|
+
Requires-Dist: pandas
|
|
10
|
+
Requires-Dist: numpy
|
|
11
|
+
Requires-Dist: scipy
|
|
12
|
+
Requires-Dist: seaborn
|
|
13
|
+
Requires-Dist: matplotlib
|
|
14
|
+
Requires-Dist: statsmodels
|
|
15
|
+
Dynamic: author
|
|
16
|
+
Dynamic: author-email
|
|
17
|
+
Dynamic: description
|
|
18
|
+
Dynamic: description-content-type
|
|
19
|
+
Dynamic: license-file
|
|
20
|
+
Dynamic: requires-dist
|
|
21
|
+
Dynamic: summary
|
|
22
|
+
|
|
23
|
+
# <font face = 'Impact' color = '#274472' > qdesc : Quick and Easy Descriptive Analysis </font>
|
|
24
|
+

|
|
25
|
+
|
|
26
|
+

|
|
27
|
+

|
|
28
|
+

|
|
29
|
+
[](https://doi.org/10.5281/zenodo.15834554)
|
|
30
|
+

|
|
31
|
+
|
|
32
|
+
## <font face = 'Calibri' color = '#274472' > Installation </font>
|
|
33
|
+
```sh
|
|
34
|
+
pip install qdesc
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
## <font face = 'Calibri' color = '#274472' > Overview </font>
|
|
38
|
+
Qdesc is a package for quick and easy descriptive analysis. It is a powerful Python package designed for quick and easy descriptive analysis of quantitative data. It provides essential statistics like mean and standard deviation for normal distribution and median and raw median absolute deviation for skewed data. With built-in functions for frequency distributions, users can effortlessly analyze categorical variables and export results to a spreadsheet. The package also includes a normality check dashboard, featuring Anderson-Darling statistics and visualizations like histograms and Q-Q plots. Whether you're handling structured datasets or exploring statistical trends, qdesc streamlines the process with efficiency and clarity.
|
|
39
|
+
|
|
40
|
+
## <font face = 'Calibri' color = '#274472' > Creating a sample dataframe</font>
|
|
41
|
+
```python
|
|
42
|
+
import pandas as pd
|
|
43
|
+
import numpy as np
|
|
44
|
+
|
|
45
|
+
# Create sample data
|
|
46
|
+
data = {
|
|
47
|
+
"Age": np.random.randint(18, 60, size=15), # Continuous variable
|
|
48
|
+
"Salary": np.random.randint(30000, 120000, size=15), # Continuous variable
|
|
49
|
+
"Department": np.random.choice(["HR", "Finance", "IT", "Marketing"], size=15), # Categorical variable
|
|
50
|
+
"Gender": np.random.choice(["Male", "Female"], size=15), # Categorical variable
|
|
51
|
+
}
|
|
52
|
+
# Create DataFrame
|
|
53
|
+
df = pd.DataFrame(data)
|
|
54
|
+
```
|
|
55
|
+
## <font face = 'Calibri' color = '#274472' > qd.desc Function</font>
|
|
56
|
+
The function qd.desc(df) generates the following statistics:
|
|
57
|
+
* count - number of observations
|
|
58
|
+
* mean - measure of central tendency for normal distribution
|
|
59
|
+
* std - measure of spread for normal distribution
|
|
60
|
+
* median - measure of central tendency for skewed distributions or those with outliers
|
|
61
|
+
* MAD - measure of spread for skewed distributions or those with outliers; this is manual Median Absolute Deviation (MAD) which is more robust when dealing with non-normal distributions.
|
|
62
|
+
* min - lowest observed value
|
|
63
|
+
* max - highest observed value
|
|
64
|
+
* AD_stat - Anderson - Darling Statistic
|
|
65
|
+
* 5% crit_value - critical value for a 5% Significance Level
|
|
66
|
+
* 1% crit_value - critical value for a 1% Significance Level
|
|
67
|
+
|
|
68
|
+
```python
|
|
69
|
+
import qdesc as qd
|
|
70
|
+
qd.desc(df)
|
|
71
|
+
|
|
72
|
+
| Variable | Count | Mean | Std Dev | Median | MAD | Min | Max | AD Stat | 5% Crit Value |
|
|
73
|
+
|----------|-------|-------|---------|--------|-------|-------|--------|---------|---------------|
|
|
74
|
+
| Age | 15.0 | 37.87 | 13.51 | 38.0 | 12.0 | 20.0 | 59.0 | 0.41 | 0.68 |
|
|
75
|
+
| Salary | 15.0 | 72724 | 29483 | 67660 | 26311 | 34168 | 119590 | 0.40 | 0.68 |
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
## <font face = 'Calibri' color = '#274472' > qd.grp_desc Function</font>
|
|
79
|
+
This function, qd.grp_desc(df, "Continuous Var", "Group Var") creates a table for descriptive statistics similar to the qd.desc function but has the measures
|
|
80
|
+
presented for each level of the grouping variable. It allows one to check whether these measures, for each group, are approximately normal or not. Combining it
|
|
81
|
+
with qd.normcheck_dashboard allows one to decide on the appropriate measure of central tendency and spread.
|
|
82
|
+
|
|
83
|
+
```python
|
|
84
|
+
import qdesc as qd
|
|
85
|
+
qd.grp_desc(df, "Salary", "Gender")
|
|
86
|
+
|
|
87
|
+
| Gender | Count | Mean - | Std Dev | Median | MAD | Min | Max | AD Stat | 5% Crit Value |
|
|
88
|
+
|---------|-------|-----------|-----------|----------|----------|--------|---------|---------|---------------|
|
|
89
|
+
| Female | 7 | 84,871.14 | 32,350.37 | 93,971.0 | 25,619.0 | 40,476 | 119,590 | 0.36 | 0.74 |
|
|
90
|
+
| Male | 8 | 62,096.12 | 23,766.82 | 60,347.0 | 14,278.5 | 34,168 | 106,281 | 0.24 | 0.71 |
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
## <font face = 'Calibri' color = '#274472' > qd.freqdist Function</font>
|
|
94
|
+
Run the function qd.freqdist(df, "Variable Name") to easily create a frequency distribution for your chosen categorical variable with the following:
|
|
95
|
+
* Variable Levels (i.e., for Sex Variable: Male and Female)
|
|
96
|
+
* Counts - the number of observations
|
|
97
|
+
* Percentage - percentage of observations from total.
|
|
98
|
+
|
|
99
|
+
```python
|
|
100
|
+
import qdesc as qd
|
|
101
|
+
qd.freqdist(df, "Department")
|
|
102
|
+
|
|
103
|
+
| Department | Count | Percentage |
|
|
104
|
+
|------------|-------|------------|
|
|
105
|
+
| IT | 5 | 33.33 |
|
|
106
|
+
| HR | 5 | 33.33 |
|
|
107
|
+
| Marketing | 3 | 20.00 |
|
|
108
|
+
| Finance | 2 | 13.33 |
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
## <font face = 'Calibri' color = '#274472' > qd.freqdist_a Function</font>
|
|
112
|
+
Run the function qd.freqdist_a(df, ascending = FALSE) to easily create frequency distribution tables, arranged in descending manner (default) or ascending (TRUE), for all the categorical variables in your data frame. The resulting table will include columns such as:
|
|
113
|
+
* Variable levels (i.e., for Satisfaction: Very Low, Low, Moderate, High, Very High)
|
|
114
|
+
* Counts - the number of observations
|
|
115
|
+
* Percentage - percentage of observations from total.
|
|
116
|
+
|
|
117
|
+
```python
|
|
118
|
+
import qdesc as qd
|
|
119
|
+
qd.freqdist_a(df)
|
|
120
|
+
|
|
121
|
+
| Column | Value | Count | Percentage |
|
|
122
|
+
|------------|----------|-------|------------|
|
|
123
|
+
| Department | IT | 5 | 33.33% |
|
|
124
|
+
| Department | HR | 5 | 33.33% |
|
|
125
|
+
| Department | Marketing| 3 | 20.00% |
|
|
126
|
+
| Department | Finance | 2 | 13.33% |
|
|
127
|
+
| Gender | Male | 8 | 53.33% |
|
|
128
|
+
| Gender | Female | 7 | 46.67% |
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
## <font face = 'Calibri' color = '#274472' > qd.freqdist_to_excel Function</font>
|
|
132
|
+
Run the function qd.freqdist_to_excel(df, "Filename.xlsx", ascending = FALSE ) to easily create frequency distribution tables, arranged in descending manner (default) or ascending (TRUE), for all the categorical variables in your data frame and SAVED as separate sheets in the .xlsx File. The resulting table will include columns such as:
|
|
133
|
+
* Variable levels (i.e., for Satisfaction: Very Low, Low, Moderate, High, Very High)
|
|
134
|
+
* Counts - the number of observations
|
|
135
|
+
* Percentage - percentage of observations from total.
|
|
136
|
+
|
|
137
|
+
```python
|
|
138
|
+
import qdesc as qd
|
|
139
|
+
qd.freqdist_to_excel(df, "Results.xlsx")
|
|
140
|
+
|
|
141
|
+
Frequency distributions written to Results.xlsx
|
|
142
|
+
```
|
|
143
|
+
|
|
144
|
+
## <font face = 'Calibri' color = '#274472' > qd.normcheck_dashboard Function</font>
|
|
145
|
+
Run the function qd.normcheck_dashboard(df) to efficiently check each numeric variable for normality of its distribution. It will compute the Anderson-Darling statistic and create visualizations (i.e., qq-plot, histogram, and boxplots) for checking whether the distribution is approximately normal.
|
|
146
|
+
|
|
147
|
+
```python
|
|
148
|
+
import qdesc as qd
|
|
149
|
+
qd.normcheck_dashboard(df)
|
|
150
|
+
```
|
|
151
|
+

|
|
152
|
+
|
|
153
|
+
|
|
154
|
+
## <font face = 'Calibri' color = '#3D5B59' > License</font>
|
|
155
|
+
This project is licensed under the GPL-3 License. See the LICENSE file for more details.
|
|
156
|
+
|
|
157
|
+
## <font face = 'Calibri' color = '#3D5B59' > Acknowledgements</font>
|
|
158
|
+
Acknowledgement of the libraries used by this package...
|
|
159
|
+
|
|
160
|
+
### <font face = 'Calibri' color = '#3D5B59' > Pandas</font>
|
|
161
|
+
Pandas is distributed under the BSD 3-Clause License, pandas is developed by Pandas contributors. Copyright (c) 2008-2024, the pandas development team All rights reserved.
|
|
162
|
+
### <font face = 'Calibri' color = '#3D5B59' > Numpy</font>
|
|
163
|
+
NumPy is distributed under the BSD 3-Clause License, numpy is developed by NumPy contributors. Copyright (c) 2005-2024, NumPy Developers. All rights reserved.
|
|
164
|
+
### <font face = 'Calibri' color = '#3D5B59' > SciPy</font>
|
|
165
|
+
SciPy is distributed under the BSD License, scipy is developed by SciPy contributors. Copyright (c) 2001-2024, SciPy Developers. All rights reserved.
|
|
166
|
+
|
|
167
|
+
|
|
168
|
+
|
|
169
|
+
|
|
170
|
+
|
qdesc-1.0.7/README.md
ADDED
|
@@ -0,0 +1,148 @@
|
|
|
1
|
+
# <font face = 'Impact' color = '#274472' > qdesc : Quick and Easy Descriptive Analysis </font>
|
|
2
|
+

|
|
3
|
+
|
|
4
|
+

|
|
5
|
+

|
|
6
|
+

|
|
7
|
+
[](https://doi.org/10.5281/zenodo.15834554)
|
|
8
|
+

|
|
9
|
+
|
|
10
|
+
## <font face = 'Calibri' color = '#274472' > Installation </font>
|
|
11
|
+
```sh
|
|
12
|
+
pip install qdesc
|
|
13
|
+
```
|
|
14
|
+
|
|
15
|
+
## <font face = 'Calibri' color = '#274472' > Overview </font>
|
|
16
|
+
Qdesc is a package for quick and easy descriptive analysis. It is a powerful Python package designed for quick and easy descriptive analysis of quantitative data. It provides essential statistics like mean and standard deviation for normal distribution and median and raw median absolute deviation for skewed data. With built-in functions for frequency distributions, users can effortlessly analyze categorical variables and export results to a spreadsheet. The package also includes a normality check dashboard, featuring Anderson-Darling statistics and visualizations like histograms and Q-Q plots. Whether you're handling structured datasets or exploring statistical trends, qdesc streamlines the process with efficiency and clarity.
|
|
17
|
+
|
|
18
|
+
## <font face = 'Calibri' color = '#274472' > Creating a sample dataframe</font>
|
|
19
|
+
```python
|
|
20
|
+
import pandas as pd
|
|
21
|
+
import numpy as np
|
|
22
|
+
|
|
23
|
+
# Create sample data
|
|
24
|
+
data = {
|
|
25
|
+
"Age": np.random.randint(18, 60, size=15), # Continuous variable
|
|
26
|
+
"Salary": np.random.randint(30000, 120000, size=15), # Continuous variable
|
|
27
|
+
"Department": np.random.choice(["HR", "Finance", "IT", "Marketing"], size=15), # Categorical variable
|
|
28
|
+
"Gender": np.random.choice(["Male", "Female"], size=15), # Categorical variable
|
|
29
|
+
}
|
|
30
|
+
# Create DataFrame
|
|
31
|
+
df = pd.DataFrame(data)
|
|
32
|
+
```
|
|
33
|
+
## <font face = 'Calibri' color = '#274472' > qd.desc Function</font>
|
|
34
|
+
The function qd.desc(df) generates the following statistics:
|
|
35
|
+
* count - number of observations
|
|
36
|
+
* mean - measure of central tendency for normal distribution
|
|
37
|
+
* std - measure of spread for normal distribution
|
|
38
|
+
* median - measure of central tendency for skewed distributions or those with outliers
|
|
39
|
+
* MAD - measure of spread for skewed distributions or those with outliers; this is manual Median Absolute Deviation (MAD) which is more robust when dealing with non-normal distributions.
|
|
40
|
+
* min - lowest observed value
|
|
41
|
+
* max - highest observed value
|
|
42
|
+
* AD_stat - Anderson - Darling Statistic
|
|
43
|
+
* 5% crit_value - critical value for a 5% Significance Level
|
|
44
|
+
* 1% crit_value - critical value for a 1% Significance Level
|
|
45
|
+
|
|
46
|
+
```python
|
|
47
|
+
import qdesc as qd
|
|
48
|
+
qd.desc(df)
|
|
49
|
+
|
|
50
|
+
| Variable | Count | Mean | Std Dev | Median | MAD | Min | Max | AD Stat | 5% Crit Value |
|
|
51
|
+
|----------|-------|-------|---------|--------|-------|-------|--------|---------|---------------|
|
|
52
|
+
| Age | 15.0 | 37.87 | 13.51 | 38.0 | 12.0 | 20.0 | 59.0 | 0.41 | 0.68 |
|
|
53
|
+
| Salary | 15.0 | 72724 | 29483 | 67660 | 26311 | 34168 | 119590 | 0.40 | 0.68 |
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
## <font face = 'Calibri' color = '#274472' > qd.grp_desc Function</font>
|
|
57
|
+
This function, qd.grp_desc(df, "Continuous Var", "Group Var") creates a table for descriptive statistics similar to the qd.desc function but has the measures
|
|
58
|
+
presented for each level of the grouping variable. It allows one to check whether these measures, for each group, are approximately normal or not. Combining it
|
|
59
|
+
with qd.normcheck_dashboard allows one to decide on the appropriate measure of central tendency and spread.
|
|
60
|
+
|
|
61
|
+
```python
|
|
62
|
+
import qdesc as qd
|
|
63
|
+
qd.grp_desc(df, "Salary", "Gender")
|
|
64
|
+
|
|
65
|
+
| Gender | Count | Mean - | Std Dev | Median | MAD | Min | Max | AD Stat | 5% Crit Value |
|
|
66
|
+
|---------|-------|-----------|-----------|----------|----------|--------|---------|---------|---------------|
|
|
67
|
+
| Female | 7 | 84,871.14 | 32,350.37 | 93,971.0 | 25,619.0 | 40,476 | 119,590 | 0.36 | 0.74 |
|
|
68
|
+
| Male | 8 | 62,096.12 | 23,766.82 | 60,347.0 | 14,278.5 | 34,168 | 106,281 | 0.24 | 0.71 |
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
## <font face = 'Calibri' color = '#274472' > qd.freqdist Function</font>
|
|
72
|
+
Run the function qd.freqdist(df, "Variable Name") to easily create a frequency distribution for your chosen categorical variable with the following:
|
|
73
|
+
* Variable Levels (i.e., for Sex Variable: Male and Female)
|
|
74
|
+
* Counts - the number of observations
|
|
75
|
+
* Percentage - percentage of observations from total.
|
|
76
|
+
|
|
77
|
+
```python
|
|
78
|
+
import qdesc as qd
|
|
79
|
+
qd.freqdist(df, "Department")
|
|
80
|
+
|
|
81
|
+
| Department | Count | Percentage |
|
|
82
|
+
|------------|-------|------------|
|
|
83
|
+
| IT | 5 | 33.33 |
|
|
84
|
+
| HR | 5 | 33.33 |
|
|
85
|
+
| Marketing | 3 | 20.00 |
|
|
86
|
+
| Finance | 2 | 13.33 |
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
## <font face = 'Calibri' color = '#274472' > qd.freqdist_a Function</font>
|
|
90
|
+
Run the function qd.freqdist_a(df, ascending = FALSE) to easily create frequency distribution tables, arranged in descending manner (default) or ascending (TRUE), for all the categorical variables in your data frame. The resulting table will include columns such as:
|
|
91
|
+
* Variable levels (i.e., for Satisfaction: Very Low, Low, Moderate, High, Very High)
|
|
92
|
+
* Counts - the number of observations
|
|
93
|
+
* Percentage - percentage of observations from total.
|
|
94
|
+
|
|
95
|
+
```python
|
|
96
|
+
import qdesc as qd
|
|
97
|
+
qd.freqdist_a(df)
|
|
98
|
+
|
|
99
|
+
| Column | Value | Count | Percentage |
|
|
100
|
+
|------------|----------|-------|------------|
|
|
101
|
+
| Department | IT | 5 | 33.33% |
|
|
102
|
+
| Department | HR | 5 | 33.33% |
|
|
103
|
+
| Department | Marketing| 3 | 20.00% |
|
|
104
|
+
| Department | Finance | 2 | 13.33% |
|
|
105
|
+
| Gender | Male | 8 | 53.33% |
|
|
106
|
+
| Gender | Female | 7 | 46.67% |
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
## <font face = 'Calibri' color = '#274472' > qd.freqdist_to_excel Function</font>
|
|
110
|
+
Run the function qd.freqdist_to_excel(df, "Filename.xlsx", ascending = FALSE ) to easily create frequency distribution tables, arranged in descending manner (default) or ascending (TRUE), for all the categorical variables in your data frame and SAVED as separate sheets in the .xlsx File. The resulting table will include columns such as:
|
|
111
|
+
* Variable levels (i.e., for Satisfaction: Very Low, Low, Moderate, High, Very High)
|
|
112
|
+
* Counts - the number of observations
|
|
113
|
+
* Percentage - percentage of observations from total.
|
|
114
|
+
|
|
115
|
+
```python
|
|
116
|
+
import qdesc as qd
|
|
117
|
+
qd.freqdist_to_excel(df, "Results.xlsx")
|
|
118
|
+
|
|
119
|
+
Frequency distributions written to Results.xlsx
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
## <font face = 'Calibri' color = '#274472' > qd.normcheck_dashboard Function</font>
|
|
123
|
+
Run the function qd.normcheck_dashboard(df) to efficiently check each numeric variable for normality of its distribution. It will compute the Anderson-Darling statistic and create visualizations (i.e., qq-plot, histogram, and boxplots) for checking whether the distribution is approximately normal.
|
|
124
|
+
|
|
125
|
+
```python
|
|
126
|
+
import qdesc as qd
|
|
127
|
+
qd.normcheck_dashboard(df)
|
|
128
|
+
```
|
|
129
|
+

|
|
130
|
+
|
|
131
|
+
|
|
132
|
+
## <font face = 'Calibri' color = '#3D5B59' > License</font>
|
|
133
|
+
This project is licensed under the GPL-3 License. See the LICENSE file for more details.
|
|
134
|
+
|
|
135
|
+
## <font face = 'Calibri' color = '#3D5B59' > Acknowledgements</font>
|
|
136
|
+
Acknowledgement of the libraries used by this package...
|
|
137
|
+
|
|
138
|
+
### <font face = 'Calibri' color = '#3D5B59' > Pandas</font>
|
|
139
|
+
Pandas is distributed under the BSD 3-Clause License, pandas is developed by Pandas contributors. Copyright (c) 2008-2024, the pandas development team All rights reserved.
|
|
140
|
+
### <font face = 'Calibri' color = '#3D5B59' > Numpy</font>
|
|
141
|
+
NumPy is distributed under the BSD 3-Clause License, numpy is developed by NumPy contributors. Copyright (c) 2005-2024, NumPy Developers. All rights reserved.
|
|
142
|
+
### <font face = 'Calibri' color = '#3D5B59' > SciPy</font>
|
|
143
|
+
SciPy is distributed under the BSD License, scipy is developed by SciPy contributors. Copyright (c) 2001-2024, SciPy Developers. All rights reserved.
|
|
144
|
+
|
|
145
|
+
|
|
146
|
+
|
|
147
|
+
|
|
148
|
+
|
|
@@ -0,0 +1,213 @@
|
|
|
1
|
+
from .update_checker import check_for_update
|
|
2
|
+
import sys
|
|
3
|
+
import threading
|
|
4
|
+
from .update_checker import check_for_update
|
|
5
|
+
import threading
|
|
6
|
+
|
|
7
|
+
# Run update check automatically on import
|
|
8
|
+
threading.Thread(target=lambda: check_for_update("qdesc"), daemon=True).start()
|
|
9
|
+
|
|
10
|
+
# replace with your current version
|
|
11
|
+
|
|
12
|
+
def desc(df):
|
|
13
|
+
import pandas as pd
|
|
14
|
+
import numpy as np
|
|
15
|
+
from scipy.stats import anderson
|
|
16
|
+
|
|
17
|
+
x = np.round(df.describe().T, 2)
|
|
18
|
+
x = x.iloc[:, [0, 1, 2, 5, 3, 7]]
|
|
19
|
+
x.rename(columns={'50%': 'median'}, inplace=True)
|
|
20
|
+
|
|
21
|
+
mad_raw = {}
|
|
22
|
+
mad_norm = {}
|
|
23
|
+
|
|
24
|
+
for column in df.select_dtypes(include=[np.number]):
|
|
25
|
+
clean_col = df[column].dropna()
|
|
26
|
+
|
|
27
|
+
if len(clean_col) == 0:
|
|
28
|
+
mad_raw[column] = np.nan
|
|
29
|
+
mad_norm[column] = np.nan
|
|
30
|
+
continue
|
|
31
|
+
median = np.median(clean_col)
|
|
32
|
+
abs_dev = np.abs(clean_col - median)
|
|
33
|
+
raw = np.median(abs_dev)
|
|
34
|
+
mad_raw[column] = raw
|
|
35
|
+
mad_norm[column] = 1.4826 * raw # normalized MAD
|
|
36
|
+
mad_df = pd.DataFrame({
|
|
37
|
+
'MAD_raw': mad_raw,
|
|
38
|
+
'MAD_norm': mad_norm
|
|
39
|
+
})
|
|
40
|
+
results = {}
|
|
41
|
+
for column in df.select_dtypes(include=[np.number]):
|
|
42
|
+
clean_col = df[column].dropna()
|
|
43
|
+
|
|
44
|
+
if len(clean_col) < 5:
|
|
45
|
+
results[column] = {'AD_stat': np.nan, '5% crit_value': np.nan}
|
|
46
|
+
continue
|
|
47
|
+
|
|
48
|
+
result = anderson(clean_col)
|
|
49
|
+
results[column] = {
|
|
50
|
+
'AD_stat': result.statistic,
|
|
51
|
+
'5% crit_value': result.critical_values[2]
|
|
52
|
+
}
|
|
53
|
+
anderson_df = pd.DataFrame.from_dict(results, orient='index')
|
|
54
|
+
xl = x.iloc[:, :4]
|
|
55
|
+
xr = x.iloc[:, 4:]
|
|
56
|
+
x_df = np.round(pd.concat([xl, mad_df, xr, anderson_df], axis=1), 2)
|
|
57
|
+
return x_df
|
|
58
|
+
|
|
59
|
+
def grp_desc(df, numeric_col, group_col):
|
|
60
|
+
import pandas as pd
|
|
61
|
+
import numpy as np
|
|
62
|
+
from scipy.stats import median_abs_deviation, anderson
|
|
63
|
+
results = []
|
|
64
|
+
for group, group_df in df.groupby(group_col):
|
|
65
|
+
data = group_df[numeric_col].dropna()
|
|
66
|
+
if len(data) < 2:
|
|
67
|
+
|
|
68
|
+
stats = {
|
|
69
|
+
group_col: group,
|
|
70
|
+
'count': len(data),
|
|
71
|
+
'mean': np.nan,
|
|
72
|
+
'std': np.nan,
|
|
73
|
+
'median': np.nan,
|
|
74
|
+
'mad': np.nan,
|
|
75
|
+
'min': np.nan,
|
|
76
|
+
'max': np.nan,
|
|
77
|
+
'anderson_stat': np.nan,
|
|
78
|
+
'crit_5%': np.nan
|
|
79
|
+
}
|
|
80
|
+
else:
|
|
81
|
+
ad_result = anderson(data, dist='norm')
|
|
82
|
+
stats = {
|
|
83
|
+
group_col: group,
|
|
84
|
+
'count': len(data),
|
|
85
|
+
'mean': data.mean(),
|
|
86
|
+
'std': data.std(),
|
|
87
|
+
'median': data.median(),
|
|
88
|
+
'mad_raw': median_abs_deviation(data),
|
|
89
|
+
'mad_norm': median_abs_deviation(data)*1.4826,
|
|
90
|
+
'min': data.min(),
|
|
91
|
+
'max': data.max(),
|
|
92
|
+
'AD_stat': ad_result.statistic,
|
|
93
|
+
'crit_5%': ad_result.critical_values[2], # 5% is the third value
|
|
94
|
+
}
|
|
95
|
+
results.append(stats)
|
|
96
|
+
return np.round(pd.DataFrame(results),2)
|
|
97
|
+
|
|
98
|
+
def freqdist(df, column_name):
|
|
99
|
+
import pandas as pd
|
|
100
|
+
import numpy as np
|
|
101
|
+
if column_name not in df.columns:
|
|
102
|
+
raise ValueError(f"Column '{column_name}' not found in DataFrame.")
|
|
103
|
+
|
|
104
|
+
if df[column_name].dtype not in ['object', 'category']:
|
|
105
|
+
raise ValueError(f"Column '{column_name}' is not a categorical column.")
|
|
106
|
+
|
|
107
|
+
freq_dist = df[column_name].value_counts().reset_index()
|
|
108
|
+
freq_dist.columns = [column_name, 'Count']
|
|
109
|
+
freq_dist['Percentage'] = np.round((freq_dist['Count'] / len(df)) * 100,2)
|
|
110
|
+
return freq_dist
|
|
111
|
+
|
|
112
|
+
def freqdist_a(df, ascending=False):
|
|
113
|
+
import pandas as pd
|
|
114
|
+
import numpy as np
|
|
115
|
+
results = []
|
|
116
|
+
for column in df.select_dtypes(include=['object', 'category']).columns:
|
|
117
|
+
frequency_table = df[column].value_counts()
|
|
118
|
+
percentage_table = np.round(df[column].value_counts(normalize=True) * 100,2)
|
|
119
|
+
|
|
120
|
+
distribution = pd.DataFrame({
|
|
121
|
+
'Column': column,
|
|
122
|
+
'Value': frequency_table.index,
|
|
123
|
+
'Count': frequency_table.values,
|
|
124
|
+
'Percentage': percentage_table.values
|
|
125
|
+
})
|
|
126
|
+
distribution = distribution.sort_values(by='Percentage', ascending=ascending)
|
|
127
|
+
results.append(distribution)
|
|
128
|
+
final_df = pd.concat(results, ignore_index=True)
|
|
129
|
+
return final_df
|
|
130
|
+
|
|
131
|
+
def clean_sheet_name(name):
|
|
132
|
+
import re
|
|
133
|
+
# Remove invalid characters
|
|
134
|
+
name = re.sub(r'[:\\/?*\[\]]', '', name)
|
|
135
|
+
# Limit to 31 characters
|
|
136
|
+
name = name.strip()[:31]
|
|
137
|
+
return name
|
|
138
|
+
|
|
139
|
+
def freqdist_to_excel(df, output_path, sort_by='Percentage', ascending=False, top_n=None):
|
|
140
|
+
import pandas as pd
|
|
141
|
+
import numpy as np
|
|
142
|
+
used_names = set()
|
|
143
|
+
with pd.ExcelWriter(output_path, engine='xlsxwriter') as writer:
|
|
144
|
+
for column in df.select_dtypes(include=['object', 'category']).columns:
|
|
145
|
+
frequency_table = df[column].value_counts()
|
|
146
|
+
percentage_table = df[column].value_counts(normalize=True) * 100
|
|
147
|
+
|
|
148
|
+
distribution = pd.DataFrame({
|
|
149
|
+
'Value': frequency_table.index,
|
|
150
|
+
'Count': frequency_table.values,
|
|
151
|
+
'Percentage': percentage_table.values
|
|
152
|
+
})
|
|
153
|
+
distribution = distribution.sort_values(by=sort_by, ascending=ascending)
|
|
154
|
+
if top_n is not None:
|
|
155
|
+
distribution = distribution.head(top_n)
|
|
156
|
+
# Generate safe sheet name
|
|
157
|
+
base_name = clean_sheet_name(column)
|
|
158
|
+
sheet_name = base_name
|
|
159
|
+
count = 1
|
|
160
|
+
while sheet_name.lower() in used_names:
|
|
161
|
+
sheet_name = f"{base_name[:28]}_{count}" # stay within 31 char limit
|
|
162
|
+
count += 1
|
|
163
|
+
used_names.add(sheet_name.lower())
|
|
164
|
+
distribution.to_excel(writer, sheet_name=sheet_name, index=False)
|
|
165
|
+
print(f"Frequency distributions written to {output_path}")
|
|
166
|
+
|
|
167
|
+
def normcheck_dashboard(df, significance_level=0.05, figsize=(18, 5)):
|
|
168
|
+
import pandas as pd
|
|
169
|
+
import numpy as np
|
|
170
|
+
import matplotlib.pyplot as plt
|
|
171
|
+
import seaborn as sns
|
|
172
|
+
import statsmodels.api as sm
|
|
173
|
+
from scipy.stats import anderson
|
|
174
|
+
import math
|
|
175
|
+
numeric_cols = df.select_dtypes(include=[np.number]).columns
|
|
176
|
+
if len(numeric_cols) == 0:
|
|
177
|
+
print("No numeric columns to analyze.")
|
|
178
|
+
return
|
|
179
|
+
for col in numeric_cols:
|
|
180
|
+
data = df[col].dropna()
|
|
181
|
+
print(f"\n--- Variable: {col} ---")
|
|
182
|
+
if len(data) < 8:
|
|
183
|
+
print("Not enough data to perform Anderson-Darling test or meaningful plots.")
|
|
184
|
+
continue
|
|
185
|
+
# Anderson-Darling Test
|
|
186
|
+
test_result = anderson(data, dist='norm')
|
|
187
|
+
stat = test_result.statistic
|
|
188
|
+
sig_levels = test_result.significance_level
|
|
189
|
+
crit_values = test_result.critical_values
|
|
190
|
+
level_diff = [abs(sl - (significance_level * 100)) for sl in sig_levels]
|
|
191
|
+
closest_index = level_diff.index(min(level_diff))
|
|
192
|
+
used_sig = sig_levels[closest_index]
|
|
193
|
+
crit_val = crit_values[closest_index]
|
|
194
|
+
decision = "Fail to Reject Null" if stat <= crit_val else "Reject Null"
|
|
195
|
+
# Print Summary
|
|
196
|
+
print(f" Anderson-Darling Statistic : {stat:.4f}")
|
|
197
|
+
print(f" Critical Value (@ {used_sig}%) : {crit_val:.4f}")
|
|
198
|
+
print(f" Decision : {decision}")
|
|
199
|
+
# Plots (QQ, Histogram, Boxplot)
|
|
200
|
+
fig, axes = plt.subplots(1, 3, figsize=figsize)
|
|
201
|
+
# QQ Plot
|
|
202
|
+
sm.qqplot(data, line='s', ax=axes[0])
|
|
203
|
+
axes[0].set_title(f"QQ Plot - {col}")
|
|
204
|
+
# Histogram (No KDE)
|
|
205
|
+
sns.histplot(data, bins=30, kde=False, color='gray', alpha=0.3, ax=axes[1])
|
|
206
|
+
axes[1].set_title(f"Histogram - {col}")
|
|
207
|
+
# Boxplot
|
|
208
|
+
sns.boxplot(x=data, ax=axes[2], color='lightblue')
|
|
209
|
+
axes[2].set_title(f"Boxplot - {col}")
|
|
210
|
+
axes[2].set_xlabel(col)
|
|
211
|
+
plt.suptitle(f"Normality Assessment - {col}", fontsize=14, y=1.05)
|
|
212
|
+
plt.tight_layout()
|
|
213
|
+
plt.show()
|
|
@@ -0,0 +1,79 @@
|
|
|
1
|
+
import os
|
|
2
|
+
import json
|
|
3
|
+
import threading
|
|
4
|
+
import requests
|
|
5
|
+
from packaging import version
|
|
6
|
+
from pathlib import Path
|
|
7
|
+
from pkg_resources import get_distribution
|
|
8
|
+
|
|
9
|
+
# Optional: colored printing for console and notebook
|
|
10
|
+
try:
|
|
11
|
+
from IPython.display import display, Markdown
|
|
12
|
+
IN_NOTEBOOK = True
|
|
13
|
+
except ImportError:
|
|
14
|
+
IN_NOTEBOOK = False
|
|
15
|
+
|
|
16
|
+
CHECK_INTERVAL_DAYS = 7
|
|
17
|
+
|
|
18
|
+
def _should_check(cache_file: Path) -> bool:
|
|
19
|
+
if not cache_file.exists():
|
|
20
|
+
return True
|
|
21
|
+
try:
|
|
22
|
+
with open(cache_file, "r") as f:
|
|
23
|
+
data = json.load(f)
|
|
24
|
+
last_check = data.get("last_check", 0)
|
|
25
|
+
except Exception:
|
|
26
|
+
return True
|
|
27
|
+
|
|
28
|
+
import time
|
|
29
|
+
days_since = (time.time() - last_check) / (60 * 60 * 24)
|
|
30
|
+
return days_since >= CHECK_INTERVAL_DAYS
|
|
31
|
+
|
|
32
|
+
|
|
33
|
+
def _write_check_timestamp(cache_file: Path):
|
|
34
|
+
try:
|
|
35
|
+
with open(cache_file, "w") as f:
|
|
36
|
+
json.dump({"last_check": __import__("time").time()}, f)
|
|
37
|
+
except Exception:
|
|
38
|
+
pass
|
|
39
|
+
|
|
40
|
+
def _print_update_message(pkg_name, installed, latest):
|
|
41
|
+
message = (
|
|
42
|
+
f"📈 A new version of '{pkg_name}' is available!\n"
|
|
43
|
+
f"Installed version: {installed}\n"
|
|
44
|
+
f"Latest version: {latest}\n"
|
|
45
|
+
f"Update with:\n"
|
|
46
|
+
f" pip install --upgrade {pkg_name}\n"
|
|
47
|
+
)
|
|
48
|
+
if IN_NOTEBOOK:
|
|
49
|
+
# Display nicely in Jupyter Notebook
|
|
50
|
+
display(Markdown(f"**{message}**"))
|
|
51
|
+
else:
|
|
52
|
+
# Regular console print
|
|
53
|
+
print(message)
|
|
54
|
+
|
|
55
|
+
def _check_now(package_name: str):
|
|
56
|
+
cache_file = Path.home() / f".{package_name}_update_check.json"
|
|
57
|
+
if not _should_check(cache_file):
|
|
58
|
+
return
|
|
59
|
+
_write_check_timestamp(cache_file)
|
|
60
|
+
if os.getenv("PYLIB_DISABLE_UPDATE_CHECK") == "1":
|
|
61
|
+
return
|
|
62
|
+
try:
|
|
63
|
+
installed_version = get_distribution(package_name).version
|
|
64
|
+
except Exception:
|
|
65
|
+
return
|
|
66
|
+
try:
|
|
67
|
+
url = f"https://pypi.org/pypi/{package_name}/json"
|
|
68
|
+
response = requests.get(url, timeout=3)
|
|
69
|
+
latest_version = response.json()["info"]["version"]
|
|
70
|
+
except Exception:
|
|
71
|
+
return
|
|
72
|
+
|
|
73
|
+
if version.parse(latest_version) > version.parse(installed_version):
|
|
74
|
+
_print_update_message(package_name, installed_version, latest_version)
|
|
75
|
+
|
|
76
|
+
def check_for_update(package_name: str):
|
|
77
|
+
# Run in background thread
|
|
78
|
+
thread = threading.Thread(target=_check_now, args=(package_name,), daemon=True)
|
|
79
|
+
thread.start()
|
|
@@ -0,0 +1,170 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: qdesc
|
|
3
|
+
Version: 1.0.7
|
|
4
|
+
Summary: Quick and Easy way to do descriptive analysis.
|
|
5
|
+
Author: Paolo Hilado
|
|
6
|
+
Author-email: datasciencepgh@proton.me
|
|
7
|
+
Description-Content-Type: text/markdown
|
|
8
|
+
License-File: LICENCE.txt
|
|
9
|
+
Requires-Dist: pandas
|
|
10
|
+
Requires-Dist: numpy
|
|
11
|
+
Requires-Dist: scipy
|
|
12
|
+
Requires-Dist: seaborn
|
|
13
|
+
Requires-Dist: matplotlib
|
|
14
|
+
Requires-Dist: statsmodels
|
|
15
|
+
Dynamic: author
|
|
16
|
+
Dynamic: author-email
|
|
17
|
+
Dynamic: description
|
|
18
|
+
Dynamic: description-content-type
|
|
19
|
+
Dynamic: license-file
|
|
20
|
+
Dynamic: requires-dist
|
|
21
|
+
Dynamic: summary
|
|
22
|
+
|
|
23
|
+
# <font face = 'Impact' color = '#274472' > qdesc : Quick and Easy Descriptive Analysis </font>
|
|
24
|
+

|
|
25
|
+
|
|
26
|
+

|
|
27
|
+

|
|
28
|
+

|
|
29
|
+
[](https://doi.org/10.5281/zenodo.15834554)
|
|
30
|
+

|
|
31
|
+
|
|
32
|
+
## <font face = 'Calibri' color = '#274472' > Installation </font>
|
|
33
|
+
```sh
|
|
34
|
+
pip install qdesc
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
## <font face = 'Calibri' color = '#274472' > Overview </font>
|
|
38
|
+
Qdesc is a package for quick and easy descriptive analysis. It is a powerful Python package designed for quick and easy descriptive analysis of quantitative data. It provides essential statistics like mean and standard deviation for normal distribution and median and raw median absolute deviation for skewed data. With built-in functions for frequency distributions, users can effortlessly analyze categorical variables and export results to a spreadsheet. The package also includes a normality check dashboard, featuring Anderson-Darling statistics and visualizations like histograms and Q-Q plots. Whether you're handling structured datasets or exploring statistical trends, qdesc streamlines the process with efficiency and clarity.
|
|
39
|
+
|
|
40
|
+
## <font face = 'Calibri' color = '#274472' > Creating a sample dataframe</font>
|
|
41
|
+
```python
|
|
42
|
+
import pandas as pd
|
|
43
|
+
import numpy as np
|
|
44
|
+
|
|
45
|
+
# Create sample data
|
|
46
|
+
data = {
|
|
47
|
+
"Age": np.random.randint(18, 60, size=15), # Continuous variable
|
|
48
|
+
"Salary": np.random.randint(30000, 120000, size=15), # Continuous variable
|
|
49
|
+
"Department": np.random.choice(["HR", "Finance", "IT", "Marketing"], size=15), # Categorical variable
|
|
50
|
+
"Gender": np.random.choice(["Male", "Female"], size=15), # Categorical variable
|
|
51
|
+
}
|
|
52
|
+
# Create DataFrame
|
|
53
|
+
df = pd.DataFrame(data)
|
|
54
|
+
```
|
|
55
|
+
## <font face = 'Calibri' color = '#274472' > qd.desc Function</font>
|
|
56
|
+
The function qd.desc(df) generates the following statistics:
|
|
57
|
+
* count - number of observations
|
|
58
|
+
* mean - measure of central tendency for normal distribution
|
|
59
|
+
* std - measure of spread for normal distribution
|
|
60
|
+
* median - measure of central tendency for skewed distributions or those with outliers
|
|
61
|
+
* MAD - measure of spread for skewed distributions or those with outliers; this is manual Median Absolute Deviation (MAD) which is more robust when dealing with non-normal distributions.
|
|
62
|
+
* min - lowest observed value
|
|
63
|
+
* max - highest observed value
|
|
64
|
+
* AD_stat - Anderson - Darling Statistic
|
|
65
|
+
* 5% crit_value - critical value for a 5% Significance Level
|
|
66
|
+
* 1% crit_value - critical value for a 1% Significance Level
|
|
67
|
+
|
|
68
|
+
```python
|
|
69
|
+
import qdesc as qd
|
|
70
|
+
qd.desc(df)
|
|
71
|
+
|
|
72
|
+
| Variable | Count | Mean | Std Dev | Median | MAD | Min | Max | AD Stat | 5% Crit Value |
|
|
73
|
+
|----------|-------|-------|---------|--------|-------|-------|--------|---------|---------------|
|
|
74
|
+
| Age | 15.0 | 37.87 | 13.51 | 38.0 | 12.0 | 20.0 | 59.0 | 0.41 | 0.68 |
|
|
75
|
+
| Salary | 15.0 | 72724 | 29483 | 67660 | 26311 | 34168 | 119590 | 0.40 | 0.68 |
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
## <font face = 'Calibri' color = '#274472' > qd.grp_desc Function</font>
|
|
79
|
+
This function, qd.grp_desc(df, "Continuous Var", "Group Var") creates a table for descriptive statistics similar to the qd.desc function but has the measures
|
|
80
|
+
presented for each level of the grouping variable. It allows one to check whether these measures, for each group, are approximately normal or not. Combining it
|
|
81
|
+
with qd.normcheck_dashboard allows one to decide on the appropriate measure of central tendency and spread.
|
|
82
|
+
|
|
83
|
+
```python
|
|
84
|
+
import qdesc as qd
|
|
85
|
+
qd.grp_desc(df, "Salary", "Gender")
|
|
86
|
+
|
|
87
|
+
| Gender | Count | Mean - | Std Dev | Median | MAD | Min | Max | AD Stat | 5% Crit Value |
|
|
88
|
+
|---------|-------|-----------|-----------|----------|----------|--------|---------|---------|---------------|
|
|
89
|
+
| Female | 7 | 84,871.14 | 32,350.37 | 93,971.0 | 25,619.0 | 40,476 | 119,590 | 0.36 | 0.74 |
|
|
90
|
+
| Male | 8 | 62,096.12 | 23,766.82 | 60,347.0 | 14,278.5 | 34,168 | 106,281 | 0.24 | 0.71 |
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
## <font face = 'Calibri' color = '#274472' > qd.freqdist Function</font>
|
|
94
|
+
Run the function qd.freqdist(df, "Variable Name") to easily create a frequency distribution for your chosen categorical variable with the following:
|
|
95
|
+
* Variable Levels (i.e., for Sex Variable: Male and Female)
|
|
96
|
+
* Counts - the number of observations
|
|
97
|
+
* Percentage - percentage of observations from total.
|
|
98
|
+
|
|
99
|
+
```python
|
|
100
|
+
import qdesc as qd
|
|
101
|
+
qd.freqdist(df, "Department")
|
|
102
|
+
|
|
103
|
+
| Department | Count | Percentage |
|
|
104
|
+
|------------|-------|------------|
|
|
105
|
+
| IT | 5 | 33.33 |
|
|
106
|
+
| HR | 5 | 33.33 |
|
|
107
|
+
| Marketing | 3 | 20.00 |
|
|
108
|
+
| Finance | 2 | 13.33 |
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
## <font face = 'Calibri' color = '#274472' > qd.freqdist_a Function</font>
|
|
112
|
+
Run the function qd.freqdist_a(df, ascending = FALSE) to easily create frequency distribution tables, arranged in descending manner (default) or ascending (TRUE), for all the categorical variables in your data frame. The resulting table will include columns such as:
|
|
113
|
+
* Variable levels (i.e., for Satisfaction: Very Low, Low, Moderate, High, Very High)
|
|
114
|
+
* Counts - the number of observations
|
|
115
|
+
* Percentage - percentage of observations from total.
|
|
116
|
+
|
|
117
|
+
```python
|
|
118
|
+
import qdesc as qd
|
|
119
|
+
qd.freqdist_a(df)
|
|
120
|
+
|
|
121
|
+
| Column | Value | Count | Percentage |
|
|
122
|
+
|------------|----------|-------|------------|
|
|
123
|
+
| Department | IT | 5 | 33.33% |
|
|
124
|
+
| Department | HR | 5 | 33.33% |
|
|
125
|
+
| Department | Marketing| 3 | 20.00% |
|
|
126
|
+
| Department | Finance | 2 | 13.33% |
|
|
127
|
+
| Gender | Male | 8 | 53.33% |
|
|
128
|
+
| Gender | Female | 7 | 46.67% |
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
## <font face = 'Calibri' color = '#274472' > qd.freqdist_to_excel Function</font>
|
|
132
|
+
Run the function qd.freqdist_to_excel(df, "Filename.xlsx", ascending = FALSE ) to easily create frequency distribution tables, arranged in descending manner (default) or ascending (TRUE), for all the categorical variables in your data frame and SAVED as separate sheets in the .xlsx File. The resulting table will include columns such as:
|
|
133
|
+
* Variable levels (i.e., for Satisfaction: Very Low, Low, Moderate, High, Very High)
|
|
134
|
+
* Counts - the number of observations
|
|
135
|
+
* Percentage - percentage of observations from total.
|
|
136
|
+
|
|
137
|
+
```python
|
|
138
|
+
import qdesc as qd
|
|
139
|
+
qd.freqdist_to_excel(df, "Results.xlsx")
|
|
140
|
+
|
|
141
|
+
Frequency distributions written to Results.xlsx
|
|
142
|
+
```
|
|
143
|
+
|
|
144
|
+
## <font face = 'Calibri' color = '#274472' > qd.normcheck_dashboard Function</font>
|
|
145
|
+
Run the function qd.normcheck_dashboard(df) to efficiently check each numeric variable for normality of its distribution. It will compute the Anderson-Darling statistic and create visualizations (i.e., qq-plot, histogram, and boxplots) for checking whether the distribution is approximately normal.
|
|
146
|
+
|
|
147
|
+
```python
|
|
148
|
+
import qdesc as qd
|
|
149
|
+
qd.normcheck_dashboard(df)
|
|
150
|
+
```
|
|
151
|
+

|
|
152
|
+
|
|
153
|
+
|
|
154
|
+
## <font face = 'Calibri' color = '#3D5B59' > License</font>
|
|
155
|
+
This project is licensed under the GPL-3 License. See the LICENSE file for more details.
|
|
156
|
+
|
|
157
|
+
## <font face = 'Calibri' color = '#3D5B59' > Acknowledgements</font>
|
|
158
|
+
Acknowledgement of the libraries used by this package...
|
|
159
|
+
|
|
160
|
+
### <font face = 'Calibri' color = '#3D5B59' > Pandas</font>
|
|
161
|
+
Pandas is distributed under the BSD 3-Clause License, pandas is developed by Pandas contributors. Copyright (c) 2008-2024, the pandas development team All rights reserved.
|
|
162
|
+
### <font face = 'Calibri' color = '#3D5B59' > Numpy</font>
|
|
163
|
+
NumPy is distributed under the BSD 3-Clause License, numpy is developed by NumPy contributors. Copyright (c) 2005-2024, NumPy Developers. All rights reserved.
|
|
164
|
+
### <font face = 'Calibri' color = '#3D5B59' > SciPy</font>
|
|
165
|
+
SciPy is distributed under the BSD License, scipy is developed by SciPy contributors. Copyright (c) 2001-2024, SciPy Developers. All rights reserved.
|
|
166
|
+
|
|
167
|
+
|
|
168
|
+
|
|
169
|
+
|
|
170
|
+
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
qdesc
|
qdesc-1.0.7/setup.cfg
ADDED
qdesc-1.0.7/setup.py
ADDED
|
@@ -0,0 +1,25 @@
|
|
|
1
|
+
from setuptools import setup, find_packages
|
|
2
|
+
from pathlib import Path
|
|
3
|
+
|
|
4
|
+
# Read the contents of the README file
|
|
5
|
+
this_directory = Path(__file__).parent
|
|
6
|
+
long_description = (this_directory / "README.md").read_text()
|
|
7
|
+
|
|
8
|
+
setup(
|
|
9
|
+
name='qdesc',
|
|
10
|
+
version='1.0.7',
|
|
11
|
+
packages=find_packages(),
|
|
12
|
+
install_requires=[
|
|
13
|
+
'pandas',
|
|
14
|
+
'numpy',
|
|
15
|
+
'scipy',
|
|
16
|
+
'seaborn',
|
|
17
|
+
'matplotlib',
|
|
18
|
+
'statsmodels'
|
|
19
|
+
],
|
|
20
|
+
author='Paolo Hilado',
|
|
21
|
+
author_email='datasciencepgh@proton.me',
|
|
22
|
+
description= 'Quick and Easy way to do descriptive analysis.',
|
|
23
|
+
long_description=long_description,
|
|
24
|
+
long_description_content_type='text/markdown', # or 'text/x-rst' for reStructuredText # other metadata fields... )
|
|
25
|
+
)
|