qdesc 0.1.9.1__tar.gz → 0.1.9.3__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of qdesc might be problematic. Click here for more details.
- qdesc-0.1.9.3/PKG-INFO +175 -0
- qdesc-0.1.9.3/README.md +153 -0
- {qdesc-0.1.9.1 → qdesc-0.1.9.3}/qdesc/__init__.py +4 -2
- qdesc-0.1.9.3/qdesc.egg-info/PKG-INFO +175 -0
- {qdesc-0.1.9.1 → qdesc-0.1.9.3}/qdesc.egg-info/SOURCES.txt +1 -0
- qdesc-0.1.9.3/qdesc.egg-info/requires.txt +6 -0
- {qdesc-0.1.9.1 → qdesc-0.1.9.3}/setup.py +7 -2
- qdesc-0.1.9.1/PKG-INFO +0 -110
- qdesc-0.1.9.1/README.md +0 -95
- qdesc-0.1.9.1/qdesc.egg-info/PKG-INFO +0 -110
- {qdesc-0.1.9.1 → qdesc-0.1.9.3}/LICENCE.txt +0 -0
- {qdesc-0.1.9.1 → qdesc-0.1.9.3}/qdesc.egg-info/dependency_links.txt +0 -0
- {qdesc-0.1.9.1 → qdesc-0.1.9.3}/qdesc.egg-info/top_level.txt +0 -0
- {qdesc-0.1.9.1 → qdesc-0.1.9.3}/setup.cfg +0 -0
qdesc-0.1.9.3/PKG-INFO
ADDED
|
@@ -0,0 +1,175 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: qdesc
|
|
3
|
+
Version: 0.1.9.3
|
|
4
|
+
Summary: Quick and Easy way to do descriptive analysis.
|
|
5
|
+
Author: Paolo Hilado
|
|
6
|
+
Author-email: datasciencepgh@proton.me
|
|
7
|
+
Description-Content-Type: text/markdown
|
|
8
|
+
License-File: LICENCE.txt
|
|
9
|
+
Requires-Dist: pandas
|
|
10
|
+
Requires-Dist: numpy
|
|
11
|
+
Requires-Dist: scipy
|
|
12
|
+
Requires-Dist: seaborn
|
|
13
|
+
Requires-Dist: matplotlib
|
|
14
|
+
Requires-Dist: statsmodels
|
|
15
|
+
Dynamic: author
|
|
16
|
+
Dynamic: author-email
|
|
17
|
+
Dynamic: description
|
|
18
|
+
Dynamic: description-content-type
|
|
19
|
+
Dynamic: license-file
|
|
20
|
+
Dynamic: requires-dist
|
|
21
|
+
Dynamic: summary
|
|
22
|
+
|
|
23
|
+
# qdesc - Quick and Easy Descriptive Analysis
|
|
24
|
+

|
|
25
|
+

|
|
26
|
+

|
|
27
|
+

|
|
28
|
+
|
|
29
|
+
## Installation
|
|
30
|
+
```sh
|
|
31
|
+
pip install qdesc
|
|
32
|
+
```
|
|
33
|
+
|
|
34
|
+
## Overview
|
|
35
|
+
Qdesc is a package for quick and easy descriptive analysis. It is a powerful Python package designed for quick and easy descriptive analysis of quantitative data. It provides essential statistics like mean and standard deviation for normal distribution and median and raw median absolute deviation for skewed data. With built-in functions for frequency distributions, users can effortlessly analyze categorical variables and export results to a spreadsheet. The package also includes a normality check dashboard, featuring Anderson-Darling statistics and visualizations like histograms and Q-Q plots. Whether you're handling structured datasets or exploring statistical trends, qdesc streamlines the process with efficiency and clarity.
|
|
36
|
+
|
|
37
|
+
## Creating a sample dataframe
|
|
38
|
+
```python
|
|
39
|
+
import pandas as pd
|
|
40
|
+
import numpy as np
|
|
41
|
+
|
|
42
|
+
# Create sample data
|
|
43
|
+
data = {
|
|
44
|
+
"Age": np.random.randint(18, 60, size=15), # Continuous variable
|
|
45
|
+
"Salary": np.random.randint(30000, 120000, size=15), # Continuous variable
|
|
46
|
+
"Department": np.random.choice(["HR", "Finance", "IT", "Marketing"], size=15), # Categorical variable
|
|
47
|
+
"Gender": np.random.choice(["Male", "Female"], size=15), # Categorical variable
|
|
48
|
+
}
|
|
49
|
+
# Create DataFrame
|
|
50
|
+
df = pd.DataFrame(data)
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
## qd.desc Function
|
|
54
|
+
The function qd.desc(df) generates the following statistics:
|
|
55
|
+
* count - number of observations
|
|
56
|
+
* mean - measure of central tendency for normal distribution
|
|
57
|
+
* std - measure of spread for normal distribution
|
|
58
|
+
* median - measure of central tendency for skewed distributions or those with outliers
|
|
59
|
+
* MAD - measure of spread for skewed distributions or those with outliers; this is manual Median Absolute Deviation (MAD) which is more robust when dealing with non-normal distributions.
|
|
60
|
+
* min - lowest observed value
|
|
61
|
+
* max - highest observed value
|
|
62
|
+
* AD_stat - Anderson - Darling Statistic
|
|
63
|
+
* 5% crit_value - critical value for a 5% Significance Level
|
|
64
|
+
* 1% crit_value - critical value for a 1% Significance Level
|
|
65
|
+
|
|
66
|
+
```python
|
|
67
|
+
import qdesc as qd
|
|
68
|
+
qd.desc(df)
|
|
69
|
+
|
|
70
|
+
| Variable | Count | Mean | Std Dev | Median | MAD | Min | Max | AD Stat | 5% Crit Value |
|
|
71
|
+
|----------|-------|-------|---------|--------|-------|-------|--------|---------|---------------|
|
|
72
|
+
| Age | 15.0 | 37.87 | 13.51 | 38.0 | 12.0 | 20.0 | 59.0 | 0.41 | 0.68 |
|
|
73
|
+
| Salary | 15.0 | 72724 | 29483 | 67660 | 26311 | 34168 | 119590 | 0.40 | 0.68 |
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
|
|
77
|
+
|
|
78
|
+
## qd.grp_desc Function
|
|
79
|
+
This function, qd.grp_desc(df, "Continuous Var", "Group Var") creates a table for descriptive statistics similar to the qd.desc function but has the measures
|
|
80
|
+
presented for each level of the grouping variable. It allows one to check whether these measures, for each group, are approximately normal or not. Combining it
|
|
81
|
+
with qd.normcheck_dashboard allows one to decide on the appropriate measure of central tendency and spread.
|
|
82
|
+
|
|
83
|
+
```python
|
|
84
|
+
import qdesc as qd
|
|
85
|
+
qd.grp_desc(df, "Salary", "Gender")
|
|
86
|
+
|
|
87
|
+
| Gender | Count | Mean - | Std Dev | Median | MAD | Min | Max | AD Stat | 5% Crit Value |
|
|
88
|
+
|---------|-------|-----------|-----------|----------|----------|--------|---------|---------|---------------|
|
|
89
|
+
| Female | 7 | 84,871.14 | 32,350.37 | 93,971.0 | 25,619.0 | 40,476 | 119,590 | 0.36 | 0.74 |
|
|
90
|
+
| Male | 8 | 62,096.12 | 23,766.82 | 60,347.0 | 14,278.5 | 34,168 | 106,281 | 0.24 | 0.71 |
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
|
|
94
|
+
## qd.freqdist Function
|
|
95
|
+
Run the function qd.freqdist(df, "Variable Name") to easily create a frequency distribution for your chosen categorical variable with the following:
|
|
96
|
+
* Variable Levels (i.e., for Sex Variable: Male and Female)
|
|
97
|
+
* Counts - the number of observations
|
|
98
|
+
* Percentage - percentage of observations from total.
|
|
99
|
+
|
|
100
|
+
```python
|
|
101
|
+
import qdesc as qd
|
|
102
|
+
qd.freqdist(df, "Department")
|
|
103
|
+
|
|
104
|
+
| Department | Count | Percentage |
|
|
105
|
+
|------------|-------|------------|
|
|
106
|
+
| IT | 5 | 33.33 |
|
|
107
|
+
| HR | 5 | 33.33 |
|
|
108
|
+
| Marketing | 3 | 20.00 |
|
|
109
|
+
| Finance | 2 | 13.33 |
|
|
110
|
+
```
|
|
111
|
+
|
|
112
|
+
|
|
113
|
+
|
|
114
|
+
## qd.freqdist_a Function
|
|
115
|
+
Run the function qd.freqdist_a(df, ascending = FALSE) to easily create frequency distribution tables, arranged in descending manner (default) or ascending (TRUE), for all the categorical variables in your data frame. The resulting table will include columns such as:
|
|
116
|
+
* Variable levels (i.e., for Satisfaction: Very Low, Low, Moderate, High, Very High)
|
|
117
|
+
* Counts - the number of observations
|
|
118
|
+
* Percentage - percentage of observations from total.
|
|
119
|
+
|
|
120
|
+
```python
|
|
121
|
+
import qdesc as qd
|
|
122
|
+
qd.freqdist_a(df)
|
|
123
|
+
|
|
124
|
+
| Column | Value | Count | Percentage |
|
|
125
|
+
|------------|----------|-------|------------|
|
|
126
|
+
| Department | IT | 5 | 33.33% |
|
|
127
|
+
| Department | HR | 5 | 33.33% |
|
|
128
|
+
| Department | Marketing| 3 | 20.00% |
|
|
129
|
+
| Department | Finance | 2 | 13.33% |
|
|
130
|
+
| Gender | Male | 8 | 53.33% |
|
|
131
|
+
| Gender | Female | 7 | 46.67% |
|
|
132
|
+
```
|
|
133
|
+
|
|
134
|
+
|
|
135
|
+
|
|
136
|
+
## qd.freqdist_to_excel Function
|
|
137
|
+
Run the function qd.freqdist_to_excel(df, "Filename.xlsx", ascending = FALSE ) to easily create frequency distribution tables, arranged in descending manner (default) or ascending (TRUE), for all the categorical variables in your data frame and SAVED as separate sheets in the .xlsx File. The resulting table will include columns such as:
|
|
138
|
+
* Variable levels (i.e., for Satisfaction: Very Low, Low, Moderate, High, Very High)
|
|
139
|
+
* Counts - the number of observations
|
|
140
|
+
* Percentage - percentage of observations from total.
|
|
141
|
+
|
|
142
|
+
```python
|
|
143
|
+
import qdesc as qd
|
|
144
|
+
qd.freqdist_to_excel(df, "Results.xlsx")
|
|
145
|
+
|
|
146
|
+
Frequency distributions written to Results.xlsx
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
## qd.normcheck_dashboard Function
|
|
150
|
+
Run the function qd.normcheck_dashboard(df) to efficiently check each numeric variable for normality of its distribution. It will compute the Anderson-Darling statistic and create visualizations (i.e., qq-plot, histogram, and boxplots) for checking whether the distribution is approximately normal.
|
|
151
|
+
|
|
152
|
+
```python
|
|
153
|
+
import qdesc as qd
|
|
154
|
+
qd.normcheck_dashboard(df)
|
|
155
|
+
```
|
|
156
|
+

|
|
157
|
+
|
|
158
|
+
|
|
159
|
+
## License
|
|
160
|
+
This project is licensed under the GPL-3 License. See the LICENSE file for more details.
|
|
161
|
+
|
|
162
|
+
## Acknowledgements
|
|
163
|
+
Acknowledgement of the libraries used by this package...
|
|
164
|
+
|
|
165
|
+
### Pandas
|
|
166
|
+
Pandas is distributed under the BSD 3-Clause License, pandas is developed by Pandas contributors. Copyright (c) 2008-2024, the pandas development team All rights reserved.
|
|
167
|
+
### NumPy
|
|
168
|
+
NumPy is distributed under the BSD 3-Clause License, numpy is developed by NumPy contributors. Copyright (c) 2005-2024, NumPy Developers. All rights reserved.
|
|
169
|
+
### SciPy
|
|
170
|
+
SciPy is distributed under the BSD License, scipy is developed by SciPy contributors. Copyright (c) 2001-2024, SciPy Developers. All rights reserved.
|
|
171
|
+
|
|
172
|
+
|
|
173
|
+
|
|
174
|
+
|
|
175
|
+
|
qdesc-0.1.9.3/README.md
ADDED
|
@@ -0,0 +1,153 @@
|
|
|
1
|
+
# qdesc - Quick and Easy Descriptive Analysis
|
|
2
|
+

|
|
3
|
+

|
|
4
|
+

|
|
5
|
+

|
|
6
|
+
|
|
7
|
+
## Installation
|
|
8
|
+
```sh
|
|
9
|
+
pip install qdesc
|
|
10
|
+
```
|
|
11
|
+
|
|
12
|
+
## Overview
|
|
13
|
+
Qdesc is a package for quick and easy descriptive analysis. It is a powerful Python package designed for quick and easy descriptive analysis of quantitative data. It provides essential statistics like mean and standard deviation for normal distribution and median and raw median absolute deviation for skewed data. With built-in functions for frequency distributions, users can effortlessly analyze categorical variables and export results to a spreadsheet. The package also includes a normality check dashboard, featuring Anderson-Darling statistics and visualizations like histograms and Q-Q plots. Whether you're handling structured datasets or exploring statistical trends, qdesc streamlines the process with efficiency and clarity.
|
|
14
|
+
|
|
15
|
+
## Creating a sample dataframe
|
|
16
|
+
```python
|
|
17
|
+
import pandas as pd
|
|
18
|
+
import numpy as np
|
|
19
|
+
|
|
20
|
+
# Create sample data
|
|
21
|
+
data = {
|
|
22
|
+
"Age": np.random.randint(18, 60, size=15), # Continuous variable
|
|
23
|
+
"Salary": np.random.randint(30000, 120000, size=15), # Continuous variable
|
|
24
|
+
"Department": np.random.choice(["HR", "Finance", "IT", "Marketing"], size=15), # Categorical variable
|
|
25
|
+
"Gender": np.random.choice(["Male", "Female"], size=15), # Categorical variable
|
|
26
|
+
}
|
|
27
|
+
# Create DataFrame
|
|
28
|
+
df = pd.DataFrame(data)
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
## qd.desc Function
|
|
32
|
+
The function qd.desc(df) generates the following statistics:
|
|
33
|
+
* count - number of observations
|
|
34
|
+
* mean - measure of central tendency for normal distribution
|
|
35
|
+
* std - measure of spread for normal distribution
|
|
36
|
+
* median - measure of central tendency for skewed distributions or those with outliers
|
|
37
|
+
* MAD - measure of spread for skewed distributions or those with outliers; this is manual Median Absolute Deviation (MAD) which is more robust when dealing with non-normal distributions.
|
|
38
|
+
* min - lowest observed value
|
|
39
|
+
* max - highest observed value
|
|
40
|
+
* AD_stat - Anderson - Darling Statistic
|
|
41
|
+
* 5% crit_value - critical value for a 5% Significance Level
|
|
42
|
+
* 1% crit_value - critical value for a 1% Significance Level
|
|
43
|
+
|
|
44
|
+
```python
|
|
45
|
+
import qdesc as qd
|
|
46
|
+
qd.desc(df)
|
|
47
|
+
|
|
48
|
+
| Variable | Count | Mean | Std Dev | Median | MAD | Min | Max | AD Stat | 5% Crit Value |
|
|
49
|
+
|----------|-------|-------|---------|--------|-------|-------|--------|---------|---------------|
|
|
50
|
+
| Age | 15.0 | 37.87 | 13.51 | 38.0 | 12.0 | 20.0 | 59.0 | 0.41 | 0.68 |
|
|
51
|
+
| Salary | 15.0 | 72724 | 29483 | 67660 | 26311 | 34168 | 119590 | 0.40 | 0.68 |
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
|
|
55
|
+
|
|
56
|
+
## qd.grp_desc Function
|
|
57
|
+
This function, qd.grp_desc(df, "Continuous Var", "Group Var") creates a table for descriptive statistics similar to the qd.desc function but has the measures
|
|
58
|
+
presented for each level of the grouping variable. It allows one to check whether these measures, for each group, are approximately normal or not. Combining it
|
|
59
|
+
with qd.normcheck_dashboard allows one to decide on the appropriate measure of central tendency and spread.
|
|
60
|
+
|
|
61
|
+
```python
|
|
62
|
+
import qdesc as qd
|
|
63
|
+
qd.grp_desc(df, "Salary", "Gender")
|
|
64
|
+
|
|
65
|
+
| Gender | Count | Mean - | Std Dev | Median | MAD | Min | Max | AD Stat | 5% Crit Value |
|
|
66
|
+
|---------|-------|-----------|-----------|----------|----------|--------|---------|---------|---------------|
|
|
67
|
+
| Female | 7 | 84,871.14 | 32,350.37 | 93,971.0 | 25,619.0 | 40,476 | 119,590 | 0.36 | 0.74 |
|
|
68
|
+
| Male | 8 | 62,096.12 | 23,766.82 | 60,347.0 | 14,278.5 | 34,168 | 106,281 | 0.24 | 0.71 |
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
|
|
72
|
+
## qd.freqdist Function
|
|
73
|
+
Run the function qd.freqdist(df, "Variable Name") to easily create a frequency distribution for your chosen categorical variable with the following:
|
|
74
|
+
* Variable Levels (i.e., for Sex Variable: Male and Female)
|
|
75
|
+
* Counts - the number of observations
|
|
76
|
+
* Percentage - percentage of observations from total.
|
|
77
|
+
|
|
78
|
+
```python
|
|
79
|
+
import qdesc as qd
|
|
80
|
+
qd.freqdist(df, "Department")
|
|
81
|
+
|
|
82
|
+
| Department | Count | Percentage |
|
|
83
|
+
|------------|-------|------------|
|
|
84
|
+
| IT | 5 | 33.33 |
|
|
85
|
+
| HR | 5 | 33.33 |
|
|
86
|
+
| Marketing | 3 | 20.00 |
|
|
87
|
+
| Finance | 2 | 13.33 |
|
|
88
|
+
```
|
|
89
|
+
|
|
90
|
+
|
|
91
|
+
|
|
92
|
+
## qd.freqdist_a Function
|
|
93
|
+
Run the function qd.freqdist_a(df, ascending = FALSE) to easily create frequency distribution tables, arranged in descending manner (default) or ascending (TRUE), for all the categorical variables in your data frame. The resulting table will include columns such as:
|
|
94
|
+
* Variable levels (i.e., for Satisfaction: Very Low, Low, Moderate, High, Very High)
|
|
95
|
+
* Counts - the number of observations
|
|
96
|
+
* Percentage - percentage of observations from total.
|
|
97
|
+
|
|
98
|
+
```python
|
|
99
|
+
import qdesc as qd
|
|
100
|
+
qd.freqdist_a(df)
|
|
101
|
+
|
|
102
|
+
| Column | Value | Count | Percentage |
|
|
103
|
+
|------------|----------|-------|------------|
|
|
104
|
+
| Department | IT | 5 | 33.33% |
|
|
105
|
+
| Department | HR | 5 | 33.33% |
|
|
106
|
+
| Department | Marketing| 3 | 20.00% |
|
|
107
|
+
| Department | Finance | 2 | 13.33% |
|
|
108
|
+
| Gender | Male | 8 | 53.33% |
|
|
109
|
+
| Gender | Female | 7 | 46.67% |
|
|
110
|
+
```
|
|
111
|
+
|
|
112
|
+
|
|
113
|
+
|
|
114
|
+
## qd.freqdist_to_excel Function
|
|
115
|
+
Run the function qd.freqdist_to_excel(df, "Filename.xlsx", ascending = FALSE ) to easily create frequency distribution tables, arranged in descending manner (default) or ascending (TRUE), for all the categorical variables in your data frame and SAVED as separate sheets in the .xlsx File. The resulting table will include columns such as:
|
|
116
|
+
* Variable levels (i.e., for Satisfaction: Very Low, Low, Moderate, High, Very High)
|
|
117
|
+
* Counts - the number of observations
|
|
118
|
+
* Percentage - percentage of observations from total.
|
|
119
|
+
|
|
120
|
+
```python
|
|
121
|
+
import qdesc as qd
|
|
122
|
+
qd.freqdist_to_excel(df, "Results.xlsx")
|
|
123
|
+
|
|
124
|
+
Frequency distributions written to Results.xlsx
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
## qd.normcheck_dashboard Function
|
|
128
|
+
Run the function qd.normcheck_dashboard(df) to efficiently check each numeric variable for normality of its distribution. It will compute the Anderson-Darling statistic and create visualizations (i.e., qq-plot, histogram, and boxplots) for checking whether the distribution is approximately normal.
|
|
129
|
+
|
|
130
|
+
```python
|
|
131
|
+
import qdesc as qd
|
|
132
|
+
qd.normcheck_dashboard(df)
|
|
133
|
+
```
|
|
134
|
+

|
|
135
|
+
|
|
136
|
+
|
|
137
|
+
## License
|
|
138
|
+
This project is licensed under the GPL-3 License. See the LICENSE file for more details.
|
|
139
|
+
|
|
140
|
+
## Acknowledgements
|
|
141
|
+
Acknowledgement of the libraries used by this package...
|
|
142
|
+
|
|
143
|
+
### Pandas
|
|
144
|
+
Pandas is distributed under the BSD 3-Clause License, pandas is developed by Pandas contributors. Copyright (c) 2008-2024, the pandas development team All rights reserved.
|
|
145
|
+
### NumPy
|
|
146
|
+
NumPy is distributed under the BSD 3-Clause License, numpy is developed by NumPy contributors. Copyright (c) 2005-2024, NumPy Developers. All rights reserved.
|
|
147
|
+
### SciPy
|
|
148
|
+
SciPy is distributed under the BSD License, scipy is developed by SciPy contributors. Copyright (c) 2001-2024, SciPy Developers. All rights reserved.
|
|
149
|
+
|
|
150
|
+
|
|
151
|
+
|
|
152
|
+
|
|
153
|
+
|
|
@@ -79,6 +79,7 @@ def grp_desc(df, numeric_col, group_col):
|
|
|
79
79
|
|
|
80
80
|
def freqdist(df, column_name):
|
|
81
81
|
import pandas as pd
|
|
82
|
+
import numpy as np
|
|
82
83
|
if column_name not in df.columns:
|
|
83
84
|
raise ValueError(f"Column '{column_name}' not found in DataFrame.")
|
|
84
85
|
|
|
@@ -87,16 +88,17 @@ def freqdist(df, column_name):
|
|
|
87
88
|
|
|
88
89
|
freq_dist = df[column_name].value_counts().reset_index()
|
|
89
90
|
freq_dist.columns = [column_name, 'Count']
|
|
90
|
-
freq_dist['Percentage'] = (freq_dist['Count'] / len(df)) * 100
|
|
91
|
+
freq_dist['Percentage'] = np.round((freq_dist['Count'] / len(df)) * 100,2)
|
|
91
92
|
return freq_dist
|
|
92
93
|
|
|
93
94
|
|
|
94
95
|
def freqdist_a(df, ascending=False):
|
|
95
96
|
import pandas as pd
|
|
97
|
+
import numpy as np
|
|
96
98
|
results = []
|
|
97
99
|
for column in df.select_dtypes(include=['object', 'category']).columns:
|
|
98
100
|
frequency_table = df[column].value_counts()
|
|
99
|
-
percentage_table = df[column].value_counts(normalize=True) * 100
|
|
101
|
+
percentage_table = np.round(df[column].value_counts(normalize=True) * 100,2)
|
|
100
102
|
|
|
101
103
|
distribution = pd.DataFrame({
|
|
102
104
|
'Column': column,
|
|
@@ -0,0 +1,175 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: qdesc
|
|
3
|
+
Version: 0.1.9.3
|
|
4
|
+
Summary: Quick and Easy way to do descriptive analysis.
|
|
5
|
+
Author: Paolo Hilado
|
|
6
|
+
Author-email: datasciencepgh@proton.me
|
|
7
|
+
Description-Content-Type: text/markdown
|
|
8
|
+
License-File: LICENCE.txt
|
|
9
|
+
Requires-Dist: pandas
|
|
10
|
+
Requires-Dist: numpy
|
|
11
|
+
Requires-Dist: scipy
|
|
12
|
+
Requires-Dist: seaborn
|
|
13
|
+
Requires-Dist: matplotlib
|
|
14
|
+
Requires-Dist: statsmodels
|
|
15
|
+
Dynamic: author
|
|
16
|
+
Dynamic: author-email
|
|
17
|
+
Dynamic: description
|
|
18
|
+
Dynamic: description-content-type
|
|
19
|
+
Dynamic: license-file
|
|
20
|
+
Dynamic: requires-dist
|
|
21
|
+
Dynamic: summary
|
|
22
|
+
|
|
23
|
+
# qdesc - Quick and Easy Descriptive Analysis
|
|
24
|
+

|
|
25
|
+

|
|
26
|
+

|
|
27
|
+

|
|
28
|
+
|
|
29
|
+
## Installation
|
|
30
|
+
```sh
|
|
31
|
+
pip install qdesc
|
|
32
|
+
```
|
|
33
|
+
|
|
34
|
+
## Overview
|
|
35
|
+
Qdesc is a package for quick and easy descriptive analysis. It is a powerful Python package designed for quick and easy descriptive analysis of quantitative data. It provides essential statistics like mean and standard deviation for normal distribution and median and raw median absolute deviation for skewed data. With built-in functions for frequency distributions, users can effortlessly analyze categorical variables and export results to a spreadsheet. The package also includes a normality check dashboard, featuring Anderson-Darling statistics and visualizations like histograms and Q-Q plots. Whether you're handling structured datasets or exploring statistical trends, qdesc streamlines the process with efficiency and clarity.
|
|
36
|
+
|
|
37
|
+
## Creating a sample dataframe
|
|
38
|
+
```python
|
|
39
|
+
import pandas as pd
|
|
40
|
+
import numpy as np
|
|
41
|
+
|
|
42
|
+
# Create sample data
|
|
43
|
+
data = {
|
|
44
|
+
"Age": np.random.randint(18, 60, size=15), # Continuous variable
|
|
45
|
+
"Salary": np.random.randint(30000, 120000, size=15), # Continuous variable
|
|
46
|
+
"Department": np.random.choice(["HR", "Finance", "IT", "Marketing"], size=15), # Categorical variable
|
|
47
|
+
"Gender": np.random.choice(["Male", "Female"], size=15), # Categorical variable
|
|
48
|
+
}
|
|
49
|
+
# Create DataFrame
|
|
50
|
+
df = pd.DataFrame(data)
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
## qd.desc Function
|
|
54
|
+
The function qd.desc(df) generates the following statistics:
|
|
55
|
+
* count - number of observations
|
|
56
|
+
* mean - measure of central tendency for normal distribution
|
|
57
|
+
* std - measure of spread for normal distribution
|
|
58
|
+
* median - measure of central tendency for skewed distributions or those with outliers
|
|
59
|
+
* MAD - measure of spread for skewed distributions or those with outliers; this is manual Median Absolute Deviation (MAD) which is more robust when dealing with non-normal distributions.
|
|
60
|
+
* min - lowest observed value
|
|
61
|
+
* max - highest observed value
|
|
62
|
+
* AD_stat - Anderson - Darling Statistic
|
|
63
|
+
* 5% crit_value - critical value for a 5% Significance Level
|
|
64
|
+
* 1% crit_value - critical value for a 1% Significance Level
|
|
65
|
+
|
|
66
|
+
```python
|
|
67
|
+
import qdesc as qd
|
|
68
|
+
qd.desc(df)
|
|
69
|
+
|
|
70
|
+
| Variable | Count | Mean | Std Dev | Median | MAD | Min | Max | AD Stat | 5% Crit Value |
|
|
71
|
+
|----------|-------|-------|---------|--------|-------|-------|--------|---------|---------------|
|
|
72
|
+
| Age | 15.0 | 37.87 | 13.51 | 38.0 | 12.0 | 20.0 | 59.0 | 0.41 | 0.68 |
|
|
73
|
+
| Salary | 15.0 | 72724 | 29483 | 67660 | 26311 | 34168 | 119590 | 0.40 | 0.68 |
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
|
|
77
|
+
|
|
78
|
+
## qd.grp_desc Function
|
|
79
|
+
This function, qd.grp_desc(df, "Continuous Var", "Group Var") creates a table for descriptive statistics similar to the qd.desc function but has the measures
|
|
80
|
+
presented for each level of the grouping variable. It allows one to check whether these measures, for each group, are approximately normal or not. Combining it
|
|
81
|
+
with qd.normcheck_dashboard allows one to decide on the appropriate measure of central tendency and spread.
|
|
82
|
+
|
|
83
|
+
```python
|
|
84
|
+
import qdesc as qd
|
|
85
|
+
qd.grp_desc(df, "Salary", "Gender")
|
|
86
|
+
|
|
87
|
+
| Gender | Count | Mean - | Std Dev | Median | MAD | Min | Max | AD Stat | 5% Crit Value |
|
|
88
|
+
|---------|-------|-----------|-----------|----------|----------|--------|---------|---------|---------------|
|
|
89
|
+
| Female | 7 | 84,871.14 | 32,350.37 | 93,971.0 | 25,619.0 | 40,476 | 119,590 | 0.36 | 0.74 |
|
|
90
|
+
| Male | 8 | 62,096.12 | 23,766.82 | 60,347.0 | 14,278.5 | 34,168 | 106,281 | 0.24 | 0.71 |
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
|
|
94
|
+
## qd.freqdist Function
|
|
95
|
+
Run the function qd.freqdist(df, "Variable Name") to easily create a frequency distribution for your chosen categorical variable with the following:
|
|
96
|
+
* Variable Levels (i.e., for Sex Variable: Male and Female)
|
|
97
|
+
* Counts - the number of observations
|
|
98
|
+
* Percentage - percentage of observations from total.
|
|
99
|
+
|
|
100
|
+
```python
|
|
101
|
+
import qdesc as qd
|
|
102
|
+
qd.freqdist(df, "Department")
|
|
103
|
+
|
|
104
|
+
| Department | Count | Percentage |
|
|
105
|
+
|------------|-------|------------|
|
|
106
|
+
| IT | 5 | 33.33 |
|
|
107
|
+
| HR | 5 | 33.33 |
|
|
108
|
+
| Marketing | 3 | 20.00 |
|
|
109
|
+
| Finance | 2 | 13.33 |
|
|
110
|
+
```
|
|
111
|
+
|
|
112
|
+
|
|
113
|
+
|
|
114
|
+
## qd.freqdist_a Function
|
|
115
|
+
Run the function qd.freqdist_a(df, ascending = FALSE) to easily create frequency distribution tables, arranged in descending manner (default) or ascending (TRUE), for all the categorical variables in your data frame. The resulting table will include columns such as:
|
|
116
|
+
* Variable levels (i.e., for Satisfaction: Very Low, Low, Moderate, High, Very High)
|
|
117
|
+
* Counts - the number of observations
|
|
118
|
+
* Percentage - percentage of observations from total.
|
|
119
|
+
|
|
120
|
+
```python
|
|
121
|
+
import qdesc as qd
|
|
122
|
+
qd.freqdist_a(df)
|
|
123
|
+
|
|
124
|
+
| Column | Value | Count | Percentage |
|
|
125
|
+
|------------|----------|-------|------------|
|
|
126
|
+
| Department | IT | 5 | 33.33% |
|
|
127
|
+
| Department | HR | 5 | 33.33% |
|
|
128
|
+
| Department | Marketing| 3 | 20.00% |
|
|
129
|
+
| Department | Finance | 2 | 13.33% |
|
|
130
|
+
| Gender | Male | 8 | 53.33% |
|
|
131
|
+
| Gender | Female | 7 | 46.67% |
|
|
132
|
+
```
|
|
133
|
+
|
|
134
|
+
|
|
135
|
+
|
|
136
|
+
## qd.freqdist_to_excel Function
|
|
137
|
+
Run the function qd.freqdist_to_excel(df, "Filename.xlsx", ascending = FALSE ) to easily create frequency distribution tables, arranged in descending manner (default) or ascending (TRUE), for all the categorical variables in your data frame and SAVED as separate sheets in the .xlsx File. The resulting table will include columns such as:
|
|
138
|
+
* Variable levels (i.e., for Satisfaction: Very Low, Low, Moderate, High, Very High)
|
|
139
|
+
* Counts - the number of observations
|
|
140
|
+
* Percentage - percentage of observations from total.
|
|
141
|
+
|
|
142
|
+
```python
|
|
143
|
+
import qdesc as qd
|
|
144
|
+
qd.freqdist_to_excel(df, "Results.xlsx")
|
|
145
|
+
|
|
146
|
+
Frequency distributions written to Results.xlsx
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
## qd.normcheck_dashboard Function
|
|
150
|
+
Run the function qd.normcheck_dashboard(df) to efficiently check each numeric variable for normality of its distribution. It will compute the Anderson-Darling statistic and create visualizations (i.e., qq-plot, histogram, and boxplots) for checking whether the distribution is approximately normal.
|
|
151
|
+
|
|
152
|
+
```python
|
|
153
|
+
import qdesc as qd
|
|
154
|
+
qd.normcheck_dashboard(df)
|
|
155
|
+
```
|
|
156
|
+

|
|
157
|
+
|
|
158
|
+
|
|
159
|
+
## License
|
|
160
|
+
This project is licensed under the GPL-3 License. See the LICENSE file for more details.
|
|
161
|
+
|
|
162
|
+
## Acknowledgements
|
|
163
|
+
Acknowledgement of the libraries used by this package...
|
|
164
|
+
|
|
165
|
+
### Pandas
|
|
166
|
+
Pandas is distributed under the BSD 3-Clause License, pandas is developed by Pandas contributors. Copyright (c) 2008-2024, the pandas development team All rights reserved.
|
|
167
|
+
### NumPy
|
|
168
|
+
NumPy is distributed under the BSD 3-Clause License, numpy is developed by NumPy contributors. Copyright (c) 2005-2024, NumPy Developers. All rights reserved.
|
|
169
|
+
### SciPy
|
|
170
|
+
SciPy is distributed under the BSD License, scipy is developed by SciPy contributors. Copyright (c) 2001-2024, SciPy Developers. All rights reserved.
|
|
171
|
+
|
|
172
|
+
|
|
173
|
+
|
|
174
|
+
|
|
175
|
+
|
|
@@ -7,10 +7,15 @@ long_description = (this_directory / "README.md").read_text()
|
|
|
7
7
|
|
|
8
8
|
setup(
|
|
9
9
|
name='qdesc',
|
|
10
|
-
version='0.1.9.
|
|
10
|
+
version='0.1.9.3',
|
|
11
11
|
packages=find_packages(),
|
|
12
12
|
install_requires=[
|
|
13
|
-
|
|
13
|
+
'pandas',
|
|
14
|
+
'numpy',
|
|
15
|
+
'scipy',
|
|
16
|
+
'seaborn',
|
|
17
|
+
'matplotlib',
|
|
18
|
+
'statsmodels'
|
|
14
19
|
],
|
|
15
20
|
author='Paolo Hilado',
|
|
16
21
|
author_email='datasciencepgh@proton.me',
|
qdesc-0.1.9.1/PKG-INFO
DELETED
|
@@ -1,110 +0,0 @@
|
|
|
1
|
-
Metadata-Version: 2.4
|
|
2
|
-
Name: qdesc
|
|
3
|
-
Version: 0.1.9.1
|
|
4
|
-
Summary: Quick and Easy way to do descriptive analysis.
|
|
5
|
-
Author: Paolo Hilado
|
|
6
|
-
Author-email: datasciencepgh@proton.me
|
|
7
|
-
Description-Content-Type: text/markdown
|
|
8
|
-
License-File: LICENCE.txt
|
|
9
|
-
Dynamic: author
|
|
10
|
-
Dynamic: author-email
|
|
11
|
-
Dynamic: description
|
|
12
|
-
Dynamic: description-content-type
|
|
13
|
-
Dynamic: license-file
|
|
14
|
-
Dynamic: summary
|
|
15
|
-
|
|
16
|
-
# qdesc - Quick and Easy Descriptive Analysis
|
|
17
|
-
|
|
18
|
-
## Overview
|
|
19
|
-
This is a package for quick and easy descriptive analysis.
|
|
20
|
-
Required packages include: pandas, numpy, and SciPy version 1.14.1
|
|
21
|
-
Be sure to run the following prior to using the "qd.desc" function:
|
|
22
|
-
|
|
23
|
-
- import pandas as pd
|
|
24
|
-
- import numpy as np
|
|
25
|
-
- from scipy.stats import anderson
|
|
26
|
-
- import qdesc as qd
|
|
27
|
-
|
|
28
|
-
The qdesc package provides a quick and easy approach to do descriptive analysis for quantitative data.
|
|
29
|
-
|
|
30
|
-
## qd.desc Function
|
|
31
|
-
Run the function qd.desc(df) to get the following statistics:
|
|
32
|
-
* count - number of observations
|
|
33
|
-
* mean - measure of central tendency for normal distribution
|
|
34
|
-
* std - measure of spread for normal distribution
|
|
35
|
-
* median - measure of central tendency for skewed distributions or those with outliers
|
|
36
|
-
* MAD - measure of spread for skewed distributions or those with outliers; this is manual Median Absolute Deviation (MAD) which is more robust when dealing with non-normal distributions.
|
|
37
|
-
* min - lowest observed value
|
|
38
|
-
* max - highest observed value
|
|
39
|
-
* AD_stat - Anderson - Darling Statistic
|
|
40
|
-
* 5% crit_value - critical value for a 5% Significance Level
|
|
41
|
-
* 1% crit_value - critical value for a 1% Significance Level
|
|
42
|
-
|
|
43
|
-
## qd.freqdist Function
|
|
44
|
-
Run the function qd.freqdist(df, "Variable Name") to easily create a frequency distribution for your chosen categorical variable with the following:
|
|
45
|
-
* Variable Levels (i.e., for Sex Variable: Male and Female)
|
|
46
|
-
* Counts - the number of observations
|
|
47
|
-
* Percentage - percentage of observations from total.
|
|
48
|
-
|
|
49
|
-
## qd.freqdist_a Function
|
|
50
|
-
Run the function qd.freqdist_a(df, ascending = FALSE) to easily create frequency distribution tables, arranged in descending manner (default) or ascending (TRUE), for all
|
|
51
|
-
the categorical variables in your data frame. The resulting table will include columns such as:
|
|
52
|
-
* Variable levels (i.e., for Satisfaction: Very Low, Low, Moderate, High, Very High)
|
|
53
|
-
* Counts - the number of observations
|
|
54
|
-
* Percentage - percentage of observations from total.
|
|
55
|
-
|
|
56
|
-
## qd.freqdist_to_excel Function
|
|
57
|
-
Run the function qd.freqdist_to_excel(df, "Name of file.xlsx", ascending = FALSE ) to easily create frequency distribution tables, arranged in descending manner (default) or ascending (TRUE), for all the categorical variables in your data frame and SAVED as separate sheets in the .xlsx File. The resulting table will include columns such as:
|
|
58
|
-
* Variable levels (i.e., for Satisfaction: Very Low, Low, Moderate, High, Very High)
|
|
59
|
-
* Counts - the number of observations
|
|
60
|
-
* Percentage - percentage of observations from total.
|
|
61
|
-
|
|
62
|
-
## qd.normcheck_dashboard Function
|
|
63
|
-
Run the function qd.normcheck_dashboard(df) to efficiently check each numeric variable for normality of its distribution. It will compute the Anderson-Darling statistic and
|
|
64
|
-
create visualizations (i.e., qq-plot, histogram, and boxplots) for checking whether the distribution is approximately normal.
|
|
65
|
-
|
|
66
|
-
|
|
67
|
-
## Installation
|
|
68
|
-
pip install qdesc
|
|
69
|
-
|
|
70
|
-
## Sample use of qdesc functions
|
|
71
|
-
|
|
72
|
-
## Creating a sample dataframe
|
|
73
|
-
import pandas as pd
|
|
74
|
-
import numpy as np
|
|
75
|
-
|
|
76
|
-
## Set seed for reproducibility
|
|
77
|
-
np.random.seed(21)
|
|
78
|
-
|
|
79
|
-
## Create two continuous variables
|
|
80
|
-
var1 = np.random.normal(loc=0, scale=1, size=1000)
|
|
81
|
-
var2 = np.random.uniform(low=10, high=50, size=1000)
|
|
82
|
-
|
|
83
|
-
## Create DataFrame
|
|
84
|
-
df = pd.DataFrame({
|
|
85
|
-
'Normal_Variable': var1,
|
|
86
|
-
'Uniform_Variable': var2
|
|
87
|
-
})
|
|
88
|
-
|
|
89
|
-
## Using the qdesc function
|
|
90
|
-
import qdesc as qd
|
|
91
|
-
|
|
92
|
-
qd.desc(df)
|
|
93
|
-
|
|
94
|
-
## License
|
|
95
|
-
This project is licensed under the GPL-3 License. See the LICENSE file for more details.
|
|
96
|
-
|
|
97
|
-
## Acknowledgements
|
|
98
|
-
Acknowledgement of the libraries used by this package...
|
|
99
|
-
|
|
100
|
-
### Pandas
|
|
101
|
-
Pandas is distributed under the BSD 3-Clause License, pandas is developed by Pandas contributors. Copyright (c) 2008-2024, the pandas development team All rights reserved.
|
|
102
|
-
### NumPy
|
|
103
|
-
NumPy is distributed under the BSD 3-Clause License, numpy is developed by NumPy contributors. Copyright (c) 2005-2024, NumPy Developers. All rights reserved.
|
|
104
|
-
### SciPy
|
|
105
|
-
SciPy is distributed under the BSD License, scipy is developed by SciPy contributors. Copyright (c) 2001-2024, SciPy Developers. All rights reserved.
|
|
106
|
-
|
|
107
|
-
|
|
108
|
-
|
|
109
|
-
|
|
110
|
-
|
qdesc-0.1.9.1/README.md
DELETED
|
@@ -1,95 +0,0 @@
|
|
|
1
|
-
# qdesc - Quick and Easy Descriptive Analysis
|
|
2
|
-
|
|
3
|
-
## Overview
|
|
4
|
-
This is a package for quick and easy descriptive analysis.
|
|
5
|
-
Required packages include: pandas, numpy, and SciPy version 1.14.1
|
|
6
|
-
Be sure to run the following prior to using the "qd.desc" function:
|
|
7
|
-
|
|
8
|
-
- import pandas as pd
|
|
9
|
-
- import numpy as np
|
|
10
|
-
- from scipy.stats import anderson
|
|
11
|
-
- import qdesc as qd
|
|
12
|
-
|
|
13
|
-
The qdesc package provides a quick and easy approach to do descriptive analysis for quantitative data.
|
|
14
|
-
|
|
15
|
-
## qd.desc Function
|
|
16
|
-
Run the function qd.desc(df) to get the following statistics:
|
|
17
|
-
* count - number of observations
|
|
18
|
-
* mean - measure of central tendency for normal distribution
|
|
19
|
-
* std - measure of spread for normal distribution
|
|
20
|
-
* median - measure of central tendency for skewed distributions or those with outliers
|
|
21
|
-
* MAD - measure of spread for skewed distributions or those with outliers; this is manual Median Absolute Deviation (MAD) which is more robust when dealing with non-normal distributions.
|
|
22
|
-
* min - lowest observed value
|
|
23
|
-
* max - highest observed value
|
|
24
|
-
* AD_stat - Anderson - Darling Statistic
|
|
25
|
-
* 5% crit_value - critical value for a 5% Significance Level
|
|
26
|
-
* 1% crit_value - critical value for a 1% Significance Level
|
|
27
|
-
|
|
28
|
-
## qd.freqdist Function
|
|
29
|
-
Run the function qd.freqdist(df, "Variable Name") to easily create a frequency distribution for your chosen categorical variable with the following:
|
|
30
|
-
* Variable Levels (i.e., for Sex Variable: Male and Female)
|
|
31
|
-
* Counts - the number of observations
|
|
32
|
-
* Percentage - percentage of observations from total.
|
|
33
|
-
|
|
34
|
-
## qd.freqdist_a Function
|
|
35
|
-
Run the function qd.freqdist_a(df, ascending = FALSE) to easily create frequency distribution tables, arranged in descending manner (default) or ascending (TRUE), for all
|
|
36
|
-
the categorical variables in your data frame. The resulting table will include columns such as:
|
|
37
|
-
* Variable levels (i.e., for Satisfaction: Very Low, Low, Moderate, High, Very High)
|
|
38
|
-
* Counts - the number of observations
|
|
39
|
-
* Percentage - percentage of observations from total.
|
|
40
|
-
|
|
41
|
-
## qd.freqdist_to_excel Function
|
|
42
|
-
Run the function qd.freqdist_to_excel(df, "Name of file.xlsx", ascending = FALSE ) to easily create frequency distribution tables, arranged in descending manner (default) or ascending (TRUE), for all the categorical variables in your data frame and SAVED as separate sheets in the .xlsx File. The resulting table will include columns such as:
|
|
43
|
-
* Variable levels (i.e., for Satisfaction: Very Low, Low, Moderate, High, Very High)
|
|
44
|
-
* Counts - the number of observations
|
|
45
|
-
* Percentage - percentage of observations from total.
|
|
46
|
-
|
|
47
|
-
## qd.normcheck_dashboard Function
|
|
48
|
-
Run the function qd.normcheck_dashboard(df) to efficiently check each numeric variable for normality of its distribution. It will compute the Anderson-Darling statistic and
|
|
49
|
-
create visualizations (i.e., qq-plot, histogram, and boxplots) for checking whether the distribution is approximately normal.
|
|
50
|
-
|
|
51
|
-
|
|
52
|
-
## Installation
|
|
53
|
-
pip install qdesc
|
|
54
|
-
|
|
55
|
-
## Sample use of qdesc functions
|
|
56
|
-
|
|
57
|
-
## Creating a sample dataframe
|
|
58
|
-
import pandas as pd
|
|
59
|
-
import numpy as np
|
|
60
|
-
|
|
61
|
-
## Set seed for reproducibility
|
|
62
|
-
np.random.seed(21)
|
|
63
|
-
|
|
64
|
-
## Create two continuous variables
|
|
65
|
-
var1 = np.random.normal(loc=0, scale=1, size=1000)
|
|
66
|
-
var2 = np.random.uniform(low=10, high=50, size=1000)
|
|
67
|
-
|
|
68
|
-
## Create DataFrame
|
|
69
|
-
df = pd.DataFrame({
|
|
70
|
-
'Normal_Variable': var1,
|
|
71
|
-
'Uniform_Variable': var2
|
|
72
|
-
})
|
|
73
|
-
|
|
74
|
-
## Using the qdesc function
|
|
75
|
-
import qdesc as qd
|
|
76
|
-
|
|
77
|
-
qd.desc(df)
|
|
78
|
-
|
|
79
|
-
## License
|
|
80
|
-
This project is licensed under the GPL-3 License. See the LICENSE file for more details.
|
|
81
|
-
|
|
82
|
-
## Acknowledgements
|
|
83
|
-
Acknowledgement of the libraries used by this package...
|
|
84
|
-
|
|
85
|
-
### Pandas
|
|
86
|
-
Pandas is distributed under the BSD 3-Clause License, pandas is developed by Pandas contributors. Copyright (c) 2008-2024, the pandas development team All rights reserved.
|
|
87
|
-
### NumPy
|
|
88
|
-
NumPy is distributed under the BSD 3-Clause License, numpy is developed by NumPy contributors. Copyright (c) 2005-2024, NumPy Developers. All rights reserved.
|
|
89
|
-
### SciPy
|
|
90
|
-
SciPy is distributed under the BSD License, scipy is developed by SciPy contributors. Copyright (c) 2001-2024, SciPy Developers. All rights reserved.
|
|
91
|
-
|
|
92
|
-
|
|
93
|
-
|
|
94
|
-
|
|
95
|
-
|
|
@@ -1,110 +0,0 @@
|
|
|
1
|
-
Metadata-Version: 2.4
|
|
2
|
-
Name: qdesc
|
|
3
|
-
Version: 0.1.9.1
|
|
4
|
-
Summary: Quick and Easy way to do descriptive analysis.
|
|
5
|
-
Author: Paolo Hilado
|
|
6
|
-
Author-email: datasciencepgh@proton.me
|
|
7
|
-
Description-Content-Type: text/markdown
|
|
8
|
-
License-File: LICENCE.txt
|
|
9
|
-
Dynamic: author
|
|
10
|
-
Dynamic: author-email
|
|
11
|
-
Dynamic: description
|
|
12
|
-
Dynamic: description-content-type
|
|
13
|
-
Dynamic: license-file
|
|
14
|
-
Dynamic: summary
|
|
15
|
-
|
|
16
|
-
# qdesc - Quick and Easy Descriptive Analysis
|
|
17
|
-
|
|
18
|
-
## Overview
|
|
19
|
-
This is a package for quick and easy descriptive analysis.
|
|
20
|
-
Required packages include: pandas, numpy, and SciPy version 1.14.1
|
|
21
|
-
Be sure to run the following prior to using the "qd.desc" function:
|
|
22
|
-
|
|
23
|
-
- import pandas as pd
|
|
24
|
-
- import numpy as np
|
|
25
|
-
- from scipy.stats import anderson
|
|
26
|
-
- import qdesc as qd
|
|
27
|
-
|
|
28
|
-
The qdesc package provides a quick and easy approach to do descriptive analysis for quantitative data.
|
|
29
|
-
|
|
30
|
-
## qd.desc Function
|
|
31
|
-
Run the function qd.desc(df) to get the following statistics:
|
|
32
|
-
* count - number of observations
|
|
33
|
-
* mean - measure of central tendency for normal distribution
|
|
34
|
-
* std - measure of spread for normal distribution
|
|
35
|
-
* median - measure of central tendency for skewed distributions or those with outliers
|
|
36
|
-
* MAD - measure of spread for skewed distributions or those with outliers; this is manual Median Absolute Deviation (MAD) which is more robust when dealing with non-normal distributions.
|
|
37
|
-
* min - lowest observed value
|
|
38
|
-
* max - highest observed value
|
|
39
|
-
* AD_stat - Anderson - Darling Statistic
|
|
40
|
-
* 5% crit_value - critical value for a 5% Significance Level
|
|
41
|
-
* 1% crit_value - critical value for a 1% Significance Level
|
|
42
|
-
|
|
43
|
-
## qd.freqdist Function
|
|
44
|
-
Run the function qd.freqdist(df, "Variable Name") to easily create a frequency distribution for your chosen categorical variable with the following:
|
|
45
|
-
* Variable Levels (i.e., for Sex Variable: Male and Female)
|
|
46
|
-
* Counts - the number of observations
|
|
47
|
-
* Percentage - percentage of observations from total.
|
|
48
|
-
|
|
49
|
-
## qd.freqdist_a Function
|
|
50
|
-
Run the function qd.freqdist_a(df, ascending = FALSE) to easily create frequency distribution tables, arranged in descending manner (default) or ascending (TRUE), for all
|
|
51
|
-
the categorical variables in your data frame. The resulting table will include columns such as:
|
|
52
|
-
* Variable levels (i.e., for Satisfaction: Very Low, Low, Moderate, High, Very High)
|
|
53
|
-
* Counts - the number of observations
|
|
54
|
-
* Percentage - percentage of observations from total.
|
|
55
|
-
|
|
56
|
-
## qd.freqdist_to_excel Function
|
|
57
|
-
Run the function qd.freqdist_to_excel(df, "Name of file.xlsx", ascending = FALSE ) to easily create frequency distribution tables, arranged in descending manner (default) or ascending (TRUE), for all the categorical variables in your data frame and SAVED as separate sheets in the .xlsx File. The resulting table will include columns such as:
|
|
58
|
-
* Variable levels (i.e., for Satisfaction: Very Low, Low, Moderate, High, Very High)
|
|
59
|
-
* Counts - the number of observations
|
|
60
|
-
* Percentage - percentage of observations from total.
|
|
61
|
-
|
|
62
|
-
## qd.normcheck_dashboard Function
|
|
63
|
-
Run the function qd.normcheck_dashboard(df) to efficiently check each numeric variable for normality of its distribution. It will compute the Anderson-Darling statistic and
|
|
64
|
-
create visualizations (i.e., qq-plot, histogram, and boxplots) for checking whether the distribution is approximately normal.
|
|
65
|
-
|
|
66
|
-
|
|
67
|
-
## Installation
|
|
68
|
-
pip install qdesc
|
|
69
|
-
|
|
70
|
-
## Sample use of qdesc functions
|
|
71
|
-
|
|
72
|
-
## Creating a sample dataframe
|
|
73
|
-
import pandas as pd
|
|
74
|
-
import numpy as np
|
|
75
|
-
|
|
76
|
-
## Set seed for reproducibility
|
|
77
|
-
np.random.seed(21)
|
|
78
|
-
|
|
79
|
-
## Create two continuous variables
|
|
80
|
-
var1 = np.random.normal(loc=0, scale=1, size=1000)
|
|
81
|
-
var2 = np.random.uniform(low=10, high=50, size=1000)
|
|
82
|
-
|
|
83
|
-
## Create DataFrame
|
|
84
|
-
df = pd.DataFrame({
|
|
85
|
-
'Normal_Variable': var1,
|
|
86
|
-
'Uniform_Variable': var2
|
|
87
|
-
})
|
|
88
|
-
|
|
89
|
-
## Using the qdesc function
|
|
90
|
-
import qdesc as qd
|
|
91
|
-
|
|
92
|
-
qd.desc(df)
|
|
93
|
-
|
|
94
|
-
## License
|
|
95
|
-
This project is licensed under the GPL-3 License. See the LICENSE file for more details.
|
|
96
|
-
|
|
97
|
-
## Acknowledgements
|
|
98
|
-
Acknowledgement of the libraries used by this package...
|
|
99
|
-
|
|
100
|
-
### Pandas
|
|
101
|
-
Pandas is distributed under the BSD 3-Clause License, pandas is developed by Pandas contributors. Copyright (c) 2008-2024, the pandas development team All rights reserved.
|
|
102
|
-
### NumPy
|
|
103
|
-
NumPy is distributed under the BSD 3-Clause License, numpy is developed by NumPy contributors. Copyright (c) 2005-2024, NumPy Developers. All rights reserved.
|
|
104
|
-
### SciPy
|
|
105
|
-
SciPy is distributed under the BSD License, scipy is developed by SciPy contributors. Copyright (c) 2001-2024, SciPy Developers. All rights reserved.
|
|
106
|
-
|
|
107
|
-
|
|
108
|
-
|
|
109
|
-
|
|
110
|
-
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|