hammock-plot 0.2__tar.gz → 0.3__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- hammock_plot-0.3/PKG-INFO +179 -0
- {hammock_plot-0.2 → hammock_plot-0.3}/hammock_plot/hammock_plot.py +44 -37
- hammock_plot-0.3/hammock_plot.egg-info/PKG-INFO +179 -0
- {hammock_plot-0.2 → hammock_plot-0.3}/hammock_plot.egg-info/SOURCES.txt +1 -0
- {hammock_plot-0.2 → hammock_plot-0.3}/setup.py +1 -1
- hammock_plot-0.2/PKG-INFO +0 -178
- hammock_plot-0.2/hammock_plot.egg-info/PKG-INFO +0 -178
- {hammock_plot-0.2 → hammock_plot-0.3}/README.md +0 -0
- {hammock_plot-0.2 → hammock_plot-0.3}/hammock_plot/__init__.py +0 -0
- {hammock_plot-0.2 → hammock_plot-0.3}/hammock_plot.egg-info/dependency_links.txt +0 -0
- {hammock_plot-0.2 → hammock_plot-0.3}/hammock_plot.egg-info/requires.txt +0 -0
- {hammock_plot-0.2 → hammock_plot-0.3}/hammock_plot.egg-info/top_level.txt +0 -0
- {hammock_plot-0.2 → hammock_plot-0.3}/setup.cfg +0 -0
|
@@ -0,0 +1,179 @@
|
|
|
1
|
+
Metadata-Version: 2.1
|
|
2
|
+
Name: hammock_plot
|
|
3
|
+
Version: 0.3
|
|
4
|
+
Summary: Hammock - visualization of categorical or mixed categorical/continuous data
|
|
5
|
+
Home-page: https://github.com/TianchengY/hammock_plot
|
|
6
|
+
Author: Tiancheng Yang
|
|
7
|
+
Author-email: t77yang@uwaterloo.ca
|
|
8
|
+
Classifier: Programming Language :: Python :: 3
|
|
9
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
10
|
+
Classifier: Operating System :: OS Independent
|
|
11
|
+
Classifier: Intended Audience :: Science/Research
|
|
12
|
+
Requires-Python: >=3.6
|
|
13
|
+
Description-Content-Type: text/markdown
|
|
14
|
+
Requires-Dist: matplotlib
|
|
15
|
+
Requires-Dist: numpy
|
|
16
|
+
Requires-Dist: pandas
|
|
17
|
+
|
|
18
|
+
# Hammock plot
|
|
19
|
+
|
|
20
|
+
|
|
21
|
+
## Description
|
|
22
|
+
|
|
23
|
+
hammock draws a graph to visualize categorical data - though it also does fine with continuous data.
|
|
24
|
+
Variables are lined up parallel to the vertical axis. Categories within a variable are spread out along a
|
|
25
|
+
vertical line. Categories of adjacent variables are connected by boxes. (The boxes are parallelograms; we
|
|
26
|
+
use boxes for brevity). The "width" of a box is proportional to the number of observations that correspond
|
|
27
|
+
to that box (i.e. have the same values/categories for the two variables). The "width" of a box refers to the
|
|
28
|
+
distance between the longer set of parallel lines rather than the vertical distance.
|
|
29
|
+
|
|
30
|
+
If the boxes degenerate to a single line, and no labels or missing values are used the hammock plot
|
|
31
|
+
corresponds to a parallel coordinate plot. Boxes degenerate into a single line if barwidth is so small that
|
|
32
|
+
the boxes for categorical variables appear to be a single line. For continuous variables boxes will usually
|
|
33
|
+
appear to be a single line because each category typically only contains one observation.
|
|
34
|
+
|
|
35
|
+
The order of variables in varlist determines the order of variables in the graph. All variables in varlist
|
|
36
|
+
must be numerical. String variables should be converted to numerical variables first, e.g. using encode or
|
|
37
|
+
destring.
|
|
38
|
+
|
|
39
|
+
|
|
40
|
+
|
|
41
|
+
|
|
42
|
+
## Getting started
|
|
43
|
+
|
|
44
|
+
You can install hammock from `pip`:
|
|
45
|
+
|
|
46
|
+
```shell
|
|
47
|
+
pip install hammock_plot
|
|
48
|
+
```
|
|
49
|
+
|
|
50
|
+
|
|
51
|
+
### Example: Asthma data
|
|
52
|
+
|
|
53
|
+
We import the diabetes dataset:
|
|
54
|
+
|
|
55
|
+
```python
|
|
56
|
+
import hammock_plot
|
|
57
|
+
import pandas as pd
|
|
58
|
+
df = pd.read_csv('../examples/asthma/asth_all3_for_python.csv')
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
Minimal example of a hammock plot:
|
|
62
|
+
```python
|
|
63
|
+
var = ["hospitalizations","group","gender","comorbidities"]
|
|
64
|
+
hammock = hammock_plot.Hammock(data_df = df)
|
|
65
|
+
ax = hammock.plot(var=var)
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
The ordering of the child-adolescent-adult variable is not in the desired order; adult should not be in the middle. We now specify a specific order, child-adolescent-adult.
|
|
69
|
+
|
|
70
|
+
```python
|
|
71
|
+
var = ["hospitalizations","group","gender","comorbidities"]
|
|
72
|
+
group_dict= {1: "child", 2: "adolescent",3: "adult"}
|
|
73
|
+
value_order = {"group": group_dict}
|
|
74
|
+
hammock = hammock_plot.Hammock(data_df = df)
|
|
75
|
+
ax = hammock.plot(var=var, value_order=value_order )
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
We highlight observations with comorbidities=0 in red:
|
|
79
|
+
|
|
80
|
+
```python
|
|
81
|
+
ax = hammock.plot(var=var, value_order=value_order ,hi_var="comorbidities", hi_value=[0], color=["red"])
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
|
|
85
|
+
### Example Satisfaction scales for the diabetes data
|
|
86
|
+
|
|
87
|
+
We import the diabetes dataset:
|
|
88
|
+
|
|
89
|
+
```python
|
|
90
|
+
import hammock_plot
|
|
91
|
+
import pandas as pd
|
|
92
|
+
df = pd.read_csv('../examples/diabetes_outlier/diabetes_for_python.csv')
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
The three variables represent different ordinal scales for satisfaction. We are checking for missing values:
|
|
96
|
+
```python
|
|
97
|
+
var = ["sataces","satcomm","satrate"]
|
|
98
|
+
hammock = hammock_plot.Hammock(data_df = df)
|
|
99
|
+
ax = hammock.plot(var=var, default_color="blue", missing=True)
|
|
100
|
+
```
|
|
101
|
+
|
|
102
|
+
The missing value category is shown at the bottom for each variable. We find missing values for all 3 variables, but fewest for the last one. We also see a phenomenon called "top coding", where
|
|
103
|
+
satisfied respondents simply choose the highest value.
|
|
104
|
+
|
|
105
|
+
|
|
106
|
+
|
|
107
|
+
## API Reference
|
|
108
|
+
|
|
109
|
+
```
|
|
110
|
+
hammock()
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
| Category | Parameter | Type | Description |
|
|
114
|
+
| --- | :-------- | :------- | :------------------------- |
|
|
115
|
+
| General | `var` | `List[str]` | List of variables to display. |
|
|
116
|
+
| | `value_order` | `Dict[str, Dict[int, str]]` | If specified, the order of the values in the plot follows the order of values in the list supplied in the dictionary. A specific value order is useful, for example, for ordered variables. The integer values affect spacing: for example the values 4,5,6 imply equal spacing between 4,5 and 5,6. The values 4,5,7 implies twice as much space between 5,7 as between 4,5.
|
|
117
|
+
| | `missing` | `bool` | Whether or not to add a category for missing values at the bottom of the plot. If False, observations that have a missing value for any variable in the data frame (even those not used in the hammock plot) are removed. Default is False. |
|
|
118
|
+
| | `label` | `bool` | Whether or not to display labels between the plotting segments |
|
|
119
|
+
| Highlighting | `hi_var` | `str` | Variable to be highlighted. Default is none. |
|
|
120
|
+
| | `hi_value` | `List[str or int]` | List of values of `hi_var` to be highlighted. You can highlighted one or multiple values. |
|
|
121
|
+
| | `hi_missing` | `bool` | Whether or not missing values for `hi_var` should be highlighted. |
|
|
122
|
+
| | `color` | `List[str]` | List of colors corresponding to the list of values to be highlighted. Default highlight color list is ["red", "green", "yellow", "lightblue", "orange", "gray", "brown", "olive", "pink", "cyan", "magenta"] |
|
|
123
|
+
| | `default_color` | `str` | Default color of plotting elements for boxes that are not highlighted. Default is "blue" |
|
|
124
|
+
| Manipulating Spacing and Layout | `bar_width` | `float` | Factor by which the default width is increased or reduced. This allows reducing visual clutter. Default is 1.0. |
|
|
125
|
+
| | `space` | `float` | Space left for the labels between the plotting elements. Default is 0.5 |
|
|
126
|
+
| | `label_options` | `Dict[str, Dict[str, Any]]` | Manipulates the size and look of the labels. Args following the options in the website: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.text.html Example:{"ExampleVarname":{"fontsize":12,"fontstyle":"italic","fontweight":"black","color":"b"}} Default is None. |
|
|
127
|
+
| | `height` | `float` | Height of the plot in inches. Default is 10. |
|
|
128
|
+
| | `width` | `float` | Width of the plot in inches. Default is 15. Caution: Width too narrow may distort the plot. |
|
|
129
|
+
| Other options | `shape` | `str` | Shape of the boxes. "rectangle" (default) or "parallelogram". |
|
|
130
|
+
| | `same_scale` | `List[str]` | List of variables that have the same scale. Default is None. |
|
|
131
|
+
| | `display_figure` | `bool` | Whether or not to display the figure. This can be useful if you just want to save the plots. Default is 'True'. |
|
|
132
|
+
| | `save_path` | `str` | If it is not None, the figure will be saved to the given path with given name and format. Default is None. |
|
|
133
|
+
|
|
134
|
+
|
|
135
|
+
## Historical context
|
|
136
|
+
|
|
137
|
+
In 1898, Sankey diagrams were developed to visualize flows of energy and materials.
|
|
138
|
+
|
|
139
|
+
In 1985, Inselberg popularized parallel coordinates to visualize continuous variables only. The central contribution is the use of parallel axes.
|
|
140
|
+
|
|
141
|
+
In 2003, Schonlau proposed the hammock plot. This was the first plot to visualize categorical data (or mixed categorical continuous data) on parallel axes.
|
|
142
|
+
|
|
143
|
+
In 2010, Rosvall proposed alluvial plots to visualize network variables over time. Rather than using bars to connect axes, alluvial plots use rounded curves. Alluvial plots are now also used to visualize categorical data.
|
|
144
|
+
|
|
145
|
+
There are several additional variations that also visualize categorical data including Parallel Set plots (Bendix et al, 2005) and Right Angle plots (Hofmann and Vendettuoli, 2013).
|
|
146
|
+
|
|
147
|
+
### References
|
|
148
|
+
Bendix, F., Kosara, R., & Hauser, H. (2005). Parallel sets: visual analysis of categorical data. In IEEE Symposium on Information Visualization, 2005. INFOVIS 2005. 133-140.
|
|
149
|
+
|
|
150
|
+
Hofmann, H., & Vendettuoli, M. (2013). Common angle plots as perception-true visualizations of categorical associations. IEEE transactions on visualization and computer graphics, 19(12), 2297-2305.
|
|
151
|
+
|
|
152
|
+
Inselberg, A., & Dimsdale, B. (2009). Parallel coordinates. Human-Machine Interactive Systems, 199-233.
|
|
153
|
+
|
|
154
|
+
Rosvall, Martin, & Bergstrom, C.T. (2010) "Mapping change in large networks." PloS one 5.1: e8694.
|
|
155
|
+
|
|
156
|
+
Sankey, H. (1898). Introductory note on the thermal efficiency of steam-engines. report of
|
|
157
|
+
the committee appointed on the 31st march, 1896, to consider and report to the council
|
|
158
|
+
upon the subject of the definition of a standard or standards of thermal efficiency for
|
|
159
|
+
steam-engines: With an introductory note. In Minutes of proceedings of the institution
|
|
160
|
+
of civil engineers, Volume 134, pp. 278–283.
|
|
161
|
+
|
|
162
|
+
Schonlau M.
|
|
163
|
+
*[Visualizing Categorical Data Arising in the Health Sciences Using Hammock Plots.](http://www.schonlau.net/publication/03jsm_hammockplot.pdf)*
|
|
164
|
+
In Proceedings of the Section on Statistical Graphics, American Statistical Association; 2003
|
|
165
|
+
|
|
166
|
+
|
|
167
|
+
|
|
168
|
+
### Other implementations of the hammock plot
|
|
169
|
+
There is also a Stata implementation `hammock` (available from the Stata archive SSC) and an R implementation as part of the package `ggparallel`.
|
|
170
|
+
|
|
171
|
+
[](https://choosealicense.com/licenses/mit/)
|
|
172
|
+
|
|
173
|
+
|
|
174
|
+
|
|
175
|
+
## Authors
|
|
176
|
+
|
|
177
|
+
- Tiancheng Yang t77yang@uwaterloo.ca
|
|
178
|
+
|
|
179
|
+
|
|
@@ -2,7 +2,7 @@ import pandas as pd
|
|
|
2
2
|
import numpy as np
|
|
3
3
|
import matplotlib.pyplot as plt
|
|
4
4
|
from abc import ABC, abstractmethod
|
|
5
|
-
from typing import List,Dict
|
|
5
|
+
from typing import List, Dict
|
|
6
6
|
import warnings
|
|
7
7
|
|
|
8
8
|
|
|
@@ -121,7 +121,7 @@ class Hammock:
|
|
|
121
121
|
|
|
122
122
|
def plot(self,
|
|
123
123
|
var: List[str] = None,
|
|
124
|
-
value_order: Dict[str, Dict[int,str]] = None,
|
|
124
|
+
value_order: Dict[str, Dict[int, str]] = None,
|
|
125
125
|
missing: bool = False,
|
|
126
126
|
hi_missing: bool = False,
|
|
127
127
|
missing_label_space: float = 1.,
|
|
@@ -133,6 +133,7 @@ class Hammock:
|
|
|
133
133
|
default_color="blue",
|
|
134
134
|
# Manipulating Spacing and Layout
|
|
135
135
|
bar_width: float = 1.,
|
|
136
|
+
min_bar_width: float = .05,
|
|
136
137
|
space: float = .5,
|
|
137
138
|
label_options: Dict = None,
|
|
138
139
|
height: float = 10.,
|
|
@@ -151,7 +152,7 @@ class Hammock:
|
|
|
151
152
|
if self.data_df[col].dtype.name == "category":
|
|
152
153
|
self.data_df[col] = self.data_df[col].cat.add_categories(self.missing_data_placeholder)
|
|
153
154
|
elif "float" in self.data_df[col].dtype.name:
|
|
154
|
-
self.data_df[col] = self.data_df[col].apply(lambda x: np.round(x,2))
|
|
155
|
+
self.data_df[col] = self.data_df[col].apply(lambda x: np.round(x, 2))
|
|
155
156
|
self.data_df_columns = self.data_df.columns.tolist()
|
|
156
157
|
|
|
157
158
|
if not var_lst:
|
|
@@ -160,7 +161,6 @@ class Hammock:
|
|
|
160
161
|
)
|
|
161
162
|
|
|
162
163
|
if color and type(color) != type([]):
|
|
163
|
-
|
|
164
164
|
raise ValueError(
|
|
165
165
|
f'Argument "color" must be a list os str.'
|
|
166
166
|
)
|
|
@@ -178,9 +178,9 @@ class Hammock:
|
|
|
178
178
|
)
|
|
179
179
|
|
|
180
180
|
if value_order:
|
|
181
|
-
for k,v_ori in value_order.items():
|
|
181
|
+
for k, v_ori in value_order.items():
|
|
182
182
|
uni_val_set = set(self.data_df[k].dropna().unique())
|
|
183
|
-
v = [value_name for order,value_name in v_ori.items()]
|
|
183
|
+
v = [value_name for order, value_name in v_ori.items()]
|
|
184
184
|
if not set(v) >= uni_val_set:
|
|
185
185
|
error_values = (set(v) ^ uni_val_set) & set(v)
|
|
186
186
|
raise ValueError(
|
|
@@ -218,23 +218,25 @@ class Hammock:
|
|
|
218
218
|
self.hi_value.append(self.missing_data_placeholder)
|
|
219
219
|
else:
|
|
220
220
|
self.hi_value = [self.missing_data_placeholder]
|
|
221
|
-
colors = ["red", "green", "yellow", "lightblue","orange", "gray", "brown", "olive", "pink", "cyan", "magenta"]
|
|
221
|
+
colors = ["red", "green", "yellow", "lightblue", "orange", "gray", "brown", "olive", "pink", "cyan", "magenta"]
|
|
222
222
|
self.color_lst = [color for color in color_lst] if color_lst else (
|
|
223
223
|
colors[:len(self.hi_value)] if hi_var else None)
|
|
224
224
|
if hi_var:
|
|
225
225
|
if hi_value and len(self.color_lst) < len(hi_value):
|
|
226
|
-
for i in range(len(hi_value)-len(self.color_lst)):
|
|
226
|
+
for i in range(len(hi_value) - len(self.color_lst)):
|
|
227
227
|
for c in colors:
|
|
228
228
|
if c not in self.color_lst:
|
|
229
229
|
self.color_lst.append(c)
|
|
230
230
|
break
|
|
231
|
-
warnings.warn(
|
|
231
|
+
warnings.warn(
|
|
232
|
+
f"Warning: The length of color is less than the total number of (high values and missing), color was automatically extended to {self.color_lst}")
|
|
232
233
|
if hi_var and default_color in self.color_lst:
|
|
233
234
|
raise ValueError(
|
|
234
235
|
f'The current highlight colors {self.color_lst} conflict with the default color {default_color}. Please choose another default color or other highlight colors'
|
|
235
236
|
)
|
|
236
237
|
# Manipulating Spacing and Layout
|
|
237
238
|
self.bar_width = bar_width
|
|
239
|
+
self.min_bar_width = min_bar_width
|
|
238
240
|
self.space = space
|
|
239
241
|
self.label_options = label_options
|
|
240
242
|
self.height = height
|
|
@@ -257,7 +259,7 @@ class Hammock:
|
|
|
257
259
|
raise ValueError(
|
|
258
260
|
f'the values: {error_values} in highlight value is not in data.'
|
|
259
261
|
)
|
|
260
|
-
|
|
262
|
+
|
|
261
263
|
value_color_dict = dict(zip(self.hi_value, self.color_lst))
|
|
262
264
|
|
|
263
265
|
self.data_df[self.color_coloumn_placeholder] = self.data_df[hi_var].apply(
|
|
@@ -301,7 +303,7 @@ class Hammock:
|
|
|
301
303
|
ax, coordinates_dict = self._list_labels(ax, self.height, self.width, self.label)
|
|
302
304
|
|
|
303
305
|
space = self.space * 10 if label else 0
|
|
304
|
-
bar = self.bar_width*3.5/max(data_point_numbers)
|
|
306
|
+
bar = self.bar_width * 3.5 / max(data_point_numbers)
|
|
305
307
|
|
|
306
308
|
if self.shape == "parallelogram":
|
|
307
309
|
figure_type = Parallelogram()
|
|
@@ -314,6 +316,8 @@ class Hammock:
|
|
|
314
316
|
left_label = k[0]
|
|
315
317
|
right_label = k[1]
|
|
316
318
|
width = bar * v
|
|
319
|
+
if self.min_bar_width and width <= self.min_bar_width:
|
|
320
|
+
width = self.min_bar_width
|
|
317
321
|
left_coordinate = (coordinates_dict[left_label][0] + space, coordinates_dict[left_label][1])
|
|
318
322
|
right_coordinate = (coordinates_dict[right_label][0] - space, coordinates_dict[right_label][1])
|
|
319
323
|
widths.append(width)
|
|
@@ -335,6 +339,8 @@ class Hammock:
|
|
|
335
339
|
right_label = k[1]
|
|
336
340
|
if v.get(color):
|
|
337
341
|
width_temp = bar * v.get(color)
|
|
342
|
+
if self.min_bar_width and width_temp <= self.min_bar_width:
|
|
343
|
+
width_temp = self.min_bar_width
|
|
338
344
|
else:
|
|
339
345
|
width_temp = 0
|
|
340
346
|
|
|
@@ -359,7 +365,7 @@ class Hammock:
|
|
|
359
365
|
|
|
360
366
|
def _get_varname(self, x):
|
|
361
367
|
return x.split(self.same_var_placeholder)[:-1][0]
|
|
362
|
-
|
|
368
|
+
|
|
363
369
|
def is_float(self, element: any) -> bool:
|
|
364
370
|
if element is None:
|
|
365
371
|
return False
|
|
@@ -390,10 +396,10 @@ class Hammock:
|
|
|
390
396
|
|
|
391
397
|
return var_pair_lst
|
|
392
398
|
|
|
393
|
-
def _gen_coordinate(self, start, n, edge, spacing, total_range,val_type="str"):
|
|
399
|
+
def _gen_coordinate(self, start, n, edge, spacing, total_range, val_type="str"):
|
|
394
400
|
coor_lst = []
|
|
395
|
-
|
|
396
|
-
if val_type=="str":
|
|
401
|
+
|
|
402
|
+
if val_type == "str":
|
|
397
403
|
for i in range(n):
|
|
398
404
|
coor_lst.append(start + i * spacing)
|
|
399
405
|
|
|
@@ -405,17 +411,17 @@ class Hammock:
|
|
|
405
411
|
coor_lst.append(total_range + (start - edge) - edge)
|
|
406
412
|
return coor_lst
|
|
407
413
|
|
|
408
|
-
def _get_same_scale_minmax(self,original_unique_value):
|
|
409
|
-
min,max = 0,0
|
|
410
|
-
for i,varname in enumerate(self.same_scale):
|
|
414
|
+
def _get_same_scale_minmax(self, original_unique_value):
|
|
415
|
+
min, max = 0, 0
|
|
416
|
+
for i, varname in enumerate(self.same_scale):
|
|
411
417
|
var_type = str(self.data_df_origin[varname].dtype.name)
|
|
412
418
|
if "int" in var_type or "float" in var_type:
|
|
413
419
|
min_val, max_val = original_unique_value[varname][0], original_unique_value[varname][-1]
|
|
414
420
|
if i == 0:
|
|
415
|
-
min,max = min_val, max_val
|
|
421
|
+
min, max = min_val, max_val
|
|
416
422
|
else:
|
|
417
|
-
min = min_val if min_val<min else min
|
|
418
|
-
max = max_val if max_val>max else max
|
|
423
|
+
min = min_val if min_val < min else min
|
|
424
|
+
max = max_val if max_val > max else max
|
|
419
425
|
|
|
420
426
|
else:
|
|
421
427
|
min_val, max_val = 1, len(original_unique_value[varname])
|
|
@@ -424,7 +430,7 @@ class Hammock:
|
|
|
424
430
|
else:
|
|
425
431
|
min = min_val if min_val < min else min
|
|
426
432
|
max = max_val if max_val > max else max
|
|
427
|
-
return (min,max)
|
|
433
|
+
return (min, max)
|
|
428
434
|
|
|
429
435
|
def _list_labels(self, ax, figsize_y, figsize_x, label):
|
|
430
436
|
|
|
@@ -440,24 +446,26 @@ class Hammock:
|
|
|
440
446
|
unique_value = []
|
|
441
447
|
original_unique_value = {}
|
|
442
448
|
varname_lst = [self._get_varname(var) for var in self.var_lst]
|
|
443
|
-
|
|
449
|
+
|
|
444
450
|
for var, varname in zip(self.var_lst, varname_lst):
|
|
445
451
|
unique_valnames = self.data_df[varname].dropna().unique().tolist()
|
|
446
452
|
sorted_unique_valnames = []
|
|
447
453
|
if self.value_order and varname in self.value_order:
|
|
448
454
|
varname_value_order_dict = self.value_order[varname]
|
|
449
455
|
sorted_unique_valnames_temp = [v for k, v in
|
|
450
|
-
|
|
456
|
+
sorted(varname_value_order_dict.items(), key=lambda item: item[0])]
|
|
451
457
|
for v in sorted_unique_valnames_temp:
|
|
452
458
|
if v in unique_valnames:
|
|
453
459
|
sorted_unique_valnames.append(v)
|
|
454
460
|
if self.missing_data_placeholder in unique_valnames:
|
|
455
461
|
unique_valnames.remove(self.missing_data_placeholder)
|
|
456
|
-
sorted_unique_valnames = sorted(
|
|
462
|
+
sorted_unique_valnames = sorted(
|
|
463
|
+
unique_valnames) if not sorted_unique_valnames else sorted_unique_valnames
|
|
457
464
|
original_unique_value[varname] = sorted_unique_valnames.copy()
|
|
458
465
|
sorted_unique_valnames.append(self.missing_data_placeholder)
|
|
459
466
|
else:
|
|
460
|
-
sorted_unique_valnames = sorted(
|
|
467
|
+
sorted_unique_valnames = sorted(
|
|
468
|
+
unique_valnames) if not sorted_unique_valnames else sorted_unique_valnames
|
|
461
469
|
original_unique_value[varname] = sorted_unique_valnames.copy()
|
|
462
470
|
unique_value.append([(var, x) for x in sorted_unique_valnames])
|
|
463
471
|
|
|
@@ -472,11 +480,11 @@ class Hammock:
|
|
|
472
480
|
|
|
473
481
|
# prepare for same_scale variabels
|
|
474
482
|
if self.same_scale:
|
|
475
|
-
same_scale_min,same_scale_max = self._get_same_scale_minmax(original_unique_value)
|
|
476
|
-
same_scale_range = same_scale_max-same_scale_min
|
|
483
|
+
same_scale_min, same_scale_max = self._get_same_scale_minmax(original_unique_value)
|
|
484
|
+
same_scale_range = same_scale_max - same_scale_min
|
|
477
485
|
|
|
478
486
|
# plot labels for each variables
|
|
479
|
-
for var_i,(x, uni_val) in enumerate(zip(label_coordinates, unique_value)):
|
|
487
|
+
for var_i, (x, uni_val) in enumerate(zip(label_coordinates, unique_value)):
|
|
480
488
|
label_num = len(uni_val) - 2 if (uni_val[0][0], self.missing_data_placeholder) in uni_val else len(
|
|
481
489
|
uni_val) - 1
|
|
482
490
|
varname = varname_lst[var_i]
|
|
@@ -487,21 +495,22 @@ class Hammock:
|
|
|
487
495
|
temp_value_range = (y_range - 2 * edge_y_range)
|
|
488
496
|
# handle the variables in same_scale
|
|
489
497
|
if self.same_scale and varname in self.same_scale:
|
|
490
|
-
min_val, max_val = same_scale_min,same_scale_max
|
|
498
|
+
min_val, max_val = same_scale_min, same_scale_max
|
|
491
499
|
else:
|
|
492
|
-
min_val,max_val = original_unique_value[varname][0],original_unique_value[varname][-1]
|
|
493
|
-
value_interval = [temp_value_range*(x_val-min_val)/(max_val-min_val) for x_val in
|
|
500
|
+
min_val, max_val = original_unique_value[varname][0], original_unique_value[varname][-1]
|
|
501
|
+
value_interval = [temp_value_range * (x_val - min_val) / (max_val - min_val) for x_val in
|
|
502
|
+
original_unique_value[varname]]
|
|
494
503
|
uni_val_coordinates = self._gen_coordinate(y_start, label_num, edge_y_range,
|
|
495
|
-
|
|
504
|
+
value_interval, y_range, val_type="number")
|
|
496
505
|
else:
|
|
497
506
|
# handle the variables in same_scale
|
|
498
507
|
if self.same_scale and varname in self.same_scale:
|
|
499
508
|
temp_value_range = (y_range - 2 * edge_y_range)
|
|
500
|
-
quant_val = list(range(1,len(original_unique_value[varname])+1))
|
|
509
|
+
quant_val = list(range(1, len(original_unique_value[varname]) + 1))
|
|
501
510
|
min_val, max_val = same_scale_min, same_scale_max
|
|
502
511
|
value_interval = [temp_value_range * (x_val - min_val) / (max_val - min_val) for x_val in quant_val]
|
|
503
512
|
uni_val_coordinates = self._gen_coordinate(y_start, label_num, edge_y_range,
|
|
504
|
-
|
|
513
|
+
value_interval, y_range, val_type="number")
|
|
505
514
|
else:
|
|
506
515
|
value_interval = (y_range - 2 * edge_y_range) / (label_num)
|
|
507
516
|
uni_val_coordinates = self._gen_coordinate(y_start, label_num, edge_y_range,
|
|
@@ -525,8 +534,6 @@ class Hammock:
|
|
|
525
534
|
ax.text(x, y, val[1], ha='center', va='center')
|
|
526
535
|
coordinates_dict[val] = (x, y)
|
|
527
536
|
|
|
528
|
-
|
|
529
537
|
return ax, coordinates_dict
|
|
530
538
|
|
|
531
539
|
|
|
532
|
-
|
|
@@ -0,0 +1,179 @@
|
|
|
1
|
+
Metadata-Version: 2.1
|
|
2
|
+
Name: hammock-plot
|
|
3
|
+
Version: 0.3
|
|
4
|
+
Summary: Hammock - visualization of categorical or mixed categorical/continuous data
|
|
5
|
+
Home-page: https://github.com/TianchengY/hammock_plot
|
|
6
|
+
Author: Tiancheng Yang
|
|
7
|
+
Author-email: t77yang@uwaterloo.ca
|
|
8
|
+
Classifier: Programming Language :: Python :: 3
|
|
9
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
10
|
+
Classifier: Operating System :: OS Independent
|
|
11
|
+
Classifier: Intended Audience :: Science/Research
|
|
12
|
+
Requires-Python: >=3.6
|
|
13
|
+
Description-Content-Type: text/markdown
|
|
14
|
+
Requires-Dist: matplotlib
|
|
15
|
+
Requires-Dist: numpy
|
|
16
|
+
Requires-Dist: pandas
|
|
17
|
+
|
|
18
|
+
# Hammock plot
|
|
19
|
+
|
|
20
|
+
|
|
21
|
+
## Description
|
|
22
|
+
|
|
23
|
+
hammock draws a graph to visualize categorical data - though it also does fine with continuous data.
|
|
24
|
+
Variables are lined up parallel to the vertical axis. Categories within a variable are spread out along a
|
|
25
|
+
vertical line. Categories of adjacent variables are connected by boxes. (The boxes are parallelograms; we
|
|
26
|
+
use boxes for brevity). The "width" of a box is proportional to the number of observations that correspond
|
|
27
|
+
to that box (i.e. have the same values/categories for the two variables). The "width" of a box refers to the
|
|
28
|
+
distance between the longer set of parallel lines rather than the vertical distance.
|
|
29
|
+
|
|
30
|
+
If the boxes degenerate to a single line, and no labels or missing values are used the hammock plot
|
|
31
|
+
corresponds to a parallel coordinate plot. Boxes degenerate into a single line if barwidth is so small that
|
|
32
|
+
the boxes for categorical variables appear to be a single line. For continuous variables boxes will usually
|
|
33
|
+
appear to be a single line because each category typically only contains one observation.
|
|
34
|
+
|
|
35
|
+
The order of variables in varlist determines the order of variables in the graph. All variables in varlist
|
|
36
|
+
must be numerical. String variables should be converted to numerical variables first, e.g. using encode or
|
|
37
|
+
destring.
|
|
38
|
+
|
|
39
|
+
|
|
40
|
+
|
|
41
|
+
|
|
42
|
+
## Getting started
|
|
43
|
+
|
|
44
|
+
You can install hammock from `pip`:
|
|
45
|
+
|
|
46
|
+
```shell
|
|
47
|
+
pip install hammock_plot
|
|
48
|
+
```
|
|
49
|
+
|
|
50
|
+
|
|
51
|
+
### Example: Asthma data
|
|
52
|
+
|
|
53
|
+
We import the diabetes dataset:
|
|
54
|
+
|
|
55
|
+
```python
|
|
56
|
+
import hammock_plot
|
|
57
|
+
import pandas as pd
|
|
58
|
+
df = pd.read_csv('../examples/asthma/asth_all3_for_python.csv')
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
Minimal example of a hammock plot:
|
|
62
|
+
```python
|
|
63
|
+
var = ["hospitalizations","group","gender","comorbidities"]
|
|
64
|
+
hammock = hammock_plot.Hammock(data_df = df)
|
|
65
|
+
ax = hammock.plot(var=var)
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
The ordering of the child-adolescent-adult variable is not in the desired order; adult should not be in the middle. We now specify a specific order, child-adolescent-adult.
|
|
69
|
+
|
|
70
|
+
```python
|
|
71
|
+
var = ["hospitalizations","group","gender","comorbidities"]
|
|
72
|
+
group_dict= {1: "child", 2: "adolescent",3: "adult"}
|
|
73
|
+
value_order = {"group": group_dict}
|
|
74
|
+
hammock = hammock_plot.Hammock(data_df = df)
|
|
75
|
+
ax = hammock.plot(var=var, value_order=value_order )
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
We highlight observations with comorbidities=0 in red:
|
|
79
|
+
|
|
80
|
+
```python
|
|
81
|
+
ax = hammock.plot(var=var, value_order=value_order ,hi_var="comorbidities", hi_value=[0], color=["red"])
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
|
|
85
|
+
### Example Satisfaction scales for the diabetes data
|
|
86
|
+
|
|
87
|
+
We import the diabetes dataset:
|
|
88
|
+
|
|
89
|
+
```python
|
|
90
|
+
import hammock_plot
|
|
91
|
+
import pandas as pd
|
|
92
|
+
df = pd.read_csv('../examples/diabetes_outlier/diabetes_for_python.csv')
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
The three variables represent different ordinal scales for satisfaction. We are checking for missing values:
|
|
96
|
+
```python
|
|
97
|
+
var = ["sataces","satcomm","satrate"]
|
|
98
|
+
hammock = hammock_plot.Hammock(data_df = df)
|
|
99
|
+
ax = hammock.plot(var=var, default_color="blue", missing=True)
|
|
100
|
+
```
|
|
101
|
+
|
|
102
|
+
The missing value category is shown at the bottom for each variable. We find missing values for all 3 variables, but fewest for the last one. We also see a phenomenon called "top coding", where
|
|
103
|
+
satisfied respondents simply choose the highest value.
|
|
104
|
+
|
|
105
|
+
|
|
106
|
+
|
|
107
|
+
## API Reference
|
|
108
|
+
|
|
109
|
+
```
|
|
110
|
+
hammock()
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
| Category | Parameter | Type | Description |
|
|
114
|
+
| --- | :-------- | :------- | :------------------------- |
|
|
115
|
+
| General | `var` | `List[str]` | List of variables to display. |
|
|
116
|
+
| | `value_order` | `Dict[str, Dict[int, str]]` | If specified, the order of the values in the plot follows the order of values in the list supplied in the dictionary. A specific value order is useful, for example, for ordered variables. The integer values affect spacing: for example the values 4,5,6 imply equal spacing between 4,5 and 5,6. The values 4,5,7 implies twice as much space between 5,7 as between 4,5.
|
|
117
|
+
| | `missing` | `bool` | Whether or not to add a category for missing values at the bottom of the plot. If False, observations that have a missing value for any variable in the data frame (even those not used in the hammock plot) are removed. Default is False. |
|
|
118
|
+
| | `label` | `bool` | Whether or not to display labels between the plotting segments |
|
|
119
|
+
| Highlighting | `hi_var` | `str` | Variable to be highlighted. Default is none. |
|
|
120
|
+
| | `hi_value` | `List[str or int]` | List of values of `hi_var` to be highlighted. You can highlighted one or multiple values. |
|
|
121
|
+
| | `hi_missing` | `bool` | Whether or not missing values for `hi_var` should be highlighted. |
|
|
122
|
+
| | `color` | `List[str]` | List of colors corresponding to the list of values to be highlighted. Default highlight color list is ["red", "green", "yellow", "lightblue", "orange", "gray", "brown", "olive", "pink", "cyan", "magenta"] |
|
|
123
|
+
| | `default_color` | `str` | Default color of plotting elements for boxes that are not highlighted. Default is "blue" |
|
|
124
|
+
| Manipulating Spacing and Layout | `bar_width` | `float` | Factor by which the default width is increased or reduced. This allows reducing visual clutter. Default is 1.0. |
|
|
125
|
+
| | `space` | `float` | Space left for the labels between the plotting elements. Default is 0.5 |
|
|
126
|
+
| | `label_options` | `Dict[str, Dict[str, Any]]` | Manipulates the size and look of the labels. Args following the options in the website: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.text.html Example:{"ExampleVarname":{"fontsize":12,"fontstyle":"italic","fontweight":"black","color":"b"}} Default is None. |
|
|
127
|
+
| | `height` | `float` | Height of the plot in inches. Default is 10. |
|
|
128
|
+
| | `width` | `float` | Width of the plot in inches. Default is 15. Caution: Width too narrow may distort the plot. |
|
|
129
|
+
| Other options | `shape` | `str` | Shape of the boxes. "rectangle" (default) or "parallelogram". |
|
|
130
|
+
| | `same_scale` | `List[str]` | List of variables that have the same scale. Default is None. |
|
|
131
|
+
| | `display_figure` | `bool` | Whether or not to display the figure. This can be useful if you just want to save the plots. Default is 'True'. |
|
|
132
|
+
| | `save_path` | `str` | If it is not None, the figure will be saved to the given path with given name and format. Default is None. |
|
|
133
|
+
|
|
134
|
+
|
|
135
|
+
## Historical context
|
|
136
|
+
|
|
137
|
+
In 1898, Sankey diagrams were developed to visualize flows of energy and materials.
|
|
138
|
+
|
|
139
|
+
In 1985, Inselberg popularized parallel coordinates to visualize continuous variables only. The central contribution is the use of parallel axes.
|
|
140
|
+
|
|
141
|
+
In 2003, Schonlau proposed the hammock plot. This was the first plot to visualize categorical data (or mixed categorical continuous data) on parallel axes.
|
|
142
|
+
|
|
143
|
+
In 2010, Rosvall proposed alluvial plots to visualize network variables over time. Rather than using bars to connect axes, alluvial plots use rounded curves. Alluvial plots are now also used to visualize categorical data.
|
|
144
|
+
|
|
145
|
+
There are several additional variations that also visualize categorical data including Parallel Set plots (Bendix et al, 2005) and Right Angle plots (Hofmann and Vendettuoli, 2013).
|
|
146
|
+
|
|
147
|
+
### References
|
|
148
|
+
Bendix, F., Kosara, R., & Hauser, H. (2005). Parallel sets: visual analysis of categorical data. In IEEE Symposium on Information Visualization, 2005. INFOVIS 2005. 133-140.
|
|
149
|
+
|
|
150
|
+
Hofmann, H., & Vendettuoli, M. (2013). Common angle plots as perception-true visualizations of categorical associations. IEEE transactions on visualization and computer graphics, 19(12), 2297-2305.
|
|
151
|
+
|
|
152
|
+
Inselberg, A., & Dimsdale, B. (2009). Parallel coordinates. Human-Machine Interactive Systems, 199-233.
|
|
153
|
+
|
|
154
|
+
Rosvall, Martin, & Bergstrom, C.T. (2010) "Mapping change in large networks." PloS one 5.1: e8694.
|
|
155
|
+
|
|
156
|
+
Sankey, H. (1898). Introductory note on the thermal efficiency of steam-engines. report of
|
|
157
|
+
the committee appointed on the 31st march, 1896, to consider and report to the council
|
|
158
|
+
upon the subject of the definition of a standard or standards of thermal efficiency for
|
|
159
|
+
steam-engines: With an introductory note. In Minutes of proceedings of the institution
|
|
160
|
+
of civil engineers, Volume 134, pp. 278–283.
|
|
161
|
+
|
|
162
|
+
Schonlau M.
|
|
163
|
+
*[Visualizing Categorical Data Arising in the Health Sciences Using Hammock Plots.](http://www.schonlau.net/publication/03jsm_hammockplot.pdf)*
|
|
164
|
+
In Proceedings of the Section on Statistical Graphics, American Statistical Association; 2003
|
|
165
|
+
|
|
166
|
+
|
|
167
|
+
|
|
168
|
+
### Other implementations of the hammock plot
|
|
169
|
+
There is also a Stata implementation `hammock` (available from the Stata archive SSC) and an R implementation as part of the package `ggparallel`.
|
|
170
|
+
|
|
171
|
+
[](https://choosealicense.com/licenses/mit/)
|
|
172
|
+
|
|
173
|
+
|
|
174
|
+
|
|
175
|
+
## Authors
|
|
176
|
+
|
|
177
|
+
- Tiancheng Yang t77yang@uwaterloo.ca
|
|
178
|
+
|
|
179
|
+
|
|
@@ -6,7 +6,7 @@ with open("README.md", "r", encoding="utf8") as fh:
|
|
|
6
6
|
|
|
7
7
|
setuptools.setup(
|
|
8
8
|
name="hammock_plot",
|
|
9
|
-
version='0.
|
|
9
|
+
version='0.3',
|
|
10
10
|
author="Tiancheng Yang",
|
|
11
11
|
author_email="t77yang@uwaterloo.ca",
|
|
12
12
|
description="Hammock - visualization of categorical or mixed categorical/continuous data",
|
hammock_plot-0.2/PKG-INFO
DELETED
|
@@ -1,178 +0,0 @@
|
|
|
1
|
-
Metadata-Version: 2.1
|
|
2
|
-
Name: hammock_plot
|
|
3
|
-
Version: 0.2
|
|
4
|
-
Summary: Hammock - visualization of categorical or mixed categorical/continuous data
|
|
5
|
-
Home-page: https://github.com/TianchengY/hammock_plot
|
|
6
|
-
Author: Tiancheng Yang
|
|
7
|
-
Author-email: t77yang@uwaterloo.ca
|
|
8
|
-
License: UNKNOWN
|
|
9
|
-
Description: # Hammock plot
|
|
10
|
-
|
|
11
|
-
|
|
12
|
-
## Description
|
|
13
|
-
|
|
14
|
-
hammock draws a graph to visualize categorical data - though it also does fine with continuous data.
|
|
15
|
-
Variables are lined up parallel to the vertical axis. Categories within a variable are spread out along a
|
|
16
|
-
vertical line. Categories of adjacent variables are connected by boxes. (The boxes are parallelograms; we
|
|
17
|
-
use boxes for brevity). The "width" of a box is proportional to the number of observations that correspond
|
|
18
|
-
to that box (i.e. have the same values/categories for the two variables). The "width" of a box refers to the
|
|
19
|
-
distance between the longer set of parallel lines rather than the vertical distance.
|
|
20
|
-
|
|
21
|
-
If the boxes degenerate to a single line, and no labels or missing values are used the hammock plot
|
|
22
|
-
corresponds to a parallel coordinate plot. Boxes degenerate into a single line if barwidth is so small that
|
|
23
|
-
the boxes for categorical variables appear to be a single line. For continuous variables boxes will usually
|
|
24
|
-
appear to be a single line because each category typically only contains one observation.
|
|
25
|
-
|
|
26
|
-
The order of variables in varlist determines the order of variables in the graph. All variables in varlist
|
|
27
|
-
must be numerical. String variables should be converted to numerical variables first, e.g. using encode or
|
|
28
|
-
destring.
|
|
29
|
-
|
|
30
|
-
|
|
31
|
-
|
|
32
|
-
|
|
33
|
-
## Getting started
|
|
34
|
-
|
|
35
|
-
You can install hammock from `pip`:
|
|
36
|
-
|
|
37
|
-
```shell
|
|
38
|
-
pip install hammock_plot
|
|
39
|
-
```
|
|
40
|
-
|
|
41
|
-
|
|
42
|
-
### Example: Asthma data
|
|
43
|
-
|
|
44
|
-
We import the diabetes dataset:
|
|
45
|
-
|
|
46
|
-
```python
|
|
47
|
-
import hammock_plot
|
|
48
|
-
import pandas as pd
|
|
49
|
-
df = pd.read_csv('../examples/asthma/asth_all3_for_python.csv')
|
|
50
|
-
```
|
|
51
|
-
|
|
52
|
-
Minimal example of a hammock plot:
|
|
53
|
-
```python
|
|
54
|
-
var = ["hospitalizations","group","gender","comorbidities"]
|
|
55
|
-
hammock = hammock_plot.Hammock(data_df = df)
|
|
56
|
-
ax = hammock.plot(var=var)
|
|
57
|
-
```
|
|
58
|
-
|
|
59
|
-
The ordering of the child-adolescent-adult variable is not in the desired order; adult should not be in the middle. We now specify a specific order, child-adolescent-adult.
|
|
60
|
-
|
|
61
|
-
```python
|
|
62
|
-
var = ["hospitalizations","group","gender","comorbidities"]
|
|
63
|
-
group_dict= {1: "child", 2: "adolescent",3: "adult"}
|
|
64
|
-
value_order = {"group": group_dict}
|
|
65
|
-
hammock = hammock_plot.Hammock(data_df = df)
|
|
66
|
-
ax = hammock.plot(var=var, value_order=value_order )
|
|
67
|
-
```
|
|
68
|
-
|
|
69
|
-
We highlight observations with comorbidities=0 in red:
|
|
70
|
-
|
|
71
|
-
```python
|
|
72
|
-
ax = hammock.plot(var=var, value_order=value_order ,hi_var="comorbidities", hi_value=[0], color=["red"])
|
|
73
|
-
```
|
|
74
|
-
|
|
75
|
-
|
|
76
|
-
### Example Satisfaction scales for the diabetes data
|
|
77
|
-
|
|
78
|
-
We import the diabetes dataset:
|
|
79
|
-
|
|
80
|
-
```python
|
|
81
|
-
import hammock_plot
|
|
82
|
-
import pandas as pd
|
|
83
|
-
df = pd.read_csv('../examples/diabetes_outlier/diabetes_for_python.csv')
|
|
84
|
-
```
|
|
85
|
-
|
|
86
|
-
The three variables represent different ordinal scales for satisfaction. We are checking for missing values:
|
|
87
|
-
```python
|
|
88
|
-
var = ["sataces","satcomm","satrate"]
|
|
89
|
-
hammock = hammock_plot.Hammock(data_df = df)
|
|
90
|
-
ax = hammock.plot(var=var, default_color="blue", missing=True)
|
|
91
|
-
```
|
|
92
|
-
|
|
93
|
-
The missing value category is shown at the bottom for each variable. We find missing values for all 3 variables, but fewest for the last one. We also see a phenomenon called "top coding", where
|
|
94
|
-
satisfied respondents simply choose the highest value.
|
|
95
|
-
|
|
96
|
-
|
|
97
|
-
|
|
98
|
-
## API Reference
|
|
99
|
-
|
|
100
|
-
```
|
|
101
|
-
hammock()
|
|
102
|
-
```
|
|
103
|
-
|
|
104
|
-
| Category | Parameter | Type | Description |
|
|
105
|
-
| --- | :-------- | :------- | :------------------------- |
|
|
106
|
-
| General | `var` | `List[str]` | List of variables to display. |
|
|
107
|
-
| | `value_order` | `Dict[str, Dict[int, str]]` | If specified, the order of the values in the plot follows the order of values in the list supplied in the dictionary. A specific value order is useful, for example, for ordered variables. The integer values affect spacing: for example the values 4,5,6 imply equal spacing between 4,5 and 5,6. The values 4,5,7 implies twice as much space between 5,7 as between 4,5.
|
|
108
|
-
| | `missing` | `bool` | Whether or not to add a category for missing values at the bottom of the plot. If False, observations that have a missing value for any variable in the data frame (even those not used in the hammock plot) are removed. Default is False. |
|
|
109
|
-
| | `label` | `bool` | Whether or not to display labels between the plotting segments |
|
|
110
|
-
| Highlighting | `hi_var` | `str` | Variable to be highlighted. Default is none. |
|
|
111
|
-
| | `hi_value` | `List[str or int]` | List of values of `hi_var` to be highlighted. You can highlighted one or multiple values. |
|
|
112
|
-
| | `hi_missing` | `bool` | Whether or not missing values for `hi_var` should be highlighted. |
|
|
113
|
-
| | `color` | `List[str]` | List of colors corresponding to the list of values to be highlighted. Default highlight color list is ["red", "green", "yellow", "lightblue", "orange", "gray", "brown", "olive", "pink", "cyan", "magenta"] |
|
|
114
|
-
| | `default_color` | `str` | Default color of plotting elements for boxes that are not highlighted. Default is "blue" |
|
|
115
|
-
| Manipulating Spacing and Layout | `bar_width` | `float` | Factor by which the default width is increased or reduced. This allows reducing visual clutter. Default is 1.0. |
|
|
116
|
-
| | `space` | `float` | Space left for the labels between the plotting elements. Default is 0.5 |
|
|
117
|
-
| | `label_options` | `Dict[str, Dict[str, Any]]` | Manipulates the size and look of the labels. Args following the options in the website: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.text.html Example:{"ExampleVarname":{"fontsize":12,"fontstyle":"italic","fontweight":"black","color":"b"}} Default is None. |
|
|
118
|
-
| | `height` | `float` | Height of the plot in inches. Default is 10. |
|
|
119
|
-
| | `width` | `float` | Width of the plot in inches. Default is 15. Caution: Width too narrow may distort the plot. |
|
|
120
|
-
| Other options | `shape` | `str` | Shape of the boxes. "rectangle" (default) or "parallelogram". |
|
|
121
|
-
| | `same_scale` | `List[str]` | List of variables that have the same scale. Default is None. |
|
|
122
|
-
| | `display_figure` | `bool` | Whether or not to display the figure. This can be useful if you just want to save the plots. Default is 'True'. |
|
|
123
|
-
| | `save_path` | `str` | If it is not None, the figure will be saved to the given path with given name and format. Default is None. |
|
|
124
|
-
|
|
125
|
-
|
|
126
|
-
## Historical context
|
|
127
|
-
|
|
128
|
-
In 1898, Sankey diagrams were developed to visualize flows of energy and materials.
|
|
129
|
-
|
|
130
|
-
In 1985, Inselberg popularized parallel coordinates to visualize continuous variables only. The central contribution is the use of parallel axes.
|
|
131
|
-
|
|
132
|
-
In 2003, Schonlau proposed the hammock plot. This was the first plot to visualize categorical data (or mixed categorical continuous data) on parallel axes.
|
|
133
|
-
|
|
134
|
-
In 2010, Rosvall proposed alluvial plots to visualize network variables over time. Rather than using bars to connect axes, alluvial plots use rounded curves. Alluvial plots are now also used to visualize categorical data.
|
|
135
|
-
|
|
136
|
-
There are several additional variations that also visualize categorical data including Parallel Set plots (Bendix et al, 2005) and Right Angle plots (Hofmann and Vendettuoli, 2013).
|
|
137
|
-
|
|
138
|
-
### References
|
|
139
|
-
Bendix, F., Kosara, R., & Hauser, H. (2005). Parallel sets: visual analysis of categorical data. In IEEE Symposium on Information Visualization, 2005. INFOVIS 2005. 133-140.
|
|
140
|
-
|
|
141
|
-
Hofmann, H., & Vendettuoli, M. (2013). Common angle plots as perception-true visualizations of categorical associations. IEEE transactions on visualization and computer graphics, 19(12), 2297-2305.
|
|
142
|
-
|
|
143
|
-
Inselberg, A., & Dimsdale, B. (2009). Parallel coordinates. Human-Machine Interactive Systems, 199-233.
|
|
144
|
-
|
|
145
|
-
Rosvall, Martin, & Bergstrom, C.T. (2010) "Mapping change in large networks." PloS one 5.1: e8694.
|
|
146
|
-
|
|
147
|
-
Sankey, H. (1898). Introductory note on the thermal efficiency of steam-engines. report of
|
|
148
|
-
the committee appointed on the 31st march, 1896, to consider and report to the council
|
|
149
|
-
upon the subject of the definition of a standard or standards of thermal efficiency for
|
|
150
|
-
steam-engines: With an introductory note. In Minutes of proceedings of the institution
|
|
151
|
-
of civil engineers, Volume 134, pp. 278–283.
|
|
152
|
-
|
|
153
|
-
Schonlau M.
|
|
154
|
-
*[Visualizing Categorical Data Arising in the Health Sciences Using Hammock Plots.](http://www.schonlau.net/publication/03jsm_hammockplot.pdf)*
|
|
155
|
-
In Proceedings of the Section on Statistical Graphics, American Statistical Association; 2003
|
|
156
|
-
|
|
157
|
-
|
|
158
|
-
|
|
159
|
-
### Other implementations of the hammock plot
|
|
160
|
-
There is also a Stata implementation `hammock` (available from the Stata archive SSC) and an R implementation as part of the package `ggparallel`.
|
|
161
|
-
|
|
162
|
-
[](https://choosealicense.com/licenses/mit/)
|
|
163
|
-
|
|
164
|
-
|
|
165
|
-
|
|
166
|
-
## Authors
|
|
167
|
-
|
|
168
|
-
- Tiancheng Yang t77yang@uwaterloo.ca
|
|
169
|
-
|
|
170
|
-
|
|
171
|
-
|
|
172
|
-
Platform: UNKNOWN
|
|
173
|
-
Classifier: Programming Language :: Python :: 3
|
|
174
|
-
Classifier: License :: OSI Approved :: MIT License
|
|
175
|
-
Classifier: Operating System :: OS Independent
|
|
176
|
-
Classifier: Intended Audience :: Science/Research
|
|
177
|
-
Requires-Python: >=3.6
|
|
178
|
-
Description-Content-Type: text/markdown
|
|
@@ -1,178 +0,0 @@
|
|
|
1
|
-
Metadata-Version: 2.1
|
|
2
|
-
Name: hammock-plot
|
|
3
|
-
Version: 0.2
|
|
4
|
-
Summary: Hammock - visualization of categorical or mixed categorical/continuous data
|
|
5
|
-
Home-page: https://github.com/TianchengY/hammock_plot
|
|
6
|
-
Author: Tiancheng Yang
|
|
7
|
-
Author-email: t77yang@uwaterloo.ca
|
|
8
|
-
License: UNKNOWN
|
|
9
|
-
Description: # Hammock plot
|
|
10
|
-
|
|
11
|
-
|
|
12
|
-
## Description
|
|
13
|
-
|
|
14
|
-
hammock draws a graph to visualize categorical data - though it also does fine with continuous data.
|
|
15
|
-
Variables are lined up parallel to the vertical axis. Categories within a variable are spread out along a
|
|
16
|
-
vertical line. Categories of adjacent variables are connected by boxes. (The boxes are parallelograms; we
|
|
17
|
-
use boxes for brevity). The "width" of a box is proportional to the number of observations that correspond
|
|
18
|
-
to that box (i.e. have the same values/categories for the two variables). The "width" of a box refers to the
|
|
19
|
-
distance between the longer set of parallel lines rather than the vertical distance.
|
|
20
|
-
|
|
21
|
-
If the boxes degenerate to a single line, and no labels or missing values are used the hammock plot
|
|
22
|
-
corresponds to a parallel coordinate plot. Boxes degenerate into a single line if barwidth is so small that
|
|
23
|
-
the boxes for categorical variables appear to be a single line. For continuous variables boxes will usually
|
|
24
|
-
appear to be a single line because each category typically only contains one observation.
|
|
25
|
-
|
|
26
|
-
The order of variables in varlist determines the order of variables in the graph. All variables in varlist
|
|
27
|
-
must be numerical. String variables should be converted to numerical variables first, e.g. using encode or
|
|
28
|
-
destring.
|
|
29
|
-
|
|
30
|
-
|
|
31
|
-
|
|
32
|
-
|
|
33
|
-
## Getting started
|
|
34
|
-
|
|
35
|
-
You can install hammock from `pip`:
|
|
36
|
-
|
|
37
|
-
```shell
|
|
38
|
-
pip install hammock_plot
|
|
39
|
-
```
|
|
40
|
-
|
|
41
|
-
|
|
42
|
-
### Example: Asthma data
|
|
43
|
-
|
|
44
|
-
We import the diabetes dataset:
|
|
45
|
-
|
|
46
|
-
```python
|
|
47
|
-
import hammock_plot
|
|
48
|
-
import pandas as pd
|
|
49
|
-
df = pd.read_csv('../examples/asthma/asth_all3_for_python.csv')
|
|
50
|
-
```
|
|
51
|
-
|
|
52
|
-
Minimal example of a hammock plot:
|
|
53
|
-
```python
|
|
54
|
-
var = ["hospitalizations","group","gender","comorbidities"]
|
|
55
|
-
hammock = hammock_plot.Hammock(data_df = df)
|
|
56
|
-
ax = hammock.plot(var=var)
|
|
57
|
-
```
|
|
58
|
-
|
|
59
|
-
The ordering of the child-adolescent-adult variable is not in the desired order; adult should not be in the middle. We now specify a specific order, child-adolescent-adult.
|
|
60
|
-
|
|
61
|
-
```python
|
|
62
|
-
var = ["hospitalizations","group","gender","comorbidities"]
|
|
63
|
-
group_dict= {1: "child", 2: "adolescent",3: "adult"}
|
|
64
|
-
value_order = {"group": group_dict}
|
|
65
|
-
hammock = hammock_plot.Hammock(data_df = df)
|
|
66
|
-
ax = hammock.plot(var=var, value_order=value_order )
|
|
67
|
-
```
|
|
68
|
-
|
|
69
|
-
We highlight observations with comorbidities=0 in red:
|
|
70
|
-
|
|
71
|
-
```python
|
|
72
|
-
ax = hammock.plot(var=var, value_order=value_order ,hi_var="comorbidities", hi_value=[0], color=["red"])
|
|
73
|
-
```
|
|
74
|
-
|
|
75
|
-
|
|
76
|
-
### Example Satisfaction scales for the diabetes data
|
|
77
|
-
|
|
78
|
-
We import the diabetes dataset:
|
|
79
|
-
|
|
80
|
-
```python
|
|
81
|
-
import hammock_plot
|
|
82
|
-
import pandas as pd
|
|
83
|
-
df = pd.read_csv('../examples/diabetes_outlier/diabetes_for_python.csv')
|
|
84
|
-
```
|
|
85
|
-
|
|
86
|
-
The three variables represent different ordinal scales for satisfaction. We are checking for missing values:
|
|
87
|
-
```python
|
|
88
|
-
var = ["sataces","satcomm","satrate"]
|
|
89
|
-
hammock = hammock_plot.Hammock(data_df = df)
|
|
90
|
-
ax = hammock.plot(var=var, default_color="blue", missing=True)
|
|
91
|
-
```
|
|
92
|
-
|
|
93
|
-
The missing value category is shown at the bottom for each variable. We find missing values for all 3 variables, but fewest for the last one. We also see a phenomenon called "top coding", where
|
|
94
|
-
satisfied respondents simply choose the highest value.
|
|
95
|
-
|
|
96
|
-
|
|
97
|
-
|
|
98
|
-
## API Reference
|
|
99
|
-
|
|
100
|
-
```
|
|
101
|
-
hammock()
|
|
102
|
-
```
|
|
103
|
-
|
|
104
|
-
| Category | Parameter | Type | Description |
|
|
105
|
-
| --- | :-------- | :------- | :------------------------- |
|
|
106
|
-
| General | `var` | `List[str]` | List of variables to display. |
|
|
107
|
-
| | `value_order` | `Dict[str, Dict[int, str]]` | If specified, the order of the values in the plot follows the order of values in the list supplied in the dictionary. A specific value order is useful, for example, for ordered variables. The integer values affect spacing: for example the values 4,5,6 imply equal spacing between 4,5 and 5,6. The values 4,5,7 implies twice as much space between 5,7 as between 4,5.
|
|
108
|
-
| | `missing` | `bool` | Whether or not to add a category for missing values at the bottom of the plot. If False, observations that have a missing value for any variable in the data frame (even those not used in the hammock plot) are removed. Default is False. |
|
|
109
|
-
| | `label` | `bool` | Whether or not to display labels between the plotting segments |
|
|
110
|
-
| Highlighting | `hi_var` | `str` | Variable to be highlighted. Default is none. |
|
|
111
|
-
| | `hi_value` | `List[str or int]` | List of values of `hi_var` to be highlighted. You can highlighted one or multiple values. |
|
|
112
|
-
| | `hi_missing` | `bool` | Whether or not missing values for `hi_var` should be highlighted. |
|
|
113
|
-
| | `color` | `List[str]` | List of colors corresponding to the list of values to be highlighted. Default highlight color list is ["red", "green", "yellow", "lightblue", "orange", "gray", "brown", "olive", "pink", "cyan", "magenta"] |
|
|
114
|
-
| | `default_color` | `str` | Default color of plotting elements for boxes that are not highlighted. Default is "blue" |
|
|
115
|
-
| Manipulating Spacing and Layout | `bar_width` | `float` | Factor by which the default width is increased or reduced. This allows reducing visual clutter. Default is 1.0. |
|
|
116
|
-
| | `space` | `float` | Space left for the labels between the plotting elements. Default is 0.5 |
|
|
117
|
-
| | `label_options` | `Dict[str, Dict[str, Any]]` | Manipulates the size and look of the labels. Args following the options in the website: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.text.html Example:{"ExampleVarname":{"fontsize":12,"fontstyle":"italic","fontweight":"black","color":"b"}} Default is None. |
|
|
118
|
-
| | `height` | `float` | Height of the plot in inches. Default is 10. |
|
|
119
|
-
| | `width` | `float` | Width of the plot in inches. Default is 15. Caution: Width too narrow may distort the plot. |
|
|
120
|
-
| Other options | `shape` | `str` | Shape of the boxes. "rectangle" (default) or "parallelogram". |
|
|
121
|
-
| | `same_scale` | `List[str]` | List of variables that have the same scale. Default is None. |
|
|
122
|
-
| | `display_figure` | `bool` | Whether or not to display the figure. This can be useful if you just want to save the plots. Default is 'True'. |
|
|
123
|
-
| | `save_path` | `str` | If it is not None, the figure will be saved to the given path with given name and format. Default is None. |
|
|
124
|
-
|
|
125
|
-
|
|
126
|
-
## Historical context
|
|
127
|
-
|
|
128
|
-
In 1898, Sankey diagrams were developed to visualize flows of energy and materials.
|
|
129
|
-
|
|
130
|
-
In 1985, Inselberg popularized parallel coordinates to visualize continuous variables only. The central contribution is the use of parallel axes.
|
|
131
|
-
|
|
132
|
-
In 2003, Schonlau proposed the hammock plot. This was the first plot to visualize categorical data (or mixed categorical continuous data) on parallel axes.
|
|
133
|
-
|
|
134
|
-
In 2010, Rosvall proposed alluvial plots to visualize network variables over time. Rather than using bars to connect axes, alluvial plots use rounded curves. Alluvial plots are now also used to visualize categorical data.
|
|
135
|
-
|
|
136
|
-
There are several additional variations that also visualize categorical data including Parallel Set plots (Bendix et al, 2005) and Right Angle plots (Hofmann and Vendettuoli, 2013).
|
|
137
|
-
|
|
138
|
-
### References
|
|
139
|
-
Bendix, F., Kosara, R., & Hauser, H. (2005). Parallel sets: visual analysis of categorical data. In IEEE Symposium on Information Visualization, 2005. INFOVIS 2005. 133-140.
|
|
140
|
-
|
|
141
|
-
Hofmann, H., & Vendettuoli, M. (2013). Common angle plots as perception-true visualizations of categorical associations. IEEE transactions on visualization and computer graphics, 19(12), 2297-2305.
|
|
142
|
-
|
|
143
|
-
Inselberg, A., & Dimsdale, B. (2009). Parallel coordinates. Human-Machine Interactive Systems, 199-233.
|
|
144
|
-
|
|
145
|
-
Rosvall, Martin, & Bergstrom, C.T. (2010) "Mapping change in large networks." PloS one 5.1: e8694.
|
|
146
|
-
|
|
147
|
-
Sankey, H. (1898). Introductory note on the thermal efficiency of steam-engines. report of
|
|
148
|
-
the committee appointed on the 31st march, 1896, to consider and report to the council
|
|
149
|
-
upon the subject of the definition of a standard or standards of thermal efficiency for
|
|
150
|
-
steam-engines: With an introductory note. In Minutes of proceedings of the institution
|
|
151
|
-
of civil engineers, Volume 134, pp. 278–283.
|
|
152
|
-
|
|
153
|
-
Schonlau M.
|
|
154
|
-
*[Visualizing Categorical Data Arising in the Health Sciences Using Hammock Plots.](http://www.schonlau.net/publication/03jsm_hammockplot.pdf)*
|
|
155
|
-
In Proceedings of the Section on Statistical Graphics, American Statistical Association; 2003
|
|
156
|
-
|
|
157
|
-
|
|
158
|
-
|
|
159
|
-
### Other implementations of the hammock plot
|
|
160
|
-
There is also a Stata implementation `hammock` (available from the Stata archive SSC) and an R implementation as part of the package `ggparallel`.
|
|
161
|
-
|
|
162
|
-
[](https://choosealicense.com/licenses/mit/)
|
|
163
|
-
|
|
164
|
-
|
|
165
|
-
|
|
166
|
-
## Authors
|
|
167
|
-
|
|
168
|
-
- Tiancheng Yang t77yang@uwaterloo.ca
|
|
169
|
-
|
|
170
|
-
|
|
171
|
-
|
|
172
|
-
Platform: UNKNOWN
|
|
173
|
-
Classifier: Programming Language :: Python :: 3
|
|
174
|
-
Classifier: License :: OSI Approved :: MIT License
|
|
175
|
-
Classifier: Operating System :: OS Independent
|
|
176
|
-
Classifier: Intended Audience :: Science/Research
|
|
177
|
-
Requires-Python: >=3.6
|
|
178
|
-
Description-Content-Type: text/markdown
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|