json-extract-pandas 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- json_extract_pandas-0.1.0/LICENSE +21 -0
- json_extract_pandas-0.1.0/PKG-INFO +128 -0
- json_extract_pandas-0.1.0/README.md +112 -0
- json_extract_pandas-0.1.0/pyproject.toml +28 -0
- json_extract_pandas-0.1.0/setup.cfg +4 -0
- json_extract_pandas-0.1.0/src/json_extract/__init__.py +4 -0
- json_extract_pandas-0.1.0/src/json_extract/json_extract.py +302 -0
- json_extract_pandas-0.1.0/src/json_extract_pandas.egg-info/PKG-INFO +128 -0
- json_extract_pandas-0.1.0/src/json_extract_pandas.egg-info/SOURCES.txt +10 -0
- json_extract_pandas-0.1.0/src/json_extract_pandas.egg-info/dependency_links.txt +1 -0
- json_extract_pandas-0.1.0/src/json_extract_pandas.egg-info/requires.txt +1 -0
- json_extract_pandas-0.1.0/src/json_extract_pandas.egg-info/top_level.txt +1 -0
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,128 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: json-extract-pandas
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: A robust JSON-to-CSV extraction engine for deeply nested payloads.
|
|
5
|
+
Author-email: Vatsa <228140210+svr-s@users.noreply.github.com>
|
|
6
|
+
Project-URL: Homepage, https://github.com/svr-s/json_extract
|
|
7
|
+
Project-URL: Issues, https://github.com/svr-s/json_extract/issues
|
|
8
|
+
Classifier: Programming Language :: Python :: 3
|
|
9
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
10
|
+
Classifier: Operating System :: OS Independent
|
|
11
|
+
Requires-Python: >=3.9
|
|
12
|
+
Description-Content-Type: text/markdown
|
|
13
|
+
License-File: LICENSE
|
|
14
|
+
Requires-Dist: pandas>=1.0.0
|
|
15
|
+
Dynamic: license-file
|
|
16
|
+
|
|
17
|
+
# JSON Extract (`json_extract`)
|
|
18
|
+
|
|
19
|
+
A high-performance, reusable Python utility designed to natively flatten, parse, and filter deeply nested JSON data into clean, normalized Pandas DataFrames. It acts as a powerful JSON-to-CSV extraction engine for robust data pipelines.
|
|
20
|
+
|
|
21
|
+
## Features
|
|
22
|
+
- **Deep Hierarchical Flattening**: Uses recursive logic to safely flatten any nested JSON object, preserving exact path context via dot-notation (e.g. `parent.child.property`).
|
|
23
|
+
- **Cartesian List Explosion**: Intelligently handles embedded lists/arrays, exploding them into new rows to prevent data loss or complex column bloat.
|
|
24
|
+
- **Dynamic Filtering**: Robust API to slice your exact dataset out of massive payloads.
|
|
25
|
+
- *Column Filtering*: Strict matching, 1-based numeric indexing (e.g., `["1-7", 10]`), Prefix wildcards (`parent.*`), and Suffix wildcards (`*.statusCode`).
|
|
26
|
+
- *Row Filtering*: Filter rows by exact values or lists of acceptable values.
|
|
27
|
+
- **Smart Data Cleaning**: Natively deduplicate rows, drop `NaN`/blanks, and cleanly simplify column headers without causing naming collisions.
|
|
28
|
+
- **Self-Documenting Metadata**: Instantly returns schema and dimension metadata (`table_size`, `column_names` mapped by 1-based index) alongside the DataFrame.
|
|
29
|
+
|
|
30
|
+
## Requirements
|
|
31
|
+
- Python 3.9+
|
|
32
|
+
- `pandas`
|
|
33
|
+
|
|
34
|
+
---
|
|
35
|
+
|
|
36
|
+
## Installation & Usage
|
|
37
|
+
|
|
38
|
+
Import `extract_json` directly into your data pipeline script:
|
|
39
|
+
|
|
40
|
+
```python
|
|
41
|
+
import json
|
|
42
|
+
from json_extract import extract_json
|
|
43
|
+
|
|
44
|
+
# 1. Load your raw payload
|
|
45
|
+
with open('data.json', 'r') as f:
|
|
46
|
+
payload = json.load(f)
|
|
47
|
+
|
|
48
|
+
# 2. Extract!
|
|
49
|
+
meta, df = extract_json(payload)
|
|
50
|
+
print(df.head())
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
---
|
|
54
|
+
|
|
55
|
+
## Function Reference
|
|
56
|
+
|
|
57
|
+
### `extract_json(json_data, **kwargs)`
|
|
58
|
+
|
|
59
|
+
The primary extraction engine.
|
|
60
|
+
|
|
61
|
+
#### Parameters
|
|
62
|
+
|
|
63
|
+
* **`json_data`** `(dict | list)`: **(Required)**
|
|
64
|
+
The parsed JSON payload to process. It handles normal dictionaries, lists of dictionaries, or even raw primitive 2D arrays (`[[0, 1], [2, 3]]`) mapping them securely to `col1`, `col2`.
|
|
65
|
+
|
|
66
|
+
* **`desired_columns`** `(list)`: *(Optional)*
|
|
67
|
+
A list of specific column names or numeric indices to retain. It natively supports numeric ranges and Unix-style wildcards (`*`, `?`).
|
|
68
|
+
* **Numeric Ranges/Indices (1-based)**: `["1-7", "10", 12]` (Gets columns 1 through 7, 10, and 12). Duplicate overlaps are safely ignored.
|
|
69
|
+
* **Exact Match**: `["accountId"]`
|
|
70
|
+
* **Prefix Wildcard**: `["shippingAddress.*"]` (Gets all properties starting with `shippingAddress.`)
|
|
71
|
+
* **Suffix Wildcard**: `["*.statusCode"]` (Gets every `statusCode` across the entire document)
|
|
72
|
+
|
|
73
|
+
* **`row_filters`** `(dict)`: *(Optional)*
|
|
74
|
+
A dictionary mapping a specific column to a required value. You can supply a single exact match, or a list of matches.
|
|
75
|
+
* **Exact**: `{"accountId": "ACC-99823-XYZ"}`
|
|
76
|
+
* **List**: `{"regionCode": ["US-EAST", "EU-WEST"]}`
|
|
77
|
+
|
|
78
|
+
* **`remove_duplicates`** `(bool)`: *(Optional, Default: False)*
|
|
79
|
+
If `True`, drops any completely identical rows generated from Cartesian explosions after all filters are applied.
|
|
80
|
+
|
|
81
|
+
* **`simplify_columns`** `(bool)`: *(Optional, Default: False)*
|
|
82
|
+
If `True`, it trims away verbose parent hierarchies in the column names, leaving only the final child key (e.g., `parent.child.shortName` safely becomes `shortName`).
|
|
83
|
+
*Note: If simplifying causes a name collision (e.g., `type.code` and `name.code` both mapping to `code`), it intelligently retains the full dot-notation for those specific columns to prevent data overwrite.*
|
|
84
|
+
|
|
85
|
+
* **`remove_empty`** `(bool | str)`: *(Optional, Default: False)*
|
|
86
|
+
Cleans the dataset by dropping rows polluted with missing values (`NaN`, `None`) or blank strings (`""`).
|
|
87
|
+
* `'any'` (or `True`): Strict. Drops the row if *any* of the filtered columns are missing.
|
|
88
|
+
* `'all'`: Lenient. Drops the row only if *all* of the filtered columns are entirely missing.
|
|
89
|
+
* `False`: Disabled. Retains everything.
|
|
90
|
+
|
|
91
|
+
---
|
|
92
|
+
|
|
93
|
+
## Full Example Pipeline
|
|
94
|
+
|
|
95
|
+
Here is an example demonstrating all parameters functioning in tandem to slice a massive, nested payload down to an exact, clean specification:
|
|
96
|
+
|
|
97
|
+
```python
|
|
98
|
+
import json
|
|
99
|
+
from json_extract import extract_json
|
|
100
|
+
|
|
101
|
+
with open('users_export.json', 'r') as f:
|
|
102
|
+
data = json.load(f)
|
|
103
|
+
|
|
104
|
+
# Extract only the fields we care about, exactly for our target accounts
|
|
105
|
+
meta, df = extract_json(
|
|
106
|
+
json_data=data,
|
|
107
|
+
|
|
108
|
+
# 1. Grab Account ID, specific columns by index, and shipping address properties
|
|
109
|
+
desired_columns=[
|
|
110
|
+
"accountId",
|
|
111
|
+
"2-4",
|
|
112
|
+
"shippingAddress.*",
|
|
113
|
+
"*.statusCode"
|
|
114
|
+
],
|
|
115
|
+
|
|
116
|
+
# 2. Filter to just these two specific regions
|
|
117
|
+
row_filters={
|
|
118
|
+
"regionCode": ["US-EAST", "EU-WEST"]
|
|
119
|
+
},
|
|
120
|
+
|
|
121
|
+
# 3. Clean up the resulting DataFrame
|
|
122
|
+
remove_duplicates=True,
|
|
123
|
+
simplify_columns=True, # Turns "shippingAddress.zipCode" into "zipCode"
|
|
124
|
+
remove_empty='all' # Drop rows that are totally blank
|
|
125
|
+
)
|
|
126
|
+
|
|
127
|
+
print(df.to_csv(index=False))
|
|
128
|
+
```
|
|
@@ -0,0 +1,112 @@
|
|
|
1
|
+
# JSON Extract (`json_extract`)
|
|
2
|
+
|
|
3
|
+
A high-performance, reusable Python utility designed to natively flatten, parse, and filter deeply nested JSON data into clean, normalized Pandas DataFrames. It acts as a powerful JSON-to-CSV extraction engine for robust data pipelines.
|
|
4
|
+
|
|
5
|
+
## Features
|
|
6
|
+
- **Deep Hierarchical Flattening**: Uses recursive logic to safely flatten any nested JSON object, preserving exact path context via dot-notation (e.g. `parent.child.property`).
|
|
7
|
+
- **Cartesian List Explosion**: Intelligently handles embedded lists/arrays, exploding them into new rows to prevent data loss or complex column bloat.
|
|
8
|
+
- **Dynamic Filtering**: Robust API to slice your exact dataset out of massive payloads.
|
|
9
|
+
- *Column Filtering*: Strict matching, 1-based numeric indexing (e.g., `["1-7", 10]`), Prefix wildcards (`parent.*`), and Suffix wildcards (`*.statusCode`).
|
|
10
|
+
- *Row Filtering*: Filter rows by exact values or lists of acceptable values.
|
|
11
|
+
- **Smart Data Cleaning**: Natively deduplicate rows, drop `NaN`/blanks, and cleanly simplify column headers without causing naming collisions.
|
|
12
|
+
- **Self-Documenting Metadata**: Instantly returns schema and dimension metadata (`table_size`, `column_names` mapped by 1-based index) alongside the DataFrame.
|
|
13
|
+
|
|
14
|
+
## Requirements
|
|
15
|
+
- Python 3.9+
|
|
16
|
+
- `pandas`
|
|
17
|
+
|
|
18
|
+
---
|
|
19
|
+
|
|
20
|
+
## Installation & Usage
|
|
21
|
+
|
|
22
|
+
Import `extract_json` directly into your data pipeline script:
|
|
23
|
+
|
|
24
|
+
```python
|
|
25
|
+
import json
|
|
26
|
+
from json_extract import extract_json
|
|
27
|
+
|
|
28
|
+
# 1. Load your raw payload
|
|
29
|
+
with open('data.json', 'r') as f:
|
|
30
|
+
payload = json.load(f)
|
|
31
|
+
|
|
32
|
+
# 2. Extract!
|
|
33
|
+
meta, df = extract_json(payload)
|
|
34
|
+
print(df.head())
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
---
|
|
38
|
+
|
|
39
|
+
## Function Reference
|
|
40
|
+
|
|
41
|
+
### `extract_json(json_data, **kwargs)`
|
|
42
|
+
|
|
43
|
+
The primary extraction engine.
|
|
44
|
+
|
|
45
|
+
#### Parameters
|
|
46
|
+
|
|
47
|
+
* **`json_data`** `(dict | list)`: **(Required)**
|
|
48
|
+
The parsed JSON payload to process. It handles normal dictionaries, lists of dictionaries, or even raw primitive 2D arrays (`[[0, 1], [2, 3]]`) mapping them securely to `col1`, `col2`.
|
|
49
|
+
|
|
50
|
+
* **`desired_columns`** `(list)`: *(Optional)*
|
|
51
|
+
A list of specific column names or numeric indices to retain. It natively supports numeric ranges and Unix-style wildcards (`*`, `?`).
|
|
52
|
+
* **Numeric Ranges/Indices (1-based)**: `["1-7", "10", 12]` (Gets columns 1 through 7, 10, and 12). Duplicate overlaps are safely ignored.
|
|
53
|
+
* **Exact Match**: `["accountId"]`
|
|
54
|
+
* **Prefix Wildcard**: `["shippingAddress.*"]` (Gets all properties starting with `shippingAddress.`)
|
|
55
|
+
* **Suffix Wildcard**: `["*.statusCode"]` (Gets every `statusCode` across the entire document)
|
|
56
|
+
|
|
57
|
+
* **`row_filters`** `(dict)`: *(Optional)*
|
|
58
|
+
A dictionary mapping a specific column to a required value. You can supply a single exact match, or a list of matches.
|
|
59
|
+
* **Exact**: `{"accountId": "ACC-99823-XYZ"}`
|
|
60
|
+
* **List**: `{"regionCode": ["US-EAST", "EU-WEST"]}`
|
|
61
|
+
|
|
62
|
+
* **`remove_duplicates`** `(bool)`: *(Optional, Default: False)*
|
|
63
|
+
If `True`, drops any completely identical rows generated from Cartesian explosions after all filters are applied.
|
|
64
|
+
|
|
65
|
+
* **`simplify_columns`** `(bool)`: *(Optional, Default: False)*
|
|
66
|
+
If `True`, it trims away verbose parent hierarchies in the column names, leaving only the final child key (e.g., `parent.child.shortName` safely becomes `shortName`).
|
|
67
|
+
*Note: If simplifying causes a name collision (e.g., `type.code` and `name.code` both mapping to `code`), it intelligently retains the full dot-notation for those specific columns to prevent data overwrite.*
|
|
68
|
+
|
|
69
|
+
* **`remove_empty`** `(bool | str)`: *(Optional, Default: False)*
|
|
70
|
+
Cleans the dataset by dropping rows polluted with missing values (`NaN`, `None`) or blank strings (`""`).
|
|
71
|
+
* `'any'` (or `True`): Strict. Drops the row if *any* of the filtered columns are missing.
|
|
72
|
+
* `'all'`: Lenient. Drops the row only if *all* of the filtered columns are entirely missing.
|
|
73
|
+
* `False`: Disabled. Retains everything.
|
|
74
|
+
|
|
75
|
+
---
|
|
76
|
+
|
|
77
|
+
## Full Example Pipeline
|
|
78
|
+
|
|
79
|
+
Here is an example demonstrating all parameters functioning in tandem to slice a massive, nested payload down to an exact, clean specification:
|
|
80
|
+
|
|
81
|
+
```python
|
|
82
|
+
import json
|
|
83
|
+
from json_extract import extract_json
|
|
84
|
+
|
|
85
|
+
with open('users_export.json', 'r') as f:
|
|
86
|
+
data = json.load(f)
|
|
87
|
+
|
|
88
|
+
# Extract only the fields we care about, exactly for our target accounts
|
|
89
|
+
meta, df = extract_json(
|
|
90
|
+
json_data=data,
|
|
91
|
+
|
|
92
|
+
# 1. Grab Account ID, specific columns by index, and shipping address properties
|
|
93
|
+
desired_columns=[
|
|
94
|
+
"accountId",
|
|
95
|
+
"2-4",
|
|
96
|
+
"shippingAddress.*",
|
|
97
|
+
"*.statusCode"
|
|
98
|
+
],
|
|
99
|
+
|
|
100
|
+
# 2. Filter to just these two specific regions
|
|
101
|
+
row_filters={
|
|
102
|
+
"regionCode": ["US-EAST", "EU-WEST"]
|
|
103
|
+
},
|
|
104
|
+
|
|
105
|
+
# 3. Clean up the resulting DataFrame
|
|
106
|
+
remove_duplicates=True,
|
|
107
|
+
simplify_columns=True, # Turns "shippingAddress.zipCode" into "zipCode"
|
|
108
|
+
remove_empty='all' # Drop rows that are totally blank
|
|
109
|
+
)
|
|
110
|
+
|
|
111
|
+
print(df.to_csv(index=False))
|
|
112
|
+
```
|
|
@@ -0,0 +1,28 @@
|
|
|
1
|
+
[build-system]
|
|
2
|
+
requires = ["setuptools>=61.0"]
|
|
3
|
+
build-backend = "setuptools.build_meta"
|
|
4
|
+
|
|
5
|
+
[project]
|
|
6
|
+
name = "json-extract-pandas"
|
|
7
|
+
version = "0.1.0"
|
|
8
|
+
authors = [
|
|
9
|
+
{ name="Vatsa", email="228140210+svr-s@users.noreply.github.com" },
|
|
10
|
+
]
|
|
11
|
+
description = "A robust JSON-to-CSV extraction engine for deeply nested payloads."
|
|
12
|
+
readme = "README.md"
|
|
13
|
+
requires-python = ">=3.9"
|
|
14
|
+
dependencies = [
|
|
15
|
+
"pandas>=1.0.0"
|
|
16
|
+
]
|
|
17
|
+
classifiers = [
|
|
18
|
+
"Programming Language :: Python :: 3",
|
|
19
|
+
"License :: OSI Approved :: MIT License",
|
|
20
|
+
"Operating System :: OS Independent",
|
|
21
|
+
]
|
|
22
|
+
|
|
23
|
+
[tool.setuptools.packages.find]
|
|
24
|
+
where = ["src"]
|
|
25
|
+
|
|
26
|
+
[project.urls]
|
|
27
|
+
Homepage = "https://github.com/svr-s/json_extract"
|
|
28
|
+
Issues = "https://github.com/svr-s/json_extract/issues"
|
|
@@ -0,0 +1,302 @@
|
|
|
1
|
+
import pandas as pd
|
|
2
|
+
import itertools
|
|
3
|
+
import fnmatch
|
|
4
|
+
import re
|
|
5
|
+
from typing import Union, Optional, Tuple, Dict, Any
|
|
6
|
+
|
|
7
|
+
def _flatten_and_expand(data, parent_key=''):
|
|
8
|
+
"""
|
|
9
|
+
Recursively flattens a nested dictionary or list and explodes lists
|
|
10
|
+
into multiple rows via Cartesian product.
|
|
11
|
+
Nested keys are separated by a dot (.).
|
|
12
|
+
For example: parentEntity.ChildEntity1.SubEntity2
|
|
13
|
+
"""
|
|
14
|
+
if isinstance(data, dict):
|
|
15
|
+
key_results = []
|
|
16
|
+
for k, v in data.items():
|
|
17
|
+
new_key = f"{parent_key}.{k}" if parent_key else k
|
|
18
|
+
res = _flatten_and_expand(v, new_key)
|
|
19
|
+
key_results.append(res)
|
|
20
|
+
|
|
21
|
+
combined_rows = []
|
|
22
|
+
for combination in itertools.product(*key_results):
|
|
23
|
+
merged = {}
|
|
24
|
+
for d in combination:
|
|
25
|
+
merged.update(d)
|
|
26
|
+
combined_rows.append(merged)
|
|
27
|
+
|
|
28
|
+
return combined_rows if combined_rows else [{}]
|
|
29
|
+
|
|
30
|
+
elif isinstance(data, list):
|
|
31
|
+
if not data:
|
|
32
|
+
return [{parent_key: None}]
|
|
33
|
+
|
|
34
|
+
all_rows = []
|
|
35
|
+
for item in data:
|
|
36
|
+
res = _flatten_and_expand(item, parent_key)
|
|
37
|
+
all_rows.extend(res)
|
|
38
|
+
return all_rows
|
|
39
|
+
|
|
40
|
+
else:
|
|
41
|
+
# Base case: primitive value
|
|
42
|
+
return [{parent_key: data}]
|
|
43
|
+
|
|
44
|
+
def extract_json(
|
|
45
|
+
json_data: Union[Dict, list],
|
|
46
|
+
desired_columns: Optional[list] = None,
|
|
47
|
+
row_filters: Optional[Dict[str, Any]] = None,
|
|
48
|
+
remove_duplicates: bool = False,
|
|
49
|
+
simplify_columns: bool = False,
|
|
50
|
+
remove_empty: Union[bool, str] = False
|
|
51
|
+
) -> Tuple[Dict[str, Any], pd.DataFrame]:
|
|
52
|
+
"""
|
|
53
|
+
Takes parsed JSON data, extracts the primary records, flattens deeply nested
|
|
54
|
+
hierarchies into a pandas DataFrame using dot-notation, and explodes lists
|
|
55
|
+
via Cartesian product.
|
|
56
|
+
|
|
57
|
+
This function acts as a robust JSON-to-CSV data extraction engine, specifically
|
|
58
|
+
built to handle complex, heavily-nested JSON payloads. It automatically gracefully
|
|
59
|
+
handles raw primitive arrays, standardizes column names, and allows for powerful
|
|
60
|
+
wildcard column extraction and list-based row filtering.
|
|
61
|
+
|
|
62
|
+
Args:
|
|
63
|
+
json_data (dict | list):
|
|
64
|
+
The parsed JSON structure to explode.
|
|
65
|
+
Can be a deeply nested dictionary, a list of dictionaries, or even a
|
|
66
|
+
raw 2D array (e.g., `[[0, 1], [2, 3]]`) which will be automatically
|
|
67
|
+
mapped to `col1`, `col2`, etc.
|
|
68
|
+
|
|
69
|
+
desired_columns (list, optional):
|
|
70
|
+
A list of column names, indices, or wildcards to retain in the final DataFrame.
|
|
71
|
+
- Example (Exact): `["accountId", "user.profile.firstName"]`
|
|
72
|
+
- Example (Index): `["5", 5]` (Extracts exactly the 5th column, 1-indexed)
|
|
73
|
+
- Example (Range): `["1-7"]` (Extracts columns 1 through 7 inclusive)
|
|
74
|
+
- Example (Prefix Wildcard): `["shippingAddress.*"]`
|
|
75
|
+
- Example (Suffix Wildcard): `["*.statusCode"]`
|
|
76
|
+
If indices are used alongside other matches that result in duplicates (e.g., `["5", "1-7"]`),
|
|
77
|
+
the final columns are safely deduplicated.
|
|
78
|
+
If a requested column or wildcard matches no columns, a warning is printed.
|
|
79
|
+
|
|
80
|
+
row_filters (dict, optional):
|
|
81
|
+
A dictionary mapping column names to desired values to filter the dataset.
|
|
82
|
+
Values can be a single exact match or a list of acceptable matches.
|
|
83
|
+
- Example (Exact Match): `{"accountId": "ACC-99823-XYZ"}`
|
|
84
|
+
- Example (List Match): `{"regionCode": ["US-EAST", "EU-WEST"]}`
|
|
85
|
+
|
|
86
|
+
remove_duplicates (bool, optional):
|
|
87
|
+
If True, removes completely identical rows from the final DataFrame
|
|
88
|
+
after all filtering has been applied. Defaults to False.
|
|
89
|
+
|
|
90
|
+
simplify_columns (bool, optional):
|
|
91
|
+
If True, intelligently strips parent prefixes from column names, retaining
|
|
92
|
+
only the final child property (e.g., `parent.child.shortName` becomes `shortName`).
|
|
93
|
+
If doing so would result in duplicate column names (e.g., both `a.code` and `b.code`
|
|
94
|
+
become `code`), it safely preserves their full dot-notation names to prevent collisions.
|
|
95
|
+
Defaults to False.
|
|
96
|
+
|
|
97
|
+
remove_empty (bool | str, optional):
|
|
98
|
+
Cleans the dataset by dropping rows containing missing values (`NaN`, `None`)
|
|
99
|
+
or completely empty strings (`""`, `" "`).
|
|
100
|
+
- `'any'` (or True): Drops the row if *any* of its filtered columns are missing.
|
|
101
|
+
- `'all'`: Drops the row only if *all* of its filtered columns are missing.
|
|
102
|
+
- False: Disables empty row removal. Defaults to False.
|
|
103
|
+
|
|
104
|
+
Returns:
|
|
105
|
+
tuple: A tuple containing:
|
|
106
|
+
- meta_data_dict (dict): A dictionary describing the resulting table size
|
|
107
|
+
and the exact schema/column names generated. The `column_names` key contains
|
|
108
|
+
a dictionary mapping the 1-based numerical index to the respective column name
|
|
109
|
+
(e.g., `{1: "accountId", 2: "user.profile.firstName"}`).
|
|
110
|
+
- pandas_dataframe (pd.DataFrame): The fully flattened, filtered, and
|
|
111
|
+
cleaned pandas DataFrame.
|
|
112
|
+
|
|
113
|
+
Raises:
|
|
114
|
+
ValueError: If `json_data` is not a dict or list, or if parameter types are invalid.
|
|
115
|
+
KeyError: If an exact column requested in `desired_columns` or `row_filters`
|
|
116
|
+
does not exist in the dataset.
|
|
117
|
+
"""
|
|
118
|
+
if not isinstance(json_data, (dict, list)):
|
|
119
|
+
raise ValueError(f"Invalid input: json_data must be a dictionary or a list, got {type(json_data).__name__}.")
|
|
120
|
+
|
|
121
|
+
# 1. Determine the records to process
|
|
122
|
+
if isinstance(json_data, list):
|
|
123
|
+
records = json_data
|
|
124
|
+
elif isinstance(json_data, dict):
|
|
125
|
+
# Look for keys that contain lists of dictionaries
|
|
126
|
+
list_keys = [k for k, v in json_data.items() if isinstance(v, list) and len(v) > 0 and isinstance(v[0], dict)]
|
|
127
|
+
if list_keys:
|
|
128
|
+
# Pick the key with the largest list, assuming that's the primary data payload
|
|
129
|
+
largest_list_key = max(list_keys, key=lambda k: len(json_data[k]))
|
|
130
|
+
records = json_data[largest_list_key]
|
|
131
|
+
else:
|
|
132
|
+
# Otherwise, treat the entire dictionary as a single record
|
|
133
|
+
records = [json_data]
|
|
134
|
+
else:
|
|
135
|
+
records = [{"value": json_data}]
|
|
136
|
+
|
|
137
|
+
# Format records if they are raw lists (e.g., CSV-like JSON [[0, 1], [2, 3]])
|
|
138
|
+
formatted_records = []
|
|
139
|
+
for r in records:
|
|
140
|
+
if isinstance(r, list):
|
|
141
|
+
formatted_records.append({f"col{i+1}": val for i, val in enumerate(r)})
|
|
142
|
+
else:
|
|
143
|
+
formatted_records.append(r)
|
|
144
|
+
records = formatted_records
|
|
145
|
+
|
|
146
|
+
# Flatten and explode each record
|
|
147
|
+
flattened_records = []
|
|
148
|
+
for record in records:
|
|
149
|
+
flattened_records.extend(_flatten_and_expand(record))
|
|
150
|
+
|
|
151
|
+
# Create DataFrame
|
|
152
|
+
df = pd.DataFrame(flattened_records)
|
|
153
|
+
|
|
154
|
+
# Apply Column Filters
|
|
155
|
+
if desired_columns:
|
|
156
|
+
if not isinstance(desired_columns, list):
|
|
157
|
+
raise ValueError("Invalid input: desired_columns must be a list of strings.")
|
|
158
|
+
|
|
159
|
+
valid_columns = []
|
|
160
|
+
for req_col in desired_columns:
|
|
161
|
+
req_str = str(req_col).strip()
|
|
162
|
+
|
|
163
|
+
# Check for range pattern (e.g., "1-7")
|
|
164
|
+
if re.match(r'^\d+-\d+$', req_str):
|
|
165
|
+
start, end = map(int, req_str.split('-'))
|
|
166
|
+
start_idx = max(0, start - 1)
|
|
167
|
+
end_idx = min(len(df.columns), end)
|
|
168
|
+
matched_cols = list(df.columns[start_idx:end_idx])
|
|
169
|
+
if not matched_cols:
|
|
170
|
+
print(f"Warning: Index range '{req_str}' is entirely out of bounds for the dataset.")
|
|
171
|
+
valid_columns.extend(matched_cols)
|
|
172
|
+
|
|
173
|
+
# Check for single digit index (e.g., "5" or 5)
|
|
174
|
+
elif req_str.isdigit():
|
|
175
|
+
idx = int(req_str) - 1
|
|
176
|
+
if 0 <= idx < len(df.columns):
|
|
177
|
+
valid_columns.append(df.columns[idx])
|
|
178
|
+
else:
|
|
179
|
+
print(f"Warning: Index '{req_str}' is out of bounds for the dataset.")
|
|
180
|
+
|
|
181
|
+
# Check for wildcards
|
|
182
|
+
elif '*' in req_str or '?' in req_str:
|
|
183
|
+
# Use fnmatch to support both prefix.* and *.suffix patterns
|
|
184
|
+
matched_cols = [c for c in df.columns if fnmatch.fnmatch(c, req_str)]
|
|
185
|
+
if not matched_cols:
|
|
186
|
+
print(f"Warning: Wildcard filter '{req_str}' did not match any columns in the dataset.")
|
|
187
|
+
valid_columns.extend(matched_cols)
|
|
188
|
+
|
|
189
|
+
# Exact match
|
|
190
|
+
else:
|
|
191
|
+
if req_str in df.columns:
|
|
192
|
+
valid_columns.append(req_str)
|
|
193
|
+
else:
|
|
194
|
+
raise KeyError(f"Requested column '{req_str}' does not exist in the exploded dataset.")
|
|
195
|
+
|
|
196
|
+
# Deduplicate the list while preserving order
|
|
197
|
+
seen = set()
|
|
198
|
+
final_columns = [x for x in valid_columns if not (x in seen or seen.add(x))]
|
|
199
|
+
df = df[final_columns]
|
|
200
|
+
|
|
201
|
+
# Apply Row Filters
|
|
202
|
+
if row_filters:
|
|
203
|
+
if not isinstance(row_filters, dict):
|
|
204
|
+
raise ValueError("Invalid input: row_filters must be a dictionary.")
|
|
205
|
+
|
|
206
|
+
for col, val in row_filters.items():
|
|
207
|
+
if col in df.columns:
|
|
208
|
+
if isinstance(val, list):
|
|
209
|
+
df = df[df[col].isin(val)]
|
|
210
|
+
else:
|
|
211
|
+
df = df[df[col] == val]
|
|
212
|
+
else:
|
|
213
|
+
raise KeyError(f"Row filter column '{col}' does not exist in the exploded dataset.")
|
|
214
|
+
|
|
215
|
+
# Remove Duplicates
|
|
216
|
+
if remove_duplicates:
|
|
217
|
+
df = df.drop_duplicates().reset_index(drop=True)
|
|
218
|
+
|
|
219
|
+
# Simplify Column Names
|
|
220
|
+
if simplify_columns:
|
|
221
|
+
from collections import defaultdict
|
|
222
|
+
short_to_original = defaultdict(list)
|
|
223
|
+
|
|
224
|
+
for col in df.columns:
|
|
225
|
+
short_name = col.split('.')[-1]
|
|
226
|
+
short_to_original[short_name].append(col)
|
|
227
|
+
|
|
228
|
+
rename_map = {}
|
|
229
|
+
for short_name, originals in short_to_original.items():
|
|
230
|
+
if len(originals) == 1:
|
|
231
|
+
# Unique short name, safe to rename
|
|
232
|
+
rename_map[originals[0]] = short_name
|
|
233
|
+
# If len > 1, there is a collision, so we do NOT add them to the rename map.
|
|
234
|
+
# They will naturally retain their full original names.
|
|
235
|
+
|
|
236
|
+
df = df.rename(columns=rename_map)
|
|
237
|
+
|
|
238
|
+
# Remove Empty/Blanks/NaNs
|
|
239
|
+
if remove_empty:
|
|
240
|
+
import numpy as np
|
|
241
|
+
# Convert empty strings or whitespace-only strings to proper NaNs
|
|
242
|
+
df = df.replace(r'^\s*$', np.nan, regex=True)
|
|
243
|
+
|
|
244
|
+
if isinstance(remove_empty, str) and remove_empty.lower() == 'all':
|
|
245
|
+
df = df.dropna(how='all')
|
|
246
|
+
else:
|
|
247
|
+
df = df.dropna(how='any')
|
|
248
|
+
|
|
249
|
+
df = df.reset_index(drop=True)
|
|
250
|
+
|
|
251
|
+
# Generate Metadata
|
|
252
|
+
meta_data = {
|
|
253
|
+
"table_size": {
|
|
254
|
+
"rows": int(df.shape[0]),
|
|
255
|
+
"columns": int(df.shape[1])
|
|
256
|
+
},
|
|
257
|
+
"column_names": {i + 1: col for i, col in enumerate(df.columns)}
|
|
258
|
+
}
|
|
259
|
+
|
|
260
|
+
return meta_data, df
|
|
261
|
+
|
|
262
|
+
if __name__ == '__main__':
|
|
263
|
+
# Localized test case to verify the function works
|
|
264
|
+
import json
|
|
265
|
+
import os
|
|
266
|
+
|
|
267
|
+
current_dir = os.path.dirname(os.path.abspath(__file__))
|
|
268
|
+
sample_file = os.path.join(current_dir, 'sample_json.json')
|
|
269
|
+
|
|
270
|
+
if os.path.exists(sample_file):
|
|
271
|
+
print(f"Loading test data from: {sample_file}")
|
|
272
|
+
with open(sample_file, 'r', encoding='utf-8') as f:
|
|
273
|
+
data = json.load(f)
|
|
274
|
+
|
|
275
|
+
# Define some filters for testing
|
|
276
|
+
col_filters = [
|
|
277
|
+
"languageCode",
|
|
278
|
+
"laborAllocationCodes.code",
|
|
279
|
+
"sample_test_4.sample_test_4_1"
|
|
280
|
+
]
|
|
281
|
+
r_filters = {
|
|
282
|
+
"languageCode": "en-CA"
|
|
283
|
+
}
|
|
284
|
+
|
|
285
|
+
print(f"Applying Column Filters: {col_filters}")
|
|
286
|
+
print(f"Applying Row Filters: {r_filters}")
|
|
287
|
+
|
|
288
|
+
meta, exploded_df = extract_json(
|
|
289
|
+
json_data=data,
|
|
290
|
+
desired_columns=col_filters,
|
|
291
|
+
row_filters=r_filters,
|
|
292
|
+
simplify_columns=False,
|
|
293
|
+
remove_empty='all'
|
|
294
|
+
)
|
|
295
|
+
|
|
296
|
+
print("\n--- Metadata Result ---")
|
|
297
|
+
print(json.dumps(meta, indent=4))
|
|
298
|
+
|
|
299
|
+
print("\n--- DataFrame Result ---")
|
|
300
|
+
print(exploded_df)
|
|
301
|
+
else:
|
|
302
|
+
print("Please provide a valid JSON file to test.")
|
|
@@ -0,0 +1,128 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: json-extract-pandas
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: A robust JSON-to-CSV extraction engine for deeply nested payloads.
|
|
5
|
+
Author-email: Vatsa <228140210+svr-s@users.noreply.github.com>
|
|
6
|
+
Project-URL: Homepage, https://github.com/svr-s/json_extract
|
|
7
|
+
Project-URL: Issues, https://github.com/svr-s/json_extract/issues
|
|
8
|
+
Classifier: Programming Language :: Python :: 3
|
|
9
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
10
|
+
Classifier: Operating System :: OS Independent
|
|
11
|
+
Requires-Python: >=3.9
|
|
12
|
+
Description-Content-Type: text/markdown
|
|
13
|
+
License-File: LICENSE
|
|
14
|
+
Requires-Dist: pandas>=1.0.0
|
|
15
|
+
Dynamic: license-file
|
|
16
|
+
|
|
17
|
+
# JSON Extract (`json_extract`)
|
|
18
|
+
|
|
19
|
+
A high-performance, reusable Python utility designed to natively flatten, parse, and filter deeply nested JSON data into clean, normalized Pandas DataFrames. It acts as a powerful JSON-to-CSV extraction engine for robust data pipelines.
|
|
20
|
+
|
|
21
|
+
## Features
|
|
22
|
+
- **Deep Hierarchical Flattening**: Uses recursive logic to safely flatten any nested JSON object, preserving exact path context via dot-notation (e.g. `parent.child.property`).
|
|
23
|
+
- **Cartesian List Explosion**: Intelligently handles embedded lists/arrays, exploding them into new rows to prevent data loss or complex column bloat.
|
|
24
|
+
- **Dynamic Filtering**: Robust API to slice your exact dataset out of massive payloads.
|
|
25
|
+
- *Column Filtering*: Strict matching, 1-based numeric indexing (e.g., `["1-7", 10]`), Prefix wildcards (`parent.*`), and Suffix wildcards (`*.statusCode`).
|
|
26
|
+
- *Row Filtering*: Filter rows by exact values or lists of acceptable values.
|
|
27
|
+
- **Smart Data Cleaning**: Natively deduplicate rows, drop `NaN`/blanks, and cleanly simplify column headers without causing naming collisions.
|
|
28
|
+
- **Self-Documenting Metadata**: Instantly returns schema and dimension metadata (`table_size`, `column_names` mapped by 1-based index) alongside the DataFrame.
|
|
29
|
+
|
|
30
|
+
## Requirements
|
|
31
|
+
- Python 3.9+
|
|
32
|
+
- `pandas`
|
|
33
|
+
|
|
34
|
+
---
|
|
35
|
+
|
|
36
|
+
## Installation & Usage
|
|
37
|
+
|
|
38
|
+
Import `extract_json` directly into your data pipeline script:
|
|
39
|
+
|
|
40
|
+
```python
|
|
41
|
+
import json
|
|
42
|
+
from json_extract import extract_json
|
|
43
|
+
|
|
44
|
+
# 1. Load your raw payload
|
|
45
|
+
with open('data.json', 'r') as f:
|
|
46
|
+
payload = json.load(f)
|
|
47
|
+
|
|
48
|
+
# 2. Extract!
|
|
49
|
+
meta, df = extract_json(payload)
|
|
50
|
+
print(df.head())
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
---
|
|
54
|
+
|
|
55
|
+
## Function Reference
|
|
56
|
+
|
|
57
|
+
### `extract_json(json_data, **kwargs)`
|
|
58
|
+
|
|
59
|
+
The primary extraction engine.
|
|
60
|
+
|
|
61
|
+
#### Parameters
|
|
62
|
+
|
|
63
|
+
* **`json_data`** `(dict | list)`: **(Required)**
|
|
64
|
+
The parsed JSON payload to process. It handles normal dictionaries, lists of dictionaries, or even raw primitive 2D arrays (`[[0, 1], [2, 3]]`) mapping them securely to `col1`, `col2`.
|
|
65
|
+
|
|
66
|
+
* **`desired_columns`** `(list)`: *(Optional)*
|
|
67
|
+
A list of specific column names or numeric indices to retain. It natively supports numeric ranges and Unix-style wildcards (`*`, `?`).
|
|
68
|
+
* **Numeric Ranges/Indices (1-based)**: `["1-7", "10", 12]` (Gets columns 1 through 7, 10, and 12). Duplicate overlaps are safely ignored.
|
|
69
|
+
* **Exact Match**: `["accountId"]`
|
|
70
|
+
* **Prefix Wildcard**: `["shippingAddress.*"]` (Gets all properties starting with `shippingAddress.`)
|
|
71
|
+
* **Suffix Wildcard**: `["*.statusCode"]` (Gets every `statusCode` across the entire document)
|
|
72
|
+
|
|
73
|
+
* **`row_filters`** `(dict)`: *(Optional)*
|
|
74
|
+
A dictionary mapping a specific column to a required value. You can supply a single exact match, or a list of matches.
|
|
75
|
+
* **Exact**: `{"accountId": "ACC-99823-XYZ"}`
|
|
76
|
+
* **List**: `{"regionCode": ["US-EAST", "EU-WEST"]}`
|
|
77
|
+
|
|
78
|
+
* **`remove_duplicates`** `(bool)`: *(Optional, Default: False)*
|
|
79
|
+
If `True`, drops any completely identical rows generated from Cartesian explosions after all filters are applied.
|
|
80
|
+
|
|
81
|
+
* **`simplify_columns`** `(bool)`: *(Optional, Default: False)*
|
|
82
|
+
If `True`, it trims away verbose parent hierarchies in the column names, leaving only the final child key (e.g., `parent.child.shortName` safely becomes `shortName`).
|
|
83
|
+
*Note: If simplifying causes a name collision (e.g., `type.code` and `name.code` both mapping to `code`), it intelligently retains the full dot-notation for those specific columns to prevent data overwrite.*
|
|
84
|
+
|
|
85
|
+
* **`remove_empty`** `(bool | str)`: *(Optional, Default: False)*
|
|
86
|
+
Cleans the dataset by dropping rows polluted with missing values (`NaN`, `None`) or blank strings (`""`).
|
|
87
|
+
* `'any'` (or `True`): Strict. Drops the row if *any* of the filtered columns are missing.
|
|
88
|
+
* `'all'`: Lenient. Drops the row only if *all* of the filtered columns are entirely missing.
|
|
89
|
+
* `False`: Disabled. Retains everything.
|
|
90
|
+
|
|
91
|
+
---
|
|
92
|
+
|
|
93
|
+
## Full Example Pipeline
|
|
94
|
+
|
|
95
|
+
Here is an example demonstrating all parameters functioning in tandem to slice a massive, nested payload down to an exact, clean specification:
|
|
96
|
+
|
|
97
|
+
```python
|
|
98
|
+
import json
|
|
99
|
+
from json_extract import extract_json
|
|
100
|
+
|
|
101
|
+
with open('users_export.json', 'r') as f:
|
|
102
|
+
data = json.load(f)
|
|
103
|
+
|
|
104
|
+
# Extract only the fields we care about, exactly for our target accounts
|
|
105
|
+
meta, df = extract_json(
|
|
106
|
+
json_data=data,
|
|
107
|
+
|
|
108
|
+
# 1. Grab Account ID, specific columns by index, and shipping address properties
|
|
109
|
+
desired_columns=[
|
|
110
|
+
"accountId",
|
|
111
|
+
"2-4",
|
|
112
|
+
"shippingAddress.*",
|
|
113
|
+
"*.statusCode"
|
|
114
|
+
],
|
|
115
|
+
|
|
116
|
+
# 2. Filter to just these two specific regions
|
|
117
|
+
row_filters={
|
|
118
|
+
"regionCode": ["US-EAST", "EU-WEST"]
|
|
119
|
+
},
|
|
120
|
+
|
|
121
|
+
# 3. Clean up the resulting DataFrame
|
|
122
|
+
remove_duplicates=True,
|
|
123
|
+
simplify_columns=True, # Turns "shippingAddress.zipCode" into "zipCode"
|
|
124
|
+
remove_empty='all' # Drop rows that are totally blank
|
|
125
|
+
)
|
|
126
|
+
|
|
127
|
+
print(df.to_csv(index=False))
|
|
128
|
+
```
|
|
@@ -0,0 +1,10 @@
|
|
|
1
|
+
LICENSE
|
|
2
|
+
README.md
|
|
3
|
+
pyproject.toml
|
|
4
|
+
src/json_extract/__init__.py
|
|
5
|
+
src/json_extract/json_extract.py
|
|
6
|
+
src/json_extract_pandas.egg-info/PKG-INFO
|
|
7
|
+
src/json_extract_pandas.egg-info/SOURCES.txt
|
|
8
|
+
src/json_extract_pandas.egg-info/dependency_links.txt
|
|
9
|
+
src/json_extract_pandas.egg-info/requires.txt
|
|
10
|
+
src/json_extract_pandas.egg-info/top_level.txt
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
pandas>=1.0.0
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
json_extract
|