json-extract-pandas 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,128 @@
1
+ Metadata-Version: 2.4
2
+ Name: json-extract-pandas
3
+ Version: 0.1.0
4
+ Summary: A robust JSON-to-CSV extraction engine for deeply nested payloads.
5
+ Author-email: Vatsa <228140210+svr-s@users.noreply.github.com>
6
+ Project-URL: Homepage, https://github.com/svr-s/json_extract
7
+ Project-URL: Issues, https://github.com/svr-s/json_extract/issues
8
+ Classifier: Programming Language :: Python :: 3
9
+ Classifier: License :: OSI Approved :: MIT License
10
+ Classifier: Operating System :: OS Independent
11
+ Requires-Python: >=3.9
12
+ Description-Content-Type: text/markdown
13
+ License-File: LICENSE
14
+ Requires-Dist: pandas>=1.0.0
15
+ Dynamic: license-file
16
+
17
+ # JSON Extract (`json_extract`)
18
+
19
+ A high-performance, reusable Python utility designed to natively flatten, parse, and filter deeply nested JSON data into clean, normalized Pandas DataFrames. It acts as a powerful JSON-to-CSV extraction engine for robust data pipelines.
20
+
21
+ ## Features
22
+ - **Deep Hierarchical Flattening**: Uses recursive logic to safely flatten any nested JSON object, preserving exact path context via dot-notation (e.g. `parent.child.property`).
23
+ - **Cartesian List Explosion**: Intelligently handles embedded lists/arrays, exploding them into new rows to prevent data loss or complex column bloat.
24
+ - **Dynamic Filtering**: Robust API to slice your exact dataset out of massive payloads.
25
+ - *Column Filtering*: Strict matching, 1-based numeric indexing (e.g., `["1-7", 10]`), Prefix wildcards (`parent.*`), and Suffix wildcards (`*.statusCode`).
26
+ - *Row Filtering*: Filter rows by exact values or lists of acceptable values.
27
+ - **Smart Data Cleaning**: Natively deduplicate rows, drop `NaN`/blanks, and cleanly simplify column headers without causing naming collisions.
28
+ - **Self-Documenting Metadata**: Instantly returns schema and dimension metadata (`table_size`, `column_names` mapped by 1-based index) alongside the DataFrame.
29
+
30
+ ## Requirements
31
+ - Python 3.9+
32
+ - `pandas`
33
+
34
+ ---
35
+
36
+ ## Installation & Usage
37
+
38
+ Import `extract_json` directly into your data pipeline script:
39
+
40
+ ```python
41
+ import json
42
+ from json_extract import extract_json
43
+
44
+ # 1. Load your raw payload
45
+ with open('data.json', 'r') as f:
46
+ payload = json.load(f)
47
+
48
+ # 2. Extract!
49
+ meta, df = extract_json(payload)
50
+ print(df.head())
51
+ ```
52
+
53
+ ---
54
+
55
+ ## Function Reference
56
+
57
+ ### `extract_json(json_data, **kwargs)`
58
+
59
+ The primary extraction engine.
60
+
61
+ #### Parameters
62
+
63
+ * **`json_data`** `(dict | list)`: **(Required)**
64
+ The parsed JSON payload to process. It handles normal dictionaries, lists of dictionaries, or even raw primitive 2D arrays (`[[0, 1], [2, 3]]`) mapping them securely to `col1`, `col2`.
65
+
66
+ * **`desired_columns`** `(list)`: *(Optional)*
67
+ A list of specific column names or numeric indices to retain. It natively supports numeric ranges and Unix-style wildcards (`*`, `?`).
68
+ * **Numeric Ranges/Indices (1-based)**: `["1-7", "10", 12]` (Gets columns 1 through 7, 10, and 12). Duplicate overlaps are safely ignored.
69
+ * **Exact Match**: `["accountId"]`
70
+ * **Prefix Wildcard**: `["shippingAddress.*"]` (Gets all properties starting with `shippingAddress.`)
71
+ * **Suffix Wildcard**: `["*.statusCode"]` (Gets every `statusCode` across the entire document)
72
+
73
+ * **`row_filters`** `(dict)`: *(Optional)*
74
+ A dictionary mapping a specific column to a required value. You can supply a single exact match, or a list of matches.
75
+ * **Exact**: `{"accountId": "ACC-99823-XYZ"}`
76
+ * **List**: `{"regionCode": ["US-EAST", "EU-WEST"]}`
77
+
78
+ * **`remove_duplicates`** `(bool)`: *(Optional, Default: False)*
79
+ If `True`, drops any completely identical rows generated from Cartesian explosions after all filters are applied.
80
+
81
+ * **`simplify_columns`** `(bool)`: *(Optional, Default: False)*
82
+ If `True`, it trims away verbose parent hierarchies in the column names, leaving only the final child key (e.g., `parent.child.shortName` safely becomes `shortName`).
83
+ *Note: If simplifying causes a name collision (e.g., `type.code` and `name.code` both mapping to `code`), it intelligently retains the full dot-notation for those specific columns to prevent data overwrite.*
84
+
85
+ * **`remove_empty`** `(bool | str)`: *(Optional, Default: False)*
86
+ Cleans the dataset by dropping rows polluted with missing values (`NaN`, `None`) or blank strings (`""`).
87
+ * `'any'` (or `True`): Strict. Drops the row if *any* of the filtered columns are missing.
88
+ * `'all'`: Lenient. Drops the row only if *all* of the filtered columns are entirely missing.
89
+ * `False`: Disabled. Retains everything.
90
+
91
+ ---
92
+
93
+ ## Full Example Pipeline
94
+
95
+ Here is an example demonstrating all parameters functioning in tandem to slice a massive, nested payload down to an exact, clean specification:
96
+
97
+ ```python
98
+ import json
99
+ from json_extract import extract_json
100
+
101
+ with open('users_export.json', 'r') as f:
102
+ data = json.load(f)
103
+
104
+ # Extract only the fields we care about, exactly for our target accounts
105
+ meta, df = extract_json(
106
+ json_data=data,
107
+
108
+ # 1. Grab Account ID, specific columns by index, and shipping address properties
109
+ desired_columns=[
110
+ "accountId",
111
+ "2-4",
112
+ "shippingAddress.*",
113
+ "*.statusCode"
114
+ ],
115
+
116
+ # 2. Filter to just these two specific regions
117
+ row_filters={
118
+ "regionCode": ["US-EAST", "EU-WEST"]
119
+ },
120
+
121
+ # 3. Clean up the resulting DataFrame
122
+ remove_duplicates=True,
123
+ simplify_columns=True, # Turns "shippingAddress.zipCode" into "zipCode"
124
+ remove_empty='all' # Drop rows that are totally blank
125
+ )
126
+
127
+ print(df.to_csv(index=False))
128
+ ```
@@ -0,0 +1,112 @@
1
+ # JSON Extract (`json_extract`)
2
+
3
+ A high-performance, reusable Python utility designed to natively flatten, parse, and filter deeply nested JSON data into clean, normalized Pandas DataFrames. It acts as a powerful JSON-to-CSV extraction engine for robust data pipelines.
4
+
5
+ ## Features
6
+ - **Deep Hierarchical Flattening**: Uses recursive logic to safely flatten any nested JSON object, preserving exact path context via dot-notation (e.g. `parent.child.property`).
7
+ - **Cartesian List Explosion**: Intelligently handles embedded lists/arrays, exploding them into new rows to prevent data loss or complex column bloat.
8
+ - **Dynamic Filtering**: Robust API to slice your exact dataset out of massive payloads.
9
+ - *Column Filtering*: Strict matching, 1-based numeric indexing (e.g., `["1-7", 10]`), Prefix wildcards (`parent.*`), and Suffix wildcards (`*.statusCode`).
10
+ - *Row Filtering*: Filter rows by exact values or lists of acceptable values.
11
+ - **Smart Data Cleaning**: Natively deduplicate rows, drop `NaN`/blanks, and cleanly simplify column headers without causing naming collisions.
12
+ - **Self-Documenting Metadata**: Instantly returns schema and dimension metadata (`table_size`, `column_names` mapped by 1-based index) alongside the DataFrame.
13
+
14
+ ## Requirements
15
+ - Python 3.9+
16
+ - `pandas`
17
+
18
+ ---
19
+
20
+ ## Installation & Usage
21
+
22
+ Import `extract_json` directly into your data pipeline script:
23
+
24
+ ```python
25
+ import json
26
+ from json_extract import extract_json
27
+
28
+ # 1. Load your raw payload
29
+ with open('data.json', 'r') as f:
30
+ payload = json.load(f)
31
+
32
+ # 2. Extract!
33
+ meta, df = extract_json(payload)
34
+ print(df.head())
35
+ ```
36
+
37
+ ---
38
+
39
+ ## Function Reference
40
+
41
+ ### `extract_json(json_data, **kwargs)`
42
+
43
+ The primary extraction engine.
44
+
45
+ #### Parameters
46
+
47
+ * **`json_data`** `(dict | list)`: **(Required)**
48
+ The parsed JSON payload to process. It handles normal dictionaries, lists of dictionaries, or even raw primitive 2D arrays (`[[0, 1], [2, 3]]`) mapping them securely to `col1`, `col2`.
49
+
50
+ * **`desired_columns`** `(list)`: *(Optional)*
51
+ A list of specific column names or numeric indices to retain. It natively supports numeric ranges and Unix-style wildcards (`*`, `?`).
52
+ * **Numeric Ranges/Indices (1-based)**: `["1-7", "10", 12]` (Gets columns 1 through 7, 10, and 12). Duplicate overlaps are safely ignored.
53
+ * **Exact Match**: `["accountId"]`
54
+ * **Prefix Wildcard**: `["shippingAddress.*"]` (Gets all properties starting with `shippingAddress.`)
55
+ * **Suffix Wildcard**: `["*.statusCode"]` (Gets every `statusCode` across the entire document)
56
+
57
+ * **`row_filters`** `(dict)`: *(Optional)*
58
+ A dictionary mapping a specific column to a required value. You can supply a single exact match, or a list of matches.
59
+ * **Exact**: `{"accountId": "ACC-99823-XYZ"}`
60
+ * **List**: `{"regionCode": ["US-EAST", "EU-WEST"]}`
61
+
62
+ * **`remove_duplicates`** `(bool)`: *(Optional, Default: False)*
63
+ If `True`, drops any completely identical rows generated from Cartesian explosions after all filters are applied.
64
+
65
+ * **`simplify_columns`** `(bool)`: *(Optional, Default: False)*
66
+ If `True`, it trims away verbose parent hierarchies in the column names, leaving only the final child key (e.g., `parent.child.shortName` safely becomes `shortName`).
67
+ *Note: If simplifying causes a name collision (e.g., `type.code` and `name.code` both mapping to `code`), it intelligently retains the full dot-notation for those specific columns to prevent data overwrite.*
68
+
69
+ * **`remove_empty`** `(bool | str)`: *(Optional, Default: False)*
70
+ Cleans the dataset by dropping rows polluted with missing values (`NaN`, `None`) or blank strings (`""`).
71
+ * `'any'` (or `True`): Strict. Drops the row if *any* of the filtered columns are missing.
72
+ * `'all'`: Lenient. Drops the row only if *all* of the filtered columns are entirely missing.
73
+ * `False`: Disabled. Retains everything.
74
+
75
+ ---
76
+
77
+ ## Full Example Pipeline
78
+
79
+ Here is an example demonstrating all parameters functioning in tandem to slice a massive, nested payload down to an exact, clean specification:
80
+
81
+ ```python
82
+ import json
83
+ from json_extract import extract_json
84
+
85
+ with open('users_export.json', 'r') as f:
86
+ data = json.load(f)
87
+
88
+ # Extract only the fields we care about, exactly for our target accounts
89
+ meta, df = extract_json(
90
+ json_data=data,
91
+
92
+ # 1. Grab Account ID, specific columns by index, and shipping address properties
93
+ desired_columns=[
94
+ "accountId",
95
+ "2-4",
96
+ "shippingAddress.*",
97
+ "*.statusCode"
98
+ ],
99
+
100
+ # 2. Filter to just these two specific regions
101
+ row_filters={
102
+ "regionCode": ["US-EAST", "EU-WEST"]
103
+ },
104
+
105
+ # 3. Clean up the resulting DataFrame
106
+ remove_duplicates=True,
107
+ simplify_columns=True, # Turns "shippingAddress.zipCode" into "zipCode"
108
+ remove_empty='all' # Drop rows that are totally blank
109
+ )
110
+
111
+ print(df.to_csv(index=False))
112
+ ```
@@ -0,0 +1,28 @@
1
+ [build-system]
2
+ requires = ["setuptools>=61.0"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "json-extract-pandas"
7
+ version = "0.1.0"
8
+ authors = [
9
+ { name="Vatsa", email="228140210+svr-s@users.noreply.github.com" },
10
+ ]
11
+ description = "A robust JSON-to-CSV extraction engine for deeply nested payloads."
12
+ readme = "README.md"
13
+ requires-python = ">=3.9"
14
+ dependencies = [
15
+ "pandas>=1.0.0"
16
+ ]
17
+ classifiers = [
18
+ "Programming Language :: Python :: 3",
19
+ "License :: OSI Approved :: MIT License",
20
+ "Operating System :: OS Independent",
21
+ ]
22
+
23
+ [tool.setuptools.packages.find]
24
+ where = ["src"]
25
+
26
+ [project.urls]
27
+ Homepage = "https://github.com/svr-s/json_extract"
28
+ Issues = "https://github.com/svr-s/json_extract/issues"
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+
@@ -0,0 +1,4 @@
1
+ from .json_extract import extract_json, _flatten_and_expand
2
+
3
+ __version__ = "0.1.0"
4
+ __all__ = ["extract_json", "_flatten_and_expand"]
@@ -0,0 +1,302 @@
1
+ import pandas as pd
2
+ import itertools
3
+ import fnmatch
4
+ import re
5
+ from typing import Union, Optional, Tuple, Dict, Any
6
+
7
+ def _flatten_and_expand(data, parent_key=''):
8
+ """
9
+ Recursively flattens a nested dictionary or list and explodes lists
10
+ into multiple rows via Cartesian product.
11
+ Nested keys are separated by a dot (.).
12
+ For example: parentEntity.ChildEntity1.SubEntity2
13
+ """
14
+ if isinstance(data, dict):
15
+ key_results = []
16
+ for k, v in data.items():
17
+ new_key = f"{parent_key}.{k}" if parent_key else k
18
+ res = _flatten_and_expand(v, new_key)
19
+ key_results.append(res)
20
+
21
+ combined_rows = []
22
+ for combination in itertools.product(*key_results):
23
+ merged = {}
24
+ for d in combination:
25
+ merged.update(d)
26
+ combined_rows.append(merged)
27
+
28
+ return combined_rows if combined_rows else [{}]
29
+
30
+ elif isinstance(data, list):
31
+ if not data:
32
+ return [{parent_key: None}]
33
+
34
+ all_rows = []
35
+ for item in data:
36
+ res = _flatten_and_expand(item, parent_key)
37
+ all_rows.extend(res)
38
+ return all_rows
39
+
40
+ else:
41
+ # Base case: primitive value
42
+ return [{parent_key: data}]
43
+
44
+ def extract_json(
45
+ json_data: Union[Dict, list],
46
+ desired_columns: Optional[list] = None,
47
+ row_filters: Optional[Dict[str, Any]] = None,
48
+ remove_duplicates: bool = False,
49
+ simplify_columns: bool = False,
50
+ remove_empty: Union[bool, str] = False
51
+ ) -> Tuple[Dict[str, Any], pd.DataFrame]:
52
+ """
53
+ Takes parsed JSON data, extracts the primary records, flattens deeply nested
54
+ hierarchies into a pandas DataFrame using dot-notation, and explodes lists
55
+ via Cartesian product.
56
+
57
+ This function acts as a robust JSON-to-CSV data extraction engine, specifically
58
+ built to handle complex, heavily-nested JSON payloads. It automatically gracefully
59
+ handles raw primitive arrays, standardizes column names, and allows for powerful
60
+ wildcard column extraction and list-based row filtering.
61
+
62
+ Args:
63
+ json_data (dict | list):
64
+ The parsed JSON structure to explode.
65
+ Can be a deeply nested dictionary, a list of dictionaries, or even a
66
+ raw 2D array (e.g., `[[0, 1], [2, 3]]`) which will be automatically
67
+ mapped to `col1`, `col2`, etc.
68
+
69
+ desired_columns (list, optional):
70
+ A list of column names, indices, or wildcards to retain in the final DataFrame.
71
+ - Example (Exact): `["accountId", "user.profile.firstName"]`
72
+ - Example (Index): `["5", 5]` (Extracts exactly the 5th column, 1-indexed)
73
+ - Example (Range): `["1-7"]` (Extracts columns 1 through 7 inclusive)
74
+ - Example (Prefix Wildcard): `["shippingAddress.*"]`
75
+ - Example (Suffix Wildcard): `["*.statusCode"]`
76
+ If indices are used alongside other matches that result in duplicates (e.g., `["5", "1-7"]`),
77
+ the final columns are safely deduplicated.
78
+ If a requested column or wildcard matches no columns, a warning is printed.
79
+
80
+ row_filters (dict, optional):
81
+ A dictionary mapping column names to desired values to filter the dataset.
82
+ Values can be a single exact match or a list of acceptable matches.
83
+ - Example (Exact Match): `{"accountId": "ACC-99823-XYZ"}`
84
+ - Example (List Match): `{"regionCode": ["US-EAST", "EU-WEST"]}`
85
+
86
+ remove_duplicates (bool, optional):
87
+ If True, removes completely identical rows from the final DataFrame
88
+ after all filtering has been applied. Defaults to False.
89
+
90
+ simplify_columns (bool, optional):
91
+ If True, intelligently strips parent prefixes from column names, retaining
92
+ only the final child property (e.g., `parent.child.shortName` becomes `shortName`).
93
+ If doing so would result in duplicate column names (e.g., both `a.code` and `b.code`
94
+ become `code`), it safely preserves their full dot-notation names to prevent collisions.
95
+ Defaults to False.
96
+
97
+ remove_empty (bool | str, optional):
98
+ Cleans the dataset by dropping rows containing missing values (`NaN`, `None`)
99
+ or completely empty strings (`""`, `" "`).
100
+ - `'any'` (or True): Drops the row if *any* of its filtered columns are missing.
101
+ - `'all'`: Drops the row only if *all* of its filtered columns are missing.
102
+ - False: Disables empty row removal. Defaults to False.
103
+
104
+ Returns:
105
+ tuple: A tuple containing:
106
+ - meta_data_dict (dict): A dictionary describing the resulting table size
107
+ and the exact schema/column names generated. The `column_names` key contains
108
+ a dictionary mapping the 1-based numerical index to the respective column name
109
+ (e.g., `{1: "accountId", 2: "user.profile.firstName"}`).
110
+ - pandas_dataframe (pd.DataFrame): The fully flattened, filtered, and
111
+ cleaned pandas DataFrame.
112
+
113
+ Raises:
114
+ ValueError: If `json_data` is not a dict or list, or if parameter types are invalid.
115
+ KeyError: If an exact column requested in `desired_columns` or `row_filters`
116
+ does not exist in the dataset.
117
+ """
118
+ if not isinstance(json_data, (dict, list)):
119
+ raise ValueError(f"Invalid input: json_data must be a dictionary or a list, got {type(json_data).__name__}.")
120
+
121
+ # 1. Determine the records to process
122
+ if isinstance(json_data, list):
123
+ records = json_data
124
+ elif isinstance(json_data, dict):
125
+ # Look for keys that contain lists of dictionaries
126
+ list_keys = [k for k, v in json_data.items() if isinstance(v, list) and len(v) > 0 and isinstance(v[0], dict)]
127
+ if list_keys:
128
+ # Pick the key with the largest list, assuming that's the primary data payload
129
+ largest_list_key = max(list_keys, key=lambda k: len(json_data[k]))
130
+ records = json_data[largest_list_key]
131
+ else:
132
+ # Otherwise, treat the entire dictionary as a single record
133
+ records = [json_data]
134
+ else:
135
+ records = [{"value": json_data}]
136
+
137
+ # Format records if they are raw lists (e.g., CSV-like JSON [[0, 1], [2, 3]])
138
+ formatted_records = []
139
+ for r in records:
140
+ if isinstance(r, list):
141
+ formatted_records.append({f"col{i+1}": val for i, val in enumerate(r)})
142
+ else:
143
+ formatted_records.append(r)
144
+ records = formatted_records
145
+
146
+ # Flatten and explode each record
147
+ flattened_records = []
148
+ for record in records:
149
+ flattened_records.extend(_flatten_and_expand(record))
150
+
151
+ # Create DataFrame
152
+ df = pd.DataFrame(flattened_records)
153
+
154
+ # Apply Column Filters
155
+ if desired_columns:
156
+ if not isinstance(desired_columns, list):
157
+ raise ValueError("Invalid input: desired_columns must be a list of strings.")
158
+
159
+ valid_columns = []
160
+ for req_col in desired_columns:
161
+ req_str = str(req_col).strip()
162
+
163
+ # Check for range pattern (e.g., "1-7")
164
+ if re.match(r'^\d+-\d+$', req_str):
165
+ start, end = map(int, req_str.split('-'))
166
+ start_idx = max(0, start - 1)
167
+ end_idx = min(len(df.columns), end)
168
+ matched_cols = list(df.columns[start_idx:end_idx])
169
+ if not matched_cols:
170
+ print(f"Warning: Index range '{req_str}' is entirely out of bounds for the dataset.")
171
+ valid_columns.extend(matched_cols)
172
+
173
+ # Check for single digit index (e.g., "5" or 5)
174
+ elif req_str.isdigit():
175
+ idx = int(req_str) - 1
176
+ if 0 <= idx < len(df.columns):
177
+ valid_columns.append(df.columns[idx])
178
+ else:
179
+ print(f"Warning: Index '{req_str}' is out of bounds for the dataset.")
180
+
181
+ # Check for wildcards
182
+ elif '*' in req_str or '?' in req_str:
183
+ # Use fnmatch to support both prefix.* and *.suffix patterns
184
+ matched_cols = [c for c in df.columns if fnmatch.fnmatch(c, req_str)]
185
+ if not matched_cols:
186
+ print(f"Warning: Wildcard filter '{req_str}' did not match any columns in the dataset.")
187
+ valid_columns.extend(matched_cols)
188
+
189
+ # Exact match
190
+ else:
191
+ if req_str in df.columns:
192
+ valid_columns.append(req_str)
193
+ else:
194
+ raise KeyError(f"Requested column '{req_str}' does not exist in the exploded dataset.")
195
+
196
+ # Deduplicate the list while preserving order
197
+ seen = set()
198
+ final_columns = [x for x in valid_columns if not (x in seen or seen.add(x))]
199
+ df = df[final_columns]
200
+
201
+ # Apply Row Filters
202
+ if row_filters:
203
+ if not isinstance(row_filters, dict):
204
+ raise ValueError("Invalid input: row_filters must be a dictionary.")
205
+
206
+ for col, val in row_filters.items():
207
+ if col in df.columns:
208
+ if isinstance(val, list):
209
+ df = df[df[col].isin(val)]
210
+ else:
211
+ df = df[df[col] == val]
212
+ else:
213
+ raise KeyError(f"Row filter column '{col}' does not exist in the exploded dataset.")
214
+
215
+ # Remove Duplicates
216
+ if remove_duplicates:
217
+ df = df.drop_duplicates().reset_index(drop=True)
218
+
219
+ # Simplify Column Names
220
+ if simplify_columns:
221
+ from collections import defaultdict
222
+ short_to_original = defaultdict(list)
223
+
224
+ for col in df.columns:
225
+ short_name = col.split('.')[-1]
226
+ short_to_original[short_name].append(col)
227
+
228
+ rename_map = {}
229
+ for short_name, originals in short_to_original.items():
230
+ if len(originals) == 1:
231
+ # Unique short name, safe to rename
232
+ rename_map[originals[0]] = short_name
233
+ # If len > 1, there is a collision, so we do NOT add them to the rename map.
234
+ # They will naturally retain their full original names.
235
+
236
+ df = df.rename(columns=rename_map)
237
+
238
+ # Remove Empty/Blanks/NaNs
239
+ if remove_empty:
240
+ import numpy as np
241
+ # Convert empty strings or whitespace-only strings to proper NaNs
242
+ df = df.replace(r'^\s*$', np.nan, regex=True)
243
+
244
+ if isinstance(remove_empty, str) and remove_empty.lower() == 'all':
245
+ df = df.dropna(how='all')
246
+ else:
247
+ df = df.dropna(how='any')
248
+
249
+ df = df.reset_index(drop=True)
250
+
251
+ # Generate Metadata
252
+ meta_data = {
253
+ "table_size": {
254
+ "rows": int(df.shape[0]),
255
+ "columns": int(df.shape[1])
256
+ },
257
+ "column_names": {i + 1: col for i, col in enumerate(df.columns)}
258
+ }
259
+
260
+ return meta_data, df
261
+
262
+ if __name__ == '__main__':
263
+ # Localized test case to verify the function works
264
+ import json
265
+ import os
266
+
267
+ current_dir = os.path.dirname(os.path.abspath(__file__))
268
+ sample_file = os.path.join(current_dir, 'sample_json.json')
269
+
270
+ if os.path.exists(sample_file):
271
+ print(f"Loading test data from: {sample_file}")
272
+ with open(sample_file, 'r', encoding='utf-8') as f:
273
+ data = json.load(f)
274
+
275
+ # Define some filters for testing
276
+ col_filters = [
277
+ "languageCode",
278
+ "laborAllocationCodes.code",
279
+ "sample_test_4.sample_test_4_1"
280
+ ]
281
+ r_filters = {
282
+ "languageCode": "en-CA"
283
+ }
284
+
285
+ print(f"Applying Column Filters: {col_filters}")
286
+ print(f"Applying Row Filters: {r_filters}")
287
+
288
+ meta, exploded_df = extract_json(
289
+ json_data=data,
290
+ desired_columns=col_filters,
291
+ row_filters=r_filters,
292
+ simplify_columns=False,
293
+ remove_empty='all'
294
+ )
295
+
296
+ print("\n--- Metadata Result ---")
297
+ print(json.dumps(meta, indent=4))
298
+
299
+ print("\n--- DataFrame Result ---")
300
+ print(exploded_df)
301
+ else:
302
+ print("Please provide a valid JSON file to test.")
@@ -0,0 +1,128 @@
1
+ Metadata-Version: 2.4
2
+ Name: json-extract-pandas
3
+ Version: 0.1.0
4
+ Summary: A robust JSON-to-CSV extraction engine for deeply nested payloads.
5
+ Author-email: Vatsa <228140210+svr-s@users.noreply.github.com>
6
+ Project-URL: Homepage, https://github.com/svr-s/json_extract
7
+ Project-URL: Issues, https://github.com/svr-s/json_extract/issues
8
+ Classifier: Programming Language :: Python :: 3
9
+ Classifier: License :: OSI Approved :: MIT License
10
+ Classifier: Operating System :: OS Independent
11
+ Requires-Python: >=3.9
12
+ Description-Content-Type: text/markdown
13
+ License-File: LICENSE
14
+ Requires-Dist: pandas>=1.0.0
15
+ Dynamic: license-file
16
+
17
+ # JSON Extract (`json_extract`)
18
+
19
+ A high-performance, reusable Python utility designed to natively flatten, parse, and filter deeply nested JSON data into clean, normalized Pandas DataFrames. It acts as a powerful JSON-to-CSV extraction engine for robust data pipelines.
20
+
21
+ ## Features
22
+ - **Deep Hierarchical Flattening**: Uses recursive logic to safely flatten any nested JSON object, preserving exact path context via dot-notation (e.g. `parent.child.property`).
23
+ - **Cartesian List Explosion**: Intelligently handles embedded lists/arrays, exploding them into new rows to prevent data loss or complex column bloat.
24
+ - **Dynamic Filtering**: Robust API to slice your exact dataset out of massive payloads.
25
+ - *Column Filtering*: Strict matching, 1-based numeric indexing (e.g., `["1-7", 10]`), Prefix wildcards (`parent.*`), and Suffix wildcards (`*.statusCode`).
26
+ - *Row Filtering*: Filter rows by exact values or lists of acceptable values.
27
+ - **Smart Data Cleaning**: Natively deduplicate rows, drop `NaN`/blanks, and cleanly simplify column headers without causing naming collisions.
28
+ - **Self-Documenting Metadata**: Instantly returns schema and dimension metadata (`table_size`, `column_names` mapped by 1-based index) alongside the DataFrame.
29
+
30
+ ## Requirements
31
+ - Python 3.9+
32
+ - `pandas`
33
+
34
+ ---
35
+
36
+ ## Installation & Usage
37
+
38
+ Import `extract_json` directly into your data pipeline script:
39
+
40
+ ```python
41
+ import json
42
+ from json_extract import extract_json
43
+
44
+ # 1. Load your raw payload
45
+ with open('data.json', 'r') as f:
46
+ payload = json.load(f)
47
+
48
+ # 2. Extract!
49
+ meta, df = extract_json(payload)
50
+ print(df.head())
51
+ ```
52
+
53
+ ---
54
+
55
+ ## Function Reference
56
+
57
+ ### `extract_json(json_data, **kwargs)`
58
+
59
+ The primary extraction engine.
60
+
61
+ #### Parameters
62
+
63
+ * **`json_data`** `(dict | list)`: **(Required)**
64
+ The parsed JSON payload to process. It handles normal dictionaries, lists of dictionaries, or even raw primitive 2D arrays (`[[0, 1], [2, 3]]`) mapping them securely to `col1`, `col2`.
65
+
66
+ * **`desired_columns`** `(list)`: *(Optional)*
67
+ A list of specific column names or numeric indices to retain. It natively supports numeric ranges and Unix-style wildcards (`*`, `?`).
68
+ * **Numeric Ranges/Indices (1-based)**: `["1-7", "10", 12]` (Gets columns 1 through 7, 10, and 12). Duplicate overlaps are safely ignored.
69
+ * **Exact Match**: `["accountId"]`
70
+ * **Prefix Wildcard**: `["shippingAddress.*"]` (Gets all properties starting with `shippingAddress.`)
71
+ * **Suffix Wildcard**: `["*.statusCode"]` (Gets every `statusCode` across the entire document)
72
+
73
+ * **`row_filters`** `(dict)`: *(Optional)*
74
+ A dictionary mapping a specific column to a required value. You can supply a single exact match, or a list of matches.
75
+ * **Exact**: `{"accountId": "ACC-99823-XYZ"}`
76
+ * **List**: `{"regionCode": ["US-EAST", "EU-WEST"]}`
77
+
78
+ * **`remove_duplicates`** `(bool)`: *(Optional, Default: False)*
79
+ If `True`, drops any completely identical rows generated from Cartesian explosions after all filters are applied.
80
+
81
+ * **`simplify_columns`** `(bool)`: *(Optional, Default: False)*
82
+ If `True`, it trims away verbose parent hierarchies in the column names, leaving only the final child key (e.g., `parent.child.shortName` safely becomes `shortName`).
83
+ *Note: If simplifying causes a name collision (e.g., `type.code` and `name.code` both mapping to `code`), it intelligently retains the full dot-notation for those specific columns to prevent data overwrite.*
84
+
85
+ * **`remove_empty`** `(bool | str)`: *(Optional, Default: False)*
86
+ Cleans the dataset by dropping rows polluted with missing values (`NaN`, `None`) or blank strings (`""`).
87
+ * `'any'` (or `True`): Strict. Drops the row if *any* of the filtered columns are missing.
88
+ * `'all'`: Lenient. Drops the row only if *all* of the filtered columns are entirely missing.
89
+ * `False`: Disabled. Retains everything.
90
+
91
+ ---
92
+
93
+ ## Full Example Pipeline
94
+
95
+ Here is an example demonstrating all parameters functioning in tandem to slice a massive, nested payload down to an exact, clean specification:
96
+
97
+ ```python
98
+ import json
99
+ from json_extract import extract_json
100
+
101
+ with open('users_export.json', 'r') as f:
102
+ data = json.load(f)
103
+
104
+ # Extract only the fields we care about, exactly for our target accounts
105
+ meta, df = extract_json(
106
+ json_data=data,
107
+
108
+ # 1. Grab Account ID, specific columns by index, and shipping address properties
109
+ desired_columns=[
110
+ "accountId",
111
+ "2-4",
112
+ "shippingAddress.*",
113
+ "*.statusCode"
114
+ ],
115
+
116
+ # 2. Filter to just these two specific regions
117
+ row_filters={
118
+ "regionCode": ["US-EAST", "EU-WEST"]
119
+ },
120
+
121
+ # 3. Clean up the resulting DataFrame
122
+ remove_duplicates=True,
123
+ simplify_columns=True, # Turns "shippingAddress.zipCode" into "zipCode"
124
+ remove_empty='all' # Drop rows that are totally blank
125
+ )
126
+
127
+ print(df.to_csv(index=False))
128
+ ```
@@ -0,0 +1,10 @@
1
+ LICENSE
2
+ README.md
3
+ pyproject.toml
4
+ src/json_extract/__init__.py
5
+ src/json_extract/json_extract.py
6
+ src/json_extract_pandas.egg-info/PKG-INFO
7
+ src/json_extract_pandas.egg-info/SOURCES.txt
8
+ src/json_extract_pandas.egg-info/dependency_links.txt
9
+ src/json_extract_pandas.egg-info/requires.txt
10
+ src/json_extract_pandas.egg-info/top_level.txt