imsciences 0.9.5.5__tar.gz → 0.9.5.6__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of imsciences might be problematic. Click here for more details.
- {imsciences-0.9.5.5 → imsciences-0.9.5.6}/PKG-INFO +17 -7
- {imsciences-0.9.5.5 → imsciences-0.9.5.6}/README.md +16 -7
- {imsciences-0.9.5.5 → imsciences-0.9.5.6}/imsciences/geo.py +167 -61
- {imsciences-0.9.5.5 → imsciences-0.9.5.6}/imsciences.egg-info/PKG-INFO +17 -7
- {imsciences-0.9.5.5 → imsciences-0.9.5.6}/imsciences.egg-info/requires.txt +1 -0
- {imsciences-0.9.5.5 → imsciences-0.9.5.6}/setup.py +2 -2
- {imsciences-0.9.5.5 → imsciences-0.9.5.6}/LICENSE.txt +0 -0
- {imsciences-0.9.5.5 → imsciences-0.9.5.6}/imsciences/__init__.py +0 -0
- {imsciences-0.9.5.5 → imsciences-0.9.5.6}/imsciences/mmm.py +0 -0
- {imsciences-0.9.5.5 → imsciences-0.9.5.6}/imsciences/pull.py +0 -0
- {imsciences-0.9.5.5 → imsciences-0.9.5.6}/imsciences/unittesting.py +0 -0
- {imsciences-0.9.5.5 → imsciences-0.9.5.6}/imsciences/vis.py +0 -0
- {imsciences-0.9.5.5 → imsciences-0.9.5.6}/imsciences.egg-info/PKG-INFO-IMS-24Ltp-3 +0 -0
- {imsciences-0.9.5.5 → imsciences-0.9.5.6}/imsciences.egg-info/SOURCES.txt +0 -0
- {imsciences-0.9.5.5 → imsciences-0.9.5.6}/imsciences.egg-info/dependency_links.txt +0 -0
- {imsciences-0.9.5.5 → imsciences-0.9.5.6}/imsciences.egg-info/top_level.txt +0 -0
- {imsciences-0.9.5.5 → imsciences-0.9.5.6}/setup.cfg +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.1
|
|
2
2
|
Name: imsciences
|
|
3
|
-
Version: 0.9.5.
|
|
3
|
+
Version: 0.9.5.6
|
|
4
4
|
Summary: IMS Data Processing Package
|
|
5
5
|
Author: IMS
|
|
6
6
|
Author-email: cam@im-sciences.com
|
|
@@ -24,6 +24,7 @@ Requires-Dist: yfinance
|
|
|
24
24
|
Requires-Dist: holidays
|
|
25
25
|
Requires-Dist: google-analytics-data
|
|
26
26
|
Requires-Dist: geopandas
|
|
27
|
+
Requires-Dist: geopy
|
|
27
28
|
|
|
28
29
|
# IMS Package Documentation
|
|
29
30
|
|
|
@@ -49,6 +50,7 @@ Table of Contents
|
|
|
49
50
|
5. [Data Pulling](#data-pulling)
|
|
50
51
|
6. [Installation](#installation)
|
|
51
52
|
7. [License](#license)
|
|
53
|
+
8. [Roadmap](#roadmap)
|
|
52
54
|
|
|
53
55
|
---
|
|
54
56
|
|
|
@@ -249,14 +251,14 @@ ims_vis = datavis()
|
|
|
249
251
|
- **Example**: `pull_ga('GeoExperiment-31c5f5db2c39.json', '111111111', '2023-10-15', 'United Kingdom', ['totalUsers', 'newUsers'])`
|
|
250
252
|
|
|
251
253
|
## 2. `process_itv_analysis`
|
|
252
|
-
- **Description**:
|
|
253
|
-
- **Usage**: `process_itv_analysis(self, raw_df, itv_path, cities_path, media_spend_path, output_path,
|
|
254
|
-
- **Example**: `process_itv_analysis(df, 'itv regional mapping.csv', 'Geo_Mappings_with_Coordinates.xlsx', 'IMS.xlsx', 'itv_for_test_analysis_itvx.csv', ['West', 'Westcountry', 'Tyne Tees'], ['Central Scotland', 'North Scotland'])`
|
|
254
|
+
- **Description**: Processes region-level data for geo experiments by mapping ITV regions, grouping selected metrics, merging with media spend data, and saving the result.
|
|
255
|
+
- **Usage**: `process_itv_analysis(self, raw_df, itv_path, cities_path, media_spend_path, output_path, test_group, control_group, columns_to_aggregate, aggregator_list)`
|
|
256
|
+
- **Example**: `process_itv_analysis(df, 'itv regional mapping.csv', 'Geo_Mappings_with_Coordinates.xlsx', 'IMS.xlsx', 'itv_for_test_analysis_itvx.csv', ['West', 'Westcountry', 'Tyne Tees'], ['Central Scotland', 'North Scotland'], ['newUsers', 'transactions'], ['sum', 'sum'])`
|
|
255
257
|
|
|
256
258
|
## 3. `process_city_analysis`
|
|
257
|
-
- **Description**: Processes city-level data for geo experiments by grouping
|
|
258
|
-
- **Usage**: `process_city_analysis(raw_df, spend_df, output_path,
|
|
259
|
-
- **Example**: `process_city_analysis(df, spend, output, ['Barnsley'], ['Aberdeen'], 'newUsers')`
|
|
259
|
+
- **Description**: Processes city-level data for geo experiments by grouping selected metrics, merging with media spend data, and saving the result.
|
|
260
|
+
- **Usage**: `process_city_analysis(raw_df, spend_df, output_path, test_group, control_group, columns_to_aggregate, aggregator_list)`
|
|
261
|
+
- **Example**: `process_city_analysis(df, spend, output, ['Barnsley'], ['Aberdeen'], ['newUsers', 'transactions'], ['sum', 'sum'])`
|
|
260
262
|
|
|
261
263
|
---
|
|
262
264
|
|
|
@@ -343,3 +345,11 @@ pip install imsciences
|
|
|
343
345
|
This project is licensed under the MIT License. 
|
|
344
346
|
|
|
345
347
|
---
|
|
348
|
+
|
|
349
|
+
## Roadmap
|
|
350
|
+
|
|
351
|
+
- [Fixes]: Naming conventions are inconsistent/ have changed from previous seasonality tools (eg. 'seas_nyd' is named 'seas_new_years_day', 'week_1' is named 'seas_1')
|
|
352
|
+
- [Fixes]: Naming conventions can be inconsistent within the data pull (suffix on some var is 'gb' on some it is 'uk' and for others there is no suffix) - furthermore, there is a lack of consistency for global holidays/events (Christmas, Easter, Halloween, etc) - some have regional suffix and others don't.
|
|
353
|
+
- [Additions]: Need to add new data pulls for more macro and seasonal varibles
|
|
354
|
+
|
|
355
|
+
---
|
|
@@ -22,6 +22,7 @@ Table of Contents
|
|
|
22
22
|
5. [Data Pulling](#data-pulling)
|
|
23
23
|
6. [Installation](#installation)
|
|
24
24
|
7. [License](#license)
|
|
25
|
+
8. [Roadmap](#roadmap)
|
|
25
26
|
|
|
26
27
|
---
|
|
27
28
|
|
|
@@ -222,14 +223,14 @@ ims_vis = datavis()
|
|
|
222
223
|
- **Example**: `pull_ga('GeoExperiment-31c5f5db2c39.json', '111111111', '2023-10-15', 'United Kingdom', ['totalUsers', 'newUsers'])`
|
|
223
224
|
|
|
224
225
|
## 2. `process_itv_analysis`
|
|
225
|
-
- **Description**:
|
|
226
|
-
- **Usage**: `process_itv_analysis(self, raw_df, itv_path, cities_path, media_spend_path, output_path,
|
|
227
|
-
- **Example**: `process_itv_analysis(df, 'itv regional mapping.csv', 'Geo_Mappings_with_Coordinates.xlsx', 'IMS.xlsx', 'itv_for_test_analysis_itvx.csv', ['West', 'Westcountry', 'Tyne Tees'], ['Central Scotland', 'North Scotland'])`
|
|
226
|
+
- **Description**: Processes region-level data for geo experiments by mapping ITV regions, grouping selected metrics, merging with media spend data, and saving the result.
|
|
227
|
+
- **Usage**: `process_itv_analysis(self, raw_df, itv_path, cities_path, media_spend_path, output_path, test_group, control_group, columns_to_aggregate, aggregator_list)`
|
|
228
|
+
- **Example**: `process_itv_analysis(df, 'itv regional mapping.csv', 'Geo_Mappings_with_Coordinates.xlsx', 'IMS.xlsx', 'itv_for_test_analysis_itvx.csv', ['West', 'Westcountry', 'Tyne Tees'], ['Central Scotland', 'North Scotland'], ['newUsers', 'transactions'], ['sum', 'sum'])`
|
|
228
229
|
|
|
229
230
|
## 3. `process_city_analysis`
|
|
230
|
-
- **Description**: Processes city-level data for geo experiments by grouping
|
|
231
|
-
- **Usage**: `process_city_analysis(raw_df, spend_df, output_path,
|
|
232
|
-
- **Example**: `process_city_analysis(df, spend, output, ['Barnsley'], ['Aberdeen'], 'newUsers')`
|
|
231
|
+
- **Description**: Processes city-level data for geo experiments by grouping selected metrics, merging with media spend data, and saving the result.
|
|
232
|
+
- **Usage**: `process_city_analysis(raw_df, spend_df, output_path, test_group, control_group, columns_to_aggregate, aggregator_list)`
|
|
233
|
+
- **Example**: `process_city_analysis(df, spend, output, ['Barnsley'], ['Aberdeen'], ['newUsers', 'transactions'], ['sum', 'sum'])`
|
|
233
234
|
|
|
234
235
|
---
|
|
235
236
|
|
|
@@ -315,4 +316,12 @@ pip install imsciences
|
|
|
315
316
|
|
|
316
317
|
This project is licensed under the MIT License. 
|
|
317
318
|
|
|
318
|
-
---
|
|
319
|
+
---
|
|
320
|
+
|
|
321
|
+
## Roadmap
|
|
322
|
+
|
|
323
|
+
- [Fixes]: Naming conventions are inconsistent/ have changed from previous seasonality tools (eg. 'seas_nyd' is named 'seas_new_years_day', 'week_1' is named 'seas_1')
|
|
324
|
+
- [Fixes]: Naming conventions can be inconsistent within the data pull (suffix on some var is 'gb' on some it is 'uk' and for others there is no suffix) - furthermore, there is a lack of consistency for global holidays/events (Christmas, Easter, Halloween, etc) - some have regional suffix and others don't.
|
|
325
|
+
- [Additions]: Need to add new data pulls for more macro and seasonal varibles
|
|
326
|
+
|
|
327
|
+
---
|
|
@@ -26,14 +26,14 @@ class geoprocessing:
|
|
|
26
26
|
print(" - Example: pull_ga('GeoExperiment-31c5f5db2c39.json', '111111111', '2023-10-15', 'United Kingdom', ['totalUsers', 'newUsers'])")
|
|
27
27
|
|
|
28
28
|
print("\n2. process_itv_analysis")
|
|
29
|
-
print(" - Description:
|
|
30
|
-
print(" - Usage: process_itv_analysis(
|
|
31
|
-
print(" - Example:process_itv_analysis(df,'
|
|
29
|
+
print(" - Description: Processes region-level data for geo experiments by mapping ITV regions, grouping selected metrics, merging with media spend data, and saving the result.")
|
|
30
|
+
print(" - Usage: process_itv_analysis(raw_df, itv_path, cities_path, media_spend_path, output_path, test_group, control_group, columns_to_aggregate, aggregator_list")
|
|
31
|
+
print(" - Example: process_itv_analysis(df, 'itv_regional_mapping.csv', 'Geo_Mappings_with_Coordinates.xlsx', 'IMS.xlsx', 'itv_for_test_analysis_itvx.csv', ['West', 'Westcountry', 'Tyne Tees'], ['Central Scotland', 'North Scotland'], ['newUsers', 'transactions'], ['sum', 'sum']")
|
|
32
32
|
|
|
33
33
|
print("\n3. process_city_analysis")
|
|
34
|
-
print(" - Description: Processes city-level data for geo experiments by grouping
|
|
35
|
-
print(" - Usage: process_city_analysis(
|
|
36
|
-
print(" - Example:process_city_analysis(df, spend, output, ['Barnsley'], ['Aberdeen'], 'newUsers')")
|
|
34
|
+
print(" - Description: Processes city-level data for geo experiments by grouping selected metrics, merging with media spend data, and saving the result.")
|
|
35
|
+
print(" - Usage: process_city_analysis(raw_data, spend_data, output_path, test_group, control_group, columns_to_aggregate, aggregator_list)")
|
|
36
|
+
print(" - Example: process_city_analysis(df, spend, 'output.csv', ['Barnsley'], ['Aberdeen'], ['newUsers', 'transactions'], ['sum', 'mean'])")
|
|
37
37
|
|
|
38
38
|
def pull_ga(self, credentials_file, property_id, start_date, country, metrics):
|
|
39
39
|
"""
|
|
@@ -137,23 +137,28 @@ class geoprocessing:
|
|
|
137
137
|
logging.error(f"An unexpected error occurred: {e}")
|
|
138
138
|
raise
|
|
139
139
|
|
|
140
|
-
def process_itv_analysis(self, raw_df, itv_path, cities_path, media_spend_path, output_path,
|
|
140
|
+
def process_itv_analysis(self, raw_df, itv_path, cities_path, media_spend_path, output_path, test_group, control_group, columns_to_aggregate, aggregator_list):
|
|
141
141
|
"""
|
|
142
142
|
Process ITV analysis by mapping geos, grouping data, and merging with media spend.
|
|
143
143
|
|
|
144
144
|
Parameters:
|
|
145
|
-
raw_df (pd.DataFrame): Raw input data containing 'geo',
|
|
145
|
+
raw_df (pd.DataFrame): Raw input data containing columns such as 'geo', plus any metrics to be aggregated.
|
|
146
146
|
itv_path (str): Path to the ITV regional mapping CSV file.
|
|
147
147
|
cities_path (str): Path to the Geo Mappings Excel file.
|
|
148
148
|
media_spend_path (str): Path to the media spend Excel file.
|
|
149
149
|
output_path (str): Path to save the final output CSV file.
|
|
150
150
|
group1 (list): List of geo regions for group 1.
|
|
151
151
|
group2 (list): List of geo regions for group 2.
|
|
152
|
+
columns_to_aggregate (list): List of columns in `raw_df` that need aggregation.
|
|
153
|
+
aggregator_list (list): List of aggregation operations (e.g. ["sum", "mean", ...]) for corresponding columns.
|
|
152
154
|
|
|
153
155
|
Returns:
|
|
154
|
-
|
|
156
|
+
pd.DataFrame: The final merged and aggregated DataFrame.
|
|
155
157
|
"""
|
|
156
|
-
|
|
158
|
+
|
|
159
|
+
# -----------------------
|
|
160
|
+
# 1. Load and preprocess data
|
|
161
|
+
# -----------------------
|
|
157
162
|
itv = pd.read_csv(itv_path).dropna(subset=['Latitude', 'Longitude'])
|
|
158
163
|
cities = pd.read_excel(cities_path).dropna(subset=['Latitude', 'Longitude'])
|
|
159
164
|
|
|
@@ -163,59 +168,114 @@ class geoprocessing:
|
|
|
163
168
|
itv_gdf = gpd.GeoDataFrame(itv, geometry='geometry')
|
|
164
169
|
cities_gdf = gpd.GeoDataFrame(cities, geometry='geometry')
|
|
165
170
|
|
|
166
|
-
#
|
|
167
|
-
|
|
171
|
+
# -----------------------
|
|
172
|
+
# 2. Perform spatial join to match geos
|
|
173
|
+
# -----------------------
|
|
174
|
+
joined_gdf = gpd.sjoin_nearest(
|
|
175
|
+
itv_gdf,
|
|
176
|
+
cities_gdf,
|
|
177
|
+
how='inner',
|
|
178
|
+
distance_col='distance'
|
|
179
|
+
)
|
|
168
180
|
matched_result = joined_gdf[['ITV Region', 'geo']].drop_duplicates(subset=['geo'])
|
|
169
181
|
|
|
170
182
|
# Handle unmatched geos
|
|
171
183
|
unmatched_geos = set(cities_gdf['geo']) - set(matched_result['geo'])
|
|
172
184
|
unmatched_cities_gdf = cities_gdf[cities_gdf['geo'].isin(unmatched_geos)]
|
|
173
|
-
|
|
185
|
+
|
|
186
|
+
nearest_unmatched_gdf = gpd.sjoin_nearest(
|
|
187
|
+
unmatched_cities_gdf,
|
|
188
|
+
itv_gdf,
|
|
189
|
+
how='inner',
|
|
190
|
+
distance_col='distance'
|
|
191
|
+
)
|
|
174
192
|
|
|
175
193
|
unmatched_geo_mapping = nearest_unmatched_gdf[['geo', 'ITV Region', 'Latitude_right', 'Longitude_right']]
|
|
176
194
|
unmatched_geo_mapping.columns = ['geo', 'ITV Region', 'Nearest_Latitude', 'Nearest_Longitude']
|
|
177
195
|
|
|
178
196
|
matched_result = pd.concat([matched_result, unmatched_geo_mapping[['geo', 'ITV Region']]])
|
|
179
197
|
|
|
180
|
-
#
|
|
198
|
+
# -----------------------
|
|
199
|
+
# 3. Merge with raw data
|
|
200
|
+
# -----------------------
|
|
181
201
|
merged_df = pd.merge(raw_df, matched_result, on='geo', how='left')
|
|
182
|
-
merged_df = merged_df[merged_df["geo"] != "(not set)"].drop(columns=['geo'])
|
|
183
|
-
merged_df = merged_df.rename(columns={'ITV Region': 'geo', 'newUsers': 'response'})
|
|
184
202
|
|
|
185
|
-
|
|
186
|
-
|
|
203
|
+
# Remove rows where geo is "(not set)"
|
|
204
|
+
merged_df = merged_df[merged_df["geo"] != "(not set)"]
|
|
205
|
+
|
|
206
|
+
# Replace 'geo' column with 'ITV Region'
|
|
207
|
+
# - We'll keep the "ITV Region" naming for clarity, but you can rename if you like.
|
|
208
|
+
merged_df = merged_df.drop(columns=['geo'])
|
|
209
|
+
merged_df = merged_df.rename(columns={'ITV Region': 'geo'})
|
|
210
|
+
|
|
211
|
+
# -----------------------
|
|
212
|
+
# 4. Group and aggregate
|
|
213
|
+
# -----------------------
|
|
214
|
+
# Build the dictionary for aggregation: {col1: agg1, col2: agg2, ...}
|
|
215
|
+
aggregation_dict = dict(zip(columns_to_aggregate, aggregator_list))
|
|
187
216
|
|
|
188
|
-
|
|
189
|
-
|
|
217
|
+
# Perform the groupby operation
|
|
218
|
+
grouped_df = merged_df.groupby(['date', 'geo'], as_index=False).agg(aggregation_dict)
|
|
219
|
+
|
|
220
|
+
# -----------------------
|
|
221
|
+
# 5. Filter for test & control groups
|
|
222
|
+
# -----------------------
|
|
223
|
+
filtered_df = grouped_df[grouped_df['geo'].isin(test_group + control_group)].copy()
|
|
224
|
+
|
|
225
|
+
assignment_map = {city: 1 for city in test_group}
|
|
226
|
+
assignment_map.update({city: 2 for city in control_group})
|
|
190
227
|
filtered_df['assignment'] = filtered_df['geo'].map(assignment_map)
|
|
191
228
|
|
|
192
|
-
#
|
|
229
|
+
# -----------------------
|
|
230
|
+
# 6. Merge with media spend
|
|
231
|
+
# -----------------------
|
|
193
232
|
media_spend_df = pd.read_excel(media_spend_path).rename(columns={'Cost': 'cost'})
|
|
194
|
-
|
|
233
|
+
|
|
234
|
+
# Merge on date and geo
|
|
235
|
+
analysis_df = pd.merge(
|
|
236
|
+
filtered_df,
|
|
237
|
+
media_spend_df,
|
|
238
|
+
on=['date', 'geo'],
|
|
239
|
+
how='left'
|
|
240
|
+
)
|
|
241
|
+
|
|
242
|
+
# Fill missing cost with 0
|
|
195
243
|
analysis_df['cost'] = analysis_df['cost'].fillna(0)
|
|
196
244
|
|
|
197
|
-
#
|
|
245
|
+
# -----------------------
|
|
246
|
+
# 7. Save to CSV
|
|
247
|
+
# -----------------------
|
|
198
248
|
analysis_df.to_csv(output_path, index=False)
|
|
199
|
-
|
|
200
|
-
return analysis_df
|
|
249
|
+
|
|
250
|
+
return analysis_df
|
|
201
251
|
|
|
202
|
-
def process_city_analysis(self, raw_data, spend_data, output_path,
|
|
252
|
+
def process_city_analysis(self, raw_data, spend_data, output_path, test_group, control_group, columns_to_aggregate, aggregator_list):
|
|
203
253
|
"""
|
|
204
|
-
Process city analysis by grouping data,
|
|
254
|
+
Process city-level analysis by grouping data, applying custom aggregations,
|
|
255
|
+
and merging with spend data.
|
|
205
256
|
|
|
206
257
|
Parameters:
|
|
207
|
-
raw_data (str or pd.DataFrame):
|
|
208
|
-
|
|
209
|
-
|
|
210
|
-
|
|
211
|
-
|
|
212
|
-
|
|
258
|
+
raw_data (str or pd.DataFrame):
|
|
259
|
+
- Raw input data as a file path (CSV/XLSX) or a DataFrame.
|
|
260
|
+
- Must contain 'date' and 'city' columns, plus any columns to be aggregated.
|
|
261
|
+
spend_data (str or pd.DataFrame):
|
|
262
|
+
- Spend data as a file path (CSV/XLSX) or a DataFrame.
|
|
263
|
+
- Must contain 'date', 'geo', and 'cost' columns.
|
|
264
|
+
output_path (str):
|
|
265
|
+
- Path to save the final output file (CSV or XLSX).
|
|
266
|
+
group1 (list):
|
|
267
|
+
- List of city regions to be considered "Test Group" or "Group 1".
|
|
268
|
+
group2 (list):
|
|
269
|
+
- List of city regions to be considered "Control Group" or "Group 2".
|
|
270
|
+
columns_to_aggregate (list):
|
|
271
|
+
- List of columns to apply aggregation to, e.g. ['newUsers', 'transactions'].
|
|
272
|
+
aggregator_list (list):
|
|
273
|
+
- List of corresponding aggregation functions, e.g. ['sum', 'mean'].
|
|
274
|
+
- Must be the same length as columns_to_aggregate.
|
|
213
275
|
|
|
214
276
|
Returns:
|
|
215
|
-
pd.DataFrame:
|
|
277
|
+
pd.DataFrame: The final merged, aggregated DataFrame.
|
|
216
278
|
"""
|
|
217
|
-
import pandas as pd
|
|
218
|
-
import os
|
|
219
279
|
|
|
220
280
|
def read_file(data):
|
|
221
281
|
"""Helper function to handle file paths or return DataFrame directly."""
|
|
@@ -239,39 +299,85 @@ class geoprocessing:
|
|
|
239
299
|
else:
|
|
240
300
|
raise ValueError("Unsupported file type. Please use a CSV or XLSX file.")
|
|
241
301
|
|
|
242
|
-
#
|
|
302
|
+
# -----------------------
|
|
303
|
+
# 1. Read and validate data
|
|
304
|
+
# -----------------------
|
|
243
305
|
raw_df = read_file(raw_data)
|
|
244
306
|
spend_df = read_file(spend_data)
|
|
245
307
|
|
|
246
|
-
#
|
|
247
|
-
required_columns = {'date', 'city'
|
|
248
|
-
|
|
249
|
-
|
|
250
|
-
|
|
308
|
+
# Columns we minimally need in raw_df
|
|
309
|
+
required_columns = {'date', 'city'}
|
|
310
|
+
# Ensure the columns to aggregate are there
|
|
311
|
+
required_columns = required_columns.union(set(columns_to_aggregate))
|
|
312
|
+
missing_in_raw = required_columns - set(raw_df.columns)
|
|
313
|
+
if missing_in_raw:
|
|
314
|
+
raise ValueError(
|
|
315
|
+
f"The raw data is missing the following required columns: {missing_in_raw}"
|
|
316
|
+
)
|
|
317
|
+
|
|
318
|
+
# Validate spend data
|
|
251
319
|
spend_required_columns = {'date', 'geo', 'cost'}
|
|
252
|
-
|
|
253
|
-
|
|
254
|
-
|
|
320
|
+
missing_in_spend = spend_required_columns - set(spend_df.columns)
|
|
321
|
+
if missing_in_spend:
|
|
322
|
+
raise ValueError(
|
|
323
|
+
f"The spend data is missing the following required columns: {missing_in_spend}"
|
|
324
|
+
)
|
|
325
|
+
|
|
326
|
+
# -----------------------
|
|
327
|
+
# 2. Clean and prepare spend data
|
|
328
|
+
# -----------------------
|
|
255
329
|
# Convert cost column to numeric after stripping currency symbols and commas
|
|
256
|
-
spend_df['cost'] =
|
|
257
|
-
|
|
258
|
-
|
|
259
|
-
|
|
260
|
-
|
|
261
|
-
|
|
262
|
-
|
|
263
|
-
|
|
264
|
-
|
|
265
|
-
|
|
266
|
-
|
|
267
|
-
|
|
330
|
+
spend_df['cost'] = (
|
|
331
|
+
spend_df['cost']
|
|
332
|
+
.replace('[^\\d.]', '', regex=True)
|
|
333
|
+
.astype(float)
|
|
334
|
+
)
|
|
335
|
+
|
|
336
|
+
# -----------------------
|
|
337
|
+
# 3. Prepare raw data
|
|
338
|
+
# -----------------------
|
|
339
|
+
# Rename 'city' to 'geo' for consistency
|
|
340
|
+
raw_df = raw_df.rename(columns={'city': 'geo'})
|
|
341
|
+
|
|
342
|
+
# Filter only the relevant geos
|
|
343
|
+
filtered_df = raw_df[raw_df['geo'].isin(test_group + control_group)].copy()
|
|
344
|
+
|
|
345
|
+
# -----------------------
|
|
346
|
+
# 4. Group and aggregate
|
|
347
|
+
# -----------------------
|
|
348
|
+
# Create a dictionary of {col: agg_function}
|
|
349
|
+
if len(columns_to_aggregate) != len(aggregator_list):
|
|
350
|
+
raise ValueError(
|
|
351
|
+
"columns_to_aggregate and aggregator_list must have the same length."
|
|
352
|
+
)
|
|
353
|
+
aggregation_dict = dict(zip(columns_to_aggregate, aggregator_list))
|
|
354
|
+
|
|
355
|
+
# Perform groupby using the aggregator dictionary
|
|
356
|
+
grouped_df = filtered_df.groupby(['date', 'geo'], as_index=False).agg(aggregation_dict)
|
|
357
|
+
|
|
358
|
+
# -----------------------
|
|
359
|
+
# 5. Map groups (Test vs. Control)
|
|
360
|
+
# -----------------------
|
|
361
|
+
assignment_map = {city: "Test Group" for city in test_group}
|
|
362
|
+
assignment_map.update({city: "Control Group" for city in control_group})
|
|
268
363
|
grouped_df['assignment'] = grouped_df['geo'].map(assignment_map)
|
|
269
364
|
|
|
270
|
-
#
|
|
271
|
-
|
|
365
|
+
# -----------------------
|
|
366
|
+
# 6. Merge with spend data
|
|
367
|
+
# -----------------------
|
|
368
|
+
merged_df = pd.merge(
|
|
369
|
+
grouped_df,
|
|
370
|
+
spend_df, # has date, geo, cost
|
|
371
|
+
on=['date', 'geo'],
|
|
372
|
+
how='left'
|
|
373
|
+
)
|
|
374
|
+
|
|
375
|
+
# Fill missing cost with 0
|
|
272
376
|
merged_df['cost'] = merged_df['cost'].fillna(0)
|
|
273
377
|
|
|
274
|
-
#
|
|
378
|
+
# -----------------------
|
|
379
|
+
# 7. Write out results
|
|
380
|
+
# -----------------------
|
|
275
381
|
write_file(merged_df, output_path)
|
|
276
382
|
|
|
277
|
-
return merged_df
|
|
383
|
+
return merged_df
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.1
|
|
2
2
|
Name: imsciences
|
|
3
|
-
Version: 0.9.5.
|
|
3
|
+
Version: 0.9.5.6
|
|
4
4
|
Summary: IMS Data Processing Package
|
|
5
5
|
Author: IMS
|
|
6
6
|
Author-email: cam@im-sciences.com
|
|
@@ -24,6 +24,7 @@ Requires-Dist: yfinance
|
|
|
24
24
|
Requires-Dist: holidays
|
|
25
25
|
Requires-Dist: google-analytics-data
|
|
26
26
|
Requires-Dist: geopandas
|
|
27
|
+
Requires-Dist: geopy
|
|
27
28
|
|
|
28
29
|
# IMS Package Documentation
|
|
29
30
|
|
|
@@ -49,6 +50,7 @@ Table of Contents
|
|
|
49
50
|
5. [Data Pulling](#data-pulling)
|
|
50
51
|
6. [Installation](#installation)
|
|
51
52
|
7. [License](#license)
|
|
53
|
+
8. [Roadmap](#roadmap)
|
|
52
54
|
|
|
53
55
|
---
|
|
54
56
|
|
|
@@ -249,14 +251,14 @@ ims_vis = datavis()
|
|
|
249
251
|
- **Example**: `pull_ga('GeoExperiment-31c5f5db2c39.json', '111111111', '2023-10-15', 'United Kingdom', ['totalUsers', 'newUsers'])`
|
|
250
252
|
|
|
251
253
|
## 2. `process_itv_analysis`
|
|
252
|
-
- **Description**:
|
|
253
|
-
- **Usage**: `process_itv_analysis(self, raw_df, itv_path, cities_path, media_spend_path, output_path,
|
|
254
|
-
- **Example**: `process_itv_analysis(df, 'itv regional mapping.csv', 'Geo_Mappings_with_Coordinates.xlsx', 'IMS.xlsx', 'itv_for_test_analysis_itvx.csv', ['West', 'Westcountry', 'Tyne Tees'], ['Central Scotland', 'North Scotland'])`
|
|
254
|
+
- **Description**: Processes region-level data for geo experiments by mapping ITV regions, grouping selected metrics, merging with media spend data, and saving the result.
|
|
255
|
+
- **Usage**: `process_itv_analysis(self, raw_df, itv_path, cities_path, media_spend_path, output_path, test_group, control_group, columns_to_aggregate, aggregator_list)`
|
|
256
|
+
- **Example**: `process_itv_analysis(df, 'itv regional mapping.csv', 'Geo_Mappings_with_Coordinates.xlsx', 'IMS.xlsx', 'itv_for_test_analysis_itvx.csv', ['West', 'Westcountry', 'Tyne Tees'], ['Central Scotland', 'North Scotland'], ['newUsers', 'transactions'], ['sum', 'sum'])`
|
|
255
257
|
|
|
256
258
|
## 3. `process_city_analysis`
|
|
257
|
-
- **Description**: Processes city-level data for geo experiments by grouping
|
|
258
|
-
- **Usage**: `process_city_analysis(raw_df, spend_df, output_path,
|
|
259
|
-
- **Example**: `process_city_analysis(df, spend, output, ['Barnsley'], ['Aberdeen'], 'newUsers')`
|
|
259
|
+
- **Description**: Processes city-level data for geo experiments by grouping selected metrics, merging with media spend data, and saving the result.
|
|
260
|
+
- **Usage**: `process_city_analysis(raw_df, spend_df, output_path, test_group, control_group, columns_to_aggregate, aggregator_list)`
|
|
261
|
+
- **Example**: `process_city_analysis(df, spend, output, ['Barnsley'], ['Aberdeen'], ['newUsers', 'transactions'], ['sum', 'sum'])`
|
|
260
262
|
|
|
261
263
|
---
|
|
262
264
|
|
|
@@ -343,3 +345,11 @@ pip install imsciences
|
|
|
343
345
|
This project is licensed under the MIT License. 
|
|
344
346
|
|
|
345
347
|
---
|
|
348
|
+
|
|
349
|
+
## Roadmap
|
|
350
|
+
|
|
351
|
+
- [Fixes]: Naming conventions are inconsistent/ have changed from previous seasonality tools (eg. 'seas_nyd' is named 'seas_new_years_day', 'week_1' is named 'seas_1')
|
|
352
|
+
- [Fixes]: Naming conventions can be inconsistent within the data pull (suffix on some var is 'gb' on some it is 'uk' and for others there is no suffix) - furthermore, there is a lack of consistency for global holidays/events (Christmas, Easter, Halloween, etc) - some have regional suffix and others don't.
|
|
353
|
+
- [Additions]: Need to add new data pulls for more macro and seasonal varibles
|
|
354
|
+
|
|
355
|
+
---
|
|
@@ -8,7 +8,7 @@ def read_md(file_name):
|
|
|
8
8
|
return f.read()
|
|
9
9
|
return ''
|
|
10
10
|
|
|
11
|
-
VERSION = '0.9.5.
|
|
11
|
+
VERSION = '0.9.5.6'
|
|
12
12
|
DESCRIPTION = 'IMS Data Processing Package'
|
|
13
13
|
LONG_DESCRIPTION = read_md('README.md')
|
|
14
14
|
|
|
@@ -24,7 +24,7 @@ setup(
|
|
|
24
24
|
packages=find_packages(),
|
|
25
25
|
install_requires=[
|
|
26
26
|
"pandas", "plotly", "numpy", "fredapi", "xgboost", "scikit-learn",
|
|
27
|
-
"bs4", "yfinance", "holidays", "google-analytics-data", "geopandas",
|
|
27
|
+
"bs4", "yfinance", "holidays", "google-analytics-data", "geopandas", "geopy"
|
|
28
28
|
],
|
|
29
29
|
keywords=['data processing', 'apis', 'data analysis', 'data visualization', 'machine learning'],
|
|
30
30
|
classifiers=[
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|