aws-sharepoint-connector 1.3.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,331 @@
1
+ Metadata-Version: 2.4
2
+ Name: aws-sharepoint-connector
3
+ Version: 1.3.1
4
+ Summary: Provides a connector via Graph API between AWS and SharePoint
5
+ Author: Philip Sinfield, Guy Wheeler
6
+ License-Expression: MIT
7
+ Requires-Dist: azure-identity>=1.24.0
8
+ Requires-Dist: boto3>=1.42.90
9
+ Requires-Dist: pydantic-settings>=2.11.0
10
+ Requires-Dist: pydantic>=2.12.3
11
+ Requires-Dist: requests>=2.32.4
12
+ Requires-Dist: types-requests>=2.33.0.20260503
13
+ Requires-Python: >=3.13
14
+ Project-URL: Repository, https://github.com/ministryofjustice/aws-sharepoint-connector
15
+ Description-Content-Type: text/markdown
16
+
17
+ # AWS - SharePoint Connector
18
+
19
+ [![GitHub release](https://img.shields.io/github/v/release/ministryofjustice/aws-sharepoint-connector)](https://github.com/ministryofjustice/aws-sharepoint-connector/releases)
20
+ [![Python 3.13+](https://img.shields.io/badge/python-3.13%2B-blue)](pyproject.toml)
21
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
22
+ [![Changelog](https://img.shields.io/badge/changelog-CHANGELOG.md-blue)](CHANGELOG.md)
23
+
24
+ [![Python Unit Test](https://github.com/ministryofjustice/aws-sharepoint-connector/actions/workflows/python-unit-test.yml/badge.svg)](https://github.com/ministryofjustice/aws-sharepoint-connector/actions/workflows/python-unit-test.yml)
25
+ [![Python Linting](https://github.com/ministryofjustice/aws-sharepoint-connector/actions/workflows/python-lint.yml/badge.svg)](https://github.com/ministryofjustice/aws-sharepoint-connector/actions/workflows/python-lint.yml)
26
+ [![Python Type Check](https://github.com/ministryofjustice/aws-sharepoint-connector/actions/workflows/python-type.yml/badge.svg)](https://github.com/ministryofjustice/aws-sharepoint-connector/actions/workflows/python-type.yml)
27
+ [![Release Container](https://github.com/ministryofjustice/aws-sharepoint-connector/actions/workflows/release-container.yml/badge.svg)](https://github.com/ministryofjustice/aws-sharepoint-connector/actions/workflows/release-container.yml)
28
+
29
+ Provides a simple connector for moving files between AWS S3 and Microsoft SharePoint (via Microsoft Graph API).
30
+
31
+ Operates in two modes: (`write_to_s3`) to move from AWS S3 and SharePoint and (`write_to_sharepoint`) to do the reverse. Simply instantiate an engine via `create_engine` and iterate over the source files you want to move with the `run` method.
32
+
33
+ ## Table of contents
34
+
35
+ - [Architecture and flow](#architecture-and-flow)
36
+ - [Configuration](#configuration)
37
+ - [Prerequisites](#prerequisites)
38
+ - [Installation](#installation)
39
+ - [How to run](#how-to-run)
40
+ - [Programmatic API](#programmatic-api)
41
+ - [Error handling and retries](#error-handling-and-retries)
42
+ - [How to modify or extend](#how-to-modify-or-extend)
43
+ - [Troubleshooting](#troubleshooting)
44
+ - [Security considerations](#security-considerations)
45
+ - [Changelog](CHANGELOG.md)
46
+ - [License](#license)
47
+
48
+ ## Architecture and flow
49
+
50
+ ### High level process flow
51
+
52
+ 1. Create an engine with `create_engine(mode, sp_site, sp_library, s3_bucket)`
53
+ 2. Identify the files you want to move, and their destination
54
+ 3. For each file, call `engine.run(source, destination)`:
55
+ - Calls `engine.validate_plans` to validate the plan before any transfers begin:
56
+ - Download from the source system (SharePoint or S3).
57
+ - Upload to the destination system (S3 or SharePoint).
58
+ - Verify transfer was successful by comparing byte size.
59
+ - Optionally delete the source file
60
+ 4. Handle errors per file in your calling code.
61
+
62
+ ### Transfer Modes
63
+
64
+ - **`write_to_s3`**: Download from SharePoint → Upload to S3
65
+ - **`write_to_sharepoint`**: Download from S3 → Upload to SharePoint
66
+
67
+ ### Core Components
68
+
69
+ - `src/connector/main.py`: Public API — `create_engine()`
70
+ - `src/connector/config.py`: Pydantic models for validated configuration.
71
+ - `src/connector/engine.py`: Abstract transfer logic.
72
+ - `src/connector/sharepoint.py`: SharePoint connector.
73
+ - `src/connector/s3.py`: AWS S3 connector.
74
+ - `src/connector/auth.py`: Azure authentication and Graph utilities.
75
+ - `src/connector/utils.py`: Logger and HTTP retry logic.
76
+
77
+ ## Configuration
78
+
79
+ Configuration is parsed by the config classes in `src/connector/config.py` which store
80
+ the S3 bucket, Sharepoint site and library and Azure App secrets
81
+
82
+ ### Required Environment Variables
83
+
84
+ MUST be provided as secrets via airflow from AWS Secrets Manager.
85
+ DO NOT store as plain text
86
+
87
+ | Variable | Type | Description |
88
+ | --- | --- | --- |
89
+ | `SECRET_AZURE_TENANT_ID` | string | Azure tenant UUID for Graph API authentication |
90
+ | `SECRET_AZURE_CLIENT_ID` | string | Azure app registration client ID |
91
+ | `SECRET_AZURE_CLIENT_SECRET` | string | Azure app registration client secret (store in secret manager) |
92
+
93
+ ### Required configuration variables
94
+
95
+ Passed directly to `create_engine()` from your calling code
96
+
97
+ **`create_engine(mode, sp_site, sp_library, s3_bucket)`**
98
+
99
+ | Argument | Type | Description |
100
+ | --- | --- | --- |
101
+ | `mode` | string | Transfer direction: `write_to_s3` or `write_to_sharepoint` |
102
+ | `sp_site` | string | SharePoint site name (without URL prefix, e.g. `analytics-site`) |
103
+ | `sp_library` | string | SharePoint document library name (e.g. `Documents`) |
104
+ | `s3_bucket` | string | S3 bucket name (without `s3://` prefix) |
105
+
106
+ Passed to the engine's `run` method to identify a specific file to move
107
+
108
+ **`run(source, destination, delete)`**
109
+
110
+ | Key | Type | Description |
111
+ | --- | --- | --- |
112
+ | `source` | string | Source file path (SharePoint path or S3 key) |
113
+ | `destination` | string | Destination file path (S3 key or SharePoint path) |
114
+ | `delete` | bool | Flag for whether to delete source file after successful transfer |
115
+
116
+ ### Example: SharePoint → S3 (single file)
117
+
118
+ For a SharePoint file at:
119
+ `https://justiceuk.sharepoint.com/sites/analytics-site/Documents/reports/2026/daily_report.csv`
120
+
121
+ To copy to `s3://my-bucket/path/to/daily_report.csv`:
122
+
123
+ ```python
124
+ engine = create_engine(
125
+ mode="write_to_s3",
126
+ sp_site="analytics-site",
127
+ sp_library="Documents",
128
+ s3_bucket="my-bucket",
129
+ )
130
+ plans = [
131
+ {
132
+ "source": "reports/2026/daily_report.csv",
133
+ "destination": "path/to/daily_report.csv",
134
+ }
135
+ ]
136
+ for plan in plans:
137
+ engine.run(plan["source"], plan["destination"])
138
+ ```
139
+
140
+ ### Example: S3 → SharePoint (single file)
141
+
142
+ To move the same file in the other direction:
143
+
144
+ ```python
145
+ engine = create_engine(
146
+ mode="write_to_sharepoint",
147
+ sp_site="analytics-site",
148
+ sp_library="Documents",
149
+ s3_bucket="my-bucket",
150
+ )
151
+ plans =
152
+ {
153
+ "source": "path/to/daily_report.csv",
154
+ "destination": "reports/2026/daily_report.csv",
155
+ }
156
+ ]
157
+ for plan in plans:
158
+ engine.run(plan["source"], plan["destination"])
159
+ ```
160
+
161
+ ## Prerequisites
162
+
163
+ ### Sharepoint site
164
+
165
+ You will require a Sharepoint site to serve as the source or destination for files. This can be a pre-existing Sharepoint site, though you should be mindful of who will have access to the data.
166
+
167
+ ### Azure app registration
168
+
169
+ An Azure app has to be registered in Entra ID. This will be bespoke to your project and provide the connection to the Sharepoint site and is what the connector will authenticate into via the secret key. To request a new Azure app and have it connected to your Sharepoint site, raise a demand request by following the [instructions here](https://user-guide.staff-identity.service.justice.gov.uk/documentation/guidance/appreg.html#application-registrations-sso). You can do this in terraform against the staff infrastructure authentication services repo (see [EM setup](https://github.com/ministryofjustice/staff-identity-idam-entra-infra/tree/main/terraform/envs/live/hmpps-electronic-monitoring-data) for an example), then post to [#staff-identity-authentication-services](https://moj.enterprise.slack.com/archives/C04AFS7TV7S).
170
+
171
+ The app will require these permissions:
172
+
173
+ - `sites.selected`
174
+
175
+ You will then need to speak to the File and Data Management team, who will grant your app access to the specific sharepoint sites you need access to.
176
+
177
+ ### Azure app details & secret
178
+
179
+ You can view your [app registrations here](https://portal.azure.com/#view/Microsoft_AAD_RegisteredApps/ApplicationsListBlade).
180
+
181
+ Open up the app registration and the tenant ID will be available as `Directory (tenant) ID`.
182
+
183
+ The client ID is available as `Application (client) ID`.
184
+
185
+ The client secret is available from `manage` -> `certificates and secrets` - you may not be able to view it and instead may be sent it when the app is created.
186
+
187
+ ### AWS access
188
+
189
+ If running via airflow, or from within another repo that is running via airflow, then standard AP credentials and access management apply and will grant access to s3.
190
+
191
+ ### Technical requirements
192
+
193
+ - Python `3.13+`
194
+ - [uv](https://docs.astral.sh/uv/) for dependency management
195
+
196
+ ## Installation
197
+
198
+ ### Local install with `uv`
199
+
200
+ ```bash
201
+ uv sync --all-groups --all-extras
202
+ ```
203
+
204
+ ### Package install in another project
205
+
206
+ If your project uses `uv`, add this package from the public GitHub repo directly and pin to a specific commit SHA:
207
+
208
+ ```bash
209
+ uv add "git+https://github.com/ministryofjustice/aws-sharepoint-connector.git@<commit_sha>"
210
+ ```
211
+
212
+ ### Running tests
213
+
214
+ ```bash
215
+ uv run python -m pytest # all tests with coverage
216
+ uv run python -m pytest tests/unit # unit tests only
217
+ uv run python -m pytest tests/e2e # E2E tests (no real API calls)
218
+ ```
219
+
220
+ ## How to run
221
+
222
+ ### Programmatic API
223
+
224
+ Import `create_engine` from the `connector` package.
225
+ The Azure secret values must be present as environment variables.
226
+
227
+ ```python
228
+ from aws_sharepoint_connector import create_engine
229
+
230
+ engine = create_engine(
231
+ mode="write_to_s3",
232
+ sp_site="analytics-site",
233
+ sp_library="Documents",
234
+ s3_bucket="my-bucket",
235
+ )
236
+
237
+ plans = [
238
+ {
239
+ "source": "reports/2026/daily_report.csv",
240
+ "destination": "path/to/daily_report.csv",
241
+ },
242
+ {
243
+ "source": "reports/2026/summary.csv",
244
+ "destination": "path/to/summary.csv",
245
+ },
246
+ ]
247
+
248
+ for plan in plans:
249
+ engine.run(plan.source, plan.destination)
250
+ ```
251
+
252
+ You can optionally use the `list_source_files` methods on the engines to obtain a list
253
+ of all files in the S3 bucket or SharePoint library. This can be used to programmatically
254
+ build the list of plans to iterate over.
255
+
256
+ ## Error handling and retries
257
+
258
+ The connector implements robust retry logic to handle transient failures.
259
+
260
+ ### Chunk Upload Strategy
261
+
262
+ For large files, uploads are split into **10 MB chunks**:
263
+
264
+ - **Max 5 consecutive failures** per chunk before aborting the entire transfer
265
+ - **Transient errors** (429 Too Many Requests, 5xx): retried with exponential backoff
266
+ - **Permanent errors** (4xx excluding 429): immediately raised as `ProcessingError` without retry
267
+ - **File pointer reset** on every retry to ensure data consistency
268
+
269
+ Example: If a 50 MB file fails on chunk 3 of 5, the transfer aborts and raises `ProcessingError`.
270
+
271
+ ### HTTP Request Retries
272
+
273
+ All Graph API and HTTP calls use `request_with_retry()`:
274
+
275
+ - **Max 3 attempts** per request
276
+ - **Retryable errors**: 429 Too Many Requests, 5xx Server Errors
277
+ - **Non-retryable errors**: 4xx Client Errors (except 429)
278
+ - **Exponential backoff** between retries
279
+
280
+ ### Batch Processing Behavior
281
+
282
+ Batch iteration is handled by the calling code. The engine processes one file per `engine.run()` call and raises `ProcessingError` on failure. It is the caller's responsibility to decide whether to abort or continue processing remaining files.
283
+
284
+ ## How to modify or extend
285
+
286
+ ### 1) Add a new transfer mode
287
+
288
+ 1. Create a new engine class in `src/connector/engine.py` implementing:
289
+ - `download_file(self, source: str) -> bytes`
290
+ - `upload_file(self, content: bytes, destination: str) -> None`
291
+ - `validate_plans(self, plans: list[dict[str, str]]) -> None`
292
+ 2. Register the engine in `MODE_MAP` in `src/connector/main.py`.
293
+ 3. Expand the `Literal` type for `mode` in `create_engine()` in `src/connector/main.py`.
294
+ 4. Add unit tests for success and failure paths, including `validate_plans`.
295
+
296
+ ### 2) Add additional configuration
297
+
298
+ 1. Add a field in `SecretConfig` (`src/connector/config.py`).
299
+ 2. Add validation if needed with a `field_validator`.
300
+ 3. Update `.env` docs and this README.
301
+ 4. Use the field in connector or engine logic.
302
+
303
+ ## Troubleshooting
304
+
305
+ ### Common errors and solutions
306
+
307
+ - **`Pre-flight validation failed with N error(s)`**: One or more sources or destinations could not be verified before transfers started. The error message lists every problem — fix all of them before retrying.
308
+ - **`Library 'X' not found on site`**: Verify `sp_library` spelling and that the app has SharePoint access via Graph API permissions (`Sites.Read.All`, `Files.ReadWrite.All`)
309
+ - **`Source file not found in SharePoint`**: Verify the file exists at the exact path supplied as `source`; check case sensitivity
310
+ - **`Destination folder not found in SharePoint`**: The parent directory of the destination path does not exist in SharePoint; create it before running the connector
311
+ - **`S3 bucket does not exist`** or **`S3 object does not exist`**: Verify bucket name is correct, bucket exists in eu-west-2, and IAM principal has access
312
+ - **`Access denied to S3 bucket/object`**: Check IAM policy grants `s3:GetObject`, `s3:PutObject`, `s3:HeadObject`, and `s3:HeadBucket` on the bucket
313
+ - **`AADSTS65001` or Graph auth failures**: Verify app permissions (`Sites.Read.All`, `Files.ReadWrite.All`) are granted in Azure; may need admin consent
314
+ - **`File transfer failed: Max retries exceeded`**: File chunk upload exceeded 5 consecutive failures; check network stability, S3/SharePoint availability, and file size
315
+
316
+ ## Security considerations
317
+
318
+ - **Never commit `.env` files or secrets**: Add `.env` to `.gitignore`
319
+ - **Prefer managed identity**: Use workload identity or managed identity in AWS/Azure instead of storing static credentials
320
+ - **Scope permissions tightly**:
321
+ - Azure: Limit app permissions to `Sites.Read.All` and `Files.ReadWrite.All` only
322
+ - AWS: Restrict IAM policy to specific bucket and prefix (e.g., `arn:aws:s3:::bucket/prefix/*`)
323
+ - **Rotate secrets**: Change Azure client secrets every 90 days and update secret manager
324
+ - **Store secrets securely**: Use AWS Secrets Manager, Azure Key Vault, or Kubernetes secrets (never hardcode in env vars)
325
+ - **Audit access**: Monitor S3 CloudTrail and SharePoint audit logs for sensitive data access
326
+ - **Network isolation**: Consider running connector in private network with appropriate egress controls
327
+ - **Data residency**: Ensure S3 bucket and SharePoint site comply with data residency requirements
328
+
329
+ ## License
330
+
331
+ MIT License. See `LICENSE`.
@@ -0,0 +1,315 @@
1
+ # AWS - SharePoint Connector
2
+
3
+ [![GitHub release](https://img.shields.io/github/v/release/ministryofjustice/aws-sharepoint-connector)](https://github.com/ministryofjustice/aws-sharepoint-connector/releases)
4
+ [![Python 3.13+](https://img.shields.io/badge/python-3.13%2B-blue)](pyproject.toml)
5
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
6
+ [![Changelog](https://img.shields.io/badge/changelog-CHANGELOG.md-blue)](CHANGELOG.md)
7
+
8
+ [![Python Unit Test](https://github.com/ministryofjustice/aws-sharepoint-connector/actions/workflows/python-unit-test.yml/badge.svg)](https://github.com/ministryofjustice/aws-sharepoint-connector/actions/workflows/python-unit-test.yml)
9
+ [![Python Linting](https://github.com/ministryofjustice/aws-sharepoint-connector/actions/workflows/python-lint.yml/badge.svg)](https://github.com/ministryofjustice/aws-sharepoint-connector/actions/workflows/python-lint.yml)
10
+ [![Python Type Check](https://github.com/ministryofjustice/aws-sharepoint-connector/actions/workflows/python-type.yml/badge.svg)](https://github.com/ministryofjustice/aws-sharepoint-connector/actions/workflows/python-type.yml)
11
+ [![Release Container](https://github.com/ministryofjustice/aws-sharepoint-connector/actions/workflows/release-container.yml/badge.svg)](https://github.com/ministryofjustice/aws-sharepoint-connector/actions/workflows/release-container.yml)
12
+
13
+ Provides a simple connector for moving files between AWS S3 and Microsoft SharePoint (via Microsoft Graph API).
14
+
15
+ Operates in two modes: (`write_to_s3`) to move from AWS S3 and SharePoint and (`write_to_sharepoint`) to do the reverse. Simply instantiate an engine via `create_engine` and iterate over the source files you want to move with the `run` method.
16
+
17
+ ## Table of contents
18
+
19
+ - [Architecture and flow](#architecture-and-flow)
20
+ - [Configuration](#configuration)
21
+ - [Prerequisites](#prerequisites)
22
+ - [Installation](#installation)
23
+ - [How to run](#how-to-run)
24
+ - [Programmatic API](#programmatic-api)
25
+ - [Error handling and retries](#error-handling-and-retries)
26
+ - [How to modify or extend](#how-to-modify-or-extend)
27
+ - [Troubleshooting](#troubleshooting)
28
+ - [Security considerations](#security-considerations)
29
+ - [Changelog](CHANGELOG.md)
30
+ - [License](#license)
31
+
32
+ ## Architecture and flow
33
+
34
+ ### High level process flow
35
+
36
+ 1. Create an engine with `create_engine(mode, sp_site, sp_library, s3_bucket)`
37
+ 2. Identify the files you want to move, and their destination
38
+ 3. For each file, call `engine.run(source, destination)`:
39
+ - Calls `engine.validate_plans` to validate the plan before any transfers begin:
40
+ - Download from the source system (SharePoint or S3).
41
+ - Upload to the destination system (S3 or SharePoint).
42
+ - Verify transfer was successful by comparing byte size.
43
+ - Optionally delete the source file
44
+ 4. Handle errors per file in your calling code.
45
+
46
+ ### Transfer Modes
47
+
48
+ - **`write_to_s3`**: Download from SharePoint → Upload to S3
49
+ - **`write_to_sharepoint`**: Download from S3 → Upload to SharePoint
50
+
51
+ ### Core Components
52
+
53
+ - `src/connector/main.py`: Public API — `create_engine()`
54
+ - `src/connector/config.py`: Pydantic models for validated configuration.
55
+ - `src/connector/engine.py`: Abstract transfer logic.
56
+ - `src/connector/sharepoint.py`: SharePoint connector.
57
+ - `src/connector/s3.py`: AWS S3 connector.
58
+ - `src/connector/auth.py`: Azure authentication and Graph utilities.
59
+ - `src/connector/utils.py`: Logger and HTTP retry logic.
60
+
61
+ ## Configuration
62
+
63
+ Configuration is parsed by the config classes in `src/connector/config.py` which store
64
+ the S3 bucket, Sharepoint site and library and Azure App secrets
65
+
66
+ ### Required Environment Variables
67
+
68
+ MUST be provided as secrets via airflow from AWS Secrets Manager.
69
+ DO NOT store as plain text
70
+
71
+ | Variable | Type | Description |
72
+ | --- | --- | --- |
73
+ | `SECRET_AZURE_TENANT_ID` | string | Azure tenant UUID for Graph API authentication |
74
+ | `SECRET_AZURE_CLIENT_ID` | string | Azure app registration client ID |
75
+ | `SECRET_AZURE_CLIENT_SECRET` | string | Azure app registration client secret (store in secret manager) |
76
+
77
+ ### Required configuration variables
78
+
79
+ Passed directly to `create_engine()` from your calling code
80
+
81
+ **`create_engine(mode, sp_site, sp_library, s3_bucket)`**
82
+
83
+ | Argument | Type | Description |
84
+ | --- | --- | --- |
85
+ | `mode` | string | Transfer direction: `write_to_s3` or `write_to_sharepoint` |
86
+ | `sp_site` | string | SharePoint site name (without URL prefix, e.g. `analytics-site`) |
87
+ | `sp_library` | string | SharePoint document library name (e.g. `Documents`) |
88
+ | `s3_bucket` | string | S3 bucket name (without `s3://` prefix) |
89
+
90
+ Passed to the engine's `run` method to identify a specific file to move
91
+
92
+ **`run(source, destination, delete)`**
93
+
94
+ | Key | Type | Description |
95
+ | --- | --- | --- |
96
+ | `source` | string | Source file path (SharePoint path or S3 key) |
97
+ | `destination` | string | Destination file path (S3 key or SharePoint path) |
98
+ | `delete` | bool | Flag for whether to delete source file after successful transfer |
99
+
100
+ ### Example: SharePoint → S3 (single file)
101
+
102
+ For a SharePoint file at:
103
+ `https://justiceuk.sharepoint.com/sites/analytics-site/Documents/reports/2026/daily_report.csv`
104
+
105
+ To copy to `s3://my-bucket/path/to/daily_report.csv`:
106
+
107
+ ```python
108
+ engine = create_engine(
109
+ mode="write_to_s3",
110
+ sp_site="analytics-site",
111
+ sp_library="Documents",
112
+ s3_bucket="my-bucket",
113
+ )
114
+ plans = [
115
+ {
116
+ "source": "reports/2026/daily_report.csv",
117
+ "destination": "path/to/daily_report.csv",
118
+ }
119
+ ]
120
+ for plan in plans:
121
+ engine.run(plan["source"], plan["destination"])
122
+ ```
123
+
124
+ ### Example: S3 → SharePoint (single file)
125
+
126
+ To move the same file in the other direction:
127
+
128
+ ```python
129
+ engine = create_engine(
130
+ mode="write_to_sharepoint",
131
+ sp_site="analytics-site",
132
+ sp_library="Documents",
133
+ s3_bucket="my-bucket",
134
+ )
135
+ plans =
136
+ {
137
+ "source": "path/to/daily_report.csv",
138
+ "destination": "reports/2026/daily_report.csv",
139
+ }
140
+ ]
141
+ for plan in plans:
142
+ engine.run(plan["source"], plan["destination"])
143
+ ```
144
+
145
+ ## Prerequisites
146
+
147
+ ### Sharepoint site
148
+
149
+ You will require a Sharepoint site to serve as the source or destination for files. This can be a pre-existing Sharepoint site, though you should be mindful of who will have access to the data.
150
+
151
+ ### Azure app registration
152
+
153
+ An Azure app has to be registered in Entra ID. This will be bespoke to your project and provide the connection to the Sharepoint site and is what the connector will authenticate into via the secret key. To request a new Azure app and have it connected to your Sharepoint site, raise a demand request by following the [instructions here](https://user-guide.staff-identity.service.justice.gov.uk/documentation/guidance/appreg.html#application-registrations-sso). You can do this in terraform against the staff infrastructure authentication services repo (see [EM setup](https://github.com/ministryofjustice/staff-identity-idam-entra-infra/tree/main/terraform/envs/live/hmpps-electronic-monitoring-data) for an example), then post to [#staff-identity-authentication-services](https://moj.enterprise.slack.com/archives/C04AFS7TV7S).
154
+
155
+ The app will require these permissions:
156
+
157
+ - `sites.selected`
158
+
159
+ You will then need to speak to the File and Data Management team, who will grant your app access to the specific sharepoint sites you need access to.
160
+
161
+ ### Azure app details & secret
162
+
163
+ You can view your [app registrations here](https://portal.azure.com/#view/Microsoft_AAD_RegisteredApps/ApplicationsListBlade).
164
+
165
+ Open up the app registration and the tenant ID will be available as `Directory (tenant) ID`.
166
+
167
+ The client ID is available as `Application (client) ID`.
168
+
169
+ The client secret is available from `manage` -> `certificates and secrets` - you may not be able to view it and instead may be sent it when the app is created.
170
+
171
+ ### AWS access
172
+
173
+ If running via airflow, or from within another repo that is running via airflow, then standard AP credentials and access management apply and will grant access to s3.
174
+
175
+ ### Technical requirements
176
+
177
+ - Python `3.13+`
178
+ - [uv](https://docs.astral.sh/uv/) for dependency management
179
+
180
+ ## Installation
181
+
182
+ ### Local install with `uv`
183
+
184
+ ```bash
185
+ uv sync --all-groups --all-extras
186
+ ```
187
+
188
+ ### Package install in another project
189
+
190
+ If your project uses `uv`, add this package from the public GitHub repo directly and pin to a specific commit SHA:
191
+
192
+ ```bash
193
+ uv add "git+https://github.com/ministryofjustice/aws-sharepoint-connector.git@<commit_sha>"
194
+ ```
195
+
196
+ ### Running tests
197
+
198
+ ```bash
199
+ uv run python -m pytest # all tests with coverage
200
+ uv run python -m pytest tests/unit # unit tests only
201
+ uv run python -m pytest tests/e2e # E2E tests (no real API calls)
202
+ ```
203
+
204
+ ## How to run
205
+
206
+ ### Programmatic API
207
+
208
+ Import `create_engine` from the `connector` package.
209
+ The Azure secret values must be present as environment variables.
210
+
211
+ ```python
212
+ from aws_sharepoint_connector import create_engine
213
+
214
+ engine = create_engine(
215
+ mode="write_to_s3",
216
+ sp_site="analytics-site",
217
+ sp_library="Documents",
218
+ s3_bucket="my-bucket",
219
+ )
220
+
221
+ plans = [
222
+ {
223
+ "source": "reports/2026/daily_report.csv",
224
+ "destination": "path/to/daily_report.csv",
225
+ },
226
+ {
227
+ "source": "reports/2026/summary.csv",
228
+ "destination": "path/to/summary.csv",
229
+ },
230
+ ]
231
+
232
+ for plan in plans:
233
+ engine.run(plan.source, plan.destination)
234
+ ```
235
+
236
+ You can optionally use the `list_source_files` methods on the engines to obtain a list
237
+ of all files in the S3 bucket or SharePoint library. This can be used to programmatically
238
+ build the list of plans to iterate over.
239
+
240
+ ## Error handling and retries
241
+
242
+ The connector implements robust retry logic to handle transient failures.
243
+
244
+ ### Chunk Upload Strategy
245
+
246
+ For large files, uploads are split into **10 MB chunks**:
247
+
248
+ - **Max 5 consecutive failures** per chunk before aborting the entire transfer
249
+ - **Transient errors** (429 Too Many Requests, 5xx): retried with exponential backoff
250
+ - **Permanent errors** (4xx excluding 429): immediately raised as `ProcessingError` without retry
251
+ - **File pointer reset** on every retry to ensure data consistency
252
+
253
+ Example: If a 50 MB file fails on chunk 3 of 5, the transfer aborts and raises `ProcessingError`.
254
+
255
+ ### HTTP Request Retries
256
+
257
+ All Graph API and HTTP calls use `request_with_retry()`:
258
+
259
+ - **Max 3 attempts** per request
260
+ - **Retryable errors**: 429 Too Many Requests, 5xx Server Errors
261
+ - **Non-retryable errors**: 4xx Client Errors (except 429)
262
+ - **Exponential backoff** between retries
263
+
264
+ ### Batch Processing Behavior
265
+
266
+ Batch iteration is handled by the calling code. The engine processes one file per `engine.run()` call and raises `ProcessingError` on failure. It is the caller's responsibility to decide whether to abort or continue processing remaining files.
267
+
268
+ ## How to modify or extend
269
+
270
+ ### 1) Add a new transfer mode
271
+
272
+ 1. Create a new engine class in `src/connector/engine.py` implementing:
273
+ - `download_file(self, source: str) -> bytes`
274
+ - `upload_file(self, content: bytes, destination: str) -> None`
275
+ - `validate_plans(self, plans: list[dict[str, str]]) -> None`
276
+ 2. Register the engine in `MODE_MAP` in `src/connector/main.py`.
277
+ 3. Expand the `Literal` type for `mode` in `create_engine()` in `src/connector/main.py`.
278
+ 4. Add unit tests for success and failure paths, including `validate_plans`.
279
+
280
+ ### 2) Add additional configuration
281
+
282
+ 1. Add a field in `SecretConfig` (`src/connector/config.py`).
283
+ 2. Add validation if needed with a `field_validator`.
284
+ 3. Update `.env` docs and this README.
285
+ 4. Use the field in connector or engine logic.
286
+
287
+ ## Troubleshooting
288
+
289
+ ### Common errors and solutions
290
+
291
+ - **`Pre-flight validation failed with N error(s)`**: One or more sources or destinations could not be verified before transfers started. The error message lists every problem — fix all of them before retrying.
292
+ - **`Library 'X' not found on site`**: Verify `sp_library` spelling and that the app has SharePoint access via Graph API permissions (`Sites.Read.All`, `Files.ReadWrite.All`)
293
+ - **`Source file not found in SharePoint`**: Verify the file exists at the exact path supplied as `source`; check case sensitivity
294
+ - **`Destination folder not found in SharePoint`**: The parent directory of the destination path does not exist in SharePoint; create it before running the connector
295
+ - **`S3 bucket does not exist`** or **`S3 object does not exist`**: Verify bucket name is correct, bucket exists in eu-west-2, and IAM principal has access
296
+ - **`Access denied to S3 bucket/object`**: Check IAM policy grants `s3:GetObject`, `s3:PutObject`, `s3:HeadObject`, and `s3:HeadBucket` on the bucket
297
+ - **`AADSTS65001` or Graph auth failures**: Verify app permissions (`Sites.Read.All`, `Files.ReadWrite.All`) are granted in Azure; may need admin consent
298
+ - **`File transfer failed: Max retries exceeded`**: File chunk upload exceeded 5 consecutive failures; check network stability, S3/SharePoint availability, and file size
299
+
300
+ ## Security considerations
301
+
302
+ - **Never commit `.env` files or secrets**: Add `.env` to `.gitignore`
303
+ - **Prefer managed identity**: Use workload identity or managed identity in AWS/Azure instead of storing static credentials
304
+ - **Scope permissions tightly**:
305
+ - Azure: Limit app permissions to `Sites.Read.All` and `Files.ReadWrite.All` only
306
+ - AWS: Restrict IAM policy to specific bucket and prefix (e.g., `arn:aws:s3:::bucket/prefix/*`)
307
+ - **Rotate secrets**: Change Azure client secrets every 90 days and update secret manager
308
+ - **Store secrets securely**: Use AWS Secrets Manager, Azure Key Vault, or Kubernetes secrets (never hardcode in env vars)
309
+ - **Audit access**: Monitor S3 CloudTrail and SharePoint audit logs for sensitive data access
310
+ - **Network isolation**: Consider running connector in private network with appropriate egress controls
311
+ - **Data residency**: Ensure S3 bucket and SharePoint site comply with data residency requirements
312
+
313
+ ## License
314
+
315
+ MIT License. See `LICENSE`.