aws-sharepoint-connector 1.3.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- aws_sharepoint_connector-1.3.1/PKG-INFO +331 -0
- aws_sharepoint_connector-1.3.1/README.md +315 -0
- aws_sharepoint_connector-1.3.1/pyproject.toml +164 -0
- aws_sharepoint_connector-1.3.1/src/aws_sharepoint_connector/__init__.py +6 -0
- aws_sharepoint_connector-1.3.1/src/aws_sharepoint_connector/auth.py +55 -0
- aws_sharepoint_connector-1.3.1/src/aws_sharepoint_connector/config.py +112 -0
- aws_sharepoint_connector-1.3.1/src/aws_sharepoint_connector/constants.py +12 -0
- aws_sharepoint_connector-1.3.1/src/aws_sharepoint_connector/engine.py +416 -0
- aws_sharepoint_connector-1.3.1/src/aws_sharepoint_connector/exceptions.py +29 -0
- aws_sharepoint_connector-1.3.1/src/aws_sharepoint_connector/main.py +95 -0
- aws_sharepoint_connector-1.3.1/src/aws_sharepoint_connector/s3.py +207 -0
- aws_sharepoint_connector-1.3.1/src/aws_sharepoint_connector/sharepoint.py +564 -0
- aws_sharepoint_connector-1.3.1/src/aws_sharepoint_connector/utils.py +113 -0
|
@@ -0,0 +1,331 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: aws-sharepoint-connector
|
|
3
|
+
Version: 1.3.1
|
|
4
|
+
Summary: Provides a connector via Graph API between AWS and SharePoint
|
|
5
|
+
Author: Philip Sinfield, Guy Wheeler
|
|
6
|
+
License-Expression: MIT
|
|
7
|
+
Requires-Dist: azure-identity>=1.24.0
|
|
8
|
+
Requires-Dist: boto3>=1.42.90
|
|
9
|
+
Requires-Dist: pydantic-settings>=2.11.0
|
|
10
|
+
Requires-Dist: pydantic>=2.12.3
|
|
11
|
+
Requires-Dist: requests>=2.32.4
|
|
12
|
+
Requires-Dist: types-requests>=2.33.0.20260503
|
|
13
|
+
Requires-Python: >=3.13
|
|
14
|
+
Project-URL: Repository, https://github.com/ministryofjustice/aws-sharepoint-connector
|
|
15
|
+
Description-Content-Type: text/markdown
|
|
16
|
+
|
|
17
|
+
# AWS - SharePoint Connector
|
|
18
|
+
|
|
19
|
+
[](https://github.com/ministryofjustice/aws-sharepoint-connector/releases)
|
|
20
|
+
[](pyproject.toml)
|
|
21
|
+
[](LICENSE)
|
|
22
|
+
[](CHANGELOG.md)
|
|
23
|
+
|
|
24
|
+
[](https://github.com/ministryofjustice/aws-sharepoint-connector/actions/workflows/python-unit-test.yml)
|
|
25
|
+
[](https://github.com/ministryofjustice/aws-sharepoint-connector/actions/workflows/python-lint.yml)
|
|
26
|
+
[](https://github.com/ministryofjustice/aws-sharepoint-connector/actions/workflows/python-type.yml)
|
|
27
|
+
[](https://github.com/ministryofjustice/aws-sharepoint-connector/actions/workflows/release-container.yml)
|
|
28
|
+
|
|
29
|
+
Provides a simple connector for moving files between AWS S3 and Microsoft SharePoint (via Microsoft Graph API).
|
|
30
|
+
|
|
31
|
+
Operates in two modes: (`write_to_s3`) to move from AWS S3 and SharePoint and (`write_to_sharepoint`) to do the reverse. Simply instantiate an engine via `create_engine` and iterate over the source files you want to move with the `run` method.
|
|
32
|
+
|
|
33
|
+
## Table of contents
|
|
34
|
+
|
|
35
|
+
- [Architecture and flow](#architecture-and-flow)
|
|
36
|
+
- [Configuration](#configuration)
|
|
37
|
+
- [Prerequisites](#prerequisites)
|
|
38
|
+
- [Installation](#installation)
|
|
39
|
+
- [How to run](#how-to-run)
|
|
40
|
+
- [Programmatic API](#programmatic-api)
|
|
41
|
+
- [Error handling and retries](#error-handling-and-retries)
|
|
42
|
+
- [How to modify or extend](#how-to-modify-or-extend)
|
|
43
|
+
- [Troubleshooting](#troubleshooting)
|
|
44
|
+
- [Security considerations](#security-considerations)
|
|
45
|
+
- [Changelog](CHANGELOG.md)
|
|
46
|
+
- [License](#license)
|
|
47
|
+
|
|
48
|
+
## Architecture and flow
|
|
49
|
+
|
|
50
|
+
### High level process flow
|
|
51
|
+
|
|
52
|
+
1. Create an engine with `create_engine(mode, sp_site, sp_library, s3_bucket)`
|
|
53
|
+
2. Identify the files you want to move, and their destination
|
|
54
|
+
3. For each file, call `engine.run(source, destination)`:
|
|
55
|
+
- Calls `engine.validate_plans` to validate the plan before any transfers begin:
|
|
56
|
+
- Download from the source system (SharePoint or S3).
|
|
57
|
+
- Upload to the destination system (S3 or SharePoint).
|
|
58
|
+
- Verify transfer was successful by comparing byte size.
|
|
59
|
+
- Optionally delete the source file
|
|
60
|
+
4. Handle errors per file in your calling code.
|
|
61
|
+
|
|
62
|
+
### Transfer Modes
|
|
63
|
+
|
|
64
|
+
- **`write_to_s3`**: Download from SharePoint → Upload to S3
|
|
65
|
+
- **`write_to_sharepoint`**: Download from S3 → Upload to SharePoint
|
|
66
|
+
|
|
67
|
+
### Core Components
|
|
68
|
+
|
|
69
|
+
- `src/connector/main.py`: Public API — `create_engine()`
|
|
70
|
+
- `src/connector/config.py`: Pydantic models for validated configuration.
|
|
71
|
+
- `src/connector/engine.py`: Abstract transfer logic.
|
|
72
|
+
- `src/connector/sharepoint.py`: SharePoint connector.
|
|
73
|
+
- `src/connector/s3.py`: AWS S3 connector.
|
|
74
|
+
- `src/connector/auth.py`: Azure authentication and Graph utilities.
|
|
75
|
+
- `src/connector/utils.py`: Logger and HTTP retry logic.
|
|
76
|
+
|
|
77
|
+
## Configuration
|
|
78
|
+
|
|
79
|
+
Configuration is parsed by the config classes in `src/connector/config.py` which store
|
|
80
|
+
the S3 bucket, Sharepoint site and library and Azure App secrets
|
|
81
|
+
|
|
82
|
+
### Required Environment Variables
|
|
83
|
+
|
|
84
|
+
MUST be provided as secrets via airflow from AWS Secrets Manager.
|
|
85
|
+
DO NOT store as plain text
|
|
86
|
+
|
|
87
|
+
| Variable | Type | Description |
|
|
88
|
+
| --- | --- | --- |
|
|
89
|
+
| `SECRET_AZURE_TENANT_ID` | string | Azure tenant UUID for Graph API authentication |
|
|
90
|
+
| `SECRET_AZURE_CLIENT_ID` | string | Azure app registration client ID |
|
|
91
|
+
| `SECRET_AZURE_CLIENT_SECRET` | string | Azure app registration client secret (store in secret manager) |
|
|
92
|
+
|
|
93
|
+
### Required configuration variables
|
|
94
|
+
|
|
95
|
+
Passed directly to `create_engine()` from your calling code
|
|
96
|
+
|
|
97
|
+
**`create_engine(mode, sp_site, sp_library, s3_bucket)`**
|
|
98
|
+
|
|
99
|
+
| Argument | Type | Description |
|
|
100
|
+
| --- | --- | --- |
|
|
101
|
+
| `mode` | string | Transfer direction: `write_to_s3` or `write_to_sharepoint` |
|
|
102
|
+
| `sp_site` | string | SharePoint site name (without URL prefix, e.g. `analytics-site`) |
|
|
103
|
+
| `sp_library` | string | SharePoint document library name (e.g. `Documents`) |
|
|
104
|
+
| `s3_bucket` | string | S3 bucket name (without `s3://` prefix) |
|
|
105
|
+
|
|
106
|
+
Passed to the engine's `run` method to identify a specific file to move
|
|
107
|
+
|
|
108
|
+
**`run(source, destination, delete)`**
|
|
109
|
+
|
|
110
|
+
| Key | Type | Description |
|
|
111
|
+
| --- | --- | --- |
|
|
112
|
+
| `source` | string | Source file path (SharePoint path or S3 key) |
|
|
113
|
+
| `destination` | string | Destination file path (S3 key or SharePoint path) |
|
|
114
|
+
| `delete` | bool | Flag for whether to delete source file after successful transfer |
|
|
115
|
+
|
|
116
|
+
### Example: SharePoint → S3 (single file)
|
|
117
|
+
|
|
118
|
+
For a SharePoint file at:
|
|
119
|
+
`https://justiceuk.sharepoint.com/sites/analytics-site/Documents/reports/2026/daily_report.csv`
|
|
120
|
+
|
|
121
|
+
To copy to `s3://my-bucket/path/to/daily_report.csv`:
|
|
122
|
+
|
|
123
|
+
```python
|
|
124
|
+
engine = create_engine(
|
|
125
|
+
mode="write_to_s3",
|
|
126
|
+
sp_site="analytics-site",
|
|
127
|
+
sp_library="Documents",
|
|
128
|
+
s3_bucket="my-bucket",
|
|
129
|
+
)
|
|
130
|
+
plans = [
|
|
131
|
+
{
|
|
132
|
+
"source": "reports/2026/daily_report.csv",
|
|
133
|
+
"destination": "path/to/daily_report.csv",
|
|
134
|
+
}
|
|
135
|
+
]
|
|
136
|
+
for plan in plans:
|
|
137
|
+
engine.run(plan["source"], plan["destination"])
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
### Example: S3 → SharePoint (single file)
|
|
141
|
+
|
|
142
|
+
To move the same file in the other direction:
|
|
143
|
+
|
|
144
|
+
```python
|
|
145
|
+
engine = create_engine(
|
|
146
|
+
mode="write_to_sharepoint",
|
|
147
|
+
sp_site="analytics-site",
|
|
148
|
+
sp_library="Documents",
|
|
149
|
+
s3_bucket="my-bucket",
|
|
150
|
+
)
|
|
151
|
+
plans =
|
|
152
|
+
{
|
|
153
|
+
"source": "path/to/daily_report.csv",
|
|
154
|
+
"destination": "reports/2026/daily_report.csv",
|
|
155
|
+
}
|
|
156
|
+
]
|
|
157
|
+
for plan in plans:
|
|
158
|
+
engine.run(plan["source"], plan["destination"])
|
|
159
|
+
```
|
|
160
|
+
|
|
161
|
+
## Prerequisites
|
|
162
|
+
|
|
163
|
+
### Sharepoint site
|
|
164
|
+
|
|
165
|
+
You will require a Sharepoint site to serve as the source or destination for files. This can be a pre-existing Sharepoint site, though you should be mindful of who will have access to the data.
|
|
166
|
+
|
|
167
|
+
### Azure app registration
|
|
168
|
+
|
|
169
|
+
An Azure app has to be registered in Entra ID. This will be bespoke to your project and provide the connection to the Sharepoint site and is what the connector will authenticate into via the secret key. To request a new Azure app and have it connected to your Sharepoint site, raise a demand request by following the [instructions here](https://user-guide.staff-identity.service.justice.gov.uk/documentation/guidance/appreg.html#application-registrations-sso). You can do this in terraform against the staff infrastructure authentication services repo (see [EM setup](https://github.com/ministryofjustice/staff-identity-idam-entra-infra/tree/main/terraform/envs/live/hmpps-electronic-monitoring-data) for an example), then post to [#staff-identity-authentication-services](https://moj.enterprise.slack.com/archives/C04AFS7TV7S).
|
|
170
|
+
|
|
171
|
+
The app will require these permissions:
|
|
172
|
+
|
|
173
|
+
- `sites.selected`
|
|
174
|
+
|
|
175
|
+
You will then need to speak to the File and Data Management team, who will grant your app access to the specific sharepoint sites you need access to.
|
|
176
|
+
|
|
177
|
+
### Azure app details & secret
|
|
178
|
+
|
|
179
|
+
You can view your [app registrations here](https://portal.azure.com/#view/Microsoft_AAD_RegisteredApps/ApplicationsListBlade).
|
|
180
|
+
|
|
181
|
+
Open up the app registration and the tenant ID will be available as `Directory (tenant) ID`.
|
|
182
|
+
|
|
183
|
+
The client ID is available as `Application (client) ID`.
|
|
184
|
+
|
|
185
|
+
The client secret is available from `manage` -> `certificates and secrets` - you may not be able to view it and instead may be sent it when the app is created.
|
|
186
|
+
|
|
187
|
+
### AWS access
|
|
188
|
+
|
|
189
|
+
If running via airflow, or from within another repo that is running via airflow, then standard AP credentials and access management apply and will grant access to s3.
|
|
190
|
+
|
|
191
|
+
### Technical requirements
|
|
192
|
+
|
|
193
|
+
- Python `3.13+`
|
|
194
|
+
- [uv](https://docs.astral.sh/uv/) for dependency management
|
|
195
|
+
|
|
196
|
+
## Installation
|
|
197
|
+
|
|
198
|
+
### Local install with `uv`
|
|
199
|
+
|
|
200
|
+
```bash
|
|
201
|
+
uv sync --all-groups --all-extras
|
|
202
|
+
```
|
|
203
|
+
|
|
204
|
+
### Package install in another project
|
|
205
|
+
|
|
206
|
+
If your project uses `uv`, add this package from the public GitHub repo directly and pin to a specific commit SHA:
|
|
207
|
+
|
|
208
|
+
```bash
|
|
209
|
+
uv add "git+https://github.com/ministryofjustice/aws-sharepoint-connector.git@<commit_sha>"
|
|
210
|
+
```
|
|
211
|
+
|
|
212
|
+
### Running tests
|
|
213
|
+
|
|
214
|
+
```bash
|
|
215
|
+
uv run python -m pytest # all tests with coverage
|
|
216
|
+
uv run python -m pytest tests/unit # unit tests only
|
|
217
|
+
uv run python -m pytest tests/e2e # E2E tests (no real API calls)
|
|
218
|
+
```
|
|
219
|
+
|
|
220
|
+
## How to run
|
|
221
|
+
|
|
222
|
+
### Programmatic API
|
|
223
|
+
|
|
224
|
+
Import `create_engine` from the `connector` package.
|
|
225
|
+
The Azure secret values must be present as environment variables.
|
|
226
|
+
|
|
227
|
+
```python
|
|
228
|
+
from aws_sharepoint_connector import create_engine
|
|
229
|
+
|
|
230
|
+
engine = create_engine(
|
|
231
|
+
mode="write_to_s3",
|
|
232
|
+
sp_site="analytics-site",
|
|
233
|
+
sp_library="Documents",
|
|
234
|
+
s3_bucket="my-bucket",
|
|
235
|
+
)
|
|
236
|
+
|
|
237
|
+
plans = [
|
|
238
|
+
{
|
|
239
|
+
"source": "reports/2026/daily_report.csv",
|
|
240
|
+
"destination": "path/to/daily_report.csv",
|
|
241
|
+
},
|
|
242
|
+
{
|
|
243
|
+
"source": "reports/2026/summary.csv",
|
|
244
|
+
"destination": "path/to/summary.csv",
|
|
245
|
+
},
|
|
246
|
+
]
|
|
247
|
+
|
|
248
|
+
for plan in plans:
|
|
249
|
+
engine.run(plan.source, plan.destination)
|
|
250
|
+
```
|
|
251
|
+
|
|
252
|
+
You can optionally use the `list_source_files` methods on the engines to obtain a list
|
|
253
|
+
of all files in the S3 bucket or SharePoint library. This can be used to programmatically
|
|
254
|
+
build the list of plans to iterate over.
|
|
255
|
+
|
|
256
|
+
## Error handling and retries
|
|
257
|
+
|
|
258
|
+
The connector implements robust retry logic to handle transient failures.
|
|
259
|
+
|
|
260
|
+
### Chunk Upload Strategy
|
|
261
|
+
|
|
262
|
+
For large files, uploads are split into **10 MB chunks**:
|
|
263
|
+
|
|
264
|
+
- **Max 5 consecutive failures** per chunk before aborting the entire transfer
|
|
265
|
+
- **Transient errors** (429 Too Many Requests, 5xx): retried with exponential backoff
|
|
266
|
+
- **Permanent errors** (4xx excluding 429): immediately raised as `ProcessingError` without retry
|
|
267
|
+
- **File pointer reset** on every retry to ensure data consistency
|
|
268
|
+
|
|
269
|
+
Example: If a 50 MB file fails on chunk 3 of 5, the transfer aborts and raises `ProcessingError`.
|
|
270
|
+
|
|
271
|
+
### HTTP Request Retries
|
|
272
|
+
|
|
273
|
+
All Graph API and HTTP calls use `request_with_retry()`:
|
|
274
|
+
|
|
275
|
+
- **Max 3 attempts** per request
|
|
276
|
+
- **Retryable errors**: 429 Too Many Requests, 5xx Server Errors
|
|
277
|
+
- **Non-retryable errors**: 4xx Client Errors (except 429)
|
|
278
|
+
- **Exponential backoff** between retries
|
|
279
|
+
|
|
280
|
+
### Batch Processing Behavior
|
|
281
|
+
|
|
282
|
+
Batch iteration is handled by the calling code. The engine processes one file per `engine.run()` call and raises `ProcessingError` on failure. It is the caller's responsibility to decide whether to abort or continue processing remaining files.
|
|
283
|
+
|
|
284
|
+
## How to modify or extend
|
|
285
|
+
|
|
286
|
+
### 1) Add a new transfer mode
|
|
287
|
+
|
|
288
|
+
1. Create a new engine class in `src/connector/engine.py` implementing:
|
|
289
|
+
- `download_file(self, source: str) -> bytes`
|
|
290
|
+
- `upload_file(self, content: bytes, destination: str) -> None`
|
|
291
|
+
- `validate_plans(self, plans: list[dict[str, str]]) -> None`
|
|
292
|
+
2. Register the engine in `MODE_MAP` in `src/connector/main.py`.
|
|
293
|
+
3. Expand the `Literal` type for `mode` in `create_engine()` in `src/connector/main.py`.
|
|
294
|
+
4. Add unit tests for success and failure paths, including `validate_plans`.
|
|
295
|
+
|
|
296
|
+
### 2) Add additional configuration
|
|
297
|
+
|
|
298
|
+
1. Add a field in `SecretConfig` (`src/connector/config.py`).
|
|
299
|
+
2. Add validation if needed with a `field_validator`.
|
|
300
|
+
3. Update `.env` docs and this README.
|
|
301
|
+
4. Use the field in connector or engine logic.
|
|
302
|
+
|
|
303
|
+
## Troubleshooting
|
|
304
|
+
|
|
305
|
+
### Common errors and solutions
|
|
306
|
+
|
|
307
|
+
- **`Pre-flight validation failed with N error(s)`**: One or more sources or destinations could not be verified before transfers started. The error message lists every problem — fix all of them before retrying.
|
|
308
|
+
- **`Library 'X' not found on site`**: Verify `sp_library` spelling and that the app has SharePoint access via Graph API permissions (`Sites.Read.All`, `Files.ReadWrite.All`)
|
|
309
|
+
- **`Source file not found in SharePoint`**: Verify the file exists at the exact path supplied as `source`; check case sensitivity
|
|
310
|
+
- **`Destination folder not found in SharePoint`**: The parent directory of the destination path does not exist in SharePoint; create it before running the connector
|
|
311
|
+
- **`S3 bucket does not exist`** or **`S3 object does not exist`**: Verify bucket name is correct, bucket exists in eu-west-2, and IAM principal has access
|
|
312
|
+
- **`Access denied to S3 bucket/object`**: Check IAM policy grants `s3:GetObject`, `s3:PutObject`, `s3:HeadObject`, and `s3:HeadBucket` on the bucket
|
|
313
|
+
- **`AADSTS65001` or Graph auth failures**: Verify app permissions (`Sites.Read.All`, `Files.ReadWrite.All`) are granted in Azure; may need admin consent
|
|
314
|
+
- **`File transfer failed: Max retries exceeded`**: File chunk upload exceeded 5 consecutive failures; check network stability, S3/SharePoint availability, and file size
|
|
315
|
+
|
|
316
|
+
## Security considerations
|
|
317
|
+
|
|
318
|
+
- **Never commit `.env` files or secrets**: Add `.env` to `.gitignore`
|
|
319
|
+
- **Prefer managed identity**: Use workload identity or managed identity in AWS/Azure instead of storing static credentials
|
|
320
|
+
- **Scope permissions tightly**:
|
|
321
|
+
- Azure: Limit app permissions to `Sites.Read.All` and `Files.ReadWrite.All` only
|
|
322
|
+
- AWS: Restrict IAM policy to specific bucket and prefix (e.g., `arn:aws:s3:::bucket/prefix/*`)
|
|
323
|
+
- **Rotate secrets**: Change Azure client secrets every 90 days and update secret manager
|
|
324
|
+
- **Store secrets securely**: Use AWS Secrets Manager, Azure Key Vault, or Kubernetes secrets (never hardcode in env vars)
|
|
325
|
+
- **Audit access**: Monitor S3 CloudTrail and SharePoint audit logs for sensitive data access
|
|
326
|
+
- **Network isolation**: Consider running connector in private network with appropriate egress controls
|
|
327
|
+
- **Data residency**: Ensure S3 bucket and SharePoint site comply with data residency requirements
|
|
328
|
+
|
|
329
|
+
## License
|
|
330
|
+
|
|
331
|
+
MIT License. See `LICENSE`.
|
|
@@ -0,0 +1,315 @@
|
|
|
1
|
+
# AWS - SharePoint Connector
|
|
2
|
+
|
|
3
|
+
[](https://github.com/ministryofjustice/aws-sharepoint-connector/releases)
|
|
4
|
+
[](pyproject.toml)
|
|
5
|
+
[](LICENSE)
|
|
6
|
+
[](CHANGELOG.md)
|
|
7
|
+
|
|
8
|
+
[](https://github.com/ministryofjustice/aws-sharepoint-connector/actions/workflows/python-unit-test.yml)
|
|
9
|
+
[](https://github.com/ministryofjustice/aws-sharepoint-connector/actions/workflows/python-lint.yml)
|
|
10
|
+
[](https://github.com/ministryofjustice/aws-sharepoint-connector/actions/workflows/python-type.yml)
|
|
11
|
+
[](https://github.com/ministryofjustice/aws-sharepoint-connector/actions/workflows/release-container.yml)
|
|
12
|
+
|
|
13
|
+
Provides a simple connector for moving files between AWS S3 and Microsoft SharePoint (via Microsoft Graph API).
|
|
14
|
+
|
|
15
|
+
Operates in two modes: (`write_to_s3`) to move from AWS S3 and SharePoint and (`write_to_sharepoint`) to do the reverse. Simply instantiate an engine via `create_engine` and iterate over the source files you want to move with the `run` method.
|
|
16
|
+
|
|
17
|
+
## Table of contents
|
|
18
|
+
|
|
19
|
+
- [Architecture and flow](#architecture-and-flow)
|
|
20
|
+
- [Configuration](#configuration)
|
|
21
|
+
- [Prerequisites](#prerequisites)
|
|
22
|
+
- [Installation](#installation)
|
|
23
|
+
- [How to run](#how-to-run)
|
|
24
|
+
- [Programmatic API](#programmatic-api)
|
|
25
|
+
- [Error handling and retries](#error-handling-and-retries)
|
|
26
|
+
- [How to modify or extend](#how-to-modify-or-extend)
|
|
27
|
+
- [Troubleshooting](#troubleshooting)
|
|
28
|
+
- [Security considerations](#security-considerations)
|
|
29
|
+
- [Changelog](CHANGELOG.md)
|
|
30
|
+
- [License](#license)
|
|
31
|
+
|
|
32
|
+
## Architecture and flow
|
|
33
|
+
|
|
34
|
+
### High level process flow
|
|
35
|
+
|
|
36
|
+
1. Create an engine with `create_engine(mode, sp_site, sp_library, s3_bucket)`
|
|
37
|
+
2. Identify the files you want to move, and their destination
|
|
38
|
+
3. For each file, call `engine.run(source, destination)`:
|
|
39
|
+
- Calls `engine.validate_plans` to validate the plan before any transfers begin:
|
|
40
|
+
- Download from the source system (SharePoint or S3).
|
|
41
|
+
- Upload to the destination system (S3 or SharePoint).
|
|
42
|
+
- Verify transfer was successful by comparing byte size.
|
|
43
|
+
- Optionally delete the source file
|
|
44
|
+
4. Handle errors per file in your calling code.
|
|
45
|
+
|
|
46
|
+
### Transfer Modes
|
|
47
|
+
|
|
48
|
+
- **`write_to_s3`**: Download from SharePoint → Upload to S3
|
|
49
|
+
- **`write_to_sharepoint`**: Download from S3 → Upload to SharePoint
|
|
50
|
+
|
|
51
|
+
### Core Components
|
|
52
|
+
|
|
53
|
+
- `src/connector/main.py`: Public API — `create_engine()`
|
|
54
|
+
- `src/connector/config.py`: Pydantic models for validated configuration.
|
|
55
|
+
- `src/connector/engine.py`: Abstract transfer logic.
|
|
56
|
+
- `src/connector/sharepoint.py`: SharePoint connector.
|
|
57
|
+
- `src/connector/s3.py`: AWS S3 connector.
|
|
58
|
+
- `src/connector/auth.py`: Azure authentication and Graph utilities.
|
|
59
|
+
- `src/connector/utils.py`: Logger and HTTP retry logic.
|
|
60
|
+
|
|
61
|
+
## Configuration
|
|
62
|
+
|
|
63
|
+
Configuration is parsed by the config classes in `src/connector/config.py` which store
|
|
64
|
+
the S3 bucket, Sharepoint site and library and Azure App secrets
|
|
65
|
+
|
|
66
|
+
### Required Environment Variables
|
|
67
|
+
|
|
68
|
+
MUST be provided as secrets via airflow from AWS Secrets Manager.
|
|
69
|
+
DO NOT store as plain text
|
|
70
|
+
|
|
71
|
+
| Variable | Type | Description |
|
|
72
|
+
| --- | --- | --- |
|
|
73
|
+
| `SECRET_AZURE_TENANT_ID` | string | Azure tenant UUID for Graph API authentication |
|
|
74
|
+
| `SECRET_AZURE_CLIENT_ID` | string | Azure app registration client ID |
|
|
75
|
+
| `SECRET_AZURE_CLIENT_SECRET` | string | Azure app registration client secret (store in secret manager) |
|
|
76
|
+
|
|
77
|
+
### Required configuration variables
|
|
78
|
+
|
|
79
|
+
Passed directly to `create_engine()` from your calling code
|
|
80
|
+
|
|
81
|
+
**`create_engine(mode, sp_site, sp_library, s3_bucket)`**
|
|
82
|
+
|
|
83
|
+
| Argument | Type | Description |
|
|
84
|
+
| --- | --- | --- |
|
|
85
|
+
| `mode` | string | Transfer direction: `write_to_s3` or `write_to_sharepoint` |
|
|
86
|
+
| `sp_site` | string | SharePoint site name (without URL prefix, e.g. `analytics-site`) |
|
|
87
|
+
| `sp_library` | string | SharePoint document library name (e.g. `Documents`) |
|
|
88
|
+
| `s3_bucket` | string | S3 bucket name (without `s3://` prefix) |
|
|
89
|
+
|
|
90
|
+
Passed to the engine's `run` method to identify a specific file to move
|
|
91
|
+
|
|
92
|
+
**`run(source, destination, delete)`**
|
|
93
|
+
|
|
94
|
+
| Key | Type | Description |
|
|
95
|
+
| --- | --- | --- |
|
|
96
|
+
| `source` | string | Source file path (SharePoint path or S3 key) |
|
|
97
|
+
| `destination` | string | Destination file path (S3 key or SharePoint path) |
|
|
98
|
+
| `delete` | bool | Flag for whether to delete source file after successful transfer |
|
|
99
|
+
|
|
100
|
+
### Example: SharePoint → S3 (single file)
|
|
101
|
+
|
|
102
|
+
For a SharePoint file at:
|
|
103
|
+
`https://justiceuk.sharepoint.com/sites/analytics-site/Documents/reports/2026/daily_report.csv`
|
|
104
|
+
|
|
105
|
+
To copy to `s3://my-bucket/path/to/daily_report.csv`:
|
|
106
|
+
|
|
107
|
+
```python
|
|
108
|
+
engine = create_engine(
|
|
109
|
+
mode="write_to_s3",
|
|
110
|
+
sp_site="analytics-site",
|
|
111
|
+
sp_library="Documents",
|
|
112
|
+
s3_bucket="my-bucket",
|
|
113
|
+
)
|
|
114
|
+
plans = [
|
|
115
|
+
{
|
|
116
|
+
"source": "reports/2026/daily_report.csv",
|
|
117
|
+
"destination": "path/to/daily_report.csv",
|
|
118
|
+
}
|
|
119
|
+
]
|
|
120
|
+
for plan in plans:
|
|
121
|
+
engine.run(plan["source"], plan["destination"])
|
|
122
|
+
```
|
|
123
|
+
|
|
124
|
+
### Example: S3 → SharePoint (single file)
|
|
125
|
+
|
|
126
|
+
To move the same file in the other direction:
|
|
127
|
+
|
|
128
|
+
```python
|
|
129
|
+
engine = create_engine(
|
|
130
|
+
mode="write_to_sharepoint",
|
|
131
|
+
sp_site="analytics-site",
|
|
132
|
+
sp_library="Documents",
|
|
133
|
+
s3_bucket="my-bucket",
|
|
134
|
+
)
|
|
135
|
+
plans =
|
|
136
|
+
{
|
|
137
|
+
"source": "path/to/daily_report.csv",
|
|
138
|
+
"destination": "reports/2026/daily_report.csv",
|
|
139
|
+
}
|
|
140
|
+
]
|
|
141
|
+
for plan in plans:
|
|
142
|
+
engine.run(plan["source"], plan["destination"])
|
|
143
|
+
```
|
|
144
|
+
|
|
145
|
+
## Prerequisites
|
|
146
|
+
|
|
147
|
+
### Sharepoint site
|
|
148
|
+
|
|
149
|
+
You will require a Sharepoint site to serve as the source or destination for files. This can be a pre-existing Sharepoint site, though you should be mindful of who will have access to the data.
|
|
150
|
+
|
|
151
|
+
### Azure app registration
|
|
152
|
+
|
|
153
|
+
An Azure app has to be registered in Entra ID. This will be bespoke to your project and provide the connection to the Sharepoint site and is what the connector will authenticate into via the secret key. To request a new Azure app and have it connected to your Sharepoint site, raise a demand request by following the [instructions here](https://user-guide.staff-identity.service.justice.gov.uk/documentation/guidance/appreg.html#application-registrations-sso). You can do this in terraform against the staff infrastructure authentication services repo (see [EM setup](https://github.com/ministryofjustice/staff-identity-idam-entra-infra/tree/main/terraform/envs/live/hmpps-electronic-monitoring-data) for an example), then post to [#staff-identity-authentication-services](https://moj.enterprise.slack.com/archives/C04AFS7TV7S).
|
|
154
|
+
|
|
155
|
+
The app will require these permissions:
|
|
156
|
+
|
|
157
|
+
- `sites.selected`
|
|
158
|
+
|
|
159
|
+
You will then need to speak to the File and Data Management team, who will grant your app access to the specific sharepoint sites you need access to.
|
|
160
|
+
|
|
161
|
+
### Azure app details & secret
|
|
162
|
+
|
|
163
|
+
You can view your [app registrations here](https://portal.azure.com/#view/Microsoft_AAD_RegisteredApps/ApplicationsListBlade).
|
|
164
|
+
|
|
165
|
+
Open up the app registration and the tenant ID will be available as `Directory (tenant) ID`.
|
|
166
|
+
|
|
167
|
+
The client ID is available as `Application (client) ID`.
|
|
168
|
+
|
|
169
|
+
The client secret is available from `manage` -> `certificates and secrets` - you may not be able to view it and instead may be sent it when the app is created.
|
|
170
|
+
|
|
171
|
+
### AWS access
|
|
172
|
+
|
|
173
|
+
If running via airflow, or from within another repo that is running via airflow, then standard AP credentials and access management apply and will grant access to s3.
|
|
174
|
+
|
|
175
|
+
### Technical requirements
|
|
176
|
+
|
|
177
|
+
- Python `3.13+`
|
|
178
|
+
- [uv](https://docs.astral.sh/uv/) for dependency management
|
|
179
|
+
|
|
180
|
+
## Installation
|
|
181
|
+
|
|
182
|
+
### Local install with `uv`
|
|
183
|
+
|
|
184
|
+
```bash
|
|
185
|
+
uv sync --all-groups --all-extras
|
|
186
|
+
```
|
|
187
|
+
|
|
188
|
+
### Package install in another project
|
|
189
|
+
|
|
190
|
+
If your project uses `uv`, add this package from the public GitHub repo directly and pin to a specific commit SHA:
|
|
191
|
+
|
|
192
|
+
```bash
|
|
193
|
+
uv add "git+https://github.com/ministryofjustice/aws-sharepoint-connector.git@<commit_sha>"
|
|
194
|
+
```
|
|
195
|
+
|
|
196
|
+
### Running tests
|
|
197
|
+
|
|
198
|
+
```bash
|
|
199
|
+
uv run python -m pytest # all tests with coverage
|
|
200
|
+
uv run python -m pytest tests/unit # unit tests only
|
|
201
|
+
uv run python -m pytest tests/e2e # E2E tests (no real API calls)
|
|
202
|
+
```
|
|
203
|
+
|
|
204
|
+
## How to run
|
|
205
|
+
|
|
206
|
+
### Programmatic API
|
|
207
|
+
|
|
208
|
+
Import `create_engine` from the `connector` package.
|
|
209
|
+
The Azure secret values must be present as environment variables.
|
|
210
|
+
|
|
211
|
+
```python
|
|
212
|
+
from aws_sharepoint_connector import create_engine
|
|
213
|
+
|
|
214
|
+
engine = create_engine(
|
|
215
|
+
mode="write_to_s3",
|
|
216
|
+
sp_site="analytics-site",
|
|
217
|
+
sp_library="Documents",
|
|
218
|
+
s3_bucket="my-bucket",
|
|
219
|
+
)
|
|
220
|
+
|
|
221
|
+
plans = [
|
|
222
|
+
{
|
|
223
|
+
"source": "reports/2026/daily_report.csv",
|
|
224
|
+
"destination": "path/to/daily_report.csv",
|
|
225
|
+
},
|
|
226
|
+
{
|
|
227
|
+
"source": "reports/2026/summary.csv",
|
|
228
|
+
"destination": "path/to/summary.csv",
|
|
229
|
+
},
|
|
230
|
+
]
|
|
231
|
+
|
|
232
|
+
for plan in plans:
|
|
233
|
+
engine.run(plan.source, plan.destination)
|
|
234
|
+
```
|
|
235
|
+
|
|
236
|
+
You can optionally use the `list_source_files` methods on the engines to obtain a list
|
|
237
|
+
of all files in the S3 bucket or SharePoint library. This can be used to programmatically
|
|
238
|
+
build the list of plans to iterate over.
|
|
239
|
+
|
|
240
|
+
## Error handling and retries
|
|
241
|
+
|
|
242
|
+
The connector implements robust retry logic to handle transient failures.
|
|
243
|
+
|
|
244
|
+
### Chunk Upload Strategy
|
|
245
|
+
|
|
246
|
+
For large files, uploads are split into **10 MB chunks**:
|
|
247
|
+
|
|
248
|
+
- **Max 5 consecutive failures** per chunk before aborting the entire transfer
|
|
249
|
+
- **Transient errors** (429 Too Many Requests, 5xx): retried with exponential backoff
|
|
250
|
+
- **Permanent errors** (4xx excluding 429): immediately raised as `ProcessingError` without retry
|
|
251
|
+
- **File pointer reset** on every retry to ensure data consistency
|
|
252
|
+
|
|
253
|
+
Example: If a 50 MB file fails on chunk 3 of 5, the transfer aborts and raises `ProcessingError`.
|
|
254
|
+
|
|
255
|
+
### HTTP Request Retries
|
|
256
|
+
|
|
257
|
+
All Graph API and HTTP calls use `request_with_retry()`:
|
|
258
|
+
|
|
259
|
+
- **Max 3 attempts** per request
|
|
260
|
+
- **Retryable errors**: 429 Too Many Requests, 5xx Server Errors
|
|
261
|
+
- **Non-retryable errors**: 4xx Client Errors (except 429)
|
|
262
|
+
- **Exponential backoff** between retries
|
|
263
|
+
|
|
264
|
+
### Batch Processing Behavior
|
|
265
|
+
|
|
266
|
+
Batch iteration is handled by the calling code. The engine processes one file per `engine.run()` call and raises `ProcessingError` on failure. It is the caller's responsibility to decide whether to abort or continue processing remaining files.
|
|
267
|
+
|
|
268
|
+
## How to modify or extend
|
|
269
|
+
|
|
270
|
+
### 1) Add a new transfer mode
|
|
271
|
+
|
|
272
|
+
1. Create a new engine class in `src/connector/engine.py` implementing:
|
|
273
|
+
- `download_file(self, source: str) -> bytes`
|
|
274
|
+
- `upload_file(self, content: bytes, destination: str) -> None`
|
|
275
|
+
- `validate_plans(self, plans: list[dict[str, str]]) -> None`
|
|
276
|
+
2. Register the engine in `MODE_MAP` in `src/connector/main.py`.
|
|
277
|
+
3. Expand the `Literal` type for `mode` in `create_engine()` in `src/connector/main.py`.
|
|
278
|
+
4. Add unit tests for success and failure paths, including `validate_plans`.
|
|
279
|
+
|
|
280
|
+
### 2) Add additional configuration
|
|
281
|
+
|
|
282
|
+
1. Add a field in `SecretConfig` (`src/connector/config.py`).
|
|
283
|
+
2. Add validation if needed with a `field_validator`.
|
|
284
|
+
3. Update `.env` docs and this README.
|
|
285
|
+
4. Use the field in connector or engine logic.
|
|
286
|
+
|
|
287
|
+
## Troubleshooting
|
|
288
|
+
|
|
289
|
+
### Common errors and solutions
|
|
290
|
+
|
|
291
|
+
- **`Pre-flight validation failed with N error(s)`**: One or more sources or destinations could not be verified before transfers started. The error message lists every problem — fix all of them before retrying.
|
|
292
|
+
- **`Library 'X' not found on site`**: Verify `sp_library` spelling and that the app has SharePoint access via Graph API permissions (`Sites.Read.All`, `Files.ReadWrite.All`)
|
|
293
|
+
- **`Source file not found in SharePoint`**: Verify the file exists at the exact path supplied as `source`; check case sensitivity
|
|
294
|
+
- **`Destination folder not found in SharePoint`**: The parent directory of the destination path does not exist in SharePoint; create it before running the connector
|
|
295
|
+
- **`S3 bucket does not exist`** or **`S3 object does not exist`**: Verify bucket name is correct, bucket exists in eu-west-2, and IAM principal has access
|
|
296
|
+
- **`Access denied to S3 bucket/object`**: Check IAM policy grants `s3:GetObject`, `s3:PutObject`, `s3:HeadObject`, and `s3:HeadBucket` on the bucket
|
|
297
|
+
- **`AADSTS65001` or Graph auth failures**: Verify app permissions (`Sites.Read.All`, `Files.ReadWrite.All`) are granted in Azure; may need admin consent
|
|
298
|
+
- **`File transfer failed: Max retries exceeded`**: File chunk upload exceeded 5 consecutive failures; check network stability, S3/SharePoint availability, and file size
|
|
299
|
+
|
|
300
|
+
## Security considerations
|
|
301
|
+
|
|
302
|
+
- **Never commit `.env` files or secrets**: Add `.env` to `.gitignore`
|
|
303
|
+
- **Prefer managed identity**: Use workload identity or managed identity in AWS/Azure instead of storing static credentials
|
|
304
|
+
- **Scope permissions tightly**:
|
|
305
|
+
- Azure: Limit app permissions to `Sites.Read.All` and `Files.ReadWrite.All` only
|
|
306
|
+
- AWS: Restrict IAM policy to specific bucket and prefix (e.g., `arn:aws:s3:::bucket/prefix/*`)
|
|
307
|
+
- **Rotate secrets**: Change Azure client secrets every 90 days and update secret manager
|
|
308
|
+
- **Store secrets securely**: Use AWS Secrets Manager, Azure Key Vault, or Kubernetes secrets (never hardcode in env vars)
|
|
309
|
+
- **Audit access**: Monitor S3 CloudTrail and SharePoint audit logs for sensitive data access
|
|
310
|
+
- **Network isolation**: Consider running connector in private network with appropriate egress controls
|
|
311
|
+
- **Data residency**: Ensure S3 bucket and SharePoint site comply with data residency requirements
|
|
312
|
+
|
|
313
|
+
## License
|
|
314
|
+
|
|
315
|
+
MIT License. See `LICENSE`.
|