airbyte-source-google-drive 0.0.8__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of airbyte-source-google-drive might be problematic. Click here for more details.

@@ -0,0 +1,102 @@
1
+ Metadata-Version: 2.1
2
+ Name: airbyte-source-google-drive
3
+ Version: 0.0.8
4
+ Summary: Source implementation for Google Drive.
5
+ Author: Airbyte
6
+ Author-email: contact@airbyte.io
7
+ Description-Content-Type: text/markdown
8
+ Requires-Dist: airbyte-cdk[file-based] >=0.60.1
9
+ Requires-Dist: google-api-python-client ==2.104.0
10
+ Requires-Dist: google-auth-httplib2 ==0.1.1
11
+ Requires-Dist: google-auth-oauthlib ==1.1.0
12
+ Requires-Dist: google-api-python-client-stubs ==1.18.0
13
+ Provides-Extra: tests
14
+ Requires-Dist: pytest-mock ~=3.6.1 ; extra == 'tests'
15
+ Requires-Dist: pytest ~=6.1 ; extra == 'tests'
16
+
17
+ # Google Drive Source
18
+
19
+ This is the repository for the Google Drive source connector, written in Python.
20
+ For information about how to use this connector within Airbyte, see [the documentation](https://docs.airbyte.io/integrations/sources/google-drive).
21
+
22
+
23
+ **To iterate on this connector, make sure to complete this prerequisites section.**
24
+
25
+
26
+ From this connector directory, create a virtual environment:
27
+ ```
28
+ python -m venv .venv
29
+ ```
30
+
31
+ This will generate a virtualenv for this module in `.venv/`. Make sure this venv is active in your
32
+ development environment of choice. To activate it from the terminal, run:
33
+ ```
34
+ source .venv/bin/activate
35
+ pip install -r requirements.txt
36
+ pip install '.[tests]'
37
+ ```
38
+ If you are in an IDE, follow your IDE's instructions to activate the virtualenv.
39
+
40
+ Note that while we are installing dependencies from `requirements.txt`, you should only edit `setup.py` for your dependencies. `requirements.txt` is
41
+ used for editable installs (`pip install -e`) to pull in Python dependencies from the monorepo and will call `setup.py`.
42
+ If this is mumbo jumbo to you, don't worry about it, just put your deps in `setup.py` but install using `pip install -r requirements.txt` and everything
43
+ should work as you expect.
44
+
45
+ **If you are a community contributor**, follow the instructions in the [documentation](https://docs.airbyte.io/integrations/sources/google-drive)
46
+ to generate the necessary credentials. Then create a file `secrets/config.json` conforming to the `source_google_drive/spec.json` file.
47
+ Note that any directory named `secrets` is gitignored across the entire Airbyte repo, so there is no danger of accidentally checking in sensitive information.
48
+ See `integration_tests/sample_config.json` for a sample config file.
49
+
50
+ **If you are an Airbyte core member**, copy the credentials in Lastpass under the secret name `source google-drive test creds`
51
+ and place them into `secrets/config.json`.
52
+
53
+ ```
54
+ python main.py spec
55
+ python main.py check --config secrets/config.json
56
+ python main.py discover --config secrets/config.json
57
+ python main.py read --config secrets/config.json --catalog integration_tests/configured_catalog.json
58
+ ```
59
+
60
+
61
+
62
+ **Via [`airbyte-ci`](https://github.com/airbytehq/airbyte/blob/master/airbyte-ci/connectors/pipelines/README.md) (recommended):**
63
+ ```bash
64
+ airbyte-ci connectors --name=source-google-drive build
65
+ ```
66
+
67
+ An image will be built with the tag `airbyte/source-google-drive:dev`.
68
+
69
+ **Via `docker build`:**
70
+ ```bash
71
+ docker build -t airbyte/source-google-drive:dev .
72
+ ```
73
+
74
+ Then run any of the connector commands as follows:
75
+ ```
76
+ docker run --rm airbyte/source-google-drive:dev spec
77
+ docker run --rm -v $(pwd)/secrets:/secrets airbyte/source-google-drive:dev check --config /secrets/config.json
78
+ docker run --rm -v $(pwd)/secrets:/secrets airbyte/source-google-drive:dev discover --config /secrets/config.json
79
+ docker run --rm -v $(pwd)/secrets:/secrets -v $(pwd)/integration_tests:/integration_tests airbyte/source-google-drive:dev read --config /secrets/config.json --catalog /integration_tests/configured_catalog.json
80
+ ```
81
+
82
+ You can run our full test suite locally using [`airbyte-ci`](https://github.com/airbytehq/airbyte/blob/master/airbyte-ci/connectors/pipelines/README.md):
83
+ ```bash
84
+ airbyte-ci connectors --name=source-google-drive test
85
+ ```
86
+
87
+ Customize `acceptance-test-config.yml` file to configure tests. See [Connector Acceptance Tests](https://docs.airbyte.com/connector-development/testing-connectors/connector-acceptance-tests-reference) for more information.
88
+ If your connector requires to create or destroy resources for use during acceptance tests create fixtures for it and place them inside integration_tests/acceptance.py.
89
+
90
+ All of your dependencies should go in `setup.py`, NOT `requirements.txt`. The requirements file is only used to connect internal Airbyte dependencies in the monorepo for local development.
91
+ We split dependencies between two groups, dependencies that are:
92
+ * required for your connector to work need to go to `MAIN_REQUIREMENTS` list.
93
+ * required for the testing need to go to `TEST_REQUIREMENTS` list
94
+
95
+ You've checked out the repo, implemented a million dollar feature, and you're ready to share your changes with the world. Now what?
96
+ 1. Make sure your changes are passing our test suite: `airbyte-ci connectors --name=source-google-drive test`
97
+ 2. Bump the connector version in `metadata.yaml`: increment the `dockerImageTag` value. Please follow [semantic versioning for connectors](https://docs.airbyte.com/contributing-to-airbyte/resources/pull-requests-handbook/#semantic-versioning-for-connectors).
98
+ 3. Make sure the `metadata.yaml` content is up to date.
99
+ 4. Make the connector documentation and its changelog is up to date (`docs/integrations/sources/google-drive.md`).
100
+ 5. Create a Pull Request: use [our PR naming conventions](https://docs.airbyte.com/contributing-to-airbyte/resources/pull-requests-handbook/#pull-request-title-convention).
101
+ 6. Pat yourself on the back for being an awesome contributor.
102
+ 7. Someone from Airbyte will take a look at your PR and iterate with you to merge it into master.
@@ -0,0 +1,17 @@
1
+ integration_tests/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
2
+ integration_tests/abnormal_state.json,sha256=OjtTybZAAaINLABJKWytuOVczeJV_F00pGa7cH8M3vU,986
3
+ integration_tests/acceptance.py,sha256=q5rKj8jSnWg7fFZu2y7tsBwrbKsoxbAgKP43-FQ8Vc4,361
4
+ integration_tests/configured_catalog.json,sha256=WaPXCRHx2Pv96Uyci_4eKA07VLQ-JxEYi24l8KqahEI,1600
5
+ integration_tests/invalid_config.json,sha256=36BpdsPLMC-89lYg-eZL0SVi_zPWKx9WJnfmix_g2w0,403
6
+ integration_tests/spec.json,sha256=a1WQD1cKsyZQQ3eKK_cJym0ftop05JUaIpfCVvqpe8o,21010
7
+ source_google_drive/__init__.py,sha256=5eaSfjduFy2tZFKWshVu1H7Evu-pjwOdti4MGDANCj4,132
8
+ source_google_drive/run.py,sha256=R5AQ5KyVwlhCDqVeNq9zRZa9cROB1PwxDBz9fW0d7RE,711
9
+ source_google_drive/source.py,sha256=PAEoNKumyLWQpSk3zNQ0B9jdlltZCfLLghcmQY8bjGk,2827
10
+ source_google_drive/spec.py,sha256=A25glT8IV4hngtv7goVK3xIV0r0uwy_qOvJER9Q6VKI,3401
11
+ source_google_drive/stream_reader.py,sha256=bRgGeSKOQLLASwZGli9juzdpSTwBzureAziKs42j4MQ,8533
12
+ source_google_drive/utils.py,sha256=ZIYqXEQfJDpFZgNW94MOeYPG2xKba-LoGgsYEllhVmY,941
13
+ airbyte_source_google_drive-0.0.8.dist-info/METADATA,sha256=Gx7pEK1Id8olLctCFE8Sd4GXREHxzAMkQqU4k0jWiIU,5639
14
+ airbyte_source_google_drive-0.0.8.dist-info/WHEEL,sha256=oiQVh_5PnQM0E3gPdiz09WCNmwiHDMaGer_elqB3coM,92
15
+ airbyte_source_google_drive-0.0.8.dist-info/entry_points.txt,sha256=rZDJ3rREvJ7j4Oew33J5eOBHfJkdjFaidKSmwVBrS3A,68
16
+ airbyte_source_google_drive-0.0.8.dist-info/top_level.txt,sha256=Tj-bdDutVStfs3X15D0Z8bx2thMV9sAaYr9KgguAFDc,38
17
+ airbyte_source_google_drive-0.0.8.dist-info/RECORD,,
@@ -0,0 +1,5 @@
1
+ Wheel-Version: 1.0
2
+ Generator: bdist_wheel (0.42.0)
3
+ Root-Is-Purelib: true
4
+ Tag: py3-none-any
5
+
@@ -0,0 +1,2 @@
1
+ [console_scripts]
2
+ source-google-drive = source_google_drive.run:run
@@ -0,0 +1,2 @@
1
+ integration_tests
2
+ source_google_drive
File without changes
@@ -0,0 +1,35 @@
1
+ [
2
+ {
3
+ "type": "STREAM",
4
+ "stream": {
5
+ "stream_descriptor": {
6
+ "name": "test"
7
+ },
8
+ "stream_state": {
9
+ "history": {
10
+ "test.jsonl": "2023-10-16T06:16:06.000000Z",
11
+ "subfolder/test2.jsonl": "2023-10-19T01:43:56.000000Z"
12
+ },
13
+ "_ab_source_file_last_modified": "2023-10-19T01:43:56.000000Z_subfolder/test2.jsonl"
14
+ }
15
+ }
16
+ },
17
+ {
18
+ "type": "STREAM",
19
+ "stream": {
20
+ "stream_descriptor": {
21
+ "name": "test_unstructured"
22
+ },
23
+ "stream_state": {
24
+ "history": {
25
+ "testdoc_docx.docx": "2023-10-27T00:45:54.000000Z",
26
+ "testdoc_pdf.pdf": "2023-10-27T00:45:58.000000Z",
27
+ "testdoc_ocr_pdf.pdf": "2023-10-27T00:46:04.000000Z",
28
+ "testdoc_google": "2023-11-10T13:46:18.551000Z",
29
+ "testdoc_presentation": "2023-11-10T13:49:06.640000Z"
30
+ },
31
+ "_ab_source_file_last_modified": "2023-11-10T13:49:06.640000Z_testdoc_presentation"
32
+ }
33
+ }
34
+ }
35
+ ]
@@ -0,0 +1,16 @@
1
+ #
2
+ # Copyright (c) 2023 Airbyte, Inc., all rights reserved.
3
+ #
4
+
5
+
6
+ from typing import Iterable
7
+
8
+ import pytest
9
+
10
+ pytest_plugins = ("connector_acceptance_test.plugin",)
11
+
12
+
13
+ @pytest.fixture(scope="session", autouse=True)
14
+ def connector_setup() -> Iterable[None]:
15
+ """This fixture is a placeholder for external resources that acceptance test might require."""
16
+ yield
@@ -0,0 +1,60 @@
1
+ {
2
+ "streams": [
3
+ {
4
+ "stream": {
5
+ "name": "test",
6
+ "json_schema": {
7
+ "type": "object",
8
+ "properties": {
9
+ "y": {
10
+ "type": ["null", "integer"]
11
+ },
12
+ "x": {
13
+ "type": ["null", "integer"]
14
+ },
15
+ "_ab_source_file_last_modified": {
16
+ "type": "string",
17
+ "format": "date-time"
18
+ },
19
+ "_ab_source_file_url": {
20
+ "type": "string"
21
+ }
22
+ }
23
+ },
24
+ "supported_sync_modes": ["full_refresh", "incremental"],
25
+ "source_defined_cursor": true,
26
+ "default_cursor_field": ["_ab_source_file_last_modified"]
27
+ },
28
+ "sync_mode": "incremental",
29
+ "destination_sync_mode": "append"
30
+ },
31
+ {
32
+ "stream": {
33
+ "name": "test_unstructured",
34
+ "json_schema": {
35
+ "type": "object",
36
+ "properties": {
37
+ "document_key": {
38
+ "type": ["null", "integer"]
39
+ },
40
+ "content": {
41
+ "type": ["null", "integer"]
42
+ },
43
+ "_ab_source_file_last_modified": {
44
+ "type": "string",
45
+ "format": "date-time"
46
+ },
47
+ "_ab_source_file_url": {
48
+ "type": "string"
49
+ }
50
+ }
51
+ },
52
+ "supported_sync_modes": ["full_refresh", "incremental"],
53
+ "source_defined_cursor": true,
54
+ "default_cursor_field": ["_ab_source_file_last_modified"]
55
+ },
56
+ "sync_mode": "incremental",
57
+ "destination_sync_mode": "append"
58
+ }
59
+ ]
60
+ }
@@ -0,0 +1,19 @@
1
+ {
2
+ "folder_url": "https://drive.google.com/drive/folders/yyy",
3
+ "credentials": {
4
+ "auth_type": "Service",
5
+ "service_account_info": "abc"
6
+ },
7
+ "streams": [
8
+ {
9
+ "name": "test",
10
+ "globs": ["**/*.jsonl"],
11
+ "format": {
12
+ "filetype": "jsonl"
13
+ },
14
+ "schemaless": false,
15
+ "validation_policy": "Emit Record",
16
+ "days_to_sync_if_history_is_full": 3
17
+ }
18
+ ]
19
+ }
@@ -0,0 +1,456 @@
1
+ {
2
+ "documentationUrl": "https://docs.airbyte.com/integrations/sources/google-drive",
3
+ "connectionSpecification": {
4
+ "title": "Google Drive Source Spec",
5
+ "description": "Used during spec; allows the developer to configure the cloud provider specific options\nthat are needed when users configure a file-based source.",
6
+ "type": "object",
7
+ "properties": {
8
+ "start_date": {
9
+ "title": "Start Date",
10
+ "description": "UTC date and time in the format 2017-01-25T00:00:00.000000Z. Any file modified before this date will not be replicated.",
11
+ "examples": ["2021-01-01T00:00:00.000000Z"],
12
+ "format": "date-time",
13
+ "pattern": "^[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}.[0-9]{6}Z$",
14
+ "pattern_descriptor": "YYYY-MM-DDTHH:mm:ss.SSSSSSZ",
15
+ "order": 1,
16
+ "type": "string"
17
+ },
18
+ "streams": {
19
+ "title": "The list of streams to sync",
20
+ "description": "Each instance of this configuration defines a <a href=\"https://docs.airbyte.com/cloud/core-concepts#stream\">stream</a>. Use this to define which files belong in the stream, their format, and how they should be parsed and validated. When sending data to warehouse destination such as Snowflake or BigQuery, each stream is a separate table.",
21
+ "order": 10,
22
+ "type": "array",
23
+ "items": {
24
+ "title": "FileBasedStreamConfig",
25
+ "type": "object",
26
+ "properties": {
27
+ "name": {
28
+ "title": "Name",
29
+ "description": "The name of the stream.",
30
+ "type": "string"
31
+ },
32
+ "globs": {
33
+ "title": "Globs",
34
+ "default": ["**"],
35
+ "order": 1,
36
+ "description": "The pattern used to specify which files should be selected from the file system. For more information on glob pattern matching look <a href=\"https://en.wikipedia.org/wiki/Glob_(programming)\">here</a>.",
37
+ "type": "array",
38
+ "items": {
39
+ "type": "string"
40
+ }
41
+ },
42
+ "validation_policy": {
43
+ "title": "Validation Policy",
44
+ "description": "The name of the validation policy that dictates sync behavior when a record does not adhere to the stream schema.",
45
+ "default": "Emit Record",
46
+ "enum": ["Emit Record", "Skip Record", "Wait for Discover"]
47
+ },
48
+ "input_schema": {
49
+ "title": "Input Schema",
50
+ "description": "The schema that will be used to validate records extracted from the file. This will override the stream schema that is auto-detected from incoming files.",
51
+ "type": "string"
52
+ },
53
+ "primary_key": {
54
+ "title": "Primary Key",
55
+ "description": "The column or columns (for a composite key) that serves as the unique identifier of a record. If empty, the primary key will default to the parser's default primary key.",
56
+ "type": "string",
57
+ "airbyte_hidden": true
58
+ },
59
+ "days_to_sync_if_history_is_full": {
60
+ "title": "Days To Sync If History Is Full",
61
+ "description": "When the state history of the file store is full, syncs will only read files that were last modified in the provided day range.",
62
+ "default": 3,
63
+ "type": "integer"
64
+ },
65
+ "format": {
66
+ "title": "Format",
67
+ "description": "The configuration options that are used to alter how to read incoming files that deviate from the standard formatting.",
68
+ "type": "object",
69
+ "oneOf": [
70
+ {
71
+ "title": "Avro Format",
72
+ "type": "object",
73
+ "properties": {
74
+ "filetype": {
75
+ "title": "Filetype",
76
+ "default": "avro",
77
+ "const": "avro",
78
+ "type": "string"
79
+ },
80
+ "double_as_string": {
81
+ "title": "Convert Double Fields to Strings",
82
+ "description": "Whether to convert double fields to strings. This is recommended if you have decimal numbers with a high degree of precision because there can be a loss precision when handling floating point numbers.",
83
+ "default": false,
84
+ "type": "boolean"
85
+ }
86
+ },
87
+ "required": ["filetype"]
88
+ },
89
+ {
90
+ "title": "CSV Format",
91
+ "type": "object",
92
+ "properties": {
93
+ "filetype": {
94
+ "title": "Filetype",
95
+ "default": "csv",
96
+ "const": "csv",
97
+ "type": "string"
98
+ },
99
+ "delimiter": {
100
+ "title": "Delimiter",
101
+ "description": "The character delimiting individual cells in the CSV data. This may only be a 1-character string. For tab-delimited data enter '\\t'.",
102
+ "default": ",",
103
+ "type": "string"
104
+ },
105
+ "quote_char": {
106
+ "title": "Quote Character",
107
+ "description": "The character used for quoting CSV values. To disallow quoting, make this field blank.",
108
+ "default": "\"",
109
+ "type": "string"
110
+ },
111
+ "escape_char": {
112
+ "title": "Escape Character",
113
+ "description": "The character used for escaping special characters. To disallow escaping, leave this field blank.",
114
+ "type": "string"
115
+ },
116
+ "encoding": {
117
+ "title": "Encoding",
118
+ "description": "The character encoding of the CSV data. Leave blank to default to <strong>UTF8</strong>. See <a href=\"https://docs.python.org/3/library/codecs.html#standard-encodings\" target=\"_blank\">list of python encodings</a> for allowable options.",
119
+ "default": "utf8",
120
+ "type": "string"
121
+ },
122
+ "double_quote": {
123
+ "title": "Double Quote",
124
+ "description": "Whether two quotes in a quoted CSV value denote a single quote in the data.",
125
+ "default": true,
126
+ "type": "boolean"
127
+ },
128
+ "null_values": {
129
+ "title": "Null Values",
130
+ "description": "A set of case-sensitive strings that should be interpreted as null values. For example, if the value 'NA' should be interpreted as null, enter 'NA' in this field.",
131
+ "default": [],
132
+ "type": "array",
133
+ "items": {
134
+ "type": "string"
135
+ },
136
+ "uniqueItems": true
137
+ },
138
+ "strings_can_be_null": {
139
+ "title": "Strings Can Be Null",
140
+ "description": "Whether strings can be interpreted as null values. If true, strings that match the null_values set will be interpreted as null. If false, strings that match the null_values set will be interpreted as the string itself.",
141
+ "default": true,
142
+ "type": "boolean"
143
+ },
144
+ "skip_rows_before_header": {
145
+ "title": "Skip Rows Before Header",
146
+ "description": "The number of rows to skip before the header row. For example, if the header row is on the 3rd row, enter 2 in this field.",
147
+ "default": 0,
148
+ "type": "integer"
149
+ },
150
+ "skip_rows_after_header": {
151
+ "title": "Skip Rows After Header",
152
+ "description": "The number of rows to skip after the header row.",
153
+ "default": 0,
154
+ "type": "integer"
155
+ },
156
+ "header_definition": {
157
+ "title": "CSV Header Definition",
158
+ "description": "How headers will be defined. `User Provided` assumes the CSV does not have a header row and uses the headers provided and `Autogenerated` assumes the CSV does not have a header row and the CDK will generate headers using for `f{i}` where `i` is the index starting from 0. Else, the default behavior is to use the header from the CSV file. If a user wants to autogenerate or provide column names for a CSV having headers, they can skip rows.",
159
+ "default": {
160
+ "header_definition_type": "From CSV"
161
+ },
162
+ "oneOf": [
163
+ {
164
+ "title": "From CSV",
165
+ "type": "object",
166
+ "properties": {
167
+ "header_definition_type": {
168
+ "title": "Header Definition Type",
169
+ "default": "From CSV",
170
+ "const": "From CSV",
171
+ "type": "string"
172
+ }
173
+ },
174
+ "required": ["header_definition_type"]
175
+ },
176
+ {
177
+ "title": "Autogenerated",
178
+ "type": "object",
179
+ "properties": {
180
+ "header_definition_type": {
181
+ "title": "Header Definition Type",
182
+ "default": "Autogenerated",
183
+ "const": "Autogenerated",
184
+ "type": "string"
185
+ }
186
+ },
187
+ "required": ["header_definition_type"]
188
+ },
189
+ {
190
+ "title": "User Provided",
191
+ "type": "object",
192
+ "properties": {
193
+ "header_definition_type": {
194
+ "title": "Header Definition Type",
195
+ "default": "User Provided",
196
+ "const": "User Provided",
197
+ "type": "string"
198
+ },
199
+ "column_names": {
200
+ "title": "Column Names",
201
+ "description": "The column names that will be used while emitting the CSV records",
202
+ "type": "array",
203
+ "items": {
204
+ "type": "string"
205
+ }
206
+ }
207
+ },
208
+ "required": ["column_names", "header_definition_type"]
209
+ }
210
+ ],
211
+ "type": "object"
212
+ },
213
+ "true_values": {
214
+ "title": "True Values",
215
+ "description": "A set of case-sensitive strings that should be interpreted as true values.",
216
+ "default": ["y", "yes", "t", "true", "on", "1"],
217
+ "type": "array",
218
+ "items": {
219
+ "type": "string"
220
+ },
221
+ "uniqueItems": true
222
+ },
223
+ "false_values": {
224
+ "title": "False Values",
225
+ "description": "A set of case-sensitive strings that should be interpreted as false values.",
226
+ "default": ["n", "no", "f", "false", "off", "0"],
227
+ "type": "array",
228
+ "items": {
229
+ "type": "string"
230
+ },
231
+ "uniqueItems": true
232
+ }
233
+ },
234
+ "required": ["filetype"]
235
+ },
236
+ {
237
+ "title": "Jsonl Format",
238
+ "type": "object",
239
+ "properties": {
240
+ "filetype": {
241
+ "title": "Filetype",
242
+ "default": "jsonl",
243
+ "const": "jsonl",
244
+ "type": "string"
245
+ }
246
+ },
247
+ "required": ["filetype"]
248
+ },
249
+ {
250
+ "title": "Parquet Format",
251
+ "type": "object",
252
+ "properties": {
253
+ "filetype": {
254
+ "title": "Filetype",
255
+ "default": "parquet",
256
+ "const": "parquet",
257
+ "type": "string"
258
+ },
259
+ "decimal_as_float": {
260
+ "title": "Convert Decimal Fields to Floats",
261
+ "description": "Whether to convert decimal fields to floats. There is a loss of precision when converting decimals to floats, so this is not recommended.",
262
+ "default": false,
263
+ "type": "boolean"
264
+ }
265
+ },
266
+ "required": ["filetype"]
267
+ },
268
+ {
269
+ "title": "Document File Type Format (Experimental)",
270
+ "type": "object",
271
+ "properties": {
272
+ "filetype": {
273
+ "title": "Filetype",
274
+ "default": "unstructured",
275
+ "const": "unstructured",
276
+ "type": "string"
277
+ },
278
+ "skip_unprocessable_files": {
279
+ "type": "boolean",
280
+ "default": true,
281
+ "title": "Skip Unprocessable Files",
282
+ "description": "If true, skip files that cannot be parsed and pass the error message along as the _ab_source_file_parse_error field. If false, fail the sync.",
283
+ "always_show": true
284
+ },
285
+ "strategy": {
286
+ "type": "string",
287
+ "always_show": true,
288
+ "order": 0,
289
+ "default": "auto",
290
+ "title": "Parsing Strategy",
291
+ "enum": ["auto", "fast", "ocr_only", "hi_res"],
292
+ "description": "The strategy used to parse documents. `fast` extracts text directly from the document which doesn't work for all files. `ocr_only` is more reliable, but slower. `hi_res` is the most reliable, but requires an API key and a hosted instance of unstructured and can't be used with local mode. See the unstructured.io documentation for more details: https://unstructured-io.github.io/unstructured/core/partition.html#partition-pdf"
293
+ },
294
+ "processing": {
295
+ "title": "Processing",
296
+ "description": "Processing configuration",
297
+ "default": {
298
+ "mode": "local"
299
+ },
300
+ "type": "object",
301
+ "oneOf": [
302
+ {
303
+ "title": "Local",
304
+ "type": "object",
305
+ "properties": {
306
+ "mode": {
307
+ "title": "Mode",
308
+ "default": "local",
309
+ "const": "local",
310
+ "enum": ["local"],
311
+ "type": "string"
312
+ }
313
+ },
314
+ "description": "Process files locally, supporting `fast` and `ocr` modes. This is the default option.",
315
+ "required": ["mode"]
316
+ }
317
+ ]
318
+ }
319
+ },
320
+ "description": "Extract text from document formats (.pdf, .docx, .md, .pptx) and emit as one record per file.",
321
+ "required": ["filetype"]
322
+ }
323
+ ]
324
+ },
325
+ "schemaless": {
326
+ "title": "Schemaless",
327
+ "description": "When enabled, syncs will not validate or structure records against the stream's schema.",
328
+ "default": false,
329
+ "type": "boolean"
330
+ }
331
+ },
332
+ "required": ["name", "format"]
333
+ }
334
+ },
335
+ "folder_url": {
336
+ "title": "Folder Url",
337
+ "description": "URL for the folder you want to sync. Using individual streams and glob patterns, it's possible to only sync a subset of all files located in the folder.",
338
+ "examples": [
339
+ "https://drive.google.com/drive/folders/1Xaz0vXXXX2enKnNYU5qSt9NS70gvMyYn"
340
+ ],
341
+ "order": 0,
342
+ "pattern": "^https://drive.google.com/.+",
343
+ "pattern_descriptor": "https://drive.google.com/drive/folders/MY-FOLDER-ID",
344
+ "type": "string"
345
+ },
346
+ "credentials": {
347
+ "title": "Authentication",
348
+ "description": "Credentials for connecting to the Google Drive API",
349
+ "type": "object",
350
+ "oneOf": [
351
+ {
352
+ "title": "Authenticate via Google (OAuth)",
353
+ "type": "object",
354
+ "properties": {
355
+ "auth_type": {
356
+ "title": "Auth Type",
357
+ "default": "Client",
358
+ "const": "Client",
359
+ "enum": ["Client"],
360
+ "type": "string"
361
+ },
362
+ "client_id": {
363
+ "title": "Client ID",
364
+ "description": "Client ID for the Google Drive API",
365
+ "airbyte_secret": true,
366
+ "type": "string"
367
+ },
368
+ "client_secret": {
369
+ "title": "Client Secret",
370
+ "description": "Client Secret for the Google Drive API",
371
+ "airbyte_secret": true,
372
+ "type": "string"
373
+ },
374
+ "refresh_token": {
375
+ "title": "Refresh Token",
376
+ "description": "Refresh Token for the Google Drive API",
377
+ "airbyte_secret": true,
378
+ "type": "string"
379
+ }
380
+ },
381
+ "required": [
382
+ "client_id",
383
+ "client_secret",
384
+ "refresh_token",
385
+ "auth_type"
386
+ ]
387
+ },
388
+ {
389
+ "title": "Service Account Key Authentication",
390
+ "type": "object",
391
+ "properties": {
392
+ "auth_type": {
393
+ "title": "Auth Type",
394
+ "default": "Service",
395
+ "const": "Service",
396
+ "enum": ["Service"],
397
+ "type": "string"
398
+ },
399
+ "service_account_info": {
400
+ "title": "Service Account Information",
401
+ "description": "The JSON key of the service account to use for authorization. Read more <a href=\"https://cloud.google.com/iam/docs/creating-managing-service-account-keys#creating_service_account_keys\">here</a>.",
402
+ "airbyte_secret": true,
403
+ "type": "string"
404
+ }
405
+ },
406
+ "required": ["service_account_info", "auth_type"]
407
+ }
408
+ ]
409
+ }
410
+ },
411
+ "required": ["streams", "folder_url", "credentials"]
412
+ },
413
+ "advanced_auth": {
414
+ "auth_flow_type": "oauth2.0",
415
+ "predicate_key": ["credentials", "auth_type"],
416
+ "predicate_value": "Client",
417
+ "oauth_config_specification": {
418
+ "complete_oauth_output_specification": {
419
+ "type": "object",
420
+ "additionalProperties": false,
421
+ "properties": {
422
+ "refresh_token": {
423
+ "type": "string",
424
+ "path_in_connector_config": ["credentials", "refresh_token"]
425
+ }
426
+ }
427
+ },
428
+ "complete_oauth_server_input_specification": {
429
+ "type": "object",
430
+ "additionalProperties": false,
431
+ "properties": {
432
+ "client_id": {
433
+ "type": "string"
434
+ },
435
+ "client_secret": {
436
+ "type": "string"
437
+ }
438
+ }
439
+ },
440
+ "complete_oauth_server_output_specification": {
441
+ "type": "object",
442
+ "additionalProperties": false,
443
+ "properties": {
444
+ "client_id": {
445
+ "type": "string",
446
+ "path_in_connector_config": ["credentials", "client_id"]
447
+ },
448
+ "client_secret": {
449
+ "type": "string",
450
+ "path_in_connector_config": ["credentials", "client_secret"]
451
+ }
452
+ }
453
+ }
454
+ }
455
+ }
456
+ }
@@ -0,0 +1,6 @@
1
+ #
2
+ # Copyright (c) 2023 Airbyte, Inc., all rights reserved.
3
+ #
4
+ from .source import SourceGoogleDrive
5
+
6
+ __all__ = ["SourceGoogleDrive"]
@@ -0,0 +1,23 @@
1
+ #
2
+ # Copyright (c) 2023 Airbyte, Inc., all rights reserved.
3
+ #
4
+
5
+
6
+ import sys
7
+
8
+ from airbyte_cdk import AirbyteEntrypoint
9
+ from airbyte_cdk.entrypoint import launch
10
+ from source_google_drive import SourceGoogleDrive
11
+
12
+
13
+ def run():
14
+ args = sys.argv[1:]
15
+ catalog_path = AirbyteEntrypoint.extract_catalog(args)
16
+ config_path = AirbyteEntrypoint.extract_config(args)
17
+ state_path = AirbyteEntrypoint.extract_state(args)
18
+ source = SourceGoogleDrive(
19
+ SourceGoogleDrive.read_catalog(catalog_path) if catalog_path else None,
20
+ SourceGoogleDrive.read_config(config_path) if config_path else None,
21
+ SourceGoogleDrive.read_state(state_path) if state_path else None,
22
+ )
23
+ launch(source, args)
@@ -0,0 +1,58 @@
1
+ #
2
+ # Copyright (c) 2023 Airbyte, Inc., all rights reserved.
3
+ #
4
+ from typing import Any, Mapping, Optional
5
+
6
+ from airbyte_cdk.models import AdvancedAuth, ConfiguredAirbyteCatalog, ConnectorSpecification, OAuthConfigSpecification
7
+ from airbyte_cdk.sources.file_based.file_based_source import FileBasedSource
8
+ from airbyte_cdk.sources.file_based.stream.cursor.default_file_based_cursor import DefaultFileBasedCursor
9
+ from airbyte_cdk.sources.source import TState
10
+ from source_google_drive.spec import SourceGoogleDriveSpec
11
+ from source_google_drive.stream_reader import SourceGoogleDriveStreamReader
12
+
13
+
14
+ class SourceGoogleDrive(FileBasedSource):
15
+ def __init__(self, catalog: Optional[ConfiguredAirbyteCatalog], config: Optional[Mapping[str, Any]], state: Optional[TState]):
16
+ super().__init__(
17
+ stream_reader=SourceGoogleDriveStreamReader(),
18
+ spec_class=SourceGoogleDriveSpec,
19
+ catalog=catalog,
20
+ config=config,
21
+ state=state,
22
+ cursor_cls=DefaultFileBasedCursor,
23
+ )
24
+
25
+ def spec(self, *args: Any, **kwargs: Any) -> ConnectorSpecification:
26
+ """
27
+ Returns the specification describing what fields can be configured by a user when setting up a file-based source.
28
+ """
29
+
30
+ return ConnectorSpecification(
31
+ documentationUrl=self.spec_class.documentation_url(),
32
+ connectionSpecification=self.spec_class.schema(),
33
+ advanced_auth=AdvancedAuth(
34
+ auth_flow_type="oauth2.0",
35
+ predicate_key=["credentials", "auth_type"],
36
+ predicate_value="Client",
37
+ oauth_config_specification=OAuthConfigSpecification(
38
+ complete_oauth_output_specification={
39
+ "type": "object",
40
+ "additionalProperties": False,
41
+ "properties": {"refresh_token": {"type": "string", "path_in_connector_config": ["credentials", "refresh_token"]}},
42
+ },
43
+ complete_oauth_server_input_specification={
44
+ "type": "object",
45
+ "additionalProperties": False,
46
+ "properties": {"client_id": {"type": "string"}, "client_secret": {"type": "string"}},
47
+ },
48
+ complete_oauth_server_output_specification={
49
+ "type": "object",
50
+ "additionalProperties": False,
51
+ "properties": {
52
+ "client_id": {"type": "string", "path_in_connector_config": ["credentials", "client_id"]},
53
+ "client_secret": {"type": "string", "path_in_connector_config": ["credentials", "client_secret"]},
54
+ },
55
+ },
56
+ ),
57
+ ),
58
+ )
@@ -0,0 +1,85 @@
1
+ #
2
+ # Copyright (c) 2023 Airbyte, Inc., all rights reserved.
3
+ #
4
+
5
+
6
+ from typing import Any, Dict, Literal, Union
7
+
8
+ import dpath.util
9
+ from airbyte_cdk.sources.file_based.config.abstract_file_based_spec import AbstractFileBasedSpec
10
+ from airbyte_cdk.utils.oneof_option_config import OneOfOptionConfig
11
+ from pydantic import BaseModel, Field
12
+
13
+
14
+ class OAuthCredentials(BaseModel):
15
+ class Config(OneOfOptionConfig):
16
+ title = "Authenticate via Google (OAuth)"
17
+ discriminator = "auth_type"
18
+
19
+ auth_type: Literal["Client"] = Field("Client", const=True)
20
+ client_id: str = Field(
21
+ title="Client ID",
22
+ description="Client ID for the Google Drive API",
23
+ airbyte_secret=True,
24
+ )
25
+ client_secret: str = Field(
26
+ title="Client Secret",
27
+ description="Client Secret for the Google Drive API",
28
+ airbyte_secret=True,
29
+ )
30
+ refresh_token: str = Field(
31
+ title="Refresh Token",
32
+ description="Refresh Token for the Google Drive API",
33
+ airbyte_secret=True,
34
+ )
35
+
36
+
37
+ class ServiceAccountCredentials(BaseModel):
38
+ class Config(OneOfOptionConfig):
39
+ title = "Service Account Key Authentication"
40
+ discriminator = "auth_type"
41
+
42
+ auth_type: Literal["Service"] = Field("Service", const=True)
43
+ service_account_info: str = Field(
44
+ title="Service Account Information",
45
+ description='The JSON key of the service account to use for authorization. Read more <a href="https://cloud.google.com/iam/docs/creating-managing-service-account-keys#creating_service_account_keys">here</a>.',
46
+ airbyte_secret=True,
47
+ )
48
+
49
+
50
+ class SourceGoogleDriveSpec(AbstractFileBasedSpec, BaseModel):
51
+ class Config:
52
+ title = "Google Drive Source Spec"
53
+
54
+ folder_url: str = Field(
55
+ description="URL for the folder you want to sync. Using individual streams and glob patterns, it's possible to only sync a subset of all files located in the folder.",
56
+ examples=["https://drive.google.com/drive/folders/1Xaz0vXXXX2enKnNYU5qSt9NS70gvMyYn"],
57
+ order=0,
58
+ pattern="^https://drive.google.com/.+",
59
+ pattern_descriptor="https://drive.google.com/drive/folders/MY-FOLDER-ID",
60
+ )
61
+
62
+ credentials: Union[OAuthCredentials, ServiceAccountCredentials] = Field(
63
+ title="Authentication", description="Credentials for connecting to the Google Drive API", discriminator="auth_type", type="object"
64
+ )
65
+
66
+ @classmethod
67
+ def documentation_url(cls) -> str:
68
+ return "https://docs.airbyte.com/integrations/sources/google-drive"
69
+
70
+ @classmethod
71
+ def schema(cls, *args: Any, **kwargs: Any) -> Dict[str, Any]:
72
+ """
73
+ Generates the mapping comprised of the config fields
74
+ """
75
+ schema = super().schema(*args, **kwargs)
76
+
77
+ # Remove legacy settings
78
+ dpath.util.delete(schema, "properties/streams/items/properties/legacy_prefix")
79
+ dpath.util.delete(schema, "properties/streams/items/properties/format/oneOf/*/properties/inference_type")
80
+
81
+ # Hide API processing option until https://github.com/airbytehq/airbyte-platform-internal/issues/10354 is fixed
82
+ processing_options = dpath.util.get(schema, "properties/streams/items/properties/format/oneOf/4/properties/processing/oneOf")
83
+ dpath.util.set(schema, "properties/streams/items/properties/format/oneOf/4/properties/processing/oneOf", processing_options[:1])
84
+
85
+ return schema
@@ -0,0 +1,185 @@
1
+ #
2
+ # Copyright (c) 2023 Airbyte, Inc., all rights reserved.
3
+ #
4
+
5
+ import io
6
+ import json
7
+ import logging
8
+ import re
9
+ from datetime import datetime
10
+ from io import IOBase
11
+ from typing import Iterable, List, Optional, Set
12
+
13
+ from airbyte_cdk.sources.file_based.file_based_stream_reader import AbstractFileBasedStreamReader, FileReadMode
14
+ from airbyte_cdk.sources.file_based.remote_file import RemoteFile
15
+ from airbyte_cdk.utils.traced_exception import AirbyteTracedException, FailureType
16
+ from google.oauth2 import credentials, service_account
17
+ from googleapiclient.discovery import build
18
+ from googleapiclient.http import MediaIoBaseDownload
19
+ from source_google_drive.utils import get_folder_id
20
+
21
+ from .spec import SourceGoogleDriveSpec
22
+
23
+ FOLDER_MIME_TYPE = "application/vnd.google-apps.folder"
24
+ GOOGLE_DOC_MIME_TYPE = "application/vnd.google-apps.document"
25
+ EXPORTABLE_DOCUMENTS_MIME_TYPES = [
26
+ GOOGLE_DOC_MIME_TYPE,
27
+ "application/vnd.google-apps.presentation",
28
+ "application/vnd.google-apps.drawing",
29
+ ]
30
+
31
+
32
+ class GoogleDriveRemoteFile(RemoteFile):
33
+ id: str
34
+ # The mime type of the file as returned by the Google Drive API
35
+ # This is not the same as the mime type when opened by the parser (e.g. google docs is exported as docx)
36
+ original_mime_type: str
37
+
38
+
39
+ class SourceGoogleDriveStreamReader(AbstractFileBasedStreamReader):
40
+ def __init__(self):
41
+ super().__init__()
42
+ self._drive_service = None
43
+
44
+ @property
45
+ def config(self) -> SourceGoogleDriveSpec:
46
+ return self._config
47
+
48
+ @config.setter
49
+ def config(self, value: SourceGoogleDriveSpec):
50
+ """
51
+ FileBasedSource reads the config from disk and parses it, and once parsed, the source sets the config on its StreamReader.
52
+
53
+ Note: FileBasedSource only requires the keys defined in the abstract config, whereas concrete implementations of StreamReader
54
+ will require keys that (for example) allow it to authenticate with the 3rd party.
55
+
56
+ Therefore, concrete implementations of AbstractFileBasedStreamReader's config setter should assert that `value` is of the correct
57
+ config type for that type of StreamReader.
58
+ """
59
+ assert isinstance(value, SourceGoogleDriveSpec)
60
+ self._config = value
61
+
62
+ @property
63
+ def google_drive_service(self):
64
+ if self.config is None:
65
+ # We shouldn't hit this; config should always get set before attempting to
66
+ # list or read files.
67
+ raise ValueError("Source config is missing; cannot create the Google Drive client.")
68
+ try:
69
+ if self._drive_service is None:
70
+ if self.config.credentials.auth_type == "Client":
71
+ creds = credentials.Credentials.from_authorized_user_info(self.config.credentials.dict())
72
+ else:
73
+ creds = service_account.Credentials.from_service_account_info(json.loads(self.config.credentials.service_account_info))
74
+ self._drive_service = build("drive", "v3", credentials=creds)
75
+ except Exception as e:
76
+ raise AirbyteTracedException(
77
+ internal_message=str(e),
78
+ message="Could not authenticate with Google Drive. Please check your credentials.",
79
+ failure_type=FailureType.config_error,
80
+ exception=e,
81
+ )
82
+
83
+ return self._drive_service
84
+
85
+ def get_matching_files(self, globs: List[str], prefix: Optional[str], logger: logging.Logger) -> Iterable[RemoteFile]:
86
+ """
87
+ Get all files matching the specified glob patterns.
88
+ """
89
+ service = self.google_drive_service
90
+ root_folder_id = get_folder_id(self.config.folder_url)
91
+ # ignore prefix argument as it's legacy only and this is a new connector
92
+ prefixes = self.get_prefixes_from_globs(globs)
93
+
94
+ folder_id_queue = [("", root_folder_id)]
95
+ seen: Set[str] = set()
96
+ while len(folder_id_queue) > 0:
97
+ (path, folder_id) = folder_id_queue.pop()
98
+ # fetch all files in this folder (1000 is the max page size)
99
+ # supportsAllDrives and includeItemsFromAllDrives are required to access files in shared drives
100
+ request = service.files().list(
101
+ q=f"'{folder_id}' in parents",
102
+ pageSize=1000,
103
+ fields="nextPageToken, files(id, name, modifiedTime, mimeType)",
104
+ supportsAllDrives=True,
105
+ includeItemsFromAllDrives=True,
106
+ )
107
+ while True:
108
+ results = request.execute()
109
+ new_files = results.get("files", [])
110
+ for new_file in new_files:
111
+ # It's possible files and folders are linked up multiple times, this prevents us from getting stuck in a loop
112
+ if new_file["id"] in seen:
113
+ continue
114
+ seen.add(new_file["id"])
115
+ file_name = path + new_file["name"]
116
+ if new_file["mimeType"] == FOLDER_MIME_TYPE:
117
+ folder_name = f"{file_name}/"
118
+ # check prefix matching in both directions to handle
119
+ prefix_matches_folder_name = any(prefix.startswith(folder_name) for prefix in prefixes)
120
+ folder_name_matches_prefix = any(folder_name.startswith(prefix) for prefix in prefixes)
121
+ if prefix_matches_folder_name or folder_name_matches_prefix or len(prefixes) == 0:
122
+ folder_id_queue.append((folder_name, new_file["id"]))
123
+ continue
124
+ else:
125
+ last_modified = datetime.strptime(new_file["modifiedTime"], "%Y-%m-%dT%H:%M:%S.%fZ")
126
+ original_mime_type = new_file["mimeType"]
127
+ mime_type = (
128
+ self._get_export_mime_type(original_mime_type)
129
+ if self._is_exportable_document(original_mime_type)
130
+ else original_mime_type
131
+ )
132
+ remote_file = GoogleDriveRemoteFile(
133
+ uri=file_name,
134
+ last_modified=last_modified,
135
+ id=new_file["id"],
136
+ original_mime_type=original_mime_type,
137
+ mime_type=mime_type,
138
+ )
139
+ if self.file_matches_globs(remote_file, globs):
140
+ yield remote_file
141
+ request = service.files().list_next(request, results)
142
+ if request is None:
143
+ break
144
+
145
+ def _is_exportable_document(self, mime_type: str):
146
+ """
147
+ Returns true if the given file is a Google App document that can be exported.
148
+ """
149
+ return mime_type in EXPORTABLE_DOCUMENTS_MIME_TYPES
150
+
151
+ def open_file(self, file: GoogleDriveRemoteFile, mode: FileReadMode, encoding: Optional[str], logger: logging.Logger) -> IOBase:
152
+ if self._is_exportable_document(file.original_mime_type):
153
+ if mode == FileReadMode.READ:
154
+ raise ValueError(
155
+ "Google Docs/Drawings/Presentations can only be processed using the document file type format. Please set the format accordingly or adjust the glob pattern."
156
+ )
157
+ request = self.google_drive_service.files().export_media(fileId=file.id, mimeType=file.mime_type)
158
+ else:
159
+ request = self.google_drive_service.files().get_media(fileId=file.id)
160
+ handle = io.BytesIO()
161
+ downloader = MediaIoBaseDownload(handle, request)
162
+ done = False
163
+ while done is False:
164
+ _, done = downloader.next_chunk()
165
+
166
+ handle.seek(0)
167
+
168
+ if mode == FileReadMode.READ_BINARY:
169
+ return handle
170
+ else:
171
+ # repack the bytes into a string with the right encoding
172
+ text_handle = io.StringIO(handle.read().decode(encoding or "utf-8"))
173
+ handle.close()
174
+ return text_handle
175
+
176
+ def _get_export_mime_type(self, original_mime_type: str):
177
+ """
178
+ Returns the mime type to export Google App documents as.
179
+
180
+ Google Docs are exported as Docx to preserve as much formatting as possible, everything else goes through PDF.
181
+ """
182
+ if original_mime_type.startswith(GOOGLE_DOC_MIME_TYPE):
183
+ return "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
184
+ else:
185
+ return "application/pdf"
@@ -0,0 +1,21 @@
1
+ # Copyright (c) 2023 Airbyte, Inc., all rights reserved.
2
+
3
+ from urllib.parse import urlparse
4
+
5
+
6
+ def get_folder_id(url_string: str) -> str:
7
+ """
8
+ Extract the folder ID from a Google Drive folder URL.
9
+
10
+ Takes the last path segment of the URL, which is the folder ID (ignoring trailing slashes and query parameters).
11
+ """
12
+ try:
13
+ parsed_url = urlparse(url_string)
14
+ if parsed_url.scheme != "https" or parsed_url.netloc != "drive.google.com":
15
+ raise ValueError("Folder URL has to be of the form https://drive.google.com/drive/folders/<folder_id>")
16
+ path_segments = list(filter(None, parsed_url.path.split("/")))
17
+ if path_segments[-2] != "folders" or len(path_segments) < 3:
18
+ raise ValueError("Folder URL has to be of the form https://drive.google.com/drive/folders/<folder_id>")
19
+ return path_segments[-1]
20
+ except Exception:
21
+ raise ValueError("Folder URL is invalid")