spark-nlp 5.5.2__py2.py3-none-any.whl → 5.5.3__py2.py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of spark-nlp might be problematic. Click here for more details.

@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: spark-nlp
3
- Version: 5.5.2
3
+ Version: 5.5.3
4
4
  Summary: John Snow Labs Spark NLP is a natural language processing library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines, that scale easily in a distributed environment.
5
5
  Home-page: https://github.com/JohnSnowLabs/spark-nlp
6
6
  Author: John Snow Labs
@@ -95,7 +95,7 @@ $ java -version
95
95
  $ conda create -n sparknlp python=3.7 -y
96
96
  $ conda activate sparknlp
97
97
  # spark-nlp by default is based on pyspark 3.x
98
- $ pip install spark-nlp==5.5.2 pyspark==3.3.1
98
+ $ pip install spark-nlp==5.5.3 pyspark==3.3.1
99
99
  ```
100
100
 
101
101
  In Python console or Jupyter `Python3` kernel:
@@ -161,7 +161,7 @@ For a quick example of using pipelines and models take a look at our official [d
161
161
 
162
162
  ### Apache Spark Support
163
163
 
164
- Spark NLP *5.5.2* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x
164
+ Spark NLP *5.5.3* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x
165
165
 
166
166
  | Spark NLP | Apache Spark 3.5.x | Apache Spark 3.4.x | Apache Spark 3.3.x | Apache Spark 3.2.x | Apache Spark 3.1.x | Apache Spark 3.0.x | Apache Spark 2.4.x | Apache Spark 2.3.x |
167
167
  |-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
@@ -189,7 +189,7 @@ Find out more about 4.x `SparkNLP` versions in our official [documentation](http
189
189
 
190
190
  ### Databricks Support
191
191
 
192
- Spark NLP 5.5.2 has been tested and is compatible with the following runtimes:
192
+ Spark NLP 5.5.3 has been tested and is compatible with the following runtimes:
193
193
 
194
194
  | **CPU** | **GPU** |
195
195
  |--------------------|--------------------|
@@ -206,7 +206,7 @@ We are compatible with older runtimes. For a full list check databricks support
206
206
 
207
207
  ### EMR Support
208
208
 
209
- Spark NLP 5.5.2 has been tested and is compatible with the following EMR releases:
209
+ Spark NLP 5.5.3 has been tested and is compatible with the following EMR releases:
210
210
 
211
211
  | **EMR Release** |
212
212
  |--------------------|
@@ -237,7 +237,7 @@ deployed to Maven central. To add any of our packages as a dependency in your ap
237
237
  from our official documentation.
238
238
 
239
239
  If you are interested, there is a simple SBT project for Spark NLP to guide you on how to use it in your
240
- projects [Spark NLP SBT S5.5.2r](https://github.com/maziyarpanahi/spark-nlp-starter)
240
+ projects [Spark NLP SBT S5.5.3r](https://github.com/maziyarpanahi/spark-nlp-starter)
241
241
 
242
242
  ### Python
243
243
 
@@ -282,7 +282,7 @@ In Spark NLP we can define S3 locations to:
282
282
 
283
283
  Please check [these instructions](https://sparknlp.org/docs/en/install#s3-integration) from our official documentation.
284
284
 
285
- ## Document5.5.2
285
+ ## Document5.5.3
286
286
 
287
287
  ### Examples
288
288
 
@@ -315,7 +315,7 @@ the Spark NLP library:
315
315
  keywords = {Spark, Natural language processing, Deep learning, Tensorflow, Cluster},
316
316
  abstract = {Spark NLP is a Natural Language Processing (NLP) library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines that can scale easily in a distributed environment. Spark NLP comes with 1100+ pretrained pipelines and models in more than 192+ languages. It supports nearly all the NLP tasks and modules that can be used seamlessly in a cluster. Downloaded more than 2.7 million times and experiencing 9x growth since January 2020, Spark NLP is used by 54% of healthcare organizations as the world’s most widely used NLP library in the enterprise.}
317
317
  }
318
- }5.5.2
318
+ }5.5.3
319
319
  ```
320
320
 
321
321
  ## Community support
@@ -3,7 +3,7 @@ com/johnsnowlabs/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,
3
3
  com/johnsnowlabs/ml/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
4
4
  com/johnsnowlabs/ml/ai/__init__.py,sha256=YQiK2M7U4d8y5irPy_HB8ae0mSpqS9583MH44pnKJXc,295
5
5
  com/johnsnowlabs/nlp/__init__.py,sha256=DPIVXtONO5xXyOk-HB0-sNiHAcco17NN13zPS_6Uw8c,294
6
- sparknlp/__init__.py,sha256=cdvKsW7Cb_LLCaot-GsMcb8n0RUXqr9NRpsallJamq0,13783
6
+ sparknlp/__init__.py,sha256=Wmw9AZuFatQEjZ0WucHWPO4yF4HTsEZOVZ27IaEAbok,13783
7
7
  sparknlp/annotation.py,sha256=I5zOxG5vV2RfPZfqN9enT1i4mo6oBcn3Lrzs37QiOiA,5635
8
8
  sparknlp/annotation_audio.py,sha256=iRV_InSVhgvAwSRe9NTbUH9v6OGvTM-FPCpSAKVu0mE,1917
9
9
  sparknlp/annotation_image.py,sha256=xhCe8Ko-77XqWVuuYHFrjKqF6zPd8Z-RY_rmZXNwCXU,2547
@@ -90,7 +90,7 @@ sparknlp/annotator/embeddings/albert_embeddings.py,sha256=6Rd1LIn8oFIpq_ALcJh-RU
90
90
  sparknlp/annotator/embeddings/auto_gguf_embeddings.py,sha256=ngqjiXUqkL3xOrmt42bY8pp7azgbIWqXGfbKud1CijM,19981
91
91
  sparknlp/annotator/embeddings/bert_embeddings.py,sha256=HVUjkg56kBcpGZCo-fmPG5uatMDF3swW_lnbpy1SgSI,8463
92
92
  sparknlp/annotator/embeddings/bert_sentence_embeddings.py,sha256=NQy9KuXT9aKsTpYCR5RAeoFWI2YqEGorbdYrf_0KKmw,9148
93
- sparknlp/annotator/embeddings/bge_embeddings.py,sha256=hXFFd9HOru1w2L9N5YGSZlaKyxqMsZccpaI4Z8-bNUE,7919
93
+ sparknlp/annotator/embeddings/bge_embeddings.py,sha256=Y4b6QzRJGc_Z9_R6SYq-P5NxcvI9XzJlBzwCLLHJpRo,8103
94
94
  sparknlp/annotator/embeddings/camembert_embeddings.py,sha256=dBTXas-2Tas_JUR9Xt_GtHLcyqi_cdvT5EHRnyVrSSQ,8817
95
95
  sparknlp/annotator/embeddings/chunk_embeddings.py,sha256=WUmkJimSuFkdcLJnvcxOV0QlCLgGlhub29ZTrZb70WE,6052
96
96
  sparknlp/annotator/embeddings/deberta_embeddings.py,sha256=_b5nzLb7heFQNN-uT2oBNO6-YmM8bHmAdnGXg47HOWw,8649
@@ -199,7 +199,7 @@ sparknlp/common/annotator_properties.py,sha256=7B1os7pBUfHo6b7IPQAXQ-nir0u3tQLzD
199
199
  sparknlp/common/annotator_type.py,sha256=ash2Ip1IOOiJamPVyy_XQj8Ja_DRHm0b9Vj4Ni75oKM,1225
200
200
  sparknlp/common/coverage_result.py,sha256=No4PSh1HSs3PyRI1zC47x65tWgfirqPI290icHQoXEI,823
201
201
  sparknlp/common/match_strategy.py,sha256=kt1MUPqU1wCwk5qCdYk6jubHbU-5yfAYxb9jjAOrdnY,1678
202
- sparknlp/common/properties.py,sha256=454BAfebYhg_l7lfjXSCKPWzmCgmU3IT-r2yLGG22DI,22912
202
+ sparknlp/common/properties.py,sha256=TMUpY0EQ3b-GXO9iuctkKrunLhRYePqu2fbmHfocr2w,23870
203
203
  sparknlp/common/read_as.py,sha256=imxPGwV7jr4Li_acbo0OAHHRGCBbYv-akzEGaBWEfcY,1226
204
204
  sparknlp/common/recursive_annotator_approach.py,sha256=vqugBw22cE3Ff7PIpRlnYFuOlchgL0nM26D8j-NdpqU,1449
205
205
  sparknlp/common/storage.py,sha256=D91H3p8EIjNspjqAYu6ephRpCUtdcAir4_PrAbkIQWE,4842
@@ -217,7 +217,7 @@ sparknlp/pretrained/pretrained_pipeline.py,sha256=lquxiaABuA68Rmu7csamJPqBoRJqMU
217
217
  sparknlp/pretrained/resource_downloader.py,sha256=8_-rpvO2LsX_Lq4wMPif2ca3RlJZWEabt8pDm2xymiI,7806
218
218
  sparknlp/pretrained/utils.py,sha256=T1MrvW_DaWk_jcOjVLOea0NMFE9w8fe0ZT_5urZ_nEY,1099
219
219
  sparknlp/reader/__init__.py,sha256=-Toj3AIBki-zXPpV8ezFTI2LX1yP_rK2bhpoa8nBkTw,685
220
- sparknlp/reader/sparknlp_reader.py,sha256=SLQ5KCWbHnR4S0DwdjRQw_NvaUTchrE0gVCHs__xAy8,17054
220
+ sparknlp/reader/sparknlp_reader.py,sha256=cMliB2zDcmhxp44mu8aRcm5nFK2BXeFCuGgVUkhI8YQ,3825
221
221
  sparknlp/training/__init__.py,sha256=qREi9u-5Vc2VjpL6-XZsyvu5jSEIdIhowW7_kKaqMqo,852
222
222
  sparknlp/training/conll.py,sha256=wKBiSTrjc6mjsl7Nyt6B8f4yXsDJkZb-sn8iOjix9cE,6961
223
223
  sparknlp/training/conllu.py,sha256=8r3i-tmyrLsyk1DtZ9uo2mMDCWb1yw2Y5W6UsV13MkY,4953
@@ -248,8 +248,8 @@ sparknlp/training/_tf_graph_builders_1x/ner_dl/dataset_encoder.py,sha256=R4yHFN3
248
248
  sparknlp/training/_tf_graph_builders_1x/ner_dl/ner_model.py,sha256=EoCSdcIjqQ3wv13MAuuWrKV8wyVBP0SbOEW41omHlR0,23189
249
249
  sparknlp/training/_tf_graph_builders_1x/ner_dl/ner_model_saver.py,sha256=k5CQ7gKV6HZbZMB8cKLUJuZxoZWlP_DFWdZ--aIDwsc,2356
250
250
  sparknlp/training/_tf_graph_builders_1x/ner_dl/sentence_grouper.py,sha256=pAxjWhjazSX8Vg0MFqJiuRVw1IbnQNSs-8Xp26L4nko,870
251
- spark_nlp-5.5.2.dist-info/.uuid,sha256=1f6hF51aIuv9yCvh31NU9lOpS34NE-h3a0Et7R9yR6A,36
252
- spark_nlp-5.5.2.dist-info/METADATA,sha256=iFDm_OdynA95nwoWm1vbJcF3i7uuRSnn7S9eU1t5_3c,19156
253
- spark_nlp-5.5.2.dist-info/WHEEL,sha256=bb2Ot9scclHKMOLDEHY6B2sicWOgugjFKaJsT7vwMQo,110
254
- spark_nlp-5.5.2.dist-info/top_level.txt,sha256=uuytur4pyMRw2H_txNY2ZkaucZHUs22QF8-R03ch_-E,13
255
- spark_nlp-5.5.2.dist-info/RECORD,,
251
+ spark_nlp-5.5.3.dist-info/.uuid,sha256=1f6hF51aIuv9yCvh31NU9lOpS34NE-h3a0Et7R9yR6A,36
252
+ spark_nlp-5.5.3.dist-info/METADATA,sha256=rZJcS1xIcl3Vota-hC2wHauvrHO45e9c8Y86MjVt4go,19156
253
+ spark_nlp-5.5.3.dist-info/WHEEL,sha256=bb2Ot9scclHKMOLDEHY6B2sicWOgugjFKaJsT7vwMQo,110
254
+ spark_nlp-5.5.3.dist-info/top_level.txt,sha256=uuytur4pyMRw2H_txNY2ZkaucZHUs22QF8-R03ch_-E,13
255
+ spark_nlp-5.5.3.dist-info/RECORD,,
sparknlp/__init__.py CHANGED
@@ -132,7 +132,7 @@ def start(gpu=False,
132
132
  The initiated Spark session.
133
133
 
134
134
  """
135
- current_version = "5.5.2"
135
+ current_version = "5.5.3"
136
136
 
137
137
  if params is None:
138
138
  params = {}
@@ -316,4 +316,4 @@ def version():
316
316
  str
317
317
  The current Spark NLP version.
318
318
  """
319
- return '5.5.2'
319
+ return '5.5.3'
@@ -21,7 +21,8 @@ class BGEEmbeddings(AnnotatorModel,
21
21
  HasCaseSensitiveProperties,
22
22
  HasStorageRef,
23
23
  HasBatchedAnnotate,
24
- HasMaxSentenceLengthLimit):
24
+ HasMaxSentenceLengthLimit,
25
+ HasClsTokenProperties):
25
26
  """Sentence embeddings using BGE.
26
27
 
27
28
  BGE, or BAAI General Embeddings, a model that can map any text to a low-dimensional dense
@@ -60,6 +61,8 @@ class BGEEmbeddings(AnnotatorModel,
60
61
  Max sentence length to process, by default 512
61
62
  configProtoBytes
62
63
  ConfigProto from tensorflow, serialized into byte array.
64
+ useCLSToken
65
+ Whether to use the CLS token for sentence embeddings, by default True
63
66
 
64
67
  References
65
68
  ----------
@@ -148,6 +151,7 @@ class BGEEmbeddings(AnnotatorModel,
148
151
  batchSize=8,
149
152
  maxSentenceLength=512,
150
153
  caseSensitive=False,
154
+ useCLSToken=True
151
155
  )
152
156
 
153
157
  @staticmethod
@@ -171,13 +175,13 @@ class BGEEmbeddings(AnnotatorModel,
171
175
  return BGEEmbeddings(java_model=jModel)
172
176
 
173
177
  @staticmethod
174
- def pretrained(name="bge_base", lang="en", remote_loc=None):
178
+ def pretrained(name="bge_small_en_v1.5", lang="en", remote_loc=None):
175
179
  """Downloads and loads a pretrained model.
176
180
 
177
181
  Parameters
178
182
  ----------
179
183
  name : str, optional
180
- Name of the pretrained model, by default "bge_base"
184
+ Name of the pretrained model, by default "bge_small_en_v1.5"
181
185
  lang : str, optional
182
186
  Language of the pretrained model, by default "en"
183
187
  remote_loc : str, optional
@@ -67,6 +67,33 @@ class HasCaseSensitiveProperties:
67
67
  return self.getOrDefault(self.caseSensitive)
68
68
 
69
69
 
70
+ class HasClsTokenProperties:
71
+ useCLSToken = Param(Params._dummy(),
72
+ "useCLSToken",
73
+ "Whether to use CLS token for pooling (true) or attention-based average pooling (false)",
74
+ typeConverter=TypeConverters.toBoolean)
75
+
76
+ def setUseCLSToken(self, value):
77
+ """Sets whether to ignore case in tokens for embeddings matching.
78
+
79
+ Parameters
80
+ ----------
81
+ value : bool
82
+ Whether to use CLS token for pooling (true) or attention-based average pooling (false)
83
+ """
84
+ return self._set(useCLSToken=value)
85
+
86
+ def getUseCLSToken(self):
87
+ """Gets whether to use CLS token for pooling (true) or attention-based average pooling (false)
88
+
89
+ Returns
90
+ -------
91
+ bool
92
+ Whether to use CLS token for pooling (true) or attention-based average pooling (false)
93
+ """
94
+ return self.getOrDefault(self.useCLSToken)
95
+
96
+
70
97
  class HasClassifierActivationProperties:
71
98
  activation = Param(Params._dummy(),
72
99
  "activation",
@@ -15,82 +15,19 @@ from sparknlp.internal import ExtendedJavaWrapper
15
15
 
16
16
 
17
17
  class SparkNLPReader(ExtendedJavaWrapper):
18
- """Instantiates class to read HTML files.
18
+ """Instantiates class to read HTML, email, and document files.
19
19
 
20
- Two types of input paths are supported,
20
+ Two types of input paths are supported:
21
21
 
22
- htmlPath: this is a path to a directory of HTML files or a path to an HTML file
23
- E.g. "path/html/files"
24
-
25
- url: this is the URL or set of URLs of a website . E.g., "https://www.wikipedia.org"
22
+ - `htmlPath`: A path to a directory of HTML files or a single HTML file (e.g., `"path/html/files"`).
23
+ - `url`: A single URL or a set of URLs (e.g., `"https://www.wikipedia.org"`).
26
24
 
27
25
  Parameters
28
26
  ----------
29
- params : spark
30
- Spark session
27
+ spark : SparkSession
28
+ The active Spark session.
31
29
  params : dict, optional
32
- Parameter with custom configuration
33
-
34
- Examples
35
- --------
36
- >>> from sparknlp.reader import SparkNLPReader
37
- >>> html_df = SparkNLPReader().html(spark, "https://www.wikipedia.org")
38
-
39
- You can use SparkNLP for one line of code
40
- >>> import sparknlp
41
- >>> html_df = sparknlp.read().html("https://www.wikipedia.org")
42
- >>> html_df.show(truncate=False)
43
-
44
- +--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
45
- |url |html |
46
- +--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
47
- |https://example.com/|[{Title, Example Domain, {pageNumber -> 1}}, {NarrativeText, 0, This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission., {pageNumber -> 1}}, {NarrativeText, 0, More information... More information..., {pageNumber -> 1}}] |
48
- +--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
49
- >>> html_df.printSchema()
50
-
51
- root
52
- |-- url: string (nullable = true)
53
- |-- html: array (nullable = true)
54
- | |-- element: struct (containsNull = true)
55
- | | |-- elementType: string (nullable = true)
56
- | | |-- content: string (nullable = true)
57
- | | |-- metadata: map (nullable = true)
58
- | | | |-- key: string
59
- | | | |-- value: string (valueContainsNull = true)
60
-
61
-
62
-
63
- Instantiates class to read email files.
64
-
65
- emailPath: this is a path to a directory of HTML files or a path to an HTML file E.g.
66
- "path/html/emails"
67
-
68
- Examples
69
- --------
70
- >>> from sparknlp.reader import SparkNLPReader
71
- >>> email_df = SparkNLPReader().email(spark, "home/user/emails-directory")
72
-
73
- You can use SparkNLP for one line of code
74
- >>> import sparknlp
75
- >>> email_df = sparknlp.read().email("home/user/emails-directory")
76
- >>> email_df.show(truncate=False)
77
- +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
78
- |email |
79
- +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
80
- |[{Title, Email Text Attachments, {sent_to -> Danilo Burbano <danilo@johnsnowlabs.com>, sent_from -> Danilo Burbano <danilo@johnsnowlabs.com>}}, {NarrativeText, Email test with two text attachments\r\n\r\nCheers,\r\n\r\n, {sent_to -> Danilo Burbano <danilo@johnsnowlabs.com>, sent_from -> Danilo Burbano <danilo@johnsnowlabs.com>, mimeType -> text/plain}}, {NarrativeText, <html>\r\n<head>\r\n<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">\r\n<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>\r\n</head>\r\n<body dir="ltr">\r\n<span style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">Email&nbsp; test with two text attachments</span>\r\n<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">\r\n<br>\r\n</div>\r\n<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">\r\nCheers,</div>\r\n<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">\r\n<br>\r\n</div>\r\n</body>\r\n</html>\r\n, {sent_to -> Danilo Burbano <danilo@johnsnowlabs.com>, sent_from -> Danilo Burbano <danilo@johnsnowlabs.com>, mimeType -> text/html}}, {Attachment, filename.txt, {sent_to -> Danilo Burbano <danilo@johnsnowlabs.com>, sent_from -> Danilo Burbano <danilo@johnsnowlabs.com>, contentType -> text/plain; name="filename.txt"}}, {NarrativeText, This is the content of the file.\n, {sent_to -> Danilo Burbano <danilo@johnsnowlabs.com>, sent_from -> Danilo Burbano <danilo@johnsnowlabs.com>, mimeType -> text/plain}}, {Attachment, filename2.txt, {sent_to -> Danilo Burbano <danilo@johnsnowlabs.com>, sent_from -> Danilo Burbano <danilo@johnsnowlabs.com>, contentType -> text/plain; name="filename2.txt"}}, {NarrativeText, This is an additional content file.\n, {sent_to -> Danilo Burbano <danilo@johnsnowlabs.com>, sent_from -> Danilo Burbano <danilo@johnsnowlabs.com>, mimeType -> text/plain}}]|
81
- +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
82
- email_df.printSchema()
83
- root
84
- |-- path: string (nullable = true)
85
- |-- content: array (nullable = true)
86
- |-- email: array (nullable = true)
87
- | |-- element: struct (containsNull = true)
88
- | | |-- elementType: string (nullable = true)
89
- | | |-- content: string (nullable = true)
90
- | | |-- metadata: map (nullable = true)
91
- | | | |-- key: string
92
- | | | |-- value: string (valueContainsNull = true)
93
-
30
+ A dictionary with custom configurations.
94
31
  """
95
32
 
96
33
  def __init__(self, spark, params=None):
@@ -100,22 +37,77 @@ class SparkNLPReader(ExtendedJavaWrapper):
100
37
  self.spark = spark
101
38
 
102
39
  def html(self, htmlPath):
40
+ """Reads HTML files or URLs and returns a Spark DataFrame.
41
+
42
+ Parameters
43
+ ----------
44
+ htmlPath : str or list of str
45
+ Path(s) to HTML file(s) or a list of URLs.
46
+
47
+ Returns
48
+ -------
49
+ pyspark.sql.DataFrame
50
+ A DataFrame containing the parsed HTML content.
51
+
52
+ Examples
53
+ --------
54
+ >>> from sparknlp.reader import SparkNLPReader
55
+ >>> html_df = SparkNLPReader(spark).html("https://www.wikipedia.org")
56
+
57
+ You can also use SparkNLP to simplify the process:
58
+
59
+ >>> import sparknlp
60
+ >>> html_df = sparknlp.read().html("https://www.wikipedia.org")
61
+ >>> html_df.show(truncate=False)
62
+ """
103
63
  if not isinstance(htmlPath, (str, list)) or (isinstance(htmlPath, list) and not all(isinstance(item, str) for item in htmlPath)):
104
64
  raise TypeError("htmlPath must be a string or a list of strings")
105
65
  jdf = self._java_obj.html(htmlPath)
106
- dataframe = self.getDataFrame(self.spark, jdf)
107
- return dataframe
66
+ return self.getDataFrame(self.spark, jdf)
108
67
 
109
68
  def email(self, filePath):
69
+ """Reads email files and returns a Spark DataFrame.
70
+
71
+ Parameters
72
+ ----------
73
+ filePath : str
74
+ Path to an email file or a directory containing emails.
75
+
76
+ Returns
77
+ -------
78
+ pyspark.sql.DataFrame
79
+ A DataFrame containing parsed email data.
80
+
81
+ Examples
82
+ --------
83
+ >>> from sparknlp.reader import SparkNLPReader
84
+ >>> email_df = SparkNLPReader(spark).email("home/user/emails-directory")
85
+
86
+ Using SparkNLP:
87
+
88
+ >>> import sparknlp
89
+ >>> email_df = sparknlp.read().email("home/user/emails-directory")
90
+ >>> email_df.show(truncate=False)
91
+ """
110
92
  if not isinstance(filePath, str):
111
93
  raise TypeError("filePath must be a string")
112
94
  jdf = self._java_obj.email(filePath)
113
- dataframe = self.getDataFrame(self.spark, jdf)
114
- return dataframe
95
+ return self.getDataFrame(self.spark, jdf)
115
96
 
116
97
  def doc(self, docPath):
98
+ """Reads document files and returns a Spark DataFrame.
99
+
100
+ Parameters
101
+ ----------
102
+ docPath : str
103
+ Path to a document file.
104
+
105
+ Returns
106
+ -------
107
+ pyspark.sql.DataFrame
108
+ A DataFrame containing parsed document content.
109
+ """
117
110
  if not isinstance(docPath, str):
118
111
  raise TypeError("docPath must be a string")
119
112
  jdf = self._java_obj.doc(docPath)
120
- dataframe = self.getDataFrame(self.spark, jdf)
121
- return dataframe
113
+ return self.getDataFrame(self.spark, jdf)