spark-nlp 6.1.4__py2.py3-none-any.whl → 6.2.0__py2.py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of spark-nlp might be problematic. Click here for more details.

@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: spark-nlp
3
- Version: 6.1.4
3
+ Version: 6.2.0
4
4
  Summary: John Snow Labs Spark NLP is a natural language processing library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines, that scale easily in a distributed environment.
5
5
  Home-page: https://github.com/JohnSnowLabs/spark-nlp
6
6
  Author: John Snow Labs
@@ -102,7 +102,7 @@ $ java -version
102
102
  $ conda create -n sparknlp python=3.7 -y
103
103
  $ conda activate sparknlp
104
104
  # spark-nlp by default is based on pyspark 3.x
105
- $ pip install spark-nlp==6.1.4 pyspark==3.3.1
105
+ $ pip install spark-nlp==6.2.0 pyspark==3.3.1
106
106
  ```
107
107
 
108
108
  In Python console or Jupyter `Python3` kernel:
@@ -168,7 +168,7 @@ For a quick example of using pipelines and models take a look at our official [d
168
168
 
169
169
  ### Apache Spark Support
170
170
 
171
- Spark NLP *6.1.4* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x
171
+ Spark NLP *6.2.0* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x
172
172
 
173
173
  | Spark NLP | Apache Spark 3.5.x | Apache Spark 3.4.x | Apache Spark 3.3.x | Apache Spark 3.2.x | Apache Spark 3.1.x | Apache Spark 3.0.x | Apache Spark 2.4.x | Apache Spark 2.3.x |
174
174
  |-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
@@ -198,7 +198,7 @@ Find out more about 4.x `SparkNLP` versions in our official [documentation](http
198
198
 
199
199
  ### Databricks Support
200
200
 
201
- Spark NLP 6.1.4 has been tested and is compatible with the following runtimes:
201
+ Spark NLP 6.2.0 has been tested and is compatible with the following runtimes:
202
202
 
203
203
  | **CPU** | **GPU** |
204
204
  |--------------------|--------------------|
@@ -216,7 +216,7 @@ We are compatible with older runtimes. For a full list check databricks support
216
216
 
217
217
  ### EMR Support
218
218
 
219
- Spark NLP 6.1.4 has been tested and is compatible with the following EMR releases:
219
+ Spark NLP 6.2.0 has been tested and is compatible with the following EMR releases:
220
220
 
221
221
  | **EMR Release** |
222
222
  |--------------------|
@@ -306,7 +306,7 @@ Please check [these instructions](https://sparknlp.org/docs/en/install#s3-integr
306
306
  Need more **examples**? Check out our dedicated [Spark NLP Examples](https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples)
307
307
  repository to showcase all Spark NLP use cases!
308
308
 
309
- Also, don't forget to check [Spark NLP in Action](https://sparknlp.org/demo) built by Streamlit.
309
+ Also, don't forget to check [Spark NLP in Action](https://sparknlp.org/demos) built by Streamlit.
310
310
 
311
311
  #### All examples: [spark-nlp/examples](https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples)
312
312
 
@@ -3,7 +3,7 @@ com/johnsnowlabs/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,
3
3
  com/johnsnowlabs/ml/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
4
4
  com/johnsnowlabs/ml/ai/__init__.py,sha256=YQiK2M7U4d8y5irPy_HB8ae0mSpqS9583MH44pnKJXc,295
5
5
  com/johnsnowlabs/nlp/__init__.py,sha256=DPIVXtONO5xXyOk-HB0-sNiHAcco17NN13zPS_6Uw8c,294
6
- sparknlp/__init__.py,sha256=LcfC7bWeae5XgjWbNbWH94LlJkBon5dA8fYnb_2NyGc,13814
6
+ sparknlp/__init__.py,sha256=6cuRDo27cGHCq7oJzF7sAB4sxm8jd9e8ciB_UH1dRT0,13814
7
7
  sparknlp/annotation.py,sha256=I5zOxG5vV2RfPZfqN9enT1i4mo6oBcn3Lrzs37QiOiA,5635
8
8
  sparknlp/annotation_audio.py,sha256=iRV_InSVhgvAwSRe9NTbUH9v6OGvTM-FPCpSAKVu0mE,1917
9
9
  sparknlp/annotation_image.py,sha256=xhCe8Ko-77XqWVuuYHFrjKqF6zPd8Z-RY_rmZXNwCXU,2547
@@ -16,7 +16,7 @@ sparknlp/annotator/chunker.py,sha256=8nz9B7R_mxKxcfJRfKvz2x_T29W3u4izE9k0wfYPzgE
16
16
  sparknlp/annotator/dataframe_optimizer.py,sha256=P4GySLzz1lRCZX0UBRF9_IDuXlRS1XvRWz-B2L0zqMA,7771
17
17
  sparknlp/annotator/date2_chunk.py,sha256=tW3m_LExmhx8LMFWOGXqMyfNRXSr2dnoEHD-6DrnpXI,3153
18
18
  sparknlp/annotator/document_character_text_splitter.py,sha256=oNrOKJAKO2h1wr0bEuSqYrrltIU_Y6J6cTHy70yKy6s,9877
19
- sparknlp/annotator/document_normalizer.py,sha256=hU2fG6vaPfdngQapoeSu-_zS_LiBZNp2tcVBGl6eTpk,10973
19
+ sparknlp/annotator/document_normalizer.py,sha256=OOqPd6zp7FbtmlLHn1zAxPg9oxDzYRPKLYKr5k0Y5ck,12155
20
20
  sparknlp/annotator/document_token_splitter.py,sha256=-9xbQ9pVAjcKHQQrSk6Cb7f8W1cblCLwWXTNR8kFptA,7499
21
21
  sparknlp/annotator/document_token_splitter_test.py,sha256=NWO9mwhAIUJFuxPofB3c39iUm_6vKp4pteDsBOTH8ng,2684
22
22
  sparknlp/annotator/graph_extraction.py,sha256=b4SB3B_hFgCJT4e5Jcscyxdzfbvw3ujKTa6UNgX5Lhc,14471
@@ -105,7 +105,7 @@ sparknlp/annotator/dependency/dependency_parser.py,sha256=SxyvHPp8Hs1Xnm5X1nLTMi
105
105
  sparknlp/annotator/dependency/typed_dependency_parser.py,sha256=60vPdYkbFk9MPGegg3m9Uik9cMXpMZd8tBvXG39gNww,12456
106
106
  sparknlp/annotator/embeddings/__init__.py,sha256=Aw1oaP5DI0OS6259c0TEZZ6j3VFSvYFEerah5a-udVw,2528
107
107
  sparknlp/annotator/embeddings/albert_embeddings.py,sha256=6Rd1LIn8oFIpq_ALcJh-RUjPEO7Ht8wsHY6JHSFyMkw,9995
108
- sparknlp/annotator/embeddings/auto_gguf_embeddings.py,sha256=TRAYbhGS4K8uSpsScvDr6uD3lYdxMpCUjwDMhV_74rM,19977
108
+ sparknlp/annotator/embeddings/auto_gguf_embeddings.py,sha256=-64uQKkvWsE2By3LEP9Hv10Eox10QAyVz0vSc_BduvY,20146
109
109
  sparknlp/annotator/embeddings/bert_embeddings.py,sha256=HVUjkg56kBcpGZCo-fmPG5uatMDF3swW_lnbpy1SgSI,8463
110
110
  sparknlp/annotator/embeddings/bert_sentence_embeddings.py,sha256=NQy9KuXT9aKsTpYCR5RAeoFWI2YqEGorbdYrf_0KKmw,9148
111
111
  sparknlp/annotator/embeddings/bge_embeddings.py,sha256=ZGbxssjJFaSfbcgqAPV5hsu81SnC0obgCVNOoJkArDA,8105
@@ -135,7 +135,7 @@ sparknlp/annotator/embeddings/xlm_roberta_embeddings.py,sha256=S2HHXOrSFXMAyloZU
135
135
  sparknlp/annotator/embeddings/xlm_roberta_sentence_embeddings.py,sha256=ojxD3H2VgDEn-RzDdCz0X485pojHBAFrlzsNemI05bY,8602
136
136
  sparknlp/annotator/embeddings/xlnet_embeddings.py,sha256=hJrlsJeO3D7uz54xiEiqqXEbq24YGuWz8U652PV9fNE,9336
137
137
  sparknlp/annotator/er/__init__.py,sha256=eF9Z-PanVfZWSVN2HSFbE7QjCDb6NYV5ESn6geYKlek,692
138
- sparknlp/annotator/er/entity_ruler.py,sha256=7eZtAwoixkl88jTyKEqTKf9Wzo459VXQkYmFBozUY6A,8784
138
+ sparknlp/annotator/er/entity_ruler.py,sha256=eg9-I9yWQ_vjaKI5g5T4s575VZEjN1Sq7WJJpCImSVg,10007
139
139
  sparknlp/annotator/keyword_extraction/__init__.py,sha256=KotCR238x7LgisinsRGaARgPygWUIwC624FmH-sHacE,720
140
140
  sparknlp/annotator/keyword_extraction/yake_keyword_extraction.py,sha256=oeB-8qdMoljG-mgFOCsfnpxyK5jFBZnX7jAUQwsnHTc,13215
141
141
  sparknlp/annotator/ld_dl/__init__.py,sha256=gWNGOaozABT83J4Mn7JmNQsXzm27s3PHpMQmlXl-5L8,704
@@ -169,9 +169,9 @@ sparknlp/annotator/sentiment/__init__.py,sha256=Lq3vKaZS1YATLMg0VNXSVtkWL5q5G9ta
169
169
  sparknlp/annotator/sentiment/sentiment_detector.py,sha256=m545NGU0Xzg_PO6_qIfpli1uZj7JQcyFgqe9R6wAPFI,8154
170
170
  sparknlp/annotator/sentiment/vivekn_sentiment.py,sha256=4rpXWDgzU6ddnbrSCp9VdLb2epCc9oZ3c6XcqxEw8nk,9655
171
171
  sparknlp/annotator/seq2seq/__init__.py,sha256=aDiph00Hyq7L8uDY0frtyuHtqFodBqTMbixx_nq4z1I,1841
172
- sparknlp/annotator/seq2seq/auto_gguf_model.py,sha256=yhZQHMHfp88rQvLHTWyS-8imZrwqp-8RQQwnw6PmHfc,11749
173
- sparknlp/annotator/seq2seq/auto_gguf_reranker.py,sha256=MS4wCm2A2YiQfkB4HVVZKuN-3A1yGzqSCF69nu7J2rQ,12640
174
- sparknlp/annotator/seq2seq/auto_gguf_vision_model.py,sha256=swBek2026dW6BOX5O9P8Uq41X2GC71VGW0ADFeUIvs0,15299
172
+ sparknlp/annotator/seq2seq/auto_gguf_model.py,sha256=FaKxJaF7BdlQcf3T-nPZWnXRClF8dcYa71QHIaXFigI,11912
173
+ sparknlp/annotator/seq2seq/auto_gguf_reranker.py,sha256=a_70sNooY_9N6KHXVeuM4cDEbHVDlHa1KUWwu0A-l9s,12809
174
+ sparknlp/annotator/seq2seq/auto_gguf_vision_model.py,sha256=59UZKJbI6oYnSNkk2qqf1nhHtB8h3upGRcjZJyl9bGQ,15494
175
175
  sparknlp/annotator/seq2seq/bart_transformer.py,sha256=I1flM4yeCzEAKOdQllBC30XuedxVJ7ferkFhZ6gwEbE,18481
176
176
  sparknlp/annotator/seq2seq/cohere_transformer.py,sha256=43LZBVazZMgJRCsN7HaYjVYfJ5hRMV95QZyxMtXq-m4,13496
177
177
  sparknlp/annotator/seq2seq/cpm_transformer.py,sha256=0CnBFMlxMu0pD2QZMHyoGtIYgXqfUQm68vr6zEAa6Eg,13290
@@ -219,11 +219,12 @@ sparknlp/base/prompt_assembler.py,sha256=_C_9MdHqsxUjSOa3TqCV-6sSfSiRyhfHBQG5m7R
219
219
  sparknlp/base/recursive_pipeline.py,sha256=V9rTnu8KMwgjoceykN9pF1mKGtOkkuiC_n9v8dE3LDk,4279
220
220
  sparknlp/base/table_assembler.py,sha256=Kxu3R2fY6JgCxEc07ibsMsjip6dgcPDHLiWAZ8gC_d8,5102
221
221
  sparknlp/base/token_assembler.py,sha256=qiHry07L7mVCqeHSH6hHxLygv1AsfZIE4jy1L75L3Do,5075
222
- sparknlp/common/__init__.py,sha256=MJuE__T1YS8f3As7X5sgzHibGjDeiFkQ5vc2bEEf0Ww,1148
222
+ sparknlp/common/__init__.py,sha256=bdnDseYWsKnsBk4KdO_NbPJshF_CeqhO2NFXV1Vu_Ts,1205
223
223
  sparknlp/common/annotator_approach.py,sha256=CbkyaWl6rRX_VaXz2xJCjofijRGJGeJCsqQTDQgNTAw,1765
224
224
  sparknlp/common/annotator_model.py,sha256=l1vDFi2m_WbWg47Jq0F8DygjndUQhv9Ftfcc8Iceb8s,1880
225
225
  sparknlp/common/annotator_properties.py,sha256=7B1os7pBUfHo6b7IPQAXQ-nir0u3tQLzDpAg83h_iqQ,4332
226
226
  sparknlp/common/annotator_type.py,sha256=ash2Ip1IOOiJamPVyy_XQj8Ja_DRHm0b9Vj4Ni75oKM,1225
227
+ sparknlp/common/completion_post_processing.py,sha256=sqcjewfrpIBZ4KFQ1XPYJI7luHIStnv6PovkehFxeOg,1423
227
228
  sparknlp/common/coverage_result.py,sha256=No4PSh1HSs3PyRI1zC47x65tWgfirqPI290icHQoXEI,823
228
229
  sparknlp/common/match_strategy.py,sha256=kt1MUPqU1wCwk5qCdYk6jubHbU-5yfAYxb9jjAOrdnY,1678
229
230
  sparknlp/common/properties.py,sha256=7eBxODxKmFQAgOtrxUH9ly4LugUlkNRVXNQcM60AUK4,53025
@@ -241,7 +242,7 @@ sparknlp/logging/__init__.py,sha256=DoROFF5KLZe4t4Q-OHxqk1nhqbw9NQ-wb64y8icNwgw,
241
242
  sparknlp/logging/comet.py,sha256=_ZBi9-hlilCAnd4lvdYMWiq4Vqsppv8kow3k0cf-NG4,15958
242
243
  sparknlp/partition/__init__.py,sha256=L0w-yv_HnnvoKlSX5MzI2GKHW3RLLfGyq8bgWYVeKjU,749
243
244
  sparknlp/partition/partition.py,sha256=GXEAUvOea04Vc_JK0z112cAKFrJ4AEpjLJ8xlzZt6Kw,8551
244
- sparknlp/partition/partition_properties.py,sha256=2tGdIv1NaJNaux_TTskKQHnARAwBkFctaqCcNw21Wr8,19920
245
+ sparknlp/partition/partition_properties.py,sha256=QPqh5p3gvBSofZpPbyd18Zchvls0QP3S9Rsiy9Vko34,21862
245
246
  sparknlp/partition/partition_transformer.py,sha256=lRR1h-IMlHR8M0VeB50SbU39GHHF5PgMaJ42qOriS6A,6855
246
247
  sparknlp/pretrained/__init__.py,sha256=GV-x9UBK8F2_IR6zYatrzFcVJtkSUIMbxqWsxRUePmQ,793
247
248
  sparknlp/pretrained/pretrained_pipeline.py,sha256=lquxiaABuA68Rmu7csamJPqBoRJqMUO0oNHsmEZDAIs,5740
@@ -250,9 +251,10 @@ sparknlp/pretrained/utils.py,sha256=T1MrvW_DaWk_jcOjVLOea0NMFE9w8fe0ZT_5urZ_nEY,
250
251
  sparknlp/reader/__init__.py,sha256=-Toj3AIBki-zXPpV8ezFTI2LX1yP_rK2bhpoa8nBkTw,685
251
252
  sparknlp/reader/enums.py,sha256=MNGug9oJ1BBLM1Pbske13kAabalDzHa2kucF5xzFpHs,770
252
253
  sparknlp/reader/pdf_to_text.py,sha256=eWw-cwjosmcSZ9eHso0F5QQoeGBBnwsOhzhCXXvMjZA,7169
253
- sparknlp/reader/reader2doc.py,sha256=87aMk8-_1NHd3bB1rxw56BQMJc6mGgtnYGXwKw2uCmU,5916
254
+ sparknlp/reader/reader2doc.py,sha256=lQHwxUwrBOScDryNpQJAdyXIqCDIHEt4-kDf-17ZZds,4287
254
255
  sparknlp/reader/reader2image.py,sha256=k3gb4LEiqDV-pnD-HEaA1KHoAxXmoYys2Y817i1yvP0,4557
255
- sparknlp/reader/reader2table.py,sha256=pIR9r6NapUV4xdsFecadWlKTSJmRMAm36eqM9aXf13k,2416
256
+ sparknlp/reader/reader2table.py,sha256=VINfUzi_tdZN3tCjLmhF9CQjHKUhVYTzBBSRSnTXlr8,1370
257
+ sparknlp/reader/reader_assembler.py,sha256=AgkA3BaZ_00Eor4D84lZLxx04n2pDE_uatO535RAs9M,5655
256
258
  sparknlp/reader/sparknlp_reader.py,sha256=MJs8v_ECYaV1SOabI1L_2MkVYEDVImtwgbYypO7DJSY,20623
257
259
  sparknlp/training/__init__.py,sha256=qREi9u-5Vc2VjpL6-XZsyvu5jSEIdIhowW7_kKaqMqo,852
258
260
  sparknlp/training/conll.py,sha256=wKBiSTrjc6mjsl7Nyt6B8f4yXsDJkZb-sn8iOjix9cE,6961
@@ -284,7 +286,7 @@ sparknlp/training/_tf_graph_builders_1x/ner_dl/dataset_encoder.py,sha256=R4yHFN3
284
286
  sparknlp/training/_tf_graph_builders_1x/ner_dl/ner_model.py,sha256=EoCSdcIjqQ3wv13MAuuWrKV8wyVBP0SbOEW41omHlR0,23189
285
287
  sparknlp/training/_tf_graph_builders_1x/ner_dl/ner_model_saver.py,sha256=k5CQ7gKV6HZbZMB8cKLUJuZxoZWlP_DFWdZ--aIDwsc,2356
286
288
  sparknlp/training/_tf_graph_builders_1x/ner_dl/sentence_grouper.py,sha256=pAxjWhjazSX8Vg0MFqJiuRVw1IbnQNSs-8Xp26L4nko,870
287
- spark_nlp-6.1.4.dist-info/METADATA,sha256=CqRyNEZCA_8F_J5vHG4GUZXRiavXyfb3tPMTStidr4c,19774
288
- spark_nlp-6.1.4.dist-info/WHEEL,sha256=JNWh1Fm1UdwIQV075glCn4MVuCRs0sotJIq-J6rbxCU,109
289
- spark_nlp-6.1.4.dist-info/top_level.txt,sha256=uuytur4pyMRw2H_txNY2ZkaucZHUs22QF8-R03ch_-E,13
290
- spark_nlp-6.1.4.dist-info/RECORD,,
289
+ spark_nlp-6.2.0.dist-info/METADATA,sha256=8UP-KdKAwIzGuwXPTaPgk3ytBpsjpSDWQI4kvfxrD7E,19775
290
+ spark_nlp-6.2.0.dist-info/WHEEL,sha256=JNWh1Fm1UdwIQV075glCn4MVuCRs0sotJIq-J6rbxCU,109
291
+ spark_nlp-6.2.0.dist-info/top_level.txt,sha256=uuytur4pyMRw2H_txNY2ZkaucZHUs22QF8-R03ch_-E,13
292
+ spark_nlp-6.2.0.dist-info/RECORD,,
sparknlp/__init__.py CHANGED
@@ -66,7 +66,7 @@ sys.modules['com.johnsnowlabs.ml.ai'] = annotator
66
66
  annotators = annotator
67
67
  embeddings = annotator
68
68
 
69
- __version__ = "6.1.4"
69
+ __version__ = "6.2.0"
70
70
 
71
71
 
72
72
  def start(gpu=False,
@@ -122,6 +122,21 @@ class DocumentNormalizer(AnnotatorModel):
122
122
  "file encoding to apply on normalized documents",
123
123
  typeConverter=TypeConverters.toString)
124
124
 
125
+ presetPattern = Param(
126
+ Params._dummy(),
127
+ "presetPattern",
128
+ "Selects a single text cleaning function from the functional presets (e.g., 'CLEAN_BULLETS', 'CLEAN_DASHES', etc.).",
129
+ typeConverter=TypeConverters.toString
130
+ )
131
+
132
+ autoMode = Param(
133
+ Params._dummy(),
134
+ "autoMode",
135
+ "Enables a predefined cleaning mode combining multiple text cleaner functions (e.g., 'light_clean', 'document_clean', 'html_clean', 'full_auto').",
136
+ typeConverter=TypeConverters.toString
137
+ )
138
+
139
+
125
140
  @keyword_only
126
141
  def __init__(self):
127
142
  super(DocumentNormalizer, self).__init__(classname="com.johnsnowlabs.nlp.annotators.DocumentNormalizer")
@@ -197,3 +212,24 @@ class DocumentNormalizer(AnnotatorModel):
197
212
  File encoding to apply on normalized documents, by default "UTF-8"
198
213
  """
199
214
  return self._set(encoding=value)
215
+
216
+ def setPresetPattern(self, value):
217
+ """Sets a single text cleaning preset pattern.
218
+
219
+ Parameters
220
+ ----------
221
+ value : str
222
+ Preset cleaning pattern name, e.g., 'CLEAN_BULLETS', 'CLEAN_DASHES'.
223
+ """
224
+ return self._set(presetPattern=value)
225
+
226
+
227
+ def setAutoMode(self, value):
228
+ """Sets an automatic text cleaning mode using predefined groups of cleaning functions.
229
+
230
+ Parameters
231
+ ----------
232
+ value : str
233
+ Auto cleaning mode, e.g., 'light_clean', 'document_clean', 'social_clean', 'html_clean', 'full_auto'.
234
+ """
235
+ return self._set(autoMode=value)
@@ -532,3 +532,8 @@ class AutoGGUFEmbeddings(AnnotatorModel, HasBatchedAnnotate):
532
532
  return ResourceDownloader.downloadModel(
533
533
  AutoGGUFEmbeddings, name, lang, remote_loc
534
534
  )
535
+
536
+ def close(self):
537
+ """Closes the llama.cpp model backend freeing resources. The model is reloaded when used again.
538
+ """
539
+ self._java_obj.close()
@@ -215,6 +215,20 @@ class EntityRulerModel(AnnotatorModel, HasStorageModel):
215
215
 
216
216
  outputAnnotatorType = AnnotatorType.CHUNK
217
217
 
218
+ autoMode = Param(
219
+ Params._dummy(),
220
+ "autoMode",
221
+ "Enable built-in regex presets that combine related entity patterns (e.g., 'communication_entities', 'network_entities', 'media_entities', etc.).",
222
+ typeConverter=TypeConverters.toString
223
+ )
224
+
225
+ extractEntities = Param(
226
+ Params._dummy(),
227
+ "extractEntities",
228
+ "List of entity types to extract. If not set, all entities in the active autoMode or from regexPatterns are used.",
229
+ typeConverter=TypeConverters.toListString
230
+ )
231
+
218
232
  def __init__(self, classname="com.johnsnowlabs.nlp.annotators.er.EntityRulerModel", java_model=None):
219
233
  super(EntityRulerModel, self).__init__(
220
234
  classname=classname,
@@ -230,3 +244,24 @@ class EntityRulerModel(AnnotatorModel, HasStorageModel):
230
244
  def loadStorage(path, spark, storage_ref):
231
245
  HasStorageModel.loadStorages(path, spark, storage_ref, EntityRulerModel.database)
232
246
 
247
+
248
+ def setAutoMode(self, value):
249
+ """Sets the auto mode for predefined regex entity groups.
250
+
251
+ Parameters
252
+ ----------
253
+ value : str
254
+ Name of the auto mode to activate (e.g., 'communication_entities', 'network_entities', etc.)
255
+ """
256
+ return self._set(autoMode=value)
257
+
258
+
259
+ def setExtractEntities(self, value):
260
+ """Sets specific entities to extract, filtering only those defined in regexPatterns or autoMode.
261
+
262
+ Parameters
263
+ ----------
264
+ value : list[str]
265
+ List of entity names to extract, e.g., ['EMAIL_ADDRESS_PATTERN', 'IPV4_PATTERN'].
266
+ """
267
+ return self._set(extractEntities=value)
@@ -12,12 +12,10 @@
12
12
  # See the License for the specific language governing permissions and
13
13
  # limitations under the License.
14
14
  """Contains classes for the AutoGGUFModel."""
15
- from typing import List, Dict
16
-
17
15
  from sparknlp.common import *
18
16
 
19
17
 
20
- class AutoGGUFModel(AnnotatorModel, HasBatchedAnnotate, HasLlamaCppProperties):
18
+ class AutoGGUFModel(AnnotatorModel, HasBatchedAnnotate, HasLlamaCppProperties, CompletionPostProcessing):
21
19
  """
22
20
  Annotator that uses the llama.cpp library to generate text completions with large language
23
21
  models.
@@ -243,7 +241,6 @@ class AutoGGUFModel(AnnotatorModel, HasBatchedAnnotate, HasLlamaCppProperties):
243
241
  inputAnnotatorTypes = [AnnotatorType.DOCUMENT]
244
242
  outputAnnotatorType = AnnotatorType.DOCUMENT
245
243
 
246
-
247
244
  @keyword_only
248
245
  def __init__(self, classname="com.johnsnowlabs.nlp.annotators.seq2seq.AutoGGUFModel", java_model=None):
249
246
  super(AutoGGUFModel, self).__init__(
@@ -300,3 +297,8 @@ class AutoGGUFModel(AnnotatorModel, HasBatchedAnnotate, HasLlamaCppProperties):
300
297
  """
301
298
  from sparknlp.pretrained import ResourceDownloader
302
299
  return ResourceDownloader.downloadModel(AutoGGUFModel, name, lang, remote_loc)
300
+
301
+ def close(self):
302
+ """Closes the llama.cpp model backend freeing resources. The model is reloaded when used again.
303
+ """
304
+ self._java_obj.close()
@@ -327,3 +327,8 @@ class AutoGGUFReranker(AnnotatorModel, HasBatchedAnnotate, HasLlamaCppProperties
327
327
  """
328
328
  from sparknlp.pretrained import ResourceDownloader
329
329
  return ResourceDownloader.downloadModel(AutoGGUFReranker, name, lang, remote_loc)
330
+
331
+ def close(self):
332
+ """Closes the llama.cpp model backend freeing resources. The model is reloaded when used again.
333
+ """
334
+ self._java_obj.close()
@@ -15,7 +15,7 @@
15
15
  from sparknlp.common import *
16
16
 
17
17
 
18
- class AutoGGUFVisionModel(AnnotatorModel, HasBatchedAnnotate, HasLlamaCppProperties):
18
+ class AutoGGUFVisionModel(AnnotatorModel, HasBatchedAnnotate, HasLlamaCppProperties, CompletionPostProcessing):
19
19
  """Multimodal annotator that uses the llama.cpp library to generate text completions with large
20
20
  language models. It supports ingesting images for captioning.
21
21
 
@@ -329,3 +329,8 @@ class AutoGGUFVisionModel(AnnotatorModel, HasBatchedAnnotate, HasLlamaCppPropert
329
329
  """
330
330
  from sparknlp.pretrained import ResourceDownloader
331
331
  return ResourceDownloader.downloadModel(AutoGGUFVisionModel, name, lang, remote_loc)
332
+
333
+ def close(self):
334
+ """Closes the llama.cpp model backend freeing resources. The model is reloaded when used again.
335
+ """
336
+ self._java_obj.close()
@@ -23,3 +23,4 @@ from sparknlp.common.storage import *
23
23
  from sparknlp.common.utils import *
24
24
  from sparknlp.common.annotator_type import *
25
25
  from sparknlp.common.match_strategy import *
26
+ from sparknlp.common.completion_post_processing import *
@@ -0,0 +1,37 @@
1
+ # Copyright 2017-2025 John Snow Labs
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+ from pyspark.ml.param import Param, Params, TypeConverters
15
+
16
+
17
+ class CompletionPostProcessing:
18
+ removeThinkingTag = Param(
19
+ Params._dummy(),
20
+ "removeThinkingTag",
21
+ "Set a thinking tag (e.g. think) to be removed from output. Will match <TAG>...</TAG>",
22
+ typeConverter=TypeConverters.toString,
23
+ )
24
+
25
+ def setRemoveThinkingTag(self, value: str):
26
+ """Set a thinking tag (e.g. `think`) to be removed from output.
27
+ Will produce the regex: `(?s)<$TAG>.+?</$TAG>`
28
+ """
29
+ self._set(removeThinkingTag=value)
30
+ return self
31
+
32
+ def getRemoveThinkingTag(self):
33
+ """Get the thinking tag to be removed from output."""
34
+ value = None
35
+ if self.removeThinkingTag in self._paramMap:
36
+ value = self._paramMap[self.removeThinkingTag]
37
+ return value
@@ -18,6 +18,23 @@ from pyspark.ml.param import Param, Params, TypeConverters
18
18
 
19
19
  class HasReaderProperties(Params):
20
20
 
21
+ inputCol = Param(
22
+ Params._dummy(),
23
+ "inputCol",
24
+ "input column name",
25
+ typeConverter=TypeConverters.toString
26
+ )
27
+
28
+ def setInputCol(self, value):
29
+ """Sets input column name.
30
+
31
+ Parameters
32
+ ----------
33
+ value : str
34
+ Name of the Input Column
35
+ """
36
+ return self._set(inputCol=value)
37
+
21
38
  outputCol = Param(
22
39
  Params._dummy(),
23
40
  "outputCol",
@@ -25,6 +42,16 @@ class HasReaderProperties(Params):
25
42
  typeConverter=TypeConverters.toString
26
43
  )
27
44
 
45
+ def setOutputCol(self, value):
46
+ """Sets output column name.
47
+
48
+ Parameters
49
+ ----------
50
+ value : str
51
+ Name of the Output Column
52
+ """
53
+ return self._set(outputCol=value)
54
+
28
55
  contentPath = Param(
29
56
  Params._dummy(),
30
57
  "contentPath",
@@ -167,6 +194,56 @@ class HasReaderProperties(Params):
167
194
  """
168
195
  return self._set(explodeDocs=value)
169
196
 
197
+ flattenOutput = Param(
198
+ Params._dummy(),
199
+ "flattenOutput",
200
+ "If true, output is flattened to plain text with minimal metadata",
201
+ typeConverter=TypeConverters.toBoolean
202
+ )
203
+
204
+ def setFlattenOutput(self, value):
205
+ """Sets whether to flatten the output to plain text with minimal metadata.
206
+
207
+ ParametersF
208
+ ----------
209
+ value : bool
210
+ If true, output is flattened to plain text with minimal metadata
211
+ """
212
+ return self._set(flattenOutput=value)
213
+
214
+ titleThreshold = Param(
215
+ Params._dummy(),
216
+ "titleThreshold",
217
+ "Minimum font size threshold for title detection in PDF docs",
218
+ typeConverter=TypeConverters.toFloat
219
+ )
220
+
221
+ def setTitleThreshold(self, value):
222
+ """Sets the minimum font size threshold for title detection in PDF documents.
223
+
224
+ Parameters
225
+ ----------
226
+ value : float
227
+ Minimum font size threshold for title detection in PDF docs
228
+ """
229
+ return self._set(titleThreshold=value)
230
+
231
+ outputAsDocument = Param(
232
+ Params._dummy(),
233
+ "outputAsDocument",
234
+ "Whether to return all sentences joined into a single document",
235
+ typeConverter=TypeConverters.toBoolean
236
+ )
237
+
238
+ def setOutputAsDocument(self, value):
239
+ """Sets whether to return all sentences joined into a single document.
240
+
241
+ Parameters
242
+ ----------
243
+ value : bool
244
+ Whether to return all sentences joined into a single document
245
+ """
246
+ return self._set(outputAsDocument=value)
170
247
 
171
248
  class HasEmailReaderProperties(Params):
172
249
 
@@ -683,13 +760,3 @@ class HasPdfProperties(Params):
683
760
  True to read as images, False otherwise.
684
761
  """
685
762
  return self._set(readAsImage=value)
686
-
687
- def setOutputCol(self, value):
688
- """Sets output column name.
689
-
690
- Parameters
691
- ----------
692
- value : str
693
- Name of the Output Column
694
- """
695
- return self._set(outputCol=value)
@@ -12,7 +12,6 @@
12
12
  # See the License for the specific language governing permissions and
13
13
  # limitations under the License.
14
14
  from pyspark import keyword_only
15
- from pyspark.ml.param import TypeConverters, Params, Param
16
15
 
17
16
  from sparknlp.common import AnnotatorType
18
17
  from sparknlp.internal import AnnotatorTransformer
@@ -69,32 +68,11 @@ class Reader2Doc(
69
68
  |[{'document', 15, 38, 'This is a narrative text', {'pageNumber': 1, 'elementType': 'NarrativeText', 'fileName': 'pdf-title.pdf'}, []}]|
70
69
  |[{'document', 39, 68, 'This is another narrative text', {'pageNumber': 1, 'elementType': 'NarrativeText', 'fileName': 'pdf-title.pdf'}, []}]|
71
70
  +------------------------------------------------------------------------------------------------------------------------------------+
72
- """
71
+ """
73
72
 
74
73
  name = "Reader2Doc"
75
- outputAnnotatorType = AnnotatorType.DOCUMENT
76
-
77
-
78
- flattenOutput = Param(
79
- Params._dummy(),
80
- "flattenOutput",
81
- "If true, output is flattened to plain text with minimal metadata",
82
- typeConverter=TypeConverters.toBoolean
83
- )
84
74
 
85
- titleThreshold = Param(
86
- Params._dummy(),
87
- "titleThreshold",
88
- "Minimum font size threshold for title detection in PDF docs",
89
- typeConverter=TypeConverters.toFloat
90
- )
91
-
92
- outputAsDocument = Param(
93
- Params._dummy(),
94
- "outputAsDocument",
95
- "Whether to return all sentences joined into a single document",
96
- typeConverter=TypeConverters.toBoolean
97
- )
75
+ outputAnnotatorType = AnnotatorType.DOCUMENT
98
76
 
99
77
  excludeNonText = Param(
100
78
  Params._dummy(),
@@ -103,6 +81,16 @@ class Reader2Doc(
103
81
  typeConverter=TypeConverters.toBoolean
104
82
  )
105
83
 
84
+ def setExcludeNonText(self, value):
85
+ """Sets whether to exclude non-text content from the output.
86
+
87
+ Parameters
88
+ ----------
89
+ value : bool
90
+ Whether to exclude non-text content from the output. Default is False.
91
+ """
92
+ return self._set(excludeNonText=value)
93
+
106
94
  @keyword_only
107
95
  def __init__(self):
108
96
  super(Reader2Doc, self).__init__(classname="com.johnsnowlabs.reader.Reader2Doc")
@@ -117,44 +105,3 @@ class Reader2Doc(
117
105
  def setParams(self):
118
106
  kwargs = self._input_kwargs
119
107
  return self._set(**kwargs)
120
-
121
-
122
- def setFlattenOutput(self, value):
123
- """Sets whether to flatten the output to plain text with minimal metadata.
124
-
125
- ParametersF
126
- ----------
127
- value : bool
128
- If true, output is flattened to plain text with minimal metadata
129
- """
130
- return self._set(flattenOutput=value)
131
-
132
- def setTitleThreshold(self, value):
133
- """Sets the minimum font size threshold for title detection in PDF documents.
134
-
135
- Parameters
136
- ----------
137
- value : float
138
- Minimum font size threshold for title detection in PDF docs
139
- """
140
- return self._set(titleThreshold=value)
141
-
142
- def setOutputAsDocument(self, value):
143
- """Sets whether to return all sentences joined into a single document.
144
-
145
- Parameters
146
- ----------
147
- value : bool
148
- Whether to return all sentences joined into a single document
149
- """
150
- return self._set(outputAsDocument=value)
151
-
152
- def setExcludeNonText(self, value):
153
- """Sets whether to exclude non-text content from the output.
154
-
155
- Parameters
156
- ----------
157
- value : bool
158
- Whether to exclude non-text content from the output. Default is False.
159
- """
160
- return self._set(excludeNonText=value)
@@ -32,20 +32,6 @@ class Reader2Table(
32
32
 
33
33
  outputAnnotatorType = AnnotatorType.DOCUMENT
34
34
 
35
- flattenOutput = Param(
36
- Params._dummy(),
37
- "flattenOutput",
38
- "If true, output is flattened to plain text with minimal metadata",
39
- typeConverter=TypeConverters.toBoolean
40
- )
41
-
42
- titleThreshold = Param(
43
- Params._dummy(),
44
- "titleThreshold",
45
- "Minimum font size threshold for title detection in PDF docs",
46
- typeConverter=TypeConverters.toFloat
47
- )
48
-
49
35
  @keyword_only
50
36
  def __init__(self):
51
37
  super(Reader2Table, self).__init__(classname="com.johnsnowlabs.reader.Reader2Table")
@@ -55,23 +41,3 @@ class Reader2Table(
55
41
  def setParams(self):
56
42
  kwargs = self._input_kwargs
57
43
  return self._set(**kwargs)
58
-
59
- def setFlattenOutput(self, value):
60
- """Sets whether to flatten the output to plain text with minimal metadata.
61
-
62
- Parameters
63
- ----------
64
- value : bool
65
- If true, output is flattened to plain text with minimal metadata
66
- """
67
- return self._set(flattenOutput=value)
68
-
69
- def setTitleThreshold(self, value):
70
- """Sets the minimum font size threshold for title detection in PDF documents.
71
-
72
- Parameters
73
- ----------
74
- value : float
75
- Minimum font size threshold for title detection in PDF docs
76
- """
77
- return self._set(titleThreshold=value)
@@ -0,0 +1,159 @@
1
+ # Copyright 2017-2025 John Snow Labs
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+
15
+ from pyspark import keyword_only
16
+
17
+ from sparknlp.common import AnnotatorType
18
+ from sparknlp.internal import AnnotatorTransformer
19
+ from sparknlp.partition.partition_properties import *
20
+
21
+ class ReaderAssembler(
22
+ AnnotatorTransformer,
23
+ HasReaderProperties,
24
+ HasHTMLReaderProperties,
25
+ HasEmailReaderProperties,
26
+ HasExcelReaderProperties,
27
+ HasPowerPointProperties,
28
+ HasTextReaderProperties,
29
+ HasPdfProperties
30
+ ):
31
+ """
32
+ The ReaderAssembler annotator provides a unified interface for combining multiple Spark NLP
33
+ readers (such as Reader2Doc, Reader2Table, and Reader2Image) into a single, configurable
34
+ component. It automatically orchestrates the execution of different readers based on input type,
35
+ configured priorities, and fallback strategies allowing you to handle diverse content formats
36
+ without manually chaining multiple readers in your pipeline.
37
+
38
+ ReaderAssembler simplifies the process of building flexible pipelines capable of ingesting and
39
+ processing documents, tables, and images in a consistent way. It handles reader selection,
40
+ ordering, and fault-tolerance internally, ensuring that pipelines remain concise, robust, and
41
+ easy to maintain.
42
+
43
+ Examples
44
+ --------
45
+ >>> from johnsnowlabs.reader import ReaderAssembler
46
+ >>> from pyspark.ml import Pipeline
47
+ >>>
48
+ >>> reader_assembler = ReaderAssembler() \\
49
+ ... .setContentType("text/html") \\
50
+ ... .setContentPath("/table-image.html") \\
51
+ ... .setOutputCol("document")
52
+ >>>
53
+ >>> pipeline = Pipeline(stages=[reader_assembler])
54
+ >>> pipeline_model = pipeline.fit(empty_data_set)
55
+ >>> result_df = pipeline_model.transform(empty_data_set)
56
+ >>>
57
+ >>> result_df.show()
58
+ +--------+--------------------+--------------------+--------------------+---------+
59
+ |fileName| document_text| document_table| document_image|exception|
60
+ +--------+--------------------+--------------------+--------------------+---------+
61
+ | null|[{'document', 0, 26...|[{'document', 0, 50...|[{'image', , 5, 5, ...| null|
62
+ +--------+--------------------+--------------------+--------------------+---------+
63
+
64
+ This annotator is especially useful when working with heterogeneous input data — for example,
65
+ when a dataset includes PDFs, spreadsheets, and images — allowing Spark NLP to automatically
66
+ invoke the appropriate reader for each file type while preserving a unified schema in the output.
67
+ """
68
+
69
+
70
+ name = 'ReaderAssembler'
71
+
72
+ outputAnnotatorType = AnnotatorType.DOCUMENT
73
+
74
+ excludeNonText = Param(
75
+ Params._dummy(),
76
+ "excludeNonText",
77
+ "Whether to exclude non-text content from the output. Default is False.",
78
+ typeConverter=TypeConverters.toBoolean
79
+ )
80
+
81
+ userMessage = Param(
82
+ Params._dummy(),
83
+ "userMessage",
84
+ "Custom user message.",
85
+ typeConverter=TypeConverters.toString
86
+ )
87
+
88
+ promptTemplate = Param(
89
+ Params._dummy(),
90
+ "promptTemplate",
91
+ "Format of the output prompt.",
92
+ typeConverter=TypeConverters.toString
93
+ )
94
+
95
+ customPromptTemplate = Param(
96
+ Params._dummy(),
97
+ "customPromptTemplate",
98
+ "Custom prompt template for image models.",
99
+ typeConverter=TypeConverters.toString
100
+ )
101
+
102
+ @keyword_only
103
+ def __init__(self):
104
+ super(ReaderAssembler, self).__init__(classname="com.johnsnowlabs.reader.ReaderAssembler")
105
+ self._setDefault(contentType="",
106
+ explodeDocs=False,
107
+ userMessage="Describe this image",
108
+ promptTemplate="qwen2vl-chat",
109
+ readAsImage=True,
110
+ customPromptTemplate="",
111
+ ignoreExceptions=True,
112
+ flattenOutput=False,
113
+ titleThreshold=18)
114
+
115
+
116
+ @keyword_only
117
+ def setParams(self):
118
+ kwargs = self._input_kwargs
119
+ return self._set(**kwargs)
120
+
121
+ def setExcludeNonText(self, value):
122
+ """Sets whether to exclude non-text content from the output.
123
+
124
+ Parameters
125
+ ----------
126
+ value : bool
127
+ Whether to exclude non-text content from the output. Default is False.
128
+ """
129
+ return self._set(excludeNonText=value)
130
+
131
+ def setUserMessage(self, value: str):
132
+ """Sets custom user message.
133
+
134
+ Parameters
135
+ ----------
136
+ value : str
137
+ Custom user message to include.
138
+ """
139
+ return self._set(userMessage=value)
140
+
141
+ def setPromptTemplate(self, value: str):
142
+ """Sets format of the output prompt.
143
+
144
+ Parameters
145
+ ----------
146
+ value : str
147
+ Prompt template format.
148
+ """
149
+ return self._set(promptTemplate=value)
150
+
151
+ def setCustomPromptTemplate(self, value: str):
152
+ """Sets custom prompt template for image models.
153
+
154
+ Parameters
155
+ ----------
156
+ value : str
157
+ Custom prompt template string.
158
+ """
159
+ return self._set(customPromptTemplate=value)