PyPI - spark-nlp - Versions diffs - 6.1.4__py2.py3-none-any.whl → 6.2.0__py2.py3-none-any.whl - Mend

spark-nlp 6.1.4py2.py3-none-any.whl → 6.2.0py2.py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of spark-nlp might be problematic. Click here for more details.

Files changed (17) hide show

{spark_nlp-6.1.4.dist-info → spark_nlp-6.2.0.dist-info}/METADATA +6 -6
{spark_nlp-6.1.4.dist-info → spark_nlp-6.2.0.dist-info}/RECORD +17 -15
sparknlp/__init__.py +1 -1
sparknlp/annotator/document_normalizer.py +36 -0
sparknlp/annotator/embeddings/auto_gguf_embeddings.py +5 -0
sparknlp/annotator/er/entity_ruler.py +35 -0
sparknlp/annotator/seq2seq/auto_gguf_model.py +6 -4
sparknlp/annotator/seq2seq/auto_gguf_reranker.py +5 -0
sparknlp/annotator/seq2seq/auto_gguf_vision_model.py +6 -1
sparknlp/common/__init__.py +1 -0
sparknlp/common/completion_post_processing.py +37 -0
sparknlp/partition/partition_properties.py +77 -10
sparknlp/reader/reader2doc.py +12 -65
sparknlp/reader/reader2table.py +0 -34
sparknlp/reader/reader_assembler.py +159 -0
{spark_nlp-6.1.4.dist-info → spark_nlp-6.2.0.dist-info}/WHEEL +0 -0
{spark_nlp-6.1.4.dist-info → spark_nlp-6.2.0.dist-info}/top_level.txt +0 -0

{spark_nlp-6.1.4.dist-info → spark_nlp-6.2.0.dist-info}/METADATA RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: spark-nlp
-Version: 6.1.4
+Version: 6.2.0
 Summary: John Snow Labs Spark NLP is a natural language processing library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines, that scale easily in a distributed environment.
 Home-page: https://github.com/JohnSnowLabs/spark-nlp
 Author: John Snow Labs
@@ -102,7 +102,7 @@ $ java -version
 $ conda create -n sparknlp python=3.7 -y
 $ conda activate sparknlp
 # spark-nlp by default is based on pyspark 3.x
-$ pip install spark-nlp==6.1.4 pyspark==3.3.1
+$ pip install spark-nlp==6.2.0 pyspark==3.3.1
 ```
 In Python console or Jupyter `Python3` kernel:
@@ -168,7 +168,7 @@ For a quick example of using pipelines and models take a look at our official [d
 ### Apache Spark Support
-Spark NLP *6.1.4* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x
+Spark NLP *6.2.0* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x
 | Spark NLP | Apache Spark 3.5.x | Apache Spark 3.4.x | Apache Spark 3.3.x | Apache Spark 3.2.x | Apache Spark 3.1.x | Apache Spark 3.0.x | Apache Spark 2.4.x | Apache Spark 2.3.x |
 |-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
@@ -198,7 +198,7 @@ Find out more about 4.x `SparkNLP` versions in our official [documentation](http
 ### Databricks Support
-Spark NLP 6.1.4 has been tested and is compatible with the following runtimes:
+Spark NLP 6.2.0 has been tested and is compatible with the following runtimes:
 | **CPU**            | **GPU**            |
 |--------------------|--------------------|
@@ -216,7 +216,7 @@ We are compatible with older runtimes. For a full list check databricks support
 ### EMR Support
-Spark NLP 6.1.4 has been tested and is compatible with the following EMR releases:
+Spark NLP 6.2.0 has been tested and is compatible with the following EMR releases:
 | **EMR Release**    |
 |--------------------|
@@ -306,7 +306,7 @@ Please check [these instructions](https://sparknlp.org/docs/en/install#s3-integr
 Need more **examples**? Check out our dedicated [Spark NLP Examples](https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples)
 repository to showcase all Spark NLP use cases!
-Also, don't forget to check [Spark NLP in Action](https://sparknlp.org/demo) built by Streamlit.
+Also, don't forget to check [Spark NLP in Action](https://sparknlp.org/demos) built by Streamlit.
 #### All examples: [spark-nlp/examples](https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples)

{spark_nlp-6.1.4.dist-info → spark_nlp-6.2.0.dist-info}/RECORD RENAMED Viewed

@@ -3,7 +3,7 @@ com/johnsnowlabs/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,
 com/johnsnowlabs/ml/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
 com/johnsnowlabs/ml/ai/__init__.py,sha256=YQiK2M7U4d8y5irPy_HB8ae0mSpqS9583MH44pnKJXc,295
 com/johnsnowlabs/nlp/__init__.py,sha256=DPIVXtONO5xXyOk-HB0-sNiHAcco17NN13zPS_6Uw8c,294
-sparknlp/__init__.py,sha256=LcfC7bWeae5XgjWbNbWH94LlJkBon5dA8fYnb_2NyGc,13814
+sparknlp/__init__.py,sha256=6cuRDo27cGHCq7oJzF7sAB4sxm8jd9e8ciB_UH1dRT0,13814
 sparknlp/annotation.py,sha256=I5zOxG5vV2RfPZfqN9enT1i4mo6oBcn3Lrzs37QiOiA,5635
 sparknlp/annotation_audio.py,sha256=iRV_InSVhgvAwSRe9NTbUH9v6OGvTM-FPCpSAKVu0mE,1917
 sparknlp/annotation_image.py,sha256=xhCe8Ko-77XqWVuuYHFrjKqF6zPd8Z-RY_rmZXNwCXU,2547
@@ -16,7 +16,7 @@ sparknlp/annotator/chunker.py,sha256=8nz9B7R_mxKxcfJRfKvz2x_T29W3u4izE9k0wfYPzgE
 sparknlp/annotator/dataframe_optimizer.py,sha256=P4GySLzz1lRCZX0UBRF9_IDuXlRS1XvRWz-B2L0zqMA,7771
 sparknlp/annotator/date2_chunk.py,sha256=tW3m_LExmhx8LMFWOGXqMyfNRXSr2dnoEHD-6DrnpXI,3153
 sparknlp/annotator/document_character_text_splitter.py,sha256=oNrOKJAKO2h1wr0bEuSqYrrltIU_Y6J6cTHy70yKy6s,9877
-sparknlp/annotator/document_normalizer.py,sha256=hU2fG6vaPfdngQapoeSu-_zS_LiBZNp2tcVBGl6eTpk,10973
+sparknlp/annotator/document_normalizer.py,sha256=OOqPd6zp7FbtmlLHn1zAxPg9oxDzYRPKLYKr5k0Y5ck,12155
 sparknlp/annotator/document_token_splitter.py,sha256=-9xbQ9pVAjcKHQQrSk6Cb7f8W1cblCLwWXTNR8kFptA,7499
 sparknlp/annotator/document_token_splitter_test.py,sha256=NWO9mwhAIUJFuxPofB3c39iUm_6vKp4pteDsBOTH8ng,2684
 sparknlp/annotator/graph_extraction.py,sha256=b4SB3B_hFgCJT4e5Jcscyxdzfbvw3ujKTa6UNgX5Lhc,14471
@@ -105,7 +105,7 @@ sparknlp/annotator/dependency/dependency_parser.py,sha256=SxyvHPp8Hs1Xnm5X1nLTMi
 sparknlp/annotator/dependency/typed_dependency_parser.py,sha256=60vPdYkbFk9MPGegg3m9Uik9cMXpMZd8tBvXG39gNww,12456
 sparknlp/annotator/embeddings/__init__.py,sha256=Aw1oaP5DI0OS6259c0TEZZ6j3VFSvYFEerah5a-udVw,2528
 sparknlp/annotator/embeddings/albert_embeddings.py,sha256=6Rd1LIn8oFIpq_ALcJh-RUjPEO7Ht8wsHY6JHSFyMkw,9995
-sparknlp/annotator/embeddings/auto_gguf_embeddings.py,sha256=TRAYbhGS4K8uSpsScvDr6uD3lYdxMpCUjwDMhV_74rM,19977
+sparknlp/annotator/embeddings/auto_gguf_embeddings.py,sha256=-64uQKkvWsE2By3LEP9Hv10Eox10QAyVz0vSc_BduvY,20146
 sparknlp/annotator/embeddings/bert_embeddings.py,sha256=HVUjkg56kBcpGZCo-fmPG5uatMDF3swW_lnbpy1SgSI,8463
 sparknlp/annotator/embeddings/bert_sentence_embeddings.py,sha256=NQy9KuXT9aKsTpYCR5RAeoFWI2YqEGorbdYrf_0KKmw,9148
 sparknlp/annotator/embeddings/bge_embeddings.py,sha256=ZGbxssjJFaSfbcgqAPV5hsu81SnC0obgCVNOoJkArDA,8105
@@ -135,7 +135,7 @@ sparknlp/annotator/embeddings/xlm_roberta_embeddings.py,sha256=S2HHXOrSFXMAyloZU
 sparknlp/annotator/embeddings/xlm_roberta_sentence_embeddings.py,sha256=ojxD3H2VgDEn-RzDdCz0X485pojHBAFrlzsNemI05bY,8602
 sparknlp/annotator/embeddings/xlnet_embeddings.py,sha256=hJrlsJeO3D7uz54xiEiqqXEbq24YGuWz8U652PV9fNE,9336
 sparknlp/annotator/er/__init__.py,sha256=eF9Z-PanVfZWSVN2HSFbE7QjCDb6NYV5ESn6geYKlek,692
-sparknlp/annotator/er/entity_ruler.py,sha256=7eZtAwoixkl88jTyKEqTKf9Wzo459VXQkYmFBozUY6A,8784
+sparknlp/annotator/er/entity_ruler.py,sha256=eg9-I9yWQ_vjaKI5g5T4s575VZEjN1Sq7WJJpCImSVg,10007
 sparknlp/annotator/keyword_extraction/__init__.py,sha256=KotCR238x7LgisinsRGaARgPygWUIwC624FmH-sHacE,720
 sparknlp/annotator/keyword_extraction/yake_keyword_extraction.py,sha256=oeB-8qdMoljG-mgFOCsfnpxyK5jFBZnX7jAUQwsnHTc,13215
 sparknlp/annotator/ld_dl/__init__.py,sha256=gWNGOaozABT83J4Mn7JmNQsXzm27s3PHpMQmlXl-5L8,704
@@ -169,9 +169,9 @@ sparknlp/annotator/sentiment/__init__.py,sha256=Lq3vKaZS1YATLMg0VNXSVtkWL5q5G9ta
 sparknlp/annotator/sentiment/sentiment_detector.py,sha256=m545NGU0Xzg_PO6_qIfpli1uZj7JQcyFgqe9R6wAPFI,8154
 sparknlp/annotator/sentiment/vivekn_sentiment.py,sha256=4rpXWDgzU6ddnbrSCp9VdLb2epCc9oZ3c6XcqxEw8nk,9655
 sparknlp/annotator/seq2seq/__init__.py,sha256=aDiph00Hyq7L8uDY0frtyuHtqFodBqTMbixx_nq4z1I,1841
-sparknlp/annotator/seq2seq/auto_gguf_model.py,sha256=yhZQHMHfp88rQvLHTWyS-8imZrwqp-8RQQwnw6PmHfc,11749
-sparknlp/annotator/seq2seq/auto_gguf_reranker.py,sha256=MS4wCm2A2YiQfkB4HVVZKuN-3A1yGzqSCF69nu7J2rQ,12640
-sparknlp/annotator/seq2seq/auto_gguf_vision_model.py,sha256=swBek2026dW6BOX5O9P8Uq41X2GC71VGW0ADFeUIvs0,15299
+sparknlp/annotator/seq2seq/auto_gguf_model.py,sha256=FaKxJaF7BdlQcf3T-nPZWnXRClF8dcYa71QHIaXFigI,11912
+sparknlp/annotator/seq2seq/auto_gguf_reranker.py,sha256=a_70sNooY_9N6KHXVeuM4cDEbHVDlHa1KUWwu0A-l9s,12809
+sparknlp/annotator/seq2seq/auto_gguf_vision_model.py,sha256=59UZKJbI6oYnSNkk2qqf1nhHtB8h3upGRcjZJyl9bGQ,15494
 sparknlp/annotator/seq2seq/bart_transformer.py,sha256=I1flM4yeCzEAKOdQllBC30XuedxVJ7ferkFhZ6gwEbE,18481
 sparknlp/annotator/seq2seq/cohere_transformer.py,sha256=43LZBVazZMgJRCsN7HaYjVYfJ5hRMV95QZyxMtXq-m4,13496
 sparknlp/annotator/seq2seq/cpm_transformer.py,sha256=0CnBFMlxMu0pD2QZMHyoGtIYgXqfUQm68vr6zEAa6Eg,13290
@@ -219,11 +219,12 @@ sparknlp/base/prompt_assembler.py,sha256=_C_9MdHqsxUjSOa3TqCV-6sSfSiRyhfHBQG5m7R
 sparknlp/base/recursive_pipeline.py,sha256=V9rTnu8KMwgjoceykN9pF1mKGtOkkuiC_n9v8dE3LDk,4279
 sparknlp/base/table_assembler.py,sha256=Kxu3R2fY6JgCxEc07ibsMsjip6dgcPDHLiWAZ8gC_d8,5102
 sparknlp/base/token_assembler.py,sha256=qiHry07L7mVCqeHSH6hHxLygv1AsfZIE4jy1L75L3Do,5075
-sparknlp/common/__init__.py,sha256=MJuE__T1YS8f3As7X5sgzHibGjDeiFkQ5vc2bEEf0Ww,1148
+sparknlp/common/__init__.py,sha256=bdnDseYWsKnsBk4KdO_NbPJshF_CeqhO2NFXV1Vu_Ts,1205
 sparknlp/common/annotator_approach.py,sha256=CbkyaWl6rRX_VaXz2xJCjofijRGJGeJCsqQTDQgNTAw,1765
 sparknlp/common/annotator_model.py,sha256=l1vDFi2m_WbWg47Jq0F8DygjndUQhv9Ftfcc8Iceb8s,1880
 sparknlp/common/annotator_properties.py,sha256=7B1os7pBUfHo6b7IPQAXQ-nir0u3tQLzDpAg83h_iqQ,4332
 sparknlp/common/annotator_type.py,sha256=ash2Ip1IOOiJamPVyy_XQj8Ja_DRHm0b9Vj4Ni75oKM,1225
+sparknlp/common/completion_post_processing.py,sha256=sqcjewfrpIBZ4KFQ1XPYJI7luHIStnv6PovkehFxeOg,1423
 sparknlp/common/coverage_result.py,sha256=No4PSh1HSs3PyRI1zC47x65tWgfirqPI290icHQoXEI,823
 sparknlp/common/match_strategy.py,sha256=kt1MUPqU1wCwk5qCdYk6jubHbU-5yfAYxb9jjAOrdnY,1678
 sparknlp/common/properties.py,sha256=7eBxODxKmFQAgOtrxUH9ly4LugUlkNRVXNQcM60AUK4,53025
@@ -241,7 +242,7 @@ sparknlp/logging/__init__.py,sha256=DoROFF5KLZe4t4Q-OHxqk1nhqbw9NQ-wb64y8icNwgw,
 sparknlp/logging/comet.py,sha256=_ZBi9-hlilCAnd4lvdYMWiq4Vqsppv8kow3k0cf-NG4,15958
 sparknlp/partition/__init__.py,sha256=L0w-yv_HnnvoKlSX5MzI2GKHW3RLLfGyq8bgWYVeKjU,749
 sparknlp/partition/partition.py,sha256=GXEAUvOea04Vc_JK0z112cAKFrJ4AEpjLJ8xlzZt6Kw,8551
-sparknlp/partition/partition_properties.py,sha256=2tGdIv1NaJNaux_TTskKQHnARAwBkFctaqCcNw21Wr8,19920
+sparknlp/partition/partition_properties.py,sha256=QPqh5p3gvBSofZpPbyd18Zchvls0QP3S9Rsiy9Vko34,21862
 sparknlp/partition/partition_transformer.py,sha256=lRR1h-IMlHR8M0VeB50SbU39GHHF5PgMaJ42qOriS6A,6855
 sparknlp/pretrained/__init__.py,sha256=GV-x9UBK8F2_IR6zYatrzFcVJtkSUIMbxqWsxRUePmQ,793
 sparknlp/pretrained/pretrained_pipeline.py,sha256=lquxiaABuA68Rmu7csamJPqBoRJqMUO0oNHsmEZDAIs,5740
@@ -250,9 +251,10 @@ sparknlp/pretrained/utils.py,sha256=T1MrvW_DaWk_jcOjVLOea0NMFE9w8fe0ZT_5urZ_nEY,
 sparknlp/reader/__init__.py,sha256=-Toj3AIBki-zXPpV8ezFTI2LX1yP_rK2bhpoa8nBkTw,685
 sparknlp/reader/enums.py,sha256=MNGug9oJ1BBLM1Pbske13kAabalDzHa2kucF5xzFpHs,770
 sparknlp/reader/pdf_to_text.py,sha256=eWw-cwjosmcSZ9eHso0F5QQoeGBBnwsOhzhCXXvMjZA,7169
-sparknlp/reader/reader2doc.py,sha256=87aMk8-_1NHd3bB1rxw56BQMJc6mGgtnYGXwKw2uCmU,5916
+sparknlp/reader/reader2doc.py,sha256=lQHwxUwrBOScDryNpQJAdyXIqCDIHEt4-kDf-17ZZds,4287
 sparknlp/reader/reader2image.py,sha256=k3gb4LEiqDV-pnD-HEaA1KHoAxXmoYys2Y817i1yvP0,4557
-sparknlp/reader/reader2table.py,sha256=pIR9r6NapUV4xdsFecadWlKTSJmRMAm36eqM9aXf13k,2416
+sparknlp/reader/reader2table.py,sha256=VINfUzi_tdZN3tCjLmhF9CQjHKUhVYTzBBSRSnTXlr8,1370
+sparknlp/reader/reader_assembler.py,sha256=AgkA3BaZ_00Eor4D84lZLxx04n2pDE_uatO535RAs9M,5655
 sparknlp/reader/sparknlp_reader.py,sha256=MJs8v_ECYaV1SOabI1L_2MkVYEDVImtwgbYypO7DJSY,20623
 sparknlp/training/__init__.py,sha256=qREi9u-5Vc2VjpL6-XZsyvu5jSEIdIhowW7_kKaqMqo,852
 sparknlp/training/conll.py,sha256=wKBiSTrjc6mjsl7Nyt6B8f4yXsDJkZb-sn8iOjix9cE,6961
@@ -284,7 +286,7 @@ sparknlp/training/_tf_graph_builders_1x/ner_dl/dataset_encoder.py,sha256=R4yHFN3
 sparknlp/training/_tf_graph_builders_1x/ner_dl/ner_model.py,sha256=EoCSdcIjqQ3wv13MAuuWrKV8wyVBP0SbOEW41omHlR0,23189
 sparknlp/training/_tf_graph_builders_1x/ner_dl/ner_model_saver.py,sha256=k5CQ7gKV6HZbZMB8cKLUJuZxoZWlP_DFWdZ--aIDwsc,2356
 sparknlp/training/_tf_graph_builders_1x/ner_dl/sentence_grouper.py,sha256=pAxjWhjazSX8Vg0MFqJiuRVw1IbnQNSs-8Xp26L4nko,870
-spark_nlp-6.1.4.dist-info/METADATA,sha256=CqRyNEZCA_8F_J5vHG4GUZXRiavXyfb3tPMTStidr4c,19774
-spark_nlp-6.1.4.dist-info/WHEEL,sha256=JNWh1Fm1UdwIQV075glCn4MVuCRs0sotJIq-J6rbxCU,109
-spark_nlp-6.1.4.dist-info/top_level.txt,sha256=uuytur4pyMRw2H_txNY2ZkaucZHUs22QF8-R03ch_-E,13
-spark_nlp-6.1.4.dist-info/RECORD,,
+spark_nlp-6.2.0.dist-info/METADATA,sha256=8UP-KdKAwIzGuwXPTaPgk3ytBpsjpSDWQI4kvfxrD7E,19775
+spark_nlp-6.2.0.dist-info/WHEEL,sha256=JNWh1Fm1UdwIQV075glCn4MVuCRs0sotJIq-J6rbxCU,109
+spark_nlp-6.2.0.dist-info/top_level.txt,sha256=uuytur4pyMRw2H_txNY2ZkaucZHUs22QF8-R03ch_-E,13
+spark_nlp-6.2.0.dist-info/RECORD,,

sparknlp/__init__.py CHANGED Viewed

@@ -66,7 +66,7 @@ sys.modules['com.johnsnowlabs.ml.ai'] = annotator
 annotators = annotator
 embeddings = annotator
-__version__ = "6.1.4"
+__version__ = "6.2.0"
 def start(gpu=False,

sparknlp/annotator/document_normalizer.py CHANGED Viewed

@@ -122,6 +122,21 @@ class DocumentNormalizer(AnnotatorModel):
                      "file encoding to apply on normalized documents",
                      typeConverter=TypeConverters.toString)
+    presetPattern = Param(
+        Params._dummy(),
+        "presetPattern",
+        "Selects a single text cleaning function from the functional presets (e.g., 'CLEAN_BULLETS', 'CLEAN_DASHES', etc.).",
+        typeConverter=TypeConverters.toString
+    )
+    autoMode = Param(
+        Params._dummy(),
+        "autoMode",
+        "Enables a predefined cleaning mode combining multiple text cleaner functions (e.g., 'light_clean', 'document_clean', 'html_clean', 'full_auto').",
+        typeConverter=TypeConverters.toString
+    )
     @keyword_only
     def __init__(self):
         super(DocumentNormalizer, self).__init__(classname="com.johnsnowlabs.nlp.annotators.DocumentNormalizer")
@@ -197,3 +212,24 @@ class DocumentNormalizer(AnnotatorModel):
             File encoding to apply on normalized documents, by default "UTF-8"
         """
         return self._set(encoding=value)
+    def setPresetPattern(self, value):
+        """Sets a single text cleaning preset pattern.
+        Parameters
+        ----------
+        value : str
+            Preset cleaning pattern name, e.g., 'CLEAN_BULLETS', 'CLEAN_DASHES'.
+        """
+        return self._set(presetPattern=value)
+    def setAutoMode(self, value):
+        """Sets an automatic text cleaning mode using predefined groups of cleaning functions.
+        Parameters
+        ----------
+        value : str
+            Auto cleaning mode, e.g., 'light_clean', 'document_clean', 'social_clean', 'html_clean', 'full_auto'.
+        """
+        return self._set(autoMode=value)

sparknlp/annotator/embeddings/auto_gguf_embeddings.py CHANGED Viewed

@@ -532,3 +532,8 @@ class AutoGGUFEmbeddings(AnnotatorModel, HasBatchedAnnotate):
         return ResourceDownloader.downloadModel(
             AutoGGUFEmbeddings, name, lang, remote_loc
         )
+    def close(self):
+        """Closes the llama.cpp model backend freeing resources. The model is reloaded when used again.
+        """
+        self._java_obj.close()

sparknlp/annotator/er/entity_ruler.py CHANGED Viewed

@@ -215,6 +215,20 @@ class EntityRulerModel(AnnotatorModel, HasStorageModel):
     outputAnnotatorType = AnnotatorType.CHUNK
+    autoMode = Param(
+        Params._dummy(),
+        "autoMode",
+        "Enable built-in regex presets that combine related entity patterns (e.g., 'communication_entities', 'network_entities', 'media_entities', etc.).",
+        typeConverter=TypeConverters.toString
+    )
+    extractEntities = Param(
+        Params._dummy(),
+        "extractEntities",
+        "List of entity types to extract. If not set, all entities in the active autoMode or from regexPatterns are used.",
+        typeConverter=TypeConverters.toListString
+    )
     def __init__(self, classname="com.johnsnowlabs.nlp.annotators.er.EntityRulerModel", java_model=None):
         super(EntityRulerModel, self).__init__(
             classname=classname,
@@ -230,3 +244,24 @@ class EntityRulerModel(AnnotatorModel, HasStorageModel):
     def loadStorage(path, spark, storage_ref):
         HasStorageModel.loadStorages(path, spark, storage_ref, EntityRulerModel.database)
+    def setAutoMode(self, value):
+        """Sets the auto mode for predefined regex entity groups.
+        Parameters
+        ----------
+        value : str
+            Name of the auto mode to activate (e.g., 'communication_entities', 'network_entities', etc.)
+        """
+        return self._set(autoMode=value)
+    def setExtractEntities(self, value):
+        """Sets specific entities to extract, filtering only those defined in regexPatterns or autoMode.
+        Parameters
+        ----------
+        value : list[str]
+            List of entity names to extract, e.g., ['EMAIL_ADDRESS_PATTERN', 'IPV4_PATTERN'].
+        """
+        return self._set(extractEntities=value)

sparknlp/annotator/seq2seq/auto_gguf_model.py CHANGED Viewed

@@ -12,12 +12,10 @@
 #  See the License for the specific language governing permissions and
 #  limitations under the License.
 """Contains classes for the AutoGGUFModel."""
-from typing import List, Dict
 from sparknlp.common import *
-class AutoGGUFModel(AnnotatorModel, HasBatchedAnnotate, HasLlamaCppProperties):
+class AutoGGUFModel(AnnotatorModel, HasBatchedAnnotate, HasLlamaCppProperties, CompletionPostProcessing):
     """
     Annotator that uses the llama.cpp library to generate text completions with large language
     models.
@@ -243,7 +241,6 @@ class AutoGGUFModel(AnnotatorModel, HasBatchedAnnotate, HasLlamaCppProperties):
     inputAnnotatorTypes = [AnnotatorType.DOCUMENT]
     outputAnnotatorType = AnnotatorType.DOCUMENT
     @keyword_only
     def __init__(self, classname="com.johnsnowlabs.nlp.annotators.seq2seq.AutoGGUFModel", java_model=None):
         super(AutoGGUFModel, self).__init__(
@@ -300,3 +297,8 @@ class AutoGGUFModel(AnnotatorModel, HasBatchedAnnotate, HasLlamaCppProperties):
         """
         from sparknlp.pretrained import ResourceDownloader
         return ResourceDownloader.downloadModel(AutoGGUFModel, name, lang, remote_loc)
+    def close(self):
+        """Closes the llama.cpp model backend freeing resources. The model is reloaded when used again.
+        """
+        self._java_obj.close()

sparknlp/annotator/seq2seq/auto_gguf_reranker.py CHANGED Viewed

@@ -327,3 +327,8 @@ class AutoGGUFReranker(AnnotatorModel, HasBatchedAnnotate, HasLlamaCppProperties
         """
         from sparknlp.pretrained import ResourceDownloader
         return ResourceDownloader.downloadModel(AutoGGUFReranker, name, lang, remote_loc)
+    def close(self):
+        """Closes the llama.cpp model backend freeing resources. The model is reloaded when used again.
+        """
+        self._java_obj.close()

sparknlp/annotator/seq2seq/auto_gguf_vision_model.py CHANGED Viewed

@@ -15,7 +15,7 @@
 from sparknlp.common import *
-class AutoGGUFVisionModel(AnnotatorModel, HasBatchedAnnotate, HasLlamaCppProperties):
+class AutoGGUFVisionModel(AnnotatorModel, HasBatchedAnnotate, HasLlamaCppProperties, CompletionPostProcessing):
     """Multimodal annotator that uses the llama.cpp library to generate text completions with large
     language models. It supports ingesting images for captioning.
@@ -329,3 +329,8 @@ class AutoGGUFVisionModel(AnnotatorModel, HasBatchedAnnotate, HasLlamaCppPropert
         """
         from sparknlp.pretrained import ResourceDownloader
         return ResourceDownloader.downloadModel(AutoGGUFVisionModel, name, lang, remote_loc)
+    def close(self):
+        """Closes the llama.cpp model backend freeing resources. The model is reloaded when used again.
+        """
+        self._java_obj.close()

sparknlp/common/__init__.py CHANGED Viewed

@@ -23,3 +23,4 @@ from sparknlp.common.storage import *
 from sparknlp.common.utils import *
 from sparknlp.common.annotator_type import *
 from sparknlp.common.match_strategy import *
+from sparknlp.common.completion_post_processing import *

sparknlp/common/completion_post_processing.py ADDED Viewed

@@ -0,0 +1,37 @@
+#  Copyright 2017-2025 John Snow Labs
+#
+#  Licensed under the Apache License, Version 2.0 (the "License");
+#  you may not use this file except in compliance with the License.
+#  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing, software
+#  distributed under the License is distributed on an "AS IS" BASIS,
+#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#  See the License for the specific language governing permissions and
+#  limitations under the License.
+from pyspark.ml.param import Param, Params, TypeConverters
+class CompletionPostProcessing:
+    removeThinkingTag = Param(
+        Params._dummy(),
+        "removeThinkingTag",
+        "Set a thinking tag (e.g. think) to be removed from output. Will match <TAG>...</TAG>",
+        typeConverter=TypeConverters.toString,
+    )
+    def setRemoveThinkingTag(self, value: str):
+        """Set a thinking tag (e.g. `think`) to be removed from output.
+        Will produce the regex: `(?s)<$TAG>.+?</$TAG>`
+        """
+        self._set(removeThinkingTag=value)
+        return self
+    def getRemoveThinkingTag(self):
+        """Get the thinking tag to be removed from output."""
+        value = None
+        if self.removeThinkingTag in self._paramMap:
+            value = self._paramMap[self.removeThinkingTag]
+        return value

sparknlp/partition/partition_properties.py CHANGED Viewed

@@ -18,6 +18,23 @@ from pyspark.ml.param import Param, Params, TypeConverters
 class HasReaderProperties(Params):
+    inputCol = Param(
+        Params._dummy(),
+        "inputCol",
+        "input column name",
+        typeConverter=TypeConverters.toString
+    )
+    def setInputCol(self, value):
+        """Sets input column name.
+        Parameters
+        ----------
+        value : str
+            Name of the Input Column
+        """
+        return self._set(inputCol=value)
     outputCol = Param(
         Params._dummy(),
         "outputCol",
@@ -25,6 +42,16 @@ class HasReaderProperties(Params):
         typeConverter=TypeConverters.toString
     )
+    def setOutputCol(self, value):
+        """Sets output column name.
+        Parameters
+        ----------
+        value : str
+            Name of the Output Column
+        """
+        return self._set(outputCol=value)
     contentPath = Param(
         Params._dummy(),
         "contentPath",
@@ -167,6 +194,56 @@ class HasReaderProperties(Params):
         """
         return self._set(explodeDocs=value)
+    flattenOutput = Param(
+        Params._dummy(),
+        "flattenOutput",
+        "If true, output is flattened to plain text with minimal metadata",
+        typeConverter=TypeConverters.toBoolean
+    )
+    def setFlattenOutput(self, value):
+        """Sets whether to flatten the output to plain text with minimal metadata.
+        ParametersF
+        ----------
+        value : bool
+            If true, output is flattened to plain text with minimal metadata
+        """
+        return self._set(flattenOutput=value)
+    titleThreshold = Param(
+        Params._dummy(),
+        "titleThreshold",
+        "Minimum font size threshold for title detection in PDF docs",
+        typeConverter=TypeConverters.toFloat
+    )
+    def setTitleThreshold(self, value):
+        """Sets the minimum font size threshold for title detection in PDF documents.
+        Parameters
+        ----------
+        value : float
+            Minimum font size threshold for title detection in PDF docs
+        """
+        return self._set(titleThreshold=value)
+    outputAsDocument = Param(
+        Params._dummy(),
+        "outputAsDocument",
+        "Whether to return all sentences joined into a single document",
+        typeConverter=TypeConverters.toBoolean
+    )
+    def setOutputAsDocument(self, value):
+        """Sets whether to return all sentences joined into a single document.
+        Parameters
+        ----------
+        value : bool
+            Whether to return all sentences joined into a single document
+        """
+        return self._set(outputAsDocument=value)
 class HasEmailReaderProperties(Params):
@@ -683,13 +760,3 @@ class HasPdfProperties(Params):
             True to read as images, False otherwise.
         """
         return self._set(readAsImage=value)
-    def setOutputCol(self, value):
-        """Sets output column name.
-        Parameters
-        ----------
-        value : str
-            Name of the Output Column
-        """
-        return self._set(outputCol=value)

sparknlp/reader/reader2doc.py CHANGED Viewed

@@ -12,7 +12,6 @@
 #  See the License for the specific language governing permissions and
 #  limitations under the License.
 from pyspark import keyword_only
-from pyspark.ml.param import TypeConverters, Params, Param
 from sparknlp.common import AnnotatorType
 from sparknlp.internal import AnnotatorTransformer
@@ -69,32 +68,11 @@ class Reader2Doc(
     |[{'document', 15, 38, 'This is a narrative text', {'pageNumber': 1, 'elementType': 'NarrativeText', 'fileName': 'pdf-title.pdf'}, []}]|
     |[{'document', 39, 68, 'This is another narrative text', {'pageNumber': 1, 'elementType': 'NarrativeText', 'fileName': 'pdf-title.pdf'}, []}]|
     +------------------------------------------------------------------------------------------------------------------------------------+
-"""
+    """
     name = "Reader2Doc"
-    outputAnnotatorType = AnnotatorType.DOCUMENT
-    flattenOutput = Param(
-        Params._dummy(),
-        "flattenOutput",
-        "If true, output is flattened to plain text with minimal metadata",
-        typeConverter=TypeConverters.toBoolean
-    )
-    titleThreshold = Param(
-        Params._dummy(),
-        "titleThreshold",
-        "Minimum font size threshold for title detection in PDF docs",
-        typeConverter=TypeConverters.toFloat
-    )
-    outputAsDocument = Param(
-        Params._dummy(),
-        "outputAsDocument",
-        "Whether to return all sentences joined into a single document",
-        typeConverter=TypeConverters.toBoolean
-    )
+    outputAnnotatorType = AnnotatorType.DOCUMENT
     excludeNonText = Param(
         Params._dummy(),
@@ -103,6 +81,16 @@ class Reader2Doc(
         typeConverter=TypeConverters.toBoolean
     )
+    def setExcludeNonText(self, value):
+        """Sets whether to exclude non-text content from the output.
+        Parameters
+        ----------
+        value : bool
+            Whether to exclude non-text content from the output. Default is False.
+        """
+        return self._set(excludeNonText=value)
     @keyword_only
     def __init__(self):
         super(Reader2Doc, self).__init__(classname="com.johnsnowlabs.reader.Reader2Doc")
@@ -117,44 +105,3 @@ class Reader2Doc(
     def setParams(self):
         kwargs = self._input_kwargs
         return self._set(**kwargs)
-    def setFlattenOutput(self, value):
-        """Sets whether to flatten the output to plain text with minimal metadata.
-        ParametersF
-        ----------
-        value : bool
-            If true, output is flattened to plain text with minimal metadata
-        """
-        return self._set(flattenOutput=value)
-    def setTitleThreshold(self, value):
-        """Sets the minimum font size threshold for title detection in PDF documents.
-        Parameters
-        ----------
-        value : float
-            Minimum font size threshold for title detection in PDF docs
-        """
-        return self._set(titleThreshold=value)
-    def setOutputAsDocument(self, value):
-        """Sets whether to return all sentences joined into a single document.
-        Parameters
-        ----------
-        value : bool
-            Whether to return all sentences joined into a single document
-        """
-        return self._set(outputAsDocument=value)
-    def setExcludeNonText(self, value):
-        """Sets whether to exclude non-text content from the output.
-        Parameters
-        ----------
-        value : bool
-            Whether to exclude non-text content from the output. Default is False.
-        """
-        return self._set(excludeNonText=value)

sparknlp/reader/reader2table.py CHANGED Viewed

@@ -32,20 +32,6 @@ class Reader2Table(
     outputAnnotatorType = AnnotatorType.DOCUMENT
-    flattenOutput = Param(
-        Params._dummy(),
-        "flattenOutput",
-        "If true, output is flattened to plain text with minimal metadata",
-        typeConverter=TypeConverters.toBoolean
-    )
-    titleThreshold = Param(
-        Params._dummy(),
-        "titleThreshold",
-        "Minimum font size threshold for title detection in PDF docs",
-        typeConverter=TypeConverters.toFloat
-    )
     @keyword_only
     def __init__(self):
         super(Reader2Table, self).__init__(classname="com.johnsnowlabs.reader.Reader2Table")
@@ -55,23 +41,3 @@ class Reader2Table(
     def setParams(self):
         kwargs = self._input_kwargs
         return self._set(**kwargs)
-    def setFlattenOutput(self, value):
-        """Sets whether to flatten the output to plain text with minimal metadata.
-        Parameters
-        ----------
-        value : bool
-            If true, output is flattened to plain text with minimal metadata
-        """
-        return self._set(flattenOutput=value)
-    def setTitleThreshold(self, value):
-        """Sets the minimum font size threshold for title detection in PDF documents.
-        Parameters
-        ----------
-        value : float
-            Minimum font size threshold for title detection in PDF docs
-        """
-        return self._set(titleThreshold=value)

sparknlp/reader/reader_assembler.py ADDED Viewed

@@ -0,0 +1,159 @@
+#  Copyright 2017-2025 John Snow Labs
+#
+#  Licensed under the Apache License, Version 2.0 (the "License");
+#  you may not use this file except in compliance with the License.
+#  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing, software
+#  distributed under the License is distributed on an "AS IS" BASIS,
+#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#  See the License for the specific language governing permissions and
+#  limitations under the License.
+from pyspark import keyword_only
+from sparknlp.common import AnnotatorType
+from sparknlp.internal import AnnotatorTransformer
+from sparknlp.partition.partition_properties import *
+class ReaderAssembler(
+    AnnotatorTransformer,
+    HasReaderProperties,
+    HasHTMLReaderProperties,
+    HasEmailReaderProperties,
+    HasExcelReaderProperties,
+    HasPowerPointProperties,
+    HasTextReaderProperties,
+    HasPdfProperties
+):
+    """
+    The ReaderAssembler annotator provides a unified interface for combining multiple Spark NLP
+    readers (such as Reader2Doc, Reader2Table, and Reader2Image) into a single, configurable
+    component. It automatically orchestrates the execution of different readers based on input type,
+    configured priorities, and fallback strategies allowing you to handle diverse content formats
+    without manually chaining multiple readers in your pipeline.
+    ReaderAssembler simplifies the process of building flexible pipelines capable of ingesting and
+    processing documents, tables, and images in a consistent way. It handles reader selection,
+    ordering, and fault-tolerance internally, ensuring that pipelines remain concise, robust, and
+    easy to maintain.
+    Examples
+    --------
+    >>> from johnsnowlabs.reader import ReaderAssembler
+    >>> from pyspark.ml import Pipeline
+    >>>
+    >>> reader_assembler = ReaderAssembler() \\
+    ...     .setContentType("text/html") \\
+    ...     .setContentPath("/table-image.html") \\
+    ...     .setOutputCol("document")
+    >>>
+    >>> pipeline = Pipeline(stages=[reader_assembler])
+    >>> pipeline_model = pipeline.fit(empty_data_set)
+    >>> result_df = pipeline_model.transform(empty_data_set)
+    >>>
+    >>> result_df.show()
+    +--------+--------------------+--------------------+--------------------+---------+
+    |fileName|       document_text|      document_table|      document_image|exception|
+    +--------+--------------------+--------------------+--------------------+---------+
+    |    null|[{'document', 0, 26...|[{'document', 0, 50...|[{'image', , 5, 5, ...|     null|
+    +--------+--------------------+--------------------+--------------------+---------+
+    This annotator is especially useful when working with heterogeneous input data — for example,
+    when a dataset includes PDFs, spreadsheets, and images — allowing Spark NLP to automatically
+    invoke the appropriate reader for each file type while preserving a unified schema in the output.
+"""
+    name = 'ReaderAssembler'
+    outputAnnotatorType = AnnotatorType.DOCUMENT
+    excludeNonText = Param(
+        Params._dummy(),
+        "excludeNonText",
+        "Whether to exclude non-text content from the output. Default is False.",
+        typeConverter=TypeConverters.toBoolean
+    )
+    userMessage = Param(
+        Params._dummy(),
+        "userMessage",
+        "Custom user message.",
+        typeConverter=TypeConverters.toString
+    )
+    promptTemplate = Param(
+        Params._dummy(),
+        "promptTemplate",
+        "Format of the output prompt.",
+        typeConverter=TypeConverters.toString
+    )
+    customPromptTemplate = Param(
+        Params._dummy(),
+        "customPromptTemplate",
+        "Custom prompt template for image models.",
+        typeConverter=TypeConverters.toString
+    )
+    @keyword_only
+    def __init__(self):
+        super(ReaderAssembler, self).__init__(classname="com.johnsnowlabs.reader.ReaderAssembler")
+        self._setDefault(contentType="",
+                         explodeDocs=False,
+                         userMessage="Describe this image",
+                         promptTemplate="qwen2vl-chat",
+                         readAsImage=True,
+                         customPromptTemplate="",
+                         ignoreExceptions=True,
+                         flattenOutput=False,
+                         titleThreshold=18)
+    @keyword_only
+    def setParams(self):
+        kwargs = self._input_kwargs
+        return self._set(**kwargs)
+    def setExcludeNonText(self, value):
+        """Sets whether to exclude non-text content from the output.
+        Parameters
+        ----------
+        value : bool
+            Whether to exclude non-text content from the output. Default is False.
+        """
+        return self._set(excludeNonText=value)
+    def setUserMessage(self, value: str):
+        """Sets custom user message.
+        Parameters
+        ----------
+        value : str
+            Custom user message to include.
+        """
+        return self._set(userMessage=value)
+    def setPromptTemplate(self, value: str):
+        """Sets format of the output prompt.
+        Parameters
+        ----------
+        value : str
+            Prompt template format.
+        """
+        return self._set(promptTemplate=value)
+    def setCustomPromptTemplate(self, value: str):
+        """Sets custom prompt template for image models.
+        Parameters
+        ----------
+        value : str
+            Custom prompt template string.
+        """
+        return self._set(customPromptTemplate=value)

{spark_nlp-6.1.4.dist-info → spark_nlp-6.2.0.dist-info}/WHEEL RENAMED Viewed

File without changes

{spark_nlp-6.1.4.dist-info → spark_nlp-6.2.0.dist-info}/top_level.txt RENAMED Viewed

File without changes

spark-nlp 6.1.4__py2.py3-none-any.whl → 6.2.0__py2.py3-none-any.whl

Potentially problematic release.

spark-nlp 6.1.4py2.py3-none-any.whl → 6.2.0py2.py3-none-any.whl