spark-nlp 5.2.1__py2.py3-none-any.whl → 5.2.3__py2.py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of spark-nlp might be problematic. Click here for more details.

@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: spark-nlp
3
- Version: 5.2.1
3
+ Version: 5.2.3
4
4
  Summary: John Snow Labs Spark NLP is a natural language processing library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines, that scale easily in a distributed environment.
5
5
  Home-page: https://github.com/JohnSnowLabs/spark-nlp
6
6
  Author: John Snow Labs
@@ -51,10 +51,10 @@ Description-Content-Type: text/markdown
51
51
 
52
52
  Spark NLP is a state-of-the-art Natural Language Processing library built on top of Apache Spark. It provides **simple**, **performant** & **accurate** NLP annotations for machine learning pipelines that **scale** easily in a distributed
53
53
  environment.
54
- Spark NLP comes with **30000+** pretrained **pipelines** and **models** in more than **200+** languages.
54
+ Spark NLP comes with **36000+** pretrained **pipelines** and **models** in more than **200+** languages.
55
55
  It also offers tasks such as **Tokenization**, **Word Segmentation**, **Part-of-Speech Tagging**, Word and Sentence **Embeddings**, **Named Entity Recognition**, **Dependency Parsing**, **Spell Checking**, **Text Classification**, **Sentiment Analysis**, **Token Classification**, **Machine Translation** (+180 languages), **Summarization**, **Question Answering**, **Table Question Answering**, **Text Generation**, **Image Classification**, **Image to Text (captioning)**, **Automatic Speech Recognition**, **Zero-Shot Learning**, and many more [NLP tasks](#features).
56
56
 
57
- **Spark NLP** is the only open-source NLP library in **production** that offers state-of-the-art transformers such as **BERT**, **CamemBERT**, **ALBERT**, **ELECTRA**, **XLNet**, **DistilBERT**, **RoBERTa**, **DeBERTa**, **XLM-RoBERTa**, **Longformer**, **ELMO**, **Universal Sentence Encoder**, **Facebook BART**, **Instructor**, **E5**, **Google T5**, **MarianMT**, **OpenAI GPT2**, and **Vision Transformers (ViT)** not only to **Python** and **R**, but also to **JVM** ecosystem (**Java**, **Scala**, and **Kotlin**) at **scale** by extending **Apache Spark** natively.
57
+ **Spark NLP** is the only open-source NLP library in **production** that offers state-of-the-art transformers such as **BERT**, **CamemBERT**, **ALBERT**, **ELECTRA**, **XLNet**, **DistilBERT**, **RoBERTa**, **DeBERTa**, **XLM-RoBERTa**, **Longformer**, **ELMO**, **Universal Sentence Encoder**, **Facebook BART**, **Instructor**, **E5**, **Google T5**, **MarianMT**, **OpenAI GPT2**, **Vision Transformers (ViT)**, **OpenAI Whisper**, and many more not only to **Python** and **R**, but also to **JVM** ecosystem (**Java**, **Scala**, and **Kotlin**) at **scale** by extending **Apache Spark** natively.
58
58
 
59
59
  ## Project's website
60
60
 
@@ -191,7 +191,7 @@ documentation and examples
191
191
  - Easy ONNX and TensorFlow integrations
192
192
  - GPU Support
193
193
  - Full integration with Spark ML functions
194
- - +24000 pre-trained models in +200 languages!
194
+ - +30000 pre-trained models in +200 languages!
195
195
  - +6000 pre-trained pipelines in +200 languages!
196
196
  - Multi-lingual NER models: Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, German, Hebrew, Italian,
197
197
  Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Spanish, Swedish, Urdu, and more.
@@ -205,7 +205,7 @@ To use Spark NLP you need the following requirements:
205
205
 
206
206
  **GPU (optional):**
207
207
 
208
- Spark NLP 5.2.1 is built with ONNX 1.16.3 and TensorFlow 2.7.1 deep learning engines. The minimum following NVIDIA® software are only required for GPU support:
208
+ Spark NLP 5.2.3 is built with ONNX 1.16.3 and TensorFlow 2.7.1 deep learning engines. The minimum following NVIDIA® software are only required for GPU support:
209
209
 
210
210
  - NVIDIA® GPU drivers version 450.80.02 or higher
211
211
  - CUDA® Toolkit 11.2
@@ -221,7 +221,7 @@ $ java -version
221
221
  $ conda create -n sparknlp python=3.7 -y
222
222
  $ conda activate sparknlp
223
223
  # spark-nlp by default is based on pyspark 3.x
224
- $ pip install spark-nlp==5.2.1 pyspark==3.3.1
224
+ $ pip install spark-nlp==5.2.3 pyspark==3.3.1
225
225
  ```
226
226
 
227
227
  In Python console or Jupyter `Python3` kernel:
@@ -266,11 +266,11 @@ For more examples, you can visit our dedicated [examples](https://github.com/Joh
266
266
 
267
267
  ## Apache Spark Support
268
268
 
269
- Spark NLP *5.2.1* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x
269
+ Spark NLP *5.2.3* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x
270
270
 
271
271
  | Spark NLP | Apache Spark 3.5.x | Apache Spark 3.4.x | Apache Spark 3.3.x | Apache Spark 3.2.x | Apache Spark 3.1.x | Apache Spark 3.0.x | Apache Spark 2.4.x | Apache Spark 2.3.x |
272
272
  |-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
273
- | 5.2.x | Partially | YES | YES | YES | YES | YES | NO | NO |
273
+ | 5.2.x | YES | YES | YES | YES | YES | YES | NO | NO |
274
274
  | 5.1.x | Partially | YES | YES | YES | YES | YES | NO | NO |
275
275
  | 5.0.x | YES | YES | YES | YES | YES | YES | NO | NO |
276
276
  | 4.4.x | YES | YES | YES | YES | YES | YES | NO | NO |
@@ -308,7 +308,7 @@ Find out more about `Spark NLP` versions from our [release notes](https://github
308
308
 
309
309
  ## Databricks Support
310
310
 
311
- Spark NLP 5.2.1 has been tested and is compatible with the following runtimes:
311
+ Spark NLP 5.2.3 has been tested and is compatible with the following runtimes:
312
312
 
313
313
  **CPU:**
314
314
 
@@ -375,7 +375,7 @@ Spark NLP 5.2.1 has been tested and is compatible with the following runtimes:
375
375
 
376
376
  ## EMR Support
377
377
 
378
- Spark NLP 5.2.1 has been tested and is compatible with the following EMR releases:
378
+ Spark NLP 5.2.3 has been tested and is compatible with the following EMR releases:
379
379
 
380
380
  - emr-6.2.0
381
381
  - emr-6.3.0
@@ -422,11 +422,11 @@ Spark NLP supports all major releases of Apache Spark 3.0.x, Apache Spark 3.1.x,
422
422
  ```sh
423
423
  # CPU
424
424
 
425
- spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.1
425
+ spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3
426
426
 
427
- pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.1
427
+ pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3
428
428
 
429
- spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.1
429
+ spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3
430
430
  ```
431
431
 
432
432
  The `spark-nlp` has been published to
@@ -435,11 +435,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s
435
435
  ```sh
436
436
  # GPU
437
437
 
438
- spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.1
438
+ spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.3
439
439
 
440
- pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.1
440
+ pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.3
441
441
 
442
- spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.1
442
+ spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.3
443
443
 
444
444
  ```
445
445
 
@@ -449,11 +449,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s
449
449
  ```sh
450
450
  # AArch64
451
451
 
452
- spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.1
452
+ spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.3
453
453
 
454
- pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.1
454
+ pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.3
455
455
 
456
- spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.1
456
+ spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.3
457
457
 
458
458
  ```
459
459
 
@@ -463,11 +463,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s
463
463
  ```sh
464
464
  # M1/M2 (Apple Silicon)
465
465
 
466
- spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.1
466
+ spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.3
467
467
 
468
- pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.1
468
+ pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.3
469
469
 
470
- spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.1
470
+ spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.3
471
471
 
472
472
  ```
473
473
 
@@ -481,7 +481,7 @@ set in your SparkSession:
481
481
  spark-shell \
482
482
  --driver-memory 16g \
483
483
  --conf spark.kryoserializer.buffer.max=2000M \
484
- --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.1
484
+ --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3
485
485
  ```
486
486
 
487
487
  ## Scala
@@ -499,7 +499,7 @@ coordinates:
499
499
  <dependency>
500
500
  <groupId>com.johnsnowlabs.nlp</groupId>
501
501
  <artifactId>spark-nlp_2.12</artifactId>
502
- <version>5.2.1</version>
502
+ <version>5.2.3</version>
503
503
  </dependency>
504
504
  ```
505
505
 
@@ -510,7 +510,7 @@ coordinates:
510
510
  <dependency>
511
511
  <groupId>com.johnsnowlabs.nlp</groupId>
512
512
  <artifactId>spark-nlp-gpu_2.12</artifactId>
513
- <version>5.2.1</version>
513
+ <version>5.2.3</version>
514
514
  </dependency>
515
515
  ```
516
516
 
@@ -521,7 +521,7 @@ coordinates:
521
521
  <dependency>
522
522
  <groupId>com.johnsnowlabs.nlp</groupId>
523
523
  <artifactId>spark-nlp-aarch64_2.12</artifactId>
524
- <version>5.2.1</version>
524
+ <version>5.2.3</version>
525
525
  </dependency>
526
526
  ```
527
527
 
@@ -532,7 +532,7 @@ coordinates:
532
532
  <dependency>
533
533
  <groupId>com.johnsnowlabs.nlp</groupId>
534
534
  <artifactId>spark-nlp-silicon_2.12</artifactId>
535
- <version>5.2.1</version>
535
+ <version>5.2.3</version>
536
536
  </dependency>
537
537
  ```
538
538
 
@@ -542,28 +542,28 @@ coordinates:
542
542
 
543
543
  ```sbtshell
544
544
  // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp
545
- libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "5.2.1"
545
+ libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "5.2.3"
546
546
  ```
547
547
 
548
548
  **spark-nlp-gpu:**
549
549
 
550
550
  ```sbtshell
551
551
  // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-gpu
552
- libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "5.2.1"
552
+ libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "5.2.3"
553
553
  ```
554
554
 
555
555
  **spark-nlp-aarch64:**
556
556
 
557
557
  ```sbtshell
558
558
  // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64
559
- libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "5.2.1"
559
+ libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "5.2.3"
560
560
  ```
561
561
 
562
562
  **spark-nlp-silicon:**
563
563
 
564
564
  ```sbtshell
565
565
  // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-silicon
566
- libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.2.1"
566
+ libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.2.3"
567
567
  ```
568
568
 
569
569
  Maven
@@ -585,7 +585,7 @@ If you installed pyspark through pip/conda, you can install `spark-nlp` through
585
585
  Pip:
586
586
 
587
587
  ```bash
588
- pip install spark-nlp==5.2.1
588
+ pip install spark-nlp==5.2.3
589
589
  ```
590
590
 
591
591
  Conda:
@@ -614,7 +614,7 @@ spark = SparkSession.builder
614
614
  .config("spark.driver.memory", "16G")
615
615
  .config("spark.driver.maxResultSize", "0")
616
616
  .config("spark.kryoserializer.buffer.max", "2000M")
617
- .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.1")
617
+ .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3")
618
618
  .getOrCreate()
619
619
  ```
620
620
 
@@ -685,7 +685,7 @@ Use either one of the following options
685
685
  - Add the following Maven Coordinates to the interpreter's library list
686
686
 
687
687
  ```bash
688
- com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.1
688
+ com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3
689
689
  ```
690
690
 
691
691
  - Add a path to pre-built jar from [here](#compiled-jars) in the interpreter's library list making sure the jar is
@@ -696,7 +696,7 @@ com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.1
696
696
  Apart from the previous step, install the python module through pip
697
697
 
698
698
  ```bash
699
- pip install spark-nlp==5.2.1
699
+ pip install spark-nlp==5.2.3
700
700
  ```
701
701
 
702
702
  Or you can install `spark-nlp` from inside Zeppelin by using Conda:
@@ -724,7 +724,7 @@ launch the Jupyter from the same Python environment:
724
724
  $ conda create -n sparknlp python=3.8 -y
725
725
  $ conda activate sparknlp
726
726
  # spark-nlp by default is based on pyspark 3.x
727
- $ pip install spark-nlp==5.2.1 pyspark==3.3.1 jupyter
727
+ $ pip install spark-nlp==5.2.3 pyspark==3.3.1 jupyter
728
728
  $ jupyter notebook
729
729
  ```
730
730
 
@@ -741,7 +741,7 @@ export PYSPARK_PYTHON=python3
741
741
  export PYSPARK_DRIVER_PYTHON=jupyter
742
742
  export PYSPARK_DRIVER_PYTHON_OPTS=notebook
743
743
 
744
- pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.1
744
+ pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3
745
745
  ```
746
746
 
747
747
  Alternatively, you can mix in using `--jars` option for pyspark + `pip install spark-nlp`
@@ -768,7 +768,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi
768
768
  # -s is for spark-nlp
769
769
  # -g will enable upgrading libcudnn8 to 8.1.0 on Google Colab for GPU usage
770
770
  # by default they are set to the latest
771
- !wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.2.1
771
+ !wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.2.3
772
772
  ```
773
773
 
774
774
  [Spark NLP quick start on Google Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/quick_start_google_colab.ipynb)
@@ -791,7 +791,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi
791
791
  # -s is for spark-nlp
792
792
  # -g will enable upgrading libcudnn8 to 8.1.0 on Kaggle for GPU usage
793
793
  # by default they are set to the latest
794
- !wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.2.1
794
+ !wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.2.3
795
795
  ```
796
796
 
797
797
  [Spark NLP quick start on Kaggle Kernel](https://www.kaggle.com/mozzie/spark-nlp-named-entity-recognition) is a live
@@ -810,9 +810,9 @@ demo on Kaggle Kernel that performs named entity recognitions by using Spark NLP
810
810
 
811
811
  3. In `Libraries` tab inside your cluster you need to follow these steps:
812
812
 
813
- 3.1. Install New -> PyPI -> `spark-nlp==5.2.1` -> Install
813
+ 3.1. Install New -> PyPI -> `spark-nlp==5.2.3` -> Install
814
814
 
815
- 3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.1` -> Install
815
+ 3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3` -> Install
816
816
 
817
817
  4. Now you can attach your notebook to the cluster and use Spark NLP!
818
818
 
@@ -863,7 +863,7 @@ A sample of your software configuration in JSON on S3 (must be public access):
863
863
  "spark.kryoserializer.buffer.max": "2000M",
864
864
  "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
865
865
  "spark.driver.maxResultSize": "0",
866
- "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.1"
866
+ "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3"
867
867
  }
868
868
  }]
869
869
  ```
@@ -872,7 +872,7 @@ A sample of AWS CLI to launch EMR cluster:
872
872
 
873
873
  ```.sh
874
874
  aws emr create-cluster \
875
- --name "Spark NLP 5.2.1" \
875
+ --name "Spark NLP 5.2.3" \
876
876
  --release-label emr-6.2.0 \
877
877
  --applications Name=Hadoop Name=Spark Name=Hive \
878
878
  --instance-type m4.4xlarge \
@@ -936,7 +936,7 @@ gcloud dataproc clusters create ${CLUSTER_NAME} \
936
936
  --enable-component-gateway \
937
937
  --metadata 'PIP_PACKAGES=spark-nlp spark-nlp-display google-cloud-bigquery google-cloud-storage' \
938
938
  --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh \
939
- --properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.1
939
+ --properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3
940
940
  ```
941
941
 
942
942
  2. On an existing one, you need to install spark-nlp and spark-nlp-display packages from PyPI.
@@ -947,16 +947,20 @@ gcloud dataproc clusters create ${CLUSTER_NAME} \
947
947
 
948
948
  You can change the following Spark NLP configurations via Spark Configuration:
949
949
 
950
- | Property Name | Default | Meaning |
951
- |--------------------------------------------------------|----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
952
- | `spark.jsl.settings.pretrained.cache_folder` | `~/cache_pretrained` | The location to download and extract pretrained `Models` and `Pipelines`. By default, it will be in User's Home directory under `cache_pretrained` directory |
953
- | `spark.jsl.settings.storage.cluster_tmp_dir` | `hadoop.tmp.dir` | The location to use on a cluster for temporarily files such as unpacking indexes for WordEmbeddings. By default, this locations is the location of `hadoop.tmp.dir` set via Hadoop configuration for Apache Spark. NOTE: `S3` is not supported and it must be local, HDFS, or DBFS |
954
- | `spark.jsl.settings.annotator.log_folder` | `~/annotator_logs` | The location to save logs from annotators during training such as `NerDLApproach`, `ClassifierDLApproach`, `SentimentDLApproach`, `MultiClassifierDLApproach`, etc. By default, it will be in User's Home directory under `annotator_logs` directory |
955
- | `spark.jsl.settings.aws.credentials.access_key_id` | `None` | Your AWS access key to use your S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` |
956
- | `spark.jsl.settings.aws.credentials.secret_access_key` | `None` | Your AWS secret access key to use your S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` |
957
- | `spark.jsl.settings.aws.credentials.session_token` | `None` | Your AWS MFA session token to use your S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` |
958
- | `spark.jsl.settings.aws.s3_bucket` | `None` | Your AWS S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` |
959
- | `spark.jsl.settings.aws.region` | `None` | Your AWS region to use your S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` |
950
+ | Property Name | Default | Meaning |
951
+ |---------------------------------------------------------|----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
952
+ | `spark.jsl.settings.pretrained.cache_folder` | `~/cache_pretrained` | The location to download and extract pretrained `Models` and `Pipelines`. By default, it will be in User's Home directory under `cache_pretrained` directory |
953
+ | `spark.jsl.settings.storage.cluster_tmp_dir` | `hadoop.tmp.dir` | The location to use on a cluster for temporarily files such as unpacking indexes for WordEmbeddings. By default, this locations is the location of `hadoop.tmp.dir` set via Hadoop configuration for Apache Spark. NOTE: `S3` is not supported and it must be local, HDFS, or DBFS |
954
+ | `spark.jsl.settings.annotator.log_folder` | `~/annotator_logs` | The location to save logs from annotators during training such as `NerDLApproach`, `ClassifierDLApproach`, `SentimentDLApproach`, `MultiClassifierDLApproach`, etc. By default, it will be in User's Home directory under `annotator_logs` directory |
955
+ | `spark.jsl.settings.aws.credentials.access_key_id` | `None` | Your AWS access key to use your S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` |
956
+ | `spark.jsl.settings.aws.credentials.secret_access_key` | `None` | Your AWS secret access key to use your S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` |
957
+ | `spark.jsl.settings.aws.credentials.session_token` | `None` | Your AWS MFA session token to use your S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` |
958
+ | `spark.jsl.settings.aws.s3_bucket` | `None` | Your AWS S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` |
959
+ | `spark.jsl.settings.aws.region` | `None` | Your AWS region to use your S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` |
960
+ | `spark.jsl.settings.onnx.gpuDeviceId` | `0` | Constructs CUDA execution provider options for the specified non-negative device id. |
961
+ | `spark.jsl.settings.onnx.intraOpNumThreads` | `6` | Sets the size of the CPU thread pool used for executing a single graph, if executing on a CPU. |
962
+ | `spark.jsl.settings.onnx.optimizationLevel` | `ALL_OPT` | Sets the optimization level of this options object, overriding the old setting. |
963
+ | `spark.jsl.settings.onnx.executionMode` | `SEQUENTIAL` | Sets the execution mode of this options object, overriding the old setting. |
960
964
 
961
965
  ### How to set Spark NLP Configuration
962
966
 
@@ -975,7 +979,7 @@ spark = SparkSession.builder
975
979
  .config("spark.kryoserializer.buffer.max", "2000m")
976
980
  .config("spark.jsl.settings.pretrained.cache_folder", "sample_data/pretrained")
977
981
  .config("spark.jsl.settings.storage.cluster_tmp_dir", "sample_data/storage")
978
- .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.1")
982
+ .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3")
979
983
  .getOrCreate()
980
984
  ```
981
985
 
@@ -989,7 +993,7 @@ spark-shell \
989
993
  --conf spark.kryoserializer.buffer.max=2000M \
990
994
  --conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
991
995
  --conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
992
- --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.1
996
+ --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3
993
997
  ```
994
998
 
995
999
  **pyspark:**
@@ -1002,7 +1006,7 @@ pyspark \
1002
1006
  --conf spark.kryoserializer.buffer.max=2000M \
1003
1007
  --conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
1004
1008
  --conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
1005
- --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.1
1009
+ --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3
1006
1010
  ```
1007
1011
 
1008
1012
  **Databricks:**
@@ -1274,7 +1278,7 @@ spark = SparkSession.builder
1274
1278
  .config("spark.driver.memory", "16G")
1275
1279
  .config("spark.driver.maxResultSize", "0")
1276
1280
  .config("spark.kryoserializer.buffer.max", "2000M")
1277
- .config("spark.jars", "/tmp/spark-nlp-assembly-5.2.1.jar")
1281
+ .config("spark.jars", "/tmp/spark-nlp-assembly-5.2.3.jar")
1278
1282
  .getOrCreate()
1279
1283
  ```
1280
1284
 
@@ -1283,7 +1287,7 @@ spark = SparkSession.builder
1283
1287
  version (3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x)
1284
1288
  - If you are local, you can load the Fat JAR from your local FileSystem, however, if you are in a cluster setup you need
1285
1289
  to put the Fat JAR on a distributed FileSystem such as HDFS, DBFS, S3, etc. (
1286
- i.e., `hdfs:///tmp/spark-nlp-assembly-5.2.1.jar`)
1290
+ i.e., `hdfs:///tmp/spark-nlp-assembly-5.2.3.jar`)
1287
1291
 
1288
1292
  Example of using pretrained Models and Pipelines in offline:
1289
1293
 
@@ -1,7 +1,7 @@
1
1
  com/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
2
2
  com/johnsnowlabs/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
3
3
  com/johnsnowlabs/nlp/__init__.py,sha256=DPIVXtONO5xXyOk-HB0-sNiHAcco17NN13zPS_6Uw8c,294
4
- sparknlp/__init__.py,sha256=IZdioR6c5AgxmY8nN2B9viL02n_EvpT8EdV8rXGOM1Y,13588
4
+ sparknlp/__init__.py,sha256=qzDxFYDRyF2Jw1kVlbunQjoL6qtiJ5EA9td1vsm1J5w,13588
5
5
  sparknlp/annotation.py,sha256=I5zOxG5vV2RfPZfqN9enT1i4mo6oBcn3Lrzs37QiOiA,5635
6
6
  sparknlp/annotation_audio.py,sha256=iRV_InSVhgvAwSRe9NTbUH9v6OGvTM-FPCpSAKVu0mE,1917
7
7
  sparknlp/annotation_image.py,sha256=xhCe8Ko-77XqWVuuYHFrjKqF6zPd8Z-RY_rmZXNwCXU,2547
@@ -75,11 +75,11 @@ sparknlp/annotator/cv/vit_for_image_classification.py,sha256=D2V3pxAd3rBi1817lxV
75
75
  sparknlp/annotator/dependency/__init__.py,sha256=eV43oXAGaYl2N1XKIEAAZJLNP8gpHm8VxuXDeDlQzR4,774
76
76
  sparknlp/annotator/dependency/dependency_parser.py,sha256=SxyvHPp8Hs1Xnm5X1nLTMi095XoQMtfL8pbys15mYAI,11212
77
77
  sparknlp/annotator/dependency/typed_dependency_parser.py,sha256=60vPdYkbFk9MPGegg3m9Uik9cMXpMZd8tBvXG39gNww,12456
78
- sparknlp/annotator/embeddings/__init__.py,sha256=IpLXw4LMrpw8muZ_NPorcpo4zS2hIvu8XW9ya_rMFcs,2038
78
+ sparknlp/annotator/embeddings/__init__.py,sha256=od9aVMywyLf0KYBueoTeUjFbbCnh4UIuIGbsXwGtOAQ,2097
79
79
  sparknlp/annotator/embeddings/albert_embeddings.py,sha256=6Rd1LIn8oFIpq_ALcJh-RUjPEO7Ht8wsHY6JHSFyMkw,9995
80
80
  sparknlp/annotator/embeddings/bert_embeddings.py,sha256=uExpIlJNkQpuoZ3J_Zc2b2dV0hDNCRCAujNR4Lckly4,8369
81
81
  sparknlp/annotator/embeddings/bert_sentence_embeddings.py,sha256=XHls9qOkurwg9o6nDuwk77KSMNJmv1n4L5pcU22alWA,9054
82
- sparknlp/annotator/embeddings/bge_embeddings.py,sha256=2zvbLNSKMoEFkobRZAjWBb2GOhJYOEnHxptQrd8hXqw,7878
82
+ sparknlp/annotator/embeddings/bge_embeddings.py,sha256=FNmYxcynM1iLJvg5ZNmrZKkyIF0Gtr7G-CgZ72mrVyU,7842
83
83
  sparknlp/annotator/embeddings/camembert_embeddings.py,sha256=dBTXas-2Tas_JUR9Xt_GtHLcyqi_cdvT5EHRnyVrSSQ,8817
84
84
  sparknlp/annotator/embeddings/chunk_embeddings.py,sha256=WUmkJimSuFkdcLJnvcxOV0QlCLgGlhub29ZTrZb70WE,6052
85
85
  sparknlp/annotator/embeddings/deberta_embeddings.py,sha256=_b5nzLb7heFQNN-uT2oBNO6-YmM8bHmAdnGXg47HOWw,8649
@@ -219,7 +219,7 @@ sparknlp/training/_tf_graph_builders_1x/ner_dl/dataset_encoder.py,sha256=R4yHFN3
219
219
  sparknlp/training/_tf_graph_builders_1x/ner_dl/ner_model.py,sha256=EoCSdcIjqQ3wv13MAuuWrKV8wyVBP0SbOEW41omHlR0,23189
220
220
  sparknlp/training/_tf_graph_builders_1x/ner_dl/ner_model_saver.py,sha256=k5CQ7gKV6HZbZMB8cKLUJuZxoZWlP_DFWdZ--aIDwsc,2356
221
221
  sparknlp/training/_tf_graph_builders_1x/ner_dl/sentence_grouper.py,sha256=pAxjWhjazSX8Vg0MFqJiuRVw1IbnQNSs-8Xp26L4nko,870
222
- spark_nlp-5.2.1.dist-info/METADATA,sha256=FYr8My66DEYXAqKvk-x3VJWpDlRI_2Jyi6wpurVKOYg,55114
223
- spark_nlp-5.2.1.dist-info/WHEEL,sha256=bb2Ot9scclHKMOLDEHY6B2sicWOgugjFKaJsT7vwMQo,110
224
- spark_nlp-5.2.1.dist-info/top_level.txt,sha256=uuytur4pyMRw2H_txNY2ZkaucZHUs22QF8-R03ch_-E,13
225
- spark_nlp-5.2.1.dist-info/RECORD,,
222
+ spark_nlp-5.2.3.dist-info/METADATA,sha256=QXMxdjxt8d8HEmdpys1UOmdWUvb1KfIdwYhfQ8pnSU0,56589
223
+ spark_nlp-5.2.3.dist-info/WHEEL,sha256=bb2Ot9scclHKMOLDEHY6B2sicWOgugjFKaJsT7vwMQo,110
224
+ spark_nlp-5.2.3.dist-info/top_level.txt,sha256=uuytur4pyMRw2H_txNY2ZkaucZHUs22QF8-R03ch_-E,13
225
+ spark_nlp-5.2.3.dist-info/RECORD,,
sparknlp/__init__.py CHANGED
@@ -128,7 +128,7 @@ def start(gpu=False,
128
128
  The initiated Spark session.
129
129
 
130
130
  """
131
- current_version = "5.2.1"
131
+ current_version = "5.2.3"
132
132
 
133
133
  if params is None:
134
134
  params = {}
@@ -309,4 +309,4 @@ def version():
309
309
  str
310
310
  The current Spark NLP version.
311
311
  """
312
- return '5.2.1'
312
+ return '5.2.3'
@@ -35,3 +35,4 @@ from sparknlp.annotator.embeddings.word_embeddings import *
35
35
  from sparknlp.annotator.embeddings.xlm_roberta_embeddings import *
36
36
  from sparknlp.annotator.embeddings.xlm_roberta_sentence_embeddings import *
37
37
  from sparknlp.annotator.embeddings.xlnet_embeddings import *
38
+ from sparknlp.annotator.embeddings.bge_embeddings import *
@@ -17,11 +17,11 @@ from sparknlp.common import *
17
17
 
18
18
 
19
19
  class BGEEmbeddings(AnnotatorModel,
20
- HasEmbeddingsProperties,
21
- HasCaseSensitiveProperties,
22
- HasStorageRef,
23
- HasBatchedAnnotate,
24
- HasMaxSentenceLengthLimit):
20
+ HasEmbeddingsProperties,
21
+ HasCaseSensitiveProperties,
22
+ HasStorageRef,
23
+ HasBatchedAnnotate,
24
+ HasMaxSentenceLengthLimit):
25
25
  """Sentence embeddings using BGE.
26
26
 
27
27
  BGE, or BAAI General Embeddings, a model that can map any text to a low-dimensional dense
@@ -125,7 +125,6 @@ class BGEEmbeddings(AnnotatorModel,
125
125
  "ConfigProto from tensorflow, serialized into byte array. Get with config_proto.SerializeToString()",
126
126
  TypeConverters.toListInt)
127
127
 
128
-
129
128
  def setConfigProtoBytes(self, b):
130
129
  """Sets configProto from tensorflow, serialized into byte array.
131
130