spark-nlp 5.3.1__py2.py3-none-any.whl → 5.3.3__py2.py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of spark-nlp might be problematic. Click here for more details.

@@ -0,0 +1 @@
1
+ 90f78083-0ee0-43e9-8240-7263731b6707
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: spark-nlp
3
- Version: 5.3.1
3
+ Version: 5.3.3
4
4
  Summary: John Snow Labs Spark NLP is a natural language processing library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines, that scale easily in a distributed environment.
5
5
  Home-page: https://github.com/JohnSnowLabs/spark-nlp
6
6
  Author: John Snow Labs
@@ -197,7 +197,7 @@ To use Spark NLP you need the following requirements:
197
197
 
198
198
  **GPU (optional):**
199
199
 
200
- Spark NLP 5.3.1 is built with ONNX 1.17.0 and TensorFlow 2.7.1 deep learning engines. The minimum following NVIDIA® software are only required for GPU support:
200
+ Spark NLP 5.3.3 is built with ONNX 1.17.0 and TensorFlow 2.7.1 deep learning engines. The minimum following NVIDIA® software are only required for GPU support:
201
201
 
202
202
  - NVIDIA® GPU drivers version 450.80.02 or higher
203
203
  - CUDA® Toolkit 11.2
@@ -213,7 +213,7 @@ $ java -version
213
213
  $ conda create -n sparknlp python=3.7 -y
214
214
  $ conda activate sparknlp
215
215
  # spark-nlp by default is based on pyspark 3.x
216
- $ pip install spark-nlp==5.3.1 pyspark==3.3.1
216
+ $ pip install spark-nlp==5.3.3 pyspark==3.3.1
217
217
  ```
218
218
 
219
219
  In Python console or Jupyter `Python3` kernel:
@@ -258,7 +258,7 @@ For more examples, you can visit our dedicated [examples](https://github.com/Joh
258
258
 
259
259
  ## Apache Spark Support
260
260
 
261
- Spark NLP *5.3.1* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x
261
+ Spark NLP *5.3.3* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x
262
262
 
263
263
  | Spark NLP | Apache Spark 3.5.x | Apache Spark 3.4.x | Apache Spark 3.3.x | Apache Spark 3.2.x | Apache Spark 3.1.x | Apache Spark 3.0.x | Apache Spark 2.4.x | Apache Spark 2.3.x |
264
264
  |-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
@@ -302,7 +302,7 @@ Find out more about `Spark NLP` versions from our [release notes](https://github
302
302
 
303
303
  ## Databricks Support
304
304
 
305
- Spark NLP 5.3.1 has been tested and is compatible with the following runtimes:
305
+ Spark NLP 5.3.3 has been tested and is compatible with the following runtimes:
306
306
 
307
307
  **CPU:**
308
308
 
@@ -375,7 +375,7 @@ Spark NLP 5.3.1 has been tested and is compatible with the following runtimes:
375
375
 
376
376
  ## EMR Support
377
377
 
378
- Spark NLP 5.3.1 has been tested and is compatible with the following EMR releases:
378
+ Spark NLP 5.3.3 has been tested and is compatible with the following EMR releases:
379
379
 
380
380
  - emr-6.2.0
381
381
  - emr-6.3.0
@@ -425,11 +425,11 @@ Spark NLP supports all major releases of Apache Spark 3.0.x, Apache Spark 3.1.x,
425
425
  ```sh
426
426
  # CPU
427
427
 
428
- spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.1
428
+ spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3
429
429
 
430
- pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.1
430
+ pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3
431
431
 
432
- spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.1
432
+ spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3
433
433
  ```
434
434
 
435
435
  The `spark-nlp` has been published to
@@ -438,11 +438,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s
438
438
  ```sh
439
439
  # GPU
440
440
 
441
- spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.3.1
441
+ spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.3.3
442
442
 
443
- pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.3.1
443
+ pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.3.3
444
444
 
445
- spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.3.1
445
+ spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.3.3
446
446
 
447
447
  ```
448
448
 
@@ -452,11 +452,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s
452
452
  ```sh
453
453
  # AArch64
454
454
 
455
- spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.3.1
455
+ spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.3.3
456
456
 
457
- pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.3.1
457
+ pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.3.3
458
458
 
459
- spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.3.1
459
+ spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.3.3
460
460
 
461
461
  ```
462
462
 
@@ -466,11 +466,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s
466
466
  ```sh
467
467
  # M1/M2 (Apple Silicon)
468
468
 
469
- spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.3.1
469
+ spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.3.3
470
470
 
471
- pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.3.1
471
+ pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.3.3
472
472
 
473
- spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.3.1
473
+ spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.3.3
474
474
 
475
475
  ```
476
476
 
@@ -484,7 +484,7 @@ set in your SparkSession:
484
484
  spark-shell \
485
485
  --driver-memory 16g \
486
486
  --conf spark.kryoserializer.buffer.max=2000M \
487
- --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.1
487
+ --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3
488
488
  ```
489
489
 
490
490
  ## Scala
@@ -502,7 +502,7 @@ coordinates:
502
502
  <dependency>
503
503
  <groupId>com.johnsnowlabs.nlp</groupId>
504
504
  <artifactId>spark-nlp_2.12</artifactId>
505
- <version>5.3.1</version>
505
+ <version>5.3.3</version>
506
506
  </dependency>
507
507
  ```
508
508
 
@@ -513,7 +513,7 @@ coordinates:
513
513
  <dependency>
514
514
  <groupId>com.johnsnowlabs.nlp</groupId>
515
515
  <artifactId>spark-nlp-gpu_2.12</artifactId>
516
- <version>5.3.1</version>
516
+ <version>5.3.3</version>
517
517
  </dependency>
518
518
  ```
519
519
 
@@ -524,7 +524,7 @@ coordinates:
524
524
  <dependency>
525
525
  <groupId>com.johnsnowlabs.nlp</groupId>
526
526
  <artifactId>spark-nlp-aarch64_2.12</artifactId>
527
- <version>5.3.1</version>
527
+ <version>5.3.3</version>
528
528
  </dependency>
529
529
  ```
530
530
 
@@ -535,7 +535,7 @@ coordinates:
535
535
  <dependency>
536
536
  <groupId>com.johnsnowlabs.nlp</groupId>
537
537
  <artifactId>spark-nlp-silicon_2.12</artifactId>
538
- <version>5.3.1</version>
538
+ <version>5.3.3</version>
539
539
  </dependency>
540
540
  ```
541
541
 
@@ -545,28 +545,28 @@ coordinates:
545
545
 
546
546
  ```sbtshell
547
547
  // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp
548
- libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "5.3.1"
548
+ libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "5.3.3"
549
549
  ```
550
550
 
551
551
  **spark-nlp-gpu:**
552
552
 
553
553
  ```sbtshell
554
554
  // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-gpu
555
- libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "5.3.1"
555
+ libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "5.3.3"
556
556
  ```
557
557
 
558
558
  **spark-nlp-aarch64:**
559
559
 
560
560
  ```sbtshell
561
561
  // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64
562
- libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "5.3.1"
562
+ libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "5.3.3"
563
563
  ```
564
564
 
565
565
  **spark-nlp-silicon:**
566
566
 
567
567
  ```sbtshell
568
568
  // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-silicon
569
- libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.3.1"
569
+ libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.3.3"
570
570
  ```
571
571
 
572
572
  Maven
@@ -588,7 +588,7 @@ If you installed pyspark through pip/conda, you can install `spark-nlp` through
588
588
  Pip:
589
589
 
590
590
  ```bash
591
- pip install spark-nlp==5.3.1
591
+ pip install spark-nlp==5.3.3
592
592
  ```
593
593
 
594
594
  Conda:
@@ -617,7 +617,7 @@ spark = SparkSession.builder
617
617
  .config("spark.driver.memory", "16G")
618
618
  .config("spark.driver.maxResultSize", "0")
619
619
  .config("spark.kryoserializer.buffer.max", "2000M")
620
- .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.1")
620
+ .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3")
621
621
  .getOrCreate()
622
622
  ```
623
623
 
@@ -688,7 +688,7 @@ Use either one of the following options
688
688
  - Add the following Maven Coordinates to the interpreter's library list
689
689
 
690
690
  ```bash
691
- com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.1
691
+ com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3
692
692
  ```
693
693
 
694
694
  - Add a path to pre-built jar from [here](#compiled-jars) in the interpreter's library list making sure the jar is
@@ -699,7 +699,7 @@ com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.1
699
699
  Apart from the previous step, install the python module through pip
700
700
 
701
701
  ```bash
702
- pip install spark-nlp==5.3.1
702
+ pip install spark-nlp==5.3.3
703
703
  ```
704
704
 
705
705
  Or you can install `spark-nlp` from inside Zeppelin by using Conda:
@@ -727,7 +727,7 @@ launch the Jupyter from the same Python environment:
727
727
  $ conda create -n sparknlp python=3.8 -y
728
728
  $ conda activate sparknlp
729
729
  # spark-nlp by default is based on pyspark 3.x
730
- $ pip install spark-nlp==5.3.1 pyspark==3.3.1 jupyter
730
+ $ pip install spark-nlp==5.3.3 pyspark==3.3.1 jupyter
731
731
  $ jupyter notebook
732
732
  ```
733
733
 
@@ -744,7 +744,7 @@ export PYSPARK_PYTHON=python3
744
744
  export PYSPARK_DRIVER_PYTHON=jupyter
745
745
  export PYSPARK_DRIVER_PYTHON_OPTS=notebook
746
746
 
747
- pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.1
747
+ pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3
748
748
  ```
749
749
 
750
750
  Alternatively, you can mix in using `--jars` option for pyspark + `pip install spark-nlp`
@@ -771,7 +771,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi
771
771
  # -s is for spark-nlp
772
772
  # -g will enable upgrading libcudnn8 to 8.1.0 on Google Colab for GPU usage
773
773
  # by default they are set to the latest
774
- !wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.3.1
774
+ !wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.3.3
775
775
  ```
776
776
 
777
777
  [Spark NLP quick start on Google Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/quick_start_google_colab.ipynb)
@@ -794,7 +794,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi
794
794
  # -s is for spark-nlp
795
795
  # -g will enable upgrading libcudnn8 to 8.1.0 on Kaggle for GPU usage
796
796
  # by default they are set to the latest
797
- !wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.3.1
797
+ !wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.3.3
798
798
  ```
799
799
 
800
800
  [Spark NLP quick start on Kaggle Kernel](https://www.kaggle.com/mozzie/spark-nlp-named-entity-recognition) is a live
@@ -813,9 +813,9 @@ demo on Kaggle Kernel that performs named entity recognitions by using Spark NLP
813
813
 
814
814
  3. In `Libraries` tab inside your cluster you need to follow these steps:
815
815
 
816
- 3.1. Install New -> PyPI -> `spark-nlp==5.3.1` -> Install
816
+ 3.1. Install New -> PyPI -> `spark-nlp==5.3.3` -> Install
817
817
 
818
- 3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.1` -> Install
818
+ 3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3` -> Install
819
819
 
820
820
  4. Now you can attach your notebook to the cluster and use Spark NLP!
821
821
 
@@ -866,7 +866,7 @@ A sample of your software configuration in JSON on S3 (must be public access):
866
866
  "spark.kryoserializer.buffer.max": "2000M",
867
867
  "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
868
868
  "spark.driver.maxResultSize": "0",
869
- "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.1"
869
+ "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3"
870
870
  }
871
871
  }]
872
872
  ```
@@ -875,7 +875,7 @@ A sample of AWS CLI to launch EMR cluster:
875
875
 
876
876
  ```.sh
877
877
  aws emr create-cluster \
878
- --name "Spark NLP 5.3.1" \
878
+ --name "Spark NLP 5.3.3" \
879
879
  --release-label emr-6.2.0 \
880
880
  --applications Name=Hadoop Name=Spark Name=Hive \
881
881
  --instance-type m4.4xlarge \
@@ -939,7 +939,7 @@ gcloud dataproc clusters create ${CLUSTER_NAME} \
939
939
  --enable-component-gateway \
940
940
  --metadata 'PIP_PACKAGES=spark-nlp spark-nlp-display google-cloud-bigquery google-cloud-storage' \
941
941
  --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh \
942
- --properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.1
942
+ --properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3
943
943
  ```
944
944
 
945
945
  2. On an existing one, you need to install spark-nlp and spark-nlp-display packages from PyPI.
@@ -982,7 +982,7 @@ spark = SparkSession.builder
982
982
  .config("spark.kryoserializer.buffer.max", "2000m")
983
983
  .config("spark.jsl.settings.pretrained.cache_folder", "sample_data/pretrained")
984
984
  .config("spark.jsl.settings.storage.cluster_tmp_dir", "sample_data/storage")
985
- .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.1")
985
+ .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3")
986
986
  .getOrCreate()
987
987
  ```
988
988
 
@@ -996,7 +996,7 @@ spark-shell \
996
996
  --conf spark.kryoserializer.buffer.max=2000M \
997
997
  --conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
998
998
  --conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
999
- --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.1
999
+ --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3
1000
1000
  ```
1001
1001
 
1002
1002
  **pyspark:**
@@ -1009,7 +1009,7 @@ pyspark \
1009
1009
  --conf spark.kryoserializer.buffer.max=2000M \
1010
1010
  --conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
1011
1011
  --conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
1012
- --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.1
1012
+ --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3
1013
1013
  ```
1014
1014
 
1015
1015
  **Databricks:**
@@ -1281,7 +1281,7 @@ spark = SparkSession.builder
1281
1281
  .config("spark.driver.memory", "16G")
1282
1282
  .config("spark.driver.maxResultSize", "0")
1283
1283
  .config("spark.kryoserializer.buffer.max", "2000M")
1284
- .config("spark.jars", "/tmp/spark-nlp-assembly-5.3.1.jar")
1284
+ .config("spark.jars", "/tmp/spark-nlp-assembly-5.3.3.jar")
1285
1285
  .getOrCreate()
1286
1286
  ```
1287
1287
 
@@ -1290,7 +1290,7 @@ spark = SparkSession.builder
1290
1290
  version (3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x)
1291
1291
  - If you are local, you can load the Fat JAR from your local FileSystem, however, if you are in a cluster setup you need
1292
1292
  to put the Fat JAR on a distributed FileSystem such as HDFS, DBFS, S3, etc. (
1293
- i.e., `hdfs:///tmp/spark-nlp-assembly-5.3.1.jar`)
1293
+ i.e., `hdfs:///tmp/spark-nlp-assembly-5.3.3.jar`)
1294
1294
 
1295
1295
  Example of using pretrained Models and Pipelines in offline:
1296
1296
 
@@ -1,7 +1,7 @@
1
1
  com/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
2
2
  com/johnsnowlabs/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
3
3
  com/johnsnowlabs/nlp/__init__.py,sha256=DPIVXtONO5xXyOk-HB0-sNiHAcco17NN13zPS_6Uw8c,294
4
- sparknlp/__init__.py,sha256=fP6mNHdeh0JvNOydT4WGsALtBOM5HLWx5Kz9MplSS8s,13588
4
+ sparknlp/__init__.py,sha256=ZUkW_iY3tWQwa5XvLKprnbvY0_hTCOHJSYWb-KNrvmE,13588
5
5
  sparknlp/annotation.py,sha256=I5zOxG5vV2RfPZfqN9enT1i4mo6oBcn3Lrzs37QiOiA,5635
6
6
  sparknlp/annotation_audio.py,sha256=iRV_InSVhgvAwSRe9NTbUH9v6OGvTM-FPCpSAKVu0mE,1917
7
7
  sparknlp/annotation_image.py,sha256=xhCe8Ko-77XqWVuuYHFrjKqF6zPd8Z-RY_rmZXNwCXU,2547
@@ -78,7 +78,7 @@ sparknlp/annotator/cv/vit_for_image_classification.py,sha256=D2V3pxAd3rBi1817lxV
78
78
  sparknlp/annotator/dependency/__init__.py,sha256=eV43oXAGaYl2N1XKIEAAZJLNP8gpHm8VxuXDeDlQzR4,774
79
79
  sparknlp/annotator/dependency/dependency_parser.py,sha256=SxyvHPp8Hs1Xnm5X1nLTMi095XoQMtfL8pbys15mYAI,11212
80
80
  sparknlp/annotator/dependency/typed_dependency_parser.py,sha256=60vPdYkbFk9MPGegg3m9Uik9cMXpMZd8tBvXG39gNww,12456
81
- sparknlp/annotator/embeddings/__init__.py,sha256=od9aVMywyLf0KYBueoTeUjFbbCnh4UIuIGbsXwGtOAQ,2097
81
+ sparknlp/annotator/embeddings/__init__.py,sha256=XQ6-UMsfvH54u3f0yceKiM8XJOAugIT3jwHE3ExoppI,2156
82
82
  sparknlp/annotator/embeddings/albert_embeddings.py,sha256=6Rd1LIn8oFIpq_ALcJh-RUjPEO7Ht8wsHY6JHSFyMkw,9995
83
83
  sparknlp/annotator/embeddings/bert_embeddings.py,sha256=uExpIlJNkQpuoZ3J_Zc2b2dV0hDNCRCAujNR4Lckly4,8369
84
84
  sparknlp/annotator/embeddings/bert_sentence_embeddings.py,sha256=XHls9qOkurwg9o6nDuwk77KSMNJmv1n4L5pcU22alWA,9054
@@ -96,6 +96,7 @@ sparknlp/annotator/embeddings/mpnet_embeddings.py,sha256=2sabImn5spYGzfNwBSH2zUU
96
96
  sparknlp/annotator/embeddings/roberta_embeddings.py,sha256=V4HGDUK2YBHhAZd1ygJEGUmxDgul0MrpKDm1UQcNqTs,9135
97
97
  sparknlp/annotator/embeddings/roberta_sentence_embeddings.py,sha256=KVrD4z_tIU-sphK6dmbbnHBBt8-Y89C_BFQAkN99kZo,8181
98
98
  sparknlp/annotator/embeddings/sentence_embeddings.py,sha256=azuA1FKMtTJ9suwJqTEHeWHumT6kYdfURTe_1fsqcB8,5402
99
+ sparknlp/annotator/embeddings/uae_embeddings.py,sha256=sqTT67vcegVxcyoATISLPJSmOnA6J_otB6iREKOb6e4,8794
99
100
  sparknlp/annotator/embeddings/universal_sentence_encoder.py,sha256=_fTo-K78RjxiIKptpsI32mpW87RFCdXM16epHv4RVQY,8571
100
101
  sparknlp/annotator/embeddings/word2vec.py,sha256=UBhA4qUczQOx1t82Eu51lxx1-wJ_RLnCb__ncowSNhk,13229
101
102
  sparknlp/annotator/embeddings/word_embeddings.py,sha256=CQxjx2yDdmSM9s8D-bzsbUQhT8t1cqC4ynxlf9INpMU,15388
@@ -182,7 +183,7 @@ sparknlp/common/read_as.py,sha256=imxPGwV7jr4Li_acbo0OAHHRGCBbYv-akzEGaBWEfcY,12
182
183
  sparknlp/common/recursive_annotator_approach.py,sha256=vqugBw22cE3Ff7PIpRlnYFuOlchgL0nM26D8j-NdpqU,1449
183
184
  sparknlp/common/storage.py,sha256=D91H3p8EIjNspjqAYu6ephRpCUtdcAir4_PrAbkIQWE,4842
184
185
  sparknlp/common/utils.py,sha256=Yne6yYcwKxhOZC-U4qfYoDhWUP_6BIaAjI5X_P_df1E,1306
185
- sparknlp/internal/__init__.py,sha256=g4REY_0X2Sr05szDb9681oiPqRWlT4KaOpcAOj3q32A,26496
186
+ sparknlp/internal/__init__.py,sha256=ymZxTXlIf6e_wWEBCVI727zq2EP4nD5z97BWmJDuKlo,26725
186
187
  sparknlp/internal/annotator_java_ml.py,sha256=UGPoThG0rGXUOXGSQnDzEDW81Mu1s5RPF29v7DFyE3c,1187
187
188
  sparknlp/internal/annotator_transformer.py,sha256=fXmc2IWXGybqZpbEU9obmbdBYPc798y42zvSB4tqV9U,1448
188
189
  sparknlp/internal/extended_java_wrapper.py,sha256=hwP0133-hDiDf5sBF-P3MtUsuuDj1PpQbtGZQIRwzfk,2240
@@ -192,7 +193,7 @@ sparknlp/logging/__init__.py,sha256=DoROFF5KLZe4t4Q-OHxqk1nhqbw9NQ-wb64y8icNwgw,
192
193
  sparknlp/logging/comet.py,sha256=_ZBi9-hlilCAnd4lvdYMWiq4Vqsppv8kow3k0cf-NG4,15958
193
194
  sparknlp/pretrained/__init__.py,sha256=GV-x9UBK8F2_IR6zYatrzFcVJtkSUIMbxqWsxRUePmQ,793
194
195
  sparknlp/pretrained/pretrained_pipeline.py,sha256=lquxiaABuA68Rmu7csamJPqBoRJqMUO0oNHsmEZDAIs,5740
195
- sparknlp/pretrained/resource_downloader.py,sha256=XKnx9Mu_K3R7Quj2X1EHVUzY5fJ6rvVnK-JChrWPaRY,7820
196
+ sparknlp/pretrained/resource_downloader.py,sha256=8_-rpvO2LsX_Lq4wMPif2ca3RlJZWEabt8pDm2xymiI,7806
196
197
  sparknlp/pretrained/utils.py,sha256=T1MrvW_DaWk_jcOjVLOea0NMFE9w8fe0ZT_5urZ_nEY,1099
197
198
  sparknlp/training/__init__.py,sha256=qREi9u-5Vc2VjpL6-XZsyvu5jSEIdIhowW7_kKaqMqo,852
198
199
  sparknlp/training/conll.py,sha256=wKBiSTrjc6mjsl7Nyt6B8f4yXsDJkZb-sn8iOjix9cE,6961
@@ -224,7 +225,8 @@ sparknlp/training/_tf_graph_builders_1x/ner_dl/dataset_encoder.py,sha256=R4yHFN3
224
225
  sparknlp/training/_tf_graph_builders_1x/ner_dl/ner_model.py,sha256=EoCSdcIjqQ3wv13MAuuWrKV8wyVBP0SbOEW41omHlR0,23189
225
226
  sparknlp/training/_tf_graph_builders_1x/ner_dl/ner_model_saver.py,sha256=k5CQ7gKV6HZbZMB8cKLUJuZxoZWlP_DFWdZ--aIDwsc,2356
226
227
  sparknlp/training/_tf_graph_builders_1x/ner_dl/sentence_grouper.py,sha256=pAxjWhjazSX8Vg0MFqJiuRVw1IbnQNSs-8Xp26L4nko,870
227
- spark_nlp-5.3.1.dist-info/METADATA,sha256=cfK1KW9iG7FnwuiQH9bBTakLsWE7H_1zHTnMPOICjE8,57087
228
- spark_nlp-5.3.1.dist-info/WHEEL,sha256=bb2Ot9scclHKMOLDEHY6B2sicWOgugjFKaJsT7vwMQo,110
229
- spark_nlp-5.3.1.dist-info/top_level.txt,sha256=uuytur4pyMRw2H_txNY2ZkaucZHUs22QF8-R03ch_-E,13
230
- spark_nlp-5.3.1.dist-info/RECORD,,
228
+ spark_nlp-5.3.3.dist-info/.uuid,sha256=1f6hF51aIuv9yCvh31NU9lOpS34NE-h3a0Et7R9yR6A,36
229
+ spark_nlp-5.3.3.dist-info/METADATA,sha256=YSJq8MiAoRizhOjb8zUeMBqNzNAL1rDEVW5MWy_Q37c,57087
230
+ spark_nlp-5.3.3.dist-info/WHEEL,sha256=bb2Ot9scclHKMOLDEHY6B2sicWOgugjFKaJsT7vwMQo,110
231
+ spark_nlp-5.3.3.dist-info/top_level.txt,sha256=uuytur4pyMRw2H_txNY2ZkaucZHUs22QF8-R03ch_-E,13
232
+ spark_nlp-5.3.3.dist-info/RECORD,,
sparknlp/__init__.py CHANGED
@@ -128,7 +128,7 @@ def start(gpu=False,
128
128
  The initiated Spark session.
129
129
 
130
130
  """
131
- current_version = "5.3.1"
131
+ current_version = "5.3.3"
132
132
 
133
133
  if params is None:
134
134
  params = {}
@@ -309,4 +309,4 @@ def version():
309
309
  str
310
310
  The current Spark NLP version.
311
311
  """
312
- return '5.3.1'
312
+ return '5.3.3'
@@ -36,3 +36,4 @@ from sparknlp.annotator.embeddings.xlm_roberta_embeddings import *
36
36
  from sparknlp.annotator.embeddings.xlm_roberta_sentence_embeddings import *
37
37
  from sparknlp.annotator.embeddings.xlnet_embeddings import *
38
38
  from sparknlp.annotator.embeddings.bge_embeddings import *
39
+ from sparknlp.annotator.embeddings.uae_embeddings import *
@@ -0,0 +1,211 @@
1
+ # Copyright 2017-2022 John Snow Labs
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+ """Contains classes for UAEEmbeddings."""
15
+
16
+ from sparknlp.common import *
17
+
18
+
19
+ class UAEEmbeddings(AnnotatorModel,
20
+ HasEmbeddingsProperties,
21
+ HasCaseSensitiveProperties,
22
+ HasStorageRef,
23
+ HasBatchedAnnotate,
24
+ HasMaxSentenceLengthLimit):
25
+ """Sentence embeddings using Universal AnglE Embedding (UAE).
26
+
27
+ UAE is a novel angle-optimized text embedding model, designed to improve semantic textual
28
+ similarity tasks, which are crucial for Large Language Model (LLM) applications. By
29
+ introducing angle optimization in a complex space, AnglE effectively mitigates saturation of
30
+ the cosine similarity function.
31
+
32
+ Pretrained models can be loaded with :meth:`.pretrained` of the companion
33
+ object:
34
+
35
+ >>> embeddings = UAEEmbeddings.pretrained() \\
36
+ ... .setInputCols(["document"]) \\
37
+ ... .setOutputCol("UAE_embeddings")
38
+
39
+
40
+ The default model is ``"uae_large_v1"``, if no name is provided.
41
+
42
+ For available pretrained models please see the
43
+ `Models Hub <https://sparknlp.org/models?q=UAE>`__.
44
+
45
+
46
+ ====================== ======================
47
+ Input Annotation types Output Annotation type
48
+ ====================== ======================
49
+ ``DOCUMENT`` ``SENTENCE_EMBEDDINGS``
50
+ ====================== ======================
51
+
52
+ Parameters
53
+ ----------
54
+ batchSize
55
+ Size of every batch , by default 8
56
+ dimension
57
+ Number of embedding dimensions, by default 768
58
+ caseSensitive
59
+ Whether to ignore case in tokens for embeddings matching, by default False
60
+ maxSentenceLength
61
+ Max sentence length to process, by default 512
62
+ configProtoBytes
63
+ ConfigProto from tensorflow, serialized into byte array.
64
+
65
+ References
66
+ ----------
67
+
68
+ `AnglE-optimized Text Embeddings <https://arxiv.org/abs/2309.12871>`__
69
+ `UAE Github Repository <https://github.com/baochi0212/uae-embedding>`__
70
+
71
+ **Paper abstract**
72
+
73
+ *High-quality text embedding is pivotal in improving semantic textual similarity (STS) tasks,
74
+ which are crucial components in Large Language Model (LLM) applications. However, a common
75
+ challenge existing text embedding models face is the problem of vanishing gradients, primarily
76
+ due to their reliance on the cosine function in the optimization objective, which has
77
+ saturation zones. To address this issue, this paper proposes a novel angle-optimized text
78
+ embedding model called AnglE. The core idea of AnglE is to introduce angle optimization in a
79
+ complex space. This novel approach effectively mitigates the adverse effects of the saturation
80
+ zone in the cosine function, which can impede gradient and hinder optimization processes. To
81
+ set up a comprehensive STS evaluation, we experimented on existing short-text STS datasets and
82
+ a newly collected long-text STS dataset from GitHub Issues. Furthermore, we examine
83
+ domain-specific STS scenarios with limited labeled data and explore how AnglE works with
84
+ LLM-annotated data. Extensive experiments were conducted on various tasks including short-text
85
+ STS, long-text STS, and domain-specific STS tasks. The results show that AnglE outperforms the
86
+ state-of-the-art (SOTA) STS models that ignore the cosine saturation zone. These findings
87
+ demonstrate the ability of AnglE to generate high-quality text embeddings and the usefulness
88
+ of angle optimization in STS.*
89
+
90
+ Examples
91
+ --------
92
+ >>> import sparknlp
93
+ >>> from sparknlp.base import *
94
+ >>> from sparknlp.annotator import *
95
+ >>> from pyspark.ml import Pipeline
96
+ >>> documentAssembler = DocumentAssembler() \\
97
+ ... .setInputCol("text") \\
98
+ ... .setOutputCol("document")
99
+ >>> embeddings = UAEEmbeddings.pretrained() \\
100
+ ... .setInputCols(["document"]) \\
101
+ ... .setOutputCol("embeddings")
102
+ >>> embeddingsFinisher = EmbeddingsFinisher() \\
103
+ ... .setInputCols("embeddings") \\
104
+ ... .setOutputCols("finished_embeddings") \\
105
+ ... .setOutputAsVector(True)
106
+ >>> pipeline = Pipeline().setStages([
107
+ ... documentAssembler,
108
+ ... embeddings,
109
+ ... embeddingsFinisher
110
+ ... ])
111
+ >>> data = spark.createDataFrame([["hello world", "hello moon"]]).toDF("text")
112
+ >>> result = pipeline.fit(data).transform(data)
113
+ >>> result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
114
+ +--------------------------------------------------------------------------------+
115
+ | result|
116
+ +--------------------------------------------------------------------------------+
117
+ |[0.50387806, 0.5861606, 0.35129607, -0.76046336, -0.32446072, -0.117674336, 0...|
118
+ |[0.6660665, 0.961762, 0.24854276, -0.1018044, -0.6569202, 0.027635604, 0.1915...|
119
+ +--------------------------------------------------------------------------------+
120
+ """
121
+
122
+ name = "UAEEmbeddings"
123
+
124
+ inputAnnotatorTypes = [AnnotatorType.DOCUMENT]
125
+
126
+ outputAnnotatorType = AnnotatorType.SENTENCE_EMBEDDINGS
127
+ poolingStrategy = Param(Params._dummy(),
128
+ "poolingStrategy",
129
+ "Pooling strategy to use for sentence embeddings",
130
+ TypeConverters.toString)
131
+
132
+ def setPoolingStrategy(self, value):
133
+ """Pooling strategy to use for sentence embeddings.
134
+
135
+ Available pooling strategies for sentence embeddings are:
136
+ - `"cls"`: leading `[CLS]` token
137
+ - `"cls_avg"`: leading `[CLS]` token + mean of all other tokens
138
+ - `"last"`: embeddings of the last token in the sequence
139
+ - `"avg"`: mean of all tokens
140
+ - `"max"`: max of all embedding features of the entire token sequence
141
+ - `"int"`: An integer number, which represents the index of the token to use as the
142
+ embedding
143
+
144
+ Parameters
145
+ ----------
146
+ value : str
147
+ Pooling strategy to use for sentence embeddings
148
+ """
149
+
150
+ valid_strategies = {"cls", "cls_avg", "last", "avg", "max"}
151
+ if value in valid_strategies or value.isdigit():
152
+ return self._set(poolingStrategy=value)
153
+ else:
154
+ raise ValueError(f"Invalid pooling strategy: {value}. "
155
+ f"Valid strategies are: {', '.join(self.valid_strategies)} or an integer.")
156
+
157
+ @keyword_only
158
+ def __init__(self, classname="com.johnsnowlabs.nlp.embeddings.UAEEmbeddings", java_model=None):
159
+ super(UAEEmbeddings, self).__init__(
160
+ classname=classname,
161
+ java_model=java_model
162
+ )
163
+ self._setDefault(
164
+ dimension=1024,
165
+ batchSize=8,
166
+ maxSentenceLength=512,
167
+ caseSensitive=False,
168
+ poolingStrategy="cls"
169
+ )
170
+
171
+ @staticmethod
172
+ def loadSavedModel(folder, spark_session):
173
+ """Loads a locally saved model.
174
+
175
+ Parameters
176
+ ----------
177
+ folder : str
178
+ Folder of the saved model
179
+ spark_session : pyspark.sql.SparkSession
180
+ The current SparkSession
181
+
182
+ Returns
183
+ -------
184
+ UAEEmbeddings
185
+ The restored model
186
+ """
187
+ from sparknlp.internal import _UAEEmbeddingsLoader
188
+ jModel = _UAEEmbeddingsLoader(folder, spark_session._jsparkSession)._java_obj
189
+ return UAEEmbeddings(java_model=jModel)
190
+
191
+ @staticmethod
192
+ def pretrained(name="uae_large_v1", lang="en", remote_loc=None):
193
+ """Downloads and loads a pretrained model.
194
+
195
+ Parameters
196
+ ----------
197
+ name : str, optional
198
+ Name of the pretrained model, by default "UAE_small"
199
+ lang : str, optional
200
+ Language of the pretrained model, by default "en"
201
+ remote_loc : str, optional
202
+ Optional remote address of the resource, by default None. Will use
203
+ Spark NLPs repositories otherwise.
204
+
205
+ Returns
206
+ -------
207
+ UAEEmbeddings
208
+ The restored model
209
+ """
210
+ from sparknlp.pretrained import ResourceDownloader
211
+ return ResourceDownloader.downloadModel(UAEEmbeddings, name, lang, remote_loc)
@@ -158,11 +158,13 @@ class _GPT2Loader(ExtendedJavaWrapper):
158
158
  super(_GPT2Loader, self).__init__(
159
159
  "com.johnsnowlabs.nlp.annotators.seq2seq.GPT2Transformer.loadSavedModel", path, jspark)
160
160
 
161
+
161
162
  class _LLAMA2Loader(ExtendedJavaWrapper):
162
163
  def __init__(self, path, jspark):
163
164
  super(_LLAMA2Loader, self).__init__(
164
165
  "com.johnsnowlabs.nlp.annotators.seq2seq.LLAMA2Transformer.loadSavedModel", path, jspark)
165
166
 
167
+
166
168
  class _LongformerLoader(ExtendedJavaWrapper):
167
169
  def __init__(self, path, jspark):
168
170
  super(_LongformerLoader, self).__init__("com.johnsnowlabs.nlp.embeddings.LongformerEmbeddings.loadSavedModel",
@@ -601,8 +603,8 @@ class _DeBertaForZeroShotClassification(ExtendedJavaWrapper):
601
603
  super(_DeBertaForZeroShotClassification, self).__init__(
602
604
  "com.johnsnowlabs.nlp.annotators.classifier.dl.DeBertaForZeroShotClassification.loadSavedModel", path,
603
605
  jspark)
604
-
605
-
606
+
607
+
606
608
  class _MPNetForSequenceClassificationLoader(ExtendedJavaWrapper):
607
609
  def __init__(self, path, jspark):
608
610
  super(_MPNetForSequenceClassificationLoader, self).__init__(
@@ -615,3 +617,10 @@ class _MPNetForQuestionAnsweringLoader(ExtendedJavaWrapper):
615
617
  super(_MPNetForQuestionAnsweringLoader, self).__init__(
616
618
  "com.johnsnowlabs.nlp.annotators.classifier.dl.MPNetForQuestionAnswering.loadSavedModel", path,
617
619
  jspark)
620
+
621
+
622
+ class _UAEEmbeddingsLoader(ExtendedJavaWrapper):
623
+ def __init__(self, path, jspark):
624
+ super(_UAEEmbeddingsLoader, self).__init__(
625
+ "com.johnsnowlabs.nlp.embeddings.UAEEmbeddings.loadSavedModel", path,
626
+ jspark)
@@ -58,7 +58,6 @@ class ResourceDownloader(object):
58
58
 
59
59
  """
60
60
 
61
-
62
61
  @staticmethod
63
62
  def downloadModel(reader, name, language, remote_loc=None, j_dwn='PythonResourceDownloader'):
64
63
  """Downloads and loads a model with the default downloader. Usually this method
@@ -67,8 +66,8 @@ class ResourceDownloader(object):
67
66
 
68
67
  Parameters
69
68
  ----------
70
- reader : str
71
- Name of the class to read the model for
69
+ reader : obj
70
+ Class to read the model for
72
71
  name : str
73
72
  Name of the pretrained model
74
73
  language : str