spark-nlp 5.2.1__py2.py3-none-any.whl → 5.2.3__py2.py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of spark-nlp might be problematic. Click here for more details.
- {spark_nlp-5.2.1.dist-info → spark_nlp-5.2.3.dist-info}/METADATA +63 -59
- {spark_nlp-5.2.1.dist-info → spark_nlp-5.2.3.dist-info}/RECORD +7 -7
- sparknlp/__init__.py +2 -2
- sparknlp/annotator/embeddings/__init__.py +1 -0
- sparknlp/annotator/embeddings/bge_embeddings.py +5 -6
- {spark_nlp-5.2.1.dist-info → spark_nlp-5.2.3.dist-info}/WHEEL +0 -0
- {spark_nlp-5.2.1.dist-info → spark_nlp-5.2.3.dist-info}/top_level.txt +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.1
|
|
2
2
|
Name: spark-nlp
|
|
3
|
-
Version: 5.2.
|
|
3
|
+
Version: 5.2.3
|
|
4
4
|
Summary: John Snow Labs Spark NLP is a natural language processing library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines, that scale easily in a distributed environment.
|
|
5
5
|
Home-page: https://github.com/JohnSnowLabs/spark-nlp
|
|
6
6
|
Author: John Snow Labs
|
|
@@ -51,10 +51,10 @@ Description-Content-Type: text/markdown
|
|
|
51
51
|
|
|
52
52
|
Spark NLP is a state-of-the-art Natural Language Processing library built on top of Apache Spark. It provides **simple**, **performant** & **accurate** NLP annotations for machine learning pipelines that **scale** easily in a distributed
|
|
53
53
|
environment.
|
|
54
|
-
Spark NLP comes with **
|
|
54
|
+
Spark NLP comes with **36000+** pretrained **pipelines** and **models** in more than **200+** languages.
|
|
55
55
|
It also offers tasks such as **Tokenization**, **Word Segmentation**, **Part-of-Speech Tagging**, Word and Sentence **Embeddings**, **Named Entity Recognition**, **Dependency Parsing**, **Spell Checking**, **Text Classification**, **Sentiment Analysis**, **Token Classification**, **Machine Translation** (+180 languages), **Summarization**, **Question Answering**, **Table Question Answering**, **Text Generation**, **Image Classification**, **Image to Text (captioning)**, **Automatic Speech Recognition**, **Zero-Shot Learning**, and many more [NLP tasks](#features).
|
|
56
56
|
|
|
57
|
-
**Spark NLP** is the only open-source NLP library in **production** that offers state-of-the-art transformers such as **BERT**, **CamemBERT**, **ALBERT**, **ELECTRA**, **XLNet**, **DistilBERT**, **RoBERTa**, **DeBERTa**, **XLM-RoBERTa**, **Longformer**, **ELMO**, **Universal Sentence Encoder**, **Facebook BART**, **Instructor**, **E5**, **Google T5**, **MarianMT**, **OpenAI GPT2**,
|
|
57
|
+
**Spark NLP** is the only open-source NLP library in **production** that offers state-of-the-art transformers such as **BERT**, **CamemBERT**, **ALBERT**, **ELECTRA**, **XLNet**, **DistilBERT**, **RoBERTa**, **DeBERTa**, **XLM-RoBERTa**, **Longformer**, **ELMO**, **Universal Sentence Encoder**, **Facebook BART**, **Instructor**, **E5**, **Google T5**, **MarianMT**, **OpenAI GPT2**, **Vision Transformers (ViT)**, **OpenAI Whisper**, and many more not only to **Python** and **R**, but also to **JVM** ecosystem (**Java**, **Scala**, and **Kotlin**) at **scale** by extending **Apache Spark** natively.
|
|
58
58
|
|
|
59
59
|
## Project's website
|
|
60
60
|
|
|
@@ -191,7 +191,7 @@ documentation and examples
|
|
|
191
191
|
- Easy ONNX and TensorFlow integrations
|
|
192
192
|
- GPU Support
|
|
193
193
|
- Full integration with Spark ML functions
|
|
194
|
-
- +
|
|
194
|
+
- +30000 pre-trained models in +200 languages!
|
|
195
195
|
- +6000 pre-trained pipelines in +200 languages!
|
|
196
196
|
- Multi-lingual NER models: Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, German, Hebrew, Italian,
|
|
197
197
|
Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Spanish, Swedish, Urdu, and more.
|
|
@@ -205,7 +205,7 @@ To use Spark NLP you need the following requirements:
|
|
|
205
205
|
|
|
206
206
|
**GPU (optional):**
|
|
207
207
|
|
|
208
|
-
Spark NLP 5.2.
|
|
208
|
+
Spark NLP 5.2.3 is built with ONNX 1.16.3 and TensorFlow 2.7.1 deep learning engines. The minimum following NVIDIA® software are only required for GPU support:
|
|
209
209
|
|
|
210
210
|
- NVIDIA® GPU drivers version 450.80.02 or higher
|
|
211
211
|
- CUDA® Toolkit 11.2
|
|
@@ -221,7 +221,7 @@ $ java -version
|
|
|
221
221
|
$ conda create -n sparknlp python=3.7 -y
|
|
222
222
|
$ conda activate sparknlp
|
|
223
223
|
# spark-nlp by default is based on pyspark 3.x
|
|
224
|
-
$ pip install spark-nlp==5.2.
|
|
224
|
+
$ pip install spark-nlp==5.2.3 pyspark==3.3.1
|
|
225
225
|
```
|
|
226
226
|
|
|
227
227
|
In Python console or Jupyter `Python3` kernel:
|
|
@@ -266,11 +266,11 @@ For more examples, you can visit our dedicated [examples](https://github.com/Joh
|
|
|
266
266
|
|
|
267
267
|
## Apache Spark Support
|
|
268
268
|
|
|
269
|
-
Spark NLP *5.2.
|
|
269
|
+
Spark NLP *5.2.3* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x
|
|
270
270
|
|
|
271
271
|
| Spark NLP | Apache Spark 3.5.x | Apache Spark 3.4.x | Apache Spark 3.3.x | Apache Spark 3.2.x | Apache Spark 3.1.x | Apache Spark 3.0.x | Apache Spark 2.4.x | Apache Spark 2.3.x |
|
|
272
272
|
|-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
|
|
273
|
-
| 5.2.x |
|
|
273
|
+
| 5.2.x | YES | YES | YES | YES | YES | YES | NO | NO |
|
|
274
274
|
| 5.1.x | Partially | YES | YES | YES | YES | YES | NO | NO |
|
|
275
275
|
| 5.0.x | YES | YES | YES | YES | YES | YES | NO | NO |
|
|
276
276
|
| 4.4.x | YES | YES | YES | YES | YES | YES | NO | NO |
|
|
@@ -308,7 +308,7 @@ Find out more about `Spark NLP` versions from our [release notes](https://github
|
|
|
308
308
|
|
|
309
309
|
## Databricks Support
|
|
310
310
|
|
|
311
|
-
Spark NLP 5.2.
|
|
311
|
+
Spark NLP 5.2.3 has been tested and is compatible with the following runtimes:
|
|
312
312
|
|
|
313
313
|
**CPU:**
|
|
314
314
|
|
|
@@ -375,7 +375,7 @@ Spark NLP 5.2.1 has been tested and is compatible with the following runtimes:
|
|
|
375
375
|
|
|
376
376
|
## EMR Support
|
|
377
377
|
|
|
378
|
-
Spark NLP 5.2.
|
|
378
|
+
Spark NLP 5.2.3 has been tested and is compatible with the following EMR releases:
|
|
379
379
|
|
|
380
380
|
- emr-6.2.0
|
|
381
381
|
- emr-6.3.0
|
|
@@ -422,11 +422,11 @@ Spark NLP supports all major releases of Apache Spark 3.0.x, Apache Spark 3.1.x,
|
|
|
422
422
|
```sh
|
|
423
423
|
# CPU
|
|
424
424
|
|
|
425
|
-
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.
|
|
425
|
+
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3
|
|
426
426
|
|
|
427
|
-
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.
|
|
427
|
+
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3
|
|
428
428
|
|
|
429
|
-
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.
|
|
429
|
+
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3
|
|
430
430
|
```
|
|
431
431
|
|
|
432
432
|
The `spark-nlp` has been published to
|
|
@@ -435,11 +435,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s
|
|
|
435
435
|
```sh
|
|
436
436
|
# GPU
|
|
437
437
|
|
|
438
|
-
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.
|
|
438
|
+
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.3
|
|
439
439
|
|
|
440
|
-
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.
|
|
440
|
+
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.3
|
|
441
441
|
|
|
442
|
-
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.
|
|
442
|
+
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.3
|
|
443
443
|
|
|
444
444
|
```
|
|
445
445
|
|
|
@@ -449,11 +449,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s
|
|
|
449
449
|
```sh
|
|
450
450
|
# AArch64
|
|
451
451
|
|
|
452
|
-
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.
|
|
452
|
+
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.3
|
|
453
453
|
|
|
454
|
-
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.
|
|
454
|
+
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.3
|
|
455
455
|
|
|
456
|
-
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.
|
|
456
|
+
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.3
|
|
457
457
|
|
|
458
458
|
```
|
|
459
459
|
|
|
@@ -463,11 +463,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s
|
|
|
463
463
|
```sh
|
|
464
464
|
# M1/M2 (Apple Silicon)
|
|
465
465
|
|
|
466
|
-
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.
|
|
466
|
+
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.3
|
|
467
467
|
|
|
468
|
-
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.
|
|
468
|
+
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.3
|
|
469
469
|
|
|
470
|
-
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.
|
|
470
|
+
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.3
|
|
471
471
|
|
|
472
472
|
```
|
|
473
473
|
|
|
@@ -481,7 +481,7 @@ set in your SparkSession:
|
|
|
481
481
|
spark-shell \
|
|
482
482
|
--driver-memory 16g \
|
|
483
483
|
--conf spark.kryoserializer.buffer.max=2000M \
|
|
484
|
-
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.
|
|
484
|
+
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3
|
|
485
485
|
```
|
|
486
486
|
|
|
487
487
|
## Scala
|
|
@@ -499,7 +499,7 @@ coordinates:
|
|
|
499
499
|
<dependency>
|
|
500
500
|
<groupId>com.johnsnowlabs.nlp</groupId>
|
|
501
501
|
<artifactId>spark-nlp_2.12</artifactId>
|
|
502
|
-
<version>5.2.
|
|
502
|
+
<version>5.2.3</version>
|
|
503
503
|
</dependency>
|
|
504
504
|
```
|
|
505
505
|
|
|
@@ -510,7 +510,7 @@ coordinates:
|
|
|
510
510
|
<dependency>
|
|
511
511
|
<groupId>com.johnsnowlabs.nlp</groupId>
|
|
512
512
|
<artifactId>spark-nlp-gpu_2.12</artifactId>
|
|
513
|
-
<version>5.2.
|
|
513
|
+
<version>5.2.3</version>
|
|
514
514
|
</dependency>
|
|
515
515
|
```
|
|
516
516
|
|
|
@@ -521,7 +521,7 @@ coordinates:
|
|
|
521
521
|
<dependency>
|
|
522
522
|
<groupId>com.johnsnowlabs.nlp</groupId>
|
|
523
523
|
<artifactId>spark-nlp-aarch64_2.12</artifactId>
|
|
524
|
-
<version>5.2.
|
|
524
|
+
<version>5.2.3</version>
|
|
525
525
|
</dependency>
|
|
526
526
|
```
|
|
527
527
|
|
|
@@ -532,7 +532,7 @@ coordinates:
|
|
|
532
532
|
<dependency>
|
|
533
533
|
<groupId>com.johnsnowlabs.nlp</groupId>
|
|
534
534
|
<artifactId>spark-nlp-silicon_2.12</artifactId>
|
|
535
|
-
<version>5.2.
|
|
535
|
+
<version>5.2.3</version>
|
|
536
536
|
</dependency>
|
|
537
537
|
```
|
|
538
538
|
|
|
@@ -542,28 +542,28 @@ coordinates:
|
|
|
542
542
|
|
|
543
543
|
```sbtshell
|
|
544
544
|
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp
|
|
545
|
-
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "5.2.
|
|
545
|
+
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "5.2.3"
|
|
546
546
|
```
|
|
547
547
|
|
|
548
548
|
**spark-nlp-gpu:**
|
|
549
549
|
|
|
550
550
|
```sbtshell
|
|
551
551
|
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-gpu
|
|
552
|
-
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "5.2.
|
|
552
|
+
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "5.2.3"
|
|
553
553
|
```
|
|
554
554
|
|
|
555
555
|
**spark-nlp-aarch64:**
|
|
556
556
|
|
|
557
557
|
```sbtshell
|
|
558
558
|
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64
|
|
559
|
-
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "5.2.
|
|
559
|
+
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "5.2.3"
|
|
560
560
|
```
|
|
561
561
|
|
|
562
562
|
**spark-nlp-silicon:**
|
|
563
563
|
|
|
564
564
|
```sbtshell
|
|
565
565
|
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-silicon
|
|
566
|
-
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.2.
|
|
566
|
+
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.2.3"
|
|
567
567
|
```
|
|
568
568
|
|
|
569
569
|
Maven
|
|
@@ -585,7 +585,7 @@ If you installed pyspark through pip/conda, you can install `spark-nlp` through
|
|
|
585
585
|
Pip:
|
|
586
586
|
|
|
587
587
|
```bash
|
|
588
|
-
pip install spark-nlp==5.2.
|
|
588
|
+
pip install spark-nlp==5.2.3
|
|
589
589
|
```
|
|
590
590
|
|
|
591
591
|
Conda:
|
|
@@ -614,7 +614,7 @@ spark = SparkSession.builder
|
|
|
614
614
|
.config("spark.driver.memory", "16G")
|
|
615
615
|
.config("spark.driver.maxResultSize", "0")
|
|
616
616
|
.config("spark.kryoserializer.buffer.max", "2000M")
|
|
617
|
-
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.
|
|
617
|
+
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3")
|
|
618
618
|
.getOrCreate()
|
|
619
619
|
```
|
|
620
620
|
|
|
@@ -685,7 +685,7 @@ Use either one of the following options
|
|
|
685
685
|
- Add the following Maven Coordinates to the interpreter's library list
|
|
686
686
|
|
|
687
687
|
```bash
|
|
688
|
-
com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.
|
|
688
|
+
com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3
|
|
689
689
|
```
|
|
690
690
|
|
|
691
691
|
- Add a path to pre-built jar from [here](#compiled-jars) in the interpreter's library list making sure the jar is
|
|
@@ -696,7 +696,7 @@ com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.1
|
|
|
696
696
|
Apart from the previous step, install the python module through pip
|
|
697
697
|
|
|
698
698
|
```bash
|
|
699
|
-
pip install spark-nlp==5.2.
|
|
699
|
+
pip install spark-nlp==5.2.3
|
|
700
700
|
```
|
|
701
701
|
|
|
702
702
|
Or you can install `spark-nlp` from inside Zeppelin by using Conda:
|
|
@@ -724,7 +724,7 @@ launch the Jupyter from the same Python environment:
|
|
|
724
724
|
$ conda create -n sparknlp python=3.8 -y
|
|
725
725
|
$ conda activate sparknlp
|
|
726
726
|
# spark-nlp by default is based on pyspark 3.x
|
|
727
|
-
$ pip install spark-nlp==5.2.
|
|
727
|
+
$ pip install spark-nlp==5.2.3 pyspark==3.3.1 jupyter
|
|
728
728
|
$ jupyter notebook
|
|
729
729
|
```
|
|
730
730
|
|
|
@@ -741,7 +741,7 @@ export PYSPARK_PYTHON=python3
|
|
|
741
741
|
export PYSPARK_DRIVER_PYTHON=jupyter
|
|
742
742
|
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
|
|
743
743
|
|
|
744
|
-
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.
|
|
744
|
+
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3
|
|
745
745
|
```
|
|
746
746
|
|
|
747
747
|
Alternatively, you can mix in using `--jars` option for pyspark + `pip install spark-nlp`
|
|
@@ -768,7 +768,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi
|
|
|
768
768
|
# -s is for spark-nlp
|
|
769
769
|
# -g will enable upgrading libcudnn8 to 8.1.0 on Google Colab for GPU usage
|
|
770
770
|
# by default they are set to the latest
|
|
771
|
-
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.2.
|
|
771
|
+
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.2.3
|
|
772
772
|
```
|
|
773
773
|
|
|
774
774
|
[Spark NLP quick start on Google Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/quick_start_google_colab.ipynb)
|
|
@@ -791,7 +791,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi
|
|
|
791
791
|
# -s is for spark-nlp
|
|
792
792
|
# -g will enable upgrading libcudnn8 to 8.1.0 on Kaggle for GPU usage
|
|
793
793
|
# by default they are set to the latest
|
|
794
|
-
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.2.
|
|
794
|
+
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.2.3
|
|
795
795
|
```
|
|
796
796
|
|
|
797
797
|
[Spark NLP quick start on Kaggle Kernel](https://www.kaggle.com/mozzie/spark-nlp-named-entity-recognition) is a live
|
|
@@ -810,9 +810,9 @@ demo on Kaggle Kernel that performs named entity recognitions by using Spark NLP
|
|
|
810
810
|
|
|
811
811
|
3. In `Libraries` tab inside your cluster you need to follow these steps:
|
|
812
812
|
|
|
813
|
-
3.1. Install New -> PyPI -> `spark-nlp==5.2.
|
|
813
|
+
3.1. Install New -> PyPI -> `spark-nlp==5.2.3` -> Install
|
|
814
814
|
|
|
815
|
-
3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.
|
|
815
|
+
3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3` -> Install
|
|
816
816
|
|
|
817
817
|
4. Now you can attach your notebook to the cluster and use Spark NLP!
|
|
818
818
|
|
|
@@ -863,7 +863,7 @@ A sample of your software configuration in JSON on S3 (must be public access):
|
|
|
863
863
|
"spark.kryoserializer.buffer.max": "2000M",
|
|
864
864
|
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
|
|
865
865
|
"spark.driver.maxResultSize": "0",
|
|
866
|
-
"spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.
|
|
866
|
+
"spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3"
|
|
867
867
|
}
|
|
868
868
|
}]
|
|
869
869
|
```
|
|
@@ -872,7 +872,7 @@ A sample of AWS CLI to launch EMR cluster:
|
|
|
872
872
|
|
|
873
873
|
```.sh
|
|
874
874
|
aws emr create-cluster \
|
|
875
|
-
--name "Spark NLP 5.2.
|
|
875
|
+
--name "Spark NLP 5.2.3" \
|
|
876
876
|
--release-label emr-6.2.0 \
|
|
877
877
|
--applications Name=Hadoop Name=Spark Name=Hive \
|
|
878
878
|
--instance-type m4.4xlarge \
|
|
@@ -936,7 +936,7 @@ gcloud dataproc clusters create ${CLUSTER_NAME} \
|
|
|
936
936
|
--enable-component-gateway \
|
|
937
937
|
--metadata 'PIP_PACKAGES=spark-nlp spark-nlp-display google-cloud-bigquery google-cloud-storage' \
|
|
938
938
|
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh \
|
|
939
|
-
--properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.
|
|
939
|
+
--properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3
|
|
940
940
|
```
|
|
941
941
|
|
|
942
942
|
2. On an existing one, you need to install spark-nlp and spark-nlp-display packages from PyPI.
|
|
@@ -947,16 +947,20 @@ gcloud dataproc clusters create ${CLUSTER_NAME} \
|
|
|
947
947
|
|
|
948
948
|
You can change the following Spark NLP configurations via Spark Configuration:
|
|
949
949
|
|
|
950
|
-
| Property Name
|
|
951
|
-
|
|
952
|
-
| `spark.jsl.settings.pretrained.cache_folder`
|
|
953
|
-
| `spark.jsl.settings.storage.cluster_tmp_dir`
|
|
954
|
-
| `spark.jsl.settings.annotator.log_folder`
|
|
955
|
-
| `spark.jsl.settings.aws.credentials.access_key_id`
|
|
956
|
-
| `spark.jsl.settings.aws.credentials.secret_access_key`
|
|
957
|
-
| `spark.jsl.settings.aws.credentials.session_token`
|
|
958
|
-
| `spark.jsl.settings.aws.s3_bucket`
|
|
959
|
-
| `spark.jsl.settings.aws.region`
|
|
950
|
+
| Property Name | Default | Meaning |
|
|
951
|
+
|---------------------------------------------------------|----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
|
952
|
+
| `spark.jsl.settings.pretrained.cache_folder` | `~/cache_pretrained` | The location to download and extract pretrained `Models` and `Pipelines`. By default, it will be in User's Home directory under `cache_pretrained` directory |
|
|
953
|
+
| `spark.jsl.settings.storage.cluster_tmp_dir` | `hadoop.tmp.dir` | The location to use on a cluster for temporarily files such as unpacking indexes for WordEmbeddings. By default, this locations is the location of `hadoop.tmp.dir` set via Hadoop configuration for Apache Spark. NOTE: `S3` is not supported and it must be local, HDFS, or DBFS |
|
|
954
|
+
| `spark.jsl.settings.annotator.log_folder` | `~/annotator_logs` | The location to save logs from annotators during training such as `NerDLApproach`, `ClassifierDLApproach`, `SentimentDLApproach`, `MultiClassifierDLApproach`, etc. By default, it will be in User's Home directory under `annotator_logs` directory |
|
|
955
|
+
| `spark.jsl.settings.aws.credentials.access_key_id` | `None` | Your AWS access key to use your S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` |
|
|
956
|
+
| `spark.jsl.settings.aws.credentials.secret_access_key` | `None` | Your AWS secret access key to use your S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` |
|
|
957
|
+
| `spark.jsl.settings.aws.credentials.session_token` | `None` | Your AWS MFA session token to use your S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` |
|
|
958
|
+
| `spark.jsl.settings.aws.s3_bucket` | `None` | Your AWS S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` |
|
|
959
|
+
| `spark.jsl.settings.aws.region` | `None` | Your AWS region to use your S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` |
|
|
960
|
+
| `spark.jsl.settings.onnx.gpuDeviceId` | `0` | Constructs CUDA execution provider options for the specified non-negative device id. |
|
|
961
|
+
| `spark.jsl.settings.onnx.intraOpNumThreads` | `6` | Sets the size of the CPU thread pool used for executing a single graph, if executing on a CPU. |
|
|
962
|
+
| `spark.jsl.settings.onnx.optimizationLevel` | `ALL_OPT` | Sets the optimization level of this options object, overriding the old setting. |
|
|
963
|
+
| `spark.jsl.settings.onnx.executionMode` | `SEQUENTIAL` | Sets the execution mode of this options object, overriding the old setting. |
|
|
960
964
|
|
|
961
965
|
### How to set Spark NLP Configuration
|
|
962
966
|
|
|
@@ -975,7 +979,7 @@ spark = SparkSession.builder
|
|
|
975
979
|
.config("spark.kryoserializer.buffer.max", "2000m")
|
|
976
980
|
.config("spark.jsl.settings.pretrained.cache_folder", "sample_data/pretrained")
|
|
977
981
|
.config("spark.jsl.settings.storage.cluster_tmp_dir", "sample_data/storage")
|
|
978
|
-
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.
|
|
982
|
+
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3")
|
|
979
983
|
.getOrCreate()
|
|
980
984
|
```
|
|
981
985
|
|
|
@@ -989,7 +993,7 @@ spark-shell \
|
|
|
989
993
|
--conf spark.kryoserializer.buffer.max=2000M \
|
|
990
994
|
--conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
|
|
991
995
|
--conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
|
|
992
|
-
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.
|
|
996
|
+
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3
|
|
993
997
|
```
|
|
994
998
|
|
|
995
999
|
**pyspark:**
|
|
@@ -1002,7 +1006,7 @@ pyspark \
|
|
|
1002
1006
|
--conf spark.kryoserializer.buffer.max=2000M \
|
|
1003
1007
|
--conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
|
|
1004
1008
|
--conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
|
|
1005
|
-
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.
|
|
1009
|
+
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3
|
|
1006
1010
|
```
|
|
1007
1011
|
|
|
1008
1012
|
**Databricks:**
|
|
@@ -1274,7 +1278,7 @@ spark = SparkSession.builder
|
|
|
1274
1278
|
.config("spark.driver.memory", "16G")
|
|
1275
1279
|
.config("spark.driver.maxResultSize", "0")
|
|
1276
1280
|
.config("spark.kryoserializer.buffer.max", "2000M")
|
|
1277
|
-
.config("spark.jars", "/tmp/spark-nlp-assembly-5.2.
|
|
1281
|
+
.config("spark.jars", "/tmp/spark-nlp-assembly-5.2.3.jar")
|
|
1278
1282
|
.getOrCreate()
|
|
1279
1283
|
```
|
|
1280
1284
|
|
|
@@ -1283,7 +1287,7 @@ spark = SparkSession.builder
|
|
|
1283
1287
|
version (3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x)
|
|
1284
1288
|
- If you are local, you can load the Fat JAR from your local FileSystem, however, if you are in a cluster setup you need
|
|
1285
1289
|
to put the Fat JAR on a distributed FileSystem such as HDFS, DBFS, S3, etc. (
|
|
1286
|
-
i.e., `hdfs:///tmp/spark-nlp-assembly-5.2.
|
|
1290
|
+
i.e., `hdfs:///tmp/spark-nlp-assembly-5.2.3.jar`)
|
|
1287
1291
|
|
|
1288
1292
|
Example of using pretrained Models and Pipelines in offline:
|
|
1289
1293
|
|
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
com/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
|
2
2
|
com/johnsnowlabs/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
|
3
3
|
com/johnsnowlabs/nlp/__init__.py,sha256=DPIVXtONO5xXyOk-HB0-sNiHAcco17NN13zPS_6Uw8c,294
|
|
4
|
-
sparknlp/__init__.py,sha256=
|
|
4
|
+
sparknlp/__init__.py,sha256=qzDxFYDRyF2Jw1kVlbunQjoL6qtiJ5EA9td1vsm1J5w,13588
|
|
5
5
|
sparknlp/annotation.py,sha256=I5zOxG5vV2RfPZfqN9enT1i4mo6oBcn3Lrzs37QiOiA,5635
|
|
6
6
|
sparknlp/annotation_audio.py,sha256=iRV_InSVhgvAwSRe9NTbUH9v6OGvTM-FPCpSAKVu0mE,1917
|
|
7
7
|
sparknlp/annotation_image.py,sha256=xhCe8Ko-77XqWVuuYHFrjKqF6zPd8Z-RY_rmZXNwCXU,2547
|
|
@@ -75,11 +75,11 @@ sparknlp/annotator/cv/vit_for_image_classification.py,sha256=D2V3pxAd3rBi1817lxV
|
|
|
75
75
|
sparknlp/annotator/dependency/__init__.py,sha256=eV43oXAGaYl2N1XKIEAAZJLNP8gpHm8VxuXDeDlQzR4,774
|
|
76
76
|
sparknlp/annotator/dependency/dependency_parser.py,sha256=SxyvHPp8Hs1Xnm5X1nLTMi095XoQMtfL8pbys15mYAI,11212
|
|
77
77
|
sparknlp/annotator/dependency/typed_dependency_parser.py,sha256=60vPdYkbFk9MPGegg3m9Uik9cMXpMZd8tBvXG39gNww,12456
|
|
78
|
-
sparknlp/annotator/embeddings/__init__.py,sha256=
|
|
78
|
+
sparknlp/annotator/embeddings/__init__.py,sha256=od9aVMywyLf0KYBueoTeUjFbbCnh4UIuIGbsXwGtOAQ,2097
|
|
79
79
|
sparknlp/annotator/embeddings/albert_embeddings.py,sha256=6Rd1LIn8oFIpq_ALcJh-RUjPEO7Ht8wsHY6JHSFyMkw,9995
|
|
80
80
|
sparknlp/annotator/embeddings/bert_embeddings.py,sha256=uExpIlJNkQpuoZ3J_Zc2b2dV0hDNCRCAujNR4Lckly4,8369
|
|
81
81
|
sparknlp/annotator/embeddings/bert_sentence_embeddings.py,sha256=XHls9qOkurwg9o6nDuwk77KSMNJmv1n4L5pcU22alWA,9054
|
|
82
|
-
sparknlp/annotator/embeddings/bge_embeddings.py,sha256=
|
|
82
|
+
sparknlp/annotator/embeddings/bge_embeddings.py,sha256=FNmYxcynM1iLJvg5ZNmrZKkyIF0Gtr7G-CgZ72mrVyU,7842
|
|
83
83
|
sparknlp/annotator/embeddings/camembert_embeddings.py,sha256=dBTXas-2Tas_JUR9Xt_GtHLcyqi_cdvT5EHRnyVrSSQ,8817
|
|
84
84
|
sparknlp/annotator/embeddings/chunk_embeddings.py,sha256=WUmkJimSuFkdcLJnvcxOV0QlCLgGlhub29ZTrZb70WE,6052
|
|
85
85
|
sparknlp/annotator/embeddings/deberta_embeddings.py,sha256=_b5nzLb7heFQNN-uT2oBNO6-YmM8bHmAdnGXg47HOWw,8649
|
|
@@ -219,7 +219,7 @@ sparknlp/training/_tf_graph_builders_1x/ner_dl/dataset_encoder.py,sha256=R4yHFN3
|
|
|
219
219
|
sparknlp/training/_tf_graph_builders_1x/ner_dl/ner_model.py,sha256=EoCSdcIjqQ3wv13MAuuWrKV8wyVBP0SbOEW41omHlR0,23189
|
|
220
220
|
sparknlp/training/_tf_graph_builders_1x/ner_dl/ner_model_saver.py,sha256=k5CQ7gKV6HZbZMB8cKLUJuZxoZWlP_DFWdZ--aIDwsc,2356
|
|
221
221
|
sparknlp/training/_tf_graph_builders_1x/ner_dl/sentence_grouper.py,sha256=pAxjWhjazSX8Vg0MFqJiuRVw1IbnQNSs-8Xp26L4nko,870
|
|
222
|
-
spark_nlp-5.2.
|
|
223
|
-
spark_nlp-5.2.
|
|
224
|
-
spark_nlp-5.2.
|
|
225
|
-
spark_nlp-5.2.
|
|
222
|
+
spark_nlp-5.2.3.dist-info/METADATA,sha256=QXMxdjxt8d8HEmdpys1UOmdWUvb1KfIdwYhfQ8pnSU0,56589
|
|
223
|
+
spark_nlp-5.2.3.dist-info/WHEEL,sha256=bb2Ot9scclHKMOLDEHY6B2sicWOgugjFKaJsT7vwMQo,110
|
|
224
|
+
spark_nlp-5.2.3.dist-info/top_level.txt,sha256=uuytur4pyMRw2H_txNY2ZkaucZHUs22QF8-R03ch_-E,13
|
|
225
|
+
spark_nlp-5.2.3.dist-info/RECORD,,
|
sparknlp/__init__.py
CHANGED
|
@@ -128,7 +128,7 @@ def start(gpu=False,
|
|
|
128
128
|
The initiated Spark session.
|
|
129
129
|
|
|
130
130
|
"""
|
|
131
|
-
current_version = "5.2.
|
|
131
|
+
current_version = "5.2.3"
|
|
132
132
|
|
|
133
133
|
if params is None:
|
|
134
134
|
params = {}
|
|
@@ -309,4 +309,4 @@ def version():
|
|
|
309
309
|
str
|
|
310
310
|
The current Spark NLP version.
|
|
311
311
|
"""
|
|
312
|
-
return '5.2.
|
|
312
|
+
return '5.2.3'
|
|
@@ -35,3 +35,4 @@ from sparknlp.annotator.embeddings.word_embeddings import *
|
|
|
35
35
|
from sparknlp.annotator.embeddings.xlm_roberta_embeddings import *
|
|
36
36
|
from sparknlp.annotator.embeddings.xlm_roberta_sentence_embeddings import *
|
|
37
37
|
from sparknlp.annotator.embeddings.xlnet_embeddings import *
|
|
38
|
+
from sparknlp.annotator.embeddings.bge_embeddings import *
|
|
@@ -17,11 +17,11 @@ from sparknlp.common import *
|
|
|
17
17
|
|
|
18
18
|
|
|
19
19
|
class BGEEmbeddings(AnnotatorModel,
|
|
20
|
-
|
|
21
|
-
|
|
22
|
-
|
|
23
|
-
|
|
24
|
-
|
|
20
|
+
HasEmbeddingsProperties,
|
|
21
|
+
HasCaseSensitiveProperties,
|
|
22
|
+
HasStorageRef,
|
|
23
|
+
HasBatchedAnnotate,
|
|
24
|
+
HasMaxSentenceLengthLimit):
|
|
25
25
|
"""Sentence embeddings using BGE.
|
|
26
26
|
|
|
27
27
|
BGE, or BAAI General Embeddings, a model that can map any text to a low-dimensional dense
|
|
@@ -125,7 +125,6 @@ class BGEEmbeddings(AnnotatorModel,
|
|
|
125
125
|
"ConfigProto from tensorflow, serialized into byte array. Get with config_proto.SerializeToString()",
|
|
126
126
|
TypeConverters.toListInt)
|
|
127
127
|
|
|
128
|
-
|
|
129
128
|
def setConfigProtoBytes(self, b):
|
|
130
129
|
"""Sets configProto from tensorflow, serialized into byte array.
|
|
131
130
|
|
|
File without changes
|
|
File without changes
|