PyPI - spark-nlp - Versions diffs - 6.2.2__tar.gz → 6.2.2.dev2__tar.gz - Mend

spark-nlp 6.2.2tar.gz → 6.2.2.dev2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (296) hide show

{spark_nlp-6.2.2 → spark_nlp-6.2.2.dev2}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: spark-nlp
-Version: 6.2.2
+Version: 6.2.2.dev2
 Summary: John Snow Labs Spark NLP is a natural language processing library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines, that scale easily in a distributed environment.
 Home-page: https://github.com/JohnSnowLabs/spark-nlp
 Author: John Snow Labs
@@ -102,7 +102,7 @@ $ java -version
 $ conda create -n sparknlp python=3.7 -y
 $ conda activate sparknlp
 # spark-nlp by default is based on pyspark 3.x
-$ pip install spark-nlp==6.2.2 pyspark==3.3.1
+$ pip install spark-nlp==6.2.0 pyspark==3.3.1
 ```
 In Python console or Jupyter `Python3` kernel:
@@ -168,7 +168,7 @@ For a quick example of using pipelines and models take a look at our official [d
 ### Apache Spark Support
-Spark NLP *6.2.2* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x
+Spark NLP *6.2.0* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x
 | Spark NLP | Apache Spark 3.5.x | Apache Spark 3.4.x | Apache Spark 3.3.x | Apache Spark 3.2.x | Apache Spark 3.1.x | Apache Spark 3.0.x | Apache Spark 2.4.x | Apache Spark 2.3.x |
 |-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
@@ -198,7 +198,7 @@ Find out more about 4.x `SparkNLP` versions in our official [documentation](http
 ### Databricks Support
-Spark NLP 6.2.2 has been tested and is compatible with the following runtimes:
+Spark NLP 6.2.0 has been tested and is compatible with the following runtimes:
 | **CPU**            | **GPU**            |
 |--------------------|--------------------|
@@ -216,7 +216,7 @@ We are compatible with older runtimes. For a full list check databricks support
 ### EMR Support
-Spark NLP 6.2.2 has been tested and is compatible with the following EMR releases:
+Spark NLP 6.2.0 has been tested and is compatible with the following EMR releases:
 | **EMR Release**    |
 |--------------------|

{spark_nlp-6.2.2 → spark_nlp-6.2.2.dev2}/README.md RENAMED Viewed

@@ -63,7 +63,7 @@ $ java -version
 $ conda create -n sparknlp python=3.7 -y
 $ conda activate sparknlp
 # spark-nlp by default is based on pyspark 3.x
-$ pip install spark-nlp==6.2.2 pyspark==3.3.1
+$ pip install spark-nlp==6.2.0 pyspark==3.3.1
 ```
 In Python console or Jupyter `Python3` kernel:
@@ -129,7 +129,7 @@ For a quick example of using pipelines and models take a look at our official [d
 ### Apache Spark Support
-Spark NLP *6.2.2* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x
+Spark NLP *6.2.0* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x
 | Spark NLP | Apache Spark 3.5.x | Apache Spark 3.4.x | Apache Spark 3.3.x | Apache Spark 3.2.x | Apache Spark 3.1.x | Apache Spark 3.0.x | Apache Spark 2.4.x | Apache Spark 2.3.x |
 |-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
@@ -159,7 +159,7 @@ Find out more about 4.x `SparkNLP` versions in our official [documentation](http
 ### Databricks Support
-Spark NLP 6.2.2 has been tested and is compatible with the following runtimes:
+Spark NLP 6.2.0 has been tested and is compatible with the following runtimes:
 | **CPU**            | **GPU**            |
 |--------------------|--------------------|
@@ -177,7 +177,7 @@ We are compatible with older runtimes. For a full list check databricks support
 ### EMR Support
-Spark NLP 6.2.2 has been tested and is compatible with the following EMR releases:
+Spark NLP 6.2.0 has been tested and is compatible with the following EMR releases:
 | **EMR Release**    |
 |--------------------|

{spark_nlp-6.2.2 → spark_nlp-6.2.2.dev2}/setup.py RENAMED Viewed

@@ -41,7 +41,7 @@ setup(
     # project code, see
     # https://packaging.python.org/en/latest/single_source_version.html
-    version='6.2.2',  # Required
+    version='6.2.2dev2',  # Required
     # This is a one-line description or tagline of what your project does. This
     # corresponds to the 'Summary' metadata field:

{spark_nlp-6.2.2 → spark_nlp-6.2.2.dev2}/spark_nlp.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: spark-nlp
-Version: 6.2.2
+Version: 6.2.2.dev2
 Summary: John Snow Labs Spark NLP is a natural language processing library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines, that scale easily in a distributed environment.
 Home-page: https://github.com/JohnSnowLabs/spark-nlp
 Author: John Snow Labs
@@ -102,7 +102,7 @@ $ java -version
 $ conda create -n sparknlp python=3.7 -y
 $ conda activate sparknlp
 # spark-nlp by default is based on pyspark 3.x
-$ pip install spark-nlp==6.2.2 pyspark==3.3.1
+$ pip install spark-nlp==6.2.0 pyspark==3.3.1
 ```
 In Python console or Jupyter `Python3` kernel:
@@ -168,7 +168,7 @@ For a quick example of using pipelines and models take a look at our official [d
 ### Apache Spark Support
-Spark NLP *6.2.2* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x
+Spark NLP *6.2.0* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x
 | Spark NLP | Apache Spark 3.5.x | Apache Spark 3.4.x | Apache Spark 3.3.x | Apache Spark 3.2.x | Apache Spark 3.1.x | Apache Spark 3.0.x | Apache Spark 2.4.x | Apache Spark 2.3.x |
 |-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
@@ -198,7 +198,7 @@ Find out more about 4.x `SparkNLP` versions in our official [documentation](http
 ### Databricks Support
-Spark NLP 6.2.2 has been tested and is compatible with the following runtimes:
+Spark NLP 6.2.0 has been tested and is compatible with the following runtimes:
 | **CPU**            | **GPU**            |
 |--------------------|--------------------|
@@ -216,7 +216,7 @@ We are compatible with older runtimes. For a full list check databricks support
 ### EMR Support
-Spark NLP 6.2.2 has been tested and is compatible with the following EMR releases:
+Spark NLP 6.2.0 has been tested and is compatible with the following EMR releases:
 | **EMR Release**    |
 |--------------------|

{spark_nlp-6.2.2 → spark_nlp-6.2.2.dev2}/sparknlp/__init__.py RENAMED Viewed

@@ -66,7 +66,7 @@ sys.modules['com.johnsnowlabs.ml.ai'] = annotator
 annotators = annotator
 embeddings = annotator
-__version__ = "6.2.2"
+__version__ = "6.2.2-dev2"
 def start(gpu=False,
@@ -78,7 +78,8 @@ def start(gpu=False,
           cluster_tmp_dir="",
           params=None,
           real_time_output=False,
-          output_level=1):
+          output_level=1,
+          scala213=False):
     """Starts a PySpark instance with default parameters for Spark NLP.
     The default parameters would result in the equivalent of:
@@ -122,6 +123,8 @@ def start(gpu=False,
         Whether to read and print JVM output in real time, by default False
     output_level : int, optional
         Output level for logs, by default 1
+    scala213 : bool, optional
+        Whether to use Scala 2.13 build of Spark NLP, by default False (Scala 2.12)
     Notes
     -----
@@ -159,12 +162,13 @@ def start(gpu=False,
             self.serializer, self.serializer_max_buffer = "org.apache.spark.serializer.KryoSerializer", "2000M"
             self.driver_max_result_size = "0"
             # Spark NLP on CPU or GPU
-            self.maven_spark3 = "com.johnsnowlabs.nlp:spark-nlp_2.12:{}".format(current_version)
-            self.maven_gpu_spark3 = "com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:{}".format(current_version)
+            scala_version = "2.13" if scala213 else "2.12"
+            self.maven_spark3 = f"com.johnsnowlabs.nlp:spark-nlp_{scala_version}:{current_version}"
+            self.maven_gpu_spark3 = f"com.johnsnowlabs.nlp:spark-nlp-gpu_{scala_version}:{current_version}"
             # Spark NLP on Apple Silicon
-            self.maven_silicon = "com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:{}".format(current_version)
+            self.maven_silicon = f"com.johnsnowlabs.nlp:spark-nlp-silicon_{scala_version}:{current_version}"
             # Spark NLP on Linux Aarch64
-            self.maven_aarch64 = "com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:{}".format(current_version)
+            self.maven_aarch64 = f"com.johnsnowlabs.nlp:spark-nlp-aarch64_{scala_version}:{current_version}"
     def start_without_realtime_output():
         builder = SparkSession.builder \
@@ -318,4 +322,5 @@ def version():
     str
         The current Spark NLP version.
     """
     return __version__

{spark_nlp-6.2.2 → spark_nlp-6.2.2.dev2}/sparknlp/annotator/ner/ner_dl.py RENAMED Viewed

@@ -41,11 +41,6 @@ class NerDLApproach(AnnotatorApproach, NerApproach, EvaluationDLParams):
     - a WordEmbeddingsModel (any embeddings can be chosen, e.g. BertEmbeddings
       for BERT based embeddings).
-    By default, collects all data points into memory for training. For larger datasets, use
-    ``setEnableMemoryOptimizer(true)``. This will optimize memory usage during training at the cost
-    of speed. Note that this annotator will use as much memory as the largest partition of the
-    input dataset, so we recommend repartitioning to batch sizes.
     Setting a test dataset to monitor model metrics can be done with
     ``.setTestDataset``. The method expects a path to a parquet file containing a
     dataframe that has the same required columns as the training dataframe. The

{spark_nlp-6.2.2 → spark_nlp-6.2.2.dev2}/sparknlp/annotator/ner/ner_dl_graph_checker.py RENAMED Viewed

@@ -13,10 +13,10 @@
 #  limitations under the License.
 """Contains classes for NerDL."""
-from pyspark.ml.util import JavaMLReadable
-import sparknlp.internal as _internal
 from sparknlp.common import *
+import sparknlp.internal as _internal
+from pyspark.ml.util import JavaMLWritable
+from pyspark.ml.wrapper import JavaEstimator
 class NerDLGraphChecker(
@@ -28,9 +28,6 @@ class NerDLGraphChecker(
     computations/training is done. This annotator is useful for custom training cases, where
     specialized graphs are needed.
-    This annotator will fill graph hyperparameters as metadata in the label column, which will be
-    available for NerDLApproach, saving computations.
     Important: This annotator should be used or positioned before any embedding or NerDLApproach
     annotators in the pipeline and will process the whole dataset to extract the required graph parameters.
@@ -205,18 +202,17 @@ class NerDLGraphChecker(
         # self._setDefault()
     def _create_model(self, java_model):
-        return NerDLGraphCheckerModel(java_model=java_model)
+        return NerDLGraphCheckerModel()
 class NerDLGraphCheckerModel(
     JavaModel,
     JavaMLWritable,
-    JavaMLReadable,
     _internal.ParamsGettersSetters,
 ):
-    """Resulting model from `NerDLGraphChecker`, that updates dataframe metadata (label column)
-    with NerDLGraph parameters. It does not perform any actual data transformations, as the
-    checks/computations are done during the `fit` phase.
+    """
+    Resulting model from NerDLGraphChecker, that does not perform any transformations, as the
+    checks are done during the ``fit`` phase. It acts as the identity.
     This annotator should never be used directly.
     """
@@ -228,66 +224,14 @@ class NerDLGraphCheckerModel(
     @keyword_only
     def __init__(
-            self,
-            classname="com.johnsnowlabs.nlp.annotators.ner.dl.NerDLGraphCheckerModel",
-            java_model=None,
+        self,
+        classname="com.johnsnowlabs.nlp.annotators.ner.dl.NerDLGraphCheckerModel",
+        java_model=None,
     ):
-        # Custom init, different from AnnotatorModel
-        # We don't have a output annotation column, so we inherit directly from JavaModel
+        super(NerDLGraphCheckerModel, self).__init__(java_model=java_model)
+        if classname and not java_model:
+            self.__class__._java_class_name = classname
+            self._java_obj = self._new_java_obj(classname, self.uid)
         if java_model is not None:
-            super(NerDLGraphCheckerModel, self).__init__(java_model=java_model)
-            self._java_obj = java_model
             self._transfer_params_from_java()
-        elif classname:
-            super(NerDLGraphCheckerModel, self).__init__()
-            self.__class__._java_class_name = classname
-            self._java_obj = self._new_java_obj(classname)
-    # Metadata keys for graph parameters
-    graphParamsMetadataKey = "NerDLGraphCheckerParams"
-    embeddingsDimKey = "embeddingsDim"
-    labelsKey = "labels"
-    charsKey = "chars"
-    dsLenKey = "dsLen"
-    labelColumn = Param(
-        Params._dummy(),
-        "labelColumn",
-        "Column with label per each token",
-        typeConverter=TypeConverters.toString,
-    )
-    embeddingsDim = Param(
-        Params._dummy(),
-        "embeddingsDim",
-        "Dimensionality of embeddings",
-        typeConverter=TypeConverters.toInt,
-    )
-    labels = Param(
-        Params._dummy(),
-        "labels",
-        "Labels in the dataset",
-        typeConverter=TypeConverters.toListString,
-    )
-    chars = Param(
-        Params._dummy(),
-        "chars",
-        "Set of characters in the dataset",
-        typeConverter=TypeConverters.toListString,
-    )
-    graphFolder = Param(
-        Params._dummy(),
-        "graphFolder",
-        "Folder path that contain external graph files",
-        typeConverter=TypeConverters.toString,
-    )
-    dsLen = Param(
-        Params._dummy(),
-        "dsLen",
-        "Length of the training dataset.",
-        typeConverter=TypeConverters.toInt,
-    )
+        # self._setDefault(lazyAnnotator=False)

{spark_nlp-6.2.2 → spark_nlp-6.2.2.dev2}/sparknlp/partition/partition_properties.py RENAMED Viewed

@@ -17,6 +17,7 @@ from pyspark.ml.param import Param, Params, TypeConverters
 class HasReaderProperties(Params):
     inputCol = Param(
         Params._dummy(),
         "inputCol",
@@ -244,8 +245,8 @@ class HasReaderProperties(Params):
         """
         return self._set(outputAsDocument=value)
 class HasEmailReaderProperties(Params):
     addAttachmentContent = Param(
         Params._dummy(),
         "addAttachmentContent",
@@ -277,6 +278,7 @@ class HasEmailReaderProperties(Params):
 class HasExcelReaderProperties(Params):
     cellSeparator = Param(
         Params._dummy(),
         "cellSeparator",
@@ -335,8 +337,8 @@ class HasExcelReaderProperties(Params):
         """
         return self.getOrDefault(self.appendCells)
 class HasHTMLReaderProperties(Params):
     timeout = Param(
         Params._dummy(),
         "timeout",
@@ -393,8 +395,8 @@ class HasHTMLReaderProperties(Params):
         """
         return self._set(outputFormat=value)
 class HasPowerPointProperties(Params):
     includeSlideNotes = Param(
         Params._dummy(),
         "includeSlideNotes",
@@ -424,8 +426,8 @@ class HasPowerPointProperties(Params):
         """
         return self.getOrDefault(self.includeSlideNotes)
 class HasTextReaderProperties(Params):
     titleLengthSize = Param(
         Params._dummy(),
         "titleLengthSize",
@@ -434,28 +436,9 @@ class HasTextReaderProperties(Params):
     )
     def setTitleLengthSize(self, value):
-        """Set the maximum character length used to identify title blocks.
-        Parameters
-        ----------
-        value : int
-            Maximum number of characters a text block can have to be considered a title.
-        Returns
-        -------
-        self
-            The instance with updated `titleLengthSize` parameter.
-        """
         return self._set(titleLengthSize=value)
     def getTitleLengthSize(self):
-        """Get the configured maximum title length.
-        Returns
-        -------
-        int
-            The maximum character length used to detect title blocks.
-        """
         return self.getOrDefault(self.titleLengthSize)
     groupBrokenParagraphs = Param(
@@ -466,28 +449,9 @@ class HasTextReaderProperties(Params):
     )
     def setGroupBrokenParagraphs(self, value):
-        """Enable or disable grouping of broken paragraphs.
-        Parameters
-        ----------
-        value : bool
-            True to merge fragmented lines into paragraphs, False to leave lines as-is.
-        Returns
-        -------
-        self
-            The instance with updated `groupBrokenParagraphs` parameter.
-        """
         return self._set(groupBrokenParagraphs=value)
     def getGroupBrokenParagraphs(self):
-        """Get whether broken paragraph grouping is enabled.
-        Returns
-        -------
-        bool
-            True if grouping of broken paragraphs is enabled, False otherwise.
-        """
         return self.getOrDefault(self.groupBrokenParagraphs)
     paragraphSplit = Param(
@@ -498,28 +462,9 @@ class HasTextReaderProperties(Params):
     )
     def setParagraphSplit(self, value):
-        """Set the regex pattern used to split paragraphs when grouping broken paragraphs.
-        Parameters
-        ----------
-        value : str
-            Regular expression string used to detect paragraph boundaries.
-        Returns
-        -------
-        self
-            The instance with updated `paragraphSplit` parameter.
-        """
         return self._set(paragraphSplit=value)
     def getParagraphSplit(self):
-        """Get the paragraph-splitting regex pattern.
-        Returns
-        -------
-        str
-            The regex pattern used to detect paragraph boundaries.
-        """
         return self.getOrDefault(self.paragraphSplit)
     shortLineWordThreshold = Param(
@@ -530,28 +475,9 @@ class HasTextReaderProperties(Params):
     )
     def setShortLineWordThreshold(self, value):
-        """Set the maximum word count for a line to be considered short.
-        Parameters
-        ----------
-        value : int
-            Number of words under which a line is considered 'short'.
-        Returns
-        -------
-        self
-            The instance with updated `shortLineWordThreshold` parameter.
-        """
         return self._set(shortLineWordThreshold=value)
     def getShortLineWordThreshold(self):
-        """Get the short line word threshold.
-        Returns
-        -------
-        int
-            Word count threshold for short lines used in paragraph grouping.
-        """
         return self.getOrDefault(self.shortLineWordThreshold)
     maxLineCount = Param(
@@ -562,28 +488,9 @@ class HasTextReaderProperties(Params):
     )
     def setMaxLineCount(self, value):
-        """Set the maximum number of lines to inspect when estimating paragraph layout.
-        Parameters
-        ----------
-        value : int
-            Maximum number of lines to evaluate for layout heuristics.
-        Returns
-        -------
-        self
-            The instance with updated `maxLineCount` parameter.
-        """
         return self._set(maxLineCount=value)
     def getMaxLineCount(self):
-        """Get the maximum number of lines used for layout heuristics.
-        Returns
-        -------
-        int
-            The configured maximum number of lines to consider.
-        """
         return self.getOrDefault(self.maxLineCount)
     threshold = Param(
@@ -594,58 +501,11 @@ class HasTextReaderProperties(Params):
     )
     def setThreshold(self, value):
-        """Set the empty-line ratio threshold for paragraph grouping decision.
-        Parameters
-        ----------
-        value : float
-            Ratio (0.0-1.0) of empty lines used to switch grouping strategies.
-        Returns
-        -------
-        self
-            The instance with updated `threshold` parameter.
-        """
         return self._set(threshold=value)
     def getThreshold(self):
-        """Get the configured empty-line threshold ratio.
-        Returns
-        -------
-        float
-            The ratio used to decide paragraph grouping strategy.
-        """
         return self.getOrDefault(self.threshold)
-    extractTagAttributes = Param(
-        Params._dummy(),
-        "extractTagAttributes",
-        "Extract attribute values into separate lines when parsing tag-based formats (e.g., HTML or XML).",
-        typeConverter=TypeConverters.toListString
-    )
-    def setExtractTagAttributes(self, attributes: list[str]):
-        """
-        Specify which tag attributes should have their values extracted as text when parsing
-        tag-based formats (e.g., HTML or XML).
-        :param attributes: list of attribute names to extract
-        :return: this instance with the updated `extractTagAttributes` parameter
-        """
-        return self._set(extractTagAttributes=attributes)
-    def getExtractTagAttributes(self):
-        """Get the list of tag attribute names configured to be extracted.
-        Returns
-        -------
-        list[str]
-            The attribute names whose values will be extracted as text.
-        """
-        return self.getOrDefault(self.extractTagAttributes)
 class HasChunkerProperties(Params):
     chunkingStrategy = Param(

{spark_nlp-6.2.2 → spark_nlp-6.2.2.dev2}/sparknlp/reader/reader2doc.py RENAMED Viewed

@@ -91,19 +91,6 @@ class Reader2Doc(
         """
         return self._set(excludeNonText=value)
-    joinString = Param(
-        Params._dummy(),
-        "joinString",
-        "If outputAsDocument is true, specifies the string used to join elements into a single document.",
-        typeConverter=TypeConverters.toString
-    )
-    def setJoinString(self, value):
-        """
-        If outputAsDocument is true, specifies the string used to join elements into a single
-        """
-        return self._set(joinString=value)
     @keyword_only
     def __init__(self):
         super(Reader2Doc, self).__init__(classname="com.johnsnowlabs.reader.Reader2Doc")
@@ -112,12 +99,8 @@ class Reader2Doc(
             explodeDocs=False,
             contentType="",
             flattenOutput=False,
-            outputAsDocument=True,
-            outputFormat="plain-text",
-            excludeNonText=False,
-            joinString="\n"
+            titleThreshold=18
         )
     @keyword_only
     def setParams(self):
         kwargs = self._input_kwargs

{spark_nlp-6.2.2 → spark_nlp-6.2.2.dev2}/sparknlp/reader/reader2table.py RENAMED Viewed

@@ -35,8 +35,7 @@ class Reader2Table(
     @keyword_only
     def __init__(self):
         super(Reader2Table, self).__init__(classname="com.johnsnowlabs.reader.Reader2Table")
-        self._setDefault(outputCol="document", outputFormat="json-table", inferTableStructure=True,
-                         outputAsDocument=False)
+        self._setDefault(outputCol="document")
     @keyword_only
     def setParams(self):