PyPI - rapid-textrank - Versions diffs - 0.0.1__tar.gz → 0.1.1__tar.gz - Mend

rapid-textrank 0.0.1tar.gz → 0.1.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (56) hide show

rapid_textrank-0.1.1/.beads/issues.jsonl ADDED Viewed

File without changes

{rapid_textrank-0.0.1 → rapid_textrank-0.1.1}/.gitignore RENAMED Viewed

@@ -15,7 +15,10 @@ dist/
 build/
 .eggs/
 *.egg
+# Jupyter notebooks (but keep notebooks/ directory for examples)
 *.ipynb
+!notebooks/*.ipynb
 # Virtual environments
 .venv/

{rapid_textrank-0.0.1 → rapid_textrank-0.1.1}/Cargo.lock RENAMED Viewed

@@ -579,7 +579,7 @@ dependencies = [
 [[package]]
 name = "rapid_textrank"
-version = "0.0.1"
+version = "0.1.1"
 dependencies = [
  "approx",
  "criterion",

{rapid_textrank-0.0.1 → rapid_textrank-0.1.1}/Cargo.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [package]
 name = "rapid_textrank"
-version = "0.0.1"
+version = "0.1.1"
 edition = "2021"
 authors = ["TextRanker Contributors"]
 description = "High-performance TextRank implementation with Python bindings"

{rapid_textrank-0.0.1 → rapid_textrank-0.1.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: rapid_textrank
-Version: 0.0.1
+Version: 0.1.1
 Classifier: Development Status :: 4 - Beta
 Classifier: Intended Audience :: Developers
 Classifier: Intended Audience :: Science/Research
@@ -41,16 +41,16 @@ Project-URL: Repository, https://github.com/xang1234/rapid-textrank
 **High-performance TextRank implementation in Rust with Python bindings.**
-Extract keywords and key phrases from text 10-100x faster than pure Python implementations, with support for multiple algorithm variants and 18 languages.
+Extract keywords and key phrases from text up to 10-100x faster than pure Python implementations (depending on document size and tokenization), with support for multiple algorithm variants and 18 languages.
 ## Features
-- **Fast**: 10-100x faster than pure Python implementations
+- **Fast**: Up to 10-100x faster than pure Python implementations (see benchmarks)
 - **Multiple algorithms**: TextRank, PositionRank, and BiasedTextRank variants
-- **Unicode-aware**: Proper handling of CJK, emoji, and other scripts
+- **Unicode-aware**: Proper handling of CJK and other scripts (emoji are ignored by the built-in tokenizer)
 - **Multi-language**: Stopword support for 18 languages
 - **Dual API**: Native Python classes + JSON interface for batch processing
-- **Zero Python overhead**: Computation happens entirely in Rust (no GIL)
+- **Rust core**: Computation happens in Rust (the Python GIL is currently held during extraction)
 ## Quick Start
@@ -91,7 +91,7 @@ TextRank is a graph-based ranking algorithm for keyword extraction, inspired by
 2. **Run PageRank**: The algorithm iteratively distributes "importance" through the graph. Words connected to many important words become important themselves.
-3. **Extract phrases**: Adjacent high-scoring words are combined into key phrases. Scores are aggregated (sum, mean, or max).
+3. **Extract phrases**: High-scoring words are grouped into noun chunks (POS-filtered) to form key phrases. Scores are aggregated (sum, mean, or max).
 ```
 Text: "Machine learning enables systems to learn from data"
@@ -217,12 +217,16 @@ config = TextRankConfig(
     damping=0.85,              # PageRank damping factor (0-1)
     max_iterations=100,        # Maximum PageRank iterations
     convergence_threshold=1e-6,# Convergence threshold
-    window_size=4,             # Co-occurrence window size
+    window_size=3,             # Co-occurrence window size
     top_n=10,                  # Number of results
     min_phrase_length=1,       # Minimum words in a phrase
     max_phrase_length=4,       # Maximum words in a phrase
     score_aggregation="sum",   # How to combine word scores: "sum", "mean", "max", "rms"
-    language="en"              # Language for stopwords
+    language="en",             # Language for stopwords
+    include_pos=["NOUN","ADJ","PROPN","VERB"],  # POS tags to include in the graph
+    use_pos_in_nodes=True,     # If True, graph nodes are lemma+POS
+    phrase_grouping="scrubbed_text",   # "lemma" or "scrubbed_text"
+    stopwords=["custom", "terms"]  # Additional stopwords (extends built-in list)
 )
 extractor = BaseTextRank(config=config)
@@ -252,7 +256,7 @@ tuples = result.as_tuples()  # [(text, score), ...]
 ### JSON Interface
-For processing large documents or integrating with spaCy, use the JSON interface. This accepts pre-tokenized data to avoid re-tokenizing in Rust.
+For processing large documents or integrating with spaCy, use the JSON interface. This accepts pre-tokenized data to avoid re-tokenizing in Rust. Stopword handling can use each token's `is_stopword` field and/or a `config.language` plus `config.stopwords` (additional words that extend the built-in list). Language codes follow the Supported Languages table below.
 ```python
 from rapid_textrank import extract_from_json, extract_batch_from_json
@@ -273,13 +277,13 @@ doc = {
         },
         # ... more tokens
     ],
-    "config": {"top_n": 10}
+    "config": {"top_n": 10, "language": "en", "stopwords": ["nlp", "transformers"]}
 }
 result_json = extract_from_json(json.dumps(doc))
 result = json.loads(result_json)
-# Batch processing (parallel in Rust)
+# Batch processing (Rust core; per-document processing is sequential)
 docs = [doc1, doc2, doc3]
 results_json = extract_batch_from_json(json.dumps(docs))
 results = json.loads(results_json)
@@ -287,7 +291,7 @@ results = json.loads(results_json)
 ## Supported Languages
-Stopword filtering is available for 18 languages:
+Stopword filtering is available for 18 languages. Use these codes for the `language` parameter in all APIs (including JSON config):
 | Code | Language | Code | Language | Code | Language |
 |------|----------|------|----------|------|----------|
@@ -298,6 +302,13 @@ Stopword filtering is available for 18 languages:
 | `hu` | Hungarian | `tr` | Turkish | `pl` | Polish |
 | `ar` | Arabic | `zh` | Chinese | `ja` | Japanese |
+You can inspect the built-in stopword list with:
+```python
+import rapid_textrank as rt
+rt.get_stopwords("en")
+```
 ## Performance
 rapid_textrank achieves significant speedups through Rust's performance characteristics and careful algorithm implementation.
@@ -474,11 +485,11 @@ The performance advantage comes from several factors:
 2. **String Interning**: Repeated words share a single allocation via `StringPool`, reducing memory usage 10-100x for typical documents.
-3. **Parallel Processing**: Rayon provides data parallelism for batch processing without explicit thread management.
+3. **Parallel Processing**: Rayon provides data parallelism in internal graph construction without explicit thread management.
 4. **Link-Time Optimization (LTO)**: Release builds use full LTO with single codegen unit for maximum inlining.
-5. **No GIL**: All computation happens in Rust. Python's Global Interpreter Lock is released during extraction.
+5. **Rust core**: Most computation happens in Rust, minimizing Python-level overhead.
 6. **FxHash**: Fast non-cryptographic hashing for internal hash maps.
@@ -498,12 +509,24 @@ Import name is `rapid_textrank`.
 pip install rapid_textrank[spacy]
 ```
+```python
+import spacy
+import rapid_textrank.spacy_component  # registers the pipeline factory
+nlp = spacy.load("en_core_web_sm")
+nlp.add_pipe("rapid_textrank")
+doc = nlp("Machine learning is a subset of artificial intelligence.")
+for phrase in doc._.phrases[:5]:
+    print(f"{phrase.text}: {phrase.score:.4f}")
+```
 ### From Source
 Requirements: Rust 1.70+, Python 3.9+
 ```bash
-git clone https://github.com/textranker/rapid_textrank
+git clone https://github.com/xang1234/rapid-textrank
 cd rapid_textrank
 pip install maturin
 maturin develop --release

{rapid_textrank-0.0.1 → rapid_textrank-0.1.1}/README.md RENAMED Viewed

@@ -6,16 +6,16 @@
 **High-performance TextRank implementation in Rust with Python bindings.**
-Extract keywords and key phrases from text 10-100x faster than pure Python implementations, with support for multiple algorithm variants and 18 languages.
+Extract keywords and key phrases from text up to 10-100x faster than pure Python implementations (depending on document size and tokenization), with support for multiple algorithm variants and 18 languages.
 ## Features
-- **Fast**: 10-100x faster than pure Python implementations
+- **Fast**: Up to 10-100x faster than pure Python implementations (see benchmarks)
 - **Multiple algorithms**: TextRank, PositionRank, and BiasedTextRank variants
-- **Unicode-aware**: Proper handling of CJK, emoji, and other scripts
+- **Unicode-aware**: Proper handling of CJK and other scripts (emoji are ignored by the built-in tokenizer)
 - **Multi-language**: Stopword support for 18 languages
 - **Dual API**: Native Python classes + JSON interface for batch processing
-- **Zero Python overhead**: Computation happens entirely in Rust (no GIL)
+- **Rust core**: Computation happens in Rust (the Python GIL is currently held during extraction)
 ## Quick Start
@@ -56,7 +56,7 @@ TextRank is a graph-based ranking algorithm for keyword extraction, inspired by
 2. **Run PageRank**: The algorithm iteratively distributes "importance" through the graph. Words connected to many important words become important themselves.
-3. **Extract phrases**: Adjacent high-scoring words are combined into key phrases. Scores are aggregated (sum, mean, or max).
+3. **Extract phrases**: High-scoring words are grouped into noun chunks (POS-filtered) to form key phrases. Scores are aggregated (sum, mean, or max).
 ```
 Text: "Machine learning enables systems to learn from data"
@@ -182,12 +182,16 @@ config = TextRankConfig(
     damping=0.85,              # PageRank damping factor (0-1)
     max_iterations=100,        # Maximum PageRank iterations
     convergence_threshold=1e-6,# Convergence threshold
-    window_size=4,             # Co-occurrence window size
+    window_size=3,             # Co-occurrence window size
     top_n=10,                  # Number of results
     min_phrase_length=1,       # Minimum words in a phrase
     max_phrase_length=4,       # Maximum words in a phrase
     score_aggregation="sum",   # How to combine word scores: "sum", "mean", "max", "rms"
-    language="en"              # Language for stopwords
+    language="en",             # Language for stopwords
+    include_pos=["NOUN","ADJ","PROPN","VERB"],  # POS tags to include in the graph
+    use_pos_in_nodes=True,     # If True, graph nodes are lemma+POS
+    phrase_grouping="scrubbed_text",   # "lemma" or "scrubbed_text"
+    stopwords=["custom", "terms"]  # Additional stopwords (extends built-in list)
 )
 extractor = BaseTextRank(config=config)
@@ -217,7 +221,7 @@ tuples = result.as_tuples()  # [(text, score), ...]
 ### JSON Interface
-For processing large documents or integrating with spaCy, use the JSON interface. This accepts pre-tokenized data to avoid re-tokenizing in Rust.
+For processing large documents or integrating with spaCy, use the JSON interface. This accepts pre-tokenized data to avoid re-tokenizing in Rust. Stopword handling can use each token's `is_stopword` field and/or a `config.language` plus `config.stopwords` (additional words that extend the built-in list). Language codes follow the Supported Languages table below.
 ```python
 from rapid_textrank import extract_from_json, extract_batch_from_json
@@ -238,13 +242,13 @@ doc = {
         },
         # ... more tokens
     ],
-    "config": {"top_n": 10}
+    "config": {"top_n": 10, "language": "en", "stopwords": ["nlp", "transformers"]}
 }
 result_json = extract_from_json(json.dumps(doc))
 result = json.loads(result_json)
-# Batch processing (parallel in Rust)
+# Batch processing (Rust core; per-document processing is sequential)
 docs = [doc1, doc2, doc3]
 results_json = extract_batch_from_json(json.dumps(docs))
 results = json.loads(results_json)
@@ -252,7 +256,7 @@ results = json.loads(results_json)
 ## Supported Languages
-Stopword filtering is available for 18 languages:
+Stopword filtering is available for 18 languages. Use these codes for the `language` parameter in all APIs (including JSON config):
 | Code | Language | Code | Language | Code | Language |
 |------|----------|------|----------|------|----------|
@@ -263,6 +267,13 @@ Stopword filtering is available for 18 languages:
 | `hu` | Hungarian | `tr` | Turkish | `pl` | Polish |
 | `ar` | Arabic | `zh` | Chinese | `ja` | Japanese |
+You can inspect the built-in stopword list with:
+```python
+import rapid_textrank as rt
+rt.get_stopwords("en")
+```
 ## Performance
 rapid_textrank achieves significant speedups through Rust's performance characteristics and careful algorithm implementation.
@@ -439,11 +450,11 @@ The performance advantage comes from several factors:
 2. **String Interning**: Repeated words share a single allocation via `StringPool`, reducing memory usage 10-100x for typical documents.
-3. **Parallel Processing**: Rayon provides data parallelism for batch processing without explicit thread management.
+3. **Parallel Processing**: Rayon provides data parallelism in internal graph construction without explicit thread management.
 4. **Link-Time Optimization (LTO)**: Release builds use full LTO with single codegen unit for maximum inlining.
-5. **No GIL**: All computation happens in Rust. Python's Global Interpreter Lock is released during extraction.
+5. **Rust core**: Most computation happens in Rust, minimizing Python-level overhead.
 6. **FxHash**: Fast non-cryptographic hashing for internal hash maps.
@@ -463,12 +474,24 @@ Import name is `rapid_textrank`.
 pip install rapid_textrank[spacy]
 ```
+```python
+import spacy
+import rapid_textrank.spacy_component  # registers the pipeline factory
+nlp = spacy.load("en_core_web_sm")
+nlp.add_pipe("rapid_textrank")
+doc = nlp("Machine learning is a subset of artificial intelligence.")
+for phrase in doc._.phrases[:5]:
+    print(f"{phrase.text}: {phrase.score:.4f}")
+```
 ### From Source
 Requirements: Rust 1.70+, Python 3.9+
 ```bash
-git clone https://github.com/textranker/rapid_textrank
+git clone https://github.com/xang1234/rapid-textrank
 cd rapid_textrank
 pip install maturin
 maturin develop --release

rapid-textrank 0.0.1__tar.gz → 0.1.1__tar.gz

rapid-textrank 0.0.1tar.gz → 0.1.1tar.gz