PyPI - langextract - Versions diffs - 0.1.0__tar.gz - Mend

langextract 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (23) hide show

langextract-0.1.0/LICENSE +202 -0
langextract-0.1.0/PKG-INFO +347 -0
langextract-0.1.0/README.md +310 -0
langextract-0.1.0/langextract/__init__.py +244 -0
langextract-0.1.0/langextract/annotation.py +542 -0
langextract-0.1.0/langextract/chunking.py +490 -0
langextract-0.1.0/langextract/data.py +236 -0
langextract-0.1.0/langextract/data_lib.py +123 -0
langextract-0.1.0/langextract/inference.py +441 -0
langextract-0.1.0/langextract/io.py +318 -0
langextract-0.1.0/langextract/progress.py +351 -0
langextract-0.1.0/langextract/prompting.py +165 -0
langextract-0.1.0/langextract/resolver.py +912 -0
langextract-0.1.0/langextract/schema.py +159 -0
langextract-0.1.0/langextract/tokenizer.py +357 -0
langextract-0.1.0/langextract/visualization.py +608 -0
langextract-0.1.0/langextract.egg-info/PKG-INFO +347 -0
langextract-0.1.0/langextract.egg-info/SOURCES.txt +21 -0
langextract-0.1.0/langextract.egg-info/dependency_links.txt +1 -0
langextract-0.1.0/langextract.egg-info/requires.txt +28 -0
langextract-0.1.0/langextract.egg-info/top_level.txt +1 -0
langextract-0.1.0/pyproject.toml +77 -0
langextract-0.1.0/setup.cfg +4 -0

langextract-0.1.0/LICENSE ADDED Viewed

@@ -0,0 +1,202 @@
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright [yyyy] [name of copyright owner]
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

langextract-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,347 @@
+Metadata-Version: 2.4
+Name: langextract
+Version: 0.1.0
+Summary: LangExtract: A library for extracting structured data from language models
+Author-email: Akshay Goel <goelak@google.com>
+License-Expression: Apache-2.0
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: absl-py>=1.0.0
+Requires-Dist: aiohttp>=3.8.0
+Requires-Dist: async_timeout>=4.0.0
+Requires-Dist: exceptiongroup>=1.1.0
+Requires-Dist: google-genai>=0.1.0
+Requires-Dist: langfun>=0.1.0
+Requires-Dist: ml-collections>=0.1.0
+Requires-Dist: more-itertools>=8.0.0
+Requires-Dist: numpy>=1.20.0
+Requires-Dist: openai>=0.27.0
+Requires-Dist: pandas>=1.3.0
+Requires-Dist: pydantic>=1.8.0
+Requires-Dist: python-dotenv>=0.19.0
+Requires-Dist: python-magic>=0.4.27
+Requires-Dist: requests>=2.25.0
+Requires-Dist: tqdm>=4.64.0
+Requires-Dist: typing-extensions>=4.0.0
+Provides-Extra: dev
+Requires-Dist: black>=23.7.0; extra == "dev"
+Requires-Dist: pylint>=2.17.5; extra == "dev"
+Requires-Dist: pytest>=7.4.0; extra == "dev"
+Requires-Dist: pytype>=2024.10.11; extra == "dev"
+Requires-Dist: tox>=4.0.0; extra == "dev"
+Provides-Extra: test
+Requires-Dist: pytest>=7.4.0; extra == "test"
+Requires-Dist: tomli>=2.0.0; extra == "test"
+Dynamic: license-file
+<p align="center">
+  <a href="https://github.com/google/langextract">
+    <img src="docs/_static/logo.svg" alt="LangExtract Logo" width="128" />
+  </a>
+</p>
+# LangExtract
+[![PyPI version](https://badge.fury.io/py/langextract.svg)](https://badge.fury.io/py/langextract)
+[![GitHub stars](https://img.shields.io/github/stars/google/langextract.svg?style=social&label=Star)](https://github.com/google/langextract)
+![Tests](https://github.com/google/langextract/actions/workflows/ci.yaml/badge.svg)
+## Table of Contents
+- [Introduction](#introduction)
+- [Why LangExtract?](#why-langextract)
+- [Quick Start](#quick-start)
+- [Installation](#installation)
+- [API Key Setup for Cloud Models](#api-key-setup-for-cloud-models)
+- [More Examples](#more-examples)
+  - [*Romeo and Juliet* Full Text Extraction](#romeo-and-juliet-full-text-extraction)
+  - [Medication Extraction](#medication-extraction)
+  - [Radiology Report Structuring: RadExtract](#radiology-report-structuring-radextract)
+- [Contributing](#contributing)
+- [Testing](#testing)
+- [Disclaimer](#disclaimer)
+## Introduction
+LangExtract is a Python library that uses LLMs to extract structured information from unstructured text documents based on user-defined instructions. It processes materials such as clinical notes or reports, identifying and organizing key details while ensuring the extracted data corresponds to the source text.
+## Why LangExtract?
+1.  **Precise Source Grounding:** Maps every extraction to its exact location in the source text, enabling visual highlighting for easy traceability and verification.
+2.  **Reliable Structured Outputs:** Enforces a consistent output schema based on your few-shot examples, leveraging controlled generation in supported models like Gemini to guarantee robust, structured results.
+3.  **Optimized for Long Documents:** Overcomes the "needle-in-a-haystack" challenge of large document extraction by using an optimized strategy of text chunking, parallel processing, and multiple passes for higher recall.
+4.  **Interactive Visualization:** Instantly generates a self-contained, interactive HTML file to visualize and review thousands of extracted entities in their original context.
+5.  **Flexible LLM Support:** Supports your preferred models, from cloud-based LLMs like the Google Gemini family to local open-source models via the built-in Ollama interface.
+6.  **Adaptable to Any Domain:** Define extraction tasks for any domain using just a few examples. LangExtract adapts to your needs without requiring any model fine-tuning.
+7.  **Leverages LLM World Knowledge:** Utilize precise prompt wording and few-shot examples to influence how the extraction task may utilize LLM knowledge. The accuracy of any inferred information and its adherence to the task specification are contingent upon the selected LLM, the complexity of the task, the clarity of the prompt instructions, and the nature of the prompt examples.
+## Quick Start
+> **Note:** Using cloud-hosted models like Gemini requires an API key. See the [API Key Setup](#api-key-setup-for-cloud-models) section for instructions on how to get and configure your key.
+Extract structured information with just a few lines of code.
+### 1. Define Your Extraction Task
+First, create a prompt that clearly describes what you want to extract. Then, provide a high-quality example to guide the model.
+```python
+import langextract as lx
+import textwrap
+# 1. Define the prompt and extraction rules
+prompt = textwrap.dedent("""\
+    Extract characters, emotions, and relationships in order of appearance.
+    Use exact text for extractions. Do not paraphrase or overlap entities.
+    Provide meaningful attributes for each entity to add context.""")
+# 2. Provide a high-quality example to guide the model
+examples = [
+    lx.data.ExampleData(
+        text="ROMEO. But soft! What light through yonder window breaks? It is the east, and Juliet is the sun.",
+        extractions=[
+            lx.data.Extraction(
+                extraction_class="character",
+                extraction_text="ROMEO",
+                attributes={"emotional_state": "wonder"}
+            ),
+            lx.data.Extraction(
+                extraction_class="emotion",
+                extraction_text="But soft!",
+                attributes={"feeling": "gentle awe"}
+            ),
+            lx.data.Extraction(
+                extraction_class="relationship",
+                extraction_text="Juliet is the sun",
+                attributes={"type": "metaphor"}
+            ),
+        ]
+    )
+]
+```
+### 2. Run the Extraction
+Provide your input text and the prompt materials to the `lx.extract` function.
+```python
+# The input text to be processed
+input_text = "Lady Juliet gazed longingly at the stars, her heart aching for Romeo"
+# Run the extraction
+result = lx.extract(
+    text_or_documents=input_text,
+    prompt_description=prompt,
+    examples=examples,
+    model_id="gemini-2.5-flash",
+)
+```
+> **Model Selection**: `gemini-2.5-flash` is the recommended default, offering an excellent balance of speed, cost, and quality. For highly complex tasks requiring deeper reasoning, `gemini-2.5-pro` may provide superior results. For large-scale or production use, a Tier 2 Gemini quota is suggested to increase throughput and avoid rate limits. See the [rate-limit documentation](https://ai.google.dev/gemini-api/docs/rate-limits#tier-2) for details.
+>
+> **Model Lifecycle**: Note that Gemini models have a lifecycle with defined retirement dates. Users should consult the [official model version documentation](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/model-versions) to stay informed about the latest stable and legacy versions.
+### 3. Visualize the Results
+The extractions can be saved to a `.jsonl` file, a popular format for working with language model data. LangExtract can then generate an interactive HTML visualization from this file to review the entities in context.
+```python
+# Save the results to a JSONL file
+lx.io.save_annotated_documents([result], output_name="extraction_results.jsonl")
+# Generate the visualization from the file
+html_content = lx.visualize("extraction_results.jsonl")
+with open("visualization.html", "w") as f:
+    f.write(html_content)
+```
+This creates an animated and interactive HTML file:
+![Romeo and Juliet Basic Visualization ](docs/_static/romeo_juliet_basic.gif)
+> **Note on LLM Knowledge Utilization:** This example demonstrates extractions that stay close to the text evidence - extracting "longing" for Lady Juliet's emotional state and identifying "yearning" from "gazed longingly at the stars." The task could be modified to generate attributes that draw more heavily from the LLM's world knowledge (e.g., adding `"identity": "Capulet family daughter"` or `"literary_context": "tragic heroine"`). The balance between text-evidence and knowledge-inference is controlled by your prompt instructions and example attributes.
+### Scaling to Longer Documents
+For larger texts, you can process entire documents directly from URLs with parallel processing and enhanced sensitivity:
+```python
+# Process Romeo & Juliet directly from Project Gutenberg
+result = lx.extract(
+    text_or_documents="https://www.gutenberg.org/files/1513/1513-0.txt",
+    prompt_description=prompt,
+    examples=examples,
+    model_id="gemini-2.5-flash",
+    extraction_passes=3,    # Improves recall through multiple passes
+    max_workers=20,         # Parallel processing for speed
+    max_char_buffer=1000    # Smaller contexts for better accuracy
+)
+```
+This approach can extract hundreds of entities from full novels while maintaining high accuracy. The interactive visualization seamlessly handles large result sets, making it easy to explore hundreds of entities from the output JSONL file. **[See the full *Romeo and Juliet* extraction example →](docs/examples/longer_text_example.md)** for detailed results and performance insights.
+## Installation
+### From PyPI
+```bash
+pip install langextract
+```
+*Recommended for most users. For isolated environments, consider using a virtual environment:*
+```bash
+python -m venv langextract_env
+source langextract_env/bin/activate  # On Windows: langextract_env\Scripts\activate
+pip install langextract
+```
+### From Source
+LangExtract uses modern Python packaging with `pyproject.toml` for dependency management:
+*Installing with `-e` puts the package in development mode, allowing you to modify the code without reinstalling.*
+```bash
+git clone https://github.com/google/langextract.git
+cd langextract
+# For basic installation:
+pip install -e .
+# For development (includes linting tools):
+pip install -e ".[dev]"
+# For testing (includes pytest):
+pip install -e ".[test]"
+```
+## API Key Setup for Cloud Models
+When using LangExtract with cloud-hosted models (like Gemini), you'll need to
+set up an API key. On-device models don't require an API key. For developers
+using local LLMs, LangExtract offers built-in support for Ollama and can be
+extended to other third-party APIs by updating the inference endpoints.
+### API Key Sources
+Get API keys from:
+*   [AI Studio](https://aistudio.google.com/app/apikey) for Gemini models
+*   [Vertex AI](https://cloud.google.com/vertex-ai/generative-ai/docs/sdks/overview) for enterprise use
+### Setting up API key in your environment
+**Option 1: Environment Variable**
+```bash
+export LANGEXTRACT_API_KEY="your-api-key-here"
+```
+**Option 2: .env File (Recommended)**
+Add your API key to a `.env` file:
+```bash
+# Add API key to .env file
+cat >> .env << 'EOF'
+LANGEXTRACT_API_KEY=your-api-key-here
+EOF
+# Keep your API key secure
+echo '.env' >> .gitignore
+```
+In your Python code:
+```python
+import langextract as lx
+result = lx.extract(
+    text_or_documents=input_text,
+    prompt_description="Extract information...",
+    examples=[...],
+    model_id="gemini-2.5-flash"
+)
+```
+**Option 3: Direct API Key (Not Recommended for Production)**
+You can also provide the API key directly in your code, though this is not recommended for production use:
+```python
+result = lx.extract(
+    text_or_documents=input_text,
+    prompt_description="Extract information...",
+    examples=[...],
+    model_id="gemini-2.5-flash",
+    api_key="your-api-key-here"  # Only use this for testing/development
+)
+```
+## More Examples
+Additional examples of LangExtract in action:
+### *Romeo and Juliet* Full Text Extraction
+LangExtract can process complete documents directly from URLs. This example demonstrates extraction from the full text of *Romeo and Juliet* from Project Gutenberg (147,843 characters), showing parallel processing, sequential extraction passes, and performance optimization for long document processing.
+**[View *Romeo and Juliet* Full Text Example →](docs/examples/longer_text_example.md)**
+### Medication Extraction
+> **Disclaimer:** This demonstration is for illustrative purposes of LangExtract's baseline capability only. It does not represent a finished or approved product, is not intended to diagnose or suggest treatment of any disease or condition, and should not be used for medical advice.
+LangExtract excels at extracting structured medical information from clinical text. These examples demonstrate both basic entity recognition (medication names, dosages, routes) and relationship extraction (connecting medications to their attributes), showing LangExtract's effectiveness for healthcare applications.
+**[View Medication Examples →](docs/examples/medication_examples.md)**
+### Radiology Report Structuring: RadExtract
+Explore RadExtract, a live interactive demo on HuggingFace Spaces that shows how LangExtract can automatically structure radiology reports. Try it directly in your browser with no setup required.
+**[View RadExtract Demo →](https://huggingface.co/spaces/google/radextract)**
+## Contributing
+Contributions are welcome! See [CONTRIBUTING.md](CONTRIBUTING.md) to get started
+with development, testing, and pull requests. You must sign a
+[Contributor License Agreement](https://cla.developers.google.com/about)
+before submitting patches.
+## Testing
+To run tests locally from the source:
+```bash
+# Clone the repository
+git clone https://github.com/google/langextract.git
+cd langextract
+# Install with test dependencies
+pip install -e ".[test]"
+# Run all tests
+pytest tests
+```
+Or reproduce the full CI matrix locally with tox:
+```bash
+tox  # runs pylint + pytest on Python 3.10 and 3.11
+```
+## Disclaimer
+This is not an officially supported Google product. If you use
+LangExtract in production or publications, please cite accordingly and
+acknowledge usage. Use is subject to the [Apache 2.0 License](LICENSE).
+For health-related applications, use of LangExtract is also subject to the
+[Health AI Developer Foundations Terms of Use](https://developers.google.com/health-ai-developer-foundations/terms).
+---
+**Happy Extracting!**