PyPI - flowcept - Versions diffs - 0.9.7__tar.gz → 0.9.8__tar.gz - Mend

flowcept 0.9.7tar.gz → 0.9.8tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (208) hide show

{flowcept-0.9.7 → flowcept-0.9.8}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: flowcept
-Version: 0.9.7
+Version: 0.9.8
 Summary: Capture and query workflow provenance data using data observability
 Author: Oak Ridge National Laboratory
 License-Expression: MIT
@@ -27,6 +27,7 @@ Requires-Dist: jupyterlab; extra == 'all'
 Requires-Dist: langchain-community; extra == 'all'
 Requires-Dist: langchain-openai; extra == 'all'
 Requires-Dist: lmdb; extra == 'all'
+Requires-Dist: matplotlib; extra == 'all'
 Requires-Dist: mcp[cli]; extra == 'all'
 Requires-Dist: mlflow-skinny; extra == 'all'
 Requires-Dist: nbmake; extra == 'all'
@@ -88,6 +89,7 @@ Requires-Dist: confluent-kafka<=2.8.0; extra == 'kafka'
 Provides-Extra: llm-agent
 Requires-Dist: langchain-community; extra == 'llm-agent'
 Requires-Dist: langchain-openai; extra == 'llm-agent'
+Requires-Dist: matplotlib; extra == 'llm-agent'
 Requires-Dist: mcp[cli]; extra == 'llm-agent'
 Requires-Dist: pymupdf; extra == 'llm-agent'
 Requires-Dist: streamlit; extra == 'llm-agent'
@@ -95,6 +97,7 @@ Provides-Extra: llm-agent-audio
 Requires-Dist: gtts; extra == 'llm-agent-audio'
 Requires-Dist: langchain-community; extra == 'llm-agent-audio'
 Requires-Dist: langchain-openai; extra == 'llm-agent-audio'
+Requires-Dist: matplotlib; extra == 'llm-agent-audio'
 Requires-Dist: mcp[cli]; extra == 'llm-agent-audio'
 Requires-Dist: pydub; extra == 'llm-agent-audio'
 Requires-Dist: pymupdf; extra == 'llm-agent-audio'
@@ -105,6 +108,7 @@ Provides-Extra: llm-google
 Requires-Dist: google-genai; extra == 'llm-google'
 Requires-Dist: langchain-community; extra == 'llm-google'
 Requires-Dist: langchain-openai; extra == 'llm-google'
+Requires-Dist: matplotlib; extra == 'llm-google'
 Requires-Dist: mcp[cli]; extra == 'llm-google'
 Requires-Dist: pymupdf; extra == 'llm-google'
 Requires-Dist: streamlit; extra == 'llm-google'

{flowcept-0.9.7 → flowcept-0.9.8}/docs/index.rst RENAMED Viewed

@@ -22,6 +22,7 @@ Flowcept
    prov_capture
    telemetry_capture
    prov_storage
+   prov_query
    schemas
    contributing
    cli-reference

flowcept-0.9.8/docs/prov_query.rst ADDED Viewed

@@ -0,0 +1,197 @@
+Provenance Querying
+====================
+Flowcept captures detailed provenance about workflows, tasks, and artifacts. Once captured, there are multiple ways to query this provenance depending on your needs. This guide summarizes the main mechanisms available for querying Flowcept data.
+.. note::
+    Persistence is optional in Flowcept. You can configure Flowcept to use LMDB, MongoDB or both. LMDB is a lightweight file‑based database ideal for simple tests; MongoDB provides advanced query support and is required for Flowcept’s database API and CLI queries. Flowcept can write to both backends if both are enabled.
+Querying with the Command‑Line Interface
+----------------------------------------
+Flowcept provides a small CLI for quick database queries. The CLI requires MongoDB to be enabled【202622579855635†L68-L96】 and accessible from your environment. After installing Flowcept, run `flowcept --help` to see all available commands and options【463846373907647†L68-L83】. The usage pattern is:
+.. code-block:: console
+    flowcept --<function-name-with-dashes> [--<arg-name-with-dashes>=<value>]
+Important query‑oriented commands include:
+* ``workflow-count`` – count tasks, workflows and objects for a given workflow ID【463846373907647†L152-L166】.
+* ``query`` – run a MongoDB query against the tasks collection, with optional projection, sorting and limit【463846373907647†L168-L190】.
+* ``get-task`` – fetch a single task document by its ID【463846373907647†L198-L207】.
+Here’s an example session:
+.. code-block:: console
+    # count the number of tasks, workflows and objects for a workflow
+    flowcept --workflow-count --workflow-id=123e4567-e89b-12d3-a456-426614174000
+    # query tasks where status is COMPLETED and only return `activity_id` and `status`
+    flowcept --query --filter='{"status": "COMPLETED"}' \
+            --project='{"activity_id": 1, "status": 1, "_id": 0}' \
+            --sort='[["started_at", -1]]' --limit=10
+    # fetch a task by ID
+    flowcept --get-task --task-id=24aa4e52-9aec-4ef6-8cb7-cbd7c72d436e
+The CLI prints JSON results to stdout. For full usage details see the official CLI reference【463846373907647†L68-L83】.
+Querying via the Python API (`Flowcept.db`)
+-------------------------------------------
+For programmatic access inside scripts and notebooks, Flowcept exposes a database API via the ``Flowcept.db`` property. When MongoDB is enabled this property returns an instance of the internal `DBAPI` class. You can call any of the following methods:
+* ``task_query(filter, projection=None, limit=0, sort=None)`` – query the `tasks` collection with an optional projection, sort and limit.
+* ``workflow_query(filter)`` – query the `workflows` collection.
+* ``get_workflow_object(workflow_id)`` – fetch a workflow and return a `WorkflowObject`.
+* ``insert_or_update_task(task_object)`` – insert or update a task.
+* ``save_or_update_object(object, type, custom_metadata, …)`` – persist binary objects such as models or large artifacts.
+Below is a typical usage pattern:
+.. code-block:: python
+    from flowcept import Flowcept
+    # query tasks for the current workflow
+    tasks = Flowcept.db.get_tasks_from_current_workflow()
+    print(f"Tasks captured in current workflow: {len(tasks)}")
+    # find all tasks marked with a "math" tag
+    math_tasks = Flowcept.db.task_query(filter={"tags": "math"})
+    for t in math_tasks:
+        print(f"{t['task_id']} – {t['activity_id']}: {t['status']}")
+    # fetch a workflow object and inspect its arguments
+    wf = Flowcept.db.get_workflow_object(workflow_id="123e4567-e89b-12d3-a456-426614174000")
+    print(wf.workflow_args)
+The `DBAPI` exposes many other methods, such as `get_tasks_recursive` to retrieve all descendants of a task, or `dump_tasks_to_file_recursive` to export tasks to Parquet. See the API reference for details.
+Accessing the In‑Memory Buffer
+------------------------------
+During runtime Flowcept stores captured messages in an in‑memory buffer (`Flowcept.buffer`). This buffer is useful for debugging or lightweight scripts because it provides immediate access to the latest tasks and workflows without any additional services. In the example below we create two tasks that attach binary data and then inspect the buffer:
+.. code-block:: python
+    from pathlib import Path
+    from flowcept import Flowcept
+    from flowcept.instrumentation.task import FlowceptTask
+    with Flowcept() as f:
+        used_args = {"a": 1}
+        # first task – attach a PDF
+        with FlowceptTask(used=used_args) as t:
+            img_path = Path("docs/img/architecture.pdf")
+            with open(img_path, "rb") as fp:
+                img_data = fp.read()
+            t.end(generated={"b": 2},
+                  data=img_data,
+                  custom_metadata={
+                      "mime_type": "application/pdf",
+                      "file_name": "architecture.pdf",
+                      "file_extension": "pdf"})
+            t.send()
+        # second task – attach a PNG
+        with FlowceptTask(used=used_args) as t:
+            img_path = Path("docs/img/flowcept-logo.png")
+            with open(img_path, "rb") as fp:
+                img_data = fp.read()
+            t.end(generated={"c": 2},
+                  data=img_data,
+                  custom_metadata={
+                      "mime_type": "image/png",
+                      "file_name": "flowcept-logo.png",
+                      "file_extension": "png"})
+            t.send()
+        # inspect the buffer
+        assert len(Flowcept.buffer) == 3  # includes the workflow message
+        assert Flowcept.buffer[1]["data"]  # binary data is captured as bytes
+At any point inside the running workflow you can access `Flowcept.buffer` to retrieve a list of dictionaries representing messages. Each element contains the original JSON payload plus any binary `data` field. Because the buffer lives in memory, it reflects the most recent state of the workflow and is cleared when the process ends.
+Working Offline: Reading a Messages File
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+When persistence is enabled in offline mode, Flowcept dumps the buffer to a JSONL file. Use :func:`Flowcept.read_messages_file` to load these messages later. If you pass `return_df=True` Flowcept will normalise nested fields into dot‑separated columns and return a pandas DataFrame. This is handy for ad‑hoc analysis with pandas.
+.. code-block:: python
+    from flowcept import Flowcept
+    # read JSON into a list of dicts
+    msgs = Flowcept.read_messages_file("offline_buffer.jsonl")
+    print(f"{len(msgs)} messages")
+    # read JSON into a pandas DataFrame
+    df = Flowcept.read_messages_file("offline_buffer.jsonl", return_df=True)
+    # dot‑notation columns allow easy selection; e.g., outputs of attention layers
+    print("generated.attention" in df.columns)
+Keep in mind that the JSONL file is only created when using fully offline mode. The path is configured in the settings file under ``DUMP_BUFFER_PATH``. If the file doesn’t exist, `read_messages_file` will raise an error.
+Working Directly with MongoDB
+-----------------------------
+If MongoDB is enabled in your settings you may prefer to query the database directly, especially for complex aggregation pipelines. Flowcept stores tasks in the ``tasks`` collection, workflows in ``workflows``, and binary objects in ``objects``. You can use any MongoDB tool or client library, such as:
+* **PyMongo** – Python driver for MongoDB; perfect for custom scripts.
+* **MongoDB Compass** – graphical UI for ad‑hoc queries and visualisation.
+* **mongo shell** or **mongosh** – CLI for interactive queries.
+For example, using PyMongo:
+.. code-block:: python
+    import pymongo
+    client = pymongo.MongoClient("mongodb://localhost:27017")
+    db = client["flowcept"]
+    # find the 20 most recent tasks for a workflow
+    tasks = db.tasks.find(
+        {"workflow_id": "123e4567-e89b-12d3-a456-426614174000"},
+        {"_id": 0, "activity_id": 1, "status": 1}
+    ).sort("started_at", pymongo.DESCENDING).limit(20)
+    for t in tasks:
+        print(t)
+The connection string, database name and authentication credentials are configured in the Flowcept settings file.
+Working with LMDB
+-----------------
+If LMDB is enabled instead of MongoDB【202622579855635†L68-L96】 Flowcept stores data in a directory (default: ``flowcept_lmdb``). LMDB is a file‑based key–value store; it does not support ad‑hoc queries out of the box, but you can read the data programmatically. Flowcept’s `DBAPI` can export LMDB data into pandas DataFrames, allowing you to analyse offline runs without MongoDB:
+.. code-block:: python
+    from flowcept import Flowcept
+    # export LMDB tasks to a DataFrame
+    df = Flowcept.db.to_df(collection="tasks")
+    print(df.head())
+Alternatively, you can use the `lmdb` Python library to iterate over raw key–value pairs. The LMDB environment is located under the directory configured in your settings file (commonly named ``flowcept_lmdb``). Because LMDB stores binary values, you’ll need to serialise and deserialise JSON messages yourself.
+Monitoring Provenance with Grafana
+----------------------------------
+Flowcept supports streaming provenance into monitoring dashboards. A sample Docker compose file (`deployment/compose-grafana.yml`) runs Grafana along with MongoDB and Redis. Grafana is configured with a pre‑built MongoDB‑Grafana image and exposes a port (3000) for the dashboard【711366042692702†L292-L297】. To configure Grafana to query Flowcept’s MongoDB, create a new data source with the URL `mongodb://flowcept_mongo:27017` and specify the database name (usually `flowcept`). The compose file sets environment variables for the admin user and password so you can log in and create your own panels.
+Grafana can also connect directly to Redis or Kafka for near‑real‑time streaming. See the Grafana documentation for instructions on configuring those plugins.
+Querying via the LLM‑based Flowcept Agent
+-----------------------------------------
+Flowcept’s agentic querying (powered by language models) is under active development. The agent will allow natural‑language queries over provenance data, with interactive guidance and summarisation. Documentation will be released in a future version. In the meantime, use the CLI or Python API for querying tasks and workflows.
+Conclusion
+----------
+Flowcept offers several ways to query provenance data depending on your environment and requirements. For quick inspection, use the in‑memory buffer or offline message files. For interactive scripts or notebooks, `Flowcept.db` provides a high‑level API to MongoDB or LMDB. For more sophisticated queries, connect directly to MongoDB using the CLI or standard MongoDB tools. Grafana integration lets you build dashboards on live data. As Flowcept evolves, additional capabilities—such as LLM‑based query agents—will expand the ways you can explore your provenance.

{flowcept-0.9.7 → flowcept-0.9.8}/pyproject.toml RENAMED Viewed

@@ -65,7 +65,7 @@ mlflow = ["mlflow-skinny", "SQLAlchemy", "alembic", "watchdog", "cryptography"]
 nvidia = ["nvidia-ml-py"]
 mqtt = ["paho-mqtt"]
 tensorboard = ["tensorboard", "tensorflow", "tbparse"]
-llm_agent = ["mcp[cli]", "langchain_community", "langchain_openai", "streamlit", "PyMuPDF"]
+llm_agent = ["mcp[cli]", "langchain_community", "langchain_openai", "streamlit", "PyMuPDF", "matplotlib"]
 llm_google = ["flowcept[llm_agent]", "google-genai"]
 llm_agent_audio = ["flowcept[llm_agent]", "streamlit-mic-recorder", "SpeechRecognition", "pydub", "gTTS"]
 # System dependency (required for pydub)

{flowcept-0.9.7 → flowcept-0.9.8}/resources/sample_settings.yaml RENAMED Viewed

@@ -1,4 +1,4 @@
-flowcept_version: 0.9.7 # Version of the Flowcept package. This setting file is compatible with this version.
+flowcept_version: 0.9.8 # Version of the Flowcept package. This setting file is compatible with this version.
 project:
   debug: true # Toggle debug mode. This will add a property `debug: true` to all saved data, making it easier to retrieve/delete them later.

{flowcept-0.9.7 → flowcept-0.9.8}/src/flowcept/agents/prompts/in_memory_query_prompts.py RENAMED Viewed

@@ -176,6 +176,7 @@ QUERY_GUIDELINES = """
       -To select the first (or earliest) N workflow executions, use or adapt the following: `df.groupby('workflow_id', as_index=False).agg({{"started_at": 'min'}}).sort_values(by='started_at', ascending=True).head(N)['workflow_id']` - utilize `started_at` to sort!
       -To select the last (or latest or most recent) N workflow executions, use or adapt the following: `df.groupby('workflow_id', as_index=False).agg({{"ended_at": 'max'}}).sort_values(by='ended_at', ascending=False).head(N)['workflow_id']` - utilize `ended_at` to sort!
+      -If the user does not ask for a specific workflow run, do not use `workflow_id` in your query.
       -To select the first or earliest or initial tasks, use or adapt the following: `df.sort_values(by='started_at', ascending=True)`
       -To select the last or final or most recent tasks, use or adapt the following: `df.sort_values(by='ended_at', ascending=False)`

{flowcept-0.9.7 → flowcept-0.9.8}/src/flowcept/commons/daos/mq_dao/mq_dao_redis.py RENAMED Viewed

@@ -70,6 +70,7 @@ class MQDaoRedis(MQDao):
                     except Exception as e:
                         self.logger.error(f"Failed to process message {message}")
                         self.logger.exception(e)
+                        continue
                     current_trials = 0
             except (redis.exceptions.ConnectionError, redis.exceptions.TimeoutError) as e:
@@ -78,7 +79,7 @@ class MQDaoRedis(MQDao):
                 sleep(3)
             except Exception as e:
                 self.logger.exception(e)
-                break
+                continue
     def send_message(self, message: dict, channel=MQ_CHANNEL, serializer=msgpack.dumps):
         """Send the message."""

{flowcept-0.9.7 → flowcept-0.9.8}/src/flowcept/flowcept_api/flowcept_controller.py RENAMED Viewed

@@ -1,7 +1,6 @@
 """Controller module."""
-import os.path
-from typing import List, Dict
+from typing import List, Dict, Any
 from uuid import uuid4
 from flowcept.commons.autoflush_buffer import AutoflushBuffer
@@ -175,25 +174,31 @@ class Flowcept(object):
         self._interceptor_instances[0]._mq_dao.bulk_publish(self.buffer)
     @staticmethod
-    def read_messages_file(file_path: str = None) -> List[Dict]:
+    def read_messages_file(file_path: str | None = None, return_df: bool = False):
         """
         Read a JSON Lines (JSONL) file containing captured Flowcept messages.
         This function loads a file where each line is a serialized JSON object.
         It joins the lines into a single JSON array and parses them efficiently
-        with ``orjson``.
+        with ``orjson``. If ``return_df`` is True, it returns a pandas DataFrame
+        created via ``pandas.json_normalize(..., sep='.')`` so nested fields become
+        dot-separated columns (for example, ``generated.attention``).
         Parameters
         ----------
         file_path : str, optional
-            Path to the messages file. If not provided, defaults to the
-            value of ``DUMP_BUFFER_PATH`` from the configuration.
-            If neither is provided, an assertion error is raised.
+            Path to the messages file. If not provided, defaults to the value of
+            ``DUMP_BUFFER_PATH`` from the configuration. If neither is provided,
+            an assertion error is raised.
+        return_df : bool, default False
+            If True, return a normalized pandas DataFrame. If False, return the
+            parsed list of dictionaries.
         Returns
         -------
-        List[dict]
-            A list of message objects (dictionaries) parsed from the file.
+        list of dict or pandas.DataFrame
+            A list of message objects when ``return_df`` is False,
+            otherwise a normalized DataFrame with dot-separated columns.
         Raises
         ------
@@ -203,35 +208,45 @@ class Flowcept(object):
             If the specified file does not exist.
         orjson.JSONDecodeError
             If the file contents cannot be parsed as valid JSON.
+        ModuleNotFoundError
+            If ``return_df`` is True but pandas is not installed.
         Examples
         --------
-        Read messages from a file explicitly:
+        Read messages as a list:
         >>> msgs = read_messages_file("offline_buffer.jsonl")
-        >>> print(len(msgs))
-        128
+        >>> len(msgs) > 0
+        True
-        Use the default dump buffer path from config:
+        Read messages as a normalized DataFrame:
-        >>> msgs = read_messages_file()
-        >>> for m in msgs[:2]:
-        ...     print(m["type"], m.get("workflow_id"))
-        task_start wf_123
-        task_end wf_123
+        >>> df = read_messages_file("offline_buffer.jsonl", return_df=True)
+        >>> "generated.attention" in df.columns
+        True
         """
+        import os
         import orjson
-        _buffer = []
         if file_path is None:
             file_path = DUMP_BUFFER_PATH
         assert file_path is not None, "Please indicate file_path either in the argument or in the config file."
         if not os.path.exists(file_path):
-            raise f"File {file_path} has not been created. It will only be created if you run in fully offline mode."
+            raise FileNotFoundError(f"File '{file_path}' was not found. It is created only in fully offline mode.")
         with open(file_path, "rb") as f:
             lines = [ln for ln in f.read().splitlines() if ln]
-        _buffer = orjson.loads(b"[" + b",".join(lines) + b"]")
-        return _buffer
+        buffer: List[Dict[str, Any]] = orjson.loads(b"[" + b",".join(lines) + b"]")
+        if return_df:
+            try:
+                import pandas as pd
+            except ModuleNotFoundError as e:
+                raise ModuleNotFoundError("pandas is required when return_df=True. Please install pandas.") from e
+            return pd.json_normalize(buffer, sep=".")
+        return buffer
     def save_workflow(self, interceptor: str, interceptor_instance: BaseInterceptor):
         """

{flowcept-0.9.7 → flowcept-0.9.8}/src/flowcept/version.py RENAMED Viewed

@@ -4,4 +4,4 @@
 # The expected format is: <Major>.<Minor>.<Patch>
 # This file is supposed to be automatically modified by the CI Bot.
 # See .github/workflows/version_bumper.py
-__version__ = "0.9.7"
+__version__ = "0.9.8"