PyPI - mostlyai-mock - Versions diffs - 0.0.5__tar.gz → 0.0.7__tar.gz - Mend

mostlyai-mock 0.0.5tar.gz → 0.0.7tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

mostlyai_mock-0.0.7/PKG-INFO +201 -0
mostlyai_mock-0.0.7/README.md +168 -0
{mostlyai_mock-0.0.5 → mostlyai_mock-0.0.7}/mostlyai/mock/__init__.py +1 -1
{mostlyai_mock-0.0.5 → mostlyai_mock-0.0.7}/mostlyai/mock/core.py +103 -34
{mostlyai_mock-0.0.5 → mostlyai_mock-0.0.7}/pyproject.toml +18 -1
mostlyai_mock-0.0.5/PKG-INFO +0 -117
mostlyai_mock-0.0.5/README.md +0 -99
{mostlyai_mock-0.0.5 → mostlyai_mock-0.0.7}/.gitignore +0 -0
{mostlyai_mock-0.0.5 → mostlyai_mock-0.0.7}/LICENSE +0 -0

mostlyai_mock-0.0.7/PKG-INFO ADDED Viewed

@@ -0,0 +1,201 @@
+Metadata-Version: 2.4
+Name: mostlyai-mock
+Version: 0.0.7
+Summary: Synthetic Mock Data
+Project-URL: homepage, https://github.com/mostly-ai/mostlyai-mock
+Project-URL: repository, https://github.com/mostly-ai/mostlyai-mock
+Project-URL: documentation, https://mostly-ai.github.io/mostlyai-mock/
+Author-email: MOSTLY AI <dev@mostly.ai>
+License-Expression: Apache-2.0
+License-File: LICENSE
+Classifier: Development Status :: 4 - Beta
+Classifier: Intended Audience :: Developers
+Classifier: Intended Audience :: Financial and Insurance Industry
+Classifier: Intended Audience :: Healthcare Industry
+Classifier: Intended Audience :: Information Technology
+Classifier: Intended Audience :: Science/Research
+Classifier: Intended Audience :: Telecommunications Industry
+Classifier: License :: OSI Approved :: Apache Software License
+Classifier: Operating System :: OS Independent
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Topic :: Software Development :: Libraries
+Classifier: Typing :: Typed
+Requires-Python: >=3.10
+Requires-Dist: litellm>=1.67.0
+Requires-Dist: numpy>=1.26.3
+Requires-Dist: pandas>=2.0.0
+Requires-Dist: pyarrow>=14.0.0
+Requires-Dist: pydantic<3.0.0,>=2.0.0
+Description-Content-Type: text/markdown
+# Synthetic Mock Data 🔮
+[![Documentation](https://img.shields.io/badge/docs-latest-green)](https://mostly-ai.github.io/mostlyai-mock/) [![stats](https://pepy.tech/badge/mostlyai-mock)](https://pypi.org/project/mostlyai-mock/) ![license](https://img.shields.io/github/license/mostly-ai/mostlyai-mock) ![GitHub Release](https://img.shields.io/github/v/release/mostly-ai/mostlyai-mock)
+Create data out of nothing. Prompt LLMs for Tabular Data.
+## Key Features
+* A light-weight python client for prompting LLMs for mixed-type tabular data
+* Select from a range of LLM endpoints, that provide structured output
+* Supports single-table as well as multi-table scenarios.
+* Supports variety of data types: `string`, `categorical`, `integer`, `float`, `boolean`, `date`, and `datetime`.
+* Specify context, distributions and rules via dataset-, table- or column-level prompts.
+* Tailor the diversity and realism of your generated data via temperature and top_p.
+## Getting Started
+1. Install the latest version of the `mostlyai-mock` python package.
+```bash
+pip install -U mostlyai-mock
+```
+2. Set the API key of your LLM endpoint (if not done yet)
+```python
+import os
+os.environ["OPENAI_API_KEY"] = "your-api-key"
+# os.environ["GEMINI_API_KEY"] = "your-api-key"
+# os.environ["GROQ_API_KEY"] = "your-api-key"
+```
+Note: You will need to obtain your API key directly from the LLM service provider (e.g. for Open AI from [here](https://platform.openai.com/api-keys)). The LLM endpoint will be determined by the chosen `model` when making calls to `mock.sample`.
+3. Create your first basic synthetic table from scratch
+```python
+from mostlyai import mock
+tables = {
+    "guests": {
+        "description": "Guests of an Alpine ski hotel in Austria",
+        "columns": {
+            "nationality": {"prompt": "2-letter code for the nationality", "dtype": "string"},
+            "name": {"prompt": "first name and last name of the guest", "dtype": "string"},
+            "gender": {"dtype": "category", "values": ["male", "female"]},
+            "age": {"prompt": "age in years; min: 18, max: 80; avg: 25", "dtype": "integer"},
+            "date_of_birth": {"prompt": "date of birth", "dtype": "date"},
+            "checkin_time": {"prompt": "the check in timestamp of the guest; may 2025", "dtype": "datetime"},
+            "is_vip": {"prompt": "is the guest a VIP", "dtype": "boolean"},
+            "price_per_night": {"prompt": "price paid per night, in EUR", "dtype": "float"},
+            "room_number": {"prompt": "room number", "dtype": "integer", "values": [101, 102, 103, 201, 202, 203, 204]}
+        },
+    }
+}
+df = mock.sample(
+    tables=tables,  # provide table and column definitions
+    sample_size=10,  # generate 10 records
+    model="openai/gpt-4.1-nano",  # select the LLM model (optional)
+)
+print(df)
+#   nationality            name  gender  age date_of_birth        checkin_time  is_vip  price_per_night  room_number
+# 0          AT     Anna Müller  female   29    1994-09-15 2025-01-05 14:30:00    True            350.0          101
+# 1          DE  Johann Schmidt    male   45    1978-11-20 2025-01-06 16:45:00   False            250.0          102
+# 2          CH      Lara Meier  female   32    1991-04-12 2025-01-05 12:00:00    True            400.0          103
+# 3          IT     Marco Rossi    male   38    1985-02-25 2025-01-07 09:15:00   False            280.0          201
+# 4          FR   Claire Dupont  female   24    2000-07-08 2025-01-07 11:20:00   False            220.0          202
+# 5          AT    Felix Gruber    male   52    1972-01-10 2025-01-06 17:50:00    True            375.0          203
+# 6          DE   Sophie Becker  female   27    1996-03-30 2025-01-08 08:30:00   False            230.0          204
+# 7          CH      Max Keller    male   31    1992-05-16 2025-01-09 14:10:00   False            290.0          101
+# 8          IT  Giulia Bianchi  female   36    1988-08-19 2025-01-05 15:55:00    True            410.0          102
+# 9          FR    Louis Martin    male   44    1980-12-05 2025-01-07 10:40:00   False            270.0          103
+```
+4. Create your first multi-table synthetic dataset
+```python
+from mostlyai import mock
+tables = {
+    "customers": {
+        "description": "Customers of a hardware store",
+        "columns": {
+            "customer_id": {"prompt": "the unique id of the customer", "dtype": "integer"},
+            "name": {"prompt": "first name and last name of the customer", "dtype": "string"},
+        },
+        "primary_key": "customer_id",
+    },
+    "warehouses": {
+        "description": "Warehouses of a hardware store",
+        "columns": {
+            "warehouse_id": {"prompt": "the unique id of the warehouse", "dtype": "integer"},
+            "name": {"prompt": "the name of the warehouse", "dtype": "string"},
+        },
+        "primary_key": "warehouse_id",
+    },
+    "orders": {
+        "description": "Orders of a Customer",
+        "columns": {
+            "customer_id": {"prompt": "the customer id for that order", "dtype": "integer"},
+            "warehouse_id": {"prompt": "the warehouse id for that order", "dtype": "integer"},
+            "order_id": {"prompt": "the unique id of the order", "dtype": "string"},
+            "text": {"prompt": "order text description", "dtype": "string"},
+            "amount": {"prompt": "order amount in USD", "dtype": "float"},
+        },
+        "primary_key": "order_id",
+        "foreign_keys": [
+            {
+                "column": "customer_id",
+                "referenced_table": "customers",
+                "description": "each customer has anywhere between 2 and 3 orders",
+            },
+            {
+                "column": "warehouse_id",
+                "referenced_table": "warehouses",
+            },
+        ],
+    },
+    "items": {
+        "description": "Items in an Order",
+        "columns": {
+            "item_id": {"prompt": "the unique id of the item", "dtype": "string"},
+            "order_id": {"prompt": "the order id for that item", "dtype": "string"},
+            "name": {"prompt": "the name of the item", "dtype": "string"},
+            "price": {"prompt": "the price of the item in USD", "dtype": "float"},
+        },
+        "foreign_keys": [
+            {
+                "column": "order_id",
+                "referenced_table": "orders",
+                "description": "each order has between 1 and 2 items",
+            }
+        ],
+    },
+}
+data = mock.sample(
+    tables=tables,
+    sample_size=2,
+    model="openai/gpt-4.1"
+)
+print(data["customers"])
+#    customer_id             name
+# 0            1  Matthew Carlson
+# 1            2       Priya Shah
+print(data["warehouses"])
+#    warehouse_id                        name
+# 0             1    Central Distribution Hub
+# 1             2  Northgate Storage Facility
+print(data["orders"])
+#    customer_id  warehouse_id   order_id                                               text  amount
+# 0            1             2  ORD-10294  3-tier glass shelving units, expedited deliver...  649.25
+# 1            1             1  ORD-10541  Office desk chairs, set of 6, with assembly se...   824.9
+# 2            1             1  ORD-10802  Executive standing desk, walnut finish, standa...   519.0
+# 3            2             1  ORD-11017  Maple conference table, cable management inclu...  1225.5
+# 4            2             2  ORD-11385  Set of ergonomic task chairs, black mesh, stan...  767.75
+print(data["items"])
+#      item_id   order_id                                        name   price
+# 0  ITM-80265  ORD-10294         3-Tier Tempered Glass Shelving Unit   409.0
+# 1  ITM-80266  ORD-10294  Brushed Aluminum Shelf Brackets (Set of 4)  240.25
+# 2  ITM-81324  ORD-10541              Ergonomic Mesh-Back Desk Chair   132.5
+# 3  ITM-81325  ORD-10541  Professional Office Chair Assembly Service    45.0
+# 4  ITM-82101  ORD-10802      Executive Standing Desk, Walnut Finish   469.0
+# 5  ITM-82102  ORD-10802         Desk Installation and Setup Service    50.0
+# 6  ITM-83391  ORD-11017             Maple Conference Table, 10-Seat  1125.5
+# 7  ITM-83392  ORD-11017       Integrated Table Cable Management Kit   100.0
+# 8  ITM-84311  ORD-11385            Ergonomic Task Chair, Black Mesh  359.25
+# 9  ITM-84312  ORD-11385                   Standard Delivery Service    48.5
+```

mostlyai_mock-0.0.7/README.md ADDED Viewed

@@ -0,0 +1,168 @@
+# Synthetic Mock Data 🔮
+[![Documentation](https://img.shields.io/badge/docs-latest-green)](https://mostly-ai.github.io/mostlyai-mock/) [![stats](https://pepy.tech/badge/mostlyai-mock)](https://pypi.org/project/mostlyai-mock/) ![license](https://img.shields.io/github/license/mostly-ai/mostlyai-mock) ![GitHub Release](https://img.shields.io/github/v/release/mostly-ai/mostlyai-mock)
+Create data out of nothing. Prompt LLMs for Tabular Data.
+## Key Features
+* A light-weight python client for prompting LLMs for mixed-type tabular data
+* Select from a range of LLM endpoints, that provide structured output
+* Supports single-table as well as multi-table scenarios.
+* Supports variety of data types: `string`, `categorical`, `integer`, `float`, `boolean`, `date`, and `datetime`.
+* Specify context, distributions and rules via dataset-, table- or column-level prompts.
+* Tailor the diversity and realism of your generated data via temperature and top_p.
+## Getting Started
+1. Install the latest version of the `mostlyai-mock` python package.
+```bash
+pip install -U mostlyai-mock
+```
+2. Set the API key of your LLM endpoint (if not done yet)
+```python
+import os
+os.environ["OPENAI_API_KEY"] = "your-api-key"
+# os.environ["GEMINI_API_KEY"] = "your-api-key"
+# os.environ["GROQ_API_KEY"] = "your-api-key"
+```
+Note: You will need to obtain your API key directly from the LLM service provider (e.g. for Open AI from [here](https://platform.openai.com/api-keys)). The LLM endpoint will be determined by the chosen `model` when making calls to `mock.sample`.
+3. Create your first basic synthetic table from scratch
+```python
+from mostlyai import mock
+tables = {
+    "guests": {
+        "description": "Guests of an Alpine ski hotel in Austria",
+        "columns": {
+            "nationality": {"prompt": "2-letter code for the nationality", "dtype": "string"},
+            "name": {"prompt": "first name and last name of the guest", "dtype": "string"},
+            "gender": {"dtype": "category", "values": ["male", "female"]},
+            "age": {"prompt": "age in years; min: 18, max: 80; avg: 25", "dtype": "integer"},
+            "date_of_birth": {"prompt": "date of birth", "dtype": "date"},
+            "checkin_time": {"prompt": "the check in timestamp of the guest; may 2025", "dtype": "datetime"},
+            "is_vip": {"prompt": "is the guest a VIP", "dtype": "boolean"},
+            "price_per_night": {"prompt": "price paid per night, in EUR", "dtype": "float"},
+            "room_number": {"prompt": "room number", "dtype": "integer", "values": [101, 102, 103, 201, 202, 203, 204]}
+        },
+    }
+}
+df = mock.sample(
+    tables=tables,  # provide table and column definitions
+    sample_size=10,  # generate 10 records
+    model="openai/gpt-4.1-nano",  # select the LLM model (optional)
+)
+print(df)
+#   nationality            name  gender  age date_of_birth        checkin_time  is_vip  price_per_night  room_number
+# 0          AT     Anna Müller  female   29    1994-09-15 2025-01-05 14:30:00    True            350.0          101
+# 1          DE  Johann Schmidt    male   45    1978-11-20 2025-01-06 16:45:00   False            250.0          102
+# 2          CH      Lara Meier  female   32    1991-04-12 2025-01-05 12:00:00    True            400.0          103
+# 3          IT     Marco Rossi    male   38    1985-02-25 2025-01-07 09:15:00   False            280.0          201
+# 4          FR   Claire Dupont  female   24    2000-07-08 2025-01-07 11:20:00   False            220.0          202
+# 5          AT    Felix Gruber    male   52    1972-01-10 2025-01-06 17:50:00    True            375.0          203
+# 6          DE   Sophie Becker  female   27    1996-03-30 2025-01-08 08:30:00   False            230.0          204
+# 7          CH      Max Keller    male   31    1992-05-16 2025-01-09 14:10:00   False            290.0          101
+# 8          IT  Giulia Bianchi  female   36    1988-08-19 2025-01-05 15:55:00    True            410.0          102
+# 9          FR    Louis Martin    male   44    1980-12-05 2025-01-07 10:40:00   False            270.0          103
+```
+4. Create your first multi-table synthetic dataset
+```python
+from mostlyai import mock
+tables = {
+    "customers": {
+        "description": "Customers of a hardware store",
+        "columns": {
+            "customer_id": {"prompt": "the unique id of the customer", "dtype": "integer"},
+            "name": {"prompt": "first name and last name of the customer", "dtype": "string"},
+        },
+        "primary_key": "customer_id",
+    },
+    "warehouses": {
+        "description": "Warehouses of a hardware store",
+        "columns": {
+            "warehouse_id": {"prompt": "the unique id of the warehouse", "dtype": "integer"},
+            "name": {"prompt": "the name of the warehouse", "dtype": "string"},
+        },
+        "primary_key": "warehouse_id",
+    },
+    "orders": {
+        "description": "Orders of a Customer",
+        "columns": {
+            "customer_id": {"prompt": "the customer id for that order", "dtype": "integer"},
+            "warehouse_id": {"prompt": "the warehouse id for that order", "dtype": "integer"},
+            "order_id": {"prompt": "the unique id of the order", "dtype": "string"},
+            "text": {"prompt": "order text description", "dtype": "string"},
+            "amount": {"prompt": "order amount in USD", "dtype": "float"},
+        },
+        "primary_key": "order_id",
+        "foreign_keys": [
+            {
+                "column": "customer_id",
+                "referenced_table": "customers",
+                "description": "each customer has anywhere between 2 and 3 orders",
+            },
+            {
+                "column": "warehouse_id",
+                "referenced_table": "warehouses",
+            },
+        ],
+    },
+    "items": {
+        "description": "Items in an Order",
+        "columns": {
+            "item_id": {"prompt": "the unique id of the item", "dtype": "string"},
+            "order_id": {"prompt": "the order id for that item", "dtype": "string"},
+            "name": {"prompt": "the name of the item", "dtype": "string"},
+            "price": {"prompt": "the price of the item in USD", "dtype": "float"},
+        },
+        "foreign_keys": [
+            {
+                "column": "order_id",
+                "referenced_table": "orders",
+                "description": "each order has between 1 and 2 items",
+            }
+        ],
+    },
+}
+data = mock.sample(
+    tables=tables,
+    sample_size=2,
+    model="openai/gpt-4.1"
+)
+print(data["customers"])
+#    customer_id             name
+# 0            1  Matthew Carlson
+# 1            2       Priya Shah
+print(data["warehouses"])
+#    warehouse_id                        name
+# 0             1    Central Distribution Hub
+# 1             2  Northgate Storage Facility
+print(data["orders"])
+#    customer_id  warehouse_id   order_id                                               text  amount
+# 0            1             2  ORD-10294  3-tier glass shelving units, expedited deliver...  649.25
+# 1            1             1  ORD-10541  Office desk chairs, set of 6, with assembly se...   824.9
+# 2            1             1  ORD-10802  Executive standing desk, walnut finish, standa...   519.0
+# 3            2             1  ORD-11017  Maple conference table, cable management inclu...  1225.5
+# 4            2             2  ORD-11385  Set of ergonomic task chairs, black mesh, stan...  767.75
+print(data["items"])
+#      item_id   order_id                                        name   price
+# 0  ITM-80265  ORD-10294         3-Tier Tempered Glass Shelving Unit   409.0
+# 1  ITM-80266  ORD-10294  Brushed Aluminum Shelf Brackets (Set of 4)  240.25
+# 2  ITM-81324  ORD-10541              Ergonomic Mesh-Back Desk Chair   132.5
+# 3  ITM-81325  ORD-10541  Professional Office Chair Assembly Service    45.0
+# 4  ITM-82101  ORD-10802      Executive Standing Desk, Walnut Finish   469.0
+# 5  ITM-82102  ORD-10802         Desk Installation and Setup Service    50.0
+# 6  ITM-83391  ORD-11017             Maple Conference Table, 10-Seat  1125.5
+# 7  ITM-83392  ORD-11017       Integrated Table Cable Management Kit   100.0
+# 8  ITM-84311  ORD-11385            Ergonomic Task Chair, Black Mesh  359.25
+# 9  ITM-84312  ORD-11385                   Standard Delivery Service    48.5
+```

{mostlyai_mock-0.0.5 → mostlyai_mock-0.0.7}/mostlyai/mock/__init__.py RENAMED Viewed

@@ -15,4 +15,4 @@
 from mostlyai.mock.core import sample
 __all__ = ["sample"]
-__version__ = "0.0.5"  # Do not set this manually. Use poetry version [params].
+__version__ = "0.0.7"  # Do not set this manually. Use poetry version [params].

{mostlyai_mock-0.0.5 → mostlyai_mock-0.0.7}/mostlyai/mock/core.py RENAMED Viewed

@@ -100,7 +100,10 @@ class MockConfig(RootModel[dict[str, "TableConfig"]]):
             if table_name in path:
                 cycle_start = path.index(table_name)
                 cycle = path[cycle_start:] + [table_name]
-                raise ValueError(f"Circular dependency detected: {' -> '.join(cycle)}")
+                msg = f"Circular dependency detected: {' -> '.join(cycle)}."
+                if len(cycle) == 2:
+                    msg += " Self-referencing tables are not yet supported."
+                raise ValueError(msg)
             if table_name in visited:
                 return
             visited.add(table_name)
@@ -119,7 +122,7 @@ class TableConfig(BaseModel):
     description: str = ""
     columns: dict[str, ColumnConfig] = Field(..., min_items=1)
     primary_key: str | None = None
-    foreign_keys: list[ForeignKeyConfig] = Field(default_factory=list, min_length=0, max_length=1)
+    foreign_keys: list[ForeignKeyConfig] = Field(default_factory=list)
 class ColumnConfig(BaseModel):
@@ -163,7 +166,7 @@ class ColumnConfig(BaseModel):
                 DType.DATETIME: (str, "strings"),
             }[self.dtype]
             try:
-                self.values = [cast_fn(c) for c in self.values]
+                self.values = [cast_fn(c) if pd.notna(c) else None for c in self.values]
             except ValueError:
                 raise ValueError(
                     f"All values must be convertible to {convertible_to} when dtype is '{self.dtype.value}'"
@@ -193,28 +196,25 @@ def _sample_table(
     table_config: TableConfig,
     primary_keys: dict[str, str] | None,
     sample_size: int | None,
-    context_data: pd.DataFrame | None,
+    generated_data: dict[str, pd.DataFrame] | None,
     temperature: float,
     top_p: float,
     batch_size: int,
     previous_rows_size: int,
+    non_context_size: int | None,
     llm_config: LLMConfig,
 ) -> pd.DataFrame:
-    assert (sample_size is None) != (context_data is None), (
-        "Exactly one of sample_size or context_data must be provided"
-    )
-    if sample_size is None:
-        sample_size = len(context_data)
     table_rows_generator = _create_table_rows_generator(
         table_name=table_name,
         table_config=table_config,
         primary_keys=primary_keys,
         sample_size=sample_size,
-        context_data=context_data,
+        generated_data=generated_data,
         temperature=temperature,
         top_p=top_p,
         batch_size=batch_size,
         previous_rows_size=previous_rows_size,
+        non_context_size=non_context_size,
         llm_config=llm_config,
     )
     table_rows_generator = tqdm(table_rows_generator, desc=f"Generating rows for table `{table_name}`".ljust(45))
@@ -231,6 +231,7 @@ def _create_table_prompt(
     batch_size: int | None,
     foreign_keys: list[ForeignKeyConfig] | None,
     context_data: pd.DataFrame | None,
+    non_context_data: dict[str, pd.DataFrame],
     previous_rows: list[dict],
 ) -> str:
     if batch_size is not None:
@@ -271,16 +272,29 @@ def _create_table_prompt(
         prompt += f"## Context Table Data:\n\n"
         prompt += f"{context_data.to_json(orient='records', indent=2)}\n\n"
+    # add non-context table names, primary keys and data
+    if non_context_data:
+        for fk in foreign_keys[1:]:
+            prompt += f"## Non-Context Table: `{fk.referenced_table}`\n\n"
+            prompt += f"## Non-Context Table Primary Key: `{primary_keys[fk.referenced_table]}`\n\n"
+            prompt += f"## Non-Context Table Data:\n\n"
+            prompt += f"{non_context_data[fk.referenced_table].to_json(orient='records', indent=2)}\n\n"
     # add instructions
     prompt += "\n## Instructions:\n\n"
     if batch_size is not None:
         prompt += f"Generate {batch_size} rows for the `{table_name}` table.\n\n"
-    else:
+    if context_data is not None:
         prompt += (
             f"Generate data for the `{table_name}` table. "
-            f"The Foreign Key column may only contain values from Context Table Data. "
+            f"The first Foreign Key column from Foreign Keys section may only contain values from Context Table Data. "
+            f"The second Foreign Key column from Foreign Keys section (if exists) may only contain values from Non-Context Table Data. "
             f"Pay attention to description of the Foreign Key column to understand the relationship.\n\n"
         )
     if previous_rows:
         prompt += (
             "Generate new rows that maintain consistency with the previous rows where appropriate. "
@@ -298,12 +312,13 @@ def _create_table_rows_generator(
     table_name: str,
     table_config: TableConfig,
     primary_keys: dict[str, str] | None,
-    sample_size: int,
+    sample_size: int | None,
+    generated_data: dict[str, pd.DataFrame] | None,
     temperature: float,
     top_p: float,
-    context_data: pd.DataFrame | None,
     batch_size: int,
     previous_rows_size: int,
+    non_context_size: int | None,
     llm_config: LLMConfig,
 ) -> Generator[dict]:
     def create_table_response_format(columns: dict[str, ColumnConfig]) -> BaseModel:
@@ -311,14 +326,14 @@ def _create_table_rows_generator(
             if column_config.values or column_config.dtype is DType.CATEGORY:
                 return Literal[tuple(column_config.values)]
             return {
-                DType.INTEGER: int,
-                DType.FLOAT: float,
-                DType.STRING: str,
-                DType.BOOLEAN: bool,
+                DType.INTEGER: int | None,
+                DType.FLOAT: float | None,
+                DType.STRING: str | None,
+                DType.BOOLEAN: bool | None,
                 # response_format has limited support for JSON Schema features
                 # thus we represent dates and datetimes as strings
-                DType.DATE: str,
-                DType.DATETIME: str,
+                DType.DATE: str | None,
+                DType.DATETIME: str | None,
             }[column_config.dtype]
         fields = {}
@@ -368,6 +383,26 @@ def _create_table_rows_generator(
                 for i in range(0, len(data), batch_size):
                     yield data.iloc[i : i + batch_size]
+    # derive context data (if first foreign key is present) and harmonize sample size accordingly
+    context_data: pd.DataFrame | None = None
+    if table_config.foreign_keys:
+        context_table_name = table_config.foreign_keys[0].referenced_table
+        assert generated_data is not None
+        assert context_table_name in generated_data
+        context_data = generated_data[context_table_name]
+        sample_size = len(context_data)
+    assert sample_size is not None
+    # derive non-context data (if more than one foreign key is present)
+    non_context_data: dict[str, pd.DataFrame] = {}
+    if table_config.foreign_keys and len(table_config.foreign_keys) > 1:
+        assert generated_data is not None
+        assert non_context_size is not None
+        for fk in table_config.foreign_keys[1:]:
+            non_context_table_name = fk.referenced_table
+            assert non_context_table_name in generated_data
+            non_context_data[non_context_table_name] = generated_data[non_context_table_name]
     # ensure model supports response_format and json schema
     supported_params = litellm.get_supported_openai_params(model=llm_config.model)
     assert "response_format" in supported_params
@@ -387,6 +422,11 @@ def _create_table_rows_generator(
     yielded_sequences = 0
     previous_rows = deque(maxlen=previous_rows_size)
     for context_batch in batch_infinitely(context_data):
+        non_context_batch = (
+            {table_name: df.sample(frac=1.0).head(non_context_size) for table_name, df in non_context_data.items()}
+            if non_context_data
+            else None
+        )
         prompt_kwargs = {
             "table_name": table_name,
             "table_description": table_config.description,
@@ -395,6 +435,7 @@ def _create_table_rows_generator(
             "batch_size": batch_size if context_batch is None else None,
             "foreign_keys": table_config.foreign_keys if context_batch is not None else None,
             "context_data": context_batch if context_batch is not None else None,
+            "non_context_data": non_context_batch if non_context_batch else None,
             "previous_rows": list(previous_rows),
         }
         prompt = _create_table_prompt(**prompt_kwargs)
@@ -429,10 +470,14 @@ def _convert_table_rows_generator_to_df(
         for column_name, column_config in columns.items():
             if column_config.dtype in [DType.DATE, DType.DATETIME]:
                 df[column_name] = pd.to_datetime(df[column_name], errors="coerce")
-            elif column_config.dtype in [DType.INTEGER, DType.FLOAT]:
-                df[column_name] = pd.to_numeric(df[column_name], errors="coerce", dtype_backend="pyarrow")
+            elif column_config.dtype is DType.INTEGER:
+                df[column_name] = pd.to_numeric(df[column_name], errors="coerce", downcast="integer").astype(
+                    "int64[pyarrow]"
+                )
+            elif column_config.dtype is DType.FLOAT:
+                df[column_name] = pd.to_numeric(df[column_name], errors="coerce").astype("double[pyarrow]")
             elif column_config.dtype is DType.BOOLEAN:
-                df[column_name] = df[column_name].astype(bool)
+                df[column_name] = pd.to_numeric(df[column_name], errors="coerce").astype("boolean[pyarrow]")
             elif column_config.dtype is DType.CATEGORY:
                 df[column_name] = pd.Categorical(df[column_name], categories=column_config.values)
             else:
@@ -472,7 +517,9 @@ def _build_dependency_graph(config: MockConfig) -> tuple[dict[str, list[str]], d
     return child_to_parents, parent_to_children, subject_tables
-def _build_execution_plan(parent_to_children: dict[str, list[str]], subject_tables: list[str]) -> list[str]:
+def _build_execution_plan(
+    parent_to_children: dict[str, list[str]], child_to_parents: dict[str, list[str]], subject_tables: list[str]
+) -> list[str]:
     execution_plan = []
     bfs_queue = list(subject_tables)
     processed = set()
@@ -482,6 +529,13 @@ def _build_execution_plan(parent_to_children: dict[str, list[str]], subject_tabl
         if table_name in processed:
             continue
+        # ensure all parents are processed before processing this table
+        unprocessed_parents = [p for p in child_to_parents[table_name] if p not in processed]
+        if unprocessed_parents:
+            bfs_queue.extend(unprocessed_parents)
+            bfs_queue.append(table_name)
+            continue
         execution_plan.append(table_name)
         processed.add(table_name)
@@ -564,10 +618,19 @@ def sample(
             },
             "primary_key": "customer_id",
         },
+        "warehouses": {
+            "description": "Warehouses of a hardware store",
+            "columns": {
+                "warehouse_id": {"prompt": "the unique id of the warehouse", "dtype": "integer"},
+                "name": {"prompt": "the name of the warehouse", "dtype": "string"},
+            },
+            "primary_key": "warehouse_id",
+        },
         "orders": {
             "description": "Orders of a Customer",
             "columns": {
                 "customer_id": {"prompt": "the customer id for that order", "dtype": "integer"},
+                "warehouse_id": {"prompt": "the warehouse id for that order", "dtype": "integer"},
                 "order_id": {"prompt": "the unique id of the order", "dtype": "string"},
                 "text": {"prompt": "order text description", "dtype": "string"},
                 "amount": {"prompt": "order amount in USD", "dtype": "float"},
@@ -577,8 +640,12 @@ def sample(
                 {
                     "column": "customer_id",
                     "referenced_table": "customers",
-                    "description": "each customer has anywhere between 1 and 3 orders",
-                }
+                    "description": "each customer has anywhere between 2 and 3 orders",
+                },
+                {
+                    "column": "warehouse_id",
+                    "referenced_table": "warehouses",
+                },
             ],
         },
         "items": {
@@ -593,13 +660,14 @@ def sample(
                 {
                     "column": "order_id",
                     "referenced_table": "orders",
-                    "description": "each order has between 2 and 5 items",
+                    "description": "each order has between 1 and 2 items",
                 }
             ],
         },
     }
     data = mock.sample(tables=tables, sample_size=2, model="openai/gpt-4.1")
     df_customers = data["customers"]
+    df_warehouses = data["warehouses"]
     df_orders = data["orders"]
     df_items = data["items"]
     ```
@@ -611,7 +679,7 @@ def sample(
     primary_keys = {table_name: table_config.primary_key for table_name, table_config in config.root.items()}
     child_to_parents, parent_to_children, subject_tables = _build_dependency_graph(config)
-    execution_plan: list[str] = _build_execution_plan(parent_to_children, subject_tables)
+    execution_plan: list[str] = _build_execution_plan(parent_to_children, child_to_parents, subject_tables)
     results: dict[str, pd.DataFrame] = {}
@@ -624,26 +692,27 @@ def sample(
                 table_config=table_config,
                 primary_keys=None,
                 sample_size=sample_size[table_name],
-                context_data=None,
+                generated_data=None,
                 temperature=temperature,
                 top_p=top_p,
-                batch_size=20,  # generate 20 subjects at a time
-                previous_rows_size=5,
+                batch_size=30,  # generate 30 subjects at a time
+                previous_rows_size=10,  # present 10 previously generated rows to the LLM
+                non_context_size=None,
                 llm_config=LLMConfig(model=model, api_key=api_key),
             )
         else:
             # sequencial table
-            referenced_table = table_config.foreign_keys[0].referenced_table
             df = _sample_table(
                 table_name=table_name,
                 table_config=table_config,
                 primary_keys=primary_keys,
                 sample_size=None,
-                context_data=results[referenced_table],
+                generated_data=results,
                 temperature=temperature,
                 top_p=top_p,
                 batch_size=1,  # generate one sequence at a time
-                previous_rows_size=5,
+                previous_rows_size=10,  # present 10 previously generated rows to the LLM
+                non_context_size=10,  # pick 10 rows to choose from for each non-context foreign key
                 llm_config=LLMConfig(model=model, api_key=api_key),
             )
         results[table_name] = df

{mostlyai_mock-0.0.5 → mostlyai_mock-0.0.7}/pyproject.toml RENAMED Viewed

@@ -1,11 +1,28 @@
 [project]
 name = "mostlyai-mock"
-version = "0.0.5"
+version = "0.0.7"
 description = "Synthetic Mock Data"
 authors = [{ name = "MOSTLY AI", email = "dev@mostly.ai" }]
 requires-python = ">=3.10"
 readme = "README.md"
 license = "Apache-2.0"
+classifiers = [
+    "Development Status :: 4 - Beta",
+    "Intended Audience :: Developers",
+    "Intended Audience :: Science/Research",
+    "Intended Audience :: Information Technology",
+    "Intended Audience :: Financial and Insurance Industry",
+    "Intended Audience :: Healthcare Industry",
+    "Intended Audience :: Telecommunications Industry",
+    "Programming Language :: Python :: 3.10",
+    "Programming Language :: Python :: 3.11",
+    "Programming Language :: Python :: 3.12",
+    "Programming Language :: Python :: 3.13",
+    "License :: OSI Approved :: Apache Software License",
+    "Operating System :: OS Independent",
+    "Topic :: Software Development :: Libraries",
+    "Typing :: Typed",
+]
 dependencies = [
     "pydantic>=2.0.0,<3.0.0",
     "numpy>=1.26.3",

mostlyai_mock-0.0.5/PKG-INFO DELETED Viewed

@@ -1,117 +0,0 @@
-Metadata-Version: 2.4
-Name: mostlyai-mock
-Version: 0.0.5
-Summary: Synthetic Mock Data
-Project-URL: homepage, https://github.com/mostly-ai/mostlyai-mock
-Project-URL: repository, https://github.com/mostly-ai/mostlyai-mock
-Project-URL: documentation, https://mostly-ai.github.io/mostlyai-mock/
-Author-email: MOSTLY AI <dev@mostly.ai>
-License-Expression: Apache-2.0
-License-File: LICENSE
-Requires-Python: >=3.10
-Requires-Dist: litellm>=1.67.0
-Requires-Dist: numpy>=1.26.3
-Requires-Dist: pandas>=2.0.0
-Requires-Dist: pyarrow>=14.0.0
-Requires-Dist: pydantic<3.0.0,>=2.0.0
-Description-Content-Type: text/markdown
-# Synthetic Mock Data 🔮
-[![Documentation](https://img.shields.io/badge/docs-latest-green)](https://mostly-ai.github.io/mostlyai-mock/) [![stats](https://pepy.tech/badge/mostlyai-mock)](https://pypi.org/project/mostlyai-mock/) ![license](https://img.shields.io/github/license/mostly-ai/mostlyai-mock) ![GitHub Release](https://img.shields.io/github/v/release/mostly-ai/mostlyai-mock) ![PyPI - Python Version](https://img.shields.io/pypi/pyversions/mostlyai-mock)
-Create data out of nothing. Prompt LLMs for Tabular Data.
-## Installation
-The latest release of `mostlyai-mock` can be installed via pip:
-```bash
-pip install -U mostlyai-mock
-```
-Note: An API key to a LLM endpoint, with structured response, is required. It is recommended to set such a key as an environment variable (e.g. `OPENAI_API_KEY`, `GEMINI_API_KEY`, etc.). Alternatively, the key needs to be passed to every call to the library iteself via the parameter `api_key`.
-## Quick Start
-### Single Table
-```python
-from mostlyai import mock
-tables = {
-    "guests": {
-        "description": "Guests of an Alpine ski hotel in Austria",
-        "columns": {
-            "nationality": {"prompt": "2-letter code for the nationality", "dtype": "string"},
-            "name": {"prompt": "first name and last name of the guest", "dtype": "string"},
-            "gender": {"dtype": "category", "values": ["male", "female"]},
-            "age": {"prompt": "age in years; min: 18, max: 80; avg: 25", "dtype": "integer"},
-            "date_of_birth": {"prompt": "date of birth", "dtype": "date"},
-            "checkin_time": {"prompt": "the check in timestamp of the guest; may 2025", "dtype": "datetime"},
-            "is_vip": {"prompt": "is the guest a VIP", "dtype": "boolean"},
-            "price_per_night": {"prompt": "price paid per night, in EUR", "dtype": "float"},
-            "room_number": {"prompt": "room number", "dtype": "integer", "values": [101, 102, 103, 201, 202, 203, 204]}
-        },
-    }
-}
-df = mock.sample(tables=tables, sample_size=10, model="openai/gpt-4.1-nano")
-print(df)
-```
-### Multiple Tables
-```python
-from mostlyai import mock
-tables = {
-    "customers": {
-        "description": "Customers of a hardware store",
-        "columns": {
-            "customer_id": {"prompt": "the unique id of the customer", "dtype": "integer"},
-            "name": {"prompt": "first name and last name of the customer", "dtype": "string"},
-        },
-        "primary_key": "customer_id",
-    },
-    "orders": {
-        "description": "Orders of a Customer",
-        "columns": {
-            "customer_id": {"prompt": "the customer id for that order", "dtype": "integer"},
-            "order_id": {"prompt": "the unique id of the order", "dtype": "string"},
-            "text": {"prompt": "order text description", "dtype": "string"},
-            "amount": {"prompt": "order amount in USD", "dtype": "float"},
-        },
-        "primary_key": "order_id",
-        "foreign_keys": [
-            {
-                "column": "customer_id",
-                "referenced_table": "customers",
-                "description": "each customer has anywhere between 1 and 3 orders",
-            }
-        ],
-    },
-    "items": {
-        "description": "Items in an Order",
-        "columns": {
-            "item_id": {"prompt": "the unique id of the item", "dtype": "string"},
-            "order_id": {"prompt": "the order id for that item", "dtype": "string"},
-            "name": {"prompt": "the name of the item", "dtype": "string"},
-            "price": {"prompt": "the price of the item in USD", "dtype": "float"},
-        },
-        "foreign_keys": [
-            {
-                "column": "order_id",
-                "referenced_table": "orders",
-                "description": "each order has between 2 and 5 items",
-            }
-        ],
-    },
-}
-data = mock.sample(tables=tables, sample_size=2, model="openai/gpt-4.1")
-df_customers = data["customers"]
-df_orders = data["orders"]
-df_items = data["items"]
-print(df_customers)
-print(df_orders)
-print(df_items)
-```

mostlyai_mock-0.0.5/README.md DELETED Viewed

@@ -1,99 +0,0 @@
-# Synthetic Mock Data 🔮
-[![Documentation](https://img.shields.io/badge/docs-latest-green)](https://mostly-ai.github.io/mostlyai-mock/) [![stats](https://pepy.tech/badge/mostlyai-mock)](https://pypi.org/project/mostlyai-mock/) ![license](https://img.shields.io/github/license/mostly-ai/mostlyai-mock) ![GitHub Release](https://img.shields.io/github/v/release/mostly-ai/mostlyai-mock) ![PyPI - Python Version](https://img.shields.io/pypi/pyversions/mostlyai-mock)
-Create data out of nothing. Prompt LLMs for Tabular Data.
-## Installation
-The latest release of `mostlyai-mock` can be installed via pip:
-```bash
-pip install -U mostlyai-mock
-```
-Note: An API key to a LLM endpoint, with structured response, is required. It is recommended to set such a key as an environment variable (e.g. `OPENAI_API_KEY`, `GEMINI_API_KEY`, etc.). Alternatively, the key needs to be passed to every call to the library iteself via the parameter `api_key`.
-## Quick Start
-### Single Table
-```python
-from mostlyai import mock
-tables = {
-    "guests": {
-        "description": "Guests of an Alpine ski hotel in Austria",
-        "columns": {
-            "nationality": {"prompt": "2-letter code for the nationality", "dtype": "string"},
-            "name": {"prompt": "first name and last name of the guest", "dtype": "string"},
-            "gender": {"dtype": "category", "values": ["male", "female"]},
-            "age": {"prompt": "age in years; min: 18, max: 80; avg: 25", "dtype": "integer"},
-            "date_of_birth": {"prompt": "date of birth", "dtype": "date"},
-            "checkin_time": {"prompt": "the check in timestamp of the guest; may 2025", "dtype": "datetime"},
-            "is_vip": {"prompt": "is the guest a VIP", "dtype": "boolean"},
-            "price_per_night": {"prompt": "price paid per night, in EUR", "dtype": "float"},
-            "room_number": {"prompt": "room number", "dtype": "integer", "values": [101, 102, 103, 201, 202, 203, 204]}
-        },
-    }
-}
-df = mock.sample(tables=tables, sample_size=10, model="openai/gpt-4.1-nano")
-print(df)
-```
-### Multiple Tables
-```python
-from mostlyai import mock
-tables = {
-    "customers": {
-        "description": "Customers of a hardware store",
-        "columns": {
-            "customer_id": {"prompt": "the unique id of the customer", "dtype": "integer"},
-            "name": {"prompt": "first name and last name of the customer", "dtype": "string"},
-        },
-        "primary_key": "customer_id",
-    },
-    "orders": {
-        "description": "Orders of a Customer",
-        "columns": {
-            "customer_id": {"prompt": "the customer id for that order", "dtype": "integer"},
-            "order_id": {"prompt": "the unique id of the order", "dtype": "string"},
-            "text": {"prompt": "order text description", "dtype": "string"},
-            "amount": {"prompt": "order amount in USD", "dtype": "float"},
-        },
-        "primary_key": "order_id",
-        "foreign_keys": [
-            {
-                "column": "customer_id",
-                "referenced_table": "customers",
-                "description": "each customer has anywhere between 1 and 3 orders",
-            }
-        ],
-    },
-    "items": {
-        "description": "Items in an Order",
-        "columns": {
-            "item_id": {"prompt": "the unique id of the item", "dtype": "string"},
-            "order_id": {"prompt": "the order id for that item", "dtype": "string"},
-            "name": {"prompt": "the name of the item", "dtype": "string"},
-            "price": {"prompt": "the price of the item in USD", "dtype": "float"},
-        },
-        "foreign_keys": [
-            {
-                "column": "order_id",
-                "referenced_table": "orders",
-                "description": "each order has between 2 and 5 items",
-            }
-        ],
-    },
-}
-data = mock.sample(tables=tables, sample_size=2, model="openai/gpt-4.1")
-df_customers = data["customers"]
-df_orders = data["orders"]
-df_items = data["items"]
-print(df_customers)
-print(df_orders)
-print(df_items)
-```

{mostlyai_mock-0.0.5 → mostlyai_mock-0.0.7}/.gitignore RENAMED Viewed

File without changes

{mostlyai_mock-0.0.5 → mostlyai_mock-0.0.7}/LICENSE RENAMED Viewed

File without changes

mostlyai-mock 0.0.5__tar.gz → 0.0.7__tar.gz

mostlyai-mock 0.0.5tar.gz → 0.0.7tar.gz