PyPI - tumblrbot - Versions diffs - 1.8.0__tar.gz → 1.9.0__tar.gz - Mend

tumblrbot 1.8.0tar.gz → 1.9.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (19) hide show

{tumblrbot-1.8.0 → tumblrbot-1.9.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: tumblrbot
-Version: 1.8.0
+Version: 1.9.0
 Summary: An updated bot that posts to Tumblr, based on your very own blog!
 Requires-Python: >= 3.13
 Description-Content-Type: text/markdown
@@ -66,6 +66,7 @@ Features:
    1. [Creates examples][Examples] to fine-tune the model from your posts.
       - Filters out posts that contain more than just text data.
       - Filters out posts that contain [configured][config] regular expressions.
+      - Only uses the most recent posts from each blog as [configured][config].
       - Adds custom user messages and assistant responses to the dataset from the [configured][config] file.
    1. Filters out any posts flagged by the [OpenAI Moderation API].
    1. [Uploads examples][Fine-Tune] to [OpenAI] and begins the fine-tuning process.
@@ -81,10 +82,6 @@ Features:
 - Colorful output, progress bars, and post previews using [rich].
 - Automatically keeps the [config] file up-to-date and recreates it if missing.
-**To-Do:**
-- Create training data from a sample of posts (possible).
 **Known Issues:**
 - Sometimes, you will get an error about the training file not being found when starting fine-tuning. We do not currently have a fix or workaround for this. You should instead use the online portal for fine-tuning if this continues to happen. Read more in [fine-tuning].
@@ -143,6 +140,13 @@ All file options can include directories that will be created when the program i
 All config options that involve *blog identifiers* expect any version of a blog URL, which is explained in more detail in the [Tumblr API documentation on blog identifiers].
+A valid post:
+- Contains any content.
+- Only has text.
+- Is not an ask.
+- Is not a reblog.
 Specific Options:
 - `custom_prompts_file` This file should follow the following file format:
@@ -155,6 +159,7 @@ Specific Options:
    To be specific, it should follow the [JSON Lines] file format with one collection of name/value pairs (a dictionary) per line. You can validate your file using the [JSON Lines Validator].
+- **`post_limit`** - At most, this many valid posts will be included in the training data. This effectively is a filter to select the `N` most recent valid posts from each blog. `0` will use every available valid post.
 - **`filtered_words`** - During training data generation, any posts with the specified words will be removed. Word boundaries are not checked by default, so "the" will also filter out posts with "them" or "thematic". This setting supports regular expressions, so you can explicitly look for word boundaries by surrounding an entry with "\\\b", i.e. "\\\bthe\\\b". Regular expressions have to be escaped like so due to how JSON data is read in. If you are familiar with regular expressions, it could be useful for you to know that every entry is joined with a "|" which is then used to search the post content for any matches.
 - **`developer_message`** - This message is used in for fine-tuning the AI as well as generating prompts. If you change this, you will need to run the fine-tuning again with the new value before generating posts.
 - **`user_message`** - This setting is used and works in the same way as `developer_message`.

{tumblrbot-1.8.0 → tumblrbot-1.9.0}/README.md RENAMED Viewed

@@ -48,6 +48,7 @@ Features:
    1. [Creates examples][Examples] to fine-tune the model from your posts.
       - Filters out posts that contain more than just text data.
       - Filters out posts that contain [configured][config] regular expressions.
+      - Only uses the most recent posts from each blog as [configured][config].
       - Adds custom user messages and assistant responses to the dataset from the [configured][config] file.
    1. Filters out any posts flagged by the [OpenAI Moderation API].
    1. [Uploads examples][Fine-Tune] to [OpenAI] and begins the fine-tuning process.
@@ -63,10 +64,6 @@ Features:
 - Colorful output, progress bars, and post previews using [rich].
 - Automatically keeps the [config] file up-to-date and recreates it if missing.
-**To-Do:**
-- Create training data from a sample of posts (possible).
 **Known Issues:**
 - Sometimes, you will get an error about the training file not being found when starting fine-tuning. We do not currently have a fix or workaround for this. You should instead use the online portal for fine-tuning if this continues to happen. Read more in [fine-tuning].
@@ -125,6 +122,13 @@ All file options can include directories that will be created when the program i
 All config options that involve *blog identifiers* expect any version of a blog URL, which is explained in more detail in the [Tumblr API documentation on blog identifiers].
+A valid post:
+- Contains any content.
+- Only has text.
+- Is not an ask.
+- Is not a reblog.
 Specific Options:
 - `custom_prompts_file` This file should follow the following file format:
@@ -137,6 +141,7 @@ Specific Options:
    To be specific, it should follow the [JSON Lines] file format with one collection of name/value pairs (a dictionary) per line. You can validate your file using the [JSON Lines Validator].
+- **`post_limit`** - At most, this many valid posts will be included in the training data. This effectively is a filter to select the `N` most recent valid posts from each blog. `0` will use every available valid post.
 - **`filtered_words`** - During training data generation, any posts with the specified words will be removed. Word boundaries are not checked by default, so "the" will also filter out posts with "them" or "thematic". This setting supports regular expressions, so you can explicitly look for word boundaries by surrounding an entry with "\\\b", i.e. "\\\bthe\\\b". Regular expressions have to be escaped like so due to how JSON data is read in. If you are familiar with regular expressions, it could be useful for you to know that every entry is joined with a "|" which is then used to search the post content for any matches.
 - **`developer_message`** - This message is used in for fine-tuning the AI as well as generating prompts. If you change this, you will need to run the fine-tuning again with the new value before generating posts.
 - **`user_message`** - This setting is used and works in the same way as `developer_message`.

{tumblrbot-1.8.0 → tumblrbot-1.9.0}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "tumblrbot"
-version = "1.8.0"
+version = "1.9.0"
 description = "An updated bot that posts to Tumblr, based on your very own blog!"
 readme = "README.md"
 requires-python = ">= 3.13"

{tumblrbot-1.8.0 → tumblrbot-1.9.0}/src/tumblrbot/flow/examples.py RENAMED Viewed

@@ -3,6 +3,7 @@ from collections.abc import Generator
 from itertools import batched
 from json import loads
 from math import ceil
+from pathlib import Path
 from re import search
 from typing import IO, override
@@ -55,13 +56,17 @@ class ExamplesWriter(FlowClass):
                 yield from data.items()
     def get_valid_posts(self) -> Generator[Post]:
+        for path in self.get_data_paths():
+            posts = list(self.get_valid_posts_from_path(path))
+            yield from posts[-self.config.post_limit :]
+    def get_valid_posts_from_path(self, path: Path) -> Generator[Post]:
         pattern = re.compile("|".join(self.config.filtered_words), re.IGNORECASE)
-        for data_path in self.get_data_paths():
-            with data_path.open("rb") as fp:
-                for line in fp:
-                    post = Post.model_validate_json(line)
-                    if post.valid_text_post() and not (self.config.filtered_words and pattern.search(post.get_content_text())):
-                        yield post
+        with path.open("rb") as fp:
+            for line in fp:
+                post = Post.model_validate_json(line)
+                if post.valid_text_post() and not (self.config.filtered_words and pattern.search(post.get_content_text())):
+                    yield post
     def filter_examples(self) -> None:
         examples = self.config.examples_file.read_text("utf_8").splitlines()

{tumblrbot-1.8.0 → tumblrbot-1.9.0}/src/tumblrbot/utils/models.py RENAMED Viewed

@@ -9,7 +9,7 @@ import tomlkit
 from keyring import get_password, set_password
 from openai.types import ChatModel
 from pwinput import pwinput
-from pydantic import BaseModel, ConfigDict, Field, NonNegativeFloat, PlainSerializer, PositiveFloat, PositiveInt, model_validator
+from pydantic import BaseModel, ConfigDict, Field, NonNegativeFloat, NonNegativeInt, PlainSerializer, PositiveFloat, PositiveInt, model_validator
 from pydantic.json_schema import SkipJsonSchema
 from requests_oauthlib import OAuth1Session
 from rich.panel import Panel
@@ -50,7 +50,8 @@ class Config(FileSyncSettings):
     data_directory: Path = Field(Path("data"), description="Where to store downloaded post data.")
     # Writing Examples
-    max_moderation_batch_size: PositiveInt = Field(100, description="How many posts, at most, to submit to the OpenAI moderation API. This is also capped by the API.")
+    post_limit: NonNegativeInt = Field(0, description="The number of the most recent posts from each blog that should be included in the training data.")
+    max_moderation_batch_size: PositiveInt = Field(100, description="The number of posts, at most, to submit to the OpenAI moderation API. This is also capped by the API.")
     custom_prompts_file: Path = Field(Path("custom_prompts.jsonl"), description="Where to read in custom prompts from.")
     filtered_words: list[str] = Field([], description="A case-insensitive list of disallowed words used to filter out training data. Regular expressions are allowed, but must be escaped.")

{tumblrbot-1.8.0 → tumblrbot-1.9.0}/.github/FUNDING.yml RENAMED Viewed

File without changes

{tumblrbot-1.8.0 → tumblrbot-1.9.0}/.github/dependabot.yml RENAMED Viewed

File without changes

{tumblrbot-1.8.0 → tumblrbot-1.9.0}/.gitignore RENAMED Viewed

File without changes

{tumblrbot-1.8.0 → tumblrbot-1.9.0}/UNLICENSE RENAMED Viewed

File without changes

{tumblrbot-1.8.0 → tumblrbot-1.9.0}/sample_custom_prompts.jsonl RENAMED Viewed

File without changes

{tumblrbot-1.8.0 → tumblrbot-1.9.0}/src/tumblrbot/__init__.py RENAMED Viewed

File without changes

{tumblrbot-1.8.0 → tumblrbot-1.9.0}/src/tumblrbot/__main__.py RENAMED Viewed

File without changes

{tumblrbot-1.8.0 → tumblrbot-1.9.0}/src/tumblrbot/flow/__init__.py RENAMED Viewed

File without changes

{tumblrbot-1.8.0 → tumblrbot-1.9.0}/src/tumblrbot/flow/download.py RENAMED Viewed

File without changes

{tumblrbot-1.8.0 → tumblrbot-1.9.0}/src/tumblrbot/flow/fine_tune.py RENAMED Viewed

File without changes

{tumblrbot-1.8.0 → tumblrbot-1.9.0}/src/tumblrbot/flow/generate.py RENAMED Viewed

File without changes

{tumblrbot-1.8.0 → tumblrbot-1.9.0}/src/tumblrbot/utils/__init__.py RENAMED Viewed

File without changes

{tumblrbot-1.8.0 → tumblrbot-1.9.0}/src/tumblrbot/utils/common.py RENAMED Viewed

File without changes

{tumblrbot-1.8.0 → tumblrbot-1.9.0}/src/tumblrbot/utils/tumblr.py RENAMED Viewed

File without changes

tumblrbot 1.8.0__tar.gz → 1.9.0__tar.gz

tumblrbot 1.8.0tar.gz → 1.9.0tar.gz