tumblrbot 1.8.0__tar.gz → 1.9.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: tumblrbot
3
- Version: 1.8.0
3
+ Version: 1.9.0
4
4
  Summary: An updated bot that posts to Tumblr, based on your very own blog!
5
5
  Requires-Python: >= 3.13
6
6
  Description-Content-Type: text/markdown
@@ -66,6 +66,7 @@ Features:
66
66
  1. [Creates examples][Examples] to fine-tune the model from your posts.
67
67
  - Filters out posts that contain more than just text data.
68
68
  - Filters out posts that contain [configured][config] regular expressions.
69
+ - Only uses the most recent posts from each blog as [configured][config].
69
70
  - Adds custom user messages and assistant responses to the dataset from the [configured][config] file.
70
71
  1. Filters out any posts flagged by the [OpenAI Moderation API].
71
72
  1. [Uploads examples][Fine-Tune] to [OpenAI] and begins the fine-tuning process.
@@ -81,10 +82,6 @@ Features:
81
82
  - Colorful output, progress bars, and post previews using [rich].
82
83
  - Automatically keeps the [config] file up-to-date and recreates it if missing.
83
84
 
84
- **To-Do:**
85
-
86
- - Create training data from a sample of posts (possible).
87
-
88
85
  **Known Issues:**
89
86
 
90
87
  - Sometimes, you will get an error about the training file not being found when starting fine-tuning. We do not currently have a fix or workaround for this. You should instead use the online portal for fine-tuning if this continues to happen. Read more in [fine-tuning].
@@ -143,6 +140,13 @@ All file options can include directories that will be created when the program i
143
140
 
144
141
  All config options that involve *blog identifiers* expect any version of a blog URL, which is explained in more detail in the [Tumblr API documentation on blog identifiers].
145
142
 
143
+ A valid post:
144
+
145
+ - Contains any content.
146
+ - Only has text.
147
+ - Is not an ask.
148
+ - Is not a reblog.
149
+
146
150
  Specific Options:
147
151
 
148
152
  - `custom_prompts_file` This file should follow the following file format:
@@ -155,6 +159,7 @@ Specific Options:
155
159
 
156
160
  To be specific, it should follow the [JSON Lines] file format with one collection of name/value pairs (a dictionary) per line. You can validate your file using the [JSON Lines Validator].
157
161
 
162
+ - **`post_limit`** - At most, this many valid posts will be included in the training data. This effectively is a filter to select the `N` most recent valid posts from each blog. `0` will use every available valid post.
158
163
  - **`filtered_words`** - During training data generation, any posts with the specified words will be removed. Word boundaries are not checked by default, so "the" will also filter out posts with "them" or "thematic". This setting supports regular expressions, so you can explicitly look for word boundaries by surrounding an entry with "\\\b", i.e. "\\\bthe\\\b". Regular expressions have to be escaped like so due to how JSON data is read in. If you are familiar with regular expressions, it could be useful for you to know that every entry is joined with a "|" which is then used to search the post content for any matches.
159
164
  - **`developer_message`** - This message is used in for fine-tuning the AI as well as generating prompts. If you change this, you will need to run the fine-tuning again with the new value before generating posts.
160
165
  - **`user_message`** - This setting is used and works in the same way as `developer_message`.
@@ -48,6 +48,7 @@ Features:
48
48
  1. [Creates examples][Examples] to fine-tune the model from your posts.
49
49
  - Filters out posts that contain more than just text data.
50
50
  - Filters out posts that contain [configured][config] regular expressions.
51
+ - Only uses the most recent posts from each blog as [configured][config].
51
52
  - Adds custom user messages and assistant responses to the dataset from the [configured][config] file.
52
53
  1. Filters out any posts flagged by the [OpenAI Moderation API].
53
54
  1. [Uploads examples][Fine-Tune] to [OpenAI] and begins the fine-tuning process.
@@ -63,10 +64,6 @@ Features:
63
64
  - Colorful output, progress bars, and post previews using [rich].
64
65
  - Automatically keeps the [config] file up-to-date and recreates it if missing.
65
66
 
66
- **To-Do:**
67
-
68
- - Create training data from a sample of posts (possible).
69
-
70
67
  **Known Issues:**
71
68
 
72
69
  - Sometimes, you will get an error about the training file not being found when starting fine-tuning. We do not currently have a fix or workaround for this. You should instead use the online portal for fine-tuning if this continues to happen. Read more in [fine-tuning].
@@ -125,6 +122,13 @@ All file options can include directories that will be created when the program i
125
122
 
126
123
  All config options that involve *blog identifiers* expect any version of a blog URL, which is explained in more detail in the [Tumblr API documentation on blog identifiers].
127
124
 
125
+ A valid post:
126
+
127
+ - Contains any content.
128
+ - Only has text.
129
+ - Is not an ask.
130
+ - Is not a reblog.
131
+
128
132
  Specific Options:
129
133
 
130
134
  - `custom_prompts_file` This file should follow the following file format:
@@ -137,6 +141,7 @@ Specific Options:
137
141
 
138
142
  To be specific, it should follow the [JSON Lines] file format with one collection of name/value pairs (a dictionary) per line. You can validate your file using the [JSON Lines Validator].
139
143
 
144
+ - **`post_limit`** - At most, this many valid posts will be included in the training data. This effectively is a filter to select the `N` most recent valid posts from each blog. `0` will use every available valid post.
140
145
  - **`filtered_words`** - During training data generation, any posts with the specified words will be removed. Word boundaries are not checked by default, so "the" will also filter out posts with "them" or "thematic". This setting supports regular expressions, so you can explicitly look for word boundaries by surrounding an entry with "\\\b", i.e. "\\\bthe\\\b". Regular expressions have to be escaped like so due to how JSON data is read in. If you are familiar with regular expressions, it could be useful for you to know that every entry is joined with a "|" which is then used to search the post content for any matches.
141
146
  - **`developer_message`** - This message is used in for fine-tuning the AI as well as generating prompts. If you change this, you will need to run the fine-tuning again with the new value before generating posts.
142
147
  - **`user_message`** - This setting is used and works in the same way as `developer_message`.
@@ -1,6 +1,6 @@
1
1
  [project]
2
2
  name = "tumblrbot"
3
- version = "1.8.0"
3
+ version = "1.9.0"
4
4
  description = "An updated bot that posts to Tumblr, based on your very own blog!"
5
5
  readme = "README.md"
6
6
  requires-python = ">= 3.13"
@@ -3,6 +3,7 @@ from collections.abc import Generator
3
3
  from itertools import batched
4
4
  from json import loads
5
5
  from math import ceil
6
+ from pathlib import Path
6
7
  from re import search
7
8
  from typing import IO, override
8
9
 
@@ -55,13 +56,17 @@ class ExamplesWriter(FlowClass):
55
56
  yield from data.items()
56
57
 
57
58
  def get_valid_posts(self) -> Generator[Post]:
59
+ for path in self.get_data_paths():
60
+ posts = list(self.get_valid_posts_from_path(path))
61
+ yield from posts[-self.config.post_limit :]
62
+
63
+ def get_valid_posts_from_path(self, path: Path) -> Generator[Post]:
58
64
  pattern = re.compile("|".join(self.config.filtered_words), re.IGNORECASE)
59
- for data_path in self.get_data_paths():
60
- with data_path.open("rb") as fp:
61
- for line in fp:
62
- post = Post.model_validate_json(line)
63
- if post.valid_text_post() and not (self.config.filtered_words and pattern.search(post.get_content_text())):
64
- yield post
65
+ with path.open("rb") as fp:
66
+ for line in fp:
67
+ post = Post.model_validate_json(line)
68
+ if post.valid_text_post() and not (self.config.filtered_words and pattern.search(post.get_content_text())):
69
+ yield post
65
70
 
66
71
  def filter_examples(self) -> None:
67
72
  examples = self.config.examples_file.read_text("utf_8").splitlines()
@@ -9,7 +9,7 @@ import tomlkit
9
9
  from keyring import get_password, set_password
10
10
  from openai.types import ChatModel
11
11
  from pwinput import pwinput
12
- from pydantic import BaseModel, ConfigDict, Field, NonNegativeFloat, PlainSerializer, PositiveFloat, PositiveInt, model_validator
12
+ from pydantic import BaseModel, ConfigDict, Field, NonNegativeFloat, NonNegativeInt, PlainSerializer, PositiveFloat, PositiveInt, model_validator
13
13
  from pydantic.json_schema import SkipJsonSchema
14
14
  from requests_oauthlib import OAuth1Session
15
15
  from rich.panel import Panel
@@ -50,7 +50,8 @@ class Config(FileSyncSettings):
50
50
  data_directory: Path = Field(Path("data"), description="Where to store downloaded post data.")
51
51
 
52
52
  # Writing Examples
53
- max_moderation_batch_size: PositiveInt = Field(100, description="How many posts, at most, to submit to the OpenAI moderation API. This is also capped by the API.")
53
+ post_limit: NonNegativeInt = Field(0, description="The number of the most recent posts from each blog that should be included in the training data.")
54
+ max_moderation_batch_size: PositiveInt = Field(100, description="The number of posts, at most, to submit to the OpenAI moderation API. This is also capped by the API.")
54
55
  custom_prompts_file: Path = Field(Path("custom_prompts.jsonl"), description="Where to read in custom prompts from.")
55
56
  filtered_words: list[str] = Field([], description="A case-insensitive list of disallowed words used to filter out training data. Regular expressions are allowed, but must be escaped.")
56
57
 
File without changes
File without changes
File without changes