tumblrbot 1.8.0__tar.gz → 1.9.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {tumblrbot-1.8.0 → tumblrbot-1.9.0}/PKG-INFO +10 -5
- {tumblrbot-1.8.0 → tumblrbot-1.9.0}/README.md +9 -4
- {tumblrbot-1.8.0 → tumblrbot-1.9.0}/pyproject.toml +1 -1
- {tumblrbot-1.8.0 → tumblrbot-1.9.0}/src/tumblrbot/flow/examples.py +11 -6
- {tumblrbot-1.8.0 → tumblrbot-1.9.0}/src/tumblrbot/utils/models.py +3 -2
- {tumblrbot-1.8.0 → tumblrbot-1.9.0}/.github/FUNDING.yml +0 -0
- {tumblrbot-1.8.0 → tumblrbot-1.9.0}/.github/dependabot.yml +0 -0
- {tumblrbot-1.8.0 → tumblrbot-1.9.0}/.gitignore +0 -0
- {tumblrbot-1.8.0 → tumblrbot-1.9.0}/UNLICENSE +0 -0
- {tumblrbot-1.8.0 → tumblrbot-1.9.0}/sample_custom_prompts.jsonl +0 -0
- {tumblrbot-1.8.0 → tumblrbot-1.9.0}/src/tumblrbot/__init__.py +0 -0
- {tumblrbot-1.8.0 → tumblrbot-1.9.0}/src/tumblrbot/__main__.py +0 -0
- {tumblrbot-1.8.0 → tumblrbot-1.9.0}/src/tumblrbot/flow/__init__.py +0 -0
- {tumblrbot-1.8.0 → tumblrbot-1.9.0}/src/tumblrbot/flow/download.py +0 -0
- {tumblrbot-1.8.0 → tumblrbot-1.9.0}/src/tumblrbot/flow/fine_tune.py +0 -0
- {tumblrbot-1.8.0 → tumblrbot-1.9.0}/src/tumblrbot/flow/generate.py +0 -0
- {tumblrbot-1.8.0 → tumblrbot-1.9.0}/src/tumblrbot/utils/__init__.py +0 -0
- {tumblrbot-1.8.0 → tumblrbot-1.9.0}/src/tumblrbot/utils/common.py +0 -0
- {tumblrbot-1.8.0 → tumblrbot-1.9.0}/src/tumblrbot/utils/tumblr.py +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: tumblrbot
|
|
3
|
-
Version: 1.
|
|
3
|
+
Version: 1.9.0
|
|
4
4
|
Summary: An updated bot that posts to Tumblr, based on your very own blog!
|
|
5
5
|
Requires-Python: >= 3.13
|
|
6
6
|
Description-Content-Type: text/markdown
|
|
@@ -66,6 +66,7 @@ Features:
|
|
|
66
66
|
1. [Creates examples][Examples] to fine-tune the model from your posts.
|
|
67
67
|
- Filters out posts that contain more than just text data.
|
|
68
68
|
- Filters out posts that contain [configured][config] regular expressions.
|
|
69
|
+
- Only uses the most recent posts from each blog as [configured][config].
|
|
69
70
|
- Adds custom user messages and assistant responses to the dataset from the [configured][config] file.
|
|
70
71
|
1. Filters out any posts flagged by the [OpenAI Moderation API].
|
|
71
72
|
1. [Uploads examples][Fine-Tune] to [OpenAI] and begins the fine-tuning process.
|
|
@@ -81,10 +82,6 @@ Features:
|
|
|
81
82
|
- Colorful output, progress bars, and post previews using [rich].
|
|
82
83
|
- Automatically keeps the [config] file up-to-date and recreates it if missing.
|
|
83
84
|
|
|
84
|
-
**To-Do:**
|
|
85
|
-
|
|
86
|
-
- Create training data from a sample of posts (possible).
|
|
87
|
-
|
|
88
85
|
**Known Issues:**
|
|
89
86
|
|
|
90
87
|
- Sometimes, you will get an error about the training file not being found when starting fine-tuning. We do not currently have a fix or workaround for this. You should instead use the online portal for fine-tuning if this continues to happen. Read more in [fine-tuning].
|
|
@@ -143,6 +140,13 @@ All file options can include directories that will be created when the program i
|
|
|
143
140
|
|
|
144
141
|
All config options that involve *blog identifiers* expect any version of a blog URL, which is explained in more detail in the [Tumblr API documentation on blog identifiers].
|
|
145
142
|
|
|
143
|
+
A valid post:
|
|
144
|
+
|
|
145
|
+
- Contains any content.
|
|
146
|
+
- Only has text.
|
|
147
|
+
- Is not an ask.
|
|
148
|
+
- Is not a reblog.
|
|
149
|
+
|
|
146
150
|
Specific Options:
|
|
147
151
|
|
|
148
152
|
- `custom_prompts_file` This file should follow the following file format:
|
|
@@ -155,6 +159,7 @@ Specific Options:
|
|
|
155
159
|
|
|
156
160
|
To be specific, it should follow the [JSON Lines] file format with one collection of name/value pairs (a dictionary) per line. You can validate your file using the [JSON Lines Validator].
|
|
157
161
|
|
|
162
|
+
- **`post_limit`** - At most, this many valid posts will be included in the training data. This effectively is a filter to select the `N` most recent valid posts from each blog. `0` will use every available valid post.
|
|
158
163
|
- **`filtered_words`** - During training data generation, any posts with the specified words will be removed. Word boundaries are not checked by default, so "the" will also filter out posts with "them" or "thematic". This setting supports regular expressions, so you can explicitly look for word boundaries by surrounding an entry with "\\\b", i.e. "\\\bthe\\\b". Regular expressions have to be escaped like so due to how JSON data is read in. If you are familiar with regular expressions, it could be useful for you to know that every entry is joined with a "|" which is then used to search the post content for any matches.
|
|
159
164
|
- **`developer_message`** - This message is used in for fine-tuning the AI as well as generating prompts. If you change this, you will need to run the fine-tuning again with the new value before generating posts.
|
|
160
165
|
- **`user_message`** - This setting is used and works in the same way as `developer_message`.
|
|
@@ -48,6 +48,7 @@ Features:
|
|
|
48
48
|
1. [Creates examples][Examples] to fine-tune the model from your posts.
|
|
49
49
|
- Filters out posts that contain more than just text data.
|
|
50
50
|
- Filters out posts that contain [configured][config] regular expressions.
|
|
51
|
+
- Only uses the most recent posts from each blog as [configured][config].
|
|
51
52
|
- Adds custom user messages and assistant responses to the dataset from the [configured][config] file.
|
|
52
53
|
1. Filters out any posts flagged by the [OpenAI Moderation API].
|
|
53
54
|
1. [Uploads examples][Fine-Tune] to [OpenAI] and begins the fine-tuning process.
|
|
@@ -63,10 +64,6 @@ Features:
|
|
|
63
64
|
- Colorful output, progress bars, and post previews using [rich].
|
|
64
65
|
- Automatically keeps the [config] file up-to-date and recreates it if missing.
|
|
65
66
|
|
|
66
|
-
**To-Do:**
|
|
67
|
-
|
|
68
|
-
- Create training data from a sample of posts (possible).
|
|
69
|
-
|
|
70
67
|
**Known Issues:**
|
|
71
68
|
|
|
72
69
|
- Sometimes, you will get an error about the training file not being found when starting fine-tuning. We do not currently have a fix or workaround for this. You should instead use the online portal for fine-tuning if this continues to happen. Read more in [fine-tuning].
|
|
@@ -125,6 +122,13 @@ All file options can include directories that will be created when the program i
|
|
|
125
122
|
|
|
126
123
|
All config options that involve *blog identifiers* expect any version of a blog URL, which is explained in more detail in the [Tumblr API documentation on blog identifiers].
|
|
127
124
|
|
|
125
|
+
A valid post:
|
|
126
|
+
|
|
127
|
+
- Contains any content.
|
|
128
|
+
- Only has text.
|
|
129
|
+
- Is not an ask.
|
|
130
|
+
- Is not a reblog.
|
|
131
|
+
|
|
128
132
|
Specific Options:
|
|
129
133
|
|
|
130
134
|
- `custom_prompts_file` This file should follow the following file format:
|
|
@@ -137,6 +141,7 @@ Specific Options:
|
|
|
137
141
|
|
|
138
142
|
To be specific, it should follow the [JSON Lines] file format with one collection of name/value pairs (a dictionary) per line. You can validate your file using the [JSON Lines Validator].
|
|
139
143
|
|
|
144
|
+
- **`post_limit`** - At most, this many valid posts will be included in the training data. This effectively is a filter to select the `N` most recent valid posts from each blog. `0` will use every available valid post.
|
|
140
145
|
- **`filtered_words`** - During training data generation, any posts with the specified words will be removed. Word boundaries are not checked by default, so "the" will also filter out posts with "them" or "thematic". This setting supports regular expressions, so you can explicitly look for word boundaries by surrounding an entry with "\\\b", i.e. "\\\bthe\\\b". Regular expressions have to be escaped like so due to how JSON data is read in. If you are familiar with regular expressions, it could be useful for you to know that every entry is joined with a "|" which is then used to search the post content for any matches.
|
|
141
146
|
- **`developer_message`** - This message is used in for fine-tuning the AI as well as generating prompts. If you change this, you will need to run the fine-tuning again with the new value before generating posts.
|
|
142
147
|
- **`user_message`** - This setting is used and works in the same way as `developer_message`.
|
|
@@ -3,6 +3,7 @@ from collections.abc import Generator
|
|
|
3
3
|
from itertools import batched
|
|
4
4
|
from json import loads
|
|
5
5
|
from math import ceil
|
|
6
|
+
from pathlib import Path
|
|
6
7
|
from re import search
|
|
7
8
|
from typing import IO, override
|
|
8
9
|
|
|
@@ -55,13 +56,17 @@ class ExamplesWriter(FlowClass):
|
|
|
55
56
|
yield from data.items()
|
|
56
57
|
|
|
57
58
|
def get_valid_posts(self) -> Generator[Post]:
|
|
59
|
+
for path in self.get_data_paths():
|
|
60
|
+
posts = list(self.get_valid_posts_from_path(path))
|
|
61
|
+
yield from posts[-self.config.post_limit :]
|
|
62
|
+
|
|
63
|
+
def get_valid_posts_from_path(self, path: Path) -> Generator[Post]:
|
|
58
64
|
pattern = re.compile("|".join(self.config.filtered_words), re.IGNORECASE)
|
|
59
|
-
|
|
60
|
-
|
|
61
|
-
|
|
62
|
-
|
|
63
|
-
|
|
64
|
-
yield post
|
|
65
|
+
with path.open("rb") as fp:
|
|
66
|
+
for line in fp:
|
|
67
|
+
post = Post.model_validate_json(line)
|
|
68
|
+
if post.valid_text_post() and not (self.config.filtered_words and pattern.search(post.get_content_text())):
|
|
69
|
+
yield post
|
|
65
70
|
|
|
66
71
|
def filter_examples(self) -> None:
|
|
67
72
|
examples = self.config.examples_file.read_text("utf_8").splitlines()
|
|
@@ -9,7 +9,7 @@ import tomlkit
|
|
|
9
9
|
from keyring import get_password, set_password
|
|
10
10
|
from openai.types import ChatModel
|
|
11
11
|
from pwinput import pwinput
|
|
12
|
-
from pydantic import BaseModel, ConfigDict, Field, NonNegativeFloat, PlainSerializer, PositiveFloat, PositiveInt, model_validator
|
|
12
|
+
from pydantic import BaseModel, ConfigDict, Field, NonNegativeFloat, NonNegativeInt, PlainSerializer, PositiveFloat, PositiveInt, model_validator
|
|
13
13
|
from pydantic.json_schema import SkipJsonSchema
|
|
14
14
|
from requests_oauthlib import OAuth1Session
|
|
15
15
|
from rich.panel import Panel
|
|
@@ -50,7 +50,8 @@ class Config(FileSyncSettings):
|
|
|
50
50
|
data_directory: Path = Field(Path("data"), description="Where to store downloaded post data.")
|
|
51
51
|
|
|
52
52
|
# Writing Examples
|
|
53
|
-
|
|
53
|
+
post_limit: NonNegativeInt = Field(0, description="The number of the most recent posts from each blog that should be included in the training data.")
|
|
54
|
+
max_moderation_batch_size: PositiveInt = Field(100, description="The number of posts, at most, to submit to the OpenAI moderation API. This is also capped by the API.")
|
|
54
55
|
custom_prompts_file: Path = Field(Path("custom_prompts.jsonl"), description="Where to read in custom prompts from.")
|
|
55
56
|
filtered_words: list[str] = Field([], description="A case-insensitive list of disallowed words used to filter out training data. Regular expressions are allowed, but must be escaped.")
|
|
56
57
|
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|