tumblrbot 1.7.0__tar.gz → 1.8.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {tumblrbot-1.7.0 → tumblrbot-1.8.0}/PKG-INFO +5 -3
- {tumblrbot-1.7.0 → tumblrbot-1.8.0}/README.md +4 -2
- {tumblrbot-1.7.0 → tumblrbot-1.8.0}/pyproject.toml +1 -1
- {tumblrbot-1.7.0 → tumblrbot-1.8.0}/src/tumblrbot/flow/examples.py +3 -1
- {tumblrbot-1.7.0 → tumblrbot-1.8.0}/src/tumblrbot/flow/generate.py +28 -16
- {tumblrbot-1.7.0 → tumblrbot-1.8.0}/src/tumblrbot/utils/models.py +3 -2
- {tumblrbot-1.7.0 → tumblrbot-1.8.0}/src/tumblrbot/utils/tumblr.py +1 -3
- {tumblrbot-1.7.0 → tumblrbot-1.8.0}/.github/FUNDING.yml +0 -0
- {tumblrbot-1.7.0 → tumblrbot-1.8.0}/.github/dependabot.yml +0 -0
- {tumblrbot-1.7.0 → tumblrbot-1.8.0}/.gitignore +0 -0
- {tumblrbot-1.7.0 → tumblrbot-1.8.0}/UNLICENSE +0 -0
- {tumblrbot-1.7.0 → tumblrbot-1.8.0}/sample_custom_prompts.jsonl +0 -0
- {tumblrbot-1.7.0 → tumblrbot-1.8.0}/src/tumblrbot/__init__.py +0 -0
- {tumblrbot-1.7.0 → tumblrbot-1.8.0}/src/tumblrbot/__main__.py +0 -0
- {tumblrbot-1.7.0 → tumblrbot-1.8.0}/src/tumblrbot/flow/__init__.py +0 -0
- {tumblrbot-1.7.0 → tumblrbot-1.8.0}/src/tumblrbot/flow/download.py +0 -0
- {tumblrbot-1.7.0 → tumblrbot-1.8.0}/src/tumblrbot/flow/fine_tune.py +0 -0
- {tumblrbot-1.7.0 → tumblrbot-1.8.0}/src/tumblrbot/utils/__init__.py +0 -0
- {tumblrbot-1.7.0 → tumblrbot-1.8.0}/src/tumblrbot/utils/common.py +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: tumblrbot
|
|
3
|
-
Version: 1.
|
|
3
|
+
Version: 1.8.0
|
|
4
4
|
Summary: An updated bot that posts to Tumblr, based on your very own blog!
|
|
5
5
|
Requires-Python: >= 3.13
|
|
6
6
|
Description-Content-Type: text/markdown
|
|
@@ -65,6 +65,7 @@ Features:
|
|
|
65
65
|
- Shows progress and previews the current post.
|
|
66
66
|
1. [Creates examples][Examples] to fine-tune the model from your posts.
|
|
67
67
|
- Filters out posts that contain more than just text data.
|
|
68
|
+
- Filters out posts that contain [configured][config] regular expressions.
|
|
68
69
|
- Adds custom user messages and assistant responses to the dataset from the [configured][config] file.
|
|
69
70
|
1. Filters out any posts flagged by the [OpenAI Moderation API].
|
|
70
71
|
1. [Uploads examples][Fine-Tune] to [OpenAI] and begins the fine-tuning process.
|
|
@@ -83,7 +84,6 @@ Features:
|
|
|
83
84
|
**To-Do:**
|
|
84
85
|
|
|
85
86
|
- Create training data from a sample of posts (possible).
|
|
86
|
-
- User-specified list of words that will filter out posts.
|
|
87
87
|
|
|
88
88
|
**Known Issues:**
|
|
89
89
|
|
|
@@ -131,7 +131,7 @@ API tokens can be created here: [Tumblr Tokens].
|
|
|
131
131
|
1. You now have access to your `consumer key` next to `Oauth Consumer Key`.
|
|
132
132
|
1. Press `Show secret key` to see your `Consumer Secret`.
|
|
133
133
|
|
|
134
|
-
When running this program, you will be prompted to enter all of these tokens.
|
|
134
|
+
When running this program, you will be prompted to enter all of these tokens. If something goes wrong while entering the tokens, you can always reset them by running the program again and answering `y` to the relevant prompt.
|
|
135
135
|
|
|
136
136
|
After inputting the [Tumblr] tokens, you will be given a URL that you need to open in your browser. Press `Allow`, then copy and paste the URL of the page you are redirected to into the console.
|
|
137
137
|
|
|
@@ -155,6 +155,7 @@ Specific Options:
|
|
|
155
155
|
|
|
156
156
|
To be specific, it should follow the [JSON Lines] file format with one collection of name/value pairs (a dictionary) per line. You can validate your file using the [JSON Lines Validator].
|
|
157
157
|
|
|
158
|
+
- **`filtered_words`** - During training data generation, any posts with the specified words will be removed. Word boundaries are not checked by default, so "the" will also filter out posts with "them" or "thematic". This setting supports regular expressions, so you can explicitly look for word boundaries by surrounding an entry with "\\\b", i.e. "\\\bthe\\\b". Regular expressions have to be escaped like so due to how JSON data is read in. If you are familiar with regular expressions, it could be useful for you to know that every entry is joined with a "|" which is then used to search the post content for any matches.
|
|
158
159
|
- **`developer_message`** - This message is used in for fine-tuning the AI as well as generating prompts. If you change this, you will need to run the fine-tuning again with the new value before generating posts.
|
|
159
160
|
- **`user_message`** - This setting is used and works in the same way as `developer_message`.
|
|
160
161
|
- **`expected_epochs`** - The default value here is the default number of epochs for `base_model`. You may have to change this value if you change `base_model`. After running fine-tuning once, you will see the number of epochs used in the [fine-tuning portal] under *Hyperparameters*. This value will also be updated automatically if you run fine-tuning through this program.
|
|
@@ -163,6 +164,7 @@ Specific Options:
|
|
|
163
164
|
- **`base_model`** - This value is used to choose the tokenizer for estimating fine-tuning costs. It is also the base model that will be fine-tuned and the model that is used to generate tags. You can find a list of options in the [fine-tuning portal] by pressing `+ Create` and opening the drop-down list for `Base Model`. Be sure to update `token_price` if you change this value.
|
|
164
165
|
- **`fine_tuned_model`** - Set automatically after monitoring fine-tuning if the job has succeeded. You can read more in [fine-tuning].
|
|
165
166
|
- **`tags_chance`** - This should be between 0 and 1. Setting it to 0 corresponds to a 0% chance (never) to add tags to a post. 1 corresponds to a 100% chance (always) to add tags to a post. Adding tags incurs a very small token cost.
|
|
167
|
+
- **`reblog_blog_identifiers`** - Whenever a reblog is attempted, a random blog from this list will be chosen to be reblogged from.
|
|
166
168
|
- **`reblog_chance`** - This setting works the same way as `tags_chance`.
|
|
167
169
|
- **`reblog_user_message`** - This setting is a prefix that is directly prepended to the contents of the post being reblogged.
|
|
168
170
|
|
|
@@ -47,6 +47,7 @@ Features:
|
|
|
47
47
|
- Shows progress and previews the current post.
|
|
48
48
|
1. [Creates examples][Examples] to fine-tune the model from your posts.
|
|
49
49
|
- Filters out posts that contain more than just text data.
|
|
50
|
+
- Filters out posts that contain [configured][config] regular expressions.
|
|
50
51
|
- Adds custom user messages and assistant responses to the dataset from the [configured][config] file.
|
|
51
52
|
1. Filters out any posts flagged by the [OpenAI Moderation API].
|
|
52
53
|
1. [Uploads examples][Fine-Tune] to [OpenAI] and begins the fine-tuning process.
|
|
@@ -65,7 +66,6 @@ Features:
|
|
|
65
66
|
**To-Do:**
|
|
66
67
|
|
|
67
68
|
- Create training data from a sample of posts (possible).
|
|
68
|
-
- User-specified list of words that will filter out posts.
|
|
69
69
|
|
|
70
70
|
**Known Issues:**
|
|
71
71
|
|
|
@@ -113,7 +113,7 @@ API tokens can be created here: [Tumblr Tokens].
|
|
|
113
113
|
1. You now have access to your `consumer key` next to `Oauth Consumer Key`.
|
|
114
114
|
1. Press `Show secret key` to see your `Consumer Secret`.
|
|
115
115
|
|
|
116
|
-
When running this program, you will be prompted to enter all of these tokens.
|
|
116
|
+
When running this program, you will be prompted to enter all of these tokens. If something goes wrong while entering the tokens, you can always reset them by running the program again and answering `y` to the relevant prompt.
|
|
117
117
|
|
|
118
118
|
After inputting the [Tumblr] tokens, you will be given a URL that you need to open in your browser. Press `Allow`, then copy and paste the URL of the page you are redirected to into the console.
|
|
119
119
|
|
|
@@ -137,6 +137,7 @@ Specific Options:
|
|
|
137
137
|
|
|
138
138
|
To be specific, it should follow the [JSON Lines] file format with one collection of name/value pairs (a dictionary) per line. You can validate your file using the [JSON Lines Validator].
|
|
139
139
|
|
|
140
|
+
- **`filtered_words`** - During training data generation, any posts with the specified words will be removed. Word boundaries are not checked by default, so "the" will also filter out posts with "them" or "thematic". This setting supports regular expressions, so you can explicitly look for word boundaries by surrounding an entry with "\\\b", i.e. "\\\bthe\\\b". Regular expressions have to be escaped like so due to how JSON data is read in. If you are familiar with regular expressions, it could be useful for you to know that every entry is joined with a "|" which is then used to search the post content for any matches.
|
|
140
141
|
- **`developer_message`** - This message is used in for fine-tuning the AI as well as generating prompts. If you change this, you will need to run the fine-tuning again with the new value before generating posts.
|
|
141
142
|
- **`user_message`** - This setting is used and works in the same way as `developer_message`.
|
|
142
143
|
- **`expected_epochs`** - The default value here is the default number of epochs for `base_model`. You may have to change this value if you change `base_model`. After running fine-tuning once, you will see the number of epochs used in the [fine-tuning portal] under *Hyperparameters*. This value will also be updated automatically if you run fine-tuning through this program.
|
|
@@ -145,6 +146,7 @@ Specific Options:
|
|
|
145
146
|
- **`base_model`** - This value is used to choose the tokenizer for estimating fine-tuning costs. It is also the base model that will be fine-tuned and the model that is used to generate tags. You can find a list of options in the [fine-tuning portal] by pressing `+ Create` and opening the drop-down list for `Base Model`. Be sure to update `token_price` if you change this value.
|
|
146
147
|
- **`fine_tuned_model`** - Set automatically after monitoring fine-tuning if the job has succeeded. You can read more in [fine-tuning].
|
|
147
148
|
- **`tags_chance`** - This should be between 0 and 1. Setting it to 0 corresponds to a 0% chance (never) to add tags to a post. 1 corresponds to a 100% chance (always) to add tags to a post. Adding tags incurs a very small token cost.
|
|
149
|
+
- **`reblog_blog_identifiers`** - Whenever a reblog is attempted, a random blog from this list will be chosen to be reblogged from.
|
|
148
150
|
- **`reblog_chance`** - This setting works the same way as `tags_chance`.
|
|
149
151
|
- **`reblog_user_message`** - This setting is a prefix that is directly prepended to the contents of the post being reblogged.
|
|
150
152
|
|
|
@@ -1,3 +1,4 @@
|
|
|
1
|
+
import re
|
|
1
2
|
from collections.abc import Generator
|
|
2
3
|
from itertools import batched
|
|
3
4
|
from json import loads
|
|
@@ -54,11 +55,12 @@ class ExamplesWriter(FlowClass):
|
|
|
54
55
|
yield from data.items()
|
|
55
56
|
|
|
56
57
|
def get_valid_posts(self) -> Generator[Post]:
|
|
58
|
+
pattern = re.compile("|".join(self.config.filtered_words), re.IGNORECASE)
|
|
57
59
|
for data_path in self.get_data_paths():
|
|
58
60
|
with data_path.open("rb") as fp:
|
|
59
61
|
for line in fp:
|
|
60
62
|
post = Post.model_validate_json(line)
|
|
61
|
-
if post.valid_text_post():
|
|
63
|
+
if post.valid_text_post() and not (self.config.filtered_words and pattern.search(post.get_content_text())):
|
|
62
64
|
yield post
|
|
63
65
|
|
|
64
66
|
def filter_examples(self) -> None:
|
|
@@ -1,7 +1,10 @@
|
|
|
1
|
-
from
|
|
1
|
+
from collections.abc import Iterable
|
|
2
|
+
from functools import cache
|
|
3
|
+
from random import choice, random, sample
|
|
2
4
|
from typing import override
|
|
3
5
|
|
|
4
6
|
import rich
|
|
7
|
+
from pydantic import ConfigDict
|
|
5
8
|
from rich.prompt import IntPrompt
|
|
6
9
|
|
|
7
10
|
from tumblrbot.utils.common import FlowClass, PreviewLive
|
|
@@ -9,6 +12,8 @@ from tumblrbot.utils.models import Post
|
|
|
9
12
|
|
|
10
13
|
|
|
11
14
|
class DraftGenerator(FlowClass):
|
|
15
|
+
model_config = ConfigDict(frozen=True) # Makes this class hashable.
|
|
16
|
+
|
|
12
17
|
@override
|
|
13
18
|
def main(self) -> None:
|
|
14
19
|
self.config.draft_count = IntPrompt.ask("How many drafts should be generated?", default=self.config.draft_count)
|
|
@@ -28,16 +33,16 @@ class DraftGenerator(FlowClass):
|
|
|
28
33
|
rich.print(f":chart_increasing: [bold green]Generated {self.config.draft_count} draft(s).[/] {message}")
|
|
29
34
|
|
|
30
35
|
def generate_post(self) -> Post:
|
|
31
|
-
if self.
|
|
32
|
-
original = self.get_random_post()
|
|
36
|
+
if original := self.get_random_post():
|
|
33
37
|
user_message = f"{self.config.reblog_user_message}\n\n{original.get_content_text()}"
|
|
34
38
|
else:
|
|
35
39
|
original = Post()
|
|
36
40
|
user_message = self.config.user_message
|
|
37
|
-
|
|
38
41
|
text = self.generate_text(user_message)
|
|
42
|
+
|
|
39
43
|
if tags := self.generate_tags(text):
|
|
40
44
|
tags = tags.tags
|
|
45
|
+
|
|
41
46
|
return Post(
|
|
42
47
|
content=[Post.Block(type="text", text=text)],
|
|
43
48
|
tags=tags or [],
|
|
@@ -65,15 +70,22 @@ class DraftGenerator(FlowClass):
|
|
|
65
70
|
|
|
66
71
|
return None
|
|
67
72
|
|
|
68
|
-
def get_random_post(self) -> Post:
|
|
69
|
-
|
|
70
|
-
|
|
71
|
-
|
|
72
|
-
|
|
73
|
-
|
|
74
|
-
|
|
75
|
-
|
|
76
|
-
|
|
77
|
-
|
|
78
|
-
|
|
79
|
-
|
|
73
|
+
def get_random_post(self) -> Post | None:
|
|
74
|
+
if self.config.reblog_blog_identifiers and random() < self.config.reblog_chance: # noqa: S311
|
|
75
|
+
blog_identifier = choice(self.config.reblog_blog_identifiers) # noqa: S311
|
|
76
|
+
for offset in self.get_offsets(blog_identifier):
|
|
77
|
+
for raw_post in self.tumblr.retrieve_published_posts(
|
|
78
|
+
blog_identifier,
|
|
79
|
+
offset,
|
|
80
|
+
).response.posts:
|
|
81
|
+
post = Post.model_validate(raw_post)
|
|
82
|
+
if post.valid_text_post():
|
|
83
|
+
return post
|
|
84
|
+
|
|
85
|
+
return None
|
|
86
|
+
|
|
87
|
+
@cache # noqa: B019 # This creates a memory leak, but it doesn't matter since this class isn't discarded until the end of the program anyways.
|
|
88
|
+
def get_offsets(self, blog_identifier: str) -> Iterable[int]:
|
|
89
|
+
total = self.tumblr.retrieve_blog_info(blog_identifier).response.blog.posts
|
|
90
|
+
# The same Iterable object is cached, so reading an element will effectively discard it. This prevents checking the same offsets twice.
|
|
91
|
+
return iter(sample(range(total), total))
|
|
@@ -52,6 +52,7 @@ class Config(FileSyncSettings):
|
|
|
52
52
|
# Writing Examples
|
|
53
53
|
max_moderation_batch_size: PositiveInt = Field(100, description="How many posts, at most, to submit to the OpenAI moderation API. This is also capped by the API.")
|
|
54
54
|
custom_prompts_file: Path = Field(Path("custom_prompts.jsonl"), description="Where to read in custom prompts from.")
|
|
55
|
+
filtered_words: list[str] = Field([], description="A case-insensitive list of disallowed words used to filter out training data. Regular expressions are allowed, but must be escaped.")
|
|
55
56
|
|
|
56
57
|
# Writing Examples & Fine-Tuning
|
|
57
58
|
examples_file: Path = Field(Path("examples.jsonl"), description="Where to output the examples that will be used to fine-tune the model.")
|
|
@@ -75,8 +76,8 @@ class Config(FileSyncSettings):
|
|
|
75
76
|
tags_chance: NonNegativeFloat = Field(0.1, description="The chance to generate tags for any given post. This will use more OpenAI tokens.")
|
|
76
77
|
tags_developer_message: str = Field("You will be provided with a block of text, and your task is to extract a very short list of the most important subjects from it.", description="The developer message used to generate tags.")
|
|
77
78
|
reblog_blog_identifiers: list[str] = Field([], description="The identifiers of blogs that can be reblogged from when generating drafts.")
|
|
78
|
-
reblog_chance: NonNegativeFloat = Field(0.
|
|
79
|
-
reblog_user_message: str = Field("Please write a comical Tumblr post in response to the following post
|
|
79
|
+
reblog_chance: NonNegativeFloat = Field(0.1, description="The chance to generate a reblog of a random post. This will use more OpenAI tokens.")
|
|
80
|
+
reblog_user_message: str = Field("Please write a comical Tumblr post in response to the following post:", description="The prefix for the user message used to reblog posts.")
|
|
80
81
|
|
|
81
82
|
@classmethod
|
|
82
83
|
@override
|
|
@@ -1,4 +1,4 @@
|
|
|
1
|
-
from typing import
|
|
1
|
+
from typing import Self
|
|
2
2
|
|
|
3
3
|
from requests import HTTPError, Response
|
|
4
4
|
from requests_oauthlib import OAuth1Session
|
|
@@ -29,14 +29,12 @@ class TumblrSession(OAuth1Session):
|
|
|
29
29
|
def retrieve_published_posts(
|
|
30
30
|
self,
|
|
31
31
|
blog_identifier: str,
|
|
32
|
-
type_: Literal["text", "quote", "link", "answer", "video", "audio", "photo", "chat"] | None = None,
|
|
33
32
|
offset: int | None = None,
|
|
34
33
|
after: int | None = None,
|
|
35
34
|
) -> ResponseModel:
|
|
36
35
|
response = self.get(
|
|
37
36
|
f"https://api.tumblr.com/v2/blog/{blog_identifier}/posts",
|
|
38
37
|
params={
|
|
39
|
-
"type": type_,
|
|
40
38
|
"offset": offset,
|
|
41
39
|
"after": after,
|
|
42
40
|
"sort": "asc",
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|