tumblrbot 1.6.0__tar.gz → 1.8.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: tumblrbot
3
- Version: 1.6.0
3
+ Version: 1.8.0
4
4
  Summary: An updated bot that posts to Tumblr, based on your very own blog!
5
5
  Requires-Python: >= 3.13
6
6
  Description-Content-Type: text/markdown
@@ -36,6 +36,8 @@ Project-URL: Source, https://github.com/MaidScientistIzutsumiMarin/tumblrbot
36
36
 
37
37
  [Tumblr]: https://tumblr.com
38
38
  [Tumblr Tokens]: https://tumblr.com/oauth/apps
39
+ [Tumblr API Documentation on Blog Identifiers]: https://tumblr.com/docs/en/api/v2#blog-identifiers
40
+ [Tumblr API Documentation on Rate Limits]: https://tumblr.com/docs/en/api/v2#rate-limits
39
41
 
40
42
  [Download]: src/tumblrbot/flow/download.py
41
43
  [Examples]: src/tumblrbot/flow/examples.py
@@ -58,11 +60,12 @@ Features:
58
60
  1. Asks for [OpenAI] and [Tumblr] tokens.
59
61
  - Stores API tokens using [keyring].
60
62
  1. Retrieves [Tumblr] [OAuth] tokens.
61
- 1. [Downloads posts][Download] from the [configured][config] [Tumblr] blogs.
63
+ 1. [Downloads posts][Download] from the [configured][config] blogs.
62
64
  - Skips redownloading already downloaded posts.
63
65
  - Shows progress and previews the current post.
64
66
  1. [Creates examples][Examples] to fine-tune the model from your posts.
65
67
  - Filters out posts that contain more than just text data.
68
+ - Filters out posts that contain [configured][config] regular expressions.
66
69
  - Adds custom user messages and assistant responses to the dataset from the [configured][config] file.
67
70
  1. Filters out any posts flagged by the [OpenAI Moderation API].
68
71
  1. [Uploads examples][Fine-Tune] to [OpenAI] and begins the fine-tuning process.
@@ -70,22 +73,23 @@ Features:
70
73
  - Resumes monitoring the same fine-tuning process when restarted.
71
74
  - Deletes the uploaded examples file if fine-tuning does not succeed (optional).
72
75
  - Stores the output model automatically when fine-tuning is completed.
73
- 1. [Generates and uploads posts][Generate] to the [configured][config] [Tumblr] blog using the [configured][config] fine-tuned model.
76
+ 1. [Generates and uploads posts][Generate] to the [configured][config] blog using the [configured][config] fine-tuned model.
74
77
  - Creates tags by extracting keywords at the [configured][config] frequency using the [configured][config] model.
75
- - Uploads posts as drafts to the [configured][config] [Tumblr] blog.
76
- - Reblog posts at the [configured][config] frequency.
78
+ - Uploads posts as drafts to the [configured][config] blog.
79
+ - Reblogs posts from the [configured][config] blogs at the [configured][config] frequency.
77
80
  - Shows progress and previews the current post.
78
81
  - Colorful output, progress bars, and post previews using [rich].
79
82
  - Automatically keeps the [config] file up-to-date and recreates it if missing.
80
83
 
81
84
  **To-Do:**
82
85
 
83
- - ...
86
+ - Create training data from a sample of posts (possible).
84
87
 
85
88
  **Known Issues:**
86
89
 
87
90
  - Sometimes, you will get an error about the training file not being found when starting fine-tuning. We do not currently have a fix or workaround for this. You should instead use the online portal for fine-tuning if this continues to happen. Read more in [fine-tuning].
88
91
  - Post counts are incorrect when downloading posts. We are not certain what the cause of this is, but our tests suggest this is a [Tumblr] API problem that is giving inaccurate numbers.
92
+ - During post downloading or post generation, you may receive a "Limit Exceeded" error message from the [Tumblr] API. This is caused by server-side rate-limiting by [Tumblr]. The only workaround is trying again or waiting for a period of time before retrying. In most cases, you either have to wait for a minute or an hour for the limits to reset. You can read more about the limits in the [Tumblr API documentation on rate limits].
89
93
 
90
94
  **Please submit an issue or contact us for features you want added/reimplemented.**
91
95
 
@@ -127,7 +131,7 @@ API tokens can be created here: [Tumblr Tokens].
127
131
  1. You now have access to your `consumer key` next to `Oauth Consumer Key`.
128
132
  1. Press `Show secret key` to see your `Consumer Secret`.
129
133
 
130
- When running this program, you will be prompted to enter all of these tokens. **The fields are password-protected, so there will be no output to the console.** If something goes wrong while entering the tokens, you can always reset them by running the program again and answering `y` to the relevant prompt.
134
+ When running this program, you will be prompted to enter all of these tokens. If something goes wrong while entering the tokens, you can always reset them by running the program again and answering `y` to the relevant prompt.
131
135
 
132
136
  After inputting the [Tumblr] tokens, you will be given a URL that you need to open in your browser. Press `Allow`, then copy and paste the URL of the page you are redirected to into the console.
133
137
 
@@ -137,6 +141,10 @@ All config options can be found in `config.toml` after running the program once.
137
141
 
138
142
  All file options can include directories that will be created when the program is run.
139
143
 
144
+ All config options that involve *blog identifiers* expect any version of a blog URL, which is explained in more detail in the [Tumblr API documentation on blog identifiers].
145
+
146
+ Specific Options:
147
+
140
148
  - `custom_prompts_file` This file should follow the following file format:
141
149
 
142
150
  ```json
@@ -147,14 +155,18 @@ All file options can include directories that will be created when the program i
147
155
 
148
156
  To be specific, it should follow the [JSON Lines] file format with one collection of name/value pairs (a dictionary) per line. You can validate your file using the [JSON Lines Validator].
149
157
 
158
+ - **`filtered_words`** - During training data generation, any posts with the specified words will be removed. Word boundaries are not checked by default, so "the" will also filter out posts with "them" or "thematic". This setting supports regular expressions, so you can explicitly look for word boundaries by surrounding an entry with "\\\b", i.e. "\\\bthe\\\b". Regular expressions have to be escaped like so due to how JSON data is read in. If you are familiar with regular expressions, it could be useful for you to know that every entry is joined with a "|" which is then used to search the post content for any matches.
150
159
  - **`developer_message`** - This message is used in for fine-tuning the AI as well as generating prompts. If you change this, you will need to run the fine-tuning again with the new value before generating posts.
151
- - **`user_message`** - This message is used in the same way as `developer_message` and should be treated the same.
160
+ - **`user_message`** - This setting is used and works in the same way as `developer_message`.
152
161
  - **`expected_epochs`** - The default value here is the default number of epochs for `base_model`. You may have to change this value if you change `base_model`. After running fine-tuning once, you will see the number of epochs used in the [fine-tuning portal] under *Hyperparameters*. This value will also be updated automatically if you run fine-tuning through this program.
153
162
  - **`token_price`** - The default value here is the default token price for `base_model`. You can find the up-to-date value in [OpenAI Pricing], in the *Training* column.
154
163
  - **`job_id`** - If there is any value here, this program will resume monitoring the corresponding job, instead of starting a new one. This gets set when starting the fine-tuning and is cleared when it is completed. You can read more in [fine-tuning].
155
164
  - **`base_model`** - This value is used to choose the tokenizer for estimating fine-tuning costs. It is also the base model that will be fine-tuned and the model that is used to generate tags. You can find a list of options in the [fine-tuning portal] by pressing `+ Create` and opening the drop-down list for `Base Model`. Be sure to update `token_price` if you change this value.
156
165
  - **`fine_tuned_model`** - Set automatically after monitoring fine-tuning if the job has succeeded. You can read more in [fine-tuning].
157
166
  - **`tags_chance`** - This should be between 0 and 1. Setting it to 0 corresponds to a 0% chance (never) to add tags to a post. 1 corresponds to a 100% chance (always) to add tags to a post. Adding tags incurs a very small token cost.
167
+ - **`reblog_blog_identifiers`** - Whenever a reblog is attempted, a random blog from this list will be chosen to be reblogged from.
168
+ - **`reblog_chance`** - This setting works the same way as `tags_chance`.
169
+ - **`reblog_user_message`** - This setting is a prefix that is directly prepended to the contents of the post being reblogged.
158
170
 
159
171
  ## Manual Fine-Tuning
160
172
 
@@ -18,6 +18,8 @@
18
18
 
19
19
  [Tumblr]: https://tumblr.com
20
20
  [Tumblr Tokens]: https://tumblr.com/oauth/apps
21
+ [Tumblr API Documentation on Blog Identifiers]: https://tumblr.com/docs/en/api/v2#blog-identifiers
22
+ [Tumblr API Documentation on Rate Limits]: https://tumblr.com/docs/en/api/v2#rate-limits
21
23
 
22
24
  [Download]: src/tumblrbot/flow/download.py
23
25
  [Examples]: src/tumblrbot/flow/examples.py
@@ -40,11 +42,12 @@ Features:
40
42
  1. Asks for [OpenAI] and [Tumblr] tokens.
41
43
  - Stores API tokens using [keyring].
42
44
  1. Retrieves [Tumblr] [OAuth] tokens.
43
- 1. [Downloads posts][Download] from the [configured][config] [Tumblr] blogs.
45
+ 1. [Downloads posts][Download] from the [configured][config] blogs.
44
46
  - Skips redownloading already downloaded posts.
45
47
  - Shows progress and previews the current post.
46
48
  1. [Creates examples][Examples] to fine-tune the model from your posts.
47
49
  - Filters out posts that contain more than just text data.
50
+ - Filters out posts that contain [configured][config] regular expressions.
48
51
  - Adds custom user messages and assistant responses to the dataset from the [configured][config] file.
49
52
  1. Filters out any posts flagged by the [OpenAI Moderation API].
50
53
  1. [Uploads examples][Fine-Tune] to [OpenAI] and begins the fine-tuning process.
@@ -52,22 +55,23 @@ Features:
52
55
  - Resumes monitoring the same fine-tuning process when restarted.
53
56
  - Deletes the uploaded examples file if fine-tuning does not succeed (optional).
54
57
  - Stores the output model automatically when fine-tuning is completed.
55
- 1. [Generates and uploads posts][Generate] to the [configured][config] [Tumblr] blog using the [configured][config] fine-tuned model.
58
+ 1. [Generates and uploads posts][Generate] to the [configured][config] blog using the [configured][config] fine-tuned model.
56
59
  - Creates tags by extracting keywords at the [configured][config] frequency using the [configured][config] model.
57
- - Uploads posts as drafts to the [configured][config] [Tumblr] blog.
58
- - Reblog posts at the [configured][config] frequency.
60
+ - Uploads posts as drafts to the [configured][config] blog.
61
+ - Reblogs posts from the [configured][config] blogs at the [configured][config] frequency.
59
62
  - Shows progress and previews the current post.
60
63
  - Colorful output, progress bars, and post previews using [rich].
61
64
  - Automatically keeps the [config] file up-to-date and recreates it if missing.
62
65
 
63
66
  **To-Do:**
64
67
 
65
- - ...
68
+ - Create training data from a sample of posts (possible).
66
69
 
67
70
  **Known Issues:**
68
71
 
69
72
  - Sometimes, you will get an error about the training file not being found when starting fine-tuning. We do not currently have a fix or workaround for this. You should instead use the online portal for fine-tuning if this continues to happen. Read more in [fine-tuning].
70
73
  - Post counts are incorrect when downloading posts. We are not certain what the cause of this is, but our tests suggest this is a [Tumblr] API problem that is giving inaccurate numbers.
74
+ - During post downloading or post generation, you may receive a "Limit Exceeded" error message from the [Tumblr] API. This is caused by server-side rate-limiting by [Tumblr]. The only workaround is trying again or waiting for a period of time before retrying. In most cases, you either have to wait for a minute or an hour for the limits to reset. You can read more about the limits in the [Tumblr API documentation on rate limits].
71
75
 
72
76
  **Please submit an issue or contact us for features you want added/reimplemented.**
73
77
 
@@ -109,7 +113,7 @@ API tokens can be created here: [Tumblr Tokens].
109
113
  1. You now have access to your `consumer key` next to `Oauth Consumer Key`.
110
114
  1. Press `Show secret key` to see your `Consumer Secret`.
111
115
 
112
- When running this program, you will be prompted to enter all of these tokens. **The fields are password-protected, so there will be no output to the console.** If something goes wrong while entering the tokens, you can always reset them by running the program again and answering `y` to the relevant prompt.
116
+ When running this program, you will be prompted to enter all of these tokens. If something goes wrong while entering the tokens, you can always reset them by running the program again and answering `y` to the relevant prompt.
113
117
 
114
118
  After inputting the [Tumblr] tokens, you will be given a URL that you need to open in your browser. Press `Allow`, then copy and paste the URL of the page you are redirected to into the console.
115
119
 
@@ -119,6 +123,10 @@ All config options can be found in `config.toml` after running the program once.
119
123
 
120
124
  All file options can include directories that will be created when the program is run.
121
125
 
126
+ All config options that involve *blog identifiers* expect any version of a blog URL, which is explained in more detail in the [Tumblr API documentation on blog identifiers].
127
+
128
+ Specific Options:
129
+
122
130
  - `custom_prompts_file` This file should follow the following file format:
123
131
 
124
132
  ```json
@@ -129,14 +137,18 @@ All file options can include directories that will be created when the program i
129
137
 
130
138
  To be specific, it should follow the [JSON Lines] file format with one collection of name/value pairs (a dictionary) per line. You can validate your file using the [JSON Lines Validator].
131
139
 
140
+ - **`filtered_words`** - During training data generation, any posts with the specified words will be removed. Word boundaries are not checked by default, so "the" will also filter out posts with "them" or "thematic". This setting supports regular expressions, so you can explicitly look for word boundaries by surrounding an entry with "\\\b", i.e. "\\\bthe\\\b". Regular expressions have to be escaped like so due to how JSON data is read in. If you are familiar with regular expressions, it could be useful for you to know that every entry is joined with a "|" which is then used to search the post content for any matches.
132
141
  - **`developer_message`** - This message is used in for fine-tuning the AI as well as generating prompts. If you change this, you will need to run the fine-tuning again with the new value before generating posts.
133
- - **`user_message`** - This message is used in the same way as `developer_message` and should be treated the same.
142
+ - **`user_message`** - This setting is used and works in the same way as `developer_message`.
134
143
  - **`expected_epochs`** - The default value here is the default number of epochs for `base_model`. You may have to change this value if you change `base_model`. After running fine-tuning once, you will see the number of epochs used in the [fine-tuning portal] under *Hyperparameters*. This value will also be updated automatically if you run fine-tuning through this program.
135
144
  - **`token_price`** - The default value here is the default token price for `base_model`. You can find the up-to-date value in [OpenAI Pricing], in the *Training* column.
136
145
  - **`job_id`** - If there is any value here, this program will resume monitoring the corresponding job, instead of starting a new one. This gets set when starting the fine-tuning and is cleared when it is completed. You can read more in [fine-tuning].
137
146
  - **`base_model`** - This value is used to choose the tokenizer for estimating fine-tuning costs. It is also the base model that will be fine-tuned and the model that is used to generate tags. You can find a list of options in the [fine-tuning portal] by pressing `+ Create` and opening the drop-down list for `Base Model`. Be sure to update `token_price` if you change this value.
138
147
  - **`fine_tuned_model`** - Set automatically after monitoring fine-tuning if the job has succeeded. You can read more in [fine-tuning].
139
148
  - **`tags_chance`** - This should be between 0 and 1. Setting it to 0 corresponds to a 0% chance (never) to add tags to a post. 1 corresponds to a 100% chance (always) to add tags to a post. Adding tags incurs a very small token cost.
149
+ - **`reblog_blog_identifiers`** - Whenever a reblog is attempted, a random blog from this list will be chosen to be reblogged from.
150
+ - **`reblog_chance`** - This setting works the same way as `tags_chance`.
151
+ - **`reblog_user_message`** - This setting is a prefix that is directly prepended to the contents of the post being reblogged.
140
152
 
141
153
  ## Manual Fine-Tuning
142
154
 
@@ -1,6 +1,6 @@
1
1
  [project]
2
2
  name = "tumblrbot"
3
- version = "1.6.0"
3
+ version = "1.8.0"
4
4
  description = "An updated bot that posts to Tumblr, based on your very own blog!"
5
5
  readme = "README.md"
6
6
  requires-python = ">= 3.13"
@@ -1,3 +1,4 @@
1
+ import re
1
2
  from collections.abc import Generator
2
3
  from itertools import batched
3
4
  from json import loads
@@ -54,11 +55,12 @@ class ExamplesWriter(FlowClass):
54
55
  yield from data.items()
55
56
 
56
57
  def get_valid_posts(self) -> Generator[Post]:
58
+ pattern = re.compile("|".join(self.config.filtered_words), re.IGNORECASE)
57
59
  for data_path in self.get_data_paths():
58
60
  with data_path.open("rb") as fp:
59
61
  for line in fp:
60
62
  post = Post.model_validate_json(line)
61
- if post.valid_text_post():
63
+ if post.valid_text_post() and not (self.config.filtered_words and pattern.search(post.get_content_text())):
62
64
  yield post
63
65
 
64
66
  def filter_examples(self) -> None:
@@ -1,7 +1,10 @@
1
- from random import random, randrange
1
+ from collections.abc import Iterable
2
+ from functools import cache
3
+ from random import choice, random, sample
2
4
  from typing import override
3
5
 
4
6
  import rich
7
+ from pydantic import ConfigDict
5
8
  from rich.prompt import IntPrompt
6
9
 
7
10
  from tumblrbot.utils.common import FlowClass, PreviewLive
@@ -9,6 +12,8 @@ from tumblrbot.utils.models import Post
9
12
 
10
13
 
11
14
  class DraftGenerator(FlowClass):
15
+ model_config = ConfigDict(frozen=True) # Makes this class hashable.
16
+
12
17
  @override
13
18
  def main(self) -> None:
14
19
  self.config.draft_count = IntPrompt.ask("How many drafts should be generated?", default=self.config.draft_count)
@@ -28,16 +33,16 @@ class DraftGenerator(FlowClass):
28
33
  rich.print(f":chart_increasing: [bold green]Generated {self.config.draft_count} draft(s).[/] {message}")
29
34
 
30
35
  def generate_post(self) -> Post:
31
- if random() < self.config.reblog_chance: # noqa: S311
32
- original = self.get_random_post()
36
+ if original := self.get_random_post():
33
37
  user_message = f"{self.config.reblog_user_message}\n\n{original.get_content_text()}"
34
38
  else:
35
39
  original = Post()
36
40
  user_message = self.config.user_message
37
-
38
41
  text = self.generate_text(user_message)
42
+
39
43
  if tags := self.generate_tags(text):
40
44
  tags = tags.tags
45
+
41
46
  return Post(
42
47
  content=[Post.Block(type="text", text=text)],
43
48
  tags=tags or [],
@@ -65,10 +70,22 @@ class DraftGenerator(FlowClass):
65
70
 
66
71
  return None
67
72
 
68
- def get_random_post(self) -> Post:
69
- total = self.tumblr.retrieve_blog_info(self.config.upload_blog_identifier).response.blog.posts
70
- post = self.tumblr.retrieve_published_posts(
71
- self.config.upload_blog_identifier,
72
- offset=randrange(total), # noqa: S311
73
- ).response.posts[0]
74
- return Post.model_validate(post)
73
+ def get_random_post(self) -> Post | None:
74
+ if self.config.reblog_blog_identifiers and random() < self.config.reblog_chance: # noqa: S311
75
+ blog_identifier = choice(self.config.reblog_blog_identifiers) # noqa: S311
76
+ for offset in self.get_offsets(blog_identifier):
77
+ for raw_post in self.tumblr.retrieve_published_posts(
78
+ blog_identifier,
79
+ offset,
80
+ ).response.posts:
81
+ post = Post.model_validate(raw_post)
82
+ if post.valid_text_post():
83
+ return post
84
+
85
+ return None
86
+
87
+ @cache # noqa: B019 # This creates a memory leak, but it doesn't matter since this class isn't discarded until the end of the program anyways.
88
+ def get_offsets(self, blog_identifier: str) -> Iterable[int]:
89
+ total = self.tumblr.retrieve_blog_info(blog_identifier).response.blog.posts
90
+ # The same Iterable object is cached, so reading an element will effectively discard it. This prevents checking the same offsets twice.
91
+ return iter(sample(range(total), total))
@@ -46,12 +46,13 @@ class Config(FileSyncSettings):
46
46
  toml_file: ClassVar = Path("config.toml")
47
47
 
48
48
  # Downloading Posts & Writing Examples
49
- download_blog_identifiers: list[str] = Field([], description="The identifiers of the blogs which post data will be downloaded from. These must be blogs associated with the same account as the configured Tumblr secret tokens.")
49
+ download_blog_identifiers: list[str] = Field([], description="The identifiers of the blogs which post data will be downloaded from.")
50
50
  data_directory: Path = Field(Path("data"), description="Where to store downloaded post data.")
51
51
 
52
52
  # Writing Examples
53
53
  max_moderation_batch_size: PositiveInt = Field(100, description="How many posts, at most, to submit to the OpenAI moderation API. This is also capped by the API.")
54
54
  custom_prompts_file: Path = Field(Path("custom_prompts.jsonl"), description="Where to read in custom prompts from.")
55
+ filtered_words: list[str] = Field([], description="A case-insensitive list of disallowed words used to filter out training data. Regular expressions are allowed, but must be escaped.")
55
56
 
56
57
  # Writing Examples & Fine-Tuning
57
58
  examples_file: Path = Field(Path("examples.jsonl"), description="Where to output the examples that will be used to fine-tune the model.")
@@ -72,10 +73,11 @@ class Config(FileSyncSettings):
72
73
  # Generating
73
74
  upload_blog_identifier: str = Field("", description="The identifier of the blog which generated drafts will be uploaded to. This must be a blog associated with the same account as the configured Tumblr secret tokens.")
74
75
  draft_count: PositiveInt = Field(150, description="The number of drafts to process. This will affect the number of tokens used with OpenAI")
75
- tags_chance: NonNegativeFloat = Field(0.1, description="The chance to generate tags for any given post. This will incur extra calls to OpenAI.")
76
+ tags_chance: NonNegativeFloat = Field(0.1, description="The chance to generate tags for any given post. This will use more OpenAI tokens.")
76
77
  tags_developer_message: str = Field("You will be provided with a block of text, and your task is to extract a very short list of the most important subjects from it.", description="The developer message used to generate tags.")
77
- reblog_chance: NonNegativeFloat = Field(0.05, description="The chance to generate a reblog of a random post.")
78
- reblog_user_message: str = Field("Please write a comical Tumblr post in response to the following Tumblr post:", description="The prefix for the user message used to reblog posts.")
78
+ reblog_blog_identifiers: list[str] = Field([], description="The identifiers of blogs that can be reblogged from when generating drafts.")
79
+ reblog_chance: NonNegativeFloat = Field(0.1, description="The chance to generate a reblog of a random post. This will use more OpenAI tokens.")
80
+ reblog_user_message: str = Field("Please write a comical Tumblr post in response to the following post:", description="The prefix for the user message used to reblog posts.")
79
81
 
80
82
  @classmethod
81
83
  @override
@@ -26,7 +26,12 @@ class TumblrSession(OAuth1Session):
26
26
  response = self.get(f"https://api.tumblr.com/v2/blog/{blog_identifier}/info")
27
27
  return ResponseModel.model_validate_json(response.text)
28
28
 
29
- def retrieve_published_posts(self, blog_identifier: str, offset: int | None = None, after: int | None = None) -> ResponseModel:
29
+ def retrieve_published_posts(
30
+ self,
31
+ blog_identifier: str,
32
+ offset: int | None = None,
33
+ after: int | None = None,
34
+ ) -> ResponseModel:
30
35
  response = self.get(
31
36
  f"https://api.tumblr.com/v2/blog/{blog_identifier}/posts",
32
37
  params={
File without changes
File without changes
File without changes