tumblrbot 1.4.4__tar.gz → 1.4.5__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,7 +1,8 @@
1
1
  # Custom
2
+ .vscode
2
3
  data
3
4
  *.toml
4
- *.json*
5
+ *.jsonl
5
6
 
6
7
  # Byte-compiled / optimized / DLL files
7
8
  __pycache__/
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: tumblrbot
3
- Version: 1.4.4
3
+ Version: 1.4.5
4
4
  Summary: An updated bot that posts to Tumblr, based on your very own blog!
5
5
  Requires-Python: >= 3.13
6
6
  Description-Content-Type: text/markdown
@@ -18,9 +18,14 @@ Requires-Dist: tiktoken
18
18
  Requires-Dist: tomlkit
19
19
  Project-URL: Source, https://github.com/MaidThatPrograms/tumblrbot
20
20
 
21
+ # tumblrbot
22
+
21
23
  [OAuth]: https://oauth.net/1
22
24
  [Python]: https://python.org/download
23
25
 
26
+ [JSON Lines]: https://jsonlines.org
27
+ [JSON Lines Validator]: https://jsonlines.org/validator
28
+
24
29
  [pip]: https://pypi.org
25
30
  [keyring]: https://pypi.org/project/keyring
26
31
  [Rich]: https://pypi.org/project/rich
@@ -42,8 +47,6 @@ Project-URL: Source, https://github.com/MaidThatPrograms/tumblrbot
42
47
 
43
48
  [Config]: #configuration
44
49
  [Fine-Tuning]: #manual-fine-tuning
45
-
46
- # tumblrbot
47
50
  [![PyPI - Version](https://img.shields.io/pypi/v/tumblrbot)](https://python.org/pypi/tumblrbot)
48
51
 
49
52
  Description of original project:
@@ -52,6 +55,7 @@ Description of original project:
52
55
  This fork is largely a rewrite of the source code with similarities in its structure and process.
53
56
 
54
57
  Features:
58
+
55
59
  - An [interactive console][Main] for all steps of generating posts for the blog:
56
60
  1. Asks for [OpenAI] and [Tumblr] tokens.
57
61
  - Stores API tokens using [keyring].
@@ -78,16 +82,18 @@ Features:
78
82
  - Automatically keeps the [config] file up-to-date and recreates it if missing.
79
83
 
80
84
  **To-Do:**
85
+
81
86
  - Add code documentation.
82
87
 
83
88
  **Known Issues:**
89
+
84
90
  - Sometimes, you will get an error about the training file not being found when starting fine-tuning. We do not currently have a fix or workaround for this. You should instead use the online portal for fine-tuning if this continues to happen. Read more in [fine-tuning].
85
91
  - Post counts are incorrect when downloading posts. We are not certain what the cause of this is, but our tests suggest this is a [Tumblr] API problem that is giving inaccurate numbers.
86
92
 
87
-
88
93
  **Please submit an issue or contact us for features you want added/reimplemented.**
89
94
 
90
95
  ## Installation
96
+
91
97
  1. Install the latest version of [Python]:
92
98
  - Windows: `winget install python3`
93
99
  - Linux (apt): `apt install python-pip`
@@ -98,17 +104,23 @@ Features:
98
104
  - See [keyring] for additional requirements if you are not on Windows.
99
105
 
100
106
  ## Usage
107
+
101
108
  Run `tumblrbot` from anywhere. Run `tumblrbot --help` for command-line options. Every command-line option corresponds to a value from the [config].
102
109
 
103
110
  ## Obtaining Tokens
111
+
104
112
  ### OpenAI
105
- API token can be created [here][OpenAI Tokens].
113
+
114
+ API token can be created here: [OpenAI Tokens].
115
+
106
116
  1. Leave everything at the defaults and set `Project` to `Default Project`.
107
117
  1. Press `Create secret key`.
108
118
  1. Press `Copy` to copy the API token to your clipboard.
109
119
 
110
120
  ### Tumblr
111
- API tokens can be created [here][Tumblr Tokens].
121
+
122
+ API tokens can be created here: [Tumblr Tokens].
123
+
112
124
  1. Press `+ Register Application`.
113
125
  1. Enter anything for `Application Name` and `Application Description`.
114
126
  1. Enter any URL for `Application Website` and `Default callback URL`, like `https://example.com`.
@@ -123,26 +135,34 @@ When running this program, you will be prompted to enter all of these tokens. **
123
135
  After inputting the [Tumblr] tokens, you will be given a URL that you need to open in your browser. Press `Allow`, then copy and paste the URL of the page you are redirected to into the console.
124
136
 
125
137
  ## Configuration
138
+
126
139
  All config options can be found in `config.toml` after running the program once. This will be kept up-to-date if there are changes to the config's format in a future update. This also means it may be worthwhile to double-check the config file after an update. Any changes to the config should be in the changelog for a given version.
127
140
 
128
141
  All file options can include directories that will be created when the program is run.
129
142
 
130
- - `custom_prompts_file` You will have to create this file yourself. It should follow the following format:
143
+ - `custom_prompts_file` This file should follow the following file format:
144
+
131
145
  ```json
132
- {"user message 1": "assistant response 1",
133
- "user message 2": "assistant response 2"}
146
+ {"user message 1": "assistant response 1"}
147
+ {"user message 1": "assistant response 1"}
148
+ {"user message 2": "assistant response 2", "user message 3": "assistant response 3"}
134
149
  ```
150
+
151
+ To be specific, it should follow the [JSON Lines] file format with one collection of name/value pairs (a dictionary) per line. You can validate your file using the [JSON Lines Validator].
152
+
135
153
  - **`developer_message`** - This message is used in for fine-tuning the AI as well as generating prompts. If you change this, you will need to run the fine-tuning again with the new value before generating posts.
136
154
  - **`user_message`** - This message is used in the same way as `developer_message` and should be treated the same.
137
155
  - **`expected_epochs`** - The default value here is the default number of epochs for `base_model`. You may have to change this value if you change `base_model`. After running fine-tuning once, you will see the number of epochs used in the [fine-tuning portal] under *Hyperparameters*. This value will also be updated automatically if you run fine-tuning through this program.
138
- - **`token_price`** - The default value here is the default token price for `base_model`. You can find the up-to-date value [here][OpenAI Pricing], in the *Training* column.
156
+ - **`token_price`** - The default value here is the default token price for `base_model`. You can find the up-to-date value in [OpenAI Pricing], in the *Training* column.
139
157
  - **`job_id`** - If there is any value here, this program will resume monitoring the corresponding job, instead of starting a new one. This gets set when starting the fine-tuning and is cleared when it is completed. You can read more in [fine-tuning].
140
158
  - **`base_model`** - This value is used to choose the tokenizer for estimating fine-tuning costs. It is also the base model that will be fine-tuned and the model that is used to generate tags. You can find a list of options in the [fine-tuning portal] by pressing `+ Create` and opening the drop-down list for `Base Model`. Be sure to update `token_price` if you change this value.
141
159
  - **`fine_tuned_model`** - Set automatically after monitoring fine-tuning if the job has succeeded. You can read more in [fine-tuning].
142
160
  - **`tags_chance`** - This should be between 0 and 1. Setting it to 0 corresponds to a 0% chance (never) to add tags to a post. 1 corresponds to a 100% chance (always) to add tags to a post. Adding tags incurs a very small token cost.
143
161
 
144
162
  ## Manual Fine-Tuning
145
- You can manually upload the examples file to [OpenAI] and start the fine-tuning [here][fine-tuning portal].
163
+
164
+ You can manually upload the examples file to [OpenAI] and start the fine-tuning here: [fine-tuning portal].
165
+
146
166
  1. Press `+ Create`.
147
167
  1. Select the desired `Base Model` from the dropdown. This should ideally match the model set in the [config].
148
168
  1. Upload the generated examples file to the section under `Training data`. You can find the path for this in the [config].
@@ -1,6 +1,11 @@
1
+ # tumblrbot
2
+
1
3
  [OAuth]: https://oauth.net/1
2
4
  [Python]: https://python.org/download
3
5
 
6
+ [JSON Lines]: https://jsonlines.org
7
+ [JSON Lines Validator]: https://jsonlines.org/validator
8
+
4
9
  [pip]: https://pypi.org
5
10
  [keyring]: https://pypi.org/project/keyring
6
11
  [Rich]: https://pypi.org/project/rich
@@ -22,8 +27,6 @@
22
27
 
23
28
  [Config]: #configuration
24
29
  [Fine-Tuning]: #manual-fine-tuning
25
-
26
- # tumblrbot
27
30
  [![PyPI - Version](https://img.shields.io/pypi/v/tumblrbot)](https://python.org/pypi/tumblrbot)
28
31
 
29
32
  Description of original project:
@@ -32,6 +35,7 @@ Description of original project:
32
35
  This fork is largely a rewrite of the source code with similarities in its structure and process.
33
36
 
34
37
  Features:
38
+
35
39
  - An [interactive console][Main] for all steps of generating posts for the blog:
36
40
  1. Asks for [OpenAI] and [Tumblr] tokens.
37
41
  - Stores API tokens using [keyring].
@@ -58,16 +62,18 @@ Features:
58
62
  - Automatically keeps the [config] file up-to-date and recreates it if missing.
59
63
 
60
64
  **To-Do:**
65
+
61
66
  - Add code documentation.
62
67
 
63
68
  **Known Issues:**
69
+
64
70
  - Sometimes, you will get an error about the training file not being found when starting fine-tuning. We do not currently have a fix or workaround for this. You should instead use the online portal for fine-tuning if this continues to happen. Read more in [fine-tuning].
65
71
  - Post counts are incorrect when downloading posts. We are not certain what the cause of this is, but our tests suggest this is a [Tumblr] API problem that is giving inaccurate numbers.
66
72
 
67
-
68
73
  **Please submit an issue or contact us for features you want added/reimplemented.**
69
74
 
70
75
  ## Installation
76
+
71
77
  1. Install the latest version of [Python]:
72
78
  - Windows: `winget install python3`
73
79
  - Linux (apt): `apt install python-pip`
@@ -78,17 +84,23 @@ Features:
78
84
  - See [keyring] for additional requirements if you are not on Windows.
79
85
 
80
86
  ## Usage
87
+
81
88
  Run `tumblrbot` from anywhere. Run `tumblrbot --help` for command-line options. Every command-line option corresponds to a value from the [config].
82
89
 
83
90
  ## Obtaining Tokens
91
+
84
92
  ### OpenAI
85
- API token can be created [here][OpenAI Tokens].
93
+
94
+ API token can be created here: [OpenAI Tokens].
95
+
86
96
  1. Leave everything at the defaults and set `Project` to `Default Project`.
87
97
  1. Press `Create secret key`.
88
98
  1. Press `Copy` to copy the API token to your clipboard.
89
99
 
90
100
  ### Tumblr
91
- API tokens can be created [here][Tumblr Tokens].
101
+
102
+ API tokens can be created here: [Tumblr Tokens].
103
+
92
104
  1. Press `+ Register Application`.
93
105
  1. Enter anything for `Application Name` and `Application Description`.
94
106
  1. Enter any URL for `Application Website` and `Default callback URL`, like `https://example.com`.
@@ -103,26 +115,34 @@ When running this program, you will be prompted to enter all of these tokens. **
103
115
  After inputting the [Tumblr] tokens, you will be given a URL that you need to open in your browser. Press `Allow`, then copy and paste the URL of the page you are redirected to into the console.
104
116
 
105
117
  ## Configuration
118
+
106
119
  All config options can be found in `config.toml` after running the program once. This will be kept up-to-date if there are changes to the config's format in a future update. This also means it may be worthwhile to double-check the config file after an update. Any changes to the config should be in the changelog for a given version.
107
120
 
108
121
  All file options can include directories that will be created when the program is run.
109
122
 
110
- - `custom_prompts_file` You will have to create this file yourself. It should follow the following format:
123
+ - `custom_prompts_file` This file should follow the following file format:
124
+
111
125
  ```json
112
- {"user message 1": "assistant response 1",
113
- "user message 2": "assistant response 2"}
126
+ {"user message 1": "assistant response 1"}
127
+ {"user message 1": "assistant response 1"}
128
+ {"user message 2": "assistant response 2", "user message 3": "assistant response 3"}
114
129
  ```
130
+
131
+ To be specific, it should follow the [JSON Lines] file format with one collection of name/value pairs (a dictionary) per line. You can validate your file using the [JSON Lines Validator].
132
+
115
133
  - **`developer_message`** - This message is used in for fine-tuning the AI as well as generating prompts. If you change this, you will need to run the fine-tuning again with the new value before generating posts.
116
134
  - **`user_message`** - This message is used in the same way as `developer_message` and should be treated the same.
117
135
  - **`expected_epochs`** - The default value here is the default number of epochs for `base_model`. You may have to change this value if you change `base_model`. After running fine-tuning once, you will see the number of epochs used in the [fine-tuning portal] under *Hyperparameters*. This value will also be updated automatically if you run fine-tuning through this program.
118
- - **`token_price`** - The default value here is the default token price for `base_model`. You can find the up-to-date value [here][OpenAI Pricing], in the *Training* column.
136
+ - **`token_price`** - The default value here is the default token price for `base_model`. You can find the up-to-date value in [OpenAI Pricing], in the *Training* column.
119
137
  - **`job_id`** - If there is any value here, this program will resume monitoring the corresponding job, instead of starting a new one. This gets set when starting the fine-tuning and is cleared when it is completed. You can read more in [fine-tuning].
120
138
  - **`base_model`** - This value is used to choose the tokenizer for estimating fine-tuning costs. It is also the base model that will be fine-tuned and the model that is used to generate tags. You can find a list of options in the [fine-tuning portal] by pressing `+ Create` and opening the drop-down list for `Base Model`. Be sure to update `token_price` if you change this value.
121
139
  - **`fine_tuned_model`** - Set automatically after monitoring fine-tuning if the job has succeeded. You can read more in [fine-tuning].
122
140
  - **`tags_chance`** - This should be between 0 and 1. Setting it to 0 corresponds to a 0% chance (never) to add tags to a post. 1 corresponds to a 100% chance (always) to add tags to a post. Adding tags incurs a very small token cost.
123
141
 
124
142
  ## Manual Fine-Tuning
125
- You can manually upload the examples file to [OpenAI] and start the fine-tuning [here][fine-tuning portal].
143
+
144
+ You can manually upload the examples file to [OpenAI] and start the fine-tuning here: [fine-tuning portal].
145
+
126
146
  1. Press `+ Create`.
127
147
  1. Select the desired `Base Model` from the dropdown. This should ideally match the model set in the [config].
128
148
  1. Upload the generated examples file to the section under `Training data`. You can find the path for this in the [config].
@@ -1,6 +1,6 @@
1
1
  [project]
2
2
  name = "tumblrbot"
3
- version = "1.4.4"
3
+ version = "1.4.5"
4
4
  description = "An updated bot that posts to Tumblr, based on your very own blog!"
5
5
  readme = "README.md"
6
6
  requires-python = ">= 3.13"
@@ -19,22 +19,18 @@ def main() -> None:
19
19
  OpenAI(api_key=tokens.openai_api_key.get_secret_value(), http_client=DefaultHttpxClient(http2=True)) as openai,
20
20
  TumblrSession(tokens=tokens) as tumblr,
21
21
  ):
22
- post_downloader = PostDownloader(openai, tumblr)
23
22
  if Confirm.ask("Download latest posts?", default=False):
24
- post_downloader.download()
25
- download_paths = post_downloader.get_data_paths()
23
+ PostDownloader(openai=openai, tumblr=tumblr).main()
26
24
 
27
- examples_writer = ExamplesWriter(openai, tumblr, download_paths)
28
25
  if Confirm.ask("Create training data?", default=False):
29
- examples_writer.write_examples()
30
- estimated_tokens = sum(examples_writer.count_tokens())
26
+ ExamplesWriter(openai=openai, tumblr=tumblr).main()
31
27
 
32
- fine_tuner = FineTuner(openai, tumblr, estimated_tokens)
28
+ fine_tuner = FineTuner(openai=openai, tumblr=tumblr)
33
29
  fine_tuner.print_estimates()
34
30
 
35
31
  message = "Resume monitoring the previous fine-tuning process?" if FlowClass.config.job_id else "Upload data to OpenAI for fine-tuning?"
36
32
  if Confirm.ask(f"{message} [bold]You must do this to set the model to generate drafts from. Alternatively, manually enter a model into the config", default=False):
37
- fine_tuner.fine_tune()
33
+ fine_tuner.main()
38
34
 
39
35
  if Confirm.ask("Generate drafts?", default=False):
40
- DraftGenerator(openai, tumblr).create_drafts()
36
+ DraftGenerator(openai=openai, tumblr=tumblr).main()
@@ -1,13 +1,14 @@
1
1
  from io import TextIOBase
2
2
  from json import dump
3
- from pathlib import Path
3
+ from typing import override
4
4
 
5
5
  from tumblrbot.utils.common import FlowClass, PreviewLive
6
6
  from tumblrbot.utils.models import Post
7
7
 
8
8
 
9
9
  class PostDownloader(FlowClass):
10
- def download(self) -> None:
10
+ @override
11
+ def main(self) -> None:
11
12
  self.config.data_directory.mkdir(parents=True, exist_ok=True)
12
13
 
13
14
  with PreviewLive() as live:
@@ -50,9 +51,3 @@ class PostDownloader(FlowClass):
50
51
  completed += len(posts)
51
52
  else:
52
53
  return
53
-
54
- def get_data_paths(self) -> list[Path]:
55
- return list(map(self.get_data_path, self.config.download_blog_identifiers))
56
-
57
- def get_data_path(self, blog_identifier: str) -> Path:
58
- return (self.config.data_directory / blog_identifier).with_suffix(".jsonl")
@@ -1,27 +1,21 @@
1
1
  from collections.abc import Generator
2
- from dataclasses import dataclass
3
2
  from json import loads
4
3
  from math import ceil
5
- from pathlib import Path
6
4
  from re import search
7
- from typing import IO
5
+ from typing import IO, override
8
6
 
9
7
  import rich
10
8
  from more_itertools import chunked
11
9
  from openai import BadRequestError
12
- from rich.console import Console
13
10
  from rich.prompt import Confirm
14
- from tiktoken import encoding_for_model, get_encoding
15
11
 
16
12
  from tumblrbot.utils.common import FlowClass, PreviewLive
17
13
  from tumblrbot.utils.models import Example, Post
18
14
 
19
15
 
20
- @dataclass
21
16
  class ExamplesWriter(FlowClass):
22
- data_paths: list[Path]
23
-
24
- def write_examples(self) -> None:
17
+ @override
18
+ def main(self) -> None:
25
19
  self.config.examples_file.parent.mkdir(parents=True, exist_ok=True)
26
20
 
27
21
  with self.config.examples_file.open("w", encoding="utf_8") as fp:
@@ -52,16 +46,22 @@ class ExamplesWriter(FlowClass):
52
46
  fp.write(f"{example.model_dump_json()}\n")
53
47
 
54
48
  def get_custom_prompts(self) -> Generator[tuple[str, str]]:
55
- if self.config.custom_prompts_file.exists():
56
- text = self.config.custom_prompts_file.read_text(encoding="utf_8")
57
- yield from loads(text).items()
49
+ self.config.custom_prompts_file.parent.mkdir(parents=True, exist_ok=True)
50
+ self.config.custom_prompts_file.touch(exist_ok=True)
51
+
52
+ with self.config.custom_prompts_file.open("r", encoding="utf_8") as fp:
53
+ for line in fp:
54
+ data: dict[str, str] = loads(line)
55
+ yield from data.items()
58
56
 
59
57
  def get_filtered_posts(self) -> Generator[Post]:
60
- posts = list(self.get_valid_posts())
58
+ posts = self.get_valid_posts()
61
59
 
62
60
  if Confirm.ask("[gray62]Remove posts flagged by the OpenAI moderation? This can sometimes resolve errors with fine-tuning validation, but is slow.", default=False):
63
- removed = 0
64
61
  chunk_size = self.get_moderation_chunk_limit()
62
+ posts = list(posts)
63
+ removed = 0
64
+
65
65
  with PreviewLive() as live:
66
66
  for chunk in live.progress.track(
67
67
  chunked(posts, chunk_size),
@@ -80,7 +80,7 @@ class ExamplesWriter(FlowClass):
80
80
  yield from posts
81
81
 
82
82
  def get_valid_posts(self) -> Generator[Post]:
83
- for data_path in self.data_paths:
83
+ for data_path in self.get_data_paths():
84
84
  with data_path.open(encoding="utf_8") as fp:
85
85
  for line in fp:
86
86
  post = Post.model_validate_json(line)
@@ -96,19 +96,3 @@ class ExamplesWriter(FlowClass):
96
96
  if match := search(r"(\d+)\.", message):
97
97
  return int(match.group(1))
98
98
  return test_n
99
-
100
- def count_tokens(self) -> Generator[int]:
101
- # Based on https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken
102
- # and https://cookbook.openai.com/examples/chat_finetuning_data_prep
103
- try:
104
- encoding = encoding_for_model(self.config.base_model)
105
- except KeyError as error:
106
- encoding = get_encoding("o200k_base")
107
- Console(stderr=True, style="logging.level.warning").print(f"[Warning] Using encoding '{encoding.name}': {''.join(error.args)}\n")
108
-
109
- with self.config.examples_file.open(encoding="utf_8") as fp:
110
- for line in fp:
111
- example = Example.model_validate_json(line)
112
- yield len(encoding.encode("assistant")) # every reply is primed with <|start|>assistant<|message|>
113
- for message in example.messages:
114
- yield 4 + len(encoding.encode(message.content))
@@ -1,25 +1,27 @@
1
- from dataclasses import dataclass
1
+ from collections.abc import Generator
2
2
  from datetime import datetime
3
3
  from textwrap import dedent
4
- from time import sleep, time
4
+ from time import sleep
5
+ from typing import override
5
6
 
6
7
  import rich
7
8
  from openai.types.fine_tuning import FineTuningJob
8
9
  from rich import progress
10
+ from rich.console import Console
9
11
  from rich.prompt import Confirm
12
+ from tiktoken import encoding_for_model, get_encoding
10
13
 
11
14
  from tumblrbot.utils.common import FlowClass, PreviewLive
15
+ from tumblrbot.utils.models import Example
12
16
 
13
17
 
14
- @dataclass
15
18
  class FineTuner(FlowClass):
16
- estimated_tokens: int
17
-
18
19
  @staticmethod
19
20
  def dedent_print(text: str) -> None:
20
21
  rich.print(dedent(text).lstrip())
21
22
 
22
- def fine_tune(self) -> None:
23
+ @override
24
+ def main(self) -> None:
23
25
  job = self.create_job()
24
26
 
25
27
  self.dedent_print(f"""
@@ -39,8 +41,6 @@ class FineTuner(FlowClass):
39
41
 
40
42
  live.progress.update(
41
43
  task_id,
42
- total=job.estimated_finish - job.created_at if job.estimated_finish else None,
43
- completed=time() - job.created_at,
44
44
  description=f"Fine-tuning: [italic]{job.status.replace('_', ' ').title()}[/]...",
45
45
  )
46
46
 
@@ -102,16 +102,33 @@ class FineTuner(FlowClass):
102
102
  self.config.fine_tuned_model = job.fine_tuned_model or ""
103
103
 
104
104
  def print_estimates(self) -> None:
105
- total_tokens = self.config.expected_epochs * self.estimated_tokens
105
+ estimated_tokens = sum(self.count_tokens())
106
+ total_tokens = self.config.expected_epochs * estimated_tokens
106
107
  cost_string = self.get_cost_string(total_tokens)
107
108
 
108
109
  self.dedent_print(f"""
109
- Tokens {self.estimated_tokens:,}:
110
+ Tokens {estimated_tokens:,}:
110
111
  Total tokens for [bold orange1]{self.config.expected_epochs}[/] epoch(s): {total_tokens:,}
111
112
  Expected cost when trained with [bold purple]{self.config.base_model}[/]: {cost_string}
112
113
  NOTE: Token values are approximate and may not be 100% accurate, please be aware of this when using the data.
113
114
  [italic red]Amelia, Mutsumi, and Marin are not responsible for any inaccuracies in the token count or estimated price.[/]
114
115
  """)
115
116
 
117
+ def count_tokens(self) -> Generator[int]:
118
+ # Based on https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken
119
+ # and https://cookbook.openai.com/examples/chat_finetuning_data_prep
120
+ try:
121
+ encoding = encoding_for_model(self.config.base_model)
122
+ except KeyError as error:
123
+ encoding = get_encoding("o200k_base")
124
+ Console(stderr=True, style="logging.level.warning").print(f"[Warning] Using encoding '{encoding.name}': {''.join(error.args)}\n")
125
+
126
+ with self.config.examples_file.open(encoding="utf_8") as fp:
127
+ for line in fp:
128
+ example = Example.model_validate_json(line)
129
+ yield len(encoding.encode("assistant")) # every reply is primed with <|start|>assistant<|message|>
130
+ for message in example.messages:
131
+ yield 4 + len(encoding.encode(message.content))
132
+
116
133
  def get_cost_string(self, total_tokens: int) -> str:
117
134
  return f"${self.config.token_price / 1000000 * total_tokens:.2f}"
@@ -1,13 +1,18 @@
1
1
  from random import random
2
+ from typing import override
2
3
 
3
4
  import rich
5
+ from rich.prompt import IntPrompt
4
6
 
5
7
  from tumblrbot.utils.common import FlowClass, PreviewLive
6
8
  from tumblrbot.utils.models import Post
7
9
 
8
10
 
9
11
  class DraftGenerator(FlowClass):
10
- def create_drafts(self) -> None:
12
+ @override
13
+ def main(self) -> None:
14
+ self.config.draft_count = IntPrompt.ask("How many drafts should be generated?", default=self.config.draft_count)
15
+
11
16
  message = f"View drafts here: https://tumblr.com/blog/{self.config.upload_blog_identifier}/drafts"
12
17
 
13
18
  with PreviewLive() as live:
@@ -24,10 +29,7 @@ class DraftGenerator(FlowClass):
24
29
 
25
30
  def generate_post(self) -> Post:
26
31
  content = self.generate_content()
27
- post = Post(
28
- content=[content],
29
- state="draft",
30
- )
32
+ post = Post(content=[content])
31
33
  if tags := self.generate_tags(content):
32
34
  post.tags = tags.tags
33
35
  return post
@@ -39,16 +41,15 @@ class DraftGenerator(FlowClass):
39
41
  model=self.config.fine_tuned_model,
40
42
  ).output_text
41
43
 
42
- return Post.Block(type="text", text=content)
44
+ return Post.Block(text=content)
43
45
 
44
46
  def generate_tags(self, content: Post.Block) -> Post | None:
45
47
  if random() < self.config.tags_chance: # noqa: S311
46
48
  return self.openai.responses.parse(
47
49
  text_format=Post,
48
- input=f"Extract the most important subjects from the following text:\n\n{content.text}",
49
- instructions="You are an advanced text summarization tool. You return the requested data to the user as a list of comma-separated strings.",
50
+ input=content.text,
51
+ instructions=self.config.tags_developer_message,
50
52
  model=self.config.base_model,
51
- temperature=0.5,
52
53
  ).output_parsed
53
54
 
54
55
  return None
@@ -1,25 +1,37 @@
1
- from dataclasses import dataclass
1
+ from abc import abstractmethod
2
2
  from random import choice
3
3
  from typing import ClassVar, Self, override
4
4
 
5
5
  from openai import OpenAI
6
+ from pydantic import ConfigDict
6
7
  from rich._spinners import SPINNERS
7
8
  from rich.console import RenderableType
8
9
  from rich.live import Live
9
10
  from rich.progress import MofNCompleteColumn, Progress, SpinnerColumn, TimeElapsedColumn
10
11
  from rich.table import Table
11
12
 
12
- from tumblrbot.utils.config import Config
13
+ from tumblrbot.utils.config import Config, Path
14
+ from tumblrbot.utils.models import FullyValidatedModel
13
15
  from tumblrbot.utils.tumblr import TumblrSession
14
16
 
15
17
 
16
- @dataclass
17
- class FlowClass:
18
+ class FlowClass(FullyValidatedModel):
19
+ model_config = ConfigDict(arbitrary_types_allowed=True)
20
+
18
21
  config: ClassVar = Config() # pyright: ignore[reportCallIssue]
19
22
 
20
23
  openai: OpenAI
21
24
  tumblr: TumblrSession
22
25
 
26
+ @abstractmethod
27
+ def main(self) -> None: ...
28
+
29
+ def get_data_paths(self) -> list[Path]:
30
+ return list(map(self.get_data_path, self.config.download_blog_identifiers))
31
+
32
+ def get_data_path(self, blog_identifier: str) -> Path:
33
+ return (self.config.data_directory / blog_identifier).with_suffix(".jsonl")
34
+
23
35
 
24
36
  class PreviewLive(Live):
25
37
  def __init__(self) -> None:
@@ -31,7 +31,7 @@ class Config(BaseSettings):
31
31
  data_directory: Path = Field(Path("data"), description="Where to store downloaded post data.")
32
32
 
33
33
  # Writing Examples
34
- custom_prompts_file: Path = Field(Path("custom_prompts.json"), description="Where to read in custom prompts from.")
34
+ custom_prompts_file: Path = Field(Path("custom_prompts.jsonl"), description="Where to read in custom prompts from.")
35
35
 
36
36
  # Writing Examples & Fine-Tuning
37
37
  examples_file: Path = Field(Path("examples.jsonl"), description="Where to output the examples that will be used to fine-tune the model.")
@@ -53,6 +53,7 @@ class Config(BaseSettings):
53
53
  upload_blog_identifier: str = Field("", description="The identifier of the blog which generated drafts will be uploaded to. This must be a blog associated with the same account as the configured Tumblr secret tokens.")
54
54
  draft_count: PositiveInt = Field(150, description="The number of drafts to process. This will affect the number of tokens used with OpenAI")
55
55
  tags_chance: NonNegativeFloat = Field(0.1, description="The chance to generate tags for any given post. This will incur extra calls to OpenAI.")
56
+ tags_developer_message: str = Field("You will be provided with a block of text, and your task is to extract a very short list of the most important subjects from it.", description="The developer message used to generate tags.")
56
57
 
57
58
  @override
58
59
  @classmethod
@@ -98,13 +98,13 @@ class Tokens(FullyValidatedModel):
98
98
 
99
99
  class Post(FullyValidatedModel):
100
100
  class Block(FullyValidatedModel):
101
- type: str = ""
101
+ type: str = "text"
102
102
  text: str = ""
103
103
  blocks: list[int] = [] # noqa: RUF012
104
104
 
105
105
  timestamp: SkipJsonSchema[int] = 0
106
106
  tags: Annotated[list[str], PlainSerializer(",".join)] = [] # noqa: RUF012
107
- state: SkipJsonSchema[Literal["published", "queued", "draft", "private", "unapproved"]] = "published"
107
+ state: SkipJsonSchema[Literal["published", "queued", "draft", "private", "unapproved"]] = "draft"
108
108
 
109
109
  content: SkipJsonSchema[list[Block]] = [] # noqa: RUF012
110
110
  layout: SkipJsonSchema[list[Block]] = [] # noqa: RUF012
File without changes