token_estimator 0.1.0 → 0.1.2

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: c4f8fd7ec54ffb7145d793b94eada2a34aa331a562cbb7b52eee0f74fcd6c181
4
- data.tar.gz: b61a2e7c7ce96b60267bcde6d67fedfe5330ad9f4d021b2d9a4b30fbccbaf6e9
3
+ metadata.gz: e76555160bf7038e96625963f99f434979da61bab11f70c0e7293d83db8c1588
4
+ data.tar.gz: 46d04371c15c88c39d96f87223c7a0713f9ceae367c52a48b50ca59b57d417e5
5
5
  SHA512:
6
- metadata.gz: 19333d9dce63923d2490c525b723a1585de5107eb16d5021c05ea1f461d0ad7659435b0d07df92d8e4b209f79501b41386132b604158dc20c60b2db2f571680d
7
- data.tar.gz: 2790194ec3c05d191dbbdc0037e01567a9e80236dc872eaf335c93738d010b5ec414bb04872d0fa22cc2f1df11cfa2c49888d002d0692460c766e8d34c48abaa
6
+ metadata.gz: 110fb3caff83e609a5b4fa9e28ff92c1b0a866d9f30b669ea72b071e36f80e96377c451581dd94fa1155280311c9fd2fbb9ab5e9516c66c6c6b507aaba4972c7
7
+ data.tar.gz: 339ca2a1c63812c424cc2d1771d4dd665151ca069965511e469dd17587764a31957f463a2c5f5222e677413c9b1117f5331c1bd2762bd4d7c96260496e586f94
data/CHANGELOG.md CHANGED
@@ -1,5 +1,9 @@
1
1
  ## [Unreleased]
2
2
 
3
+ ## [0.1.1] - 2024-07-15
4
+
5
+ - Added `TokenEstimator::Estimator::SUPPORTED_FILE_TYPES` method to specify supported file types.
6
+
3
7
  ## [0.1.0] - 2024-07-11
4
8
 
5
9
  - Initial release
data/README.md CHANGED
@@ -13,6 +13,68 @@ And then execute:
13
13
  bundle install
14
14
  ```
15
15
 
16
+ ## Methods
17
+
18
+ #### `count_tokens_from_text`
19
+ Count tokens from a given text.
20
+
21
+ ```rb
22
+ require "token_estimator"
23
+
24
+ tokenizer_name = "gpt2"
25
+ estimator = TokenEstimator::Estimator.new(tokenizer_name)
26
+
27
+ text = "Your sample text here."
28
+ token_estimation = estimator.count_tokens_from_text(text)
29
+
30
+ puts "Token estimation: #{token_estimation}"
31
+ ```
32
+
33
+ #### `count_tokens_from_file`
34
+ Count tokens from a file. The file type is determined by the file extension.
35
+
36
+ ```rb
37
+ require "token_estimator"
38
+
39
+ file_path = "spec/fixtures/files/lorem.pdf"
40
+ tokenizer_name = "gpt2"
41
+ estimator = TokenEstimator::Estimator.new(tokenizer_name)
42
+
43
+ token_estimation = estimator.count_tokens_from_file(file_path)
44
+
45
+ puts "Token estimation: #{token_estimation}"
46
+ ```
47
+
48
+ #### `count_tokens_from_excel_file`
49
+ Counts tokens from an Excel (.xlsx) file.
50
+
51
+ #### `count_tokens_from_csv_file`
52
+ Counts tokens from a CSV file.
53
+
54
+ #### `count_tokens_from_pdf_file`
55
+ Counts tokens from a PDF file.
56
+
57
+ #### `count_tokens_from_txt_file`
58
+ Counts tokens from a plain text (.txt) file.
59
+
60
+ #### `count_tokens_from_markdown_file`
61
+ Counts tokens from a Markdown (.md) file.
62
+
63
+ #### `count_tokens_from_json_file`
64
+ Counts tokens from a JSON file.
65
+
66
+ #### `count_tokens_from_html_file`
67
+ Counts tokens from an HTML file.
68
+
69
+ #### `count_tokens_from_json`
70
+ Counts tokens from a JSON object.
71
+
72
+ #### `count_tokens_from_html`
73
+ Counts tokens from an HTML string.
74
+
75
+ #### `TokenEstimator::Estimator::SUPPORTED_FILE_TYPES`
76
+ Return the supported file types.
77
+
16
78
  ## Roadmap
17
79
  Here is a checklist of the formats we currently support for token counting and those we plan to support in the future:
18
80
 
@@ -29,6 +91,17 @@ Here is a checklist of the formats we currently support for token counting and t
29
91
  - [ ] PNG
30
92
  - [ ] JPG
31
93
 
94
+ ## Error Handling
95
+ If you try to count tokens from an unsupported file type, the gem will raise an `UnsupportedFileTypeError`
96
+
97
+ ```rb
98
+ begin
99
+ token_count = estimator.count_tokens_from_file("path/to/your/file.unsupported")
100
+ rescue TokenEstimator::UnsupportedFileTypeError => e
101
+ puts e.message
102
+ end
103
+ ```
104
+
32
105
  ## Contributing
33
106
  Contribution directions go here. You can fork the repository, create a new branch, and submit a pull request for review. Please make sure to write tests for your contributions and follow the coding standards set in the project.
34
107
 
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module TokenEstimator
4
- VERSION = "0.1.0"
4
+ VERSION = "0.1.2"
5
5
  end
@@ -13,6 +13,8 @@ module TokenEstimator
13
13
  class UnsupportedFileTypeError < StandardError; end
14
14
 
15
15
  class Estimator
16
+ SUPPORTED_FILE_TYPES = [".txt", ".csv", ".pdf", ".json", ".md", ".html", ".xlsx"]
17
+
16
18
  def initialize(tokenizer_name)
17
19
  @tokenizer = Tokenizers.from_pretrained(tokenizer_name)
18
20
  end
Binary file
Binary file
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: token_estimator
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.1.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - aemabit
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2024-07-11 00:00:00.000000000 Z
11
+ date: 2024-07-15 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: rails
@@ -138,6 +138,8 @@ files:
138
138
  - lib/token_estimator.rb
139
139
  - lib/token_estimator/version.rb
140
140
  - sig/token_estimator.rbs
141
+ - token_estimator-0.1.0.gem
142
+ - token_estimator-0.1.1.gem
141
143
  homepage: https://github.com/aemabit/token_estimator
142
144
  licenses:
143
145
  - MIT