tokensplit 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) [year] [fullname]
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,154 @@
1
+ Metadata-Version: 2.4
2
+ Name: tokensplit
3
+ Version: 0.1.0
4
+ Summary: String-separated values with user-defined multi-character delimiters
5
+ Author-email: lost_0 <l05t_0@proton.me>
6
+ License-Expression: MIT
7
+ Classifier: Programming Language :: Python :: 3
8
+ Classifier: Operating System :: OS Independent
9
+ Classifier: Topic :: File Formats
10
+ Requires-Python: >=3.7
11
+ Description-Content-Type: text/markdown
12
+ License-File: LICENSE.txt
13
+ Dynamic: license-file
14
+
15
+ # TokenSplit — Token-Separated Values
16
+
17
+ A lightweight Python package for reading and writing `.toks` files: a plain-text tabular format where you choose your own multi-character delimiter string.
18
+
19
+ ---
20
+
21
+ ## Why?
22
+
23
+ CSV uses a single character (`,`) as a separator, which means commas in your data need escaping or quoting.
24
+ Toks lets you pick any string — `/---/`, `:::`, `<<SEP>>` — that you know won't appear in your data, keeping files simple and unambiguous without any escape sequences.
25
+
26
+ ---
27
+
28
+ ## File format
29
+
30
+ ```
31
+ /---/
32
+ Alice/---/30/---/Engineer/---/
33
+ Bob/---/25/---/Designer/---/
34
+ ```
35
+
36
+ - **Line 1** — the delimiter string (written automatically by the writer)
37
+ - **Every other line** — values separated by the delimiter, with the line ending on `<delimiter><newline>`
38
+
39
+ Newlines *inside* a value are preserved because rows end only on the `<delimiter><newline>` sequence, not on bare newlines.
40
+
41
+ ---
42
+
43
+ ## Installation
44
+
45
+ ```bash
46
+ pip install tokensplit # once published to PyPI
47
+ # or, from source:
48
+ pip install .
49
+ ```
50
+
51
+ ---
52
+
53
+ ## Quick start
54
+
55
+ ### Writing
56
+
57
+ ```python
58
+ import tokensplit
59
+
60
+ # Convenience function
61
+ tokensplit.write("people.toks", [
62
+ ["name", "age", "role"],
63
+ ["Alice", "30", "Engineer"],
64
+ ["Bob", "25", "Designer"],
65
+ ], delimiter="/---/")
66
+ ```
67
+
68
+ ```python
69
+ # Streaming writer — useful for large files
70
+ with open("people.toks", "w") as f:
71
+ writer = tokensplit.ToksWriter(f, delimiter="/---/")
72
+ writer.writerow(["name", "age", "role"]) # header
73
+ writer.writerow(["Alice", "30", "Engineer"])
74
+ writer.writerow(["Bob", "25", "Designer"])
75
+ ```
76
+
77
+ ### Reading
78
+
79
+ ```python
80
+ import tokensplit
81
+
82
+ # Convenience function — returns list of rows
83
+ rows = tokensplit.read("people.toks")
84
+ # [["name", "age", "role"], ["Alice", "30", "Engineer"], ...]
85
+
86
+ # Streaming reader — one row at a time (memory-efficient)
87
+ with open("people.toks") as f:
88
+ reader = tokensplit.ToksReader(f)
89
+ print("delimiter:", reader.delimiter) # "/---/"
90
+ for row in reader:
91
+ print(row)
92
+ ```
93
+
94
+ ---
95
+
96
+ ## Choosing a delimiter
97
+
98
+ Any non-empty string without a newline character works. Good choices:
99
+
100
+ | Delimiter | Good when data contains… |
101
+ |-----------|--------------------------|
102
+ | `/---/` | General text |
103
+ | `\|\|\|` | Paths, URLs |
104
+ | `<<<>>>` | Code snippets |
105
+ | `,,,,` | Numeric CSVs being converted |
106
+ | `:::` | Short labels / IDs |
107
+
108
+ **Two rules enforced by the writer:**
109
+
110
+ 1. A value must not *contain* the delimiter string.
111
+ 2. A value must not end with a prefix of the delimiter in a way that creates an ambiguous sequence when written (e.g. value `"aa"` with delimiter `"aaa"` would produce `"aaaaa"` which embeds an extra delimiter). A `ValueError` is raised in both cases.
112
+
113
+ ---
114
+
115
+ ## API reference
116
+
117
+ ### `tokensplit.write(filepath, rows, delimiter)`
118
+ Write `rows` (list of lists of strings) to `filepath`.
119
+
120
+ ### `tokensplit.read(filepath) → List[List[str]]`
121
+ Read all rows from `filepath`. Returns a list of lists of strings.
122
+
123
+ ### `tokensplit.ToksWriter(file_obj, delimiter)`
124
+ Streaming writer. Call `.writerow(row)` or `.writerows(rows)`.
125
+ The delimiter is written to line 1 of the file on construction.
126
+
127
+ ### `tokensplit.ToksReader(file_obj)`
128
+ Streaming reader. Iterate with `for row in reader`.
129
+ `.delimiter` attribute exposes the delimiter read from line 1.
130
+
131
+ ---
132
+
133
+ ## Reading algorithm
134
+
135
+ The reader uses a **forward-only sliding window** of exactly `len(delimiter)` characters:
136
+
137
+ ```
138
+ content: h e l l o / - - - / w o r l d / - - - / \n
139
+ window: [ 5 ]
140
+ → slides one character at a time
141
+ match! → emit token, jump window past delimiter
142
+ ```
143
+
144
+ - **Time:** O(n) — every character is visited once; one slice emitted per match
145
+ - **Extra space:** O(d) — only the current window lives in memory beyond the content string
146
+ - No regex, no `str.split`, no backtracking
147
+
148
+ ---
149
+
150
+ ## Running tests
151
+
152
+ ```bash
153
+ python -m pytest tests/
154
+ ```
@@ -0,0 +1,140 @@
1
+ # TokenSplit — Token-Separated Values
2
+
3
+ A lightweight Python package for reading and writing `.toks` files: a plain-text tabular format where you choose your own multi-character delimiter string.
4
+
5
+ ---
6
+
7
+ ## Why?
8
+
9
+ CSV uses a single character (`,`) as a separator, which means commas in your data need escaping or quoting.
10
+ Toks lets you pick any string — `/---/`, `:::`, `<<SEP>>` — that you know won't appear in your data, keeping files simple and unambiguous without any escape sequences.
11
+
12
+ ---
13
+
14
+ ## File format
15
+
16
+ ```
17
+ /---/
18
+ Alice/---/30/---/Engineer/---/
19
+ Bob/---/25/---/Designer/---/
20
+ ```
21
+
22
+ - **Line 1** — the delimiter string (written automatically by the writer)
23
+ - **Every other line** — values separated by the delimiter, with the line ending on `<delimiter><newline>`
24
+
25
+ Newlines *inside* a value are preserved because rows end only on the `<delimiter><newline>` sequence, not on bare newlines.
26
+
27
+ ---
28
+
29
+ ## Installation
30
+
31
+ ```bash
32
+ pip install tokensplit # once published to PyPI
33
+ # or, from source:
34
+ pip install .
35
+ ```
36
+
37
+ ---
38
+
39
+ ## Quick start
40
+
41
+ ### Writing
42
+
43
+ ```python
44
+ import tokensplit
45
+
46
+ # Convenience function
47
+ tokensplit.write("people.toks", [
48
+ ["name", "age", "role"],
49
+ ["Alice", "30", "Engineer"],
50
+ ["Bob", "25", "Designer"],
51
+ ], delimiter="/---/")
52
+ ```
53
+
54
+ ```python
55
+ # Streaming writer — useful for large files
56
+ with open("people.toks", "w") as f:
57
+ writer = tokensplit.ToksWriter(f, delimiter="/---/")
58
+ writer.writerow(["name", "age", "role"]) # header
59
+ writer.writerow(["Alice", "30", "Engineer"])
60
+ writer.writerow(["Bob", "25", "Designer"])
61
+ ```
62
+
63
+ ### Reading
64
+
65
+ ```python
66
+ import tokensplit
67
+
68
+ # Convenience function — returns list of rows
69
+ rows = tokensplit.read("people.toks")
70
+ # [["name", "age", "role"], ["Alice", "30", "Engineer"], ...]
71
+
72
+ # Streaming reader — one row at a time (memory-efficient)
73
+ with open("people.toks") as f:
74
+ reader = tokensplit.ToksReader(f)
75
+ print("delimiter:", reader.delimiter) # "/---/"
76
+ for row in reader:
77
+ print(row)
78
+ ```
79
+
80
+ ---
81
+
82
+ ## Choosing a delimiter
83
+
84
+ Any non-empty string without a newline character works. Good choices:
85
+
86
+ | Delimiter | Good when data contains… |
87
+ |-----------|--------------------------|
88
+ | `/---/` | General text |
89
+ | `\|\|\|` | Paths, URLs |
90
+ | `<<<>>>` | Code snippets |
91
+ | `,,,,` | Numeric CSVs being converted |
92
+ | `:::` | Short labels / IDs |
93
+
94
+ **Two rules enforced by the writer:**
95
+
96
+ 1. A value must not *contain* the delimiter string.
97
+ 2. A value must not end with a prefix of the delimiter in a way that creates an ambiguous sequence when written (e.g. value `"aa"` with delimiter `"aaa"` would produce `"aaaaa"` which embeds an extra delimiter). A `ValueError` is raised in both cases.
98
+
99
+ ---
100
+
101
+ ## API reference
102
+
103
+ ### `tokensplit.write(filepath, rows, delimiter)`
104
+ Write `rows` (list of lists of strings) to `filepath`.
105
+
106
+ ### `tokensplit.read(filepath) → List[List[str]]`
107
+ Read all rows from `filepath`. Returns a list of lists of strings.
108
+
109
+ ### `tokensplit.ToksWriter(file_obj, delimiter)`
110
+ Streaming writer. Call `.writerow(row)` or `.writerows(rows)`.
111
+ The delimiter is written to line 1 of the file on construction.
112
+
113
+ ### `tokensplit.ToksReader(file_obj)`
114
+ Streaming reader. Iterate with `for row in reader`.
115
+ `.delimiter` attribute exposes the delimiter read from line 1.
116
+
117
+ ---
118
+
119
+ ## Reading algorithm
120
+
121
+ The reader uses a **forward-only sliding window** of exactly `len(delimiter)` characters:
122
+
123
+ ```
124
+ content: h e l l o / - - - / w o r l d / - - - / \n
125
+ window: [ 5 ]
126
+ → slides one character at a time
127
+ match! → emit token, jump window past delimiter
128
+ ```
129
+
130
+ - **Time:** O(n) — every character is visited once; one slice emitted per match
131
+ - **Extra space:** O(d) — only the current window lives in memory beyond the content string
132
+ - No regex, no `str.split`, no backtracking
133
+
134
+ ---
135
+
136
+ ## Running tests
137
+
138
+ ```bash
139
+ python -m pytest tests/
140
+ ```
@@ -0,0 +1,18 @@
1
+ [build-system]
2
+ requires = ["setuptools>=61"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "tokensplit"
7
+ version = "0.1.0"
8
+ description = "String-separated values with user-defined multi-character delimiters"
9
+ readme = "README.md"
10
+ requires-python = ">=3.7"
11
+ authors = [{name = "lost_0", email = "l05t_0@proton.me"}]
12
+ license = "MIT"
13
+ license-files = ["LICENSE.txt"]
14
+ classifiers = [
15
+ "Programming Language :: Python :: 3",
16
+ "Operating System :: OS Independent",
17
+ "Topic :: File Formats",
18
+ ]
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+
@@ -0,0 +1,168 @@
1
+ """
2
+ tests/test_tokensplit.py — tests for the tokensplit package.
3
+
4
+ Run with: python -m pytest tests/
5
+ """
6
+
7
+ import io
8
+ import os
9
+ import sys
10
+ import pytest
11
+
12
+ sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
13
+
14
+ from tokensplit.reader import ToksReader, read
15
+ from tokensplit.writer import ToksWriter, write
16
+
17
+
18
+ # ---------------------------------------------------------------------------
19
+ # Helpers
20
+ # ---------------------------------------------------------------------------
21
+
22
+ def roundtrip(rows, delimiter):
23
+ """Write rows to an in-memory buffer, read them back, return result."""
24
+ buf = io.StringIO()
25
+ ToksWriter(buf, delimiter).writerows(rows)
26
+ buf.seek(0)
27
+ return list(ToksReader(buf))
28
+
29
+
30
+ # ---------------------------------------------------------------------------
31
+ # Basic round-trip tests
32
+ # ---------------------------------------------------------------------------
33
+
34
+ class TestRoundTrip:
35
+ def test_simple_three_char_delimiter(self):
36
+ rows = [["hello", "world"], ["foo", "bar", "baz"]]
37
+ assert roundtrip(rows, ",,,") == rows
38
+
39
+ def test_slash_delimiter(self):
40
+ rows = [["alpha", "beta"], ["gamma", "delta", "epsilon"]]
41
+ assert roundtrip(rows, "/---/") == rows
42
+
43
+ def test_single_char_delimiter(self):
44
+ rows = [["a", "b", "c"], ["d", "e"]]
45
+ assert roundtrip(rows, "|") == rows
46
+
47
+ def test_long_delimiter(self):
48
+ rows = [["x", "y"], ["z"]]
49
+ assert roundtrip(rows, "<<SPLIT>>") == rows
50
+
51
+ def test_single_row_single_value(self):
52
+ rows = [["only"]]
53
+ assert roundtrip(rows, "/---/") == rows
54
+
55
+ def test_empty_values(self):
56
+ rows = [["", "b", ""], ["", ""]]
57
+ assert roundtrip(rows, "/---/") == rows
58
+
59
+ def test_numeric_strings(self):
60
+ rows = [["1", "2", "3"], ["100", "200"]]
61
+ assert roundtrip(rows, "|||") == rows
62
+
63
+ def test_whitespace_values(self):
64
+ rows = [[" leading", "trailing ", " both "]]
65
+ assert roundtrip(rows, "/---/") == rows
66
+
67
+ def test_newlines_inside_values(self):
68
+ # Newlines inside a value are preserved; rows end on <delim><newline>
69
+ rows = [["line1\nline2", "normal"]]
70
+ assert roundtrip(rows, "/---/") == rows
71
+
72
+ def test_many_rows(self):
73
+ rows = [[str(i), str(i * 2)] for i in range(200)]
74
+ assert roundtrip(rows, "---") == rows
75
+
76
+
77
+ # ---------------------------------------------------------------------------
78
+ # Delimiter detection edge cases
79
+ # ---------------------------------------------------------------------------
80
+
81
+ class TestDelimiterEdgeCases:
82
+ def test_value_starts_with_partial_delimiter(self):
83
+ # Value starts with part of the delimiter — must not be split early.
84
+ rows = [["/--hello", "world"]] # delimiter is /---/
85
+ assert roundtrip(rows, "/---/") == rows
86
+
87
+ def test_value_ends_with_safe_partial_delimiter(self):
88
+ # "hello/--" ends with "/--" (3 chars). Delimiter is "/---/" (5 chars).
89
+ # "hello/--" + "/---/" = "hello/---/" which has "/---/" exactly at the
90
+ # intended boundary — safe to write.
91
+ rows = [["hello/--", "world"]]
92
+ assert roundtrip(rows, "/---/") == rows
93
+
94
+
95
+ # ---------------------------------------------------------------------------
96
+ # Writer safety / validation
97
+ # ---------------------------------------------------------------------------
98
+
99
+ class TestWriterValidation:
100
+ def test_value_contains_delimiter_raises(self):
101
+ buf = io.StringIO()
102
+ writer = ToksWriter(buf, "/---/")
103
+ with pytest.raises(ValueError, match="delimiter"):
104
+ writer.writerow(["safe", "un/---/safe"])
105
+
106
+ def test_adjacency_collision_raises(self):
107
+ # "aa" + "aaa" = "aaaaa" which embeds "aaa" before the intended split.
108
+ buf = io.StringIO()
109
+ writer = ToksWriter(buf, "aaa")
110
+ with pytest.raises(ValueError):
111
+ writer.writerow(["aa", "b"])
112
+
113
+ def test_empty_delimiter_raises(self):
114
+ with pytest.raises(ValueError, match="empty"):
115
+ ToksWriter(io.StringIO(), "")
116
+
117
+ def test_delimiter_with_newline_raises(self):
118
+ with pytest.raises(ValueError, match="newline"):
119
+ ToksWriter(io.StringIO(), "/--\n/")
120
+
121
+
122
+ # ---------------------------------------------------------------------------
123
+ # Reader validation
124
+ # ---------------------------------------------------------------------------
125
+
126
+ class TestReaderValidation:
127
+ def test_empty_file_raises(self):
128
+ with pytest.raises(ValueError, match="empty"):
129
+ ToksReader(io.StringIO(""))
130
+
131
+
132
+ # ---------------------------------------------------------------------------
133
+ # File I/O convenience functions
134
+ # ---------------------------------------------------------------------------
135
+
136
+ class TestFileIO:
137
+ def test_write_and_read_file(self, tmp_path):
138
+ path = str(tmp_path / "data.toks")
139
+ rows = [["name", "age"], ["Alice", "30"], ["Bob", "25"]]
140
+ write(path, rows, delimiter="/---/")
141
+ assert read(path) == rows
142
+
143
+ def test_file_first_line_is_delimiter(self, tmp_path):
144
+ path = str(tmp_path / "data.toks")
145
+ write(path, [["a", "b"]], delimiter="###")
146
+ with open(path) as f:
147
+ first_line = f.readline().rstrip("\n")
148
+ assert first_line == "###"
149
+
150
+ def test_delimiter_accessible_on_reader(self, tmp_path):
151
+ path = str(tmp_path / "data.toks")
152
+ write(path, [["x"]], delimiter="::::")
153
+ with open(path) as f:
154
+ reader = ToksReader(f)
155
+ assert reader.delimiter == "::::"
156
+
157
+
158
+ # ---------------------------------------------------------------------------
159
+ # Streaming / large data
160
+ # ---------------------------------------------------------------------------
161
+
162
+ class TestStreaming:
163
+ def test_fifty_rows(self):
164
+ rows = [[f"r{i}c0", f"r{i}c1"] for i in range(50)]
165
+ result = roundtrip(rows, ",,,")
166
+ assert len(result) == 50
167
+ assert result[0] == ["r0c0", "r0c1"]
168
+ assert result[49] == ["r49c0", "r49c1"]
@@ -0,0 +1,5 @@
1
+ from .reader import ToksReader, read
2
+ from .writer import ToksWriter, write
3
+
4
+ __all__ = ["ToksReader", "ToksWriter", "read", "write"]
5
+ __version__ = "0.1.0"
@@ -0,0 +1,138 @@
1
+ """
2
+ tokensplit/reader.py — Token-Separated Values reader.
3
+
4
+ File format
5
+ -----------
6
+ Line 1 : the delimiter string (e.g. /---/ or ,,,)
7
+ Line 2+ : rows of values, each value separated by the delimiter string,
8
+ each row terminated by <delimiter><newline>
9
+
10
+ Example with delimiter /---/ :
11
+
12
+ /---/
13
+ hello/---/world/---/
14
+ foo/---/bar/---/baz/---/
15
+
16
+ Reading algorithm — sliding window, O(n) time, O(d) extra space
17
+ ---------------------------------------------------------------
18
+ We read the post-header content as a single string, then scan it with a
19
+ two-pointer window of exactly len(delimiter) characters.
20
+
21
+ end advances one character per iteration.
22
+ start marks the beginning of the current token.
23
+ window_start = end - d is the left edge of the current d-wide window.
24
+
25
+ When the window matches the delimiter we:
26
+ 1. Emit text[start : end-d] as the next value.
27
+ 2. Set start = end (skip past the delimiter).
28
+ 3. If content[start] == newline -> row terminator; close row, skip newline.
29
+
30
+ Because we move forward-only and slice once per delimiter hit, total work is
31
+ O(n) in file size. Partial-overlap cases (delimiter="ab", value="aab") are
32
+ handled naturally by the character-at-a-time slide.
33
+ """
34
+
35
+ from typing import Iterator, List
36
+
37
+
38
+ # ---------------------------------------------------------------------------
39
+ # Internal helpers
40
+ # ---------------------------------------------------------------------------
41
+
42
+ def _read_delimiter(file_obj) -> str:
43
+ """Read line 1 and return the delimiter (without its trailing newline)."""
44
+ line = file_obj.readline()
45
+ if not line:
46
+ raise ValueError("File is empty — cannot read delimiter from first line.")
47
+ if line.endswith("\n"):
48
+ line = line[:-1]
49
+ if not line:
50
+ raise ValueError("Delimiter string on line 1 must not be empty.")
51
+ return line
52
+
53
+
54
+ def _parse(content: str, delimiter: str) -> List[List[str]]:
55
+ """
56
+ Parse *content* (everything after the delimiter line) into a list of rows.
57
+
58
+ Row terminator : <delimiter><newline>
59
+ Value separator : <delimiter> (followed by more values on the same row)
60
+ """
61
+ d = len(delimiter)
62
+ n = len(content)
63
+ rows: List[List[str]] = []
64
+ current_row: List[str] = []
65
+
66
+ start = 0 # start of the current token
67
+ end = d # right edge of the sliding window (exclusive)
68
+
69
+ if n < d:
70
+ # Content shorter than one delimiter — nothing to split.
71
+ if content:
72
+ current_row.append(content)
73
+ rows.append(current_row)
74
+ return rows
75
+
76
+ while end <= n:
77
+ if content[end - d : end] == delimiter:
78
+ # Emit the token that ends just before this window.
79
+ current_row.append(content[start : end - d])
80
+ start = end # jump past the delimiter
81
+
82
+ # Peek: is the next character a newline (row terminator)?
83
+ if start < n and content[start] == "\n":
84
+ rows.append(current_row)
85
+ current_row = []
86
+ start += 1 # consume the newline
87
+
88
+ end = start + d # position window at start of next potential match
89
+ else:
90
+ end += 1
91
+
92
+ # Flush anything not closed by a row-end delimiter (e.g. file with no final newline).
93
+ tail = content[start:]
94
+ if tail or current_row:
95
+ current_row.append(tail)
96
+ rows.append(current_row)
97
+
98
+ return rows
99
+
100
+
101
+ # ---------------------------------------------------------------------------
102
+ # Public API
103
+ # ---------------------------------------------------------------------------
104
+
105
+ class ToksReader:
106
+ """
107
+ Read a .toks file row by row.
108
+
109
+ Usage
110
+ -----
111
+ with open("data.toks") as f:
112
+ reader = ToksReader(f)
113
+ for row in reader:
114
+ print(row) # ['val1', 'val2', ...]
115
+
116
+ The delimiter is read automatically from line 1 of the file.
117
+ Inspect it via reader.delimiter after construction.
118
+ """
119
+
120
+ def __init__(self, file_obj):
121
+ self.delimiter: str = _read_delimiter(file_obj)
122
+ self._rows: List[List[str]] = _parse(file_obj.read(), self.delimiter)
123
+
124
+ def __iter__(self) -> Iterator[List[str]]:
125
+ return iter(self._rows)
126
+
127
+
128
+ def read(filepath: str) -> List[List[str]]:
129
+ """
130
+ Convenience function — read an entire .toks file and return all rows.
131
+
132
+ rows = tokensplit.read("data.toks")
133
+
134
+ Returns a list of rows; each row is a list of string values.
135
+ """
136
+ with open(filepath, "r", encoding="utf-8", newline="") as f:
137
+ reader = ToksReader(f)
138
+ return list(reader)
@@ -0,0 +1,115 @@
1
+ """
2
+ tokensplit/writer.py — Token-Separated Values writer.
3
+
4
+ File format
5
+ -----------
6
+ Line 1 : the delimiter string followed by a newline
7
+ Line 2+ : rows of values, each value separated by the delimiter,
8
+ each row terminated by <delimiter><newline>
9
+
10
+ Example with delimiter /---/ :
11
+
12
+ /---/
13
+ hello/---/world/---/
14
+ foo/---/bar/---/baz/---/
15
+
16
+ Safety
17
+ ------
18
+ Two kinds of collision are detected and rejected with a ValueError:
19
+
20
+ 1. A value *contains* the delimiter ("hel/---/lo" with delim "/---/")
21
+ 2. A value's suffix + delimiter prefix would create a new delimiter when
22
+ written adjacently (value "aa" + delim "aaa" = "aaaaa" which embeds
23
+ "aaa").
24
+
25
+ In both cases the caller should choose a different delimiter.
26
+ """
27
+
28
+ from typing import List
29
+
30
+
31
+ # ---------------------------------------------------------------------------
32
+ # Internal helpers
33
+ # ---------------------------------------------------------------------------
34
+
35
+ def _validate_value(value: str, delimiter: str):
36
+ """
37
+ Raise ValueError if writing *value* followed by *delimiter* would
38
+ produce a byte sequence that embeds the delimiter at an unexpected position.
39
+
40
+ Two checks:
41
+ 1. value itself contains the delimiter string.
42
+ 2. (value + delimiter) contains the delimiter *before* the intended
43
+ boundary at index len(value), meaning the suffix of value and the
44
+ prefix of delimiter combine to form a spurious delimiter earlier.
45
+ """
46
+ if delimiter in value:
47
+ raise ValueError(
48
+ f"Value {value!r} contains the delimiter {delimiter!r}. "
49
+ "Choose a different delimiter or sanitise your data first."
50
+ )
51
+ combined = value + delimiter
52
+ idx = combined.find(delimiter)
53
+ if idx < len(value):
54
+ raise ValueError(
55
+ f"Value {value!r} ends with a prefix of the delimiter "
56
+ f"{delimiter!r}, creating an ambiguous sequence when written. "
57
+ "Choose a different delimiter."
58
+ )
59
+
60
+
61
+ # ---------------------------------------------------------------------------
62
+ # Public API
63
+ # ---------------------------------------------------------------------------
64
+
65
+ class ToksWriter:
66
+ """
67
+ Write rows to a .toks file.
68
+
69
+ Usage
70
+ -----
71
+ with open("data.toks", "w") as f:
72
+ writer = ToksWriter(f, delimiter="/---/")
73
+ writer.writerow(["hello", "world"])
74
+ writer.writerow(["foo", "bar", "baz"])
75
+
76
+ The delimiter is written automatically to line 1 on construction.
77
+ """
78
+
79
+ def __init__(self, file_obj, delimiter: str):
80
+ if not delimiter:
81
+ raise ValueError("Delimiter must not be empty.")
82
+ if "\n" in delimiter:
83
+ raise ValueError("Delimiter must not contain a newline character.")
84
+ self.delimiter: str = delimiter
85
+ self._file = file_obj
86
+ # Write the delimiter as the very first line.
87
+ file_obj.write(delimiter + "\n")
88
+
89
+ def writerow(self, row: List[str]):
90
+ """
91
+ Write a single row of values.
92
+
93
+ Raises ValueError if any value would corrupt the file (see module
94
+ docstring for details).
95
+ """
96
+ for value in row:
97
+ _validate_value(str(value), self.delimiter)
98
+ self._file.write(self.delimiter.join(str(v) for v in row))
99
+ self._file.write(self.delimiter + "\n")
100
+
101
+ def writerows(self, rows: List[List[str]]):
102
+ """Write multiple rows at once."""
103
+ for row in rows:
104
+ self.writerow(row)
105
+
106
+
107
+ def write(filepath: str, rows: List[List[str]], delimiter: str):
108
+ """
109
+ Convenience function — write all rows to a .toks file in one call.
110
+
111
+ tokensplit.write("data.toks", [["a", "b"], ["c", "d"]], delimiter="/---/")
112
+ """
113
+ with open(filepath, "w", encoding="utf-8", newline="") as f:
114
+ writer = ToksWriter(f, delimiter=delimiter)
115
+ writer.writerows(rows)
@@ -0,0 +1,154 @@
1
+ Metadata-Version: 2.4
2
+ Name: tokensplit
3
+ Version: 0.1.0
4
+ Summary: String-separated values with user-defined multi-character delimiters
5
+ Author-email: lost_0 <l05t_0@proton.me>
6
+ License-Expression: MIT
7
+ Classifier: Programming Language :: Python :: 3
8
+ Classifier: Operating System :: OS Independent
9
+ Classifier: Topic :: File Formats
10
+ Requires-Python: >=3.7
11
+ Description-Content-Type: text/markdown
12
+ License-File: LICENSE.txt
13
+ Dynamic: license-file
14
+
15
+ # TokenSplit — Token-Separated Values
16
+
17
+ A lightweight Python package for reading and writing `.toks` files: a plain-text tabular format where you choose your own multi-character delimiter string.
18
+
19
+ ---
20
+
21
+ ## Why?
22
+
23
+ CSV uses a single character (`,`) as a separator, which means commas in your data need escaping or quoting.
24
+ Toks lets you pick any string — `/---/`, `:::`, `<<SEP>>` — that you know won't appear in your data, keeping files simple and unambiguous without any escape sequences.
25
+
26
+ ---
27
+
28
+ ## File format
29
+
30
+ ```
31
+ /---/
32
+ Alice/---/30/---/Engineer/---/
33
+ Bob/---/25/---/Designer/---/
34
+ ```
35
+
36
+ - **Line 1** — the delimiter string (written automatically by the writer)
37
+ - **Every other line** — values separated by the delimiter, with the line ending on `<delimiter><newline>`
38
+
39
+ Newlines *inside* a value are preserved because rows end only on the `<delimiter><newline>` sequence, not on bare newlines.
40
+
41
+ ---
42
+
43
+ ## Installation
44
+
45
+ ```bash
46
+ pip install tokensplit # once published to PyPI
47
+ # or, from source:
48
+ pip install .
49
+ ```
50
+
51
+ ---
52
+
53
+ ## Quick start
54
+
55
+ ### Writing
56
+
57
+ ```python
58
+ import tokensplit
59
+
60
+ # Convenience function
61
+ tokensplit.write("people.toks", [
62
+ ["name", "age", "role"],
63
+ ["Alice", "30", "Engineer"],
64
+ ["Bob", "25", "Designer"],
65
+ ], delimiter="/---/")
66
+ ```
67
+
68
+ ```python
69
+ # Streaming writer — useful for large files
70
+ with open("people.toks", "w") as f:
71
+ writer = tokensplit.ToksWriter(f, delimiter="/---/")
72
+ writer.writerow(["name", "age", "role"]) # header
73
+ writer.writerow(["Alice", "30", "Engineer"])
74
+ writer.writerow(["Bob", "25", "Designer"])
75
+ ```
76
+
77
+ ### Reading
78
+
79
+ ```python
80
+ import tokensplit
81
+
82
+ # Convenience function — returns list of rows
83
+ rows = tokensplit.read("people.toks")
84
+ # [["name", "age", "role"], ["Alice", "30", "Engineer"], ...]
85
+
86
+ # Streaming reader — one row at a time (memory-efficient)
87
+ with open("people.toks") as f:
88
+ reader = tokensplit.ToksReader(f)
89
+ print("delimiter:", reader.delimiter) # "/---/"
90
+ for row in reader:
91
+ print(row)
92
+ ```
93
+
94
+ ---
95
+
96
+ ## Choosing a delimiter
97
+
98
+ Any non-empty string without a newline character works. Good choices:
99
+
100
+ | Delimiter | Good when data contains… |
101
+ |-----------|--------------------------|
102
+ | `/---/` | General text |
103
+ | `\|\|\|` | Paths, URLs |
104
+ | `<<<>>>` | Code snippets |
105
+ | `,,,,` | Numeric CSVs being converted |
106
+ | `:::` | Short labels / IDs |
107
+
108
+ **Two rules enforced by the writer:**
109
+
110
+ 1. A value must not *contain* the delimiter string.
111
+ 2. A value must not end with a prefix of the delimiter in a way that creates an ambiguous sequence when written (e.g. value `"aa"` with delimiter `"aaa"` would produce `"aaaaa"` which embeds an extra delimiter). A `ValueError` is raised in both cases.
112
+
113
+ ---
114
+
115
+ ## API reference
116
+
117
+ ### `tokensplit.write(filepath, rows, delimiter)`
118
+ Write `rows` (list of lists of strings) to `filepath`.
119
+
120
+ ### `tokensplit.read(filepath) → List[List[str]]`
121
+ Read all rows from `filepath`. Returns a list of lists of strings.
122
+
123
+ ### `tokensplit.ToksWriter(file_obj, delimiter)`
124
+ Streaming writer. Call `.writerow(row)` or `.writerows(rows)`.
125
+ The delimiter is written to line 1 of the file on construction.
126
+
127
+ ### `tokensplit.ToksReader(file_obj)`
128
+ Streaming reader. Iterate with `for row in reader`.
129
+ `.delimiter` attribute exposes the delimiter read from line 1.
130
+
131
+ ---
132
+
133
+ ## Reading algorithm
134
+
135
+ The reader uses a **forward-only sliding window** of exactly `len(delimiter)` characters:
136
+
137
+ ```
138
+ content: h e l l o / - - - / w o r l d / - - - / \n
139
+ window: [ 5 ]
140
+ → slides one character at a time
141
+ match! → emit token, jump window past delimiter
142
+ ```
143
+
144
+ - **Time:** O(n) — every character is visited once; one slice emitted per match
145
+ - **Extra space:** O(d) — only the current window lives in memory beyond the content string
146
+ - No regex, no `str.split`, no backtracking
147
+
148
+ ---
149
+
150
+ ## Running tests
151
+
152
+ ```bash
153
+ python -m pytest tests/
154
+ ```
@@ -0,0 +1,11 @@
1
+ LICENSE.txt
2
+ README.md
3
+ pyproject.toml
4
+ tests/test_tokensplit.py
5
+ tokensplit/__init__.py
6
+ tokensplit/reader.py
7
+ tokensplit/writer.py
8
+ tokensplit.egg-info/PKG-INFO
9
+ tokensplit.egg-info/SOURCES.txt
10
+ tokensplit.egg-info/dependency_links.txt
11
+ tokensplit.egg-info/top_level.txt
@@ -0,0 +1 @@
1
+ tokensplit