semantic-compressor 1.2__py3-none-any.whl → 1.3__py3-none-any.whl

Sign up to get free protection for your applications and to get access to all the features.
compressor/semantic.py CHANGED
@@ -213,8 +213,6 @@ def compress_text(text, *, target_token_count=None, compression_rate=0.7, refere
213
213
  try:
214
214
  if target_token_count is None:
215
215
  compression_rate = 1 - compression_rate
216
- original_token_count = count_tokens(text)
217
- target_token_count = int(original_token_count * compression_rate)
218
216
  else:
219
217
  original_token_count = count_tokens(text)
220
218
  if original_token_count <= target_token_count:
@@ -233,8 +231,9 @@ def compress_text(text, *, target_token_count=None, compression_rate=0.7, refere
233
231
  return text
234
232
 
235
233
  def find_needle_in_haystack(
236
- *, haystack: str, needle: str, block_size = 350,
237
- semantic_embeddings_weight: float = 0.3, textual_embeddings_weight: float = 0.7
234
+ *, haystack: str, needle: str, block_size = 300,
235
+ semantic_embeddings_weight: float = 0.3,
236
+ textual_embeddings_weight: float = 0.7
238
237
  ):
239
238
  """
240
239
  Finds the string block in the haystack that contains the needle.
@@ -275,7 +274,9 @@ def find_needle_in_haystack(
275
274
  # Find the index of the needle in all the blocks
276
275
  most_similar_block_index = blocks.index(most_similar_block)
277
276
 
278
- needle_region = blocks[most_similar_block_index-1:most_similar_block_index+2]
277
+ start_index = most_similar_block_index-1 if most_similar_block_index > 0 else 0
278
+
279
+ needle_region = blocks[start_index:most_similar_block_index+2]
279
280
 
280
281
  return ''.join(needle_region).strip()
281
282
  except Exception:
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: semantic_compressor
3
- Version: 1.2
3
+ Version: 1.3
4
4
  Author: Carlo Moro
5
5
  Author-email: Carlo Moro <cnmoro@gmail.com>
6
6
  Classifier: Programming Language :: Python :: 3
@@ -17,7 +17,7 @@ Requires-Dist: onnxruntime
17
17
  Requires-Dist: onnxruntime-extensions
18
18
 
19
19
  ```python
20
- from compressor.semantic import compress_text
20
+ from compressor.semantic import compress_text, find_needle_in_haystack
21
21
 
22
22
  text = """
23
23
  Akin to France's heartier, spicier, richer boeuf bourguignon, "alcatra" is synonymous with a single island in the remote Azores archipelago. The Azores, an archipelago of nine islands belonging to Portugal and located roughly between Europe and the US, are cow country. They're said to be home to more cattle than people, and despite being home to less than 3% of Portugal's population, the islands produce 30% of Portugal's dairy products and 13% of its beef. Beef is part of everyday life in the Azores, and come spring on one particular island, the ingredient even crosses paths with religion. In the days following Easter, Azorean people kick off a series of religious celebrations called Festas do Espírito Santo (Festivals of the Holy Spirit). During the 13th Century, a Catholic sect called the Cult of the Holy Spirit predicted a utopian era on Earth. This fringe faith was discouraged in mainland Europe but lived on in these remote islands in the middle of the Atlantic Ocean. The sect was also promoted by Portuguese queen Elizabeth of Aragon (also known as Elizabeth of Portugal), who was known for her charity. Over the subsequent centuries, a series of festivals emerged on the Azores that blended these utopian aspirations with the queen's alleged generosity. Between Easter and the week following Whitsunday, a total of eight weeks, the islands host a series of parades and other cultural and religious festivals that revolve around brightly coloured community houses called impérios. During this time, the community houses also collect donations from locals, which is then redistributed to people in the form of bread, beef and wine. These three elements generally come together in the form of a soup, called sopa do Espírito Santo, that's served at the impérios during the festivals. But on the island of Terceira, locals combine these ingredients in a different and delicious way, one that's become synonymous with the island's culinary identity. Austin Bush The Festas do Espírito Santo revolve around community houses called impérios (Credit: Austin Bush)Austin Bush The Festas do Espírito Santo revolve around community houses called impérios (Credit: Austin Bush) "People eat alcatra year round, but especially during the celebrations in spring and summer," explains Duarte Fournier. He is the Grand Master of the Brotherhood of Alcatra, a culinary fraternity on Terceira, and is telling me about the island's signature dish: cuts of beef braised in local wine, smoked pork fat and dried spices, resulting in something of a heartier, spicier, richer version of France's famed boeuf bourguignon. We're sitting at a cafe in Angra do Heroísmo, Terceira's largest city, and as we chat, children race to and from a nearby império delivering massive trays of raw beef to neighbours. Fournier tells me that alcatra likely has its origins in northern Portugal, where there's a tradition of baking goat in wine. "We don't know why it's called alcatra," he says. "We suppose it's from Arabic. Al catar means 'small pieces of meat'." According to Fournier, alcatra differs from mainland Portugal's baked meat dishes in that it includes dried spices, generally allspice and black peppercorns, but also sometimes clove or cinnamon.
@@ -30,4 +30,15 @@ print(compressed_text_90_percent)
30
30
  compressed_text_to_100_tokens = compress_text(text, target_token_count=100)
31
31
  print(compressed_text_to_100_tokens)
32
32
  # 'Akin to France\'s heartier, spicier, richer boeuf bourguignon, "alcatra" is synonymous with a single island in the remote Azores archipelago.'
33
+
34
+ text_reference = "Archipelago Islands"
35
+ compressed_text_with_steering = compress_text(text, compression_rate=0.7, reference_text_steering=text_reference)
36
+ # 'Akin to france\'s heartier, spicier, richer boeuf bourguignon, "alcatra" is synonymous with a single island in the remote azores archipelago. The azores, an archipelago of nine islands belonging to portugal and located roughly between europe and the us, are cow country. Beef is part of everyday life in the azores, and come spring on one particular island, the ingredient even crosses paths with religion. But on the island of terceira, locals combine these ingredients in a different and delicious way, one that\'s become synonymous with the island\'s culinary identity. He is the grand master of the brotherhood of alcatra, a culinary fraternity on terceira, and is telling me about the island\'s signature dish: cuts of beef braised in local wine, smoked pork fat and dried spices, resulting in something of a heartier, spicier, richer version of france\'s famed boeuf bourguignon.'
37
+
38
+ needle_in_haystack = find_needle_in_haystack(
39
+ haystack = text,
40
+ needle = "Archipelago Islands",
41
+ block_size = 200
42
+ )
43
+ # 'Akin to France\'s heartier, spicier, richer boeuf bourguignon, "alcatra" is synonymous with a single island in the remote Azores archipelago. The Azores, an archipelago of nine islands belonging to Portugal and located roughly between Europe and the US, are cow country. They\'re said to be home to more cattle than people, and despite being home to less than 3% of Portugal\'s population, the islands'
33
44
  ```
@@ -1,5 +1,5 @@
1
1
  compressor/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
2
- compressor/semantic.py,sha256=UUskzKy3Uj90UMDC_zRbPgCr30IxANCUuO1h0nWkKHU,12429
2
+ compressor/semantic.py,sha256=8MQdV-ZmTBMC3sEIQr565hafxS6v5_A_-dNiqb0R5Xg,12379
3
3
  compressor/minbpe/__init__.py,sha256=wZ1z2QKkncvGgiZDBc91AP5m7-M-MVenPStKbS6xylE,95
4
4
  compressor/minbpe/base.py,sha256=tTKag04RRFnc4ppoieBbDV0V6thzi_ZvZTlhOYIoY7Q,6881
5
5
  compressor/minbpe/basic.py,sha256=0kD4tU8l2MZegfPaHMfDo5CnaSzb9i1v9tDBy6GwMbg,2883
@@ -8,8 +8,8 @@ compressor/resources/embedding_model.onnx,sha256=uLBbAfCGEJTwR1yyiK0bMDroruLr6W5
8
8
  compressor/resources/en_stopwords.pkl,sha256=Q2PyGQnphPUs_jxN9NMSqp2EQjYv4b4oMJY2aMYvbSY,1310
9
9
  compressor/resources/lid.176.ftz,sha256=jzRyz-hzintgmejpmcPL-uDc0VaWqsfXc4qAOdtgPoM,938013
10
10
  compressor/resources/pt_stopwords.pkl,sha256=-9bJaxJWjeOFxLHLT9D-rI3XTzGC0iLJfMiwBDnkCYI,1716
11
- semantic_compressor-1.2.dist-info/LICENSE,sha256=DFRihXonZ3qVRaTrzuXNaDI_-h2jyT2SqWqjtTDHfqI,1067
12
- semantic_compressor-1.2.dist-info/METADATA,sha256=s9pltj6AtpXW6OEcZE1h3W8OPYks_PbhdCbJDR9e5b0,4545
13
- semantic_compressor-1.2.dist-info/WHEEL,sha256=R0nc6qTxuoLk7ShA2_Y-UWkN8ZdfDBG2B6Eqpz2WXbs,91
14
- semantic_compressor-1.2.dist-info/top_level.txt,sha256=qb2SlKrEmMrQDVrhwxu3Wr7U6JupPXtDGrJpIQr8xSc,11
15
- semantic_compressor-1.2.dist-info/RECORD,,
11
+ semantic_compressor-1.3.dist-info/LICENSE,sha256=DFRihXonZ3qVRaTrzuXNaDI_-h2jyT2SqWqjtTDHfqI,1067
12
+ semantic_compressor-1.3.dist-info/METADATA,sha256=baw_1lughU6R-9nQ_23COy7DP70ZI6H_DlH3YvrYBRU,6148
13
+ semantic_compressor-1.3.dist-info/WHEEL,sha256=R0nc6qTxuoLk7ShA2_Y-UWkN8ZdfDBG2B6Eqpz2WXbs,91
14
+ semantic_compressor-1.3.dist-info/top_level.txt,sha256=qb2SlKrEmMrQDVrhwxu3Wr7U6JupPXtDGrJpIQr8xSc,11
15
+ semantic_compressor-1.3.dist-info/RECORD,,