structify 0.1.0 β†’ 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md CHANGED
@@ -2,220 +2,382 @@
2
2
 
3
3
  [![Gem Version](https://badge.fury.io/rb/structify.svg)](https://badge.fury.io/rb/structify)
4
4
 
5
- Structify is a Ruby gem that provides a simple DSL to define extraction schemas for LLM-powered models. It integrates seamlessly with Rails models, allowing you to specify versioning, assistant prompts, and field definitionsβ€”all in a clean, declarative syntax.
5
+ A Ruby gem for extracting structured data from content using LLMs in Rails applications
6
6
 
7
- ## Features
7
+ ## What is Structify?
8
8
 
9
- - 🎯 Simple DSL for defining LLM extraction schemas
10
- - πŸ”„ Built-in versioning for schema evolution
11
- - πŸ“ Support for custom assistant prompts
12
- - πŸ—οΈ JSON Schema generation for LLM validation
13
- - πŸ”Œ Seamless Rails/ActiveRecord integration
14
- - πŸ’Ύ Automatic JSON attribute handling
9
+ Structify helps you extract structured data from unstructured content in your Rails apps:
15
10
 
16
- ## Installation
11
+ - **Define extraction schemas** directly in your ActiveRecord models
12
+ - **Generate JSON schemas** to use with OpenAI, Anthropic, or other LLM providers
13
+ - **Store and validate** extracted data in your models
14
+ - **Access structured data** through typed model attributes
17
15
 
18
- Add this line to your application's Gemfile:
16
+ ## Use Cases
17
+
18
+ - Extract metadata, topics, and sentiment from articles or blog posts
19
+ - Pull structured information from user-generated content
20
+ - Organize unstructured feedback or reviews into categorized data
21
+ - Convert emails or messages into actionable, structured formats
22
+ - Extract entities and relationships from documents
19
23
 
20
24
  ```ruby
21
- gem 'structify'
25
+ # 1. Define extraction schema in your model
26
+ class Article < ApplicationRecord
27
+ include Structify::Model
28
+
29
+ schema_definition do
30
+ field :title, :string
31
+ field :summary, :text
32
+ field :category, :string, enum: ["tech", "business", "science"]
33
+ field :topics, :array, items: { type: "string" }
34
+ end
35
+ end
36
+
37
+ # 2. Get schema for your LLM API
38
+ schema = Article.json_schema
39
+
40
+ # 3. Store LLM response in your model
41
+ article = Article.find(123)
42
+ article.update(llm_response)
43
+
44
+ # 4. Access extracted data
45
+ article.title # => "AI Advances in 2023"
46
+ article.summary # => "Recent developments in artificial intelligence..."
47
+ article.topics # => ["machine learning", "neural networks", "computer vision"]
22
48
  ```
23
49
 
24
- And then execute:
50
+ ## Install
51
+
52
+ ```ruby
53
+ # Add to Gemfile
54
+ gem 'structify'
55
+ ```
25
56
 
57
+ Then:
26
58
  ```bash
27
- $ bundle install
59
+ bundle install
28
60
  ```
29
61
 
30
- Or install it yourself as:
62
+ ## Database Setup
31
63
 
32
- ```bash
33
- $ gem install structify
64
+ Add a JSON column to store extracted data:
65
+
66
+ ```ruby
67
+ add_column :articles, :json_attributes, :jsonb # PostgreSQL (default column name)
68
+ # or
69
+ add_column :articles, :json_attributes, :json # MySQL (default column name)
70
+
71
+ # Or if you configure a custom column name:
72
+ add_column :articles, :custom_json_column, :jsonb # PostgreSQL
34
73
  ```
35
74
 
36
- ## Usage
75
+ ## Configuration
76
+
77
+ Structify can be configured in an initializer:
78
+
79
+ ```ruby
80
+ # config/initializers/structify.rb
81
+ Structify.configure do |config|
82
+ # Configure the default JSON container attribute (default: :json_attributes)
83
+ config.default_container_attribute = :custom_json_column
84
+ end
85
+ ```
37
86
 
38
- ### Basic Example
87
+ ## Usage
39
88
 
40
- Here's a simple example of using Structify in a Rails model:
89
+ ### Define Your Schema
41
90
 
42
91
  ```ruby
43
92
  class Article < ApplicationRecord
44
93
  include Structify::Model
45
94
 
46
95
  schema_definition do
47
- title "Article Extraction"
48
- description "Extract key information from articles"
49
96
  version 1
50
-
51
- assistant_prompt "Extract the following fields from the article content"
52
- llm_model "gpt-4"
53
-
97
+ title "Article Extraction"
98
+
54
99
  field :title, :string, required: true
55
- field :summary, :text, description: "A brief summary of the article"
100
+ field :summary, :text
56
101
  field :category, :string, enum: ["tech", "business", "science"]
102
+ field :topics, :array, items: { type: "string" }
103
+ field :metadata, :object, properties: {
104
+ "author" => { type: "string" },
105
+ "published_at" => { type: "string" }
106
+ }
57
107
  end
58
108
  end
59
109
  ```
60
110
 
61
- ### Advanced Example
111
+ ### Get Schema for LLM API
62
112
 
63
- Here's a more complex example showing all available features:
113
+ Structify generates the JSON schema that you'll need to send to your LLM provider:
64
114
 
65
115
  ```ruby
66
- class EmailSummary < ApplicationRecord
67
- include Structify::Model
68
-
69
- schema_definition do
70
- version 2 # Increment this when making breaking changes
71
- title "Email Thread Extraction"
72
- description "Extracts key information from email threads"
116
+ # Get JSON Schema to send to OpenAI, Anthropic, etc.
117
+ schema = Article.json_schema
118
+ ```
73
119
 
74
- assistant_prompt <<~PROMPT
75
- You are an assistant that extracts concise metadata from email threads.
76
- Focus on producing a clear summary, action items, and sentiment analysis.
77
- If there are multiple participants, include their roles in the conversation.
78
- PROMPT
120
+ ### Integration with LLM Services
79
121
 
80
- llm_model "gpt-4" # Supports any LLM model
122
+ You need to implement the actual LLM integration. Here's how you can integrate with popular services:
81
123
 
82
- # Required fields
83
- field :subject, :string,
84
- required: true,
85
- description: "The main topic or subject of the email thread"
124
+ #### OpenAI Integration Example
86
125
 
87
- field :summary, :text,
88
- required: true,
89
- description: "A concise summary of the entire thread"
126
+ ```ruby
127
+ require "openai"
90
128
 
91
- # Optional fields with enums
92
- field :sentiment, :string,
93
- enum: ["positive", "neutral", "negative"],
94
- description: "The overall sentiment of the conversation"
129
+ class OpenAiExtractor
130
+ def initialize(api_key = ENV["OPENAI_API_KEY"])
131
+ @client = OpenAI::Client.new(access_token: api_key)
132
+ end
133
+
134
+ def extract(content, model_class)
135
+ # Get schema from Structify model
136
+ schema = model_class.json_schema
137
+
138
+ # Call OpenAI with structured outputs
139
+ response = @client.chat(
140
+ parameters: {
141
+ model: "gpt-4o",
142
+ response_format: { type: "json_object", schema: schema },
143
+ messages: [
144
+ { role: "system", content: "Extract structured information from the provided content." },
145
+ { role: "user", content: content }
146
+ ]
147
+ }
148
+ )
149
+
150
+ # Parse and return the structured data
151
+ JSON.parse(response.dig("choices", 0, "message", "content"), symbolize_names: true)
152
+ end
153
+ end
95
154
 
96
- field :priority, :string,
97
- enum: ["high", "medium", "low"],
98
- description: "The priority level based on content and tone"
155
+ # Usage
156
+ extractor = OpenAiExtractor.new
157
+ article = Article.find(123)
158
+ extracted_data = extractor.extract(article.content, Article)
159
+ article.update(extracted_data)
160
+ ```
99
161
 
100
- # Complex fields
101
- field :participants, :json,
102
- description: "List of participants and their roles"
162
+ #### Anthropic Integration Example
103
163
 
104
- field :action_items, :json,
105
- description: "Array of action items extracted from the thread"
164
+ ```ruby
165
+ require "anthropic"
106
166
 
107
- field :next_steps, :string,
108
- description: "Recommended next steps based on the thread"
167
+ class AnthropicExtractor
168
+ def initialize(api_key = ENV["ANTHROPIC_API_KEY"])
169
+ @client = Anthropic::Client.new(api_key: api_key)
170
+ end
171
+
172
+ def extract(content, model_class)
173
+ # Get schema from Structify model
174
+ schema = model_class.json_schema
175
+
176
+ # Call Claude with tool use
177
+ response = @client.messages.create(
178
+ model: "claude-3-opus-20240229",
179
+ max_tokens: 1000,
180
+ system: "Extract structured data based on the provided schema.",
181
+ messages: [{ role: "user", content: content }],
182
+ tools: [{
183
+ type: "function",
184
+ function: {
185
+ name: "extract_data",
186
+ description: "Extract structured data from content",
187
+ parameters: schema
188
+ }
189
+ }],
190
+ tool_choice: { type: "function", function: { name: "extract_data" } }
191
+ )
192
+
193
+ # Parse and return structured data
194
+ JSON.parse(response.content[0].tools[0].function.arguments, symbolize_names: true)
109
195
  end
110
-
111
- # You can still use regular ActiveRecord features
112
- validates :subject, presence: true
113
- validates :summary, length: { minimum: 10 }
114
196
  end
115
197
  ```
116
198
 
117
- ### Accessing Schema Information
118
-
119
- Structify provides several helper methods to access schema information:
199
+ ### Store & Access Extracted Data
120
200
 
121
201
  ```ruby
122
- # Get the JSON Schema
123
- EmailSummary.json_schema
124
- # => {
125
- # name: "Email Thread Extraction",
126
- # description: "Extracts key information from email threads",
127
- # parameters: {
128
- # type: "object",
129
- # required: ["subject", "summary"],
130
- # properties: {
131
- # subject: { type: "string" },
132
- # summary: { type: "text" },
133
- # sentiment: {
134
- # type: "string",
135
- # enum: ["positive", "neutral", "negative"]
136
- # },
137
- # # ...
138
- # }
139
- # }
140
- # }
202
+ # Store LLM response in your model
203
+ article.update(response)
141
204
 
142
- # Get the current version
143
- EmailSummary.extraction_version # => 2
205
+ # Access via model attributes
206
+ article.title # => "How AI is Changing Healthcare"
207
+ article.category # => "tech"
208
+ article.topics # => ["machine learning", "healthcare"]
144
209
 
145
- # Get the assistant prompt
146
- EmailSummary.extraction_assistant_prompt
147
- # => "You are an assistant that extracts concise metadata..."
148
-
149
- # Get the LLM model
150
- EmailSummary.extraction_llm_model # => "gpt-4"
210
+ # All data is in the JSON column (default column name: json_attributes)
211
+ article.json_attributes # => The complete JSON
151
212
  ```
152
213
 
153
- ### Working with Extracted Data
214
+ ## Field Types
154
215
 
155
- Structify uses the `attr_json` gem to handle JSON attributes. All fields are stored in the `extracted_data` JSON column:
216
+ Structify supports all standard JSON Schema types:
156
217
 
157
218
  ```ruby
158
- # Create a new record with extracted data
159
- summary = EmailSummary.create(
160
- subject: "Project Update",
161
- summary: "Team discussed Q2 goals",
162
- sentiment: "positive",
163
- priority: "high",
164
- participants: [
165
- { name: "Alice", role: "presenter" },
166
- { name: "Bob", role: "reviewer" }
167
- ]
168
- )
219
+ field :name, :string # String values
220
+ field :count, :integer # Integer values
221
+ field :price, :number # Numeric values (float/int)
222
+ field :active, :boolean # Boolean values
223
+ field :metadata, :object # JSON objects
224
+ field :tags, :array # Arrays
225
+ ```
169
226
 
170
- # Access fields directly
171
- summary.subject # => "Project Update"
172
- summary.sentiment # => "positive"
173
- summary.participants # => [{ name: "Alice", ... }]
227
+ ## Field Options
174
228
 
175
- # Validate enum values
176
- summary.sentiment = "invalid"
177
- summary.valid? # => false
229
+ ```ruby
230
+ # Required fields
231
+ field :title, :string, required: true
232
+
233
+ # Enum values
234
+ field :status, :string, enum: ["draft", "published", "archived"]
235
+
236
+ # Array constraints
237
+ field :tags, :array,
238
+ items: { type: "string" },
239
+ min_items: 1,
240
+ max_items: 5,
241
+ unique_items: true
242
+
243
+ # Nested objects
244
+ field :author, :object, properties: {
245
+ "name" => { type: "string", required: true },
246
+ "email" => { type: "string" }
247
+ }
178
248
  ```
179
249
 
180
- ## Database Setup
250
+ ## Chain of Thought Mode
181
251
 
182
- Ensure your model has a JSON column named `extracted_data`:
252
+ Structify supports a "thinking" mode that automatically requests chain of thought reasoning from the LLM:
183
253
 
184
254
  ```ruby
185
- class CreateEmailSummaries < ActiveRecord::Migration[7.1]
186
- def change
187
- create_table :email_summaries do |t|
188
- t.json :extracted_data # Required by Structify
189
- t.timestamps
190
- end
191
- end
255
+ schema_definition do
256
+ version 1
257
+ thinking true # Enable chain of thought reasoning
258
+
259
+ field :title, :string, required: true
260
+ # other fields...
192
261
  end
193
262
  ```
194
263
 
195
- ## Development
264
+ Chain of thought (COT) reasoning is beneficial because it:
265
+ - Adds more context to the extraction process
266
+ - Helps the LLM think through problems more systematically
267
+ - Improves accuracy for complex extractions
268
+ - Makes the reasoning process transparent and explainable
269
+ - Reduces hallucinations by forcing step-by-step thinking
196
270
 
197
- After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
271
+ This is especially useful when:
272
+ - Answers need more detailed information
273
+ - Questions require multi-step reasoning
274
+ - Extractions involve complex decision-making
275
+ - You need to understand how the LLM reached its conclusions
198
276
 
199
- To install this gem onto your local machine, run `bundle exec rake install`.
277
+ For best results, include instructions for COT in your base system prompt:
200
278
 
201
- ## Contributing
279
+ ```ruby
280
+ system_prompt = "Extract structured data from the content.
281
+ For each field, think step by step before determining the value."
282
+ ```
202
283
 
203
- 1. Fork it
204
- 2. Create your feature branch (`git checkout -b feature/my-new-feature`)
205
- 3. Commit your changes (`git commit -am 'Add some feature'`)
206
- 4. Push to the branch (`git push origin feature/my-new-feature`)
207
- 5. Create a new Pull Request
284
+ You can generate effective chain of thought prompts using tools like the [Claude Prompt Designer](https://console.anthropic.com/dashboard).
208
285
 
209
- Bug reports and pull requests are welcome on GitHub at https://github.com/kieranklaassen/structify.
286
+ ## Schema Versioning and Field Lifecycle
210
287
 
211
- ## License
288
+ Structify provides a simple field lifecycle management system using a `versions` parameter:
289
+
290
+ ```ruby
291
+ schema_definition do
292
+ version 3
293
+
294
+ # Fields for specific version ranges
295
+ field :title, :string # Available in all versions (default behavior)
296
+ field :legacy, :string, versions: 1...3 # Only in versions 1-2 (removed in v3)
297
+ field :summary, :text, versions: 2 # Added in version 2 onwards
298
+ field :content, :text, versions: 2.. # Added in version 2 onwards (endless range)
299
+ field :temp_field, :string, versions: 2..3 # Only in versions 2-3
300
+ field :special, :string, versions: [1, 3, 5] # Only in versions 1, 3, and 5
301
+ end
302
+ ```
303
+
304
+ ### Version Range Syntax
212
305
 
213
- The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
306
+ Structify supports several ways to specify which versions a field is available in:
214
307
 
215
- ## Code of Conduct
308
+ | Syntax | Example | Meaning |
309
+ |--------|---------|---------|
310
+ | No version specified | `field :title, :string` | Available in all versions (default) |
311
+ | Single integer | `versions: 2` | Available from version 2 onwards |
312
+ | Range (inclusive) | `versions: 1..3` | Available in versions 1, 2, and 3 |
313
+ | Range (exclusive) | `versions: 1...3` | Available in versions 1 and 2 (not 3) |
314
+ | Endless range | `versions: 2..` | Available from version 2 onwards |
315
+ | Array | `versions: [1, 4, 7]` | Only available in versions 1, 4, and 7 |
216
316
 
217
- Everyone interacting in the Structify project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](CODE_OF_CONDUCT.md).
317
+ ### Handling Records with Different Versions
318
+
319
+ ```ruby
320
+ # Create a record with version 1 schema
321
+ article_v1 = Article.create(title: "Original Article")
218
322
 
323
+ # Access with version 3 schema
324
+ article_v3 = Article.find(article_v1.id)
325
+
326
+ # Fields from v1 are still accessible
327
+ article_v3.title # => "Original Article"
328
+
329
+ # Fields not in v1 raise errors
330
+ article_v3.summary # => VersionRangeError: Field 'summary' is not available in version 1.
331
+ # This field is only available in versions: 2 to 999.
332
+
333
+ # Check version compatibility
334
+ article_v3.version_compatible_with?(3) # => false
335
+ article_v3.version_compatible_with?(1) # => true
336
+
337
+ # Upgrade record to version 3
338
+ article_v3.summary = "Added in v3"
339
+ article_v3.save! # Record version is automatically updated to 3
219
340
  ```
220
341
 
342
+ ### Accessing the Container Attribute
343
+
344
+ The JSON container attribute can be accessed directly:
345
+
346
+ ```ruby
347
+ # Using the default container attribute :json_attributes
348
+ article.json_attributes # => { "title" => "My Title", "version" => 1, ... }
349
+
350
+ # If you've configured a custom container attribute
351
+ article.custom_json_column # => { "title" => "My Title", "version" => 1, ... }
221
352
  ```
353
+
354
+
355
+ ## Understanding Structify's Role
356
+
357
+ Structify is designed as a **bridge** between your Rails models and LLM extraction services:
358
+
359
+ ### What Structify Does For You
360
+
361
+ - βœ… **Define extraction schemas** directly in your ActiveRecord models
362
+ - βœ… **Generate compatible JSON schemas** for OpenAI, Anthropic, and other LLM providers
363
+ - βœ… **Store and validate** extracted data against your schema
364
+ - βœ… **Provide typed access** to extracted fields through your models
365
+ - βœ… **Handle schema versioning** and backward compatibility
366
+ - βœ… **Support chain of thought reasoning** with the thinking mode option
367
+
368
+ ### What You Need To Implement
369
+
370
+ - πŸ”§ **API integration** with your chosen LLM provider (see examples above)
371
+ - πŸ”§ **Processing logic** for when and how to extract data
372
+ - πŸ”§ **Authentication** and API key management
373
+ - πŸ”§ **Error handling and retries** for API calls
374
+
375
+ This separation of concerns allows you to:
376
+ 1. Use any LLM provider and model you prefer
377
+ 2. Implement extraction logic specific to your application
378
+ 3. Handle API access in a way that fits your application architecture
379
+ 4. Change LLM providers without changing your data model
380
+
381
+ ## License
382
+
383
+ [MIT License](https://opensource.org/licenses/MIT)