structify 0.1.0 β 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +22 -0
- data/CLAUDE.md +27 -0
- data/Gemfile +2 -2
- data/Gemfile.lock +26 -25
- data/README.md +301 -139
- data/lib/structify/model.rb +304 -60
- data/lib/structify/schema_serializer.rb +165 -0
- data/lib/structify/version.rb +1 -1
- data/lib/structify.rb +90 -4
- data/structify.gemspec +2 -2
- metadata +7 -4
data/README.md
CHANGED
@@ -2,220 +2,382 @@
|
|
2
2
|
|
3
3
|
[](https://badge.fury.io/rb/structify)
|
4
4
|
|
5
|
-
|
5
|
+
A Ruby gem for extracting structured data from content using LLMs in Rails applications
|
6
6
|
|
7
|
-
##
|
7
|
+
## What is Structify?
|
8
8
|
|
9
|
-
|
10
|
-
- π Built-in versioning for schema evolution
|
11
|
-
- π Support for custom assistant prompts
|
12
|
-
- ποΈ JSON Schema generation for LLM validation
|
13
|
-
- π Seamless Rails/ActiveRecord integration
|
14
|
-
- πΎ Automatic JSON attribute handling
|
9
|
+
Structify helps you extract structured data from unstructured content in your Rails apps:
|
15
10
|
|
16
|
-
|
11
|
+
- **Define extraction schemas** directly in your ActiveRecord models
|
12
|
+
- **Generate JSON schemas** to use with OpenAI, Anthropic, or other LLM providers
|
13
|
+
- **Store and validate** extracted data in your models
|
14
|
+
- **Access structured data** through typed model attributes
|
17
15
|
|
18
|
-
|
16
|
+
## Use Cases
|
17
|
+
|
18
|
+
- Extract metadata, topics, and sentiment from articles or blog posts
|
19
|
+
- Pull structured information from user-generated content
|
20
|
+
- Organize unstructured feedback or reviews into categorized data
|
21
|
+
- Convert emails or messages into actionable, structured formats
|
22
|
+
- Extract entities and relationships from documents
|
19
23
|
|
20
24
|
```ruby
|
21
|
-
|
25
|
+
# 1. Define extraction schema in your model
|
26
|
+
class Article < ApplicationRecord
|
27
|
+
include Structify::Model
|
28
|
+
|
29
|
+
schema_definition do
|
30
|
+
field :title, :string
|
31
|
+
field :summary, :text
|
32
|
+
field :category, :string, enum: ["tech", "business", "science"]
|
33
|
+
field :topics, :array, items: { type: "string" }
|
34
|
+
end
|
35
|
+
end
|
36
|
+
|
37
|
+
# 2. Get schema for your LLM API
|
38
|
+
schema = Article.json_schema
|
39
|
+
|
40
|
+
# 3. Store LLM response in your model
|
41
|
+
article = Article.find(123)
|
42
|
+
article.update(llm_response)
|
43
|
+
|
44
|
+
# 4. Access extracted data
|
45
|
+
article.title # => "AI Advances in 2023"
|
46
|
+
article.summary # => "Recent developments in artificial intelligence..."
|
47
|
+
article.topics # => ["machine learning", "neural networks", "computer vision"]
|
22
48
|
```
|
23
49
|
|
24
|
-
|
50
|
+
## Install
|
51
|
+
|
52
|
+
```ruby
|
53
|
+
# Add to Gemfile
|
54
|
+
gem 'structify'
|
55
|
+
```
|
25
56
|
|
57
|
+
Then:
|
26
58
|
```bash
|
27
|
-
|
59
|
+
bundle install
|
28
60
|
```
|
29
61
|
|
30
|
-
|
62
|
+
## Database Setup
|
31
63
|
|
32
|
-
|
33
|
-
|
64
|
+
Add a JSON column to store extracted data:
|
65
|
+
|
66
|
+
```ruby
|
67
|
+
add_column :articles, :json_attributes, :jsonb # PostgreSQL (default column name)
|
68
|
+
# or
|
69
|
+
add_column :articles, :json_attributes, :json # MySQL (default column name)
|
70
|
+
|
71
|
+
# Or if you configure a custom column name:
|
72
|
+
add_column :articles, :custom_json_column, :jsonb # PostgreSQL
|
34
73
|
```
|
35
74
|
|
36
|
-
##
|
75
|
+
## Configuration
|
76
|
+
|
77
|
+
Structify can be configured in an initializer:
|
78
|
+
|
79
|
+
```ruby
|
80
|
+
# config/initializers/structify.rb
|
81
|
+
Structify.configure do |config|
|
82
|
+
# Configure the default JSON container attribute (default: :json_attributes)
|
83
|
+
config.default_container_attribute = :custom_json_column
|
84
|
+
end
|
85
|
+
```
|
37
86
|
|
38
|
-
|
87
|
+
## Usage
|
39
88
|
|
40
|
-
|
89
|
+
### Define Your Schema
|
41
90
|
|
42
91
|
```ruby
|
43
92
|
class Article < ApplicationRecord
|
44
93
|
include Structify::Model
|
45
94
|
|
46
95
|
schema_definition do
|
47
|
-
title "Article Extraction"
|
48
|
-
description "Extract key information from articles"
|
49
96
|
version 1
|
50
|
-
|
51
|
-
|
52
|
-
llm_model "gpt-4"
|
53
|
-
|
97
|
+
title "Article Extraction"
|
98
|
+
|
54
99
|
field :title, :string, required: true
|
55
|
-
field :summary, :text
|
100
|
+
field :summary, :text
|
56
101
|
field :category, :string, enum: ["tech", "business", "science"]
|
102
|
+
field :topics, :array, items: { type: "string" }
|
103
|
+
field :metadata, :object, properties: {
|
104
|
+
"author" => { type: "string" },
|
105
|
+
"published_at" => { type: "string" }
|
106
|
+
}
|
57
107
|
end
|
58
108
|
end
|
59
109
|
```
|
60
110
|
|
61
|
-
###
|
111
|
+
### Get Schema for LLM API
|
62
112
|
|
63
|
-
|
113
|
+
Structify generates the JSON schema that you'll need to send to your LLM provider:
|
64
114
|
|
65
115
|
```ruby
|
66
|
-
|
67
|
-
|
68
|
-
|
69
|
-
schema_definition do
|
70
|
-
version 2 # Increment this when making breaking changes
|
71
|
-
title "Email Thread Extraction"
|
72
|
-
description "Extracts key information from email threads"
|
116
|
+
# Get JSON Schema to send to OpenAI, Anthropic, etc.
|
117
|
+
schema = Article.json_schema
|
118
|
+
```
|
73
119
|
|
74
|
-
|
75
|
-
You are an assistant that extracts concise metadata from email threads.
|
76
|
-
Focus on producing a clear summary, action items, and sentiment analysis.
|
77
|
-
If there are multiple participants, include their roles in the conversation.
|
78
|
-
PROMPT
|
120
|
+
### Integration with LLM Services
|
79
121
|
|
80
|
-
|
122
|
+
You need to implement the actual LLM integration. Here's how you can integrate with popular services:
|
81
123
|
|
82
|
-
|
83
|
-
field :subject, :string,
|
84
|
-
required: true,
|
85
|
-
description: "The main topic or subject of the email thread"
|
124
|
+
#### OpenAI Integration Example
|
86
125
|
|
87
|
-
|
88
|
-
|
89
|
-
description: "A concise summary of the entire thread"
|
126
|
+
```ruby
|
127
|
+
require "openai"
|
90
128
|
|
91
|
-
|
92
|
-
|
93
|
-
|
94
|
-
|
129
|
+
class OpenAiExtractor
|
130
|
+
def initialize(api_key = ENV["OPENAI_API_KEY"])
|
131
|
+
@client = OpenAI::Client.new(access_token: api_key)
|
132
|
+
end
|
133
|
+
|
134
|
+
def extract(content, model_class)
|
135
|
+
# Get schema from Structify model
|
136
|
+
schema = model_class.json_schema
|
137
|
+
|
138
|
+
# Call OpenAI with structured outputs
|
139
|
+
response = @client.chat(
|
140
|
+
parameters: {
|
141
|
+
model: "gpt-4o",
|
142
|
+
response_format: { type: "json_object", schema: schema },
|
143
|
+
messages: [
|
144
|
+
{ role: "system", content: "Extract structured information from the provided content." },
|
145
|
+
{ role: "user", content: content }
|
146
|
+
]
|
147
|
+
}
|
148
|
+
)
|
149
|
+
|
150
|
+
# Parse and return the structured data
|
151
|
+
JSON.parse(response.dig("choices", 0, "message", "content"), symbolize_names: true)
|
152
|
+
end
|
153
|
+
end
|
95
154
|
|
96
|
-
|
97
|
-
|
98
|
-
|
155
|
+
# Usage
|
156
|
+
extractor = OpenAiExtractor.new
|
157
|
+
article = Article.find(123)
|
158
|
+
extracted_data = extractor.extract(article.content, Article)
|
159
|
+
article.update(extracted_data)
|
160
|
+
```
|
99
161
|
|
100
|
-
|
101
|
-
field :participants, :json,
|
102
|
-
description: "List of participants and their roles"
|
162
|
+
#### Anthropic Integration Example
|
103
163
|
|
104
|
-
|
105
|
-
|
164
|
+
```ruby
|
165
|
+
require "anthropic"
|
106
166
|
|
107
|
-
|
108
|
-
|
167
|
+
class AnthropicExtractor
|
168
|
+
def initialize(api_key = ENV["ANTHROPIC_API_KEY"])
|
169
|
+
@client = Anthropic::Client.new(api_key: api_key)
|
170
|
+
end
|
171
|
+
|
172
|
+
def extract(content, model_class)
|
173
|
+
# Get schema from Structify model
|
174
|
+
schema = model_class.json_schema
|
175
|
+
|
176
|
+
# Call Claude with tool use
|
177
|
+
response = @client.messages.create(
|
178
|
+
model: "claude-3-opus-20240229",
|
179
|
+
max_tokens: 1000,
|
180
|
+
system: "Extract structured data based on the provided schema.",
|
181
|
+
messages: [{ role: "user", content: content }],
|
182
|
+
tools: [{
|
183
|
+
type: "function",
|
184
|
+
function: {
|
185
|
+
name: "extract_data",
|
186
|
+
description: "Extract structured data from content",
|
187
|
+
parameters: schema
|
188
|
+
}
|
189
|
+
}],
|
190
|
+
tool_choice: { type: "function", function: { name: "extract_data" } }
|
191
|
+
)
|
192
|
+
|
193
|
+
# Parse and return structured data
|
194
|
+
JSON.parse(response.content[0].tools[0].function.arguments, symbolize_names: true)
|
109
195
|
end
|
110
|
-
|
111
|
-
# You can still use regular ActiveRecord features
|
112
|
-
validates :subject, presence: true
|
113
|
-
validates :summary, length: { minimum: 10 }
|
114
196
|
end
|
115
197
|
```
|
116
198
|
|
117
|
-
###
|
118
|
-
|
119
|
-
Structify provides several helper methods to access schema information:
|
199
|
+
### Store & Access Extracted Data
|
120
200
|
|
121
201
|
```ruby
|
122
|
-
#
|
123
|
-
|
124
|
-
# => {
|
125
|
-
# name: "Email Thread Extraction",
|
126
|
-
# description: "Extracts key information from email threads",
|
127
|
-
# parameters: {
|
128
|
-
# type: "object",
|
129
|
-
# required: ["subject", "summary"],
|
130
|
-
# properties: {
|
131
|
-
# subject: { type: "string" },
|
132
|
-
# summary: { type: "text" },
|
133
|
-
# sentiment: {
|
134
|
-
# type: "string",
|
135
|
-
# enum: ["positive", "neutral", "negative"]
|
136
|
-
# },
|
137
|
-
# # ...
|
138
|
-
# }
|
139
|
-
# }
|
140
|
-
# }
|
202
|
+
# Store LLM response in your model
|
203
|
+
article.update(response)
|
141
204
|
|
142
|
-
#
|
143
|
-
|
205
|
+
# Access via model attributes
|
206
|
+
article.title # => "How AI is Changing Healthcare"
|
207
|
+
article.category # => "tech"
|
208
|
+
article.topics # => ["machine learning", "healthcare"]
|
144
209
|
|
145
|
-
#
|
146
|
-
|
147
|
-
# => "You are an assistant that extracts concise metadata..."
|
148
|
-
|
149
|
-
# Get the LLM model
|
150
|
-
EmailSummary.extraction_llm_model # => "gpt-4"
|
210
|
+
# All data is in the JSON column (default column name: json_attributes)
|
211
|
+
article.json_attributes # => The complete JSON
|
151
212
|
```
|
152
213
|
|
153
|
-
|
214
|
+
## Field Types
|
154
215
|
|
155
|
-
Structify
|
216
|
+
Structify supports all standard JSON Schema types:
|
156
217
|
|
157
218
|
```ruby
|
158
|
-
|
159
|
-
|
160
|
-
|
161
|
-
|
162
|
-
|
163
|
-
|
164
|
-
|
165
|
-
{ name: "Alice", role: "presenter" },
|
166
|
-
{ name: "Bob", role: "reviewer" }
|
167
|
-
]
|
168
|
-
)
|
219
|
+
field :name, :string # String values
|
220
|
+
field :count, :integer # Integer values
|
221
|
+
field :price, :number # Numeric values (float/int)
|
222
|
+
field :active, :boolean # Boolean values
|
223
|
+
field :metadata, :object # JSON objects
|
224
|
+
field :tags, :array # Arrays
|
225
|
+
```
|
169
226
|
|
170
|
-
|
171
|
-
summary.subject # => "Project Update"
|
172
|
-
summary.sentiment # => "positive"
|
173
|
-
summary.participants # => [{ name: "Alice", ... }]
|
227
|
+
## Field Options
|
174
228
|
|
175
|
-
|
176
|
-
|
177
|
-
|
229
|
+
```ruby
|
230
|
+
# Required fields
|
231
|
+
field :title, :string, required: true
|
232
|
+
|
233
|
+
# Enum values
|
234
|
+
field :status, :string, enum: ["draft", "published", "archived"]
|
235
|
+
|
236
|
+
# Array constraints
|
237
|
+
field :tags, :array,
|
238
|
+
items: { type: "string" },
|
239
|
+
min_items: 1,
|
240
|
+
max_items: 5,
|
241
|
+
unique_items: true
|
242
|
+
|
243
|
+
# Nested objects
|
244
|
+
field :author, :object, properties: {
|
245
|
+
"name" => { type: "string", required: true },
|
246
|
+
"email" => { type: "string" }
|
247
|
+
}
|
178
248
|
```
|
179
249
|
|
180
|
-
##
|
250
|
+
## Chain of Thought Mode
|
181
251
|
|
182
|
-
|
252
|
+
Structify supports a "thinking" mode that automatically requests chain of thought reasoning from the LLM:
|
183
253
|
|
184
254
|
```ruby
|
185
|
-
|
186
|
-
|
187
|
-
|
188
|
-
|
189
|
-
|
190
|
-
|
191
|
-
end
|
255
|
+
schema_definition do
|
256
|
+
version 1
|
257
|
+
thinking true # Enable chain of thought reasoning
|
258
|
+
|
259
|
+
field :title, :string, required: true
|
260
|
+
# other fields...
|
192
261
|
end
|
193
262
|
```
|
194
263
|
|
195
|
-
|
264
|
+
Chain of thought (COT) reasoning is beneficial because it:
|
265
|
+
- Adds more context to the extraction process
|
266
|
+
- Helps the LLM think through problems more systematically
|
267
|
+
- Improves accuracy for complex extractions
|
268
|
+
- Makes the reasoning process transparent and explainable
|
269
|
+
- Reduces hallucinations by forcing step-by-step thinking
|
196
270
|
|
197
|
-
|
271
|
+
This is especially useful when:
|
272
|
+
- Answers need more detailed information
|
273
|
+
- Questions require multi-step reasoning
|
274
|
+
- Extractions involve complex decision-making
|
275
|
+
- You need to understand how the LLM reached its conclusions
|
198
276
|
|
199
|
-
|
277
|
+
For best results, include instructions for COT in your base system prompt:
|
200
278
|
|
201
|
-
|
279
|
+
```ruby
|
280
|
+
system_prompt = "Extract structured data from the content.
|
281
|
+
For each field, think step by step before determining the value."
|
282
|
+
```
|
202
283
|
|
203
|
-
|
204
|
-
2. Create your feature branch (`git checkout -b feature/my-new-feature`)
|
205
|
-
3. Commit your changes (`git commit -am 'Add some feature'`)
|
206
|
-
4. Push to the branch (`git push origin feature/my-new-feature`)
|
207
|
-
5. Create a new Pull Request
|
284
|
+
You can generate effective chain of thought prompts using tools like the [Claude Prompt Designer](https://console.anthropic.com/dashboard).
|
208
285
|
|
209
|
-
|
286
|
+
## Schema Versioning and Field Lifecycle
|
210
287
|
|
211
|
-
|
288
|
+
Structify provides a simple field lifecycle management system using a `versions` parameter:
|
289
|
+
|
290
|
+
```ruby
|
291
|
+
schema_definition do
|
292
|
+
version 3
|
293
|
+
|
294
|
+
# Fields for specific version ranges
|
295
|
+
field :title, :string # Available in all versions (default behavior)
|
296
|
+
field :legacy, :string, versions: 1...3 # Only in versions 1-2 (removed in v3)
|
297
|
+
field :summary, :text, versions: 2 # Added in version 2 onwards
|
298
|
+
field :content, :text, versions: 2.. # Added in version 2 onwards (endless range)
|
299
|
+
field :temp_field, :string, versions: 2..3 # Only in versions 2-3
|
300
|
+
field :special, :string, versions: [1, 3, 5] # Only in versions 1, 3, and 5
|
301
|
+
end
|
302
|
+
```
|
303
|
+
|
304
|
+
### Version Range Syntax
|
212
305
|
|
213
|
-
|
306
|
+
Structify supports several ways to specify which versions a field is available in:
|
214
307
|
|
215
|
-
|
308
|
+
| Syntax | Example | Meaning |
|
309
|
+
|--------|---------|---------|
|
310
|
+
| No version specified | `field :title, :string` | Available in all versions (default) |
|
311
|
+
| Single integer | `versions: 2` | Available from version 2 onwards |
|
312
|
+
| Range (inclusive) | `versions: 1..3` | Available in versions 1, 2, and 3 |
|
313
|
+
| Range (exclusive) | `versions: 1...3` | Available in versions 1 and 2 (not 3) |
|
314
|
+
| Endless range | `versions: 2..` | Available from version 2 onwards |
|
315
|
+
| Array | `versions: [1, 4, 7]` | Only available in versions 1, 4, and 7 |
|
216
316
|
|
217
|
-
|
317
|
+
### Handling Records with Different Versions
|
318
|
+
|
319
|
+
```ruby
|
320
|
+
# Create a record with version 1 schema
|
321
|
+
article_v1 = Article.create(title: "Original Article")
|
218
322
|
|
323
|
+
# Access with version 3 schema
|
324
|
+
article_v3 = Article.find(article_v1.id)
|
325
|
+
|
326
|
+
# Fields from v1 are still accessible
|
327
|
+
article_v3.title # => "Original Article"
|
328
|
+
|
329
|
+
# Fields not in v1 raise errors
|
330
|
+
article_v3.summary # => VersionRangeError: Field 'summary' is not available in version 1.
|
331
|
+
# This field is only available in versions: 2 to 999.
|
332
|
+
|
333
|
+
# Check version compatibility
|
334
|
+
article_v3.version_compatible_with?(3) # => false
|
335
|
+
article_v3.version_compatible_with?(1) # => true
|
336
|
+
|
337
|
+
# Upgrade record to version 3
|
338
|
+
article_v3.summary = "Added in v3"
|
339
|
+
article_v3.save! # Record version is automatically updated to 3
|
219
340
|
```
|
220
341
|
|
342
|
+
### Accessing the Container Attribute
|
343
|
+
|
344
|
+
The JSON container attribute can be accessed directly:
|
345
|
+
|
346
|
+
```ruby
|
347
|
+
# Using the default container attribute :json_attributes
|
348
|
+
article.json_attributes # => { "title" => "My Title", "version" => 1, ... }
|
349
|
+
|
350
|
+
# If you've configured a custom container attribute
|
351
|
+
article.custom_json_column # => { "title" => "My Title", "version" => 1, ... }
|
221
352
|
```
|
353
|
+
|
354
|
+
|
355
|
+
## Understanding Structify's Role
|
356
|
+
|
357
|
+
Structify is designed as a **bridge** between your Rails models and LLM extraction services:
|
358
|
+
|
359
|
+
### What Structify Does For You
|
360
|
+
|
361
|
+
- β
**Define extraction schemas** directly in your ActiveRecord models
|
362
|
+
- β
**Generate compatible JSON schemas** for OpenAI, Anthropic, and other LLM providers
|
363
|
+
- β
**Store and validate** extracted data against your schema
|
364
|
+
- β
**Provide typed access** to extracted fields through your models
|
365
|
+
- β
**Handle schema versioning** and backward compatibility
|
366
|
+
- β
**Support chain of thought reasoning** with the thinking mode option
|
367
|
+
|
368
|
+
### What You Need To Implement
|
369
|
+
|
370
|
+
- π§ **API integration** with your chosen LLM provider (see examples above)
|
371
|
+
- π§ **Processing logic** for when and how to extract data
|
372
|
+
- π§ **Authentication** and API key management
|
373
|
+
- π§ **Error handling and retries** for API calls
|
374
|
+
|
375
|
+
This separation of concerns allows you to:
|
376
|
+
1. Use any LLM provider and model you prefer
|
377
|
+
2. Implement extraction logic specific to your application
|
378
|
+
3. Handle API access in a way that fits your application architecture
|
379
|
+
4. Change LLM providers without changing your data model
|
380
|
+
|
381
|
+
## License
|
382
|
+
|
383
|
+
[MIT License](https://opensource.org/licenses/MIT)
|