nitfr 1.0.0 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 397b4e4f6de3e985e4d8bfbfe18ad663ab6bdd350536365a826728f87704c571
4
- data.tar.gz: 775aab503543f0f678abbca1bb55e8420ec1ad829f7e2470a0f609dd2924798f
3
+ metadata.gz: 601b081002076baf704497b19c0aa35d29b0201d13eb65bc43026e57eb7645d1
4
+ data.tar.gz: f28461883556e66c71d3282c9a4ebc3e40577ecb9fb87e83a7e783a0066b9362
5
5
  SHA512:
6
- metadata.gz: 482f2856dd1c1e1854f2d6f463dcdcf46ed644b417934b46847bf96597c3b1cf6012a2a9001de5bdcdd0573e53c2cf34ab0177a74d3a645fd2d8a6d3a074aa35
7
- data.tar.gz: 59ac46fea40da9ace71f9ce9663ac6c6275d5c268b589d4f3f583261e34d78fe74dc027105fb3a27d4389650a34b9585fd6df7369ebdbd012fb11c8c60a85b70
6
+ metadata.gz: 92663f9fb2de633c74d68c6092afadcae2b8b65fb7f357d185c68fa8e64b488b0e3a844adc172dd2ab82d8b0bfbd8eb28696431fc15e667c0cdab9e9b1899243
7
+ data.tar.gz: a501b8c8bde10794a02080444a63216874d3583dfdc6b99607befeee9c025c23fc82cfdf98aae0b7ad555a2e67bd1d1fe30a1b6438db7f5e83967cc712df6a60
data/CHANGELOG.md ADDED
@@ -0,0 +1,232 @@
1
+ # Changelog
2
+
3
+ All notable changes to NITFr will be documented in this file.
4
+
5
+ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6
+ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
+
8
+ ## [1.1.0] - 2025-12-15
9
+
10
+ ### Added
11
+
12
+ #### Serialization
13
+ - `Document#to_h` - Hash representation of entire document
14
+ - `Document#to_json` - JSON serialization
15
+ - `Head#to_h` - Hash representation
16
+ - `Body#to_h` - Hash representation
17
+ - `Headline#to_h` - Hash representation
18
+ - `Byline#to_h` - Hash representation
19
+ - `Paragraph#to_h` - Hash representation
20
+ - `Media#to_h` - Hash representation
21
+ - `Docdata#to_h` - Hash representation
22
+ - `Footnote#to_h` - Hash representation
23
+
24
+ #### Reading Statistics
25
+ - `Document#word_count` - Total word count across all paragraphs (memoized)
26
+ - `Document#reading_time(words_per_minute:)` - Estimated reading time (e.g., "3 min read")
27
+
28
+ #### Search & Query
29
+ - `Document#search(query, case_sensitive:)` - Full-text search with match positions and context
30
+ - `Document#contains?(query, case_sensitive:)` - Check if text exists in document
31
+ - `Document#paragraphs_containing(query, case_sensitive:)` - Find paragraphs by text
32
+ - `Document#paragraphs_mentioning(person:, org:, location:, match_all:)` - Find paragraphs by entity
33
+ - `Document#paragraphs_where(&block)` - Custom predicate filtering
34
+ - `Document#find_paragraph(&block)` - Find first matching paragraph
35
+ - `Document#find_media(type:)` - Filter media by type
36
+ - `Document#images` / `#videos` / `#audio` - Media type convenience accessors
37
+ - `Document#all_people` - All unique person names (memoized)
38
+ - `Document#all_organizations` - All unique organization names (memoized)
39
+ - `Document#all_locations` - All unique location names (memoized)
40
+ - `Document#all_entities` - Hash of all entity types (memoized, single-pass)
41
+ - `Document#count_occurrences(query, case_sensitive:)` - Count matches
42
+ - `Document#excerpt(query, context_chars:, case_sensitive:)` - Context snippet around match
43
+
44
+ #### Paragraph Search Helpers
45
+ - `Paragraph#contains?(query, case_sensitive:)` - Text search within paragraph
46
+ - `Paragraph#mentions_person?(name, exact:)` - Check for person reference
47
+ - `Paragraph#mentions_org?(name, exact:)` - Check for organization reference
48
+ - `Paragraph#mentions_location?(name, exact:)` - Check for location reference
49
+ - `Paragraph#mentions?(person:, org:, location:)` - Multi-entity check
50
+ - `Paragraph#has_links?` - Check if paragraph contains links
51
+ - `Paragraph#has_emphasis?` - Check if paragraph contains emphasis
52
+ - `Paragraph#has_strong?` - Check if paragraph contains strong text
53
+ - `Paragraph#has_entities?` - Check if paragraph contains any entities
54
+
55
+ #### Extended Headline Levels
56
+ - `Headline#tertiary` / `#hl3` - Tertiary headline
57
+ - `Headline#quaternary` / `#hl4` - Quaternary headline
58
+ - `Headline#quinary` / `#hl5` - Quinary headline
59
+ - Updated `Headline#all` and `Headline#to_h` to include all five levels
60
+
61
+ #### Strong/Bold Text
62
+ - `Paragraph#strong` - Extract `<strong>` elements (alongside existing `<em>` support)
63
+ - `Paragraph#has_strong?` - Check for strong text
64
+ - Included in `Paragraph#to_h` serialization
65
+
66
+ #### Slugline Support
67
+ - `Document#slugline` - Section/category identifier
68
+ - `Body#slugline` - Slugline from body.head
69
+ - Included in `Body#to_h` serialization
70
+
71
+ #### Footnotes
72
+ - `Footnote` class for parsing `<fn>` elements with label and value
73
+ - `Document#footnotes` - Array of Footnote objects
74
+ - `Body#footnotes` - Footnotes from body.content and body.end
75
+ - `Footnote#id` - Footnote ID attribute
76
+ - `Footnote#label` - Reference marker (e.g., "1", "*")
77
+ - `Footnote#value` / `#text` / `#content` - Footnote content
78
+ - `Footnote#present?` - Check if has content
79
+ - Included in `Body#to_h` serialization
80
+
81
+ #### Line Break Preservation
82
+ - `<br/>` elements now converted to newline characters in text extraction
83
+ - Preserves intended line breaks within paragraph content
84
+
85
+ #### Export Formats
86
+ - `Document#to_markdown` - Markdown export with headers, emphasis, blockquotes, footnotes
87
+ - `Document#to_text` - Plain text export with underlined headlines
88
+ - `Document#to_html(include_wrapper:)` - Semantic HTML with article/header/section structure
89
+ - `Exporter` module for export functionality
90
+
91
+ ### Notes
92
+
93
+ - 337 tests with comprehensive coverage (173 new tests)
94
+ - Memoization added for frequently accessed computed values
95
+ - `SearchPattern` module for consistent pattern building across classes
96
+
97
+ ---
98
+
99
+ ## [1.0.0] - 2025-12-14
100
+
101
+ ### Added
102
+
103
+ #### Core Parsing
104
+ - `NITFr.parse(xml)` - Parse NITF XML string into a Document
105
+ - `NITFr.parse_file(path)` - Parse NITF file with encoding support
106
+ - `Document` class as main entry point for NITF content
107
+ - `Head` class for document head section (title, meta, pubdata, docdata)
108
+ - `Body` class for document body section
109
+ - `Headline` class with primary (hl1) and secondary (hl2) headline support
110
+ - `Byline` class for author/contributor information
111
+ - `Paragraph` class for body content paragraphs
112
+ - `Media` class for embedded media (images, video, audio)
113
+ - `Docdata` class for document metadata
114
+
115
+ #### Document Attributes
116
+ - `Document#title` - Document title from head
117
+ - `Document#headline` - Primary headline text
118
+ - `Document#headlines` - Full Headline object
119
+ - `Document#byline` - Byline object
120
+ - `Document#paragraphs` - Array of Paragraph objects
121
+ - `Document#text` - Full concatenated article text
122
+ - `Document#media` - Array of Media objects
123
+ - `Document#docdata` - Document metadata
124
+ - `Document#doc_id` - Document identifier
125
+ - `Document#issue_date` - Issue date
126
+ - `Document#version` - NITF version
127
+ - `Document#change_date` - Last change date
128
+ - `Document#change_time` - Last change time
129
+ - `Document#valid?` - Check if valid NITF document
130
+ - `Document#to_xml` - Original XML string
131
+
132
+ #### Headline Support
133
+ - `Headline#primary` / `#hl1` - Primary headline
134
+ - `Headline#secondary` / `#hl2` - Secondary headline
135
+ - `Headline#all` - Array of headline levels
136
+ - `Headline#to_s` - Combined headline text
137
+ - `Headline#present?` - Check if headline exists
138
+
139
+ #### Body Content
140
+ - `Body#headline` - Headline object
141
+ - `Body#byline` - Byline object
142
+ - `Body#dateline` - Dateline text
143
+ - `Body#abstract` - Article abstract/summary
144
+ - `Body#distributor` - Wire service/distributor
145
+ - `Body#series` - Series information
146
+ - `Body#paragraphs` - Paragraph array
147
+ - `Body#media` - Media array
148
+ - `Body#block_quotes` - Block quote texts
149
+ - `Body#lists` - List structures (ul, ol, dl)
150
+ - `Body#tables` - Raw table elements
151
+ - `Body#tagline` - Tagline from body.end
152
+ - `Body#notes` - Editorial notes
153
+
154
+ #### Byline Features
155
+ - `Byline#person` - Author name
156
+ - `Byline#title` - Author title/role
157
+ - `Byline#org` - Organization
158
+ - `Byline#location` - Location
159
+ - `Byline#text` - Full byline text
160
+ - `Byline#present?` - Check if byline has content
161
+
162
+ #### Paragraph Features
163
+ - `Paragraph#text` - Plain text content
164
+ - `Paragraph#id` - Paragraph ID attribute
165
+ - `Paragraph#lede` - Lede attribute value
166
+ - `Paragraph#lead?` - Check if lead paragraph
167
+ - `Paragraph#word_count` - Word count
168
+ - `Paragraph#inner_html` - Raw XML content
169
+ - `Paragraph#present?` - Check if has content
170
+
171
+ #### Entity Extraction (Lazy Batch)
172
+ - `Paragraph#people` - Person references (`<person>`)
173
+ - `Paragraph#organizations` - Organization references (`<org>`)
174
+ - `Paragraph#locations` - Location references (`<location>`)
175
+ - `Paragraph#emphasis` - Emphasized text (`<em>`)
176
+ - `Paragraph#links` - Link information (text and href)
177
+ - Efficient single-pass DOM traversal on first access
178
+
179
+ #### Media Support
180
+ - `Media#type` - Media type (image, video, audio)
181
+ - `Media#image?` / `#video?` / `#audio?` - Type checks
182
+ - `Media#caption` - Media caption
183
+ - `Media#producer` / `#credit` - Credit information
184
+ - `Media#source` / `#src` / `#url` - Source URL
185
+ - `Media#mime_type` - MIME type
186
+ - `Media#alt_text` - Alternate text
187
+ - `Media#width` / `#height` - Dimensions
188
+ - `Media#references` - All media references
189
+ - `Media#primary_reference` - First/main reference
190
+ - `Media#metadata` - Additional metadata
191
+
192
+ #### Docdata Features
193
+ - `Docdata#doc_id` - Document identifier
194
+ - `Docdata#issue_date` - Issue date (parsed as Date)
195
+ - `Docdata#release_date` - Release date
196
+ - `Docdata#expire_date` - Expiration date
197
+ - `Docdata#urgency` - Urgency level (1-8)
198
+ - `Docdata#fixture` - Fixture identifier
199
+ - `Docdata#doc_scope` - Document scope
200
+ - `Docdata#ed_msg` - Editorial message
201
+ - `Docdata#series` - Series information
202
+ - `Docdata#copyright` - Copyright holder and year
203
+ - `Docdata#subjects` - Subject classifiers
204
+ - `Docdata#people` - Identified people
205
+ - `Docdata#organizations` - Identified organizations
206
+ - `Docdata#locations` - Identified locations
207
+
208
+ #### Head Features
209
+ - `Head#title` - Document title
210
+ - `Head#meta` - Meta tags as hash
211
+ - `Head#pubdata` - Publication data
212
+ - `Head#docdata` - Docdata object
213
+ - `Head#revision_history` - Revision entries
214
+
215
+ #### Text Processing
216
+ - `TextExtractor` module for recursive text extraction from nested elements
217
+
218
+ #### Security
219
+ - XXE (XML External Entity) attack protection
220
+ - Entity expansion limits configured at load time
221
+ - REXML security settings (100 entity limit, 10KB text limit)
222
+ - No external entity processing
223
+
224
+ #### Error Handling
225
+ - `NITFr::ParseError` - XML parsing errors
226
+ - `NITFr::InvalidDocumentError` - Invalid NITF structure (missing `<nitf>` root)
227
+
228
+ ### Notes
229
+
230
+ - Pure Ruby implementation using REXML (no native dependencies)
231
+ - Lazy batch extraction for efficient entity parsing
232
+ - 164 tests with comprehensive coverage
data/lib/nitfr/body.rb CHANGED
@@ -39,6 +39,13 @@ module NITFr
39
39
  @dateline ||= (body_head && xpath_first(body_head, "dateline"))&.text&.strip
40
40
  end
41
41
 
42
+ # Get the slugline (section/category identifier)
43
+ #
44
+ # @return [String, nil] the slugline text
45
+ def slugline
46
+ @slugline ||= (body_head && xpath_first(body_head, "slugline"))&.text&.strip
47
+ end
48
+
42
49
  # Get the abstract/summary
43
50
  #
44
51
  # @return [String, nil] the abstract text
@@ -111,6 +118,19 @@ module NITFr
111
118
  end
112
119
  end
113
120
 
121
+ # Get all footnotes from the document
122
+ #
123
+ # Footnotes can appear in body.content or body.end
124
+ #
125
+ # @return [Array<Footnote>] array of footnote objects
126
+ def footnotes
127
+ @footnotes ||= begin
128
+ content_fns = body_content ? xpath_match(body_content, ".//fn") : []
129
+ end_fns = body_end ? xpath_match(body_end, ".//fn") : []
130
+ (content_fns + end_fns).map { |fn| Footnote.new(fn) }
131
+ end
132
+ end
133
+
114
134
  # Get the body.end content (tagline, bibliography)
115
135
  #
116
136
  # @return [Hash] body end content
@@ -132,6 +152,28 @@ module NITFr
132
152
  body_end_content[:notes] || []
133
153
  end
134
154
 
155
+ # Convert body to a Hash representation
156
+ #
157
+ # @return [Hash] the body as a hash
158
+ def to_h
159
+ {
160
+ headline: headline&.to_h,
161
+ byline: byline&.to_h,
162
+ dateline: dateline,
163
+ slugline: slugline,
164
+ abstract: abstract,
165
+ distributor: distributor,
166
+ series: series,
167
+ paragraphs: paragraphs.map(&:to_h),
168
+ media: media.empty? ? nil : media.map(&:to_h),
169
+ block_quotes: block_quotes.empty? ? nil : block_quotes,
170
+ lists: lists.empty? ? nil : lists,
171
+ footnotes: footnotes.empty? ? nil : footnotes.map(&:to_h),
172
+ tagline: tagline,
173
+ notes: notes.empty? ? nil : notes
174
+ }.compact
175
+ end
176
+
135
177
  private
136
178
 
137
179
  def xpath_first(context, path)
data/lib/nitfr/byline.rb CHANGED
@@ -57,6 +57,19 @@ module NITFr
57
57
  !text.empty?
58
58
  end
59
59
 
60
+ # Convert byline to a Hash representation
61
+ #
62
+ # @return [Hash] the byline as a hash
63
+ def to_h
64
+ {
65
+ text: text,
66
+ person: person,
67
+ title: title,
68
+ location: location,
69
+ org: org
70
+ }.compact
71
+ end
72
+
60
73
  private
61
74
 
62
75
  def xpath_first(path)
data/lib/nitfr/docdata.rb CHANGED
@@ -133,6 +133,28 @@ module NITFr
133
133
  identified_content[:people] || []
134
134
  end
135
135
 
136
+ # Convert docdata to a Hash representation
137
+ #
138
+ # @return [Hash] the docdata as a hash
139
+ def to_h
140
+ {
141
+ doc_id: doc_id,
142
+ issue_date: issue_date&.to_s,
143
+ release_date: release_date&.to_s,
144
+ expire_date: expire_date&.to_s,
145
+ urgency: urgency,
146
+ copyright: copyright.empty? ? nil : copyright,
147
+ doc_scope: doc_scope,
148
+ fixture: fixture,
149
+ series: series.empty? ? nil : series,
150
+ management_status: management_status.empty? ? nil : management_status,
151
+ subjects: subjects.empty? ? nil : subjects,
152
+ locations: locations.empty? ? nil : locations,
153
+ organizations: organizations.empty? ? nil : organizations,
154
+ people: people.empty? ? nil : people
155
+ }.compact
156
+ end
157
+
136
158
  private
137
159
 
138
160
  def xpath_first(path)
@@ -9,6 +9,9 @@ module NITFr
9
9
  # @note This parser does not process external entities (DTD references) for security.
10
10
  # REXML by default does not expand external entities, which protects against XXE attacks.
11
11
  class Document
12
+ include SearchPattern
13
+ include Exporter
14
+
12
15
  attr_reader :xml_doc, :head, :body
13
16
 
14
17
  # Create a new Document from an NITF XML string
@@ -50,6 +53,13 @@ module NITFr
50
53
  body&.byline
51
54
  end
52
55
 
56
+ # Get the slugline (section/category identifier)
57
+ #
58
+ # @return [String, nil] the slugline text
59
+ def slugline
60
+ body&.slugline
61
+ end
62
+
53
63
  # Get all paragraphs from the body content
54
64
  #
55
65
  # @return [Array<Paragraph>] array of paragraph objects
@@ -64,6 +74,28 @@ module NITFr
64
74
  @text ||= paragraphs.map(&:text).join("\n\n")
65
75
  end
66
76
 
77
+ # Get the total word count of the document
78
+ #
79
+ # @return [Integer] total word count across all paragraphs
80
+ def word_count
81
+ @word_count ||= paragraphs.sum(&:word_count)
82
+ end
83
+
84
+ # Get the estimated reading time
85
+ #
86
+ # @param words_per_minute [Integer] reading speed (default: 200)
87
+ # @return [String] human-readable reading time (e.g., "3 min read")
88
+ def reading_time(words_per_minute: 200)
89
+ minutes = (word_count / words_per_minute.to_f).ceil
90
+ if minutes < 1
91
+ "Less than 1 min read"
92
+ elsif minutes == 1
93
+ "1 min read"
94
+ else
95
+ "#{minutes} min read"
96
+ end
97
+ end
98
+
67
99
  # Get all media objects (images, etc.) from the document
68
100
  #
69
101
  # @return [Array<Media>] array of media objects
@@ -71,6 +103,13 @@ module NITFr
71
103
  body&.media || []
72
104
  end
73
105
 
106
+ # Get all footnotes from the document
107
+ #
108
+ # @return [Array<Footnote>] array of footnote objects
109
+ def footnotes
110
+ body&.footnotes || []
111
+ end
112
+
74
113
  # Get document metadata from docdata
75
114
  #
76
115
  # @return [Docdata, nil] the docdata object
@@ -127,8 +166,243 @@ module NITFr
127
166
  @xml_doc.to_s
128
167
  end
129
168
 
169
+ # Convert document to a Hash representation
170
+ #
171
+ # @return [Hash] the document as a hash
172
+ def to_h
173
+ {
174
+ version: version,
175
+ change_date: change_date,
176
+ change_time: change_time,
177
+ title: title,
178
+ doc_id: doc_id,
179
+ issue_date: issue_date&.to_s,
180
+ head: head&.to_h,
181
+ body: body&.to_h
182
+ }.compact
183
+ end
184
+
185
+ # Convert document to JSON string
186
+ #
187
+ # @param args [Array] arguments passed to JSON.generate
188
+ # @return [String] JSON representation of the document
189
+ def to_json(*args)
190
+ require "json"
191
+ to_h.to_json(*args)
192
+ end
193
+
194
+ # =========================================================================
195
+ # Search Methods
196
+ # =========================================================================
197
+
198
+ # Search the full document text for a query string or pattern
199
+ #
200
+ # @param query [String, Regexp] the search query (string or regex)
201
+ # @param case_sensitive [Boolean] whether search is case-sensitive (default: false)
202
+ # @return [Array<Hash>] array of match results with context
203
+ def search(query, case_sensitive: false)
204
+ pattern = build_search_pattern(query, case_sensitive)
205
+ results = []
206
+
207
+ paragraphs.each_with_index do |para, index|
208
+ para.text.scan(pattern) do
209
+ match = Regexp.last_match
210
+ results << {
211
+ paragraph_index: index,
212
+ paragraph: para,
213
+ match: match[0],
214
+ position: match.begin(0)
215
+ }
216
+ end
217
+ end
218
+
219
+ results
220
+ end
221
+
222
+ # Check if document contains the given text
223
+ #
224
+ # @param query [String, Regexp] the search query
225
+ # @param case_sensitive [Boolean] whether search is case-sensitive (default: false)
226
+ # @return [Boolean] true if text is found
227
+ def contains?(query, case_sensitive: false)
228
+ pattern = build_search_pattern(query, case_sensitive)
229
+ text.match?(pattern)
230
+ end
231
+
232
+ # Find paragraphs containing the given text
233
+ #
234
+ # @param query [String, Regexp] the search query
235
+ # @param case_sensitive [Boolean] whether search is case-sensitive (default: false)
236
+ # @return [Array<Paragraph>] matching paragraphs
237
+ def paragraphs_containing(query, case_sensitive: false)
238
+ pattern = build_search_pattern(query, case_sensitive)
239
+ paragraphs.select { |p| p.text.match?(pattern) }
240
+ end
241
+
242
+ # Find paragraphs mentioning specific entities
243
+ #
244
+ # @param person [String, nil] person name to search for
245
+ # @param org [String, nil] organization name to search for
246
+ # @param location [String, nil] location name to search for
247
+ # @param match_all [Boolean] if true, paragraph must contain ALL specified entities (default: false)
248
+ # @return [Array<Paragraph>] matching paragraphs
249
+ def paragraphs_mentioning(person: nil, org: nil, location: nil, match_all: false)
250
+ return paragraphs if person.nil? && org.nil? && location.nil?
251
+
252
+ paragraphs.select do |para|
253
+ matches = []
254
+ matches << para.mentions_person?(person) if person
255
+ matches << para.mentions_org?(org) if org
256
+ matches << para.mentions_location?(location) if location
257
+
258
+ match_all ? matches.all? : matches.any?
259
+ end
260
+ end
261
+
262
+ # Find paragraphs using a custom block
263
+ #
264
+ # @yield [Paragraph] block to evaluate each paragraph
265
+ # @return [Array<Paragraph>] paragraphs where block returns true
266
+ # @example Find long paragraphs
267
+ # doc.paragraphs_where { |p| p.word_count > 50 }
268
+ # @example Find lead paragraphs with links
269
+ # doc.paragraphs_where { |p| p.lead? && p.links.any? }
270
+ def paragraphs_where(&block)
271
+ return paragraphs unless block_given?
272
+
273
+ paragraphs.select(&block)
274
+ end
275
+
276
+ # Find the first paragraph matching criteria
277
+ #
278
+ # @yield [Paragraph] block to evaluate each paragraph
279
+ # @return [Paragraph, nil] first matching paragraph or nil
280
+ def find_paragraph(&block)
281
+ return nil unless block_given?
282
+
283
+ paragraphs.find(&block)
284
+ end
285
+
286
+ # Find media by type
287
+ #
288
+ # @param type [String, Symbol, nil] media type ('image', 'video', 'audio')
289
+ # @return [Array<Media>] matching media objects
290
+ def find_media(type: nil)
291
+ return media if type.nil?
292
+
293
+ type_str = type.to_s
294
+ media.select { |m| m.type == type_str }
295
+ end
296
+
297
+ # Get all images from the document
298
+ #
299
+ # @return [Array<Media>] image media objects
300
+ def images
301
+ media.select(&:image?)
302
+ end
303
+
304
+ # Get all videos from the document
305
+ #
306
+ # @return [Array<Media>] video media objects
307
+ def videos
308
+ media.select(&:video?)
309
+ end
310
+
311
+ # Get all audio from the document
312
+ #
313
+ # @return [Array<Media>] audio media objects
314
+ def audio
315
+ media.select(&:audio?)
316
+ end
317
+
318
+ # Get all unique people mentioned in the document
319
+ #
320
+ # @return [Array<String>] unique person names from paragraphs and docdata
321
+ def all_people
322
+ all_entities[:people]
323
+ end
324
+
325
+ # Get all unique organizations mentioned in the document
326
+ #
327
+ # @return [Array<String>] unique organization names from paragraphs and docdata
328
+ def all_organizations
329
+ all_entities[:organizations]
330
+ end
331
+
332
+ # Get all unique locations mentioned in the document
333
+ #
334
+ # @return [Array<String>] unique location names from paragraphs and docdata
335
+ def all_locations
336
+ all_entities[:locations]
337
+ end
338
+
339
+ # Get all unique entities (people, organizations, locations) mentioned
340
+ #
341
+ # Uses single-pass aggregation for efficiency when multiple entity
342
+ # methods are called.
343
+ #
344
+ # @return [Hash] hash with :people, :organizations, :locations arrays
345
+ def all_entities
346
+ @all_entities ||= aggregate_entities
347
+ end
348
+
349
+ # Count occurrences of a term in the document
350
+ #
351
+ # @param query [String, Regexp] the search query
352
+ # @param case_sensitive [Boolean] whether search is case-sensitive (default: false)
353
+ # @return [Integer] number of occurrences
354
+ def count_occurrences(query, case_sensitive: false)
355
+ pattern = build_search_pattern(query, case_sensitive)
356
+ text.scan(pattern).size
357
+ end
358
+
359
+ # Get excerpt around first match of query
360
+ #
361
+ # @param query [String, Regexp] the search query
362
+ # @param context_chars [Integer] characters of context on each side (default: 50)
363
+ # @param case_sensitive [Boolean] whether search is case-sensitive (default: false)
364
+ # @return [String, nil] excerpt with surrounding context and ellipses, or nil if not found
365
+ def excerpt(query, context_chars: 50, case_sensitive: false)
366
+ pattern = build_search_pattern(query, case_sensitive)
367
+ match = text.match(pattern)
368
+ return nil unless match
369
+
370
+ start_pos = [match.begin(0) - context_chars, 0].max
371
+ end_pos = [match.end(0) + context_chars, text.length].min
372
+
373
+ prefix = start_pos > 0 ? "..." : ""
374
+ suffix = end_pos < text.length ? "..." : ""
375
+
376
+ excerpt_text = text[start_pos...end_pos]
377
+ "#{prefix}#{excerpt_text}#{suffix}"
378
+ end
379
+
130
380
  private
131
381
 
382
+ # Aggregate all entities in a single pass through paragraphs
383
+ #
384
+ # @return [Hash] hash with :people, :organizations, :locations arrays
385
+ def aggregate_entities
386
+ result = { people: [], organizations: [], locations: [] }
387
+
388
+ paragraphs.each do |para|
389
+ result[:people].concat(para.people)
390
+ result[:organizations].concat(para.organizations)
391
+ result[:locations].concat(para.locations)
392
+ end
393
+
394
+ # Add docdata entities if available
395
+ if docdata
396
+ result[:people].concat(docdata.people || [])
397
+ result[:organizations].concat(docdata.organizations || [])
398
+ result[:locations].concat(docdata.locations || [])
399
+ end
400
+
401
+ # Remove duplicates
402
+ result.transform_values!(&:uniq)
403
+ result
404
+ end
405
+
132
406
  # Parse XML string into REXML document
133
407
  #
134
408
  # REXML does not expand external entities by default, which protects against: