nitfr 1.0.0 → 1.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +232 -0
- data/lib/nitfr/body.rb +42 -0
- data/lib/nitfr/byline.rb +13 -0
- data/lib/nitfr/docdata.rb +22 -0
- data/lib/nitfr/document.rb +274 -0
- data/lib/nitfr/exporter.rb +257 -0
- data/lib/nitfr/footnote.rb +63 -0
- data/lib/nitfr/head.rb +14 -0
- data/lib/nitfr/headline.rb +40 -3
- data/lib/nitfr/media.rb +18 -0
- data/lib/nitfr/paragraph.rb +130 -1
- data/lib/nitfr/search_pattern.rb +29 -0
- data/lib/nitfr/text_extractor.rb +11 -1
- data/lib/nitfr/version.rb +1 -1
- data/lib/nitfr.rb +3 -0
- metadata +5 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 601b081002076baf704497b19c0aa35d29b0201d13eb65bc43026e57eb7645d1
|
|
4
|
+
data.tar.gz: f28461883556e66c71d3282c9a4ebc3e40577ecb9fb87e83a7e783a0066b9362
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 92663f9fb2de633c74d68c6092afadcae2b8b65fb7f357d185c68fa8e64b488b0e3a844adc172dd2ab82d8b0bfbd8eb28696431fc15e667c0cdab9e9b1899243
|
|
7
|
+
data.tar.gz: a501b8c8bde10794a02080444a63216874d3583dfdc6b99607befeee9c025c23fc82cfdf98aae0b7ad555a2e67bd1d1fe30a1b6438db7f5e83967cc712df6a60
|
data/CHANGELOG.md
ADDED
|
@@ -0,0 +1,232 @@
|
|
|
1
|
+
# Changelog
|
|
2
|
+
|
|
3
|
+
All notable changes to NITFr will be documented in this file.
|
|
4
|
+
|
|
5
|
+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
|
6
|
+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
7
|
+
|
|
8
|
+
## [1.1.0] - 2025-12-15
|
|
9
|
+
|
|
10
|
+
### Added
|
|
11
|
+
|
|
12
|
+
#### Serialization
|
|
13
|
+
- `Document#to_h` - Hash representation of entire document
|
|
14
|
+
- `Document#to_json` - JSON serialization
|
|
15
|
+
- `Head#to_h` - Hash representation
|
|
16
|
+
- `Body#to_h` - Hash representation
|
|
17
|
+
- `Headline#to_h` - Hash representation
|
|
18
|
+
- `Byline#to_h` - Hash representation
|
|
19
|
+
- `Paragraph#to_h` - Hash representation
|
|
20
|
+
- `Media#to_h` - Hash representation
|
|
21
|
+
- `Docdata#to_h` - Hash representation
|
|
22
|
+
- `Footnote#to_h` - Hash representation
|
|
23
|
+
|
|
24
|
+
#### Reading Statistics
|
|
25
|
+
- `Document#word_count` - Total word count across all paragraphs (memoized)
|
|
26
|
+
- `Document#reading_time(words_per_minute:)` - Estimated reading time (e.g., "3 min read")
|
|
27
|
+
|
|
28
|
+
#### Search & Query
|
|
29
|
+
- `Document#search(query, case_sensitive:)` - Full-text search with match positions and context
|
|
30
|
+
- `Document#contains?(query, case_sensitive:)` - Check if text exists in document
|
|
31
|
+
- `Document#paragraphs_containing(query, case_sensitive:)` - Find paragraphs by text
|
|
32
|
+
- `Document#paragraphs_mentioning(person:, org:, location:, match_all:)` - Find paragraphs by entity
|
|
33
|
+
- `Document#paragraphs_where(&block)` - Custom predicate filtering
|
|
34
|
+
- `Document#find_paragraph(&block)` - Find first matching paragraph
|
|
35
|
+
- `Document#find_media(type:)` - Filter media by type
|
|
36
|
+
- `Document#images` / `#videos` / `#audio` - Media type convenience accessors
|
|
37
|
+
- `Document#all_people` - All unique person names (memoized)
|
|
38
|
+
- `Document#all_organizations` - All unique organization names (memoized)
|
|
39
|
+
- `Document#all_locations` - All unique location names (memoized)
|
|
40
|
+
- `Document#all_entities` - Hash of all entity types (memoized, single-pass)
|
|
41
|
+
- `Document#count_occurrences(query, case_sensitive:)` - Count matches
|
|
42
|
+
- `Document#excerpt(query, context_chars:, case_sensitive:)` - Context snippet around match
|
|
43
|
+
|
|
44
|
+
#### Paragraph Search Helpers
|
|
45
|
+
- `Paragraph#contains?(query, case_sensitive:)` - Text search within paragraph
|
|
46
|
+
- `Paragraph#mentions_person?(name, exact:)` - Check for person reference
|
|
47
|
+
- `Paragraph#mentions_org?(name, exact:)` - Check for organization reference
|
|
48
|
+
- `Paragraph#mentions_location?(name, exact:)` - Check for location reference
|
|
49
|
+
- `Paragraph#mentions?(person:, org:, location:)` - Multi-entity check
|
|
50
|
+
- `Paragraph#has_links?` - Check if paragraph contains links
|
|
51
|
+
- `Paragraph#has_emphasis?` - Check if paragraph contains emphasis
|
|
52
|
+
- `Paragraph#has_strong?` - Check if paragraph contains strong text
|
|
53
|
+
- `Paragraph#has_entities?` - Check if paragraph contains any entities
|
|
54
|
+
|
|
55
|
+
#### Extended Headline Levels
|
|
56
|
+
- `Headline#tertiary` / `#hl3` - Tertiary headline
|
|
57
|
+
- `Headline#quaternary` / `#hl4` - Quaternary headline
|
|
58
|
+
- `Headline#quinary` / `#hl5` - Quinary headline
|
|
59
|
+
- Updated `Headline#all` and `Headline#to_h` to include all five levels
|
|
60
|
+
|
|
61
|
+
#### Strong/Bold Text
|
|
62
|
+
- `Paragraph#strong` - Extract `<strong>` elements (alongside existing `<em>` support)
|
|
63
|
+
- `Paragraph#has_strong?` - Check for strong text
|
|
64
|
+
- Included in `Paragraph#to_h` serialization
|
|
65
|
+
|
|
66
|
+
#### Slugline Support
|
|
67
|
+
- `Document#slugline` - Section/category identifier
|
|
68
|
+
- `Body#slugline` - Slugline from body.head
|
|
69
|
+
- Included in `Body#to_h` serialization
|
|
70
|
+
|
|
71
|
+
#### Footnotes
|
|
72
|
+
- `Footnote` class for parsing `<fn>` elements with label and value
|
|
73
|
+
- `Document#footnotes` - Array of Footnote objects
|
|
74
|
+
- `Body#footnotes` - Footnotes from body.content and body.end
|
|
75
|
+
- `Footnote#id` - Footnote ID attribute
|
|
76
|
+
- `Footnote#label` - Reference marker (e.g., "1", "*")
|
|
77
|
+
- `Footnote#value` / `#text` / `#content` - Footnote content
|
|
78
|
+
- `Footnote#present?` - Check if has content
|
|
79
|
+
- Included in `Body#to_h` serialization
|
|
80
|
+
|
|
81
|
+
#### Line Break Preservation
|
|
82
|
+
- `<br/>` elements now converted to newline characters in text extraction
|
|
83
|
+
- Preserves intended line breaks within paragraph content
|
|
84
|
+
|
|
85
|
+
#### Export Formats
|
|
86
|
+
- `Document#to_markdown` - Markdown export with headers, emphasis, blockquotes, footnotes
|
|
87
|
+
- `Document#to_text` - Plain text export with underlined headlines
|
|
88
|
+
- `Document#to_html(include_wrapper:)` - Semantic HTML with article/header/section structure
|
|
89
|
+
- `Exporter` module for export functionality
|
|
90
|
+
|
|
91
|
+
### Notes
|
|
92
|
+
|
|
93
|
+
- 337 tests with comprehensive coverage (173 new tests)
|
|
94
|
+
- Memoization added for frequently accessed computed values
|
|
95
|
+
- `SearchPattern` module for consistent pattern building across classes
|
|
96
|
+
|
|
97
|
+
---
|
|
98
|
+
|
|
99
|
+
## [1.0.0] - 2025-12-14
|
|
100
|
+
|
|
101
|
+
### Added
|
|
102
|
+
|
|
103
|
+
#### Core Parsing
|
|
104
|
+
- `NITFr.parse(xml)` - Parse NITF XML string into a Document
|
|
105
|
+
- `NITFr.parse_file(path)` - Parse NITF file with encoding support
|
|
106
|
+
- `Document` class as main entry point for NITF content
|
|
107
|
+
- `Head` class for document head section (title, meta, pubdata, docdata)
|
|
108
|
+
- `Body` class for document body section
|
|
109
|
+
- `Headline` class with primary (hl1) and secondary (hl2) headline support
|
|
110
|
+
- `Byline` class for author/contributor information
|
|
111
|
+
- `Paragraph` class for body content paragraphs
|
|
112
|
+
- `Media` class for embedded media (images, video, audio)
|
|
113
|
+
- `Docdata` class for document metadata
|
|
114
|
+
|
|
115
|
+
#### Document Attributes
|
|
116
|
+
- `Document#title` - Document title from head
|
|
117
|
+
- `Document#headline` - Primary headline text
|
|
118
|
+
- `Document#headlines` - Full Headline object
|
|
119
|
+
- `Document#byline` - Byline object
|
|
120
|
+
- `Document#paragraphs` - Array of Paragraph objects
|
|
121
|
+
- `Document#text` - Full concatenated article text
|
|
122
|
+
- `Document#media` - Array of Media objects
|
|
123
|
+
- `Document#docdata` - Document metadata
|
|
124
|
+
- `Document#doc_id` - Document identifier
|
|
125
|
+
- `Document#issue_date` - Issue date
|
|
126
|
+
- `Document#version` - NITF version
|
|
127
|
+
- `Document#change_date` - Last change date
|
|
128
|
+
- `Document#change_time` - Last change time
|
|
129
|
+
- `Document#valid?` - Check if valid NITF document
|
|
130
|
+
- `Document#to_xml` - Original XML string
|
|
131
|
+
|
|
132
|
+
#### Headline Support
|
|
133
|
+
- `Headline#primary` / `#hl1` - Primary headline
|
|
134
|
+
- `Headline#secondary` / `#hl2` - Secondary headline
|
|
135
|
+
- `Headline#all` - Array of headline levels
|
|
136
|
+
- `Headline#to_s` - Combined headline text
|
|
137
|
+
- `Headline#present?` - Check if headline exists
|
|
138
|
+
|
|
139
|
+
#### Body Content
|
|
140
|
+
- `Body#headline` - Headline object
|
|
141
|
+
- `Body#byline` - Byline object
|
|
142
|
+
- `Body#dateline` - Dateline text
|
|
143
|
+
- `Body#abstract` - Article abstract/summary
|
|
144
|
+
- `Body#distributor` - Wire service/distributor
|
|
145
|
+
- `Body#series` - Series information
|
|
146
|
+
- `Body#paragraphs` - Paragraph array
|
|
147
|
+
- `Body#media` - Media array
|
|
148
|
+
- `Body#block_quotes` - Block quote texts
|
|
149
|
+
- `Body#lists` - List structures (ul, ol, dl)
|
|
150
|
+
- `Body#tables` - Raw table elements
|
|
151
|
+
- `Body#tagline` - Tagline from body.end
|
|
152
|
+
- `Body#notes` - Editorial notes
|
|
153
|
+
|
|
154
|
+
#### Byline Features
|
|
155
|
+
- `Byline#person` - Author name
|
|
156
|
+
- `Byline#title` - Author title/role
|
|
157
|
+
- `Byline#org` - Organization
|
|
158
|
+
- `Byline#location` - Location
|
|
159
|
+
- `Byline#text` - Full byline text
|
|
160
|
+
- `Byline#present?` - Check if byline has content
|
|
161
|
+
|
|
162
|
+
#### Paragraph Features
|
|
163
|
+
- `Paragraph#text` - Plain text content
|
|
164
|
+
- `Paragraph#id` - Paragraph ID attribute
|
|
165
|
+
- `Paragraph#lede` - Lede attribute value
|
|
166
|
+
- `Paragraph#lead?` - Check if lead paragraph
|
|
167
|
+
- `Paragraph#word_count` - Word count
|
|
168
|
+
- `Paragraph#inner_html` - Raw XML content
|
|
169
|
+
- `Paragraph#present?` - Check if has content
|
|
170
|
+
|
|
171
|
+
#### Entity Extraction (Lazy Batch)
|
|
172
|
+
- `Paragraph#people` - Person references (`<person>`)
|
|
173
|
+
- `Paragraph#organizations` - Organization references (`<org>`)
|
|
174
|
+
- `Paragraph#locations` - Location references (`<location>`)
|
|
175
|
+
- `Paragraph#emphasis` - Emphasized text (`<em>`)
|
|
176
|
+
- `Paragraph#links` - Link information (text and href)
|
|
177
|
+
- Efficient single-pass DOM traversal on first access
|
|
178
|
+
|
|
179
|
+
#### Media Support
|
|
180
|
+
- `Media#type` - Media type (image, video, audio)
|
|
181
|
+
- `Media#image?` / `#video?` / `#audio?` - Type checks
|
|
182
|
+
- `Media#caption` - Media caption
|
|
183
|
+
- `Media#producer` / `#credit` - Credit information
|
|
184
|
+
- `Media#source` / `#src` / `#url` - Source URL
|
|
185
|
+
- `Media#mime_type` - MIME type
|
|
186
|
+
- `Media#alt_text` - Alternate text
|
|
187
|
+
- `Media#width` / `#height` - Dimensions
|
|
188
|
+
- `Media#references` - All media references
|
|
189
|
+
- `Media#primary_reference` - First/main reference
|
|
190
|
+
- `Media#metadata` - Additional metadata
|
|
191
|
+
|
|
192
|
+
#### Docdata Features
|
|
193
|
+
- `Docdata#doc_id` - Document identifier
|
|
194
|
+
- `Docdata#issue_date` - Issue date (parsed as Date)
|
|
195
|
+
- `Docdata#release_date` - Release date
|
|
196
|
+
- `Docdata#expire_date` - Expiration date
|
|
197
|
+
- `Docdata#urgency` - Urgency level (1-8)
|
|
198
|
+
- `Docdata#fixture` - Fixture identifier
|
|
199
|
+
- `Docdata#doc_scope` - Document scope
|
|
200
|
+
- `Docdata#ed_msg` - Editorial message
|
|
201
|
+
- `Docdata#series` - Series information
|
|
202
|
+
- `Docdata#copyright` - Copyright holder and year
|
|
203
|
+
- `Docdata#subjects` - Subject classifiers
|
|
204
|
+
- `Docdata#people` - Identified people
|
|
205
|
+
- `Docdata#organizations` - Identified organizations
|
|
206
|
+
- `Docdata#locations` - Identified locations
|
|
207
|
+
|
|
208
|
+
#### Head Features
|
|
209
|
+
- `Head#title` - Document title
|
|
210
|
+
- `Head#meta` - Meta tags as hash
|
|
211
|
+
- `Head#pubdata` - Publication data
|
|
212
|
+
- `Head#docdata` - Docdata object
|
|
213
|
+
- `Head#revision_history` - Revision entries
|
|
214
|
+
|
|
215
|
+
#### Text Processing
|
|
216
|
+
- `TextExtractor` module for recursive text extraction from nested elements
|
|
217
|
+
|
|
218
|
+
#### Security
|
|
219
|
+
- XXE (XML External Entity) attack protection
|
|
220
|
+
- Entity expansion limits configured at load time
|
|
221
|
+
- REXML security settings (100 entity limit, 10KB text limit)
|
|
222
|
+
- No external entity processing
|
|
223
|
+
|
|
224
|
+
#### Error Handling
|
|
225
|
+
- `NITFr::ParseError` - XML parsing errors
|
|
226
|
+
- `NITFr::InvalidDocumentError` - Invalid NITF structure (missing `<nitf>` root)
|
|
227
|
+
|
|
228
|
+
### Notes
|
|
229
|
+
|
|
230
|
+
- Pure Ruby implementation using REXML (no native dependencies)
|
|
231
|
+
- Lazy batch extraction for efficient entity parsing
|
|
232
|
+
- 164 tests with comprehensive coverage
|
data/lib/nitfr/body.rb
CHANGED
|
@@ -39,6 +39,13 @@ module NITFr
|
|
|
39
39
|
@dateline ||= (body_head && xpath_first(body_head, "dateline"))&.text&.strip
|
|
40
40
|
end
|
|
41
41
|
|
|
42
|
+
# Get the slugline (section/category identifier)
|
|
43
|
+
#
|
|
44
|
+
# @return [String, nil] the slugline text
|
|
45
|
+
def slugline
|
|
46
|
+
@slugline ||= (body_head && xpath_first(body_head, "slugline"))&.text&.strip
|
|
47
|
+
end
|
|
48
|
+
|
|
42
49
|
# Get the abstract/summary
|
|
43
50
|
#
|
|
44
51
|
# @return [String, nil] the abstract text
|
|
@@ -111,6 +118,19 @@ module NITFr
|
|
|
111
118
|
end
|
|
112
119
|
end
|
|
113
120
|
|
|
121
|
+
# Get all footnotes from the document
|
|
122
|
+
#
|
|
123
|
+
# Footnotes can appear in body.content or body.end
|
|
124
|
+
#
|
|
125
|
+
# @return [Array<Footnote>] array of footnote objects
|
|
126
|
+
def footnotes
|
|
127
|
+
@footnotes ||= begin
|
|
128
|
+
content_fns = body_content ? xpath_match(body_content, ".//fn") : []
|
|
129
|
+
end_fns = body_end ? xpath_match(body_end, ".//fn") : []
|
|
130
|
+
(content_fns + end_fns).map { |fn| Footnote.new(fn) }
|
|
131
|
+
end
|
|
132
|
+
end
|
|
133
|
+
|
|
114
134
|
# Get the body.end content (tagline, bibliography)
|
|
115
135
|
#
|
|
116
136
|
# @return [Hash] body end content
|
|
@@ -132,6 +152,28 @@ module NITFr
|
|
|
132
152
|
body_end_content[:notes] || []
|
|
133
153
|
end
|
|
134
154
|
|
|
155
|
+
# Convert body to a Hash representation
|
|
156
|
+
#
|
|
157
|
+
# @return [Hash] the body as a hash
|
|
158
|
+
def to_h
|
|
159
|
+
{
|
|
160
|
+
headline: headline&.to_h,
|
|
161
|
+
byline: byline&.to_h,
|
|
162
|
+
dateline: dateline,
|
|
163
|
+
slugline: slugline,
|
|
164
|
+
abstract: abstract,
|
|
165
|
+
distributor: distributor,
|
|
166
|
+
series: series,
|
|
167
|
+
paragraphs: paragraphs.map(&:to_h),
|
|
168
|
+
media: media.empty? ? nil : media.map(&:to_h),
|
|
169
|
+
block_quotes: block_quotes.empty? ? nil : block_quotes,
|
|
170
|
+
lists: lists.empty? ? nil : lists,
|
|
171
|
+
footnotes: footnotes.empty? ? nil : footnotes.map(&:to_h),
|
|
172
|
+
tagline: tagline,
|
|
173
|
+
notes: notes.empty? ? nil : notes
|
|
174
|
+
}.compact
|
|
175
|
+
end
|
|
176
|
+
|
|
135
177
|
private
|
|
136
178
|
|
|
137
179
|
def xpath_first(context, path)
|
data/lib/nitfr/byline.rb
CHANGED
|
@@ -57,6 +57,19 @@ module NITFr
|
|
|
57
57
|
!text.empty?
|
|
58
58
|
end
|
|
59
59
|
|
|
60
|
+
# Convert byline to a Hash representation
|
|
61
|
+
#
|
|
62
|
+
# @return [Hash] the byline as a hash
|
|
63
|
+
def to_h
|
|
64
|
+
{
|
|
65
|
+
text: text,
|
|
66
|
+
person: person,
|
|
67
|
+
title: title,
|
|
68
|
+
location: location,
|
|
69
|
+
org: org
|
|
70
|
+
}.compact
|
|
71
|
+
end
|
|
72
|
+
|
|
60
73
|
private
|
|
61
74
|
|
|
62
75
|
def xpath_first(path)
|
data/lib/nitfr/docdata.rb
CHANGED
|
@@ -133,6 +133,28 @@ module NITFr
|
|
|
133
133
|
identified_content[:people] || []
|
|
134
134
|
end
|
|
135
135
|
|
|
136
|
+
# Convert docdata to a Hash representation
|
|
137
|
+
#
|
|
138
|
+
# @return [Hash] the docdata as a hash
|
|
139
|
+
def to_h
|
|
140
|
+
{
|
|
141
|
+
doc_id: doc_id,
|
|
142
|
+
issue_date: issue_date&.to_s,
|
|
143
|
+
release_date: release_date&.to_s,
|
|
144
|
+
expire_date: expire_date&.to_s,
|
|
145
|
+
urgency: urgency,
|
|
146
|
+
copyright: copyright.empty? ? nil : copyright,
|
|
147
|
+
doc_scope: doc_scope,
|
|
148
|
+
fixture: fixture,
|
|
149
|
+
series: series.empty? ? nil : series,
|
|
150
|
+
management_status: management_status.empty? ? nil : management_status,
|
|
151
|
+
subjects: subjects.empty? ? nil : subjects,
|
|
152
|
+
locations: locations.empty? ? nil : locations,
|
|
153
|
+
organizations: organizations.empty? ? nil : organizations,
|
|
154
|
+
people: people.empty? ? nil : people
|
|
155
|
+
}.compact
|
|
156
|
+
end
|
|
157
|
+
|
|
136
158
|
private
|
|
137
159
|
|
|
138
160
|
def xpath_first(path)
|
data/lib/nitfr/document.rb
CHANGED
|
@@ -9,6 +9,9 @@ module NITFr
|
|
|
9
9
|
# @note This parser does not process external entities (DTD references) for security.
|
|
10
10
|
# REXML by default does not expand external entities, which protects against XXE attacks.
|
|
11
11
|
class Document
|
|
12
|
+
include SearchPattern
|
|
13
|
+
include Exporter
|
|
14
|
+
|
|
12
15
|
attr_reader :xml_doc, :head, :body
|
|
13
16
|
|
|
14
17
|
# Create a new Document from an NITF XML string
|
|
@@ -50,6 +53,13 @@ module NITFr
|
|
|
50
53
|
body&.byline
|
|
51
54
|
end
|
|
52
55
|
|
|
56
|
+
# Get the slugline (section/category identifier)
|
|
57
|
+
#
|
|
58
|
+
# @return [String, nil] the slugline text
|
|
59
|
+
def slugline
|
|
60
|
+
body&.slugline
|
|
61
|
+
end
|
|
62
|
+
|
|
53
63
|
# Get all paragraphs from the body content
|
|
54
64
|
#
|
|
55
65
|
# @return [Array<Paragraph>] array of paragraph objects
|
|
@@ -64,6 +74,28 @@ module NITFr
|
|
|
64
74
|
@text ||= paragraphs.map(&:text).join("\n\n")
|
|
65
75
|
end
|
|
66
76
|
|
|
77
|
+
# Get the total word count of the document
|
|
78
|
+
#
|
|
79
|
+
# @return [Integer] total word count across all paragraphs
|
|
80
|
+
def word_count
|
|
81
|
+
@word_count ||= paragraphs.sum(&:word_count)
|
|
82
|
+
end
|
|
83
|
+
|
|
84
|
+
# Get the estimated reading time
|
|
85
|
+
#
|
|
86
|
+
# @param words_per_minute [Integer] reading speed (default: 200)
|
|
87
|
+
# @return [String] human-readable reading time (e.g., "3 min read")
|
|
88
|
+
def reading_time(words_per_minute: 200)
|
|
89
|
+
minutes = (word_count / words_per_minute.to_f).ceil
|
|
90
|
+
if minutes < 1
|
|
91
|
+
"Less than 1 min read"
|
|
92
|
+
elsif minutes == 1
|
|
93
|
+
"1 min read"
|
|
94
|
+
else
|
|
95
|
+
"#{minutes} min read"
|
|
96
|
+
end
|
|
97
|
+
end
|
|
98
|
+
|
|
67
99
|
# Get all media objects (images, etc.) from the document
|
|
68
100
|
#
|
|
69
101
|
# @return [Array<Media>] array of media objects
|
|
@@ -71,6 +103,13 @@ module NITFr
|
|
|
71
103
|
body&.media || []
|
|
72
104
|
end
|
|
73
105
|
|
|
106
|
+
# Get all footnotes from the document
|
|
107
|
+
#
|
|
108
|
+
# @return [Array<Footnote>] array of footnote objects
|
|
109
|
+
def footnotes
|
|
110
|
+
body&.footnotes || []
|
|
111
|
+
end
|
|
112
|
+
|
|
74
113
|
# Get document metadata from docdata
|
|
75
114
|
#
|
|
76
115
|
# @return [Docdata, nil] the docdata object
|
|
@@ -127,8 +166,243 @@ module NITFr
|
|
|
127
166
|
@xml_doc.to_s
|
|
128
167
|
end
|
|
129
168
|
|
|
169
|
+
# Convert document to a Hash representation
|
|
170
|
+
#
|
|
171
|
+
# @return [Hash] the document as a hash
|
|
172
|
+
def to_h
|
|
173
|
+
{
|
|
174
|
+
version: version,
|
|
175
|
+
change_date: change_date,
|
|
176
|
+
change_time: change_time,
|
|
177
|
+
title: title,
|
|
178
|
+
doc_id: doc_id,
|
|
179
|
+
issue_date: issue_date&.to_s,
|
|
180
|
+
head: head&.to_h,
|
|
181
|
+
body: body&.to_h
|
|
182
|
+
}.compact
|
|
183
|
+
end
|
|
184
|
+
|
|
185
|
+
# Convert document to JSON string
|
|
186
|
+
#
|
|
187
|
+
# @param args [Array] arguments passed to JSON.generate
|
|
188
|
+
# @return [String] JSON representation of the document
|
|
189
|
+
def to_json(*args)
|
|
190
|
+
require "json"
|
|
191
|
+
to_h.to_json(*args)
|
|
192
|
+
end
|
|
193
|
+
|
|
194
|
+
# =========================================================================
|
|
195
|
+
# Search Methods
|
|
196
|
+
# =========================================================================
|
|
197
|
+
|
|
198
|
+
# Search the full document text for a query string or pattern
|
|
199
|
+
#
|
|
200
|
+
# @param query [String, Regexp] the search query (string or regex)
|
|
201
|
+
# @param case_sensitive [Boolean] whether search is case-sensitive (default: false)
|
|
202
|
+
# @return [Array<Hash>] array of match results with context
|
|
203
|
+
def search(query, case_sensitive: false)
|
|
204
|
+
pattern = build_search_pattern(query, case_sensitive)
|
|
205
|
+
results = []
|
|
206
|
+
|
|
207
|
+
paragraphs.each_with_index do |para, index|
|
|
208
|
+
para.text.scan(pattern) do
|
|
209
|
+
match = Regexp.last_match
|
|
210
|
+
results << {
|
|
211
|
+
paragraph_index: index,
|
|
212
|
+
paragraph: para,
|
|
213
|
+
match: match[0],
|
|
214
|
+
position: match.begin(0)
|
|
215
|
+
}
|
|
216
|
+
end
|
|
217
|
+
end
|
|
218
|
+
|
|
219
|
+
results
|
|
220
|
+
end
|
|
221
|
+
|
|
222
|
+
# Check if document contains the given text
|
|
223
|
+
#
|
|
224
|
+
# @param query [String, Regexp] the search query
|
|
225
|
+
# @param case_sensitive [Boolean] whether search is case-sensitive (default: false)
|
|
226
|
+
# @return [Boolean] true if text is found
|
|
227
|
+
def contains?(query, case_sensitive: false)
|
|
228
|
+
pattern = build_search_pattern(query, case_sensitive)
|
|
229
|
+
text.match?(pattern)
|
|
230
|
+
end
|
|
231
|
+
|
|
232
|
+
# Find paragraphs containing the given text
|
|
233
|
+
#
|
|
234
|
+
# @param query [String, Regexp] the search query
|
|
235
|
+
# @param case_sensitive [Boolean] whether search is case-sensitive (default: false)
|
|
236
|
+
# @return [Array<Paragraph>] matching paragraphs
|
|
237
|
+
def paragraphs_containing(query, case_sensitive: false)
|
|
238
|
+
pattern = build_search_pattern(query, case_sensitive)
|
|
239
|
+
paragraphs.select { |p| p.text.match?(pattern) }
|
|
240
|
+
end
|
|
241
|
+
|
|
242
|
+
# Find paragraphs mentioning specific entities
|
|
243
|
+
#
|
|
244
|
+
# @param person [String, nil] person name to search for
|
|
245
|
+
# @param org [String, nil] organization name to search for
|
|
246
|
+
# @param location [String, nil] location name to search for
|
|
247
|
+
# @param match_all [Boolean] if true, paragraph must contain ALL specified entities (default: false)
|
|
248
|
+
# @return [Array<Paragraph>] matching paragraphs
|
|
249
|
+
def paragraphs_mentioning(person: nil, org: nil, location: nil, match_all: false)
|
|
250
|
+
return paragraphs if person.nil? && org.nil? && location.nil?
|
|
251
|
+
|
|
252
|
+
paragraphs.select do |para|
|
|
253
|
+
matches = []
|
|
254
|
+
matches << para.mentions_person?(person) if person
|
|
255
|
+
matches << para.mentions_org?(org) if org
|
|
256
|
+
matches << para.mentions_location?(location) if location
|
|
257
|
+
|
|
258
|
+
match_all ? matches.all? : matches.any?
|
|
259
|
+
end
|
|
260
|
+
end
|
|
261
|
+
|
|
262
|
+
# Find paragraphs using a custom block
|
|
263
|
+
#
|
|
264
|
+
# @yield [Paragraph] block to evaluate each paragraph
|
|
265
|
+
# @return [Array<Paragraph>] paragraphs where block returns true
|
|
266
|
+
# @example Find long paragraphs
|
|
267
|
+
# doc.paragraphs_where { |p| p.word_count > 50 }
|
|
268
|
+
# @example Find lead paragraphs with links
|
|
269
|
+
# doc.paragraphs_where { |p| p.lead? && p.links.any? }
|
|
270
|
+
def paragraphs_where(&block)
|
|
271
|
+
return paragraphs unless block_given?
|
|
272
|
+
|
|
273
|
+
paragraphs.select(&block)
|
|
274
|
+
end
|
|
275
|
+
|
|
276
|
+
# Find the first paragraph matching criteria
|
|
277
|
+
#
|
|
278
|
+
# @yield [Paragraph] block to evaluate each paragraph
|
|
279
|
+
# @return [Paragraph, nil] first matching paragraph or nil
|
|
280
|
+
def find_paragraph(&block)
|
|
281
|
+
return nil unless block_given?
|
|
282
|
+
|
|
283
|
+
paragraphs.find(&block)
|
|
284
|
+
end
|
|
285
|
+
|
|
286
|
+
# Find media by type
|
|
287
|
+
#
|
|
288
|
+
# @param type [String, Symbol, nil] media type ('image', 'video', 'audio')
|
|
289
|
+
# @return [Array<Media>] matching media objects
|
|
290
|
+
def find_media(type: nil)
|
|
291
|
+
return media if type.nil?
|
|
292
|
+
|
|
293
|
+
type_str = type.to_s
|
|
294
|
+
media.select { |m| m.type == type_str }
|
|
295
|
+
end
|
|
296
|
+
|
|
297
|
+
# Get all images from the document
|
|
298
|
+
#
|
|
299
|
+
# @return [Array<Media>] image media objects
|
|
300
|
+
def images
|
|
301
|
+
media.select(&:image?)
|
|
302
|
+
end
|
|
303
|
+
|
|
304
|
+
# Get all videos from the document
|
|
305
|
+
#
|
|
306
|
+
# @return [Array<Media>] video media objects
|
|
307
|
+
def videos
|
|
308
|
+
media.select(&:video?)
|
|
309
|
+
end
|
|
310
|
+
|
|
311
|
+
# Get all audio from the document
|
|
312
|
+
#
|
|
313
|
+
# @return [Array<Media>] audio media objects
|
|
314
|
+
def audio
|
|
315
|
+
media.select(&:audio?)
|
|
316
|
+
end
|
|
317
|
+
|
|
318
|
+
# Get all unique people mentioned in the document
|
|
319
|
+
#
|
|
320
|
+
# @return [Array<String>] unique person names from paragraphs and docdata
|
|
321
|
+
def all_people
|
|
322
|
+
all_entities[:people]
|
|
323
|
+
end
|
|
324
|
+
|
|
325
|
+
# Get all unique organizations mentioned in the document
|
|
326
|
+
#
|
|
327
|
+
# @return [Array<String>] unique organization names from paragraphs and docdata
|
|
328
|
+
def all_organizations
|
|
329
|
+
all_entities[:organizations]
|
|
330
|
+
end
|
|
331
|
+
|
|
332
|
+
# Get all unique locations mentioned in the document
|
|
333
|
+
#
|
|
334
|
+
# @return [Array<String>] unique location names from paragraphs and docdata
|
|
335
|
+
def all_locations
|
|
336
|
+
all_entities[:locations]
|
|
337
|
+
end
|
|
338
|
+
|
|
339
|
+
# Get all unique entities (people, organizations, locations) mentioned
|
|
340
|
+
#
|
|
341
|
+
# Uses single-pass aggregation for efficiency when multiple entity
|
|
342
|
+
# methods are called.
|
|
343
|
+
#
|
|
344
|
+
# @return [Hash] hash with :people, :organizations, :locations arrays
|
|
345
|
+
def all_entities
|
|
346
|
+
@all_entities ||= aggregate_entities
|
|
347
|
+
end
|
|
348
|
+
|
|
349
|
+
# Count occurrences of a term in the document
|
|
350
|
+
#
|
|
351
|
+
# @param query [String, Regexp] the search query
|
|
352
|
+
# @param case_sensitive [Boolean] whether search is case-sensitive (default: false)
|
|
353
|
+
# @return [Integer] number of occurrences
|
|
354
|
+
def count_occurrences(query, case_sensitive: false)
|
|
355
|
+
pattern = build_search_pattern(query, case_sensitive)
|
|
356
|
+
text.scan(pattern).size
|
|
357
|
+
end
|
|
358
|
+
|
|
359
|
+
# Get excerpt around first match of query
|
|
360
|
+
#
|
|
361
|
+
# @param query [String, Regexp] the search query
|
|
362
|
+
# @param context_chars [Integer] characters of context on each side (default: 50)
|
|
363
|
+
# @param case_sensitive [Boolean] whether search is case-sensitive (default: false)
|
|
364
|
+
# @return [String, nil] excerpt with surrounding context and ellipses, or nil if not found
|
|
365
|
+
def excerpt(query, context_chars: 50, case_sensitive: false)
|
|
366
|
+
pattern = build_search_pattern(query, case_sensitive)
|
|
367
|
+
match = text.match(pattern)
|
|
368
|
+
return nil unless match
|
|
369
|
+
|
|
370
|
+
start_pos = [match.begin(0) - context_chars, 0].max
|
|
371
|
+
end_pos = [match.end(0) + context_chars, text.length].min
|
|
372
|
+
|
|
373
|
+
prefix = start_pos > 0 ? "..." : ""
|
|
374
|
+
suffix = end_pos < text.length ? "..." : ""
|
|
375
|
+
|
|
376
|
+
excerpt_text = text[start_pos...end_pos]
|
|
377
|
+
"#{prefix}#{excerpt_text}#{suffix}"
|
|
378
|
+
end
|
|
379
|
+
|
|
130
380
|
private
|
|
131
381
|
|
|
382
|
+
# Aggregate all entities in a single pass through paragraphs
|
|
383
|
+
#
|
|
384
|
+
# @return [Hash] hash with :people, :organizations, :locations arrays
|
|
385
|
+
def aggregate_entities
|
|
386
|
+
result = { people: [], organizations: [], locations: [] }
|
|
387
|
+
|
|
388
|
+
paragraphs.each do |para|
|
|
389
|
+
result[:people].concat(para.people)
|
|
390
|
+
result[:organizations].concat(para.organizations)
|
|
391
|
+
result[:locations].concat(para.locations)
|
|
392
|
+
end
|
|
393
|
+
|
|
394
|
+
# Add docdata entities if available
|
|
395
|
+
if docdata
|
|
396
|
+
result[:people].concat(docdata.people || [])
|
|
397
|
+
result[:organizations].concat(docdata.organizations || [])
|
|
398
|
+
result[:locations].concat(docdata.locations || [])
|
|
399
|
+
end
|
|
400
|
+
|
|
401
|
+
# Remove duplicates
|
|
402
|
+
result.transform_values!(&:uniq)
|
|
403
|
+
result
|
|
404
|
+
end
|
|
405
|
+
|
|
132
406
|
# Parse XML string into REXML document
|
|
133
407
|
#
|
|
134
408
|
# REXML does not expand external entities by default, which protects against:
|