RubyGems - nitfr - Versions diffs - 1.0.0 → 1.1.0 - Mend

nitfr 1.0.0 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (17) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 397b4e4f6de3e985e4d8bfbfe18ad663ab6bdd350536365a826728f87704c571
-  data.tar.gz: 775aab503543f0f678abbca1bb55e8420ec1ad829f7e2470a0f609dd2924798f
+  metadata.gz: 601b081002076baf704497b19c0aa35d29b0201d13eb65bc43026e57eb7645d1
+  data.tar.gz: f28461883556e66c71d3282c9a4ebc3e40577ecb9fb87e83a7e783a0066b9362
 SHA512:
-  metadata.gz: 482f2856dd1c1e1854f2d6f463dcdcf46ed644b417934b46847bf96597c3b1cf6012a2a9001de5bdcdd0573e53c2cf34ab0177a74d3a645fd2d8a6d3a074aa35
-  data.tar.gz: 59ac46fea40da9ace71f9ce9663ac6c6275d5c268b589d4f3f583261e34d78fe74dc027105fb3a27d4389650a34b9585fd6df7369ebdbd012fb11c8c60a85b70
+  metadata.gz: 92663f9fb2de633c74d68c6092afadcae2b8b65fb7f357d185c68fa8e64b488b0e3a844adc172dd2ab82d8b0bfbd8eb28696431fc15e667c0cdab9e9b1899243
+  data.tar.gz: a501b8c8bde10794a02080444a63216874d3583dfdc6b99607befeee9c025c23fc82cfdf98aae0b7ad555a2e67bd1d1fe30a1b6438db7f5e83967cc712df6a60

data/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,232 @@
+# Changelog
+All notable changes to NITFr will be documented in this file.
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [1.1.0] - 2025-12-15
+### Added
+#### Serialization
+- `Document#to_h` - Hash representation of entire document
+- `Document#to_json` - JSON serialization
+- `Head#to_h` - Hash representation
+- `Body#to_h` - Hash representation
+- `Headline#to_h` - Hash representation
+- `Byline#to_h` - Hash representation
+- `Paragraph#to_h` - Hash representation
+- `Media#to_h` - Hash representation
+- `Docdata#to_h` - Hash representation
+- `Footnote#to_h` - Hash representation
+#### Reading Statistics
+- `Document#word_count` - Total word count across all paragraphs (memoized)
+- `Document#reading_time(words_per_minute:)` - Estimated reading time (e.g., "3 min read")
+#### Search & Query
+- `Document#search(query, case_sensitive:)` - Full-text search with match positions and context
+- `Document#contains?(query, case_sensitive:)` - Check if text exists in document
+- `Document#paragraphs_containing(query, case_sensitive:)` - Find paragraphs by text
+- `Document#paragraphs_mentioning(person:, org:, location:, match_all:)` - Find paragraphs by entity
+- `Document#paragraphs_where(&block)` - Custom predicate filtering
+- `Document#find_paragraph(&block)` - Find first matching paragraph
+- `Document#find_media(type:)` - Filter media by type
+- `Document#images` / `#videos` / `#audio` - Media type convenience accessors
+- `Document#all_people` - All unique person names (memoized)
+- `Document#all_organizations` - All unique organization names (memoized)
+- `Document#all_locations` - All unique location names (memoized)
+- `Document#all_entities` - Hash of all entity types (memoized, single-pass)
+- `Document#count_occurrences(query, case_sensitive:)` - Count matches
+- `Document#excerpt(query, context_chars:, case_sensitive:)` - Context snippet around match
+#### Paragraph Search Helpers
+- `Paragraph#contains?(query, case_sensitive:)` - Text search within paragraph
+- `Paragraph#mentions_person?(name, exact:)` - Check for person reference
+- `Paragraph#mentions_org?(name, exact:)` - Check for organization reference
+- `Paragraph#mentions_location?(name, exact:)` - Check for location reference
+- `Paragraph#mentions?(person:, org:, location:)` - Multi-entity check
+- `Paragraph#has_links?` - Check if paragraph contains links
+- `Paragraph#has_emphasis?` - Check if paragraph contains emphasis
+- `Paragraph#has_strong?` - Check if paragraph contains strong text
+- `Paragraph#has_entities?` - Check if paragraph contains any entities
+#### Extended Headline Levels
+- `Headline#tertiary` / `#hl3` - Tertiary headline
+- `Headline#quaternary` / `#hl4` - Quaternary headline
+- `Headline#quinary` / `#hl5` - Quinary headline
+- Updated `Headline#all` and `Headline#to_h` to include all five levels
+#### Strong/Bold Text
+- `Paragraph#strong` - Extract `<strong>` elements (alongside existing `<em>` support)
+- `Paragraph#has_strong?` - Check for strong text
+- Included in `Paragraph#to_h` serialization
+#### Slugline Support
+- `Document#slugline` - Section/category identifier
+- `Body#slugline` - Slugline from body.head
+- Included in `Body#to_h` serialization
+#### Footnotes
+- `Footnote` class for parsing `<fn>` elements with label and value
+- `Document#footnotes` - Array of Footnote objects
+- `Body#footnotes` - Footnotes from body.content and body.end
+- `Footnote#id` - Footnote ID attribute
+- `Footnote#label` - Reference marker (e.g., "1", "*")
+- `Footnote#value` / `#text` / `#content` - Footnote content
+- `Footnote#present?` - Check if has content
+- Included in `Body#to_h` serialization
+#### Line Break Preservation
+- `<br/>` elements now converted to newline characters in text extraction
+- Preserves intended line breaks within paragraph content
+#### Export Formats
+- `Document#to_markdown` - Markdown export with headers, emphasis, blockquotes, footnotes
+- `Document#to_text` - Plain text export with underlined headlines
+- `Document#to_html(include_wrapper:)` - Semantic HTML with article/header/section structure
+- `Exporter` module for export functionality
+### Notes
+- 337 tests with comprehensive coverage (173 new tests)
+- Memoization added for frequently accessed computed values
+- `SearchPattern` module for consistent pattern building across classes
+---
+## [1.0.0] - 2025-12-14
+### Added
+#### Core Parsing
+- `NITFr.parse(xml)` - Parse NITF XML string into a Document
+- `NITFr.parse_file(path)` - Parse NITF file with encoding support
+- `Document` class as main entry point for NITF content
+- `Head` class for document head section (title, meta, pubdata, docdata)
+- `Body` class for document body section
+- `Headline` class with primary (hl1) and secondary (hl2) headline support
+- `Byline` class for author/contributor information
+- `Paragraph` class for body content paragraphs
+- `Media` class for embedded media (images, video, audio)
+- `Docdata` class for document metadata
+#### Document Attributes
+- `Document#title` - Document title from head
+- `Document#headline` - Primary headline text
+- `Document#headlines` - Full Headline object
+- `Document#byline` - Byline object
+- `Document#paragraphs` - Array of Paragraph objects
+- `Document#text` - Full concatenated article text
+- `Document#media` - Array of Media objects
+- `Document#docdata` - Document metadata
+- `Document#doc_id` - Document identifier
+- `Document#issue_date` - Issue date
+- `Document#version` - NITF version
+- `Document#change_date` - Last change date
+- `Document#change_time` - Last change time
+- `Document#valid?` - Check if valid NITF document
+- `Document#to_xml` - Original XML string
+#### Headline Support
+- `Headline#primary` / `#hl1` - Primary headline
+- `Headline#secondary` / `#hl2` - Secondary headline
+- `Headline#all` - Array of headline levels
+- `Headline#to_s` - Combined headline text
+- `Headline#present?` - Check if headline exists
+#### Body Content
+- `Body#headline` - Headline object
+- `Body#byline` - Byline object
+- `Body#dateline` - Dateline text
+- `Body#abstract` - Article abstract/summary
+- `Body#distributor` - Wire service/distributor
+- `Body#series` - Series information
+- `Body#paragraphs` - Paragraph array
+- `Body#media` - Media array
+- `Body#block_quotes` - Block quote texts
+- `Body#lists` - List structures (ul, ol, dl)
+- `Body#tables` - Raw table elements
+- `Body#tagline` - Tagline from body.end
+- `Body#notes` - Editorial notes
+#### Byline Features
+- `Byline#person` - Author name
+- `Byline#title` - Author title/role
+- `Byline#org` - Organization
+- `Byline#location` - Location
+- `Byline#text` - Full byline text
+- `Byline#present?` - Check if byline has content
+#### Paragraph Features
+- `Paragraph#text` - Plain text content
+- `Paragraph#id` - Paragraph ID attribute
+- `Paragraph#lede` - Lede attribute value
+- `Paragraph#lead?` - Check if lead paragraph
+- `Paragraph#word_count` - Word count
+- `Paragraph#inner_html` - Raw XML content
+- `Paragraph#present?` - Check if has content
+#### Entity Extraction (Lazy Batch)
+- `Paragraph#people` - Person references (`<person>`)
+- `Paragraph#organizations` - Organization references (`<org>`)
+- `Paragraph#locations` - Location references (`<location>`)
+- `Paragraph#emphasis` - Emphasized text (`<em>`)
+- `Paragraph#links` - Link information (text and href)
+- Efficient single-pass DOM traversal on first access
+#### Media Support
+- `Media#type` - Media type (image, video, audio)
+- `Media#image?` / `#video?` / `#audio?` - Type checks
+- `Media#caption` - Media caption
+- `Media#producer` / `#credit` - Credit information
+- `Media#source` / `#src` / `#url` - Source URL
+- `Media#mime_type` - MIME type
+- `Media#alt_text` - Alternate text
+- `Media#width` / `#height` - Dimensions
+- `Media#references` - All media references
+- `Media#primary_reference` - First/main reference
+- `Media#metadata` - Additional metadata
+#### Docdata Features
+- `Docdata#doc_id` - Document identifier
+- `Docdata#issue_date` - Issue date (parsed as Date)
+- `Docdata#release_date` - Release date
+- `Docdata#expire_date` - Expiration date
+- `Docdata#urgency` - Urgency level (1-8)
+- `Docdata#fixture` - Fixture identifier
+- `Docdata#doc_scope` - Document scope
+- `Docdata#ed_msg` - Editorial message
+- `Docdata#series` - Series information
+- `Docdata#copyright` - Copyright holder and year
+- `Docdata#subjects` - Subject classifiers
+- `Docdata#people` - Identified people
+- `Docdata#organizations` - Identified organizations
+- `Docdata#locations` - Identified locations
+#### Head Features
+- `Head#title` - Document title
+- `Head#meta` - Meta tags as hash
+- `Head#pubdata` - Publication data
+- `Head#docdata` - Docdata object
+- `Head#revision_history` - Revision entries
+#### Text Processing
+- `TextExtractor` module for recursive text extraction from nested elements
+#### Security
+- XXE (XML External Entity) attack protection
+- Entity expansion limits configured at load time
+- REXML security settings (100 entity limit, 10KB text limit)
+- No external entity processing
+#### Error Handling
+- `NITFr::ParseError` - XML parsing errors
+- `NITFr::InvalidDocumentError` - Invalid NITF structure (missing `<nitf>` root)
+### Notes
+- Pure Ruby implementation using REXML (no native dependencies)
+- Lazy batch extraction for efficient entity parsing
+- 164 tests with comprehensive coverage

data/lib/nitfr/body.rb CHANGED Viewed

@@ -39,6 +39,13 @@ module NITFr
       @dateline ||= (body_head && xpath_first(body_head, "dateline"))&.text&.strip
     end
+    # Get the slugline (section/category identifier)
+    #
+    # @return [String, nil] the slugline text
+    def slugline
+      @slugline ||= (body_head && xpath_first(body_head, "slugline"))&.text&.strip
+    end
     # Get the abstract/summary
     #
     # @return [String, nil] the abstract text
@@ -111,6 +118,19 @@ module NITFr
       end
     end
+    # Get all footnotes from the document
+    #
+    # Footnotes can appear in body.content or body.end
+    #
+    # @return [Array<Footnote>] array of footnote objects
+    def footnotes
+      @footnotes ||= begin
+        content_fns = body_content ? xpath_match(body_content, ".//fn") : []
+        end_fns = body_end ? xpath_match(body_end, ".//fn") : []
+        (content_fns + end_fns).map { |fn| Footnote.new(fn) }
+      end
+    end
     # Get the body.end content (tagline, bibliography)
     #
     # @return [Hash] body end content
@@ -132,6 +152,28 @@ module NITFr
       body_end_content[:notes] || []
     end
+    # Convert body to a Hash representation
+    #
+    # @return [Hash] the body as a hash
+    def to_h
+      {
+        headline: headline&.to_h,
+        byline: byline&.to_h,
+        dateline: dateline,
+        slugline: slugline,
+        abstract: abstract,
+        distributor: distributor,
+        series: series,
+        paragraphs: paragraphs.map(&:to_h),
+        media: media.empty? ? nil : media.map(&:to_h),
+        block_quotes: block_quotes.empty? ? nil : block_quotes,
+        lists: lists.empty? ? nil : lists,
+        footnotes: footnotes.empty? ? nil : footnotes.map(&:to_h),
+        tagline: tagline,
+        notes: notes.empty? ? nil : notes
+      }.compact
+    end
     private
     def xpath_first(context, path)

data/lib/nitfr/byline.rb CHANGED Viewed

@@ -57,6 +57,19 @@ module NITFr
       !text.empty?
     end
+    # Convert byline to a Hash representation
+    #
+    # @return [Hash] the byline as a hash
+    def to_h
+      {
+        text: text,
+        person: person,
+        title: title,
+        location: location,
+        org: org
+      }.compact
+    end
     private
     def xpath_first(path)

data/lib/nitfr/docdata.rb CHANGED Viewed

@@ -133,6 +133,28 @@ module NITFr
       identified_content[:people] || []
     end
+    # Convert docdata to a Hash representation
+    #
+    # @return [Hash] the docdata as a hash
+    def to_h
+      {
+        doc_id: doc_id,
+        issue_date: issue_date&.to_s,
+        release_date: release_date&.to_s,
+        expire_date: expire_date&.to_s,
+        urgency: urgency,
+        copyright: copyright.empty? ? nil : copyright,
+        doc_scope: doc_scope,
+        fixture: fixture,
+        series: series.empty? ? nil : series,
+        management_status: management_status.empty? ? nil : management_status,
+        subjects: subjects.empty? ? nil : subjects,
+        locations: locations.empty? ? nil : locations,
+        organizations: organizations.empty? ? nil : organizations,
+        people: people.empty? ? nil : people
+      }.compact
+    end
     private
     def xpath_first(path)

data/lib/nitfr/document.rb CHANGED Viewed

@@ -9,6 +9,9 @@ module NITFr
   # @note This parser does not process external entities (DTD references) for security.
   #   REXML by default does not expand external entities, which protects against XXE attacks.
   class Document
+    include SearchPattern
+    include Exporter
     attr_reader :xml_doc, :head, :body
     # Create a new Document from an NITF XML string
@@ -50,6 +53,13 @@ module NITFr
       body&.byline
     end
+    # Get the slugline (section/category identifier)
+    #
+    # @return [String, nil] the slugline text
+    def slugline
+      body&.slugline
+    end
     # Get all paragraphs from the body content
     #
     # @return [Array<Paragraph>] array of paragraph objects
@@ -64,6 +74,28 @@ module NITFr
       @text ||= paragraphs.map(&:text).join("\n\n")
     end
+    # Get the total word count of the document
+    #
+    # @return [Integer] total word count across all paragraphs
+    def word_count
+      @word_count ||= paragraphs.sum(&:word_count)
+    end
+    # Get the estimated reading time
+    #
+    # @param words_per_minute [Integer] reading speed (default: 200)
+    # @return [String] human-readable reading time (e.g., "3 min read")
+    def reading_time(words_per_minute: 200)
+      minutes = (word_count / words_per_minute.to_f).ceil
+      if minutes < 1
+        "Less than 1 min read"
+      elsif minutes == 1
+        "1 min read"
+      else
+        "#{minutes} min read"
+      end
+    end
     # Get all media objects (images, etc.) from the document
     #
     # @return [Array<Media>] array of media objects
@@ -71,6 +103,13 @@ module NITFr
       body&.media || []
     end
+    # Get all footnotes from the document
+    #
+    # @return [Array<Footnote>] array of footnote objects
+    def footnotes
+      body&.footnotes || []
+    end
     # Get document metadata from docdata
     #
     # @return [Docdata, nil] the docdata object
@@ -127,8 +166,243 @@ module NITFr
       @xml_doc.to_s
     end
+    # Convert document to a Hash representation
+    #
+    # @return [Hash] the document as a hash
+    def to_h
+      {
+        version: version,
+        change_date: change_date,
+        change_time: change_time,
+        title: title,
+        doc_id: doc_id,
+        issue_date: issue_date&.to_s,
+        head: head&.to_h,
+        body: body&.to_h
+      }.compact
+    end
+    # Convert document to JSON string
+    #
+    # @param args [Array] arguments passed to JSON.generate
+    # @return [String] JSON representation of the document
+    def to_json(*args)
+      require "json"
+      to_h.to_json(*args)
+    end
+    # =========================================================================
+    # Search Methods
+    # =========================================================================
+    # Search the full document text for a query string or pattern
+    #
+    # @param query [String, Regexp] the search query (string or regex)
+    # @param case_sensitive [Boolean] whether search is case-sensitive (default: false)
+    # @return [Array<Hash>] array of match results with context
+    def search(query, case_sensitive: false)
+      pattern = build_search_pattern(query, case_sensitive)
+      results = []
+      paragraphs.each_with_index do |para, index|
+        para.text.scan(pattern) do
+          match = Regexp.last_match
+          results << {
+            paragraph_index: index,
+            paragraph: para,
+            match: match[0],
+            position: match.begin(0)
+          }
+        end
+      end
+      results
+    end
+    # Check if document contains the given text
+    #
+    # @param query [String, Regexp] the search query
+    # @param case_sensitive [Boolean] whether search is case-sensitive (default: false)
+    # @return [Boolean] true if text is found
+    def contains?(query, case_sensitive: false)
+      pattern = build_search_pattern(query, case_sensitive)
+      text.match?(pattern)
+    end
+    # Find paragraphs containing the given text
+    #
+    # @param query [String, Regexp] the search query
+    # @param case_sensitive [Boolean] whether search is case-sensitive (default: false)
+    # @return [Array<Paragraph>] matching paragraphs
+    def paragraphs_containing(query, case_sensitive: false)
+      pattern = build_search_pattern(query, case_sensitive)
+      paragraphs.select { |p| p.text.match?(pattern) }
+    end
+    # Find paragraphs mentioning specific entities
+    #
+    # @param person [String, nil] person name to search for
+    # @param org [String, nil] organization name to search for
+    # @param location [String, nil] location name to search for
+    # @param match_all [Boolean] if true, paragraph must contain ALL specified entities (default: false)
+    # @return [Array<Paragraph>] matching paragraphs
+    def paragraphs_mentioning(person: nil, org: nil, location: nil, match_all: false)
+      return paragraphs if person.nil? && org.nil? && location.nil?
+      paragraphs.select do |para|
+        matches = []
+        matches << para.mentions_person?(person) if person
+        matches << para.mentions_org?(org) if org
+        matches << para.mentions_location?(location) if location
+        match_all ? matches.all? : matches.any?
+      end
+    end
+    # Find paragraphs using a custom block
+    #
+    # @yield [Paragraph] block to evaluate each paragraph
+    # @return [Array<Paragraph>] paragraphs where block returns true
+    # @example Find long paragraphs
+    #   doc.paragraphs_where { |p| p.word_count > 50 }
+    # @example Find lead paragraphs with links
+    #   doc.paragraphs_where { |p| p.lead? && p.links.any? }
+    def paragraphs_where(&block)
+      return paragraphs unless block_given?
+      paragraphs.select(&block)
+    end
+    # Find the first paragraph matching criteria
+    #
+    # @yield [Paragraph] block to evaluate each paragraph
+    # @return [Paragraph, nil] first matching paragraph or nil
+    def find_paragraph(&block)
+      return nil unless block_given?
+      paragraphs.find(&block)
+    end
+    # Find media by type
+    #
+    # @param type [String, Symbol, nil] media type ('image', 'video', 'audio')
+    # @return [Array<Media>] matching media objects
+    def find_media(type: nil)
+      return media if type.nil?
+      type_str = type.to_s
+      media.select { |m| m.type == type_str }
+    end
+    # Get all images from the document
+    #
+    # @return [Array<Media>] image media objects
+    def images
+      media.select(&:image?)
+    end
+    # Get all videos from the document
+    #
+    # @return [Array<Media>] video media objects
+    def videos
+      media.select(&:video?)
+    end
+    # Get all audio from the document
+    #
+    # @return [Array<Media>] audio media objects
+    def audio
+      media.select(&:audio?)
+    end
+    # Get all unique people mentioned in the document
+    #
+    # @return [Array<String>] unique person names from paragraphs and docdata
+    def all_people
+      all_entities[:people]
+    end
+    # Get all unique organizations mentioned in the document
+    #
+    # @return [Array<String>] unique organization names from paragraphs and docdata
+    def all_organizations
+      all_entities[:organizations]
+    end
+    # Get all unique locations mentioned in the document
+    #
+    # @return [Array<String>] unique location names from paragraphs and docdata
+    def all_locations
+      all_entities[:locations]
+    end
+    # Get all unique entities (people, organizations, locations) mentioned
+    #
+    # Uses single-pass aggregation for efficiency when multiple entity
+    # methods are called.
+    #
+    # @return [Hash] hash with :people, :organizations, :locations arrays
+    def all_entities
+      @all_entities ||= aggregate_entities
+    end
+    # Count occurrences of a term in the document
+    #
+    # @param query [String, Regexp] the search query
+    # @param case_sensitive [Boolean] whether search is case-sensitive (default: false)
+    # @return [Integer] number of occurrences
+    def count_occurrences(query, case_sensitive: false)
+      pattern = build_search_pattern(query, case_sensitive)
+      text.scan(pattern).size
+    end
+    # Get excerpt around first match of query
+    #
+    # @param query [String, Regexp] the search query
+    # @param context_chars [Integer] characters of context on each side (default: 50)
+    # @param case_sensitive [Boolean] whether search is case-sensitive (default: false)
+    # @return [String, nil] excerpt with surrounding context and ellipses, or nil if not found
+    def excerpt(query, context_chars: 50, case_sensitive: false)
+      pattern = build_search_pattern(query, case_sensitive)
+      match = text.match(pattern)
+      return nil unless match
+      start_pos = [match.begin(0) - context_chars, 0].max
+      end_pos = [match.end(0) + context_chars, text.length].min
+      prefix = start_pos > 0 ? "..." : ""
+      suffix = end_pos < text.length ? "..." : ""
+      excerpt_text = text[start_pos...end_pos]
+      "#{prefix}#{excerpt_text}#{suffix}"
+    end
     private
+    # Aggregate all entities in a single pass through paragraphs
+    #
+    # @return [Hash] hash with :people, :organizations, :locations arrays
+    def aggregate_entities
+      result = { people: [], organizations: [], locations: [] }
+      paragraphs.each do |para|
+        result[:people].concat(para.people)
+        result[:organizations].concat(para.organizations)
+        result[:locations].concat(para.locations)
+      end
+      # Add docdata entities if available
+      if docdata
+        result[:people].concat(docdata.people || [])
+        result[:organizations].concat(docdata.organizations || [])
+        result[:locations].concat(docdata.locations || [])
+      end
+      # Remove duplicates
+      result.transform_values!(&:uniq)
+      result
+    end
     # Parse XML string into REXML document
     #
     # REXML does not expand external entities by default, which protects against: