RubyGems - cv-parser - Versions diffs - 0.1.1 → 0.1.2 - Mend

cv-parser 0.1.1 → 0.1.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

checksums.yaml +4 -4
data/README.md +76 -9
data/lib/cv_parser/configuration.rb +1 -0
data/lib/cv_parser/errors.rb +3 -0
data/lib/cv_parser/extractor.rb +1 -0
data/lib/cv_parser/pdf_converter.rb +1 -0
data/lib/cv_parser/providers/anthropic.rb +49 -7
data/lib/cv_parser/providers/base.rb +19 -3
data/lib/cv_parser/providers/faker.rb +18 -3
data/lib/cv_parser/providers/openai.rb +54 -15
data/lib/cv_parser/version.rb +1 -1
metadata +2 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: e81f88aa6b7a332110ce00a4280c55ba844b270022c875eeb0369c45564ef31b
-  data.tar.gz: ca28c678b098c51a8c223d7f8ce5d1e6eef779ddf5e14d0f10adee32b85066e3
+  metadata.gz: f19aafe2fe36f105d7c2d307d2717697e193326b9505e14d7f5bfe7be227ff9b
+  data.tar.gz: 1d82e243b702db581b1fdf68e2ff55ab03d1417d3058b4649e012f9210c05874
 SHA512:
-  metadata.gz: da3933d924d9bbb7aae50a0e773c9e651b37f249d265a28e0ae78516dad83052e8abc4952bd94dd2536d7826f74ec41eebbe53f479d523d1db70565ddb8a86f7
-  data.tar.gz: 17e4706faeb54916bfe85305fd170572a62231203489c7d394830c90b260ea0a8a8502a40c7c3f1a7e7de07446ee317f1a2ddba8faf1738d1548056d9e1e13aa
+  metadata.gz: 97e6de02e6543085f46d03f77b5a64ee712c4da2bb30b3ac3d923d0aa9d820f9824a3a838226904901e6d5abfbadf51b28b576b08f437d1b346aac37321d8444
+  data.tar.gz: ea5efd136b8cbbde4eb96694d164ae60295807017129434886fdc57708ec5ebd72a8a4a807d72c4732289467207b315564c97f51bc9b2939ce407ef7fac176b4

data/README.md CHANGED Viewed

@@ -3,12 +3,14 @@
 A Ruby gem for parsing and extracting structured information from CVs/resumes using LLM providers.
 ## Features
-- Convert DOCX to PDF before uploading to LLM providers
-- Extract structured data from CVs by directly uploading files to LLM providers
-- Configure different LLM providers (OpenAI, Anthropic, and Faker for testing)
-- Customizable output schema to match your data requirements (JSON Schema format)
-- Command-line interface for quick parsing and analysis
-- Robust error handling and validation
+- **Multiple file format support**: PDF, DOCX, TXT, and Markdown files
+- **Smart file processing**: Converts DOCX to PDF, processes text files directly (no upload required)
+- **Extract structured data** from CVs using leading LLM providers
+- **Multiple LLM providers**: OpenAI, Anthropic, and Faker (for testing)
+- **Customizable output schema** using JSON Schema format
+- **Command-line interface** for quick parsing and analysis
+- **Performance optimized**: Text files bypass upload for faster processing
+- **Robust error handling** and validation
 ## Installation
@@ -191,10 +193,22 @@ extractor.extract(
 ```ruby
 extractor = CvParser::Extractor.new
+# Extract from PDF (uploaded to LLM)
 result = extractor.extract(
   file_path: "path/to/resume.pdf"
 )
+# Extract from text file (fast, no upload)
+result = extractor.extract(
+  file_path: "path/to/resume.txt"
+)
+# Extract from markdown file (fast, no upload)
+result = extractor.extract(
+  file_path: "path/to/resume.md"
+)
 puts "Name: #{result['personal_info']['name']}"
 puts "Email: #{result['personal_info']['email']}"
 result['skills'].each { |skill| puts "- #{skill}" }
@@ -205,10 +219,15 @@ result['skills'].each { |skill| puts "- #{skill}" }
 ```ruby
 begin
   result = extractor.extract(
-    file_path: "path/to/resume.pdf"
+    file_path: "path/to/resume.txt"  # Works with any supported format
   )
 rescue CvParser::FileNotFoundError, CvParser::FileNotReadableError => e
   puts "File error: #{e.message}"
+rescue CvParser::EmptyTextFileError => e
+  puts "Text file is empty: #{e.message}"
+rescue CvParser::TextFileEncodingError => e
+  puts "Text file encoding error: #{e.message}"
 rescue CvParser::ParseError => e
   puts "Error parsing the response: #{e.message}"
 rescue CvParser::APIError => e
@@ -225,10 +244,19 @@ end
 CV Parser also provides a CLI for quick analysis:
 ```bash
+# Process different file formats
 cv-parser path/to/resume.pdf
+cv-parser path/to/resume.docx
+cv-parser path/to/resume.txt
+cv-parser path/to/resume.md
+# Use different providers
 cv-parser --provider anthropic path/to/resume.pdf
-cv-parser --format yaml --output result.yaml path/to/resume.pdf
-cv-parser --schema custom-schema.json path/to/resume.pdf
+cv-parser --provider openai path/to/resume.txt
+# Output options
+cv-parser --format yaml --output result.yaml path/to/resume.md
+cv-parser --schema custom-schema.json path/to/resume.txt
 cv-parser --help
 ```
@@ -242,6 +270,45 @@ export CV_PARSER_API_KEY=your-api-key
 cv-parser resume.pdf
 ```
+## Supported File Formats
+CV Parser supports multiple file formats with optimized processing:
+### File Format Support
+| Format | Extension | Processing Method | Upload Required | Performance |
+|--------|-----------|-------------------|-----------------|-------------|
+| PDF | `.pdf` | Direct upload | Yes | Standard |
+| DOCX | `.docx` | Convert to PDF → Upload | Yes | Standard |
+| Text | `.txt` | Direct text processing | **No** | **Fast** |
+| Markdown | `.md` | Direct text processing | **No** | **Fast** |
+### Performance Benefits of Text Files
+Text files (`.txt` and `.md`) offer significant performance advantages:
+- **No file upload overhead**: Content is included directly in the API request
+- **Faster processing**: Eliminates the upload → reference workflow
+- **Reduced API calls**: Single request instead of upload + process
+- **Lower bandwidth usage**: Direct text inclusion vs binary file transfer
+- **Better for automation**: Simpler integration in automated workflows
+### File Size Limits
+- **PDF/DOCX files**: Limited by LLM provider (typically 20MB)
+- **Text files**: No explicit size limits (limited only by LLM provider)
+### File Processing Examples
+```ruby
+# Fast text processing (no upload)
+extractor.extract(file_path: "resume.txt", output_schema: schema)
+extractor.extract(file_path: "resume.md", output_schema: schema)
+# Standard file processing (with upload)
+extractor.extract(file_path: "resume.pdf", output_schema: schema)
+extractor.extract(file_path: "resume.docx", output_schema: schema)
+```
 ## Advanced Configuration

data/lib/cv_parser/configuration.rb CHANGED Viewed

@@ -1,6 +1,7 @@
 # frozen_string_literal: true
 module CvParser
+  # Configuration settings for CV parser including LLM provider, API credentials, and extraction options
   class Configuration
     attr_accessor :provider, :model, :api_key, :timeout, :max_retries, :prompt, :system_prompt,
                   :output_schema, :max_tokens, :temperature

data/lib/cv_parser/errors.rb CHANGED Viewed

@@ -11,4 +11,7 @@ module CvParser
   class InvalidRequestError < APIError; end
   class FileNotFoundError < Error; end
   class FileNotReadableError < Error; end
+  class TextFileError < Error; end
+  class TextFileEncodingError < TextFileError; end
+  class EmptyTextFileError < TextFileError; end
 end

data/lib/cv_parser/extractor.rb CHANGED Viewed

@@ -1,6 +1,7 @@
 # frozen_string_literal: true
 module CvParser
+  # Extracts structured data from CV/resume files using configured LLM providers
   class Extractor
     def initialize(config = CvParser.configuration)
       @config = config

data/lib/cv_parser/pdf_converter.rb CHANGED Viewed

@@ -5,6 +5,7 @@ require "rexml/document"
 require "rexml/xpath"
 module CvParser
+  # Converts DOCX files to PDF format by extracting text content and rendering it as PDF pages
   class PdfConverter
     # Constants modules for better organization
     module PageConstants

data/lib/cv_parser/providers/anthropic.rb CHANGED Viewed

@@ -30,13 +30,7 @@ module CvParser
       def extract_data(output_schema:, file_path: nil)
         validate_inputs!(output_schema, file_path)
-        processed_file_path = prepare_file(file_path)
-        base64_content = encode_file_to_base64(processed_file_path)
-        response = make_api_request(output_schema, base64_content)
-        cleanup_temp_file(processed_file_path, file_path)
+        response = process_file_and_get_response(file_path, output_schema)
         handle_tool_response(response, output_schema)
       rescue Faraday::Error => e
         raise APIError, "Anthropic API connection error: #{e.message}"
@@ -44,6 +38,21 @@ module CvParser
       private
+      def process_file_and_get_response(file_path, output_schema)
+        if text_file?(file_path)
+          # Handle text files without base64 encoding
+          text_content = read_text_file_content(file_path)
+          make_api_request_with_text(output_schema, text_content)
+        else
+          # Existing file processing logic
+          processed_file_path = prepare_file(file_path)
+          base64_content = encode_file_to_base64(processed_file_path)
+          response = make_api_request(output_schema, base64_content)
+          cleanup_temp_file(processed_file_path, file_path)
+          response
+        end
+      end
       def validate_inputs!(output_schema, file_path)
         raise ArgumentError, "File_path must be provided" unless file_path
@@ -122,6 +131,39 @@ module CvParser
         }
       end
+      def make_api_request_with_text(output_schema, text_content)
+        extraction_tool = build_extraction_tool(output_schema)
+        @client.post do |req|
+          req.headers["Content-Type"] = "application/json"
+          req.body = build_text_request_body(output_schema, extraction_tool, text_content).to_json
+        end
+      end
+      def build_text_request_body(output_schema, extraction_tool, text_content)
+        {
+          model: @config.model || DEFAULT_MODEL,
+          max_tokens: @config.max_tokens,
+          temperature: @config.temperature,
+          system: build_system_prompt,
+          tools: [extraction_tool],
+          tool_choice: { type: "tool", name: TOOL_NAME },
+          messages: [build_text_message(output_schema, text_content)]
+        }
+      end
+      def build_text_message(output_schema, text_content)
+        {
+          role: "user",
+          content: [
+            {
+              type: "text",
+              text: "#{build_extraction_prompt(output_schema)}\n\nCV Content:\n#{text_content}"
+            }
+          ]
+        }
+      end
       def build_extraction_tool(output_schema)
         json_schema = normalize_schema_to_json_schema(output_schema)

data/lib/cv_parser/providers/base.rb CHANGED Viewed

@@ -47,15 +47,16 @@ module CvParser
           # Convert DOCX to PDF
           @pdf_converter.convert(file_path, temp_pdf_path)
           temp_pdf_path
-        when ".pdf"
-          # Already a PDF, return as-is
+        when ".pdf", ".txt", ".md"
+          # PDF files, text files - return as-is
+          # Text files will be handled as text content by providers
           file_path
         else
           # For other file types, let the provider handle them directly
           file_path
         end
       rescue StandardError => e
-        raise APIError, "Failed to convert DOCX to PDF: #{e.message}"
+        raise APIError, "Failed to convert file: #{e.message}"
       end
       def cleanup_temp_file(processed_file_path, original_file_path)
@@ -114,6 +115,21 @@ module CvParser
         raise FileNotReadableError, "File not readable: #{file_path}"
       end
+      def text_file?(file_path)
+        [".txt", ".md"].include?(File.extname(file_path).downcase)
+      end
+      def read_text_file_content(file_path)
+        content = File.read(file_path, encoding: "UTF-8")
+        # Validate content is not empty
+        raise EmptyTextFileError, "Text file is empty: #{file_path}" if content.strip.empty?
+        content
+      rescue Encoding::InvalidByteSequenceError, Encoding::UndefinedConversionError => e
+        raise TextFileEncodingError, "Invalid text encoding in file #{file_path}: #{e.message}"
+      end
     end
   end
 end

data/lib/cv_parser/providers/faker.rb CHANGED Viewed

@@ -27,17 +27,32 @@ module CvParser
       JSON_SCHEMA_TYPE = "json_schema"
       def extract_data(output_schema:, file_path: nil)
-        validate_schema_format!(output_schema)
+        validate_inputs!(output_schema, file_path)
         generate_fake_data(output_schema)
       end
-      def upload_file(file_path)
+      def upload_file(_file_path)
         # No-op for faker provider
         { id: "fake-file-#{SecureRandom.hex(8)}" }
       end
       private
+      def validate_inputs!(output_schema, file_path)
+        validate_schema_format!(output_schema)
+        # Validate file if provided
+        return unless file_path
+        validate_file_exists!(file_path)
+        validate_file_readable!(file_path)
+        # For text files, validate content
+        return unless text_file?(file_path)
+        read_text_file_content(file_path) # Just for validation
+      end
       def validate_schema_format!(output_schema)
         return if valid_json_schema_format?(output_schema)
@@ -129,7 +144,7 @@ module CvParser
         end
       end
-      def generate_string_value(key, description = nil)
+      def generate_string_value(key, _description = nil)
         key_string = key.to_s.downcase
         case key_string

data/lib/cv_parser/providers/openai.rb CHANGED Viewed

@@ -42,12 +42,7 @@ module CvParser
       def extract_data(output_schema:, file_path: nil)
         validate_inputs!(output_schema, file_path)
-        processed_file_path = prepare_file(file_path)
-        file_id = upload_file(processed_file_path)
-        response = create_response_with_file(file_id, output_schema)
-        cleanup_temp_file(processed_file_path, file_path)
+        response = process_file_and_get_response(file_path, output_schema)
         parse_response_output(response)
       rescue Timeout::Error => e
         raise APIError, "OpenAI API timeout: #{e.message}"
@@ -74,6 +69,21 @@ module CvParser
       private
+      def process_file_and_get_response(file_path, output_schema)
+        if text_file?(file_path)
+          # Handle text files without upload
+          text_content = read_text_file_content(file_path)
+          create_response_with_text(text_content, output_schema)
+        else
+          # Existing file upload logic
+          processed_file_path = prepare_file(file_path)
+          file_id = upload_file(processed_file_path)
+          response = create_response_with_file(file_id, output_schema)
+          cleanup_temp_file(processed_file_path, file_path)
+          response
+        end
+      end
       def validate_inputs!(output_schema, file_path)
         raise ArgumentError, "File_path must be provided" unless file_path
@@ -277,10 +287,44 @@ module CvParser
         ]
       end
+      def create_response_with_text(text_content, schema)
+        uri = URI(API_RESPONSES_URL)
+        payload = build_text_response_payload(text_content, schema)
+        make_responses_api_request(uri, payload)
+      end
+      def build_text_response_payload(text_content, schema)
+        {
+          model: @config.model || DEFAULT_MODEL,
+          input: build_text_input_for_responses_api(text_content),
+          text: {
+            format: {
+              type: "json_schema",
+              name: SCHEMA_NAME,
+              schema: schema_to_json_schema(schema)
+            }
+          }
+        }
+      end
+      def build_text_input_for_responses_api(text_content)
+        [
+          {
+            role: "user",
+            content: [
+              {
+                type: "input_text",
+                text: "#{build_extraction_prompt}\n\nCV Content:\n#{text_content}"
+              }
+            ]
+          }
+        ]
+      end
       def parse_response_output(response)
         # Extract content from Responses API format
         output = response["output"]
-        return nil unless output&.is_a?(Array) && !output.empty?
+        return nil unless output.is_a?(Array) && !output.empty?
         # Look for message with text content
         text_content = nil
@@ -289,14 +333,9 @@ module CvParser
           if item.is_a?(Hash)
             if item["type"] == "message" && item["content"]
               item["content"].each do |content_item|
-                if content_item.is_a?(Hash)
-                  if content_item["type"] == "text"
-                    text_content = content_item["text"]
-                    break
-                  elsif content_item["type"] == "output_text"
-                    text_content = content_item["text"]
-                    break
-                  end
+                if content_item.is_a?(Hash) && %w[text output_text].include?(content_item["type"])
+                  text_content = content_item["text"]
+                  break
                 end
               end
             elsif item["type"] == "text" && item["text"]

data/lib/cv_parser/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module CvParser
-  VERSION = "0.1.1"
+  VERSION = "0.1.2"
 end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: cv-parser
 version: !ruby/object:Gem::Version
-  version: 0.1.1
+  version: 0.1.2
 platform: ruby
 authors:
 - Gys Muller
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2025-07-17 00:00:00.000000000 Z
+date: 2025-08-05 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: base64