RubyGems - ligamagic-scraper - Versions diffs - 0.6.0 - Mend

ligamagic-scraper 0.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (19) hide show

checksums.yaml +7 -0
data/CHANGELOG.md +318 -0
data/Gemfile +4 -0
data/LICENSE +22 -0
data/README.md +614 -0
data/Rakefile +121 -0
data/bin/ligamagic-scraper +28 -0
data/lib/ligamagic_scraper/alerts/alert_system.rb +218 -0
data/lib/ligamagic_scraper/alerts/base_alert.rb +75 -0
data/lib/ligamagic_scraper/alerts/file_alert.rb +56 -0
data/lib/ligamagic_scraper/alerts/telegram_alert.rb +36 -0
data/lib/ligamagic_scraper/cli.rb +152 -0
data/lib/ligamagic_scraper/loggable.rb +43 -0
data/lib/ligamagic_scraper/scrapers/base_scraper.rb +126 -0
data/lib/ligamagic_scraper/scrapers/global_scraper.rb +240 -0
data/lib/ligamagic_scraper/scrapers/store_scraper.rb +392 -0
data/lib/ligamagic_scraper/version.rb +4 -0
data/lib/ligamagic_scraper.rb +18 -0
metadata +134 -0

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA256:
+  metadata.gz: e5e915123285e7c8e182658c4e4579eaf0ced97e548d0f63bce91916c2a892e4
+  data.tar.gz: 4eb14fcb363ffcf64c6c8633efb9a70c050f03e9255811a539444c3689553f14
+SHA512:
+  metadata.gz: 940f79ede397bc0189193925c78a535289190d96b4396064ff84ef60d346f8a9930d3e05ba22c54c34b70a0fe2837f520b4d4e745d1739d5ab08978accba8428
+  data.tar.gz: c753dc394b4032a1bda76b8bd301085e6938451b80290d0cdd5ff9faa2b79ba42a23ffb6cc2c18e6367c303d24dad36f2f852993d78ddd8eb1841adde04a2b0f

data/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,318 @@
+# Changelog
+All notable changes to this project will be documented in this file.
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [Unreleased]
+## [0.6.0] - 2025-10-29
+### Changed
+- **Major Performance Optimization for StoreScraper**: Improved extraction speed by 20% (1.7s faster)
+  - Captures HTML incrementally while loading pages (browser still open)
+  - Replaced Capybara DOM queries with Nokogiri in-memory HTML parsing
+  - Browser closes immediately after loading, before extraction phase
+  - Extraction is instant: ~0.01s with Nokogiri vs mixed with loading before
+  - Added automatic deduplication by card ID
+  - Added comprehensive performance benchmarking with timing breakdowns
+### Added
+- **Performance Summary Logs for StoreScraper**: Detailed timing information
+  - Phase 1: Loading + Capture (incremental HTML capture during page loads)
+  - Phase 2: Extraction (data parsing with Nokogiri, browser closed)
+  - Total time and product counts displayed
+## [0.5.0] - 2025-10-29
+### Changed
+- **Major Performance Optimization for GlobalScraper**: Improved extraction speed by 34% (7.5s faster)
+  - Changed from incremental HTML capture to batch capture after all pages loaded
+  - Replaced Capybara DOM queries with Nokogiri in-memory HTML parsing
+  - Browser now closes immediately after page loading (37% faster browser release)
+  - Extraction is 148x faster: 7.41s → 0.03s (using Nokogiri instead of Capybara)
+  - Reduced memory usage: 75% fewer HTML snapshots (160 vs 400) for same results
+  - Added automatic deduplication by product ID (website shows overlapping products)
+  - Added comprehensive performance benchmarking with timing breakdowns
+### Added
+- **Performance Summary Logs**: Detailed timing information for each scraping phase
+  - Phase 1: Loading (page navigation and clicks)
+  - Phase 2: Capture (HTML snapshot collection)
+  - Phase 3: Extraction (data parsing with Nokogiri)
+  - Total time and product counts displayed
+- **Benchmark Support**: Added `Benchmark` and `Set` dependencies for performance tracking
+- **Product Deduplication**: Automatically removes duplicate products by ID
+## [0.4.0] - 2025-10-29
+### Added
+- **Store URL Filtering**: Automatic filtering and ordering for store searches
+  - Orders results by price (most expensive to cheapest): `txt_order=6`
+  - Filters only in-stock items: `txt_estoque=1`
+  - Applied to both store listing and store search modes
+- **Store Search with Search Term**: Search within specific stores using `-u STORE -s TERM`
+  - Scrapes all pages until no more results (unlimited pagination)
+  - URL automatically includes search parameter: `&busca=<term>`
+  - Search term included in filename and JSON output
+- **Pagination Support for Store Scraper**: Automatically scrapes multiple pages
+  - Detects next page button using `.ecomresp-paginacao` class
+  - Scrapes all available pages for search results
+  - Configurable max pages limit for store listings
+  - Sleep 1 second between page requests
+- **Max Pages Limit (`-p/--pages N`)**: Required page limit for store listings
+  - Maximum: 5 pages (values above 5 are automatically capped)
+  - Required when using `-u` without `-s`
+  - Prevents accidental full inventory scraping
+  - Safety feature to control scraping scope
+  - Example: `ligamagic-scraper -u test-store -p 3` (scrapes 3 pages)
+  - Example: `ligamagic-scraper -u test-store -p 10` (capped to 5 pages)
+- **Three Search Modes**:
+  1. Global search (`-s TERM`): Search across all Liga Magic, unlimited pages
+  2. Store listing (`-u STORE -p N`): List products from store, max N pages
+  3. Store search (`-u STORE -s TERM`): Search within store, all pages
+- **Automatic Pagination Detection**: Uses next page button to continue scraping
+- **Search Term in Filenames**: Store searches include search slug in filename
+  - Format: `YYYYMMDD_HHMMSS__search_term.json`
+### Changed
+- **StoreScraper Parameters**: Now accepts `search_term` and `max_pages` parameters
+  - `search_term`: Optional search term to filter products within store
+  - `max_pages`: Required when no search term provided (capped at 5)
+- **CLI Flag Behavior**: `-s` and `-u` can now be used together
+  - Previously mutually exclusive
+  - Enables store search with search term
+  - `-s` flag works for both global and store searches
+- **Store URLs**: Automatically adapt based on search term and page number
+  - With search: `?view=ecom/itens&busca=<term>&page=N`
+  - Without search: `?view=ecom/itens&tcg=1&page=N`
+- **Validation Logic**: Updated to require and cap max_pages
+  - Requires `-p` flag when using `-u` without `-s`
+  - Automatically caps max_pages at 5 (user provides 10, uses 5)
+  - Clear error messages guide users to correct usage
+- **Store Scraper Behavior**: Now scrapes multiple pages instead of just first page
+  - Pagination loop visits each page sequentially
+  - Logs progress for each page
+  - Accumulates all products across pages
+### Security
+- **Max Pages Cap**: Prevents accidental full inventory scraping
+  - Hard limit of 5 pages for store listings
+  - Values above 5 are automatically capped
+  - Protects against unintentional resource-intensive operations
+  - Cannot be overridden (use `-s` for unlimited pagination)
+### Known Limitations
+- **Store Search Price/Qty Extraction**: When using store search with search term (`-u STORE -s TERM`), price and quantity data cannot be extracted
+  - Liga Magic uses sophisticated CSS sprite obfuscation for anti-scraping protection
+  - Both CSS class names and sprite background-positions rotate per session
+  - Digits are rendered visually only, with no text content in DOM
+  - Store listings without search term (`-u STORE -p N`) work normally with full data
+  - See code documentation in `store_scraper.rb` for technical details
+### Fixed
+- **Alert System Comparison Bug**: Fixed incorrect change detection when comparing different search terms
+  - Previously compared against most recent file chronologically, regardless of search term
+  - Now only compares files with matching search term slugs
+  - Prevents false "new products" alerts when different searches are interleaved
+  - Example: Running "booster box" search after "volcanic" search now correctly compares with previous "booster box" results
+- Removed verbose debug logs for individual card details in store scraper
+- Fixed pagination hanging on single-page results (reduced wait time to 0)
+## [0.3.0] - 2025-10-27
+### Added
+- **Structured Logging System**: Introduced `Loggable` module for all classes
+  - All execution logs now collected in accessible `@logs` array
+  - Log entries include timestamp, level, message, and source
+  - Log levels: `:info`, `:debug`, `:warning`, `:error`
+  - Methods: `log_info`, `log_debug`, `log_warning`, `log_error`
+  - `formatted_logs` method returns array of message strings
+  - `clear_logs` method to reset logs
+  - Programmatic access to execution logs via `scraper.logs` or `scraper.formatted_logs`
+- **RSpec Test Suite**: Comprehensive test coverage with 175+ tests
+  - Unit tests for all scrapers (base, global, store)
+  - Unit tests for all alerts (system, base, file, telegram)
+  - Unit tests for CLI parsing and validation
+  - Integration tests using static HTML examples
+  - Loggable module tests
+  - Version validation tests
+  - Test structure mirrors `lib/` structure
+- **CLI Class**: Extracted command-line parsing to dedicated `CLI` class
+  - Cleaner separation of concerns
+  - Easier to test
+  - Reduced `bin/ligamagic-scraper` from 120 to 28 lines
+- **Organized Code Structure**: Reorganized lib/ligamagic_scraper/ into subdirectories
+  - `scrapers/` - All scraper classes (base, global, store)
+  - `alerts/` - All alert-related classes (system, handlers)
+  - Core files at root (version.rb, cli.rb)
+### Changed
+- **Logging Behavior**: All `puts` statements replaced with structured logging
+  - CLI displays logs after scraping completes
+  - Library usage is silent by default, logs accessible programmatically
+  - More control over log output
+- **SimplifiedAPI**: Removed noise from immediate output when using as library
+### Removed
+- **`verbose` parameter**: Removed from all scraper classes
+  - No longer needed with structured logging
+  - Use `scraper.formatted_logs` to access all messages
+  - CLI always shows logs (equivalent to old verbose mode)
+- **`-v` / `--verbose` CLI flag**: Removed from command-line interface
+  - Logs are always collected and displayed
+  - More detailed logs available via log levels
+### Breaking Changes
+- **`verbose:` parameter removed** from:
+  - `BaseScraper.new`
+  - `GlobalScraper.new`
+  - `StoreScraper.new`
+  - Solution: Remove the parameter from your code
+- **CLI `-v` / `--verbose` flag removed**
+  - Logs are always displayed in CLI
+  - For library usage, access logs via `scraper.formatted_logs`
+## [0.2.0] - 2025-10-27
+### Added
+- **Alert System**: Complete infrastructure for detecting changes between scrapes
+  - Automatic comparison with previous scrape data
+  - Detects new products, removed products, price changes, quantity changes, and availability changes
+  - Change detection with percentage calculations for price changes
+  - Support for multiple alert types: file (implemented), telegram (stub)
+  - CLI flags: `-a`/`--alerts` to enable, `--alert-types` to configure
+  - Alert configuration via library interface
+  - Automatic discovery of most recent previous scrape file
+  - Formatted change summaries with emojis and statistics
+- **Alert Handler Classes**:
+  - `BaseAlert`: Base class for all alert handlers with formatting utilities
+  - `FileAlert`: File-based alert handler - fully implemented, saves to organized directories
+  - `TelegramAlert`: Telegram bot notification handler (stub - not implemented)
+- **Organized Directory Structure**: Files now organized by type and store
+  - Global scrapes: `scrapped/global/YYYYMMDD_HHMMSS__slug.json`
+  - Store scrapes: `scrapped/stores/{store_domain}/YYYYMMDD_HHMMSS.json`
+  - Global alerts: `alerts_json/global/YYYYMMDD_HHMMSS.json`
+  - Store alerts: `alerts_json/stores/{store_domain}/YYYYMMDD_HHMMSS.json`
+### Changed
+- **Filename Format**: Now includes datetime instead of just date
+  - Allows multiple scrapes per day for better tracking
+  - Uses `__` (double underscore) to separate datetime from slug (global only)
+- **BaseScraper**: Added `alert_config` parameter to constructor
+- **GlobalScraper**: Added `alert_config` parameter support
+- **StoreScraper**: Added `alert_config` parameter support
+- **BaseScraper.save_to_json**: Now processes alerts before saving if enabled
+- **BaseScraper.find_previous_scrape**: Simplified to find files in same directory
+### Removed
+- **EmailAlert**: Removed in favor of simpler alert system
+- **WebhookAlert**: Removed in favor of simpler alert system
+- Reduced alert types to file (implemented) and telegram (stub)
+### Breaking Changes
+None - this is purely additive functionality. If alerts are not enabled, behavior is unchanged.
+## [0.1.7] - 2025-10-27
+### Changed
+- **Store Scraper Simplified**: Now accepts only store domain name instead of full URLs
+  - Changed parameter from `store_url:` to `store_domain:`
+  - Example: `ligamagic-scraper -u kamm-store` (instead of full URL)
+  - Automatically builds URL: `https://www.<domain>.com.br/?view=ecom/itens&tcg=1`
+- **Removed Complexity**: Eliminated unnecessary store ID and name extraction logic
+- **CLI Updated**: Flag `-u` now accepts domain name instead of full URL
+### Added
+- **Card ID Extraction**: Store scraper now extracts unique card ID from product links
+  - Extracts from link parameter: `card=16149`
+  - Added `card_id` field to product output
+  - Useful for tracking specific cards across scrapes
+### Removed
+- Store ID extraction from page
+- Store name extraction from page
+- Complex URL rebuilding logic
+- Empty parameter filtering
+### Breaking Changes
+⚠️ **API Change**: Store scraper parameter changed from `store_url:` to `store_domain:`
+- Old: `StoreScraper.new(store_url: "https://www.kamm-store.com.br/...")`
+- New: `StoreScraper.new(store_domain: "kamm-store")`
+## [0.1.6] - 2025-10-27
+### Changed
+- Applied Ruby 3.1+ hash value shorthand syntax across all .rb files (e.g., `{name:}` instead of `{name: name}`)
+- Cleaner and more modern Ruby code style
+### Removed
+- Removed `display_results` method from both `GlobalScraper` and `StoreScraper`
+- Simplified output to just essential progress logs
+## [0.1.5] - 2025-10-27
+### Changed
+- **Major Refactoring**: Improved code organization and reusability
+  - Created `BaseScraper` class with shared functionality (browser config, slug generation, price parsing, JSON saving)
+  - Renamed `Scraper` to `GlobalScraper` and moved to `lib/ligamagic_scraper/global_scraper.rb`
+  - Refactored `StoreScraper` to extend `BaseScraper` in `lib/ligamagic_scraper/store_scraper.rb`
+  - Both scrapers now inherit common functionality from `BaseScraper`
+  - Maintained backward compatibility with legacy `Scraper` alias
+- Changed `qty` field to `qtd` (Portuguese) in `StoreScraper` output
+### Fixed
+- Eliminated code duplication between global and store scrapers
+- Improved maintainability and extensibility for future scraper types
+## [0.1.4] - 2025-10-26
+### Changed
+- Refactor: Changed `global` boolean parameter to `search_type` enum-like constant
+- Add `SEARCH_TYPE_GLOBAL` and `SEARCH_TYPE_STORE` constants
+- Improved slug generation with better character transliteration using `tr` method
+- Add `search_type` to JSON output for better data tracking
+- Maintain backward compatibility with boolean values for search type
+### Fixed
+- Fix syntax errors in `save_to_json` method
+## [0.1.3] - 2025-10-26
+### Added
+- Global search flag with `-g` or `--global` option (optional, enabled by default)
+- Search type indicator in output (Global vs Store-specific)
+### Removed
+- Removed standalone `scraper.rb` file in favor of using `bin/ligamagic-scraper` for both development and production
+## [0.1.2] - 2025-10-26
+### Added
+- Browser mode support with `-b` or `--browser-mode` flag
+  - `headed` mode (default): Visible browser window
+  - `headless` mode: No UI, ideal for servers and automation
+- Browser mode parameter in library interface
+## [0.1.0] - 2025-10-26
+### Added
+- Initial release of Liga Magic Scraper gem
+- Command-line interface with `ligamagic-scraper` command
+- Library interface for use in other Ruby projects
+- Automatic pagination through "Load More" button
+- Product extraction with ID, slug, name, and pricing
+- JSON export to `scrapped/` directory
+- Verbose mode for detailed output
+- Smart stopping when unavailable products are detected
+- Chrome browser automation with Selenium WebDriver

data/Gemfile ADDED Viewed

@@ -0,0 +1,4 @@
+source 'https://rubygems.org'
+# Specify your gem's dependencies in ligamagic-scraper.gemspec
+gemspec

data/LICENSE ADDED Viewed

@@ -0,0 +1,22 @@
+MIT License
+Copyright (c) 2025
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.