ligamagic-scraper 0.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: e5e915123285e7c8e182658c4e4579eaf0ced97e548d0f63bce91916c2a892e4
4
+ data.tar.gz: 4eb14fcb363ffcf64c6c8633efb9a70c050f03e9255811a539444c3689553f14
5
+ SHA512:
6
+ metadata.gz: 940f79ede397bc0189193925c78a535289190d96b4396064ff84ef60d346f8a9930d3e05ba22c54c34b70a0fe2837f520b4d4e745d1739d5ab08978accba8428
7
+ data.tar.gz: c753dc394b4032a1bda76b8bd301085e6938451b80290d0cdd5ff9faa2b79ba42a23ffb6cc2c18e6367c303d24dad36f2f852993d78ddd8eb1841adde04a2b0f
data/CHANGELOG.md ADDED
@@ -0,0 +1,318 @@
1
+ # Changelog
2
+
3
+ All notable changes to this project will be documented in this file.
4
+
5
+ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6
+ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
+
8
+ ## [Unreleased]
9
+
10
+ ## [0.6.0] - 2025-10-29
11
+
12
+ ### Changed
13
+ - **Major Performance Optimization for StoreScraper**: Improved extraction speed by 20% (1.7s faster)
14
+ - Captures HTML incrementally while loading pages (browser still open)
15
+ - Replaced Capybara DOM queries with Nokogiri in-memory HTML parsing
16
+ - Browser closes immediately after loading, before extraction phase
17
+ - Extraction is instant: ~0.01s with Nokogiri vs mixed with loading before
18
+ - Added automatic deduplication by card ID
19
+ - Added comprehensive performance benchmarking with timing breakdowns
20
+
21
+ ### Added
22
+ - **Performance Summary Logs for StoreScraper**: Detailed timing information
23
+ - Phase 1: Loading + Capture (incremental HTML capture during page loads)
24
+ - Phase 2: Extraction (data parsing with Nokogiri, browser closed)
25
+ - Total time and product counts displayed
26
+
27
+ ## [0.5.0] - 2025-10-29
28
+
29
+ ### Changed
30
+ - **Major Performance Optimization for GlobalScraper**: Improved extraction speed by 34% (7.5s faster)
31
+ - Changed from incremental HTML capture to batch capture after all pages loaded
32
+ - Replaced Capybara DOM queries with Nokogiri in-memory HTML parsing
33
+ - Browser now closes immediately after page loading (37% faster browser release)
34
+ - Extraction is 148x faster: 7.41s → 0.03s (using Nokogiri instead of Capybara)
35
+ - Reduced memory usage: 75% fewer HTML snapshots (160 vs 400) for same results
36
+ - Added automatic deduplication by product ID (website shows overlapping products)
37
+ - Added comprehensive performance benchmarking with timing breakdowns
38
+
39
+ ### Added
40
+ - **Performance Summary Logs**: Detailed timing information for each scraping phase
41
+ - Phase 1: Loading (page navigation and clicks)
42
+ - Phase 2: Capture (HTML snapshot collection)
43
+ - Phase 3: Extraction (data parsing with Nokogiri)
44
+ - Total time and product counts displayed
45
+ - **Benchmark Support**: Added `Benchmark` and `Set` dependencies for performance tracking
46
+ - **Product Deduplication**: Automatically removes duplicate products by ID
47
+
48
+ ## [0.4.0] - 2025-10-29
49
+
50
+ ### Added
51
+ - **Store URL Filtering**: Automatic filtering and ordering for store searches
52
+ - Orders results by price (most expensive to cheapest): `txt_order=6`
53
+ - Filters only in-stock items: `txt_estoque=1`
54
+ - Applied to both store listing and store search modes
55
+ - **Store Search with Search Term**: Search within specific stores using `-u STORE -s TERM`
56
+ - Scrapes all pages until no more results (unlimited pagination)
57
+ - URL automatically includes search parameter: `&busca=<term>`
58
+ - Search term included in filename and JSON output
59
+ - **Pagination Support for Store Scraper**: Automatically scrapes multiple pages
60
+ - Detects next page button using `.ecomresp-paginacao` class
61
+ - Scrapes all available pages for search results
62
+ - Configurable max pages limit for store listings
63
+ - Sleep 1 second between page requests
64
+ - **Max Pages Limit (`-p/--pages N`)**: Required page limit for store listings
65
+ - Maximum: 5 pages (values above 5 are automatically capped)
66
+ - Required when using `-u` without `-s`
67
+ - Prevents accidental full inventory scraping
68
+ - Safety feature to control scraping scope
69
+ - Example: `ligamagic-scraper -u test-store -p 3` (scrapes 3 pages)
70
+ - Example: `ligamagic-scraper -u test-store -p 10` (capped to 5 pages)
71
+ - **Three Search Modes**:
72
+ 1. Global search (`-s TERM`): Search across all Liga Magic, unlimited pages
73
+ 2. Store listing (`-u STORE -p N`): List products from store, max N pages
74
+ 3. Store search (`-u STORE -s TERM`): Search within store, all pages
75
+ - **Automatic Pagination Detection**: Uses next page button to continue scraping
76
+ - **Search Term in Filenames**: Store searches include search slug in filename
77
+ - Format: `YYYYMMDD_HHMMSS__search_term.json`
78
+
79
+ ### Changed
80
+ - **StoreScraper Parameters**: Now accepts `search_term` and `max_pages` parameters
81
+ - `search_term`: Optional search term to filter products within store
82
+ - `max_pages`: Required when no search term provided (capped at 5)
83
+ - **CLI Flag Behavior**: `-s` and `-u` can now be used together
84
+ - Previously mutually exclusive
85
+ - Enables store search with search term
86
+ - `-s` flag works for both global and store searches
87
+ - **Store URLs**: Automatically adapt based on search term and page number
88
+ - With search: `?view=ecom/itens&busca=<term>&page=N`
89
+ - Without search: `?view=ecom/itens&tcg=1&page=N`
90
+ - **Validation Logic**: Updated to require and cap max_pages
91
+ - Requires `-p` flag when using `-u` without `-s`
92
+ - Automatically caps max_pages at 5 (user provides 10, uses 5)
93
+ - Clear error messages guide users to correct usage
94
+ - **Store Scraper Behavior**: Now scrapes multiple pages instead of just first page
95
+ - Pagination loop visits each page sequentially
96
+ - Logs progress for each page
97
+ - Accumulates all products across pages
98
+
99
+ ### Security
100
+ - **Max Pages Cap**: Prevents accidental full inventory scraping
101
+ - Hard limit of 5 pages for store listings
102
+ - Values above 5 are automatically capped
103
+ - Protects against unintentional resource-intensive operations
104
+ - Cannot be overridden (use `-s` for unlimited pagination)
105
+
106
+ ### Known Limitations
107
+ - **Store Search Price/Qty Extraction**: When using store search with search term (`-u STORE -s TERM`), price and quantity data cannot be extracted
108
+ - Liga Magic uses sophisticated CSS sprite obfuscation for anti-scraping protection
109
+ - Both CSS class names and sprite background-positions rotate per session
110
+ - Digits are rendered visually only, with no text content in DOM
111
+ - Store listings without search term (`-u STORE -p N`) work normally with full data
112
+ - See code documentation in `store_scraper.rb` for technical details
113
+
114
+ ### Fixed
115
+ - **Alert System Comparison Bug**: Fixed incorrect change detection when comparing different search terms
116
+ - Previously compared against most recent file chronologically, regardless of search term
117
+ - Now only compares files with matching search term slugs
118
+ - Prevents false "new products" alerts when different searches are interleaved
119
+ - Example: Running "booster box" search after "volcanic" search now correctly compares with previous "booster box" results
120
+ - Removed verbose debug logs for individual card details in store scraper
121
+ - Fixed pagination hanging on single-page results (reduced wait time to 0)
122
+
123
+ ## [0.3.0] - 2025-10-27
124
+
125
+ ### Added
126
+ - **Structured Logging System**: Introduced `Loggable` module for all classes
127
+ - All execution logs now collected in accessible `@logs` array
128
+ - Log entries include timestamp, level, message, and source
129
+ - Log levels: `:info`, `:debug`, `:warning`, `:error`
130
+ - Methods: `log_info`, `log_debug`, `log_warning`, `log_error`
131
+ - `formatted_logs` method returns array of message strings
132
+ - `clear_logs` method to reset logs
133
+ - Programmatic access to execution logs via `scraper.logs` or `scraper.formatted_logs`
134
+
135
+ - **RSpec Test Suite**: Comprehensive test coverage with 175+ tests
136
+ - Unit tests for all scrapers (base, global, store)
137
+ - Unit tests for all alerts (system, base, file, telegram)
138
+ - Unit tests for CLI parsing and validation
139
+ - Integration tests using static HTML examples
140
+ - Loggable module tests
141
+ - Version validation tests
142
+ - Test structure mirrors `lib/` structure
143
+
144
+ - **CLI Class**: Extracted command-line parsing to dedicated `CLI` class
145
+ - Cleaner separation of concerns
146
+ - Easier to test
147
+ - Reduced `bin/ligamagic-scraper` from 120 to 28 lines
148
+
149
+ - **Organized Code Structure**: Reorganized lib/ligamagic_scraper/ into subdirectories
150
+ - `scrapers/` - All scraper classes (base, global, store)
151
+ - `alerts/` - All alert-related classes (system, handlers)
152
+ - Core files at root (version.rb, cli.rb)
153
+
154
+ ### Changed
155
+ - **Logging Behavior**: All `puts` statements replaced with structured logging
156
+ - CLI displays logs after scraping completes
157
+ - Library usage is silent by default, logs accessible programmatically
158
+ - More control over log output
159
+
160
+ - **SimplifiedAPI**: Removed noise from immediate output when using as library
161
+
162
+ ### Removed
163
+ - **`verbose` parameter**: Removed from all scraper classes
164
+ - No longer needed with structured logging
165
+ - Use `scraper.formatted_logs` to access all messages
166
+ - CLI always shows logs (equivalent to old verbose mode)
167
+
168
+ - **`-v` / `--verbose` CLI flag**: Removed from command-line interface
169
+ - Logs are always collected and displayed
170
+ - More detailed logs available via log levels
171
+
172
+ ### Breaking Changes
173
+ - **`verbose:` parameter removed** from:
174
+ - `BaseScraper.new`
175
+ - `GlobalScraper.new`
176
+ - `StoreScraper.new`
177
+ - Solution: Remove the parameter from your code
178
+
179
+ - **CLI `-v` / `--verbose` flag removed**
180
+ - Logs are always displayed in CLI
181
+ - For library usage, access logs via `scraper.formatted_logs`
182
+
183
+ ## [0.2.0] - 2025-10-27
184
+
185
+ ### Added
186
+ - **Alert System**: Complete infrastructure for detecting changes between scrapes
187
+ - Automatic comparison with previous scrape data
188
+ - Detects new products, removed products, price changes, quantity changes, and availability changes
189
+ - Change detection with percentage calculations for price changes
190
+ - Support for multiple alert types: file (implemented), telegram (stub)
191
+ - CLI flags: `-a`/`--alerts` to enable, `--alert-types` to configure
192
+ - Alert configuration via library interface
193
+ - Automatic discovery of most recent previous scrape file
194
+ - Formatted change summaries with emojis and statistics
195
+
196
+ - **Alert Handler Classes**:
197
+ - `BaseAlert`: Base class for all alert handlers with formatting utilities
198
+ - `FileAlert`: File-based alert handler - fully implemented, saves to organized directories
199
+ - `TelegramAlert`: Telegram bot notification handler (stub - not implemented)
200
+
201
+ - **Organized Directory Structure**: Files now organized by type and store
202
+ - Global scrapes: `scrapped/global/YYYYMMDD_HHMMSS__slug.json`
203
+ - Store scrapes: `scrapped/stores/{store_domain}/YYYYMMDD_HHMMSS.json`
204
+ - Global alerts: `alerts_json/global/YYYYMMDD_HHMMSS.json`
205
+ - Store alerts: `alerts_json/stores/{store_domain}/YYYYMMDD_HHMMSS.json`
206
+
207
+ ### Changed
208
+ - **Filename Format**: Now includes datetime instead of just date
209
+ - Allows multiple scrapes per day for better tracking
210
+ - Uses `__` (double underscore) to separate datetime from slug (global only)
211
+ - **BaseScraper**: Added `alert_config` parameter to constructor
212
+ - **GlobalScraper**: Added `alert_config` parameter support
213
+ - **StoreScraper**: Added `alert_config` parameter support
214
+ - **BaseScraper.save_to_json**: Now processes alerts before saving if enabled
215
+ - **BaseScraper.find_previous_scrape**: Simplified to find files in same directory
216
+
217
+ ### Removed
218
+ - **EmailAlert**: Removed in favor of simpler alert system
219
+ - **WebhookAlert**: Removed in favor of simpler alert system
220
+ - Reduced alert types to file (implemented) and telegram (stub)
221
+
222
+ ### Breaking Changes
223
+ None - this is purely additive functionality. If alerts are not enabled, behavior is unchanged.
224
+
225
+ ## [0.1.7] - 2025-10-27
226
+
227
+ ### Changed
228
+ - **Store Scraper Simplified**: Now accepts only store domain name instead of full URLs
229
+ - Changed parameter from `store_url:` to `store_domain:`
230
+ - Example: `ligamagic-scraper -u kamm-store` (instead of full URL)
231
+ - Automatically builds URL: `https://www.<domain>.com.br/?view=ecom/itens&tcg=1`
232
+ - **Removed Complexity**: Eliminated unnecessary store ID and name extraction logic
233
+ - **CLI Updated**: Flag `-u` now accepts domain name instead of full URL
234
+
235
+ ### Added
236
+ - **Card ID Extraction**: Store scraper now extracts unique card ID from product links
237
+ - Extracts from link parameter: `card=16149`
238
+ - Added `card_id` field to product output
239
+ - Useful for tracking specific cards across scrapes
240
+
241
+ ### Removed
242
+ - Store ID extraction from page
243
+ - Store name extraction from page
244
+ - Complex URL rebuilding logic
245
+ - Empty parameter filtering
246
+
247
+ ### Breaking Changes
248
+ ⚠️ **API Change**: Store scraper parameter changed from `store_url:` to `store_domain:`
249
+ - Old: `StoreScraper.new(store_url: "https://www.kamm-store.com.br/...")`
250
+ - New: `StoreScraper.new(store_domain: "kamm-store")`
251
+
252
+ ## [0.1.6] - 2025-10-27
253
+
254
+ ### Changed
255
+ - Applied Ruby 3.1+ hash value shorthand syntax across all .rb files (e.g., `{name:}` instead of `{name: name}`)
256
+ - Cleaner and more modern Ruby code style
257
+
258
+ ### Removed
259
+ - Removed `display_results` method from both `GlobalScraper` and `StoreScraper`
260
+ - Simplified output to just essential progress logs
261
+
262
+ ## [0.1.5] - 2025-10-27
263
+
264
+ ### Changed
265
+ - **Major Refactoring**: Improved code organization and reusability
266
+ - Created `BaseScraper` class with shared functionality (browser config, slug generation, price parsing, JSON saving)
267
+ - Renamed `Scraper` to `GlobalScraper` and moved to `lib/ligamagic_scraper/global_scraper.rb`
268
+ - Refactored `StoreScraper` to extend `BaseScraper` in `lib/ligamagic_scraper/store_scraper.rb`
269
+ - Both scrapers now inherit common functionality from `BaseScraper`
270
+ - Maintained backward compatibility with legacy `Scraper` alias
271
+ - Changed `qty` field to `qtd` (Portuguese) in `StoreScraper` output
272
+
273
+ ### Fixed
274
+ - Eliminated code duplication between global and store scrapers
275
+ - Improved maintainability and extensibility for future scraper types
276
+
277
+ ## [0.1.4] - 2025-10-26
278
+
279
+ ### Changed
280
+ - Refactor: Changed `global` boolean parameter to `search_type` enum-like constant
281
+ - Add `SEARCH_TYPE_GLOBAL` and `SEARCH_TYPE_STORE` constants
282
+ - Improved slug generation with better character transliteration using `tr` method
283
+ - Add `search_type` to JSON output for better data tracking
284
+ - Maintain backward compatibility with boolean values for search type
285
+
286
+ ### Fixed
287
+ - Fix syntax errors in `save_to_json` method
288
+
289
+ ## [0.1.3] - 2025-10-26
290
+
291
+ ### Added
292
+ - Global search flag with `-g` or `--global` option (optional, enabled by default)
293
+ - Search type indicator in output (Global vs Store-specific)
294
+
295
+ ### Removed
296
+ - Removed standalone `scraper.rb` file in favor of using `bin/ligamagic-scraper` for both development and production
297
+
298
+ ## [0.1.2] - 2025-10-26
299
+
300
+ ### Added
301
+ - Browser mode support with `-b` or `--browser-mode` flag
302
+ - `headed` mode (default): Visible browser window
303
+ - `headless` mode: No UI, ideal for servers and automation
304
+ - Browser mode parameter in library interface
305
+
306
+ ## [0.1.0] - 2025-10-26
307
+
308
+ ### Added
309
+ - Initial release of Liga Magic Scraper gem
310
+ - Command-line interface with `ligamagic-scraper` command
311
+ - Library interface for use in other Ruby projects
312
+ - Automatic pagination through "Load More" button
313
+ - Product extraction with ID, slug, name, and pricing
314
+ - JSON export to `scrapped/` directory
315
+ - Verbose mode for detailed output
316
+ - Smart stopping when unavailable products are detected
317
+ - Chrome browser automation with Selenium WebDriver
318
+
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in ligamagic-scraper.gemspec
4
+ gemspec
data/LICENSE ADDED
@@ -0,0 +1,22 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
22
+