RubyGems - datanorm - Versions diffs - 0.0.1 - Mend

datanorm 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (37) hide show

checksums.yaml +7 -0
data/README.md +154 -0
data/lib/datanorm/document.rb +55 -0
data/lib/datanorm/documents/assemble.rb +43 -0
data/lib/datanorm/documents/assembles/price.rb +85 -0
data/lib/datanorm/documents/assembles/product.rb +176 -0
data/lib/datanorm/documents/preprocess.rb +36 -0
data/lib/datanorm/documents/preprocesses/cache.rb +76 -0
data/lib/datanorm/documents/preprocesses/process.rb +80 -0
data/lib/datanorm/file.rb +65 -0
data/lib/datanorm/header.rb +46 -0
data/lib/datanorm/headers/v4/date.rb +39 -0
data/lib/datanorm/headers/v4/version.rb +36 -0
data/lib/datanorm/headers/v5/date.rb +25 -0
data/lib/datanorm/headers/v5/version.rb +36 -0
data/lib/datanorm/helpers/filename.rb +20 -0
data/lib/datanorm/helpers/utf8.rb +20 -0
data/lib/datanorm/lines/base.rb +67 -0
data/lib/datanorm/lines/parse.rb +33 -0
data/lib/datanorm/lines/v4/dimension.rb +44 -0
data/lib/datanorm/lines/v4/extra.rb +55 -0
data/lib/datanorm/lines/v4/parse.rb +42 -0
data/lib/datanorm/lines/v4/price.rb +120 -0
data/lib/datanorm/lines/v4/priceset.rb +42 -0
data/lib/datanorm/lines/v4/product.rb +90 -0
data/lib/datanorm/lines/v4/text.rb +31 -0
data/lib/datanorm/lines/v5/dimension.rb +22 -0
data/lib/datanorm/lines/v5/parse.rb +29 -0
data/lib/datanorm/lines/v5/price.rb +27 -0
data/lib/datanorm/lines/v5/product.rb +42 -0
data/lib/datanorm/lines/v5/text.rb +30 -0
data/lib/datanorm/logger.rb +15 -0
data/lib/datanorm/logging.rb +27 -0
data/lib/datanorm/progress.rb +26 -0
data/lib/datanorm/version.rb +5 -0
data/lib/datanorm.rb +49 -0
metadata +158 -0

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA256:
+  metadata.gz: f44ab02c4041e106b7e120a1eae889c7d3e1384c4dbaedd6adea4bc86a0d1895
+  data.tar.gz: 338c389dded4de592d34d9730080bbf800052c32359a2294fea1aaa45d25b99b
+SHA512:
+  metadata.gz: 342576af02ba6ae033e97090908944045896e2bd2df1e496ec83e08835945e737051a62d139300dd10f1d9711ac1e88901e94715e0d060024390975ed2e10e39
+  data.tar.gz: 535491f2403475fb0227e6e4e81db3b4c43bb512b70640d04852a110359c74816d1ec3b02ab959a7d83a9fed62269a54a21401efb7db1a24025e629d3c7f271b

data/README.md ADDED Viewed

@@ -0,0 +1,154 @@
+# About Datanorm
+Datanorm is a German legacy file format for serialization of B2B product data (stock lists) optimized for floppy disks and monospace needle printers. Version 4 was published in 1994 and version 5 in 1999, since then there has been no change in the format. Earlier versions date back to 1986 and are practically extinct.
+30 years later, it is still the de-facto standard used by German product suppliers (especially electricity and plumbing) to communicate to their business clients, which products can be bought at which price.
+Disadvantages:
+* One line of text (e.g. a product name or description) is limited to 40 characters. This is because of monospace fonts where a quotation or an invoice would have column limitations. There is no reliable way to convert this into flowing text (mostly because of bullet lists).
+* One product can only have one price.
+  So the suppliers will typically export multiple Datanorm files to their clients: one with list prices, one with discounted buying prices and one with recommended selling prices. Or, they'll provide a second file that only contains various prices.
+* The file encoding is not UTF-8 but `CP850`, common in DOS in Western Europe.
+* There are cross-references between lines (and even parts of lines) within a single Datanorm file, that make parsing very complicated and inefficient (Datanorm files are commonly between 1 MB and 3 GB large).
+* Version 4, still most commonly used, has special quirks, such as only allowing a product quantity unit of 1, 10, 100 or 1000. So you cannot have a product sold in packs of, say 25.
+* One line in the file for prices "P" may actually be a set of prices for up to three same or different products. Thus data that belongs together separated by new lines at seemingly random places.
+* Over the years, people started working around the standard to overcome its limitations. In other words, every company that exports Datanorm, uses data fields in different ways to communicate different things and many structure the content of the file differently. For example, one file for creating products, one for amending existing products and one for doing both. Another example is to use product descriptions and product category descriptions interchangeably, and using free text fields to override normalized values that are wrong.
+* To my knowledge, there is no documentation publicly available on the Internet. Maybe old books in some library have it.
+If you were ever wondering why Germany has so much bureaucracy, it's because they like to cling on to things.
+### File types
+There are various files with various extensions, but the main ones are those that are called `DATANORM.001`, `DATANORM.002`, etc. They have this convention because those files used to be on one floppy disk each. Nowadays you only have `DATANORM.001`, which can be several GB large.
+Other file types are `.RAB` (for discount groups) and `.WRG` (for product categories), but we don't support them yet (not difficult to implement, though).
+### Main Datanorm file format
+In Datanorm, *one line* in the file represents one record. The most common ones are:
+* The very first line in the file is the header that identifies the Datanorm version (4 or 5) and a date (indicating when the prices are in effect). In V4 the fields are separated by fixed-length and in V5 they are separated by semicolon.
+* Sometimes there is a `V` record as second line of the file, it is additional information such as the number of a printed catalogue.
+* One `A` record represent one product. It is identified by a product ID that is usually unique per one supplier (and thus, it is unique within one Datanorm file of a supplier). So there is only one `A` record with a given ID in one file. This is the most important kind of record and holds product name, short description and price. It can be located anywhere in the file and can reference multiple other records (that is, lines anywhere else in the file), such as records with long text descriptions.
+* The `B` record is mostly relevant in V4 and holds additional product data, such as an EAN code. In V5 it is used as a DELETE statement for a product (so that the supplier can tell the business client that this product is now or soon deprecated), but we don't support really that feature. The `B` record has the same ID as the `A` record, so you know that they belong together. There is only one `B` record per product and within a Datanorm file, the `B` record is usually located one line below the `A` record.
+* The `D` record represents one line of the description of one product. So you have multiple `D` records where each of them has the same ID as the product. Additionally, each record holds one line number, so you know which of those records represents which line of the product description. To complicate things further, each `D` records has two text fields (of course each limited to 40 characters), so most people put two lines of text into one `D` record (say, records for line 1, 3, and 5, which in reality represent the product description lines 1-6). The `D` records are usually located below or above the `A` record.
+* `T` records were originally meant to be long texts that similar products could reference to. So, in addition to the description `D` for one single product, each product could additionally reference one `T` text that it might share with other products. But in practice they are not shared and `T` is often used instead of `D`. In other words, sometimes every single product has one single set of `T` records just for that one product. What makes `T` records complicated, in terms of file processing, is that each set of `T` records has a unique ID that is not related to any product. Rather, a product will reference that (made-up) text record set ID to indicate that those `T` records hold the long text description for a product. But the `T` records could be anywhere else in the file, so it's hard to parse.
+* Usually, one `A` record holds one price for that product. But there also exist so called `P` records (one for *multiple* products) that hold up to three product IDs and corresponding prices for those products. All these `P` price records are delivered in a separate file called `DATPREIS.001`, because you need fewer floppy disks if you only regularly update the prices rather than the entirety of all product details (given you already have imported them before). Sometimes you're dependent on those `P` records from another file, because the `A` record may specify a recommended selling price whereas the corresponding `P` may have discount details that tell you the purchase price.
+* Only in V5 there exists a `C` record that specifies additional data for one product, such as public tender descriptions and how much work time it normally takes to physically install a product.
+# About this Rubygem
+What you find here is the minimal Ruby code required to parse and loop over the content of a Datanorm file and works predominantly with version 4. Version 5 support is not fully implemented.
+It does not cover all features that Datanorm has, but it supports everything to get you started with the most common data attributes.
+If you encounter a parsing or price calculation bug, it's likely that I didn't encounter your Datanorm file yet. I appreciate your pull request in that case.
+### Parsing technique
+As explained above, there are cross-references within the Datanorm file, where one line may reference another, which is located at an arbitrary line position.
+If you're lucky, the file only contains `A` and `D` records that are located closely to one another, which makes linear file parsing somewhat possible.
+If you 're out of luck, there are a bunch of `T` records at the beginning of the file, then some `A` and `B` records, or the other way around.
+If you have `P` records, it's really complex, because one `P` record can represent one price of one product, or two different prices for the same product, or three different prices for three entirely different products.
+You, as a Ruby developer, would most likely do something like this:
+```ruby
+Datanorm::Document.new(path:).each do |item|
+  # Here you would have *one* Object that represents
+  # *one* product and *all*  its attributes.
+end
+```
+That's what we're doing. For that to work, however, we need to
+* parse the entire file (may take many minutes for large files)
+* remember, categorize and cache the data (on disk, because you don't want 3 GB in RAM)
+* then enumerate over every product
+* while doing so, gather the referenced attributes that belong to each product (sometimes referencing one part of one line, as in "P" record sets)
+* wrap it all in a Ruby object and yield it as UTF-8 to you
+I know there are smart ways to make this faster. But "being smart" was probably the reason that the file format grew in complexity in the first place.
+I went for a parsing mechanism that works every time, with every file, at the expense of running a little bit longer than needed.
+## Usage
+If you have a `DATANORM.001` and also a `DATPREIS.001`, you must concatenate those two files into one file first (their versions need to be the same). The resulting, merged file is what you provide to this Rubygem.
+If you want one product at a time, without having to deal with the complexities of Datanorm, you can use this:
+```ruby
+document = Datanorm::Document.new(path: 'datanorm.001')
+puts document.header
+puts document.version
+document.each do |product, progress|
+  # Once pre-processing is complete, you'll start to get products here
+  puts product # <- can be nil in the beginning
+  # You can always look at the progress to see what's going on.
+  puts progress if progress.significant? # Throttling, so your STDOUT doesn't get spammed.
+end
+```
+In case you only want the raw Datanorm file one line at a time as Ruby Objects, you can use this:
+```ruby
+file = Datanorm::File.new(path: 'datanorm.001')
+puts file.header
+puts file.version
+puts file.lines_count
+file.each do |record, line_number|
+  puts record
+end
+```
+**Debugging**
+You can set the ENV variable `DEBUG_DATANORM=1` for verbose logging output.
+You can also inspect the denormalization cache located at `/tmp/datanorm_ruby`
+(it won't be automatically deleted if you set the `DEBUG_DATANORM` flag).
+## Development
+Throughout the code, the following terms are used:
+* `line` is one line of a Datanorm file in its raw format.
+* `record` is one Ruby Object representing one of those `line`s.
+* `product` is a product (article) and all its (immediate and referenced) attributes.
+Run unit tests with `bin/tests`.
+To get you started, you can run `bin/demo path/to/your/datanorm.001` to show its contents.
+There are a few example Datanorm files in the test folder, but their characters don't always have the standard Datanorm CP850 encoding (through my tooling I often accidentally convert to UTF-8 or ASCII). Fĭx iẗ iƒ ȳou çaƞ.
+## Open Source Maintenance
+This software is release under the MIT license (see LICENSE.md).
+I already anticipate people sending me their various Datanorm files, thinking that I can fix their problems, but I really don't want to 😂.
+Let me be clear: Nobody should use this data format. It's from the digital stone age. If you have to parse it in Ruby and need more features, I'll gladly welcome a pull request with proper test coverage.

data/lib/datanorm/document.rb ADDED Viewed

@@ -0,0 +1,55 @@
+# frozen_string_literal: true
+module Datanorm
+  # Loads and parses a datanorm file product by product.
+  class Document
+    include Datanorm::Logging
+    include Enumerable
+    attr_reader :path
+    def initialize(path:, timestamp: nil)
+      @path = path
+      if timestamp
+        # Re-use an existing workdir in case the preprocessing was already done earlier.
+        @timestamp = timestamp
+        @preprocessed = true
+      else
+        @timestamp = (Time.now.to_f * 1_000_000_000).to_i.to_s # Timestamp with nanoseconds
+      end
+    end
+    def header
+      file.header
+    end
+    def version
+      file.version
+    end
+    def each(&)
+      unless @preprocessed
+        ::Datanorm::Documents::Preprocess.call(file:, workdir:, &)
+        @preprocessed = true
+      end
+      ::Datanorm::Documents::Assemble.call(workdir:, &)
+    ensure
+      # At this point all yields have gone through and we can clean up.
+      workdir.rmtree unless ENV['DEBUG_DATANORM']
+    end
+    def workdir
+      @workdir ||= Pathname.new('/tmp/datanorm_ruby').join(@timestamp)
+    end
+    private
+    def file
+      return @file if defined?(@file)
+      @file = ::Datanorm::File.new(path:)
+    end
+  end
+end

data/lib/datanorm/documents/assemble.rb ADDED Viewed

@@ -0,0 +1,43 @@
+# frozen_string_literal: true
+module Datanorm
+  module Documents
+    # Yields every product found in the text files that the preprocessing generated.
+    class Assemble
+      include Calls
+      include ::Datanorm::Logging
+      option :workdir
+      def call
+        return unless products_file.file?
+        ::File.foreach(products_file) do |json|
+          progress.increment!
+          yield ::Datanorm::Documents::Assembles::Product.new(json:, workdir:), progress
+        end
+      end
+      private
+      def products_file
+        workdir.join('A.txt')
+      end
+      def products_count
+        return @products_count if defined?(@products_count)
+        @products_count = 0
+        ::File.foreach(products_file) { @products_count += 1 }
+        @products_count
+      end
+      def progress
+        @progress ||= ::Datanorm::Progress.new.tap do |progress|
+          progress.current = 0
+          progress.total = products_count
+        end
+      end
+    end
+  end
+end

data/lib/datanorm/documents/assembles/price.rb ADDED Viewed

@@ -0,0 +1,85 @@
+# frozen_string_literal: true
+module Datanorm
+  module Documents
+    module Assembles
+      # Object wrapper for a single Price (that belongs to a Priceset).
+      class Price
+        attr_reader :as_json
+        def initialize(json:)
+          @as_json = JSON.parse(json, symbolize_names: true)
+        end
+        # -----------------
+        # Native Attributes
+        # -----------------
+        def wholesale?
+          as_json[:is_wholesale]
+        end
+        def retail?
+          as_json[:is_retail]
+        end
+        def no_discount?
+          as_json[:is_no_discount]
+        end
+        def percentage_discount?
+          as_json[:is_percentage_discount]
+        end
+        def discount_percentage_integer
+          as_json[:discount_percentage]
+        end
+        def cents
+          as_json[:cents].to_i
+        end
+        # ---------------------
+        # Calculated Attributes
+        # ---------------------
+        def price
+          BigDecimal(cents) / 100
+        end
+        def discount_percentage
+          return unless percentage_discount?
+          # 3700 == 37% == 0.37
+          BigDecimal(discount_percentage_integer) / 100 / 100
+        end
+        # What is the final price after the discount?
+        def price_after_discount
+          return price if no_discount?
+          return unless discount_percentage
+          price * (1 - discount_percentage)
+        end
+        # Helpers
+        def <=>(other)
+          precedence <=> other.precedence
+        end
+        # So we can distinguish between multiple conflicting prices.
+        def precedence
+          return 2 if no_discount?
+          return 1 if percentage_discount?
+          0
+        end
+        def to_s
+          "<Price #{as_json}>"
+        end
+      end
+    end
+  end
+end

data/lib/datanorm/documents/assembles/product.rb ADDED Viewed

@@ -0,0 +1,176 @@
+# frozen_string_literal: true
+module Datanorm
+  module Documents
+    module Assembles
+      # Object wrapper for a single Product with all its attributes.
+      class Product
+        include ::Datanorm::Logging
+        def initialize(json:, workdir:)
+          @json = JSON.parse(json, symbolize_names: true)
+          @workdir = workdir
+          load_files!
+        end
+        # ------
+        # Basics
+        # ------
+        def id
+          json[:id]
+        end
+        def quantity
+          json[:quantity]
+        end
+        def quantity_unit
+          json[:quantity_unit]
+        end
+        def discount_group
+          json[:discount_group]
+        end
+        # -------
+        # Textual
+        # -------
+        def title
+          json[:title]
+        end
+        def text_id
+          json[:text_id]
+        end
+        def description
+          # In theory, the dimension is for this product only
+          # and the text shared by several products of the same kind.
+          # In practice, those two are not intended for stacking.
+          # Instead, we choose one or the other.
+          return dimension_content if dimension_content && !dimension_content.strip.empty?
+          text_content
+        end
+        # -----------------------
+        # Immediate Price details
+        # -----------------------
+        def retail_price?
+          json[:is_retail_price]
+        end
+        def wholesale_price?
+          json[:is_wholesale_price]
+        end
+        def cents
+          json[:cents]
+        end
+        # Convenience shortcut.
+        def price
+          BigDecimal(cents) / 100
+        end
+        # ------------------------
+        # Referenced Price details
+        # ------------------------
+        def prices
+          return @prices if defined?(@prices)
+          @prices = prices_content&.split("\n")&.map do |json|
+            ::Datanorm::Documents::Assembles::Price.new(json:)
+          end || []
+        end
+        # -----------------
+        # Referenced Extras
+        # -----------------
+        def matchcode
+          extra_json[:matchcode]
+        end
+        def alternative_id
+          extra_json[:alternative_id]
+        end
+        def ean
+          extra_json[:ean]
+        end
+        def category_id
+          extra_json[:category_id]
+        end
+        # -------
+        # Helpers
+        # -------
+        def to_s
+          "<Product #{as_json}>"
+        end
+        def as_json
+          # Adding referenced attributes that were cached to disk during preprocessing.
+          json.merge(description:, prices: prices.map(&:as_json)).merge(extra_json)
+        end
+        def to_json(...)
+          as_json.to_json(...)
+        end
+        private
+        attr_reader :json, :workdir
+        # The temporary cached files may be deleted quickly, so let's fetch what we need.
+        # effectively populates all data we need
+        alias load_files! as_json
+        def dimension_content
+          return @dimension_content if defined?(@dimension_content)
+          @dimension_content = begin
+            path = workdir.join('D', ::Datanorm::Helpers::Filename.call(id))
+            path.read if path.file?
+          end
+        end
+        def text_content
+          return unless text_id
+          return @text_content if defined?(@text_content)
+          @text_content = begin
+            path = workdir.join('T', ::Datanorm::Helpers::Filename.call(text_id))
+            path.read if path.file?
+          end
+        end
+        def extra_json
+          return @extra_json if defined?(@extra_json)
+          @extra_json = begin
+            path = workdir.join('B', ::Datanorm::Helpers::Filename.call(id))
+            JSON.parse(path.read, symbolize_names: true) if path.file?
+          end
+        end
+        def prices_content
+          return @prices_content if defined?(@prices_content)
+          @prices_content = begin
+            path = workdir.join('P', ::Datanorm::Helpers::Filename.call(id))
+            path.read if path.file?
+          end
+        end
+      end
+    end
+  end
+end

data/lib/datanorm/documents/preprocess.rb ADDED Viewed

@@ -0,0 +1,36 @@
+# frozen_string_literal: true
+module Datanorm
+  module Documents
+    # Takes an entire Datanorm file and writes many small text files from it
+    # so that the content can later be iterated over in an efficient way.
+    class Preprocess
+      include Calls
+      include ::Datanorm::Logging
+      option :file
+      option :workdir
+      def call
+        FileUtils.mkdir_p(workdir)
+        file.each do |record|
+          ::Datanorm::Documents::Preprocesses::Process.call(workdir:, record:)
+          progress.increment!
+          yield nil, progress # No items to yield during preprocess.
+        end
+      end
+      private
+      def progress
+        @progress ||= ::Datanorm::Progress.new.tap do |progress|
+          progress.title = 'Preprocessing'
+          progress.current = 0
+          progress.total = file.lines_count
+        end
+      end
+    end
+  end
+end

data/lib/datanorm/documents/preprocesses/cache.rb ADDED Viewed

@@ -0,0 +1,76 @@
+# frozen_string_literal: true
+module Datanorm
+  module Documents
+    module Preprocesses
+      # Writes a Datanorm record to disk for later retrieval.
+      class Cache
+        include Calls
+        include ::Datanorm::Logging
+        option :workdir, as: :parent_workdir
+        option :namespace
+        option :id
+        option :target_line_number, default: -> { 1 }
+        option :content, as: :raw_content # Encoding::CP850
+        def call
+          do_ensure_workdir
+            .on_success { do_read_currently_cached_lines }
+            .on_success { do_amend_content }
+            .on_success { do_write_to_file }
+        end
+        def do_ensure_workdir
+          return Tron.success :workdir_exists if workdir.directory?
+          log { "Creating working dir `#{workdir}`" }
+          FileUtils.mkdir_p(workdir)
+          Tron.success(:workdir_created)
+        end
+        def do_read_currently_cached_lines
+          if filepath.exist?
+            @lines = filepath.readlines(chomp: true)
+            Tron.success(:loaded_current_file_content)
+          else
+            @lines = []
+            Tron.success(:nothing_cached_yet)
+          end
+        end
+        def do_amend_content
+          # Populate every line up until the wanted line.
+          @lines[target_line_number - 1] ||= ''
+          # Insert the content
+          @lines[target_line_number - 1] = content
+          Tron.success(:inserted_content)
+        end
+        def do_write_to_file
+          log { "Writing line(s) at position #{target_line_number} to #{filepath}" }
+          filepath.write @lines.join("\n")
+          Tron.success :wrote_to_cache
+        end
+        private
+        def content
+          raw_content.encode 'UTF-8'
+        end
+        def filepath
+          workdir.join(::Datanorm::Helpers::Filename.call(id))
+        end
+        def workdir
+          namespace_parts = Array(namespace)
+          parent_workdir.join(*namespace_parts)
+        end
+      end
+    end
+  end
+end