RubyGems - csv-import-analyzer - Versions diffs - 0.0.5 → 0.0.6 - Mend

csv-import-analyzer 0.0.5 → 0.0.6

Files changed (6) hide show

checksums.yaml +4 -4
data/README.md +100 -9
data/lib/csv-import-analyzer/csv_sanitizer.rb +3 -3
data/lib/csv-import-analyzer/export/metadata_analysis.rb +2 -2
data/lib/csv-import-analyzer/version.rb +1 -1
metadata +1 -1

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 7b92797b9053b622641edc450507aa53c0242a4e
-  data.tar.gz: e35be7b229587ac6cef590949ec09aeff59a0403
+  metadata.gz: 40f1ef2bfdbd829eaa64dfa88f360ee722d60228
+  data.tar.gz: c0beeb4085de093f7d41b79e3688cfe36a2b6811
 SHA512:
-  metadata.gz: 927882e009a8b0fc49094f3580e23bcbdf38b9fbc0406f974abef5730bbba93a3986c33ceed704a8da9c053f85c018cd22f0ea309c3aa401936eaf56846bd85a
-  data.tar.gz: 7ed89d2847e9df80541942042300f3409230f6c33f4eb6a5478741d9e3a2ecac40e40bb893c43b0b9456a4fe79101afc47a025deb31cc14f9020dbdcd0dc0804
+  metadata.gz: 94a4839a40f22301b36776b6155855f5bc49a162f3f6c3d9706c46b1105ecf4598c65969d76bc4fade7f1a2d9250fc32fe4fb3ce121657f18fd661bf71cf9ca5
+  data.tar.gz: e359dd8f516b2a96d3f799477975c2006639229ee40a9a9cccf18cca8893e16217c6f7ef1349df4ddd448dbbb7d2964c57854a5d7094808ab81b14fdebe8a4cf

data/README.md CHANGED

@@ -1,10 +1,8 @@
 # Csv::Import::Analyzer
-Perform datatype analysis on desired chunk
-Calculate min-max bounds for each column
-Determine which coulmns are nullable in the csv file
+CsvImportAnalyzer is intended to help perform data analysis on csv (comma seperated), tsv (tab seperated) or ssv (semi-colon seperated) files. It can be used to process large datasets in desired chunk sizes (defaults to 200 rows), gives you a comprehensive analysis on each column with possible datatype, minimum and manimum bounds, if the column can be set to nullable for each column.
-Note: This gem expects the first line to be definitve header, as in like column names if the csv file has to be imported to database.
+<b>Note</b>: This gem expects the first line to be definitve header, as in like column names if the csv file has to be imported to database.
 ## Installation
@@ -24,25 +22,117 @@ Or install it yourself as:
 ## Usage
-Calling process on a filename would generate a metadata_output.json which has the Delimiter, Datatype Analysis and SQL (create and import) statements for both PostgreSQL and MySQL
+Calling process on a filename would generate metadata for the sample file and return it as a json object. This metadata would have the following
+<ul>
+    <li> High level stats for the given file (E.g. filename, file size, number of rows, number of columns).</li>
+    <li> Data manipulation done for pre-processing the file.</li>
+    <li> Data analysis on each column as key value pairs.</li>
+    <li> By default you would also have MySQL queries that you need to import the file to database.</li>
+</ul>
+```ruby
+  CsvImportAnalyzer.process(filename)
+```
+## Demo
+Below is a sample test.csv file
+```
+Year ID,Make ID,Model ID,Description ID,Price ID
+1997,Ford,E350,"ac, abs, moon","3000.00"
+1999,Chevy,"Venture ""Extended Edition""",,4900.00
+1999,"Chevy","Venture ""Extended Edition, Very Large""","",5000.00
+1996,Jeep,Grand Che'rokee,"MUST SELL!air, moon roof, loaded",4799.00
+```
+To get the data analysis of above file, you can use CsvImportAnalyzer to process the file.
 ```ruby
+metadata = CsvImportAnalyzer.process("test.csv", {:distinct => 2})
+```
+### Result
+Now the metadata would hold the json object of the comprehensive analysis. Below is what the metadata would be for the sample csv file
+```ruby
+puts metadata
+```
+```json
+{
+  "csv_file": {
+    "filename": "sampleTab.csv",
+    "file_size": 276,
+    "record_delimiter": ",",
+    "rows": 6,
+    "columns": 5,
+    "processed_filename": "processed_sampleTab.csv",
+    "processed_file_path": "/tmp/processed_sampleTab.csv",
+    "processed_file_size": 279,
+    "error_report": "/tmp/error_report_sampleTab.csv"
+  },
+  "data_manipulations": {
+    "replace_nulls": true,
+    "replace_quotes": true
+  },
+  "csv_headers": {
+    "year_id": {
+      "datatype": "int",
+      "datatype_analysis": {
+        "int": 4
+      },
+      "distinct_values": "2+"
+    },
+    "make_id": {
+      "datatype": "string",
+      "datatype_analysis": {
+        "string": 4
+      },
+      "distinct_values": "2+"
+    },
+    "model_id": {
+      "datatype": "string",
+      "datatype_analysis": {
+        "string": 4
+      },
+      "distinct_values": "2+"
+    },
+    "description_id": {
+      "datatype": "string",
+      "datatype_analysis": {
+        "string": 2
+      },
+      "distinct_values": [
+        "ac, abs, moon",
+        "MUST SELL!air, moon roof, loaded"
+      ],
+      "nullable": true
+    },
+    "price_id": {
+      "datatype": "float",
+      "datatype_analysis": {
+        "float": 4
+      },
+      "distinct_values": "2+"
+    }
+  },
+  "sql": {
+    "mysql": {
+      "create_query": "create table processed_sampletab.csv ( year_id int not null, make_id varchar(255) not null, model_id varchar(255) not null, description_id varchar(255), price_id float not null);",
+      "import_query": "COPY processed_sampletab.csv FROM '/tmp/processed_sampleTab.csv' HEADER DELIMITER ',' CSV NULL AS 'NULL';"
+    }
+  }
+}
-  CsvImportAnalyzer.process(filename)
 ```
 ## TODO:
   <ul>
-    <li> Handle control of processed input file to user </li>
-    <li> Return the analysis as Json object.</li>
     <li> Better - Structuring the analysis outputted to csv</li>
     <li> Add support to convert and import xlsx files to csv </li>
+    <li> Handle control of processed input file to user </li>
   </ul>
 ## Additional Information
 ### Dependencies
   <ul><li><a href="https://github.com/tilo/smarter_csv">smarter_csv</a> - For processing the csv in chunks</li></ul>
 ## Contributing
@@ -52,3 +142,4 @@ Calling process on a filename would generate a metadata_output.json which has th
 3. Commit your changes (`git commit -am 'Add some feature'`)
 4. Push to the branch (`git push origin my-new-feature`)
 5. Create a new Pull Request

data/lib/csv-import-analyzer/csv_sanitizer.rb CHANGED

@@ -69,12 +69,12 @@ module CsvImportAnalyzer
       {
         :metadata_output => nil,      # To be set if metadata needs to be printed to a file
         :processed_input => nil,      # To be set if processed input is needed
-        :unique => 10,                # Threshold for number of defaults values that needs to identified
+        :unique => 2,                 # Threshold for number of defaults values that needs to identified
         :check_bounds => true,        # Option to check for min - max bounds for each column [true => find the bounds]
         :datatype_analysis => 200,    # Number of rows to be sampled for datatype analysis
         :chunk => 200,                # Chunk size (no of rows) that needs to processed in-memory [Important not to load entire file into memory]
-        :database => [:pg, :mysql],   # Databases for which schema needs to be generated
-        :quote_convert => true,       # Convert any single quotes to double quotes
+        :database => [:mysql],        # Databases for which schema needs to be generated
+        :quote_convert => true,       # Convert single quotes to double quotes
         :replace_nulls => true,       # Replace nulls, empty's, nils, Null's with NULL
         :out_format => :json          # Set what type of output do you need as analysis
       }

data/lib/csv-import-analyzer/export/metadata_analysis.rb CHANGED

@@ -178,8 +178,8 @@ module CsvImportAnalyzer
           columns[column_name] = {}
           columns[column_name][:datatype] = header_datatypes[column_name]
           columns[column_name][:datatype_analysis] = header_datatype_analysis[column_name]
-          if unique_values[column_name].size > max_distinct_values
-            columns[column_name][:distinct_values] = "#{max_distinct_values}+"
+          if unique_values[column_name].size > max_distinct_values - 1
+            columns[column_name][:distinct_values] = "#{max_distinct_values - 1}+"
           else
             columns[column_name][:distinct_values] = unique_values[column_name]
           end

data/lib/csv-import-analyzer/version.rb CHANGED

@@ -1,5 +1,5 @@
 module CsvImportAnalyzer
   module Version
-    VERSION = "0.0.5"
+    VERSION = "0.0.6"
   end
 end

metadata CHANGED

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: csv-import-analyzer
 version: !ruby/object:Gem::Version
-  version: 0.0.5
+  version: 0.0.6
 platform: ruby
 authors:
 - Avinash Vallabhaneni