simdjson 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (132) hide show
  1. checksums.yaml +7 -0
  2. data/.clang-format +5 -0
  3. data/.gitignore +14 -0
  4. data/.gitmodules +3 -0
  5. data/.rubocop.yml +9 -0
  6. data/.travis.yml +7 -0
  7. data/Gemfile +4 -0
  8. data/LICENSE.txt +21 -0
  9. data/README.md +39 -0
  10. data/Rakefile +32 -0
  11. data/benchmark/apache_builds.json +4421 -0
  12. data/benchmark/demo.json +15 -0
  13. data/benchmark/github_events.json +1390 -0
  14. data/benchmark/run_benchmark.rb +30 -0
  15. data/ext/simdjson/extconf.rb +22 -0
  16. data/ext/simdjson/simdjson.cpp +76 -0
  17. data/ext/simdjson/simdjson.hpp +6 -0
  18. data/lib/simdjson/version.rb +3 -0
  19. data/lib/simdjson.rb +2 -0
  20. data/simdjson.gemspec +35 -0
  21. data/vendor/.gitkeep +0 -0
  22. data/vendor/simdjson/AUTHORS +3 -0
  23. data/vendor/simdjson/CMakeLists.txt +63 -0
  24. data/vendor/simdjson/CONTRIBUTORS +27 -0
  25. data/vendor/simdjson/Dockerfile +10 -0
  26. data/vendor/simdjson/LICENSE +201 -0
  27. data/vendor/simdjson/Makefile +203 -0
  28. data/vendor/simdjson/Notes.md +85 -0
  29. data/vendor/simdjson/README.md +581 -0
  30. data/vendor/simdjson/amalgamation.sh +158 -0
  31. data/vendor/simdjson/benchmark/CMakeLists.txt +8 -0
  32. data/vendor/simdjson/benchmark/benchmark.h +223 -0
  33. data/vendor/simdjson/benchmark/distinctuseridcompetition.cpp +347 -0
  34. data/vendor/simdjson/benchmark/linux/linux-perf-events.h +93 -0
  35. data/vendor/simdjson/benchmark/minifiercompetition.cpp +181 -0
  36. data/vendor/simdjson/benchmark/parse.cpp +393 -0
  37. data/vendor/simdjson/benchmark/parseandstatcompetition.cpp +305 -0
  38. data/vendor/simdjson/benchmark/parsingcompetition.cpp +298 -0
  39. data/vendor/simdjson/benchmark/statisticalmodel.cpp +208 -0
  40. data/vendor/simdjson/dependencies/jsoncppdist/json/json-forwards.h +344 -0
  41. data/vendor/simdjson/dependencies/jsoncppdist/json/json.h +2366 -0
  42. data/vendor/simdjson/dependencies/jsoncppdist/jsoncpp.cpp +5418 -0
  43. data/vendor/simdjson/doc/apache_builds.jsonparseandstat.png +0 -0
  44. data/vendor/simdjson/doc/gbps.png +0 -0
  45. data/vendor/simdjson/doc/github_events.jsonparseandstat.png +0 -0
  46. data/vendor/simdjson/doc/twitter.jsonparseandstat.png +0 -0
  47. data/vendor/simdjson/doc/update-center.jsonparseandstat.png +0 -0
  48. data/vendor/simdjson/images/halvarflake.png +0 -0
  49. data/vendor/simdjson/images/logo.png +0 -0
  50. data/vendor/simdjson/include/simdjson/common_defs.h +102 -0
  51. data/vendor/simdjson/include/simdjson/isadetection.h +152 -0
  52. data/vendor/simdjson/include/simdjson/jsoncharutils.h +301 -0
  53. data/vendor/simdjson/include/simdjson/jsonformatutils.h +202 -0
  54. data/vendor/simdjson/include/simdjson/jsonioutil.h +32 -0
  55. data/vendor/simdjson/include/simdjson/jsonminifier.h +30 -0
  56. data/vendor/simdjson/include/simdjson/jsonparser.h +250 -0
  57. data/vendor/simdjson/include/simdjson/numberparsing.h +587 -0
  58. data/vendor/simdjson/include/simdjson/padded_string.h +70 -0
  59. data/vendor/simdjson/include/simdjson/parsedjson.h +544 -0
  60. data/vendor/simdjson/include/simdjson/portability.h +172 -0
  61. data/vendor/simdjson/include/simdjson/simdjson.h +44 -0
  62. data/vendor/simdjson/include/simdjson/simdjson_version.h +13 -0
  63. data/vendor/simdjson/include/simdjson/simdprune_tables.h +35074 -0
  64. data/vendor/simdjson/include/simdjson/simdutf8check_arm64.h +180 -0
  65. data/vendor/simdjson/include/simdjson/simdutf8check_haswell.h +198 -0
  66. data/vendor/simdjson/include/simdjson/simdutf8check_westmere.h +169 -0
  67. data/vendor/simdjson/include/simdjson/stage1_find_marks.h +121 -0
  68. data/vendor/simdjson/include/simdjson/stage1_find_marks_arm64.h +210 -0
  69. data/vendor/simdjson/include/simdjson/stage1_find_marks_flatten.h +93 -0
  70. data/vendor/simdjson/include/simdjson/stage1_find_marks_flatten_haswell.h +95 -0
  71. data/vendor/simdjson/include/simdjson/stage1_find_marks_haswell.h +210 -0
  72. data/vendor/simdjson/include/simdjson/stage1_find_marks_macros.h +239 -0
  73. data/vendor/simdjson/include/simdjson/stage1_find_marks_westmere.h +194 -0
  74. data/vendor/simdjson/include/simdjson/stage2_build_tape.h +85 -0
  75. data/vendor/simdjson/include/simdjson/stringparsing.h +105 -0
  76. data/vendor/simdjson/include/simdjson/stringparsing_arm64.h +56 -0
  77. data/vendor/simdjson/include/simdjson/stringparsing_haswell.h +43 -0
  78. data/vendor/simdjson/include/simdjson/stringparsing_macros.h +88 -0
  79. data/vendor/simdjson/include/simdjson/stringparsing_westmere.h +41 -0
  80. data/vendor/simdjson/jsonexamples/small/jsoniter_scala/README.md +4 -0
  81. data/vendor/simdjson/scripts/dumpsimplestats.sh +11 -0
  82. data/vendor/simdjson/scripts/issue150.sh +14 -0
  83. data/vendor/simdjson/scripts/javascript/README.md +3 -0
  84. data/vendor/simdjson/scripts/javascript/generatelargejson.js +19 -0
  85. data/vendor/simdjson/scripts/minifier.sh +11 -0
  86. data/vendor/simdjson/scripts/parseandstat.sh +24 -0
  87. data/vendor/simdjson/scripts/parser.sh +11 -0
  88. data/vendor/simdjson/scripts/parsingcompdata.sh +26 -0
  89. data/vendor/simdjson/scripts/plotparse.sh +98 -0
  90. data/vendor/simdjson/scripts/selectparser.sh +11 -0
  91. data/vendor/simdjson/scripts/setupfortesting/disablehyperthreading.sh +15 -0
  92. data/vendor/simdjson/scripts/setupfortesting/powerpolicy.sh +32 -0
  93. data/vendor/simdjson/scripts/setupfortesting/setupfortesting.sh +6 -0
  94. data/vendor/simdjson/scripts/setupfortesting/turboboost.sh +51 -0
  95. data/vendor/simdjson/scripts/testjson2json.sh +99 -0
  96. data/vendor/simdjson/scripts/transitions/Makefile +10 -0
  97. data/vendor/simdjson/scripts/transitions/generatetransitions.cpp +20 -0
  98. data/vendor/simdjson/singleheader/README.md +1 -0
  99. data/vendor/simdjson/singleheader/amalgamation_demo.cpp +20 -0
  100. data/vendor/simdjson/singleheader/simdjson.cpp +1652 -0
  101. data/vendor/simdjson/singleheader/simdjson.h +39692 -0
  102. data/vendor/simdjson/src/CMakeLists.txt +67 -0
  103. data/vendor/simdjson/src/jsonioutil.cpp +35 -0
  104. data/vendor/simdjson/src/jsonminifier.cpp +285 -0
  105. data/vendor/simdjson/src/jsonparser.cpp +91 -0
  106. data/vendor/simdjson/src/parsedjson.cpp +323 -0
  107. data/vendor/simdjson/src/parsedjsoniterator.cpp +272 -0
  108. data/vendor/simdjson/src/simdjson.cpp +30 -0
  109. data/vendor/simdjson/src/stage1_find_marks.cpp +41 -0
  110. data/vendor/simdjson/src/stage2_build_tape.cpp +567 -0
  111. data/vendor/simdjson/style/clang-format-check.sh +25 -0
  112. data/vendor/simdjson/style/clang-format.sh +25 -0
  113. data/vendor/simdjson/style/run-clang-format.py +326 -0
  114. data/vendor/simdjson/tape.md +134 -0
  115. data/vendor/simdjson/tests/CMakeLists.txt +25 -0
  116. data/vendor/simdjson/tests/allparserscheckfile.cpp +192 -0
  117. data/vendor/simdjson/tests/basictests.cpp +75 -0
  118. data/vendor/simdjson/tests/jsoncheck.cpp +136 -0
  119. data/vendor/simdjson/tests/numberparsingcheck.cpp +224 -0
  120. data/vendor/simdjson/tests/pointercheck.cpp +38 -0
  121. data/vendor/simdjson/tests/singleheadertest.cpp +22 -0
  122. data/vendor/simdjson/tests/stringparsingcheck.cpp +408 -0
  123. data/vendor/simdjson/tools/CMakeLists.txt +3 -0
  124. data/vendor/simdjson/tools/cmake/FindCTargets.cmake +15 -0
  125. data/vendor/simdjson/tools/cmake/FindOptions.cmake +52 -0
  126. data/vendor/simdjson/tools/json2json.cpp +112 -0
  127. data/vendor/simdjson/tools/jsonpointer.cpp +93 -0
  128. data/vendor/simdjson/tools/jsonstats.cpp +143 -0
  129. data/vendor/simdjson/tools/minify.cpp +21 -0
  130. data/vendor/simdjson/tools/release.py +125 -0
  131. data/vendor/simdjson/windows/dirent_portable.h +1043 -0
  132. metadata +273 -0
@@ -0,0 +1,581 @@
1
+ # simdjson : Parsing gigabytes of JSON per second
2
+ [![Build Status](https://cloud.drone.io/api/badges/lemire/simdjson/status.svg)](https://cloud.drone.io/lemire/simdjson/)
3
+ [![CircleCI](https://circleci.com/gh/lemire/simdjson.svg?style=svg)](https://circleci.com/gh/lemire/simdjson)
4
+ [![Build Status](https://img.shields.io/appveyor/ci/lemire/simdjson.svg)](https://ci.appveyor.com/project/lemire/simdjson)
5
+ [![][license img]][license]
6
+ [![Code Quality: Cpp](https://img.shields.io/lgtm/grade/cpp/g/lemire/simdjson.svg?logo=lgtm&logoWidth=18)](https://lgtm.com/projects/g/lemire/simdjson/context:cpp)
7
+
8
+
9
+ ## A C++ library to see how fast we can parse JSON with complete validation.
10
+
11
+ JSON documents are everywhere on the Internet. Servers spend a lot of time parsing these documents. We want to accelerate the parsing of JSON per se using commonly available SIMD instructions as much as possible while doing full validation (including character encoding).
12
+
13
+ <img src="images/logo.png" width="10%">
14
+
15
+
16
+ ## Real-world usage
17
+
18
+ - [Microsoft FishStore](https://github.com/microsoft/FishStore)
19
+ - [Yandex ClickHouse](https://github.com/yandex/ClickHouse)
20
+
21
+ ## Paper
22
+
23
+ A description of the design and implementation of simdjson appears at https://arxiv.org/abs/1902.08318 and an informal blog post providing some background and context is at https://branchfree.org/2019/02/25/paper-parsing-gigabytes-of-json-per-second/.
24
+
25
+ Some people [enjoy reading our paper](https://arxiv.org/abs/1902.08318):
26
+
27
+ [<img src="images/halvarflake.png" width="50%">](https://twitter.com/halvarflake/status/1118459536686362625)
28
+
29
+
30
+ ## Performance results
31
+
32
+ simdjson uses three-quarters less instructions than state-of-the-art parser RapidJSON and fifty percent less than sajson. To our knowledge, simdjson is the first fully-validating JSON parser to run at gigabytes per second on commodity processors.
33
+
34
+ <img src="doc/gbps.png" width="90%">
35
+
36
+ On a Skylake processor, the parsing speeds (in GB/s) of various processors on the twitter.json file are as follows.
37
+
38
+ | parser | GB/s |
39
+ | ------------------------------------- | ---- |
40
+ | simdjson | 2.2 |
41
+ | RapidJSON encoding-validation | 0.51 |
42
+ | RapidJSON encoding-validation, insitu | 0.71 |
43
+ | sajson (insitu, dynamic) | 0.70 |
44
+ | sajson (insitu, static) | 0.97 |
45
+ | dropbox | 0.14 |
46
+ | fastjson | 0.26 |
47
+ | gason | 0.85 |
48
+ | ultrajson | 0.42 |
49
+ | jsmn | 0.28 |
50
+ | cJSON | 0.34 |
51
+ | JSON for Modern C++ (nlohmann/json) | 0.10 |
52
+
53
+ ## Requirements
54
+
55
+ - We support platforms like Linux or macOS, as well as Windows through Visual Studio 2017 or later.
56
+ - A processor with
57
+ - AVX2 (i.e., Intel processors starting with the Haswell microarchitecture released 2013 and AMD processors starting with the Zen microarchitecture released 2017),
58
+ - or SSE 4.2 and CLMUL (i.e., Intel processors going back to Westmere released in 2010 or AMD processors starting with the Jaguar used in the PS4 and XBox One)
59
+ - or a 64-bit ARM processor (ARMv8-A): this covers a wide range of mobile processors, including all Apple processors currently available for sale, going back as far back as the iPhone 5s (2013).
60
+ - A recent C++ compiler (e.g., GNU GCC or LLVM CLANG or Visual Studio 2017), we assume C++17. GNU GCC 7 or better or LLVM's clang 6 or better.
61
+ - Some benchmark scripts assume bash and other common utilities, but they are optional.
62
+
63
+ ## License
64
+
65
+ This code is made available under the Apache License 2.0.
66
+
67
+ Under Windows, we build some tools using the windows/dirent_portable.h file (which is outside our library code): it under the liberal (business-friendly) MIT license.
68
+
69
+ ## Code usage and example
70
+
71
+ The main API involves populating a `ParsedJson` object which hosts a fully navigable document-object-model (DOM) view of the JSON document. The DOM can be accessed using [JSON Pointer](https://tools.ietf.org/html/rfc6901) paths, for example. The main function is `json_parse` which takes a string containing the JSON document as well as a reference to pre-allocated `ParsedJson` object (which can be reused multiple time). Once you have populated the `ParsedJson` object you can navigate through the DOM with an iterator (e.g., created by `ParsedJson::Iterator pjh(pj)`, see 'Navigating the parsed document').
72
+
73
+ ```C
74
+ #include "simdjson/jsonparser.h"
75
+ using namespace simdjson;
76
+
77
+ /...
78
+
79
+ const char * filename = ... //
80
+
81
+ // use whatever means you want to get a string (UTF-8) of your JSON document
82
+ padded_string p = get_corpus(filename);
83
+ ParsedJson pj;
84
+ pj.allocate_capacity(p.size()); // allocate memory for parsing up to p.size() bytes
85
+ const int res = json_parse(p, pj); // do the parsing, return 0 on success
86
+ // parsing is done!
87
+ if (res != 0) {
88
+ // You can use the "simdjson/simdjson.h" header to access the error message
89
+ std::cout << "Error parsing:" << simdjson::error_message(res) << std::endl;
90
+ }
91
+ // the ParsedJson document can be used here
92
+ // pj can be reused with other json_parse calls.
93
+ ```
94
+
95
+ It is also possible to use a simpler API if you do not mind having the overhead
96
+ of memory allocation with each new JSON document:
97
+
98
+ ```C
99
+ #include "simdjson/jsonparser.h"
100
+ using namespace simdjson;
101
+
102
+ /...
103
+
104
+ const char * filename = ... //
105
+ padded_string p = get_corpus(filename);
106
+ ParsedJson pj = build_parsed_json(p); // do the parsing
107
+ if( ! pj.is_valid() ) {
108
+ // something went wrong
109
+ std::cout << pj.get_error_message() << std::endl;
110
+ }
111
+ ```
112
+
113
+ Though the `padded_string` class is recommended for best performance, you can call `json_parse` and `build_parsed_json`, passing a standard `std::string` object.
114
+
115
+
116
+ ```C
117
+ #include "simdjson/jsonparser.h"
118
+ using namespace simdjson;
119
+
120
+ /...
121
+ std::string mystring = ... //
122
+ ParsedJson pj;
123
+ pj.allocate_capacity(mystring.size()); // allocate memory for parsing up to p.size() bytes
124
+ // std::string may not overallocate so a copy will be needed
125
+ const int res = json_parse(mystring, pj); // do the parsing, return 0 on success
126
+ // parsing is done!
127
+ if (res != 0) {
128
+ // You can use the "simdjson/simdjson.h" header to access the error message
129
+ std::cout << "Error parsing:" << simdjson::error_message(res) << std::endl;
130
+ }
131
+ // pj can be reused with other json_parse calls.
132
+ ```
133
+
134
+ or
135
+
136
+ ```C
137
+ #include "simdjson/jsonparser.h"
138
+ using namespace simdjson;
139
+
140
+ /...
141
+
142
+ std::string mystring = ... //
143
+ // std::string may not overallocate so a copy will be needed
144
+ ParsedJson pj = build_parsed_json(mystring); // do the parsing
145
+ if( ! pj.is_valid() ) {
146
+ // something went wrong
147
+ std::cout << pj.get_error_message() << std::endl;
148
+ }
149
+ ```
150
+
151
+ As needed, the `json_parse` and `build_parsed_json` functions copy the input data to a temporary buffer readable up to SIMDJSON_PADDING bytes beyond the end of the data.
152
+
153
+ ## Usage: easy single-header version
154
+
155
+ See the "singleheader" repository for a single header version. See the included
156
+ file "amalgamation_demo.cpp" for usage. This requires no specific build system: just
157
+ copy the files in your project in your include path. You can then include them quite simply:
158
+
159
+ ```C
160
+ #include <iostream>
161
+ #include "simdjson.h"
162
+ #include "simdjson.cpp"
163
+ using namespace simdjson;
164
+ int main(int argc, char *argv[]) {
165
+ const char * filename = argv[1];
166
+ padded_string p = get_corpus(filename);
167
+ ParsedJson pj = build_parsed_json(p); // do the parsing
168
+ if( ! pj.is_valid() ) {
169
+ std::cout << "not valid" << std::endl;
170
+ std::cout << pj.get_error_message() << std::endl;
171
+ } else {
172
+ std::cout << "valid" << std::endl;
173
+ }
174
+ return EXIT_SUCCESS;
175
+ }
176
+ ```
177
+
178
+
179
+ Note: In some settings, it might be desirable to precompile `simdjson.cpp` instead of including it.
180
+
181
+ ## Runtime dispatch
182
+
183
+ On Intel and AMD processors, we get best performance by using the hardware support for AVX2 instructions. However, simdjson also
184
+ runs on older Intel and AMD processors. We require a minimum feature support of SSE 4.2 and CLMUL (2010 Intel Westmere or better).
185
+ The code automatically detects the feature set of your processor and switches to the right function at runtime (a technical
186
+ sometimes called runtime dispatch).
187
+
188
+
189
+ We also support 64-bit ARM. We assume NEON support, and if the cryptographic extension is available, we leverage it, at compile-time.
190
+ There is no runtime dispatch on ARM.
191
+
192
+ ## Thread safety
193
+
194
+ The simdjson library is single-threaded and thread safety is the responsability of the caller. If you are on an x64 processor, the runtime dispatching assigns the right code path the firs time that parsing is attempted. For safety, you should always call json_parse at least once in a single-threaded context.
195
+
196
+
197
+ ## Usage (old-school Makefile on platforms like Linux or macOS)
198
+
199
+ Requirements: recent clang or gcc, and make. We recommend at least GNU GCC/G++ 7 or LLVM clang 6. A system like Linux or macOS is expected.
200
+
201
+ To test:
202
+
203
+ ```
204
+ make
205
+ make test
206
+ ```
207
+
208
+ To run benchmarks:
209
+
210
+ ```
211
+ make parse
212
+ ./parse jsonexamples/twitter.json
213
+ ```
214
+
215
+ Under Linux, the `parse` command gives a detailed analysis of the performance counters.
216
+
217
+ To run comparative benchmarks (with other parsers):
218
+
219
+ ```
220
+ make benchmark
221
+ ```
222
+
223
+ ## Usage (CMake on platforms like Linux or macOS)
224
+
225
+ Requirements: We require a recent version of cmake. On macOS, the easiest way to install cmake might be to use [brew](https://brew.sh) and then type
226
+
227
+ ```
228
+ brew install cmake
229
+ ```
230
+
231
+ There is an [equivalent brew on Linux which works the same way as well](https://linuxbrew.sh).
232
+
233
+ You need a recent compiler like clang or gcc. We recommend at least GNU GCC/G++ 7 or LLVM clang 6. For example, you can install a recent compiler with brew:
234
+
235
+ ```
236
+ brew install gcc@8
237
+ ```
238
+
239
+ Optional: You need to tell cmake which compiler you wish to use by setting the CC and CXX variables. Under bash, you can do so with commands such as `export CC=gcc-7` and `export CXX=g++-7`.
240
+
241
+ Building: While in the project repository, do the following:
242
+
243
+ ```
244
+ mkdir build
245
+ cd build
246
+ cmake ..
247
+ make
248
+ make test
249
+ ```
250
+
251
+ CMake will build a library. By default, it builds a shared library (e.g., libsimdjson.so on Linux).
252
+
253
+ You can build a static library:
254
+
255
+ ```
256
+ mkdir buildstatic
257
+ cd buildstatic
258
+ cmake -DSIMDJSON_BUILD_STATIC=ON ..
259
+ make
260
+ make test
261
+ ```
262
+
263
+ In some cases, you may want to specify your compiler, especially if the default compiler on your system is too old. You may proceed as follows:
264
+
265
+ ```
266
+ brew install gcc@8
267
+ mkdir build
268
+ cd build
269
+ export CXX=g++-8 CC=gcc-8
270
+ cmake ..
271
+ make
272
+ make test
273
+ ```
274
+
275
+ ## Usage (CMake on Windows using Visual Studio)
276
+
277
+ We assume you have a common Windows PC with at least Visual Studio 2017 and an x64 processor with AVX2 support (2013 Intel Haswell or later) or SSE 4.2 + CLMUL (2010 Westmere or later).
278
+
279
+ - Grab the simdjson code from GitHub, e.g., by cloning it using [GitHub Desktop](https://desktop.github.com/).
280
+ - Install [CMake](https://cmake.org/download/). When you install it, make sure to ask that `cmake` be made available from the command line. Please choose a recent version of cmake.
281
+ - Create a subdirectory within simdjson, such as `VisualStudio`.
282
+ - Using a shell, go to this newly created directory.
283
+ - Type `cmake -DCMAKE_GENERATOR_PLATFORM=x64 ..` in the shell while in the `VisualStudio` repository. (Alternatively, if you want to build a DLL, you may use the command line `cmake -DCMAKE_GENERATOR_PLATFORM=x64 -DSIMDJSON_BUILD_STATIC=OFF ..`.)
284
+ - This last command (`cmake ...`) created a Visual Studio solution file in the newly created directory (e.g., `simdjson.sln`). Open this file in Visual Studio. You should now be able to build the project and run the tests. For example, in the `Solution Explorer` window (available from the `View` menu), right-click `ALL_BUILD` and select `Build`. To test the code, still in the `Solution Explorer` window, select `RUN_TESTS` and select `Build`.
285
+
286
+
287
+
288
+ ## Usage (Using `vcpkg` on Windows, Linux and MacOS)
289
+
290
+ [vcpkg](https://github.com/Microsoft/vcpkg) users on Windows, Linux and MacOS can download and install `simdjson` with one single command from their favorite shell.
291
+
292
+ On Linux and MacOS:
293
+
294
+ ```
295
+ $ ./vcpkg install simdjson
296
+ ```
297
+
298
+ will build and install `simdjson` as a static library.
299
+
300
+ On Windows (64-bit):
301
+
302
+ ```
303
+ .\vcpkg.exe install simdjson:x64-windows
304
+ ```
305
+
306
+ will build and install `simdjson` as a shared library.
307
+
308
+ ```
309
+ .\vcpkg.exe install simdjson:x64-windows-static
310
+ ```
311
+
312
+ will build and install `simdjson` as a static library.
313
+
314
+ These commands will also print out instructions on how to use the library from MSBuild or CMake-based projects.
315
+
316
+ If you find the version of `simdjson` shipped with `vcpkg` is out-of-date, feel free to report it to `vcpkg` community either by submiting an issue or by creating a PR.
317
+
318
+
319
+ ## Tools
320
+
321
+ - `json2json mydoc.json` parses the document, constructs a model and then dumps back the result to standard output.
322
+ - `json2json -d mydoc.json` parses the document, constructs a model and then dumps model (as a tape) to standard output. The tape format is described in the accompanying file `tape.md`.
323
+ - `minify mydoc.json` minifies the JSON document, outputting the result to standard output. Minifying means to remove the unneeded white space characters.
324
+ - `jsonpointer mydoc.json <jsonpath> <jsonpath> ... <jsonpath>` parses the document, constructs a model and then processes a series of [JSON Pointer paths](https://tools.ietf.org/html/rfc6901). The result is itself a JSON document.
325
+
326
+ ## Scope
327
+
328
+ We provide a fast parser, that fully validates an input according to various specifications.
329
+ The parser builds a useful immutable (read-only) DOM (document-object model) which can be later accessed.
330
+
331
+ To simplify the engineering, we make some assumptions.
332
+
333
+ - We support UTF-8 (and thus ASCII), nothing else (no Latin, no UTF-16). We do not believe this is a genuine limitation, because we do not think there is any serious application that needs to process JSON data without an ASCII or UTF-8 encoding. If the UTF-8 contains a leading BOM, it should be omitted: the user is responsible for detecting and skipping the BOM; UTF-8 BOMs are discouraged.
334
+ - All strings in the JSON document may have up to 4294967295 bytes in UTF-8 (4GB). To enforce this constraint, we refuse to parse a document that contains more than 4294967295 bytes (4GB). This should accommodate most JSON documents.
335
+ - As allowed by the specification, we allow repeated keys within an object (other parsers like sajson do the same).
336
+ - Performance is optimized for JSON documents spanning at least a tens kilobytes up to many megabytes: the performance issues with having to parse many tiny JSON documents or one truly enormous JSON document are different.
337
+
338
+ _We do not aim to provide a general-purpose JSON library._ A library like RapidJSON offers much more than just parsing, it helps you generate JSON and offers various other convenient functions. We merely parse the document.
339
+
340
+ ## Features
341
+
342
+ - The input string is unmodified. (Parsers like sajson and RapidJSON use the input string as a buffer.)
343
+ - We parse integers and floating-point numbers as separate types which allows us to support large 64-bit integers in [-9223372036854775808,9223372036854775808), like a Java `long` or a C/C++ `long long`. Among the parsers that differentiate between integers and floating-point numbers, not all support 64-bit integers. (For example, sajson rejects JSON files with integers larger than or equal to 2147483648. RapidJSON will parse a file containing an overly long integer like 18446744073709551616 as a floating-point number.) When we cannot represent exactly an integer as a signed 64-bit value, we reject the JSON document.
344
+ - We support the full range of 64-bit floating-point numbers (binary64). The values range from ` std::numeric_limits<double>::lowest()` to `std::numeric_limits<double>::max()`, so from -1.7976e308 all the way to 1.7975e308. Extreme values (less or equal to -1e308, greater or equal to 1e308) are rejected: we refuse to parse the input document.
345
+ - We test for accurate float parsing with a bound on the [unit of least precision (ULP)](https://en.wikipedia.org/wiki/Unit_in_the_last_place) of one. Practically speaking, this implies 15 digits of accuracy or better.
346
+ - We do full UTF-8 validation as part of the parsing. (Parsers like fastjson, gason and dropbox json11 do not do UTF-8 validation. The sajson parser does incomplete UTF-8 validation, accepting code point
347
+ sequences like 0xb1 0x87.)
348
+ - We fully validate the numbers. (Parsers like gason and ultranjson will accept `[0e+]` as valid JSON.)
349
+ - We validate string content for unescaped characters. (Parsers like fastjson and ultrajson accept unescaped line breaks and tabs in strings.)
350
+ - We fully validate the white-space characters outside of the strings. Parsers like RapidJSON will accept JSON documents with null characters outside of strings.
351
+
352
+ ## Architecture
353
+
354
+ The parser works in two stages:
355
+
356
+ - Stage 1. (Find marks) Identifies quickly structure elements, strings, and so forth. We validate UTF-8 encoding at that stage.
357
+ - Stage 2. (Structure building) Involves constructing a "tree" of sort (materialized as a tape) to navigate through the data. Strings and numbers are parsed at this stage.
358
+
359
+ ## JSON Pointer
360
+
361
+ We can navigate the parsed JSON using JSON Pointers as per the [RFC6901 standard](https://tools.ietf.org/html/rfc6901).
362
+
363
+ You can build a tool (jsonpointer) to parse a JSON document and then issue an array of JSON Pointer queries:
364
+
365
+ ```
366
+ make jsonpointer
367
+ ./jsonpointer jsonexamples/small/demo.json /Image/Width /Image/Height /Image/IDs/2
368
+ ./jsonpointer jsonexamples/twitter.json /statuses/0/id /statuses/1/id /statuses/2/id /statuses/3/id /statuses/4/id /statuses/5/id
369
+ ```
370
+
371
+ In C++, given a `ParsedJson`, we can move to a node with the `move_to` method, passing a `std::string` representing the JSON Pointer query.
372
+
373
+ ## Navigating the parsed document
374
+
375
+ Here is a code sample to dump back the parsed JSON to a string:
376
+
377
+ ```c
378
+ ParsedJson::Iterator pjh(pj);
379
+ if (!pjh.is_ok()) {
380
+ std::cerr << " Could not iterate parsed result. " << std::endl;
381
+ return EXIT_FAILURE;
382
+ }
383
+ compute_dump(pj);
384
+ //
385
+ // where compute_dump is :
386
+
387
+ void compute_dump(ParsedJson::Iterator &pjh) {
388
+ if (pjh.is_object()) {
389
+ std::cout << "{";
390
+ if (pjh.down()) {
391
+ pjh.print(std::cout); // must be a string
392
+ std::cout << ":";
393
+ pjh.next();
394
+ compute_dump(pjh); // let us recurse
395
+ while (pjh.next()) {
396
+ std::cout << ",";
397
+ pjh.print(std::cout);
398
+ std::cout << ":";
399
+ pjh.next();
400
+ compute_dump(pjh); // let us recurse
401
+ }
402
+ pjh.up();
403
+ }
404
+ std::cout << "}";
405
+ } else if (pjh.is_array()) {
406
+ std::cout << "[";
407
+ if (pjh.down()) {
408
+ compute_dump(pjh); // let us recurse
409
+ while (pjh.next()) {
410
+ std::cout << ",";
411
+ compute_dump(pjh); // let us recurse
412
+ }
413
+ pjh.up();
414
+ }
415
+ std::cout << "]";
416
+ } else {
417
+ pjh.print(std::cout); // just print the lone value
418
+ }
419
+ }
420
+ ```
421
+
422
+ The following function will find all user.id integers:
423
+
424
+ ```C
425
+ void simdjson_scan(std::vector<int64_t> &answer, ParsedJson::Iterator &i) {
426
+ while(i.move_forward()) {
427
+ if(i.get_scope_type() == '{') {
428
+ bool found_user = (i.get_string_length() == 4) && (memcmp(i.get_string(), "user", 4) == 0);
429
+ i.move_to_value();
430
+ if(found_user) {
431
+ if(i.is_object() && i.move_to_key("id",2)) {
432
+ if (i.is_integer()) {
433
+ answer.push_back(i.get_integer());
434
+ }
435
+ i.up();
436
+ }
437
+ }
438
+ }
439
+ }
440
+ }
441
+ ```
442
+
443
+ ## In-depth comparisons
444
+
445
+ If you want to see how a wide range of parsers validate a given JSON file:
446
+
447
+ ```
448
+ make allparserscheckfile
449
+ ./allparserscheckfile myfile.json
450
+ ```
451
+
452
+ For performance comparisons:
453
+
454
+ ```
455
+ make parsingcompetition
456
+ ./parsingcompetition myfile.json
457
+ ```
458
+
459
+ For broader comparisons:
460
+
461
+ ```
462
+ make allparsingcompetition
463
+ ./allparsingcompetition myfile.json
464
+ ```
465
+
466
+ Both the `parsingcompetition` and `allparsingcompetition` tools take a `-t` flag which produces
467
+ a table-oriented output that can be conventiently parsed by other tools.
468
+
469
+
470
+ ## Docker
471
+
472
+ One can run tests and benchmarks using docker. It especially makes sense under Linux. A privileged access may be needed to get performance counters.
473
+
474
+ ```
475
+ git clone https://github.com/lemire/simdjson.git
476
+ cd simdjson
477
+ docker build -t simdjson .
478
+ docker run --privileged -t simdjson
479
+ ```
480
+
481
+ ## Other programming languages
482
+
483
+ We distinguish between "bindings" (which just wrap the C++ code) and a port to another programming language (which reimplements everything).
484
+
485
+ - [pysimdjson](https://github.com/TkTech/pysimdjson): Python bindings for the simdjson project.
486
+ - [simdjson-rs](https://github.com/Licenser/simdjson-rs): Rust port
487
+ - [simdjson-rust](https://github.com/SunDoge/simdjson-rust): Rust wrapper (bindings)
488
+ - [SimdJsonSharp](https://github.com/EgorBo/SimdJsonSharp): C# version for .NET Core (bindings and full port)
489
+ - [simdjson_nodejs](https://github.com/luizperes/simdjson_nodejs): Node.js bindings for the simdjson project.
490
+ - [simdjson_php](https://github.com/crazyxman/simdjson_php): PHP bindings for the simdjson project.
491
+
492
+ ## Various References
493
+
494
+ - [Google double-conv](https://github.com/google/double-conversion/)
495
+ - [How to implement atoi using SIMD?](https://stackoverflow.com/questions/35127060/how-to-implement-atoi-using-simd)
496
+ - [Parsing JSON is a Minefield 💣](http://seriot.ch/parsing_json.php)
497
+ - https://tools.ietf.org/html/rfc7159
498
+ - The Mison implementation in rust https://github.com/pikkr/pikkr
499
+ - http://rapidjson.org/md_doc_sax.html
500
+ - https://github.com/Geal/parser_benchmarks/tree/master/json
501
+ - Gron: A command line tool that makes JSON greppable https://news.ycombinator.com/item?id=16727665
502
+ - GoogleGson https://github.com/google/gson
503
+ - Jackson https://github.com/FasterXML/jackson
504
+ - https://www.yelp.com/dataset_challenge
505
+ - RapidJSON. http://rapidjson.org/
506
+
507
+ Inspiring links:
508
+
509
+ - https://auth0.com/blog/beating-json-performance-with-protobuf/
510
+ - https://gist.github.com/shijuvar/25ad7de9505232c87034b8359543404a
511
+ - https://github.com/frankmcsherry/blog/blob/master/posts/2018-02-11.md
512
+
513
+ Validating UTF-8 takes no more than 0.7 cycles per byte:
514
+
515
+ - https://github.com/lemire/fastvalidate-utf-8 https://lemire.me/blog/2018/05/16/validating-utf-8-strings-using-as-little-as-0-7-cycles-per-byte/
516
+
517
+ ## Remarks on JSON parsing
518
+
519
+ - The JSON spec defines what a JSON parser is:
520
+ > A JSON parser transforms a JSON text into another representation. A JSON parser MUST accept all texts that conform to the JSON grammar. A JSON parser MAY accept non-JSON forms or extensions. An implementation may set limits on the size of texts that it accepts. An implementation may set limits on the maximum depth of nesting. An implementation may set limits on the range and precision of numbers. An implementation may set limits on the length and character contents of strings.
521
+
522
+ * JSON is not JavaScript:
523
+
524
+ > All JSON is Javascript but NOT all Javascript is JSON. So {property:1} is invalid because property does not have double quotes around it. {'property':1} is also invalid, because it's single quoted while the only thing that can placate the JSON specification is double quoting. JSON is even fussy enough that {"property":.1} is invalid too, because you should have of course written {"property":0.1}. Also, don't even think about having comments or semicolons, you guessed it: they're invalid. (credit:https://github.com/elzr/vim-json)
525
+
526
+ * The structural characters are:
527
+
528
+
529
+ begin-array = [ left square bracket
530
+ begin-object = { left curly bracket
531
+ end-array = ] right square bracket
532
+ end-object = } right curly bracket
533
+ name-separator = : colon
534
+ value-separator = , comma
535
+
536
+ ### Pseudo-structural elements
537
+
538
+ A character is pseudo-structural if and only if:
539
+
540
+ 1. Not enclosed in quotes, AND
541
+ 2. Is a non-whitespace character, AND
542
+ 3. Its preceding character is either:
543
+ (a) a structural character, OR
544
+ (b) whitespace.
545
+
546
+ This helps as we redefine some new characters as pseudo-structural such as the characters 1, G, n in the following:
547
+
548
+ > { "foo" : 1.5, "bar" : 1.5 GEOFF_IS_A_DUMMY bla bla , "baz", null }
549
+
550
+ ## Academic References
551
+
552
+ - T.Mühlbauer, W.Rödiger, R.Seilbeck, A.Reiser, A.Kemper, and T.Neumann. Instant loading for main memory databases. PVLDB, 6(14):1702–1713, 2013. (SIMD-based CSV parsing)
553
+ - Mytkowicz, Todd, Madanlal Musuvathi, and Wolfram Schulte. "Data-parallel finite-state machines." ACM SIGARCH Computer Architecture News. Vol. 42. No. 1. ACM, 2014.
554
+ - Lu, Yifan, et al. "Tree structured data processing on GPUs." Cloud Computing, Data Science & Engineering-Confluence, 2017 7th International Conference on. IEEE, 2017.
555
+ - Sidhu, Reetinder. "High throughput, tree automata based XML processing using FPGAs." Field-Programmable Technology (FPT), 2013 International Conference on. IEEE, 2013.
556
+ - Dai, Zefu, Nick Ni, and Jianwen Zhu. "A 1 cycle-per-byte XML parsing accelerator." Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays. ACM, 2010.
557
+ - Lin, Dan, et al. "Parabix: Boosting the efficiency of text processing on commodity processors." High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on. IEEE, 2012. http://parabix.costar.sfu.ca/export/1783/docs/HPCA2012/final_ieee/final.pdf
558
+ - Deshmukh, V. M., and G. R. Bamnote. "An empirical evaluation of optimization parameters in XML parsing for performance enhancement." Computer, Communication and Control (IC4), 2015 International Conference on. IEEE, 2015.
559
+ - Moussalli, Roger, et al. "Efficient XML Path Filtering Using GPUs." ADMS@ VLDB. 2011.
560
+ - Jianliang, Ma, et al. "Parallel speculative dom-based XML parser." High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems (HPCC-ICESS), 2012 IEEE 14th International Conference on. IEEE, 2012.
561
+ - Li, Y., Katsipoulakis, N.R., Chandramouli, B., Goldstein, J. and Kossmann, D., 2017. Mison: a fast JSON parser for data analytics. Proceedings of the VLDB Endowment, 10(10), pp.1118-1129. http://www.vldb.org/pvldb/vol10/p1118-li.pdf
562
+ - Cameron, Robert D., et al. "Parallel scanning with bitstream addition: An xml case study." European Conference on Parallel Processing. Springer, Berlin, Heidelberg, 2011.
563
+ - Cameron, Robert D., Kenneth S. Herdy, and Dan Lin. "High performance XML parsing using parallel bit stream technology." Proceedings of the 2008 conference of the center for advanced studies on collaborative research: meeting of minds. ACM, 2008.
564
+ - Shah, Bhavik, et al. "A data parallel algorithm for XML DOM parsing." International XML Database Symposium. Springer, Berlin, Heidelberg, 2009.
565
+ - Cameron, Robert D., and Dan Lin. "Architectural support for SWAR text processing with parallel bit streams: the inductive doubling principle." ACM Sigplan Notices. Vol. 44. No. 3. ACM, 2009.
566
+ - Amagasa, Toshiyuki, Mana Seino, and Hiroyuki Kitagawa. "Energy-Efficient XML Stream Processing through Element-Skipping Parsing." Database and Expert Systems Applications (DEXA), 2013 24th International Workshop on. IEEE, 2013.
567
+ - Medforth, Nigel Woodland. "icXML: Accelerating Xerces-C 3.1. 1 using the Parabix Framework." (2013).
568
+ - Zhang, Qiang Scott. Embedding Parallel Bit Stream Technology Into Expat. Diss. Simon Fraser University, 2010.
569
+ - Cameron, Robert D., et al. "Fast Regular Expression Matching with Bit-parallel Data Streams."
570
+ - Lin, Dan. Bits filter: a high-performance multiple string pattern matching algorithm for malware detection. Diss. School of Computing Science-Simon Fraser University, 2010.
571
+ - Yang, Shiyang. Validation of XML Document Based on Parallel Bit Stream Technology. Diss. Applied Sciences: School of Computing Science, 2013.
572
+ - N. Nakasato, "Implementation of a parallel tree method on a GPU", Journal of Computational Science, vol. 3, no. 3, pp. 132-141, 2012.
573
+
574
+
575
+ ## Funding
576
+
577
+ The work is supported by the Natural Sciences and Engineering Research Council of Canada under grant number RGPIN-2017-03910.
578
+
579
+
580
+ [license]: LICENSE
581
+ [license img]: https://img.shields.io/badge/License-Apache%202-blue.svg