simdjson 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (132) hide show
  1. checksums.yaml +7 -0
  2. data/.clang-format +5 -0
  3. data/.gitignore +14 -0
  4. data/.gitmodules +3 -0
  5. data/.rubocop.yml +9 -0
  6. data/.travis.yml +7 -0
  7. data/Gemfile +4 -0
  8. data/LICENSE.txt +21 -0
  9. data/README.md +39 -0
  10. data/Rakefile +32 -0
  11. data/benchmark/apache_builds.json +4421 -0
  12. data/benchmark/demo.json +15 -0
  13. data/benchmark/github_events.json +1390 -0
  14. data/benchmark/run_benchmark.rb +30 -0
  15. data/ext/simdjson/extconf.rb +22 -0
  16. data/ext/simdjson/simdjson.cpp +76 -0
  17. data/ext/simdjson/simdjson.hpp +6 -0
  18. data/lib/simdjson/version.rb +3 -0
  19. data/lib/simdjson.rb +2 -0
  20. data/simdjson.gemspec +35 -0
  21. data/vendor/.gitkeep +0 -0
  22. data/vendor/simdjson/AUTHORS +3 -0
  23. data/vendor/simdjson/CMakeLists.txt +63 -0
  24. data/vendor/simdjson/CONTRIBUTORS +27 -0
  25. data/vendor/simdjson/Dockerfile +10 -0
  26. data/vendor/simdjson/LICENSE +201 -0
  27. data/vendor/simdjson/Makefile +203 -0
  28. data/vendor/simdjson/Notes.md +85 -0
  29. data/vendor/simdjson/README.md +581 -0
  30. data/vendor/simdjson/amalgamation.sh +158 -0
  31. data/vendor/simdjson/benchmark/CMakeLists.txt +8 -0
  32. data/vendor/simdjson/benchmark/benchmark.h +223 -0
  33. data/vendor/simdjson/benchmark/distinctuseridcompetition.cpp +347 -0
  34. data/vendor/simdjson/benchmark/linux/linux-perf-events.h +93 -0
  35. data/vendor/simdjson/benchmark/minifiercompetition.cpp +181 -0
  36. data/vendor/simdjson/benchmark/parse.cpp +393 -0
  37. data/vendor/simdjson/benchmark/parseandstatcompetition.cpp +305 -0
  38. data/vendor/simdjson/benchmark/parsingcompetition.cpp +298 -0
  39. data/vendor/simdjson/benchmark/statisticalmodel.cpp +208 -0
  40. data/vendor/simdjson/dependencies/jsoncppdist/json/json-forwards.h +344 -0
  41. data/vendor/simdjson/dependencies/jsoncppdist/json/json.h +2366 -0
  42. data/vendor/simdjson/dependencies/jsoncppdist/jsoncpp.cpp +5418 -0
  43. data/vendor/simdjson/doc/apache_builds.jsonparseandstat.png +0 -0
  44. data/vendor/simdjson/doc/gbps.png +0 -0
  45. data/vendor/simdjson/doc/github_events.jsonparseandstat.png +0 -0
  46. data/vendor/simdjson/doc/twitter.jsonparseandstat.png +0 -0
  47. data/vendor/simdjson/doc/update-center.jsonparseandstat.png +0 -0
  48. data/vendor/simdjson/images/halvarflake.png +0 -0
  49. data/vendor/simdjson/images/logo.png +0 -0
  50. data/vendor/simdjson/include/simdjson/common_defs.h +102 -0
  51. data/vendor/simdjson/include/simdjson/isadetection.h +152 -0
  52. data/vendor/simdjson/include/simdjson/jsoncharutils.h +301 -0
  53. data/vendor/simdjson/include/simdjson/jsonformatutils.h +202 -0
  54. data/vendor/simdjson/include/simdjson/jsonioutil.h +32 -0
  55. data/vendor/simdjson/include/simdjson/jsonminifier.h +30 -0
  56. data/vendor/simdjson/include/simdjson/jsonparser.h +250 -0
  57. data/vendor/simdjson/include/simdjson/numberparsing.h +587 -0
  58. data/vendor/simdjson/include/simdjson/padded_string.h +70 -0
  59. data/vendor/simdjson/include/simdjson/parsedjson.h +544 -0
  60. data/vendor/simdjson/include/simdjson/portability.h +172 -0
  61. data/vendor/simdjson/include/simdjson/simdjson.h +44 -0
  62. data/vendor/simdjson/include/simdjson/simdjson_version.h +13 -0
  63. data/vendor/simdjson/include/simdjson/simdprune_tables.h +35074 -0
  64. data/vendor/simdjson/include/simdjson/simdutf8check_arm64.h +180 -0
  65. data/vendor/simdjson/include/simdjson/simdutf8check_haswell.h +198 -0
  66. data/vendor/simdjson/include/simdjson/simdutf8check_westmere.h +169 -0
  67. data/vendor/simdjson/include/simdjson/stage1_find_marks.h +121 -0
  68. data/vendor/simdjson/include/simdjson/stage1_find_marks_arm64.h +210 -0
  69. data/vendor/simdjson/include/simdjson/stage1_find_marks_flatten.h +93 -0
  70. data/vendor/simdjson/include/simdjson/stage1_find_marks_flatten_haswell.h +95 -0
  71. data/vendor/simdjson/include/simdjson/stage1_find_marks_haswell.h +210 -0
  72. data/vendor/simdjson/include/simdjson/stage1_find_marks_macros.h +239 -0
  73. data/vendor/simdjson/include/simdjson/stage1_find_marks_westmere.h +194 -0
  74. data/vendor/simdjson/include/simdjson/stage2_build_tape.h +85 -0
  75. data/vendor/simdjson/include/simdjson/stringparsing.h +105 -0
  76. data/vendor/simdjson/include/simdjson/stringparsing_arm64.h +56 -0
  77. data/vendor/simdjson/include/simdjson/stringparsing_haswell.h +43 -0
  78. data/vendor/simdjson/include/simdjson/stringparsing_macros.h +88 -0
  79. data/vendor/simdjson/include/simdjson/stringparsing_westmere.h +41 -0
  80. data/vendor/simdjson/jsonexamples/small/jsoniter_scala/README.md +4 -0
  81. data/vendor/simdjson/scripts/dumpsimplestats.sh +11 -0
  82. data/vendor/simdjson/scripts/issue150.sh +14 -0
  83. data/vendor/simdjson/scripts/javascript/README.md +3 -0
  84. data/vendor/simdjson/scripts/javascript/generatelargejson.js +19 -0
  85. data/vendor/simdjson/scripts/minifier.sh +11 -0
  86. data/vendor/simdjson/scripts/parseandstat.sh +24 -0
  87. data/vendor/simdjson/scripts/parser.sh +11 -0
  88. data/vendor/simdjson/scripts/parsingcompdata.sh +26 -0
  89. data/vendor/simdjson/scripts/plotparse.sh +98 -0
  90. data/vendor/simdjson/scripts/selectparser.sh +11 -0
  91. data/vendor/simdjson/scripts/setupfortesting/disablehyperthreading.sh +15 -0
  92. data/vendor/simdjson/scripts/setupfortesting/powerpolicy.sh +32 -0
  93. data/vendor/simdjson/scripts/setupfortesting/setupfortesting.sh +6 -0
  94. data/vendor/simdjson/scripts/setupfortesting/turboboost.sh +51 -0
  95. data/vendor/simdjson/scripts/testjson2json.sh +99 -0
  96. data/vendor/simdjson/scripts/transitions/Makefile +10 -0
  97. data/vendor/simdjson/scripts/transitions/generatetransitions.cpp +20 -0
  98. data/vendor/simdjson/singleheader/README.md +1 -0
  99. data/vendor/simdjson/singleheader/amalgamation_demo.cpp +20 -0
  100. data/vendor/simdjson/singleheader/simdjson.cpp +1652 -0
  101. data/vendor/simdjson/singleheader/simdjson.h +39692 -0
  102. data/vendor/simdjson/src/CMakeLists.txt +67 -0
  103. data/vendor/simdjson/src/jsonioutil.cpp +35 -0
  104. data/vendor/simdjson/src/jsonminifier.cpp +285 -0
  105. data/vendor/simdjson/src/jsonparser.cpp +91 -0
  106. data/vendor/simdjson/src/parsedjson.cpp +323 -0
  107. data/vendor/simdjson/src/parsedjsoniterator.cpp +272 -0
  108. data/vendor/simdjson/src/simdjson.cpp +30 -0
  109. data/vendor/simdjson/src/stage1_find_marks.cpp +41 -0
  110. data/vendor/simdjson/src/stage2_build_tape.cpp +567 -0
  111. data/vendor/simdjson/style/clang-format-check.sh +25 -0
  112. data/vendor/simdjson/style/clang-format.sh +25 -0
  113. data/vendor/simdjson/style/run-clang-format.py +326 -0
  114. data/vendor/simdjson/tape.md +134 -0
  115. data/vendor/simdjson/tests/CMakeLists.txt +25 -0
  116. data/vendor/simdjson/tests/allparserscheckfile.cpp +192 -0
  117. data/vendor/simdjson/tests/basictests.cpp +75 -0
  118. data/vendor/simdjson/tests/jsoncheck.cpp +136 -0
  119. data/vendor/simdjson/tests/numberparsingcheck.cpp +224 -0
  120. data/vendor/simdjson/tests/pointercheck.cpp +38 -0
  121. data/vendor/simdjson/tests/singleheadertest.cpp +22 -0
  122. data/vendor/simdjson/tests/stringparsingcheck.cpp +408 -0
  123. data/vendor/simdjson/tools/CMakeLists.txt +3 -0
  124. data/vendor/simdjson/tools/cmake/FindCTargets.cmake +15 -0
  125. data/vendor/simdjson/tools/cmake/FindOptions.cmake +52 -0
  126. data/vendor/simdjson/tools/json2json.cpp +112 -0
  127. data/vendor/simdjson/tools/jsonpointer.cpp +93 -0
  128. data/vendor/simdjson/tools/jsonstats.cpp +143 -0
  129. data/vendor/simdjson/tools/minify.cpp +21 -0
  130. data/vendor/simdjson/tools/release.py +125 -0
  131. data/vendor/simdjson/windows/dirent_portable.h +1043 -0
  132. metadata +273 -0
@@ -0,0 +1,203 @@
1
+
2
+ .SUFFIXES:
3
+ #
4
+ .SUFFIXES: .cpp .o .c .h
5
+
6
+
7
+ .PHONY: clean cleandist
8
+ COREDEPSINCLUDE = -Idependencies/json/single_include -Idependencies/rapidjson/include -Idependencies/sajson/include -Idependencies/cJSON -Idependencies/jsmn
9
+ EXTRADEPSINCLUDE = -Idependencies/jsoncppdist -Idependencies/json11 -Idependencies/fastjson/src -Idependencies/fastjson/include -Idependencies/gason/src -Idependencies/ujson4c/3rdparty -Idependencies/ujson4c/src
10
+ # users can provide their own additional flags with make EXTRAFLAGS=something
11
+ architecture:=$(shell arch)
12
+
13
+ ####
14
+ # If you want to specify your own target architecture,
15
+ # then define ARCHFLAGS. Otherwise, we set good default.
16
+ # E.g., type ' ARCHFLAGS="-march=nehalem" make parse '
17
+ ###
18
+ ifeq ($(architecture),aarch64)
19
+ ARCHFLAGS ?= -march=armv8-a+crc+crypto
20
+ else
21
+ ARCHFLAGS ?= -msse4.2 -mpclmul # lowest supported feature set?
22
+ endif
23
+
24
+ CXXFLAGS = $(ARCHFLAGS) -std=c++17 -Wall -Wextra -Wshadow -Iinclude -Ibenchmark/linux $(EXTRAFLAGS)
25
+ CFLAGS = $(ARCHFLAGS) -Idependencies/ujson4c/3rdparty -Idependencies/ujson4c/src $(EXTRAFLAGS)
26
+
27
+
28
+ # This is a convenience flag
29
+ ifdef SANITIZEGOLD
30
+ SANITIZE = 1
31
+ LINKER = gold
32
+ endif
33
+
34
+ ifdef LINKER
35
+ CXXFLAGS += -fuse-ld=$(LINKER)
36
+ CFLAGS += -fuse-ld=$(LINKER)
37
+ endif
38
+
39
+
40
+ # SANITIZE *implies* DEBUG
41
+ ifeq ($(MEMSANITIZE),1)
42
+ CXXFLAGS += -g3 -O0 -fsanitize=memory -fno-omit-frame-pointer -fsanitize=undefined
43
+ CFLAGS += -g3 -O0 -fsanitize=memory -fno-omit-frame-pointer -fsanitize=undefined
44
+ else
45
+ ifeq ($(SANITIZE),1)
46
+ CXXFLAGS += -g3 -O0 -fsanitize=address -fno-omit-frame-pointer -fsanitize=undefined
47
+ CFLAGS += -g3 -O0 -fsanitize=address -fno-omit-frame-pointer -fsanitize=undefined
48
+ else
49
+ ifeq ($(DEBUG),1)
50
+ CXXFLAGS += -g3 -O0
51
+ CFLAGS += -g3 -O0
52
+ else
53
+ # we opt for -O3 for regular builds
54
+ CXXFLAGS += -O3
55
+ CFLAGS += -O3
56
+ endif # ifeq ($(DEBUG),1)
57
+ endif # ifeq ($(SANITIZE),1)
58
+ endif # ifeq ($(MEMSANITIZE),1)
59
+
60
+ MAINEXECUTABLES=parse minify json2json jsonstats statisticalmodel jsonpointer
61
+ TESTEXECUTABLES=jsoncheck numberparsingcheck stringparsingcheck pointercheck
62
+ COMPARISONEXECUTABLES=minifiercompetition parsingcompetition parseandstatcompetition distinctuseridcompetition allparserscheckfile allparsingcompetition
63
+ SUPPLEMENTARYEXECUTABLES=parse_noutf8validation parse_nonumberparsing parse_nostringparsing
64
+
65
+ HEADERS= include/simdjson/simdutf8check_haswell.h include/simdjson/simdutf8check_westmere.h include/simdjson/simdutf8check_arm64.h include/simdjson/stringparsing.h include/simdjson/stringparsing_arm64.h include/simdjson/stringparsing_haswell.h include/simdjson/stringparsing_macros.h include/simdjson/stringparsing_westmere.h include/simdjson/numberparsing.h include/simdjson/jsonparser.h include/simdjson/common_defs.h include/simdjson/jsonioutil.h benchmark/benchmark.h benchmark/linux/linux-perf-events.h include/simdjson/parsedjson.h include/simdjson/stage1_find_marks.h include/simdjson/stage1_find_marks_arm64.h include/simdjson/stage1_find_marks_haswell.h include/simdjson/stage1_find_marks_westmere.h include/simdjson/stage1_find_marks_macros.h include/simdjson/stage2_build_tape.h include/simdjson/jsoncharutils.h include/simdjson/jsonformatutils.h include/simdjson/stage1_find_marks_flatten.h include/simdjson/stage1_find_marks_flatten_haswell.h
66
+ LIBFILES=src/jsonioutil.cpp src/jsonparser.cpp src/simdjson.cpp src/stage1_find_marks.cpp src/stage2_build_tape.cpp src/parsedjson.cpp src/parsedjsoniterator.cpp
67
+ MINIFIERHEADERS=include/simdjson/jsonminifier.h include/simdjson/simdprune_tables.h
68
+ MINIFIERLIBFILES=src/jsonminifier.cpp
69
+
70
+
71
+ RAPIDJSON_INCLUDE:=dependencies/rapidjson/include
72
+ SAJSON_INCLUDE:=dependencies/sajson/include
73
+ JSON11_INCLUDE:=dependencies/json11/json11.hpp
74
+ FASTJSON_INCLUDE:=dependencies/include/fastjson/fastjson.h
75
+ GASON_INCLUDE:=dependencies/gason/src/gason.h
76
+ UJSON4C_INCLUDE:=dependencies/ujson4c/src/ujdecode.c
77
+ CJSON_INCLUDE:=dependencies/cJSON/cJSON.h
78
+ JSMN_INCLUDE:=dependencies/jsmn/jsmn.h
79
+ JSON_INCLUDE:=dependencies/json/single_include/nlohmann/json.hpp
80
+
81
+ LIBS=$(RAPIDJSON_INCLUDE) $(JSON_INCLUDE) $(SAJSON_INCLUDE) $(JSON11_INCLUDE) $(FASTJSON_INCLUDE) $(GASON_INCLUDE) $(UJSON4C_INCLUDE) $(CJSON_INCLUDE) $(JSMN_INCLUDE)
82
+
83
+ EXTRAOBJECTS=ujdecode.o
84
+ all: $(MAINEXECUTABLES)
85
+
86
+ competition: $(COMPARISONEXECUTABLES)
87
+
88
+ .PHONY: benchmark test
89
+
90
+ benchmark:
91
+ bash ./scripts/parser.sh
92
+ bash ./scripts/parseandstat.sh
93
+
94
+ test: jsoncheck numberparsingcheck stringparsingcheck basictests allparserscheckfile minify json2json pointercheck
95
+ ./basictests
96
+ ./numberparsingcheck
97
+ ./stringparsingcheck
98
+ ./jsoncheck
99
+ ./pointercheck
100
+ ./scripts/testjson2json.sh
101
+ ./scripts/issue150.sh
102
+ @echo "It looks like the code is good!"
103
+
104
+ quiettest: jsoncheck numberparsingcheck stringparsingcheck basictests allparserscheckfile minify json2json pointercheck
105
+ ./basictests
106
+ ./numberparsingcheck
107
+ ./stringparsingcheck
108
+ ./jsoncheck
109
+ ./pointercheck
110
+ ./scripts/testjson2json.sh
111
+ ./scripts/issue150.sh
112
+
113
+ amalgamate:
114
+ ./amalgamation.sh
115
+ $(CXX) $(CXXFLAGS) -o singleheader/demo ./singleheader/amalgamation_demo.cpp -Isingleheader
116
+
117
+ submodules:
118
+ -git submodule update --init --recursive
119
+ -touch submodules
120
+
121
+ $(JSON_INCLUDE) $(SAJSON_INCLUDE) $(RAPIDJSON_INCLUDE) $(JSON11_INCLUDE) $(FASTJSON_INCLUDE) $(GASON_INCLUDE) $(UJSON4C_INCLUDE) $(CJSON_INCLUDE) $(JSMN_INCLUDE) : submodules
122
+
123
+ parse: benchmark/parse.cpp $(HEADERS) $(LIBFILES)
124
+ $(CXX) $(CXXFLAGS) -o parse $(LIBFILES) benchmark/parse.cpp $(LIBFLAGS)
125
+
126
+ statisticalmodel: benchmark/statisticalmodel.cpp $(HEADERS) $(LIBFILES)
127
+ $(CXX) $(CXXFLAGS) -o statisticalmodel $(LIBFILES) benchmark/statisticalmodel.cpp $(LIBFLAGS)
128
+
129
+
130
+ parse_noutf8validation: benchmark/parse.cpp $(HEADERS) $(LIBFILES)
131
+ $(CXX) $(CXXFLAGS) -o parse_noutf8validation -DSIMDJSON_SKIPUTF8VALIDATION $(LIBFILES) benchmark/parse.cpp $(LIBFLAGS)
132
+
133
+ parse_nonumberparsing: benchmark/parse.cpp $(HEADERS) $(LIBFILES)
134
+ $(CXX) $(CXXFLAGS) -o parse_nonumberparsing -DSIMDJSON_SKIPNUMBERPARSING $(LIBFILES) benchmark/parse.cpp $(LIBFLAGS)
135
+
136
+ parse_nostringparsing: benchmark/parse.cpp $(HEADERS) $(LIBFILES)
137
+ $(CXX) $(CXXFLAGS) -o parse_nostringparsing -DSIMDJSON_SKIPSTRINGPARSING $(LIBFILES) benchmark/parse.cpp $(LIBFLAGS)
138
+
139
+
140
+ jsoncheck:tests/jsoncheck.cpp $(HEADERS) $(LIBFILES)
141
+ $(CXX) $(CXXFLAGS) -o jsoncheck $(LIBFILES) tests/jsoncheck.cpp -I. $(LIBFLAGS)
142
+
143
+ basictests:tests/basictests.cpp $(HEADERS) $(LIBFILES)
144
+ $(CXX) $(CXXFLAGS) -o basictests $(LIBFILES) tests/basictests.cpp -I. $(LIBFLAGS)
145
+
146
+
147
+ numberparsingcheck:tests/numberparsingcheck.cpp $(HEADERS) $(LIBFILES)
148
+ $(CXX) $(CXXFLAGS) -o numberparsingcheck tests/numberparsingcheck.cpp src/jsonioutil.cpp src/jsonparser.cpp src/simdjson.cpp src/stage1_find_marks.cpp src/parsedjson.cpp -I. $(LIBFLAGS) -DJSON_TEST_NUMBERS
149
+
150
+
151
+ stringparsingcheck:tests/stringparsingcheck.cpp $(HEADERS) $(LIBFILES)
152
+ $(CXX) $(CXXFLAGS) -o stringparsingcheck tests/stringparsingcheck.cpp src/jsonioutil.cpp src/jsonparser.cpp src/simdjson.cpp src/stage1_find_marks.cpp src/parsedjson.cpp -I. $(LIBFLAGS) -DJSON_TEST_STRINGS
153
+
154
+ pointercheck:tests/pointercheck.cpp $(HEADERS) $(LIBFILES)
155
+ $(CXX) $(CXXFLAGS) -o pointercheck tests/pointercheck.cpp src/stage2_build_tape.cpp src/jsonioutil.cpp src/jsonparser.cpp src/simdjson.cpp src/stage1_find_marks.cpp src/parsedjson.cpp src/parsedjsoniterator.cpp -I. $(LIBFLAGS)
156
+
157
+ minifiercompetition: benchmark/minifiercompetition.cpp $(HEADERS) submodules $(MINIFIERHEADERS) $(LIBFILES) $(MINIFIERLIBFILES)
158
+ $(CXX) $(CXXFLAGS) -o minifiercompetition $(LIBFILES) $(MINIFIERLIBFILES) benchmark/minifiercompetition.cpp -I. $(LIBFLAGS) $(COREDEPSINCLUDE)
159
+
160
+ minify: tools/minify.cpp $(HEADERS) $(MINIFIERHEADERS) $(LIBFILES) $(MINIFIERLIBFILES)
161
+ $(CXX) $(CXXFLAGS) -o minify $(MINIFIERLIBFILES) $(LIBFILES) tools/minify.cpp -I.
162
+
163
+ json2json: tools/json2json.cpp $(HEADERS) $(LIBFILES)
164
+ $(CXX) $(CXXFLAGS) -o json2json $ tools/json2json.cpp $(LIBFILES) -I.
165
+
166
+ jsonpointer: tools/jsonpointer.cpp $(HEADERS) $(LIBFILES)
167
+ $(CXX) $(CXXFLAGS) -o jsonpointer $ tools/jsonpointer.cpp $(LIBFILES) -I.
168
+
169
+ jsonstats: tools/jsonstats.cpp $(HEADERS) $(LIBFILES)
170
+ $(CXX) $(CXXFLAGS) -o jsonstats $ tools/jsonstats.cpp $(LIBFILES) -I.
171
+
172
+ ujdecode.o: $(UJSON4C_INCLUDE)
173
+ $(CC) $(CFLAGS) -c dependencies/ujson4c/src/ujdecode.c
174
+
175
+ parseandstatcompetition: benchmark/parseandstatcompetition.cpp $(HEADERS) $(LIBFILES) submodules
176
+ $(CXX) $(CXXFLAGS) -o parseandstatcompetition $(LIBFILES) benchmark/parseandstatcompetition.cpp -I. $(LIBFLAGS) $(COREDEPSINCLUDE)
177
+
178
+ distinctuseridcompetition: benchmark/distinctuseridcompetition.cpp $(HEADERS) $(LIBFILES) submodules
179
+ $(CXX) $(CXXFLAGS) -o distinctuseridcompetition $(LIBFILES) benchmark/distinctuseridcompetition.cpp -I. $(LIBFLAGS) $(COREDEPSINCLUDE)
180
+
181
+ parsingcompetition: benchmark/parsingcompetition.cpp $(HEADERS) $(LIBFILES) submodules
182
+ @echo "In case of build error due to missing files, try 'make clean'"
183
+ $(CXX) $(CXXFLAGS) -o parsingcompetition $(LIBFILES) benchmark/parsingcompetition.cpp -I. $(LIBFLAGS) $(COREDEPSINCLUDE)
184
+
185
+ allparsingcompetition: benchmark/parsingcompetition.cpp $(HEADERS) $(LIBFILES) $(EXTRAOBJECTS) submodules
186
+ $(CXX) $(CXXFLAGS) -o allparsingcompetition $(LIBFILES) benchmark/parsingcompetition.cpp $(EXTRAOBJECTS) -I. $(LIBFLAGS) $(COREDEPSINCLUDE) $(EXTRADEPSINCLUDE) -DALLPARSER
187
+
188
+
189
+ allparserscheckfile: tests/allparserscheckfile.cpp $(HEADERS) $(LIBFILES) $(EXTRAOBJECTS) submodules
190
+ $(CXX) $(CXXFLAGS) -o allparserscheckfile $(LIBFILES) tests/allparserscheckfile.cpp $(EXTRAOBJECTS) -I. $(LIBFLAGS) $(COREDEPSINCLUDE) $(EXTRADEPSINCLUDE)
191
+
192
+ .PHONY: clean cppcheck cleandist
193
+
194
+ cppcheck:
195
+ cppcheck --enable=all src/*.cpp benchmarks/*.cpp tests/*.cpp -Iinclude -I. -Ibenchmark/linux
196
+
197
+ everything: $(MAINEXECUTABLES) $(EXTRA_EXECUTABLES) $(TESTEXECUTABLES) $(COMPARISONEXECUTABLES) $(SUPPLEMENTARYEXECUTABLES)
198
+
199
+ clean:
200
+ rm -f submodules $(EXTRAOBJECTS) $(MAINEXECUTABLES) $(EXTRA_EXECUTABLES) $(TESTEXECUTABLES) $(COMPARISONEXECUTABLES) $(SUPPLEMENTARYEXECUTABLES)
201
+
202
+ cleandist:
203
+ rm -f submodules $(EXTRAOBJECTS) $(MAINEXECUTABLES) $(EXTRA_EXECUTABLES) $(TESTEXECUTABLES) $(COMPARISONEXECUTABLES) $(SUPPLEMENTARYEXECUTABLES)
@@ -0,0 +1,85 @@
1
+ # Notes on simdjson
2
+
3
+ ## Rationale:
4
+
5
+ The simdjson project serves two purposes:
6
+
7
+ 1. It creates a useful library for parsing JSON data quickly.
8
+
9
+ 2. It is a demonstration of the use of SIMD and pipelined programming techniques to perform a complex and irregular task.
10
+ These techniques include the use of large registers and SIMD instructions to process large amounts of input data at once,
11
+ to hold larger entities than can typically be held in a single General Purpose Register (GPR), and to perform operations
12
+ that are not cheap to perform without use of a SIMD unit (for example table lookup using permute instructions).
13
+
14
+ The other key technique is that the system is designed to minimize the number of unpredictable branches that must be taken
15
+ to perform the task. Modern architectures are both wide and deep (4-wide pipelines with ~14 stages are commonplace). A
16
+ recent Intel Architecture processor, for example, can perform 3 256-bit SIMD operations or 2 512-bit SIMD operations per
17
+ cycle as well as other operations on general purpose registers or with the load/store unit. An incorrectly predicted branch
18
+ will clear this pipeline. While it is rare that a programmer can achieve the maximum throughput on a machine, a developer
19
+ may be missing the opportunity to carry out 56 operations for each branch miss.
20
+
21
+ Many code-bases make use of SIMD and deeply pipelined, "non-branchy", processing for regular tasks. Numerical problems
22
+ (e.g. "matrix multiply") or simple 'bulk search' tasks (e.g. "count all the occurrences of a given character in a text",
23
+ "find the first occurrence of the string 'foo' in a text") frequently use this class of techniques. We are demonstrating
24
+ that these techniques can be applied to much more complex and less regular tasks.
25
+
26
+ ## Design:
27
+
28
+ ### Stage 1: SIMD over bytes; bit vector processing over bytes.
29
+
30
+ The first stage of our processing must identify key points in our input: the 'structural characters' of JSON (curly and
31
+ square braces, colon, and comma), the start and end of strings as delineated by double quote characters, other JSON 'atoms'
32
+ that are not distinguishable by simple characters (constructs such as "true", "false", "null" and numbers), as well as
33
+ discovering these characters and atoms in the presence of both quoting conventions and backslash escaping conventions.
34
+
35
+ As such we follow the broad outline of the construction of a structural index as set forth in the Mison paper [XXX]; first,
36
+ the discovery of odd-length sequences of backslash characters (which will cause quote characters immediately following to
37
+ be escaped and not serve their quoting role but instead be literal charaters), second, the discovery of quote pairs (which
38
+ cause structural characters within the quote pairs to also be merely literal characters and have no function as structural
39
+ characters), then finally the discovery of structural characters not contained without the quote pairs.
40
+
41
+ We depart from the Mison paper in terms of method and overall design. In terms of method, the Mison paper uses iteration
42
+ over bit vectors to discover backslash sequences and quote pairs; we introduce branch-free techniques to discover both of
43
+ these properties.
44
+
45
+ We also make use of our ability to quickly detect whitespace in this early stage. We can use another bit-vector based
46
+ transformation to discover locations in our data that follow a structural character or quote or whitespace and are not whitespace. Excluding locations within strings, and the structural characters we have already discovered,
47
+ these locations are the only place that we can expect to see the starts of the JSON 'atoms'. These locations are thus
48
+ treated as 'structural' ('pseudo-structural characters').
49
+
50
+ This stage involves either SIMD processing over bytes or the manipulation of bit arrays that have 1 bit corresponding
51
+ to 1 byte of input. As such, it can be quite inefficient for some inputs - it is possible to observe dozens of operations
52
+ taking place to discover that there are in fact no odd-numbered sequences of backslashes or quotes in a given block of
53
+ input. However, this inefficiency on such inputs is balanced by the fact that it costs no more to run this code over
54
+ complex structured input, and the alternatives would generally involve running a number of unpredictable branches (for
55
+ example, the loop branches in Mison that iterate over bit vectors).
56
+
57
+ ### Stage 2: The transition from "SIMD over bytes" to "indices"
58
+
59
+ Our structural, pseudo-structural and other 'interesting' characters are relatively rare (TODO: quantify in detail -
60
+ it's typically about 1 in 10). As such, continuing to process them as bit vectors will involve manipulating data structures
61
+ that are relatively large as well as being fairly unpredictably spaced. We must transform these bitvectors of "interesting"
62
+ locations into offsets.
63
+
64
+ Note that we can examine the character at the offset to discover what the original function of the item in the bitvector
65
+ was. While the JSON structural characters and quotes are relatively self-explanatory (although working only with one offset
66
+ at a time, we have lost the distinction between opening quotes and closing quotes, something that was available in Stage 1),
67
+ it is a quirk of JSON that the legal atoms can all be distinguished from each other by their first character - 't' for
68
+ 'true', 'f' for 'false', 'n' for 'null' and the character class [0-9-] for numerical values.
69
+
70
+ Thus, the offset suffices, as long as we retain our original input.
71
+
72
+ Our current implementation involves a straightforward transformation of bitmaps to indices by use of the 'count trailing
73
+ zeros' operation and the well-known operation to clear the lowest set bit. Note that this implementation introduces an
74
+ unpredictable branch; unless there is a regular pattern in our bitmaps, we would expect to have at least one branch miss
75
+ for each bitmap.
76
+
77
+ ### Stage 3: Operation over indices
78
+
79
+ This now works over a dual structure.
80
+
81
+ 1. The "state machine", whose role it is to validate the sequence of structural characters and ensure that the input is at least generally structured like valid JSON (after this stage, the only errors permissible should be malformed atoms and numbers). If and only if the "state machine" reached all accept states, then,
82
+
83
+ 2. The "tape machine" will have produced valid output. The tape machine works blindly over characters writing records to tapes. These records create a lean but somewhat traversable linked structure that, for valid inputs, should represent what we need to know about the JSON input.
84
+
85
+ FIXME: a lot more detail is required on the operation of both these machines.