laser 0.7.0.pre1
Sign up to get free protection for your applications and to get access to all the features.
- data/.document +5 -0
- data/.rspec +1 -0
- data/Gemfile +14 -0
- data/LICENSE +661 -0
- data/README.md +158 -0
- data/Rakefile +104 -0
- data/VERSION +1 -0
- data/bin/laser +7 -0
- data/design_docs/goals.md +57 -0
- data/design_docs/object_regex.md +426 -0
- data/design_docs/type_annotations.md +80 -0
- data/ext/laser/BasicBlock.cpp +572 -0
- data/ext/laser/BasicBlock.h +118 -0
- data/ext/laser/extconf.rb +3 -0
- data/features/laser.feature +25 -0
- data/features/step_definitions/laser_steps.rb +39 -0
- data/features/support/env.rb +14 -0
- data/features/support/testdata/1_input +1 -0
- data/features/support/testdata/1_output +1 -0
- data/features/support/testdata/2_input +4 -0
- data/features/support/testdata/2_output +4 -0
- data/features/support/testdata/3_input +8 -0
- data/features/support/testdata/3_output +11 -0
- data/features/support/testdata/4_input +5 -0
- data/features/support/testdata/4_output +5 -0
- data/features/support/testdata/5_input +13 -0
- data/laser.gemspec +382 -0
- data/lib/laser.rb +98 -0
- data/lib/laser/analysis/annotations.rb +95 -0
- data/lib/laser/analysis/annotations/annotation_config.yaml +3 -0
- data/lib/laser/analysis/annotations/comment_attachment_annotation.rb +66 -0
- data/lib/laser/analysis/annotations/node_pointers_annotation.rb +36 -0
- data/lib/laser/analysis/annotations/runtime_annotation.rb +55 -0
- data/lib/laser/analysis/argument_expansion.rb +132 -0
- data/lib/laser/analysis/arity.rb +34 -0
- data/lib/laser/analysis/bindings.rb +144 -0
- data/lib/laser/analysis/bootstrap/bootstrap.rb +298 -0
- data/lib/laser/analysis/bootstrap/laser_class.rb +106 -0
- data/lib/laser/analysis/bootstrap/laser_method.rb +255 -0
- data/lib/laser/analysis/bootstrap/laser_module.rb +403 -0
- data/lib/laser/analysis/bootstrap/laser_module_copy.rb +74 -0
- data/lib/laser/analysis/bootstrap/laser_object.rb +69 -0
- data/lib/laser/analysis/bootstrap/laser_proc.rb +150 -0
- data/lib/laser/analysis/bootstrap/laser_singleton_class.rb +44 -0
- data/lib/laser/analysis/comments.rb +35 -0
- data/lib/laser/analysis/control_flow.rb +28 -0
- data/lib/laser/analysis/control_flow/alias_analysis.rb +31 -0
- data/lib/laser/analysis/control_flow/basic_block.rb +105 -0
- data/lib/laser/analysis/control_flow/cfg_builder.rb +2505 -0
- data/lib/laser/analysis/control_flow/cfg_instruction.rb +190 -0
- data/lib/laser/analysis/control_flow/constant_propagation.rb +742 -0
- data/lib/laser/analysis/control_flow/control_flow_graph.rb +370 -0
- data/lib/laser/analysis/control_flow/lifetime_analysis.rb +91 -0
- data/lib/laser/analysis/control_flow/method_call_search.rb +26 -0
- data/lib/laser/analysis/control_flow/raise_properties.rb +25 -0
- data/lib/laser/analysis/control_flow/simulation.rb +385 -0
- data/lib/laser/analysis/control_flow/static_single_assignment.rb +185 -0
- data/lib/laser/analysis/control_flow/unreachability_analysis.rb +57 -0
- data/lib/laser/analysis/control_flow/unused_variables.rb +91 -0
- data/lib/laser/analysis/control_flow/yield_properties.rb +103 -0
- data/lib/laser/analysis/errors.rb +131 -0
- data/lib/laser/analysis/laser_utils.rb +18 -0
- data/lib/laser/analysis/lexical_analysis.rb +172 -0
- data/lib/laser/analysis/method_call.rb +68 -0
- data/lib/laser/analysis/protocol_registry.rb +30 -0
- data/lib/laser/analysis/scope.rb +118 -0
- data/lib/laser/analysis/sexp.rb +159 -0
- data/lib/laser/analysis/sexp_analysis.rb +40 -0
- data/lib/laser/analysis/sexp_extensions/constant_extraction.rb +115 -0
- data/lib/laser/analysis/sexp_extensions/source_location.rb +164 -0
- data/lib/laser/analysis/sexp_extensions/type_inference.rb +47 -0
- data/lib/laser/analysis/signature.rb +76 -0
- data/lib/laser/analysis/special_methods/send.rb +67 -0
- data/lib/laser/analysis/unused_methods.rb +21 -0
- data/lib/laser/analysis/visitor.rb +141 -0
- data/lib/laser/annotation_parser/annotations.treetop +126 -0
- data/lib/laser/annotation_parser/annotations_parser.rb +748 -0
- data/lib/laser/annotation_parser/class_annotations.treetop +82 -0
- data/lib/laser/annotation_parser/class_annotations_parser.rb +654 -0
- data/lib/laser/annotation_parser/overload.treetop +24 -0
- data/lib/laser/annotation_parser/overload_parser.rb +167 -0
- data/lib/laser/annotation_parser/parsers.rb +6 -0
- data/lib/laser/annotation_parser/structural.treetop +37 -0
- data/lib/laser/annotation_parser/structural_parser.rb +406 -0
- data/lib/laser/annotation_parser/useful_parsers.treetop +47 -0
- data/lib/laser/annotation_parser/useful_parsers_parser.rb +674 -0
- data/lib/laser/rake/task.rb +46 -0
- data/lib/laser/runner.rb +189 -0
- data/lib/laser/scanner.rb +169 -0
- data/lib/laser/standard_library/_thread.rb +110 -0
- data/lib/laser/standard_library/abbrev.rb +103 -0
- data/lib/laser/standard_library/array.rb +418 -0
- data/lib/laser/standard_library/base64.rb +91 -0
- data/lib/laser/standard_library/basic_object.rb +55 -0
- data/lib/laser/standard_library/benchmark.rb +556 -0
- data/lib/laser/standard_library/bignum.rb +185 -0
- data/lib/laser/standard_library/cgi.rb +275 -0
- data/lib/laser/standard_library/cgi/cookie.rb +147 -0
- data/lib/laser/standard_library/cgi/core.rb +791 -0
- data/lib/laser/standard_library/cgi/html.rb +1021 -0
- data/lib/laser/standard_library/cgi/session.rb +537 -0
- data/lib/laser/standard_library/cgi/session/pstore.rb +111 -0
- data/lib/laser/standard_library/cgi/util.rb +188 -0
- data/lib/laser/standard_library/class_definitions.rb +333 -0
- data/lib/laser/standard_library/comparable.rb +125 -0
- data/lib/laser/standard_library/complex.rb +162 -0
- data/lib/laser/standard_library/enumerable.rb +178 -0
- data/lib/laser/standard_library/exceptions.rb +135 -0
- data/lib/laser/standard_library/fixnum.rb +188 -0
- data/lib/laser/standard_library/float.rb +180 -0
- data/lib/laser/standard_library/hash.rb +237 -0
- data/lib/laser/standard_library/integer.rb +123 -0
- data/lib/laser/standard_library/laser_magic.rb +7 -0
- data/lib/laser/standard_library/nil_false_true.rb +113 -0
- data/lib/laser/standard_library/numbers.rb +192 -0
- data/lib/laser/standard_library/proc.rb +31 -0
- data/lib/laser/standard_library/set.rb +1348 -0
- data/lib/laser/standard_library/string.rb +666 -0
- data/lib/laser/standard_library/stringio.rb +2 -0
- data/lib/laser/standard_library/symbol.rb +125 -0
- data/lib/laser/standard_library/tsort.rb +242 -0
- data/lib/laser/support/acts_as_struct.rb +66 -0
- data/lib/laser/support/frequency.rb +55 -0
- data/lib/laser/support/inheritable_attributes.rb +145 -0
- data/lib/laser/support/module_extensions.rb +94 -0
- data/lib/laser/support/placeholder_object.rb +13 -0
- data/lib/laser/third_party/rgl/adjacency.rb +221 -0
- data/lib/laser/third_party/rgl/base.rb +228 -0
- data/lib/laser/third_party/rgl/bidirectional.rb +39 -0
- data/lib/laser/third_party/rgl/condensation.rb +47 -0
- data/lib/laser/third_party/rgl/connected_components.rb +138 -0
- data/lib/laser/third_party/rgl/control_flow.rb +170 -0
- data/lib/laser/third_party/rgl/depth_first_spanning_tree.rb +37 -0
- data/lib/laser/third_party/rgl/dominators.rb +124 -0
- data/lib/laser/third_party/rgl/dot.rb +93 -0
- data/lib/laser/third_party/rgl/graphxml.rb +51 -0
- data/lib/laser/third_party/rgl/implicit.rb +174 -0
- data/lib/laser/third_party/rgl/mutable.rb +117 -0
- data/lib/laser/third_party/rgl/rdot.rb +445 -0
- data/lib/laser/third_party/rgl/topsort.rb +72 -0
- data/lib/laser/third_party/rgl/transitivity.rb +180 -0
- data/lib/laser/third_party/rgl/traversal.rb +348 -0
- data/lib/laser/types/types.rb +433 -0
- data/lib/laser/version.rb +14 -0
- data/lib/laser/warning.rb +149 -0
- data/lib/laser/warning_sets/default.yml +13 -0
- data/lib/laser/warnings/assignment_in_condition.rb +20 -0
- data/lib/laser/warnings/comment_spacing.rb +31 -0
- data/lib/laser/warnings/extra_blank_lines.rb +30 -0
- data/lib/laser/warnings/extra_whitespace.rb +16 -0
- data/lib/laser/warnings/hash_symbol_18_warning.rb +63 -0
- data/lib/laser/warnings/hash_symbol_19_warning.rb +29 -0
- data/lib/laser/warnings/line_length.rb +115 -0
- data/lib/laser/warnings/misaligned_unindentation.rb +17 -0
- data/lib/laser/warnings/operator_spacing.rb +68 -0
- data/lib/laser/warnings/parens_on_declaration.rb +30 -0
- data/lib/laser/warnings/rescue_exception.rb +42 -0
- data/lib/laser/warnings/semicolon.rb +25 -0
- data/lib/laser/warnings/sexp_errors.rb +24 -0
- data/lib/laser/warnings/uncalled_method_warning.rb +7 -0
- data/lib/laser/warnings/useless_double_quotes.rb +38 -0
- data/spec/analysis_specs/annotations_spec.rb +47 -0
- data/spec/analysis_specs/annotations_specs/comment_attachment_spec.rb +68 -0
- data/spec/analysis_specs/annotations_specs/node_pointers_annotation_spec.rb +90 -0
- data/spec/analysis_specs/annotations_specs/runtime_annotation_spec.rb +135 -0
- data/spec/analysis_specs/annotations_specs/spec_helper.rb +33 -0
- data/spec/analysis_specs/argument_expansion_spec.rb +113 -0
- data/spec/analysis_specs/bindings_spec.rb +36 -0
- data/spec/analysis_specs/comment_spec.rb +93 -0
- data/spec/analysis_specs/control_flow_specs/cfg_instruction_spec.rb +111 -0
- data/spec/analysis_specs/control_flow_specs/constant_propagation_spec.rb +560 -0
- data/spec/analysis_specs/control_flow_specs/control_flow_graph_spec.rb +5 -0
- data/spec/analysis_specs/control_flow_specs/raise_properties_spec.rb +310 -0
- data/spec/analysis_specs/control_flow_specs/raise_type_inference_spec.rb +301 -0
- data/spec/analysis_specs/control_flow_specs/return_type_inference_spec.rb +431 -0
- data/spec/analysis_specs/control_flow_specs/simulation_spec.rb +158 -0
- data/spec/analysis_specs/control_flow_specs/spec_helper.rb +110 -0
- data/spec/analysis_specs/control_flow_specs/tuple_misuse_inference_spec.rb +125 -0
- data/spec/analysis_specs/control_flow_specs/unreachability_analysis_spec.rb +76 -0
- data/spec/analysis_specs/control_flow_specs/unused_variable_spec.rb +99 -0
- data/spec/analysis_specs/control_flow_specs/yield_properties_spec.rb +372 -0
- data/spec/analysis_specs/error_spec.rb +30 -0
- data/spec/analysis_specs/laser_class_spec.rb +322 -0
- data/spec/analysis_specs/lexical_analysis_spec.rb +184 -0
- data/spec/analysis_specs/protocol_registry_spec.rb +63 -0
- data/spec/analysis_specs/scope_annotation_spec.rb +1013 -0
- data/spec/analysis_specs/scope_spec.rb +126 -0
- data/spec/analysis_specs/sexp_analysis_spec.rb +30 -0
- data/spec/analysis_specs/sexp_extension_specs/constant_extraction_spec.rb +309 -0
- data/spec/analysis_specs/sexp_extension_specs/source_location_spec.rb +231 -0
- data/spec/analysis_specs/sexp_extension_specs/spec_helper.rb +1 -0
- data/spec/analysis_specs/sexp_extension_specs/type_inference_spec.rb +252 -0
- data/spec/analysis_specs/sexp_spec.rb +167 -0
- data/spec/analysis_specs/spec_helper.rb +27 -0
- data/spec/analysis_specs/unused_methods_spec.rb +65 -0
- data/spec/analysis_specs/visitor_spec.rb +64 -0
- data/spec/annotation_parser_specs/annotations_parser_spec.rb +89 -0
- data/spec/annotation_parser_specs/class_annotation_parser_spec.rb +120 -0
- data/spec/annotation_parser_specs/overload_parser_spec.rb +39 -0
- data/spec/annotation_parser_specs/parsers_spec.rb +14 -0
- data/spec/annotation_parser_specs/spec_helper.rb +1 -0
- data/spec/annotation_parser_specs/structural_parser_spec.rb +67 -0
- data/spec/laser_spec.rb +14 -0
- data/spec/rake_specs/spec_helper.rb +1 -0
- data/spec/rake_specs/task_spec.rb +67 -0
- data/spec/runner_spec.rb +207 -0
- data/spec/scanner_spec.rb +75 -0
- data/spec/spec_helper.rb +121 -0
- data/spec/standard_library/exceptions_spec.rb +19 -0
- data/spec/standard_library/globals_spec.rb +14 -0
- data/spec/standard_library/set_spec.rb +31 -0
- data/spec/standard_library/spec_helper.rb +1 -0
- data/spec/standard_library/standard_library_spec.rb +302 -0
- data/spec/support_specs/acts_as_struct_spec.rb +94 -0
- data/spec/support_specs/frequency_spec.rb +23 -0
- data/spec/support_specs/module_extensions_spec.rb +117 -0
- data/spec/support_specs/spec_helper.rb +1 -0
- data/spec/type_specs/spec_helper.rb +1 -0
- data/spec/type_specs/types_spec.rb +133 -0
- data/spec/warning_spec.rb +95 -0
- data/spec/warning_specs/assignment_in_condition_spec.rb +68 -0
- data/spec/warning_specs/comment_spacing_spec.rb +65 -0
- data/spec/warning_specs/extra_blank_lines_spec.rb +70 -0
- data/spec/warning_specs/extra_whitespace_spec.rb +33 -0
- data/spec/warning_specs/hash_symbol_18_warning_spec.rb +89 -0
- data/spec/warning_specs/hash_symbol_19_warning_spec.rb +63 -0
- data/spec/warning_specs/line_length_spec.rb +173 -0
- data/spec/warning_specs/misaligned_unindentation_spec.rb +35 -0
- data/spec/warning_specs/operator_spacing_spec.rb +104 -0
- data/spec/warning_specs/parens_on_declaration_spec.rb +57 -0
- data/spec/warning_specs/rescue_exception_spec.rb +105 -0
- data/spec/warning_specs/semicolon_spec.rb +58 -0
- data/spec/warning_specs/spec_helper.rb +1 -0
- data/spec/warning_specs/useless_double_quotes_spec.rb +74 -0
- data/status_reports/2010/12/2010-12-14.md +163 -0
- data/status_reports/2010/12/2010-12-23.md +298 -0
- data/status_reports/2010/12/2010-12-24.md +6 -0
- data/test/third_party_tests/rgl_tests/TestComponents.rb +65 -0
- data/test/third_party_tests/rgl_tests/TestCycles.rb +61 -0
- data/test/third_party_tests/rgl_tests/TestDirectedGraph.rb +125 -0
- data/test/third_party_tests/rgl_tests/TestDot.rb +18 -0
- data/test/third_party_tests/rgl_tests/TestEdge.rb +34 -0
- data/test/third_party_tests/rgl_tests/TestGraph.rb +71 -0
- data/test/third_party_tests/rgl_tests/TestGraphXML.rb +57 -0
- data/test/third_party_tests/rgl_tests/TestImplicit.rb +52 -0
- data/test/third_party_tests/rgl_tests/TestRdot.rb +863 -0
- data/test/third_party_tests/rgl_tests/TestTransitivity.rb +129 -0
- data/test/third_party_tests/rgl_tests/TestTraversal.rb +220 -0
- data/test/third_party_tests/rgl_tests/TestUnDirectedGraph.rb +102 -0
- data/test/third_party_tests/rgl_tests/examples/north/Graph.log +128 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.0.graphml +28 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.1.graphml +28 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.11.graphml +31 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.12.graphml +27 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.13.graphml +27 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.14.graphml +27 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.15.graphml +26 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.16.graphml +26 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.17.graphml +26 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.19.graphml +37 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.2.graphml +28 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.20.graphml +38 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.22.graphml +43 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.24.graphml +30 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.25.graphml +45 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.27.graphml +38 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.28.graphml +30 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.29.graphml +38 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.3.graphml +26 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.30.graphml +34 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.31.graphml +42 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.34.graphml +42 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.37.graphml +28 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.38.graphml +38 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.39.graphml +36 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.4.graphml +26 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.40.graphml +37 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.41.graphml +37 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.42.graphml +26 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.45.graphml +28 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.46.graphml +32 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.5.graphml +31 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.50.graphml +30 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.56.graphml +29 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.57.graphml +32 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.58.graphml +32 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.6.graphml +26 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.60.graphml +32 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.61.graphml +34 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.62.graphml +34 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.68.graphml +30 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.69.graphml +32 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.7.graphml +29 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.70.graphml +26 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.71.graphml +27 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.72.graphml +28 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.74.graphml +29 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.75.graphml +29 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.78.graphml +27 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.79.graphml +34 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.8.graphml +29 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.80.graphml +34 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.82.graphml +35 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.83.graphml +32 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.85.graphml +34 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.86.graphml +34 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.88.graphml +37 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.89.graphml +29 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.9.graphml +26 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.90.graphml +32 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.91.graphml +31 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.92.graphml +26 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.93.graphml +32 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.10.94.graphml +34 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.12.8.graphml +40 -0
- data/test/third_party_tests/rgl_tests/examples/north/g.14.9.graphml +36 -0
- data/test/third_party_tests/rgl_tests/test_helper.rb +7 -0
- data/test/third_party_tests/test_inheritable_attributes.rb +187 -0
- metadata +470 -0
data/README.md
ADDED
@@ -0,0 +1,158 @@
|
|
1
|
+
LASER: Lexically- and Semantically-Enriched Ruby
|
2
|
+
================================================
|
3
|
+
|
4
|
+
**Homepage**: [http://carboni.ca/projects/p/laser](http://carboni.ca/)
|
5
|
+
**IRC**: [irc.freenode.net / #laser](irc://irc.freenode.net/laser)
|
6
|
+
**Git**: [http://github.com/michaeledgar/laser](http://github.com/michaeledgar/laser)
|
7
|
+
**Author**: Michael Edgar
|
8
|
+
**Copyright**: 2011
|
9
|
+
**License**: AGPL v3.0 w/ Commercial Exceptions. See below.
|
10
|
+
**Latest Version**: 0.7.0pre1
|
11
|
+
**Release Date**: None, yet.
|
12
|
+
|
13
|
+
Synopsis
|
14
|
+
--------
|
15
|
+
|
16
|
+
LASER is a tool to analyze the lexical structure and semantic meaning of your
|
17
|
+
Ruby programs. It will be able to discover bugs that Ruby only encounters at
|
18
|
+
run-time, and it can discover properties about your code that no pre-existing tools
|
19
|
+
can, such as whether a given block of code raises, which methods are private,
|
20
|
+
if a method call could require a block, and so on. It provides warnings
|
21
|
+
as well as errors for *potentially* error-prone code, such as:
|
22
|
+
|
23
|
+
if x = 5
|
24
|
+
# well, x is 5 *now*
|
25
|
+
end
|
26
|
+
|
27
|
+
Naturally, all warnings can be ignored on a case-by-case basis with inline comments
|
28
|
+
and turned off completely via command-line switches.
|
29
|
+
|
30
|
+
Feature List
|
31
|
+
------------
|
32
|
+
|
33
|
+
Details are always forthcoming, but:
|
34
|
+
|
35
|
+
**1. Optional Type System** - taking some cues from Gilad Bracha's Strongtalk.
|
36
|
+
**2. Style Fixing** - There are many style no-nos in Ruby. LASER can find them *and* fix them
|
37
|
+
like similar linting tools for other languages.
|
38
|
+
**3. Common Semantic Analyses** - dead-code discovery, yield-ability, raise-ability,
|
39
|
+
unused variables/arguments, and so on.
|
40
|
+
**4. Documentation Generation** - By this, I mean inserting comments in your code documenting
|
41
|
+
it. I don't want to try to replace YARD, which has already done tons of work in parsing docs
|
42
|
+
and generating beautiful output as a result. But LASER can definitely, say, insert a
|
43
|
+
`@raise [SystemExitError]` when it detects a call to `Kernel#exit`!
|
44
|
+
**5. Pluggable Annotation Parsers** - to get the most out of LASER, you may wish to
|
45
|
+
annotate your code with types or arbitrary properties (such as method purity/impurity,
|
46
|
+
visibility, etc). This requires an annotation syntax, which of course will lead to religious
|
47
|
+
wars. So I'll be including the syntax *I* would like, as well as a parser for YARD-style
|
48
|
+
annotations.
|
49
|
+
**6. Ruby 1.9+ only** - Yep, LASER will only run on Ruby 1.9, and it'll expect its target
|
50
|
+
code is Ruby 1.9. Of course, since any 1.8 code will still parse just fine, the only issues
|
51
|
+
that will come up is API differences (looking at you, `String`).
|
52
|
+
**7. Reusable Semantic Information** - I don't want a new AST format. I don't like the one
|
53
|
+
provided by RubyParser and co. So I'm sticking with Ripper's AST format. It has quirks, but
|
54
|
+
I prefer it, and it's part of the standard library. LASER works by creating an Array subclass
|
55
|
+
called `Sexp` that wraps the results of a Ripper parse and *does not modify its contents*. So anyone
|
56
|
+
expecting a typical Ripper AST can use the results of LASER's analysis. The `Sexp` subclass then
|
57
|
+
has a variety of accessor methods created on it that contain the results of static analysis.
|
58
|
+
|
59
|
+
More to come here.
|
60
|
+
|
61
|
+
Installing
|
62
|
+
----------
|
63
|
+
|
64
|
+
To install LASER, use the following command:
|
65
|
+
|
66
|
+
$ gem install laser --prerelease
|
67
|
+
|
68
|
+
(Add `sudo` if you're installing to a directory requiring root privileges to write)
|
69
|
+
|
70
|
+
Usage
|
71
|
+
-----
|
72
|
+
|
73
|
+
There are a couple of ways to use LASER. It has a command-line implementation,
|
74
|
+
and a Rake task.
|
75
|
+
|
76
|
+
The command-line implementation is still having its flags worked out for usability -
|
77
|
+
right now, there's some flexibility, but they're a huge pain to use. Also, the style-related
|
78
|
+
analyses are handled slightly differently from semantic analyses. So bear with me.
|
79
|
+
|
80
|
+
When analyzing for semantic issues, `require`s and `load`s are *always* followed. This
|
81
|
+
may become a command-line flag in the future, but it isn't now.
|
82
|
+
|
83
|
+
When analyzing for style issues, the file in question must be listed on the command line.
|
84
|
+
|
85
|
+
Example runs:
|
86
|
+
|
87
|
+
```
|
88
|
+
$ cat temp.rb
|
89
|
+
class Foo
|
90
|
+
def initialize(x, *args)
|
91
|
+
a, b = args[1..2]
|
92
|
+
end
|
93
|
+
end
|
94
|
+
Foo.new(gets, gets)
|
95
|
+
|
96
|
+
$ laser temp.rb
|
97
|
+
4 warnings found. 0 are fixable.
|
98
|
+
================================
|
99
|
+
(stdin):3 Error (4) - Variable defined but not used: x
|
100
|
+
(stdin):3 Error (6) - LHS never assigned - defaults to nil
|
101
|
+
(stdin):3 Error (4) - Variable defined but not used: a
|
102
|
+
(stdin):3 Error (4) - Variable defined but not used: b
|
103
|
+
```
|
104
|
+
|
105
|
+
Cool! If you want to specify a set of warnings to consider, you can use the `--only` flag. And
|
106
|
+
if you want style errors to be fixed, use `--fix`. For example:
|
107
|
+
|
108
|
+
```
|
109
|
+
$ cat tempstyle.rb
|
110
|
+
x = 0
|
111
|
+
x+=10 # extra space at the end of this line
|
112
|
+
# blank lines following
|
113
|
+
|
114
|
+
|
115
|
+
$ laser --only OperatorSpacing,ExtraBlankLinesWarning,InlineCommentSpaceWarning,ExtraWhitespaceWarning --fix tempstyle.rb
|
116
|
+
4 warnings found. 4 are fixable.
|
117
|
+
================================
|
118
|
+
tempstyle.rb:0 Extra blank lines (1) - This file has 3 blank lines at the end of it.
|
119
|
+
tempstyle.rb:2 Inline comment spacing error () - Inline comments must be exactly 2 spaces from code.
|
120
|
+
tempstyle.rb:2 Extra Whitespace (2) - The line has trailing whitespace.
|
121
|
+
tempstyle.rb:2 No operator spacing (5) - Insufficient spacing around +=
|
122
|
+
|
123
|
+
$ cat tempstyle.rb
|
124
|
+
x = 0
|
125
|
+
x += 10 # extra space at the end of this line
|
126
|
+
# blank lines following$ (prompt)
|
127
|
+
```
|
128
|
+
|
129
|
+
What happened there is:
|
130
|
+
|
131
|
+
1. Inline comments were set to 2 spaces away from their line of code. This will be configurable in the future.
|
132
|
+
2. The `+=` operator was properly spaced.
|
133
|
+
3. The extra spaces at the end of line 2 were removed
|
134
|
+
4. The blank lines at the end of the file were removed.
|
135
|
+
|
136
|
+
Cool! Of course, all those would have happened if you just ran `laser --fix tempstyle.rb`, but I wanted to demonstrate
|
137
|
+
how to specify individual warnings. Again, that's going to have to be made a lot easier - I've experimented with giving
|
138
|
+
each warning a "short name" that gets emitted alongside the warning, but that has some discoverability issues. We'll see
|
139
|
+
where that goes.
|
140
|
+
|
141
|
+
Changelog
|
142
|
+
---------
|
143
|
+
|
144
|
+
- **Jan.26.11**: Not publicizing LASER yet, but I figure I need a first entry in
|
145
|
+
the changelog.
|
146
|
+
- **Jun.15.11**: [Thesis](http://www.cs.dartmouth.edu/reports/abstracts/TR2011-686/) published
|
147
|
+
based on Laser. License officially switching to AGPLv3 with commercial exceptions.
|
148
|
+
- **Aug.12.11**: First prerelease gem published, version 0.7.0pre1. Expect several iterations
|
149
|
+
before 0.7 is finalized, and please report all bugs! Not all Ruby code will work!
|
150
|
+
|
151
|
+
Copyright
|
152
|
+
---------
|
153
|
+
|
154
|
+
LASER © 2011 by [Michael Edgar](mailto:adgar@carboni.ca).
|
155
|
+
By default, LASER is licensed under the AGPLv3;
|
156
|
+
see {file:LICENSE} for licensing details.
|
157
|
+
Alternative licensing arrangements are also possible;
|
158
|
+
contact [Michael Edgar](mailto:adgar@carboni.ca) to discuss your needs.
|
data/Rakefile
ADDED
@@ -0,0 +1,104 @@
|
|
1
|
+
require 'rubygems'
|
2
|
+
require 'rake'
|
3
|
+
|
4
|
+
# switch to false if the gem can't load
|
5
|
+
if true
|
6
|
+
begin
|
7
|
+
require 'laser'
|
8
|
+
Laser::Rake::LaserTask.new(:laser) do |laser|
|
9
|
+
laser.libs << 'lib' << 'spec'
|
10
|
+
laser.using << :all << Laser::LineLengthMaximum(100) << Laser::LineLengthWarning(80)
|
11
|
+
laser.options = '--debug --fix'
|
12
|
+
laser.fix << Laser::ExtraBlankLinesWarning << Laser::ExtraWhitespaceWarning << Laser::LineLengthWarning(80)
|
13
|
+
end
|
14
|
+
rescue LoadError,Exception => err
|
15
|
+
$:.unshift(File.dirname(__FILE__))
|
16
|
+
require 'lib/laser/version'
|
17
|
+
task :laser do
|
18
|
+
abort 'Laser is not available. In order to run laser, you must: sudo gem install laser'
|
19
|
+
end
|
20
|
+
end
|
21
|
+
end
|
22
|
+
|
23
|
+
begin
|
24
|
+
require 'jeweler'
|
25
|
+
Jeweler::Tasks.new do |gem|
|
26
|
+
gem.name = 'laser'
|
27
|
+
gem.summary = %Q{Analysis and linting tool for Ruby.}
|
28
|
+
gem.description = %Q{Laser is an advanced static analysis tool for Ruby.}
|
29
|
+
gem.email = 'michael.j.edgar@dartmouth.edu'
|
30
|
+
gem.homepage = 'http://github.com/michaeledgar/laser'
|
31
|
+
gem.authors = ['Michael Edgar']
|
32
|
+
gem.extensions = ['ext/laser/extconf.rb']
|
33
|
+
gem.version = Laser::Version::STRING
|
34
|
+
# gem is a Gem::Specification... see http://www.rubygems.org/read/chapter/20 for additional settings
|
35
|
+
end
|
36
|
+
Jeweler::GemcutterTasks.new
|
37
|
+
rescue LoadError
|
38
|
+
puts 'Jeweler (or a dependency) not available. Install it with: gem install jeweler'
|
39
|
+
end
|
40
|
+
|
41
|
+
require 'rspec/core/rake_task'
|
42
|
+
RSpec::Core::RakeTask.new(:spec) do |spec|
|
43
|
+
spec.pattern = FileList['spec/**/*_spec.rb']
|
44
|
+
end
|
45
|
+
|
46
|
+
task rcov: :default
|
47
|
+
|
48
|
+
require 'rake/testtask'
|
49
|
+
Rake::TestTask.new do |t|
|
50
|
+
t.libs << 'test'
|
51
|
+
t.test_files = FileList['test/**/test*.rb', 'test/**/Test*.rb']
|
52
|
+
t.verbose = true
|
53
|
+
end
|
54
|
+
|
55
|
+
require 'cucumber'
|
56
|
+
require 'cucumber/rake/task'
|
57
|
+
|
58
|
+
Cucumber::Rake::Task.new(:features) do |t|
|
59
|
+
t.cucumber_opts = "features --format pretty"
|
60
|
+
end
|
61
|
+
|
62
|
+
task rebuild: [:gemspec, :build, :install] do
|
63
|
+
%x(rake laser)
|
64
|
+
end
|
65
|
+
|
66
|
+
#################### Parser rake tasks #####################
|
67
|
+
|
68
|
+
SRC = FileList['lib/laser/annotation_parser/*.treetop']
|
69
|
+
OBJ = SRC.sub(/.treetop$/, '_parser.rb')
|
70
|
+
|
71
|
+
SRC.each do |source|
|
72
|
+
result = source.sub(/.treetop$/, '_parser.rb')
|
73
|
+
file result => [source] do |t|
|
74
|
+
sh "tt #{source} -o #{result}"
|
75
|
+
end
|
76
|
+
end
|
77
|
+
|
78
|
+
task build_parsers: OBJ
|
79
|
+
|
80
|
+
#################### C++ Extension Rake Tasks ##############
|
81
|
+
|
82
|
+
require 'rbconfig'
|
83
|
+
file 'ext/laser/Makefile' => ['ext/laser/extconf.rb'] do |t|
|
84
|
+
Dir.chdir('ext/laser') do
|
85
|
+
ruby 'extconf.rb'
|
86
|
+
end
|
87
|
+
end
|
88
|
+
|
89
|
+
dylib_name = "ext/laser/BasicBlock.#{RbConfig::CONFIG['DLEXT']}"
|
90
|
+
file dylib_name => ['ext/laser/Makefile'] do
|
91
|
+
Dir.chdir('ext/laser') do
|
92
|
+
sh 'make'
|
93
|
+
end
|
94
|
+
end
|
95
|
+
|
96
|
+
desc 'Guarantees build-readiness'
|
97
|
+
task build_components: [:build_parsers, dylib_name]
|
98
|
+
|
99
|
+
# Alias for script/console from rails world lawlz
|
100
|
+
task sc: :build_components do
|
101
|
+
system("irb -r./lib/laser")
|
102
|
+
end
|
103
|
+
|
104
|
+
task default: [:spec, :test]
|
data/VERSION
ADDED
@@ -0,0 +1 @@
|
|
1
|
+
0.6.0
|
data/bin/laser
ADDED
@@ -0,0 +1,57 @@
|
|
1
|
+
# LASER Goals
|
2
|
+
|
3
|
+
At some point, I need to specify what my actual goals are for this project.
|
4
|
+
Since the overall point is "static analysis," that's going to involve a wide
|
5
|
+
variety of things, and since I like being anal about style, I want to include
|
6
|
+
that too. So I figure: break things down into categories, assign them priorities,
|
7
|
+
check them off when I'm done.
|
8
|
+
|
9
|
+
## Annotations
|
10
|
+
1. Types of methods/variables
|
11
|
+
2. Purity/Mutation
|
12
|
+
3. Generated methods
|
13
|
+
4. Catch-all annotation of method_missing?
|
14
|
+
|
15
|
+
## Style
|
16
|
+
1. √ Assignment in Conditional (P0)
|
17
|
+
2. √ raise Exception (P0)
|
18
|
+
3. √ Inline-Comment Spacing (P0)
|
19
|
+
4. √ Line-length (P0)
|
20
|
+
5. √ Useless Whitespace (P0)
|
21
|
+
6. √ Parens in method declarations (P0)
|
22
|
+
7. √ Useless Double Quotes/%Q (P0)
|
23
|
+
8. √ Operator Spacing (P1)
|
24
|
+
9. Indentation (P2)
|
25
|
+
10. Require ! for methods that mutate state (P2)
|
26
|
+
11. Require ? for methods that always return booleans
|
27
|
+
12. MyStruct = Struct.new, not MyStruct < Struct.new
|
28
|
+
|
29
|
+
## General-Use Information
|
30
|
+
1. √ Private Method (P0)
|
31
|
+
2. √ Raisability (P0)
|
32
|
+
3. √ Yielding (P0)
|
33
|
+
4. √ Yield-necessity (P1)
|
34
|
+
5. Yield-count (P1)
|
35
|
+
6. Method Purity (P1)
|
36
|
+
7. Mutation Detection (P1)
|
37
|
+
8. √ Types of arguments/return types/variables
|
38
|
+
|
39
|
+
## Error Detection
|
40
|
+
1. √ NoSuchMethod detection
|
41
|
+
2. Incorrect # of arguments
|
42
|
+
3. √ including already-included module
|
43
|
+
4. √ extending already-extended module
|
44
|
+
5. √ super with wrong number of args
|
45
|
+
6. √ super when no super method exists
|
46
|
+
7. √ re-open class as module (and vice-versa)
|
47
|
+
8. Superclass mismatch
|
48
|
+
8. No block provided to method requiring one
|
49
|
+
9. Shadowing of really important methods (private = :xyz)
|
50
|
+
10. Type conflicts
|
51
|
+
11. √ Constants in for loops (P0)
|
52
|
+
12. Useless lhs or rhs in mlhs/mrhs
|
53
|
+
|
54
|
+
## Optimization
|
55
|
+
1. √ Dead Code Detection
|
56
|
+
2. √ Useless variable writes/reads detection
|
57
|
+
3. √ Constant Folding
|
@@ -0,0 +1,426 @@
|
|
1
|
+
## Introduction
|
2
|
+
|
3
|
+
I present a small Ruby class which provides full Ruby Regexp matching on sequences of (potentially) heterogenous objects, conditioned on those objects implementing a single, no-argument method returning a String. I propose it should be used to implement the desired behavior in the Ruby standard library.
|
4
|
+
|
5
|
+
## Motivation
|
6
|
+
|
7
|
+
So I'm hammering away at [Laser (soon to be renamed Laser)](http://github.com/michaeledgar/laser/), and I come across a situation: I need to parse out comments using Ripper's output.
|
8
|
+
|
9
|
+
I decided a while ago I wouldn't use [YARD](http://yardoc.org/)'s Ripper-based parser as it returns [its own AST format](https://github.com/lsegal/yard/blob/master/lib/yard/parser/ruby/ruby_parser.rb). YARD has its own goals, so it's not surprising the standard output from Ripper was insufficient. However, I don't want to define a new AST format - we already have Ripper's, YARD's, and of course, the venerable [RubyParser/ParseTree format](http://parsetree.rubyforge.org/). I'm rambling: the point is, I'm using exact Ripper output, and there's no existing code to annotate a Ripper node with the comments immediately preceding it.
|
10
|
+
|
11
|
+
## Extracting Comments from Ruby
|
12
|
+
|
13
|
+
Since Ripper strips the comments out when you use `Ripper.sexp`, and I'm not going to switch to the SAX-model of parsing just for comments, I had to use `Ripper.lex` to grab the comments. I immediately found this would prove annoying:
|
14
|
+
|
15
|
+
{{{
|
16
|
+
pp Ripper.lex(" # some comment\n # another comment\n def abc; end")
|
17
|
+
}}}
|
18
|
+
|
19
|
+
gives
|
20
|
+
|
21
|
+
{{{
|
22
|
+
[[[1, 0], :on_sp, " "],
|
23
|
+
[[1, 2], :on_comment, "# some comment\n"],
|
24
|
+
[[2, 0], :on_sp, " "],
|
25
|
+
[[2, 2], :on_comment, "# another comment\n"],
|
26
|
+
[[3, 0], :on_sp, " "],
|
27
|
+
[[3, 1], :on_kw, "def"],
|
28
|
+
[[3, 4], :on_sp, " "],
|
29
|
+
[[3, 5], :on_ident, "abc"],
|
30
|
+
[[3, 8], :on_semicolon, ";"],
|
31
|
+
[[3, 9], :on_sp, " "],
|
32
|
+
[[3, 10], :on_kw, "end"]]
|
33
|
+
}}}
|
34
|
+
|
35
|
+
Naturally, Ripper is separating each line-comment into its own token, even those that follow on subsequent lines. I'd have to combine those comment tokens to get what a typical programmer considers one logical comment.
|
36
|
+
|
37
|
+
I didn't want to write an ugly, imperative algorithm to do this: part of the beauty of writing Ruby is you don't often have to actually write a `while` loop. I described my frustration to my roommate, and he quickly observed the obvious connection to regular expressions. That's when I remembered [Ripper.slice and Ripper.token_match](http://ruby-doc.org/ruby-1.9/classes/Ripper.html#M001274) (token_match is undocumented), which provide almost exactly what I needed:
|
38
|
+
|
39
|
+
{{{
|
40
|
+
Ripper.slice(" # some comment\n # another comment\n def abc; end",
|
41
|
+
'comment (sp? comment)*')
|
42
|
+
# => "# some comment\n # another comment\n"
|
43
|
+
}}}
|
44
|
+
|
45
|
+
A few problems: `Ripper.slice` lexes its input on each invocation and then searches it from the start for one match. I need *all* matches. `Ripper.slice` also returns the exact string, and not the location in the source text of the match, which I need - how else will I know where the comments are? The lexer output includes line and column locations, so it should be easy to retrieve.
|
46
|
+
|
47
|
+
All this means an O(N) solution was not in sight using the built-in library functions. I was about to start doing some subclassing hacks, until I peeked at the source for `Ripper.slice` and saw it was too cool to not generalize.
|
48
|
+
|
49
|
+
## Formal Origins of `Ripper.slice`
|
50
|
+
|
51
|
+
The core of regular expressions - the [actually "regular" kind](http://en.wikipedia.org/wiki/Regular_expression#Definition) - correspond directly to a [DFA](http://en.wikipedia.org/wiki/Deterministic_finite_automata) with an [alphabet](http://en.wikipedia.org/wiki/Alphabet_\(computer_science\)) equal to the character set being searched. Naturally, Ruby's `Regexp` engine offers many features that cannot be directly described by a DFA. Anyway, what I wanted was a way to perform the same searches, only with an alphabet of token types instead of characters.
|
52
|
+
|
53
|
+
We could construct a separate DFA engine for searching sequences of our new alphabet, but we'd much rather piggyback an existing (and more-featured) implementation. Since the set of token types is countable, one can create a one-to-one mapping from token types to finite strings of an alphabet that Ruby's `Regexp` class can search, namely regular old characters. If we replace each occurrence of a member of our alphabet with a member of the target, Regexp alphabet, then we should be able to use Regexp to do regex searching on our token sequence. That transformation on the token sequence is easy: just map each token's type onto some string using a 1-to-1 function. However, one important bit that remains is how the search pattern is specified. As you saw above, we used:
|
54
|
+
|
55
|
+
{{{
|
56
|
+
'comment (sp? comment)*'
|
57
|
+
}}}
|
58
|
+
|
59
|
+
to specify a search for "a comment token, followed by zero or more groups, where each group is an optional space token followed by a comment token." This departs from traditional Regexp syntax, because our alphabet is no longer composed of individual characters, it is composed of tokens. For this implementation's sake, we can observe that we require whitespace be insensitive, and that `?` and `*` operators apply to tokens, not to characters. We could specify this input however we like, as long as we can generate the correct string-searching pattern from it.
|
60
|
+
|
61
|
+
One last observation that allows us to use Regexp to search our tokens: we must be able to specify a one-to-one function from a token name to the set of tokens that it should match. In other words, no two tokens that we consider "different" can have the same token type. For a normal Regex, this is a trivial condition, as a character matches only that character. However, 'comment' must match the infinite set of all comment tokens. If we satisfy that condition, then there exists a function from a regex on token-types to a regex on strings. This is still pretty trivial to show for tokens, but later when we generalize this approach further, it becomes even more important to do correctly.
|
62
|
+
|
63
|
+
## Implementation
|
64
|
+
|
65
|
+
So, we get to Ripper's implementation:
|
66
|
+
|
67
|
+
1. Each token type is mapped to a single character in the set [a-zA-Z0-9].
|
68
|
+
2. The sequence of tokens to be searched is transformed into the sequence of characters corresponding to the token types.
|
69
|
+
3. The search pattern is transformed into a pattern that can search this mapped representation of the token sequence. Each token found in the search pattern is replaced by its corresponding single character, and whitespace is removed.
|
70
|
+
4. The new pattern runs on the mapped sequence. The result, if successful, is the start and end locations of the match in the mapped sequence.
|
71
|
+
5. Since each character in the mapped sequence corresponds to a single token, we can index into the original token sequence using the exact boundaries of the match result.
|
72
|
+
|
73
|
+
## An Example
|
74
|
+
|
75
|
+
Let's run through the previous example:
|
76
|
+
|
77
|
+
### Each token type is mapped to a single character in the set:
|
78
|
+
|
79
|
+
Ripper runs this code at load-time:
|
80
|
+
|
81
|
+
{{{
|
82
|
+
seed = ('a'..'z').to_a + ('A'..'Z').to_a + ('0'..'9').to_a
|
83
|
+
SCANNER_EVENT_TABLE.each do |ev, |
|
84
|
+
raise CompileError, "[RIPPER FATAL] too many system token" if seed.empty?
|
85
|
+
MAP[ev.to_s.sub(/\Aon_/,'')] = seed.shift
|
86
|
+
end
|
87
|
+
}}}
|
88
|
+
|
89
|
+
I fired up an `irb` instance and checked the result:
|
90
|
+
|
91
|
+
{{{
|
92
|
+
Ripper::TokenPattern::MAP
|
93
|
+
# => {"CHAR"=>"a", "__end__"=>"b", "backref"=>"c", "backtick"=>"d",
|
94
|
+
"comma"=>"e", "comment"=>"f", "const"=>"g", "cvar"=>"h", "embdoc"=>"i",
|
95
|
+
"embdoc_beg"=>"j", "embdoc_end"=>"k", "embexpr_beg"=>"l",
|
96
|
+
"embexpr_end"=>"m", "embvar"=>"n", "float"=>"o", "gvar"=>"p",
|
97
|
+
"heredoc_beg"=>"q", "heredoc_end"=>"r", "ident"=>"s", "ignored_nl"=>"t",
|
98
|
+
"int"=>"u", "ivar"=>"v", "kw"=>"w", "label"=>"x", "lbrace"=>"y",
|
99
|
+
"lbracket"=>"z", "lparen"=>"A", "nl"=>"B", "op"=>"C", "period"=>"D",
|
100
|
+
"qwords_beg"=>"E", "rbrace"=>"F", "rbracket"=>"G", "regexp_beg"=>"H",
|
101
|
+
"regexp_end"=>"I", "rparen"=>"J", "semicolon"=>"K", "sp"=>"L",
|
102
|
+
"symbeg"=>"M", "tlambda"=>"N", "tlambeg"=>"O", "tstring_beg"=>"P",
|
103
|
+
"tstring_content"=>"Q", "tstring_end"=>"R", "words_beg"=>"S",
|
104
|
+
"words_sep"=>"T"}
|
105
|
+
}}}
|
106
|
+
|
107
|
+
This is completely implementation-dependent, but these characters are an implementation detail for the algorithm anyway.
|
108
|
+
|
109
|
+
### The sequence of tokens to be searched is transformed into the sequence of characters corresponding to the token types.
|
110
|
+
|
111
|
+
Ripper implements this as follows:
|
112
|
+
|
113
|
+
{{{
|
114
|
+
def map_tokens(tokens)
|
115
|
+
tokens.map {|pos,type,str| map_token(type.to_s.sub(/\Aon_/,'')) }.join
|
116
|
+
end
|
117
|
+
}}}
|
118
|
+
|
119
|
+
Running this on our token stream before (markdown doesn't support anchors, so scroll up if necessary), we get this:
|
120
|
+
|
121
|
+
{{{
|
122
|
+
"LfLfLwLsKLw"
|
123
|
+
}}}
|
124
|
+
|
125
|
+
This is what we will eventually run our modified Regexp against.
|
126
|
+
|
127
|
+
### The search pattern is transformed into a pattern that can search this mapped representation of the token sequence. Each token found in the search pattern is replaced by its corresponding single character, and whitespace is removed.
|
128
|
+
|
129
|
+
What we want is `comment (sp? comment)*`. In this mapped representation, a quick look at the table above shows the regex we need is
|
130
|
+
|
131
|
+
{{{
|
132
|
+
/f(L?f)*/
|
133
|
+
}}}
|
134
|
+
|
135
|
+
Ripper implements this in a somewhat roundabout fashion, as it seems they wanted to experiment with slightly different syntax. Since my implementation (which I'll present shortly) does not retain these syntax changes, I choose not to list the Ripper version here.
|
136
|
+
|
137
|
+
### The new pattern runs on the mapped sequence. The result, if successful, is the start and end locations of the match in the mapped sequence.
|
138
|
+
|
139
|
+
We run `/f(L?f)*/` on `"LfLfLwLsKLw"`. It matches `fLf` at position 1.
|
140
|
+
|
141
|
+
As expected, the implementation is quite simple for Ripper:
|
142
|
+
|
143
|
+
{{{
|
144
|
+
def match_list(tokens)
|
145
|
+
if m = @re.match(map_tokens(tokens))
|
146
|
+
then MatchData.new(tokens, m)
|
147
|
+
else nil
|
148
|
+
end
|
149
|
+
end
|
150
|
+
}}}
|
151
|
+
|
152
|
+
### Since each character in the mapped sequence corresponds to a single token, we can index into the original token sequence using the exact boundaries of the match result.
|
153
|
+
|
154
|
+
The boundaries returned were `(1..4]` in mathematical notation, or `(1...4)`/`(1..3)` as Ruby ranges. We then use this range on the original sequence, which returns:
|
155
|
+
|
156
|
+
{{{
|
157
|
+
[[[1, 2], :on_comment, "# some comment\n"],
|
158
|
+
[[2, 0], :on_sp, " "],
|
159
|
+
[[2, 2], :on_comment, "# another comment\n"]]
|
160
|
+
}}}
|
161
|
+
|
162
|
+
The implementation is again quite simple in Ripper, yet it for some reason immediately extracts the token contents:
|
163
|
+
|
164
|
+
{{{
|
165
|
+
def match(n = 0)
|
166
|
+
return [] unless @match
|
167
|
+
@tokens[@match.begin(n)...@match.end(n)].map {|pos,type,str| str }
|
168
|
+
end
|
169
|
+
}}}
|
170
|
+
|
171
|
+
## Generalization
|
172
|
+
|
173
|
+
My only complaints with Ripper's implementation, for what it intends to do, is that it lacks an API to get more than just the source code corresponding to the matched tokens. That's an API problem, and could easily be worked around.
|
174
|
+
|
175
|
+
What has been provided can be generalized, however, to work on not just tokens but sequences arbitrary, even heterogenous objects. There are a couple of properties we'll need to preserve to extend this to arbitrary sequences.
|
176
|
+
|
177
|
+
1. Alphabet Size: The alphabet for Ruby tokens is smaller than 62 elements, so we could use a single character from [A-Za-z0-9] to represent a token. If your alphabet is larger than that, we'll likely need to use a larger string for each element in the alphabet. Also, with Ruby Tokens, we knew the entire alphabet ahead of time. We don't necessarily know the whole alphabet for arbitrary sequences.
|
178
|
+
2. No two elements of the sequence which should match differently can have the same string representation. We used token types for this before, but our sequence was homogenous.
|
179
|
+
|
180
|
+
One observation makes the alphabet size issue less important: we actually only need to define a string mapping for elements in the alphabet that appear in the search pattern, not all those in the searched sequence. We can use the same string mapping for all elements in the searched sequence that don't appear in the regex pattern. If we recall that Regex features like character classes (`\w`, `\s`) and ranges (`[A-Za-z]`) are just syntactic sugar for repeated `|` operators, we'll see that in a normal regex we also only need to consider the elements of the alphabet appearing in the search pattern. All this means that if we use the same 62 characters that Ripper does, that only an input pattern with 62 different element types will require more than 1 character per element.
|
181
|
+
|
182
|
+
That said, we'll implement support for large alphabets anyway.
|
183
|
+
|
184
|
+
## General Implementation
|
185
|
+
|
186
|
+
For lack of a better name, we'll call this an `ObjectRegex`.
|
187
|
+
|
188
|
+
The full listing follows. You'll quickly notice that I haven't yet implemented the API that I actually need for Laser. Keeping focused seems incompatible with curiosity in my case, unfortunately.
|
189
|
+
|
190
|
+
{{{
|
191
|
+
class ObjectRegex
|
192
|
+
def initialize(pattern)
|
193
|
+
@map = generate_map(pattern)
|
194
|
+
@pattern = generate_pattern(pattern)
|
195
|
+
end
|
196
|
+
|
197
|
+
def mapped_value(reg_desc)
|
198
|
+
@map[reg_desc] || @map[:FAILBOAT]
|
199
|
+
end
|
200
|
+
|
201
|
+
MAPPING_CHARS = ('a'..'z').to_a + ('A'..'Z').to_a + ('0'..'9').to_a
|
202
|
+
def generate_map(pattern)
|
203
|
+
alphabet = pattern.scan(/[A-Za-z]+/).uniq
|
204
|
+
repr_size = Math.log(alphabet.size + 1, MAPPING_CHARS.size).ceil
|
205
|
+
@item_size = repr_size + 1
|
206
|
+
|
207
|
+
map = Hash[alphabet.map.with_index do |symbol, idx|
|
208
|
+
[symbol, mapping_for_idx(repr_size, idx)]
|
209
|
+
end]
|
210
|
+
map.merge!(FAILBOAT: mapping_for_idx(repr_size, map.size))
|
211
|
+
end
|
212
|
+
|
213
|
+
def mapping_for_idx(repr_size, idx)
|
214
|
+
convert_to_mapping_radix(repr_size, idx).map do |char|
|
215
|
+
MAPPING_CHARS[char]
|
216
|
+
end.join + ';'
|
217
|
+
end
|
218
|
+
|
219
|
+
def convert_to_mapping_radix(repr_size, num)
|
220
|
+
result = []
|
221
|
+
repr_size.times do
|
222
|
+
result.unshift(num % MAPPING_CHARS.size)
|
223
|
+
num /= MAPPING_CHARS.size
|
224
|
+
end
|
225
|
+
result
|
226
|
+
end
|
227
|
+
|
228
|
+
def generate_pattern(pattern)
|
229
|
+
replace_tokens(fix_dots(remove_ranges(pattern)))
|
230
|
+
end
|
231
|
+
|
232
|
+
def remove_ranges(pattern)
|
233
|
+
pattern.gsub(/\[([A-Za-z ]*)\]/) do |match|
|
234
|
+
'(?:' + match[1..-2].split(/\s+/).join('|') + ')'
|
235
|
+
end
|
236
|
+
end
|
237
|
+
|
238
|
+
def fix_dots(pattern)
|
239
|
+
pattern.gsub('.', '.' * (@item_size - 1) + ';')
|
240
|
+
end
|
241
|
+
|
242
|
+
def replace_tokens(pattern)
|
243
|
+
pattern.gsub(/[A-Za-z]+/) do |match|
|
244
|
+
'(?:' + mapped_value(match) + ')'
|
245
|
+
end.gsub(/\s/, '')
|
246
|
+
end
|
247
|
+
|
248
|
+
def match(input)
|
249
|
+
new_input = input.map { |object| object.reg_desc }.
|
250
|
+
map { |desc| mapped_value(desc) }.join
|
251
|
+
if (match = new_input.match(@pattern))
|
252
|
+
start, stop = match.begin(0) / @item_size, match.end(0) / @item_size
|
253
|
+
input[start...stop]
|
254
|
+
end
|
255
|
+
end
|
256
|
+
end
|
257
|
+
}}}
|
258
|
+
|
259
|
+
## Generalized Map Generation
|
260
|
+
|
261
|
+
Generating the map is the primary interest here, so I'll start there.
|
262
|
+
|
263
|
+
First, we discover the alphabet by extracting all matches for `/[A-Za-z]+/` from the input pattern.
|
264
|
+
|
265
|
+
{{{
|
266
|
+
alphabet = pattern.scan(/[A-Za-z]+/).uniq
|
267
|
+
}}}
|
268
|
+
|
269
|
+
We figure out how many characters we need to represent that many elements, and save that for later:
|
270
|
+
|
271
|
+
{{{
|
272
|
+
# alphabet.size + 1 because of the catch-all, "not-in-pattern" mapping
|
273
|
+
repr_size = Math.log(alphabet.size + 1, MAPPING_CHARS.size).ceil
|
274
|
+
# repr_size + 1 because we will be inserting a terminator in a moment
|
275
|
+
@item_size = repr_size + 1
|
276
|
+
}}}
|
277
|
+
|
278
|
+
Now, we just calculate the [symbol, mapped\_symbol] pairs for each symbol in the input alphabet:
|
279
|
+
|
280
|
+
{{{
|
281
|
+
map = Hash[alphabet.map.with_index do |symbol, idx|
|
282
|
+
[symbol, mapping_for_idx(repr_size, idx)]
|
283
|
+
end]
|
284
|
+
}}}
|
285
|
+
|
286
|
+
We'll come back to how this works, but we must add the catch-all map entry: the entry that is triggered if we see a token in the searched sequence that didn't appear in the search pattern:
|
287
|
+
|
288
|
+
{{{
|
289
|
+
map.merge!(FAILBOAT: mapping_for_idx(repr_size, map.size))
|
290
|
+
}}}
|
291
|
+
|
292
|
+
Note that we avoid the use of the `inject({})` idiom common for constructing Hashes, since the computation of each tuple is independent from the others. `mapping_for_idx` is responsible for finding the mapped string for the given element. In Ripper, this was just an index into an array. However, if we want more than 62 possible elements in our alphabet, we instead need to convert the index into a base-62 number, first. `convert_to_mapping_radix` does this, using the size of the `MAPPING_CHARS` constant as the new radix:
|
293
|
+
|
294
|
+
{{{
|
295
|
+
# Standard radix conversion.
|
296
|
+
def convert_to_mapping_radix(repr_size, num)
|
297
|
+
result = []
|
298
|
+
repr_size.times do
|
299
|
+
result.unshift(num % MAPPING_CHARS.size)
|
300
|
+
num /= MAPPING_CHARS.size
|
301
|
+
end
|
302
|
+
result
|
303
|
+
end
|
304
|
+
}}}
|
305
|
+
|
306
|
+
If MAPPING\_CHARS.size = 62, then:
|
307
|
+
|
308
|
+
{{{
|
309
|
+
convert_to_mapping_radix(3, 12498)
|
310
|
+
# => [3, 15, 36]
|
311
|
+
}}}
|
312
|
+
|
313
|
+
After we convert each number into the necessary radix, we can then convert that array of place-value integers into a string by mapping each place value to its corresponding character in the MAPPING\_CHARS array:
|
314
|
+
|
315
|
+
{{{
|
316
|
+
def mapping_for_idx(repr_size, idx)
|
317
|
+
convert_to_mapping_radix(repr_size, idx).map { |char| MAPPING_CHARS[char] }.join + ';'
|
318
|
+
end
|
319
|
+
}}}
|
320
|
+
|
321
|
+
Notice that we added a semicolon at the end there. The choice of semicolon was arbitrary - it could be any valid character that isn't in MAPPING\_CHARS. Why'd I add that?
|
322
|
+
|
323
|
+
Imagine we were searching for a long input sequence that needed 2 characters per element in the alphabet. Perhaps the Ruby grammar has expanded and now has well over 62 token types, and `comment` tokens are represented as `ba`, while `sp` tokens are `aa`. If we search for `:sp` in the input `[:comment, :sp]`, we'll search in the string `"baaa"`. it will match halfway through the `comment` token at index 1, instead of at index 2, where the `:sp` actually lies. Thus, to avoid this, we simply pad each mapping with a semicolon. We could choose to only add the semicolon if `repr_size > 1` as an optimization, if we'd like.
|
324
|
+
|
325
|
+
## Generalized Pattern Transformation
|
326
|
+
|
327
|
+
After building the new map, constructing the corresponding search pattern is quite simple:
|
328
|
+
|
329
|
+
{{{
|
330
|
+
def generate_pattern(pattern)
|
331
|
+
replace_tokens(fix_dots(remove_ranges(pattern)))
|
332
|
+
end
|
333
|
+
|
334
|
+
def remove_ranges(pattern)
|
335
|
+
pattern.gsub(/\[([A-Za-z ]*)\]/) do |match|
|
336
|
+
'(?:' + match[1..-2].split(/\s+/).join('|') + ')'
|
337
|
+
end
|
338
|
+
end
|
339
|
+
|
340
|
+
def fix_dots(pattern)
|
341
|
+
pattern.gsub('.', '.' * (@item_size - 1) + ';')
|
342
|
+
end
|
343
|
+
|
344
|
+
def replace_tokens(pattern)
|
345
|
+
pattern.gsub(/[A-Za-z]+/) do |match|
|
346
|
+
'(?:' + mapped_value(match) + ')'
|
347
|
+
end.gsub(/\s/, '')
|
348
|
+
end
|
349
|
+
}}}
|
350
|
+
|
351
|
+
First, we have to account for this regex syntax:
|
352
|
+
|
353
|
+
{{{
|
354
|
+
[comment embdoc_beg int]
|
355
|
+
}}}
|
356
|
+
|
357
|
+
which we assume to mean "comment or eof or int", much like `[Acf]` means "A or c or f". Since constructs such as `A-Z` don't make sense with an arbitrary alphabet, we don't need to concern ourselves with that syntax. However, if we simply replace "comment" with its mapped string, and do the same with eof and int, we get something like this:
|
358
|
+
|
359
|
+
{{{
|
360
|
+
[f;j;u;]
|
361
|
+
}}}
|
362
|
+
|
363
|
+
which won't work: it'll match any semicolon! So we manually replace all instances of `[tok1 tok2 ... tokn]` with `tok1|tok2|...|tokn`. A simple gsub does the trick, since nested ranges don't really make much sense. This is implemented in #remove\_ranges:
|
364
|
+
|
365
|
+
{{{
|
366
|
+
def remove_ranges(pattern)
|
367
|
+
pattern.gsub(/\[([A-Za-z ]*)\]/) do |match|
|
368
|
+
'(?:' + match[1..-2].split(/\s+/).join('|') + ')'
|
369
|
+
end
|
370
|
+
end
|
371
|
+
}}}
|
372
|
+
|
373
|
+
Next, we replace the '.' matcher with a sequence of dots equal to the size of our token mapping, followed by a semicolon: this is how we properly match "any alphabet element" in our mapped form.
|
374
|
+
|
375
|
+
{{{
|
376
|
+
def fix_dots(pattern)
|
377
|
+
pattern.gsub('.', '.' * (@item_size - 1) + ';')
|
378
|
+
end
|
379
|
+
}}}
|
380
|
+
|
381
|
+
Then, we simply replace each alphabet element with its mapped value. Since those mapped values could be more than one character, we must group them for other Regex features such as `+` or `*` to work properly; since we may want to extract subexpressions, we must make the group we introduce here non-capturing. Then we just strip whitespace.
|
382
|
+
|
383
|
+
{{{
|
384
|
+
def replace_tokens(pattern)
|
385
|
+
pattern.gsub(/[A-Za-z]+/) do |match|
|
386
|
+
'(?:' + mapped_value(match) + ')'
|
387
|
+
end.gsub(/\s/, '')
|
388
|
+
end
|
389
|
+
}}}
|
390
|
+
|
391
|
+
## Generalized Matching
|
392
|
+
|
393
|
+
Lastly, we have a simple #match method:
|
394
|
+
|
395
|
+
{{{
|
396
|
+
def match(input)
|
397
|
+
new_input = input.map { |object| object.reg_desc }.map { |desc| mapped_value(desc) }.join
|
398
|
+
if (match = new_input.match(@pattern))
|
399
|
+
start, stop = match.begin(0) / @item_size, match.end(0) / @item_size
|
400
|
+
input[start...stop]
|
401
|
+
end
|
402
|
+
end
|
403
|
+
}}}
|
404
|
+
|
405
|
+
While there's many ways of extracting results from a Regex match, here we do the simplest: return the subsequence of the original sequence that matches first (using the usual leftmost, longest rule of course). Here comes the one part where you have to modify the objects that are in the sequence: in the first line, you'll see:
|
406
|
+
|
407
|
+
{{{
|
408
|
+
input.map { |object| object.reg_desc }.map { |desc| mapped_value(desc) }
|
409
|
+
}}}
|
410
|
+
|
411
|
+
This interrogates each object for its string representation: the string you typed into your search pattern if you wanted to find it. The method name (`reg_desc` in this case) is arbitrary, and this could also be implemented by providing a `Proc` to the ObjectRegex at initialization, and having the Proc be responsible for determining string representations.
|
412
|
+
|
413
|
+
We also see on the 3rd and 4th lines of the method why we stored @item\_size earlier: for boundary calculations:
|
414
|
+
|
415
|
+
{{{
|
416
|
+
start, stop = match.begin(0) / @item_size, match.end(0) / @item_size
|
417
|
+
input[start...stop]
|
418
|
+
}}}
|
419
|
+
|
420
|
+
Sometimes I wish `begin` and `end` could be local variable names in Ruby. Alas.
|
421
|
+
|
422
|
+
## Conclusion
|
423
|
+
|
424
|
+
Firstly, I won't suggest this idea is new, since DFAs with arbitrary alphabets have been around for, well, a while. Additionally, I've found a [Python library, RXPY](http://www.acooke.org/rxpy/), with a similar capability, though it's part of a larger Regex testbed library.
|
425
|
+
|
426
|
+
I've tested this both with tokens and integers (in word form) as the alphabets, with 1- and 2-character mappings. I think this technique could see use in other areas, so I'll be packaging it up as a small gem. I also think this implementation is fine for use in Ripper to achieve the tasks the existing, experimental code seeks to implement without dependence on the number of tokens in the language. A bit of optimization for the exceedingly common 1-character use-case could further support this goal.
|