RubyGems - oryx - Versions diffs - 0.2.1 → 0.3.1 - Mend

oryx 0.2.1 → 0.3.1

Files changed (46) hide show

data/.gitignore +0 -1
data/.travis.yml +15 -2
data/README.md +13 -1
data/doc/ast.md +110 -0
data/doc/cflat.md +71 -0
data/doc/conclusion.md +0 -0
data/doc/fibonacci.md +15 -0
data/doc/img/fib_parse.jpg +0 -0
data/doc/intermediate_lang.md +0 -0
data/doc/intro.md +89 -0
data/doc/lexer.md +33 -0
data/doc/parser.md +11 -0
data/doc/symbol_table.md +23 -0
data/doc/tools.md +9 -0
data/doc/x86_translation.md +0 -0
data/lib/oryx/ast.rb +34 -8
data/lib/oryx/contractor.rb +185 -0
data/lib/oryx/error.rb +8 -0
data/lib/oryx/parser.rb +31 -7
data/lib/oryx/runner.rb +30 -1
data/lib/oryx/symbol_table.rb +48 -0
data/lib/oryx/version.rb +1 -1
data/lib/oryx.rb +3 -0
data/test/data/add.c +3 -0
data/test/data/div.c +3 -0
data/test/data/eq.c +6 -0
data/test/data/fib.c +8 -0
data/test/data/fun_1.c +7 -0
data/test/data/ge.c +6 -0
data/test/data/geq.c +6 -0
data/test/data/gvar_1.c +5 -0
data/test/data/gvar_2.c +6 -0
data/test/data/gvar_3.c +6 -0
data/test/data/if.c +15 -0
data/test/data/le.c +6 -0
data/test/data/leq.c +6 -0
data/test/data/mul.c +3 -0
data/test/data/neq.c +6 -0
data/test/data/return.c +3 -0
data/test/data/sub.c +3 -0
data/test/lib/oryx/error_test.rb +19 -0
data/test/lib/oryx/runner_test.rb +25 -0
data/test/lib/oryx/symbol_table_test.rb +147 -0
data/test/shoulda_macros/runner.rb +58 -0
data/test/test_helper.rb +1 -0
metadata +61 -4

data/.gitignore CHANGED Viewed

@@ -7,7 +7,6 @@ Gemfile.lock
 InstalledFiles
 _yardoc
 coverage
-doc/
 lib/bundler/man
 pkg
 rdoc

data/.travis.yml CHANGED Viewed

@@ -1,5 +1,18 @@
+---
 language: ruby
 rvm:
-  - "1.9.3"
-  - "2.0.0"
+- 1.9.3
+- 2.0.0
+before_script:
+- "wget https://s3.amazonaws.com/oryx-artifacts/artifacts/40/40.1/llvm-cmpl-3.0.tar.gz"
+- tar -xzf llvm-cmpl-3.0.tar.gz
+- "export LD_LIBRARY_PATH=/home/travis/build/rampantmonkey/oryx/llvm-shared/lib:$LD_LIBRARY_PATH"
+- "export PATH=/home/travis/build/rampantmonkey/oryx/llvm-shared/bin:$PATH"
 script: bundle exec rake
+env:
+  global:
+  - "ARTIFACTS_AWS_REGION=us-east-1"
+  - "ARTIFACTS_S3_BUCKET=oryx-artifacts"
+  - secure: "lIqFPSOg8pq/x/I55RRBD749/2ASnnBrf7trG+y1q2wu5SOEcYWHPGn+QvyzFL0q67c/Qnz1MYtn6GJOZXPnAWZrZOtlw9QgIkrk6NquiyUMfR6qQA79lMo79okyFxeyK7uedvVbCBxPgJK3CA2IcKIMmWkOU+LLvnTCl2veqh8="
+  - secure: "HN9A0olMrWO8TTmAEN7fg3qtJp08JqabIKVgujLWw+JHVRGqh9MBF9j5ra1WhdV3n7myi+A0v38Fe7T7Mi0gRd+OAp/eCLg2prO2/ZpobWduY5k22XwR0uujsPmv9W4EQalhauWx4kmAu/9rjPcO5Voe5mcvHoFK5xun1571vjM="

data/README.md CHANGED Viewed

@@ -6,6 +6,16 @@ C-Flat to x86 compiler.
 - 20 Feb 13
     + Oryx 0.1.1
     + Lexer complete
+- 1 April 13
+    + Oryx 0.2.1
+    + Parse Tree successfully generated
+- 3 April 13
+    + Oryx 0.2.7
+    + Symbol Table implemented
+- 29 April 13
+    + Oryx 0.3.1
+    + Abstract Syntax Tree defined and generated
+    + Fixed Travis errors by pre-compiling LLVM shared library
 ## C-Flat
 C-Flat is a working subset of C designed for use in compilers courses. C-flat includes expressions, basic control flow, recursive functions, and strict type checking. It is object-compatible with ordinary C and thus can take advantage of the standard C library, at least when using limited types.
@@ -17,6 +27,7 @@ C-Flat is a working subset of C designed for use in compilers courses. C-flat in
 ## Requirements
 - Ruby >= 1.9.3
+- [LLVM 3.0](http://llvm.org/releases/) shared library
 ## Usage
@@ -27,7 +38,8 @@ For more details try
     $ oryx -h
 ## See Also
-`bin/oryx.markdown`
+`doc/intro.md`
 ## Contributing

data/doc/ast.md ADDED Viewed

@@ -0,0 +1,110 @@
+# Abstract Syntax Tree
+## Design
+The abstract syntax tree serves as the data structure which bridges parsing and code generation. As such, there should be a simple mapping from parser to AST and from AST to intermediate code.  The base class provided by RLTK is `ASTNode`. This class holds the functionality required for a tree: traversal and references to parents and children.
+There are two methods for distinguishing between similar classes (e.g. `Oryx::Add` and `Oryx::Sub`): inheritance or additional parameter. Inheritance was chosen for this project because it is consistent with the traversal method in code generation and it is simpler to create additional objects with the parser.
+## Construction
+This section describes the creation of an AST from the token stream. Each grammar rule in the parser finds sequences of tokens which match the CFlat language. When the proper match is determined (see [parser documentation](parser.md) for details) Oryx generates a node in the tree. A simple example is the `Oryx::Number` class. This node has one attribute, the integer which the lexeme represents.
+## Generation
+This section introduces the code generation from the AST. During code generation, `Oryx` traverses the tree and emits intermediate code corresponding to the class type. For example consider `Oryx::Add`. This class has two values (inherited from `Oryx::Binary`) `left` and `right` which correspond directly to the LLVM instruction `add`.
+## Class Hierarchy
+                   |- Expression -|- Assign          |-  Add
+                   |- ParamList   |- Binary ---------|-  Div
+                   |              |                  |-  EQ
+                   |              |                  |-  GE
+                   |              |                  |-  GEQ
+                   |              |                  |-  LE
+                   |              |                  |-  LEQ
+                   |              |                  |-  Mul
+                   |              |                  |-  NEQ
+                   |              |
+    RLTK::ASTNode -|- ArgList     |
+                   |              |- Call
+                   |- Function    |
+                                  |- CodeBlock
+                                  |- Declaration ------ GDeclaration
+                                  |- If
+                                  |- Initialization --- GInitialization
+                                  |- Number
+                                  |- Return
+                                  |                  |- Boolean
+                                  |- Type -----------|- Char
+                                  |                  |- Int
+                                  |                  |- Str
+                                  |- Variable
+                                  |- While
+## Class Descriptions
+Here the details of each class of the syntax tree defined in Oryx.
+### Add
+Subclass of `Binary` for addition.
+### ArgList
+List of expressions to be evaluated and passed to a function. Only used as a direct descendant of `Call`.
+### Assign
+Assign the result of an expression to a pre-existing variable.
+### Binary
+An abstraction designed to represent expressions which have two arguments - `left` and `right`.
+### Boolean
+Node representing the `boolean` data type.
+### Call
+Call a function with the given parameters.
+### Char
+Node representing the `character` data type.
+### CodeBlock
+A list of expressions surrounded by curly braces.
+### Declaration
+Declare a new variable with no value.
+### Div
+Subclass of `Binary` for division.
+### EQ
+Subclass of `Binary` for equality comparison.
+### Expression
+Generic base class, inherited from `RLTK::ASTNode` to simplify the interface between the external library.
+### Function
+Create a new function which contains a list of expressions and optional parameters.
+### GDeclaration
+Subclass of `Declaration` for variables declared in a global scope.
+### GE
+Subclass of `Binary` for greater-than comparison.
+### GEQ
+Subclass of `Binary` for greater-than-or-equal comparison.
+### GInitialization
+Subclass of `Initialization` for variables initialized in a global scope.
+### If
+Stores a conditional and two `CodeBlocks` (or `Expressions`) to be evaluated based on result of conditional.
+### Initialization
+Create a new variable with a default value.
+### Int
+Node representing the `integer` data type.
+### LE
+Subclass of `Binary` for less-than comparison.
+### LEQ
+Subclass of `Binary` for less-than-or-equal comparison.
+### Mul
+Subclass of `Binary` for multiplication.
+### NEQ
+Subclass of `Binary` for not-equal comparison.
+### Number
+Contains the numerical value of a lexeme; used for numerical constants.
+### ParamList
+List of formal parameters to a `Function`; used in a `Function` definition.
+### Return
+Return the value of the child of this node from a function.
+### Str
+Node representing the `string` data type.
+### Type
+Abstraction designed to represent data types.
+### Variable
+Variable lookup(or reference).
+### While
+While loop construct with a `CodeBlock` and an `Expression`.

data/doc/cflat.md ADDED Viewed

@@ -0,0 +1,71 @@
+# C-Flat 2013
+C-Flat is a working subset of C designed for use in compilers courses. C-Flat includes expressions, control flow, recursion, and strict type checking. This document will outline an **informal** description of the language, its features, and design decisions.
+## Keywords
+ - `boolean`
+ - `char`
+ - `else`
+ - `if`
+ - `int`
+ - `print`
+ - `return`
+ - `string`
+ - `true`
+ - `void`
+ - `while`
+## Identifiers
+Identifiers in C-Flat are identical to C. Identifiers may contain letters, numbers, and underscores and must begin with a letter or a number. Keywords are also not valid identifiers.
+## Whitespace
+C-Flat ignores whitespace (beyond separating identifiers).
+## Operators
+C-Flat includes many of the arithmetic operators found in C. Here they are enumerated with precedence.
+    ( )                  grouping            ^   (highest precedence)
+    * /                  multiplication      |
+    + -                  addition            |
+    < <= >= == !=        comparison          |
+    &&                   logical and         |
+    ||                   logical or          |
+    =                    assignment         ---  (lowest precedence)
+## Types
+C-Flat is strictly typed. This means that there is no type casting or type promotion, which means that an error occurs when the operand types differ. C-Flat supports four or five types (depending on how you count). All types can be used as function return types and can all (except for void) be used as a type of variable.
+### Integer
+    int x = 123;
+`integers` are always signed 32-bit values.
+### Character
+    char c = 'q';
+`character` represents a single ASCII character.
+### String
+    string s = "hello world\n";
+A `string` is an immutable, doubly quoted list of `character`s.
+### Boolean
+    boolean b = false;
+`boolean` represents the literal values *true* and *false*.
+### Void
+Void is slightly different than all the other types. It can only be used in one context, the return type of a function. In this context, `void` indicates that the function does not return a value.

data/doc/conclusion.md ADDED Viewed

File without changes

data/doc/fibonacci.md ADDED Viewed

@@ -0,0 +1,15 @@
+# Fibonacci Example
+Source Code (`test/data/fib.c`)
+    int fib(int x) {
+      if (x < 2) return 1;
+      else return fib(x-1)+fib(x-2);
+    }
+    int main() {
+      return fib(5);
+    }
+Parse Tree
+![Parse Tree](img/fib_parse.jpg)

data/doc/img/fib_parse.jpg ADDED Viewed

Binary file

data/doc/intermediate_lang.md ADDED Viewed

File without changes

data/doc/intro.md ADDED Viewed

@@ -0,0 +1,89 @@
+# Project Description
+[Oryx](https://github.com/rampantmonkey/oryx) is a compiler for the C-Flat language written in [Ruby](http://www.ruby-lang.org/en/).
+# Goals
+- Produce tokens from source code (Oryx 0.1.1)
+- Create parse tree from token stream (Oryx 0.2.1)
+- Symbol Table working (Oryx 0.2.7)
+- Translate parse tree into AST (Oryx ?.?.?)
+- Emit LLVM intermediate code by walking AST (Oryx ?.?.?)
+- Implement Semantic Analysis (Oryx ?.?.?)
+- Use LLVM optimizations (Oryx ?.?.?)
+- Improved error handling (Oryx ?.?.?)
+# Implementation Overview
+Ruby was chosen as the implementation language since it supports numerous programming styles simultaneously. This feature allows Oryx to use the [programming paradigm](http://en.wikipedia.org/wiki/Programming_paradigm) which most directly models each piece of the compiler. The alternative being choosing one paradigm and forcing each piece of the compiler to fit that model. Ruby elegantly [combines three programming paradigms](http://en.wikipedia.org/wiki/Ruby_(programming_language)) which can produce powerful code that is easy to understand. The three paradigms -- [imperative](http://en.wikipedia.org/wiki/Imperative_programming), [functional](http://en.wikipedia.org/wiki/Functional_programming), and [object oriented](http://en.wikipedia.org/wiki/Object-oriented_programming) -- each shine with different types of problems.
+Imperative techniques are a direct abstraction of the capabilities of the execution environment and are therefore provide the least friction when interfacing with the machine (I/O in this program). The functional paradigm is the programming language incarnation of mathematical theory. Due to the strong link to theory, it is simple to implement compiler design (also based strongly in theory). The object oriented paradigm comes from observations about biological systems. Under this paradigm programs are constructed as independent units which communicate, and therefore get work done, by sending messages to other objects. Object oriented techniques tend to be useful for code organization due to their compartmentalization.
+Beyond the theoretical reasoning, Ruby has many practical advantages. Ruby's main focus is developer happiness (reference). This goal has lead to many development tools, libraries, and distribution mechanisms. Unit testing (with automation) and static [code analysis](http://codeclimate.org) are two of the development tools I used to assist development of Oryx. Ruby uses gems as the main mechanism for sharing and distributing libraries. The main library used is [RLTK](https://github.com/chriswailes/RLTK). The Ruby Language Tool Kit provides a lexer generator, parser generator, abstract syntax tree nodes, and LLVM bindings.
+My familiarity with Ruby re-enforced the choice of language.
+# Program Flow
+    #####################
+    #                   #
+    #      source       #
+    #                   #
+    #####################
+             |
+             | Character by character
+             |
+             V
+    #####################
+    #                   #
+    #      lexer        #
+    #                   #
+    #####################
+             |
+             | Token Stream
+             |
+             V
+    #####################
+    #                   #
+    #      parser       #
+    #                   #
+    #####################
+             |
+             | Abstract Syntax Tree
+             |
+             V
+    #####################
+    #                   #
+    #     semantic      #
+    #     analysis      #
+    #                   #
+    #####################
+             |
+             | Verified Syntax Tree
+             |
+             V
+    #####################
+    #                   #
+    #       code        #
+    #    generation     #
+    #                   #
+    #####################
+             |
+             | Executable (via LLVM)
+             |
+             V
+# Table of Contents
+- [Introduction](intro.md)
+- [C-Flat](cflat.md)
+- [Tools Available](tools.md)
+- [Lexer](lexer.md)
+- [Parser](parser.md)
+- [Abstract Syntax Tree](ast.md)
+- [Symbol Table](symbol_table.md)
+- [Intermediate Representation](intermediate_lang.md)
+- [Translation into x86 Assembly](x86_translation.md)
+- [Conclusion](conclusion.md)

data/doc/lexer.md ADDED Viewed

@@ -0,0 +1,33 @@
+# Lexer
+The lexer is the first step in the compilation of C-Flat. The input for the lexer is the raw source code and the output is a stream of tokens. The lexer processes the source file one character at a time looking for valid *tokens*. Some examples of valid tokens are identifiers, punctuation (commas, semi colons, parentheses, etc.), and string constants. The lexer also removes comments and whitespace.
+The main mechanism behind the lexer is a finite automaton. The states are defined by a series of regular expressions. The regular expressions are joined together to form an automaton which represents all of the valid tokens in C-Flat (as well as some error tokens).
+## Lexer Rules
+Each regular expression is designed to find one type of lexeme and return the corresponding token when complete. Ruby's [block semantics](http://c2.com/cgi/wiki?BlocksInRuby) provide a convienent method for expressing this idiom. Blocks are essentially anonymous functions which can be stored and executed later. Here is an example from the lexer source code.
+    rule(/^[^\d\W]\w*/)      { |t| [:IDENT, t] }
+There are two important things to know in order to truly appreciate this line of code. The code is divided into two pieces, the rule definition and the function to execute on a successful match. The rule method adds the regular expression to the state machine as a valid pattern. The block has one parameter `t` which represents the matched lexeme and returns a token of type `IDENT` and containing the matched lexeme.
+## Lexer States
+Some rules do not require a lexeme, e.g. `IF` or `WHILE`. And some are more complicated requiring an additional lexer feature. This additional feature is states and is required for processing string constants and comments.
+The term *states* is overloaded in this context so I will use the term *lexer state* when referring to the lexer feature, otherwise I am referring to the 'states' of the finite automaton.
+Without *lexer states* each of the accepting states return to the initial start state. *lexer states* can be viewed as an additional start state to which additional rules may be attached or as a new [namespace](http://en.wikipedia.org/wiki/Namespace_(computer_science)) in which to define new rules. The separation created by *lexer states* is useful in processing strings and comments since the only important character is the comment/string terminator. When this token is encountered the *lexer state* is exited and lexing returns to the default state.
+## Lexer Flags
+One final feature which improves the robustness of the lexer is *flags*. Flags represent certain conditions which should result in an error upon exit but are not currently an error. One such case is comments. A source code file with an unclosed comment is most likely a mistake and is defined as an error in the language specification. Thus Oryx needs to detect this issue.
+Flags are boolean values. Oryx uses one flag for comment processing, which is `true` during comment processing and `false` otherwise. When the `EOF` character is read, the lexer checks the status of each flag and prints an error message if the flag is set.

data/doc/parser.md ADDED Viewed

@@ -0,0 +1,11 @@
+# Parser
+The parser defines the grammar (sentence structure) of CFlat. The grammar is built by combining many small rules together. The rules are defined in a similar fashion to the [lexer](lexer.md). Each rule consists of a pattern and a function to construct the correct syntax node.
+The structure of parse rules has two minor differences from the lexer. Since the input to the parser is a stream of `Token`s a basic string can be used to define the pattern instead of a regular expression which corresponds to the decrease in complexity of the input. The functions to return/construct syntax tree nodes are more complicated than the lexer. The lexer only returned one class, `RLTK::Token`, while the parser selects a more descriptive class (see [ast](ast.md)).
+A few developer niceties are provided by RLTK. Rules can be grouped and named to reduce redundancy. Rules also have an optional precedence hierarchy and directional associativity (left or right). There is also an option to output the parse tree in [`dot`](http://graphviz.org/) syntax to create a visual representaton of parsing. A final feature is the ability to produce a state diagram of the parser.
+## Parsing Algorithm
+RLTK uses the [GLR](http://en.wikipedia.org/wiki/GLR_parser)(generalized "LR") parsing algorithm. GLR, originally published in 1984, is an extension of the LR algorithm designed to handle ambiguities and conflicts between rules. Typical LR parsers only allow one transition per state per token. This creates the potential for shift/reduce and reduce/reduce conflicts. GLR allows multiple transitions, effectively eliminating these types of conflicts. This feature is implemented by forking the parse stack and continuing parsing along both paths. To handle the potential for an explosion in memory usage common prefixes and suffixes are shared among each stack.

data/doc/symbol_table.md ADDED Viewed

@@ -0,0 +1,23 @@
+# Symbol Table
+The symbol table is a key data structure used during the semantic analysis and code generation phases of compilation. Thus a correct and efficient implementation is vital to the success of this project. Here we will discuss the implementation of symbol tables used in Oryx from the basic operations necessary to the design decisions made during development.
+## Basic Operations
+A symbol table for C-Flat has to support 5 basic operations -- `insert`, `update`, `lookup`, `enter` scope, and `exit` scope. When a variable is declared, regardless of the context, `insert` is used to store the fact that a variable exists. Assigning a value to a variable, whether during initialization or as the result of an expression, requires `update`. It is only possible to `update` a variable that has already been inserted into the table. `lookup` returns the value associated with a given variable. `enter` and `exit` scope are called when entering/exiting a code block which allows previously defined variables to be redefined temporarily.
+## Data Structures
+The 5 basic operations can be divided into two groups. `insert`, `update`, and `lookup` form the first group while `enter`, `exit` form the second. When looking at these two groupings it is apparent that there is a different data structure which supports each set of instructions. The first group directly maps to a hash table and the second to a stack. These conceptual data structures are represented differently in various programming languages. First we will look at how Ruby provides these structures, then we will see how to combine them into a symbol table.
+### Hash Tables in Ruby
+As part of the language, Ruby provides a [`Hash` class](http://ruby-doc.org/core-2.0/Hash.html) which implements a hash table. `Hash` stores and retrieves objects by label. A label can be any object which implements the `==` (equality) operator, but the typical objects used for labels are [`String`](http://ruby-doc.org/core-2.0/String.html) and [`Symbol`](http://ruby-doc.org/core-2.0/Symbol.html). `String`s are, as expected, a list of characters. `Symbol`s can be viewed as strings with two important distinctions. `Symbol`s are __immutable__ and __globally__ scoped. `Symbol`s are implemented as a pointer to a memory object containing the name of the symbol, thus comparisons are extremely fast. All of these properties make `Symbol`s an ideal choice for storing values in a hash table.
+### Stacks in Ruby
+Ruby does not provide a specific `stack` class, however the `push` and `pop` methods are implemented for the [`Array` class](http://ruby-doc.org/core-2.0/Array.html).
+## Bringing It Together
+Combining these data structures into a symbol table is straight forward. The symbol table is implemented as an `Array` of `Hash`es. The first `Hash` in the `Array` represents the global program scope while successive elements represent more localized scopes. When a function exits the last entry is removed from the `Array`, effectively destroying the values created inside of the scope.

data/doc/tools.md ADDED Viewed

@@ -0,0 +1,9 @@
+# Tools Available
+## RLTK
+[The Ruby Language Toolkit](https://github.com/chriswailes/RLTK) (or RLTK) by [Chris Wailes](http://chris.wailes.name/) provides a large collection of tools useful for translating languages - lexer generator, parser generator, abstract syntax tree node base class, and LLVM bindings. RLTK is also well documented and includes tutorials on constructing compilers. The LLVM bindings are of particular interest since it is rapidly evolving into an important part of the development toolchain.
+## Others
+Ruby provides a [foreign function interface](https://github.com/ffi/ffi) for interfacing with C functions and libraries. This means that any options from C, such as `yacc`, are available in Ruby. There are also options to interface with Java libraries so jcup and jflex would also be possibilities. The downside to all of these options is that they require a translation layer between Ruby and the library, which reduces the usefulness of Ruby.

data/doc/x86_translation.md ADDED Viewed

File without changes

data/lib/oryx/ast.rb CHANGED Viewed

@@ -7,8 +7,13 @@ module Oryx
     value :value, Integer
   end
+  class Type < Expression
+    value :t, String
+  end
   class Variable < Expression
     value :name, String
+    child :type, Type
   end
   class Binary < Expression
@@ -16,11 +21,39 @@ module Oryx
     child :right, Expression
   end
+  class Initialization < Expression
+    value :name, String
+    child :right, Expression
+    child :type, Type
+  end
+  class GInitialization < Initialization; end
+  class Declaration < Expression
+    value :name, String
+    child :type, Type
+  end
+  class GDeclaration < Declaration; end
   class Assign < Expression
     value :name, String
     child :right, Expression
   end
+  class ParamList < RLTK::ASTNode
+    child :params, [Declaration]
+  end
+  class ArgList < RLTK::ASTNode
+    child :args, [Expression]
+  end
+  class Call < Expression
+    value :name, String
+    child :args, ArgList
+  end
   class Add < Binary; end
   class Sub < Binary; end
   class Mul < Binary; end
@@ -47,14 +80,11 @@ module Oryx
     child :statements, [Expression]
   end
-  class ParamList < RLTK::ASTNode
-    child :params, [Expression]
-  end
   class Function < RLTK::ASTNode
     value :i, String
     child :params, ParamList
     child :body, CodeBlock
+    child :return_type, Type
   end
   class While < Expression
@@ -62,10 +92,6 @@ module Oryx
     child :body, CodeBlock
   end
-  class Type < Expression
-    value :t, String
-  end
   class Boolean < Type; end
   class Char < Type; end
   class Int < Type; end