Writing a C Compiler: Part 1, Lexer

November 20, 2021

As a programmer I enjoy building things. I also like knowing how things work, I figure the two are related. There are a few things that I feel I’ve missed out on from not attending a CS program (nothing practical of course), one being a deeper understanding of how programming languages are written. To this end I’ve decided to spend some time learning how to write compilers.

I decided that writing a C compiler would be a good place to start. I don’t know C. In spite of this, I think it will be a good choice because of it’s simplicity. Not simplicity in terms of writing correct programs with it, but simplicity in terms of being “lower level” so probably less work to turn it into assembly.

I did a Google search for something like “write C compiler tutorial” and found this post series which looks promising. I’ll probably try to do posts that match, 10 parts.

Nora said examples will be given in x86 instruction sets, but I’ll be attempting to rewrite using AMD64. Based on AMD64 docs this should be fine because it should support all x86 instructions.

Also, Nora started writing in python and eventually switched to OCamel, despite not knowing the language at all. I’m going to start in Ruby and see how it goes!

The first step is to write a lexer. It seems all a lexer does is turn a code file into a list of tokens. Parenthes, keywords, identifiers, etc. It does not add any structure or attempt to check the code as valid. Later steps use the list of tokens to create the structure of the program.

The program to be, uh, lexed is this:

int main() {
    return 500;
}

Here’s the lexer I wrote:

module Token
    BLOCK = :BLOCK
    END_BLOCK = :END_BLOCK
    PAREN = :PAREN
    END_PAREN = :END_PAREN
    KEYWORD = :KEYWORD
    CONST = :CONST
    TERM = :TERM
    ID = :ID

    def self.keywords
        [
            "int",
            "return"
        ]
    end
end

tokens = []
curconst = ""

File.open(ARGV.last).each_char do |char|
    if /[a-zA-Z0-9]/ =~ char
        curconst << char
        next
    end

    if curconst.length > 0
        if Token.keywords.include? curconst
            tokens << [Token::KEYWORD, curconst]
        elsif curconst =~ /^[0-9]+$/
            tokens << [Token::CONST, curconst]
        else
            tokens << [Token::ID, curconst]
        end

        curconst = ""
    end

    token = case char
    when '{' then [Token::BLOCK, char]
    when '}' then [Token::END_BLOCK, char]
    when '(' then [Token::PAREN, char]
    when ')' then [Token::END_PAREN, char]
    when ';' then [Token::TERM, char]
    when /\s/ then next
    else
        raise "unknown token %s" % char.inspect
    end

    tokens << token
end

neat-o, it lexes! cool. The next step is to create the AST. I’ll be covering that in my next post, unless I completely lose interest and never work on this again!