More on leftshift and heredoc
A few weeks ago I wrote something on distinguishing between leftshift and heredoc in ruby lexer. Even though I knew how things work in theory, during the implementation I found out quite a lot complicated cases, and to handle all of them requires a complete symbol table, which is something I try to avoid right now.
Today I spent a few hours on trying examples and reading ruby's source code, and finally decoded the rules (it turns out to be much simpler than I thought).
Now let me show you how it works. Suppose the lexer is going to tokenize the following line:
Obviously, "x" is an IDENTIFIER, but what about "<<"? Is it a SHIFT operator or the start of HEREDOC? Well, sometime you can do it at syntax level. In the above case, since there is no whitespace between "x" and "<<", ruby always treated "<<" as shift operator.
Then let's move on to another example:
As you can see the only difference is that we put space between "x" and "<<". Now comes rule 2: if "x" is a variable, "<<" becomes SHIFT operator, if "x" is a method, "<<" and the following will be treated as HEREDOC. As you can see, ruby will choose the way that makes most sense. But how do we know whether "x" is a method or not? Well, sometime it is easy: if "x" looks like "a.b.c" (notice the DOT), then it is always a method. Otherwise, we need a symbol table to lookup.
And there is a trick when implementing the symbol table: you can not put only methods in it and later check if a simple IDENTIFIER is indeed a method. It does not work for two reasons: 1) The method you called may not be defined yet; 2) The method may be defined in an included module and you do not want to parse them here. So how do we do it? Simply store the local variables in the symbol table: if a simple IDENTIFIER is not a local variable, then it must be a method.
The above rules make lexer implementation easy. Originally I thought I need to check the number of parameters of a method to get more accurate predict, but that is too much for a lexer.
Today I spent a few hours on trying examples and reading ruby's source code, and finally decoded the rules (it turns out to be much simpler than I thought).
Now let me show you how it works. Suppose the lexer is going to tokenize the following line:
x<<blahblah
Obviously, "x" is an IDENTIFIER, but what about "<<"? Is it a SHIFT operator or the start of HEREDOC? Well, sometime you can do it at syntax level. In the above case, since there is no whitespace between "x" and "<<", ruby always treated "<<" as shift operator.
Then let's move on to another example:
x <<blahblah
As you can see the only difference is that we put space between "x" and "<<". Now comes rule 2: if "x" is a variable, "<<" becomes SHIFT operator, if "x" is a method, "<<" and the following will be treated as HEREDOC. As you can see, ruby will choose the way that makes most sense. But how do we know whether "x" is a method or not? Well, sometime it is easy: if "x" looks like "a.b.c" (notice the DOT), then it is always a method. Otherwise, we need a symbol table to lookup.
And there is a trick when implementing the symbol table: you can not put only methods in it and later check if a simple IDENTIFIER is indeed a method. It does not work for two reasons: 1) The method you called may not be defined yet; 2) The method may be defined in an included module and you do not want to parse them here. So how do we do it? Simply store the local variables in the symbol table: if a simple IDENTIFIER is not a local variable, then it must be a method.
The above rules make lexer implementation easy. Originally I thought I need to check the number of parameters of a method to get more accurate predict, but that is too much for a lexer.