Tuesday, November 29, 2005

More on leftshift and heredoc

A few weeks ago I wrote something on distinguishing between leftshift and heredoc in ruby lexer. Even though I knew how things work in theory, during the implementation I found out quite a lot complicated cases, and to handle all of them requires a complete symbol table, which is something I try to avoid right now.

Today I spent a few hours on trying examples and reading ruby's source code, and finally decoded the rules (it turns out to be much simpler than I thought).

Now let me show you how it works. Suppose the lexer is going to tokenize the following line:

x<<blahblah

Obviously, "x" is an IDENTIFIER, but what about "<<"? Is it a SHIFT operator or the start of HEREDOC? Well, sometime you can do it at syntax level. In the above case, since there is no whitespace between "x" and "<<", ruby always treated "<<" as shift operator.

Then let's move on to another example:

x  <<blahblah

As you can see the only difference is that we put space between "x" and "<<". Now comes rule 2: if "x" is a variable, "<<" becomes SHIFT operator, if "x" is a method, "<<" and the following will be treated as HEREDOC. As you can see, ruby will choose the way that makes most sense. But how do we know whether "x" is a method or not? Well, sometime it is easy: if "x" looks like "a.b.c" (notice the DOT), then it is always a method. Otherwise, we need a symbol table to lookup.

And there is a trick when implementing the symbol table: you can not put only methods in it and later check if a simple IDENTIFIER is indeed a method. It does not work for two reasons: 1) The method you called may not be defined yet; 2) The method may be defined in an included module and you do not want to parse them here. So how do we do it? Simply store the local variables in the symbol table: if a simple IDENTIFIER is not a local variable, then it must be a method.

The above rules make lexer implementation easy. Originally I thought I need to check the number of parameters of a method to get more accurate predict, but that is too much for a lexer.

3 Comments:

Anonymous Anonymous said...

Ok, this is just a minor little additional point, and maybe you're already aware of it, but it's good to document it here. In ambiguous cases, << is treated as a here document if it is preceded but not followed by whitespace. So, for example:

p<<1 #left shift
p <<1 #here doc
p << 1 #left shift
p<< 1 #left shift

The same 'preceeded but not followed by whitespace' rule applies to the other funny characters that can be ambiguous in ruby:
%
:
/

1:33 PM  
Anonymous Anonymous said...

xyz's statement that if x isn't a variable it must be a method is wrong. The following shows the problem.

b = 23

def b var
end

b <<HERE
Double quoted \
here document.
HERE

b will appear in the symbol table as a variable, bit this code is a Here Doc.

caleb's example is also incorrect. If you use an identifier other than the letter p these assumptions fail. I don't know what is special about p.

3:14 PM  
Blogger xue.yong.zhi said...

Hello neville, the code in your example does not work:
uninitialized constant HERE (NameError)

3:22 PM  

Post a Comment

<< Home