X.Y.Z: November 2005

Tuesday, November 29, 2005

More on leftshift and heredoc

A few weeks ago I wrote something on distinguishing between leftshift and heredoc in ruby lexer. Even though I knew how things work in theory, during the implementation I found out quite a lot complicated cases, and to handle all of them requires a complete symbol table, which is something I try to avoid right now.

Today I spent a few hours on trying examples and reading ruby's source code, and finally decoded the rules (it turns out to be much simpler than I thought).

Now let me show you how it works. Suppose the lexer is going to tokenize the following line:


x<<blahblah

Obviously, "x" is an IDENTIFIER, but what about "<<"? Is it a SHIFT operator or the start of HEREDOC? Well, sometime you can do it at syntax level. In the above case, since there is no whitespace between "x" and "<<", ruby always treated "<<" as shift operator.

Then let's move on to another example:


x  <<blahblah

As you can see the only difference is that we put space between "x" and "<<". Now comes rule 2: if "x" is a variable, "<<" becomes SHIFT operator, if "x" is a method, "<<" and the following will be treated as HEREDOC. As you can see, ruby will choose the way that makes most sense. But how do we know whether "x" is a method or not? Well, sometime it is easy: if "x" looks like "a.b.c" (notice the DOT), then it is always a method. Otherwise, we need a symbol table to lookup.

And there is a trick when implementing the symbol table: you can not put only methods in it and later check if a simple IDENTIFIER is indeed a method. It does not work for two reasons: 1) The method you called may not be defined yet; 2) The method may be defined in an included module and you do not want to parse them here. So how do we do it? Simply store the local variables in the symbol table: if a simple IDENTIFIER is not a local variable, then it must be a method.

The above rules make lexer implementation easy. Originally I thought I need to check the number of parameters of a method to get more accurate predict, but that is too much for a lexer.

Monday, November 28, 2005

Entertainment Reward Scam

Purchasing a ticket on TicketMaster in last November made me a member of so called "Entertainment Reward" program, and they started to charge $9 from my credit card each month. They deliberately make the process unnoticable during the purchase. If my coworker did not break the news, I might no even know I was in the trap.

A quick search on google shows there are lots of people who are scamed, and now there is a website for sharing the stoies: www.entertainmentrewardscam.com. So be careful, better don't do business with those unethical companies.

Shame on TicketMaster and Entertainment Reward!

Wednesday, November 09, 2005

Does pure OO matter?

Ruby is a pure OO language, in other words, it has no functions. Even if a "def" statement outside of a class definition looks and behaves like a function, it is actually a private methods of class Object.

You may ask, is this just an implementation details? Well, most of the time it doesn't matter, people can ignore the difference and use those methods as functions -- no problem at all. It only matters when you use introspect(reflection). For example, the following code:


def my_global_function
end

puts Object.private_methods

will print:


...
my_global_function
exit
putc
system
...

Reference:
[0] Ruby FAQ: How Does Ruby Compare With Python?

Thursday, November 03, 2005

Linear approximate lookahead

Even though we talk antlr as a LL(k) parser generator for simplicity, it actually does linear approximate lookahead instead of full LL(k).
Most of the time this is a good thing, because it reduces the complexity of storing and testing lookahead (from exponential to linear reduction). But occasionally the weaker lookahead is insufficient and results in nondeterministic parser/lexer.
Here is an example to illustrate the problem:


RULE1
   : "ab" | "ba"
   ;

RULE2
   : "ac" | "ca"
   ;

The above is prefect LL(2) grammar, but antlr will give you a warning:


[antlr] test.g: warning:lexical nondeterminism between rules RULE1 and RULE2 upon
[antlr] test.g:     k==1:'a'
[antlr] test.g:     k==2:'a'

If you look at the generated code, soon you will realized this is because of linear approximate lookahead:


else if ((LA(1)=='a'||LA(1)=='b') && (LA(2)=='a'||LA(2)=='b')) {
   ...//match RULE1
}
else if ((LA(1)=='a'||LA(1)=='c') && (LA(2)=='a'||LA(2)=='c')) {
   ...//match RULE2
}

If the input is "aa", even though it does not match any of the two rules, both tests will be true, and that is what the nondeterminism warning trying to tell you.

A full LL(k) parser generator should generate the following code instead:


else if ((LA(1)=='a'&&LA(2)=='b') || (LA(1)=='b'&&LA(2)=='a')) {
   ...//match RULE1
}
else if ((LA(1)=='a'&&LA(2)=='c') || (LA(1)=='c'&&LA(2)=='a')) {
   ...//match RULE2
}

Tuesday, November 01, 2005

Distinguish leftshift and heredoc

I am working on a ruby parser with antlr. Last weekend I tested my ruby lexer on lots of realworld ruby scripts, and found an interesting problem. That is, how to distinguish left shift operator and "here document(heredoc)"?

As we know, "<<" is left shift operator in lots of computer languages. In perl and ruby, it is also the start of heredoc. For those of you who do not have unix shell script/perl/ruby background, heredoc acts like a string, but with flexibility. Here is an example in ruby:


str = <<EOF
hello, world!
EOF

Since perl/ruby choose not to introduce a new symbol for the start of heredoc, their parser developers have to deal with the ambiguity. I tried to find out a simple solution, but later found out the ambiguity can not be solved at syntax level.

For example, in the following code, "<<" should be parsed as the start of heredoc:


def x(var)
  puts var
end

x <<1
test
1

But in another case, "<<" should be parsed as left shift operator:


x = 1

x <<1
test
1

The only difference is: in the first case, x is a method, while in the second case x is a variable. So there is no other choice, we have to look up symbol table to decide lexer state.

Today I checked the source code of ruby 1.8.3. No surprise, it uses the same approach.

X.Y.Z