Wednesday, September 28, 2005

Ruby and Ada do look alike

I once thought ruby is mostly inspired by perl, since Yukihiro Matsumoto mentioned ruby inherited the Perl philosophy. But it is hard to explain where a few "wired" syntax come from (considering my C/C++/C#/Java background), for example:


Ruby C/C++/C#/Java
---------------------------
begin/end {}
rescue try/catch
raise throw
elsif else if
case/when switch/case
nil NULL/null


Actually it is all from Ada. Definitely this is not a news, I failed to pay enough attention since I know nothing about Ada.

You can check out syntax across languages per category to compare the two.

Dynamic typing and error checking

For people who get used to Eclipse or other great IDE, coding in ruby has some dark sides. For example, there is no compilation, so error checking only happens at runtime. Normally after you write a program, you run it, find a bug, fix it and repeat the process again and again. Since the program always stops at the first error, it is very inefficient. Most of the people would prefer the IDE to detect errors (like typo) while they are typing.

Some people believe it is the fault of "dynamic typing". Actually it is not, the poor IDE is the one to blame. As long as the language is strong typed, a compiler(or part of an IDE) is able to do lots of type inference, then validation. You do not have to wait it until runtime. Static typing make type checking easier, in some case even better. But dynamic typing language itself does not prevent you to do it. Unfortunally there is no ruby IDE doing that.

UPDATE(10/11/2005): Eric Lippert has a new post on this topic today[2]. MS's script engine(JScript and VBScript) can check syntax errors without actually running the code, but unfortunately it always stops at the first syntax error.

References:
[0] Compiling in Vain
[1] Typing: Strong vs. Weak, Static vs. Dynamic by Aahz
[2] Checking For Script Syntax Errors by Eric Lippert

Saturday, September 10, 2005

How text editor stores text

Text editors need to store text in an efficient way, so that common operations like insert/delete/replace have minimized overhead. How do they do it?

Well, the simplest way is to put anything into an array(string), but clearly this is not very efficient. If an user is inserting/deleting text at beginning of a file, each keystroke will result in copying the entire buffer.

There is an obvious trick to improve the performance : since most editing activities tend to occur frequently in a local area , we can allocate more space (often called "Gap Buffer") in advance so that further insertions/replacements do not involve copying. Now the text is represented by two contiguous blocks of data, and a gap buffer between them. Whenever an editing action takes place, we first move the gap to the location. Now the design becomes a little more complex (have to make the gap invisible to the user), but it is much more efficient (less copying and still enjoy caching benefits of locality). Gap Buffer is used in SharpDevelop(see SharpDevelop\src\Libraries\ICSharpCode.TextEditor\src\Document\TextBufferStrategy\GapTextBufferStrategy.cs) and Eclipse(see eclipse\plugins\org.eclipse.rcp.source.win32.win32.x86_3.1.0\src\org.eclipse.swt.win32.win32.x86_3.1.0\src.zip\org\eclipse\swt\custom\DefaultContent.java).

While gap buffer might be good enough for small size documents, the copying overhead will soon become unacceptable for larger files. Professional text editors (like MS Word, AbiWord etc) use a method called "piece table". If you know Composite Design Pattern, it is very easy to figure it out all by yourself. In this approach, you start with a buffer containing the original text read from the file (this buffer won't be touched anymore). Instead, the new text is kept in a new append-only block, and during editing, the simple buffer changes into a composite structure. You can think the structure as a linked list, and each item is the "piece". A delete is handled by splitting a piece into two pieces, and an insert is handled by splitting the piece into three pieces (the middle one is the inserted content).

References:

[1] Data Structures for Text Sequences by Charles Crowley

[2] The Craft of Text Editing by Craig A. Finseth

[3] Dissecting a C# Application: Inside SharpDevelop by SharpDevelop Team

[4] MS Word Piece Table by Tomas Frydrych

[5] Performance Improvements by Msevior

[6] Programmer Myopia by Wesner Moise

On Hungarian Notation

Recently Micah Martin complained about Hungarian Notation on his blog. He is definitely not the first person to bring this issue on, the criticism might start as soon as the Notation was born. The complain is valid: if you need to change a type, in Micah Martin's case, from an interface(ICommand) to abstract class(Command), Hungarian Notation is in your way. You have to delete the capital "I" prefix in lots of places, which is not very "agile" for some people.

But I do not want to ignore the advantage of Hungarian Notation: with its help, you can quickly tell the nature of a type without consulting the documents. If you are using a third party library/framework with stable APIs, you are very likely to appreciate if the vendor uses the Hungarian Notation. And for vendors still doing Big Upfront Design (like MS), changing a type does not happen very frequently.

In fact Micah Martin's example is not very good as well, he'd better to create an abstract class which simply implements the interface, and let the subclasses extend from it (to enjoy benefit of template method). The clients of the interface do not require any change, they still consume the interface.

Tuesday, September 06, 2005

Thoughts on parser generators

Recently I have been evaluating several parser generators and eventually picked antlr in my project. Among all the parser generators in the world, yacc, boost.spirit, antlr and Perl 6 represent the current state of art and each has unique characters worth talking about.

Since most languages do not have parsing features built-in (Perl 6 will be the pioneer), almost all of the current parser generators use code generation. A common problem is: the generated code may become very foreign to you, and look totally cryptic in your favorite debugger.

Yacc suffers most in this perspective, since it uses LALR parsing. Lots of people like LL(k) based antlr since it produces human readable code, which I find extremely useful when working on a new/unfamiliar grammar. For more detailed comparison you can read Ian Kaplan's article.

Boost.Spirit is quite unique (and amazing!). With the help of powerful C++ template and operator overloading, it mimics EBNF in C++. Although it looks like perl 6's approach, it actually generates code through template (compile time), and the generated code is not as readable as antlr's output. Even worse, since C++ compiler won't take special care for boost.spirit, a simple grammar error may appears to be multiple pages of impenetrable garbage. Dave Handley has a nice introduction about Boost.Spirit on CodeProject.

Comega(Cω) made an interesting move by integrating popular "foreign language"(XML, SQL) to the core of the host language(C#). Perl 6's built-in parsing support did the similar thing (brings EBNF to perl). This can help to provide ealier and more friendly error reporting, which is a mjor problem for Boost.Spirit.

So far, antlr is sufficient for me. It can generate parser in several languages (Java, C#, C++, Python), reports errors nicely, and the code is readable and debugger friendly.