Saturday, September 10, 2005

How text editor stores text

Text editors need to store text in an efficient way, so that common operations like insert/delete/replace have minimized overhead. How do they do it?

Well, the simplest way is to put anything into an array(string), but clearly this is not very efficient. If an user is inserting/deleting text at beginning of a file, each keystroke will result in copying the entire buffer.

There is an obvious trick to improve the performance : since most editing activities tend to occur frequently in a local area , we can allocate more space (often called "Gap Buffer") in advance so that further insertions/replacements do not involve copying. Now the text is represented by two contiguous blocks of data, and a gap buffer between them. Whenever an editing action takes place, we first move the gap to the location. Now the design becomes a little more complex (have to make the gap invisible to the user), but it is much more efficient (less copying and still enjoy caching benefits of locality). Gap Buffer is used in SharpDevelop(see SharpDevelop\src\Libraries\ICSharpCode.TextEditor\src\Document\TextBufferStrategy\GapTextBufferStrategy.cs) and Eclipse(see eclipse\plugins\org.eclipse.rcp.source.win32.win32.x86_3.1.0\src\org.eclipse.swt.win32.win32.x86_3.1.0\src.zip\org\eclipse\swt\custom\DefaultContent.java).

While gap buffer might be good enough for small size documents, the copying overhead will soon become unacceptable for larger files. Professional text editors (like MS Word, AbiWord etc) use a method called "piece table". If you know Composite Design Pattern, it is very easy to figure it out all by yourself. In this approach, you start with a buffer containing the original text read from the file (this buffer won't be touched anymore). Instead, the new text is kept in a new append-only block, and during editing, the simple buffer changes into a composite structure. You can think the structure as a linked list, and each item is the "piece". A delete is handled by splitting a piece into two pieces, and an insert is handled by splitting the piece into three pieces (the middle one is the inserted content).

References:

[1] Data Structures for Text Sequences by Charles Crowley

[2] The Craft of Text Editing by Craig A. Finseth

[3] Dissecting a C# Application: Inside SharpDevelop by SharpDevelop Team

[4] MS Word Piece Table by Tomas Frydrych

[5] Performance Improvements by Msevior

[6] Programmer Myopia by Wesner Moise

0 Comments:

Post a Comment

<< Home