03 text editor

5/25/2018 03 Text Editor

1/91

TEXT EDITOR, MAMCET, 2008 IT

1

M.A.M. COLLEGE OF ENGINEERING AND TECHNOLOGYTRICHY -621105

DEPARTMENT OF INFORMATION TECHNOLOGY

ANNA UNIVERSITY PRACTICAL EXAMINATIONS, OCT2011

RECORD NOTE BOOK

CS1403 - SOFTWARE DEVELOPMENT LABORATORY

Student Name :

Register Number :

Class & Semester : 2008-2012, B.Tech., IT, VII - Semester

Month & Year : OCT - 2011


2/91


2


3/91


3

M.A.M. COLLEGE OF ENGINEERING AND TECHNOLOGY

TRICHY -621105

DEPARTMENT OF INFORMATION TECHNOLOGY

ANNA UNIVERSITY PRACTICAL EXAMINATIONS, OCT2011

BONAFIDE CERTIFICATE

This is to certify that this practical work titled CS1403 Software Development

Laboratory is the bonafide work of ________________________________

Register Number ______________________ during the academic year 2011-2012.

Faculty Incharge Head of the Department

Submitted for Anna University of Technology, Tiruchirappalli practical

examination held on ____/___/___ at M.A.M. College of Engineering and

Technology, Trichy 621105.

Internal Examiner External Examiner


4/91


4

ABSTRACT

The data structure used at maintains the sequence of characters is an

important part of a text editor. This paper investigates and evaluates the range of

possible data structures for text sequences. The ADT interface to the text sequence

component of a text editor is examined. Six common sequence data structures

(array, gap, list, line pointers, fixed size buffers and piece tables) are examined and

then a general model of sequence data structures that encompasses all six

structures is presented. The piece table method is explained in detail and its

advantages are presented.

The design space of sequence data structures is examined and several

variations on the ones listed above are presented. These sequence data structures

are compared experimentally and evaluated based on a number of criteria. The

experimental comparison is done by implementing each data structure in an editing

simulator and testing it using a synthetic load of many thousands of edits. We also

report on experiments on the sensitivity of the results to variations in the

parameters used to generate the synthetic editing load.


5/91


5

TABLE OF CONTENTS

Description Page No

BONAFIDE CERTIFICATE 03

ABSTRACT 04

Chapter 1 INTRODUCTION 07

Chapter 2 SEQUENCE INTERFACES 11

2.1 The Sequence Abstract Data Type 11

2.2 The C Interface 13

Chapter 3 COMPARING SEQUENCE DATA STRUCTURES 16

Chapter 4 DEFINITIONS 19

Chapter 5 BASIC SEQUENCE DATA STRUCTURES 215.1 The array method (one span in one buffer) 21

5.2 The gap method (two spans in one buffer) 22

5.3 The linked list method 24

Chapter 6 RECURSIVE SEQUENCE DATA STRUCTURES 26

6.1 Determination of span and buffer sizes 27

6.2 The line span method (one span per line) 29

6.3 Fixed size buffers 31

Chapter 7 A GENERAL MODEL OF SEQUENCE DATA 37


6/91


6

STRUCTURES

7.1 The design space for sequence data structures 40

Chapter 8 EXPERIMENTAL COMPARISON OF SEQUENCE DATA

STRUCTURES

43

8.1 Discussion of the timing results 44

Chapter 9 COMPARISON OF SEQUENCE DATA STRUCTURES 46

9.1 Basic sequence data structures 46

9.2 Recursive sequence data structures 47Chapter 10 CONCLUSIONS AND RECOMMENDATIONS 49

REFERENCES 51

APPENDICES 52

1 Sample Screen Shots 52

2 Sample Source Code 57

3 ASCII Chart 90


7/91


7

Chapter 1

INTRODUCTION

The central data structure in a text editor is the one that manages the

sequence of characters that represents the current state of the file that is being

edited. Every text editor requires such a data structure but books on data structures

do not cover data structures for text sequences. Articles on the design of texteditors often discuss the data structure they use but they do not cover the area in a

general way. This article is concerned with such data structures.

Figure 1: Ordered sets


8/91


8

Figure 1 show where sequence data structures fit in with other data

structures. Some ordered sets are ordered by something intrinsic in the items in the

sets (e.g., the value of an integer, the lexicographic position of a string) and the

position of an inserted item depends on its value and the values of the items

already in the set. Such data structures are mainly concerned with fast searching.

Data structures for this type of ordered set have been studied extensively.

The other possibility is for the order to be determined by where the items are

placed when they are inserted into the set. If insert and delete is restricted to thetwo ends of the ordering then you have a deque. For deques, the two basic data

structures are an array (used circularly) and a linked list. Nothing beyond this is

necessary due to the simplicity of the ADT interface to deques. If you can insert

and delete items from anywhere in the ordering you have a sequence. An important

subclass is sequences where reading an item in the sequence (by position number)

are extremely localized.

This is the case for text editors and it is this subclass that is examined in this

paper. A linked list and an array are the two obvious data structures for a sequence.

Neither is suitable for a general purpose text editor (a linked list takes up too much

memory and an array is too slow because it requires too much data movement) but

they provide useful base cases on which to build more complex sequence data

structures. The gap method is a simple extension of an array; it is simply an array

with a gap in the middle where characters are inserted and deleted. Many text

editors use the gap method since it is simple and quite efficient but the demands on

a modern text editor (multiple files, very large files, structured text, sophisticated


9/91


9

undo, virtual memory, etc.) encourage the investigation of more complicated data

structures which might handle these things more effectively.

The more sophisticated sequence data structures keep the sequence as a

recursive sequence of spans of text. The line span method keeps each line together

and keeps an array or a linked list of line pointers. The fixed buffer method keeps a

linked list of fixed size buffers each of which is partially full of text from the

sequence. Both the line span method and the fixed buffer method have been used

for many text editors.

A less commonly used method is the piece table method which keeps the

text as a sequence of \pieces" of text from either the original file and an \added

text" file. This method has many advantages and these will become clear as the

methods are presented in detail and analyzed. A major purpose of this paper is to

describe the piece table method and explain why it is a good data structure for text

sequences.

Looking at methods in detail suggests a general model of sequence data

structures that subsumes them all. Based on examination of this general model I

will propose several new sequence data structures that do not appear to have been

tried before.

It is difficult to analyze these algorithms mathematically so an experimental

approach was taken. I have implemented a text editor simulator and a number of

sequence data structures. Using this as an experimental text bed I have compared

the performance of each of these sequence data structures under a variety of

conditions. Based on these experiments and other considerations I conclude with


10/91


10

recommendations on which sequence data structures are best in various situations.

In almost all cases, either the gap method or the piece table method is the best data

structure.


11/91


11

Chapter 2

SEQUENCE INTERFACES

It is useful to start out with a definition of a sequence and the interface to it.

Since the more sophisticated text sequence data structures are recursive in that they

require a component data structure that maintains a sequence of pointers, I will

formulate the sequence of a general sequence of \items" rather than as a sequenceof characters. This supports discussion of the recursive sequence data structures

better.

2.1 The Sequence Abstract Data Type

A text editor maintains a sequence of characters, by implementing some

variant of the abstract date type Sequence. One definition of the Sequence abstract

data type is:

Domains:

Item | the data type that this is a sequence of (in most cases it will be anASCII character).

Sequence | an ordered set of Items. Position | an index into the Sequence which identifies the Items (in this case

it will be a natural number from 0 to the length of the Sequence minus one).

Syntax:


12/91


12

Empty :Sequence Insert :Sequence x Position x ItemSequence Delete :Sequence x PositionSequence ItemAt :Sequence x PositionItem U {EndOfFile}

Types:

s : Sequence; i : Item; p, p1, p2 : Position;

Axioms:

1. Delete(Empty, p) = Empty

2. Delete(Insert(s; p1; i); p2) =

if p1 < p2 then Insert(Delete(s, p2-1),p1,i)if p1 = p2 then s

if p1 > p2 then Insert(Delete(s, p2); p1-1,i)

3. ItemAt(Empty, p) = EndOfFile

4. ItemAt(Insert(s, p1, i); p2) =

if p1 < p2 then ItemAt(s, p2 -1)

if p1 = p2 then i

if p1 > p2 then ItemAt(s, p2)

The definition of a Sequence is relatively simple. Axiom 1 says that deleting

from an Empty Sequence is a no-op. This could be considered an error. Axiom 2


13/91


13

allows the reduction of a Sequence of Inserts and Deletes to a Sequence containing

only Inserts. This defines a canonical form of a Sequence which is a Sequence of

Inserts on a initial Empty Sequence. Axiom 3 implies that reading outside the

Sequence returns a special EndOfFile item. This also could have been an error.

Axiom 4 defines the semantics of a Sequence by defining what is at each position

of a canonical Sequence.

2.2 The C Interface

How would this translate into C? First some type definitions:typedef ReturnCode int; /* 1 for success, zero or negative for failure */

typedef Position int; /* a position in the sequence */

/* the first item in the sequence is at position 0 */

typedef Item unsigned char; /* they are sequences of eight bit bytes */

typedef struct {

/* To be determined */

/* whatever information we need for the data structures we choose */

} Sequence;

In this interface the only operations that change the Sequence are Insert and

Delete. I am ignoring the error of inserting beyond the end of the existing

sequence.

Sequence Empty( ); ReturnCode Insert( Sequence *sequence, Position position, Item ch ); ReturnCode Delete( Sequence *sequence, Position position ); Item ItemAt( Sequence *sequence, Position position ); - This does not

actually require a pointer to a Sequence since no change to the sequence is


14/91


14

being made but we expect that they will be large structures and should not

be passing them around. I am ignoring error returns (e.g., position out of

range) for the purposes of this discussion. These are easily added if desired.

ReturnCode Close( Sequence *sequence );Many variations are possible. The next few paragraphs discuss some of them.

Any practical interface would allow the sequence to be initialized with the

contexts of a file. In theory this is just the Empty operation followed by an Insert

operation for each character in the initializing file. Of course, this is too in efficientfor a real text editor.2 Instead we would have a NewSequence operation:

Sequence NewSequence( char * file name ); - The sequence is initializedwith the contents of the file whose name is contained in `file name'. Usually

the Delete operation will delete any logically contiguous subsequence

ReturnCode Delete( Sequence *sequence, Position beginPosition, PositionendPosition ); Sometimes the Insert operation will insert a subsequence

instead of just a single character.

ReturnCode Insert( Sequence *sequence, Position position, SequencesequenceToInsert ); Sometimes Copy and Move are separate operations

(instead of being composed of Inserts and Deletes).

ReturnCode Copy( Sequence *sequence, Position fromBegin, PositionfromEnd, Position toPosition );

ReturnCode Move( Sequence *sequence, Position fromBegin, PositionfromEnd, Position toPosition ); A Replace operation that subsumes Insert

and Delete in another possibility.


15/91


15

ReturnCode Replace( Sequence *sequence, Position fromBegin, PositionfromEnd, Sequence sequenceToReplaceItWith ); Finally the ItemAt

procedure could retrieve a subsequence. Although this is the method I use in

my text editor simulator described later.

ReturnCode SequenceAt( Sequence *sequence, Position fromBegin,Position fromEnd, Sequence *returnedSequence );

These variations will not a ect the basic character of the data structure used to

implement the sequence or the comparisons between them that follow. Therefore I

will assume the first interface (Empty, Insert, Delete, IntemAt, and Close).


16/91


16

Chapter 3

COMPARING SEQUENCE DATA STRUCTURES

In order to compare sequence data structures it is necessary to know the relative

frequency of the five basic operations in typical editing.

The NewSequence operation is infrequent. Most sequence data structures

require the NewSequence operation to scan the entire file and convert it to someinternal form. This can be a problem with very large files. To look through a very

large file, it must be read in any case but if the sequence data structure does not

require sequence preprocessing it can interleave the file reading with text editor

use rather than requiring the user to wait for the entire file to be read before editing

can begin. This is an advantage and means the user does not have to worry about

how many files are loaded or how large they are, since the cost of reading the large

file is amortized over its use.

The Close operation is also infrequent and not normally an important factor

in comparing sequence data structures.

The ItemAt operation will, of course, be very frequent, but it will also be

very localized as well because characters in a text editor are almost always

accessed sequentially. When the screen is drawn the characters on the screen are

accessed sequentially beginning with the first character on the screen. If the user

pages forward or backwards the characters are accessed sequentially. Searches

normally begin at the current position and proceed forward or backward


17/91


17

sequentially. Overall, there is almost no random access in a text editor. This means

that, although the ItemAt operation is very frequent, its efficiency in the general

case is not significant. Only the case where a character is requested that is

sequentially before or after the last character accessed needs to be optimized.

To test this hypothesis I instrumented a text editor and looked at the ItemAt

operations in a variety of text editing situations. The number of ItemAt calls that

were sequential (one away from the last ItemAt) was always above 97% and

usually above 99%. The number of ItemAts that were within 20 items of the

previous ItemAt was always over 99%. The average number of characters fromone ItemAt to the next was typically around 2 but sometimes would go up to as

high as 10 or 15.

Overall these experiments verify that the positions of ItemAt calls are

extremely localized.

Thus sequence data structures should be optimized for edits (inserts and

deletes) and sequential character access. Since caching can be used with most data

structures to make sequential access fast, this means that the eiency of inserts and

deletes is paramount. With regard to Inserts and Deletes one would assume that

there would be more Inserts than Deletes but that there would be a mix between

them in typical text editing.

In all sequence data structures, ItemAt is much faster than Inserts and

Deletes. If we consider typical text editing, the display is changed after each Insert

and Delete and this requires some number of ItemAts, often just a few characters


18/91


18

around the edit but possibly the whole rest of the window (if a newline in inserted

or deleted).

The criteria used for comparing sequence data structures are:

The time taken by each operation The paging behavior of each operation The amount of space used by the sequence data structure How easily it fits in with typical file and IO systems The complexity (and space taken by) the implementation

Later I will present timings comparing the basic operations for a range of

sequence data structures. These timings will be taken from example

implementations of the data data structures and a text editor simulator that calls

these implementations.


19/91


19

Chapter 4

DEFINITIONS

An item is the basic element. Usually it will be a character. A sequence is an

ordered set of items. Sequential items in a sequence are said to be logically

contiguous. The sequence data structure will keep the items of the sequence in bu

ers. A buer is a set of sequentially addressed memory locations. A buer containsitems from the sequence but not necessarily in the same order as they appear

logically in the sequence. Sequentially addressed items in a buffer are physically

contiguous.

When a string of items is physically contiguous in a buffer and is also

logically contiguous in the sequence we call them a span. A descriptor is a pointer

to a span. In some cases the buffer is actually part of the descriptor and so no

pointer is necessary. This variation is not important to the design of the data

structures but is more a memory management issue.

Sequence data structures keep spans in buffers and keep enough information

(in terms of descriptors and sequences of descriptors) to piece together the spans to

form the sequence. Buffers can be kept in memory but most sequence data

structures allow buffers to get as large as necessary or allow an unlimited number

of buffers. Thus it is necessary to keep the buffers on disk in disk files. Many

sequence data structures use buffers of unlimited size, that is, their size is


20/91


20

determined by the file contents. This requires the buffer to be a disk file. With

enough disk block caching this can be made as fast as necessary.

The concepts of buffers, spans and descriptors can be found in almost every

sequence data structure. Sequence data structures vary in terms of how these

concepts are used.

If a sequence data structures uses a variable number of descriptors it requires

a recursive sequence data structure to keep track of the sequence of descriptors. In

section 5 we will look at three sequence data structures that use a fixed number of

descriptors and in section 6 we will look at three sequence data structures that use avariable number of descriptors. Section 7 will present a general model of a

sequence data structure that encompasses all these data structures.


21/91


21

Chapter 5

BASIC SEQUENCE DATA STRUCTURES

All three of the sequence data structures in this section use a fixed number (either

one, two or three) of external descriptors.

5.1 The array method (one span in one buffer)

This is the most obvious sequence data structure. In this method there is one

span and one buffer. Two descriptors are needed: one for the span and one for the

buffer. (See Figure 24) The buffer contains the items of the sequence in physically

contiguous order. Deletes are handled by moving all the items following the

deleted item to fill in the hole left by the deleted item. Inserts are handled by

moving all the items that will follow the item to be inserted in order to make room

for the new item. ItemAt is an array reference. The buffer can be extended as much

as necessary to hold the data.


22/91


22

Figure 2: The array method

Clearly this would not be an efficient data structure if a lot of editing was to

be done are large files. It is a useful base case and is a reasonable choice in

situations where few inserts and deletes are made (e.g., a read-only sequence) or

the sequences are relatively small (e.g., a one-line text editor).

This data structure is sometimes used to hold the sequence of descriptors in

the more complex sequence data structure, for example, an array of line pointers

(see section 6).

5.2 The gap method (two spans in one buffer)

The gap method is only a little more complex than the array method but it is more

much efficient. In this method you have one large buffer that holds all the text but


23/91


23

there is a gap in the middle of the buffer that does not contain valid text. (See

Figure 3) Three descriptors are needed: one for each span and one for the gap. The

gap will be at the text \cursor" or \point" where all text

Figure 3: The gap (or two span) method

editing operations take place. Inserts are handled by using up one of the places in

the gap and incrementing the length of the first descriptor (or decrementing the

begin pointer of the second descriptor). Deletes are handled by decrementing the

length of the first descriptor (or incrementing the begin pointer of the second

descriptor). ItemAt is a test (first or second span?) and an array reference.

When the cursor moves the gap is also moved so if the cursor moves 100

items forward then 100 items have to bemoved from the second span to the first

span (and the descriptors adjusted). Since most cursor moves are local, not that

many items have to be moved in most cases.

Actually the gap does not need to move every time the cursor is moved.

When an editing operation is requested then the gap is moved. This way moving


24/91


24

the cursor around the file while paging or searching will not cause unnecessary gap

moves.

If the gap fills up, the second span is moved in the buffer to create a new

gap. There must be an algorithm to determine the new gap size. As in the array

case, the buffer can be extended to any length. In practice, it is usually increased in

size by some fixed increment or by some fixed percentage of the current buffer

size and this becomes the size of the new gap. With virtual memory we can make

the buffer so large that it is unlikely to fill up. And with some help from the

operating system, we can expand the gap without actually moving any data. Thismethod does use up large quantities of virtual address space, however.

This method is simple and surprisingly efficient. The gap method is also called the

split buffer method and is discussed in [9].

5.3 The linked list method

At the other extreme is the linked list method which has a span for each item

in the sequence and the span is kept in the descriptor. This requires a large (and file

content dependent) number of descriptors which must be sequenced. If we link the

descriptors together into a linked list then we really only have to remember one

descriptor, the head of the list. (See Figure 4)

Inserts and Deletes are fast and easy (just move some pointers) but ItemAt is

more expensive than the array or gap methods since several pointers may have to

be followed.


25/91


25

Figure 4: The linked list method

The linked list method uses a lot of extra space and so is not appropriate for

a large sequence but is frequently used as a way of implementing a sequence of

descriptors required in the more complex sequence data structures. In fact, it is the

most common method for that. One could think of the array method as a special

case of the linked list method. An array is really a linked list with the links

implicit, that is, the link is computed to be the next physically sequential address

after the current item. In this view, the linked list method the only primitive

sequence data type. The array method is a special case if linked list method and the

gap method is a variation on the array method.


26/91


26

Chapter 6

RECURSIVE SEQUENCE DATA STRUCTURES

In this section three more sequence data structures are presented. Each of

these methods requires a variable number of descriptors and so must recursively

use a (usually simpler) sequence data structure to implement the sequence of

descriptors.

6.1 Determination of span and buffer sizes

The basic cases either used a fixed number of spans (one for the array

method and two for the gap method) or spans of size 1 (the linked list method).

The recursive methods described in this section use a variable number of spans.

How is the span size determined? There are two possibilities: either the spans are

determined by the contents of the file or by the text editor independent of the

sequence contents. The main example of content determined spans is to have one

span per logical line of

text.

If the editor is allowed to determine span sizes to its own best advantage the

second issue is the size of the buffers. Again there are two possibilities: fixed size

or variable size. Fixed size buffers are easier to manage and are often chosen to be

some (low) multiple of the page size or the disk block size. A fixed size buffer

implies a maximum span size (that is, the buffer size).


27/91


27

This table show the possibilities:

Fixed size buffers Variable size buffers

Content determined spans Line size is limited Line spans

Editor determined spans Fixed size buffers Piece tables

These are the three methods that will be examined in this section.

6.2 The line span method (one span per line)

Since most text editors do many operations on lines it seems reasonable to use a

data structure that is line oriented so line operations can be handled efficiently. The

line span method uses a descriptor for each line. Often one large buffer is used to

hold all the line spans. New lines are appended to the end of the used part of the

buffer. See Figure 5.

Figure 5: The line spans method


28/91


28

Line deletes are handled by deleting the line descriptor. Deleting characters

within a line involves moving the rest of the characters in the line up to fill the gap.

Since any one line is probably not that long this is reasonably efficient.

Line inserts are handled by adding a line descriptor. Inserting characters in a

line involves copying the initial part (before the insert) of the line buffer to new

space allocated for the line, adding the characters to be inserted, copying the rest of

the line and pointing the descriptor at the new line. Multiple line inserts and deletes

are combinations of these operations. Caching can make this all fairly efficient.

Usually new space is allocated at the end of the buffer and the space occupied by

deleted or changed lines is not reused since the effort of dynamic memory

allocation is not worth the trouble. A disk file the continues to grow at the end can

be handled quite efficiently by most file systems.

This method uses as many descriptors as there are lines in the file, that is, a

variable number of descriptors hence there is a recursive problem of keeping these

descriptors in a sequence. Typically one of the basic methods described in section

5 is used to maintain the sequence of line descriptors.

These simpler methods are acceptable since the number of line descriptors is

much smaller than the number of characters in the file being edited. The linked list

method allows efficient insertions and deletions but requires the management of

list nodes. This method is acceptable for a line oriented text editor but is not as

common these days since strict line orientation is seen as too restrictive. It does

require preprocessing of the entire buffer before you can begin editing since the

line descriptors have to be set up.


29/91


29

6.3 Fixed size buffers

In the line spans method the partitioning of the sequence into spans is

determined by the contents of the text file, in particular the line structure. Another

approach is for the text editor to decide the partitioning of the sequence into spans

based on the efficiency of buffer use. Since fixed size blocks are more efficient to

deal with the file can be divided into a sequence of fixed size buffers. If the buffers

were required to be always full, a great deal of time would be spent rearranging

data in the buffers, therefore, each block has a maximum size but will usuallycontain fewer actual items from the sequence so there is room for inserts and

deletes, which will usually affect only one buffer.(See Figure 6.)

Figure 6: Fixed size buffers

The disk block size (or some multiple of the disk block size) is usually the

most efficient choice for the fixed size buffers since then the editor can do its own


30/91


30

disk management more easily and not depend on the virtual memory system or the

file system for efficient use of the disk.

Usually a lower bound on the number of items in a buffer is set (half the

buffer size is a common choice). This requires moving items between buffers and

occasionally merging two buffers to prevent the accumulation of large numbers of

buffers. There are four problems with letting too many buffers accumulate:

wasted space in the buffers, the recursive sequence of descriptors gets too large, the probability that an edit will be confined to one buffer is reduced, and ItemAt caching is less effective.

As an example of fixed size buffers, suppose disk blocks are 4K bytes long.

Each buffer will be 4K bytes long and will contain a span of length from2K to 4K

bytes. Each buffer is handled using the array method, that is, inserts and deletes are

done by moving the items up or down in the buffer. Typically an edit will only

affect one buffer but if a buffer fills up it is split into two buffers and if the span in

a buffer falls below 2K bytes then items are moved into it from an adjacent buffer

or it is coalesced with an adjacent buffer.

Each descriptor points to a buffer and contains the length of the span in the

buffer. The fixed size buffer method also requires a recursive sequence for the

descriptors. This could be any of the basic methods but most examples from the

literature use a linked list of descriptors. The loose packing allows small changes

to be made within the buffers and the fact that the buffers are linked makes it easy

to add and delete buffers.


31/91


31

6.4 The piece table method

In the piece table method the sizes of the spans are as large as possible but

are split as a result of the editing that is done on the sequence. The sequence starts

out as one big span and that gets divided up as insertions and deletions are made.

We call each span a piece (of the sequence) and its descriptor is called a piece

descriptor. The sequence of piece descriptors is called the piece table.

The piece table method uses two buffers. The first (the file buffer) contains

the original contents of the file being edited. It is of fixed size and is read-only.The second (the add buffer) is another buffer that can grow without bound and is

append-only. All new items are placed in this buffer.

Each piece descriptor points to a span in the file buffer or in the add buffer.

Thus a descriptor must contain three pieces of information: which buffer (a

boolean), an offset into that buffer (a non-negative integer) and a length (a positive

integer5). Figure 7 shows a piece table structure. The file consists of five pieces.

Initially there is only one piece descriptor which points to the entire file

buffer. A delete is handled by splitting a piece into two pieces. One of these pieces

points to the items in the old piece before the deleted item and the other piece

points to the items after the deleted item. A special case occurs when the deleted

item is at the beginning or end of the piece in which case we simply adjust the

pointer or the piece length.

An insert is handled by splitting the piece into three pieces. The first piece

points to the items of the old piece before the inserted item. The third piece points


32/91


32

to the items of the old piece after the inserted item. The inserted item is appended

to the end of the add file and the second piece points to this item. Again there are

special cases for in insertion at the beginning or end of a piece.

If several items are inserted in a row, the inserted items are combined into a

single piece rather than using a separate piece for each item inserted. Figures 8, 9

and 10 show the effect of a delete and an insert in a piece table. Figure 8 shows the

5Normally zero length pieces are eliminated.

Figure 7: The piece table method

piece table after the file is read in initially. This is a very short file containing only

20 characters. Figure 9 shows the piece table after the word \large " has been

deleted. Figure 10 shows the piece table after the word \English " has been added.

Notice that, in general, a delete increases the number of pieces in the piece table byone and an insert increases the number of pieces in the piece table by two.


33/91


33

Figure 8: A piece table before any edits

Figure 9: A piece table after a delete

Let us look at another example. Suppose we start with a new file that is 1000 bytes

long and make the following edits.


34/91


34

Figure 10: A piece table after a delete and an insert

1. Six characters inserted (typed in) after character 900.

2. Character 600 deleted.

3. Five characters inserted (typed in) after character 500.

The piece table after these edits will look like this:

file start length logical offset

orig 0 500 0

add 6 5 500

Orig 500 100 505

Orig 601 300 605

add 0 6 905

Orig 901 100 911


35/91


35

The \logical offset" column does not actually exist in the piece table but can be

computed from it (it is the running total of the lengths). These logical offsets are

not kept in the piece table because they would all have to be updated after each

edit.

The piece table method has several advantages.

The original file is never changed so it can be a read-only file. This isadvantageous for caching systems since the data never changes.

The add file is append-only and so, once written, it never changes either. Items never move once they have been written into a bu

er so they can be pointed to by other data structures working together with

the piece table.

Undo is made much easier by the fact that items are never written over. It isnever necessary to save deleted or changed items. Undo is just a matter of

keeping the right piece descriptors around. Unlimited undoes can be easily

supported.

No file preprocessing is required. The initial piece can be set up onlyknowing the length of the original file, information that can be quickly and

easily obtained from the file system. Thus the size of the file does not affect

the startup time.

The amount of memory used is a function of the number of edits not the sizeof the file. Thus edits on very large files will be quite efficient.

The above description implies that a sequence must start out as one big piece

and only inserts and deleted can add pieces. Following this rule keeps the number

of pieces at a minimum and the fewer pieces there are the more efficient the


36/91


36

ItemAt operations are. But the text editor is free to split pieces at other times to suit

its purposes.

For example, a word processor needs to keep formatting information as well

as the text sequence itself. This formatting information can be kept as a tree where

the leaves of the tree are pieces. A word in bold face would be kept as a separate

piece so it could be pointed to by a \bold" format node in the format tree. The text

editor Lara [8] uses piece tables this way.

As another example, suppose the text editor implements hypertext linksbetween any two spans of text in the sequence. The span at each end of the link can

be isolated in a separate piece and the link data structure would point to these two

pieces.6 This technique is used in the Pastiche text editor [5] for fine-grained

hypertext.

These techniques work because piece table descriptors and items do not

move when edits occur and so these tree structures will be maintained with little

extra work even if the sequence is edited heavily.

Overall, the piece table is an excellent data structure for sequences and is

normally the data structure of choice. Caching can be used to speed up this data

structure so it is competitive with other data structures for sequences.


37/91


37

Chapter 7

A GENERAL MODEL OF SEQUENCE DATA STRUCTURES

The discussions in the previous two sections suggest that it is possible to

characterize all sequence data structures given the following assumptions:

1. The computer has main memory with sequentially addressed cells and has disk

memory with sequentially addressed blocks of sequentially addressed cells.2. Items have a fixed size in memory and on disk.

3. Items are stored directly in the memory or on disk, that is, they are not encoded.

Hence every item must exist somewhere in memory or on disk.

4. The main memory is of limited size and hence cannot hold all the items in large

sequences. Even if virtual memory is provided there will be an upper bound on it

in any actual system configuration. In addition, most sophisticated sequence data

structures do not rely on virtual memory to efficiently shuttle sequence data

between main memory and the disk. Usually the program can do better since it

understands exactly how the data is accessed.

5. The environment provides dynamic memory allocation (although some sequence

data structures will do their own and not use the environment's dynamic memory

allocation).

6. The environment provides a reasonably efficient file system for item storage that

provides files of (for all practical purposes) unlimited size.

The following concepts are used in the model.


38/91


38

An item is the basic component of our sequences. In most cases it will be acharacter or byte (but it might be a descriptor in a recursive sequence data

structure).

A sequence is an ordered set of items. During editing, items will be insertedinto and deleted from the sequence. The items in the sequence are logically

contiguous.

A buffer is some contiguous space in main memory or on disk that cancontain items (Assumption 4). All items in the sequence are kept in buffers

(Assumption 3). Consecutive items in a buffer are physically contiguous.

A span is one or more items that are logically contiguous in the sequenceand are also physically contiguous in the buffer. (Assumption 1)

A descriptor is a data structure that represents a span. Usually the descriptorcontains a pointer to the span but it is also possible for the descriptor to

contain the buffer that contains the span.

A sequence data structure is either

* A basic sequence data structure which is one of:

- An array.

- An array with a gap.

- A linked list of items.

- A more complex linked structure of items.

* A recursive sequence data structure which comprises:

- Zero or more buffers each of which contains zero or more spans.

- A (recursive) sequence data structure of zero or more descriptors to spans

in the buffers.


39/91


39

This model is recursive in that to implement a sequence of items it is

necessary to implement a sequence of descriptors. This recursion is usually only

one step, that is, the sequence of descriptors in implemented with a basic sequence

data structure. The deficiencies of the basic sequence data structures for

implementing character sequences are less critical for sequences of descriptors

since there are usually far fewer descriptors and so sophisticated methods are not

required.


40/91


40

7.1 The design space for sequence data structures

This model can be used to examine the design space for sequence data

structures. The four basic methods seem to cover most of the useful cases. A basic

method is one where the number of descriptors is fixed. The array method uses two

descriptors and the gap method uses three. We could generalize this to a two gap

method using four descriptors and so on but it is not clear that there is any

advantage in doing that. The linked list method is basic even though it uses a

descriptor for every item in the sequence. The items are linked together so one

descriptor serves to define the sequence. As soon as we try to put two or moreitems into a descriptor it becomes an instance of the recursive fixed size buffer

method.

Using a more complex linked structure than a linked list will make reading

items (ItemAt) more efficient but makes Insert and Delete less efficient. Since

ItemAt access is nearly always sequential this tradeoff is not advantageous.

The recursive methods divide into two types. The first uses fixed size buffers

and the second uses variable sized buffers. There are two issues in the fixed size

buffer method The first is the method used in maintaining the items in each fixed

size buffer and the second is the method of maintaining the sequence of buffers.

The items in a single (fixed size) buffer are a sequence. The typical

implementation keeps them as an array at the beginning of the buffer. In general

the gap method is superior to the array method, thus it might be useful to keep

characters in a single buffer using the gap method where the gap is kept at the last

edit point. This should halve the expected number of bytes moved for one edit


41/91


41

at little additional program cost. If the edits exhibit locality (as they typically do)

the advantage will be greater.

A linked list is usually used to implement the sequence of descriptors (buffer

pointers) and this is generally a good method because of the ease of inserting and

deleting descriptors (which are frequent operations). One problem is the extra

space used by the links but this is only a problem if there are lots of descriptors.

With 1K blocks and editing 5M of files, this is still only 5K descriptors with 10K

links. Thus this problem does not seem to be significant in practical cases.

Another issue is virtual memory performance. Linked lists do not exhibitgood locality. If the descriptors were kept using the gap method, locality would be

improved considerably. Assuming a descriptor takes four words (one for the

pointer to the span, one for the length of the span and two for the links) the 5K

descriptors would consume 20K words or 80K bytes (assuming four bytes per

word). Again this is small enough that virtual memory performance would

probably not be a significant problem, but if it were, the gap method could improve

things.

The piece table method uses fewer descriptors than the fixed buffer method

initially (before any editing) but heavy editing can create numerous pieces. There

are advantages to maintaining the pieces since it allows easy implementation of

unlimited undo. In addition, pieces can be used to record structures in the text (as

described in section 7.1). As a consequence there might be many pieces. This

means the problems presented above in the discussion of fixed size bu

ers (space consumed by links and virtual memory performance) might be

significant here. That is, the gap method of keeping the piece table might be

preferred.


42/91


42

Another generalization would be to use two levels of recursion, that is, to

use one of the recursive sequence data structures to implement the sequence of

descriptors. The recursive methods are beneficial when the sequences are quite

large so we might use a two-level recursive method if the number of descriptors

was quite large. As we mentioned above, this might be the case with a piece table.

So there are four new variations that we have uncovered in this analysis.

The fixed size buffers method using the gap method inside each buffer.

The fixed size buffers method using the gap method for descriptors. Thismight be better if virtual memory performance was a consideration.

The piece table method using the gap method for descriptors. This might bebetter if virtual memory performance was a consideration.

A two level recursive method that uses a recursive method to maintain thesequence of descriptors. This would suitable if there is a very large number

of descriptors.


43/91


43

Chapter 8

EXPERIMENTAL COMPARISON OF SEQUENCE DATA STRUCTURES

In order to compare the performance of these data structures I implemented

each of them and a simulator program that would simulate typical editing behavior.

The simulator has a number of parameters that I will discuss in presenting the

results. I ran these tests on a several machines, under two compilers (cc and gcc)

and with maximum optimization. The results for the different architectures andcompilers were all similar.

The measurements were made with the following parameters (except where

one of these parameters is being experimentally varied):

Sequence length of 8000 characters Block size of 1024 characters Fixed buffer methods keep buffers at least half full (from 512 to 1024

characters)

The location of 98% of the edits is normally distributed around the locationof the previous edit with a standard deviation of 25

The location of 2% of the edits is uniformly distributed over the entiresequence

After each edit, 25 characters on each side of the edit location are accessed Every 250 edits the entire file is scanned sequentially with ItemAts

The sequence data structures measured (and their abbreviated names) is:


44/91


44

Null | the null method that does nothing. This is for comparison since itmeasures the overhead of the procedure calls.

Arr | The array method. List | The list method. Gap | The gap method. FsbA - The fixed size buffer method with the array method used to maintain

the sequence inside each buffer.

FsbAOpt - The fixed size buffer method with the array method used tomaintain the sequence inside each buffer and with ItemAt optimized.

FsbG - The fixed size buffer method with the gap method used to maintainthe sequence inside each buffer.

Piece - The piece table method. PieceOpt - The piece table method with ItemAt optimized

8.1 Discussion of the timing results

Only time will be considered in the following discussion. In the next section

these sequence data structures will be compared on a range of criteria. Remember

that the ItemAt times assume a very high locality of reference. If the references

were not local the ItemAt times would be much higher and the ratios would be

different. The Arr method has the fastest ItemAt but is terrible for Insert and

Delete. It is not a practical method.

The Gap method is nearly as fast for ItemAt but is also quite fast for Insert

and Delete. The Gap method is a generally fast method but some other problems as


45/91


45

will be seen in the next section. The List method has the fastest Inserts and Deletes

by far but a fairly slow ItemAt. The List method uses a lot of extra space (two

pointers per item). It is useful for in-memory sequences with large items.

The FsbA method is slow for Inserts and Deletes but its ItemAt can be made

quite fast with some simple caching. It is possible to reduce the ItemAt time even

further by making it an inline operation. The FsbG method reduces the Insert and

Delete times radically but the ItemAt time is a bit higher. The equivalent ItemAt

caching would be a little more complicated and a little slower.

The Piece method has excellent Insert and Delete times (only slightly slower

than the linked list method) but its ItemAt time is quite slow even with simple

caching. More complex ItemAt caching that avoids procedure calls is necessary

when using the Piece method. The idea of the caching is simple. Instead of

requesting an item you request the address and length of the longest span starting

at a particular position. Then all the items in the span can be accessed with a

pointer dereference (and increment). This will bring the ItemAt time down to the

level of the array and gap methods.


46/91


46

Chapter 9

COMPARISON OF SEQUENCE DATA STRUCTURES

9.1 Basic sequence data structures

The array method is just too slow for Inserts and Deletes. In addition its

paging behavior is very bad (it touches lots of pages). It might be useful for a one

line text editor or a text editor where few edits are expected.

The linked list method takes far too much space for long sequences. It is

useful for short in-memory sequences where the edits and Item Ats are not

localized. The linked list method has one great advantage and that is that the items

never move in memory. This makes it easy to embed a list sequence in another

data structures (such as a tree to provide fast searching). Of the basic sequence data

structures, only the gap method can be seriously considered for a general purpose

text editor. The gap method has one major problem and that is when the gap fills

up. This will require lots of item movement to reestablish the gap. REDO: be more

positive about the gap method.

Array Gap Linked List

Time Slow Fast Fast

Space Low Low Very high

Ease of programming Easy Easy Easy

Size of code Low (39 lines) Low (59 lines) Medium (79 lines)

The lines of code measure was taken from the sample implementations.


47/91


47

9.2 Recursive sequence data structures

The line span method is an older method that has little to recommend it in

modern text editors. The Fsb methods and the Piece method are both good choices

for professional-quality text editors.

Both methods:

are acceptably fast if caching is used handle large files without slowing down handle many of files without slowing down are efficient in their use of space allow efficient buffer management provide excellent locality of buffer use

Overall however the piece table method seems to be better. It has a number of

advantages:

All buffers except one are read-only and the other buffer is append-only.(Thus the buffers are easy to cache and work well over a network.)

The code is quite simple. (The code for Fsb is complicated by the need tobalance buffers.)

Huge files load as quickly as tiny files. (No preprocessing is required forlarge files so they load quickly.)

Disk buffers are always full of data (rather than 3=4 full|on the average|asthey are in the Fixed buffer method). Thus disk caching is more efficient.

Items never move once they are placed.


48/91


48

The last point is important. If the piece sequence is kept as a list then the

pieces never move either. This allows the sequence to be pointed to by other data

structures that record additional information about the sequence. For example it is

fairly easy to implement \sticky" pointers (that is, pointers that point to the content

of the sequence rather than relative sequence positions). For example we might

want to attach a sticky pointer to the beginning of a procedure definition.

As another example, the text editor Lara [8] also formats its text. It keeps the

formatting state in a tree structure where pieces are the leaves of the tree. Inserts

and deletes require very little bookkeeping because the items and pieces nevermove around when you use a piece table.

FSB-Array FSB-Gap Piece

Time Fairly fast Fast

(with caching)

Fairly fast

(with caching)

Space Low Low Low

Ease of

programming

Hard Hard Medium

Size of code Medium to large

(218 lines)

Large

(301 lines)

Medium

(162 lines)


49/91


49

Chapter 10

CONCLUSIONS AND RECOMMENDATIONS

This paper has two purposes:

to examine systematically data structures for sequences and to present the piece table method and describe its advantages.

A review of the literature shows that there are only a few different data

structures for text sequences that have been used in text editors. A careful

examination of the design space showed that there really are not that many

fundamental types of sequence data structures.

Sequence data structures are divided into two categories. Basic sequence

data structures (array, gap and linked list) and recursive sequence data structures

(line spans, fixed size buffers and piece tables).

The array method is the obvious one: keep the text in an array of characters.

The gap method is similar but it keeps a gap in the middle of the array at the text

editor insertion point. The linked list method keeps the characters on a linked list.

The recursive methods keep the text in a number of separate \spans" and

keeps track of a sequence of pointers to these spans. The line span method uses a

span for each line. The fixed size buffer method keeps a sequence of fixed size


50/91


50

buffers each of which contains one span. The piece table method uses spans of any

size either in the original file or in a file for added characters.

In examining these data structures a general model of sequence data

structures was formulated and used this model and the examples were used to

discover several new variations for sequence data structures that might improve

performance in some situations.

A series of experiments was performed on these data structures to determine

their relative performance and they were compared on a variety of criteriaincluding time. The main conclusion is that the piece table structure is the best data

structure for text sequences although some of the other methods might be useful in

certain cases. The piece table method has a number of advantages and is an

especially good method for text with additional structure. Thus it would be the best

choice for a word processor or a editor with hypertext facilities.


51/91


51

REFERENCES

1. C. C. Charlton and P. H. Leng. Editors: two for the price of one.Software|Practice and Experience, 11:195{202, 1981.

2. Computer System Research Group, EECS, University of California,Berkeley, CA 94720. UNIX User's Reference Manual (4.3 Berkeley

Software Distribution), April 1986.

3. Computer System Research Group, EECS, University of California,Berkeley, CA 94720. UNIX User's Supplementary Documents (4.3 Berkeley

Software Distribution), April 1986.4. C. Crowley. The Point text editor for X. Technical Report CS91-3,

University of New Mexico, 1991.

5. C. Crowley. Using fine-grained hypertext for recording and viewingprogram structures. Technical Report CS91-2, University of New Mexico,

1991.

6. B. Elliot. Design of a simple screen editor. Software|Practice andExperience, 12:375{384, 1982.

7. C. W. Fraser and B. Krishnamurthy. Live text. Software|Practice andExperience, 20(8):851{ 858, August 1990.

8. J. Gutknecht. Concepts of the text editor Lara. Communications of theACM, 28(9):942{960, September 1985.

9. J.Kyle. Split buers, patched links, and half-transpositions. ComputerLanguage, pages 67{70, December 1989.

10.B. W. Lampson. Bravo Manual in the Alto User's Handbook. Xerox PaloAlto Research Center, Palo Alto, CA, 1976.


52/91


52

Appendix 1

SAMPLE SCREEN SHOTS


53/91


53


54/91


54


55/91


55


56/91


56


57/91


57

Appendix 2

SOURCE CODE

Cursor.c/ * posi t i ons cur sor at t he begi nni ng of t he f i l e */st ar t _ f i l e( ){

i nt di spl ay = YES ;

/ * i f f i r st page of t he f i l e i s bei ng cur r ent l y di spl ayed */i f ( cur scr == st art l oc && ski p == 0 )

di spl ay = NO ;

/ * reset var i abl es */cur r = 2 ;curc = 1 ;l ogr = 1 ;l ogc = 1 ;ski p = 0 ;cur scr = st ar t l oc ;cur r ow = st art l oc ;

/ * di spl ay f i rs t page of f i l e, i f necessary * /i f ( di spl ay == YES ){

menubox ( 2, 1, 22, 78, 27, NO_SHADOW ) ;di spl ayscr een ( cur scr ) ;

}

/ * di spl ay cur r ent cur sor l ocat i on */wr i t er ow( ) ;wr i tecol ( ) ;

}

/ * posi t i ons cur sor at t he end of t he f i l e */end_f i l e( ){

char *t emp ;

i nt i , status ;s i ze ( 32, 0 ) ; / * hi de cur sor */

/ * count t he t ot al number of l i nes i n t he f i l e */l ogr = 1 ;cur r = 2 ;t emp = st art l oc ;whi l e ( t emp != endl oc ){


58/91


58

i f ( *t emp == ' \ n' )cur r ++ ;

t emp++ ;}

/ * set up cur r ent scr een and cur r ent r ow poi nt er s */cur scr = endl oc ;cur r ow = endl oc ;

/ * di spl ay t he l ast page of t he f i l e */page_up ( 0 ) ;

/ * posi t i on cur sor 20 l i nes af t er cur r ent r ow */f or ( i = 0 ; i < 20 ; i ++ ){

st at us = down_l i ne ( 0 ) ;

/ * i f end of f i l e i s reached */i f ( st at us == 1 )

br eak ;

}

/ * posi t i on cur sor at t he end of t he l ast l i ne */end_l i ne( ) ;

si ze ( 5, 7 ) ; / * show cur sor */

/ * di spl ay cur r ent cur sor r ow */wr i t er ow( ) ;

}

/ * posi t i ons cur sor i n t he f i r st r ow of cur r ent scr een */top_screen()

{ s i ze ( 32, 0 ) ;

/ * go up unt i l t he f i r st r ow i s encount er ed */whi l e ( l ogr ! = 1 )

up_l i ne ( 0 ) ;

/ * di spl ay cur r ent cur sor r ow */wr i t er ow( ) ;

s i ze ( 5, 7 ) ;}

/ * posi t i ons cur sor i n t he l ast r ow of cur r ent screen */

bot t om_scr een( ){

i nt st at us ;

s i ze ( 32, 0 ) ;

/ * go down unt i l t he l ast r ow or end of f i l e i s encount er ed */whi l e ( l ogr ! = 21 ){

st at us = down_l i ne ( 0 ) ;


59/91


59

/ * i f end of f i l e i s reached */i f ( st at us == 1 )

br eak ;}

wr i t er ow( ) ;s i ze ( 5, 7 ) ;

}

/ * posi t i ons cur sor at t he begi nni ng of cur r ent r ow */star t _ l i ne( ){

/ * i f t her e exi st char acter s t o t he l ef t of cur r ent l y di spl ayed l i ne */i f ( ski p ! = 0 ){

ski p = 0 ;menubox ( 2, 1, 22, 78, 27, NO_SHADOW ) ;di spl ayscr een ( cur scr ) ;

}

l ogc = 1 ;curc = 1 ;

/ * di spl ay cur r ent cur sor col umn */wr i tecol ( ) ;

}

/ * posi t i ons cur sor at t he end of cur r ent r ow */end_l i ne( ){

char *t emp ;i nt count , di spl ay = YES ;

t emp = cur r ow ;count = 1 ;

/ * count t he number of char act er s i n cur r ent l i ne */whi l e ( *t emp ! = ' \ n' ){

/ * i f end of f i l e i s encount er ed */i f ( t emp >= endl oc )

br eak ;

i f ( *t emp == ' \ t ' )count += 8 ;

el se

count ++ ;

t emp++ ;}

/ * backt r ace across t he \ t ' s and spaces whi ch may be pr esent at t he endof t he l i ne * /

whi l e ( * ( t emp - 1 ) == ' \ t ' | | * ( t emp - 1 ) == ' ' ){

i f ( * ( t emp - 1 ) == ' \ t ' )


60/91


60

count - = 8 ;el se

count ++ ;

t emp- - ;}

/ * i f t he number of char act er s i n t he l i ne i s l ess t han 78 */i f ( count


61/91


61

}

t emp++ ;}


t emp- - ;

/ * i f char acters at cur r ent cur sor l ocat i on and t o i t s l ef t ar eal phanumer i c */

condi t i on1 = i sal num ( *t emp ) && i sal num ( * ( t emp - 1 ) ) ;

/ * i f char act er at cur r ent cur sor l ocat i on i s al phanumer i c and t hepr evi ous char act er i s not al phanumer i c */

condi t i on2 = i sal num ( *t emp ) && ! i sal num ( * ( t emp - 1 ) ) ;

/ * i f char act er at cur r ent cur sor l ocat i on i s not al phanumer i c */condi t i on3 = ! i sal num ( *t emp ) ;

i f ( *t emp == ' \ n' )t emp- - ;

i f ( condi t i on2 )t emp- - ;

i f ( condi t i on1 ){

/ * move l ef t so l ong as al phanumer i c charact er s are f ound */whi l e ( i sal num ( *t emp ) ){

i f ( t emp == st ar t l oc )br eak ;

t emp- - ;count ++ ;

}}

i f ( condi t i on2 | | condi t i on3 ){

/ * move l ef t t i l l an al phanumer i c char act er i s f ound */whi l e ( ! ( i sal num ( *t emp ) ) ){

i f ( t emp


62/91


62

r et ur n ;}


}

/ * move l ef t t i l l a non- al phanumer i c char act er i s f ound */whi l e ( i sal num ( *t emp ) ){

i f ( t emp == st ar t l oc )br eak ;


}}

/ * i f begi nni ng of f i l e i s encount er ed */i f ( t emp == st art l oc )

{l ogc = 1 ;curc = 1 ;

}el se{

l ogc - = count ;cur c - = count ;

/ * i f screen needs t o be scrol l ed hor i zont al l y */i f ( cur c > 78 ){

l ogc = 78 ;

ski p = cur c - 78 ;menubox ( 2, 1, 22, 78, 27, NO_SHADOW ) ;di spl ayscr een ( cur scr ) ;

}}

wr i tecol ( ) ;wr i t er ow( ) ;

}

/ * posi t i ons cur sor one wor d t o t he r i ght */wor d_r i ght ( ){

char *t emp ;

i nt col , count = 0 ;

/ * i ncrement `t emp' t o poi nt t o char act er at cur r ent cur sor l ocat i on */t emp = cur r ow ;f or ( col = 1 ; col < cur c ; col ++ ){


r et ur n ;


63/91


63

i f ( *t emp == ' \ t ' )col += 7 ;

/ * i f end of l i ne i s encount er ed bef or e cur r ent cur sor col umn */i f ( *t emp == ' \ n' )

br eak ;

t emp++ ;}

i f ( t emp >= endl oc )r et ur n ;

/ * i f cur sor i s at t he end of cur r ent l i ne */i f ( *t emp == ' \ n' ){

/ * cont i nue t i l l an al phanumer i c char act er i s f ound */whi l e ( ! ( i sal num ( *t emp ) ) ){

/ * i f end of f i l e i s encount er ed */

i f ( t emp >= endl oc )br eak ;

i f ( *t emp == ' \ t ' )count += 7 ;

/ * i f end of l i ne i s encount er ed */i f ( *t emp == ' \ n' ){

/ * posi t i on cur sor i n t he next l i ne */down_l i ne ( 0 ) ;

/ * posi t i on cur sor at t he begi nni ng of t he l i ne */

s tar t _ l i ne( ) ;t emp = cur r ow ;count = 0 ;cont i nue ;

}

t emp++ ;count ++ ;

}}el se{

/ * t her e exi st s a wor d t o t he r i ght of cur sor */

count = 0 ;

/ * move r i ght so l ong as al phanumeri c charact ers are f ound */whi l e ( i sal num ( *t emp ) ){


t emp++ ;count ++ ;


64/91


64

}

/ * move r i ght t i l l a non- al phanumer i c char act er or end of l i ne i smet */

whi l e ( ! ( i sal num ( *t emp ) | | *t emp == ' \ n' ) ){


i f ( *t emp == ' \ t ' )count += 7 ;

t emp++ ;count ++ ;

}}

l ogc += count ;cur c += count ;

/ * i f scr een needs t o be scr ol l ed hor i zont al l y */i f ( cur c > 78 ){

l ogc = 78 ;ski p = cur c - 78 ;menubox ( 2, 1, 22, 78, 27, NO_SHADOW ) ;di spl ayscr een ( cur scr ) ;

}


}

/ * di spl ays pr evi ous f i l e page */page_up ( i nt di spl ay ){

i nt r ow ;

/ * i f f i r st page i s cur r ent l y di spl ayed */i f ( cur scr == st ar t l oc )

r et ur n ;

/ * posi t i on t he `cur scr ' poi nt er 20 l i nes bef or e */f or ( r ow = 1 ; r ow


65/91


65

curc = 1 ;

menubox ( 2, 1, 22, 78, 27, NO_SHADOW ) ;di spl ayscr een ( cur scr ) ;

i f ( di spl ay ){


}

r et ur n ;}

/ * go t o t he begi nni ng of pr evi ous l i ne */whi l e ( *cur scr ! = ' \ n' ){

i f ( cur scr


66/91


66

}

/ * di spl ays next f i l e page */page_down( ){

char *p ;i nt r ow = 1, i , col ;

/ * posi t i on t he `cur scr' poi nt er 20 l i nes hence */p = cur scr ;f or ( r ow = 1 ; r ow = endl oc ){

cur scr = p ;r et ur n ;

}

curscr ++ ;}

i f ( cur scr >= endl oc ){

cur scr = p ;r et ur n ;

}

/ * go t o t he begi nni ng of next l i ne */curscr ++ ;

}/ * di spl ay t he next screen */menubox ( 2, 1, 22, 78, 27, NO_SHADOW ) ;di spl ayscreen ( cur scr ) ;

/ * posi t i on cur sor 20 l i nes hence */

s i ze ( 32, 0 ) ;

/ * cont i nue t i l l f i r st r ow on t he screen i s r eached */r ow = 1 ;whi l e ( cur r ow ! = cur scr ){

i f ( *cur r ow == ' \ n' ){

cur r ++ ;r ow++ ;

}cur r ow++ ;

}

l ogr = 1 ;col = cur c ;


67/91


67

f or ( i = r ow ; i = 249 ){

curc = 249 ;pr i nt f ( " \ a" ) ;r et ur n ;

}


i f ( *t emp == ' \ t ' )col += 7 ;

i f ( *t emp == ' \ n' )br eak ;

t emp++ ;}

/ * i f next char acter i s a t ab */i f ( *t emp == ' \ t ' ){

l ogc += 7 ;cur c += 7 ;

}

cur c++ ;

/ * i f cur sor i s i n t he l ast col umn, scr ol l scr een hor i zont al l y */i f ( l ogc >= 78 ){

ski p = cur c - 78 ;menubox ( 2, 1, 22, 78, 27, NO_SHADOW ) ;di spl ayscr een ( cur scr ) ;


68/91


68

l ogc = 78 ;}el se

l ogc++ ;

wr i tecol ( ) ;}

/ * posi t i ons cur sor one col umn t o the l ef t */l e f t ( ){

i nt col = 1 ;char *t emp ;

/ * i f cur sor i s i n t he f i rs t col umn */i f ( cur c == 1 )

r et ur n ;

/ * i ncrement `t emp' t o poi nt t o char act er at cur r ent cur sor l ocat i on */t emp = cur r ow ;

f or ( col = 1 ; col < cur c ; col ++ ){

i f ( *t emp == ' \ t ' )col += 7 ;

i f ( *t emp == ' \ n' )br eak ;

t emp++ ;}

/ * i f pr evi ous char acter i s a t ab */i f ( * ( t emp - 1 ) == ' \ t ' )

{ l ogc - = 7 ;cur c - = 7 ;

}

/ * i f cur sor i s i n t he f i r st col umn and i f t her e exi st char acters t ot he l ef t of cur r ent l y di spl ayed l i ne, scr ol l scr een hor i zont al l y */

i f ( l ogc


69/91


69

/ * posi t i ons cur sor one l i ne up */up_l i ne ( i nt di spl ay ){

/ * i f cursor i s i n the f i r s t l i ne of the f i l e * /i f ( cur r == 2 )

r et ur n ;

/ * r emove spaces and tabs at t he end of t he l i ne */del _whi t espace( ) ;

l ogr - - ;c u r r - - ;

/ * go to t he begi nni ng of pr evi ous l i ne */curr ow - = 2 ;i f ( curscr < start l oc )

cur scr = st ar t l oc ;whi l e ( *cur r ow ! = ' \ n' ){

i f ( cur r ow


70/91


70

whi l e ( *cur r ow ! = ' \ n' ){

/ * i f end of f i l e i s encount er ed */i f ( cur r ow >= endl oc ){

cur r ow = p ;r et ur n ( 1 ) ;

}

cur r ow++ ;}i f ( cur r ow == endl oc ){

cur r ow = p ;r et ur n ( 1 ) ;

}cur r ow++ ;l ogr ++ ;cur r ++ ;

/ * i f vert i cal scrol l i ng i s requi red * /i f ( l ogr >= 22 ){

l ogr = 21 ;scr ol l up ( 2, 1, 22, 78 ) ;di spl ayl i ne ( cur r ow, 22 ) ;

/ * posi t i on `cur scr' poi nt er at t he begi nni ng of cur r ent screen*/

whi l e ( *cur scr ! = ' \ n' )curscr ++ ;

curscr ++ ;}

/ * posi t i on cur sor i n appr opr i at e col umn */got ocol ( ) ;

/ * i f cur r ent cur sor r ow and col umn i s t o be di spl ayed */i f ( di spl ay ){

wr i t ecol ( ) ;wr i t er ow( ) ;

}

r et ur n ( 0 ) ;}

/ * scr ol l s t he scr een cont ent s down */scrol l down ( i nt sr , i nt sc, i nt er, i nt ec ){

uni on REGS i i , oo ;

i i . h. ah = 7 ; / * ser vi ce number */i i . h. al = 1 ; / * number of l i nes t o scr ol l */i i . h. ch = sr ; / * s tart i ng row */i i . h. cl = sc ; / * start i ng col umn */


71/91


71

i i . h. dh = er ; / * endi ng r ow */i i . h. dl = ec ; / * endi ng col umn */i i . h. bh = 27 ; / * di spl ay at t r i but e of bl ank l i ne creat ed at t op */i nt 86 ( 16, &i i , &oo ) ; / * i ssue i nt er r upt */

}

/ * scr ol l s t he screen cont ent s up */scrol l up ( i nt s r , i nt sc, i nt er , i nt ec ){

uni on REGS i i , oo ;

i i . h. ah = 6 ; / * ser vi ce number */i i . h. al = 1 ; / * number of l i nes t o scr ol l */i i . h. ch = sr ; / * s tart i ng row */i i . h. cl = sc ; / * start i ng col umn */i i . h. dh = er ; / * endi ng r ow */i i . h. dl = ec ; / * endi ng col umn */i i . h. bh = 27 ; / * di spl ay at t r i but e of bl ank l i ne creat ed at bot t om */i nt 86 ( 16, &i i , &oo ) ; / * i ssue i nt er r upt */

}

/ * posi t i ons cur sor i n appr opr i at e col umn */got ocol ( ){

char *t emp ;i nt c ol ;


i f ( *t emp == ' \ t ' )col += 7 ;

i f ( *t emp == ' \ n' )br eak ;

t emp++ ;}

/ * i f t he char acter at cur r ent cur sor l ocat i on i s a t ab */i f ( col > cur c ){

/ * go t o t he end of t ab */l ogc += ( col - cur c ) ;cur c = col ;

/ * i f screen needs t o be scr ol l ed hor i zont al l y */i f ( cur c > 78 ){

l ogc = 78 ;ski p = cur c - 78 ;menubox ( 2, 1, 22, 78, 27, NO_SHADOW ) ;di spl ayscr een ( cur scr ) ;

}}

}


72/91


72

Util.c# def i ne CGA 1# def i ne ESC 27# def i ne YES 1# def i ne NO 0# def i ne NO_SHADOW 0# def i ne HALF_SHADOW - 1

/ * var i ous menu def i ni t i ons *// * char act er f ol l owi ng ` ' i s a hot key */char *mai nmenu[ ] = {

" Cursor Movement" ," Fi l e" ," Search"," Del et e" ," Exi t "

} ;

char *cursormenu[ ] = {" St ar t of F i l e","End of Fi l e"," Top of scr een" ," Bot t omof scr een","St ar t of L i ne"," End of Li ne" ," Word Lef t " ,"Word Ri ght " ,"Page Up","Page Down","Ret ur N"

} ;

char *f i l emenu[ ] = {" Load" ," Pi ck"," New" ," Save" ,"Save As" ," Merge" ," Change di r " ," Out put t o pr i nt er ","Re Turn"

} ;

char *sear chmenu[ ] = {" Fi nd" ,"Fi nd & Repl ace" ,"Repeat Last f i nd"," Abor t oper at i on" ," Go t o l i ne no","Re Turn"

} ;

char *del etemenu[ ] = {


73/91


73

" Del et e l i ne","To End of l i ne","To Begi nni ng of Li ne","Word Ri ght " ,"Re Turn"

} ;

char *exi t menu[ ] = {" Exi t "," Shel l ","Re Turn"

} ;

/ * most r ecent f i l es edi t ed */char *pi ckf i l e[ 5] = {

" " ," " ," " ," " ," "

} ;

/ * buf f er i n whi ch f i l es are l oaded and mani pul at ed */char *buf ;

unsi gned maxsi ze ;char *st ar t l oc, *cur scr, *cur r ow, *endl oc ;char sear chst r [ 31] , r epl acest r [ 31] , f i l espec[30] , f i l ename[ 17] ;

voi d i nt er r upt ( *ol d1b) ( ) ;voi d i nt er r upt handl er ( ) ;voi d i nt er r upt ( *ol d23) ( ) ;char *sear ch ( char * ) ;

i nt asci i , scan, pi ckf i l eno, no_tab ;i nt cur r = 2, cur c = 1, l ogc = 1, l ogr = 1 ;i nt ski p, f i ndf l ag, f r f l ag, saved = YES, ctr l _c_f l ag = 0 ;

char f ar *vi d_mem ;char f ar *i ns = ( char f ar * ) 0x417 ;

/ * wr i t es a char acter i n speci f i ed at t r i but e */wr i t echar ( i nt r , i nt c, char ch, i nt at t b ){

char f ar *v ;

v = vi d_mem + r * 160 + c * 2 ; / * cal cul at e addr ess cor r espondi ng t o

r ow r and col umn c */*v = ch ; / * store asci i val ue of char act er */v++ ;*v = at t b ; / * st or e at t r i but e of char acter */

}

/ * wr i t es a str i ng i n speci f i ed at t r i but e */wr i tes t r i ng ( char *s, i nt r , i nt c, i nt at t b ){

whi l e ( *s != ' \ 0' )


74/91


74

{/ * i f next char act er i s t he hot key of menu i t em */i f ( *s == ' ' ){

s++ ;

/ * i f hot key of hi ghl i ght ed bar */i f ( att b == 15 )

wri t echar ( r , c, *s , 15 ) ;el se

wr i t echar ( r , c, *s, 113 ) ;}el se{

/ * i f next char acter i s hot key of "Yes", "No", "Cancel ",etc . * /

i f ( *s == ' $' ){

s++ ;wri t echar ( r , c, *s , 47 ) ;

}el se

wr i t echar ( r , c, *s, at t b ) ; / * nor mal char acter*/

}

c++ ;s++ ;

}}

/ * saves scr een cont ent s i nt o al l ocated memory */savevi deo ( i nt sr, i nt sc, i nt er , i nt ec, char *buf f er )

{ char f ar *v ;i nt i , j ;

f or ( i = sr ; i


75/91


75

f or ( i = sr ; i


76/91


76

}}

}

/ * draws a doubl e- l i ned box */dr awbox ( i nt sr , i nt sc, i nt er, i nt ec, i nt at t r ){

i nt i ;

/ * dr aw hor i zont al l i nes */f or ( i = sc + 1 ; i < ec ; i ++ ){

wri t echar ( sr , i , 205, at t r ) ;wri t echar ( er , i , 205, at t r ) ;

}

/ * dr aw ver t i cal l i nes */f or ( i = sr + 1 ; i < er ; i ++ ){

wri t echar ( i , sc, 186, at t r ) ;

wri t echar ( i , ec, 186, at t r ) ;}

/ * di spl ay cor ner char act er s */wri t echar ( sr , sc, 201, at t r ) ;wr i t echar ( sr, ec, 187, at t r ) ;wr i t echar ( er , sc, 200, at t r ) ;wr i t echar ( er , ec, 188, at t r ) ;

}

/ * di spl ays or hi des cur sor */si ze ( i nt s sl , i nt esl ){

uni on REGS i , o ;i . h. ah = 1 ; / * ser vi ce number * /i . h. ch = ssl ; / * start i ng scan l i ne */i . h. cl = esl ; / * endi ng scan l i ne */i . h. bh = 0 ; / * vi deo page number */

/ * i ssue i nt er r upt f or changi ng t he si ze of t he cur sor */i nt 86 ( 16, &i , &o ) ;

}

/ * get s asci i and scan code of key pr essed */get key( ){

uni on REGS i , o ;

/ * wai t t i l l a key i s hi t * /whi l e ( ! kbhi t ( ) ){

/ * i f Ct r l - C has been hi t */i f ( ct r l _ c_ f l ag ){

/ * erase t he charact er s C */di spl ayl i ne ( cur r ow, l ogr + 1 ) ;


77/91


77

gotoxy ( l ogc + 1, l ogr + 2 ) ;ctr l _c_f l ag = 0 ;

}}

i . h. ah = 0 ; / * ser vi ce number * /

/ * i ssue i nt err upt * /i nt 86 ( 22, &i , &o ) ;

asci i = o. h. al ;scan = o. h. ah ;

}

/ * pops up a menu ver t i cal l y */popupmenuv ( char **menu, i nt count , i nt sr , i nt sc, char *hot keys, i nthel pnumber ){

i nt er, i , ec, l , l en = 0, area, choi ce ;char *p ;

s i ze ( 32, 0 ) ; / * hi de cur sor */

/ * cal cul ate endi ng r ow f or menu */er = sr + count + 2 ;

/ * f i nd l ongest menu i t em */f or ( i = 0 ; i < count ; i ++ ){

l = st r l en ( menu[ i ] ) ;i f ( l > l en )

l en = l ;}

/ * cal cul ate endi ng col umn f or menu */ec = sc + l en + 5 ;

/ * cal cul ate area r equi r ed t o save scr een cont ent s where menu i s t o bepopped up */

area = ( er - sr + 1 ) * ( ec - sc + 1 ) * 2 ;

p = mal l oc ( area ) ; / * al l ocate memory */

/ * i f al l ocat i on f ai l s */i f ( p == NULL )

err or_exi t ( ) ;

/ * save screen cont ent s i nt o al l ocated memory */savevi deo ( sr, sc, er , ec, p ) ;

/ * draw f i l l ed box wi t h shadow */menubox ( sr , sc + 1, er , ec, 112, 15 ) ;

/ * draw a doubl e l i ned box */dr awbox ( sr , sc + 1, er - 1, ec - 2, 112 ) ;

/ * di spl ay the menu i n t he f i l l ed box */


78/91


78

di spl aymenuv ( menu, count , sr + 1, sc + 3 ) ;

/ * r ecei ve user ' s choi ce */choi ce = getr esponsev ( menu, hot keys, sr , sc + 2, count , hel pnumber )

;

/ * r est or e or i gi nal scr een cont ent s */r estorevi deo ( sr, sc, er , ec, p ) ;

/ * f r ee al l ocat ed memory */f ree ( p ) ;

si ze ( 5, 7 ) ; / * set cur sor t o nor mal si ze */r et ur n ( choi ce ) ;

}

/ * di spl ays menu ver t i cal l y */di spl aymenuv ( char **menu, i nt count , i nt sr , i nt sc ){

i nt i ;

f or ( i = 0 ; i < count ; i ++ ){

wr i t estr i ng ( menu[ i ] , sr, sc, 112 ) ;sr ++ ;

}}

/ * r ecei ves user ' s choi ce f or t he ver t i cal menu di spl ayed */get r esponsev ( char **menu, char *hot keys, i nt sr , i nt sc, i nt count , i nthel pnumber ){

i nt choi ce = 1, l en, hot keychoi ce ;

/ * cal cul ate number of hot keys f or t he menu */l en = st r l en ( hot keys ) ;

/ * hi ghl i ght t he f i r st menu i t em */wr i t est r i ng ( menu[ choi ce - 1] , sr + choi ce, sc + 1, 15 ) ;

whi l e ( 1 ){

/ * r ecei ve key */get key( ) ;

/ * i f speci al key i s hi t * /i f ( asci i == 0 )

{swi t ch ( scan ){

case 80 : / * down ar r ow key */

/ * make hi ghl i ght ed i t em normal */wr i t est r i ng ( menu[ choi ce - 1] , sr + choi ce, sc

+ 1, 112 ) ;

choi ce++ ;


79/91


79

br eak ;

case 72 : / * up arr ow key */

/ * make hi ghl i ght ed i t em normal */wr i t est r i ng ( menu[ choi ce - 1] , sr + choi ce, sc

+ 1, 112 ) ;

choi ce- - ;br eak ;

case 77 : / * ri ght ar r ow key */r et ur n ( 77 ) ;

case 75 : / * l ef t arr ow key */r et ur n ( 75 ) ;

case 59 : / * f unct i on key F1 f or hel p */

/ * i f cur r ent menu i s f i l e menu */

i f ( hel pnumber == 3 ){

/ * i f hi ghl i ght ed bar i s not on r et ur n */i f ( choi ce ! = 9 ){

/ * cal l wi t h appr opr i at e hel pscr een number */

di spl ayhel p ( 8 + choi ce - 1 ) ;}br eak ;

}

/ * i f cur r ent menu i s exi t menu */

i f ( hel pnumber == 6 ){/ * i f hi ghl i ght ed bar i s not on Ret ur n */i f ( choi ce ! = 3 ){

/ * cal l wi t h appr opr i at e hel pscr een number */

di spl ayhel p ( 16 + choi ce - 1 ) ;}br eak ;

}

/ * i f cur r ent menu i s ot her t han f i l e menu orexi t menu */

di spl ayhel p ( hel pnumber ) ;}

/ * i f hi ghl i ght ed bar i s on f i r st i t em and up ar r ow key i shi t * /

i f ( choi ce == 0 )choi ce = count ;

/ * i f hi ghl i ght ed bar i s on l ast i t em and down ar r ow key i shi t * /


80/91


80

i f ( choi ce > count )choi ce = 1 ;

/ * hi ghl i ght t he appr opr i at e menu i t em */wr i t est r i ng ( menu[ choi ce - 1] , sr + choi ce, sc + 1, 15 ) ;

}el se{

i f ( asci i == 13 ) / * Ent er key */r et ur n ( choi ce ) ;

i f ( asci i == ESC ){

di spl aymenuh ( mai nmenu, 5 ) ;r et ur n ( ESC ) ;

}

hot keychoi ce = 1 ;asci i = t oupper ( asci i ) ;

/ * check whet her hot key has been pressed */whi l e ( *hot keys ! = ' \ 0' ){

i f ( *hot keys == asci i )r et ur n ( hot keychoi ce ) ;

el se{

hotkeys++ ;hot keychoi ce++ ;

}}

/ * r eset var i abl e t o poi nt t o t he f i r st hot key char acter

*/ hotkeys = hotkeys - l en ;}

}}

/ * di spl ays menu hor i zont al l y */di spl aymenuh ( char ** menu, i nt count ){

i nt col = 2, i ;

s i ze ( 32, 0 ) ;menubox ( 0, 0, 0, 79, 112, NO_SHADOW ) ;

f or ( i = 0 ; i < count ; i ++ ){

wr i t est r i ng ( menu[ i ] , 0, col , 112 ) ;col = col + ( st r l en ( menu[ i ] ) ) + 7 ;

}

s i ze ( 5, 7 ) ;}

/ * r ecei ves user ' s choi ce f or t he hori zont al menu di spl ayed */


81/91


81

getr esponseh ( char **menu, char *hot keys, i nt count ){

i nt choi ce = 1, hot keychoi ce, l en, col ;

s i ze ( 32, 0 ) ;col = 2 ;

/ * cal cul ate number of hot keys f or t he menu */l en = st r l en ( hot keys ) ;

/ * hi ghl i ght t he f i r st menu i t em */wr i t est r i ng ( menu[ choi ce - 1] , 0, col , 15 ) ;

whi l e ( 1 ){

/ * r ecei ve key */get key( ) ;

/ * i f speci

03 text editor

Documents

text editor

sequence data structuresare

basic sequence data

text sequences

text sequencecomponent

sequence abstract data

sequence interfaces

sequence of characters