travesty in byte

25
THE SMALL SYSTEMS JOLH' New Chips Data General/One NOVEMBER 1984 VOL. 9, NO. 12 $3.50 IN UNITED STATES $ 4.25 IN CANADA / £2.10 IN U.K. A MCGRAW-HILL PUBLICATION 0360-5280

Upload: leonardo-flores

Post on 27-Oct-2014

192 views

Category:

Documents


10 download

TRANSCRIPT

Page 1: Travesty in Byte

THE SMALL SYSTEMS JOLH'

New Chips

Data General/One

NOVEMBER 1984 VOL. 9, NO. 12

$3.50 IN UNITED STATES$4.25 IN CANADA / £2.10 IN U.K.

A MCGRAW-HILL PUBLICATION0360-5280

Page 2: Travesty in Byte

Nonsense imitationcan be disconcertingly recognizable

English letter-combination frequencies can be used togenerate random text that mimics the frequencies foundin a sample. Though nonsensical, these pseudo-texts havea haunting plausibility, preserving as they do manyrecognizable mannerisms of the texts from which they are

derived. For example, the following text was generated by the firstsentences of this article:

English letter-combination frequencies from text was generived. Forexample. Though nonsentencies from text was the text wasgenerated to generisms of that mimics the first sentencies from textthe texts have a have a sample, they article:

The nature of such texts has been little explored, in part because it's beendifficult to get samples. Claude Shannon generated "approximations toEnglish" by hand in 1948, but the laborious calculation it involvedprevented extensive study. This is clearly a task for a computer, but pro-grams have been hampered by the need for impractical amounts ofmemory.

We offer a Pascal program, Travesty, to fabricate Hugh Kenner andpseudo-text quickly from any input text. Students of Joseph o'Roarke teach

style and linguistics will see possibilities . So may pro- English and computer

grammers since Travest contains a feature th aty cascience , respectively, at

n The Johns Hopkinsgreatly speed up general pattern-matching proce- university, Baltimore,dures. We add a special-case version that is (continued) MD 21218.

NOVEMBER 1984 • BYTE 129

Page 3: Travesty in Byte

130 B Y T E • NOVEMBER 1984 ILLUSTRATED BY JOAN HALL. WITH APOLOGIES TO JAMES JOYCE AND HENRY JAMES

Page 4: Travesty in Byte

Each of these writershad his own way

with trigrams, tetragrams, pentagrams,matters to which

he surely gave no thought.

even speedier. To make clear what Travesty does, we'll first discuss lan-guage statistics and what they imply.

LANGUAGE STATISTICSFinish typing a page of English prose, and the key you hit most oftenwill have been the space bar. Either "e" or "t" will rank second. You didnot make those decisions, the language did. In fact, the language makesthree-quarters of your writing decisions for you. Not only do the lettersobserve preferred frequencies, they keep preferred company. A familiarexample: write "Q", and (unless you are drafting a QANTAS ad or somecomments on Iraq) the next character is almost sure to be "u".

If probability coerces the successor to a single letter, what follows aletter pair is even more tightly bound. Write "th", and the probability isvery high that what follows will be "e". If it is, then the character after"e" is most likely to be either a space or an "r". Pairs like "th" are called

digrams; triplets like "the" are trigrams. They have frequencies, like let-ters. The most common English digram is "he"; you will find it three timesin the sentence you are reading now, 15 times in this paragraph. Andyou will guess correctly that as we move up from single letters to diagramsand trigrams, the probabilities that govern the next character grow evermore rigorous. By the time we've reached, say, pentagrams, has the author

any choice at all?Yes, he has; otherwise Henry James could have had no way to be Henry

James , or James Joyce to be James Joyce. At a fairly low level, the statisticsof English would have taken over from both of them, and neither wouldhave been distinguishable from The New York Times.

But that is not what happens. True, even with a James or a Joyce holdingthe pen, the statistics do not lie dormant. However, they no longer derivefrom the undifferentiated language , i.e., from a large sample of everything

we can find. The significant statistics derive from the personal habits ofJames , or Joyce, or Jack London, or J. D. Salinger. Each of these writers,

amazingly, had his own way with trigrams, tetragrams, pentagrams, mat-ters to which he surely gave no thought. (continued on page 449)

NOVEMBER 1984 • BYTE 131

Page 5: Travesty in Byte

TRAVESTY

(continued from page 131)

This line of reasoning brings us tothe unexpected fact that essentiallyrandom nonsense can preserve many"personal" characteristics of a sourcetext. Travesty (listing 1), a programsuitable for small systems, will scan asample text and generate, from thesample's n-gram statistics, a "non-sense" imitation through which theoriginal text, and even its authorship,is disconcertingly recognizable.

For example, we provided Travestywith 29 names of towns taken from agazetteer of England and called forthird-order (trigram) analysis. Itpromptly churned out a couple thou-sand characters. These letter groupsincluded (1) many input words re-gurgitated; (2) some uninteresting let-ter strings that we agreed to call "gar-bage" (on the principle that a weedis a flower you don't want); and (3)some wondrously plausible names forEnglish towns that don't exist butought to. They included Bambudge,Nettlewett, Gidge, Hample, Bognor-ton, Chire, Clop, Tootinton, Bleweth,and Eastle. (If any of these is a realname, that's by accident; none was onour input list.) And fancy being Mayorof Clop!

The connection of the output to thesource can be stated exactly: for anorder-n scan , every n-character sequence in theoutput occurs somewhere in the input, and atabout the same frequency . That is all, yetit is enough to account for an eeriesimilarity. Every string of three lettersin our pseudo-place-names, "ttl" or"dge", for instance, was lifted out ofa string of characters and spaces thatconsisted simply of the 29 inputwords typed one after another withone space after each.

Figure 1 shows one of the thou-sands of machine-generated deriva-tions Travesty can extract from a75-word sample of James Joyce'sUlysses. This passage is an order-4scan; every four-character sequence inthe output comes from somewhere inthe input.

language and literature to investigate.To what degree can personal "style"be described as a manifestation of let-ter frequencies? Such a question,though not new, was merely tantaliz-ing before the modern computer;

even more so before procedures werediscovered-quite recently-thatdidn't demand impossible amounts ofmachine memory.

Brian P. Hayes, associate editor of

(continued)

FREQUENCY ARRAYS

There is a lot of fun to be had here.There is also much for the student of

REMARKS ON THETRAVESTY LISTING

P ascal input/output (1/O) conven-tions are, to say the least, poorly

standardized. We have three Pascal sys-tems available: Turbo Pascal for CP/Mand MS-DOS, Lucidata Pascal for CP/Mand HDOS, and Berkeley Pascal run-ning under UNIX-and we haven't beenable to write a version of Travesty thatwill run on all three unmodified. judg-ing that Turbo is the rising youngcomer, we list the Turbo Pascal version,with notes on such problem areas aswe know about. This version might runon UCSD Pascal too, but we've notbeen able to try it. Since it avoidsfeatures unique to Turbo and UCSD, itought to be transportable to any de-cent Pascal system at the cost of a lit-tle attention to input and output.

Line numbers are, of course, forreference only: don't type them intoyour Pascal listing.

23 This value is safe and may even beincreased, but remember that you'llhave two arrays this size. How big youcan make ArraySize depends on yoursystem's memory requirements. TurboPascal, when compiled to disk to getthe compiler itself out of the way, per-mitted ArraySize = 14,000 on a 64K-byte CP/M system. That's about 2300words of input text. On an MS-DOS sys-tem with 196K bytes, maximum Array-Size increased to 21,000, or 3500words of text. independent of whethercompilation was to memory or to disk.33 If your Pascal doesn't know aboutthe TEXT type, change this line to ffile of char.40 If your Pascal system has a RAN-DOM function, you can drop lines 40to 44 altogether. Then change line 239to read toss : = random(total) + 1;. Youshould also delete lines 38. 52, and 53.49 Many versions of Pascal don'trecognize STRING types unless they

have been declared:

Type STRING = PACKED ARRAY[1 .. 12]OF CHAR;

Then change line 49 to InFileSTRING.62 Some Pascals will require you todeclare a variable i and say, FOR i1 TO 12 DO READ InFile[i};.63 Berkeley Pascal doesn't use theASSIGN command. You'd omit this lineand change line 64 to reset (f, infile);.

Also, you will probably want outputto a disk file, and you'll have to set thatup yourself. Add a second TEXT vari-able, g, to line 33 and a secondSTRING variable, OutFile, to line 49.Then insert after line 64 a request forthe name of the Outfile, and ASSIGNit to g in whatever way your system pro-vides. And if your system requires filesto be explicitly closed, add a statementline, CLOSE (g), just before the finalEND. (Don't forget the semicolon atthe end of the line above it.)

NOTES ON HELLBAT

To change Travesty into Hellbat, pro-cedures InitSkip and Match are re-placed by the versions given in listing2, and numerous lines are deleted asshown below. Note that WriteCharacternow receives its characters from Matchand has only formatting duties to per-form. If your Pascal has its own RAN-DOM function, make the deletionslisted in the section on Travesty for line40; and the major change-appliedabove to the WriteCharacter proce-dure-should instead be made to theline in the new Match procedure thatinvokes Random.

Lines to delete for Hellbat include 28,72 to 80, 269, 273 (all references toFregArray), and 232 to 245 (process forgetting a character).

NOVEMBER 1984 • B Y T E 449

Page 6: Travesty in Byte

Circle 445 on inquiry card.

CAMBRIDGEGRAPHICSYSTEMS

V M 1480RGB, TTL Input High Resolution ; 16 Color,14" Display; IBM®, Apple® Compatible

VM12501

VM1 2101High Resolution MonochromeLow DistortionTilt and Swivel BaseFully IBM® and Apple® CompatibleGreen and Amber Displays

IMMEDIATE AVAILABILITYEXCELLENT PERFORMANCE

COMPETITIVE PRICE PACKAGEDealer/DlsMbutlen Inquiries Invited

40 - 50% margin built-in.Sales territories available.

CMCambridge Graphic Systems

11020 East Rush StreetSo. El Monte, CA 91733

800-228-3320 / 818-448-6173See us at Comdex / Booth M832

°Apple is a registered trademark of the Apple Corp° IBM is a registered trademark of the International

Business Machines Corp

TRAVESTY

Listing 1: Travesty, a program for generating pseudo-text. The program will scana sample text and generate a "nonsense" imitation. For an order-n scan, everyn-character sequence in the output occurs somewhere in the input.

3 PROGRAM travesty (input, output); (Kenner / O'Rourke, 5/9/84}

This is based on Brian Hayes's article in ScientificAmerican, November 1983. It scans a text and generatesan n-order simulation of its letter combinations. Fororder n, the relation of output to input is exactly:

"Any pattern n characters long in the outputhas occurred somewhere in the input,

and at about the same frequency."Input should be ready on disk. Program asks how manycharacters of output you want. It next asks for the"Order"-i.e. how long a string of characters will becloned to output when found, You are asked for thename of the input file, and offered a "Verse" option.If you select this, and if the input has a " I " char-acter at the end of each line, words that end lines inthe original will terminate output lines. Otherwise,output lines will average 50 characters in length.

22 CONST23 ArraySize = 3000; {maximum number of text chars}24 MaxPat = 9; {maximum Pattern length}

26 VAR27 BigArray : PACKED ARRAY [1..ArraySize] of CHAR;28 FreqArray, StartSkip : ARRAY[' '..';] of INTEGER;29 Pattern : PACKED ARRAY [1..MaxPat] of CHAR;30 SkipArray : ARRAY [1..ArraySize] of INTEGER;31 OutChars : INTEGER; {number of characters to be output)32 PatLength : INTEGER;33 f : TEXT;34 CharCount : INTEGER; (characters so far output)35 Verse, NearEnd : BOOLEAN;36 NewChar : CHAR;37 TotalChars : INTEGER; {total chars input, + wraparound}38 Seed : INTEGER;

40 FUNCTION Random (VAR Randlnt : INTEGER) : REAL;41 BEGIN42 Random : = Randlnt / 1009;43 Randlnt (31 * Randlnt + 11 ) MOD 100944 END;

46 PROCEDURE InParams;47 (* Obtains user's instructions48 VAR49 InFile : STRING (12];50 Response : CHAR;51 BEGIN52 WRITELN ('Enter a Seed (1..1000) for the randomizer');53 READLN (Seed);54 WRITELN ('Number of characters to be output?');55 READLN (OutChars);56 REPEAT57 WRITELN ('What order? <2-', MaxPat,'>');58 READLN (PatLength)59 UNTIL (PatLength IN (2..MaxPat]);60 PatLength := PatLength - 1;61 WRITELN ('Name of input file?');

(continued)

450 BYT E • NOVEMBER 1984

Page 7: Travesty in Byte

e

'a`s ly` pgribl o `d tiona ,High resolution color graphics and interla eMonochrome graphics and interlace

O One year warrantyq Priced competitively at

M is a Trade'74a4 f I}# ° ess Mac..

INTELLIGENT 14932 G wenchris Ct.DATA SYSTEM INC. Paramount , CA 90723

8009325.2455(213) 633.5504TLX: 509098

Page 8: Travesty in Byte

DeSmetC

8086/8088Development $^ O^Package-FULL DEVELOPMENT PACKAGE

• Full K&R C Compiler• Assembler, Linker & Librarian• Full-Screen Editor• Execution Profiler• Complete STDID Library (>120 Func)

Automatic DOS I.X/2.X SUPPORT

BOTH 8087 ANDSOFTWARE FLOATING POINT

OUTSTANDING PERFORMANCE• First and Second in AUG '83 BYTE

benchmarks

SYMBOLIC DEBUGGER $50

• Examine & change variables byname using C expressions

• Flip between debug and displayscreen

• Display C source during execution• Set multiple breakpoints by function

or line number

DOS LINK SUPPORT $35

• Uses DOS OBJ Format• LINKS with DOS ASM• Uses Lattice naming conventions

Check: ( )E] Debugger (50)q DOS Link Supt (35)

SHIP TO:

]ip

dft-WARECORPORATION

P.O. BOX CSunnyvale , CA 94087

(408) 720-9696All orders shipped UPS surface on IBM format disksShipping included in price California residents addsales tax. Canada shipping add $5, elsewhere add$15 Checks must be on US Bank and in US Dollars,Call 9 am , - 1 p,m to CHARGE by VISA/MC/AMEX

TRAVESTY

62 READLN (InFile);63 ASSIGN (f, InFile);64 RESET (f);65 WRITELN ('Prose or Verse? <p/v>');66 READLN (Response);67 IF (Response = 'V') OR (Response = 'v') THEN68 Verse : = true69 ELSE Verse : = false70 END; {Procedure InParams}

72 PROCEDURE ClearFreq;73 (* FreqArray is indexed by 93 probable ASCII characters,74 (* from to ". Its elements are all set to zero.75 VAR76 ch : CHAR;77 BEGIN78 FOR ch TO ' I' DO79 FregArray[ch] : = 080 END; (Procedure ClearFreq}

82 PROCEDURE NullArrays;83 (* Fill BigArray and Pattern with nulls84 VAR85 j : INTEGER;86 BEGIN87 FOR j 1 TO ArraySize DO88 BigArray [j] : = CHR(0);89 FOR j : = 1 TO MaxPat DO90 Pattern [j] : = CHR(0)91 END ; { Procedure NullArrays}

93 PROCEDURE FillArray;94 (* Moves textfile from disk into BigArray , cleaning it95 (* up and reducing any run of blanks to one blank.96 (* Then copies to end of array a string of its opening97 (* characters as long as the Pattern , in effect wrapping98 (* the end to the beginning.99 VAR

100 Blank : BOOLEAN;101 ch: CHAR;102 j : INTEGER;

104 PROCEDURE Cleanup;105 (* Clears Carriage Returns, Linefeeds, and Tabs out of106 (* input stream. All are changed to blanks.107 BEGIN108 IF (( ch = CHR(13)) {CR}109 OR (ch = CHR(10)) {LF}110 OR (ch = CHR(9))) {TAB}111 THEN ch112 END;

114 BEGIN {Procedure FillArray}115 j := 1;116 Blank : = false;117 WHILE (NOT EOF(f)) AND (j < = (ArraySize- MaxPat)) DO118 BEGIN (While Not EOF}119 READ (f, ch);120 Cleanup;121 BigArray[j] : = ch;122 IF ch = ' ' THEN Blank : = true;123 j := j + 1;124 WHILE (Blank AND (NOT EOF(f))

{Place character in BigArray}

(continued)

452 BYTE • NOVEMBER 1984

Page 9: Travesty in Byte

Why people choose an IBM PC in the first placeis why people want IBM service ...in the first place .

After all, who knows your IBM PersonalComputer better than we do?

That's why we offer an IBM maintenanceagreement for every member of the PersonalComputer family. It's just another exampleof blue chip service from IBM.

An IBM maintenance agreement for yourPC components comes with the choice of serviceplan that's best for you-at the price that'sbest for you.

Many customers enjoy the convenience andlow cost of our carry-in service. That's wherewe exchange a PC display, for example. at anyof our Service/ Exchange Centers.

And for those customers who prefer it, weoffer IBM on-site service, where a service repre-sentative comes when you call.

No matter which you choose for your PC,an IBM maintenance agreement offers you fast,effective service.

Quality. Speed. Commitment. That's whyan IBM maintenance agreement means bluechip service. To find out more about thespecific service offerings available for your PC,call 1800 IBM-2468. Ext. 104and ask for PC Maintenance.Circle 205 on inquiry card.

Blue chip service from ='_

Page 10: Travesty in Byte

Circle 128 on inquiry card.

Own your owncomputer supply

business.

DISK WORLD!will show you how.

You probably know who DISKWORLD! is: our ads are scatteredthroughout this and every other majorcomputer magazine.

We're one of the largest computersupply marketers in the country.

And we want you!But, no matter how much we ad-

vertise, we still can't reach every com-puter user... but you can.

We're looking for people who want torun their own part- or full-time comput-er supply business.

You'll have our help.You won't be alone.You'll have the accumulated experi-

ence, buying power and merchandis-ing skills of DISK WORLD! workingwith you. (And, if you don't think that'simportant, just remember this:eighteen months ago DISK WORLD!didn't exist.. .and now we're one of thelargest distributors in the nation.)

$24.95 gets you started.We'll send you a complete business

plan that tells you everything you needto know.

It'll cost you $24.95 + $3.00 ship-ping.

But it's risk-free. Read it for fifteen(15) days and if you decide this isn't foryou, send it back. We'll refund yourmoney.

If it is for you, you'll know what to donext.

'--------------------------

DISK WORLD!Suite 4806

1 30 East Huron StreetChicago, Illinois 60611

YES, I'm interested in the details of theDISK WORLD! independent resellers pro-gram . Please send me my manual.

I understand that if I don' t like it, I canreturn it within 15 days for a full refund.

q My check or money order for $27.95 isenclosed.

q Charge my VISA or MASTERCARD#

Exp. /

Signature:PLEASE PRINT LEGIBLYI

Name:Address:City: State: Zip:Phone: ( ) rL-------------------------a454 B YTE • NOVEMBER 1984

TRAVESTY

125 AND (j < _ (ArraySize- MaxPat))) DO126 BEGIN (While Blank) (When a blank has just been)127 READ (f, ch); (printed, Blank is true,)128 Cleanup; (so succeeding blanks are skipped,)129 IF & <> ''THEN (thus stopping runs.)130 BEGIN (If)131 Blank : = false;132 BigArray[j] := ch;133 1: =j+1134 END {If)135 END (While Blank}136 END; (While Not EOF}137 TotalChars : = j - 1;138 IF BigArray[TotalChars] <> THEN

(To BigArray if not a Blank)

139 BEGIN (If no Blank at end of text, append one)140 TotalChars : - TotalChars + 1;141 BigArray[TotalChars]142 END;143 (Copy front of array to back to simulate wraparound.)144 FOR j : = 1 TO PatLength DO145 BigArray[TotalChars+j] :- BigArray[j];146 TotalChars : = TotalChars + PatLength;147 WRITELN('Characters read, plus wraparound - ',TotalChars:4)148 END; (Procedure FillArray)

150

151

152

153

154

155

156

157

158

159

160161

162

163164

165

166

PROCEDURE FirstPattern;User selects "order" of operation, an integer , n, in therange 1 .. 9. The input text will henceforth be scannedin n-sized chunks. The first n-1 characters of the inputfile are placed in the "Pattern" Array. The Pattern iswritten at the head of output.

j : INTEGER;BEGIN

FOR j : = 1 TO PatLength DOPattern[j] : = BigArray[j];

CharCount PatLength;NearEnd : = false;IF Verse THEN (' ');FOR j : - 1 TO PatLength DO

WRITE (Pattern[j])END; (Procedure FirstPattern)

(Put opening chars into Pattern}

(Align first line)

168 PROCEDURE InitSkip;169 (* The i-th entry of SkipArray contains the smallest index170 (* j > i such that BigArray[j] - BigArray[i]. Thus SkipArray171 (* links together all identical characters in BigArray.172 (* StartSkip contains the index of the first occurrence of173 (* each character, These two arrays are used to skip the174 (* matching routine through the text , stopping only at175 (* locations whose character matches the first character176 (* in Pattern.177 VAR178 ch : CHAR;179 j : INTEGER;180 BEGIN181 FOR ch :- ''TO ';' DO182 StartSkip[ch] : - TotalChars + 1;183 FOR j : - TotalChars DOWNTO 1 DO184 BEGIN185 ch :- BigArray[j];186 SkipArray[j] StartSkip(ch);187 StartSkip(ch] : = j

Page 11: Travesty in Byte

TRAVESTY ERG/68000MIN I-SYSTEMS

q Full IEEE 696/S100 Compatibility

188 END189 END; {Procedure InitSkip)

191 PROCEDURE Match;192 (* Checks BigArray for strings that match Pattern; for each193 (* match found, notes following character and increments its194 (- count in FreqArray. Position for first trial comes from195 (- StartSkip; thereafter positions are taken from SkipArray.196 (- Thus no sequence is checked unless its first character is197 (- already known to match first character of Pattern.198 VAR199 i : INTEGER; {one location before start of the match in200 j INTEGER; {index into Pattern)201 Found : BOOLEAN; {true if there is a match from i + 1 to i +j - 1 }202 chi CHAR; (the first character in Pattern; used for skipping)203 NxtCh : CHAR;204 BEGIN {Procedure Match)205 chi = Pattern[1];206 i := StartSkip[chl] - 1; {i is 1 to left of the Match start)207 WHILE (i < = TotalChars-PatLength -1) DO208 BEGIN {While)209 j:= 1;210 Found true;211 WHILE (Found AND (j < = PatLength)) DO212 IF BigArray[i+j) <> Pattern[j]213 THEN Found : = false214 ELSE j := j + 1;215 IF Found THEN

{Go thru Pattern till Match fails)

216 BEGIN {Note next char and increment FregArray}217 NxtCh BigArray[i + PatLength + 1 ];218 FreqArray[NxtCh] : = FregArray[NxtCh] + 1219 END;220 i : = SkipArray[i + 1 ] - 1221 END {While)222 END; {Procedure Match)

(Skip to next matching

224 PROCEDURE WriteCharacter;225 (* The next character is written. It is chosen at Random226 (* from characters accumulated in FreqArray during last227 (* scan of input. Output lines will average 50 characters228 (* n length. If "Verse" option has been selected, a new229 (* line will commence after any word that ends with "." in230 (* input file. Thereafter lines will be indented until231 (* the 50-character average has been made up.232 VAR233 Counter, Total, Toss : INTEGER,234 ch CHAR;235 BEGIN236 Total : = 0;237 FOR ch : = ' ' TO ' DO238 Total = Total + FreqArray[ch]; {Sum counts in FregArray}239 Toss = TRUNC (Total * Random(Seed)) + 1;240 Counter := 31;241 REPEAT242 Counter := Counter + 1;243 Toss : = Toss - FreqArray[CHR(Counter)]

position)

{We begin with ' '}

244 until Toss < = 0; {Char chosen by}245 NewChar : = CHR(Counter); { successive subtractions)246 IF NewChar <> THEN247 WRITE (NewChar);248 CharCount : = CharCount + 1;249 IF CharCount MOD 50 = 0 THEN NearEnd true;250 IF ((Verse) AND (NewChar = 'I')) THEN WRITELN;

(continued)

HARDWARE OPTIONS

q 8MHz, 10 MHz, or 12 MHz68000/68010 CPU

q 68451 Memory Managementq Hardware Floating Pointq Multiple Port Intelligent I/Oq 64K/128K Static RAM (70 nsec)q 256K/512K/1MB Dynamic RAM (150

nsec)q Graphics-Digital Graphics

CAT-1600q DMA Disk Interfaceq SMD Disk Interfaceq 'h" or 1/2" Tape Backupq 514" or 8" Floppy Disk Drives

q 5MB-474MB Hard Disk Drives

q 7/10/20 Slot Back Plane

q 20 or 30A Power Supplyq Desk Top or Rack Mount Encl.

SOFTWARE OPTIONS

q 68KFORTH ' Systems Languageq CP/M-68K20 /S with C , 68K-BASIC',

68KFORTH +, FORTRAN 77, EM8OEmulator , Whitesmiths ' C, PASCAL

q IDRIS3 O/S with C , PASCAL,FORTRAN 77,68K -BASIC', CISCOBOL4, INFORMIX' RelationalDBMS

q UNIX- SYS V O/S with C , PASCAL,FORTRAN 77, BASIC , RM COBOL7,ADA8, INFORMIXS, Relational DBMS

q VED 68K Screen Editorq Motorola 's MACSBUG and FFP

Package

Trademark 'ERG, 'Digital Research,'Whitesmiths, 'Micro Focus, SRDS,

Inc., 'Bell Labs, 'Ryan McFarland,"U.S. DoD

30 Day Delivery - OEM Discounts

since 1974

Empirical Research Group, Inc.P.O. Box 1176

Milton, WA 98354(206) 872-7665

NOVEMBER 1984 • BYTE 455

Page 12: Travesty in Byte

CLEO TM makes the 3270Mainframe ConnectionThe communications features of the CLEO-3270 Software packageallow your computer to emulate a remote cluster of IBM terminaldevices.You'll be up and running fast. No changes are necessary on yourmainframe. And CLEO on your computer will install anywhere anIBM 3276 cluster might be used.

Once CLEO is running, it maintains communications with your hostcomputer, while allowing your computer keyboard operators to runDOS or UNIX tasks in addition to 3278 emulation.

3780Plus TM provides powerfulbatch communications toyour mainframe or other 3780's.If your IBM mainframe doesn't support remote 3270 clusters, youmay need remote batch communications. 3780Plus is your answerfor high speed computer-to-computer file transfer.

378OPlus is full featured and supports the IBM 3780 and 2780 BSCprotocols. If you need to transmit or receive text or binary files athigh speed, 378OPIus has no match. It'll run synchronous modemsup to 19.2K baud.378OPlus maintains a '•job file.' It'll run "unattended" and keep a logfile of all communications activity. Before you buy a 3780 emulatormake sure you have the features you'll need: transparent mode.space compression, device selection, printer forms control, spool-ing, configurable line parameters and line trace for diagnostics.3780Plus has them all!

it FLC ^ ! L { . A L ^ i - w.1 2 3 L L L 7 t 9 ^ ^ I iw^

i1L

Simultaneously usealternate key.Simultaneously useshift key.No simultaneous keyrequired .

O W F A U O I 9I I

S D F

p z ( e H M Q i

Page 13: Travesty in Byte

IBM PC CLEO on the IBM PC Compatibles . The CLEOsoftware program loads from a PC-DOS disk-

COMPATIBLESette and contains simultaneous emulation for3276 line protocol and up to six devices

emulating 3278 crts or 3287 printers. When CLEO runs on the PC, sixdevices are supported (through three interface cards supplied withthe CLEO software). The six supported devices include: 3278 supportfor the PC's console; 3287 support for the PC's printer; and. 3278support for up to four other PC's which may be serially attached tothe PC which is running CLEO.

CLEO is a full 3276 emulator and supports all of the standardfeatures of IBM's 3276; you won't need to make any modifications toyour 3270 host computer. Your installation considerations for CLEOon the PC will be identical to those experienced in installing an IBM3276. In fact, CLEO on your PC will readily install at any site where a3276 currently operates.

Keyboard decals are supplied with CLEO. These adhere to the frontface of the keys on your PC keyboard and serve as a quick reminder ofthe location of the 3278 keys.

BSC and SNA versions of CLEO are available in two models . The firstversion is PC2 which provides 3276 emulation and support for twodevices, a 3278 CRT and a 3287 printer . PC2 includes one hardwaresupport card (PC-SIC).

The expanded or "clustered " model is called PC6. PC6 requires twoadditional interface cards (PC-AIC) and four 25 ' cables for attachingfour PC's to a central PC running PC6. Physically these four PC's areattached through their asynchronous serial port (COMM1 ) using a 25'cable included in the PC6 package.

When you ' re operating one of the four attached PC's and you wantyour PC to become a 3278 you execute a software package, PC3278,included with the PC6 package.

HP -150 CLEO on the HP-150 . The CLEO software pro-gram loads from a MsDos diskette and con-

tains simultaneous emulation for 3276/2 BSC line protocol and sup-port for two devices, a 3278 crt and 3287 printer.

CLEO is a full 3276/2 emulator and supports all of the standardfeatures of IBM's 3276/2; you won't need to make any modifica-tions to your 3270 host computer. Your installation considerationsfor CLEO on the HP-150 will be identical to those experienced ininstalling an IBM 3276/2. In fact. CLEO on your 150 will readilyinstall at any site where a 3276/2 currently operates.

Unlike a 3270 coax product, CLEO needs no additional hardware onthe HP-150 and an IBM cluster controller to support a coax connec-tion is not required. CLEO is a 3276/2 cluster controller and hooks toa synchronous modem in the same fashion as an IBM 3276/2. TheHP-150's port 1 is used for the modem connection.

Keyboard decals are supplied with CLEO. These adhere to the frontface of the keys on your 150 keyboard and serve as a quick reminderof the location of the 3278 keys.

CLEO is ported for many more computers.Contact Altos. Honeywell, IMS. Micromation,Molecular. Nohalt, Olivetti, Tandy, orZenith Data Systems for CLEO on their machines.

MsDOS is a Trademark of Microsoft. Inc.Unix is a Trademark of Bell Labs

For further information or to order, call:

For enhanced 3278 display, the IBM color graphics card is recom-mended for use with CLEO.3780Plus on the IBM PC Compatibles . 3780Plus Software is self-contained on one floppy disk and menu driven with simple commandsso that you need not be an "expert" in 3780 or 2780 communicationsto use the package. 3780Plus for the PC includes an interface card,SIC, for interfacing to your synchronous modem.

3780Plus supports IBM's 2780 and 3780 BSC protocols forcomputer-to-host or computer-to-computer high speed communica-tions. Additionally, 378OPlus has special new features for talking toother computers which are running 3780Plus. For example, files areautomatically named by the receiving computer to their originalname on the transmitting computer.

3780Plus allows maximum flexibility with a "System" commandwhich allows you to execute many PC-DOS commands from within3780PIus.

Software/Hardware

Model Description

PC2-3276/2SHM 3276 BSCfor the PC

PC2-3276/12SHM 3276 SNAfor the PC

PC6-3276/2SHM 3276 BSCcluster for PC

PC6-3276/12SHM 3276 SNAcluster for PC

PC-3780PlusSHM 3780/2780emulationfor PC

3270Devices

SupportedInterface

CardsRetailPrice

2 1 $ 795.

2 1 $ 895.

6 2 $1,349.

6 2 $1,499.

1 $ 795.

3780PIus on the HP -150. 378OPlus Software is self-contained on onefloppy disk and menu driven with simple commands so that you neednot be an "expert" in 3780 or 2780 communications to use thepackage.

378OPlus supports IBM's 2780 and 3780 BSC protocols forcomputer-to-host or computer-to-computer high speed communica-tions. Additionally, 3780Plus has special new features for talking toother computers which are running 378OPlus. For example, files areautomatically named by the receiving computer to their originalname on the transmitting computer.

3780Plus allows maximum flexibility with a "System" commandwhich allows you to execute many MSDos commands from within378OPlus.

The HP-150's port 1 is used to connect to the synchronous modem.

SoftwareModel Description

RetailPrice

HP-3276/2SM 3270 BSC for HP-150 $500.HP-3780PlusSM 3780/2780 emulation for HP-150 $500.

• Color coding on configuration diagrams identifies components supplied in software packages.

Circle 327 on inquiry card.

CLEOCLEO Software A division of Phone 1, Inc., 461 North Mulford Rd., Rockford, IL 61107 1(815) 397-8110. Telex 703639

Page 14: Travesty in Byte

TRAVESTY

251 IF ((NearEnd) AND (NewChar = ' ')) THEN252 BEGIN {If NearEnd}253 WRITELN;254 IF Verse THEN WRITE255 NearEnd : = false256 END {If NearEnd}257 END; {Procedure WriteCharacter)

259 PROCEDURE NewPattern;260 (• This removes the first character of the Pattern and261 (• appends the character just printed. FregArray is262 (- zeroed in preparation for a new scan.263 VAR264 j : INTEGER;265 BEGIN266 FOR j 1 to PatLength - 1 DO267 Pattern[j] := Pattern[j+1]; {Move all chars leftward}268 Pattern[PatLength] NewChar; {Append NewChar}269 ClearFreq270 END; {Procedure NewPattern}

272 BEGIN {Main Program}273 ClearFreq;274 NullArrays;275 InParams;276 FillArray;277 FirstPattern;278 InitSkip;279 REPEAT280 Match;281 WriteCharacter;282 NewPattern283 UNTIL CharCount > = OutChars;284 END. {Main Program}

Scientific American, explained in theNovember 1983 issue of that publica-tion ("Computer Recreations," pages18-28) how the obvious approach toan order-n scan used n-dimensionalarrays. Let Array, count all occur-rences of all characters. Let Array2

keep track of the character thatfollows each character. Let Array3keep track of the character thatfollows each pair. Now generate a ran-dom printout that reflects the prob-abilities recorded in the arrays. Theresult of this order-3 scan will be what

,$2395 DEVELOPMENT SYSTEM

Up to 128K bytes of ROM EMULATION (8Kstandard ) allows you to make program patchesinstantly . Since the target ROM socket con-nects data and address lines to both theanalyzer and the emulator , no expensive adap-tors or personality modules are needed.

we saw in the place-names, a scram-bled impression that preserves, to asurprising degree, many idiosyn-crasies of the input text.

A higher order would be even moreinteresting, especially if the input sam-ple were long. But until quite recent-ly order-4 was an exceptional achieve-ment and no one had ever seen anorder-5. Getting even as far as order-3(trigrams) gobbles up memory if youuse arrays of arrays. For example, evenif you restrict the character set to amere 28-the uppercase letters, aspace, and an apostrophe-you haveto store 21,952 characters to createthree arrays. Most of the charactercombinations would actually beblanks, because most trigram possi-bilities don't occur: think of "cnx". Afourth array would entail over half amillion places of storage. A fifth isnearly unthinkable (and almostempty).

Sparse arrays are generally a signthat a new approach is needed. It wasHayes himself who took the next step,which was to discard multiplied arraysaltogether. He perceived that youcould get the same result by scanningthe text for patterns and simply re-cording the character that followseach pattern. Consider this brief text:ALL IN ALL THE CHANCES MAYWELL BE ENHANCED.

To run a trigram scan, take any twoletters and see what comes next. 'AL'is followed only by "L'; "LL' is fol-lowed only by a space, which we'll

(continued)

Turns any personal computer into a complete micro-computer DEVELOPMENT SYSTEM. Our integrated con-trol/display program runs under PC/MS-DOS, CP/M, orTRS-DOS. and controls the UDL via an RS-232 port.

The powerful BUS STATE ANALYZER features The PROM PROGRAMMER alsofour-step sequential triggering, selective trace, doubles as a STIMULUS GENERATOR.and pass and delay counters. $99 symbolic trace For a brochure and list of crossdisassemblers are available for Z-80, 8048, 6500, assemblers call or write:6800, 8031, 8085, 3870, Z-8, 1802, 8088, & 8086. 172 Otis Ave., Woodside, CA 94082 1111=13111011,10-

(415) 851 -1172 Instruments

458 BYT E • NOVEMBER 1984 Circle 313 on inquiry card.

Page 15: Travesty in Byte

The best of both worlds

Now, software developers can expand their markets and increase theirproductivity with Co-ldrisTM, the newest UNIX-like operating system fromWhitesmiths, Ltd.

Co-Idris is a professional, sophisticated tool enabling users to develop programsin a powerful and flexible UNIX-like environment, then easily port theseapplications to a wide range of PC/MS-DOS machines, including the IBM PC,DEC Rainbow, Wang PC, DG Desktop, and Olympia PC. With the Co-ldrispackage, you can construct C, Pascal, or assembler programs for operationunder Co-Idris, DOS or CP/M-86.

Able to work in as little as 128 KB of total main memory, Co-Idris allowsconcurrent access to both Idris-based programs and PC- or MS-DOS basedapplication programs. You get the multi-user, multi-tasking features of a UNIXenvironment as well as the rich selection of DOS applications. And there is noneed to reboot DOS, ever.

Co-Idris works on most all PC/MS-DOS based configurations with hard disks,and it's available now!

Dealer Inquiries Invited.

Whitesmiths, Ltd.97 Lowell Road Concord, MA 01742 (617) 369-8499TLX 750246 SOFTWARE cNcM

DISTRIBUTORS: Australia , Fawnray Ply. Ltd., Huratville, (612) 570.8100; Japan, Advanced Data Controls Corp., Chiyoda-ku, Tokyo (03) 63-0383; United Kingdom,Real Time Systems, Douglas, Isle of Man 0624-28021; Sweden , Uniaoft A. B., Goteborg, 31.125810. Rainbow is a trademark of Digital Equipment Corp. UNIX Is atrademark of Bell Laboratories; MS-DOS is a trademark of Microsoft Corp. PC-DOS is a trademark of International Business Machines Corporation. Idris is atrademark of Whitesmitha, Ltd.

Circle 423 on inquiry card.NOVEMBER 1984 • BYTE 459

Page 16: Travesty in Byte

TRAVESTY

Gleaming harnesses, petticoats on slim ass rain. Had to adore. Gleaming silks, spicyfruits, pettled. Perfume all. Had to go back. Had to back. Had to back. His bracesall him ass to adore. Gleaming harness rain yielded. A warm silver, rays of themutely craved Born on slim ass rays of the woman plumpnesses. Useleshobscurely, he mutely craved down on him assailed. A warm hungered down on himassailed to adore, Gleaming silks, silver, rich from Jaffa. A warm hungered down onslim braces all. Hig

Figure 1: An order-4 scan taken from a 75-word sample of James Joyce's Ulysses.

represent as "_": so the pair 'AL' canonly lead to the string ' ALL_". Butthe next pair, "L_", may be followedby "I", "T", or "B". We list these char-acters as we come to them, thenchoose one from the list at random.Let's say we choose "B" so that wehave 'ALL-B". The next pattern is"_B", and the only thing that canfollow is "E". Continue in this way, andby the time you've convinced yourselfthat one possible result is ALL BEENHANCES MAY WELL THE EN-HANCED, you'll have understood the

method. You'll also see how it pro-duced a word-ENHANCES-that wasnot in the input.

One further principle isn't illustratedby an example this short. A long textmay yield a fairly long list of charac-ters from which to make the randomchoice, and many of these will haveturned up over and over. In such acase, frequent appearance on the listshould improve a character's chanceof being chosen. What we want ouroutput to reflect is n-gram frequency,not just n-gram presence. Hayes's

method provides for this too.His method has several advantages.

It is applicable to programs that fit asmall computer. It makes feasibleorder-4, order-5, order-6 scans-infact scans of any order. Nor need itconserve memory by restricting thecharacter set; it can accommodate up-percase and lowercase letters,numerals, and all punctuation. But itdoes have one disadvantage. Sincewith his method you have to scan theentire input text to generate eachcharacter, the process can be veryslow. We'll describe a partial remedyfor that.

IMPLEMENTATIONIn our first attempts to implement theHayes algorithm, we left the sourcetext on disk to be read through overand over. Though it's tempting to letsource length be limited only by disk

(continued)

(IEback Issues for sale1976 1977 1978 1979 1980 1981 1982 1983 1984

Jan. $2.75 $3.25 $3.25 $3.70 $4.25

Feb. $2.75 $2.75 $3.25 $3.25 $3.70 $4.25

March $2.75 $3.25 $3.70 $3.70 $4.25

April $2.75 $2.75 $3.25 $3.25 $3.70 $3.70 $4.25

May $2.00 $2.75 $2.75 $3.25 $3.70 $3.70 S4.25

June $2.00 $2.75 $2.75 $3.25 $3.70 $3.70 $4.25

July 2.00 $2.00 $2.75 $2.75 $3.25 $3.70 $4.25 $4.25

Aug. $2.00 $2.75 $2.75 S3.25 $3.70 $4.25 $4.25

Sept. $2.75 $2.75 $2.75 $3.25 $3.70 $4.25 $4.25

Oct. $2.75 $2.75 $3.25 $3.25 $3.70 $4.25 $4.25

Nov. $3.25 $3.25 $3.70 $4.25

Dec. $2.75 $2.75 $3.25 $3.25 $3.25 $3.70 $4.25

Special BYTE Guide to IBM PC's - S4.75

Circle and send requests with payments to:BYTE Back IssuesP.O. Box 328Hancock, NH 03449

Prices include postage in the US. Please add 5.50 percopy for Canada and Mexico; and 52.00 per copy toforeign countries (surface delivery).

q Check enclosedPayments from foreign countries must be made in

US funds payable at a US bank.

q VISA q Master Card

Card #

Exp.

Signature

Please allow 4 weeks for domestic delivery and 12weeks for foreign delivery.

NAME

ADDRESS

CITY

STATE ZIP

460 BYTE • NOVEMBER 1984

Page 17: Travesty in Byte

Free Storage:

A radical V 3 offer In extra lengthHigh Density Data Cartr idges,

•iii ® now offers you the perfect formulafor data storage with our new nine packconfiguration. And to make this an abso-lute value; with the purchase of eachnine cartridges we'll include a durableand practical storage container. First,you'll have the finest high-density cartridge thatmoney can buy. Second,our extra length 5 fifty 5'cartridges have 23% more

high-density storage capacity at 6400and 10,000 ftpi. And finally, our threeseparate shrink wrap packages will allowyou to maximize your department inven-tory and provide flexibility when workingwith small quantities. This is, of course, a

limited offer, so call your

DATA ELECTRONICS, INC. .a..ua m=

nearest listed DEl distribu-tor today. When it comesto 14" technology... U thas the answers.

10150 Sorrento Valley Road n San Diego, CA 92121- 699

Authorized DRI Resale Locations:American Computer Supply, TX (214) 243-3232 • Automated Computer Systems, FL (305) 5943619

Business Data Products , Inc., TX (512) 453-5129 • Challenge Computer, CA (415) 785-1300Computer Media Products , CA(619) 565-7802 • Computer Ribbons , CO (303) 295-0851

Crown Computer Products, NY (518) 438-0600 • Data Documents , CA (818) 965-3323 • EDP Supply, NH (603) 893.6118Emerald Systems , CA (619) 270-1994 • Global Computer Supplies , NY (516) 485-1000 • i.D.E.A., CA (408) 745-1911

Indel-Davis, OK (918) 587-2151 • International Memory Products , CA (213) 450-0132Manchester Equipment , NY (516) 435-1199 • MISCO, NJ (201) 284.8200 • Nashua Plus, NH (803) 880-2323

Paragrem Sales, VA (703) 356-0808 • Stannard Computer, TX (214) 423-7553 • Step Computer Media , TX (512) 822-2376Western Paper Co., WA (208) 2515300

a J and DEl are registered trademarks of Data Electronics, Inc.

See us at

COMOIN `/Fall '84Booth #8122

Circle 109 on inquiry card. NOVEMBER 1984 BY T E 461

Page 18: Travesty in Byte

TRAVESTY

capacity, the amount of disk readingcan be altogether unreasonable: onefull read per output character. Notonly is the process slow but , if you tryto run the program to get a meaning-ful amount of output from a long text,it might consume an appreciable frac-tion of a disk drive ' s service life. Wesettled for a limited input sample, to

be read from the disk just once andstored in an array.

In the program in listing 1, this isBigArray, indexed by a range of inte-gers from I to ArraySize. As thesource text is read in, linefeeds, car-riage returns, and tabs are strippedout, and runs of blanks are con-densed to a single blank. The cleaned-

for under $1000• Custom vocabulary you control• Supports SP 1000 synthesizer• IBM PC compatible• Includes hardware & softwareWith Adisa's VX2 The VX2 lets youSystem , you can say digitize voice, convert"Goodbye" to robotic it to LPC codes, andtiming and inflection. conveniently edit theCreate synthesized synthetic speech.voice the way youwant it pronounced, onyour own schedule.Experiment to evaluateits impact on your pro-duct. Tweak thevocabulary, tailoring itto your market.

Manual $25includes

cassette tape

A P P L I E D D I G I T A L -S I G N A L A N A L Y S I S

P.O. BOX 1364, PALO ALTO, CA(415) 326-7303

TELEX 750626

462 BYTE • NOVEMBER 1984

up result is stored, character by char-acter, in BigArray. Next, for an order-n scan, the first n-1 characters of thesource are put into a small arraycalled Pattern. They are also printed,to get the output started. All this maytake a second or a few.

The rest is simple in principle. AMatch procedure runs through Big-Array, checking for matches with thecontents of Pattern. Each time amatch is found, the next character isstored in a way we'll explain in a mo-ment. At the end of the scan, one ofthe stored characters is randomlychosen for printing, the more fre-quent ones being the more likelychoices. We now make a new Pattern,by dropping the old Pattern's firstcharacter and appending the charac-ter just printed. And we keep this upuntil we have generated as many char-acters of output as we want.

Tb store the characters from whichto choose at random, we used amethod suggested by Hayes. Freq-Array is an array of integers, indexedby 93 ASCII (American StandardCode for Information Interchange)characters, from " " (space, ASCII 32)to " " (ASCII 124). (We might havestopped with "z" JASCII 1221, but wehad a special use for " 1 ", as will ap-pear.) Before each scan, all of Freq-Array's elements are zeroed. Aftereach match, the element indexed bythe found character is incremented.Thus, after four "e"s have been found,FregArrayj'e'j contains four. At theend of the scan, the contents of allFreqArray elements are totaled, anda random number in the range "I ...Total" is generated. The contents ofthe individual FreqArray elements(most of them zero) are then sub-tracted, one by one, from thisnumber; the subtraction that dropsthe remainder to zero or less choosesthe character that will be printed. Thatway, characters that index some fair-ly large FreqArray number stand abetter chance of being chosen.

You decide how much output youwant by setting the variable MaxCharswhen the starting menu prompts you.You also set the Order, which deter-

(continued)

Circle 28 on inquiry card. -y

Page 19: Travesty in Byte

What ifyoulcould get morpersonal with a

personal computer?Even computer veterans,

who have seen and done it all,software-wise, are amazedwhen they see Framework" run.j

Framework makes your PCtruly personal, so it can thinkand work the way you do.

Framework is the logical stepbeyond spreadsheet-based,integrated software like 1-2-3?"Only it's more of a leap than astep. With Framework you canplay with ideas, words andnumbers as deftly as last gener-ation software let you handlenumbers alone.

Framework goes beyondintegration. It unites a uniqueoutlining function, word pro-cessing, spreadsheet, graphics,data management and telecom-munications with a commoncommand structure. It runsother PC DOS compatibleprograms within itself. Macrosare easily written and applied.And it contains a full applicationslanguage, Fred;" which hasalready made it a hit withthird-party developers.

Framework lets you getpersonal with your personalcomputer.

For more information call(800) 437-4329 extension 222or in Colorado (303) 799-4900extension 222.

ASHTON •TATEi

Framework. Vramev.ork. For Ihinkerti. Fred and Ashton "late arc traderark, of Athtun "late.

12 is a trademark it ,tuti I)c^elupment Cure x,1984 Ashton late. All rights re erred.

Framework. For Thinkers:

Page 20: Travesty in Byte

TRAVESTY

The language makesthree-fourths of yourwriting decisions.mines the pattern length. As theOrder increases, so does the likeli-hood of every pattern being unique, inwhich case the output would be sim-ply the unaltered input. We set theconstant MaxPat to 9, but that's anoption.

The output is formatted by count-ing the characters printed (TotalChars)and watching for a chance to breakthe line whenever this total passes amultiple of 50. The next space afterthat will trigger a new line. To makeverse look more like verse, we type abar ("; ") at the end of each line ofinput text. No bar is ever printed, butif the user has selected the Verse op-tion from the starting menu, a newline will commence whenever a bar isencountered. On top of that, any linebreaks created by the "TotalCharsMOD 50" test will be followed by a5-character indent. Figure 2 shows

We are not appearSight kingdomThese do not appear:

This very longBetween the wind behaving alone

And the KingdomRemember us, if at all, not

appearSightless , unlessIn a fading as lost kingdom

what the result can look like. The in-put text was the whole of T. S. Eliot'sThe Hollow Men.

The smooth running of this systemdepends on the fact that a match willbe found, and a new character re-turned, somewhere on every scan.There is one potential bug-when thenew character isn't valid because thematching process stepped right offthe end of BigArray. Suppose the lastfour characters of the input text are"it._". Now suppose an order-4 scanhas come clear to the end, seekingthe pattern "t._"; though it finds amatch, the character after the matchis undefined. If that character hap-pens to get selected for printing,something unauthorized will turn upin the output. Worse, something un-determined will go into the next pat-tern. Whether the Match procedurecan find the new pattern in the inputtext is now doubtful. If it can't, theprogram will lock into an endlessloop.

Our solution was to wrap the endof BigArray back to the beginning, byappending a space (if there wasn't

Remember us, if at all , not me also wearAnd

voices are thine imagesAre raised, here, is that

five o'clock in dreamsIn the Kingdom.This is

the world endsThere, the stuffed meaning places

Are quiet and more solent souls , but on a bangbut a whimper.

Let me also wearPrickly pear:

Figure 2 : A pseudo-text version , in verse form , of T S. Eliot's The Hollow Men.

464 BYT E • NOVEMBER 1984

one) plus as many of the openingcharacters as there are in the searchpattern. Thus the input text is, in ef-fect, a closed loop and has no endthat the matching routine can step off.

A FASTER VERSIONAs first set up, with a simple matchingalgorithm, all this worked perfectlybut slowly, new characters tricklingonto the screen like drops of molas-ses. Then a way to get a dramaticspeedup presented itself. Considerthat the Match procedure is spendingmost of its time checking out casesthat don't match at even the first char-acter. They are the majority of thecases, and they are a total waste oftime. Suppose we are looking for thestring "rep"; is it possible to skipthrough BigArray, investigating only se-quences that begin with "r"?

Yes, it is, at the cost of a second ar-ray the size of BigArray. Called Skip-Array, it works as if it were manylinked lists braided together. In our ex-ample, we have only to follow Skip-Array's linked skein of "r"s. With itsguidance, the Match procedure canskip swiftly through BigArray fromone "r" to the next, ignoring all othercheckpoints. If all letters were equal-ly frequent, that would hasten thematching process by a factor of 26.In the program run that generatedfigure 2 , the number of searches wascut by a factor of 15. If Match werethe only determinant, we'd speed upexecution by that much. However,other overhead stays constant andhogs time. In practice, the overallspeedup approaches a limit of about7-for big jobs, the difference be-tween five minutes and half an hour.

SkipArray is accompanied by amuch smaller array called StartSkip,which is indexed by characters andsimply records the first location ofeach character in BigArray. The twoof them are set up, once and for alland very quickly, by the procedureInitSkip. Once they are in place, westart each search from the informationin StartSkip, and then use SkipArrayas follows.

Consider our example, a pattern(continued)

Page 21: Travesty in Byte

ENGINEERS

Some Of The Most ProvocativeTelecommunications Engineering

Of The 80 's Will Be DoneRight Here With Hayes In Atlanta .

There's an energy level here at Hayes thatfuels our confidence. An enthusiasm fewengineering environments encourage or support.A unique blend of engineering and technologicaltalents drawn together to move telecommunica-tions technology further along. The projects ...The programs and the confidence to roll oursleeves up - ask the questions that must beasked - and search out the answers.

If you're taking a closer look at your career,perhaps it's time to take an in-depth look atHayes. There's a future in it.

•HARDWARE/MICROPROCESSOR DESIGNENGINEERS

•SOFTWARE DEVELOPMENTPROGRAMMERS/ANALYSTS

•VLSI/DSP DEVELOPMENT ENGINEERS•PRODUCT ENGINEERS

Interested, qualified candidates should forwarda confidential resume to: Hayes MicrocomputerProducts , Inc., Dept . TJC-84, 5923 PeachtreeIndustrial Blvd., Norcross , GA 30092 . An EqualOpportunity Employer M/F.

Opportunity forthe here and now .

Hayes®Hayes Microcomputer Products Inc.

NOVEMBER 1984 • BYTE 465

Page 22: Travesty in Byte

TRAVESTY

that begins with -r". Let's suppose wehave learned from StartSkip that thefirst "r" is at BigArravl3J. Then the "r"entries in SkipArray might begin likethis:

Without a Computer-Mate"case on this sales trip , Willie's entire

demonstration fell apart.The pitch meant millions to

the firm. A bundle of dreamsto poor Willie. But they neversaid he would have to carrythe demo equipment withouta Computer-Mate TM' case.

Mr. Prospect met him at thedoor. "So what have you gotfor us, Willie?" Willie'sface crumpled like awadded sales contractas he stammered,"The equipment...it's b-b-broken"

Willie lost thesale of a lifetimebecause hedidn't transporthis sensitiveequipment in aComputer-Mate'case. Casesthat are hand-

poor Willie. Get aComputer-Mate'

case before it'stoo late. After all,

the life of yournext presentationmay depend onit. Right, Willie?

Con DUter-Mate TM Cases.bilitth d DT d f St urareng an y.W-le Sumo- oresteW

For more information contact : Computer Mate, Inc.1006 Hampshire Lane, Richardson, Texas 75080

Dallas (214) 669-9370 Texas Residents (800) 442-4006Out of State (800) 527-3643 Dealer Inquiries Welcome.

466 BYTE • NOVEMBER IvBd

assembled for each orderto ensure a safe, snug fit.Specially-designed and con-structed from quality Americanmaterials to make them lighter,yet stronger than steel. Andeach case withstands the moststringent air transport stan-

Circle 87 on inquiry card.

dards, is stamped witha toll-free lost and

found phone number,and is backed by

Computer-Mate's'100% guarantee.Don't end up like

3 4 5 6 7 8 9 10 11 12 13...7 13 22

At 3, we learn that the next "r" is at7; at 7, that the next "r" is at 13; at13, that the next "r" is at 22: and soon. Of course, the other positions inSkipArray would be filled with infor-mation about the other characters.That information is ignored this timethrough but stays available for otherscans.

Inspection of the listing will showhow the Match procedure consultsStartSkip to see where, in BigArray,the first character of the Pattern to bematched first appears. Thereafter,guided by SkipArray, it can safelyleave most locations of BigArray un-visited. In a count that includes thespaces, "r" occurs in English aboutonce every 18 characters. So if Big-Array were 2000 characters long, onlyabout 110 of its locations would bechecked. "P", with a frequency ofabout 2 percent, would trigger a mere40 visits. The improvement in speedover our preceding versions is dra-matic. Scanning a small input file ona fast system, we've seen new charac-ters patter onto the screen faster thanwe could read them. The analogy isno longer with molasses but with rain-drops.

TIMING ANALYSISSome statistics yielded by the "time"function on the UNIX system suggesttwo timing considerations. First, thetime consumed seems independentof pattern length. (Most matches failwell before the pattern length isreached.) Second, time is largelydetermined by the product of two fac-tors: the number of characters in theinput file and the number of outputcharacters requested. For large inputsand outputs, doubling either doublesthe time, and doubling both quadru-ples the time. But on a small scale, theprogram works more slowly than that.

(continued)

Page 23: Travesty in Byte

Run INTEL softand full-screen symbolic debugger

on your IBM-PC• Get our complete development solution: compilers, assembler, linker, debuggers,

and emulators for the entire 8086 family: 8086, 8087, 8088, 80186, .. .• Run under MS-DOS with our GenePak®PLM package, containing Intel's PLM86,

ASM86, LINK86, LOC86, LIB86, and OH86. Options include PASCAL86, FORTRAN86,and C86. Or use the Intel software you already own.

• It takes less than 6 minutes to recompile a 1000-line PLM86 module and relink it with34 others to regenerate a 40K program.

• Use GeneScope® our interactive, fully symbolic debugger and emulator line. Itfeatures macros, help, on-line assembly, automatic disassembly and trace withscrolling, simultaneously supporting separate screens for debug and source-fileviewing during one debug, session. GeneScope is easier to use and faster thanPSCOPE, ICE-86, and I 'ICE, loading 100K programs in under 15 seconds. Its powerfuluser interface runs in your PC, either connected to your target system withGenesis is a licensed OEM of Intel Corp. are Trademarks of Intel Corp.

GeneProbe® our full in-circuit emulator, or as asoftware-only debugger for programs in your PCor your target system.

• Use our PC •--► MDS datalink program to bringover your existing libraries or store back your newmodules. It transfers files of any kind, in eitherdirection.

Call us today.

GenesisTMMicrosystems

196 Castro Street, Mountain View, CA 94041 (415) 964-9001

INTERNATIONAL DISTRIBUTORS • Instrumatic: Deutschland -Munchen, Tel. 8985 20 63 • Eapalla-Madrid , Tel. 1250 25 77-Malaga , Tel. 5@21 38 98 • Sehwaiz-Rueschlikon, Tel. 1724 14 10-Geneve, Tel. 022 36 08 30• United Kingdom-High Wycombe, Tel. 494 450 336 • Israel-Savyon: Micro-Bit. Tel. 03- 380098 • Japan-Tokyo: Asahi Business Consultant Co., Tel (03) 543-3161; Shows System Laboratory. Tel. (03) 361.7131.

Circle 179 on inquiry card. NOVEMBER 1984 • BY T E 467

Page 24: Travesty in Byte

TRAVESTY

In the short run Hellbatcan be disablinglyinput-sensitive.

More exactly, the time in seconc.sfollows quite closely an equation ofthe form

T=K(i/l0+o+io)

where i is thousands of characterssupplied as input, o is thousandswanted in the output, and K is a sys-tem constant, obtained by trial. Thus,1000 characters both in and out (i ando both = 1) took 21 seconds on onesystem, a VAX 750 running BerkeleyPascal. The same job took 130 sec-onds with a different compiler on a2-MHz Heath H-89. System constantswere thus 10 and 62, respectively.

For large inputs and outputs, the

last term-the product of i and o-pre-dominates. On a smaller scale, the in-creasing weight of the first two termswill slow things down.

It is possible to obtain a measure ofthe speed contributed by the Skip-Array process. With SkipArray deac-tivated so that the Match proceduremust check every character, the finalterm gets multiplied by about 7. Thisfactor represents an empiricalweighted average of the frequency ofthe characters that get checked, andit means that for big jobs, where thefinal term predominates, SkipArrayspeeds up the matching of Englishtext by 7 times. For smaller samples,the improvement factor may be closerto 4.

SHANNON'S ALGORITHMAt this point, we wondered if therewasn't still more speed to be gained.In 1948, working without a computer,

Free Software!We're so sure you're going to want all our Pop-Ups that we'll giveyou a free Pop-Up Alarm Clock to use - a $19.95 value! Pop-UpAlarm Clock gives you the time of day, sets alarms , and lets you

run applications when you're away from your computer.

Imagine ': Pop up these handy tools at a keystroke, even^ while you're running other programs ...

#*4F

T

t% -4o

' PQ1

hf ^ dr̂ dIV_tb

vl0 `% (206) 828-7282

a're .e . Nn C- L.

Pop Up Calculator. $39.95Pop Up Notepad . $39.95

Pop-Up Calendar. $19.95Pop-Up TeleComm . $79.95

Pop-Up PopDOS . $39.95Pop-Up Alarm Clock . $19.95/free

r0%Discover the power andconvenience of the

f^^ `

oA a^ C^ Ptr^ + k.H 9,'+A_-.O,O

N` Pop-Ups from

A. ♦ BELL SOFT

` vet.(.lA 0r- 6/P O

y 1b-1142011"IV

Claude Shannon had not bothered tobuild up frequency tables at all. Heopened a book at random, selecteda letter, and recorded it. He thenopened the book again, read till hefound that letter, and recorded its suc-cessor. On another random page, hefound the first successor to that letter... and so on.

This process amounts to hoppinginto BigArray at random and lettingthe first occurrence of Pattern you en-counter be the one to define thefollowing character. Why build a table,only to select one element from it atrandom? Why not acknowledge thatthe text itself is a frequency table andmake random entries into it? So weimplemented a new method and gota further speedup factor of betterthan 3 (listing 2). After molasses andraindrops, this was a torrent! The newversion flew like a bat out of hell andgot nicknamed Hellbat.

Theoretically, in the long run, Traves-ty and Hellbat give equivalent genera-tion results, with Hellbat a lot faster.But in the long run, as John MaynardKeynes said, we are all dead. Theworld's results are obtained in theshort run, and in the short run Hellbatcan be disablingly input-sensitive.

Its problem is this. Let's imagine aninput that contains, midway, the se-quence "... silk stockings and silkhats ...", and no other occurrencesof "silk". Suppose the pattern we arematching is "silk-". The chances aregood that a random pounce will landeither before that sequence or afterit. If before, then "silk" will be followedby "stockings". If after, then wrap-around will carry us around to before,and "silk" will again be followed by"stockings". Only if the jump chancesto land in the short interval betweenthe two occurrences of "silk" can thefollowing word ever be "hats". So theoutput can settle into a tedious loop,"silk stockings and silk stockings andsilk stockings and ...". Nothing but arather unlikely jump can enable Hell-bat to break out of that loop.

But note how this case would behandled by the frequency-tablemethod of Travesty. "Silk-s" and

(continued)

468 BYTE • NOVEMBER 1984

Page 25: Travesty in Byte

TRAVESTY

Listing 2: These routines turn Travesty into Hellbat, a faster, but finicky, version.

PROCEDURE InitSkip;VAR

HeadSkip, TailSkip : ARRAY of INTEGER;ch : CHAR;

INTEGER;BEGIN

(Initialize HeadSkip and TailSkip to indicate that){no occurrence of any character has yet been Found.}

FORch:=''TO'I'DOBEGIN

HeadSkip[ch] := TotalChars + 1;TailSkip[ch] : = 0

END;{Link SkipArray by reverse pass through BigArray.}

FOR i (TotalChars - PatLength) DOWNTO 1 DOBEGIN

ch : = BigArray[i];IF TailSkip[ch] = 0THEN {1st occurrence}

BEGINTailSkip(chl : = i;HeadSkip[ch] := i

ENDELSE

BEGINSkipArray[i] HeadSkip[ch];HeadSkip[ch] i

ENDEND;

{Close links from tail back to head.)FORch:='' TO':' DOIF TailSkip[ch] <> 0 THEN

SkipArray [TailSkip[ch]] HeadSkip[ch]END;

PROCEDURE Match;VAR

INTEGER; (one location BEFORE start of Match in BigArray)INTEGER; (index into pattern)

Found : BOOLEAN; {true if there is a match from i + 1 to i +j -1 }chi : CHAR; (first character in Pattern; used for skipping)

BEGINchi := Pattern[1];

(Hop into BigArray at a random location)= TRUNC((TotalChars-PatLength) * Random(Seed));

{Search for an instance of ch1 at location i+1}WHILE BigArray[i+1) <> chi DO

i : _ (i + 1) MOD (TotalChars - PatLength);Found : = false;WHILE (NOT Found) DO

BEGINj := 1;Found true;WHILE (Found AND 0 < = PatLength)) DO

IF BigArray[i +j] < > Pattern[j]THEN Found false

ELSE j := j + 1;IF Found THEN

NewChar := BigArray[i + PatLength + 1]ELSE i := SkipArray[i+1] - 1

ENDEND;

silk-h- would be considered equal-ly likely; the letters "s" and "h" wouldring up the same increments in Freq-Array; and the upshot, all else beingequal, would be truly 50-50. Traves-ty, in short, makes probability in-dependent of the spacing of items.Though slower than Hellbat, it is lessparticular about peculiar inputstructures.

Hellbat, on the other hand, is tempt-ingly fast; the io term disappears al-together from its timing equation,which takes the form

T = K (i/5 + o)

For order-5, and input and output of1,000 characters each, the twomachines that took 21 and 130 sec-onds to run Travesty ran Hellbat in 4and 50 seconds, respectively.

Because a long pattern is apt to en-tail more tries before a match isfound, Hellbat's time, unlike Travesty's,is order-dependent. The dependencycould be incorporated into K, whichcould then stand for both the com-puter system and the order.

So, if the input text is fairly free ofrepeated words clumped together, orif it is long and varied enough for suchwords to be scattered elsewhere also,then Hellbat (jump in with closedeyes) is the algorithm of choice.Otherwise, opt for the inefficiencies ofTravesty and its frequency table. Weprint herewith (listing 1) the Pascalcode for Travesty (modified Hayes)and append instructions for convert-ing it into random-jump Hellbat(Shannon) (listing 2). The nature ofyour input must decide which versionis for you. n

BIBLIOGRAPHY1. Bennett , William R. Jr. "How Artificialis Intelligence ?" American Scientist.November-December 1977, page 694.2. Hayes, Brian . "Computer Recreations."Scientific American . November 1983, page18.3. Shannon, Claude E. "The MathematicalTheory of Communication:' Bell SystemTechnical Journal . July and October 1948,pages 379, 623.4. Shannon, Claude E. "Prediction andEntropy of Printed English:' Bell SystemTechnical Journal . January 1951, page 50.

NOVEMBER 1984 • B Y T E 469