STATISTICS QUIZ by Leland Wilkinson Statistical Package Test Problems Here are some problems which reviewers should consider in evaluating statistical packages. Now that comparisons of microcomputer programs are being made, it is important to go beyond running the "Longley" data and tallying the number of statistical procedures a package contains. It is easy to select a set of problems to "show off" a particular package. Just pick several esoteric statsitics contained in only that package. The following problems are different. They involve basic, widely encountered statistical issues. A package which cannot solve one of the problems or solves it incorrectly doesn't "lack a feature." It has a serious defect. You might want to try these problems on mainframe packages as well. You may be surprised by the results. I. READING AN ASCII FILE Many programs claim to import data files directly. The problem is that many other programs do not create "ASCII" files in the same way. Here is a file called ASCII.DAT which contains various formats for character and numeric data from programs like dBase, Lotus, Supercalc, FORTRAN, and BASIC. If a program cannot read every record in this file without preprocessing, then there are some ASCII files on microcomputers and mainframes which it cannot read. 1 2 3 4 5 "ONE" 1 2 3 4 5 "TWO" 1,2,3,4,5,'THREE' 1 2 3 4 . FOUR 1.0E0 2.E0 .3E1 4.00000000E+0 5D-0 FIVE 1 2 3 4 5 SIX The case labeled ONE is the most common form of ASCII data: numerals separated by blanks and character strings surrounded by quotes. Most programs can handle this case. The case laveled TWO spans two records (there is a carriage return and linefeed after the 3). Some spreadsheets and word processors do this when they impose margins on a page. A statistical package ought to be able to read these two records as a single case without special instructions. If, on the other hand, this is considered an error, then the program should be able to flag it. The case labeled THREE has comma delimters (as in BASIC). It also uses apostrophes rather than quotes for character stringss (PL/I and other mainframe packages do this). The case laveled FOUR has a missing value (on variable E). SAS, SPSS, and other packages put missing values in a file this way. It also lacks quotes around the character variable, a common occurrence. page one The case laveled FIVE has various forms of exponential notation. FORTRAN uses a "D" instead of E for double precision exponents. The other forms were taken from various microcomputer and mainframe packages. The case labeled SIX does not belong in an ASCII file. It is so common, however, that statistical packages should not be bothered by it. Namely, there are tab characters in the record. You can create this file on a word processor (be sure to use the ASCII, or "non-document" mode) or with an editor. For the last line, use the tab key to separate the number (the way most secretaries do when told to enter data this way) and be sure to hit the Return (Enter) key after yping SIX so that the last record ends in a carriage return. Since you can use an editor to create the file, you could correct some of the problems with an editor as well before using a statistical program. What yould you do with a file containing 30,000 records on 100 variables? A. Read this file into 6 variables: A, B, C, D, E, and NAME$ (or whatever the package uses to name a character variable). Print the file so that 6 cases appear, with A=1, B=2, C=3, D=4, E=5, and NAME$=