name: $RCSfile: custom_sed.spec,v $ process: Outline of the proposed custom sed, to be based on the existing GNU sed 2.05 author: Simon Taylor revision: $Id: custom_sed.spec,v 1.2 1998/05/01 11:27:29 simon Exp simon $ ------------------------------------------------------------------------------- 1.0 Overview 1.1 General From the outset our main aim should be to produce a custom sed for our own purposes that is always capable of running existing sed scripts portably and efficiently. The custom sed will be freely available in source form to anyone who wants it, subject to the terms of the typical GNU GPL (General Public License). Changes should be to enhance the sort of program sed already is, (ie: small, fast and powerful), and not to turn it into an awk or a perl. 1.2 Portability The changes should be portable to all contemporary Unixes and if the underlying GNU code permits, to the DOS/Windows environments. 1.3 How the document is structured The proposed changes are broken up into two groups. The first of these groups, section 2.0, covers items that have been requested by a number of seders, and which seem likely to be implemented in a realistic timeframe. The second group, covered in section 3.0, is a list of the balance of items that may be addressed at a later date. Section 4 covers testing. 2.0 Implementation Wish List - To do now 2.01 'r' command, read file into pattern space (Al Aab) A form of r (super r ?), to get a file into the pattern space (or the hold space). As is, r is 1 of the least used/useful sed commands, but if you could implement a super-r then you could merge several input files. (Greg Ubben) I think it was suggested to have the r command read into the pattern space where the text can be manipulated. Rather than this (slurping a whole file into the pattern space), maybe add an f (file) or e (edit) command that will essentially switch the input stream to the named file. Then you can process each line separately. Might push this file (like .so in nroff), returning to the original stream when the end is reached. This one needs to be thought out a lot more -- I just thought of it. 2.02 Case insensitivity (Al Aab and others) An new 'i' flag to the s command and on the command line s/RE/replacement/i # meaning match RE regardless of case. /match/i # as above sed -i ... # meaning everything should be # case-insensitive till the next -e 2.03 Debugger 2.03.01 -d0 print parameters and exit (Doug McClure) An option to print the parameters passed to sed. This would help debug scripts that are mangled by the shell process. The -d0 flag would output the parameters passed to sed and then exit. 2.03.02 -d1 print a trace of sed commands encountered (Simon Taylor) An option to output a simple indicator of each sed command (and flags) processed by sed. For instance, the processing of sed 's/abc/def/g' would result in the debugging output: Command: s Flag(s): g being output on the stderr stream. 2.03.03 Print better error messages. (Doug McClure) sed often complains about extra garbage, but it would be helpful if it would print where it detected the garbage as it began parsing. (Simon Taylor) The error messages output by the GNU gawk program are an outstanding example of how user-friendly error recovery can be. However, this is often not a trivial programming task. 2.04 Extending the 'y' command (y/a-z/A-Z/) (Greg Ubben/Al Aab)) Modify the 'y' to enable character ranges y/a-z/A-Z/ 2.05 White space = space/s, tab/s, nl/s (Al Aab) Suggest "\s" to mean any of \t, space, \r, \f or \n 2.06 Use of \n and \t (Al Aab and others) Enhance sed to accept \n and \t in the replacement part of a s/... /.../ (See 2.09 also) 2.07 New command-line switches: (Eric Pement) --traditional --compat -c Compatibility mode. Disables new features to permit compatible operation with traditional sed syntax. -E Within regular expressions ONLY, support for Extended regexps (like egrep), so that braces {..} and parens (..) do not need to be preceded by the backslash as normally. Extended regexp syntax should be consisted with GNU grep. -H If hold space is empty, do not append leading newline to hold space when using the H command. --help --usage -h Brief message explaining syntax of sed commands and switches. 2.08 New regexp operators: (Eric Pement) + one or more of the previous atom. Equivalent to \{1,\} ? zero or one of the previous atom. Equivalent to \{0,\1} with Perl-style minimal pattern matching: *? minimal match of * +? minimal match of + ?? minimal match of ? | matches the regexp on either side of the bar \(...\) - in standard GNU sed, or (...) - with the -E switch should support grouping as well as backreferences. I.e., sed -E "/foo(bar){5}/d" should be valid syntax to delete lines matching "foobarbarbarbarbar". 2.09 New regexp metacharacters: (Eric Pement) \\ - literal backslash \a - alert (^G) \e - escape \f - formfeed \r - carriage return \t - tab \v - vertical tab \xhh - the ASCII character corresponding to 2 hex digits hh. \ooo - the ASCII character corresponding to 3 octal digits ooo | - match the regexp on either side of the vertical bar Regexp metacharacters should be usable in pattern matching (/match/) and in the LHS and RHS of a substitution (s/LHS/RHS/)! Especially, GNU sed should support hex and octal notation. 2.10 New sed commands: (Eric Pement) ~ Print current line number without a trailing newline E Erase first line of pattern space (like D), but return to loop at top of the current brace level instead of at top of script q NUM Optional numeric argument, NUM, sets exit code when quitting E.g, 'q5' or 'q 5' exits script with errorlevel 5 Q NUM Quit applying sed script to input file, but continue printing the rest of the file to stdout (unless -n is used). Accepts an optional numeric argument, NUM, to set exit code. v NUM Requires GNU sed version NUM or higher to run. If NUM is omitted, requires GNU sed to run (if run on normal seds, 'v' will generate an error anyway). z Zero out hold space. Same as "{x;s/.*//;x}" 2.11 Other wishes (features already in HHsed): (Eric Pement) * a, c and i commands don't insist on a leading backslash '\n' in the text. * r and w commands do not insist on whitespace before the filename. * The g, P, p and 'n' options on s commands may be given in any order. On the s command, an option P is allowed: Print the first line of the current text buffer if a replacement was made. * Escape sequences are valid in all contexts except file names and labels. * The full range of characters are allowed all 256 values. * The W command (write first line of pattern space to file). W [wfile] Write the first line of the current text buffer to 'wfile'. If no 'wfile' is given standard output is used. * The T command (branch on last substitution failed). T [label] Branch to the ':' command with the given 'label' if no s commands have succeeded since the last input line or t or T command. Branch to the end of the script if no 'label' is given. * The second address [of a range address] may be in the form of '+number'. This means that the command will stay selected for number lines after the first address is satisfied. * The empty RE "//" is allowed as a first address if a previous RE has been compiled. 2.12 Several pattern/hold spaces With attendant commands to use those spaces/buffers (Greg Ubben) Allow h/H/g/G/x commands to be followed by a single digit to allow up to 10 named buffers (besides the normal hold space). Sure you could allow more, but allowing arbitrary names like awk/Perl variables gets out of the spirit of sed. or, as a variation, (Al Aab) a stack of pattern/spaces with commands: push pop swap 2.13 Allow addresses to specify file 'n' of a multiple file input stream. (Greg Ubben) A feature which I suggested years ago to the GNU sed maintainer would overcome a sed limitation when processing multiple files. Currently sed has no way of telling where one file ends and another starts. It just looks like one long stream of text. So you can't write a script to extract the Subject: header out of a series of news articles, for instance, because it wouldn't know if it was in the body of an article or the header of the next. I think the syntax I suggested involved using a period in line number addressing to denote a line number relative to the current file rather than to the entire stream. So: 5 line 5 of entire stream (occurs no more than once) .5 5th line of each file $ end of entire stream (occurs exactly once) .$ last line of current file .1,/^$/ extract headers of a series of messages I also suggested a number followed by a period to denote a file number, so: 3.18 means the 18th line of the 3rd file 3.$ last line of 3rd file 3. could be short for 3.1,3.$ when used as only address, or 3.1 when used as address1, or 3.$ when used as address2 Or could be considered illegal syntax. $.18 18th line of last file 2.13,5.3 from file 2, line 13 to file 5, line 3 This should be easy to implement and seems to fit into the spirit of sed. 2.14 Determining r and w file names at run time (Greg Ubben and others) Another common thing you have to use awk/Perl for, is writing files whose names are determined at run-time from the contents of the input. Such as splitting C functions in one big program out into separate .c files named for each function. Just can't be done in sed. One way to implement this would be to have an RE syntax that remembers the text as the filename to use (e.g., \(\(...\)\), though that's pretty ugly), which could occur in a /pattern/ address or in a s/// command, and if you use a w command without a filename given, it uses the last text matched by this RE syntax as the filename. For consistency, r command should allow this too, though not real useful there. The r/w commands would be dynamic in this case -- opening/closing at runtime rather than staying open as it works now. So you could write 1000 lines at the beginning of a file, and read them back later in the file. If no previous RE has set the filename, might default it to a temporary scratch filename. 2.15 A BEGIN address Add the ability to specify a sed action at the position in the stream before the first line is read. ie: 0{ blah; blah; } 2.16 Enhanced y command (Greg Ubben) Add a Y command, in the spirit of P and D, that works like y///, except it only transforms up to the first newline. Allows you to move text to be transformed up to the front to transform only part of a line, rather than having to move stuff you don't want changed out to the hold space like we do now. May not be useful enough to warrant adding it -- just an idea. Could do same thing with S///g command. Actually, what we *really* need are allowing \u, \U, \l, \L etc. on the RHS of a s/// for changing case -- then you can leave the y/// command as it is. 3.0 Implementation Wish List - To consider later 3.01 Ability to split files into several, f0000, f0001, ... , fnnnn sed -o "anyheader" myfile should chop myfile into subfiles myfile.000, myfile.001, ... each subfile has line 1 staring with the substring "anyheader" 3.02 Near search/replace Al suggests: "RE1 and RE2, if they are within n lines of each other, a la Boolean near of altavista power search" Comments anyone? 3.03 Paragraph processing Various suggestions to enable sed to support different kinds of record separators. 3.04 String back-reference, dynamically Extending the back-reference \1,\2, ... ,\9 * NEED INPUT * 3.05 The ability to deal with the null character (hex/octal/decimal zero). Extend sed so that is can read input that includes (and is not terminated by) the null character. Other suggestions to extend sed to operate against true binary data. 3.06 Self-modification * NEED INPUT * 3.07 Nor, nand, exclusive or * NEED INPUT * 4.0 Testing It is vital that we start with a set of test scripts that broadly define the agreed behavior of the notional 'standard' sed. Each of the sed commands should be represented by one or more test scripts and a corresponding results file. ie: for the 'p' command, we might have a test case as simple as: sed '3,6p' some_file And a test result file that contains the expected output. This is a trivial example, but for regression testing it is vital that we can run each iteration of the custom sed through as large a set of automated tests as possible. The sed one liners might be a good place to start. I already have a mechanism to run the tests, all we need is the group to supply test cases and results that we all agree are 'standard'. 5.0 Distribution Initially via the seders mailing list and associated WWW sites. 6.0 Input This specification is based on ideas submitted by members of the seders mailing list, particularly the following, (more submissions are always welcome). Al Aab Edgar Allen Otavio Exel Brendan Macmillan Dennis Marti Doug McClure Eric Pement Greg Ubben ------------------------------------------------------------------------------ $Log: custom_sed.spec,v $ Revision 1.2 1998/05/01 11:27:29 simon Created section 2 for items more likely to be implemented in the near future, and section 3 for those to be considered later. Expanded on "r" command changes, added paragraphs for multiple pattern and hold spaces, distinguishing multiple files on input, dynamically determining file names for the "r" and "w" commands, a proposed BEGIN address and an enhanced "y" command. Also attempted to attribute the authors of suggestions where possible. Revision 1.1 1998/04/22 12:12:36 simon Modified description of 'r' wish list item, also added new item re multiple pattern/hold spaces. Revision 1.0 1998/04/21 12:19:27 simon Initial revision