Seder's grab bag

Script Archive

It is sometimes necessary to remove comments (preceded by #), since this is not universally legal syntax.

I tried to classify these scripts according to their cleverness and interest when learning how to use sed. Clever ones are not always faster; sometimes, more complicated techniques are also slower. Sometimes, longer and more complex scripts are mostly boilerplate, thus less interesting for the sed `student' and rated with fewer stars; readability also and documentation influenced the rating.

      (*) = very basic, little to learn
     (**) = does some simple processing
    (***) = shows some nice techniques
   (****) = shows advanced techniques, such as lookup tables
(*****) = that's extreme sed!

Filename manipulation

Lowercase filenames (filter) (***)
Uppercase filenames (filter) (***): Lowercase/uppercase list of filenames supplied from STDIN. Makes a list of mv commands.
Example: find /mnt/zeus/docs | tolower.sed | sh -x
Lowercase filenames (application) (***)
Uppercase filenames (application) (***): Lowercase/uppercase list of filenames supplied as command line arguments. Again, makes a list of mv commands. This version operates on files in the current directory only.
Example: down *.HTM *.INC *.sed
Print basename of files (**): Remove the directory prefix from a file path, and print remaining element. Like Unix basename, but reads data from a file or stdin. Could easily be adapted for DOS conventions.
Print path of files (**): Remove the filename from a file path, and print remaining elements. Like Unix dirname, but reads data from a file or stdin. Easily adapted to DOS conventions.

File conversion

Convert DOS files for UNIX and vice versa (*): Changes DOS end-of-lines to UNIX end-of-lines (to be ran under UNIX). Provided in a single gzipped tar file to avoid that the server screws up the control characters.
Split digest (**): Recreates original email messages from a list digest. The author says this should work `at least for digests generated by Majordomo and #listserv, and FAQs using minimal digest format.'
rot13 (*): The simplest symmetric cypher in the world...
TeX to XML converter (*****): Changes TeX-like tags (abc{...}) to XML-like tags (<abc>...</abc>). An interesting proof of concept script by Tilmann Bitterberg, supporting nested tags and much more.
Expand quoted strings (*****): This script takes a complex configuration file format (supporting almost every quoting style in the Bourne shell) and encodes each value that the script defines with "dangerous" characters properly escaped; full documentation is contained in the download. This script by Nathan D. Ryan shows how to do complex conversions with sed.

HTML utilities

Text -> HTML (*): Converts preformatted text to HTML ready for viewing.
Insert boldface/italic tags (***): Takes input files with two different "toggle switches" such as the _underscore_ and *asterisk*, and convert them into something like <i>italic</i> and <b>boldface</b> in the output. A nice exercise would be to merge this with untroff.sed and obtain a nice troff-to-HTML sed script.
ISO8859-1 -> HTML (*): Convert ISO Latin 1 characters (eg: é, £, ¥, ½) to their equivalent HTML character entitities.
HTML -> ISO8859-1 (*): Convert HTML character entities to their ISO Latin 1 equivalent.
Lowercase HTML tags (****)
Uppercase HTML tags (****): Change case of HTML tags, preserving attributes.
Index HTML links (****): This script, by Tilman Bitterberg, adds an index of links to an HTML file: similar to `lynx -dump', but preserving the HTML tags in the file.
Strip HTML comments (**): Remove all commented material from HTML
Extract URLs from HTML (***): Print all URLs (even commented ones) and associated ALT comments found in an HTML file, formatted as: URL|comment.
Extract title from HTML (***): Print the TITLE (or the first H[0-7] heading located) of an HTML document.

Text formatting

Capitalise words 1/5 (**): Capitalises the first letter of each word.
Capitalise words 2/5 (**): A first approach to doing it faster.
Capitalise words 3/5 (***): A cleaner implementation of the idea in cflword2.sed
Capitalise words 4/5 (***): This gets weirdo!
Capitalise words 5/5 (****): And finally, we capitalise words with lookup tables.

Formats text lines (***): Formats text so that each line is shorter than 40 characters.
Expand tabs to spaces (****): Another masterpiece by Greg Ubben. The link above works with all sed implementations, while this version only works with GNU sed 3.02.80 or ssed, but is more readable because it does not contain control characters.
Reverse text (**): Reverses the order of characters on each line of input.
Reverse text (**): A faster version.
Reverse file (***): Reverses the line order of a file, subject to the size of the hold buffer.
Join lines (*): Joins all input on a single line.
Un-double-space lines (*): Change double-spaced lines to single-spaced.
Centre lines 1/2 (**): Centres lines for an 80-column device. Easily adapted to different widths.
Centre lines 2/2 (*): A different and more CPU-intensive approach.
Squeeze blank lines (***): Replace consecutive blank lines with one line, so that at most one empty line separates two non-empty lines. Emulates cat -s.

Beautifiers

Intel assembler -> UNIX assembler (**): Converts Intel 386 assembly (nasm) code to Unix 386 assembly (gas) code.
Strip C comments (1/4) (**): This one is the first script in a series of scripts that do the same task in more and more sophisticated ways. This handles multiline comments, but not multiple comments in a line
Strip C comments (2/4) (***): This script, by Stewart Ravenhall, unlike the previous one handles comments surrounded by code.
Strip C/C++ comments (3/4) (****): This script, by Brian Hiles, handles C and C++ (//) comments and, unlike the previous ones, correctly skips comments inside strings. It shows a very interesting trick to build a line piecewise in hold space, which eases more complicated parsing tasks.
Beautify directory listing (UNIX) (***): Indents the output of ls -lR according to the depth of each directory. Makes output far easier to read.
Directory tree (UNIX) (**): Indents the output of find -type d into a nice tree format. Thanks to Stewart Ravenhall.
Commify numbers 1/3 (**): Formats numbers by placing commas before every 3 digits (eg: 1,200,573).
Commify numbers 2/3 (**): A more compact script for versions of sed which recognise Extended RE's.
Commify numbers 3/3 (**): Compare with #1. This script expects 100% numeric input.
File polisher (troff): Very comprehensive suite of filters by Robert Marks which perform a large number of beautifying operations on text files prior to processing by troff. These scripts were used to produce camera-ready output for the Australian School of Management between 1985 and 1995. You can download a gzipped tar archive of the scripts, or individual scripts: polish0.sed, polish1.sed, polish2.sed, polish3.sed, polish4.sed, polish5.sed, polish6.sed, polish7.sed, polish8.sed, polish9.sed, or visit Robert's Web site.
Horizontal banner (*): Rotates the vertical output of banner to produce horizontal output. The script assumes a screen size of 80x60. This could be overcome.
Remove troff overstrikes (***): A script to convert troff output to pure text, replacing boldfaces with "*...*" and underlines with "_..._". Also shows how to justify text using sed.
Number lines (*): A short script to display output lines preceded by line numbers. This is similar to the UNIX nl command, or cat -n.
Number lines (**): This version demonstrates a technique for manually calculating numbers.
Number non-empty lines (*): A short script to display output lines, preceding non-empty lines with a line number. Empty lines affect the count. This is not the same as cat -bn, which does not count empty lines.
Number non-empty lines (**): This version demonstrates a technique for manually calculating numbers; it emulates cat -bn exactly.

Information extraction / tabulation

Find subwords (**): Search for dictionary words in a string.
Extract regular expressions and print the context - by Greg Ubben (****): Extract from a file the lines that contain a regular expression, printing the lines containing the pattern and those that surround them.
Extract regular expressions and print the context - thanks to Hartmut Schaefer (***): Print all the occurrences of a regular expression in a file. Each occurrence is printed on a separate line, isolated from the non-matching text (for example, the regex \<[A-Za-z]*\> will yield all the words in the file, one per line.
Find anagrams (****): Search for anagrams in a list of words (one word per line).
Indexer (****): This script collates a list of references to produce an index suitable for a book or magazine. A detailed description of the way it works, along with alternative versions of the script, is available on the tutorials page. The script was used by the Cornerstone magazine to create an index for a book after typesetting.
Show make targets (***): Extracts targets for a file from a makefile.
Sort/delimit/number a list of names (*****): Sort, partition and number a list of names. This script is not exactly the one described on the tutorials page, but it resembles it very closely. A thorough analysis of the techniques used in this script is given by the author of the original script in Using lookup tables with s/// and A lookup-table counter.
Display beginning of file (*): Display first 10 lines of a file. Like head.

Miscellaneous

Desktop calculator (*****): This script from sed guru Greg Ubben is a full implementation of the Unix desktop calculator dc. dc is an arbitrary precision, multi-base, stacking calculator. Here is a quick guide to it, which Greg posted to seders, and an overview of how it works.
Add decimals (****): This impressively short script adds a list of decimal numbers. It pulls this off by transforming and concatenating units in each number into an analogue format, where a=1, aa=2, aaa=3, etc, transforming the result back to decimal, and proceeding with the next digit. Usage of lookup tables permits to do this with only 9 commands, with a 3-command inner loop; to understand the idea better you might want to peek at an implementation of the same algorithm without lookup tables.

Note that this is not the script explained in Greg Ubben's Adding a list of decimal numbers (on the tutorials page); Greg's script is found directly in the tutorial.
Sierpinski triangles 1/3 (***)
Sierpinski triangles 2/3 (***): These scripts generate Sierpinski's triangle. Pass them a line made of many underscores and a single X, something like ______X.
Sierpinski triangles 3/3, slow and portable (***)
Sierpinski triangles 3/3, fast and less portable (***): To get below 10 commands to do the same, I had to find out the real rule behind Sierpinski's triangle (the other two attemps were somewhat empiric). It turns out that Sierpinski's triangle is actually Wolfram's rule-90 cellular automaton. :-)
Increment a number (****): Interesting script to increment numbers. This algorithm is the fastest I know of that does not use both buffers.
Turing Machine in sed (****): A totally useless but quite funny script by Christophe Blaess: a Turing Machine is able to execute any computable task (albeit slowly and painfully)... so sed can perform any computable task!!! Here is a description of the input file format, including a sample automaton to increment binary numbers.
sed sokoban by Aurelio Marinho Jargas (*****): Yes, this is a full-featured 90-level sokoban game with color and animation! Play with the arrow keys or with the classic vi keys hjkl (left, down, up, right).
sed arkanoid by Aurelio Marinho Jargas (*****): And yet another masterpiece from the author of the sed sokoban game.

You might like the shell script playsed which makes the ball move automatically.
sed naughts and crosses (*****): And now, here's a naughts and crosses game too.
Brainf**k to C compiler (**): This scripts convert Brainf**k programming language to C, ready to be compiled to machine code.
Display month calendar (*****): Display a simple calendar for the current month, à la the UNIX command cal. Only date is required, math is done directly in sed.
Display year calendar (****): Display a simple calendar for the current year, very roughly based on the above script. This time date computation is done with dc rather than date.

`sed` debuggers

Python sed debugger by Aurelio Marinho Jargas: A python script that reads a sed script from a file and generate another sed script, this one with debug commands. So, it's NOT a sed interpreter, it generates sed debug file in sed! You can debug your sed files with your own version of sed (DOS, Linux, HP-UX, ...). The debug file is saved with a .sedd extension.

You can also use it as a script beautifier (to insert and standardize spacing) with the --indent option which writes the beautified script on standard output. Also, it can be used as an expert command analizer, with the --tokenize option, that gives all command information you need
Korn shell debugger by Brian Hiles: This also instruments the debugged sed script. It implements spypoints on conditional and unconditional criteria that can involve lines, regular expressions, or a combination of both. A man page is embedded in the script. You can also download an older version which runs in the Bourne Shell.

Updated 17 Nov 2003

seder's grab bag