Readers Journal weblog/wEssays | home | |
In Praise of Awk (Protagoras, December 10, 2007) Awk (yes, I will explain in a minute what it is) is a bit like that knife you've had forever in your kitchen drawer, that you always seem to use every time you cook. Whatever you need to do, it seems so easy and natural for it. Awk is like some old tool you have in your toolbox, and almost every time you do any carpentry, you find yourself looking for it. It could be one of those Japanese saws with a crosscut and a ripping edge, or perhaps a screw bit holder, but the kind CSK make, with a friction drive not a ratchet. For me Awk also has an old fashioned feel. It is like that washing machine which your parents had for 20 years, with two programs and a mechanical timer, or an old tape recorder that you just set to either record or play. If you've always used Windows, if you're under 40, if you've never written programs, you may never have heard of Awk. But its like that, its like a lot of truly excellent computer tools, relatively hard to learn, but super easy to use. So what is it? Its a program that extracts data from files. The hard thing to understand when you are learning Awk is just that. It does not operate on files to change them. Its not a general purpose programming language. Awk wakes up and the first thing it says is. 'Where is my file?'. And the second thing it says is 'What do you want out of it?' And the third thing it asks is 'Where do you want to put it?'. You use awk to do almost any kind of reformatting and restructuring of a file. Here's an example. In the UK recently, one Government office put a couple of CDs in the post to another. The CDs contained a file of half the population of the UK, names, addresses, bank account details, social security numbers. That is around 25 million people. The CDs got lost in the post, as they do. This created something of a fuss, and in some areas of London you will find queues of people outside banks, not to take their money out, but to change their pin numbers. Now here is what is funny about this. The receiving office said that they did not want all this data. Just names and cities, or something similar, would do. They thought it was a security risk to mail the complete file around the country. How right they were! But, the problem that the sending office had was, it was going to cost several thousand pounds to strip out some of the data. It would take a real professional software house to do something as complicated as that. In Awk, this is a one liner, and for the curious, assuming the data you want is in a tab separated file in columns one and two, this is what you do to get it out: awk '{ print $1, $2 }' sourcefile >>"destinationfile" Come back later, because 25 million lines will take a while, and you will have a file with just columns one and two in it. Now, maybe they thought it was so complicated, because they only had proper modern software around the office. They had a copy of MS Office, and for a fleeting moment they thought of using that, but then they realized that Excel will only allow 64 thousand rows, and we had 25 million here. Or maybe they had some modern database package, and they realized that they were going to have to write some SQL to get at a subset, and they had no idea how to do it, or whether it was even possible. Maybe it was as if, in a fit of irritation, you decide that rather than navigate through all those menus to get your video recorder to tape the program at 8.30 next Thursday, you will just write a big note to yourself and put on the door so you'll see it every day and not forget. So, in the mail it went. It was probably copied over using drag and drop. Here is another example. You have a database package, and you have data in it, and the people who supplied it tell you it doesn't do exports. Or not in any form any other package can use. They explain a lot of stuff which makes no sense to you, about how you have to write XLST transforms, how tab separated files are not a well defined format. The bottom line is no, you are stuck with us. You look into this more deeply and discover your data is in a text file in a format like this:
RECORD+++ and you want it to go
data TAB data TAB That is, each record should be on one line, with the different fields separated by tabs, and the RECORD+++ should appear nowhere. Now, you can reach for a proper modern programming language with a proper graphical interface. Or you can take half an hour and write yourself a little script in Awk. If you need to, you can do loops, arrays, conditionals, all kinds of stuff. If you don't need, you just write one liners. You can write real programs and save them in files and run them whenever you want. Or you can just type in commands and tell it what to do this time. No, you did not need that food processor with 25 different settings and all those attachments to choose from, and a huge pile of washing up afterwards. No, you did not need to learn the mysteries of xml and writing transforms. What you needed was something shaped to the hand, with a nice patina on it, that you know how to use, that as soon as you got hold of it said to you, yes, what would you like to do with that onion? And a few minutes later, you're done, there's no mess, and without thinking too much about it, you move on to the peppers and tomatoes. Awk comes with Linux. But you can get it free for Windows, just google for gnu awk. And if you want a tutorial, the best one is in the O'Reilly book 'SED and AWK'. Forget SED, Awk is what you want. If you want an online tutorial, there is an excellent one in three parts at https://www.gentoo.org/doc/en/articles/l-awk1.xml.
It may take some persistence to learn. But its a tool to have and keep. You will never again sit looking at a file, baffled by some infuriating, pastel colored graphical interface, and think wearily that yes, probably, with 25 different search and replaces, done in just the right order, I might be able to get what I want out of it, but might it not be easier just to type the whole thing in again....
HTML, format and art copyright © 2007 Charles Hugh Smith, copyright to text and all other content in the above work is held by the author of the essay as of the publication date listed above. All rights reserved in all media. The views of the contributor authors are their own, and do not reflect the views of Charles Hugh Smith. All errors and errors of omission in the above essay are the sole responsibility of the essay's author. The writer(s) would be honored if you linked this Readers Journal essay to your site, or printed a copy for your own use. |
||
Readers Journal weblog/wEssays | home |