On Using Awk Instead of Python
2019-08-29
A Software Niche
Python is widely used as a language for one-off scripts that deal with text files. As an interpreted, dynamically typed language, Python has extremely low development and deployment overhead, making it perfect for automating tasks that would otherwise be done manually, or which are done as part of a manual process. This is perhaps part of the reason why Python is so popular in data science workflows. While the upfront complexity of transforming a input into a usable data table can be high, the number of "transactions", or processing batches, is relatively low. Therefore, data science pre-processing problems benefit from a language that offers high productivity up front, with a low amortized performance penalty.
While Python, and to a lesser extent Ruby, have justly taken much of the mind
share for this kind of problem, I want to promote an older, perhaps cooler
alternative. Many developers likely know awk
as the inscrutable answer to
StackOverflow questions such as "Shell command to sum integers, one per line?",
and "Find and kill a process in one line using bash and regex". If you're
anything like me, you've probably copied and pasted more than a few awk
oneliners without understanding the tool in any detail. I recently took the time
to learn awk
in some more detail, and found that it is not only powerful, but
extremely approachable.
My goal with this post is not to promote awk
over Python, even for the very
specific kind of workload that we'll consider. Rather, I want to introduce
people to some of awk
's power, and show that most of what comes naturally
about Python is just as natural, even more so, in awk
.
A Motivating Example
Let's use the problem I was solving as an example. I replicate my org file TODOs
to a server, from which I want to run a daily cron summarizing the task status
changes. For example, if yesterday a task was marked as TODO, but today it is
DONE, I want to have that task listed in an email summarizing the things I got
done yesterday. In order to access historical information on the status of the org
files, rather than put them in version control and inspect the history, I
materialize the current status with a separate daily cron job, which writes rows
to a .tsv
file. So the architecture we have is two separate cron jobs, each
using an awk
script.
Let's look at each problem, and compare a strawman Python solution with my awk script.
Dealing with row-based data
The first script needs to take org-mode headlines, and transform them into structured data. An org-mode headline might look like this:
** TODO separate books for donation :moving:
|--| |--------------------------------------------------------|
keyword heading
We want to extract the heading, and treat that as a stable identifier. The
TODO
state keyword, as well as the file and the current date, will be stored
in separate columns. The first task is to grep for all of the headings. I think
even the most dedicated Python jockey would tend to reach for the command line
tool here:
find ~/org -name '[^.]*.org' -type f \
-exec egrep '\* (TODO|DONE|CANCELLED|NEXT|WAIT|HOLD)' {} + | \
This will produce a line of output per match, prepended with the name of the file. We can now pipe this stream through a processing script.
Python version
In Python, you might write something like this:
import sys
import datetime
now = datetime.datetime.now()
date = "{}-{}-{}".format(now.year, now.month, now.day)
for line in sys.stdin.readlines():
words = line.split(" ")
words[0] = words[0].replace(":", " ")
heading = " ".join(words[2:])
output = "%s\t%s\t%s\t%s".format(heading, words[0], words[1], date)
print(output)
and use it by piping into the python
interpreter.
Awk version
BEGIN {
"date \"+%Y-%m-%d\"" | getline date
}
The first part is the BEGIN
block, which is optional. If specified, it will
run before the rest of the script. Conveniently, any variables defined here will
be in scope for the per-line portion of the script. In this case we are using
the cmd | getline
form to precompute today's date. This stores the output of
cmd
in the variable named by the argument of getline
.
Next is the main body of the script. This will run once per line. awk
has
already split the line by the field separator (by default, any whitespace), and
stored the parts in the numbered variables $1, $2, $3
and so on.
{
sub(":", " ", $1);
for (i=3;i<=NF;i++) {
printf("%s ", $i)
}
printf("\t%s\t%s\t%s\n", $1, $2, date)
}
The first thing we do is an in-place substitution, using the sub
function.
This replaces the first instance of ":"
with a space, in the string $1
. In
our case, $1
contains a string like work.org:**
. Next, we write out the
whole headline after the TODO
keyword, by iterating over every field past
$3
. The variable NF
contains the number of fields in the current row.
Finally, we use the date
variable we defined in the BEGIN
clause to write
the rest of the row. We end up with a row like this:
separate books for donation\tpersonal.org\t**\tTODO\t2019-07-21
Comparison
The Python version of the script uses the expressive standard library, and comes
out very succinct. The higher character count is mainly down to having better
readability, and performing string copies rather than in-place modification.
List slicing in particular makes it nice to write---" ".join(words[2:])
looks
better than the for
loop that awk
requires.
On the other hand, writing that for
loop really isn't that much trouble.
What strikes me most is that the awk
version, while it reads differently
than the Python, has most of the same semantics that make Python appealing to
write: easy use of global variables, solid string manipulation primitives, and
simple access to system resources like the current time.
So far we are still in the realm of what I always knew awk
was good
for---munging rows of text into other rows of text. Next we'll take a look at
a task for which I definitely would have used Python in the past.
Stringly- and dictly-typed programming
Now that we have per-day snapshots of what my org headings look like, it's time to process the data into a daily report. There are two requirements for the report:
- It needs to show the number of tasks in the previous day that were marked DONE from a non-DONE state, and list each one.
- It should list the five most stale tasks which remain undone---that is, those tasks whose TODO status has not changed in the longest period.
These scripts are more involved, so I'll discuss them with the corresponding sections interleaved, to highlight the similarities.
Preamble
In both scripts, we precompute the current day's string representation. Awk uses a pair of built-in functions to format the date string:
BEGIN {
now = systime();
today = strftime("%Y-%m-%d", now);
}
In Python, we also precompute the date. We have to set up our container data structures as well: a bunch of dictionaries and one set.
import sys
import datetime
now = datetime.datetime.now()
date = now.strftime("%Y-%m-%d")
headlines = set([])
first_appearance = {}
latest_appearance = {}
previous_status = {}
current_file = {}
latest_status = {}
days_in_state = {}
Per-line processing
We can do everything we want in a single pass of the input data. For each line, we need to do the following:
- Make a note of the latest date we have seen it. Since the input is sorted by day, this is just the date of the current row. We'll use this to filter out TODOs that are no longer current.
- Store the current status and the previous day's status.
- If the current status is equal to the previous day's, increment the number of days it has been in that status.
- Store the file where the TODO is found on that day.
In Python, these steps look like so:
for line in sys.stdin.readlines():
words = line.strip().split("\t")
h = words[0]
headlines.add(h)
latest_appearance[h] = words[4]
status = words[3]
previous_status[h] = latest_status[h] if h in latest_status else ""
latest_status[h] = status
if h in days_in_state and status == previous_status[h]:
days_in_state[h] = days_in_state[h] + 1
else:
days_in_state[h] = 1
current_file[h] = words[1]
The awk
version is strikingly similar. The only data structure on offer is the
versatile associative array. These take the place of both dictionaries and
lists. They also allow us to model a set by simply using the number 1
(or any
other value) as the value for each key we want to store.
{
h = $1;
headlines[h] = 1;
latest_appearance[h] = $5;
status = $4
previous_status[h] = latest_status[h];
latest_status[h] = status;
if (status == previous_status[h]) {
days_in_state[h]++;
} else {
days_in_state[h] = 1;
}
current_file[h] = $2;
}
Even though we didn't declare any of these arrays in the preamble, they do "the
right thing" for missing values. For example, in the first instance of a
headline, latest_status[h]
is the empty string. awk
has written the Python
version's if
expression for us. Similarly, days_in_state[h]++
increments the
missing value to 1, as desired.
Aggregations
After looping over all the per-day records, we need to materialize a few aggregates of the data.
- For each headline, if its latest status is DONE and its previous status was NEXT or TODO, it is added to the list of tasks completed today.
- The incomplete headlines are sorted by the number of days they have been in their current state, and the top 5 are added to the list of most stale TODOs.
Once again, this is code that Python is perfectly suited for. In particular, the
list manipulation functions filter
and sorted
make it very easy to express
the calculation for the five most stale TODOs.
donelist = []
for h in headlines:
if latest_appearance[h] == date:
if previous_status[h] in ["NEXT", "TODO"] and latest_status[h] == "DONE":
donelist.append(h)
print("{} tasks done:\n\t".format(len(donelist)), end="")
print("\n\t".join(donelist))
todos = filter(lambda h: latest_status[h] in ["NEXT", "TODO"], headlines)
stalest_todos = sorted(todos, key=lambda h: days_in_state[h], reverse=True)[:5]
print("Top 5 stalest tasks:")
for h in stalest_todos:
print("\t- {}".format(h))
The awk
version is somewhat more ungainly. However, it still reads broadly
similarly to the python.
END {
delete done_yesterday;
for (h in headlines) {
if (latest_appearance[h] == today) {
prev = previous_status[h];
curr = latest_status[h];
if ((prev == "TODO" || prev == "NEXT") && curr == "DONE") {
done_yesterday[h] = 1;
}
if (curr == "TODO" || curr == "NEXT") {
todos[h] = days_in_state[h];
}
}
}
printf("Tasks completed yesterday: %s\n", length(done_yesterday));
for (h in done_yesterday) {
printf("\t%s", h);
}
printf("\n\n");
asorti(todos, stalest, "@val_num_desc")
printf("Top 5 most stale TODO tasks:\n")
for (i = 1; i <= 5; i++) {
h = stalest[i];
printf("\t[[file:%s::*%s][%s]]\n", current_file[h], h, h);
}
}
The ugliest wart here is the line delete done_yesterday
. This is required for
length(done_yesterday)
to return 0
, if we never evaluate done_yesterday[h] = 1
. awk
doesn't require you to declare your variables, but it also doesn't
have a facility for assigning a variable to an empty array! This is a fairly
serious gotcha, in my opinion.
We also see an example of sorting arrays. While terse, sorting in awk
is
rather unnatural compared to the Python. The asorti
function sorts the
indices of its first argument, the (associative) array todos
. The second
argument tells asorti
to copy the result into stalest
, initializing it. The
third argument, @val_num_desc
, is a sigil describing how to sort the indices.
asorti
can take a variety of traversal order specifiers in this argument, or a
user-provided function. We populate todos
with the value of days_in_state
in
the loop, so this function call sorts the headlines by their staleness,
numerically, in descending order.
Comparison and Conclusion
Some of the more obvious differences between awk
and Python are the most
relevant to their applicability to this sort of problem. Python is a general
purpose programming language, which means that one has to set up the boilerplate
to read from STDIN for every script. On the other hand, its standard library
offers basic data processing functions that are substantially more ergonomic
than awk
. Sorting, filtering, and joining strings are all more readable and
less error-prone in Python.
Nevertheless, awk
is competitive with Python when it comes to its bread and
butter: munging data in and out of associative data structures. It's frankly
embarrassing how many problems are most naturally solved by an awkward series of
intermediate hash maps. For my money, awk beats Python when it comes to this
under-appreciated and under-discussed programming paradigm. Its handling of
default values is extremely nice to have. How many times have you tried to solve
a problem in Python with just standard dictionaries, only to find yourself going
back and adding import defaultdict
to the top of your file?
Ease of deployment is a small point in awk
's favor as well. While dependency
hell is almost never an issue Python scripts this short, I don't think I've ever
logged onto a server where awk
was not available.
Finally, I have barely touched on performance at all. I hope to do a followup
post comparing awk
's performance with associative arrays to a variety of
Python data structures. For what it's worth, the two scripts described in this
post were within 50% of each other's (perfectly adequate) performance.