Thursday, December 29, 2011

Bash sort... what the hell is up with tab separators?

UPDATE: I have updated the sort Wiki page to include an example for tab separated sorting.

In the world of Unix shells there exists a very common one called Bash. And within the Bash shell there are a whole host of commands that can be used both on the command line and in a script file. These commands run the gamut for what they can do, but there is a small subset that most people find their daily lives centered around when it comes to programming or hacking out solutions such as sed, cut, and sort.

Fairly common inputs to these commands are delimited-separated values such as CSVs and TSVs. Thankfully almost every command that you would use for these formats allows you to specify the delimiter.

For instance, say we have a TSV file called phonebook that contains the name and number for each contact:
$ cat phonebook 
Smith, Brett 555-4321
Doe, John 555-1234
Doe, Jane 555-3214
Avery, Cory 555-4132
Fogarty, Suzie 555-2314
With cut you could get just the names if you wanted:
$ cut -f1 phonebook 
Smith, Brett
Doe, John
Doe, Jane
Avery, Cory
Fogarty, Suzie
How did it know what the delimiter was? Luckily with cut the default delimiter is the tab character. What if it's not though? Looking at the man page for cut gives us:
$ man cut
...
     -d delim
             Use delim as the field delimiter character instead of the tab character.
...
So, say you wanted just the last name for everyone. Well, you can pipe the output from the first command to a second command to do just that! The only difference is now we are going to specify the delimiter to be a comma for the second command.
$ cut -f1 phonebook | cut -f1 -d ','
Smith
Doe
Doe
Avery
Fogarty
Nice! Now, lets check out sort. First, lets sort our phonebook by last name:
$ sort -k1,1 phonebook 
Avery, Cory 555-4132
Doe, Jane 555-3214
Doe, John 555-1234
Fogarty, Suzie 555-2314
Smith, Brett 555-4321
That works well enough. Now lets sort by phone numbers:
$ sort -k2,2 phonebook 
Smith, Brett 555-4321
Avery, Cory 555-4132
Doe, Jane 555-3214
Doe, John 555-1234
Fogarty, Suzie 555-2314
Well... that's not right. It sorted by first name instead. Hmmmmm. Looking at the man page we see:
$ man sort
...
       -t, --field-separator=SEP
              use SEP instead of non-blank to blank transition
...
OK, lets add our trusty tab character:
$ sort -k2,2 -t '\t' phonebook 
sort: multi-character tab `\\t'
Uhhhhh... multi-character? Looks like sort doesn't interpret '\t' as a tab character, but instead a literal '\' and 't'. In another way:
$ echo -n '\t' | hexdump -c
0000000   \   t                                                        
0000002
So, how do we set the separator to be a tab character? The beginner's bash guide provides some guidance on that:
3.3.5. ANSI-C quoting

Words in the form "$'STRING'" are treated in a special way. The word expands to a string, with backslash-escaped characters replaced as specified by the ANSI-C standard. Backslash escape sequences can be found in the Bash documentation.
Using our echo example again:
$ echo -n $'\t' | hexdump -c
0000000  \t                                                            
0000001
Yup, one character. Trying out phone number sort again:
$ sort -k2,2 -t $'\t' phonebook 
Doe, John 555-1234
Fogarty, Suzie 555-2314
Doe, Jane 555-3214
Avery, Cory 555-4132
Smith, Brett 555-4321
BINGO! Now we are truly sorting on the second column.

tl;dr

Turns out sort is similar to echo in that by default escaped characters are interpreted as two character literals rather than the intended escaped character. While echo -e does provide a means to do so, sort does not. So we must use the ANSI-C quoting (or some other means).