Perl / Unix One-liner Cage Match, Part 2

sorangutan

In Part 1, I compared Perl’s regexp features with sed and Awk. In this concluding part, I’ll cover examples that make use of Perl’s extensive built-in features and third-party modules.

Bigger library

Perl has a much bigger collection of built-in functions compared to Awk. For command-line usage, I often need tr, join, map and grep. I like that arrays and hashes are distinct in Perl and applying sort on these data types is much simpler compared to Awk.

Append items to a list

This problem wants to append columns to rows that have too few, like the b, c and d rows:

a,10,12,13
b,20,22
c,30
d,33

This appends zeros to list by using the /e again. This time, the Perl in the replacement counts the number of commas, and subtracts that from 3 to find out how many more columns it needs. The x is the string replication operator:

$ perl -pe 's|$|",0" x (3 - tr/,//)|e' ip.txt
a,10,12,13
b,20,22,0
c,30,0,0
d,33,0,0

Reversing things

In reverse complement DNA sequence for a specific field, I need to select part of the string, complement it, and turn it around. I want to work on the third column:

ABC DEF GATTAG GHK
ABC DEF GGCGTC GHK
ABC DEF AATTCC GHK

I use the tr and reverse in the replacement side (with /e again):

$ perl -pe 's/^(\H+\h+){2}\K\H+/reverse $&=~tr|ATGC|TACG|r/e' test.txt
ABC DEF CTAATC GHK
ABC DEF GACGCC GHK
ABC DEF GGAATT GHK

Alternatively, I can use -a, which automatically splits on whitespace and puts the result in @F. I work on the third element then output @F again:

$ perl -lane '$F[2]=reverse $F[2]=~tr/ATGC/TACG/r; print "@F"' test.txt
ABC DEF CTAATC GHK
ABC DEF GACGCC GHK
ABC DEF GGAATT GHK

Sort a CSV row

How about sorting rows in csv file without header & first column? Here’s some simple comma-separated values:

id,h1,h2,h3,h4,h5,h6,h7
101,zebra,1,papa,4,dog,3,apple
102,2,yahoo,5,kangaroo,7,ape

I use -a again, but also -F, to make comma as the field separator:

$ perl -F, -lane 'print join ",", $.==1 ? @F : (shift @F, sort @F)' ip.txt
id,h1,h2,h3,h4,h5,h6,h7
101,1,3,4,apple,dog,papa,zebra
102,2,5,7,ape,kangaroo,yahoo

The $. variable keeps track of the input line number. I use this to skip the first line (the header). In all other lines, I make a list of the first element of @F and the sorted list of the rest of the elements. Note that the numbers to be sorted in this example have the same number of digits, otherwise it wouldn’t work.

Insert incremental row and column labels

Insert a row and a column in a matrix needs to add numerical labels with a fixed interval:

2 3 4 1 2 3
3 4 5 2 4 6
2 4 0 5 0 7
0 0 5 6 3 8

Here, I use map to generate the header:

$ perl -lane 'print join "\t", "", map {20.00+$_*0.33} 0..$#F if $.==1;
              print join "\t", 100+(0.33*$i++), @F' matrix.txt
        20      20.33   20.66   20.99   21.32   21.65
100     2       3       4       1       2       3
100.33  3       4       5       2       4       6
100.66  2       4       0       5       0       7
100.99  0       0       5       6       3       8

# with formatting and alternate way to join print arguments
$ perl -lane 'BEGIN{$,="\t"; $st=0.33}
              print "", map {sprintf "%.2f", 20+$_*$st} 0..$#F if $.==1;
              print sprintf("%.2f", 100+($st*$i++)), @F' matrix.txt
        20.00   20.33   20.66   20.99   21.32   21.65
100.00  2       3       4       1       2       3
100.33  3       4       5       2       4       6
100.66  2       4       0       5       0       7
100.99  0       0       5       6       3       8

Using Perl modules

Apart from built-in functions, Standard or CPAN modules come in handy too. Load those with -M and put the import list after a =:

# randomize word list after filtering
$ s='floor bat to dubious four pact feed'
$ echo "$s" | perl -MList::Util=shuffle -lanE '
                    say join ":", shuffle grep {/[au]/} @F'
bat:four:pact:dubious

# remove duplicate elements while retaining input order
$ s='3,b,a,3,c,d,1,d,c,2,2,2,3,1,b'
$ echo "$s" | perl -MList::Util=uniq -F, -lanE 'say join ",", uniq @F'
3,b,a,c,d,1,2

# apply base64 decoding only for a portion of the string
$ s='123 aGVsbG8gd29ybGQK'
$ echo "$s" | perl -MMIME::Base64 -ane 'print decode_base64 $F[1]'
hello world

CPAN

The Comprehensive Perl Archive Network (CPAN) has a huge collection of modules for various use cases. Here are some examples.

Extract IPv4 addresses

The Regexp::Common has recipes for common things you want to match. Here’s some text with dotted-decimal IP addresses:

3.5.52.243 555.4.3.1 34242534.23.42.42
foo 234.233.54.123 baz 4.4.4.3123

It’s easy to extract the IPv4 addresses:

$ perl -MRegexp::Common=net -nE 'say $& while /$RE{net}{IPv4}/g' ipv4.txt
3.5.52.243
55.4.3.1
34.23.42.42
234.233.54.123
4.4.4.31

I can match only if the IPv4 address isn’t surrounded by digit characters, so I don’t match in the middle of 34242534.23.42.42:

$ perl -MRegexp::Common=net -nE '
        say $& while /(?<!\d)$RE{net}{IPv4}(?!\d)/g' ipv4.txt
3.5.52.243
234.233.54.123

Real CSV processing

Earlier I did some simple CSV processing, but if I want to do it for real I can use Text::CSV_XS to make sure everything happens correctly. This one handles the quoted field fox,42:

$ s='eagle,"fox,42",bee,frog\n1,2,3,4'

# note that neither -n nor -p are used here
$ printf '%b' "$s" | perl -MText::CSV_XS -E 'say $row->[1]
                     while $row = Text::CSV_XS->new->getline(*ARGV)'
fox,42
2

Processing XML

Processing XML files is another format that’s easy to mess up. Many people try to do this with regexp, but that can easily go wrong. Here’s an example file:

<doc>
    <greeting type="ask">Hi there. How are you?</greeting>
    <greeting type="reply">I am good.</greeting>
    <color>
        <blue>flower</blue>
        <blue>sand stone</blue>
        <light-blue>sky</light-blue>
        <light-blue>water</light-blue>
    </color>
</doc>

The xpath (a Perl program) and xmllint can be used for processing XML files:

$ xpath -e '//blue/text()' sample.xml
Found 2 nodes in sample.xml:
-- NODE --
flower
-- NODE --
sand stone
$ xpath -q -e '//blue/text()' sample.xml
flower
sand stone

$ xmllint --xpath '//blue/text()' sample.xml
flower
sand stone

Using the XML::LibXML module will help if you need Perl’s power:

$ perl -MXML::LibXML -E '
    $ip = XML::LibXML->load_xml(location => $ARGV[0]);
    say $_->to_literal() for $ip->findnodes("//blue")' sample.xml
flower
sand stone

$ perl -MXML::LibXML -E '
    $ip = XML::LibXML->load_xml(location => $ARGV[0]);
    say uc $_->to_literal() for $ip->findnodes("//blue")' sample.xml
FLOWER
SAND STONE

Processing JSON

JSON files have the same issue. You don’t want to do regexes on this:

$ s='{"greeting":"hi","marks":[78,62,93],"fruit":"apple"}'

Various JSON modules, such as Cpanel::JSON::XS can handle this. For example, pretty printing:

$ echo "$s" | cpanel_json_xs
{
   "fruit" : "apple",
   "greeting" : "hi",
   "marks" : [
      78,
      62,
      93
   ]
}

And here’s a particular selection:

$ echo "$s" | perl -MCpanel::JSON::XS -0777 -E '$ip=decode_json <>;
              say join ":", @{$ip->{marks}}'
78:62:93

Sometimes it’s easier to put that in a script (although that’s not really a one-liner anymore). I use a Bash function as a shortcut:

$ pj() { perl -MCpanel::JSON::XS -0777 -E '$ip=decode_json <>;'"$@" ; }

$ echo "$s" | pj 'say $ip->{fruit}'
apple
$ echo "$s" | pj 'say join ":", @{$ip->{marks}}'
78:62:93

A non-Perl example of the same thing is jq, but that’s something you have to install separately and might not be available:

$ echo "$s" | jq '.fruit'
"apple"
$ echo "$s" | jq '.marks | join(":")'
"78:62:93"

Speed

Perl is usually slower compared to specialized tools, but the regexp engine performs better for certain cases of backreferences and quantifiers.

$ time LC_ALL=C grep -xE '([a-z]..)\1' /usr/share/dict/words > f1
real    0m0.074s

$ time perl -ne 'print if /^([a-z]..)\1$/' /usr/share/dict/words > f2
real    0m0.024s

$ time LC_ALL=C grep -xP '([a-z]..)\1' /usr/share/dict/words > f3
real    0m0.010s

Perl’s hash implementation performs better compared to Awk’s associative arrays for large number of keys. The SCOWL-wl.txt file used below was created using app.aspell.net. words.txt is from /usr/share/dict/words. Mawk is usually faster, but GNU Awk does better in this particular case.

$ wc -l words.txt SCOWL-wl.txt
  99171 words.txt
 662349 SCOWL-wl.txt
 761520 total

# finding common lines between two files
# Case 1: shorter file passed as the first argument
$ time mawk 'NR==FNR{a[$0]; next} $0 in a' words.txt SCOWL-wl.txt > t1
real    0m0.296s
$ time gawk 'NR==FNR{a[$0]; next} $0 in a' words.txt SCOWL-wl.txt > t2
real    0m0.210s
$ time perl -ne 'if(!$#ARGV){$h{$_}=1; next}
                 print if exists $h{$_}' words.txt SCOWL-wl.txt > t3
real    0m0.189s

# Case 2: longer file passed as the first argument
$ time mawk 'NR==FNR{a[$0]; next} $0 in a' SCOWL-wl.txt words.txt > f1
real    0m0.539s
$ time gawk 'NR==FNR{a[$0]; next} $0 in a' SCOWL-wl.txt words.txt > f2
real    0m0.380s
$ time perl -ne 'if(!$#ARGV){$h{$_}=1; next}
                 print if exists $h{$_}' SCOWL-wl.txt words.txt > f3
real    0m0.351s