Perl for Sed-ers

·

4 min read

I've generally avoided Perl like the plague; I've never had a need for it, and its minimalist syntax makes for a steep initial learning curve. But mainly I just haven't needed it, as the GNU utilities have generally covered my needs. Today though I needed to run a multi-line substitution, which anyone who's ever used Sed knows is normally not worth the trouble. I was pleasantly surprised at how easy the task was in Perl, without having to really engage with the language itself!

I had a bunch of Powershell runbooks, for which I wanted to count the number of lines of code (real code). I had been using my usual faithful tool cloc, but the number of lines it was listing in my runbooks (1,572) seemed lower than I expected. The runbooks did include large, multi-line strings - which was actually code in a different language, that my runbooks would be submitting to an API endpoint - and I thought that perhaps cloc was counting those as a single line each time. So I decided to do it the old-fashioned way.

In Powershell, there are two ways of entering comments. In-line comments start from a # character and continue to the end of the line:

$MyVariable = 5 # assign a value

# this whole line is a comment

Block comments are tagged with <# and continue until the closing tag #>.

<#
   This is a block comment

   I can write anything I want

   until the closing tag
#>
$MyVariable = 6

Okay, sweet. So I need to:

  1. Exclude all block comment...blocks

  2. Exclude all comment-only lines

  3. Exclude all empty lines

The first one is the tough one because you have to go multi-line. I would say sed is a poor tool for our job here mainly because it doesn't have support for regex lookaheads, which in my opinion would be necessary for most multi-line regex matching - i.e., our comment blocks.

What I need to match is <#, then anything except #>, until the actual #> - i.e., a non-greedy match. I recently had to do this for HTML <table> tags, so I still have the example regex available at regex101.com:

We're not matching <table> tags though, so our regex in this case is:

<#(?:(?!#>)[\s\S])*?#>

Again, this doesn't work with Sed because of the lookahead - but it will work in Perl just fine. We just need to tell Perl to use \0 as the record separator (which effectively enables multi-line regexes), and also to act like sed through the -n/-p switches, which allows us to skip writing any actual Perl code.

$ perl --help

Usage: perl [switches] [--] [programfile] [arguments]
  -0                specify record separator (\0, if no argument)
  [...]
  -n                assume "while (<>) { ... }" loop around program
  -p                assume loop like -n but print line also, like sed

We'll use the s/<regex>//g command to delete our regex matches without substituting anything:

$ perl -0pe 's/<#(?:(?!#>)[\s\S])*?#>//g' MyFile.ps1

That gives the result we wanted! Easy.

Note that using the -0 switch means that we can't easily filter the comment-only lines (2, above) and empty lines (3) without writing some very ugly regex, so I'm going to be lazy and just add Sed to the pipeline, to do the job it can do easily:

# Remove comment-only lines
/^\s*#/d

# Remove empty lines
/^\s*$/d

# BONUS: we could just combine the above two - but Sed then needs the -E (extended regexp) switch
/^\s*(#.*)?$/d

So, finally, putting that all together, our Perl+Sed filter on non-comment/non-blank lines is:

$ perl -0pe 's/<#(?:(?!#>)[\s\S])*?#>//g' MyFile.ps1 | sed -E '/^\s*(#.*)?$/d'

Then we need to add wc to give us the line count:

$ perl -0pe 's/<#(?:(?!#>)[\s\S])*?#>//g' MyFile.ps1 | sed -E '/^\s*(#.*)?$/d' | wc -l

And then finally finally we need to put that into a loop:

for file in $(ls *.ps1); do
  perl -0pe 's/<#(?:(?!#>)[\s\S])*?#>//g' $file | sed -E '/^\s*(#.*)?$/d' | wc -l
done

Which spits out the number of lines per file. I added up the numbers and the total was 1,572...which was what cloc was telling me in the first place!