With so many programming languages available, it's no longer straightforward to find a language that simultaneously satisfies the needs for personal growth, project utility, and longevity, while still adhering closely to common programming paradigms and styles. It becomes very easy to fall into the trap of bemoaning the failures of Language X without actually providing alternatives. A single comparison that "Language Y is better for A" fails to establish that Language Y is better overall, so a more thorough overview is needed.
I set out to provide a list of language constructs and patterns that makes Perl straightforward, easy, fun, extensible, maintainable, and powerful for a broad range of personal and enterprise projects. Instead, I quickly realized that the individual items would be easier to digest if broken into categories. With the number of examples in each section, this document has increasingly taken on the feel of a tutorial or general introduction.
It is not my intent to write a tutorial because one already exists. Instead, I shall try to relate why learning and using Perl has been fun and painless over the years.
It is not my intent to produce a language comparison chart. See Wikipedia and others for comparisons of features, support, and so forth.
One observation before we get started: Languages grow and evolve (hopefully), particularly throughout the decades of Perl use that's been available to us. Many new languages have increasingly incorporated some of the below methodologies, idioms, and efficiencies. In times past, many of these approaches were only available in Perl. (I'm not going to track down feature history in dozens of scripting and other programming languages.)
✗ A word of warning: Many articles titled "Why you should learn X" focus on aspects of software development that are entirely unrelated to the language itself or are otherwise not significantly correlated with the strengths or weaknesses of the language and implementation. If you see a ✗ below, it signals a dubious consideration. The item might be unrelated to the language chosen, a language comparison may be irrelevant, or it may be too complicated to make a comparison. In any case, you may ask yourself, "Does this even matter?", and I have provided the information only to rule out any confusion. A ✗ is unlikely to be a reason that I chose Perl, but it may be a reason that I continue to choose Perl.
With that, let us begin...
Perl is a great choice for a scripting and programming language because it is ubiquitous. Perl was first released in 1987, with Perl 5 being ported to many platforms and in solid use since 1994. Like many languages from that era, there have been continual modifications, improvements, and feature additions, but also some stagnation regarding adoption of new major versions. Perl 6 is considered by many to be a "different language" and, given Perl 5's longevity of use in production environments, the standard pressures against rewrites apply. Perl 5 continues to be under active development and is still a good choice for a new programmer.
Perl is compiled. When you write code, you want to know if you've made simple mistakes. Rapid development cannot afford to wait for code to fail in production because you typoed a variable name. Even simple tools like sed will tell you about syntax mistakes, so you should expect the same from your programming language.
1 #!/usr/bin/perl 2 3 $c=5 4 $d++Let's run it:
$ perl ./gettingstarted.pl
Scalar found where operator expected at ./gettingstarted.pl line 4, near "$d"
(Missing semicolon on previous line?)
syntax error at ./gettingstarted.pl line 4, near "$d"
Execution of ./gettingstarted.pl aborted due to compilation errors.
Note what has happened here. "Execution... aborted due to compilation errors". In this case, the syntax is not parseable; most scripting languages (not all) will catch this type of mistake at startup (if you can't parse the script, you can't run it). In this case, however, we have the added benefit of clear instructions to fix the problem, which happen to be correct: "Missing semicolon on previous line?". Yeap. Easy fix.
1 #!/usr/bin/perl 2 3 $c=5; 4 $d++;You can easily force just a syntax check without execution:
$ perl -c ./gettingstarted.pl ./gettingstarted.pl syntax OK
Perl knows your dumb mistakes. To prevent silly typoes and other, more complicated expressions that could misbehave, Perl provides strict mode. For quick commandline work, we typically don't worry about strict syntax concerns, but when we develop code for reuse and larger applications, we want to know that we've messed up.
1 #!/usr/bin/perl 2 3 use strict; 4 5 $c=5; 6 $d++;Perl let's us know that something untoward could happen.
$ perl -c ./gettingstarted.pl Global symbol "$c" requires explicit package name at ./gettingstarted.pl line 5. Global symbol "$d" requires explicit package name at ./gettingstarted.pl line 6. ./gettingstarted.pl had compilation errors.
This may seem a little bit strange, but consider copying those two lines of code into a subroutine/function definition and the usefulness of this error will be more clear: Perl knows that these variables lack a defined scope. In a subroutine, it wouldn't know whether to create local copies, create an as-yet-undefined global variable, or overwrite an existing global variable. That's bad code. Perl wants you to write maintainable code without surprises.
1 #!/usr/bin/perl 2 3 use strict; 4 5 my $c=5; 6 my $d++;
Perl tells you about itself. One of the challenges with modern languages is the large number of builtin and extensible features. This makes it very easy to create confusing code:
1 #!/usr/bin/perl
2
3 use strict;
4
5 sub rand { return 1 }
6
7 my $c=rand();
Despite my very poor random number generator, this code is strictly syntactically correct:
$ perl -c ./gettingstarted.pl ./gettingstarted.pl syntax OK
On the other hand, Perl gives us warnings about dubious constructions and, as most programmers know, there are tons. Again, we generally don't bother including this in our one-liners, but it's essential for our maintainable code.
1 #!/usr/bin/perl
2
3 use strict; # always use this
4 use warnings; # always use this
5
6 sub rand { return 1 }
7
8 my $c=rand();
Perl tells us we're a doofus.
Ambiguous call resolved as CORE::rand(), qualify as such or use & at ... line 8.
Silly us, there was already a function called "rand". Now we know! It's a good thing it told us now, instead of three months later when someone on our team decided to use the built-in rand() and spent hours trying to ascertain why their numbers weren't random.
The usefulness of this for long term maintainability cannot be understated. Many scripting languages will happily let you declare conflicting and ambiguous names. Myself, I prefer not to rely on unit tests and code reviews just to discover that I've chosen a name that is in conflict with something buried deep inside the language specification that no one ever uses anyway (but one person will join the team later and use to "show off their skills"). Perl helps us prevent future mistakes. (If you have no choice, there's a way to distinguish between our rand() and Perl's rand(), but that's an exercise to the reader.)
✗Perl uses common syntax. Like many languages, Perl looks a lot like C and Java: Blocks are denoted with curly braces, semicolons terminate single statements, function calls, single and double quotes are used for strings (but all languages have caveats here), the print/sprintf operators, and the arrow and slash operators for de/referencing. Loops, conditionals, return statements... all the same. Things that don't look like C: The comment character, function prototypes, and declaration of complex data structures.
Perl syntax helps you. In compiled, strongly typed languages, the compiler catches mistakes you make using more complicated data structures. Your hash nested in an array nested in a hash will complain when you try to treat it like a hash of arrays of scalars. In dynamically typed languages, no such limitations apply and you're left to fend for yourself; you have to use the right interfaces for your objects or you'll find yourself in a mess of runtime errors. This is not a new problem. Hungarian notation has been used for decades to provide reminders in code and database schemæ. In many languages, coders find no recourse except to use excessively lengthy variable names as reminders of proper object use.
But let's be honest here: A great deal of code gets written with copying and pasting. (You can admit it.) We look at what's currently written, take the current code using the objects, and modify to access what we need. Unfortunately, that's where the trouble begins; even if we have working source in front of us, the type of a variable may be unclear from context. Moreover, complicated logic will often use variables to access an object hierarchy, making the types in the hierarchy harder to decipher.
Take something simple like "record". A record could be a string representing a document, such as the body of email. It could be an array of string items, such as a record of items on a service quote. It could also be a collection of fields and values, like a patient overview or shipping information. How are we to know without digging through the code, investigating unit tests, reading the class documentation (if it exists) or definitions? One method is somewhere between tab-expansion heaven and hell, where we hope our editor can fill in the blanks, but this often fails or is missing outside statically-typed languages. Another is to expand Hungarian notation to something "more descriptive" (that means 'unbounded' in practice): email_as_string, itemsList, or (well how do you say hash or dictionary without putting that in the name...?) shipping_key_values/shippingFields(but that's just the fields)/shippingForm(but that's an object)... and Babbage spins in his grave.
Let's look at Perl.
1 my $record="Boring email";
2 my @record=("Item 1", "Item 2");
3 my %record=(name=>'Brian', phone=>"Free Saturday night?");
...
951 print $record;
952 foreach my $item (@record) { ... };
953 print Dumper(\%record);
Those little squiggly things about which you'll find much complaining, they help you write and extend code more quickly. Instead of digging around determining the basic data type of a variable, you can use a sigil (literally seal or true name) to communicate meaning to yourself. Not only that, but the compiler will throw warnings if you try to refer to nonexistent variables (if you declared @record, an attempt to $record will fail).
Nothing prevents you from additional description; these all work: @sortedItems, @shippingRecords, %shippingRecord, $singleMessage, $decryptedMessage. You are free to be as verbose as necessary while you benefit from this simple way to help clarify data type inside a dynamic language.
(Feel free to skip this paragraph). Even in the Perl community there is some "anti-sigil sentiment". Of particular note, there is no fixed style requirement for handling "double sigils", though increasingly tooling has been added to better support arrow-dereferencing syntax. Because certain functions operate on arrays or hashes, not on references to those objects, it has traditionally been required to use a prefixed sigil approach in some cases. For example, push(@$arrayref,...) and in cases of constructing nested data structures. Later versions of Perl have added postfix dereferencing to handle such a case with the arrow operator, via push($arrayref->@*,...) but this may be experimental on your version of Perl. Having written and maintained Perl projects over the years, I have increasingly moved away from the arrow-dereferencing style except in very rare cases; it's much harder to read because it leaves the coder "waiting until the end" to figure out what data type is intended. This is, of course, an opinion, and I hope to include an example in Perl::Critic below with more details. In any case, in most examples below sigils are used in a circumfix manner, namely via @$arrayref or @$_ or @{...} and similar constructs.
Perl syntax is consistent. To further support you in understanding variables, the way you use scalars, arrays, and hashes, is distinguishable. Nothing prevents you from creating a hash where the keys are integers, but you'll be able to tell at a glance that the variable is not an array.
12 my @A=(0,1,2,3);
13 my %h=(1=>'one',2=>'two',3=>'three');
...
95 foreach my $x (qw/1 3/) {
96 print $A[$x],' versus ',$h{$x},"\n";
97 }
I'm trying to get a scalar value for my print statement, so I start with a dollar sign, the name of the variable, and either an array index via [index] or a hash lookup via {key}. Arrays are always brackety and hashes are always bracey. In the above example, the initializations both use parentheses, and there's a reason for that, but if you want to nest objects you'll see that the [] and {} are used in declarations as well:
my @twodim=(
[1,2],
[3,4],
);
my @records=(
{name=>'Brian', phone=>'Call me!'},
{name=>'Sally', phone=>'Call me too!'},
);
When you return to your code five months later and see $twodim[$i][$j] you'll know it's meant to be an array of arrays. When you see $records[$i]{$k} you'll know it's an array containing hashes. You'll also know that $i and $j are probably integers and $k could be a string, and you won't even have to track down where those are declared or where the objects themselves are initialized. But of course you can still break your own contracts if you try:
print $records[$i][$k],"\n";
Not an ARRAY reference at ... line 733.
I won't name any names, but having variable[thing] is confusing in many languages because you're relying in the good style of someone (maybe even yourself) to name variable and thing in such a way as to signal what manner of object dereference is intended. That's... fairly substandard for maintainability.
Perl can handle your style. Here are some styles that work in Perl: Tabs, spaces, visual indents, operators before line continuation, operators after (implied) line continuation, preconditions, postconditions, verbal Booleans, symbol Booleans, terminal commas, no terminal commas, .... "More than one way to do it" isn't about creating confusion; it's primarily about allowing you to establish the style for your application.
This says Hi six times:
1 my $a=5;
2 if( $a == 5 ) { print "Hi A\n" }
3 if($a == 5) {
4 print "Hi B\n"; # tab, versus spaces in above examples
5 }
6 print "Hi C\n" if($a==5);
7 $a==5 and print "Hi D\n";
8 $a==5 && print "Hi E\n";
9 print "Hi F\n" unless $a != 5;
Line continuation works the way you want. You've managed to nest so deep you only have 33 characters...
1 my $cost=complicatedThing(1,2,3) # Ran out of room, and 2 +moreComplicatedThing(3,4,5); # this line isn't just some function hanging around; that + operator tells me something is happening. 3 my $url="$protocol://$server/". # Or I prefer to put my operator at the end of the line and 4 "$target?q=$query"; # visually indent.
Review the @twodim and @records examples above. See those tricky little comma at the very end; makes copy-pasting and adding a new record very easy, and you don't have to worry about the extra line change in your commit. Some languages support this; others not so much. (It's one of my biggest peeves in writing static JSON.)
Perl knows you need useful documentation. The Perl documentation is, in all honesty, the best of any programming language I've ever used. Of course, that doesn't tell you anything (like many articles online), so I'll try to tell you why: Availability, Breadth, Depth, Searchable, Provides examples, Sets the example, Provides tutorials.
Perl documentation is available with no waiting for Internet delays and search engine failures. A simple man warnings or perldoc Date::Calc (man Date::Calc also works) gives you what you want. If you've forgotten the outputs of a built-in function, it's just as easy:
$ perldoc -f localtime
localtime EXPR
localtime
Converts a time as returned by the time function to a 9-element
list with the time analyzed for the local time zone. Typically
used as follows:
# 0 1 2 3 4 5 6 7 8
($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) =
localtime(time);
All list elements are numeric and...
You haven't used the Base64 module in forever and have forgotten...
$ man MIME::Base64
MIME::Base64(3p) Perl Programmers Reference Guide MIME::Base64(3p)
NAME
MIME::Base64 - Encoding and decoding of base64 strings
SYNOPSIS
use MIME::Base64;
$encoded = encode_base64('Aladdin:open sesame');
$decoded = decode_base64($encoded);
...
You don't know where to start to turn off buffered output...
$ perldoc -q buffer Found in /usr/libdata/perl5/pod/perlfaq5.pod How do I flush/unbuffer an output filehandle? Why must I do this? ...That option accepts a regular expression, so searching the FAQ is fairly powerful. Standard search for man pages applies, and there are other tools available for searching module documentation. It's certainly online, but, to be transparent, I rarely use the online documentation (and any associated web formatting issues) because everything I need is available locally through `man`.
There are tutorials for more involved aspects of Perl programming available in man pages and at perldoc.
You can perform your own comparisons and I cannot anticipate the type of documentation you usually need. Here's array splicing for C++, Javascript, and Perl. Here's JSON de/serialization for Java (gson), Perl, and Python.
Perl wants you to share solutions. The Comprehensive Perl Archive Network has been available online since 1995 (compare to your favorite programming language's package repository), and many of the most useful, frequently-used modules are summarized in perlmodlib (there are some many things in there, I have an endless supply of potential improvements for my applications). The `cpan` command let's you install packages system-wide (got root?) or in a user-local library path. Most modules that I've installed have been through OS packages, as the modules are ported to most operating systems. Module details can be found at CPAN and can be searched through metacpan.
Online Q&A environments for Perl users have existed for decades: Perl Mongers (1998), PerlMonks (since at least 1999), and, of course, all those traditional, stable technologies like mailing lists and newsgroups.
✗✗✗Perl allows iterative and agile development. Well of course it does; it's a programming language. In case you missed it, the Agile Manifesto says nothing about properties of programming languages that make them "agile" or "not agile". Programming "agility" is about how you structure and organize your code, so you're only limited by something that lacks basic structural concepts or otherwise prevents you from making your own choice about the overall organization of your project. (For example, if you're required to put a structure in a separate file, that may be a waste of time). But if you like B.S., I'll give it a try.
✗ Individuals and interactions. Perl is easy to write, as seen above and below. It's easy to read, and the language removes ambiguity which helps you pick up where you left off. The syntax is fairly consistent and you are free to simplify with functions, modules, representative variable names, and descriptive data structures. This improves code sharing and accelerates code reviews, and long-term maintainability remains high.
✗ Working software. Creating a prototype is fast; you don't need a lot of structure. Extending to a script that reads from standard input and writes to standard output is very, very simple (TODO to be added below). When you're ready to divide to multiple output files, that is easy. Reading from input files follows immediately (with equivalent syntax). Modules allow you to quickly build networked/databased/web prototypes, and you can move to more powerful modules as needed later. It's straightforward to separate functions into modules, to move modules into separate files, to include helper scripts directly or to include full-fledged libraries as modules. Testing is provided by a standard library, and you can easily get coverage and style checking. Performance can be checked with benchmarking tools or, as needed, with profilers.
✗ Responding to change. When "there's more than one way to do things", individuals and teams can quickly respond to needs for change because a "better way" is more likely to already exist. Your hacky little script can easily turn into a production application, often with the addition of an existing framework and testing. If you can write well-structured code in Language X, then you can write well-structured code in Perl.
The basics are easy. Okay, you're here...
1 use strict;
2 use warnings;
3
4 my $stringa="Newlines go in double quotes.\n";
5 my $stringb='Single quotes are literal.\n';
6 my %hash=(
7 scalars=>"Scalars work in double quotes: $stringb $stringa",
8 concat =>'Or you can concatenate: '.$stringb.' '.$stringa,
9 );
10
11 foreach my $k (keys %hash) {
12 print $hash{$k};
13 print 'You can also do this: ',$hash{$k};
14 print "And this: $hash{$k}\n";
15 }
If you don't see it:
$ perl ./thingswork.pl Or you can concatenate: Single quotes are literal.\n Newlines go in double quotes. You can also do this: Or you can concatenate: Single quotes are literal.\n Newlines go in double quotes. And this: Or you can concatenate: Single quotes are literal.\n Newlines go in double quotes. Scalars work in double quotes: Single quotes are literal.\n Newlines go in double quotes. You can also do this: Scalars work in double quotes: Single quotes are literal.\n Newlines go in double quotes. And this: Scalars work in double quotes: Single quotes are literal.\n Newlines go in double quotes.
Honestly, though, go read perlintro, perlsyn, and perltrap. You'll be up in a few hours.
✗ Truthiness is related to existence and definition.
# These things are false:
# undef
# 0
# empty arrays
# empty hashes
# the empty string ''
# results from a return;
# results from a return undef;
#
# These things are true:
# initialized variables
# 1
# nonzero numbers
# nonempty arrays
# nonempty hashes
# nonempty string, including ' '
# results from a return $var; when $var is not undef
Types behave. This is important for a dynamically typed language because it improves readability and maintainability of your code. Extra handlers and casting calls have a tendency to obfuscate the algorithmic steps being applied to your data, reducing extensibility and increasing the risk of subsequent runtime failures. (You are not freed from the requirements of input validation and security). Here's a simple example; when you read a decimal representation (as a string), it's a decimal (this is not true in all scripting languages, include some very popular ones):
1 my $x=" 5.3 ";
2 my $y=$x+2.7; # Yes, 5.3+2.7 is 8, so y is 8.
3 my $z=$x%3; # z is the integer 2
4 my $zz=$z.$z; # zz is the string "22"
5 if($x>4){$z=5} # z is now the integer 5
That is to say: "If it looks like a number and behaves like a number...". "If it looks like a string and behaves like a string...". Nothing prevents you from explicit casting and input validation and checking, nor from throwing errors on validation failures, but neither are you stuck dealing with a language that can't add a few numbers simply because you read them from a file (as strings). Safe? Not in all circumstances. Fast? You bet your class.
When you find yourself with a collection of logfiles and you want to parse the latencies relative to the time of day because your server is suddenly slow, having numbers behave saves you a tonne of time.
Avoid initialization clutter. Speaking of reading files and input, it's fairly common to see loops producing new values in your data structures. A very, very, very common use case is to count things.
1 #!/usr/bin/perl
2
3 use strict;
4 use warnings;
5 use Data::Dumper;
6
7 # Count the number of occurrences of each character. Also store the offsets to each.
8 # (You can get the count from the number of offsets observed, but this an example.)
9
10 my %hist=();
11 my $off=0;
12 while(<>) {
13 foreach my $c (split(//,$_)) {
14 $hist{char}{$c}++;
15 push @{$hist{offset}{$c}},$off;
16 $hist{count}++;
17 $off++;
18 }
19 }
20
21 print Dumper(\%hist);
This takes about 15sec on my slow system for 235k words from the dictionary:
$ cat /usr/share/dict/words | perl ./countcharacters.pl
$VAR1 = {
'offset' => {
'h' => [
105,
456,
556,
...
],
'o' => [
40,
47,
53,
...
],
...
},
'count' => 2487232,
'char' => {
"\n" => 234979,
'S' => 2274,
'W' => 321,
...
}
};
This demonstrates a number of very powerful things and you should have a few questions:
Here's what it looks like when you have to check manually. It becomes so complicated, you might as well create a helper function for each character addition. The runtime goes up 16%. (In some other popular scripting languages, a get-or-default is 60% slower than a +=, and an if-defined-else is 30% slower.)
1 #!/usr/bin/perl
2
3 use strict;
4 use warnings;
5 use Data::Dumper;
6
7 # Count the number of occurrences of each character. Also store the offsets to each.
8 # (You can get the count from the number of offsets observed, but this an example.)
9
10 sub addToHist {
11 my ($histref,$char,$off)=@_;
12 if(!defined($$histref{count})) { $$histref{count}=0 }
13 if(!defined($$histref{char})) { %{$$histref{char}}=() }
14 if(!defined($$histref{char}{$char})) { $$histref{char}{$char}=0 }
15 if(!defined($$histref{offset})) { %{$$histref{offset}}=() }
16 if(!defined($$histref{offset}{$char})) { @{$$histref{offset}{$char}}=() }
17 $$histref{char}{$char}++;
18 push @{$$histref{offset}{$char}},$off;
19 $$histref{count}++;
20 }
21
22 my %hist=();
23 my $off=0;
24 while(<>) {
25 foreach my $c (split(//,$_)) {
26 addToHist(\%hist,$c,$off);
27 $off++;
28 }
29 }
30
31 print Dumper(\%hist);
Haha. Yeah that's "better". When your eyes have glazed over, you need to try doing it "the right way" instead, ie "the Perl way". Which would you rather code review? Which would you rather maintain?
Here's a case that, sadly, isn't so seamless. It's not enough to just say $hash{key} since an assignment is needed for autovivification. Without it, the manual check is needed, but it's a small price to pay.
1 #!/usr/bin/perl
2
3 use strict;
4 use warnings;
5 use Data::Dumper;
6
7 # Construct a "spell checking" tree.
8 # If the terminal node has "\n" as a key, it's a valid word.
9
10 my %dict=();
11 while(<>) {
12 my $node=\%dict;
13 foreach my $c (split(//,$_)) {
14 if(!defined($$node{$c})) { $$node{$c}={} }
15 $node=$$node{$c};
16 }
17 }
18
19 my $dumper=Data::Dumper->new([\%dict],['dictionary'])->Indent(1);
20 print $dumper->Dump();
That's right, it takes eight lines to build a hash-based spell-checking structure from a list of input words. Sample results. Examine the nesting levels carefully and you will find the three words. It's left as an exercise to split a trial word and validate that the hash contains the word.
$ head -n69579 /usr/share/dict/words | tail -n3 | perl ./spellcheqr.pl
$dictionary = {
'f' => {
'i' => {
'g' => {
'a' => {
'r' => {
'o' => {
"\n" => {}
}
}
},
"\n" => {}
},
'f' => {
't' => {
'y' => {
'f' => {
'o' => {
'l' => {
'd' => {
"\n" => {}
}
}
}
}
}
}
}
}
}
};
None means none. Particularly useful to clean code is avoiding exceptional cases that need specialized handlers. Every occurrence of an "if(exceptional case) { prevent crashing }" followed by "else { do what we actually want to do }" is a representation that the language requires you to sprinkle your code with random safety points simply because the language doesn't do it for you. They are certainly cases in programming logic where you want "if(exceptional case) { do something else }", but having to insert that bit of disclaimer with every common operation clutters your code, obfuscates the logical intent, and makes it harder to read.
One real-world phenomenon is the logical concept of "nothing". In Perl, this value is undef. Considering a function that returns "the list of remaining things", it is a reasonable request to "add the list of remaining things to the array". Such a list could be empty, however, with the logically-acceptable meaning "there are no things remaining". Here's the wrong way:
1 sub remainingThings { return }
2
3 my @A=();
4 my ($x,$y);
5
6 # The WRONG WAY
7 my @things=remainingThings();
8 if(@things) {
9 foreach my $thing (@things) {
10 push @A,$thing;
11 }
12 }
13
14 print "|A|=",1+$#A,"\n";
The structure of the code itself, line 7 through 12, should scream out "this is way too complicated": A temporary variable (with no other subsequent use) has been created simply to extend an array. The code also requires multiple levels of indentation. All that's being done is adding some elements to an array! ARGH.
In Perl, "none" is interpreted automatically. Namely, "for each thing in 'none'" means "do nothing":
1 sub remainingThings { return }
2
3 my @A=();
4 my ($x,$y);
5
6 # The Right Way
7 foreach my $thing (remainingThings()) {
8 push @A,$thing;
9 }
10
11 print "|A|=",1+$#A,"\n";
Unsurprisingly, the array shows size zero on line 11, but there's no crashing nor unhandled confusion. Of course the function can return (1,2,3) and the same code works.
Let's consider a more involved example, where a purchasing contract is handled in multiple steps: Reservation, packing, back-ordering, quality control. Any step could result in a failure or partial failure: It may not be possible to reserve the requested quantity; packing may find that the warehouse actually doesn't have the reserved quantity (roops); back-ordering and quality control are dependent on failures in the preceeding steps. With handler functions that mostly just spit out messages for demonstration, and the QC report saved until the end, we might expect this type of output:
vegetables Backordered 3 pizza Packed 1 of 4000in^2 into crate. Invoiced 1. flyer Packed 1 of 0in^3 into crate. Invoiced 1. QC [ ] inspect 1 of 4000in^2 [ ] inspect 1 of 0in^3
Let's make that happen the Wrong Way:
01 my @contract=(
02 {item=>'vegetables', quantity=>3},
03 {item=>'pizza', quantity=>1},
04 {item=>'flyer', quantity=>1},
05 );
06
07 # The WRONG WAY
08 foreach my $item (@contract) {
09 print $$item{item},"\n";
10 my @reservation=reserve($item);
11 if(@reservation) {
12 my ($qty,$value,$unit)=@reservation;
13 my $missing=handlePacking(\%crate,$qty,$value,$unit);
14 if($missing) {
15 if($missing<$$item{quantity}) {
16 handleInvoice($packing{items},$$item{quantity}-$missing);
17 }
18 handleBackorder($packing{backorder},$missing);
19 }
20 else { # nothing missing
21 handleInvoice($packing{items},$$item{quantity}-$missing);
22 handleQC(\@qcreport,$qty,$value,$unit);
23 }
24 } # reservation
25 else { # reservation failed
26 handleBackorder($packing{backorder},$$item{quantity});
27 }
28 } # item
29 print "QC\n";
30 print join("\n",map {" $_"} @qcreport),"\n";
As a brief walkthrough, reservations may return null on failures (line 10) which must be checked; reservation failures are handled (way down) on line 25. Packing may find missing items in the warehouse (line 13), which must be checked (line 14); if nothing is missing (line 20), update the invoice (line 21) and the QC report (line 22). If something is missing, but not everything was missing (line 15), update the invoice (line 16). If either case, if something was missing create a backorder (line 18).
Lost? Unsurprising. Imagine additional logic to handle missing items in the warehouse, sending external messages about events, and so forth. That code should make you glaze over fairly quickly, even though it reflects the "correct" order of logical thinking that goes into the process. If you can't handle null values in an elegant fashion, the above code will ensure you spend a good deal of time in refactoring for future extensions, and introducing a number of helper functions to "try to make the program easier to follow", while still staying "close to the logic of the process".
Try this instead. (Yes, same result/behavior/output):
01 my @contract=(
02 {item=>'vegetables', quantity=>3},
03 {item=>'pizza', quantity=>1},
04 {item=>'flyer', quantity=>1},
05 );
06
07 # The right way
08 foreach my $item (@contract) {
09 print $$item{item},"\n";
10 my ($qty,$value,$unit)=reserve($item);
11 my $missing=handlePacking(\%crate,$qty,$value,$unit) // $$item{quantity};
12 handleInvoice($packing{items},$$item{quantity}-$missing);
13 handleBackorder($packing{backorder},$missing);
14 handleQC(\@qcreport,$qty,$value,$unit);
15 } # item
16 print "QC\n";
17 print join("\n",map {" $_"} @qcreport),"\n";
As a brief walkthrough, reservations may return null (line 10), but both handlePacking and handleQC ignore null values; that is, they validate their own inputs. Reservation failures or warehouse-location failures return null from handlePacking, and, in that case, the number missing takes the quantity requested in the contract (line 11, the // quantity construct). The value of $missing cannot be null on line 12 or the subtraction will fail. The remaining functions validate their inputs and ignore null/zero quantities. Incidentally, I used the exact same handle* functions in both examples, because they should always validate their inputs; there's no "hidden savings" in the "wrong way" above.
Friendly reader, you are certainly entitled to your own opinions. Myself, I find the "right way" much more logical, much easier to read, hence much easier to maintain.
The complex is easy. Since Perl wants you to share solutions, most "complicated" things are fairly easy. If you want:
A list is a list. Perl uses context to establish behavior, much like a natural language where 'number' is indicated by word choice and pairing (goose:geese::fish:fish). Operations on lists can be applied equally to anything that looks like a list: A static list, a real array, the keys of a hash, a slice of an array, a slice of a hash, an operator that returns a list, and a function that returns a list. This drastically reduces the variation of method calls that you might have to remember were each of those cases handled by different functions or as different objects. These all work:
1 # Explicit declarations
2 my @e=(1,2,3);
3 my %f=(one=>1,two=>2,four=>4,eight=>8);
4
5 # Static list
6 foreach my $c ('a','b') { print "c=$c\n" }
7
8 # Quoted words (spaces are separators)
9 foreach my $d (qw/one two three/,"forty five") { print "d=$d\n" }
10
11 # Arrays and keys
12 foreach my $e (@e) { print "e=$e\n" }
13 foreach my $k (keys %f) { print "f{$k}=$f{$k}\n" }
14
15 # Array slice
16 foreach my $e (@e[0,2]) { print "e=$e\n" }
17
18 # Hash values slice
19 foreach my $v (@f{qw/one eight/}) { print "v=$v\n" }
20
21 # Hash keyvalue slice
22 my %fnew=%f{qw/one eight/};
23 foreach my $k (keys %fnew) { print "fnew{$k}=$fnew{$k}\n" }
And the results
c=a
c=b
d=one
d=two
d=three
d=forty five
e=1
e=2
e=3
f{two}=2
f{eight}=8
f{one}=1
f{four}=4
e=1
e=3
v=1
v=8
fnew{eight}=8
fnew{one}=1
Regular expressions can also return lists of matches in certain contexts (see below) and assignments to lists behave:
1 ($x,$y,$z)=($h{i},$h{j},$h{k})=qw/one two three/;
2 @e=(); @e[1,3,4]=(1,2,3);
3 print "x=$x, y=$y, z=$z\n";
4 print $dumph->Dump(),"\n";
5 print "e=@e (note the undef at 0 and 2)\n";
Which gives, as expected,
x=one, y=two, z=three
%h = ('k' => 'three','j' => 'two','i' => 'one');
e= 1 2 3 (note the undef at 0 and 2)
Of utmost importance, it's possible to pass an array as a list of items to a function. Reread carefully; an array need not be passed as an array reference:
1 sub minReferences {
2 my ($arrayref)=@_;
3 my $min;
4 foreach my $x (@$arrayref) {
5 $min//=$x;
6 if($x<$min) { $min=$x } }
7 return $min;
8 }
9
10 sub minItems {
11 my (@array)=@_;
12 my $min;
13 foreach my $x (@array) {
14 $min//=$x;
15 if($x<$min) { $min=$x } }
16 return $min;
17 }
18
19 my @a=(5,4,3,2,1);
20 print minReferences(\@a),"\n";
21 print minItems(@a),"\n";
22
23 my ($m,$n)=(5,7);
24 @a=($m,$n);
25 print minReferences(\@a),"\n";
26 print minReferences([$m,$n]),"\n";
27 print minItems($m,$n),"\n";
28 print minItems($m,$n,12,3),"\n";
The first form, minReferences, requires a reference to an array; it is, of course, modifiable inside the function. The second form permits passing an arbitrary number of items, and list assignment automatically dumps all remaining items into the last array item listed (@array in this case). When working with individual items, it's convenient to be able to pass them individually without having to create a temporary array first, so the minItems approach is preferred. This is particularly important in cases where you need to "unwrap" two or more arrays:
1 my %prices=(
2 drinks=>[0.99,1.50,4],
3 candy =>[0.25,1.25],
4 # ...
5 );
6 print minItems(@{$prices{drinks}},@{$prices{candy}}),"\n";
In some scripting languages, this can only be facilitated by passing each array (as reference) sequentially and tracking the overall minimum.
Ranges are intuitive. When counting whole numbers of items, such as days of the month or the number of people in the room, we want our language to illuminate reality. January has days one through thirty-one. A dozen doughnuts can be counted as they enter the box, one to twelve. So also should you easily be able to read code:
1 foreach my $i (1..6) { print "$i " }
1 2 3 4 5 6Developers are adamant about code readability. Well let me tell you, I don't go into the shoppe and ask for "thirteen minus one doughnuts", and the remaining days in the month are ($today..$daysInMonth), which is specifically "say everything in this list in turn until there's nothing left to say". That is much more reasonable than a language that handles ranges as "start with today's number, say it aloud, increase to the next number, ask yourself if you now have a number that you don't want and, if so, stop, otherwise say it aloud and repeat the process". That approach is very much a "Woops, I fell off a cliff".
For arrays, this makes code Very Obvious.
1 foreach my $i (0..$#A) { ... }
We're familiar with arrays being indexed at zero, in almost all languages at that. The right-hand side is where it gets less obvious: Do we put the last index for the array? The length of the array minus one? What if the array isn't zero-indexed? Does it become the length of the array minus something else? Logically I want one item in my (index) list for each item in the array, and it shouldn't take range operators and funky math to get there. If you still don't believe it, think about the meaning of (1..4,5..9) and basic principles of subsets of the natural numbers (these are not indicators for ranges of real numbers, after all).
Of course, an intuitive range operator should behave intuitively.
1 foreach my $c ('a'..'l') { print "$c " }
gives the result
a b c d e f g h i j k l
"Great!", I hear you say, "since characters are just integers in disguise anyway". Indeed, so this should work too, right?
1 foreach my $c ("ef".."gh") { print "$c " }; print "\n";
2 foreach my $c ("EF".."GH") { print "$c " }; print "\n";
Guess what? It does!
ef eg eh ei ej ek el em en eo ep eq er es et eu ev ew ex ey ez fa fb fc fd fe ff fg fh fi fj fk fl fm fn fo fp fq fr fs ft fu fv fw fx fy fz ga gb gc gd ge gf gg gh EF EG EH EI EJ EK EL EM EN EO EP EQ ER ES ET EU EV EW EX EY EZ FA FB FC FD FE FF FG FH FI FJ FK FL FM FN FO FP FQ FR FS FT FU FV FW FX FY FZ GA GB GC GD GE GF GG GH
As an exercise to the reader, this behaves:
1 foreach my $c ("a".."bc") { print "$c " }
There's... only one way to do it?. A common statement about Perl is "there's more than one way to do it". (✗) Of course, this is true of every language if you're willing to implement the "other ways" yourself, but languages often claim features that they ultimately don't deliver. A claim of "one true way" may be unsustained in practice. For example, here's a fairly common problem in theoretical computer science: Sorting. If you care about the theory, Perl now uses a stable merge sort; nothing particularly special there. In many strongly typed, object oriented languages you're expected to implement an object comparator, which is subsequently passed to a general sort utility. Some languages let you pass a function that calculates the value of the items being sorted (but still returns the original items). Many achieve a Schwartzian transform only in several lines using temporary variables.
In Perl, no matter how complex the comparison needed or the objects being compared, there's only one thing you need: sort:
01 @a=(5,4,3,2);
02
03 @a=sort @a;
04 print "3. @a\n";
05
06 @a=sort {$b<=>$a} @a;
07 print "6. @a\n";
08
09 @a=sort {abs(3-$a)<=>abs(3-$b) || $a<=>$b} @a;
10 print "9. @a\n";
11
12 @a=(
13 [8,"eight"],
14 [1,"one"],
15 [5,"five"],
16 );
17
18 @a=sort {$$a[0]<=>$$b[0]} @a;
19 print '18. ';
20 foreach my $i (0..$#a) { print $a[$i][1],' '};
21 print "\n";
22
23 @a=sort {length($$b[1])<=>length($$a[1])} @a;
24 print '23. ';
25 foreach my $i (0..$#a) { print $a[$i][1],' '};
26 print "\n";
27
28 %h=(two=>2,six=>6,nine=>9);
29 print '30. ';
30 foreach my $k (sort keys %h) { print $k,' ' }
31 print "\n";
32
33 print '34. ';
34 foreach my $k (sort {$b cmp $a} keys %h) { print $k,' ' }
35 print "\n";
36
37 print '38. ';
38 foreach my $k (sort {$h{$a}<=>$h{$b}} keys %h) { print $k,' ' }
39 print "\n";
That's quite a bit; consider the output and the comments to follow. The printed line numbers refer the line containing the sort command executed.
3. 2 3 4 5 6. 5 4 3 2 9. 3 2 4 5 18. one five eight 23. eight five one 30. nine six two 34. two six nine 38. two six nine
Line 3, sort values, default is ascending numeric or stringy. Line 6, sort descending. This demonstrates the use of the variables $a and $b used for comparison. The <=> operator compares the left and right numerically returning the common -1/0/1 result for lesser/equal/greater. The sort function runs merge sort against the values of the list, using the comparison in the execution block as needed to determine which elements to reorder. Any execution blocks works, though the really fun examples have been omitted here. (In favor of honest, I'll point out the hidden issue that those $a,$b variables are globals, so in rare cases you get unexpected results. See the man page.)
Line 9, sort based on distance from 3; if two numbers are the same distance from 3, sort them ascending. Line 18, sort an array of arrays by the first element of each row, which could also be done via (0..$#a). Line 23, sort by the string lengths of the second columns of each row, descending.
Line 30, sort the keys of a hash, default lexicographic (string) order. Line 34, sort by keys descending. Line 38, sort by the hash values, but output the related keys.
In summary, there's one way to do it. Sort always behaves in the same fashion, independent of the source of the inputs. Very complex objects are accessible in the comparison block (and can actually be modified, so be careful). Comparisons can be straightforward operators, chained comparisons, or function calls. You rarely need temporary lists to perform your sorting.
See also List::Util and related modules for some other list operations.
List functions are not restricted. A common pattern that develops in languages is a failure to maintain encapsulation as a result of over-specialization. class Dog has a comparator. A set of dogs then provides an in-place reordering method, Kennel.sort. The next day it's discovered that a dog checklist only needs the names, without resorting the kennel, so a Kennel.sortedDogNames is added. As the business grows, Kennel.sortedDogNamesForBreed.
Eventually the language develops a number of wrappers to "facilitate easier sorting for different types of objects". ArrayHelpers.sort(array,comparator). The Kennel is updated to use ArrayHelpers everywhere, and this works for a while, but a refactoring of some underlying data structures results in the need to sort something that isn't a direct subclass of an array. A quick hacking session yields a few dozen lines of code to calculate the index values needed for sorting, make the quick call to sort the indexes in place, and then to translate that iterable into an output object. Then more functions are added for filtering "fields" of the results.
Next someone realizes that to increase the number of visits for each dog present in the kennel, it's necessary walk the list and modify some Dogs. But! a few different filter methods are added to handle those events. Other methods are added to filter the dogs by age, visits, breed, weight, aversion to dry food, ... Gaahhh!
Perl. In Perl, you don't have to completely refactor your code when you realize, "Not only do I want these sorted, I actually want to assign some aggregate statistics along the way". You don't have to worry about your list function suddenly becoming impotent just because you want to do something more when you process a list. Nothing prevents you from adding commonly-used helper functions, but from a code maintenance perspective you won't suddenly have to refactor because a specialized list function has to be replaced with another specialized function.
Enough words. Code.
01 @a=qw/three two one/;
02 print '2. ',join(',',grep(/o/,@a)),"\n";
03 print '3. ',join(',',sort grep(/o/,@a)),"\n";
04 print '4. ',join(',',sort grep {/o/} @a),"\n";
05
06 $cnt=0;
07 print '7. ',join(',',sort {$b cmp $a} grep {/o/ && ++$cnt} @a)," (cnt=$cnt)\n";
08
09 $cnt=0;
10 print '10. ',join(',',sort {$b cmp $a} grep {/o/ || !++$cnt} @a)," (cnt=$cnt)\n";
11
12 %h=(one=>1,two=>2,fortytwo=>42,eight=>8);
13 print '13. ',join(',',grep(/o/,keys %h)),"\n";
14
15 %h=(one=>1,two=>2,fortytwo=>42,eight=>8);
16 print '16. ',join(',',grep($h{$_}=~/2/,keys %h)),"\n";
17
18 %h=(one=>1,two=>2,fortytwo=>42,eight=>8);
19 print '19. ',join(',',sort {$h{$a}<=>$h{$b}} grep($h{$_}<10,keys %h)),"\n";
20
21 @a=qw/three two one/;
22 print '22. ',join(',',grep {length($_)>3} @a),"\n";
23 print '23. ',join(',',grep {defined($h{$_})} @a),"\n";
And the output followed by some comments.
2. two,one 3. one,two 4. one,two 7. two,one (cnt=2) 10. two,one (cnt=1) 13. fortytwo,two,one 16. two,fortytwo 19. one,two,eight 22. three 23. two,one
Line 2, find all elements of the one-dimensional array that contain a letter 'o'. Line 3, sort those (lexicographically). Line 4, same thing but showing the block syntax to grep. Namely, if the block evaluates to true, the list element will be matched.
Line 7 shows two additions: We wanted to count the number of matches and sort the descending. This example is slightly contrived (reasons), but demonstrates the it's not necessary to do a complete refactoring using temporary storage for the filtered list, which is then passed through a loop for counting, and then passed elsewhere for output (and then still in scope as a temporary variable even though you don't want it). Note this a rare case where you need the prefixed ++ operator. Line 10 is the same as line 7 but counts the non-matches.
Line 13, find the hash keys that contain an 'o'. Line 16, find the hash keys where the values contain the digit '2'. Line 19, find the keys where the hash value is less than ten, sorted ascending by the value.
Line 22, find elements of the array with string length greater than three. Line 23, find the elements of the array that represent defined keys of the hash.
There is one hidden trick
1 @a=qw/three two one/;
2 foreach my $w (grep(/o/,@a)) { $w=uc($w) }
3 print "@a\n";
which also comes with a warning:
three TWO ONEI'll steal the explanation directly from grep: "That is, modifying an element of a list returned by grep... actually modifies the element in the original list."
Perl provides a safer mechanism to modify the values that would result from grep/sort:
01 @a=qw/three two one/;
02 print '2. ',join(',',map {uc($_)} @a),"\n";
03 print '3. ',join(',',map {uc($_)} grep {/o/} @a),"\n";
04 print '4. ',join(',',map {/o/?uc($_):$_} @a),"\n";
05 print '5. ',join(',',map {length($_)} @a),"\n";
06 print '6. ',join(',',map {defined($h{$_})?$h{$_}:"undefined at $_"} @a),"\n";
07 print '7. ',join(',',map {defined($h{$_})?$h{$_}:()} @a),"\n";
08
09 $cnt=0;
10 print '10. ',join(',',sort map {$cnt++;$_} grep {/o/} @a)," (cnt=$cnt)\n";
11
12 $cnt=0;
13 print '13. ',join(',',sort map {$cnt++;$_} grep {/o/} @a)," (cnt=",1+$#a-$cnt,")\n";
14
15 $cnt=0;
16 print '16. ',join(',',sort map {!/t/ && $cnt++; /t/?$_:()} @a)," (cnt=$cnt)\n";
17
18 $cnt=0;
19 print '19. ',join(',',sort map {/t/ && $_ || (++$cnt && ())} @a)," (cnt=$cnt)\n";
20
21 %h=map {$_=>length($_)} @a;
22 print '21. ',join(',',map {sprintf("%s=>%s",$_,$h{$_})} keys %h),"\n";
The same function can be used for "reformatting a stream" and assignments. It's not necessary to refactor to temporaries/loops.
2. THREE,TWO,ONE 3. TWO,ONE 4. three,TWO,ONE 5. 5,3,3 6. undefined at three,2,1 7. 2,1 10. one,two (cnt=2) 13. one,two (cnt=1) 16. three,two (cnt=1) 19. three,two (cnt=1) 21. three=>5,two=>3,one=>3
Line 2, for each element from the array (not "in the array"), display an uppercased version. This does not change the array. Line 3, filter+uppercase. Line 4, all array elements are displayed, but uppercased when they contain an 'o'. Line 5, the lengths of the elements of the array. Line 6, if the element from the array is a key of the hash, show the associated value from the hash, otherwise the "undefined" message. Line 7, same as line 6, but a map-based approach to select only those elements defined as keys of the hash.
Line 10, a more sane approach to counting the grepped matches. Line 13, a cheater approach to counting the non-matches in the case that the array is available as an object. Line 16, a different approach to counting non-matches, with no requirement that the array be an object that persists outside the map statement. (Note the changed pattern.) Line 19, a different way to write line 16.
Line 21, using map to create entries in a hash. Line 22, using a string-print statement to format key-value pairs from a hash.
The <> operator. As per a theme in the previous section, input can be handled with "the one way": Use the <> operator as mentioned above.
1 $c=0;
2 while(<>) { $c++ }
3 print "$c\n";
This permits data to come from the standard input stream or from named files:
$ cat /usr/share/dict/words | perl ./io.pl 234979 $ perl ./io.pl /usr/share/dict/words 234979 $ perl ./io.pl /usr/share/dict/words /usr/share/dict/words 469958
Reading from an open file handle, say it's called $fh, is equivalent: while(<$fh>).... Note that input lines are not automatically truncated; we get to control that as we see fit (see chomp).
1 while(<>) { print length($_),"\n" }
which will include the newline character in the string, hence in the character count:
$ echo 'hi' | perl ./io.pl 3
There are functions for system-level character reads, but for most data processing and filtering tools, it's very easy, fast, and consistent, to read in data. It's also not necessary to handle commandline argument processing for a simple list of files. (Think of a tool such as `touch file1 file2 file3...`.)
Output. There's one way to print: print. There are tonnes of examples above. Printing to an open file: print $fh "Stuff\n". Formatting can be handled with sprintf. One caution, learn the difference between a comma and period when separating output to print.
Files are simple. The syntax to open files mirrors Unix redirection.
1 open(my $fhA,'>', 'output.log'); 2 open(my $fhB,'>>','append.txt'); 3 open(my $fhC,'<', 'source.dat');An existing "output.log" is overwritten, whereas output to "append.txt" will append. "source.dat" is an input data source. Output is via print $fhA "stuff"; likewise for $fhB. Input is via while(<$fhC>).
Run commands and inspect results quickly. To be small, fast, and lightweight, support tooling and applications must be able to communicate with other processes, call external commands, pass in data, and collect the results. This can be a simple and powerful tool, grow into something very complex, or become intractable depending on circumstances. Perl let's us do what we need to do quickly without imposing more complicated structures upon us. Getting output from a command mirrors shell backtick syntax:
1 my $helloa=qx/echo Hello./;
2 my $hellob=`echo Hello.`;
3 my $df=qx{df -h /home};
4 print $helloa;
5 print $hellob;
6 print $df;
7
8 my $dfshort=qx/echo "$df" | grep home/;
9 print "Short: ",$dfshort;
10
11 my $path='/';
12 my $cnt=0;
13 foreach my $fn (split(/\n/,qx/ls $path/)) { $cnt++ }
14 print $cnt,"\n";
The operator to quote words (qw//) was demonstrated above. This shows the use of the quoted execution (qx//) operator; backticks also work (ln2), as do curly braces. Newlines are not prematurely removed. As per lines 8 and 13, the command first undergoes variable expansion. Some output:
Hello. Hello. Filesystem Size Used Avail Capacity Mounted on /dev/sd4g 134.5G 8.3G 124.6G 6% /home Short: /dev/sd4g 134.5G 8.3G 124.6G 6% /home 21Pipelining works (line 8) so you can directly apply external filters without invoking multiple qx// calls, or you can gather all results and process them inside the script (line 13). Note that the simple file counting after line 11 can also be achieved with glob.
"Larger" external program execution is facilitated with system and exec, equivalent to the C functions. One of the simplest forms of "interprocess communication" is through files, so you can prepare the data to pass to the external, invoke qx/mktemp/, open/print/close to store your data in the file, and then initiate the external command, similar to a shell script. As this is a common idiom, Perl provides it directly through open ("everything is a file"):
1 open(my $ofh,'|-','mail -s "Greetings" root@localhost'); 2 print $ofh "Have a great day friendly sysadmin!\n"; 3 close($ofh);To process large amounts of buffered input, pipe in:
1 open(my $ifh,'-|','find /etc -type f | head -n10') or die "open: $!";
2 while(<$ifh>) {
3 chomp;
4 print length($_),"\n";
5 }
6 close($ifh);
I included only the first ten lines of output because I didn't want all of it for this demo and it demonstrates that the external command can be any viable shell command sequence. It works, of course:
$ perl ./io.pl find: /etc/......./private: Permission denied 22 10 12 23 ...
✗Handling records. There are many scenarios where input is divided into records instead of lines, or where a line is not a "sequence of characters up to and including the first newline". For the latter, one obvious example is a character sequence terminated by a carriage return, or perhaps a carriage return and newline. Before discussing the ease with which Perl can handle the former (records), one caveat.
If you disbelieve the definition I've given of a "line", in Unix land note the following:
$ echo -n "hi" | wc -l
0
$ echo -n "hi\r" | wc -l
0
$ echo -n "hi\n" | wc -l
1
$ echo -n "hi\r\n" | wc -l
1
Namely, a line ends with a newline and starts with some other character; a newline does not signal the start of a line. Ergo, for portability, your files containing text, such as configurations, code, and so forth, should always terminate with a newline. If your editor mislabels the number of lines in the file because it contains a proper last line (terminated with a newline), then you need to find an editor that works. (vi/vim/emacs have been around for 43/27/43 years, respectively).
So, the caveat: The <> operator returns results between the final newline and the end of file (marker), so a while(<>) will "count" incorrectly. Results are unlikely to be negatively affected by chomp().
$ echo -n "hi" | perl -ne 'print length($_),"\n"' 2 $ echo -n "hi\nthere" | perl -ne 'print length($_),"\n"' 3 5 $ echo -n "hi\nthere" | perl -ne 'chomp;print length($_),"\n"' 2 5
How about processing something other than Unix lines? First, the 'wrong' results.
$ echo -n "hi\r\n" | perl -ne 'print length($_),"\n"' 4 $ echo -n "hi\r\n" | perl -ne 'chomp;print length($_),"\n"' 3
In a mixed environment, you'll want to quickly support different platforms, either through a configuration or custom detection mechanism. If you receive or have to deal with teletype/DEC/(CP/M)/MSDOS/Windows, Perl permits you to do so with no modifications to the meat of your code:
1 $/="\r\n";
2 while(<>) {
3 print length($_),' ';
4 chomp;
5 print length($_),"\n";
6 }
Let's see the length of the line before and after the chomp:
$ echo -n "hi\r\n" | perl ./io.pl 4 2Perl correctly splits on the CR+LF and chomp behaves. The $/ is the input record separator (character sequence).
Input a stream of records separated by semicolons:
1 $/=";";
2 while(<>) {
3 print length($_),"\n";
4 }
No funky, complicated parser logic is needed! No confusing string manipulation; no endless updates to string splitting and line-continuation logic. Simply change the definition of a "record" and use the Exact Same Code.
$ echo -n "one;two;three,four;five" | perl ./io.pl 4 4 11 4
If you set the input record separator to undef, it reads everything. Setting the separator from the commandline with -0:
$ echo -n "one;two;three,four;five" | perl -0 -ne 'print length($_),"\n"' 23This is known as "slurp mode", and can be achieved within the script with local $/;
There is also an output record separator. Guess what it does?
(Dangerous stuff here; feel free to skip ahead). The input record separator setting is local, so if you want to get really tricky you can use a bit of open() magic. This is definitely one of Perl's "more than one way to do things" and, in all honesty, likely too confusing and less efficient than other methods (see below). I'll stuff the "item handler" into a subroutine to suggest that it might be called separately.
1 sub getitems {
2 my ($line)=@_;
3 local $/=',';
4 open(my $fh,'<',\$line);
5 while(<$fh>) {
6 print ' ',length($_),"\n";
7 }
8 close($fh);
9 }
10
11 $/=";";
12 while(<>) {
13 chomp;
14 print "Record\n";
15 getitems($_);
16 }
And some records and items:
$ echo -n "one;two;three,four;five" | perl ./io.pl Record 3 Record 3 Record 6 4 Record 4
Quoting operators. Perl provides generic quoting operators to more generically denote various literals and support expansion of variables and commands. These permit faster development and a more concise signal of intent. Most of the quote operators allow the choice of an arbitrary quoting character, making it easier to illuminate places in your code with special needs or unexpected behavior, and allow you to avoid the dreaded endless collection of double and triple backslashes and/or dollar signs that might otherwise be needed.
The word list operator is one such example. Compare the following:
1 foreach my $w (qw/one two three $r @s %t/) { print "$w\n" }
2 foreach my $w ('one','two','three','$a','@s','%t') { print "$w\n" }
3 foreach my $w ("one","two","three","\$a","\@s","\%t") { print "$w\n" }
4
5 foreach my $w (qw(one two three)) { print "$w\n" }
6 foreach my $w (qw{one two three}) { print "$w\n" }
7 foreach my $w (qw[one two three]) { print "$w\n" }
8 foreach my $w (qw\one two three\) { print "$w\n" }
9 foreach my $w (qw#one two three#) { print "$w\n" }
10
11 foreach my $w (qw{pathA/fileA pathB/fileB path{C}/file{C}}) { print "$w\n" }
Line 1 uses the word list operator; note that variables are not expanded. Line 2 gives the same result using single quotes (also non-expanded). Line 3 is what you'd need if you have an arbitrary style expectation against single quotes ("because they're confusing"?) and prefer to use double quotes for everything. Lines 5 through 7 demonstrate using alternate quote characters. Line 11 shows why you might want to change from the default slash delimiter, namely to permit using the slash inside the operator, and also what happens when you nest the characters.
The next most useful operators (imho) are mentioned elsewhere, namely the qx operator to execute an external script and capture the output (see above), and the regular expression quoting operator, qr, to declare a variable as a regular expression (see below).
The other two quoting operators of interest, though I must admit to not using them often, are the generic single and double-quote operators:
1 my $name="Brian";
2 print qq{Hello $name.\n};
3 print qq/Hello $name.\n/;
4 print qq:Hello $name.\n:;
5 print qq[Hello $name.\n];
6 print qq-Hello $name.\n-;
7 print q{Hello $name.\n};
8 print q/Hello $name.\n/;
9 print q:Hello $name.\n:;
10 print q[Hello $name.\n];
11 print q-Hello $name.\n-;
And the results
Hello Brian. Hello Brian. Hello Brian. Hello Brian. Hello Brian. Hello $name.\nHello $name.\nHello $name.\nHello $name.\nHello $name.\n
Default parameters. Classic programming nightmares: You create a function that accepts input parameters but want to ensure some default, non-null value is used when the function is executed. Or, you are well within a loop or other subroutine and want to provide a default value in the case that a variable is null. The nightmare begins when you realize you'll need to litter your code with if(meinUarable==null) { meinUarable=5 }. There are two tricks that help you along in Perl.
1 sub defaults {
2 my ($a,$b)=@_;
3 $a||=5;
4 $b//=6;
5 print "a=$a. b=$b.\n";
6 }
7
8 defaults(' 8',);
9 defaults(' 9',0,0);
10 defaults('10',1,1);
11 defaults('11','','');
You'll likely recognize line 3 as the Boolean OR operator, used via standard operator-assignment in this example. The operator on line 4 is the "defined-OR" operator. Compare the subtleties in the results:
8: a=5. b=6. 9: a=5. b=0. 10: a=1. b=1. 11: a=5. b=.Line 8 shows that an undefined value is false for both operators. Line 9 and 11, however, show that the number zero and empty string are defined, hence are true for the // operator. If zero or the empty string are invalid inputs, the standard Boolean OR operator provides an easy default value specification. If any defined value is acceptable for the parameter, use // to provide the default.
If you want to defer defaults, handle situation-dependent defaults, or otherwise can't check at the start of the function, the operators work just as well inline. Showing several different approaches:
1 $task{length}=$config{user}{defaultTaskLength} || $config{global}{defaultTaskLength} || 30;
2 %{$task{keywords}}=map {$_=>1} @{$config{user}{defaultKeywords} // $config{global}{defaultKeywords} // [qw/key1 key2/]};
3 print encode_json(\%task),"\n";
4
5 $config{global}{defaultKeywords}=[qw/g1 g2/];
6 %{$task{keywords}}=map {$_=>1} map {@$_} ($config{user}{defaultKeywords} // $config{global}{defaultKeywords} // [qw/key1 key2/]);
7 print encode_json(\%task),"\n";
8
9 $config{user}{defaultKeywords}=[qw/u1 u2/];
10 %{$task{keywords}}=();
11 my $kwref=$config{user}{defaultKeywords} // $config{global}{defaultKeywords} // [qw/key1 key2/];
12 foreach my $kw (@$kwref) { $task{keywords}{$kw}=1 }
13 print encode_json(\%task),"\n";
14
15 print encode_json(\%config),"\n";
Assuming that %config is already declared, it can be checked for configuration keys. Autovivification creates 'user' and 'global', but not the subkeys, as can be seen from the check output on line 15. As to the other behaviors, consider them in turn:
{"length":30,"keywords":{"key2":1,"key1":1}}
{"length":30,"keywords":{"g1":1,"g2":1}}
{"length":30,"keywords":{"u1":1,"u2":1}}
{"global":{"defaultKeywords":["g1","g2"]},"user":{"defaultKeywords":["u1","u2"]}}
Line 1 shows numeric defaults, namely "declare the task length as the user default if it is non-zero, or the global default if it is non-zero, or thirty". Being numeric values, the Boolean OR operator has been chosen.
Line 2 demonstrates the defined-OR operator being used to construct a hash/set of keywords. At the innermost level, inside the @{...}, there are three array references; the first of the user or global defaults are used if they are defined; otherwise, a static array reference has been given. Once a defined array reference has been found, it is wrapped with the @{...} to convert it to an array so it can be passed to 'map', which then creates a key in the hash for each entry in the array. This approach constructs the entire hash/set in a single call. It avoids the need for a temporary and conditionals such as if(defined($config{user}{defaultKeywords})) { @keywords=@{$config{user}{defaultKeywords}} } elsif(defined(...)) .... This allows a straightforward construction of clean code, namely that one logical operation occurs on a single line; specifically, the single-sentence pattern: "I want my task keywords to be the user defaults, if they exist, or the global defaults otherwise, if they exist, or key1/key2". That should not take ten lines of code.
Moving to line 5, global defaults have been provided and a different demonstration of construction. In this case, the array reference is passed into 'map', which does nothing other than converting it into an array. Passing this array to a second 'map' creates the hash keys. Note the output is correct. Incidentally, $task{keywords} is entirely overwritten by line 6 and the previous key1/key2 entries are gone.
Line 9 shows a more verbose construction, namely emptying the keywords hash, declaring a standalone $kwref, and using a standard foreach loop to populate the hash/set. This approach may be useful if subsequent calls are to be made against the keywords. Though using keys %{$task{keywords}} should suffice for such cases, it's not as concise and may not be readable depending on the circumstances.
Regular expressions. I'll start with a few bold statements: Anyone claiming to know, like, and excel at Perl but claiming a dislike for regular expressions, doesn't know, like, and excel at Perl. Anyone claiming to know, like, and excel at regular expressions but claiming a dislike for Perl, doesn't know, like, and excel at regular expressions.
Indeed, what conversation about Perl wouldn't mention regular expressions? Perl provides what is doubtless the best regular expression support of any programming language (that I've ever seen at least). For all languages supporting regular expressions, it's possible to track down benchmark comparisons, regexp atom/feature support, and language resources such as cookbooks and tutorials. Perl's default regular expression engine doesn't employ the fastest matching algorithm known to exist, yet it is still faster that most commonly-used language implementations. There are deviations with PCRE. There are other, more subtle issues. Instead of reproducing all existing resources, let me talk about the two things that make Perl regular expressions the best for a programmer, coder, and software developer.
First and foremost, Perl regular expressions are a type. Not a collection of imported helper functions. Not a strange class or two (or more) that must be loaded. Not relegated to the confusing realm that is populated with functions taking multiple string inputs ("Was that pattern, replacement, input? Or input->replaceWith(replacement)->when(pattern)->global()?"). Not demoted to a generic string and coupled with endless backslashes to deal with nested quoting confusion. The syntax is simple, straightforward, and elegant because regular expressions are just variables of type Regexp in Perl. "But wait?!", you say, "Perl doesn't have types!".
1 my $str="[A-Z][a-z]"; 2 my $re=qr/[A-Z][a-z]/; 3 print ref($str),"...\n"; 4 print ref($re),"...\n";Take a look at ref and the results.
... Regexp...The empty line signifies that $str is not a reference, whereas you can see that the variable created with qr// is a regular expression!
Put simply, this helps you create performant (precompiled) regular expressions, store them inside structures easily, and use them correctly in complicated scenarios. For example, here's a configuration for an argument parser:
1 my %handler=(
2 qr/[Hh]/ => sub { ... },
3 qr/[Qq]/ => sub { ... },
4 qr/e/ => sub { ... },
5 qr/E/ => sub { ... },
6 );
7 foreach my $re (keys %handler) {
8 if($arg=~/^$re( .*)$/) { ... }
9 }
The configuration is simple: "This pattern induces this behavior". It is not hiding behind an excessive call list used to construct precompiled patterns; you don't have to write a "precompile phase" that converts stringy-pattern hash keys into compiled regular expression / pattern objects. Use in the foreach loop directly correlates with every other variable type, and the variable can be used directly inside the pattern matching construct. Note also that the conditional on line 8 enforces starting at the beginning of the line; you don't need to forget adding it in each individual pattern.
This ability to quickly merge regular expressions without calling several different string formatting and concatenation mechanisms is very useful. As an exercise, ask yourself how to use a regular expression to verify that a string "yyyy-mm-dd" is a validate date. I won't give away the answer, but observe a few pieces:
... my $reRegularMonths=qr/(?:$reDaysToThirty|$reDaysThirtyOne|$reDaysTwentyEight)/; my $reValidDate=qr/(?:$reYrAny-$reRegularMonths|...things...)/;Date calculations are known to be tricky, and I don't claim that this is an "ideal solution" to the problem stated, but it does demonstrate the ease with which complex patterns can be built safely. Consider how a URL string might be validated for correctness; such validation will benefit from patterns such as $protocol, $server, $port, and so forth.
Let's extend the above handler to support statements with input parameters. (Again, this is a demonstration; there are minor issues):
1 %handler=(
2 qr/add (?<termA>\d+) (?<termB>\d+)/ => sub { return $+{termA} + $+{termB} },
3 );
4
5 my $call="add 5 7";
6 foreach my $re (keys %handler) {
7 if($call=~/^$re$/) { print "Result: ",&{$handler{$re}}(),"\n" }
8 }
The handler accepts two named integers (denoted by the (?<name>...) atom). I've added extra space around the addition on line 2 to distinguish it from the $+{name} hash lookup; the hash is %+, so its keys are referenced via $+{key}, just like every other named hash.
As a short pause, I'll note this is a mere grain of ice on the tip of the iceberg. The efficiency of Perl's regular expressions makes them common solutions to basic string tasks (chomp/trim/ltrim/rtrim/uppercase-before-parse/lowercase-before-parse/removing unprintable characters). They are useful for parsing (split and process, split and compare,...). For example, the matching operator works as a list to facilitate easier parsing without the split-trim-filter idiom:
1 my $sentence="See Spot run.";
2 foreach my $word ($sentence=~m/\w+/g) { print $word,"\n" }
3 print join(',',$sentence=~m/\w+/g),"\n";
Line 2 shows list output used in a foreach loop. Line 3 is a separate demonstration that the list result can be used in subsequent operators, such as map/join/....
Before moving to a different topic, there's a second thing that makes Perl regular expressions a joy to use: The regular expression engine performs analysis and optimization, preventing wasted development effort on finding "equivalent patterns that do the same thing with greater sanity".
1 my $str='ab'x10**7;
2 my $re=qr/(a|b)*z/;
3
4 print time(),'->';
5 if($str=~/$re/) { print '(Found)->' }
6 print time(),"\n";
7
8 $str.='zababab';
9
10 print time(),'->';
11 if($str=~/$re/) { print '(Found)->' }
12 print time(),"\n";
A simple engine will start at index 0, match the first atom, the use the asterisk to search forward until the completion of the atom at index 10M (-1). Then it will move to the next atom and discover that the 'z' cannot be found. Scanning forward through all [ab] characters in the string takes considerable time, as the group is extended, only to find that no match is possible. In the second test, a match for the 'z' is found, and the regular expression takes longer to run!
1557594927.75->1557594927.79 1557594927.81->(Found)->1557594929.03Perl optimizes matches in this case by searching first for a "floating z". In the first case, the entire string is scanned and no 'z' is found, so the pattern is guaranteed to not match. Note that other languages don't perform this optimization, and may require exponential runtimes for this type of pattern. (But further note that you can trick out Perl if you're not careful.)
Here's another example where the Perl regular expression engine does "the right thing":
0 $str='a'x2000; $str.='b'; 1 my $reA=qr/(?:a*)*b/; 2 my $reB=qr/(?:a*)b/; 3 $str=~/$reA/; 4 $str=~/$reB/;Depending on the appearance of the terminal 'b' in the string, some languages will enter deep recursion trying to evaluate the first. In particular, (a)(a)(a)... is viewed as a different match than (aa)(a)(a)... — where the parentheses here signify the match groups created, not actual characters — so a language may enter a loop looking for all permutations of groupings of sequences of 'a'. Since the Kleene Star is greedy in Perl, both the first and second regular expressions will match quickly. Indeed, the second is still 10–15% faster than the first, but neither will hang indefinitely.
✗ Every viable language should provide some mechanisms to inspect data, handle errors, and facilitate debugging. These may be very basic or very overkill, and most languages develop tooling based on their own particular development idiom. Every note in this section could be seen as labeled with a ✗ because, in all fairness, it's relatively straightforward to dump data structures for investigation, to provide function exit codes for handling error behaviors, and to add debugging messages to most code. I'll try to comment in each section how the Perl approach is convenient.
Data::Dumper. Obtaining a human-readable form of a large data structure can be simple or complicated depending on the language. In some cases, the only choice is a serialization library, whereas in others a simple "print Thing" will succeed. Perl provides a built-in, unified mechanism to emit data structures that can likewise be reloaded directly from that output. JSON wasn't around until the early aughts, so having a mechanism to display structures to arbitrary depth automatically was nice.
1 use Data::Dumper;
2 my %h=(
3 scalar=>"A small poem\nA large home.",
4 array=>[1,2,3],
5 hash=>{key=>'value'},
6 aoh=>[
7 {name=>'zero'},
8 {name=>'one'},
9 {name=>'two'},
10 ],
11 );
12 print Dumper(\%h);
13
14 my $dump=Data::Dumper->new([\%h],['*h'])
15 ->Indent(0)
16 ->Useqq(1)
17 ->Pair('=>')
18 ->Sortkeys(1);
19 print $dump->Dump(),"\n";
Other examples of Data::Dumper were provided above. The output in this case
$VAR1 = {
'array' => [
1,
2,
3
],
'hash' => {
'key' => 'value'
},
'scalar' => 'A small poem
A large home.',
'aoh' => [
{
'name' => 'zero'
},
{
'name' => 'one'
},
{
'name' => 'two'
}
]
};
%h = ("aoh"=>[{"name"=>"zero"},{"name"=>"one"},{"name"=>"two"}],"array"=>[1,2,3],"hash"=>{"key"=>"value"},"scalar"=>"A small poem\nA large home.");
The default output uses visual indents, a default variable name of "$VAR1" (a hash reference in this case), includes certain characters as literals (note the newline after "poem"), and is output in hash order. The second dumper has been configured to reference "%h" and specifically name the variable "h" (the "*" is explained in the manual). Indent(0) removes the visual indents and newlines. The Useqq call results in double-quoted string values with escape sequences (notice the "poem\n"). The Sortkeys ensures that the output key order is canonical (or you can pass a key sorting function).
Dumper is convenient because it "always works", even with references.
1 %h=(one=>[1],two=>{key=>'value'});
2 $h{One}=$h{one};
3 $h{Two}=$h{two};
4 %{$h{Three}}=(%{$h{Two}}); # materialized
5
6 print Dumper(\%h);
And the output
$VAR1 = {
'One' => [
1
],
'two' => {
'key' => 'value'
},
'Three' => {
'key' => 'value'
},
'one' => $VAR1->{'One'},
'Two' => $VAR1->{'two'}
};
Note that the nested values are referenced correctly. While this is easy to read, reloading the structure is more difficult. More specifically, if $VAR1 is initially undefined, or even an empty hash, the values for keys 'One' and 'two' don't exist, so the assignments to 'one' and 'Two' will load as 'undef'. It's possible to handle references correctly, including circular references, as follows:
1 %h=(one=>[1],two=>{key=>'value'});
2 $h{One}=$h{one};
3 $h{Two}=$h{two};
4 %{$h{Three}}=(%{$h{Two}}); # materialized
5
6 %{$h{step1}}=(value=>'step1',next=>undef);
7 %{$h{start}}=(value=>'start',next=>$h{step1});
8 $h{step1}{next}=$h{start};
9
10 print Data::Dumper->new([\%h],['*h'])->Indent(1)->Purity(1)->Dump(),"\n";
Look carefully at the output:
%h = (
'one' => [
1
],
'two' => {},
'One' => [],
'Two' => {
'key' => 'value'
},
'Three' => {
'key' => 'value'
},
'start' => {
'value' => 'start',
'next' => {
'value' => 'step1',
'next' => {}
}
},
'step1' => {},
);
$h{'One'} = $h{'one'};
$h{'two'} = $h{'Two'};
$h{'start'}{'next'}{'next'} = $h{'start'};
$h{'step1'} = $h{'start'}{'next'};
For complex structures during active development and debugging, particularly when you're writing test cases or building up helper functions, Data::Dumper is often the fastest way to ascertain the structure of your input parameters. For larger structures, you can print the output of Dumper to a file, and later reload it with eval, so it's a good rapid-prototyping tool when working with state tables or other structures that are built incrementally/successively over time.
It's not perfect, of course: It will dump code references as sub{"DUMMY"}, so they won't reload correctly, but neither will they break reloads. Tracking references is slow. The output must be built in place, which takes more memory. (Once I had memory troubles because of all the leading spaces Dumper uses; I dropped to Indent(1) and my memory issues went away.)
Exceptions. Raising arbitrary exception messages is easy. (✗ This is obviously not something unique to Perl, but I do feel that Perl has an approach that strikes a nice balance between robustness and simplicity.)
1 sub divide {
2 my ($numerator,$denominator)=@_;
3 if($denominator==0) { die "Attempted division by zero" }
4 return $numerator/$denominator;
5 }
6
7 print divide(4,0),"\n";
As per die(), omitting a newline from the error message will append the line number automatically, making debugging easier.
$ perl ./die.pl Attempted division by zero at ./die.pl line 3.
In some cases, die statements are hiding inside libraries or subroutines. It's easy to prevent them cascading and blowing up your own program.
1 sub divide {
2 my ($numerator,$denominator)=@_;
3 if($denominator==0) { die "Attempted division by zero" }
4 return $numerator/$denominator;
5 }
6
7 my $result;
8
9 eval { $result=divide(4,0) };
10 if(my $msg=$@) {
11 print "Inner failure '$msg'";
12 $result=123321; # or whatever
13 }
14 print "Program continues (result=$result)...\n";
The eval traps errors and makes them available in the $@ variable. Giving this a try:
$ perl ./die.pl Inner failure 'Attempted division by zero at ./die.pl line 3. 'Program continues (result=123321)...
Exceptions propagate upwards as expected. If no inner handler catches an exception, you can catch it at a higher level.
...
7 sub improperize {
8 my ($numerator,$denominator)=@_;
9 my %ret=();
10 $ret{whole}=divide($numerator,$denominator);
11 my $num=$numerator-$div*$denominator;
12 ($ret{numerator},$ret{denominator})=reduce($num,$denominator);
13 return %ret;
14 }
15
16 my %frac;
17
18 eval { %frac=improperize(4,0) };
19 if(my $msg=$@) {
20 print "Inner failure '$msg'";
21 $result=123321; # or whatever
22 }
23 print "Program continues (result=$result)...\n";
The division by zero will occur when line 10 calls divide(). The exception isn't trapped by improperize(), but one level higher (line 18), even though the exception occurs two levels down.
Inner failure 'Attempted division by zero at ./die.pl line 3. 'Program continues (result=123321)...
The above is nothing special. Languages with exception handling generally work in this fashion. Notice, however, that the die approach is focused on returning error messages. Similar to encoded return values, this permits your error handler to trap only errors with an appropriate message, either customized or library-related, to either handle those situations or die() again.
1 sub diewith {
2 my ($msg,$x)=@_;
3 if($x==5) { die "die $msg (x=$x)" }
4 }
5
6 sub test {
7 my ($msg)=@_;
8 eval { diewith($msg,5) };
9 if(my $err=$@) {
10 if($err=~/problem/) { print "We saw some problems\n" }
11 elsif($err=~/error/) { print "We saw some errors\n" }
12 else { die "Unknown issue" }
13 }
14 print "test($msg) reached the end\n\n";
15 }
16
17 test("Some problems");
18 test("At least one error");
19 test("Entropy");
And the result
We saw some problems test(Some problems) reached the end We saw some errors test(At least one error) reached the end Unknown issue at ./die.pl line 12.
In the absence of exceptions, subroutines and libraries would be required to return exit codes or structures and each caller in the stack would need to check for those output states. That approach has its benefits, but is harder to maintain over time. die() permits returning exceptions as "stringy messages", which is a natural approach for raising and trapping exceptions. In particular, interprocess communication, socket communication, and networking APIs, are all required to facilitate error handling through messages. Some protocols have specific codes, but in general there are no complex exception types to consider. When developing in Perl, you naturally use descriptive exception messages in your functions and libraries and handle those exceptions based on such messages. Later, when you're refactoring your code into microservices and standalone tools, you can use the same handler logic.
If you need more complicated types of exceptions that pass detailed information up the call stack, you can pass an object reference to die(). This is arguably more powerful than the exception handling offered in most programming languages.
1 sub dieobj {
2 my ($msg,$x)=@_;
3 my %obj=(request=>{msg=>$msg,x=>$x});
4 if($x==5) {
5 die({
6 err=>'Invalid parameter',
7 details=>"The value of x requested is not supported: $x",
8 state=>\%obj,
9 });
10 }
11 die 'dieobj shall not finish';
12 }
13
14 eval { dieobj("Go lucky five!",7) };
15 if(my $err=$@) {
16 if(!ref($err)) { die }
17 if(ref($err) eq 'HASH') { die sprintf("ERROR %s in file %s at line %d: %s.\n",$$err{err},__FILE__,__LINE__,$$err{details}); }
18 else { die 'Unhandled exception' }
19 }
20
21 print "Concluded\n";
As this is slightly overkill, what should happen with that input value of seven on line 14? The call into dieobj() should reach line 11 and die with a message. Back on line 16, $err will just be a string (ref($err) will be undefined), so the die on line 16 will reraise and "Concluded" will never be reached:
dieobj shall not finish at ./die.pl line 11.
...propagated at ./die.pl line 16.
Convenient! Perl automatically reports that die propagated.
Let's suppose instead that the input on line 14 is the value five. Line 4 will see the "invalid input" and return an object, which will be handled on line 17. The subsequently constructed message contains the error, the details, and the location, and is again raised through a call to die():
ERROR Invalid parameter in file ./die.pl at line 17: The value of x requested is not supported: 5.This provides considerable power in handling exceptions, particularly if aspects of the internal data state are needed to appropriately handle next steps in the callers.
A few last notes: Read the manual for die/exec/$@ to avoid common pitfalls and strange behaviors. The hash constructed in the previous example will work for more complex exception handling, but it's better to use proper, inherited classes, to maintain a consistent hierarchy of exceptions. Additional information will be mentioned below. Note that, similar to most languages, using eval/if/try/catch comes with a performance penalty. See the discussion about Benchmark below.
✗Carp. Libraries and deeply nested handlers can provide additional stack/trace information when they encounter exceptions by using Carp.
1 use Carp qw/confess/;
2
3 sub fib {
4 my ($n)=@_;
5 if($n<1) { confess "Fibonacci is undefined for $n" }
6 return fib($n-2)+fib($n-1);
7 }
8
9 print fib(5),"\n";
The call to confess should produce a stack trace.
Fibonacci is undefined for -1 at ./die.pl line 5.
main::fib(-1) called at ./die.pl line 6
main::fib(1) called at ./die.pl line 6
main::fib(3) called at ./die.pl line 6
main::fib(5) called at ./die.pl line 9
In this example, confess is the same as croak because the code is running from a single script. When writing modules (see below), the behavior differs as described in the documentation, and is very useful for debugging misbehaviors within modules.
✗A debugger. My pattern for debugging most applications is to increase logging output to find mismatches between input data and function assumptions. This has served me well when coupled with isolated test cases. On the other hand, there have been a half a dozen times that I should have used the Perl debugger. Maybe I'll try this the next time I get stuck.
Encapsulation. There's a class of popular languages amenable to object oriented programming that, in fact, don't satisfy all the standard attributes of true object oriented design. It's easy to find a scripting language that claims inheritance, but lacks multiple inheritance; that claims polymorphism, but only supports it for direct descendents; that claims encapsulation of values, but doesn't support protected attributes; that claims encapsulation of actions, but doesn't support first class functions. On the other hand, it's easy to find "true" object oriented languages with their own nightmares, limitations, and idiosynchracies. Perl is no exception to this rule-of-exceptions, and it's commonly noted that support for OO in Perl was "added on later". This is not entirely true, but the OO intro notes that Perl's minimalistic approach to OO made sense in 1994, but leaves the programmer to implement a great deal of OO as needed.
✗To get started with objects in Perl, it's helpful to clarify goals: "Put a bunch of things inside an 'object'". Yes, you can do that. "Attach functions to an object that serve as the object's actions". Yeap, that works too. "Convert my C++/Java class heirarchy into a Perl class heirarchy with equivalent behavior". No, that's just silly. For many "OO features" and patterns, you'll need to include helper modules, and there are several choices for helper modules you can use. On the other hand, if you're willing to work with the core concepts of OO design incrementally, you'll see that Perl helps you out.
In my very early days with Perl, I was creating objects with encapsulation of values. It turns out to be very, very easy to do. Here's an object that encapsulates all the file status properties for a file. We call it a "hash":
1 my %stat=();
2 @stat{qw/dev ino mode nlink uid gid rdev size atime mtime ctime blksize blocks/}=stat('/etc/passwd');
3 print $stat{size},"\n";
Congratulations, encapsulation is easy in Perl! In almost all cases, these are sufficient objects to build complex behavior and interactions in applications. Functions can be passed references to hashes and can perform actions based on the content of those hashes. This permits building an application that adheres to a message-based API model, namely that everything should be de/serializable, and interactions should be via transmission and updates to those messages. (In a sense, this is similar to Ruby's model of messaging in classes.)
Attaching functions to hashes is likewise easy (and everyone who talks about object oriented programming must include geometry or animals!):
1 my %dog=(
2 name=>'Fido',
3 speak=>sub{print "Bark!\n"},
4 );
5
6 &{$dog{speak}};
The ampersand sigil on line 6 says "this is a function, treat it as such". Without the sigil, $dog{speak} is just a code reference, which means it can be assigned to other variables. The sigil dereferences to the subroutine itself, which allows its execution. (It can also be invoked with $dog{speak}->() but ugh.)
Unfortunately I've lied. That function does work, but it actually doesn't know that it's being called for the dog named Fido. It's a simple function with no access to $dog{name}. Of course, you can create a variable for the dog's name and create a function closure to make it speak, but that's not how we want "objects" to work. To get functions tied to objects, they have to be declared differently.
1 package Dog;
2
3 use strict;
4 use warnings;
5
6 sub new {
7 my ($ref,$name)=@_;
8 my $class=ref($ref)||$ref;
9 my $self={
10 name=>$name,
11 };
12 return bless($self,$class);
13 }
14
15 sub speak {
16 my ($self)=@_;
17 print "Bark! (from $$self{name})\n";
18 }
19
20 1;
21
22 package main;
23
24 my $dog=Dog->new('Fido');
25 $dog->speak();
Declaring a package creates a new "type", in this case having a reference name of 'Dog'. The included modules (ln3,4) are included within the class, which concludes on line 20. (If lines 1--20 are placed in a separate file called Dog.pm, the main program can use Dog; to include the module, and line 22 won't be needed at all. Much cleaner.) A new(name) function, which can be called anything, is created. There's a bit of magic hiding in there, but the function expects to be given the $name, which it then places inside the $self hash. The last bit of magic on line 12 tells $self that it is a Dog, and a reference to it is returned from new().
The class function, speak, will be passed a reference to itself (line 16), which can then be used to access member variables (line 17). Since a variable initialized with class Dog is a reference (line 24), the function can be accessed with a dereference arrow (line 25). The dog's name is directly accessible as $$dog{name} (but there is no such thing as $$dog{speak}).
There are no "private variables", only tricks to hide away variables that shouldn't be accessed from outside. There's no "immutability" as a result. Everything is "public" in the sense that it can be accessed from the outside, dereferenced from the outside, or otherwise inspected. In practice, this is beneficial because it makes writing unit tests very easy. In theory, since a running application has access to all of its own memory (everything in a process is "advisory" anyway), the only way to truly dissociate an object from the caller is to use a thread or separate process. ✗ In practice, this is more about having the right paradigm when creating your libraries than it is about building an application that utilizes every theoretical feature of OO design.
As mentioned previously, classes are a good place for Carp. Let's add a few things:
5 use Carp qw/confess/;
# ...
21 sub area {
22 my ($self,$unit)=@_;
23 confess("I'm too fuzzy for a surface area calculation");
24 }
# ...
32 $dog->area('in^2');
Then it will be easy to see the call stack into the object.
I'm too fuzzy for a surface area calculation at ./oo.pl line 23.
Dog::area(Dog=HASH(0x1fd8f5041688), "in^2") called at ./oo.pl line 32
Moving the above into Dog.pm and including it via use Dog; makes it a simple module/library. (See also use lib './';). Most of encapsulation, composition, and polymorphism, behave as expected; so also does inheritance, but why would you do that to yourself?
You are free to define the syntax you want for your library. In the example above, new was used as an object constructor, but this is by no means required. It is common to return $self; from initialization functions, copy constructors, and even in-place modification methods, so that you can chain together initialization and execution easily. (I've been using "stream syntax" in my objects for over a decade, haha). See the examples for Data::Dumper above. For initialization, it's more natural to pass in a set of initialization parameters as a hash:
1 $dog=Dog->new({
2 name=>'Fido',
3 height=>30,
4 weight=>25,
5 breed=>'Ovcharka',
6 });
(It's of course necessary to change new() to support that syntax.)
There's lots more. See the tutorials.
✗POD. I've found POD (plain old documentation) to provide a good balance between detail and simplicity of use, meaning that I'm more likely to actually include reasonable documentation when I create my own libraries. Many of the references above from meta/CPAN include the output of perlpod. Here's an example of a standalone section that could be added at the end of Dog.pm.
01 =pod 02 03 =head1 Dog 04 05 Dog - Canine things 06 07 =head1 SYNOPSIS 08 09 use Dog; 10 my $dog=Dog->new( 11 name=>'Fido', 12 height=>30, 13 weight=>25, 14 breed=>'Ovcharka', 15 }); 16 $dog->speak(); 17 18 =head1 Description 19 20 This object stores information about a dog. 21 22 =cutIf this looks like a man page or the online Perl documentation, indeed that is often what is used for those libraries. Here are some options for "seeing" the documentation:
pod2text ./oo.pl pod2man ./oo.pl | nroff -man | less -R pod2usage ./oo.pl # just the Usage section pod2html ./oo.pl
That last one provides the following (I've stripped out all but the <body> contents here):
Dog - Canine things
use Dog;
my $dog=Dog->new(
name=>'Fido',
height=>30,
weight=>25,
breed=>'Ovcharka',
});
$dog->speak();
This object stores information about a dog.
✗It is easy to include inline pod, such as providing function descriptions at the point of declaration, and the tooling will collect all the pieces when creating the documentation. Being a part-time user of literate programming, I understand this approach, but don't use it much with perlpod because I'm used to having the full features of weave with named section support. In Perl, I prefer to include my documentation at the end of libraries. The most important point, however, is that I use it. I've used it successfully in production applications touched by multiple developers, and typically invoke pod2html in the Makefile to populate a doc/ directory at build time.
Test::More. Testing is easy.
1 #!/usr/bin/perl
2
3 use strict;
4 use warnings;
5 use Test::More tests=>5;
6
7 ok(1,'vacuous');
8 is(4+1,5,'silly math');
9 ok("hi"=~/h/,'silly regexp');
10 is_deeply({two=>2,one=>1},{one=>1,two=>2},'silly hash');
And the result
1..5 ok 1 - vacuous ok 2 - silly math ok 3 - silly regexp ok 4 - silly hash # Looks like you planned 5 tests but ran 4.For most of my library and application testing, I use a common helper script that invokes the tests and handles general test naming/labeling. This permits better test organization, for example per class function, or per test category. Since the only way to "hide" information is via certain lexically-scoped variables, most tests that require introspection or class-instance-overrides are fairly straightforward. Here's a silly example.
1 use Data::Dumper;
2 package Data::Dumper;
3 no warnings qw/redefine/;
4 sub Dumpperl { return "Not implemented" }
5 package main;
6
7 my %h=(one=>[undef],two=>[undef,undef]);
8 my $dumper=Data::Dumper->new([\%h],['*h'])->Indent(1)->Useperl(1);
9 is($dumper->Dump(),"Not implemented",'Useperl(1) works');
Here we override the Dumpperl method of a class within our testing framework, which allows us to ensure that a higher method correctly calls into this supporting function. In this case, the test we're performing is rather silly. Typically tests will override instance values, and this is easier to do when you build proper classes with standard $self={} hashes.
Benchmark. ✗"Premature optimization is... evil", but writing flat-out crappy code and releasing it into production wastes a great deal more time than would have been spent on getting a little bit more confidence up front. Many languages require you to write a bunch of timing routines; in Perl, those wrappers already exist in the form of Benchmark.
Let's look at a mildly-practical example, the case of determining the number of entries in an array:
1 use Benchmark qw/cmpthese/;
2 sub gen { my ($N)=@_; return (0..$N+int(0.10*$N*rand())) }
3 sub idx { my ($A)=@_; return 1+$#$A }
4 sub scal { my ($A)=@_; return scalar(@$A) }
5 sub cast { my ($A)=@_; return 1*@$A }
6
7 sub compare {
8 my ($N,$cnt)=@_;
9 print "N=$N\n";
10 cmpthese($cnt,{
11 'gen' =>sub { my @A=gen($N) },
12 'index' =>sub { my @A=gen($N); idx(\@A) },
13 'scalar'=>sub { my @A=gen($N); scal(\@A) },
14 'cast' =>sub { my @A=gen($N); cast(\@A) },
15 });
16 }
17
18 compare(50000, 500);
19 compare( 5000, 5000);
20 compare( 500,50000);
This demonstrates, for different sizes of arrays, the average runtime to generate and (optionally) count the number of elements in the generated array. Note that the results depend on the size:
N=50000
Rate scalar index cast gen
scalar 177/s -- -3% -4% -6%
index 182/s 3% -- -1% -3%
cast 185/s 4% 1% -- -1%
gen 188/s 6% 3% 2% --
N=5000
Rate scalar index gen cast
scalar 1497/s -- -4% -6% -14%
index 1553/s 4% -- -2% -11%
gen 1587/s 6% 2% -- -9%
cast 1748/s 17% 13% 10% --
N=500
Rate gen cast scalar index
gen 15773/s -- -0% -1% -1%
cast 15823/s 0% -- -1% -1%
scalar 15924/s 1% 1% -- -0%
index 15974/s 1% 1% 0% --
The results for the 50k(ish) array are sensible: Generating the array is fastest, adding the size-check step is a minor change, but using the last index approach ($#A) or implicit cast (1*@array) are a few percent faster than a call to scalar(). Unfortunately there's no existing mechanism to provide a baseline (as far as I know), so the percent comparisons are actually under-reporting the differences in this case. (It's easy enough to extend cmpthese using the builtin timethese). In particular, for implicit casting, the 185/s rate equals 5.405ms of runtime per cycle, but 5.319ms of that is for the array generation itself. The actual rate to "execute the casting step" is therefore 11593/s. By comparison, the scalar rate (177/s) is actually 3025/s, and using the index approach is 88% faster! (Note, but contrast, that the results for the small array are so mixed that "generate and do something" takes less time than just "generate", which should be a clear warning that more iterations are needed during benchmarking to obtain an accurate result.)
Conclusion: scalar() is the clear loser (except in situations where it might make some things easier to read), and the index approach is the clear winner (unless you want to have a bunch of people why you're multiplying an array by the number one).
Devel::Cover. When I've wanted code coverage, I've found Devel::Cover to be quick and easy to use. The cover command-line tool can be used to produce different report formats, so it's simple to add a coverage target to your usual tests.
Perl::Critic. Less obvious is Perl::Critic, which can be invoked with the tool perlcritic, and will provide five levels of message severity regarding constructs it deems to be issues. Rules can be included/excluded as commandline options, and you can create custom profiles stored in files within your project repository and load those rule files at runtime.
Adding new style rules is a bit more involved, I must admit, but since this is a matter of understanding the Perl tokenization and parse tree, this is unsurprising. Indeed, I suspect it would be equally complicated for other languages as well. I've had to do this on several occasions to help new engineers understand some project-local needs, particularly within new products.
More on the way...