Optimize reading and writting lines with Perl
Par Pierrick, lundi 3 juillet 2006 à 16:32 / categorie: Perl / tags: / #74 / rss
What's the most optimized Perl code to read lines in a file, extract fields from each line, modify fields and write lines in a new file? I've been working on this question, and here are the result of my test. For my bench, I've used a reference input file made of 1,700,000 lines. Each line containing 6 fields. Input line example 0000482005032030708847018-000000764930-0000079000.
Reference code
my $inputFileHandle = new FileHandle; open($inputFileHandle, '<', $input_filename) or die 'cannot open file "' . $input_filename . '"'; my $outputFileHandle = new FileHandle; open($outputFileHandle, '>', $output_filename) or die 'cannot open file "' . $output_filename . '"'; my $sp = ';'; while (<$inputFileHandle>) { chomp; my ( $field0, $field1, $field2, $field3, $field4, $field5, $field6, ) = unpack('a6 a6 a13 a1 a12 a1 a*', $_); # output everything but $field1 print {$outputFileHandle} ( $field0, $sp, $field2, $sp, $field3, $sp, $field4, $sp, $field5, $sp, $field6, $sp, "\n" ); } close $inputFileHandle; close $outputFileHandle;
This is my reference code. The most optimized. Each line is unpacked in several scalars and scalars are printed in the output file. No modification is applied to the $fieldN because I don't want to bench this for the moment.
Extract fields in scalar, array or hash?
My first question is "how much would it cost to output the line in a separated sub?". The first obvious additional cost would be to store fields in a data structure such as an array or a hash.
An array
my @row = unpack('a6 a6 a13 a1 a12 a1 a*', $_); print {$outputFileHandle} ( $row[0], $sp, $row[2], $sp, $row[3], $sp, $row[4], $sp, $row[5], $sp, $row[6], "\n" );
A hash
my @columns = qw/field0 field1 field2 field3 field4 field5 field6/; while () { # [...] my %row = (); @row{@columns} = unpack('a6 a6 a13 a1 a12 a1 a*', $_); print {$outputFileHandle} ( $row{field0}, $sp, $row{field2}, $sp, $row{field3}, $sp, $row{field4}, $sp, $row{field5}, $sp, $row{field6}, "\n" ); }
Results
hash : 24.1
array : 13.1
scalar : 10.9
In seconds, this is the average time taken for each script based on 5 execution of reading/writting 1,700,000 lines. The hash is obviously the slowest and the array is time consuming but the loss of performance is much more acceptable.
The problem with the array is the loss of readability. Using simple numeric indexes is less readable than clear field names. A solution (inspired from Pequel) would be to use constants for fields names. Using simple scalar is not a solution for my problem, because I want a "generic" coding template. I don't know in advance the number of fields and their names.
use constant FIELD0 => 0; print $row[FIELD0];
[bench results]
array with constant : 13.5
array : 13.1
We lose a bit of performance, but an acceptable quantity.
Print output in an external sub?
So, what about the use of an external sub to print lines ? (after using a row to store extracted fields from input line)
while (<$inputFileHandle>) { chomp; my @row = unpack('a6 a6 a13 a1 a12 a1 a*', $_); writeOutputFile( \@row, $outputFileHandle ); } sub writeOutputFile { my ($row, $fh) = @_; my $sp = ';'; print {$fh} ( $row[0], $sp, $row[2], $sp, $row[3], $sp, $row[4], $sp, $row[5], $sp, $row[6], "\n" ); }
Results:
with sub : 17.3
without : 11.0
The test without sub is the most optimized (using only scalar). We can see that using a sub is very time consumming. In the general case you won't see the difference, but in a 1,700,000 iterations loop, it makes the difference.
As a conclusion, store fields in an array and flatten your code: avoid calling million times an external sub.
Commentaires
Aucun commentaire pour le moment.
Ajouter un commentaire
Les commentaires pour ce billet sont fermés.