regex - Perl - Split string on comma. Ignore whitespace


Keywords:regex 


Question: 

I have this string:

$str="     a, b,    c>d:e,  f,    g ";

In this string there might be spaces and/or tabs

I split the string in perl:

my (@COLUMNS) = split(/[\s\t,]+/, $str));

But this creates a leading space in position [0].

@COLUMNS=[

    a
    b
    c>d:e
    f
    g
]

I want this:

@COLUMNS=[
    a
    b
    c>d:e
    f
    g
]

2 Answers: 

I suggest that you use a global regex match to find all subsequences of characters that are neither commas nor whitespace

It will produce the same output as your split(/[\s\t,]+/. (Note that the \t is superfluous because \s also matches tabs.) But will create a list without any empty elements

use strict;
use warnings 'all';

my $str = "     a, b,    c>d:e,  f,    g ";

my @columns = $str =~ /[^\s,]+/g;

use Data::Dump;
dd \@columns;

output

["a", "b", "c>d:e", "f", "g"]

Note that, just like your split, this method will ignore any empty fields: something like a,,,b will return [ 'a', 'b' ] instead of [ 'a', '', '', 'b' ]. Also, columns that contain whitespace will be split, so a,two words,b will produce [ 'a', 'two', 'words', 'b' ] instead of [ 'a', 'two words', 'b' ]. Only you can tell whether these situations are likely to arise

If there is any chance that this method will produce the wrong results, then it is better to simply split on commas and write a subroutine to trim the resulting fields

use strict; 
use warnings 'all';

sub trim(;$);

my $str="     a  ,, ,two words ,,, b";
my @columns = map trim, split /,/, $str;

use Data::Dump;
dd \@columns;


sub trim(;$) {
    (my $trimmed = $_[0] // $_) =~ s/\A\s+|\s+\z//g;
    $trimmed;
}

output

["a", "", "", "two words", "", "", "b"]
 

A pretty common solution to this is to transform the values returned from split. In this case you want to remove any leading or trailing space, normally called a trim operation. Using this approach you don't have to worry about spaces at all in the split operation:

use strict; 
use warnings; 

my $str="     a, b,    c>d:e,  f,    g ";
my @columns = map { s/^\s*|\s*$//gr } split(/,/, $str);
print join(',', @columns), "\n";

Another solution as @toolic mentions above is to remove all spaces beforehand:

use strict; 
use warnings; 

my $str="     a, b,    c>d:e,  f,    g ";
$str =~ s/\s+//g; # remove all occurrences of 1 or more spaces
my @columns = split(/,/, $str);
print join(',', @columns), "\n";

both of the above solutions return this output:

a,b,c>d:e,f,g

More information about the /r modifier:

/r is a modifier that can be applied to substitutions that is non-destructive. Meaning that the original string is not modified, instead a copy is created, modified, and returned. This has advantages because normally in scalar context the s/// operator will return the number of substitutions that occurred instead of the modified string. This is only available in Perl versions >= 5.14. An equivalent statement for Perl versions below this would be:

my $original = "some_string";
(my $copy = $original) =~ s/$search_pattern/$replace_pattern/;

and to use in a map:

map { 
   (my $temp = $_) =~ s/$search_pattern/$replace_pattern/; $temp 
} split /$delimiter/, $original;

ex:

my $string = 'abc'; 
my $num_substitutions = $string =~ s/a/d/; # 1 

my $string = 'abc';
my $new_string = $string =~ s/a/d/r; # dbc