regex - Split the data with 3 delimiters and store it in 2 separate arrays at a time


Keywords:regex 


Question: 

I have data with 3 delimiters (: , and ;) And in the data : and ; appear only once

__DATA__

1:X,Y,X,A,B;C,D,E,F 
2:A,C,B,D
3:W,R,T,E;E

Step 1:

Split by : and build a hash

Step 2:

Split by , and store every comma separated value in a array till we find ;

Step 3:

Everything that follows ; would be in another array

From the above data, i am trying to store all the values before ; in array A and everything on right in array B

Output
A = [X,Y,X,B,A,B,C,D,W,R,T,E]  B=[C,D,E,F,E]

Below is the code i tried

my (@A,@B);
sub Compare_results  
{
  my %result_hash = map { chomp; split ':', $_ } <DATA> ; #split by    colon and futher split by , and ; if any (doing it in insert_array)
 foreach my $key ( sort { $a <=> $b } (keys %result_hash) )
 {

   @A = split ",", (/([^;]+)/)[0], $result_hash{$key};
   @B = split ",", (/;([^;]+)/)[0], $result_hash{$key};
   print Dumper \@A,\@B;
 }    

}

But this is not producing the any results, The output arrays are empty Whats the right approach of splitting data by , and ; at a time store in separate array Is there also a way split data by three delimiters (one split for building up a hash ) at one shot

Thanks


2 Answers: 

Many problems: open needs a file name, not filehandle contents (unless DATA contains the file name, which it doesn't). To keep the values in the arrays, use push, not assignment - you can't assign to two arrays at the same time, anyway, as the first one eats everything. Also, doing everything in one command might be possible, but definitely not readable and maintainable.

#!/usr/bin/perl
use warnings;
use strict;

my $fh = *DATA{IO};
my (@A, @B);                                            # The comments just fix
                                                        # the stupid SO syntax highlighter.
my %result_hash = map { chomp; split /:/ } <$fh>;       #/
for my $key (sort { $a <=> $b } keys %result_hash) {
    my ($left, $right) = split /;/, $result_hash{$key}; #/
    push @A, split /,/, $left;                          #/
    push @B, split /,/, $right // q();
}

use Data::Dumper; print Dumper(\@A, \@B, \%result_hash);

__DATA__
1:X,Y,X,A,B;C,D,E,F
2:A,C,B,D
3:W,R,T,E;E
 

The trick is to do each step separately, and in a different order. The order is tokenize then parse. Basically, break it up into pieces, then do something with those pieces.

The next trick is to tokenize recursively. That is rather than trying to tokenize everything in one go, break big tokens up into smaller tokens, and those into smaller tokens, and so on until you hit bottom. First the line, then the CSVs.

Looking at it this way, the first layer of the grammar looks something like this (whitespace is ignored).

LINE = LINENUM : CSV ; CSV

Note that at this point we don't care what's in the CSV. We'll assume we don't have to deal with quoting and escaping else things get complicated.

There's a few ways to deal with this. One is to use a regex to tokenize the whole thing in one shot.

my($linenum, @csvs) = $line =~ /^(.*?) : ([^;]*) ; (.*)$/x;

Now that you have the @csvs separated from everything else they need to be tokenized. You can turn them into more tokens by splitting on commas.

push @$a, split /,/, $csvs[0];
push @$b, split /,/, $csvs[1];

And there you go. By tokenizing each layer you avoid the complexity of trying to parse everything in one go.


As to your function, there's many things which can be done to improve it. Mostly have it do one thing, parse the file. Something else opens the file.

Also everything it needs should be passed in and returned, no using globals (yes, my from outside the function counts as a global).

use strict;
use warnings;
use v5.10;  # for say()

my($left, $right) = parse_whatever_this_format_is_called(*DATA);

say "Left:  ". join ", ", @$left;
say "Right: ". join ", ", @$right;

sub parse_whatever_this_format_is_called {
    # Take the filehandle to read as input
    my $fh = shift;

    # Declare our outputs
    my(@left, @right);

    # Parse each line
    while( my $line = <$fh>) {
        # Tokenize LINE = LINENUM : CSV ; CSV
        my($linenum, @csvs) = $line =~ /^(.*?) : ([^;]*) ; (.*)$/x;

        # Skip lines that didn't match
        next if !$linenum;

        # Split the CSVs
        push @left,  split /,/, $csvs[0];
        push @right, split /,/, $csvs[1];
    }

    # Return our outputs as references.
    # It's the only way to return multiple lists.
    # Also it avoids the expense of a copy.
    return( \@left, \@right );
}

__DATA__
1:X,Y,X,A,B;C,D,E,F
2:A,C,B,D
3:W,R,T,E;E