Perl: how to find items between any same two arbitrary regular expressions?


Keywords:regex 


Question: 

I am trying to find and extract texts between any 2 identical but arbitrary key words. For example, given the string:

"aa bb aa cc 11 dd bb 11 cc"

...I want to get:

"aa bb aa"

"bb aa cc 11 dd bb"

"cc 11 dd bb 11 cc"

"11 dd bb 11"

When I do m/(\w+).+?($1)/, or when I use look-ahead operator, nothing seems to work and can't find all of it.


3 Answers: 

I prefer a non-regex solution to this. The program below does what is required

It first splits the string into items and stores them in array @items. Hash %indexes is a hash relating each different item to the indexes in @items where it appears, and array @keys is the keys of the hash in the order they appear in @terms. (There's no need for this if the order of the output is immaterial.)

The subsets of the @terms array are printed for each value of %indexes that has two or more items. while is used with splice in case there is an item that appears four or more times, in which case the output will consist of the first appearance to the second, the third to the fourth etc. If this isn't required then the program can be simplified further

use strict;
use warnings 'all';

my $str = "aa bb aa cc 11 dd bb 11 cc";;
my @items = split ' ', $str;
my ( %indexes, @keys);

for my $i ( 0 .. $#items ) {
    my $key = $items[$i];
    push @keys, $key unless $indexes{$key};
    push @{ $indexes{$key} }, $i;
}

for my $key ( @keys ) {
    my @indexes = @{ $indexes{$key} };
    while ( @indexes >= 2 ) {
        my ( $beg, $end ) = splice @indexes, 0, 2;
        print "@items[$beg .. $end]\n";
    }
}

output

aa bb aa
bb aa cc 11 dd bb
cc 11 dd bb 11 cc
11 dd bb 11


Three problems:

  • You're using $1, but within the regex backreferences use a backslash, not a dollar sign: \1.

  • You're trying to match whole words, but your regex lacks word boundaries.

  • You say you've tried lookaheads, but you don't say how.

The regex you want is:

(?=((\b\w+\b).+?(\b\2\b)))

You also need to add the /g flag and do the matching in a while loop to get all the results:

my $subject = "aa bb aa cc 11 dd bb 11 cc";
while ($subject =~ m/(?=((\b\w+\b).+?(\b\2\b)))/g) {
    print "$1\n"
}

The matches will $1 because the whole match occurs inside a lookahead, meaning $& will be empty.

Here's a regex demo (Regex101.com),
and a code demo (Ideone.com)



If I understand your question right you can use (?{ code }) pattern and (*FAIL):

#!/usr/bin/perl 

use strict;
use warnings;

my $s = 'aa bb aa cc 11 dd bb 11 cc';
$s =~ /((\b\w+\b).*\2)(?{print "$&\n"})(*FAIL)/g

Result seems to be as you expected:

$ perl test.pl
aa bb aa
bb aa cc 11 dd bb
cc 11 dd bb 11 cc
11 dd bb 11