regex - How to generate a list of repeating patterns from a string in TCL?


Keywords:regex 


Question: 

set s1 "dir1/dir2/some_word_g3_ger_another_word_g1_ger_TEMP2"

How to get this list {some_word_g3_ger_ another_word_g1_ger_} from s1 ?

I tried this :

regexp -inline -all {[^/]+_ger_} $s1

But it is failed to split :

some_word_g3_ger_another_word_g1_ger_


2 Answers: 

You need to make the match non-greedy, i.e. ensure that it ends as soon as it has found a minimal match, not when it has matched as much text as possible. This is done by using a +? quantifier (corresponding to the greedy + quantifier): in this case a non-capturing group ((?:...)) is also necessary.

% regexp -inline -all {(?:[^/]+_ger_)+?} $s1
some_word_g3_ger_ another_word_g1_ger_

ETA:

A regular expression is helpful here since it can deal with both skipping the unwanted text and chopping up the tokens. If it is practicable to remove the unwanted text in a first step, several other methods become at least as useful. For example:

set s1 some_word_g3_ger_another_word_g1_ger_
string map {_ger_ {_ger_ }} $s1

(This results in the string "some_word_g3_ger_ another_word_g1_ger_ " with a trailing blank, but it is still functionally equivalent to the list of those two tokens.)

Documentation: regexp, Syntax of Tcl regular expressions

 

Here's another technique, using string commands:

set base [file tail $s1]
set start 0
while {1} { 
    set idx [string first _ger_ $base $start]
    if {$idx == -1} break
    lappend bits [string range $base $start $idx+4]
    set start [expr {$idx + 5}]
}
set bits
# => some_word_g3_ger_ another_word_g1_ger_