regex - c# Split string using another string as delimiter and include delimiter as part of the splitted string


Keywords:c# 


Question: 

I need to split an input string using c# regex. need to know how to include the delimiter content in the output as given below.

input:

string content="heading1: contents with respect to heading1 heading2: heading2 contents heading3: heading 3 related contents sample strings";

string[] delimters = new string[] {"heading1:","heading2:","heading3:"};

Expected output:

outputArray[0] = heading1: contents with respect to heading1
outputArray[1] = heading2: heading2 contents
outputArray[2] = heading3: heading 3 related contents sample strings

What I tried:

var result = content.Split(delimters,StringSplitOptions.RemoveEmptyEntries);

Output I got:

result [0]: " contents with respect to heading1 "
result [1]: " heading2 contents "
result [2]: " heading 3 related contents sample strings"

I couldnt find an API in string.split or in Regex to split as the expect result.


3 Answers: 

You may use a positive lookahead based solution:

var result = Regex.Split(content, $@"(?={string.Join("|", delimiters.Select(m => Regex.Escape(m)))})")
                  .Where(x => !string.IsNullOrEmpty(x))

See the C# demo:

var content="heading1: contents with respect to heading1 heading2: heading2 contents heading3: heading 3 related contents sample strings";
var delimiters = new string[] {"heading1:","heading2:","heading3:"};
Console.WriteLine(
    string.Join("\n", 
        Regex.Split(content, $@"(?={string.Join("|", delimiters.Select(m => Regex.Escape(m)))})")
             .Where(x => !string.IsNullOrEmpty(x))
    )
);

Output:

heading1: contents with respect to heading1 
heading2: heading2 contents 
heading3: heading 3 related contents sample strings

The (?={string.Join("|", delimiters.Select(m => Regex.Escape(m)))}) will construct a regex dynamically, it will look like

(?=heading1:|heading2:|heading3:)

See the regex demo. The pattern will basically match any position in the string that is followed with either herring1:, herring2: or herring3: without consuming these substrings, so they will land in the output.

Note that delimiters.Select(m => Regex.Escape(m)) is there to make sure all special regex metacharacters that might be in the delimiters are escaped and treated as literal chars by the regex engine.

 

Instead of splitting, I suggest matching which we then can order:

private static IEnumerable<string> Solution(string source, string[] delimiters) {
  int from = 0;
  int length = 0;

  // Points at which we can split
  var points = delimiters
      .SelectMany(delimiter => Regex
        .Matches(source, delimiter)
        .OfType<Match>()
        .Select(match => match.Index)
        .Select(index => new {
          index = index,
          delimiter = delimiter,
        }))
      .OrderBy(item => item.index)
      .ThenBy(item => Array.IndexOf(delimiters, item.delimiter)); // tie break

  foreach (var point in points) {
    if (point.index >= from + length) {
      // Condition: we don't want the very first empty part
      if (from != 0 || point.index - from != 0)
        yield return source.Substring(from, point.index - from);

      from = point.index;
      length = point.delimiter.Length;
    }
  }

  yield return source.Substring(from);
}

Test:

string content = 
  "heading1: contents with respect to heading1 heading2: heading2 contents heading3: heading 3 related contents sample strings";

string[] delimiters = new string[] { 
  "heading1:", "heading2:", "heading3:" };

Console.WriteLine(Solution(content, delimiters));

Outcome:

heading1: contents with respect to heading1 
heading2: heading2 contents 
heading3: heading 3 related contents sample strings

In case we split by digits (2nd test)

Console.WriteLine(Solution(content, new string[] {"[0-9]+"}));

We'll get

heading
1: contents with respect to heading
1 heading
2: heading
2 contents heading
3: heading 
3 related contents sample strings
 
string content = "heading1: contents with respect to heading1 heading2: heading2 contents heading3: heading 3 related contents sample strings";
string[] delimters = new string[] { "heading1:", "heading2:", "heading3:" };

var dels = string.Join("|", delimters);
var pattern = "(" + dels + ").*?(?=" + dels + "|\\Z)";

var outputArray = Regex.Matches(content, pattern);

foreach (Match match in outputArray)
    Console.WriteLine(match);

The pattern is the following:

(heading1:|heading2:|heading3:).*?(?=heading1:|heading2:|heading3:|\Z)

It looks like the answer of Wiktor Stribiżew.
And of course we should use Regex.Escape, as he has shown.