c# - Regex.Split returning whitespaces


Keywords:c# 


Question: 

I want to export a View as a HTML-Document to the User on my ASP.NET page. I want to give the option to only get a part of the view.

Because of that I want to split the output with Regex.Split(). I wrote a Regex that matches the part I want to cut out. After splitting I put the 2 output parts together again.

The problem is that I get a list of 3 parts, of which the second contains " ". How can I change the code that the output contains only 2 strings?

My Code:

textParts = Regex.Split(text, @"<!--Graphic2-->(.|\n)*<!--EndDiscarded-->");
text = textParts[0] + textParts[1];

text contains HTML, CSS and jQuery Code. I wrote comments like <!--Graphic2--> around the blocks I want to cut out.

EDIT

I got it working now by using the Regex.Replace() Method. But I still don't know why Split isn't working how I expected.


1 Answer: 

You should consider parsing HTML with the proper tools, like HtmlAgilityPack.

The current question is about why Regex.Split returned 3 values. That is due to the presence of a capturing group in your pattern. Regex.Split returns the chunks between start/end of string and the matched chunks, and all captured substrings:

If capturing parentheses are used in a Regex.Split expression, any captured text is included in the resulting string array. For example, if you split the string "plum-pear" on a hyphen placed within capturing parentheses, the returned array includes a string element that contains the hyphen.

So, Regex.Split(text, @"<!--Graphic2-->(.|\n)*<!--EndDiscarded-->") matches <!--Graphic2--> substring, then matches and captures into Group 1 any 0+ occurrences of any char, as many as possible, and then matches <!--EndDiscarded-->") - these matches are removed and substrings that are not matched are returned, but the last char captured into the repeated capturing group is also returned.

So, if you plan to use regex for this task, you should consider re-writing it to @"(?s)<!--Graphic2-->.*?<!--EndDiscarded-->" or @"<!--Graphic2-->[^<]*(?:<(?!!--EndDiscarded)[^<]*)*<!--EndDiscarded-->" that will be much more efficient, or even @"<!--Graphic2-->[^<]*(?:<(?!!--(?:EndDiscarded|Graphic2))[^<]*)*<!--EndDiscarded-->" that will ensure no nested Graphic2 comments are matched.

See, the complexity of the regexps rises when you want to make sure your patterns work more efficiently and safer. However, even these longer versions do not guarantee 100% safety.