c# - Regex split by non-alphanumeric characters with special treatment of words with apostrophes contractions



I am trying to split a string using Regex in C#. I want to split it based on all non-alphanumeric characters but I would like to treat words with apostrophes as whole word when contains a contraction such as: 'd, 's, 't.
An example should clarify what I would like to achieve. Given a sentence such as:

"Steve's dog is mine 'not yours' I know you'd like'it"

I would like to obtain the following tokens:

steve's, dog, is, mine, not, yours, i, know, you'd, like, it

At the moment I am using:

Regex.Split(str.ToLower(), @"[^a-zA-Z0-9_']").Where(s => s != String.Empty).ToArray<string>();

It returns:

steve's , dog , is , mine , 'not , yours', i , know, you'd, like'it

4 Answers: 

Here is a half-regex-half-LINQ solution:

string s = "Steve's dog is mine 'not yours' I know you'd like'it";
string[] result = Regex.Matches(s, "\\w+('(s|d|t|ve|m))?")
    .Cast<Match>().Select(x => x.Value).ToArray();

I try to match everything that you want to get, instead of the separators you want to split by. And then I just Selected the Values and turn them all into an array.



\w+         // 1 or more word chars
(?:         // optional uncaptured group
'           // apostrophe
(?![aeiou]) // look ahead and assert the character class doesn't match
\w+         // 1 or more word chars
)?          // end of optional group
  • Catches: should've, i'm, 'tis
  • Doesn't catch: rock 'n' roll



The solution that I can think about it is something like this:

var txt = "Steve's dog is mine 'not yours' I know you'd like'it, the Hundred Years' War, I'm - they're - don't - o'clock - we've 'the Hundred Years' War of yours'";

// Finding valid `'`s and replace them temporarily to something like `_replaceMe_`
// Then replace net `'` to a blank space ` `
var osTxt = Regex.Replace(txt.ToLower(), 
    .Replace("\'"," ");

// Now, extract words from sentence and replace `_replaceMe_` back to `'`
var words = Regex.Matches(osTxt, @"\w+")
    .Select(c=> c.Value.Replace("_replaceMe_", "\'"))

But this will not have ' of Years' in a sentence like the Hundred Years' War.
Also there is some other valid situation those ignored ;).

 //  also covers: I've I'm She'll you're you've";

        var sen = "Steve's dog is mine 'not yours' I know you'd like'it";

        StringBuilder builder = new StringBuilder();

        foreach (Match m in Regex.Matches(sen, @"[^' ]+\w+\'([dstm]|ll|ve|re)|\w+"))