regex - R - How to split text and punctuation with a exception?


Keywords:r 


Question: 

Analysing Facebook comments in R for Sentimental Analysis. Emojis are coding in text between <> symbols.

Example:

"Jesus te ama!!! <U+2764>  Ou não...?<U+1F628> (fé em stand by)"

<U+2764> and <U+1F628> are emojis (heavy black heart and fearful face, respectively).

So, I need split words/numbers and punctuations/symbols, except in emoji codes. I did, using gsub function, this:

a1  <- "([[:alpha:]])([[:punct:]])"
a2 <- "([[:punct:]])([[:alpha:]])"
b <- "\\1 \\2"
gsub(a1, b, gsub(a2, b, "Jesus te ama!!! <U+2764>  Ou não...?<U+1F628> (fé em stand by)"))

...but, the results, logically, also affects emojis code:

[1] "Jesus te ama !!! < U +2764>  Ou não ...?< U +1F628> ( fé em stand by )"

The objective is create a exception for the text between <>, split it externally and don't split internally - i.e.:

[1] "Jesus te ama !!! <U+2764>  Ou não ...? <U+1F628> ( fé em stand by )"

Note that:

  1. sometimes the space between the sentence/word/punct and a emoji code is non-existent (needs to be created)
  2. It is required that a punct sequence stays join (e.g. "!!!", "...?")

How can I do it?


2 Answers: 

You may use the following regex solution:

a1  <- "(?<=<)U\\+\\w+>(*SKIP)(*F)|(?<=\\S)(?=<U\\+\\w+>)|(?<=[[:alpha:]])(?=[[:punct:]])|(?<=[[:punct:]])(?=[[:alpha:]])"
gsub(a1, " ", "Jesus te ama!!! <U+2764>  Ou não...?<U+1F628> (fé em stand by)", perl=TRUE)
# => [1] "Jesus te ama !!! <U+2764>  Ou não ...? <U+1F628> ( fé em stand by )"

See the online R demo

This PCRE regex (see perl=TRUE argument in the call to gsub) matches:

  • (?<=<)U\\+\\w+>(*SKIP)(*F) - a U+ and 1+ word chars with > after if preceded with < - and the match value is discarded with the PCRE verbs (*SKIP)(*F) and the next match is looked for from the end of this match
  • | - or
  • (?<=\\S)(?=<U\\+\\w+>) - a non-whitespace char must be present immediately to the left of the current location, and a <U+, 1+ word chars and > must be present immediately to the right of the current location
  • | - or
  • (?<=[[:alpha:]])(?=[[:punct:]]) - a letter must be present immediately to the left of the current location, and a punctuation must be present immediately to the right of the current location
  • | - or
  • (?<=[[:punct:]])(?=[[:alpha:]]) - a punctuation must be present immediately to the left of the current location, and a letter must be present immediately to the right of the current location
 
> str <- "Jesus te ama!!! <U+2764>  Ou não...?<U+1F628> (fé em stand by)"
> strsplit(str,"[[:space:]]|(?=[.!?])",perl=TRUE)
[[1]]
 [1] "Jesus"     "te"        "ama"       "!"         "!"         "!"        
 [7] ""          "<U+2764>"  ""          "Ou"        "não"       "."        
[13] "."         "."         "?"         "<U+1F628>" "(fé"       "em"       
[19] "stand"     "by)"