regex - Extract variable names using stringr in R


Keywords:r 


Question: 

I am trying to extract some variable names and numbers from the following vector and store them into two new variables:

unique_strings <- c("PM_1_PMS5003_S_Avg", "PM_2_5_PMS5003_S_Avg", "PM_10_PMS5003_S_Avg", 
  "PM_1_PMS5003_A_Avg", "PM_2_5_PMS5003_A_Avg", "PM_10_PMS5003_A_Avg", 
  "PNC_0_3_PMS5003_Avg", "PNC_0_5_PMS5003_Avg", "PNC_1_0_PMS5003_Avg", 
  "PNC_2_5_PMS5003_Avg", "PNC_5_0_PMS5003_Avg", "PNC_10_0_PMS5003_Avg", 
  "PM_1_PMS7003_S_Avg", "PM_2_5_PMS7003_S_Avg", "PM_10_PMS7003_S_Avg", 
  "PM_1_PMS7003_A_Avg", "PM_2_5_PMS7003_A_Avg", "PM_10_PMS7003_A_Avg", 
  "PNC_0_3_PMS7003_Avg", "PNC_0_5_PMS7003_Avg", "PNC_1_0_PMS7003_Avg", 
  "PNC_2_5_PMS7003_Avg", "PNC_5_0_PMS7003_Avg", "PNC_10_0_PMS7003_Avg"
)

I would like to extract each character before the PMS for the first variable. This includes the strings that being with PM or PNC, as well as the underscores and digits. I would like to store these results into a variable called pollutant.

Desired output:

unique(pollutant)
[1] "PM_1" "PM_2_5" "PM_10" "PNC_0_3" "PNC_0_5" "PNC_1_0" "PNC_2_5" "PNC_5_0" "PNC_10"

I would like to extract everything after the PMS for the second variable.

For this, I first tried extracting just the model numbers (four-digit numbers ending in 003) from each string, however, it would be useful to include the A_Avg or S_Avg in the extraction as well.

Here's my first attempt:

model_id <- str_extract(unique_strings, "[0-9]{4,}")

unique(model_id)
[1] "5003" "7003"

I have not used regex before and am having a difficult time navigating existing docs / stack posts. Your input is appreciated!


2 Answers: 

We can use str_split to split the string based on "PMS". After that, use str_replace to remove the last "_" in the first column. The output is m. The first variable is in the first column, while the second variable is in the second column.

library(stringr)
m <- str_split(unique_strings, pattern = "PMS", simplify = TRUE)
m[, 1] <- str_replace(m[, 1], "_$", "")
m
#       [,1]       [,2]        
#  [1,] "PM_1"     "5003_S_Avg"
#  [2,] "PM_2_5"   "5003_S_Avg"
#  [3,] "PM_10"    "5003_S_Avg"
#  [4,] "PM_1"     "5003_A_Avg"
#  [5,] "PM_2_5"   "5003_A_Avg"
#  [6,] "PM_10"    "5003_A_Avg"
#  [7,] "PNC_0_3"  "5003_Avg"  
#  [8,] "PNC_0_5"  "5003_Avg"  
#  [9,] "PNC_1_0"  "5003_Avg"  
# [10,] "PNC_2_5"  "5003_Avg"  
# [11,] "PNC_5_0"  "5003_Avg"  
# [12,] "PNC_10_0" "5003_Avg"  
# [13,] "PM_1"     "7003_S_Avg"
# [14,] "PM_2_5"   "7003_S_Avg"
# [15,] "PM_10"    "7003_S_Avg"
# [16,] "PM_1"     "7003_A_Avg"
# [17,] "PM_2_5"   "7003_A_Avg"
# [18,] "PM_10"    "7003_A_Avg"
# [19,] "PNC_0_3"  "7003_Avg"  
# [20,] "PNC_0_5"  "7003_Avg"  
# [21,] "PNC_1_0"  "7003_Avg"  
# [22,] "PNC_2_5"  "7003_Avg"  
# [23,] "PNC_5_0"  "7003_Avg"  
# [24,] "PNC_10_0" "7003_Avg"
 

We can use str_extract to match either 'PM' or 'PNC' from the start (^) of the string (^(PM|PNC)) followed by a _ and one or more digits (\\d+) followed by cases that have another set of _ and digits (for this we specify zero or more ((_\\d)*)

library(stringr)
out <- str_extract(unique_strings, "^(PM|PNC)_\\d+(_\\d)*")

This will give NA for those elements that don't have a match. If we need to remove those

na.omit(out)

For the second case, it is not clear about the desired output. If we need to to extract everything after the PMS, we can do with a regexlookbehind to((?<=PMS)) and match all the characters that follow (.*)

str_extract(unique_strings, "(?<=PMS).*")