regex - Extract variable names using stringr in R



I am trying to extract some variable names and numbers from the following vector and store them into two new variables:

unique_strings <- c("PM_1_PMS5003_S_Avg", "PM_2_5_PMS5003_S_Avg", "PM_10_PMS5003_S_Avg", 
  "PM_1_PMS5003_A_Avg", "PM_2_5_PMS5003_A_Avg", "PM_10_PMS5003_A_Avg", 
  "PNC_0_3_PMS5003_Avg", "PNC_0_5_PMS5003_Avg", "PNC_1_0_PMS5003_Avg", 
  "PNC_2_5_PMS5003_Avg", "PNC_5_0_PMS5003_Avg", "PNC_10_0_PMS5003_Avg", 
  "PM_1_PMS7003_S_Avg", "PM_2_5_PMS7003_S_Avg", "PM_10_PMS7003_S_Avg", 
  "PM_1_PMS7003_A_Avg", "PM_2_5_PMS7003_A_Avg", "PM_10_PMS7003_A_Avg", 
  "PNC_0_3_PMS7003_Avg", "PNC_0_5_PMS7003_Avg", "PNC_1_0_PMS7003_Avg", 
  "PNC_2_5_PMS7003_Avg", "PNC_5_0_PMS7003_Avg", "PNC_10_0_PMS7003_Avg"

I would like to extract each character before the PMS for the first variable. This includes the strings that being with PM or PNC, as well as the underscores and digits. I would like to store these results into a variable called pollutant.

Desired output:

[1] "PM_1" "PM_2_5" "PM_10" "PNC_0_3" "PNC_0_5" "PNC_1_0" "PNC_2_5" "PNC_5_0" "PNC_10"

I would like to extract everything after the PMS for the second variable.

For this, I first tried extracting just the model numbers (four-digit numbers ending in 003) from each string, however, it would be useful to include the A_Avg or S_Avg in the extraction as well.

Here's my first attempt:

model_id <- str_extract(unique_strings, "[0-9]{4,}")

[1] "5003" "7003"

I have not used regex before and am having a difficult time navigating existing docs / stack posts. Your input is appreciated!

2 Answers: 

We can use str_split to split the string based on "PMS". After that, use str_replace to remove the last "_" in the first column. The output is m. The first variable is in the first column, while the second variable is in the second column.

m <- str_split(unique_strings, pattern = "PMS", simplify = TRUE)
m[, 1] <- str_replace(m[, 1], "_$", "")
#       [,1]       [,2]        
#  [1,] "PM_1"     "5003_S_Avg"
#  [2,] "PM_2_5"   "5003_S_Avg"
#  [3,] "PM_10"    "5003_S_Avg"
#  [4,] "PM_1"     "5003_A_Avg"
#  [5,] "PM_2_5"   "5003_A_Avg"
#  [6,] "PM_10"    "5003_A_Avg"
#  [7,] "PNC_0_3"  "5003_Avg"  
#  [8,] "PNC_0_5"  "5003_Avg"  
#  [9,] "PNC_1_0"  "5003_Avg"  
# [10,] "PNC_2_5"  "5003_Avg"  
# [11,] "PNC_5_0"  "5003_Avg"  
# [12,] "PNC_10_0" "5003_Avg"  
# [13,] "PM_1"     "7003_S_Avg"
# [14,] "PM_2_5"   "7003_S_Avg"
# [15,] "PM_10"    "7003_S_Avg"
# [16,] "PM_1"     "7003_A_Avg"
# [17,] "PM_2_5"   "7003_A_Avg"
# [18,] "PM_10"    "7003_A_Avg"
# [19,] "PNC_0_3"  "7003_Avg"  
# [20,] "PNC_0_5"  "7003_Avg"  
# [21,] "PNC_1_0"  "7003_Avg"  
# [22,] "PNC_2_5"  "7003_Avg"  
# [23,] "PNC_5_0"  "7003_Avg"  
# [24,] "PNC_10_0" "7003_Avg"

We can use str_extract to match either 'PM' or 'PNC' from the start (^) of the string (^(PM|PNC)) followed by a _ and one or more digits (\\d+) followed by cases that have another set of _ and digits (for this we specify zero or more ((_\\d)*)

out <- str_extract(unique_strings, "^(PM|PNC)_\\d+(_\\d)*")

This will give NA for those elements that don't have a match. If we need to remove those


For the second case, it is not clear about the desired output. If we need to to extract everything after the PMS, we can do with a regexlookbehind to((?<=PMS)) and match all the characters that follow (.*)

str_extract(unique_strings, "(?<=PMS).*")