r - Split a column into separate columns using regex


Keywords:r 


Question: 

I have got a single column that includes the information as follows:

1 x=abc1000000\ty=pqr2000000\tz=olk78fgzu_zuii8999_ikooo
2 x=oljhh88999\ty=lop9876666
3 x=frdt876544\ty=ztr6u76532\ty=uzrt899963\tz=wertttts_765342_ioooosww\tz=tzuuuee_66554422_88uuiiid

So, none of the id types has a certain number of occurences in a line. They are all separated by tab. I am searching for a way to obtain the IDs for each line as separate columns and have found tstrsplit within data.table but cannot figure out how I can use it with multiple split arguments. Any ideas?

Edit: The expected format is:

     x1          x2        y1    y2                          z1                  z2
1  abc1000000   NA pqr2000000    NA         olk78fgzu_zuii8999_ikooo            NA    
2  oljhh88999   NA lop9876666    NA                               NA            NA
3  frdt876544   NA ztr6u76532   uzrt899963  wertttts_765342_ioooosww  zuuuee_66554422_88uuiiid

Note that the id's do not include 'id' in their names. So I updated the example accordingly. The id's may appear a number of times in each line. The format given above is just an example to make the question clearer. In reality, the X type of IDs for example, can have 20 occurences in one line. The number of columns for X will be then the maximum number of appearance of this particular type of ID in a line given the whole dataset. The data is pretty large. We are talking about some 30m lines.


5 Answers: 

New answer:

For the updated example you could approach the problem as follows:

dt2 <- dt[, rn := .I
          ][, .(V1 = unlist(tstrsplit(V1, '\t'))), by = rn
            ][, c('id','value') := tstrsplit(V1, '=')
              ][, idn := 1:.N, by = .(rn, id)]

dcast(dt2, rn ~ id + idn, value.var = 'value', sep = '')

which results in:

   rn         x1         y1         y2                       z1                        z2
1:  1 abc1000000 pqr2000000         NA olk78fgzu_zuii8999_ikooo                        NA
2:  2 oljhh88999 lop9876666         NA                       NA                        NA
3:  3 frdt876544 ztr6u76532 uzrt899963 wertttts_765342_ioooosww tzuuuee_66554422_88uuiiid

To get the exact output (thus also including a x2 column), you can do:

dcast(dt2[CJ(rn = rn, id = id, idn = idn, unique = TRUE), on = .(rn, id, idn)], 
      rn ~ id + idn, value.var = 'value', sep = '')

which results in:

   rn         x1 x2         y1         y2                       z1                        z2
1:  1 abc1000000 NA pqr2000000         NA olk78fgzu_zuii8999_ikooo                        NA
2:  2 oljhh88999 NA lop9876666         NA                       NA                        NA
3:  3 frdt876544 NA ztr6u76532 uzrt899963 wertttts_765342_ioooosww tzuuuee_66554422_88uuiiid

Used data:

dt <- fread('"x=abc1000000\ty=pqr2000000\tz=olk78fgzu_zuii8999_ikooo"
             "x=oljhh88999\ty=lop9876666"
             "x=frdt876544\ty=ztr6u76532\ty=uzrt899963\tz=wertttts_765342_ioooosww\tz=tzuuuee_66554422_88uuiiid"',
            header=FALSE)

Answer for the original question:

If you want to use tstrsplit, you could approach it as follows:

dt[, rn := .I
   ][, .(V1 = unlist(tstrsplit(V1, '\t'))), by = rn
     ][, .(rn, id = gsub('([a-z0-9]+)(=.*$)','\\1',V1))]

which results in:

   rn   id
1:  1 xid1
2:  1 yid2
3:  1 zid3
4:  2 xid4
5:  2 yid5
6:  3 xid6
7:  3 yid7
8:  3 yid8
9:  3 zid9

An alternative which results in a wide format output:

dt[, tstrsplit(V1, '\t'),
   ][, lapply(.SD, gsub, pattern = '([a-z0-9]+)(=.*$)', replacement = '\\1')]

which results in:

     V1   V2   V3   V4
1: xid1 yid2 zid3   NA
2: xid4 yid5   NA   NA
3: xid6 yid7 yid8 zid9

And if you want to extract all id's as @UweBlock mentioned, you could also do (though a bit less straightforward then @UweBlock's approach):

l <- regmatches(dt$V1, gregexpr('([a-z]{1}id[0-9]{1})',dt$V1))
l <- lapply(l, as.data.frame.list)
l <- lapply(l, function(x) {names(x) <- paste0('v',seq_along(x)); as.data.table(x)})

rbindlist(l, fill = TRUE)

which results in:

     v1   v2   v3   v4   v5
1: xid1 yid2 zid3   NA   NA
2: xid4 yid5   NA   NA   NA
3: xid6 yid7 yid8 zid8 zid9

Used data:

dt <- fread('"xid1=abc1000000\tyid2=pqr2000000\tzid3=olk78fgzu_zuii8999_ikooo"
"xid4=oljhh88999\tyid5=lop9876666"
"xid6=frdt876544\tyid7=ztr6u76532\tyid8=uzrt899963tzid8=wertttts_765342_ioooosww\tzid9=tzuuuee_66554422_88uuiiid"',header=FALSE)
 

You didn't specify what the output should look like. In order to beat akrun to the answer, here's a list where elements of the list represent your rows.

In this solution, you can split each row by the tab and find the pattern of [xyz]id[integer].

x <- c("xid1=abc1000000\tyid2=pqr2000000\tzid3=olk78fgzu_zuii8999_ikooo",
       "xid4=oljhh88999\tyid5=lop9876666",
       "xid6=frdt876544\tyid7=ztr6u76532\tyid8=uzrt899963tzid8=wertttts_765342_ioooosww\tzid9=tzuuuee_66554422_88uuiiid")

res <- sapply(x, FUN = function(m) {
  m <- strsplit(m, "\t")
  out <- sapply(m, FUN = function(o) gsub(pattern = "(^[[:alpha:]]id\\d+)(=.*$)", replacement = "\\1", x = o), 
         simplify = FALSE)
  out
  }, simplify = FALSE)

res <- unname(res)
res

[[1]]
[[1]][[1]]
[1] "xid1" "yid2" "zid3"


[[2]]
[[2]][[1]]
[1] "xid4" "yid5"


[[3]]
[[3]][[1]]
[1] "xid6" "yid7" "yid8" "zid9"

If you omit simplify = FALSE and do not unname the result, you can get

$`xid1=abc1000000\tyid2=pqr2000000\tzid3=olk78fgzu_zuii8999_ikooo`
     [,1]  
[1,] "xid1"
[2,] "yid2"
[3,] "zid3"

$`xid4=oljhh88999\tyid5=lop9876666`
     [,1]  
[1,] "xid4"
[2,] "yid5"

$`xid6=frdt876544\tyid7=ztr6u76532\tyid8=uzrt899963tzid8=wertttts_765342_ioooosww\tzid9=tzuuuee_66554422_88uuiiid`
     [,1]  
[1,] "xid6"
[2,] "yid7"
[3,] "yid8"
[4,] "zid9"

If you don't care about where each element came from, you can do

rapply(as.list(x), f = function(m){
  m <- strsplit(m, "\t")
  out <- sapply(m, FUN = function(o) gsub(pattern = "(^[[:alpha:]]id\\d+)(=.*$)", replacement = "\\1", x = o), 
                simplify = FALSE)
})

[1] "xid1" "yid2" "zid3" "xid4" "yid5" "xid6" "yid7" "yid8" "zid9"

but even this can be reconstructed using the first solution (counting number of elements in each list).

 

New answer to the updated question

The OP has now specified the expected result and also has updated the sample data set. So, the answer has to be updated accordingly.

The approach below uses data.table as in Jaap's answer but differs as follows

  1. tstrsplit() is only called once to split up all columns,
  2. melt() is used to melt multiple columns resulting in two value columns, one for the ids, one for the pay load values,
  3. for each row, the occurence of each id is numbered consecutively,
  4. the final dcast() is used to rename the columns appropriately.

The code below should work with an arbitrary number of id-value-pairs in each row as it has been requested by the OP:

library(data.table)   # CRAN version 1.10.4 used
# split input data at "=" and "\t"
splitted <- DT[, tstrsplit(V1, "=|\t")]
# odd numbered columns contain the ids
cols_odd  <- names(splitted)[c(TRUE, FALSE)] # or: seq(1L, ncol(splitted), by = 2L)
# even numbered columns contain the pay load values
cols_even <- names(splitted)[c(FALSE, TRUE)] # or: seq(2L, ncol(splitted), by = 2L)
# add row number before melting, melt multiple columns, remove NA
molten <- melt(splitted[, rn := .I], 
               measure.vars = list(cols_odd, cols_even),
               value.name = c("id", "value"), na.rm = TRUE)
# in each row, number the occurences of each id consecutively
molten[, var.count := rowid(id), rn]
# final reshape from long to wide, 
# create column names from id and id count per row
# remove rn because no longer needed
dcast(molten, rn ~ id + var.count, sep = "")[, rn := NULL][]

results in

           x1         y1         y2                       z1                        z2
1: abc1000000 pqr2000000         NA olk78fgzu_zuii8999_ikooo                        NA
2: oljhh88999 lop9876666         NA                       NA                        NA
3: frdt876544 ztr6u76532 uzrt899963 wertttts_765342_ioooosww tzuuuee_66554422_88uuiiid

In contrast to above result, the result expected by the OP includes an empty x2 column. In case this is really intended by the OP, the missing rows need to be added before the final dcast(). This is achieved by joining a data.table containing all possible combinations of row numbers, ids and id counts per row as follows:

   dcast(molten[CJ(rn, id, var.count, unique = TRUE), 
             on = .(rn = V1, id = V2, var.count = V3)], 
      rn ~ id + var.count, sep = "")[, rn := NULL][]

which yields

           x1 x2         y1         y2                       z1                        z2
1: abc1000000 NA pqr2000000         NA olk78fgzu_zuii8999_ikooo                        NA
2: oljhh88999 NA lop9876666         NA                       NA                        NA
3: frdt876544 NA ztr6u76532 uzrt899963 wertttts_765342_ioooosww tzuuuee_66554422_88uuiiid

Data

library(data.table)
DT <- fread('"x=abc1000000\ty=pqr2000000\tz=olk78fgzu_zuii8999_ikooo"
  "x=oljhh88999\ty=lop9876666"
  "x=frdt876544\ty=ztr6u76532\ty=uzrt899963\tz=wertttts_765342_ioooosww\tz=tzuuuee_66554422_88uuiiid"',
  header=FALSE)

Answer to the original question

As the question is entitled Split a column into separate columns using regex and the OP explicitely is asking for ideas on how to use tstrsplit with multiple split arguments all answers focused on splitting the columns.

However, if the OP is only interested to obtain the IDs for each line as separate columns there is a simple solution available:

stringr::str_extract_all(DT$V1, "\\wid\\d", simplify = TRUE)
#     [,1]   [,2]   [,3]   [,4]   [,5]  
#[1,] "xid1" "yid2" "zid3" ""     ""    
#[2,] "xid4" "yid5" ""     ""     ""    
#[3,] "xid6" "yid7" "yid8" "zid8" "zid9"

or in data.table context:

DT[, data.table(stringr::str_extract_all(V1, "\\wid\\d", simplify = TRUE))]
#     V1   V2   V3   V4   V5
#1: xid1 yid2 zid3          
#2: xid4 yid5               
#3: xid6 yid7 yid8 zid8 zid9

Note that in the third line an ID zid8 has been found which is missing in all other answers. The other answers are looking to split the strings at \t (tab character). It might be that there is a \ missing in row three of the given data set, so the section reads tzid8= instead of \tzid8=.

Data

As provided in the question (including the potential error).

DT <- fread('"xid1=abc1000000\tyid2=pqr2000000\tzid3=olk78fgzu_zuii8999_ikooo"
"xid4=oljhh88999\tyid5=lop9876666"
"xid6=frdt876544\tyid7=ztr6u76532\tyid8=uzrt899963tzid8=wertttts_765342_ioooosww\tzid9=tzuuuee_66554422_88uuiiid"',
header=FALSE)
 

For the updated example and desired result:

library(tidyverse)

df <- data_frame(x = c("x=abc1000000\ty=pqr2000000\tz=olk78fgzu_zuii8999_ikooo", 
                       "x=oljhh88999\ty=lop9876666", 
                       "x=frdt876544\ty=ztr6u76532\ty=uzrt899963\tz=wertttts_765342_ioooosww\tz=tzuuuee_66554422_88uuiiid"))

df_tidy <- df %>% 
    mutate(row = row_number()) %>% 
    separate_rows(x, sep = '\t') %>% 
    separate(x, c('var', 'val'), sep = '=') %>% 
    group_by(row, var) %>% 
    mutate(n = row_number()) %>% 
    unite(var, var, n, sep = '') %>% 
    spread(var, val)

df_tidy
#> # A tibble: 3 x 6
#> # Groups:   row [3]
#>     row         x1         y1         y2                       z1
#> * <int>      <chr>      <chr>      <chr>                    <chr>
#> 1     1 abc1000000 pqr2000000       <NA> olk78fgzu_zuii8999_ikooo
#> 2     2 oljhh88999 lop9876666       <NA>                     <NA>
#> 3     3 frdt876544 ztr6u76532 uzrt899963 wertttts_765342_ioooosww
#> # ... with 1 more variables: z2 <chr>

For the original example, if you want to capture all your data in a tidy form,

library(tidyverse)

df <- data_frame(x = c("xid1=abc1000000\tyid2=pqr2000000\tzid3=olk78fgzu_zuii8999_ikooo", 
                       "xid4=oljhh88999\tyid5=lop9876666", 
                       "xid6=frdt876544\tyid7=ztr6u76532\tyid8=uzrt899963tzid8=wertttts_765342_ioooosww\tzid9=tzuuuee_66554422_88uuiiid"))


df_tidy <- df %>% 
    mutate(row_id = row_number()) %>% 
    separate_rows(x, sep = '\t') %>% 
    separate(x, c('id', 'value'), extra = 'merge')

df_tidy
#> # A tibble: 9 x 3
#>   row_id    id                                    value
#> *  <int> <chr>                                    <chr>
#> 1      1  xid1                               abc1000000
#> 2      1  yid2                               pqr2000000
#> 3      1  zid3                 olk78fgzu_zuii8999_ikooo
#> 4      2  xid4                               oljhh88999
#> 5      2  yid5                               lop9876666
#> 6      3  xid6                               frdt876544
#> 7      3  yid7                               ztr6u76532
#> 8      3  yid8 uzrt899963tzid8=wertttts_765342_ioooosww
#> 9      3  zid9                tzuuuee_66554422_88uuiiid

Add more separate_rows calls if it makes sense to split further by = or _.

 

We can do this elegantly with tidyverse

library(tidyverse)
xl <- x %>%
      data_frame(id = .) %>% 
      rownames_to_column(., 'rn') %>% 
      separate_rows(id, sep = '\t') %>% 
      mutate(id = str_extract(id, "[[:alnum:]]+(?=\\=)"))
xl
# A tibble: 9 x 2
#     rn    id
#  <chr> <chr>
#1     1  xid1
#2     1  yid2
#3     1  zid3
#4     2  xid4
#5     2  yid5
#6     3  xid6
#7     3  yid7
#8     3  yid8
#9     3  zid9

If we need a wide format

xl %>%
   group_by(rn) %>%
   mutate(Seq = paste0("V", row_number())) %>% 
   spread(Seq, id)
# A tibble: 3 x 5
# Groups: rn [3]
#     rn    V1    V2    V3    V4
#* <chr> <chr> <chr> <chr> <chr>
#1     1  xid1  yid2  zid3  <NA>
#2     2  xid4  yid5  <NA>  <NA>
#3     3  xid6  yid7  yid8  zid9

data

x <- c("xid1=abc1000000\tyid2=pqr2000000\tzid3=olk78fgzu_zuii8999_ikooo",
   "xid4=oljhh88999\tyid5=lop9876666",
   "xid6=frdt876544\tyid7=ztr6u76532\tyid8=uzrt899963tzid8=wertttts_765342_ioooosww\tzid9=tzuuuee_66554422_88uuiiid")