New answer to the updated question
The OP has now specified the expected result and also has updated the sample data set. So, the answer has to be updated accordingly.
The approach below uses
data.table as in Jaap's answer but differs as follows
tstrsplit() is only called once to split up all columns,
melt() is used to melt multiple columns resulting in two value columns, one for the ids, one for the pay load values,
- for each row, the occurence of each id is numbered consecutively,
- the final
dcast() is used to rename the columns appropriately.
The code below should work with an arbitrary number of id-value-pairs in each row as it has been requested by the OP:
library(data.table) # CRAN version 1.10.4 used
# split input data at "=" and "\t"
splitted <- DT[, tstrsplit(V1, "=|\t")]
# odd numbered columns contain the ids
cols_odd <- names(splitted)[c(TRUE, FALSE)] # or: seq(1L, ncol(splitted), by = 2L)
# even numbered columns contain the pay load values
cols_even <- names(splitted)[c(FALSE, TRUE)] # or: seq(2L, ncol(splitted), by = 2L)
# add row number before melting, melt multiple columns, remove NA
molten <- melt(splitted[, rn := .I],
measure.vars = list(cols_odd, cols_even),
value.name = c("id", "value"), na.rm = TRUE)
# in each row, number the occurences of each id consecutively
molten[, var.count := rowid(id), rn]
# final reshape from long to wide,
# create column names from id and id count per row
# remove rn because no longer needed
dcast(molten, rn ~ id + var.count, sep = "")[, rn := NULL]
x1 y1 y2 z1 z2
1: abc1000000 pqr2000000 NA olk78fgzu_zuii8999_ikooo NA
2: oljhh88999 lop9876666 NA NA NA
3: frdt876544 ztr6u76532 uzrt899963 wertttts_765342_ioooosww tzuuuee_66554422_88uuiiid
In contrast to above result, the result expected by the OP includes an empty
x2 column. In case this is really intended by the OP, the missing rows need to be added before the final
dcast(). This is achieved by joining a
data.table containing all possible combinations of row numbers, ids and id counts per row as follows:
dcast(molten[CJ(rn, id, var.count, unique = TRUE),
on = .(rn = V1, id = V2, var.count = V3)],
rn ~ id + var.count, sep = "")[, rn := NULL]
x1 x2 y1 y2 z1 z2
1: abc1000000 NA pqr2000000 NA olk78fgzu_zuii8999_ikooo NA
2: oljhh88999 NA lop9876666 NA NA NA
3: frdt876544 NA ztr6u76532 uzrt899963 wertttts_765342_ioooosww tzuuuee_66554422_88uuiiid
DT <- fread('"x=abc1000000\ty=pqr2000000\tz=olk78fgzu_zuii8999_ikooo"
Answer to the original question
As the question is entitled Split a column into separate columns using regex and the OP explicitely is asking for ideas on how to use
tstrsplit with multiple split arguments all answers focused on splitting the columns.
However, if the OP is only interested to obtain the IDs for each line as separate columns there is a simple solution available:
stringr::str_extract_all(DT$V1, "\\wid\\d", simplify = TRUE)
# [,1] [,2] [,3] [,4] [,5]
#[1,] "xid1" "yid2" "zid3" "" ""
#[2,] "xid4" "yid5" "" "" ""
#[3,] "xid6" "yid7" "yid8" "zid8" "zid9"
DT[, data.table(stringr::str_extract_all(V1, "\\wid\\d", simplify = TRUE))]
# V1 V2 V3 V4 V5
#1: xid1 yid2 zid3
#2: xid4 yid5
#3: xid6 yid7 yid8 zid8 zid9
Note that in the third line an ID
zid8 has been found which is missing in all other answers. The other answers are looking to split the strings at
\t (tab character). It might be that there is a
\ missing in row three of the given data set, so the section reads
tzid8= instead of
As provided in the question (including the potential error).
DT <- fread('"xid1=abc1000000\tyid2=pqr2000000\tzid3=olk78fgzu_zuii8999_ikooo"