r - regex to split on anything not a digit


Keywords:regex 


Question: 

I would like to split strings on anything not a digit. In this particular case the strings were dates and times read in from an external .csv file and are not currently in as.POSIXct format.

Ideally I would like to split the strings using regex, but if there is a simpler way to convert them to six columns of numbers using a date / time function that would be of interest as well.

I have already succeeded in creating a regex that splits the strings into six columns, but this regex is not general.

Here are the data:

my.data <- read.csv(text = '
          Date_Time
    18/05/2011 07:32:40
    19/05/2011 13:26:02
    19/05/2011 13:32:47
    19/05/2011 13:45:24
    19/05/2011 14:57:27
    19/05/2011 15:03:18
', header=TRUE, stringsAsFactors = FALSE, na.strings = 'NA', strip.white = TRUE)

Here is a regex statement that splits the strings into six columns:

my.date.time <- data.frame(do.call(rbind, strsplit(my.data$Date_Time,"[/|:|[:space:]]+") ))

The above statement is not general. Here is an unsuccessful attempt at making the regex general by specifying a split on anything that is not a digit:

data.frame(do.call(rbind, strsplit(my.data$Date_Time,"[^\\d]+") ))

After I split the strings into six columns I still need what seems like an excessive number of statements to convert the columns into numeric format:

colnames(my.date.time) <- c('my.day', 'my.month', 'my.year', 'my.hour', 'my.minute', 'my.second')

revised.data <- data.frame(my.data, my.date.time, stringsAsFactors = FALSE)

revised.data$my.day    <- as.numeric(as.character(revised.data$my.day))
revised.data$my.month  <- as.numeric(as.character(revised.data$my.month))
revised.data$my.year   <- as.numeric(as.character(revised.data$my.year))
revised.data$my.hour   <- as.numeric(as.character(revised.data$my.hour))
revised.data$my.minute <- as.numeric(as.character(revised.data$my.minute))
revised.data$my.second <- as.numeric(as.character(revised.data$my.second))
revised.data

str(revised.data)

Thank you for any assistance in generalizing the above regex (or streamlining the procedure using date / time functions). The apply function probably can eliminate most of the as.numeric(as.character) statements, although that is a relatively minor issue.


4 Answers: 

Maybe I missed something but here is my solution:

lisda <- apply(my.data, 1, strsplit, "[^[:digit:]]")
my.data2 <- t(data.frame(lisda))
my.data2
            [,1] [,2] [,3]   [,4] [,5] [,6]
Date_Time   "18" "05" "2011" "07" "32" "40"
Date_Time.1 "19" "05" "2011" "13" "26" "02"
Date_Time.2 "19" "05" "2011" "13" "32" "47"
Date_Time.3 "19" "05" "2011" "13" "45" "24"
Date_Time.4 "19" "05" "2011" "14" "57" "27"
Date_Time.5 "19" "05" "2011" "15" "03" "18"

Just in case you want to convert them all to numeric.

apply(my.data2, 2, function(x) as.numeric(as.character(x)))
 

Give a try to \\D+

> x <- "18/05/2011 07:32:40"
> strsplit(x, "\\D+")
[[1]]
[1] "18"   "05"   "2011" "07"   "32"   "40" 

or

> strsplit(x, "[^0-9]+")
[[1]]
[1] "18"   "05"   "2011" "07"   "32"   "40" 
 

Using cSplit

library(splitstackshape)
tmp = cSplit(my.data, "Date_Time", "/")
out = cSplit(tmp, "Date_Time_3", ":")

if you read your data like this

my.data <- read.csv(text = 'Date Time
18/05/2011 07:32:40
19/05/2011 13:26:02
19/05/2011 13:32:47
19/05/2011 13:45:24
19/05/2011 14:57:27
19/05/2011 15:03:18', header=TRUE, sep =' ' ,stringsAsFactors = FALSE, na.strings = 'NA', strip.white = TRUE)

you could do

library(splitstackshape)
out = cSplit(my.data, splitCols = c("Date", "Time"), sep = c("/", ":"))

#> out
#   Date_1 Date_2 Date_3 Time_1 Time_2 Time_3
#1:     18      5   2011      7     32     40
#2:     19      5   2011     13     26      2
#3:     19      5   2011     13     32     47
#4:     19      5   2011     13     45     24
#5:     19      5   2011     14     57     27
#6:     19      5   2011     15      3     18
 

You might consider using read.pattern from the gsubfn package for this:

library(gsubfn)
read.pattern(text = my.data$Date_Time, pattern = "\\d+")

#   V1 V2   V3 V4 V5 V6
# 1 18  5 2011  7 32 40
# 2 19  5 2011 13 26  2
# 3 19  5 2011 13 32 47
# 4 19  5 2011 13 45 24
# 5 19  5 2011 14 57 27
# 6 19  5 2011 15  3 18

Then you can simply assign the column names as you desire.