ruby - Parse multiline text with pattern



here is a little example:

02-09-17 1:01 PM - Some User (Add comments)

How are you?


02-09-17 3:29 PM - Another User (Add comments)

Thanks, all is fine.

Some another text here.

02-09-17 4:30 AM - Just a User (Add comments)
some text

I want to parse and process this three comments. What is the best way for this?

Tried regex like this - but have problems with /m flag. Without multiline flag for regex - can`t catch "body" of comment.

Also tried to split by regex:

text_above.split(/^(\d{1,2}-\d{1,2}-\d{2} \d{1,2}:\d{1,2} [AP]M - .+ \(Add comments\))/)
=> ["",
"02-09-17 1:01 PM - Some User (Add comments)",
"\n" + "Hello,\n" + "\n" + "How are you?\n" + "\n" + "Regards,\n" + "\n",
"02-09-17 3:29 PM - Another User (Add comments)",
"\n" + "Hey,\n" + "\n" + "Thanks, all is fine.\n" + "\n" + "Some another text     here.\n" + "\n",
"02-09-17 4:30 AM - Just a User (Add comments)",
"\n" + "some text\n" + "with\n" + "multiline\n" + "\n",
"02-09-17 5:29 PM - Another User (Add comments)",
"\n" + "Hey,\n" + "\n" + "Thanks, all is fine.\n" + "\n" + "Some another text here.\n" + "\n",
"02-09-17 6:30 AM - Just a User (Add comments)",
"\n" + "some text\n" + "with\n" + "multiline\n"]

But this is not comfortable solution.

Ideally I want to get regex captures with three or two group matches, for example:

1. 02-09-17 1:01 PM
2. Some User (Add comments)
3. Hello,

How are you?


for each comment, or, Array of comments:

[['02-09-17 1:01 PM - Some User (Add comments) Hello,

How are you?


Any ideas? Thanks.

3 Answers: 

You can keep it simple using two splits (one for the whole string and one for each block):

text.split(/\n\n(?=\d\d-)/).map { |m| m.split(/ - |\n/, 3) }

You can also use the scan method, but it's a little more fastidious:

text.scan(/([\d-]+[^-]+) - (.*)\n(.*(?>\n.*)*?(?=\n\n\d\d-|\z))/)

slice_before might be easier to understand than a huge scan, and it has the advantage of keeping the pattern (split removes it)

data = text.each_line.slice_before(/^\d\d\-\d\d\-\d\d/).map do |block|
  time, user = block.shift.strip.split(' - ')
  [time, user, block.join.strip]

p data
# [["02-09-17 1:01 PM",
#   "Some User (Add comments)",
#   "Hello,\n\nHow are you?\n\nRegards,"],
#  ["02-09-17 3:29 PM",
#   "Another User (Add comments)",
#   "Hey,\n\nThanks, all is fine.\n\nSome another text here."],
#  ["02-09-17 4:30 AM",
#   "Just a User (Add comments)",
#   "some text\nwith\nmultiline"]]

You can use this regular expression:

(\d{2}-\d{2}-\d{2} \d{1,2}:\d{2} (?:AM|PM)) - (.*?)\r?\n((?:.|\r?\n)+?)(?=\r?\n\d{2}-\d{2}-\d{2} \d{1,2}:\d{2} (?:AM|PM) - |$)
  • (\d{2}-\d{2}-\d{2} \d{1,2}:\d{2} (?:AM|PM)) matches the first group, the date and time. The date must consist of three numbers, separated by a dash, followed by the time with AM/PM
  • (.*?)\r?\n((?:.|\r?\n)+?) matches the username up to the first line break (\r?\n) as the second group. Afterwards, anything including linebreaks is matching and building the third group, the comment.
  • This won't work, because it would handle everything from the beginning of the comment up to the end of the file as a comment. Therefore, you need to select the next date/time format, so that it stops there. You can do this just by repeating the date/time format after the comment and matching non-greedy, but this will include the next datetime already in the current match and therefore exclude it in the next match (which will lead to a skip of every second match). To circumvent this, you can use a positive lookahead: (?=\r?\n\d{2}-\d{2}-\d{2} \d{1,2}:\d{2} (?:AM|PM) - |$). This matches a number afterwards, but does not include it in the match. The last comment must then end at the end of the string $.
  • You need to use the global flag /g but mustn't use the multi-line flag /g, because the matching of the comment goes over multiple lines.

Here is a live example: