python - Using regex to split text content into dictionary


Keywords:python 


Question: 

I have a text file that follows this format.

LESTER HOLT (00:00:01): Breaking News Tonight: A deadly mass shooting at the airport. A gunman opens fire at baggage claim in Fort Lauderdale, witnesses describing scenes of sheer horror. A silent killer shooting people in the head as they tried to run and hide. Tonight, a storm of questions. Why did he do it? The suspect, a passenger with a firearm in his checked bag. New concerns about airport security before the checkpoint.

(00:00:25): Also breaking tonight the new report from U.S. intelligence: Vladimir Putin himself ordered the effort to influence the election, aimed at hurting Clinton and helping Trump win. What the President-elect is saying after his top-secret briefing.

(00:00:39): And States of Emergency: Millions from coast to coast paralyzed by a massive winter storm.

(00:00:45): NIGHTLY NEWS begins right now.

I am trying to parse this information into a Python Dictionary, where the speaker is a dictionary, of dictionaries, which has timecode keys, and the content is the value, I can't consistently split because of potential information before the timecode, (IE the first quote), as well as the fact that the split character : is also a character involved with the timecode itself 00:00:00.

Trying to split according to the regex.

for line in msg.get_payload().split('\n'):
    regex = r'\d{2}:\d{2}:\d{2}'
    test = re.split(regex, line)
    print(test)
    sleep(1)

Appears to work in splitting it properly, but it causes me to lose the value I am splitting on (timecode), which I intend to use as a key. How can I properly split the above content, get the speaker, and then get the timecode as a key, and the content as a value.It is possible he may be present later in the text as well, and it should append to the list of timecodes./

The output format I am targeting is something along the lines of

{speakers:{'Lester Holt': {'00:00:01':content..., '00:00:0025': content...},
'speaker2':{etc,etc,etc} }}

Ive tried using the split as mentioned above, but it removes my timecode variable.

Any thoughts and guidance is appreciated.


2 Answers: 

Don't bother with split. You're trying to get 2-3 pieces of information out of each line, so try the following:

for line in msg.get_payload().split('\n'):
    match = re.search(r'^\s*([^(]*?)\s*\((\d{2}:\d{2}:\d{2})\):\s*(.*)$', line)
    if match:
        (speaker, time, message) = match.groups()

Speaker will be an empty string if none was present on that line.

Regex explanation:

^                    # Start of line
\s*                  # Drop leading whitespace
([^(]*?)             # Capture the speaker if present (non-paren characters)
\s*                  # Drop whitespace between name and time
\(                   # Drop literal open paren
(\d{2}:\d{2}:\d{2})  # Capture time
\):\s*               # Drop literal close paren, colon and whitespace
(.*)                 # Capture the rest of the line
$                    # End of line
 

Splitting message in lines when you need to split it in time-stamped paragraphs is a waste. re.split can easily save the tokens that it split on, if you only look at the documentation. Here's my solution:

toks = re.split(r"\((\d\d:\d\d:\d\d)\):", msg.get_payload())[1:]
answer = dict(zip(toks[::2], toks[1::2]))

This creates a dictionary of timestamps and paragraphs. Feel free to use the same approach to split by speaker as well.

Result: { '00:00:01': ' Breaking News Tonight: A .....', '00:00:25': ' Also breaking tonight ......', .... }