python - split by elements of a string, and create a dictionary with {element used to split: that chunk of text}


Keywords:python 


Question: 

Consider the following text:

"Mr. McCONNELL. yadda yadda jon stewart is mean to me. The PRESIDING OFFICER. Suck it up. Mr. McCONNELL. but noooo. Mr. REID. Really dude?" 

And a list of words to split on:

["McCONNELL", "PRESIDING OFFICER", "REID"]

I want to have the output be the dictionary

{"McCONNELL": "yadda yadd jon stewart is mean to me. but noooo.", 
"PRESIDING OFFICER": "Suck it up. "
"REID": "Really dude?"}

So I need a way to split by elements of a list (on any of those names), and then be aware of which one it split on and be able to map that to the chunk of text in that split. In the case of more than one chunks of text having the same speaker ("McCONNELL", in the example), just concatenate the strings.

Edit: Here is the function I have been using. It works on the example, but is not robust when I try it on a much larger scale (and isn't clear why it messes up)

def split_by_speaker(txt, seps):
    '''
    Given raw text and a list of separators (generally possible speaker names), splits based 
    on those names and returns a dictionary of text attributable to that name 
    '''
    speakers = []
    default_sep = seps[0]
    rv = {}

    for sep in seps:
        if sep in txt: 
            all_occurences = [m.start() for m in re.finditer(sep, txt)]
            for occ in all_occurences: 
                speakers.append((sep, occ))

            txt = txt.replace(sep, default_sep)
    temp_t = [i.strip() for i in txt.split(default_sep)][1:]
    speakers.sort(key = lambda x: x[1])
    for i in range(len(temp_t)): 
        if speakers[i][0] in rv: 
            rv[speakers[i][0]] = rv[speakers[i][0]] + " " + temp_t[i]
        else: 
            rv[speakers[i][0]] = temp_t[i]
    return rv 

1 Answer: 

Use re module from standard library to define splits. Hint: split "separator" - regular expression - can be of the form: (WORD1|WORD2|WORD3)

See these examples on what are the results of re.split.

import re

text = "Mr. McCONNELL. yadda yadda jon stewart is mean to me. The PRESIDING OFFICER. Suck it up. Mr. McCONNELL. but noooo. Mr. REID. Really dude?"

speakers = ["McCONNELL", "PRESIDING OFFICER", "REID"]

speakers_re = re.compile('(' + '|'.join([re.escape(s) for s in speakers]) + ')')

print speakers_re.split(text)

Result:

['Mr. ', 'McCONNELL', 
 '. yadda yadda jon stewart is mean to me. The ', 
 'PRESIDING OFFICER', '. Suck it up. Mr. ', 
 'McCONNELL', '. but noooo. Mr. ', 'REID', '. Really dude?']

Removing unnecessary punctuation can also be done by regexps, or simple .rstrip() and .lstrip() methods for strings.