python - Joining a varying number of list elements


Keywords:python 


Question: 

I have the following list of strings:

data = ['1 General Electric (GE)   24581660 $18.19 0.04 0.22 ',
        '2 Qudian ADR (QD)   24227349 12.22 -3.93 -24.33 ',
        '3 Square Cl A (SQ)   16233308 48.86 0.05 0.10 ',
        '4 Teva Pharmaceutical Industries ADR (TEVA)   15830425 13.70 0.22 1.63 ',
        '5 Vale ADR (VALE)   14768221 10.98 0.21 1.95 ',
        '6 Bank of America (BAC)   13938799 26.59 -0.07 -0.26 ',
        '7 Entercom Communications Cl A (ETM)   13087209 12.00 0.10 0.84 ',
        '8 Chesapeake Energy (CHK)   12948648 3.92 -0.05 -1.26 ',
        "9 Macy's (M)   12684478 21.07 0.44 2.13 "]

Where the format of every string is: count, stock name, volume, some more int values...

I need to split these strings into a list where each element is one of the items in the string format above, and this is how I attempted to do that:

for i in range(1, len(data)-1):
    split = data[i].split()
    temp = "{} {} {}".format(split[1], split[2], split[3])
    del split[2 : 4]
    split[1] = temp
    print(split)

However, I believe this is inefficient and it doesn't work when the name is more or less than two words. How would I handle this? Would I have to adjust how I generate the list of strings (data) in the first place?

EDIT:

final_data = [
    re.split('(?<=\))\s+|(?<=[\d\$-])\s(?=[\d\$-])|(?<=\d)\s(?=[a-zA-Z])', i)
        for i in data[1]]
final_data = [i[:-1]+[i[-1][:-1]] for i in final_data]
print(final_data)

Output:

~/workspace $ python extract.py 2017-11-27-04-26-51-ss.xhtml
[[''],
 [''],
 [''],

 ...,

 [''],
 [''],
 ['']]

2 Answers: 

You can use re.split:

import re
data = ['1 General Electric (GE)   24581660 $18.19 0.04 0.22 ', '2 Qudian ADR (QD)   24227349 12.22 -3.93 -24.33 ', '3 Square Cl A (SQ)   16233308 48.86 0.05 0.10 ', '4 Teva Pharmaceutical Industries ADR (TEVA)   15830425 13.70 0.22 1.63 ', '5 Vale ADR (VALE)   14768221 10.98 0.21 1.95 ', '6 Bank of America (BAC)   13938799 26.59 -0.07 -0.26 ', '7 Entercom Communications Cl A (ETM)   13087209 12.00 0.10 0.84 ', '8 Chesapeake Energy (CHK)   12948648 3.92 -0.05 -1.26 ', "9 Macy's (M)   12684478 21.07 0.44 2.13 "]
final_data = [re.split('(?<=[a-zA-Z])\s+(?=\()|(?<=\))\s+|(?<=[\d\$-])\s+(?=[\d\$-])|(?<=\d)\s+(?=[a-zA-Z])', i) for i in data]

Output:

[['1', 'General Electric', '(GE)', '24581660', '$18.19', '0.04', '0.22 '], ['2', 'Qudian ADR', '(QD)', '24227349', '12.22', '-3.93', '-24.33 '], ['3', 'Square Cl A', '(SQ)', '16233308', '48.86', '0.05', '0.10 '], ['4', 'Teva Pharmaceutical Industries ADR', '(TEVA)', '15830425', '13.70', '0.22', '1.63 '], ['5', 'Vale ADR', '(VALE)', '14768221', '10.98', '0.21', '1.95 '], ['6', 'Bank of America', '(BAC)', '13938799', '26.59', '-0.07', '-0.26 '], ['7', 'Entercom Communications Cl A', '(ETM)', '13087209', '12.00', '0.10', '0.84 '], ['8', 'Chesapeake Energy', '(CHK)', '12948648', '3.92', '-0.05', '-1.26 '], ['9', "Macy's", '(M)', '12684478', '21.07', '0.44', '2.13 ']]

With the parenthesis removed:

final_data = [[b[1:-1] if b.startswith('(') and b.endswith(')') else b for b in i] for i in final_data]

Output:

[['1', 'General Electric', 'GE', '24581660', '$18.19', '0.04', '0.22 '], ['2', 'Qudian ADR', 'QD', '24227349', '12.22', '-3.93', '-24.33 '], ['3', 'Square Cl A', 'SQ', '16233308', '48.86', '0.05', '0.10 '], ['4', 'Teva Pharmaceutical Industries ADR', 'TEVA', '15830425', '13.70', '0.22', '1.63 '], ['5', 'Vale ADR', 'VALE', '14768221', '10.98', '0.21', '1.95 '], ['6', 'Bank of America', 'BAC', '13938799', '26.59', '-0.07', '-0.26 '], ['7', 'Entercom Communications Cl A', 'ETM', '13087209', '12.00', '0.10', '0.84 '], ['8', 'Chesapeake Energy', 'CHK', '12948648', '3.92', '-0.05', '-1.26 '], ['9', "Macy's", 'M', '12684478', '21.07', '0.44', '2.13 ']]
 

You can split lists on characters

All of the strings in your original data list have 2 sections, the stock name and then the number values, if you split on the closing paranthesis in the string you can break it into a list holding a string for the stockname and a string containing the numbers, the numbers have consistent spacing between them of one space and then you can split the list of numbers on the space character.