regex - Python: removing duplicate letters from tweets



Hello dear fellow programmers, social media comments include much casual language that is characterized by the usage of many characters. An example would be: "Helloooooo!". For analysis I want to remove these duplicate letters beyond 2 and replace them with exact 2 letters. Our example would be "Helloo!". I found the corresponding regex. But it also reduces the number of my lines from 500.000 to 450.000. Some lines now contain multiple tweets instead of just one.

example of broken line (the following text should be split into 3 lines, not 1 in the output file:

z .. :)"

"USERNAME Am Wochenende gabs das halt für 10 und das DLC für 2,50. Und da das Guthaben hier rumfliegt.. hab ich zugeschlagen :D"

"Wenn das keine #Leseempfehlung ist! Vielen Dank. :) #krimi #sauerland #lesen #lesetipp #rezension URL

Code for processing:

#repeating letters are set to a limit of 2
#errror: Output file loses 50000 columns. WHy?
import re
with open("C:/Users/M/PycharmProjects/Bachelor_Thesis/test/data_feat2.csv","r", encoding="utf-8") as oldfile1, open('data_feat3.csv', 'w',encoding="utf-8") as newfile1:
    for line in oldfile1:
        line=re.sub(r'(.)\1+', r'\1\1', line) 

1 Answer: 

There could be repeated commas, are they escaped? Search for that in your csv?

Another thing to try is read the file using the csv module and run the regex on each column individually. This would be far slower, but would help you debug.