Hello dear fellow programmers, social media comments include much casual language that is characterized by the usage of many characters. An example would be: "Helloooooo!". For analysis I want to remove these duplicate letters beyond 2 and replace them with exact 2 letters. Our example would be "Helloo!". I found the corresponding regex. But it also reduces the number of my lines from 500.000 to 450.000. Some lines now contain multiple tweets instead of just one.
example of broken line (the following text should be split into 3 lines, not 1 in the output file:
z .. :)" "USERNAME Am Wochenende gabs das halt fÃ¼r 10 und das DLC fÃ¼r 2,50. Und da das Guthaben hier rumfliegt.. hab ich zugeschlagen :D" "Wenn das keine #Leseempfehlung ist! Vielen Dank. :) #krimi #sauerland #lesen #lesetipp #rezension URL
Code for processing:
#repeating letters are set to a limit of 2 #errror: Output file loses 50000 columns. WHy? import re with open("C:/Users/M/PycharmProjects/Bachelor_Thesis/test/data_feat2.csv","r", encoding="utf-8") as oldfile1, open('data_feat3.csv', 'w',encoding="utf-8") as newfile1: for line in oldfile1: line=re.sub(r'(.)\1+', r'\1\1', line) newfile1.write(line) newfile1.close()