python - Identifying sections tabbed in from raw text



Consider the text on this page. If you look at the source code, you'll see that the main text is presented exactly as in the page -- no HTML divisions or any other way to obviously find paragraphs/tabbed in sections.

Is there a way to automatically identify and remove sections that are tabbed in from the raw text?

One thing I notice is that when I encode the text as text = unicode(raw_text).encode("utf-8"), I can then see a bunch of \n's for line skips. But no \t's. (This might be not a useful direction to think, but just an idea).

Edit: The following works

text = unicode(raw_text).encode("utf-8")
y = [x for x in text.split("\n") if "     " not in x]
final = " ".join(y)

1 Answer: 

Well, after looking at the page, they are 'tabbed' in with spaces rather than the tab character; looking for tabs would not be useful. It looks like the section is tabbed in with 5 spaces.

raw_text.replace('     ','')

To replace all occurances of 5 spaces...

from re import sub


raw_text = sub(r'     .*\n', '', raw_text)