python - Identifying sections tabbed in from raw text


Keywords:python 


Question: 

Consider the text on this page. If you look at the source code, you'll see that the main text is presented exactly as in the page -- no HTML divisions or any other way to obviously find paragraphs/tabbed in sections.

Is there a way to automatically identify and remove sections that are tabbed in from the raw text?

One thing I notice is that when I encode the text as text = unicode(raw_text).encode("utf-8"), I can then see a bunch of \n's for line skips. But no \t's. (This might be not a useful direction to think, but just an idea).

Edit: The following works

text = unicode(raw_text).encode("utf-8")
y = [x for x in text.split("\n") if "     " not in x]
final = " ".join(y)

1 Answer: 

Well, after looking at the page, they are 'tabbed' in with spaces rather than the tab character; looking for tabs would not be useful. It looks like the section is tabbed in with 5 spaces.

raw_text.replace('     ','')

To replace all occurances of 5 spaces...

from re import sub

...

raw_text = sub(r'     .*\n', '', raw_text)