Consider the text on this page. If you look at the source code, you'll see that the main text is presented exactly as in the page -- no HTML divisions or any other way to obviously find paragraphs/tabbed in sections.
Is there a way to automatically identify and remove sections that are tabbed in from the raw text?
One thing I notice is that when I encode the text as
text = unicode(raw_text).encode("utf-8"), I can then see a bunch of \n's for line skips. But no \t's. (This might be not a useful direction to think, but just an idea).
Edit: The following works
text = unicode(raw_text).encode("utf-8") y = [x for x in text.split("\n") if " " not in x] final = " ".join(y)