Skip to content Skip to sidebar Skip to footer

Html To Readable Text

I'm writing a program to run on Google App Engine. Which simply get an URL and return the text by removing markups, scripts and any other non-readable things from its HTML source(s

Solution 1:

That's because there's a mix between unicode and bytestrings.

If you use in the module with HtmlTool

from __future__ import unicode_literals

To ensure every " "-like blocks are unicode, and

text = HtmlTool.to_nice_text(urllib.urlopen(url).read().decode("utf-8"))

To send a UTF-8 string to your method, that solves your problem.

Please read this for more information about Unicode: http://www.joelonsoftware.com/articles/Unicode.html

Post a Comment for "Html To Readable Text"