How To Deal With Unknown Encoding When Scraping Webpages?
I'm scraping news articles from various sites, using GAE and Python. The code where I scrape one article url at a time leads to the following error: UnicodeDecodeError: 'ascii' cod
Solution 1:
I had the same problem some time ago and there is nothing 100% accurate. What I did was:
- Get encoding from Content-Type
- Get encoding from meta tags
- Detect encoding with chardet Python module
- Decode text from the most common encoding to Unicode
- Process the text/html
Solution 2:
It is better to simply read Content-Type in meta tags or headers. Please note, that Chrome (opposite to Opera) does not guess encoding. If it is not said that it is UTF8 or anything else in either of places, it threads site as windows default encoded. So only really bad sites do not define it.
Post a Comment for "How To Deal With Unknown Encoding When Scraping Webpages?"