How To Deal With Unknown Encoding When Scraping Webpages?

August 21, 2024 Post a Comment

I'm scraping news articles from various sites, using GAE and Python. The code where I scrape one article url at a time leads to the following error: UnicodeDecodeError: 'ascii' cod

Solution 1:

I had the same problem some time ago and there is nothing 100% accurate. What I did was:

Get encoding from Content-Type
Get encoding from meta tags
Detect encoding with chardet Python module
Decode text from the most common encoding to Unicode
Process the text/html

Solution 2:

It is better to simply read Content-Type in meta tags or headers. Please note, that Chrome (opposite to Opera) does not guess encoding. If it is not said that it is UTF8 or anything else in either of places, it threads site as windows default encoded. So only really bad sites do not define it.

alezinhacris

How To Deal With Unknown Encoding When Scraping Webpages?

Solution 1:

Solution 2:

Post a Comment for "How To Deal With Unknown Encoding When Scraping Webpages?"

Widget HTML #3