Skip to content Skip to sidebar Skip to footer

C# Webclient Strange Characters

I am trying to download this webpage using C# WebClient.. Now it works perfectly with python urllib2 but with c# web client it gives these strange characters in the output file..

Solution 1:

You're looking at a compressed byte stream. You can tell by inspecting the headers of the http response, for example with curl:

curl -X HEAD -i http://bet.hkjc.com/

but the Developer Console of your browser will reveal the same:

HTTP/1.1200OKCache-Control:public,max-age=120,must-revalidateContent-Length:3615Content-Type:text/html;charset=utf-8Content-Encoding:gzipExpires:Wed,29Jun2016 08:01:06 GMTVary:Accept-EncodingServer:Microsoft-IIS/7.0X-AspNet-Version:2.0.50727X-Powered-By:ASP.NETDate:Wed,29Jun2016 08:00:14 GMTVia:1.1stjbwbwa52Accept-Ranges:bytes

Notice the Content-Encoding: to say gzip. This means the result you just got is compressed with the gzip algorithm. The standard WebClient can't handle that but with an simple subclass the WebClient can do new tricks:

publicclassDecompressWebClient:WebClient
{
    // moved common logic herepublicDecompressWebClient()
    {
        this.Encoding = Encoding.UTF8;
    }

    // This is the factory to create the webrequestprotectedoverride WebRequest GetWebRequest(Uri address)
    {
        // get the default onevar request = base.GetWebRequest(address);
        // see if it is a HttpWebRequestvar httpReq = request as HttpWebRequest;
        if (httpReq != null)
        {
            // add extra capabilities, like decompression
            httpReq.AutomaticDecompression =  DecompressionMethods.GZip;
        }
        return request;
    }
}

On the HttpWebRequest there exists a property AutomaticDecompression that, when set to true, will take care of the decompression for us.

When you put the Subclassed WebClient to use your code will look like:

string url = "http://bet.hkjc.com";
using(WebClient webClient = new DecompressWebClient())
{
    string html = webClient.DownloadString(url);
    File.WriteAllText("page.html", html);
}

The encoding UTF8 is correct, as you can also see in the header for the Content-Type setting.

The top of the html file will look like this:

<html><head><metahttp-equiv="X-UA-Compatible"content="IE=EmulateIE7; IE=EmulateIE10"/><metaname="application-name"content="香港賽馬會"/><title>香港賽馬會</title>

Post a Comment for "C# Webclient Strange Characters"