Skip to content Skip to sidebar Skip to footer

Scrapy Crawlspider Retry Scrape

For a page that I'm trying to scrape, I sometimes get a 'placeholder' page back in my response that contains some javascript that autoreloads until it gets the real page. I can det

Solution 1:

I would think about having a custom Retry Middleware instead - similar to a built-in one.

Sample implementation (not tested):

import logging

logger = logging.getLogger(__name__)


classRetryMiddleware(object):
    defprocess_response(self, request, response, spider):
        if'var PageIsLoaded = false;'in response.body:
            logger.warning('parse_page encountered an incomplete rendering of {}'.format(response.url))
            return self._retry(request) or response

        return response

    def_retry(self, request):
        logger.debug("Retrying %(request)s", {'request': request})

        retryreq = request.copy()
        retryreq.dont_filter = Truereturn retryreq

And don't forget to activate it.

Post a Comment for "Scrapy Crawlspider Retry Scrape"