Scrapy Crawlspider Retry Scrape
For a page that I'm trying to scrape, I sometimes get a 'placeholder' page back in my response that contains some javascript that autoreloads until it gets the real page. I can det
Solution 1:
I would think about having a custom Retry Middleware instead - similar to a built-in one.
Sample implementation (not tested):
import logging
logger = logging.getLogger(__name__)
classRetryMiddleware(object):
defprocess_response(self, request, response, spider):
if'var PageIsLoaded = false;'in response.body:
logger.warning('parse_page encountered an incomplete rendering of {}'.format(response.url))
return self._retry(request) or response
return response
def_retry(self, request):
logger.debug("Retrying %(request)s", {'request': request})
retryreq = request.copy()
retryreq.dont_filter = Truereturn retryreq
And don't forget to activate it.
Post a Comment for "Scrapy Crawlspider Retry Scrape"