I've been running a crawler in Scrapy to crawl a large site I'd rather not mention. I use the tutorial spider as a template, then I created a series of starting requests and let it crawl from there, using something like this:
def start_requests(self): f = open('zipcodes.csv', 'r') lines = f.readlines() for line in lines: zipcode = int(line) yield self.make_requests_from_url("http://www.example.com/directory/%05d" % zipcode)
To start, there are over 10,000 such pages, then each of those queue up a pretty large directory, from which there are several more pages to queue, etc., and scrapy appears to like to stay "shallow," accumulating requests waiting in memory instead of delving through them and then back up.
The result of this is a repetitive big exception that ends like this:
File "C:\Python27\lib\site-packages\scrapy\utils\defer.py", line 57, in <genexpr> work = (callable(elem, *args, **named) for elem in iterable) --- <exception caught here> --- File "C:\Python27\lib\site-packages\scrapy\utils\defer.py", line 96, in iter_errback yield next(it)
..... (Many more lines) .....
File "C:\Python27\lib\site-packages\scrapy\selector\lxmldocument.py", line 13, in _factory body = response.body_as_unicode().strip().encode('utf8') or '<html/>' exceptions.MemoryError:
Fairly quickly, within an hour or so of a crawler that should take several days, the python executable balloons to 1.8gigs and Scrapy won't function anymore (continuing to cost me many wasted dollars in proxy usage fees!).
Is there any way to get Scrapy to dequeue or externalize or iterate over (I don't even know the right words) stored requests to prevent such a memory problem?
(I'm not very proficient in programming, other than to piece together what I see here or in the docs, so I'm not equipped to troubleshoot under the hood, so to speak - I also was unable to install the full python/django/scrapy as 64-bit on W7, after days of trying and reading.)