I've written a basic Scrapy spider to crawl a website which seems to run fine other than the fact it doesn't want to stop, i.e. it keeps revisiting the same urls and returning the same content - I always end up having to stop it. I suspect it's going over the same urls over and over again. Is there a rule that will stop this? Or is there something else I have to do? Maybe middleware?
The Spider is as below:
class LsbuSpider(CrawlSpider): name = "lsbu6" allowed_domains = ["lsbu.ac.uk"] start_urls = [ "http://www.lsbu.ac.uk" ] rules = [ Rule(SgmlLinkExtractor(allow=['lsbu.ac.uk/business-and-partners/.+']), callback='parse_item', follow=True), ] def parse_item(self, response): join = Join() sel = Selector(response) bits = sel.xpath('//*') scraped_bits =  for bit in bits: scraped_bit = LsbuItem() scraped_bit['title'] = scraped_bit.xpath('//title/text()').extract() scraped_bit['desc'] = join(bit.xpath('//*[@id="main_content_main_column"]//text()').extract()).strip() scraped_bits.append(scraped_bit) return scraped_bits
settings.py file looks like this
BOT_NAME = 'lsbu6' DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter' DUPEFILTER_DEBUG = True SPIDER_MODULES = ['lsbu.spiders'] NEWSPIDER_MODULE = 'lsbu.spiders'
Any help/ guidance/ instruction on stopping it running continuously would be greatly appreciated...
As I'm a newbie to this; any comments on tidying the code up would also be helpful (or links to good instruction).