I want to get all external links from a given website using Scrapy. Using the following code the spider crawls external links as well:
from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors import LinkExtractor from myproject.items import someItem class someSpider(CrawlSpider): name = 'crawltest' allowed_domains = ['someurl.com'] start_urls = ['http://www.someurl.com/'] rules = (Rule (LinkExtractor(), callback="parse_obj", follow=True), ) def parse_obj(self,response): item = someItem() item['url'] = response.url return item
What am I missing? Doesn't "allowed_domains" prevent the external links to be crawled? If I set "allow_domains" for LinkExtractor it does not extract the external links. Just to clarify: I wan't to crawl internal links but extract external links. Any help appriciated!