I was going through all the scrapy examples and tutorials I can find and I couldn't find an example where I can go and get all the urls of the images, css, and js files being sent from the server.
Is there a way to do that with scrapy? If not with scrapy, then is there a way to do it with something else?
I basically want to go through my website and get all the urls/resources and output them to a log file.
Best How To :
You can use a link extractor (more specifically, I've found the LxmlParserLinkExtractor works better for this kind of thing), customizing the elements and attributes like this:
from scrapy.contrib.linkextractors.lxmlhtml import LxmlParserLinkExtractor
tags = ['img', 'embed', 'link', 'script']
attrs = ['src', 'href']
extractor = LxmlParserLinkExtractor(lambda x: x in tags, lambda x: x in attrs)
resource_urls = [l.url for l in extractor.extract_links(response)]