I'm currently working on a crawler coded in Python with combination of Gevent/requests/lxml to crawl a defined set of pages. I use redis as a db to hold lists such as pending queue, fetching, and sites that has been crawled. For each url, I have a key url_ and I'm using a SETNX command to ensure that the URL has not been already crawled and then put it into the queue.
One of the problems that I'm starting to face is that the url_ set of keys are starting to grow really fast and Redis keep almost all data in memory so it will soon become an issue. The URL that are crawled don't have an expiration time as I need to visit them only once and the content of the url will not change in the future so I still want to keep all visited urls. (There are a lot of duplicate URLs that I'm filtering) Is it possible to use some data structure like cuckoo hash table or bloom filter in Redis so I can prevent the list of visited urls to be growing that fast and still benefit the speed when quering the queue?
Is there some alternative approach that I can use to determine if the URL has been already visited or not? The solution should be scalable and distributed as the crawlers are currently running on more than one machine. Thanks!