I am making a Python web-crawler program to play The Wiki game.
If you're unfamiliar with this game:
- Start from some article on Wikipedia
- Pick a goal article
- Try to get to the goal article from the start article just by clicking wiki/ links
My process for doing this is:
- Take a start article and a goal article as input
- Get a list of articles that link to the goal article
- Preform a breadth-first search on the links found avoiding pages that have already been visited starting from the start article
- Check if the goal article is on the current page: If it is, then return the
- Check if any of the articles that link to the goal are on the current page. If one of them is, return
I was having a problem where the program would return a path, but the path wouldn't really link to the goal.
def get_all_links(source): source = source[:source.find('Edit section: References')] source = source[:source.find('id="See_also"')] links=findall('\/wiki\/[^\(?:/|"|\#)]+',source) return list(set(['http://en.wikipedia.org'+link for link in links if is_good(link) and link])) links_to_goal = get_all_links(goal)
I realized that I was getting the links to the goal by scraping all of the links off of the goal page, but wiki/ links are unidirectional: Just because the goal links to a page doesn't mean that page links to the goal.
How can I get a list of articles that link to the goal?