python,function,while-loop,web-crawler,scrapy
I would use a for loop, like this: class MySpider(BaseSpider): stock = ["SCMP", "APPL", "GOOG"] name = "dozen" allowed_domains = ["yahoo.com"] def stock_list(stock): start_urls = [] for i in stock: start_urls.append("http://finance.yahoo.com/q/is?s={}".format(i)) return start_urls start_urls = stock_list(stock) Then assign the function call as I have at the bottom. UPDATE Using Scrapy...
javascript,html,web-scraping,firefox-addon,web-crawler
Basically you might set up Google Spreadsheet to scrape pages' parts thru IMPORTXML function (here with an example) using xpath. Then you set up notifications in a spreadsheet: Tools -> Notification Rules Now each time the scraping function (IMPORTXML) gets content that is different to previous one, spreadsheet should trigger...
python,selenium,data-structures,web-crawler,scrapy
The problem with the code above was: Mutable object ( list, dict) : and all the callbacks were changing that same object in each loop hence ...first and second level of data was being overwritten in last third loop ( mp3_son_url) ...(this was my failed attempt) the solution was to...
The url to that forum has changed Two modifications for your code Changed the forum 1.url(https://www.thenewboston.com/forum/recent_activity.php?page=" + str(page)) allow_redirects=False (to disable redirects if any). import requests from bs4 import BeautifulSoup def trade_spider(max_pages): page = 1 while page <= max_pages: url = "https://www.thenewboston.com/forum/recent_activity.php?page=" + str(page) print url source_code = requests.get(url, allow_redirects=False)...
python,web-scraping,web-crawler,scrapy,duplication
The DupeFilter is enabled by default: http://doc.scrapy.org/en/latest/topics/settings.html#dupefilter-class and it's based on the request url. I tried a simplified version of your spider on a new vanilla scrapy project without any custom configuration. The dupefilter worked and the crawl stopped after a few requests. I'd say you have something wrong on...
It looks like you might be able to accomplish this through the use of signals. Specifically, the item_scraped signal which allows you to register an event after an item is scraped. For each received response, check if the response.url is in the start_url list. scrapy.signals.item_scraped(item, response, spider) More info on...
architecture,web-crawler,message-queue
what it does next? How it separates found links into processed, discovered and the new ones? You would set up separate queues for these, which would stream back to your database. The idea is that you could have multiple workers going, and a feedback loop to send the newly...
My best guess is you are using integer division from all the .size() methods which will result in 0 if the answer is less than 1. Cast your results of .size() to float or double
php,curl,cookies,header,web-crawler
CURLOPT_COOKIESESSION is not the option to set your request's Cookie: header. CURLOPT_COOKIE is. That said, path=/; HttpOnly should never be a part of it. Since that second URI expects session cookie to be present in the request and it isn't because you fail to set it, it redirects you to...
php,curl,encryption,web-crawler
Looks like their web server is rejecting requests based on HTTP headers. Or it might be on the application level as well. Try this <?php // Get cURL resource $curl = curl_init(); // Set some options - we are passing in a useragent too here curl_setopt_array($curl, array( CURLOPT_RETURNTRANSFER => 1,...
ExtractorHTML parses the page using the following regex: static final String RELEVANT_TAG_EXTRACTOR = "(?is)<(?:((script[^>]*+)>.*?</script)" + // 1, 2 "|((style[^>]*+)>.*?</style)" + // 3, 4 "|(((meta)|(?:\\w{1,"+MAX_ELEMENT_REPLACE+"}))\\s+[^>]*+)" + // 5, 6, 7 "|(!--(?!\\[if).*?--))>"; // 8 Basically, cases 1 .. 7 match any interesting tags for link extractions, and case 8 matches HTML comments...
python,listbox,wxpython,web-crawler
You can just grab it from the listbox, you don't need it from the event. See below: import wx # for gui class MyFrame(wx.Frame): def __init__(self, parent, id): wx.Frame.__init__(self, parent, id, 'Title', size=(300,200)) tournLevel = ['$10,000', '$15,000', '$20,000', '$50,000','$75,000','$100,000'] self.levelBox = wx.ListBox(panel, -1, (40, 50), (90, 90), tournLevel) self.levelBox.SetSelection(1) #...
javascript,python,pagination,web-crawler,scrapy
Visiting the site with a Web-Browser and activated Web-Developer-Tools (the following screenshots are made with Firefox and add-on Firebug) you should be able to analyze the Network requests and responses. It will show you that the sites pagination buttons send requests like the following: So the URL seems to be:...
python,beautifulsoup,web-crawler,python-requests,url-redirection
How about regex You just check response.text on redirect occurance (python): regex= /window\.location\s*=\s*\"([^"]+)\"/ var occurance = regex.exec(responce.text) if (occurance[1]) print occurance[1]; See the demo....
facebook,facebook-graph-api,web-crawler
Since you didn't mentioned which language you are using I'll use PHP since its easier and quicker to learn (my opinion). If you have the ID then this link will take you the user profile: www.facebook.com/profile.php?id=<user id digits only> If you have the username then you can use: www.facebook.com/<user name>...
python-2.7,web-scraping,web-crawler,scrapy,scrapy-spider
from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from scrapy.exceptions import CloseSpider from scrapy.http import Request from test.items import CraigslistSampleItem from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor URL = "http://example.com/subpage/%d" class MySpider(BaseSpider): name = "craig" allowed_domains = ["xyz.com"] def start_requests(self): for i in range(10): yield Request(URL % i,...
selenium,web-crawler,mechanize
Selenium is a browser automation tool. You can basically automate everything you can do in your browser. Start with going through the Getting Started section of the documentation. Example: from selenium import webdriver driver = webdriver.Firefox() driver.get("http://www.python.org") print driver.title driver.close() Besides automating common browsers, like Chrome, Firefox, Safari or Internet...
node.js,character-encoding,web-crawler,phantomjs
I got a cause. This problem was simply the MS console encoding issues. I just typed 'chcp' command and console show me a clear screen. That's all.
It looks like regular expressions would be helpful here. You could say, for instance: /* if $input_url contains a 4 digit year, slash, number(s), slash, number(s) */ if (preg_match("/\/20\d\d\/\d+\/\d+\/",$input_url)) { echo $input_url . "<br>"; } ...
This sound like resources problem. Did you close all your resources in finally statements? Did you start Threads, which are not finishing and keep going on and on?
javascript,web-crawler,phantomjs,casperjs
try to change your selector to this.mouse.click('li[class="item1"] > a') because the li[class="item1"] is not clickable, but the a element inside of it is.
authentication,python-3.x,web-crawler,authorization
In my case problem was in 'Referer' parameter in headers, which is required but wasn't specified
ruby,cron,google-api,web-crawler,google-api-client
Solved thanks to Ruby google_drive gem oAuth2 saving I needed to get a refresh token and make my code use it like below. CLIENT_ID = '!!!' CLIENT_SECRET = '!!!' OAUTH_SCOPE = 'https://www.googleapis.com/auth/drive' REDIRECT_URI = 'urn:ietf:wg:oauth:2.0:oob' REFRESH_TOKEN = '!!!' client = Google::APIClient.new client.authorization.client_id = CLIENT_ID client.authorization.client_secret = CLIENT_SECRET client.authorization.scope = OAUTH_SCOPE...
python,web-crawler,scrapy,scrapy-spider
You need to add scheme for the URL: ftp://ftp.site.co.uk The FTP URL syntax is defined as: ftp://[<user>[:<password>]@]<host>[:<port>]/<url-path> Basically, you do this: yield Request('ftp://ftp.site.co.uk/feed.xml', ...) Read more about schemas at Wikipedia: http://en.wikipedia.org/wiki/URI_scheme...
python,web-scraping,web-crawler,scrapy,scrapy-spider
The key problem with your code is that you have not set the rules for the CrawlSpider. Other improvements I would suggest: there is no need to instantiate HtmlXPathSelector, you can use response directly select() is deprecated now, use xpath() get the text() of the title element in order to...
python,web-crawler,scrapy,scrapy-spider
Wrap your image url in a list like so: item['image_urls'] = [self.page_name + imageStr[3:-2]] ...
website,web-crawler,scrapy,extract
So, a couple things first: 1) the rules attribute only works if you're extending the CrawlSpider class, they won't work if you extend the simpler scrapy.Spider. 2) if you go the rules and CrawlSpider route, you should not override the default parse callback, because the default implementation is what actually...
ruby,web-crawler,nokogiri,net-http,mechanize-ruby
Thanks to the help of @theTinMan, @MarkThomas and a colleague, I've managed to log into jenkins and collect the page's XML, through Mechanize and Nokogiri: 1 require 'rubygems' 2 require 'nokogiri' 3 require 'net/https' 4 require 'openssl' 5 require 'mechanize' 6 7 # JenkinsXML logs into Jenkins and gets an...
python,web-scraping,web-crawler,mechanize
If you look at the request being sent to that site in developer tools, you'll see that a POST is sent as soon as you select a state. The response that is sent back has the form with the values in the city dropdown populated. So, to replicate this in...
java,parsing,web-crawler,jsoup
I have to translate \x signs and add that site to my "toVisit" sites...I don't have any other idea, how to parse something like this... The \xAA is hexadecimal encoded ascii. For instance \x3d is =, and \x26 is &. These values can be converted using Integer.parseInt with radix...
python,web-scraping,web-crawler,scrapy
With this setup, you need to move middlewares.py one level up into craiglist package.
This is a lie: getResultCounter :: String -> Integer The type signature above is promising that the resulting integer only depends on the input string, when this is not the case: Google can add/remove results from one call to the other, affecting the output. Making the type more honest, we...
javascript,jquery,python,web-crawler,window.open
I can suggest you use iframe for loading pages. For example: $.each($your-links, function(index, link) { var href = $(link).attr("href"); // your link preprocess logic ... var $iframe = $("<iframe />").appendTo($("body")); $iframe.attr("src", href).on("load", function() { var $bodyContent = $iframe.contents().find("body"); // check iframe content and remove iframe $iframe.remove(); } } But, I...
python,web-scraping,web-crawler,urllib2
Rather than working around redirects with urlopen, you're probably better off using a more robust requests library: http://docs.python-requests.org/en/latest/user/quickstart/#redirection-and-history r = requests.get('website', allow_redirects=True) print r.text ...
If the extracted urls could be differentiated by Regular expression you can do that with current Nutch by adding the specific regex filter. But if you are going to classify URL according to some metadata features related to page you have to implement a customized HTMLParseFilter to filter Outlink[] during...
python,web-crawler,scrapy,scrapy-spider,sgml
The problem is in the restrict_xpaths - it should point to a block where a link extractor should look for links. Don't specify allow at all: rules = [ Rule(SgmlLinkExtractor(restrict_xpaths='//div[@class="more"]'), callback="parse_me", follow=True), ] And you need to fix your allowed_domains: allowed_domains = ["www.allgigs.co.uk"] Also note that the print items in...
python,web-crawler,beautifulsoup,urllib2
I modified your function so it doesn't write to file, it just prints the urls, and this is what I got: http://www.nytimes.com/ http://cn.nytimes.com http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/ http://international.nytimes.com http://cn.nytimes.com http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/ http://international.nytimes.com http://cn.nytimes.com http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/ http://international.nytimes.com http://cn.nytimes.com...
ruby,url,hyperlink,web-crawler,net-http
First of all you need a function that accepts a link and returns the body output. Then parse all the links out of the body and keep a list of links. Check that list if you didn't visit the link yet. Remove those visited links from the new links list...
You need to set DOWNLOAD_DELAY in settings.py of your project. Note that you may also need to limit concurrency. By default concurrency is 8 so you are hitting website with 8 simultaneous requests. # settings.py DOWNLOAD_DELAY = 1 CONCURRENT_REQUESTS_PER_DOMAIN = 2 Starting with Scrapy 1.0 you can also place custom...
python,arrays,web-crawler,scrapy
You should instantiate a new item to yield each time. Assuming TV() is an item class class TV(Item): .... Than you should have a separate item = TV() for each episode If you want to pass data from top levels - pass the data itself and create an Item only...
python,xpath,web-scraping,web-crawler,scrapy
Seems like its an xpath problem, in this site during the development they might have omitted tbody but a browser automatically inserted when its viewed through the browser. You can get more info about this from here. So you need county's value (WELD #123) in the given page then the...
python,html,xpath,web-crawler,scrapy
'id("div_a1")/div[3]' works for me. See this sample scrapy shell session: $ scrapy shell http://tool.httpcn.com/Html/Zi/28/PWMETBAZTBTBBDTB.shtml ... 2015-02-10 12:56:13+0100 [default] DEBUG: Crawled (200) <GET http://tool.httpcn.com/Html/Zi/28/PWMETBAZTBTBBDTB.shtml> (referer: None) [s] Available Scrapy objects: [s] crawler <scrapy.crawler.Crawler object at 0x7fbd4f5a7a50> [s] item {} [s] request <GET http://tool.httpcn.com/Html/Zi/28/PWMETBAZTBTBBDTB.shtml> [s] response <200...
Use PHP cURL and PHP DOMDocument for this: libxml_use_internal_errors(true); for ($y = 1; $y <= 5; $y++) { $ch = curl_init(); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_URL, 'http://www.pakwheels.com/used-cars/search/-/?page=' . $y); $searchResults = curl_exec($ch); // save $searchResults here to a file or use DOMDocument to filter what you need $doc = new...
Use capturing group or lookarounds. >>> pattern=re.compile(r'\bdata-id="(\d+)"') >>> s = 'data-id="48859672"' >>> pattern.search(s).group(1) '48859672' OR >>> pattern=re.compile(r'(?<=\bdata-id=")\d+(?=")') >>> s = 'data-id="48859672"' >>> pattern.search(s).group() '48859672' ...
angularjs,parse.com,web-crawler,google-crawlers,prerender
After some help from Prerender.io team, here are the outlined steps that resulted in successful crawling by the Facebook and Google crawler tests. Remember this is for an AngularJS app running on a Parse.com backend add $locationProvider.hashPrefix("!") to your .config in your main module (I am not using HTML5Mode because...
I think that you get this code of C# or C++ or other similar language, this code not work in PHP, If you get an external java application (jar) use the exec functions instead. $url_effective = "http://www.endclothing.co.uk/checkout/cart/add/uenc/aHR0cDovL3d3dy5lbmRjbG90aGluZy5jby51ay9ldHEtaGlnaC10b3AtMS1zbmVha2VyLWVuZC1leGNsdXNpdmUtZXRxLTQtZW5kYmsuaHRtbA,,/product/$i/form_key/DcwmUqDsjy4Wj4Az/"; $crwal = exec("end-cookie.jar -w".$url_effective." -L -s -S -o"); Or some for this style....
Yes there is some ways. You need to understand about Constructors which is a way to create your classes in Java. Right now your code only allows to do: LinkNode frl = new LinkNode(frontierUrl) webCrawler.enque( frl ); //you are passing your LinkNode Instance Another ways is changing your LinkNode Constructor...
java,tomcat,amazon-web-services,web-crawler
Set up some cache headers on the pages so the pages are stored for a longer period of time (e.g. a few days), move tomcat over to some other hostname, then setup amazon cloudfront to have tomcat as the origin server. Then finally setup a CNAME DNS record to point...
python,xpath,web-crawler,scrapy
A very simple solution is to correct your parse function as this one. No need of the outside loop since there is just one div_a1 element in the html code. class Spider(BaseSpider): name = "hzIII" allowed_domains = ["tool.httpcn.com"] start_urls = ["http://tool.httpcn.com/Html/Zi/28/PWMETBAZTBTBBDTB.shtml"] def parse(self, response): print response.xpath('//*[@id="div_a1"]/div[2]').extract() print response.xpath('//*[@id="div_a1"]/div[3]').extract() Note: About...
python,unicode,casting,web-crawler,unicode-string
Use a RegEx to extract the price from your Unicode string: import re def reducePrice(price): match = re.search(r'\d+', u' $500 ') price = match.group() # returns u"500" price = str(price) # convert "500" in unicode to single-byte characters. return price Even though this function converts Unicode to a "regular" string...
python,xpath,web-crawler,lxml,python-requests
I've had a look at the html source of that page and the content of the element with the id chapterMenu is empty. I think your problem is that it is filled using javascript and javascript will not be automatically evaluated just by reading the html with lxml.html You might...
linux,logging,web-crawler,monitoring,server
Your http server probably produces logs in /var/log/something, where something depends on which server you use. Apache?
ruby-on-rails,ruby,web-crawler
If your computer is going to be on everyday at 9am, you could schedule it using cron. If your computer is not going to be on everyday at 9am, you can simply buy a cheap VPS server from digitalocean, linode or one of the many other VPS hosting providers and...
java,google-app-engine,servlets,web-crawler,jsoup
Works when i use the userAgent mozilla. doc3 = Jsoup.connect(link).userAgent("Mozilla").timeout(250000).get(); ...
python,python-3.x,web-crawler,python-requests
You are accessing request.content here: size = len(respond_file.content)//1000000 Accessing that property forces the whole response to be downloaded, and for large responses this takes some time. Use int(respond_file.headers['content-length']) instead: size = int(respond_file.headers['content-length']) // 1000000 The Content-Length header is provided by the server and since it is part of the headers...
web-crawler,sitemap,meta-tags,google-webmaster-tools,noindex
Google usually fairly quickly crawls your pages. Inclusion into index is a bit slower, and getting reasonable search rank takes time. Look at your web server log to confirm that google bot did crawl your pages, you can search for exact page in google and it usually comes up, but...
You should try to use the data and metadata of the web page to find the element you care about as much as possible instead of using element index numbers like you are doing. The "class" and "id" attributes are a good way to do it. Nokogiri has XPath features...
facebook,nginx,service,web-crawler
Add a break after the proxy_pass. location / { if ($http_user_agent ~ Facebot) { proxy_pass http://127.0.0.1:9998; break; } root /etc/www/website; try_files $uri /index.html; ... other stuff... } ...
web-crawler,scrapy,rules,extractor
It's not possible to answer "should i" questions if you don't provide complete example strings and what you want to match (and what you don't want to match) with a regular expression. I guess, that your regex won't work because you use \ instead of /. I recommend you go...
python,selenium,selenium-webdriver,web-scraping,web-crawler
Disable the built-in pdfjs plugin and navigate to the URL - the PDF file would be downloaded automatically, the code: from selenium import webdriver fp = webdriver.FirefoxProfile() fp.set_preference("browser.download.folderList", 2) fp.set_preference("browser.download.manager.showWhenStarting",False) fp.set_preference("browser.download.dir", "/home/jill/Downloads/Dinamalar") fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf,application/x-pdf") fp.set_preference("pdfjs.disabled", "true") # < KEY PART HERE...
python-2.7,selenium,web-crawler
The problem is in the following line. rules = (Rule(LinkExtractor (allow=(" .*http://profiles.ehs.state.ma.us/Profiles/Pages/PhysicianProfile.aspx?PhysicianID=.*," )))) You miss positioned a comma. The correct code is: rules = (Rule(LinkExtractor(allow=(" .*http://profiles.ehs.state.ma.us/Profiles/Pages/PhysicianProfile.aspx?PhysicianID=.*" ))),) By this correction you make the rule iterable. Good definition of iterators here: (Build a Basic Python Iterator) Iterator objects in python conform...
python,web-crawler,python-requests
Requests uses an internal version of urllib3. I have the impression that somehow there is a version mismatch between the internal urllib3 and requests itself. httplib_response = conn.getresponse(buffering=True) TypeError: getresponse() got an unexpected keyword argument 'buffering' Seems to indicate that requests is calling urllib3 (the internal version, not Python's), but...
node.js,request,web-crawler,phantomjs,cheerio
Here you go; var phantom = require('phantom'); phantom.create(function (ph) { ph.createPage(function (page) { var url = "http://www.bdtong.co.kr/index.php?c_category=C02"; page.open(url, function() { page.includeJs("http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js", function() { page.evaluate(function() { $('.listMain > li').each(function () { console.log($(this).find('a').attr('href')); }); }, function(){ ph.exit() }); }); }); }); }); ...
You could use 3 decide rules: The first one accepts all non-html pages, using a ContentTypeNotMatchesRegexDecideRule; The second one accepts all urls in the current domain. The third one rejects all pages not in the domain and not directly reached from the domain (the alsoCheckVia option) So something like that:...
The regular expression \d{2} matches every number that starts with two digits. If you want to limit the regular expression to two digits you can use \d{2}$ so that it only matches if there are tow digits at the end of the line. Even more general would be to use...
python,http,web,web-crawler,scrapy
You can use a link extractor (more specifically, I've found the LxmlParserLinkExtractor works better for this kind of thing), customizing the elements and attributes like this: from scrapy.contrib.linkextractors.lxmlhtml import LxmlParserLinkExtractor tags = ['img', 'embed', 'link', 'script'] attrs = ['src', 'href'] extractor = LxmlParserLinkExtractor(lambda x: x in tags, lambda x: x...
web-crawler,scrapy,scrapy-spider
import scrapy from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.selector import Selector from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from urlparse import urljoin class CompItem(scrapy.Item): name = scrapy.Field() price = scrapy.Field() location = scrapy.Field() class criticspider(CrawlSpider): name = "craig" allowed_domains = ["newyork.craigslist.org"] start_urls = ["http://newyork.craigslist.org/search/cta"] def parse(self, response): sites = response.xpath('//div[@class="content"]') items = []...
javascript,google-chrome,debugging,web-crawler
If you open devtools and go to your network tab. Search the request that you want to investigate. One of the columns in the network table is the "Initiator column". There you'll find the script and line number of the javascript that fired the request. If you then hover over...
python,regex,beautifulsoup,web-crawler
Edited: Just print the text of txt (thanks for @angurar clarifying OP's requirements): for txt in soup.findAll('td',{'class':"field title"}): print txt.string Or if you're after the title attribute of <a>: for txt in soup.findAll('td',{'class':"field title"}): print [a.get('title') for a in txt.findAll('a')] It will return a list of all <a> title's attribute....
python,web-crawler,downloading
You probably don't want to detect this from outside, when you can just use timeouts to have requests fail instead of stopping is the server stops sending bytes. Since you didn't show us your code, it's hard to show you how to change it… but I'll show you how to...
python,google-chrome,xpath,web-crawler,scrapy
In most cases you need to tweak a bit the Xpath returned by the browsers, for these basic reasons: The HTML can be altered after the page loads by JavaScript. The HTML can be altered by the browser itself. They rely heavily on node position and include many unnecessary elements,...
python,xpath,hyperlink,web-crawler,scrapy
I think this is what you want: from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from scrapy.http import Request from brute_force.items import BruteForceItem from urlparse import urljoin class DmozSpider(BaseSpider): name = "brutus" allowed_domains = ["tool.httpcn.com"] start_urls = ['http://tool.httpcn.com/Zi/BuShou.html'] def parse(self, response): for url in response.css('td a::attr(href)').extract(): cb = self.parse if...
python,beautifulsoup,web-crawler,bs4
You are not actually running the function if you have the function call inside the actual function, after you correct that you are going to get an error as that is not a valid url to pass to requests, last your soup.findAll('a', {'h3 class': "two-lines-name"}) is not going to find...
apache,web-crawler,nutch,resume
One of its simple solutoion is to save current batchid and stage at which crawler is running. Edit your crawl script so that you can jump to the stage that is in stage file and change batchid with saved batchid also( This is like case statement scenerio). It will do...
delete,web-crawler,scrapy,scrapy-spider,scrapinghub
You just need to remove the spider from your project, and deploy the project again, via shub deploy, or scrapyd-deploy.
multithreading,scala,web-crawler,akka,actor
I'm working on a similar program where the workers have a non-uniform resource cost (in my case the task is performing database queries and dumping the results in another database, but just as crawling different websites will have different costs so too will different queries have different costs). Two ways...
This is because you need to return Item instances: import scrapy from tutorial.items import TutorialItem class ChillumSpider(scrapy.Spider): name = "chillum" allowed_domains = ["flipkart.com"] start_urls = ["http://www.flipkart.com/search?q=brown+jacket&as=offas-show=off&otracker=start"] def parse(self, response): titles = response.xpath('//a[@class="fk-display-block"]/text()').extract() for title in titles: item = TutorialItem() item['title'] = title yield item ...
python,web-scraping,web-crawler,beautifulsoup
The problem with your current code is that the URLs you are putting into the queue (urls) are pointing to the same page, but to a different anchor, for example: https://weedmaps.com/dispensaries/shakeandbake#videos https://weedmaps.com/dispensaries/shakeandbake#weedmenu In other words, tag['href'] not in visited condition does not filter different URLs pointing to the same page...
You can check your user accepts cookies via AJAX. On landing page set a cookie and then send a request back to server immediately after page load with the cookie. And only if the cookie is present, then create your user. This will be quick and confirms that the users...
html,xpath,web-scraping,web-crawler,scrapy
The warning and error you get are specific to the site you are using to test your XPath expression. It appears you have used a syntax that is used to declare namespaces on http://www.xpathtester.com/xpath. Given that you know how to submit an XPath expression, the following works fine: //td[@class =...
Wikipedia has a built in tool that does just what you are describing WhatLinksHere/Backlink. You can see this tool on every Wikipedia page. You can simply scrape all of the links off the page of goal's the back-links page. 'http://en.wikipedia.org/w/index.php?title=Special%3AWhatLinksHere&limit='500'&target='+goal+'&namespace=0' ^^^^ Article you are trying to reach here Wiki-help page...
python,web-scraping,web-crawler,scrapy
In order to get see an email on a craiglist item page, the one would click "Reply" button, which initiates a new request to "reply/chi/vgm/" url. This is something you need to simulate in Scrapy by issuing a new Request and parsing the results in a callback: # -*- coding:...
python,xpath,web-scraping,web-crawler,scrapy
The problem is in how you define sites. Currently, it is just //table[@width="100%"] which would result into the complete table to be matched. Instead, find all div elements having id attribute directly inside a td tag: sites = response.xpath("//td/div[@id]") As for the rules part - here is the approach I...
python,url,redis,queue,web-crawler
A few suggestions: Look into using Redis' (2.8.9+) HyperLogLog data structure - you can use PFADD and PFCOUNT to get a reasonable answer whether a URL was counted before. Don't keep each URL in its own url_ key - consolidate into a single or bucket Hashs as explained in "Memory...
At the time of this posting, the REST API is not yet complete. A much more detailed document exists, though it's still not comprehensive. It is linked to in the following email from the user mailing list (which you might want to consider joining): http://www.mail-archive.com/user%40nutch.apache.org/msg13652.html But to answer your question...
You want to convert html (a byte-like object) into a string using .decode, e.g. html = response.read().decode('utf-8'). See Convert bytes to a Python String...
jquery,seo,web-crawler,single-page-application,fullpage.js
You probably won't be able to force Google to index your anchor links as different pages. You will be able to index them as a single page. Google will read your page as what it really is. A single page. There are some recommendations which suggest to use the id...
I got the answer: $client->getClient()->get($img_url, ['save_to' => $img_url_save_name, 'headers'=>['Referer'=>$src] ]); Actually I can set header Referer in Goutte\Client and but there's no option to give a path to save image. So I finally use Guzzle Client instead....
python,web-scraping,web-crawler,scrapy,scrapy-spider
For posting title, get all the text nodes from the span tag and join them: $ scrapy shell http://denver.craigslist.org/bik/5042090428.html In [1]: "".join(response.xpath("//span[@class='postingtitletext']//text()").extract()) Out[1]: u'Tonka double shock boys bike - $10 (Denver)' Note that the "Scrapy-way" to do this would be to use an ItemLoader and the Join() processor. Second is...
java,html,web,web-crawler,jsoup
Your question isn't clear enough, but from your codes I understand that you are looking for, saving the Link's Text and using the .select() syntax you must use doc.select("a[href"]); then you can use your current for loop.