Menu
  • HOME
  • TAGS

Google still indexing unique URLs

indexing,robots.txt,google-webmaster-tools

Robots.txt tells search engines not to crawl the page, but it does not stop them from indexing the page, especially if there are links to the page from other sites. If your main goal is to guarantee that these pages never wind up in search results, you should use robots...

What is a 'disallowed entry' when nmap scans through the Robots.txt file?

robots.txt,nmap

It is an entry in the robots.txt file of the form Disallow: X. This means that at least some User-Agent has been instructed not to request these URIs. A list of them should be listed below the line you showed. You can read more about the robots.txt standard and format...

How to block spider if he's disobeying the rules of robots.txt

php,robots.txt

There are ways to prevent most bots from spidering your site. Aside from filtering by user agent and known IP adresses, you should as well implement behaviour driven blocking. That means, if it acts like a crawler, block it. You can find multiple lists of search engine bots here. But...

Block 1 out of 2 domains only from search engines

regex,apache,.htaccess,mod-rewrite,robots.txt

Create a file called robots2.txt with this code: User-agent: * Disallow: / Then put this rule in your DOCUMENT_ROOT/.htaccess file: RewriteEngine On RewriteCond %{HTTP_HOST} ^(www\.)?site2\.com$ [NC] RewriteRule ^robots\.txt$ /robots2.txt [L,NC] This will serve /robots2.txt for all /robots.txt requests for site2. Regular /robots.txt will be used for site1. Only site2 will...

What do you do with Django SuspiciousOperations?

django,spam,robots.txt

First, this is OK. I wouldn't call such emails ‘false positives’—someone is probably actually scanning you for vulnerabilities—but on public Internet such scanning happens all the time, in which case these error reports are just noise. Noise is an issue, though, since among it you may not notice more legitimate...

Prevent Google indexing an AngularJS route

javascript,html,angularjs,seo,robots.txt

If your root module is placed on the <html> tag (<html ng-app="myApp">), you can modify all properties in the <head>. That allows you to dynamically set the robots <meta> for each page. You can do that with the $routeChangeSuccess event in your root module. If you are using ui-router, you...

Robots.txt: disallow a folder's name, regardless at which depth it may show up

web-services,robots.txt

robots.txt specification doesn't say anything about wildcards but Google (Google Robots.txt Specifications) and Bing allow the use of wildcards in robots.txt files. Disallow: */ajax/* Your disallow is valid for all the /ajax/ urls no matter what is the nesting level of /ajax/....

Magento CE 1.9.0.1 robots.txt not showing when called

magento,windows-ce,robots.txt

I can see your robots.txt. Clear your browser's cache and remove all cookies.

Wordpress - customized pages with blocks - prohibit google seo index of blocks

wordpress,seo,woocommerce,robots.txt,google-sitemap

Noindex tags would be useful. https://support.google.com/webmasters/answer/93710?hl=en

robots.txt allow all except few sub-directories

seo,search-engine,cpanel,robots.txt,shared-hosting

No, this is wrong. You can’t have a robots.txt in a sub-directory. Your robots.txt must be placed in the document root of your host. If you want to disallow crawling of URLs whose paths begin with /foo, use this record in your robots.txt (http://example.com/robots.txt): User-agent: * Disallow: /foo This allows...

Robots.txt for WordPress when not in root directory

wordpress,robots.txt

For example, your Site is reachable under www.example.com under the directory /var/www/ and your WordPress-Blog is under /var/www/newsite/, than put the robots.txt in /var/www/ and change the folders in there: Disallow: /newsite/wp-admin/ Disallow: /newsite/wp-includes/ Disallow: /newsite/wp-content/plugins/ Why /newsite/wp-admin/ and not /var/www/newsite/wp-admin/ for example? The directory is relative from the URI...

Sitemaps for country domains which are directories

sitemap,robots.txt

You should gather all your sitemaps into a single .XML file, and then submit it to the Google Webmaster Tools. If you wish your sitemap to be indexed by other SE (w/o manual submitting), then name your XML sitemap as sitemap.xml and place it to the site's root, e.g.: site.com/sitemap.xml

Mule : No receiver found on secondary lookup of receiver on connector: HTTP_HTTPS with URI key: https://myhost:443/robots.txt

mule,robots.txt

The issue you're having can be reproduced through the following steps: Add an inbound endpoint that has the following url: http://localhost:8081/test, start your app and call http://localhost:8081/something.txt The explanation is: There is no inbound endpoint that matches the beginning of the url you're calling, the easiest solution is to have...

Robots.txt: Disallow repeated subdirectories but allow main directories

robots.txt

You were almost there in your question! # User agent that should be disallowed, '*' is far 'all' User-agent: * Disallow: /*/media # A less restrictive rule that would also work: # Disallow: /dir*/media In general search engines do want to see every resource that might be referenced from your...

Prevent search engines spider my site in .net mvc?

.net,asp.net-mvc,robots.txt,asp.net-mvc-5.1

Yes, and place in the root dir of your application. Here some more info: https://support.microsoft.com/en-us/kb/217103...

robots.txt: Site still not showing up in Google

robots.txt

Your robots.txt is not doing what you want (but that’s not related to the problem you mention). If you want to disallow crawling for every bot except "googlebot", you want to use this robots.txt: User-agent: googlebot Disallow: User-agent: * Disallow: / Disallow: / means: disallow every URL Disallow: means: disallow...

How to attach Sitecore context for controller action mappled to route robots.txt?

sitecore,robots.txt,sitecore-mvc,sitecore8

Ok, I found the issue. I was correct in assuming that txt needed to be added to the allowed extensions for the Sitecore.Pipelines.PreprocessRequest.FilterUrlExtensions setting. However robots.txt was listed under the IgnoreUrlPrefixes setting in the config file. That was causing sitecore to ignore that request. I removed it from that list...

mod_rewrite http/https for robots.txt

.htaccess,mod-rewrite,robots.txt

You can use this rule for serving a SSL specific robots.txt: RewriteEngine On RewriteCond %{HTTPS} on RewriteRule ^robots\.txt$ robots_ssl.txt [L,NC] ...

Wordpress - Robotx.txt allows admin login?

wordpress,seo,robots.txt

You should create your own robots.txt file and upload it to website root directory. Follow the following steps to create and upload into root folder: Open notepad Add the following text into this file, remember add also your website sitemap path sitemap: http://www.yoursite.com/sitemap.xml User-agent: * Disallow: /cgi-bin/ Disallow: /wp-admin/ Disallow:...

Noindex or disallow in robots symfony

indexing,symfony-1.4,robots.txt

Just have a look here User-agent: * Disallow: / ...

Umbraco imagegen.ashx disallowed in robots.txt couse images blocked from search

seo,umbraco,robots.txt

You can safely remove imagegen.ashx from the robots.txt. So far I know it is not default disallowed in robots.txt Umbraco and ImageGen do not create default a robots.txt (not in recent ImageGen versions) Images generated with Umbraco and ImageGen can just be found and displayed in Google. There is no...

Django - Loading Robots.txt through generic views

django,django-templates,django-views,robots.txt

Finally got it. I had to add a '/' in ^robots.txt$ (r'^robots\.txt/$', TemplateView.as_view(template_name='robots.txt', content_type='text/plain')), That's elementary! I presumed that by default APPEND_SLASH it True however, on the production server it didn't work. Let me know if anyone can provide some insights on it. ...

How to verified Robot.txt rules

seo,robots.txt

Your rule Disallow: /classifieds/search*/ does not do what you want it to do. First, note that the * character has no special meaning in the original robots.txt specification. But some parsers, like Google’s, use it as a wildcard for pattern matching. Assuming that you have this rule for those parsers...

May Disallow entire website on robots.txt have consequences after removal?

robots.txt

The "critical problem" occurs because Google cannot index pages on your site with your robots.txt configuration. If you're still developing the site, it is standard procedure to have this robots.txt configuration. Webmaster tools treats your site as if it was in production however it sounds like you are still developing...

Couple of questions about robots and content blocking

php,seo,bots,robots.txt

Bots don’t care about your internal server-side system (well, they can’t see it to begin with). They visit your website just like a human visitor: by following links (from your own site, from external sites, from your sitemap etc.), and some might possibly also "guess" URLs. So what matters are...

Is the beginning of a path enough in robots.txt?

robots.txt

In short, yes. If you have: User-agent: * Disallow: /abc It will block anything that starts with /abc, including: /abc /abc.html /abc/def/ghi /abcdefghi /abc?x=123 This is part of the original robots.txt standard, and it applies to all robots that obey robots.txt. The thing to remember about robots.txt is that it's...

Disallow google robot from robots.txt and list sitemap instead

html,sitemap,robots.txt

Google index only with crawling... The best thing to do for you, is to disable the geolocation script when you detect a Google robot (or other) You can recognize them in various ways: HTTP_USER_AGENT or HTTP_FROM, or IP...

allowing certain urls and deny the rest with robots.txt

robots.txt

Your snippet looks OK, just don't forget to add a User-Agent at the top. The order of the allow/disallow keywords doesn't matter currently, but it's up to the client to make the correct choice. See Order of precedence for group-member records section in our Robots.txt documentation. [...] for allow and...

Defaults for robots meta tag

html,seo,robots.txt,meta,robot

Yes, by only specifying the noindex, it will still be follow. More information can be found [here]{https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag}

Ban robots from website [closed]

bots,robots.txt,web-crawler

based on these https://www.projecthoneypot.org/ip_46.229.164.98 https://www.projecthoneypot.org/ip_46.229.164.100 https://www.projecthoneypot.org/ip_46.229.164.101 it looks like the bot is http://www.semrush.com/bot.html if thats actually the robot, in their page they say To remove our bot from crawling your site simply insert the following lines to your "robots.txt" file: User-agent: SemrushBot Disallow: / Of course that does not guarantee...

Disallow specific folders in robots.txt with wildcards

seo,search-engine,robots.txt,google-crawlers

You don't need wildcards at all for this. Your example will work, but it would work just as well without the wildcard. Trailing wildcards do not do anything useful. For example, this: Disallow: /x means: "Block any path that starts with '/x', followed by zero or more characters." And this:...

How to customize DNN robots.txt to allow a module specific sitemap to be crawled by search engines?

seo,dotnetnuke,robots.txt,googlebot

The proper way to do this would be to use the DNN Sitemap provider, something that is pretty darn easy to do as a module developer. I don't have a blog post/tutorial on it, but I do have sample code which can be found in http://dnnsimplearticle.codeplex.com/SourceControl/latest#cs/Providers/Sitemap/Sitemap.cs This will allow custom...

robots.txt URL patterns with @@

robots.txt

@@ has no reserved meaning in the robots.txt specification. So a line like Disallow: /@@example will disallow crawling of URLs whose path literally starts with /@@example, e.g.: http://example.com/@@example http://example.com/@@example.html http://example.com/@@example/foo If you want to disallow crawling of URLs whose path starts with /book-search, then you should use: Disallow: /book-search (without...

Robots.txt file in MVC.NET 4

asp.net,asp.net-mvc-4,seo,robots.txt

do robots crawl the controllers which has [Authorization] attribute like Administration If they find a link to it, they are likely to try and crawl it, but they will fail just like anyone with a web browser that does not log in. Robots have no special ability to access...

Disallow pages that ends with number only in robots.txt

robots.txt

Following the original robots.txt specification, this would work (for all conforming bots, including Google’s): User-agent: * Disallow: /blog/pages/0 Disallow: /blog/pages/1 Disallow: /blog/pages/2 Disallow: /blog/pages/3 Disallow: /blog/pages/4 Disallow: /blog/pages/5 Disallow: /blog/pages/6 Disallow: /blog/pages/7 Disallow: /blog/pages/8 Disallow: /blog/pages/9 This blocks all URLs whose path begins with /blog/pages/ followed by any number (/blog/pages/9129831823,...

What is the “unique” keyword in robots.txt?

robots.txt

It's not a keyword, it's a directory on your server that shouldn't be visited by a web crawler.

What does /*.php$ mean in robots.txt?

robots.txt

So what does it do? By spec it means "URLs starting with /*.php$", which isn't very useful. There might be engines out that which support some custom syntax for it. I know some support wild cards, but that looks like regular expression syntax and I've not heard of anything...

Block bingbot from crawling my site

asp.net-mvc,bots,robots.txt,bing

This WILL definitely affect your SEO/search ranking and will cause pages to drop from the index so please use with care You can block requests based on the user-agent string if you have the iis rewrite module installed (if not go here) And then add a rule to your webconfig...

How to prevent search engines from indexing a span of text?

html,web-crawler,robots.txt,googlebot,noindex

There is no way to stop crawlers from indexing anything, it's up to their author to decide what the crawlers would do. The rule-obeying ones, like Yahoo Slurp, Googlebot, etc. they each have their own rule, as you've already discovered, but it's still up to them whether to completely obey...

Robotstxt Google Searchresults

sitemap,google-search,robots.txt

You can use the Allow keyword to give access to a URL in a Disallowed directory. Allow: /nl/sitemap.xml Disallow: /nl/ ...

where to put robots.txt for a CodeIgniter

codeigniter,robots.txt

The robots.txt file MUST be placed in the document root of the host. It will not work in other locations. If your host is example.com, it needs to be accessible at http://example.com/robots.txt....

Random IP Address Accessing my Website [closed]

http,robots.txt

Looks like somebody visited your Web page using their iPad or iPhone, and in addition to loading your home page, their browser tried to load various sizes and formats of favicons, including high-res ones. Seems pretty normal. Those apple-touch-icons are not "only on apache web servers off of Macs" and...

robots.tx disallow all with crawl-delay

robots.txt

Yes, this record would mean the same if it were reduced to this: User-agent: * Disallow: / A bot matched by this record is not allowed to crawl anything on this host (having an unneeded Crawl-delay doesn’t change this)....

Search robots pressing my button…? Can I prevent that?

asp.net,search,web,robots.txt,search-engine-bots

When bots find POST request to some url, they like to send GET request there to peek around. If they like what they see, the link can get cached and you can get additional GET requests for that url from time to time. Nasty bots don't follow robots.txt, only way...

Ignore urls that have no parameters in robots.txt

mod-rewrite,robots.txt

To block: http://example.com/category/ without blocking: http://example.com/category/whatever You can use the $ operator: User-agent: * Disallow: /category/$ The $ means "end of url". Note that the $ operator is not supported by all web robots. It is a common extension that works for all of the major search engines, but it...

is it fine with respect to seo if we add same product to 5 different categories

magento,seo,robots.txt

As you already implemented, product url would not contain category path, thus every product would have its unique url and there would be no such two urls containing same product info and content which stands good with SEO. Same as other CMS like wordpress, etc. Magento also take care of...

Different domains, different languages, same content, 1 robots.txt

robots.txt

Write different robots.txt for each domain and use .htaccess to redirect robots.txt request based on host from where the request came: RewriteCond %{HTTP_HOST} ^(.*)\.com$ [NC] RewriteCond %{HTTPS}s ^on(s)| RewriteRule ^robots\.txt$ /robots-com.txt [L] RewriteCond %{HTTP_HOST} ^(.*)\.it$ [NC] RewriteCond %{HTTPS}s ^on(s)| RewriteRule ^robots\.txt$ /robots-it.txt [L] Make sure that RewriteEngine On is placed...

EHow to Disallow few list of URL crawled by google crawler using robots.txt

url,robots.txt,googlebot

The value of the Disallow field is always the beginning of the URL path. So if your robots.txt is accessible from http://example.com/robots.txt, and it contains this line Disallow: http://example.com/admin/feedback.htm then URLs like these would be disallowed: http://example.com/http://example.com/admin/feedback.htm http://example.com/http://example.com/admin/feedback.html http://example.com/http://example.com/admin/feedback.htm_foo http://example.com/http://example.com/admin/feedback.htm/bar … So if you want to disallow the URL...

Use `robots.txt` in a multilingual site

seo,sitemap,robots.txt

You need to put one robots.txt at the top level. The robots.txt file must be in the top-level directory of the host, accessible though the appropriate protocol and port number. https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt...

Someone using our site on robots.txt

magento,website,directory,robots.txt

In a robots.txt (a simple text file) you can specify which URLs of your site should not be crawled by bots (like search engine crawlers). The location of this file is fixed so that bots always know where to find the rules: the file named robots.txt has to be placed...

How do i tell google to not crawl a domain completly

seo,opencart,robots.txt,multistore

If you are using Apache and mod_rewrite you can add a rewrite rule to serve a different robots.txt file for xyz.com: RewriteCond %{HTTP_HOST} xyz.com$ [NC] RewriteRule ^robots.txt robots_xyz.txt [L] Then create robots_xyz.txt: User-agent: * Disallow: / ...