Using robots.txt to Disallow Search Engines to Index Your Files

Updated on March 3, 2018
RonElFran profile image

Ron is a retired engineer and manager for IBM and other high tech companies. He specialized in both hardware and software design.

Source

As a writer who contributes articles to various sites around the internet, I wanted to set up an online archive for my work. This would be a repository to which I could give others access as needed - for example, to establish authorship in copyright infringement (DMCA) cases. At the same time, to avoid having duplicate files with the same content show up in search results, I needed to prevent the files in my archive from being indexed by search engines such as Google or Bing.

A little research showed that by using a robots.txt file, I could inform search engines that they should not index certain items on my website. It’s a simple and easy solution that does exactly what I need it to do. But in setting up my robots.txt file, I ran into some issues that were not addressed in the documentation I had read, and which required some time and head-scratching to figure out through trial and error.

That’s why I thought it might be useful to provide a simple guide that might save someone else from having to struggle with the issues I did.

What is robots.txt?

Search engines use applications called “robots” to “crawl” the entire internet, searching out online files and adding them to a database. When a user enters a search term into Google, for example, that query is matched against Google’s database of websites it has crawled. It is from that internal database that a list of search results is produced for the user.

The robot.txt file is used to essentially put up a KEEP OUT sign for files on your website you don’t want search engine robots to see. Since these files will be skipped by the robot, they won’t be indexed in the search engine’s database, and they won’t show up in search results.

Reputable search engines all program their robots to look for the robot.txt file on every website they find. If that file exists, the robot will follow its instructions regarding any files or folders the robot should skip.

(Take note that this is all entirely voluntary on the search engine’s part. Rogue search engines can and do ignore the instructions in robot.txt. In fact, some bad guys may actually be attracted to the parts of your website robot.txt says to avoid on the theory that if you want to hide it, there might be something there they can exploit).

How to Set Up a robots.txt File

I’m going to describe how I set up my robots.txt file to address my specific need. You can read a more general description of the various ways robots.txt can be used here.

Note that to use this method you must have your own website with its own domain name.

Using robots.txt to restrict access to your files only works if you have your own website with its own domain name. That's because the robots.txt file can only reside in the top level directory of your web site, and you'll only be able to make changes to that directory if you own the site.

For example, if your web site is

http://www.myownwebsite.com

then the robots.txt file must have the name

http://www.myownwebsite.com/robots.txt

If you put your robots.txt file anywhere else on the site, it won’t be recognized. For example, if you put your robots.txt into a folder called mygoodstuff,

http://www.myownwebsite.com/mygoodstuff/robots.txt

or into a subdomain such as

http://www.mygoodstuff/myownwebsite.com/robots.txt

web crawling robots will not recognize it, and won’t heed its instructions.

Because of that restriction, you can't do this with a free Wordpress site such as https://myfreewebsite.wordpress.com. You can see the robots.txt file on wordpress.com (https://wordpress.com/robots.txt) but you can't change it.

(If you'd like to view the wordpress.com robots.txt file, just enter https://wordpress.com/robots.txt into the URL field of your browser and press Enter. You'll be able to see the contents of the file, but you won't be able to modify it).

Also note that capitalization matters! The file name must be robots.txt and nothing else. ROBOTS.TXT or Robots.Txt won’t work.

Do you plan to personally set up a robots.txt file?

See results

The Contents of a robots.txt File

Here’s what the contents of a typical robots.txt file might look like:

User-agent: *

Disallow: /folder-to-ignore/

The User-agent term specifies the particular search engines to which this directive is addressed. The * in the above example signifies that it applies to all search engines. If you only want your instructions to apply to Google, for example, you would use:

User-agent: Google

Disallow: /folder-to-ignore/

This would restrict only Google, and not any other search engines, from accessing the folders or files you list.

The Disallow term specifies which folders or files are not to be searched or recognized by the robot. In the example above, I don't want the contents of a folder called folder-to-ignore to be indexed by search engines. So, my Disallow statement instructs web crawlers to ignore the following URL:

http://www.myownwebsite.com/folder-to-ignore/

Multiple folders or files can be specified:

User-agent: *

Disallow: /folder-to-ignore/

Disallow: /another-folder/

Disallow: /third-folder/subfolder/

Disallow: /some-folder/myfile.html

Creating a robots.txt File

Any text editor, such as NotePad in Windows, may be used to create robot.txt files. Note that if a document editor, such as Microsoft Word, is used, the output must be saved as a .txt file. Otherwise, the file may contain hidden codes that will invalidate its contents.

Once saved as text, the file must be uploaded to the top level directory of your website. On most servers, that will be the public_html folder.

Upload robots.txt in exactly the same way you normally upload files to the site. In most cases that will involve using a FTP app such as the free, open source FileZilla client. Make sure the file is placed in the proper folder.

VIDEO: How to Create a robots.txt File

Testing Your robots.txt File

It’s very important to test your robots.txt setup to insure that it’s working as you desire. Otherwise you may find that the folders you wanted blocked are still accessible to crawlers, and are showing up in search results. Once that happens, it could take weeks or even months to get them removed from the search engine’s database.

Several free robots.txt testers are available on the web. Here are the ones I used:

Google’s Webmaster Tools robots.txt tester (requires a Google account)

http://www.searchenginepromotionhelp.com/m/robots-text-tester/robots-checker.php

The GOTCHAs That Got Me!

Google was unable to see my robots.txt file

I set up my robots.txt file to block a folder called /YCN Archive/. I created that folder on my website and verified that it could be accessed as expected.

I then created a robots.txt file with the following contents:

User-agent: *

Disallow: /YCN Archive/

After uploading this file to my top-level directory, I tested it using the robots.txt tester in Google’s Webmaster Tools. Although I carefully followed the directions given via the Webmaster Tools link above, I immediately ran into a problem. Here’s the totally unexpected error message I got:

Source

But the robot.txt was there! I could see it in my website’s file listing, exactly where it was supposed to be. Why couldn’t Google see it? Eventually I saw something on the tester page I hadn’t noticed before:

Source

The key was in the line that says, “Latest version seen on 7/26/14 …” (I was doing the test several days after 7/26). When I initiated the test, it seems that Google didn’t go out and look at the state of the website at that moment, but apparently relied on its internal picture of what the website looked like the last time it crawled it.

I needed Google to have a current picture of what was on my website. I caused that to happen by using the Fetch as Google function:

Source

Once the Fetch as Google function was performed, Google was able to find the robots.txt file.

Here’s another point to be careful of. In the robots.txt tester, Google listed my website two different ways:

myownwebsite.org

http://myownwebsite.org

Of course both those entries refer to exactly the same URL. But I had to do individual Google fetches for each to have the robots.txt file recognized. I also did separate tests on each in order to make sure my blocking instructions would be carried out no matter which URL was used to access the site.

My robots.txt File Didn’t Work!

Now that Google could see my robots.txt file, I ran the test, confident of success. It still didn’t work. This time, the test reported that although my robots.txt was now recognized, it was not blocking access to the /YCN Archive/ folder. Web crawler access to that folder was still "ALLOWED."

Source

No Spaces Allowed in the Disallowed Folder or File’s Name

I knew my robots.txt was set up correctly, so it baffled me why it was not blocking access to the specified folder. It took me some time to figure out what was going on. My folder had a space in the name! When I renamed the folder to remove the space, the Google robots.txt tester showed the folder as blocked.

Source

robots.txt Does Its Job

Since I put my robot.txt in place, it’s done its job silently and efficiently. My files are safely archived online, and can be accessed by anyone to whom I give the URL. But none of them are showing up in search engine results.

Questions & Answers

    © 2014 Ronald E Franklin

    Comments

      0 of 8192 characters used
      Post Comment

      • RonElFran profile image
        Author

        Ronald E Franklin 3 years ago from Mechanicsburg, PA

        Hi, kislanyk. It's interesting about Wordpress. I looked at my Wordpress robots.txt, but didn't realize it's virtual. For me that's fine; I wasn't wanting to change it anyway. Thanks for reading and sharing.

      • kislanyk profile image

        Marika 3 years ago from Cyprus

        I used this trick all the time and on some of my sites I tweak it as I go alone (in my cPanel). Great tip!

        Btw something I've noticed recently - if you have a Wordpress blog, the robots.txt file is virtual, meaning it doesn't actually exist. Too me quite some time to figure it out once when I needed to change...kind of sucked, but oh well...

      • RonElFran profile image
        Author

        Ronald E Franklin 3 years ago from Mechanicsburg, PA

        Thanks, Mel. Actually this was prompted by the Yahoo Voices shutdown, plus the need to have proof of authorship of many of my Yahoo articles that had been pirated. So, a writer may suddenly need this kind of info at any time.

      • Mel Carriere profile image

        Mel Carriere 3 years ago from San Diego California

        I don't think I see the need to do this at this point. I'm busy at this juncture trying to get search engines to find me more. Nonetheless this is great info to store for future use and it was very competently presented!

      • RonElFran profile image
        Author

        Ronald E Franklin 3 years ago from Mechanicsburg, PA

        Thanks, Rachael. Interestingly, while setting up my website's robots.txt, I also looked in Google Webmaster Tools at the robots.txt of my HubPages subdomain. It's there, though of course we don't have access to change it.

      • RachaelOhalloran profile image

        Rachael O'Halloran 3 years ago from United States

        This is great information. I can't use it here on HubPages, but I can use it on my blogs. Thanks!

      working

      This website uses cookies

      As a user in the EEA, your approval is needed on a few things. To provide a better website experience, turbofuture.com uses cookies (and other similar technologies) and may collect, process, and share personal data. Please choose which areas of our service you consent to our doing so.

      For more information on managing or withdrawing consents and how we handle data, visit our Privacy Policy at: "https://turbofuture.com/privacy-policy#gdpr"

      Show Details
      Necessary
      HubPages Device IDThis is used to identify particular browsers or devices when the access the service, and is used for security reasons.
      LoginThis is necessary to sign in to the HubPages Service.
      Google RecaptchaThis is used to prevent bots and spam. (Privacy Policy)
      AkismetThis is used to detect comment spam. (Privacy Policy)
      HubPages Google AnalyticsThis is used to provide data on traffic to our website, all personally identifyable data is anonymized. (Privacy Policy)
      HubPages Traffic PixelThis is used to collect data on traffic to articles and other pages on our site. Unless you are signed in to a HubPages account, all personally identifiable information is anonymized.
      Amazon Web ServicesThis is a cloud services platform that we used to host our service. (Privacy Policy)
      CloudflareThis is a cloud CDN service that we use to efficiently deliver files required for our service to operate such as javascript, cascading style sheets, images, and videos. (Privacy Policy)
      Google Hosted LibrariesJavascript software libraries such as jQuery are loaded at endpoints on the googleapis.com or gstatic.com domains, for performance and efficiency reasons. (Privacy Policy)
      Features
      Google Custom SearchThis is feature allows you to search the site. (Privacy Policy)
      Google MapsSome articles have Google Maps embedded in them. (Privacy Policy)
      Google ChartsThis is used to display charts and graphs on articles and the author center. (Privacy Policy)
      Google AdSense Host APIThis service allows you to sign up for or associate a Google AdSense account with HubPages, so that you can earn money from ads on your articles. No data is shared unless you engage with this feature. (Privacy Policy)
      Google YouTubeSome articles have YouTube videos embedded in them. (Privacy Policy)
      VimeoSome articles have Vimeo videos embedded in them. (Privacy Policy)
      PaypalThis is used for a registered author who enrolls in the HubPages Earnings program and requests to be paid via PayPal. No data is shared with Paypal unless you engage with this feature. (Privacy Policy)
      Facebook LoginYou can use this to streamline signing up for, or signing in to your Hubpages account. No data is shared with Facebook unless you engage with this feature. (Privacy Policy)
      MavenThis supports the Maven widget and search functionality. (Privacy Policy)
      Marketing
      Google AdSenseThis is an ad network. (Privacy Policy)
      Google DoubleClickGoogle provides ad serving technology and runs an ad network. (Privacy Policy)
      Index ExchangeThis is an ad network. (Privacy Policy)
      SovrnThis is an ad network. (Privacy Policy)
      Facebook AdsThis is an ad network. (Privacy Policy)
      Amazon Unified Ad MarketplaceThis is an ad network. (Privacy Policy)
      AppNexusThis is an ad network. (Privacy Policy)
      OpenxThis is an ad network. (Privacy Policy)
      Rubicon ProjectThis is an ad network. (Privacy Policy)
      TripleLiftThis is an ad network. (Privacy Policy)
      Say MediaWe partner with Say Media to deliver ad campaigns on our sites. (Privacy Policy)
      Remarketing PixelsWe may use remarketing pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to advertise the HubPages Service to people that have visited our sites.
      Conversion Tracking PixelsWe may use conversion tracking pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to identify when an advertisement has successfully resulted in the desired action, such as signing up for the HubPages Service or publishing an article on the HubPages Service.
      Statistics
      Author Google AnalyticsThis is used to provide traffic data and reports to the authors of articles on the HubPages Service. (Privacy Policy)
      ComscoreComScore is a media measurement and analytics company providing marketing data and analytics to enterprises, media and advertising agencies, and publishers. Non-consent will result in ComScore only processing obfuscated personal data. (Privacy Policy)
      Amazon Tracking PixelSome articles display amazon products as part of the Amazon Affiliate program, this pixel provides traffic statistics for those products (Privacy Policy)