What is robots.txt and why it is needed

Table of contents

1 Why it’s important to customise and add robots.txt
2 How to create a robots.txt file
3 Which directives should be used in the robots.txt files
4 Common mistakes when compiling a robots.txt file
5 What characters to use in the robots.txt file
6 How you can check and test the robots.txt file
7 Frequently Asked Questions:

Robots.txt is a text file that contains instructions for search engine robots. Its purpose is to tell the robots which sections and pages of a site can or cannot be indexed. Setting up a robots.txt file is included in the process of promoting a website of any topic or niche.

Without this file, search engines will scan and index everything: duplicates, sensitive data, test pages, etc.

A proper robots.txt guides search engine robots by telling them what can be indexed and what should be omitted.

Why it’s important to customise and add robots.txt

Setting up and adding a robots.txt file to a website is extremely important for several reasons:

Indexation Control: Robots.txt allows you to control which pages or sections of your site are indexed by search engines. Without the robots.txt file, search engine robots will scan all available site pages, which will lead to the indexing of unwanted content and many rubbish pages in the output of the PS.
Crawl Optimisation: A robots.txt file can help optimise the crawling of a website. Search engine robots operate with limited resources, and if they cannot find the robots.txt file, they will crawl the site more aggressively. Adding the file will help reduce server load and provide more efficient crawling.
Indexing errors: The site may contain dynamically generated pages that change based on user requests or URL parameters.

Without a proper robots.txt file, search engine robots can index every possible combination of parameters, which can create duplicate content and a lot of junk pages in the search engine.

It is important to note that robots.txt does not provide absolute protection from the indexing of unwanted content. Still, a properly compiled file helps improve site indexing control immediately after website development and throughout its existence.

How to create a robots.txt file

Creating a robots.txt file is quite simple. You will need a text editor and access to the website hosting server.

Basic steps for creating a robots.txt file:

Open a text editor (Notepad for Windows or TextEdit for Mac will do).
Enter rules for search engine robots as required by the site.
Save the file named – robots.txt (observe lower case) in the root directory of the site.
Upload the robots.txt file to the root folder of your hosted site.

Example: for example.com, the file path would be: https://www.example.com/robots.txt.

Which directives should be used in the robots.txt files

The robots.txt file maintains certain directives that robots analyse to understand the rules for accessing a website.

Directives in the robots.txt file are instructions that set access rules for search engine robots to access different sections and pages of a website.

When a search robot scans the robots.txt file before indexing a site, it observes the directives specified in it and determines which sections or files of the site can be indexed and which should be excluded from indexing.

Here are some basic directives:

User-agent: This directive specifies which robot or group of robots the following rules apply.
You can specify a single bot (e.g. User-agent: Googlebot) or apply the directives to all bots by placing an asterisk (User-agent: *).
Disallow: This directive specifies the sections of the site that should not be indexed.
Allow: shows that the robot is allowed access to the page/section of the site, and it can be indexed and displayed in search results (even if there is a general Disallow directive).
Sitemap: With this directive, you can specify the path to the Sitemap file, which helps robots understand the site’s structure.

What should not be in the robots.txt file

Personal data and sensitive information:
Never place information in the robots.txt file you want to hide from search engine robots and unauthorised persons.
All pages that collect personal information – should be hidden from indexing by an alternative method.
Do not place links in the robots.txt file to pages or sections of the site that should not be available for public viewing or indexing. For example, pages with restricted access, test sections, etc.

Common mistakes when compiling a robots.txt file

Making a robots.txt file is a responsible process, and mistakes can negatively affect the indexation and visibility of the site in the search engine.

Common mistakes to watch out for:

Syntax errors: if directives and characters are not spelt correctly, the search engine may misunderstand them.
Errors can be related to incorrect directives, missing characters, missing blank lines between directives, spaces in the wrong places, etc.Example:

Incorrect
User-agent: Googlebot
Allow /public/
Right
User-agent: Googlebot
Allow: /public/
*a colon is missing after the Allow directive.
Duplicate rules: Duplicate rules can cause misunderstandings by robots and create indexing problems.
Incorrect path specification: such an error can block necessary content or, on the contrary, allow access to unnecessary sections.
For example, it is necessary to block files/images/.Example of incorrect rule:
User-agent: *
Disallow: /imagesExample as needed:
Disallow: */images
In this case only the /images folder will be blocked, not its subfolders.
Site-wide indexing ban: An incorrectly specified file can result in the entire site being banned from indexing, and as a consequence completely exclude it from search results.

Example:
User-agent: *
Disallow: /In this example, the User-agent directive: * indicates that the rule applies to all search engine robots. And the directive Disallow: / means to deny access to all sections of the site, as the slash (/) denotes the root directory of the site.
As a result, robots see this rule and will not scan and index any page of the site.
In the robots.txt file, you cannot use commas or spaces to list multiple directories in a single directive.Examples of an invalid rule:
User-agent: *
Disallow: /private/, /admin/Or:
User-agent: *
Disallow: /private/ /admin/In both cases, the rule is incorrect. Each directory should be listed on a separate line without commas or spaces between them.The correct way is to list them on different lines:

User-agent: *
Disallow: /private/
Disallow: /admin/
Incorrect file name
The file itself should be called only robots.txt, not Robots.txt, ROBOTS.TXT, or something else.

What characters to use in the robots.txt file

In the robots.txt file, you can use certain characters to set access rules for search engine robots.

Example of basic symbols:

The * symbol is any sequence of characters.
It can be used to block or allow access to specific sections or URLs on the site.
The $ symbol is used to indicate the end of an URL. This allows you to specify rules for specific URLs more precisely.
Example: Disallow: /images/$
This example specifies to disallow indexing of all pages in the “images” folder, but allow indexing of its subfolders (e.g. /images/subfolder/).
The # symbol is a comment. Everything after this symbol (in the same line) is ignored.

It should be noted that the rules in the robots.txt file are processed in order.

That is, more specific rules should be specified before the general rules.

Example of how to make Robots for CMS WordPress

It is important to understand that when setting up robots for a website, you need to consider its peculiarities.

Algorithm for writing robots.txt for WordPress:

Specify User-agent
Close from indexing – Disallow:
- admin files;
- personal offices, registration and authorisation forms;
- tools for working with orders (cart, forms for filling out data, etc.);
- data on search functionality;
- duplicate pages;
- filter parameters;
- comparison, sorting;
- service pages;
- UTM-tags;

Allow those files and documents that should be indexed but are inside already closed categories (e.g. JavaScript, images);
Add Sitemap

An example of a robots.txt file for WordPress:

User-Agent: *

Disallow: /wp-login.php

Disallow: /wp-register.php

Disallow: /xmlrpc.php

Disallow: /template.html

Disallow: /wp-admin

Disallow: /wp-includes

Disallow: /wp-content

Disallow: /tag

Disallow: /category

Disallow: /archive

Disallow: */trackback/

Disallow: */feed/

Disallow: */comments/

Disallow: /?feed=

Disallow: /?s=

Allow: /*/*.js

Allow: /*/*.css

Allow: */uploads

Allow: /wp-*.png

Allow: /wp-*.jpg

Allow: /wp-*.jpeg

Allow: /wp-*.gif

Sitemap: http://site.com/sitemap.xml

Promoting a website on WordPress has its own features and ready-made solutions, but robots.txt is worth setting up manually.

How you can check and test the robots.txt file

Перед публикацией файла robots.txt на сайте нужно обязательно протестировать его на наличие ошибок.

Способы проверки:

Использование Robots.txt Tester: В Google Search Console есть robots.txt tester, который позволяет просматривать, как Googlebot будет воспринимать файл.
Проверка файла через Screaming Frog (просканировали сайт, открываете Configuration → robots.txt → Custom).

Conclusion

Properly compiling and customising the file helps improve the site’s indexing and increase its visibility in search results.

To summarise the robots.txt algorithm:

Create and place the file in the root folder of the website on the hosting;
Add the required User-agent, Sitemap, to it;
Add typical rubbish pages (such as filters and parameter pages) to the file;
Test robots.txt and scan the site with a crawler (like Screaming Frog or Netpeak Spider) to check the overall picture after creating the file. (check what you closed, you may notice some more rubbish pages).

Frequently Asked Questions:

What to write in Robots.txt?

Robots.txt specifies instructions (directives) about which pages/folders are allowed or denied for indexing and crawling by robots.

How do I read Robots.txt file?

To read the robots.txt file on a website, you need to enter its URL into the search box: example.com/robots.txt.

Where is the Robots.txt file in WordPress?

If you have a plugin for WordPress, for example: “Yoast SEO” or “All in One SEO Pack”, you can edit the robots.txt file in the site admin.
In your hosting’s file system, you can create/set up a robots.txt file in the root folder of your website (usually the public_html or www folder).

AroundChief editor other author's articles