How to set up smartphones and PCs. Informational portal

Correct robots txt file. Yandex robots

Each blog gives its own answer to this. Therefore, newcomers to search promotion are often confused, like this:

What kind of robots ti ex ti?

File robots.txt or index file- a plain text document in UTF-8 encoding, valid for the http, https, and FTP protocols. The file gives recommendations to search robots: which pages / files should be crawled. If the file contains characters not in UTF-8, but in a different encoding, search robots may not process them correctly. The rules listed in the robots.txt file are only valid for the host, protocol, and port number where the file is located.

The file must be located in the root directory as a plain text document and be available at: https://site.com.ua/robots.txt.

In other files, it is customary to mark BOM (Byte Order Mark). This is a Unicode character that is used to determine the sequence in bytes when reading information. Its code symbol is U+FEFF. At the beginning of the robots.txt file, the byte sequence mark is ignored.

Google has set a limit on the size of the robots.txt file - it should not weigh more than 500 KB.

Okay, if you're interested in the technical details, the robots.txt file is a Backus-Naur Form (BNF) description. This uses the rules of RFC 822 .

When processing rules in the robots.txt file, search robots receive one of three instructions:

  • partial access: scanning of individual elements of the site is available;
  • full access: you can scan everything;
  • complete ban: the robot cannot scan anything.

When scanning the robots.txt file, robots receive the following responses:

  • 2xx- the scan was successful;
  • 3xx- the crawler follows the redirect until it receives another response. Most often there are five attempts for the robot to get a response other than the 3xx response, then a 404 error is logged;
  • 4xx- the search robot believes that it is possible to crawl all the content of the site;
  • 5xx- are evaluated as temporary server errors, scanning is completely disabled. The robot will access the file until it receives another response. The Google search robot can determine whether it is correctly or incorrectly configured to return responses to missing pages of the site, that is, if instead of a 404 error the page returns a 5xx response, in this case the page will be processed with response code 404.

It is not yet known how the robots.txt file is processed, which is not available due to server problems with Internet access.

Why you need a robots.txt file

For example, sometimes robots should not visit:

  • pages with personal information of users on the site;
  • pages with various forms of sending information;
  • mirror sites;
  • search results pages.

Important: even if the page is in the robots.txt file, there is a chance that it will appear in the search results if a link to it was found inside the site or somewhere on an external resource.

This is how search engine robots see a site with and without a robots.txt file:

Without robots.txt, information that should be hidden from prying eyes can get into the search results, and both you and the site will suffer because of this.

This is how the search engine robot sees the robots.txt file:

Google found the robots.txt file on the site and found the rules by which site pages should be crawled

How to create a robots.txt file

With Notepad, Notepad, Sublime, or any other text editor.

User-agent - business card for robots

User-agent - a rule about which robots need to see the instructions described in the robots.txt file. At the moment, 302 search robots are known

She says that we specify the rules in robots.txt for all search robots.

For Google, the main robot is Googlebot. If we want to take into account only it, the entry in the file will be like this:

In this case, all other robots will crawl content based on their directives to process an empty robots.txt file.

For Yandex, the main robot is... Yandex:

Other special robots:

  • Googlebot news- to search for news;
  • Mediapartners-Google- for the AdSense service;
  • AdsBot-Google— to check the quality of the landing page;
  • YandexImages— Yandex.Pictures indexer;
  • Googlebot Image- for pictures;
  • YandexMetrika— Yandex.Metrica robot;
  • YandexMedia- a robot that indexes multimedia data;
  • YaDirectFetcher— Yandex.Direct robot;
  • Googlebot Video- for video;
  • Googlebot mobile- for mobile version;
  • YandexDirectDyn— dynamic banner generation robot;
  • YandexBlogs- a blog search robot that indexes posts and comments;
  • YandexMarket— Yandex.Market robot;
  • YandexNews— Yandex.News robot;
  • YandexDirect— downloads information about the content of the partner sites of the Advertising Network in order to clarify their subject matter for the selection of relevant advertising;
  • YandexPagechecker— microdata validator;
  • YandexCalendar— Yandex.Calendar robot.

Disallow - we place "bricks"

It should be used if the site is in the process of being improved, and you do not want it to appear in the search results in its current state.

It is important to remove this rule as soon as the site is ready for users to see it. Unfortunately, this is forgotten by many webmasters.

Example. How to write a Disallow rule to advise robots not to view the contents of a folder /folder/:

This line prevents indexing of all files with the .gif extension

Allow - direct robots

Allow allows to scan any file/directive/page. Let's say that it is necessary that robots can only view pages that begin with /catalog, and close all other content. In this case, the following combination is prescribed:

The Allow and Disallow rules are sorted by URL prefix length (lowest to longest) and applied sequentially. If more than one rule matches a page, the robot selects the last rule in the sorted list.

Host - choose a site mirror

Host is one of the mandatory rules for robots.txt; it tells the Yandex robot which of the site mirrors should be taken into account for indexing.

Site mirror - an exact or almost exact copy of the site, available at different addresses.

The robot will not get confused when finding site mirrors and will understand that the main mirror is specified in the robots.txt file. The site address is specified without the “http://” prefix, but if the site works on HTTPS, the “https://” prefix must be specified.

How to write this rule:

An example of a robots.txt file if the site works on the HTTPS protocol:

Sitemap - medical sitemap

Sitemap tells robots that all site URLs required for indexing are located at http://site.ua/sitemap.xml. With each crawl, the robot will look at what changes were made to this file and quickly refresh information about the site in the search engine databases.

Crawl-delay - stopwatch for weak servers

Crawl-delay - a parameter with which you can set the period after which the pages of the site will be loaded. This rule is relevant if you have a weak server. In this case, large delays are possible when search robots access the pages of the site. This parameter is measured in seconds.

Clean-param - Duplicate Content Hunter

Clean-param helps deal with get-parameters to avoid duplicate content that might be available at different dynamic URLs (with question marks). Such addresses appear if the site has different sortings, session ids, and so on.

Let's say the page is available at the addresses:

www.site.com/catalog/get_phone.ua?ref=page_1&phone_id=1

www.site.com/catalog/get_phone.ua?ref=page_2&phone_id=1

www.site.com/catalog/get_phone.ua?ref=page_3&phone_id=1

In this case, the robots.txt file will look like this:

Here ref indicates where the link comes from, so it is written at the very beginning, and only then the rest of the address is indicated.

But before moving on to the reference file, there are a few more signs you need to know about when writing a robots.txt file.

Symbols in robots.txt

The main characters of the file are "/, *, $, #".

Via slash "/" we show what we want to hide from detection by robots. For example, if there is one slash in the Disallow rule, we prohibit crawling the entire site. With the help of two slashes, you can disable scanning of any particular directory, for example: /catalog/.

Such an entry says that we forbid scanning the entire contents of the catalog folder, but if we write /catalog, we forbid all links on the site that begin with /catalog.

Asterisk "*" means any sequence of characters in a file. It is placed after each rule.

This entry says that all robots should not index any .gif files in the /catalog/ folder

dollar sign «$» limits the scope of the asterisk sign. If you want to disallow the entire contents of the catalog folder, but you cannot disallow URLs that contain /catalog, the entry in the index file would be:

Hash "#" used for comments that the webmaster leaves for himself or other webmasters. The robot will not take them into account when scanning the site.

For example:

What does the ideal robots.txt look like?

The file opens the content of the site for indexing, the host is registered and the site map is specified, which will allow search engines to always see the addresses that should be indexed. The rules for Yandex are written separately, since not all robots understand the Host instruction.

But do not rush to copy the contents of the file to yourself - unique rules must be written for each site, which depends on the type of site and CMS. therefore, it is worth remembering all the rules when filling out the robots.txt file.

How to check the robots.txt file

If you want to know if you filled out the robots.txt file correctly, check it in webmaster tools Google and Yandex. Just enter the source code of the robots.txt file into the form at the link and specify the site to be checked.

How not to fill out the robots.txt file

Often annoying mistakes are made when filling out the index file, and they are associated with ordinary inattention or haste. A little lower is a chart of errors that I met in practice.

2. Writing multiple folders/directories in one Disallow statement:

Such an entry can confuse search robots, they may not understand what exactly they should not index: either the first folder, or the last one, so you need to write each rule separately.

3. The file itself must be called only robots.txt, not Robots.txt, ROBOTS.TXT or otherwise.

4. You cannot leave the User-agent rule empty - you need to say which robot should take into account the rules specified in the file.

5. Extra characters in the file (slashes, asterisks).

6. Adding pages to the file that should not be in the index.

Non-standard use of robots.txt

In addition to direct functions, an index file can become a platform for creativity and a way to find new employees.

Here is a site where robots.txt itself is a small site with work elements and even an ad unit.

As a platform for searching for specialists, the file is mainly used by SEO agencies. And who else can know about its existence? :)

And Google has a special file humans.txt, so that you do not allow the thought of discrimination against specialists from the skin and meat.

conclusions

With the help of Robots.txt, you can set instructions for search robots, advertise yourself, your brand, search for specialists. This is a great field for experimentation. The main thing is to remember about the correct filling of the file and typical mistakes.

Rules, they are directives, they are also instructions of the robots.txt file:

  1. User-agent - a rule about which robots need to view the instructions described in robots.txt.
  2. Disallow makes a recommendation about what kind of information should not be scanned.
  3. Sitemap informs robots that all site URLs required for indexing are located at http://site.ua/sitemap.xml.
  4. Host tells the Yandex robot which of the site's mirrors should be taken into account for indexing.
  5. Allow allows to scan any file/directive/page.

Signs when compiling robots.txt:

  1. The dollar sign "$" limits the scope of the asterisk sign.
  2. With the help of a slash "/" we indicate that we want to hide from detection by robots.
  3. The asterisk "*" means any sequence of characters in the file. It is placed after each rule.
  4. The hash mark "#" is used to denote comments that the webmaster writes for himself or other webmasters.

Use the index file wisely - and the site will always be in the search results.

Sales Generator

If you make a mistake when creating the robots.txt file, then it may be useless for search robots. There will be a risk of incorrect transmission of the necessary commands to search robots, which will lead to a decrease in the rating, a change in the user indicators of the virtual site. Even if the site works well and is complete, checking robots.txt will not hurt it, but will only make it work better.

From this article you will learn:

Why check robots.txt

Sometimes the system includes unnecessary pages of your Internet resource in the search results, which is not necessary. It may seem that there is nothing wrong with a large number of pages in the search engine index, but this is not so:

  • On extra pages, the user will not find any useful information for himself. With a greater degree of probability, he will not visit these pages at all or will not stay on them for long;
  • The search engine results contain the same pages, the addresses of which are different (that is, the content is duplicated);
  • Search robots have to spend a lot of time to index completely unnecessary pages. Instead of indexing useful content, they will wander around the site uselessly. Since the robot cannot index the entire resource and does it page by page (since there are a lot of sites), the necessary information that you would like to receive after running the query may not be found very quickly;
  • The server is under heavy load.

In this regard, it is advisable to close access to search robots to some pages of web resources.

What files and folders can be prohibited from indexing:

  1. search pages. This is a controversial point. Sometimes using an internal search on a site is necessary in order to generate relevant pages. But this is not always done. Often the result of the search is the appearance of a large number of duplicate pages. Therefore, it is recommended to close search pages for indexing.
  2. Cart and page where the order is made/confirmed. Their closure is recommended for online trading sites and other commercial resources using the order form. Getting these pages into the index of search engines is highly undesirable.
  3. pagination pages. As a rule, they are characterized by automatic prescribing of the same meta tags. In addition, they are used to place dynamic content, so duplicates appear in the search results. In this regard, pagination should be closed for indexing.
  4. Filters and comparison of products. They need to be closed by online stores and catalog sites.
  5. Registration and authorization pages. They need to be closed due to the confidentiality of the data entered by users during registration or authorization. The unavailability of these pages for indexing will be evaluated by Google.
  6. System directories and files. Each resource on the Internet consists of a lot of data (scripts, CSS tables, administrative part) that should not be viewed by robots.

The robots.txt file will help close files and pages for indexing.

robots.txt is a plain text file containing instructions for search robots. When the search robot is on the site, it first searches for the robots.txt file. If it is missing (or empty), then the robot will go to all pages and directories of the resource (including system ones) that are in the public domain and try to index them. At the same time, there is no guarantee that the page you need will be indexed, since it may not get to it.

robots.txt allows you to direct search robots to the necessary pages and not to let them in on those that should not be indexed. The file can instruct both all robots at once, and each one individually. If the site page is closed from indexing, then it will never appear in the search engine results. Creating a robots.txt file is essential.

The location of the robots.txt file should be the server, the root of your resource. The robots.txt file of any site is available for viewing on the web. To see it, you need to add /robots.txt after the resource address.

As a rule, the robots.txt files of different resources differ from each other. If you mindlessly copy the file of someone else's site, then search robots will have problems indexing your site. Therefore, it is so important to know what the robots.txt file is for and the instructions (directives) used to create it.


Submit your application

How Yandex checks robots.txt

  • A special service of Yandex.Webmaster "Analysis of robots.txt" will help you check the file. You can find it at the link: http://webmaster.yandex.ru/robots.xml
  • In the proposed form, you need to enter the contents of the robots.txt file, which you need to check for errors. There are two ways to enter data:
    1. Go to the site using the link http://your-site.ru/robots.txt , copy the content to the empty field of the service (if there is no robots.txt file, you definitely need to create it!);
    2. Insert a link to the file to be checked in the "Host name" field, click "Download robots.txt from the site" or Enter.
  • The check is started by pressing the "Check" command.
  • After the test is started, you can analyze the results.

After the start of the check, the analyzer parses each line of the contents of the "Text robots.txt" field and analyzes the directives it contains. In addition, you will know if the robot will crawl pages from the "List of URLs" field.

You can create a robots.txt file suitable for your resource by editing the rules. Keep in mind that the resource file itself remains unchanged. For the changes to take effect, you will need to independently upload the new version of the file to the site.

When checking directives for sections intended for the Yandex robot (User-agent: Yandex or User-agent:*), the analyzer is guided by the rules for using robots.txt. The remaining sections are checked in accordance with the requirements of the standard. When the analyzer parses the file, it displays a message about found errors, warns if there are inaccuracies in writing the rules, lists which parts of the file are intended for the Yandex robot.

The parser can send two types of messages: errors and warnings.

An error message is displayed if any line, section or the entire file cannot be processed by the parser due to the presence of serious syntax errors that were made when compiling directives.

As a rule, a warning informs about a deviation from the rules that cannot be corrected by the analyzer, or about the presence of a potential problem (it may not be), the cause of which is an accidental typo or inaccurately composed rules.

The error message "This URL does not belong to your domain" indicates that the URL list contains the address of one of the mirrors of your resource, for example, http://example.com instead of http://www.example.com (formally, these URLs are different). The URLs to be checked must be related to the site whose robots.txt file is being parsed.

How Google robots.txt checks

The Google Search Console tool allows you to check whether the robots.txt file contains a prohibition against Googlebot crawling certain URLs on your property. For example, you have an image that you don't want to appear in the Google image search results. The tool will tell you if Googlebot-Image has access to that image.

To do this, specify the URL of interest. After that, the robots.txt file is processed by the inspection tool, similar to the Googlebot inspection. This makes it possible to determine if the address is reachable.

Checking procedure:

  • After selecting your property in Google Search Console, go to the verification tool, which will give you the contents of the robots.txt file. The highlighted text is syntax or logical errors. Their number is indicated under the editing window.
  • At the bottom of the interface page, you will see a special window in which you need to enter the URL.
  • A menu will appear on the right, from which you need to select a robot.
  • Click on the "Check" button.
  • If the check results in a message with the text "available", it means that Googlebots are allowed to visit the specified page. The status "unavailable" indicates that access to it to robots is closed.
  • If necessary, you can change the menu and perform a new check. Attention! There will be no automatic changes to the robots.txt file on your resource.
  • Copy the changes and make them to the robots.txt file on your web server.

What you need to pay attention to:

  1. Changes made in the editor are not saved to the web server. You will need to copy the resulting code and paste it into the robots.txt file.
  2. Only Google user agents and robots related to Google (for example, Googlebot) can receive the results of the robots.txt file check by the tool. At the same time, there is no guarantee that the interpretation of the contents of your file by the robots of other search engines will be similar.

15 errors when checking the robots.txt file

Mistake 1. Confused Instructions

The most common mistake in the robots.txt file is messed up instructions. For instance:

  • user-agent: /
  • Disallow: Yandex

The correct option is this:

  • User agent: Yandex
  • disallow: /

Mistake 2: Specifying multiple directories in a single Disallow statement

Often Internet resource owners try to list all the directories they want to disable indexing in a single Disallow statement.

Disallow: /css/ /cgi-bin/ /images/

Such a record does not meet the requirements of the standard; it is impossible to predict how it will be processed by different robots. Some of them may ignore spaces. Their interpretation of the entry would be "Disallow: /css/cgi-bin/images/". Others may only use the first or last folder. Still others may even discard the instruction without understanding it.

There is a chance that the processing of this construction will be exactly the way the wizard was counting on, but it’s still better to write it correctly:

  • Disallow: /css/
  • Disallow: /cgi-bin/
  • Disallow: /images/

Error 3. The file name contains capital letters

The correct file name is robots.txt, not Robots.txt or ROBOTS.TXT.

Mistake 4: Writing the filename as robot.txt instead of robots.txt

Remember to correctly name the robots.txt file.

Mistake 5. Leaving a string in User-agent empty

Wrong option:

  • user agent:
  • Disallow:
  • User-agent: *
  • Disallow:

Mistake 6. Writing Url in the Host Directive

The URL must be specified without using the Hypertext Transfer Protocol abbreviation (http://) and the trailing slash (/).

Invalid entry:

Correct option:

The correct use of the host directive is only for the Yandex robot.

Mistake 7: Using Wildcards in a Disallow Statement

Sometimes, to list all files file1.html, file2.html, file3.html, etc., the webmaster might write:

  • User-agent: *
  • Disallow: file*.html

But this cannot be done, because some robots do not have support for wildcards.

Mistake 8. Using one line for writing comments and instructions

The standard allows entries like this:

Disallow: /cgi-bin/ #prohibit robots from indexing cgi-bin

Previously, the processing of such strings by some robots was impossible. Maybe no search engine will have a problem with this at the moment, but is it worth the risk? It is better to place comments on a separate line.

Error 9. Redirecting to a 404 page

Often, if the site does not have a robots.txt file, then when it is requested, the search engine will redirect to another page. Sometimes this does not return a 404 Not Found status. The robot has to figure out what it got - robots.txt or a regular html file. This is not a problem, but it is better if an empty robots.txt file is placed in the root of the site.

Mistake 10. Using capital letters is a sign of bad style

USER-AGENT: GOOGLEBOT

Although the standard does not regulate the case sensitivity of robots.txt, it is often the case with file and directory names. In addition, if the robots.txt file is written entirely in capital letters, then this is considered bad style.

User agent: googlebot

Mistake 11. Listing all files

It would be incorrect to list each file in a directory individually:

  • User-agent: *
  • Disallow: /AL/Alabama.html
  • Disallow: /AL/AR.html
  • Disallow: /Az/AZ.html
  • Disallow: /Az/bali.html
  • Disallow: /Az/bed-breakfast.html

It will be correct to close the entire directory from indexing:

  • User-agent: *
  • Disallow: /AL/
  • Disallow: /az/

Mistake 12. Using additional directives in section *

Some robots may react incorrectly to the use of additional directives. Therefore, their use in the "*" section is undesirable.

If the directive is not standard (like "Host" for example), then it is better to create a special section for it.

Invalid option:

It would be correct to write:

Mistake 13. Missing a Disallow Instruction

Even if you want to use an additional directive and not set any prohibition, it is recommended to specify an empty Disallow. The standard states that the Disallow instruction is mandatory; if it is absent, the robot may "misunderstand you".

Not right:

Right:

Error 14. Not using slashes when specifying a directory

What will be the actions of the robot in this case?

  • User agent: Yandex
  • Disallow: john

According to the standard, neither the file nor the directory named "john" will be indexed. To specify only a directory, you need to write:

  • User agent: Yandex
  • Disallow: /john/

Mistake 15: Wrong spelling of the HTTP header

The server should return "Content-Type: text/plain" in the HTTP header for robots.txt and, for example, not "Content-Type: text/html". If the header is written incorrectly, some robots will not be able to process the file.

How to compose the file correctly so that the robots.txt check does not reveal errors

What should be the correct robots.txt file for an Internet resource? Consider its structure:

1.User-agent

This directive is the main one, it determines for which robots the rules are written.

If for any robot, we write:

If for a specific bot:

User agent: GoogleBot

It's worth noting that character case doesn't matter in robots.txt. For example, a user agent for Google can be written like this:

user agent: googlebot

Here is a table of the main user agents of various search engines.

Google's main indexing robot

Google News

Google Pictures

Mediapartners-Google

Google Adsense, Google Mobile Adsense

landing page quality check

AdsBot-Google-Mobile-Apps

Google Robot for Apps

Yandex's main indexing robot

Yandex.Images

Yandex.Video

multimedia data

blog search robot

robot accessing the page when it is added via the "Add URL" form

robot that indexes site icons (favicons)

Yandex.Direct

Yandex.Metrica

Yandex.Catalog

Yandex.News

YandexImageResizer

mobile services robot

the main indexing robot Bing

main indexing robot Yahoo!

main indexing robot Mail.Ru

2. Disallow and Allow

Disallow allows you to disable indexing of pages and sections of the Internet resource.

Allow is used to force them to be opened for indexing.

But using them is quite difficult.

First, you need to familiarize yourself with additional operators and the rules for their use. These include: *, $ and #.

  • * - any number of characters, even their absence. It is not necessary to put this operator at the end of the line, it is assumed that it is there by default;
  • $ - indicates that the character before it must be the last;
  • # - this operator is used to designate a comment, any information after it is not taken into account by the robot.

How to use these operators:

  • Disallow: *?s=
  • Disallow: /category/$

Secondly, you need to understand how the rules nested in the robots.txt file are executed.

It doesn't matter in what order the directives are written. Determining the inheritance of rules (what to open or close from indexing) is carried out according to the specified directories. Let's take an example.

Allow: *.css

Disallow: /template/

If you need to open all .css files for indexing, then you will need to additionally specify this for each folder, access to which is closed. In our case:

  • Allow: *.css
  • Allow: /template/*.css
  • Disallow: /template/

Recall again: it does not matter in what order the directives are written.

3. Sitemap

This directive specifies the path to the Sitemap XML file. The URL has the same form as in the address bar.

The Sitemap directive can be specified anywhere in the robots.txt file, and it does not need to be tied to a specific user-agent. Multiple sitemap rules are allowed.

This directive specifies the main mirror of the resource (usually with www or without www). Remember: when specifying the main mirror, do not write http://, but https://. If necessary, the port is also specified.

This directive can only be supported by Yandex and Mail.Ru bots. Other robots, including GoogleBot, do not take this command into account. You can register host only once!

5. Crawl delay

Allows you to set the period of time after which the robot needs to download resource pages. The directive is supported by the robots of Yandex, Mail.Ru, Bing, Yahoo. When setting the interval, you can use both integer and fractional values, using a dot as a separator. The unit of measurement is seconds.

Crawl delay: 0.5

If the load on the site is small, then there is no need to set this rule. But if the result of indexing pages by the robot is exceeding the limits or a serious increase in load, leading to server outages, then using this directive is reasonable: it allows you to reduce the load.

The longer the interval you set, the less will be the number of downloads during one session. The optimal value for each resource is different. At first, it is recommended to set small values ​​(0.1, 0.2, 0.5), then gradually increase them. For search engine robots that are not particularly important for promotion results (for example, Mail.Ru, Bing and Yahoo), you can immediately set values ​​that are greater than for Yandex robots.

6.Clean param

This directive is needed to inform the crawler (search robot) about the uselessness of indexing URLs with the specified parameters. The rule is given two arguments: a parameter and a section URL. Yandex supports the directive.

http://site.ru/articles/?author_id=267539 - will not be indexed

http://site.ru/articles/?author_id=267539&sid=0995823627 - will not be indexed

Clean-Param: utm_source utm_medium utm_campaign

7. Other options

The extended robots.txt specification also contains the following parameters: Request-rate and Visit-time. But currently there is no support for their leading search engines.

Directives are needed for the following:

  • Request-rate: 1/5 - allows loading no more than 1 page in 5 seconds
  • visit-time: 0600-0845 - Allows page loading from 6am to 8:45am GMT only

To properly configure the robots.txt file, we recommend using the following algorithm:

2) Close access for robots to your personal account, authorization and registration pages;

4) Close ajax, json scripts from indexing;

6) Prohibit indexing plugins, themes, js, css for robots of all search engines except Yandex and Google;

7) Close access to robots to the search functionality;

8) Prohibit indexing service sections that are not valuable for the resource in the search (error 404, list of authors);

9) Close from indexing technical duplicates of pages and pages whose content to some extent duplicates the content of other pages (calendars, archives, RSS);

12) Use the “site:” parameter to check what Yandex and Google have indexed. To do this, enter "site:site.ru" in the search bar. If there are pages in the SERP that do not need to be indexed, add them to robots.txt;

13) Write down the Sitemap and Host rules;

14) If necessary, specify Crawl-Delay and Clean-Param;

15) Check the correctness of the robots.txt file using Google and Yandex tools;

16) After 14 days, check again to make sure there are no pages in the search engine results that should not be indexed. If there are any, repeat all the above points.

Checking the robots.txt file only makes sense if your site is fine. A site audit conducted by qualified specialists will help determine this.

We hope that our article on business ideas will be useful to you. And if you have already decided on the direction of your activity and are actively engaged in the development of and, then we advise you to undergo an audit of the site in order to present a real picture of the capabilities of your resource.


The first thing a search bot does when it comes to your site is to search and read the robots.txt file. What is this file? is a set of instructions for a search engine.

It is a text file with the extension txt, which is located in the root directory of the site. This set of instructions tells the search robot which pages and site files to index and which not. It also indicates the main mirror of the site and where to look for the site map.

What is the robots.txt file for? For proper indexing of your site. So that there are no duplicate pages in the search, various service pages and documents. Once you correctly set up the directives in robots, you will save your site from many problems with indexing and site mirroring.

How to compose the correct robots.txt

Compiling robots.txt is easy enough, we create a text document in a standard Windows notepad. We write directives for search engines in this file. Next, save this file with the name "robots" and the text extension "txt". Everything can now be uploaded to the hosting, to the root folder of the site. Please note that only one robots document can be created per site. If this file is missing on the site, then the bot automatically "decides" that everything can be indexed.

Since it is one, it contains instructions for all search engines. Moreover, you can write down both separate instructions for each PS, and the general one immediately for everything. Separation of instructions for different search bots is done through the User-agent directive. We'll talk more about this below.

robots.txt directives

The "robot" file may contain the following indexing directives: User-agent, Disallow, Allow, Sitemap, Host, Crawl-delay, Clean-param. Let's look at each instruction in more detail.

User agent directive

User agent directive- indicates for which search engine there will be instructions (more precisely, for which specific bot). If it is "*" then the instructions are for all robots. If a specific bot is listed, such as Googlebot, then the instructions are for the main Google indexing bot only. Moreover, if there are instructions separately for Googlebot and for all other PSs, then Google will only read its own instructions, and ignore the general one. The Yandex bot will do the same. Let's look at an example of a directive entry.

User-agent: YandexBot - instructions only for the main Yandex indexing bot
User-agent: Yandex - instructions for all Yandex bots
User-agent: * - instructions for all bots

Disallow and Allow directives

Disallow and Allow directives- give commands what to index and what not. Disallow gives the command not to index a page or an entire section of the site. And Allow, on the contrary, indicates what needs to be indexed.

Disallow: / - prohibits indexing the entire site
Disallow: /papka/ - prohibits indexing the entire contents of the folder
Disallow: /files.php - prohibits indexing the file files.php

Allow: /cgi-bin - allows indexing cgi-bin pages

It is possible and often necessary to use special characters in the Disallow and Allow directives. They are needed to define regular expressions.

Special character * - replaces any sequence of characters. It is by default appended to the end of each rule. Even if you didn’t register it, the PS will put it on themselves. Usage example:

Disallow: /cgi-bin/*.aspx - prohibits indexing of all files with the .aspx extension
Disallow: /*foto - prohibits indexing of files and folders containing the word foto

The special character $ - cancels the effect of the special character "*" at the end of the rule. For example:

Disallow: /example$ - prohibits indexing '/example', but does not prohibit '/example.html'

And if you write without the $ special character, then the instruction will work differently:

Disallow: /example - disallows both '/example' and '/example.html'

Sitemap Directive

Sitemap Directive- is designed to indicate to the search engine robot where the site map is located on the hosting. The sitemap format should be sitemaps.xml. A sitemap is needed for faster and more complete site indexing. Moreover, a sitemap is not necessarily one file, there may be several. Directive entry format:

Sitemap: http://site/sitemaps1.xml
Sitemap: http://site/sitemaps2.xml

Host Directive

Host Directive- indicates to the robot the main mirror of the site. Whatever is in the site's mirror index, you must always specify this directive. If it is not specified, the Yandex robot will index at least two versions of the site with and without www. Until the mirror robot glues them together. Recording example:

Host: www.site
host: site

In the first case, the robot will index the version with www, in the second case without. Only one Host directive is allowed in the robots.txt file. If you write several of them, the bot will process and take into account only the first one.

A valid host directive should have the following data:
— indicate the connection protocol (HTTP or HTTPS);
- a correctly written domain name (you cannot write an IP address);
- port number, if necessary (for example, Host: site.com:8080).

Incorrectly made directives will simply be ignored.

Crawl-delay directive

Crawl-delay directive allows you to reduce the load on the server. It is needed in case your site starts to fall under the onslaught of various bots. The Crawl-delay directive tells the search bot to wait between the end of downloading one page and the start of downloading another page of the site. The directive must come immediately after the "Disallow" and/or "Allow" directive entries. The Yandex search robot can read fractional values. For example: 1.5 (one and a half seconds).

Clean-param Directive

Clean-param Directive needed by sites whose pages contain dynamic parameters. We are talking about those that do not affect the content of the pages. This is various service information: session identifiers, users, referrers, etc. So, in order to avoid duplicates of these pages, this directive is used. It will tell the PS not to re-upload the re-commuting information. The load on the server and the time it takes for the robot to crawl the site will also decrease.

Clean-param: s /forum/showthread.php

This entry tells the PS that the s parameter will be considered insignificant for all urls that start with /forum/showthread.php. The maximum record length is 500 characters.

We figured out the directives, let's move on to setting up our robots.

Setting robots.txt

We proceed directly to setting up the robots.txt file. It must contain at least two entries:

user agent:- indicates for which search engine the instructions below will be.
Disallow:- Specifies which part of the site is not to be indexed. It can close from indexing both a separate page of the site and entire sections.

Moreover, you can specify that these directives are intended for all search engines, or for one specifically. This is specified in the User-agent directive. If you want all bots to read the instructions, put an asterisk

If you want to write instructions for a specific robot, but you must specify its name.

User agent: YandexBot

A simplified example of a properly composed robots file would be:

User-agent: *
Disallow: /files.php
Disallow: /section/
host: site

Where, * says that the instructions are intended for all PS;
Disallow: /files.php- gives a ban on indexing the file file.php;
Disallow: /foto/- prohibits indexing the entire "foto" section with all attached files;
host: website- tells the robots which mirror to index.

If your site does not have pages that need to be closed from indexing, then your robots.txt file should be like this:

User-agent: *
Disallow:
host: site

Robots.txt for Yandex (Yandex)

To indicate that these instructions are intended for the Yandex search engine, you must specify in the User-agent directive: Yandex. Moreover, if we write “Yandex”, then the site will be indexed by all Yandex robots, and if we specify “YandexBot”, then this will be a command only for the main indexing robot.

It is also necessary to register the "Host" directive, where to specify the main mirror of the site. As I wrote above, this is done to prevent duplicate pages. Your correct robots.txt for Yandex will be like this:

User agent: Yandex
Disallow: /cgi-bin
Disallow: /adminka
host: site

Until now, one often hears questions about what is better to specify in the host directive, a site with or without www. And after all, there is no difference. It's just how you like it, what would the site look like in the SERPs. The main thing is not to forget to specify it at all, so as not to create duplicates.

Robots.txt for Google

The Google search engine supports all common robots.txt file entry formats. True, it does not take into account the Host directive. Therefore, there will actually be no differences from Yandex. Robots.txt for Google will look like this:

User agent: Googlebot
Disallow: /cgi-bin
Disallow: /adminka
Sitemap: http://site/sitemaps.xml

I hope that the data I have presented will be enough for you to compile a high-quality, and most importantly, correct file. robots.txt If you use one of the popular CMS, then in the next article I have prepared for you a selection of robots - robots.txt for popular CMS 1 ratings, average: 5,00 out of 5)

/ view: 21952

Hello dear friends! Checking robots.txt is just as important as writing it right.

Checking the robots.txt file in the Yandex and Google Webmasters panels.

Checking robots.txt, why is it important to check?

Sooner or later, every self-respecting site author remembers the robots file. About this file, placed in the root of the site, it is written abound on the Internet. Almost every webmaster has a site about the relevance and correctness of its compilation. In this article, I will remind novice bloggers how to check it using the tools in the webmaster panel provided by Yandex and Google.

First, a little about him. The Robots.txt file (sometimes erroneously called robot.txt, in the singular, attention to the English letter s at the end is required) is created by webmasters to mark or prohibit certain files and folders of a website, for search spiders (as well as other types of robots). That is, those files to which the search engine robot should not have access.

Checking robots.txt is a mandatory attribute for the site author when creating a blog on WordPress and its further promotion. Many webmasters are also sure to view the pages of the project. Parsing tells robots the correct syntax to make sure it's in a valid format. The fact is that there is an established Standard for exceptions for robots. It will not be superfluous to find out the opinion of the search engines themselves, read the documentation, in which search engines detail their vision about this file.

All this will not be superfluous in order to continue to protect your site from errors during indexing. I know examples when, due to an incorrectly compiled file, a signal was given to prohibit its visibility on the network. With further correction, you can wait a long time for a change in the situation around the site.

I will not dwell on the correct compilation of the file itself in this article. There are many examples on the net, you can go to the blog of any popular blogger and add /robots.txt at the end of his domain for verification. The browser will show its version, which you can use as a basis. However, everyone has their own exceptions, so you need to check for compliance specifically for your site. Also, a description and an example of the correct text for a WordPress blog can be found at:

Sitemap: http://your site/sitemap.xml

User agent: Googlebot Image

#Google Adsense

User-agent: Mediapartners-Google*

User agent: duggmirror

Disallow: /cgi-bin/

Disallow: /wp-admin/

Disallow: /wp-includes/

Disallow: /wp-content/plugins/

Disallow: /wp-content/cache/

Disallow: /wp-content/themes/

Disallow: /trackback/

Disallow: /feed/

Disallow: /comments/

Disallow: /category/*/*

Disallow: */trackback/

Disallow: */feed/

Disallow: */comments/

Allow: /wp-content/uploads/

There are some differences in the compilation and further verification of the robots.txt file for the main search engines of Runet. Below I will give examples of how to check in the Yandex Webmaster and Google panels.

After you have compiled the file and uploaded it to the root of your site via FTP, you need to check it for compliance with, for example, the Yandex search engine. Thus, we will find out if we have not accidentally closed those pages, thanks to which visitors will come to you.

Checking robots.txt in the Yandex Webmaster panel

You must have an account in the Yandex Webmaster panel. Going into the tools and specifying your site, on the right there will be a list of available features. Go to the tab "Check robots.txt"

Specify your domain and click "Download robots.txt from the site". If you have compiled a file that indicates separately for each search engine, then you need to select the lines for Yandex and copy them into the field below. I remind you that the directive Host: is relevant for Jand., so do not forget to enter it in the field for verification. It remains to check robots.txt. button on the right.

You will literally immediately see an analysis from Yandex for compliance with your robots.txt. Below will be the lines that Yand. accepted for consideration. And look at the test results. Directives are indicated on the left of the Url. On the right is the result itself. As you can see in the screenshot, it will be correct to see the inscription in red - prohibited by the rule and the rule itself is indicated. If you specified a directive for indexing, then we will see green - it is allowed.

After checking robots.txt, you will be able to correct your file. I also recommend checking the pages of the site. Paste the url address of a single entry into the /List of URLs/ field. And at the output we get the result - allowed. So we can separately check the bans on archives, categories, and so on.

Do not forget to subscribe, in the next article I plan to show how to register for free in the Mail.ru catalog. Do not miss, .

How to check in Yandex Webmasters.

Check robots.txt in Google Webmasters panel

We go into your account and look on the left /Status/ - /Blocked URLs/

Here we will see its presence and the ability to edit it. If you need to check the entire site for compliance, specify the address of the main page in the field below. It is possible to check how different Google robots see your site, taking into account the check of the robots.txt file

In addition to the main Google bot, we also choose a robot specializing in different types of content (2). Screenshot below.

  1. Googlebot
  2. Googlebot Image
  3. Googlebot mobile
  4. Mediapartners-Google - Metrics for AdSense
  5. AdsBot-Google - Landing page quality check

I did not find indicators for other Google robots:

  • Googlebot Video
  • Googlebot news

By analogy with checking the robots.txt file in the Yandex panel, there is also the opportunity to analyze a separate page of the site. After checking, you will see the result separately for each search bot.

Provided that the results of the check did not suit you, you just have to continue editing. And further verification.

Analyze robots.txt online

In addition to these features, you can also analyze the robots.txt file using online services. Those that I found are mostly English-speaking. I liked this service. After the analysis, recommendations for its correction will be given.

tool.motoricerca.info/robots-checker.phtml

That's all. I hope that checking the robots.txt file through the eyes of Yandex and Google did not upset you? If you see a non-compliance with your desires, then you can always edit and then re-analyze. Thank you for your tweet on Twitter and like on Facebook!

The robots.txt file is one of the most important when optimizing any site. Its absence can lead to a high load on the site from search robots and slow indexing and re-indexing, and an incorrect setting can lead to the site completely disappearing from the search or simply not being indexed. Therefore, it will not be searched in Yandex, Google and other search engines. Let's take a look at all the nuances of properly setting up robots.txt.

First, a short video that will give you a general idea of ​​what a robots.txt file is.

How robots.txt affects site indexing

Search robots will index your site regardless of the presence of a robots.txt file. If such a file exists, then the robots can be guided by the rules that are written in this file. At the same time, some robots may ignore certain rules, or some rules may be specific only to some bots. In particular, GoogleBot does not use the Host and Crawl-Delay directives, YandexNews has recently begun to ignore the Crawl-Delay directive, and YandexDirect and YandexVideoParser ignore more general robots directives (but are guided by those specified specifically for them).

More about exceptions:
Yandex exceptions
Robot Exception Standard (Wikipedia)

The maximum load on the site is created by robots that download content from your site. Therefore, by specifying what to index and what to ignore, as well as at what time intervals to download, you can, on the one hand, significantly reduce the load on the site from robots, and on the other hand, speed up the download process by prohibiting bypassing unnecessary pages .

Such unnecessary pages include ajax, json scripts responsible for pop-up forms, banners, captcha output, etc., order forms and a shopping cart with all the steps of making a purchase, search functionality, personal account, admin panel.

For most robots, it is also desirable to disable indexing of all JS and CSS. But for GoogleBot and Yandex, such files must be left for indexing, as they are used by search engines to analyze the convenience of the site and its ranking (Google proof, Yandex proof).

robots.txt directives

Directives are rules for robots. There is a W3C specification from January 30, 1994 and an extended standard from 1996. However, not all search engines and robots support certain directives. In this regard, it will be more useful for us to know not the standard, but how the main robots are guided by certain directives.

Let's look at it in order.

user-agent

This is the most important directive that determines for which robots the rules follow.

For all robots:
User-agent: *

For a specific bot:
User agent: GoogleBot

Note that robots.txt is case insensitive. Those. The user agent for Google can just as well be written like this:
user agent: googlebot

Below is a table of the main user agents of various search engines.

Bot Function
Google
Googlebot Google's main indexing robot
Googlebot news Google News
Googlebot Image Google Pictures
Googlebot Video video
Mediapartners-Google
media partners Google Adsense, Google Mobile Adsense
AdsBot-Google landing page quality check
AdsBot-Google-Mobile-Apps Google Robot for Apps
Yandex
YandexBot Yandex's main indexing robot
YandexImages Yandex.Images
YandexVideo Yandex.Video
YandexMedia multimedia data
YandexBlogs blog search robot
YandexAddurl robot accessing the page when it is added via the "Add URL" form
YandexFavicons robot that indexes site icons (favicons)
YandexDirect Yandex.Direct
YandexMetrika Yandex.Metrica
YandexCatalog Yandex.Catalog
YandexNews Yandex.News
YandexImageResizer mobile services robot
Bing
bingbot the main indexing robot Bing
Yahoo!
Slurp main indexing robot Yahoo!
Mail.Ru
Mail.Ru main indexing robot Mail.Ru
Rambler
StackRambler Formerly the main indexing robot Rambler. However, as of June 23, 2011, Rambler ceases to support its own search engine and now uses Yandex technology on its services. No longer relevant.

Disallow and allow

Disallow closes pages and sections of the site from indexing.
Allow forcefully opens pages and sections of the site for indexing.

But everything is not so simple here.

First, you need to know additional operators and understand how they are used - these are *, $ and #.

* is any number of characters, including their absence. At the same time, you can not put an asterisk at the end of the line, it is understood that it is there by default.
$ - indicates that the character before it must be the last one.
# - comment, everything after this character in the line is not taken into account by the robot.

Examples of using:

Disallow: *?s=
Disallow: /category/$

Second, you need to understand how nested rules are executed.
Remember that the order in which the directives are written is not important. The rule inheritance of what to open or close from indexing is determined by which directories are specified. Let's take an example.

Allow: *.css
Disallow: /template/

http://site.ru/template/ - closed from indexing
http://site.ru/template/style.css - closed from indexing
http://site.ru/style.css - open for indexing
http://site.ru/theme/style.css - open for indexing

If you want all .css files to be open for indexing, you will have to additionally register this for each of the closed folders. In our case:

Allow: *.css
Allow: /template/*.css
Disallow: /template/

Again, the order of the directives is not important.

Sitemap

Directive for specifying the path to the Sitemap XML file. The URL is written in the same way as in the address bar.

For example,

Sitemap: http://site.ru/sitemap.xml

The Sitemap directive is specified anywhere in the robots.txt file without being tied to a specific user-agent. You can specify multiple sitemap rules.

Host

Directive for specifying the main mirror of the site (in most cases: with www or without www). Please note that the main mirror is indicated WITHOUT http://, but WITH https://. Also, if necessary, the port is specified.
The directive is only supported by Yandex and Mail.Ru bots. Other robots, in particular GoogleBot, will not take the command into account. Host is registered only once!

Example 1:
Host: site.ru

Example 2:
Host: https://site.ru

Crawl-delay

Directive for setting the time interval between downloading the site pages by the robot. Supported by Yandex robots, Mail.Ru, Bing, Yahoo. The value can be set in integer or fractional units (separator - dot), time in seconds.

Example 1:
Crawl delay: 3

Example 2:
Crawl delay: 0.5

If the site has a small load, then there is no need to set such a rule. However, if the indexing of pages by a robot leads to the fact that the site exceeds the limits or experiences significant loads, up to server outages, then this directive will help reduce the load.

The higher the value, the fewer pages the robot will download in one session. The optimal value is determined individually for each site. It is better to start with not very large values ​​- 0.1, 0.2, 0.5 - and gradually increase them. For search engine robots that are less important for promotion results, such as Mail.Ru, Bing and Yahoo, you can initially set higher values ​​than for Yandex robots.

Clean param

This rule tells the crawler that URLs with the specified parameters should not be indexed. The rule is given two arguments: a parameter and a section URL. The directive is supported by Yandex.

Clean-param: author_id http://site.ru/articles/

Clean-param: author_id&sid http://site.ru/articles/

Clean-Param: utm_source&utm_medium&utm_campaign

Other Options

In the extended robots.txt specification, you can also find the Request-rate and Visit-time parameters. However, they are currently not supported by the leading search engines.

Meaning of directives:
Request-rate: 1/5 - load no more than one page in five seconds
Visit-time: 0600-0845 - Load pages only between 6 am and 8:45 GMT.

Closing robots.txt

If you need to configure your site to NOT be indexed by search robots, then you need to write the following directives:

User-agent: *
disallow: /

Make sure that these directives are written on the test sites of your site.

Proper setting of robots.txt

For Russia and the CIS countries, where Yandex's share is tangible, directives should be written for all robots and separately for Yandex and Google.

To properly configure robots.txt, use the following algorithm:

  1. Close the site admin panel from indexing
  2. Close personal account, authorization, registration from indexing
  3. Close cart, order forms, shipping and order data from indexing
  4. Close from ajax indexing, json scripts
  5. Close cgi folder from indexing
  6. Close plugins, themes, js, css from indexing for all robots except Yandex and Google
  7. Close search functionality from indexing
  8. Close service sections from indexing that do not carry any value for the site in the search (error 404, list of authors)
  9. Close technical duplicates of pages from indexing, as well as pages on which all content is duplicated in one form or another from other pages (calendars, archives, RSS)
  10. Close from indexing pages with filter, sort, compare options
  11. Stop indexing pages with UTM tags and sessions parameters
  12. Check what is indexed by Yandex and Google using the “site:” parameter (type “site:site.ru” in the search bar). If there are pages in the search that also need to be closed from indexing, add them to robots.txt
  13. Specify Sitemap and Host
  14. If necessary, write Crawl-Delay and Clean-Param
  15. Check the correctness of robots.txt using Google and Yandex tools (described below)
  16. After 2 weeks, check again if there are new pages in the SERP that should not be indexed. If necessary, repeat the above steps.

robots.txt example

# An example of a robots.txt file for setting up a hypothetical site https://site.ru User-agent: * Disallow: /admin/ Disallow: /plugins/ Disallow: /search/ Disallow: /cart/ Disallow: */?s= Disallow : *sort= Disallow: *view= Disallow: *utm= Crawl-Delay: 5 User-agent: GoogleBot Disallow: /admin/ Disallow: /plugins/ Disallow: /search/ Disallow: /cart/ Disallow: */?s = Disallow: *sort= Disallow: *view= Disallow: *utm= Allow: /plugins/*.css Allow: /plugins/*.js Allow: /plugins/*.png Allow: /plugins/*.jpg Allow: /plugins/*.gif User-agent: Yandex Disallow: /admin/ Disallow: /plugins/ Disallow: /search/ Disallow: /cart/ Disallow: */?s= Disallow: *sort= Disallow: *view= Allow: /plugins/*.css Allow: /plugins/*.js Allow: /plugins/*.png Allow: /plugins/*.jpg Allow: /plugins/*.gif Clean-Param: utm_source&utm_medium&utm_campaign Crawl-Delay: 0.5 Sitemap: https://site.ru/sitemap.xml Host: https://site.ru

How to add and where is robots.txt

After you have created the robots.txt file, it must be placed on your site at site.ru/robots.txt - i.e. in the root directory. The crawler always accesses the file at the URL /robots.txt

How to check robots.txt

Checking robots.txt is carried out at the following links:

  • In Yandex.Webmaster — on the Tools>Robots.txt analysis tab
  • IN Google Search Console- on the Scan tab > robots.txt file inspection tool

Common mistakes in robots.txt

At the end of the article, I will give some typical robots.txt file errors.

  • robots.txt is missing
  • in robots.txt the site is closed from indexing (Disallow: /)
  • the file contains only the most basic directives, there is no detailed study of the file
  • pages with UTM tags and session IDs are not blocked from indexing in the file
  • the file contains only directives
    Allow: *.css
    Allow: *.js
    Allow: *.png
    Allow: *.jpg
    Allow: *.gif
    while css, js, png, jpg, gif files are closed by other directives in a number of directories
  • Host directive is written multiple times
  • Host does not specify https protocol
  • the path to the Sitemap is incorrect, or the wrong protocol or site mirror is specified

P.S.

P.S.2

Useful video from Yandex (Attention! Some recommendations are only suitable for Yandex).

Top Related Articles