How to set up smartphones and PCs. Informational portal
  • home
  • Windows 8
  • Yandex robots. How to edit the robots txt file What should the robots txt file be?

Yandex robots. How to edit the robots txt file What should the robots txt file be?

We've released a new book, Social Media Content Marketing: How to Get Inside Your Followers' Heads and Make Them Fall in Love with Your Brand.

Robots.txt is a text file containing information for search robots that help index portal pages.


More videos on our channel - learn internet marketing with SEMANTICA

Imagine that you went to an island for treasure. You have a map. The route is indicated there: “Approach a large stump. From there, take 10 steps east, then reach the cliff. Turn right, find a cave.”

These are the directions. Following them, you follow the route and find the treasure. A search bot works in much the same way when it starts indexing a site or page. It finds the robots.txt file. It reads which pages need to be indexed and which do not. And following these commands, it crawls the portal and adds its pages to the index.

What is robots.txt for?

They start visiting sites and indexing pages after the site is uploaded to hosting and DNS is registered. They do their job whether you have any technical files or not. Robots tells search engines that when crawling a website, they need to take into account the parameters it contains.

The absence of a robots.txt file can lead to problems with site crawl speed and the presence of garbage in the index. Incorrect configuration of the file can result in the exclusion of important parts of the resource from the index and the presence of unnecessary pages in the output.

All this, as a result, leads to problems with promotion.

Let's take a closer look at what instructions are contained in this file and how they affect the behavior of the bot on your site.

How to make robots.txt

First, check if you have this file.

Enter the site address in the browser address bar followed by a slash the file name, for example, https://www.xxxxx.ru/robots.txt

If the file is present, a list of its parameters will appear on the screen.

If there is no file:

  1. The file is created in a regular text editor such as Notepad or Notepad++.
  2. You need to set the name robots, extension .txt. Enter data taking into account accepted design standards.
  3. You can check for errors using services such as Yandex Webmaster. There you need to select the “Robots.txt Analysis” item in the “Tools” section and follow the prompts.
  4. When the file is ready, upload it to the root directory of the site.

Setting rules

Search engines have more than one robot. Some bots only index text content, some only graphic content. And even among search engines themselves, the way crawlers work can be different. This must be taken into account when compiling the file.

Some of them may ignore some of the rules, for example, GoogleBot does not respond to information about which site mirror is considered the main one. But in general, they perceive and are guided by the file.

File Syntax

Document parameters: name of the robot (bot) “User-agent”, directives: allowing “Allow” and prohibiting “Disallow”.

Now there are two key search engines: Yandex and Google, respectively, it is important to take into account the requirements of both when creating a website.

The format for creating entries is as follows, please note the required spaces and empty lines.

User-agent directive

The robot looks for records that begin with User-agent; it should contain indications of the name of the search robot. If it is not specified, bot access is considered to be unlimited.

Disallow and Allow directives

If you need to disable indexing in robots.txt, use Disallow. With its help, the bot’s access to the site or certain sections is limited.

If robots.txt does not contain any prohibiting “Disallow” directives, it is considered that indexing of the entire site is allowed. Usually bans are prescribed after each bot separately.

All information that appears after the # sign is a comment and is not machine readable.

Allow is used to allow access.

The asterisk symbol serves as an indication of what applies to everyone: User-agent: *.

This option, on the contrary, means a complete ban on indexing for everyone.

Prevent viewing the entire contents of a specific directory folder

To block one file you need to specify its absolute path


Sitemap, Host directives

For Yandex, it is customary to indicate which mirror you want to designate as the main one. And Google, as we remember, ignores it. If there are no mirrors, simply note whether you think it is correct to write the name of your website with or without www.

Clean-param directive

It can be used if the URLs of website pages contain changeable parameters that do not affect their content (this could be user ids, referrers).

For example, in the page address “ref” determines the source of traffic, i.e. indicates where the visitor came to the site from. The page will be the same for all users.

You can point this out to the robot and it won't download duplicate information. This will reduce server load.

Crawl-delay directive

Using this, you can determine how often the bot will load pages for analysis. This command is used when the server is overloaded and indicates that the crawl process should be speeded up.

Robots.txt errors

  1. The file is not in the root directory. The robot will not look for it deeper and will not take it into account.
  2. The letters in the name must be small Latin.
    There is a mistake in the name, sometimes they miss the letter S at the end and write robot.
  3. You cannot use Cyrillic characters in the robots.txt file. If you need to specify a domain in Russian, use the format in the special Punycode encoding.
  4. This is a method of converting domain names into a sequence of ASCII characters. To do this, you can use special converters.

This encoding looks like this:
site.rf = xn--80aswg.xn--p1ai

Additional information on what to close in robots txt and on settings in accordance with the requirements of Google and Yandex search engines can be found in the help documents. Different cms may also have their own characteristics, this should be taken into account.

Robots.txt is a text file that contains special instructions for search engine robots exploring your site on the Internet. These instructions are called directives— may prohibit some pages of the site from being indexed, indicate correct “mirroring” of the domain, etc.

For sites running on the Nubex platform, a file with directives is created automatically and is located at domen.ru/robots.txt, where domen.ru is the domain name of the site..ru/robots.txt.

You can change robots.txt and add additional directives for search engines in the site admin area. To do this, select the section on the control panel "Settings", and in it there is a point "SEO".

Find the field "Text of the robots.txt file" and write down the necessary directives in it. It is advisable to activate the checkbox “Add a link to an automatically generated sitemap.xml file in robots.txt”: this way the search bot will be able to load the site map and find all the necessary pages for indexing.

Basic directives for the robots txt file

When loading robots.txt, the search robot first looks for an entry starting with User-agent: The value of this field must be the name of the robot whose access rights are being set in this entry. Those. The User-agent directive is a kind of call to the robot.

1. If the value of the User-agent field contains the symbol " * ", then the access rights specified in this entry apply to any search robots that request the /robots.txt file.

2. If more than one robot name is specified in an entry, then access rights apply to all specified names.

3. Uppercase or lowercase characters do not matter.

4. If the string User-agent: BotName is detected, directives for User-agent: * are not taken into account (this is the case if you make multiple entries for different robots). Those. the robot will first scan the text for the User-agent entry: MyName, and if it finds it, it will follow these instructions; if not, it will act according to the instructions of the User-agent entry: * (for all bots).

By the way, it is recommended to insert an empty line feed (Enter) before each new User-agent directive.

5. If the lines User-agent: BotName and User-agent: * are missing, it is considered that the robot’s access is not limited.

Prohibiting and allowing site indexing: Disallow and Allow directives

To deny or allow search bots access to certain pages of the site, directives are used Disallow And Allow respectively.

The meaning of these directives indicates the full or partial path to the section:

  • Disallow: /admin/— prohibits indexing of all pages located inside the admin section;
  • Disallow: /help— prohibits indexing of both /help.html and /help/index.html;
  • Disallow: /help/ — closes only /help/index.html;
  • Disallow: /— blocks access to the entire site.

If the Disallow value is not specified, then access is not limited:

  • Disallow:— indexing of all site pages is allowed.

You can use the allow directive to configure exceptions Allow. For example, such an entry will prohibit robots from indexing all sections of the site except those whose path begins with /search:

It does not matter in what order the directives for prohibiting and allowing indexing will be listed. When reading, the robot will still sort them by the length of the URL prefix (from smallest to largest) and apply them sequentially. That is, the example above in the perception of the bot will look like this:

— only pages starting with /search are allowed to be indexed. Thus, the order of the directives will not affect the result in any way.

Host directive: how to specify the main domain of the site

If several domain names are associated with your site (technical addresses, mirrors, etc.), the search engine may decide that these are all different sites. And with the same content. Solution? Ban! And one bot knows which domain will be “punished” - the main one or the technical one.

To avoid this trouble, you need to tell the search robot at which address your site is participating in the search. This address will be designated as the main one, and the rest will form a group of mirrors of your site.

You can do this using Host directives. It must be added to the entry starting with User-Agent, immediately after the Disallow and Allow directives. In the value of the Host directive, you need to specify the main domain with the port number (80 by default). For example:

Host: test-o-la-la.ru

Such an entry means that the site will be displayed in search results with a link to the domain test-o-la-la.ru, and not www.test-o-la-la.ru and s10364.. screenshot above).

In the Nubex constructor, the Host directive is added to the text of the robots.txt file automatically when you specify in the admin panel which domain is the main one.

In the text of robots.txt, the host directive can only be used once. If you write it several times, the robot will only accept the first entry in order.

Crawl-delay directive: how to set the page loading interval

To indicate to the robot the minimum interval between finishing loading one page and starting loading the next, use Crawl-delay directive. It must be added to the entry starting with User-Agent, immediately after the Disallow and Allow directives. In the value of the directive, specify the time in seconds.

Using such a delay when processing pages will be convenient for overloaded servers.

There are also other directives for search robots, but the five described - User-Agent, Disallow, Allow, Host and Crawl-delay - usually enough to compose the text of the robots.txt file.

Most robots are well designed and do not cause any problems for website owners. But if the bot was written by an amateur or “something went wrong,” then it can create a significant load on the site it crawls. By the way, spiders do not enter the server at all like viruses - they simply request the pages they need remotely (in fact, these are analogues of browsers, but without the page viewing function).

Robots.txt - user-agent directive and search engine bots

Robots.txt has a very simple syntax, which is described in great detail, for example, in Yandex help And Google help. It usually indicates which search bot the following directives are intended for: bot name (" User-agent"), allowing (" Allow") and prohibiting (" Disallow"), and "Sitemap" is also actively used to indicate to search engines exactly where the map file is located.

The standard was created quite a long time ago and something was added later. There are directives and design rules that will only be understood by robots of certain search engines. In RuNet, only Yandex and Google are of interest, which means that you should familiarize yourself with their help for compiling robots.txt in particular detail (I provided the links in the previous paragraph).

For example, previously it was useful for the Yandex search engine to indicate that your web project is the main one in a special “Host” directive, which only this search engine understands (well, also Mail.ru, because their search is from Yandex). True, at the beginning of 2018 Yandex still canceled Host and now its functions, like those of other search engines, are performed by a 301 redirect.

Even if your resource does not have mirrors, it will be useful to indicate which spelling option is the main one - .

Now let's talk a little about the syntax of this file. Directives in robots.txt look like this:

<поле>:<пробел><значение><пробел> <поле>:<пробел><значение><пробел>

The correct code should contain at least one “Disallow” directive after each “User-agent” entry. An empty file assumes permission to index the entire site.

User-agent

"User-agent" directive must contain the name of the search bot. Using it, you can set up rules of behavior for each specific search engine (for example, create a ban on indexing a separate folder only for Yandex). An example of writing “User-agent” addressed to all bots visiting your resource looks like this:

User-agent: *

If you want to set certain conditions in the “User-agent” only for one bot, for example, Yandex, then you need to write this:

User-agent: Yandex

Name of search engine robots and their role in the robots.txt file

Bot of every search engine has its own name (for example, for a rambler it is StackRambler). Here I will give a list of the most famous of them:

Google http://www.google.com Googlebot Yandex http://www.ya.ru Yandex Bing http://www.bing.com/ bingbot

Major search engines sometimes have except the main bots, there are also separate instances for indexing blogs, news, images, etc. You can get a lot of information on the types of bots (for Yandex) and (for Google).

How to be in this case? If you need to write a rule for prohibiting indexing, which all types of Google robots must follow, then use the name Googlebot and all other spiders of this search engine will also obey. However, you can only ban, for example, the indexing of pictures by specifying the Googlebot-Image bot as the User-agent. Now this is not very clear, but with examples, I think it will be easier.

Examples of using the Disallow and Allow directives in robots.txt

I'll give you a few simple ones. examples of using directives with an explanation of his actions.

  1. The code below allows all bots (indicated by an asterisk in the User-agent) to index all content without any exceptions. This is given empty directive Disallow. User-agent: * Disallow:
  2. The following code, on the contrary, completely prohibits all search engines from adding pages of this resource to the index. Sets this to Disallow with "/" in the value field. User-agent: * Disallow: /
  3. In this case, all bots will be prohibited from viewing the contents of the /image/ directory (http://mysite.ru/image/ is the absolute path to this directory) User-agent: * Disallow: /image/
  4. To block one file, it will be enough to register its absolute path to it (read): User-agent: * Disallow: /katalog1//katalog2/private_file.html

    Looking ahead a little, I’ll say that it’s easier to use the asterisk (*) symbol so as not to write the full path:

    Disallow: /*private_file.html

  5. In the example below, the directory “image” will be prohibited, as well as all files and directories starting with the characters “image”, i.e. files: “image.htm”, “images.htm”, directories: “image”, “ images1", "image34", etc.): User-agent: * Disallow: /image The fact is that by default at the end of the entry there is an asterisk, which replaces any characters, including their absence. Read about it below.
  6. By using Allow directives we allow access. Complements Disallow well. For example, with this condition we prohibit the Yandex search robot from downloading (indexing) everything except web pages whose address begins with /cgi-bin: User-agent: Yandex Allow: /cgi-bin Disallow: /

    Well, or this obvious example of using the Allow and Disallow combination:

    User-agent: * Disallow: /catalog Allow: /catalog/auto

  7. When describing paths for Allow-Disallow directives, you can use the symbols "*" and "$", thus defining certain logical expressions.
    1. Symbol "*"(star) means any (including empty) sequence of characters. The following example prohibits all search engines from indexing files with the “.php” extension: User-agent: * Disallow: *.php$
    2. Why is it needed at the end? $ sign? The fact is that, according to the logic of compiling the robots.txt file, a default asterisk is added at the end of each directive (it’s not there, but it seems to be there). For example, we write: Disallow: /images

      Implying that this is the same as:

      Disallow: /images*

      Those. this rule prohibits the indexing of all files (web pages, pictures and other types of files) whose address begins with /images, and then anything follows (see example above). So, $ symbol it simply cancels the default asterisk at the end. For example:

      Disallow: /images$

      Only prevents indexing of the /images file, but not /images.html or /images/primer.html. Well, in the first example, we prohibited indexing only files ending in .php (having such an extension), so as not to catch anything unnecessary:

      Disallow: *.php$

  • In many engines, users (human-readable Urls), while system-generated Urls have a question mark "?" in the address. You can take advantage of this and write the following rule in robots.txt: User-agent: * Disallow: /*?

    The asterisk after the question mark suggests itself, but, as we found out just above, it is already implied at the end. Thus, we will prohibit the indexing of search pages and other service pages created by the engine, which the search robot can reach. It won’t be superfluous, because the question mark is most often used by CMS as a session identifier, which can lead to duplicate pages being included in the index.

  • Sitemap and Host directives (for Yandex) in Robots.txt

    To avoid unpleasant problems with site mirrors, it was previously recommended to add a Host directive to robots.txt, which pointed the Yandex bot to the main mirror.

    Host directive - indicates the main mirror of the site for Yandex

    For example, earlier if you have not yet switched to a secure protocol, it was necessary to indicate in Host not the full URL, but the domain name (without http://, i.e..ru). If you have already switched to https, then you will need to indicate the full URL (such as https://myhost.ru).

    A wonderful tool for combating duplicate content - the search engine simply will not index the page if a different URL is registered in Canonical. For example, for such a page of my blog (page with pagination), Canonical points to https://site and there should be no problems with duplicating titles.

    But I digress...

    If your project is created on the basis of any engine, then Duplicate content will occur with a high probability, which means you need to fight it, including with the help of a ban in robots.txt, and especially in the meta tag, because in the first case Google may ignore the ban, but it will no longer be able to give a damn about the meta tag ( brought up that way).

    For example, in WordPress, pages with very similar content can be indexed by search engines if indexing of both category content, tag archive content, and temporary archive content is allowed. But if, using the Robots meta tag described above, you create a ban on the tag archive and temporary archive (you can leave the tags and prohibit indexing of the content of the categories), then duplication of content will not occur. How to do this is described in the link given just above (to the OlInSeoPak plugin)

    To summarize, I will say that the Robots file is intended for setting global rules for denying access to entire site directories, or to files and folders whose names contain specified characters (by mask). You can see examples of setting such prohibitions just above.

    Now let's look at specific examples of robots designed for different engines - Joomla, WordPress and SMF. Naturally, all three options created for different CMS will differ significantly (if not radically) from each other. True, they will all have one thing in common, and this moment is connected with the Yandex search engine.

    Because In RuNet, Yandex has quite a lot of weight, then we need to take into account all the nuances of its work, and here we The Host directive will help. It will explicitly indicate to this search engine the main mirror of your site.

    For this, it is recommended to use a separate User-agent blog, intended only for Yandex (User-agent: Yandex). This is due to the fact that other search engines may not understand Host and, accordingly, its inclusion in the User-agent record intended for all search engines (User-agent: *) may lead to negative consequences and incorrect indexing.

    It’s hard to say what the situation really is, because search algorithms are a thing in themselves, so it’s better to do as advised. But in this case, we will have to duplicate in the User-agent: Yandex directive all the rules that we set User-agent: *. If you leave User-agent: Yandex with an empty Disallow:, then in this way you will allow Yandex to go anywhere and drag everything into the index.

    Robots for WordPress

    I will not give an example of a file that the developers recommend. You can watch it yourself. Many bloggers do not at all limit Yandex and Google bots in their walks through the content of the WordPress engine. Most often on blogs you can find robots automatically filled with a plugin.

    But, in my opinion, we should still help the search in the difficult task of sifting the wheat from the chaff. Firstly, it will take a lot of time for Yandex and Google bots to index this garbage, and there may not be any time left to add web pages with your new articles to the index. Secondly, bots crawling through garbage engine files will create additional load on your host’s server, which is not good.

    You can see my version of this file for yourself. It’s old and hasn’t been changed for a long time, but I try to follow the principle “don’t fix what isn’t broken,” and it’s up to you to decide: use it, make your own, or steal from someone else. I also had a ban on indexing pages with pagination until recently (Disallow: */page/), but recently I removed it, relying on Canonical, which I wrote about above.

    But in general, the only correct file for WordPress probably doesn't exist. You can, of course, implement any prerequisites in it, but who said that they will be correct. There are many options for ideal robots.txt on the Internet.

    I will give two extremes:

    1. you can find a megafile with detailed explanations (the # symbol separates comments that would be better deleted in a real file): User-agent: * # general rules for robots, except Yandex and Google, # because for them the rules are below Disallow: /cgi-bin # folder on hosting Disallow: /? # all request parameters on the main page Disallow: /wp- # all WP files: /wp-json/, /wp-includes, /wp-content/plugins Disallow: /wp/ # if there is a subdirectory /wp/ where the CMS is installed ( if not, # the rule can be deleted) Disallow: *?s= # search Disallow: *&s= # search Disallow: /search/ # search Disallow: /author/ # author archive Disallow: /users/ # author archive Disallow: */ trackback # trackbacks, notifications in comments about the appearance of an open # link to an article Disallow: */feed # all feeds Disallow: */rss # rss feed Disallow: */embed # all embeddings Disallow: */wlwmanifest.xml # manifest xml file Windows Live Writer (if you don't use it, # the rule can be deleted) Disallow: /xmlrpc.php # WordPress API file Disallow: *utm= # links with utm tags Disallow: *openstat= # links with openstat tags Allow: */uploads # open the folder with the files uploads User-agent: GoogleBot # rules for Google (I do not duplicate comments) Disallow: /cgi-bin Disallow: /? Disallow: /wp- Disallow: /wp/ Disallow: *?s= Disallow: *&s= Disallow: /search/ Disallow: /author/ Disallow: /users/ Disallow: */trackback Disallow: */feed Disallow: */ rss Disallow: */embed Disallow: */wlwmanifest.xml Disallow: /xmlrpc.php Disallow: *utm= Disallow: *openstat= Allow: */uploads Allow: /*/*.js # open js scripts inside /wp - (/*/ - for priority) Allow: /*/*.css # open css files inside /wp- (/*/ - for priority) Allow: /wp-*.png # images in plugins, cache folder and etc. Allow: /wp-*.jpg # images in plugins, cache folder, etc. Allow: /wp-*.jpeg # images in plugins, cache folder, etc. Allow: /wp-*.gif # images in plugins, cache folder, etc. Allow: /wp-admin/admin-ajax.php # used by plugins so as not to block JS and CSS User-agent: Yandex # rules for Yandex (I do not duplicate comments) Disallow: /cgi-bin Disallow: /? Disallow: /wp- Disallow: /wp/ Disallow: *?s= Disallow: *&s= Disallow: /search/ Disallow: /author/ Disallow: /users/ Disallow: */trackback Disallow: */feed Disallow: */ rss Disallow: */embed Disallow: */wlwmanifest.xml Disallow: /xmlrpc.php Allow: */uploads Allow: /*/*.js Allow: /*/*.css Allow: /wp-*.png Allow: /wp-*.jpg Allow: /wp-*.jpeg Allow: /wp-*.gif Allow: /wp-admin/admin-ajax.php Clean-Param: utm_source&utm_medium&utm_campaign # Yandex recommends not blocking # from indexing, but deleting tag parameters, # Google does not support such rules Clean-Param: openstat # similar # Specify one or more Sitemap files (no need to duplicate for each User-agent #). Google XML Sitemap creates 2 sitemaps like the example below. Sitemap: http://site.ru/sitemap.xml Sitemap: http://site.ru/sitemap.xml.gz # Specify the main mirror of the site, as in the example below (with WWW / without WWW, if HTTPS # then write protocol, if you need to specify a port, indicate it). The Host command is understood by # Yandex and Mail.RU, Google does not take it into account. Host: www.site.ru
    2. But you can use an example of minimalism: User-agent: * Disallow: /wp-admin/ Allow: /wp-admin/admin-ajax.php Host: https://site.ru Sitemap: https://site. ru/sitemap.xml

    The truth probably lies somewhere in the middle. Also, don’t forget to add the Robots meta tag for “extra” pages, for example, using the wonderful plugin - . It will also help you set up Canonical.

    Correct robots.txt for Joomla

    User-agent: * Disallow: /administrator/ Disallow: /bin/ Disallow: /cache/ Disallow: /cli/ Disallow: /components/ Disallow: /includes/ Disallow: /installation/ Disallow: /language/ Disallow: /layouts/ Disallow: /libraries/ Disallow: /logs/ Disallow: /modules/ Disallow: /plugins/ Disallow: /tmp/

    In principle, almost everything is taken into account here and it works well. The only thing is that you should add a separate User-agent: Yandex rule to insert the Host directive, which defines the main mirror for Yandex, and also specify the path to the Sitemap file.

    Therefore, in its final form, the correct robots for Joomla, in my opinion, should look like this:

    User-agent: Yandex Disallow: /administrator/ Disallow: /cache/ Disallow: /includes/ Disallow: /installation/ Disallow: /language/ Disallow: /libraries/ Disallow: /modules/ Disallow: /plugins/ Disallow: /tmp/ Disallow: /layouts/ Disallow: /cli/ Disallow: /bin/ Disallow: /logs/ Disallow: /components/ Disallow: /component/ Disallow: /component/tags* Disallow: /*mailto/ Disallow: /*.pdf Disallow : /*% Disallow: /index.php Host: vash_sait.ru (or www.vash_sait.ru) User-agent: * Allow: /*.css?*$ Allow: /*.js?*$ Allow: /* .jpg?*$ Allow: /*.png?*$ Disallow: /administrator/ Disallow: /cache/ Disallow: /includes/ Disallow: /installation/ Disallow: /language/ Disallow: /libraries/ Disallow: /modules/ Disallow : /plugins/ Disallow: /tmp/ Disallow: /layouts/ Disallow: /cli/ Disallow: /bin/ Disallow: /logs/ Disallow: /components/ Disallow: /component/ Disallow: /*mailto/ Disallow: /*. pdf Disallow: /*% Disallow: /index.php Sitemap: http://path to your XML format map

    Yes, also note that in the second option there are directives Allow, allowing indexing of styles, scripts and images. This was written specifically for Google, because its Googlebot sometimes complains that indexing of these files, for example, from the folder with the theme used, is prohibited in the robots. He even threatens to lower his ranking for this.

    Therefore, we allow this whole thing to be indexed in advance using Allow. By the way, the same thing happened in the example file for WordPress.

    Good luck to you! See you soon on the pages of the blog site

    You might be interested

    Domains with and without www - the history of their appearance, the use of 301 redirects to glue them together
    Mirrors, duplicate pages and Url addresses - an audit of your site or what could be the cause of failure during its SEO promotion SEO for beginners: 10 main points of a technical website audit
    Bing webmaster - center for webmasters from the Bing search engine
    Google webmaster - Search Console tools (Google Webmaster)
    How to avoid common mistakes when promoting a website
    How to promote a website yourself by improving internal keyword optimization and removing duplicate content
    Yandex Webmaster - indexing, links, site visibility, region selection, authorship and virus checking in Yandex Webmaster

    Robots.txt is a text file that contains site indexing parameters for search engine robots.

    Yandex supports the following directives:

    Directive What is he doing
    User-agent *
    Disallow
    Sitemap
    Clean-param
    Allow
    Crawl-delay
    Directive What is he doing
    User-agent * Indicates a robot for which the rules listed in robots.txt apply.
    Disallow Prohibits indexing of sections or individual pages of the site.
    Sitemap Specifies the path to the Sitemap file that is located on the site.
    Clean-param Indicates to the robot that the page URL contains parameters (for example, UTM tags) that do not need to be taken into account when indexing.
    Allow Allows indexing of sections or individual pages of the site.
    Crawl-delay

    Sets the minimum time period (in seconds) for the robot between finishing loading one page and starting loading the next.

    * Mandatory directive.

    The most common directives you may need are Disallow, Sitemap and Clean-param. For example:

    User-agent: * #specify for which robots directives are installed\nDisallow: /bin/ # prohibits links from the \"Shopping Cart\".\nDisallow: /search/ # prohibits links to pages built into the search site\nDisallow: /admin / # prohibits links from the admin panel\nSitemap: http://example.com/sitemap # point the robot to the sitemap file for the site\nClean-param: ref /some_dir/get_book.pl

    Robots of other search engines and services may interpret directives differently.

    Note. The robot takes case into account when writing substrings (name or path to the file, robot name) and does not take case into account in the names of directives.

    Using the Cyrillic alphabet

    The use of Cyrillic is prohibited in the robots.txt file and server HTTP headers.

    Robots.txt is a text file that contains site indexing parameters for the search engine robots.

    Recommendations on the content of the file

    Yandex supports the following directives:

    Directive What does it do
    User-agent *
    Disallow
    Sitemap
    Clean-param
    Allow
    Crawl-delay
    Directive What does it do
    User-agent * Indicates the robot to which the rules listed in robots.txt apply.
    Disallow Prohibits indexing site sections or individual pages.
    Sitemap Specifies the path to the Sitemap file that is posted on the site.
    Clean-param Indicates to the robot that the page URL contains parameters (like UTM tags) that should be ignored when indexing it.
    Allow Allows indexing site sections or individual pages.
    Crawl-delay Specifies the minimum interval (in seconds) for the search robot to wait after loading one page, before starting to load another.

    * Mandatory directive.

    You"ll most often need the Disallow, Sitemap, and Clean-param directives. For example:

    User-agent: * # specify the robots that the directives are set for Disallow: /bin/ # disables links from the Shopping Cart. Disallow: /search/ # disables page links of the search embedded on the site Disallow: /admin/ # disables links from the admin panel Sitemap: http://example.com/sitemap # specify for the robot the sitemap file of the site Clean-param: ref /some_dir/get_book.pl

    Robots from other search engines and services may interpret the directives in a different way.robots.txt file to be taken into account by the robot, it must be located in the root directory of the site and respond with HTTP 200 code. The indexing robot doesn't support the use of files hosted on other sites.

    You can check the server's response and the accessibility of robots.txt to the robot using the tool.

    If your robots.txt file redirects to another robots.txt file (for example, when moving a site), add the redirect target site to Yandex.Webmaster and verify the rights to manage this site.

    Best articles on the topic