Search engines. Finding information on the Web

11.08.2019 Safety

A postgraduate doctor can find on the Internet scientific articles for writing a literature review of a medical Ph.D. thesis, articles in a foreign language for preparing for the candidate minimum exam, a description of modern research methods and much more ...

How to search for information on the Internet using search engines will be discussed in this article.

For those who are not yet very well versed in concepts such as a site, a server, I am giving you some basic information about the Internet.

The Internet is a set of sites hosted on servers, united by communication channels (telephone, fiber-optic and satellite lines).

A site is a collection of documents in html format (site pages) linked by hyperlinks.

A large site (for example, "Medlink" - the medical thematic directory http://www.medlinks.ru - consists of 30,000 pages, and the amount of disk space that it occupies on the server is about 400 MB).
A small site consists of several tens - hundreds of pages and occupies 1 - 10 Mb (for example, my site "Doctor-graduate student" on July 25, 2004 consisted of 280 .htm pages and occupied 6 Mb on the server).

A server is a computer connected to the Internet and working around the clock. The server can host simultaneously from several hundred to several thousand sites.

Sites hosted on a server computer can be viewed and copied by Internet users.

To ensure uninterrupted access to the sites, the server is powered through uninterruptible power supplies, and the room where the servers (data center) operate is equipped with an automatic fire extinguishing system, and a round-the-clock duty of technical personnel is organized.

For more than 10 years of its existence, the Runet (Russian-speaking Internet) has become an orderly structure and the search for information on the Web has become more predictable.

The main tool for finding information on the Internet is search engines.

The search engine consists of a spider program that crawls the Internet sites and a database (index), which contains information about the visited sites.

At the request of the webmaster, a spider robot enters the site and looks through the pages of the site, entering information about the pages of the site into the search engine index. A search engine can find a site by itself, even if its webmaster has not applied for registration. If a link to a site comes across somewhere on the path of a search engine (on another site, for example), then it will immediately index the site.

The spider does not copy the pages of the site into the index of the search engine, but saves information about the structure of each page of the site - for example, which words occur in the document and in what order, the addresses of the hyperlinks of the site pages, the size of the document in kilobytes, the date of its creation, and much more. Therefore, the search engine index is several times less than the volume of indexed information.

What and how does a search engine search on the Internet?

The search engine was invented by people to help them search for information. What is information in our human understanding and visual representation? They are not smells or sounds, not sensations or images. These are just words, text. When we search for something on the Internet, we ask for words - a search query, and in response we hope to receive a text containing exactly these words. Because we know that the search engine will look for exactly the words we requested in the array of information. Because that is how she was conceived to search for words.

The search engine does not look for words on the Internet, but in its index. The search engine index contains information about only a small number of Internet sites. There are search engines that index only sites in English and there are search engines that list only Russian-language sites in their index.

(the index contains sites in English, German and other European languages)

Runet search engines(the index contains sites in Russian)

Features of some search engines on the Runet

The Google search engine does not take into account the morphology of the Russian language. For example, Google considers the words "dissertation" and "dissertation" different.

It is necessary to view not only the first page of the search query result, but also the rest.

Because often sites that contain information really necessary for the user are located on pages 4 - 10 of the search query result.

Why is this happening? First, many website builders do not optimize their site pages for search engines, for example, they do not include meta tags in their site pages.

Meta tags are service elements of a web document that are not visible on the screen, but are important when search engines find your site. Meta tags make it easier for search engines to search so that they do not have to go deep into the document and analyze the entire text of the site to draw up a certain picture about it. The most important meta tag is meta NAME = "keywords" - keywords of the site page. If a word from the main text of the document is not regarded as "search spam" and is in the "keywords" among the first 50, then the weight of this word in the query increases, that is, the document gets a higher relevance.

Secondly, there is fierce competition among the webmasters of the sites for the first positions in the result of a search query.

According to statistics, 80% of visitors to the site come from search engines. Sooner or later, webmasters realize this and begin to adapt their sites to the laws of search engines.

Unfortunately, some of the site creators use a dishonest method of promoting their site through search engines - the so-called "search spam" to create what appears to be a match between the content of the meta tags and the rest of the text of the site - they place hidden words on the pages of the site, typed in the background color, so that they do not interfere with site visitors. However, the creators of search engines keep track of such tricks and the site of the "search spammer" falls from the heights reached to the very bottom.

On the Internet, metaphors and figurative comparisons are of little use. They distort the truth, lead Internet users away from accurate and unambiguous information. The less artistry and more accuracy in the style of the author of the site, the higher positions in the search results the site takes.

In turn, if you want a search engine to find articles for you on the Internet, think like a machine, become a machine. At least for a while. At the time of the search.

Search engines

Search engines allow you to find WWW-documents related to a given subject or provided with keywords or their combinations. Search engines handle two search methods:

· By the hierarchy of concepts;

· By keywords.

Search servers are filled in automatically or manually. The search engine usually has links to the rest of the search engines, and sends them a search request at the request of the user.

There are two types of search engines.

1. "Full text" search engines that index every word on a web page, excluding stop words.

2. "Abstract" search engines that create an abstract of each page.

For webmasters, full-text machines are more useful, since any word that appears on a web page is analyzed to determine its relevance to user requests. However, abstract machines can index pages better than full-text ones. It depends on the information extraction algorithm, for example, by the frequency of using the same words.

The main characteristics of search engines.

1. The size of the search engine is determined by the number of indexed pages. However, at any given time, the links provided in response to user requests may be of different ages. The reasons why this happens:

· Some search engines immediately index the page at the request of the user, and then continue to index the pages that have not yet been indexed.

· Others are more likely to index the most popular web pages.

2.Date of indexing. Some search engines show the date when the document was indexed. This helps the user to determine when the document appeared on the web.

3. The depth of indexing shows how many pages after the specified one the search engine will index. Most machines have no indexing depth limits. Reasons why not all pages can be indexed:

· Incorrect use of frame structures.

Use of a sitemap without duplication of regular links

4. Working with frames. If the search robot does not know how to work with frame structures, then many structures with frames will be missed during indexing.

5. Frequency of links. Major search engines can determine a document's popularity by how often it is referenced. Some machines, on the basis of such data, "conclude" whether or not it is worth indexing a document.

6. Server update frequency. If the server is updated frequently, the search engine will re-index it more often.

7. Control of indexing. Shows by what means the search engine can be controlled.

8 redirection. Some sites redirect visitors from one server to another, and this parameter shows how this will be related to the found documents.

9. Stop words. Some search engines do not include certain words in their indexes or may not include those words in user queries. These words are usually considered prepositions or commonly used words.

10. Spam penalties. The ability to block spam.

11. Deleting old data. A parameter that determines the actions of the webmaster when the server is closed or moved to another address.

Examples of search engines.

1. Altavista. The system was opened in December 1995. It is owned by the DEC company. Since 1996 he has been working with Yahoo. AltaVista is the best choice for custom searches ... However, sorting the results by category pits are not executed and you have to manually review the information provided. AltaVista does not provide a means to retrieve hotspot lists, news, or other content search capabilities.

2.Excite Search. Launched at the end of 1995. September 1996 - acquired by WebCrawler. This knot has a powerful search furnizm, the ability to automatically customizethe information provided, as well as the compiled qualificationsby qualified personnel to describe the set of nodes. Excite differs from other search sites in thatallows you to search news services and publish reviews Web pages. The search engine uses the meansstandard keyword search and heuristiccontent search methods. Thanks to this combination,you can find pages that match the meaning Web, if they do not contain a user-specified key out words. The disadvantage of Excite is a somewhat chaotic interface.

3.HotBot. Launched May 1996. Owned by Wired. Based on Berkeley Inktomi search engine technology. HotBot is a full-text indexed document database and one of the most comprehensive search engines on the Web. Its boolean search and search constraints to any region or Web site help the user find the information they need while filtering out unnecessary information. HotBot provides the ability to select the required search parameters from the drop-down lists.

4.InfoSeek. Launched before 1995 and is easily accessible. It currently contains about 50 million URLs. Infoseek has a well-designed interface and excellent search tools. Most of the responses to queries are accompanied by "related topics" links, and each response is followed by "similar pages" links. The search engine database of pages indexed by full text. The responses are sorted by two indicators: the frequency of occurrence of a word or phrases on the page tsakh, as well as the location of words or phrases on the pages. There is a Web Directory subdivided into 12 categories with hundreds of subcategories that can be searched. Each page of the catalog contains a list of re featured nodes.

5. Lycos. Has been working since May 1994. It is widely known and used. It includes a directory with a huge number of URLs. and the Point search engine with the technology of statistical analysis of page content, as opposed to indexing by full text. Lycos contains news, site reviews, links to popular sites, city maps, and tools for finding addresses, from images and sound and video clips. Lycos orders the answers according to the degree of correspondencethe results of a query by several criteria, for example, by the number oflu search terms found in the document annotationcop, the interval betweendo words in a specific phrase of the document, locationterms in the document.

6. WebCrawler. Opened April 20, 1994 as a Washington University project. Webcrawler provides opportunities syntax for specifying queries, as well as a large selection annotations of nodes with an uncomplicated interface.

Following each response, the WebCrawler will get in the way of a small pictogram with an approximate estimate of the match to the request. Coma Togo displays a page with a short summary for each answer, its full URL, an exact match score, and also uses this response in the query is modeled as its keywords.A graphical interface for customizing queries in Web Crawler no. N e is alloweduse of wildcards, and it is also impossibleassign weights to keywords.There is no way to limit the search fieldspecific area.

7. Yahoo. The oldest Yahoo directory was launched in early 1994. Widely known, frequently used and most respected. In March 1996, the Yahooligans catalog for children was launched. Yahoo regional and top directories appear. Yahoo is subscription based. It can serve as a starting point for any Web search, as it uses its classification system to locate a site with well-organized information. Web content is divided into 14 general categories, listed on the Yahoo! home page. Depending on the specifics of the user's request, it is possible to either work with these categories to get acquainted with the subcategories and lists of nodes, or to search for specific words and terms throughout the entire database. The user can also limit the search to any section or subsection of Yahoo! Due to the fact that the classification of nodes is carried out by humans, and not by a computer, the quality of links is usually very high. However, refining a search in case of failure is difficult. To Yahoo ! search engine included AltaVista, so if your search fails on Yahoo! it automatically happens repetition using a search engine AltaVista ... The results are then transferred to Yahoo!. Yahoo! provides the ability to send queries to search Usenet and Fourl 1 for email addresses.

Russian search engines include:

1. Rambler. This is a Russian-language search engine. The sections listed on the Rambler home page cover Russian-language Web resources. There is an information classifier. A convenient opportunity to work is to provide a list of the most visited sites for each the proposed topic.

2. Aport Search. Aport ranked among the leading search engines certified Microsoft as local search enginessystems for the Russian version Microsoft Internet Explorer. One of the advantages of Aport is English-Russian and Russian-English online translation of queries and search results, which makes it possible to search in Russian Internet resources. without even knowing the Russian language. Moreover you can search for information tion using expressions, even for sentences.Among the main properties of the search engine Aport, you canshare the following:

Translation of the query and search results from Russian into Englishsky language and vice versa;

Automatic checking of spelling errors of the request;

Informative display of search results for found sites;

The ability to search in any grammatical form;

advanced query language for professional nal users.

Other search properties includesupport of five main code pages (different operatingsystems) for the Russian language, search technology usingwe have restrictions on Url and date of documents, search implementationby titles, comments and captionsgo to pictures, etc., save search parameters and define a limited number of previous user requests, combining copies of the document located on different servers.

3. List. ru ( http://www.list.ru) By its implementation, this server has manyin common with the English-speaking system Yahoo!. The server home page contains links to the most popular search categories.

The list of links to the main categories of the catalog occupies the central part. The search in the catalog is implemented in such a way that as a result of the query, both individual sites and categories can be found. If the search is successful, the URL, title, description, keywords are displayed. Allowed to use the Yandex. WITHlink "Structurecatalog "opens in a separate window the full category of katalog. The ability to switch from the rubricator to any selected subcategory has been implemented. More detailed thematic divisionthe current heading is represented by a list of links. The catalog is organized like this in such a way that all sites contained in the lower levels of the structuretours are presented in the headings.The displayed list of resources is sorted alphabetically, but you can choose to sort: by time new additions, by transitions, by the order of adding to the catalog, bypopularity among directory visitors.

4. Yandex. The software products of the Yandex series represent a set of tools for full-text indexing and search for text data, taking into account the morphology of the Russian language. Yandex includes modules for morphological analysis and synthesis, indexing and search, as well as a set of auxiliary modules such as a document analyzer, markup languages, format converters, spider.

Algorithms for morphological analysis and synthesis, based on a basic dictionary, are able to normalize words, that is, find their initial form, as well as build hypotheses for words not contained in the basic dictionary. The full-text indexing system allows you to create a compact index and quickly perform searches based on logical operators.

Yandex is designed to work with texts in the local and global network, and can also be connected as a module to other systems.

The main element of the modern Internet is search engines or search engines, Yandex, Rambler, Google and others. There is a sea of various information on the Internet, and it is the search engines that help the user to quickly find the information he needs.

There is a list of important terms in textbooks or scientific books - an alphabetic subject index or index. The index lists the most important terms in this book (keywords) and the page numbers on which they appear.

Search engines work on a similar principle. Basically, when a user enters a search term (keyword), he or she refers to the Internet Subject Index or index - a list of all the keywords on the Internet, indicating the pages where they occur.

Search engine Is a program that compiles and stores the subject index of the Internet (index), and also finds the specified keywords in it.

Stages of index compilation and search:

Collecting addresses of web pages on the Internet

An initial list of website page addresses is loaded into a search engine. Then the search engine, or rather its component part - search robot, collects all hypertext links from each of the specified pages to other pages and adds all addresses found in the links to its original list of addresses. Thus, the original list is growing rapidly.

Pumping out pages

A search robot or a spider crawls pages, downloads text material from them and stores it on the disks of its computers, then transfers it to the index robot for indexing.

Index compilation

To begin with, the text of the indexed page is cleared of any non-textual elements (graphics, HTML markup, etc.). Further, the words selected from the text are reduced to their stems or the nominative case. The collected word stems are arranged in alphabetical order, indicating page numbers where the base is taken, and entry numbers, where the base was on this page.

Search

When a user enters a word into the query string, the search engine looks at the index. Finds all page numbers related to a given word and shows the user the search result (list of pages).

Search engine quality

Search quality is synonymous with it. relevance. In relation to search engines, the word relevant(relevant) is almost the main term. The relevance of search engine search results means that those results contain pages that are relevant to the meaning of the search query. The relevance or quality of a search is a tricky thing.

Another important criterion for the quality of the search engine's work is accuracy.

Accuracy Is a measure of the quality of the returned results, it is calculated as the number of relevant pages in the total volume of pages displayed in the search results. However, not only the accuracy of the search is important, but also ranging search results.

Ranging- location of search results by relevance.

It is impossible to say which search engine is better. The user is better off with a search engine that gives out the most relevant and accurate results. For a site owner, it is good that the machine in which the site is clearly visible and which brings the largest number of targeted visitors.

What is it

DuckDuckGo is a fairly well-known open source search engine. The servers are located in the USA. In addition to its own robot, the search engine uses the results of other sources: Yahoo, Bing, Wikipedia.

The better

DuckDuckGo positions itself as a search engine that provides maximum privacy and confidentiality. The system does not collect any user data, does not store logs (no search history), the use of cookies is as limited as possible.

DuckDuckGo does not collect or share personal information from users. This is our privacy policy.
Gabriel Weinberg, founder of DuckDuckGo

Why do you need it

All major search engines try to personalize based on the data about the person in front of the monitor. This phenomenon is called the "filter bubble": the user sees only those results that agree with his preferences or that the system considers as such.

DuckDuckGo creates an objective picture that does not depend on your past behavior on the web, and gets rid of Google and Yandex thematic ads based on your queries. DuckDuckGo makes it easy to search for information in foreign languages: Google and Yandex by default give preference to Russian-language sites, even if the request is entered in another language.

What is it

not Evil is a search engine for the Tor anonymous network. To use it, you need to go to this network, for example, by running a specialized one with the same name.

not Evil is not the only search engine of its kind. There is LOOK (the default search in the Tor browser, accessible from the regular Internet) or TORCH (one of the oldest search engines in the Tor network) and others. We settled on not Evil because of the unambiguous hint of Google (just look at the start page).

The better

Searches where Google, Yandex and other search engines are closed in principle.

Why do you need it

There are many resources on the Tor network that cannot be found on the law-abiding Internet. And their number will grow as the government tightens its control over the content of the Web. Tor is a kind of network within the Network with its own social networks, torrent trackers, media, marketplaces, blogs, libraries, and so on.

3. YaCy

What is it

YaCy is a decentralized search engine based on P2P networks. Each computer on which the main software module is installed scans the Internet independently, that is, it is an analogue of a search robot. The results obtained are collected in a common database, which is used by all participants in YaCy.

The better

It is difficult to say whether it is better or worse here, since YaCy is a completely different approach to organizing search. The absence of a single server and company-owner makes the results completely independent of someone's preferences. The autonomy of each node excludes censorship. YaCy is capable of searching the deep web and non-indexed public networks.

Why do you need it

If you are a supporter of open source and the free Internet, which is not influenced by government agencies and large corporations, then YaCy is your choice. It can also be used to organize searches within a corporate or other autonomous network. And while YaCy is not very useful in everyday life, it is a worthy alternative to Google in terms of the search process.

4. Pipl

What is it

Pipl is a system designed to search for information about a specific person.

The better

The authors of Pipl claim that their specialized algorithms search more efficiently than "regular" search engines. In particular, the priority sources of information are social media profiles, comments, member lists and various databases where information about people is published, such as databases of court decisions. Pipl's leadership in this area has been validated by Lifehacker.com, TechCrunch and others.

Why do you need it

If you need to find information about a person living in the United States, then Pipl will be much more effective than Google. The databases of Russian courts are apparently inaccessible to a search engine. Therefore, he does not cope with the citizens of Russia so well.

What is it

FindSounds is another specialized search engine. Searches for various sounds (house, nature, cars, people, and so on) in open sources. The service does not support requests in Russian, but there is an impressive list of Russian-language tags that you can search for.

The better

The results are only sounds and nothing more. In the search settings, you can set the desired format and sound quality. All found sounds are available for download. There is a search for sounds by pattern.

Why do you need it

If you need to quickly find the sound of a musket shot, the blows of a sucking woodpecker or the scream of Homer Simpson, then this service is for you. And we chose this only from the available Russian-language requests. In English, the spectrum is even wider.

But seriously, a specialized service assumes a specialized audience. But what if it comes in handy?

What is it

Wolfram | Alpha is a computational search engine. Instead of links to articles that contain keywords, it provides a ready-made response to a user's request. For example, if you enter "compare the populations of New York and San Francisco" in English into the search form, Wolfram | Alpha will immediately display tables and graphs with a comparison.

The better

This service is better than others for finding facts and calculating data. Wolfram | Alpha collects and organizes knowledge available on the Web from a variety of fields, including science, culture and entertainment. If this database contains a ready-made answer to a search query, the system shows it; if not, it calculates and displays the result. In this case, the user sees only the necessary information and nothing superfluous.

Why do you need it

If you are, for example, a student, analyst, journalist, or research scientist, you can use Wolfram | Alpha to find and calculate data related to your work. The service does not understand all requests, but it is constantly evolving and becoming smarter.

What is it

The Dogpile metasearch engine displays a combined list of results from search results from Google, Yahoo and other popular search engines.

The better

First, Dogpile displays fewer ads. Secondly, the service uses a special algorithm to find and show the best results from different search engines. According to the developers of Dogpile, their system generates the most complete search results on the entire Internet.

Why do you need it

If you cannot find information in Google or another standard search engine, search for it in several search engines at once using Dogpile.

What is it

BoardReader is a system for text search in forums, Q&A services and other communities.

The better

The service allows you to narrow the search field to social platforms. Thanks to special filters, you can quickly find posts and user comments that match your criteria: language, publication date and site name.

Why do you need it

BoardReader can be useful for PR specialists and other media professionals who are interested in the opinion of a mass audience on certain issues.

Finally

The life of alternative search engines is often fleeting. Lifehacker asked Sergei Petrenko, the former general director of the Ukrainian branch of Yandex, about the long-term prospects of such projects.

Sergey Petrenko

Former CEO of Yandex.Ukraine.

As for the fate of alternative search engines, it is simple: to be very niche projects with a small audience, therefore, without clear commercial prospects, or, conversely, with complete clarity of their absence.

If you look at the examples in the article, you can see that such search engines either specialize in a narrow but demanded niche, which, perhaps only so far, has not grown enough to be noticeable on Google or Yandex radars, or they are testing an original hypothesis in ranking. which is not yet applicable in regular search.

For example, if a search on Tor suddenly turns out to be in demand, that is, results from there will be needed at least by a percentage of Google's audience, then, of course, ordinary search engines will begin to solve the problem of how to find and show them to the user. If the audience behavior shows that the results seem more relevant to a noticeable share of users in a noticeable number of queries, data without taking into account user-dependent factors, then Yandex or Google will begin to give such results.

“To be better” in the context of this article does not mean “to be better at everything”. Yes, in many aspects our heroes are far from Google and Yandex (even Bing is far away). But on the other hand, each of these services gives the user something that the giants of the search industry cannot offer. Surely you also know similar projects. Share with us - we will discuss.

Search engines are one of the main ways to find information on the Internet. Search engines crawl around the web every day: they visit web pages and enter them into giant databases. This allows the user to type in some keywords, hit submit and see which pages are satisfying their request.

Understanding how search engines work is essential for webmasters. For them, the correct structure of documents and the entire server or site from the point of view of search engines is vitally important. Without this, documents will not appear often enough in response to user queries to a search engine, or may even not be indexed at all.

Webmasters want to increase the ranking of their pages, and this is understandable: after all, for any request to a search engine, hundreds and thousands of links to documents that correspond to it can be issued. In most cases, only the first 10 links have sufficient relevance to the query.

Naturally, you want the document to be in the top ten, since most users rarely look at the links following the top ten. In other words, if the link to the document is eleventh, then it is just as bad as if it did not exist at all.

Major search engines

Which of the hundreds of search engines are really important for a webmaster? Well, of course, widely known and frequently used. But at the same time, you should take into account the audience for which your server is designed. For example, if your server contains highly specialized information about the latest methods of milking cows, then you probably shouldn't rely on general search engines. In this case, I would advise you to exchange links with your colleagues who are engaged in similar issues 🙂 So, first, let's define the terminology.

There are two types of informational databases about web pages: search engines and directories.

Search engines: (spiders, crawlers) are constantly exploring the Web in order to replenish their databases of documents. This usually does not require any effort on the part of the person. An example would be the Altavista search engine.

For search engines, the construction of each document is quite important. Title, meta-tags and page content are of great importance.

Directories: unlike search engines, information is entered into a directory on the initiative of a person. The added page must be rigidly linked to the categories accepted in the catalog. An example of a directory is Yahoo. The construction of the pages does not matter. Further we will focus mainly on search engines.

Altavista

The system was opened in December 1995. It is owned by the DEC company. Since 1996 he has been working with Yahoo.

Excite Search

Launched in late 1995, the system has evolved rapidly. July 1996 purchased Magellan, September 1996 acquired WebCrawler. However, both use it separately from each other. Perhaps in the future they will work together.

There is also a catalog in this system - Excite Reviews. Getting into this directory is a stroke of luck, since not all sites are listed there. However, the information from this directory is not used by the search engine by default, but it is possible to check it after viewing the search results.

HotBot

Launched May 1996. Owned by Wired. Based on Berkeley Inktomi search engine technology.

InfoSeek

Launched a little earlier than 1995, it is well known, looks great and is easily accessible. Ultrasmart / Ultraseek currently contains about 50 million URLs.

The default search option is Ultrasmart. In this case, both directories are searched. With the Ultraseek option, the query results are returned without additional information. Truly new search technology also allows for easier searches and a host of other features that you can read about InfoSeek. There is an InfoSeek Select directory separate from the search engine.

Lycos

Since about May 1994, one of the oldest search engines, Lycos, has been operating. Widely known and frequently used. It includes the Point search engine (operating since 1995) and the A2Z catalog (operating since February 1996).

OpenText

The OpenText system appeared a little earlier than 1995. In June 1996, she began to partner with Yahoo. It is gradually losing its position and will soon cease to be included in the number of major search engines.

Webcrawler

Opened April 20, 1994 as a research project at the University of Washington. Acquired by America Online in March 1995. There is a WebCrawler Select catalog.

Yahoo

The oldest Yahoo directory was launched in early 1994. Widely known, frequently used and most respected. In March 1996, another Yahoo directory launched, Yahooligans for Kids. There are more and more regional and top-directories of Yahoo.

Because Yahoo is a subscription-based user, some sites may not be listed. If a Yahoo search does not return any suitable results, users can use the search engine. This is very easy to do. When a request is made to Yahoo, the directory forwards it to any of the major search engines. The first links in the list of satisfying addresses come from the directory, and then there are addresses obtained from search engines, in particular from Altavista.

Features of search engines

Each search engine has a number of features. These features should be taken into account when making your pages.

Search engine type

"Full-text" search engines index every word on a web page, excluding a few stop words. "Abstract" search engines create an extract of each page.

For webmasters, full-text machines are more useful, since any word that appears on a web page is analyzed to determine its relevance to user requests. However, it can happen for abstract search engines that pages are indexed better than for full-text ones. This may come from the extraction algorithm, for example, the frequency of the same words in the page.

The size

The size of a search engine is determined by the number of indexed pages. For example, in a search engine with a large size, almost all of your pages can be indexed, with an average volume, your server may be partially indexed, and with a small volume, your pages may not be included in the directories of the search engine at all.

Renewal period

some search engines immediately index the page at the request of the user, and then continue to index the pages not yet indexed
others are more likely to "crawl" on the most popular web pages than on others

Date the document was indexed

Some search engines show the date when a particular document was indexed. This helps the user to understand how "fresh" the link is given by the search engine. Others leave users only guessing about it.

Submitted pages

Ideally, search engines should find any page on any server as a result of following the links. The real picture looks different. Server pages appear much earlier in the indexes of search engines, if you specify them directly (Add URL).

Non-submitted pages

If at least one page of the server is specified, then the search engines will definitely find the next pages by the links from the specified one. However, this takes longer. Some machines index the entire server at once, but most still, having written the specified page into the index, leave the server indexing for the future.

Indexing depth

This parameter only applies to pages not listed. It shows how many pages after the specified one the search engine will index.

Most large machines have no indexing depth limits. In practice, this is not entirely true. Here are some reasons why not all pages may be indexed:

not overly careful use of frame structures (no duplicate links in the control (frameset) file)
using imagemap without duplicating them with regular links

Frame support

If the search robot does not know how to work with frame structures, then many structures with frames will be missed during indexing.

ImageMap support

This is roughly the same problem as with server frame structures.

Password protected directories and servers

Some search engines can index such servers if they are given a Username and Password. Why is this needed? So that users can see what is on your server. This allows at least to know that such information is there, and perhaps they will then subscribe to your information.

Link frequency

Major search engines can determine a document's popularity by how often it is linked from elsewhere on the web. Some machines, based on such data, "make a conclusion" whether or not it is worth spending time indexing such a document.

Learning ability

If the server is updated frequently, then the search engine will re-index it more often, if rarely - less often.

Indexing control

Shows what means can be controlled by a particular search engine. All major search engines follow robots.txt file guidelines. Some also support control using META tags from the documents being indexed themselves.

Redirect

Some sites redirect visitors from one server to another, and this parameter indicates which URL will be associated with your documents. This is important because if the search engine does not process the redirection, then problems with non-existent files may arise.

Stop words

Some search engines do not include certain words in their indexes or may not include those words in user queries. Such words are usually considered prepositions or just very often used words. And do not include them in order to save space on the media. For example, Altavista ignores the word web and for web developer requests, only the second word will be returned. There are ways to avoid this.

Influence on the algorithm for determining relevance

Search engines make sure to use the location and frequency of keywords in the document. However, the additional mechanisms for increasing relevance are different for each vehicle. This parameter shows what kind of mechanisms exist for a particular machine.

Spam fines

All major search engines "don't like" when a site tries to increase its ranking by, for example, repeatedly specifying itself through the Add URL or mentioning the same keyword multiple times, etc. In most cases, such actions (spamming, stacking) are punished, and the site's rating, on the contrary, falls.

Search engines. Finding information on the Web

Collecting addresses of web pages on the Internet

Pumping out pages

Index compilation

Search

Search engine quality

What is it

The better

Why do you need it

What is it

The better

Why do you need it

3. YaCy

What is it

The better

Why do you need it

4. Pipl

What is it

The better

Why do you need it

What is it

The better

Why do you need it

What is it

The better

Why do you need it

What is it

The better

Why do you need it

What is it

The better

Why do you need it

Finally

Major search engines

Features of search engines

Top related articles