The case of the search engine. Internet search engine - Yahoo

02.07.2019 Advice

The architecture of a search engine typically includes:

Encyclopedic YouTube

1 / 5

✪ Lesson 3: How a search engine works. Introduction to SEO

✪ Search engine from within

✪ Shodan - Black Google

✪ Cheburashka search engine will replace Google and Yandex in Russia

✪ Lesson 1 - How a search engine works

Subtitles

Story

Chronology
Year	System	Event
1993	W3Catalog?!	launch
	Aliweb	launch
	JumpStation	launch
1994	web crawler	launch
	infoseek	launch
	Lycos	launch
1995	AltaVista	launch
	Daum	Base
	open text web index	launch
	Magellan	launch
	Excite	launch
	SAPO	launch
	Yahoo!	launch
1996	Dogpile	launch
	Inktomi	Base
	Rambler	Base
	hotbot	Base
	Ask Jeeves	Base
1997	Northern Light	launch
1997	Yandex	launch
1998	Google	launch
1999	AlltheWeb	launch
	GenieKnows	Base
	Naver	launch
	Teoma	Base
	Vivisimo	Base
2000	Baidu	Base
2000	Exalead	Base
2003	info.com	launch
2004	Yahoo! Search	Final launch
	A9.com	launch
	sogou	launch
2005	MSN Search	Final launch
	Ask.com	launch
	Nigma	launch
	goodsearch	launch
Search Me	Base
2006	wikiseek	Base
	Quaero	Base
	Live Search	launch
	ChaCha	Launch (beta)
	Guruji.com	Launch (beta)
2007	wikiseek	launch
	Sproose	launch
	Wikia Search	launch
	blackle.com	launch
2008	DuckDuckGo	launch
	Tooby	launch
	Picollator	launch
	Viewzi	launch
	Cuil	launch
	Boogami	launch
	LeapFish	Launch (beta)
	Forestle	launch
	VADLO	launch
	powerset	launch
2009	Bing	launch
	KAZ.KZ	launch
	Yebol	Launch (beta)
	Mugurdy	closure
	Scout	launch
2010	Cuil	closure
	Blekko	Launch (beta)
	Viewzi	closure
2012	WAZZUB	launch
2014	Satellite	Launch (beta)

At an early stage in the development of the Internet, Tim Berners-Lee maintained a list of web servers posted on the CERN website. There were more and more sites, and manually maintaining such a list became more and more difficult. The NCSA website had a dedicated "What's New!" section. (eng. What's New!), where they published links to new sites.

The first computer program for searching the Internet was Archie(English archie - archive without the letter "c"). It was created in 1990 by Alan Emtage, Bill Heelan, and J. Peter Deutsch, computer science students at McGill University in Montreal. The program downloaded lists of all files from all available anonymous FTP servers and built a database that could be searched by filenames. However, Archie's program did not index the contents of these files, as the amount of data was so small that everything could be easily found by hand.

The development and dissemination of the Gopher network protocol, coined in 1991 by Mark McCahill at the University of Minnesota, has led to the creation of two new search programs, Veronica and Jughead. Like Archie, they looked up filenames and headers stored in Gopher's index systems. Veronica (English) Very Easy Rodent-Oriented Net-wide Index to Computerized Archives) allowed keyword searches for most Gopher menu headings across all Gopher lists. The Jughead Program Jonzy's Universal Gopher Hierarchy Excavation And Display) retrieved menu information from certain Gopher servers. Although the name of Archie's search engine was not related to the comic book series "Archie", nevertheless Veronica and Jughead are characters in these comics.

By the summer of 1993, there was not yet a single system for searching the web, although numerous specialized directories were manually maintained. Oscar Nierstrasz at the University of Geneva wrote a series of Perl scripts that periodically copied these pages and rewrote them to a standard format. This became the basis for W3Catalog?!, the first primitive web search engine, launched on September 2, 1993.

Probably the first search engine written in Perl was "World Wide Web Wanderer", a bot by Matthew Gray from June 1993. This robot created the search index "Wandex". The purpose of the Wanderer robot was to measure the size of the World Wide Web and find all web pages containing the words from the query. In 1993, the second search engine Aliweb appeared. Aliweb did not use a crawler, but instead waited for notifications from website administrators about the presence of an index file in a certain format on their sites.

JumpStation, created in December 1993 by Jonathan Fletcher, searched and indexed web pages using a crawler, and used a web form as an interface for formulating search queries. It was the first Internet search tool to combine the three essential functions of a search engine (validation, indexing, and actual search). Due to the limited resources of computers of the time, indexing, and therefore search, was limited to only the titles and titles of the web pages found by the crawler.

Search engines participated in the Dot-com Bubble of the late 1990s. Several companies entered the market in spectacular fashion, generating record profits during their IPOs. Some have abandoned the public search engine market and only work with the corporate sector, such as Northern Light.

Google took on the idea of selling keywords in 1998, when it was a small company running a search engine at goto.com. The move marked a shift for search engines from competing with each other to becoming one of the most profitable business ventures on the Internet. Search engines began to sell the first places in search results to individual companies.

The Google search engine has been in a prominent position since the early 2000s. The company has achieved a high position due to good search results using the PageRank algorithm. The algorithm was presented to the public in the article "The Anatomy of Search Engine" written by Sergey Brin and Larry Page, founders of Google. This iterative algorithm ranks web pages based on an estimate of the number of hyperlinks to a web page, assuming that "good" and "important" pages get more links than others. Google's interface is designed in a spartan style, where there is nothing superfluous, unlike many of its competitors, who built the search engine into the web portal. The Google search engine has become so popular that imitators of it have appeared, for example, Mystery Seeker(secret search engine).

Search for information in Russian

In 1996, a search was implemented taking into account Russian morphology on the Altavista search engine and the original Russian search engines Rambler and Aport were launched. On September 23, 1997, the Yandex search engine was opened. On May 22, 2014, Rostelecom launched the national search engine Sputnik, which at the time of 2015 is in beta testing. On April 22, 2015, a new Sputnik service was launched. Children specially for children with increased safety.

The methods of cluster analysis and metadata search have gained great popularity. Of the international machines of this kind, the most famous was "Clusty" companies Vivisimo. In 2005, with the support of Moscow State University, the Nigma search engine was launched in Russia, which supports automatic clustering. In 2006, the Russian metamachine Quintura was opened, offering visual clustering in the form of a tag cloud. Nigma also experimented with visual clustering.

How the search engine works

The main components of a search engine: search robot, indexer, search engine.

As a rule, systems work in stages. First, the crawler gets the content, then the indexer generates a searchable index, and finally, the crawler provides the functionality to search the indexed data. To update the search engine, this indexing cycle is repeated.

Search engines work by storing information about many web pages that they get from HTML pages. Search robot or "crawler" (eng. Crawler) - a program that automatically follows all the links found on the page and highlights them. The crawler, based on links or based on a predefined list of addresses, searches for new documents that are not yet known to the search engine. The site owner can exclude certain pages using robots.txt , which can be used to prevent the indexing of files, pages or directories of the site.

The search engine analyzes the content of each page for further indexing. Words can be extracted from titles, page text or special fields - meta tags. An indexer is a module that analyzes a page, after breaking it into parts, using its own lexical and morphological algorithms. All elements of a web page are isolated and analyzed separately. Web page data is stored in an index database for use in subsequent requests. The index allows you to quickly find information on the user's request. A number of search engines, like Google, store all or part of the original page, the so-called cache, as well as various information about the web page. Other systems, like AltaVista, store every word of every page found. Using the cache helps speed up the extraction of information from already visited pages. Cached pages always contain the text that the user specified in the search query. This can be useful when the web page has been updated, that is, it no longer contains the text of the user's request, and the page in the cache is still old. This situation is related to the loss of links (Eng. linkrot) and Google's user-friendly (usability) approach. This involves issuing short chunks of text from the cache containing the query text. The principle of least surprise applies, the user usually expects to see the search words in the texts of the received pages ( user expectations). In addition to speeding up searches using cached pages, cached pages may contain information that is no longer available elsewhere.

The search engine works with output files received from the indexer. The search engine accepts user requests, processes them using an index, and returns search results.

When a user enters a query into a search engine (usually using keywords), the system checks its index and returns a list of the most relevant web pages (sorted by some criterion), usually with a brief annotation containing the title of the document and sometimes parts of the text. The search index is built according to a special technique based on information extracted from web pages. Since 2007, the Google search engine allows you to search based on time, create the documents you are looking for (call the "Search Tools" menu and specify the time range). Most search engines support the use of boolean AND, OR, NOT operators in queries, which allows you to refine or expand the list of searched keywords. In this case, the system will search for words or phrases exactly as entered. Some search engines allow approximate search, in this case, users expand the search area by specifying the distance to keywords . There are also conceptual search, which uses a statistical analysis of the use of the search words and phrases in the texts of web pages. These systems allow you to compose queries in natural language. An example of such a search engine is the ask com website.

The usefulness of a search engine depends on the relevance of the pages it finds. While millions of web pages may include a word or phrase, some may be more relevant, popular, or authoritative than others. Most search engines use ranking methods to bring the "best" results to the top of the list. Search engines decide which pages are more relevant and in what order results should be shown in different ways. Search methods, like the Internet itself, change over time. Thus, two main types of search engines appeared: systems of predefined and hierarchically ordered keywords and systems in which an inverted index is generated based on text analysis.

Most search engines are commercial enterprises that make a profit from advertising, in some search engines you can buy top positions in the search results for given keywords for a fee. Those search engines that do not charge for the order of results, earn on contextual advertising, while advertising messages correspond to the user's request. Such ads are displayed on the page with a list of search results, and search engines earn every time a user clicks on advertising messages.

Search Engine Types

There are four types of search engines: robotic, human-driven, hybrid, and meta-systems.

systems using search robots

They consist of three parts: a crawler ("bot", "robot" or "spider"), an index and a search engine software. The crawler is needed to bypass the network and create lists of web pages. An index is a large archive of copies of web pages. The purpose of the software is to evaluate search results. Due to the fact that the crawler in this mechanism is constantly exploring the network, the information is more up-to-date. Most modern search engines are systems of this type.

human-controlled systems (resource catalogs)

These search engines get lists of web pages. The directory contains the address, title, and a brief description of the site. The resource catalog looks for results only from page descriptions submitted to it by webmasters. The advantage of directories is that all resources are checked manually, therefore, the quality of the content will be better compared to the results obtained automatically by the system of the first type. But there is also a drawback - updating these directories is done manually and can significantly lag behind the real state of affairs. Page rankings cannot change instantly. Examples of such systems are Yahoo directory, dmoz and Galaxy.

hybrid systems

Search engines such as Yahoo, Google, MSN combine the functions of systems using search robots and human-controlled systems.

meta-systems

Metasearch engines combine and rank the results of several search engines at once. These search engines were useful when each search engine had a unique index and the search engines were less "smart". Since search has improved so much now, the need for them has decreased. Examples: MetaCrawler and MSN Search.

Search engine market

Google is the most popular search engine in the world with a market share of 68.69%. Bing occupies the second position, its share is 12.26%.

The most popular search engines in the world:

Search system	Market share in July 2014	Market share in October 2014	Market share in September 2015
Google	68,69 %	58.01 %	69.24%
Baidu	17,17 %	29.06 %	6,48%
Bing	6.22 %	8.01 %	12,26%
Yahoo!	6.74 %	4.01 %	9,19%
AOL	0,13 %	0.21 %	1,11%
Excite	0.22 %	0,00 %	0.00%
Ask	0,13 %	0,10 %	0,24%

Asia

In East Asian countries and in Russia, Google is not the most popular search engine. In China, for example, more popular Soso search engine.

In South Korea, Naver's proprietary search portal is used by about 70% of Yahoo! Japan and Yahoo! Taiwan are the most popular search engines in Japan and Taiwan, respectively.

Russia and Russian-language search engines

According to LiveInternet data in June 2015 on the coverage of Russian-language search queries:

All-lingual:
- Yahoo! (0.1%) and search engines owned by this company: Inktomi, AltaVista , Alltheweb
English-speaking and international:
- AskJeeves(Teoma mechanism)
Russian-speaking - most "Russian-speaking" search engines index and search for texts in many languages - Ukrainian, Belarusian, English, Tatar and others. They differ from “all-language” systems that index all documents in a row, in that they mainly index resources located in domain zones where the Russian language dominates, or otherwise limit their robots to Russian-language sites.

Some of the search engines use external search algorithms.

Quantitative Google Search Engine Data

The number of Internet users and search engines and user requirements for these systems is constantly growing. To increase the speed of searching for the necessary information, large search engines contain a large number of servers. Servers are usually grouped into server centers (data centers). Popular search engines have server centers scattered all over the world.

In October 2012, Google launched the Where the Internet Lives project, where users are given the opportunity to get acquainted with the company's data centers.

The Google search engine knows the following about the work of data centers:

The total capacity of all Google data centers as of 2011 was estimated at 220 MW.
When Google planned to open a new 6.5 million m² three-building complex in Oregon in 2008, Harper's Magazine estimated that such a large complex would consume over 100 MW of electricity, which is comparable to the energy consumption of a city of 300,000 people. Human.
The estimated number of Google servers in 2012 is 1,000,000.
Google's spending on data centers was $1.9 billion in 2006 and $2.4 billion in 2007.

The size of the World Wide Web indexed by Google as of December 2014 is approximately 4.36 billion pages.

Search engines that take into account religious prohibitions

The global spread of the Internet and the increasing popularity of electronic devices in the Arab and Muslim world, in particular in the countries of the Middle East and the Indian subcontinent, contributed to the development of local search engines that take into account Islamic traditions. Such search engines contain special filters that help users avoid accessing prohibited sites, such as sites with pornography, and allow them to use only those sites whose content is not contrary to the Islamic faith. Shortly before the Muslim month of Ramadan, in July 2013, the world was introduced Halal googling- a system that gives users only halal "correct" links by filtering search results received from other search engines such as Google and Bing. Two years earlier, in September 2011, the I'mHalal search engine was launched to serve users in the Middle East. However, this search service had to be closed shortly, according to the owner, due to lack of funding.

The lack of investment and the slow pace of technology diffusion in the Muslim world has hindered progress and hindered the success of a serious Islamic search engine. The failure of huge investments in Muslim lifestyle web projects, one of which was Muxlim. He has received millions of dollars from investors such as Rite Internet Ventures and is now - according to the latest post from I'mHalal before it shut down - coming up with the dubious idea that "the next Facebook or Google might only be in the Middle East. if you support our brilliant youth." Nevertheless, Islamic internet experts have been busy for years defining what is or is not in accordance with Sharia, and classify websites as "halal" or "haram". All former and current Islamic search engines are just a specially indexed set of data, or they are major search engines such as Google, Yahoo and Bing with some kind of filtering system used to prevent users from accessing haraam sites such as like sites about nudity, LGBT, gambling, and anything else that is considered anti-Islamic.

Among other religion-oriented search engines, Jewogle, the Jewish version of Google, and SeekFind.org, a Christian site that includes filters to keep users away from content that could undermine or weaken their faith, are common.

Personal results and filter bubbles

Many search engines, such as Google and Bing, use algorithms to selectively guess what information a user would like to see based on their past activities on the system. As a result, websites only show information that is consistent with the user's past interests. This effect is called "filter bubble".

All this leads to the fact that users receive much less information that contradicts their point of view and become intellectually isolated in their own "information bubble". Thus, the "bubble effect" can have negative consequences for the formation of civic opinion.

Search engine bias

Although search engines are programmed to rank websites based on some combination of their popularity and relevancy, the reality is that experimental research indicates that various political, economic, and social factors influence SERPs.

This bias can be a direct result of economic and commercial processes: companies that advertise on a search engine may become more popular in organic search results on that search engine. The removal of search results that do not comply with local laws is an example of the influence of political processes. For example, Google will not display some neo-Nazi websites in France and Germany, where Holocaust denial is illegal.

Bias can also be a consequence of social processes, as search engine algorithms are often designed to exclude unformatted points of view in favor of more "popular" results. The indexing algorithms of the major search engines prioritize American sites.

The search bomb is one example of an attempt to manipulate search results for political, social, or commercial reasons.

Notes

Literature

Ashmanov I. S. , Ivanov A.A. Website promotion in search engines. - M. : Williams, 2007. - 304 p. - ISBN 978-5-8459-1155-1.
Baikov V.D. Internet. Search for information. Website promotion. - St. Petersburg. : BHV-Petersburg, 2000. - 288 p. - ISBN 5-8206-0095-9.
Kolisnichenko D. N. Search engines and website promotion on the Internet. - M.: Dialectics, 2007. - 272 p. - ISBN 978-5-8459-1269-5.

Lande D.V. Search for knowledge on the Internet. - M. : Dialectics, 2005. - 272 p. - ISBN 5-8459-0764-0.

Lande D.V., Snarsky A. A., Bezsudnov I.V. Internet: Navigation in complex networks: models and algorithms. - M.: Librokom (Editorial URSS), 2009. - 264 p. - ISBN 978-5-397-00497-8.

Chu H., Rosenthal M. Search engines for the World Wide Web: A comparative study and evaluation methodology (English) // PROCEEDINGS OF THE ANNUAL MEETING-AMERICAN SOCIETY FOR INFORMATION SCIENCE: journal. - 1996. - Vol. 33 . - P. 127-135.

Gandal, Neil. The dynamics of competition in the internet search engine market. - 2001. - Vol. 19. - P. 1103–1117. -

Professional search on the Internet requires specialized software, as well as specialized search engines and search services.

PROGRAMS

http://dr-watson.wix.com/home - a program designed to study arrays of textual information in order to identify entities and relationships between them. The result of the work is a report on the object under study.

http://www.fmsasg.com/ - Sentinel Vizualizer is one of the world's best connection and relationship visualization software. The company completely Russified its products and connected a hotline in Russian.

http://www.newprosoft.com/ - "Web Content Extractor" is the most powerful, easy to use web site data extraction software. It also has an efficient Visual Web spider.

SiteSputnik – a software package that has no analogues in the world, which allows you to search and process its results in the Visible and Invisible Internet, using all the search engines necessary for the user.

WebSite-Watcher - allows you to monitor web pages, including password-protected ones, monitor forums, RSS feeds, newsgroups, local files. It has a powerful filter system. Monitoring is automatic and delivered in a user-friendly way. The program with advanced features costs 50 euros. Constantly updated.

http://www.scribd.com/ is the most popular platform in the world and increasingly used in Russia for hosting various kinds of documents, books, etc. for free access with a very convenient search engine for names, topics, etc.

http://www.atlasti.com/ - is the most powerful and effective tool available for individual users, small and even medium-sized businesses for qualitative information analysis. The program is multifunctional and therefore useful. It combines the possibilities of creating a single information environment for working with various text, spreadsheet, audio and video files as a whole, as well as tools for qualitative analysis and visualization.

Ashampoo ClipFinder HD - An increasing proportion of the information flow is video. Accordingly, competitive scouts need tools to work with this format. One such product is the presented free utility. It allows you to search for videos by specified criteria on video file storages such as YouTube. The program is easy to use, displays all search results on one page with detailed information, titles, duration, time when the video was uploaded to storage, etc. There is a Russian interface.

http://www.advego.ru/plagiatus/ - the program is made by seo optimizers, but it is quite suitable as an Internet intelligence tool. Plagiarism shows the degree of uniqueness of the text, the sources of the text, the percentage of text matching. The program also checks the uniqueness of the specified URL. The program is free.

http://neiron.ru/toolbar/ - includes an add-on for combining Google and Yandex searches, and also allows you to perform competitive analysis based on evaluating the effectiveness of sites and contextual advertising. Implemented as a plugin for FF and GC.

http://web-data-extractor.net/ is a universal solution for obtaining any data available on the Internet. Setting up cutting data from any page is done in a few mouse clicks. You just need to select the data area that you want to save and Datacol will select the formula for cutting this block.

CaptureSaver is a professional internet research tool. Just an indispensable working program that allows you to capture, store and export any information on the Internet, including not only web pages, blogs, but also RSS news, email, images and much more. It has the widest functionality, an intuitive interface and a ridiculous price.

http://www.orbiscope.net/en/software.html - web monitoring system at more than affordable prices.

http://www.kbcrawl.co.uk/ - software for work, including in the "Invisible Internet".

http://www.copernic.com/en/products/agent/index.html - the program allows you to search using more than 90 search engines, more than 10 parameters. Allows you to merge results, eliminate duplicates, block broken links, show the most relevant results. Comes in free, personal and professional versions. Used by more than 20 million users.

Maltego is a fundamentally new software that allows you to establish the relationship of subjects, events and objects in real life and on the Internet.

SERVICES

new https://hunter.io/ is an efficient email detection and validation service.

https://www.whatruns.com/ is an easy to use yet effective scanner to discover what is working and not working on a website and what are the security holes. Also implemented as a plugin for Chrom.

https://www.crayon.co/ is an American low-cost market and competitive intelligence platform on the Internet.

http://www.cs.cornell.edu/~bwong/octant/ - host locator.

https://iplogger.ru/ - a simple and convenient service for determining someone else's IP.

http://linkurio.us/ is a powerful new product for economic security workers and corruption investigators. Processes and visualizes huge arrays of unstructured information from financial sources.

http://www.intelsuite.com/en is an English-language online platform for competitive intelligence and monitoring.

http://yewno.com/about/ is the first operating system for translating information into knowledge and visualizing unstructured information. Currently supports English, French, German, Spanish and Portuguese.

https://start.avalancheonline.ru/landing/?next=%2F - forecasting and analytical services of Andrey Masalovich.

https://www.outwit.com/products/hub/ - a complete set of stand-alone programs for professional work on the web 1.

https://github.com/search?q=user%3Acmlh+maltego - extensions for Maltego.

http://www.whoishostingthis.com/ - search engine for hosting, IP addresses, etc.

http://appfollow.ru/ - analysis of applications based on reviews, ASO optimization, positions in tops and search results for the App Store, Google Play and Windows Phone Store.

http://spiraldb.com/ is a service implemented as a plugin for Chrom that allows you to get a lot of valuable information about any electronic resource.

https://millie.northernlight.com/dashboard.php?id=93 - a free service that collects and structures key information on industries and companies. It is possible to use information panels based on text analysis.

http://byratino.info/ - collection of factual data from publicly available sources on the Internet.

http://www.datafox.co/ - CI platform that collects and analyzes information on companies of interest to customers. There is a demo.

https://unwiredlabs.com/home - a specialized application with an API for searching by geolocation of any device connected to the Internet.

http://visualping.io/ is a service for monitoring sites and, first of all, the photos and images on them. Even if the photo appeared for a second, it will be in the subscriber's email. Has a plugin for Google Chrome.

http://spyonweb.com/ is a research tool that allows you to carry out a deep analysis of any Internet resource.

http://bigvisor.ru/ - the service allows you to track advertising campaigns for certain segments of goods and services, or for specific organizations.

http://www.itsec.pro/2013/09/microsoft-word.html - Artem Ageev's instructions on using Windows programs for the needs of competitive intelligence.

http://granoproject.org/ is an open source tool for researchers who trace networks of connections between persons and organizations in politics, economics, crime, and more. Allows you to connect, analyze and visualize information obtained from various sources, as well as show significant relationships.

http://imgops.com/ is a service for extracting metadata from graphic files and working with them.

http://sergeybelove.ru/tools/one-button-scan/ - a small online scanner for checking security holes in websites and other resources.

http://isce-library.net/epi.aspx - search service for primary sources by a fragment of text in English

https://www.rivaliq.com/ is an effective tool for conducting competitive intelligence in Western, primarily European and American markets for goods and services.

http://watchthatpage.com/ is a service that allows you to automatically collect new information from monitored resources on the Internet. Service services are free.

http://falcon.io/ is a kind of Rapportive for the Web. It is not a replacement for Rapportive, but provides additional tools. Unlike Rapportive, it gives a general profile of a person, as if glued together from data from social networks and mentions in web.http://watchthatpage.com/ - a service that allows you to automatically collect new information from monitored resources on the Internet. Service services are free.

https://addons.mozilla.org/en/firefox/addon/update-scanner/ is an addon for Firefox. Keeps track of web page updates. Useful for websites that don't have news feeds (Atom or RSS).

http://agregator.pro/ is an aggregator of news and media portals. Used by marketers, analysts, etc. to analyze news flows on certain topics.

http://price.apishops.com/ is an automated web service for monitoring prices for selected product groups, specific online stores and other parameters.

http://www.la0.ru/ is a convenient and relevant service for analyzing links and backlinks to an Internet resource.

www.recordedfuture.com is a powerful data analysis and visualization tool implemented as an online service based on cloud computing.

http://advse.ru/ is a service under the slogan “Learn everything about your competitors”. Allows you to get competitors' websites in accordance with search queries, analyze competitors' advertising campaigns in Google and Yandex.

http://spyonweb.com/ - the service allows you to identify sites with the same characteristics, including those using the same Google Analytics statistics service identifiers, IP addresses, etc.

http://www.connotate.com/solutions - a line of products for competitive intelligence, information flow management and transformation of information into information assets. It includes both complex platforms and simple cheap services that allow you to effectively monitor along with information compression and getting only the results you need.

http://www.clearci.com/ is a competitive intelligence platform for businesses of all sizes from startups and small companies to Fortune 500 companies. Designed as saas.

http://startingpage.com/ is a Google add-on that allows you to search Google without fixing your IP address. Fully supports all Google search features, including Russian.

http://newspapermap.com/ is a unique service that is very useful for a competitive intelligence officer. Connects geolocation with an online media search engine. Those. you choose the region or even the city or language you are interested in, see the place and the list of online versions of newspapers and magazines on the map, click on the appropriate button and read. Supports Russian language, very user-friendly interface.

http://infostream.com.ua/ is a very convenient Infostream news monitoring system, distinguished by a first-class selection, quite affordable for any wallet, from one of the classics of Internet search D.V. Lande.

http://www.instapaper.com/ is a very simple and effective tool for saving the necessary web pages. Can be used on computers, iPhones, iPads, etc.

http://screen-scraper.com/ - allows you to automatically extract all information from web pages, download the vast majority of file formats, automatically enter data into various forms. Downloaded files and pages are stored in databases and perform many other extremely useful functions. Works under all major platforms, has a fully functional free and very powerful professional versions.

http://www.mozenda.com/ - having several tariff plans and accessible even for small businesses, a web service for multifunctional web monitoring and delivery of information necessary for the user from selected sites.

http://www.recipdonor.com/ - the service allows you to automatically monitor everything that happens on the sites of competitors.

http://www.spyfu.com/ - and this is if you have foreign competitors.

www.webground.su is a service for monitoring the Runet, created by Internet search professionals, which includes all the main providers of information, news, etc., and is capable of individual monitoring settings for the needs of the user.

SEARCH ENGINES

https://www .idmarch .org/ is the best search engine for the world archive of pdf documents in terms of quality. Currently, more than 18 million pdf documents have been indexed, ranging from books to secret reports.

http://www.marketvisual.com/ is a unique search engine that allows you to search for owners and top management by full name, company name, position, or a combination of them. The search results contain not only the desired objects, but also their relationships. Designed primarily for English-speaking countries.

http://worldc.am/ is a free-access photo search engine with reference to geolocation.

https://app.echosec.net/ is a public domain search engine that describes itself as the most advanced analytics tool for law enforcement and security and intelligence professionals. Allows you to search for photos posted on various sites, social platforms and social networks in relation to specific geolocation coordinates. There are currently seven data sources connected. By the end of the year, their number will be more than 450. Thanks to Dementy for the tip.

http://www.quandl.com/ is a search engine for seven million financial, economic and social databases.

http://bitzakaz.ru/ - search engine for tenders and government orders with additional paid features

Website-Finder - makes it possible to find sites that are poorly indexed by Google. The only limitation is that it only searches 30 websites for each keyword. The program is easy to use.

http://www.dtsearch.com/ is the most powerful search engine that allows you to process terabytes of text. Works on desktop, web and intranet. Supports both static and dynamic data. Allows you to search in all MS Office programs. The search is conducted by phrases, words, tags, indexes and much more. The only federated search engine available. It has both paid and free versions.

http://www.strategator.com/ - searches, filters and aggregates company information from tens of thousands of web sources. Searches for the USA, Great Britain, the main countries of the EEC. It is highly relevant, user-friendly, has free and paid options ($14 per month).

http://www.shodanhq.com/ is an unusual search engine. Immediately after the appearance, he received the nickname "Google for hackers". It does not look for pages, but determines IP addresses, types of routers, computers, servers and workstations located at a particular address, traces chains of DNS servers and allows you to implement many other interesting functions for competitive intelligence.

http://search.usa.gov/ is a search engine for websites and open databases of all US government agencies. The databases contain a lot of practical useful information, including for use in our country.

http://visual.ly/ – Visualization is increasingly being used to present data. It is the first infographic search engine on the web. Along with the search engine, the portal has powerful data visualization tools that do not require programming skills.

http://go.mail.ru/realtime - search for discussions of topics, events, objects, subjects in real or custom time. The previously highly criticized search in Mail.ru works very efficiently and gives interesting, relevant results.

Zanran is the first and only search engine for data that has just started but already works great, extracting data from PDF files, EXCEL spreadsheets, data on HTML pages.

http://www.ciradar.com/Competitive-Analysis.aspx is one of the world's best search engines for competitive intelligence on the deep web. Extracts almost all kinds of files in all formats on the topic of interest. Implemented as a web service. The prices are more than reasonable.

http://public.ru/ - Effective search and professional analysis of information, media archive since 1990. The online media library offers a wide range of information services: from access to electronic archives of Russian-language media publications and ready-made thematic press reviews to individual monitoring and exclusive analytical studies based on press materials.

Cluuz is a young search engine with ample opportunities for competitive intelligence, especially on the English-speaking Internet. Allows not only to find, but also to visualize, establish links between people, companies, domains, e-mail, addresses, etc.

www.wolframalpha.com is the search engine of tomorrow. For a search query, it issues statistical and factual information available on the request object, including visualized information.

www.ist-budget.ru - universal search in databases of public procurement, tenders, auctions, etc.

A search engine is a database of specific information on the Internet. Many users believe that as soon as they enter a query into a search engine, the entire Internet is immediately crawled, but this is not at all the case. Internet scanning occurs constantly, many programs, data about sites are entered into a database, where, according to certain criteria, all sites and all their pages are distributed into various lists and databases. That is, it is a kind of data file, and the search takes place not on the Internet, but on this file.

Popular search engines

Yandex is the largest search engine in Runet.

In addition to the search engine, Yandex offers 77 additional services, the most popular of which are the Yandex mail service, Yandex browser, Yandex disk, traffic and weather information, Yandex money, and much more. The search engine considers your location when providing search results. Also, the search program is constantly being upgraded to provide more correct results, designed for the greatest information content for the user.

Google is the most popular search engine in the world.

In addition to the search engine, Google offers many additional services, software and hardware, including the mail service, the Google Chrome browser, the largest youtube video library and many other projects. Google is confidently buying up many projects that bring large profits. Most of the services are not aimed at a direct user, but at making money on the Internet and are integrated with a focus on the interests of European and American users.

Mail is a search engine popular mainly because of the mail service.

There are many additional services, the key of which is mail Mail, at the moment Mail owns the Odnoklassniki social network, its own My World network, the Money-mail service, many online games, three almost identical browsers with different names. All applications and services have a lot of advertising content. The social network "VKonatkte" blocks direct transitions to Mail services, arguing with a large number of viruses.

Wikipedia.

Wikipedia is a searchable reference system.

A non-profit search engine that exists on private donations, therefore it does not fill the pages with advertising. A multilingual project whose goal is to create a complete reference encyclopedia in all languages of the world. It has no specific authors, is filled in and managed by volunteers from all over the world. Each user can both write and edit an article.

The official page is www.wikipedia.org.

Youtube is the largest video library.

Video hosting with elements of a social network, where each user can add a video. From the moment they were acquired by Google Ink, a separate registration for YouTube is not required, it is enough to register in the Google mail service.

The official page is youtube.com.

Yahoo! is the second most important search engine in the world.

There are additional services, the most famous of which is Yahoo mail. As part of improving the quality of the search engine, Yahoo transmits data about users and their queries to Microsoft. From these data, an idea of the interests of users is formed, as well as a market for advertising content. The Yahoo search engine, as well as, is engaged in the absorption of other companies, for example, Yahoo owns the Altavista search service and the Alibaba e-commerce site.

The official page is www.yahoo.com.

WDL is a digital library.

The library collects books of cultural value in digital form. The main goal is to increase the level of cultural content of the Internet. Access to the library is free.

The official page is www.wdl.org/ru/.

Bing is a search engine from Microsoft.

The official website is www.baidu.com.

Search engines in Russia

Rambler is a "pro-American" search engine.

It was originally created as a media Internet portal. Like many other search engines, it has image search services, video files, maps, weather forecast, news section and much more. Publishers also offer a free browser Rambler-Nichrome.

The official page is www.rambler.ru.

Nigma is an intelligent search engine.

A more convenient search engine due to the presence of many filters and settings. The interface allows you to include or exclude suggested similar values in the search to get better results. Also, when receiving a search result, it allows you to use information from other major search engines.

The official page is www.nigma.ru.

Aport - online catalog of goods.

In the past, the search engine, but after the fact that developments and innovations were discontinued, quickly lost ground and . At the moment, Aport is a trading platform, where goods from more than 1500 companies are presented.

The official page is www.aport.ru.

Sputnik is a national search engine and Internet portal.

Created by Rostelecom. It is currently in the testing phase.

The official website is www.sputnik.ru.

Metabot is a developing search engine.

The tasks of Metabot is to create a search engine for all other search engines, creating positions for issuing results, taking into account the data of the entire list of search engines. That is, it is a search engine for search engines.

The official page is www.metabot.ru.

The search engine has been suspended.

The official page is www.turtle.ru.

KM - multiportal.

Initially, the site was a multi-portal with the subsequent introduction of a search engine. The search can be carried out both within the site and on all tracked Runet sites.

The official page is www.km.ru.

Gogo - does not work, redirects to a search engine.

The official page is www.gogo.ru.

The Russian multiportal, which is not very popular, needs to be improved. The search engine includes news, TV, games, map.

The official page is www.zoneru.org.

The search engine does not work, the developers suggest using the search engine.

The official page is www.au.ru.

Search engines (PS) have been an indispensable part of the Internet for a long time. Today they are huge and complex mechanisms that are not only a tool for finding any necessary information, but also quite exciting areas for business.

Many search users have never thought about the principles of their work, about how user requests are processed, about how these systems are built and function. This material will help people who are engaged in optimization and understand the device and the main functions of search engines.

Functions and concept of PS

Search system- this is a hardware-software complex that is designed to implement the search function on the Internet, and responding to a user request, which is usually set in the form of a text phrase (or rather a search query), by issuing a link list to information sources, carried out by relevance. The most common and largest search engines: Google, Bing, Yahoo, Baidu. In Runet - Yandex, Mail.Ru, Rambler.

Let's take a closer look at the meaning of the search query itself, taking the Yandex system as an example.

The request must be formulated by the user in full accordance with the subject of his search, as simply and concisely as possible. For example, we want to find information in this search engine: "how to choose a car for yourself." To do this, open the main page and enter a query to search for "how to choose a car." Then our functions are reduced to following the provided links to information sources on the network.

But even by acting in this way, we may not get the information we need. If we got such a negative result, we just need to reformulate our query, or there really is no useful information on this type of query in the search base (this is quite possible with the given “narrow” query parameters, such as, for example, “how to choose a car in Anadyr ").

The most basic task of every search engine is to deliver exactly the kind of information that people need. And to accustom users to create the “correct” type of queries to search engines, that is, phrases that will correspond to their principles of work, is practically impossible.

That is why search engine developers make such principles and algorithms of their work that would allow users to find the information they are interested in. This means that the system must “think” in the same way as a person thinks when searching for the necessary information on the Internet.

When he enters his query into a search engine, he wants to find what he wants as easily and quickly as possible. After receiving the result, the user makes his assessment of the system, guided by several criteria. Did he manage to find the information he needed? If not, how many times did he have to reformat the query text to find her? How up-to-date was the information received? How quickly did the search engine process his request? How convenient were the search results provided? Was the desired result first, or was it in 30th place? How much "garbage" (unnecessary information) was found along with useful information? Will there be relevant information for him, when using the PS, in a week, or in a month?

In order to get the right answers to such questions, search developers are constantly improving the principles of ranking and its algorithms, adding new features and functions to them, and by any means trying to make the system work faster.

Main characteristics of search engines

Let's designate the main characteristics of the search:

Completeness.

Completeness is one of the most important characteristics of the search, it is the ratio of the number of information documents found by request to their total number on the Internet related to this request. For example, there are 100 pages in the network with the phrase “how to choose a car”, and only 60 of the total were selected for the same query, then in this case, the search recall will be 0.6. It is clear that the more complete the search itself, the more likely it is that the user will find exactly the document that he needs, of course, if it exists at all.

Accuracy.

Another major function of a search engine is accuracy. It determines the degree of compliance with the user's request of the pages found on the Web. For example, if there are hundreds of documents for the key phrase “how to choose a car”, half of them contain this phrase, and the rest simply have such words (how to correctly choose a car radio and install it in a car”), then the search accuracy equals 50/100 = 0.5.

The more accurate the search, the sooner the user will find the information he needs, the less various "garbage" will be found among the results, the less documents found will not correspond to the meaning of the request.

Relevance.

This is a significant component of the search, which is characterized by the time elapsed from the moment information is published on the Internet until it is entered into the index database of the search engine.

For example, the day after the release of the new iPad, a lot of users turned to search with the corresponding types of queries. In most cases, information about this news is already available in the search, although very little time has passed since its appearance. This is due to the large search engines having a "fast database" that is updated several times a day.

Search speed.

Such a function as search speed is closely related to the so-called "load tolerance". Every second, a huge number of people turn to the search, such workload requires a significant reduction in the time to process one request. Here the interests of both the search engine and the user completely coincide: the visitor wants to get results as quickly as possible, and the search engine must process his request as quickly as possible so as not to slow down the processing of subsequent requests.

visibility.

Visual presentation of results is an essential element of search convenience. For many queries, the search engine finds thousands, and in some cases millions of different documents. Due to the vagueness of the compilation of key phrases for the search or its inaccuracy, even the very first query results do not always have only the necessary information.

This means that a person often has to do their own search among the provided results. A variety of components of the PS issuance pages help you navigate the search results.

History of search engines

When the Internet first began to develop, the number of its regular users was small, and the amount of information to access was relatively small. Basically, only specialists in research fields had access to this network. At that time, the task of finding information was not as relevant as it is now.

One of the very first methods of organizing wide access to information resources was the creation of catalogs of sites, and links to them began to be grouped by topic. The Yahoo.com resource, which opened in the spring of 1994, became such a first project. Subsequently, when the number of sites in the Yahoo directory increased significantly, the option to search for the necessary information in the directory was added. It was not yet a full-fledged search engine, since the scope of such a search was limited only to the sites included in this directory, and not to absolutely all resources on the Internet. Link directories were very widely used in the past, but at the present time, they have almost completely lost their popularity.

After all, even today's huge catalogs have information about a small part of the sites on the Internet. The most famous and largest directory in the world has information on five million sites, while the Google database contains information on more than 25 billion pages.

The very first real search engine was WebCrawler, which appeared back in 1994.

AltaVista and Lycos appeared the following year. Moreover, the first was the leader in information search for a very long time.

In 1997, Sergey Brin, along with Larry Page, created the Google search engine as a research project at Stanford University. Today it is Google, the most popular and popular search engine in the world.

In September 1997, the Yandex PS was (officially) announced, which is currently the most popular search engine on the Runet.

According to September 2015, the shares of search engines in the world are distributed as follows:

Google - 69.24%;
Bing - 12.26%;
Yahoo! - 9.19%;
Baidu - 6.48%;
AOL - 1.11%;
Ask - 0.23%;
Excite - 0.00%

According to December 2016, shares of search engines in Runet:

Yandex - 48.40%
Google - 45.10%
Search.Mail.ru - 5.70%
Rambler - 0.40%
Bing - 0.30%
Yahoo - 0.10%

Search engine principles

In Russia, the main search engine is Yandex, then Google, and then [email protected]. All large search engines have their own structure, which is quite different from others. But still, it is possible to single out the main elements common to all search engines.

Indexing module.

This component consists of three robots:

Spider(in English spider) - a program that is designed to download web pages. The spider downloads a specific page while extracting all the links from it at the same time. The html code is downloaded from almost every page. To do this, robots use HTTP protocols.

"Spider" functions as follows. The robot sends a request to the server “get/path/document” and other HTTP request commands. In response, the robot program receives a text stream that contains information of a service type and, of course, the document itself.

URL of the downloaded page;
the date the page was downloaded;
server http response header;
html code, "body" of the page.

Crawler("traveling" spider). This program automatically goes to all the links that are found on the page, and also highlights them. Its task is to determine where the spider should go in the future, based on these links or based on a given list of addresses.

indexer(Indexing bot) is a program that analyzes pages downloaded by spiders.

The indexer completely parses the page into its constituent elements and analyzes them using its own morphological and lexical types of algorithms.

The analysis is carried out over various parts of the page, such as headings, text, links, style and structural features, html tags, etc.

Thus, the indexing module makes it possible to follow the links of a given number of resources, download pages, extract the link mass to new pages from the received documents and do a detailed analysis of them.

Database

Database(or search engine index) - a data storage complex, an array of information in which the parameters of each processed by the indexing module and downloaded document are stored in a certain way.

search server

This is the most important element of the entire system, because the speed and, of course, the quality of the search directly depend on the algorithms underlying its functionality.

The search server works like this:

The request that comes from the user is subjected to morphological analysis. The information environment of any document available in the database is generated (it will be further displayed as a snippet, i.e. an information field of the text corresponding to this request).
The received data is passed as input parameters to a specialized ranking module. They are processed for all documents, and as a result, for each such document its own rating is calculated, which characterizes the relevance of such a document to the user's request, and other components.
Depending on the conditions set by the user, this rating may well be adjusted by additional ones.
Then the snippet itself is generated, i.e. for any found document, the title, the annotation that best meets the query, and a link to this document are extracted from the corresponding table, while the found word forms and words are highlighted.
The results of the received search are transmitted to the person who performed it in the form of a page on which search results (SERP) are issued.

All these elements are closely interconnected and function, interacting, forming a distinct, but rather complicated mechanism for the functioning of the PS, which requires huge expenditures of resources.

We have released a new book, "Social Media Content Marketing: How to get into the head of subscribers and make them fall in love with your brand."

If you really understand something, then thoroughly. And if you are subscribed to our blog, then you probably want to become a cool specialist or want to know more about web search. To achieve what you want, chips and life hacks are not enough. You need to broaden your horizons.

A search engine is a large and complex program designed to search for information on the Internet.

Have you ever wondered how what we use every day appeared, what exist on the Internet and why do all studios work only with and? Do not put such questions on the back burner. Just 10 minutes and here is another topic of conversation that you can easily support.

How Search Engines Came to Be

A long time ago, when the internet was young and green...

Users, who, it must be said, were very few, had enough of their own bookmarks. But this did not last long: soon it became difficult for a person to navigate in the variety that appeared on the network in a short time.

And in order to somehow streamline the chaos, Yahoo, DMOZ and other directories were invented (some still exist), in which the authors added and sorted the emerging sites into categories. For a while, life got easier.

But the Internet continued to expand, and soon the size of the catalogs turned into something mind-bogglingly gigantic. Then the developers first thought about searching inside directories, and only then about creating an automated system for indexing everything that is on the Internet in order to simplify the work of all users.

This is how the first search robots appeared.

What was the first search engine

The first search engine is Wandex (well, to be confused with Yandex!).This and other early services, of course, were far from perfect. For a search query, they gave out something completely different from what we are used to seeing now, i.e. not the most relevant pages, and everything in a row, ignoring the ranking. On January 1, 2012, Wandex was relaunched.

So the first PS began its work.What are the search engineson the modern internet? I am attaching a list.

What are the search engines: the kings of the dance floor

Surprisingly, there are those who arguewhat is the best search engine. I would not do this, simply for the reason that they are different and in general it all depends on the goal and what kind of user you are.

Yandex

It is the most popular search engine in our country. LiveInternet claims that Yandex 50.9% use it, while Google accounts for 40.6% (data from June 2015).

There is such a myth that there are many times more commercial requests in Yandex than the nearest competitor. I came across a couple of times the thought that due to the regionality honed over the years, the type of audience or its number may differ - this is the reason for Yandex's superiority in commercial queries. So don't believe this. Lie.

Google

The Google search engine is the most popular everywhere except Russia :) It has a lot of possibilities for different directions. In general, the undisputed world leader among search robots.

Google itself appeared approximately along with Yandex, and came to Russia only in 2004, when Yandex strengthened its position.

The process of searching on Google has become a household name for many earthlings. But when I tell my mom “Google”, she still goes to search for the information she needs in Yandex :) She doesn’t know at allwhat search engines exist on the Internet.

What are the search engines: a list of little-known PS

Most Internet users are not even awarewhat are the search engines other than Yandexand Google. So here they are;) Meet!

The search share of this search engine can hardly be called large, but the figures are slowly growing. Although you should not miss the fact that these numbers directly depend on Odnoklassniki, Mail.ru mail and other things from Mail Corporation.

This is a real old school. Just imagine: when this search engine appeared, some SEOs were just learning to walk. In general, Rambler had a chance to rule the ball in, but this did not happen for a number of reasons. Currently, this is no longer quite a search engine, but a kind of set of services that use the Yandex engine as a search engine - for example, there is one. Attendance, by the way, is quite decent: a little more than a million users visit Rambler's main page per day.

Also, Rambler has a version Rambler Lite (everything is the same, only without the weather, news, advertising, etc.) and XRambler , which combines 15 search engines at once.

How many names this search engine has changed! For 8 years, he managed to vilify the name MSN Search, then Windows Live Search, then shortened the previous name to Live Search, and now he has come to the name Bing. Many argue that the quality of the search is close to the laid down Google standard.

Now it is difficult to call Yahoo a search engine, since according to the contract, all sites owned by Yahoo use the Bing search engine. The latest news about the agreement can be found at searchengines.

Webalta

Surely this so-called search engine is familiar to you. Had to pick it out like a tick from your browser?Everyone has long been aware of the dark deeds of this search engine. Alas, no one is interested in this PS. Users are only looking for articles on how to remove this rubbish from their computer.

Nigma

This search engine is very different from the rest. And if you won’t surprise anyone with the index base of other search engines, then the ability to solve problems in chemistry and mathematics distinguishes Nigma from other PSs. Nigma also offers a search for music, books, games and torrents.

The search engine, created by order of the Russian government, is considered the world's first state search engine. Offers a separate medical search (search for pharmacies, medicines and articles about diseases). A very convenient topic with “Convenient Country”, where all the recommendations that help a citizen are collected in one place. Here, for example, is the "Documents" section.

This PS is very different from the onewhat are internet search engines. DuckDuckGo - search engine open source with an interesting policy of not using the "filter bubble". For those who do not know: “filter bubble” is when a search engine shows in the search results only those search results that it (this PS) considers necessary for a particular user. At the same time, the opinion of the user himself is of no interest to anyone. DuckDuckGo makes sure that using their search engine will ensure that you get all the information that the search engine has.

“DuckDuckGo” is gaining momentum. Already this summer (2015), the creator of the PS reported three billion requests in annual terms.

While writing this article, I had a few questions. In such cases, I do not rely on extradition, yes, and why, if there is a person sitting next to me who knows everything about the Internet? Mini-interview with Igor Ivanov.

Igor Ivanov

Head of SEMANTICA studio

If my site is in Google and Yandex, will my site be at the top of the search results in other, smaller search engines?

There is a very high probability that this will happen. Yandex and Google are developing their algorithms in the right direction and other search engines are following their example. There was a case when Google noticed that the Bing search engine not only copies their algorithms, but search results.

Why probability and not complete certainty? Because other search engines will not have time to adjust their ranking algorithms to the standard set by their more successful competitors.

Is it worth it to advance in Sputnik, Mile and other “our” search engines? Which search engine is better?