A search engine is the main element of modern internet. Nowadays Google is dominating globally.
Usually Google somehow finds out about a web site. If it is publicly available, sooner or later Google get to know about it – either a link will appear from some other site or the webmaster themselves will register at Google. The webmaster needs traffic from Google, therefore they have a vested interest in informing Google about their web site. Therefore they register at Google Search Console (https://search.google.com/). And usually facilitate more rapid scanning with the aid of site maps.
As soon as Google finds out about a web site, it looks for its site map, and scans its contents page by page. Therefore, Google holds and keeps updated a copy of the whole publicly available internet.
When a user inputs some phrase in the search bar, Google looks for the most relevant web page in its database and shows the result.
What factors can influence what search engines index?
- Dependency on the language. For example, Baidu is for Chinese audience. So, its main purpose to index in Chinese.
- Document formats. Initially search engines indexed only HTML documents. But later they started indexing PDF, documents prepared in MS Word, MS Excel and all sorts of formats. Some document formats are indexed better than others. If a webmaster wants their site to be indexed well, they usually try not to be overenthusiastic about exotic formats.
- “Depth” of a page. Theoretically a search engine can reach any depth. But in practice search engines are reluctant to dive deep inside a web site. A robot of a search engine visits a site and has a limited budget of how many pages it can scan. All the other pages of a site will be postponed till the next session. A webmaster should organise their web sites reasonably: a page that can be reached only via a chain of ten links are highly unlikely to be indexed well.
- Junk pages. As the amount of pages a search engine is likely to scan is limited, don’t make the search engines scan such pages. These pages may include outdated news, legal information, contacts etc. From SEO point of view this is all junk. Prevent such pages from indexing in robots.txt.
- Viruses. If a search engine finds out about any malvare on the site, the traffic will drop by half immediately. If not to the ground.
- Something preventing search engines from scanning (scanning is forbidden either in robots.txt, or by the settings of web server, or by the hosging provider).
Algorithms of search engines is their secret. We can only guess what is beneficial for SEO and what is not. The most sophisticated and skillful specialists in marketing usually search for Google Patents. In other words, Google patent some of their ideas which they later implement as working algorithms.
Nowadays artificial intelligence and neural networks are analyzing web pages. Neural networks learn to provide users with better relevant search results.
How often the robot of a search engine scans your web site?
Nowadays we are living in the era of real time search. Search engines try to be as quick as possible. Most particularly it refers to web sites that are updated daily: news sites, popular forums, social networks. If your site contains a lot of pages and updated regularly, it is likely to be scanned quickly.
In Google Search Console there is a tool to notify Google about changes on your web site.