Sunday 26 March 2017

How Search Engines Work Using a 3 Step Process


All Search engines work using a 3 phase approach to managing , ranking and returning search results.

Crawling

Imagine the World Wide Web as a network of stops in a big city subway system.
Each stop is a unique document (usually a web page, but sometimes a PDF, JPG, or other file). The search engines need a way to “crawl” the entire city and find all the stops along the way, so they use the best path available—links.


Links allow the search engines' automated robots, called "crawlers" or "spiders," (or Googlebot, in case of Google) to reach the many billions of interconnected documents on the web.

Crawling is where it all begins – the acquisition of data about a website. This involves scanning the site and getting a complete list of everything on there – the page title, images, keywords it contains, and any other pages it links to – at a bare minimum. Modern crawlers may cache a copy of the whole page, as well as look for some additional information such as the page layout, where the advertising units are, where the links are on the page.


Indexing

You’d be forgiven for thinking this is an easy step – indexing is the process of taking all of that data you have from a crawl, and placing it in a big database from where it can later be retrieved. Imagine trying to a make a list of all the books you own, their author and the number of pages. Going through each book is the crawl and writing the list is the index. But now imagine it’s not just a room full of books, but every library in the world. That’s pretty much a small-scale version of what Google does.

Algorithm/Ranking

The last step is what you see – you type in a search query, and the search engine attempts to display the most relevant documents it finds that match your query.

The ranking algorithm checks your search query against billions of pages to determine how relevant each one is. The algorithm is a very complex and lengthy equation which calculates a value for any given site in relation to a search term.

Exploiting the ranking algorithm has in fact been commonplace since search engines began, but in the last 3 years or so Google has really made that difficult. Originally, sites were ranked based on how many times a particular keyword was mentioned. This led to “keyword stuffing”, where pages are filled with mostly nonsense so long as it includes the keyword everywhere.


Then the concept of importance based on linking was introduced  – more popular sites would be more linked to, obviously – but this led to a proliferation of spammed links all over the web. Now each link is determined to have a different value, depending on the “authority” of the site in question. If a high level government agency links to you, it’s worth far more than a link found in a free-for-all “link directory”.

No comments:

Post a Comment