Friday, 8 March 2013

Basics of Web Crawling

In this post I'll discuss the concept of web crawling, what it means and how it's achieved. Every time you use a search engine such as Google, it performs a task called web crawling amongst other things.

The idea of web crawling is that a search engine wants a collection of as many web pages as possible for users to query. To do this search engines use a few web pages known as seed pages to search for even more web pages. Typical search engine will probably use multiple servers to crawl the web and accumulate web pages.

To start with the search engine will extract the HTML from it's seed pages and search the code of those pages for links to other web pages (you can picture this like a spider web). If a good seed page is used the crawler can go on almost forever and end up with a large variety of different websites. If the crawler started from a bad seed, it may only end up with a few links which probably all refer to different pages on its own site.

Additionally, it's important for a web crawler to keep track of the web pages it's already crawled so it doesn't waste resources getting stuck in a loop scanning pages it's already visited and spamming other servers with requests. To get around this the crawler will add pages it's already crawled to a list so it knows not to check them again.

There are some ethics involved in web crawling as you can probably imagine and websites typically contain a file telling web crawlers  how many times and how frequently a crawler can send requests to their servers (if they allow it at all).

So there you have it! A brief summary of web crawling.

Until next time,
 Duane