Friday 8 March 2013

Basics of Web Crawling

In this post I'll discuss the concept of web crawling, what it means and how it's achieved. Every time you use a search engine such as Google, it performs a task called web crawling amongst other things.

The idea of web crawling is that a search engine wants a collection of as many web pages as possible for users to query. To do this search engines use a few web pages known as seed pages to search for even more web pages. Typical search engine will probably use multiple servers to crawl the web and accumulate web pages.

To start with the search engine will extract the HTML from it's seed pages and search the code of those pages for links to other web pages (you can picture this like a spider web). If a good seed page is used the crawler can go on almost forever and end up with a large variety of different websites. If the crawler started from a bad seed, it may only end up with a few links which probably all refer to different pages on its own site.

Additionally, it's important for a web crawler to keep track of the web pages it's already crawled so it doesn't waste resources getting stuck in a loop scanning pages it's already visited and spamming other servers with requests. To get around this the crawler will add pages it's already crawled to a list so it knows not to check them again.

There are some ethics involved in web crawling as you can probably imagine and websites typically contain a file telling web crawlers  how many times and how frequently a crawler can send requests to their servers (if they allow it at all).

So there you have it! A brief summary of web crawling.

Until next time,
 Duane

Tuesday 26 February 2013

Bitwise Multiplication

This is just a small way to perform multiplication in Java that is close to what your computer actually does. Since computers can only understand 0 (off) and 1 (on), computer scientists had to develop smart ways to perform calculations.

Computers can only add numbers without performing tricks, so for fun I tried to program methods for multiplying, dividing, modulus and calculating square root. The results were slow to run methods. By using bitwise operators I was able to perform the calculations significantly faster.

Let's take a look at this method for performing bitwise multiplication in Java (keeping it simple not using decimal numbers):

Now examining the code we'll start at the first if statement as the while statement is self explanatory. By writing "if ((b & 1) != 0)" the computer will convert 'b' to a binary number and check the first bit if it's 0 or 1 (all odd numbers will be 1). So everyone loop where 'b' is an odd number the if statement will return true and we'll add the current number in 'a' to our result.

Next we have 2 statements outside the loop "a <<= 1;" and "b>>= 1;". These two statements convert our numbers to binary and then shift them 1 place to the left or right. Essentially this will double the number on a left shift and halve the number on a right shift. It's important to note that if a number halved gives a decimal result such as 5 / 2 = 2.5, the computer will then round that number down to 2.

A good way to see how the program works is by printing out the variables involved each loop. Let's examine the state of the variables multiplying 5 * 5:

Result: 5
Variable A: 10
Variable B: 2

---------------Next Loop------------

Result: 5
Variable A: 20
Variable B: 1

---------------Next Loop------------

Result: 25
Variable A: 40
Variable B: 0

---------------Next Loop------------


5 is added to the result the first loop because we check binary 5 (1001) for a starting 1 bit. The variable 'a' is then doubled and 'b' halved (rounded down to 2).

Next loop the result stays the same because we check binary 2 (0010) for a starting 1 bit (it fails the if statement). Once again the other variables are doubled and halved. The 3rd loop 'b' is an odd number again '0001' in binary meanining we add 'a' to result again giving us the final result.

The loop now ends since 'b' has now hit 0.

We could have just looped adding 5 every loop and decrementing 'b' until it hits 0 but this would be very ineffective and with large calculations take a long time to find the answer. The method using bitwise operators will perform almost any size calculation very quickly since we're doubling the number with every loop.

Thanks for your time and feel free to comment. Duane