Memories on Searching, Hashing and Sorting Algorithms

elevysi · 05 April 2017 |

Searching

Searching is the process of finding an item in a collection of item. We may search using a sequential search algorithm or a binary search algorithm.

The sequential search compares, starting from the first element of the pile, each element in the collection to see whether it fulfils the search condition or not and then move onto the next element.
For a collection of size n, there will be at most n comparisons hence runs in time complexity of O(n).

The binary search assumes that the collection is ordered and then applies the divide and conquer approach to search for the item. It starts of by dividing the collection in two and then compares the middle term with the item to look for; depending on the order, it concludes that one half of the collection does not contain the term since the collection is sorted. It then goes again to divide the remaining collection into two and compares the middle term again. This algorithm presents an extra step, one that would pre-sort the collection before starting the search operation.
A collection of size n is left with half items (i.e. n/2) after the first comparison; the second comparison leaves n/4 items in the collection, the third n/8 and so on. To sum up, ith comparisons will leave i = n/2i items the collection.
The splitting of the collection in two will eventually stop when the collection cannot be further divided i.e. when there is only one element left in the collection. We need to solve, for a collection of size n, the number of steps required to arrive to a collection of size 1 i.e. n/2i =1 where i is the number of required steps. We can therefore assert that:
n/2i = 1
i=log2(n)

The time complexity of our search is hence O(log n).
Although time complexity of the binary search is better than the sequential one, we need a sorted array for its implementation; the step of pre-sorting the collection might be expensive given a small size n of the initial collection. It can hence be more advantageous to use the Binary Search only when the size n of the collection is significant.
Moreover, there is another approach that makes searching a collection even easier through the use of Hashing algorithms; here the search time execution can even be a constant, i.e. O(1). A Hash is a function that takes in an input and returns an output by applying an algorithm to the input. We wish to produce a Hash table which is a collection that stores items in such a way that they will be easily found later. To this effect, items are stored at their corresponding hash value index in the Hash Table. For a collection of size n, we may use the remainder method as the hashing function. If we are to have n items in the collection, %n (remainder of n) has n different values it can produce which is the size of the collection. For instance, an element with value x out of Collection needs to be stored at position k=hash(x) = x%n such that searching for x with function S(x) = Collection[k]. In other words, searching for element x in the collection can be executed as follows:

  • compute the hash k of the value to search x (i.e. k=hash(x))
  • return whether at position k of the collection is the value to search for x (i.e. Collection[k]== x)

However, while designing a hash function, one might consider the following issues:

  • Collisions: when two or more elements hash to the same value. In our remainder hashing algorithm example, some elements may produce the same hash value. Say we have 14 and 21 in a collection of size 7. Hash value of 14=14%7=21%7=0; both 14 and 21 map to the same hash value hence producing a collision;
  • We do not always know beforehand the number of items

While those problems can be solved by a perfect hash function, one that has a unique slot for each element, such a function may require assumptions such as a known collection size. The load factor is the hash table filling rate; a hash table of size 7 with 3 elements in it has a load factor of 3/7.
The basic remainder method for hashing can be extended by known methods such as the folding method: divide the collection in sub collections of size 2 and sum up the couples produced together to get a hash value. For instance, a collection of size 7 with the item 010230 can have its hash value computed as follows:

01+02+30 = 33 to which we use the remainder function Hash=33%7=6

There are some known ways of avoiding collisions through collision resolution by the use of techniques such as open addressing; if an element from the collection has its hash value position already taken up in the hash table, one can traverse incrementally the table until a free slot is found. This is referred to as linear probing. However, one drawback can be clustering whereby several identical hash values imply that a given slot’s several surroundings are taken up by collision resolutions. The risk of clustering can be mitigated by a more even distribution of the collision resolutions. The skip value is the number of slots we need to skip while looking for the next free slot; increasing its value to more than 1 can eventually lead to the reduction of the clustering effect. Instead of taking up the first free slot, one uses a skip value of x slots to avoid clustering, such that x > 1 and x is a prime number. The concept of looking for the next spot is called rehashing.

Rehashing(previousHashValue) = (previousHashValue + skip)%n

All the stored items must be eventually found (if present in the collection), even when their hash does not compute to the position they are stored at in the hash table. For this, we need to first look for the value stored at the hash position and if not found traverse the hash table by sequentially incrementing the position by the skip value to assert if the item is there or not. It is advised to take a hash table size that is a prime number in order to mitigate the risk of collisions that might arise from the use of a badly designed hash function.
Suppose that the hash function produces values such as {x, 2x, 3x, 4x, ...} to which we have to apply the remainder function to assign their index in the hash table. Applying the remainder function to the given set of values is likely to produce more collisions if the n used as the hash table size is not prime. This is due to the fact that these hashed values will have their remainder values as respect to n fall in m buckets (a bucket as in the slot in which items with an equivalent remainder with respect to n fall) where:
m = n / GCF(x, n)
GCF (x, n)
is the Greatest Common Factor between x (common factor of the hash values) and n (size of the Hash Table). To reduce collisions, we wish to have a maximum number of buckets; we wish to have the lowest GCF(x, n) value which in the most convienient case would be 1, hence justifying our use of a prime size n (since the greatest common factor is likely to be 1 for a prime number). If x happens to be a common multiple of the skip value used for computing the hash in schemes like linear probing and its variations, the choice of a prime number as the skip value will eventually help in the overall reduction of collisions.
The case above assumes a hash function that would produce evenly distributed hash values, which is not always the case. Nevertheless, in most cases we do not know beforehand the type of hash values we will be receiving and want to undertake the most robust approach by choosing a prime size of the hash table in case we get hash values similar to the ones presented in the above case.

Sorting

Sorting is the process of arranging the items of a collection in either ascending or descending order.
The first and most straightforward approach that one can use is the Bubble Sort; it starts by the first element and compares it to all the other items in the collection.
For a given collection with n elements, for each item x of the collection, compare it to the n-1 other elements of the collection; each time one of the n-1 elements, say y, satisfies the sort condition (is greater or smaller than x), let x take the position of y and y take the position of x in the collection. While the algorithm runs in O(n2) time, the exchange of positions can be  pretty expensive in terms of processing power.

You might improve the performance of this algorithm by using the Insertion Sort; this scheme compares a collection item to its predecessors only (or the items on its left).
The first element of the collection, one that is at the leftmost, does not need comparison iterations over the other collection items; the second element needs comparison to the first one, the third element to the second and then the first one and so on.
Each collection item is iteratively compared to the other collection items, comparisons executed only in one direction which is the current’s position preceding positions (i.e. for an item at position k, compare it first to the element at position k-1 and then k-2 …) until one of the following conditions is satisfied:

  • We reach the first position of the collection in which case the current value is the smallest/biggest number of the collection so far
  • We reach the first element that satisfies the sort condition (i.e. if we are sorting in ascending order, stop the comparison iterations when we reach the first smaller item to the value being compared and reversely for the descending order). In this case, the nth element has reached its temporary position for its round of comparisons. For instance, if we are sorting in ascending order, for the element x at position k, we want to compare to the elements occupying the k-1 positions of the already sorted preceding subarray. As soon as we encounter an element, y, that is smaller than x, we can be sure that all the other elements before this element will be smaller than y since we have already sorted them (we started the comparisons from the leftmost item).
    We move to the next element in the collection (by incrementing the position of the current value to compare from current to right), which we will compare only to those items that are before it (by decrementing the collection comparisons position pointer to move from current to left).

Once we encounter the first element (y) to be bigger/smaller to x, we shift the position of y occupying position k to position k+1 and then element x will take upon position k. The shifting can improve on the processing power as compared to the exchange of positions described in the bubble sort. However, since each element is potentially compared to the other n-1 elements, the algorithm also runs with a time complexity of O(n2) although it has a  better best case scenario than the Bubble Sort through which only each item would be compared to the item that precedes it only.
For an array of n elements, while for the last collection item we might do n-1 comparisons (if it is the smallest/biggest element of the collection), each element at position k needs at most k-1 comparisons. Items at early positions have less iterations.

Another improvement on the sorting algorithms presented above is one that uses a Divide and Conquer approach. As the name indicates, the strategy consists of two steps:

  • Dividing the bigger problem into smaller problems of the same nature
  • Conquering or merging the results of the sub-problems together

For sorting a collection, one can use the Merge Sort algorithm that takes advantage of the paradigm presented above. For a collection of size n, we divide the collection into two arrays of size n/2 each. For each of the resulting collections, we solve the problem for sorting the sub-collection and later merge the results obtained from the two sub-problems; we compare the resulting two sorted sub-collections as described below:

  • We have three different counters for this; one that keeps track of each of the sub-collections and one that is used for the containing final collection
  • Step-1 : Starting from the first positions for the two sub-collections, compare elements at position k of the two sub-collections; the one that holds the bigger/smaller element at position k gives its element to the final sorted collection for the position indicated by the final collection counter and increments its counter.
  • Step-2: The next round of comparisons begins for the element at position k+1 of one of the sub-collection and the item at position k of the other collection. Repeat step-1 and go onto the next iterations
  • The sub-problem iterations stop condition or base case is that one of the sub-collections becomes empty because all of its elements have been moved to the final collection; in this case, the items of the remaining sub-collection are all given as sorted in the final collection to come after the last position that was occupied by the other collection

The solution as described above makes use of recursion in dividing and merge-sorting the collection. The first collection is divided in two, which are themselves divided in two sub-collections each and so on until we reach sub-collections which cannot be further divided, i.e. collections with 1 item (the dividing operation base case is that the sub-collection length is greater than one). Each of the recursion resulting collection is merged as described above to its counterpart sub-collection, merge of which the resulting collection is in turn merged to the preceding step recursion counterpart collection.
The Merge Sort algorithm takes up two different processes:

  • The Split of the collection which takes up O(log n) time complexity as described for the Binary Search algorithm
  • The Merge of the n items of the collection back to the original collection; a collection of size n requires n operations i.e. each item is put on the collection once hence inducing n operations

Each split of O(log n) time complexity requires n merge operations hence the overall time complexity of the Merge Sort algorithm of O(n log n).

This naturally introduces us the concept of Recursion. All recursive algorithms must fulfill the following conditions:

  • A recursive algorithm must have a base case
  • A recursive algorithm must change its state and move towards the base case
  • A recursive algorithm must call itself, recursively

Some popular examples of computational problems in which recursion can be applied are the Fibonacci and Factorial computations. The following snippet shows a basic JAVA implementation of an integer's factorial computation, finding out which integer occupies a given position in the fibonacci sequence, the bubble sort, the insertion sort and merge sort algorithms.

   /**
    * In mathematics, the factorial of a non-negative integer n, denoted by n!, is the product of all positive integers less than or equal to n
    * E.g. 5! = 5 * 4 * 3 * 2 * 1
    * @param n The number we wish to compute the factorial for
    * @return The factorial value
    */
    public static int factorialComputation(int n){
		 if(n == 0) return 1;
		 while(n > 0){
	         return n * factorialComputation(( n - 1));
	     }
	      
	     return n;
		 
	 }
	
   /**
    * 
    * In mathematics, the Fibonacci numbers are the numbers in the following integer sequence, called the Fibonacci sequence
	* and characterized by the fact that every number after the first two is the sum of the two preceding ones
    * 1 , 1 , 2 , 3 , 5 , 8 , 13 , 21 , 34 , 55 , 89 , 144 , … 
    * @param n --> the position of the number we are seeking for
    * @return the number that occupies the position n in the Fibonacci Sequence 
    */
	public int fibonacciElement(int n){
		if(n == 0 ) return 0;
		else if(n == 1) return 1;
		else return fibonacciElement(n -1) + fibonacciElement(n -2); //element at nth position is the sum of the 2 preceding it
	}
	
	/**
	 * 
	 * @param collectionToSort The unsorted collection that we wish to sort
	 * @return The sorted collection taken as the input
	 */
	public int[] sortAndMerge(int[] collectionToSort){
		
		int length = collectionToSort.length;
		if(length > 1){
			
			//We want first to divide into two
			int middle = length / 2; // for an odd value, java will make e.g. 3.5 to 3
			int[] leftHalf = new int[middle];
			int[] rightHalf = new int[length - middle];
			
			//System.arraycopy(Object[] src, int srcStartIndex, Object[] dest, int dstStartIndex, int lengthOfCopiedIndices);
			
			System.arraycopy(collectionToSort, 0, leftHalf, 0, middle);
			System.arraycopy(collectionToSort, middle, rightHalf, 0, rightHalf.length);
			
			//Recursive calling of the obtained 
			sortAndMerge(leftHalf);
			sortAndMerge(rightHalf);
			
			int i = 0; //index for the collectionToSort
			int j = 0; //index for the leftArray
			int k = 0; //index for the rightArray
			
			//This is the merge sort step
			//The loop will execute until one the halves has emptied all its elements at which moment we push the remaining element of the other half to the collection to sort
			while(leftHalf.length != j && rightHalf.length != k){
				if(leftHalf[j] < rightHalf[k]){
					collectionToSort[i] = leftHalf[j];
					j++;
				}else if(rightHalf[k] < leftHalf[j]){
					collectionToSort[i] = rightHalf[k];
					k++;
				}
				
				i++;
			}
			
			//Handle whichever half is left with elements
			if(leftHalf.length != j){
				System.arraycopy(leftHalf, j, collectionToSort, i, leftHalf.length - j);
			}else if(rightHalf.length != k){
				System.arraycopy(rightHalf, k, collectionToSort, i, rightHalf.length - k);
			}
		}
		
		return collectionToSort;
	}
	
	//Ascending order
	public int[] bubbleSort(int[] collection){
		int tmp;
		for(int outer = 0; outer < collection.length; outer++){
			for(int inner  = 0; inner < collection.length; inner ++){
				if(collection[outer] < collection[inner]){
					tmp = collection[inner];
					collection[inner] = collection[outer];
					collection[outer] = tmp;
				}
			}
			
		}
		
		return collection;
	}
	
	//Ascending order
	public int[] insertSort(int[] data) {
		int length = data.length;
		
		for(int i = 1; i < length; i++) {
			int value = data[i];
			int k = i - 1;
			while(k >=0 && value < data[k]) {
				//Shift element at k to k+1
				data[k+1] = data[k];
				k--;
			}
			data[k+1] = value;
		}
		return data;
	}

 

Illustration

Share: