Wednesday, July 24, 2013

Binary search aha! moment

Although I've used database table indexes for eons, and understand how and when to use them, I recently had a real computer science-y aha! moment that gave me a new appreciation for indexes.

I was re-reading Kochan's Programming in C, and wrote a little program to calculate binary search worst scenario step counts for collections of various sizes.  So, for instance, given a collection of N sorted linked list items, empirically calculate the maximum number of steps needed to search the list to find an arbitrarily selected item.

I was floored when I saw that binary searching can find an item in a collection as large as one million items in 21 steps or less!  I have a newfound appreciation for how much more desirable index seeks are in comparison to (sequential) table scans!

In computer science lingo, the worst case binary search has a time complexity of O(log n) which means the worst case scales up as the log of the collection size.  This is a good thing since the log of a number x increases much slower than x increases.  O(log n) time vs. O(n) time (a.k.a. linear time).  The output of the program (shown below) below clearly shows this to be the case.  

#include <stdio.h>
#include <math.h>

/*
Calculate binary search worst case step counts for a range of collection
sizes and compare to worst case step counts with log2(collection size).  
Do this for a wide range of node sizes.  It's remarkable that in only 21
steps (or fewer, obviously), any entry can be found in a collection of a
million sorted items.  Especially compared to a sequential search of the
entire collection.
*/

unsigned long int worst_case(unsigned long int collectionSize)
{
    /*
    This pretends to step through a binary search process that assumes
    the target item is the last item in the collection which requires
    the maximum number of steps.  Binary search starts with an upper and
    lower bound and sees if the target item is above or below the midpoint
    and adjusts the lower and upper bounds of the search's next step to
    be the half of the distribution containing the target item.  Halving
    of the next step's search area continues until the item is found or
    the item is determined not to be present in the collection.  Again, this
    does not actually do a binary search, it just counts the number of steps
    required to actually do one.
    */
    unsigned long int low=0;
    unsigned long int high=collectionSize-1;
    unsigned long int mid;
    unsigned long int steps=0;
    unsigned long int target=collectionSize-1; //Last number in virtual collection
    while(low<=high)
    {
        mid=(low+high)/2; //Midpoint of current range of focus
        steps++;
        if (mid==target)
            break;           //Simulates item found, so exit loop
        else if (target<mid) //Simulates target is in lower half of range
            high=mid;
        else //Target>mid  --  Simulates target is in upper half of range
            low=mid+1;
    }
    return steps;
}

int main()
{
    unsigned long int maxCollectionSize = pow(2,24);
    int i;
    unsigned long int collectionSize;
    float log2ofSize;
    unsigned long int wc;

    printf("Collection Size  Worse Case  Log2(Collection Size)\n");
    printf("---------------  ----------  ---------------------\n");

    for(i=0; pow(2,i)<=maxCollectionSize; i++)
    {
        collectionSize=pow(2,i);
        log2ofSize = log2(collectionSize);
        wc = worst_case(collectionSize);
        printf("%15i  %10i  %21.2f\n", collectionSize, wc, log2ofSize);
    }

    return 0;
}

/*
Output:

Collection Size  Worse Case  Log2(Collection Size)
---------------  ----------  ---------------------
              1           1                   0.00
              2           2                   1.00
              4           3                   2.00
              8           4                   3.00
             16           5                   4.00
             32           6                   5.00
             64           7                   6.00
            128           8                   7.00
            256           9                   8.00
            512          10                   9.00
           1024          11                  10.00
           2048          12                  11.00
           4096          13                  12.00
           8192          14                  13.00
          16384          15                  14.00
          32768          16                  15.00
          65536          17                  16.00
         131072          18                  17.00
         262144          19                  18.00
         524288          20                  19.00
        1048576          21                  20.00
        2097152          22                  21.00
        4194304          23                  22.00
        8388608          24                  23.00
       16777216          25                  24.00
*/

No comments:

Post a Comment