3

Could anyone explain why an array will always be faster than a hash object described in SAS Prog3 course note?

flag

2 Answers

3

If you have the memory for an array you don't need a hash object. The point of a hash table is to provide access to a particular keyed item almost as fast as you'd get with a simple array.

Consider a file with 10 million SSNs as keys. Now, there are 1 billion possible SSNs (minus those covered by special rules); but you probably don't have the memory for slots representing all 1 billion. If you did, you'd put the info in memory and access it as lastname(ssn) or address(ssn) or whatever. And that sort of access is easy, because calculating the memory location is simple: start + index*length.

But you do have memory for the 10 million in your database. A hash table trades computation time for compactness: it costs more to find an item in the hash table but--you can do it without investing in a new motherboard.

link|flag
both hash tables AND arrays are stored in memory and would require approximately the same amount of space. hash tables do not get written to disk unless you run out of memory. i'm not sure how investing in a new hardware applies here - can you please try to clarify – Robert Penridge Jan 15 at 23:57
An array that allows you to find data associated with any possible SSN will have a size of 1 billion * however many columns you wish to include, assuming you wish to use the SSN as the index for finding the rest of the data. That's true whether you have 1000 rows or 10 million or 1 billion, if you're going to access data like this: Phone_no = data_array(ssn,2); A hash table makes the lookup calculation a little more complicated but requires less space, because the index is not the SSN itself but a function of same. – RogerLustig Feb 10 at 22:29
2

My understanding is that there are additional overheads when working with a hash object. These can be broken down into 3 main areas:

  1. Creating the hash object.
  2. Determining where to store data in the hash object.
  3. Determining where to retrieve data from in the hash object.

The hash object takes its name from the hashing algorithm that is used to calculate where an observation will be stored in the hash table (ie. in memory). The hashing algorithm is extremely fast though and the result of applying the algorithm to a key value is the exact location in memory it is stored. This means that the hashing algorithm needs to be applied to a key value any time you want to store or retrieve information from the hash table.

When dealing with an array the hash algorithm is not required because you explicitly specify the index from the array that you want to store or retrieve information from. The location in memory can be almost instantly determined once the index is supplied. This assumes you already know the array index that contains the data you require.

Both hash tables and arrays are loaded into memory. If you don't have sufficient memory the performance of a hash table can severely suffer. (http://support.sas.com/kb/34/193.html) If you wanted to create hash tables in earlier versions of SAS prior to the existence of the hash object then you had to create it manually using arrays.

link|flag

Your Answer

Not the answer you're looking for? Browse other questions tagged or ask your own question.