When focusing on the performance of hexinspector, I realised that one of the performance bottlenecks was probably the loading single bytes individually and comparing them, instead of working a word at a time.
For performance reasons the C versions of memcmp in glibc performes based on word at a time, although usually the assembly based versions are used.
Complicating matters is the inability of some architectures to fetch words non-aligned with the platform word size. In glibc when both pointers are not word aligned, this is overcome by aligning one of the pointers to word aligned and shifting the other pointer data (see memcmp_not_common_alignment).
As hexinspector is currently/unfortunatly aimed at x86 architectures which supports fetching words non-aligned, it is not a high priority of mine to implement the non-aligned version.
By moving from byte based to word based (see commit id b0049cc3363fb8de88fd5f26debf9701633d2e87), the simple diff algorithm performance improved from 600mb/sec to 800mb/sec (when running on an 2 core Intel and the files are both in the cache).
Word-based finding two bytes the same
We realised that the while loop while the data is different could be implemented word-based instead of byte-based by using the Zero in Word technique discussed in Bit hacks by Sean Eron Anderson, which is a document well worth a read for any programmer.
The commit bb6c8298b95bd5650794c1790fc17e4751df358d implements this, basically as below
value = (*srcword) ^ (*dstptr);
if (((value - 0x01010101UL) & (~value) & 0x80808080UL) != 0) printf("Different");
Using this the performance of the finding two bytes the same increased on a dual core from 600mb/sec to 800mb/sec effectively the same as the compare performance.
One problem I noted, was that adding more CPU’s to the diff algorithm when the file was not in cache slowed it down. Question is, what’s the optimal way to access the same file using two or more cores?





