A pure xor
solution will not work as there is information lost during the process (this problem is likely to exist in other forms of lossy calculation as well, such as hashing). The information lost in this case is the actual characters being used for comparison.
By way of example, consider the two strings ae
and bf
(in ASCII):
a: 0110 0001 b: 0110 0010
e: 0110 0101 f: 0110 0110
---- ---- ---- ----
xor: 0000 0100 0000 0100
You can see that the result of the xor
is identical for both string despite the fact they are totally different.
This may become even more obvious once you realise that any value xor
-ed with itself is zero, meaning that all strings like aa
, bb
, cc
, xx
, and so on, would be considered anagrams under your scheme.
So, now you've established that method as unsuitable, there are a couple of options that spring to mind.
The first is to simply sort both strings and compare them. Once sorted, they will be identical on a character-by-character basis. This will work but it's unlikely to deliver your requested O(n)
time complexity since you'll almost certainly be using a comparison style sort.
The second still allows you to meet that requirement by using the usual "trick" of trading space for time. You simply set up a count of each character (all initially zero) then, for each character in the first string, increase its count.
After that, for each character in the second string, decrease its count.
That's linear time complexity and the strings can be deemed to be anagrams if every character count is set to zero after the process. Any non-zero count will only be there if a character occurred more times in one string than the other.
This is effectively a counting sort, a non-comparison sort meaning it's not subject to the normal minimum O(n log n)
time complexity for those sorts.
The pseudo-code for such a beast would be:
def isAnagram(str1, str2):
if len(str1) != len(str2): # Can also handle different lengths.
return false
dim count[0..255] = {0} # Init all counts to zero.
for each code in str1: # Increase for each char in string 1.
count[code]++
for each code in str2: # Decrease for each char in string 2.
count[code]--
for each code in 0..255:
if count[code] != 0: # Any non-zero means non-anagram.
return false
return true # All zero means anagram.
Here, by the way, is a complete C test program which illustrates this concept, able to handle 8-bit character widths though more widths can be added with a simple change to the #if
section:
#include <stdio.h>
#include <string.h>
#include <limits.h>
#include <stdbool.h>
#if CHAR_BIT == 8
#define ARRSZ 256
#else
#error Need to adjust for unexpected CHAR_BIT.
#endif
static bool isAnagram(unsigned char *str1, unsigned char *str2) {
// Ensure strings are same size.
size_t len = strlen(str1);
if (len != strlen(str2))
return false;
// Initialise all counts to zero.
int count[ARRSZ];
for (size_t i = 0; i < sizeof(count) / sizeof(*count); ++i)
count[i] = 0;
// Increment for string 1, decrement for string 2.
for (size_t i = 0; i < len; ++i) {
count[str1[i]]++;
count[str2[i]]--;
}
// Any count non-zero means non-anagram.
for (size_t i = 0; i < sizeof(count) / sizeof(*count); ++i)
if (count[i] != 0)
return false;
// All counts zero means anagram.
return true;
}
int main(int argc, char *argv[]) {
if ((argc - 1) % 2 != 0) {
puts("Usage: check_anagrams [<string1> <string2>] ...");
return 1;
}
for (size_t i = 1; i < argc; i += 2) {
printf("%s: '%s' '%s'
",
isAnagram(argv[i], argv[i + 1]) ? "Yes" : " No",
argv[i], argv[i + 1]);
}
return 0;
}
Running this on some suitable test data shows it in action:
pax$ ./check_anagrams ' paxdiablo ' 'a plaid box' paxdiablo PaxDiablo
one two aa bb aa aa '' '' paxdiablo pax.diablo
Yes: ' paxdiablo ' 'a plaid box'
No: 'paxdiablo' 'PaxDiablo'
No: 'one' 'two'
No: 'aa' 'bb'
Yes: 'aa' 'aa'
Yes: '' ''
No: 'paxdiablo' 'pax.diablo'