c - Resolving equal XOR values for different strings for anagram detection

Question

Welcome To Ask or Share your Answers For Others

c - Resolving equal XOR values for different strings for anagram detection

asked Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

I recently had an interview question where I had to write a function that takes two strings, and it would return 1 if they are anagrams of each other or else return 0. To simplify things, both strings are of the same length, non-empty, and only contain lower-case alphabetical and numeric characters.

What I implemented a function that accumulates the XOR value of each character of each string independently then compared the final XOR values of each string to see if they are equal. If they are, I would return 1, else return 0.

My function:

int isAnagram(char* str1, char* str2){
    int xor_acc_1 = 0;
    int xor_acc_2 = 0;
    for(int i = 0; i<strlen(str1); i++){
        xor_acc_1 ^= str1[i] - '0';
        xor_acc_2 ^= str2[i] - '0';
    }
    return xor_acc_1 == xor_acc_2;
}

My function worked for every case except for one test case.

char* str1 = "123";
char* str2 = "303";

To my surprise, even though these two strings are not anagrams of each other, they both returned 48 as their XOR value.

My question is: Can this be resolve still with XOR in linear time, without the usage of a data structure e.g. a Map, through modification on the mathematics behind XOR?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

133 views

1 Answer

深蓝 · Answer 1 · 2022-01-31T07:13:43+0000

A pure xor solution will not work as there is information lost during the process (this problem is likely to exist in other forms of lossy calculation as well, such as hashing). The information lost in this case is the actual characters being used for comparison.

By way of example, consider the two strings ae and bf (in ASCII):

  a: 0110 0001    b: 0110 0010
  e: 0110 0101    f: 0110 0110
     ---- ----       ---- ----
xor: 0000 0100       0000 0100

You can see that the result of the xor is identical for both string despite the fact they are totally different.

This may become even more obvious once you realise that any value xor-ed with itself is zero, meaning that all strings like aa, bb, cc, xx, and so on, would be considered anagrams under your scheme.

So, now you've established that method as unsuitable, there are a couple of options that spring to mind.

The first is to simply sort both strings and compare them. Once sorted, they will be identical on a character-by-character basis. This will work but it's unlikely to deliver your requested O(n) time complexity since you'll almost certainly be using a comparison style sort.

The second still allows you to meet that requirement by using the usual "trick" of trading space for time. You simply set up a count of each character (all initially zero) then, for each character in the first string, increase its count.

After that, for each character in the second string, decrease its count.

That's linear time complexity and the strings can be deemed to be anagrams if every character count is set to zero after the process. Any non-zero count will only be there if a character occurred more times in one string than the other.

This is effectively a counting sort, a non-comparison sort meaning it's not subject to the normal minimum O(n log n) time complexity for those sorts.

The pseudo-code for such a beast would be:

def isAnagram(str1, str2):
    if len(str1) != len(str2):    # Can also handle different lengths.
        return false

    dim count[0..255] = {0}       # Init all counts to zero.

    for each code in str1:        # Increase for each char in string 1.
        count[code]++

    for each code in str2:        # Decrease for each char in string 2.
        count[code]--

    for each code in 0..255:
        if count[code] != 0:      # Any non-zero means non-anagram.
            return false    

    return true                   # All zero means anagram.

Here, by the way, is a complete C test program which illustrates this concept, able to handle 8-bit character widths though more widths can be added with a simple change to the #if section:

#include <stdio.h>
#include <string.h>
#include <limits.h>
#include <stdbool.h>

#if CHAR_BIT == 8
    #define ARRSZ 256
#else
    #error Need to adjust for unexpected CHAR_BIT.
#endif

static bool isAnagram(unsigned char *str1, unsigned char *str2) {
    // Ensure strings are same size.

    size_t len = strlen(str1);
    if (len != strlen(str2))
        return false;

    // Initialise all counts to zero.

    int count[ARRSZ];
    for (size_t i = 0; i < sizeof(count) / sizeof(*count); ++i)
        count[i] = 0;

    // Increment for string 1, decrement for string 2.

    for (size_t i = 0; i < len; ++i) {
        count[str1[i]]++;
        count[str2[i]]--;
    }

    // Any count non-zero means non-anagram.

    for (size_t i = 0; i < sizeof(count) / sizeof(*count); ++i)
        if (count[i] != 0)
            return false;

    // All counts zero means anagram.

    return true;
}

int main(int argc, char *argv[]) {
    if ((argc - 1) % 2 != 0) {
        puts("Usage: check_anagrams [<string1> <string2>] ...");
        return 1;
    }

    for (size_t i = 1; i < argc; i += 2) {
        printf("%s: '%s' '%s'
",
            isAnagram(argv[i], argv[i + 1]) ? "Yes" : " No",
            argv[i], argv[i + 1]);
    }

    return 0;
}

Running this on some suitable test data shows it in action:

pax$ ./check_anagrams ' paxdiablo ' 'a plaid box' paxdiablo PaxDiablo 
         one two aa bb aa aa '' '' paxdiablo pax.diablo

Yes: ' paxdiablo ' 'a plaid box'
 No: 'paxdiablo' 'PaxDiablo'
 No: 'one' 'two'
 No: 'aa' 'bb'
Yes: 'aa' 'aa'
Yes: '' ''
 No: 'paxdiablo' 'pax.diablo'

Categories

c - Resolving equal XOR values for different strings for anagram detection

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags