regex - Seeking Unicode-savvy function for searching text in binary data

Question

Ask a Question

Welcome To Ask or Share your Answers For Others

regex - Seeking Unicode-savvy function for searching text in binary data

asked Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

I need to find unicode text inside binary data (files).

I'm seeking any C or C++ code or library that I can use on macOS. Since I guess this is also useful to other platforms, so I rather make this question not specific to macOS.

On macOS, the NSString functions, meeting my unicode savvyness needs, can't be used because they do not work on binary data.

As an alternative I've tried the POSIX complient regex functions provided on macOS, but they have some limitations:

They are not normalization-savvy, i.e. if I search for a precomposed (NFC) character, it won't find the characher if it's occuring in decomposed (NFD) form in the target data.
Case insensitive search does not work for latin NFC characters (searching for ü does not find ü).

Example code showing these results is below.

What code or library is out there that fulfills these needs?

I do not need regex capabilities, but if there's a regex lib that can handle these requirements, I'm fine with that, too.

Basically, I need unicode text search with these options:

case-insensitive
normalization-insensitive
diacritics-insensitive
works on arbitrary binary data, finding matching UTF-8 text fragments

Here's the test code showing the results from using the TRE regex implementation on macOS:

#include <stdio.h>
#include <regex.h>

void findIn (const char *what, const char *data, int whatPre, int dataPre) {
    regex_t re;
    regcomp (&re, what, REG_ICASE | REG_LITERAL);
    int found = regexec(&re, data, 0, NULL, 0) == 0;
    printf ("Found %s (%s) in %s (%s): %s
", what, whatPre?"pre":"dec", data, dataPre?"pre":"dec", found?"yes":"no");
}

void findInBoth (const char *what, int whatPre) {
    char dataPre[] = { '<', 0xC3, 0xA4, '>', 0};        // precomposed
    char dataDec[] = { '<', 0x61, 0xCC, 0x88, '>', 0};  // decomposed
    findIn (what, dataPre, whatPre, 1);
    findIn (what, dataDec, whatPre, 0);
}

int main(int argc, const char * argv[]) {
    char a_pre[] = { 0xC3, 0xA4, 0};        // precomposed ?
    char a_dec[] = { 0x61, 0xCC, 0x88, 0};  // decomposed ?
    char A_pre[] = { 0xC3, 0x84, 0};        // precomposed ?
    char A_dec[] = { 0x41, 0xCC, 0x88, 0};  // decomposed ?

    findInBoth (a_pre, 1);
    findInBoth (a_dec, 0);
    findInBoth (A_pre, 1);
    findInBoth (A_dec, 0);

    return 0;
}

Output is:

Found ? (pre) in <?> (pre): yes
Found ? (pre) in <?> (dec): no
Found ? (dec) in <?> (pre): no
Found ? (dec) in <?> (dec): yes
Found ? (pre) in <?> (pre): no
Found ? (pre) in <?> (dec): no
Found ? (dec) in <?> (pre): no
Found ? (dec) in <?> (dec): yes

Desired output: All cases should give "yes"

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

168 views

1 Answer

深蓝 · Answer 1 · 2022-01-31T07:15:52+0000

I've solved the issue by writing my own pre-precessor, generating a regular expression that combines all the alternatices (case and normalization but not diacritics) and passing that to the regex function.

The complete solution is documented here.

Categories

regex - Seeking Unicode-savvy function for searching text in binary data

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags