Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I am splitting the lines as they are red from the input:

char **split (char *in, size_t *n, char *delimiter){
    if (delimiter == NULL) delimiter=""; // default
    
    char **tokens = malloc(20 * sizeof(char*));

    char *token = strtok(in, delimiter);
    while (token != NULL) {
        tokens[*n] = malloc(30 * sizeof(char*)); // <- why not *tokens[0]
        strcpy(tokens[*n], token);
        (*n)++;
        token = strtok(NULL, "");
    }
    return tokens;
}

I have the (probably wrong but please explain it to me) idea that allocating (and releasing) different slots of memory on every line is not efficient.

Is it safe (and possible?) to allocate the memory OUTSIDE of the split function and pass a pointer to the array of strings to split function? How (then I would use memset before next line)?

Thank you!


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
347 views
Welcome To Ask or Share your Answers For Others

1 Answer

Is it safe (and possible?) to allocate the memory OUTSIDE of the split function and pass a pointer to the array of strings to split function?

No, because you don't know how large each token will be until you read it nor how many tokens there are. For this reason your code as written is unsafe. It assumes each token is no more than 29 characters and there are no more than 20 tokens.


Allocating just the right amount of memory for each token can be solved with strdup, as Chris Dodd explains. But that still leaves you allocating a bunch of little pieces of memory.

To minimize allocations you would store all the tokens in a single block of memory. That's what strtok does; it modifies the input string replacing the delimiters with nulls. Rather than copying each token, tokens would simply point to the start of each token within the original string.

This means tokens is pointing to in and your function is modifying in. I'd leave the choice of whether to duplicate the input string up to the caller.

// Don't modify the original, tokens points at a copy.
char **tokens = split(strdup(input), &num_tokens, "");

// Modify the original, tokens points at input.
char **tokens = split(input, &num_tokens, "");

That leaves the problem of allocating tokens. To do that you either need to know how many tokens there are, or reallocate on the fly.

To minimize CPU you would realloc to grow tokens as needed.

    size_t tokens_size = 1;
    char **tokens = malloc(tokens_size * sizeof(char*));

    *num_tokens = 0;
    for(
        char *token = strtok(in, delimiter);
        token;
        token = strtok(NULL, delimiter)
    ) {
        if( *num_tokens >= tokens_size ) {
            tokens_size *= 2;
            tokens = realloc(tokens, tokens_size * sizeof(char*));
        }

        tokens[*num_tokens] = token;

        (*num_tokens)++;
    }

As a compromise between efficiency and memory use, rather than reallocating for each token I've doubled the size of tokens each realloc. This will use extra memory, but minimizes reallocations.


To minimize allocations, you'd tokenize the string and remember how many tokens there are. Then allocate enough space. Then read the string again storing the pointers to the tokens.

    // tokenize in and remember how many there are.
    *num_tokens = 0;
    for(
        char *token = strtok(in, delimiter);
        token;
        token = strtok(NULL, delimiter)
    ) {
        (*num_tokens)++;
    }

    // Allocate exactly enough space.
    char **tokens = malloc(*num_tokens * sizeof(char*));
    
    // Iterate through the tokens and store them.
    char *token = in;
    for(size_t i = 0; i < *num_tokens; i++) {
        // Store a pointer to the token
        tokens[i] = token;
        // Jump ahead the length of the token plus the null.
        token += strlen(token) + 1;
    }

For the cost of scanning the input twice you can do just a single malloc.


Note: this code would be easier and safer with a struct to contain all the information.

typedef struct {
    char *original;
    char **tokens;
    size_t num_tokens;
    size_t tokens_size;
} Tokens;

This keeps everything together and you can break it up into functions like Tokens_init and Tokens_free.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...