Is it safe (and possible?) to allocate the memory OUTSIDE of the split function and pass a pointer to the array of strings to split function?
No, because you don't know how large each token will be until you read it nor how many tokens there are. For this reason your code as written is unsafe. It assumes each token is no more than 29 characters and there are no more than 20 tokens.
Allocating just the right amount of memory for each token can be solved with strdup
, as Chris Dodd explains. But that still leaves you allocating a bunch of little pieces of memory.
To minimize allocations you would store all the tokens in a single block of memory. That's what strtok
does; it modifies the input string replacing the delimiters with nulls. Rather than copying each token, tokens
would simply point to the start of each token within the original string.
This means tokens
is pointing to in
and your function is modifying in
. I'd leave the choice of whether to duplicate the input string up to the caller.
// Don't modify the original, tokens points at a copy.
char **tokens = split(strdup(input), &num_tokens, "");
// Modify the original, tokens points at input.
char **tokens = split(input, &num_tokens, "");
That leaves the problem of allocating tokens
. To do that you either need to know how many tokens there are, or reallocate on the fly.
To minimize CPU you would realloc
to grow tokens
as needed.
size_t tokens_size = 1;
char **tokens = malloc(tokens_size * sizeof(char*));
*num_tokens = 0;
for(
char *token = strtok(in, delimiter);
token;
token = strtok(NULL, delimiter)
) {
if( *num_tokens >= tokens_size ) {
tokens_size *= 2;
tokens = realloc(tokens, tokens_size * sizeof(char*));
}
tokens[*num_tokens] = token;
(*num_tokens)++;
}
As a compromise between efficiency and memory use, rather than reallocating for each token I've doubled the size of tokens each realloc. This will use extra memory, but minimizes reallocations.
To minimize allocations, you'd tokenize the string and remember how many tokens there are. Then allocate enough space. Then read the string again storing the pointers to the tokens.
// tokenize in and remember how many there are.
*num_tokens = 0;
for(
char *token = strtok(in, delimiter);
token;
token = strtok(NULL, delimiter)
) {
(*num_tokens)++;
}
// Allocate exactly enough space.
char **tokens = malloc(*num_tokens * sizeof(char*));
// Iterate through the tokens and store them.
char *token = in;
for(size_t i = 0; i < *num_tokens; i++) {
// Store a pointer to the token
tokens[i] = token;
// Jump ahead the length of the token plus the null.
token += strlen(token) + 1;
}
For the cost of scanning the input twice you can do just a single malloc.
Note: this code would be easier and safer with a struct to contain all the information.
typedef struct {
char *original;
char **tokens;
size_t num_tokens;
size_t tokens_size;
} Tokens;
This keeps everything together and you can break it up into functions like Tokens_init
and Tokens_free
.