Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I'm trying to fill file of enormous size (>1GB) with random data.

I've written simple "thread safe random", that generates strings (solution was suggested at https://devblogs.microsoft.com/pfxteam/getting-random-numbers-in-a-thread-safe-way/), and reworking random to make random strings is trivial.

I'm trying to write this to file using this code:

String rp;

Parallel.For(1, numlines -1, i => 
{
    rp = ThreadSafeRandom.Next();
    outputFile.WriteLineAsync(rp.ToString()).Wait();
});

when line numbers are small file is generated perfectly.

When I enter bigger number of lines (say 30000) following happens:

  • some strings are corrupted (Notepad++ sees them as prepended by lots of NUL)

  • at some point i get InvalidOperationException("Thread is used by previous thread operation").

I tried making Parallel.For(1, numlines -1, async i => with await outputFile.WriteLineAsync(rp.ToString());

and also tried doing

lock (outputFile) {
    outputFile.WriteLineAsync(rp.ToString());
}

I can always use single thread approach with simple for and simple writeLine() but as I've said I want to generate big files and I assume that even simple for loop that generates > 10000 records can take some time (in file with big size we will have 1e+6 or even 1e9 records, which is > then 20GB) and I can not think about any optimal approach.

Can someone suggest how to optimize this?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
207 views
Welcome To Ask or Share your Answers For Others

1 Answer

Your limiting factor is probably the speed of your hard disk. Nevertheless you may gain some performance by splitting the work in two. One thread (the producer) will produce the random lines, and another thread (the consumer) will write the produced lines in the file. The code bellow writes 1,000,000 random lines to a file in my SSD in less than a second (10 MB).

var buffer = new BlockingCollection<string>(boundedCapacity: 10);
var producer = Task.Factory.StartNew(() =>
{
    var random = new Random();
    var sb = new StringBuilder();
    for (int i = 0; i < 10000; i++) // 10,000 chunks
    {
        sb.Clear();
        for (int j = 0; j < 100; j++) // 100 lines each chunk
        {
            sb.AppendLine(random.Next().ToString());
        }
        buffer.Add(sb.ToString());
    }
    buffer.CompleteAdding();
}, TaskCreationOptions.LongRunning);
var consumer = Task.Factory.StartNew(() =>
{
    using (var outputFile = new StreamWriter(@".....Huge.txt"))
        foreach (var chunk in buffer.GetConsumingEnumerable())
        {
            outputFile.Write(chunk);
        }
}, TaskCreationOptions.LongRunning);
Task.WaitAll(producer, consumer);

This way you don't even need thread safety in the production of the random lines, because the production happens in a single thread.


Update: In case the writing to disk is not the bottleneck, and the producer is slower than the consumer, more producers can be added. Bellow is a version with three producers and one consumer.

var buffer = new BlockingCollection<string>(boundedCapacity: 10);
var producers = Enumerable.Range(0, 3)
.Select(n => Task.Factory.StartNew(() =>
{
    var random = new Random(n); // Non-random seed, same data on every run
    var sb = new StringBuilder();
    for (int i = 0; i < 10000; i++)
    {
        sb.Clear();
        for (int j = 0; j < 100; j++)
        {
            sb.AppendLine(random.Next().ToString());
        }
        buffer.Add(sb.ToString());
    }
}, TaskCreationOptions.LongRunning))
.ToArray();
var allProducers = Task.WhenAll(producers).ContinueWith(_ =>
{
    buffer.CompleteAdding();
});
// Consumer the same as previously (ommited)
Task.WaitAll(allProducers, consumer);

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...