Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I've done some research and found that the most efficient way for me to read and write multi-gig (+5GB) files is to use something like the following code:

using (FileStream fs = File.Open(file, FileMode.Open, FileAccess.Read, FileShare.Read))
using (BufferedStream bs = new BufferedStream(fs, 256 * 1024))
using (StreamReader sr = new StreamReader(bs, Encoding.ASCII, false, 256 * 1024))
{
    StreamWriter sw = new StreamWriter(outputFile, true, Encoding.Unicode, 256 * 1024);
    string line = "";

    while (sr.BaseStream != null && (line = sr.ReadLine()) != null)
    {
        //Try to clean csv then split
        line = Regex.Replace(line, "[\s\dA-Za-z]["][\s\dA-Za-z]", ""); 
        string[] fields = Regex.Split(line, ",(?=(?:[^"]*"[^"]*")*[^"]*$)");
        //I know there are libraries for this that I will switch out 
        //when I have time to create the classes as it seems they all
        //require a mapping class

        //Remap 90-250 properties
        object myObj = ObjectMapper(fields);

        //Write line
        bool success = ObjectWriter(myObj);
    }

    sw.Dispose();
}

CPU is averaging around 33% for each of 3 instances on an Intel Xeon 2.67 GHz. I was able to output 2 files in ~26 hrs that were just under 7GB while the process was running 3 instances using:

Parallel.Invoke(
    () => new Worker().DoWork(args[0]),
    () => new Worker().DoWork(args[1]),
    () => new Worker().DoWork(args[2])
);

The third instance is generating a MUCH larger file being, so far, +34GB and am coming up on day 3, ~67 hrs in.

From what I've read, I think performance may be increased slightly by getting the buffer lowered to a sweet spot.

My questions are:

  1. Based on what is stated, is this typical performance?
  2. Besides what I mentioned above, are there any other improvements you can see?
  3. Are the CSV mapping and reading libraries much faster that regex?
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
554 views
Welcome To Ask or Share your Answers For Others

1 Answer

So, first of all, you should profile your code to identify bottlenecks.

Visual Studio comes with a built-in profiler for this purpose, which can clearly identify hot-spots in your code.

Given that your process is CPU bound, this is likely to prove very effective.

However, if I had to guess at why it's slow, I would imagine it's because you are not re-using your regexes. A regex is relatively expensive to construct, so re-using it can see massive performance improvements.

var regex1 = new Regex("[\s\dA-Za-z]["][\s\dA-Za-z]", RegexOptions.Compiled);
var regex2 = new Regex(",(?=(?:[^"]*"[^"]*")*[^"]*$)", RegexOptions.Compiled);
while (sr.BaseStream != null && (line = sr.ReadLine()) != null)
{
    //Try to clean csv then split
    line = regex1.Replace(line, ""); 
    string[] fields = regex2.Split(line);
    //I know there are libraries for this that I will switch out 
    //when I have time to create the classes as it seems they all
    //require a mapping class

    //Remap 90-250 properties
    object myObj = ObjectMapper(fields);

    //Write line
    bool success = ObjectWriter(myObj);
}

However, I would strongly encourage you to use a library like Linq2Csv - it will likely be more performant, as it will have had several rounds of performance tuning, and it will handle edge-cases that your code doesn't.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...