I'm using LINQ to compare two DataSets with each other to create new rows and update existing. I've noticed that the complete comparison lasts ~1,5 hours and only one of the two cores is busy(Task-Manager is 50-52% CPU Usage). I must admit that I'm completely new to parallel LINQ, but I assume that it could increase performance significantly.
So my question is, how and what should I parallelize?
These are the original queries(reduced to the essentials):
'check for new data
Dim srcUnique = From row In src.Email_Total
Select Ticket_ID = row.ticket_id, Interaction = row.interaction, ModifiedAt = row.modified_time
Dim destUnique = From row In dest.ContactDetail
Where row.ContactRow.fiContactType = emailContactType.idContactType
Select row.ContactRow.Ticket_ID, row.Interaction, row.ModifiedAt
'get all emails(contactdetails) that are in source but not in destination
Dim diffRows = srcUnique.Except(destUnique).ToList
'get all new emails(according to ticket_id) for calculating contact columns
Dim newRowsTickets = (From row In src.Email_Total
Join d In diffRows
On row.ticket_id Equals d.Ticket_ID _
And row.interaction Equals d.Interaction _
And row.modified_time Equals d.ModifiedAt
Group row By Ticket_ID = row.ticket_id Into NewTicketRows = Group).ToList
For Each ticket In newRowsTickets
Dim contact = dest.Contact.FindByTicket_IDfiContactType(ticket.Ticket_ID, emailContactType.idContactType)
If contact Is Nothing Then
' Create new Contact with many sub-queries on this ticket(omitted) ****'
Dim newContact = Me.dest.Contact.NewContactRow
dest.Contact.AddContactRow(newContact)
contact = newContact
Else
' Update Contact with many sub-queries on this ticket(omitted) '
End If
daContact.Update(dest.Contact)
' Add new ContactDetail-Rows from this Ticket(this is the counterpart of the src.Email_Total-Rows, details omitted) '
For Each newRow In ticket.NewTicketRows
Dim newContactDetail = dest.ContactDetail.NewContactDetailRow
newContactDetail.ContactRow = contact
dest.ContactDetail.AddContactDetailRow(newContactDetail)
Next
daContactDetails.Update(dest.ContactDetail)
Next
Note: daContact
and daContactDetails
are SqlDataAdapters
, source
and dest
are DataSets
and Contact
and ContactDetail
are DataTables
, where every ContactDetail belongs to a Contact.
Even if not both cores would use 100% CPU, I assume that it would increase performance significantly if I would parallelize the queries, because the second core is nearly idle. The for each
might also be a good place to optimize since the tickets are not related to each other. So I assume that I could loop with multiple threads and create/update records parallel. But how to do it with PLINQ?
Side Note: As I've mentioned in the comments, performance is not a key factor for me so far, since the server's only purpose is to synchronize the MySQL Database(on another server) with a MS SQL-Server(on the same server as this Windows-Service). It acts as a source for reports that are generated by another service. But these reports are only generated once a day. But apart from that I was interested in learning PLINQ because I thought that this could be an excellent exercise. It takes the mentioned 1,5h only if destination DB is empty and all records must be created. If both databases are nearly in sync, this method takes only ~1 minute yet. In future performance will become more important since email is only one of several contact-types(chat+calls will exceed 1mil.records). I think that I'll anyway need some kind of (LINQ) Data-Paging then.
If something is unclear I'll update my answer accordingly. Thanks in advance.
Edit: Here is the result of my investigations and attempts:
Question: How to "PLINQ" an existing LINQ query with joins?
Answer: Note that some LINQ operators are binary—they take two IEnumerables as input. Join is a perfect example of such an operator. In these cases, the type of the left-most data source determines whether LINQ or PLINQ is used. Thus you need only call AsParallel on the first data source for your query to run in parallel:
IEnumerable<T> leftData = ..., rightData = ...;
var q = from x in leftData.AsParallel()
join y in rightData on x.a == y.b
select f(x, y);
But if I change my query in the following way(note the AsParallel
):
Dim newRowsTickets = (From row In src.Email_Total.AsParallel()
Join d In diffRows
On row.ticket_id Equals d.Ticket_ID _
And row.interaction Equals d.Interaction _
And row.modified_time Equals d.ModifiedAt
Group row By Ticket_ID = row.ticket_id Into NewTicketRows = Group).ToList
The compiler will complain that I need to add AsParallel
to the right datasource as well. So this seem to be a VB.NET issue or a lack of documentation(article is from 2007). I assume the latter because the(apart from that recommendable) article also says that you need to add System.Concurrency.dll
manually but actually it is part of .NET 4.0 Framework and in Namespace Sytem.Threading.Tasks
.
I realized that I won't profit from a parallelized Except
since the query is fast enough in sequential mode(even with nearly the same number of rows in both collection which results in the maximum number of comparisons, I got the result in less than 30 seconds). But I will add it for the sake of completeness later.
So I decided to parallelize the for-each
what is as easy as with LINQ-Queries, you simply need to add AsParallel()
at the end.
But I realized that I need to force the parallelism with WithExecutionMode(ParallelExecutionMode.ForceParallelism)
, otherwise .NET decides to use only one core for this loop. I also wanted to tell .NET that I wish to use as many Threads as possible but not more than 8: WithDegreeOfParallelism(8).
Now both cores are working at the same time, but the CPU usage stays on 54%.
So this is the PLINQ version so far:
Dim diffRows = srcUnique.AsParallel.Except(destUnique.AsParallel).ToList
Dim newRowsTickets = (From row In src.Email_Total.AsParallel()
Join d In diffRows.AsParallel()
On row.ticket_id Equals d.Ticket_ID _
And row.interaction Equals d.Interaction _
And row.modified_time Equals d.ModifiedAt
Group row By Ticket_ID = row.ticket_id Into NewTicketRows = Group).ToList
For Each ticket In newRowsTickets.
AsParallel().
WithDegreeOfParallelism(8).
WithExecutionMode(ParallelExecutionMode.ForceParallelism)
' blah,blah ... '
'add new ContactDetails for this Ticket(only new rows)
For Each newRow In ticket.NewTicketRows.
AsParallel().
WithExecutionMode(ParallelExecutionMode.Default)
' blah,blah ... '
Next
daContactDetails.Update(dest.ContactDetail)
Next
Unfortunately I don't see any performance benefits from using AsParallel
in comparison with sequential mode:
The for each
with AsParallel
(hh:mm:ss.mm):
09/29/2011 18:54:36: Contacts/ContactDetails created or modified. Duration: 01:21:34.40
And without:
09/29/2011 16:02:55: Contacts/ContactDetails created or modified. Duration: 01:21:24.50
Can somebody explain me this result? Is the database' write access in the for each
responsible for the similar time?
Following are recommendable readings:
- http://msdn.microsoft.com/en-us/magazine/cc163329.aspx (not up-to-date)
- Overview of videos/articles according this topic
- PDF: When To Use ParallelForEach Or PLINQ
- When PLINQ will speedup performance