Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I've been struggling to get Deduplication to work in SolrCloud (version 8.6). My solrconfig.xml contains:

<updateRequestProcessorChain name="dedupeOn">
       <processor class="solr.processor.SignatureUpdateProcessorFactory">
         <bool name="enabled">true</bool>
         <str name="signatureField">dedupeId</str>
         <bool name="overwriteDupes">true</bool>
         <str name="fields">journal_doi,internal_pmid</str>
         <str name="signatureClass">solr.processor.Lookup3Signature</str>
       </processor>
       <processor class="solr.LogUpdateProcessorFactory" />
       <processor class="solr.DistributedUpdateProcessorFactory"/>
       <processor class="solr.RunUpdateProcessorFactory" />
     </updateRequestProcessorChain>

and

 <requestHandler name="/update" class="solr.UpdateRequestHandler" >
  <lst name="defaults">
          <str name="update.chain">dedupeOn</str>
  </lst>
  </requestHandler>

my managedschema contains:

<field name="dedupeId" type="string" indexed="true" stored="true" multiValued="false" />

In my test, I add 1000 documents, and commit manually. I see the "dedupeId" is created with the hash.
I then add 10 more documents that I know are duplicates, and again commit manually. These 10 rows are added, and the original document with the matching dedupeId is not overwritten. For example:

  "response":{"numFound":2,"start":0,"maxScore":2.1554677,"numFoundExact":true,"docs":[
      {
        "internal_pmid":"13367837",
        "dedupeId":"7f0306ecd909a68e",
        "journal_doi":"10.1097/00005053-195603000-00006"},
      {
        "internal_pmid":"13367837",
        "dedupeId":"7f0306ecd909a68e",
        "journal_doi":"10.1097/00005053-195603000-00006"}]
  }}

I'm not sure if its significant, but in the solr logs, I see some "add" entries that contain, in part:

webapp=/solr path=/update params={update.distrib=TOLEADER&update.chain=dedupeOn&distrib.from=*(shard path)*/&wt=javabin&version=2}{add=[00001hLxMb (1690871781072568320)]} 0 2

but other add entries do not contain the update.chain property e.g.

webapp=/solr path=/update params={wt=javabin&version=2}{add=[00000sta0n (1690871780667817984)]} 0 2

Any help would be greatly appreciated.

question from:https://stackoverflow.com/questions/66067082/solrcloud-deduplication-overwrite-isnt-working

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
536 views
Welcome To Ask or Share your Answers For Others

1 Answer

Waitting for answers

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share

548k questions

547k answers

4 comments

86.3k users

...