Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I have already asked a similar question at Calculating Word Proximity in an inverted Index. However i felt that the question was too general and not refined enough. So here goes.

I have a List which contains the location of tokens in a document. for each token it goes as

public List<int> hitLocation;

Lets say the the document is

Java programming language has a name similar to java island in Indonesia however
local language in java bears no resemblance to the programming language called java.

and the query is

java island language

So Say i lock on to the Java HitList and attempt to directly calculate the distance between the Java HisList, Island HitList and Language Hitlist.

Now the first problem is that there are 4 java tokens occurrences in the sentence. Which one do i select. Assuming i select the first one.

I go onto the island token list and after comparing find it that it adjacent to the second occurrence of java. So i change my selection and lock onto the second occurrence of java.

Proceeding to the third token language i find that it situated at quite a distance from our selection however i find it that it is quite near the first java occurrence.

So you see the dilemma here if now again revert back to the original selection i.e the first occurrence of java the distance to second token "island" increases and if i stay with my current selection the sheer distance of the second occurrence of the token "language" will make relevance busted.

Previously there was the suggestion of dot product however i am at loss on how to proceed forward with that option.

Any other solution would also be welcomed.

I Understand that this question is quite detailed. However i have searched long and hard and haven't found any question like this on this topic.

I feel if this question is answered it will be a great addition to the community and will make anybody who is designing anything related to relevancy quite happy.

Thank You.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
127 views
Welcome To Ask or Share your Answers For Others

1 Answer

You seem to be using the hit lists a little differently then how they are intended to be used (at least given my understanding).

Typically people compare hit lists returned by different documents. This is how they rank one document as being "more relevant" than a different document.

That said, if you want to find all locations of some multi-word phrase like "java island" given the locations of the words "java" and "island" you would...

  • Get a list of locations for "java"
  • Get a list of locations for "island"
  • Sort both lists
  • Iterate through both lists at the same time. You start be getting the first entry of both lists. Now test this pair of entries. I.E., if these entries are "off by one" you have found one instance of "java island" (or perhaps "island java"). Get the next entry in the list that currently shows the minimum value. Test this new pair of entries. Repeat.

BTW -- The dot product is more useful when comparing 2 different documents.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...