Return to Project-GC

Welcome to Project-GC Q&A. Ask questions and get answers from other Project-GC users.

If you get a good answer, click the checkbox on the left to select it as the best answer.

Upvote answers or questions that have helped you.

If you don't get clear answers, edit your question to make it clearer.

Log similarity

+7 votes

can someone explain me the algorithm principle behind "log similarity"?

I log in 4 different languages depending on the country, i vary a lot my log length, wordings, story and signatures. And my log similarity is still quite high (lower is better).

I tend to make my log messages good and interesting. Log length is one thing but log similarity is also very important.
asked May 27, 2017 in Support and help by Arnaudd (1,560 points)
I tried to keep an eye on that in the past months and while maintaining long logs worked to increase my wording variety etc but nothing really changed. I still have a pretty bad ranking on the log similarity which I may have to accept..
Same here. No change in the result, although i edited many logs and now have a shortest log of 32 words.

4 Answers

+2 votes
I guess the algorithm count the number of occurrences of each word you put in your logs. So, words like "the" in english which are really oftenly used, will increase significantly the log similarity.

For my part, I also love making long and (I hope) interesting logs. But, when I log a series of caches, I always add a "header" log (identical to all caches of the day) and a part dedicated to the cache itself. So, my log similarity is very high (45%).
answered May 31, 2017 by Squall_Leonhart (790 points)
This is the way we log too (meanwhile). The logs are mostly in German, while on holidays usually in German and English and somtimes even a bit danish.
The stats say we have an average of 244 words. So overall I would say it "feels" like it may be a similarity of 35% but the stats say 64%. This is a bit funny as it looks like we are more a copy and paste logger, but we really try to say something unique on every cache (even on PT if possible and if we are going for one what actually is not very likely) ;)
Mine is also around 65%, that's why i don't understand.

I agree that the principle is probably a histogram of the words, but then how is the percentage calculated? Entropy? Kurtosys?
+1 vote
I would not be surprised if details about the algorithm are withheld on purpose. This might make it too easy to fool the algorithm to achieve a good ranking.

I fully agree it is worth trying to make good and interesting logs, however, this is not an easy task, and very difficult to be judged by an algorithm. My own logs (mostly written in German, plus the local language in most cases) also have a high similarity which may also be due to the fact that a few words appear frequently in any normal text. I also have to admit I use the 'day header' or 'trip header' part which may add to the similarity and then would assume the check may also be done on a 'text similarity' (checking for identical sets or sentences), which is also easily programmable.
answered May 31, 2017 by Domino_67 (6,410 points)
I agree on the complexity and secrecy.

But i feel that my value at 65% does not represent the effort i put into the logs. I'd like to know at least how i can improve this value.
+1 vote
Perhaps shorter logs give more variance. I have an average log length of 41 words, and a similarity score of 31%.  Although I sometimes write full length descriptions, I often post logs such as "Another quick find here - TFTC" (I'm not saying I should, and I will write more for a good cache, or more effort than usual needed to find it). Personally I'd rather have a higher log length and lose out on the similarity score, since I get a better badge for the log length, whereas the similarity score doesn't seem to be used in any other statistics. However with over 2000 logs, bringing the average length up is very slow!
answered Jun 2, 2017 by Optimist on the run (Expert) (17,790 points)
0 votes
I'm now at 68% although i still do long and different logs in multiple languages.

My guess is that the algorithm is based on histogram entropy and number of different words used but why does it give a so high value for me?
answered Aug 12, 2018 by Arnaudd (1,560 points)
I also wonder about it. Since this algorithm has been introduced, my log similarity went down from 44% to 50%, although I try to vary the logs and write them in different languages....