Log similarity

From Project-GC
Revision as of 22:35, 15 February 2020 by magma1447 (3305483) (talk | contribs) (Create page)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

The following is a copy/paste of an answer on Facebook. It should be cleaned up. But it's pasted here for now. It contains a lot of related facts.


It's the content of the logs, not the logs themselves that are similar. 57% doesn't say that 57% of your logs are identical, but the content of the logs coming after the former one has 57% similarity in average.

Having code that was language aware would be close enough to impossible, the code isn't language aware. It's characters, nothing else. If you in one log write "ab" and "cd" in the next, those will have 0% similarity. Also, since it's not word aware, it doesn't matter if some words are more commonly used in some languages. "Rabbit" and "Tea" will also have a similarity. The similarity would be 22% actually.

Another example. Logging these three logs in a row would give 50% log similarity: "ab", "cd", "cd" ( (0+100)/2 ). Add another "ab" at the end and you would have 33% ( (0+100 + 0)/3 ).

The number itself doesn't really have to be understandable in our opinion. The worth of it is to keep it low. If it's lower than most others, then you are doing a better job of writing varying logs than the majority.

I don't think it has to be taken with a pinch of salt at all. It's a quite hard fact as a number, it's just hard to understand the number. But the lower it is, the more varying the logs are. High word count and low log similarity is definitely possible. In theory one could reach 0%, but that would definitely requiring aiming for it, and I can agree that it won't happen by chance.

You say that the numbers here says it can't, I don't agree. 213 words,17% and 308 word, 22% are quite long logs and a low similarity. Maybe you just have to high expectations of what "low similarity" is? I would personally say that everything below 50% is fairly low.

We rely on an open source function. The exact math can be learned by studying documentation and source code. https://www.php.net/manual/en/function.similar-text.php

In an ideal world we would rather use https://www.php.net/manual/en/function.levenshtein.php , which we actually used at first. But it's a multiple times slow. It also scales worse. With long logs it's ridiculous slow. The similar_text function scales is more linear in it's scaling.