Return to Project-GC

Welcome to Project-GC Q&A. Ask questions and get answers from other Project-GC users.

If you get a good answer, click the checkbox on the left to select it as the best answer.

Upvote answers or questions that have helped you.

If you don't get clear answers, edit your question to make it clearer.

"Log Length, words: Total words" numbers have gone down. Why?

+2 votes
2,407 views
Have noticed the "Log Length, words:    Total words" numbers have gone down in the last few days, despite the numbers of logs have gone up.  What is happening here?
asked Jan 28, 2016 in Support and help by huhugrub (270 points)

1 Answer

+3 votes
 
Best answer

This is correct. Has it been a big change for you?

We are currently implementing a new algorithm for calculating words, which we believe to be more correct. The process of recalculating the data of 500M logs will probably take a few weeks.

So, why have we changed it then ...

Calculating the number of words in a log might sound very simple, but it's not. I'll give some examples, and you (whoever reads this, not specific to the one who asked the question) can think about how many words it is:

  • X X X X X X X
  • --------------
  • fri3nd
  • 1.0
  • 100
  • 5/5
  • foo-bar
  • friend(name)
  • måste

And then we haven't look at chinese or similar yet. Solving this in a fair and good way with code is quite hard. I am quite sure that if we would post an example log and let users count the words, we would get quite a few different results from it. And something with a dash in it might be considered one word in some cases, two in others, and none in yet another.

answered Jan 28, 2016 by ganja1447 (Admin) (188,090 points)
selected Jan 30, 2016 by huhugrub
As a side-note. My own average number of words has changed from 161.1480 to 159.7495. A very slight change. The only logs that has a significant difference for me are logs on challenge caches where a "proof" has been posted.
Good to know and good job making this more fair. I got to average of 75 from almost 80 if I remember well.
But! It seems to me weird at least in some cases. I have shortest log (1 word) containing at least three valid ones: http://coord.info/GL16XNQV
That might be some bug?
Yes, the word count definition might vary depending on exactly how a "word" is defined.  Done a little testing using the following 6 tools:
* Microsoft Word
* wordcounter.net
* wordcountertool.com
* wordcounttool.com
* wordcounttools.com
* Microsoft Excel

The count results from these tools on the examples are, respectively:
    X X X X X X X  => 7, 7, 7, 7, 7, 7 => (no variation)
    --------------  => 1, 0, 1, 1, 0, 1 => (0 or 1)
    fri3nd  => 1, 1, 1, 2, 1, 1 => (1 or 2)
    1.0  => 1, 2, 1, 0, 1, 1 => (0, 1 or 2)
    100  => 1, 1, 1, 0, 1, 1 => (0 or 1)
    5/5  => 1, 2, 1, 0, 2, 1 => (0, 1, or 2)
    foo-bar  => 1, 1, 1, 1, 1, 1 => (no variation)
    friend(name)  => 1, 1, 1, 2, 2, 1 => (1 or 2)
    måste  => 1, 1, 1, 2, 1, 1 => (1, 2)

Total counts for the 9 examples together  => 15, 16, 15, 15, 16, 15 => (15 or 16)

Could you advise how Project-GC has counted the above using the previous algorithm and the new algorithm?  It will be useful to understand the rules used.

At the end of the day, it probably doesn't matter a great deal as long as it is consistently applied.  However, it is useful for the Project-GC algorithm to be compatible with that used for Kyle's BadgeGen.  Do you know how the old and new algorithms compare with BadgeGen's?
Thanks for explanation... I have the same problem. My count of total words gone down from 53 000 to 50 600 in last two weeks. I hope that the reason is new algorithm and my actual nuber is now correct :)
Comment on Jakuje's post:
Seems that in your case single character words, like "s", "w" in slavic or "i" in scandinavic languages or "a" in english and so on ... are not counted as a word.
You would need to change your orthography to improve the correct word count. ;-)
As far as I know Micros* counts very simple everything between spaces as a word.
What do you base your statement upon?

$log = 'this is a banana';
list($words, $characters) = GC_LogLength($log);
echo $words . PHP_EOL;

$ php -f test.php
4
...