Return to Project-GC

Welcome to Project-GC Q&A. Ask questions and get answers from other Project-GC users.

If you get a good answer, click the checkbox on the left to select it as the best answer.

Upvote answers or questions that have helped you.

If you don't get clear answers, edit your question to make it clearer.

+4 votes
2.0k views
During last week my average length of logs felt down from 98 to 90 words. I understand you have new algorithm for calculating words but I did not expected such significant difference. So, I am afraid that this new algorithm gives wrong results. Please, consider these two examples. First: Cache GLJ22CCW: Number of words in my the most length log for this cache is 625 according to Microsoft Word  But PROJECT-GC gives 495 words only...

Second: Last week I had min 10 logs all around 450 words length; But average length logs moved up only from 90 to 92..  How is it possible? Thank you for the answer. I understand you will probably not count smileys or numbers..
in Feature requests by PEJATEKL (160 points)
I have no idea of the algoritm but tried to make a guess. My guess it that Word counts the number of block of chars separated by space because there is 624 spaces in your log.
Removing spaces gives 604 spaces and number gives 593 and GC codes and other non letters 589 spaces. 94 more words has to be removed to reach the 495 words

My guess is that code created to remove text with extra space like TFTC as T F T C is the problem because Czech has a lot of one and tow letter words.
48 with only one letter and 80 with two that gives a sum of 128 word. If som of the are counted the result might be 495 word.
But i might be totally wrong.

I also looked at my longest log with 677 words according to pgc and 702 words with a space count. A difference of 25 words. 15 could be accounted for from the ground speak edit string.
All numbers cant be removed because i have 69 number matches in the text with 17 in GC codes and 20 in dates
The word number reduction is quite resonable on my log and quite large on your. One big difference is that one is in Czech and one in Swedish. It is not resonabel that the algoritm works well in Swedish but not in Czech.

I also had to look if Czech was a wordy language but your original log was 651 words in Swedish and 653 in english. That is the google translate and that might skew the result alto.
Thank you so much for your view. I tried to follow this idea and tested the my log :-) I deleted all space betweem word and one-letter word, then deleted all figures and smileys...and number felt down from 625 to 539. When I deleted space betwen a word and tho-letters-word...number felt down undet 450... I do not understand..:-) Newertheless thank you for your interest
Thanks for clarifying this issue.  I also had a change in my average words per log although it wasn't as drastic.  I looked at it again today and it looks like something got changed again and that my "issue" has resolved.  Keep up the many good things y'all are doing here at Project-GC.

1 Answer

+3 votes

There seems to be some form of issue actually.

"přemýšlel a hledal v mapě" has become one word for example. We will look into it.

UPDATE:

A fix has been made and we will deploy it soon. The new result for this log will be 606.

It was a UTF8 issue. Since your language has so many accent characters it got a bit confused and thought it was garbage. Not for the characters themselves, but because it sometimes broke one character into two much weirder characters. The reason is a bit techy, but these characters are represented by more than one byte. If one the splits between them, it becomes two completely different characters.

Next problem then. We will have to start over with calculating log lengths. It will take over a week to catch up to where we are now. We had done about 30% this far, and now it's back to zero.

Regardless of that. I am very happy that you could post a good example log of where it actually went very wrong.

by magma1447 (Admin) (241k points)
edited by magma1447 (Admin)
ah the dreaded curse of utf surrogate pairs - one of the more obscure parts of utf that few people know about until they have to start dealing with problems like these!
A small update. We have improved our job that recalculates the log length, made it faster and paralleled it. Current pace is almost 7 million per hour. Doing the math on all of them will then take 500/7/24 = 3 days.
Thank you so much for your response. Yes, I understand every language has its particulatiry. Czech language has some one-letter-words such as conjactions "a"=and, "i"=and, prepositions such as "k" = to, "v" = into, "o" = about, "s" = with, "z" = from, "u" = next to. These are really words :-) But, it is a statistics only..:-) Thanks for considerations.
That I did not know! But it wasn't exactly the issue here. It's very hard to explain the utf8-issues we were having, at least they are fixed.

But regarding those one letter words. Normally one letter words will count with the new algorithm we have implemented. It will definitely have a lot of odd effects that one will find if they count every word in their logs though. The aim is to get as close as possible to reality, and getting rid of data that really isn't words.
I fully understand. Some cachers can "simplify" logs to reach better (or faster) statistic feature. It has to be difficult to find out an algorithm, that would recognize fake-words in a log such as "t h a n k  y o u  f  o r  t he  c a ch e" :-) It is not my case :-) Do not try to explain utf8 to me, I am a user only :-) Thanks a lot.
...