Return to Project-GC

Question

Uncleare change of average lenght of the logs.

I have no idea of the algoritm but tried to make a guess. My guess it that Word counts the number of block of chars separated by space because there is 624 spaces in your log.
Removing spaces gives 604 spaces and number gives 593 and GC codes and other non letters 589 spaces. 94 more words has to be removed to reach the 495 words

My guess is that code created to remove text with extra space like TFTC as T F T C is the problem because Czech has a lot of one and tow letter words.
48 with only one letter and 80 with two that gives a sum of 128 word. If som of the are counted the result might be 495 word.
But i might be totally wrong.

I also looked at my longest log with 677 words according to pgc and 702 words with a space count. A difference of 25 words. 15 could be accounted for from the ground speak edit string.
All numbers cant be removed because i have 69 number matches in the text with 17 in GC codes and 20 in dates
The word number reduction is quite resonable on my log and quite large on your. One big difference is that one is in Czech and one in Swedish. It is not resonabel that the algoritm works well in Swedish but not in Czech.

I also had to look if Czech was a wordy language but your original log was 651 words in Swedish and 653 in english. That is the google translate and that might skew the result alto.

commented Feb 1, 2016 by Target. (Expert) (104k points)

1 Answer

Answer 1 · 2016-02-01T02:03:43+0000

There seems to be some form of issue actually.

"přemýšlel a hledal v mapě" has become one word for example. We will look into it.

UPDATE:

A fix has been made and we will deploy it soon. The new result for this log will be 606.

It was a UTF8 issue. Since your language has so many accent characters it got a bit confused and thought it was garbage. Not for the characters themselves, but because it sometimes broke one character into two much weirder characters. The reason is a bit techy, but these characters are represented by more than one byte. If one the splits between them, it becomes two completely different characters.

Next problem then. We will have to start over with calculating log lengths. It will take over a week to catch up to where we are now. We had done about 30% this far, and now it's back to zero.

Regardless of that. I am very happy that you could post a good example log of where it actually went very wrong.

Return to Project-GC

Categories

Uncleare change of average lenght of the logs.

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.