Difference between revisions of "Data synchronization"
(Extra note about paying being more up-to-date) |
m (Added note about number of profile harvest rules) |
||
Line 21: | Line 21: | ||
* [[Geocacher]]s that hasn't been refreshed in 7 days but used [[Project-GC]]'s website the last 3 months are being refreshed. | * [[Geocacher]]s that hasn't been refreshed in 7 days but used [[Project-GC]]'s website the last 3 months are being refreshed. | ||
* [[Geocacher]]s for whom their [[Profile stats]] has been viewed recently. | * [[Geocacher]]s for whom their [[Profile stats]] has been viewed recently. | ||
− | * ... and so on. | + | * ... and so on, there are ~10 rules like this in place. |
Also whenever a user visits the website of [[Project-GC]]s the system checks when that user was updated. If it's a paying member there will be yet another update if the data is more than 1 hours old. However, this is a job that gets queued in the background and it might take a few minutes until it kicks in. | Also whenever a user visits the website of [[Project-GC]]s the system checks when that user was updated. If it's a paying member there will be yet another update if the data is more than 1 hours old. However, this is a job that gets queued in the background and it might take a few minutes until it kicks in. |
Revision as of 12:49, 12 November 2020
This page is a Work in progress and needs severe fixes.
Feel free to contribute by editing the page. When it has the information needed in a readable form and in a decent formatted way, remove the FIXME template-tag.
Reason: Work in progress
Contents
Origin of data
Project-GC's statistics are based on data from Geocaching.com. The data is fetched via the Geocaching LIVE api. Project-GC uses a combination of the official API available to general Geocaching Partners and a private Enterprise API.
The data is still owned by Geocaching HQ and Project-GC pays royalties to be allowed to fetch and use the data like it does.
Fetching of data
Geocaching data is continuously being updated in Project-GCs databases via the mentioned API. This is done using several parallel methods:
- Daily fetch of all newly published Geocaches.
- Continuously updating of finds for Geocachers. See Refreshing profiles for more details.
- Regular updates of Geocache information. See Refreshing geocache data for more details.
Refreshing profiles
Geocaching profiles are continuously being updated based on a rule set in the background.
When refreshing Geocaching profiles all new and updated logs are fetched from the API, and also metadata around the Geocaching profile itself, like for example the name of the Geocacher.
Rules
- Geocachers that hasn't been refreshed in 24 hours and are paying members are being refreshed.
- Geocachers that hasn't been refreshed in 2 days but used Project-GC's website the last week are being refreshed.
- Geocachers that hasn't been refreshed in 7 days but used Project-GC's website the last 3 months are being refreshed.
- Geocachers for whom their Profile stats has been viewed recently.
- ... and so on, there are ~10 rules like this in place.
Also whenever a user visits the website of Project-GCs the system checks when that user was updated. If it's a paying member there will be yet another update if the data is more than 1 hours old. However, this is a job that gets queued in the background and it might take a few minutes until it kicks in.
A Geocache with new logs will also get scheduled to have its metadata (difficulty, terrain, size, country ...) updated.
It can be noted that a Geocacher who is a paying member is more likely to have up-to-date data at Project-GC when coming back to the site, while freemium users might notice that their data isn't as fresh.
Refreshing Geocache data
The refreshing of Geocaching profiles mentioned above makes sure that Project-GCs users have frequently updated data, but some Geocaches might be left out because they have only been logged by Geocachers not very active with [[Project-GC]. Therefore Project-GC also refreshes Geocaches.
The frequency of when Geocaches are refreshed are based on a mix of several variables, like for example:
- Last found
- Hidden date
- Disabled/Archived state
- Number of logs
When a Geocache gets refreshed metadata about the Geocache gets updated in the database. Also all new/updated logs are updated in the system.
Log data
The log entries fetched are the same, regardless if they are fetched from Geocaching profiles or from Geocache data. It's just different approaches/angles to retrieve the same information. If Geocaching profile X has found Geocache GCX, then GCX automatically also has a log from Geocaching profile X.
Databases
Now most Geocache data and logs exists in the primary database cluster and are fairly up-to-date. Most of the data will only be hours old in the database, but a fair share is expected to be 24-36 hours old.
However Project-GC have more than one database cluster. Most statistics on the web are created based on data from another database cluster. The primary database cluster replicates its data every 4th hour into the second cluster, adding an additional ~4 hours of latency. As an example most top lists are based on this secondary cluster, while Profile stats are not.
As a technical note, the primary database cluster is a row-based relational database. The secondary cluster is created and meant for database harvesting and is a column-oriented DBMS.
Statistics
As mentioned in DATABASES most statistics are based on a secondary database cluster. Even if Project-GC itself has up-to-date profile data the secondary database cluster might still have the old data. This is due to the fact that a full replica of the data-set is copied every forth hour.
Most top lists are using this secondary cluster. Basically everything that gets heavily computed (data harvesting) uses the secondary cluster, while more raw fetches use the primary one. Profile stats is an exception.
It's also worth mentioning here as well that Lab caches generally aren't included in statistics. They aren't technically compatible and it would be very complex to do this. Again, Profile stats can be an exception.
Finally all generated statistics are cached, the period it gets cached may vary but everything between 5 minutes and 1 hour is very common. So if two persons asks for the same statistic, it will only be computed the first time, the second person will receive a cached version. This has the downside that it potentially also adds more latency and shows older data in some cases.
Profile stats
- Generated from current state at the date written in the header.
- labs depending on setting and paying/freemium
- Cached for 7 days for freemium, 24h for paying
- Caching not shared between foreign viewer and domestic
- Based on the primary database cluster
Challenge checkers
Challenge checkers uses a mix of the different database clusters. If the Checker script fetches the user's finds from Project-GC it will use the primary database cluster. But some other more advanced API methods might use the secondary database cluster instead. This is for performance reasons.
Special numbers
Some statistics are very special since they are based on pre-calculated values. This is usually because it would be extremely hard to calculate this in real-time, therefore it's based on pre-calculated values. Project-GC calculates the following data for every Geocacher in the background on a regular basis. Normally daily, but there are some exceptions to it. Usually that some users gets calculated more often.
- dt loops
- streaks
- calendar loops
This is not affecting Profile stats. If numbers like these are needed, Profile stats calculates them itself instead of using pre-calculated data. This is also needed since it may have Labcache data merged.