Data synchronization

From Project-GC
Revision as of 12:46, 12 November 2020 by magma1447 (3305483) (talk | contribs) (Fixed link to Paid membership)
Jump to: navigation, search



This page is a Work in progress and needs severe fixes.

Feel free to contribute by editing the page. When it has the information needed in a readable form and in a decent formatted way, remove the FIXME template-tag.

Reason: Work in progress



Origin of data

Project-GC's statistics are based on data from Geocaching.com. The data is fetched via the Geocaching LIVE api. Project-GC uses a combination of the official API available to general Geocaching Partners and a private Enterprise API.

The data is still owned by Geocaching HQ and Project-GC pays royalties to be allowed to fetch and use the data like it does.

Fetching of data

Geocaching data is continuously being updated in Project-GCs databases via the mentioned API. This is done using several parallel methods:

  • Daily fetch of all newly published Geocaches.
  • Continuously updating of finds for Geocachers. See REFRESHING PROFILES for more details.
  • Regular updates of Geocache information. See REFRESHING GEOCACHE DATA for more details.

Refreshing profiles

Geocaching profiles are continuously being updated based on a rule set in the background.

When refreshing Geocaching profiles all new and updated logs are fetched from the API, and also metadata around the Geocaching profile itself, like for example the name of the Geocacher.

Rules

Also whenever a user visits the website of Project-GCs the system checks when that user was updated. If it's a paying member there will be yet another update if the data is more than 1 hours old. However, this is a job that gets queued in the background and it might take a few minutes until it kicks in.

A Geocache with new logs will also get scheduled to have its metadata (difficulty, terrain, size, country ...) updated.

Refreshing Geocache data

The refreshing of Geocaching profiles mentioned above makes sure that Project-GCs users have frequently updated data, but some Geocaches might be left out because they have only been logged by Geocachers not very active with [[Project-GC]. Therefore Project-GC also refreshes Geocaches.

The frequency of when Geocaches are refreshed are based on a mix of several variables, like for example:

  • Last found
  • Hidden date
  • Disabled/Archived state
  • Number of logs

When a Geocache gets refreshed metadata about the Geocache gets updated in the database. Also all new/updated logs are updated in the system.

Log data

The log entries fetched are the same, regardless if they are fetched from Geocaching profiles or from Geocache data. It's just different approaches/angles to retrieve the same information. If Geocaching profile X has found Geocache GCX, then GCX automatically also has a log from Geocaching profile X.

Databases

Now most Geocache data and logs exists in the primary database cluster and are fairly up-to-date. Most of the data will only be hours old in the database, but a fair share is expected to be 24-36 hours old.

However Project-GC have more than one database cluster. Most statistics on the web are created based on data from another database cluster. The primary database cluster replicates its data every 4th hour into the second cluster, adding an additional ~4 hours of latency. As an example most top lists are based on this secondary cluster, while Profile stats are not.

As a technical note, the primary database cluster is a row-based relational database. The secondary cluster is created and meant for database harvesting and is a column-oriented DBMS.

Statistics

As mentioned in DATABASES most statistics are based on a secondary database cluster. Even if Project-GC itself has up-to-date profile data the secondary database cluster might still have the old data. This is due to the fact that a full replica of the data-set is copied every forth hour.

Most top lists are using this secondary cluster. Basically everything that gets heavily computed (data harvesting) uses the secondary cluster, while more raw fetches use the primary one. Profile stats is an exception.

It's also worth mentioning here as well that Lab caches generally aren't included in statistics. They aren't technically compatible and it would be very complex to do this. Again, Profile stats can be an exception.

Finally all generated statistics are cached, the period it gets cached may vary but everything between 5 minutes and 1 hour is very common. So if two persons asks for the same statistic, it will only be computed the first time, the second person will receive a cached version. This has the downside that it potentially also adds more latency and shows older data in some cases.

Profile stats

  • Generated from current state at the date written in the header.
  • labs depending on setting and paying/freemium
  • Cached for 7 days for freemium, 24h for paying
  • Caching not shared between foreign viewer and domestic
  • Based on the primary database cluster

Challenge checkers

Challenge checkers uses a mix of the different database clusters. If the Checker script fetches the user's finds from Project-GC it will use the primary database cluster. But some other more advanced API methods might use the secondary database cluster instead. This is for performance reasons.

Special numbers

Some statistics are very special since they are based on pre-calculated values. This is usually because it would be extremely hard to calculate this in real-time, therefore it's based on pre-calculated values. Project-GC calculates the following data for every Geocacher in the background on a regular basis. Normally daily, but there are some exceptions to it. Usually that some users gets calculated more often.

  • dt loops
  • streaks
  • calendar loops

This is not affecting Profile stats. If numbers like these are needed, Profile stats calculates them itself instead of using pre-calculated data. This is also needed since it may have Labcache data merged.