Difference between revisions of "Data synchronization"

From Project-GC
Jump to: navigation, search
m (Databases)
m (Refreshing geocache data)
 
(10 intermediate revisions by 4 users not shown)
Line 7: Line 7:
  
 
== Fetching of data ==
 
== Fetching of data ==
Geocaching data is continuously being updated in [[Project-GC]]s databases via the mentioned API. This is done using several parallel methods:
+
Geocaching data is continuously being updated in [[Project-GC]]'s databases via the mentioned API. This is done using several parallel methods:
* Daily fetch of all newly published [[Geocache]]s.
+
* Continuously fetching newly published [[Geocache]]s. Normal latency 0-35 seconds.
* Continuously updating of finds for [[Geocacher]]s. See [[Data synchronization#Refreshing_profiles|Refreshing profiles]] for more details.
+
* Continuously updating finds for [[Geocacher]]s. See [[Data synchronization#Refreshing_profiles|Refreshing profiles]] for more details.
 
* Regular updates of [[Geocache]] information. See [[Data synchronization#Refreshing geocache data|Refreshing geocache data]] for more details.
 
* Regular updates of [[Geocache]] information. See [[Data synchronization#Refreshing geocache data|Refreshing geocache data]] for more details.
  
Line 23: Line 23:
 
* ... and so on, there are ~10 rules like this in place.
 
* ... and so on, there are ~10 rules like this in place.
  
Also whenever a user visits the website of [[Project-GC]]s the system checks when that user was updated. If it's a paying member there will be yet another update if the data is more than 1 hour old. However, this is a job that gets queued in the background and it might take a few minutes until it kicks in.
+
Also whenever a user visits the website of [[Project-GC]]s the system checks when that user was last updated. If it's a paying member there will be yet another update if the data is more than 1 hour old. However, this is a job that gets queued in the background and it might take a few minutes until it kicks in.
  
 
A [[Geocache]] with new logs will also get scheduled to have its metadata (difficulty, terrain, size, country ...) updated.
 
A [[Geocache]] with new logs will also get scheduled to have its metadata (difficulty, terrain, size, country ...) updated.
Line 34: Line 34:
 
The frequency of when [[Geocache]]s are refreshed are based on a mix of several variables, like for example:
 
The frequency of when [[Geocache]]s are refreshed are based on a mix of several variables, like for example:
 
* Last found
 
* Last found
* Hidden date
+
* [[Hidden date]]
 
* Disabled/Archived state
 
* Disabled/Archived state
 
* Number of logs
 
* Number of logs
Line 46: Line 46:
 
Now most [[Geocache]] data and logs exists in the primary database cluster and are fairly up-to-date. Most of the data will only be hours old in the database, but a fair share is expected to be 24-36 hours old.
 
Now most [[Geocache]] data and logs exists in the primary database cluster and are fairly up-to-date. Most of the data will only be hours old in the database, but a fair share is expected to be 24-36 hours old.
  
However [[Project-GC]] has more than one database cluster. Most statistics on the web are created based on data from another database cluster. The primary database cluster replicates its data every 4th hour into the second cluster, adding an additional ~4 hours of latency. As an example most top lists are based on this secondary cluster, while [[Profile stats]] are not.
+
However [[Project-GC]] has more than one database cluster. Most statistics on the web are created based on data from another database cluster. The primary database cluster replicates its data every 4 hours o the second cluster, adding an additional ~4 hours of latency. As an example most top lists are based on this secondary cluster, while [[Profile stats]] are not.
  
 
As a technical note, the primary database cluster is a row-based relational database. The secondary cluster is created and meant for database harvesting and is a column-oriented DBMS.
 
As a technical note, the primary database cluster is a row-based relational database. The secondary cluster is created and meant for database harvesting and is a column-oriented DBMS.
  
 
== Statistics ==
 
== Statistics ==
As mentioned in DATABASES most statistics are based on a secondary database cluster. Even if [[Project-GC]] itself has up-to-date profile data the secondary database cluster might still have the old data. This is due to the fact that a full replica of the data-set is copied every forth hour.
+
As mentioned in [[Data synchronization#Databases|Databases]], most statistics are based on a secondary database cluster. Even if [[Project-GC]] itself has up-to-date profile data the secondary database cluster might still have the old data. This is due to the fact that a full replica of the data-set is copied every four hours.
  
 
Most top lists are using this secondary cluster. Basically everything that gets heavily computed (data harvesting) uses the secondary cluster, while more raw fetches use the primary one. [[Profile stats]] is an exception.
 
Most top lists are using this secondary cluster. Basically everything that gets heavily computed (data harvesting) uses the secondary cluster, while more raw fetches use the primary one. [[Profile stats]] is an exception.
Line 57: Line 57:
 
It's also worth mentioning here as well that [[Lab caches]] generally aren't included in statistics. They aren't technically compatible and it would be very complex to do this. Again, [[Profile stats]] can be an exception.
 
It's also worth mentioning here as well that [[Lab caches]] generally aren't included in statistics. They aren't technically compatible and it would be very complex to do this. Again, [[Profile stats]] can be an exception.
  
Finally all generated statistics are [[Data caching|cached]], the period it gets cached may vary but everything between 5 minutes and 1 hour is very common. So if two persons asks for the same statistic, it will only be computed the first time, the second person will receive a cached version. This has the downside that it potentially also adds more latency and shows older data in some cases.
+
Finally, all generated statistics are [[Data caching|cached]]. The period for which it is cached varies but anything between 5 minutes and 1 hour is very common. So if two people ask for the same statistic, it will only be computed the first time, the second person will receive a cached version. This has the downside that it potentially also adds more latency and shows older data in some cases.
  
 
== Profile stats ==
 
== Profile stats ==
 
* Generated from ''current state'' at the date written in the header.
 
* Generated from ''current state'' at the date written in the header.
* labs depending on setting and paying/freemium
+
* Labs included depending on setting and paying/freemium.
* Cached for 7 days for freemium, 24h for paying
+
* Cached for 7 days for freemium, 24h for paying.
* Caching not shared between foreign viewer and domestic
+
* Caching is not shared between the user the data is about and other viewers.
* Based on the primary database cluster
+
* Based on the primary database cluster.
  
 
== Challenge checkers ==
 
== Challenge checkers ==
Challenge checkers uses a mix of the different database clusters. If the [[Checker script]] fetches the user's finds from [[Project-GC]] it will use the primary database cluster. But some other more advanced API methods might use the secondary database cluster instead. This is for performance reasons.
+
Challenge checkers use a mix of the different database clusters. If the [[Checker script]] fetches the user's finds from [[Project-GC]] it will use the primary database cluster. But some other more advanced API methods might use the secondary database cluster instead. This is for performance reasons.
  
 
== Special numbers ==
 
== Special numbers ==
Some statistics are very special since they are based on pre-calculated values. This is usually because it would be extremely hard to calculate this in real-time, therefore it's based on pre-calculated values. [[Project-GC]] calculates the following data for every [[Geocacher]] in the background on a regular basis. Normally daily, but there are some exceptions to it. Usually that some users gets calculated more often.
+
Some statistics are very special since they are based on pre-calculated values. This is usually because it would be extremely hard to calculate this in real-time, therefore it's based on pre-calculated values. [[Project-GC]] calculates the following data for every [[Geocacher]] in the background on a regular basis. Normally daily, but there are some exceptions to it (usually that some users gets calculated more often).
* dt loops
+
* D/T loops
 
* streaks
 
* streaks
 
* calendar loops
 
* calendar loops
  
This is not affecting [[Profile stats]]. If numbers like these are needed, [[Profile stats]] calculates them itself instead of using pre-calculated data. This is also needed since it may have [[Labcache]] data merged.
+
This does not affect [[Profile stats]]. If numbers like these are needed, [[Profile stats]] calculates them itself instead of using pre-calculated data. This is also needed since it may have [[Lab cache]] data merged.

Latest revision as of 17:36, 15 March 2021



This page is a Work in progress and needs severe fixes.

Feel free to contribute by editing the page. When it has the information needed in a readable form and in a decent formatted way, remove the FIXME template-tag.

Reason: Work in progress



Origin of data

Project-GC's statistics are based on data from Geocaching.com. The data is fetched via the Geocaching LIVE api. Project-GC uses a combination of the official API available to general Geocaching Partners and a private Enterprise API.

The data is still owned by Geocaching HQ and Project-GC pays royalties to be allowed to fetch and use the data like it does.

Fetching of data

Geocaching data is continuously being updated in Project-GC's databases via the mentioned API. This is done using several parallel methods:

Refreshing profiles

Geocaching profiles are continuously being updated based on a rule set in the background.

When refreshing Geocaching profiles all new and updated logs are fetched from the API, and also metadata around the Geocaching profile itself, like for example the name of the Geocacher.

Rules

  • Geocachers that haven't been refreshed in 24 hours and are paying members are being refreshed.
  • Geocachers that haven't been refreshed in 2 days but used Project-GC's website the last week are being refreshed.
  • Geocachers that haven't been refreshed in 7 days but used Project-GC's website the last 3 months are being refreshed.
  • Geocachers whose Profile stats page has been viewed recently.
  • ... and so on, there are ~10 rules like this in place.

Also whenever a user visits the website of Project-GCs the system checks when that user was last updated. If it's a paying member there will be yet another update if the data is more than 1 hour old. However, this is a job that gets queued in the background and it might take a few minutes until it kicks in.

A Geocache with new logs will also get scheduled to have its metadata (difficulty, terrain, size, country ...) updated.

It can be noted that a Geocacher who is a paying member is more likely to have up-to-date data at Project-GC when coming back to the site, while freemium users might notice that their data isn't as fresh.

Refreshing geocache data

The refreshing of Geocaching profiles mentioned above makes sure that Project-GC's users have frequently updated data, but some Geocaches might be left out because they have only been logged by Geocachers not very active with Project-GC. Therefore Project-GC also refreshes Geocaches.

The frequency of when Geocaches are refreshed are based on a mix of several variables, like for example:

  • Last found
  • Hidden date
  • Disabled/Archived state
  • Number of logs

When a Geocache gets refreshed metadata about the Geocache gets updated in the database. Also all new/updated logs are updated in the system.

Log data

The log entries fetched are the same, regardless if they are fetched from Geocaching profiles or from Geocache data. It's just different approaches/angles to retrieve the same information. If Geocaching profile X has found Geocache GCX, then GCX automatically also has a log from Geocaching profile X.

Databases

Now most Geocache data and logs exists in the primary database cluster and are fairly up-to-date. Most of the data will only be hours old in the database, but a fair share is expected to be 24-36 hours old.

However Project-GC has more than one database cluster. Most statistics on the web are created based on data from another database cluster. The primary database cluster replicates its data every 4 hours o the second cluster, adding an additional ~4 hours of latency. As an example most top lists are based on this secondary cluster, while Profile stats are not.

As a technical note, the primary database cluster is a row-based relational database. The secondary cluster is created and meant for database harvesting and is a column-oriented DBMS.

Statistics

As mentioned in Databases, most statistics are based on a secondary database cluster. Even if Project-GC itself has up-to-date profile data the secondary database cluster might still have the old data. This is due to the fact that a full replica of the data-set is copied every four hours.

Most top lists are using this secondary cluster. Basically everything that gets heavily computed (data harvesting) uses the secondary cluster, while more raw fetches use the primary one. Profile stats is an exception.

It's also worth mentioning here as well that Lab caches generally aren't included in statistics. They aren't technically compatible and it would be very complex to do this. Again, Profile stats can be an exception.

Finally, all generated statistics are cached. The period for which it is cached varies but anything between 5 minutes and 1 hour is very common. So if two people ask for the same statistic, it will only be computed the first time, the second person will receive a cached version. This has the downside that it potentially also adds more latency and shows older data in some cases.

Profile stats

  • Generated from current state at the date written in the header.
  • Labs included depending on setting and paying/freemium.
  • Cached for 7 days for freemium, 24h for paying.
  • Caching is not shared between the user the data is about and other viewers.
  • Based on the primary database cluster.

Challenge checkers

Challenge checkers use a mix of the different database clusters. If the Checker script fetches the user's finds from Project-GC it will use the primary database cluster. But some other more advanced API methods might use the secondary database cluster instead. This is for performance reasons.

Special numbers

Some statistics are very special since they are based on pre-calculated values. This is usually because it would be extremely hard to calculate this in real-time, therefore it's based on pre-calculated values. Project-GC calculates the following data for every Geocacher in the background on a regular basis. Normally daily, but there are some exceptions to it (usually that some users gets calculated more often).

  • D/T loops
  • streaks
  • calendar loops

This does not affect Profile stats. If numbers like these are needed, Profile stats calculates them itself instead of using pre-calculated data. This is also needed since it may have Lab cache data merged.