This task represents the work with reducing how long the parser cache retains data.
Requirements
- The Data-Persistence team, in conjunction with the Editing-team and Performance-Team, defines how the often the parser cache should be refreshed/cleared
- The Data-Persistence team writes and deploys the patch(es) required to put "1." into effect.
Details and strategy
The parser cache databases have had their size increase for several months in a way that has started to exceed the operational thresholds set by the DBAs. This is believed to be (at least in part) due to the DiscussionTools roll-out. Refer to T280599 for more context on that.
We have in the past decade sometimes exceeded these thresholds from organic growth, in which case, if new capacity/hardware did not arrive in time, we would temporary sacrifice page view performance by reducing the retention time of the ParserCache. This is something we'd like to avoid, and if not possible, to do for as little time as possible.
The alternative plan by @Krinkle was to take this oppertunity to see if we can adopt a pattern of more granularly controlling and localising these costs instead of having only a single lever. As such, the plan is to reduce the parser cache expiry of talk page specifically.
Cost. As with any change in retention, such configuration change only affects how new entries are stored. It does not work retroactively and would take at least a month of churn before we see whether it worked. In order to resolve the current operational risk we must do something sooner, namely to retroactively evict blobs older than a certain date as well. This retroactive action requires 3-4 days of labourious work to do in such a way that the database servers remain responsive to live traffic. During these 3-4 days of work, the spare parser cache servers will employed, which have as side-effect that they will be partially unable to serve data more than a day old. This has direct impact on page view performance.
Unknowns: We don't know the distribution of talk pages in the parser cache. We don't know what percentage of talk pages gets frequently modified and rotated within a few days vs stays the same and benefits from caching for longer. For example, if the vast majority of talk page views are within a week of them being modified (and thus added to the cache) then caching them for longer does not benefit much, and shorting their retention would gain us space. On the other hand, if people regularly view stale discussion pages without then changing them, that means shortening the expiry would just lead to the same stale pages being re-cached as new entries and not significantly reducing the used space.
Risk: There is a chance that the new DiscussionTools would (still) take up more space than the amount of space we can reclaim by reducing its retention. As such, if all we did is reduce the DT cache expiry, there is a chance we would have to do the 3-4 days of reclaim work a second time, and also pay the end-user penalty again.
Strategy: We will avoid this risk by performing the DT mitigation and our fallback mitigation at the same time. The fallback mitigation is to reduce the cache expiry of Wikipedia articles. This means we will definitely reclaim the amount of space we need. Then, after we have found our new equilibrium we determine how far we can ramp the Wikipedia article cache expiry back up. (Either in full if the DT strategy worked out as hoped, or only in part if it didn't).
Done
- Draft strategy for interim mitigation with Performance-Team and DBA.
- Update scheduled job in Puppet. – @Marostegui https://gerrit.wikimedia.org/r/c/operations/puppet/+/685222
- Reduce overall ParserCache expiry in wmf-config (reduce fom 30 days to 22 days). Prior art: T167784, T210992. – @Krinkle
- Add ParserCache expiry override for talk pages in DiscussionTools extension (reduce from 30 days to 10 days). – @Krinkle, with code review by Editing-team.
- Perform the reclaim sequence on the parser cache database servers. – DBAs
- The ===Requirements above are met. – @ppelberg
30 days after this is resolved, the follow-up task T280604 will kick-in, which covers ramping the overall expiry back up.