Post-deployment: evaluate impact on site performance
Closed, ResolvedPublic

Description

This task represents the work involved with the Performance-Team and DBA evaluating the impact changing the parser cache data retention time has had on the site's performance.

Requirements

Site performance:

Perfomance team and DBAs will monitor, and evaluates the outcome, based primarily on the following:

  1. "Parser cache disk space available", should remain above 20%. Right now it is at 21% on average, with one or two servers below it. Measured via Grafana: Parser Cache and Grafana: Host overview
  1. "Parser cache hit ratio", has been stable around ~80% for article page views. Measured via Grafana: Parser Cache (contenttype; wikitext)
  1. "Backend pageview response time (p75)", has been stable around ~250ms for the past two years. Measured via https://grafana.wikimedia.org/d/QLtC93rMz/backend-pageview-response-time.
  1. Monitor overall appserver load and internal latencies, via https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard.
  1. Daily purge of parser cache MUST take less than an actual day to run. (Runs automatically at midnight). P16418. Monitored by DBAs.

Path forward

Done

  • The metrics to monitor are documented.
  • The metrics to monitor have been evaluated two weeks post-deployment.

Event Timeline

Krinkle subscribed.

Perfomance team and DBAs will monitor, and evaluates the outcome, based primarily on the following:

  • "Parser cache hit ratio", has been stable around ~80% for article page views. Measured via Grafana: Parser Cache (contenttype; wikitext)

As we are proposing applying a shorter expiry time to talk pages, our analysis should probably evaluate talk pages separately (if that is not already planned) to ensure there is no significant regression.

META
@Krinkle: based on the steps outlined in T280606#7058580, I'm assigning this task over to you.

Of course, if you think there is someone better to assign this task to, please adjust the task accordingly.

As we are proposing applying a shorter expiry time to talk pages, our analysis should probably evaluate talk pages separately (if that is not already planned) to ensure there is no significant regression.

Agreed. We need new instrumentations for that, however, and their success is I suppose up to you (plural) to decide over. But some ideas:

  • Add "discussion page" as pseudo content-type for parsercache metrics (e.g. distinct from wikitext). This doesn't need to be perfect given its an aggregate (e.g. we can do wikitext-in-talk-namespace, or perhaps include other discussion pages based on the "newsection" heuristic etc.). That will give you a dedicated cache-hit ratio to monitor.
  • Page view load time (as proxy for how impactful a reduced hit rate actually is) for the talk namespace of real users. Our RUM navtiming data includes a namespace factor, however we don't load these into Graphite that way currently, but we can show you and/or help you with doing these as one-off queries in Hadoop.
  • @dpifke and @Peter will also be looking into adding talk pages to our continous synthetic/lab speed tests. These are unlikely to help us for this particular mitigation since these are not going to beneefit parser caching much, but these will help with other work in your team in the future, e.g with regards to cost and impact of added HTML, CSS, and JavaScript payloads.

Sounds good. Do you need any more input from the Editing team at this point?

LSobanski moved this task from Triage to Blocked on the DBA board.

Sounds good. Do you need any more input from the Editing team at this point?

@Krinkle friendly nudge about the question about Ed posed above. Absent of a response, we're going to continue assuming no additional input is needed from the Editing Team about this task.

No further input is needed at this time.

No further input is needed at this time.

Noted. Thank you.

Late arrival, one additional operational metric to monitor:

  • Daily purge of parser cache MUST take less than an actual day to run. (Runs automatically at midnight). P16418. Monitored by DBAs.
Krinkle triaged this task as High priority.Jun 12 2021, 5:31 PM
Krinkle lowered the priority of this task from High to Medium.
Krinkle edited projects, added Performance-Team; removed Performance-Team (Radar).

Let us add at three talk pages to our synthetic testing (we used three in the past to just rule out problems on one page) at the beginning and then we feel that one is enough we can remove two. @ppelberg do yo have any input on which ones we should add?

Thank you for the ping, @Peter. Before offering suggestions for which three talk pages y'all might use in synthetic testing, can you share what qualities you're seeking in said pages? E.g. large pages (in terms of file size), pages with a variety of content on them (tables, images, templates, etc.) , pages in certain languages, etc.

cc @matmarex

My proposals:

They're all fairly large and fairly active pages, and they represent all three of the talk page "kinds" (article talk page, user talk page, and a "general" talk page). Also, "Village pump (technical)" always has all kinds of strange markup in it (which usually is fun for testing, but you might want to pick a different Village pump subpage if that's a problem).

Note that people might not expect or appreciate having their comments preserved somewhere else than where they wrote them (even though they are public, and particularly for user talk pages), so if you include those in testing, it'd be best if the contents were not stored as part of whatever results you gather.

Also note that the page size may fluctuate wildly as bots archive the sections.

  • disk space is still quite rapidly increasing despite shortened retention and daily purging, which suggests we're not going to stay stable for long given more data will mean longer purge times.
  • as part of restoring retention, purge time is expected to go up even furhter.
LSobanski subscribed.

Parser cache disk space available", should remain above 20%

This target is no longer a good measure since we migrated to new hardware. Should we take a snapshot at the point where the hosts were replaced and call this task done or revise the expectations and continue monitoring?

Reviewing the 5 targets from the task description:

  1. Parser cache disk space available", should remain above 20%

This target is no longer a good measure since we migrated to new hardware. […]

I think 20% could continue to be a general threshold for the new hardware as well. But, that's for your team to decide later. I understand what you mean though, which is that the new servers have more space so it's not enough that to say that we have more than 20% space in relation to whether we have succeeded in reducing space used.

We can however verify it by looking at the per-server metrics, since the old hardware is still online and keep up with incoming writes afaik:

Screenshot 2021-09-01 at 04.17.38.png (611×1 px, 126 KB)

Grafana: Parser Cache

Success. The first panel is the average, which is skewed by the new hardware. The second panel is per-server, with the newer servers muted via the legend. Each of the old servers today has more absolute space available then when the pre-new-hardware average was around 20%, so in the spirit of the original target, I think we can still consider this a verified "success".

  1. "Parser cache hit ratio", has been stable around ~80% for article page views.

Screenshot 2021-09-01 at 04.30.54.png (1×2 px, 290 KB)

Grafana: Parser Cache

Meh. We are continuing to take a hit here, as expected per the lowered retention on articles (30 to 20 days) still being in effect. I actually wanted to merge this task with T280604 since the two should happen in parallel, but kept them as separate as Peter filed them that way originally. In any event, this isn't a "success" right now. But I hope/expect that when we ramp this back up under T280604, that we will regain this loss. We have some additional margin today even on the old hardware, so I suppose we could start ramping up one day at a time now, or we could wait until your team is comfortable taking the old hardware out of rotation. I'll leave that to you.

  1. "Backend pageview response time (p75)", has been stable around ~250ms

Screenshot 2021-09-01 at 05.15.28.png (1×2 px, 302 KB)

Grafana: Backend pageview time

Success. It's quite a long span of time to evaluate a high-level metric like this. There have been many skin and core performance changes during this time. But, for our purposes here I'm mainly looking for correlation between the post-May 25 reduction in hit rate, and a negative impact on performance, and there isn't a strong correlation of that kind that I can see.

On the contrary. I do see a significant and consistent improvement in pageview response times more recently, from 19 August onward. This aligns suspiciously well with @Kormat's rollout of the new hardware that day. But, as great at it would be, I'm unsure we can attribute this major backend time reduction to the new hardware, as on 19 Aug she rolled it out in Eqiad. And we're currently primary in COdfw, where she rolled it out a few days earlier on Aug 13 and Aug 17. In any event, no bad news here.

  1. Monitor overall appserver load and internal latencies via the "Application Servers RED Dashboard".

Sucesss. We had our annual switchover in the middle here, and again it's a long time range to say anything definitive about at glance. I'd say the numbers have remarkably stable over this entire period. With the same 20% @ <100ms and 75% @ <250ms as before. If anything, the Codfw numbers are less variable day-to-day and slightly better, but virtually no lasting correlation from any day during this period, including the days where we made changes in hardware or software around ParserCache.

  1. Daily purge of parser cache MUST take less than an actual day to run. (Runs automatically at midnight).

Success, per T282761: purgeParserCache.php should not take over 24 hours for its daily run.

Parser cache disk space available", should remain above 20%

Should we […] call this task done or revise the expectations and continue monitoring?

Shorter answer: Yes, let's close this task. My above comment does point out a consistently lower parser cache hit rate. But let's treat this task as the green-light for DiscussionTools rollout and continue working together at T280604: Post-deployment: (partly) ramp parser cache retention back up .

  1. "Parser cache hit ratio", has been stable around ~80% for article page views.

Screenshot 2021-09-01 at 04.30.54.png (1×2 px, 290 KB)

Grafana: Parser Cache

Meh. We are continuing to take a hit here, as expected per the lowered retention on articles (30 to 20 days) still being in effect. I actually wanted to merge this task with T280604 since the two should happen in parallel, but kept them as separate as Peter filed them that way originally. In any event, this isn't a "success" right now. But I hope/expect that when we ramp this back up under T280604, that we will regain this loss. We have some additional margin today even on the old hardware, so I suppose we could start ramping up one day at a time now, or we could wait until your team is comfortable taking the old hardware out of rotation. I'll leave that to you.

The graph window you've chosen seems to be rather misleading. Here's the graph for 2021-03, with a hit average of 78.1%:

image.png (700×1 px, 83 KB)

https://grafana-rw.wikimedia.org/d/000000106/parser-cache?viewPanel=7&orgId=1&from=1614556800000&to=1617235200000

And here's the one for the most recent month, 2021-08, with a hit average of 77.5%:

image.png (669×1 px, 72 KB)

https://grafana-rw.wikimedia.org/d/000000106/parser-cache?viewPanel=7&orgId=1&from=1627776000000&to=1630454400000

(Months chosen to be before and after the major overhaul of parsercache/purging that happened april->july.)

2021-07 was slightly lower, at 76%. I don't know of a specific reason for that, or at least i can't think of why the h/w swaps would have an impact there.