[SPIKE] Estimate growth in DiscussionTools' demand for ParserCache storage
Closed, ResolvedPublic

Description

This task represents the work with estimating how DiscussionTools' demand for Parser Cache storage is likely to grow over time.

@DAbad and the platform team(s) will use this information as part of the work/planning they are doing to converge on an approach for making Parser Cache, or something like it, more reliable and performant in the long-run (read: on the order of years).

Open question(s)

  • 1. In a future where all talk page project features [i] are available to all users (logged in + out), at all Wikimedia projects by default, how much demand will DiscussionTools have for parser cache's storage?
    • In the long-term, the Editing Team estimates DiscussionTools' uncompressed parser cache usage to grow by somewhere between 10% (1.1x) and 90% (1.9x). See more in T285995#7234495.
  • 2. Are the Performance and Data Persistence Teams [ii] comfortable with the Editing Team offering the Reply Tool as an on-by-default feature to all Wikipedias?
  • 3. Is the Editing Engineering Team confident enough in DiscussionTools' code at this point to offer these features as on-by-default at all Wikipedias and suppressing particular features (e.g. using CSS) where applicable?

Done

  • Answers to all ===Open question(s) are documented in the task description

Event Timeline

Once it’s enabled by default on a given wiki, DiscussionTools will just supplant the existing usage of the cache by talk pages, approximately evening out.

Technically there will be a small amount of extra storage used because we do add new html to the markup that’s being cached. I haven’t actually measured that, and the ratio will change per-page based on usage, but I’d be surprised if it was more than 2-3% or so. (This is the actual html for the reply links, and later the section headers.)

The absolute worst storage case for DiscussionTools is if it lingers on a wiki as a widely-used beta. That’d result in most talk pages being split, approximately doubling the parser cache storage devoted to talk pages.

@DLynch, the context you shared in T285995#7191760 is helpful, a resulting question for you...

Once it’s enabled by default on a given wiki, DiscussionTools will just supplant the existing usage of the cache by talk pages, approximately evening out.

What does "DiscussionTools" mean in this context?

Asked another way: are the performance gains I understand you to be describing in T285995#7191760 dependent upon all DiscussionTools-based features being available by default at a given wiki [i]? Are they dependent upon at least one DiscussionTools-based feature being available by default on a given wiki [ii]? Something else?


i. Reply Tool, Topic Subscriptions, and Visual Enhancements are all available by default
ii. Reply Tool is available by default; Topic Subscriptions and Visual Enhancements are all available as opt-in beta features

What does "DiscussionTools" mean in this context?

Any DiscussionTools feature that requires the comment formatter to run. Which means, at the moment, the reply tool or topic subscriptions. You could enable the new topic tool without those, though I don't think we've actually considered doing that anywhere? (Guiding principle: are there any visible changes inside the text of the talk page? The comment formatter will run.)

As soon as any of those features is enabled by default, the split goes away. So, e.g., once the reply tool is enabled by default, it won't matter if topic subscription is still in beta.

@DLynch so based on your comments, I want to make sure I am understanding them correctly. Please validate the below:

  • Currently more storage is being used because we are not enabling by default and thus using additional storage on top of what we are using for talk pages
  • Once we enable default to DiscussionTools, we should expect that storage of talk pages is no longer needed and instead replaced by DiscussionTools. Thus if we implement as default for wikis, the amount of storage required will only marginally increase by 2%-3% from what talk pages uses today.

Given that, it seems that inflated storage needs are driven by whether or not DiscussionTools stays in beta or become the default. @ppelberg Is there a plan for how long we plan to keep each wiki in Beta before we set the default?

@DAbad that sounds mostly correct. I'd phrase "storage of talk pages is no longer needed and instead replaced by DiscussionTools" as "DiscussionTools will enhance the default stored talk page rather than storing its own split version", but that's probably just me quibbling.

As soon as any of those features is enabled by default, the split goes away. So, e.g., once the reply tool is enabled by default, it won't matter if topic subscription is still in beta.

Understood; thank you for clarifying, David.

Given that, it seems that inflated storage needs are driven by whether or not DiscussionTools stays in beta or become the default. @ppelberg Is there a plan for how long we plan to keep each wiki in Beta before we set the default?

@DAbad: once we answer the two questions below, the Editing Team is ready to begin scaling the Reply Tool as an on-by-default feature to all Wikipedias [i]:

  1. "Are the Performance and Data Persistence Teams [ii] comfortable with the Editing Team offering the Reply Tool as an on-by-default feature to all Wikipedias?"
  2. "Is the Editing Engineering Team confident enough in DiscussionTools' code at this point to offer these features as on-by-default at all Wikipedias and suppressing particular features (e.g. using CSS) where applicable?" [iii]

Next steps

  • @ppelberg to answer "1." and "2." above and document the answers on this ticket.
  • While I do the above, @DAbad is there any additional information you need in order to understand the Editing Team's needs of ParserCache, or an alternative to it?

i. For context: I did not directly answer "...how long we plan to keep each wiki in Beta before we set the default..." considering offering the Reply Tool by default will – based on what David shared in T285995#7195218 – make this question moot, I think.
ii. When we last discussed this on 1 July 2021, the consensus was "Not yet."
iii. This question was raised during the meeting we had on 29 April 2021

@ppelberg - please document answers to question 2 above based on conversation with DLynch and Matmarex.

  1. "Is the Editing Engineering Team confident enough in DiscussionTools' code at this point to offer these features as on-by-default at all Wikipedias and suppressing particular features (e.g. using CSS) where applicable?"

As @DLynch + @matmarex confirmed during the team's 21 July meeting, the Editing Engineering Team is confident enough in DT's code to offer its constituent features (e.g. Reply Tool, New Discussion Tool, Topic Subscriptions) as on-by-default at all Wikipedias and suppressing particular ones where applicable using CSS.

The work to do the above will happen in T287098, tho it will not be deployed until first speaking with the Performance and Data Persistence Teams (see "Next steps" below).

Next steps

  • @ppelberg to answer: "Are the Performance and Data Persistence Teams [ii] comfortable with the Editing Team offering the Reply Tool as an on-by-default feature to all Wikipedias?"

Technically there will be a small amount of extra storage used because we do add new html to the markup that’s being cached. I haven’t actually measured that, and the ratio will change per-page based on usage, but I’d be surprised if it was more than 2-3% or so. (This is the actual html for the reply links, and later the section headers.)

I tried to estimate it, and it's actually a lot more than that, more like 90% in the worst cases, but also about 0% on talk pages that don't contain any comments, which is probably the majority.

I did it by comparing the size of the action=render output while logged in (DiscussionTools enabled) and logged out (DiscussionTools disabled). I'm assuming that parser cache isn't compressed; gzip makes the results less dramatic, but still way more than 2-3%.

For example, we turn HTML like this:

<h2><span class="mw-headline" id="Partial_blocks">Partial blocks</span></h2>
<p>Not sure if this or the Help desk is the best place to ask this question. An IP range is partially blocked. It expires in February. A single IP that is part of the range is sitewide-blocked. It expires in January. When the single IP's block expires, does the partial block remain in effect and include the single IP?--<a href="//en.wikipedia.org/wiki/User:Bbb23" title="User:Bbb23">Bbb23</a> (<a href="//en.wikipedia.org/wiki/User_talk:Bbb23" title="User talk:Bbb23">talk</a>) 18:56, 22 July 2021 (UTC)
</p>
<dl><dd><span class="template-ping">@<a href="//en.wikipedia.org/wiki/User:Bbb23" title="User:Bbb23">Bbb23</a>:</span> Yes. -- <a href="//en.wikipedia.org/wiki/User:Zzuuzz" title="User:Zzuuzz">zzuuzz</a> <sup><a href="//en.wikipedia.org/wiki/User_talk:Zzuuzz" title="User talk:Zzuuzz">(talk)</a></sup> 20:12, 22 July 2021 (UTC)
<dl><dd><span class="template-ping">@<a href="//en.wikipedia.org/wiki/User:Zzuuzz" title="User:Zzuuzz">Zzuuzz</a>:</span> You'd be a great witness in a courtroom. Thanks.--<a href="//en.wikipedia.org/wiki/User:Bbb23" title="User:Bbb23">Bbb23</a> (<a href="//en.wikipedia.org/wiki/User_talk:Bbb23" title="User talk:Bbb23">talk</a>) 21:54, 22 July 2021 (UTC)</dd></dl></dd></dl>

into this:

<h2 class="ext-discussiontools-init-section"><span class="mw-headline" id="Partial_blocks" data-mw-comment="{&quot;type&quot;:&quot;heading&quot;,&quot;level&quot;:0,&quot;id&quot;:&quot;h-Partial_blocks-2021-07-22T18:56:00.000Z&quot;,&quot;replies&quot;:[&quot;c-Bbb23-2021-07-22T18:56:00.000Z-Partial_blocks&quot;],&quot;headingLevel&quot;:2,&quot;placeholderHeading&quot;:false}"><span data-mw-comment-start="" id="h-Partial_blocks-2021-07-22T18:56:00.000Z"></span>Partial blocks<span data-mw-comment-end="h-Partial_blocks-2021-07-22T18:56:00.000Z"></span></span><!--__DTSUBSCRIBE__h-Bbb23-2021-07-22T18:56:00.000Z--></h2>
<p><span data-mw-comment-start="" id="c-Bbb23-2021-07-22T18:56:00.000Z-Partial_blocks"></span>Not sure if this or the Help desk is the best place to ask this question. An IP range is partially blocked. It expires in February. A single IP that is part of the range is sitewide-blocked. It expires in January. When the single IP's block expires, does the partial block remain in effect and include the single IP?--<a href="//en.wikipedia.org/wiki/User:Bbb23" title="User:Bbb23">Bbb23</a> (<a href="//en.wikipedia.org/wiki/User_talk:Bbb23" title="User talk:Bbb23">talk</a>) 18:56, 22 July 2021 (UTC)<span class="ext-discussiontools-init-replylink-buttons"><span class="ext-discussiontools-init-replylink-bracket">[</span><a class="ext-discussiontools-init-replylink-reply" role="button" tabindex="0" data-mw-comment="{&quot;type&quot;:&quot;comment&quot;,&quot;level&quot;:1,&quot;id&quot;:&quot;c-Bbb23-2021-07-22T18:56:00.000Z-Partial_blocks&quot;,&quot;replies&quot;:[&quot;c-Zzuuzz-2021-07-22T20:12:00.000Z-Bbb23-2021-07-22T18:56:00.000Z&quot;],&quot;timestamp&quot;:&quot;2021-07-22T18:56:00.000Z&quot;,&quot;author&quot;:&quot;Bbb23&quot;}" href=""><!--__DTREPLY__--></a><span class="ext-discussiontools-init-replylink-bracket">]</span></span><span data-mw-comment-end="c-Bbb23-2021-07-22T18:56:00.000Z-Partial_blocks"></span>
</p>
<dl><dd><span class="template-ping"><span data-mw-comment-start="" id="c-Zzuuzz-2021-07-22T20:12:00.000Z-Bbb23-2021-07-22T18:56:00.000Z"></span>@<a href="//en.wikipedia.org/wiki/User:Bbb23" title="User:Bbb23">Bbb23</a>:</span> Yes. -- <a href="//en.wikipedia.org/wiki/User:Zzuuzz" title="User:Zzuuzz">zzuuzz</a> <sup><a href="//en.wikipedia.org/wiki/User_talk:Zzuuzz" title="User talk:Zzuuzz">(talk)</a></sup> 20:12, 22 July 2021 (UTC)<span class="ext-discussiontools-init-replylink-buttons"><span class="ext-discussiontools-init-replylink-bracket">[</span><a class="ext-discussiontools-init-replylink-reply" role="button" tabindex="0" data-mw-comment="{&quot;type&quot;:&quot;comment&quot;,&quot;level&quot;:2,&quot;id&quot;:&quot;c-Zzuuzz-2021-07-22T20:12:00.000Z-Bbb23-2021-07-22T18:56:00.000Z&quot;,&quot;replies&quot;:[&quot;c-Bbb23-2021-07-22T21:54:00.000Z-Zzuuzz-2021-07-22T20:12:00.000Z&quot;],&quot;timestamp&quot;:&quot;2021-07-22T20:12:00.000Z&quot;,&quot;author&quot;:&quot;Zzuuzz&quot;}" href=""><!--__DTREPLY__--></a><span class="ext-discussiontools-init-replylink-bracket">]</span></span><span data-mw-comment-end="c-Zzuuzz-2021-07-22T20:12:00.000Z-Bbb23-2021-07-22T18:56:00.000Z"></span>
<dl><dd><span class="template-ping"><span data-mw-comment-start="" id="c-Bbb23-2021-07-22T21:54:00.000Z-Zzuuzz-2021-07-22T20:12:00.000Z"></span>@<a href="//en.wikipedia.org/wiki/User:Zzuuzz" title="User:Zzuuzz">Zzuuzz</a>:</span> You'd be a great witness in a courtroom. Thanks.--<a href="//en.wikipedia.org/wiki/User:Bbb23" title="User:Bbb23">Bbb23</a> (<a href="//en.wikipedia.org/wiki/User_talk:Bbb23" title="User talk:Bbb23">talk</a>) 21:54, 22 July 2021 (UTC)<span class="ext-discussiontools-init-replylink-buttons"><span class="ext-discussiontools-init-replylink-bracket">[</span><a class="ext-discussiontools-init-replylink-reply" role="button" tabindex="0" data-mw-comment="{&quot;type&quot;:&quot;comment&quot;,&quot;level&quot;:3,&quot;id&quot;:&quot;c-Bbb23-2021-07-22T21:54:00.000Z-Zzuuzz-2021-07-22T20:12:00.000Z&quot;,&quot;replies&quot;:[],&quot;timestamp&quot;:&quot;2021-07-22T21:54:00.000Z&quot;,&quot;author&quot;:&quot;Bbb23&quot;}" href=""><!--__DTREPLY__--></a><span class="ext-discussiontools-init-replylink-bracket">]</span></span><span data-mw-comment-end="c-Bbb23-2021-07-22T21:54:00.000Z-Zzuuzz-2021-07-22T20:12:00.000Z"></span></dd></dl></dd></dl>

If it's worth the effort, we could probably easily improve this a lot by just using shorter CSS class names, JSON property names, etc. It would be less easy to get rid of the &quot; entities everywhere (according to C. Scott in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/663997, it's a convention in MediaWiki and changing the quoting could affect security-critical code in the sanitizer).

I tried to estimate it, and it's actually a lot more than that, more like 90% in the worst cases, but also about 0% on talk pages that don't contain any comments, which is probably the majority.

Huh, that's way more than I anticipated. I guess I wasn't thinking about how bulky some of those encoded entities would wind up being, or rather was in a mental model which featured much longer comments compared to the data we store.

I think my point that removing the split is still better than the current status-quo remains valid. Long term the question is whether we're okay with the pre-DiscussionTools (uncompressed) parser-cache usage for talk pages effectively growing by somewhere between 1.1x and 1.9x, depending on estimates about how many talk pages actually contain any comments at all for this metadata to be added to. (With it all being less-severe if parser cache turns out to be compressed, of course.)

ppelberg added a subscriber: Krinkle.

@Krinkle: a question for you about extending the work the Editing Team did in T280295 and T27964...

Question: what – if any – additional information would you need to give the the Editing Team the "go ahead" to converge on a single, DiscussionTools-transformed version of talk page HTML in the parser cache, on all wikis?

For context: in order to reduce DiscussionTools' demand on the parser cache, we are exploring how quickly we can transition to a single version of the stored talk page HTML and use using CSS to hide features on wikis where their on-by-default availability has not yet been approved (T287098)?

@Krinkle: a question for you about extending the work the Editing Team did in T280295 and T27964...

Question: what – if any – additional information would you need to give the the Editing Team the "go ahead" to converge on a single, DiscussionTools-transformed version of talk page HTML in the parser cache, on all wikis?

For context: in order to reduce DiscussionTools' demand on the parser cache, we are exploring how quickly we can transition to a single version of the stored talk page HTML and use using CSS to hide features on wikis where their on-by-default availability has not yet been approved (T287098)?

I'm resolving this ticket; we can revisit the question above once T280599 is resolved and we prioritize work on T273072.