Wikipedia:Bots/Requests for approval/Legobot 41
- The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at Wikipedia:Bots/Noticeboard. The result of the discussion was Approved.
New to bots on Wikipedia? Read these primers!
- Approval process – How this discussion works
- Overview/Policy – What bots are/What they can (or can't) do
- Dictionary – Explains bot-related jargon
Operator: Legoktm (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)
Time filed: 00:45, Sunday, January 22, 2023 (UTC)
Function overview: Automatically fix (low priority) obsolete-tag lint errors
Automatic, Supervised, or Manual: Automatic
Programming language(s): Rust
Source code available: yes
Links to relevant discussions (where appropriate): Wikipedia:Bots/Noticeboard/Archive_17#MalnadachBot_and_watchlists
Edit period(s): One time run (well, possibly multiple runs, but at some point it will be done)
Estimated number of pages affected: There are 4 million errors, but many pages have multiple lint errors, so I'd estimate less than 1 million pages.
Namespace(s): All
Exclusion compliant (Yes/No): Yes
Function details:
The main difference from other attempts is that Legobot will attempt to fix all obsolete-tag lint errors at once, and if it is unable to fix everything, it will not edit the page. This should ensure that Legobot does not edit a page more than once, which was the main issue in the above-linked BOTN discussion.
For each page that is reporting lint errors, the bot pulls the Parsoid HTML for the page, and:
- Any
<font>...</font>
tags are turned into<span>...</span>
(or<div>...</div>
if it contains block elements) with the appropriate inline styles. The color, face, and size attributes are parsed according to the HTML spec.- If the
<font>...</font>
tag specified a color (color
attribute or inline style) and it contains links, then another<span>...</span>
will be added inside the link, wrapping the link text.
- If the
Any<strike>...</strike>
are turned into<s>...</s>
- Any
<tt><nowiki>...</nowiki></tt>
are turned into<code><nowiki>...</nowiki></code>
- Any
<center>...</center>
are turned into<div class="center">...</div>
- If any of the descendants are tables or contain an inline style with a "margin" rule, it will be skipped.
The HTML is converted back to wikitext, and checks with the Linter API if there are any lint errors left. If there are remaining obsolete-tag errors, it does nothing. If all obsolete-tag errors have been resolved, then it saves the page.
One known limitation includes being unable to fix things inside template parameters, e.g. {{1x|<center>foo</center>}}
. I haven't decided whether this is worth fixing; the lack of support isn't an issue because it simply won't edit those pages.
I've prepared a little over 1,000 edits as a demo where you can see the wikitext change and a side-by-side comparison of the rendered HTML.
Discussion
[edit]Approved for trial (100 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. If it's possible to also link the pages that were not edited, please list some of them (at the very least as proof-of-concept that the bot skips pages appropriately). Primefac (talk) 12:50, 22 January 2023 (UTC)[reply]
- Comment: I support this task in general, but the above specification appears incomplete, and the bot would replace tags that should not be replaced. The bot needs to be more selective in selecting tags to replace. I looked through the first 30 or so demo edits and found some errors.
- Turning all
<tt>...</tt>
tags into<code>...</code>
tags is not appropriate; conversion to {{mono}} or<kbd>...</kbd>
or other formatting is sometimes the correct change. See mw:Help:Lint errors/obsolete-tag for examples of these conversions and those for other obsolete tags, based on context. - Similarly, converting
<center>...</center>
into a div tag is not always the right change; when it wraps tables, galleries, or other block content, markup like|class=center
or wrapping the whole block in table markup is sometimes needed. See this diff and this diff for examples of a div tag that doesn't work properly. - Converting font tags into span tags when they contain a color specification and wrap a link is not the correct fix; the span tags should be placed inside the link. See this diff for an example (the font tag wrapping
[[/Lobby|•]]
is replaced incorrectly). See also this diff, where the tags around the link to User:Damërung/Secret are replaced incorrectly. Also this diff; the original page contains a "font tag wrapping link" error that is not detected (see T294720). - Wrapping a table in a span tag is invalid markup and causes a new Linter error. See this diff for an example.
- This edit appears to have deleted a necessary and valid (although misnested)
</big>
tag. It's a bit of GIGO, but this sort of thing is all over the place; let's not introduce new errors. - This edit appears to have created a new misnested bold tag error.
- Turning all
- How will the bot know when to apply the above proposed fixes and when to do something different or leave the code for human editors to fix? – Jonesey95 (talk) 15:49, 22 January 2023 (UTC)[reply]
- Thank you for the detailed feedback @Jonesey95!
- Would you recommend to just skip dealing with
<tt>...</tt>
entirely then? My initial thoughts was that while there are often better replacements for<tt>...</tt>
as you and the wiki page suggest,<code>...</code>
isn't absolutely wrong, and the value gained by using the correct one isn't really worth the human time. But if that disagrees with the current consensus, I can axe it. - Right. I'm confused by The bellman's userpage, why the "I agree to multi-license..." text isn't center-aligned despite being inside
<center>...</center>
. Based on [1], it seems like<div class="center">...</div>
would actually be a better replacement than a raw inline text-align style as the class appears to handle centering of block content? - I mostly implemented the wrapping of link contents, the remaining todo is when
<font style="color:...;">...</font>
is used instead of thecolor
attribute. - In this case it should be a div rather than a span, right? I can add something to detect if a block element is being used, and to switch the tag based on that.
- Regarding the final two issues, I switched it to bail out if there are any lint issues, so it will not make things worse. In theory it should've fixed the misnesting bug, but I think there is also a separate Parsoid bug that I raised on IRC, I'll see what they have to say...
- Would you recommend to just skip dealing with
- In general I'd like the bot to be defensive as possible to begin with, if it is not confident in the fix, we skip. And then we can iterate on specific cases it can't handle, and go back over the skipped pages. For example, if we don't have a good solution on the center issue (2nd bullet), then I could have the bot skip fixing if there are any block elements nested in the center tag.
- Either tomorrow or the next day (presumably after we figure out the center issue) I'll generate a new set of demo edits and "categorize" them so it's easy to see specific types of fixes (e.g. link inside of font) and also places where it skipped. Legoktm (talk) 07:48, 23 January 2023 (UTC)[reply]
- I recommend:
- Replacing
<tt>...</tt>
with<code>...</code>
when it wraps<nowiki>...</nowiki>
. In my experience, that is the only safe tt replacement to do semi-automatically. - It may be that div class="center" is a better replacement in some cases; I haven't played with it. In my experience, no method of centering applies correctly to every situation. Even class="center" within gallery tags doesn't work sometimes (I have a phab ticket lying around somewhere). It's exhausting.
- I haven't tried wrapping a whole table in a div tag to specify the font, but since span tags appear to work (even though they are invalid HTML), a div tag around a block element is probably correct.
- Replacing
- I think you're on the right track. Even with a conservative set of patterns, you will be able to fix a lot of pages. There are plenty of pages with just one center tag, or a few easy font tags. You might try starting in the User talk and Project spaces, since there are a lot of short pages with just a few signatures in those. Getting the AFD pages cleaned up, for example, would be pretty nice, since each one is transcluded in a larger page.
- I hesitate to expose my poor regex skills, but feel free to look at User:Jonesey95/AutoEd/doi.js and User:Jonesey95/AutoEd/coauthors.js for patterns that I use. They almost never result in errors, but I still preview every edit before saving, just in case. The proposed code at User:SheepLinterBot/1 may also have some value. – Jonesey95 (talk) 14:15, 23 January 2023 (UTC)[reply]
- Still going through your regexes...did people really try
<font colour="">...</font>
??? Did that ever work? Legoktm (talk) 02:03, 25 January 2023 (UTC)[reply]- If there is one thing I have learned in ten years as a gnome on this site, it is that Wikipedia editors are endlessly creative (currently 481 hits for "font colour" tags) in the way that they make errors. I'm pretty sure that "font colour" has never worked. – Jonesey95 (talk) 03:54, 25 January 2023 (UTC)[reply]
- @SSastry (WMF) and I discussed the center tag today, he pointed out it was documented that it only centers tables rather than their content. So in theory if we wanted to get as close to identical as possible, we'd have to mark up each child with
class="center"
and then tables would getstyle="align-left: auto; align-right: auto;"
. That would probably get very messy and not always possible if the children are templates. It's also unlikely someone used a center tag knowing that it would center the table but not the contents... - So I think the best option, though not purely identical, is to swap center with div class="center" (as I suggested earlier), and then if any of the children are tables, mark those with class="center", if possible.
- I published a new set of demo edits and tried to group them by edge case. I will spend a while tomorrow reviewing them, and if you don't mind peeking at some that would be appreciated. And if all looks good, I'll kick off the trial! Legoktm (talk) 07:17, 26 January 2023 (UTC)[reply]
- Re:
It's also unlikely someone used a center tag knowing that it would center the table but not the contents
: If I understand that sentence correctly, I think it is incorrect. I see tables wrapped in center tags all the time, and it is clear that the editors wanted the table to appear normally, without centering any of the interior content, but as a block in the horizontal middle of the page. Maybe I misunderstand your sentence. As for centering tables with|class="center"
orstyle="align-left: auto; align-right: auto;"
, I wonder why that would be the recommendation when Wikipedia:HTML 5#Tables has shown "margin:1em auto" as the recommended styling for many years. I have found that tables wrapped by center tags are easily updated by adding that "margin:1em auto" style, as recommended. – Jonesey95 (talk) 16:12, 26 January 2023 (UTC)[reply]- Comments on this batch of demo edits:
- Exclude pages in Wikipedia space that contain the string "log/" or "Log/" in the title. Those are compilation pages that do not contain any actual errors; the transcluded pages need to be fixed.
- I would be wary of including User pages in your initial batches. People don't usually like having their sandboxes and article drafts messed with, and those do not always live at /sandbox. I would stick to pages with discussions on them for a while.
- This diff wrapped a multi-line bit of content (i.e. content with a hard line break, not just p or br tags) in the "Tools" section using a span tag. LintHint does not detect an error, but wrapping multi-line content with span tags instead of div tags can introduce new errors.
- This color replacement appears to be very nicely done, preserving the color both inside and outside the wikilinks. This similar situation appears to have missed additional wikilinks that needed interior styling.
- This edit added nowiki tags.
- This edit appears to have changed the position of Template:Historical (probably for the better, but just noting it).
- This edit de-centered the block content in the Invite section.
- This edit de-centered the block in the International section.
- This edit replaced font size=-1 with font-size:x-small, which is one size too small, according to the help page.
- This edit looks promising. You could do a targeted search for this newsletter and hit them all.
- Likewise, a targeted search for the envelope icon would probably yield a high fix rate. You can look at User:MalnadachBot/Signature submissions for additional common patterns.
- That's probably enough from me. Don't get discouraged. As you can see from the regex pages of people who have gone before you, this stuff is hard to get right, especially as you broaden the scope of potential fixes. – Jonesey95 (talk) 17:19, 26 January 2023 (UTC)[reply]
- I'm still figuring out what to do with the centering issues, but in the meantime I think I addressed everything else, fixed the link color wrapping to handle nested cases, skip if nowikis are added (and reported as a Parsoid bug), and fixed font-size calculation.
- And just to make sure, when you say "including User pages in your initial batches", you mean just the User namespace and not User talk as well? Legoktm (talk) 06:57, 3 February 2023 (UTC)[reply]
- Editing User talk space is fine (and desirable). – Jonesey95 (talk) 14:52, 3 February 2023 (UTC)[reply]
- (People put a lot of signed barnstars on their user pages, if any primary motivation was needed....) Izno (talk) 18:27, 3 February 2023 (UTC)[reply]
- Editing User talk space is fine (and desirable). – Jonesey95 (talk) 14:52, 3 February 2023 (UTC)[reply]
- In 1 and 2 where the center fix didn't work, it's because those blocks contain inline styles that override class=center's margin-left: auto; margin-right: auto;. I have a few ideas on how to account for this by parsing the inline styles, but I'm going to defer that for later and just make this another skip condition for now.
- Re: why not
margin:1em auto
, that's shorthand formargin-top: 1em; margin-bottom: 1em; margin-left: auto; margin-right: auto;
. Since we're just trying to horizontally center, setting margin-top/bottom seems unnecessary. - If editors intended for center to just center the table and not the text (I buy that) and we think that is an important property to preserve (seems reasonable), then I think we should set
text-align: left
(if text-align isn't set yet) to undo the impact of class=center having text-align:center. Setting text-align won't always be possible if the table is via a template, so it'll have to be best effort. Legoktm (talk) 04:34, 4 February 2023 (UTC)[reply]- Yes, centering is not trivial. That is why there are so many options and suggestions at mw:Help:Lint errors/obsolete-tag. It is above my pay grade to understand why a given fix will work in one case, but not in another. There could be bugs in there, but it is more likely that I just don't understand the nuance. Anyway, if you stick to known bullet-proof situations, you'll still be able to do a lot of replacements. Once those are done, we can look for patterns in what is left and address those in batches. There is still TONS of low-hanging fruit left for bots to fix. – Jonesey95 (talk) 04:57, 4 February 2023 (UTC)[reply]
- Comments on this batch of demo edits:
- Re:
- I recommend:
- Thank you for the detailed feedback @Jonesey95!
Comment: It appears that you are filing a BRFA for obsolete HTML tags including font tags. If you are going to include font tags, please consider the following:
- Will there be any errors when fixing font tags, i.e. span tags with color outside a wikilink which is an error?
- Will your code for font tags be stronger than my regexes? My regexes' strength is currently in a ratio of about 1:1.55 - 1:1.6 pages (1:1.9 - 1:2 using these safe regexes).
Please consider the following, as I already filed a BRFA on October for font tags. In case you don't know, the ratio (I call the edit-to-page ratio) is the percentage of edits made to a number of pages checked. Sheep (talk • he/him) 13:57, 24 January 2023 (UTC)[reply]
- It is valuable to have more than one bot in development for this particular task, since it is so large. – Jonesey95 (talk) 15:43, 24 January 2023 (UTC)[reply]
- There is also a relatively important difference between how the proposed LegoBot task and the proposed SheepLinterBot task handle the issue of "MalnadachBot makes way too many edits": LegoBot forces a "ratio" of 1:1 by only making edits to pages it can fix in one go, while SheepLinterBot is a finer-combed tool that reduces the number of edits by narrowing the task. There is room for both of them on-wiki, I think. casualdejekyll 20:05, 24 January 2023 (UTC)[reply]
- My understanding from both BRFAs is that each bot will abandon an edit if it is unable to fix all of the Linter errors that it sets out to fix, so each bot will have a ratio of less than 1:1 (pages fixed : pages examined). This BRFA's description says "if it is unable to fix everything, it will not edit the page". – Jonesey95 (talk) 21:10, 24 January 2023 (UTC)[reply]
- On a reread, I think me, you, and Sheep all used completely different ideas of what the ratio was supposed to be. Maybe not the greatest measurement, then. casualdejekyll 12:28, 25 January 2023 (UTC)[reply]
- My understanding from both BRFAs is that each bot will abandon an edit if it is unable to fix all of the Linter errors that it sets out to fix, so each bot will have a ratio of less than 1:1 (pages fixed : pages examined). This BRFA's description says "if it is unable to fix everything, it will not edit the page". – Jonesey95 (talk) 21:10, 24 January 2023 (UTC)[reply]
- There is also a relatively important difference between how the proposed LegoBot task and the proposed SheepLinterBot task handle the issue of "MalnadachBot makes way too many edits": LegoBot forces a "ratio" of 1:1 by only making edits to pages it can fix in one go, while SheepLinterBot is a finer-combed tool that reduces the number of edits by narrowing the task. There is room for both of them on-wiki, I think. casualdejekyll 20:05, 24 January 2023 (UTC)[reply]
- @Sheep8144402: hi!
- Yes, it should correctly mark up the inside of links with the correct color if necessary.
- My font code (and all the other tags) uses an HTML parser. In theory it should cover all possible invocations and use cases because it looks at the structure of the tag rather than how it is laid out in text. The use of HTML increases the confidence in the fixes, but that also means it will miss e.g. commented out wikitext that a regex-based bot would (of course, commented out tags don't trigger lint errors in the first place!).
- In the first 5,000 lint errors (not pages) I pulled, my bot prepared a little over 1,000 edits. That number will go down even further as we've changed the task description and guardrails to be more restrictive. Legoktm (talk) 02:21, 25 January 2023 (UTC)[reply]
- While MalnadachBot was busy with Task 13 in the last 6 months, I have substantially imporved it. It is back to fixing Lint errors and no longer has issues with font tags raised in the BOTN discussion. That said, I support this and Sheep's bot task since all 3 of our bots work in different ways and have their own roles in bring down Lint errors. I can run MalnadachBot on more complex patterns that are difficult to handle programmtically by the other 2 bots, in addition to overlapped scopes. I have fixed 11 million errors with MalnadachBot, putting a large dent in the backlog which was at 22 million when I started. Now the number is at 8.784 million, we can bring this down to less than a million by working together! ಮಲ್ನಾಡಾಚ್ ಕೊಂಕ್ಣೊ (talk) 13:06, 28 January 2023 (UTC)[reply]
- Thanks for doing this! I'd just like to note that according to MDN,
<strike>
may not be always equivalent to<s>
; deleted content should be represented with<del>
instead. As such, a wide-ranging replacement may not be appropriate: see also this previous task declined due to this issue. Best, EpicPupper (talk) 03:02, 4 February 2023 (UTC)[reply]- Ack, didn't know it had been previously declined. I'll strike (hah) it from the task description for now. (My rationale for including it in the first place was roughly the same as what I explained for tt.) Legoktm (talk) 02:21, 7 February 2023 (UTC)[reply]
Trial complete
[edit]Trial complete. Made 100 edits (takes a while to load). The console log is at /Trial log, showing that it did skip a bunch of pages it couldn't fix properly for whatever reason. After the first 3 edits I noticed it wasn't marking them as minor, which was fixed for the rest. It was editing at 6epm (10s between edits). Legoktm (talk) 07:44, 7 February 2023 (UTC)[reply]
- Initial note: I'm seeing log entries like "User talk:KGirlTrucker81/Signpost/Archive 3 still has some lint errors (missing-end-tag, stripped-tag), will be skipped". If the bot fixed every error that it came to fix, consider saving the edit. It won't be possible to fix all of the misnested tags, the stripped italics, and the missing end tags with a set of general regexes. That said, if you are concerned that the bot's edits may be introducing new Linter errors because of extremely broken syntax in the original text, skipping the save if there are any errors remaining is a safe move. It's up to you. I know that the other Lint-fixing bots do not hold themselves to this standard. – Jonesey95 (talk) 14:37, 7 February 2023 (UTC)[reply]
- I inspected all 100 edits and did not find any errors. I saw replacements of a few center and tt tags that worked properly, and many complex replacements of font tags that appeared to work perfectly. I recommend approval of this bot task. – Jonesey95 (talk) 15:41, 7 February 2023 (UTC)[reply]
- As far as I can see, this bot appears to be editing at a ratio of 1:2.91 or an edit % of 34.4%. Are you willing to make the code stronger so it can try to get all obsolete tag errors at once and therefore increase these statistics, or is that all you can do (the codes are at their strongest)? Sheep (talk • he/him) 17:11, 7 February 2023 (UTC)[reply]
- If the bot edits six times per minute, the ratio of skipped pages does not matter (I know that this runs somewhat contrary to what I said above). Once the bot has exhausted all possible edits, the remaining pages can be examined to see if the bot can and should apply fixes to those pages. – Jonesey95 (talk) 17:15, 7 February 2023 (UTC)[reply]
- @Sheep8144402: I don't think the concept of "stronger" really makes sense here. My philosophy here is to set restrictive guardrails on the bot so each edit it makes is worthwhile, aggressively skipping when that's not the case. The number or ratio of skips seems irrelevant.
- I'll add that the 6epm is the artificial rate limiter I imposed, it could easily edit much faster but I don't think that's a good idea. Legoktm (talk) 06:45, 8 February 2023 (UTC)[reply]
- As far as I can see, this bot appears to be editing at a ratio of 1:2.91 or an edit % of 34.4%. Are you willing to make the code stronger so it can try to get all obsolete tag errors at once and therefore increase these statistics, or is that all you can do (the codes are at their strongest)? Sheep (talk • he/him) 17:11, 7 February 2023 (UTC)[reply]
- @Jonesey95: The main reason it doesn't edit if there are remaining lint errors is that some times those issues cause bugs in the fixes (e.g. 1, 2 that you found earlier) so I'm not comfortable letting that run automated yet. In some cases the edit is totally fine because the misnested tag is after the obsolete tags or whatever (I looked at the proposed diff for the page you mentioned and it looks fine). I have some vague ideas for storing those edits for human review in a web dashboard or something, but that's for another BRFA. Legoktm (talk) 06:33, 8 February 2023 (UTC)[reply]
- I figured that is what you might be doing. Sounds safe. There will be plenty of pages edited in this first round. – Jonesey95 (talk) 06:41, 8 February 2023 (UTC)[reply]
- I inspected all 100 edits and did not find any errors. I saw replacements of a few center and tt tags that worked properly, and many complex replacements of font tags that appeared to work perfectly. I recommend approval of this bot task. – Jonesey95 (talk) 15:41, 7 February 2023 (UTC)[reply]
The operator should probably modify the bot's edit summary to link to this BRFA. – Jonesey95 (talk) 17:17, 7 February 2023 (UTC)[reply]
- Good idea, once approved, I'll create a wiki page that describes the function details since I assume the specifics will change over time and this page will be archived. Legoktm (talk) 07:28, 13 February 2023 (UTC)[reply]
The elephant in the room is that there is an open RfC regarding this task. I assume that's why we're still waiting a week after the trial was completed (not that I was going to start this task while the RfC was outstanding). If that RfC changes the community's desires around these edits, my plan is to adjust the logic to only make edits when a visual change will happen (currently tt and link inside font). Legoktm (talk) 03:53, 14 February 2023 (UTC)[reply]
- Approved. Thank you for your considerations towards that elephant, though in looking through the discussion I suspect that nothing will change so you can make whatever modifications you personally feel are necessary but as far as the requested task goes I see no reason not to approve. Primefac (talk) 10:40, 8 March 2023 (UTC)[reply]
- The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at Wikipedia:Bots/Noticeboard.