Jump to content

Wikipedia:Reference desk/Archives/Computing/2023 June 20

From Wikipedia, the free encyclopedia
Computing desk
< June 19 << May | June | Jul >> Current desk >
Welcome to the Wikipedia Computing Reference Desk Archives
The page you are currently viewing is a transcluded archive page. While you can leave answers for any questions shown below, please ask new questions on one of the current reference desk pages.


June 20

[edit]

Locating a string from between identifiable characters

[edit]

I've been using regular expression to match certain strings so long as they rest between two identifiable parts of a text. For example, the matched string in {{WPDENMARK|class=}} would be WPDENMARK, and it would be replaced with {{WikiProject Denmark|class=}} if and only if there are two curly brackets to the left of WPDENMARK and two curly brackets or a pipe to the right. I've been trying to write a regex that does this regardless of whether the curly brackets or the pipe are on completely different lines, or if there are characters between them (or even whitespaces), and this has been a problem. Is there a commonly used configuration for accomplishing this? Nythar (💬-🍀) 11:42, 20 June 2023 (UTC)[reply]

Being on separate lines is a limitation of the implementation of the regex that you are using. Either it handles newlines as a stop or it goes to the end of the text. I will ignore that as it is not pertinent and assume that the regex imlpementation you are using easily handles newline and return characters. So, you want {{. That is easy. Now, it might have whitespace. That is {{\s*. Depending on your implementation, it might be \s or \\s or [:SPACE:] or [[:SPACE:]] etc... Now, you want text. I am going to assume this text cannot contain a | or }. I would use {{\s*[^|}]+. But, it looks like you are really looking for WP at the beginning to replace with WikiProject. So, I would use {{\s*WP[^|}]+. Now, it can end with | or }}. So, you give it the two options: {{\s*WP[^|}]+(\||}}). You can see that I had to escape the | when it isn't inside [ ]. But, what if there is a space there... {{\s*WP[^|}]+\s*(\||}}). Now, it has been matched, but you need parenthesis to get the stuff that follows WP for your replacement. {{\s*WP([^|}]+)\s*(\||}}). The first match is what you want to keep. The second match is the ending, either | or }}. For my implementation, I use \1 for the first match and \2 for the second, so I replace it with {{WikiProject \1\2. Hopefully the editor doesn't really mess up the characters. I tried to place everything in nowiki tags.12.116.29.106 (talk) 13:01, 20 June 2023 (UTC)[reply]
Your suggestion is working well, but it matches the pipe and curly brackets. I was thinking of lookbehinds when I said I was writing "regex that does this regardless of whether the curly brackets or the pipe are on completely different lines." My idea would be a match of only WPDENMARK and nothing else using positive lookbehinds and lookaheads. I assumed this was possible. Do you happen to know of a way? Nythar (💬-🍀) 15:58, 20 June 2023 (UTC)[reply]
If you only want one or the other, you can use [|}] to match either a pipe or the first } of the pair of }s. that matches one and only one character. Place it in parenthesis if you want to know what it was for later use. If you give an example of input that doesn't work and what you expected, it is easier to identify exactly what you want. 12.116.29.106 (talk) 16:38, 20 June 2023 (UTC)[reply]
Talk:Marija Karan is an example. WikiProject Biography is on one line, while the other parameters are all on separate lines (so neither | or }} appear at the first line). In this case, I think a lookahead or lookbehind is needed, but for some reason it's quite difficult to write such a regex. I've been trying to take characters from both sides into account (using \n), but they're failing certain tests. Nythar (💬-🍀) 17:07, 20 June 2023 (UTC)[reply]
It appears that the issue is that regex is not parsing it as one text, but line by line. If it parsed it as one text, \s would capture the newline character as whitespace and then | would follow. That is an issue with the parser, not the regex expression. Is there an option in your parser to do two things: First, treat the text as one string, not an array of lines of strings. Second, match all occurences because there are two on that page. 12.116.29.106 (talk) 17:42, 20 June 2023 (UTC)[reply]
Dotall seems to be what I was missing. Regex is matching successfully now; hopefully it'll remain that way. Thank you. Nythar (💬-🍀) 11:36, 21 June 2023 (UTC)[reply]