User:GreenC/testcases/autourl
Document describing auto-generated URLs, problems they cause for bots and suggested solutions.
For example given this template source:
{{cite journal |title=The Discodermia calyx Toxin Calyculin A |last1=Edelson |first1=Jessica R. |last2=Brautigan |first2=David L. |date=24 January 2011 |journal=Toxins |volume=3 |issue=1 |pages=105–119 |doi=10.3390/toxins3010105 |doi-access=free |pmid=22069692 |pmc=3210456}}
Renders as:
- Edelson, Jessica R.; Brautigan, David L. (24 January 2011). "The Discodermia calyx Toxin Calyculin A". Toxins. 3 (1): 105–119. doi:10.3390/toxins3010105. PMC 3210456. PMID 22069692.
Note the title is linked to https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3210456 even though the URL is not in the template source ie. it is an auto-generated URL. Auto-generated URLs are generated by certain templates (such as {{cite journal}}
) when certain conditions are met. For example in this case when the |url=
field is otherwise empty, and it has a |doi=
specified along with a |doi-access=free
. There are other conditions, and the conditions may be in flux as community consensus changes. Additionally, the conditions may be different on each wiki language site, as local communities have control over how templates work in their wiki.
Generally, auto-generated URLs should be respected by bots and not overwritten by the inclusion of a |url=
, since a bot can not know which URL is better. Thus bots should detect the existence of an auto-generated URL before adding a hard-coded URL.
There are three possible ways to detect:
- 1. Program the bot to match the template behavior (conditions) like described above eg. if there is no
|url=
and it has a|doi=
and|doi-free=yes
. - 2. Web-scrape the HTML of the Wikipedia page where the template is located and look at the HTML to see if an auto-generated URL was rendered.
- 3. Use the MediaWiki API "parse" endpoint, convert the template into HTML and see if the title field has a URL attached.
The first is difficult and error prone as conditions may change at any time without documentation, and each 300+ language site may have different conditions. The second slow is slow to load and hard to parse. Third is most universal and stable, though a little messy parsing.
The MediaWiki API "parse" command for this template is: https://en.wikipedia.org/w/api.php?action=parse&text=%7B%7Bcite%20journal%20%7Ctitle%3DThe%20Discodermia%20calyx%20Toxin%20Calyculin%20A%20%7Clast1%3DEdelson%20%7Cfirst1%3DJessica%20R.%20%7Clast2%3DBrautigan%20%7Cfirst2%3DDavid%20L.%20%7Cdate%3D24%20January%202011%20%7Cjournal%3DToxins%20%7Cvolume%3D3%20%7Cissue%3D1%20%7Cpages%3D105%E2%80%93119%20%7Cdoi%3D10.3390%2Ftoxins3010105%20%7Cdoi-access%3Dfree%20%7Cpmid%3D22069692%20%7Cpmc%3D3210456%7D%7D&contentmodel=wikitext
The &text=
is a urlencoded copy of the template with any newlines removed. In practice would also add &format=json
to get a pure JSON result (the above is HTML rendering of JSON for visual debugging purposes). The API can be run on other wiki sites by changing the domain for example tr.wikipedia.org for Turkish.
The JSON can be parsed to see if it contains an auto-generated URL.
Parsing is a two-step process.
- Extract the portion
rft.atitle=The+Discodermia+calyx+Toxin+Calyculin+A
and convert to "The Discodermia calyx Toxin Calyculin A" ie. the title text. This string can be identified beginning withrft.atitle=
and ending at&
. Remove the leading portionrft.atitle=
and trailing&
. It is urlencoded so next urldecode it. Finally HTML encode any "&" to& ;
. Do not HTML encode the whole string just that character, though there may be others that need it (todo). - Search if there is an
<a href=""></a>
associated with the title string for example<a rel=\"nofollow\" class=\"external text\" href=\"//www.ncbi.nlm.nih.gov/pmc/articles/PMC3210456\">\"The Discodermia calyx Toxin Calyculin A\"</a>
. If so, it is known the|title=
is already linked by an auto-generated URL.
Because API requests can slow a bot, it can be done late in the process after other checks have passed.
This method should work universally, in any language or wiki site, and remain stable regardless of changes to conditions that auto-generate a URL.