Jump to content

User:GreenC/testcases/bigtenorg

From Wikipedia, the free encyclopedia
Steps to process bigten.org

This is a real request that was recently made. The steps below are exactly what I did to process. This is if everything goes smoothly and there are no problems that require changes to the core code of the bot - which can happen frequently. This request had no URL transformations (step 6) which would normally require some additional code.

1. Request was created by a user:

	https://en.wikipedia.org/wiki/Wikipedia:Link_rot/URL_change_requests#bigten.org

2. Create a list of articles containing the domain, at the same time coin a new project name ('bigtenorg')

	wikiget -a "insource:bigten insource:/bigten[.]org/" | shuf > bigtenorg.auth

3. Create a skeleton source file that contains domain-specific changes:

	cp urlchanger_SKELETON_HARD.nim urlchanger_bigtenorg.nim

4. Edit urlchanger_bigtenorg.nim and modify basic domain information:

# --------- CONFIG START

Runme.urlchangerSum        = "[[WP:URLREQ#bigten.org]]"  # Edit summary

Runme.urlchangerDRe        = "bigten[.]org"            # Old name: hostname/domain/path - regex
                                                                   # Used to parse URLs from wikitext
Runme.urlchangerDDRe       = "bigten[.]org"            # Same as ^ - hostname/domain only - no path
Runme.urlchangerDPRe       = "bigten.org"                # Same as ^ - no regex and no path
Runme.urlchangerNRe        = "bigten[.]org"   # New name - hostname/domain - regex
                                                                   # Used to identify when it's been switched to new URL
                                                                   # If DRe and NRe have the same values use the same entry for eachRunme.urlchangerNPRe       = "bigten.org"       # Same as ^ - no regex
Runme.urlchangerNPPRe      = "[[Big Ten Conference]]"            # Wikitext to replace with when it finds NRe in metadata fields - [[]] OK
Runme.urlchangerNRPRe      = "Big Ten Conference"                # Plain text string to replace named refs - [[]] NOT OK
Runme.urlchangerTCRe       = &"(?i){mypipe}[^$]*[^$]*"
Runme.skipapicheckexception = "bigten[.]org"

# --------- CONFIG END

5. Add code to do URL transformations.

        None required for this project.

6. Compile medic binary

	lx -n bigtenorg 

7. Create project directories and files. Project name (-p) is the number of articles to process ie. run the bot on articles 1 to 1326 as listed in bigtenorg.auth created in step #2

        wc bigtenorg.auth 
          1326
	projectm -c -p bigtenorg.0001-1326

8. Run the bot on 1,326 articles:

	runbot -n bigtenorg.0001-1326 -v medic-bigtenorg -r 8 -f auth

9. As it is running, check the logs for known trouble areas, such as soft-404s, that the bot will discover as it is running.

10. Cancel the bot and add code to handle discovered soft-404s ie. edit urlchanger_bigtenorg.nim and add the following code:

            # Soft-404 traps here:
             if newloc ~ ("^https?://" & GX.hostname & Runme.urlchangerDRe & "/?$") and newurl !~ ("^https?://" & GX.hostname & Runme.urlchangerDRe & "/?$"):
               sendlog(Project.syslog, CL.name, url & " ---- " & newloc & " ---- Redirect to home found ---- urlchanger7.1.3") 
               return "DEADLINK"
             if newloc ~ ("^https?://" & GX.hostname & Runme.urlchangerDRe & "/mbb/?$") and newurl !~ ("^https?://" & GX.hostname & Runme.urlchangerDRe & "/mbb/?$"):
               sendlog(Project.syslog, CL.name, url & " ---- " & newloc & " ---- Redirect to mbb ---- urlchanger7.1.4")
               return "DEADLINK"

   The above code is saying if a redirected URL ends in "/mbb" this indicates a soft-404 and treat it as a dead link.

11. Kill the original project and recreate it and re-run the bot:

	projectm -x -p bigtenorg.0001-1326
	projectm -c -p bigtenorg.0001-1326
	runbot -n bigtenorg.0001-1326 -v medic-bigtenorg -r 8 -f auth

12. Repeat steps #8-10 until it is running clear, then run to completetion.

13. After completion, follow a lengthy manual process of checking for known problems that show up in the logs. Sample steps:

     (meta) if(-e logembway) cat logembway            
            # Check these - something went wrong
     (meta) grep fixcommentarchive syslog                
            # look at diffs for problems / see also the first "error" step why those didn't get fixed
     (meta) if(-e logradicalurl) cat logradicalurl | awk -F"----" '{print $3}'
            # check for legit archive URLs and add to logradicalurl() in medic.nim
     (meta) grep removearchive2 cbignore
            # check for embedded templates that should be added to encodeWiki()
     etc..

     Modify the bot code as needed and rerun any articles as needed. To re-run a single article:

	bugm -n "Feudalism" -r 

14. For new archive.today links, need to manually verify each one is working, per a process outlined in the docs.

15. Push 5 diffs up to Wikipedia

	push2wiki -s5

16. Manually verify the diffs on Wikipedia look good and there are no problems

17. Push the remaining diffs

	push2wiki -s0

18. Any articles with intervening edits by other users (edit conflicts), reprocess them and upload

	push

19. Generate statistics and copy-paste into the request from step #1. Add a {{done}} flag to the page.

	stats bigten.org