Wikipedia:Scripts/mwlink
This Ruby program has two modes. It can run as a daemon or text processor (daemon mode is preferred, since it's more efficient).
In text-scanning mode, it interprets its command line (or stdin if no command line given) as text possibly containing [[wikilinks]]. It preserves the original text and adds a text hyperlink (the http:
address contained in <> braces).
In daemon mode, it receives HTTP requests like http://localhost:4242/mwlink?page=
wiki-page-name and redirects to the appropriate Wikimedia page. It's convenient for scripts to just use that URL rather than constructing one themselves--all they have to do is URL-escape the text between [[ and ]].
#!/usr/bin/ruby
# This script is dual-licensed under the GPL version 2 or any later
# version, at your option. See http://www.gnu.org/licenses/gpl.txt for more
# details.
=begin
= NAME
mwlink - Linkify mediawiki-style wikilinks in plain text
= SYNOPSIS
mwlink [options] [text-to-wikilink]
--daemon[=port] Run as HTTP daemon
--encoding Default character set encoding (utf-8)
--default-wiki Default wiki (wikipedia)
--default-language Default language (en)
= DESCRIPTION
In text-scanning mode (without the --daemon argument) The mwlink program scans
its arguments (or its standard input, in the event of no arguments) for
wikilinks of the form [[link]]. It expands such links into URLs and inserts
them into the original text after the [[link]] in sharp braces ((({<})) and
(({>}))). Options are provided for specifying a default wiki (the wiki to link
to if no qualifier is given in the link) and a default language (the language
to assume if no qualifier is given) as well as the character set encoding in
use. The built-in defaults are ((*wikipedia*)), ((*en*)) and ((*utf-8*)),
respectively.
In daemon mode (now preferred), It receives HTTP requests of the form
"http://.../page=((*wikipedia page*))" (the ((*wikipedia page*)) name is what
would appear within a [[wikilink]]. URL-escaping is required but no other
processing, making it convenient to use from scripts.
== Initialization File
The names of namespaces vary in different languages (especially due to
language. For example, "User:" in English is "Benutzer:" in German. You can
specify lists of namespaces to use for particular languages in an
initialization file (({~/.mwlinkrc})). This is simply a line with the
language, a colon, and a space-separated list of namespaces in that
language. When interpreting links for that language (either because
((*--default-language*)) was specified or there is a language qualifier in
the link, mwlink will recognize it as a namespace appropriately. All the
namespaces must appear on one line--line continuation is not supported.
Comments (lines introduced with (({#}})) (pound sign)) are comments, and
are ignored, along with blank lines.
Here is an example configuration containing (only) some namespaces from the
German Wikipedia. ((*Note*)): To be kind to the wiki when this script is
uploaded, I have broken the line, but it ((*may not be broken*)) in order
to work with mwlink.
de: Spezial Spezial_diskussion Diskussion Benutzer Benutzer_diskussion
Bild Bild_diskussion Einordnung Einordnung_diskussion Wikipedia
Wikipedia_talk WP Hilf Hilf_diskussion
= WARNINGS
* The program (like mediawiki) assumes links are not broken across line
boundaries.
* The mechanism for providing an alternate list of namespaces only works
per-language; other wikis could have different namespaces, too.
* The list of wikis and their abbreviations is doubtlessly incomplete.
* The initialization file mechanism is not that useful for a shared daemon.
* In command-line mode, it's very difficult to process ASCII em-dashes (--)
correctly and still honor command-line options. mwlink gets it wrong, and
that's one reason daemon mode is preferred.
= AUTHOR
Demi @ Wikipedia - http://en.wikipedia.org/wiki/User:Demi
=end
require 'cgi'
require 'iconv'
require 'getoptlong'
require 'webrick'
include WEBrick
$opt = {
'default-wiki' => 'wikipedia',
'default-language' => 'en',
'encoding' => 'utf-8'
}
class String
def initcap()
new = self.dup
# Okay, I consider it dumb that a string subscripted produces an
# integer --Demi
new[0] = new[0].chr.upcase
return new
end
def initcap!()
self[0] = self[0].chr.upcase
return self
end
end
class Canon
def initialize()
@ns = { }
@ns_array = %w(Media Special Talk User User_talk Project Project_talk
Image Image_talk MediaWiki MediaWiki_talk Template Template_talk Help
Help_talk Category Category_talk Wikipedia Wikipedia_talk WP)
@ns['default'] = { }
@ns_array.each { |nspc| @ns['default'][nspc] = nspc }
if File::readable?(ENV['HOME'] + '/.mwlinkrc')
IO::foreach(ENV['HOME'] + '/.mwlinkrc') { |line|
next if line =~ /^\s*\#/
next if line =~ /^\s*$/
line.chomp!
if m = line.match(/^(\w+)\:(.*)$/)
lang = m[1]
nslist = m[2].split
@ns[lang] = { }
nslist.each { |nspc| @ns[lang][nspc] = nspc }
end
}
end
@wiki = {
'Wiktionary' => 'wiktionary',
'Wikt' => 'wiktionary',
'W' => 'wikipedia',
'M' => 'meta',
'N' => 'news',
'Q' => 'quote',
'B' => 'books',
'Meta' => 'meta',
'Wikibooks' => 'books',
'Commons' => 'commmons',
'Wikisource' => 'source'
}
@wikispec = {
'wikipedia' => { 'domain' => 'wikipedia.org', 'lang' => 1 },
'wiktionary' => { 'domain' => 'wiktionary.org', 'lang' => 1 },
'meta' => { 'domain' => 'meta.wikimedia.org', 'lang' => 0 },
'books' => { 'domain' => 'wikibooks.org', 'lang' => 1 },
'commons' => { 'domain' => 'commmons.wikimedia.org', 'lang' => 0 },
'source' => { 'domain' => 'sources.wikimedia.org', 'lang' => 0 },
'news' => { 'domain' => 'wikinews.org', 'lang' => 1 },
}
@cs = Iconv.new("iso-8859-1", $opt['encoding'])
end
#TODO The % part of the # section of the URL should become a dot.
def urlencode(s)
CGI::escape(s).gsub(/%3[Aa]/, ':').gsub(/%2[Ff]/, '/').gsub(/%23/, '#')
end
def canonword(word)
s = word.strip.squeeze(' ').tr(' ', '_').initcap
begin
@cs.iconv(s)
rescue Iconv::IllegalSequence
s
end
end
def parselink(link)
l = {
'namespace' => '',
'language' => $opt['default-language'],
'wiki' => $opt['default-wiki'],
'title' => ''
}
terms = link.split(':')
l['title'] = canonword(terms.pop)
terms.each { |term|
next if term.nil? or term.empty?
t = canonword(term)
if @ns[l['language']]
then
ns = @ns[l['language']]
else
ns = @ns['default']
end
if ns.key?(t)
l['namespace'] = ns[t]
elsif @wiki.key?(t)
l['wiki'] = @wiki[t]
else
l['language'] = t.downcase
end
}
l
end
def canonicalize(link)
linkdesc = parselink(link.sub(/\|.*$/, ''))
if @wikispec.key?(linkdesc['wiki'])
ws = @wikispec[linkdesc['wiki']]
host = ws['domain']
if ws['lang'] != 0
host = linkdesc['language'] + '.' + host
end
else
host = linkdesc['wiki'] + '.' + 'wikimedia.org'
end
uri =
if linkdesc['namespace'].length > 0
linkdesc['namespace'] + ':' + linkdesc['title']
else
linkdesc['title']
end
r = urlencode('http://' + host + '/wiki/' + uri)
r
end
def to_s()
"Namespace sets: " + @ns.keys.join(', ') +
"; Wikis: " + @wiki.to_a.join(', ')
end
end
def linkexpand(c, bracketlink)
linktext =
if m = /\[\[([^\]]+)\]\]/.match(bracketlink)
m[1]
else
bracketlink
end
bracketlink +
" <" + c.canonicalize(linktext) + ">"
end
c = Canon.new()
re = /\[\[\s*[^\s\\][^\]]+\]\]/
class MwlinkServlet < HTTPServlet::AbstractServlet
def initialize(server, canonicalizer)
super(server)
@c = canonicalizer
end
def do_GET(rq, rs)
p = CGI.parse(rq.query_string)
# Just for testing
l = @c.canonicalize(p['page'][0])
rs.status = 302
rs['Location'] = l
rs.body = "<html><body>\n" +
"<a href=\"#{l}\">#{p['page'][0]}</a>\n" +
"</body></html>\n"
end
end
begin
GetoptLong::new(
['--default-wiki', GetoptLong::REQUIRED_ARGUMENT],
['--default-language', GetoptLong::REQUIRED_ARGUMENT],
['--encoding', GetoptLong::REQUIRED_ARGUMENT],
['--daemon', GetoptLong::OPTIONAL_ARGUMENT]
).each do |k, v|
k = k.sub(/^--/,'')
case k
when 'default-wiki', 'default-language', 'encoding'
$opt[k] = v
when 'daemon'
$opt['daemon'] = true
if v.empty?
$opt['port'] = 4242
else
$opt['port'] = v
end
end
end
rescue GetoptLong::InvalidOption
true
end
if $opt['daemon']
port = $opt['port'].to_i
puts "Starting daemon on port #{port}"
s = HTTPServer.new(:Port => port)
s.mount("/mwlink", MwlinkServlet, c)
trap('INT') { s.shutdown }
s.start
else
# Note, there are various combinations of -- appearing in normal text that
# will break this. --daemon is the recommended method.
if ARGV.empty?
STDIN.each_line { |line|
puts line.chomp.gsub(re) { |expr| linkexpand(c, expr) }
}
else
puts ARGV.join(' ').gsub(re) { |expr| linkexpand(c, expr) }
end
end
Example output:
[[Ashland (disambiguation)]] is an example of a [[Wikipedia:Disambiguation]] page.
[[Ashland (disambiguation)]] <http://en.wikipedia.org/wiki/Ashland_%28disambiguation%29> is an example of a [[Wikipedia:Disambiguation]] <http://en.wikipedia.org/wiki/Wikipedia:Disambiguation> page.
GET http://localhost:4242/mwlink?page=Ashland+%28disambiguation%29
GET http://localhost:4242/mwlink?page=Ashland+%28disambiguation%29 --> 302 Found GET http://en.wikipedia.org/wiki/Ashland_%28disambiguation%29 --> ...(page content)
The GET program is a utility distributed with Perl's libwww. Also, note that wikimedia servers forbid scripts based on the LWP Perl module.