Jump to content

User:AnomieBOT/source/d/Nowiki.pm

From Wikipedia, the free encyclopedia
package d::Nowiki;

use utf8;
use strict;
use Carp;
use Digest::SHA qw/sha256_base64/;
use AnomieBOT::API;

=pod

=head1 NAME

d::Nowiki - AnomieBOT decorator for nowiki (and other content) stripping

=head1 SYNOPSIS

 use AnomieBOT::API;

 $api = new AnomieBOT::API('conf.ini', 1);
 $api->decorators(qw/d::Nowiki/);

=head1 DESCRIPTION

C<d::Nowiki> contains utility functions for manipulating nowiki tags in the
page, for use by an AnomieBOT task. When "d::Nowiki" is used as a decorator on
the API object, the following methods are available.

=head1 METHODS PROVIDED

=over

=item $api->get_token_for( $content )

Returns the opaque token to be used for the specified content. Modulo hash
collisions, the same token will not be returned for different contents; the
same token will be returned for identical values of content.

=cut

sub get_token_for {
    my $api=shift;
    my $content=shift;
    utf8::encode( $content ) if utf8::is_utf8( $content );
    my $tag="\x02".sha256_base64($content)."\x03";
    $tag=~tr!+/=!-_!d;
    return $tag;
}

=pod

=item $api->get_token_regex()

Returns a regular expression that will match tokens.

=cut

sub get_token_regex {
    return qr/\x02[a-zA-Z0-9_-]+\x03/;
}

=pod

=item $api->strip_regex( $re, $wikitext )

=item $api->strip_regex( $re, $wikitext, \%mapping )

Given some wikitext and a regular expression, replaces all text matching the
regex with an opaque token. If the optional C<\%mapping> hashref is provided,
the mappings of the tokens back to the removed text will be added to that
hashref (and that hashref will be returned as the second value).

Returns a list of two values: The wikitext with matches replaced and a hashref 
mapping tokens back to the removed text. In a scalar context, returns just the
text.

Notes:

=over

=item *

The returned text may contain unusual characters such as C<\x02> and C<\x03>.
MediaWiki doesn't like to save these, so this should help prevent your saving a
broken page.

=item *

If you apply this function multiple times, it would be best to apply 
C<< $api->replace_stripped >> in the reverse order.

=back

=cut

sub strip_regex {
    my $api=shift;
    my $re=shift;
    my $wikitext=shift;
    my $mapping=shift // {};

    if(defined($wikitext)){
        my $func=sub {
            my $x=shift;
            $x=$api->replace_stripped($x,$mapping);
            my $tag=$api->get_token_for($x);
            $mapping->{$tag}=$x;
            return $tag;
        };
        $wikitext=~s{$re}{ &$func(substr($wikitext,$-[0],$+[0]-$-[0])) }ge;
    } else {
        carp "\$wikitext is undefined";
    }
    return wantarray ? ($wikitext, $mapping) : $wikitext;
}

=pod

=item $api->replace_stripped( $wikitext, \%mapping )

=item $api->replace_stripped( $wikitext, \%mapping, ... )

Searches C<$wikitext> for tokens defined in the mapping hashrefs, and replaces
them with the original text. If multiple mapping hashrefs are given, they are
processed in the order given. In other words, this is correct:

 ($text,$mapping1) = $api->strip_regex($re1, $text);
 ($text,$mapping2) = $api->strip_regex($re2, $text);
 $text = $api->replace_stripped($text, $mapping2, $mapping1);

Returns the corrected wikitext.

=cut

sub replace_stripped {
    my $api=shift;
    my $wikitext=shift;

    if(defined($wikitext)){
        my $re=$api->get_token_regex();
        while(@_){
            my $mapping=shift;
            $wikitext=~s!($re)! $mapping->{$1} // $1 !ge;
        }
    } else {
        carp "\$wikitext is undefined";
    }
    return $wikitext;
}

=pod

=item $api->strip_tags( \@tags, $wikitext )

=item $api->strip_tags( \@tags, $wikitext, \%mapping )

Given some wikitext, replaces the specified XML-like tags with an opaque
token. To also strip comments, include the special tag "!--". You can then
process the wikitext without worrying about affecting things you shouldn't and
call C<< $api->replace_stripped >> when done.

Returns the same as C<strip_regex>, which in fact it uses internally.

Note that this doesn't handle L<#tag|mw:Help:Magic_words#Miscellaneous>s. Use
C<strip_templates> from A<d::Templates> for that.

=cut

sub strip_tags {
    my $api=shift;
    my $tags=shift;

    my $comment=grep $_ eq '!--', @$tags;
    my @tags=grep $_ ne '!--', @$tags;
    my @re=();
    push @re, '<('.join("|", @tags).')(?:\s[^>]*)?(?:/>|(?<!/)>.*?(?:</\g{-1}\s*>))' if @tags;
    push @re, '<!--.*?(?:-->|$)' if $comment;
    push @re, '(?!)' unless @re;
    my $re=join('|',@re);

    return $api->strip_regex(qr($re)si, @_);
}

=pod

=item $api->extension_tags( )

=item $api->extension_tags( $reload )

Return the current list of XML-like extension tags, plus "!--".

=cut

sub extension_tags {
    my $api=shift;
    my $reload=shift//0;
    my $memc = $api->cache;

    $memc->delete('$d::Nowiki::extension_tags') if $reload;
    my $ret = $memc->get('$d::Nowiki::extension_tags');
    return @$ret if $ret;

    my $res=$api->query(meta=>'siteinfo',siprop=>'extensiontags');
    if($res->{'code'} ne 'success'){
        # Crap. Guess.
        return qw/!-- pre nowiki gallery indicator langconvert graph timeline heiro source syntaxhighlight score templatestyles templatedata charinsert ref references inputbox imagemap poem categorytree section math ce chem maplink mapframe/;
    }

    my @ret=('!--');
    push @ret, map { s/[<>]//g; $_; } @{$res->{'query'}{'extensiontags'}};
    $memc->set('$d::Nowiki::extension_tags', [@ret], 86400);
    return @ret;
}

=pod

=item $api->strip_nowiki( $wikitext )

=item $api->strip_nowiki( $wikitext, \%mapping )

Given some wikitext, replaces nowikis, pres, and comments with an opaque
token. You can then process the wikitext without worrying about affecting
things you shouldn't and call C<< $api->replace_nowiki >> when done.

Returns the same as C<strip_tags>, which in fact it uses internally.

=item $api->replace_nowiki( $wikitext, $nowiki_hashref )

Exists as an alias for C<replace_stripped> for backwards compatibility.

=cut

sub strip_nowiki {
    my $api=shift;

    return $api->strip_tags([qw/!-- nowiki pre source syntaxhighlight/], @_);
}

sub replace_nowiki {
    return replace_stripped(@_);
}

1;

=pod

=back

=head1 COPYRIGHT

Copyright 20082013 Anomie

This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.