注册
登录

您现在的位置是:首页 > 学无止境

How to avoid Unicode pitfalls

木木彡82 2014-02-18 10:31:56 599人围观
转载自:http://showmetheco.de/articles/2010/10/how-to-avoid-unicode-pitfalls-in-mojolicious.html

Unicode is hard.Unicode in Perl is even harder,becaus...

转载自:http://showmetheco.de/articles/2010/10/how-to-avoid-unicode-pitfalls-in-mojolicious.html

Unicode is hard.Unicode in Perl is even harder,because sometimes Perl is just too smart.While Mojolicious is a web framework,no wonder why it should support Unicode really well.But even if Mojolicious tries hard to make things easy for a developer,one must really understand what's going on behind the scene.

There is a really good documentation on using Unicode in Perl.Raise your hand if you haven't read perluniintro,perlunicode,perlunifaq or perlunitut.I knew that.

A byte and a character is not the same thing.Sometimes they have the same length (all ASCII characters),but most of the time (other Unicode characters) their length is different.For example,the dollar character $ is 1 byte long,but euro is 3.In Perl we want to work with characters.

For decoding bytes to characters and encoding characters to bytes we use Encode module.I kept always forgetting what is encoding and what is decoding.But if you think for a minute and imagine this evil world that lives in bytes,and you want to take something from it and bring it to Perl,you have to decode it,so you can understand it.And otherwise when you have to send something you have to encode your stuff.

There is also utf8 pragma.Everybody keeps thinking that if you use it,you don't have any problems,that Perl will do everything for you.But the problem is that utf8 pragma must be used ONLY when you have unicode characters in your module or perl file.It doesn't take care of file handles or databases.

Back to the evil world.It lives in bytes.When using Mojolicious client's request comes in bytes and Mojolicious answers in bytes too.To make things easy for a developer everything inside is automatically decoded into Perl characters.This way you have parameters,captures and stash values all in Perl characters.You can run regexes on them,sort them,count their length etc.

When a developer bypasses the Mojolicious mechanism of decoding,like opens a file or opens a database connection,he has to be sure to decode octets to Perl characters (i.e.,using ':utf8' for file handles,Mojo::ByteStream for strings,or database modules internal methods).

Below are some examples that show what is right and what is wrong to do in Mojolicious.

EXAMPLES

Mojo::JSON

decode accepts a sequence of octets (not characters!),calculates a correct encoding (UTF-8,UTF-16 etc) and returns a Perl arrayref or hashref with correctly decoded values.

use utf8;

# Wrong
$json->decode('{"foo":"ü"}');

# Right
$json->decode(b('{"foo":"ü"}')->encode('UTF-8'));
encode accepts a Perl structure with correctly decoded values and returns octets (not bytes) in UTF-8.
use utf8;

# Wrong
$json->encode(foo => 'ü') . 'ü';

# Right
b($json->encode(foo => 'ü'))->decode('UTF-8') . 'ü';

Mojo::DOM

parse accepts Perl characters OR octets, detecting and decoding them automatically.

use utf8;

# Right
$dom->parse(b('ü')->encode('UTF-8'));

# Right too
$dom->parse('ü');
And returns Perl characters and thus must be encoded before printing it to the console for example.
# Wrong
print $dom->at('a')->text;

# Right
print b($dom->at('a')->text)->encode('UTF-8');

ojo

When crawling a web site, don't forget to encode the result before printing it to console. Mojo::ByteStream's say method not only adds a newline character, but also automatically encodes the data.

# Wrong
perl -Mojo -e 'print g("mojolicio.us")->dom->at("title")->text'

# Right
perl -Mojo -e 'b(g("mojolicio.us")->dom->at("title")->text)->say'

SEE ALSO

http://www.slideshare.net/Penfold/perl-and-unicode


文章评论

  • 登录后评论

点击排行