为什么现代 Perl 默认避免使用 UTF-8？

I wonder why most modern solutions built using Perl don't enable UTF-8 by default.

I understand there are many legacy problems for core Perl scripts, where it may break things. But, from my point of view, in the 21^st century, big new projects (or projects with a big perspective) should make their software UTF-8 proof from scratch. Still I don't see it happening. For example, Moose enables strict and warnings, but not Unicode. Modern::Perl reduces boilerplate too, but no UTF-8 handling.

Why? Are there some reasons to avoid UTF-8 in modern Perl projects in the year 2011?

Commenting @tchrist got too long, so I'm adding it here.

It seems that I did not make myself clear. Let me try to add some things.

tchrist and I see situation pretty similarly, but our conclusions are completely in opposite ends. I agree, the situation with Unicode is complicated, but this is why we (Perl users and coders) need some layer (or pragma) which makes UTF-8 handling as easy as it must be nowadays.

tchrist pointed to many aspects to cover, I will read and think about them for days or even weeks. Still, this is not my point. tchrist tries to prove that there is not one single way "to enable UTF-8". I have not so much knowledge to argue with that. So, I stick to live examples.

I played around with Rakudo and UTF-8 was just there as I needed. I didn't have any problems, it just worked. Maybe there are some limitation somewhere deeper, but at start, all I tested worked as I expected.

Shouldn't that be a goal in modern Perl 5 too? I stress it more: I'm not suggesting UTF-8 as the default character set for core Perl, I suggest the possibility to trigger it with a snap for those who develop new projects.

Another example, but with a more negative tone. Frameworks should make development easier. Some years ago, I tried web frameworks, but just threw them away because "enabling UTF-8" was so obscure. I did not find how and where to hook Unicode support. It was so time-consuming that I found it easier to go the old way. Now I saw here there was a bounty to deal with the same problem with Mason 2: How to make Mason2 UTF-8 clean?. So, it is pretty new framework, but using it with UTF-8 needs deep knowledge of its internals. It is like a big red sign: STOP, don't use me!

I really like Perl. But dealing with Unicode is painful. I still find myself running against walls. Some way tchrist is right and answers my questions: new projects don't attract UTF-8 because it is too complicated in Perl 5.

转载于:https://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default

There's a truly horrifying amount of ancient code out there in the wild, much of it in the form of common CPAN modules. I've found I have to be fairly careful enabling Unicode if I use external modules that might be affected by it, and am still trying to identify and fix some Unicode-related failures in several Perl scripts I use regularly (in particular, iTiVo fails badly on anything that's not 7-bit ASCII due to transcoding issues).

I think you misunderstand Unicode and its relationship to Perl. No matter which way you store data, Unicode, ISO-8859-1, or many other things, your program has to know how to interpret the bytes it gets as input (decoding) and how to represent the information it wants to output (encoding). Get that interpretation wrong and you garble the data. There isn't some magic default setup inside your program that's going to tell the stuff outside your program how to act.

You think it's hard, most likely, because you are used to everything being ASCII. Everything you should have been thinking about was simply ignored by the programming language and all of the things it had to interact with. If everything used nothing but UTF-8 and you had no choice, then UTF-8 would be just as easy. But not everything does use UTF-8. For instance, you don't want your input handle to think that it's getting UTF-8 octets unless it actually is, and you don't want your output handles to be UTF-8 if the thing reading from them can handle UTF-8. Perl has no way to know those things. That's why you are the programmer.

I don't think Unicode in Perl 5 is too complicated. I think it's scary and people avoid it. There's a difference. To that end, I've put Unicode in Learning Perl, 6th Edition, and there's a lot of Unicode stuff in Effective Perl Programming. You have to spend the time to learn and understand Unicode and how it works. You're not going to be able to use it effectively otherwise.

We're all in agreement that it is a difficult problem for many reasons, but that's precisely the reason to try to make it easier on everybody.

There is a recent module on CPAN, utf8::all, that attempts to "turn on Unicode. All of it".

As has been pointed out, you can't magically make the entire system (outside programs, external web requests, etc.) use Unicode as well, but we can work together to make sensible tools that make doing common problems easier. That's the reason that we're programmers.

If utf8::all doesn't do something you think it should, let's improve it to make it better. Or let's make additional tools that together can suit people's varying needs as well as possible.

While reading this thread, I often get the impression that people are using "UTF-8" as a synonym to "Unicode". Please make a distinction between Unicode's "Code-Points" which are an enlarged relative of the ASCII code and Unicode's various "encodings". And there are a few of them, of which UTF-8, UTF-16 and UTF-32 are the current ones and a few more are obsolete.

Please, UTF-8 (as well as all other encodings) exists and have meaning in input or in output only. Internally, since Perl 5.8.1, all strings are kept as Unicode "Code-points". True, you have to enable some features as admiringly covered previously.

You should enable the unicode strings feature, and this is the default if you use v5.14;

You should not really use unicode identifiers esp. for foreign code via utf8 as they are insecure in perl5, only cperl got that right. See e.g. http://perl11.org/blog/unicode-identifiers.html

Regarding utf8 for your filehandles/streams: You need decide by yourself the encoding of your external data. A library cannot know that, and since not even libc supports utf8, proper utf8 data is rare. There's more wtf8, the windows aberration of utf8 around.

BTW: Moose is not really "Modern Perl", they just hijacked the name. Moose is perfect Larry Wall-style postmodern perl mixed with Bjarne Stroustrup-style everything goes, with an eclectic aberration of proper perl6 syntax, e.g. using strings for variable names, horrible fields syntax, and a very immature naive implementation which is 10x slower than a proper implementation. cperl and perl6 are the true modern perls, where form follows function, and the implementation is reduced and optimized.