Just some random musing on Unicode from
Joel's blog:
Back in the semi-olden days, everything was very simple. EBCDIC was on its way out. The only
characters that mattered were good old unaccented English letters, and
we had a code for them called ASCII which was able to represent every character using a number between 32 and 127. Space was 32, the letter "A" was 65, etc.
The IBM-PC had something that came to be known as the OEM character set
which provided some accented characters for European languages and a bunch of line drawing characters,
horizontal bars, vertical bars, horizontal bars with little
dingle-dangles dangling off the right side, etc., and you could use
these line drawing characters to make spiffy boxes and lines on the
screen, which you can still see running on the 8088 computer at your dry
cleaners'.
In fact
as soon as people started buying PCs outside of America all kinds of
different OEM character sets were dreamed up, which all used the top 128
characters for their own purposes. For example on some PCs the
character code 130 would display as é, but on computers sold in Israel
it was the Hebrew letter Gimel (
), so when Americans would send their résumés to Israel they would arrive as
rsums. (Love this analogy!!)
Eventually this OEM free-for-all got codified in the ANSI standard. In
the ANSI standard, everybody agreed on what to do below 128, which was
pretty much the same as ASCII, but there were lots of different ways to
handle the characters from 128 and on up, depending on where you lived.
These different systems were called
code pages.
So for example in Israel DOS used a code page called 862, while Greek
users used 737. They were the same below 128 but different from 128 up,
where all the funny letters resided.
Almost every stupid "my website looks like gibberish" or "she can't read
my emails when I use accents" problem comes down to one naive
programmer who didn't understand the simple fact that if you don't tell
me whether a particular string is encoded using UTF-8 or ASCII or ISO
8859-1 (Latin 1) or Windows 1252 (Western European), you simply cannot
display it correctly or even figure out where it ends. There are over a
hundred encodings and above code point 127, all bets are off.
About the author.
I’m
Joel Spolsky,
co-founder of
Fog Creek Software,
a
New York company that proves that
you can treat programmers well and still be highly profitable.
Programmers get private offices, free lunch, and work 40
hours a week. Customers only pay for software if they’re delighted.
We make Trello,
easy web-based collaboration software, FogBugz, an enlightened
bug tracking and software development tool, and Kiln, a distributed
source control system that will blow your socks off.
I’m also the co-founder and CEO of
Stack Exchange.
More about me.