RubyLearning

Helping Ruby Programmers become Awesome!

International Encodings

By Satish Talim

By default, the Ruby interpreter assumes that programs are encoded in ASCII. In Ruby 1.9, the author of the script can specify the encoding of the script by placing a special "coding comment" at the start of the file. For example:

# coding: utf-8

The above comment must be written entirely in ASCII, and must include the string coding followed by a colon or equals sign and the name of the desired encoding (which cannot include spaces or punctuation other than hyphen and underscore). Whitespace is allowed on either side of the colon or equals sign, and the string coding may have any prefix, such as en to spell encoding. The entire comment, including coding and the encoding name, is case-insensitive and can be written with upper- or lowercase letters.

An encoding comment like the one above is usually valid on the first line of the file. It may appear on the second line, however, if the first line is a shebang comment (which makes a script executable on *nix operating systems):

#!/usr/bin/ruby -w
# coding: utf-8

Encoding names are not case-sensitive and may be written in uppercase, lowercase or a mix. Ruby 1.9 supports the following source encodings: ASCII-8BIT, US-ASCII, ISO-8859-1 thro' ISO-8859-15, UTF-8, SHIFT_JIS and EUC-JP.

In Ruby 1.9, strings are true sequences of characters, and those characters are not confined to the ASCII character set. In 1.9, the individual elements of a string are characters - represented as strings of length 1 - rather than integer character codes. The String class can properly handle multibyte characters. If a string contains multibyte characters, then the number of bytes does not correspond to the number of characters. In Ruby 1.9, the length and size methods return the number of characters in a string, and the new bytesize method returns the number of bytes. Refer the following program (this program is typed using the free Unicode text editor for Windows ie. Unipad):

# coding: utf-8
# utf8p1.rb
# in Ruby 1.9 only
# A string literal containing a multibye character
s = "José"
# The string contains 5 bytes which encode 4 characters
puts s.length    # => 4
puts s.bytesize # => 5

The above information is adapted from the book - The Ruby Programming Language.

Dave Thomas has an interesting example on his blog.

Using both Ruby 1.8 and 1.9 on Windows

You can use both Ruby 1.8 and 1.9 on Windows. This is what you need to do to use Ruby 1.9:

  • Download Ruby 1.9 for Windows, from http://www.ruby-lang.org/en/downloads/
  • Then, unzip the downloaded file to a folder say - c:/ruby1.9
  • Mind you, this download does not have additional tools like SciTE, so you may have to use something like Wordpad or Textpad for writing your .rb files.
  • Open a command window and switch folder to: c:/ruby1.9/bin
  • In this command window, type:
    set path=c:/ruby1.9/bin;
    (Remember: This path setting is valid as long as this command window is open.)
  • Switch to the folder where your Ruby programs are located, say c:/rubyprograms and then you can compile any Ruby program by typing:
    ruby program.rb

This topic covers Ruby 1.9 encoding features. Modern Ruby versions (2.0+) have improved encoding handling with UTF-8 as the default encoding.