Representing text

 

CS 231: Introduction to Programming
Lecture #3: Data representation

Representing text
bullet Counting symbols we can represent most "plain" textual documents as a sequence of symbols (i.e., ignoring mark-up such as boldface, page layout, etc.)

The symbols themselves can be represented as sequences of bits: 7 or 8 bits per symbol is the norm, thus between 128 and 256 possible symbols (also called "characters")

Control bar


















































 

CS 231: Introduction to Programming
Lecture #3: Data representation

Representing text
bullet Counting symbols
bullet The ASCII code The American Standard Code for Information Interchange (or ASCII) uses 7 bits per symbol to represent English letters, Arabic numerals, punctuation and a few control characters (e.g., "end-of-file", "carriage return", "bell", etc.). Several other codes were used historically (e.g., EBCDIC on IBM computers), but are used only rarely today.

Here's a link to some info on ASCII

Control bar


















































 

CS 231: Introduction to Programming
Lecture #3: Data representation

Representing text
bullet Counting symbols
bullet The ASCII code
bullet 8-bit codes: the use of parity 8 bits are more natural than 7 in most computer applications (since 8 is a power of two); the "extra bit" is sometimes used as a safeguard against data corruption introduced by transmission errors and other "noise"

For example, we can set the eighth bit to 0 if the number of 1 bits in the code is even or 1 if it is odd; this allows us to detect transmission problems where any single bit is corrupted

Control bar


















































 

CS 231: Introduction to Programming
Lecture #3: Data representation

Representing text
bullet Counting symbols
bullet The ASCII code
bullet 8-bit codes: the use of parity
bullet 8-bit codes: modern symbol sets in many modern computer systems (Macintosh, Windows), eight bits are used to extend the symbol set to include, e.g., accented letters, the British pound symbol, etc.. This approach may cause problems when transmitting between newer and older systems and also conflicts with its use for parity checking.

Control bar


















































 

CS 231: Introduction to Programming
Lecture #3: Data representation

Representing text
bullet Counting symbols
bullet The ASCII code
bullet 8-bit codes: the use of parity
bullet 8-bit codes: modern symbol sets
bullet Unicode: a new 16-bit standard a recent international standard called Unicode allows for 16 bits per symbol, thus greatly extending the range of symbols that can be acommodated (more than 65,000), including those used by almost every written language on Earth

Here's a link to the Unicode home page and another one to a techincal introduction

Control bar


















































 

CS 231: Introduction to Programming
Lecture #3: Data representation

Representing text
bullet Counting symbols
bullet The ASCII code
bullet 8-bit codes: the use of parity
bullet 8-bit codes: modern symbol sets
bullet Unicode: a new 16-bit standard
bullet Problems with Unicode unfortunately, not every language could be completely acommodated in the Unicaode, so that some older symbols used in Chinese, for example, had to be left out. This led to legitimate and serious concerns about fair representation for some cultures

New modifications to the Unicode standard have tried to address these concerns by adding capacity for over a million characters if necessary. Newly proposed scripts include Tengwar, Klingon and many others

Control bar