CS 231 Lab 5: Digital Representation of Data

Lab #5: Digital Representation of Data

Assigned: Mon 14 Feb 2000
Due: Mon 21 Feb 2000

In this lab we will explore how the arithmetic of various number bases and the methods of data representation described in lectures can be used to determine the best or most concise way to store various forms of data. Questions like these are faced by graphic designers, web page designers, sound engineers and communications technicians on a daily basis. An understanding of some of these issues will help you understand the needs of these disciplines in your programming; it will also help you to develop intuitions about more general data storage and complexity analysis issues.

As with Lab #2, this one is intended to be largely a "paper and pencil" exercise. You may still use any calculators or other computer tools that you find useful, but you should make sure that you understand the underlying concepts and can do these kinds of calculations by hand if necessary (e.g., on an exam).

Instead of a demo, you should hand in a paper version of your answers by the due date.

A reminder on terminology
Exercise #1: Representing text
Exercise #2: Fonts and texts
Exercise #3: Simple graphics
Exercise #4: Color graphics
Exercise #5: Run-length encodings
Exercise #6: Sound sampling
Exercise #7: Digital representation of omelettes

A reminder on terminology

A bit is the smallest unit of binary data representation, i.e., a single "zero" or "one", representing a distinction between two different things (or choices, or alternatives). Eight bits gathered together constitute a byte; this is usually enough to represent a single letter or symbol (the ASCII code really uses only 7 bits). Since a byte a byte is eight bits, it can be represented by a pair of hexadecimal digits, since each hex digit is equivalent to four bits. Sometimes the four-bit unit is known as a nibble, but only when we really want to push the limits of cuteness .

Exercise #1: Representing text

Translate the following streams of hexadecimal characters into both ASCII and binary. Assume that each of the ASCII characters is represented by seven bits, with a leading zero bit to pad to an "even" eight bits (in the binary translation, show the leading zeros explicitly). You might want to open another Netscape window to this ASCII chart from lecture while you work.
Note: in this chart, the hex value for a given character is to the immediate right of that character.
```
	50 65 6E 67 75 69 6E 20 70 6F 77 65 72 21   
```
Modify the above hexadecimal message to include a parity bit in place of the leading zero. In particular, use a leading zero if there are an even number of ones in the remainder of that byte (a byte is a series of eight bits) and a leading one of there are an odd number of ones in the remainder of the byte. (Your answer should be another series of hex numbers, of the same length as above).
Show how the message would read if the parity-bit version from the second part of the question were read on a Windows system and the leading bit were misinterpreted as part of the standard Windows eight-bit character code. Use this chart from lecture to determine the interpretation of the eight-bit codes.

Exercise #2: Fonts and texts

On a certain computer system texts are stored as eight-bit ASCII codes (with a leading zero) and the standard system font consists of a 5 x 7 fixed-width grid for each character. For the sake of uniformity, the system stores a single 5 x 7 grid for all of the 128 ASCII characters, even though some control codes and the like are not printable. The system font includes a built-in pixel's worth of vertical spacing between lines of text, i.e., when a text is rendered on-screen as a bit-mapped graphic, no extra space need be left between lines.

How many different "symbols" can be rendered in a 5 x 7 grid? In other words, how many distinct patterns of on-and-off pixels within such a rectangular arrangement are there? Is this number more or less than the number of symbols which can be coded in a sixteen-bit code such as Unicode (ignore extensions such as in Unicode 2.0 for this question)?
Assume we wish to transmit a message in the smallest possible space and can choose between either of the following methods: using the first method, we transmit a description of the entire bit-mapped system font, plus the text of the message encoded as symbols. In this case, the receiver will use the bit-mapped font and the symbol encodings to render a version of the message on a screen to be read. Using the second method, we will render the message as a bit-mapped graphic in advance, and send only the bit-mapped result to be displayed without any "translation" or rendering on the receiving end.
For longer messages, the first method will clearly result in a more compact transmission, since sending the font information separately will save space. But for shorter messages, the "raw" bit-mapped graphic may actually be smaller than the symbolically-encoded form. What, if any, is the "break-even" points, i.e., how many symbols can we send before it is worth switching to the font+symbol form?

Exercise #3: Simple graphics

The chart below represents a scaled-up version of a black-and-white bit-map graphic. The following table of hex digits describes the content of the graphic as a series of bit values, zero for white and one for black. The lower half of the graphic has already been filled in for you, using the last four rows of hex values. Fill in the upper half of the graphic using the first four rows of hex values.

Exercise #4: Color graphics

Assume that we have a graphic represented as a color bit-map of size 30 x 40 pixels and with a color depth of 16 bits. Answer the following questions about file sizes assuming no overhead for file headers and the like (i.e., just count pixel data or, where relevant, color pallette tables).

How many bytes does the pixel information take up in the "raw" form described above?
Say that only 75 colors out of the possible 64K range are actually used in the picture (why 64K? Because 2¹⁶ = 64 * 1024 or 64K). We can create a color pallette containing these 75 sixteen-bit colors and then send this pallette along with modified color data, now using just enough bits per pixel to index the table. How many bytes will the whole graphic take now, including the size of the table and the pixel data? What percentage savings does this represent over the "raw" form above?
Clearly the use of a color palette will eventually not pay off in that space will not be saved. What is the "break-even" point, i.e., how many sixteen-bit colors can be in use in the picture before the use of the pallette will end up making the total size of the image larger than in the "raw" form?

Exercise #5: Run-length encodings

Recall from lecture that a run-length encoding is a technique used to compress data by representing repeated values (in a "run") as a pair of values, the first of which is a run length (the number of repetitions) and the second of which is the repeated element or value. Note that a run-length encoding can be used in any context where a series of (possibly-repeated) values is used (e.g., graphics, sound, text, etc.).

Write out a run-length encoding of the following data stream using hex digits. You should allow up to four bits (one hex digit) of repetition count and one byte (two hex digits) per value (in other words, your answer should be in the form of a series of hex digits, alternating between single-digit counts and double-digit values).
```
00 00 00 00 00 FF FF 00 A1 A1 A1 A1 00 00 00 00 00 A1 A1 FF FF FF 00 00   
```
Expand the following run-length encoding, which is in the same form as your answer above (your answer here should be simply a series of double-digit values). What do you notice about the relative lengths of the "compressed" and "expanded" versions? How do you account for this discrepancy?
```
1 00 1 A1 1 00 1 B2 2 C0 1 FF 1 A1 2 32   
```

Exercise #6: Sound sampling

Recall from lecture that a digitally-sampled sound consists of a series of samples (or measurements) of the amplitude of a sound wave, taken at regular intervals (the sampling rate) and represented with a specific number of bits.

If we sample sound at CD-quality (44.1KHz and 16 bits per sample), in stereo (i.e., two separate channels of sound, one left and one right), how many bytes will it take to represent 73 minutes of sound? Show how you computed your answer. You may assume that the "K" in "KHz" is exactly 1000Hz, but use K = 1024 to calculate your answer in mega-bytes (i.e., 1M-byte = 1024 * 1024 bytes).
Telephone answering messages don't need to be recorded in stereo and they don't need to be CD-quality. If we have 64K (K = 1024) bytes available, how many minutes of monophonic (single channel) sound can we record digitally at 10KHz (K = 1000) using 8-bit samples?

Exercise #7: Digital representation of omelettes

Last year I ate breakfast with my family at a local restaurant that specializes in omelettes, offering a wide variety of ingredients (in case you like omelletes, it's called Sybil's, and it's down State Street at about 24th). Although the omelette was good, I found their menu a bit disorganized and daunting: it was broken down into sub-sections based on main ingredients (e.g., tomatoes or sausages). Each section was further sub-divided into a series of omelettes featuring different combinations of optional ingredients (e.g., tomatoes + asparagus + cheese), along with an order number and a price for each combination.

Unfortunately, there seemed to be no rhyme or reason to the combinations offered, and some combinations I might have wanted weren't listed, even though several dozen were. I couldn't help but think there must be a better way to organize the menu: why couldn't I have any combination of ingredients I wanted, with priced based on the specific ingredients?

Assume for the moment that omelettes always have a fixed number of eggs and that the only variation is in the other ingredients added in. Furthermore, assume that at some restaurant, the list of such ingredients is:

asparagus;
bacon;
cauliflower;
eggplant;
feta cheese;
green peppers;
ham;
Italian sausage;
jack cheese;
kraut (i.e., sauerkraut);
mushrooms;
onions;
pea pods;
red potatoes;
spinach;
tomatoes;

Consider now three different methods for representing omelletes digitally:

letter-based: we can represent the presence of any ingredient by including a one-byte ASCII letter for that ingredient (fortunately, the initial letters of the ingredients are unique).
For example, an asparagus-jack cheese-tomato omellete would be represented by the code "AJT", since A J and T are the initials of those ingredients. A "plain omelette" would be represented by the "empty string" of letters, "", of length 0 (presumably the surrounding context of an order would make this clear and distinguishable from other possibilities).
hex-based: we can use a single hex digit to represent the presence of any ingredient, based on the ordinal number of that ingredient (fortunately, there are exactly sixteen ingredients on the menu).
For example, an asparagus-jack cheese-tomato omellete would be represented by the code "08F", since 0 8 and F are the numbers of those ingredients in hex.
binary: we can represent the omellete as a whole by a sixteen-bit value (written as four hex digits) where the presence or absence of any ingredient in order is given by a corresponding one or zero.
For example, an asparagus-jack cheese-tomato omellete would be represented by the code "8081".

Now answer the following questions:

What are the total sizes, in bits, of each of the representations given above for the asparagus-jack cheese-tomato omellete? Note that the use of ASCII letters and hex digits can be deceptive when measuring the size.
The chef points out that most people only order omelettes with one or two ingredients. In fact, a thorough study shows that 10% of omelettes are ordered plain (just eggs), 30% are ordered with just one ingredient, 40% with two, and 10% each with three and four ingredients (after four ingredients they get pretty expensive and pretty unwieldy). Given these percentages, what choice of representation will tend to yield the smallest overall "bandwidth" of data transmitted to the kitchen for omelette orders? Give explicitly the expected number of bits per omelette for each representation type (e.g., the weighted average based on the percentages given).
Hint: make a matrix indexed by number of ingredients and representation type. Compute the size (in bits) of each entry in the matrix. Then weight the entries by their percentage probability and sum according to each representation type.
Explain why the best representation might be different if most omelettes were ordered with larger numbers of ingredients (say, most ordered with 5 to 8 ingredients).

Contents