Token Searches

Jed Rembold

November 22, 2024

Announcements

  • Project 5!
    • Get a start!
    • I’m fine with you partnering with anyone from your small section or this class
    • Back things up to GitHub after finishing every Milestone!
  • Final
    • I will be posting final study materials and an old test a week from tomorrow so that you can start studying
    • Will be more weighted toward content from the latter half of the semester, but everything we’ve been doing this semester builds on earlier material
    • Adventure largely marks the end of testable material, though content we cover between now and the final may show up in some extra credit contexts
  • I will be giving you time on the last day of class to complete our course course evals (the SAIs)
  • Polling: polling.jedrembold.prof

Review Question!

The two classes to the right mimic a bit of what might occur in the course of the Adventure Project. What python data type is ANS?

  1. A String
  2. An AdvObject
  3. An AdvLocation
  4. A List
class AdvObject:
    def __init__(self, name, loc):
        self.name = name
        self.loc = loc
    def get_loc(self):
        return self.loc

class AdvLocation:
    def __init__(self, name):
        self.name = name
        self.stuff = []
    def store(self, item):
        self.stuff.append(item)
    def retrieve_top(self):
        return self.stuff.pop()

A = AdvObject("Hammer", "RA")
B = AdvObject("Torch", "RA")
RA = AdvLocation("Room A")
RA.store(A)
RA.store(B)
ANS = RA.retrieve_top().get_loc()

Back to Scanning

A Token Scanner

  • A class that plucked out individual tokens might be called a token scanner
  • What would a client want from a token scanner?
    • A way to pass in the necessary input
    • A way to retrieve the next individual token
    • A way to know when there are no more tokens
    • Maybe a way to tailor what tokens are desired
  • These requirements help inform what methods should be incorporated into a token scanner class!
    • Still need to determine what internal attributes might be needed

Token Scanner Design

  • Frequently, specific wants or objectives make for good methods to include in the token scanner
  • Chapter 12 includes an example of a common implementation
  • Exports 4 main methods:
    1. |||scanner|||.set_input(|||str|||)
      • Sets the input of the token scanner to the specified string or input stream
    2. |||scanner|||.next_token()
      • Returns the next token from the scanner text, or "" at the end
    3. |||scanner|||.has_more_tokens()
      • Returns True if more tokens exist, False otherwise
    4. |||scanner|||.ignore_whitespace()
      • Customization option which tells the scanner to ignore whitespace characters

Token Scanner Code

# File: tokenscanner.py

"""
This file implements a simple version of a token scanner class.
"""

# A token scanner is an abstract data type that divides a string into
# individual tokens, which are strings of consecutive characters that
# form logical units.  This simplified version recognizes two token types:
#
#   1. A string of consecutive letters and digits
#   2. A single character string
#
# To use this class, you must first create a TokenScanner instance by
# calling its constructor:
#
#     scanner = TokenScanner()
#
# The next step is to call the set_input method to specify the string
# from which tokens are read, as follows:
#
#     scanner.set_input(s)
#
# Once you have initialized the scanner, you can retrieve the next token
# by calling
#
#    token = scanner.next_token()
#
# To determine whether any tokens remain to be read, you can either
# call the predicate method scanner.has_more_tokens() or check to see
# whether next_token returns the empty string.
#
# The following code fragment serves as a pattern for processing each
# token in the string stored in the variable source:
#
#     scanner = TokenScanner(source)
#     while scanner.has_more_tokens():
#         token = scanner.next_token()
#         . . . code to process the token . . .
#
# By default, the TokenScanner class treats whitespace characters
# as operators and returns them as single-character tokens.  You
# can set the token scanner to ignore whitespace characters by
# making the following call:
#
#     scanner.ignore_whitespace()

class TokenScanner:

    """This class implements a simple token scanner."""

# Constructor

    def __init__(self, source=""):
        """
        Creates a new TokenScanner object that scans the specified string.
        """
        self.set_input(source)
        self._ignore_whitespace_flag = False

# Public methods

    def set_input(self, source):
        """
        Resets the input so that it comes from source.
        """
        self._source = source
        self._nch = len(source)
        self._cp = 0

    def next_token(self):
        """
        Returns the next token from this scanner.  If called when no
        tokens are available, next_token returns the empty string.
        """
        if self._ignore_whitespace_flag:
            self._skip_whitespace()
        if self._cp == self._nch:
            return ""
        token = self._source[self._cp]
        self._cp += 1
        if token.isalnum():
            while self._cp < (
                self._nch and self._source[self._cp].isalnum()):
                token += self._source[self._cp]
                self._cp += 1
        return token

    def has_more_tokens(self):
        """
        Returns True if there are more tokens for this scanner to read.
        """
        if self._ignore_whitespace_flag:
            self._skip_whitespace()
        return self._cp < self._nch

    def ignore_whitespace(self):
        """
        Tells the scanner to ignore whitespace characters.
        """
        self._ignore_whitespace_flag = True

# Private methods

    def _skip_whitespace(self):
        """
        Skips over any whitespace characters before the next token.
        """
        while self._cp < self._nch and self._source[self._cp].isspace():
            self._cp += 1

Using TokenScanner

  • Need to initialize the token scanner object
    • You need to create the machine before you can use it
  • Feed the machine the text you want to grab tokens from
  • Generally, keep looping as long as there are still tokens
    • Each iteration, get the latest token and then do something with it

Using TokenScanner in PigLatin

from tokenscanner import TokenScanner

def to_pig_latin(text):
    translation = ""
    scanner = TokenScanner()
    scanner.set_input(text)
    while scanner.has_more_tokens():
        token = scanner.next_token()
        if token.isalpha():
            token = word_to_pig_latin(token)
        translation += token
    return translation

Which Way is Better?

Searching for Efficiency

  • Our last chapter is less about introducing new programming machinery and more about better understanding what we already have
  • Hopefully you have realized by now that there can be many approaches to solving a problem computationally
    • So far, the first way you figure out has likely been the “best”, in that it gets the job done.
    • Sometimes there is a difference though in an approach that is technically correct and one that is practically correct.
    • How can we make informed choices about the algorithms we use?
  • Want to look at algorithm efficiency in this chapter
  • Will focus mainly on searching and sorting as our examples to better understand how an algorithm’s efficiency can be quantified

Searching for Area Codes

  • To illustrate the efficiency of linear search, it can be helpful to work with a larger dataset
  • We’ll look here at searching through potential US area codes to find that of Salem: 503
  • Linear search examines each value in order to find the matching value.
    • As the arrays get larger, the number of steps required also grows
  • As you watch linear search do its thing on the next slide, see if you can beat the computer at finding 503.
    • What approach did you take?

Linear Search in Action

How did you do?

  • Frequently, many people can “beat the animation” in finding 503
  • Approaches vary, but you may well have done something along the lines of:
    • Look at some number in the middle
    • Depending on how close it was to 503, jump ahead some in that direction and check again
  • Requires some special conditions though, so let’s try again

Racing Linear Search Again

Binary Search in Action

Linear Search Efficiency

  • The running time of the linear search depends on the size of the array
    • That in itself is not particularly surprising. The running time of most algorithms will depend on the size of the problem to which the algorithm is applied.
  • For many applications, it is easy to come up with a numeric value that describes the problem size, commonly called \(N\).
    • For most lists, \(N\) is simply the length of the array
  • In the worst case, when the target value is the last element of the list or does not appear at all, the linear search requires \(N\) steps
    • On average, it takes about half that, or \(\frac{N}{2}\)
    • Computer scientists are apparently pessimists though, and will generally use the worse case scenario to compare

Binary Search Efficiency

  • The running time of binary search also depends on the size of the array, but in a very different way
  • Each step of the process, the binary search rules out half the remaining options
    • The worst case (which we had earlier!) requires a number of steps equal to however many times we can divide the array in half until we have only a single number left.
    • Mathematically, this looks like \[1 = N / \underbrace{2 / 2 / 2 / 2 \cdots / 2}_{k\text{ times}} = \frac{N}{2^k}\]
  • We really want to know the number of steps, \(k\), so solving for \(k\): \[2^k = N \quad\Rightarrow\quad k = \log_2(N)\]

Comparing Efficiencies

  • The below table illustrates the differences in the number of required steps for the two search algorithms
Problem Size Linear (\(N\)) Binary (\(log_2 N\))
10 10 3
100 100 7
1,000 1,000 10
1,000,000 1,000,000 20
1,000,000,000 1,000,000,000 30
  • Clearly, for large values, the difference in the number of steps is enormous
  • At 1 million elements, the binary search is 50,000 times faster!

Sorting

  • Binary search only works on arrays in which the elements are ordered.
    • The process of putting the elements into order is called sorting.
  • Lots of different sorting algorithms, which can vary substantially in their efficiency.
  • From an algorithms view, sorting is probably the most applicable algorithm we’ll discuss in this course
    • Organizing data makes it easier to digest that data, whether the data is being digested by other machines or by humans
// reveal.js plugins