Token Searches

Jed Rembold

November 22, 2024

Announcements

Project 5!
- Get a start!
- I’m fine with you partnering with anyone from your small section or this class
- Back things up to GitHub after finishing every Milestone!
Final
- I will be posting final study materials and an old test a week from tomorrow so that you can start studying
- Will be more weighted toward content from the latter half of the semester, but everything we’ve been doing this semester builds on earlier material
- Adventure largely marks the end of testable material, though content we cover between now and the final may show up in some extra credit contexts
I will be giving you time on the last day of class to complete our course course evals (the SAIs)
Polling: polling.jedrembold.prof

Review Question!

The two classes to the right mimic a bit of what might occur in the course of the Adventure Project. What python data type is ANS?

A String
An AdvObject
An AdvLocation
A List

class AdvObject:
    def __init__(self, name, loc):
        self.name = name
        self.loc = loc
    def get_loc(self):
        return self.loc

class AdvLocation:
    def __init__(self, name):
        self.name = name
        self.stuff = []
    def store(self, item):
        self.stuff.append(item)
    def retrieve_top(self):
        return self.stuff.pop()

A = AdvObject("Hammer", "RA")
B = AdvObject("Torch", "RA")
RA = AdvLocation("Room A")
RA.store(A)
RA.store(B)
ANS = RA.retrieve_top().get_loc()

Back to Scanning

A Token Scanner

A class that plucked out individual tokens might be called a token scanner
What would a client want from a token scanner?
- A way to pass in the necessary input
- A way to retrieve the next individual token
- A way to know when there are no more tokens
- Maybe a way to tailor what tokens are desired
These requirements help inform what methods should be incorporated into a token scanner class!
- Still need to determine what internal attributes might be needed

Token Scanner Design

Frequently, specific wants or objectives make for good methods to include in the token scanner
Chapter 12 includes an example of a common implementation
Exports 4 main methods:
1. |||scanner|||.set_input(|||str|||)
  - Sets the input of the token scanner to the specified string or input stream
2. |||scanner|||.next_token()
  - Returns the next token from the scanner text, or "" at the end
3. |||scanner|||.has_more_tokens()
  - Returns True if more tokens exist, False otherwise
4. |||scanner|||.ignore_whitespace()
  - Customization option which tells the scanner to ignore whitespace characters

Token Scanner Code

# File: tokenscanner.py

"""
This file implements a simple version of a token scanner class.
"""

# A token scanner is an abstract data type that divides a string into
# individual tokens, which are strings of consecutive characters that
# form logical units.  This simplified version recognizes two token types:
#
#   1. A string of consecutive letters and digits
#   2. A single character string
#
# To use this class, you must first create a TokenScanner instance by
# calling its constructor:
#
#     scanner = TokenScanner()
#
# The next step is to call the set_input method to specify the string
# from which tokens are read, as follows:
#
#     scanner.set_input(s)
#
# Once you have initialized the scanner, you can retrieve the next token
# by calling
#
#    token = scanner.next_token()
#
# To determine whether any tokens remain to be read, you can either
# call the predicate method scanner.has_more_tokens() or check to see
# whether next_token returns the empty string.
#
# The following code fragment serves as a pattern for processing each
# token in the string stored in the variable source:
#
#     scanner = TokenScanner(source)
#     while scanner.has_more_tokens():
#         token = scanner.next_token()
#         . . . code to process the token . . .
#
# By default, the TokenScanner class treats whitespace characters
# as operators and returns them as single-character tokens.  You
# can set the token scanner to ignore whitespace characters by
# making the following call:
#
#     scanner.ignore_whitespace()

class TokenScanner:

    """This class implements a simple token scanner."""

# Constructor

    def __init__(self, source=""):
        """
        Creates a new TokenScanner object that scans the specified string.
        """
        self.set_input(source)
        self._ignore_whitespace_flag = False

# Public methods

    def set_input(self, source):
        """
        Resets the input so that it comes from source.
        """
        self._source = source
        self._nch = len(source)
        self._cp = 0

    def next_token(self):
        """
        Returns the next token from this scanner.  If called when no
        tokens are available, next_token returns the empty string.
        """
        if self._ignore_whitespace_flag:
            self._skip_whitespace()
        if self._cp == self._nch:
            return ""
        token = self._source[self._cp]
        self._cp += 1
        if token.isalnum():
            while self._cp < (
                self._nch and self._source[self._cp].isalnum()):
                token += self._source[self._cp]
                self._cp += 1
        return token

    def has_more_tokens(self):
        """
        Returns True if there are more tokens for this scanner to read.
        """
        if self._ignore_whitespace_flag:
            self._skip_whitespace()
        return self._cp < self._nch

    def ignore_whitespace(self):
        """
        Tells the scanner to ignore whitespace characters.
        """
        self._ignore_whitespace_flag = True

# Private methods

    def _skip_whitespace(self):
        """
        Skips over any whitespace characters before the next token.
        """
        while self._cp < self._nch and self._source[self._cp].isspace():
            self._cp += 1

Using `TokenScanner`

Need to initialize the token scanner object
- You need to create the machine before you can use it
Feed the machine the text you want to grab tokens from
Generally, keep looping as long as there are still tokens
- Each iteration, get the latest token and then do something with it

Using `TokenScanner` in `PigLatin`

from tokenscanner import TokenScanner

def to_pig_latin(text):
    translation = ""
    scanner = TokenScanner()
    scanner.set_input(text)
    while scanner.has_more_tokens():
        token = scanner.next_token()
        if token.isalpha():
            token = word_to_pig_latin(token)
        translation += token
    return translation

Which Way is Better?

Searching for Efficiency

Our last chapter is less about introducing new programming machinery and more about better understanding what we already have
Hopefully you have realized by now that there can be many approaches to solving a problem computationally
- So far, the first way you figure out has likely been the “best”, in that it gets the job done.
- Sometimes there is a difference though in an approach that is technically correct and one that is practically correct.
- How can we make informed choices about the algorithms we use?
Want to look at algorithm efficiency in this chapter
Will focus mainly on searching and sorting as our examples to better understand how an algorithm’s efficiency can be quantified

A Linear Search

Suppose you needed to determine if a particular element was in a list, and didn’t have any of the built-in methods available to you
The easiest method (which many of you have indeed used!) is to just search through the list element by element and check it to see if it is the one you desire
- This approach is called a linear search

Easy to understand and implement:

def linear_search(target, array):
    for i in range(len(array)):
        if array[i] == target:
             return i
    return -1

Searching for Area Codes

To illustrate the efficiency of linear search, it can be helpful to work with a larger dataset
We’ll look here at searching through potential US area codes to find that of Salem: 503
Linear search examines each value in order to find the matching value.
- As the arrays get larger, the number of steps required also grows
As you watch linear search do its thing on the next slide, see if you can beat the computer at finding 503.
- What approach did you take?

Linear Search in Action

How did you do?

Frequently, many people can “beat the animation” in finding 503
Approaches vary, but you may well have done something along the lines of:
- Look at some number in the middle
- Depending on how close it was to 503, jump ahead some in that direction and check again
Requires some special conditions though, so let’s try again

Racing Linear Search Again

Idea of a Binary Search

If your data is ordered, then you might try a alternative search strategy
Look at the center element in the array, it is either:
- The value you want. Excellent! Return it.
- A value larger than what you want. Throw away that value and everything bigger.
- A value smaller than what you want. Throw away that value and everything smaller.
Then you can repeat the process with the remaining elements until you find your value
Since number of searched elements is divided by 2 each time, called a binary search

Binary Search in Action

Implementing Binary Search

def binary_search(target, array):
    lh = 0
    rh = len(array) - 1
    while lh <= rh:
        middle = (lh + rh) // 2
        if array[middle] == target:
            return middle
        elif array[middle] < target:
            lh = middle + 1
        else:
            rh = middle - 1
    return -1

Linear Search Efficiency

The running time of the linear search depends on the size of the array
- That in itself is not particularly surprising. The running time of most algorithms will depend on the size of the problem to which the algorithm is applied.
For many applications, it is easy to come up with a numeric value that describes the problem size, commonly called \(N\).
- For most lists, \(N\) is simply the length of the array
In the worst case, when the target value is the last element of the list or does not appear at all, the linear search requires \(N\) steps
- On average, it takes about half that, or \(\frac{N}{2}\)
- Computer scientists are apparently pessimists though, and will generally use the worse case scenario to compare

Binary Search Efficiency

The running time of binary search also depends on the size of the array, but in a very different way
Each step of the process, the binary search rules out half the remaining options
- The worst case (which we had earlier!) requires a number of steps equal to however many times we can divide the array in half until we have only a single number left.
- Mathematically, this looks like \[1 = N / \underbrace{2 / 2 / 2 / 2 \cdots / 2}_{k\text{ times}} = \frac{N}{2^k}\]
We really want to know the number of steps, \(k\), so solving for \(k\): \[2^k = N \quad\Rightarrow\quad k = \log_2(N)\]

Comparing Efficiencies

The below table illustrates the differences in the number of required steps for the two search algorithms

Problem Size	Linear (\(N\))	Binary (\(log_2 N\))
10	10	3
100	100	7
1,000	1,000	10
1,000,000	1,000,000	20
1,000,000,000	1,000,000,000	30

Clearly, for large values, the difference in the number of steps is enormous
At 1 million elements, the binary search is 50,000 times faster!

Sorting

Binary search only works on arrays in which the elements are ordered.
- The process of putting the elements into order is called sorting.
Lots of different sorting algorithms, which can vary substantially in their efficiency.
From an algorithms view, sorting is probably the most applicable algorithm we’ll discuss in this course
- Organizing data makes it easier to digest that data, whether the data is being digested by other machines or by humans

Token Searches

Announcements

Review Question!

Back to Scanning

A Token Scanner

Token Scanner Design

Token Scanner Code

Using TokenScanner

Using TokenScanner in PigLatin

Which Way is Better?

Searching for Efficiency

A Linear Search

Searching for Area Codes

Linear Search in Action

How did you do?

Racing Linear Search Again

Idea of a Binary Search

Binary Search in Action

Implementing Binary Search

Linear Search Efficiency

Binary Search Efficiency

Comparing Efficiencies

Sorting

Using `TokenScanner`

Using `TokenScanner` in `PigLatin`