Automata and languages

Published

2023-08-26

Automata and languages

Throughout history, and long, long before the invention of modern computing, people have been fascinated by automata. Essentially, an automaton is a machine that somehow operates on its own. For hundreds of years, the construction of automata was restricted to clever gadgets (mechanical toys, automatic door openers, bell ringers, etc.). This subject in itself has a fascinating history.

At some point, people began construting machines to perform calculations. These don’t always qualify as automata. For example, an abacus is used for calculation, but requires a human operator at each step in the calculation. You can’t ask an abacus “What’s 14 + 23?” and expect an answer.

Automata are machines that can work independently (to a greater or lessser degree).

In this course, the automata we’ll see are more abstract. We’ll investigate automata ranging from deterministic finite automata to Turing machines. These automata don’t require any physical manifestation to be of interest to us. Instead, the automata we’ll investigate serve as models of computation. We’ll ask questions like this: If we create such-and-such an automaton, with these rules and restrictions, what can it do? What can it compute?

We’ll see that different type of automata corresponds to different class of language, and that each model of computation—each automaton we define—recognizes a specific language. What does this mean? In order to answer this question, we need to understand what we mean by a “language.”

For our purposes, a language is a set of strings. This is very different from a natural language, e.g., English, Korean, Hindi, or Igbo. What we’re considering are sets of strings. This may seem trivial, but rest assured it is not!

Some very mundane examples of languages as sets of strings include:

  • the set of all valid US ZIP codes,
  • the set of all valid international telephone numbers, and
  • the set of all valid 16-digit credit card numbers.

The programs we write (regardless of the language—C++, Java, Haskell, Python, Rust, whatever) are also strings.

Consider:

  • the set of all syntactically valid programs in Python, or
  • the set of all syntactically valid programs in C which calculate orbital period of bodies orbiting the sun.

There are many other possible languages:

  • the set of all prime numbers encoded as binary strings,
  • the set of all valid MP3 encodings,
  • the set of all screenplays for Oscar winning films, and
  • the set of all utterances you’ve made in your lifetime encoded in the International Phonetic Alphabet.

So you see, considering languages as sets of strings covers many possibilities!

We’ll start with what are called regular languages. These have properties of their own, and there is an infinite variety of such languages. Regular languages are recognized by deterministic finite automata or DFAs. DFAs are the simplest type of automaton we’ll see, but they’re pretty powerful nonetheless. If you’ve ever used regular expressions for pattern matching, guess what? They’re powered by DFAs.

Examples of regular languages:

  • the set of all valid ISO dates (e.g., 2023-08-28, 1989-06-04, 1918-11-11), and
  • the set of all valid Vermont license plate numbers.

We’ll see another class of languages called context-free languages. These are a superset of regular languages, meaning that all regular languages are context-free, but not all context-free languages are regular. Context-free languages are recognized by a different, more powerful type of automaton called a push-down automaton. (We’ll also see how context-free languages can be generated by context-free grammars or CFGs.)

Examples of context-free languages:

  • any syntactically valid computer program,
  • any valid HTML,
  • the set of all palindromic strings (strings which are the same forward and backward), and
  • any valid arithmetic or algebraic expression.

Beyond this we’ll see recursively enumerable languages also know as Turing-recognizable languages. These are a superset of context-free languages: all context-free languages are Turing recognizable, but not all Turing-recognizable languages are context free. These languages are recognized (or generated) by the most powerful type of automaton, the Turing machine, named after its inventor Alan Turing.

One example of a Turing-recognizable languages is the set of all binary encodings of prime numbers (we’ll see others later).

These classes of languages, and the automata that recognize them, form a hierarchy, sometimes called the Chomsky hierarchy.

Chomsky hierarchy

(You may notice that I’ve left out context-sensitive languages. They too are part of this hierarchy, but aren’t particularly interesting in and of themselves, so we won’t address these in this course.)

Context-free languages are a strict subset of Turing-recognizable languages. Regular languages are a strict subset of context-free languages.

Copyright © 2023 Clayton Cafiero. All rights reserved.