Paste output from the CODEX codebook cipher app and watch how a cryptanalyst would attack it. The tool runs a seven-step pipeline: it detects the format, fingerprints the cipher layer, attempts to strip it, performs group frequency analysis, builds a frequency-based mapping, demonstrates a context attack using word segmentation and bigram analysis, then reconstructs the plaintext using all available evidence. Each step shows key statistics with reference ranges, and longer explanations are available via “read more” toggles.
Panels 1–5 above use simple frequency counting — ranking code groups by how often they appear and matching the distribution to expected English. This works against simple substitution ciphers, but codebook ciphers (especially with homophones) are specifically designed to defeat it.
Panel 6 demonstrates the first steps beyond frequency counting. Without homophones, a context attack identifies [space], segments the text into words, and uses bigram patterns to recover dozens of word mappings. With homophones, a distributional clustering attack groups code groups by context similarity to identify which codes are homophones of the same token, then combines their frequencies. Panel 7 lets you toggle between frequency-only and context-enhanced reconstruction to see the difference.
But even context analysis is just the beginning. Historically, every major codebook system was eventually broken. Here is the full toolkit a modern cryptanalyst would bring to bear:
The single most powerful technique. If the analyst knows (or can guess) even a small fragment of the original message — a greeting, a date, a signature, a formulaic opening — they can match those words against the code groups at each position. Every confirmed match recovers codebook entries, and those entries can then be recognized wherever they appear in the rest of the message (or in other messages using the same codebook). This bootstrapping effect means a single good crib can unravel a large portion of the codebook.
This is how the British Room 40 broke German naval codes in WWI, and how Bletchley Park attacked Enigma in WWII. The principle is the same for codebooks: find something you know, and use it as a lever.
Instead of looking at each code group in isolation, analyze which groups appear next to which other groups. In English, certain word sequences are extremely common: “of the”, “in the”, “to the”, “it is”. If you have identified the code for “the” (the second most frequent group), then whatever groups appear immediately before it most often are likely “of”, “in”, and “to”. Each confirmed mapping opens up more context-based deductions.
A computer can build a co-occurrence matrix — for every pair of code groups, how often do they appear adjacent? — and compare this matrix against expected English word-pair frequencies. This is far more powerful than single-group frequency counting.
The hardest part of attacking a homophonic codebook is figuring out which code groups are homophones of each other (i.e., different codes for the same plaintext token). Panel 6 above demonstrates a basic version of this when homophones are detected. Modern approaches use distributional similarity: two groups that consistently appear in the same contexts (same neighbors, same positions in sentences) are likely homophones. Techniques include:
Once homophones are clustered, their combined frequency matches the original token frequency, and standard frequency analysis works again.
English has rigid grammatical structure. Once a few words are known, the possibilities for neighboring words are heavily constrained. “the ___ of” can only be a noun. “___ is” is likely a pronoun or short noun. A computer can use language models (even simple n-gram models) to score candidate decodings and prune impossible ones. Modern large language models make this even more powerful — they can evaluate whether a partial decoding “sounds like English” with high accuracy.
Real codebooks are used for more than one message. Each additional message using the same codebook provides more data for frequency analysis, more context for n-gram analysis, and more opportunities for cribs. Historically, codebook security depended on changing the codebook frequently — but producing and distributing new codebooks was expensive and slow, so in practice books were used for months or years, giving analysts ample material.
If null (dummy) code groups are present, they appear in the frequency distribution as a cluster of low-frequency groups with nearly identical counts — because each null code is inserted with equal probability. An analyst can identify this cluster statistically: look for a plateau in the frequency tail where 10+ groups all appear the same number of times. Once identified, filtering them out restores the original distribution and all standard techniques apply. Nulls add noise but do not change the fundamental structure of the real token frequencies.
If the analyst knows the codebook structure (vocabulary size, group length, whether homophones are used), they can use computers to search the space of possible codebooks. For CODEX specifically, the codebook is generated from a seed string and a set of configuration options. If the configuration is known (or can be guessed), the search space reduces to just the seed — and modern computers can test millions of candidate seeds per second.
The reconstruction you see in Panel 7 will look surprisingly weak even at 60,000+ groups with no cipher layer and no homophones — lots of unknowns and short runs of guessed letters where you might expect whole words. That is not a bug; it is the structural limit of pure frequency analysis against this codebook design, and it is worth understanding.
The CODEX default codebook only has a few thousand word entries. Every English word not in that book gets spelled out by the encoder — trigram, then digram, then single letter as fallback. Even a literary passage has many such words, so a typical codetext is roughly: ~22% space codes, ~30% individual-letter codes, ~25% digram and trigram codes, ~15% real word codes, ~8% punctuation and control codes. The actual word codes are sparse: even “the” ends up at < 1% of all groups because it is competing with thousands of letter and chunk codes for ranking.
That breaks the textbook “Nth most frequent code = Nth most frequent English word” rank-mapping completely from rank 2 onward. The Decryptor handles this by context-classifying every code first: does the code usually sit alone between two space codes (likely a word code), or does it usually sit inside a run of consecutive non-space codes (likely a chunk of a spelled-out OOV word)? Word codes get rank-mapped against English word frequencies; chunk codes get rank-mapped against English letter frequencies. Chunk runs are wrapped in the reconstruction as one visual unit so they read as a spelled-out word, not as several short adjacent words.
This is much more honest than the rank-only approach, but it is still limited: the codebook's trigram and digram coverage means many chunks are NOT single letters, so the letter-frequency mapping is approximate. To get past this you need the techniques described below — cribs, n-gram context analysis, and language models.
Against CODEX specifically, a modern analyst with standard tools would likely proceed as follows:
The fundamental weakness of any codebook system is that the mapping is fixed: the same plaintext token always produces codes from the same set. No matter how many homophones or cipher layers are added, this structural regularity leaks information with every message. This is why codebook ciphers were progressively abandoned in the 20th century in favor of stream and block ciphers, which produce entirely different output for every encryption even with the same key.
Disclaimer: these pages are educational demos provided as-is, with no warranty of any kind. The author is not responsible for any consequences arising from their use.
Send comments and bug reports to chris@chrisspackman.com.
Version 0.10 — Last updated: 2026-05-25
This page is Copyright © 2025 – 2026
Chris Spackman <chris@chrisspackman.com>.
This web site developed entirely on GNU/Linux with Free / Open
Source Software.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.