Text Processing and Pattern Matching
In classic pattern matching, we are given a text string T
of length n
and a pattern string P
of length m
, and want to find whether P
is a substring of T
. If so, we may want to find the lowest index j
within T
at which P
begins, such that T[j:j+m]
equals P
, or perhaps find all indices of T
at which pattern P
begins.
Brute Force
Time complexity: \(O(nm)\)
Compare all substrings in T
with P
.
1 2 3 4 5 6 7 8 9 10 |
|
Boyer-Moore Algorithm
Introduces two time-saving heuristics:
- Looking Glass Heuristic - when testing a possible placement of
P
againstT
, begin the comparisons from the ofP
and move bckward to the front ofP
. - Character Jump Heuristic - during the testing of a possible placement of
P
withinT
, a mismatch of text characterT[i]=c
with the corresponding pattern characterP[k]
is handled as follows: ifc
is not contained anywhere inP
, then shiftP
completely pastT[i]
(for it cannot match any character inP
), otherwise shiftP
until an occurance of characterc
inP
gets aligned withT[i]
.
The time complexity of the Boyer-Moore algorithm depends on the characteristics of the input text and pattern.
-
Best Case: In the best-case scenario, when there are many mismatches and few matches, the algorithm can achieve linear time complexity (\(O(n/m)\)), where \(n\) is the length of the text and \(m\) is the length of the pattern.
-
Worst Case: In the worst-case scenario, the algorithm may exhibit quadratic time complexity (\(O(mn)\)).
Despite the worst-case time complexity, Boyer-Moore is often faster than other string searching algorithms in practice, especially for longer patterns and texts, due to its ability to skip large portions of the text efficiently based on precomputed rules.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
|
Knuth-Morris-Pratt Algorithm
The Knuth-Morris-Pratt (KMP) algorithm works by utilizing the information about previous matches to avoid unnecessary comparisons.
Key Components:
-
Failure Function (Prefix Function): The algorithm precomputes a "failure function" or "prefix function," which stores the length of the longest proper prefix that is also a suffix of the pattern. This information is used to determine the amount to shift the pattern when a mismatch occurs.
-
Partial Match Table (LPS Array): The failure function is often implemented using a partial match table, also known as the "Longest Proper Prefix which is also Suffix" (LPS) array. This table helps to efficiently compute the shift amount during the search phase.
Search Phase:
-
Comparison of Pattern and Text: The algorithm compares characters of the pattern with corresponding characters in the text from left to right. When a mismatch occurs, instead of shifting the pattern by one position, KMP uses the failure function to determine the maximum possible shift without missing potential matches.
-
Utilizing LPS Array: The LPS array helps to avoid redundant comparisons by efficiently skipping over previously matched characters. When a mismatch occurs, the algorithm uses the LPS value of the previous character to determine the next possible comparison position in the pattern.
Knuth-Morris-Pratt Algorithm: Example
Consider searching for the pattern "ABABCAB" in the text "ABABDABACDABABCABAB". Here's a brief overview of how the KMP algorithm works:
- Precompute the LPS array for the pattern: "ABABCAB"
-
LPS array: \([0, 0, 1, 2, 0, 1, 2]\)
-
Start the search phase:
- Begin comparing characters of the pattern and the text from left to right.
-
When a mismatch occurs, use the LPS value of the previous character to determine the next comparison position.
-
Progress through the text:
- When a match occurs, continue comparing subsequent characters.
-
When a mismatch occurs, use the LPS value to determine the next position to start comparing from.
-
Continue until the end of the text is reached or a complete match of the pattern is found.
Python implementation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
|
Knuth-Morris-Pratt Algorithm: Time Complexity
The time complexity of the Knuth-Morris-Pratt algorithm is \(O(n + m)\), where \(n\) is the length of the text and \(m\) is the length of the pattern. This complexity arises from the fact that each character of the text is compared at most once, and the precomputation of the LPS array takes \(O(m)\) time.
Least Common Substring Algorithm
Find the longest string S
that is a subsequence of both character string X
and Y
.
The dynamic programming solution for the LCS problem involves building a table to store the lengths of the longest common subsequences of prefixes of the given sequences. The table is filled iteratively, starting from the shortest prefixes and gradually building up to the entire sequences.
Key Steps:
-
Initialization: Initialize a table \(dp\) with dimensions \((n+1) \times (m+1)\), where \(n\) and \(m\) are the lengths of the two sequences, respectively. Initialize the first row and first column of the table to 0.
-
Filling the Table: Iterate over the sequences and fill the table according to the following rules:
- If the current characters match, increment the value in the table corresponding to the previous characters by 1.
-
If the current characters do not match, take the maximum of the values obtained by either excluding one of the characters and considering the LCS obtained so far or by excluding the other character and considering the LCS obtained so far.
-
Backtracking: Once the table is filled, backtrack from the bottom-right corner to reconstruct the longest common subsequence.
Example:
Consider two sequences "ABCBDAB" and "BDCAB". Here's a step-by-step explanation of the dynamic programming solution:
- Initialize the table:
1 2 3 4 5 6 7 8
| | A | B | C | B | D | A | B | |---|---|---|---|---|---|---|---| | | 0 | 0 | 0 | 0 | 0 | 0 | 0 | B | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | D | 0 | 0 | 1 | 1 | 1 | 2 | 2 | 2 | C | 0 | 0 | 1 | 2 | 2 | 2 | 2 | 2 | A | 0 | 1 | 1 | 2 | 2 | 2 | 3 | 3 | B | 0 | 1 | 2 | 2 | 3 | 3 | 3 | 4 |
- Backtrack from the bottom-right corner to reconstruct the LCS: "BCAB"
Time Complexity:
The time complexity of the dynamic programming solution for the LCS problem is \(O(nm)\), where \(n\) and \(m\) are the lengths of the two sequences, respectively. This is because we fill an \(n \times m\) table, and each cell requires constant time to compute based on the values of its neighbors.
Python Implementation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
|
Text Compression
The text compression problem involves efficiently encoding a string \(x\) into a smaller binary string \(y\) using only the characters 0 and 1. The goal is to reduce the size of the encoded string while preserving its essential information.
Huffman Coding
Huffman coding is a popular algorithm used for lossless data compression. It works by assigning variable-length codes to input characters, with shorter codes assigned to more frequent characters and longer codes assigned to less frequent characters. This ensures that the most common characters are represented by shorter codes, resulting in efficient compression.
Key Concepts:
-
Character Frequencies: Huffman coding takes advantage of the frequency distribution of characters in the input text. Characters that occur more frequently are assigned shorter codes, while characters that occur less frequently are assigned longer codes.
-
Prefix Code: Huffman codes are prefix-free codes, also known as prefix codes. This means that no code word is a prefix of another code word. This property allows for uniquely decodable encoding and decoding.
-
Binary Tree Representation: Huffman codes can be represented using binary trees, where each leaf node corresponds to a character and each internal node corresponds to the sum of the frequencies of its children. The tree is constructed in a way that minimizes the total encoding length.
Huffman Coding Algorithm
The Huffman coding algorithm follows these steps:
- Frequency Counting: Count the frequency of each character in the input text.
- Build Huffman Tree: Build a Huffman tree using a priority queue (min-heap) or similar data structure. The tree is constructed by repeatedly merging the two nodes with the lowest frequencies until only one node remains, which becomes the root of the Huffman tree.
- Generate Codes: Traverse the Huffman tree to generate Huffman codes for each character. The code for each character is determined by the path from the root of the tree to the leaf node representing that character.
- Encode Data: Encode the input text using the generated Huffman codes.
- Decode Data: Decode the encoded text using the same Huffman tree.
Pseudocode
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
|
Complexity Analysis
- Building Huffman Tree: The time complexity of building the Huffman tree is \(O(n \log n)\), where \(n\) is the number of unique characters in the input text. This is because each insertion and deletion operation in the priority queue takes \(O(\log n)\) time, and there are \(n\) characters to process.
- Generating Codes: The time complexity of generating Huffman codes is \(O(n)\), where \(n\) is the number of unique characters in the input text. This is because each character corresponds to a leaf node in the Huffman tree, and traversing from the root to each leaf node takes constant time.
- Encoding and Decoding: The time complexity of encoding and decoding the text using Huffman codes is \(O(m)\), where \(m\) is the length of the input text. This is because each character in the input text is replaced by its corresponding Huffman code, which takes constant time.