Rabin-Karp Algorithm
@rabinkarp_algorithmPseudo-code
function rabin_karp(T, P):
n <- length of T
m <- length of P
d <- base
q <- prime number
h <- 1
for i <- 0 to m - 2:
h <- (h * d) mod q
p_hash <- 0
t_hash <- 0
for i <- 0 to m - 1:
p_hash <- (d * p_hash + value(P[i])) mod q
t_hash <- (d * t_hash + value(T[i])) mod q
for i <- 0 to n - m:
if p_hash == t_hash then
for j <- 0 to m - 1:
if T[i + j] != P[j] then
break
if j == m then
report match at index i
if i < n - m then
t_hash <- (d * (t_hash - value(T[i]) * h) + value(T[i + m])) mod q
if t_hash < 0 then
t_hash <- t_hash + q
- T T is the full text where we search, like a long sentence or document.
- P P is the pattern we are trying to find inside the text.
- n n is the length of the text T.
- m m is the length of the pattern P.
- d d is the base used for hashing, often the number of possible characters.
- q q is a prime number used to keep hash values small and reduce collisions.
- h h is the hash value of the pattern.
- t t is the current hash value of a substring in the text.
- i i is the current starting index in the text.
- j j is used when checking characters one by one after a hash match.
- 0 0 is used to initialize hash values.
- 26 26 represents the number of lowercase letters (can vary depending on alphabet).
- d d is reused as the base for building hash values.
- q q is the modulus to prevent overflow and reduce collisions.
Recipe Steps
Think of each word as having a unique number (hash). Instead of comparing full words, we compare numbers first.
Compute the hash of the pattern and the first part of the text with the same length.
Compare the two hash values. If they are different, we know the strings are different without checking every character.
If the hashes match, then check character by character to be sure (because different strings can have same hash).
Move the window one step forward and update the hash using a rolling formula instead of recomputing everything.
Repeat this process until you reach the end of the text.
Questions & Answers
- Rolling hash is the key optimization that makes updates fast.
- Choosing a good prime number reduces hash collisions.
- It is especially useful when searching for multiple patterns.
- Hash collisions are rare but must always be handled.
- Works well for large texts where naive matching would be slow.
Related Algorithms
Z-Algorithm
Z-Algorithm is a string searching algorithm that uses a preprocessed array to find all occurrences of a pattern within a main string. It is particularly useful for finding all occurrences of a pattern in a long string, as it allows for efficient searching and multiple match detection.
View