Rabin-Karp Algorithm

@rabinkarp_algorithm

Rabin-Karp is a string searching algorithm that uses hashing to quickly compare a pattern with parts of a text using a rolling hash technique.

Pseudo-code

function rabin_karp(T, P):
    n <- length of T
    m <- length of P
    d <- base
    q <- prime number

    h <- 1
    for i <- 0 to m - 2:
        h <- (h * d) mod q

    p_hash <- 0
    t_hash <- 0

    for i <- 0 to m - 1:
        p_hash <- (d * p_hash + value(P[i])) mod q
        t_hash <- (d * t_hash + value(T[i])) mod q

    for i <- 0 to n - m:
        if p_hash == t_hash then
            for j <- 0 to m - 1:
                if T[i + j] != P[j] then
                    break
            if j == m then
                report match at index i

        if i < n - m then
            t_hash <- (d * (t_hash - value(T[i]) * h) + value(T[i + m])) mod q

            if t_hash < 0 then
                t_hash <- t_hash + q

T T is the full text where we search, like a long sentence or document.
P P is the pattern we are trying to find inside the text.
n n is the length of the text T.
m m is the length of the pattern P.
d d is the base used for hashing, often the number of possible characters.
q q is a prime number used to keep hash values small and reduce collisions.
h h is the hash value of the pattern.
t t is the current hash value of a substring in the text.
i i is the current starting index in the text.
j j is used when checking characters one by one after a hash match.

0 0 is used to initialize hash values.
26 26 represents the number of lowercase letters (can vary depending on alphabet).
d d is reused as the base for building hash values.
q q is the modulus to prevent overflow and reduce collisions.

Recipe Steps

Think of each word as having a unique number (hash). Instead of comparing full words, we compare numbers first.

Compute the hash of the pattern and the first part of the text with the same length.

Compare the two hash values. If they are different, we know the strings are different without checking every character.

If the hashes match, then check character by character to be sure (because different strings can have same hash).

Move the window one step forward and update the hash using a rolling formula instead of recomputing everything.

Repeat this process until you reach the end of the text.

Questions & Answers

Hashing allows fast comparison of strings by comparing numbers instead of characters.

It is when two different strings have the same hash value.

Because hashes can collide, so we must confirm the match.

The rolling hash lets us update the hash in constant time instead of recomputing it fully.

- The algorithm avoids comparing full strings unless necessary.
- Rolling hash is the key optimization that makes updates fast.
- Choosing a good prime number reduces hash collisions.
- It is especially useful when searching for multiple patterns.
- Hash collisions are rare but must always be handled.
- Works well for large texts where naive matching would be slow.

Related Algorithms

Z-Algorithm

Z-Algorithm is a string searching algorithm that uses a preprocessed array to find all occurrences of a pattern within a main string. It is particularly useful for finding all occurrences of a pattern in a long string, as it allows for efficient searching and multiple match detection.

View

Rabin-Karp Algorithm

Pseudo-code

Recipe Steps

Questions & Answers

Why use hashing in Rabin-Karp?

What is a collision?

Why do we still compare characters after hash match?

What makes it efficient?

Related Algorithms

Z-Algorithm