Rolling Hash

Summary

Rolling hash is a common technique to compare if two substrings are equal or not. Given O(n) preprocessing time, the comparison becomes O(1) at the best case.

OJ

 TrickDifficulty
LC 1316. Distinct Echo SubstringsRolling Hash + string_view or DP in Linear Structure + string_view6 points

Details

We use two large coprime numbers b for base and m for mod. We use a polynomial to calculate the hash value. Typical values are b=1e9+7 and m=1e9+9

    \[f(s) = \sum s[i] \times b^i \pmod m\]

Given a substring s[i, j], it is not difficult to get that the hash value of it is,

    \[f(s[i, j]) = \frac{(f(s[0, j]) - f(s[0, i - 1]) + m)}{b^i} \pmod m\]

We could preprocess tables for all the f[0, i] and b^i in O(n). After that, the complexity of getting a hash value for any substring is O(1).

We then have the following truth,

    \[f(str0)==f(str1) \centernot\implies str0==str1\ (hash\ collision)\]

    \[f(str0)\centernot=f(str1) \implies str0\centernot=sub1\]

int b = 31;
int m = 1e9 + 9;

vector<int> hash;
vector<int> bases;

void preprocess(string_view text) {
    ll sum = 0;
    ll base = 1;
    for (auto ch : text) {
        sum += (ch - '0') * base; 
        sum %= m;
        hash.push_back(sum);
        bases.push_back(base);
        base = (base * b) % m;
    }
}

ll get_hash(int left, int right) {
    return (hash[right] - ((left >= 1) ? hash[left - 1] : 0) + m) % m;
}

// s[i, i + len - 1], s[j, j + len - 1]
bool equal(int i, int j, int len, string_view const &text) {
    int hash0 = (get_hash(i, i + len - 1) * bases[j]) % m, hash1 = (get_hash(j, j + len - 1) * bases[i]) % m;
    if (hash0 != hash1) {
        return false;
    } else {
        return text.substr(i, len) == text.substr(j, len);
    }
}

Two Hash Functions

In LC 1044. Longest Duplicate Substring, you might find out one hash rolling hash function causes collisions. Another technique for rolling hash is to use Two hash functions (two bases and two mods).

    \[f_i(s) = \sum s[i] \times b_i^i \pmod {m_i}\]

    \[f_1(s[i, j]) = \frac{(f_1(s[0, j]) - f_1(s[0, i - 1]) + m_1)}{b_1^i} \pmod {m_1}\]

    \[f_2(s[i, j]) = \frac{(f_2(s[0, j]) - f_2(s[0, i - 1]) + m_2)}{b_2^i} \pmod {m_2}\]

Use f_1 ^ f_2 directly here will cause collisions. Instead, we use a hash set of pairs to store these two values. HashSet in the standard library helps us handle hash collisions, which is sufficient to pass the OJ.

struct PairHash {
    size_t operator() (pair<ll, ll> const &p) const {
        return hash<int>()(p.first) ^ hash<int>()(p.second);
    }
};

unordered_set<pair<int, int>, PairHash> visited;

Another technique you might use here is the Modular Multiplicative Inverse. Please refer to this post.

Leave a Reply

Your email address will not be published. Required fields are marked *