2019年11月9日 星期六

String Replace


String Replace

  • Lintcode : 841 / Leetcode
  • Level : Hard

Problem

Given two identical-sized string array A, B and a string S. All substrings A appearing in S are replaced by B.(Notice: From left to right, it must be replaced if it can be replaced. If there are multiple alternatives, replace longer priorities. After the replacement of the characters can't be replaced again.)

Example
Example 1

Input:
A = ["ab","aba"]
B = ["cc","ccc"]
S = "ababa"

Output: "cccba"
Explanation: In accordance with the rules, the substring that can be replaced is "ab" or "aba". Since "aba" is longer, we replace "aba" with "ccc". 
Example 2

Input:
A = ["ab","aba"]
B = ["cc","ccc"]
S = "aaaaa"

Output: "aaaaa"
Explanation: S does not contain strings in A, so no replacement is done.
Example 3

Input:
A = ["cd","dab","ab"]
B = ["cc","aaa","dd"]
S = "cdab"

Output: "ccdd"
Explanation: From left to right, you can find the "cd" can be replaced at first, so after the replacement becomes "ccab", then you can find "ab" can be replaced, so the string after the replacement is "ccdd".
Notice
The size of each string array does not exceed 100, the total string length does not exceed 50000.
The lengths of A [i] and B [i] are equal.
The length of S does not exceed 50000.
All characters are lowercase letters.
We guarantee that the A array does not have the same string

Concept & Algorithm

Time Complexity & Space Complexity

time : O(s.size * a.size)
space: O(n)

Answer

class Solution {
public:
    string stringReplace(vector<string> &a, vector<string> &b, string &s) {
        vector<long long> aHash, base, sHash;
        const int BASE = 10000007;
        const int seed = 37;
        int maxLen = 0;
        // constuct aHash
        for (auto l : a) {
            long long ans = 0;
            maxLen = max(maxLen, (int)l.size());
            for (int i = 0; i < l.size(); i++) {
                ans = (ans * seed + l[i]) % BASE;
            }
            aHash.push_back(ans);
        }

        // construct sHash
        sHash.push_back(0);
        long long ans = 0;
        for (int i = 0; i < s.size(); i++) {
            ans = (ans * seed + s[i]) % BASE;
            sHash.push_back(ans);
        }

        // construct base hashtable
        base.push_back(1);
        ans = 1;
        for (int i = 0; i < maxLen; i++) {
            ans = (ans * seed) % BASE;
            base.push_back(ans);
        }

        int sHashValue = 0;
        for (int i = 0; i < s.size(); i++) {
            maxLen = 0;
            int index;
            for (int j = 0; j < a.size(); j++) {
                int len = a[j].size();
                if (i + len > s.size()) continue;
                int l = i + 1;
                int r = i + len;
                sHashValue = (sHash[r] - base[len] * sHash[l - 1] % BASE + BASE) % BASE;
                int aHashValue = aHash[j];
                if (aHashValue == sHashValue && len > maxLen) {
                    maxLen = len;
                    index = j;
                }
            }
            if (maxLen != 0) {
                for (int k = i; k < i + maxLen; k++) {
                    s[k] = b[index][k - i];
                }
                i = i + maxLen - 1;
            }
        }
        return s;
    }
};

/*
首先要知道滚动哈希是怎么样的思路,可以看九章微课堂或者维基百科了解一下。其实就是一个对字符串的hash函数,例如对字符串"abc",它的hash值计算为(('a')*seed+'b')*seed+'c',而对于字符串"abcd",它的hash值为((('a')*seed+'b')*seed+'c')*seed+'d'(这里的seed也可以叫作base,但为了和代码中保持一致我使用了seed)。可以发现对于每一个字符串,它的hash值都能由它的上一个前缀递推过来,这也是它称之为滚动的原因。

现在将"abcd"的哈希值展开,得到:'a'*seed^3+'b'*seed^2+'c'*seed^1+'d'*seed^0。然后看"ab"的哈希值展开,为'a'*seed^1+'b'*seed^0。假设这时候我要求"cd"的哈希值,我会发现"cd"的哈希值为'c'*seed^1+'d'*seed^0,不就是"abcd"的哈希值表达式中次数较低的两项吗?因此我只要将最高次数的两项减掉就可以了,然后我又发现最高次数的两项不就是"ab"的哈希值表达式乘以seed^2吗,因此我们得到hash("cd")=hash("abcd")-hash("ab")*seed^2,类似于34=1234-12*100。
现在将它推广,就可以使用字符串前缀的hash值在常数时间内计算出任意的子串的hash值:考虑字符串s在[l,r]中的子串(为了和代码的下标统一起来,我假定这个下标从1开始),我可以用s[l,r]表示这个子串,那么就有hash(s[l,r]) = hash(s[1,r])-hash(s[1,l-1])*seed^len(s[l,r]) = hash(s[1,r])-hash(s[1,l-1])*seed^(r-l+1)。

在理解了这些的基础上,这个代码就很好理解了。首先预处理出a中字符串的hash值放到aHash中,然后预处理出s的所有前缀的hash值放到sHash中,然后将seed从0开始的所有幂次seed^0,seed^1,seed^2...放到base中。因为上面说的hash函数滚动的特性,hash(s[1,n])能够从hash(s[1,n-1])中递推过来,因此这一步花费的时间正比于s和a中所有字符串的长度总和。要注意的是代码中的sHash和aHash和上面说的有些不同。可能是为了方便迭代计算,它规定了sHash[0]=1,因此之后的每一项都会多一个1*seed^...的项,但两个前缀相减之后这一项会消去,因此不会影响上面那个表达式的计算。然后有了这些之后,我知道对于s的任意一个子串s[l,r],我都能在O(1)时间内计算出它的哈希值了,因为有hash(s[l,r]) = hash(s[1,r])-hash(s[1,l-1])*seed^(r-l+1),而hash(s[1,r])和hash(s[1,l-1])和seed^(r-l+1)这些都已经被我们预处理完了。

下面的思路就很简单了,对于s的每一个位置,判断是否有a中的字符串可以匹配,如果有的话取最长的那个进行替换,并跳过这些被替换的字符;如果没有则将当前位置向前移动一个字符。最后返回被替换的字符串就好了。
*/
tags: string

沒有留言:

張貼留言

Last Position of Target

Last Position of Target Lintcode : 458 / Leetcode : 34 Level : Easy Problem Find the last position of a target number in a sorted ...