[Algorithm] Implement Split Function with KMP Algorithm

Splitting a string per the given seperator/delimiter similar to split() funtion in Python.
Implement with Knuth–Morris–Pratt string searching algorithm (or KMP algorithm) in C++ language.

Source code here
Any feedbacks are welcome!

This task can be divide into two part. Firstly, we need to find out the position of matched delimiter in string. Secondly, we need to return a list of the words septerated by the delimiter considering special cases.

Pattern search

Although there are many methods(set 1-8) for pattern search, I focus on KMP algorithm which is clean and fast.

Example:
Search sep “AABCAABA” in string “AABCAA BAABCAABA”

Naive Pattern Searching

Well, my first thought was sliding the delimiter over string one by one for a match:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
vector<string> split(const string &s, const string &sep) {
vector<string> res;
int last = 0, i = 0;
int m = s.length(), n = sep.length();
while(i < m) {
int find_sep = 0;
if (s[i] == sep[0] && i + n <= m) {
if (s.substr(i, n) == sep) {
res.push_back(s.substr(last, i-last));
last = i + n;
i += n;
find_sep = 1;
}
}
if (find_sep == 0) ++i;
}
res.push_back(s.substr(last, i - last));
return res;
}

Notice: this function doesn’t split string by space in default. For example, split("a b c", "") = "a b c", which I think is also resonable.

Please read image from top to bottom, left to right.

Best case: First character of the delimiter is not present in string.
split("AAAAAA", "BC")
Time complexity: O(m), m is string’s length.
Worst case: Only last character of the delimiter is different.
split("AAAAAA", "AAB")
Time complexity: O((m-n+1)n), m is string’s length and n is delimiter’s length.

Knuth–Morris–Pratt(KMP) string searching algorithm

Time complexity: O(m+n) in worst case, m is string’s length and n is delimiter’s length.

In fact, we can directly jump to a potential position if the previous matched substring has same prefix and suffix. That is to say, we can make use of previous info and skip comparing some partial matched substring.

Please read image from top to bottom, left to right.

See also: this post(字符串匹配的KMP算法) provides explict diagrams too.

Then the only problem is how many steps we should jump? We should jump to the longest matched suffix:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
ABCDAB ABCDABCDABDE // " " mismatch
ABCDABD
=> // "AB" matched in substring "ABCDAB", jump four steps
ABCDAB ABCDABCDABDE // " " mismatch
ABCDABD
=> // no match in substring "AB", forward one step
ABCDAB ABCDABCDABDE // "B" mismatched
ABCDABD
=> // ......
ABCDAB ABCDABCDABDE // "D" mismatched
ABCDABD
=> // "AB" matched in substring "ABCDAB", jump four steps
ABCDAB ABCDABCDABDE // matched!
ABCDABD
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
vector<int> table = partial_match_table(delimiter);
int i = 0, j = 0; // point to string and delimiter respectively
int last = 0; // to split position
int m = str.size(), n = delimiter.size();
while (i < m) {
if (str[i] == delimiter[j]) {
++i;
++j;
}
else if (j == 0) {
++i;
}
else { // skip matched string of prefix of sep and suffix of s
j = table[j - 1];
}
if (j == n) { // find a seperator
res.push_back(str.substr(last, i - n - last));
last = i;
j = 0; // start to match again
}
}
res.push_back(str.substr(last, i - last));

For example, for delimiter AABCAABA, consider substring:
AA: one character A matched
AAB: zero character matched
AABC: zero character matched
AABCA: one character A matched
AABCAA: two character AA matched
AABCAAB: three character AAB matched
AABCAABA: one character A matched

I recommend read codes by calculating the above example first.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
/* Return a table indicates the length of partial
* matched string for each substring */
vector<int> partial_match_table(string delimiter) {
vector<int> res;
res.push_back(0);
unsigned int i = 1; // iterate over delimiter, the end of prefix
unsigned int len = 0; // the end of suffix, the length of matched substr
while (i < delimiter.size()) {
if (delimiter[i] == delimiter[len]) {
++len;
res.push_back(len);
++i;
}
else {
if(len) len = res[len - 1]; // trick
else {
++i;
res.push_back(0);
}
}
}
return res;
}

Here’s my draft(L indicate len in code):

Matched prefix is the substring of [0, L] and suffix is the substring of [i-L, i]. They will become longer when matching (delimiter[i] == delimiter[len]), in which case i and L step forward simultaneously. If mismatch, L will step backward for a shorter potential matched string.
Their’s a trick when strings mismatch and L is not at the begining, we set len = res[len - 1]. Try to understand this trick by counterexample.

From the above codes, we can see their’s only one loop for each function, that’s why KMP has linear time complexity in worst case.

Split

We can find the split function definition in Python Standard Library

Return a list of the words in the string, using sep as the delimiter string.
If sep is given, consecutive delimiters are not grouped together and are deemed to delimit empty strings (for example, ‘1,,2’.split(‘,’) returns [‘1’, ‘’, ‘2’]). The sep argument may consist of multiple characters (for example, ‘1<>2<>3’.split(‘<>’) returns [‘1’, ‘2’, ‘3’]). Splitting an empty string with a specified separator returns [‘’].
If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. Consequently, splitting an empty string or a string consisting of just whitespace with a None separator returns [].

Here I design various tests to consider as much categories as possible cases. All expected_outs are generated by python output. Some of the test cases come from Python documents.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
void test(const int test_num, const string &in,
const string &expected_out, const string &sep = "");
int main() {
/* ************************** sep is given *************************** */
// mormal test
test(0, "1<>2<>3", "['1', '2', '3']", "<>");
test(1, "1,,2", "['1', '', '2']", ",");
test(2, "Line1-abcdef \nLine2-abc \nLine4-abcd", "['Line1-abcdef', '\nLine2-abc', '\nLine4-abcd']", " ");
test(3, "topeka,kansas city,wichita,olathe", "['topeka', 'kansas city', 'wichita', 'olathe']", ",");
test(4, "I am a student.", "['I ', 'm ', ' student.']", "a");
test(5, "I am\n a \tstudent.", "['I ', 'm\n ', ' \tstudent.']", "a");
// edge test
test(6, "a*", "['a', '']", "*"); // sepertor in the end
test(7, "**abc**abc**", "['', 'abc', 'abc', '']", "**"); // seperator in the start and end of string
test(8, "", "['']", "a"); // null string
test(9, " a", "['', 'a']", " "); // leading whitespace
test(10, "a ", "['a', '']", " "); // trailing whitespace
test(11, " a ", "['', 'a', '']", " "); // leading and trailing whitespaces
test(12, " ", "['', '', '', '']", " "); // whitespaces
test(13, "***", "['', '', '', '']", "*"); // string consist of several seperators
test(14, "*", "['', '']", "*"); // string equals seperator
test(15, "aaa", "['', 'a']", "aa");
/* *************** sep is not specified or is null("") ******************** */
test(16, "I am a student.", "['I', 'am', 'a', 'student.']");
test(17, "Hello , world\t!\n", "['Hello', ',', 'world', '!']");
test(18, "", "[]"); // empty string
test(19, "", "[]", ""); // empty string
test(20, "Hello , world\t!\n", "['Hello', ',', 'world', '!']", ""); // string with spaces
test(21, "a", "['a']", ""); // string without space
test(22, " ", "[]"); // leading or trailing whitespace
test(23, " 1 2 3 ", "['1', '2', '3']");
system("pause");
return 0;
}