Investigating filter matching algorithms · 2006-08-22 20:02 by Wladimir Palant
Warning: dense computer science talk ahead. Proceed at your own risk!
This is the story behind improving the performance of Adblock’s most important component — filter matching. The problem is the following: we have a set F consisting of f filters, we have a string (an address) S of length n, and we need to know whether this string matches any of the filters. If it does, we need to block the element associated with this address. And we should decide really soon because there are a few hundred more of those addresses to be tested and the user is waiting impatiently.
What does Adblock do, what did Adblock Plus 0.5/0.6 do? They all used the trivial algoritm:
function getMatchingFilter1(S, F) for each filter in F if (filter matches S) return filter end if end for return null end function
The complexity of regexp matching is roughly something like O(n) meaning that the overall complexity of the algorithm is O(n ⋅ f). Which is bad, since the performance depends on the number of filters. Consequently, users try to improve performance by compressing their list of filters at any cost. Amongst others that would mean using large and totally incomprehensible regular expressions as well as too general filters.
Can we do better? When we deal with regular expressions that are pretty much a black box for us — no, we really have to test all the filters, any of them might match here. And we don’t want to get into business of analyzing the semantics of regular expressions, that’s a science on its own. Fortunately, most filters are not (or rather: should not be) specified as regular expressions, they are only converted to ones. And from those simple filters we can extract a characteristic substring (“shortcut”) that we know must be present in any string matched by the filter.
This changes our problem quite significantly. Instead of a set of filters we have a set of shortcuts C of length m. And we look up these shortcuts in S — if we find one, we have to test the string against the corresponding filter. And we only need to test against filters whose shortcuts we find, meaning that this operation will happen rarely, and only the effectiveness of the shortcut search is relevant for the overall performance.
Note: we still might have a number of filters that have been specified as regular expressions or that don’t have enough text to extract a shortcut from. For those we are stuck with the trivial algorithm which doesn’t mean that we cannot process the other filters separately and more efficiently.
Some might recognize the problem as string search using a finite set of patterns. My first approach to solving it simply stored all shortcuts in a hash table H and looked up all substrings of S with length m in this hash table:
function initializeHashTable2(H, C) for each c in C H(c) := filter(c) end for end function function getMatchingFilter2(S, H, m) for each substring in (substrings of S with length m) if (substring in H and H(substring) matches S) return H(substring) end if end for return null end function
Now that we got the theoretical complexity right, we need to worry about the practical execution time. Substring extraction isn’t exactly well-known as the most effective operation, and consequently Adblock Plus has to switch to the trivial algorithm whenever we have less than 100 filters — the execution time is simply lower then. Maybe we should find a way to avoid working with substrings?
String.charCodeAt() requires almost the same time as
String.substring(), a rather interesting implementation detail. Once we notice that Rabin-Karp algorithm calls
String.charCodeAt() twice for every character in the string to compute the hash whereas the algorithm above calls
String.substring() roughly once per character, the reason for the strange result becomes clear.
String.substring() we need to focus on reducing their number. We need an algorithm that skips several characters before testing again — like the Boyer-Moore algorithm. Now Boyer-Moore is a single-pattern search algorithm that also works on the character level. I thought about ways to adapt it and found that instead of building a jump table based on characters we can build a jump table J for pattern substrings of length m/2 (assuming m is an even number):
function initializeHashTable3(H, J, C, m) for each c in C H(c) := filter(c) for i in (0 .. m/2) substring := c.substring(i, m/2) jump := m/2 - i if substring not in J or J(substring) > jump J(substring) := jump end if end for end for end function function getMatchingFilter3(S, H, J, m) endPtr := m while endPtr <= n substring := S.substring(endPtr - m/2, m/2) if substring in J and J[substring] > 0 endPtr := endPtr + J[substring] else if substring in J and J[substring] = 0 substring := S.substring(endPtr - m, m) if (substring in H and H(substring) matches S) return H(substring) else endPtr := endPtr + 1 end if else endPtr := endPtr + m/2 + 1 end if end while return null end function
We move through the string in the same way Boyer-Moore algorithm does. First test a substring of length m/2: if our jump table tells us that there are patterns containing this substring we adjust our current position accordingly. If the jump table tells us that we don’t need to adjust — the substring is potentially the ending of a pattern and we need to test whether this is the case and whether the corresponding filter matches the string. If this test fails we can only move our current position by 1 (a fallback table might help but so far it doesn’t look like this situation would happen all to often). And finally, if the substring isn’t in the jump table we can jump by m/2 + 1 characters — we know that no pattern contains this particular substring.
The best-case complexity here will be O(n/m), worst-case O(n) (and actually factor two slower than the previous algorithm since it calls
String.substring() at most twice for every position). The tests indicate that that actual execution time is pretty near to the best-case but gets slightly worse once the jump table fills up. Here are the test results for all three algorithms, with two different filter lists but the same set of addresses:
Note: I cheated a little here. What is called “Adblock Plus 0.7” in these graphs isn’t really the implementation used in Adblock Plus 0.7 — I changed an implementation detail which gave it a ~30% speedup. Yet the new algorithm is still considerably faster.
Well, it looks like I will use this algorithm in Adblock Plus 0.7.2 unless something better comes up. It gets equal with the trivial algorithm already at 15-20 filters, meaning that I can stop switching to the trivial algorithm when we have too few filters (the 10ms difference are not worth it). We get a 20-30% increase in initialization time which is O(f ⋅ m) now, this should be tolerable. We need additional memory for the jump table — it seems to contain on average 1.5 entries for every filter, that won’t consume a noteworthy amount of memory. And, to round up: here is how we perform with really many filters:
Commenting is closed for this article.