Recognizing third-party content · 2007-01-31 16:07 by Wladimir Palant
I have done all the preparation work so that now I can finally implement the $third-party filter option allowing to restrict filters to third-party or same-party content. This would be used for filters like
*/banners/*$third-party — if some webmaster is crazy enough to call the directory with site logos “banners” those still won’t be blocked. This filter will only block something coming from the directory “banners” on a different server.
That’s the theory at least. However, recognizing what is third-party and what isn’t turned out to be a difficult task, and efficiency concerns (the third-party check will have to be done for every address) don’t make it easier. Usually something is considered third-party if it comes from a different second level domain, e.g. bugzilla.mozilla.org and addons.mozilla.org are same-party while mozilla.org and adblockplus.org are not. For recognizing the second-level domain part Firefox (Gecko) usually follows the one dot rule — the second-level domain is the ending of the server name that contains at most one dot. Unfortunately it will treat “co.uk” as a second-level domain.
Now Gecko 1.9 has a new mechanism for recognizing top-level domains: Effective TLD Service. Its database of top-level domains isn’t complete yet but it is already good enough. So once could use this to find the top-level domain and go to the next dot which would mark the end of the second-level domain. Of course this would only work in Firefox 3 and other browsers that will be based on Gecko 1.9, in older browsers the one dot rule will have to do.
Yet there are more issues. For reasons I don’t know the Effective TLD Service requires the server name to be encoded in UTF-8. Adblock Plus has it in UTF-16 however. So to use the service properly the server name would need to be converted into UTF-8 — fun way to waste CPU time. One can go without converting of course but that might cause wrong results with some international domain names (fortunately there are no international TLDs yet). So finally the third alternative would be to look for non-ASCII in the server name and fall back to the one dot rule if it has some. Right now I am a little undecided about which solution I should choose. Update: looking more into this, this last issue isn’t as critical as I thought first. However, bug 368989 is a showstopper at the moment.
Commenting is closed for this article.