假设有一个文本文件中的URL列表(以百万为单位),并且在包含黑名单单词的文本文件中还有另一个列表。分析黑名单单词的URL和黑名单URL列表的算法
我愿意在URL列表上进行以下处理。
- Parse the URLs and store them in some DS
- Process the URLs and blacklist those URLs which contain atleast one of the
blacklisted words.
- If there exists a URL containing 50% or more blacklisted words, add the other
words of that URL in the list of blacklisted words.
- Since now the blacklisted words list has been modified then it's probable
that the URLs which were not blacklisted earlier can get blacklisted now. So,
the algorithm should handle this case as well and mark the earlier whitelisted
URLs as blacklisted if they contain these newly added blacklisted words.
我到底应该列入白名单网址,
任何建议,这可能是它可以被用来实现最有效的时间和空间复杂解决方案最好的算法和DS的名单?
我强烈建议'如果存在包含50%或更多黑名单单词的网址,请将该网址的其他单词添加到列入黑名单的单词中。您很可能最终会禁止诸如“a”,“that”,“the”这样的词,并最终以空白集作为您的“列入白名单”的网址 – amit
请小心使用此方法。假设你有一个网站“http://www.theblacklistedwordblog.com”。在运行这个词之后,博客和这个词也将被列入黑名单。我希望你不要限制。 – Erik
如何定义URL的单词? –