我有一篇文章列表,并且想要筛选列表,以便新的文章列表仅包含来自一定数量的域的文章。使用Bash筛选域名列表中的URL列表
现在,我有一个文章列表(〜500)以及域名列表(〜3,000)。
如何从不在我的域列表中的文章列表中删除文章?
两者都是文本文件,我该怎么做Bash?
我有一种感觉,你将不得不采取文章的名单,得到他们的域名,把这两个东西放在一个数组中,然后比较数组中的域名和列表中的域名,如果他们匹配,保持他们。如果没有,请取出物品并转到下一个。
这是我到目前为止有:
readarray a < ./articles
#I know "${b[@]}" is incorrect but idk how to write what I'm trying to do.
awk -F/ '{print $3}' "${a[@]}" > "${b[@]}"
echo "${b[@]}"
# I'm lost after this
这里是输入:
articles.txt:
http://www.cbsnews.com/videos/white-house-knows-options-are-limited-in-ukraine/&ct=ga&cd=CAIyAA&usg=AFQjCNFeY2uVQrvvDAMHeT-0nK2ZLNH7-g
http://www.huffingtonpost.com/2014/03/01/ukraine-russia-crimea_n_4879935.html&ct=ga&cd=CAIyAA&usg=AFQjCNFH7GY3B6swce3qiK49xGt-CwDvMA
http://www.nybooks.com/blogs/nyrblog/2014/mar/01/ukraine-haze-propaganda/&ct=ga&cd=CAIyAA&usg=AFQjCNFCcWadUJiAzaxg3OSO67gVIPVxww
http://ktla.com/2014/03/01/russian-upper-house-approves-use-of-military-force-in-ukraine-as-protests-continue/&ct=ga&cd=CAIyAA&usg=AFQjCNGTkxvvAo1zSYLlA5ET54OcBsS-PA
http://deadlinelive.info/2014/03/01/you-quit-falling-for-the-war-on-terror-ukraine-coup-spawns-cold-war-redux-2014/&ct=ga&cd=CAIyAA&usg=AFQjCNE3Fa_h7xoESBkcOzXVZCQnfBfxNA
http://www.ctvnews.ca/world/russian-parliament-oks-putin-s-request-to-use-military-force-in-ukraine-1.1709506&ct=ga&cd=CAIyAA&usg=AFQjCNGnGeo4LWoLF5Qbq2UvL58ymlNFkA
http://www.vanguardngr.com/2014/03/un-security-council-hold-emergency-talks-ukraine/&ct=ga&cd=CAIyAA&usg=AFQjCNFN7YRo037au4RfxSQoeVUCcL9hhA
http://www.reddit.com/r/AdviceAnimals/comments/1z82rt/russian_troops_cross_the_border_in_ukraine/&ct=ga&cd=CAIyAA&usg=AFQjCNFHkmelnoRy2TCW-eYDpIt_t-N1iA
http://criticallegalthinking.com/2014/03/01/knot-politics-thoughts-ukraine-protest/&ct=ga&cd=CAIyAA&usg=AFQjCNFLMuZzbuvzpLf7a9U8MtbhCE5lJQ
http://nypost.com/2014/03/01/russia-parliament-approves-military-action-in-ukraine/&ct=ga&cd=CAIyAA&usg=AFQjCNFpdyelZDEMUk39LmfC1tTDcQ6_FA
domains.txt:
cbsnews.com
huffingtonpost.com
可否请你添加一些有代表性的输入两种物品和域你的问题? – n0741337
添加了两个文本文件 – MoneyBag