所以我试图匹配使用Python和正则表达式在亚马逊项目页中的货币字符串。Python的正则表达式不匹配所有的字符串
我当前的代码,因为它代表:
import csv
import requests as rq
import re
import lxml
from bs4 import BeautifulSoup as bs
i = 0
urls = csv.reader(open('/Users/Fuck/Documents/Amazon/HTML_Parsetest/urls.csv'))
for url in urls:
r=rq.get(url[0],stream=True)
for chunk in r.iter_content(chunk_size=2048):
if chunk:
data = chunk
soup=bs(data, "lxml")
elem=soup.find_all('td',attrs={'class':'a-text-right dp-used-col'})
print(elem)
if elem!=[]:
i = i + 1
s=re.findall('(\£\d+\.\d+)+',str(elem[0]))
print (i,"Price:", s[0].split()[0])
当前打印出从first url:
[<td class="a-text-right dp-used-col">
<a class="a-link-normal" href="/gp/offer-listing/019859660X/ref=tmm_hrd_used_olp_0?ie=UTF8&condition=used&qid=&sr=">
<span>£51.70</span>
</a>
</td>]
1 Price: £51.70
[<td class="a-text-right dp-used-col">
<a class="a-link-normal" href="/gp/offer-listing/0198596790/ref=tmm_pap_used_olp_sr?ie=UTF8&condition=used&qid=&sr=">
<span>£35.15</span>
</a>
</td>]
2 Price: £35.15
从second url打印出来:
[<td class="a-text-right dp-used-col">
<a class="a-link-normal" href="/gp/offer-listing/0521254167/ref=tmm_hrd_used_olp_0?ie=UTF8&condition=used&qid=&sr=">
<span>£355.37</span>
</a>
</td>, <td class="a-text-right dp-used-col">
<a class="a-link-normal" href="/gp/offer-listing/0521274249/ref=tmm_pap_used_olp_sr?ie=UTF8&condition=used&qid=&sr=">
<span>£29.93</span>
</a>
</td>]
3 Price: £355.37
在第二url运行,它发现整个TD块作为一个实体,而在第一个我吨发现他们作为单独的块,我不知道为什么。 所以看来我的正则表达式只会在每个块中找到一个字符串实例。
如何在第二个网址找到两个字符串£355.37和£29.93?
我发现[在线正则表达式测试仪(https://regex101.com/)通常是有帮助的 – miraculixx
@miraculixx正则表达式似乎是罚款。 – taleinat
价格总是以'£'为单位吗? –