2017-04-15 30 views
0

我有一个熊猫数据帧列在每个细胞几个环节:给出一个包含多个垃圾链接的列表,如何以这种方式提取所有以.pdf结束的链接?

Name|COL 
San Diego|'https://foo.com/energy_docs/tyv/2004/019787_S30_gasTOC.cfm https://foo.com/energy_docs/tyv/99/19787s022_gas.pdf https://foo.com/energy_docs/tyv/2000/19787s021_gas.pdf https://foo.com/energy_docs/tyv/2000/19787-s017_report.pdf https://foo.com/energy_docs/tyv/99/293-_9302SDFS 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/99/19787-s018_gas.pdf https://foo.com/energy_docs/tyv/2000/19787-s017_report.pdf https://foo.com/energy_docs/tyv/98/019787-S16_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/97/019787-S15_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/97/019787-S14_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/97/19787-S013_gas.pdf https://foo.com/energy_docs/tyv/96/019787-S12_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/96/019787-S11_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/96/019787-S10_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/pre96/019787-S9_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/pre96/019787-S8_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/96/19-787s007_Amlodipine.cfm https://foo.com/energy_docs/tyv/pre96/019787-S6_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/pre96/019787-S5_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/pre96/019787-S4_gas GAS_TPC.cfm https://foo.com/energy_docs/tyv/pre96/019787-S3_gas_toc.cfm https://foo.com/energy_docs/tyv/pre96/019787-S2_gas GAS_TPC.cfm' 
Washington|'https://foo.com/energy_docs/a32/2007/022136.cfm' 
Texas|'https://foo.com/energy/29380/no_ant/USA/2/2007.pdf' 

我怎么能提取所有在.pdf结束以下方式联系:

Name|COL 
San Diego|https://foo.com/energy_docs/tyv/99/19787s022_gas.pdf 
San Diego|https://foo.com/energy_docs/tyv/2000/19787s021_gas.pdf 
San Diego|https://foo.com/energy_docs/tyv/2000/19787-s017_report.pdf 
San Diego|https://foo.com/energy_docs/tyv/99/19787-s018_gas.pdf 
San Diego|https://foo.com/energy_docs/tyv/2000/19787-s017_report.pdf 
San Diego|https://foo.com/energy_docs/tyv/97/19787-S013_gas.pdf 
Washington|NaN 
Texas|https://foo.com/energy/29380/no_ant/USA/2/2007.pdf 

我想:

import re 

def url_extractor(row): 

    url=str(row) 

    r = re.compile('(http[^\s]+\.pdf)') 

    urls = r.findall(url) 

    if len(urls) == 0: 

     return 'NaN' 

    else: 

     return ' '.join(urls) 

​ 

在:

df4['COL'] = df4['COL'].apply(url_extractor) 
df4 

日期:

Name COL 
0 San Diego https://foo.com/energy_docs/tyv/99/19787s022_g... 
1 Washington NaN 
2 Texas https://foo.com/energy/29380/no_ant/USA/2/2007... 

但是我不知道如何才能得到每行一个链接/ URL做堆叠/拆分排部。例如,让我们检查的第一行:

在:

df4['COL'][0] 

日期:

'https://foo.com/energy_docs/tyv/99/19787s022_gas.pdf https://foo.com/energy_docs/tyv/2000/19787s021_gas.pdf https://foo.com/energy_docs/tyv/2000/19787-s017_report.pdf https://foo.com/energy_docs/tyv/99/19787-s018_gas.pdf https://foo.com/energy_docs/tyv/2000/19787-s017_report.pdf https://foo.com/energy_docs/tyv/97/19787-S013_gas.pdf'

每个链接都应该“映射”到其名称San Diego

+1

这是您的实际数据吗?如果是这样,为什么'(?<= href =“)。*?(?=”)'是你尝试的正则表达式。它离工作数英里之遥。 – Vallentin

+0

Ups对不起...我正在尝试几件事...我更新了... @Vallentin –

回答

1

如果已经装入大熊猫数据帧时,可以使用内置的字符串的方法来打破COL字符串到列表中的大熊猫,从列表中提取所需的元素,将列表的列改为长列,然后将其与原始数据框合并

# break COL into lists of strings that only end if '.pdf' 
COL_series = df.COL.str.split().apply(lambda x: [y for y in x if y.endswith('pdf')]) 
# create a long format series from the lists 
COL_series = COL_series.apply(pd.Series).stack().reset_index(level=1, drop=True) 
COL_series.name = 'COL' 

# merge with df 
pd.merge(df.Name.reset_index(), 
     COL_series.reset_index(), 
     how='outer', 
     on='index').drop('index', axis=1) 

# returns: 
     Name               COL 
0 San Diego  https://foo.com/energy_docs/tyv/99/19787s022_gas.pdf 
1 San Diego  https://foo.com/energy_docs/tyv/2000/19787s021_gas.pdf 
2 San Diego https://foo.com/energy_docs/tyv/2000/19787-s017_report.pdf 
3 San Diego  https://foo.com/energy_docs/tyv/99/19787-s018_gas.pdf 
4 San Diego https://foo.com/energy_docs/tyv/2000/19787-s017_report.pdf 
5 San Diego  https://foo.com/energy_docs/tyv/97/19787-S013_gas.pdf 
6 Washington               NaN 
7  Texas   https://foo.com/energy/29380/no_ant/USA/2/2007.pdf 
+0

感谢您的帮助,尽管如此,我的数据框有其他列(3多列),当我应用它时,它删除了其他列...如何做到这一点,而不删除其他3列?...我只是把两列在空间/可视化问题的问题..我试图删除pd.merge()中的名称,但它添加了COL_x和COL_y –

2

而不是[^<]你应该做[^\s]或更短\S。然后加入\.pdf

(http\S+\.pdf) 

Live Demo

编辑:

是的,你也可以使用单词边界,如果你想。

(\bhttp.*?\.pdf\b) 

Live Demo

+0

谢谢,我认为我可以用'\ b'做到这一点...任何想法如何分割/堆叠行只在每一行留下一个链接?... –

+1

是的,你也可以使用'\ b'(更新和添加示例到答案中)。分裂如何?如果有多个,那么你将如何决定保留哪一个? – Vallentin

+0

好的,谢谢!...检查我的更新! –

相关问题