Python 3.6.3 urlopen从URI中删除服务器名称以存储在远程服务器上的html文件

我需要解析数百个存档在服务器上的HTML文件。这些文件通过UNC访问，然后使用pathlib的as_uri（）方法将UNC路径转换为URI。Python 3.6.3 urlopen从URI中删除服务器名称以存储在远程服务器上的html文件

例如低于

完整UNC路径：\\ dmsupportfs \〜图像\沙箱\的test.html

from urllib.request import urlopen 
from bs4 import BeautifulSoup 
import os, pathlib 

source_path = os.path.normpath('//dmsupportfs/~images/sandbox/') + os.sep 
filename = 'test.html' 

full_path = source_path + filename 
url = pathlib.Path(full_path).as_uri() 
print('URL -> ' + url) 
url_html = urlopen(url).read()

所以URI（L）我传递到的urlopen是：文件：// dmsupportfs/％7Eimages/sandbox/test.html

我可以将其插入任何Web浏览器并返回页面，但是，当urlopen去阅读页面时，它将忽略/删除URI中的服务器名称（dmsupportfs），并且所以读取失败，无法找到文件。我认为这与urlopen方法如何处理URI有关，但我很困惑（可能是快速且容易解决的问题......对不起，Python有点新鲜）。如果我将UNC位置映射到一个驱动器号，然后使用映射的驱动器号而不是UNC路径，则此操作没有任何问题。我想不必依靠映射驱动器来完成这个。有什么建议？

下面是从上面的代码显示错误的输出：

Traceback (most recent call last): 
    File "C:\Anaconda3\lib\urllib\request.py", line 1474, in open_local_file 
    stats = os.stat(localfile) 
FileNotFoundError: [WinError 3] The system cannot find the path specified: '\\~images\\sandbox\\test.html' 

During handling of the above exception, another exception occurred: 

Traceback (most recent call last): 
    File "url_test.py", line 10, in <module> 
    url_html = urlopen(url).read() 
    File "C:\Anaconda3\lib\urllib\request.py", line 223, in urlopen 
    return opener.open(url, data, timeout) 
    File "C:\Anaconda3\lib\urllib\request.py", line 526, in open 
    response = self._open(req, data) 
    File "C:\Anaconda3\lib\urllib\request.py", line 544, in _open 
    '_open', req) 
    File "C:\Anaconda3\lib\urllib\request.py", line 504, in _call_chain 
    result = func(*args) 
    File "C:\Anaconda3\lib\urllib\request.py", line 1452, in file_open 
    return self.open_local_file(req) 
    File "C:\Anaconda3\lib\urllib\request.py", line 1491, in open_local_file 
    raise URLError(exp) 
urllib.error.URLError: <urlopen error [WinError 3] The system cannot find the path specified: '\\~images\\sandbox\\test.html'>

UPDATE：那么，通过上面的回溯和实际方法挖掘，我发现这一点，它实际上告诉我什么我想处理文件：// URI不适用于远程服务器。

def file_open(self, req): 
    url = req.selector 
    if url[:2] == '//' and url[2:3] != '/' and (req.host and 
      req.host != 'localhost'): 
     if not req.host in self.get_names(): 
      raise URLError("file:// scheme is supported only on localhost")

任何想法，然后如何让这个工作没有映射驱动器？

来源

2017-12-27 Lava Viperidae

所以我更换了这一点：

url = pathlib.Path(full_path).as_uri()  
url_html = urlopen(url).read()

与此：

with open(full_path) as url_html

，并能传递到BeautifulSoup，并根据需要解析...

来源

2017-12-27 21:12:23

Python 3.6.3 urlopen从URI中删除服务器名称以存储在远程服务器上的html文件

回答

相关问题