查找字符串中子字符串的所有实例

在my last question我询问了解析HTML页面中的链接的问题。因为我还没有找到解决方案，所以我想我在尝试其他方法：搜索每个<a href=并复制所有内容，直到我点击</a>。查找字符串中子字符串的所有实例

现在，我的C有点生疏，但我记得我可以使用strstr()来获取该字符串的第一个实例，但是如何获取其余的？

任何帮助表示赞赏。

PS：不。这不是学校的家庭作业或类似的东西。就这样你知道。

2011-03-02 Mr Aleph

不好，坏主意，注定要失败。当你点击一个'' tag? Use an XML parser. – meagar 2011-03-02 15:24:23

Thanks. I know it's a bad idea but I haven't found an XML parser that it's not uber complicated that has a good example of how to do this. If you know of one (plus an example code) please do send it my way – 2011-03-02 15:35:50

您可以使用一个循环：

char *ptr = haystack; 
size_t nlen = strlen (needle); 

while (ptr != NULL) { 
    ptr = strstr (ptr, needle); 
    if (ptr != NULL) { 
    // do whatever with ptr 
    ptr += nlen; // hat tip to @larsman 
    } 
}

来源

2011-03-02 15:22:55 chrisaycock

Loops infinitely if 'needle' is found at least once. You have to move past the match in every iteration. Also, you have to check for 'NULL' *after* 'strstr'. – 2011-03-02 15:25:12

@larsman Ah, thanks. Corrected. – chrisaycock 2011-03-02 15:30:15

Given the OP's pattern, I'd do 'ptr += strlen(needle)' (or better, 'size_t nlen = strlen(needle)' before the loop. – 2011-03-02 15:31:13

C字符串只是指向第一个字符的指针;要获得下一场比赛，只需再次调用它，并将指针传递给前一场比赛的结尾。

来源

2011-03-02 15:21:11 Arkku

为什么不使用libxml其中内置了非常好的HTML解析器？

来源

2011-03-02 15:22:37

I'm trying not to use external libs, specially if they are GPL but I did already check that lib. However I cannot find a good example of how to do this, if you have a good example of how to parse links out of an HTML page using libxml I am willing to use it. THanks – 2011-03-02 15:34:47

Here are examples: http://xmlsoft.org/tutorial/index.html What I would do personally is use libxml's XPath, because it is the easiest way to get array of ALL s in document with one query. I am a bit rusty on Xpath, but I think the query was simply: "/a" or something like that, to find all elements in the document. I would consider all the strstr examples as 19th century. This is not how things should be done nowadays anymore. – Gnudiff 2011-03-02 15:37:54

@Mr Aleph: If you don't want GPL, try [Apache Xerces](http://xerces.apache.org/). – chrisaycock 2011-03-02 15:40:47

这里是我会做什么（未测试，只是我的想法）：

char* hRef_start = "<a href="; 
char* hRef_end = "</a>";

假设你的文本是

char text[1000]; 
char * first = strstr(text , hRef_start); 
if(first) 
{ 
    char * last = strstr(first , hRef_end); 
    if(last) 
     last--; 
    else 
     //Error here. 

    char * link = malloc((last - first + 2) * sizeof(char)); 
    copy_link(link , first , last); 
} 

void copy_link(char * link , const char * first , const char * last) 
{ 

    while(first < last) 
    { 
      *link = *first; 
      ++first; 
    } 
    *link = 0; 
}

您应该检查malloc()是否成功，并确保您的号码为free()，并确认copy_link()没有任何参数是null。

来源

2011-03-02 15:28:12 Muggen

好的，最初的答案和我的评论似乎需要更多的信息比评论部分的舒适，所以我决定创建一个新的答案。

首先，你正在试图做IS编程任务已经，这WILL需要一定的编程能力倾向，根据您的具体需求。其次，提供了一些答案，建议您使用char查找和正则表达式的循环。这些都是可怕的做错事情的方式，如讨论的，例如here。

现在解析HTML/XML东西的正常方法是使用为此设计的外部库。事实上，这些库现在已经是标准的，并且在很多编程语言中都已经内置了。

你的特殊需要，我在C和XPath的生锈要么，但它应该工作大约是这样的：

启动一个XML/HTML解析器。
加载到它的HTML文档作为字符串
告诉解析器发现标签的所有实例（使用XPath）
它将返回给你一个“节点集”
工艺节点的集合在循环中，每次你需要什么

我发现了一些其他的例子，也许这是一个更好的标签做：http://xmlsoft.org/example.html

正如你可以看到有，有一个XML文档（不很重要，因为HTML只是XML的子集，您的HTML文档也应该工作）。

在Python或类似的语言，这将是非常容易的，在一些伪代码，这将是这样的：

p=new HTMLParser 
p->load(my html document) 
resultset=p->XPath_Search("//a") # this will find all A elements in the HTML document 
for each result of resultset: 
    write(result.href) 
end for

这一般会写出文档中的所有A类元素的HREF部分。一个体面的教程，你可以使用XPath的例子是here。

我恐怕在C这会更复杂些，但想法是一样的，它是一个编程任务。

如果这是一些快速而肮脏的工作，则可以使用建议的strstr（）或regexp搜索，而不要使用外部库。但是，请记住，根据您的确切任务，您很可能会错过许多传出链接或误读其内容。

来源

2011-03-02 16:28:16 Gnudiff

查找字符串中子字符串的所有实例

回答

相关问题