使用C/C++解析来自解析文本的名词短语

我想从解析文本中解析名词短语（NN，NNP，NNS，NNPS）。例如： -使用C/C++解析来自解析文本的名词短语

Input sentence - 
John/NNP 
works/VBZ 
in/IN 
oil/NN 
industry/NN 
./. 
Output: John Oil Industry

我感到困惑的逻辑，因为我需要搜索字符串，例如/NN，/NNP，/NNS和/NNPS和之前打印上一个字。使用C或C++解析名词短语的逻辑是什么？

我自己尝试是以下几点：

char* SplitString(char* str, char sep 
{ 
    return str; 
} 
main() 
{ 
    char* input = "John/NNP works/VBZ in/IN oil/NN industry/NN ./."; 
    char *output, *temp; 
    char * field; 
    char sep = '/NNP'; 
    int cnt = 1; 
    output = SplitString(input, sep); 

    field = output; 
    for(temp = field; *temp; ++temp){ 
     if (*temp == sep){ 
      printf(" %.*s\n", temp-field, field); 
      field = temp+1; 
     } 
    } 
    printf("%.*s\n", temp-field, field); 
}

我的修改如下：

#include <regex> 
#include <iostream> 

int main() 
{ 
    const std::string s = "John/NNP works/VBZ in/IN oil/NNS industry/NNPS ./."; 
    std::regex rgx("(\\w+)\/NN[P-S]{0,2}"); 
    std::smatch match; 

    if (std::regex_search(s.begin(), s.end(), match, rgx)) 
     std::cout << " " << match[1] << '\n'; 
}

我得到的输出是唯一的 “约翰”。其他/ NNS标签不会来。

我的第二个办法：

#include <stdio.h> 
#include <stdlib.h> 
#include <string.h> 
#include <assert.h> 

char** str_split(char* a_str, const char a_delim) 
{ 
    char** result = 0; 
    size_t count = 0; 
    char* tmp = a_str; 
    char* last_comma = 0; 
    char delim[2]; 
    delim[0] = a_delim; 
    delim[1] = 0; 

    /* Count how many elements will be extracted. */ 
    while (*tmp) 
    { 
     if (a_delim == *tmp) 
     { 
      count++; 
      last_comma = tmp; 
     } 
     tmp++; 
    } 

    /* Add space for trailing token. */ 
    count += last_comma < (a_str + strlen(a_str) - 1); 

    /* Add space for terminating null string so caller 
     knows where the list of returned strings ends. */ 
    count++; 

    result = malloc(sizeof(char*) * count); 

    if (result) 
    { 
     size_t idx = 0; 
     char* token = strtok(a_str, delim); 

     while (token) 
     { 
      assert(idx < count); 
      *(result + idx++) = strdup(token); 
      token = strtok(0, delim); 
     } 
     assert(idx == count - 1); 
     *(result + idx) = 0; 
    } 

    return result; 
} 

int main() 
{ 
    char text[] = "John/NNP works/VBZ in/IN oil/NN industry/NN ./."; 
    char** tokens; 

    //printf("INPUT SENTENCE=[%s]\n\n", text); 

    tokens = str_split(text, ''); 

    if (tokens) 
    { 
     int i; 
     for (i = 0; *(tokens + i); i++) 
     { 
      printf("[%s]\n", *(tokens + i)); 
      free(*(tokens + i)); 
     } 
     printf("\n"); 
     free(tokens); 
    } 

    return 0; 
}

与输出是：

[John/NNP] 
[works/VBZ] 
[in/IN] 
[oil/NN] 
[industry/NN] 
[./.]

我只想/NNP和/NN解析数据，即John，oil和industry。如何得到这个？将正则表达式的帮助？如何在C中使用正则表达式与C++相同？

来源

2015-11-06 New_Programmer

我对这个逻辑感到困惑。我正在尝试搜索/ NN，/ NNP，/ NNS和/ NNPS等字符串，然后在“/”之前打印所有字符，直到获得空格。 –

@New_Programmer应该关于工作。 – Magisch

@Haris不，它不被称为自然语言处理。它是一个简单的解析问题。 – Identity1

你行“为结尾标记添加空间”是不必要的，因为strtok将在终止零自动结束。

此外，tokens = str_split(text, '');不能正确的，因为你的str_split预计，a_delim一个字符，你''，这对我的编译器（锵）发出

error: empty character constant

想必你的意思是分裂的错误喂它一个空间' '，但我没有测试它本身是否可行。（即使你得到某种形式的输出的反正。）

您的代码返回结果[John/NNP]（等），因为你没有做别的拆断的标签名称，你也没有测试对你的希望列表标签。一个C程序只做你所说的 - 这就是编程的目的。

我建议在普通的C一个直接的解决方案，使用字符串标记化功能strtok，单个字符的查找strchr，只有字符串比较strcmp。

我的日常标记化在空格输入字符串，分裂掉一个字在上空格的时间（注：这个工作，strtok需要能够修改输入的字符串），定位斜杠在此令牌中，比较斜杠后面的文本与所需短语的列表，并且输出斜杠之前的单词（如果它在列表中）。

strtok每个呼叫之后，指针token将指向下一个字，它已经将是零封端的开始。因此，第一个令牌将是John/NNP。
然后strchr试图找到斜杠，如果找到，将把它的位置置于slash。
如果成功，slash指向斜线本身;所以，测试标签应该在slash+1。
一个简单的循环将其与wanted列表中的每个标签名称进行比较。如果找到，*slash设置为0，覆盖斜杠，因此当前令牌字符串在其之前结束。然后输出当前令牌。
无论是否找到，strtok都会在循环中再次调用，直到失败。如果它成功找到下一个标记，它将回滚到＃2，否则退出。

这一计划的

#include <stdio.h> 
#include <string.h> 

int main() 
{ 
    /* input */ 
    char text[] = "John/NNP works/VBZ in/IN oil/NN industry/NN ./."; 
    char *wanted[] = { "NN", "NNP", "NNS", "NNPS" }; 

    /* helper variables */ 
    size_t i; 
    char *token, *slash; 

    token = strtok(text, " "); 
    while (token) 
    { 
     slash = strchr (token, '/'); 
     if (slash && slash[1]) 
     { 
      for (i=0; i<sizeof(wanted)/sizeof(wanted[0]); i++) 
      { 
       if (!strcmp (slash+1, wanted[i])) 
       { 
        *slash = 0; 
        printf ("%s\n", token); 
        break; 
       } 
      } 
     } 
     token = strtok(NULL, " "); 
    } 

    return 0; 
}

输出：

John 
oil 
industry

我没有刻意去把握的话，按您所需的输出。这是一个微不足道的附录，你应该能够自己解决这个问题。

来源

2015-11-07 00:26:02 usr2564301

如果所有关于打印，然后尝试这种方法。它在搜索功能中使用regular expression来查找是否有一个模式\/NN[A-Z]{0,3}即/ NN后跟0到3个大写字母并捕获()之前的\\w+单词。

这是未经测试，但：

#include <regex> 
#include <iostream> 

int main() 
{ 
    const std::string s = "John/NNP works/VBZ in/IN oil/NN industry/NN ./."; 
    std::regex rgx("(\\w+)\/NN[A-Z]{0,3}"); 
    std::smatch match; 

    while (std::regex_search(s, match, rgx)) 
     std::cout << "match: " << match[1] << '\n'; 
}

来源

2015-11-06 09:12:04 Identity1

1：它只显示“John”作为输出。我试图解析所有四个/ NNP，/ NN，/ NNS和/ NNPS类型的名词短语。无论如何感谢代码片段。我明白了。让我继续尝试。谢谢。 –

while循环在循环中打印“John”。 –

是的我没有在CPP中完成正则表达式，因此不确定如何使范围全局 – Identity1

regex_token_iterator可能会有所帮助

std::string input = "John/NNP works/VBZ in/IN oil/NN industry/NN ABC/NNPS ./."; 

    // This regex has a capture group() that is looking for a sequence of word characters 
    // followed by /NN which is not captured but just matched 
    std::regex nouns_re("(\\w+)\\/NN"); 

    // We pass 1 as the final argument to the token iterator 
    // because we just want to print the word captured and not the /NN part 
    std::copy(std::sregex_token_iterator(input.begin(), input.end(), nouns_re, 1), 
       std::sregex_token_iterator(), 
       std::ostream_iterator<std::string>(std::cout, "\n") 
     );

来源

2015-11-06 23:52:45

使用C/C++解析来自解析文本的名词短语

回答

相关问题