2017-10-17 31 views
1

我有从程序的TSV文件管分隔文件,但是我有在那里它们放置不同的信息在由所述管限定符号一个小区的问题。迭代通过标签,然后使用C++

XP_017347145.1 GO:0003676|GO:0005524|GO:0006139|GO:0008026|GO:0016818 
XP_017347145.1 GO:0003677|GO:0004003|GO:0005524 
XP_017347145.1 GO:0005524 
XP_017347145.1 GO:0004003|GO:0016818 
XP_017347145.1 GO:0003676|GO:0005524|GO:0006139|GO:0008026|GO:0016818 
XP_017350967.1 GO:0005515 

我想将它转换成只有两列像下面,但它似乎我不理解如何使用,则对getline()函数在C++中。

我有经验其实并不多,但输出应该看起来象下面这样:

XP_017347145.1 = GO:0003676 
XP_017347145.1 = GO:0005524 
XP_017347145.1 = GO:0006139 
XP_017347145.1 = GO:0008026 
XP_017347145.1 = GO:0016818 
XP_017347145.1 = GO:0003677 
XP_017347145.1 = GO:0004003 
XP_017347145.1 = GO:0005524 
XP_017347145.1 = GO:0005524 
XP_017347145.1 = GO:0004003 
XP_017347145.1 = GO:0016818 
XP_017347145.1 = GO:0003676 
XP_017347145.1 = GO:0005524 
XP_017347145.1 = GO:0006139 
XP_017347145.1 = GO:0008026 
XP_017347145.1 = GO:0016818 
XP_017350967.1 = GO:0005515 

我在C++当前代码失败,错过在某些地方等号,并返回一个标签来代替。

#include <fstream> 
#include <iostream> 
#include <sstream> 
#include <string> 

int main() { 

    using namespace std; 
    string stringIn; 
    string stringOut; 
    string value; 
    string value2; 

    cout << "Input the name of the file: " << endl; 
    getline(cin, stringIn); 
    cout << "The output file name is " << endl; 
    getline(cin, stringOut); 

    ifstream inputFile(stringIn); 
    ofstream outputFile(stringOut); 

    // Let the user know if the file exists 
    if (!inputFile) { 
     cout << "Cannot open input file" << endl; 
    } 

    if (!outputFile) { 
     cout << "Can not save output file" << endl; 
    } 

    // It should iterate through the values using column 
    // and column2 delimited by the pipe sign. 
    // For example, GO:0005524|GO:0008026 and this could be of unknown length. 
    while (getline(inputFile,value,'\t')) { 
     while (getline(inputFile,value2,'|')) { 
      outputFile << value + " = " + value2 << endl; 
     } 
    } 

    outputFile.close(); 
    inputFile.close(); 
    cin.get(); 

    return 0; 
} 

我现在的代码返回下面的输出和数据,如下所示。任何建议,将不胜感激。

GO:0016818\nXP_017347145.1\tGO:0003677 
     ^
      | 
      | 
     newline captured 

所以然后它打印整个记录而不等号,因为它是先前俘获value2的一部分:因为getline(inputFile,value2,'|')正在捕获以下会发生

XP_017347145.1 = GO:0003676 
XP_017347145.1 = GO:0005524 
XP_017347145.1 = GO:0006139 
XP_017347145.1 = GO:0008026 
XP_017347145.1 = GO:0016818 
XP_017347145.1 GO:0003677 
XP_017347145.1 = GO:0004003 
XP_017347145.1 = GO:0005524 
XP_017347145.1 GO:0005524 
XP_017347145.1 GO:0004003 
XP_017347145.1 = GO:0016818 
XP_017347145.1 GO:0003676 
XP_017347145.1 = GO:0005524 
XP_017347145.1 = GO:0006139 
XP_017347145.1 = GO:0008026 
XP_017347145.1 = GO:0016818 
XP_017350967.1 GO:0005515 
+0

问题是什么? –

回答

1

问题。

对于具有默认\n换行符分隔符的每行,getline(inputFile,line)会更好。然后使用line创建std::stringstream ss{line},然后最后运行getline(ss,value2,'|')


顺便说一句,我用正则表达式玩,我想下面可能是一个更优雅的和通用的解决方案:

#include <iostream> 
#include <regex> 
#include <sstream> 
#include <string> 
#include <algorithm> 
#include <vector> 

std::stringstream input{R"(XP_017347145.1 GO:0003676|GO:0005524|GO:0006139|GO:0008026|GO:0016818 
XP_017347145.1 GO:0003677|GO:0004003|GO:0005524 
XP_017347145.1 GO:0005524 
XP_017347145.1 GO:0004003|GO:0016818 
XP_017347145.1 GO:0003676|GO:0005524|GO:0006139|GO:0008026|GO:0016818 
XP_017350967.1 GO:0005515)"}; 

struct Record{ 
    std::string xp; 
    std::string go; 
}; 

std::ostream& operator<<(std::ostream& os, const Record& r) 
{ 
    return os << "XP_" << r.xp << " = GO:" << r.go << '\n'; 
} 

int main() 
{ 
    std::vector<Record> records; 
    for(std::string line; getline(input, line);) { 
     std::regex r{R"(^XP_(\d*\.\d))"}; // match xp 
     std::smatch m; 
     if(std::regex_search(line, m, r)){ 
      auto xp = m[1].str(); 
      std::regex go_r{R"(GO:(\d*)\|?)"}; // match go 
      auto begin = std::sregex_iterator{line.begin(), line.end(), go_r}; 
      auto end = std::sregex_iterator{}; 
      std::for_each(begin, end, [&records, &xp](const auto& i){records.emplace_back(Record{xp, i[1].str()}); }); 
     } 
    } 
    for(const auto& i : records) 
     std::cout << i; 
} 

输出:

XP_017347145.1 = GO:0003676 
XP_017347145.1 = GO:0005524 
XP_017347145.1 = GO:0006139 
XP_017347145.1 = GO:0008026 
XP_017347145.1 = GO:0016818 
XP_017347145.1 = GO:0003677 
XP_017347145.1 = GO:0004003 
XP_017347145.1 = GO:0005524 
XP_017347145.1 = GO:0005524 
XP_017347145.1 = GO:0004003 
XP_017347145.1 = GO:0016818 
XP_017347145.1 = GO:0003676 
XP_017347145.1 = GO:0005524 
XP_017347145.1 = GO:0006139 
XP_017347145.1 = GO:0008026 
XP_017347145.1 = GO:0016818 
XP_017350967.1 = GO:0005515 
+0

感谢您的帮助 – user1238097

2

就可以解决问题通过使用sregex_token_iterator像:

std::regex re("\\s+|\\|"); 
    sregex_token_iterator reg_end; 
    while (getline(inputFile,value)) { 
     sregex_token_iterator it(value.begin(), value.end(), re, -1); 
     std::string p1 = (it++)->str(); 
     for (; it != reg_end; ++it) { 
      outputFile << p1 << " = " << it->str() << endl; 
     } 
    } 
+0

问题正则表达式“\ S”应该是空间正确的,但什么额外的“\”是什么意思? – user1238097