2010-06-13 45 views
2

我有了格式如何将文本文件读入matlab并将其设置为列表?

gene   complement(22995..24539) 
       /gene="ppp" 
       /locus_tag="MRA_0020" 
CDS    complement(22995..24539) 
       /gene="ppp" 
       /locus_tag="MRA_0020" 
       /codon_start=1 
       /transl_table=11 
       /product="putative serine/threonine phosphatase Ppp" 
       /protein_id="ABQ71738.1" 
       /db_xref="GI:148503929" 
gene   complement(24628..25095) 
       /locus_tag="MRA_0021" 
CDS    complement(24628..25095) 
       /locus_tag="MRA_0021" 
       /codon_start=1 
       /transl_table=11 
       /product="hypothetical protein" 
       /protein_id="ABQ71739.1" 
       /db_xref="GI:148503930" 
gene   complement(25219..26802) 
       /locus_tag="MRA_0022" 
CDS    complement(25219..26802) 
       /locus_tag="MRA_0022" 
       /codon_start=1 
       /transl_table=11 
       /product="hypothetical protein" 
       /protein_id="ABQ71740.1" 
       /db_xref="GI:148503931" 

我想读文本文件到Matlab和做一个清单从系基因为出发点,在列表中每个项目的信息的文本文件。所以对于这个例子,列表中会有3个项目。我已经尝试了一些东西,无法让这个工作。任何人有任何想法我可以做什么?

回答

2

下面是一个算法的快速建议:

  1. 公开赛fopen
  2. 开始读取线与fgetl,直到找到与'CDS'开始行的文件。
  3. 保持读取行,直到您得到以'gene'开头的另一行。
  4. 对于(2)和(3)中
    • 找到'/''='之间的字符串的行之间的所有行。这是字段名称
    • 找到引号之间的字符串。这是场
  5. 的向上一个计数器的值,从2号开始,直到你完成读取文件

这些命令可能会有所帮助:

  • 要查找由特定的 字符包围的字符串,请使用例如 regexp(lineThatHasBeenRead,'/(.+)=','tokens','once')
  • 要创建 输出结构,请使用动态 字段名称,例如, output(ct).(fieldname) = value;

编辑

下面是一些代码。我将你的例子保存为'test.txt'。

% open file 
fid = fopen('test.txt'); 

% parse the file 
eof = false; 
geneCt = 1; 
clear output % you cannot reassign output if it exists with different fieldnames already 
output(1:1000) = struct; % you may want to initialize fields here 
while ~eof 
    % read lines till we find one with CDS 
    foundCDS = false; 
    while ~foundCDS 
     currentLine = fgetl(fid); 
     % check for eof, then CDS. Allow whitespace at the beginning 
     if currentLine == -1 
      % end of file 
      eof = true; 
     elseif ~isempty(regexp(currentLine,'^\s+CDS','match','once')) 
      foundCDS = true; 
     end 
    end % looking for CDS 

    if ~eof 

     % read (and remember) lines till we find 'gene' 
     collectedLines = cell(1,20); % assume no more than 20 lines pere gene. Row vector for looping below 
     foundGene = false; 
     lineCt = 1; 
     while ~foundGene 
      currentLine = fgetl(fid); 
      % check for eof, then gene. Allow whitespace at the beginning 
      if currentLine == -1; 
       % end of file - consider all data has been read 
       eof = true; 
       foundGene = true; 
      elseif ~isempty(regexp(currentLine,'^\s+gene','match','once')) 
       foundGene = true; 
      else 
       collectedLines{lineCt} = currentLine; 
       lineCt = lineCt + 1; 
      end 
     end 

     % loop through collectedLines and assign. Do not loop through the 
     % gene line 
     for line = collectedLines(1:lineCt-1) 
      fieldname = regexp(line{1},'/(.+)=','tokens','once'); 
      value = regexp(line{1},'="?([^"]+)"?$','tokens','once'); 
      % try converting value to number 
      numValue = str2double(value); 
      if isfinite(numValue) 
       value = numValue; 
      else 
       value = value{1}; 
      end 
      output(geneCt).(fieldname{1}) = value; 
     end 
     geneCt = geneCt + 1; 
    end 
end % while eof 

% cleanup 
fclose(fid); 
output(geneCt:end) = []; 
相关问题