2012-09-04 207 views
1

我需要从XPS文档中提取特定页面的文本。 提取的文本应该写入一个字符串。我需要使用Microsofts SpeechLib读出提取的文本。 请仅在C#中使用示例。从XPS文档中提取文本

感谢

+0

既然您已为问题为C#,因此几乎所有的答案都将在C#中,但为什么只有C#。你对其他语言过敏吗? –

+0

不,但我的公司在C#开发,我也必须这样做 –

+0

那么,什么?使用任何其他语言创建,然后使用任何在线转换器(如http://www.developerfusion.com/tools/convert/csharp-to-vb/#convert-again)将其更改为您所需的语言。在我的最后一个公司,我用C#编码,现在用VB编写代码。它(语法)在前两天是一个问题。 –

回答

9

添加引用到ReachFrameworkWindowsBase及以下using声明:

using System.Windows.Xps.Packaging; 

然后使用此代码:

XpsDocument _xpsDocument=new XpsDocument("/path",System.IO.FileAccess.Read); 
IXpsFixedDocumentSequenceReader fixedDocSeqReader 
    =_xpsDocument.FixedDocumentSequenceReader; 
IXpsFixedDocumentReader _document = fixedDocSeqReader.FixedDocuments[0]; 
IXpsFixedPageReader _page 
    = _document.FixedPages[documentViewerElement.MasterPageNumber]; 
StringBuilder _currentText = new StringBuilder(); 
System.Xml.XmlReader _pageContentReader = _page.XmlReader; 
if (_pageContentReader != null) 
{ 
    while (_pageContentReader.Read()) 
    { 
    if (_pageContentReader.Name == "Glyphs") 
    { 
     if (_pageContentReader.HasAttributes) 
     { 
     if (_pageContentReader.GetAttribute("UnicodeString") != null) 
     {         
      _currentText. 
      Append(_pageContentReader. 
      GetAttribute("UnicodeString"));        
     } 
     } 
    } 
    } 
} 
string _fullPageText = _currentText.ToString(); 

文本存在Glyphs - >UnicodeString字符串属性。您必须使用XMLReader作为固定页面。

+2

@Tim Trabold:对于答案的反馈将有所帮助。 – Sanjay

+0

我得到的例外如下:错误类型'System.IO.Packaging.Package'在没有引用的程序集中定义。您必须添加对程序集“WindowsBase,版本= 3.0.0.0,文化=中立,PublicKeyToken = 31bf3856ad364e35”的引用。 – 2013-09-26 05:33:17

+0

+清除它..伟大的工作。 – 2013-09-26 06:30:00

0

类的全码:

using System.Collections.Generic; 
using System.Drawing; 
using System.Windows.Forms; 
using System.Windows.Xps.Packaging; 

namespace XPS_Data_Transfer 
{ 
    internal static class XpsDataReader 
    { 
     public static List<string> ReadXps(string address, int pageNumber) 
     { 
      var xpsDocument = new XpsDocument(address, System.IO.FileAccess.Read); 
      var fixedDocSeqReader = xpsDocument.FixedDocumentSequenceReader; 
      if (fixedDocSeqReader == null) return null; 

      const string uniStr = "UnicodeString"; 
      const string glyphs = "Glyphs"; 
      var document = fixedDocSeqReader.FixedDocuments[pageNumber - 1]; 
      var page = document.FixedPages[0]; 
      var currentText = new List<string>(); 
      var pageContentReader = page.XmlReader; 

      if (pageContentReader == null) return null; 
      while (pageContentReader.Read()) 
      { 
       if (pageContentReader.Name != glyphs) continue; 
       if (!pageContentReader.HasAttributes) continue; 
       if (pageContentReader.GetAttribute(uniStr) != null) 
        currentText.Add(Dashboard.CleanReversedPersianText(pageContentReader.GetAttribute(uniStr))); 
      } 
      return currentText; 
     } 
    } 
} 

,从自定义文件的自定义页面返回字符串数据的列表。

+0

Dashboard.CleanReversedPersianText丢失 – salle55

0
private string ReadXpsFile(string fileName) 
    { 
     XpsDocument _xpsDocument = new XpsDocument(fileName, System.IO.FileAccess.Read); 
     IXpsFixedDocumentSequenceReader fixedDocSeqReader 
      = _xpsDocument.FixedDocumentSequenceReader; 
     IXpsFixedDocumentReader _document = fixedDocSeqReader.FixedDocuments[0]; 
     FixedDocumentSequence sequence = _xpsDocument.GetFixedDocumentSequence(); 
     string _fullPageText=""; 
     for (int pageCount = 0; pageCount < sequence.DocumentPaginator.PageCount; ++pageCount) 
     { 
      IXpsFixedPageReader _page 
       = _document.FixedPages[pageCount]; 
      StringBuilder _currentText = new StringBuilder(); 
      System.Xml.XmlReader _pageContentReader = _page.XmlReader; 
      if (_pageContentReader != null) 
      { 
       while (_pageContentReader.Read()) 
       { 
        if (_pageContentReader.Name == "Glyphs") 
        { 
         if (_pageContentReader.HasAttributes) 
         { 
          if (_pageContentReader.GetAttribute("UnicodeString") != null) 
          { 
           _currentText. 
            Append(_pageContentReader. 
            GetAttribute("UnicodeString")); 
          } 
         } 
        } 
       } 
      } 
      _fullPageText += _currentText.ToString(); 
     } 
     return _fullPageText; 
    } 
+0

我得到ArgumentOutOfRangeException使用此代码,_document.FixedPages只包含一个单一的元素(即使是XPS包含多个页面)。请参阅:http://i.imgur.com/gpcKxCX.png – salle55

0

方法,返回所有网页的文本(修改阿米尔:S码,希望这是确定):

/// <summary> 
/// Get all text strings from an XPS file. 
/// Returns a list of lists (one for each page) containing the text strings. 
/// </summary> 
private static List<List<string>> ExtractTextFromXps(string xpsFilePath) 
{ 
    var xpsDocument = new XpsDocument(xpsFilePath, FileAccess.Read); 
    var fixedDocSeqReader = xpsDocument.FixedDocumentSequenceReader; 
    if (fixedDocSeqReader == null) 
     return null; 

    const string UnicodeString = "UnicodeString"; 
    const string GlyphsString = "Glyphs"; 

    var textLists = new List<List<string>>(); 
    foreach (IXpsFixedDocumentReader fixedDocumentReader in fixedDocSeqReader.FixedDocuments) 
    { 
     foreach (IXpsFixedPageReader pageReader in fixedDocumentReader.FixedPages) 
     { 
     var pageContentReader = pageReader.XmlReader; 
     if (pageContentReader == null) 
      continue; 

     var texts = new List<string>(); 
     while (pageContentReader.Read()) 
     { 
      if (pageContentReader.Name != GlyphsString) 
       continue; 
      if (!pageContentReader.HasAttributes) 
       continue; 
      if (pageContentReader.GetAttribute(UnicodeString) != null) 
       texts.Add(pageContentReader.GetAttribute(UnicodeString)); 
     } 
     textLists.Add(texts); 
     } 
    } 
    xpsDocument.Close(); 
    return textLists; 
} 

用法:

var txtLists = ExtractTextFromXps(@"C:\myfile.xps"); 

int pageIdx = 0; 
foreach (List<string> txtList in txtLists) 
{ 
    pageIdx++; 
    Console.WriteLine("== Page {0} ==", pageIdx); 
    foreach (string txt in txtList) 
     Console.WriteLine(" "+txt); 
    Console.WriteLine(); 
}