从XPS文档中提取文本

我需要从XPS文档中提取特定页面的文本。提取的文本应该写入一个字符串。我需要使用Microsofts SpeechLib读出提取的文本。请仅在C＃中使用示例。从XPS文档中提取文本

感谢

2012-09-04 Tim Trabold

既然您已为问题为C＃，因此几乎所有的答案都将在C＃中，但为什么只有C＃。你对其他语言过敏吗？ –

不，但我的公司在C＃开发，我也必须这样做 –

那么，什么？使用任何其他语言创建，然后使用任何在线转换器（如http://www.developerfusion.com/tools/convert/csharp-to-vb/#convert-again）将其更改为您所需的语言。在我的最后一个公司，我用C＃编码，现在用VB编写代码。它（语法）在前两天是一个问题。 –

添加引用到ReachFramework和WindowsBase及以下using声明：

using System.Windows.Xps.Packaging;

然后使用此代码：

XpsDocument _xpsDocument=new XpsDocument("/path",System.IO.FileAccess.Read); 
IXpsFixedDocumentSequenceReader fixedDocSeqReader 
    =_xpsDocument.FixedDocumentSequenceReader; 
IXpsFixedDocumentReader _document = fixedDocSeqReader.FixedDocuments[0]; 
IXpsFixedPageReader _page 
    = _document.FixedPages[documentViewerElement.MasterPageNumber]; 
StringBuilder _currentText = new StringBuilder(); 
System.Xml.XmlReader _pageContentReader = _page.XmlReader; 
if (_pageContentReader != null) 
{ 
    while (_pageContentReader.Read()) 
    { 
    if (_pageContentReader.Name == "Glyphs") 
    { 
     if (_pageContentReader.HasAttributes) 
     { 
     if (_pageContentReader.GetAttribute("UnicodeString") != null) 
     {         
      _currentText. 
      Append(_pageContentReader. 
      GetAttribute("UnicodeString"));        
     } 
     } 
    } 
    } 
} 
string _fullPageText = _currentText.ToString();

文本存在Glyphs - >UnicodeString字符串属性。您必须使用XMLReader作为固定页面。

来源

2012-09-05 12:53:42 Sanjay

@Tim Trabold：对于答案的反馈将有所帮助。 – Sanjay

我得到的例外如下：错误类型'System.IO.Packaging.Package'在没有引用的程序集中定义。您必须添加对程序集“WindowsBase，版本= 3.0.0.0，文化=中立，PublicKeyToken = 31bf3856ad364e35”的引用。 – 2013-09-26 05:33:17

+清除它..伟大的工作。 – 2013-09-26 06:30:00

类的全码：

using System.Collections.Generic; 
using System.Drawing; 
using System.Windows.Forms; 
using System.Windows.Xps.Packaging; 

namespace XPS_Data_Transfer 
{ 
    internal static class XpsDataReader 
    { 
     public static List<string> ReadXps(string address, int pageNumber) 
     { 
      var xpsDocument = new XpsDocument(address, System.IO.FileAccess.Read); 
      var fixedDocSeqReader = xpsDocument.FixedDocumentSequenceReader; 
      if (fixedDocSeqReader == null) return null; 

      const string uniStr = "UnicodeString"; 
      const string glyphs = "Glyphs"; 
      var document = fixedDocSeqReader.FixedDocuments[pageNumber - 1]; 
      var page = document.FixedPages[0]; 
      var currentText = new List<string>(); 
      var pageContentReader = page.XmlReader; 

      if (pageContentReader == null) return null; 
      while (pageContentReader.Read()) 
      { 
       if (pageContentReader.Name != glyphs) continue; 
       if (!pageContentReader.HasAttributes) continue; 
       if (pageContentReader.GetAttribute(uniStr) != null) 
        currentText.Add(Dashboard.CleanReversedPersianText(pageContentReader.GetAttribute(uniStr))); 
      } 
      return currentText; 
     } 
    } 
}

，从自定义文件的自定义页面返回字符串数据的列表。

来源

2014-08-09 16:35:34 Amir

Dashboard.CleanReversedPersianText丢失 – salle55

private string ReadXpsFile(string fileName) 
    { 
     XpsDocument _xpsDocument = new XpsDocument(fileName, System.IO.FileAccess.Read); 
     IXpsFixedDocumentSequenceReader fixedDocSeqReader 
      = _xpsDocument.FixedDocumentSequenceReader; 
     IXpsFixedDocumentReader _document = fixedDocSeqReader.FixedDocuments[0]; 
     FixedDocumentSequence sequence = _xpsDocument.GetFixedDocumentSequence(); 
     string _fullPageText=""; 
     for (int pageCount = 0; pageCount < sequence.DocumentPaginator.PageCount; ++pageCount) 
     { 
      IXpsFixedPageReader _page 
       = _document.FixedPages[pageCount]; 
      StringBuilder _currentText = new StringBuilder(); 
      System.Xml.XmlReader _pageContentReader = _page.XmlReader; 
      if (_pageContentReader != null) 
      { 
       while (_pageContentReader.Read()) 
       { 
        if (_pageContentReader.Name == "Glyphs") 
        { 
         if (_pageContentReader.HasAttributes) 
         { 
          if (_pageContentReader.GetAttribute("UnicodeString") != null) 
          { 
           _currentText. 
            Append(_pageContentReader. 
            GetAttribute("UnicodeString")); 
          } 
         } 
        } 
       } 
      } 
      _fullPageText += _currentText.ToString(); 
     } 
     return _fullPageText; 
    }

来源

2014-08-11 05:03:28 Nurkhan

我得到ArgumentOutOfRangeException使用此代码，_document.FixedPages只包含一个单一的元素（即使是XPS包含多个页面）。请参阅：http://i.imgur.com/gpcKxCX.png – salle55

方法，返回所有网页的文本（修改阿米尔：S码，希望这是确定）：

/// <summary> 
/// Get all text strings from an XPS file. 
/// Returns a list of lists (one for each page) containing the text strings. 
/// </summary> 
private static List<List<string>> ExtractTextFromXps(string xpsFilePath) 
{ 
    var xpsDocument = new XpsDocument(xpsFilePath, FileAccess.Read); 
    var fixedDocSeqReader = xpsDocument.FixedDocumentSequenceReader; 
    if (fixedDocSeqReader == null) 
     return null; 

    const string UnicodeString = "UnicodeString"; 
    const string GlyphsString = "Glyphs"; 

    var textLists = new List<List<string>>(); 
    foreach (IXpsFixedDocumentReader fixedDocumentReader in fixedDocSeqReader.FixedDocuments) 
    { 
     foreach (IXpsFixedPageReader pageReader in fixedDocumentReader.FixedPages) 
     { 
     var pageContentReader = pageReader.XmlReader; 
     if (pageContentReader == null) 
      continue; 

     var texts = new List<string>(); 
     while (pageContentReader.Read()) 
     { 
      if (pageContentReader.Name != GlyphsString) 
       continue; 
      if (!pageContentReader.HasAttributes) 
       continue; 
      if (pageContentReader.GetAttribute(UnicodeString) != null) 
       texts.Add(pageContentReader.GetAttribute(UnicodeString)); 
     } 
     textLists.Add(texts); 
     } 
    } 
    xpsDocument.Close(); 
    return textLists; 
}

用法：

var txtLists = ExtractTextFromXps(@"C:\myfile.xps"); 

int pageIdx = 0; 
foreach (List<string> txtList in txtLists) 
{ 
    pageIdx++; 
    Console.WriteLine("== Page {0} ==", pageIdx); 
    foreach (string txt in txtList) 
     Console.WriteLine(" "+txt); 
    Console.WriteLine(); 
}

来源

2017-01-30 16:07:05 salle55

从XPS文档中提取文本

回答

相关问题