如何在U-SQL中使用XML Extractor从XML元素中提取属性值

如何使用U-SQL中的XML Extractor为我的Azure数据湖分析作业从XML元素提取属性值。如何在U-SQL中使用XML Extractor从XML元素中提取属性值

更新：有关该问题的更多细节

我的XML文件是这样的：

<?xml version="1.0" encoding="utf-8"?> 
<testelement testatr="xyz"> 
</testelement>

这里是我的U型SQL脚本：

调试我观察后，XPath类的Load方法尝试加载时发生异常：

"<?xml version=1.0 encoding=utf-8?>"

这里有一个例外：

Microsoft.Cosmos.ScopeStudio.BusinessObjects.Debugger.ScopeDebugException was unhandled 
Message: An unhandled exception of type 'Microsoft.Cosmos.ScopeStudio.BusinessObjects.Debugger.ScopeDebugException' occurred in Microsoft.Cosmos.ScopeStudio.BusinessObjects.Debugger.dll 
Additional information: {"diagnosticCode":195887111,"severity":"Error","component":"RUNTIME","source":"User","errorId":"E_RUNTIME_USER_EXPRESSIONEVALUATION","message":"Error while evaluating expression Microsoft.Analytics.Samples.Formats.Xml.XPath.Evaluate(log, \"testelement/attribute::testatr\").ElementAt(0)","description":"Inner exception from user expression: '1.0' is an unexpected token. The expected token is '\"' or '''. Line 1, position 15.\nCurrent row dump: \tlog:\t\"<?xml version=1.0 encoding=utf-8?>\" 
\n","resolution":"","helpLink":"","details":"==== Caught exception System.Xml.XmlException\n\n at System.Xml.XmlTextReaderImpl.Throw(Exception e) 
\n at System.Xml.XmlTextReaderImpl.ParseXmlDeclaration(Boolean isTextDecl) 
\n at System.Xml.XmlTextReaderImpl.Read() 
\n at System.Xml.XmlLoader.Load(XmlDocument doc, XmlReader reader, Boolean preserveWhitespace) 
\n at System.Xml.XmlDocument.Load(XmlReader reader) 
\n at System.Xml.XmlDocument.LoadXml(String xml) 
\n at Microsoft.Analytics.Samples.Formats.Xml.XPath.Load(String xml) 
\n at Microsoft.Analytics.Samples.Formats.Xml.XPath.Evaluate(String xml, String xpath) 
\n at ___Scope_Generated_Classes___.SqlFilterTransformer_2.Process(IRow row, IUpdatableRow output) in c:\\workarea\\bswbigdata\\USQLAppForLogs\\USQLAppForLogs\\bin\\Debug\\A06D46624BBA798\\ReadBlobs.usql.Debug_A54F30D359F939C7\\__ScopeCodeGen__.dll.cs:line 53","internalDiagnostics":""}

更新2：

使用引用后：假我得到另一个异常：

Microsoft.Cosmos.ScopeStudio.BusinessObjects.Debugger.ScopeDebugException was unhandled 
Message: An unhandled exception of type 'Microsoft.Cosmos.ScopeStudio.BusinessObjects.Debugger.ScopeDebugException' occurred in Microsoft.Cosmos.ScopeStudio.BusinessObjects.Debugger.dll 
Additional information: {"diagnosticCode":195887111,"severity":"Error","component":"RUNTIME","source":"User","errorId":"E_RUNTIME_USER_EXPRESSIONEVALUATION","message":"Error while evaluating expression Microsoft.Analytics.Samples.Formats.Xml.XPath.Evaluate(log, \"testelement/attribute::testatr\").ElementAt(0)","description":"Inner exception from user expression: Root element is missing.\nCurrent row dump: \tlog:\t\"<?xml version=\"1.0\" encoding=\"utf-8\"?>\" 
\n","resolution":"","helpLink":"","details":"==== Caught exception System.Xml.XmlException\n\n at System.Xml.XmlTextReaderImpl.Throw(Exception e) 
\n at System.Xml.XmlTextReaderImpl.ParseDocumentContent() 
\n at System.Xml.XmlLoader.LoadDocSequence(XmlDocument parentDoc) 
\n at System.Xml.XmlDocument.Load(XmlReader reader) 
\n at System.Xml.XmlDocument.LoadXml(String xml) 
\n at Microsoft.Analytics.Samples.Formats.Xml.XPath.Load(String xml) 
\n at Microsoft.Analytics.Samples.Formats.Xml.XPath.Evaluate(String xml, String xpath) 
\n at ___Scope_Generated_Classes___.SqlFilterTransformer_2.Process(IRow row, IUpdatableRow output) in c:\\workarea\\bswbigdata\\USQLAppForLogs\\USQLAppForLogs\\bin\\Debug\\A06D46624BBA798\\ReadBlobs.usql.Debug_A54F30D359F939C7\\__ScopeCodeGen__.dll.cs:line 53","internalDiagnostics":""}

来源

2016-01-05 Jamil

您识别使用XPath值表达式。使用@attr_name（或全轴表达式attribute::attr_name）查询属性。基于问题的更新

UPDATE：

它看起来像解析器以某种方式得到由“XML声明里面我看到你使用内置的TSV（）提取每默认当前处理。”困惑作为引用字符放在字段中，然后将其删除。这是我们计划修复的错误。

在此之前，我建议您使用Extractors.Tsv(quoting:false)。

如果您使用的是任何内置文本提取器（Extractors.*），请确保您的XML文档不包含任何CR/LF，并且在使用.Tsv时不包含制表符值。

如果您的XML将包含CR和/或LF，那么您将不得不使用自定义提取器来使用不同的行分隔符。如果您需要这样做，请给我留言，因为我目前正在跟踪这些请求，以了解我们可以在内置提取器中改进哪些内容。

如果你的文件只包含一个XML文档（而非XML文档的几行）我会建议使用XML提取，这也是在GitHub上的XML样本的一部分。

来源

2016-01-05 20:48:35

谢谢迈克尔，我试过这种方法，但得到了一个例外。请参阅更新的问题细节。 – Jamil

感谢Jamil。我根据您的更多详细信息更新了我的答案。 –

在新的错误消息：它看起来像XML文档包含XML声明之后CR或LF或两者，因此TSV（）提取拆分XML文档。请参阅我在以前的回答中的评论：

如果您使用的是任何内置文本提取器（Extractor。*），请确保您的XML文档不包含任何CR/LF，如果您使用.Tsv，则不包含制表符值。

如果您的XML将包含CR和/或LF，那么你将不得不使用一个自定义的提取使用不同的行分隔符。如果您需要这样做，请给我留言，因为我目前正在跟踪这些请求，以了解我们可以在内置提取器中改进哪些内容。

来源

2016-01-06 22:51:26

你说得对，我的XML包含CR/LF。 – Jamil

所以我想，目前，我没有任何解决方案在默认提取从XML获取包含CR/LF的属性值，对不对？ – Jamil

正确。您应该改用示例库中提供的提取器。或者删除CR/LF（如果它们在XML中被称为“无意义的空白”）。 –

如何在U-SQL中使用XML Extractor从XML元素中提取属性值

回答

相关问题