2014-03-30 52 views
5

所有的名字我解析从http://hackage.haskell.org/package/xml-conduit-1.1.0.9/docs/Text-XML-Stream-Parse.html获得从XML的管道

这里修改了XML是什么样子:

<?xml version="1.0" encoding="utf-8"?> 
<population xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://example.com"> 
    <success>true</success> 
    <row_count>2</row_count> 
    <summary> 
    <bananas>0</bananas> 
    </summary> 
    <people> 
     <person> 
      <firstname>Michael</firstname> 
      <age>25</age> 
     </person> 
     <person> 
      <firstname>Eliezer</firstname> 
      <age>2</age> 
     </person> 
    </people> 
</population> 

如何获得的firstnameage每个人名单?

我的目标是使用HTTP的管道下载此XML,然后解析它,但是我正在寻找如何在没有属性分析的解决方案(使用tagNoAttrs?)

这里是我的”已经尝试过了,我已经将我的问题在哈斯克尔评论:

{-# LANGUAGE OverloadedStrings #-} 
import Control.Monad.Trans.Resource 
import Data.Conduit (($$)) 
import Data.Text (Text, unpack) 
import Text.XML.Stream.Parse 
import Control.Applicative ((<*)) 

data Person = Person Int Text 
     deriving Show 

-- Do I need to change the lambda function \age to something else to get both name and age? 
parsePerson = tagNoAttr "person" $ \age -> do 
     name <- content -- How do I get age from the content? "unpack" is for attributes 
     return $ Person age name 

parsePeople = tagNoAttr "people" $ many parsePerson 

-- This doesn't ignore the xmlns attributes 
parsePopulation = tagName "population" (optionalAttr "xmlns" <* ignoreAttrs) $ parsePeople 

main = do 
     people <- runResourceT $ 
      parseFile def "people2.xml" $$ parsePopulation 
     print people 
+1

编辑添加我迄今试过的和评论 – Lionel

回答

8

首先:在XML的解析导管组合程序没有在很长一段时间被更新,并显示他们的年龄。我建议大多数人使用DOM或游标界面。这就是说,让我们看看你的例子。您的代码有两个问题:

  • 它没有正确处理XML名称空间。所有元素名称都位于http://example.com命名空间中,并且您的代码需要反映这一点。
  • 解析组合器要求你占用所有的元素。他们不会自动跳过一些元素给你。

因此,这里是使用流API,它得到期望的结果的实现:

{-# LANGUAGE OverloadedStrings #-} 
import   Control.Monad.Trans.Resource (runResourceT) 
import   Data.Conduit     (Consumer, ($$)) 
import   Data.Text     (Text) 
import   Data.Text.Read    (decimal) 
import   Data.XML.Types    (Event) 
import   Text.XML.Stream.Parse 

data Person = Person Int Text 
     deriving Show 

-- Do I need to change the lambda function \age to something else to get both name and age? 
parsePerson :: MonadThrow m => Consumer Event m (Maybe Person) 
parsePerson = tagNoAttr "{http://example.com}person" $ do 
     name <- force "firstname tag missing" $ tagNoAttr "{http://example.com}firstname" content 
     ageText <- force "age tag missing" $ tagNoAttr "{http://example.com}age" content 
     case decimal ageText of 
      Right (age, "") -> return $ Person age name 
      _ -> force "invalid age value" $ return Nothing 

parsePeople :: MonadThrow m => Consumer Event m [Person] 
parsePeople = force "no people tag" $ do 
    _ <- tagNoAttr "{http://example.com}success" content 
    _ <- tagNoAttr "{http://example.com}row_count" content 
    _ <- tagNoAttr "{http://example.com}summary" $ 
     tagNoAttr "{http://example.com}bananas" content 
    tagNoAttr "{http://example.com}people" $ many parsePerson 

-- This doesn't ignore the xmlns attributes 
parsePopulation :: MonadThrow m => Consumer Event m [Person] 
parsePopulation = force "population tag missing" $ 
    tagName "{http://example.com}population" ignoreAttrs $ \() -> parsePeople 

main :: IO() 
main = do 
     people <- runResourceT $ 
      parseFile def "people2.xml" $$ parsePopulation 
     print people 

下面是一个使用游标API的例子。请注意,它具有不同的错误处理特性,但对于格式良好的输入应该产生相同的结果。

{-# LANGUAGE OverloadedStrings #-} 
import Text.XML 
import Text.XML.Cursor 
import Data.Text (Text) 
import Data.Text.Read (decimal) 
import Data.Monoid (mconcat) 

main :: IO() 
main = do 
    doc <- Text.XML.readFile def "people2.xml" 
    let cursor = fromDocument doc 
    print $ cursor $// element "{http://example.com}person" >=> parsePerson 

data Person = Person Int Text 
     deriving Show 

parsePerson :: Cursor -> [Person] 
parsePerson c = do 
    let name = c $/ element "{http://example.com}firstname" &/ content 
     ageText = c $/ element "{http://example.com}age" &/ content 
    case decimal $ mconcat ageText of 
     Right (age, "") -> [Person age $ mconcat name] 
     _ -> [] 
+0

谢谢你这样做的两种方法!游标API看起来更简单。如果我使用http-conduit进行POST(这是我如何获取xml),是否需要继续使用xml-conduit或者我可以使用游标API?我在http-conduit – Lionel

+0

中使用httpLbs(懒惰字节字符串)嗯,你仍然会使用xml-conduit,因为游标API是它的一部分。你可以做的最有效的方法是将'sinkDoc'和'http'一起使用。虽然你可以走更简单的路线,如果你愿意,也可以使用'httpLbs'。 –

+0

游标API是否仍然避免将整个XML结构一次保存在内存中? Nevermind,似乎是这样:http://stackoverflow.com/questions/29454267/how-to-use-the-xml-conduit-cursor-interface-for-information-extraction-from-a-la?rq=1 – unhammer