2011-10-24 60 views
1

让说,我想从一个网页用下面的标记中提取数据:使用YQL提取HTML内容?

<table> 
    <tr> 
    <td><a href="Link 1">Column 1 Text</a></td> 
    <td>Column 2 Text</td> 
    <td>Column 3 Text</td> 
    </tr> 
    <tr> 
    <td><a href="Link 2">Column 1 Text</a></td> 
    <td>Column 2 Text</td> 
    <td>Column 3 Text</td> 
    </tr> 
    ... 
</table> 

JSON格式

[ 
    { 
    link: 'Link 1', 
    text: 'Column 1 Text', 
    data: 'Column 3 Text' 
    }, 
    { 
    link: 'Link 2', 
    text: 'Column 1 Text', 
    data: 'Column 3 Text' 
    } 
] 

我们能用YQL做到吗?如果是,那么请给我一个示例查询。

任何帮助将不胜感激!

回答

1

这里有一个查询,这是一个很好的起点,使用HTML表格与一些XPath查询沿(见Extracting HTML Content With XPath,详细了解此技术):

select * from html where url="http://cantoni.org/test/table.html" and xpath='//table/tr'

将会产生这样的JSON结果:

{ 
"query": { 
    "count": 2, 
    "created": "2012-01-06T20:16:46Z", 
    "lang": "en-US", 
    "results": { 
    "tr": [ 
    { 
    "td": [ 
     { 
     "a": { 
     "href": "Link%201", 
     "content": "Column 1 Text" 
     } 
     }, 
     { 
     "p": "Column 2 Text" 
     }, 
     { 
     "p": "Column 3 Text" 
     } 
    ] 
    }, 
    { 
    "td": [ 
     { 
     "a": { 
     "href": "Link%202", 
     "content": "Column 1 Text" 
     } 
     }, 
     { 
     "p": "Column 2 Text" 
     }, 
     { 
     "p": "Column 3 Text" 
     } 
    ] 
    } 
    ] 
    } 
} 
}