2016-01-14 75 views
9

我想解析一个表,但我不知道如何从中保存数据。我想将数据保存每行排的样子:如何用Nokogiri解析HTML表格?

['Raw name 1', 2,094, 0,017, 0,098, 0,113, 0,452] 

样品列表:

html = <<EOT 
    <table class="open"> 
     <tr> 
      <th>Table name</th> 
      <th>Column name 1</th> 
      <th>Column name 2</th> 
      <th>Column name 3</th> 
      <th>Column name 4</th> 
      <th>Column name 5</th> 
     </tr> 
     <tr> 
      <th>Raw name 1</th> 
      <td>2,094</td> 
      <td>0,017</td> 
      <td>0,098</td> 
      <td>0,113</td> 
      <td>0,452</td>   
     </tr> 
     . 
     . 
     . 
     <tr> 
      <th>Raw name 5</th> 
      <td>2,094</td> 
      <td>0,017</td> 
      <td>0,098</td> 
      <td>0,113</td> 
      <td>0,452</td>   
     </tr> 
    </table> 
EOT 

我刮的代码是:

doc = Nokogiri::HTML(open(html), nil, 'UTF-8') 
    tables = doc.css('div.open') 

    @tablesArray = [] 

    tables.each do |table| 
    title = table.css('tr[1] > th').text 
    cell_data = table.css('tr > td').text 
    raw_name = table.css('tr > th').text 
    @tablesArray << Table.new(cell_data, raw_name) 
    end 

    render template: 'scrape_krasecology' 
    end 
    end 

当我尝试显示HTML页面中的数据看起来像所有的列名都以同样的方式存储在一个数组元素中,并且所有数据都以相同的方式存储。

+1

请降低你的代码需要说明问题的最低限度。在问题本身*中提供一个最小的HTML *示例,它也演示了这个问题。不要要求我们去页面提取HTML或建立必要的周边代码来测试你的。阅读“[问]”,“[mcve]”和http://codeblog.jonskeet.uk/2010/08/29/writing-the-perfect-question/ –

+0

@锡人感谢。我更新了我的代码。相信现在看起来好多了? – verrom

+0

一般信息为一般人寻找这个主题:http://ruby.bastardsbook.com/chapters/web-crawling/ – benjamin

回答

11

问题的关键在于对多个结果调用#text将返回每个单独元素的#text的串联。

让我们检查了每个步骤做:

# Finds all <table>s with class open 
# I'm assuming you have only one <table> so 
# you don't actually have to loop through 
# all tables, instead you can just operate 
# on the first one. If that is not the case, 
# you can use a loop the way you did 
tables = doc.css('table.open') 

# The text of all <th>s in <tr> one in the table 
title = table.css('tr[1] > th').text 

# The text of all <td>s in all <tr>s in the table 
# You obviously wanted just the <td>s in one <tr> 
cell_data = table.css('tr > td').text 

# The text of all <th>s in all <tr>s in the table 
# You obviously wanted just the <th>s in one <tr> 
raw_name = table.css('tr > th').text 

现在我们知道了什么是错的,这里是一个可能的解决方案:

html = <<EOT 
    <table class="open"> 
     <tr> 
      <th>Table name</th> 
      <th>Column name 1</th> 
      <th>Column name 2</th> 
      <th>Column name 3</th> 
      <th>Column name 4</th> 
      <th>Column name 5</th> 
     </tr> 
     <tr> 
      <th>Raw name 1</th> 
      <td>1001</td> 
      <td>1002</td> 
      <td>1003</td> 
      <td>1004</td> 
      <td>1005</td>   
     </tr> 
     <tr> 
      <th>Raw name 2</th> 
      <td>2001</td> 
      <td>2002</td> 
      <td>2003</td> 
      <td>2004</td> 
      <td>2005</td>   
     </tr> 
     <tr> 
      <th>Raw name 3</th> 
      <td>3001</td> 
      <td>3002</td> 
      <td>3003</td> 
      <td>3004</td> 
      <td>3005</td>   
     </tr> 
    </table> 
EOT 

doc = Nokogiri::HTML(html, nil, 'UTF-8') 

# Fetches only the first <table>. If you have 
# more than one, you can loop the way you 
# originally did. 
table = doc.css('table.open').first 

# Fetches all rows (<tr>s) 
rows = table.css('tr') 

# The column names are the first row (shift returns 
# the first element and removes it from the array). 
# On that row we get the text of each individual <th> 
# This will be Table name, Column name 1, Column name 2... 
column_names = rows.shift.css('th').map(&:text) 

# On each of the remaining rows 
text_all_rows = rows.map do |row| 

    # We get the name (<th>) 
    # On the first row this will be Raw name 1 
    # on the second - Raw name 2, etc. 
    row_name = row.css('th').text 

    # We get the text of each individual value (<td>) 
    # On the first row this will be 1001, 1002, 1003... 
    # on the second - 2001, 2002, 2003... etc 
    row_values = row.css('td').map(&:text) 

    # We map the name, followed by all the values 
    [row_name, *row_values] 
end 

p column_names # => ["Table name", "Column name 1", "Column name 2", 
       #  "Column name 3", "Column name 4", "Column name 5"] 
p text_all_rows # => [["Raw name 1", "1001", "1002", "1003", "1004", "1005"], 
       #  ["Raw name 2", "2001", "2002", "2003", "2004", "2005"], 
       #  ["Raw name 3", "3001", "3002", "3003", "3004", "3005"]] 

# If you want to combine them 
text_all_rows.each do |row_as_text| 
    p column_names.zip(row_as_text).to_h 
end # => 
    # {"Table name"=>"Raw name 1", "Column name 1"=>"1001", "Column name 2"=>"1002", "Column name 3"=>"1003", "Column name 4"=>"1004", "Column name 5"=>"1005"} 
    # {"Table name"=>"Raw name 2", "Column name 1"=>"2001", "Column name 2"=>"2002", "Column name 3"=>"2003", "Column name 4"=>"2004", "Column name 5"=>"2005"} 
    # {"Table name"=>"Raw name 3", "Column name 1"=>"3001", "Column name 2"=>"3002", "Column name 3"=>"3003", "Column name 4"=>"3004", "Column name 5"=>"3005"} 
+0

谢谢,这有帮助! – verrom

2

你所需输出是无稽之谈:

['Raw name 1', 2,094, 0,017, 0,098, 0,113, 0,452] 
# ~> -:1: Invalid octal digit 
# ~> ['Raw name 1', 2,094, 0,017, 0,098, 0,113, 0,452] 

我假设你想要引用的数字。

剥离,保持代码的工作,并减少HTML更易于管理的例子,东西之后再运行它:

require 'nokogiri' 

html = <<EOT 
    <table class="open"> 
     <tr> 
      <th>Table name</th> 
      <th>Column name 1</th> 
      <th>Column name 2</th> 
     </tr> 
     <tr> 
      <th>Raw name 1</th> 
      <td>2,094</td> 
      <td>0,017</td> 
     </tr> 
     <tr> 
      <th>Raw name 5</th> 
      <td>2,094</td> 
      <td>0,017</td> 
     </tr> 
    </table> 
EOT 


doc = Nokogiri::HTML(html) 
tables = doc.css('table.open') 

tables_data = [] 

tables.each do |table| 
    title = table.css('tr[1] > th').text # !> assigned but unused variable - title 
    cell_data = table.css('tr > td').text 
    raw_name = table.css('tr > th').text 
    tables_data << [cell_data, raw_name] 
end 

导致:

tables_data 
# => [["2,0940,0172,0940,017", 
#  "Table nameColumn name 1Column name 2Raw name 1Raw name 5"]] 

的第一件事注意你是不是在使用title,虽然你指定它。例如,当您清理代码时可能发生这种情况。

css,如searchxpath,返回一个NodeSet,类似于一个节点数组。当您在使用节点集或textinner_text返回连接成一个字符串中每个节点的文本:

获取包含的所有Node对象的内部文本。

这是它的行为:

require 'nokogiri' 

doc = Nokogiri::HTML('<html><body><p>foo</p><p>bar</p></body></html>') 

doc.css('p').text # => "foobar" 

相反,你应该遍历找到的每个节点,并单独提取其文本。这部分内容的很多倍SO:

doc.css('p').map{ |node| node.text } # => ["foo", "bar"] 

这可以简化为:

doc.css('p').map(&:text) # => ["foo", "bar"] 

见 “How to avoid joining all text from Nodes when scraping” 也。

文档说这个约contenttextinner_text一个节点时:

返回此节点的内容。

相反,你需要的单个节点的文本之后去:

require 'nokogiri' 

html = <<EOT 
    <table class="open"> 
     <tr> 
      <th>Table name</th> 
      <th>Column name 1</th> 
      <th>Column name 2</th> 
      <th>Column name 3</th> 
      <th>Column name 4</th> 
      <th>Column name 5</th> 
     </tr> 
     <tr> 
      <th>Raw name 1</th> 
      <td>2,094</td> 
      <td>0,017</td> 
      <td>0,098</td> 
      <td>0,113</td> 
      <td>0,452</td>   
     </tr> 
     <tr> 
      <th>Raw name 5</th> 
      <td>2,094</td> 
      <td>0,017</td> 
      <td>0,098</td> 
      <td>0,113</td> 
      <td>0,452</td>   
     </tr> 
    </table> 
EOT 


tables_data = [] 

doc = Nokogiri::HTML(html) 

doc.css('table.open').each do |table| 

    # find all rows in the current table, then iterate over the second all the way to the final one... 
    table.css('tr')[1..-1].each do |tr| 

    # collect the cell data and raw names from the remaining rows' cells... 
    raw_name = tr.at('th').text 
    cell_data = tr.css('td').map(&:text) 

    # aggregate it... 
    tables_data += [raw_name, cell_data] 
    end 
end 

现在导致:

tables_data 
# => ["Raw name 1", 
#  ["2,094", "0,017", "0,098", "0,113", "0,452"], 
#  "Raw name 5", 
#  ["2,094", "0,017", "0,098", "0,113", "0,452"]] 

你能弄清楚如何要挟引用数为小数接受到Ruby,或者你想要的操作内部数组。

+0

非常感谢您的回答和解答!答案非常有用,帮助了我! – verrom

0

我假设你从这里借用了一些代码或者任何其他相关的参考资料(或者我很抱歉添加了错误的参考) - http://quabr.com/34781600/ruby-nokogiri-parse-html-table

但是,如果你想捕捉所有的行,你可以更改以下密码 -

希望这有助于你解决你的问题。

doc = Nokogiri::HTML(open(html), nil, 'UTF-8') 

# We need .open tr, because we want to capture all the columns from a specific table's row 

@tablesArray = doc.css('table.open tr').reduce([]) do |array, row| 
    # This will allow us to create result as this your illustrated one 
    # ie. ['Raw name 1', 2,094, 0,017, 0,098, 0,113, 0,452] 
    array << row.css('th, td').map(&:text) 
end 

render template: 'scrape_krasecology' 

最良好的祝愿