我有一个Ruby脚本,遍历项目列表。对于每个项目,它遍历HTML表格,收集每行的td
文本并将其添加到数组中。如何处理循环中的空白数组元素?
问题是,当该表对于该特定项目为空时,它会向我的二维数组添加一个空数组,然后在尝试使用该数组将数据插入到SQL中时导致错误数据库。我怎样才能防止空数组被追加到我的数组的开始?
projects.each do |project_id|
url = "http://myurl.com/InventoryMaster.aspx?Qtr=%s&Client=%s" % [qtr,project_id[1]]
page = Nokogiri::HTML(open(url))
table = page.at('my_table')
rows = Array.new
table.search('tr').each do |tr|
cells = Array.new
tr.search('td').each do |cell|
cells.push(cell.text.gsub(/\r\n?/, "").strip)
end
# add the project id to the cells array, and get ride of other array elements I don't need.
cells.insert(1, project_id[0])
cells.slice!(11, 6)
cells.delete_at(8)
cells.delete_at(2)
cells.delete_at(0)
rows.push(cells)
end
# first row in the array in the html table is headers. get rid of those.
rows.shift
# last row in the html table is the footers. get rid of those too.
rows.pop
p rows
end
这里是我解析HTML,按要求:
<table id="ctl00_MainContent_gvSearchResults" cellspacing="1" cellpadding="1"
border="1" style="color:Black;background-color:LightGoldenrodYellow;border-color:Tan;
border-width:1px;border-style:solid;" rules="cols">
<caption></caption>
<tbody>
<tr style="background-color:Tan;font-weight:bold;">
#I don't need the headers.
<th scope="col"></th>
<th scope="col"></th>
<th scope="col"></th>
<th scope="col"></th>
<th scope="col"></th>
<th scope="col"></th>
<th scope="col"></th>
<th scope="col"></th>
<th scope="col"></th>
<th scope="col"></th>
<th scope="col"></th>
<th scope="col"></th>
<th scope="col"></th>
<th scope="col"></th>
<th scope="col"></th>
<th scope="col"></th>
</tr>
<tr style="font-family:arial,tahoma;font-size:Smaller;">
<td>not needed</td>
<td>not needed</td>
<td>needed</td>
<td align="right">needed</td>
<td>needed</td>
<td>needed</td>
<td>needed</td>
<td>needed</td>
<td>not needed</td>
<td>needed</td>
#I don't need any of the remaining td's in this row either.
<td align="right"></td>
<td align="right"></td>
<td align="right"></td>
<td align="right"></td>
<td align="right"></td>
<td></td>
</tr>
#this row is the footer, and it isn't needed either.
<tr style="background-color:Tan;">
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
一旦我分析的表,我需要在项目的ID,这是一部分加包含在projects
数组中的键值对。
显示了一些HTML,使您的问题完整。有了这个,我们可以很容易地向您展示如何正确解析,而不是在之后尝试扫描。 –
'table = page.at('my_table')后,如果table.children.size <= 1'(检查my_table是空白的东西),那么应该跳过空表 – bjhaid
@Tian Man - 我添加了我的html表格。我应该提到,我需要解析的最后3个td是日期,需要解析为mm-dd-yyyy。我刚刚意识到,当日期的一天部分是单个数字时,我也对此脚本有问题。 – hyphen