现有如下格式的html文件bar_list.html(需要解析的主要部分):
<table>
<tbody>
<tr>
<th>A</th>
<th>B</th>
<th>C</th>
</tr>
<tr class="rowA">
<td><a href="www.google.com">test value for a</a></td>
<td>test value for b</td>
<td>test value for c</td>
</tr>
<tr class="rowB">
<td><a href="www.google.com">test value for a</a></td>
<td>test value for b</td>
<td>test value for c</td>
</tr>
</tbody>
</table>
现在目的是用Hpricot库提取class为rowA or rowB的tr标签下td的值插入数据库的某张表
也就是要提取test value for a | test value for b | test value for c (三个列)
请教熟练或精通正则表达式的前辈们指点一下,谢谢!
自己写了一段ruby script如下:
require 'rubygems'
require 'open-uri'
require 'hpricot'local_html_url = "bar_list.html"
doc = Hpricot( open(local_html_url) )
(doc/"tr[@class ~= '(rowA|rowB)']").each do |row|
row.children.each do |column|
puts column.inner_html
end
end
这里不是正则表达式,是xpath,取内部值可以用inner_text,你可以写成这样:
[code="ruby"]
doc = Hpricot(open(local_html_url))
p (doc/"tr[@class='rowA' or @class='rowB']/td").map(&:inner_text)
[/code]