缅怀一休之关于hpricot解析html文件的疑惑？？？

现有如下格式的html文件bar_list.html(需要解析的主要部分):
<table>
<tbody>
    <tr>
      <th>A</th>
      <th>B</th>
      <th>C</th>
   </tr>
   <tr class="rowA">
      <td><a href="www.google.com">test value for a</a></td>
      <td>test value for b</td>
      <td>test value for c</td>
   </tr>
   <tr class="rowB">
      <td><a href="www.google.com">test value for a</a></td>
      <td>test value for b</td>
      <td>test value for c</td>
   </tr>
</tbody>
</table>

现在目的是用Hpricot库提取class为rowA or rowB的tr标签下td的值插入数据库的某张表
也就是要提取test value for a | test value for b | test value for c （三个列）

请教熟练或精通正则表达式的前辈们指点一下，谢谢！

自己写了一段ruby script如下：



require 'rubygems'

require 'open-uri'

require 'hpricot'

local_html_url = "bar_list.html"

doc = Hpricot( open(local_html_url) )

(doc/"tr[@class ~= '(rowA|rowB)']").each do |row|

   row.children.each do |column|

     puts column.inner_html

   end

end

遇到的疑惑:
1、tr[@class ~= '(rowA|rowB)']的目的是想要匹配class为rowA或rowB的所有tr元素，但无法匹配。单独用tr[@class ~= 'rowA'] 或 tr[@class ~= 'rowB']则可以。

2、由于<tr>下的第一个<td>标签下的内容含有<a href="...">value</a>,现在只想要获取value,去掉a标签
不知用<a href="...">value</a>.sub(/regexp/, '') 这里的regexp不知怎么写。

3、由于要分别获取三个<td>下的值插入数据库表中所对应的列，而上述脚本获取的内容如下:
<a href="www.google.com">test value for a</a>
test value for b
test value for c
(因为用puts，所以出现了换行的效果，实际获得的三个<td>下的值是连续的。
像这样:<a href="www.google.com">test value for a</a>test value for btest value for c)

这里不是正则表达式，是xpath，取内部值可以用inner_text，你可以写成这样：
[code="ruby"]
doc = Hpricot(open(local_html_url))
p (doc/"tr[@class='rowA' or @class='rowB']/td").map(&:inner_text)
[/code]