ruby解析rss feed，使用RSS::Parser和hpricot都存在问题，求教

用RSS::Parser，遇到不规范的feed，无法解析。

用hpricot解析rss feed，无法读取link属性。

使用最新的0.8.1版，测试如下
[code="java"]
doc = Hpricot("test")
puts doc
[/code]
运行结果为 test

换为0.7版，运行结果为 test

好象只对 link 标签有问题，其他还没发现有问题的。

请问谁遇到过这个问题，是如何解决的，谢谢。

hmm，RSS是XML来的诶，为什么不用Hpricot::XML("test")？我用的Hpricot是0.6系的，比较老了。

其实处理个rss根本不用杀鸡用牛刀的。REXML + XPath就可以的，而且很方便,是ruby标准库。
不论规范还是不规范的rss格式，你要提取的标签就是那几个，像等等。都是可枚举的。
简单举几个例子：
1.
[code="java"]
require'rexml/document'
doc=REXML::Document.new("test")
link=REXML::XPath.new(doc,"//a/link").text
puts link #test
[/code]
2.更复杂一点的
[code="ruby"]
doc=<<-RSS

http://api.douban.com/people/2599975
Hooopo

My Twitter：https://twitter.com/Hooopo
My Blog:http://Hooopo.blogspot.com
My Gtalk:wxz125627771@gmail.com
湖南长沙/db:location
db:uidhooopo/db:uid

RSS
require'rexml/document'
require'pp'
require'rexml/document'
class People
class << self
def attr_names
[
:id,
:location,
:title,
:link,
:content,
:uid
]
end
end
attr_names.each do |attr|
attr_accessor attr
end
def initialize(feed)
doc=REXML::Document.new(feed)
id=REXML::XPath.first(doc,"//entry/id")
@id=id.text if id
content=REXML::XPath.first(doc,"//entry/content")
@content=content.text if content
title=REXML::XPath.first(doc,"//entry/title")
@title=title.text if title
location=REXML::XPath.first(doc,"//entry/db:location")
@location=location.text if location
uid=REXML::XPath.first(doc,"//entry/db:uid")
@uid=uid.text if uid
REXML::XPath.each(doc,"//entry/link") do|link|
@link||={}
@link[link.attributes['rel']]=link.attributes['href']
end
end
end

people=People.new(doc)
pp people
puts people.id
puts people.location
puts people.link["self"]

[/code]
结果：
[code="ruby"]
# @content=
"My Twitter\357\274\232https://twitter.com/Hooopo\nMy Blog:http://Hooopo.blogspot.com\nMy Gtalk:wxz125627771@gmail.com",
@id="http://api.douban.com/people/2599975",
@link=
{"self"=>"http://api.douban.com/people/2599975",
"icon"=>"http://otho.douban.com/icon/u2599975-3.jpg",
"alternate"=>"http://www.douban.com/people/hooopo/",
"homepage"=>"http://://hooopo.blogspot.com/"},
@location="\346\271\226\345\215\227\351\225\277\346\262\231",
@title="Hooopo",
@uid="hooopo">
http://api.douban.com/people/2599975
湖南长沙
http://api.douban.com/people/2599975

[/code]

改一下：
[quote]
require'rexml/document'
doc=REXML::Document.new("test")
link=REXML::XPath.new(doc,"//a/link").text
puts link #test

[/quote]
================>
[code="ruby"]
require'rexml/document'
doc=REXML::Document.new("test")
link=REXML::XPath.first(doc,"//a/link")
puts link.text #=>test
[/code]

相信楼主想用Hpricot也就是图个方便……这玩儿也算不上牛刀
[quote="汪兆铭"][code="ruby"]require'rexml/document'
doc=REXML::Document.new("test")
link=REXML::XPath.first(doc,"//a/link")
puts link.text #=>test[/code][/quote]
你看要是用Hpricot的话……
[code="ruby"]require 'hpricot'
doc = Hpricot::XML('test')
link = (doc/:link).first
puts link.inner_html #=> test[/code]
或者用更快的Nokogiri：
[code="ruby"]require 'nokogiri'
doc = Nokogiri.parse 'test'
link = (doc/:link).first
puts link.text[/code]
写起XPath来比标准库里的XPath要方便些 ^ ^