Resurrecting feed_tools

Jul 23, 2008 by Sean Cribbs

Last week, a couple of unrelated sites I’ve worked on converged on the same problem: slow loading of data from RSS feeds. In one case, the site at times refused to load a feed at all, then at other times added a whole second to the page load time! The culprit, of course was feed_tools. While a very versatile and flexible library at parsing even ill-formed feeds, feed_tools is notorious for poor performance and huge memory usage. I was inspired by Charlie Savage’s resurrection of libxml-ruby to do the same for feed_tools, partly because it’s an old library (started in 2005) and because it would be an excuse to use the new hotness that is libxml-ruby.

To get a sense of my progression on this, follow my Twitter page. Here’s a few:

If FeedTools were to have a mental disorder, it would definitely be paranoid schizophrenia. Let’s hope I can be an anti-psychotic.

FeedTools’ largest problem is a lack of DRY (and poor Ruby style). The real intention of the code is lost in the repetition.

FeedTools uses ObjectSpace. EPIC FAIL

Confessions

If we’re going to give feed_tools a clean conscience, let’s first confess its sins. And believe me, they are many!

`ObjectSpace`

One of the things that immediately jumped out to me was the use of ObjectSpace. Any experienced Rubyist knows that ObjectSpace is a dangerous thing to play with and should not be used in performance-critical situations. feed_tools ignores that precept and uses it to find the parent feed of an individual item, on the premise that it will help with cleaning up dangling objects during garbage collection (which seems wrong to me):

# Returns the parent feed of this feed item
# Warning, this method may be slow if you have a
# large number of FeedTools::Feed objects. Can't
# use a direct reference to the parent because it plays
# havoc with the garbage collector. Could've used
# a WeakRef object, but really, if there are multiple
# parent feeds, something is going to go wrong, and the
# programmer needs to be notified. A WeakRef
# implementation can't detect this condition.
def feed
  parent_feed = nil
  ObjectSpace.each_object(FeedTools::Feed) do |feed|
    if feed.instance_variable_get("@entries").nil?
      feed.items
    end
    unsorted_items = feed.instance_variable_get("@entries")
    for item in unsorted_items
      if item.object_id == self.object_id
        if parent_feed.nil?
          parent_feed = feed
          break
        else
          raise "Multiple parent feeds found."
        end
      end
    end
  end
  return parent_feed
end

To add insult to injury, the result is not memoized, so every time the FeedItem object has to access its parent feed, you have to go through the ObjectSpace loop again! This is one place where I feel it should not be optimizing for memory over speed when the solution is very simple — set the parent feed on initialization or when adding to an existing feed.

`!object#nil?` vs. `object`

One thing that programmers new to Ruby don’t often grasp immediately is that nil and false are equivalent in conditional expressions and anything else (i.e. an object) is equivalent to true. This makes many typical scenarios where one might want to test if a method call returned a value, or returning anyone of several possible values much easier and clearer. feed_tools is littered with a paranoia about nil values. Here’s just one example:

@content = FeedTools::HtmlHelper.process_text_construct(content_node,
  self.feed_type, self.feed_version, [self.base_uri])
if self.feed_type == "atom" ||
    self.configurations[:always_strip_wrapper_elements]
  @content = FeedTools::HtmlHelper.strip_wrapper_element(@content)
end
if @content.nil?
  @content = self.media_text
end
if @content.nil?
  @content = self.itunes_summary
end
if @content.nil?
  @content = self.itunes_subtitle
end

Those last three if statements would obviously be more succinctly and clearly said like so (additionally without the ‘self’ fetish):

@content ||= media_text || itunes_summary || itunes_subtitle

Much cleaner! The original code is full of things like this.

Exception paranoia

Just like the obsession with #nil?, FeedTools is obsessed with rescue blocks. There is nothing inherently wrong with begin...rescue...end blocks, however FeedTools seems to use them willy-nilly and without specificity. An example:

begin
  if cache_object != nil && cache_object.serialized != nil
    # If we've got a cache hit, deserialize
    expired = true
    if cache_object.time_to_live == nil
      cache_object.time_to_live =
        feed_configurations[:default_ttl].to_i
      cache_object.save
    end
    if (cache_object.last_retrieved == nil)
      expired = true
    elsif (cache_object.time_to_live < 30.minutes)
      expired =
        (cache_object.last_retrieved + 30.minutes) < Time.now.gmtime
    else
      expired =
        (cache_object.last_retrieved + cache_object.time_to_live) <
          Time.now.gmtime
    end
    if !expired
      require 'yaml'
      deserialized_feed = YAML.load(cache_object.serialized)
      deserialized_feed.cache_object = cache_object
      Thread.pass
    end
  end
rescue Exception
end

The problem with the rescue this code snippet is that it reflects a lackadaisacal attitude toward what exception is thrown and by what statement.

This is just a preview of the greater architectural problems that plague FeedTools — inconsistent interfaces, broken encapsulation, misplaced responsibility, lack of division of labor, reified utility modules, an incredible amount of repetition, and a general ignorance of Ruby style and convention. If I went into all the details, this would be a much longer article than I want. So let’s pull out our Jump to Conclusions Mat and take a leap.

Going forward

As you may have found from the links above, I’ve created a github project for my refactoring of the library. You’ll want to pay attention to the ‘libxml’ branch where I’m doing most of the work. Here are my goals:

Decouple the XML-parsing framework from the feed-parsing, abstracting out the differences between libxml-ruby, Hpricot, and REXML.
Use meta-programming and good Ruby style to simplify and clarify the code.
Separate responsibilities into appropriate modules and classes.
Maintain a substantial amount of backwards-compatibility with FeedTools 0.2.x, with the exception of the internal API.
Maintain the ability to recognize and parse any feeds that FeedTools currently recognizes, using the existing test suite.
Improve the test suite by adding more focused tests on individual components.
Improve performance and reduce memory consumption using real numbers from ruby-prof and other appropriate tools.

Obviously the first three goals are the most significant on my list. I’d appreciate any feedback you can give me, either on the github project or via email.