Resurrecting feed_tools, part 2

(Follow-up on part 1)

One of the great and powerful things about Ruby is its ability to expressively extract common patterns through metaprogramming. Even the standard library in Ruby provides some good examples. One of my favorites is the generation of getter/setter methods for instance variables, the attr methods:

class Person
  attr_reader :name # adds 'name' method (getter)
  attr_writer :phone # adds 'phone=' method (setter)
  attr_accessor :address # adds both 'address' and 'address='
  attr :gender # same as attr_accessor
end

Using the metaprogramming techniques like this, we’re going to clean up a few things in feed_tools.
h2. Memoization

I don’t know if I’d go so far as to call it an anti-pattern, but what FeedTools does in many cases is definitely not DRY. Memoization is a pattern where a value is calculated only once and thereafter stored in an instance variable so it can be reused. Here’s the basic formula the original library follows:

def some_value
  if @some_value.nil?
    # do some calculation
    @some_value = value
  end
  @some_value
end

That’s an acceptable way to do it, although stylistically I would have done a few things differently (especially using unless @some_value rather than if...nil?). However, the code is littered with these blocks. It seems every single major attribute of the Feed class has one. Let’s clean that up with some metaprogramming. First, we’re going to reduce the method just to its calculation, making sure to return the value we want to store in the ivar.

def some_value
  # do some calculation
  value
end

Now that we have distilled the essence of the calculation, let’s inject some memoization at the class level. I’d like to make an attribute-like declaration like so (Incidentally, recent versions of Rails have this already):

memoize :some_value

So, I made a module like so:

require 'activesupport' # for alias_method_chain
module FeedTools
  module Memoize
    def memoize(*names)
      names.each do |name|
        class_eval %{
          def #{name}_with_memoize
            @#{name} ||= #{name}_without_memoize
          end
        }
        alias_method_chain name, :memoize
      end
    end
  end
end

Then we’ll extend the Feed class with that module:

class FeedTools::Feed
  extend FeedTools::Memoize
end

How does it work? Let’s step through it. Since we extended the Feed class with the module, memoize becomes a class/singleton method. This means we can call it directly in our class definition. Next, when we pass it any number of symbols or strings, it will iterate through each one, defining a method with class_eval and establishing a pair of aliases. So for :some_value, it will create a method that looks like this:

def some_value_with_memoize
  @some_value ||= some_value_without_memoize
end

One thing to note about this method is that I’ve simplified the if...nil? into the simpler “or-equals” or “lazy assignment” operator. The only disadvantage to this method is if you are trying to memoize something that calculates to false, it will be recalculated everytime you call the method. Since FeedTools mostly returns strings for things, this is not an issue.

The next line after the class_eval is alias_method_chain. This is a little goodie that comes with ActiveSupport, the source of much of the magic behind Rails. Essentially, it shortens these two lines into one:

alias_method "#{name}_without_memoize", name
alias_method name, "#{name}_with_memoize"

So our “with” method is injected in the place of the original method, but leaving a hook back into the original “without” method. Incidentally, this is one example of why Ruby doesn’t need the complexities of Java-like Dependency Injection solutions — it’s a no-brainer!

In the next installment, I’ll discuss more ways to clean up the FeedTools code.

Resurrecting feed_tools

Last week, a couple of unrelated sites I’ve worked on converged on the same problem: slow loading of data from RSS feeds. In one case, the site at times refused to load a feed at all, then at other times added a whole second to the page load time! The culprit, of course was feed_tools. While a very versatile and flexible library at parsing even ill-formed feeds, feed_tools is notorious for poor performance and huge memory usage. I was inspired by Charlie Savage’s resurrection of libxml-ruby to do the same for feed_tools, partly because it’s an old library (started in 2005) and because it would be an excuse to use the new hotness that is libxml-ruby.

To get a sense of my progression on this, follow my Twitter page. Here’s a few:

If FeedTools were to have a mental disorder, it would definitely be paranoid schizophrenia. Let’s hope I can be an anti-psychotic.

FeedTools’ largest problem is a lack of DRY (and poor Ruby style). The real intention of the code is lost in the repetition.

FeedTools uses ObjectSpace. EPIC FAIL
h2. Confessions

If we’re going to give feed_tools a clean conscience, let’s first confess its sins. And believe me, they are many!

ObjectSpace

One of the things that immediately jumped out to me was the use of ObjectSpace. Any experienced Rubyist knows that ObjectSpace is a dangerous thing to play with and should not be used in performance-critical situations. feed_tools ignores that precept and uses it to find the parent feed of an individual item, on the premise that it will help with cleaning up dangling objects during garbage collection (which seems wrong to me):


# Returns the parent feed of this feed item
# Warning, this method may be slow if you have a
# large number of FeedTools::Feed objects. Can't
# use a direct reference to the parent because it plays
# havoc with the garbage collector. Could've used
# a WeakRef object, but really, if there are multiple
# parent feeds, something is going to go wrong, and the
# programmer needs to be notified. A WeakRef
# implementation can't detect this condition.
def feed
  parent_feed = nil
  ObjectSpace.each_object(FeedTools::Feed) do |feed|
    if feed.instance_variable_get("@entries").nil?
      feed.items
    end
    unsorted_items = feed.instance_variable_get("@entries")
    for item in unsorted_items
      if item.object_id == self.object_id
        if parent_feed.nil?
          parent_feed = feed
          break
        else
          raise "Multiple parent feeds found."
        end
      end
    end
  end
  return parent_feed
end

To add insult to injury, the result is not memoized, so every time the FeedItem object has to access its parent feed, you have to go through the ObjectSpace loop again! This is one place where I feel it should not be optimizing for memory over speed when the solution is very simple — set the parent feed on initialization or when adding to an existing feed.

!object#nil? vs. object

One thing that programmers new to Ruby don’t often grasp immediately is that nil and false are equivalent in conditional expressions and anything else (i.e. an object) is equivalent to true. This makes many typical scenarios where one might want to test if a method call returned a value, or returning anyone of several possible values much easier and clearer. feed_tools is littered with a paranoia about nil values. Here’s just one example:


@content = FeedTools::HtmlHelper.process_text_construct(content_node,
  self.feed_type, self.feed_version, [self.base_uri])
if self.feed_type == "atom" ||
    self.configurations[:always_strip_wrapper_elements]
  @content = FeedTools::HtmlHelper.strip_wrapper_element(@content)
end
if @content.nil?
  @content = self.media_text
end
if @content.nil?
  @content = self.itunes_summary
end
if @content.nil?
  @content = self.itunes_subtitle
end

Those last three if statements would obviously be more succinctly and clearly said like so (additionally without the ‘self’ fetish):


@content ||= media_text || itunes_summary || itunes_subtitle

Much cleaner! The original code is full of things like this.

Exception paranoia

Just like the obsession with #nil?, FeedTools is obsessed with rescue blocks. There is nothing inherently wrong with begin...rescue...end blocks, however FeedTools seems to use them willy-nilly and without specificity. An example:


begin
  if cache_object != nil && cache_object.serialized != nil
    # If we've got a cache hit, deserialize
    expired = true
    if cache_object.time_to_live == nil
      cache_object.time_to_live =
        feed_configurations[:default_ttl].to_i
      cache_object.save
    end
    if (cache_object.last_retrieved == nil)
      expired = true
    elsif (cache_object.time_to_live < 30.minutes)
      expired =
        (cache_object.last_retrieved + 30.minutes) < Time.now.gmtime
    else
      expired =
        (cache_object.last_retrieved + cache_object.time_to_live) <
          Time.now.gmtime
    end
    if !expired
      require 'yaml'
      deserialized_feed = YAML.load(cache_object.serialized)
      deserialized_feed.cache_object = cache_object
      Thread.pass
    end
  end
rescue Exception
end

The problem with the rescue this code snippet is that it reflects a lackadaisacal attitude toward what exception is thrown and by what statement.

This is just a preview of the greater architectural problems that plague FeedTools — inconsistent interfaces, broken encapsulation, misplaced responsibility, lack of division of labor, reified utility modules, an incredible amount of repetition, and a general ignorance of Ruby style and convention. If I went into all the details, this would be a much longer article than I want. So let’s pull out our Jump to Conclusions Mat and take a leap.

Going forward

As you may have found from the links above, I’ve created a github project for my refactoring of the library. You’ll want to pay attention to the ‘libxml’ branch where I’m doing most of the work. Here are my goals:

  • Decouple the XML-parsing framework from the feed-parsing, abstracting out the differences between libxml-ruby, Hpricot, and REXML.
  • Use meta-programming and good Ruby style to simplify and clarify the code.
  • Separate responsibilities into appropriate modules and classes.
  • Maintain a substantial amount of backwards-compatibility with FeedTools 0.2.x, with the exception of the internal API.
  • Maintain the ability to recognize and parse any feeds that FeedTools currently recognizes, using the existing test suite.
  • Improve the test suite by adding more focused tests on individual components.
  • Improve performance and reduce memory consumption using real numbers from ruby-prof and other appropriate tools.

Obviously the first three goals are the most significant on my list. I’d appreciate any feedback you can give me, either on the github project or via email.