Thursday, August 13, 2009

debugging a memory leak in a Perl module

I needed to add a (small) wiki to the intranet web site I develop (in Perl) for work. Some customization and integration with the existing site was required, so I couldn't just drop in a standalone wiki package. But I also didn't want to roll my own from scratch. I settled on using Wiki::Toolkit from CPAN because it takes care of all the low-level details and includes an interface to SQLite, which I'm already using for a number of other purposes in the site.

A crucial requirement for this wiki is full-text searching. Wiki::Toolkit provides interfaces to three different search backends:
  • DBIx::FullTextSearch - Uses MySQL to index. I chose not to use this because I don't have a MySQL installation and because this backend doesn't provide fuzzy searching, which I would like to use.
  • Search::InvertedIndex - Can use different databases, including SQLite, but doesn't provide phrase searching, which I definitely need.
  • Plucene - A Perl interface to the Lucene search engine. Provides both fuzzy searching and phrase searching. I decided to use this.
I have since learned that this might not have been the best idea, due to performance concerns:
http://perlbuzz.com/mechanix/2008/03/dont-use-plucene-for-real-work.html. Perhaps I should have tried KinoSearch. But I still would have needed to write a plugin for Wiki::Toolkit to incorporate it. And it turns out my wiki database is small enough that the Plucene performance hit isn't noticeable, so Plucene it is.

Except for one complication. Following is a condensed version of the learning process I went through in trying to resolve the complication. I've intentionally written this a bit pedantically to remind myself about the tools and concepts I learned about along the way.

Once I got the search function running, I began getting errors in the form of "Too many files open". It turned out there was a filehandle leak somewhere in the Plucene modules. The filehandles were being opened in Plucene::Store::InputStream, then never closed. Tracing the Plucene::Store::InputStream objects, I found they were never getting destroyed, hence the leak. Thus commenced a brute-force examination of the Plucene modules, figuring out which object contained which object, so I could eventually track down which object wasn't going out of scope. This approach didn't get me very far.

Then, a breakthrough! Playing with the system, I realized that the leak only occurred when I was searching for multiple terms. I started overriding various Plucene library methods to produce stack dumps at various helpful places. This way I determined that Plucene constructs queries for single terms using Plucene::Search::TermQuery, but when there are multiple terms connected with AND / OR, it uses Plucene::Search::BooleanQuery. This narrowed down the list of suspects quite dramatically, and I confirmed the diagnosis using Adam Kennedy's Devel::Leak::Object to examine the objects remaining when my test program (written to demonstrate the problem) exits:

Plucene::Index::FieldInfo 720
Plucene::Index::FieldInfos 120
Plucene::Index::FieldsReader 120
Plucene::Index::Norm 600
Plucene::Index::SegmentReader 120
Plucene::Index::SegmentTermDocs 240
Plucene::Index::SegmentTermEnum 120
Plucene::Index::SegmentsTermDocs 16
Plucene::Index::Term 3152
Plucene::Index::TermInfo 3016
Plucene::Index::TermInfosReader 120
Plucene::Search::BooleanScorer 8
Plucene::Search::BucketCollector 16
Plucene::Search::BucketTable 8
Plucene::Search::TermScorer 16
Plucene::Store::InputStream 960

Of particular interest is the fact that the test program executes the search in a fixed length loop; in the case that produced this output there were 8 iterations, and there are (rather suggestively) 8 each of the Plucene::Search::{BooleanScorer,BucketTable} objects. Looking at the code I found that the Plucene::Search::BooleanQuery object contains the BooleanScorer object, which contains the BucketTable object, which points back to the BooleanScorer object! Sure enough, the circular references are revealed using Lincoln Stein's Devel::Cycle:

Cycle (1):
$Plucene::Search::BooleanScorer::A->{'bucket_table'}
=> \%Plucene::Search::BucketTable::B
$Plucene::Search::BucketTable::B->{'scorer'}
=> \%Plucene::Search::BooleanScorer::A

Knowing this, how do I fix it? I played with it a bit, attempting to figure out when the objects are supposed to be destroyed, and manually undefining them. But then I found this: http://www.perl.com/pub/a/2007/06/07/better-code-through-destruction.html. Object::Destroyer (Adam Kennedy again!) to the rescue!


# Plucene::Search::BooleanScorer has circular references that cause a
# memory leak in this persistent mod_perl setting. Override the
# constructor to add Object::Destroyer which will break the circular
# references
require Plucene::Search::BooleanScorer;
use Object::Destroyer;

*Plucene::Search::BooleanScorer::release = sub {
my $self = shift;
$self->bucket_table->scorer(undef);
$self->bucket_table(undef);
};

my $old_PSBnew = \&Plucene::Search::BooleanScorer::new;
*Plucene::Search::BooleanScorer::new = sub {
my $result = $old_PSBnew->(@_);
return Object::Destroyer->new($result, 'release');
};

It is so convenient to be able to modify / insert Perl library methods this way so one doesn't have to edit the source or use local copies. Learning about Perl symbol tables has served me well.

I filed a bug report for Plucene, with an example patch to Plucene/Search/BooleanScorer.pm. I don't know how useful it will be, but I figured I should share what I've learned in case it's useful to someone.