Prefix substring matching for django haystack / solr

If you're using django haystack with SOLR (both fantastic products) and you want your queries to also match partially against words, e.g. search "foo" will match "foo", "foobar", "food", then the following might be a solution.

SOLR supports prefix substring matching through the EdgeNGramFilterFactory. You will need to add this to your "text" field entry in the generated schema.xml:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <!-- in this example, we will only use synonyms at query time
    <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
    -->
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" />
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" />
  </analyzer>
</fieldType>

However, this one-time change will get lost as soon as you re-generate your schema using "build_solr_schema". To work around this you can customize the "solr.xml" schema that is used to generate your schema.xml. Copy it from the django haystack source to templates/search_configuration/ in your project and add the two filters explicitly. Make sure haystack gets loaded (as INSTALLED_APP) after your own code, of course.

Last updated April 18, 2013, 4:35 p.m. | filed under python, solr, django | django match substring prefix haystack
comments powered by Disqus