Sesat > Docs + Support > Moving from FAST to Solr review

Abstract

This is an article about moving two big complex collections of search documents from FAST to Solr, the problems encountered during the move, and how they were tackled.
These FAST FDS 4.1 collections where the company (yellow) and person (white) catalogues for Sesam.no from Norway.

Reasons for moving to Solr

We where going to move from one hosting provider to another. Moving the old FAST 4 installation over was not an option. Upgrading to FAST ESP5 was an option, but it was also a good time to experiment with Solr. While FAST was really good for real-time search (eg news search with clustering) for big static collections like our yellow and white indexes Solr looked like an ideal candidate.

Problems resolved

1. Document Processing

The new Sesat document processing framework written, now released under the LGPLv3 license, is compatible with the FAST document processing framework. Porting FAST's document processor is easily done by, copying each file, and runing a script to fix minor things like imports and some special fetching of config files. The pipeline configuration file (pipeline-config.xml) from the FAST installation can also be largely reused. It is not wholly supported in the new document processing framework where functionality can be taken out of the pipeline process and provided by Solr itself.
The most prominent examples to this being:

  • removing all references to proprietary FAST pipeline steps,
  • removing all tokenization, lemmatization and stemming as these are handled with the field types in Solr (given you're planning to use the default Solr tokenizers, and snowball for lemmatization and stemming),
  • the default FAST specific pipeline steps needs to be removed as they are not needed (RowsetXML, FASTXML, DocInit, FIXML).

2. Data-import disadvantages

Solved with the Sesat document processing framework.

Not being able to use the data-import handler for various reasons, and not wanting to find out how to use it with the homegrown document processing, we made a simple django import handler. That took about 3-4 hours to complete. Copying the data-import handler functionality was relatively easy and cronjobs were used to schedule the imports.

3. Index-profile

Replicating FAST's index profile was largely about rewriting it into Solr's schema.xml and solrconfig.xml
The schema.xml does a large amount of what the index-profile does. Field specification is done there, but also the filter and tokenization parts of the pipeline from FAST.

Most fields where easy to convert, here is a general mapping:

FAST field type Solr field type comment
list with separator field is multiValued need to make a list in the pipeline before adding to Solr
int32 sint sortable integer fields
geo no mapping havent implemented geo in Solr
composit fields not needed uses dismaxHandler to mirror this functionality
composit fields use multiValued fields use a AttributeCopy pipeline step to fill a multiValue
does not support ranking differently on the different matches inside the composit field
rank profile not needed uses dismaxHandler to mirror this functionality
return specification not needed uses dismaxHandler to mirror this functionality with the returnfield for a qt
tokenize done in schema.xml set the schema.xml field to a type that is tokenized the way you want
lemmatize snowball for the field type did not end up using this because the lemmatizer wasn't good enough for norwegian
can be fixed with making a list lookup lemmatizer for the pipeline
or configuring / coding in snowball

4. Hidden recall

To copy the FAST search functionality the recall needs to be copied. In our FAST installation we had heavy use of composite field searches where a lot of the sub-fields in the composite fields where not referenced in the rank-profile.
To solve this, you need control over all the normal fields in the composite fields that are used in a search. These needs to be added to the DismaxHandler as zero ranked (so not to give any rank) fields. For example for a field "mysubfield" add:

mysubfield^0

to the qf field in the DismaxHandler.

5. Phonetic matching (not solved)

The FAST FDS 4.1 installation used a soundex matcher, configured for norwegian phonetic matching. The functionality is not mirrored with Solr's phonetic matching. A satisfactory replacement was the RefinedSoundex with a special set of fields used just for the phonetic matching.

<filter class="solr.PhoneticFilterFactory" encoder="RefinedSoundex" inject="false"/>

The FAST phonetic matcher has a lot more configuration options. But I guess that this configuration can be mirrored by extending the Soundex Filter type.

6. Geographical filtering

Geographical filtering in Solr is pretty easy. In FAST one could use a composite field with all the geographical location information as an AND cause to the search. In Solr just add a Multivalued field and add the location information to it:

# in the fields definition add: 
<field name="iypcfgeo" type="text" indexed="true" stored="true" multiValued="true" />
...
<copyField source="iypstedsnavn" dest="iypcfgeo" />
<copyField source="iyplandsdel" dest="iypcfgeo" />
<copyField source="iypby" dest="iypcfgeo" />
<copyField source="iypbydel" dest="iypcfgeo" />
<copyField source="iypkommune" dest="iypcfgeo" />

This makes it possible to use the filter in the query. "fq=iypcfgeo:Oslo" will return only hits in Oslo

7. Direct match list

In the FAST installation we had composite fields we search by issuing ^querystring$ searches. It was difficult to find any way to do multiple field search combined with the dismax. Dismax just lists fields to be searched, and the type of the field defines how that field is searched.
By making the fields that need ^querystring$ type queries into String multiValued this functionality is matched. Since that field is not tokenized since strings does not have field transformations.

<field name="iypbransje" type="string"  indexed="true" store="true" multiValued="true"/>        

Might be possible to solve this with nested queries that was added to Solr 1.4 that also works for the dismax.

8. Adding navigators

Navigators in Solr are called facets. They need be of type string in schema.xml and they're defined in dismax or the query.
When I started out to move the FAST navigators to Solr I looked in our index-profile. For each navigator, add a field in schema.xml like this:

<!-- navigator / facet fields -->
<field name="ywpoststednavigator" type="string" index="true" />
<field name="ywbydelnavigator" type="string" index="true" />

Normally the facet field is a normal string and can be added in the copyField section of schema.xml like this:

<!-- navigator / facet copying -->
<copyField source="ywpoststed" dest="ywpoststednavigator" />
<copyField source="ywbydel" dest="ywbydelnavigator" />

For a list field (separator field in FAST) add an AttributeCopy step in the pipeline to do the copy, because I manually split all multiValue fields in the pipeline other then those populated by the copyField in schema.xml. And fed them to a navigator field with type="string" index="true" multiValued="true"

In Solr it is possible to add facets to the dismax (since they are normally connected to the dismax anyway. Searching for whitepages should give a set of facets back). Some configuration can be added to the dismax to make that happen. On the bottom of the DismaxHandler add something like this:

...
<int name="facet.mincount">1</int>
</lst>
<lst name="invariants">
   <str name="facet.field">ywpoststednavigator </str>
   <str name="facet.field">ywbydelnavigator</str>
</lst>
...

This approach makes it possible to ask Solr to return the default facets for the Dismax by adding &facet=true too the query url. You can still define your own facets at querytime.

9. Geographical searches

Searching for coordinates to map coordinates to hits in that area.

This can be solved with mimicking the geographical coordinate search with boxing ANDs.

10. Boundary matching

In FAST we had searches for ^term$ syntax in the searches. This is not possible in the same way in Solr, but by making a copy of the field as a list of string??

<field name="somefield" type="string" index="true" store="true" />

Strings in that array will only match if the whole string matches (no processing done on string fields).
It's important to make all values in the list lower case and only send in lower case searches to make it hit right.

11. Name searching

Name searching is a bit hard, the solution made for our Solr index uses a dismax requestHandler that combines search across phonetic, string, and a customized text_name type, fields. And ranking these with string match first, text_name second, and phonetic a lot lower. The text_name uses that LetterTokenizer

        <!-- test of name search fields to do real tokenization and work delimit -->
        <fieldtype name="text_name" class="solr.TextField" positionIncrementGap="100">
            <analyzer type="index">
                <tokenizer class="solr.LetterTokenizerFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" caten
ateAll="0"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.LetterTokenizerFactory"/>
                <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" caten
ateAll="0"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
            </analyzer>
        </fieldtype>

12 Phonetic number matching

After, implementing the solrconfig.xml and dismaxes for all the types of searches we needed to do, a problem popped up. The dismax in question had a number of phonetic matching fields and every time a number was searched (with or without other alpha nummeric query content) a disproportionate number of results was returned.

After debugging using the debugQuery=on option, I saw that the numbers was searches against the phonetic fields as an empty string.

Anyway, doing LetterTokenization for the phonetic field removes all the numbers on the way in and solved the problem

	<fieldtype name="text_phon" class="solr.TextField" positionIncrementGap="100">
		<analyzer type="index">
			<tokenizer class="solr.LetterTokenizerFactory"/>
	        <!-- tokenizer class="solr.WhitespaceTokenizerFactory"/-->
			<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
			<filter class="solr.LowerCaseFilterFactory"/>
			<filter class="solr.PhoneticFilterFactory" inject="false" encoder="RefinedSoundex" />
			<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
		</analyzer>
		<analyzer type="query">
			<tokenizer class="solr.LetterTokenizerFactory"/>
	                <!--tokenizer class="solr.WhitespaceTokenizerFactory"/-->
			<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
                        <filter class="solr.LowerCaseFilterFactory"/>
			<filter class="solr.PhoneticFilterFactory" encoder="RefinedSoundex" inject="false"/>
                        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
                </analyzer>		

	</fieldtype>

As commented out the whitespacetokenizer was used before and resulted in 30,000 results where there would otherwise be only 3.

13. Replication

Replication was needed to make the index fault tolerant over a number of servers in different locations. The replication scheme in Solr is very straight forward. The server has a master configuration part in solrconfig.xml that describes actions related to replication. The slaves then have their configuration parts in solrconfig.xml which describe how they should poll the master requesting replication.

Replication is described in detail here: http://wiki.apache.org/solr/CollectionDistribution
A sample master configuration in solrconfig.xml

<requestHandler name="/replication" class="solr.ReplicationHandler" >
    <lst name="master">
	<str name="replicateAfter">optimize</str>
	<str name="replicateAfter">commit</str>
	<str name="replicateAfter">startup</str> <!-- for testing purposes only!! -->
        <str name="snapshot">optimize</str>
        <str name="confFiles">solrconfig_slave.xml:solrconfig.xml,schema.xml</str>
    </lst>
</requestHandler>

Where the replication is set to notify the slave to copy the index at optimize command, commit commands, and startup: the schema and the slave configuration is deployed to the slaves from the master.

<requestHandler name="/replication" class="solr.ReplicationHandler" >
    <lst name="slave">
        <str name="masterUrl">http://solr1.somemasterserver.no:8010/solr/replication</str>  
        <str name="pollInterval">00:02:00</str>  
        <str name="compression">internal</str>
        <str name="httpConnTimeout">5000</str>
        <str name="httpReadTimeout">10000</str>
     </lst>
</requestHandler>

Here the slave is polling the master every 2 minutes.

14. Deploy replication configuration

To deploy the replication configuration we decided to make two different solrconfig.xml files. "solrconfig_master.xml" and "solrconfig_slave.xml". These are then soft-linked on the different servers like this:

master: 
cd conf/
ln -s solrconfig.xml solrconfig_master.xml

slave: 
cd conf/
ln -s solrconfig.xml solrconfig_slave.xml

This has the problem of the soft-link being moved on the slaves after a replication, which really is not such a problem since it ends up with the right configuration.

15. Direct matching.

To allow direct matching against commercially sold search words the string multiValue fields was used. The multiValue (list) string fields has no index or querytime transformation and either matches or doesn't match. This allows for direct list matching on certain fields while giving broader hits in other more generic fields.

We ended up with making all queries lowercase before they hit the Solr index, and all direct match fields are lower-cased in the pipeline before making it into Solr.

Statistics

yellow first feed

First full run of yellow feed to solr using django+mysql exporter, homegrown document processing pipeline:
added batches of 10000 to Solr, only one commit at the end of the full feed.
added 1138544 docs in 11324 seconds
commit took 10 minutes
no memory leaks, and no errors from the pipeline.

white first feed

setting up the solution to use a django exporter, mysql dump of whole white index (added index to 3 columns) and configuring the pipeline took about 4 hours (forgot to index the pk field so trouble shooted that for 1-2 hours)

while feeding : 188 documents pr second

exporter_white.py first run end print:
added bacthes of 10000 to Solr and ran a loop over 10000 lazy fetches from the db
fluch to Solr returned : True
commit to Solr returned : True
added 4016499 docs in 21498 seconds

so 6 hours for the first run, took Solr 2 minutes to commit the changes (seemed a bit low)

Jul 2, 2009 7:57:41 PM org.apache.solr.search.SolrIndexSearcher <init>
INFO: Opening Searcher@11e96e0 main
Jul 2, 2009 7:57:41 PM org.apache.solr.update.DirectUpdateHandler2 commit
INFO: end_commit_flush
...
Jul 2, 2009 7:57:42 PM org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: {commit=} 0 3460

Testing

 © 2007-2009 Schibsted ASA
Contact us