|Sesat > Docs + Support > Moving from FAST to Solr review|
This is an article about moving two big complex collections of search documents from FAST to Solr, the problems encountered during the move, and how they were tackled.
These FAST FDS 4.1 collections where the company (yellow) and person (white) catalogues for Sesam.no from Norway.
We where going to move from one hosting provider to another. Moving the old FAST 4 installation over was not an option. Upgrading to FAST ESP5 was an option, but it was also a good time to experiment with Solr. While FAST was really good for real-time search (eg news search with clustering) for big static collections like our yellow and white indexes Solr looked like an ideal candidate.
1. Document Processing
The new Sesat document processing framework written, now released under the LGPLv3 license, is compatible with the FAST document processing framework. Porting FAST's document processor is easily done by, copying each file, and runing a script to fix minor things like imports and some special fetching of config files. The pipeline configuration file (pipeline-config.xml) from the FAST installation can also be largely reused. It is not wholly supported in the new document processing framework where functionality can be taken out of the pipeline process and provided by Solr itself.
The most prominent examples to this being:
2. Data-import disadvantages
Solved with the Sesat document processing framework.
Not being able to use the data-import handler for various reasons, and not wanting to find out how to use it with the homegrown document processing, we made a simple django import handler. That took about 3-4 hours to complete. Copying the data-import handler functionality was relatively easy and cronjobs were used to schedule the imports.
Replicating FAST's index profile was largely about rewriting it into Solr's schema.xml and solrconfig.xml
The schema.xml does a large amount of what the index-profile does. Field specification is done there, but also the filter and tokenization parts of the pipeline from FAST.
Most fields where easy to convert, here is a general mapping:
|FAST field type||Solr field type||comment|
|list with separator||field is multiValued||need to make a list in the pipeline before adding to Solr|
|int32||sint||sortable integer fields|
|geo||no mapping||havent implemented geo in Solr|
|composit fields||not needed||uses dismaxHandler to mirror this functionality|
|composit fields||use multiValued fields|| use a AttributeCopy pipeline step to fill a multiValue
does not support ranking differently on the different matches inside the composit field
|rank profile||not needed||uses dismaxHandler to mirror this functionality|
|return specification||not needed||uses dismaxHandler to mirror this functionality with the returnfield for a qt|
|tokenize||done in schema.xml||set the schema.xml field to a type that is tokenized the way you want|
|lemmatize||snowball for the field type|| did not end up using this because the lemmatizer wasn't good enough for norwegian
can be fixed with making a list lookup lemmatizer for the pipeline
or configuring / coding in snowball
To copy the FAST search functionality the recall needs to be copied. In our FAST installation we had heavy use of composite field searches where a lot of the sub-fields in the composite fields where not referenced in the rank-profile.
To solve this, you need control over all the normal fields in the composite fields that are used in a search. These needs to be added to the DismaxHandler as zero ranked (so not to give any rank) fields. For example for a field "mysubfield" add:
to the qf field in the DismaxHandler.
The FAST FDS 4.1 installation used a soundex matcher, configured for norwegian phonetic matching. The functionality is not mirrored with Solr's phonetic matching. A satisfactory replacement was the RefinedSoundex with a special set of fields used just for the phonetic matching.
The FAST phonetic matcher has a lot more configuration options. But I guess that this configuration can be mirrored by extending the Soundex Filter type.
Geographical filtering in Solr is pretty easy. In FAST one could use a composite field with all the geographical location information as an AND cause to the search. In Solr just add a Multivalued field and add the location information to it:
This makes it possible to use the filter in the query. "fq=iypcfgeo:Oslo" will return only hits in Oslo
In the FAST installation we had composite fields we search by issuing ^querystring$ searches. It was difficult to find any way to do multiple field search combined with the dismax. Dismax just lists fields to be searched, and the type of the field defines how that field is searched.
By making the fields that need ^querystring$ type queries into String multiValued this functionality is matched. Since that field is not tokenized since strings does not have field transformations.
Might be possible to solve this with nested queries that was added to Solr 1.4 that also works for the dismax.
Navigators in Solr are called facets. They need be of type string in schema.xml and they're defined in dismax or the query.
When I started out to move the FAST navigators to Solr I looked in our index-profile. For each navigator, add a field in schema.xml like this:
Normally the facet field is a normal string and can be added in the copyField section of schema.xml like this:
For a list field (separator field in FAST) add an AttributeCopy step in the pipeline to do the copy, because I manually split all multiValue fields in the pipeline other then those populated by the copyField in schema.xml. And fed them to a navigator field with type="string" index="true" multiValued="true"
In Solr it is possible to add facets to the dismax (since they are normally connected to the dismax anyway. Searching for whitepages should give a set of facets back). Some configuration can be added to the dismax to make that happen. On the bottom of the DismaxHandler add something like this:
This approach makes it possible to ask Solr to return the default facets for the Dismax by adding &facet=true too the query url. You can still define your own facets at querytime.
Searching for coordinates to map coordinates to hits in that area.
This can be solved with mimicking the geographical coordinate search with boxing ANDs.
In FAST we had searches for ^term$ syntax in the searches. This is not possible in the same way in Solr, but by making a copy of the field as a
list of string??
Strings in that array will only match if the whole string matches (no processing done on string fields).
It's important to make all values in the list lower case and only send in lower case searches to make it hit right.
Name searching is a bit hard, the solution made for our Solr index uses a dismax requestHandler that combines search across phonetic, string, and a customized text_name type, fields. And ranking these with string match first, text_name second, and phonetic a lot lower. The text_name uses that LetterTokenizer
After, implementing the solrconfig.xml and dismaxes for all the types of searches we needed to do, a problem popped up. The dismax in question had a number of phonetic matching fields and every time a number was searched (with or without other alpha nummeric query content) a disproportionate number of results was returned.
After debugging using the debugQuery=on option, I saw that the numbers was searches against the phonetic fields as an empty string.
Anyway, doing LetterTokenization for the phonetic field removes all the numbers on the way in and solved the problem
As commented out the whitespacetokenizer was used before and resulted in 30,000 results where there would otherwise be only 3.
Replication was needed to make the index fault tolerant over a number of servers in different locations. The replication scheme in Solr is very straight forward. The server has a master configuration part in solrconfig.xml that describes actions related to replication. The slaves then have their configuration parts in solrconfig.xml which describe how they should poll the master requesting replication.
Replication is described in detail here: http://wiki.apache.org/solr/CollectionDistribution
A sample master configuration in solrconfig.xml
Where the replication is set to notify the slave to copy the index at optimize command, commit commands, and startup: the schema and the slave configuration is deployed to the slaves from the master.
Here the slave is polling the master every 2 minutes.
To deploy the replication configuration we decided to make two different solrconfig.xml files. "solrconfig_master.xml" and "solrconfig_slave.xml". These are then soft-linked on the different servers like this:
This has the problem of the soft-link being moved on the slaves after a replication, which really is not such a problem since it ends up with the right configuration.
To allow direct matching against commercially sold search words the string multiValue fields was used. The multiValue (list) string fields has no index or querytime transformation and either matches or doesn't match. This allows for direct list matching on certain fields while giving broader hits in other more generic fields.
We ended up with making all queries lowercase before they hit the Solr index, and all direct match fields are lower-cased in the pipeline before making it into Solr.
First full run of yellow feed to solr using django+mysql exporter, homegrown document processing pipeline:
added batches of 10000 to Solr, only one commit at the end of the full feed.
added 1138544 docs in 11324 seconds
commit took 10 minutes
no memory leaks, and no errors from the pipeline.
setting up the solution to use a django exporter, mysql dump of whole white index (added index to 3 columns) and configuring the pipeline took about 4 hours (forgot to index the pk field so trouble shooted that for 1-2 hours)
while feeding : 188 documents pr second
exporter_white.py first run end print:
added bacthes of 10000 to Solr and ran a loop over 10000 lazy fetches from the db
fluch to Solr returned : True
commit to Solr returned : True
added 4016499 docs in 21498 seconds
so 6 hours for the first run, took Solr 2 minutes to commit the changes (seemed a bit low)