I don’t know how often someone have to come across situation where one need to index data from database and files both in one index. One of our clients had this issue, they wanted to make data from files and related data from database both search able by Sphinx Search. Sphinx provides a native feature to import data from files by using xmlpipe as data source, but no feature to combine data from database and files at time.
The solutions for the problem was simple. We suggested to write a php or shell pre-indexing script. This script reads files from disk and dump the content in to a mysql table. Than we used this table in indexing as per normal. After index is complete we safely truncated this table. This was done by providing
TRUNCATE TABLE query in
The main challenge was reading so many files from disk and dumping into mysql. We found that a lot can be done to optimize and scale this process. The first thing we did was delta indexing, so only the updated and newly added content is dumped in mysql and only that is update in Sphinx index. As you may aware that delta indexing is another built-in feature in Sphinx Search. A lot can also be done to speed up dumping content into mysql. Also moving the files directory into some memory based file system can help scale further.