Sphinx Search Delta Index implementation

Sphinx Search Delta Index implementation

Delta indexing is way of designing indexes so only the updated data is read from db and indexed by Sphinx. Imagine a blog with 100 million posts. As posts are rarely updated and only new ones are added, so there is no point of indexing from scratch again and again. Delta indexing will allow us to only index data which is new, has been updated or deleted since last indexing. There are two methods by which we can implement delta index in Sphinx.

  1. Conventional delta index and merge with main index
  2. Delta index via distributed index

Conventional delta index and merge with main index

This type of delta index is useful where old records are updated often and new ones are added. In this we will have a main index which will hold all the records, this will be the index where search is performed. As it will be too large to index very frequently we can schedule it to re-index every week, every month or so.

The other index which we will defined is “delta” index and will only index the recently modified or added records, and will be scheduled to run frequently. Every time delta index is re-indexed it will be merged with the main index so the added records become available in main and modified records get updated in main index.

To fetch only the modified records we need to make sure that we record the time with each record to keep track when it was added or last updated. Also make sure to do soft delete for the records so we can know which records has been deleted and we get these removed from Sphinx index as well. First we need to define a main index for posts. In this index we need to record time of records which are being indexed, in a separate table (sphinx_index_meta). In this table index_name will be a unique key. This recording will be done via sql_query_pre feature. The table to hold information for delta index will look like this.

The configuration will look like this

Here sql_query_pre will record the time when indexing is starting. So now every time this index is re-indexed it will have time recorded in this table. This last_update will be used in delta index. Can use following command to re-index this index

Now we need to define the delta index,

Note here that rest of the details will be inherited from main index, including the sql_query_pre which will update the last_update being indexed. Following shell script will be used to re-index the delta and merge with main index.

That is it, the delta index is ready.

Delta index via distributed index

This is the 2nd way to implement the delta indexing in Sphinx. This is suitable if only new records are added and old records does not need to be updated (this issue can be solved as well). In this, a separate index (delta) is created which only have the newly inserted records and the other (main) index contains all the other older records. Search is performed on another distributed index which consist of these two indices (main + delta). The implementation is bit simpler and there is no need to record the ids and time of last index. The major characteristics of this scheme will be

  • Main index gets updated once a day (you can change to a week, month or anything you like)
  • The delta index gets updated every hour (you can change to day, week or anything you like)
  • The search query will be done on a distributed index. This will make it transparent to application.
  • The delta will only contain the records from current day (or any period you specify)
  • The main and delta can also contain few records in common and that is not an issue.

The main index will look like this.

For simplicity lets schedule this to be update every day at end of the day or start of the day. The delta index will look like this

The delta index will be scheduled to be updated every hour. It will be small index relatively so it should be quick. Notice here all the properties are inherited from main index. We only need to modify the sql_range_query here so we did that. Now we need to have a distributed index where all the search queries will be sent. Its configuration will look like

There is no need to schedule these update for distributed index as it don’t need to be updated. Only main and delta indexes needs to be updated individually. This approach has following cavities.

  • All the posts from the day will be re-indexed every hour. If you have too many new posts per day you may need to consider the reduce frequency.
  • At times both indexes can contain common records, so a same records can exists in both indexes. This is normally not an issue but you may always want to find records from delta index, for this use “index_weights” (look for this in Sphinx Search manual).
  • Only the new records get updated in delta and modified records don’t get updated in delta. This can be resolved as well by modifying the sql_query in delta index and fetch the posts which are added or modified in that day. Like

    In this case there is no need to have the sql_query_range. (If you need help in this regard contact us)

So this is how we can have delta index on Sphinx Search. Post comments if you have any issues implementing it and I will try to answer.

By | 2017-02-12T18:32:52+00:00 July 27th, 2013|16 Comments

About the Author:

Waseem is our consultant for Sphinx Search, Elastic and Solr. He is lamp stack backend developer and has 14 years of experience of development. His areas of interest (other than search) are NoSQL and BigData tools.

16 Comments

  1. Chris September 24, 2014 at 5:12 pm - Reply

    I found this confusing at first – May be worth pointing out that BETWEEN…AND… is inclusive rather than exclusive: http://dev.mysql.com/doc/refman/5.5/en/comparison-operators.html#operator_between

    I believe this means that the delta index will end up re-indexing the latest table row on each run.

    If it was exclusive as I first feared, then this setup would miss rows.

    It being INCLUSIVE and re-indexing and extra row seems to me to be a fine compromise instead of making the configuration more complex!

  2. Bunty Roy January 2, 2015 at 9:50 am - Reply

    When running index I’m getting an error for the query in delta index
    “sql_query_range = SELECT max_id, (SELECT MAX(id) FROM wp_posts)
    FROM sphinx_index_meta m
    WHERE m.index_name = ‘sph_idx_posts_main’

    Unknown column ‘max_id’ ..

    Please suggest me something on this behalf.

    Thanks

    • Waseem Asif March 6, 2015 at 2:46 am - Reply

      Sorry there was a typo in table definition and “max_id” column was missing. Please see the sphinx_index_meta in post. This is will resolve your issue.
      Thanks

  3. Alex January 13, 2015 at 12:48 pm - Reply

    I created delta indexes using method #2 – one question now – how can I update the indexer when I have new posts (because I dont see new posts automatically)? Do I need to talk with my hosting company to run ‘indexer –all –rotate’ every day?

    Thank you!

    • Waseem Asif March 6, 2015 at 2:48 am - Reply

      Yes, the delta index must also be rotated on periodic basis. If you have a vps or dedicated server you can setup a cronjob to do that. Other wise you will need to talk to hosting company.
      Thanks

  4. Ajeet Kumar April 27, 2015 at 5:57 am - Reply

    Hi,

    Thank you for great post. Am I new to sphinxsearch. But as I have seen your first method of implementation of delta index, in your configuration of source ‘src_posts_main’ you have only insert/update index_name and last_update in the sql_query_post_index option. Also in delta index you have only selected max_id from table sphinx_index_meta and max(id) from post table.

    My question is when you update the max_id of sphinx_index_meta table?

    Also what if I want to index the updated records along with newly added in the delta index.

    Please clarify.

    Regards
    Ajeet

    • Waseem Asif May 11, 2015 at 5:18 pm - Reply

      The max_id can be updated by following query in
      UPDATE sphinx_index_meta SET max_id = IF($maxid = 0,((SELECT MAX(id) FROM files)), $maxid)
      WHERE index_name = ‘sph_idx_records’

      Here $maxid is set by sphinx just like $start and $end. I have fixed the post please check.
      Thanks

  5. Cris April 27, 2015 at 11:55 am - Reply

    Where do you update the max_id field on conventional step?

    • Waseem Asif May 10, 2015 at 5:11 am - Reply

      We don’t need to update max_id in conventional setup as it works on modified date. This modified date gets updated by this

      sql_query_post_index =
      REPLACE INTO sphinx_index_meta (index_name, last_update)
      VALUES ('sph_idx_posts_main', UNIX_TIMESTAMP());

      The delta index will then will query the results which have been updated after this date.

      SELECT p.ID, p.post_title, p.post_content, p.post_excerpt,
      p.post_date, p.post_modified
      FROM wp_posts p
      WHERE
      p.post_modified > (select FROM sphinx_index_meta m
      WHERE m.index_name = ‘sph_idx_posts_main’)

  6. Evgeny May 15, 2015 at 8:02 am - Reply

    I think you should update sphinx_index_meta in sql_query_pre.
    Records can be added while main index was reindexing.

    • Waseem Asif May 21, 2015 at 3:23 am - Reply

      It can be done in case we are not using $maxid. Updated the post. Thanks

  7. Vonn May 18, 2015 at 11:22 am - Reply

    Hello guys I’ved completed the guide above. I will use the 2nd solution. I would like to what index would i build?

    • Waseem Asif May 21, 2015 at 3:25 am - Reply

      You will need to have indices like sph_idx_posts_main, sph_idx_posts_delta and one distributed index.

  8. Hemant Kumar October 1, 2015 at 4:37 pm - Reply

    Hello, First of all I want you to appreciate for you artical, it’s help me alot. You explained each and every thing a very nice way. I had implemented this thing quite easy with your help but as the database has to be indexed frequently we have to add the indexer in the corn job. I did this but the merge function give me error. Can you give me an example or anything useful so that I can create cron job for my large 4 database table.
    Thanks

  9. Wouter Neuteboom June 1, 2016 at 8:06 am - Reply

    Shouldnt it be sql_query_post, else the updated time will always be the current date.

    Fromg sphinx: They are executed before the main fetch query, and they will be executed exactly in order of appearance in the configuration file. Pre-query results are ignored.

  10. isa yeter August 3, 2016 at 11:46 am - Reply

    Hello, thank you for your perfect documentation, I have a question, I’m using first method.

    I have setup a cronjob that is called every minute :

    /usr/local/sphinx/bin/indexer sph_idx_posts_delta –rotate &&
    /usr/local/sphinx/bin/indexer –merge sph_idx_posts_main sph_idx_posts_delta –rotate

    after indexing and merging , “sphinx_index_meta” table will never be updated with new unixtimestamp. So every time cronjob is called, same users will be affected again and again. I have solved it by adding this line to “source src_posts_delta : src_posts_main” section:

    sql_query_post = REPLACE INTO sphinx_index_meta (index_name, last_update) values (‘sph_idx_users_main’, unix_timestamp())

    Is it right ? You didnt mention it. Thank you.

Leave A Comment