Sphinx Search Delta Index implementation

Delta indexing is way of designing indexes so only the updated data is read from db and indexed by Sphinx. Imagine a blog with 100 million posts. As posts are rarely updated and only new ones are added, so there is no point of indexing from scratch again and again. Delta indexing will allow us to only index data which is new, has been updated or deleted since last indexing. There are two methods by which we can implement delta index in Sphinx.

  1. Conventional delta index and merge with main index
  2. Delta index via distributed index

Conventional delta index and merge with main index

This type of delta index is useful where old records are updated often and new ones are added. In this we will have a main index which will hold all the records, this will be the index where search is performed. As it will be too large to index very frequently we can schedule it to re-index every week, every month or so. The other index which we will defined is “delta” index and will only index the recently modified or added records, and will be scheduled to run frequently. Every time delta index is re-indexed it will be merged with the main index so the added records become available in main and modified records get updated in main index. To fetch only the modified records we need to make sure that we record the time with each record to keep track when it was added or last updated. Also make sure to do soft delete for the records so we can know which records has been deleted and we get these removed from Sphinx index as well. First we need to define a main index for posts. In this index we need to record time of records which are being indexed, in a separate table (sphinx_index_meta). In this table index_name will be a unique key. This recording will be done via sql_query_pre feature. The table to hold information for delta index will look like this.

CREATE TABLE `sphinx_index_meta` (
	`index_name` VARCHAR(50) NOT NULL COMMENT 'Unique name of sphinx index',
        `max_id` INT(10) UNSIGNED NOT NULL COMMENT 'Max id of records indexed',
	`last_update` INT(10) UNSIGNED NOT NULL COMMENT 'Timestamp when last update',
	PRIMARY KEY (`index_name`)
)
COMMENT='Tracks the date and time for the table which is indexed in Sphinx. Updated by sphinx indexer.';

The configuration will look like this

# Source of main index
source src_posts_main
{
       type = mysql
       sql_host = localhost
# be sure that user here has insert rights on sphinx_index_meta
       sql_user = root
       sql_pass = dbpassword
       sql_db = datbasename

	sql_query_range	= SELECT MIN(ID),MAX(ID) FROM wp_posts
	sql_range_step	= 100000

	sql_query = \
		SELECT p.ID, p.post_title, p.post_content, p.post_excerpt, \
		p.post_date, p.post_modified \
		FROM wp_posts p \
		WHERE \
		p.ID BETWEEN $start AND $end
# this will update the time when indexing is completed
	sql_query_pre = \
		REPLACE INTO sphinx_index_meta (index_name, last_update) \
                VALUES ('sph_idx_posts_main', UNIX_TIMESTAMP());

	sql_attr_timestamp = post_date
        sql_attr_timestamp = post_modified
}

index sph_idx_posts_main
{
	source			= src_posts_main
	path			= /usr/local/sphinx/data/posts/main
}

Here sql_query_pre will record the time when indexing is starting. So now every time this index is re-indexed it will have time recorded in this table. This last_update will be used in delta index. Can use following command to re-index this index

/usr/local/sphinx/bin/indexer --config /etc/sphinx.conf sph_idx_posts_main --rotate

Now we need to define the delta index,

source src_posts_delta : src_posts_main
{
	# Select the max id of last indexed row
	sql_query = \
        SELECT p.ID, p.post_title, p.post_content, p.post_excerpt, \
        p.post_date, p.post_modified \
        FROM wp_posts p \
        WHERE \
        p.post_modified >= (select last_update FROM sphinx_index_meta m \
         WHERE m.index_name = 'sph_idx_posts_main' )

index sph_idx_posts_delta
{
	source			= src_posts_delta
	path			= /usr/local/sphinx/var/data/posts/delta
}

Note here that rest of the details will be inherited from main index, including the sql_query_pre which will update the last_update being indexed. Following shell script will be used to re-index the delta and merge with main index.

/usr/local/sphinx/bin/indexer --config /etc/sphinx.conf sph_idx_posts_delta --rotate
/usr/local/sphinx/bin/indexer --config /etc/sphinx.conf --merge sph_idx_posts_main sph_idx_posts_delta --rotate

That is it, the delta index is ready.

Delta index via distributed index

This is the 2nd way to implement the delta indexing in Sphinx. This is suitable if only new records are added and old records does not need to be updated (this issue can be solved as well). In this, a separate index (delta) is created which only have the newly inserted records and the other (main) index contains all the other older records. Search is performed on another distributed index which consist of these two indices (main + delta). The implementation is bit simpler and there is no need to record the ids and time of last index. The major characteristics of this scheme will be

  • Main index gets updated once a day (you can change to a week, month or anything you like)
  • The delta index gets updated every hour (you can change to day, week or anything you like)
  • The search query will be done on a distributed index. This will make it transparent to application.
  • The delta will only contain the records from current day (or any period you specify)
  • The main and delta can also contain few records in common and that is not an issue.

The main index will look like this.

# Source of main index
source src_posts_main
{
       type = mysql
       sql_host = localhost
       sql_user = root
       sql_pass = dbpassword
       sql_db = datbasename

	sql_query_range	= SELECT MIN(ID),MAX(ID) FROM wp_posts
	sql_range_step	= 100000

	sql_query = \
		SELECT p.ID, p.post_title, p.post_content, p.post_excerpt, \
		p.post_date, p.post_modified \
		FROM wp_posts p \
		WHERE \
		p.ID BETWEEN $start AND $end

	sql_attr_timestamp = post_date
        sql_attr_timestamp = post_modified
} 

index sph_idx_posts_main { 
       source = src_posts_main 
       path = /usr/local/sphinx/data/idx_posts/main
}

For simplicity lets schedule this to be update every day at end of the day or start of the day. The delta index will look like this

source src_posts_delta : src_posts_main
{
	# Select the max id of last indexed row
	sql_query_range	= SELECT MIN(ID),MAX(ID) FROM wp_posts p WHERE p.post_date => CURDATE()
}

index idx_posts_delta
{
	source			= src_posts_delta
	path			= /usr/local/sphinx/data/idx_posts/delta
}

The delta index will be scheduled to be updated every hour. It will be small index relatively so it should be quick. Notice here all the properties are inherited from main index. We only need to modify the sql_range_query here so we did that. Now we need to have a distributed index where all the search queries will be sent. Its configuration will look like

index idx_posts_distributed
{
	type = dist
	agent = localhost:9312:idx_posts_main
	agent = localhost:9312:idx_posts_delta
}

There is no need to schedule these update for distributed index as it don’t need to be updated. Only main and delta indexes needs to be updated individually. This approach has following cavities.

  • All the posts from the day will be re-indexed every hour. If you have too many new posts per day you may need to consider the reduce frequency.
  • At times both indexes can contain common records, so a same records can exists in both indexes. This is normally not an issue but you may always want to find records from delta index, for this use “index_weights” (look for this in Sphinx Search manual).
  • Only the new records get updated in delta and modified records don’t get updated in delta. This can be resolved as well by modifying the sql_query in delta index and fetch the posts which are added or modified in that day. Like
    SELECT p.ID, p.post_title, p.post_content, p.post_excerpt, \
    		p.post_date, p.post_modified \
    		FROM wp_posts p \
    		WHERE p.post_date => CURDATE() \
                    OR p.post_modified => CURDATE()

    In this case there is no need to have the sql_query_range. (If you need help in this regard contact us)

So this is how we can have delta index on Sphinx Search. Post comments if you have any issues implementing it and I will try to answer.

17 comments on “Sphinx Search Delta Index implementation

  1. I found this confusing at first – May be worth pointing out that BETWEEN…AND… is inclusive rather than exclusive: http://dev.mysql.com/doc/refman/5.5/en/comparison-operators.html#operator_between

    I believe this means that the delta index will end up re-indexing the latest table row on each run.

    If it was exclusive as I first feared, then this setup would miss rows.

    It being INCLUSIVE and re-indexing and extra row seems to me to be a fine compromise instead of making the configuration more complex!

  2. When running index I’m getting an error for the query in delta index
    “sql_query_range = SELECT max_id, (SELECT MAX(id) FROM wp_posts)
    FROM sphinx_index_meta m
    WHERE m.index_name = ‘sph_idx_posts_main’

    Unknown column ‘max_id’ ..

    Please suggest me something on this behalf.

    Thanks

    1. Sorry there was a typo in table definition and “max_id” column was missing. Please see the sphinx_index_meta in post. This is will resolve your issue.
      Thanks

  3. I created delta indexes using method #2 – one question now – how can I update the indexer when I have new posts (because I dont see new posts automatically)? Do I need to talk with my hosting company to run ‘indexer –all –rotate’ every day?

    Thank you!

    1. Yes, the delta index must also be rotated on periodic basis. If you have a vps or dedicated server you can setup a cronjob to do that. Other wise you will need to talk to hosting company.
      Thanks

  4. Hi,

    Thank you for great post. Am I new to sphinxsearch. But as I have seen your first method of implementation of delta index, in your configuration of source ‘src_posts_main’ you have only insert/update `index_name` and `last_update` in the `sql_query_post_index` option. Also in delta index you have only selected max_id from table `sphinx_index_meta` and max(id) from post table.

    My question is when you update the max_id of `sphinx_index_meta` table?

    Also what if I want to index the updated records along with newly added in the delta index.

    Please clarify.

    Regards
    Ajeet

    1. The max_id can be updated by following query in
      UPDATE sphinx_index_meta SET max_id = IF($maxid = 0,((SELECT MAX(id) FROM files)), $maxid)
      WHERE index_name = ‘sph_idx_records’

      Here $maxid is set by sphinx just like $start and $end. I have fixed the post please check.
      Thanks

    1. We don’t need to update max_id in conventional setup as it works on modified date. This modified date gets updated by this

      sql_query_post_index =
      REPLACE INTO sphinx_index_meta (index_name, last_update)
      VALUES ('sph_idx_posts_main', UNIX_TIMESTAMP());

      The delta index will then will query the results which have been updated after this date.

      SELECT p.ID, p.post_title, p.post_content, p.post_excerpt,
      p.post_date, p.post_modified
      FROM wp_posts p
      WHERE
      p.post_modified > (select FROM sphinx_index_meta m
      WHERE m.index_name = ‘sph_idx_posts_main’)

  5. Hello, First of all I want you to appreciate for you artical, it’s help me alot. You explained each and every thing a very nice way. I had implemented this thing quite easy with your help but as the database has to be indexed frequently we have to add the indexer in the corn job. I did this but the merge function give me error. Can you give me an example or anything useful so that I can create cron job for my large 4 database table.
    Thanks

  6. Shouldnt it be sql_query_post, else the updated time will always be the current date.

    Fromg sphinx: They are executed before the main fetch query, and they will be executed exactly in order of appearance in the configuration file. Pre-query results are ignored.

  7. Hello, thank you for your perfect documentation, I have a question, I’m using first method.

    I have setup a cronjob that is called every minute :

    /usr/local/sphinx/bin/indexer sph_idx_posts_delta –rotate &&
    /usr/local/sphinx/bin/indexer –merge sph_idx_posts_main sph_idx_posts_delta –rotate

    after indexing and merging , “sphinx_index_meta” table will never be updated with new unixtimestamp. So every time cronjob is called, same users will be affected again and again. I have solved it by adding this line to “source src_posts_delta : src_posts_main” section:

    sql_query_post = REPLACE INTO sphinx_index_meta (index_name, last_update) values (‘sph_idx_users_main’, unix_timestamp())

    Is it right ? You didnt mention it. Thank you.

  8. Hello, thanks for all of this info. Asking for your help as I am a Sphinx beginner need some direction. I currently index 25 columns in a very large mysql tbl (1bil rows). When I re-build this index I actually create index ‘chunks’ due to the size of the index. This works pretty well and I re-build this index weekly and it takes approx 24 hrs to complete. I want to implement Delta indexing as I do not want to rebuild the complete index every time. Can I add Delta index to my existing config or do I need to config an entire new index config in order to include Delta indexing? Thanks in advance.

Leave a Reply