Delta indexing is way of designing indexes so only the updated data is read from db and indexed by Sphinx. Imagine a blog with 100 million posts. As posts are rarely updated and only new ones are added, so there is no point of indexing from scratch again and again. Delta indexing will allow us to only index data which is new, has been updated or deleted since last indexing. There are two methods by which we can implement delta index in Sphinx.

  1. Conventional delta index and merge with main index
  2. Delta index via distributed index

Conventional delta index and merge with main index

This type of delta index is useful where old records are updated often and new ones are added. In this we will have a main index which will hold all the records, this will be the index where search is performed. As it will be too large to index very frequently we can schedule it to re-index every week, every month or so. The other index which we will defined is “delta” index and will only index the recently modified or added records, and will be scheduled to run frequently. Every time delta index is re-indexed it will be merged with the main index so the added records become available in main and modified records get updated in main index. To fetch only the modified records we need to make sure that we record the time with each record to keep track when it was added or last updated. Also make sure to do soft delete for the records so we can know which records has been deleted and we get these removed from Sphinx index as well. First we need to define a main index for posts. In this index we need to record time of records which are being indexed, in a separate table (sphinx_index_meta). In this table index_name will be a unique key. This recording will be done via sql_query_pre feature. The table to hold information for delta index will look like this.

CREATE TABLE `sphinx_index_meta` (
	`index_name` VARCHAR(50) NOT NULL COMMENT 'Unique name of sphinx index',
        `max_id` INT(10) UNSIGNED NOT NULL COMMENT 'Max id of records indexed',
	`last_update` INT(10) UNSIGNED NOT NULL COMMENT 'Timestamp when last update',
	PRIMARY KEY (`index_name`)
)
COMMENT='Tracks the date and time for the table which is indexed in Sphinx. Updated by sphinx indexer.';

The configuration will look like this

# Source of main index
source src_posts_main
{
       type = mysql
       sql_host = localhost
# be sure that user here has insert rights on sphinx_index_meta
       sql_user = root
       sql_pass = dbpassword
       sql_db = datbasename

	sql_query_range	= SELECT MIN(ID),MAX(ID) FROM wp_posts
	sql_range_step	= 100000

	sql_query = \
		SELECT p.ID, p.post_title, p.post_content, p.post_excerpt, \
		p.post_date, p.post_modified \
		FROM wp_posts p \
		WHERE \
		p.ID BETWEEN $start AND $end
# this will update the time when indexing is completed
	sql_query_pre = \
		REPLACE INTO sphinx_index_meta (index_name, last_update) \
                VALUES ('sph_idx_posts_main', UNIX_TIMESTAMP());

	sql_attr_timestamp = post_date
        sql_attr_timestamp = post_modified
}

index sph_idx_posts_main
{
	source			= src_posts_main
	path			= /usr/local/sphinx/data/posts/main
}

Here sql_query_pre will record the time when indexing is starting. So now every time this index is re-indexed it will have time recorded in this table. This last_update will be used in delta index. Can use following command to re-index this index

/usr/local/sphinx/bin/indexer --config /etc/sphinx.conf sph_idx_posts_main --rotate

Now we need to define the delta index,

source src_posts_delta : src_posts_main
{
	# Select the max id of last indexed row
	sql_query = \
        SELECT p.ID, p.post_title, p.post_content, p.post_excerpt, \
        p.post_date, p.post_modified \
        FROM wp_posts p \
        WHERE \
        p.post_modified >= (select last_update FROM sphinx_index_meta m \
         WHERE m.index_name = 'sph_idx_posts_main' )

index sph_idx_posts_delta
{
	source			= src_posts_delta
	path			= /usr/local/sphinx/var/data/posts/delta
}

Note here that rest of the details will be inherited from main index, including the sql_query_pre which will update the last_update being indexed. Following shell script will be used to re-index the delta and merge with main index.

/usr/local/sphinx/bin/indexer --config /etc/sphinx.conf sph_idx_posts_delta --rotate
/usr/local/sphinx/bin/indexer --config /etc/sphinx.conf --merge sph_idx_posts_main sph_idx_posts_delta --rotate

That is it, the delta index is ready.

Delta index via distributed index

This is the 2nd way to implement the delta indexing in Sphinx. This is suitable if only new records are added and old records does not need to be updated (this issue can be solved as well). In this, a separate index (delta) is created which only have the newly inserted records and the other (main) index contains all the other older records. Search is performed on another distributed index which consist of these two indices (main + delta). The implementation is bit simpler and there is no need to record the ids and time of last index. The major characteristics of this scheme will be

  • Main index gets updated once a day (you can change to a week, month or anything you like)
  • The delta index gets updated every hour (you can change to day, week or anything you like)
  • The search query will be done on a distributed index. This will make it transparent to application.
  • The delta will only contain the records from current day (or any period you specify)
  • The main and delta can also contain few records in common and that is not an issue.

The main index will look like this.

# Source of main index
source src_posts_main
{
       type = mysql
       sql_host = localhost
       sql_user = root
       sql_pass = dbpassword
       sql_db = datbasename

	sql_query_range	= SELECT MIN(ID),MAX(ID) FROM wp_posts
	sql_range_step	= 100000

	sql_query = \
		SELECT p.ID, p.post_title, p.post_content, p.post_excerpt, \
		p.post_date, p.post_modified \
		FROM wp_posts p \
		WHERE \
		p.ID BETWEEN $start AND $end

	sql_attr_timestamp = post_date
        sql_attr_timestamp = post_modified
} 

index sph_idx_posts_main { 
       source = src_posts_main 
       path = /usr/local/sphinx/data/idx_posts/main
}

For simplicity lets schedule this to be update every day at end of the day or start of the day. The delta index will look like this

source src_posts_delta : src_posts_main
{
	# Select the max id of last indexed row
	sql_query_range	= SELECT MIN(ID),MAX(ID) FROM wp_posts p WHERE p.post_date => CURDATE()
}

index idx_posts_delta
{
	source			= src_posts_delta
	path			= /usr/local/sphinx/data/idx_posts/delta
}

The delta index will be scheduled to be updated every hour. It will be small index relatively so it should be quick. Notice here all the properties are inherited from main index. We only need to modify the sql_range_query here so we did that. Now we need to have a distributed index where all the search queries will be sent. Its configuration will look like

index idx_posts_distributed
{
	type = dist
	agent = localhost:9312:idx_posts_main
	agent = localhost:9312:idx_posts_delta
}

There is no need to schedule these update for distributed index as it don’t need to be updated. Only main and delta indexes needs to be updated individually. This approach has following cavities.

  • All the posts from the day will be re-indexed every hour. If you have too many new posts per day you may need to consider the reduce frequency.
  • At times both indexes can contain common records, so a same records can exists in both indexes. This is normally not an issue but you may always want to find records from delta index, for this use “index_weights” (look for this in Sphinx Search manual).
  • Only the new records get updated in delta and modified records don’t get updated in delta. This can be resolved as well by modifying the sql_query in delta index and fetch the posts which are added or modified in that day. Like
    SELECT p.ID, p.post_title, p.post_content, p.post_excerpt, \
    		p.post_date, p.post_modified \
    		FROM wp_posts p \
    		WHERE p.post_date => CURDATE() \
                    OR p.post_modified => CURDATE()

    In this case there is no need to have the sql_query_range. (If you need help in this regard contact us)

So this is how we can have delta index on Sphinx Search. Post comments if you have any issues implementing it and I will try to answer.