Merging Two DSpace Solr-Based Data Sets Together

Have you ever messed up a DSpace upgrade and somehow ended up resetting your DSpace statistics? I did that. When we upgrade DSpace at A&M we preform a fresh install each time and then restore the data from the old instance into the new instance. This involves connecting the database, linking the asset store, and copying the DSpace log directory. We like to do it this way so that our configs are fresh each time. Our documented installation procedures lists the exact settings (about 5) that need to be touched for each production install. All other parameters in the dspace.cfg are maintained in our local SVN copy. This prevents the problem of never know exactly how your DSpace is configured if you do the recommended upgrade procedure by modifying the dspace.cfg each upgrade with new parameters.

What happened?

Our documented upgrade procedures calls for us to copy of the Solr directory from the old instance into the fresh instance. However I typoed the command so that the old copy of the Solr directory ended up inside the fresh copy of the Solr directory. Then we did not catch the error for a few weeks. So on the repository statistics page we were only showing stats from when the upgrade occurred, all the other months were zeroed out.

How to fix it?

If you found your self in a similar predicament then you can recover. The import thing is that you have both your old statistics data and your new statistics data. You just need them combined into one data set. DSpace uses Solr (which is built upon Lucene) for storing statistics information. Because of this you have two basic approaches one at the Solr level and the other at the Lucene level. The basic concept is that you need to merge the two indexes together. The Solr wiki describes these two methods. At first I attempted down the Solr path but I ran into a road block early when I was unable to issue the Solr command to create a new core where the two merged indexes would reside. Then I tried the other option and the Lucene MergeTool worked well. Here are the steps I followed to restore statistics.

Step by Step Instructions

  1. Identify DSpace’s copy of the Lucene libraries: lucene-core and lucene-misc.

    You will find these inside DSpace’s solr webapp: <solr webapp>/WEB-INF/lib/lucene-*.jar For DSpace 1.7.x the version for these libraries were 2.9.3. It is important that you use the same Lucene version that wrote the original indexes. We’ll use both of these paths below in step 5 as <path to lucene-core jar> and <path to lucene-misc jar>.

  2. Identify both Lucene indexes.

    Typically the solr-based statistics index is stored inside the DSpace install directory: <dspace directory>/solr/statistics/data/index. Inside the directory you should see at least one “.cfs” file along with a “segments.gen” file. The other copy will likely come from a back of copy of DSpace you have stashed away somewhere. We’ll use both of these paths below as /path/to/oldindex1 and /path/to/oldindex2.

  3. Shutdown your DSpace instance.

    The merge tool requires that all the indexes it is reading be closed so while the merge is processing you can not be recording any new statistics.

  4. Run the Lucene merge tool.

    java -cp <path to lucene-core jar>:<path to lucene-misc jar>
        org/apache/lucene/misc/IndexMergeTool /path/to/newindex
        /path/to/oldindex1 /path/to/oldindex2
    

    The command above will merge both the old index into the new index. The command takes about the same amount of time as it does to copy both indexes.

  5. Restore the combined index.

    mv <dspace directory>/solr/statistics/data/index /path/to/someplace/safe
    cp -r /path/to/newindex   <dspace directory>/solr/statistics/data/index
    
  6. Restart DSpace and check the statistics.

    Hopefully everything works and you have a full set of statistics. Let others know that it worked in the comments.


comments powered by Disqus