SBN

Splunk Tutorial: KV Store Troubleshooting Adventures

Introduction

One of my least favorite features in Splunk is KV Store – mainly, because whenever I have to deal with it as a Splunk administrator, it’s broken in some horrible new way that I need to figure out. The goal of this post is to capture one of these troubleshooting adventures that we recently encountered in the hopes that it might help someone who runs into this same problem in the future. 

Background

Beginning with Splunk Enterprise 8.1, Splunk introduced a new storage engine for KVstore (WiredTiger).  When upgrading to Splunk Enterprise 9.0 or later, you are required to migrate to the new storage engine.   You can also migrate to this storage engine prior to upgrading to Splunk Enterprise 9.0 if you want. 

AWS Builder Community Hub

We’ve done this migration for a bunch of clients, but every once in a while, we’ve seen some issues that require additional troubleshooting, especially if there is an error or failure in the migration or upgrade process. 

While I’m not sure of the exact circumstances that led to this exact error, it appears that the root cause may have been related to a Splunk version conflict where a system was upgraded to Splunk 9.0, and then an older version of Splunk 8.x was started for some reason.  The end result (and where I entered this story) was a system running Splunk 9.0 with a KV Store that wouldn’t start. 

Symptoms of the issue

Based on the output of splunkd.log on the broken system, it appeared that KV Store on this host was looking to start version 4.2 with the mmapv1 (legacy KV Store) storage engine.  Even with storageEngine = mmapv1 in server.conf, the system was trying to migrate to WiredTiger and failing.

Furthermore, the kvstore files in $SPLUNK_HOME/var/lib/splunk/kvstore/mongo all ended with a .ns extension, which indicates that the storage engine was mmapv1 and not WiredTiger.  After a conversion to WiredTiger, you’ll instead see a bunch of files with .wt extensions. 

For some reason, the system was convinced that it was running a more current version of KVstore, but the data files in KVstore disagreed. When this was occurring, KV Store didn’t start or function, and there were no logs in mongod.log (at all).  

Fortunately, the splunkd.log file had some more output as to what was happening:

Copy to Clipboard

The splunk show kvstore-status command showed the following output:

Copy to Clipboard

Now, we needed to figure out why the KV Store status was showing as failed (and more importantly) how to fix it. 

Researching the solution

Reviewing logs on multiple Splunk environments led us to a clue in the migrate.log file.  KV Store upgrades looked to have these types of entries recorded:

Copy to Clipboard

and

Copy to Clipboard

The purpose of these files is not documented, and they contain no content:

Copy to Clipboard

Our best guess is that the presence of this file tells Splunk what version of the KV Store engine to use. We decided to try removing the versionFile40 and versionFile42 files, and creating a versionFile36 in its place to correspond to a version that used the old mmapv1 storage engine. 

At this point, we crossed our fingers and restarted Splunk.  To our relief, Splunk restarted and KV Store successfully came up this time too!

Copy to Clipboard

At this point, we needed to do a storage migration process to get the engine upgraded to WiredTiger on serverVersion 3.6.17:

Copy to Clipboard

After this conversion, our kvstore-status showed that we were running on WiredTiger on server version 3.6:

Copy to Clipboard

Next, we performed another KV Store migration to get the server version up to 4.2.17:

Copy to Clipboard

At this point, the server version was showing 4.2:

Copy to Clipboard

Now KV Store is running correctly and on the current version.  We fixed the problem!

Conclusion 

Do I expect that you’ll ever be in a situation where you will find this information useful?  I hope not.  Did I write this so that I can have some notes in case I ever run into a similar problem in the future? Absolutely.  

This is a great example of running into a problem where you have to make some educated guesses on a possible solution with limited information to go on.  I’m glad we were able to figure this one out and hope these notes might help you if you ever see this problem in your Splunk environment.  If not, hello to my future self who is reading this months or years from now and again fighting with a broken KV Store somewhere.

The post Splunk Tutorial: KV Store Troubleshooting Adventures appeared first on Hurricane Labs.

*** This is a Security Bloggers Network syndicated blog from Hurricane Labs authored by Hurricane Labs. Read the original post at: https://hurricanelabs.com/blog/splunk-tutorial-kv-store-troubleshooting-adventures/?utm_source=rss&utm_medium=rss&utm_campaign=splunk-tutorial-kv-store-troubleshooting-adventures