Splunk Tutorial: KV Store Troubleshooting Adventures

Introduction

One of my least favorite features in Splunk is KV Store – mainly, because whenever I have to deal with it as a Splunk administrator, it’s broken in some horrible new way that I need to figure out. The goal of this post is to capture one of these troubleshooting adventures that we recently encountered in the hopes that it might help someone who runs into this same problem in the future. 

Background

Beginning with Splunk Enterprise 8.1, Splunk introduced a new storage engine for KVstore (WiredTiger).  When upgrading to Splunk Enterprise 9.0 or later, you are required to migrate to the new storage engine.   You can also migrate to this storage engine prior to upgrading to Splunk Enterprise 9.0 if you want. 

We’ve done this migration for a bunch of clients, but every once in a while, we’ve seen some issues that require additional troubleshooting, especially if there is an error or failure in the migration or upgrade process. 

While I’m not sure of the exact circumstances that led to this exact error, it appears that the root cause may have been related to a Splunk version conflict where a system was upgraded to Splunk 9.0, and then an older version of Splunk 8.x was started for some reason.  The end result (and where I entered this story) was a system running Splunk 9.0 with a KV Store that wouldn’t start. 

Symptoms of the issue

Based on the output of splunkd.log on the broken system, it appeared that KV Store on this host was looking to start version 4.2 with the mmapv1 (legacy KV Store) storage engine.  Even with storageEngine = mmapv1 in server.conf, the system was trying to migrate to WiredTiger and failing.

Furthermore, the kvstore files in $SPLUNK_HOME/var/lib/splunk/kvstore/mongo all ended with a .ns extension, which indicates that the storage engine was mmapv1 and not WiredTiger.  After a conversion to WiredTiger, you’ll instead see a bunch of files with .wt extensions. 

For some reason, the system was convinced that it was running a more current version of KVstore, but the data files in KVstore disagreed. When this was occurring, KV Store didn’t start or function, and there were no logs in mongod.log (at all).  

Fortunately, the splunkd.log file had some more output as to what was happening:

Copy to Clipboard

The splunk show kvstore-status command showed the following output:

Copy to Clipboard

This member:
backupRestoreStatus : Ready
disabled : 0
guid : F50190FF-36F7-486B-B34F-FDE64B4665E9
port : 8191
standalone : 1
status : failed
storageEngine : mmapv1

Now, we needed to figure out why the KV Store status was showing as failed (and more importantly) how to fix it. 

Researching the solution

Reviewing logs on multiple Splunk environments led us to a clue in the migrate.log file.  KV Store upgrades looked to have these types of entries recorded:

Copy to Clipboard

and

Copy to Clipboard

The purpose of these files is not documented, and they contain no content:

Copy to Clipboard

Our best guess is that the presence of this file tells Splunk what version of the KV Store engine to use. We decided to try removing the versionFile40 and versionFile42 files, and creating a versionFile36 in its place to correspond to a version that used the old mmapv1 storage engine. 

At this point, we crossed our fingers and restarted Splunk.  To our relief, Splunk restarted and KV Store successfully came up this time too!

Copy to Clipboard

This member:
backupRestoreStatus : Ready
date : Mon Apr 10 12:22:12 2023
dateSec : 1681129332.136
disabled : 0
guid : F50190FF-36F7-486B-B34F-FDE64B4665E9
oplogEndTimestamp : Mon Apr 10 12:22:12 2023
oplogEndTimestampSec : 1681129332
oplogStartTimestamp : Mon Apr 10 00:17:05 2023
oplogStartTimestampSec : 1679617025
port : 8191
replicaSet : F50190FF-36F7-486B-B34F-FDE64B4665E9
replicationStatus : KV store captain
standalone : 1
status : ready
storageEngine : mmapv1

KV store members:
127.0.0.1:8191
configVersion : 1
electionDate : Mon Apr 10 12:21:49 2023
electionDateSec : 1681129309
hostAndPort : 127.0.0.1:8191
optimeDate : Mon Apr 10 12:22:12 2023
optimeDateSec : 1681129332
replicationStatus : KV store captain
uptime : 24

At this point, we needed to do a storage migration process to get the engine upgraded to WiredTiger on serverVersion 3.6.17:

Copy to Clipboard

After this conversion, our kvstore-status showed that we were running on WiredTiger on server version 3.6:

Copy to Clipboard

This member:
backupRestoreStatus : Ready
date : Mon Apr 10 12:30:54 2023
dateSec : 1681129854.993
disabled : 0
featureCompatibilityVersion : 3.6
guid : F50190FF-36F7-486B-B34F-FDE64B4665E9
oplogEndTimestamp : Mon Apr 10 12:30:54 2023
oplogEndTimestampSec : 1681129854
oplogStartTimestamp : Mon Apr 10 12:29:07 2023
oplogStartTimestampSec : 1681129747
port : 8191
replicaSet : F50190FF-36F7-486B-B34F-FDE64B4665E9
replicationStatus : KV store captain
standalone : 1
status : ready
storageEngine : wiredTiger

KV store members:
127.0.0.1:8191
configVersion : 1
electionDate : Mon Apr 10 12:29:07 2023
electionDateSec : 1681129747
hostAndPort : 127.0.0.1:8191
optimeDate : Mon Apr 10 12:30:54 2023
optimeDateSec : 1681129854
replicationStatus : KV store captain
serverVersion : 3.6.17
uptime : 110

Next, we performed another KV Store migration to get the server version up to 4.2.17:

Copy to Clipboard

At this point, the server version was showing 4.2:

Copy to Clipboard

This member:
backupRestoreStatus : Ready
date : Mon Apr 10 12:38:00 2023
dateSec : 1681130280.039
disabled : 0
featureCompatibilityVersion : 4.2
guid : F50190FF-36F7-486B-B34F-FDE64B4665E9
oplogEndTimestamp : Mon Apr 10 12:37:59 2023
oplogEndTimestampSec : 1681130279
oplogStartTimestamp : Mon Apr 10 12:29:07 2023
oplogStartTimestampSec : 1681129747
port : 8191
replicaSet : F50190FF-36F7-486B-B34F-FDE64B4665E9
replicationStatus : KV store captain
standalone : 1
status : ready
storageEngine : wiredTiger

KV store members:
127.0.0.1:8191
configVersion : 1
electionDate : Mon Apr 10 12:36:29 2023
electionDateSec : 1681130189
hostAndPort : 127.0.0.1:8191
optimeDate : Mon Apr 10 12:37:59 2023
optimeDateSec : 1681130279
replicationStatus : KV store captain
serverVersion : 4.2.17

Now KV Store is running correctly and on the current version.  We fixed the problem!

Conclusion

Do I expect that you’ll ever be in a situation where you will find this information useful?  I hope not.  Did I write this so that I can have some notes in case I ever run into a similar problem in the future? Absolutely.  

This is a great example of running into a problem where you have to make some educated guesses on a possible solution with limited information to go on.  I’m glad we were able to figure this one out and hope these notes might help you if you ever see this problem in your Splunk environment.  If not, hello to my future self who is reading this months or years from now and again fighting with a broken KV Store somewhere.

The post Splunk Tutorial: KV Store Troubleshooting Adventures appeared first on Hurricane Labs.

*** This is a Security Bloggers Network syndicated blog from Hurricane Labs authored by Hurricane Labs. Read the original post at: https://hurricanelabs.com/blog/splunk-tutorial-kv-store-troubleshooting-adventures/?utm_source=rss&utm_medium=rss&utm_campaign=splunk-tutorial-kv-store-troubleshooting-adventures

Tags: Infosec Blog

Recent Posts

Account takeover fraud: 5 steps for protecting your customers

According to research by the Aite Group, financial institutions are facing a 64% uptick in account takeover attacks than before…

4 hours ago

Seven Common Lateral Movement Techniques

Inside the Attacker’s Playbook: Unmasking the most common lateral movement techniques   Lateral movement techniques refer to the methods employed by…

5 hours ago

Antisocial Media and Critical National Infrastructure

[For some reason I posted this several months ago on my Dataholics blog, when this one might have been at…

5 hours ago

API Discovery: Definition, Importance, and Step-by-Step Guide on AppTrana WAAP

The growing use of APIs in various business areas exposes organizations to new security risks. An analysis of data breaches…

7 hours ago

Infoline launches LogRhythm-Powered SOC to Deliver Crucial Cybersecurity Services in Malaysia

SINGAPORE, July 25, 2023— LogRhythm, the company helping security teams stop breaches by turning disconnected data and signals into trustworthy…

12 hours ago

What Comes After Your SIEM Purchase?

Let’s say you recently acquired a security information and event Management (SIEM) solution and have a new layer of defense…

15 hours ago