Skip to main content

Investigating and Solving a Production Memory Spike Issue

image.png

Background

The project went live. Everything was normal during development and testing. After going live, within a week, the project started running slower and slower - it took 30 seconds to enter the admin interface. When I first heard this, I thought "Impossible! We're still in internal testing with barely any users." But when I opened the system myself and waited forever to get in, I believed it.

Investigation

Checking Error Logs

Knowing there was a problem, but unable to access the production system directly, I requested access. The ops team took me to the server room - security was tight. I checked the error logs but found no key information. I asked the ops team to export the error logs for me to study, but was firmly denied...

Checking Process Status

Using:

ps -aux | grep ***

To locate the process PID, then:

top -p <PID>

To check memory usage. Although this system is fairly large, it was consuming 4GB of memory - scary.

Using jmap to Check Runtime Memory Information

jmap -histo <PID>

A stream of text appeared. Here's a screenshot I reproduced locally later: image.png

Let me explain what's in this image:

  • num: sequence number
  • instances: object instances
  • bytes: size occupied
  • class name: class name, B is byte[], C is char[], I is int[], S would be short[]

I noticed something: "Why are there so many byte[]!!! Does the project use byte[] anywhere? Doesn't seem like it. Maybe they're produced by other objects, like file streams, read/write operations. Ah, I got it - could someone have forgotten to close a stream?"

I asked the ops team if I could take photos - since I couldn't export files, surely photos would be okay? Ops: "No! What are you thinking!"

No choice, let's look at other things.

Checking Heap Information

jmap -heap <PID>

image.png I don't remember the exact situation at the time - I glanced at it briefly while still thinking about the byte[] issue.

Since I was already there, let me check more information.

Checking GC Status

jstat -gc <PID> 1000(interval) 10000(count)

image.png

In the image:

  • S0C: First survivor space size
  • S1C: Second survivor space size
  • S0U: First survivor space used
  • S1U: Second survivor space used
  • EC: Eden space size
  • EU: Eden space used
  • OC: Old generation size
  • OU: Old generation used
  • MC: Metaspace size
  • MU: Metaspace used
  • CCSC: Compressed class space used
  • YGC: Young generation GC count
  • YGCT: Young generation GC time (seconds)
  • FGC: Full GC count
  • FGCT: Full GC time (seconds)
  • GCT: Total GC time (seconds)

At the time, Full GC was triggering extremely frequently. I used my phone to make a request to the external URL - before I could click a few features, Full GC triggered. I was mortified. Good thing the ops team didn't understand what I was doing, otherwise it would have been embarrassing.

Normally, Full GC should trigger every few hours or days, and Young GC every few minutes or hours. In this situation, I noted the features I had just clicked - that code was likely problematic.

Feeling like I'd seen enough, and with ops constantly asking if I was done (it was after hours), let me check for deadlocks.

Using jstack to Find Deadlocks

Let me also document jstack usage - good thing I brought my tablet with reference materials:

  1. Use top -p <PID> to show Java process memory, where pid is the Java process number
  2. Press H to get memory info for each thread
  3. Find the thread ID with highest memory and CPU usage
  4. Convert to hex to get the thread's hex representation, e.g., 0x2365
  5. Execute jstack <PID>|grep -A 10 2365 to get 10 lines after thread 2365 in the stack trace
  6. Check the corresponding stack info to find potentially problematic code

OK, no deadlocks.

Reproducing the Issue

Reproducing this locally was actually easy - since the production machine with 32GB RAM had issues within 7 days, locally I could use tools like jvisualvm or jprofile for real-time monitoring.

I used jprofile - quite a powerful tool.

Using jprofile

Project Startup

After starting the project without any requests, it looked normal - used size under 500M: image.png

Normal Requests

Started making requests - instantly jumped to 2G: image.png

Manual GC Trigger

After manually triggering GC, what remains must be objects that can't be garbage collected: image.png

Checking Live Memory

Sure enough, byte[] was the biggest: image.png

Object Analysis

Select the object for display: image.png

Enter the interface: image.png

Here's the key - use the selected object: image.png

  • outgoing: objects produced by the selected object
  • incoming: what produced these objects - this is what we want

image.png

Enter the interface: image.png

Then select one: image.png

Enter the interface: image.png image.png

Then I discovered something important: image.png

What? It's produced by Tomcat! messageBytes - based on my understanding of Netty, this refers to the inbound handler's message decoding operation.

Case Solved

I checked other byte[] arrays and they were all the same. Could Tomcat be the culprit?? But?!

A flash of insight - I remembered Tomcat configuration options. Something must be misconfigured.

Remove All Tomcat Configuration Options

Restarted and monitored: image.png

Not solved. But byte[] clearly pointed to Tomcat - did I get it wrong?

Let me think about messageBytes again - it means requests. When a request comes in, it causes memory increase. How would Netty handle this?

!!! Got it - remove all request configurations!!!

For example:

server:
port: 82
max-http-header-size: 102400000

spring:
http:
multipart:
max-file-size: 100MB
max-request-size: 10MB

servlet:
multipart:
max-file-size: 10MB
max-request-size: 20MB

Restarted, made requests, monitored:

image.png

Perfect! Indeed, some parameter configuration was problematic. Now I just need to find which one - at most 5 tries. Got lucky - first try hit it:

max-http-header-size: 102400000

This configuration was the problem. Searched online for what this parameter does: image.png

Sure enough, I wasn't the only one who got burned.

Checked the git commit history, found the colleague who added this parameter and asked why - that's another story involving specific business scenarios. Case closed.

Summary

When copying parameters from online sources, it's best to understand what they mean. This way you'll know why you're using them and can optimize code for your business scenarios. Every small optimization adds up to big improvements - expecting to boost performance 20-30% by adjusting 2-3 parameters is unrealistic, even Elasticsearch can't do that.

Understanding the underlying code also helps us find problems faster instead of guessing blindly. When this issue first appeared, I kept thinking it was caused by ehcache. After all, programs are dead - it's people who truly solve problems.