We see similar issues. Sometimes we have to restart many times per day, some days it will keep going. we've been tracking it here:
We're running stuff under docker, which makes things more complicated, but I've had luck in a local dev/test environment with using Linux perf tools and building flamegraphs. For raisins, we haven't done this yet in docker, though you can see how to make that work here.
However, if you're not running docker, it's a little bit easier. Follow the instructions here:
He doesn't go into it there, but if you're on Ubuntu you'll need to install:
linux-tools-common
linux-cloud-tools-common
...and then kernel specific versions of those based on what you're running in order to use perf
.
If you get there before we do, I'm eager to hear what anyone discovers with this.