Part of our Ruby for DevOps series — practical Ruby scripts that solve real system-administration problems.

The Problem

The classic monitoring gap: full-blown observability stacks (Prometheus, Datadog, Nagios) are great, but there’s always a box that isn’t covered — the legacy VM, the build server, the client machine you can’t install agents on. Meanwhile the questions you actually need answered are simple: is nginx still running? is anything eating all the memory? are zombies piling up?

This tutorial builds a dependency-free watchdog in Ruby that reads the Linux /proc filesystem directly — no gems, no agents, nothing to install beyond Ruby itself. It checks required processes, flags memory hogs and zombies, can restart dead systemd services, and speaks JSON so other tools can consume it.

Ruby process monitor architecture diagram

Prerequisites

  • Ruby 2.7+, standard library only.
  • Linux — the script reads /proc, which is Linux-specific. (Windows equivalent: query WMI via win32ole — see Extending below.)
  • Root only if you want automatic systemctl restart; all read-only checks work as any user.

The Complete Script

Save this as proc_monitor.rb:

#!/usr/bin/env ruby
# proc_monitor.rb -- Process & service watchdog for Linux
#
# Checks that critical processes are running, watches CPU/memory usage,
# and (optionally) restarts systemd services that have died.
#
# Usage:
#   ruby proc_monitor.rb                       # one-shot report
#   ruby proc_monitor.rb --watch nginx,sshd    # check specific processes
#   ruby proc_monitor.rb --mem-limit 500       # flag procs over 500 MB RSS
#
# Requires: Ruby 2.7+ (standard library only). Reads /proc directly,
# so no external gems or commands are needed for process data.

require 'optparse'
require 'json'

options = { watch: [], mem_limit: nil, top: 5, json: false }
OptionParser.new do |opts|
  opts.banner = "Usage: ruby proc_monitor.rb [options]"
  opts.on("--watch LIST", "Comma-separated process names that MUST be running") { |v| options[:watch] = v.split(",").map(&:strip) }
  opts.on("--mem-limit MB", Integer, "Flag processes using more than MB of RSS") { |v| options[:mem_limit] = v }
  opts.on("--top N", Integer, "Show top N by memory (default 5)") { |v| options[:top] = v }
  opts.on("--json", "Emit JSON (for piping into other tools)") { options[:json] = true }
end.parse!

# ---------------------------------------------------------------
# 1. Collect processes straight from /proc.
#    Each numeric directory under /proc is a PID. /proc/PID/status
#    gives us the name and memory; /proc/PID/cmdline the full command.
# ---------------------------------------------------------------
def collect_processes
  Dir.glob("/proc/[0-9]*").filter_map do |dir|
    pid = File.basename(dir).to_i
    status = File.read("#{dir}/status")
    name   = status[/^Name:\s+(.+)$/, 1]
    rss_kb = status[/^VmRSS:\s+(\d+)\s+kB/, 1].to_i
    state  = status[/^State:\s+(\S+)/, 1]
    cmd    = File.read("#{dir}/cmdline").tr("\0", " ").strip rescue ""
    { pid: pid, name: name, rss_mb: (rss_kb / 1024.0).round(1),
      state: state, cmd: cmd.empty? ? "[#{name}]" : cmd[0, 60] }
  rescue Errno::ENOENT, Errno::EACCES
    nil  # process exited mid-scan or is not readable -- skip it
  end
end

procs = collect_processes
alerts = []

# ---------------------------------------------------------------
# 2. Watchdog: are all required processes present?
# ---------------------------------------------------------------
# NOTE: we exclude our own PID -- otherwise this script's command line
# ("--watch nginx,...") would match the search and mask a dead service!
candidates = procs.reject { |p| p[:pid] == Process.pid }

options[:watch].each do |want|
  found = candidates.select { |p| p[:name] == want || p[:cmd].include?(want) }
  if found.empty?
    alerts << { level: "CRITICAL", msg: "required process '#{want}' is NOT running" }
  end
end

# ---------------------------------------------------------------
# 3. Memory limit check
# ---------------------------------------------------------------
if options[:mem_limit]
  procs.select { |p| p[:rss_mb] > options[:mem_limit] }.each do |p|
    alerts << { level: "WARNING",
                msg: "PID #{p[:pid]} (#{p[:name]}) using #{p[:rss_mb]} MB " \
                     "(limit #{options[:mem_limit]} MB)" }
  end
end

# ---------------------------------------------------------------
# 4. Zombie detection -- Z-state processes indicate a parent that
#    is not reaping its children.
# ---------------------------------------------------------------
procs.select { |p| p[:state] == "Z" }.each do |p|
  alerts << { level: "WARNING", msg: "zombie process PID #{p[:pid]} (#{p[:name]})" }
end

# ---------------------------------------------------------------
# 5. Optional systemd integration: try to restart dead watched
#    services if systemctl exists AND we're running as root.
# ---------------------------------------------------------------
def try_restart(service)
  return "systemctl not available" unless system("which systemctl > /dev/null 2>&1")
  return "not running as root -- skipping restart" unless Process.uid.zero?
  ok = system("systemctl", "restart", service)
  ok ? "restart issued" : "restart FAILED"
end

alerts.select { |a| a[:level] == "CRITICAL" }.each do |a|
  service = a[:msg][/'(.+)'/, 1]
  a[:action] = try_restart(service)
end

# ---------------------------------------------------------------
# 6. Report
# ---------------------------------------------------------------
if options[:json]
  puts JSON.pretty_generate(
    total: procs.size, alerts: alerts,
    top_memory: procs.max_by(options[:top]) { |p| p[:rss_mb] }
  )
else
  puts "Process report -- #{Time.now.strftime('%F %T')}"
  puts "Total processes: #{procs.size}"

  puts "\nTop #{options[:top]} by memory:"
  procs.max_by(options[:top]) { |p| p[:rss_mb] }.each do |p|
    puts format("  %6d  %-20s %8.1f MB  %s", p[:pid], p[:name], p[:rss_mb], p[:state])
  end

  if alerts.empty?
    puts "\nAlerts: none -- all clear."
  else
    puts "\nAlerts (#{alerts.size}):"
    alerts.each do |a|
      line = "  [#{a[:level]}] #{a[:msg]}"
      line += " -> #{a[:action]}" if a[:action]
      puts line
    end
  end
end

# Non-zero exit if anything critical -- lets cron/CI catch failures
exit(alerts.any? { |a| a[:level] == "CRITICAL" } ? 2 : 0)

How It Works, Step by Step

1. /proc is just files

Every process on Linux appears as a numbered directory under /proc. /proc/PID/status is a plain-text key/value file with the process name, state and memory; /proc/PID/cmdline holds the full command line with NUL separators (hence the tr("\0", " ")). Reading these files is exactly what ps does under the hood — we’re just cutting out the middleman, which also makes the script faster than shelling out to ps thousands of times.

2. Processes vanish mid-scan — plan for it

Between listing /proc and reading a PID’s files, that process may exit. The rescue Errno::ENOENT inside filter_map treats that as normal (it is), skipping the entry instead of crashing. This race is the #1 bug in naive process scanners.

3. The self-match trap

Here’s a subtle bug we hit while writing this script — and that also bites people using pgrep -f: when you run ruby proc_monitor.rb --watch nginx, the monitor’s own command line contains the string “nginx”. Without excluding our own PID, the watchdog finds itself, decides nginx is running, and a dead web server sails through unnoticed. One line fixes it: procs.reject { |p| p[:pid] == Process.pid }.

Gotcha: the same failure mode exists in shell scripts: ps aux | grep nginx matches the grep itself. That’s why the old grep [n]ginx trick exists.

4. Zombie detection

A process in state Z is dead but unreaped — its parent hasn’t called wait(). One or two are harmless; a growing pile means a buggy parent leaking children and eventually exhausting the PID table. The scan makes them visible before that happens.

5. Exit codes are an API

The script exits 2 on any CRITICAL alert. That single convention makes it composable: cron emails you on non-zero exit, systemd timers record the failure, CI gates on it, and a wrapper script can chain || send_page.

Example Output

$ ruby proc_monitor.rb --watch bash,nginx

Process report -- 2026-07-02 11:15:01
Total processes: 6

Top 5 by memory:
       7  ruby                     21.9 MB  R
       4  socat                     4.5 MB  S
       3  socat                     4.5 MB  S
       2  bash                      3.4 MB  S
       5  bash                      3.4 MB  S

Alerts (1):
  [CRITICAL] required process 'nginx' is NOT running -> not running as root -- skipping restart

$ echo $?
2

bash was found, nginx wasn’t — so we get a CRITICAL alert, an attempted remediation (skipped, since we weren’t root), and exit code 2 for the automation layer.

Scheduling It

# Every 5 minutes; cron mails the report only when something is wrong
*/5 * * * * /usr/bin/ruby /opt/scripts/proc_monitor.rb --watch nginx,postgres,redis-server --mem-limit 2048 || true

Troubleshooting

  • Watched process “not running” but it is — process names in /proc/PID/status are truncated to 15 characters (redis-server fits; some-long-daemon-name doesn’t). Watch a substring of the command line instead, or compare against cmd.
  • VmRSS missing / 0 MB for some PIDs — kernel threads have no userspace memory; they legitimately lack VmRSS. The script already treats missing values as 0.
  • Restart says “restart FAILED” — the process name and the systemd unit name often differ (process postgres, unit postgresql.service). Map names to units explicitly if you use auto-restart seriously.
  • Permission denied on some /proc entries — normal for other users’ processes under hardened kernels (hidepid). Run as root for full visibility.

Extending the Script

  • Windows version — swap the /proc reader for WMI: require 'win32ole'; wmi = WIN32OLE.connect("winmgmts://"); wmi.ExecQuery("SELECT Name, ProcessId, WorkingSetSize FROM Win32_Process"). The alert/restart logic stays identical (use sc start instead of systemctl).
  • CPU usage — read /proc/PID/stat utime/stime twice with a 1s sleep and diff, divided by clock ticks.
  • Webhooks — POST the --json output to Slack/Discord/PagerDuty with Net::HTTP when alerts are present.
  • State tracking — write alerts to a state file and only notify on changes, so a flapping service doesn’t page you every 5 minutes.

Leave a Reply

Your email address will not be published. Required fields are marked *