\chapter{Profiling (ocamlprof, perf)} \label{c:profiler}
%HEVEA\cutname{profil.html}

This chapter describes different ways to profile the execution of OCaml
programs.

\section{s:ocamlprog-overview}{Profiling execution counts with ocamlprof}

This section describes how the execution of OCaml programs can be profiled,
by recording how many times functions are called, and branches of conditionals
are taken. This approach applies to OCaml programs compiled with the bytecode
and native backends.

\subsection{s:ocamlprof-compiling}{Compiling for profiling}

Before profiling an execution, the program must be compiled in
profiling mode, using the "ocamlcp" front-end to the "ocamlc" compiler
(see chapter~\ref{c:camlc}) or the "ocamloptp" front-end to the
"ocamlopt" compiler (see chapter~\ref{c:nativecomp}). When compiling
modules separately, "ocamlcp" or "ocamloptp" must be used when
compiling the modules (production of ".cmo" or ".cmx" files), and can
also be used (though this is not strictly necessary) when linking them
together.

\lparagraph{p:ocamlprof-warning}{Note} If a module (".ml" file) doesn't have a corresponding
interface (".mli" file), then compiling it with "ocamlcp" will produce
object files (".cmi" and ".cmo") that are not compatible with the ones
produced by "ocamlc", which may lead to problems (if the ".cmi" or
".cmo" is still around) when switching between profiling and
non-profiling compilations.  To avoid this problem, you should always
have a ".mli" file for each ".ml" file.  The same problem exists with
"ocamloptp".

\lparagraph{p:ocamlprof-reserved}{Note} To make sure your programs can be compiled in
profiling mode, avoid using any identifier that begins with
"__ocaml_prof".

The amount of profiling information can be controlled through the "-P"
option to "ocamlcp" or "ocamloptp", followed by one or several letters
indicating which parts of the program should be profiled:

%% description des options
\begin{options}
\item["a"] all options
\item["f"] function calls : a count point is set at the beginning of
each function body
\item["i"] {\bf if \ldots then \ldots else \ldots} : count points are set in
both {\bf then} branch and {\bf else} branch
\item["l"] {\bf while, for} loops: a count point is set at the beginning of
the loop body
\item["m"] {\bf match} branches: a count point is set at the beginning of the
body of each branch
\item["t"] {\bf try \ldots with \ldots} branches: a count point is set at the
beginning of the body of each branch
\end{options}

For instance, compiling with "ocamlcp -P film" profiles function calls,
if\ldots then\ldots else\ldots, loops and pattern matching.

Calling "ocamlcp" or "ocamloptp" without the "-P" option defaults to
"-P fm", meaning that only function calls and pattern matching are
profiled.

\paragraph{Note} For compatibility with previous releases, "ocamlcp"
also accepts the "-p" option, with the same arguments and behaviour as
"-P".

The "ocamlcp" and "ocamloptp" commands also accept all the options of
the corresponding "ocamlc" or "ocamlopt" compiler, except the "-pp"
(preprocessing) option.

\subsection{s:ocamlprof-profiling}{Profiling an execution}

Running an executable that has been compiled with "ocamlcp" or
"ocamloptp" records the execution counts for the specified parts of
the program and saves them in a file called "ocamlprof.dump" in the
current directory.

If the environment variable "OCAMLPROF_DUMP" is set when the program
exits, its value is used as the file name instead of "ocamlprof.dump".

The dump file is written only if the program terminates
normally (by calling "exit" or by falling through). It is not written
if the program terminates with an uncaught exception.

If a compatible dump file already exists in the current directory, then the
profiling information is accumulated in this dump file. This allows, for
instance, the profiling of several executions of a program on
different inputs.  Note that dump files produced by byte-code
executables (compiled with "ocamlcp") are compatible with the dump
files produced by native executables (compiled with "ocamloptp").

\subsection{s:ocamlprof-printing}{Printing profiling information}

The "ocamlprof" command produces a source listing of the program modules
where execution counts have been inserted as comments. For instance,
\begin{verbatim}
        ocamlprof foo.ml
\end{verbatim}
prints the source code for the "foo" module, with comments indicating
how many times the functions in this module have been called. Naturally,
this information is accurate only if the source file has not been modified
after it was compiled.

The following options are recognized by "ocamlprof":

\begin{options}

\item["-args" \var{filename}]
 Read additional newline-terminated command line arguments from \var{filename}.

\item["-args0" \var{filename}]
 Read additional null character terminated command line arguments from \var{filename}.

\item["-f" \var{dumpfile}]
Specifies an alternate dump file of profiling information to be read.

\item["-F" \var{string}]
Specifies an additional string to be output with profiling information.
By default, "ocamlprof" will annotate programs with comments of the form
{\tt (* \var{n} *)} where \var{n} is the counter value for a profiling
point. With option {\tt -F \var{s}}, the annotation will be
{\tt (* \var{s}\var{n} *)}.

\item["-impl" \var{filename}]
Process the file \var{filename} as an implementation file, even if its
extension is not ".ml".

\item["-intf" \var{filename}]
Process the file \var{filename} as an interface file, even if its
extension is not ".mli".

\item["-keywords" \var{version+list}]
Set keywords according to the \var{version+list}
specification.

\item["-version"]
Print version string and exit.

\item["-vnum"]
Print short version number and exit.

\item["-help" or "--help"]
Display a short usage summary and exit.
%
\end{options}

Profiling with "ocamlprof" only records execution counts, not the actual
time spent within each function. There is currently no way to perform
time profiling on bytecode programs generated by "ocamlc".

\section{s:ocamlprof-time-profiling}{Time profiling with Linux perf}

This section describes how to use \texttt{perf} to profile OCaml programs. Linux Performance Events (aka \texttt{perf}) is a suite of tools for performance observability included with the Linux kernel. The main features covered here are recording events using \texttt{perf-record(1)}, and visualising recorded events using \texttt{perf-report(1)}. \texttt{perf} has many additional profiling and visualising features, for more comprehensive documentation, see (man \texttt{perf(1)}, \href{https://perfwiki.github.io/main/}{\texttt{perf} wiki} or \href{https://www.brendangregg.com/perf.html}{Brendan Gregg's Blog}).

\subsection{s:ocamlprof-time-profiling-background}{Background}

Time profiling is typically performed by sampling the CPU call graph at frequent intervals to gather statistics on the code paths that are consuming CPU resources. To profile OCaml code, \texttt{perf} needs to understand the call graph of an OCaml program. \texttt{perf} supports multiple options for recording call graphs:
\begin{itemize}
\item Frame Pointers, which is the default.
\item DWARF's Call Frame Information (CFI).
\item Hardware Last Branch Record (LBR).
\end{itemize}

Frame pointer based call graphs use a convention where a register called the frame pointer holds the address for the beginning of the stack frame, and the previous values for the frame pointer are stored at a known offset from the current frame pointer. Using this information the call graph for the current function can be calculated purely based on the current stack, a process called unwinding. On x86_64 the register used for storing the frame pointer is \$rbp, while ARM64 uses the register x29. OCaml 5 introduced non-contiguous stacks as part of the implementation of effects, see \href{https://dl.acm.org/doi/10.1145/3453483.3454039}{Retrofitting effect handlers onto OCaml} (Section 5.5). Due to this implementation detail, frame pointers work better with the copying nature of perf.

DWARF based call graphs use the DWARF CFI information to perform unwinding. However this interacts poorly with the copying nature of perf, often leading to truncated call graphs where not enough of the stack has been copied by \texttt{perf}. It also produces larger trace files that are more costly to capture and process. Finally it requires including CFI debugging information in your program resulting in slightly larger binaries.

Hardware Last Branch Record (LBR) uses a processor provided method to record call graphs. This has the limitations of restricted availability (only on certain Intel CPUs) and a limited stack depth (16 on Haswell and 32 since Skylake).

Of these options, frame pointers are recommended for profiling OCaml code. It has the following advantages:
\begin{itemize}
\item Unwinding is faster to calculate.
\item Tracing data produced is smaller.
\item Frame pointers provide more complete call graphs, particularly when used with a Linux distribution that supports them.
\item Frame pointers work better with perf's copying nature and OCaml 5's non-contiguous stacks.
\end{itemize}

\subsection{s:ocamlprof-compiling-perf}{Compiling for Profiling}

The process for compiling an executable for time profiling depends on the OCaml version. For OCaml versions 4.14 and earlier, either frame pointers or DWARF can be used, while for OCaml 5.0 and later, enabling frame pointers is recommended.

To enable frame pointers, configure the compiler with \texttt{--enable-frame-pointers}. Alternatively, you can install an opam switch with frame pointers enabled, as follows:

\begin{verbatim}
        opam switch create <YOUR-SWITCH-NAME-HERE> ocaml-option-fp
\end{verbatim}

Frame pointer support for OCaml is available on x86_64 architecture for Linux starting with OCaml 4.12 and on macOS from OCaml 5.3. ARM64 architecture is supported on Linux and macOS from OCaml 5.4, while other Tier-1 architectures (POWER, RISC-V, and s390x) are currently unsupported.

\subsection{s:ocamlprof-profiling-perf}{Profiling an Execution}

The basic \texttt{perf} command for profiling is:
\begin{verbatim}
        perf record -F 99 --call-graph fp PROGRAM
\end{verbatim}

The \texttt{-F 99} option sets \texttt{perf} to sample at 99Hz, reducing excessive data generation during longer runs and minimising overlap with other periodic activities. The \texttt{--call-graph fp} instructs \texttt{perf} to use frame pointers to get the call graph, followed by the OCaml executable you want to profile. This command creates a \texttt{perf.data} file in the current directory. Alternatively use \texttt{--output} to choose a more descriptive filename.

The \texttt{perf record} command works by copying a segment of the call stack at each sample and recording this data into a \texttt{perf.data} file. These samples can then be processed after recording using \texttt{perf report} to reconstruct the profiled program’s call stack at every sample.

\texttt{perf} uses the symbols present in an OCaml executable, so it helps to understand OCaml's name mangling scheme to map names to OCaml source locations. OCaml 5.4 onwards uses a name mangling scheme of \texttt{caml<MODULE_NAME>\$<FUNCTION_NAME>_<NNN>}, where \texttt{NNN} is a randomly generated number. In OCaml 5.3 the separator is a dot \texttt{caml<MODULE_NAME>.<FUNCTION_NAME>_<NNN>}, and before OCaml 5.1 the separator is double underscores. All formats are supported by \texttt{perf}.

Consider the following program:

\begin{caml_example*}{verbatim}
(* fib.ml *)
module Compute = struct
  let rec fib n =
    if n = 0 then 0
    else if n = 1 then 1
    else fib (n-1) + fib (n-2)
end

let main () =
  let r = Compute.fib 20 in
  Printf.printf "fib(20) = %d" r

let _ = main ()
\end{caml_example*}

This program produces the names \texttt{camlFib\$main_274} for the \texttt{main} function and \texttt{camlFib\$fib_271} for the \texttt{fib} function in the \texttt{Compute} module.

\subsection{s:ocamlprof-printing}{Displaying profiling information}

The \texttt{perf report} command summarises data in the \texttt{perf.data} file.
The basic \texttt{perf report} command is:

\begin{verbatim}
        perf report -f --no-children -i perf.data
\end{verbatim}

This command provides an interactive interface where you can navigate through the accumulated call graphs and select functions and threads for detailed information. Alternatively \texttt{--stdio} will output similar data using a text based report written to stdout. Note that if stack traces appear broken, it may be due to software not having frame pointer support enabled.

Consider the following program which calculates the Tak function.
\begin{caml_example*}{verbatim}
let (x,y,z) =
  try
    let x = int_of_string Sys.argv.(1) in
    let y = int_of_string Sys.argv.(2) in
    let z = int_of_string Sys.argv.(3) in
    (x,y,z)
  with _ -> (18,12,6)

let rec tak x y z =
  if x > y then
    tak (tak (x-1) y z) (tak (y-1) z x) (tak (z-1) x y)
  else z

let main () =
  let r = tak x y z in
  Printf.printf "tak %d %d %d = %d\n" x y z r

let _ = main ()
\end{caml_example*}

The \texttt{perf} report for this might resemble the following.
\begin{verbatim}
Samples: 809  of event 'cycles', Event count (approx.): 24701952617
  Overhead  Command     Shared Object      Symbol
-  100.00%  tak-fp.exe  tak-fp.exe         [.] Tak\$tak_402
     Tak\$tak_402
     Tak\$tak_402
   + Tak\$tak_402
     0.00%  tak-fp.exe  [kernel.kallsyms]  [k] 0xffffb9a5ff79d854
     0.00%  perf-exec   [kernel.kallsyms]  [k] 0xffffb9a5ff719c34
\end{verbatim}

For a more interesting program implementing a Sudoku solver using Norvig's Backtracking algorithm, the output might resemble.

\begin{verbatim}
Samples: 7K of event 'cycles:P', Event count (approx.): 141874642467
  Overhead  Command          Shared Object      Symbol
-   35.90%  sudoku_norvig.e  sudoku_norvig.exe  [.] Sudoku_norvig._pow_747
   - 35.29% Sudoku_norvig$_pow_747
      + 17.70% Sudoku_norvig$eliminate_value_760
      + 15.74% Sudoku_norvig$_pow_747
      + 1.03% Sudoku_norvig$fun_1435
     0.61% Sudoku_norvig$eliminate_value_760
-   12.99%  sudoku_norvig.e  sudoku_norvig.exe  [.] BatList$findnext_1543
   - 12.00% BatList$findnext_1543
      - 11.71% BatList$find_all_1539
         + 8.07% Sudoku_norvig$fun_1421
         + 2.73% Sudoku_norvig$eliminate_948
         + 0.90% Sudoku_norvig$assign_949
-   10.63%  sudoku_norvig.e  sudoku_norvig.exe  [.] Sudoku_norvig$eliminate_value_760
   + 8.29% Sudoku_norvig$eliminate_value_760
   + 1.93% Sudoku_norvig$_pow_747
+    9.50%  sudoku_norvig.e  [kernel.kallsyms]  [k] asm_exc_nmi
+    7.24%  sudoku_norvig.e  sudoku_norvig.exe  [.] Sudoku_norvig$fun_1435
+    5.06%  sudoku_norvig.e  sudoku_norvig.exe  [.] caml_modify
+    3.10%  sudoku_norvig.e  sudoku_norvig.exe  [.] Sudoku_norvig$fun_1360
+    2.11%  sudoku_norvig.e  sudoku_norvig.exe  [.] Sudoku_norvig$eliminate_948
+    2.00%  sudoku_norvig.e  sudoku_norvig.exe  [.] Stdlib.List$iter_366
+    1.54%  sudoku_norvig.e  sudoku_norvig.exe  [.] Sudoku_norvig$fun_1421
+    1.47%  sudoku_norvig.e  sudoku_norvig.exe  [.] caml_apply2
+    1.42%  sudoku_norvig.e  sudoku_norvig.exe  [.] BatList$find_all_1539
+    1.20%  sudoku_norvig.e  sudoku_norvig.exe  [.] Sudoku_norvig$fun_1366
+    1.09%  sudoku_norvig.e  sudoku_norvig.exe  [.] Stdlib.List$fold_left_380
\end{verbatim}
Note this implementation clearly has some optimsation possible in `_pow_747` a recursive Power of 2 function, and parts of the OCaml runtime `caml_modify` are present.

Profiling data can also be visualised as Flame Graphs, which highlight the most frequent code paths in stack traces. The original scripts scripts \texttt{stackcollapse-perf.pl} and \texttt{flamegraph.pl} can be found at Brendan Gregg's \href{https://www.brendangregg.com/flamegraphs.html}{Flame Graphs} web page and work as follows:

\begin{verbatim}
git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph

# Collect the results into perf.data
perf record -F 99 --call-graph fp PROGRAM
perf script -i perf.data | ./stackcollapse-perf.pl out.folded
flamegraph.pl out.folded > flamegraph.svg ## Create the FlameGraph svg
\end{verbatim}

Alternatively \href{https://github.com/jonhoo/inferno}{inferno} is a Rust port of the Flame Graphs tools which works in a similar way and is faster to process large perf files.

\begin{verbatim}
cargo install inferno

# Collect the results into perf.data
perf script -i perf.data | inferno-collapse-perf > stacks.folded
cat stacks.folded | inferno-flamegraph > flamegraph.svg
\end{verbatim}

Some \texttt{perf} tools (e.g., \texttt{perf-report(1)} and \texttt{perf-annotate(1)}) use DWARF debugging information to associate symbols with source code locations, if you need these features, the program and it's dependencies need to be compiled with \texttt{-g} to include debugging information in the executable.

Captured profile data can also be processed using \texttt{perf script} in various ways or with online tools like \href{https://www.speedscope.app}{speedscope.app} and \href{https://profiler.firefox.com/}{profiler.firefox.com}, or any other tool that supports \texttt{perf}-formatted data.

\subsection{s:ocamlperf-conclusion}{Conclusion}

For Time profiling of native code, standard tools such as \texttt{perf}, eBPF, DTrace, or Instruments (on macOS) are recommended. Compiling with frame pointers enabled is often necessary for these tools to work effectively. Profiling with \texttt{gprof} is no longer supported.

Enabling frame pointers can impact performance on certain architectures (up to 10\% performance cost on x86_64 has been measured). Users of this feature are encouraged to benchmark their own applications to assess this impact.

\section{s:ocamlprof-instruments}{Time profiling with macOS Instruments}

Instruments is an application performance analyzer and visualizer provided in XCode on macOS. It includes a CLI tool "xctrace" that can capture Time Profiling information about OCaml programs, there is also a GUI application for running and visualising profiles. Here we only cover capturing time profiles with "xctrace", for more information on Instruments consult Apple's documentation.

The same prerequisites hold for Time profiling with macOS Instruments as for Linux perf, consult "Background" (~\ref{s:ocamlprof-time-profiling-background}) from the previous section for an explanation of call graphs and follow the setup from "Compiling for Profiling" (~\ref{s:ocamlprof-compiling-perf}).

The basic \texttt{xctrace} command for time profiling is:

\begin{verbatim}
$ xcrun xctrace record --output . --template "Time Profiler" \
  --target-stdout - --launch -- PROGRAM
\end{verbatim}

The \texttt{--template "Time Profiler"} selects pre-defined template for time profiling which will create a trace file with extension `.trace` that can be opened with Instruments. Use the \texttt{--output} to select a name for the trace file, by default xctrace will create a new file each run rather than overwriting the existing file.

The resulting file can be opened with Instruments to view the profile, and it maybe be possible to generate Flame Graphs using \href{https://github.com/jonhoo/inferno}{inferno} or the scripts from \href{https://github.com/brendangregg/FlameGraph}{https://github.com/brendangregg/FlameGraph}.

\lparagraph{p:ocamlprof-instruments}{Note}
Sometimes Instruments requires adding entitlements to an OCaml binary before it is able to profile it. This typically shows as a failure to launch the process or as a permission error. To work around this issue run the following command, replacing 'PROGRAM' with the program to trace.

\begin{verbatim}
$ codesign -s - -v -f --entitlements =(echo -n
'<?xml version="1.0" encoding="UTF-8"?>
 <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
   "https://www.apple.com/DTDs/PropertyList-1.0.dtd"\>
 <plist version="1.0">
    <dict>
        <key>com.apple.security.get-task-allow</key>
        <true/>
    </dict>
 </plist>') PROGRAM
\end{verbatim}

For more details on entitlements and codesigning consult Apple's documentation.

\section{s:ocamlprof-glossary}{Glossary}

The following terminology is used in this chapter of the manual.

\begin{itemize}
\item[{\bf Call graph}] The chain of function calls that have lead to the current function (also referred to as a call stack).
\item[{\bf Unwinding}] The process of walking up the call stack (also called stack unwinding).
\item[{\bf Stack frame}] Refers to the portion of the stack allocated to a single function call. (also called an activation frame, activation record or simply frame).
\end{itemize}