With the added TRACE_EVENT macro, the events no longer appear in
the function graph tracer. This was because the function graph
did not know how to display the entries. The graph tracer was
only aware of its own entries and the printk entries.
By using the event call back feature, the graph tracer can now display
the events.
# echo irq > /debug/tracing/set_event
Which can show:
0) | handle_IRQ_event() {
0) | /* irq_handler_entry: irq=48 handler=eth0 */
0) | e1000_intr() {
0) 0.926 us | __napi_schedule();
0) 3.888 us | }
0) | /* irq_handler_exit: irq=48 return=handled */
0) 0.655 us | runqueue_is_locked();
0) | __wake_up() {
0) 0.831 us | _spin_lock_irqsave();
The irq entry and exit events show up as comments.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
The function depth in trace_printk was to facilitate the function
graph output. Now that the function graph calculates the depth within
the trace output, we no longer need to record the depth when the
trace_printk is called.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Currently, the function graph tracer depends on the trace_printk
to record the depth. All the information is already there in the trace
to calculate function depth, with the exception of having the printk
be the first item. But as soon as a entry or exit is reached, then
we know the depth.
This patch changes the iter->private data from recording a per cpu
last_pid, to a structure that holds both the last_pid and the current
depth. This data is used to determine the function depth for the
printks.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
This patch makes print_printk_msg_only and print_bprintk_msg_only
global for other functions to use. It also renames them by adding
a "trace_" to the beginning to avoid namespace collisions.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: fix warning with irqsoff tracer
The ring buffer allocates its buffers on pre-smp time (early_initcall).
It means that, at first, only the boot cpu buffer is allocated and
the ring-buffer cpumask only has the boot cpu set (cpu_online_mask).
Later, the secondary cpu will show up and the ring-buffer will be notified
about this event: the appropriate buffer will be allocated and the cpumask
will be updated.
Unfortunately, if !CONFIG_CPU_HOTPLUG, the ring-buffer will not be
notified about the secondary cpus, meaning that the cpumask will have
only the cpu boot set, and only one cpu buffer allocated.
We fix that by using cpu_possible_mask if !CONFIG_CPU_HOTPLUG.
This patch fixes the following warning with irqsoff tracer running:
[ 169.317794] WARNING: at kernel/trace/trace.c:466 update_max_tr_single+0xcc/0xf3()
[ 169.318002] Hardware name: AMILO Li 2727
[ 169.318002] Modules linked in:
[ 169.318002] Pid: 5624, comm: bash Not tainted 2.6.29-rc8-tip-02636-g6aafa6c #11
[ 169.318002] Call Trace:
[ 169.318002] [<ffffffff81036182>] warn_slowpath+0xea/0x13d
[ 169.318002] [<ffffffff8100b9d6>] ? ftrace_call+0x5/0x2b
[ 169.318002] [<ffffffff8100b9d6>] ? ftrace_call+0x5/0x2b
[ 169.318002] [<ffffffff8100b9d1>] ? ftrace_call+0x0/0x2b
[ 169.318002] [<ffffffff8101ef10>] ? ftrace_modify_code+0xa9/0x108
[ 169.318002] [<ffffffff8106e27f>] ? trace_hardirqs_off+0x25/0x27
[ 169.318002] [<ffffffff8149afe7>] ? _spin_unlock_irqrestore+0x1f/0x2d
[ 169.318002] [<ffffffff81064f52>] ? ring_buffer_reset_cpu+0xf6/0xfb
[ 169.318002] [<ffffffff8106637c>] ? ring_buffer_reset+0x36/0x48
[ 169.318002] [<ffffffff8106aeda>] update_max_tr_single+0xcc/0xf3
[ 169.318002] [<ffffffff8100bc17>] ? sysret_check+0x22/0x5d
[ 169.318002] [<ffffffff8106e3ea>] stop_critical_timing+0x142/0x204
[ 169.318002] [<ffffffff8106e4cf>] trace_hardirqs_on_caller+0x23/0x25
[ 169.318002] [<ffffffff8149ac28>] trace_hardirqs_on_thunk+0x3a/0x3c
[ 169.318002] [<ffffffff8100bc17>] ? sysret_check+0x22/0x5d
[ 169.318002] ---[ end trace db76cbf775a750cf ]---
Because this tracer may try to swap two cpu ring buffers for an
unregistered cpu on the ring buffer.
This patch might also fix a fair loss of traces due to unallocated buffers
for secondary cpus.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-b: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1237470453-5427-1-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: clean up
The prologue of the function graph entry, return and comments all
start out pretty much the same. Each of these duplicate code and
do so slightly differently.
This patch consolidates the printing of the pid, absolute time,
cpu and proc (and for entry, the interrupt).
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
There is currently no easy way to clear the trace buffer. Currently
the only way is to change the current tracer.
This patch lets the user clear the trace buffer by simply writing
into the trace files.
echo > /debug/tracing/trace
or to clear a single cpu (i.e. for CPU 1):
echo > /debug/tracing/per_cpu/cpu1/trace
Requested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: fix command line to pid mapping
map_cmdline_to_pid[] is checked in trace_save_cmdline(), but never
updated. This results in stale pid to command line mappings and the
tracer output will associate the wrong comm string.
Signed-off-by: Carsten Emde <Carsten.Emde@osadl.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: prevent stale command line output
In case there is no valid command line mapping for a pid
trace_find_cmdline() returns without updating the comm buffer. The
trace dump keeps the previous entry which results in confusing trace
output:
<idle>-0 [000] 280.702056 ....
<idle>-23456 [000] 280.702080 ....
Update the comm buffer with "<...>" when no mapping is found.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
The command line recorder uses (unsigned) -1 to mark non mapped
entries in the pid to command line maps. The validity check is
completely unintuitive: idx >= SAVED_CMDLINES
There is no need for such casting games. Use a constant to mark
unmapped entries and check for that constant to make the code readable
and understandable.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: prevent overwrite of command line entries
When the tracer is stopped the command line recording continues to
record. The check for tracing_is_on() is not sufficient here as the
ringbuffer status is not affected by setting
debug/tracing/tracing_enabled to 0. On a non idle system this can
result in the loss of the command line information for the stopped
trace, which makes the trace harder to read and analyse.
Check tracer_enabled to allow further recording.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The start/stop methods of a tracer should be able to be executed
in all contexts. This patch converts the power tracer to do so.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
The stopping and starting of a tracer should be light weight and
be able to be called in all contexts. The sched_switch grabbed
mutexes in the start/stop functions. This patch changes it to a
simple variable, on/off.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: feature to allow better serialized clock
This patch adds an option called "global-clock" that will allow
the tracer to switch to a slower but more accurate (across CPUs)
clock.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
This patch adds a new function called ring_buffer_set_clock that
allows a tracer to assign its own clock source to the buffer.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: fix memory leak
If event_format_read() exits early due to nonzero ppos, the
previous kmalloc doesn't get freed - might as well do the
check before the kmalloc and avoid the problem.
Signed-off-by: Tom Zanussi <tzanussi@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: =?ISO-8859-1?Q?Fr=E9d=E9ric?= Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1237270859.8033.141.camel@charm-linux>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix for losing comms in trace
The command lines of tasks are cached at sched switch to not need
to record them at every trace point. Disabling the tracing on stops
the recording of traces, but does not stop the caching of command lines.
When the tracing is off the cache may overflow and cause the tracing
to show incorrect tasks matching the PIDs.
This patch disables prevents updates to the comm cache when the ring buffer
is off.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: fix to one cause of incorrect comm outputs in trace
The spinlock only protected the creation of a comm <=> pid pair.
But it was possible that a reader could look up a pid, and get the
wrong comm because it had no locking.
This also required changing trace_find_cmdline to copy the comm cache
and not just send back a pointer to it.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: fix a dynamic tracing failure
Recently, the function and function graph tracers failed to use dynamic
tracing after the following commit:
fa9d13cf13
(ftrace: don't try to __ftrace_replace_code on !FTRACE_FL_CONVERTED rec)
The patch is right except a mistake on the check for the FTRACE_FL_CONVERTED
flag. The code patching is aborted in case of successfully nopped sites.
What we want is the opposite: ignore the callsites that haven't been nopped.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: fix crashes when tracing cpumasks
While ring-buffer allocation, the cpumasks are allocated too,
including the tracing cpumask and the per-cpu file mask handler.
But these cpumasks are freed accidentally just after.
Fix it.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1237164303-11476-1-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix possible locking imbalance
In case of ring buffer resize failure, tracing_set_tracer forgot to
release trace_types_lock. Fix it.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1237151439-6755-5-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Syscall tracing must select kallsysms.
The arch code builds a table to find the syscall metadata by syscall
number. It needs the syscalls names resolution from the symbol table
to know which name found on the syscalls metadatas match a function
pointer from the arch sys_call_table.
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1237151439-6755-4-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix syscall tracer enable/disable race
The current thread flag toggling is racy as shown in the following
scenario:
- task A is the last user of syscall tracing, it releases the
TIF_SYSCALL_FTRACE on each tasks
- at the same time task B start syscall tracing. refcount == 0 so
it sets up TIF_SYSCALL_FTRACE on each tasks.
The effect of the mixup is unpredictable.
So this fix adds a mutex on {start,stop}_syscall_tracing().
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Reported-by: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <1237151439-6755-3-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix 'stuck' syscall tracer
The syscall tracer uses a refcounter to enable several users
simultaneously.
But the refcounter did not behave correctly and always restored
its value to 0 after calling start_syscall_tracing(). Therefore,
stop_syscall_tracing() couldn't release correctly the tasks from
tracing.
Also the tracer forgot to reset the buffer when it is released.
Drop the pointless refcount decrement on start_syscall_tracing()
and reset the buffer when we release the tracer.
This fixes two reported issue:
- when we switch from syscall tracer to another tracer, syscall
tracing continued.
- incorrect use of the refcount.
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1237151439-6755-1-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: new feature
This adds the generic support for syscalls tracing. This is
currently exploited through a devoted tracer but other tracing
engines can use it. (They just have to play with
{start,stop}_ftrace_syscalls() and use the display callbacks
unless they want to override them.)
The syscalls prototypes definitions are abused here to steal
some metadata informations:
- syscall name, param types, param names, number of params
The syscall addr is not directly saved during this definition
because we don't know if its prototype is available in the
namespace. But we don't really need it. The arch has just to
build a function able to resolve the syscall number to its
metadata struct.
The current tracer prints the syscall names, parameters names
and values (and their types optionally). Currently the value is
a raw hex but higher level values diplaying is on my TODO list.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1236955332-10133-2-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: save memory
The struct dyn_ftrace table is very large, this patch will save
about 50%.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Steven Rostedt <srostedt@redhat.com>
LKML-Reference: <49BA2C9F.8020009@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
VFS layer has tested the file mode, we do not need test it.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Steven Rostedt <srostedt@redhat.com>
LKML-Reference: <49BA2BAB.6010608@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Do __ftrace_replace_code for !FTRACE_FL_CONVERTED rec will always
fail, we should ignore this rec.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Cc: "Steven Rostedt ;" <rostedt@goodmis.org>
LKML-Reference: <49BA2472.4060206@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If dyn_ftrace is freed before ftrace_release(), ftrace_release()
will free it again and make ftrace_free_records wrong.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Cc: "Steven Rostedt ;" <rostedt@goodmis.org>
LKML-Reference: <49BA23D9.1050900@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Provide basic callbacks to do syscall tracing.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <1236401580-5758-2-git-send-email-fweisbec@gmail.com>
[ simplified it to a trace_printk() for now. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The binary_buffers directory in /debugfs/tracing held the files
to read the trace buffers in a binary format. This held one file
per CPU buffer. But we also have a per_cpu directory that holds
a way to read the pretty-print formats.
This patch moves the binary buffers into the per_cpu_directory:
# ls /debug/tracing/per_cpu/cpu1/
trace trace_pipe trace_pipe_raw
The new name is called "trace_pipe_raw". The binary buffers always
acted similar to trace_pipe, except that they produce raw data.
Requested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: documentation
The use of the double __builtin_contant_p checks in the event_trace_printk
can be confusing to developers and reviewers. This patch adds a comment
to explain why it is there.
Requested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
LKML-Reference: <20090313122235.43EB.A69D9226@jp.fujitsu.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
In a private email conversation I explained how the ring buffer
page worked by using silly ASCII art. Ingo suggested that I add
that to the comments of the code.
Here it is.
Requested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Unlike TRACE_FORMAT() macros, the TRACE_EVENT() macros do not show
the event name in the trace file. Knowing the event type in the trace
output is very useful.
Instead of:
task swapper:0 [140] ==> ntpd:3308 [120]
We now have:
sched_switch: task swapper:0 [140] ==> ntpd:3308 [120]
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
commit c3ffc7a40b
"Don't use tracing_record_cmdline() in workqueue tracer"
has a race window.
find_task_by_vpid() requires task_list_lock().
LKML-Reference: <20090313090042.43CD.A69D9226@jp.fujitsu.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
If the stack tracing is disabled (by default) the stack_trace file
will only contain the header:
# cat /debug/tracing/stack_trace
Depth Size Location (0 entries)
----- ---- --------
This can be frustrating to a developer that does not realize that the
stack tracer is disabled. This patch adds the following text:
# cat /debug/tracing/stack_trace
Depth Size Location (0 entries)
----- ---- --------
#
# Stack tracer disabled
#
# To enable the stack tracer, either add 'stacktrace' to the
# kernel command line
# or 'echo 1 > /proc/sys/kernel/stack_tracer_enabled'
#
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
The binary printk saves a pointer to the format string in the ring buffer.
On output, the format is processed. But if the user is reading the
ring buffer through a binary interface, the pointer is meaningless.
This patch creates a file called printk_formats that maps the pointers
to the formats.
# cat /debug/tracing/printk_formats
0xffffffff80713d40 : "irq_handler_entry: irq=%d handler=%s\n"
0xffffffff80713d48 : "lock_acquire: %s%s%s\n"
0xffffffff80713d50 : "lock_release: %s\n"
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: speed up on event tracing
The event_trace_printk is currently a wrapper function that calls
trace_vprintk. Because it uses a variable for the fmt it misses out
on the optimization of using the binary printk.
This patch makes event_trace_printk into a macro wrapper to use the
fmt as the same as the trace_printks.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: fix callsites with dynamic format strings
Since its new binary implementation, trace_printk() internally uses static
containers for the format strings on each callsites. But the value is
assigned once at build time, which means that it can't take dynamic
formats.
So this patch unearthes the raw trace_printk implementation for the callers
that will need trace_printk to be able to carry these dynamic format
strings. The trace_printk() macro will use the appropriate implementation
for each callsite. Most of the time however, the binary implementation will
still be used.
The other impact of this patch is that mmiotrace_printk() will use the old
implementation because it calls the low level trace_vprintk and we can't
guess here whether the format passed in it is dynamic or not.
Some parts of this patch have been written by Steven Rostedt (most notably
the part that chooses the appropriate implementation for each callsites).
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: do not confuse user on small trace buffer sizes
When the system boots up, the trace buffer is small to conserve memory.
It is only two pages per online CPU. When the tracer is used, it expands
to the default value.
This can confuse the user if they look at the buffer size and see only
7, but then later they see 1408.
# cat /debug/tracing/buffer_size_kb
7
# echo sched_switch > /debug/tracing/current_tracer
# cat /debug/tracing/buffer_size_kb
1408
This patch tries to help remove this confustion by showing that the
buffer has not been expanded.
# cat /debug/tracing/buffer_size_kb
7 (expanded: 1408)
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: speed up and remove possible races
The get_online_cpus was added to the ring buffer because the original
design would free the ring buffer on a CPU that was being taken
off line. The final design kept the ring buffer around even when the
CPU was taken off line. This is to allow a user to still read the
information on that ring buffer.
Most of the get_online_cpus are no longer needed since the ring buffer will
not disappear from the use cases.
Reported-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
The hotplug code in the ring buffers is for use with CPU hotplug,
not generic hotplug.
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: prevent races with ring_buffer_expanded
This patch places the expanding of the tracing buffer under the
protection of the trace_types_lock mutex. It is highly unlikely
that there would be any contention, but better safe than sorry.
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: cleanup
Some of the comments about the trace buffer resizing is gobbledygook.
And I wonder why people question if I'm a native English speaker.
This patch makes the comments make a bit more sense.
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: save on memory
Currently, a ring buffer was allocated for each "possible_cpus". On
some systems, this is the same as NR_CPUS. Thus, if a system defined
NR_CPUS = 64 but it only had 1 CPU, we could have possibly 63 useless
ring buffers taking up space. With a default buffer of 3 megs, this
could be quite drastic.
This patch changes the ring buffer code to only allocate ring buffers
for online CPUs. If a CPU goes off line, we do not free the buffer.
This is because the user may still have trace data in that buffer
that they would like to look at.
Perhaps in the future we could add code to delete a ring buffer if
the CPU is offline and the ring buffer becomes empty.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: fix to task live locking on reading trace_pipe on one CPU
The same code is used for both trace_pipe (all CPUS) and the per_cpu
trace_pipe file. When there is no data to read, it will check for
signals and wait on the trace wait queue.
The problem happens with the per_cpu wait. The trace_wait code checks
all CPUs. Thus, if there's data in another CPU buffer, then it will
exit the wait, without checking for signals or waiting on the wait queue.
It would then try to read the empty buffer, and since that will just
return nothing, then it will try to wait again. Unfortunately, that will
again fail due to there still being data in the other buffers. This
ends up with a live lock for the task.
This patch fixes the trace_wait to be aware that the iterator may only
be waiting on a single buffer.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
To save memory, the tracer ring buffers are set to a minimum.
The activating of a trace expands the ring buffer size. This patch
adds this expanding, when an event is activated.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: less memory impact on systems not using tracer
When the kernel boots up that has tracing configured, it allocates
the default size of the ring buffer. This currently happens to be
1.4Megs per possible CPU. This is quite a bit of wasted memory if
the system is never using the tracer.
The current solution is to keep the ring buffers to a minimum size
until the user uses them. Once a tracer is piped into the current_tracer
the ring buffer will be expanded to the default size. If the user
changes the size of the ring buffer, it will take the size given
by the user immediately.
If the user adds a "ftrace=" to the kernel command line, then the ring
buffers will be set to the default size on initialization.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: prevent locking up by lockdep tracer
The lockdep tracer uses trace_vprintk and thus trace_vprintk can not
call back into lockdep without locking up.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Using the function_graph tracer in recent kernels generates a spew of
preemption BUGs. Fix this by not requiring trace_clock_local() users
to disable preemption themselves.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: clean up
There existed a lot of <space><tab>'s in the tracing code. This
patch removes them.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: clean up / comments
The comments that described the ftrace macros to manipulate the
TRACE_EVENT and TRACE_FORMAT macros no longer match the code.
This patch updates them.
Reported-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: clean up
In trying to stay consistant with the C style format in the TRACE_EVENT
macro, it makes more sense to do the printk after the assigning of
the variables.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
The event directory files type and available_types were no longer
needed with the new TRACE_EVENT_FORMAT macros, they were deleted.
But by accident the available_events file was also removed.
This patch brings it back.
Reported-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: fix to prevent crash on calling NULL function pointer
The ftrace internal records have their format exported via the event
system under the ftrace subsystem. These are only for exporting the
format to allow binary readers to be able to parse them in a binary
output.
The ftrace subsystem events can only be enabled via the ftrace tracers
and do not have a registering function. The event files expect the
event record to have registering function and will call it directly.
Passing in a ftrace subsystem event will cause the kernel to crash
because it will execute a NULL pointer.
This patch prevents the ftrace subsystem from being viewable to the
event enabling files.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: clean up
The offsetof and sizeof are of type size_t, and instead of typecasting
them to unsigned int for printk formatting, one could just use %zu.
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
"for (++cpu ; cpu < num_possible_cpus(); cpu++)" statement assumes
possible cpus have continuous number - but that's a wrong assumption.
Insted, cpumask_next() should be used.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20090310104437.A480.A69D9226@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: clean up
The TRACE_EVENT_FORMAT macro is no longer used by trace points
and only the DECLARE_TRACE, TRACE_FORMAT or TRACE_EVENT macros should
be used by them. Although the TRACE_EVENT_FORMAT macro is still used
by the internal tracing utility, it should not be used in core
kernel code.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: clean up and enhancement
The TRACE_EVENT_FORMAT macro looks quite ugly and is limited in its
ability to save data as well as to print the record out. Working with
Ingo Molnar, we came up with a new format that is much more pleasing to
the eye of C developers. This new macro is more C style than the old
macro, and is more obvious to what it does.
Here's the example. The only updated macro in this patch is the
sched_switch trace point.
The old method looked like this:
TRACE_EVENT_FORMAT(sched_switch,
TP_PROTO(struct rq *rq, struct task_struct *prev,
struct task_struct *next),
TP_ARGS(rq, prev, next),
TP_FMT("task %s:%d ==> %s:%d",
prev->comm, prev->pid, next->comm, next->pid),
TRACE_STRUCT(
TRACE_FIELD(pid_t, prev_pid, prev->pid)
TRACE_FIELD(int, prev_prio, prev->prio)
TRACE_FIELD_SPECIAL(char next_comm[TASK_COMM_LEN],
next_comm,
TP_CMD(memcpy(TRACE_ENTRY->next_comm,
next->comm,
TASK_COMM_LEN)))
TRACE_FIELD(pid_t, next_pid, next->pid)
TRACE_FIELD(int, next_prio, next->prio)
),
TP_RAW_FMT("prev %d:%d ==> next %s:%d:%d")
);
The above method is hard to read and requires two format fields.
The new method:
/*
* Tracepoint for task switches, performed by the scheduler:
*
* (NOTE: the 'rq' argument is not used by generic trace events,
* but used by the latency tracer plugin. )
*/
TRACE_EVENT(sched_switch,
TP_PROTO(struct rq *rq, struct task_struct *prev,
struct task_struct *next),
TP_ARGS(rq, prev, next),
TP_STRUCT__entry(
__array( char, prev_comm, TASK_COMM_LEN )
__field( pid_t, prev_pid )
__field( int, prev_prio )
__array( char, next_comm, TASK_COMM_LEN )
__field( pid_t, next_pid )
__field( int, next_prio )
),
TP_printk("task %s:%d [%d] ==> %s:%d [%d]",
__entry->prev_comm, __entry->prev_pid, __entry->prev_prio,
__entry->next_comm, __entry->next_pid, __entry->next_prio),
TP_fast_assign(
memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
__entry->prev_pid = prev->pid;
__entry->prev_prio = prev->prio;
memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
__entry->next_pid = next->pid;
__entry->next_prio = next->prio;
)
);
This macro is called TRACE_EVENT, it is broken up into 5 parts:
TP_PROTO: the proto type of the trace point
TP_ARGS: the arguments of the trace point
TP_STRUCT_entry: the structure layout of the entry in the ring buffer
TP_printk: the printk format
TP_fast_assign: the method used to write the entry into the ring buffer
The structure is the definition of how the event will be saved in the
ring buffer. The printk is used by the internal tracing in case of
an oops, and the kernel needs to print out the format of the record
to the console. This the TP_printk gives a means to show the records
in a human readable format. It is also used to print out the data
from the trace file.
The TP_fast_assign is executed directly. It is basically like a C function,
where the __entry is the handle to the record.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: clean up
This removes the custom made STR(x) macros in the tracer and uses
the generic __stringify macro instead.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: clean up
The macros TPPROTO, TPARGS, TPFMT, TPRAWFMT, and TPCMD all look a bit
ugly. This patch adds an underscore to their names.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: fix compiler warnings
On x86_64 sizeof and offsetof are treated as long, where as on x86_32
they are int. This patch typecasts them to unsigned int to avoid
one arch giving warnings while the other does not.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: improve workqueue tracer output
Currently, /sys/kernel/debug/tracing/trace_stat/workqueues can display
wrong and strange thread names.
Why?
Currently, ftrace has tracing_record_cmdline()/trace_find_cmdline()
convenience function that implements a task->comm string cache.
This can avoid unnecessary memcpy overhead and the workqueue tracer
uses it.
However, in general, any trace statistics feature shouldn't use
tracing_record_cmdline() because trace statistics can display
very old process. Then comm cache can return wrong string because
recent process overrides the cache.
Fortunately, workqueue trace guarantees that displayed processes
are live. Thus we can search comm string from PID at display time.
<before>
% cat workqueues
# CPU INSERTED EXECUTED NAME
# | | | |
7 431913 431913 kondemand/7
7 0 0 tail
7 21 21 git
7 0 0 ls
7 9 9 cat
7 832632 832632 unix_chkpwd
7 236292 236292 ls
Note: tail, git, ls, cat unix_chkpwd are obiously not workqueue thread.
<after>
% cat workqueues
# CPU INSERTED EXECUTED NAME
# | | | |
7 510 510 kondemand/7
7 0 0 kmpathd/7
7 15 15 ata/7
7 0 0 aio/7
7 11 11 kblockd/7
7 1063 1063 work_on_cpu/7
7 167 167 events/7
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Remove a few leftovers and clean up the code a bit.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1236356510-8381-5-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: faster and lighter tracing
Now that we have trace_bprintk() which is faster and consume lesser
memory than trace_printk() and has the same purpose, we can now drop
the old implementation in favour of the binary one from trace_bprintk(),
which means we move all the implementation of trace_bprintk() to
trace_printk(), so the Api doesn't change except that we must now use
trace_seq_bprintk() to print the TRACE_PRINT entries.
Some changes result of this:
- Previously, trace_bprintk depended of a single tracer and couldn't
work without. This tracer has been dropped and the whole implementation
of trace_printk() (like the module formats management) is now integrated
in the tracing core (comes with CONFIG_TRACING), though we keep the file
trace_printk (previously trace_bprintk.c) where we can find the module
management. Thus we don't overflow trace.c
- changes some parts to use trace_seq_bprintk() to print TRACE_PRINT entries.
- change a bit trace_printk/trace_vprintk macros to support non-builtin formats
constants, and fix 'const' qualifiers warnings. But this is all transparent for
developers.
- etc...
V2:
- Rebase against last changes
- Fix mispell on the changelog
V3:
- Rebase against last changes (moving trace_printk() to kernel.h)
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1236356510-8381-5-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: add a generic printk() for tracing, like trace_printk()
trace_bprintk() uses the infrastructure to record events on ring_buffer.
[ fweisbec@gmail.com: ported to latest -tip, made it work if
!CONFIG_MODULES, never free the format strings from modules
because we can't keep track of them and conditionnaly create
the ftrace format strings section (reported by Steven Rostedt) ]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1236356510-8381-4-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: save on memory for tracing
Current tracers are typically using a struct(like struct ftrace_entry,
struct ctx_switch_entry, struct special_entr etc...)to record a binary
event. These structs can only record a their own kind of events.
A new kind of tracer need a new struct and a lot of code too handle it.
So we need a generic binary record for events. This infrastructure
is for this purpose.
[fweisbec@gmail.com: rebase against latest -tip, make it safe while sched
tracing as reported by Steven Rostedt]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1236356510-8381-3-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix deadlock while using set_ftrace_pid
Reproducer:
# cd /sys/kernel/debug/tracing
# echo $$ > set_ftrace_pid
then, console becomes hung.
Details:
when writing set_ftracepid, kernel callstack is following
ftrace_pid_write()
mutex_lock(&ftrace_lock);
ftrace_update_pid_func()
mutex_lock(&ftrace_lock);
mutex_unlock(&ftrace_lock);
mutex_unlock(&ftrace_lock);
then, system always deadlocks when ftrace_pid_write() is called.
In past days, ftrace_pid_write() used ftrace_start_lock, but
commit e6ea44e9b4 consolidated
ftrace_start_lock to ftrace_lock.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Steven Rostedt <srostedt@redhat.com>
LKML-Reference: <20090306151155.0778.A69D9226@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: allow user apps to read binary format of basic ftrace entries
Currently, only defined raw events export their formats so a binary
reader can parse them. There's no reason that the default ftrace entries
can't export their formats.
This patch adds a subsystem called "ftrace" in the events directory
that includes the ftrace entries for basic ftrace recorded items.
These only have three files in the events directory:
type : printf
available_types : printf
format : format for the event entry
For example:
# cat /debug/tracing/events/ftrace/wakeup/format
name: wakeup
ID: 3
format:
field:unsigned char type; offset:0; size:1;
field:unsigned char flags; offset:1; size:1;
field:unsigned char preempt_count; offset:2; size:1;
field:int pid; offset:4; size:4;
field:int tgid; offset:8; size:4;
field:unsigned int prev_pid; offset:12; size:4;
field:unsigned char prev_prio; offset:16; size:1;
field:unsigned char prev_state; offset:17; size:1;
field:unsigned int next_pid; offset:20; size:4;
field:unsigned char next_prio; offset:24; size:1;
field:unsigned char next_state; offset:25; size:1;
field:unsigned int next_cpu; offset:28; size:4;
print fmt: "%u:%u:%u ==+ %u:%u:%u [%03u]"
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: clean up
Move the macro that creates the event format file to a separate header.
This will allow the default ftrace events to use this same macro
to create the formats to read those events.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: cleanup
All file_operations structures should be constant. No one is going to
change them.
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Clean up menu structure, introduce TRACING_SUPPORT switch that signals
whether an architecture supports various instrumentation mechanisms.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: decrease hangs risks with the graph tracer on slow systems
Since the function graph tracer can spend too much time on timer
interrupts, it's better now to use the more lightweight local
clock. Anyway, the function graph traces are more reliable on a
per cpu trace.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <49af243d.06e9300a.53ad.ffff840c@mx.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Use a more generic name - this also allows the prototype to move
to kernel.h and be generally available to kernel developers who
want to do some quick tracing.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The latency tracers (irqsoff, preemptoff, preemptirqsoff, and wakeup)
are pretty useless with the default output format. This patch makes them
automatically enable the latency format when they are selected. They
also record the state of the latency option, and if it was not enabled
when selected, they disable it on reset.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: clean up
Both print_lat_fmt and print_trace_fmt do pretty much the same thing
except for one different function call. This patch consolidates the
two functions and adds an if statement to perform the difference.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: clean up
The trace and latency_trace function pointers are identical for
every tracer but the function tracer. The differences in the function
tracer are trivial (latency output puts paranthesis around parent).
This patch removes the latency_trace pointer and all prints will
now just use the trace output function pointer.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
With the removal of the latency_trace file, we lost the ability
to see some of the finer details in a trace. Like the state of
interrupts enabled, the preempt count, need resched, and if we
are in an interrupt handler, softirq handler or not.
This patch simply creates an option to bring back the old format.
This also removes the warning about an unused variable that held
the latency_trace file operations.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
The buffer used by trace_seq was updated incorrectly. Instead
of consuming what was actually read, it consumed the rest of the
buffer on reads.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: fix trace read to conform to standards
Andrew Morton, Theodore Tso and H. Peter Anvin brought to my attention
that a userspace read should not return -EFAULT if it succeeded in
copying anything. It should only return -EFAULT if it failed to copy
at all.
This patch modifies the check of copy_from_user and updates the return
code appropriately.
I also used H. Peter Anvin's short cut rule to just test ret == count.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
If a partial ring_buffer_page_read happens, then some of the
incremental timestamps may be lost. This patch writes the
recent timestamp into the page that is passed back to the caller.
A partial ring_buffer_page_read is where the full page would not
be written back to the user, and instead, just part of the page
is copied to the user. A full page would be a page swap with the
ring buffer and the timestamps would be correct.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: fix to ftrace_dump output corruption
The commit: b04cc6b1f6
tracing/core: introduce per cpu tracing files
added a new field to the iterator called cpu_file. This was a handle
to differentiate between the per cpu trace output files and the
all cpu "trace" file. The all cpu "trace" file required setting this
to TRACE_PIPE_ALL_CPU.
The problem is that the ftrace_dump sets up its own iterator but was
not updated to handle this change. The result was only CPU 0 printing
out on crash and a lot of "<0>"'s also being printed.
Reported-by: Thomas Gleixner <tglx@linuxtronix.de>
Tested-by: Darren Hart <dvhtc@us.ibm.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: new feature
This patch creates a directory of files that correspond to the
per CPU ring buffers. These are binary files and are made to
be used with splice. This is the fastest way to extract data from
the ftrace ring buffers.
Thanks to Jiaying Zhang for pushing me to get this code fixed,
and to Eduard - Gabriel Munteanu for his splice code that helped
me debug my code.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: dont leave holes in read buffer page
The ring_buffer_read_page swaps a given page with the reader page
of the ring buffer, if certain conditions are set:
1) requested length is big enough to hold entire page data
2) a writer is not currently on the page
3) the page is not partially consumed.
Instead of swapping with the supplied page. It copies the data to
the supplied page instead. But currently the data is copied in the
same offset as the source page. This causes a hole at the start
of the reader page. This complicates the use of this function.
Instead, it should copy the data at the beginning of the function
and update the index fields accordingly.
Other small clean ups are also done in this patch.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: fix to possible alignment problems on some archs.
Some arch compilers include an NULL char array in the sizeof field.
Since the ring_buffer_event type includes one of these, it is better
to use the "offsetof" instead, to avoid strange bugs on these archs.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
The ring_buffer_read_page was broken if it were to only copy part
of the page. This patch fixes that up as well as adds a parameter
to allow a length field, in order to only copy part of the buffer page.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: fix ring_buffer_read_page
After a page is swapped into the ring buffer, the write field must
also be reset.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
The registering of events had the return value check backwards.
A zero returned is success, the check had it as a failure.
This patch also fixes a missing "\n" in the warning that the check
failed.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
To be able to identify the trace in the binary format output, the
id of the trace event (which is dynamically assigned) must also be listed.
This patch adds the name of the trace point as well as the id assigned.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
This patch includes the ftrace header to the event formats files:
# cat /debug/tracing/events/sched/sched_switch/format
field:unsigned char type; offset:0; size:1;
field:unsigned char flags; offset:1; size:1;
field:unsigned char preempt_count; offset:2; size:1;
field:int pid; offset:4; size:4;
field:int tgid; offset:8; size:4;
field:pid_t prev_pid; offset:12; size:4;
field:int prev_prio; offset:16; size:4;
field special:char next_comm[TASK_COMM_LEN]; offset:20; size:16;
field:pid_t next_pid; offset:36; size:4;
field:int next_prio; offset:40; size:4;
A blank line is used as a deliminator between the ftrace header and the
trace point fields.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
This patch adds the "format" file to the trace point event directory.
This is based off of work by Tom Zanussi, in which a file is exported
to be tread from user land such that a user space app may read the
binary record stored in the ring buffer.
# cat /debug/tracing/events/sched/sched_switch/format
field:pid_t prev_pid; offset:12; size:4;
field:int prev_prio; offset:16; size:4;
field special:char next_comm[TASK_COMM_LEN]; offset:20; size:16;
field:pid_t next_pid; offset:36; size:4;
field:int next_prio; offset:40; size:4;
Idea-from: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: clean up
The trace_seq functions may be used separately outside of the ftrace
iterator. The trace_seq_reset is needed for these operations.
This patch also renames trace_seq_reset to the more appropriate
trace_seq_init.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
The trace event objects are currently not proctected against
reentrancy. This patch adds a mutex around the modifications of
the trace event fields.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Tom Zanussi pointed out that the simple TRACE_FIELD was not enough to
record trace data that required memcpy. This patch addresses this issue
by adding a TRACE_FIELD_SPECIAL. The format is similar to TRACE_FIELD
but looks like so:
TRACE_FIELD_SPECIAL(type_item, item, cmd)
What TRACE_FIELD gave was:
TRACE_FIELD(type, item, assign)
The TRACE_FIELD would be used in declaring a structure:
struct {
type item;
};
And later assign it via:
entry->item = assign;
What TRACE_FIELD_SPECIAL gives us is:
In the declaration of the structure:
struct {
type_item;
};
And the assignment:
cmd;
This change log will explain the one example used in the patch:
TRACE_EVENT_FORMAT(sched_switch,
TPPROTO(struct rq *rq, struct task_struct *prev,
struct task_struct *next),
TPARGS(rq, prev, next),
TPFMT("task %s:%d ==> %s:%d",
prev->comm, prev->pid, next->comm, next->pid),
TRACE_STRUCT(
TRACE_FIELD(pid_t, prev_pid, prev->pid)
TRACE_FIELD(int, prev_prio, prev->prio)
TRACE_FIELD_SPECIAL(char next_comm[TASK_COMM_LEN],
next_comm,
TPCMD(memcpy(TRACE_ENTRY->next_comm,
next->comm,
TASK_COMM_LEN)))
TRACE_FIELD(pid_t, next_pid, next->pid)
TRACE_FIELD(int, next_prio, next->prio)
),
TPRAWFMT("prev %d:%d ==> next %s:%d:%d")
);
The struct will be create as:
struct {
pid_t prev_pid;
int prev_prio;
char next_comm[TASK_COMM_LEN];
pid_t next_pid;
int next_prio;
};
Note the TRACE_ENTRY in the cmd part of TRACE_SPECIAL. TRACE_ENTRY will
be set by the tracer to point to the structure inside the trace buffer.
entry->prev_pid = prev->pid;
entry->prev_prio = prev->prio;
memcpy(entry->next_comm, next->comm, TASK_COMM_LEN);
entry->next_pid = next->pid;
entry->next_prio = next->prio
Reported-by: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
This patch adds the interface to enable the C style trace points.
In the directory /debugfs/tracing/events/subsystem/event
We now have three files:
enable : values 0 or 1 to enable or disable the trace event.
available_types: values 'raw' and 'printf' which indicate the tracing
types available for the trace point. If a developer does not
use the TRACE_EVENT_FORMAT macro and just uses the TRACE_FORMAT
macro, then only 'printf' will be available. This file is
read only.
type: values 'raw' or 'printf'. This indicates which type of tracing
is active for that trace point. 'printf' is the default and
if 'raw' is not available, this file is read only.
# echo raw > /debug/tracing/events/sched/sched_wakeup/type
# echo 1 > /debug/tracing/events/sched/sched_wakeup/enable
Will enable the C style tracing for the sched_wakeup trace point.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: lower overhead tracing
The current event tracer can automatically pick up trace points
that are registered with the TRACE_FORMAT macro. But it required
a printf format string and parsing. Although, this adds the ability
to get guaranteed information like task names and such, it took
a hit in overhead processing. This processing can add about 500-1000
nanoseconds overhead, but in some cases that too is considered
too much and we want to shave off as much from this overhead as
possible.
Tom Zanussi recently posted tracing patches to lkml that are based
on a nice idea about capturing the data via C structs using
STRUCT_ENTER, STRUCT_EXIT type of macros.
I liked that method very much, but did not like the implementation
that required a developer to add data/code in several disjoint
locations.
This patch extends the event_tracer macros to do a similar "raw C"
approach that Tom Zanussi did. But instead of having the developers
needing to tweak a bunch of code all over the place, they can do it
all in one macro - preferably placed near the code that it is
tracing. That makes it much more likely that tracepoints will be
maintained on an ongoing basis by the code they modify.
The new macro TRACE_EVENT_FORMAT is created for this approach. (Note,
a developer may still utilize the more low level DECLARE_TRACE macros
if they don't care about getting their traces automatically in the event
tracer.)
They can also use the existing TRACE_FORMAT if they don't need to code
the tracepoint in C, but just want to use the convenience of printf.
So if the developer wants to "hardwire" a tracepoint in the fastest
possible way, and wants to acquire their data via a user space utility
in a raw binary format, or wants to see it in the trace output but not
sacrifice any performance, then they can implement the faster but
more complex TRACE_EVENT_FORMAT macro.
Here's what usage looks like:
TRACE_EVENT_FORMAT(name,
TPPROTO(proto),
TPARGS(args),
TPFMT(fmt, fmt_args),
TRACE_STUCT(
TRACE_FIELD(type1, item1, assign1)
TRACE_FIELD(type2, item2, assign2)
[...]
),
TPRAWFMT(raw_fmt)
);
Note name, proto, args, and fmt, are all identical to what TRACE_FORMAT
uses.
name: is the unique identifier of the trace point
proto: The proto type that the trace point uses
args: the args in the proto type
fmt: printf format to use with the event printf tracer
fmt_args: the printf argments to match fmt
TRACE_STRUCT starts the ability to create a structure.
Each item in the structure is defined with a TRACE_FIELD
TRACE_FIELD(type, item, assign)
type: the C type of item.
item: the name of the item in the stucture
assign: what to assign the item in the trace point callback
raw_fmt is a way to pretty print the struct. It must match
the order of the items are added in TRACE_STUCT
An example of this would be:
TRACE_EVENT_FORMAT(sched_wakeup,
TPPROTO(struct rq *rq, struct task_struct *p, int success),
TPARGS(rq, p, success),
TPFMT("task %s:%d %s",
p->comm, p->pid, success?"succeeded":"failed"),
TRACE_STRUCT(
TRACE_FIELD(pid_t, pid, p->pid)
TRACE_FIELD(int, success, success)
),
TPRAWFMT("task %d success=%d")
);
This creates us a unique struct of:
struct {
pid_t pid;
int success;
};
And the way the call back would assign these values would be:
entry->pid = p->pid;
entry->success = success;
The nice part about this is that the creation of the assignent is done
via macro magic in the event tracer. Once the TRACE_EVENT_FORMAT is
created, the developer will then have a faster method to record
into the ring buffer. They do not need to worry about the tracer itself.
The developer would only need to touch the files in include/trace/*.h
Again, I would like to give special thanks to Tom Zanussi for this
nice idea.
Idea-from: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Right now all tracers must manage their own trace buffers. This was
to enforce tracers to be independent in case we finally decide to
allow each tracer to have their own trace buffer.
But now we are adding event tracing that writes to the current tracer's
buffer. This adds an interface to allow events to write to the current
tracer buffer without having to manage its own. Since event tracing
has no "tracer", and is just a way to hook into any other tracer.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
This patch makes the event files, set_event and available_events
aware of the subsystem.
Now you can enable an entire subsystem with:
echo 'irq:*' > set_event
Note: the '*' is not needed.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
If a trace point header defines TRACE_SYSTEM, then it will add the
following trace points into that event system.
If include/trace/irq_event_types.h has:
#define TRACE_SYSTEM irq
at the top and
#undef TRACE_SYSTEM
at the bottom, then a directory "irq" will be created in the
/debug/tracing/events directory. Inside that directory will contain the
two trace points that are defined in include/trace/irq_event_types.h.
Only adding the above to irq and not to sched, we get:
# ls /debug/tracing/events/
irq sched_process_exit sched_signal_send sched_wakeup_new
sched_kthread_stop sched_process_fork sched_switch
sched_kthread_stop_ret sched_process_free sched_wait_task
sched_migrate_task sched_process_wait sched_wakeup
# ls /debug/tracing/events/irq
irq_handler_entry irq_handler_exit
If we add #define TRACE_SYSTEM sched to the trace/sched_event_types.h
then the rest of the trace events will be put in a sched directory
within the events directory.
I've been playing with this idea of the subsystem for a while, but
recently Tom Zanussi posted some patches to lkml that included this
method. Tom's approach was clean and got me to finally put some effort
to clean up the event trace points.
Thanks to Tom Zanussi for demonstrating how nice the subsystem
method is.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: clean up
To further facilitate the ease of adding trace points for developers, this
patch creates include/trace/trace_events.h and
include/trace/trace_event_types.h.
The former file will hold the trace/<type>.h files and the latter will hold
the trace/<type>_event_types.h files.
To create new tracepoints and to have them automatically
appear in the event tracer, a developer makes the trace/<type>.h file
which includes <linux/tracepoint.h> and the trace/<type>_event_types.h file.
The trace/<type>_event_types.h file will hold the TRACE_FORMAT
macros.
Then add the trace/<type>.h file to trace/trace_events.h,
and add the trace/<type>_event_types.h to the trace_event_types.h file.
No need to modify files elsewhere.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: clean up
kcalloc is a better approach to allocate a NULL array.
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: fix compile warning and clean up
When I first wrote __tracing_open, instead of passing the error
code via the ERR_PTR macros, I lazily used a separate parameter
to hold the return for errors.
When Frederic Weisbecker updated that function, he used the Linux
kernel ERR_PTR for the returns. This caused the parameter return
to possibly not be initialized on error. gcc correctly pointed this
out with a warning.
This patch converts the entire function to use the Linux kernel
ERR_PTR macro methods.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: fix to possible race conditions
There's some uses of current_tracer that is not protected by the
trace_types_lock. There is a small chance that a sysadmin changes
the tracer while the current_tracer is being referenced.
If the race is hit, it is unlikely to cause any harm since the
tracers are constant and are not freed. But some strang side
effects may occur.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
This patch adds the tracer dependent options dynamically to the
options directory when the tracer is activated. These options are
removed when the tracer is deactivated.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
This patch creates an options directory in the debugfs, that contains
the available tracing options. These files contain 1 or 0, where 1
is the option is enabled and 0 it is disabled.
Simply echoing in 1 will enable the option and 0 will disable it.
This patch only contains the core options, not the tracer options.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: implement new tracing timestamp APIs
Add three trace clock variants, with differing scalability/precision
tradeoffs:
- local: CPU-local trace clock
- medium: scalable global clock with some jitter
- global: globally monotonic, serialized clock
Make the ring-buffer use the local trace clock internally.
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: add new tracepoints
Add them to the generic IRQ code, that way every architecture
gets these new tracepoints, not just x86.
Using Steve's new 'TRACE_FORMAT', I can get function graph
trace as follows using the original two IRQ tracepoints:
3) | handle_IRQ_event() {
3) | /* (irq_handler_entry) irq=28 handler=eth0 */
3) | e1000_intr_msi() {
3) 2.460 us | __napi_schedule();
3) 9.416 us | }
3) | /* (irq_handler_exit) irq=28 handler=eth0 return=handled */
3) + 22.935 us | }
Signed-off-by: Jason Baron <jbaron@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mathieu Desnoyers <compudj@krystal.dyndns.org>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: restructure the VFS layout of per CPU trace buffers
The per cpu trace files are all in a single directory:
/debug/tracing/per_cpu. In case of a large number of cpu, the
content of this directory becomes messy so we create now one
directory per cpu inside /debug/tracing/per_cpu which contain
each their own trace_pipe and trace files.
Ie:
/debug/tracing$ ls -R per_cpu
per_cpu:
cpu0 cpu1
per_cpu/cpu0:
trace trace_pipe
per_cpu/cpu1:
trace trace_pipe
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There's been a bit confusion to whether DEFINE/DECLARE_TRACE_FMT should
be a DEFINE or a DECLARE. Ingo Molnar suggested simply calling it
TRACE_FORMAT.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Now that several per-cpu files can be read or spliced at the
same, we want the read/splice callbacks for tracing files to be
reentrants.
Until now, a single global mutex (trace_types_lock) serialized
the access to tracing_read_pipe(), tracing_splice_read_pipe(),
and the seq helpers.
Ie: it means that if a user tries to read trace_pipe0 and
trace_pipe1 at the same time, the access to the function
tracing_read_pipe() is contended and one reader must wait for
the other to finish its read call.
The trace_type_lock mutex is mostly here to serialize the access
to the global current tracer (current_trace), which can be
changed concurrently. Although the iter struct keeps a private
pointer to this tracer, its callbacks can be changed by another
function.
The method used here is to not keep anymore private reference to
the tracer inside the iterator but to make a copy of it inside
the iterator. Then it checks on subsequents read calls if the
tracer has changed. This is not costly because the current
tracer is not expected to be changed often, so we use a branch
prediction for that.
Moreover, we add a private mutex to the iterator (there is one
iterator per file descriptor) to serialize the accesses in case
of multiple consumers per file descriptor (which would be a
silly idea from the user). Note that this is not to protect the
ring buffer, since the ring buffer already serializes the
readers accesses. This is to prevent from traces weirdness in
case of concurrent consumers. But these mutexes can be dropped
anyway, that would not result in any crash. Just tell me what
you think about it.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: split up tracing output per cpu
Currently, on the tracing debugfs directory, three files are
available to the user to let him extracting the trace output:
- trace is an iterator through the ring-buffer. It's a reader
but not a consumer It doesn't block when no more traces are
available.
- trace pretty similar to the former, except that it adds more
informations such as prempt count, irq flag, ...
- trace_pipe is a reader and a consumer, it will also block
waiting for traces if necessary (heh, yes it's a pipe).
The traces coming from different cpus are curretly mixed up
inside these files. Sometimes it messes up the informations,
sometimes it's useful, depending on what does the tracer
capture.
The tracing_cpumask file is useful to filter the output and
select only the traces captured a custom defined set of cpus.
But still it is not enough powerful to extract at the same time
one trace buffer per cpu.
So this patch creates a new directory: /debug/tracing/per_cpu/.
Inside this directory, you will now find one trace_pipe file and
one trace file per cpu.
Which means if you have two cpus, you will have:
trace0
trace1
trace_pipe0
trace_pipe1
And of course, reading these files will have the same effect
than with the usual tracing files, except that you will only see
the traces from the given cpu.
The original all-in-one cpu trace file are still available on
their original place.
Until now, only one consumer was allowed on trace_pipe to avoid
racy consuming on the ring-buffer. Now the approach changed a
bit, you can have only one consumer per cpu.
Which means you are allowed to read concurrently trace_pipe0 and
trace_pipe1 But you can't have two readers on trace_pipe0 or
trace_pipe1.
Following the same logic, if there is one reader on the common
trace_pipe, you can not have at the same time another reader on
trace_pipe0 or in trace_pipe1. Because in trace_pipe is already
a consumer in all cpu buffers in essence.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: remove old debug/tracing API
/debug/tracing/latency_trace is an old legacy format we kept from
the old latency tracer. Remove the file for now. If there's any
useful bit missing then we'll propagate any useful output bits into
the /debug/tracing/trace output.
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix CPU hotplug lockup
bts_hotcpu_handler() is called with irqs disabled, so using mutex_lock()
is a no-no.
All the BTS codepaths here are atomic (they do not schedule), so using
a spinlock is the right solution.
Cc: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds the directory /debug/tracing/events/ that will contain
all the registered trace points.
# ls /debug/tracing/events/
sched_kthread_stop sched_process_fork sched_switch
sched_kthread_stop_ret sched_process_free sched_wait_task
sched_migrate_task sched_process_wait sched_wakeup
sched_process_exit sched_signal_send sched_wakeup_new
# ls /debug/tracing/events/sched_switch/
enable
# cat /debug/tracing/events/sched_switch/enable
1
# cat /debug/tracing/set_event
sched_switch
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
This patch changes the trace/sched.h to use the DECLARE_TRACE_FMT
such that they are automatically registered with the event tracer.
And it also adds the tracing sched headers to kernel/trace/events.c
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
This patch creates the event tracing infrastructure of ftrace.
It will create the files:
/debug/tracing/available_events
/debug/tracing/set_event
The available_events will list the trace points that have been
registered with the event tracer.
set_events will allow the user to enable or disable an event hook.
example:
# echo sched_wakeup > /debug/tracing/set_event
Will enable the sched_wakeup event (if it is registered).
# echo "!sched_wakeup" >> /debug/tracing/set_event
Will disable the sched_wakeup event (and only that event).
# echo > /debug/tracing/set_event
Will disable all events (notice the '>')
# cat /debug/tracing/available_events > /debug/tracing/set_event
Will enable all registered event hooks.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Fix an invalid memory reference problem when cpu hotplug support is
disabled and the hw-branch-tracer is set as current tracer.
Initializing the tracer calls bts_trace_init() which has already
been freed at this time.
Reported-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: added precaution on failure detection
Break out of the modifying loop as soon as a failure is detected.
This is just an added precaution found by code review and was not
found by any bug chasing.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
This patch creates the weak functions: ftrace_arch_code_modify_prepare
and ftrace_arch_code_modify_post_process that are called before and
after the stop machine is called to modify the kernel text.
If the arch needs to do pre or post processing, it only needs to define
these functions.
[ Update: Ingo Molnar suggested using the name ftrace_arch_code_modify_*
over using ftrace_arch_modify_* ]
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: trace only functions matching a pattern
The set_graph_function file let one to trace only one or several
chosen functions and follow all their code flow.
Currently, only a constant function name is allowed so this patch
allows the ftrace_regex functions:
- matches all functions that end with "name":
echo *name > set_graph_function
- matches all functions that begin with "name":
echo name* > set_graph_function
- matches all functions that contains "name":
echo *name* > set_graph_function
Example:
echo mutex* > set_graph_function
0) | mutex_lock_nested() {
0) 0.563 us | __might_sleep();
0) 2.072 us | }
0) | mutex_unlock() {
0) 1.036 us | __mutex_unlock_slowpath();
0) 2.433 us | }
0) | mutex_unlock() {
0) 0.691 us | __mutex_unlock_slowpath();
0) 1.787 us | }
0) | mutex_lock_interruptible_nested() {
0) 0.548 us | __might_sleep();
0) 1.945 us | }
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>