Commit Graph

70 Commits

Author SHA1 Message Date
Tony Luck 97de6ad196 [IA64] hook up new rt_tgsigqueueinfo syscall
Assign syscall #1321 for rt_tgsigqueueinfo.

Signed-off-by: Tony Luck <tony.luck@intel.com>
2009-06-16 13:13:41 -07:00
Tony Luck 8851d3712a [IA64] wire up preadv/pwritev system calls
Gerd Hoffmann added these to Linux.  Let ia64 use them.

Signed-off-by: Tony Luck <tony.luck@intel.com>
2009-04-08 13:46:14 -07:00
Linus Torvalds 714f83d5d9 Merge branch 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (413 commits)
  tracing, net: fix net tree and tracing tree merge interaction
  tracing, powerpc: fix powerpc tree and tracing tree interaction
  ring-buffer: do not remove reader page from list on ring buffer free
  function-graph: allow unregistering twice
  trace: make argument 'mem' of trace_seq_putmem() const
  tracing: add missing 'extern' keywords to trace_output.h
  tracing: provide trace_seq_reserve()
  blktrace: print out BLK_TN_MESSAGE properly
  blktrace: extract duplidate code
  blktrace: fix memory leak when freeing struct blk_io_trace
  blktrace: fix blk_probes_ref chaos
  blktrace: make classic output more classic
  blktrace: fix off-by-one bug
  blktrace: fix the original blktrace
  blktrace: fix a race when creating blk_tree_root in debugfs
  blktrace: fix timestamp in binary output
  tracing, Text Edit Lock: cleanup
  tracing: filter fix for TRACE_EVENT_FORMAT events
  ftrace: Using FTRACE_WARN_ON() to check "freed record" in ftrace_release()
  x86: kretprobe-booster interrupt emulation code fix
  ...

Fix up trivial conflicts in
 arch/parisc/include/asm/ftrace.h
 include/linux/memory.h
 kernel/extable.c
 kernel/module.c
2009-04-05 11:04:19 -07:00
Isaku Yamahata 94752a794d ia64/pv_ops: paravirtualize mov = ar.itc.
paravirtualize mov reg = ar.itc.

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2009-03-26 10:50:22 -07:00
Ingo Molnar 4092762aeb Merge branch 'tracing/ftrace'; commit 'v2.6.29-rc2' into tracing/core 2009-01-18 20:15:05 +01:00
Heiko Carstens 1134723e96 [CVE-2009-0029] Remove __attribute__((weak)) from sys_pipe/sys_pipe2
Remove __attribute__((weak)) from common code sys_pipe implemantation.
IA64, ALPHA, SUPERH (32bit) and SPARC (32bit) have own implemantations
with the same name. Just rename them.
For sys_pipe2 there is no architecture specific implementation.

Cc: Richard Henderson <rth@twiddle.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:15 +01:00
Shaohua Li a14a07b801 ftrace, ia64: IA64 dynamic ftrace support
IA64 dynamic ftrace support.
The original _mcount stub for each function is like:
	alloc r40=ar.pfs,12,8,0
	mov r43=r0;;
	mov r42=b0
	mov r41=r1
	nop.i 0x0
	br.call.sptk.many b0 = _mcount;;

The patch convert it to below for nop:
	[MII] nop.m 0x0
	mov r3=ip
	nop.i 0x0
	[MLX] nop.m 0x0
	nop.x 0x0;;
This isn't completely nop, as there is one instuction 'mov r3=ip', but
it should be light and harmless for code follow it.

And below is for call
	[MII] nop.m 0x0
	mov r3=ip
	nop.i 0x0
	[MLX] nop.m 0x0
	brl.many .;;
In this way, only one instruction is changed to convert code between nop
and call. This should meet dyn-ftrace's requirement.
But this requires CPU support brl instruction, so dyn-ftrace isn't
supported for old Itanium system. Assume there are quite few such old
system running.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-14 12:11:31 +01:00
Shaohua Li d3e75ff14b ftrace, ia64: IA64 static ftrace support
IA64 ftrace suppport. In IA64, below code will be added in each function
if -pg is enabled.

alloc r40=ar.pfs,12,8,0
mov r43=r0;;
mov r42=b0
mov r41=r1
nop.i 0x0
br.call.sptk.many b0 = _mcount;;

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-14 12:11:26 +01:00
Tony Luck b704882e70 [IA64] Rationalize kernel mode alignment checking
Itanium processors can handle some misaligned data accesses. They
also provide a mode where all such accesses are forced to trap. The
kernel was schizophrenic about use of this mode:

* Base kernel code ran in permissive mode where the only traps
  generated were from those cases that the h/w could not handle.
* Interrupt, syscall and trap code ran in strict mode where all
  unaligned accesses caused traps to the 0x5a00 unaligned reference
  vector.

Use strict alignment checking throughout the kernel, but make
sure that we continue to let user mode use more relaxed mode
as the default.

Signed-off-by: Tony Luck <tony.luck@intel.com>
2008-11-20 13:27:12 -08:00
Shaohua Li f14488ccfe [IA64] utrace use generic trace hook
Make IA64 use generic trace hook in some paths.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2008-10-06 10:43:06 -07:00
Tony Luck 3e4d0cab61 [IA64] Wire up new system calls
Six new system calls: signalfd4, eventfd2, epoll_create1,
dup3, pipe2 and inotify_init1.

Signed-off-by: Tony Luck <tony.luck@intel.com>
2008-07-25 10:10:28 -07:00
Isaku Yamahata 4df8d22bbb [IA64] pvops: paravirtualize entry.S
paravirtualize ia64_swtich_to, ia64_leave_syscall and ia64_leave_kernel.
They include sensitive or performance critical privileged instructions
so that they need paravirtualization.
To paravirtualize them by single source and multi compile
they are converted into indirect jump. And define each pv instances.

Cc: Keith Owens <kaos@ocs.com.au>
Cc: "Dong, Eddie" <eddie.dong@intel.com>
Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2008-05-27 15:08:01 -07:00
Hidetoshi Seto 2e513fe490 [IA64] trivial cleanup for entry.S
This patch does:
 - make comment at next to resched check more robust
 - move "re-check" comments to next to where change predicate regs

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2008-05-14 15:56:09 -07:00
Hidetoshi Seto 3633c73080 [IA64] fix interrupt masking for pending works on kernel leave
[Bug-fix for "[BUG?][2.6.25-mm1] sleeping during IRQ disabled"]

This patch does:
 - enable interrupts before calling schedule() as same as others, ex. x86
 - enable interrupts during ia64_do_signal() and ia64_sync_krbs()
 - do_notify_resume_user() is still called with interrupts disabled, since
   we can take short path of fsys_mode if-statement quickly.
 - pfm_handle_work() is also called with interrupts disabled, since
   it can deal interrupt mask within itself.
 - fix/add some comments/notes

Reported-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2008-05-14 15:55:35 -07:00
Hidetoshi Seto 38477ad751 [IA64] disable interrupts on exit of ia64_trace_syscall
While testing with CONFIG_VIRT_CPU_ACCOUNTING=y, I found that
I occasionally get very huge system time in some threads.

So I dug the issue and finally noticed that it was caused
because of an interrupt which interrupt in the following window:

> [arch/ia64/kernel/entry.S: (!CONFIG_PREEMPT && CONFIG_VIRT_CPU_ACCOUNTING)]
>
> ENTRY(ia64_leave_syscall)
>    :
> (pUStk) rsm psr.i
>         cmp.eq pLvSys,p0=r0,r0          // pLvSys=1: leave from syscall
> (pUStk) cmp.eq.unc p6,p0=r0,r0          // p6 <- pUStk
> .work_processed_syscall:
>         adds r2=PT(LOADRS)+16,r12
> (pUStk) mov.m r22=ar.itc                        // fetch time at leave
>         adds r18=TI_FLAGS+IA64_TASK_SIZE,r13
>         ;;
> <<< window: from here >>>
> (p6)    ld4 r31=[r18]  // load current_thread_info()->flags
>         ld8 r19=[r2],PT(B6)-PT(LOADRS)
>         adds r3=PT(AR_BSPSTORE)+16,r12
>         ;;
>         mov r16=ar.bsp
>         ld8 r18=[r2],PT(R9)-PT(B6)
> (p6)    and r15=TIF_WORK_MASK,r31  // any work other than TIF_SYSCALL_TRACE?
>         ;;
>         ld8 r23=[r3],PT(R11)-PT(AR_BSPSTORE)
> (p6)    cmp4.ne.unc p6,p0=r15, r0               // any special work pending?
> (p6)    br.cond.spnt .work_pending_syscall
>         ;;
>         ld8 r9=[r2],PT(CR_IPSR)-PT(R9)
>         ld8 r11=[r3],PT(CR_IIP)-PT(R11)
> (pNonSys) break 0 // bug check: we shouldn't be here if pNonSys is TRUE!
>         ;;
>         invala
> <<< window: to here >>>
>         rsm psr.i | psr.ic // turn off interrupts and interruption collection

If pUStk is true, it means we are going to return user mode, hence we fetch
ar.itc to get time at leave from system.
It seems that it is not possible to interrupt the window if pUStk is true,
because interrupts are disabled early.  And also disabling interrupt makes
sense because it is safe for referring current_thread_info()->flags.

However interrupting the window while pUStk is true was possible.
The route was:
ia64_trace_syscall
-> .work_pending_syscall_end
-> .work_processed_syscall
Only in case entering the window from this route, interrupts are enabled
during in the window even if pUStk is true.  I suppose interrupts must be
disabled here anyway if pUStk is true.
I'm not sure but afraid that what kind of bad effect were there, other
than crazy system time which I found.

FYI, there was a commit 6f6d75825d that
points out a bug at same point(exit of ia64_trace_syscall) in 2006.
It can be said that there was an another bug.

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2008-04-22 08:55:51 -07:00
Hidetoshi Seto b64f34cdfe [IA64] VIRT_CPU_ACCOUNTING (accurate cpu time accounting)
This patch implements VIRT_CPU_ACCOUNTING for ia64,
which enable us to use more accurate cpu time accounting.

The VIRT_CPU_ACCOUNTING is an item of kernel config, which s390
and powerpc arch have.  By turning this config on, these archs
change the mechanism of cpu time accounting from tick-sampling
based one to state-transition based one.

The state-transition based accounting is done by checking time
(cycle counter in processor) at every state-transition point,
such as entrance/exit of kernel, interrupt, softirq etc.
The difference between point to point is the actual time consumed
during in the state. There is no doubt about that this value is
more accurate than that of tick-sampling based accounting.

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2008-02-20 12:55:37 -08:00
Tony Luck ad9e39c70f [IA64] Wire up timerfd_{create,settime,gettime} syscalls
Add ia64 hooks for the new syscalls that were added in
commit 4d672e7ac7

Signed-off-by: Tony Luck <tony.luck@intel.com>
2008-02-08 12:00:32 -08:00
Davide Libenzi 4d672e7ac7 timerfd: new timerfd API
This is the new timerfd API as it is implemented by the following patch:

int timerfd_create(int clockid, int flags);
int timerfd_settime(int ufd, int flags,
		    const struct itimerspec *utmr,
		    struct itimerspec *otmr);
int timerfd_gettime(int ufd, struct itimerspec *otmr);

The timerfd_create() API creates an un-programmed timerfd fd.  The "clockid"
parameter can be either CLOCK_MONOTONIC or CLOCK_REALTIME.

The timerfd_settime() API give new settings by the timerfd fd, by optionally
retrieving the previous expiration time (in case the "otmr" parameter is not
NULL).

The time value specified in "utmr" is absolute, if the TFD_TIMER_ABSTIME bit
is set in the "flags" parameter.  Otherwise it's a relative time.

The timerfd_gettime() API returns the next expiration time of the timer, or
{0, 0} if the timerfd has not been set yet.

Like the previous timerfd API implementation, read(2) and poll(2) are
supported (with the same interface).  Here's a simple test program I used to
exercise the new timerfd APIs:

http://www.xmailserver.org/timerfd-test2.c

[akpm@linux-foundation.org: coding-style cleanups]
[akpm@linux-foundation.org: fix ia64 build]
[akpm@linux-foundation.org: fix m68k build]
[akpm@linux-foundation.org: fix mips build]
[akpm@linux-foundation.org: fix alpha, arm, blackfin, cris, m68k, s390, sparc and sparc64 builds]
[heiko.carstens@de.ibm.com: fix s390]
[akpm@linux-foundation.org: fix powerpc build]
[akpm@linux-foundation.org: fix sparc64 more]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05 09:44:07 -08:00
David Chinner 3d7559e677 [IA64] fallocate system call
sys_fallocate for ia64. This uses an empty slot #1303 erroneously
marked as reserved for move_pages (which had already been allocated
as syscall #1276)

Signed-Off-By: Dave Chinner <dgc@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2007-07-19 13:48:00 -07:00
Tony Luck ae67e498a5 [IA64] wire up {signal,timer,event}fd syscalls
Signed-off-by: Tony Luck <tony.luck@intel.com>
2007-05-14 15:55:11 -07:00
Tony Luck 472118e63d [IA64] Wire up epoll_pwait and utimensat
Another day, another pair of new system calls.

Signed-off-by: Tony Luck <tony.luck@intel.com>
2007-05-10 09:44:42 -07:00
Alexey Kuznetsov e180583b85 [IA64] wire up pselect, ppoll
Signed-off-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2007-05-08 15:57:59 -07:00
Alexey Dobriyan 4a177cbf84 [IA64] Add TIF_RESTORE_SIGMASK
Preparation for pselect and ppoll.
ia32 compat code not tested. :-(

Signed-off-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2007-05-08 14:51:59 -07:00
Tony Luck b643b0fdbc Pull percpu-dtc into release branch 2007-04-30 13:56:00 -07:00
Chen, Kenneth W a0776ec8e9 [IA64] remove per-cpu ia64_phys_stacked_size_p8
It's not efficient to use a per-cpu variable just to store
how many physical stack register a cpu has.  Ever since the
incarnation of ia64 up till upcoming Montecito processor, that
variable has "glued" to 96. Having a variable in memory means
that the kernel is burning an extra cacheline access on every
syscall and kernel exit path.  Such "static" value is better
served with the instruction patching utility exists today.
Convert ia64_phys_stacked_size_p8 into dynamic insn patching.

This also has a pleasant side effect of eliminating access to
per-cpu area while psr.ic=0 in the kernel exit path. (fixable
for per-cpu DTC work, but why bother?)

There are some concerns with the default value that the instruc-
tion encoded in the kernel image.  It shouldn't be concerned.
The reasons are:

(1) cpu_init() is called at CPU initialization.  In there, we
    find out physical stack register size from PAL and patch
    two instructions in kernel exit code.  The code in question
    can not be executed before the patching is done.

(2) current implementation stores zero in ia64_phys_stacked_size_p8,
    and that's what the current kernel exit path loads the value with.
    With the new code, it is equivalent that we store reg size 96
    in ia64_phys_stacked_size_p8, thus creating a better safety net.
    Given (1) above can never fail, having (2) is just a bonus.

All in all, this patch allow one less memory reference in the kernel
exit path, thus reducing syscall and interrupt return latency; and
avoid polluting potential useful data in the CPU cache.

Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2007-02-06 15:04:18 -08:00
Fenghua Yu 86afa9eb88 [IA64] Hook up getcpu system call for IA64
getcpu system call returns cpu# and node# on which this system call and
its caller are running. This patch hooks up its implementation on IA64.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2007-02-05 16:56:36 -08:00
Zou Nan hai a79561134f [IA64] IA64 Kexec/kdump
Changes and updates.

1. Remove fake rendz path and related code according to discuss with Khalid Aziz.
2. fc.i offset fix in relocate_kernel.S.
3. iospic shutdown code eoi and mask race fix from Fujitsu.
4. Warm boot hook in machine_kexec to SN SAL code from Jack Steiner.
5. Send slave to SAL slave loop patch from Jay Lan.
6. Kdump on non-recoverable MCA event patch from Jay Lan
7. Use CTL_UNNUMBERED in kdump_on_init sysctl.

Signed-off-by: Zou Nan hai <nanhai.zou@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2006-12-07 09:51:35 -08:00
Uwe Zeisberger f30c226954 fix file specification in comments
Many files include the filename at the beginning, serveral used a wrong one.

Signed-off-by: Uwe Zeisberger <Uwe_Zeisberger@digi.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-10-03 23:01:26 +02:00
Arnd Bergmann 3db03b4afb [PATCH] rename the provided execve functions to kernel_execve
Some architectures provide an execve function that does not set errno, but
instead returns the result code directly.  Rename these to kernel_execve to
get the right semantics there.  Moreover, there is no reasone for any of these
architectures to still provide __KERNEL_SYSCALLS__ or _syscallN macros, so
remove these right away.

[akpm@osdl.org: build fix]
[bunk@stusta.de: build fix]
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Andi Kleen <ak@muc.de>
Acked-by: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Hirokazu Takata <takata.hirokazu@renesas.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:23 -07:00
Tony Luck 5c55cd63a7 Revert "[IA64] Unwire set/get_robust_list"
This reverts commit 2636255488.

Jakub Jelinek provided the missing futex_atomic_cmpxchg_inatomic()
function, so now it should be safe to re-enable these syscalls.

Signed-off-by: Tony Luck <tony.luck@intel.com>
2006-09-26 14:04:42 -07:00
Andreas Schwab 2636255488 [IA64] Unwire set/get_robust_list
The syscalls set/get_robust_list must not be wired up until
futex_atomic_cmpxchg_inatomic is implemented.  Otherwise the kernel will
hang in handle_futex_death.

Signed-off-by: Andreas Schwab <schwab@suse.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2006-09-08 11:03:40 -07:00
Jörn Engel 6ab3d5624e Remove obsolete #include <linux/config.h>
Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-06-30 19:25:36 +02:00
Christoph Lameter 742755a1d8 [PATCH] page migration: sys_move_pages(): support moving of individual pages
move_pages() is used to move individual pages of a process. The function can
be used to determine the location of pages and to move them onto the desired
node. move_pages() returns status information for each page.

long move_pages(pid, number_of_pages_to_move,
		addresses_of_pages[],
		nodes[] or NULL,
		status[],
		flags);

The addresses of pages is an array of void * pointing to the
pages to be moved.

The nodes array contains the node numbers that the pages should be moved
to. If a NULL is passed instead of an array then no pages are moved but
the status array is updated. The status request may be used to determine
the page state before issuing another move_pages() to move pages.

The status array will contain the state of all individual page migration
attempts when the function terminates. The status array is only valid if
move_pages() completed successfullly.

Possible page states in status[]:

0..MAX_NUMNODES	The page is now on the indicated node.

-ENOENT		Page is not present

-EACCES		Page is mapped by multiple processes and can only
		be moved if MPOL_MF_MOVE_ALL is specified.

-EPERM		The page has been mlocked by a process/driver and
		cannot be moved.

-EBUSY		Page is busy and cannot be moved. Try again later.

-EFAULT		Invalid address (no VMA or zero page).

-ENOMEM		Unable to allocate memory on target node.

-EIO		Unable to write back page. The page must be written
		back in order to move it since the page is dirty and the
		filesystem does not provide a migration function that
		would allow the moving of dirty pages.

-EINVAL		A dirty page cannot be moved. The filesystem does not provide
		a migration function and has no ability to write back pages.

The flags parameter indicates what types of pages to move:

MPOL_MF_MOVE	Move pages that are only mapped by the process.

MPOL_MF_MOVE_ALL Also move pages that are mapped by multiple processes.
		Requires sufficient capabilities.

Possible return codes from move_pages()

-ENOENT		No pages found that would require moving. All pages
		are either already on the target node, not present, had an
		invalid address or could not be moved because they were
		mapped by multiple processes.

-EINVAL		Flags other than MPOL_MF_MOVE(_ALL) specified or an attempt
		to migrate pages in a kernel thread.

-EPERM		MPOL_MF_MOVE_ALL specified without sufficient priviledges.
		or an attempt to move a process belonging to another user.

-EACCES		One of the target nodes is not allowed by the current cpuset.

-ENODEV		One of the target nodes is not online.

-ESRCH		Process does not exist.

-E2BIG		Too many pages to move.

-ENOMEM		Not enough memory to allocate control array.

-EFAULT		Parameters could not be accessed.

A test program for move_pages() may be found with the patches
on ftp.kernel.org:/pub/linux/kernel/people/christoph/pmig/patches-2.6.17-rc4-mm3

From: Christoph Lameter <clameter@sgi.com>

  Detailed results for sys_move_pages()

  Pass a pointer to an integer to get_new_page() that may be used to
  indicate where the completion status of a migration operation should be
  placed.  This allows sys_move_pags() to report back exactly what happened to
  each page.

  Wish there would be a better way to do this. Looks a bit hacky.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Jes Sorensen <jes@trained-monkey.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:53 -07:00
Jens Axboe 912d35f867 [PATCH] Add support for the sys_vmsplice syscall
sys_splice() moves data to/from pipes with a file input/output. sys_vmsplice()
moves data to a pipe, with the input being a user address range instead.

This uses an approach suggested by Linus, where we can hold partial ranges
inside the pages[] map. Hopefully this will be useful for network
receive support as well.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-04-26 10:59:21 +02:00
Jens Axboe 70524490ee [PATCH] splice: add support for sys_tee()
Basically an in-kernel implementation of tee, which uses splice and the
pipe buffers as an intelligent way to pass data around by reference.

Where the user space tee consumes the input and produces a stdout and
file output, this syscall merely duplicates the data inside a pipe to
another pipe. No data is copied, the output just grabs a reference to the
input pipe data.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-04-11 15:51:17 +02:00
Tony Luck b8cd2af862 [IA64] Wire up new syscalls {set,get}_robust_list
Join the dots to enable Ingo's robut futex syscalls.

Signed-off-by: Tony Luck <tony.luck@intel.com>
2006-04-06 14:20:16 -07:00
Tony Luck d905b00b3b [IA64] Wire up new syscall sync_file_range()
Also reserve syscall numbers for {set,get}_robust_list

Signed-off-by: Tony Luck <tony.luck@intel.com>
2006-04-04 14:08:11 -07:00
Jens Axboe 5274f052e7 [PATCH] Introduce sys_splice() system call
This adds support for the sys_splice system call. Using a pipe as a
transport, it can connect to files or sockets (latter as output only).

From the splice.c comments:

   "splice": joining two ropes together by interweaving their strands.

   This is the "extended pipe" functionality, where a pipe is used as
   an arbitrary in-memory buffer. Think of a pipe as a small kernel
   buffer that you can use to transfer data from one end to the other.

   The traditional unix read/write is extended with a "splice()" operation
   that transfers data buffers to or from a pipe buffer.

   Named by Larry McVoy, original implementation from Linus, extended by
   Jens to support splicing to files and fixing the initial implementation
   bugs.

Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-30 12:28:18 -08:00
Tony Luck 581249966f Pull delete-sigdelayed into release branch 2006-03-21 08:17:38 -08:00
Jack Steiner 6f6d75825d [IA64] Missing check for TIF_WORK if trace/audit enabled
It appears that if auditing is enabled, the kernel fails to
check for pending signals before returning to user mode.

Signed-off-by: Jack Steiner <steiner@sgi.com>
Acked-by: Ken Chen <kenneth.w.chen@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2006-02-16 10:20:08 -08:00
Janak Desai 9621a4ef8a [IA64] unshare system call registration for ia64
Registers system call for the ia64 architecture.

Reserves space for ppoll and pselect, and adds unshare at system
call number 1296.

Signed-off-by: Janak Desai <janak@us.ibm.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2006-02-08 15:43:38 -08:00
Chen, Kenneth W 9ed2ad8648 [IA64] add syscall entry for *at()
Wire up the ia64 syscalls for *at() functions.

Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2006-02-06 10:42:46 -08:00
Keith Owens b0a06623dc [IA64] Delete MCA/INIT sigdelayed code
The only user of the MCA/INIT sigdelayed code (SGI's I/O probing) has
moved from the kernel into SAL.  Delete the MCA/INIT sigdelayed code.

Signed-off-by: Keith Owens <kaos@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2006-01-26 13:23:27 -08:00
Christoph Lameter 39743889aa [PATCH] Swap Migration V5: sys_migrate_pages interface
sys_migrate_pages implementation using swap based page migration

This is the original API proposed by Ray Bryant in his posts during the first
half of 2005 on linux-mm@kvack.org and linux-kernel@vger.kernel.org.

The intent of sys_migrate is to migrate memory of a process.  A process may
have migrated to another node.  Memory was allocated optimally for the prior
context.  sys_migrate_pages allows to shift the memory to the new node.

sys_migrate_pages is also useful if the processes available memory nodes have
changed through cpuset operations to manually move the processes memory.  Paul
Jackson is working on an automated mechanism that will allow an automatic
migration if the cpuset of a process is changed.  However, a user may decide
to manually control the migration.

This implementation is put into the policy layer since it uses concepts and
functions that are also needed for mbind and friends.  The patch also provides
a do_migrate_pages function that may be useful for cpusets to automatically
move memory.  sys_migrate_pages does not modify policies in contrast to Ray's
implementation.

The current code here is based on the swap based page migration capability and
thus is not able to preserve the physical layout relative to it containing
nodeset (which may be a cpuset).  When direct page migration becomes available
then the implementation needs to be changed to do a isomorphic move of pages
between different nodesets.  The current implementation simply evicts all
pages in source nodeset that are not in the target nodeset.

Patch supports ia64, i386 and x86_64.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:12:42 -08:00
Peter Chubb 24b8e0cc09 [IA64] Remove warnings for gcc 4.0 IA64 compilation.
This patch removes some compilation warnings, mostly
trivially. acpi.c fix also noted by Kenji Kaneshige.

Signed-off-by; Peter Chubb <peterc@gelato.unsw.edu.au>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2005-09-16 09:45:27 -07:00
Linus Torvalds 486a153f0e Merge master.kernel.org:/pub/scm/linux/kernel/git/sam/kbuild 2005-09-09 15:46:49 -07:00
Chen, Kenneth W 383f2835eb [PATCH] Prefetch kernel stacks to speed up context switch
For architecture like ia64, the switch stack structure is fairly large
(currently 528 bytes).  For context switch intensive application, we found
that significant amount of cache misses occurs in switch_to() function.
The following patch adds a hook in the schedule() function to prefetch
switch stack structure as soon as 'next' task is determined.  This allows
maximum overlap in prefetch cache lines for that structure.

Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09 13:57:31 -07:00
Sam Ravnborg 39e01cb874 kbuild: ia64 use generic asm-offsets.h support
Delete obsolete stuff from arch Makefile
Rename file to asm-offsets.h
The trick used in the arch Makefile to circumvent the circular
dependency is kept.

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
2005-09-09 22:03:13 +02:00
Chen, Kenneth W 0232622324 [IA64] minor performance tune-up in ia64_switch_to
The reenabling of psr.ic should really belong to dtr mapping code block.
It make the fall through code fast since it doesn't need to execute the
predicated-off instruction.  Logically make more sense as well since psr.ic
was turned off in .map code block.

Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2005-09-07 13:56:23 -07:00
Ingo Molnar 6cb54819d7 [PATCH] remove sys_set_zone_reclaim()
This removes sys_set_zone_reclaim() for now.  While i'm sure Martin is
trying to solve a real problem, we must not hard-code an incomplete and
insufficient approach into a syscall, because syscalls are pretty much
for eternity.  I am quite strongly convinced that this syscall must not
hit v2.6.13 in its current form.

Firstly, the syscall lacks basic syscall design: e.g. it allows the
global setting of VM policy for unprivileged users. (!) [ Imagine an
Oracle installation and a SAP installation on the same NUMA box fighting
over the 'optimal' setting for this flag. What will they do? Will they
try to set the flag to their own preferred value every second or so? ]

Secondly, it was added based on a single datapoint from Martin:

 http://marc.theaimsgroup.com/?l=linux-mm&m=111763597218177&w=2

where Martin characterizes the numbers the following way:

 ' Run-to-run variability for "make -j" is huge, so these numbers aren't
   terribly useful except to see that with reclaim the benchmark still
   finishes in a reasonable amount of time. '

in other words: the fundamental problem has likely not been solved, only
a tendential move into the right direction has been observed, and a
handful of numbers were picked out of a set of hugely variable results,
without showing the variability data. How much variance is there
run-to-run?

I'd really suggest to first walk the walk and see what's needed to get
stable & predictable kernel compilation numbers on that NUMA box, before
adding random syscalls to tune a particular aspect of the VM ... which
approach might not even matter once the whole picture has been analyzed
and understood!

The third, most important point is that the syscall exposes VM tuning
internals in a completely unstructured way. What sense does it make to
have a _GLOBAL_ per-node setting for 'should we go to another node for
reclaim'? If then it might make sense to do this per-app, via numalib or
so.

The change is minimalistic in that it doesnt remove the syscall and the
underlying infrastructure changes, only the user-visible changes.  We
could perhaps add a CAP_SYS_ADMIN-only sysctl for this hack, a'ka
/proc/sys/vm/swappiness, but even that looks quite counterproductive
when the generic approach is that we are trying to reduce the number of
external factors in the VM balance picture.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-01 10:03:56 -07:00