Commit Graph

58281 Commits

Author SHA1 Message Date
Dan Williams f6dff381af md: remove raid5 compute_block and compute_parity5
replaced by raid5_run_ops

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-07-13 08:06:18 -07:00
Dan Williams 830ea01673 md: handle_stripe5 - request io processing in raid5_run_ops
I/O submission requests were already handled outside of the stripe lock in
handle_stripe.  Now that handle_stripe is only tasked with finding work,
this logic belongs in raid5_run_ops.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-07-13 08:06:17 -07:00
Dan Williams f0a50d3754 md: handle_stripe5 - add request/completion logic for async expand ops
When a stripe is being expanded bulk copying takes place to move the data
from the old stripe to the new.  Since raid5_run_ops only operates on one
stripe at a time these bulk copies are handled in-line under the stripe
lock.  In the dma offload case we poll for the completion of the operation.

After the data has been copied into the new stripe the parity needs to be
recalculated across the new disks.  We reuse the existing postxor
functionality to carry out this calculation.  By setting STRIPE_OP_POSTXOR
without setting STRIPE_OP_BIODRAIN the completion path in handle stripe
can differentiate expand operations from normal write operations.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-07-13 08:06:17 -07:00
Dan Williams b5e98d65d3 md: handle_stripe5 - add request/completion logic for async read ops
When a read bio is attached to the stripe and the corresponding block is
marked R5_UPTODATE, then a read (biofill) operation is scheduled to copy
the data from the stripe cache to the bio buffer.  handle_stripe flags the
blocks to be operated on with the R5_Wantfill flag.  If new read requests
arrive while raid5_run_ops is running they will not be handled until
handle_stripe is scheduled to run again.

Changelog:
* cleanup to_read and to_fill accounting
* do not fail reads that have reached the cache

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-07-13 08:06:17 -07:00
Dan Williams e89f89629b md: handle_stripe5 - add request/completion logic for async check ops
Check operations are scheduled when the array is being resynced or an
explicit 'check/repair' command was sent to the array.  Previously check
operations would destroy the parity block in the cache such that even if
parity turned out to be correct the parity block would be marked
!R5_UPTODATE at the completion of the check.  When the operation can be
carried out by a dma engine the assumption is that it can check parity as a
read-only operation.  If raid5_run_ops notices that the check was handled
by hardware it will preserve the R5_UPTODATE status of the parity disk.

When a check operation determines that the parity needs to be repaired we
reuse the existing compute block infrastructure to carry out the operation.
Repair operations imply an immediate write back of the data, so to
differentiate a repair from a normal compute operation the
STRIPE_OP_MOD_REPAIR_PD flag is added.

Changelog:
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-07-13 08:06:17 -07:00
Dan Williams f38e12199a md: handle_stripe5 - add request/completion logic for async compute ops
handle_stripe will compute a block when a backing disk has failed, or when
it determines it can save a disk read by computing the block from all the
other up-to-date blocks.

Previously a block would be computed under the lock and subsequent logic in
handle_stripe could use the newly up-to-date block.  With the raid5_run_ops
implementation the compute operation is carried out a later time outside
the lock.  To preserve the old functionality we take advantage of the
dependency chain feature of async_tx to flag the block as R5_Wantcompute
and then let other parts of handle_stripe operate on the block as if it
were up-to-date.  raid5_run_ops guarantees that the block will be ready
before it is used in another operation.

However, this only works in cases where the compute and the dependent
operation are scheduled at the same time.  If a previous call to
handle_stripe sets the R5_Wantcompute flag there is no facility to pass the
async_tx dependency chain across successive calls to raid5_run_ops.  The
req_compute variable protects against this case.

Changelog:
* remove the req_compute BUG_ON

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-07-13 08:06:17 -07:00
Dan Williams e33129d841 md: handle_stripe5 - add request/completion logic for async write ops
After handle_stripe5 decides whether it wants to perform a
read-modify-write, or a reconstruct write it calls
handle_write_operations5.  A read-modify-write operation will perform an
xor subtraction of the blocks marked with the R5_Wantprexor flag, copy the
new data into the stripe (biodrain) and perform a postxor operation across
all up-to-date blocks to generate the new parity.  A reconstruct write is run
when all blocks are already up-to-date in the cache so all that is needed
is a biodrain and postxor.

On the completion path STRIPE_OP_PREXOR will be set if the operation was a
read-modify-write.  The STRIPE_OP_BIODRAIN flag is used in the completion
path to differentiate write-initiated postxor operations versus
expansion-initiated postxor operations.  Completion of a write triggers i/o
to the drives.

Changelog:
* make the 'rcw' parameter to handle_write_operations5 a simple flag, Neil Brown
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-07-13 08:06:16 -07:00
Dan Williams d84e0f10d3 md: common infrastructure for running operations with raid5_run_ops
All the handle_stripe operations that are to be transitioned to use
raid5_run_ops need a method to coherently gather work under the stripe-lock
and hand that work off to raid5_run_ops.  The 'get_stripe_work' routine
runs under the lock to read all the bits in sh->ops.pending that do not
have the corresponding bit set in sh->ops.ack.  This modified 'pending'
bitmap is then passed to raid5_run_ops for processing.

The transition from 'ack' to 'completion' does not need similar protection
as the existing release_stripe infrastructure will guarantee that
handle_stripe will run again after a completion bit is set, and
handle_stripe can tolerate a sh->ops.completed bit being set while the lock
is held.

A call to async_tx_issue_pending_all() is added to raid5d to kick the
offload engines once all pending stripe operations work has been submitted.
This enables batching of the submission and completion of operations.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-07-13 08:06:16 -07:00
Dan Williams 91c0092484 md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:

1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api

The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.

To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests.  In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing.  The following flags outline the
requests that handle_stripe can make of raid5_run_ops:

STRIPE_OP_BIOFILL
 - copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
 - generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
 - subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
 - copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
 - recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
 - verify that the parity is correct
STRIPE_OP_IO
 - submit i/o to the member disks (note this was already performed outside
   the stripe lock, but it made sense to add it as an operation type

The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
   operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
   sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
   new operations that were previously blocked

Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.

Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
  ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
  handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
  was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-07-13 08:06:15 -07:00
Dan Williams 45b4233caa raid5: replace custom debug PRINTKs with standard pr_debug
Replaces PRINTK with pr_debug, and kills the RAID5_DEBUG definition in
favor of the global DEBUG definition.  To get local debug messages just add
'#define DEBUG' to the top of the file.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-07-13 08:06:15 -07:00
Dan Williams a445685647 raid5: refactor handle_stripe5 and handle_stripe6 (v3)
handle_stripe5 and handle_stripe6 have very deep logic paths handling the
various states of a stripe_head.  By introducing the 'stripe_head_state'
and 'r6_state' objects, large portions of the logic can be moved to
sub-routines.

'struct stripe_head_state' consumes all of the automatic variables that previously
stood alone in handle_stripe5,6.  'struct r6_state' contains the handle_stripe6
specific variables like p_failed and q_failed.

One of the nice side effects of the 'stripe_head_state' change is that it
allows for further reductions in code duplication between raid5 and raid6.
The following new routines are shared between raid5 and raid6:

	handle_completed_write_requests
	handle_requests_to_failed_array
	handle_stripe_expansion

Changes:
* v2: fixed 'conf->raid_disk-1' for the raid6 'handle_stripe_expansion' path
* v3: removed the unused 'dirty' field from struct stripe_head_state
* v3: coalesced open coded bi_end_io routines into return_io()

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-07-13 08:06:15 -07:00
Dan Williams 9bc89cd82d async_tx: add the async_tx api
The async_tx api provides methods for describing a chain of asynchronous
bulk memory transfers/transforms with support for inter-transactional
dependencies.  It is implemented as a dmaengine client that smooths over
the details of different hardware offload engine implementations.  Code
that is written to the api can optimize for asynchronous operation and the
api will fit the chain of operations to the available offload resources. 
 
	I imagine that any piece of ADMA hardware would register with the
	'async_*' subsystem, and a call to async_X would be routed as
	appropriate, or be run in-line. - Neil Brown

async_tx exploits the capabilities of struct dma_async_tx_descriptor to
provide an api of the following general format:

struct dma_async_tx_descriptor *
async_<operation>(..., struct dma_async_tx_descriptor *depend_tx,
			dma_async_tx_callback cb_fn, void *cb_param)
{
	struct dma_chan *chan = async_tx_find_channel(depend_tx, <operation>);
	struct dma_device *device = chan ? chan->device : NULL;
	int int_en = cb_fn ? 1 : 0;
	struct dma_async_tx_descriptor *tx = device ?
		device->device_prep_dma_<operation>(chan, len, int_en) : NULL;

	if (tx) { /* run <operation> asynchronously */
		...
		tx->tx_set_dest(addr, tx, index);
		...
		tx->tx_set_src(addr, tx, index);
		...
		async_tx_submit(chan, tx, flags, depend_tx, cb_fn, cb_param);
	} else { /* run <operation> synchronously */
		...
		<operation>
		...
		async_tx_sync_epilog(flags, depend_tx, cb_fn, cb_param);
	}

	return tx;
}

async_tx_find_channel() returns a capable channel from its pool.  The
channel pool is organized as a per-cpu array of channel pointers.  The
async_tx_rebalance() routine is tasked with managing these arrays.  In the
uniprocessor case async_tx_rebalance() tries to spread responsibility
evenly over channels of similar capabilities.  For example if there are two
copy+xor channels, one will handle copy operations and the other will
handle xor.  In the SMP case async_tx_rebalance() attempts to spread the
operations evenly over the cpus, e.g. cpu0 gets copy channel0 and xor
channel0 while cpu1 gets copy channel 1 and xor channel 1.  When a
dependency is specified async_tx_find_channel defaults to keeping the
operation on the same channel.  A xor->copy->xor chain will stay on one
channel if it supports both operation types, otherwise the transaction will
transition between a copy and a xor resource.

Currently the raid5 implementation in the MD raid456 driver has been
converted to the async_tx api.  A driver for the offload engines on the
Intel Xscale series of I/O processors, iop-adma, is provided in a later
commit.  With the iop-adma driver and async_tx, raid456 is able to offload
copy, xor, and xor-zero-sum operations to hardware engines.
 
On iop342 tiobench showed higher throughput for sequential writes (20 - 30%
improvement) and sequential reads to a degraded array (40 - 55%
improvement).  For the other cases performance was roughly equal, +/- a few
percentage points.  On a x86-smp platform the performance of the async_tx
implementation (in synchronous mode) was also +/- a few percentage points
of the original implementation.  According to 'top' on iop342 CPU
utilization drops from ~50% to ~15% during a 'resync' while the speed
according to /proc/mdstat doubles from ~25 MB/s to ~50 MB/s.
 
The tiobench command line used for testing was: tiobench --size 2048
--block 4096 --block 131072 --dir /mnt/raid --numruns 5
* iop342 had 1GB of memory available

Details:
* if CONFIG_DMA_ENGINE=n the asynchronous path is compiled away by making
  async_tx_find_channel a static inline routine that always returns NULL
* when a callback is specified for a given transaction an interrupt will
  fire at operation completion time and the callback will occur in a
  tasklet.  if the the channel does not support interrupts then a live
  polling wait will be performed
* the api is written as a dmaengine client that requests all available
  channels
* In support of dependencies the api implicitly schedules channel-switch
  interrupts.  The interrupt triggers the cleanup tasklet which causes
  pending operations to be scheduled on the next channel
* Xor engines treat an xor destination address differently than a software
  xor routine.  To the software routine the destination address is an implied
  source, whereas engines treat it as a write-only destination.  This patch
  modifies the xor_blocks routine to take a an explicit destination address
  to mirror the hardware.

Changelog:
* fixed a leftover debug print
* don't allow callbacks in async_interrupt_cond
* fixed xor_block changes
* fixed usage of ASYNC_TX_XOR_DROP_DEST
* drop dma mapping methods, suggested by Chris Leech
* printk warning fixups from Andrew Morton
* don't use inline in C files, Adrian Bunk
* select the API when MD is enabled
* BUG_ON xor source counts <= 1
* implicitly handle hardware concerns like channel switching and
  interrupts, Neil Brown
* remove the per operation type list, and distribute operation capabilities
  evenly amongst the available channels
* simplify async_tx_find_channel to optimize the fast path
* introduce the channel_table_initialized flag to prevent early calls to
  the api
* reorganize the code to mimic crypto
* include mm.h as not all archs include it in dma-mapping.h
* make the Kconfig options non-user visible, Adrian Bunk
* move async_tx under crypto since it is meant as 'core' functionality, and
  the two may share algorithms in the future
* move large inline functions into c files
* checkpatch.pl fixes
* gpl v2 only correction

Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-07-13 08:06:14 -07:00
Dan Williams 685784aaf3 xor: make 'xor_blocks' a library routine for use with async_tx
The async_tx api tries to use a dma engine for an operation, but will fall
back to an optimized software routine otherwise.  Xor support is
implemented using the raid5 xor routines.  For organizational purposes this
routine is moved to a common area.

The following fixes are also made:
* rename xor_block => xor_blocks, suggested by Adrian Bunk
* ensure that xor.o initializes before md.o in the built-in case
* checkpatch.pl fixes
* mark calibrate_xor_blocks __init, Adrian Bunk

Cc: Adrian Bunk <bunk@stusta.de>
Cc: NeilBrown <neilb@suse.de>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2007-07-13 08:06:14 -07:00
Dan Williams d379b01e90 dmaengine: make clients responsible for managing channels
The current implementation assumes that a channel will only be used by one
client at a time.  In order to enable channel sharing the dmaengine core is
changed to a model where clients subscribe to channel-available-events.
Instead of tracking how many channels a client wants and how many it has
received the core just broadcasts the available channels and lets the
clients optionally take a reference.  The core learns about the clients'
needs at dma_event_callback time.

In support of multiple operation types, clients can specify a capability
mask to only be notified of channels that satisfy a certain set of
capabilities.

Changelog:
* removed DMA_TX_ARRAY_INIT, no longer needed
* dma_client_chan_free -> dma_chan_release: switch to global reference
  counting only at device unregistration time, before it was also happening
  at client unregistration time
* clients now return dma_state_client to dmaengine (ack, dup, nak)
* checkpatch.pl fixes
* fixup merge with git-ioat

Cc: Chris Leech <christopher.leech@intel.com>
Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: David S. Miller <davem@davemloft.net>
2007-07-13 08:06:13 -07:00
Dan Williams 7405f74bad dmaengine: refactor dmaengine around dma_async_tx_descriptor
The current dmaengine interface defines mutliple routines per operation,
i.e. dma_async_memcpy_buf_to_buf, dma_async_memcpy_buf_to_page etc.  Adding
more operation types (xor, crc, etc) to this model would result in an
unmanageable number of method permutations.

	Are we really going to add a set of hooks for each DMA engine
	whizbang feature?
		- Jeff Garzik

The descriptor creation process is refactored using the new common
dma_async_tx_descriptor structure.  Instead of per driver
do_<operation>_<dest>_to_<src> methods, drivers integrate
dma_async_tx_descriptor into their private software descriptor and then
define a 'prep' routine per operation.  The prep routine allocates a
descriptor and ensures that the tx_set_src, tx_set_dest, tx_submit routines
are valid.  Descriptor creation and submission becomes:

struct dma_device *dev;
struct dma_chan *chan;
struct dma_async_tx_descriptor *tx;

tx = dev->device_prep_dma_<operation>(chan, len, int_flag)
tx->tx_set_src(dma_addr_t, tx, index /* for multi-source ops */)
tx->tx_set_dest(dma_addr_t, tx, index)
tx->tx_submit(tx)

In addition to the refactoring, dma_async_tx_descriptor also lays the
groundwork for definining cross-channel-operation dependencies, and a
callback facility for asynchronous notification of operation completion.

Changelog:
* drop dma mapping methods, suggested by Chris Leech
* fix ioat_dma_dependency_added, also caught by Andrew Morton
* fix dma_sync_wait, change from Andrew Morton
* uninline large functions, change from Andrew Morton
* add tx->callback = NULL to dmaengine calls to interoperate with async_tx
  calls
* hookup ioat_tx_submit
* convert channel capabilities to a 'cpumask_t like' bitmap
* removed DMA_TX_ARRAY_INIT, no longer needed
* checkpatch.pl fixes
* make set_src, set_dest, and tx_submit descriptor specific methods
* fixup git-ioat merge
* move group_list and phys to dma_async_tx_descriptor

Cc: Jeff Garzik <jeff@garzik.org>
Cc: Chris Leech <christopher.leech@intel.com>
Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: David S. Miller <davem@davemloft.net>
2007-07-13 08:06:11 -07:00
Dan Aloni 428ed6024f I/OAT: fix I/OAT for kexec
Under kexec, I/OAT initialization breaks over busy resources because the
previous kernel did not release them.

I'm not sure this fix can be considered a complete one but it works for me.
 I guess something similar to the *_remove method should occur there..

Signed-off-by: Dan Aloni <da-x@monatomic.org>
Signed-off-by: Chris Leech <christopher.leech@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-07-11 16:10:53 -07:00
Andrew Morton e00c5d8b4d I/OAT: warning fix
net/ipv4/tcp.c: In function 'tcp_recvmsg':
net/ipv4/tcp.c:1111: warning: unused variable 'available'

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Leech <christopher.leech@intel.com>
2007-07-11 16:10:53 -07:00
Chris Leech 2b1244a43b I/OAT: Only offload copies for TCP when there will be a context switch
The performance wins come with having the DMA copy engine doing the copies
in parallel with the context switch.  If there is enough data ready on the
socket at recv time just use a regular copy.

Signed-off-by: Chris Leech <christopher.leech@intel.com>
2007-07-11 16:10:53 -07:00
Chris Leech 72d0b7a81d I/OAT: Add documentation for the tcp_dma_copybreak sysctl
Signed-off-by: Chris Leech <christopher.leech@intel.com>
2007-07-11 15:39:05 -07:00
Chris Leech 70774b4739 ioatdma: Remove the use of writeq from the ioatdma driver
There's only one now anyway, and it's not in a performance path,
so make it behave the same on 32-bit and 64-bit CPUs.

Signed-off-by: Chris Leech <christopher.leech@intel.com>
2007-07-11 15:39:04 -07:00
Chris Leech e38288117c ioatdma: Remove the wrappers around read(bwl)/write(bwl) in ioatdma
Signed-off-by: Chris Leech <christopher.leech@intel.com>
2007-07-11 15:39:04 -07:00
Jeff Garzik ff487fb773 drivers/dma: handle sysfs errors
From: Jeff Garzik <jeff@garzik.org>

Signed-off-by: Jeff Garzik <jeff@garzik.org>
Signed-off-by: Chris Leech <christopher.leech@intel.com>
2007-07-11 15:39:03 -07:00
Chris Leech 000725d56a ioatdma: Push pending transactions to hardware more frequently
Every 20 descriptors turns out to be to few append commands with
newer/faster CPUs.  Pushing every 4 still cuts down on MMIO writes to an
acceptable level without letting the DMA engine run out of work.

Signed-off-by: Chris Leech <christopher.leech@intel.com>
2007-07-11 15:39:03 -07:00
Linus Torvalds 7dcca30a32 Linux 2.6.22
Woo-hoo. I'm sure somebody will report a "this doesn't compile, and
I have a new root exploit" five minutes after release, but it still
feels good ;)

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-08 16:32:17 -07:00
Linus Torvalds fe5a2de17e Merge master.kernel.org:/pub/scm/linux/kernel/git/bart/ide-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/bart/ide-2.6:
  qd65xx: fix PIO mode selection
  sis5513: adding PCI-ID
2007-07-08 12:14:27 -07:00
Linus Torvalds 1e5de2837c Fix permission checking for the new utimensat() system call
Commit 1c710c896e added the utimensat()
system call, but didn't handle the case of checking for the writability
of the target right, when the target was a file descriptor, not a
filename.

We cannot use vfs_permission(MAY_WRITE) for that case, and need to
simply check whether the file descriptor is writable.  The oops from
using the wrong function was noticed and narrowed down by Markus
Trippelsdorf.

Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Al Viro <viro@ftp.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-08 12:02:55 -07:00
Peter Zijlstra 4e99325b46 mm: double mark_page_accessed() in read_cache_page_async()
Fix a post-2.6.21 regression.

read_cache_page_async() has two invocations of mark_page_accessed() which will
launch pages right onto the active list.

Remove the first one, keeping the latter one.  This avoids marking unwanted
pages active (in the retry loop).

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-08 10:13:21 -07:00
Bartlomiej Zolnierkiewicz 4660897e6c qd65xx: fix PIO mode selection
PIO4 is a maximum PIO mode supported by a driver.  Using "255" as a max_mode
argument to ide_get_best_pio_mode() could result in wrong timings being used
by a driver (for "pio" equal to 5) or OOPS (for "pio" values > 5 && < 255).

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Reviewed-by: Alan Cox <alan@lxorguk.ukuu.org.uk>
2007-07-08 15:21:58 +02:00
Uwe Koziolek 4c6c914e4c sis5513: adding PCI-ID
The SiS966 has one additional PCI-ID 1180.

If the chipset is using this PCI-ID, the primary channel is connected to the
first PATA-port. The secondary channel is connected to SATA-ports in IDE
emulation mode.  The legacy IO-ports are used.

The including of the PCI-ID into pata_sis is not sufficient, because the legacy
driver in drivers/ide is initialized before pata_sis.

Signed-off-by: Uwe Koziolek <uwe.koziolek@gmx.net>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2007-07-08 15:21:58 +02:00
Adrian Bunk 95511ad434 DLM must depend on SYSFS
The dependency of DLM on SYSFS got lost in
commit 6ed7257b46 resulting in the
following compile error with CONFIG_DLM=y, CONFIG_SYSFS=n:

<--  snip  -->

...
  LD      .tmp_vmlinux1
fs/built-in.o: In function `dlm_lockspace_init':
/home/bunk/linux/kernel-2.6/linux-2.6.22-rc6-mm1/fs/dlm/lockspace.c:231: undefined reference to `kernel_subsys'
fs/built-in.o: In function `configfs_init':
/home/bunk/linux/kernel-2.6/linux-2.6.22-rc6-mm1/fs/configfs/mount.c:143: undefined reference to `kernel_subsys'
make[1]: *** [.tmp_vmlinux1] Error 1

<--  snip  -->

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-07 14:17:43 -07:00
Dave Jones 38377be88a Clean up E7520/7320/7525 quirk printk.
The printk level in this printk is bogus, as the previous printk
didn't have a terminating \n resulting in ..

Intel E7520/7320/7525 detected.<6>Disabling irq balancing and affinity

It also never printed a \n at all in the case where we didn't do
the quirk.

Change it to only make noise if it actually does something useful.

Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-07 13:53:13 -07:00
Adrian Bunk 40e48eed84 include/linux/kallsyms.h must #include <linux/errno.h>
This patch fixes the following 2.6.22 regression with CONFIG_KALLSYMS=n:

<--  snip  -->

...
  CC      arch/m32r/kernel/traps.o
In file included from /home/bunk/linux/kernel-2.6/linux-2.6.22-rc6-mm1/arch/m32r/kernel/traps.c:14:
/home/bunk/linux/kernel-2.6/linux-2.6.22-rc6-mm1/include/linux/kallsyms.h: In function 'lookup_symbol_name':
/home/bunk/linux/kernel-2.6/linux-2.6.22-rc6-mm1/include/linux/kallsyms.h:66: error: 'ERANGE' undeclared (first use in this function)
/home/bunk/linux/kernel-2.6/linux-2.6.22-rc6-mm1/include/linux/kallsyms.h:66: error: (Each undeclared identifier is reported only once
/home/bunk/linux/kernel-2.6/linux-2.6.22-rc6-mm1/include/linux/kallsyms.h:66: error: for each function it appears in.)
/home/bunk/linux/kernel-2.6/linux-2.6.22-rc6-mm1/include/linux/kallsyms.h: In function 'lookup_symbol_attrs':
/home/bunk/linux/kernel-2.6/linux-2.6.22-rc6-mm1/include/linux/kallsyms.h:71: error: 'ERANGE' undeclared (first use in this function)
make[2]: *** [arch/m32r/kernel/traps.o] Error 1

<--  snip  -->

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-07 13:49:51 -07:00
David Woodhouse 1c39858b5d Fix use-after-free oops in Bluetooth HID.
When cleaning up HIDP sessions, we currently close the ACL connection
before deregistering the input device. Closing the ACL connection
schedules a workqueue to remove the associated objects from sysfs, but
the input device still refers to them -- and if the workqueue happens to
run before the input device removal, the kernel will oops when trying to
look up PHYSDEVPATH for the removed input device.

Fix this by deregistering the input device before closing the
connections.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Acked-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-07 12:22:37 -07:00
Christoph Lameter d23cf676d0 slub: remove useless EXPORT_SYMBOL
kmem_cache_open is static. EXPORT_SYMBOL was leftover from some earlier
time period where kmem_cache_open was usable outside of slub.

(Fixes powerpc build error)

Signed-off-by: Chrsitoph Lameter <clameter@sgi.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-06 11:45:11 -07:00
maximilian attems c3000e031c MAINTAINERS new kernel janitors ml
davem kindly moved the list from osdl to vger.

Signed-of-by: maximilian attems <max@stro.at>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-06 11:45:11 -07:00
Andres Salomon 95069f89e8 GEODE: reboot fixup for geode machines with CS5536 boards
Writing to MSR 0x51400017 forces a hard reset on CS5536-based machines,
this has the reboot fixup do just that if such a board is detected.

Acked-by: Jordan Crouse <jordan.crouse@amd.com>
Signed-off-by: Andres Salomon <dilinger@debian.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-06 11:45:11 -07:00
Linus Torvalds 1feb17e286 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
* 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6:
  [NETPOLL]: Fixups for 'fix soft lockup when removing module'
  [NET]: net/core/netevent.c should #include <net/netevent.h>
  [NETFILTER]: nf_conntrack_h323: add checking of out-of-range on choices' index values
  [NET] skbuff: remove export of static symbol
  SCTP: Add scope_id validation for link-local binds
  SCTP: Check to make sure file is valid before setting timeout
  SCTP: Fix thinko in sctp_copy_laddrs()
2007-07-06 10:30:12 -07:00
Linus Torvalds dadde13ad8 Merge branch 'upstream' of git://ftp.linux-mips.org/pub/scm/upstream-linus
* 'upstream' of git://ftp.linux-mips.org/pub/scm/upstream-linus:
  [MIPS] Fix scheduling latency issue on 24K, 34K and 74K cores
  [MIPS] Add macros to encode processor revisions.
  [MIPS] RM7000: Enable ICACHE_REFILLS_WORKAROUND_WAR.
  [MIPS] SMTC: Fix cut'n'paste bug in Kconfig.debug
  [MIPS] Change libgcc-style functions from lib-y to obj-y
  [MIPS] Fix timer/performance interrupt detection
  [MIPS] AP/SP: Avoid triggering the 34K E125 performance issue
  [MIPS] 64-bit TO_PHYS_MASK macro for RM9000 processors
2007-07-06 10:29:33 -07:00
Peter Zijlstra 23c1fb5296 mm: fixup /proc/vmstat output
Line up the vmstat_text with zone_stat_item

enum zone_stat_item {
	/* First 128 byte cacheline (assuming 64 bit words) */
	NR_FREE_PAGES,
	NR_INACTIVE,
	NR_ACTIVE,

We current have nr_active and nr_inactive reversed.

[ "OK with patch, though using initializers canbe handy to prevent such
   things in future:

	static const char * const vmstat_text[] = {
		[NR_FREE_PAGES] = "nr_free_pages",
		..."
							 - Alexey ]

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-06 10:26:50 -07:00
Yoann Padioleau 0da2f0f164 potential compiler error, irqfunc caller sites update
In 7d12e780e0 David Howells performed
this evolution:
 "IRQ: Maintain regs pointer globally rather than passing to IRQ handlers"

He correctly updated many of the function definitions that were using this
extra regs pointer parameter but forgot to update some caller sites of
those functions.  The reason the modifications was not properly done on all
drivers is that some drivers were rarely compiled because they are for
AMIGA, or that some code sites were inside #ifdefs where the option is not
set or inside #if 0.

Here is the semantic patch that found the occurences
and fixed the problem.

@ rule1 @
identifier fn;
identifier irq, dev_id;
typedef irqreturn_t;
@@

static irqreturn_t fn(int irq, void *dev_id)
{
   ...
}

@@
identifier rule1.fn;
expression E1, E2, E3;
@@

 fn(E1, E2
-   ,E3
   )

Signed-off-by: Yoann Padioleau <padator@wanadoo.fr>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-06 10:23:43 -07:00
Vivek Goyal 071922c08c i386: es7000 build breakage fix
o Commit 1833d6bc72 broke the build if
  compiled with CONFIG_ES7000=y and CONFIG_X86_GENERICARCH=n

arch/i386/kernel/built-in.o(.init.text+0x4fa9): In function `acpi_parse_madt':
: undefined reference to `acpi_madt_oem_check'
arch/i386/kernel/built-in.o(.init.text+0x7406): In function `smp_read_mpc':
: undefined reference to `mps_oem_check'
arch/i386/kernel/built-in.o(.init.text+0x8990): In function
`connect_bsp_APIC':
: undefined reference to `enable_apic_mode'
make: *** [.tmp_vmlinux1] Error 1

o Fix the build issue. Provided the definitions of missing functions.

o Don't have ES7000 machine. Only compile tested.

Cc: Len Brown <lenb@kernel.org>
Cc: Natalie Protasevich <protasnb@gmail.com>
Cc: Roland Dreier <rolandd@cisco.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-06 10:23:43 -07:00
Bjorn Helgaas 41a5311465 PNP SMCf010 quirk: work around Toshiba Portege 4000 ACPI issues
When we enable the SMCf010 IR device, the Toshiba Portege 4000 BIOS claims
the device is working, but it really isn't configured correctly.  The BIOS
*will* configure it, but only if we call _SRS after (1) reversing the order
of the SIR and FIR I/O port regions and (2) changing the IRQ from
active-high to active-low.

This patch addresses the 2.6.22 regression:
    "no irda0 interface (2.6.21 was OK), smsc does not find chip"

I tested this on a Portege 4000.  The smsc-ircc2 driver correctly detects
the device, and "irattach irda0 -s && irdadump" shows transmitted and
received packets.

Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Andrey Borzenkov <arvidjaar@mail.ru>
Cc: Samuel Ortiz <samuel@sortiz.org>
Cc: "Linus Walleij (LD/EAB)" <linus.walleij@ericsson.com>
Cc: Michal Piotrowski <michal.k.k.piotrowski@gmail.com>
Cc: Adam Belay <ambx1@neo.rr.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-06 10:23:43 -07:00
Alexander Graf d57d973101 fix logic error in ipc compat semctl()
When calling a semctl(IPC_STAT) without IPC_64 the check if the memory is
unevaluated.  This patch fixes this.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-06 10:23:43 -07:00
David Woodhouse 0db19c412c x86_64: fix headers_install
A bug in headers_install for ARCH=x86_64 yields an asm/ directory full of
files all of which are using the same #ifdef guard, "__ASM_STUB_" with no
postfix.  So the second and later asm files #included in the same C file
(often through standard headers like ioctl.h) yields no symbols.

Strangeness with the Ubuntu 'tell me if I support something that's not
explcitly mentioned in POSIX, and I'll strip it out' shell, I believe.

We don't need the 'export' but we do need a semicolon at the end of the
FNAME line:

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Rob Landley <rob@landley.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-06 10:23:43 -07:00
Loic Prylli d25c1ba2fa MTRR: Fix race causing set_mtrr to go into infinite loop
Processors synchronization in set_mtrr requires the .gate field to be set
after .count field is properly initialized.  Without an explicit barrier,
the compiler was reordering those memory stores.  That was sometimes
causing a processor (in ipi_handler) to see the .gate change and decrement
.count before the latter is set by set_mtrr() (which then hangs in a
infinite loop with irqs disabled).

Signed-off-by: Loic Prylli <loic@myri.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-06 10:23:43 -07:00
Jason Wessel 1e2e99f0e4 i386: fix regression, endless loop in ptrace singlestep over an int80
The commit 635cf99a80 introduced a
regression.  Executing a ptrace single step after certain int80
accesses will infinitely loop and never advance the PC.

The TIF_SINGLESTEP check should be done on the return from the syscall
and not before it.

I loops on each single step on the pop right after the int80 which writes out
to the console.  At that point you can issue as many single steps as you want
and it will not advance any further.

The test case is below:

/* Test whether singlestep through an int80 syscall works.
 */
#define _GNU_SOURCE
#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
#include <sys/mman.h>
#include <asm/user.h>
#include <string.h>

static int child, status;
static struct user_regs_struct regs;

static void do_child()
{
	char str[80] = "child: int80 test\n";

	ptrace(PTRACE_TRACEME, 0, 0, 0);
	kill(getpid(), SIGUSR1);
	write(fileno(stdout),str,strlen(str));
	asm ("int $0x80" : : "a" (20)); /* getpid */
}

static void do_parent()
{
	unsigned long eip, expected = 0;
again:
	waitpid(child, &status, 0);
	if (WIFEXITED(status) || WIFSIGNALED(status))
		return;

	if (WIFSTOPPED(status)) {
		ptrace(PTRACE_GETREGS, child, 0, &regs);
		eip = regs.eip;
		if (expected)
			fprintf(stderr, "child stop @ %08lx, expected %08lx %s\n",
					eip, expected,
					eip == expected ? "" : " <== ERROR");

		if (*(unsigned short *)eip == 0x80cd) {
			fprintf(stderr, "int 0x80 at %08x\n", (unsigned int)eip);
			expected = eip + 2;
		} else
			expected = 0;

		ptrace(PTRACE_SINGLESTEP, child, NULL, NULL);
	}
	goto again;
}

int main(int argc, char * const argv[])
{
	child = fork();
	if (child)
		do_parent();
	else
		do_child();
	return 0;
}

Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: <stable@kernel.org>
Cc: Chuck Ebbert <76306.1226@compuserve.com>
Acked-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-06 10:23:43 -07:00
Michael Ellerman ef7320edb1 Fix elf_core_dump() when writing arch specific notes (spu coredumps)
elf_core_dump() supports dumping arch specific ELF notes, via the #define
ELF_CORE_WRITE_EXTRA_NOTES.  Currently the only user of this is the powerpc
spu coredump code.

There is a bug in the handling of foffset WRT the arch notes, which causes
us to erroneously increment foffset by the size of the arch notes, leaving
a block of zeroes in the file, and causing all subsequent data in the file
to be at <supposed position> + <arch note size>.  eg:

  LOAD  0x050000 0x00100000 0x00000000 0x20000 0x20000 R E 0x10000

Tells us we should have a chunk of data at 0x50000.  The truth is the data
is at 0x90dbc = 0x50000 + 0x40dbc (the size of the arch notes).

This bug prevents gdb from reading the core file correctly.

The simplest fix is to simply remember the size of the arch notes, and add
it to foffset after we've written the arch notes.  The only drawback is
that if the arch code doesn't write as many bytes as it said it would, we
end up with a broken core dump again.  For now I think that's a reasonable
requirement.

Tested on a Cell blade, gdb no longer complains about the core file being
bogus.

While I'm here I should point out that the spu coredump code does not work
if we're dumping to a pipe - we'll have to wait for 23 to fix that.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-06 10:23:43 -07:00
Ralf Baechle 4b3e975e4a [MIPS] Fix scheduling latency issue on 24K, 34K and 74K cores
The idle loop goes to sleep using the WAIT instruction if !need_resched().
This has is suffering from from a race condition that if if just after
need_resched has returned 0 an interrupt might set TIF_NEED_RESCHED but
we've just completed the test so go to sleep anyway.  This would be
trivial to fix by just disabling interrupts during that sequence as in:

        local_irq_disable();
        if (!need_resched())
                __asm__("wait");
        local_irq_enable();

but the processor architecture leaves it undefined if a processor calling
WAIT with interrupts disabled will ever restart its pipeline and indeed
some processors have made use of the freedom provided by the architecture
definition.  This has been resolved and the Config7.WII bit indicates that
the use of WAIT is safe on 24K, 24KE and 34K cores.  It also is safe on
74K starting revision 2.1.0 so enable the use of WAIT with interrupts
disabled for 74K based on a c0_prid of at least that.

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2007-07-06 16:17:11 +01:00
Ralf Baechle fde97822a2 [MIPS] Add macros to encode processor revisions.
Older processors used to encode processor version and revision in two
4-bit bitfields, the 4K seems to simply count up and even newer MTI cores
have switched to use the 8-bits as 3:3:2 bitfield with the last field as
the patch number.

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2007-07-06 16:17:11 +01:00
Ralf Baechle 075c733e19 [MIPS] RM7000: Enable ICACHE_REFILLS_WORKAROUND_WAR.
The RM7000 processors and the E9000 cores have a bug (though PMC-Sierra
opposes it being called that) where invalid instructions in the same
I-cache line worth of instructions being fetched may case spurious
exceptions.

The workaround for this was only enabled for E9000 cores; enable it also
for all RM7000-based platforms.

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2007-07-06 16:17:11 +01:00