There is little/no advantage to running a 64-bit kernel with a 32-bit userland unless you have more than 4GB of physical memory; a normal desktop machine just doesn't spend enough time CPU-bound in the kernel for it to make a difference.
GCC has no specific optimisations for the Core line of processors at this point, even in x86_64, as far as I can see. You may get slight performance benefits from using x86_64 due to the accessibility of the extra registers for the compiler, but you don't get any Core specific benefits from that, just the parts that are common to all x86_64 chips, Intel or AMD.
Note that even when running in 32-bit mode the extra registers present on the silicon are still used, even though the compiler can't generate code that accesses them: they become part of the register renaming buffer. Any modern x86 chip, 32 bit or 64 bit has more registers physically on the silicon than are accessible by the compiler - the register names don't always correspond to the same physical register as they did at an earlier time, as this is all optimised by the out-of-order execution hardware at runtime. This effect can be used to a larger degree on chips capable of running 64-bit code, as they will contain more registers to begin with, and this is doable even if the code running is 32-bit.
I have been told, though I have no idea if this is true, that there are at least 28 fully 64-bit general purpose registers on an Athlon64 die. The x86_64 architecture only exposes 16 of them.
I'd just use x86, though I have no idea which -march is likely to be the best for the Core chips.