Unleash the Beast: Core i7 and X58

Overall Score

[img]IMG_0471.JPG[/img]

Intel often, if not always, selects generic names, typically geographical locations, for unreleased products in efforts to avoid trouble with legal trademarks. They typically derive from, but not always, landmarks in regions near the location where the product was designed. A quick search yields a little info — Nehalem ( http://en.wikipedia.org/wiki/Nehalem_River ) is a river in Oregon, located just northwest of Portland and running through the Nehalem Valley — so it is a good bet that Nehalem was designed in the US as opposed to Israel where Merom (Core 2 Duo) found it’s origins. As many already know, Nehalem represents the tock in Intel’s tick tock strategy, a method by which a product refresh will occur — like clockwork as the name would imply — about every 12 months, either by introducing a shrink of a given architecture on a new manufacturing process or a new architectural revision on a mature process.

Long, winding, swift and wide are adjectives commonly used to describe rivers. Ironically, these same words can be used to describe Intel’s newest processor line, now known as Core i7 a.k.a. Nehalem. Intel has spent much of the last 5 years wringing out the last drops of capability from their aging platform architecture, all the while winding the story that they would implement new features only when the timing warranted. Many would argue that they are late to the game, however, we have witnessed Intel bring more and more performance while riding on the same aging bus technology for the past 2 years; performance gains that many thought otherwise impossible. Nonetheless, after well over a decade of living on top of the front side bus (FSB,) and a memory controller located on the slow end of a 64-bit highway, Intel has finally ditched this tried, and tired technology, for a new interconnect scheme and, consequently, has followed in the footsteps of their long time rival who implemented these features five years ago.

Core i7, carrying through with the river analogy, is well described as swift and wide when analyzed from several different angles. Many changes in Core i7, and the platform supporting it, were specifically targeted at increasing bandwidth and lowering latency for shuttling data to and from the execution core(s). That is, Intel made the entire data network swifter. Intel also took an already wide core (more on this later) and widened various other architectural features, thus enabling more instructions to be in flight at any given time. The overall result — a much better utilization of computational resources.

Shortly after the first public information about Nehalem was released, numerous hardware/enthusiast web sites have regurgitated most, if not all, the key information; the majority taken directly from the Intel press kits. A few, higher quality sites, have actually analyzed those details all the while prognosticating what benefits these changes bring. In the end, the common consensus has been that Core i7 will be a mixed bag in performance. Varying from as little as no real improvement over the prior generation to jaw dropping improvements akin to the leap from the Pentium 4 to the much more efficient Core 2 Duo.

In all, the long, winding path from Core 2 to Core i7 has produced a CPU that is, like a raging river, swifter at feeding a beefed-up, wide-issue CPU core. Today we will be looking at how, specifically, some of these changes have translated into actual end user experiences and the performance gains (or lack thereof) these changes bring to the table.

 

A few notes on the Architecture

A generalized discussion of the Nehalem architecture is probably best done by peeling back the details much like the layers of an onion. On the surface, Intel has completely reworked the entire platform, as mentioned in the introduction, the front side bus is now replaced by a serial point to point interconnect, branded Quick Path Interconnect (QPI). At the same time Intel has integrated the memory controller, formerly located on the chipset north bridge, residing on the same slice of silicon as the CPU. The overall block diagram for the new platform arrangement is shown below.

[img]X58_blockdiagram.gif[/img]

Overall, the basic arrangement remains relatively the same. The CPU connects to the chipset north bridge, which serves as the graphics hub (and formerly the memory hub), the north bridge then connects to the south bridge which deals with all the IO components of the system (USB, SATA, Ethernet etc.). The memory is now hardwired directly to the CPU and the CPU/chipset communication path is point to point via QPI. One glance by anyone who has followed computing history should be able to immediately recognize the configuration. This is the same basic block diagram used by AMD for the past 5 years, first introduced in 2003 with the famous K8 architecture. This often leads people to conclude that Intel copied AMD, and to an extent this is true.

Though on paper each block connects to the same components like the AMD direct connect architecture, the actual implementation details are vastly different. For instance, Intel chose to use this transition to support DDR3 exclusively and this choice most certainly was made through technical details beyond anything they have made public to date. The QPI link also provides more bandwidth than what AMD supports today, however, AMD has revised their specification and provided for a faster HyperTransport speed to match the QPI total bandwidth in the coming months. A few other features of the memory controller are different as well, namely three channels that provide for a 192-bit wide data bus and capable of delivering a max theoretical memory bandwidth of up to 25.6 GB/sec. Finally, the QPI link has extra error correcting safe guards built into the protocol, including double bit correction and implementation of different coherency protocols. In short, while a block diagram looks the same, the two implementations are vastly different in their current form and one could not simply interchange one for the other.

Peeling back another layer, aside from the fact that Core i7 is a native quad core design, the cache has also received a dramatic overhaul, and again there is an apparent convergence of approaches between AMD and the Intel approach. Intel has implemented a shared L3 cache, same as found in AMD’s native quad core processors. However, like the memory controller and the serial link, the appearance of a 3 level cache structure within a simplified block diagram is where the similarity stops as the implementation and actual cache technology diverges.

[img]nehalem_cache.jpg[/img]

Core i7, unlike Phenom/Opteron quad cores, utilizes an inclusive cache system where as AMD utilizes a victimless exclusive cache. Both cache hierarchies accomplish the job required of them, but the computational behavior is quite different. For Intel, the 8MB L3 cache represents the most cache available to the memory subsystem, since an inclusive cache will replicate data and instructions in higher order cache. That is, any data or instruction in L1 cache has been copied in L2 as well as L3 caches. AMD’s exclusive cache simply means that data in L1 is not guaranteed to be in L2 or L3.

The advantages and disadvantages of an exclusive or inclusive cache system are too numerous to list. However, briefly, Intel’s inclusive cache has the advantage of mitigating less coherency traffic between the cores. For example, as the cores execute and fetch new code it first looks for the next block of instructions in cache, if it is not in L1/L2 then it looks for it in L3. If a core misses L3 cache (i.e. the needed instruction/data is not there) then it is guaranteed that none of the other dedicated caches on any other core will have that particular data or instruction, thus the L1/L2 cache on other cores will not need to be snooped (or checked) for the existence for the needed information. This scenario is not true of an exclusive arrangement, a miss on L3 cache generates a required snoop on the remaining caches before any data can be retrieved from main memory.

The other curio for an inclusive cache is that of redundancy, meaning that data is replicated. Though this sounds fine, it is less efficient. Cache takes up premium real-estate, and as such, the more cache one can give the processor the better. However, since data is replicated across cache levels the effective total cache size is otherwise smaller. Take as the corollary example, AMD’s exclusive cache approach. An exclusive cache implemented across multiple levels gives the core processor more overall aggregate cache to store data, the more data in cache the less likely the need to spend cycles retrieving data from the slower main system memory. Let’s take a more concrete example. AMD’s up-coming Deneb core will have ~8.5 MB of total cache (6 MB L3 + 4×512 KB L2 + 4×128 KB L1). Core i7, however, replicates the cache lines, therefore it will have a max total of 8 MB of total cache available — basically, for the first time in a long time, AMD’s processors will carry more effective cache. In short, Nehalem has come under some criticism for the size of the L2 and L1 cache and, certainly, there will be performance tradeoffs to be had with such a design.

Going beyond the similarities between Nehalem and AMD’s architecture, Intel has done a few things with Nehalem that provide performance enhancements. Nehalem now comes with expanded translation look-aside buffers (TLB). TLBs service the memory handling unit and translate virtual memory address space to physical space. TLBs are complicated, as is any memory handling technique, but to generally understand the function of a TLB think of the index of a textbook. One may find information in a textbook a couple of different ways. One way would be to simply thumb through the book until you catch the chapter, page, and paragraph of the topic of interest or one could simply turn to the index, look up the topic and turn directly to the page. A TLB, more or less, functions as the latter.

Besides adding to the TLB, Intel also deepened the execution window in Nehalem by adding more entries into the reorder buffer and increasing the reservation station. The exact details, again, are to complex for the purpose here, however in brief Intel increased the total number of instructions that the processor logic will examine in determining what instructions may run in parallel. A program or thread runs with a series of instructions or an instruction stream. Within this stream instructions are presented to be executed in particular order. However, not all instructions need to be executed in the exact order that they appear when one instruction is not dependent upon the outcome of upstream instructions. That is, as a thread executes instructions can be run out of order and in parallel. The instruction window is the collection of instructions visible to the out of order engine from which the engine will select the, preferably, most efficient order to execute in parallel.

Nehalem is built on the same 4-issue wide core, logically as it is based on the preceding generational architecture at the core. Many a person misinterpret what ‘4-issue’ really means. In short, a 4-issue core can (but not necessarily) issue 4 instructions per clock tick to the execution units. That is up to 4 instructions may be dispatched at a time, conditional that the parallelism can be extracted to produce 4 instructions that can logically be dispatched. When coupled with other architectural tricks (i.e. macrofusion), the 4-issue design can actually dispatch as many as 5 instructions per clock tick by fusing two instructions and sending them as one. Widening the execution window to examine more total instructions increases the probability that more parallelism can be found at any given time, thus increasing overall performance of the processor by improving the number of instructions retired per clock tick.

In many cases, the total amount of extractable parallelism from a group of instructions is still not enough to keep the dispatch station full and the execution core working — enter simultaneous multithreading or SMT for short. Inte’s variant of SMT is called Hyperthreading and was first introduced in the older Pentium 4 (P4) core. Though, in many cases, the Pentium 4 saw some performance gains, the fundamental architecture was not well suited to handle threading of this nature. The P4 core was 3-issue wide with a very deep pipeline (laymen translation – the preparation steps toward sending instructions into the execution units were many). As such, SMT could not effectively sustain enough instructions in flight to make significant differences in many cases. Some applications saw good gains, others mediocre, and in a number of applications performance actually went down. The concept is rather simple, enable the hardware to track and sustain two distinctly different software threads simultaneously, where if one thread stalls (say waiting on a memory fetch) the other thread can continue working.

Core i7, building from the Core 2 lineage, is a much better candidate for SMT. With the recurrence of SMT in Nehalem, this also provided much more incentive to increase the reorder buffer and reservation stations, as these resources are now shared. Ultimately, Nehalem should show more impressive gains with SMT but, like P4, there will be cases where performance is actually less as the competitive nature of the threads bite into processor resources that would be better served left alone. Nonetheless, SMT can, and certainly does, provide ways top keep executing units busy with work and, as applications become more and more threaded, increase the sum total instructions retired and these increases can be significant.

Last, but not least, is Turbo mode. The original concept was actually introduced with Penryn, the 45 nm Core 2 Duo processor immediately preceding Nehalem, but the implementation was somewhat weak relative to the Core i7. Turbo mode has also been the most misreported and misunderstood feature of Nehalem though the idea is quite simple.

[img]turbo.jpg[/img]

Turbo mode is, in the simplest explanation, the opposite of Speedstep or Cool-n-Quiet (power saving techniques that change the CPU clock speeds to conserve power). Unlike power saving methods of throttling down, Turbo mode is the idea that the processor will dynamically throttle up. The processor will only throttle up if the thermal and power head room allow. The key is what parameter does it monitor to determine when to dynamically overclock the CPU?

In Speedstep, or AMD’s anlogous Cool and Quiet technology, the OS plays a key role in that the OS will determine the load of the processor and whether it is small enough or idle to change the clock speed. When the situation is correct, the OS will instruct the processor to clock down and lower voltages to conserve power. The key, of course, is when the processor is doing next to nothing, if anything at all. Logistically this is rather easy to conceptualize — is the processor idle? If so, then throttle.

Turbo mode, however, will increase the clock speed under a different set of conditions or criteria. If only a few cores are loaded and if there is thermal head room the processor will bump up the clock speed of the working cores. It is the latter condition that complicates the situation. While it is easy to know the load on the process, the ability to up-clock is also limited by the total thermals of the CPU . In other words, the processor now must determine how much energy is consumed and how much energy it can afford to spend in order to throttle up. To do this Intel devised a special power control unit that monitors this activity and controls the dynamic clocking independent of the OS.

Turbo mode is indeed an innovative feature to Nehalem in that as processors increase core count, and so long as software lags in taking advantage of those extra cores, turbo mode enables a bit of the best of both worlds. The thermal design of cooling and power must be made to account for the worst case scenario, and as core count increases, the thermal designs must accommodate all cores loaded to the top. Software, however, that may only use one or a few cores gets punished since the power consumed or dissipated will always be much less than what the original specification allows. Turbo mode solves this by giving back performance by increasing the clock speed when only a few cores are operational, almost like having your cake and eating it to.

There are pitfalls, however. In environments such that cooling efficiency is not optimal, then turbo mode will necessarily not be able to ‘kick in’ and the processor must be intelligent enough to sense this situation. As a result, there can be no real guarantee of the extra performance boost. In short, to get the full advantage of turbo mode use a good fan and run in a cool (room temperature) ambient.

Readers truly interested in understanding more details of the Core i7 architecture should read two well written articles, http://www.realworldtech.com/page.cfm?ArticleID=RWT040208182719 and http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3382.

[img]x58_chipset.jpg[/img]

Along with a new processor comes a new chipset. The rules have changed on the X58 chipset, but the layout is something you will find familiar. Like the 4-series chipsets on the Socket 775 motherboards, the X58 is a dual chip solution that uses Intel’s ICH10(R) southbridge. Unlike previous chipsets, the X58 northbridge chip no longer contains the MCH. Instead, the northbridge is considered the Input/Output hub. Intel’s mainstream platform will use a one chip solution along with a new socket, Socket 1156.

One of the most interesting and anticipated details of the X58 chipset is the inclusion of nVIDIA’s SLI technology. That’s right, SLI on an Intel chipset, without hacked drivers. Motherboard makers have the option to include SLI in the BIOS or purchase an SLI bridge chip from nVIDIA. The bridge chip adds an additional $50 to motherboards that incorporate it.

With the inclusion of SLI on Intel’s X58 chipset, we, as testers, can finally compare two competing multi-gpu technologies on the same platform. That gives us the opportunity to test which technology has implemented the multi-gpu craze the best. The best meaning scalability from one to two gpus, two to three and so on. In addition to finding out how each mutli-gpu setup performs, we will be able to test out how each motherboard manufacturer optimizes this tech on their boards. Today, we will be looking at the Asus P6T, while a future review will showcase multiple X58 motherboards.

Now that the boring stuff is out of the way … let’s see some benchmarks!

ASUS was first out of the gates with their X58 motherboard, the P6T Deluxe. Some of the P6T Deluxe features include support for CrossfireX and SLI (plus physx), up to 24GB of DDR3 triple channel memory, Express Gate (a proprietary SSD-based quick boot distro), EPU (energy processing unit), onboard SAS, and 16 + 2 phase power design.

ASUS Express Gate allows you to boot into a very small and limited operating system right out of the box. In about 5 secs, you have access to the web along with a few other nice features like Skype. The tiny OS is installed on an onboard SSD. We will have a full review with an indepth look into the P6T shortly. For now, lets take a look at the layout of the board and a closer look at some of it’s features.

[img]P1040500.JPG[/img]

The overall layout of the P6T is what you would expect from ASUS. The socket area leaves plenty of room around the stock Intel fan/heatsink and should easily accommodate a larger air cooler such as the ThermalRight Ultra Extrema 120. The six memory slots are color coded orange and black, with the orange slots being first to be filled. All of the SATA connectors are clear of any video card you mount thanks to the distance between the first and second PCIe x16/x8 slots.

[img]P1040503.JPG[/img]

In addition to having room for large heatsinks, the socket area gives you plenty of space for mounting a water block like the Swiftech Apogee GTZ.

[img]s1366_gtz.JPG[/img]

A more detailed review is in the works with the P6T Deluxe getting some closer attention.

On hand for testing we have the following hardware:

Core i7 Setup

  • Intel Core i7 940 B0 ES
  • Asus P6T Deluxe
  • Corsair DDR3 1333 1GB x 3 DIMMS
  • 3 x Western Digital Raptors, 74GB each in RAID 0
  • PC Power and Cooling 1KW SR and Silencer 750W
  • 3 x eVGA GTX 280’s
  • 2 x ATI Radeon 4870X2’s
  • Cooling: Stock Intel fan and Swiftech GTZ waterblock with 1366 mounting bracket

Socket 775 Setup

  • Intel Core 2 Extreme QX9650
  • Asus Striker II Extreme
  • DFI Lanparty Jr P45 T2RS
  • OCZ DDR2 800 1GB x 2 DIMMs
  • Patriot Viper Series DDR3 1600 LL

In order to test the hardware you need to run some benchmarks. For all games tested today we will be running them at a screen resolution of 1680×1050. There are two main for using that resolution. The first is that we are currently limited by the monitor on the testbed, and the second is that most gamers use this resolution or lower. Multi GPU setups work great on large monitors but they also work really well on 22″ and smaller ones. It is also a great feeling to be able to run Crysis, or any game, at maxed out settings!

For the software side of our test setup, we will be using the following programs:

Games

  • Unreal Tournament 3 with maxed out in game settings
  • Crysis with maxed out in game settings and AA set to 8x
  • GRID with maxed out in game settings and AA set to 8x on the 4870X2’s and 16XQCSAA on the GTX 280’s
  • FarCry 2 with maxed out in game settings and AA set to 8x

Synthetic Testing

  • Sisoft Sandra Processor Arithmetic and Multimedia tests
  • wPrime 32M
  • SuperPi mod v1.5 1M
  • 3Dmark Vantage
  • 3Dmark06
  • Everest Cach and Memory Benchmark

With the release of a new processor the biggest question asked invariable comes down to how it compares to the previous generation. What we are looking for is performance gains, if any, and how much of an improvement we can expect to see. To fully answer this question we need to compare results from obtained from actual gameplay as well as synthetic CPU tests.

In addition to directly comparing Core i7 to Core 2 Quad, we will also see what kinds of gains you can expect out of current top-end graphics cards on an X58 system. As mentioned earlier, the inclusion of SLI on the X58 chipset will allow us to obtain the most accurate multi-GPU comparison possible. One drawback of our P6T Deluxe testbed is the lack of support for 3-way SLI. Keeping this limitation in mind, we will still be able to see how well the GTX 280’s scale on the X58 compared to nVIDIA’s 790i Ultra chipset as well as compare scaling between one 4870X2 and two 4870X2’s on the X58 and a P45-based motherboard.

Two of the other burning questions on users’ minds is triple channel DDR3 memory support on the X58 chipset and the return of Hyperthreading. In a nutshell, do these technologies bring anything to the table that is tangible? We will investigate these new technologies to the fullest in the coming pages.

Our testing methods are quite straightforward. When testing a game, all possible in-game settings are maxed out. Maximum AA settings vary depending on the graphics card used, and have been noted above. We run each game at a resolution of 1680×1050 (sorry guys, the 2560 x 1920 30″ Dell did not arrive in time for this review).

Synthetic tests are run at default settings, including 3Dmark06 and Vantage. While there are many programs and games out there that offer good results for review purposes, I have chosen to use the ones listed in the test setup basically for reproducability, ease of use, and consistency with previous reviews.

3Dmark06 and 3Dmark Vantage are great videocard benchmarking programs. These programs are mainly used in the overclocking world. A mixture of CPU tests and graphics tests are scored and combined to give you a total score. Vantage is Futuremarks latest test suite and is geared towards Directx 10 and Windows Vista only.

Our first comparison is designed to look at the scaleability of nVIDIA’s GTX 280 cards in single card and SLI configurations as well as ATI’s 4870X2 in single card and dual 4870X2 CrossFireX configurations. Stock CPU and GPU speeds were used for all of these comparisons.

[img]3dmark06 cpu.jpg[/img]

The Core i7 940 manages to outmuscle the QX9650 by 809 3Dmarks, an 18% advantage.

[img]3dmark vantage cpu.jpg[/img]

In 3Dmark Vantage we see a difference of 5675 3Dmarks, or an increase of 45% over the QX9650.

There is no doubt that Nehalem is faster clock for clock when comparing the CPU results in 3Dmark06 and Vantage. But does Nehalem impact the GPU in both single card or multi-GPU setups? Let’s find out…

[img]3dmark06 scaling.jpg[/img]

With SLI utilitzing dual GTX 280s, we can see a noticeable increase in going from one card to two on the X58 platform compared to the 790i. The increase is not earth-shattering, but nearly 2000 3Dmarks is nothing to sneeze at. And the X58 boasts an 1100 3Dmark point advantage on the single card front over the 790i platform. That’s an 11% advantage in SLI mode, and a 7% advantage in single card mode. Multi-GPU scaling is also better on the X58 platform. Going from one to two cards on the X58 system yields an advantage of 9% almost double the 5.5% increase seen with the 790i platform.

With the 4870X2’s you actually see a slight drop in 3Dmarks when going from one 4870X2 card to two cards on the X58 platform. This result is quite surprising and we hope represents a driver glitch for the CrossFireX multi-GPU implementation. What is nice to see is the improvement in the overall score on the X58 compared to the P45 system. On a single 4870X2 you see a nice jump of 13.5% on the X58 system compared to the P45.

[img]3dmark vantage scaling.jpg[/img]

In 3Dmark Vantage, the difference in scaling is much larger. Here we can see that 3Dmark Vantage takes advantage, pun intended, of multi-GPU setups much more efficiently than 3Dmark06. On the single card side, the GTX 280 sees no improvements, but the 4870X2 sees a nice increase of 14% on the X58 platform. When we move from a single GTX 280 to two, the performance increase is 63.5%. With the 4870X2 , moving from one to two cards yields a 29% improvement. Compared to the 790i and P45 scaling results of 48% and 21% respectively, the X58 gives you an additional improvement of 15.5% on the 280’s and 8% on the 4870X2s.

The gains from the X58 platform range from moderate to substantial, but the most important finding is that the X58 platform out-performs the older platforms in both 3Dmark06 and 3Dmark Vantage in just about every configuration.

It’s been floating around the net in various forums that Nehalem would not be able to play games any better, or possibly even worse, than current Core 2 Quad – based machines. We set out to determine if the Core i7 could offer a substantial improvement in game performance compared to Yorkfield-based systems or, as we have seen so many times in the past, would the GPU be the limiting factor.

The first game we decided to take a look at was UT3. UT3 has been around for awhile now, but still offers stellar graphics, physx enhancement, and is absolutely a blast to play. The physx side will not be in this article, but results will be provided on the impact of GPU performance in later reviews.

[img]UT3 X58 vs 790i_P45.jpg[/img]

A single 4870X2 on the X58 platform was 37% slower than on the P45. We ran the test a few times to make sure we weren’t doing anything different but we came up with the same results. As you can see, there is a 20fps advantage on the X58 with dual 4870X2’s. There is also a significant increase of 50fps when moving from one to two 4870X2’s on the X58 system. The reason for the substantial loss of performance when running a single 4870X2 on the X58 platform is currently unknown, but our suspicions are that it is driver-related. We will monitor the situation and report further developments.

The GTX 280’s offer similar gains in UT3. A single GTX 280 gives you a 36fps increase over the 790i platform. SLI GTX 280’s shows a similar 38fps increase over the older system.

GRID is an excellent racing game delivering beautiful graphics. For this test we used all in game settings set to max with AA set to 8XMSAA on the 4870X2’s and 16XQCSAA on the GTX 280’s. Those are the max AA settings available on each card.

[img]GRID X58 vs 790i_P45.jpg[/img]

With these results, you can conclude that Core i7 and the X58 motherboards have little impact on overall performance compared to both the 790i and P45 systems. Scaling on the GTX 280’s is along the same lines as UT3. Moving to a dual card GTX 280 solution gives you an average of 117fps, almost double the 60fps on a single GTX 280. The same results can be seen on the 790i platform. A single 4870X2 offers similar performance to a dual GTX 280 setup. Moving to dual 4870X2’s yields an overall framerate increase of 11fps, hardly worth the additional cost of a 4870X2 if you are planning on playing only this game and no others.

Crysis has long been a hardware hog, and even today with the mighty 4870X2’s and GTX 280’s you cannot max out AA at resolutions of 1680×1050 and above. 8x AA is playable at 1680×1050 and the more gpu’s you can throw at it, the better. That’s an 8x AA setting with all other in game settings set to Very High. In our opinion, Crysis still offers the most visually stunning graphics to date, regardless of whether or not it was coded poorly.

[img]Crysis X58 vs 790i_P45.jpg[/img]

Looking at the results, you can see that moving from one GTX 280 to two of them yields the greatest gains on either the X58, 790i or P45 systems. There is a very small 2fps increase that the X58 system offers over the 790i in either single card or SLI setups. Moving to the 4870X2’s you can see that going from 2 GPUs to 4 GPUs offers a 15fps increase and much more playable framerates.

FarCry 2 was the long anticipated sequel to FarCry. It has been received with mixed reviews. Some reviewers and gamers have claimed it has unbelievable graphics while others have simply not been impressed. In our experience with the game, FarCry 2 offers beautiful graphics and stunning scenary that is playable at decent in game settings on many video cards, whether single or multi-GPU solutions. All in-game settings were set to MAX, including AA which was set to 8x on both ATI and nVIDIA setups.

[img]FarCry2 X58 vs 790i_P45.jpg[/img]

Just as we observed in Crysis, the performance when moving from one GTX 280 to two yields almost double the framerate you get on a single GTX 280. On the 4870X2 the difference is less dramatic, but the results are still impressive. An increase of 19fps is a tremendous step up from a single 4870X2. When comparing the results of the X58 set up to the 790i and P45 systems, you can see a small, but still significant increase across the board in fps. The only instance where there is no difference is on a single 4870X2.

In our synthetic testing we will first take a look at wPrime 32M and SuperPi 1M. wPrime is basically the same as SuperPi in that both programs are aimed at producing the number Pi to a certain amount of decimal points. The difference is that wPrime is multithreaded while SuperPi runs only one instance on a single thread.

[img]wPrime_superPi.jpg[/img]

In wPrime 32M we see the amazing power of Nehalem and 8 threads. The Core i7 940 is able to complete wPrime 32M 35% faster than a QX9650 clocked 70MHz faster. While the single-threaded performance noted in the SuperPi 1M results are not quite as impressive, the Core i7 940 still managed to complete the test almost 2 seconds faster than the QX9650, a full 20% faster.

Moving on to Sisoft Sandra, we are looking at the CPU Arithmetic and Multi Media performance tests.

[img]sisoft_arithmetic.jpg[/img]

[img]sisoft_multi media.jpg[/img]

Glancing at the Arithmetic tests, we can see an increase of 40-56% over previous generation architectures in Dhrystone and Whetstone test results. The Multi Media tests result in a gain of 6.5 and 20% respectively with the Core i7 940 over the QX9650.

After taking a run through some synthetic CPU tests we can now move on to some memory benchmarks and see what kind, if any, performance gains are to be had with Core i7 and X58. We will also explore the increased bandwidth of moving from single to triple channel on the X58 platform.

Let’s look at latency on the X58 compared to the 790i and P45 systems. Please note that the P45 system uses dual channel DDR2 memory at 800MHz while the 790i uses dual channel DDR3 at 1600MHz. For these tests we used Everest.

[img]Comparison of Latency Across Platforms.jpg[/img]

With Intel moving the memory controller on the CPU die, we are greated with terrific latency results on the triple channel DDR3 X58 setup that scales as the DDR3 frequency climbs.

 

Taking a closer look at latency on the X58, we can see the performance increases or decreases as related to the number of dimms used and the speed at which the memory is run.

[img]Latency on the X58.jpg[/img]

The above graph is a little busy, but we start to see some patterns emerge as we increase in the number of DIMMs used. Going from single to dual to triple channel configuration results in increased memory latency.

Moving on to read, write and copy performance, we can see the benefits of moving beyond 1 dimm. What is interesting is the performance similarity of read, write, and copy at dual and triple channel settings. Of course, triple channel offers more memory bandwidth, but dual channel memory will suffice for the vast majority of users out there.

[img]single_dual_triple_read.jpg[/img]

[img]single_dual_triple_write.jpg[/img]

[img]single_dual_triple_copy.jpg[/img]

The good news is that if you have a dual channel DDR3 memory kit on a previous generation platform, you will see a noticeable difference when using it on an X58 platform. The need to move to triple channel is not really necessary unless you feel inclined to squeeze every last percentage of performance out of the system, but you should know in advance that you diminishing returns for your investment.

Finally, we wanted to compare the old gear to the new gear.

[img]triple channel vs dual channel.jpg[/img]

Performance on a triple channel memory system is nearly double that of the aging 790i and P45 systems running dual channel configurations.

Like the title says, this is just a peek at overclocking. We are looking just quickly at the performance gains when moving from stock CPU speeds up to 3.8GHz.

[img]wPrime_superPi_overclocking.jpg[/img]

As you can see, wPrime 32m shows a 1 second drop in time while SuperPi 1M offers up a decrease of 3 seconds with an increase of 800MHz. We will delve deeper into the overclocking aspect of Nehalem on a later review. For now, it’s nice to see that Core i7 gives amazing results that you could not achieve on the QX9650 under 4.4GHz.

[img]intel_core_i7.jpg[/img]

Intel has managed to impress again. Their tick-tock model is working right on schedule. This time around we are left with a new architecture with the new Quickpath Interconnect that does away with the old and dated Front Side Bus technology. The integration of the memory controller onto the die itself allows for considerable improvements in almost all applications. The true benefit of QPI, however, won’t be fully realized until Intel releases it’s 2P and 4P server solutions. AMD’s advantage in that market segment has the potentional to disappear much like the FSB just did.

Not all applications or games will see an improvement on the new Core i7 and X58 platform. The majority of applications do, however, and the rest that do not is more of problem with the programs themselves, than with Nehalem. There are not many applications that can take advantage of 8 threads on a desktop platform. The largest gains are seen in applications that take advantage of multi threads, such as wPrime 32M, and in games like Crysis and FarCry 2. In single threaded applications you will see an increase in performance as Core i7 is still faster clock for clock compared to the QX9650.

Improvements in gaming are seen primarily in multi-GPU solutions, as the new on die memory controller and triple channel memory provide more bandwidth to the gpu’s. Some games will be able to take advantage of these new technologies right away, but as we have seen in our testing, not all games benefit. So that leaves you with the question of wether or not to upgrade. The choice is obviously yours and we are not going to try to sway you one way or another. We’ll just let the data speak for itself. What should be kept in consideration is that it is now possible to get the performance of what used to be one of Intel’s fastest chips, the QX9650, in a CPU that is one third its cost. In our testing, moving the same hardware from the old 790i and P45 systems to the X58 and Core i7 offered us a noticeable improvement in not just benchmarks but system responsiveness as well. Obviously price is a factor and moving to the Core i7 and X58 system isn’t going to be cheap, so we do recommend that you take a good look at your wallet if you are seriously contemplating a Core i7 upgrade.

This article marks the beginning of our Core i7. In future reviews we will be taking an in-depth look at how Turbo mode works as well as dissceting the differences among the various X58 motherboards availble for the Core i7 lineup.

For the simple fact that this is the fastest platform available to date for the desktop, XCPUS.com is proud to give Intel’s Core i7 and the X58 chipset the GOLD Award.

[img]goldcopygh3[1].jpg[/img]

Discuss in the forums

SHARE THIS POST

  • Facebook
  • Twitter
  • Myspace
  • Google Buzz
  • Reddit
  • Stumnleupon
  • Delicious
  • Digg
  • Technorati

Leave A Response