User:Vivrax/Cray MTA
This user page or section is in a state of significant expansion or restructuring. You are welcome to assist in its construction by editing it as well. This template was placed by User:Vivrax. If this user page has not been edited in several days, please remove this template. If you are the editor who added this template and you are actively editing, please be sure to replace this template with {{in use}} during the active editing session. Click on the link for template parameters to use.
This page was last edited by Monkbot (talk | contribs) 4 years ago. (Update timer) |
Cray MTA (formerly known as Tera MTA, for Multithreaded Architecture) is a highly multithreaded scalar shared-memory multiprocessor supercomputer architecture developed by Tera Computer Company and later expanded by Cray. It features a barrel processor arrangment with 128 hardware threads, fine-grained communication and synchronization between threads, and high latency tolerance for irregular computations. There have been 5 generations of supercomputer systems developed upon the MTA architecture, all bearing the same name as the underlying architecture. Its predecesors are the HEP and Horizon supercompter architecturesr. The principal architect of all generations was Burton J. Smith.[1][2]
Architecture | Introduced | First System
Deployed |
Process | Cooling | Instruction set | CPU | Socket | CPU
frequency |
CPU TDP | Memory | EMS bits | Hashing granularity | Max. CPU
count |
Max. memory | Network
Topology |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Tera | 1990 | never | ? | ? | ??-bit VLIW | ? | ? | 333 MHz[nb 1] | ? | custom 128 MB data memory units | 6 (4 trap)[5] | 64-bit? | 256 | 64 GB | 3D torus |
Tera MTA | 1996 | 1997 | 0.6 μm "H-GaAs III" MESFET | water | 64-bit VLIW[6] | 24 GaAs FX350K-based chips[4][7][nb 2] | custom 524-ball BGA[7] | 255 MHz[nb 3] | ~1,000 W[7][8] | 1 or 2 GB custom memory SDRAM DIMMs | 4 (2 trap) | 64-bit? | 256 | 512 GB | 3D torus |
Cray MTA-2 | 2001 | 2002 | CMOS | air | 64-bit VLIW[9] | Torrent[10] (5 CMOS ASICs[11]) | custom | 220 MHz | 50 W[8] | 1GB custom DIMMs | 4 (2 trap) | 64-bit | 256 | 1 TB | modified Cayley[nb 4] |
Cray XMT | 2005 | 2007 | ? | air | 64-bit VLIW[nb 5] | Threadstorm3 | Socket 940[12] | 500 MHz[12] | 30 W[12] | 4 or 8 GB DDR1 | 4 (2 trap) | 512-bit | 8192 | 64 TB | 3D torus |
Cray XMT2 | 2011 | 2011 | ? | air | 64-bit VLIW[nb 5] | Threadstorm4 | Socket F | 500 MHz | 30 W? | 16 or 32 GB DDR2 | 2 | 512-bit | 8192 | 512 TB | 3D torus |
Architecture | Known prototypes | Known adopters |
---|---|---|
Tera | None | None |
Tera MTA | San Diego Supercomputer Center[nb 6]
| |
Cray MTA-2 | Cray
|
Electronic Navigation Research Institute
U.S. Naval Research Laboratory ("Boomer")
|
Cray XMT | Cray[14] | Pacific Northwest National Laboratory ("Cougar")[15]
Sandia National Laboratories ("Graphstorm")
Oak Ridge National Laboratory ("Erdos")
several 64 CPU systems to DoD and DoE by 2009[14] |
Cray XMT2 | Swiss National Supercomputing Centre ("Matterhorn")[nb 7]
Noblis Center for Applied High Performance Computing ("Danriver")[22]
Mayo Clinic ("Grace")
Pittsburgh Supercomputing Center ("Sherlock")[nb 8] Institute of Systems Biology and a US government organization[27] |
History
[edit]The MTA design cost Tera $6.7 million (1995), $10.5 million (1996), $13.2 million (1997), $13.7 million (1998), $15.2 million (1999), having 50 full-time engineers in 1996 and 70 in 1999. DARPA invested $19.3 million from 1990 to 1999 into the MTA design. As of 1998, Tera had been 1 granted software patent for their compiler optimazations and had 15 pending for software and hardware inventions.
The core CPU architecture remained the same from 1990 throughout all generations, with the only changes being the manufacturing process change (GaAs to CMOS), support for various custom and off-the-shelf memory modules, faster clock speeds and minor changes in the interconnect topology (modified Cayley before reverting back to the 3D torus). Thus, software compiled for the initial Tera should run on any later system.
In 1998, Tera revealed plans to deploy the Tera MTA to European partners, though that never materialized.[28] In 2014, YarcData, a Cray division, announced that its uRiKA line of systems (GD, based upon Cray XMT2, and XA, based on Intel Xeon CPUs) is being used by "CSCS, the Mayo Clinic, Noblis, ORNL, QinetiQ, Pittsburgh Supercomputing Center and Sandia National Laboratories, as well as leading government and intelligence organizations, financial services firms, life sciences companies, and telecommunications providers", though only the first three and PSC actually had an XMT2 system deployed, while ORNL and Sandia had older systems (Cray XMT), predating YarcData and uRiKA. The software used in uRiKA-GD later became Cray Graph Engine.
Development on Threadstorm4, as well as the whole MTA architecture, ended silently after XMT2. Probable reasons include the leaving of Burton J. Smith in 2015 for Microsoft, as well as competition from commodity processors such as Intel's Xeon[29] and Xeon Phi, even though Cray never officially discontinued neither XMT nor XMT2. A 2014 research paper from the University of Washington implied the Cray XMT isn't overcoming its narrow range of applicability, demonstrating a system from commodity x86 processors and specialized software could be utilized to achieve performance in the same order of magnitude as Threadstorm CPUs (between 2.6x faster and 4.4x slower than the XMT) for graph-related problems.[30] In the CUG 2015 proceedings, a presentation showed successful porting of the uRiKA-GD system from MTA architecture to XC30/40 with good preliminary performance, implying the redundancy of Cray XMT2 infrastructure by 2015.[31] In late 2016, the uRiKA-GD (the last user of the Cray XMT2 infrastructure and Threadstorm4 CPUs) and uRiKA-XA (based upon Xeon E5 v3 CPUs) appliances merged into uRiKA-GX, which utilized Xeon E5 v4 processors. As of 2020, Cray has removed all customer documentation on both XMT, XMT2 and uRiKA-GD from its online catalogue. Most XMT and XMT2-based systems have been decommisioned by 2020.
Tera Computer System
[edit]Designer | Tera Computer Company |
---|---|
Bits | 64-bit |
Introduced | 1990 |
Version | 1st generation of MTA |
Successor | Cray MTA-2 |
Tera Computer System is the first generation of the MTA architecture, as descibed in a 1990 paper.[3] It was never commerically produced as the system in the paper, though prototypes were probably made during its development. Architectural deisgn choices the Tera Computer System set forth were retained by all subsequent architectures (except for additional state bits and hashing granularity), thus ensuring backwards-compatibility.
- 64-bit VLIW ISA, 3 RISC-like instructions per VLIW instruction
- load-store architecture
- 64-bit single-core barrel processor with 128 streams (each mapping 1 thread)
- 3 functional units (Memory, Adder, Control)
- 128 separate register sets (32 GP, 8 target, status word with PC)
- 16 protection domains
- fair, full context-switching at each cycle (zero-cost switching)
- pipeline of 21 cycles
- no data caches
- latency-tolerant, targeted at problems with lot of unpredictable memory access (graph problems, semantic databases, genome sequencing, other big data)
- "CPU shall never stall, memory lanes should be saturated"
- big unified shared content-addressable memory with scrambling at memory word granularity
- 6 additional state bits on each 64-bit memory word for tagged memory
- support for IEEE 754 single and double precision floating point numbers, quad precision 128-bit floating point numbers, 64-bit complex numbers and arbitrary length unsigned numbers
Tera MTA
[edit]Designer | Tera Computer Company |
---|---|
Bits | 64-bit |
Introduced | 1996 |
Version | 2nd generation of MTA |
Predecessor | Tera Computer System |
Successor | Cray MTA-2 |
Tera MTA is a scalable multithreaded shared memory supercomputer architecture by Cray, based on the second generation of the Tera MTA architecture, and the first commercial implementation of the architecture.
Each MTA processor (CPU) has a high-performance ALU with many independent register sets, each running an independent thread. For example, the Cray MTA-2 uses 128 register sets and thus 128 threads per CPU/ALU. All MTAs to date use a barrel processor arrangement, with a thread switch on every cycle, with blocked (stalled) threads skipped to avoid wasting ALU cycles. When a thread performs a memory read, execution blocks until data returns; meanwhile, other threads continue executing. With enough threads (concurrency), there are nearly always runnable threads to "cover" for blocked threads, and the ALUs stay busy. The memory system uses full/empty bits to ensure correct ordering. For example, an array A is initially written with "empty" bits, and any thread reading a value from A blocks until another thread writes a value. This ensures correct ordering, but allows fine-grained interleaving and provides a simple programming model. The memory system is also "randomized", with adjacent physical addresses going to different memory banks. Thus, when two threads access memory simultaneously, they rarely conflict unless they are accessing the same location.
A goal of the MTA is that porting codes from other machines is straightforward, but gives good performance. A parallelizing FORTRAN compiler can produce high performance for some codes with little manual intervention. Where manual porting is required, the simple and fine-grained synchronization model often allows programmers to write code the "obvious" way yet achieve good performance. A further goal is that programs for the MTA will be scalable – that is, when run on an MTA with twice as many CPUs, the same program will have nearly twice the performance. Both of these are challenges for many other high-performance computer systems.
An uncommon feature of the MTA is several workloads can be interleaved with good performance. Typically, supercomputers are dedicated to a task at a time. The MTA allows idle threads to be allocated to other tasks with very little effect on the main calculations.
The MTA uses many register sets, thus each register access is slow. Although concurrency (running other threads) typically hides latency, slow register file access limits performance when there are few runable threads. In existing MTA implementations, single-thread performance is 21 cycles per instruction,[32] so performance suffers when there are fewer than 21 threads per CPU.
The MTA-1, -2, and -3 use no data caches. This reduces CPU complexity and avoids cache coherency problems. However, no data caching introduces two performance problems. First, the memory system must support the full data access bandwidth of all threads, even for unshared and thus cacheable data. Thus, good system performance requires very high memory bandwidth. Second, memory references take 150-170 cycles,[32][33] a much higher latency than even a slow cache, thus increasing the number of runable threads required to keep the ALU busy.
Full/empty status changes use polling, with a timeout for threads that poll too long. A timed-out thread may be descheduled and the hardware context used to run another thread; the OS scheduler sets a "trap on write" bit so the waited-for write will trap and put the descheduled thread back in the run queue.[33] Where the descheduled thread is on the critical path, performance may suffer substantially.
The MTA is latency-tolerant, including irregular latency, giving good performance on irregular computations if there is enough concurrency to "cover" delays. The latency-tolerance hardware may be wasted on regular calculations, including those with latency that is high but which can be scheduled easily.
The operating system of the Tera MTA was Tera MTX, a fully distributed and symmetric implementation of UNIX. The system runs stand-alone and does not have dedicated service nodes or front ends, as common in later generations.
Unisys Corporation was contracted to provide semiconductor test and packaging services (1996-1998), while Vitesse (1996-1997, 2000) and TriQuant (1996, 1998-2000) were contracted for supply of GaAs wafers.[2][7]
MTA and MTA-2 systems were constructed from resource modules. Each resource module measures approximately 5 by 7 by 32 inches and contains:[10][34]
- computational processor (CP)
- I/O processor (IOP) connected to an I/O device via HIPPI
- memory units.
Model | MTA
CPUs |
IO
CPUs |
Memory | Performance | Bisection
bandwidth |
IO
bandwidth |
---|---|---|---|---|---|---|
MTA 16 | 16 | 16 | 16-32 GB | 16 GFLOPS | 179 GB/s | 6 GB/s |
MTA 32 | 32 | 32 | 32-64 GB | 32 GFLOPS | 179 GB/s | 12 GB/s |
MTA 64 | 64 | 64 | 64-128 GB | 64 GFLOPS | 358 GB/s | 25 GB/s |
MTA 128 | 128 | 128 | 128-256 GB | 128 GFLOPS | 717 GB/s | 51 GB/s |
MTA 256 | 256 | 256 | 256-512 GB | 256 GFLOPS | 1434 GB/s | 102 GB/s |
In addition to the SDSC researchers, early users of the MTA system also included the Boeing, Caltech, Los Alamos National Laboratory, National Energy Research Scientific Computer Center and Sanders, a Lockheed Martin company.[35] The only deployed Tera MTA system in existence, at the San Diego Supercomputer Center, was retired in September 2001, roughly a year after its last upgrade to 16 CPUs.[36]
MTA CPU
[edit]The MTA CPU is comprised of 24 GaAs chips, based off Vitesse's FX350K gate arrays, with a very high thermal dissipiation between 20 and 48 W per chip, resulting in about 1,000 W for the whole CPU package.[7][8] The custom FX350K chips were produced using Vitesse's 0.6 μm "H-GaAs III" MESFET process.[37] The CPU is packaged in a custom 524-ball BGA.[7] The MTA processor design was a joint effort between Tera and the Design Services group of Cadence Design Systems.[38]
- 8KB level 1 and 2MB level 2 instruction caches
- Up to 8 concurrent memory references per thread
Memory subsystem
[edit]- custom memory modules
- 4 additional state bits per 64-bit memory word (reduced from 6 bits in the Tera Computer System, removed 2 trap bits)
- 4-way associative, 512 entry TLB[39]
- 47 bit memory address space, though only 42 bits are used[39]
- segment sizes from 8 kB to 256 MB[39]
- word granular address hashing
Interconnect system
[edit]Each node has five ports, four to neighboring nodes and one to a resource board.[7] Each port simultaneously transmits and receives an entire 164-bit packet every 3 ns clock cycle. Of the 164 bits, 64 are data, so the data bandwidth per port is 2.67 GB/s in each direction, for a peak communication bandwidth of 13.3 GB/s.[6][7]
Cray MTA-2
[edit]Designer | Cray |
---|---|
Bits | 64-bit |
Introduced | 2001 |
Version | 3rd generation of MTA |
Endianness | Big-endian |
Predecessor | Tera Computer System |
Successor | Cray XMT |
Cray MTA-2 is a scalable multithreaded shared memory supercomputer architecture by Cray, based on the third generation of the Tera MTA architecture. Presented in 2001, MTA-2 was an attempt to produce a cheaper and more reliable MTA architecture implementation. The manufacturing process switched from GaAs to CMOS, dramatically lowering both cost of production and maintenance, as well as thermal dissipiation of the processor package, though essentially retaining the same CPU architecture, and regressing the network design from a 4-D torus topology to a less efficient but more scalable Cayley graph topology. Despite addressing major flaws with the Tera MTA, Cray MTA-2 wasn't a commercial success, selling only one medium sized system of 40 CPUs.
Taiwan Semiconductor Manufacturing Company was contracted in 1999 to produce CMOS wafers. Kyocera America, Inc. was contracted for MTA-2 packaging, Samsung Semiconductor for MTA-2 custom memory, Sun Microsystems, Inc. for I/O systems of the MTA-2 and InterCon Systems, Inc. for custom interconnects. The operating system of the Cray MTA-2 was Cray MTX, Tera MTX's successor.
In February 2001, Cray announced at $5.4 million contract for a 28 CPU MTA-2 for the U.S. Naval Research Laboratory in Q4 2001.[40] In February 2002 Cray delivered a 16 CPU MTA-2 and upgraded it in parts to a 40 CPU MTA-2 in late 2002. Largest known prototype was a 44 CPU system at Cray. A presentation from the CUG 2006 proceedings suggested that "10 MTA processors [are] likely as fast as 32,000 BlueGene/L processors" for graph-related problems.[41]
Torrent CPU
[edit]This section needs expansion. You can help by adding to it. (May 2020) |
General information | |
---|---|
Designed by | Cray |
Common manufacturer | |
Performance | |
Max. CPU clock rate | 220 MHz |
Architecture and classification | |
Instruction set | MTA ISA |
Physical specifications | |
Cores |
|
Socket |
|
History | |
Predecessor | Tera MTA CPU |
Successor | Threadstorm3 |
TDP of 50 W. 8 trap registers added per register set (retained until inclusive XMT2).[8] The CMOS process was not disclosed but considering TSMC technology at the time (between the contract was signed in 1999 and the first delivered system in 2002), the chips could have been produced using a 180 nm (1999) or 130 nm (2001) manufacturing process.
Memory subsystem
[edit]This section needs expansion. You can help by adding to it. (May 2020) |
Cray XMT
[edit]Designer | Cray |
---|---|
Bits | 64-bit |
Introduced | 2005 |
Version | 4th generation of Tera MTA |
Endianness | Big-endian |
Predecessor | Cray MTA-2 |
Successor | Cray XMT2 |
Registers | |
32 general-purpose per stream (4096 per CPU) 8 target per stream (1024 per CPU) |
Cray XMT (Cray eXtreme MultiThreading[1], codenamed Eldorado[4]) is a scalable multithreaded shared memory supercomputer architecture by Cray, based on the fourth generation of the Tera MTA architecture, targeted at large graph problems (e.g. semantic databases, big data, pattern matching).[42][14][43] Presented in 2005, it supersedes the earlier unsuccessful Cray MTA-2. It uses the Threadstorm3 CPUs inside Cray XT3 blades. Designed to make use of commodity parts and existing subsystems for other commercial systems, it alleviated the shortcomings of Cray MTA-2's high cost of fully custom manufacture and support.[4] It brought various substantial improvements over Cray MTA-2, most notably more than doubling clock speed, nearly tripling the peak performance, and vastly increasing the maximum CPU count to 8,192 and maximum memory to 64 TB, with a data TLB reach of maximal 128 TB.[4][42] To provide compatibility with existing XT3 infrastructure, commodity memory and higher network bandwitdth, engineers had to sacrifice higher network and memory speeds of the MTA-2.[41][nb 10]
Cray XMT uses a scrambled[42] content-addressable memory[44] model with 64 bytes (8 blocks) granularity on DDR1 ECC modules to implicitly load-balance memory access across the whole shared global address space of the system.[43] There are no hardware interrupts and hardware threads are allocated by an instruction, not the OS.[43][45] Front-end (login, I/O, and other service nodes, utilizing AMD Opteron processors and running SLES Linux) and back-end (compute nodes, utilizing Threadstorm3 processors and running MTK, a simple BSD Unix-based microkernel[42]) communicate through the LUC (Lightweight User Communication) C++ interface, a RPC-style bidirectional client/server interface.[1][43][46]
Research and development of the Cray XMT was co-funded by the U.S. government. Though it was presented already in 2005, the first early version of the system was shiped in September 2007 with smaller shipments in 2008 and 2009. It was declared to be a current product in Cray's 10-K filling of 2010. Largest known prototype was a 512 CPU system at Cray.
Threadstorm3
[edit]General information | |
---|---|
Launched | 2006 |
Discontinued | 2009 |
Designed by | Cray |
Common manufacturer | |
Performance | |
Max. CPU clock rate | 500 MHz |
HyperTransport speeds | to 300 GT/s |
Architecture and classification | |
Instruction set | MTA ISA |
Physical specifications | |
Cores |
|
Socket | |
History | |
Predecessor | Cray Torrent |
Successor | Threadstorm4 |
Threadstorm3 (referred to as "MT processor"[4] and Threadstorm before XMT2[47]) is a 64-bit single-core VLIW barrel processor (compatible with 940-pin Socket 940 used by AMD Opteron processors) with 128 hardware streams, onto each a software thread can be mapped (effectively creating 128 hardware threads per CPU), running at 500 MHz and using the MTA instruction set or a superset of it.[45][46][nb 11] It has a 128KB, 4-way associative data buffer. Each Threadstorm3 has 128 separate register sets and program counters (one per each stream), which are fairly[48] fully context-switched at each cycle.[43] Its estimated peak performance is 1.5 GFLOPS. It has 3 functional units (memory, fused multiply-add and control), which receive operations from the same MTA instruction and operate within the same cycle.[45] Each stream has 32 general-purpose registers, 8 target registers and a status word, containing the program counter.[44] High-level control of job allocation across threads is not possible.[43][nb 12] Due to the MTA's pipeline length of 21, each stream is selected to execute instructions again no prior than 21 cycles later, effectively capping sequential single-thread speed to 23.8 MHz.[49] The TDP of the processor package is 30 W.[12]
Due to their thread-level context switch at each cycle, performance of Threadstorm CPUs is not constrained by memory access time. In a simplified model, at each clock cycle an instruction from one of the threads is executed and another memory request is queued with the understanding that by the time the next round of execution is ready the requested data has arrived.[50] This is contrary to many conventional architectures which stall on memory access. The architecture excels in data walking schemes where subsequent memory access cannot be easily predicted and thus wouldn't be well suited to a conventional cache model.[1]
TSMC was contracted in 2006 to procude Threadstorm3 chips and later the SeaStar2 router chip. Again, the manufacturing process for the chips was not disclosed but given TSMC's technology at the time, a 90 nm (2004) or 65 nm (2006) could have been used for Threadstorm3.
Memory subsystem
[edit]Cray XMT uses a shared byte-addressable memory system with 64-bit memory words, each of which has 4 additional "Extended Memory Semantics" bits (full/empty, forwarding and 2 trap bits) bits asociated with it, which enable lightweight, fine-grained syncronization on all memory between all CPUs. Memory is implemented using commodity DDR components (with modules from 4 GB to 16 GB) and is protected against single bit failures (using ECC). Each module has a single access port and is 128 bits wide. Logical addresses are hashed to physical addresses in 8 word blocks rather than the word granularity hashing that used in MTA-2's custom memory system. Threadstorm3 also introduced a local, non-coherent 128 kB, 4-way associated data buffer, previously absent from the architecture.[4][45]
Scorpio
[edit]After launching XMT, Cray researched a possible multi-core variant of the Threadstorm3, dubbed Scorpio. Most of Threadstorm3's features would be retained, including the multiplexing of many hardware streams onto an execution pipeline and the implementation of additional state bits for every 64-bit memory word. Cray later abandoned Scorpio, and the project yielded no manufactured chip.[42]
Cray XMT2
[edit]Designer | Cray |
---|---|
Bits | 64-bits |
Introduced | 2011 |
Version | 5th generation of MTA |
Endianness | Big-endian |
Predecessor | Cray XMT |
Registers | |
32 general-purpose per stream (4096 per CPU) 8 target per stream (1024 per CPU) 8 trap per stream (1024 per CPU) |
Cray XMT2[42] (also "next-generation XMT"[47] or simply XMT[44]) is a scalable multithreaded shared memory supercomputer architecture by Cray, based on the fifth generation of the Tera MTA architecture.[43] Presented in 2011, it supersedes Cray XMT, which had issues with memory hotspots.[47] It uses Threadstorm4 CPUs inside Cray XT5 blades and increases memory capacity eightfold to 512 TB and memory bandwidth trifold (300 MHz instead 200 MHz) compared to XMT by using twice the memory modules per node and DDR2.[44][47] It introduces the Node Pair Link inter-Threadstorm connect, as well as memory-only nodes, with Threadstorm4 packages having their CPU and HyperTransport 1.x components disabled.[43][51] The underlying scrambled content-addressable memory model has been inherited from XMT. XMT2 is the first iteration of the MTA architecture to reduce the additional state bits from 4 to 2 (full/empty and extended, removing trap bits).
Cray's 10-K fillings and product websites never mention XMT2 or "next-genration XMT" as offered products, instead listing uRiKA-GD utilizing Threadstorm4 CPUs (marketed as "Graph Accelerators") until late 2016 (when uRiKA-GD merged with uRiKA-XA into uRiKA-GX, replacing Threadstorm4 with Intel Xeon E5 v4), while making no direct mention to their memory subsystems or the XT5 infrastructure.[2][52] Pricing on uRiKA configurations has not been made public, but according to Arvind Parthasarathi, then-YarcData general manager, a low-end setup (probably the smallest, one-cabinet XMT2 configuration of 64 CPUs) cost in the low hundreds of thousands of dollars.[53] According to publicly available data, Cray sold two uRiKA-64 and one uRiKA-128 systems, as well as two more uRiKA configurations of unknown size.
Threadstorm4
[edit]General information | |
---|---|
Launched | 2011 |
Discontinued | 2016 |
Designed by | Cray |
Common manufacturer | |
Performance | |
Max. CPU clock rate | 500 MHz |
HyperTransport speeds | to 400 GT/s |
Architecture and classification | |
Instruction set | MTA ISA |
Physical specifications | |
Cores |
|
Socket | |
History | |
Predecessor | Threadstorm3 |
Threadstorm4 (also "Threadstorm IV"[1] and "Threadstorm 4.0"[nb 13]) is a 64-bit single-core VLIW barrel processor (compatible with 1207-pin Socket F used by AMD Opteron processors) with 128 hardware streams, very similar to its predecessor, Threadstorm3. It features an improved, DDR2-capable memory controller.
TSMC was contracted to procude Threadstorm4 chips. As with previous generations, the manufacturing process for the chips was not disclosed but given TSMC's available technology for CPU ICs in 2011, a 40 nm, 28 nm or 20 nm process could have been used for Threadstorm4. The TDP was not publicly revealed, though probably it is the same 30 W as with its predecessor.
Memory subsystem
[edit]Cray intentionally decided against a DDR3 controller, citing the reusing of existing Cray XT5 infrastructure[nb 14] and a shorter burst length than DDR3.[nb 15] Though the longer burst length could be compensated by higher speeds of DDR3, it would also require more power, which Cray engineers wanted to avoid.[47]
Unused references
[edit][54][55][56][57][58][59][60][61][62][63][64][65][66][67][68][69][70]
Notes
[edit]- ^ The 1990 paper only mentions a clock period of at most 3 ns.
- ^ While a website from 2001 by Cray, as well as an 1999 interview with Jerry Loe, then-vice president of hardware engineering at Tera, imply a 24 GaAs chip/CPU setup, presentations about Cray XMT2 in 2011 and 2012 claimed the Tera MTA to have "18 GaAs chips per processor blade". According to the most definitive document on the subject, a Tera research paper from 1997, there were 26 GaAs gate array chips per CPU package, however.
- ^ On their website in 1995, Tera claimed a nominal clock speed of 333 MHz for the Tera MTA but at deployment in 1997, only 145 MHz were achieved, though that was later improved to 255 MHz.
- ^ Also described as "3D torus with missing links" in some papers.
- ^ a b Though no definitive source has been found on the VLIW instruction width for Threadstorm3 and Threadstorm4, considering no mention of additional translation stages a wider instruction would incur and the stated backwards compatibility, it is assumed that Threadstorm3 and Threadstorm4 both use the same 64-bit wide VLIW ISA as all previous generations.
- ^ Retired in September 2001.
- ^ Retired in 2015.
- ^ Retired between 2016 and 2018.
- ^ Only one MTA with 16 CPUs has ever been produced. Other proposed configurations were never produced.
- ^ MTA-2's modified Cayley architecture provided lower latencies and higher speeds than the XT3's standard 3D torus architecture, due to lacking certain connections. Though no technical specifications for the custom Samsung memory modules in MTA-2 was given, presentations suggest a higher speed than commerical DDR-400 modules at 200 MHz used in XT3.
- ^ The Tera MTA ISA is closed-sourced and it is only due to a workshop presentation asserting backward-compatibility with previous MTA systems that the ISA used on Threadstorm CPUs cannot be a subset of MTA ISA.
- ^ Though it is not known if it is possible on the instruction-level.
- ^ On physical package.
- ^ Even though the DDR3-based Cray XT6 was launched in 2009, two years prior to XMT2.
- ^ As Cray XMT mostly operates with single 8-byte word random accesses and has a 128-bit memory channel, at DDR2 burst length of 4, the usual overhead is 56 bytes. DDR3 with its burst length of 8 would increase the usual overhead to 120 bytes.
References
[edit]- ^ a b c d e "Why is uRiKA So Fast on Graph-Oriented Queries?". YarcData Blog. November 14, 2012. Archived from the original on February 14, 2015.
- ^ a b c 10-K fillings from Cray, Inc. from 1996 to 2019.
- ^ a b Alverson, Robert; Callahan, David; Cummings, Daniel; Koblenz, Brian; Porterfield, Allan; Smith, Burton (1990-09-01). "The Tera computer system". ACM SIGARCH Computer Architecture News. 18 (3): 1–6. doi:10.1145/255129.255132.
- ^ a b c d e f g Feo, John; Harper, David; Kahan, Simon; Konecny, Petr (2005). "ELDORADO". Proceedings of the 2nd Conference on Computing Frontiers - CF '05. Ischia, Italy: ACM Press: 28. doi:10.1145/1062261.1062268. ISBN 978-1-59593-019-4.
- ^ Callahan, David; Smith, Burton (May 1990). "A future-based parallel language for a general-purpose highly-parallel computer". Selected Papers of the Second Workshop on Languages and Compilers for Parallel Computing. 1990: 95–113 – via ACM.
- ^ a b "Major System Characteristics of the Tera Computer". 1997-01-26. Archived from the original on 1997-01-26. Retrieved 2020-06-13.
- ^ a b c d e f g h Howard, M.; Kopser, A. (1997). "Design of the Tera MTA integrated circuits". GaAs IC Symposium. IEEE Gallium Arsenide Integrated Circuit Symposium. 19th Annual Technical Digest 1997. Anaheim, CA, USA: IEEE: 14–17. doi:10.1109/GAAS.1997.628228. ISBN 978-0-7803-4083-1. S2CID 1878241.
- ^ a b c d "Cray MTA". 2001-02-16. Archived from the original on 2001-02-16. Retrieved 2020-05-18.
- ^ Alam, Sadaf R.; Barrett, Richard F.; McCurdy, Collin B.; Roth, Philip C.; Vetter, Jeffrey F. (2006). Characterizing Applications on the Cray MTA-2 Multithreading Architecture. Oak Ridge National Laboratory, Computer Science and Mathematics Division.
- ^ a b "Cray MTA-2". Archived from the original on 2001-02-16.
- ^ Cray XMT2 - uRiKA Overview (PDF). 2012. p. 14.
- ^ a b c d Cray XMT Brochure (PDF). Cray. 2005. Archived from the original (PDF) on December 24, 2012.
{{cite book}}
:|archive-date=
/|archive-url=
timestamp mismatch; December 24, 2016 suggested (help) - ^ "Cray Inc. MTA-2 System Accepted by Naval Research Laboratory". 2002-10-28. Archived from the original on 2002-10-28. Retrieved 2020-06-13.
- ^ a b c Mizell, David; Maschhoff, Kristyn (2009). "Early experiences with large-scale Cray XMT systems". 2009 IEEE International Symposium on Parallel Distributed Processing: 1–9. doi:10.1109/IPDPS.2009.5161108. ISBN 978-1-4244-3751-1. S2CID 1964042.
- ^ a b Bokhari, Shahid H.; Bokhari, Saniyah S. (2013). "A comparison of the Cray XMT and XMT-2: CRAY XMT AND XMT-2". Concurrency and Computation: Practice and Experience. 25 (15): 2123–2139. doi:10.1002/cpe.2909.
- ^ Chin Jr., George; Marquez, Andres; Choudhury, Sutanay; Feo, John (2012-09-27). "Scalable Triadic Analysis of Large-Scale Graphs: Multi-Core vs. Multi- Processor vs. Multi-Threaded Shared Memory Architectures". arXiv:1209.6308 [cs]. arXiv:1209.6308.
- ^ Advanced Contingency Analysis: Analyzing power grid stability using the CRAY XMT (PDF). 2009. Archived from the original (PDF) on April 2, 2011.
- ^ "November 2010 | Graph 500". Retrieved 2020-05-18.
- ^ "June 2011 | Graph 500". Retrieved 2020-05-18.
- ^ "Assembly of CSCS Matterhorn".
{{cite web}}
: CS1 maint: url-status (link) - ^ Schoenemeyer, Thomas (May 13, 2011). "EUREKA ‐ a new data analysis facility at CSCS" (PDF). Archived from the original (PDF) on January 24, 2012.
- ^ Bokhari, Shahid H.; Çatalyürek, Ümit V.; Gurcan, Metin N. (2014-12-25). "Massively multithreaded maxflow for image segmentation on the Cray XMT-2: MASSIVELY MULTITHREADED MAXFLOW FOR IMAGE SEGMENTATION ON THE CRAY XMT-2". Concurrency and Computation: Practice and Experience. 26 (18): 2836–2855. doi:10.1002/cpe.3181. PMC 4295505. PMID 25598745.
- ^ "Cray XMT Supercomputer Arrives at the Noblis Center for Applied High Performance Computing". www.businesswire.com. 2011-12-15. Retrieved 2020-06-13.
- ^ "June 2012 | Graph 500". Retrieved 2020-05-18.
- ^ "The Pittsburgh Supercomputing Center Presents Sherlock, a YarcData uRiKa System for Unlocking the Secrets of Big Data". finance.yahoo.com. Retrieved 2020-06-15.
- ^ "Sherlock". 2012-11-30. Archived from the original on 2012-11-30. Retrieved 2020-06-15.
- ^ "News Release | Cray's YarcData Division Launches New Big Data Graph Appliance | Cray Investors: Press Releases". 2017-03-18. Archived from the original on 2017-03-18. Retrieved 2020-06-13.
- ^ "Tera's Two Processor MTA Benchmark Stats Released". HPCwire. 1998-06-26. Retrieved 2020-06-15.
- ^ "Cray CTO Connects The Dots On Future Interconnects". The Next Platform. 8 January 2016. Retrieved 2 May 2016.
Steve Scott: You can do it just great with a Xeon. We are not planning on doing another ThreadStorm processor. But it does take some software technology that comes out of the ThreadStorm legacy.
- ^ Nelson, Jacob; Holt, Brandon; Myers, Brandon; Briggs, Preston; Ceze, Luis; Kahan, Simon; Oskin, Mark (2014). Grappa: A Latency-Tolerant Runtime for Large-Scale Irregular Applications (PDF). Department of Computer Science and Engineering, University of Washington.
- ^ Maschhoff, Kristyn; Maltby, James; Vesse, Robert (2015). Porting the Urika-GD Graph Analytic Database to the XC30/40 Platform (PDF). Chicago, Illinois: 57th Cray User Group meeting, CUG 2015.
- ^ a b "Tera MTA (Multi-Threaded Architecture)". 1999.
- ^ a b "Microbenchmarking the Tera MTA" (PDF). 1999.
{{cite web}}
: CS1 maint: url-status (link) - ^ a b "Tera MTA: Beyond massive parallelism". 1997-01-26. Archived from the original on 1997-01-26. Retrieved 2020-06-13.
- ^ "SDSC Accepts Tera's Four-Processor MTA System". 2002-10-19. Archived from the original on 2002-10-19. Retrieved 2020-06-13.
- ^ Ungerer, Theo; Robič, Borut; Šilc, Jurij (2003). "A survey of processors with explicit multithreading". ACM Computing Surveys (CSUR). 35 (1): 29–63. doi:10.1145/641865.641867. ISSN 0360-0300. S2CID 11434501.
- ^ Vitesse FX Family (PDF). 1997. Archived from the original (PDF) on June 27, 1997.
- ^ "Tera is ready to deliver advanced parallel-computing microprocessor". 2000-08-16. Archived from the original on 2000-08-16. Retrieved 2020-06-13.
- ^ a b c E. Riedy, Jason; Vuduc, Rich (1999). Microbenchmarking the Tera MTA (PDF).
- ^ "Cray Inc. Receives $5.4 Million Cray MTA-2 Supercomputer Order from Logicon for Naval Research Laboratory". 2002-06-03. Archived from the original on 2002-06-03. Retrieved 2020-06-13.
- ^ a b Berry, Jonathan; Hendrickson, Bruce (2006). Graph Software Development and Performance on the MTA-2 and Eldorado (PDF). 48th Cray User Group meeting, CUG 2006: Sandia National Laboratory.
{{cite book}}
: CS1 maint: location (link) - ^ a b c d e f Padua, David, ed. (2011). Encyclopedia of Parallel Computing. Boston, MA: Springer US. pp. 453–457, 2033. doi:10.1007/978-0-387-09766-4. ISBN 978-0-387-09765-7.
- ^ a b c d e f g h Maltby, James (2012). Cray XMT Multithreated programming model. "Using the next-generation Cray XMT (uRiKA) for Large Scale Data Analytics." Swiss National Supercomputing Centre.
- ^ a b c d Cray XMT™ System Overview (S-2466-201) (PDF). Cray. 2011. Archived from the original (PDF) on December 3, 2012.
- ^ a b c d Konecny, Petr (2011). Introducing the Cray XMT (PDF). Cray.
- ^ a b Programming the Cray XMT (PDF). Cray. 2012. p. 14.
- ^ a b c d e Kopser A, Vollrath D (May 2011). Overview of the Next Generation Cray XMT (PDF). 53rd Cray User Group meeting, CUG 2011. Fairbanks, Alaska. Retrieved February 14, 2015.
- ^ Carter, Larry & Feo, John & Snavely, Allan. (2002). Performance and Programming Experience on the Tera MTA.
- ^ Snavely, A.; Carter, L.; Boisseau, J.; Majumdar, A.; Kang Su Gatlin; Mitchell, N.; Feo, J.; Koblenz, B. (1998). "Multi-processor Performance on the Tera MTA". Proceedings of the IEEE/ACM SC98 Conference. Orlando, FL, USA: IEEE: 4. doi:10.1109/SC.1998.10049. ISBN 978-0-8186-8707-5. S2CID 8258396.
- ^ Nieplocha J, Marquez A, Petrini F, Chavarria-Miranda D (2007). "Unconventional Architectures for High-Throughput Sciences" (PDF). SciDAC Review (5, Fall 2007). Pacific Northwest National Laboratory: 46–50. Archived from the original (PDF) on February 14, 2015. Retrieved February 14, 2015.
- ^ Cray uRiKA-GD Technical Specifications (PDF). Cray Inc. 2014. Archived from the original (PDF) on July 30, 2016.
- ^ uRiKA-GD Whitepaper (PDF). Cray Inc. 2011.
- ^ Feldman, Michael (2012-03-02). "Cray Parlays Supercomputing Technology Into Big Data Appliance". Datanami. Retrieved 2020-06-15.
{{cite web}}
: CS1 maint: url-status (link) - ^ Alverson, Gail; Briggs, Preston; Coatney, Susan; Kahan, Simon; Korry, Richard (1997). "Tera hardware-software cooperation". Proceedings of the 1997 ACM/IEEE Conference on Supercomputing (CDROM) - Supercomputing '97. San Jose, CA: ACM Press: 1–16. doi:10.1145/509593.509631. ISBN 978-0-89791-985-2. S2CID 14843985.
- ^ Anderson, Wendell; Briggs, Preston; Hellberg, C. Stephen; Hess, Daryl W.; Khokhlov, Alexei; Lanzagorta, Marco; Rosenberg, Robert (2003). "Early Experience with Scientific Programs on the Cray MTA-2". Proceedings of the 2003 ACM/IEEE Conference on Supercomputing - SC '03. Not Known: ACM Press: 46. doi:10.1145/1048935.1050196. ISBN 978-1-58113-695-1. S2CID 2971164.
- ^ Norton A., Melton E. (1987) A class of Boolean linear transformations for conflict-free power-of-two stride access. In: Proceedings of the international conference on parallel processing, St. Charles, IL
- ^ Bokhari, S.H.; Sauer, J.R. (2003). "Sequence alignment on the Cray MTA-2". Proceedings International Parallel and Distributed Processing Symposium. Nice, France: IEEE Comput. Soc: 8. doi:10.1109/IPDPS.2003.1213285. ISBN 978-0-7695-1926-5. S2CID 10597645.
- ^ booth (2014-10-06). "CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger". SlideServe. Retrieved 2020-05-17.
- ^ "CUG Proceedings". cug.org. Retrieved 2020-05-17.
- ^ https://dl.acm.org/doi/fullHtml/10.5555/509058.509062
- ^ "Exploiting heterogeneous parallelism on a multithreaded multiprocessor | Proceedings of the 6th international conference on Supercomputing" (Document). doi:10.1145/143369.143408.
{{cite document}}
: Cite document requires|publisher=
(help); Unknown parameter|s2cid=
ignored (help) - ^ Bader, D.A.; Madduri, K. (2006). "Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2". 2006 International Conference on Parallel Processing (ICPP'06). Columbus, OH, USA: IEEE: 523–530. doi:10.1109/ICPP.2006.34. ISBN 978-0-7695-2636-2. S2CID 8452392.
- ^ Bokhari, S.H.; Glaser, M.A.; Jordan, H.F.; Lansac, Y.; Sauer, J.R.; Van Zeghbroeck, B. (2002). "Parallelizing a DNA simulation code for the Cray MTA-2". Proceedings. IEEE Computer Society Bioinformatics Conference. Stanford, CA, USA: IEEE Comput. Soc: 291–302. doi:10.1109/CSB.2002.1039351. ISBN 978-0-7695-1653-0. PMID 15838145. S2CID 8777712.
- ^ McIlvaine, Bill (2000-02-03). "Tera Computer buys Cray from SGI, readies CMOS processors". EETimes.
{{cite web}}
: CS1 maint: url-status (link) - ^ Asanovic, Krste (2000). Multithreading and the Tera MTA (PDF).
- ^ "Introducing the Cray XMT Supercomputer" (PDF). Archived from the original (PDF) on April 2, 2011. Retrieved 2020-06-13.
- ^ Kahan, Simon (2009). Competition => Collaboration: a runtime-centric view of parallel computation (PDF).
- ^ Mellor-Crummey, John (2019). Fine-grain Multithreading: Sun Niagara, Cray MTA-2, Cray Threadstorm & Oracle T5 (PDF). Rice University.
- ^ Snavely, A.; Carter, L.; Boisseau, J.; Majumdar, A.; Kang Su Gatlin; Mitchell, N.; Feo, J.; Koblenz, B. (1998). "Multi-processor Performance on the Tera MTA". Proceedings of the IEEE/ACM SC98 Conference. Orlando, FL, USA: IEEE: 4. doi:10.1109/SC.1998.10049. ISBN 978-0-8186-8707-5. S2CID 8258396.
- ^ uRiKA Product Brief (PDF). YarcData. 2014.