Skip to main content


eCommons@Cornell

eCommons@Cornell >
College of Engineering >
Computer Science >
Computer Science Technical Reports >

Please use this identifier to cite or link to this item: http://hdl.handle.net/1813/6078
Title: COMPILING FOR NUMA PARALLEL MACHINES
Authors: Li, Wei
Keywords: computer science
technical report
Issue Date: Nov-1994
Publisher: Cornell University
Citation: http://techreports.library.cornell.edu:8081/Dienst/UI/1.0/Display/cul.cs/TR94-1469
Abstract: A common feature of many scalable parallel machines is non-uniform memory access (NUMA) --- data access to local memory is much faster than to non-local memories. In addition, when a number of remote accesses must be made, it is usually more efficient to use block transfers of data rather than to use many small messages. Almost every modern processor is designed with a memory hierarchy organized into several levels -- each smaller and faster than the level below. In general, the effective use of parallel machines requires careful attention to the following issues: (1) exposing and exploiting parallelism; (2) accessing local memory instead of remote memory; (3) using block transfers for remote accesses; (4) reusing data in the cache; and (5) load balancing. We have built a system called {\em Pnuma} for programming NUMA machines. We make the following contributions: First, we propose a parallelization scheme for both parallelism and data locality. Second, we develop a framework based on {\em non-singular} matrices and integer lattice theory for the systematic development of loop transformations. Program transformations, such as loop restructuring, are critical to achieving high performance. The framework can be used in parallelizing compilers for both coarse-grain and fine-grain parallel architectures. We have implemented a loop restructuring tool-kit called {\em Lambda} based on this framework. Third, using this loop transformation framework, we develop algorithms for improving memory locality. The memory locality algorithm restructures loop nests to expose opportunities for parallel execution and for block transfers, while keeping data accesses local wherever possible. Fourth, for cache locality, we introduce a new simple cache model based on {\em reuse distances\/}, which is more precise than the existing {\em reuse vector space} model. We develop a new loop transformation technique that optimizes directly on reuse distances, so that no exhaustive search is necessary. Fifth, we use our loop transformation framework to improve parallelism as well. We develop a unified algorithm for parallelism, memory locality and cache locality. System evaluations have been conducted on a multiprocessor machine without cache (BBN GP1000), a uniprocessor workstation with cache (HP 9000/720) and a multiprocessor machine with caches (KSR1), using programs from linear algebra, NASA benchmarks and SIMPLE hydrodynamics benchmark.
URI: http://hdl.handle.net/1813/6078
Appears in Collections:Computer Science Technical Reports

Files in This Item:

File Description SizeFormat
94-1469.pdf967.24 kBAdobe PDFView/Open
94-1469.ps633.62 kBPostscriptView/Open

Refworks Export

Items in eCommons are protected by copyright, with all rights reserved, unless otherwise indicated.

 

© 2014 Cornell University Library Contact Us