eCommons

 

COMPILING FOR NUMA PARALLEL MACHINES

Other Titles

Author(s)

Abstract

A common feature of many scalable parallel machines is non-uniform memory access (NUMA) --- data access to local memory is much faster than to non-local memories. In addition, when a number of remote accesses must be made, it is usually more efficient to use block transfers of data rather than to use many small messages. Almost every modern processor is designed with a memory hierarchy organized into several levels -- each smaller and faster than the level below. In general, the effective use of parallel machines requires careful attention to the following issues: (1) exposing and exploiting parallelism; (2) accessing local memory instead of remote memory; (3) using block transfers for remote accesses; (4) reusing data in the cache; and (5) load balancing. We have built a system called {\em Pnuma} for programming NUMA machines. We make the following contributions: First, we propose a parallelization scheme for both parallelism and data locality. Second, we develop a framework based on {\em non-singular} matrices and integer lattice theory for the systematic development of loop transformations. Program transformations, such as loop restructuring, are critical to achieving high performance. The framework can be used in parallelizing compilers for both coarse-grain and fine-grain parallel architectures. We have implemented a loop restructuring tool-kit called {\em Lambda} based on this framework. Third, using this loop transformation framework, we develop algorithms for improving memory locality. The memory locality algorithm restructures loop nests to expose opportunities for parallel execution and for block transfers, while keeping data accesses local wherever possible. Fourth, for cache locality, we introduce a new simple cache model based on {\em reuse distances/}, which is more precise than the existing {\em reuse vector space} model. We develop a new loop transformation technique that optimizes directly on reuse distances, so that no exhaustive search is necessary. Fifth, we use our loop transformation framework to improve parallelism as well. We develop a unified algorithm for parallelism, memory locality and cache locality. System evaluations have been conducted on a multiprocessor machine without cache (BBN GP1000), a uniprocessor workstation with cache (HP 9000/720) and a multiprocessor machine with caches (KSR1), using programs from linear algebra, NASA benchmarks and SIMPLE hydrodynamics benchmark.

Journal / Series

Volume & Issue

Description

Sponsorship

Date Issued

1994-11

Publisher

Cornell University

Keywords

computer science; technical report

Location

Effective Date

Expiration Date

Sector

Employer

Union

Union Local

NAICS

Number of Workers

Committee Chair

Committee Co-Chair

Committee Member

Degree Discipline

Degree Name

Degree Level

Related Version

Related DOI

Related To

Related Part

Based on Related Item

Has Other Format(s)

Part of Related Item

Related To

Related Publication(s)

Link(s) to Related Publication(s)

References

Link(s) to Reference(s)

Previously Published As

http://techreports.library.cornell.edu:8081/Dienst/UI/1.0/Display/cul.cs/TR94-1469

Government Document

ISBN

ISMN

ISSN

Other Identifiers

Rights

Rights URI

Types

technical report

Accessibility Feature

Accessibility Hazard

Accessibility Summary

Link(s) to Catalog Record