SUIF Overview Data Analysis - "basic" data dependence test - formulates an integer programming problem - analyzes arrays and scalars for privatization Loop nests - loop nests are critical - loops may be nested across function boundaries - inlining is possible, but inlining large blocks of code scares compilers Interprocedural Analysis - analyze functions for a set of side effects - clones functions based on call site context - only clone when necessary (based on context) Memory Problems - communication: sharing with coherent caches - capacity: large working sets - associativity: contention for a few cache lines - line size: false sharing Prefetching - hide latency of cache misses - SUIF will insert prefetching code Data Reuse - affine partitioning - focus on maximizing data reuse - reduce communication and working set size Locality - loop blocking - tries to group data into contiguous segments - improve spatial locality and reduce false sharing - uses data permutation and strip-mining to achieve its goals Data Transformations OS hints - SUIF tries to direct the OS's page allocation - knows about each processor's working set and access patterns - tries to make data physically contiguous SUIF 1 Passes - Constant Propagation (`porky -const-prop'). - Scalarize Array Accesses (`porky -scalarize'). - Forward Propagation (`porky -forward-prop'). - Normalization (`predep -normalize'). - Induction Variable Detection (`porky -ivar', `porky -know-bounds'). - Constant Folding (`porky -fold'). - Scalar Privatization Analysis (`moo -Psce'). - Reduction Recognition (`reduction'). - Dependence Preprocessing (`predep -presc'). - Parallelism and Locality Optimizer (`skweel -T'). - Parallel Code Generator (`pgen'). SUIF 1 Capabilities - preprocessing handles common nonlinear accesses - loop transformations: interchange, reversal, skewing and tiling - handles imperfect loop nests Results - use two benchmark suites - SPECfp95 - NAS SPECfp95 NAS Breakdown Baseline: - intraprocedural analysis - scalar optimizations Coarse grain: - interprocedural analysis - array privatization Memory - data partitioning and permutation Breakdown Performance - 13 of 18 programs show noticeable speedups - of 5 failing cases 3 are attributed to fine grained parallelism - only 2 are not parallelizable SUIF vs KAP - SUIF has performance results for SUIF1 vs KAP for {\tt Perfect} and {\tt NAS} benchmarks SUIF vs KAP Perfect \begin{verbatim} Perfect Parallelized Loops KAP loops Club KAP SUIF By KAP SUIF found total total Both only only by SUIF ------------------------------------------------------ ADM (APS) 141 134 132 9 2 94 % SPICE (CSS) 8 30 8 0 22 100 % QCD (LGS) 85 85 84 1 1 99 % MDG (LWS) 29 26 24 5 2 83 % TRACK (MTS) 48 48 48 0 0 100 % BDNA (NAS) 103 105 94 9 11 95 % OCEAN (OCS) 65 60 59 6 1 91 % DYFESM (SDS) 90 93 86 4 7 100 % MG3D (SMS) 38 33 33 5 0 87 % ARC2D (SRS) 136 136 132 4 4 100 % FLO52 (TFS) 64 64 64 0 0 100 % TRFD (TIS) 22 25 22 0 3 100 % SPEC77 (WSS) 202 195 193 9 2 96 % ------------------------------------------------------ Total 1031 1034 979 52 55 96 % Average 79.3 79.5 75.3 4.0 4.2 96 % \end{verbatim} SUIF vs KAP NAS \begin{verbatim} NAS Parallelized Loops KAP loops Benchmarks KAP SUIF By KAP SUIF found total total Both only only by SUIF ------------------------------------------------------ appbt 92 90 89 3 1 97 % applu 67 72 62 5 10 93 % appsp 80 87 74 6 13 93 % buk 2 4 2 0 2 100 % cgm 14 16 14 0 2 100 % embar 1 5 1 0 4 100 % fftpde 13 14 10 3 4 77 % mgrid 12 15 11 1 4 92 % ------------------------------------------------------ Total 281 303 263 18 41 94 % Average 35.1 37.9 32.9 2.2 5.1 94 % \end{verbatim} Polaris Analysis - Equality test - GCD test - Range test (nonlinear access) - Inlining and Constant Propagation - inlining is guided by heuristics - alternatively values from call sites can be propagated through the called function - functions are cloned for specific values of the arguments Privatization - scalar privatization - array privatization - similar idea to scalar case - if all accesses to elements within an iteration write first, then read the array can be privatized Reduction - can identify reductions of the form: \begin{verbatim} DO i=1,n sum = sum + A(B(i)) \end{verbatim} - selects between a locking strategy and dividing up the reduction across every processor Induction Variables - variables which can be derived from the loop iteration \begin{verbatim} j=0 DO i=1,n j = j+2 u(j) = j \end{verbatim} - analysis of the loop requires j to be in terms of the loop index - Polaris expands to handle the case where j is a function of more than one index Results - uses benchmarks from three sources - SPECfp95 - Perfect - NCSA programs PFA vs Polaris Polaris Components Improvements - remove unnecessary analysis - dependence and privatization could be determined at runtime Polaris Philosophy - portability is important - need to have compilers which make extensive analysis - example: HPF's data directives require knowledge of the system Polaris Philosophy - need to test with real programs - use analysis to direct compiler efforts Complete Automation - both compilers are fully automated - HPF was useful and successful - some recent projects suggest that full automation may not be necessary or desirable Source to Source vs Complete Compilation - SUIF generates executable code - Polaris mentions providing information through markup - both note that extensive information is critical at several stages Where They Are Now: SUIF - SUIF2 was developed - SUIF is generic enough to support research other than parallelism - some recent work (1999,2000) looked into interactive parallelism - Parascope was ahead of its time? Where They Are Now: Polaris - on a page labelled {\tt polaris-old.html} - Polaris research group launched - group is looking at OpenMP in addition to some other automated compilers - some later work seemed to be looking at runtime analysis Big Picture - Polaris claims to deal with real programs - optimizations seem very specific - SUIF covers more traditional areas: loop and data optimizations - many of these optimizations have proven themselves through HPF