Abstract
This article describes the Digital Continuous Profiling Infrastructure, a sampling-based profiling system designed to run continuously on production systems. The system supports multiprocessors, works on unmodified executables, and collects profiles for entire systems, including user programs, shared libraries, and the operating system kernel. Samples are collected at a high rate (over 5200 samples/sec. per 333MHz processor), yet with low overhead (1–3% slowdown for most workloads). Analysis tools supplied with the profiling system use the sample data to produce a precise and accurate accounting, down to the level of pipeline stalls incurred by individual instructions, of where time is bring spent. When instructions incur stalls, the tools identify possible reasons, such as cache misses, branch mispredictions, and functional unit contention. The fine-grained instruction-level analysis guides users and automated optimizers to the causes of performance problems and provides important insights for fixing them.
- ANDERSON, T. E. AND LAZOWSKA, E. D. 1990. Quartz: A tool for tuning parallel program performance. In Proceedings of the ACM SIGMETRICS 1990 Conference on Measurement and Modeling of Computer Systems. ACM, New York, 115-125. Google Scholar
- BALL, T. AND LARUS, g. 1994. Optimally profiling and tracing programs. ACM Trans. Program. Lang. Syst. 16, 4 (July), 1319-1360. Google ScholarDigital Library
- BLICKSTEIN, D., CRAIG, P., DAVIDSON, C., FAIMAN, R., GLOSSOP, K., GROVE, R., HOBBS, S., AND NOYCE, W. 1992. The GEM optimizing compiler system. Digital Tech. J. 4, 4.Google Scholar
- CARTA, D. 1990. Two fast implementations of the "minimal standard" random number generator. Commun. ACM 33, 1 (Jan.), 87-88. Google ScholarDigital Library
- COHN, R. AND LOWNEY, P.G. 1996. Hot cold optimization of large Windows/NT applications. In 29th Annual International Symposium on Microarchitecture (Micro-29) (Paris, France, Dec.). Google Scholar
- COHN, R., GOODWIN, D., LOWNEY, P. G., AND RUBIN, N. 1997. Spike: An optimizer for Alpha/NT executables. In USENIX Windows NT Workshop. USENIX Assoc., Berkeley, Calif. Google Scholar
- DIGITAL. 1995a. Alpha 21164 microprocessor hardware reference manual. Digital Equipment Corp., Maynard, Mass.Google Scholar
- DIGITAL. 1995b. DECchip 21064 and DECchip 21064A Alpha AXP microprocessors hardware reference manual. Digital Equipment Corp., Maynard, Mass.Google Scholar
- GOLDBERG, A. J. AND HENNESSY, J.L. 1993. MTOOL: An integrated system for performance debugging shared memory multiprocessor applications. IEEE Trans. Parallel Distrib. Syst. 28-40. Google ScholarDigital Library
- GRAHAM, S., KESSLER, P., AND McKuSICK, M. 1982. gprof: A call graph execution profiler. SIGPLAN Not. 17, 6 (June), 120-126. Google ScholarDigital Library
- HALL, M., ANDERSON, J., AMARASINGHE, S., MURPHY, B., LIAO, S.-W., BUGNION, E., AND LAM, M. 1996. Maximizing multiprocessor performance with the SUIF compiler. IEEE Comput. 29, 12 (Dec.), 84-89. Google ScholarDigital Library
- JOHNSON, R., PEARSON, D., AND PINGALI, K. 1994. The program structure tree: Computing control regions in linear time. In Proceedings of the ACM SIGPLAN '94 Conference on Programming Language Design and Implementation. ACM, New York, 171-185. Google Scholar
- MCCALPIN, J. D. 1995. Memory bandwidth and machine balance in high performance computers. IEEE Tech. Comm. Comput. Arch. Newslett. See also http://www.cs.virginia.edu/ stream.Google Scholar
- MIPS. 1990. UMIPS-V reference manual (pixie and pixstats). MIPS Computer Systems, Sunnyvale, Calif.Google Scholar
- REISER, J. F. AND SKUDLAREK, J. P. 1994. Program profiling problems, and a solution via machine language rewriting. SIGPLAN Not. 29, 1 (Jan.), 37-45. Google ScholarDigital Library
- ROSENBLUM, M., HERROD, S., WITCHEL, E., AND GUPTA, A. 1995. Complete computer simulation: The SimOS approach. IEEE Parallel Distrib. Tech. 3, 3 (Fall). Google ScholarCross Ref
- SITES, R. AND WITEK, R. 1995. Alpha AXP architecture reference manual. Digital Press, Newton, Mass. Google Scholar
- ZAGHA, M., LARSON, B., TURNER, S., AND ITZKOWITZ, M. 1996. Performance analysis using the MIPS R10000 performance counters. In Proceedings of Supercomputing. Google Scholar
- ZHANG, X., WANG, Z., GLOY, N., CHEN, J. B., AND SMITH, M. D. 1997. Operating system support for automated profiling and optimization. In Proceedings of the 16th ACM Symposium on Operating Systems Principles. ACM, New York. Google Scholar
Index Terms
- Continuous profiling: where have all the cycles gone?
Recommendations
Hardware-Based Profiling: An Effective Technique for Profile-Driven Optimization
Profile-based optimization can be used for instruction scheduling, loop scheduling, data preloading, function in-lining, and instruction cache performance enhancement. However, these techniques have not been embraced by software vendors because programs ...
Evaluating the use of profiling by a region-based register allocator
SAC '02: Proceedings of the 2002 ACM symposium on Applied computingIn a region-based compilation framework, the compiler builds regions to provide the best compilation unit for scheduling and optimization. The compiler uses execution frequency information gained from profiling to place frequently executed blocks in the ...
Value profiling
MICRO 30: Proceedings of the 30th annual ACM/IEEE international symposium on MicroarchitectureIdentifying variables as invariant or constant at compile-time allows the compiler to perform optimizations including constant folding, code specialization, and partial evaluation. Some variables, which cannot be labeled as constants, may exhibit semi-...
Comments