Using an extended Roofline Model to understand data and thread affinities on NUMA systems

Oscar G. Lorenzo, Tomás F. Pena, José C. Cabaleiro, Juan C. Pichel, Francisco F. Rivera


Today’s microprocessors include multicores that feature a diverse set of compute cores and onboard memory subsystems connected by complex communication networks and protocols. The analysis of factors that affect performance in such complex systems is far from being an easy task. Anyway, it is clear that increasing data locality and affinity is one of the main challenges to reduce the access latency to data. As the number of cores increases, the influence of this issue on the performance of parallel codes is more and more important. Therefore, models to characterize the performance in such systems are broadly demanded. This paper shows the use of an extension of the well known Roofline Model adapted to the main features of the memory hierarchy present in most of the current multicore systems. Also the Roofline Model was extended to show the dynamic evolution of the execution of a given code. In order to reduce the overheads to get the information needed to obtain this dynamic Roofline Model, hardware counters present in most of the current microprocessors are used. To illustrate its use, two simple parallel vector operations, SAXPY and SDOT, were considered. Different access strides and initial location of vectors in memory modules were used to show the influence of different scenarios in terms of locality and affinity. The effect of thread migration were also considered. We conclude that the proposed Roofline Model is an useful tool to understand and characterise the behaviour of the execution of parallel codes in multicore systems.


A. Sodan, “Message-passing and shared-data programming models: Wish vs. reality,” in Proc. IEEE Int. Symp. High Performance Computing Systems Applications, 2005, pp. 131–139.

R. Hazara, “The explosion of petascale in the race to exascale,” in ACM/IEEE conference on Supercomputing, 2012.

S. Devadas, “Toward a coherence multicore memory model,” IEEE Computer, vol. 46, no. 10, pp. 30–31, 2013.

S. Moore, D. Cronk, K. London, and J. Dongarra, “Review of performance analysis tools for MPI parallel programs,” in Recent Advances in Parallel Virtual Machine and Message Passing Interface. Springer, 2001, pp. 241–248.

L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor- Crummey, and N. R. Tallent, “HPCToolkit: Tools for performance analysis of optimized parallel programs,” Concurrency and Computation: Practice and Experience, vol. 22, no. 6, pp. 685–701, 2010.

A. Morris, W. Spear, A. D. Malony, and S. Shende, “Observing performance dynamics using parallel profile snapshots,” in Euro-Par 2008–Parallel Processing. Springer, 2008, pp. 162–171.

M. Geimer, F. Wolf, B. J. Wylie, E. A´ braha´m, D. Becker, and B. Mohr, “The Scalasca performance toolset architecture,” Concurrency and Computation: Practice and Experience, vol. 22, no. 6, pp. 702–719, 2010.

A. Cheung and S. Madden, “Performance profiling with EndoScope, an acquisitional software monitoring framework,” Proceedings of the VLDB Endowment, vol. 1, no. 1, pp. 42–53, 2008.

B. Mohr, A. D. Malony, H. C. Hoppe, F. Schlimbach, G. Haab, J. Hoeflinger, and S. Shah, “A performance monitoring interface for OpenMP,” in Proceedings of the Fourth Workshop on OpenMP (EWOMP 2002), 2002.

M. Schulz and B. R. de Supinski, “PN MPI tools: A whole lot greater than the sum of their parts,” in Proceedings of the 2007 ACM/IEEE conference on Supercomputing. ACM, 2007.

S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful visual performance model for multicore architectures,” Commun. ACM, vol. 52, no. 4, pp. 65–76, Apr. 2009.

M. Schuchhardt, A. Das, N. Hardavellas, G. Memik, and A. Choudhary, “The impact of dynamic directories on multicore interconnects,” IEEE Computer, vol. 46, no. 10, pp. 32–39, 2013.

K. F¨urlinger, C. Klausecker, and D. Kranzlm¨uller, “Towards energy efficient parallel computing on consumer electronic devices,” in Information and Communication on Technology for the Fight against Global Warming. Springer, 2011, pp. 1–9.

H. Servat, G. Llort, J. Gim´enez, K. Huck, and J. Labarta, “Folding: detailed analysis with coarse sampling,” in Tools for High Performance Computing 2011. Springer, 2012, pp. 105–118.

O. G. Lorenzo, J. A. Lorenzo, J. C. Cabaleiro, D. B. Heras, M. Suarez, and J. C. Pichel, “A study of memory access patterns in irregular parallel codes using hardware counter-based tools,” in Int. Conf. on Parallel and Distributed Processing Techniques and Applications (PDPTA), 2011, pp. 920–923.

T. Constantinou, Y. Sazeides, P. Michaud, D. Fetis, and A. Seznec, “Performance implications of single thread migration on a chip multicore,” ACM SIGARCH Computer Architecture News, vol. 33, no. 4, pp. 80–91, 2005.

O. G. Lorenzo, T. F. Pena, J. C. Cabaleiro, J. C. Pichel, and F. F. Rivera, “DyRM: A dynamic roofline model based on runtime information,” in 2013 International Conference on Computational and Mathematical Methods in Science and Engineering,, 2013, pp. 965–967.

O. G. Lorenzo, T. F. Pena, J. C. Pichel, J. C. Cabaleiro, and F. F. Rivera, “3DyRM: A dynamic roofline model including memory latency information,” Journal of Supercomputing, 2014, to appear.

A. Ilic, F. Pratas, and L. Sousa, “Cache-aware roofline model: Upgrading the loft,” IEEE Computer Architecture Letters, 2013.

perfmon2. (2013, Jun.) Precise Event- Based Sampling (PEBS). [Online]. Available: intel core.html#pebs

Intel. (2012, Jun.) Intel R 64 and IA-32 architectures software developer’s manual volume 3B: System programming guide, part 2. [Online]. Available:

A. Kleen, “A NUMA API for Linux,” Novel Inc, 2005.

Intel Developer Zone, “Fluctuating FLOP count on Sandy Bridge,”, 2014, [Online; accessed 5-February-2014].

Full Text: PDF


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.