Assignment 1

Develop a program that takes a binary as input and determines the longest dependency chain/dataflow path through the binary. Assuming every instruction has a delay of 1 cycle before it can forward its output (via registers or memory) to subsequent instructions, how many cycles does it take to execute every instruction in a binary? You don't have to model any real architectural bottlenecks, and can assume perfect caching and branch prediction.

Starter Files

For Reference
Guidelines
  • Assume perfect address/branch prediction.
  • Assume no cache miss penalty/stalls.
  • Assume infinite Load/Store queue, i.e. ignore memory consistency(ordering of memory operations across different addresses).
  • Binaries used for testing will use one syscall - exit(). For simplicity, assume exit depends on all other instructions.
  • Programs will be relatively short, e.g. 50K instructions, so don't worry much about optimizing.
  • The “YOUR CODE HERE” markings are suggestions. Feel free to modify any of the existing code.

Installation

Download the newest version of pin at: http://www.pintool.org/downloads.html

export PIN=<PIN_DIR_PATH>

Build the pintool with the included makefile

make

If it complains about not being able to attach to process, it's likely because the OS disables inter-process attachment. You can try:

echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope

You'll need the libdwarf.so packaged with pin in your LD_LIBRARY_PATH:

export LD_LIBRARY_PATH=$PIN/intel64/runtime

Instrumenting a program

Use the included Makefile to compile sample programs. Take a look at the multiple 'hello world' programs. How many instructions do you think they'll use?

To instrument a binary with the pintool:

$PIN/intel64/bin/pinbin -t ./dataflow.so -- <PROGRAM>

or the provided run script does the same:

./run.sh <PROGRAM>

Samples

hello64.s input
.section .data
helloWorld:
        .ascii "Hello, World!", "\n"
 
.section .text
.globl _start
_start:
 
        mov $0x2000004, %rax
        mov $1, %rdi
        mov $helloWorld, %rsi
        mov $14, %rdx
        syscall
 
        mov $60, %rax
        mov $0, %rdi
        syscall
Pintool output
> ./run.sh sampleprograms/hello64
					 writes to register	19	gr	gr64
0x4000b0            mov rax, 0x2000004
					 writes to register	12	gr	gr64
0x4000b7            mov rdi, 0x1
					 writes to register	13	gr	gr64
0x4000be            mov rsi, 0x6000e0
					 writes to register	17	gr	gr64
0x4000c5            mov rdx, 0xe
					 writes to register	35
					 writes to register	34
0x4000cc            syscall 
					 writes to register	19	gr	gr64
0x4000ce            mov rax, 0x3c
					 writes to register	12	gr	gr64
0x4000d5            mov rdi, 0x0
					 writes to register	35
					 writes to register	34
0x4000dc            syscall 
Application stopped. Shutting down.
loop-slow.s input

This application loops over a string and in each iteration modifies a single character. The loop isn't unrolled.

.section .data
helloWorld:
	.ascii "Hello, World!!"
 
.section .text
.globl _start
_start:
 
	movl $helloWorld, %edi
	add $14, %edi
loop:
	movl -1(%edi), %eax
	add $1, %eax
	movl %eax, -1(%edi)
	sub $1, %edi
	cmp $helloWorld, %edi
	jne loop
 
print:
## Uncomment for debugging
#	movl $4, %eax
#	movl $1, %ebx
#	movl $helloWorld, %ecx
#	movl $14, %edx
#	int $0x80
 
	movl $1, %eax
	movl $0, %ebx
	int $0x80

Your code should output 19. Why is this?

  • The first 2 instructions take 2 cycles
  • The loop body has several operations that depend on the subtract from the previous iteration:
    • The N(number of iterations) subtracts cyclically feed one another, creating a dependency chain of N steps
      • The last subtract has a dependency chain of two - the compare and not-taken conditional jump
      • The second to last subtract has a dependency chain of three - the move→add→move chain
  • The last two moves - outside the loop - can be issued early, and are thus off the critical path.
  • The final syscall (int $0×80) is dependent on all instructions, and thus the loop dataflow.

This means the dataflow depth is 2 + N-1(subtracts) + 3(mov→add→mov) + 1(syscall) = 2 + 13 + 3 + 1 = 19. This is the number your program should print for this loop program as input. You could also get this as 2 + N(subtracts) + 2(cmp→branch) + 1(syscall) since the sub(N-1)→mov→add→move is in parallel with sub(N-1)→sub(N)→cmp→jmp.

loop.s input

This input has a loop which has been unrolled once, and iterates 6 times. Each iteration is completely independent and uses only 3 instructions. How many cycles does this take?

.section .data
helloWorld:
	.ascii "Hello, World!!"
 
.section .text
.globl _start
_start:
 
	movl $helloWorld, %edi
	movl $13, %edx
loop:
	movl 0(%edi,%edx,1), %eax
	add $1, %eax
	movl %eax, 0(%edi,%edx,1)
	movl -1(%edi,%edx,1), %eax
	add $1, %eax
	movl %eax, -1(%edi,%edx,1)
 
	sub $2, %edx
	jns loop
 
print:
## Uncomment for debugging
#	movl $4, %eax
#	movl $1, %ebx
#	movl $helloWorld, %ecx
#	movl $14, %edx
#	int $0x80
 
	movl $1, %eax
	movl $0, %ebx
	int $0x80

Personal Tools