Develop a program that takes a binary as input and determines the longest dependency chain/dataflow path through the binary. Assuming every instruction has a delay of 1 cycle before it can forward its output (via registers or memory) to subsequent instructions, how many cycles does it take to execute every instruction in a binary? You don't have to model any real architectural bottlenecks, and can assume perfect caching and branch prediction.
- Assume perfect address/branch prediction.
- Assume no cache miss penalty/stalls.
- Assume infinite Load/Store queue, i.e. ignore memory consistency(ordering of memory operations across different addresses).
- Binaries used for testing will use one syscall - exit(). For simplicity, assume exit depends on all other instructions.
- Programs will be relatively short, e.g. 50K instructions, so don't worry much about optimizing.
- The “YOUR CODE HERE” markings are suggestions. Feel free to modify any of the existing code.
Download the newest version of pin at: http://www.pintool.org/downloads.html
Build the pintool with the included makefile
If it complains about not being able to attach to process, it's likely because the OS disables inter-process attachment. You can try:
echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope
You'll need the libdwarf.so packaged with pin in your LD_LIBRARY_PATH:
Use the included Makefile to compile sample programs. Take a look at the multiple 'hello world' programs. How many instructions do you think they'll use?
To instrument a binary with the pintool:
$PIN/intel64/bin/pinbin -t ./dataflow.so -- <PROGRAM>
or the provided run script does the same:
.section .data helloWorld: .ascii "Hello, World!", "\n" .section .text .globl _start _start: mov $0x2000004, %rax mov $1, %rdi mov $helloWorld, %rsi mov $14, %rdx syscall mov $60, %rax mov $0, %rdi syscall
> ./run.sh sampleprograms/hello64 writes to register 19 gr gr64 0x4000b0 mov rax, 0x2000004 writes to register 12 gr gr64 0x4000b7 mov rdi, 0x1 writes to register 13 gr gr64 0x4000be mov rsi, 0x6000e0 writes to register 17 gr gr64 0x4000c5 mov rdx, 0xe writes to register 35 writes to register 34 0x4000cc syscall writes to register 19 gr gr64 0x4000ce mov rax, 0x3c writes to register 12 gr gr64 0x4000d5 mov rdi, 0x0 writes to register 35 writes to register 34 0x4000dc syscall Application stopped. Shutting down.
This application loops over a string and in each iteration modifies a single character. The loop isn't unrolled.
.section .data helloWorld: .ascii "Hello, World!!" .section .text .globl _start _start: movl $helloWorld, %edi add $14, %edi loop: movl -1(%edi), %eax add $1, %eax movl %eax, -1(%edi) sub $1, %edi cmp $helloWorld, %edi jne loop print: ## Uncomment for debugging # movl $4, %eax # movl $1, %ebx # movl $helloWorld, %ecx # movl $14, %edx # int $0x80 movl $1, %eax movl $0, %ebx int $0x80
Your code should output 19. Why is this?
- The first 2 instructions take 2 cycles
- The loop body has several operations that depend on the subtract from the previous iteration:
- The N(number of iterations) subtracts cyclically feed one another, creating a dependency chain of N steps
- The last subtract has a dependency chain of two - the compare and not-taken conditional jump
- The second to last subtract has a dependency chain of three - the move→add→move chain
- The last two moves - outside the loop - can be issued early, and are thus off the critical path.
- The final syscall (int $0×80) is dependent on all instructions, and thus the loop dataflow.
This means the dataflow depth is 2 + N-1(subtracts) + 3(mov→add→mov) + 1(syscall) = 2 + 13 + 3 + 1 = 19. This is the number your program should print for this loop program as input. You could also get this as 2 + N(subtracts) + 2(cmp→branch) + 1(syscall) since the sub(N-1)→mov→add→move is in parallel with sub(N-1)→sub(N)→cmp→jmp.
This input has a loop which has been unrolled once, and iterates 6 times. Each iteration is completely independent and uses only 3 instructions. How many cycles does this take?
.section .data helloWorld: .ascii "Hello, World!!" .section .text .globl _start _start: movl $helloWorld, %edi movl $13, %edx loop: movl 0(%edi,%edx,1), %eax add $1, %eax movl %eax, 0(%edi,%edx,1) movl -1(%edi,%edx,1), %eax add $1, %eax movl %eax, -1(%edi,%edx,1) sub $2, %edx jns loop print: ## Uncomment for debugging # movl $4, %eax # movl $1, %ebx # movl $helloWorld, %ecx # movl $14, %edx # int $0x80 movl $1, %eax movl $0, %ebx int $0x80