A Prototyping Standard Cell Placer for 2.5D Integration
Yangdong Deng










1. Motivation and Assumption

The density and size of VLSI chips are increasing rapidly along the Moore's Law. State-of-the-art VLSI technology concentrates on the single-chip approach. Design and fabrication of system-on-a-chip (SOC) has become the focus of academic research and industrial practicing. However, meanwhile the cost of silicon is consuming a quickly increasing portion of the available profit margin. For instance, when long wires in a system-on-a-chip leads to excess delay, expensive copper wire has to be used to implement intra-chip interconnection. In fact, only part of circuit need such fast interconnection but all wires on the chip have to made in copper. In other words, in the monolithic approach, not all components can be fabricated in optimum technology ("optimum technology" means the cheapest process satisfying the performance requirements). It may lead to great lose in profit. For example, SRAM fabricated in an SRAM process may be two times denser and two times faster than fabricated in a standard logic process. However, in SOC both SRAM and other standard logic have to be fabricated in the same process. It is highly possible that system-on-a-chip may become the semiconductor dinosaur! Under such situation, new integration strategy should be developed. Wojciech P. Maly proposed a 2.5D integration scheme [1]. In this approach, a VLSI system is partitioned into several parts and each part is fabricated as a single die with an optimum technology. Then these dies are stacked together to form the whole system. The vertical interconnecting technology may be flip chip bonding or more advanced junction techniques.

To explore the applicability of 2.5D integration, I implemented a 2.5D standard cell placer based on simulated annealing algorithm. A conventional 2D placer using the same algorithm was developed as well. Thus we can make further estimation and comparison. When constructing the placer, in the most simplified version, we assume the vertical interconnection can be placed at any place on the chip. In the refined version, we deal with the vertical interconnection as cells. The interconnection cells have comparable size with standard cells and are placed in the standard cell row. Because the real characteristic of the interconnection, now we are interested in evaluating the geometric feature of 2.5D placement comparing with 2D placement.

Both placers are based on simulated annealing algorithm. The input includes aspect ratio, netlist and I/O pad location (the placer does not place pads!). After reading the netlist, the placer generates an initial placement. Then perturbations are applied to the placement. A simulated annealing engine controls the placing flow. When the annealing process terminates, there may be overlaps in the placement. Hence, the placer will do compaction/expansion. The final output specifies the exact location of each module in the netlist. For simplicity, the 2.5D placer places cells into two planes (each plane can be considered as a single die) instead of multiple planes. After placing, the resultant placement is excised by a routing area estimation program to calculate the final chip area. The detail of this program can be found at [9].

2. The Annealing Scheduling

Simulated annealing is probably the most well known method available for module placement. It is very time consuming but generates excellent results. There are already plenty of papers and books about this topic [2][3][4][5]. The general algorithm is illustrated in figure 1. Details of the algorithm applied to VLSI cell placement are given in Shahookar and Mazumder[6].


Figure 1. simulated annealing algorithm

2.1 Initial Temperature

The initial temperature must be hot enough so that the initial accept ratio is closed to 1. Generally speaking, a system is hot renough if :

sigma << T                                (1)
where sigma is the standard deviation of the cost distribution. Hence, to determine the initial temperature, the configuration space has to be explored. In my approach, all the newly generated states are accepted, i.e. the temperature is assumed to be infinite. Then the standard deviation is calculated and the initial temperature is derived as:
Tinitial = k × sigma                          (2)
A typical value is 20.

2.2 Generation of New Configuration

Two methods are used to generate new configurations from the current configuration. Either a cell is selected randomly and displaced to a random location on the two placing planes, or two cells are chosen randomly and swapped. It is reported by Sechen and Sangiovanni-Vincentelli [7] and verified by my experiments that the performance of the algorithm depends on r, the ration of displacements to swaps. Sechen and Sangiovanni-Vincentelli indicate that the algorithm performs best when 3 <= r <=8. In my program, the r is 4.

At each temperature of annealing, a certain number of moves per cell are attempted. This number is refereed to as the inner loop criterion. The larger the number of moves per cell, the smaller the final wire length is. However, the reduction speed diminishes and CPU time increases as the number of moves grows. So a trade-off must be made. A curve of recommended number of moves versus the number of cells is given by Sechen and Sangiovanni-Vincentelli [7]. For cell number between 500~3000, the curve can be approximated as:

Nm = 100 ×1000.003 Nc where Nm is number of moves per cell and Nc is number of cells (3)
Generally, Nm should be larger than 300. In the TimberWolf package, the number of moves generated at each temperature is determined according to Table 1[8]. . To reduce CPU time, I use 30 moves per cell in the "faster" version and Table 1 in the "slower" version.
No. of cells
Moves per cell
<= 200
100
<=500
200
<=1000
300
<= 1500
400
<= 2000
500
<= 2500
600
<= 3000
700
<= 3500
800
<= 4000
900
<= 4500
1000
> 4500
1200
Table 1. TimberWolf optimal number of moves per cell

2.3 Cooling Schedule

The cooling schedule is very important to the quality of result. The simplest approach is:

More advanced cooling schedules are proposed [6][7]. According to the results of experiments, I use the following formula:
deviation of the cost at this temperature

The ration Ti+1/ Ti is not allowed to be lower than a certain bound (typically 0.5) in order to prevent drastic reduction in temperature. If such a case happened, let Ti+1 = 0.9 Ti.

2.4 Termination Criterion

In my program, after each iteration, we basically just look at the last 2 temperatures, and see if the cost is not changing much and if the we have done enough temperatures and if the accept ratio fraction is small enough.

3. The Implementation of the Algorithm

3.1 Data Structure

The placing planes are represented as a 3-dimensional array. Each element is a structure representing a grid on one plane. So x and y position, overlapping numbers as well as overlap penalty is stored in the structure. The modules and nets are also stored in arrayed structure. And they are cross-referenced.

Data structure for grids and nets also have fields for temporary wire length and penalty. At each temperature, a certain number of new configurations will be generated. When a new configuration is generated, the temporary wire length and penalty is recorded but the new configuration is not accepted yet. Only when the Accept routine determines to accept is does the data structure is updated.

3.2 Reading Input

The placer reads input from a file. The first line of netlist is the ratio of width to height of the chip, i.e., if the two numbers of first line is W and L, then the aspect ration is W/L (W and L are not exact width and height, only the ration is meaningful). Following the aspect ration is the pad description, which has the format as:

index p orientation (N, S, E, W) location netindex

where orientation maybe 0 or N(north), 1 or S(south), 2 or E(east), and 3 or W(west), location is the horizontal distance from (0,0) (if orientation is N or S) or the vertical distance from (0,0) (if orientation is E or W). The remains is the netlist of standard cells. Each line specifies a module and the format is:

index g width number_of_nets [net1, net2, ...]

The input file is processed by a simple parsing routine. Modules are stored in a static array structure. The information of nets is also extracted and recorded. Finally, when all the cells have been read, total width of them are computed. The result is used with aspect ration to calculate the actual width and height of the initial placement. The pads are simply placed on the edges as specified.

3.3 Cost Function

The choice of cost function is crucial to the success of the algorithm. During the experiments, it is found that the final chip area is strongly dependent on the total wire length. Hence, the overall goal is to minimize the wire length. Other factors must be considered as well to avoid fault minimization. For instance, if overlapping of cells are allowed, there should be some mechanism to constrain the overlapping extent. Otherwise, after expansion, i.e. overlapping is removed, wire length can be several times longer.

In my program, the cost function is the sum of four components: the wire length cost, C1, the module overlap penalty, C2, the vertical interconnection cost C3 and row width penalty. C1 is found by creating bounding box for each net. Then the half perimeter is considered as the wire length. The computation of wire length is illustrated in figure 2. By adding all wire length together, we get C1. The formula for C1 is the following:

C1 = SUMfor all nets(Wx×X + Wy×Y)      (4)

Where Wx and Wy are x-axis and y-axis weights, respectively. To be more realistic, the weights for X-axis and Y-axis length should be different. For Instance, usually the height of an inverter is three to four times of its width.


Figure 2. semi-perimeter wire length = X + Y

During the annealing process, after each displacement and swap, overlap between cells may occur. To speed up the annealing, the overlap is not rejected at once. Instead, cost function is computed and the cost is passed to the accepting condition as in the case of no overlap. However, penalty must be added so that overlap will be reduced to the minimum. Because each cell is located on an integer number of grids, each grid may be occupied by several cells when overlap occurs. The overlap penalty for a grid is defined as the square of the sum of No, number of overlapping, and Poffset, a constant offset. No is the number of cells occupying the grid minus 1. For example, when a grid is occupied by three cells simultaneously, No is 2. The introduction of Poffset guides the overlap to be reduced to 0. Experiment shows that Poffset should be 1 or 2. The formula for C2 is:

C2 = SUMfor all grids( No + Poffset)2     (5)

C3 is to model the cost of the "2.5D connector", vertical interconnection. If a net connects cells located in different planes, a vertical cost is applied to it. Because the characteristic of interconnection technology is not clear, it is difficult to select this cost. The main function of this component is to adjust the number of interconnection.

C4 is to model the penalty of the unequalization of standard cell rows. To reduce the waste of area, each row of standard cells should have similar width. C4 is calculated as:

C4 = Ww × SUMfor all rows( Wi - Wdesired)     (6)
where Ww is weight for row unequalization (5 in my program). Wi is the width of row i. Wdesired is the desired row width.

The cost is computed incrementally in the annealing flow. After each move, only the nets and grids that are changed are taken into account. At each temperature, the mean cost and standard deviation of the moves are calculated.

3.4 Initial Placement

The placer starts up the annealing by setting up a dumb random placement. Each cell is placed according to its order in the netlist. No overlap is allowed at this time. The placement is on two grid planes. Then length of each wire is computed. Evidently, the initial overlap penalty is 0.

3.5 Generating New Configuration

As mentioned earlier, new configuration may be generated by displacing a cell or swapping two cells. At the beginning of each move (a new configuration is generated by a move), a random number between 1 and Nc, the number of cells is computed. Then we generate the second random number, which is uniformly distributed between 1 and Nupper_bound(Nupper_bound is larger than Nc.). If new number is larger than Nc, this move is considered as a displacing and the position of new location is generated randomly. Otherwise, the new number is index of another cell. Then the two cells are swapped. To make the ratio of displacement to swap is 4, Nupper_bound is set to 4Nc.

In the process of annealing, when temperature decreases, a limiting window is used to increase the chance of local optimization and reduce CPU time. In the case of displacement, the new location must be in the limiting window. While in the case of swapping, if the second cell is out of the limiting window, it is discarded. The limiting window is shrunken by 0.5% at each new temperature.

3.6 Insertion of Interconnection Cells

In my implementation, the inter-plane connector is dealt with as a standard cell. When a net lies in two planes, two interconnectors are needed to be placed in both planes. The two interconnectors must be aligned and the location is the geometric center (in standard cell rows) of the net. There are two possible ways to process interconnectors. One way is to place the cell right after each move. After a swap or a displacement, if a net is found crossing two planes,

3.7 Final Expansion

When the annealing is over, there may exist overlap in the placement. A simple expansion routine is applied to remove all overlaps. Cells in each row are sorted by its starting x-position. Then they are placed according to their sorted order. Note that the horizontal size of the placement may be larger than the original. In this case the pads on the eastern side must be re-placed.

3.8 Complexity

Due to the nature of the simulated annealing algorithm, the placers run raltively slow. Now I make a simple analysis on the algorithm complexity. Suppose the netlist has n cells. Note that the complexity is also related to the initial temperature, cooling schedule and moves per cell. Here I assume the annealing starts at 50o and 50 moves is applied to each cell on the average. Hence, at each temperature 50n moves is needed. According to my experiments, for an 800-cell netlist, perturbations will be introduced at about 80 different temperatures. So totally 3 *106 new configurations are attempted.

4. Experimental Results

4.1 Generating Placement for an 800-Gate circuit

An 800-gate benchmark circuit is excised by the placers to generate placement. The curve of cost versus temperature is showed in figure 3. Several intermediate 2D placements are illustrated in the appendix. The figures are ordered by the generated time but may not be consecutive to one another. In these figures, black rectangles represent cells. The red squares mean overlap on the grid. Green wires are the nets.


Figure 3. curve of cost versus temperature during the annealing process

4.2 Wire Length Distribution


Figure 4. wire length distribution of an 800-gate design (similar weights for X and Y)

Figure 5. wire length distribution of an 800-gate design (different weights for X and Y)

Figure 6. wire length distribution of an 800-gate design with different weight
for 2.5D "vertical" interconnection (different weights for X and Y)

Figure 7. reduction of wire length in 2.5D placement

The first application of the placer is to analyze the difference of wire length distribution between 2D and 2.5D placements. The wire length distribution for an 800-gate design is showed in figure 4. Note in figure 4, the weights for x-axis and y-axis distance are the same. If the height is taken into account (4 in this case), the length distribution will be like figure 5. When computing the wire length of the 2.5D placement, vertical weight has been subtracted. We can further examine the wire length distribution by change the weight of "vertical" interconnection, which is illustrated in figure 6. It can be found that when the cost of a "vertical contact" is within 15 times of a unit-length wire, considerable wire length reduction is expected. From a qualitative analysis, in 2.5D scheme, long wires can be reduced because of the usage of "vertical wire". This is illustrated in figure 7. In a big design that has many long lines, the advantage may become evident. In smaller design, we cannot be too optimistic because the extra vertical weight may offset the reduction in the planar wire length. In my experiment, it can be noted that long wires are reduced indeed, although total wire length reduction is trivial.
 

Reference

[1] Recent CEDA presentation by Wojciech P. Maly.

[2] S. Kirkpatrick, C. D. Gelatt, Jr., and M. P. Vecchi, Optimization by Simulated Annealing, Science, May 13, 1983, Vol. 220, No. 4598, pp. 45-54.

[3] E. H. L. Aarts, et al., Simulated Annealing, Local Search in Combinatorial Optimization (Edited by E. Aarts and J. K. Lenstra), John Wiley & Sons Ltd., 1997, pp. 91-120.

[4] C. Sechen, Chip-planning, Placement, and Global Routing of Macro/Custom Cell Integrated Circuit Using Simulated Annealing, Proceedings of 25th Design Automation Conference, 1988, pp. 73-80.

[5] M. D. Huang, F. Romeo, and A. Sangiovanni-Vincentelli, An Efficient General Cooling Schedule for Simulated Annealing, Proceedings of IEEE International Conference on Computer-Aided Design, 1986, pp.381-384.

[6] K. Shahookar and P. Mazumder, VLSI Cell Placement Techniques, ACM Computing Survey, Vol. 23, No. 2, June 1991, pp. 143-220.

[7] C. Sechen and A. Sangiovanni-Vincentelli, Timberwolf3.2: A New Standard Cell Placement and Global Routing Package, Proceedings of the 23rd Design Automation Conference, 1986, pp. 432-439.

[8] Huang, M. D., Romeo, F., Sangiovanni-Vincentelli, A.,  Efficient General  Cooling Schedule for Simulated Annealinf, IEEE International Conference on Computer-Aided Design, ICCAD-86 - Digest of Technical Papers. 1986, pp. 381-384.

[9] Peng, L., http://www.ece.cmu.edu/~pli/.
 

Appendix


(a) initial placement

(b) intermediate placement1

(c) intermediate placement2

(d) intermediate placement3

 (f) intermediate placement5

(g) intermediate placement6

 (h) final placement