并行架构与编程(七)Interconnection Networks

This is about network in a single chip. We care about performance, energy efficiency and also scalability.

General rule: latency increases with load.

  • interconnect component
    • processor cores with other cores
    • processors and memories
    • processor cores and caches
    • caches and caches
    • I/O devices
  • How are hardware components interconnected?
  • What is the basic cost of communication?
  • How does data traverse the network?

1. terminology

  • network node: a network endpoint connected to a router/switch
  • network interface
    • connects nodes to the network
  • switch/router
    • connects a fixed number of input links to a fixed number of output links
  • link
    • a bundle of wires carrying a signal

2. design issues

  • topology: how switches are connected via links
  • routing: how a message gets from its source to its destination in the network
  • buffering and flow control
    • what data is stored in the network? packets, partial packets?
    • how does the network manage buffer space?

3. properties of interconnect topology

  • routing distance
    • diameter: the maximum routing distance between two nodes
    • average distance: average routing distance over all valid routes
  • direct vs. indirect networks
    • direct network: node在网络内部生成traffic,同时作为endpoint和switch
    • indirect network: node在网络边界生成traffic

  • bisection bandwidth: cut network in half, sum bandwidth of all severed links
  • blocking vs. non-blocking
    • non-blocking: 若节点集合一一映射进行通信且不存在冲突,则该网络是non-blocking的。

4. topology

4.1. bus interconnect

  • pros
    • simple design
    • cost effective for a small number of nodes
    • easy to implement coherence
  • cons
    • contention: all nodes contend for shared bus
    • limited bandwidth: all nodes communicate over same writes
    • high electrical load

4.2. crossbar

  • every node is connected to every other node (non-blocking, indirect)
  • pro
    • O(1) latency and high bandwidth
  • bad
    • O(N^2) switches
    • high cost
    • difficult to arbitrate at scale

4.3. ring

  • pro
    • simple
    • cheap: O(N) cost
  • con
    • high latency: O(N)
    • scalability issue: bisection bandwidth remains constant as nodes are added
  • ex
    • Core i7

4.4. mesh

  • direct network
  • O(N) cost
  • average latency: O(sqrt(N))
  • easy to lay out on chip: fixed-length links
  • path diversity: many ways for message to travel from one node to another
  • ex
    • tilera processors
    • prototype intel chips: xeon phi

4.5. torus

  • mesh topology are different based on whether node is near edge or middle of network
  • O(N) cost
  • higher path diversity and bisection BW than mesh
  • higher complexity
    • difficult to layout in chip
    • unequal link lengths

4.6. tree

  • latency: O(lgN)
  • “fat trees”: alleviate root bandwidth problem (higher bandwidth links near root)

4.7. hypercube

  • low latency: O(lg N)
  • radix: O(lg N)
  • number of links O(N logN)

4.8. multi-stage logarithmic

5. buffering and flow control

  • circuit switching: preallocate all resources
    • higher bandwidth
    • low utilization
    • once setup, no contention
  • packet switch
    • higher utilization
    • overhead due to dynamic switching logic

5.1. communication granularity

  • message: unit of transfer between network clients, multiple of packets
  • packet: unit of transfer of network
    • header
    • payload
    • tail
  • flit(flow control bit): a unit of flow control
    • packets broken into smaller units called “flits”
    • minimum granularity of routing/buffering
  • packet format
    • header: routing and control information
    • payload/body
    • tail: additional control information, error code

5.2. handling contention

contention: when two packets need to be routed onto the same outbound link at the same time

solution:

  • buffer one packet, send it over link later
  • drop one packet
  • reroute one packet (deflection)
  • circuit switching
    • no contention due to preallocation, so no need for buffering
    • arbitrary messages size

5.3. flow control

  • store and forward
    • buffer: channels and buffers are allocated per packet
    • pro: each time use one link
    • flow control unit: packet
    • latency: 由于需要先存储然后转发,若一个packet包含#flits则延迟为#flits * #links
  • cut-through flow control: 借助link进行pipeline式发送,即一个switch收到flit后立即转发
    • switch starts forwarding data on next link as soon as packet header is received
    • buffer: per packet channel
    • reduced latency
      • best latency = #links + #flits, 即header到达目的地的时间加上packet的flits的数量
      • worst latency: under high contention, reduces to store-and-forward contention
  • wormhole flow control
    • buffer: buffers are allocated per flit, reducing significantly
    • routing information only in head flit
    • policy:
      • body flits follows head, tail flit follows body, if head flit blocks, rest of packet stops
    • completely pipelined transmission
      • for long messages, latency is independent of network distance
      • latency = #links + #flits, for long message, #flits » #links

prob: head-of-line blocking, 同一个buffer中后续抵达的packet的传输由于最先抵达的packet被阻塞而阻塞,即使后续抵达的需要被发往idle的端口。

  • virtual channel flow control
    • divide switch’s input buffer into multiple buffers sharing a single physical channel
    • deadlock avoidance
    • prioritization of traffic classes