A Load Balancing Algorithm Based on Replication and Movement af Data Items for Dynamic Structured P2P Systems

of 18
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report



Views: 32 | Pages: 18

Extension: PDF | Download: 0

Load balancing is one of the main challenges of every structured peer-to-peer (P2P) system that uses distributed hash tables to map and distribute data items (objects) onto the nodes of the system. In a typical P2P system with N nodes, the use of
  International Journal of Peer to Peer Networks (IJP2P) Vol.5, No.3, August 2014 DOI : 10.5121/ijp2p.2014.5302 15  A    L OAD B  ALANCING  A  LGORITHM B  ASED ON R  EPLICATION AND M OVEMENT OF D  ATA I TEMS FOR D  YNAMIC S TRUCTURED P2P   S  YSTEMS Narjes Soltani and Mohsen Sharifi Department of Computer Engineering, Iran University of Science and Technology, Tehran, Iran  A  BSTRACT     Load balancing is one of the main challenges of every structured peer-to-peer (P2P) system that uses distributed hash tables to map and distribute data items (objects) onto the nodes of the system. In a typical P2P system with N nodes, the use of random hash functions for distributing keys among peer nodes can lead to O(log N) imbalance. Most existing load balancing algorithms for structured P2P systems are not adaptable to objects’ variant loads in different system conditions, assume uniform distribution of objects in the system, and often ignore node heterogeneity. In this paper we propose a load balancing algorithm that considers the above issues by applying node movement and replication mechanisms while load balancing. Given the high overhead of replication, we postpone this mechanism as much as possible, but we use it when necessary. Simulation results show that our algorithm is able to balance the load within 85% of the optimal value.  K   EYWORDS   Structured P2P Systems, Load Balancing, Node Movement, Replication 1.   I NTRODUCTION   Distribution of objects among nodes in most structured Peer-to-Peer (P2P) systems is done through Distributed Hash Table (DHT) mechanism that use consistent hashing to map objects onto nodes [1,2,3,4]. Using this mechanism, a unique identifier is associated with each data item (object) and each node in the system. This is simply shown in Figure 1. The identifier space is partitioned among the nodes that form the P2P system and each node is responsible for storing all data items that are mapped to an identifier in its portion of the space.  International Journal of Peer to Peer Networks (IJP2P) Vol.5, No.3, August 2014 16 Figure 1. A schematic view of a structured p2p system If node identifiers are chosen at random (as in [1,2,3,4]), a random choice of item identifiers results in O(log N)  imbalance factor in the number of items stored at a node, where  N   is the total number of nodes in the system. Furthermore, the imbalance may be due to the non-uniform distribution of objects in the identifier space as well as high degree of heterogeneity in object loads and node capacities, memories, and bandwidths. Most existing load balancing algorithms in structured P2P systems, do not consider system dynamicity, nodes and objects heterogeneities, links latencies between nodes, and the popularity level of moved data items (objects) [5,6,7]. Furthermore, they are totally ignorant as to the causes of nodes overloading. In this paper we present a new load balancing algorithm which is based on our previous work [8] and differentiates two cases for node overloading: 1) when only one data item on a node is highly popular, and 2) when more than one item is popular or a high number of data items are mapped to a node but none of the items is highly popular. Our algorithm differentiates these two cases and uses two mechanisms namely object replication and node movement to balance the load between nodes considering the popularities of items. To avoid the overheads of node replication, our algorithm uses node movement for the purpose of load balancing in most cases. It only uses the replication mechanism when it detects that an alone node is not   able to handle its only popular data item. The algorithm uses capable nodes to handle load balancing. Also in order to consider system’s varying loads at different times, it introduces a new notion called valid boundary  and balances the load considering the following parameters: 1.   Non-uniform distribution of data items 2.   System heterogeneity 3.   The different popularity levels of data items 4.   Link latency between nodes The rest of the paper is organized as follows. Section 2 presents related works. Section 3 explains in more detail and formulates the load balancing problem. Section 4 presents our load balancing algorithm. Section 5 explains how system directories are stored in more capable nodes to help in system load balancing. Section 6 shows the performance of our algorithm through simulation. Section 7 concludes the paper and introduces some future directions. 2.   RELATED   WORK Generally load balancing protocols are divided into two main groups in structured P2P systems. The first group is based on uniform distribution of data items (objects) in identifier space and the  International Journal of Peer to Peer Networks (IJP2P) Vol.5, No.3, August 2014 17 second group has no such assumption [9]. Suppose that there are  N   nodes in the system, load balancing is achieved in the first group if the fraction of address space covered by each node is O(1/N) . Most algorithms have used the notion of virtual servers, first introduced in [1] to achieve this goal. A virtual server is similar to a single peer to the underlying Distributed Hash Table (DHT) and has its own routing table and successors list, but each physical node can take the responsibility of more than one virtual server. There are two main advantages in using virtual servers. The first advantage is that nodes can own noncontiguous portions of identifier space when they have multiple virtual servers. The second advantage is that virtual servers have the ability to move from any node to any other node in the system; this can easily be done by using a leave followed by a join in the underlying DHT which is supported by all DHTs. Chord [1] allows each physical node to host O(log N)  virtual servers so that each node gets a constant number of items with high probability. But this solution has some drawbacks; for example it assumes uniform load distribution on nodes, assumes all nodes have the same capacities, and it uses a constant number of virtual servers for every node, a choice which is only effective in homogenous systems. Also Chord load balancing algorithm is nonreactive and it tries to balance the load only when new nodes join the system. In other words, it has no provision to redistribute the load, making it unsuitable for dynamic structured P2P systems. CFS [10] accounts for nodes heterogeneities by allocating to each node some number of virtual servers proportional to the node capacity. In addition, CFS proposes a simple solution to shed the load from an overloaded node by having the overloaded node remove some of its virtual servers. However this scheme may result in thrashing as removing some virtual servers from an overloaded node may result in another node becoming overloaded. The reverse operation is done in the case of node underloading by creating some new virtual servers for the underloaded node, but it has its own problems again. A node with only some limited number of virtual servers may have no accurate estimation of the costs of creating new virtual servers. Also when the whole system is in an underloading status, it is quite probable that every node creates its maximum allowed number of virtual servers resulting in a huge increment in the sizes of routing tables and the search time. The main advantage of CFS is that it is completely decentralized. Rao et al. [5] have proposed three different mechanisms to balance the load using virtual servers, yet their mechanisms are static and ignore data items popularities. Using virtual servers in any algorithm leads to some common disadvantages. The first is that it leads to churn increase. It means that when a physical node wants to join the system, it should do the join operation for all of its virtual servers and when it wants to leave the system, it should remove all of its virtual servers. Because joining and leaving of objects from structured P2P systems impose some overhead, using virtual servers causes this overhead to be multiplied. Also in most structured P2P systems, searching is guaranteed to be done in O(log N)  steps, but when using virtual servers this value changes to O(log M)  where M is the total number of virtual servers in the whole system. Another disadvantage of virtual servers is that they increase the size of routing tables. Considering above problems with virtual servers, we do not use virtual servers in our proposed algorithm. Protocols which do not assume uniform object distribution use two different mechanisms to achieve load balance, namely object movement [9] and node movement [11]. Movement of objects breaks the DHT assigned addresses of objects to nodes making it hard to find objects further on. Its application is thus limited to situations where the latter is not an issue, e.g. when  International Journal of Peer to Peer Networks (IJP2P) Vol.5, No.3, August 2014 18 objects correspond to programs that do not have to be found because they transmit their results automatically. Moving nodes by letting them to choose their own addresses arbitrarily increases the threat of Byzantine attack that can prevent some items from being found in the network. This is done by simply choosing the node’s address to be the address of the item; the node then becomes responsible for storing the item and can refuse to do so. Moving nodes preserve the DHT search mechanism. Apart from the above classification, replicating data items is another way to achieve load balance. Akbariani et al. [12] have proposed a replication mechanism on Chord. They have used multiple hash functions to produce several key identifiers from a single key. Their approach has one main drawback that is they have defined no way to determine the number of hash functions for replicating a data item. Xia et al. [13] have discussed this problem and have proposed a solution for it. A major limitation about their solution is ignoring nodes heterogeneity while replicating, i.e. replication is done blindly on some nodes and without considering their ability. CAN [2] has proposed two approaches for data replication. The first approach uses multiple hash functions, but this is done statically and without considering system varying load patterns at different times. In the second approach, a node that finds it is being overloaded by requests for a particular object can replicate the object at each of its neighbouring nodes. The main problem with this approach is that system cannot control objects replication on nodes with more capabilities. In Pastry [3] replicas of an object are stored on the k Pastry nodes with keys numerically closest to the object’s key. The problem with this approach is again not having the ability to replicate objects on more capable nodes. Although some other replication mechanisms such as the one presented in [14] have been proposed for structured P2P systems, but they are mostly concerned with placing object replicas to maximize system availability with no concern for balancing the load of objects on the nodes of the system. Also most of the proposed replication algorithms use to replicate data items from the early operation of the system in specified number of nodes which seems not to be necessary especially for unpopular data items, as it leads to wasting of system resources and increase system complexity. 3.   DEFINITIONS   AND   PROBLEM   FORMULATION In this section we explain the primary definitions in context of load balancing and also present our formulas. 3.1. Primary Concepts In most structured P2P systems, the load of each node is defined as the number of items that are stored in that node [5,6,7,15]. On the other hand, there may exist high popularity items that make a node overloaded, although it stores some limited number of data items. Therefore, by taking into account the number of entered requests for a node’s data items, we can define a node’s load as the average number of bytes that are transferred by that node in each unit of time. By the same token, we can define the node capacity as the maximum number of bytes that node can transfer per time unit.  International Journal of Peer to Peer Networks (IJP2P) Vol.5, No.3, August 2014 19 When a node intends to join the system, it is given a unique key using a hash function we call the FirstHash  hereafter in this paper. For the purpose of load balancing, a set of load directories, each called  LoadDir  , is designed in the system to which nodes send their loads, capacities, locations, and downtimes periodically. A node’s downtime is defined as the summation of continuous times it is not accessible by other nodes in the system in defined intervals namely t   that can be due to links failure or high traffic. We use the successor and predecessor nodes of each node n  to periodically ping node n  and to record the current system time tt   somewhere in their local memories in case they find node n  inaccessible. If after a specified timeout interval they again find node n  as their successor (or predecessor), they subtract tt   from the current system time and report this value to node n . The downtime of node n  is periodically estimated by summing the reported values in the related t   interval. We store locations of nodes to consider locality during load balancing process and thus reduce the time it takes to balance load. Later we explain the way these directories are stored in nodes that are more capable in terms of bandwidth, uptime and memory in comparison with other nodes. Each node should be aware of a node that stores its related directory. Specifying directory store nodes however is done dynamically by considering different system states. So we define some constant nodes as pointer nodes in which the identifier of directory store nodes are saved. We use the approach proposed in [16] to prevent Byzantine attacks and to specify pointer nodes. Each node connects to a central authority once, i.e. the first time it joins the system and obtains a pointer identifier; we denote this pointer with PointerNo  that specifies to which  LoadDir   the node should send its information. In other words, each PointerNo  specifies a pointer node and each pointer node specifies the identifier of the node that stores the related  LoadDir  . The node whose identifier is equal to or follows a PointerNo  is in fact the related pointer node for this PointerNo . The number of distinct pointer identifiers is limited and determines the number of load directories in the system. Using the explained mechanism, a node cannot dishonestly report its information to take responsibility of some specified items and then refuse to respond to those items’ requests (Byzantine attack). The reverse case happens when nodes report their information falsely to prevent movement of some items to them. To stop the occurrences of such cases, incentive mechanisms such as the ones presented in [17] can be applied. We simply suppose that nodes report their locations honestly, but more secure mechanisms like the one proposed in [18] can be applied. The central authority periodically sends the pointer numbers to the related pointer nodes so that each directory can get aware of other directories. In our algorithm we group the nodes based on their load directories. In other words, nodes that send their information to the same  LoadDir  , form a group with each other. The use of these groups is explained in the following part. 3.2. Load and Cost Definition Our load balancing algorithm tries to minimize the load imbalance factor in the system while minimizing load movement. Also we consider some other important factors which are related to the destination of the load transfer. In our algorithm, the calculation of the cost of transferring load to a destination node is based on destination load, downtime and also its proximity to the overloaded node. By considering proximity, we want to show the importance of links’ latencies in the final cost. The final goal is to increase the number of successfully routed queries in the
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!