I Introduction
Free space optical communication (FSO) supports high data rate applications with minimum electromagnetic interference [1]. FSO requires a pointtopoint link between transmitter and receiver and is sensitive to atmospheric turbulences, such as fog, dust, and sand storms that result in serious degradation of FSO link capacity [2, 3]. To increase the reliability of terrestrial broadband links, a radio frequency (RF) link is integrated with FSO links to form a hybrid FSO/RF system. The hybrid FSO/RF system exploits the complementary nature of individual FSO and RF links with respect to their environmental sensitivity. This rationale is well explained as two links exhibit complementary attributes to atmospheric and weather conditions; the FSO link is not very sensitive to rain, while RF links do not deteriorate the signal quality from fog, sand, and dust storms and are heavily attenuated due to rain [4, 5, 6]. A hybrid FSO/RF maintains reliable communication due to an efficient switching between FSO and RF links under varying weather conditions, thereby improving the performance of the system as a whole [7, 8].
Designing an efficient hybrid FSO/RF system with high link availability in a dynamic environment characterized by fog, dust, or sand is challenging; essentially decisions regarding the switching of the link between FSO and RF must be made immediately. Towards efficient FSO and RF link switching, few works have been reported, such as switching mechanism based on weak and strong turbulence for a 1hop FSO/RF system [9, 10], fuzzy logicbased FSO/RF soft switching mechanism [29]
, predictive link switching mechanism using long short term memory (LSTM)
[11], AdaBoostClassifier, DecisionTreeClassifier, RandomForestClassifier, and GradientBoostingClassifier [12]. Other methods include thresholdbased link adaptation technique [13] to avoid frequent link switching, FSO link Outageinitiated link switching [14], switching for multihop FSO/RF systems under various statistical models[15, 16, 17], and coding schemes for link switching [18, 19]. These techniques, although, switch FSO and RF links using threshold, coding, or predictive methods. However, these techniques are not efficient for link switching under a timevarying dynamic environment. The dynamics of network conditions certainly compound the problem and switching control complexity that is not addressed by the existing FSO/RF link switching techniques for hybrid FSO/RF systems.Fortunately, deep reinforcement learning (DRL) [20] is an emerging trend to solve such complex problems and control issues. Recent years have witnessed the application of DRL to various applications, such as resource allocation and optimization [21, 22, 23, 24, 25], fog/edge computing and caching [26, 27], and dynamic channel assignment [28]. To our best knowledge, there is no work to consider DRL for FSO and RF link switching or control problems in hybrid FSO/RF systems. This work proposes DRLbased frameworks for link switching in a hybrid FSO/RF system such as actorcritic and DeepQ network (DQN). Due to frequently changing weather conditions, such as fog, dust, and sand, integrating FSO/RF system with efficient link switching is indeed challenging [29, 9]. The key motivation for this DRLbased framework is to support dynamic and multifactor decision making [30, 31].
DQN has gained significant attention due to tremendous improvement in performance with the help of deep neural networks (DNN) to extract features. Recently, significant optimizations to DQN have been proposed, such as double DQN
[32] to reduce overestimation of action values, experience replay buffer to prioritize learning [43], distributed DQN to models the distribution of action value, and dueling DQN architecture [33] that maps action values from state and advantage values.To enable DQN efficiently for link switching and control decisions, it is desirable to limit the DQN agent to reduce the number of times the deployed/target policy changes during training. For largescale hybrid FSO/RF systems, updating a policy requires reconsidering the complete environment. This requirement motivates the design of DQN agents with low switching costs. The proposed agent in this work aims to reduce the number of times the deployed/target policy interacting with the environment changes. Our work proposes consensusbased feature selection criteria to DQN to reduce switching costs in contrast to DQN. The key contributions of this work are summarized as follows:

Conduct the first systematic study of modern deep reinforcement learning algorithms for hybrid FSO/RF systems. We implement two wellknown DRL algorithms: 1) actorcritic based on actor and critic neural networks and 2) deep Qnetwork that consists of two deep neural networks to approximate a Qfunction. We investigate that both the DRL methods find the best policy for FSO and RF switching in contrast to MyOpic. This is illustrated below:

An actorcritic DRLbased method called Actor/CriticFSO/RF solves the link switching optimization problem by considering its nonconvex and combinatorial characteristics. The objective is to investigate the optimal longterm reward/utility of FSO and RF link switching for a hybrid FSO/RF system while maximizing link availability. Specifically, the states, action, and reward functions are defined for FSO/RF transceivers.

Adopt a deep Qlearning method and design called DQNFSO/RF that can generate efficient FSO/RF link switching decisions in a highly dynamic environment subject to fog, rain, snow, and dust. With the help of a deep neural network, DQNFSO/RF demonstrates a nearoptimal strategy through approximation of actionvalue functions using the current state for a hybrid FSO/RF.


Propose a novel consensusbased representation learning approach using deep Qnetwork called DQNEnsembleFSO/RF to update the deployed policy to achieve low switching cost in a hybrid FSO/RF system in contrast to DQNFSO/RF’s frequent policy updates. Inspired by the representation learning, the switching criteria uses consensus features from an ensemble of asynchronous threads between the deployed Qnetwork and the underlying learning Qnetwork. The experiment results reveal that the proposed DQNEnsembleFSO/RF significantly reduces the policy switching cost in contrast to DQNFSO/RF and MyOpic when implemented for a hybrid FSO/RF system.

The Actor/CriticFSO/RF and DQNFSO/RF strategies for FSO and RF link switching are evaluated under an environment subject to attenuations due to fog, dust, rain, and sand storms. Results exhibit the training performance under different parameters and show the influence of link switching on FSO/RF system performance in terms of reward.

Experiment results show that the proposed method DQNEnsembleFSO/RF method demonstrates faster convergence and learning performance in contrast to Actor/CriticFSO/RF and DQNFSO/RF at a significantly reduced switching cost.
The remainder of this article unfolds as follows. Section II discusses recent works relevant to link switching techniques in hybrid FSO/RF systems. We describe the system model and problem formulation in Section III and Section IV, respectively. Deep reinforcement learning methods, i.e., Actor/CriticFSO/RF and DQNFSO/RF for the FSO/RF link switching are presented in Section V. Section VI formulates the policy switching problem considering the hybrid FSO/RF system. The novel proposed DRL method to solve the policy switching problem called DQNEnsembleFSO/RF is presented in Section VII. Performance evaluation including MyOpic policy switching, evaluation setup, results, and analysis of all the DRL methods are given in Section VIII. Finally, Section IX concludes the paper with possible future directions.
Ii Related Work
Hybrid FSO/RF systems have been widely discussed in the literature to improve the reliability of communication. Authors in [34] proposed a hybrid FSO/RF system to act as a capacity backhaul to support high traffic for 5G and beyond. In this work, the FSO acts as a primary backhaul and RF as a secondary. RF system is activated through a onebit feedback signal by the receiver, but the system does not consider the realtime channel conditions closer to the transmitter.
A link switching technique called adaptive combining used signaltonoise ratio (SNR) threshold to keep the FSO link active
[35]. It activates the RF link, if the perceived SNR of the FSO link drops below the switching threshold, enabling diversity reception for simultaneous data transmission over both the links combined through maximal ratio combining (MRC). The system switches back to the FSO link upon acceptable link quality. The combining technique based on MRC, however, is subject to performance and complexity tradeoff. In [36, 37], a hybrid FSO/RF system is proposed using adaptive combining techniques based on various power adaptation mechanisms. This system demonstrated better outage performance in contrast to FSO/RF system with constant power. One serious limitation of this system is the requirement of power adaptable RF and FSO transceivers that are expensive compared to constant power FSO/RF systems.Works in [9, 10, 38] proposed a link switching mechanism between FSO and RFbased on atmospheric turbulences. The switching mechanism keeps one link active depending upon atmospheric conditions. A method in [9, 10] evaluated an FSO/RF switching mechanism according to weak and strong turbulence for a 1hop FSO/RF hybrid system. Abid et al. in [29] proposed a fuzzy logicbased hybrid FSO/RF soft switching mechanism to improve the reliability and capacity of FSO under atmospheric conditions, such as dust, fog, and cloud, with a particular focus on sand/dust. The system consists of fuzzyinferencebased FSO and RF subsystem controlled using a fuzzy inference switching mechanism that prioritizes the FSO subsystem. The proposed fuzzy inference switching mechanism improves the performance of the hybrid system for bit error rate (BER), SNR, and system capacity. However, the fuzzy logic inference is dependent on human expertise and requires extensive validation and verification.
In [39]
, Renat et al. proposed a hard switching method based on received signal strength identifier (RSSI) predictions. The method aimed to increase the high availability of the optical link using an RF link under atmospheric turbulences. Authors have considered the use of machine learning models, such as random forest, gradientboosting regression, and decision trees for RSSI prediction to increase the availability of FSO/RF systems. The proposed work used regression models to analyze the prediction of received optical power. These models can accurately determine the predicted RSSI value. The results of the proposed machine learning models, however, are incorrect under practical verification due to overfitting. Further, the decision trees are susceptible to small changes in data resulting in different outputs.
A work in [11] proposed a predictive link switching mechanism for hybrid FSO/RF systems to conserve energy. The proposed method kept the FSO link continuously active and sampled the FSO signal periodically to collect a dataset. The method trained an LSTM model using the dataset. The work correlated the number of FSO signal power samples in the RF transmission with the prediction error using a predefined error threshold to improve energy efficiency. This work, however, is preliminary and does not consider the dynamics in a hybrid FSO/RF communication system, a challenging task to the model environment to optimize network performance.
Another work in [12]
proposed a hard FSO/RF switching mechanism by predicting RSSI value. The authors analyzed the effects of the atmospheric channel on the quality of the optical signal. Although this analysis studied both the soft and hard switching between FSO and RF, the primary consideration was given to the hard switching. The work evaluated various machine learning classifiers, such as AdaBoostClassifier, DecisionTreeClassifier, RandomForestClassifier, and GradientBoostingClassifier for RSSI prediction to enable efficient hard FSO/RF switching. Similarly to
[11], the work is a simple evaluation of single and ensemble machine learning classifiers for the hard switching between FSO and RF links. The contribution is limited and pays no attention to modeling dynamics of the atmospheric channel and its effect on soft switching between FSO and RF linksIii System Model
In this paper, we consider the FSO and RF link switching problem for a hybrid FSO/RF system using FSO/RF state learning and is called as . A detailed description of the system model is given below.
The system is equipped with hybrid links,i.e., RF and FSO with two possible states: and . If a link is in state, it means it can successfully transmit data, however, a
state indicates that switching is required else it may result in failed transmission. The hybrid FSO/RF system assumes that the system can dynamically switch between FSO and RF links. FSO and RF are correlated under dynamic weather conditions. We can model the system using the Markov chain to model switching patterns between FSO and RF links. At any instant of time
, the link state is represented as , where and denote the state of link in time slot . The andare restricted to change the state only at the start of each time slot. The probability that the state of the
and changes from the current state to a different state is ; and if it stays the same.The hybrid FSO/RF system assumes that the link switching is unknown and should be learned from the link feedback, i.e., atmospheric observations/attenuation level . The attenuation level is observed using the signaltointerferenceplusnoise ratio (SINR) through a pilot signal by tuning to an FSO or RF link. The model assumes that the system can learn the weather conditions of the link without using any explicit mechanism. The system can learn the corresponding link state. For the , means and otherwise, as given in Equation 1. On the other hand, for , denotes and otherwise, as illustrated in Equation 2. Let us define the state of and link selected by the system for time slots, as:
(1) 
(2) 
(3) 
It is evident from Equation 3 that only if RF or FSO link is active in a time slot then its state is known and not otherwise. As can be observed from the above, the system considers a discrete action space with valid possible actions. Hence, in each time slot, the system selects one action from the action space and accesses the selected link, accordingly. Once the action is selected, the link condition is monitored. Hence, in each time slot, the agent selects an action from the action space , i.e., access the corresponding RF or FSO link, and the condition of the selected links is revealed.
Iv FSO/RF link Switching Problem Formulation
This section formulates the link switching problem in a hybrid FSO/RF system. The performance of the system is affected by varying weather conditions including rain, fog, snow, and dust. The work, however, considers the deployment of the system in an environment subject to dust and dust storms, common to regions with desert. Once the system is switched to a link, both the FSO and RF are associated with particular rewards. The work proposes various deep reinforcement learning approaches to implement an agent that learns FSO and RF switching patterns.
Let us define the reward as if the selected link in time slot is FSO and for the RF as given in Equation 4 below.
(4) 
The objective of the FSO/RF system is to select appropriate RF or FSO links based on the link quality to ensure successful transmission under dusty conditions. To achieve this, the agent aims to find an optimal policy that maps system observation space in Equation 3 to an appropriate action in action space to maximize the expected reward given in Equation 4 called as in Equation 5.
(5) 
The reward for a finite number of time slots is expressed in Equation 6 as:
(6) 
In Equation 7, indicates the link selected for a time slot . Now, we can formulate the RF/FSO link switching problem as:
(7) 
Subject to
In each time slot, the agent can select either FSO or RF link based on the dust conditions. As discussed earlier, the reward is .
V Deep Reinforcement Learningbased Agent for Hybrid FSO/RF System
This section presents the framework for a hybrid FSO/RF system based on a deep reinforcement learning agent. In this work, we formulate the link switching problem as a Markov Decision Process (MDP) and employ DQN and actorcritic to address it. We consider various DRL approaches including DQN and actorcritic for the implementation of the agent to switch between RF and FSO links. Further, we also consider the performance comparison of the actorcritic agent with the MyOpic policy under complex and dynamic weather conditions. Finally, we propose a consensusbased ensemble representation learning DQN approach for low switching costs under timevarying and dynamic weather conditions. To the best of our knowledge, this is the first study and implementation of DQN and actorcritic and proposal of ensemble consensusbased representation learning DQN in the field of hybrid FSO/RF for link switching.
Va Deep Reinforcement Learning Agent
FSO/RF link state: As discussed earlier, the FSO/RF link state is timevarying as modeled by the MDP. The agent uses its observation space as an input to the DRL framework. The agent accesses the selected FSO or RF link in each iteration. The agent updates the reward based on the state of the selected FSO/RF link. The FSO/RF system learns the best policy based on previous experiences in form of observation space for time slot that is updated for time slot as , where MAX denotes the maximum number of iterations to observe the FSO/RF link state.
Action space: The agent evaluates different actions from the action space using the observation space with the highest reward. In the context of FSO/RF switching, an action means to access RF or FSO link for data transmission in time slot .
Reward: Once the agent deliberate action from the , it observes an instantaneous reward based on the condition of the selected link. The basic objective of the reward is to solve the optimization problem as defined in Equation 7.
VB Actor/CriticFSO/RF DRL Algorithm
The actorcritic DRL Algorithm 1 considers the average reward observed by the DRL agent in time slot and is given as Equation 8:
It can be observed from Figure 2 that the actorcritic framework is based on actor and critic neural networks. It can be observed from Algorithm 1 that the actor neural network is initialized with and critic with parameter . The actor neural network maps an observation at time slot to an actions from using an optimal policy as given in Equation 9. In the Equation, is discrete and the normalized probability of each action is calculated using the Softmax function at the output layer of the actornetwork.
(9) 
The actor neural network, therefore, can be represented as Equation 10.
(10) 
The critic neural network is based on a value function, i.e., . The critic neural network receives the feedback from the actor network that records the by executing the action in the environment with varying weather conditions at time slot . The actor feedback consists of and . The critic uses this information to calculate the temporal difference error as given in Equation 11, where .
(11) 
The critic network uses an optimal value function to minimize the least squares temporal difference (LSTD) as given in Equation 12.
(12) 
The actor uses the TD error given in Equation 11 to compute the policy gradient as given below in Equation 13.
(13) 
In Equation 13, represents the score of action, i.e., selected FSO/RF link under the current optimal policy . Given this the parameters of the actor neural network can be updated using the with learning rate of . The gradient for the actor network is computed using the Equation 14.
(14) 
The critic network collects the recent observations of FSO and RF links called as and at the beginning of time slot . The actor network chooses the RF or FSO link based on the optimal policy . Once the selected link is used for transmission, the observed reward is recorded. The critic network computes the TD error by using the reward, current observation state , and the next observation state . The computed error updates the gradients of both the critic and actor neural networks.
VC DeepQ Network for Hybrid FSO/RF System (DQNFSO/RF)
The following section illustrates link switching using DQN for the hybrid FSO/RF system.
VC1 Qlearning
Qlearning in a hybrid FSO/RF system has the potential to learn switching patterns between FSO and RF links directly from environment observations without the need for any estimated system model. This is an ideal solution for unknown weather dynamics, such as dust storms. Qlearning for hybrid FSO/RF system aims to find an optimal policy, i.e., an FSO/RF switching pattern to maximize the accumulated reward
defined in Equation 6. Qlearning learns from actions outside the current policy called offpolicy learning. The Qvalue denoted as represents the cumulative reward of an agent being in state and performing the committed action and is given in Equation 15.(15) 
In Equation 15, Q(s,a) denotes a Qfunction, represents the instantaneous reward of a transmission on a selected FSO or RF link and the is a maximum Qvalue for the next observation/state with a discount factor . An agent uses QTable to store and look up Qvalues representing future actions according to states. The expected Qvalues are updated using the Equation 16.
(16) 
Equation 16 computes the expected future Qvalue as a difference between the old Qvalue and the discounted future value with onestep look ahead.
VC2 DqnFso/rf
Qlearning can find an optimal policy if it can estimate Qfunction for each action and state pair using Equation 15. This turns to computationally expensive if state and action space is large. DQN [30] uses a deep neural network called as QNetwork to approximate the Qfunction, i.e., . The agent selects an action using and makes transition to next state with reward . It is represented as a transition/experience tuple . The is saved to a replay memory buffer .
The DeepQ Network is based on states, actions, rewards, and the QNetwork . At each time slot , the QNetwork uses using Equation 3 to select either or from the action space as illustrated in Figure 3. The objective of the QNetwork is to maximize the reward as given in Equation 6. Also, an unchanged reward does not contribute to the objective, therefore, it receives a penalty.
A DQN agent consists of two deep neural networks to approximate Qfunction which is illustrated in Figure 3. One acts as an actionvalue function approximator and the second as the target actionvalue approximator , where and represents weights of the neural networks for iteration . The weights of the first neural network, are updated using the minibatch of random samples from the replay memory buffer at each learning iteration .
The weights
of the second network are updated using the stochastic gradient descent (SGD) and backward propagation algorithm that minimizes the meansquare error(MSE) as loss function. Referring to Equation
16, the loss is calculated as Equation 17, where represents the weights of the QNetwork.(17) 
In Equation 17, represents the parameters/weights of the target neural network that is replaced by the parameters of training QNetwork every time steps as can be seen in Figure 3. Deep QNetwork uses minibatch data from the replay buffer to train the QNetwork. Instead of using an greedy approach, the experience replay component exploits stochastic prioritization to generate the probability of actions that helps the network to converge. The steps of DQNFSO/RF are summarized in Algorithm 2.
Vi DQNFSO/RF Policy Switching Cost Problem Formulation
This section focuses on the policy switching cost called as . This cost is used to optimize the DQNFSO/RF agent. denotes the frequency of changes in deployed policy in actionvalue network, i.e., in Algorithm 2 in episodes as given below in Equation 18:
(18) 
The objective of the DQNFSO/RF agent is to learn an optimal policy with small cost.
Vii Ensemble Consensusbased Representation Deep Reinforcement Learning DQNEnsembleFSO/RF for FSO/RF link switching
Inspired by the representation learning [40], We adopt the view the DQN learns to extract informative features of the states of environments using the consensus of an ensemble of threads. Our proposed criterion tries to switch the deployed policy according to the consensus of features. In the proposed approach to ensemble consensus to DQNFSO/RF, asynchronous threads sample batch of data from the replay buffer and then extract the feature of all states by both the actionvalue network and target Qnetwork.
For a state , the extracted features by threads are denoted as , , respectively. The similarity score between , for each thread on state is defined in Equation 19 as:
(19) 
The average similarity score for a batch of states for each thread is given in Equation 20 as:
(20) 
(21) 
The representation learned ensemble consensus score updates the target/deployed policy whenever the , where is learning rate. The DQNEnsembleFSO/RF algorithm is illustrated in Algorithm 3.
Viii Performance Evaluation
This section presents and discusses (i) the effectiveness of using deep reinforcement learning algorithms including DQN and actorcritic and (ii) the comparison of DQNEnsembleFSO/RF agent with the Actor/CriticFSO/RF, DQNFSO/RF, and MyOpic as discussed below.
Viiia MyOpic Policy for FSO/RF Link Switching
The MyOpic policy only accumulates immediate reward obtained from transceiver switching without considering the future. MyOpic agent always selects a transceiver to maximize the immediate expected reward.
ViiiB Evaluation Setup
We have implemented the proposed DQNEnsembleFSO/RF, DQNFSO/RF, and Actor/CriticFSO/RF using Tensorflow with the DRL environment in OpenAI Gym framework
^{1}^{1}1http://gym.openai.com/. The code for these DRL agents is available on the GitHub repository ^{2}^{2}2https://github.com/shenna2017/FSORFagent. For each iteration, DRL agents are trained using toepisodes. The DQNFSO/RF and DQNEnsembleFSO/RF agents are created using Keras functions. The neural network has 5 layers;
hidden, one input, and one output layer. All three hidden layers consist of , , and neurons, respectively. These layers are implemented using ReLu activation functions with the linear output layer. For both the DQNEnsembleFSO/RF and DQNFSO/RF agents, actions are selected using the Boltzman policy [41]. Our evaluations consider the use of Adam optimizer to minimize the loss function given in Equation 17. Other parameters including minibatch size, learning rate, discount factor , and experience replay buffer size are given in Table I.Hyperparameters  Values 

Batch size  32 
Activation function  ReLU 
Learning rate  
Experience replay size  1,000,000 
0.9  
DQN hyperparameters
ViiiC Results and Analysis
Figure 4 presents the loss results versus the number of episodes during the training of DQNEnsembleFSO/RF, DQNFSO/RF, and Actor/CriticFSO/RF. The results show that for all three DRL agents, the loss degrades to almost zero with an increase in the number of episodes. It is evident from Figure 4 that DQNEnsembleFSO/RF agent shows less loss value in contrast to DQNFSO/RF showing the success of DQNEnsembleFSO/RF for FSO and RF switching. On the contrary, the Actor/CriticFSO/RF agent has the highest loss. This trend of the results demonstrates that all the DRL agents converge with an increase in the number of training episodes. The results also imply that all DRL agents converge within a reasonable number of training episodes. Actor/CriticFSO/RF appears to converge faster as compared to DQNEnsembleFSO/RF and DQNFSO/RF.
Figure 5 presents the average reward observed by all the DRL agents with a varying number of training episodes. The reward results show that for DQNEnsembleFSO/RF and DQNFSO/RF, the average reward is highest after 100 episodes. Actor/CriticFSO/RF agent achieves maximum average reward after episodes indicating the success of switching between RF and FSO links. This increase in average reward with the number of episodes indicates the convergence of DRL agents after a reasonable number of training episodes. It can be observed that the DQNEnsembleFSO/RF converges to a higher average reward than the DQNFSO/RF and Actor/CriticFSO/RF. Both DQNFSO/RF and Actor/CriticFSO/RF achieve higher rewards with fewer training episodes compared to DQNEnsembleFSO/RF, indicating faster learning. Overall average reward results show that DQNEnsembleFSO/RF is superior to that of DQNFSO/RF and Actor/CriticFSO/RF, respectively, indicating the effectiveness of DQNEnsembleFSO/RF learning due to decrease in overestimation error of Qvalue for FSO and RF switching.
Figure 6 shows the mean reward for the DQNEnsembleFSO/RF after interacting with the environment for episodes varying from 250 to 2000. The DQNEnsembleFSO/RF achieves convergence of approximately 1500 episodes. It can be observed that in the early episodes, the reward value is low due to the limited learning. As the number of training episodes increase, the DQNEnsembleFSO/RF gradually evolve and reward increases. The average reward per episode is significantly improved after 1000 episodes. After about 1500 training episodes, the reward flattens out smoothly, indicating DQNEnsembleFSO/RF ’s ability for successful FSO and RF link switching.
Figure 7 shows the loss value variation of the critic and actor neural networks for the Actor/CriticFSO/RF agent. It can be seen from the figure that the average loss value of both the actor and critic decreases during the training process, which indicates that both the actor and critic reduces the error due to overestimations and helps the Qlearning. The loss curves show that the actor loss shows a dramatic decrease at the beginning of the training process, then the reduction gradually turns unstable. On the contrary, the critic loss is lower from the beginning and stays stable as the training continues. As a consequence, the actionvalue of the critic neural network achieves a higher reward. This, however, does not guarantee an optimal learned policy as the Actor/CriticFSO/RF may overfit.
For DQNEnsembleFSO/RF,DQNFSO/RF, and MyOpic , we evaluate the switching cost for , , and episodes. During the training process, the target network policy is synchronized with the actionvalue policy using the consensus of features of threads. As shown in Figure 8, MyOpic with known , i.e, 0.5 follows a roundrobin fashion to switch policy and has the highest switching cost for all the episodes. MyOpic cannot use correlation among FSO, RF, and environment for policy switching. DQNFSO/RF shows a significantly lower switching cost than MyOpic. DQNFSO/RF agent can learn the FSO/RF system dynamics including correlation between FSO, RF, and environment, i.e., atmospheric conditions. The learned policy switching improves accumulated DQNFSO/RF, thereby improving the FSO/RF system performance. In contrast to MyOpic and DQNFSO/RF, DQNEnsembleFSO/RF drastically reduces the policy switches that represent low switching cost suitable for FSO/RF hybrid systems operating under stable environments, such as atmospheric conditions. DQNEnsembleFSO/RF’s consensus criterion for policy switching achieves better performance with the minimal switching cost for , , and episodes. The consensus criterion switches the actionvalue policy decreases with an increase in the number of episodes. It results in a significant switching cost reduction compared to MyOpic and DQNFSO/RF and remains robust than MyOpic and DQNFSO/RF.
Figure 9 investigates the first transition of the DQNEnsembleFSO/RF execution with a varying number of samples. It can be observed from the Figure that DQNEnsembleFSO/RF’s mean error converges faster between 0 and 40 steps. The plot considers the error calculation over the initial steps to calculate the mean error. Figure 10 plots the error of the DQNEnsembleFSO/RF for the last transition for a sample captured during the agent execution. This error value for this transition is used to calculate the means to analyze DQNEnsembleFSO/RF’s learning performance. Similar to DQNEnsembleFSO/RF, Figure 11 and Figure 12 show the first and the last transition of the DQNFSO/RF agent, respectively. These samples represent the smoothest transitions during the agent execution, demonstrating faster convergence.
Figure 13 and Figure 14 plots the behaviour Actor/CriticFSO/RF for the first and last transitions, respectively. It is evident from the figure that the Actor/CriticFSO/RF does not converge in contrast to DQNEnsembleFSO/RF and DQNFSO/RF. Further, for the first transition, its mean error remains significantly higher compared to DQNEnsembleFSO/RF and DQNFSO/RF, resulting in slow learning. The last transition, however, demonstrates a significantly lower mean error and is comparable to DQNEnsembleFSO/RF and DQNFSO/RF.
Figure 15 demonstrates the stability of DQNEnsembleFSO/RF, DQNFSO/RF, and Actor/CriticFSO/RF. The stability, here, represents, the frequency an agent deviates from the median error for both the first and last transitions. As discussed, earlier, the last transition of each agent is used to calculate the median value as it indicates an agents’ best performance. It can be observed from the Figure that DQNEnsembleFSO/RF significantly deviates less from the median for the last transition in contrast to DQNFSO/RF and Actor/CriticFSO/RF. This proves DQNEnsembleFSO/RF’s stable learning for the first transition. Its count for the deviation from the median value for the last transition is comparable to the other agents. This shows that DQNEnsembleFSO/RF exhibits overall better stability for both the first and the last transition attributed to its consensusbased representation learning to update the deployed policy. The Actor/CriticFSO/RF agent demonstrates the least stability among all the agents, therefore, has slow learning performance.
Ix Conclusion and Future Works
To overcome the challenges of unknown weather dynamics, such as fog, dust, and sand storms, this work has applied the concept of ensemble consensusbased representation deep reinforcement learning for FSO/RF link switching for hybrid FSO/RF systems. The link switching optimization problem has been formulated to achieve the maximum longterm FSO and RF links utility as a whole while maximizing the link availability between the transmitter and receiver. Considering the nonconvex and combinatorial characteristics of this optimization problem, we have applied actorcritic and DQN under a hybrid FSO/RF system with dynamic weather conditions. Compared with the actorcritic, the DQN algorithm achieves faster convergence. Further, to reduce the frequent switching of the deployed policy of DQN, we have proposed a consensusbased representation DQN called DQNEnsembleFSO/RF for FSO and RF link switching. Experiment results show that the DQNEnsembleFSO/RF outperformed DQNFSO/RF, Actor/CriticFSO/RF, and MyOpic with faster convergence while maintaining switching cost significantly low under timevarying weather scenarios. We believe this work is the first step towards the application of DRL for the link switching problem with consideration to low switching cost for hybrid FSO/RF systems. One interesting direction is to design a deep Qnetwork algorithm with provable guarantees and generalization that can consider the switching cost for a larger state space compared to the states considered in this work.
Acknowledgement
This work is supported by the Letterkenny Institute of Technology, Co. Donegal, Ireland.
References
 [1] N.A. Androutsos, H.E. Nistazakis, A.N. Stassinakis, H.G. Sandalidis, G.S. Tombras, ”Performance of SIMO FSO Links over Mixture Composite Irradiance Channels,” Appl. Sci., vol. 9, 2072, 2019.
 [2] Z. Ghassemlooy, W.O. Popoola, S. Rajbhandari, ”Optical Wireless Communications: System and Channel Modelling with MATLAB,” CRC Press: Boca Raton, USA, 2018.
 [3] T.D. Katsilieris, G.P. Latsas, H.E. Nistazakis, G.S. Tombras, ”An Accurate Computational Tool for Performance Estimation of FSO Communication Links overWeak to Strong Atmospheric Turbulent Channels,” Computation, vol. 5, no. 18, 2017.
 [4] L. Cheng, S. Mao, Z. Li, Y. Han, H.Y. Fu, ”Grating Couplers on Silicon Photonics: Design Principles,”Emerging Trends and Practical Issues. Micromachines, vol. 11, pp. 666, 2020.
 [5] T. Rakia, H.C. Yang, F. Gebali, M.S. Alouini, ”Power adaptation based on truncated channel inversion for hybrid FSO/RF transmission with adaptive combining,” IEEE Photon. J. vol. 7. no. 4, pp. 112, 2015.
 [6] H. Lei, H. Luo, K.H. Park, Z. Ren, G. Pan, M.S. Alouini, ”Secrecy outage analysis of mixed RFFSO systems with channel imperfection,” IEEE Photon, vol. 3, no. 10, pp. 1–13, 2018.
 [7] W.M.R. Shakir, ”Performance evaluation of a selection combining scheme for the hybrid FSO/RF system,” IEEE Photon, vol. 1, no. 10, pp. 1–10, 2018.
 [8] A. Douik, H. Dahrouj, T. Y. AlNaffouri, and M. S. Alouini, “Hybrid radio/freespace optical design for next generation backhaul systems,” IEEE Trans. Commun., vol. 64, no. 6, pp. 2563–2577, 2016.
 [9] A. Touati, A. Abdaoui, F. Touati, et al.,”On the effects of combined atmospheric fading and misalignment on the hybrid FSO/RF transmission,” IEEE J. Opt. Commun. Netw.,vol. 8, no. 10, pp. 715–725, 2016.
 [10] S. Sharma, A. S. Madhukumar,R. Swaminathan, ”Switchingbased cooperative decodeandforward relaying for hybrid FSO/RF networks,” IEEE J. Opt. Commun. Netw., vol. 11, no. 6, pp. 267–281, 2019.
 [11] Y. Meng, Y. Liu, S. Song, Y. Yang, and L. Guo, ”Predictive Link Switching for Energy Efficient FSO/RF Communication System,” in Asia Communications and Photonics Conference (ACPC) 2019, OSA Technical Digest (Optical Society of America, 2019), paper T4C.1.
 [12] J. Toth, L. Ovsenik, J., Turan,L. Michaeli, M. Marton,”Classification prediction analysis of RSSI parameter in hard switching process for FSO/RF systems,” Measurement, vol. 116, pp. 602–610. doi:10.1016/j.measurement.2017.11.044.
 [13] B. Bag, A. Das, I.S. Ansari, A. Prokes, C. Bose, A. Chandra, ”Performance Analysis of Hybrid FSO Systems Using FSO/RFFSO Link Adaptation,” IEEE Photonics, vol. 10, pp.117, 2018.
 [14] M. Usman, H. Yang, and M. Alouini, “Practical switchingbased hybrid FSO/RF transmission and its performance analysis,” IEEE Photon. J., vol. 6, no. 5, pp. 1–13, 2014.
 [15] L. Kong, W. Xu, L. Hanzo, H. Zhang, and C. Zhao, “Performance of a freespaceoptical relayassisted hybrid RF/FSO system in generalized distributed channels,” IEEE Photon. J., vol. 7, no. 5, pp. 1–19, 2015.
 [16] E. Zedini, I. S. Ansari, and M. S. Alouini, “Performance analysis of mixed Nakagamim and gammagamma dualhop FSO transmission systems,” IEEE Photon. J., vol. 7, no. 1, pp. 1–20, Feb. 2015.
 [17] J. Zhang, L. Dai, Y. Zhang, and Z. Wang, “Unified performance analysis of mixed radio frequency/freespace optical dualhop transmission systems,” J. Lightw. Technol., vol. 33, no. 11, pp. 2286–2293, Jun. 2015.
 [18] H. Samimi and M. Uysal, “Endtoend performance of mixed RF/FSO transmission systems,” IEEE J. Opt. Commun. Netw., vol. 5, no. 11, pp. 1139–1144, 2013.
 [19] I. S. Ansari, F. Yilmaz, and M. S. Alouini, “Impact of pointing errors on the performance of mixed RF/FSO dualhop transmission systems,” IEEE Wireless Commun. Lett., vol. 2, no. 3, pp. 351–354, 2013.
 [20] V. Mnih et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, pp. 529–533, Feb. 2015.

[21]
U. Challita, L. Dong, and W. Saad, “Proactive resource management for LTE in unlicensed spectrum: A deep learning perspective,” IEEE Trans. Wireless Commun., vol. 17, no. 7, pp. 4674–4689, Jul. 2018.
 [22] X. Li, J. Fang, W. Cheng, H. Duan, Z. Chen, and H. Li, “Intelligent power control for spectrum sharing in cognitive radios: A deep reinforcement learning approach,” IEEE Access, vol. 6, pp. 25463–25473,Apr. 2018.
 [23] Y. Wei, F. R. Yu, M. Song, and Z. Han, “User scheduling and resource allocation in HetNets with hybrid energy supply: An actorcritic reinforcement learning approach,” IEEE Trans. Wireless Commun., vol. 17, no. 1, pp. 680–692, Jan. 2018.
 [24] H. Ye, G. Y. Li, and B.H. F. Juang, “Deep reinforcement learning for resource allocation in V2V communications,” IEEE Trans. Veh. Technol., vol. 68, no. 4, pp. 3163–3173, Apr. 2019.
 [25] L. Xiao, Y. Li, C. Dai, H. Dai, and H. V. Poor, “Reinforcement learningbased NOMA power allocation in the presence of smart jamming,” IEEE Trans. Veh. Technol., vol. 67, no. 4, pp. 3377–3389, Apr. 2018.
 [26] Y. Sun, M. Peng, and S. Mao, “Deep reinforcement learningbased mode selection and resource management for Green fog radio access networks,” IEEE Internet Things J., vol. 6, no. 2, pp. 1960–1971, Sep. 2019.
 [27] Y. He, F. R. Yu, N. Zhao, V. C. M. Leung, and H. Yin, “Softwaredefined networks with mobile edge computing and caching for smart cities: A big data deep reinforcement learning approach,” IEEE Commun. Mag., vol. 55, no. 12, pp. 31–37, Dec. 2017
 [28] S. Wang, H. Liu, P. H. Gomes, and B. Krishnamachari, “Deep reinforcement learning for dynamic multichannel access in wireless networks,” IEEE Trans. Cogn. Commun. Netw., vol. 4, no. 2, pp. 257–265, Jun. 2018.
 [29] A.A. Minhas, M. S. Khan, S. Henna, M.S. Iqbal, ”Attenuationbased hybrid RF/FSO link using soft switching,” Opt. Eng. vol. 60, no.5, pp. 122, 2021.
 [30] N. C. Luong, D. T. Hoang, S. Gong, D. Niyato, P. Wang, Y.C. Liang, and D. I. Kim, “Applications of deep reinforcement learning in communications and networking: A survey,” IEEE Communications Surveys & Tutorials, vol. 21, no. 4, pp. 3133–3174, May 2019.
 [31] K. Li, T. Zhang, and R. Wang, “Deep reinforcement learning for multiobjective optimization,” IEEE Transactions on Cybernetics, pp. 1–12, 2020.
 [32] H. V. Hasselt, H., “Double Qlearning,” in Advances in Neural Information Processing Systems, vol. 23, pp. 2613–2621, 2010.
 [33] Z. Wang, T. Schaul, M. Hessel, H. V. Hasselt et al., “Dueling network architectures for deep reinforcement learning,” in Proceedings of the 33rd International Conference on Machine Learning, 2015.
 [34] M. Z. Chowdhury, M. K. Hasan, M. Shahjalal, et al.,”Optical wireless hybrid networks: trends, opportunities, challenges, and research directions,” IEEE Commun. Surv. Tutor., vol. 22, no. 2, pp. 930–966, 2020.
 [35] T. Rakia, H. Yang, M. Alouini, et al., ”Outage analysis of practical FSO/RF hybrid system with adaptive combining,” IEEE Commun. Lett., vol. 19, no. 8, pp. 1366–1369, 2015.
 [36] T. Rakia, H. Yang, F. Gebali, et al., ”Power adaptation based on truncated channel inversion for hybrid FSO/RF transmission with adaptive combining,” IEEE Photon. J., vol. 7, no. 4, pp. 1–12, 2015.
 [37] T. Rakia, H. Yang, F. Gebali, et al., ”Outage performance of hybrid FSO/RF system with lowcomplexity power adaptation,” Proc. 2015 IEEE Globecom Workshops, San Diego, CA, pp. 1–6, 2015.
 [38] S. Sharma, A. S. Madhukumar,R. Swaminathan, ”Effect of pointing errors on the performance of hybrid FSO/RF networks,” IEEE Access, vol. 7, pp. 131418–131434, 2019.
 [39] R. Haluška, P. Šuľaj, L. Ovseník, et al., ”Prediction of Received Optical Power for Switching Hybrid FSO/RF System,” Electronics, vol. 9, no. 8, pp. 1261, 2020.
 [40] S. Subramanian, A. Trischler,Y. Bengio, C.J. Pal, ”Learning general purpose distributed sentence representations via large scale multitask learning,” in Proceedings of the International Conference on Learning Representations (ICLR), 2018.
 [41] Z. Song, R.E. Parr, L. Carin, ”Revisiting the softmax bellman operator: Theoretical properties and practical benefits,” arXiv preprint arXiv:1812.00456, 2018.
 [42] S. H. A. Ahmad, M. Liu, T. Javidi, Q. Zhao, and B. Krishnamachari, “Optimality of myopic sensing in multichannel opportunistic access,” IEEE Transactions on Information Theory, vol. 55, no. 9, pp. 4040–4050, 2009.
 [43] T. Schaul, J. Quan, I. Antonoglou, D. Silver, “Prioritized experience replay,” in Proceedings of the 4th International Conference on Learning Representations, CA, 2015/
Comments
There are no comments yet.