বলবর্ধনমূলক শিখন: সংশোধিত সংস্করণের মধ্যে পার্থক্য

বিষয়বস্তু বিয়োগ হয়েছে বিষয়বস্তু যোগ হয়েছে

রৈখিক

০৪:২৪, ৩১ আগস্ট ২০১৯ তারিখে সংশোধিত সংস্করণ

রিইনফোর্সমেন্ট লার্নিং (RL) হল মেশিন লার্নিংয়ের একটি শাখা, যেখানে কোনো পরিবেবেশে কোন পদক্ষেপ নেবার মাধ্যমে সেখানকার অবস্থার সর্বোত্তম উন্নতি করা সম্ভব হয়। রিইনফোর্সমেন্ট লার্নিং জল তিনিটি মৌলিক লার্নিং একটি উদাহরণ। অপর দুইটি লার্নিং হলে তত্ত্বাবধয়ায়নে লার্নিং ও অতত্ত্বাবধায়নে লার্নিং।

এটি তত্ত্বাবধায়নে লার্নিং থেকে আলাদা। কেননা সেখানে ইনপুট ও আউটপুটে পাওয়া বিষয় উপস্থাপনের দিকে জোড় দেওয়া হয় না। এছাড়া সেখামে প্রায় আন্দাজকৃত বিষয়কে পুরোপুরিভাবে সঠিক হতে হয় না। অপরদিকে এখানে অনুসন্ধান (চার্টহীন এলাকা) ও ব্যবহারের (বর্তমান জ্ঞান) দিকে গুরুত্ব দেওয়া হয়।^[১]

এই পরিবেশের গঠম প্রক্রিয়া সাধিত হয় মারকভ সিদ্ধান্ত প্রক্রিয়ার (এমডিপি) মাধ্যমে। অনেক রিইনফোর্সমেন্ট লার্নিং অ্যালগরিদম ব্যবহার করে থাকে ডায়নামিক প্রোগ্রামিং পদ্ধতি।^[২] প্রাচীন ডায়নামিক প্রোগ্রামিং পদ্ধতি ও রিইনফোর্সমেন্ট লার্নিংয়ের পার্থক্য হল রিইনফোর্সমেন্ট লার্নিং এমডিপির মূল গাণিতিক মডেল ভিত্তি করে চলে না ও এটি চলে বৃহৎ এমডিপির ভিত্তিতে যেখানে আসল পদ্ধতি জানা সম্ভব নয়।টেমপ্লেট:Toclimit

সূচনা

Reinforcement learning, due to its generality, is studied in many other disciplines, such as game theory, control theory, operations research, information theory, simulation-based optimization, multi-agent systems, swarm intelligence, statistics and genetic algorithms. In the operations research and control literature, reinforcement learning is called approximate dynamic programming, or neuro-dynamic programming. The problems of interest in reinforcement learning have also been studied in the theory of optimal control, which is concerned mostly with the existence and characterization of optimal solutions, and algorithms for their exact computation, and less with learning or approximation, particularly in the absence of a mathematical model of the environment. In economics and game theory, reinforcement learning may be used to explain how equilibrium may arise under bounded rationality.

Basic reinforcement is modeled as a Markov decision process:

a set of environment and agent states, $S$ ;
a set of actions, $A$ , of the agent;
$P_{a}(s,s')=\Pr(s_{t+1}=s'\mid s_{t}=s,a_{t}=a)$ is the probability of transition from state $s$ to state $s'$ under action $a$ .
$R_{a}(s,s')$ is the immediate reward after transition from $s$ to $s'$ with action $a$ .
rules that describe what the agent observes

Rules are often stochastic. The observation typically involves the scalar, immediate reward associated with the last transition. In many works, the agent is assumed to observe the current environmental state (full observability). If not, the agent has partial observability. Sometimes the set of actions available to the agent is restricted (a zero balance cannot be reduced. For example, if the current value of the agent is 3 and the state transition reduces the value by 4, the transition will not be allowed).

A reinforcement learning agent interacts with its environment in discrete time steps. At each time $t$ , the agent receives an observation $o_{t}$ , which typically includes the reward $r_{t}$ . It then chooses an action $a_{t}$ from the set of available actions, which is subsequently sent to the environment. The environment moves to a new state $s_{t+1}$ and the reward $r_{t+1}$ associated with the transition $(s_{t},a_{t},s_{t+1})$ is determined. The goal of a reinforcement learning agent is to collect as much reward as possible. The agent can (possibly randomly) choose any action as a function of the history.

When the agent's performance is compared to that of an agent that acts optimally, the difference in performance gives rise to the notion of regret. In order to act near optimally, the agent must reason about the long term consequences of its actions (i.e., maximize future income), although the immediate reward associated with this might be negative.

Thus, reinforcement learning is particularly well-suited to problems that include a long-term versus short-term reward trade-off. It has been applied successfully to various problems, including robot control, elevator scheduling, telecommunications, backgammon, checkers^[৩] and go (AlphaGo).

Two elements make reinforcement learning powerful: the use of samples to optimize performance and the use of function approximation to deal with large environments. Thanks to these two key components, reinforcement learning can be used in large environments in the following situations:

A model of the environment is known, but an analytic solution is not available;
Only a simulation model of the environment is given (the subject of simulation-based optimization);^[৪]
The only way to collect information about the environment is to interact with it.

The first two of these problems could be considered planning problems (since some form of model is available), while the last one could be considered to be a genuine learning problem. However, reinforcement learning converts both planning problems to machine learning problems.

↑ Kaelbling, Leslie P.; Littman, Michael L.; Moore, Andrew W. (১৯৯৬)। "Reinforcement Learning: A Survey"। Journal of Artificial Intelligence Research। 4: 237–285। arXiv:cs/9605103 । ডিওআই:10.1613/jair.301। ২০০১-১১-২০ তারিখে মূল থেকে আর্কাইভ করা।
↑ van Otterlo, M.; Wiering, M. (২০১২)। Reinforcement learning and markov decision processes। Reinforcement Learning। Adaptation, Learning, and Optimization। 12। পৃষ্ঠা 3–42। আইএসবিএন 978-3-642-27644-6। ডিওআই:10.1007/978-3-642-27645-3_1।
↑ Sutton Barto, Chapter 11।
↑ Gosavi, Abhijit (২০০৩)। Simulation-based Optimization: Parametric Optimization Techniques and Reinforcement। Operations Research/Computer Science Interfaces Series। Springer। আইএসবিএন 978-1-4020-7454-7।

[kaelbling-1] Kaelbling, Leslie P.; Littman, Michael L.; Moore, Andrew W. (১৯৯৬)। "Reinforcement Learning: A Survey"। Journal of Artificial Intelligence Research। 4: 237–285। arXiv:cs/9605103 । ডিওআই:10.1613/jair.301। ২০০১-১১-২০ তারিখে মূল থেকে আর্কাইভ করা।

[2] van Otterlo, M.; Wiering, M. (২০১২)। Reinforcement learning and markov decision processes। Reinforcement Learning। Adaptation, Learning, and Optimization। 12। পৃষ্ঠা 3–42। আইএসবিএন 978-3-642-27644-6। ডিওআই:10.1007/978-3-642-27645-3_1।

[FOOTNOTESuttonBartoChapter_11-3] Sutton Barto, Chapter 11।

[4] Gosavi, Abhijit (২০০৩)। Simulation-based Optimization: Parametric Optimization Techniques and Reinforcement। Operations Research/Computer Science Interfaces Series। Springer। আইএসবিএন 978-1-4020-7454-7।

[১]

[২]

[৩]

[৪]

@@ ১ নং লাইন: / ১ নং লাইন: @@
+{{কাজ চলছে/২০১৯}}
-{{For|reinforcement learning in psychology|Reinforcement|Operant conditioning}}
-{{Machine learning bar}}
+'''রিইনফোর্সমেন্ট লার্নিং''' ('''RL''') হল মেশিন লার্নিংয়ের একটি শাখা, যেখানে কোনো পরিবেবেশে কোন পদক্ষেপ নেবার মাধ্যমে সেখানকার অবস্থার সর্বোত্তম উন্নতি করা সম্ভব হয়। রিইনফোর্সমেন্ট লার্নিং জল তিনিটি মৌলিক লার্নিং একটি উদাহরণ। অপর দুইটি লার্নিং হলে তত্ত্বাবধয়ায়নে লার্নিং ও অতত্ত্বাবধায়নে লার্নিং।
-'''Reinforcement learning''' ('''RL''') is an area of [[machine learning]] concerned with how [[software agent]]s ought to take [[Action selection|actions]] in an environment so as to maximize some notion of cumulative reward. Reinforcement learning is one of three basic machine learning paradigms, alongside [[supervised learning]] and [[unsupervised learning]].
-It differs from supervised learning in that labelled input/output pairs need not be presented, and sub-optimal actions need not be explicitly corrected. Instead the focus is finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge).<ref name="kaelbling">{{cite journal|last1=Kaelbling|first1=Leslie P.|last2=Littman|first2=Michael L.|authorlink2=Michael L. Littman|last3=Moore|first3=Andrew W.|authorlink3=Andrew W. Moore|year=1996|title=Reinforcement Learning: A Survey|url=http://www.cs.washington.edu/research/jair/abstracts/kaelbling96a.html|deadurl=yes|journal=Journal of Artificial Intelligence Research|volume=4|pages=237–285|doi=10.1613/jair.301|archiveurl=http://webarchive.loc.gov/all/20011120234539/http://www.cs.washington.edu/research/jair/abstracts/kaelbling96a.html|archivedate=2001-11-20|ref=harv|authorlink1=Leslie P. Kaelbling|df=|arxiv=cs/9605103}}</ref>
+এটি তত্ত্বাবধায়নে লার্নিং থেকে আলাদা। কেননা সেখানে ইনপুট ও আউটপুটে পাওয়া বিষয় উপস্থাপনের দিকে জোড় দেওয়া হয় না। এছাড়া সেখামে প্রায় আন্দাজকৃত বিষয়কে পুরোপুরিভাবে সঠিক হতে হয় না। অপরদিকে এখানে অনুসন্ধান (চার্টহীন এলাকা) ও ব্যবহারের (বর্তমান জ্ঞান) দিকে গুরুত্ব দেওয়া হয়।<ref name="kaelbling">{{cite journal|last1=Kaelbling|first1=Leslie P.|last2=Littman|first2=Michael L.|authorlink2=Michael L. Littman|last3=Moore|first3=Andrew W.|authorlink3=Andrew W. Moore|year=1996|title=Reinforcement Learning: A Survey|url=http://www.cs.washington.edu/research/jair/abstracts/kaelbling96a.html|deadurl=yes|journal=Journal of Artificial Intelligence Research|volume=4|pages=237–285|doi=10.1613/jair.301|archiveurl=http://webarchive.loc.gov/all/20011120234539/http://www.cs.washington.edu/research/jair/abstracts/kaelbling96a.html|archivedate=2001-11-20|ref=harv|authorlink1=Leslie P. Kaelbling|df=|arxiv=cs/9605103}}</ref>
-The environment is typically formulated as a [[Markov decision process]] (MDP), as many reinforcement learning algorithms for this context utilize [[dynamic programming]] techniques.<ref>{{Cite book|title=Reinforcement learning and markov decision processes|author1=van Otterlo, M.|author2=Wiering, M.|journal=Reinforcement Learning |volume=12|pages=3–42 |year=2012 |doi=10.1007/978-3-642-27645-3_1|series=Adaptation, Learning, and Optimization|isbn=978-3-642-27644-6}}</ref> The main difference between the classical dynamic programming methods  and reinforcement learning algorithms is that the latter do not assume knowledge of an exact mathematical model of the MDP and they target large MDPs where exact methods become infeasible.{{toclimit|3}}
+এই পরিবেশের গঠম প্রক্রিয়া সাধিত হয় মারকভ সিদ্ধান্ত প্রক্রিয়ার (এমডিপি) মাধ্যমে। অনেক রিইনফোর্সমেন্ট লার্নিং অ্যালগরিদম ব্যবহার করে থাকে ডায়নামিক প্রোগ্রামিং পদ্ধতি।<ref>{{Cite book|title=Reinforcement learning and markov decision processes|author1=van Otterlo, M.|author2=Wiering, M.|journal=Reinforcement Learning |volume=12|pages=3–42 |year=2012 |doi=10.1007/978-3-642-27645-3_1|series=Adaptation, Learning, and Optimization|isbn=978-3-642-27644-6}}</ref> প্রাচীন ডায়নামিক প্রোগ্রামিং পদ্ধতি ও
+রিইনফোর্সমেন্ট লার্নিংয়ের পার্থক্য হল রিইনফোর্সমেন্ট লার্নিং এমডিপির মূল গাণিতিক মডেল ভিত্তি করে চলে না ও এটি চলে বৃহৎ এমডিপির ভিত্তিতে যেখানে আসল পদ্ধতি
+জানা সম্ভব নয়।{{toclimit|3}}
+==সূচনা==
+[[File:Reinforcement learning diagram.svg|thumb|right|250px| রিইনফোর্সমেন্ট লার্নিংয়ের একটি সাধারণ দৃশ্য, একজন ব্যক্তি একটি পরিবেশের ওপর ভিত্তি করে একটি সিদ্ধান্ত গ্রহণ করেন  যা পরবর্তীতে তার ওপর ভালো কিছু হয়ে ফিরে আসে।]]
+Reinforcement learning, due to its generality, is studied in many other disciplines, such as [[game theory]], [[control theory]], [[operations research]], [[information theory]], [[simulation-based optimization]], [[multi-agent system]]s, [[swarm intelligence]], [[statistics]] and [[genetic algorithm]]s. In the operations research and control literature, reinforcement learning is called ''approximate dynamic programming,'' or ''neuro-dynamic programming.'' The problems of interest in reinforcement learning have also been studied in the [[optimal control theory|theory of optimal control]], which is concerned mostly with the existence and characterization of optimal solutions, and algorithms for their exact computation, and less with learning or approximation, particularly in the absence of a mathematical model of the environment. In [[economics]] and [[game theory]], reinforcement learning may be used to explain how equilibrium may arise under [[bounded rationality]].
+Basic reinforcement is modeled as a [[Markov decision process]]:
+* a set of environment and agent states, {{mvar|S}};
+* a set of actions, {{mvar|A}}, of the agent;
+* <math>P_a(s,s')=\Pr(s_{t+1}=s'\mid s_t=s, a_t=a)</math> is the probability of transition from state <math>s</math> to state <math>s'</math> under action <math>a</math>.
+* <math>R_a(s,s')</math> is the immediate reward after transition from <math>s</math> to <math>s'</math> with action <math>a</math>.
+* rules that describe what the agent observes
+Rules are often [[stochastic]]. The observation typically involves the scalar, immediate reward associated with the last transition. In many works, the agent is assumed to observe the current environmental state (''full observability''). If not, the agent has ''partial observability''. Sometimes the set of actions available to the agent is restricted (a zero balance cannot be reduced. For example, if the current value of the agent is 3 and the state transition reduces the value by 4, the transition will not be allowed).
+A reinforcement learning agent interacts with its environment in discrete time steps. At each time {{mvar|t}}, the agent receives an observation <math>o_t</math>, which typically includes the reward <math>r_t</math>. It then chooses an action <math>a_t</math> from the set of available actions, which is subsequently sent to the environment. The environment moves to a new state <math>s_{t+1}</math> and the reward <math>r_{t+1}</math> associated with the ''transition'' <math>(s_t,a_t,s_{t+1})</math> is determined. The goal of a reinforcement learning agent is to collect as much reward as possible. The [[Software agent|agent]] can (possibly randomly) choose any action as a function of the history.
+When the agent's performance is compared to that of an agent that acts optimally, the difference in performance gives rise to the notion of ''[[regret (game theory)|regret]]''. In order to act near optimally, the agent must reason about the long term consequences of its actions (i.e., maximize future income), although the immediate reward associated with this might be negative.
+Thus, reinforcement learning is particularly well-suited to problems that include a long-term versus short-term reward trade-off. It has been applied successfully to various problems, including [[robot control]], elevator scheduling, [[telecommunications]], [[backgammon]], [[checkers]]{{Sfn|Sutton|Barto|p=|loc=Chapter 11}} and [[go (game)|go]] ([[AlphaGo]]).
+Two elements make reinforcement learning powerful: the use of samples to optimize performance and the use of function approximation to deal with large environments. Thanks to these two key components, reinforcement learning can be used in large environments in the following situations:
+* A model of the environment is known, but an [[Closed-form expression|analytic solution]] is not available;
+* Only a simulation model of the environment is given (the subject of [[simulation-based optimization]]);<ref>{{cite book|url = https://www.springer.com/mathematics/applications/book/978-1-4020-7454-7|title = Simulation-based Optimization: Parametric Optimization Techniques and Reinforcement|last = Gosavi|first = Abhijit|publisher = Springer|year = 2003|isbn = 978-1-4020-7454-7|pages =|ref = harv|authorlink = Abhijit Gosavi|series = Operations Research/Computer Science Interfaces Series}}</ref>
+* The only way to collect information about the environment is to interact with it.
+The first two of these problems could be considered planning problems (since some form of model is available), while the last one could be considered to be a genuine learning problem. However, reinforcement learning converts both planning problems to [[machine learning]] problems.