Ch. 2 — Multi-Armed Bandits

%23%20%2F%2F%2F%20script%0A%23%20requires-python%20%3D%20%22%3E%3D3.13%22%0A%23%20dependencies%20%3D%20%5B%0A%23%20%20%20%20%20%22marimo%3E%3D0.23.6%22%2C%0A%23%20%20%20%20%20%22numpy%3D%3D2.4.4%22%2C%0A%23%20%20%20%20%20%22pandas%3D%3D3.0.2%22%2C%0A%23%20%20%20%20%20%22plotly%3D%3D6.7.0%22%2C%0A%23%20%20%20%20%20%22torch%3D%3D2.11.0%22%2C%0A%23%20%5D%0A%23%20%2F%2F%2F%0A%0Aimport%20marimo%0A%0A__generated_with%20%3D%20%220.23.6%22%0Aapp%20%3D%20marimo.App(%0A%20%20%20%20width%3D%22medium%22%2C%0A%20%20%20%20app_title%3D%22Ch.%202%20%E2%80%94%20Multi-Armed%20Bandits%22%2C%0A%20%20%20%20css_file%3D%22%2Fusr%2Flocal%2F_marimo%2Fcustom.css%22%2C%0A%20%20%20%20auto_download%3D%5B%22html%22%5D%2C%0A)%0A%0A%0A%40app.cell%0Adef%20_()%3A%0A%20%20%20%20import%20marimo%20as%20mo%0A%0A%20%20%20%20return%20(mo%2C)%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20%23%23%20**Multi-Armed%20Bandits**%0A%20%20%20%20%23%23%23%20**%24k%24-armed%20Bandit%20problem**%0A%0A%20%20%20%20We%20have%20to%20choose%20an%20action%20repeatedly%20from%20%24k%24%20different%20actions.%20After%20each%20action%2C%20we%20are%20receive%20a%20numerical%20reward%20chosen%20from%20a%20stationary%20probability%20distribution%20that%20depends%20on%20the%20action%20selected.%20Our%20objective%20is%20to%20_maximize%20the%20expected%20total%20reward_%20over%20some%20time%20period%2C%20or%20time%20steps.%0A%0A%20%20%20%20Each%20of%20the%20%24k%24%20actions%20has%20an%20expected%20or%20mean%20reward%20given%20that%20that%20action%20is%20selected%3B%20let%20us%20call%20this%20the%20value%20of%20that%20action.%20We%20denote%20the%20action%20selected%20on%20time%20step%20%24t%24%20as%20%24A_t%24%2C%20and%20the%20corresponding%20reward%20as%20%24R_t%24.%20The%20value%20then%20of%20an%20arbitrary%20action%20%24a%24%2C%20denoted%20%24q_*(a)%24%2C%20is%20the%20expected%20reward%20given%20that%20%24a%24%20is%20selected%3A%0A%0A%20%20%20%20%24%24q_*(a)%20%5Cdoteq%20%5Cmathbb%7BE%7D%5BR_t%20%5Cmid%20A_t%20%3D%20a%5D%24%24%0A%0A%20%20%20%20Since%20we%20do%20not%20know%20the%20true%20value%20%24q_*(a)%24%20for%20each%20action%2C%20we%20would%20like%20to%20estimate%20this%20true%20value.%0A%0A%20%20%20%20We%20denote%20the%20estimated%20value%20of%20action%20%24a%24%20at%20time%20step%20%24t%24%20as%20%24Q_t(a)%24.%20Our%20goal%20then%20is%20to%20make%20%24Q_t(a)%24%20to%20be%20close%20to%20%24q_*(a)%24.%0A%0A%20%20%20%20%23%23%23%20**Action-value%20Methods**%0A%20%20%20%20The%20true%20value%20of%20an%20action%20is%20the%20mean%20reward%20when%20the%20action%20is%20selected.%20We%20can%20estimate%20this%20value%20by%20averaging%20the%20rewards%20actually%20received%3A%0A%0A%20%20%20%20%24%24Q_t(a)%20%5Cdoteq%20%5Cfrac%7B%5Csum_%7Bi%3D1%7D%5E%7Bt-1%7D%20R_i%20%5Cmathbf%7B1%7D_%7BA_i%20%3D%20a%7D%7D%7B%5Csum_%7Bi%3D1%7D%5E%7Bt-1%7D%20%5Cmathbf%7B1%7D_%7BA_i%20%3D%20a%7D%7D%24%24%0A%0A%20%20%20%20Greedy%20action%20selection%20method%3A%0A%20%20%20%20Selecting%20the%20action%20with%20the%20highest%20estimated%20value%3A%0A%20%20%20%20%24%24A_t%20%5Cdoteq%20%5Cargmax_a%7BQ_t(a)%7D%24%24%0A%0A%20%20%20%20**%24%5Cvarepsilon%24-greedy%20methods%3A**%20with%20small%20probability%20%24%5Cvarepsilon%24%2C%20select%20randomly%20from%20among%20all%20the%20actions%20with%20equal%20probability%2C%20independently%20of%20the%20action-value%20estimates.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20%23%23%23%23Exercise%202.1%0A%20%20%20%20In%20%24%5Cvarepsilon%24-greedy%20action%20selection%2C%20for%20the%20case%20of%20two%20actions%20and%20%24%5Cvarepsilon%24%20%3D%200.5%2C%20what%20is%0A%20%20%20%20the%20probability%20that%20the%20greedy%20action%20is%20selected%3F%0A%0A%20%20%20%20Sol%3A%20P(greedy%20action%20chosen%20%7C%20not%20%24%5Cvarepsilon%24-greedy)%20%2B%20P(greedy%20action%20chosen%20%7C%20%24%5Cvarepsilon%24-greedy)%0A%20%20%20%20%3D%200.5x1%20%2B%200.5x0.5%20%3D%200.75%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20%23%23%23%20**The%2010-armed%20Testbed**%0A%20%20%20%20This%20is%20a%20set%20of%202000%20randomly%20generated%20%24k%24-armed%20bandit%20problems%2C%20with%20%24k%3D10%24.%20For%20each%20bandit%20problem%2C%20the%20action%20values%2C%20%24q_*(a)%2C%20a%20%3D%201%2C%20.%20.%20.%20%2C%2010%24%2C%20were%20selected%20according%20to%20a%20normal%20(Gaussian)%20distribution%20with%20mean%200%20and%20variance%201.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_()%3A%0A%20%20%20%20import%20plotly.express%20as%20px%0A%20%20%20%20import%20pandas%20as%20pd%0A%20%20%20%20import%20numpy%20as%20np%0A%20%20%20%20import%20torch%0A%0A%20%20%20%20return%20pd%2C%20px%2C%20torch%0A%0A%0A%40app.cell%0Adef%20_(torch)%3A%0A%20%20%20%20class%20Bandit%3A%0A%20%20%20%20%20%20%20%20def%20__init__(self%2C%20k%3D10)%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20self.k%20%3D%20k%0A%20%20%20%20%20%20%20%20%20%20%20%20self.qs%20%3D%20torch.randn(k)%0A%0A%20%20%20%20%20%20%20%20def%20act(self%2C%20k)%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20r%20%3D%20self.qs%5Bk%5D%20%2B%20torch.randn(1)%0A%20%20%20%20%20%20%20%20%20%20%20%20return%20r.item()%0A%0A%20%20%20%20return%20(Bandit%2C)%0A%0A%0A%40app.cell%0Adef%20_(Bandit%2C%20torch)%3A%0A%20%20%20%20n%20%3D%202000%0A%20%20%20%20k%20%3D%2010%0A%20%20%20%20R%20%3D%20torch.zeros((n%2Ck))%0A%20%20%20%20b%20%3D%20Bandit(k)%0A%20%20%20%20for%20i%20in%20range(n)%3A%0A%20%20%20%20%20%20%20%20for%20j%20in%20range(k)%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20R%5Bi%2Cj%5D%20%3D%20b.act(j)%0A%20%20%20%20return%20R%2C%20b%2C%20k%0A%0A%0A%40app.cell%0Adef%20_(R%2C%20b%2C%20k%2C%20mo%2C%20pd%2C%20px)%3A%0A%20%20%20%20%23%20Convert%20to%20long-form%20DataFrame%20for%20plotly%0A%20%20%20%20df%20%3D%20pd.DataFrame(R.numpy()%2C%20columns%3D%5Bf%22Action%20%7Bj%2B1%7D%22%20for%20j%20in%20range(k)%5D)%0A%20%20%20%20df_long%20%3D%20df.melt(var_name%3D%22Action%22%2C%20value_name%3D%22Reward%22)%0A%0A%20%20%20%20_fig%20%3D%20px.violin(%0A%20%20%20%20%20%20%20%20df_long%2C%0A%20%20%20%20%20%20%20%20x%3D%22Action%22%2C%0A%20%20%20%20%20%20%20%20y%3D%22Reward%22%2C%0A%20%20%20%20%20%20%20%20box%3DTrue%2C%0A%20%20%20%20%20%20%20%20points%3DFalse%2C%0A%20%20%20%20%20%20%20%20title%3D%22Reward%20Distributions%20for%20Each%20Action%20(k-Armed%20Bandit)%22%2C%0A%20%20%20%20%20%20%20%20color%3D%22Action%22%2C%0A%20%20%20%20)%0A%0A%20%20%20%20%23%20Overlay%20the%20true%20q*(a)%20values%20as%20scatter%20points%0A%20%20%20%20_fig.add_scatter(%0A%20%20%20%20%20%20%20%20x%3D%5Bf%22Action%20%7Bj%2B1%7D%22%20for%20j%20in%20range(k)%5D%2C%0A%20%20%20%20%20%20%20%20y%3Db.qs.numpy()%2C%0A%20%20%20%20%20%20%20%20mode%3D%22markers%22%2C%0A%20%20%20%20%20%20%20%20marker%3Ddict(symbol%3D%22star%22%2C%20size%3D10%2C%20color%3D%22black%22)%2C%0A%20%20%20%20%20%20%20%20name%3D%22q*(a)%20true%20value%22%2C%0A%20%20%20%20)%0A%0A%20%20%20%20plot%20%3D%20mo.ui.plotly(_fig)%0A%20%20%20%20plot%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20%23%23%23%20**Simple%20Algorithm**%0A%20%20%20%20We%20compare%20a%20greedy%20method%20with%20two%20%24%5Cvarepsilon%24-greedy%20methods%20(%24%5Cvarepsilon%24%20%3D%200.01%20and%20%24%5Cvarepsilon%24%20%3D%200.1)%20on%20the%2010-armed%20testbed.%20All%20the%20methods%20formed%20their%20action-value%20estimates%20using%20the%20sample-average%20technique.%20The%20upper%20graph%20shows%20the%20increase%20in%20expected%20reward%20with%20experience.%20These%20data%20are%20averages%20over%202000%20runs%20with%20different%20bandit%20problems%20across%201000%20time-steps.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(Bandit%2C%20torch)%3A%0A%20%20%20%20def%20simple_algorithm(b%3A%20Bandit%2C%20eps%3A%20float%2C%20n%3A%20int%20%3D%201000)%3A%0A%20%20%20%20%20%20%20%20N%20%3D%20torch.zeros(b.k)%0A%20%20%20%20%20%20%20%20S%20%3D%20torch.zeros(b.k)%0A%20%20%20%20%20%20%20%20Q%20%3D%20torch.zeros(b.k)%0A%20%20%20%20%20%20%20%20R%20%3D%20torch.zeros(n)%20%20%20%20%20%20%20%23%20raw%20per-step%20reward%2C%20not%20cumulative%20avg%0A%20%20%20%20%20%20%20%20q_a%20%3D%20torch.argmax(b.qs%2C%20dim%3D0).item()%0A%20%20%20%20%20%20%20%20Qopt%20%3D%20torch.zeros(n)%0A%20%20%20%20%20%20%20%20for%20t%20in%20range(n)%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20if%20torch.rand(1).item()%20%3C%20eps%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20A_t%20%3D%20torch.randint(0%2C%20b.k%2C%20(1%2C)).item()%0A%20%20%20%20%20%20%20%20%20%20%20%20else%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20A_t%20%3D%20torch.argmax(Q%2C%20dim%3D0).item()%0A%20%20%20%20%20%20%20%20%20%20%20%20R_t%20%3D%20b.act(A_t)%0A%20%20%20%20%20%20%20%20%20%20%20%20N%5BA_t%5D%20%2B%3D%201%0A%20%20%20%20%20%20%20%20%20%20%20%20S%5BA_t%5D%20%2B%3D%20R_t%0A%20%20%20%20%20%20%20%20%20%20%20%20Q%5BA_t%5D%20%3D%20S%5BA_t%5D%20%2F%20N%5BA_t%5D%0A%20%20%20%20%20%20%20%20%20%20%20%20R%5Bt%5D%20%3D%20R_t%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20raw%20reward%0A%20%20%20%20%20%20%20%20%20%20%20%20Qopt%5Bt%5D%20%3D%201.0%20if%20A_t%20%3D%3D%20q_a%20else%200.0%20%20%20%23%20binary%2C%20not%20cumulative%0A%20%20%20%20%20%20%20%20return%20Q%2C%20R%2C%20Qopt%0A%0A%20%20%20%20return%20(simple_algorithm%2C)%0A%0A%0A%40app.cell%0Adef%20_(Bandit%2C%20k%2C%20mo%2C%20pd%2C%20px%2C%20simple_algorithm%2C%20torch)%3A%0A%20%20%20%20n_runs%20%3D%202000%0A%20%20%20%20n_steps%20%3D%201000%0A%20%20%20%20epsilons%20%3D%20%5B0.0%2C%200.01%2C%200.1%5D%0A%0A%20%20%20%20avg_ravg%20%3D%20%7B%7D%0A%20%20%20%20avg_qopt%20%3D%20%7B%7D%0A%20%20%20%20for%20eps%20in%20epsilons%3A%0A%20%20%20%20%20%20%20%20run_ravg%20%3D%20torch.zeros((n_runs%2C%20n_steps))%0A%20%20%20%20%20%20%20%20run_qopt%20%3D%20torch.zeros((n_runs%2C%20n_steps))%0A%20%20%20%20%20%20%20%20for%20run_idx%20in%20range(n_runs)%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20bandit%20%3D%20Bandit(k)%0A%20%20%20%20%20%20%20%20%20%20%20%20_%2C%20Ravg%2C%20Qopt%20%3D%20simple_algorithm(bandit%2C%20eps%2C%20n%3Dn_steps)%0A%20%20%20%20%20%20%20%20%20%20%20%20run_ravg%5Brun_idx%5D%20%3D%20Ravg%0A%20%20%20%20%20%20%20%20%20%20%20%20run_qopt%5Brun_idx%5D%20%3D%20Qopt%0A%20%20%20%20%20%20%20%20avg_ravg%5Beps%5D%20%3D%20run_ravg.mean(dim%3D0)%0A%20%20%20%20%20%20%20%20avg_qopt%5Beps%5D%20%3D%20run_qopt.mean(dim%3D0)%0A%0A%20%20%20%20%23%20Ravg%20plot%0A%20%20%20%20df_ravg%20%3D%20pd.DataFrame(%7Bf%22%CE%B5%3D%7Beps%7D%22%3A%20v.numpy()%20for%20eps%2C%20v%20in%20avg_ravg.items()%7D)%0A%20%20%20%20df_ravg%5B%22Step%22%5D%20%3D%20range(n_steps)%0A%20%20%20%20df_ravg_melted%20%3D%20df_ravg.melt(id_vars%3D%22Step%22%2C%20var_name%3D%22Policy%22%2C%20value_name%3D%22Ravg%22)%0A%0A%20%20%20%20_fig_ravg%20%3D%20px.line(%0A%20%20%20%20%20%20%20%20df_ravg_melted%2C%20x%3D%22Step%22%2C%20y%3D%22Ravg%22%2C%20color%3D%22Policy%22%2C%0A%20%20%20%20%20%20%20%20title%3D%22Average%20Reward%20over%20Time%20(%CE%B5-Greedy%2C%202000%20runs)%22%2C%0A%20%20%20%20)%0A%0A%20%20%20%20%23%20Qopt%20plot%0A%20%20%20%20df_qopt%20%3D%20pd.DataFrame(%7Bf%22%CE%B5%3D%7Beps%7D%22%3A%20v.numpy()%20for%20eps%2C%20v%20in%20avg_qopt.items()%7D)%0A%20%20%20%20df_qopt%5B%22Step%22%5D%20%3D%20range(n_steps)%0A%20%20%20%20df_qopt_melted%20%3D%20df_qopt.melt(id_vars%3D%22Step%22%2C%20var_name%3D%22Policy%22%2C%20value_name%3D%22Optimal%20Action%20%25%22)%0A%0A%20%20%20%20_fig_qopt%20%3D%20px.line(%0A%20%20%20%20%20%20%20%20df_qopt_melted%2C%20x%3D%22Step%22%2C%20y%3D%22Optimal%20Action%20%25%22%2C%20color%3D%22Policy%22%2C%0A%20%20%20%20%20%20%20%20title%3D%22Optimal%20Action%20%25%20over%20Time%20(%CE%B5-Greedy%2C%202000%20runs)%22%2C%0A%20%20%20%20)%0A%20%20%20%20_fig_qopt.update_yaxes(tickformat%3D%22.0%25%22)%0A%0A%20%20%20%20mo.vstack(%5Bmo.ui.plotly(_fig_ravg)%2C%20mo.ui.plotly(_fig_qopt)%5D)%0A%20%20%20%20return%0A%0A%0Aif%20__name__%20%3D%3D%20%22__main__%22%3A%0A%20%20%20%20app.run()%0A