Bayesian Methods for Hackers
1. 베이지안 추론의 철학Permalink
- 서론
- 사전확률 : 사건 A에 대한 우리의 믿음의 양
P(A)P(A) - 사후확률 : 증거 X가 주어진 상황에서 A의 확률
P(A|X)P(A|X)
- 사전확률 : 사건 A에 대한 우리의 믿음의 양
- 베이지안 프레임워크
- 베이즈 정리
P(A|X)=p(A)p(X|A)p(X)P(A|X)=p(A)p(X|A)p(X)
- 베이즈 정리
- 확률분포
-
확률변수 Z가 이산적인 경우, 확률질량함수
P(Z=k)=λke−λk!P(Z=k)=λke−λk!, k=0,1,2,…k=0,1,2,…
ZZ ~ Poi(λ)Poi(λ)
E[Z|λ]=λE[Z|λ]=λ -
확률변수 Z가 연속적인 경우, 확률밀도함수
fZ(z|λ)=λe−λz,z≥0fZ(z|λ)=λe−λz,z≥0
ZZ ~ Exp(λ)Exp(λ)
E[Z|λ]=1λE[Z|λ]=1λ
-
- 컴퓨터를 사용하여 베이지안 추론하기
- PyMC를 사용한 베이지안
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersfigsize(12.5, 3.5) count_data = np.loadtxt("data/txtdata.csv") n_count_data = len(count_data) plt.bar(np.arange(n_count_data), count_data, color="#348ABD") plt.xlabel("Time (days)") plt.ylabel("count of text-msgs received") plt.title("Did the user's texting habits change over time?") plt.xlim(0, n_count_data); import pymc3 as pm with pm.Model() as model: alpha = 1.0/count_data.mean() # Recall count_data is the # variable that holds our txt counts lambda_1 = pm.Exponential("lambda_1", alpha) lambda_2 = pm.Exponential("lambda_2", alpha) tau = pm.DiscreteUniform("tau", lower=0, upper=n_count_data - 1) with model: idx = np.arange(n_count_data) # Index lambda_ = pm.math.switch(tau > idx, lambda_1, lambda_2) with model: observation = pm.Poisson("obs", lambda_, observed=count_data) with model: step = pm.Metropolis() trace = pm.sample(10000, tune=5000, step=step, return_inferencedata=False) lambda_1_samples = trace['lambda_1'] lambda_2_samples = trace['lambda_2'] tau_samples = trace['tau'] figsize(12.5, 10) #histogram of the samples: ax = plt.subplot(311) ax.set_autoscaley_on(False) plt.hist(lambda_1_samples, histtype='stepfilled', bins=30, alpha=0.85, label="posterior of λ1λ1", color="#A60628", density=True) plt.legend(loc="upper left") plt.title(r"""Posterior distributions of the variables λ1,λ2,τλ1,λ2,τ""") plt.xlim([15, 30]) plt.xlabel("λ1λ1 value") ax = plt.subplot(312) ax.set_autoscaley_on(False) plt.hist(lambda_2_samples, histtype='stepfilled', bins=30, alpha=0.85, label="posterior of λ2λ2", color="#7A68A6", density=True) plt.legend(loc="upper left") plt.xlim([15, 30]) plt.xlabel("λ2λ2 value") plt.subplot(313) w = 1.0 / tau_samples.shape[0] * np.ones_like(tau_samples) plt.hist(tau_samples, bins=n_count_data, alpha=1, label=r"posterior of ττ", color="#467821", weights=w, rwidth=2.) plt.xticks(np.arange(n_count_data)) plt.legend(loc="upper left") plt.ylim([0, .75]) plt.xlim([35, len(count_data)-20]) plt.xlabel(r"ττ (in days)") plt.ylabel("probability"); figsize(12.5, 5) # tau_samples, lambda_1_samples, lambda_2_samples contain # N samples from the corresponding posterior distribution N = tau_samples.shape[0] expected_texts_per_day = np.zeros(n_count_data) for day in range(0, n_count_data): # ix is a bool index of all tau samples corresponding to # the switchpoint occurring prior to value of 'day' ix = day < tau_samples # Each posterior sample corresponds to a value for tau. # for each day, that value of tau indicates whether we're "before" # (in the lambda1 "regime") or # "after" (in the lambda2 "regime") the switchpoint. # by taking the posterior sample of lambda1/2 accordingly, we can average # over all samples to get an expected value for lambda on that day. # As explained, the "message count" random variable is Poisson distributed, # and therefore lambda (the poisson parameter) is the expected value of # "message count". expected_texts_per_day[day] = (lambda_1_samples[ix].sum() + lambda_2_samples[~ix].sum()) / N plt.plot(range(n_count_data), expected_texts_per_day, lw=4, color="#E24A33", label="expected number of text-messages received") plt.xlim(0, n_count_data) plt.xlabel("Day") plt.ylabel("Expected # text-messages") plt.title("Expected number of text-messages received") plt.ylim(0, 60) plt.bar(np.arange(len(count_data)), count_data, color="#348ABD", alpha=0.65, label="observed texts per day") plt.legend(loc="upper left");
- PyMC를 사용한 베이지안
- 결론
- 부록
- 연습문제
- 참고자료
2. PyMC 더 알아보기Permalink
- 서론
- Stochastic 변수 : 값이 정해지지 않은 변수
- Deterministic 변수 : 변수의 부모를 모두 알고있는 경우에 랜덤하지 않은 변수
- @pymc.deterministic 파이썬 래퍼 (데코레이터)를 써서 구분
- 모델링 방법
- 관측된 빈도와 실제 빈도간에는 차이가 발생
- 베르누이 분포
XX ~ Berp(p)Berp(p) : X는 p의 확률로 1, 1-p의 확률로 0 - 이항분포
XX ~ Bin(N,p)Bin(N,p)
P(X=k)=(nk)pk(1−p)N−kP(X=k)=(nk)pk(1−p)N−k
기댓값 : NpNp - 데이터 Import 예시
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters# Data Import Option test_data = np.getfromtxt("data.csv", skip_header=1, usecols=[1, 2], missing_values="NA", delimiter=",") # Remove NA rows test_data = test_data[~np.isnan(test_data[:,1])] - Logistic Function
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersfrom IPython.core.pylabtools import figsize from matplotlib import pyplot as plt import numpy as np figsize(12, 3) def logistic(x, beta, alpha=0): return 1.0 / (1.0 + np.exp(np.dot(beta, x) + alpha)) x = np.linspace(-4, 4, 100) plt.plot(x, logistic(x, 1), label=r"β=1β=1", ls="--", lw=1) plt.plot(x, logistic(x, 3), label=r"β=3β=3", ls="--", lw=1) plt.plot(x, logistic(x, -5), label=r"β=−5β=−5", ls="--", lw=1) plt.plot(x, logistic(x, 1, 1), label=r"β=1,α=1β=1,α=1", color="#348ABD") plt.plot(x, logistic(x, 3, -2), label=r"β=3,α=−2β=3,α=−2", color="#A60628") plt.plot(x, logistic(x, -5, 7), label=r"β=−5,α=7β=−5,α=7", color="#7A68A6") plt.xlabel("xx") plt.ylabel("Logistic function") plt.title("Logistic functions based on alphaalpha, betabeta", fontsize=14) plt.legend(loc="lower left"); - 정규분포
정규확률변수 : X∼N(μ,1/τ)X∼N(μ,1/τ)
확률밀도함수 : f(x|μ,τ)=√τ2πexp(−τ2(x−μ)2)f(x|μ,τ)=√τ2πexp(−τ2(x−μ)2)
정규분포의 기댓값 : E[X|μ,τ]=μE[X|μ,τ]=μ
정규분포의 분산 : Var(X|μ,τ)=1τVar(X|μ,τ)=1τ
- 우리의 모델이 적절한가?
- 결론
- 부록
- 연습문제
- 참고자료
3. MCMC 블랙박스 열기Permalink
- 베이지안 지형
- 수렴 판정하기
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersfigsize(12.5, 4) import pymc3 as pm x_t = np.random.normal(0, 1, 200) x_t[0] = 0 y_t = np.zeros(200) for i in range(1, 200): y_t[i] = np.random.normal(y_t[i - 1], 1) plt.plot(y_t, label="ytyt", lw=3) plt.plot(x_t, label="xtxt", lw=3) plt.xlabel("time, tt") plt.legend(); def autocorr(x): # from http://tinyurl.com/afz57c4 result = np.correlate(x, x, mode='full') result = result / np.max(result) return result[result.size // 2:] colors = ["#348ABD", "#A60628", "#7A68A6"] x = np.arange(1, 200) plt.bar(x, autocorr(y_t)[1:], width=1, label="ytyt", edgecolor=colors[0], color=colors[0]) plt.bar(x, autocorr(x_t)[1:], width=1, label="xtxt", color=colors[1], edgecolor=colors[1]) plt.legend(title="Autocorrelation") plt.ylabel("measured correlation \nbetween ytyt and yt−kyt−k.") plt.xlabel("k (lag)") plt.title("Autocorrelation plot of ytyt and xtxt for differing kk lags."); max_x = 200 // 3 + 1 x = np.arange(1, max_x) plt.bar(x, autocorr(y_t)[1:max_x], edgecolor=colors[0], label="no thinning", color=colors[0], width=1) plt.bar(x, autocorr(y_t[::2])[1:max_x], edgecolor=colors[1], label="keeping every 2nd sample", color=colors[1], width=1) plt.bar(x, autocorr(y_t[::3])[1:max_x], width=1, edgecolor=colors[2], label="keeping every 3rd sample", color=colors[2]) plt.autoscale(tight=True) plt.legend(title="Autocorrelation plot for ytyt", loc="lower left") plt.ylabel("measured correlation \nbetween ytyt and yt−kyt−k.") plt.xlabel("k (lag)") plt.title("Autocorrelation of ytyt (no thinning vs. thinning) \ at differing kk lags."); pm.plots.traceplot(trace=trace, varnames=["centers"]) pm.plots.plot_posterior(trace=trace["centers"][:,0]) pm.plots.plot_posterior(trace=trace["centers"][:,1]) pm.plots.autocorrplot(trace=trace, varnames=["centers"]); - MCMC에 대한 유용한 팁
- 결론
- 참고자료
4. 아무도 알려주지 않는 위대한 이론Permalink
- 서론
- 큰 수의 법칙
1N∑Ni=1Zi→E[Z],N→∞1N∑Ni=1Zi→E[Z],N→∞
- 직관
1N∑Ni=1Zi=1N(∑Zi=c1c1+∑Zi=c2c2)1N∑Ni=1Zi=1N(∑Zi=c1c1+∑Zi=c2c2)
=c1∑Zi=c11N+c2∑Zi=c21N=c1∑Zi=c11N+c2∑Zi=c21N
=c1× (approximate frequency of c1) +c2× (approximate frequency of c2) =c1× (approximate frequency of c1) +c2× (approximate frequency of c2)
≈c1×P(Z=c1)+c2×P(Z=c2)=E[Z]≈c1×P(Z=c1)+c2×P(Z=c2)=E[Z]
D(N)=√E[(1N∑Ni=1Zi−4.5)2]D(N)=√E[(1N∑Ni=1Zi−4.5)2]
Yk=(1N∑Ni=1Zi−4.5)2
1NY∑NYk=1Yk→E[Yk]=E[(1N∑Ni=1Zi−4.5)2]
√1NY∑NYk=1Yk≈D(N)
1N∑Ni=1(Zi−μ)2→E[(Z−μ)2]=Var(Z)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters%matplotlib inline import numpy as np from IPython.core.pylabtools import figsize import matplotlib.pyplot as plt figsize( 12.5, 5 ) sample_size = 100000 expected_value = lambda_ = 4.5 poi = np.random.poisson N_samples = range(1,sample_size,100) for k in range(3): samples = poi( lambda_, sample_size ) partial_average = [ samples[:i].mean() for i in N_samples ] plt.plot( N_samples, partial_average, lw=1.5, label="average of n samples; seq. %d"%k) plt.plot( N_samples, expected_value*np.ones_like( partial_average), ls = "--", label = "true expected value", c = "k") plt.ylim( 4.35, 4.65) plt.title( "Convergence of the average of \n random variables to its expected value") plt.ylabel( "average of n samples") plt.xlabel( "# of samples, n") plt.legend(); figsize( 12.5, 4) N_Y = 250 #use this many to approximate D(N) N_array = np.arange( 1000, 50000, 2500 ) #use this many samples in the approx. to the variance. D_N_results = np.zeros( len( N_array ) ) lambda_ = 4.5 expected_value = lambda_ #for X ~ Poi(lambda) , E[ X ] = lambda def D_N( n ): """ This function approx. D_n, the average variance of using n samples. """ Z = poi( lambda_, (n, N_Y) ) average_Z = Z.mean(axis=0) return np.sqrt( ( (average_Z - expected_value)**2 ).mean() ) for i,n in enumerate(N_array): D_N_results[i] = D_N(n) plt.xlabel( "N" ) plt.ylabel( "expected squared-distance from true value" ) plt.plot(N_array, D_N_results, lw = 3, label="expected distance between\n\ expected value and \naverage of N random variables.") plt.plot( N_array, np.sqrt(expected_value)/np.sqrt(N_array), lw = 2, ls = "--", label = r"√λ√N") plt.legend() plt.title( "How 'fast' is the sample average converging?");
- 직관
- 작은 수의 혼란
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters#adding a number to the end of the %run call will get the ith top post. %run top_showerthoughts_submissions.py 2 print("Post contents: \n") print(top_post) """ contents: an array of the text from the last 100 top submissions to a subreddit votes: a 2d numpy array of upvotes, downvotes for each submission. """ n_submissions = len(votes) submissions = np.random.randint( n_submissions, size=4) print("Some Submissions (out of %d total) \n-----------"%n_submissions) for i in submissions: print('"' + contents[i] + '"') print("upvotes/downvotes: ",votes[i,:], "\n") import pymc3 as pm def posterior_upvote_ratio( upvotes, downvotes, samples = 20000): """ This function accepts the number of upvotes and downvotes a particular submission recieved, and the number of posterior samples to return to the user. Assumes a uniform prior. """ N = upvotes + downvotes with pm.Model() as model: upvote_ratio = pm.Uniform("upvote_ratio", 0, 1) observations = pm.Binomial( "obs", N, upvote_ratio, observed=upvotes) trace = pm.sample(samples, step=pm.Metropolis()) burned_trace = trace[int(samples/4):] return burned_trace["upvote_ratio"] figsize( 11., 8) posteriors = [] colours = ["#348ABD", "#A60628", "#7A68A6", "#467821", "#CF4457"] for i in range(len(submissions)): j = submissions[i] posteriors.append( posterior_upvote_ratio( votes[j, 0], votes[j,1] ) ) plt.hist( posteriors[i], bins = 10, normed = True, alpha = .9, histtype="step",color = colours[i%5], lw = 3, label = '(%d up:%d down)\n%s...'%(votes[j, 0], votes[j,1], contents[j][:50]) ) plt.hist( posteriors[i], bins = 10, normed = True, alpha = .2, histtype="stepfilled",color = colours[i], lw = 3, ) plt.legend(loc="upper left") plt.xlim( 0, 1) plt.title("Posterior distributions of upvote ratios on different submissions"); N = posteriors[0].shape[0] lower_limits = [] for i in range(len(submissions)): j = submissions[i] plt.hist( posteriors[i], bins = 20, normed = True, alpha = .9, histtype="step",color = colours[i], lw = 3, label = '(%d up:%d down)\n%s...'%(votes[j, 0], votes[j,1], contents[j][:50]) ) plt.hist( posteriors[i], bins = 20, normed = True, alpha = .2, histtype="stepfilled",color = colours[i], lw = 3, ) v = np.sort( posteriors[i] )[ int(0.05*N) ] #plt.vlines( v, 0, 15 , color = "k", alpha = 1, linewidths=3 ) plt.vlines( v, 0, 10 , color = colours[i], linestyles = "--", linewidths=3 ) lower_limits.append(v) plt.legend(loc="upper left") plt.legend(loc="upper left") plt.title("Posterior distributions of upvote ratios on different submissions"); order = np.argsort( -np.array( lower_limits ) ) print(order, lower_limits) def intervals(u,d): a = 1. + u b = 1. + d mu = a/(a+b) std_err = 1.65*np.sqrt( (a*b)/( (a+b)**2*(a+b+1.) ) ) return ( mu, std_err ) print("Approximate lower bounds:") posterior_mean, std_err = intervals(votes[:,0],votes[:,1]) lb = posterior_mean - std_err print(lb) print("\n") print("Top 40 Sorted according to approximate lower bounds:") print("\n") order = np.argsort( -lb ) ordered_contents = [] for i in order[:40]: ordered_contents.append( contents[i] ) print(votes[i,0], votes[i,1], contents[i]) print("-------------") r_order = order[::-1][-40:] plt.errorbar( posterior_mean[r_order], np.arange( len(r_order) ), xerr=std_err[r_order], capsize=0, fmt="o", color = "#7A68A6") plt.xlim( 0.3, 1) plt.yticks( np.arange( len(r_order)-1,-1,-1 ), map( lambda x: x[:30].replace("\n",""), ordered_contents) ); - 결론
- 부록
- 연습문제
- 참고자료
5. 오히려 큰 손해를 보시겠습니까?Permalink
- 서론
- 손실함수
L(θ,ˆθ)=(θ−ˆθ)2
L(θ,ˆθ)={(θ−ˆθ)2 ˆθ<θ c(θ−ˆθ)2 ˆθ≥θ,0<c<1
L(θ,ˆθ)=|θ−ˆθ|
L(θ,ˆθ)=−θlog(ˆθ)−(1−θ)log(1−ˆθ),θ∈0,1,ˆθ∈[0,1]
l(ˆθ)=Eθ[L(θ,ˆθ)]
1N∑Ni=1L(θi,ˆθ)≈Eθ[L(θ,ˆθ)]=l(ˆθ)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters%matplotlib inline import scipy.stats as stats from IPython.core.pylabtools import figsize import numpy as np import matplotlib.pyplot as plt figsize(12.5, 9) norm_pdf = stats.norm.pdf plt.subplot(311) x = np.linspace(0, 60000, 200) sp1 = plt.fill_between(x , 0, norm_pdf(x, 35000, 7500), color = "#348ABD", lw = 3, alpha = 0.6, label = "historical total prices") p1 = plt.Rectangle((0, 0), 1, 1, fc=sp1.get_facecolor()[0]) plt.legend([p1], [sp1.get_label()]) plt.subplot(312) x = np.linspace(0, 10000, 200) sp2 = plt.fill_between(x , 0, norm_pdf(x, 3000, 500), color = "#A60628", lw = 3, alpha = 0.6, label="snowblower price guess") p2 = plt.Rectangle((0, 0), 1, 1, fc=sp2.get_facecolor()[0]) plt.legend([p2], [sp2.get_label()]) plt.subplot(313) x = np.linspace(0, 25000, 200) sp3 = plt.fill_between(x , 0, norm_pdf(x, 12000, 3000), color = "#7A68A6", lw = 3, alpha = 0.6, label = "Trip price guess") plt.autoscale(tight=True) p3 = plt.Rectangle((0, 0), 1, 1, fc=sp3.get_facecolor()[0]) plt.legend([p3], [sp3.get_label()]); import pymc3 as pm data_mu = [3e3, 12e3] data_std = [5e2, 3e3] mu_prior = 35e3 std_prior = 75e2 with pm.Model() as model: true_price = pm.Normal("true_price", mu=mu_prior, sd=std_prior) prize_1 = pm.Normal("first_prize", mu=data_mu[0], sd=data_std[0]) prize_2 = pm.Normal("second_prize", mu=data_mu[1], sd=data_std[1]) price_estimate = prize_1 + prize_2 logp = pm.Normal.dist(mu=price_estimate, sd=(3e3)).logp(true_price) error = pm.Potential("error", logp) trace = pm.sample(50000, step=pm.Metropolis()) burned_trace = trace[10000:] price_trace = burned_trace["true_price"] figsize(12.5, 4) import scipy.stats as stats x = np.linspace(5000, 40000) plt.plot(x, stats.norm.pdf(x, 35000, 7500), c = "k", lw = 2, label = "prior dist. of suite price") _hist = plt.hist(price_trace, bins = 35, normed= True, histtype= "stepfilled") plt.title("Posterior of the true price estimate") plt.vlines(mu_prior, 0, 1.1*np.max(_hist[0]), label = "prior's mean", linestyles="--") plt.vlines(price_trace.mean(), 0, 1.1*np.max(_hist[0]), \ label = "posterior's mean", linestyles="-.") plt.legend(loc = "upper left"); figsize(12.5, 7) #numpy friendly showdown_loss def showdown_loss(guess, true_price, risk = 80000): loss = np.zeros_like(true_price) ix = true_price < guess loss[~ix] = np.abs(guess - true_price[~ix]) close_mask = [abs(true_price - guess) <= 250] loss[close_mask] = -2*true_price[close_mask] loss[ix] = risk return loss guesses = np.linspace(5000, 50000, 70) risks = np.linspace(30000, 150000, 6) expected_loss = lambda guess, risk: \ showdown_loss(guess, price_trace, risk).mean() for _p in risks: results = [expected_loss(_g, _p) for _g in guesses] plt.plot(guesses, results, label = "%d"%_p) plt.title("Expected loss of different guesses, \nvarious risk-levels of \ overestimating") plt.legend(loc="upper left", title="Risk parameter") plt.xlabel("price bid") plt.ylabel("expected loss") plt.xlim(5000, 30000); import scipy.optimize as sop ax = plt.subplot(111) for _p in risks: _color = next(ax._get_lines.prop_cycler) _min_results = sop.fmin(expected_loss, 15000, args=(_p,),disp = False) _results = [expected_loss(_g, _p) for _g in guesses] plt.plot(guesses, _results , color = _color['color']) plt.scatter(_min_results, 0, s = 60, \ color= _color['color'], label = "%d"%_p) plt.vlines(_min_results, 0, 120000, color = _color['color'], linestyles="--") print("minimum at risk %d: %.2f" % (_p, _min_results)) plt.title("Expected loss & Bayes actions of different guesses, \n \ various risk-levels of overestimating") plt.legend(loc="upper left", scatterpoints = 1, title = "Bayes action at risk:") plt.xlabel("price guess") plt.ylabel("expected loss") plt.xlim(7000, 30000) plt.ylim(-1000, 80000); - 베이지안 방법을 통한 기계학습
Ri(x)=αi+βix+ϵ
where ϵ∼Normal(0,σi) and i indexes our posterior samples.
argminrER(x)[L(R(x),r)]
- 예제: 금융예측
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
figsize(12.5, 4) def stock_loss(true_return, yhat, alpha = 100.): if true_return * yhat < 0: #opposite signs, not good return alpha*yhat**2 - np.sign(true_return)*yhat \ + abs(true_return) else: return abs(true_return - yhat) true_value = .05 pred = np.linspace(-.04, .12, 75) plt.plot(pred, [stock_loss(true_value, _p) for _p in pred], \ label = "Loss associated with\n prediction if true value = 0.05", lw =3) plt.vlines(0, 0, .25, linestyles="--") plt.xlabel("prediction") plt.ylabel("loss") plt.xlim(-0.04, .12) plt.ylim(0, 0.25) true_value = -.02 plt.plot(pred, [stock_loss(true_value, _p) for _p in pred], alpha = 0.6, \ label = "Loss associated with\n prediction if true value = -0.02", lw =3) plt.legend() plt.title("Stock returns loss if true value = 0.05, -0.02"); ## Code to create artificial data N = 100 X = 0.025*np.random.randn(N) Y = 0.5*X + 0.01*np.random.randn(N) ls_coef_ = np.cov(X, Y)[0,1]/np.var(X) ls_intercept = Y.mean() - ls_coef_*X.mean() plt.scatter(X, Y, c="k") plt.xlabel("trading signal") plt.ylabel("returns") plt.title("Empirical returns vs trading signal") plt.plot(X, ls_coef_*X + ls_intercept, label = "Least-squares line") plt.xlim(X.min(), X.max()) plt.ylim(Y.min(), Y.max()) plt.legend(loc="upper left"); import pymc3 as pm with pm.Model() as model: std = pm.Uniform("std", 0, 100) beta = pm.Normal("beta", mu=0, sd=100) alpha = pm.Normal("alpha", mu=0, sd=100) mean = pm.Deterministic("mean", alpha + beta*X) obs = pm.Normal("obs", mu=mean, sd=std, observed=Y) trace = pm.sample(100000, step=pm.Metropolis()) burned_trace = trace[20000:] pm.plots.traceplot(trace=burned_trace, varnames=["std", "beta", "alpha"]) pm.plot_posterior(trace=burned_trace, varnames=["std", "beta", "alpha"], kde_plot=True); figsize(12.5, 6) from scipy.optimize import fmin def stock_loss(price, pred, coef = 500): """vectorized for numpy""" sol = np.zeros_like(price) ix = price*pred < 0 sol[ix] = coef*pred**2 - np.sign(price[ix])*pred + abs(price[ix]) sol[~ix] = abs(price[~ix] - pred) return sol std_samples = burned_trace["std"] alpha_samples = burned_trace["alpha"] beta_samples = burned_trace["beta"] N = std_samples.shape[0] noise = std_samples*np.random.randn(N) possible_outcomes = lambda signal: alpha_samples + beta_samples*signal + noise opt_predictions = np.zeros(50) trading_signals = np.linspace(X.min(), X.max(), 50) for i, _signal in enumerate(trading_signals): _possible_outcomes = possible_outcomes(_signal) tomin = lambda pred: stock_loss(_possible_outcomes, pred).mean() opt_predictions[i] = fmin(tomin, 0, disp = False) plt.xlabel("trading signal") plt.ylabel("prediction") plt.title("Least-squares prediction vs. Bayes action prediction") plt.plot(X, ls_coef_*X + ls_intercept, label ="Least-squares prediction") plt.xlim(X.min(), X.max()) plt.plot(trading_signals, opt_predictions, label ="Bayes action prediction") plt.legend(loc="upper left");
- 예제: 금융예측
- 결론
- 참고자료
6. 우선순위 바로잡기Permalink
- 서론
- 주관적인 사전확률분포 vs. 객관적인 사전확률분포
μp=1N∑Ni=0Xi - 알아두면 유용한 사전확률분포
- 감마분포
Exp(β)∼Gamma(1,β)
F(x∣α,β)=βαxα−1e−βxΓ(α) - 베타분포
fX(x|α,β)=x(α−1)(1−x)(β−1)B(α,β)
Observation : X∼Binomial(N,p)
Posterior : Beta(1+X,1+N−X)
- 감마분포
- 예제: 베이지안 MAB (Multi-Armed-Bandits)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersrand = np.random.rand class Bandits(object): """ This class represents N bandits machines. parameters: p_array: a (n,) Numpy array of probabilities >0, <1. methods: pull( i ): return the results, 0 or 1, of pulling the ith bandit. """ def __init__(self, p_array): self.p = p_array self.optimal = np.argmax(p_array) def pull(self, i): #i is which arm to pull return np.random.rand() < self.p[i] def __len__(self): return len(self.p) class BayesianStrategy(object): """ Implements a online, learning strategy to solve the Multi-Armed Bandit problem. parameters: bandits: a Bandit class with .pull method methods: sample_bandits(n): sample and train on n pulls. attributes: N: the cumulative number of samples choices: the historical choices as a (N,) array bb_score: the historical score as a (N,) array """ def __init__(self, bandits): self.bandits = bandits n_bandits = len(self.bandits) self.wins = np.zeros(n_bandits) self.trials = np.zeros(n_bandits) self.N = 0 self.choices = [] self.bb_score = [] def sample_bandits(self, n=1): bb_score = np.zeros(n) choices = np.zeros(n) for k in range(n): #sample from the bandits's priors, and select the largest sample choice = np.argmax(np.random.beta(1 + self.wins, 1 + self.trials - self.wins)) #sample the chosen bandit result = self.bandits.pull(choice) #update priors and score self.wins[choice] += result self.trials[choice] += 1 bb_score[k] = result self.N += 1 choices[k] = choice self.bb_score = np.r_[self.bb_score, bb_score] self.choices = np.r_[self.choices, choices] return figsize(11.0, 10) beta = stats.beta x = np.linspace(0.001,.999,200) def plot_priors(bayesian_strategy, prob, lw = 3, alpha = 0.2, plt_vlines = True): ## plotting function wins = bayesian_strategy.wins trials = bayesian_strategy.trials for i in range(prob.shape[0]): y = beta(1+wins[i], 1 + trials[i] - wins[i]) p = plt.plot(x, y.pdf(x), lw = lw) c = p[0].get_markeredgecolor() plt.fill_between(x,y.pdf(x),0, color = c, alpha = alpha, label="underlying probability: %.2f" % prob[i]) if plt_vlines: plt.vlines(prob[i], 0, y.pdf(prob[i]) , colors = c, linestyles = "--", lw = 2) plt.autoscale(tight = "True") plt.title("Posteriors After %d pull" % bayesian_strategy.N +\ "s"*(bayesian_strategy.N > 1)) plt.autoscale(tight=True) return # Start Main Code hidden_prob = np.array([0.85, 0.60, 0.75]) bandits = Bandits(hidden_prob) bayesian_strat = BayesianStrategy(bandits) draw_samples = [1, 1, 3, 10, 10, 25, 50, 100, 200, 600] for j,i in enumerate(draw_samples): plt.subplot(5, 2, j+1) bayesian_strat.sample_bandits(i) plot_priors(bayesian_strat, hidden_prob) #plt.legend() plt.autoscale(tight = True) plt.tight_layout() -
해당 분야 전문가로부터 사전확률분포 유도하기
wopt=maxw1N(∑Ni=0μTiw−λ2wTΣiw) - 켤레 사전확률분포
- 제프리 사전확률분포
- N이 증가할 때 사전확률분포의 효과
p(θ|X)∝p(X|θ)⏟likelihood⋅prior⏞p(θ)
log(p(θ|X))=c+L(θ;X)+log(p(θ))
- 서로 다른 사전확률분포에서 시작하더라도 표본 크기가 증가함에 따라 사후확률분포는 수렴함
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersfigsize(12.5, 15) p = 0.6 beta1_params = np.array([1.,1.]) beta2_params = np.array([2,10]) beta = stats.beta x = np.linspace(0.00, 1, 125) data = stats.bernoulli.rvs(p, size=500) plt.figure() for i,N in enumerate([0,4,8, 32,64, 128, 500]): s = data[:N].sum() plt.subplot(8,1,i+1) params1 = beta1_params + np.array([s, N-s]) params2 = beta2_params + np.array([s, N-s]) y1,y2 = beta.pdf(x, *params1), beta.pdf( x, *params2) plt.plot(x,y1, label = r"flat prior", lw =3) plt.plot(x, y2, label = "biased prior", lw= 3) plt.fill_between(x, 0, y1, color ="#348ABD", alpha = 0.15) plt.fill_between(x, 0, y2, color ="#A60628", alpha = 0.15) plt.legend(title = "N=%d" % N) plt.vlines(p, 0.0, 7.5, linestyles = "--", linewidth=1) #plt.ylim(0, 10)#
- 서로 다른 사전확률분포에서 시작하더라도 표본 크기가 증가함에 따라 사후확률분포는 수렴함
- 결론
- 부록
- 참고자료
7. 베이지안 A/B 테스트Permalink
- 서론
- 전환율 테스트 개요
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersvisitors_to_A = 1300 visitors_to_B = 1275 conversions_from_A = 120 conversions_from_B = 125 from scipy.stats import beta alpha_prior = 1 beta_prior = 1 posterior_A = beta(alpha_prior + conversions_from_A, beta_prior + visitors_to_A - conversions_from_A) posterior_B = beta(alpha_prior + conversions_from_B, beta_prior + visitors_to_B - conversions_from_B) samples = 20000 # 더 나은 근사를 위해서는 이 값이 더 커야 한다 # We want this to be large to get a better approximation. samples_posterior_A = posterior_A.rvs(samples) samples_posterior_B = posterior_B.rvs(samples) print ((samples_posterior_A > samples_posterior_B).mean()) %matplotlib inline from IPython.core.pylabtools import figsize from matplotlib import pyplot as plt import matplotlib import numpy as np matplotlib.rc('font', family='Malgun Gothic') # 그림 한글 폰트 지정, 맑은 고딕 figsize(12.5, 4) plt.rcParams['savefig.dpi'] = 300 plt.rcParams['figure.dpi'] = 300 x = np.linspace(0,1, 500) plt.plot(x, posterior_A.pdf(x), label='A의 사후확률분포') plt.plot(x, posterior_B.pdf(x), label='B의 사후확률분포') plt.xlabel('값') plt.ylabel('밀도') #plt.title("웹페이지 A 와 B의 전환율에 대한 사후확률분포") plt.legend(); plt.plot(x, posterior_A.pdf(x), label='A의 사후확률분포') plt.plot(x, posterior_B.pdf(x), label='B의 사후확률분포') plt.xlim(0.05, 0.15) plt.xlabel('값') plt.ylabel('밀도') #plt.title("웹페이지 A 와 B의 전환율에 대한 사후확률분포(확대)") plt.legend(); - 선형손실함수 추가하기
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersfrom numpy.random import dirichlet N = 1000 N_79 = 10 N_49 = 46 N_25 = 80 N_0 = N - (N_79 + N_49 + N_49) observations = np.array([N_79, N_49, N_25, N_0]) prior_parameters = np.array([1,1,1,1]) posterior_samples = dirichlet(prior_parameters + observations, size=10000) print ("Two random samples from the posterior:") print (posterior_samples[0]) print (posterior_samples[1]) for i, label in enumerate(['p_79', 'p_49', 'p_25', 'p_0']): ax = plt.hist(posterior_samples[:,i], bins=50, label=label, histtype='stepfilled') plt.xlabel('값') plt.ylabel('밀도') #plt.title("여러 가격 선택의 사후확률분포") plt.legend(); def expected_revenue(P): return 79*P[:,0] + 49*P[:,1] + 25*P[:,2] + 0*P[:,3] posterior_expected_revenue = expected_revenue(posterior_samples) plt.hist(posterior_expected_revenue, histtype='stepfilled',label='기대수익', bins=50) plt.xlabel('값') plt.ylabel('밀도') #plt.title("기대수익에 대한 사후확률분포") plt.legend(); N_A = 1000 N_A_79 = 10 N_A_49 = 46 N_A_25 = 80 N_A_0 = N_A - (N_A_79 + N_A_49 + N_A_49) observations_A = np.array([N_A_79, N_A_49, N_A_25, N_A_0]) N_B = 2000 N_B_79 = 45 N_B_49 = 84 N_B_25 = 200 N_B_0 = N_B - (N_B_79 + N_B_49 + N_B_49) observations_B = np.array([N_B_79, N_B_49, N_B_25, N_B_0]) prior_parameters = np.array([1,1,1,1]) posterior_samples_A = dirichlet(prior_parameters + observations_A, size=10000) posterior_samples_B = dirichlet(prior_parameters + observations_B, size=10000) posterior_expected_revenue_A = expected_revenue(posterior_samples_A) posterior_expected_revenue_B = expected_revenue(posterior_samples_B) plt.hist(posterior_expected_revenue_A, histtype='stepfilled', label='A의 기대수익', bins=50) plt.hist(posterior_expected_revenue_B, histtype='stepfilled', label='B의 기대수익', bins=50, alpha=0.8) plt.xlabel('값') plt.ylabel('밀도') #plt.title("A 와 B 간의 기대수익의 사후확률분포") plt.legend(); p = (posterior_expected_revenue_B > posterior_expected_revenue_A).mean() print ("Probability that page B has a higher revenue than page A: %.3f"%p) posterior_diff = posterior_expected_revenue_B - posterior_expected_revenue_A plt.hist(posterior_diff, histtype='stepfilled', color='#7A68A6', label='B 와 A의 수익 차이', bins=50) plt.vlines(0, 0, 700, linestyles='--') plt.xlabel('값') plt.ylabel('밀도') #plt.title("A 와 B의 기대수익간 델타의 사후확률분포") plt.legend(); - 전환율을 넘어서: t-검정
- 증분 추정하기
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersfigsize(12,4) visitors_to_A = 1275 visitors_to_B = 1300 conversions_from_A = 22 conversions_from_B = 12 alpha_prior = 1 beta_prior = 1 posterior_A = beta(alpha_prior + conversions_from_A, beta_prior + visitors_to_A - conversions_from_A) posterior_B = beta(alpha_prior + conversions_from_B, beta_prior + visitors_to_B - conversions_from_B) samples = 20000 samples_posterior_A = posterior_A.rvs(samples) samples_posterior_B = posterior_B.rvs(samples) _hist(samples_posterior_A, 'A') _hist(samples_posterior_B, 'B') plt.xlabel('값') plt.ylabel('밀도') #plt.title("웹페이지 A 와 B의 전환율에 대한 사후확률분포") plt.legend(); def relative_increase(a,b): return (a-b)/b posterior_rel_increase = relative_increase(samples_posterior_A, samples_posterior_B) plt.xlabel('값') plt.ylabel('밀도') #plt.title("웹페이지 B의 전환율에 대한 웹페이지 A의 전환율 상대적 증가에 대한 사후확률분포") _hist(posterior_rel_increase, 'relative increase', color='#7A68A6'); print ((posterior_rel_increase > 0.2).mean()) print ((posterior_rel_increase > 0.5).mean()) mean = posterior_rel_increase.mean() median = np.percentile(posterior_rel_increase, 50) conservative_percentile = np.percentile(posterior_rel_increase, 30) _hist(posterior_rel_increase,'', color='#7A68A6'); plt.vlines(mean, 0, 2500, linestyles='-.', label='평균') plt.vlines(median, 0, 2500, linestyles=':', label='중앙값', lw=3) plt.vlines(conservative_percentile, 0, 2500, linestyles='--', label='30% 퍼센타일') plt.xlabel('값') plt.ylabel('밀도') #plt.title("상대적 증가에 대한 사후확률분포의 여러 요약통계량") plt.legend(); - 결론
- 참고자료
부록 APermalink
- 파이썬, PyMC
- 주피터 노트북
- Reddit 실습하기
참고자료Permalink
- https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers
- https://github.com/gilbutITbook/006775
- https://nbviewer.jupyter.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/tree/master/
- https://towardsdatascience.com/bayesian-price-optimization-with-pymc3-d1264beb38ee
Comments