Bayesian Methods for Hackers

3 minute read

Bayesian Methods for Hackers

1. 베이지안 추론의 철학Permalink

  • 서론
    • 사전확률 : 사건 A에 대한 우리의 믿음의 양
      P(A)P(A)
    • 사후확률 : 증거 X가 주어진 상황에서 A의 확률
      P(A|X)P(A|X)
  • 베이지안 프레임워크
    • 베이즈 정리
      P(A|X)=p(A)p(X|A)p(X)P(A|X)=p(A)p(X|A)p(X)
  • 확률분포
    • 확률변수 Z가 이산적인 경우, 확률질량함수
      P(Z=k)=λkeλk!P(Z=k)=λkeλk!, k=0,1,2,k=0,1,2,
      ZZ ~ Poi(λ)Poi(λ)
      E[Z|λ]=λE[Z|λ]=λ

    • 확률변수 Z가 연속적인 경우, 확률밀도함수
      fZ(z|λ)=λeλz,z0fZ(z|λ)=λeλz,z0
      ZZ ~ Exp(λ)Exp(λ)
      E[Z|λ]=1λE[Z|λ]=1λ

  • 컴퓨터를 사용하여 베이지안 추론하기
    • PyMC를 사용한 베이지안
      figsize(12.5, 3.5)
      count_data = np.loadtxt("data/txtdata.csv")
      n_count_data = len(count_data)
      plt.bar(np.arange(n_count_data), count_data, color="#348ABD")
      plt.xlabel("Time (days)")
      plt.ylabel("count of text-msgs received")
      plt.title("Did the user's texting habits change over time?")
      plt.xlim(0, n_count_data);
      import pymc3 as pm
      with pm.Model() as model:
      alpha = 1.0/count_data.mean() # Recall count_data is the
      # variable that holds our txt counts
      lambda_1 = pm.Exponential("lambda_1", alpha)
      lambda_2 = pm.Exponential("lambda_2", alpha)
      tau = pm.DiscreteUniform("tau", lower=0, upper=n_count_data - 1)
      with model:
      idx = np.arange(n_count_data) # Index
      lambda_ = pm.math.switch(tau > idx, lambda_1, lambda_2)
      with model:
      observation = pm.Poisson("obs", lambda_, observed=count_data)
      with model:
      step = pm.Metropolis()
      trace = pm.sample(10000, tune=5000, step=step, return_inferencedata=False)
      lambda_1_samples = trace['lambda_1']
      lambda_2_samples = trace['lambda_2']
      tau_samples = trace['tau']
      figsize(12.5, 10)
      #histogram of the samples:
      ax = plt.subplot(311)
      ax.set_autoscaley_on(False)
      plt.hist(lambda_1_samples, histtype='stepfilled', bins=30, alpha=0.85,
      label="posterior of λ1λ1", color="#A60628", density=True)
      plt.legend(loc="upper left")
      plt.title(r"""Posterior distributions of the variables
      λ1,λ2,τλ1,λ2,τ""")
      plt.xlim([15, 30])
      plt.xlabel("λ1λ1 value")
      ax = plt.subplot(312)
      ax.set_autoscaley_on(False)
      plt.hist(lambda_2_samples, histtype='stepfilled', bins=30, alpha=0.85,
      label="posterior of λ2λ2", color="#7A68A6", density=True)
      plt.legend(loc="upper left")
      plt.xlim([15, 30])
      plt.xlabel("λ2λ2 value")
      plt.subplot(313)
      w = 1.0 / tau_samples.shape[0] * np.ones_like(tau_samples)
      plt.hist(tau_samples, bins=n_count_data, alpha=1,
      label=r"posterior of ττ",
      color="#467821", weights=w, rwidth=2.)
      plt.xticks(np.arange(n_count_data))
      plt.legend(loc="upper left")
      plt.ylim([0, .75])
      plt.xlim([35, len(count_data)-20])
      plt.xlabel(r"ττ (in days)")
      plt.ylabel("probability");
      figsize(12.5, 5)
      # tau_samples, lambda_1_samples, lambda_2_samples contain
      # N samples from the corresponding posterior distribution
      N = tau_samples.shape[0]
      expected_texts_per_day = np.zeros(n_count_data)
      for day in range(0, n_count_data):
      # ix is a bool index of all tau samples corresponding to
      # the switchpoint occurring prior to value of 'day'
      ix = day < tau_samples
      # Each posterior sample corresponds to a value for tau.
      # for each day, that value of tau indicates whether we're "before"
      # (in the lambda1 "regime") or
      # "after" (in the lambda2 "regime") the switchpoint.
      # by taking the posterior sample of lambda1/2 accordingly, we can average
      # over all samples to get an expected value for lambda on that day.
      # As explained, the "message count" random variable is Poisson distributed,
      # and therefore lambda (the poisson parameter) is the expected value of
      # "message count".
      expected_texts_per_day[day] = (lambda_1_samples[ix].sum()
      + lambda_2_samples[~ix].sum()) / N
      plt.plot(range(n_count_data), expected_texts_per_day, lw=4, color="#E24A33",
      label="expected number of text-messages received")
      plt.xlim(0, n_count_data)
      plt.xlabel("Day")
      plt.ylabel("Expected # text-messages")
      plt.title("Expected number of text-messages received")
      plt.ylim(0, 60)
      plt.bar(np.arange(len(count_data)), count_data, color="#348ABD", alpha=0.65,
      label="observed texts per day")
      plt.legend(loc="upper left");
  • 결론
  • 부록
  • 연습문제
  • 참고자료

2. PyMC 더 알아보기Permalink

  • 서론
    • Stochastic 변수 : 값이 정해지지 않은 변수
    • Deterministic 변수 : 변수의 부모를 모두 알고있는 경우에 랜덤하지 않은 변수
      • @pymc.deterministic 파이썬 래퍼 (데코레이터)를 써서 구분
  • 모델링 방법
    • 관측된 빈도와 실제 빈도간에는 차이가 발생
    • 베르누이 분포
      XX ~ Berp(p)Berp(p) : X는 p의 확률로 1, 1-p의 확률로 0
    • 이항분포
      XX ~ Bin(N,p)Bin(N,p)
      P(X=k)=(nk)pk(1p)NkP(X=k)=(nk)pk(1p)Nk
      기댓값 : NpNp
    • 데이터 Import 예시
      # Data Import Option
      test_data = np.getfromtxt("data.csv", skip_header=1, usecols=[1, 2], missing_values="NA", delimiter=",")
      # Remove NA rows
      test_data = test_data[~np.isnan(test_data[:,1])]
      view raw data_import.py hosted with ❤ by GitHub
    • Logistic Function
      from IPython.core.pylabtools import figsize
      from matplotlib import pyplot as plt
      import numpy as np
      figsize(12, 3)
      def logistic(x, beta, alpha=0):
      return 1.0 / (1.0 + np.exp(np.dot(beta, x) + alpha))
      x = np.linspace(-4, 4, 100)
      plt.plot(x, logistic(x, 1), label=r"β=1β=1", ls="--", lw=1)
      plt.plot(x, logistic(x, 3), label=r"β=3β=3", ls="--", lw=1)
      plt.plot(x, logistic(x, -5), label=r"β=5β=5", ls="--", lw=1)
      plt.plot(x, logistic(x, 1, 1), label=r"β=1,α=1β=1,α=1", color="#348ABD")
      plt.plot(x, logistic(x, 3, -2), label=r"β=3,α=2β=3,α=2", color="#A60628")
      plt.plot(x, logistic(x, -5, 7), label=r"β=5,α=7β=5,α=7", color="#7A68A6")
      plt.xlabel("xx")
      plt.ylabel("Logistic function")
      plt.title("Logistic functions based on alphaalpha, betabeta", fontsize=14)
      plt.legend(loc="lower left");
      Logistic Function
    • 정규분포
      정규확률변수 : XN(μ,1/τ)XN(μ,1/τ)
      확률밀도함수 : f(x|μ,τ)=τ2πexp(τ2(xμ)2)f(x|μ,τ)=τ2πexp(τ2(xμ)2)
      정규분포의 기댓값 : E[X|μ,τ]=μE[X|μ,τ]=μ
      정규분포의 분산 : Var(X|μ,τ)=1τVar(X|μ,τ)=1τ
  • 우리의 모델이 적절한가?
  • 결론
  • 부록
  • 연습문제
  • 참고자료

3. MCMC 블랙박스 열기Permalink

  • 베이지안 지형
  • 수렴 판정하기
    figsize(12.5, 4)
    import pymc3 as pm
    x_t = np.random.normal(0, 1, 200)
    x_t[0] = 0
    y_t = np.zeros(200)
    for i in range(1, 200):
    y_t[i] = np.random.normal(y_t[i - 1], 1)
    plt.plot(y_t, label="ytyt", lw=3)
    plt.plot(x_t, label="xtxt", lw=3)
    plt.xlabel("time, tt")
    plt.legend();
    def autocorr(x):
    # from http://tinyurl.com/afz57c4
    result = np.correlate(x, x, mode='full')
    result = result / np.max(result)
    return result[result.size // 2:]
    colors = ["#348ABD", "#A60628", "#7A68A6"]
    x = np.arange(1, 200)
    plt.bar(x, autocorr(y_t)[1:], width=1, label="ytyt",
    edgecolor=colors[0], color=colors[0])
    plt.bar(x, autocorr(x_t)[1:], width=1, label="xtxt",
    color=colors[1], edgecolor=colors[1])
    plt.legend(title="Autocorrelation")
    plt.ylabel("measured correlation \nbetween ytyt and ytkytk.")
    plt.xlabel("k (lag)")
    plt.title("Autocorrelation plot of ytyt and xtxt for differing kk lags.");
    max_x = 200 // 3 + 1
    x = np.arange(1, max_x)
    plt.bar(x, autocorr(y_t)[1:max_x], edgecolor=colors[0],
    label="no thinning", color=colors[0], width=1)
    plt.bar(x, autocorr(y_t[::2])[1:max_x], edgecolor=colors[1],
    label="keeping every 2nd sample", color=colors[1], width=1)
    plt.bar(x, autocorr(y_t[::3])[1:max_x], width=1, edgecolor=colors[2],
    label="keeping every 3rd sample", color=colors[2])
    plt.autoscale(tight=True)
    plt.legend(title="Autocorrelation plot for ytyt", loc="lower left")
    plt.ylabel("measured correlation \nbetween ytyt and ytkytk.")
    plt.xlabel("k (lag)")
    plt.title("Autocorrelation of ytyt (no thinning vs. thinning) \
    at differing kk lags.");
    pm.plots.traceplot(trace=trace, varnames=["centers"])
    pm.plots.plot_posterior(trace=trace["centers"][:,0])
    pm.plots.plot_posterior(trace=trace["centers"][:,1])
    pm.plots.autocorrplot(trace=trace, varnames=["centers"]);
  • MCMC에 대한 유용한 팁
  • 결론
  • 참고자료

4. 아무도 알려주지 않는 위대한 이론Permalink

  • 서론
  • 큰 수의 법칙
    1NNi=1ZiE[Z],N1NNi=1ZiE[Z],N
    • 직관
      1NNi=1Zi=1N(Zi=c1c1+Zi=c2c2)1NNi=1Zi=1N(Zi=c1c1+Zi=c2c2)
      =c1Zi=c11N+c2Zi=c21N=c1Zi=c11N+c2Zi=c21N
      =c1× (approximate frequency of c1+c2× (approximate frequency of c2=c1× (approximate frequency of c1+c2× (approximate frequency of c2
      c1×P(Z=c1)+c2×P(Z=c2)=E[Z]c1×P(Z=c1)+c2×P(Z=c2)=E[Z]
      D(N)=E[(1NNi=1Zi4.5)2]D(N)=E[(1NNi=1Zi4.5)2]
      Yk=(1NNi=1Zi4.5)2
      1NYNYk=1YkE[Yk]=E[(1NNi=1Zi4.5)2]
      1NYNYk=1YkD(N)
      1NNi=1(Ziμ)2E[(Zμ)2]=Var(Z)
      %matplotlib inline
      import numpy as np
      from IPython.core.pylabtools import figsize
      import matplotlib.pyplot as plt
      figsize( 12.5, 5 )
      sample_size = 100000
      expected_value = lambda_ = 4.5
      poi = np.random.poisson
      N_samples = range(1,sample_size,100)
      for k in range(3):
      samples = poi( lambda_, sample_size )
      partial_average = [ samples[:i].mean() for i in N_samples ]
      plt.plot( N_samples, partial_average, lw=1.5, label="average of n samples; seq. %d"%k)
      plt.plot( N_samples, expected_value*np.ones_like( partial_average), ls = "--", label = "true expected value", c = "k")
      plt.ylim( 4.35, 4.65)
      plt.title( "Convergence of the average of \n random variables to its expected value")
      plt.ylabel( "average of n samples")
      plt.xlabel( "# of samples, n")
      plt.legend();
      figsize( 12.5, 4)
      N_Y = 250 #use this many to approximate D(N)
      N_array = np.arange( 1000, 50000, 2500 ) #use this many samples in the approx. to the variance.
      D_N_results = np.zeros( len( N_array ) )
      lambda_ = 4.5
      expected_value = lambda_ #for X ~ Poi(lambda) , E[ X ] = lambda
      def D_N( n ):
      """
      This function approx. D_n, the average variance of using n samples.
      """
      Z = poi( lambda_, (n, N_Y) )
      average_Z = Z.mean(axis=0)
      return np.sqrt( ( (average_Z - expected_value)**2 ).mean() )
      for i,n in enumerate(N_array):
      D_N_results[i] = D_N(n)
      plt.xlabel( "N" )
      plt.ylabel( "expected squared-distance from true value" )
      plt.plot(N_array, D_N_results, lw = 3, label="expected distance between\n\ expected value and \naverage of N random variables.")
      plt.plot( N_array, np.sqrt(expected_value)/np.sqrt(N_array), lw = 2, ls = "--", label = r"λN")
      plt.legend()
      plt.title( "How 'fast' is the sample average converging?");
  • 작은 수의 혼란
    #adding a number to the end of the %run call will get the ith top post.
    %run top_showerthoughts_submissions.py 2
    print("Post contents: \n")
    print(top_post)
    """
    contents: an array of the text from the last 100 top submissions to a subreddit
    votes: a 2d numpy array of upvotes, downvotes for each submission.
    """
    n_submissions = len(votes)
    submissions = np.random.randint( n_submissions, size=4)
    print("Some Submissions (out of %d total) \n-----------"%n_submissions)
    for i in submissions:
    print('"' + contents[i] + '"')
    print("upvotes/downvotes: ",votes[i,:], "\n")
    import pymc3 as pm
    def posterior_upvote_ratio( upvotes, downvotes, samples = 20000):
    """
    This function accepts the number of upvotes and downvotes a particular submission recieved,
    and the number of posterior samples to return to the user. Assumes a uniform prior.
    """
    N = upvotes + downvotes
    with pm.Model() as model:
    upvote_ratio = pm.Uniform("upvote_ratio", 0, 1)
    observations = pm.Binomial( "obs", N, upvote_ratio, observed=upvotes)
    trace = pm.sample(samples, step=pm.Metropolis())
    burned_trace = trace[int(samples/4):]
    return burned_trace["upvote_ratio"]
    figsize( 11., 8)
    posteriors = []
    colours = ["#348ABD", "#A60628", "#7A68A6", "#467821", "#CF4457"]
    for i in range(len(submissions)):
    j = submissions[i]
    posteriors.append( posterior_upvote_ratio( votes[j, 0], votes[j,1] ) )
    plt.hist( posteriors[i], bins = 10, normed = True, alpha = .9,
    histtype="step",color = colours[i%5], lw = 3,
    label = '(%d up:%d down)\n%s...'%(votes[j, 0], votes[j,1], contents[j][:50]) )
    plt.hist( posteriors[i], bins = 10, normed = True, alpha = .2,
    histtype="stepfilled",color = colours[i], lw = 3, )
    plt.legend(loc="upper left")
    plt.xlim( 0, 1)
    plt.title("Posterior distributions of upvote ratios on different submissions");
    N = posteriors[0].shape[0]
    lower_limits = []
    for i in range(len(submissions)):
    j = submissions[i]
    plt.hist( posteriors[i], bins = 20, normed = True, alpha = .9,
    histtype="step",color = colours[i], lw = 3,
    label = '(%d up:%d down)\n%s...'%(votes[j, 0], votes[j,1], contents[j][:50]) )
    plt.hist( posteriors[i], bins = 20, normed = True, alpha = .2,
    histtype="stepfilled",color = colours[i], lw = 3, )
    v = np.sort( posteriors[i] )[ int(0.05*N) ]
    #plt.vlines( v, 0, 15 , color = "k", alpha = 1, linewidths=3 )
    plt.vlines( v, 0, 10 , color = colours[i], linestyles = "--", linewidths=3 )
    lower_limits.append(v)
    plt.legend(loc="upper left")
    plt.legend(loc="upper left")
    plt.title("Posterior distributions of upvote ratios on different submissions");
    order = np.argsort( -np.array( lower_limits ) )
    print(order, lower_limits)
    def intervals(u,d):
    a = 1. + u
    b = 1. + d
    mu = a/(a+b)
    std_err = 1.65*np.sqrt( (a*b)/( (a+b)**2*(a+b+1.) ) )
    return ( mu, std_err )
    print("Approximate lower bounds:")
    posterior_mean, std_err = intervals(votes[:,0],votes[:,1])
    lb = posterior_mean - std_err
    print(lb)
    print("\n")
    print("Top 40 Sorted according to approximate lower bounds:")
    print("\n")
    order = np.argsort( -lb )
    ordered_contents = []
    for i in order[:40]:
    ordered_contents.append( contents[i] )
    print(votes[i,0], votes[i,1], contents[i])
    print("-------------")
    r_order = order[::-1][-40:]
    plt.errorbar( posterior_mean[r_order], np.arange( len(r_order) ),
    xerr=std_err[r_order], capsize=0, fmt="o",
    color = "#7A68A6")
    plt.xlim( 0.3, 1)
    plt.yticks( np.arange( len(r_order)-1,-1,-1 ), map( lambda x: x[:30].replace("\n",""), ordered_contents) );
    view raw raddit_vote.py hosted with ❤ by GitHub
  • 결론
  • 부록
  • 연습문제
  • 참고자료

5. 오히려 큰 손해를 보시겠습니까?Permalink

  • 서론
  • 손실함수
    L(θ,ˆθ)=(θˆθ)2
    L(θ,ˆθ)={(θˆθ)2 ˆθ<θ c(θˆθ)2 ˆθθ,0<c<1 
    L(θ,ˆθ)=|θˆθ|
    L(θ,ˆθ)=θlog(ˆθ)(1θ)log(1ˆθ),θ0,1,ˆθ[0,1]
    l(ˆθ)=Eθ[L(θ,ˆθ)]
    1NNi=1L(θi,ˆθ)Eθ[L(θ,ˆθ)]=l(ˆθ)
    %matplotlib inline
    import scipy.stats as stats
    from IPython.core.pylabtools import figsize
    import numpy as np
    import matplotlib.pyplot as plt
    figsize(12.5, 9)
    norm_pdf = stats.norm.pdf
    plt.subplot(311)
    x = np.linspace(0, 60000, 200)
    sp1 = plt.fill_between(x , 0, norm_pdf(x, 35000, 7500),
    color = "#348ABD", lw = 3, alpha = 0.6,
    label = "historical total prices")
    p1 = plt.Rectangle((0, 0), 1, 1, fc=sp1.get_facecolor()[0])
    plt.legend([p1], [sp1.get_label()])
    plt.subplot(312)
    x = np.linspace(0, 10000, 200)
    sp2 = plt.fill_between(x , 0, norm_pdf(x, 3000, 500),
    color = "#A60628", lw = 3, alpha = 0.6,
    label="snowblower price guess")
    p2 = plt.Rectangle((0, 0), 1, 1, fc=sp2.get_facecolor()[0])
    plt.legend([p2], [sp2.get_label()])
    plt.subplot(313)
    x = np.linspace(0, 25000, 200)
    sp3 = plt.fill_between(x , 0, norm_pdf(x, 12000, 3000),
    color = "#7A68A6", lw = 3, alpha = 0.6,
    label = "Trip price guess")
    plt.autoscale(tight=True)
    p3 = plt.Rectangle((0, 0), 1, 1, fc=sp3.get_facecolor()[0])
    plt.legend([p3], [sp3.get_label()]);
    import pymc3 as pm
    data_mu = [3e3, 12e3]
    data_std = [5e2, 3e3]
    mu_prior = 35e3
    std_prior = 75e2
    with pm.Model() as model:
    true_price = pm.Normal("true_price", mu=mu_prior, sd=std_prior)
    prize_1 = pm.Normal("first_prize", mu=data_mu[0], sd=data_std[0])
    prize_2 = pm.Normal("second_prize", mu=data_mu[1], sd=data_std[1])
    price_estimate = prize_1 + prize_2
    logp = pm.Normal.dist(mu=price_estimate, sd=(3e3)).logp(true_price)
    error = pm.Potential("error", logp)
    trace = pm.sample(50000, step=pm.Metropolis())
    burned_trace = trace[10000:]
    price_trace = burned_trace["true_price"]
    figsize(12.5, 4)
    import scipy.stats as stats
    x = np.linspace(5000, 40000)
    plt.plot(x, stats.norm.pdf(x, 35000, 7500), c = "k", lw = 2,
    label = "prior dist. of suite price")
    _hist = plt.hist(price_trace, bins = 35, normed= True, histtype= "stepfilled")
    plt.title("Posterior of the true price estimate")
    plt.vlines(mu_prior, 0, 1.1*np.max(_hist[0]), label = "prior's mean",
    linestyles="--")
    plt.vlines(price_trace.mean(), 0, 1.1*np.max(_hist[0]), \
    label = "posterior's mean", linestyles="-.")
    plt.legend(loc = "upper left");
    figsize(12.5, 7)
    #numpy friendly showdown_loss
    def showdown_loss(guess, true_price, risk = 80000):
    loss = np.zeros_like(true_price)
    ix = true_price < guess
    loss[~ix] = np.abs(guess - true_price[~ix])
    close_mask = [abs(true_price - guess) <= 250]
    loss[close_mask] = -2*true_price[close_mask]
    loss[ix] = risk
    return loss
    guesses = np.linspace(5000, 50000, 70)
    risks = np.linspace(30000, 150000, 6)
    expected_loss = lambda guess, risk: \
    showdown_loss(guess, price_trace, risk).mean()
    for _p in risks:
    results = [expected_loss(_g, _p) for _g in guesses]
    plt.plot(guesses, results, label = "%d"%_p)
    plt.title("Expected loss of different guesses, \nvarious risk-levels of \
    overestimating")
    plt.legend(loc="upper left", title="Risk parameter")
    plt.xlabel("price bid")
    plt.ylabel("expected loss")
    plt.xlim(5000, 30000);
    import scipy.optimize as sop
    ax = plt.subplot(111)
    for _p in risks:
    _color = next(ax._get_lines.prop_cycler)
    _min_results = sop.fmin(expected_loss, 15000, args=(_p,),disp = False)
    _results = [expected_loss(_g, _p) for _g in guesses]
    plt.plot(guesses, _results , color = _color['color'])
    plt.scatter(_min_results, 0, s = 60, \
    color= _color['color'], label = "%d"%_p)
    plt.vlines(_min_results, 0, 120000, color = _color['color'], linestyles="--")
    print("minimum at risk %d: %.2f" % (_p, _min_results))
    plt.title("Expected loss & Bayes actions of different guesses, \n \
    various risk-levels of overestimating")
    plt.legend(loc="upper left", scatterpoints = 1, title = "Bayes action at risk:")
    plt.xlabel("price guess")
    plt.ylabel("expected loss")
    plt.xlim(7000, 30000)
    plt.ylim(-1000, 80000);
  • 베이지안 방법을 통한 기계학습
    Ri(x)=αi+βix+ϵ
    where ϵNormal(0,σi) and i indexes our posterior samples.
    argminrER(x)[L(R(x),r)]
    • 예제: 금융예측
      figsize(12.5, 4)
      def stock_loss(true_return, yhat, alpha = 100.):
      if true_return * yhat < 0:
      #opposite signs, not good
      return alpha*yhat**2 - np.sign(true_return)*yhat \
      + abs(true_return)
      else:
      return abs(true_return - yhat)
      true_value = .05
      pred = np.linspace(-.04, .12, 75)
      plt.plot(pred, [stock_loss(true_value, _p) for _p in pred], \
      label = "Loss associated with\n prediction if true value = 0.05", lw =3)
      plt.vlines(0, 0, .25, linestyles="--")
      plt.xlabel("prediction")
      plt.ylabel("loss")
      plt.xlim(-0.04, .12)
      plt.ylim(0, 0.25)
      true_value = -.02
      plt.plot(pred, [stock_loss(true_value, _p) for _p in pred], alpha = 0.6, \
      label = "Loss associated with\n prediction if true value = -0.02", lw =3)
      plt.legend()
      plt.title("Stock returns loss if true value = 0.05, -0.02");
      ## Code to create artificial data
      N = 100
      X = 0.025*np.random.randn(N)
      Y = 0.5*X + 0.01*np.random.randn(N)
      ls_coef_ = np.cov(X, Y)[0,1]/np.var(X)
      ls_intercept = Y.mean() - ls_coef_*X.mean()
      plt.scatter(X, Y, c="k")
      plt.xlabel("trading signal")
      plt.ylabel("returns")
      plt.title("Empirical returns vs trading signal")
      plt.plot(X, ls_coef_*X + ls_intercept, label = "Least-squares line")
      plt.xlim(X.min(), X.max())
      plt.ylim(Y.min(), Y.max())
      plt.legend(loc="upper left");
      import pymc3 as pm
      with pm.Model() as model:
      std = pm.Uniform("std", 0, 100)
      beta = pm.Normal("beta", mu=0, sd=100)
      alpha = pm.Normal("alpha", mu=0, sd=100)
      mean = pm.Deterministic("mean", alpha + beta*X)
      obs = pm.Normal("obs", mu=mean, sd=std, observed=Y)
      trace = pm.sample(100000, step=pm.Metropolis())
      burned_trace = trace[20000:]
      pm.plots.traceplot(trace=burned_trace, varnames=["std", "beta", "alpha"])
      pm.plot_posterior(trace=burned_trace, varnames=["std", "beta", "alpha"], kde_plot=True);
      figsize(12.5, 6)
      from scipy.optimize import fmin
      def stock_loss(price, pred, coef = 500):
      """vectorized for numpy"""
      sol = np.zeros_like(price)
      ix = price*pred < 0
      sol[ix] = coef*pred**2 - np.sign(price[ix])*pred + abs(price[ix])
      sol[~ix] = abs(price[~ix] - pred)
      return sol
      std_samples = burned_trace["std"]
      alpha_samples = burned_trace["alpha"]
      beta_samples = burned_trace["beta"]
      N = std_samples.shape[0]
      noise = std_samples*np.random.randn(N)
      possible_outcomes = lambda signal: alpha_samples + beta_samples*signal + noise
      opt_predictions = np.zeros(50)
      trading_signals = np.linspace(X.min(), X.max(), 50)
      for i, _signal in enumerate(trading_signals):
      _possible_outcomes = possible_outcomes(_signal)
      tomin = lambda pred: stock_loss(_possible_outcomes, pred).mean()
      opt_predictions[i] = fmin(tomin, 0, disp = False)
      plt.xlabel("trading signal")
      plt.ylabel("prediction")
      plt.title("Least-squares prediction vs. Bayes action prediction")
      plt.plot(X, ls_coef_*X + ls_intercept, label ="Least-squares prediction")
      plt.xlim(X.min(), X.max())
      plt.plot(trading_signals, opt_predictions, label ="Bayes action prediction")
      plt.legend(loc="upper left");
  • 결론
  • 참고자료

6. 우선순위 바로잡기Permalink

  • 서론
  • 주관적인 사전확률분포 vs. 객관적인 사전확률분포
    μp=1NNi=0Xi
  • 알아두면 유용한 사전확률분포
    • 감마분포
      Exp(β)Gamma(1,β)
      F(xα,β)=βαxα1eβxΓ(α)
    • 베타분포
      fX(x|α,β)=x(α1)(1x)(β1)B(α,β)
      Observation : XBinomial(N,p)
      Posterior : Beta(1+X,1+NX)
  • 예제: 베이지안 MAB (Multi-Armed-Bandits)
    rand = np.random.rand
    class Bandits(object):
    """
    This class represents N bandits machines.
    parameters:
    p_array: a (n,) Numpy array of probabilities >0, <1.
    methods:
    pull( i ): return the results, 0 or 1, of pulling
    the ith bandit.
    """
    def __init__(self, p_array):
    self.p = p_array
    self.optimal = np.argmax(p_array)
    def pull(self, i):
    #i is which arm to pull
    return np.random.rand() < self.p[i]
    def __len__(self):
    return len(self.p)
    class BayesianStrategy(object):
    """
    Implements a online, learning strategy to solve
    the Multi-Armed Bandit problem.
    parameters:
    bandits: a Bandit class with .pull method
    methods:
    sample_bandits(n): sample and train on n pulls.
    attributes:
    N: the cumulative number of samples
    choices: the historical choices as a (N,) array
    bb_score: the historical score as a (N,) array
    """
    def __init__(self, bandits):
    self.bandits = bandits
    n_bandits = len(self.bandits)
    self.wins = np.zeros(n_bandits)
    self.trials = np.zeros(n_bandits)
    self.N = 0
    self.choices = []
    self.bb_score = []
    def sample_bandits(self, n=1):
    bb_score = np.zeros(n)
    choices = np.zeros(n)
    for k in range(n):
    #sample from the bandits's priors, and select the largest sample
    choice = np.argmax(np.random.beta(1 + self.wins, 1 + self.trials - self.wins))
    #sample the chosen bandit
    result = self.bandits.pull(choice)
    #update priors and score
    self.wins[choice] += result
    self.trials[choice] += 1
    bb_score[k] = result
    self.N += 1
    choices[k] = choice
    self.bb_score = np.r_[self.bb_score, bb_score]
    self.choices = np.r_[self.choices, choices]
    return
    figsize(11.0, 10)
    beta = stats.beta
    x = np.linspace(0.001,.999,200)
    def plot_priors(bayesian_strategy, prob, lw = 3, alpha = 0.2, plt_vlines = True):
    ## plotting function
    wins = bayesian_strategy.wins
    trials = bayesian_strategy.trials
    for i in range(prob.shape[0]):
    y = beta(1+wins[i], 1 + trials[i] - wins[i])
    p = plt.plot(x, y.pdf(x), lw = lw)
    c = p[0].get_markeredgecolor()
    plt.fill_between(x,y.pdf(x),0, color = c, alpha = alpha,
    label="underlying probability: %.2f" % prob[i])
    if plt_vlines:
    plt.vlines(prob[i], 0, y.pdf(prob[i]) ,
    colors = c, linestyles = "--", lw = 2)
    plt.autoscale(tight = "True")
    plt.title("Posteriors After %d pull" % bayesian_strategy.N +\
    "s"*(bayesian_strategy.N > 1))
    plt.autoscale(tight=True)
    return
    # Start Main Code
    hidden_prob = np.array([0.85, 0.60, 0.75])
    bandits = Bandits(hidden_prob)
    bayesian_strat = BayesianStrategy(bandits)
    draw_samples = [1, 1, 3, 10, 10, 25, 50, 100, 200, 600]
    for j,i in enumerate(draw_samples):
    plt.subplot(5, 2, j+1)
    bayesian_strat.sample_bandits(i)
    plot_priors(bayesian_strat, hidden_prob)
    #plt.legend()
    plt.autoscale(tight = True)
    plt.tight_layout()
  • 해당 분야 전문가로부터 사전확률분포 유도하기
    wopt=maxw1N(Ni=0μTiwλ2wTΣiw)

  • 켤레 사전확률분포
  • 제프리 사전확률분포
  • N이 증가할 때 사전확률분포의 효과
    p(θ|X)p(X|θ)likelihoodpriorp(θ)
    log(p(θ|X))=c+L(θ;X)+log(p(θ))
    • 서로 다른 사전확률분포에서 시작하더라도 표본 크기가 증가함에 따라 사후확률분포는 수렴함
      figsize(12.5, 15)
      p = 0.6
      beta1_params = np.array([1.,1.])
      beta2_params = np.array([2,10])
      beta = stats.beta
      x = np.linspace(0.00, 1, 125)
      data = stats.bernoulli.rvs(p, size=500)
      plt.figure()
      for i,N in enumerate([0,4,8, 32,64, 128, 500]):
      s = data[:N].sum()
      plt.subplot(8,1,i+1)
      params1 = beta1_params + np.array([s, N-s])
      params2 = beta2_params + np.array([s, N-s])
      y1,y2 = beta.pdf(x, *params1), beta.pdf( x, *params2)
      plt.plot(x,y1, label = r"flat prior", lw =3)
      plt.plot(x, y2, label = "biased prior", lw= 3)
      plt.fill_between(x, 0, y1, color ="#348ABD", alpha = 0.15)
      plt.fill_between(x, 0, y2, color ="#A60628", alpha = 0.15)
      plt.legend(title = "N=%d" % N)
      plt.vlines(p, 0.0, 7.5, linestyles = "--", linewidth=1)
      #plt.ylim(0, 10)#
      view raw prior_effect.py hosted with ❤ by GitHub
  • 결론
  • 부록
  • 참고자료

7. 베이지안 A/B 테스트Permalink

  • 서론
  • 전환율 테스트 개요
    visitors_to_A = 1300
    visitors_to_B = 1275
    conversions_from_A = 120
    conversions_from_B = 125
    from scipy.stats import beta
    alpha_prior = 1
    beta_prior = 1
    posterior_A = beta(alpha_prior + conversions_from_A,
    beta_prior + visitors_to_A - conversions_from_A)
    posterior_B = beta(alpha_prior + conversions_from_B,
    beta_prior + visitors_to_B - conversions_from_B)
    samples = 20000 # 더 나은 근사를 위해서는 이 값이 더 커야 한다 # We want this to be large to get a better approximation.
    samples_posterior_A = posterior_A.rvs(samples)
    samples_posterior_B = posterior_B.rvs(samples)
    print ((samples_posterior_A > samples_posterior_B).mean())
    %matplotlib inline
    from IPython.core.pylabtools import figsize
    from matplotlib import pyplot as plt
    import matplotlib
    import numpy as np
    matplotlib.rc('font', family='Malgun Gothic') # 그림 한글 폰트 지정, 맑은 고딕
    figsize(12.5, 4)
    plt.rcParams['savefig.dpi'] = 300
    plt.rcParams['figure.dpi'] = 300
    x = np.linspace(0,1, 500)
    plt.plot(x, posterior_A.pdf(x), label='A의 사후확률분포')
    plt.plot(x, posterior_B.pdf(x), label='B의 사후확률분포')
    plt.xlabel('값')
    plt.ylabel('밀도')
    #plt.title("웹페이지 AB의 전환율에 대한 사후확률분포")
    plt.legend();
    plt.plot(x, posterior_A.pdf(x), label='A의 사후확률분포')
    plt.plot(x, posterior_B.pdf(x), label='B의 사후확률분포')
    plt.xlim(0.05, 0.15)
    plt.xlabel('값')
    plt.ylabel('밀도')
    #plt.title("웹페이지 AB의 전환율에 대한 사후확률분포(확대)")
    plt.legend();
  • 선형손실함수 추가하기
    from numpy.random import dirichlet
    N = 1000
    N_79 = 10
    N_49 = 46
    N_25 = 80
    N_0 = N - (N_79 + N_49 + N_49)
    observations = np.array([N_79, N_49, N_25, N_0])
    prior_parameters = np.array([1,1,1,1])
    posterior_samples = dirichlet(prior_parameters + observations, size=10000)
    print ("Two random samples from the posterior:")
    print (posterior_samples[0])
    print (posterior_samples[1])
    for i, label in enumerate(['p_79', 'p_49', 'p_25', 'p_0']):
    ax = plt.hist(posterior_samples[:,i], bins=50, label=label, histtype='stepfilled')
    plt.xlabel('값')
    plt.ylabel('밀도')
    #plt.title("여러 가격 선택의 사후확률분포")
    plt.legend();
    def expected_revenue(P):
    return 79*P[:,0] + 49*P[:,1] + 25*P[:,2] + 0*P[:,3]
    posterior_expected_revenue = expected_revenue(posterior_samples)
    plt.hist(posterior_expected_revenue, histtype='stepfilled',label='기대수익', bins=50)
    plt.xlabel('값')
    plt.ylabel('밀도')
    #plt.title("기대수익에 대한 사후확률분포")
    plt.legend();
    N_A = 1000
    N_A_79 = 10
    N_A_49 = 46
    N_A_25 = 80
    N_A_0 = N_A - (N_A_79 + N_A_49 + N_A_49)
    observations_A = np.array([N_A_79, N_A_49, N_A_25, N_A_0])
    N_B = 2000
    N_B_79 = 45
    N_B_49 = 84
    N_B_25 = 200
    N_B_0 = N_B - (N_B_79 + N_B_49 + N_B_49)
    observations_B = np.array([N_B_79, N_B_49, N_B_25, N_B_0])
    prior_parameters = np.array([1,1,1,1])
    posterior_samples_A = dirichlet(prior_parameters + observations_A, size=10000)
    posterior_samples_B = dirichlet(prior_parameters + observations_B, size=10000)
    posterior_expected_revenue_A = expected_revenue(posterior_samples_A)
    posterior_expected_revenue_B = expected_revenue(posterior_samples_B)
    plt.hist(posterior_expected_revenue_A, histtype='stepfilled', label='A의 기대수익', bins=50)
    plt.hist(posterior_expected_revenue_B, histtype='stepfilled', label='B의 기대수익', bins=50, alpha=0.8)
    plt.xlabel('값')
    plt.ylabel('밀도')
    #plt.title("AB 간의 기대수익의 사후확률분포")
    plt.legend();
    p = (posterior_expected_revenue_B > posterior_expected_revenue_A).mean()
    print ("Probability that page B has a higher revenue than page A: %.3f"%p)
    posterior_diff = posterior_expected_revenue_B - posterior_expected_revenue_A
    plt.hist(posterior_diff, histtype='stepfilled', color='#7A68A6', label='B 와 A의 수익 차이', bins=50)
    plt.vlines(0, 0, 700, linestyles='--')
    plt.xlabel('값')
    plt.ylabel('밀도')
    #plt.title("AB의 기대수익간 델타의 사후확률분포")
    plt.legend();
  • 전환율을 넘어서: t-검정
  • 증분 추정하기
    figsize(12,4)
    visitors_to_A = 1275
    visitors_to_B = 1300
    conversions_from_A = 22
    conversions_from_B = 12
    alpha_prior = 1
    beta_prior = 1
    posterior_A = beta(alpha_prior + conversions_from_A,
    beta_prior + visitors_to_A - conversions_from_A)
    posterior_B = beta(alpha_prior + conversions_from_B,
    beta_prior + visitors_to_B - conversions_from_B)
    samples = 20000
    samples_posterior_A = posterior_A.rvs(samples)
    samples_posterior_B = posterior_B.rvs(samples)
    _hist(samples_posterior_A, 'A')
    _hist(samples_posterior_B, 'B')
    plt.xlabel('값')
    plt.ylabel('밀도')
    #plt.title("웹페이지 AB의 전환율에 대한 사후확률분포")
    plt.legend();
    def relative_increase(a,b):
    return (a-b)/b
    posterior_rel_increase = relative_increase(samples_posterior_A,
    samples_posterior_B)
    plt.xlabel('값')
    plt.ylabel('밀도')
    #plt.title("웹페이지 B의 전환율에 대한 웹페이지 A의 전환율 상대적 증가에 대한 사후확률분포")
    _hist(posterior_rel_increase, 'relative increase', color='#7A68A6');
    print ((posterior_rel_increase > 0.2).mean())
    print ((posterior_rel_increase > 0.5).mean())
    mean = posterior_rel_increase.mean()
    median = np.percentile(posterior_rel_increase, 50)
    conservative_percentile = np.percentile(posterior_rel_increase, 30)
    _hist(posterior_rel_increase,'', color='#7A68A6');
    plt.vlines(mean, 0, 2500, linestyles='-.', label='평균')
    plt.vlines(median, 0, 2500, linestyles=':', label='중앙값', lw=3)
    plt.vlines(conservative_percentile, 0, 2500, linestyles='--',
    label='30% 퍼센타일')
    plt.xlabel('값')
    plt.ylabel('밀도')
    #plt.title("상대적 증가에 대한 사후확률분포의 여러 요약통계량")
    plt.legend();
  • 결론
  • 참고자료

부록 APermalink

  • 파이썬, PyMC
  • 주피터 노트북
  • Reddit 실습하기

참고자료Permalink

Updated:

Comments